NVIDIA NeMo Curator Enhances Vietnamese Language Data Processing
Open-source large language models (LLMs) are often proficient in English, but they face challenges with other languages, particularly those in Southeast Asia, due to a scarcity of training data. Addressing this issue, Viettel Solutions, a subsidiary of Viettel Corporation, has adopted NVIDIA’s NeMo Curator to enhance the processing of high-quality Vietnamese language data, as reported by NVIDIA.
Challenges with Language Models
LLMs typically excel in English due to abundant training data. However, languages like Vietnamese often lack sufficient data, which affects model performance. NVIDIA’s NeMo Curator offers a solution by enabling the creation of high-quality datasets necessary for training effective language models.
Viettel’s Collaboration with NVIDIA
Viettel Solutions has leveraged NeMo Curator to train its Llama 3 ViettelSolution 8B model, now ranking among the top in the VMLU leaderboard. The tool’s GPU-accelerated features, such as deduplication and filtering, have increased model accuracy by 10%, reduced training time by threefold, and decreased dataset size by 60%, according to Tuan Nguyen, Head of Data Analytics at Viettel Solutions.
Data Curation Pipeline
The data curation process includes downloading datasets from various sources, reformatting Unicode, deduplicating, and applying quality filtering. The datasets include Vietnamese subsets from C4, OSCAR, and Wikipedia, combined into a single dataset for training. NeMo Curator employs heuristic and classifier-based filtering to enhance data quality, ensuring the removal of noise and preserving essential content diversity.
Advanced Filtering Techniques
Heuristic filtering removes low-quality content using predefined rules, while classifier-based filtering employs a trained model to identify high and low-quality data. This dual approach ensures that the dataset is both comprehensive and of high quality, crucial for effective language model training.
Impact on Dataset Quality
The curation process significantly reduces dataset size by removing low-quality and redundant content, with classifier-based filtering alone accounting for a 45% reduction. This efficient filtering ensures that the remaining data is of the highest quality, suitable for pretraining language models.
Conclusion
NVIDIA’s NeMo Curator provides a robust tool for processing high-quality Vietnamese language data, enhancing the performance of language models. By improving data quality and efficiency, it supports Viettel Solutions’ goal of leading in generative AI and developing AI-powered products for the Vietnamese market.
Image source: Shutterstock