Close Menu
AsiaTokenFundAsiaTokenFund
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
What's Hot

Best meme coin to buy now for x100 in this bull run, with BTC above 100k, is it Pepe, Shiba, or Pepeto?

May 8, 2025

Solana Shows Strength As it Recovers Past $150 Mark, But Analysts Expect Ruvi AI (RUVI) To be the Next 100x Gem and Skyrocket by 15,500% in 2025

May 8, 2025

Ripple Backs Global Expansion as Hidden Road Sets Foot in Abu Dhabi

May 8, 2025
Facebook X (Twitter) Instagram
Facebook X (Twitter) YouTube LinkedIn
AsiaTokenFundAsiaTokenFund
ATF Capital
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
AsiaTokenFundAsiaTokenFund

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

0
By Aggregated - see source on May 7, 2025 Blockchain
Share
Facebook Twitter LinkedIn Pinterest Email


Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.





NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock


Credit: Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Robinhood Plans Blockchain-Based Trading for US Securities in Europe: Report

May 8, 2025

$MELANIA Lightning Buy Nets Traders $100M in 150 Seconds, Financial Times Reveals

May 7, 2025

DePIN $18.9B boom, +230K users

May 7, 2025
Leave A Reply Cancel Reply

What's New Here!

Best meme coin to buy now for x100 in this bull run, with BTC above 100k, is it Pepe, Shiba, or Pepeto?

May 8, 2025

Solana Shows Strength As it Recovers Past $150 Mark, But Analysts Expect Ruvi AI (RUVI) To be the Next 100x Gem and Skyrocket by 15,500% in 2025

May 8, 2025

Ripple Backs Global Expansion as Hidden Road Sets Foot in Abu Dhabi

May 8, 2025

Memecoins Explode as Bitcoin Inches Close to $100K—Will Memes Catch the Investors’ Attention?

May 8, 2025
AsiaTokenFund
Facebook X (Twitter) LinkedIn YouTube
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
© 2025 asiatokenfund.com - All Rights Reserved!

Type above and press Enter to search. Press Esc to cancel.

Ad Blocker Enabled!
Ad Blocker Enabled!
Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.