HuggingFaceTB/smollm-corpus
General NLPENodc-by
HuggingFaceTB/smollm-corpus is a General NLP-focused dataset in EN that provides 236,980,453 labeled examples distributed in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 32.3K times.
About HuggingFaceTB/smollm-corpus
SmolLM-Corpus
This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models.
You can find more details about the models trained on this dataset in our SmolLM blog post.
...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- 236980453
- Size
- 100M<n<1B
- Creator
- HuggingFaceTB
- Year
- 2024
- License
- odc-by
- Downloads
- 32318
- Likes
- 468