allenai/c4
Text GenerationFill MaskAF, AM, ARodc-by
Allenai/c4 is a text generation dataset in AF, AM, AR from allenai with 1,837,702,356 records in Parquet format. It is distributed under the odc-by license and falls in the 10B<n<100B size category, and has been downloaded 1M times.
About allenai/c4
C4
Dataset Summary
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's C4 dataset
We prepared five variants of th...
Details
- Task
- Text Generation, Fill Mask
- Language
- AF, AM, AR
- Format
- Parquet
- Rows / instances
- 1837702356
- Size
- 10B<n<100B
- Creator
- allenai
- Year
- 2026
- License
- odc-by
- Downloads
- 1029647
- Likes
- 601