CC Net
Text CorporaMulti-Lingual
CC Net is a text corpora dataset in Multi-Lingual from Wenzek et al. with A LOT! records in JSON format.
About CC Net
Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.
Details
- Task
- Text Corpora
- Language
- Multi-Lingual
- Format
- JSON
- Rows / instances
- A LOT!
- Creator
- Wenzek et al.
- Year
- 2019