Skip to content

Symato/cc

General NLPVI

Symato/cc is a General NLP dataset in VI from Symato in Parquet format.

About Symato/cc

What is Symato CC? To download all WARC data from Common Crawl then filter out Vietnamese in Markdown and Plaintext format. There is 1% of Vietnamse in CC, extract all of them out should be a lot (~10TB of plaintext). Main contributors ...

Details

Task
General NLP
Language
VI
Format
Parquet
Rows / instances
N/A
Creator
Symato
Year
2023
Download

Related General NLP datasets

FAQ