m-a-p/PIN-200M
General NLPEN, ZH
M-a-p/PIN-200M is a General NLP dataset in EN, ZH from m-a-p in Parquet format.
About m-a-p/PIN-200M
PIN-200M
A mini version of "PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents"
Paper: https://arxiv.org/abs/2406.13923
This dataset contains around 200M samples in PIN format, with around 312 TB storage....
Details
- Task
- General NLP
- Language
- EN, ZH
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- m-a-p
- Year
- 2026