allenai/olmOCR-mix-1025
General NLPEnglishodc-by
Created by allenai at 2025, the allenai/olmOCR-mix-1025 is a General NLP dataset in English in Parquet format. With 1.9K downloads and 34 likes, it is actively used by the community. It is released under the odc-by license and is a 100K<n<1M-scale dataset.
About allenai/olmOCR-mix-1025
olmOCR-mix-1025
olmOCR-mix-1025 is a dataset of ~270,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4.1 and a special
prompting strategy that preserves any born-digital content from each page.
This data...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100K<n<1M
- Creator
- allenai
- Year
- 2025
- License
- odc-by
- Downloads
- 1915
- Likes
- 34