Skip to content

allenai/olmOCR-mix-0225

General NLPEnglishodc-by

Allenai/olmOCR-mix-0225 is a General NLP-focused dataset in English distributed in Parquet format. It is distributed under the odc-by license and falls in the 100K<n<1M size category, and has been downloaded 686 times.

About allenai/olmOCR-mix-0225

olmOCR-mix-0225 olmOCR-mix-0225 is a dataset of ~250,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4o-2024-08-06 and a special prompting strategy that preserves any born-digital content from each page....

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Size
100K<n<1M
Creator
allenai
Year
2025
License
odc-by
Downloads
686
Likes
171
Download Homepage

Related General NLP datasets

FAQ