Skip to content

allenai/olmOCR-bench

General NLPENodc-by

Allenai/olmOCR-bench is a General NLP dataset in EN from allenai in Parquet format. It is distributed under the odc-by license and falls in the 1K<n<10K size category, and has been downloaded 7.3K times.

About allenai/olmOCR-bench

olmOCR-bench olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF docum...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
N/A
Size
1K<n<10K
Creator
allenai
Year
2025
License
odc-by
Downloads
7282
Likes
250
Download Homepage

Related General NLP datasets

FAQ