allenai/olmOCR-bench
General NLPENodc-by
Allenai/olmOCR-bench is a General NLP dataset in EN from allenai in Parquet format. It is distributed under the odc-by license and falls in the 1K<n<10K size category, and has been downloaded 7.3K times.
About allenai/olmOCR-bench
olmOCR-bench
olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have.
This benchmark evaluates the ability of OCR systems to accurately convert PDF docum...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 1K<n<10K
- Creator
- allenai
- Year
- 2025
- License
- odc-by
- Downloads
- 7282
- Likes
- 250