Skip to content

The Semantic Scholar Open Research Corpus (S2ORC)

Text CorporaKnowledge BaseEnglishBenchmark

The Semantic Scholar Open Research Corpus (S2ORC) is a text corpora-focused benchmark dataset in English that provides 467M edges, 136M nodes labeled examples distributed in JSON format.

📊 This dataset is used as an LLM benchmark. See model leaderboards →

About The Semantic Scholar Open Research Corpus (S2ORC)

Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.

Details

Task
Text Corpora, Knowledge Base
Language
English
Format
JSON
Rows / instances
467M edges, 136M nodes
Creator
Lo et al.
Year
2020
Download Paper

Related Text Corpora, Knowledge Base datasets

FAQ