Skip to content

OpenWebTextCorpus

Text CorporaEnglish

The OpenWebTextCorpus dataset is a English text corpora resource from Gokaslan et al. at 2019 comprising 8,013,769 examples.

About OpenWebTextCorpus

Dataset contains millions of webpages text stemming from reddit urls totalling 38Gb of text data.

Details

Task
Text Corpora
Language
English
Format
n/a
Rows / instances
8,013,769
Creator
Gokaslan et al.
Year
2019
Download Paper

Related Text Corpora datasets

FAQ