Skip to content

anisoleai/fineweb-tokenized

Text GenerationENodc-by

Anisoleai/fineweb-tokenized is a text generation-focused dataset in EN distributed in Parquet format. It is distributed under the odc-by license and falls in the n>1T size category, and has been downloaded 150.8K times.

About anisoleai/fineweb-tokenized

FineWeb Tokenized > 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer What is it? This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trill...

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
n>1T
Creator
anisoleai
Year
2026
License
odc-by
Downloads
150849
Likes
2
Download Homepage

Related Text Generation datasets

FAQ