Skip to content

bigcode/starcoderdata

Text GenerationCODEBenchmark

The bigcode/starcoderdata dataset is a CODE text generation resource from bigcode at 2023. With 22K downloads and 523 likes, it is actively used by the community. It is released under the other license and is a 100M<n<1B-scale dataset.

📊 This dataset is used as an LLM benchmark. See model leaderboards →

About bigcode/starcoderdata

StarCoder Training Dataset Dataset description This is the dataset used for training StarCoder and StarCoderBase. It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scri...

Details

Task
Text Generation
Language
CODE
Format
Parquet
Rows / instances
N/A
Size
100M<n<1B
Creator
bigcode
Year
2023
License
other
Downloads
22045
Likes
523
Download Homepage

Related Text Generation datasets

FAQ