Text Generation Datasets
There are 137 text generation datasets in our directory, 4 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.
Text Generation is the task of producing new, coherent text from a prompt — the core capability behind chatbots and writing assistants. We catalog 137 datasets for it.
Updated June 2026
- OpenSQZ/AutoMathText-V2Text Generation, Question AnsweringEN, ZH
- 64bits/lima_vicuna_formatText GenerationEN
- clzoro/GLM-5.1-1000000xText Generation, Question AnsweringEN, ZH
- codefuse-ai/CodeExercise-Python-27kText GenerationEN
- codeparrot/appsText GenerationCODE
- a-m-team/AM-DeepSeek-R1-Distilled-1.4MText GenerationZH, EN
- cais/wmdpText GenerationEN
- KingNish/reasoning-base-20kText GenerationEN
- OpenMed/Medical-Reasoning-SFT-MegaText Generation, Question AnsweringEN
- Social Bias Inference Corpus (SBIC) Classification, Text GenerationEnglish
- JailbreakV-28K/JailBreakV-28kText Generation, Question AnsweringEnglish
- codeparrot/codecomplexText GenerationCODE
- codeparrot/github-codeText GenerationCODE
- commoncrawl/host-index-testing-v2Text GenerationEnglish
- nvidia/ToolScaleText GenerationEN
- Locutusque/UltraTextbooksText GenerationEN, CODE
- kaist-ai/CoT-CollectionText Generation, Text ClassificationEN
- E2EText GenerationEnglish
- Congliu/Chinese-DeepSeek-R1-Distill-data-110kText Generation, Question AnsweringZH
- Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFTText Generation, Question AnsweringZH
- coral-nlp/german-commonsText GenerationDE
- silk-road/Wizard-LM-Chinese-instruct-evolText Generation, Question AnsweringZH, EN
- DAMO-NLP-SG/multimodal_textbookText Generation, SummarizationEN
- Nanbeige/ToolMindText GenerationEN
- LDJnr/PuffinQuestion Answering, Text GenerationEN
- danish-foundation-models/danish-dynawordText GenerationDA
- anisoleai/fineweb-tokenizedText GenerationEN
- data-is-better-together/10k_prompts_rankedText Classification, Text Generation, Reinforcement LearningEN
- DataMuncher-Labs/UltiMathText GenerationEN
- legacy-datasets/mc4Text Generation, Fill MaskAF, AM, AR
- lvwerra/stack-exchange-pairedText Generation, Question AnsweringEN
- arcinstitute/opengenome2Text GenerationEnglishBenchmark
- davanstrien/haiku_dpoText Generation, Reinforcement LearningEnglish
- Mxode/Chinese-InstructText Generation, Question AnsweringZH
- tencent/CL-benchText GenerationEN
- shibing624/alpaca-zhText GenerationZH
- declare-lab/HarmfulQAText Generation, Text ClassificationEN
- allenai/dolmaText GenerationEN
- uonlp/CulturaXText Generation, Fill MaskAF, ALS, AM
- BAAI/Infinity-InstructText GenerationEN, ZH
- lazarus19/Vibe-Coding-InstructText GenerationEN
- tiiuae/falcon-refinedwebText GenerationEN
- nvidia/Nemotron-Personas-KoreaText GenerationKO
- defunct-datasets/the_pile_books3Text Generation, Fill MaskEN
- nvidia/OpenMathReasoningQuestion Answering, Text GenerationEN
- shibing624/medicalText GenerationZH
- angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7kText Generation, Question AnsweringEN
- argilla/FinePersonas-v0.1Text GenerationEN
- WNT3D/Ultimate-Offensive-Red-TeamText Generation, Question Answering, Text ClassificationEN
- JeanKaddour/minipileText Generation, Fill MaskEN
- Pageshift-Entertainment/LongPageText GenerationEN
- math-ai/StackMathQAText Generation, Question AnsweringEN
- Open-Orca/OpenOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- togethercomputer/RedPajama-Data-1TText GenerationEN
- IlyaGusev/gpt_roleplay_realmText GenerationRU, EN
- proj-persona/PersonaHubText Generation, Text Classification, Token Classification, Fill Mask, Table Question AnsweringEN, ZH
- euirim/goodwikiText Generation, SummarizationEN
- bigcode/the-stackText GenerationCODE
- yahma/alpaca-cleanedText GenerationEN
- bigcode/the-stack-v2Text GenerationCODE
- Salesforce/xlam-function-calling-60kQuestion Answering, Text Generation, Reinforcement LearningEN
- JosephusCheung/GuanacoDatasetText Generation, Question AnsweringZH, EN, JA
- HuggingFaceH4/ultrafeedback_binarizedText GenerationEN
- actava/chi-benchText GenerationEN
- allenai/WildChat-1MText Generation, Question AnsweringEnglish
- bigcode/the-stack-dedupText GenerationCODE
- iamtarun/python_code_instructions_18k_alpacaQuestion Answering, Text GenerationEnglish
- zwhe99/DeepMath-103KText GenerationEN
- openbmb/UltraData-MathText GenerationEN, ZH
- LooksJuicy/ruozhibaText GenerationZH
- nvidia/Nemotron-Personas-USAText GenerationEN
- vicgalle/alpaca-gpt4Text Generation, Question AnsweringEN
- stanfordnlp/SHPText Generation, Question AnsweringEN
- HuggingFaceFW/finetranslationsText Generation, TranslationABK, ABQ, ABS
- agentlans/high-quality-english-sentencesText Classification, Text Generation, Feature Extraction, Sentence SimilarityEN
- HuggingFaceFW/finewikiText GenerationEnglish
- llamafactory/tiny-supervised-datasetText Generation, Question AnsweringEN, ZH
- BAAI/TACOText GenerationCODE
- mlabonne/orpo-dpo-mix-40kText GenerationEN
- peteromallet/dataclaw-peteromalletText GenerationEN
- Open-Orca/SlimOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- opencsg/chinese-fineweb-eduText GenerationZH
- HuggingFaceH4/CodeAlpaca_20KText GenerationEnglish
- shareAI/ShareGPT-Chinese-English-90kQuestion Answering, Text GenerationEN, ZH
- silk-road/alpaca-data-gpt4-chineseText GenerationZH, EN
- ccdv/pubmed-summarizationSummarization, Text GenerationEN
- sunzeyeah/chinese_chatgpt_corpusText Generation, Question Answering, Reinforcement LearningZH
- ShadenA/MathNetQuestion Answering, Text Generation, Image To TextEN, PT, ES
- nvidia/Nemotron-Personas-FranceText GenerationFR
- CharlieDreemur/OpenManus-RLText GenerationEN
- wikimedia/wikisourceText Generation, Fill MaskAR, AS, AZ
- osunlp/TravelPlannerText GenerationEN
- OpenDataArena/MMFineReason-SFT-123K-Qwen3-VL-235B-ThinkingVisual Question Answering, Question Answering, Text GenerationEN
- TeichAI/DeepSeek-v4-Pro-AgentText GenerationEN
- sujet-ai/Sujet-Finance-Instruct-177kText Generation, Question AnsweringEN
- KBlueLeaf/danbooru2023-metadata-databaseImage Classification, Text To Image, Image To Text, Image To Image, Text Retrieval, Text Generation, Text ClassificationEN, JA
- nvidia/Nemotron-Pretraining-Specialized-v1Text GenerationEnglish
- opencsg/Fineweb-Edu-Chinese-V2.2Text Generation, Question AnsweringZH
- opencsg/chinese-cosmopediaText GenerationZH
- Squish42/bluemoon-fandom-1-1-rp-cleanedText GenerationEN
- LDJnr/Pure-DoveQuestion Answering, Text GenerationEN
- BAAI/Infinity-PreferenceText Generation, Question AnsweringEN, ZH
- galaxyMindAiLabs/stem-reasoning-complexText Generation, Question AnsweringEN, ZH
- Limour/b-corpusText GenerationZH
- OpenMed/Medical-Reasoning-SFT-Trinity-MiniText Generation, Question AnsweringEN
- zai-org/LongCite-45kText Generation, Question AnsweringEN, ZH
- selfrag/selfrag_train_dataText GenerationEN
- ResplendentAI/NSFW_RP_Format_DPOText GenerationEN
- Vikhrmodels/GrandMaster-PRO-MAXText GenerationRU, EN
- xingyaoww/code-actText GenerationEN
- MiniMaxAI/SynLogicText GenerationEN, ZH
- nlpai-lab/kullm-v2Text GenerationKO
- codefuse-ai/Evol-instruction-66kText GenerationEN
- defunct-datasets/amazon_us_reviewsSummarization, Text Generation, Fill Mask, Text ClassificationEN
- Clinton/Text-to-sql-v1Text GenerationEN
- opencsg/chinese-fineweb-edu-v2Text GenerationZH
- newfacade/LeetCodeDatasetText GenerationEN
- Jackrong/Claude-opus-4.6-TraceInversion-9000xText GenerationEN, ZH, KO
- nyuuzyou/google-code-archiveText GenerationCODE, ENBenchmark
- ClusterlabAi/101_billion_arabic_words_datasetText GenerationAR
- Qwen/WebWorldDataText GenerationEN, ZH
- llm-wizard/alpaca-gpt4-data-zhText GenerationZH
- manu/project_gutenbergText GenerationFR, EN, ZH
- Modotte/MathX-5MQuestion Answering, Text GenerationEnglish
- omarkamali/wikipedia-monthlyText GenerationAB, ACE, ADY
- nvidia/Nemotron-Pretraining-Code-v1Text GenerationEnglish
- ServiceNow-AI/evaText Generation, OtherEN
- shibing624/roleplay-zh-sharegpt-gpt4-dataText GenerationZH
- tasksource/bigbenchMultiple Choice, Question Answering, Text Classification, Text Generation, Zero Shot ClassificationENBenchmark
- IlyaGusev/ru_turbo_alpacaText GenerationRU
- silk-road/ChatHaruhi-54K-Role-Playing-DialogueText GenerationEN, ZH
- Jackrong/Claude-opus-4.7-TraceInversion-5000xText GenerationEN, ZH, KO
- openbmb/RLHF-V-DatasetText Generation, Visual Question AnsweringEN
- PleIAs/French-PD-NewspapersText GenerationFR
- maya-research/IndicVaultQuestion Answering, Text GenerationHI, TE, ENBenchmark
- nvidia/Nemotron-Pretraining-SFT-v1Text GenerationEnglish
- facebook/kilt_tasksFill Mask, Question Answering, Text Classification, Text Generation, Text RetrievalEN