Text Classification Datasets
There are 33 text classification datasets in our directory, 1 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.
Text Classification is the task of assigning predefined categories or labels to a piece of text, such as topic or intent labelling. We catalog 33 datasets for it.
Updated June 2026
- aps/super_glueText Classification, Token Classification, Question AnsweringEN
- Historical Portuguese Corpora (HPC)Text Corpora, Text ClassificationPortuguese
- TweetSentBRText ClassificationPortuguese
- Mercadolibre Data Challenge 2019Text ClassificationPortuguese, Spanish
- Constructive Comments Corpus (C3)Text ClassificationEnglish
- CogComp/trecText ClassificationEN
- stanfordnlp/imdbText ClassificationEN
- Vietnamese Students’ Feedback Corpus (UIT-VSFC)Text Classification, Sentiment AnalysisVietnamese
- kaist-ai/CoT-CollectionText Generation, Text ClassificationEN
- community-datasets/yahoo_answers_topicsText ClassificationEN
- lmarena-ai/arena-human-preference-55kText ClassificationEN
- data-is-better-together/10k_prompts_rankedText Classification, Text Generation, Reinforcement LearningEN
- data-is-better-together/fineweb-cText ClassificationLVS, KOR, KIN
- declare-lab/HarmfulQAText Generation, Text ClassificationEN
- nvidia/Aegis-AI-Content-Safety-Dataset-2.0Text ClassificationEN
- WNT3D/Ultimate-Offensive-Red-TeamText Generation, Question Answering, Text ClassificationEN
- Open-Orca/OpenOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- proj-persona/PersonaHubText Generation, Text Classification, Token Classification, Fill Mask, Table Question AnsweringEN, ZH
- dair-ai/emotionText ClassificationEN
- AdaptLLM/finance-tasksText Classification, Question Answering, Zero Shot ClassificationEN
- ade-benchmark-corpus/ade_corpus_v2Text Classification, Token ClassificationEN
- agentlans/high-quality-english-sentencesText Classification, Text Generation, Feature Extraction, Sentence SimilarityEN
- Open-Orca/SlimOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- musicdsl/reamixed-project-filesAudio Classification, Text ClassificationEnglish
- KBlueLeaf/danbooru2023-metadata-databaseImage Classification, Text To Image, Image To Text, Image To Image, Text Retrieval, Text Generation, Text ClassificationEN, JA
- Brianferrell787/financial-news-multisourceText Classification, Text Retrieval, OtherEN
- JanosAudran/financial-reports-secFill Mask, Text ClassificationEN
- toxigen/toxigen-dataText ClassificationEnglish
- defunct-datasets/amazon_us_reviewsSummarization, Text Generation, Fill Mask, Text ClassificationEN
- ronantakizawa/github-top-developersText Classification, Time Series Forecasting, Text RetrievalEN
- papluca/language-identificationText ClassificationAR, BG, DE
- tasksource/bigbenchMultiple Choice, Question Answering, Text Classification, Text Generation, Zero Shot ClassificationENBenchmark
- facebook/kilt_tasksFill Mask, Question Answering, Text Classification, Text Generation, Text RetrievalEN