Video description Models
There are 15 AI and NLP models for Video description in our directory. Browse the full list below, or explore models by provider.
Video description is a machine-learning task covered in our directory. We list 15 models for it.
Updated June 2026
- Gemini 2.5 Pro (Jun 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Gemini 2.5 Pro (May 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Gemini 2.5 Pro (Mar 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- ERNIE-4.5-VL-424B-A47B (文心大模型4.5)Language modeling/generation,Visual question answering,Video description,Speech recognition (ASR),Quantitative reasoning,Code generation,Translation,Question answering,Character recognition (OCR)Baidu
- Gemini 2.0 ProCode generation,Language modeling/generation,Question answering,Visual question answering,Speech recognition (ASR),Video descriptionGoogle DeepMind
- Amazon Nova ProLanguage modeling/generation,Retrieval-augmented generation,Visual question answering,Image captioning,Video description,Character recognition (OCR),Code generation,TranslationAmazon
- Qwen3-Omni-30B-A3BLanguage modeling/generation,Question answering,Visual question answering,Image captioning,Video description,Speech recognition (ASR),Speech synthesis,Speech-to-text,Text-to-speech (TTS)Alibaba
- Gemini 2.5 Deep ThinkLanguage modeling/generation,Mathematical reasoning,Code generation,Visual question answering,Question answering,Visual puzzles,Video description,Speech recognition (ASR),Speech-to-textGoogle,Google DeepMind
- Seed1.5-VLVisual question answering,Video description,Language modeling/generation,Question answering,Character recognition (OCR)ByteDance
- Apollo 7BVideo descriptionMeta AI,Stanford University
- NVILA 15BVisual question answering,Video description,Language modeling/generation,Question answering,Character recognition (OCR)NVIDIA,Massachusetts Institute of Technology (MIT),University of California (UC) Berkeley,University of California San Diego,University of Washington,Tsinghua University
- Oryx 34BVisual question answering,Video compression,Image captioning,Video description,Language modeling/generationTsinghua University,Tencent,Nanyang Technological University
- LLaVA-OV-72BImage captioning,Visual question answering,Video description,Object recognition,Action recognition,Language modeling/generationByteDance,Nanyang Technological University,Chinese University of Hong Kong (CUHK),Hong Kong University of Science and Technology (HKUST)
- Reka CoreChat,Language modeling/generation,Image captioning,Code generation,Code autocompletion,Question answering,Visual question answering,Video description,Speech recognition (ASR),Speech-to-text,Quantitative reasoningReka AI
- PaLI-XImage captioning,Video description,Character recognition (OCR),Visual question answeringGoogle Research