Visual question answering Models
There are 64 AI and NLP models for Visual question answering in our directory. Browse the full list below, or explore models by provider.
Visual question answering is the task of answering natural-language questions about the contents of an image. We list 64 models for it.
Updated June 2026
- VILA1.5-13BChat,Visual question answering,Image captioning,Language modeling/generation,Question answeringNVIDIA,Massachusetts Institute of Technology (MIT)
- EXAONE 1.0Translation,Language modeling/generation,Visual question answeringLG
- Wu Dao 2.0Image captioning,Chat,Image generation,Text-to-image,Language modeling/generation,Question answering,Visual question answeringBeijing Academy of Artificial Intelligence / BAAI
- Claude Opus 4.5Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Claude Sonnet 4.5Language modeling/generation,Code generation,System control,Question answering,Quantitative reasoning,Mathematical reasoning,Visual question answeringAnthropic
- Claude Opus 4.1Language modeling/generation,Question answering,System control,Code generation,Search,Quantitative reasoning,Mathematical reasoning,Visual question answeringAnthropic
- Grok 4Language modeling/generation,Question answering,Search,Visual question answering,Character recognition (OCR),Image captioning,Quantitative reasoningxAI
- Gemini 2.5 Pro (Jun 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Claude Sonnet 4Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Claude Opus 4Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Gemini 2.5 Pro (May 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Gemini 2.5 Pro (Mar 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- ERNIE-4.5-VL-424B-A47B (文心大模型4.5)Language modeling/generation,Visual question answering,Video description,Speech recognition (ASR),Quantitative reasoning,Code generation,Translation,Question answering,Character recognition (OCR)Baidu
- GPT-4.5Language modeling/generation,Question answering,Quantitative reasoning,Translation,Visual question answering,Code generation,Instruction interpretationOpenAI
- Claude 3.7 SonnetLanguage modeling/generation,Question answering,Code generation,Quantitative reasoning,Translation,Instruction interpretation,Visual question answeringAnthropic
- Grok 3Chat,Language modeling/generation,Question answering,Code generation,Visual question answeringxAI
- Gemini 2.0 ProCode generation,Language modeling/generation,Question answering,Visual question answering,Speech recognition (ASR),Video descriptionGoogle DeepMind
- Amazon Nova ProLanguage modeling/generation,Retrieval-augmented generation,Visual question answering,Image captioning,Video description,Character recognition (OCR),Code generation,TranslationAmazon
- Pixtral LargeVision-language generation,Visual question answering,Mathematical reasoning,Character recognition (OCR),Language modeling/generation,Question answeringMistral AI
- GPT-4o miniChat,Language modeling/generation,Code generation,Visual question answeringOpenAI
- GPT-4 Turbo (Apr 2024)Chat,Language modeling/generation,Image generation,Speech synthesis,Table tasks,Visual question answering,Image captioningOpenAI
- Qwen-VL-MaxChat,Image captioning,Face recognition,Visual question answeringAlibaba
- GPT-4 Turbo (Nov 2023)Chat,Language modeling/generation,Image generation,Speech synthesis,Table tasks,Visual question answering,Image captioningOpenAI
- ChatGLM3-6BChat,Visual question answering,Code generationZ.ai (Zhipu AI)
- Qwen3-Omni-30B-A3BLanguage modeling/generation,Question answering,Visual question answering,Image captioning,Video description,Speech recognition (ASR),Speech synthesis,Speech-to-text,Text-to-speech (TTS)Alibaba
- Gemini 2.5 Deep ThinkLanguage modeling/generation,Mathematical reasoning,Code generation,Visual question answering,Question answering,Visual puzzles,Video description,Speech recognition (ASR),Speech-to-textGoogle,Google DeepMind
- Seed1.5-VLVisual question answering,Video description,Language modeling/generation,Question answering,Character recognition (OCR)ByteDance
- Llama 4 ScoutChat,Code generation,Visual question answering,Language modeling/generation,Question answeringMeta AI
- Llama 4 MaverickChat,Code generation,Visual question answering,Language modeling/generation,Question answeringMeta AI
- Llama 4 Behemoth (preview)Chat,Code generation,Visual question answering,Translation,Language modeling/generation,Quantitative reasoning,Question answeringMeta AI
- Kimi k1.5Language modeling/generation,Code generation,Quantitative reasoning,Question answering,Visual question answering,Translation,Image captioning,Visual puzzlesMoonshot
- o3Language modeling/generation,Question answering,Quantitative reasoning,Code generation,Visual question answering,Search,Instruction interpretation,Visual puzzlesOpenAI
- NVILA 15BVisual question answering,Video description,Language modeling/generation,Question answering,Character recognition (OCR)NVIDIA,Massachusetts Institute of Technology (MIT),University of California (UC) Berkeley,University of California San Diego,University of Washington,Tsinghua University
- Llama 3.2 11BVisual question answering,Image captioning,Object detectionMeta AI
- Oryx 34BVisual question answering,Video compression,Image captioning,Video description,Language modeling/generationTsinghua University,Tencent,Nanyang Technological University
- Harrison.rad.1Visual question answering,Medical diagnosisHarrison.ai
- LLaVA-OV-72BImage captioning,Visual question answering,Video description,Object recognition,Action recognition,Language modeling/generationByteDance,Nanyang Technological University,Chinese University of Hong Kong (CUHK),Hong Kong University of Science and Technology (HKUST)
- Grok-2Chat,Language modeling/generation,Question answering,Code generation,Visual question answeringxAI
- SenseChat 5.5Vision-language generation,Visual question answering,Language modeling/generation,Question answering,Chat,Quantitative reasoningSenseTime
- Ernie 4.0 TurboVision-language generation,Language modeling/generation,Question answering,Chat,Visual question answeringBaidu
- Cambrian-1-34BImage captioning,Visual question answering,Character recognition (OCR)New York University (NYU)
- Reka CoreChat,Language modeling/generation,Image captioning,Code generation,Code autocompletion,Question answering,Visual question answering,Video description,Speech recognition (ASR),Speech-to-text,Quantitative reasoningReka AI
- MM1-30BChat,Image captioning,Visual question answeringApple
- Gemini 1.5 ProLanguage modeling,Visual question answeringGoogle DeepMind
- CogAgentInstruction interpretation,Visual question answeringTsinghua University,Z.ai (Zhipu AI)
- VILA-13BChat,Visual question answering,Image captioning,Language modeling/generation,Question answeringNVIDIA,Massachusetts Institute of Technology (MIT)
- Gemini 1.0 UltraLanguage modeling,Visual question answering,Chat,TranslationGoogle DeepMind
- Gemini 1.0 ProLanguage modeling,Visual question answering,Chat,TranslationGoogle DeepMind
- Volcano 13BLanguage modeling/generation,Visual question answeringKorea University,Korea Advanced Institute of Science and Technology (KAIST),LG
- SPHINX (Llama 2 13B)Visual question answering,Image captioningShanghai AI Lab,Chinese University of Hong Kong (CUHK),ShanghaiTech University
- mPLUG-Owl2Visual question answering,Image captioning,Language modeling/generationAlibaba
- CogVLM-17BImage captioning,Visual question answering,ChatTsinghua University,Z.ai (Zhipu AI),Beihang University
- LLaVA 1.5Chat,Question answering,Visual question answeringUniversity of Wisconsin Madison,Microsoft Research
- PaLI-3Visual question answering,Character recognition (OCR),Image captioningGoogle DeepMind,Google Research,Google Cloud
- GPT-4VLanguage modeling,Visual question answeringOpenAI
- Qwen-VLImage captioning,Chat,Question answering,Visual question answeringAlibaba
- GPT-4 (Jun 2023)Language modeling,Language modeling/generation,Question answering,Visual question answeringOpenAI
- PaLI-XImage captioning,Video description,Character recognition (OCR),Visual question answeringGoogle Research
- InstructBLIPVisual question answering,ChatSalesforce Research,Hong Kong University of Science and Technology (HKUST),Nanyang Technological University
- LLaVAChat,Question answering,Visual question answeringUniversity of Wisconsin Madison,Microsoft Research,Columbia University
- GPT-4 (Mar 2023)Language modeling,Language modeling/generation,Question answering,Visual question answeringOpenAI
- PaLM-EVisual question answering,Robotic manipulation,Image captioning,Language generationGoogle,TU Berlin
- BLIP-2 (Q-Former)Visual question answering,Image captioningSalesforce Research
- AltCLIP_M9Language modeling/generation,Chat,Visual question answering,Image generationBeijing Academy of Artificial Intelligence / BAAI