Skip to content

Visual Storytelling Dataset (VIST)

Multi-Modal LearningEnglish

Visual Storytelling Dataset (VIST) is a multi-modal learning-focused dataset in English that provides 81,743 labeled examples distributed in JSON format.

About Visual Storytelling Dataset (VIST)

Dataset contains 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).

Details

Task
Multi-Modal Learning
Language
English
Format
JSON
Rows / instances
81,743
Creator
Huang et al.
Year
2016
Download Paper

Related Multi-Modal Learning datasets

FAQ