Skip to main content
. 2021 Dec 17;21(24):8448. doi: 10.3390/s21248448

Table 3.

Main available datasets for conversational agents—part A.

General-Purpose Datasets
Dataset Source Description Size Used for
DailyDialog [213] hand written, daily interactions 13,118 dialogs, general
manualy labeled 7~.9 turns purpose
[216] subtitles interaction–response purpose
pairs
Movie dialogue dataset movie metadata OMDb, MovieLens, 3.1 M simulated Movies QA and
[217] as knowledge triples and Reddit QA pairs recommendation
Cornell Movie Dialogues Short conversations movie metadata 220 K understanding
Corpus [218] from film scripts conversations linguistic style
Ubuntu dialogue Ubuntu chat stream human–human chat 930 K response
corpus [224] conversations generation
Question-Answering Datasets
Squad Version 1.1 questions and answers 1~00 K questions 100 K q&a machine reading
[227] on Wikipedia articles on Wikipedia articles comprehension
Squad Version 2 questions and answers Squad 1.1 + 100 K Q&A + machine reading
[228] and additional questions 50 k questions 50 k questions comprehension
with no answers with no answers
CNN/Daily Mail queries from the CNN cont.–query–answer 1~M stories+ machine reading
comprehension [229] and Daily Mail websites triples associated queries training dataset
Natural Questions Google search queries+ Google question+ 307,372 training &
dataset [230] Wikipedia answers long answer+ training examples evaluation of
by crowd workers short answers answ. systems
TriviaQA crowdworkers question-answer- 95 K quest.-ans. reading
[231] questions evidence triples pairs + 6 evidence comprehension
doc. per quest.