Table 1. A summary of the existing COVID-19 datasets collected for misinformation detection or situational information classification.
The summary includes the size, the language, the sources of samples, the annotation method, and the class granularity. All listed datasets are designed for misinformation detection except the ones released for situational info classification.
Dataset | Size | Main input | Language | Classification task | Annotation method |
---|---|---|---|---|---|
Patwa et al. (2021) | 10.7K | Twitter, Facebook Fact-checking websites | English | Binary (fake, real) | Manually + fact-checked claims |
COVIDLies (Hossain et al., 2020) | 6.7K | English | Binary (fake, real) | Manually | |
Helmstetter & Paulheim (2021) | 400K | English | Binary (fake, non-fake) | Weak supervision | |
ReCOVery (Zhou et al., 2020) | 146K | News articles Twitter | English | Binary (fake, real) | Distance Supervision |
CoAID (Cui & Lee, 2020) | 301.1K | Social Posts User engagements News Articles |
English | Binary (fake, real) | Distance Supervision |
MM-COVID (Li et al., 2020b) | 11.1K | Multi-lingual | Binary (fake, real) | Manually | |
ArCOV19-Rumors (Haouari et al., 2020b) | 9.4K | Arabic | Three classes (false, true, other) | Manually | |
Alqurashi et al. (2021) | 8.7K | Arabic | Binary (misleading not-misleading) | Manually | |
COVID-19-FAKES (Elhadad, Li & Gebali, 2020) | 0.4K | Arabic + English | Binary (fake, real) | Automatically (13 ML algorithm) | |
Mahlous & Al-Laith (2021) (a) | 2.5K | Arabic | Binary (fake, genuine) | Manually | |
Mahlous & Al-Laith (2021) (b) | 14.9K | Arabic | Binary (fake, genuine) | Automatically | |
CLEF-2021 CheckThat! Lab (task 3A) (Shahi, Struß & Mandl, 2021) | 1.2K | News articles | Multi-lingual | Four classes (false, true, partially false, other) | Manually |
FakeCovid (Shahi & Nandini, 2020) | 5.1K | Several Social media platforms | Multi-Lingual | Three classes (false, true, partially false) | Manually |
Alsudias & Rayson (2020) | 2K | Arabic | Three classes (false, true, unrelated) | Manually | |
CMU-MisCOV19 (Memon & Carley, 2020) | 0.5K | English | Multi-class (17 classes) | Manually | |
Li et al. (2020a) | 3K | English | Multi-class (eight Situational classes) | Manually | |
ArCorona (Mubarak & Hassan, 2020) | 8K | Arabic | Multi-class (17 classes) (Mixing situational and misinformation classes) | Manually | |
AraCOVID19-MFH (Ameur & Aliane, 2021) | 10.8K | Arabic | Ten independent tasks (each with 2–4 classes) | Manually | |
Alam et al. (2020) | 722 | Arabic | Ten independent tasks (each with 2–3 classes) | Manually | |
Out dataset (ArCOV19-MCM) | 6.7K | Arabic | Multi-class (19 misinformation classes) | Manually | |
Out dataset (ArCOV19-MLM) | 6.7K | Arabic | Multi-label (19 misinformation classes) | Manually | |
Out dataset (ArCOV19-Sit) | 4.2K | Arabic | Multi-class (Six situational classes) | Manually |