Skip to main content
. 2024 May 27;10:e2037. doi: 10.7717/peerj-cs.2037

Table 1. Overview of prevalent deepfake detection datasets, summarizing their key attributes.

Dataset name Origin Size Resolution Diversity Variety of attributes Realism Availability
CelebA Academic 200k images Various Various 40 attribute annotations per image High Public
DeeperForensics-1.0 Academic 60,000 videos 1,920 × 1,080 pixels 100 actors from 26 countries of various skin tones and ages Eight naturally expressed feelings High Public
Deepfake-TIMIT Academic 640 videos Low Quality: 64 × 64, High Quality: 128 × 128 Various Face Swap Low/High Public
Celeb-DF Academic 5,639 videos 256 × 256 pixels 59 celebrities of various ethnicities and ages Face Swap High Public
FaceForensics (FF) Academic 1,004 videos Various Various Face-to-Face Reenactment High Public
UADFV University of Albany 49 videos 294 × 500 pixels Various Physiological cues Medium Public
FaceForensics++ (FF++) Academic 1,000 videos VGA, HD, FullHD Male 60%, Female 40% Face High Public
Deepfake detection challenge (DFDC) Facebook 5,000 videos Various 74% female, 26% male, and diverse ethnicities Face swap High Public
HOHA-based Academic 600 videos 360 × 240 pixels Various Human actions High Public
FakeAVCeleb Real YouTube videos of celebrities Audio and Video Various Covers four racial backgrounds Deepfake videos and corresponding synthesized cloned audios High, with perfect lip-syncing Public
Attack Agnostic Combines two audio DeepFakes and one anti-spoofing dataset Audio clips Audio dataset, no visual resolution Varied types of spoofing attacks Audio DeepFakes designed to challenge detection algorithms Designed to improve detection generalization Public
ADD 2022 Audio Deep Synthesis Detection challenge Covers a range of real-life scenarios Audio dataset, no visual resolution Real-life and challenging scenarios Tracks for low-quality fake audio detection, partially fake audio Aims to push boundaries of detection research Public
TweepFake Twitter, based on advancements in language modeling like GPT-2 25,572 tweets Various 23 bots imitating 17 human accounts Tweets generated using various techniques like Markov Chains, LSTM Real tweets posted on Twitter Public
ADBT Twitter, focused on detecting bot-generated messages Extended from existing work Various Generated tweets designed to mimic human writing Tweets generated to test detection models against language models Generated content difficult to distinguish from human-written Public