Skip to main content
. 2022 Mar 10;81(14):19967–19998. doi: 10.1007/s11042-022-12802-6

Table 1.

Summary of the benchmark datasets for fashion tasks

Task Dataset Number of photos Description Publish time
Virtual Try-On LookBook [8] 84,748 Composed by 9,732 top product images and 75,016 fashion model images 2016
DeepFashion [43] 78,979 Selected from the In-shop Clothes Benchmark and associated with several sentences as captions and a segmentation map. 2016
VITON [16] 32,506 Contained around 19,000 frontal-view woman and top clothing image pairs, yielding 16,253 pairs 2018
FashionTryOn [106] 28,714 Comprising 28, 714 clothing person-person triplets with each consisting of a clothing item image and two model images in different poses. 2019
FashionOn [22] 22,566 Pairs of person image wearing the same clothes in different poses. 2019
Fashion Parsing Fashionista [93] 158,235 Outfit information in the form of tags, comments, and links 2012
Paper Doll [94] 339,797 Annotated with metadata tags denoting characteristics, e.g., color, style, occasion, clothing type, brand 2013
Chictopia10k [36] 10,000 Contains real-world annotated images in the wild with arbitrary postures, views and backgrounds 2015
LIP [13] 50,462

■ Focus on semantic understanding of person and contains images with elaborated pixel-wise annotations with 19 semantic human part labels and 2D human poses with 16 key points.

■ Images collected from real-world scenarios contain human appearing with challenging poses and views, occlusions, and various appearances.

2017
MHP v1.0 [105] 4,980 ■ Instance-aware setting with fine-grained pixel-level annotations works with 7 body parts and 11 clothes categories. 2017
v2.0 [85] 25,403

■ Annotated images with 58 fine-grained semantic categories: 11 body parts and 47 clothes categories

■ Captured images in real-world scenes from various viewpoints, poses, occlusion, interaction, and background

2018
Crowd Instance-level Human Parsing (CIHP) [103] 38,280

■ Multi-person images

■ Pixel-wise annotations in instance-level

2018
ModaNet [18] 55,176 Annotated with pixel-level labels, bounding boxes, and polygons 2018
DeepFashion2 [109] 491,000

■ Diverse images of 13 popular clothing categories from both commercial shopping stores and consumers.

■ Labeled with scale, occlusion, zoom-in, viewpoint, and category, style, bounding box, dense landmarks and per-pixel mask.

2019
Fashionpedia [24] 48,000 Containing 294 fine-grained attributes with high resolution (1710 × 2151) 2020
RichWear [1] 322,198 Street fashion dataset containing various text labels for fashion analysis. The images are collected from an Asian social network site, focuses on street styles in Japan and other Asian areas. 2021
Fashion landmark detection DeepFashion-C [43] 289,222 Annotated with clothing bounding box, pose variation type, landmark visibility, clothing type, category, and attributes 2016
Fashion Landmark Dataset (FLD) [44] 123,016 Annotated with clothing type, pose variation type, landmark visibility, clothing bounding box, and human body joint 2016
Unconstrained Landmark Database (ULD) [95] 30,000

■ Collected from fashion blogs, forums and the consumer-to shop retrieval benchmark of DeepFashion [43]

■ Contains substantial foreground scatters and background clutters

2017
DeepFashion2 [109] 491,000 DeepFashion2 used in diverse tasks like fashion parsing, clothes detection, pose estimation, segmentation, and retrieval. 2019
Human Pose Estimation MPII Human pose [60] 2.5104 ■ Data are from YouTube videos. It covers 410 human activities, and each image is provided with activity label 2014
MSCOCO [88] 328,000 ■ Data are from Internet. It used for diverse activities. 2014
AI Challenger [2] 300,000

■ Data are crawled from Internet.

■ Provide three sub-datasets for human keypoint detection, attribute based zero-shot recognition and image Chinese captioning.

2017
PoseTrack [25] 550 video sequences ■ Focusses on 3 aspects: (1) single-frame multi-person pose estimation. (2) Multi-person pose estimation in videos. (3) Multi-person articulated tracking. 2017
Pose Transfer Human3.6M [87] 3.6M

■ Containing 3.6 million different 3D articulated poses captured from a set of men and women actors.

■ provides synchronized 2D and 3D data (including time of flight, high quality image and motion capture data), accurate 3D human models of the actors, and mixed reality settings

2014
Market-1501 [70] 32,668 ■ Contains over 32,000 annotated boxes, plus a distractor set of over 500K images produced using the Deformable Part Model (DPM) as pedestrian detector. 2015
DeepFashion [43] 52,712 In-shop Clothes Retrieval Benchmark DeepFashion is used for pose transfer 2016
SMPL-NPT [5] 24,000 Contains 24,000 synthesized body meshes and used for 3D Pose Transfer 2020
SMG-3D [54] 8,000 Contains 8,000 pairs of naturally plausible body meshes of 40 identities and 200 poses, 35 identities and 180 poses are used as the training set 2021
Clothing Simulation MG-Cloth [108] 356 scans Contains 3D scans of person with different body shapes, poses and clothes. 2019
DeepFashion3D [99] 2,078 models Contains 3D garment models with 10 different clothing categories and 563 garment instances 2020
AFRIFASHION1600 [82] 1600 African fashion dataset curated to improve visibility, inclusion and familiarity of African fashion in computer vision tasks 2021