Table 1.
Summary of the benchmark datasets for fashion tasks
| Task | Dataset | Number of photos | Description | Publish time | |
|---|---|---|---|---|---|
| Virtual Try-On | LookBook [8] | 84,748 | Composed by 9,732 top product images and 75,016 fashion model images | 2016 | |
| DeepFashion [43] | 78,979 | Selected from the In-shop Clothes Benchmark and associated with several sentences as captions and a segmentation map. | 2016 | ||
| VITON [16] | 32,506 | Contained around 19,000 frontal-view woman and top clothing image pairs, yielding 16,253 pairs | 2018 | ||
| FashionTryOn [106] | 28,714 | Comprising 28, 714 clothing person-person triplets with each consisting of a clothing item image and two model images in different poses. | 2019 | ||
| FashionOn [22] | 22,566 | Pairs of person image wearing the same clothes in different poses. | 2019 | ||
| Fashion Parsing | Fashionista [93] | 158,235 | Outfit information in the form of tags, comments, and links | 2012 | |
| Paper Doll [94] | 339,797 | Annotated with metadata tags denoting characteristics, e.g., color, style, occasion, clothing type, brand | 2013 | ||
| Chictopia10k [36] | 10,000 | Contains real-world annotated images in the wild with arbitrary postures, views and backgrounds | 2015 | ||
| LIP [13] | 50,462 |
■ Focus on semantic understanding of person and contains images with elaborated pixel-wise annotations with 19 semantic human part labels and 2D human poses with 16 key points. ■ Images collected from real-world scenarios contain human appearing with challenging poses and views, occlusions, and various appearances. |
2017 | ||
| MHP | v1.0 [105] | 4,980 | ■ Instance-aware setting with fine-grained pixel-level annotations works with 7 body parts and 11 clothes categories. | 2017 | |
| v2.0 [85] | 25,403 |
■ Annotated images with 58 fine-grained semantic categories: 11 body parts and 47 clothes categories ■ Captured images in real-world scenes from various viewpoints, poses, occlusion, interaction, and background |
2018 | ||
| Crowd Instance-level Human Parsing (CIHP) [103] | 38,280 |
■ Multi-person images ■ Pixel-wise annotations in instance-level |
2018 | ||
| ModaNet [18] | 55,176 | Annotated with pixel-level labels, bounding boxes, and polygons | 2018 | ||
| DeepFashion2 [109] | 491,000 |
■ Diverse images of 13 popular clothing categories from both commercial shopping stores and consumers. ■ Labeled with scale, occlusion, zoom-in, viewpoint, and category, style, bounding box, dense landmarks and per-pixel mask. |
2019 | ||
| Fashionpedia [24] | 48,000 | Containing 294 fine-grained attributes with high resolution (1710 × 2151) | 2020 | ||
| RichWear [1] | 322,198 | Street fashion dataset containing various text labels for fashion analysis. The images are collected from an Asian social network site, focuses on street styles in Japan and other Asian areas. | 2021 | ||
| Fashion landmark detection | DeepFashion-C [43] | 289,222 | Annotated with clothing bounding box, pose variation type, landmark visibility, clothing type, category, and attributes | 2016 | |
| Fashion Landmark Dataset (FLD) [44] | 123,016 | Annotated with clothing type, pose variation type, landmark visibility, clothing bounding box, and human body joint | 2016 | ||
| Unconstrained Landmark Database (ULD) [95] | 30,000 |
■ Collected from fashion blogs, forums and the consumer-to shop retrieval benchmark of DeepFashion [43] ■ Contains substantial foreground scatters and background clutters |
2017 | ||
| DeepFashion2 [109] | 491,000 | DeepFashion2 used in diverse tasks like fashion parsing, clothes detection, pose estimation, segmentation, and retrieval. | 2019 | ||
| Human Pose Estimation | MPII Human pose [60] | 2.5104 | ■ Data are from YouTube videos. It covers 410 human activities, and each image is provided with activity label | 2014 | |
| MSCOCO [88] | 328,000 | ■ Data are from Internet. It used for diverse activities. | 2014 | ||
| AI Challenger [2] | 300,000 |
■ Data are crawled from Internet. ■ Provide three sub-datasets for human keypoint detection, attribute based zero-shot recognition and image Chinese captioning. |
2017 | ||
| PoseTrack [25] | 550 video sequences | ■ Focusses on 3 aspects: (1) single-frame multi-person pose estimation. (2) Multi-person pose estimation in videos. (3) Multi-person articulated tracking. | 2017 | ||
| Pose Transfer | Human3.6M [87] | 3.6M |
■ Containing 3.6 million different 3D articulated poses captured from a set of men and women actors. ■ provides synchronized 2D and 3D data (including time of flight, high quality image and motion capture data), accurate 3D human models of the actors, and mixed reality settings |
2014 | |
| Market-1501 [70] | 32,668 | ■ Contains over 32,000 annotated boxes, plus a distractor set of over 500K images produced using the Deformable Part Model (DPM) as pedestrian detector. | 2015 | ||
| DeepFashion [43] | 52,712 | In-shop Clothes Retrieval Benchmark DeepFashion is used for pose transfer | 2016 | ||
| SMPL-NPT [5] | 24,000 | Contains 24,000 synthesized body meshes and used for 3D Pose Transfer | 2020 | ||
| SMG-3D [54] | 8,000 | Contains 8,000 pairs of naturally plausible body meshes of 40 identities and 200 poses, 35 identities and 180 poses are used as the training set | 2021 | ||
| Clothing Simulation | MG-Cloth [108] | 356 scans | Contains 3D scans of person with different body shapes, poses and clothes. | 2019 | |
| DeepFashion3D [99] | 2,078 models | Contains 3D garment models with 10 different clothing categories and 563 garment instances | 2020 | ||
| AFRIFASHION1600 [82] | 1600 | African fashion dataset curated to improve visibility, inclusion and familiarity of African fashion in computer vision tasks | 2021 | ||