Abstract
The dataset presents raw data on the egocentric (first-person view) and exocentric (third-person view) perspectives, including 47166 frame images. Egocentric and exocentric frame images are recorded from original iPhone videos simultaneously. The egocentric view captures the details of proximity hand gestures and attentiveness of the iPhone wearer, while the exocentric view captures the hand gestures in the top-down view of all participants. The data provides frame images of two, three, and four people engaged in interactive games such as Poker, Checkers, and Dice. Furthermore, the data was collected in the real environment under natural, white, yellow, and dim light conditions. The dataset contains diverse hand gestures, including remarkable instances such as motion blur, extremely deformed, sharp shadows, and extremely dim light. Moreover, researchers working on artificial intelligence (AI) interaction games in extended reality can create sub-datasets from the metadata for one or both perspectives in the egocentric or exocentric views, facilitating the AI understanding of hand gestures in human interactive games. Furthermore, researchers can extract hand gestures considered relevant studies for hand-object interaction, such as hands deformed by holding a chess piece, blurred hand gripping containers at Dice, and hands obscured by playing cards. Researchers can annotate rectangular boxes, and hand edges for semi-supervised and supervised hand detection, hand segmentation, and hand classification to improve the ability of the AI to distinguish between each player's hand gestures. Unsupervised, self-supervised research can also be done directly using this dataset.
Keywords: First-person view, Third-person view, Interactive game, Frame images, Extended reality
Specifications Table
| Subject | Artificial Intelligence, Human-Computer Interaction |
| Specific subject area | Hand gesture recognition in interactive senses, extended reality interactive games |
| Type of data | Raw frame images (sequential video frames) |
| Data collection | The data collected in this dataset include continuous frame images converted from videos recorded simultaneously in egocentric and exocentric views. Four lighting conditions were selected: natural light (normal, weak, dark), white light (bright), yellow light (soft), and dim light (insufficient light). Two iPhone 13s were put on a tripod and chest strap phone mount to record simultaneously. Six volunteers completed two-person, three-person, and four-person interactive games during the recording process. |
| Data source location | Universiti Teknologi Malaysia, 81310, Skudai Johor, Malaysia. |
| Data accessibility | Repository name: Mendeley Data Data identification number: DOI: 10.17632/bxr7kx84y6.2 Direct URL to data: https://data.mendeley.com/datasets/bxr7kx84y6/2 |
| Related research article | None |
1. Value of the Data
-
•
This dataset contributes to improving the ability of smart terminal devices to learn the details of hands, enhancing the accuracy of predicting the hand's following action. Further, the user experience of smart glasses wearers engaging in interactive games in extended reality is optimized.
-
•
The dataset contains many rapidly moving, deforming, sharp shadows, and dimly light hand gestures. These hand gestures typically occur in the context of human-computer interaction activity scenarios in extended reality, such as rapid card tosses, dice rolls, finger-holding pieces, etc.
-
•
Researchers, game developers, and project managers conducting artificial intelligence (AI) interaction games in extended reality can capitalize on integrating diverse gestural features of pairs in first-person and third-person view, and sequential frame images.
-
•
The dataset will assist computer vision researchers in developing deep learning algorithms to detect, segment, classify, and recognize static or sequential image-based hand gestures more accurately.
2. Background
Many related datasets were created for hand detection, segmentation, and classification, including EgoGesture [1], EgoHOS [2], HaGRID [3], Chalearn IsoGD [4], Jester [5], Egohands [6], NVGesture [7], EgoCom [8], etc. However, these similarly existing datasets include only one egocentric or exocentric view. Although Charades-Ego [9] and EgoExoLearn [10] contain first-person view and third-person view, they do not include the hands of multi-person in interactive scenarios. Therefore, this paper aims to construct a dataset with multifarious hand gestures in varied light conditions. The purposes of creating this dataset are as follows:
-
•
To provide a dataset including hand gestures from two, three, or four players completing interactive games under natural light, white light, yellow light, and dim light in the real world.
-
•
To produce frame images with hand gestures of movement, rotation, deformation, blur, sharp shadow, and in a dark environment.
-
•
To provide the metadata that researchers can create sub-datasets for either one viewpoint or both viewpoints in the egocentric or exocentric views.
-
•
To provide raw data and publicly facilitate researchers to preprocess the data according to their needs for the training, validation, and testing for supervised, semi-supervised, unsupervised, and self-supervised deep learning models for hand detection, segmentation, and classification purposes.
3. Data Description
The dataset contains two folders: Egocentric View and Exocentric View. Each folder contains two persons, three persons, and four persons subfolders. Each subfolder is separated by lighting conditions, including natural light, white light, yellow light, and dim light. Each lighting condition folder has four game folders. Furthermore, the subfolders in egocentric view (first-person view) and exocentric view (third-person view) contain the frame images from videos recorded simultaneously from these two perspectives. These frame images are named after the number of frames. For example, “00000062.jpg” is the frame image corresponding to the sixty-second frame in the original video. Each frame image is annotated with time information. Frames with the same name and time information located in subfolders of the same name with same in the egocentric view and exocentric view folders respectively, show two perspectives of the same hand gestures. The folder structure is shown in Fig. 1. The number of frame images is illustrated in Table 1.
Fig. 1.
Folder structure.
Table 1.
The number of frame images.
| Folder Name | Subfolder Name | The Number of Frame Images |
|||
|---|---|---|---|---|---|
| Natural Light | White Light | Yellow Light | Dim Light | ||
| Egocentric View | Two persons | 1745 | 1135 | 1422 | 1560 |
| Three persons | 2077 | 2353 | 2058 | 1822 | |
| Four persons | 2152 | 2155 | 2541 | 2099 | |
| Exocentric View | Two persons | 1990 | 1210 | 1495 | 1644 |
| Three persons | 2263 | 2373 | 2075 | 1844 | |
| Four persons | 2167 | 2245 | 2557 | 2184 | |
The dataset contains 47166 continuous frame images. These images are captured from video frames, with 30 FPS (Frames Per Second). The videos record scenes of two, three, and four people playing Dice, Gomoku, Chinese Chess, Checkers, Aeroplane Chess, Poker, and Meeting Chess under different lighting conditions in real scenes. The rule of each interactive game is listed in Table 2. Additionally, the dice are recorded for two to three minutes and the others are recorded for four to five minutes.
Table 2.
A brief introduction to interactive games rules.
| Game names | Rules | The number of volunteers |
|---|---|---|
| Dice | Players concurrently shake the dice in the dice container and reverse put them back to the table, displaying the numbers. Victory conditions: one is a single round of dice tossing, with the highest number winning. Another is multiple rounds of dice rolling. At the end of each round put the dice with one or six red points on the table and the remaining dice continue to be shaken. All dice are placed on the table to win. |
2 or 3 or 4 |
| Gomoku | Two players take turns [11]. Victory conditions: the first player to line five pieces on the board wins |
2 |
| Chinese Chess | Two players take turns. When a piece moves to a position currently held by an opponent's piece, it captures that opponent's piece. The captured piece is removed from the board. The game ends when one of the following situations: one is checkmate, if one threatens to capture the opponent's King and the opponent has no way to resolve the threat, one wins. Another is a stalemate, if one does not have any valid move, one loses [12]. |
2 |
| Meeting Chess | Two players take turns. Victory conditions: the first player to line two color pieces in line on the board wins |
2 |
| Checkers | The checkers board has six corners and can be played by up to 6 people. Each person places a corner with pieces of the same color and takes turns to move according to the rules. If it is a game between 2 or 4 people, the pieces are placed on opposite corners; for 3 people, the pieces are placed one corner apart to be evenly distributed. Victory conditions: The first person to reach and fill the opposite corner wins [13]. |
3 or 4 |
| Aeroplane Chess | Two to four players each try to get all their own plane pieces from their hangars, located at the corners of the board, into the base of their own color in the center of the board. Each player takes a turn by rolling the dice. Victory conditions: fly into the center base on an exact roll to get wins. The rest play until there is only one loser [14]. |
3 or 4 |
| Poker | Two ways to play: first, each player takes turns taking cards, one at a time, till no cards are left on the table. The one with the number three starts and the others choose to play or not based on the cards in their hands. Second, each player takes turns placing a card on the table. If a card has the same number and poker suit as one of the cards on the table, the middle card be taken to this player's hands. Victory conditions: in the first, the first player without a poker card wins. In the second, the first player without a poker card loses. |
3 or 4 |
Furthermore, the natural light changes over time, resulting in different natural light in recording the groups of two, three, or four persons, interrelated to normal, weak, and dark, respectively. Moreover, the dim light comes from the roof on the opposite side of the window. Since volunteers sit in different positions blocking the light, the number of people affects the dim light intensity. In Figs. 2 and 3, (a), (b), and (c) are represented details as follows: (a) Two persons play four games in different light conditions, and the natural light is normal. (b) Three persons play four games in different light conditions, and the natural light is weak. (c) Four persons play four games in different light conditions, and the natural light is dark.
Fig. 2.
Frame images in each group of egocentric view hands.
Fig. 3.
Frame images in each group of exocentric view hands.
In particular, these videos include fast-moving, rotating, and deforming hands due to hand pose changes such as movement, deformation, and occlusion. In the concrete case, hand movement generates gestures such as blur, fast, slow, up or down, left or right. Deformation is derived from hand rotation, pieces or cards in hand, finger moving pieces. Occlusion includes holding pieces or cards, containers, and interacting with other hands. Normally one image covers more than one gesture. For example, in checkers, players take turns holding pieces, hence each frame image in the subfolder has gestures for movement and deformation. The diverse hand gestures involved in the interactive activities are presented in Table 3. Correspondingly, the converted frame images contain many blurred hand gestures. What's more, because of the various lighting conditions, the dataset includes hands in the dark and clear shadows to follow caused by light interference, in Fig. 4.
Table 3.
Typical hand gestin interactive activities.
| Gamen ame |
Movement |
Deformation |
Occlusion |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Blur | Fast | Slow | Up or down | Left or right | Hand rotation |
Pieces or cards in hand | Finger moving pieces | Pieces orc ards |
Containers | Otherh ands |
|
| Dice | ✓ | ✓ | - | ✓ | ✓ | ✓ | ✓ | - | - | ✓ | ✓ |
| Gomoku | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | - |
| Chinese Chess | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | - |
| Meeting Chess | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ | - | - |
| Checkers | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ | - | - |
| Aeroplane Chess | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ | - | ✓ |
| Poker | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | - | ✓ | - | ✓ |
Fig. 4.
Special data. (a) blur hands in the dark, (b) clear shadows to follow hands.
4. Experimental Design, Materials and Methods
The dataset involves six volunteers. They completed two-player, three-player, and four-player interaction activities such as Poker, Chinese Chess, and Checkers. This process was recorded by two iPhones 13 in parallel to capture hand interaction data. The whole experimental process includes the following steps:
The first step is equipment preparation. Two iPhone 13s, a floor-standing tripod, and a chest-mounted phone holder. The parameters are shown in Table 4.
Table 4.
The details of equipment.
| Devices | Parameter | Image of devices |
|---|---|---|
| iPhone 13 camera | Dual-camera system; 12MP Main Ultra Wide; Portrait mode with Focus and Depth Control; Optical zoom options; Camera resolution: 2532*1170 pixel resolution, 460 ppi Record video: 720p HD at 30 fps Formats: High Efficiency Composition: Grid, Level, View Outside the Frame Photographic Style Prioritize Faster Shooting: Open Lens Correction: Open |
![]() |
| Tripod | Retractable tripod, 360-degree rotating phone clamp | ![]() |
| Chest-mounted mobile phone holder | Adjustable chest strap, J-shaped base, adapter (adjustable angle rotation), reinforcement strap | ![]() |
| Others | Poker, dice, chess board, chess pieces | ![]() |
The second step is to set the acquisition scene. The props are one table, chairs, sofas, chessboards, etc.. The first iPhone is put on a chest-mounted mobile phone holder, and one of the volunteers wears it to record the first-person view videos. The angle between the mobile phone and the chest is about 60 degrees. The second iPhone is placed on a floor-standing tripod to record the third-person view videos. The angle between the mobile phone and the vertical plane is close to 30 degrees. The overall layout is shown in Fig. 5. Moreover, natural light comes from the windows on the wall. white light and yellow light come from the bulbs on the roof. Dim light comes from the bulbs on the roof opposite the windows. In addition, the filming was done indoors with natural light, white light (bright), yellow light (soft), and dim light chosen as in Fig. 6.
Fig. 5.
Acquisition scene setup.
Fig. 6.
The light condition indoors. (a) Natural light (normal), (b) White light (bright), (c) Yellow light (soft), (d) Dim light.
The third step is data collection. Record videos in the constructed scene. The video collection was completed under four lighting conditions: natural light, white light, yellow light, and dim light. The interactive activities for the two-person group include Dice, Gomoku, Chinese Chess, and Meeting Chess. The interactive activities for the three-person or four-person group include Poker, Aeroplane Chess, Checkers, and Dice. Furthermore, for the first-person view collection, volunteers wore the chest-mounted mobile phone holder to record the horizontal screen. For third-person view collection, the mobile phone records the vertical screen. Notably, both the photographer standing next to the iPhone on the tripod and the volunteer wearing the chest-mounted mobile phone holder simultaneously shout a countdown to press the record button.
The fourth step is to convert the video into continuous frame images. Python code converted the video into continuous frame images with annotated time information, taking a picture every ten frames. In that case, the first-person view frame images were rotated 90 degrees counterclockwise with Python code afterward. Moreover, the images manually were deleted without hands, and multiple pictures with the same gesture to keep one. Furthermore, Python code was coded to invoke the YOLO-Face [15] pre-trained model to recognize volunteers’ identifiable features in frame images, such as eyes, nose, and mouth. Following these features are masked by a mosaic without quality loss.
Limitations
Although the videos from the two perspectives were recorded simultaneously, the number of frame images with hands or with the same gesture was different because of the differing viewpoints. As a result, the final number of egocentric and exocentric views are not the same. In particular, because the range of first-person view recording is a bit narrower and the volunteers move in participating games, their hands may not always appear in the camera range. It leads to exocentric view frame images significantly outnumbering the egocentric view in a few interactive scenes.
Ethics Statement
The collected data only contains interactive hand frames and no other personal details. The activity is free and volunteers voluntarily decide to provide their hands. These volunteers are classmates we are familiar with. In addition, we manually checked each frame image and removed or covered frames containing identifiable features of volunteers to ensure their privacy.
CRediT Author Statement
Cui Cui: Conceptualization, Methodology, Software, Data Curation, Writing – original draft; Mohd Shahrizal Sunar: Supervision, Writing – review & editing; Goh Eg Su: Supervision, Writing – review & editing.
Acknowledgements
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Contributor Information
Cui Cui, Email: cui@graduate.utm.my.
Mohd Shahrizal Sunar, Email: shahrizal@utm.my.
Data Availability
References
- 1.Zhang Y., Cao C., Cheng J., Lu H. EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia. 2018;20:1038–1050. doi: 10.1109/TMM.2018.2808769. [DOI] [Google Scholar]
- 2.Zhang L., Zhou S., Stent S., Shi J. Fine-grained egocentric hand-object segmentation: dataset. Model, Appl. 2022 http://arxiv.org/abs/2208.03826 [Google Scholar]
- 3.Alexander K., Karina K., Alexander N., Roman K., Andrei M. 2024 IEEECVF Winter Conf. Appl. Comput. Vis. WACV. 2024. HaGRID – HAnd Gesture Recognition Image Dataset; pp. 4560–4569. [DOI] [Google Scholar]
- 4.Wan J., Lin C., Wen L., Li Y., Miao Q., Escalera S., Anbarjafari G., Guyon I., Guo G., Li S.Z. ChaLearn looking at people: IsoGD and ConGD large-scale RGB-D gesture recognition. IEEE Trans. Cybern. 2022;52:3422–3433. doi: 10.1109/TCYB.2020.3012092. [DOI] [PubMed] [Google Scholar]
- 5.Materzynska J., Berger G., Bax I., Memisevic R. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) IEEE; Seoul, Korea (South): 2019. The jester dataset: a large-scale video dataset of human gestures; pp. 2874–2882. [DOI] [Google Scholar]
- 6.Bambach S., Lee S., Crandall D.J., Yu C., Hand Lending A. 2015 IEEE Int. Conf. Comput. Vis. ICCV. 2015. Detecting hands and recognizing activities in complex egocentric interactions; pp. 1949–1957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Molchanov P., Yang X., Gupta S., Kim K., Tyree S., Kautz J. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE; Las Vegas, NV, USA: 2016. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks; pp. 4207–4215. [DOI] [Google Scholar]
- 8.Northcutt C.G., Zha S., Lovegrove S., Newcombe R. EgoCom: a multi-person multi-modal egocentric communications dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2023;45:6783–6793. doi: 10.1109/TPAMI.2020.3025105. [DOI] [PubMed] [Google Scholar]
- 9.Sigurdsson G.A., Gupta A., Schmid C., Farhadi A., Alahari K. 2018 IEEECVF Conf. Comput. Vis. Pattern Recognit. 2018. Actor and observer: joint modeling of first and third-person videos; pp. 7396–7404. [DOI] [Google Scholar]
- 10.Y. Huang, G. Chen, J. Xu, M. Zhang, L. Yang, B. Pei, H. Zhang, L. Dong, Y. Wang, L. Wang, Y. Qiao, EgoExoLearn: a dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world, (2024). 10.48550/arXiv.2403.16182. [DOI]
- 11.Gomoku, Wikipedia (2024). https://en.wikipedia.org/w/index.php?title=Gomoku&oldid=1230102610 (accessed June 24, 2024).
- 12.Xiangqi, Wikipedia (2024). https://en.wikipedia.org/w/index.php?title=Xiangqi&oldid=1230006915 (accessed June 24, 2024).
- 13.Chinese checkers, Wikipedia (2024). https://en.wikipedia.org/w/index.php?title=Chinese_checkers&oldid=1227109425 (accessed June 24, 2024).
- 14.Editorial A. Asiapac Books Pte Ltd; 2011. Gateway to Old School Games. [Google Scholar]
- 15.G. Jocher, A. Chaurasia, J. Qiu (2023). https://github.com/ultralytics/ultralytics (accessed September 10, 2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.










