Hand gestures for emergency situations: A video dataset based on words from Indian sign language

V Adithya; R Rajesh

doi:10.1016/j.dib.2020.106016

. 2020 Jul 11;31:106016. doi: 10.1016/j.dib.2020.106016

Hand gestures for emergency situations: A video dataset based on words from Indian sign language

V Adithya ^1,^⁎, R Rajesh ¹

PMCID: PMC7378574 PMID: 32715044

Abstract

Automatic sign language recognition provides better services to the deaf as it avoids the existing communication gap between them and the rest of the society. Hand gestures, the primary mode of sign language communication, plays a key role in improving sign language recognition. This article presents a video dataset of the hand gestures of Indian sign language (ISL) words used in emergency situations. The videos of eight ISL words have been collected from 26 individuals (including 12 males and 14 females) in the age group of 22 to 26 years with two samples from each individual in an indoor environment with normal lighting conditions. Such a video dataset is highly needed for automatic recognition of emergency situations from the sign language for the benefit of the deaf. The dataset is useful for the researchers working on vision based sign language recognition (SLR) as well as hand gesture recognition (HGR). Moreover, support vector machine based classification and deep learning based classification of the emergency gestures has been carried out and the base classification performance shows that the database can be used as a benchmarking dataset for developing novel and improved techniques for recognizing the hand gestures of emergency words in Indian sign language.

Keywords: Indian sign language recognition, Hand gestures, Emergency words, Video data

Specifications table

Subject	Computer Vision and Pattern Recognition
Specific subject area	Automatic sign language recognition
Type of data	Videos
How data were acquired	The videos in this dataset were collected by asking the participants to stand comfortably behind a black colored board and present the hand gestures, in front of the board. A Sony cyber shot DSC-W810 digital camera with 20.1 mega pixel resolution has been used for capturing the videos.
Data format	Raw videos as well as cropped videos. The data are organized in two sets. One set contains the captured video sequences in original format (raw) and the other set contains the video sequences obtained after cropping out the excessive background objects, and downsampling the frames to a uniform size of 500x600 pixels.
Parameters for data collection	All the videos have been collected with plain black background by placing the camera at a fixed distance. Both male and female subjects from various parts of India with varying hand sizes and skin tones have been included for collecting the data. Two sample videos have been collected from each participant with the gap of small time duration. The data collection is done on different days and at different times in an indoor environment with normal lighting conditions. No restriction has been imposed on the speed of hand movements so as to get the gesture presentations as natural as possible.
Description of data collection	Videos for a set of eight hand gestures representing the ISL words namely, ‘accident’, ‘call’, ‘doctor’, ‘help’, ‘hot’, ‘lose’, ‘pain’ and ‘thief’ have been included in the dataset.
Data source location	Department of Computer Science, Central University of Kerala Periya, Kasaragod, Kerala India-671320
Data accessibility	Repository name: Mendeley data. Data identification number: DOI: 10.17632/2vfdm42337.1 Direct URL to data:https://data.mendeley.com/datasets/2vfdm42337/draft?a=c5c2265d-5dd2-4e67-8656-0af6527a9937

Open in a new tab

Value of the data

The lack of publicly available dataset is a big challenge that hinders the developments in automatic SLR. The dataset proposed in this article is the first publicly available dataset of the hand gestures of the emergency ISL words. The data will be useful for the researchers to develop novel techniques for the improvements in automatic recognition of ISL gestures [1,2].
Improvement in this field is a great benefit to the society, as it provides a communication platform for the Deaf to convey their urgent messages to the authority.
This dataset can act as a basic benchmarking database of a set of hand gestures of emergency ISL words. It can be referred for further expanding the dataset by replicating the samples, or adding new samples of the gestures in different views and background conditions to further develop and improve the SLR and HGR techniques [3].

Data description

The dataset contains the RGB videos of hand gestures of eight ISL words, namely, ‘accident’, ‘call’, ‘doctor’, ‘help’, ‘hot’, ‘lose’, ‘pain’ and ‘thief’ which are commonly used to convey messages or seek support during emergency situations. All the words included in this dataset except the word `doctor' are dynamic hand gestures. The videos were captured from 26 adult individuals including 12 males and 14 females in the age group of 22 years to 26 years. The subjects participated in the data collection process are not the representatives of a particular region, rather represent the whole India.

For dynamic gestures, hand gestures recognition depends most importantly on motion features, rather than skin color features, based on silhouette or shapes or edges and their variations over time due to its movements. Even though skin color variations play little role, the data collection has been done by taking extreme care to include participants with maximum skin color variations to study the dependency of gesture recognition performance on human skin color.

It may so happen that the skin color will certain time highly resemble to the background color (including person's clothing) and will highly affect the classification rate. Hence, all the videos in this dataset have been collected against a black background under normal lighting conditions in an indoor environment. Such type of black background can be easily constructed at a very low cost with a board painted in black color and placed in front of the camera. As these are emergency situation related words for use with the deaf to communicate with the world, high recognition rates with less false positive and less false negative are highly needed. The plain black color background in the videos helps to increase the performance of hand gesture recognition with less computational overhead.

The dataset is built with an objective for developing a benchmark for emergency hand gesture recognition and the corresponding classification results as a reference for further improvements of the ISL recognition. The dataset is included in two folders namely ‘Raw_Data’ and ‘Cropped_Data’. The folder ‘Raw_Data’ contains the original ISL videos of size 1280x720 pixels. The folder ‘Cropped_Data’ contains the video sequences obtained after cropping out the excessive backgrounds and rescaling the frames to a uniform size of 500x600 pixels. Fig. 1 shows a set of keyframe sequences for sample videos from all the eight hand gestures in the ‘Cropped_Data’ set.

Fig 1 — The key frame sequences of the hand gestures of the ISL words included in the `Cropped_Data’ set.

The dataset contains a total of 824 sample videos in .avi format. The raw videos are labelled using the format ISLword_XXX_YY, where:

•
ISLword corresponds to the words ‘accident’, ‘call’, ‘doctor’, ‘help’, ‘hot’, ‘lose’, ‘pain’ and ‘thief’.
•
XXX is an identifier of the participant and is in the range of 001 to 026.
•
YY corresponds to 01 or 02 that identifies the sample number for each subject.

For example, the file named accident_003_02 is the video sequence of the second sample of the ISL gesture of the word ‘accident’ presented by the 3^rdparticipant.

The cropped videos are labelled using the format ISLword_crop_XXX_YY, where:

•
ISLword corresponds to the words ‘accident’, ‘call’, ‘doctor’, ‘help’, ‘hot’, ‘lose’, ‘pain’ and ‘thief’.
•
XXX is an identifier of the participant and is in the range of 001 to 026.
•
YY corresponds to 01 or 02 that identifies the sample number for each subject.

For example, the file named accident_crop_003_02 is the video sequence of the second sample of the ISL gesture of the word ‘accident’ presented by the 3^rd participant obtained after cropping and downsampling to 500x600 pixels. Table 1 and Table 2 shows the file and folder organizations of the sets containing raw data and cropped data respectively.

Table 1.

Organization of raw videos in the dataset.

Folder	File Name	Description
accident_Raw	accident_001_01 to accident_026_01, accident_001_02 to accident_026_02	52 sample videos of ISL hand gestures for the word `accident' presented by 26 subjects.
call_Raw	call_001_01 to call_026_01, call_001_02 to call_026_02	52 sample videos of ISL hand gestures for the word `call' presented by 26 subjects.
doctor_Raw	doctor_001_01 to doctor_026_01, doctor_001_02 to doctor_026_02	52 sample videos of ISL hand gestures for the word `doctor' presented by 26 subjects.
help_Raw	help_001_01 to help_026_01, help_001_02 to help_026_02	52 sample videos of ISL hand gestures for the word `help' presented by 26 subjects.
hot_Raw	hot_001_01 to hot_026_01, hot_001_02 to hot_026_02	52 sample videos of ISL hand gestures for the word `hot' presented by 26 subjects.
lose_Raw	lose_001_01 to lose_018_01, lose_020_01 to lose_026_01, lose_001_02 to lose_018_02, lose_020_02 to lose_026_02	50 sample videos of ISL hand gestures for the word `lose' presented by 25 subjects.
pain_Raw	pain_001_01 to pain_026_01, pain_001_02 to pain_026_02	52 sample videos of ISL hand gestures for the word `pain' presented by 26 subjects.
thief_Raw	thief_001_01 to thief_019_01, thief_021_01 to thief_026_01, thief_001_02 to thief_019_02, thief_021_02 to thief_026_02	50 sample videos of ISL hand gestures for the word `thief' presented by 25 subjects.

Open in a new tab

Table 2.

Organization of cropped videos in the dataset

Folder	File Name	Description
accident_Cropped	accident_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `accident' presented by 26 subjects.
call_Cropped	call_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `call' presented by 26 subjects.
doctor_Cropped	doctor_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `doctor' presented by 26 subjects.
help_Cropped	help_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `help' presented by 26 subjects.
hot_Cropped	hot_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `hot' presented by 26 subjects.
lose_Cropped	lose_crop_xxx_yy	50 sample videos of ISL hand gestures for the word `lose' presented by 25 subjects.
pain_Cropped	pain_crop_xxx_yy	52 sample videos of ISL hand gestures for the word `pain' presented by 26 subjects.
thief_Cropped	thief_crop_xxx_yy	50 sample videos of ISL hand gestures for the word `thief' presented by 25 subjects.

Open in a new tab

Experimental design

The hand gestures included in this dataset are according to the style and movements specified in the ISL dictionary published by the Ramakrishna Mission Vivekananda University, Coimbatore, Tamilnadu, India [4,5]. The videos of the ISL words and their descriptions have been shown to the participants for the effective presentation of the gestures. A Sony cyber shot DSC-W810 digital camera with 1280x720 pixels frame size is used for the data collection. This data collection process has got ethical clearance from Institutional Human Ethics Committee (IHEC) of Central University of Kerala, India. All the individuals have gone through the detailed informed consent form and signed their consent for voluntary participation.

The videos in this dataset were collected by asking the participants to stand comfortably behind a black colored board. The participants were asked to present the eight hand gestures, in front of the board, one by one and the procedure is repeated two times to capture two sample videos of each gesture. Example for a single frame of the video of the word ‘accident’ in original form (raw) and after cropping are shown in Fig. 2(a) and Fig. 2(b) respectively.

Fig 2 — (a) A single frame of the video for the hand gesture of the word ‘accident’ in original form (b) the corresponding frame obtained after cropping and downsampling.

All the videos were taken by fixing the camera at the same distance from the black board. Human hands are highly flexible in nature and the style and speed of hand movements by different individuals has shown great variations while presenting the gestures. No restriction has been imposed on the speed or time duration while capturing the video samples. Hence the duration of videos varies from one second to three seconds depending upon the speed of gesture presentations by different individuals.

Data analysis

The ISL gestures in the cropped dataset have been analysed by classifying them with the conventional feature driven approach using multiclass support vector machine (SVM) [6] as well as the recently evolved data driven approach using deep learning model. In both cases, 50% of the dataset is used for training and the remaining 50% is used for testing.

In feature based approach, a set of key frames have been extracted from the video sequences through a fast and efficient method based on image entropy and density clustering as proposed in [7]. The keyframe extraction eliminates the redundant information and makes all the videos with equal numbers of frames. The appearance features, namely, the three dimensional wavelet transform descriptors [8] are extracted from the keyframes. These descriptors are used for training and testing the SVM classifier. SVM is a supervised machine learning approach used for binary and multiclass pattern recognition. The memory efficient operation through the data points called support vectors and the availability of versatile kernel functions make it a widely adopted choice for image/video based feature classification too. It is extensively used in classification problems with comparatively less training samples and shown greater performances. Multiclass SVM is utilized in this work and obtained an average classification accuracy of 90%.

In deep learning approach, the pre-trained convolutional neural network (CNN) model, namely GoogleNet [9], is combined with a long short term memory (LSTM) network [10] for gesture classification. The videos are converted into sequences of feature vectors through GoogleNet network, by getting the output of the activations of its last pooling layer. The classification model of LSTM network is built with a sequence input layer followed by a bidirectional LSTM layer with 2000 hidden units and a dropout layer afterwards. The output of the dropout layer is further transformed by the fully connected layer into the size suitable for classification by a softmax layer and a final classification layer. The network is trained for 20 epochs with the sequences of feature vectors in which 10% of the training dataset is used for validation with an adaptive moment estimation (adam) optimizer, a batch size of 16 and an initial learning rate of 0.0001. The performance of the classification model is evaluated with the test videos and achieved an average classification accuracy of 96.25 %.

The classification performances of both the methods have also been evaluated using the metrics for precision, recall and F-score values corresponding to each gesture class as shown in Table 3. The given average classification accuracies, precision, recall and the F-score values can be considered as the base performance measures and those who are further going to work on the dataset may improve it.

Table 3.

Classification performance of multiclass SVM as well as deep learning model on the ISL words in the cropped set.

ISL Word	SVM Classifier			Deep Learning
ISL Word	Precision (%)	Recall (%)	F-score (%)	Precision (%)	Recall (%)	F-score (%)
Accident	96.55	93.33	94.92	100	100	100
Call	96.15	83.33	89.29	90.32	93.33	91.80
Doctor	90.63	96.67	93.55	93.75	100	96.77
Help	96.55	93.33	94.92	100	93.33	96.55
Hot	92.59	83.33	87.72	100	93.33	96.55
Lose	96.30	86.67	91.23	96.67	96.67	96.67
Pain	93.10	90	91.53	96.55	93.33	94.92
Thief	68.29	93.33	78.87	93.75	100	96.77

Open in a new tab

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Acknowledgments

The author, Adithya V. thanks Kerala State Council for Science Technology and Environment (KSCSTE), Kerala, India for the research fellowship. The authors express their gratitude to Central University of Kerala, India for the research support. The authors also would like to acknowledge all the individuals who have participated in the data collection process.

Footnotes

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.dib.2020.106016.

Contributor Information

V. Adithya, Email: adithyaushas88@gmail.com.

R. Rajesh, Email: rajeshr@cukerala.ac.in.

Appendix. Supplementary materials

mmc1.xml^{(1.2KB, xml)}

References

1.Pisharady Pramod Kumar, Saerbeck Martin. Recent methods and databases in vision-based hand gesture recognition: A review. Comput. Viss Image Underst. 2015;141:152–165. doi: 10.1016/j.cviu.2015.08.004. [DOI] [Google Scholar]
2.Wadhawan A., Kumar P. Sign Language Recognition Systems: A Decade Systematic Literature Review. Arch. Comput. Meth. Eng. 2019 doi: 10.1007/s11831-019-09384-2. [DOI] [Google Scholar]
3.Cheok M.J., Omar Z., Jaward M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cyber. 2019;10:131–153. doi: 10.1007/s13042-017-0705-5. [DOI] [Google Scholar]
4.Indian Sign Language (ISL) Dictionary . third ed. Ramakrishna Mission Vivekananda University; Coimbatore, India: 2016. Faculty of disability management and special education. [Google Scholar]
5.Faculty of disability management and special education, Ramakrishna Mission Vivekananda University, Coimbatore, India. http://www.indiansignlanguage.org, 2020(accessed 26 April 2020).
6.Rifkin Ryan, Klautau Aldebaro. In defence of one-vs-all classification. J. Mach.Learn. Res. 2004;5:101–141. [Google Scholar]
7.Tang Hao, Xiao Wei, Liu Hong, Sebe Nicu. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing. 2019;331:424–433. doi: 10.1016/j.neucom.2018.11.038. [DOI] [Google Scholar]
8.Weeks M., Bayoumi M.A. Three-dimensional discrete wavelet transform architectures. IEEE Trans. Signal Process. 2002;50:2050–2063. doi: 10.1109/TSP.2002.800402. [DOI] [Google Scholar]
9.Khan Asifullah, Sohail Anabia, Zahoora Umme, Qureshi Aqsa Saeed. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020 doi: 10.1007/s10462-020-09825-6. [DOI] [Google Scholar]
10.Hochreiter Sepp, Schmidhuber Jurgen. Long Short-Term Memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mmc1.xml^{(1.2KB, xml)}

[bib0001] 1.Pisharady Pramod Kumar, Saerbeck Martin. Recent methods and databases in vision-based hand gesture recognition: A review. Comput. Viss Image Underst. 2015;141:152–165. doi: 10.1016/j.cviu.2015.08.004. [DOI] [Google Scholar]

[bib0002] 2.Wadhawan A., Kumar P. Sign Language Recognition Systems: A Decade Systematic Literature Review. Arch. Comput. Meth. Eng. 2019 doi: 10.1007/s11831-019-09384-2. [DOI] [Google Scholar]

[bib0003] 3.Cheok M.J., Omar Z., Jaward M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cyber. 2019;10:131–153. doi: 10.1007/s13042-017-0705-5. [DOI] [Google Scholar]

[bib0004] 4.Indian Sign Language (ISL) Dictionary . third ed. Ramakrishna Mission Vivekananda University; Coimbatore, India: 2016. Faculty of disability management and special education. [Google Scholar]

[bib0005] 5.Faculty of disability management and special education, Ramakrishna Mission Vivekananda University, Coimbatore, India. http://www.indiansignlanguage.org, 2020(accessed 26 April 2020).

[bib0006] 6.Rifkin Ryan, Klautau Aldebaro. In defence of one-vs-all classification. J. Mach.Learn. Res. 2004;5:101–141. [Google Scholar]

[bib0007] 7.Tang Hao, Xiao Wei, Liu Hong, Sebe Nicu. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing. 2019;331:424–433. doi: 10.1016/j.neucom.2018.11.038. [DOI] [Google Scholar]

[bib0008] 8.Weeks M., Bayoumi M.A. Three-dimensional discrete wavelet transform architectures. IEEE Trans. Signal Process. 2002;50:2050–2063. doi: 10.1109/TSP.2002.800402. [DOI] [Google Scholar]

[bib0009] 9.Khan Asifullah, Sohail Anabia, Zahoora Umme, Qureshi Aqsa Saeed. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020 doi: 10.1007/s10462-020-09825-6. [DOI] [Google Scholar]

[bib0010] 10.Hochreiter Sepp, Schmidhuber Jurgen. Long Short-Term Memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

PERMALINK

Hand gestures for emergency situations: A video dataset based on words from Indian sign language

V Adithya

R Rajesh

Abstract

Value of the data

Data description

Fig. 1.

Table 1.

Table 2.

Experimental design

Fig. 2.

Data analysis

Table 3.

Declaration of Competing Interest

Acknowledgments

Footnotes

Contributor Information

Appendix. Supplementary materials

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Hand gestures for emergency situations: A video dataset based on words from Indian sign language

V Adithya

R Rajesh

Abstract

Value of the data

Data description

Fig. 1.

Table 1.

Table 2.

Experimental design

Fig. 2.

Data analysis

Table 3.

Declaration of Competing Interest

Acknowledgments

Footnotes

Contributor Information

Appendix. Supplementary materials

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases