Dataset for the recognition of Kurdish sound dialects

Karwan M Hama Rawf; Sarkhel H Taher Karim; Ayub O Abdulrahman; Karzan J Ghafoor

doi:10.1016/j.dib.2024.110231

. 2024 Feb 22;53:110231. doi: 10.1016/j.dib.2024.110231

Dataset for the recognition of Kurdish sound dialects

Karwan M Hama Rawf ^1,^⁎, Sarkhel H Taher Karim ¹, Ayub O Abdulrahman ¹, Karzan J Ghafoor ¹

PMCID: PMC10907190 PMID: 38435729

Abstract

Dialect recognition System (DRS) is a highly significant subject within the field of speech analysis. The performance of speech recognition systems is adversely impacted by factors such as the age, gender, and dialect features of the speaker. In order to address variations in dialect, it is possible to incorporate DRS into speech recognition systems. The system can be configured to utilize the appropriate speech recognition model based on the identification of the spoken dialect. Currently, there is a lack of available datasets suitable for the development of automatic dialect recognition systems specifically tailored for the Kurdish language. The proposed dataset under consideration is assessed using experimental data that has been gathered by personnel associated with the Computer Science Department at the University of Halabja. As the Kurdish language has three main dialects: Northern Kurdish (Badini variation), Central Kurdish (Sorani variant), and Hawrami, three dialects are included in the dataset.

Keywords: Dialect recognition, Kurdish dialect, Sorani, Hawrami, Badini

Specifications Table

Subject	Sound Recognition, Artificial Intelligent, Signal Processing, Multimedia
Specific subject area	This comprehensive dataset is dedicated to the recognition and analysis of Kurdish sound dialects. It serves as a valuable resource for researchers and linguists, facilitating the exploration of linguistic variations within the Kurdish language and advancing language studies in Kurdish dialectology, sociolinguistics, and speech technology.
Data format	Raw Data, Segmented Data by Praat, Annotation, Matrix
Type of data	Sound (wave), Matrix (MFCC)
Data collection	The dataset known as Kurdish Language Dialects (KuLD) comprises three primary dialects, namely Sorani, Badini, and Hawrami. Samples were taken from various television broadcasts (Speda tv, NRT, and GK Sat). The sounds have been transformed into the wave format, and the duration of each sample is just one second. In addition, each sample has been converted to a matrix.
Data source location	Samples were taken from various television broadcasts (Speda TV, NRT, and GK Sat) and collected by researchers at the University of Halabja, IKR/IRAQ.
Data accessibility	Repository name: Data mendeley (A Dataset for the Classification of Different Kurdish Dialects) Data identification number: (10.17632/srkp2j4v93.1) Direct URL to data: https://data.mendeley.com/datasets/srkp2j4v93/1 Instructions for accessing these data: [1]
Related research article	K.J. Ghafoor, K.M. Hama Rawf, A.O. Abdulrahman, S.H. Taher, Kurdish Dialect Recognition using 1D CNN, ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY. 9 (2021) 10–14. 10.14500/aro.10837 [2].

Open in a new tab

1. Value of the Data

•
Creating a sound database for recognizing the three main Kurdish dialects (Sorani, Badini, and Hawrami).
•
The KuLD dataset represents the initial annotated dataset specifically designed for the Kurdish language in the Kurdish dialect. The mentioned dataset holds significant value for researchers in the fields of computer science and computational linguistics who possess an interest in Kurdish language and its various dialects.
•
Sharing the dataset publicly promotes research collaboration, validation, and reproducibility, thereby advancing linguistic comprehension and speech technology in the Kurdish language domain.

2. Data Description

Dialect recognition algorithms categorize spoken languages into groups based on shared linguistic characteristics. Speech recognition accuracy decreases with the speakerʼs age, gender, and the presence of a dialect. Speech recognition systems may accommodate variations in speech by including a dialect recognition system. Dialect detection allows the system to transition to the appropriate speech recognition model. Unfortunately, to the best of the authors’ knowledge, automated Kurdish dialect identification algorithms do not have access to a suitable dataset. This research is aimed at finding a solution to this weakness [3]. Regarding the context of human authentication and evidence, one of the most important forms of biometric information to collect is the individualʼs unique voice tone [4].

Different dialects of the Kurdish language are spoken in four major Middle Eastern nations [5]. Among these dialects of Kurdish, there are three major dialects (Sorani, Badini, and Hawrami) [6] in the Kurdistan region of Iraq. The most common areas where the Kurdish language is spoken are situated within the territories of Turkey, Iraq, Iran, and Syria. According to estimates, the total number of Kurdish speakers exceeds 40 million individuals [7,8].

Within the domain of linguistic variation, both Dialect Recognition Systems (DRS) and Kurdish Sign Language (KuSL) recognition aim to improve the accessibility of communication. DRS addresses the several elements that affect the accuracy of speech recognition and suggests incorporating them into systems to accommodate different dialects. The KuSL identification system demonstrates exceptional precision in detecting Kurdish symbols, employing a customised convolutional neural network (CNN) structure. Both domains, although employing distinct data formats such as pictures for signs and audio waves for dialects, have a common reliance on feature extraction, pattern identification, and deep learning. This simultaneous endeavour has intriguing prospects for integrated frameworks, providing a comprehensive methodology for addressing language variety [9].

Collecting data is a prerequisite for each model built using a machine learning technique. The proposed dataset was collected by a number of faculty members in the Computer Science Department at the University of Halabja. The standard procedures and guidelines were followed throughout the data collection process. The dataset employed speaker age and gender as additional conditioning factors. However, including these constraints during data collection may lead to a more generalizable model. The recordings are from various TV shows and TV interviews from Spedia TV, NRT, and GK Sat.

A Dataset for the Classification of Different Kurdish Dialects folder: Within this directory, there are two main subdirectories, namely the Kurdish Language Dialects (KuLD) Dataset (Matrix files) and the Kurdish Language Dialects (KuLD) Dataset (Sound files), as shown in Fig. 1. The KuLD Dataset (Matrix) subdirectory is further divided into three subdirectories, namely Badini, Hawrami, and Soran. Each of these subdirectories contains 2000 matrix files corresponding to the Badini, Hawrami, and Sorani dialects. In parallel, the KuLD Dataset (Sound) subdirectory also consists of three subdirectories, each representing one of the aforementioned dialects. These subdirectories contain a sequential series of 2000 sound files, each lasting for one second. This dataset is 6000 s long, with each sample lasting exactly 1 s.

Fig.1: — The proposed dataset structure.

Based on preliminary experiments, 2000 sound units per dialect were determined to be a sufficient sample size for dialect recognition. In these experiments, various machine-learning algorithms were trained on different sample sizes and evaluated on a held-out test set. Although this sample size is relatively small, it is sufficient for the purposes of this study, and steps were taken to ensure the dataset's representativeness of Kurdish dialect diversity by collecting data from various sources.

3. Experimental Design, Materials and Methods

The KuLD sound dataset was collected from people of different ages who appeared on television programs, and it was gathered at the University of Halabja, IKR/IRAQ. Various tools are used in the conversion of video content into waveform audio representations. The speaker's speech was segmented into one-second intervals after extracting just a limited number of phrases. The overall quantity of sound units per dialect amounts to 2000, whereas the cumulative number of sound units within the collection reaches 6000.

The Praat program played an essential part in the methods used for the identification and classification of Kurdish dialects using acoustic analysis. A comprehensive methodology section was included to clarify the procedures involved in sound analysis and recognition. Initially, a comprehensive compilation of data was undertaken, including a range of Kurdish dialects, with the aim of collecting a wide array of linguistic traits. The precise data sources, recording equipment, and recording conditions were elaborated upon to provide a more complete picture of the dataset. Subsequently, the captured audio was subjected to preprocessing using the Praat software. The process included the tasks of segmentation, labeling, and feature extraction. A detailed evaluation of the stages was presented, including the specific use of Praat for each activity. The use of Mel-Frequency Cepstral Coefficients (MFCCs) represented an essential component of the employed approach. The process of converting wave files into Mel-frequency cepstral coefficients (MFCCs) was discussed, including the detailed specifications and configurations used in Praat for spectrum analysis. The use of Mel-frequency cepstral coefficients (MFCCs) facilitated the representation of sound data in an approach that was more conducive to machine learning methods. Ultimately, the Mel-frequency cepstral coefficients (MFCCs) performed a transformation in order to generate a matrix structure. The process of conversion was explained extensively, offering a comprehensive understanding of the matrix structure and its representation of a single second of auditory information.

3.1. Data collection

Modern advances in technology have completely altered the process of identifying a specific dialect. Researchers can better analyse and detect dialectal speech using modern algorithms and tools driven by machine learning, deep learning, and artificial intelligence. These advances help preserve linguistic variety, cultural history, and cross-cultural communication by improving dialect pattern interpretation.

Finding a good data source is usually the initial task to do whereas building a database. The primary focus here is on collecting several sources of high-quality samples of properly spoken Kurdish dialects. The objective of obtaining Kurdish sound can be accomplished through the utilization of various TV programs, such as interviews, NEWS, analysing debates, and Documentary program. The gathered information was initially produced in a wide range of media formats, including video, audio, and a number of various sound extension files. Furthermore, these details are exported as a wave file in preparation for the segmentation process.

During the data collection phase, strict guidelines were implemented to ensure the maintenance of data quality. Precise recording protocols were used, and a comprehensive evaluation was undertaken to detect and correct any conflicts or irregularities in the documented auditory components. In order to enhance the validity of the dataset, a portion of the recorded sound units passed an accurate assessment conducted by researchers who had received specialized training. Samples that demonstrated anomalies or were of poor quality were identified and then subjected to re-evaluation or exclusion from the dataset.

3.2. Sound segmentation and pre-processing

For the purpose of speech analysis, the obtained data is processed and segmented using the Praat program, which is a highly versatile tool utilized for the purpose of speech analysis. The system provides a diverse array of both regular and irregular techniques, encompassing spectrographic assessments and neural networks. The original signal spectrum has been separated into fragments, and one second is taken from each portion that has a strong intensity signal, as illustrated in Fig. 2.

Fig. 2: — Praat Layout for selecting one second-wave sound from the original longer sound.

The selected sound is then saved to a file with a wave extension, and the second step in the Praat software involves opening the wave file and converting it to an MFCC file by analyzing the spectrum. The MFCC file was finally transformed into matrix format as the very last stage in the process of creating one second of sound, as showed in Fig. 3a and 3b.

Fig. 3a — Annotating signals and doing spectrum analysis for an MFCC file.

Fig. 3b — Using the Praat framework to convert an MFCC files to a matrix.

In this study, the research team employed a special approach for the annotation and preprocessing of data, with a specific emphasis on the analysis of speech. Instead of employing Python code for the purposes of labeling and preprocessing, the decision was made to utilize hand annotation via Praat, a software application specifically developed for the analysis of speech. By making this decision, the authors were able to enhance their comprehension of the speech data by obtaining a more comprehensive and precise understanding. This enabled them to classify and refine annotations with greater accuracy, thereby capturing the subtleties associated with speech features. Through the utilization of Praat's functionalities, the research team successfully accomplished the meticulous and refined preparation of their voice dataset, resulting in data of exceptional quality.

3.3. Labelling

The process of dataset labeling plays a crucial role in the building of new datasets. It involves providing meaningful and specific labels to the particular data within a dataset. The process includes the categorization or classification of data according to specific criteria or characteristics. The proposed dataset labeling process consists of two sequential steps. Firstly, the sounds of each dialect, namely Sorani, Badini, and Hawrami, should be separated into individual folders. Subsequently, the files within these folders should be labeled from 1 to 2000 according to the following scheme: for Sorani files, the labels should range from s1 to s2000; for Badini files, the labels should range from b1 to b2000; and for Hawrami files, the labels should range from h1 to h2000.

Limitations

Building a sound dataset for dialect detection presents limitations in terms of data availability, speaker variability, annotation issues, data bias, restricted scope, data privacy and ethics, and scalability. These obstacles may be broken down into many categories. Challenges include limited access to different data among dialects, dealing with speaker-related variances, guaranteeing correct annotations, addressing biases, taking into consideration the extent of the dataset, and protecting privacy and ethical principles. The quality and dependability of the sound collection is ensured for efficient dialect identification research by overcoming these restrictions by thorough curation, diversified representation, correct annotations, ethical concerns, and scalability.

Despite certain limitations that may have been addressed for widely spoken Kurdish dialects in the Kurdistan Region of Iraq, namely Sorani, Badini, and Hawrami, the development of datasets for other Kurdish dialects, including Zazaki, Northern Kurmanji, Bakhtyari, and Luri, is still facing challenges as previously outlined. Nonetheless, the insufficiency of data and the inconsistency seen in less widely spoken languages might still impede the progress of creating extensive databases in such instances. The research experienced another limitation in locating an appropriate data source for the Hawrami dialect, leading to a dataset including 2000 samples, which surpasses the sample size of prior studies on dialect identification. Although the size of the dataset is somewhat smaller in comparison to certain research, it is greater than the normal datasets used for dialect recognition. In order to overcome the size constraint, the possibility of conducting in-person interviews with individuals fluent in the Hawrami dialect could be taken into account. However, this approach presented additional challenges, including the substantial financial expenses connected with face-to-face interviews, the hurdles involved in acquiring the necessary permissions, and the considerable time commitment needed.

Ethics Statement

Ethical considerations for dialect recognition datasets include informed consent from participants, privacy and data security, cultural sensitivity, research transparency, institutional review, and data sharing and openness. These ethical guidelines preserve participants' rights, maintain study integrity, and promote responsible and accountable dialect recognition sound dataset building.

This dataset includes information gathered from TV broadcasts (Speda tv, NRT, and GK Sat). In regard to the information obtained from television shows, this was made possible due to the fact that these programs are freely available to everyone, making it possible for articles to be taken from them. We can certify that the information we collect is not being used for dishonest or illegal reasons.

In order to effectively address the ethical considerations associated with data collection, several measures were implemented. Firstly, prior informed consent was obtained from all participants who voluntarily participated in the interview process. Secondly, a secure server was utilized to store the collected data, ensuring its confidentiality and protection. Additionally, consultations were conducted with members of the communities from which the data was being gathered, ensuring that the data collection process adhered to cultural norms and values, thereby promoting cultural respect. Precautionary measures were implemented to mitigate any risk to the participants and to guarantee that the data collected was exclusively used for the studyʼs objectives.

CRediT authorship contribution statement

Karwan M. Hama Rawf: Conceptualization, Data curation, Writing – original draft, Software. Sarkhel H. Taher Karim: Data curation, Writing – review & editing. Ayub O. Abdulrahman: Visualization, Data curation, Investigation, Writing – review & editing. Karzan J. Ghafoor: Data curation, Methodology, Validation, Software.

Acknowledgments

The authors would like to extend their sincere thanks to everyone who helped construct this dataset, both within and outside of the university of Halabaj. We express our appreciation to the television broadcasts, including Speda TV, NRT, and GK Sat.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

A Dataset for the Classification of Different Kurdish Dialects (Original data) (Mendeley Data).

References

1.Rawf K.M., Kareem S., Abdulrahman A., Ghafoor K. A dataset for the classification of different Kurdish dialects. Mendeley Data. 2023;V2 doi: 10.17632/srkp2j4v93.2. [DOI] [Google Scholar]
2.Ghafoor K.J., Hama Rawf K.M., Abdulrahman A.O., Taher S.H. Kurdish dialect recognition using 1D CNN. ARO- Sci. J. Koya Univ. 2021;9(2):10–14. doi: 10.14500/aro.10837. [DOI] [Google Scholar]
3.Işik G., Artuner H. 2018 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey. 2018. A dataset for Turkish dialect recognition and classification with deep learning; pp. 1–4. [DOI] [Google Scholar]
4.Abdul Z.K. Kurdish speaker identification based on one dimensional convolutional neural network. Comput. Methods Differ. Eq. 2019;7(4):566–572. [Google Scholar]
5.Abdalla P.A., Qadir A.M., Shakor M.Y., Saeed A.M., Jabar A.T., Salam A.A., Amin H.H.H. A vast dataset for Kurdish handwritten digits and isolated characters recognition. Data Br. 2023;47 doi: 10.1016/j.dib.2023.109014. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Al-Talabani A., Abdul Z., Ameen A. Kurdish dialects and neighbor languages automatic recognition. ARO- Sci. J. Koya Univ. 2017;5(1):20–23. doi: 10.14500/aro.10167. [DOI] [Google Scholar]
7.Eppler E., Benedikt J. A perceptual dialectological approach to linguistic variation and spatial analysis of Kurdish varieties. J. Linguist. Geogr. 2017;5(2):109–130. doi: 10.1017/jlg.2017.6. [DOI] [Google Scholar]
8.Badawi S., Saeed A.M., Ahmed S.A., Abdalla P.A., Hassan D.A. Kurdish News Dataset Headlines (KNDH) through multiclass classification. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109120. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hama Rawf K.M., Mohammed A.A., Abdulrahman A.O., Abdalla P.A., Ghafor K.J. A comparative study using 2D CNN and transfer learning to detect and classify arabic-script-based sign language. Acta Inform. Malaysia. 2023;7(1):08–14. doi: 10.26480/aim.01.2023.08.14. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

A Dataset for the Classification of Different Kurdish Dialects (Original data) (Mendeley Data).

[bib0001] 1.Rawf K.M., Kareem S., Abdulrahman A., Ghafoor K. A dataset for the classification of different Kurdish dialects. Mendeley Data. 2023;V2 doi: 10.17632/srkp2j4v93.2. [DOI] [Google Scholar]

[bib0002] 2.Ghafoor K.J., Hama Rawf K.M., Abdulrahman A.O., Taher S.H. Kurdish dialect recognition using 1D CNN. ARO- Sci. J. Koya Univ. 2021;9(2):10–14. doi: 10.14500/aro.10837. [DOI] [Google Scholar]

[bib0003] 3.Işik G., Artuner H. 2018 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey. 2018. A dataset for Turkish dialect recognition and classification with deep learning; pp. 1–4. [DOI] [Google Scholar]

[bib0004] 4.Abdul Z.K. Kurdish speaker identification based on one dimensional convolutional neural network. Comput. Methods Differ. Eq. 2019;7(4):566–572. [Google Scholar]

[bib0005] 5.Abdalla P.A., Qadir A.M., Shakor M.Y., Saeed A.M., Jabar A.T., Salam A.A., Amin H.H.H. A vast dataset for Kurdish handwritten digits and isolated characters recognition. Data Br. 2023;47 doi: 10.1016/j.dib.2023.109014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0006] 6.Al-Talabani A., Abdul Z., Ameen A. Kurdish dialects and neighbor languages automatic recognition. ARO- Sci. J. Koya Univ. 2017;5(1):20–23. doi: 10.14500/aro.10167. [DOI] [Google Scholar]

[bib0007] 7.Eppler E., Benedikt J. A perceptual dialectological approach to linguistic variation and spatial analysis of Kurdish varieties. J. Linguist. Geogr. 2017;5(2):109–130. doi: 10.1017/jlg.2017.6. [DOI] [Google Scholar]

[bib0008] 8.Badawi S., Saeed A.M., Ahmed S.A., Abdalla P.A., Hassan D.A. Kurdish News Dataset Headlines (KNDH) through multiclass classification. Data Br. 2023;48 doi: 10.1016/j.dib.2023.109120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0009] 9.Hama Rawf K.M., Mohammed A.A., Abdulrahman A.O., Abdalla P.A., Ghafor K.J. A comparative study using 2D CNN and transfer learning to detect and classify arabic-script-based sign language. Acta Inform. Malaysia. 2023;7(1):08–14. doi: 10.26480/aim.01.2023.08.14. [DOI] [Google Scholar]

PERMALINK

Dataset for the recognition of Kurdish sound dialects

Karwan M Hama Rawf

Sarkhel H Taher Karim

Ayub O Abdulrahman

Karzan J Ghafoor

Abstract

1. Value of the Data

2. Data Description

Fig.1.