Abstract
Sound Event Classification (SEC) and Sound Event Detection (SED) are gaining momentum across various domains. With the rise of machine learning, identifying specific sound sources amid background noise in outdoor environments has become a major focus. Recognizable sound types are many and vary depending on context, ranging from vehicles, trains, and aircraft to human and animal activity. This work introduces two open-access datasets, DataSEC and DataSED, created to address gaps identified in the existing dataset literature. Together, they provide over 35 hours of authentic, non-synthesized.wav audio, collected from sound level meter measurements and online repositories. DataSEC consists of 4292 audio samples, with each sample representing a single event that has been classified into one of 22 defined sound classes and 28 subclasses. DataSED comprises 712 real-world recordings containing multiple events, accompanied by over 4000 labels provided in .csv format. These datasets extend across a range of urban to rural environments and have been designed to support research in real-world sound event classification and automated analysis of environmental noise.
Subject terms: Acoustics, Scientific data
Background & Summary
The automation of acoustic analysis represents a rapidly evolving interdisciplinary domain of research, combining acoustics, signal processing, and machine learning. Two particularly important tasks in audio analysis are Sound Event Classification (SEC), which involves the classification and assignment of categorical labels to audio segments with the aim of identifying sound sources, and Sound Event Detection (SED), which determines the occurrence of sound events within the time component of a recording and includes the precise timestamp of an event’s onset and offset. The two aforementioned tasks address the two-fold problem of classifying acoustic information and specifying the respective temporal characteristics across various auditory scenes. Sound Event Detection (SED) has gained significant research attention over the past several years due to its widespread applications. It is central to many applications including urban acoustic planning1 healthcare2–4, bioacoustics monitoring5–7, surveillance in security8, multimedia event detection9, event analysis in a large scale10, industrial noise monitoring11, smart-home technology12 and wildlife monitoring13.
Although deep learning has transformed image recognition and many other fields, progress in recognising everyday sounds has not been as rapid, partly because there are no standardised, even large, audio datasets14–16. In fact, there are so many applications that new neural networks for SEC and SED are being developed every day, and older ones need to be updated. Both processes continuously require detailed and tuned datasets, which then become extremely valuable as they should be as specific as possible. However, the currently available datasets do not take into account the complexity of realistic environments, which include multiple sounds or mixtures of sounds with a constantly present background noise17,18. This is particularly true for outdoor applications, where the lack of high quality, large and diverse datasets is recognised as a significant challenge for SED19,20.
In recent years, several datasets have been developed to support sound event classification (SEC) and sound event detection (SED) tasks, including UrbanSound21 and UrbanSound8K21, ESC-5014, AudioSet22, and FSD50K17. Many of these datasets were partially constructed using audio clips from collaborative platforms such as Freesound.org23, which has become a widely used resource for real-world soundscape data. These datasets have significantly contributed to the advancement of the field and are commonly used for benchmarking and training. However, they also present limitations that can reduce their applicability to real-world environmental noise analysis. These include a limited number of classes14, the use of synthetic or pre-processed content18, class imbalance17, constrained clip durations19, limited contextual diversity17,18, and lack of access to original recordings18. Such limitations are particularly relevant for outdoor environments, where overlapping sounds, fluctuating background noise, and high acoustic variability make sound detection and classification especially challenging17–20. While synthetic datasets have been proposed to address some of these issues, they often fail to reflect the acoustic complexity and unpredictability of real-world soundscapes24–28.
The creation of an environmental sound database is a complex task that requires the collaboration of multidisciplinary expertise in different fields to make it manageable. There is a considerable amount of effort to create benchmark databases for use in environmental acoustics.
To address the main issue raised, the present work presents DataSEC.29 and DataSED30, two datasets in the form of .wav files, one specifically created for SEC and the other for SED of outdoor environmental noise. Both consist of real-world measurements covering a wide variety of sounds that can be heard outdoors in both urban and rural environments.
The principal task in methodological development, specifically in the developing field of machine learning applied to environmental noise analysis, is the proper definition of sound event categories (classes), which is a salient distinction of these datasets from others. The dataset has been categorised into 22 macro classes of events, some of which are further organised into sub-categories that share characteristics or can be considered the same for the purposes of the work. The subdivision of the dataset into subfolders facilitates the adaptation of the dataset to suit specific objectives of other potential users.
The main purpose of these datasets is to train, validate and test neural networks capable of recognising sound events. This can be used either to search for specific sounds in the environment, or to identify sounds to be removed from a larger track. Either purpose would avoid the need for manual identification, which is often costly and impractical. The main focus is on environmental acousticians who make long noise measurements and need to label specific sounds or remove unwanted sounds. At the same time, they may be of interest to, for example, ecologists studying bird/animal populations, who will collect thousands of hours of field recordings, but whose measurements will certainly be affected by many unwanted sounds that need to be removed15.
The work’s notable strengths lie in the rigorous methodology employed, with all samples drawn from both short and long measurements obtained from either online repositories or by the authors themselves using Class I sound level meters. Furthermore, the authors subjected all the processed audio files, amounting to around 35 hours of audio samples (around 18 from DataSEC and 17 from DataSED), to meticulous listening, review and processing, resulting in a more descriptive and refined dataset.
Each track in DataSEC may comprise a single event or multiple events of a single sound class. Conversely, DataSED consists of individual recordings or multiple sounds, either from the same class or from different classes, over a more protracted duration and accompanied by background noise. DataSED is uploaded in two versions. The first version does not contain overlapping events of different classes, such that each moment in time is assigned to a single class only. In contrast, the second version incorporates overlapping events from multiple classes, thereby providing a more realistic representation of real-world conditions. These two versions are offered to support the development and evaluation of two types of applications: monophonic sound detection and polyphonic sound detection31. To mitigate the subjectivity inherent in the labelling process, the authors have performed ground truth annotations according to the 22 class divisions on the SED dataset, ensuring consistent evaluation. While all audio files are in .wav format, the ground truth labelling is in .csv format. The SEC and SED datasets consist of 4292 and 712 .wav files, respectively. The SED dataset comprises 4034 grounds in its polyphonic version and 4309 labels in its monophonic version.
Methods
As the objective of the datasets is to provide a realistic depiction of environmental sounds occurring in outdoor real-world scenarios, it was essential to adhere to the principle of realism during the construction of the datasets. The design of the datasets should be such that it captures realistic soundscapes with their variability, while remaining descriptive in terms of distinct sound sources. Indeed, the principal objective of the dataset is to provide a comprehensive description of sounds occurring in the context of outdoor environmental noise evaluation, where the localization of the sound source or the identification of unwanted sound events must be precise. These datasets are of paramount importance in the development of a machine learning model for the purpose of classifying and detecting real-world environmental sound events.
In consideration of the variability of sounds that can be observed in nature and that may be present during an environmental noise measurement, it is necessary to undertake a preliminary step of subdivision into similar classes. The concept of similarity is employed in two distinct ways. Firstly, it is considered within the context of human perception, disregarding an analysis of the spatial and temporal characteristics of the sound. Secondly, the evaluation is conducted in accordance with the regulatory framework governing the assessment of noise impact. The classification of subjects is a pivotal aspect, as certain assessments do not necessitate the delineation of numerous classes, while in other instances, it is imperative. Furthermore, it is imperative to acknowledge that the number of classes inevitably complicates the task of SEC and SED. In both DataSEC and DataSED, the authors believe that a sufficient number of classes have been selected for consideration as distinct entities, thereby resolving the issues observed in previous datasets that lacked this distinguishing attribute.
The choice was of paramount importance in order to ensure an adequate number of classes that would allow for the identification of the different sources while also taking into account the need to consider sounds within them as similar, despite the variability exhibited by these sounds. It is acknowledged that the classes included in these sets may be subject to modification, whether in the present or at a future point.
Table 1 shows the 22 classes that were identified. Certain classes are composed of a diverse ensemble of sounds, which are regarded as equal within the domain of environmental acoustics. Conversely, other classes exhibit a delineation into subclasses due to the presence of sounds that diverge significantly from one another. However, for the purposes of this study, it was deemed sufficient to categorise them within the same macroclass. In such cases, the classification system is subdivided into 28 subclasses. This facilitates the identification of folders containing the relevant sounds by interested users of the datasets, and allows for the separation of these classes if necessary.
Table 1.
Classes of sound used in both DataSEC and DataSED.
| Class | Subclasses |
|---|---|
| Bells | — |
| Birds | — |
| Cat fights and moans | — |
| Chicken coop | — |
| Cicadas and crickets | Cicadas, Crickets |
| Crows seagulls and magpies | Crows, Seagulls, Magpies |
| Dog barkings and howlings | — |
| Glass breaking | — |
| Horn | — |
| Jet aircrafts | — |
| Lawn mower brush cutter and olive shaker | Lawn mower, Brush cutter, Olive shaker |
| Music | — |
| Propeller aircrafts | Airplanes, Helicopters |
| Sirens and alarms | Sirens, Alarms |
| Thunder fireworks and gunshot | Thunder, Fireworks, Gunshot |
| Train | — |
| Vacuum cleaner fan and hairdryer | Vacuum cleaner, Fan, Hairdryer |
| Vehicle idling | Car-truck idling, Motorbike idling |
| Vehicle pass-by | Car pass-by, Motorbike pass-by, Truck pass-by |
| Voices | — |
| Wind turbine | — |
| Workshop | Air compressor, Drill, Grinder, Jackhammer, Saw |
Bells are a clearly identifiable, occasional sound that can occur in both urban and rural settings, altering the local noise. Birds are very present, but they are different from ‘Crows, seagulls and magpies’, which are quite distinctive from other birds in terms of noise. In fact, their calls can be much higher pitched and limited in time to be recognisable. Even the voices of dogs and cats, at certain times, can significantly alter the noisiness, like other natural sounds such as ‘Cicadas and Cricket’, which are similar to each other. Since ‘Birds’ and ‘Cicadas and Crickets’ sounds are almost constantly present in some recordings, rather than aiming to detect individual events, one could instead focus on characterizing the sound in the frequency domain and removing only the specific frequency bands associated with this noise. This approach would leave the measurement segment affected but modified. In this context, the corresponding classes from the datasets presented in this article could prove useful. In the countryside, it is easy to find ‘Chicken coops’ and roosters. ‘Glass breaking’ is a loud noise heard in the vicinity of glass recycling bins, an intense and clearly identifiable sound. ‘Horn’, “Music”, “Voices”, “Train”, are rather straightforward sounds. ‘Lawn mower brush cutter and olive shaker’ are all sounds that become frequent in the spring-summer period for gardening, public works or olive (or other) harvesting in both urban and rural environments. Sirens and alarms are similar sounds, spurious in nature. As well as ‘Thunder fireworks and gunshot’, these are three different but similar sounds that are probably always anomalous sounds that disturb measurements. ‘Vacuum cleaner fan and hairdryer’ are all sounds that can occur in noise measurements made near dwellings, with noises emitted inside propagating outside through open windows. Under ‘workshop’ are included the most frequent sounds related to mechanical, construction, or industrial work. ‘Wind turbine’ is a very particular sound that is difficult to identify, but is now widespread in the countryside. Vehicle noise, on the other hand, has been divided into pass-by and idling because in some types it is important to distinguish them, and because they are different sounds. These were then also divided into subclasses by vehicle type (motorbike, car, truck). A particular example of the hierarchical organisation is that of aircraft noise. A further classification of aircraft has been made, distinguishing between jet aircraft and propeller aircraft. The latter category encompasses both propeller airplanes and helicopters. Indeed, the diverse nature of propulsion systems gives rise to a range of distinct sound emissions. These emissions are characterised by the distinct operational characteristics of the respective systems, with jet engines and rotary mechanisms generating sound in distinctly different ways.
Using these classes as a basis, the authors then attempted to populate the datasets uniformly wherever possible. All of the audio samples included in both datasets are real recordings and do not contain any synthetic sounds. Field recordings made with class I sound level meters were the main source of the measurements, only in a very restricted cases mobile recorders have been employed. The portion of the data that was not directly measured by the authors was retrieved from online repositories. For each class, except for voice and music, these data repositories are Freesound.org23 and AudioSet22. The authors consistently ensured that the files were licensed under the Creative Commons Attribution 4.0 License (CC0, BY or BY NC for Freesound.org, and CC BY 4.0 or CC BY-SA 4.0 for AudioSet). In all cases, for both DataSEC and DataSED, the files have been modified from their original versions: they have been cut, pasted, mixed, manually edited, and had their sampling rate, file type, and file name altered. The “Voices” and “Music” classes in DataSEC were filled with entries taken from very well-structured and detailed datasets, respectively: the MTG-Jamendo Dataset32 (https://mtg.github.io/mtg-jamendo-dataset/) with CC BY-NC-SA 3.0 and CC-BY-SA licence and the CLIPS33 (https://github.com/paperswithcode/paperswithcode-data?tab=readme-ov-file), with specific authorization. Both classes had a lot more entries than the others, but the authors decided not to reduce their number to match others in order to keep up with the latest research in voice and music recognition, two areas that have been well studied.
Each sound sample was assessed manually by the authors, who carefully listened to the sound files. All the audio files were standardised to a sampling rate of 44.1 kHz, mono-channel and .wav file type to ensure consistency across the recordings. The duration of each file is not fixed, as it is meant to be coherent with the natural duration of the events.
All the tracks were then subjected to a meticulous listening process by the operators, utilising high-quality headphones in a silent environment. The tracks were labelled to assign a class when a particular one was identified. The term “label assignment” takes on different meanings for the classification and detection datasets. In the former case it consists in the sole assignment of a class to each measure, while in the latter it also provides the initial and final points of the designated class within the track. These operations, which are both extensive and time-consuming, were performed manually. For the detection labelling, the process involved listening to the track and using the “Labelling tool for audio files”, a Python-based tool specifically created34 to write the .csv file with the labels, as illustrated in Fig. 1.
Fig. 1.
Example of labelling procedure.
Data Record
DataSEC - dataset for sound event classification of environmental noise
The DataSEC.29 is stored on Zenodo, at 10.5281/zenodo.15340689. It consists of 4292 sounds, corresponding to real case recordings (18 hours and 26 minutes in total), which have been standardized as mono-channel .wav files with a sampling rate of 44.1 kHz. All samples have been meticulously examined by the authors, and all files were pre-processed to eliminate silences and superfluous parts, such as irrelevant background activity or overlapping sounds. This was done in order to enhance clarity and place greater emphasis on the target acoustic events. Consequently, each recording comprises a single sound event of a single class, with no sounds overlapping. It should be noted that background noise may be present during the specific event due to its inherent variability as a natural phenomenon, but it will not be outside of the event.
The dataset has been constructed with the objective of achieving a minimum of 50 entries for each class, with a minimum of 20 entries for each subclass. In Table 2 the principal information is delineated, including the number of audio samples per class, the minimum, maximum and average duration of events, and also the total duration in seconds of the events for each class.
Table 2.
General properties of DataSEC.
| Class/Subclass | Min Duration [s] | Max Duration [s] | Avg. Duration [s] | Sum. Duration [s] | Number of audio samples |
|---|---|---|---|---|---|
| Bells | 2.7 | 73.1 | 28.6 | 1914.6 | 67 |
| Birds | 2.0 | 80.0 | 13.2 | 907.8 | 69 |
| Cat fights and moans | 2.4 | 67.0 | 23.4 | 1170.6 | 50 |
| Chicken coop | 1.0 | 55.1 | 12.0 | 693.6 | 58 |
| Cicadas and crickets | |||||
| /cicadas | 5.0 | 77.4 | 22.5 | 1215.6 | 54 |
| /crickets | 4.2 | 73.6 | 23.8 | 476.3 | 20 |
| Crows seagulls and magpies | |||||
| /crows | 2.4 | 169.6 | 25.1 | 1257.0 | 50 |
| /seagulls | 5.6 | 198.4 | 25.6 | 896.6 | 35 |
| /magpies | 1.5 | 85.4 | 26.3 | 551.8 | 21 |
| Dog barkings and howlings | 0.4 | 37.7 | 9.6 | 672.6 | 70 |
| Glass breaking | 0.2 | 60.5 | 36.6 | 3993.6 | 109 |
| Horn | 0.3 | 45.2 | 4.9 | 281.4 | 57 |
| Jet aircrafts | 7.2 | 87.3 | 27.8 | 2866.4 | 103 |
| Lawn mower brush cutter and olive shaker | |||||
| /lawn mower | 17.3 | 86.7 | 39.9 | 837.3 | 21 |
| /brush cutter | 7.2 | 248.9 | 47.2 | 4063.0 | 86 |
| /olive shaker | 9.2 | 90.1 | 28.1 | 561.5 | 20 |
| Music | 30.0 | 30.0 | 30.0 | 7350.0 | 245 |
| Propeller aircrafts | |||||
| /airplanes | 5.0 | 82.5 | 23.8 | 1188.0 | 50 |
| /helicopters | 5.8 | 228.1 | 42.5 | 2462.4 | 58 |
| Sirens and alarms | |||||
| /sirens | 2.4 | 99.1 | 16.8 | 1139.5 | 68 |
| /alarms | 1.9 | 84.0 | 20.7 | 765.8 | 37 |
| Thunder fireworks and gunshot | |||||
| /thunder | 3.7 | 30.9 | 13.2 | 342.9 | 26 |
| /fireworks | 0.6 | 174.6 | 24.4 | 1660.3 | 68 |
| /gunshot | 0.4 | 6.7 | 2.0 | 285.9 | 141 |
| Train | 3.9 | 230.1 | 35.2 | 2356.8 | 67 |
| Vacuum cleaner fan and hairdryer | |||||
| /fan | 2.7 | 301.0 | 43.1 | 1378.8 | 32 |
| /hairdryer | 6.4 | 123.0 | 20.5 | 657.0 | 32 |
| /vacuum cleaner | 10.3 | 163.0 | 52.4 | 1625.4 | 31 |
| Vehicle idling | |||||
| /car truck idling | 2.4 | 233.1 | 36.0 | 2912.4 | 81 |
| /motorbike idling | 5.0 | 481.5 | 9.1 | 1210.8 | 31 |
| Vehicle pass-by | |||||
| /car pass-by | 1.5 | 38.3 | 5.2 | 570.6 | 110 |
| /motorbike pass-by | 1.1 | 226.9 | 13.2 | 749.4 | 57 |
| /truck pass-by | 3.1 | 303.1 | 20.4 | 1019.4 | 50 |
| Voices | 0.2 | 29.9 | 3.2 | 6055.2 | 1900 |
| Wind turbine | 10.0 | 59.3 | 11.5 | 1146 | 100 |
| Workshop | |||||
| /air compressor | 0.6 | 135.5 | 28.6 | 859.2 | 30 |
| /drill | 2.4 | 81.8 | 20.3 | 913.8 | 45 |
| /grinder | 3.1 | 233.4 | 28.9 | 1097.4 | 38 |
| /jackhammer | 1.6 | 145.5 | 31.2 | 1966.8 | 63 |
| /saw | 1.9 | 79.1 | 14.7 | 617.4 | 42 |
The audio files have been methodically organised into folders that are specific to each class, with these folders being based on the 22 environmental acoustics classes and 28 subclasses. The audio files have been renamed according to a naming convention that followed the format of “category name - 001”, “category name - 002” and so on, allowing easy tracking and retrieval of files during model training. The dataset was provisioned in a format that could provide unambiguous acoustic events.
In the context of the SEC task, the dataset has been meticulously designed to support machine learning models. The isolation of individual events and the implementation of a systematic naming procedure are crucial elements in optimising the effectiveness of dataset for machine learning workflows. These measures are expected to facilitate the rapid extraction of features during the model training process.
DataSED - dataset for sound event detection of environmental noise
Also DataSED30 is stored on Zenodo, at 10.5281/zenodo.15346092. was developed to enable the temporal analysis of multiple sound events in real-world acoustic scenarios. Unlike the other single-event dataset, the SED dataset contains continuous and unsegmented sound recordings, including simultaneous sound events. The files were obtained from the same sources as DataSEC, as well as the same classification system but excluding the subclasses. In some cases, tracks have been manually modified to remove long periods of no particular change, or sounds have been manually added to create a much more dynamic track.
It is not mandatory for the entire track to be fully labelled, as there can be periods where no source is recognised. This means that there may be silences or periods of unrecognisable sound.
The authors opted for a dual approach with regard to the labelling of overlaps, and DataSED is uploaded in two versions. One version is for polyphonic sound detection, which is the most realistic, and labels have been assigned to all sounds occurring in a portion of the track, even if they overlap. For instance, both events are labelled in the event of a short and quiet dog barking over a passing aeroplane. In accordance with the aforementioned example, the other version involves the labelling of the predominant source (i.e. the airplane) alone, with the dog bark remaining unlabelled. This particular choice is for monophonic sound detection, an approach that would facilitate, for instance, the first SED applications oriented towards the removal of unwanted sounds events.
DataSED comprises three distinct components: an audio files folder, a ground truth labels folder, and a soundtrack list file (that includes metadata of audio files).
The audio files are a total of 712 in number, all of which have been standardised to mono-channel with a sampling rate of 44.1 kHz in .wav format. The files have been organised into a single folder directory and have been named in a consistent pattern, ranging from S-0001 to S-0712, where S indicates “Sample”. The shortest audio clip is 2.29 seconds, while the longest is 285.0 seconds, with an average audio duration of 87.18 seconds across all clips. The total duration of the audio content is approximately 17.02 hours. Further details can be found in Table 3. The dataset has been constructed with the objective of achieving a minimum of 100 labels for each class.
Table 3.
General properties of DataSED for monophonic version.
| Class/Subclass | Label Duration Min [s] | Label Duration Max [s] | Label Duration Average [s] | Label Duration Sum. [s] | Number of event labels |
|---|---|---|---|---|---|
| Bells | 0.4 | 115.1 | 11.2 | 1377.7 | 123 |
| Birds | 0.6 | 97.0 | 7.1 | 1590.7 | 225 |
| Cat fights and moans | 1.3 | 41.1 | 4.6 | 636.2 | 138 |
| Chicken coop | 0.5 | 23.6 | 3.3 | 559.4 | 169 |
| Cicadas and crickets | 0.7 | 111.1 | 6.4 | 2695.7 | 422 |
| Crows seagulls and magpies | 0.4 | 35.2 | 4.8 | 802.2 | 166 |
| Dog barkings and howlings | 0.4 | 100.9 | 5.0 | 1214.1 | 243 |
| Glass breaking | 0.6 | 19.5 | 3.3 | 395.1 | 121 |
| Horn | 0.4 | 19.1 | 2.6 | 353.5 | 135 |
| Jet aircrafts | 0.5 | 98.0 | 27.1 | 2709.6 | 100 |
| Lawn mower brush cutter and olive shaker | 0.6 | 264.1 | 22.7 | 3182.8 | 140 |
| Music | 1.7 | 247.3 | 21.5 | 3433.3 | 160 |
| Propeller aircrafts | 0.4 | 285.0 | 23.9 | 5219.6 | 218 |
| Sirens and alarms | 0.6 | 143.1 | 18.9 | 2790.9 | 148 |
| Thunder fireworks and gunshot | 0.6 | 95.3 | 10.2 | 1475.6 | 145 |
| Train | 0.8 | 227.5 | 29.8 | 4801.5 | 161 |
| Vacuum cleaner fan and hairdryer | 1.6 | 72.8 | 17.4 | 2369.0 | 136 |
| Vehicle idling | 1.2 | 119.4 | 15.8 | 2511.9 | 159 |
| Vehicle pass-by | 0.9 | 56.7 | 10.3 | 3283.2 | 318 |
| Voices | 0.4 | 83.6 | 5.9 | 2011.1 | 344 |
| Wind turbine | 6.9 | 107.1 | 33.9 | 3833.8 | 113 |
| Workshop | 0.9 | 166.1 | 11.0 | 4818.4 | 439 |
The ground truth labels of the classes are stored in the .csv format. The corresponding table is structured with seven descriptive columns. The columns contained within the ground truth files are as follows:
sound_name – File name associated with the audio sample.
class_name – Classification of the sound event, matching the categories of the pretraining dataset.
start_perc – The percentage of the audio file the sound event starts.
end_perc – The percentage of the audio file the sound event ends.
start_time – The time in seconds that the sound event starts.
end_time – The time in seconds that the sound event ends.
event_length – The duration of the event in seconds.
Technical Validation
The datasets provide a robust foundation for training and evaluating machine learning models capable of distinguishing a wide variety of outdoor sound events. This objective is realised through the selection and curation of authentic audio recordings from a total of 22 distinctly defined environmental noise classes, and 28 subclasses.
A first step of validation is in the very nature with which the files were selected, carried out by listening to all signals by the authors. A second step concerns the fact that the sound signals were only chosen from measurements performed by the authors themselves using instruments of the highest quality (Class I sound Level Meters) or retrieved from famous online repositories, choosing those that were declared as having been acquired using appropriate instruments and that were real and not synthetic sounds.
A third validation step lies in the repeated listening by multiple operators of all the signals included in the datasets. The same was done for the labelling process.
A fourth validation step resides in the conversion to standardised single-channel .wav files, with a sampling rate of 44.1 kHz, to ensure consistency across the recordings.
Finally, a statistically significant number of signals were maintained in each class during the process of populating them.
The labelling process is comprehensive in nature, incorporating a meticulous audio review and structured labelling in both mono-event and poly-event formats. This ensures the generation of high-quality training data. This positions the datasets as valuable assets, primarily for environmental acousticians, due to the manner in which classes are defined. However, it can be readily adapted to be useful for applications in ecological monitoring, urban planning, and smart home technologies.
Ground truths were also checked. In particular, in both monophonic and polyphonic, it was checked that there were no overlaps between events of the same class (produced erroneously during labelling), and in monophonic, it was checked that there were no overlaps between events of different classes either. For each measure, the authors had representations of labelling in order to validate the process visually as well, as the example shown in Fig. 2 for a monophonic case Table 4.
Fig. 2.
Example of graphical representation for the labelling of a track in DataSED.
Table 4.
Structure of the ground truth annotation files in the monophonic version of DataSED.
| sound_name | class_name | start_perc | end_perc | start_time | end_time | event_length |
|---|---|---|---|---|---|---|
| S-0001.wav | Birds | 0 | 0.3 | 0 | 17.6 | 17.6 |
| S-0001.wav | Lawn mower brush cutter and olive shaker | 0.36 | 0.59 | 21.17 | 34.83 | 13.66 |
| S-0001.wav | Bells | 0.59 | 0.85 | 34.93 | 50.65 | 15.72 |
| S-0001.wav | Lawn mower brush cutter and olive shaker | 0.9 | 0.98 | 53.23 | 58.05 | 4.82 |
| S-0002.wav | Bells | 0 | 0.08 | 0 | 10.31 | 10.31 |
| S-0002.wav | Bells | 0.11 | 0.18 | 13.16 | 22.52 | 9.36 |
| S-0002.wav | Bells | 0.2 | 0.28 | 25.22 | 35.06 | 9.84 |
| S-0002.wav | Bells | 0.29 | 1 | 35.61 | 123.98 | 88.37 |
| S-0003.wav | Train | 0.06 | 0.25 | 5.52 | 23.71 | 18.19 |
| S-0003.wav | Voices | 0.29 | 0.5 | 27.88 | 47.82 | 19.93 |
| S-0003.wav | Vehicle pass-by | 0.66 | 0.71 | 62.88 | 68.38 | 5.5 |
| S-0003.wav | Voices | 0.72 | 1 | 68.62 | 95.62 | 27 |
| S-0004.wav | Train | 0 | 0.23 | 0 | 24.49 | 24.49 |
| S-0004.wav | Voices | 0.26 | 0.33 | 28.3 | 34.78 | 6.47 |
| S-0004.wav | Sirens and alarms | 0.44 | 0.48 | 46.63 | 51.22 | 4.6 |
| S-0004.wav | Train | 0.53 | 0.8 | 56.34 | 85.47 | 29.13 |
| S-0004.wav | Sirens and alarms | 0.82 | 0.85 | 87.28 | 90.39 | 3.11 |
| S-0004.wav | Train | 0.85 | 1 | 90.58 | 106.96 | 16.38 |
Usage Notes
A notable benefit of these datasets is their capacity for expansion. The resources have been meticulously designed to promote continuous contribution. These tools empower researchers and practitioners to continually enrich the data repository, thereby promoting community-driven enhancement and ensuring long-term relevance. In addition, the clear and thorough classification of sounds contributes to the refinement of the model training process. This, in turn, serves to reduce the probability of misclassification, a prevalent concern in earlier datasets.
As with other large-scale audio datasets, the proposed datasets are not without limitations. Further work on these datasets could address variability in recording distances, further balance class distribution, and broaden the range of environmental contexts captured. It is argued that, in particular, an expansion of both vocabulary and acoustic coverage has the potential to enhance generalizability and real-world applicability.
Acknowledgements
The paper was supported by the PRIN 2022 project 20223LMSZN, COMBINE “Sustainable condition monitoring of wind turbines using acoustic signals and machine learning techniques” funded by the Italian Ministry of University and Research in the context of NextGenerationEU.
Author contributions
Luca Fredianelli: Conceptualization; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Supervision; Validation; Writing - original draft; Writing - review & editing. Francesco Artuso: Data curation; Investigation; Methodology; Validation; Writing - original draft; Writing - review & editing. Geremia Pompei: Data curation; Investigation; Methodology; Validation; Writing - original draft; Writing - review & editing. Gaetano Licitra: Supervision; Validation; Writing - original draft; Writing - review & editing. Gino Iannace: Supervision; Validation; Writing - original draft; Writing - review & editing. Andac Akbaba: Data curation; Formal analysis; Investigation; Methodology; Validation; Writing - original draft; Writing - review & editing.
Code availability
The authors declare that labelling of DataSED was manually performed by the operators using a self-developed and published code34.
Data availability
The data sets that were created and analyzed in this research are available for public access on Zenodo.org.
• DataSEC (Dataset for Sound Event Classification of environmental noise) is available as a zipped file archive at 10.5281/zenodo.15340689. The main folder is titled DataSEC and contains subfolders with the names of the 22 sound event class and 28 subclasses. These subfolders include the corresponding number of the number of mono-channel .wav files (44.1 kHz) as listed in Table 2 of the manuscript, altogether this is 4292 files (18 hours 26 minutes. The naming of the files in the subfolders follows a consistent convection of naming (e.g., category name – 001).
• DataSED (Dataset for Sound Event Detection of environmental noise) are available in zipped format at 10.5281/zenodo.15346092. The main folder, called DataSED, contains two subfolders: one with 712 mono-channel .wav files (44.1 kHz, labeled from S-0001.wav to S-0712.wav) and one with the corresponding ground truth annotations in .csv format. Each .csv file contains structured labels in seven columns:
• sound_name (name of the audio file),
• class_name (category of sound event),
• start_perc and end_perc (respectively, the relative position of the event in the file, in percentage),
• start_time and end_time (respective timestamps in seconds), and
• event_length (duration of the event, in seconds).
DataSEC and DataSED are released under Creative Commons licenses as outlined in the manuscript, and both are offered in a ready-to-use format for purposes of training, validation, and benchmarking sound event classification and detection models in environmental acoustics.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Bello, J. P. et al. SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution. Commun. ACM.62(2), 68–77 (2019). [Google Scholar]
- 2.Drugman, T. et al. Audio and contact microphones for cough detection. arXiv preprint (2020).
- 3.Hüwel, A., Adiloğlu, K. & Bach, J. H. Hearing aid research data set for acoustic environment recognition. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 706–710 (2020).
- 4.Messner, E. et al. Multi-channel lung sound classification with convolutional recurrent neural networks. Comput. Biol. Med.122, 103831 (2020). [DOI] [PubMed] [Google Scholar]
- 5.Cramer, J., Lostanlen, V., Farnsworth, A., Salamon, J. & Bello, J. P. Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers. Proc. IEEE Int. Conf. Acoust., Speech Signal Process. 901–905 (2020).
- 6.Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S. & Bello, J. P. Robust sound event detection in bioacoustic sensor networks. PLoS One.14(10), e0214168 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Xu, K., Cai, H., Liu, X., Gao, Z. & Zhang, B. North Atlantic Right Whale call detection with very deep convolutional neural networks. J. Acoust. Soc. Amer.141(5), 3944–3945 (2017). [Google Scholar]
- 8.Crocco, M., Cristani, M., Trucco, A. & Murino, V. Audio surveillance: A systematic review. ACM Comput. Surv.48(4), 1–46 (2016). [Google Scholar]
- 9.Wang, Y., Neves, L. & Metze, F. Audio-based multimedia event detection using deep recurrent neural networks. Proc. IEEE Int. Conf. Acoust., Speech Signal Process. 2742–2746 (2016).
- 10.Jansen, A. et al. Large-scale audio event discovery in one million YouTube videos. Proc. IEEE Int. Conf. Acoust., Speech Signal Process. 786–790 (2017).
- 11.Morrison, M. & Pardo, B. OtoMechanic: Auditory automobile diagnostics via query-by-example. Preprint at 10.48550/arXiv.1911.02073 (2019).
- 12.Shabbir, A. et al. Enhancing smart home environments: a novel pattern recognition approach to ambient acoustic event detection and localization. Frontiers in Big Data7, 1419562 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kath, H., Serafini, P. P., Campos, I. B., Gouvêa, T. S. & Sonntag, D. Leveraging transfer learning and active learning for sound event detection in passive acoustic monitoring of wildlife. 3rd Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE-2024), (2024).
- 14.Piczak, K. J. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd ACM International Conference on Multimedia (2015).
- 15.Stowell, D., Wood, M., Pamuła, H., Stylianou, Y. & Glotin, H. Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge. Methods in Ecology and Evolution. 10(3), 368–380 (2019). [Google Scholar]
- 16.Mkrtchian, G. & Furletov, Y. Classification of environmental sounds using neural networks. 2022 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO), 1–4 (2022).
- 17.Fonseca, E., Favory, X., Pons, J., Duchateau, J. & Serra, X. FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM Transactions on Audio, Speech, and Language Processing.30, 829–852 (2021). [Google Scholar]
- 18.Cakir, E., Heittola, T., Huttunen, H. & Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing.25(6), 1291–1303 (2017). [Google Scholar]
- 19.Liang, J. et al. Mind the Domain Gap: A Systematic Analysis on Bioacoustic Sound Event Detection. Proc. Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop (2023).
- 20.Mesaros, A., Heittola, T., Virtanen, T. & Plumbley, M. D. Sound Event Detection: A Tutorial. IEEE Signal Processing Magazine, 38(5),
- 21.Salamon, J., Jacoby, C. & Bello, J. P. A Dataset and Taxonomy for Urban Sound Research. Proceedings of the 22nd ACM International Conference on Multimedia (2014).
- 22.Gemmeke, J. F. et al. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) (2017).
- 23.Font, F., Roma, G. & Serra, X. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia, 21, 411–412, 10.1145/2502081.2502245 (Association for Computing Machinery, New York, NY, USA, 2013).
- 24.Ghosh, S. et al. Synthio: Augmenting small-scale audio classification datasets with synthetic data. Proceedings of the International Conference on Learning Representations (ICLR) (2025).
- 25.Geiping, J. et al. How much data are augmentations worth? An investigation into scaling laws, invariance, and implicit regularization. Preprint at 10.48550/arXiv.2210.06441 (2022).
- 26.Zeng, W. et al. Infusion: Preventing customized text-to-image diffusion from overfitting. ACM Multimedia (2024).
- 27.Ghosh, S. et al. Compa: Addressing the gap in compositional reasoning in audio-language models. Preprint at 10.48550/arXiv.2310.08753 (2023).
- 28.Ghosal, D., Majumder, N., Mehrish, A. & Poria, S. Text-to-audio generation using instruction tuned LLM and latent diffusion model. Preprint at 10.48550/arXiv.2304.1373 (2023).
- 29.Fredianelli, L. et al. DataSEC - Dataset for Sound Event Classification of environmental noise.10.5281/zenodo.17033970 (2025). [DOI] [PubMed]
- 30.Fredianelli, L. et al. DataSED - Dataset for Sound Event Detection of environmental noise.10.5281/zenodo.15346092 (2025). [DOI] [PubMed]
- 31.Chan, T. K. & Chin, C. S. A Comprehensive Review of Polyphonic Sound Event Detection. IEEE Access.8, 103339–103373 (2020). [Google Scholar]
- 32.Bogdanov, D., Won, M., Tovstogan, P., Porter, A. & Serra, X. The MTG-Jamendo Dataset for Automatic Music Tagging. Machine Learning for Music Discovery Workshop, International Conference on Machine Learning.10.5281/zenodo.3826813 (2019). [Google Scholar]
- 33.CLIPS Corpus Dataset. http://www.clips.unina.it/home (Retrievable at https://github.com/paperswithcode/paperswithcode-data?tab=readme-ov-file).
- 34.Pompei, G., Artuso, F., Akbaba, A. & Fredianelli, L. Labelling tool for audio files.10.5281/zenodo.15438634 (2025). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
Data Availability Statement
The authors declare that labelling of DataSED was manually performed by the operators using a self-developed and published code34.
The data sets that were created and analyzed in this research are available for public access on Zenodo.org.
• DataSEC (Dataset for Sound Event Classification of environmental noise) is available as a zipped file archive at 10.5281/zenodo.15340689. The main folder is titled DataSEC and contains subfolders with the names of the 22 sound event class and 28 subclasses. These subfolders include the corresponding number of the number of mono-channel .wav files (44.1 kHz) as listed in Table 2 of the manuscript, altogether this is 4292 files (18 hours 26 minutes. The naming of the files in the subfolders follows a consistent convection of naming (e.g., category name – 001).
• DataSED (Dataset for Sound Event Detection of environmental noise) are available in zipped format at 10.5281/zenodo.15346092. The main folder, called DataSED, contains two subfolders: one with 712 mono-channel .wav files (44.1 kHz, labeled from S-0001.wav to S-0712.wav) and one with the corresponding ground truth annotations in .csv format. Each .csv file contains structured labels in seven columns:
• sound_name (name of the audio file),
• class_name (category of sound event),
• start_perc and end_perc (respectively, the relative position of the event in the file, in percentage),
• start_time and end_time (respective timestamps in seconds), and
• event_length (duration of the event, in seconds).
DataSEC and DataSED are released under Creative Commons licenses as outlined in the manuscript, and both are offered in a ready-to-use format for purposes of training, validation, and benchmarking sound event classification and detection models in environmental acoustics.


