Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 2;19(4):e0299888. doi: 10.1371/journal.pone.0299888

Musical instrument classifier for early childhood percussion instruments

Brandon Rufino 1,2, Ajmal Khan 1,#, Tilak Dutta 2,3,#, Elaine Biddiss 1,2,4,*
Editor: John Blake5
PMCID: PMC10986987  PMID: 38564622

Abstract

While the musical instrument classification task is well-studied, there remains a gap in identifying non-pitched percussion instruments which have greater overlaps in frequency bands and variation in sound quality and play style than pitched instruments. In this paper, we present a musical instrument classifier for detecting tambourines, maracas and castanets, instruments that are often used in early childhood music education. We generated a dataset with diverse instruments (e.g., brand, materials, construction) played in different locations with varying background noise and play styles. We conducted sensitivity analyses to optimize feature selection, windowing time, and model selection. We deployed and evaluated our best model in a mixed reality music application with 12 families in a home setting. Our dataset was comprised of over 369,000 samples recorded in-lab and 35,361 samples recorded with families in a home setting. We observed the Light Gradient Boosting Machine (LGBM) model to perform best using an approximate 93 ms window with only 12 mel-frequency cepstral coefficients (MFCCs) and signal entropy. Our best LGBM model was observed to perform with over 84% accuracy across all three instrument families in-lab and over 73% accuracy when deployed to the home. To our knowledge, the dataset compiled of 369,000 samples of non-pitched instruments is first of its kind. This work also suggests that a low feature space is sufficient for the recognition of non-pitched instruments. Lastly, real-world deployment and testing of the algorithms created with participants of diverse physical and cognitive abilities was also an important contribution towards more inclusive design practices. This paper lays the technological groundwork for a mixed reality music application that can detect children’s use of non-pitched, percussion instruments to support early childhood music education and play.

Introduction

Musical play and learning, particularly prior to 7 years of age, can enhance outcomes in self-efficacy, perception, fine motor coordination and spatial reasoning [1]. Yet, there are significant opportunity gaps when it comes to early childhood music education [24]. Children with motor disabilities and those from low-income households are much less likely to learn to play a musical instrument and participate in early childhood music learning programs [57]. Barriers to early childhood music education include finding suitable programs [2], costs [3], travel constraints and time [4]. These barriers are not easily overcome by traditional teacher-led programs. Music applications “apps” may offer opportunities to close opportunity gaps in early childhood music education for children with and without disabilities. However, a review of some of the most popular mainstream music apps for young children concluded that they generally do not promote diverse, frequent, and active music engagement [8]. Active engagement, wherein real-life instruments are manipulated introducing sensory and physical task parameters, is vital to music and motor learning [9].

“Mixed reality” apps have recently emerged that detect and respond to audio signals from real-life, pitched instruments (e.g., piano, violin, guitar) [10]. However, in early childhood, non-pitched percussion instruments (e.g., shakers/maracas, tambourines and castanets) are more prominent and appropriate for developing motor abilities, especially for children with disabilities who may have limitations in fine motor control [11, 12]. Musical instrument and audio classifiers have been built with a wide range of machine learning models including k-nearest neighbors (KNN) [13], Multi-layered Perceptron (MLP), and boosting algorithms [14]. Harish et al. reported accuracies of 79% for their SVM model which outperformed the other state-of-the-art models with in classifying six pitched instruments, including voice, using spectral features [14]. Mittal et al. demonstrated best performance with a Naive Bayes Classifier with an accuracy of 97% for distinguishing between 4 drum instruments using a dataset composed of both live recordings as well as a drum simulator [15]. While the musical instrument classification task for pitched instruments and drums are widely studied [16], we were not able to find previous work classifying diverse non-pitched instruments like maracas, tambourines, and castanets. Classification of non-pitched instruments may pose additional challenges due to greater overlaps in frequency bands and variation in sound quality and play style than pitched instruments [17].

Pairing signal processing techniques with machine learning offers opportunities to address current limitations in the detection of non-pitched percussion instruments. Improving signal robustness, increasing recognition of families of instruments (e.g., any shaker regardless of size and material), and prioritizing computationally low-cost approaches are important pragmatic considerations for the development of classifiers that will enable children to play with the instruments they have on hand [18] and with technology they have in their home (i.e., a tablet, computer, or phone).

Another key consideration for designing an instrument classifier for use by children in a home-based music application is the complex sound environment. For instance, in our target application, the audio stream may include speech, background noise, game/app music, as well as sounds from the musical instruments of interest. There are several approaches to the classification task for polyphonic audio (i.e., audio with multiple sources of sound) [1922]. However, the most common and straightforward approach is to extract features directly from the polyphonic signal to classify the musical instrument [23, 24]. This last approach minimizes computation and latency, which are important practical considerations for a classifier intended for a music application.

In this work, we aimed to develop an audio detection interface for non-pitched percussion instruments, specifically maracas, tambourines, and castanets, for use in early childhood music applications. To this end, this manuscript first describes the creation of a large database of non-pitched instrument audio samples. Second, we describe feature extraction and the development of machine learning models used in this classification task. Next, we present the performance of our classifier with (i) a test set recorded in-lab and, (ii) real-world data recorded “in the wild” in family homes. Lastly, we discuss key findings, particularly with respect to the algorithms intended application as an audio detection interface to support interactive early childhood music applications.

Methodology

Dataset

1. Non-pitched percussion dataset

Given the unavailability of databases with well-labelled samples of diverse, non-pitched percussion instruments, a novel dataset was created for this purpose consisting of four classes: (1) tambourines, (2) maracas/shakers, (3) castanets, and the (4) noise class (e.g., white noise, people speaking, environment sounds etc.). This dataset (described in Table 1) was accumulated over June 2019 to May 2021 in various locations (e.g., family homes, coffee shops, outdoors and basements) within Ontario, Canada. It will be referred to as the “in-lab” dataset to convey that the recordings were made in contrived rather than naturalistic conditions. Audio samples were recorded using a single microphone channel (i.e., mono) at 44.1 kHz. Most samples recorded are polyphonic (are mixed with people speaking, coffee shop sounds like the sound of cutler, different frequencies of white noise and background music) to reflect the anticipated sound environment of the target application.

Table 1. Data splits for “in-lab” dataset: training, validation and testing.

Samples calculated with ≈93 ms window and 50% overlap.

Class Training Samples Validation Samples Test Samples (“in-lab”) variants
Tambourines 88,339 12,217 11,491
Shakers 93,735 15,102 10,968
Castanets 34,363 9,773 4,102
Noise 73,207 8,506 7,273
Total 289,644 45,598 33,834

When recording samples for each instrument class, a variety of instruments was used to avoid bias in the brand and material of the instrument. For example, egg shakers, wooden maracas and plastic maracas were recorded for the shaker class. For the tambourine class, we used tambourines varying in diameter from 8 cm to 30 cm and of different materials such as cowhide, plastic and wood. For the castanet, hand-held and finger castanets were sampled—some of which were made from plastic and others from wood. For all three instrument classes, homemade instruments were also included. Our goal in doing so was to alleviate any financial barriers a family may face when taking part in musical play and learning at home given that not all families may have access to musical instruments. “Do It Yourself” (DIY) instructions were created for each instrument class using common household items. A homemade version of a shaker was constructed from a plastic water bottle partially filled with rice. For a homemade tambourine, we trialed a ring of three keys or a paper plate with four metal jingle bells attached around the circumference. Lastly, a homemade castanet was simulated using two wooden spoons, or a folded paper plate with approximately four coins glued onto each side. See the supporting information file (S1 Appendix) for an image of the homemade instrument options.

2. Dataset splits

For training and evaluating the classifier to the “in-lab” dataset, we split our corpus into approximately 80/10/10% train, validation and test set. The validation set included instruments and ambient sound environments that were not part of the training set. Of note, we labelled and grouped by instruments before data splitting to ensure that there was no overlap in samples between the training and testing sets. The test set fully comprised of either (1) instruments that were not included in training and validation, or (2) instruments that were included in training/validation but recorded in a totally new environment. This prevented the classifier from being biased to certain brands and materials of instruments. Table 1 summarizes the data splits and Fig 1 summarizes the complete data collection process, including the at-home deployment test set (to be discussed in Classifier Evaluation).

Fig 1. Data collection summary including in-lab and at-home deployed samples.

Fig 1

Features and window length

1. Windowing length

The general approach for musical instrument classification is to discretize the signal based on a window time length and extract relevant features [16]. There are no established standards for windowing time for the musical instrument classification task. Previous successful work has used window lengths between 24–500 ms and 50% overlap between the windows [25, 26]. For this work, a maximum window length of ≈93 ms was specified because tolerable latency in gameplay is reported to be under 100 ms [27]. A longer window length would also increase the likelihood of multiple instruments being played within the window or the signal not being stationary (a key assumption for our features). As the Fast Fourier Transform (FFT) [28] was used, a sample length to a power of two was maintained to minimize the computation time. Using a common audio recording sample rate of 44.1 kHz, we therefore experimented with the following window lengths: ≈23 ms windows (1024 samples), ≈46 ms windows (2048 samples) and ≈93 ms windows (4096 samples).

2. Feature extraction and selection

Dimensionality for the musical instrument classification task ranges widely [29] with previous work using anywhere from 27 [30] to 162 time and spectral features [29]. A full table of the features that extracted in this work can be found in the supporting information file (S2 Appendix) from which neighborhood component analysis (NCA) was used to select an optimal feature set. NCA is a non-parametric method with the goal of maximizing accuracy of a classification algorithm [31]. We used the fscna tool provided by Matlab [32] which performs NCA with a regularization term. The first 12 mel-frequency cepstral coefficients (MFCCs) and signal entropy were found to contribute most to the model across all windowing times. Below, we detail methods associated with obtaining these features.

MFCCs. MFCCs are used to represent shapes of sounds and are calculated using standard procedures [33]. The bandwidth of interest and the number of mel-filterbanks used must be specified in line with sampling rates [25, 34, 35]. A maximum frequency of 8 kHz was selected as a conservative approach in case our classifier is ever deployed to a hardware device that is limited to a sampling rate of 16 kHz (albeit, if the classifier was deployed to a 16 kHz device, it should be retrained using a down sampled version of our dataset). For the interested reader, spectrograms of each instrument are presented in the supporting information file (S3 Appendix). Music classification uses anywhere from 20–40 mel-filterbanks to summarize the power spectrum [34]. As is typical, we discarded the first MFCC because it is related to only the signal energy and kept the subsequent 2–20 MFCCs [25].

Signal entropy (also referred to as Shannon’s Entropy [36]). Musical instruments like castanets with a sharp signal onset and decay will contain more information (entropy) than a flat white noise frequency in the background. Signal entropy was calculated using equations provided in previous work [36]. To avoid underestimation, a bias correction term was applied, following the work by Moddemeijier [37].

Classifier development

We developed models using KNN, SVM, MLP and the boosting algorithms, AdaBoost [38], XGBoost [39] and Light Gradient Boosting Machine (LGBM) methods [40]. We hyper-parameter tuned the LGBM models using Optuna [41]. Optuna offers a flexible API along with a sampling algorithm to efficiently optimize the search space. We report the optimal parameters and other baseline model performances for the ≈ 93ms window in the supporting information file (S4 Appendix). Fig 2 presents an overview of the model development and evaluation process.

Fig 2. Development and evaluation of non-pitched percussion musical instrument classifier.

Fig 2

Classifier evaluation

In-lab test phase

Accuracy, recall, precision and F1 scores were calculated to quantify classifier performance and compared between models using window lengths of ≈23 ms, ≈46 ms, and ≈93 ms.

Real-world deployed phase

The following describes the methodology associated with the real-world deployment phase to evaluate the performance of the classifier in the naturalistic home setting. This was considered an important step to identify any biases in our model due to the variation in environment, play styles and instruments and to ascertain its practical viability. In-home test data were collected in the context of a usability study wherein the classifier was embedded within an existing game application, Bootle Band.

Materials. Bootle Band is an early childhood music application developed at the Possibility Engineering And Research Lab (PEARL) at Holland Bloorview Kids Rehabilitation Hospital and available on the Apple App Store. In Bootle Band, children play their musical instrument to interact with cartoon characters and advance the narrative. Families were provided an iPad with Bootle Band installed as well as a set of musical instruments that included a maraca, an egg shaker, a castanet, and a tambourine (Rhythm Band Instruments, Texas, US). Video instructions for the DIY instruments were also provided within the application. The research version of Bootle Band has the flexibility to be used as a data collection platform as audio and video data can be recorded during play with the family’s permission.

Participants. Twelve families were recruited through Holland Bloorview Kids Rehabilitation Hospital and included 9 boys and 3 girls, with and without motor disabilities aged from 4 to 9 years (see Table 3 for a list of diagnoses represented in this study). The recruitment period spanned from February 22nd, 2021 to May 1st, 2021. Ethical approval for this study was obtained from the Holland Bloorview Research Ethics Board and the University of Toronto Health Sciences Research Ethics Board. Participants and/or their guardians provided informed and written consent using approved e-consent procedures facilitated by REDCap. Children who were unable to consent provided assent.

Table 3. Participants demographics (n = 12).
Demographic Total participants
Diagnosis
Cerebral Palsy, GMFCS < = Level 3 3
Spinal Cord Injury 1
Hard of hearing 1
Hydrocephalus 1
MYBPC2 3
Hemiparesis 2
Polymicrogyria 1
ADHD 1
No reported diagnosis 4
Average annual family income
<$24,999 2
$25,000 to $49,999 3
$50,000 to $74,999 0
$75,000 to $99,999 1
$100,000 to $149,000 2
>$150,000 2
Prefer not to disclose 2
Ethnicity
White 3
Hispanic 1
Black 5
Asian 1
Mixed heritage 2

Protocol. Families played Bootle Band with the audio detection interface for 2 weeks at home. The game prompted the family to adjust the volume of the iPad to a comfortable level before beginning to play but did not place any restrictions to noise that may be caused from external factors. During play, audio data were recorded with the family’s permission. Roughly 27 minutes of audio data were collected. See Fig 3 for a dataset breakdown per participant. We also conducted an interview with the families at the end of the study and asked them if they observed any technical difficulties with the detection of the instruments.

Fig 3. At-home deployed dataset from a usability study with 12 families (≈ 93 ms window and 50% overlap).

Fig 3

Analysis. A researcher reviewed the recordings from the play sessions and manually identified the 4 classes of sound (i.e., shaker, tambourine, castanet or noise class). Ten percent of the labels were randomly reviewed by another researcher for accuracy and showed 95% in agreement using Cohen’s kappa method [42]. Once labelled, we evaluated the performance of our classifier in the real-world home setting.

Results

In-lab performance

The LGBM classifier performed best on our validation set across all window lengths and for all performance metrics (Precision: 0.845; Recall: 0.835, F1: 0.839, Accuracy: 0.844, 93ms window). The AdaBoost model had similar performance (Accuracy: 0.837, 93ms window) with the logistic regression model (Accuracy: 0.676, 93ms window) and the SVM (Accuracy: 0.756, 93ms window) being the lowest performing. Henceforth, results reported are with reference to the LGBM model only. The interested reader is referred to the Supporting Information (S5 Appendix) for a detailed performance comparison of the different machine learning models.

Table 2 presents the performance of the LGBM model on the held-out “in-lab” test set. Using a ≈23 ms (1024 sample) window length we see an overall accuracy of about 72.6%. This accuracy increases when we expand our window to ≈46 ms (2048 samples) and ≈93 ms (4096 samples). Respectively, we see an increase in accuracy of 1.4% (to a total of 74.0%) and an increase of 11.8% (to a total of 84.4%) from our smallest window size of ≈23 ms. Another substantial increase in performance can be observed in the precision of our classifier. For example, the lowest precision score was in the Tambourine class with a ≈23 ms window at 67.5%. This precision improved to 86.7% when the window size was increased to 93 ms. With this information, we deployed the classifier trained from the ≈93 ms window into Bootle Band for real-world playtime evaluation. The confusion matrix analysis for the LGBM model with a window size of 93 ms is provided in Supporting Files (S6 Appendix) for the interested reader.

Table 2. Non-pitched percussion instrument classifier to our “in-lab” test set against different window lengths.

Window Length (ms) Instrument Class Accuracy Precision Recall F1 Score
23 Tambourine 0.6750 0.7285 0.7007
Shaker 0. 7185 0. 7801 0. 7480
Castanet 0. 8909 0. 5181 0. 6552
Noise 0. 7158 0. 8812 0. 7900
Macro Average 0. 7255 0. 7500 0. 7270 0. 7235
46 Tambourine 0. 6822 0. 7480 0. 7136
Shaker 0. 7433 0. 7963 0. 7689
Castanet 0. 8886 0. 5303 0. 6642
Noise 0. 7304 0. 8834 0. 7997
Macro Average 0. 7395 0. 7611 0. 7395 0. 7366
93 Tambourine 0. 8673 0. 8063 0. 8357
Shaker 0. 8175 0. 8781 0. 8467
Castanet 0. 8388 0. 7512 0. 7925
Noise 0. 8553 0. 9051 0. 8795
Macro Average 0.8441 0.8447 0.8352 0.8386

Real-world performance

Twelve participants aged 4 to 9 years with a range of diagnoses affecting motor abilities participated in the home trial with detailed demographics reported in Table 3.

Fig 4 summarizes the performance of our classifier against the real-world data collected as children played the game at home.

Fig 4. Non-pitched percussion musical instrument classifier performance to real-world playtime setting (participants = 12).

Fig 4

There was a significant decrease in accuracy across all instrument classes from the performance seen with the “in-lab” test set. When evaluated with the real-world data collected in family homes, the average accuracy decreased approximately 11% to 73.3% [SD = 6.8%]. We also observed a significant decrease in performance across all other macro averages (recall, precision and F1 score). The largest drop in performance was seen in the precision metric which decreased about 14% (from 84.5% [SD = 0.6%] to 70.5% [SD = 5.6%]). The smallest reduction in performance was seen in the recall metric with a decrease of 8.6% (from 83.5% [SD = 0.6%] to 74.9% [SD = 5.7%]).

Looking into individual instrument classes, we see a significant reduction in performance in recall for the noise class (69.9% median recall for at-home variants vs. 90.5% median recall for in-lab variants). The tambourine class showed the worst precision out of all 4 classes with a precision median of 63.5%. The tambourine also showed the lowest F1 score with a median of roughly 62.8%. All three instruments, apart from the noise class, demonstrated above 75% recall and the decrease in performance was not statistically significant from the in-lab variants.

In terms of variation from one household to the next, we observed that four participants had below 70% accuracy (P01 = 67.0%, P03 = 65.5%, P06 = 68.2% and P11 = 68.5%). We also observed one outlier with an accuracy of 91.5% accuracy (P04). All other participants (n = 7) showed accuracies within the 70–80% range. To coincide with the quantitative analysis, when families were asked “Did you experience any technical difficulties with the detection of musical instruments?”, eleven out of twelve families reported no latency or detection issues with Bootle Band. One out of the twelve caregivers noticed a very minor delay when an instrument was played vs. the reaction time of the game. However, the caregiver noted that their child never seemed to notice and it did not impede the play experience.

Discussion

The goal of this research was to design an audio detection interface for non-pitched percussion instruments common in early childhood to support musical play and learning opportunities through low-cost, home-based applications. To this end, we created a large dataset of diverse audio samples which was used to design and evaluate a classifier which achieved 84.4% accuracy across 4 classes (shakers, castanet, tambourine, noise).

This shows that a machine learning classifier can find predictive patterns that are unique to families of percussion instruments. As anticipated, a reduction in performance of about 11% was observed when the classifier was deployed in a naturalistic home setting, however the performance was sufficient to sustain user satisfaction across all families. During this research, we prioritized the development of a technological solution that would translate well to real-world implementation and this focus shaped many of our design decisions. We also prioritized equity, diversion and inclusion in our research by including diverse children with and without disabilities in the home study in an effort to ensure that this research is translatable to families most in need of low-cost, early childhood music applications. Key findings are as follows:

1) Multiple strategies are needed to address the high level of variability associated with non-pitched percussion instruments. Within the same class of instrument, there exists a lot of variation from one instrument to another depending on the material of fabrication, size etc. For example, a 12” cowhide tambourine purchased from a music store is sonically quite different from a 4” plastic tambourine purchased at a local toy store. To address this variability, we built a large dataset containing over 369,000 samples of diverse instruments recorded in varied contexts with which to train our classifier. While we attempted to capture a wide variety of instruments in each class, our data set was not exhaustive and will need continued expansion and iteration to capture the range of variability associated with non-pitched percussion instruments. To address this limitation, we conducted an error analysis of our classifier to understand which instruments were more likely to be inaccurately classified. First, we observed that there were frequent misclassifications between tambourines and shakers likely owing to an overlap of frequencies between these two instruments. When this behavior was explored in greater depth, we identified that plastic tambourines were often misclassified as shakers (compared to tambourines that used jingle bells or were made from wood/cowhide) and wooden shakers were more often misclassified than plastic shakers. These observations were important for guiding successful real-world deployment and recommendations for preferred instruments for use in the game. Another step that was taken to mitigate this limitation was the creation of the “DIY” instrument guides which resulted in low-cost instruments that could be consistently detected by our classifier with an average accuracy of 92% for the in-lab test set. Understanding the limitations of our classifier enabled us to generate and test these mitigating strategies, as well as guide future developments.

2) Good classifier performance with low latency and computational demands typical of common mobile devices could be achieved using a small feature set. We found that only a handful of features (MFCCs and signal entropy) were needed to classify percussion instruments in a short window of time (≈93 ms). By minimizing our features, we were able to reduce bias in our model and minimize computation time. This is crucial for our target application which required latency of less than 100 ms to ensure a good user experience [27]. While larger window times were associated with an increase in classifier performance, it is unlikely that this would have translated to a better user experience when embedded into a gaming application. Using the high speed SHAP algorithm [43, 44], we identified the features which contributed the most to our LGBM classifier. SHAP combines local explanations and optimal credit allocation using the classic Shapley values from game theory and their related extensions. We found that the first 4 MFCCs and signal entropy were the main drivers for our predictions. This finding makes sense since the first few MFCCs summarize most of the variance within the spectral envelope. Looking at signal entropy, we observed that it plays an important role for the castanet and noise predictions. We believe this is because the castanet family has a very sharp onset and decay compared to other instruments like a tambourine or a shaker, thereby producing a distinct entropy value. The opposite is true for noise, where most samples carry little information which is distinct from the other three families. Our SHAP results can be found in the supporting information file (S7 Appendix).

3) “In the wild” evaluation is essential to characterizing system performance. Our classifier achieved 84.4% accuracy across all three instrument families when evaluated with our in-lab test set with ≈93 ms windows. In deployment, it achieved 73.3% accuracy across all three instrument families. In the home setting, misclassifications could be attributed to many different events: children yelling, the microphone being covered when a child holds the iPad, and children playing two different instruments at the same time with their sibling/parent. Upon exploratory investigation we noticed that when the child was playing Bootle Band in a quiet environment without many interruptions like the ones listed above, the classifier performed with near equivalence to in-lab testing. Understanding the source of errors enables opportunities to provide families with better instructions for an optimal user experience as well as to inform technological developments that could help to mitigate misclassification (e.g., programmatic volume controls associated with the game music, warnings if the microphone is covered). As an example of the latter, one feature that we implemented to support deployment is that we allowed users, through a friendly front-end user interface, to toggle the probability threshold of predicting an instrument. For example, if a family was playing in a loud environment, then to reduce the number of false positives we allowed them to increase the probability threshold of our classifier to a max of 0.9. If the user were playing in a quiet environment, they could lower the probability threshold to 0.4. Doing this allowed the family to adjust the classifier prediction threshold based on the environment they were playing in. See the supporting information file for an image of this front-end user interface (S8 Appendix). Families were made aware of this feature in an onboarding session, however, it was not used in this study. Families did not experience dissatisfaction with the audio detection interface, so it may be that they did not feel a need for this feature. Future work will be conducted to quantify the extent to which specific environmental conditions impact accuracy and the effectiveness of mitigating strategies that can be designed into the technology or provided through training resources.

Another important consideration when thinking about target application was the relative consequence of false negatives (i.e., no detection of the instrument being played) versus false positives (i.e., detection of the instrument when it is not being played). While both may lead to a bad gameplay experience, young children tend to respond more negatively to false negatives. Thus, for our application, it was important to have a high recall and classify all instances of the instrument, even if that may lead to some additional false positives. In our case, our lowest recall in deployment was with the noise family with a median of 69.9% which was most often misclassified as a castanet. Upon further investigation, in most cases this was because the game audio had a castanet sound playing periodically in the background or because the family was moving around their instruments (creating a sharp onset and decay type of sound). However, a low recall for the noise class is not necessarily a bad thing as a false negative for noise means a false positive for an instrument class which is generally more acceptable than the counterpart. As noted above, families were also provided with a user interface to adjust the prediction threshold given that the optimal balance between false negatives and false positives may vary from child to child depending on motor abilities, self-efficacy and other individual characteristics. The families in this study did not make use of this feature and were satisfied with the audio detection interface.

Only one family noted any latency issues, and it was considered barely perceivable by the caregiver and unperceivable to the child. One reason for this is due to the game deployment itself. That is when Bootle Band prompts a user to play an instrument, it listens for an input for several windows. For example, Bootle Band may listen to a musical input for one second. In that second, with a ≈93 ms window and 50% overlap, inference is performed roughly 21 times. Therefore, even with an average performance of 73.3%, Bootle Band mitigates errors through gameplay programming.

In future work, we plan on expanding our dataset to include the real-world samples we recorded with our participants. Additionally, as Bootle Band is deployed commercially, it might be possible to collect a large variety of audio recordings that could be used for further quality improvement with the appropriate user permissions. It is common that machine learning models are frequently diagnosed and updated in deployment, and we see our classifier following a similar path which will hopefully lead to improved performance in real-world deployments. When more training data and real-world samples are available, we may also explore the performance of deep neural networks (DNNs) to further improve classification. DNN have been used successfully in widespread applications from speech-based emotion recognition [4547] to music recognition [48] to detecting emotion in music [49]. Recent work with DNNs have shown immense potential for musical instrument classification as reviewed by Blaszke and Kostek [50], particularly for predominant instrument recognition in polyphonic audio. To our knowledge, no previous work with DNNs have explored their application to the percussion instruments (maracas, tambourines, castanets) or sound environment (e.g. game music, children’s play, home) of interest in this music application. While the classification accuracy obtained by the methods described herein appeared to provide children with a good user experience, DNNs might be a promising direction of exploration particularly if more instrument families are added to the system. As well reviewed by Blaszke and Kostek [50], the current state of the art for multiple instrument recognition yields F1 scores around 0.64 while their DNN approach provided substantial increases to 0.93 [50]. DNN approaches may also offer greater flexibility allowing for more complex models for instruments that are difficult to classify and simpler, more computationally efficient models for instruments that are easily identified [50]. The model architecture proposed by Blazke and Kostek also makes it possible to add new instruments [50]. Some challenges to using DNNs include the large datasets needed for training and computational costs [51].

In this work, we used a cut-off frequency of 8 kHz however we would like to explore the results of increasing our bandwidth to 20 kHz for our MFCC generation along with other signal processing techniques. Looking at the spectrograms for each instrument class, we expect that this might improve the misclassifications between shakers and tambourines. With the long-term goal of integrating our classifier into a low-cost, at-home music education application, we are also interested in the family experience (usability, engagement, perceived value) of mixed reality music applications like Bootle Band both on its own and when contrasted with typical touchscreen approaches. A parallel paper will describe the family experience with Bootle Band in this 2x2 crossover study design wherein children played Bootle Band with real-world instruments via the audio detection interface described herein or with virtual instruments via the touchscreen following which their experiences were captured through game logs, interviews and questionnaires. In the future, it would also be important to understand how the audio detection interface supports different game designs and play mechanics to understand how it can/should be deployed in applications. Working towards a music-making platform, we are interested in advanced pattern recognition to identify how well a child might keep a steady beat or reproduce a rhythmic pattern.

A final point of note: the goal of maximizing accessibility directed many of our design decisions, particularly with respect to: (i) the instruments included in the system, (ii) consideration of diverse play style, (iii) support of instrument variants and do-it-yourself instruments. The instruments we selected for our audio detection interface were selected through consultations with occupational and music therapists based on their prominence in early childhood musical play, and they also target different grasp and motor movements to play. In the early stages of development of our interface, we recognized that the sound characteristics of the musical instruments might vary quite significantly depending on the motor abilities of the child and how they were played (i.e., the play style). This was a key consideration when developing the database to train the instrument detection algorithms, as was the need to accommodate instrument variants (e.g., tambourines of different brands or fabrications that sound slightly different). The latter was to ensure that families could use instruments that they already might have on hand. To reduce potential economic constraints even further, we also included “do-it-yourself” instruments in the development of our algorithms. This design decision ensured that in the future, families would be able to participate in the musical experience using instruments fabricated from low-cost, household items. The art and crafts aspect to these homemade instruments provided an additional feature to the game which a few families reported to enjoy. It should be noted that future work would be needed to expand the audio detection interface to the wide range of non-pitched instruments that a child might encounter in the real-world as this was not the focus of this first stage of research. We expect that the generalizability of our approach to new non-pitched percussion instruments would likely depend on the extent of overlap and similarity in sonic characteristics.

Conclusion

In this work we developed a musical instrument classifier for non-pitched percussion instruments. By developing a non-pitched instrument classifier, we offer insight into how manipulation of different variables (i.e., window length, feature extraction, and model selection) contribute to the performance of a musical instrument classifier. We evaluated our classifier against in-lab variants and see accuracies over 84% using a ≈93 ms windowing time. Also, we deployed our classifier to 12 families within a music application called Bootle Band and see over 73% accuracy across all instrument families. Our classifier is the first step toward designing a low-cost at home music-making platform for children in early childhood. Ultimately, we expect that these results will help eliminate barriers that children and their families face such as: scheduling time for musical play and learning, the cost of musical play and learning, and the availability of a music program that is suitable to their children’s abilities.

Supporting information

S1 Appendix. Do It Yourself (DIY) instruments.

(PDF)

pone.0299888.s001.pdf (175.3KB, pdf)
S2 Appendix. List of feature extraction.

Italicized indicates selected features from NCA.

(PDF)

pone.0299888.s002.pdf (73.9KB, pdf)
S3 Appendix. Spectrogram of castanet (top), tambourine (middle) and shaker (bottom).

Parameters: 44.1 kHz sampling rate, 50% overlap Hanning window, and 4096 samples DFT.

(PDF)

pone.0299888.s003.pdf (135.7KB, pdf)
S4 Appendix. Optimized LGBM parameters using Optuna.

(PDF)

pone.0299888.s004.pdf (102.7KB, pdf)
S5 Appendix. Comparison of model performance on in-lab test set.

Using approximately 93 ms window and reporting macro-average result across all classes. Bold represents best performance.

(PDF)

pone.0299888.s005.pdf (74.1KB, pdf)
S6 Appendix. Confusion matrix analysis for the LGBM model using a 93ms window.

(DOCX)

pone.0299888.s006.docx (60.5KB, docx)
S7 Appendix. SHAP results.

‘Class 0’ = tambourines, ‘Class 1’ = shakers, ‘Class 2’ = castanets, ‘Class 3’ = noise.

(PDF)

pone.0299888.s007.pdf (241.1KB, pdf)
S8 Appendix. Bootle Band user interface for probability threshold of LGBM model.

(PDF)

pone.0299888.s008.pdf (149.9KB, pdf)

Acknowledgments

The authors would like to thank the Institute of Biomedical Engineering at the University of Toronto, Holland Bloorview Kids Rehabilitation Hospital, and the Bloorview Research Institute for allowing us to conduct our research. The authors would like to thank Alexander Hodge for supporting integration within the Bootle Band app as well as Nikki Ponte and Dorothy Choi who supported the real-world data collection. The authors would like to thank Dr. Azadeh Kushki for her insights and helping to shape this project.

Data Availability

The data that support the findings of this study may be made available on request from the corresponding author, E.B, in compliance with institutional and ethical standards of operation. Data cannot be shared publicly because research participants did not provide consent for public sharing of their data. To ensure the long-term stability and accessibility of our research data, we will designate a non-author institutional contact, the research ethics committee chair. This approach ensures that the data remains accessible over time, providing a reliable point of contact for interested researchers. Such an arrangement is particularly beneficial in cases where an author may change their email address, shift to a different institution, or become unavailable to respond to data access requests. Please see the contact information for the non-author institutional contact below: Deryk Beal Research Ethics Board Chair Holland Bloorview Kids Rehabilitation Hospital 150 Kilgour Road, Toronto, ON M4G 1R8 Tel: (416) 425-6220, ext.3582 E-mail: dbeal@hollandbloorview.ca.

Funding Statement

The authors would like to acknowledge research funding provided through the Collaborative Health Research Projects program (FRN 163978) by the Natural Sciences and Engineering Research Councils of Canada, the Canadian Institutes of Health Research, and the Social Sciences and Humanities Research Council of Canada (https://www.nserc-crsng.gc.ca/professors-professeurs/grants-subs/chrpprcs_eng.asp), as well as the Queen Elizabeth II-Graduate Scholarship in Science and Technology program funded through the Ontario provincial governments (https://osap.gov.on.ca/OSAPPortal/en/A-ZListofAid/PRDR019236.html). The authors would also like to thank Rhythm Band Instruments (TX, USA) for their generous donation of instruments in support of this research. We are also extremely grateful to the Ontario Brain Institute for funding and support through the integrated discovery program, CP-Net (http://cpnet.canchild.ca/). We would also like to thank Scotiabank for its generous support of the gaming hub at the Bloorview Research Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Hallam S. The power of music: Its impact on the intellectual, social and personal development of children and young people. Int J Music Educ. 2010;28: 269–289. doi: 10.1177/0255761410370658 [DOI] [Google Scholar]
  • 2.Chiarello LA, Palisano RJ, Orlin MN, Chang HJ, Begnoche D, An M. Understanding participation of preschool-age children with cerebral palsy. J Early Interv. 2012;34: 3–19. doi: 10.1177/1053815112443988 [DOI] [Google Scholar]
  • 3.Carlson E, Bitterman A, Daley T. Access to Educational and Community Activities for Young Children with Disabilities. National Center for Special Education Research, Tech. Rep. 2010. [Google Scholar]
  • 4.King GA, Law M, King S, Hurley P, Hanna S, Kertoy M, et al. Measuring children’s participation in recreation and leisure activities: Construct validation of the CAPE and PAC. Child Care Health Dev. 2007;33:28–39. doi: 10.1111/j.1365-2214.2006.00613.x [DOI] [PubMed] [Google Scholar]
  • 5.Halfon S, Friendly M. Inclusion of young children with disabilities in regulated child care in Canada: A snapshot of research, policy, and practice. Childcare Resource and Research Unit. 2013;57. [Google Scholar]
  • 6.Michelsen SI, Flachs EM, Uldall P, Eriksen EL, McManus V, Parkes J, et al. Frequency of participation of 8-12-year-old children with cerebral palsy: A multi-centre cross-sectional European study. Eur J Paediatr Neurol. 2009;13: 165–177. doi: 10.1016/j.ejpn.2008.03.005 [DOI] [PubMed] [Google Scholar]
  • 7.Darrow AA. Music educators’ perceptions regarding the inclusion of students with severe disabilities in music classrooms. J Music Ther. 1999;36: 254–273. doi: 10.1093/jmt/36.4.254 [DOI] [PubMed] [Google Scholar]
  • 8.Burton SL, Pearsall A. Music-based iPad app preferences of young children. Res Stud Music Educ. 2016;38: 75–91. doi: 10.1177/1321103X16642630 [DOI] [Google Scholar]
  • 9.Adamovich SV, Fluet GG, Tunik E, Merians AS. Sensorimotor training in virtual reality: A review. 2009;25: 29–44. doi: 10.3233/NRE-2009-0497 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dannenberg RB, Sanchez M, Joseph A, Capell P, Joseph R, Saul R. A computer-based multimedia tutor for beginning piano students. Interface—J New Music Res. 1990;19: 155–173. doi: 10.1080/09298219008570563 [DOI] [Google Scholar]
  • 11.Salmon S. Inclusion and Orff-Schulwerk. Orff Echo. 2012. [Google Scholar]
  • 12.Wright L, Pastore V. Musical Instruments and the Motor Skills They Require. 2020. Available from: https://www.understood.org/en/learning-thinking-differences/child-learning-disabilities/movement-coordination-issues/musical-instruments-and-the-motor-skills-they-require [Google Scholar]
  • 13.Nagawade MS, Ratnaparkhe VR. Musical instrument identification using MFCC. 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT). 2017; 2198–2202. doi: 10.1109/RTEICT.2017.8256990 [DOI] [Google Scholar]
  • 14.Racharla K, Kumar V, Jayant CB, Khairkar A, Harish P. Predominant musical instrument classification based on spectral features. 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). 2020; 617–622. doi: 10.1109/SPIN48934.2020.9071125 [DOI] [Google Scholar]
  • 15.Chhabra A, Singh AV, Srivastava R, Mittal V. Drum Instrument Classification Using Machine Learning. In 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN) 2020. Dec 18 (pp. 217–221). IEEE. [Google Scholar]
  • 16.Muller M, Ellis DPW, Klapuri A, Richard G. Signal processing for music analysis. IEEE J Sel Top Signal Process. 2011;5: 1088–1110. [Google Scholar]
  • 17.Brent W. Physical and Perceptual Aspects of Percussive Timbre [Ph.D. dissertation]. UC San Diego; 2010. [Google Scholar]
  • 18.Derer K, Polsgrove L, Rieth H. A survey of assistive technology applications in schools and recommendations for practice. J Spec Educ Technol. 1996;13: 62–80. doi: 10.1177/01626434960130020 [DOI] [Google Scholar]
  • 19.Eggink J, Brown G. A missing feature approach to instrument identification in polyphonic music. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003; 5: V–553. [Google Scholar]
  • 20.Kostek B. Musical instrument classification and duet analysis employing music information retrieval techniques. Proc IEEE. 2004;92: 712–729. doi: 10.1109/JPROC.2004.825903 [DOI] [Google Scholar]
  • 21.Martins LG, Burred JJ, Tzanetakis G, Lagrange M. Polyphonic instrument recognition using spectral clustering. In: Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007. 2007: 213–218. [Google Scholar]
  • 22.Burred JJ, Robel A, Sikora T. Polyphonic musical instrument recognition based on a dynamic model of the spectral envelope. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. 2009: 173–176. [Google Scholar]
  • 23.Little D, Pardo B. Learning musical instruments from mixtures of audio with weak labels. ISMIR 2008, 9th International Conference on Music Information Retrieval. 2008: 127–132. [Google Scholar]
  • 24.Essid S, Richard G, David B. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE Transactions on Audio, Speech, and Language Processing. 2006;14: 68–80. doi: 10.1109/TSA.2005.860351 [DOI] [Google Scholar]
  • 25.Sturm BL, Morvidone M, Daudet L. Musical instrument identification using multiscale mel-frequency cepstral coefficients. Proceedings of the European Signal Processing Conference. 2010; 477–481. [Google Scholar]
  • 26.Poliner G, Ellis D, Ehmann A, Gomez E, Streich S, Ong B. Melody transcription from music audio: Approaches and evaluation. IEEE Transactions on Audio, Speech, and Language Processing. 2007;15: 1247–1256. doi: 10.1109/TASL.2006.889797 [DOI] [Google Scholar]
  • 27.Claypool M, Claypool K. Latency and player actions in online games. Communications of the ACM. 2006; 49:40–45. doi: 10.1145/1167838.1167860 [DOI] [Google Scholar]
  • 28.Cooley JW, Tukey JW. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation. 1965;19: 297–301. [Google Scholar]
  • 29.Joder C, Essid S, Richard G. Temporal integration for audio classification with application to musical instrument classification. IEEE Transactions on Audio, Speech, and Language Processing. 2009;17: 174–186. doi: 10.1109/TASL.2008.2007613 [DOI] [Google Scholar]
  • 30.Bernhard S. Heart Sound Classifier. 2020. Available at: https://www.mathworks.com/matlabcentral/fileexchange/65286-heart-sound-classifier [Google Scholar]
  • 31.Yang W, Wang K, Zuo W. Neighborhood component feature selection for high-dimensional data. Journal of Computers. 2012;7: 162–168. doi: [DOI] [Google Scholar]
  • 32.The MathWorks Inc. Neighborhood Component Analysis (NCA) Feature Selection—MATLAB & Simulink. 2015. Available at: https://www.mathworks.com/help/stats/neighborhood-component-analysis.html [Google Scholar]
  • 33.Logan B. Mel frequency cepstral coefficients for music modeling. International Symposium on Music Information Retrieval. 2000. doi: 10.5281/zenodo.1416444 [DOI] [Google Scholar]
  • 34.Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. Proc. of the SPECOM-2005. 2005;1: 191–194. [Google Scholar]
  • 35.Morvidone M, Sturm BL, Daudet L. Incorporating scale information with cepstral features: Experiments on musical instrument recognition. Pattern Recognition Letters. 2010;31: 1489–1497. doi: 10.1016/j.patrec.2009.12.035 [DOI] [Google Scholar]
  • 36.Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27: 379–423. [Google Scholar]
  • 37.Moddemeijer R. On estimation of entropy and mutual information of continuous distributions. Signal Processing. 1989; 16:233–248. [Google Scholar]
  • 38.Zhu J, Rosset S, Zou H, Hastie T. Multi-class AdaBoost. Statistics and its Interface. 2006;2: 349–360. [Google Scholar]
  • 39.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. 2016; 785–794. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 40.Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017; 3149–3157. [Google Scholar]
  • 41.Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019; 2623–2631. doi: 10.1145/3292500.3330701 [DOI] [Google Scholar]
  • 42.Cohen J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement. 1960; 20: 37–46. [Google Scholar]
  • 43.Lundberg SM, Lee SI. A Unified Approach to Interpreting Model Predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017; 4768–4777. [Google Scholar]
  • 44.Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence. 2020;2: 56–67. doi: 10.1038/s42256-019-0138-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Swain M, Maji B, Khan M, El Saddik A, Gueaieb W. Multilevel feature representation for hybrid transformers-based emotion recognition. In2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART) 2023. Jun 7 (pp. 1–5). IEEE. [Google Scholar]
  • 46.Mustaqeem K, El Saddik A, Alotaibi FS, Pham NT. AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network. Knowledge-Based Systems. 2023. Jun 21;270:110525. [Google Scholar]
  • 47.Ishaq M, Khan M, Kwon S. TC-Net: A Modest & Lightweight Emotion Recognition System Using Temporal Convolution Network. Computer Systems Science & Engineering. 2023. Sep 1;46(3). [Google Scholar]
  • 48.Poulose A, Reddy CS, Dash S, Sahu BJ. Music recommender system via deep learning. Journal of Information and Optimization Sciences. 2022. Jul 4;43(5):1081–8. [Google Scholar]
  • 49.Ramírez J, Flores MJ. Machine learning for music genre: multifaceted review and experimentation with audioset. Journal of Intelligent Information Systems. 2020. Dec;55(3):469–99. [Google Scholar]
  • 50.Blaszke M, Kostek B. Musical Instrument Identification Using Deep Learning Approach. Sensors. 2022;22: 3033. doi: 10.3390/s22083033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Zaman K, Sah M, Direkoglu C, Unoki M. A Survey of Audio Classification Using Deep Learning. IEEE Access. 2023. Sep 22. [Google Scholar]

Decision Letter 0

Kathiravan Srinivasan

21 Dec 2023

PONE-D-23-33875Musical Instrument Classifier for Early Childhood Percussion InstrumentsPLOS ONE

Dear Dr. Biddiss,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: Please revise and resubmit your manuscript. The requested citations in the review comments are not a requirement for publication. 

==============================

Please submit your revised manuscript by Feb 04 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Kathiravan Srinivasan

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Competing Interests section:

“I have read the journal's policy and the authors of this manuscript have the following competing interests: Holland Bloorview is supporting the creation of a company called Pearl Interactives to commercialize products like Bootle Band so that it can be made widely available to those who can benefit from it. Elaine Biddiss and Ajmal Khan are shareholders in Pearl Interactives and may financially benefit from this interest if Pearl Interactives is successful in marketing products related to this research including Bootle Band. The terms of this arrangement have been reviewed and approved by Holland Bloorview Kids Rehabilitation Hospital and the University of Toronto in accordance with its policy on objectivity in research. We will continue to actively monitor, mitigate and manage any conflicts of interest. Our goal is to remain transparent and committed to the best interests of study participants, patients and families.”

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).  If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

3. In this instance it seems there may be acceptable restrictions in place that prevent the public sharing of your minimal data. However, in line with our goal of ensuring long-term data availability to all interested researchers, PLOS’ Data Policy states that authors cannot be the sole named individuals responsible for ensuring data access (http://journals.plos.org/plosone/s/data-availability#loc-acceptable-data-sharing-methods).

Data requests to a non-author institutional point of contact, such as a data access or ethics committee, helps guarantee long term stability and availability of data. Providing interested researchers with a durable point of contact ensures data will be accessible even if an author changes email addresses, institutions, or becomes unavailable to answer requests.

Before we proceed with your manuscript, please also provide non-author contact information (phone/email/hyperlink) for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If no institutional body is available to respond to requests for your minimal data, please consider if there any institutional representatives who did not collaborate in the study, and are not listed as authors on the manuscript, who would be able to hold the data and respond to external requests for data access? If so, please provide their contact information (i.e., email address). Please also provide details on how you will ensure persistent or long-term data storage and availability.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

5. We note that Figures S5 and S6 and S6 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

1. You may seek permission from the original copyright holder of Figure(s) [#] to publish the content specifically under the CC BY 4.0 license.

 We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

 Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

 In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

 2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. Please add paper contributions in the abstract.

2. Please add paper organization at the end of the introduction.

3. Please show the data distribution graph.

4. The paper didn’t consider the related work section.

5. Please add the following reference article for your discussion.

a. Music recommender system via deep learning, Journal of Information and Optimization Sciences 43 (5), 1081-1088.

6. A comparison of other machine learning models is missing.

7. State-of-the-art comparison is not considered in the analysis.

8. Confusion matrix analysis is missing.

9. Computation time analysis is not considered.

Reviewer #2: Decision: Major Revision

1. The paper mentions a dataset of diverse audio samples. How representative is this dataset of the wide range of non-pitched percussion instruments that children might encounter in real-world scenarios?

2. While the paper acknowledges challenges related to variability in home environments, the extent to which background noise and interruptions affect the classifier's performance could be explored in more detail. Are there specific environmental conditions that significantly impact accuracy?

3. The paper highlights the reduction in performance during real-world deployment. Could the paper delve deeper into the limitations of the model and potential causes for the decrease in accuracy?

4. The paper focuses on non-pitched percussion instruments. How generalizable is the proposed approach to other musical instruments, especially those with different sonic characteristics?

5. While the paper mentions the user interface feature allowing families to adjust the probability threshold, it could provide more insights into the effectiveness of this feature. Were families able to use it effectively, and did it enhance the overall user experience?

6. In the future work section, the paper mentions plans to expand the dataset. What specific strategies will be employed to ensure a more exhaustive dataset, and how might this impact the classifier's performance?

7. The paper suggests exploring deep neural networks in the future. Could the authors elaborate on the anticipated advantages and challenges of using deep learning for this particular classification task?

8. The paper mentions a parallel study to describe the family experience. Could the authors provide a brief overview of the methodology used to assess the family experience and its integration with the classifier's performance?

9. The paper discusses the consequence of false negatives versus false positives. How was the balance between minimizing false negatives and managing false positives determined, and what impact did this have on the overall user experience?

10. The manuscript, however, does not link well with recent literature on recognition appeared in relevant top-tier journals, e.g., the IEEE Intelligent Systems department on AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network". Also, new trends of AI for recognition “ARTriViT: Automatic Face Recognition System Using ViT-Based Siamese Neural Networks with a Triplet Loss” are missing it should be comprised.

11. Why is the proposed approach suitable to be used to solve the critical problem? We need more convinced response to indicate clearly the SOTA development.

12. Extension to literature will be appreciated: 10.32604/csse.2023.037373, 10.1109/BioSMART58455.2023.10162089. Please cite in section II behind the CNN and DL to enrich the literature.

Reviewer #3: The authors present a model for the identification of non-pitched percussion musical instruments, with a focus on instruments that are used in childhood education.

The presented investigation is meant to be used as a backbone and proof of concept for a target mixed reality application.

In the introduction, I am missing a final paragraph describing the structure of the rest of the paper, that is usually present and useful to the reader.

In my opinion Methods should be changed to Methodology or Materials and Methods.

One weak point of the paper is the references and citations. Most of the papers are very old, which does not look good for a paper in a rapidly evolving field, like machine learning for audio.

The extended description of the MFCCs and entropy features is not in my opinion needed in a research paper, since the aim of the contribution is not there, but in the novelty aspect.

My main concern regarding the methodology is the window length. First of all, [23] is used as a reference for the selection of window, but it is related to a much different task (music transcription). In [44], even though a small window is used for MFCC extraction (1024samples), classification is performed for 4second frames.

Another thing that I suggest should be improved is the structure of the paper. First of all, there is a plan to use the classifier in a mixed reality application. This is mentioned several times, in different places. I believe that there should be a section or sub-section clarifying what kind of game this is, etc. It is difficult for the reader to follow the concept, when different parts are mixed up.

I feel the same about dataset creation. There is a section for the dataset creation of the classifier, and there is another section concerning an experiment with the classifier used within a game.

I believe it is an interesting work but needs rewriting with emphasis on:

1) Defining the structure of the paper in the introduction

2) Making more clear what kind of application is the task expected to be used in

3) What kind of experiments are held within the paper.

4) An update of the references with more recent works

5) A better justification of the window length, which seems quite small. Can humen identify the instruments in such a length? What is the length of the recordings? For example, how many ms is approximately one hit of the maraca or tambourine. Is one hit broken into smaller samples in the proposed methodology? Is it possible that samples that belong to the same hit are present in both training and testing?

6) Has there been some kind of normalization of audio samples? Classifiers can adapt to different energy levels if the dataset is not normalized. I believe that the dataset is one of the strong points of the paper and it should be presented more clearly. Is it available? This could also strengthen both the contribution of the paper and also the transparency.

I understand that many of these questions maybe answered, but I believe that the reader should be able to find the information they want without having to carefully read the whole paper, and this is why I propose a better structure.

7) I don't understand the relevance of the demographics

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: 

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Comments PONE-D-23-33875.docx

pone.0299888.s009.docx (13KB, docx)
PLoS One. 2024 Apr 2;19(4):e0299888. doi: 10.1371/journal.pone.0299888.r002

Author response to Decision Letter 0


4 Feb 2024

Editorial comments:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

As requested, the PLOS ONE style templates were consulted and corresponding changes were made to adhere to PLOS ONE’s style requirements.

2. Thank you for stating the following in the Competing Interests section. Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

As requested, we have added the updated Competing Interests statement in our cover letter.

3. In this instance it seems there may be acceptable restrictions in place that prevent the public sharing of your minimal data. However, in line with our goal of ensuring long-term data availability to all interested researchers, PLOS’ Data Policy states that authors cannot be the sole named individuals responsible for ensuring data access (http://journals.plos.org/plosone/s/data-availability#loc-acceptable-data-sharing-methods).

Before we proceed with your manuscript, please also provide non-author contact information (phone/email/hyperlink) for a data access committee, ethics committee, or other institutional body to which data requests may be sent. Please also provide details on how you will ensure persistent or long-term data storage and availability.

As requested, our revised data availability statement is as follows:

“The data that support the findings of this study may be made available on request from the corresponding author, E.B, in compliance with institutional and ethical standards of operation. Data cannot be shared publicly because research participants did not provide consent for public sharing of their data. To ensure the long-term stability and accessibility of our research data, we will designate a non-author institutional contact, the research ethics committee chair. This approach ensures that the data remains accessible over time, providing a reliable point of contact for interested researchers. Such an arrangement is particularly beneficial in cases where an author may change their email address, shift to a different institution, or become unavailable to respond to data access requests. Please see the contact information for the non-author institutional contact below:

Deryk Beal

Research Ethics Board Chair

Holland Bloorview Kids Rehabilitation Hospital

150 Kilgour Road, Toronto, ON M4G 1R8

Tel: (416) 425-6220, ext.3582

E-mail: dbeal@hollandbloorview.ca

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

Our full ethics statement is mentioned in the ‘Methods’ section of our manuscript file as follows: “Ethical approval for this study was obtained from the Holland Bloorview Research Ethics Board and the University of Toronto Health Sciences Research Ethics Board (eREB project ID #0257). Participants and/or their guardians provided informed and written consent using approved e-consent procedures facilitated by REDCap. Children who were unable to consent provided assent.”

5. We note that Figures S5 and S6 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

As requested, we have uploaded the completed Content Permission Form as an "Other" file with our submission. In the figure caption of the copyrighted figures, we’ve include the following text: “Reprinted from Pearl Interactives under a CC BY license, with permission from Sharon Wong, CEO of Pearl Interactives, original copyright 2024.”

Reviewer comments:

Reviewer #1:

1. Please add paper contributions in the abstract.

As requested, we have added the following paper contributions to the abstract as follows:

“To our knowledge, the dataset compiled of 369,000 samples of non-pitched instruments is first of its kind. This work also suggests that a low feature space is sufficient for the recognition of non-pitched instruments. Lastly, real-world deployment and testing of the algorithms created with participants of diverse physical and cognitive abilities was also an important contribution towards more inclusive design practices. This paper lays the technological groundwork for a mixed reality music application that can detect children’s use of non-pitched, percussion instruments to support early childhood music education and play.”

2. Please add paper organization at the end of the introduction.

As requested, we have added a description of the paper organization to the end of the introduction. Thank you, we think this improves the paper’s readability:

“In this work, we aimed to develop an audio detection interface for non-pitched percussion instruments, specifically maracas, tambourines, and castanets, for use in early childhood music applications. To this end, this manuscript first describes the creation of a large database of non-pitched instrument audio samples. Second, we describe feature extraction and the development of machine learning models used in this classification task. Next, we present the performance of our classifier with (i) a test set recorded in-lab and, (ii) real-world data recorded “in the wild” in family homes. To collect the latter, the audio detection interface was deployed in a music application called Bootle Band with 12 families with children of diverse abilities. Bootle Band is an early childhood music application developed at the Possibility Engineering And Research Lab (PEARL) at Holland Bloorview Kids Rehabilitation Hospital and available on the Apple App Store. Lastly, we discuss key findings, particularly with respect to the algorithms intended application as an audio detection interface to support interactive early childhood music applications.”

3. Please show the data distribution graph.

We have included the data distribution with respect to the instruments and data splits as Table 1. Please let us know if you would like additional information and what would be helpful and we are happy to provide. Thank you.

4. The paper didn’t consider the related work section.

The related work section has been expanded in the introduction section as follows:

Thank you for this suggestion. We have expanded the related work in the introduction section as follows:

“Musical instrument and audio classifiers have been built with a wide range of machine learning models including k-nearest neighbors (KNN) [35], Multi-layered Perceptron (MLP), and boosting algorithms [36]. Harish et al. reported accuracies of 79% for their SVM model which outperformed the other state-of-the-art models with in classifying six pitched instruments, including voice, using spectral features [36]. Mittal et al. demonstrated best performance with a Naive Bayes Classifier with an accuracy of 97% for distinguishing between 4 drum instruments using a dataset composed of both live recordings as well as a drum simulator [37]. While the musical instrument classification task for pitched instruments and drums are widely studied [10], we were not able to find previous work classifying diverse non-pitched instruments like maracas, tambourines, and castanets. Classification of non-pitched instruments may pose additional challenges due to greater overlaps in frequency bands and variation in sound quality and play style than pitched instruments [14]. “

We have also included greater discussion of deep learning techniques in the discussion as follows:

“As well reviewed by Blaszke and Kostek [50], the current state of the art for multiple instrument recognition yields F1 scores around 0.64 while their Deep Neural Nets (DNN) approach provided substantial increases to 0.93 [50]. DNN approaches may also offer greater flexibility allowing for more complex models for instruments that are difficult to classify and simpler, more computationally efficient models for instruments that are easily identified [50]. The model architecture proposed by Blazke and Kostek also makes it possible to add new instruments [50]. Some challenges to using DNNs include the large datasets needed for training and computational costs [51].”

51) Zaman K, Sah M, Direkoglu C, Unoki M. A Survey of Audio Classification Using Deep Learning. IEEE Access. 2023 Sep 22.

5. Please add the following reference article for your discussion.

a. Music recommender system via deep learning, Journal of Information and Optimization Sciences 43 (5), 1081-1088.

The requested article has been added to the discussion section as requested.

6. A comparison of other machine learning models is missing.

In agreement and as requested, the comparison to other machine learning models is provided in the supporting file (S5) as follows:

S5 Appendix. Comparison of model performance on in-lab test set using approximately 93 ms window and reporting macro-average result across all classes. Bold represents best performance.

Please let us know if you prefer for the comparison table to appear in the main manuscript as opposed to the supporting information.

7. State-of-the-art comparison is not considered in the analysis.

The state-of-the-art comparison is now provided in the Results (In-lab Performance) analysis as follows:

“The LGBM classifier performed best on our validation set across all window lengths and for all performance metrics (Precision: 0.845; Recall: 0.835, F1: 0.839, Accuracy: 0.844, 93ms window). The AdaBoost model had similar performance (Accuracy: 0.837, 93ms window) with the logistic regression model (Accuracy: 0.676, 93ms window) and the SVM (Accuracy: 0.756, 93ms window) being the lowest performing.”

8. Confusion matrix analysis is missing.

As requested, we have added a confusion matrix to the supporting files (S6)

Please let us know if you prefer for the comparison table to appear in the main manuscript as opposed to the supporting information.

This was an important addition to support our discussion of the relative impact of false negatives and false positives provided in the discussion section as follows:

“Another important consideration when thinking about target application was the relative consequence of false negatives (i.e., no detection of the instrument being played) versus false positives (i.e., detection of the instrument when it is not being played). While both may lead to a bad gameplay experience, young children tend to respond more negatively to false negatives. Thus, for our application, it was important to have a high recall and classify all instances of the instrument, even if that may lead to some additional false positives. In our case, our lowest recall in deployment was with the noise family with a median of 69.9% which was most often misclassified as a castanet. Upon further investigation, in most cases this was because the game audio had a castanet sound playing periodically in the background or because the family was moving around their instruments (creating a sharp onset and decay type of sound). However, a low recall for the noise class is not necessarily a bad thing as a false negative for noise means a false positive for an instrument class which is generally more acceptable than the counterpart.”

9. Computation time analysis is not considered.

The model results using different window lengths is now provided in the Results in the “In-Lab performance” section as follows:

“Table 3 presents the performance of the LGBM model on the held-out “in-lab” test set. Using a ≈23 ms (1024 sample) window length we see an overall accuracy of about 72.6%. This accuracy increases when we expand our window to ≈46 ms (2048 samples) and ≈93 ms (4096 samples). Respectively, we see an increase in accuracy of 1.4% (to a total of 74.0%) and an increase of 11.8% (to a total of 84.4%) from our smallest window size of ≈23 ms. Another substantial increase in performance can be observed in the precision of our classifier. For example, the lowest precision score was in the Tambourine class with a ≈23 ms window at 67.5%. This precision improved to 86.7% when the window size was increased to 93 ms.”

Table 3. Non-pitched percussion instrument classifier to our “in-lab” test set against different window lengths.

We also provide results pertaining to the families’ perception of the computation time in the Results under the section heading, Real-World Deployment Performance, as follows:

“To coincide with the quantitative analysis, when families were asked “Did you experience any technical difficulties with the detection of musical instruments?”, eleven out of twelve families reported no latency or detection issues with Bootle Band. One out of the twelve caregivers noticed a very minor delay when an instrument was played vs. the reaction time of the game. However, the caregiver noted that their child never seemed to notice and it did not impede the play experience.”

Finally, in the discussion, we address computation time as follows:

“Good classifier performance with low latency and computational demands typical of common mobile devices could be achieved using a small feature set. We found that only a handful of features (MFCCs and signal entropy) were needed to classify percussion instruments in a short window of time (≈93 ms). By minimizing our features, we were able to reduce bias in our model and minimize computation time. This is crucial for our target application which required latency of less than 100 ms to ensure a good user experience [24]. While larger window times were associated with an increase in classifier performance, it is unlikely that this would have translated to a better user experience when embedded into a gaming application.”

We also included in the discussion section, some notes regarding the deployment of the algorithm in the Bootle Band app and how that would have impacted the user’s perception of the computation time:

“Only one family noted any latency issues, and it was considered barely perceivable by the caregiver and unperceivable to the child. One reason for this is due to the game deployment itself. That is when Bootle Band prompts a user to play an instrument, it listens for an input for several windows. For example, Bootle Band may listen to a musical input for one second. In that second, with a ≈93 ms window and 50% overlap, inference is performed roughly 21 times. Therefore, even with an average performance of 73.3%, Bootle Band mitigates errors through gameplay programming.”

Reviewer #2: Decision: Major Revision

1. The paper mentions a dataset of diverse audio samples. How representative is this dataset of the wide range of non-pitched percussion instruments that children might encounter in real-world scenarios?

Thank you for this comment. It is an important one. The dataset was created not with the intent of representing every non-pitched percussion that a child might encounter in the real-world. Rather, it was designed and instruments were selected with a focus on accessibility. To address this, we have added the following to the discussion section:

“The goal of maximizing accessibility directed many of our design decisions, particularly with respect to: (i) the instruments included in the system, (ii) consideration of diverse play style, (iii) support of instrument variants and do-it-yourself instruments. The instruments we selected for our audio detection interface were selected through consultations with occupational and music therapists based on their prominence in early childhood musical play, and they also target different grasp and motor movements to play. In the early stages of development of our interface, we recognized that the sound characteristics of the musical instruments might vary quite significantly depending on the motor abilities of the child and how they were played (i.e., the play style). This was a key consideration when developing the database to train the instrument detection algorithms, as was the need to accommodate instrument variants (e.g., tambourines of different brands or fabrications that sound slightly different). The latter was to ensure that families could use instruments that they already might have on hand. To reduce potential economic constraints even further, we also included “do-it-yourself” instruments in the development of our algorithms. This design decision ensured that in the future, families would be able to participate in the musical experience using instruments fabricated from low-cost, household items. The art and crafts aspect to these homemade instruments provided an additional feature to the game which a few families reported to enjoy. It should be noted that future work would be needed to expand the audio detection interface to the wide range of non-pitched instruments that a child might encounter in the real-world as this was not the focus of this first stage of research.”

2. While the paper acknowledges challenges related to variability in home environments, the extent to which background noise and interruptions affect the classifier's performance could be explored in more detail. Are there specific environmental conditions that significantly impact accuracy?

In response to the reviewer, the following is included in the discussion section:

“In the home setting, misclassifications could be attributed to many different events: children yelling, the microphone being covered when a child holds the iPad, and children playing two different instruments at the same time with their sibling/parent. Upon exploratory investigation we noticed that when the child was playing Bootle Band in a quiet environment without many interruptions like the ones listed above, the classifier performed with near equivalence to in-lab testing. Understanding the source of errors enables opportunities to provide families with better instructions for an optimal user experience as well as to inform technological developments that could help to mitigate misclassification (e.g., programmatic volume controls associated with the game music, warnings if the microphone is covered). As an example of the latter, one feature that we implemented to support deployment is that we allowed users, through a friendly front-end user interface, to toggle the probability threshold of predicting an instrument. For example, if a family was playing in a loud environment, then to reduce the number of false positives we allowed them to increase the probability threshold of our classifier to a max of 0.9. If the user were playing in a quiet environment, they could lower the probability threshold to 0.4. Doing this allowed the family to adjust the classifier prediction threshold based on the environment they were playing in. See the supporting information file for an image of this front-end user interface (S8). Families were made aware of this feature in an onboarding session, however, it was not used in this study. Families did not experience dissatisfaction with the audio detection interface, so it may be that they did not feel a need for this feature. Future work will be conducted to quantify the extent to which specific environmental conditions impact accuracy and the effectiveness of mitigating strategies that can be designed into the technology or provided through training resources as needed.”

3. The paper highlights the reduction in performance during real-world deployment. Could the paper delve deeper into the limitations of the model and potential causes for the decrease in accuracy?

The following is provided in the Discussion section pertaining to the reduction in performance in the real-world deployment.

“Our classifier achieved 84.4% accuracy across all three instrument families when evaluated with our in-lab test set with ≈93 ms windows. In deployment, it achieved 73.3% accuracy across all three instrument families. In the home setting, misclassifications could be attributed to many different events: children yelling, the microphone being covered when a child holds the iPad, and children playing two different instruments at the same time with their sibling/parent. Upon exploratory investigation we noticed that when the child was playing Bootle Band in a quiet environment without many interruptions like the ones listed above, the classifier performed with near equivalence to in-lab testing.”

4. The paper focuses on non-pitched percussion instruments. How generalizable is the proposed approach to other musical instruments, especially those with different sonic characteristics?

In response, we have added the following to the discussion of limitations. Thank you.

“It should be noted that future work would be needed to expand the audio detection interface to the wide range of non-pitched instruments that a child might encounter in the real-world as this was not the focus of this first stage of research. We expect that the generalizability of our approach to new non-pitched percussion instruments would likely depend on the extent of overlap and similarity in sonic characteristics.”

5. While the paper mentions the user interface feature allowing families to adjust the probability threshold, it could provide more insights into the effectiveness of this feature. Were families able to use it effectively, and did it enhance the overall user experience?

Thank you for your interest in this feature. As none of the families experienced dissatisfaction with the detection interface, this feature was not used. This clarification has been added to the manuscript as follows:

“Families were made aware of this feature in an onboarding session, however, it was not used in this study. Families did not experience dissatisfaction with the audio detection interface, so it may be that they did not feel a need for this feature.”

6. In the future work section, the paper mentions plans to expand the dataset. What specific strategies will be employed to ensure a more exhaustive dataset, and how might this impact the classifier's performance?

Thank you. In response, our plan for expanding the dataset is described in the discussion as follows:

“In future work, we plan on expanding our dataset to include the real-world samples we recorded with our participants. Additionally, as Bootle Band is deployed commercially, it might be possible to collect a large variety of audio recordings that could be used for further quality improvement with the appropriate user permissions. It is common that machine learning models are frequently diagnosed and updated in deployment, and we see our classifier following a similar path which will hopefully lead to improved performance in real-world deployments.”

7. The paper suggests exploring deep neural networks in the future. Could the authors elaborate on the anticipated advantages and challenges of using deep learning for this particular classification task?

We have added further discussion as requested by the reviewer as follows:

“As well reviewed by Blaszke and Kostek [50], the current state of the art for multiple instrument recognition yields F1 scores around 0.64 while their DNN approach provided substantial increases to 0.93 [50]. DNN approaches may also offer greater flexibility allowing for more complex models for instruments that are difficult to classify and simpler, more computationally efficient models for instruments that are easily identified [50]. The model architecture proposed by Blazke and Kostek also makes it possible to add new instruments [50]. Some challenges to using DNNs include the large datasets needed for training and computational costs [51].”

51) Zaman K, Sah M, Direkoglu C, Unoki M. A Survey of Audio Classification Using Deep Learning. IEEE Access. 2023 Sep 22.

8. The paper mentions a parallel study to describe the family experience. Could the authors provide a brief overview of the methodology used to assess the family experience and its integration with the classifier's performance?

Thank you for your interest. This study has been described in better detail under the Discussion as follows:

“A parallel paper will describe the family experience with Bootle Band in this 2x2 crossover study design wherein children played Bootle Band with real-world instruments via the audio detection interface described herein or with virtual instruments via the touchscreen following which their experiences were captured through game logs, interviews and questionnaires. “

9. The paper discusses the consequence of false negatives versus false positives. How was the balance between minimizing false negatives and managing false positives determined, and what impact did this have on the overall user experience?

We have included the following discussion with respect to false negatives and false positives and the overall user experience in the discussion

“Another important consideration when thinking about target application was the relative consequence of false negatives (i.e., no detection of the instrument being played) versus false positives (i.e., detection of the instrument when it is not being played). While both may lead to a bad gameplay experience, young children tend to respond more negatively to false negatives. Thus, for our application, it was important to have a high recall and classify all instances of the instrument, even if that may lead to some additional false positives. In our case, our lowest recall in deployment was with the noise family with a median of 69.9% which was most often misclassified as a castanet. Upon further investigation, in most cases this was because the game audio had a castanet sound playing periodically in the background or because the family was moving around their instruments (creating a sharp onset and decay type of sound). However, a low recall for the noise class is not necessarily a bad thing as a false negative for noise means a false positive for an instrument class which is generally more acceptable than the counterpart. As noted above, families were also provided with a user interface to adjust the prediction threshold given that the optimal balance between false negatives and false positives may vary from child to child depending on motor abilities, self-efficacy and other individual characteristics. The families in this study did not make use of this feature and were satisfied with the audio detection interface.”

10. The manuscript, however, does not link well with recent literature on recognition appeared in relevant top-tier journals, e.g., the IEEE Intelligent Systems department on AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network". Also, new trends of AI for recognition “ARTriViT: Automatic Face Recognition System Using ViT-Based Siamese Neural Networks with a Triplet Loss” are missing it should be comprised.

We have added some of the suggested citations to the discussion as suggested.

11. Why is the proposed approach suitable to be used to solve the critical problem? We need more convinced response to indicate clearly the SOTA development.

Thank you for this comment. We have expanded the related work in the introduction section as follows:

“Musical instrument and audio classifiers have been built with a wide range of machine learning models including k-nearest neighbors (KNN) [35], Multi-layered Perceptron (MLP), and boosting algorithms [36]. Harish et al. reported accuracies of 79% for their SVM model which outperformed the other state-of-the-art models with in classifying six pitched instruments, including voice, using spectral features [36]. Mittal et al. demonstrated best performance with a Naive Bayes Classifier with an accuracy of 97% for distinguishing between 4 drum instruments using a dataset composed of both live recordings as well as a drum simulator [37]. While the musical instrument classification task for pitched instruments and drums are widely studied [10], we were not able to find previous work classifying diverse non-pitched instruments like maracas, tambourines, and castanets. Classification of non-pitched instruments may pose additional challenges due to greater overlaps in frequency bands and variation in sound quality and play style than pitched instruments [14]. “

We have also included greater discussion of deep learning techniques in the discussion as follows:

“As well reviewed by Blaszke and Kostek [50], the current state of the art for multiple instrument recognition yields F1 scores around 0.64 while their Deep Neural Nets (DNN) approach provided substantial increases to 0.93 [50]. DNN approaches may also offer greater flexibility allowing for more complex models for instruments that are difficult to classify and simpler, more computationally efficient models for instruments that are easily identified [50]. The model architecture proposed by Blazke and Kostek also makes it possible to add new instruments [50]. Some challenges to using DNNs include the large datasets needed for training and computational costs [51].”

51) Zaman K, Sah M, Direkoglu C, Unoki M. A Survey of Audio Classification Using Deep Learning. IEEE Access. 2023 Sep 22.

Working off the prior work we shared above. We feel our contributions bring several unique differences. These are now stated in our abstract:

“To our knowledge, the dataset compiled of 369,000 samples of non-pitched instruments is first of its kind. This work also suggests that a low feature space is sufficient for the recognition of non-pitched instruments. Lastly, real-world deployment and testing of the algorithms created with participants of diverse physical and cognitive abilities was also an important contribution towards more inclusive design practices. This paper lays the technological groundwork for a mixed reality music application that can detect children’s use of non-pitched, percussion instruments to support early childhood music education and play.”

12. Extension to literature will be appreciated: 10.32604/csse.2023.037373, 10.1109/BioSMART58455.2023.10162089. Please cite in section II behind the CNN and DL to enrich the literature.

As requested, we have included the suggested citation.

Reviewer #3:

In the introduction, I am missing a final paragraph describing the structure of the rest of the paper, that is usually present and useful to the reader.

We agree and have added the requested paragraph as follows:

“In this work, we aimed to develop an audio detection interface for non-pitched percussion instruments, specifically maracas, tambourines, and castanets, for use in early childhood music applications. To this end, this manuscript first describes the creation of a large database of non-pitched instrument audio samples. Second, we describe feature extraction and the development of machine learning models used in this classification task. Next, we present the performance of our classifier with (i) a test set recorded in-lab and, (ii) real-world data recorded “in the wild” in family homes. Lastly, we discuss key findings, particularly with respect to the algorithms intended application as an audio detection interface to support interactive early childhood music applications.”

In my opinion Methods should be changed to Methodology or Materials and Methods.

We agree and have changed the heading to Methodology

One weak point of the paper is the references and citations. Most of the papers are very old, which does not look good for a paper in a rapidly evolving field, like machine learning for audio.

In agreement, we have redone the literature search and added more recent citations. This has been added to the Introduction section as follows:

“Musical instrument and audio classifiers have been built with a wide range of machine learning models including k-nearest neighbors (KNN) [35], Multi-layered Perceptron (MLP), and boosting algorithms [36]. Harish et al. reported accuracies of 79% for their SVM model which outperformed the other state-of-the-art models with in classifying six pitched instruments, including voice, using spectral features [36].Mittal et al. demonstrated best performance with a Naive Bayes Classifier with an accuracy of 97% for distinguishing between 4 drum instruments using a dataset composed of both live recordings as well as a drum simulator [37]. While the musical instrument classification task for pitched instruments and drums are widely studied [10], we were not able to find previous work classifying diverse non-pitched instruments like maracas, tambourines, and castanets. Classification of non-pitched instruments may pose additional challenges due to greater overlaps in frequency bands and variation in sound quality and play style than pitched instruments [14].”

We have also added some more recent literature to the discussion section as follows:

“DNN have been used successfully in widespread applications from speech-based emotion recognition [45-47] to music recognition [48] to detecting emotion in music [49]. Recent work with DNNs have shown immense potential for musical instrument classification as reviewed by Blaszke and Kostek [50]”

Citations added are as follows:

37) Chhabra A, Singh AV, Srivastava R, Mittal V. Drum Instrument Classification Using Machine Learning. In2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN) 2020 Dec 18 (pp. 217-221). IEEE.

45) Swain M, Maji B, Khan M, El Saddik A, Gueaieb W. Multilevel feature representation for hybrid transformers-based emotion recognition. In2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART) 2023 Jun 7 (pp. 1-5). IEEE.

46) Mustaqeem K, El Saddik A, Alotaibi FS, Pham NT. AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network. Knowledge-Based Systems. 2023 Jun 21;270:110525.

47) Ishaq M, Khan M, Kwon S. TC-Net: A Modest & Lightweight Emotion Recognition System Using Temporal Convolution Network. Computer Systems Science & Engineering. 2023 Sep 1;46(3).

48) Poulose A, Reddy CS, Dash S, Sahu BJ. Music recommender system via deep learning. Journal of Information and Optimization Sciences. 2022 Jul 4;43(5):1081-8.

49) Ramírez J, Flores MJ. Machine learning for music genre: multifaceted review and experimentation with audioset. Journal of Intelligent Information Systems. 2020 Dec;55(3):469-99.

The extended description of the MFCCs and entropy features is not in my opinion needed in a research paper, since the aim of the contribution is not there, but in the novelty aspect.

Thank you. We have reduced the description of the MFCCs and entropy features and have provided appropriate references for the interested reader to refer to if they require more computational details.

My main concern regarding the methodology is the window length. First of all, [23] is used as a reference for the selection of window, but it is related to a much different task (music transcription). In [44], even though a small window is used for MFCC extraction (1024samples), classification is performed for 4second frames. A better justification of the window length, which seems quite small. Can humen identify the instruments in such a length? What is the length of the recordings? For example, how many ms is approximately one hit of the maraca or tambourine. Is one hit broken into smaller samples in the proposed methodology? Is it possible that samples that belong to the same hit are present in both training and testing?

Thank you for this excellent point. The gaming literature suggests that “latency of less than 100 ms to ensure a good user experience [24]” which motivated our decision to use a window length no greater than 100ms. This is noted in the Methods section (Features and Window length).

As for your second point, it is an important one and we completely agree as test results could have been biased if the same instrument samples were within our training and testing set. We have added a sentence to the methods section (Dataset splits) to clarify that this was not the case.

“We labelled and grouped by instruments before data splitting to ensure that there was no overlap in samples between the training and testing sets. The test set fully comprised of either (1) instruments that were not included in training and validation, or (2) instruments that were included in training/validation but recorded in a totally new environment. This prevented the classifier from being biased to certain brands and materials of instruments.”

Another thing that I suggest should be improved is the structure of the paper. First of all, there is a plan to use the classifier in a mixed reality application. This is mentioned several times, in different places. I believe that there should be a section or sub-section clarifying what kind of game this is, etc. It is difficult for the reader to follow the concept, when different parts are mixed up. I feel the same about dataset creation. There is a section for the dataset creation of the classifier, and there is another section concerning an experiment with the classifier used within a game.

Thank you. Upon re-read, we realize that the structure of the paper was not clear. We have reorganized the paper significantly and we hope this makes it an easier read.

Has there been some kind of normalization of audio samples? Classifiers can adapt to different energy levels if the dataset is not normalized. I believe that the dataset is one of the strong points of the paper and it should be presented more clearly. Is it available? This could also strengthen both the contribution of the paper and also the transparency.

Normalization of the audio sample was carried out. The dataset is not yet available. We have not yet been able to secure the funding necessary (e.g. to ensure that it’s properly documented and for the home dataset, that there are no identifying information in the recordings).

I don't understand the relevance of the demographics

It is our institution’s policy to collect and report demographics fully (including socioeconomic status, ethnicity, gender, etc.) to promote equity, diversity, and inclusion in research.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

As requested, we have uploaded our figure files to the PACE to ensure that figures meet PLOS requirements. The changes made by PACE were the resolution being changed to 300 PPI, and the TIFF file being converted to a valid TIF file. We have uploaded the updated figure files to our revised submission.

Attachment

Submitted filename: Editorial comments Response_Bootle Band.docx

pone.0299888.s010.docx (185.8KB, docx)

Decision Letter 1

John Blake

19 Feb 2024

Musical Instrument Classifier for Early Childhood Percussion Instruments

PONE-D-23-33875R1

Dear Dr. Biddiss,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

John Blake, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Authors,

Thank you for addressing all my comments and I don't have any further concerns on your paper.

Regards

Reviewer #2: The authors successfully addressed my comments and suggestions. Good Luck!

The authors successfully addressed my comments and suggestions. Good Luck!

The authors successfully addressed my comments and suggestions. Good Luck!

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Alwin Poulose

Reviewer #2: No

**********

Acceptance letter

John Blake

22 Mar 2024

PONE-D-23-33875R1

PLOS ONE

Dear Dr. Biddiss,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. John Blake

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Do It Yourself (DIY) instruments.

    (PDF)

    pone.0299888.s001.pdf (175.3KB, pdf)
    S2 Appendix. List of feature extraction.

    Italicized indicates selected features from NCA.

    (PDF)

    pone.0299888.s002.pdf (73.9KB, pdf)
    S3 Appendix. Spectrogram of castanet (top), tambourine (middle) and shaker (bottom).

    Parameters: 44.1 kHz sampling rate, 50% overlap Hanning window, and 4096 samples DFT.

    (PDF)

    pone.0299888.s003.pdf (135.7KB, pdf)
    S4 Appendix. Optimized LGBM parameters using Optuna.

    (PDF)

    pone.0299888.s004.pdf (102.7KB, pdf)
    S5 Appendix. Comparison of model performance on in-lab test set.

    Using approximately 93 ms window and reporting macro-average result across all classes. Bold represents best performance.

    (PDF)

    pone.0299888.s005.pdf (74.1KB, pdf)
    S6 Appendix. Confusion matrix analysis for the LGBM model using a 93ms window.

    (DOCX)

    pone.0299888.s006.docx (60.5KB, docx)
    S7 Appendix. SHAP results.

    ‘Class 0’ = tambourines, ‘Class 1’ = shakers, ‘Class 2’ = castanets, ‘Class 3’ = noise.

    (PDF)

    pone.0299888.s007.pdf (241.1KB, pdf)
    S8 Appendix. Bootle Band user interface for probability threshold of LGBM model.

    (PDF)

    pone.0299888.s008.pdf (149.9KB, pdf)
    Attachment

    Submitted filename: Comments PONE-D-23-33875.docx

    pone.0299888.s009.docx (13KB, docx)
    Attachment

    Submitted filename: Editorial comments Response_Bootle Band.docx

    pone.0299888.s010.docx (185.8KB, docx)

    Data Availability Statement

    The data that support the findings of this study may be made available on request from the corresponding author, E.B, in compliance with institutional and ethical standards of operation. Data cannot be shared publicly because research participants did not provide consent for public sharing of their data. To ensure the long-term stability and accessibility of our research data, we will designate a non-author institutional contact, the research ethics committee chair. This approach ensures that the data remains accessible over time, providing a reliable point of contact for interested researchers. Such an arrangement is particularly beneficial in cases where an author may change their email address, shift to a different institution, or become unavailable to respond to data access requests. Please see the contact information for the non-author institutional contact below: Deryk Beal Research Ethics Board Chair Holland Bloorview Kids Rehabilitation Hospital 150 Kilgour Road, Toronto, ON M4G 1R8 Tel: (416) 425-6220, ext.3582 E-mail: dbeal@hollandbloorview.ca.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES