Propagating variational model uncertainty for bioacoustic call label smoothing

Georgios Rizos; Jenna Lawson; Simon Mitchell; Pranay Shah; Xin Wen; Cristina Banks-Leite; Robert Ewers; Björn W Schuller

doi:10.1016/j.patter.2024.100932

. 2024 Feb 12;5(3):100932. doi: 10.1016/j.patter.2024.100932

Propagating variational model uncertainty for bioacoustic call label smoothing

Georgios Rizos ^1,^5,^∗, Jenna Lawson ², Simon Mitchell ³, Pranay Shah ¹, Xin Wen ¹, Cristina Banks-Leite ², Robert Ewers ², Björn W Schuller ^1,^4,^∗∗

PMCID: PMC10935495 PMID: 38487806

Summary

Along with propagating the input toward making a prediction, Bayesian neural networks also propagate uncertainty. This has the potential to guide the training process by rejecting predictions of low confidence, and recent variational Bayesian methods can do so without Monte Carlo sampling of weights. Here, we apply sample-free methods for wildlife call detection on recordings made via passive acoustic monitoring equipment in the animals’ natural habitats. We further propose uncertainty-aware label smoothing, where the smoothing probability is dependent on sample-free predictive uncertainty, in order to downweigh data samples that should contribute less to the loss value. We introduce a bioacoustic dataset recorded in Malaysian Borneo, containing overlapping calls from 30 species. On that dataset, our proposed method achieves an absolute percentage improvement of around 1.5 points on area under the receiver operating characteristic (AU-ROC), 13 points in F1, and 19.5 points in expected calibration error (ECE) compared to the point-estimate network baseline averaged across all target classes.

Keywords: variational Bayesian deep learning, uncertainty propagation, adaptive label smoothing, epistemic uncertainty, calibrated deep learning, bioacoustics, wildlife call detection, passive acoustic monitoring, machine audition

Graphical abstract

Highlights

•
Sample-free, Bayesian attentive ResNet with squeeze and excitation
•
Uncertainty-based, data-specific label smoothing
•
Bioacoustic call detection on two datasets, one of which is introduced here
•
Use of predictive uncertainty in label-smoothing parameterization

The bigger picture

Uncertainty awareness in deep learning enables models to focus on learning from well-annotated data and to place less confidence on uncertain predictions. This has the potential to foster trust in algorithmic decision making and enhance policy making in applications pertaining to conservation using recordings made by on-site passive acoustic monitoring equipment. Such analyses can automate the annotation process and reduce human presence in the field.

In this article, sample-free Bayesian neural networks are applied to bioacoustic call detection in order to improve both predictive and calibration performance. The authors further explore the use of Bayesian predictive uncertainty to guide the training process to focus less on samples for which the model predicts higher uncertainty and show promising results on two animal call-detection datasets, one of which is introduced here.

Introduction

Effective wildlife monitoring can guide action to ameliorate the effects of the global biodiversity crisis but poses an enormous scalability challenge.¹^,² A potential solution for scalable bioacoustic data modeling³ is offered by the combination of audio sensing infrastructure⁴ and deep learning (DL), i.e., methods consisting of hierarchical stacking of linear processing layers and nonlinear pooling and activation operations. The monitoring of wildlife and environments using sound recorders—i.e., passive acoustic monitoring (PAM)⁴—allows for an automated, continuous monitoring solution that minimizes the duration of human presence in the field and, thus, the impact such presence can have on the behavior of the animals. Furthermore, the recordings no longer need to be limited to how much experts can reasonably listen, leading to great scalability both spatially and temporally. DL for bioacoustics offers the possibility of distilling the detection and categorization experience of ecology experts into a DL computational model. This can automate and expedite relevant labor, alleviating spurious annotation errors (as DL is known to be capable of doing⁵), such that the time of experts can be invested in a more fruitful manner. This scaled-up data enrichment can improve contributions to conservation- and ecology-related policy making.⁶

Many DL architectures that perform well in detecting specific signals in sound recordings—i.e., acoustic event detection (AED)—were originally designed for the visual classification domain.⁷ For a recent example, residual networks (hence ResNets,⁸ i.e., deep convolutional networks with residual connections every few layers for facilitating backwards propagation of the error signal for training) were shown to outperform the competition in a study on AED.⁹ A ResNet similar to the winning method from the aforementioned study was also shown to be the best performer specifically for bioacoustic call detection in an extensive comparative study¹⁰ against a non-residual deep convolutional network,⁹ shallower networks of around two or three (1D or 2D) convolutional layers commonly used for AED,¹¹^,¹²^,¹³^,¹⁴ as well as a combination of convolutional and recurrent (i.e., designed for sequential data) layers previously used for the bioacoustic detection of Bornean gibbon calls.¹⁵ The success of the winning model of Rizos et al.¹⁰ was also due to the incorporation of attention methods, i.e., methods that entail the learning of weights that allow the model to focus on particular time frames¹⁶^,¹⁷ or convolutional filters.¹⁸ Although both the former mechanism—attentive global sequence pooling—and the latter—squeeze and excitation (SE)—have been shown to be contribute to improvements in the acoustic domain as well,¹²^,¹⁹ including to the previously mentioned improved variant of ResNet for call detection,¹⁰ they have not necessarily been adopted in later acoustic ResNet-based call-detection studies.²⁰^,²¹ A recent alternate approach is BirdNET,²⁰ an application of the Wide ResNet²² model (a variant of ResNet using larger filter numbers) on a large composite dataset on detection of calls from 984 bird species that achieves competitive results compared to other methods that have been tested on similar datasets with much fewer species. The pretrained BirdNet model has since been extensively applied on various datasets.²³ Finally, although in this study we focus on the task of call detection, related tasks include cross-²⁴ or within-species²⁵ call type classification and individual identification.²⁶ Such liberally selected applications exist across a wide range in animal species, e.g., on primates,¹⁰^,¹⁴^,¹⁵^,²⁷ whales,²⁴ and birds.²⁸^,²⁹

It is important, however, that the predictions made by the DL model are understood and trusted. Unfortunately, during this near-decade of DL advancement, a fixation by the DL community toward deeper and more complicated architectures, as well as on traditional prediction performance evaluation measures, has led to an insidious DL model behavior manifesting overconfident predictions,³⁰ i.e., predictions made at a probability nearing 1, regardless of whether they are correct or not. Downstream software modules or policy makers making catastrophic decisions due to these overconfidently predicted misclassifications can foster deep mistrust in DL,³⁰^,³¹^,³² something that has also been noted with respect to bioacoustics.³ However, early prediction calibration fixes³⁰ are based on learning a transformation of the model outputs that requires the existence of a validation set of labels, something that cannot be safely assumed in general. Another approach is label smoothing,³³ a regularization method that has also been used with the intention of improving calibration.³⁴ A smoothing probability hyperparameter, selected a priori, allows us to treat a ground-truth label annotation as noisy instead of binary (e.g., probability of 0.9 of a call being present, instead of a fully confident 1). Although it was originally proposed as a means to improve predictive performance,³³ its success in that regard³⁵^,³⁶ has also been inconsistent, as it has in other cases been shown to deteriorate it,³⁴^,³⁷ without necessarily improving calibration.³⁴ Label smoothing has also shown promise²¹ in some cases on a call-detection study; however, no evaluation of calibration was made.

A means of designing DL models with the ability to accompany their standard predictive output with a measure of uncertainty is Bayesian inference. Predictive uncertainty is a signal that the input sample may have potentially been mislabeled.³⁸ Bayesian neural networks (BNNs)³⁹ have been shown to naturally offer better calibrated outputs as well as regularization compared to non-Bayesian, point-estimate versions of the same underlying architectures⁴⁰ (see related surveys on in-depth discussion for why this happens,⁴¹ as well as lists of domain applications⁴²). This is due to the predictive uncertainty, which describes a distribution from which less overconfident predictions can be sampled. BNNs employ distributional weight parameters, of which the posterior distributions are calculated via Bayes’ rule and dependent on the observed training set and a prior distribution assumption.³⁹ Since, however, the integration for these posteriors is intractable (due to containing high dimensionality factors, see Blei et al. and Zhang et al.⁴³^,⁴⁴), marginalizing the weights in order to get the statistical distribution of the outputs is often approximated via Monte Carlo (MC) sampling. As the uncertainty of stochastic parameters informs the output of each layer, and, hence, the input of each subsequent layer, we can understand the uncertainty information being propagated through the entire network until the final layer calculates the output (or epistemic) uncertainty. Using MC-based approximation to calculate it, one has to use K MC samples, something that increases the computational load by K. MC-based approaches comprise Bayes by backprop⁴⁵ and MC dropout,⁴⁶ and have been applied on a wide range of data domains, including audio.¹³^,⁴⁷

Uncertainty propagation in an MC sample-free manner can be performed by the approximation of the first two moments (i.e., expectation and variance) of the layer output pre-activations by leveraging the central limit theorem (CLT). This approach was used first for fast dropout,⁴⁸ where it allowed for sampling from the much fewer pre-activations instead of the layer weights, and later in the context of BNNs.⁴⁹ Later sample-free BNNs use closed-form, uncertainty-propagating, nonlinear activation functions⁵⁰^,⁵¹^,⁵²^,⁵³^,⁵⁴ and eschew the need for sampling even from pre-activations. Apart from avoiding costly weight sampling, this approach is also not subject to the stochasticity of MC-based approaches. This has been hypothesized to be the reason behind their improved performance compared to MC-based methods in prediction and calibration performance.⁵³^,⁵⁴^,⁵⁵ Propagation of more than two moments has been shown to be beneficial, e.g., in resisting adversarial attacks (i.e., target distortion of test data such that the output is misclassified⁵⁶) but also requires sampling for cubature,⁵⁷ or unscented⁵⁵ and particle⁵⁸ filtering. Such models have been applied on computer vision tasks such as image classification and segmentation, on data ranging from standard benchmarks⁵⁴ such as CIFAR,⁵⁹ to medical and radar images,⁵⁵ but never to audio, and, specifically, to bioacoustic call detection.

That being said, many recent moment-propagating BNN studies constitute Bayesian treatments of DL models with simple mechanisms, such as dense⁵²^,⁵³^,⁵⁸ and convolutional⁵¹^,⁵⁵ layers interweaved by nonlinear activation functions,⁵⁰^,⁵²^,⁵³^,⁵⁵ even in non-Bayesian uncertainty propagation.⁶⁰ Although a sample-free Bayesian version of a dense layer-based network with ResNet-like skip connections has been proposed in Wu et al.,⁵³ less consideration has been given on doing the same for more advanced concepts such as convolutional ResNets, SE, and attention. Furthermore, even though the sample-free Bayesian approach has been shown to be superior to MC-based BNNs,⁵³^,⁵⁴^,⁵⁵ only the latter approach has been used in bioacoustics¹³ (and, in fact, on a shallower three-layer network of the kind that has been shown to underperform compared to deeper ResNets¹⁰).

Despite the demonstrated promise of sample-free BNNs, there does not exist an explicit utilization of sample-free predictive uncertainty as a signal for data-specific regularization during training. We believe that such an explicit usage can guide the model to not place as much weight on the learning of data that it calculates as noisily annotated, something that can potentially improve both predictive and calibration performance. We also believe that this is a very timely topic for investigation in the domain of bioacoustic call detection, a domain where the need for calibration of model output probabilities (along with traditional accuracy-based performance evaluation) has been repeatedly suggested.³^,⁶¹ This is especially important as probabilistic, instead of categorical, outputs are considered to be more informative for downstream decision making.⁶²

The contributions we make in this article are summarized as follows.

(1)
We perform the first exploration of sample-free, uncertainty propagating, variational Bayesian DL on bioacoustic call detection in order to exploit the regularization and the better calibration that such models exhibit. Specifically, we provide a sample-free Bayesian treatment of a complex DL architecture that has excelled in the call-detection task.¹⁰ It propagates activation expectations and variances through mechanisms such as global attention pooling and SE blocks. To our knowledge, this is the first time a moment-propagating version of the SE mechanism has been proposed and evaluated, although MC-based Bayesian methods have done so before.⁶³ We further consider two variants of the underlying model concerning the type of local pooling: one using the known⁵⁵^,⁶⁴ moment-propagating version of max-pooling, and a moment-propagating version that we first use of an attention-pooling method inspired by recent studies.⁶⁵^,⁶⁶ Our results indicate that opting for a sample-free Bayesian DL method is indeed the most promising approach as it outperforms the corresponding point-estimate baseline in most cases.
(2)
We propose a regularization method that explicitly uses the propagated predictive uncertainty of a sample-free BNN model as a signal for adaptive label smoothing that is specific to each data sample. The rationale is that the importance of highly uncertain samples could be attenuated in the loss calculation. An overview of the whole approach is depicted in Figure 1. This approach achieves generally higher predictive and calibration performance compared to our other baselines when the underlying model uses maximum local pooling. In the case of attention local pooling, the comparison is less conclusive, as the best performer is either a variant in which the same smoothing probability is used for all samples in a batch (hence, data-sample agnostic) or no label smoothing at all. Our results indicate, however, that deterministic, moment-propagating BNNs—including our proposed method—exhibit high calibration performance also in bioacoustic call detection compared to point-estimate networks.
(3)
Our methodology is evaluated on challenging, real-world, “in-the-wild” datasets, as literally is the case in bioacoustics for wildlife PAM. The recordings may contain multiple background sounds other than the target calls. We obtain the best reported results on a spider monkey call-detection dataset previously used in Rizos et al.,¹⁰ and further introduce a new dataset with annotations for 30 distinct species (29 bird species, and Bornean gibbons) with potentially overlapping calls. The latter, which we call the SAFE Project⁶⁷ Multi-Species Multi-Task (SAFE-MSMT) dataset, is available at Zenodo: https://doi.org/10.5281/zenodo.7740620.

This is an abstraction of the sample-free, moment-propagating variational Bayesian SE-ResNet model with multi-head attention we use as a basis throughout this study

The point-estimate version follows the same architecture, but each layer, block, and nonlinearity does not use variational learning for inference or make affordances for propagating uncertainty. The outputs of the Bayesian SE-ResNet are used to parameterize a label-smoothing operation, and the loss calculation is performed using the smooth label.

Results

The common type of task between our two animal call-detection datasets is binary classification (i.e., positive class when one or more calls of a particular type are found in a recorded clip, negative class otherwise). The Osa Peninsula Spider Monkey Whinny (OSA-SMW) dataset was first introduced and described in Rizos et al.,¹⁰ and a single binary call-detection task is defined on it, where the focus is specifically the whinny call of Geoffroy’s spider monkey (Ateles geoffroyi). We first introduce here the SAFE-MSMT dataset, of which the description and preprocessing details can be found in supplemental experimental procedures (sub-section “SAFE-MSMT Dataset”). We consider the detection of calls for each species identified within the dataset as a separate binary task and have identified 30 species such that, for all tasks, there are positive examples for each class in all of the training, development, and testing sets. It is possible that there are zero, one, or more species’ calls audible per audio clip, which constitutes a multi-label classification problem. We approach this via a multi-task framework where each independent task is binary classification. This is realized by having one prediction layer per task, responsible for predicting the probability of the presence of a corresponding species call.

For evaluating our experiments, we opted to report the non-interpolated area under the precision-recall (AU-PR) curve of the positive class, and the area under the receiver operating characteristic (AU-ROC) curve as prediction performance measures that average over all possible probability thresholds for classification. Test performance is measured using the model that achieved best validation performance according to AU-PR, which is a stricter measure in class-imbalanced cases where the positive class is a minority, as AU-ROC is known to inflate due to the abundance of true negatives. We also report the unweighted average of F1 of the positive and negative classes at a probability threshold of 0.5 (F1) as well as the expected calibration error (ECE) for measuring calibration quality, as suggested by Guo et al.,³⁰ with 10 probability buckets. In order to provide a summary performance profile for the 30-task SAFE-MSMT dataset, we report here the weighted average of the per-task performance measures, where each weight is proportional to the number of positive instances per task. Even so, this is a quite austere evaluation as, for some species, there are only a handful of positive samples (as low as four), which heavily restricts the predictive potential of supervised-learning-based approaches.

As the baseline in our comparisons, we used a variation of a modern, complex DL model that was the best-performing method in a comparative study on the bioacoustics domain.¹⁰ It combines a ResNet architecture, SE blocks, and multi-head global attentive pooling of sequential embeddings, and an output dense layer per binary task, for a total depth of 21 layers (instead of 28 in Rizos et al.¹⁰); hence, base SE-ResNet. A summary of its architecture, including parameter values and tensor shapes, can be found in Table 1, and more details are given in section “description of multi-attentive SE-ResNet.” It is designed to process log-Mel spectrograms as input, i.e., two-dimensional audio representations.

Table 1.

SE-ResNet with multiple-head attention implementation

Model operation	Shape
Log-Mel spectrogram	(300, 128)
(ConvBlock @ 64, ReLU) & Pool	(150, 64, 64)
(SEBlock @ 64, ReLU) $\times$ 2 & Pool	(75, 32, 64)
(SEBlock @ 128, ReLU) $\times$ 2 & Pool	(37, 16, 128)
(SEBlock @ 256, ReLU) $\times$ 3 & Pool	(18. 8, 256)
(SEBlock @ 512, ReLU) $\times$ 2 & Pool	(9, 4, 512)
(ConvBlock @ 1024, ReLU)	(9, 4, 1024)
Reshape embedding	(9, 4096)
4-head attention-based pooling	(4096 $\times$ 4)
Dense layer per task	(1) $\times$ tasks

Open in a new tab

The sample-free variational versions share the same architecture, albeit by propagating moments throughout.

We compare the performance of the base SE-ResNet with those of the (1) uncertainty propagating, variational Bayesian version developed for this article in section “crafting a competitive Bayesian SE-ResNet baseline”) variant with the addition of our sample-free, uncertainty-aware label-smoothing technique in section “benefits of uncertainty usage in label smoothing,” a pictorial overview of which can be seen in Figure 1. In an effort to show whether our proposed approach is robust to variations in the base architecture, we identify the local pooling operation as a point of interest. This is due to it being less explored in cited related literature on sample-free Bayesian DL,⁵²^,⁵³^,⁵⁴^,⁵⁵ where only the max-pooling (max-pool) equivalent operation is considered. We further consider an attentive pooling (att-pool) operation that is similar to the recent eMPool⁶⁶ and local importance pooling⁶⁵ operations. Our att-pool employs an additional dense layer and a softmax nonlinear activation that learn a weighted average of the activations to be pooled. More details on the implementation of core mechanisms, the considerations made toward a Bayesian treatment, and technical propositions can be found in section “experimental procedures,” and full technical details in the supplemental experimental procedures. We summarize in Table 2 the predictive and calibration performance measure results that arose from our comparative analysis on animal call detection, which includes sample-free BNNs (for a higher granularity report of certain endangered species from SAFE-MSMT; see Table S1). In all cases, we performed eight trials for which we report mean and standard deviation.

Table 2.

Comparative study on two datasets between point-estimate neural networks and their sample-free Bayesian DL versions with and without uncertainty-aware label smoothing

SAFE Project Multi-Species Multi-Task
	SE-ResNet	W-AU-PR $↑$	W-AU-ROC $↑$	W-F1 $↑$	W-ECE $↓$
max-pool	base	$21.16 \pm 2.16$	$78.45 \pm 2.35$	$36.31 \pm 11.94$	$35.86 \pm 11.31$
	variational	$22.44 * \pm 2.00$	$79.16 \pm 1.75$	$46.68 \pm 4.33$	$22.63 \pm 4.77$
	smooth	$22.25 \pm 1.11$	$79.83 \pm 2.89$	$52.43 * \pm 3.35$	$17.00 \pm 7.19$
	ua-smooth	$20.76 \pm 2.57$	$80.05 * \pm 2.81$	$49.61 \pm 3.71$	$16.21 * \pm 3.19$
att-pool	base	$16.01 \pm 2.25$	$72.15 \pm 3.19$	$39.51 \pm 13.37$	$29.36 \pm 12.74$
	variational	$20.38 * \pm 2.70$	$77.97 * \pm 2.09$	$47.96 * \pm 3.61$	$21.63 \pm 4.67$
	smooth	$15.53 \pm 3.33$	$65.35 \pm 7.90$	$47.86 \pm 7.85$	$18.96 * \pm 13.66$
	ua-smooth	$16.94 \pm 2.11$	$69.82 \pm 4.44$	$38.75 \pm 10.83$	$31.81 \pm 11.72$

Osa Peninsula Spider Monkey Whinny

	SE-ResNet	AU-PR $↑$	AU-ROC $↑$	F1 $↑$	ECE $↓$
max-pool	base	$81.81 \pm 2.46$	$97.01 \pm 0.79$	$82.95 \pm 4.44$	$3.51 \pm 1.30$
	variational	$82.74 \pm 1.14$	$97.14 \pm 0.34$	$80.31 \pm 3.14$	$4.56 \pm 1.16$
	smooth	$82.55 \pm 1.60$	$97.26 \pm 0.43$	$82.79 \pm 4.17$	$3.66 \pm 1.35$
	ua-smooth	$83.79 * \pm 2.42$	$97.47 * \pm 0.38$	$83.40 * \pm 3.22$	$3.46 * \pm 1.21$
att-pool	base	$84.81 \pm 0.93$	$97.41 \pm 0.34$	$84.38 \pm 3.79$	$3.32 * \pm 1.63$
	variational	$84.82 \pm 1.94$	$97.28 \pm 0.55$	$78.74 \pm 6.82$	$5.18 \pm 3.36$
	smooth	$85.83 * \pm 0.60$	$97.47 * \pm 0.37$	$84.89 * \pm 5.85$	$3.53 \pm 2.49$
	ua-smooth	$82.24 \pm 5.42$	$96.68 \pm 1.54$	$81.32 \pm 4.90$	$3.99 \pm 1.72$

Open in a new tab

The proposed ua-smooth method distinguishes itself in the case where max-pool is used by the SE-ResNet. In case att-pool is used, the highest performer is either variational for the SAFE-MSMT dataset or smooth for OSA-SMW. The choice of max-pool works better for SAFE-MSMT, whereas att-pool works better for OSA-SMW, thus the use of label smoothing and whether it is uncertainty aware or not should be made depending on dataset. We denote by asterisks the best value (%) for each performance measure per dataset and per pooling type choice in order to more easily track the comparisons among methods based on the same backbone architecture. We further denote by italics the highest value per dataset, regardless of pooling choice.

SE-ResNet is a competitive point-estimate baseline

Although our goal is to show the benefits of sample-free Bayesian DL (with and without uncertainty-aware label smoothing) on bioacoustic call detection, we nevertheless performed one point-estimate neural architecture comparison, with a Wide ResNet²² that was used in a bird call classification study (BirdNET²⁰). We made our own implementation of the architecture, and train it from scratch on the datasets we include in our study using the same setup as our own methods. This is done in the interest of a fair comparison and because the pretrained BirdNET is trained to predict neither all the bird species in our SAFE-MSMT dataset nor spider monkey whinnies from OSA-SMW. The results of the comparison with our SE-ResNet (both the maximum and attention-pooling versions) are summarized in Table 3. We continue, thus, with sample-free, Bayesian treatments of only SE-ResNet in the following.

Table 3.

Comparison of point-estimate neural network baselines

SAFE Project Multi-Species Multi-Task
Model	W-AU-PR $↑$	W-AU-ROC $↑$	W-F1 $↑$	W-ECE $↓$
SE-ResNet (max)	$21.16 * \pm 2.16$	$78.45 * \pm 2.35$	$36.31 \pm 11.94$	$35.86 \pm 11.31$
SE-ResNet (att)	$16.01 \pm 2.25$	$72.15 \pm 3.19$	$39.51 \pm 13.37$	$29.36 \pm 12.74$
Wide ResNet	$19.75 \pm 2.00$	$77.82 \pm 1.88$	$52.51 \pm 3.99$	$12.32 * \pm 5.01$

Osa Peninsula Spider Monkey Whinny

Model	AU-PR $↑$	AU-ROC $↑$	F1 $↑$	ECE $↓$
SE-ResNet (max)	$81.81 \pm 2.46$	$97.01 \pm 0.79$	$82.95 \pm 4.44$	$3.51 \pm 1.30$
SE-ResNet (att)	$84.81 * \pm 0.93$	$97.41 * \pm 0.34$	$84.38 * \pm 3.79$	$3.32 * \pm 1.63$
Wide ResNet	$74.79 \pm 2.15$	$95.62 \pm 0.57$	$76.22 \pm 5.97$	$5.30 \pm 2.69$

Open in a new tab

The comparison is among our implementations of a Wide ResNet previously used for bird classification,²⁰ an SE-ResNet previously used on the OSA-SMW dataset,¹⁰ and a variation of the latter using attention local pooling. Although the Wide ResNet achieves the best performance in W-F1 and W-ECE for the SAFE-MSMT dataset, it is surpassed by SE-ResNet with max-pooling in the other two measures. Furthermore, it is surpassed by both SE-ResNet variants in all measures in the OSA-SMW dataset. We denote by asterisks the best value (%) for each performance measure per dataset.

Crafting a competitive Bayesian SE-ResNet baseline

As a first step toward a more uncertainty-aware approach, we modify base SE-ResNet such that it becomes a variational Bayesian, uncertainty-propagating version of itself. Linear operators such as dense and convolutional neural layers are replaced with locally reparameterized versions, as described respectively in Kingma et al.⁴⁹ and Shridhar et al.⁵¹ The first two moments of the outputs are given in closed form and are linearly dependent on the, also stochastic, respective layer inputs and weights, yet independent among themselves. The stochastic layer outputs are transformed by nonlinear activation functions such as ReLU and sigmoid, where the first two moments of the activations are approximated as previously described elsewhere.⁴⁸^,⁵⁰^,⁵²^,⁶⁰ Regarding max-pooling of such normally distributed variables, the authors of several studies⁵⁵^,⁶⁴ independently proposed co-pooling of the two moments, i.e., propagating only the moments of the random variable with the highest expected value. As for attention pooling, the weighted sum of normally distributed variables is well known, and we learn the probabilistic weights using attention. This way, information on the first two moments of all pooled variables is propagated.

In the results shown in Table 2, the Bayesian, uncertainty-propagating version of SE-ResNet with max co-pooling (variational max-pool) exhibits a slightly higher performance than base max-pool in terms of AU-PR, and AU-ROC for the OSA-SMW dataset, and an improvement across all measures for SAFE-MSMT, including the highest AU-PR among all max-pool-based methods.

As for the attention-pooling variant (variational att-pool), we observe a higher performance compared to the point-estimate baseline in all measures for SAFE-MSMT but at the same or lower performance in all measures for OSA-SMW.

Benefits of uncertainty usage in label smoothing

Label smoothing³³ in loss calculation is the use of a label distribution that is an interpolation between the true distribution, as given by the annotators, and the uniform distribution. In the binary classification task, the latter corresponds to 0.5 probability for both the negative and the positive classes:

y_{i, c}^{s m o o t h} = α y_{i, c}^{u n i f o r m} + (1 - α) y_{i, c}^{t r u e},

(Equation 1)

where $y_{i, c}$ refers to the label probability that class c is correct for data sample i, and α denotes the smoothing probability hyperparameter. The latter quantifies the degree to which we want the model to not overexert in trying to learn to classify that particular sample as per the ground truth $y_{i, c}^{t r u e}$ .

Here, we propose a solution for data-specific label smoothing that is dependent on the uncertainty propagated throughout a BNN model, and is also MC sample free. A description of the means by which we define such an uncertainty-aware smoothing probability $α_{i}$ for sample i, is found in section “experimental procedures,” and a schematic overview is depicted in Figure 1.

As seen in the results in Table 2, our uncertainty-aware label-smoothing method (ua-smooth) used on the BNN described in section “crafting a competitive Bayesian SE-ResNet baseline” outperforms the max-pool-based variational method in terms of all measures except for W-AU-PR on the SAFE-MSMT dataset. In the att-pool case, we do not observe a similar behavior, as the only improvement is on F1 and ECE in the OSA-SMW dataset. In the max-pool case, the ua-smooth method also achieves better performance compared to the baseline in all cases.

Smoothing should be specific to data samples

How can we be sure, then, that the propagated model uncertainty contains information about which samples should use higher smoothing probabilities and that it is not simply a case of label smoothing being beneficial in general?

To answer this question, we perform one more series of experiments, with a label-smoothing variant (hence, smooth) that keeps the smoothing probability fixed across the training batch. Specifically, we calculate for every batch the average of the uncertainty-aware smoothing probabilities as per our proposed ua-smooth method and apply that to all batch samples instead. This is not a hyperparameter-based, fixed-value label smoothing, as is commonly used, since it benefits from the uncertainty quantification provided by the BNN, the values of which change per training step as the model learns to model the training data, and it tracks the average value of the uncertainty-aware smoothing probability, thus allowing for a stricter comparison with the ua-smooth method, which we propose as the better means of performing uncertainty-aware label smoothing using a BNN.

We observe from Table 2 that, in the max-pool case, ua-smooth always outperforms smooth for all measures on OSA-SMW, whereas on SAFE-MSMT this holds only for W-AU-ROC and W-ECE. In the att-pool case, it is instead smooth that outperforms ua-smooth for all measures on OSA-SMW, whereas the comparison is also inconclusive on SAFE-MSMT, with ua-smooth performing better in terms of W-AU-PR and W-AU-ROC only.

Sample-free BNN outputs are calibrated

A recommendation on which Bayesian approach to use agnostically is not easy to make, although, on the SAFE-MSMT dataset, the calibration performance of the point-estimate baselines are worse compared to all corresponding Bayesian versions. We do not observe the same behavior on the OSA-SMW dataset, although, in the max-pool cases on both datasets, it is our proposed ua-smooth method that achieves the best ECE performance among the corresponding competing methods.

Locally pooling normal random variables

Apart from max-pooling, we have also showcased the efficacy of a BNN approach based on newer, more elaborate local pooling methods.⁶⁵^,⁶⁶ Although we see that, on the OSA-SMW dataset, attention pooling brings a clear improvement on all performance measures over the use of max-pooling, on SAFE-MSMT the behavior is reversed; i.e., max-pooling is overall the best-performing local pooling operation. We find that our proposed ua-smooth method manages to achieve best performance compared to corresponding competing methods for the max-pooling case, but not for attention pooling, where either the naive smooth method works best on OSA-SMW or the sample-free BNN without any label smoothing in SAFE-MSMT.

Execution times

We further perform a wall-clock execution time measurement for all the competing methods on a machine equipped with an Nvidia GeForce GTX 1080 Ti graphics processing unit (GPU) with 11 Gb of memory. The results are summarized in Table 4. The increase in execution times for the sample-free Bayesian methods is well known in relevant literature.⁵³^,⁵⁴^,⁵⁵

Table 4.

Training and prediction batch execution times in milliseconds for a batch size of eight on the SAFE-MSMT dataset

	SE-ResNet	Training time	Prediction time
max-pool	base	112	32
	variational	706	166
	smooth	714	166
	ua-smooth	713	166
att-pool	base base	134	40
	variational	753	176
	smooth	763	176
	ua-smooth	760	176

Open in a new tab

We measure both training time (including backpropagation) and prediction time. Regarding training times, the Bayesian methods are $\sim 6.3$ and $\sim 5.6$ times slower compared to the point-estimate baseline in the max-pool and att-pool cases, respectively. Regarding prediction times, the factors are, instead, $\sim 5.2$ and $\sim 4.4$ .

The time per epoch of training is dependent on dataset size. For example, for SAFE-MSMT and using the max-pooling variants, an epoch of training requires $\sim 21 s$ and $\sim 140 s$ for point-estimate and Bayesian versions, respectively, whereas for OSA-SMW it is $\sim 88 s$ and $\sim 560 s$ . For SAFE-MSMT and using max-pooling, training requires around 20 min and 3 h for point-estimate and Bayesian versions, respectively. For OSA-SMW, the training times are, correspondingly, 1.5 h and 10 h. The higher overall training times for Bayesian methods can be explained by the fact that they require more epochs as they generally reach better parameter set optima.

Discussion

We now discuss (1) the insights extracted from our experiments regarding our proposed methodology in section “propagated uncertainty should be explicitly used”; (2) relations to similar methods and means by which our method should engender a re-evaluation thereof in section “rethinking label smoothing”; and (3) potential extensions, criticisms, and opportunities in sections “should we focus on the easy data then?” to “conclusions and future work.”

Propagated uncertainty should be explicitly used

Propagated predictive uncertainty, as per our variational variant of SE-ResNet, affects loss value calculation as it describes a predictive distribution from which multiple prediction instances can be sampled. This leads to an expected loss value calculation that is based on softer, less overconfident prediction outputs compared to a loss value based on point-estimate predictions; the utilization of epistemic uncertainty involving all potential output samples has been cited as a major regularizing strength of BNNs.⁴¹

In addition to the point-estimate base, we have designed the sample-free variational method to be a more advanced baseline, to more strictly compete with our proposed uncertainty-based label-smoothing method.

That being said, by means of an insight from our experiments with the moment-propagating “flavor” of BNNs, i.e., the variational Bayesian SE-ResNet, we observed promising (e.g., overall improvement on the OSA-SMW dataset) yet inconclusive results. As such, we recommend that the Bayesian property, as well as the type of uncertainty-aware label smoothing, should be considered to be types of hyperparameters, not to be employed agnostically but only after experimental validation on the task under examination, including consideration of the relevant performance measures thereof.

However, the Bayesian formulations offer us another highly informative signal, something exclusive to them and unavailable to the baseline: the value itself of predictive variance, i.e., a proxy of epistemic uncertainty. There is a more explicit manner of utilizing it, which can, and indeed should, be used in the loss calculation, as, in our experiments, the ua-smooth method performs better than the corresponding variational in most performance measures in the case of models using max-pooling.

Usually, predictive uncertainty is used in downstream tasks, e.g., as a signal for data acquisition in active learning,⁵²^,⁶⁸ or toward the design of uncertainty-aware (e.g., risk-averse) reinforcement learning agents.⁶⁹ Inversely, we believe that uncertainty should be used as a signal that guides learning in the self-same task, and by the self-same model that is undergoing training; as per our experiments, not doing so may lead to missing the opportunity given by the usage of a BNN and is also disregarding one-half of the BNN output. The sample-free manner of uncertainty offers a more elegant and less stochastic means of doing so compared to MC-based methods.

More than that, our experiments with the batch-wide fixed smoothing method (smooth) indicate that a higher degree of label smoothing can be beneficial to data samples for which the BNN is less confident in modeling, and that, thus, the ua-smooth variant is preferrable. That being said, for the OSA-SMW dataset in the att-pool case, it seems that smooth performs better than the other corresponding methods, indicating that Bayesian regularization may be beneficial for that dataset in any shape or form, most probably due to the positive class sample scarcity in all binary classification tasks of this dataset.

Rethinking label smoothing

That being said, label smoothing has been considered as one of the reasons for the high performance achieved by the student model in knowledge distillation⁷⁰; i.e., a learning framework involving a student model learning from the predictions of a teacher model that is itself trained with the true labels. Knowledge distillation utilizes the smooth prediction probabilities output by the teacher model in place of ground-truth labels. These output distributions are smoother, i.e., closer to the uniform, for data samples that the teacher model finds difficult to model, thus constituting data-specific smoothing. Moving away from the two-step, teacher-student framework (that is focused on model compression), in this study, we have shown the usefulness of a means for smoothing that requires no more than a single model, a single training process, and is also MC sample free. As indicated by our experiments, we believe that the underlying conception of label smoothing is still promising, with the caveat that they need to be made in an adaptive, intelligent, and data-specific manner; a fortiori in the uncertainty-propagating BNN context, where a guiding signal is provided by design.

The study that is closer conceptually to our own, in terms of attempting to improve accuracy and calibration, is the one performed in Seo et al.,⁷¹ in which the authors use the MC-based BNN approach proposed in Gal and Ghahramani⁴⁶ called MC dropout and focus on image classification. They calculate a loss value as an interpolation of the cross entropy between the predictions and the true labels, and the cross entropy between the predictions and the uniform distribution, where these two factors are weighed based on a value that is a normalization of the MC-based estimate of the variance. Even though their loss calculation uses the predictions of a single execution, it also requires K executions for estimating the variance. As such, the authors use five MC samples and, subsequently, five propagations of the input through the entire model during training. Instead, we use both the expectation and the variance of the outputs in our loss calculation, as propagated through the entire network in closed form approximation, constituting a more deterministic and elegant solution. Given the long-standing criticisms of MC dropout on whether its assumptions and approximations truly constitute a Bayesian method,⁷²^,⁷³^,⁷⁴ and the fact that sample-free Bayesian methods have outperformed MC-dropout before,⁵³^,⁵⁴ we did not consider a direct comparison with this method necessary.

Should we focus on the easy data then?

The underlying philosophy of our uncertainty-aware smoothing method is that high predictive uncertainty implies a training data sample that is, for whatever reason (e.g., difficulty, subjectivity, scarcity), difficult to model, and, as such, that our BNN should not over-penalize itself trying to memorize it. Similar assumptions have been made by past studies that focus on aleatory uncertainty,³⁸ and soft labels due to rater disagreement,⁷⁵ or label smoothing.³³^,⁷¹ That being said, there has also been an alternate way of thinking, such as data samples that are too easy to model should be the ones either ignored or downweighed, such that we avoid a flood of common samples dominating the loss calculation. A method that follows this paradigm is the focal loss,⁷⁶ of which newer versions are also heteroscedastic, i.e., dependent on the input, as the degree of focus is itself dependent on an auxiliary output of the model.⁷⁷ This is similar to our approach, albeit we are not using a separate output “head” but leverage the Bayesian predictive uncertainty. A combination of these two philosophies, and a means by which we can learn the degrees to which we should downweigh both the easy as well as the difficult samples side by side is something we would like to focus on in a future extension of this study, potentially by incorporating uncertainty decomposition methods.³⁸

Generality of method

Although, in the study performed in Wu et al.,⁵³ the authors validated their moment-propagating BNNs on small scale, tabular datasets, in Schmitt and Roth⁵⁴ such models have also been applied on standard image classification datasets such as MNIST,⁷⁸ CIFAR,⁵⁹ and ImageNet.⁷⁹ Dera et al.⁵⁵ have gone further to image segmentation on both radar sensor and medical magnetic resonance images. Finally, Haußmann et al.⁵² have used the sample-free output uncertainty in a downstream active learning framework for budgeted image classification labeling. We not only build upon such models methodologically with our adaptive label smoothing but we also apply them to a new domain, that of bioacoustic animal detection. Given the above, we see it as highly likely that the performance of our method can be transferrable to any data domain in which it is beneficial to model uncertainty, including speech and textual language processing, multimodal domains such as video, as well as graph data.

Limitations of method

Even though the parameter space required for the sample-free Bayesian models is almost equivalent to the baseline (just one additional parameter per trainable layer, as described in the supplemental experimental procedures for variance parameterization), the prediction and training times are longer (see section “execution times”). Furthermore, the activation space is double compared to the baseline as we are propagating the variances as well as the expectations. That being said, this is a known and accepted behavior in the sample-free Bayesian DL literature.⁵²^,⁵³^,⁵⁴^,⁵⁵ This is also reasonable, since other Bayesian considerations also require an increase in resources: e.g., MC sampling-based methods perform a number of propagations through the network that is equal to the number of MC samples, something that also introduces stochasticity in training.⁵⁴

Conclusions and future work

Although the predictive uncertainty signal calculated by BNNs is often used to make decisions in a downstream task, such as identifying samples to annotate in active learning or addressing risk in reinforcement learning, in this article, we have used it to guide learning in the self-same task the neural network is being trained on. To that end, we have focused on deterministic (i.e., non-MC-based) BNNs that propagate feature variances along with expectations and utilized the end-to-end propagated output uncertainty to inform the degree of label smoothing that is applied in a data-specific manner. Our proposed sample-free variational Bayesian SE ResNet yields in most cases an improvement over the point-estimate baseline. Furthermore, our recommended variant with uncertainty-aware label smoothing brings further improvement in cases in which the maximum operation is used for local pooling.

Our methodology has been evaluated on two animal call-detection bioacoustics datasets, one of them introduced here for the first time, as well as in two variations pertaining to local hidden unit pooling. We find that the choice of pooling affects performance depending on the dataset, and it affects the success of uncertainty-aware label smoothing. As such, we submit that the use of uncertainty-aware label smoothing is a promising method that should be considered as a hyperparameter, to be incorporated based on validation performance. By using it, one incorporates the uncertainty value that is available to sample-free BNNs in the loss value calculation.

This work both advances work on moment-propagating BNNs that are of great use in the domain of DL and is of special interest to the application field of bioacoustics, where low signal-to-noise-ratio data often also receive weak annotation, leading to a need for soft, modest predictions that are highly calibrated (noted so far to be missing).³^,⁶¹^,⁶² Well-calibrated model outputs with meaningful prediction probabilities are required for downstream processing either by automatic decision-making software or human experts, especially in a collaborative human-machine setting, such as active learning. Although other types of BNN are known to perform well in terms of calibration,⁴⁰^,⁸⁰ we have shown here that this also holds for the moment-propagating variety, with and without the use of our intelligent label smoothing.

It is important to note that this study has not been an extended comparative study of neural network architectures for acoustics as in Rizos et al.¹⁰ Many promising point-estimate DL architectures exist, potentially focused on other data domains, that could prove to be excellent performers on one (e.g., see the experiment with the WideResNet-based BirdNET in section “SE-ResNet is a competitive point-estimate baseline,” as well as Table S1) or even both the datasets we considered. Our results indicate that a sample-free Bayesian treatment of any existing point-estimate architecture is highly likely to bring further improvement, with or without our proposed uncertainty-aware label-smoothing approach. We further believe this study can stimulate research in uncertainty-aware local pooling and attention methods, in identifying informative data samples⁴⁷ in an integrated manner with focal loss,⁷⁷ and in trustworthy decision making in bioacoustics. Finally, we believe it is of interest to approach the newly introduced SAFE-MSMT dataset via a few-shot learning framework,⁸¹ to extract as much information as possible from the limited size labeling.

Experimental procedures

Resource availability

Lead contact

Further information regarding the computational methodology and use of codebase should be directed to and will be fulfilled by the lead contact, G.R. (georgios.rizos12@imperial.ac.uk). Information regarding the SAFE-MSMT dataset should be addressed to R.E. (r.ewers@imperial.ac.uk), and regarding OSA-SMW to J.L. (j.lawson17@imperial.ac.uk).

Materials availability

This study did not generate new unique materials or reagents.

Data and code availability

The latest version of the code can be found at https://github.com/glam-imperial/sample-free-uncertainty-label-smoothing under DOI through Zenodo: https://doi.org/10.5281/zenodo.10253149 ⁸² and is publicly available as of the date of publication. The SAFE-MSMT dataset introduced in this paper is to be found at Zenodo: https://doi.org/10.5281/zenodo.7740620 ⁸³; the contact for this dataset is R.E. The OSA-SMW data reported in this paper will be shared by J.L. upon request.

Description of multi-attentive SE-ResNet

The base DL architecture we use in this study is a close variant of the best performing method from the comparative study in Rizos et al.¹⁰ The sample-free Bayesian treatment is applied on the same architecture, whether uncertainty-aware label smoothing is used or not. Table 1 summarizes the number of layers used and related parameters.

We can divide the architecture in three modules: (1) the core, audio processing module, which produces a sequence of learnt audio embeddings and is based on convolutional layers, residual blocks, local pooling (maximum or attentive), and SE blocks; (2) the multiple-head, attention mechanism for weighted average pooling of the embeddings; and (3) the top module, a set of dense layers that process the averaged, recording-wide neural representation, where each layer makes a prediction corresponding to a separate binary call-detection task. There is one such layer for the OSA-SMW and 30 for the SAFE-MSMT dataset. We extract spectrograms from sound waveform sampled at a rate of 16 kHz, by using a fast Fourier transform window of 128 ms, sliding at a hop length of 10 ms. Given a 3-s clip, we extract 128 Mel coefficients and end up with a log-Mel spectrogram with sequence length equal to 300.

As seen in Table 1, the log-Mel spectrogram is first processed by a block (ConvBlock) of two convolutional layers, each with 64 filters and ReLU activations, and followed by a pooling operation without padding. The pooling operation can be either max- or attentive pooling. Then, the hidden units are processed by four blocks (SEBlock), where each is composed of two residual layers with SE mechanisms, and is followed by a pooling operation. The core module concludes with another ConvBlock, where the convolutions learn 1,024 filters, but this time not followed by pooling. In all cases, the convolutional layers learn $3 \times 3$ filters and corresponding biases, and the pooling operations are subsampling at a $2 \times 2$ ratio.

The above module transforms a log-Mel spectrogram input into a hidden tensor with sequence length of nine, width of four, and 1,024 features. We want to perform global pooling across the sequence length, and so first reshape the tensor to $(9, 4096)$ . We then learn four weighted sequence-averaging operations, using four attention heads. Each head corresponds to a learnt linear transformation of each embedding frame to a single energy value, and the calculation of a probability vector by passing the energy values from the sequence through a softmax function. These probabilities are used for weighted averaging, leading to an averaged embedding per attention head; those are then concatenated to provide a single, sequence-wide representation of the input audio clip. This is processed by the top module, where the dense layer that corresponds to each task avails of the common base model for shared feature extraction. Each dense layer produces one logit per data sample, which is passed through a sigmoid function such that we obtain the probability that the sample is positive.

Epistemic uncertainty-aware label smoothing

We need to quantify the belief that an input sample has been noisily annotated, and as such the prediction error for it should contribute less to the loss value calculation. We design such a measure by adhering to the following desiderata: (1) it is in the $[0, 1]$ range, such that it can serve as the label-smoothing probability; (2) it is positively correlated to the propagated, predictive variance in order to reflect BNN uncertainty about the input sample categorization; and (3) it is also positively correlated to overconfident (i.e., close to 1) predictions, such that moderate predictions do not receive feedback reinforcement.

Consider the expected logit output $E [h_{i, L}^{t}]$ of a dense prediction layer for the i-th acoustic data sample, where L denotes the last layer index and t denotes the task corresponding to that prediction layer. If we do utilize the logit variance $V [h_{i, L}^{t}]$ and transform the normally distributed random variable via a sigmoid function (as detailed in the supplemental experimental procedures, section “sample-free variational attentive SE-ResNet”), we get the fully propagated, Bayesian expectation and variance of $y_{i, POS}^{t, Bayes}$ , i.e., the probability that the input sample is from the positive (POS) class. Inversely, if we opt for a maximum a posteriori (MAP) approach for that final layer, by not utilizing the logit variance, we transform the logit expectation via the sigmoid and denote the probability by $y_{i, POS}^{t, MAP}$ .

$y_{i, POS}^{t, MAP}$ would still benefit from the moment propagation up until the final layer in terms of the learnt features and logits $h_{l}$ (for l up to but excluding L), as well as from the Bayesian regularization for all layers. However, final-layer MAP makes the information encoded in the propagated uncertainty unavailable in the calculation of the predictive probability distribution. Inversely, $y_{i, POS}^{t, Bayes}$ gets the full benefits of the Bayesian approach. A fully Bayesian treatment of even just the final layer has been shown to have a positive benefit on addressing overconfidence, even when the rest of the model is parameterized with point-estimate weights.⁸⁴

We, thus, attempt to capture this additional, Bayesian uncertainty information by defining the data-sample-specific smoothing probability as

α_{i}^{t} = | y_{i, POS}^{t, MAP} - y_{i, POS}^{t, Bayes} | .

(Equation 2)

For a binary call-detection task, this is equivalent to the Manhattan distance between the corresponding two-element discrete predictive probability distributions multiplied by two. A visualization of our adaptive smoothing probability given ranges of logit expectations and variances can be found in Figure 2.

The value of our proposed adaptive, uncertainty-aware smoothing probability given the expectation and variance of the logit

For close to 0 logit uncertainties, the smoothing probability $α_{i}^{t}$ is also close to 0. For higher logit uncertainties $V [h_{i, L}^{t}]$ , $α_{i}^{t}$ is higher for predictions that are closer to the extreme values of either 0 or 1. For moderate predictions close to 0.5, $α_{i}^{t}$ is closer to 0, thus encouraging learning from the true signal instead of reinforcing a moderate prediction behavior.

Acknowledgments

G.R. would like to acknowledge the Engineering and Physical Sciences Research Council (EPSRC) grant no. 2021037.

Author contributions

G.R. conceived and designed the proposed methodology, coded the moment-propagating BNNs and smoothing methods, executed the experiments, wrote the article, and prepared figures. J.L. collected and annotated the OSA-SMW dataset and contributed in designing the predictive task and writing related parts. S.M. annotated the SAFE-MSMT dataset. P.S. contributed in coding the dense Bayesian neural layer and the automatic relevance determination prior and proposed the use of cold posteriors for training. X.W. preprocessed the SAFE-MSMT dataset, prepared initial versions of related figures and descriptions of related parts, and executed exploratory experiments on SAFE-MSMT. C.B.-L., R.E., and B.W.S. supervised the research. All authors discussed the results and commented on/edited the manuscript.

Declaration of interests

G.R. is affiliated with the University of Cambridge. This work was performed during his PhD candidacy at Imperial College London. J.L. is also affiliated with the UK Centre for Ecology and Hydrology. P.S. is now affiliated with Advai Ltd. P.S. and X.W. worked on this study as MSc students at Imperial College London. B.W.S. is also affiliated with the Technical University of Munich and audEERING GmbH.

Published: February 12, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2024.100932.

Contributor Information

Georgios Rizos, Email: georgios.rizos12@imperial.ac.uk.

Björn W. Schuller, Email: bjoern.schuller@imperial.ac.uk.

Supplemental information

Document S1. Table S1 and supplemental experimental procedures

mmc1.pdf^{(372.8KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(2.9MB, pdf)}

References

1.Witmer G.W. Wildlife population monitoring: some practical considerations. Wildl. Res. 2005;32:259–263. doi: 10.1071/WR04003. [DOI] [Google Scholar]
2.Tuia D., Kellenberger B., Beery S., Costelloe B.R., Zuffi S., Risse B., Mathis A., Mathis M.W., van Langevelde F., Burghardt T., et al. Perspectives in machine learning for wildlife conservation. Nat. Commun. 2022;13:1–15. doi: 10.1038/s41467-022-27980-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Stowell D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ. 2022;10 doi: 10.7717/peerj.13152. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Turner W. Sensing biodiversity. Science. 2014;346:301–302. doi: 10.1126/science.1256014. [DOI] [PubMed] [Google Scholar]
5.Veit A., Alldrin N., Chechik G., Krasin I., Gupta A., Belongie S. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF; 2017. Learning from noisy large-scale datasets with minimal supervision; pp. 839–847. [DOI] [Google Scholar]
6.Arroyo-Rodríguez V., Fahrig L. Why is a landscape perspective important in studies of primates? Am. J. Primatol. 2014;76:901–909. doi: 10.1002/ajp.22282. [DOI] [PubMed] [Google Scholar]
7.Hershey S., Chaudhuri S., Ellis D.P.W., Gemmeke J.F., Jansen A., Moore R.C., Plakal M., Platt D., Saurous R.A., Seybold B., et al. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2017. CNN architectures for large-scale audio classification; pp. 131–135. [DOI] [Google Scholar]
8.He K., Zhang X., Ren S., Sun J. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF; 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]
9.Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020;28:2880–2894. doi: 10.1109/TASLP.2020.3030497. [DOI] [Google Scholar]
10.Rizos G., Lawson J., Han Z., Butler D., Rosindell J., Mikolajczyk K., Banks-Leite C., Schuller B.W. Multi-attentive detection of the spider monkey qhinny in the (actual) wild. Proceedings of Interspeech (ISCA) 2021:471–475. doi: 10.21437/Interspeech.2021-1969. [DOI] [Google Scholar]
11.Hong S., Zou Y., Wang W. Gated multi-head attention pooling for weakly labelled audio tagging. Proceedings of Interspeech (ISCA) 2020:816–820. doi: 10.21437/Interspeech.2020-1197. [DOI] [Google Scholar]
12.Naranjo-Alcazar J., Perez-Castanos S., Zuccarello P., Cobos M. Acoustic scene classification with squeeze-excitation residual networks. IEEE Access. 2020;8:112287–112296. doi: 10.1109/ACCESS.2020.3002761. [DOI] [Google Scholar]
13.Kiskin I., Cobb A.D., Sinka M., Willis K., Roberts S.J. Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2021. Automatic acoustic mosquito tagging with bayesian neural networks; pp. 351–366. [DOI] [Google Scholar]
14.Dufourq E., Durbach I., Hansford J.P., Hoepfner A., Ma H., Bryant J.V., Stender C.S., Li W., Liu Z., Chen Q., et al. Automated detection of hainan gibbon calls for passive acoustic monitoring. Remote Sens. Ecol. Conserv. 2021;7:475–487. doi: 10.1002/rse2.201. [DOI] [Google Scholar]
15.Tzirakis P., Shiarella A., Ewers R., Schuller B.W. Computer audition for continuous rainforest occupancy monitoring: The case of bornean gibbons’ call detection. Proceedings of Interspeech (ISCA) 2020:1211–1215. doi: 10.21437/Interspeech.2020-2655. [DOI] [Google Scholar]
16.Bahdanau D., Cho K.H., Bengio Y. Proceedings of the International Conference on Learning Representations. Preprint at arXiv; 2015. Neural machine translation by jointly learning to align and translate. [DOI] [Google Scholar]
17.Luong M.T., Pham H., Manning C.D. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2015. Effective approaches to attention-based neural machine translation; pp. 1412–1421. [DOI] [Google Scholar]
18.Hu J., Shen L., Sun G. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CFV; 2018. Squeeze-and-excitation networks; pp. 7132–7141. [DOI] [Google Scholar]
19.Zhang Z., Wu B., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. Attention-augmented end-to-end multi-task learning for emotion prediction from speech; pp. 6705–6709. [DOI] [Google Scholar]
20.Kahl S., Wood C.M., Eibl M., Klinck H. Birdnet: A deep learning solution for avian diversity monitoring. Ecol. Inf. 2021;61 doi: 10.1016/j.ecoinf.2021.101236. [DOI] [Google Scholar]
21.Ruan W., Wu K., Chen Q., Zhang C. Resnet-based bio-acoustics presence detection technology of hainan gibbon calls. Appl. Acoust. 2022;198 doi: 10.1016/j.apacoust.2022.108939. [DOI] [Google Scholar]
22.Zagoruyko S., Komodakis N. Procedings of the British Machine Vision Conference. British Machine Vision Association; 2016. Wide residual networks. [Google Scholar]
23.Pérez-Granados C. Birdnet: applications, performance, pitfalls and future opportunities. Ibis. 2023;165:1068–1075. doi: 10.1111/ibi.13193. [DOI] [Google Scholar]
24.Shiu Y., Palmer K.J., Roch M.A., Fleishman E., Liu X., Nosal E.M., Helble T., Cholewiak D., Gillespie D., Klinck H. Deep neural networks for automated detection of marine mammal species. Sci. Rep. 2020;10:607–612. doi: 10.1038/s41598-020-57549-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hantke S., Cummins N., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2018. What is my dog trying to tell me? The automatic recognition of the context and perceived emotion of dog barks; pp. 5134–5138. [DOI] [Google Scholar]
26.Oikarinen T., Srinivasan K., Meisner O., Hyman J.B., Parmar S., Fanucci-Kiss A., Desimone R., Landman R., Feng G. Deep convolutional network for animal sound classification and source attribution using dual audio recordings. J. Acoust. Soc. Am. 2019;145:654–662. doi: 10.1121/1.5087827. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Clink D.J., Klinck H. Gibbonfindr: An R package for the detection and classification of acoustic signals. arXiv. 2019 doi: 10.48550/arXiv.1906.02572. Preprint at. [DOI] [Google Scholar]
28.Goëau H., Glotin H., Vellinga W.P., Planqué R., Joly A. Proceedings of CLEF: Conference and Labs of the Evaluation Forum. 2016. Lifeclef bird identification task 2016: The arrival of deep learning; pp. 440–449. [Google Scholar]
29.Rovithis E., Moustakas N., Vogklis K., Drossos K., Floros A. Towards citizen science for smart cities: A framework for a collaborative game of bird call recognition based on internet of sound practices. arXiv. 2021 doi: 10.48550/arXiv.2103.16988. Preprint at. [DOI] [Google Scholar]
30.Guo C., Pleiss G., Sun Y., Weinberger K.Q. Proceedings of the International Conference on Machine Learning. PMLR; 2017. On calibration of modern neural networks; pp. 1321–1330. [Google Scholar]
31.Tomsett R., Preece A., Braines D., Cerutti F., Chakraborty S., Srivastava M., Pearson G., Kaplan L. Rapid trust calibration through interpretable and uncertainty-aware ai. Patterns (N. Y). 2020;1 doi: 10.1016/j.patter.2020.100049. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Tomani C., Buettner F. Towards trustworthy predictions from deep neural networks with fast adversarial calibration. Proc. AAAI Conf. Artif. Intell. 2021;35:9886–9896. [Google Scholar]
33.Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Proceedings of the Conference on Computer Vision and Pattern Recognition. CVPR/CVF; 2016. Rethinking the inception architecture for computer vision; pp. 2818–2826. [DOI] [Google Scholar]
34.Singh A., Bay A., Sengupta B., Mirabile A. ICML Workshop on Uncertainty and Robustness in Deep Learning. 2021. On the dark side of calibration for modern neural networks. [Google Scholar]
35.Lukasik M., Bhojanapalli S., Menon A., Kumar S. Does label smoothing mitigate label noise? Proceedings of the International Conference on Machine Learning (PMLR) 2020:6448–6458. [Google Scholar]
36.Wei J., Liu H., Liu T., Niu G., Sugiyama M., Liu Y. Proceedings of the International Conference on Machine Learning. PMLR; 2021. To smooth or not? when label smoothing meets noisy labels; pp. 23589–23614. [Google Scholar]
37.Wang D.B., Feng L., Zhang M.L. Proceedings of Advances in Neural Information Processing Systems. 2021. Rethinking calibration of deep neural networks: Do not be afraid of overconfidence; pp. 11809–11820. [Google Scholar]
38.Kendall A., Gal Y. Proceedings of Advances in Neural Information Processing Systems. 2017. What uncertainties do we need in bayesian deep learning for computer vision? pp. 5580–5590. [Google Scholar]
39.Mackay D.J.C. California Institute of Technology; 1992. Bayesian methods for adaptive models. PhD Thesis. [Google Scholar]
40.Maddox W.J., Garipov T., Izmailov P., Vetrov D., Wilson A.G. Proceedings of Advances in Neural Information Processing Systems. 2019. A simple baseline for bayesian uncertainty in deep learning; pp. 13153–13164. [Google Scholar]
41.Wilson A.G. The case for bayesian deep learning. arXiv. 2020 doi: 10.48550/arXiv.2001.10995. Preprint at. [DOI] [Google Scholar]
42.Wang H., Yeung D.Y. A survey on bayesian deep learning. ACM Comput. Surv. 2020;53:1–37. doi: 10.1145/3409383. [DOI] [Google Scholar]
43.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]
44.Zhang C., Bütepage J., Kjellström H., Mandt S. Advances in variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 2019;41:2008–2026. doi: 10.1109/TPAMI.2018.2889774. [DOI] [PubMed] [Google Scholar]
45.Blundell C., Cornebise J., Kavukcuoglu K., Wierstra D. Proceedings of the International conference on machine learning. PMLR; 2015. Weight uncertainty in neural networks; pp. 1613–1622. [Google Scholar]
46.Gal Y., Ghahramani Z. Proceedings of the International conference on machine learning. PMLR; 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning; pp. 1050–1059. [Google Scholar]
47.Rizos G., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. Modelling sample informativeness for deep affective computing; pp. 3482–3486. [DOI] [Google Scholar]
48.Wang S., Manning C. Proceedings of the International Conference on Machine Learning. PMLR; 2013. Fast dropout training; pp. 118–126. [Google Scholar]
49.Kingma D.P., Salimans T., Welling M. Proceedings of Advances in Neural Information Processing Systems. 2015. Variational dropout and the local reparameterization trick; pp. 2575–2583. [Google Scholar]
50.Roth W., Pernkopf F. Proceedings of the NIPS Workshop on Bayesian Deep Learning. 2016. Variational inference in neural networks using an approximate closed-form objective. [Google Scholar]
51.Shridhar K., Laumann F., Liwicki M. Uncertainty estimations by softplus normalization in bayesian convolutional neural networks with variational inference. arXiv. 2018 doi: 10.48550/arXiv.1806.05978. Preprint at. [DOI] [Google Scholar]
52.Haußmann M., Hamprecht F., Kandemir M. Proceedings of the International Joint Conference on Artificial Intelligence (ACM) 2019. Deep active learning with adaptive acquisition; pp. 2470–2476. [Google Scholar]
53.Wu A., Nowozin S., Meeds E., Turner R.E., Hernandez-Lobato J.M., Gaunt A.L. Proceedings of the International Conference on Learning Representations. Preprint at arXiv; 2018. Deterministic variational inference for robust bayesian neural networks. [DOI] [Google Scholar]
54.Schmitt J., Roth S. Proceedings of DAGM German Conference on Pattern Recognition. Springer; 2021. Sampling-free variational inference for neural networks with multiplicative activation noise; pp. 33–47. [DOI] [Google Scholar]
55.Dera D., Bouaynaya N.C., Rasool G., Shterenberg R., Fathallah-Shaykh H.M. Premium-CNN: Propagating uncertainty towards robust convolutional neural networks. IEEE Trans. Signal Process. 2021;69:4669–4684. doi: 10.1109/TSP.2021.3096804. [DOI] [Google Scholar]
56.Goodfellow I.J., Shlens J., Szegedy C. Explaining and harnessing adversarial examples. arXiv. 2014 doi: 10.48550/arXiv.1412.6572. Preprint at. [DOI] [Google Scholar]
57.Wang P., Bouaynaya N.C., Mihaylova L., Wang J., Zhang Q., He R. Proceedings of the International Joint Conference on Neural Networks. IEEE; 2020. Bayesian neural networks uncertainty quantification with cubature rules; pp. 1–7. [DOI] [Google Scholar]
58.Carannante G., Bouaynaya N.C., Mihaylova L. Proceedings of the International Conference on Information Fusion. IEEE; 2021. An enhanced particle filter for uncertainty quantification in neural networks; pp. 1–7. [DOI] [Google Scholar]
59.Krizhevsky A. University of Toronto; 2009. Learning multiple layers of features from tiny images. Master’s Thesis. [Google Scholar]
60.Tzelepis C., Patras I. Uncertainty propagation in convolutional neural networks: Technical report. arXiv. 2021 doi: 10.48550/arXiv.2102.06064. Preprint at. [DOI] [Google Scholar]
61.Stowell D., Wood M.D., Pamuła H., Stylianou Y., Glotin H. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 2019;10:368–380. doi: 10.1111/2041-210x.13103. [DOI] [Google Scholar]
62.Kitzes J., Schricker L. The necessity, promise and challenge of automated biodiversity surveys. Environ. Conserv. 2019;46:247–250. doi: 10.1017/S0376892919000146. [DOI] [Google Scholar]
63.Krokos V., Bui Xuan V., Bordas S.P.A., Young P., Kerfriden P. A bayesian multiscale cnn framework to predict local stress fields in structures with microscale features. Comput. Mech. 2022;69:733–766. doi: 10.1007/s00466-021-02112-3. [DOI] [Google Scholar]
64.Haußmann M., Hamprecht F.A., Kandemir M. Proceedings of Uncertainty in Artificial Intelligence (PMLR) 2020. Sampling-free variational inference of bayesian neural networks by variance backpropagation; pp. 563–573. [DOI] [Google Scholar]
65.Gao Z., Wang L., Wu G. Proceedings of the International Conference on Computer Vision. IEEE; 2019. Lip: Local importance-based pooling; pp. 3355–3364. [DOI] [Google Scholar]
66.Stergiou A., Poppe R. Adapool: Exponential adaptive pooling for information-retaining downsampling. IEEE Trans. Image Process. 2023;32:251–266. doi: 10.1109/TIP.2022.3227503. [DOI] [PubMed] [Google Scholar]
67.Ewers R.M., Didham R.K., Fahrig L., Ferraz G., Hector A., Holt R.D., Kapos V., Reynolds G., Sinun W., Snaddon J.L., Turner E.C. A large-scale forest fragmentation experiment: the stability of altered forest ecosystems project. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2011;366:3292–3302. doi: 10.1098/rstb.2011.0049. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Gal Y., Islam R., Ghahramani Z. Proceedings of International Conference on Machine Learning. PMLR); 2017. Deep bayesian active learning with image data; pp. 1183–1192. [Google Scholar]
69.Depeweg S., Hernandez-Lobato J.M., Doshi-Velez F., Udluft S. Proceedings of International Conference on Machine Learning. PMLR; 2018. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning; pp. 1184–1193. [Google Scholar]
70.Hinton G., Vinyals O., Dean J. Distilling the knowledge in a neural network. arXiv. 2015 doi: 10.48550/arXiv.1503.02531. Preprint at. [DOI] [Google Scholar]
71.Seo S., Seo P.H., Han B. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF); 2019. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences; pp. 9030–9038. [DOI] [Google Scholar]
72.Osband I. NIPS workshop on bayesian deep learning. 2016. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. [Google Scholar]
73.Verdoja F., Kyrki V. Notes on the behavior of MC dropout. arXiv. 2020 doi: 10.48550/arXiv.2008.02627. Preprint at. [DOI] [Google Scholar]
74.Folgoc L.L., Baltatzis V., Desai S., Devaraj A., Ellis S., Martinez Manzanera O.E., Nair A., Qiu H., Schnabel J., Glocker B. Is MC dropout bayesian? arXiv. 2021 doi: 10.48550/arXiv.2110.04286. Preprint at. [DOI] [Google Scholar]
75.Chou H.C., Lee C.C. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE); 2019. Every rating matters: Joint learning of subjective labels and individual annotators for speech emotion classification; pp. 5886–5890. [DOI] [Google Scholar]
76.Lin T.Y., Goyal P., Girshick R., He K., Dollár P. Proceedings of the International Conference on Computer Vision. IEEE; 2017. Focal loss for dense object detection; pp. 2980–2988. [DOI] [Google Scholar]
77.Li X., Wang W., Hu X., Li J., Tang J., Yang J. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF); 2021. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection; pp. 11632–11641. [DOI] [Google Scholar]
78.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86:2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]
79.Deng J., Dong W., Socher R., Li L.J., Li K., Li F.F. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE; 2009. Imagenet: A large-scale hierarchical image database; pp. 248–255. [DOI] [Google Scholar]
80.Osawa K., Swaroop S., Khan M.E.E., Jain A., Eschenhagen R., Turner R.E., Yokota R. In Roceedings of Advances in Neural Information Processing Systems. ACM; 2019. Practical deep learning with bayesian principles; pp. 4287–4299. [Google Scholar]
81.Nolasco I., Singh S., Vidaña-Vila E., Grout E., Morford J., Emmerson M.G., Jensen F.H., Kiskin I., Whitehead H., Strandburg-Peshkin A., et al. Detection and Classification of Acoustic Scenes and Events. 2022. Few-shot bioacoustic event detection at the dcase 2022 challenge. [Google Scholar]
82.Rizos G. Code for the article “Propagating Variational Model Uncertainty for Bioacoustic Call Label Smoothing”. Zenodo. 2023 doi: 10.5281/zenodo.10253149. [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Trigg L., Mitchell S., Ewers R.M. Assessment of acoustic indices for monitoring phylogenetic and temporal patterns of biodiversity in tropical forests. Zenodo. 2023 doi: 10.5281/zenodo.7740620. [DOI] [Google Scholar]
84.Kristiadi A., Hein M., Hennig P. Proceedings of the International conference on machine learning. PMLR; 2020. Being bayesian, even just a bit, fixes overconfidence in relu networks; pp. 5436–5446. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Table S1 and supplemental experimental procedures

mmc1.pdf^{(372.8KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(2.9MB, pdf)}

Data Availability Statement

[bib1] 1.Witmer G.W. Wildlife population monitoring: some practical considerations. Wildl. Res. 2005;32:259–263. doi: 10.1071/WR04003. [DOI] [Google Scholar]

[bib2] 2.Tuia D., Kellenberger B., Beery S., Costelloe B.R., Zuffi S., Risse B., Mathis A., Mathis M.W., van Langevelde F., Burghardt T., et al. Perspectives in machine learning for wildlife conservation. Nat. Commun. 2022;13:1–15. doi: 10.1038/s41467-022-27980-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Stowell D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ. 2022;10 doi: 10.7717/peerj.13152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Turner W. Sensing biodiversity. Science. 2014;346:301–302. doi: 10.1126/science.1256014. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Veit A., Alldrin N., Chechik G., Krasin I., Gupta A., Belongie S. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF; 2017. Learning from noisy large-scale datasets with minimal supervision; pp. 839–847. [DOI] [Google Scholar]

[bib6] 6.Arroyo-Rodríguez V., Fahrig L. Why is a landscape perspective important in studies of primates? Am. J. Primatol. 2014;76:901–909. doi: 10.1002/ajp.22282. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Hershey S., Chaudhuri S., Ellis D.P.W., Gemmeke J.F., Jansen A., Moore R.C., Plakal M., Platt D., Saurous R.A., Seybold B., et al. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2017. CNN architectures for large-scale audio classification; pp. 131–135. [DOI] [Google Scholar]

[bib8] 8.He K., Zhang X., Ren S., Sun J. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF; 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]

[bib9] 9.Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020;28:2880–2894. doi: 10.1109/TASLP.2020.3030497. [DOI] [Google Scholar]

[bib10] 10.Rizos G., Lawson J., Han Z., Butler D., Rosindell J., Mikolajczyk K., Banks-Leite C., Schuller B.W. Multi-attentive detection of the spider monkey qhinny in the (actual) wild. Proceedings of Interspeech (ISCA) 2021:471–475. doi: 10.21437/Interspeech.2021-1969. [DOI] [Google Scholar]

[bib11] 11.Hong S., Zou Y., Wang W. Gated multi-head attention pooling for weakly labelled audio tagging. Proceedings of Interspeech (ISCA) 2020:816–820. doi: 10.21437/Interspeech.2020-1197. [DOI] [Google Scholar]

[bib12] 12.Naranjo-Alcazar J., Perez-Castanos S., Zuccarello P., Cobos M. Acoustic scene classification with squeeze-excitation residual networks. IEEE Access. 2020;8:112287–112296. doi: 10.1109/ACCESS.2020.3002761. [DOI] [Google Scholar]

[bib13] 13.Kiskin I., Cobb A.D., Sinka M., Willis K., Roberts S.J. Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2021. Automatic acoustic mosquito tagging with bayesian neural networks; pp. 351–366. [DOI] [Google Scholar]

[bib14] 14.Dufourq E., Durbach I., Hansford J.P., Hoepfner A., Ma H., Bryant J.V., Stender C.S., Li W., Liu Z., Chen Q., et al. Automated detection of hainan gibbon calls for passive acoustic monitoring. Remote Sens. Ecol. Conserv. 2021;7:475–487. doi: 10.1002/rse2.201. [DOI] [Google Scholar]

[bib15] 15.Tzirakis P., Shiarella A., Ewers R., Schuller B.W. Computer audition for continuous rainforest occupancy monitoring: The case of bornean gibbons’ call detection. Proceedings of Interspeech (ISCA) 2020:1211–1215. doi: 10.21437/Interspeech.2020-2655. [DOI] [Google Scholar]

[bib16] 16.Bahdanau D., Cho K.H., Bengio Y. Proceedings of the International Conference on Learning Representations. Preprint at arXiv; 2015. Neural machine translation by jointly learning to align and translate. [DOI] [Google Scholar]

[bib17] 17.Luong M.T., Pham H., Manning C.D. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2015. Effective approaches to attention-based neural machine translation; pp. 1412–1421. [DOI] [Google Scholar]

[bib18] 18.Hu J., Shen L., Sun G. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CFV; 2018. Squeeze-and-excitation networks; pp. 7132–7141. [DOI] [Google Scholar]

[bib19] 19.Zhang Z., Wu B., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. Attention-augmented end-to-end multi-task learning for emotion prediction from speech; pp. 6705–6709. [DOI] [Google Scholar]

[bib20] 20.Kahl S., Wood C.M., Eibl M., Klinck H. Birdnet: A deep learning solution for avian diversity monitoring. Ecol. Inf. 2021;61 doi: 10.1016/j.ecoinf.2021.101236. [DOI] [Google Scholar]

[bib21] 21.Ruan W., Wu K., Chen Q., Zhang C. Resnet-based bio-acoustics presence detection technology of hainan gibbon calls. Appl. Acoust. 2022;198 doi: 10.1016/j.apacoust.2022.108939. [DOI] [Google Scholar]

[bib22] 22.Zagoruyko S., Komodakis N. Procedings of the British Machine Vision Conference. British Machine Vision Association; 2016. Wide residual networks. [Google Scholar]

[bib23] 23.Pérez-Granados C. Birdnet: applications, performance, pitfalls and future opportunities. Ibis. 2023;165:1068–1075. doi: 10.1111/ibi.13193. [DOI] [Google Scholar]

[bib24] 24.Shiu Y., Palmer K.J., Roch M.A., Fleishman E., Liu X., Nosal E.M., Helble T., Cholewiak D., Gillespie D., Klinck H. Deep neural networks for automated detection of marine mammal species. Sci. Rep. 2020;10:607–612. doi: 10.1038/s41598-020-57549-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Hantke S., Cummins N., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2018. What is my dog trying to tell me? The automatic recognition of the context and perceived emotion of dog barks; pp. 5134–5138. [DOI] [Google Scholar]

[bib26] 26.Oikarinen T., Srinivasan K., Meisner O., Hyman J.B., Parmar S., Fanucci-Kiss A., Desimone R., Landman R., Feng G. Deep convolutional network for animal sound classification and source attribution using dual audio recordings. J. Acoust. Soc. Am. 2019;145:654–662. doi: 10.1121/1.5087827. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Clink D.J., Klinck H. Gibbonfindr: An R package for the detection and classification of acoustic signals. arXiv. 2019 doi: 10.48550/arXiv.1906.02572. Preprint at. [DOI] [Google Scholar]

[bib28] 28.Goëau H., Glotin H., Vellinga W.P., Planqué R., Joly A. Proceedings of CLEF: Conference and Labs of the Evaluation Forum. 2016. Lifeclef bird identification task 2016: The arrival of deep learning; pp. 440–449. [Google Scholar]

[bib29] 29.Rovithis E., Moustakas N., Vogklis K., Drossos K., Floros A. Towards citizen science for smart cities: A framework for a collaborative game of bird call recognition based on internet of sound practices. arXiv. 2021 doi: 10.48550/arXiv.2103.16988. Preprint at. [DOI] [Google Scholar]

[bib30] 30.Guo C., Pleiss G., Sun Y., Weinberger K.Q. Proceedings of the International Conference on Machine Learning. PMLR; 2017. On calibration of modern neural networks; pp. 1321–1330. [Google Scholar]

[bib31] 31.Tomsett R., Preece A., Braines D., Cerutti F., Chakraborty S., Srivastava M., Pearson G., Kaplan L. Rapid trust calibration through interpretable and uncertainty-aware ai. Patterns (N. Y). 2020;1 doi: 10.1016/j.patter.2020.100049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Tomani C., Buettner F. Towards trustworthy predictions from deep neural networks with fast adversarial calibration. Proc. AAAI Conf. Artif. Intell. 2021;35:9886–9896. [Google Scholar]

[bib33] 33.Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Proceedings of the Conference on Computer Vision and Pattern Recognition. CVPR/CVF; 2016. Rethinking the inception architecture for computer vision; pp. 2818–2826. [DOI] [Google Scholar]

[bib34] 34.Singh A., Bay A., Sengupta B., Mirabile A. ICML Workshop on Uncertainty and Robustness in Deep Learning. 2021. On the dark side of calibration for modern neural networks. [Google Scholar]

[bib35] 35.Lukasik M., Bhojanapalli S., Menon A., Kumar S. Does label smoothing mitigate label noise? Proceedings of the International Conference on Machine Learning (PMLR) 2020:6448–6458. [Google Scholar]

[bib36] 36.Wei J., Liu H., Liu T., Niu G., Sugiyama M., Liu Y. Proceedings of the International Conference on Machine Learning. PMLR; 2021. To smooth or not? when label smoothing meets noisy labels; pp. 23589–23614. [Google Scholar]

[bib37] 37.Wang D.B., Feng L., Zhang M.L. Proceedings of Advances in Neural Information Processing Systems. 2021. Rethinking calibration of deep neural networks: Do not be afraid of overconfidence; pp. 11809–11820. [Google Scholar]

[bib38] 38.Kendall A., Gal Y. Proceedings of Advances in Neural Information Processing Systems. 2017. What uncertainties do we need in bayesian deep learning for computer vision? pp. 5580–5590. [Google Scholar]

[bib39] 39.Mackay D.J.C. California Institute of Technology; 1992. Bayesian methods for adaptive models. PhD Thesis. [Google Scholar]

[bib40] 40.Maddox W.J., Garipov T., Izmailov P., Vetrov D., Wilson A.G. Proceedings of Advances in Neural Information Processing Systems. 2019. A simple baseline for bayesian uncertainty in deep learning; pp. 13153–13164. [Google Scholar]

[bib41] 41.Wilson A.G. The case for bayesian deep learning. arXiv. 2020 doi: 10.48550/arXiv.2001.10995. Preprint at. [DOI] [Google Scholar]

[bib42] 42.Wang H., Yeung D.Y. A survey on bayesian deep learning. ACM Comput. Surv. 2020;53:1–37. doi: 10.1145/3409383. [DOI] [Google Scholar]

[bib43] 43.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]

[bib44] 44.Zhang C., Bütepage J., Kjellström H., Mandt S. Advances in variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 2019;41:2008–2026. doi: 10.1109/TPAMI.2018.2889774. [DOI] [PubMed] [Google Scholar]

[bib45] 45.Blundell C., Cornebise J., Kavukcuoglu K., Wierstra D. Proceedings of the International conference on machine learning. PMLR; 2015. Weight uncertainty in neural networks; pp. 1613–1622. [Google Scholar]

[bib46] 46.Gal Y., Ghahramani Z. Proceedings of the International conference on machine learning. PMLR; 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning; pp. 1050–1059. [Google Scholar]

[bib47] 47.Rizos G., Schuller B.W. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE; 2019. Modelling sample informativeness for deep affective computing; pp. 3482–3486. [DOI] [Google Scholar]

[bib48] 48.Wang S., Manning C. Proceedings of the International Conference on Machine Learning. PMLR; 2013. Fast dropout training; pp. 118–126. [Google Scholar]

[bib49] 49.Kingma D.P., Salimans T., Welling M. Proceedings of Advances in Neural Information Processing Systems. 2015. Variational dropout and the local reparameterization trick; pp. 2575–2583. [Google Scholar]

[bib50] 50.Roth W., Pernkopf F. Proceedings of the NIPS Workshop on Bayesian Deep Learning. 2016. Variational inference in neural networks using an approximate closed-form objective. [Google Scholar]

[bib51] 51.Shridhar K., Laumann F., Liwicki M. Uncertainty estimations by softplus normalization in bayesian convolutional neural networks with variational inference. arXiv. 2018 doi: 10.48550/arXiv.1806.05978. Preprint at. [DOI] [Google Scholar]

[bib52] 52.Haußmann M., Hamprecht F., Kandemir M. Proceedings of the International Joint Conference on Artificial Intelligence (ACM) 2019. Deep active learning with adaptive acquisition; pp. 2470–2476. [Google Scholar]

[bib53] 53.Wu A., Nowozin S., Meeds E., Turner R.E., Hernandez-Lobato J.M., Gaunt A.L. Proceedings of the International Conference on Learning Representations. Preprint at arXiv; 2018. Deterministic variational inference for robust bayesian neural networks. [DOI] [Google Scholar]

[bib54] 54.Schmitt J., Roth S. Proceedings of DAGM German Conference on Pattern Recognition. Springer; 2021. Sampling-free variational inference for neural networks with multiplicative activation noise; pp. 33–47. [DOI] [Google Scholar]

[bib55] 55.Dera D., Bouaynaya N.C., Rasool G., Shterenberg R., Fathallah-Shaykh H.M. Premium-CNN: Propagating uncertainty towards robust convolutional neural networks. IEEE Trans. Signal Process. 2021;69:4669–4684. doi: 10.1109/TSP.2021.3096804. [DOI] [Google Scholar]

[bib56] 56.Goodfellow I.J., Shlens J., Szegedy C. Explaining and harnessing adversarial examples. arXiv. 2014 doi: 10.48550/arXiv.1412.6572. Preprint at. [DOI] [Google Scholar]

[bib57] 57.Wang P., Bouaynaya N.C., Mihaylova L., Wang J., Zhang Q., He R. Proceedings of the International Joint Conference on Neural Networks. IEEE; 2020. Bayesian neural networks uncertainty quantification with cubature rules; pp. 1–7. [DOI] [Google Scholar]

[bib58] 58.Carannante G., Bouaynaya N.C., Mihaylova L. Proceedings of the International Conference on Information Fusion. IEEE; 2021. An enhanced particle filter for uncertainty quantification in neural networks; pp. 1–7. [DOI] [Google Scholar]

[bib59] 59.Krizhevsky A. University of Toronto; 2009. Learning multiple layers of features from tiny images. Master’s Thesis. [Google Scholar]

[bib60] 60.Tzelepis C., Patras I. Uncertainty propagation in convolutional neural networks: Technical report. arXiv. 2021 doi: 10.48550/arXiv.2102.06064. Preprint at. [DOI] [Google Scholar]

[bib61] 61.Stowell D., Wood M.D., Pamuła H., Stylianou Y., Glotin H. Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 2019;10:368–380. doi: 10.1111/2041-210x.13103. [DOI] [Google Scholar]

[bib62] 62.Kitzes J., Schricker L. The necessity, promise and challenge of automated biodiversity surveys. Environ. Conserv. 2019;46:247–250. doi: 10.1017/S0376892919000146. [DOI] [Google Scholar]

[bib63] 63.Krokos V., Bui Xuan V., Bordas S.P.A., Young P., Kerfriden P. A bayesian multiscale cnn framework to predict local stress fields in structures with microscale features. Comput. Mech. 2022;69:733–766. doi: 10.1007/s00466-021-02112-3. [DOI] [Google Scholar]

[bib64] 64.Haußmann M., Hamprecht F.A., Kandemir M. Proceedings of Uncertainty in Artificial Intelligence (PMLR) 2020. Sampling-free variational inference of bayesian neural networks by variance backpropagation; pp. 563–573. [DOI] [Google Scholar]

[bib65] 65.Gao Z., Wang L., Wu G. Proceedings of the International Conference on Computer Vision. IEEE; 2019. Lip: Local importance-based pooling; pp. 3355–3364. [DOI] [Google Scholar]

[bib66] 66.Stergiou A., Poppe R. Adapool: Exponential adaptive pooling for information-retaining downsampling. IEEE Trans. Image Process. 2023;32:251–266. doi: 10.1109/TIP.2022.3227503. [DOI] [PubMed] [Google Scholar]

[bib67] 67.Ewers R.M., Didham R.K., Fahrig L., Ferraz G., Hector A., Holt R.D., Kapos V., Reynolds G., Sinun W., Snaddon J.L., Turner E.C. A large-scale forest fragmentation experiment: the stability of altered forest ecosystems project. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2011;366:3292–3302. doi: 10.1098/rstb.2011.0049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] 68.Gal Y., Islam R., Ghahramani Z. Proceedings of International Conference on Machine Learning. PMLR); 2017. Deep bayesian active learning with image data; pp. 1183–1192. [Google Scholar]

[bib69] 69.Depeweg S., Hernandez-Lobato J.M., Doshi-Velez F., Udluft S. Proceedings of International Conference on Machine Learning. PMLR; 2018. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning; pp. 1184–1193. [Google Scholar]

[bib70] 70.Hinton G., Vinyals O., Dean J. Distilling the knowledge in a neural network. arXiv. 2015 doi: 10.48550/arXiv.1503.02531. Preprint at. [DOI] [Google Scholar]

[bib71] 71.Seo S., Seo P.H., Han B. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF); 2019. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences; pp. 9030–9038. [DOI] [Google Scholar]

[bib72] 72.Osband I. NIPS workshop on bayesian deep learning. 2016. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. [Google Scholar]

[bib73] 73.Verdoja F., Kyrki V. Notes on the behavior of MC dropout. arXiv. 2020 doi: 10.48550/arXiv.2008.02627. Preprint at. [DOI] [Google Scholar]

[bib74] 74.Folgoc L.L., Baltatzis V., Desai S., Devaraj A., Ellis S., Martinez Manzanera O.E., Nair A., Qiu H., Schnabel J., Glocker B. Is MC dropout bayesian? arXiv. 2021 doi: 10.48550/arXiv.2110.04286. Preprint at. [DOI] [Google Scholar]

[bib75] 75.Chou H.C., Lee C.C. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE); 2019. Every rating matters: Joint learning of subjective labels and individual annotators for speech emotion classification; pp. 5886–5890. [DOI] [Google Scholar]

[bib76] 76.Lin T.Y., Goyal P., Girshick R., He K., Dollár P. Proceedings of the International Conference on Computer Vision. IEEE; 2017. Focal loss for dense object detection; pp. 2980–2988. [DOI] [Google Scholar]

[bib77] 77.Li X., Wang W., Hu X., Li J., Tang J., Yang J. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE/CVF); 2021. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection; pp. 11632–11641. [DOI] [Google Scholar]

[bib78] 78.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86:2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]

[bib79] 79.Deng J., Dong W., Socher R., Li L.J., Li K., Li F.F. Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE; 2009. Imagenet: A large-scale hierarchical image database; pp. 248–255. [DOI] [Google Scholar]

[bib80] 80.Osawa K., Swaroop S., Khan M.E.E., Jain A., Eschenhagen R., Turner R.E., Yokota R. In Roceedings of Advances in Neural Information Processing Systems. ACM; 2019. Practical deep learning with bayesian principles; pp. 4287–4299. [Google Scholar]

[bib81] 81.Nolasco I., Singh S., Vidaña-Vila E., Grout E., Morford J., Emmerson M.G., Jensen F.H., Kiskin I., Whitehead H., Strandburg-Peshkin A., et al. Detection and Classification of Acoustic Scenes and Events. 2022. Few-shot bioacoustic event detection at the dcase 2022 challenge. [Google Scholar]

[bib82] 82.Rizos G. Code for the article “Propagating Variational Model Uncertainty for Bioacoustic Call Label Smoothing”. Zenodo. 2023 doi: 10.5281/zenodo.10253149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib83] 83.Trigg L., Mitchell S., Ewers R.M. Assessment of acoustic indices for monitoring phylogenetic and temporal patterns of biodiversity in tropical forests. Zenodo. 2023 doi: 10.5281/zenodo.7740620. [DOI] [Google Scholar]

[bib84] 84.Kristiadi A., Hein M., Hennig P. Proceedings of the International conference on machine learning. PMLR; 2020. Being bayesian, even just a bit, fixes overconfidence in relu networks; pp. 5436–5446. [Google Scholar]

PERMALINK

Propagating variational model uncertainty for bioacoustic call label smoothing

Georgios Rizos

Jenna Lawson

Simon Mitchell

Pranay Shah

Xin Wen

Cristina Banks-Leite

Robert Ewers

Björn W Schuller

Summary

Graphical abstract

Highlights

The bigger picture

Introduction

Figure 1.

Results

Table 1.

Table 2.

SE-ResNet is a competitive point-estimate baseline

Table 3.

Crafting a competitive Bayesian SE-ResNet baseline

Benefits of uncertainty usage in label smoothing

Smoothing should be specific to data samples

Sample-free BNN outputs are calibrated

Locally pooling normal random variables

Execution times

Table 4.

Discussion

Propagated uncertainty should be explicitly used

Rethinking label smoothing

Should we focus on the easy data then?

Generality of method

Limitations of method

Conclusions and future work

Experimental procedures

Resource availability

Lead contact

Materials availability

Data and code availability

Description of multi-attentive SE-ResNet

Epistemic uncertainty-aware label smoothing

Figure 2.

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Contributor Information

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases