Skip to main content
. 2021 Nov 1;10:e68837. doi: 10.7554/eLife.68837

Figure 1. DAS performance for fly song.

(A) Fly song (black, top) with manual annotations of sine (blue) and pulse (red) song. The spectrogram (bottom) shows the signal’s frequency content over time (see color bar). (B) DAS builds a hierarchical presentation of song features relevant for annotation using a deep neural network. The network consists of three TCN blocks, which extract song features at multiple timescales. The output of the network is a confidence score for each sample and song type. (C) Confidence scores (top) for sine (blue) and pulse (red) for the signal in A. The confidence is transformed into annotation labels (bottom) based on a confidence threshold (0.5 for sine, 0.7 for pulse). Ground truth (bottom) from manual annotations shown for comparison. (D) Confusion matrix for pulse from the test data set. Color indicates the percentage (see color bar) and text labels indicate the number of pulses for each quadrant. All confusion matrices are normalized such that columns sum to 100%. The concentration of values along the diagonal indicates high annotation performance. (E) Precision-recall curve for pulse depicts the performance characteristics of DAS for different confidence thresholds (from 0 to 1, black arrow points in the direction of increasing threshold). Recall decreases and precision increases with the threshold. The closer the curve to the upper and right border, the better. The red circle corresponds to the performance of DAS for a threshold of 0.7. The black circle depicts the performance of FlySongSegmenter (FSS) and gray circles the performance of two human annotators. (F) Probability density function of temporal errors for all detected pulses (red shaded area), computed as the distance between each pulse annotated by DAS and the nearest manually annotated pulse. Lines depict the median temporal error for DAS (red line, 0.3 ms) and FSS (gray line, 0.1 ms). (G, H) Recall of DAS (red line) and FSS (gray line) as a function of the pulse carrier frequency (G) and signal-to-noise ratio (SNR) (H). Red shaded areas show the distributions of carrier frequencies (G) and SNRs (H) for all pulses. DAS outperforms FSS for all carrier frequencies and SNRs. (I) Same as in D but for sine. Color indicates the percentage (see color bar) and text labels indicate seconds of sine for each quadrant. (J) Same as in E but for sine. The blue circle depicts the performance for the confidence threshold of 0.5. (K) Distribution of temporal errors for all detected sine on- and offsets. Median temporal error is 12 ms for DAS (blue line) and 22 ms for FSS (gray line). (L, M) Recall for DAS (blue line) and FSS (gray line) as a function of sine duration (L) and SNR (M). Blue-shaded areas show the distributions of durations and SNRs for all sine songs. DAS outperforms FSS for all durations and SNRs.

Figure 1.

Figure 1—figure supplement 1. DAS architecture and evaluation.

Figure 1—figure supplement 1.

(A-D) Network architectures for annotating fly song from single (A) and multi-channel (B) recordings, mouse USVs and marmoset song (C), and bird song (Bengalese and Zebra finches) (D). See legend to the right. Each TCN block consists of stacks of residual blocks shown in E. See Table 4 for all network parameters. (E) A TCN block (left) consists of a stack of five residual blocks (right). Residual blocks process the input with a sequence of dilated convolution, rectification (ReLU) and normalization. The output of this sequence of steps is then added to the input. In successive residual blocks, the dilation rate of the convolution filters doubles from 1x in the first to 16x in the last layer (see numbers to the left of each block). The output of the last residual block is passed as an input to the next TCN block in the network. In addition, the outputs of all residual blocks in a network are linearly combined to predict the song. (F) Annotation performance is evaluated by comparing manual annotations (top) with labels produced by DAS (bottom). Gray indicates no song, orange song. True negatives (TN) and true positives (TP) are samples for which DAS matches the manual labels. False negatives (FNs) are samples for which the song was missed (blue frame) and reduce recall (TP/(FN+TP)). False positives (FP) correspond to samples that were falsely predicted as containing song (green frames) and reduce precision (TP/(TP+FP)). (G) Precision and recall are calculated from a confusion matrix which tabulates TP (orange), TN (gray), FP (green), FN (blue). In the example, precision is 3/(3+2) and recall is 3/(1+3).
Figure 1—figure supplement 2. Performance and the role of context for annotating fly pulse song.

Figure 1—figure supplement 2.

(A) Recall (blue) and precision (orange) for fly pulse song for different distance thresholds. The distance threshold determines the maximal distance to a true pulse for a detected pulse to be a true positive. (B) Waveforms of true positive (left), false positive (middle), and false negatives (right) pulses in fly song. Pulses were aligned to the peak, adjusted for sign, and their amplitude was normalized to have unit norm (see Clemens et al., 2018). (C) Waveforms (top) and confidence scores (bottom, see color bar) for pulses in different contexts. DAS exploits context effects to boost the detection of weak signals. An isolated ('Isolated', first row) weak pulse-like waveform is detected with low confidence, since similar waveforms often arise from noise. Manual annotators exploit context information—the fact that pulses often occur in trains at an interval of 40 ms—to annotate weak signals. DAS does the same: the same waveform is detected with much higher confidence due to the presence of a nearby pulse train ('Correct IPI', 2nd row). If the pulse is too close to another pulse ('Too close', 3rd row), it is likely noise and DAS detects it with lower confidence. Context effects do not affect strong signals. For instance, a missing pulse within a pulse train ('Missing', last row) does not reduce detection confidence of nearby pulses.
Figure 1—figure supplement 3. Performance for multi-channel recordings of fly courtship song.

Figure 1—figure supplement 3.

(A) Fly song (black) with manual annotation indicating sine (blue) and pulse (red). Traces (top) and spectrogram (bottom, see color bar) show data from the loudest of the nine audio channels. (B) Confidence scores (top) for sine (blue) and pulse (red). The confidence is transformed into annotation labels (bottom) based on a confidence threshold (0.5 for sine and pulse). Ground truth (bottom) from manual annotations shown for comparison. DAS annotations were generated using separate networks for pulse and for sine. (C) Confusion matrix for pulse from a test data set. Color indicates the percentage (see color bar) and text labels indicate number of pulses for each quadrant. All confusion matrices are normalized such that columns sum to 100%. The concentration of values along the diagonal indicates high annotation performance. (D) Precision-recall curve for pulse depicts the performance characteristics of DAS for different confidence threshold (from 0 to 1). Recall decreases and precision increases with the threshold. The closer the curve to the upper and right border, the better. The red circle corresponds to the performance of DAS for a threshold of 0.5. The gray circle depicts the performance of FlySongSegmenter (FSS). (E) Probability density function (PDF) of temporal errors for all detected pulses (red shaded area), computed as the distance between each pulse annotated by DAS and its nearest manual pulse. Lines depict median temporal error for DAS (red line, 0.3 ms) and FSS (gray line, 0.1 ms). (F, G) Recall of DAS (red line) and FSS (gray line) as a function of the pulse carrier frequency (F) and signal-to-noise ratio (SNR) (G). Red areas show the distributions of carrier frequencies (F) and SNRs (G) for all pulses. (H) Same as in C but for sine. Color indicates the percentage (see color bar) and text labels indicate seconds of sine for each quadrant. (I) Same as in D but for sine. The blue circle depicts the performance for the confidence threshold of 0.5 used in A. (J) Distribution of temporal errors for all detected sine on- and offsets. Median temporal error is 9 ms for DAS (blue line) and 14 ms for FSS (gray line). (K, L) Recall for DAS (blue line) and FSS (gray line) as a function of sine duration (K) and SNR (L). Blue-shaded areas show the distributions of durations (K) and SNRs (L) for all sine songs. DAS outperforms FSS for sine songs with short durations and SNRs <1.0.