Skip to main content
. 2021 Nov 1;10:e68837. doi: 10.7554/eLife.68837

Figure 2. DAS performance for mouse ultrasonic vocalizations.

(A) Waveform (top) and spectrogram (bottom) of USVs produced by a female mouse in response to an anesthetized female intruder. Shaded areas (top) show manual annotations. (B) Confidence scores (top) and DAS and manual annotations (bottom) for the female USVs in A. Brief gaps in confidence are filled smooth annotations. (C) Example of male USVs with sex-specific characteristics produced in the same assay. (D) Confidence scores (top) and DAS and manual annotations (bottom) for the male USVs in C from a DAS network trained to detect female USVs. (E) Confusion matrix from a female-trained network for a test set of female USVs. Color indicates the percentage (see color bar) and text labels the seconds of song in each quadrant. (F) Distribution of temporal errors for syllable on- and offsets in female USVs. The median temporal error is 0.3 ms for DAS (brown line) and 0.4 ms for USVSEG Tachibana et al., 2020, a method developed to annotate mouse USVs (gray line). (G) Recall of the female-trained network (brown line) as a function of SNR. The brown shaded area represents the distribution of SNRs for all samples containing USVs. Recall is high even at low SNR. (H) Confusion matrix of the female-trained DAS network for a test set of male USVs (see C, D for examples). Color indicates the percentage (see color bar) and text labels the seconds of song in each quadrant.

Figure 2.

Figure 2—figure supplement 1. Performance for marmoset vocalizations.

Figure 2—figure supplement 1.

(A, C) Waveform (top) and spectrogram (bottom) of vocalizations from male and female marmosets. Shaded areas (top) show manual annotations of the different vocalization types, colored by type. Recordings are noisy (A, left), clipped (orange), and individual vocalization types are variable (C). (B, D) DAS and manual annotation labels for the vocalizations types in the recordings in A and C (see color bar in C). DAS annotates the syllable boundaries and types with high accuracy. Note the false negative in D. (E, F) Confusion matrices for the four vocalization types in the test set (see color bar), using the syllable boundaries from the manual annotations (E) or from DAS (F) as reference. Rows depict the probability with which DAS annotated each syllable as one of the four types in the test dataset. The type of most syllables were correctly annotated, resulting in the concentration of probability mass along the main diagonal. False positives and false negatives correspond to the first row in E and the first column in F, respectively. When using the true syllables for reference, there are no false positives (E, x=‘noise’, gray labels) since all detections are positives. By contrast, when using the predicted syllables as reference, there are no true negatives (F, y=‘noise’, gray labels), since all reference syllables are (true or false) positives. (G) Distribution of temporal errors for the on- and offsets of all detected syllables (purple-shaded area). The median temporal error is 4.4 ms for DAS (purple line) and 12.5 ms for the method by Oikarinen et al., 2019 developed to annotate marmoset calls (gray line).