(
A, C) Waveform (top) and spectrogram (bottom) of vocalizations from male and female marmosets. Shaded areas (top) show manual annotations of the different vocalization types, colored by type. Recordings are noisy (A, left), clipped (orange), and individual vocalization types are variable (
C). (
B, D)
DAS and manual annotation labels for the vocalizations types in the recordings in A and C (see color bar in C).
DAS annotates the syllable boundaries and types with high accuracy. Note the false negative in D. (
E, F) Confusion matrices for the four vocalization types in the test set (see color bar), using the syllable boundaries from the manual annotations (
E) or from
DAS (
F) as reference. Rows depict the probability with which
DAS annotated each syllable as one of the four types in the test dataset. The type of most syllables were correctly annotated, resulting in the concentration of probability mass along the main diagonal. False positives and false negatives correspond to the first row in E and the first column in F, respectively. When using the true syllables for reference, there are no false positives (
E, x=‘noise’, gray labels) since all detections are positives. By contrast, when using the predicted syllables as reference, there are no true negatives (
F, y=‘noise’, gray labels), since all reference syllables are (true or false) positives. (
G) Distribution of temporal errors for the on- and offsets of all detected syllables (purple-shaded area). The median temporal error is 4.4 ms for
DAS (purple line) and 12.5 ms for the method by
Oikarinen et al., 2019 developed to annotate marmoset calls (gray line).