(
A-D) Network architectures for annotating fly song from single (
A) and multi-channel (
B) recordings, mouse USVs and marmoset song (
C), and bird song (Bengalese and Zebra finches) (
D). See legend to the right. Each TCN block consists of stacks of residual blocks shown in E. See
Table 4 for all network parameters. (
E) A TCN block (left) consists of a stack of five residual blocks (right). Residual blocks process the input with a sequence of dilated convolution, rectification (ReLU) and normalization. The output of this sequence of steps is then added to the input. In successive residual blocks, the dilation rate of the convolution filters doubles from 1x in the first to 16x in the last layer (see numbers to the left of each block). The output of the last residual block is passed as an input to the next TCN block in the network. In addition, the outputs of all residual blocks in a network are linearly combined to predict the song. (
F) Annotation performance is evaluated by comparing manual annotations (top) with labels produced by
DAS (bottom). Gray indicates no song, orange song. True negatives (TN) and true positives (TP) are samples for which
DAS matches the manual labels. False negatives (FNs) are samples for which the song was missed (blue frame) and reduce recall (TP/(FN+TP)). False positives (FP) correspond to samples that were falsely predicted as containing song (green frames) and reduce precision (TP/(TP+FP)). (
G) Precision and recall are calculated from a confusion matrix which tabulates TP (orange), TN (gray), FP (green), FN (blue). In the example, precision is 3/(3+2) and recall is 3/(1+3).