(
A, B) Number of syllables (log scaled) in the train (
A) and the test (
B) sets for each syllable type present in the test set. (
C) Precision (blue) and recall (orange) for each syllable type, computed from the confusion matrix in
Figure 3C. (
D) Confusion matrices when using true (left, same as
Figure 3C) and predicted (right) syllables as a reference. Colors were log-scaled to make the rare annotation errors more apparent (see color bar). The reference determines the syllable bounds and the syllable label is then given by the most frequent label found in the samples within the syllable bounds. When using the true syllables for reference, there are no false positives (left, y=0 (no song), gray line) since all detections are positives. By contrast, when using the predicted syllables as reference, there are no true negatives (right, x=0 (no song), gray line), since all reference syllables are (true or false) positives. The average false negative and positive rates are 0.3% and 0.2%, respectively.