Abstract
We design a framework for studying prelinguistic child voice from 3 to 24 months based on state-of-the-art algorithms in diarization. Our system consists of a time-invariant feature extractor, a context-dependent embedding generator, and a classifier. We study the effect of swapping out different components of the system, as well as changing loss function, to find the best performance. We also present a multiple-instance learning technique that allows us to pre-train our parameters on larger datasets with coarser segment boundary labels. We found that our best system achieved 43.8% DER on test dataset, compared to 55.4% DER achieved by LENA software. We also found that using convolutional feature extractor instead of logmel features significantly increases the performance of neural diarization.
Index Terms—: Child Speech, Language Development, Speaker Diarization, Voice Activity Detection, Multiple Instance Learning, Transfer Learning
1. INTRODUCTION
Mental health disorders and behavioral problems first emerge in early childhood [1], [2]. Early diagnosis and intervention may help to ameliorate or prevent some types of behavioral disorders, but findings for the effectiveness of interventions are often mixed [3]. Intensive assessments of parent-infant interactions in naturalistic home environments, as well as normative data, are needed. Yet, such assessments conducted manually pose logistical challenges, including time and labor required by researchers, as well as privacy concerns of participating families. Automatic or semi-automatic diarization has the potential to address these limitations and provide insight into parent-infant vocal interactions – including the relative timing, duration, volume, and tone of voice – and would permit the establishment of normative data while minimizing privacy concerns. Greater volume of normative data, in turn, would facilitate the creation of effective evidence- based interventions in support of child mental health [4].
In contrast to broadcast news recordings [5, 6], automatic diarization of naturalistic parent-child recordings is difficult because: (1) most speakers are usually recorded at a distance, (2) most utterances are informal and brief, (3) infants’ voices and types of utterances differ significantly from those of adults. Diarization of distant recordings [7, 8] and informal speech [9] has been well-studied, and end-to-end neural solutions [10] exist. Automatic diarization of children’s speech, however, has only been extensively studied in the past three or four years, and remains challenging. Perhaps the most influential child speech diarization system is LENA [11, 12], a propriety system that segments audio recorded from LENA wearable devices, and that has been commonly used as a baseline [12, 13]. Additional recent work on child speech diarization has been inspired by the Second DiHARD Challenge [14], which includes childrens’ speech (the Seedlings Corpus [15]) as one of its test corpora.
Because of the difficulty of the task, most systems submitting results to DiHARD use oracle voice activity detection (VAD) [16, 17, 18]; other solutions in the literature include automatic VAD [13, 19] or explicit models of one or more non-speech classes [12, 20, 21]. When oracle VAD is not provided, it is not always clear how a system should respond. Parent-infant interaction involves overlap of speech events, interspersed by long periods of silence. Instead of assigning each frame to a single type, it is more natural to formulate the problem as a multi-label sequence-to-sequence problem [10, 19, 21]. Furthermore, the permutation-invariant [22] labelling rules typical of other diarization tasks are less appropriate for infant databases, in which there is typically one infant, one female adult, one male adult, and sometimes one other child of a different age. Infant diarization accuracy may improve by pre-training and/or clustering models of 3 or 4 classes, e.g., key child, other child, female adult, male adult [12, 13, 13, 19, 20, 21].
This paper demonstrates end-to-end neural diarization of infant vocalizations. We also present a way to pre-train the neural network on large datasets with inaccurate boundary labels. In addition, we study the effect of different loss functions and input segment lengths of the network. Sec. 2 is the system description, Sec. 3 describes our approach to pretraining, Sec. 4 describes experimental methods and results, Sec. 5 concludes.
2. SYSTEM DESCRIPTION
Suppose we have a dataset where is a waveform of T samples, and is the label. C is the number of speaker classes that we define, and L is the number of frames, where yc,l is the binary label indicating whether any speaker from class c is speaking at any sample in the range of . We wish to design a system Fθ parameterized by θ that approximates the true mapping
| (1) |
from the audio to binary speaker class labels for each frame.
We decompose our system as follows:
| (2) |
where
| (3) |
is a time-invariant mapping from signal samples to feature space,
| (4) |
maps each frame from feature space to speaker embedding, and is conditioned on the rest of the frames. Finally,
| (5) |
maps each frame in the embedding sequence to a set of logits, and does not depend on other frames.
Features Ffeat:
Two types of features are tested.
The first tested feature vector is based on a 23-dimensional log-Mel-filterbanks with a window size of 25ms and hop size of 16ms. We splice features of 15 consecutive frames, and subsequently subsample the spliced feature matrix by a factor of 16 along the time dimension. Therefore, we compute a 345-dimensional feature vector every 256ms.
The second tested feature vector consists of a learned filterbank applied to waveform samples. This method is based on the encoder in [23], and consists of 12 blocks of Conv1D with zero-padding, followed by LeakyReLU activation, then Decimation Pooling which halves the time dimension. At a waveform sampling rate of 16000Hz, this feature extractor produces a 288-dimensional feature vector every 256 ms.
Embedding Fembed:
Two types of neural embeddings were tested.
The first tested neural embedding is a Bi-Directional LSTM (BLSTM) [24]. Similar to [10], we use 5 layers with 256 hidden units for both forward and backward LSTM.
The second tested neural embedding is a multi-headed self-attention model. As in [10], we linearly transform each frame’s feature vector, then apply two encoder layers. As in [25], each encoder layer includes layer normalization [26], then multi-headed self-attention, then another layer norm, then a two-layer position-wise fully-connected network. After both encoder layers, another layer normalization is applied. For self attention, we set both input and output dimensions to 256, and for the position-wise fully-connected network, we set the hidden layer size to 1024.
Classifier Fcls:
To map the speaker embedding of each frame to a binary label for each speaker class, we use either a linear predictor, or a two layer fully connected network with ReLU activation (denoted as MLP), with the first hidden layer the same size as speaker embedding.
Focal Loss for Imbalanced Binary Labels:
We use Adam [27] to train our networks. The majority of frames are silence: without considering voice overlap, for a label tensor of size , only of target entries will be 1, where Ton is the total number of frames where someone is speaking. Therefore, we treat this as a class-imbalanced classification problem, and use focal loss [28] to balance our training. As with the best configuration in [28], we use α = 0.25, γ = 2 as our hyper-parameters. When running on test data, we set the prediction threshold to 0.5.
3. PRE-TRAINING WITH MULTIPLE INSTANCE LEARNING
Because the time resolution of our diarization system is 256ms per frame, we require high-resolution labels to train the system. However, infant recordings mostly consist of non-speech vocalizations, so the average speaker turns are extremely short [20] compared with normal diarization datasets [30]. In addition, overlapping vocalizations are frequent in infant recording. Therefore, it is relatively costly to acquire accurate labels; our core training and test datasets are precisely labeled, but relatively small.
Due to the limited size of our own training set, we use the Brawnwald [31] and Providence [32] Corpora from the CHILDES project as a transfer learning dataset. During manual inspection, we noticed that in both datasets, although the labelled speaker turns are usually correct in terms of speaker class, turn boundaries are relatively imprecise: most speaker turns contain silence at both start and end.
Because we do not have the true binary speaker class label for each frame in our transfer learning dataset, we cannot directly train the same system described in 2. Therefore, we re-formulate the problem as a multiple-instance learning problem, described below:
Given as our transfer learning datum, where each pair of x and 0 < s < C are respectively the audio samples and the speaker class label for a segment, we wish to find a mapping Gθ that shares the same parameter θ with Fθ. We wish to learn the parameters θ to maximize the accuracy of Eq. 1, but the true value of is unknown (time alignment of the utterance within the segment is unknown), therefore we design a classifier Gθ to maximize the accuracy of
| (6) |
where , the multiple-instance learning target, is either 1[s] (a one-hot vector for s) or 0 (the zero vector). Eq. 6 can be computed using a Gθ that computes the maximum over a segment, and compares the maximum to 1[s], thus
| (7) |
or
| (8) |
where MaxPool denotes global max-pooling over time.
4. EXPERIMENTAL METHODS AND RESULTS
Primary training, validation and test data were drawn from two studies of socioemotional development. Families with typically developing infants between 3 and 24 months of age were recruited from the community. Infants wore the LENA recorder for a total of 16 hours in the home. Reference labels were manually coded for 107 10-minute segments that, according to LENA segmentation, had the highest frequency of voice activity (23 at 3 months, 20 at 6 months, 22 at 9 months, 22 at 12 months, and 20 at 13–24 months). We split those into 87 for train, 10 for validation, and 10 for test. For each 10-minute segment, we then manually labelled four tiers, with cross-labeler validation at a precision of 0.2 seconds: CHN (key child), CXN (other child), FAN (female adult), and MAN (male adult). Neural nets were trained using audio waveforms normalized to [−1, 1], and divided into 20-second segments with 0.256 second frames. We consider CHN and CXN as the same speaker class, and FAN and MAN as separate speaker classes.
We use Diarization Error Rate(DER)[33] as our primary metric. It can be computed as
| (9) |
Note that DER in the case of infant speech would be overall much higher than normal case, since the denominator of DER is smaller when less voice activity is present.
Networks were pre-trained using the transfer learning datasets, Braunwald and Providence, using two different MIL frameworks (MIL1=Eq. 7 and MIL2=Eq. 8). Variable-length segments were used, based on the start and end times in the labels. We only keep segments with durations between 1.28s and 10.24s. The best configurations using MIL1 and MIL2 achieved respectively 17.1% and 16.7% classification accuracy on validation set.
For both MIL1 and MIL2, we pre-train with all combinations of Ffeat and Fembed, and use a linear classifier with 3 classes(Child, Female, Male). We train with Adam optimizer for 5 epochs, with learning rate of 0.0005 and decay of 0.5 per epoch.
Configurations and results are listed in Table 1, using the notation introduced in Sec. 2. Configurations 1–4 and 9–12 usee learning rate 0.001 with decay of 0.98 per epoch. Configurations 5–8 used learning rate 0.0005 with decay 0.94 per epoch. Configurations 13–20 used learning rate 0.0005 with decay 0.98.
Table 1.
Performance of different system configurations
| Config | F feat | F embed | F cls | pre train | param loaded | freeze Ffeat 10epoch | freeze Fembed 10epoch | input len (s) | loss | DER |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Conv | BLSTM | linear | None | - | 20 | focal | 0.486 | ||
| 2 | Conv | MHA | 0.490 | |||||||
| 3 | LogMel | BLSTM | 0.535 | |||||||
| 4 | LogMel | MHA | 0.665 | |||||||
| 5 | Conv | BLSTM | MLP 2layer | MIL1 | Ffeat, Fembed | Yes | Yes | 20 | focal | 0.438 |
| 6 | Conv | MHA | Ffeat, Fembed | Yes | 0.509 | |||||
| 7 | LogMel | BLSTM | F embed | - | 0.509 | |||||
| 8 | LogMel | MHA | F embed | - | 0.638 | |||||
| 9 | Conv | BLSTM | MLP 2layer | MIL1 | Ffeat, Fembed | No | No | 20 | focal | 0.461 |
| 10 | Conv | MHA | Ffeat, Fembed | No | 0.446 | |||||
| 11 | LogMel | BLSTM | F embed | - | 0.533 | |||||
| 12 | LogMel | MHA | F embed | - | 0.628 | |||||
| 13 | Conv | BLSTM | linear | MIL2 | Ffeat, Fembed, Fcls | Yes | Yes | 20 | focal | 0.465 |
| 14 | Conv | MHA | Ffeat, Fembed, Fcls | Yes | 0.449 | |||||
| 15 | LogMel | BLSTM | Fembed, Fcls | - | 0.521 | |||||
| 16 | LogMel | MHA | Fembed, Fcls | - | 0.656 | |||||
| 17 | Conv | BLSTM | linear | MIL2 | Ffeat, Fembed, Fcls | No | No | 20 | focal | 0.514 |
| 18 | Conv | MHA | Ffeat, Fembed, Fcls | No | 0.444 | |||||
| 19 | LogMel | BLSTM | Fembed, Fcls | - | 0.523 | |||||
| 20 | LogMel | MHA | Fembed, Fcls | - | 0.631 | |||||
| LENA | - | - | - | - | 0.554 | |||||
| Lavechin etal. [21] | Sinc-Net[29] | BLSTM | MLP 3layer | - | 2 | MSE | 0.586 | |||
| Ablation1 | Conv | BLSTM | linear | None | - | 20 | BCE | 0.509 | ||
| Ablation2 | Conv | BLSTM | linear | None | - | 2 | focal | 0.470 | ||
| Ablation3 | Conv | BLSTM | MLP 2layer | MIL1 | Ffeat, Fembed | Yes | Yes | 2 | focal | 0.451 |
We ran two baselines developed by others on our test set: LENA [11, 12] and the system of Lavechin et al. [21]. We ran three additional ablation studies on our baseline, each studying the effect of loss function, input segment size, and whether to use an additional speaker class for non-key child, each using the same learning rate as the configuration it ablates.
Results1 are shown in 1. Ablation studies 1, 2 and 3, compared to configurations 1, 1, and 5, respectively, show that focal loss improves performance and that shorter chunk size has uncertain effect on performance. Our best system is based on configuration 5, which achieves 43.8% DER, compared to 55.4% and 58.6% DER achieved by baselines from LENA software and Lavechin et al.[21]. We also note that convolutional feature extractors work better than logmel features in most cases, which contradicts prior practice in diarization systems. This could be due to the various forms of vocalizations in the infant speaker class, which include extremely high-pitched vocalizations that may be ill-adapted to logmel features.
Our best system (configuration 5) was also trained on a task with 4 speaker-class targets, in which the key child and other children are counted as separate classes. DER and frame error rate (percentage of frames with prediction error in at least one class) are reported in Table 2. DER of LENA and our system were not affected much, but DER of [21] increased.
Table 2.
DER and Frame Error Rate of each system on 4-speaker case
| System | DER | Frame Error Rate |
|---|---|---|
| Ours, config 5 | 0.497 | 0.338 |
| LENA | 0.581 | 0.353 |
| Lavechin et al., 2020[21] | 0.762 | 0.454 |
5. CONCLUSIONS
This paper offers two key contributions. First, we decomposed the child-speech diarization systems into three separately analyzed components: Fembed, Ffeat, Fcls, and provided results for two configurations of each component. Second, we developed a pre-training procedure to enable transfer learning from datasets with coarse speaker segment labels. We found that convolutional features, combined with focal loss training and transfer learning, together achieves the most accurate system.
Fig. 1.

An overview of system components. Feature extractor, embedding generator, and classifier are successively applied to input waveform, yielding a prediction for each speaker class at each frame.
Fig. 2.

Two configurations for multiple-instance learning. Colored blocks are speaker embedding. Black and white boxes are speaker class.
Acknowledgments
Thanks to Jiahao Xu from University of Sydney for help with server. This work was supported by funding from the National Institute on Drug Abuse (R34DA050256-01), the National Institute of Mental Health (R21MH112578-01) and the National Institute of Food and Agriculture, U.S. Department of Agriculture (ILLU-793-339).
Footnotes
code can be found at https://github.com/JunzheJosephZhu/Child_Speech_Diarization
References
- [1].Egger Helen Link and Angold Adrian, “Common emotional and behavioral disorders in preschool children: presentation, nosology, and epidemiology,” Journal of child psychology and psychiatry, vol. 47, no. 3–4, pp. 313–337, 2006. [DOI] [PubMed] [Google Scholar]
- [2].Cree RA, Bitsko RH, Robinson LR, Holbrook JR, Danielson ML, Smith DS, Kaminski JW, Kenney MK, and Peacock G, “Health care, family, and community factors associated with mental, behavioral, and developmental disorders and poverty among children aged 2–8 years – united states 2016,” MMWR, vol. 67, no. 5, pp. 1377–1383, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Wright Barry and Edgington Elizabeth, “Evidence-based parenting interventions to promote secure attachment: Findings from a systematic review and meta-analysis,” Global Pediatric Health, vol. 3, pp. 1–14, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Hoagwood Kimberly, Burns Barbara J., Kiser Laurel, Ringeisen Heather, and Scheonwald Sonja K., “Evidence-based practice in child and adolescent mental health services,” Psychiatric Services, vol. 52, no. 9, pp. 1179–1189, 2001. [DOI] [PubMed] [Google Scholar]
- [5].Canseco-Rodriguez Leonardo, Lamel Lori, and Gauvain Jean-Luc, “Speaker diarization from speech transcripts,” in Proc. Interspeech, 2004. [Google Scholar]
- [6].Tranter Sue E. and Reynolds Douglas A., “An overview of automatic speaker diarization systems,” IEEE/ACM Transactions on Audio, Speech and Language, vol. 14, no. 5, pp. 1557–1565, 2006. [Google Scholar]
- [7].Pardo Jose M., Anguera Xavier, and Wooters Chuck, “Speaker diarization for multiple distant microphone meetings: Mixing acoustic features and inter-channel time differences,” in Proc. Interspeech, 2006, pp. 1337: 1–4. [Google Scholar]
- [8].Stolcke Andreas, Friedland Gerald, and Imseng David, “Leveraging speaker diarization for meeting recognition from distant microphones,” in Proc. ICASSP, 2010. [Google Scholar]
- [9].Castaldo Fabio, Colibro Daniele, Dalmasso Emanuele, Laface Pietro, and Vair Claudio, “Stream-based speaker segmentation using speaker factors and eigenvoices,” in Proc. ICASSP, 2008, pp. 4133–4136. [Google Scholar]
- [10].Fujita Y, Kanda Naoyuki, Horiguchi Shota, Xue Yawen, Nagamatsu K, and Watanabe S, “End-to-end neural speaker diarization with self-attention,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 296–303, 2019. [Google Scholar]
- [11].Canault Mélanie, Normand Marie-Thérèse, Foudil Samy, Loundon Natalie, and Thai-Van Hung, “Reliability of the language environmentanalysis system (lena™) in europeanfrench,” Behavior research methods, vol. 48, 07 2015. [DOI] [PubMed] [Google Scholar]
- [12].Cristia Alejandrina, Ganesh Shobhana, Casillas M, and Ganapathy S, “Talker diarization in the wild: the case of child-centered daylong audio-recordings,” in INTERSPEECH, 2018. [Google Scholar]
- [13].Franc Adrien Le, Riebling Eric, Karadayi Julien, Wang Y, Scaff Camila, Metze F, and Cristia Alejandrina, “The aclew divime: An easy-to-use diarization tool,” in INTERSPEECH, 2018. [Google Scholar]
- [14].Ryant Neville, Church K, Cieri C, Cristia Alejandrina, Du J, Ganapathy S, and Liberman M, “The second dihard diarization challenge: Dataset, task, and baselines,” in INTERSPEECH, 2019. [Google Scholar]
- [15].Bergelson Elika and Aslin Richard N., “Nature and origins of the lexicon in 6-mo-olds,” Proceedings of the National Academy of Sciences, vol. 114, no. 49, pp. 12916–12921, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Sun L, Du J, Gao T, Lu Y, Tsao Y, Lee C, and Ryant N, “A novel lstm-based speech preprocessor for speaker diarization in realistic mismatch conditions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5234–5238. [Google Scholar]
- [17].Zajíc Z, Kunesová M, Zelinka J, and Hrúz M, “Zcuntis speaker diarization system for the dihard 2018 challenge,” in INTERSPEECH, 2018. [Google Scholar]
- [18].Xie Jiamin, García-Perera LP, Povey D, and Khudanpur S, “Multi-plda diarization on children’s speech,” in INTERSPEECH, 2019. [Google Scholar]
- [19].García P, Villalba Jesús, Bredin H, Du Jun, Castan Diego, Cristia Alejandrina, Bullock Latané, Guo Ling, Okabe K, Nidadavolu Phani Sankar, Kataria S, Chen Sizhu, Galmant Léo, Lavechin Marvin, Sun Lei, Gill Marie-Philippe, Ben-Yair Bar, Abdoli S, Wang X, Bouaziz Wassim, Titeux Hadrien, Dupoux Emmanuel, Lee K, and Dehak Najim, “Speaker detection in the wild: Lessons learned from jsalt 2019,” ArXiv, vol. abs/1912.00938, 2019. [Google Scholar]
- [20].Najafian M and Hansen JHL, “Speaker independent diarization for child language environment analysis using deep neural networks,” in 2016 IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 114–120. [Google Scholar]
- [21].Lavechin Marvin, Bousbib Ruben, Bredin Hervé, Dupoux Emmanuel, and Cristia Alejandrina, “An opensource voice type classifier for child-centered daylong recordings,” 2020.
- [22].Yu Dong, Kolbæk Morten, Tan Zheng-Hua, and Jensen J, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245, 2017. [Google Scholar]
- [23].Stoller D, Ewert S, and Dixon S, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” in ISMIR, 2018. [Google Scholar]
- [24].Hochreiter Sepp and Schmidhuber Jürgen, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [DOI] [PubMed] [Google Scholar]
- [25].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser L, and Polosukhin Illia, “Attention is all you need,” ArXiv, vol. abs/1706.03762, 2017. [Google Scholar]
- [26].Ba Jimmy, Kiros J, and Hinton Geoffrey E., “Layer normalization,” ArXiv, vol. abs/1607.06450, 2016. [Google Scholar]
- [27].Kingma Diederik P. and Ba Jimmy, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015. [Google Scholar]
- [28].Lin Tsung-Yi, Goyal Priya, Girshick Ross B., He Kaiming, and Dollár P, “Focal loss for dense object detection,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007, 2017. [Google Scholar]
- [29].Ravanelli Mirco and Bengio Yoshua, “Speaker recognition from raw waveform with sincnet,” 2018.
- [30].Anguera X, Bozonnet Simon, Evans Nicholas, Fredouille Corinne, Friedland O, and Vinyals O, “Speaker diarization : A review of recent research,” ”IEEE Transactions On Audio, Speech, and Language Processing” (TASLP), special issue on ”New Frontiers in Rich Transcription”, February 2012, Volume 20, N°2, ISSN: 1558–7916, 05 2011. [Google Scholar]
- [31].Braunwald Susan R., “Mother-child communication: The function of maternal-language input,” ¡i¿WORD¡/i¿, vol. 27, no. 1–3, pp. 28–50, 1971. [Google Scholar]
- [32].Song Jae Yung, Demuth Katherine, Evans Karen, and Shattuck-Hufnagel Stefanie, “Durational cues to fricative codas in 2-year-olds’ american english: Voicing and morphemic factors,” The Journal of the Acoustical Society of America, vol. 133, no. 5, pp. 2931–2946, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Fiscus Jonathan, Ajot Jerome, Michel Martial, and Garofolo John, “The rich transcription 2006 spring meeting recognition evaluation,” 01 2006, pp. 309–322.
