Figure 1.
Task design, diarization methods, and feature extraction. Participants took part in (A) a conversation with a study partner, and (B) a Describe the Picture (DtP) task. (C) Conversations were automatically diarized into a hypothesised number of unknown speakers. (D) DtP audio was manually segmented to extract participant speech only. (E,F) The audio for each hypothesised speaker, and for the participant’s speech, were converted into fixed 192 dimensional embeddings. (G) The distances between the speaker embeddings and the participant’s embedding were measured using the cosine distance. (H) Speakers who’s distance was less than a predefined threshold (0.75: Spk1/Spk3 in this example) were assigned to be the participant. (I) The duration of each utterance was measured, and their basic statistics were calculated.