Manfredi et al., 2017
|
Recordings of 18 synthesized voice samples in a soundproof booth. Smartphones were fixed at a 10 cm distance from the center of the sound source (loudspeaker). The voice samples consisted of sustained /a:/ utterances, of 2 s long, at two different median f0 (120 Hz, 200 Hz), three levels of jitter (0.9%, 2.8%, 4.5%), and three levels of additive noise (97.4 dB, 23.8 dB, 17.6 dB) |
Smartphones, external microphone |
A basic, inexpensive smartphone (Wiko model CINK SLIM2), a high-level, expensive smartphone (HTC One), and a high-quality external microphone (Sennheiser model MD421U) |
Jitter, shimmer, noise-to-harmonics ratio |
PRAAT |
The results obtained with the three devices for the different jitter, shimmer, and amount of noise levels were significantly correlated. No absolute differences between devices were tested or reported |
Guidi et al., 2015
|
Two subjects were asked to read non-emotional text and to comment on displayed pictures, in the context of an Android application testing, while getting recorded simultaneously by two identical smartphones and one external microphone. The first smartphone was kept on the table and the other was held by the subject. The smartphone held by the subject and the external microphone were positioned at a 30 cm distance from the subjects’ mouths, whereas the smartphone on the table was placed at a 40 cm distance from the subjects’ mouths. In a second experiment, they simultaneously recorded with two different smartphones, both held close together by the subject (the exact distance is not specified) |
Smartphones, external microphone |
Two Samsung smartphones (I9300 Galaxy S III), an LG smartphone (Nexus 4 E960), and a high-quality external microphone (AKG P220) |
Mean f0, standard deviation of f0, and jitter |
C code in the Java Native Interface |
For both subjects, the three investigated features between the external microphone and hand-held smartphone, external microphone and smartphone on the table, and the two identical smartphones showed significant correlations. Weaker correlations, but still significant, were found for the extracted jitter between the external microphone and the hand-held smartphone. High correlations were also found in all features between the two different smartphones. No differences between devices were tested or reported |
Uloza et al., 2015
|
118 subjects (34 healthy (23 females & 11 males; Mage = 41.8 years, SDage = 16.96) and 84 pathological voices of various voice disorders (50 females & 34 males; Mage = 49.87 years, SDage = 14.86)) were asked to phonate a sustained /a:/ vowel at a comfortable pitch and loudness level for at least 5 s. The voice samples were simultaneously recorded in a soundproof booth through two devices that were placed at a 10 cm distance from the subjects’ mouths |
Smartphone, external microphone |
Healthy vs. pathological voice groups. A Samsung smartphone (Galaxy Note 3), and a high-quality external microphone (AKG Perception 220) |
f0, jitter, shimmer, normalized noise energy (NNE), signal-to-noise ratio (SNR), and harmonics-to-noise ratio (HNR) |
Dr. Speech |
After splitting their sample and performing separate analyses for the healthy and pathological voices, they found in the healthy voice group significant differences for all the investigated acoustic voice parameters, except shimmer and f0, with the mean values from the external microphone recordings being higher. For the pathological voice group, no significant differences were found for the mean values of jitter, shimmer, and f0. For both groups, they showed significant correlations among the measured voice features reflecting pitch and amplitude perturbations (jitter and shimmer) and the features of voice signal turbulences (NNE, HNR, and SNR) captured both from the external and smartphone microphones |
Lin et al., 2012. |
11 healthy subjects (6 females & 5 males – Mage = 41.8 years, SDage = 16.7) were simultaneously recorded in a quiet room through a smartphone and an external microphone. All subjects were asked to read six sentences. The smartphone was placed approximately at 13 cm distance from the subjects’ mouth, whereas the external microphone was approximately at 5 cm distance from the subjects’ mouth. Vowel segments were used to extract the investigated vocal features |
Smartphone, external microphone |
An iPhone smartphone (model A1303), and a high-quality external microphone (AKG C420) |
f0, jitter, shimmer, signal-to-noise ratio (SNR), amplitude difference between the first two harmonics (H1 – H2), singing power ratio (SPR), and frequencies of formants one and two |
TF32, Adobe Audition |
The correlations between the vocal features captured by the smartphone and the external microphone were found to range from extremely to moderately high. The results showed a significant effect of the device used for shimmer, SNR, H1 – H2, and SPR and a significant device by vowel type interaction effect for shimmer and SPR in each vowel |
Brown et al., 2020
|
Crowdsourced voice data were collected via the “COVID-19 Sounds App” (web-based, Android, iOS). Subjects were asked to cough three times, breathe deeply through their mouth three to five times, and read a short sentence appearing on the screen three times |
Any device that connects to the internet (e.g., smartphone, laptop) |
Tested positive vs. negative for COVID-19, reported symptoms |
Duration, onset, tempo, period, root mean square energy, spectral centroid, roll-off frequency, zero-crossing, and mel-frequency cepstral coefficients (MFCCs) measures |
Python (librosa), VGGish |
Coughing sounds can distinguish COVID-19-positive from COVID-19-negative individuals with 80% precision. No device effects were tested or reported |
Han et al., 2021
|
Crowdsourced voice data were collected via the “COVID-19 Sounds App” (web-based, Android, iOS). Subjects were asked to cough three times, breathe deeply through their mouth three to five times, and read a short sentence appearing on the screen three times |
Any device that connects to the internet (e.g., smartphone, laptop) |
Tested positive vs. negative for COVID-19, reported symptoms |
Zero-crossing-rate (ZCR), root mean square frame energy, f0, harmonics-to-noise ratio (HNR), mel-frequency cepstral coefficients (MFCCs), prosodic, spectral, and voice quality features |
openSMILE |
When distinguishing positive tested individuals from negative ones without taking their symptoms into account, the model achieves a sensitivity and specificity of 62 and 74%, respectively. When distinguishing recently tested positive individuals from healthy controls without any symptoms, the ROC-AUC and PR-AUC both increase from around 75 to 79%. While the sensitivity and specificity are improved from 62 to 70%, and from 74 to 75%. No device effects were tested or reported |
Parsa & Jamieson, 2001
|
Different microphones were tested on how they recorded three different acoustic stimuli: broadband, pure tone, and voice samples in a mini-anechoic chamber. Each of these signals was played back over a digital-to-analog converter The voice samples consisted of sustained samples of the vowel /a:/ from 53 healthy subjects (33 females & 20 males; age range 22–59 years) and 100 subjects with voice disorders (63 females & 37 males; age range 21–58 years). The exact distance between the microphone and the digital speaker is not reported |
External microphones (specific devices were not reported) |
One high-quality, expensive external microphone used for clinical purposes, and three cheaper external microphones used for clinical purposes (microphones’ brand and model are not reported). Healthy or pathological subjects |
Four f0 perturbation measures, four amplitude perturbation measures, and four glottal noise measures |
Not reported |
Of the four f0 perturbation measures, the absolute jitter parameter was not significantly different for any of the microphone signals. All four of the amplitude perturbation measures were significantly different for all the microphones. The microphone affects classification accuracy between healthy and pathological voices. All amplitude perturbation measures were significantly different across microphones |
Titze & Winholtz, 1993
|
4 subjects phonating a sustained /a:/ vowel and synthesized voice samples playing through a loudspeaker in a soundproof booth were used as the acoustic signal that was captured by different microphones at varying distances (4 cm, 30 cm, 1 m) and angles (0°, 45°, 90°). Each of the synthesized signals was 6 s and varied in terms of signal modulations (e.g., amplitude modulations, frequency modulations) |
External microphones |
Four professional-grade microphones (AKG 451EB CK22, AKG 451EB CK1, EV DO54, AKG D224E), and two consumer-grade microphones (Realistic 33-985, Realistic 33-1063). Different angles and distances |
Amplitude and frequency measures |
GLIMPES (Glottal Imaging by processing external signal) |
Some consumer-grade microphones used in conjunction with the same equipment and analysis programs inflated the frequency perturbation to a range of 0.1–0.2% and amplitude perturbation to a range of 1–2%. When the microphone distance was changed from 4 cm to 1 m, perturbation measures significantly increased |
Alsabek et al., 2020
|
Each subject was asked to cough four times, take a deep breath, and count from one to ten and instructed to have their head upright. The total collected number of samples used in this study was 42 [(7 COVID-negative speakers × 3 recordings) + (7 COVID-positive speakers × 3 recordings)]. The captured speech signal underwent pre-processing, which involved the removal of noise using PRAAT |
Mobile phones (not further specified) |
Tested positive vs. negative for COVID-19 |
Mel-frequency cepstral coefficients (MFCCs) |
Not reported |
The voice of subjects has shown a high correlation between COVID-negative and COVID-positive samples |
This research |
30 Subjects uttered 2 phrases in 3 emotional states while being recorded simultaneously via five common consumer-grade audio-recording devices. Participants were recorded using both low proximity to speaker microphones (i.e., smartphone, laptop, and studio microphone) placed away from the source (60 cm), and in high proximity to the speaker (i.e., lavalier and headset microphones) close to the source (15–20 cm) |
Smartphone, laptop, studio microphone, headset, lavalier |
Five distinct devices that were recording simultaneously: a studio microphone (Blue Yeti Logitech), a lavalier microphone (SmartLav+ Rode), a headset microphone (Beats by Dr. Dre EP), smartphone (Samsung A6), and a laptop (MacBook Pro 2017). To increase speaker variability, participants expressed three discrete emotions (neutral, happy, and sad) with two varying intonation types (phonetic amplification of “i” vs. “a”), and with or without wearing a headset |
Mean f0, and mean amplitude |
Python (Parselmouth) |
Significant differences between recording devices (e.g., amplification of amplitude measure for high-proximity devices) which in turn led to lower predictive accuracy across an emotion prediction or biological sex prediction task |