Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2024 Dec 19;3(12):e0000679. doi: 10.1371/journal.pdig.0000679

A voice-based algorithm can predict type 2 diabetes status in USA adults: Findings from the Colive Voice study

Abir Elbéji 1, Mégane Pizzimenti 1, Gloria Aguayo 1, Aurélie Fischer 1, Hanin Ayadi 1, Franck Mauvais-Jarvis 2,3, Jean-Pierre Riveline 4,5, Vladimir Despotovic 6, Guy Fagherazzi 1,*
Editor: Ludwig Christian Giuseppe Hinske7
PMCID: PMC11658629  PMID: 39700066

Abstract

The pressing need to reduce undiagnosed type 2 diabetes (T2D) globally calls for innovative screening approaches. This study investigates the potential of using a voice-based algorithm to predict T2D status in adults, as the first step towards developing a non-invasive and scalable screening method. We analyzed pre-specified text recordings from 607 US participants from the Colive Voice study registered on ClinicalTrials.gov (NCT04848623). Using hybrid BYOL-S/CvT embeddings, we constructed gender-specific algorithms to predict T2D status, evaluated through cross-validation based on accuracy, specificity, sensitivity, and Area Under the Curve (AUC). The best models were stratified by key factors such as age, BMI, and hypertension, and compared to the American Diabetes Association (ADA) score for T2D risk assessment using Bland-Altman analysis. The voice-based algorithms demonstrated good predictive capacity (AUC = 75% for males, 71% for females), correctly predicting 71% of male and 66% of female T2D cases. Performance improved in females aged 60 years or older (AUC = 74%) and individuals with hypertension (AUC = 75%), with an overall agreement above 93% with the ADA risk score. Our findings suggest that voice-based algorithms could serve as a more accessible, cost-effective, and noninvasive screening tool for T2D. While these results are promising, further validation is needed, particularly for early-stage T2D cases and more diverse populations.

Author summary

Type 2 diabetes (T2D) is a major public health issue, affecting millions worldwide and leading to severe health complications if undiagnosed. Currently, diagnosing T2D relies on blood tests, which are invasive, costly, and challenging to implement on a large scale. This study explores a new, non-invasive approach: detecting T2D risk through voice analysis. Using data from the Colive Voice study, we developed a voice-based algorithm to predict T2D status in adults in the USA. The algorithm analyzes specific voice features and is designed to capture subtle differences in the voices of individuals with T2D compared to those without. We trained and tested the algorithm separately for men and women and observed promising results, with the algorithm showing accuracy levels comparable to traditional risk assessment tools, such as the American Diabetes Association (ADA) score. We also found that the algorithms performed better in certain subgroups, such as older women and individuals with hypertension. Our findings highlight the potential of voice analysis as an accessible and affordable screening tool for T2D, especially valuable for early detection in diverse populations and settings with limited resources. This innovative approach could transform diabetes screening by offering a practical, scalable solution for identifying those at risk.

Introduction

Diabetes mellitus (DM) is an endocrine system illness in which the body cannot regulate blood glucose levels. It is one of the most severe and common chronic diseases of our time, as it was responsible for 6.7 million deaths in 2021 [1]. In 2022, about 1 in 10 people in the world is living with DM, and the number is expected to grow from 537 million adults, up to 643 million by 2030 and 783 million by 2045, as the result of population aging, economic development, urbanization, unhealthy eating habits, and sedentary lifestyle[1]. In the USA, according to the 2022 National Diabetes Statistics Report from the CDC [1,2], 37.3 million people, or 11.3% of the population, have diabetes. This total includes 28.7 million diagnosed cases and an estimated 8.5 million people who are living with undiagnosed diabetes.

One of the most urgent public health challenges in DM is reducing the number of undiagnosed cases worldwide. Currently, almost one in every two people with type 2 diabetes (T2D) is undiagnosed worldwide, and as a result, cannot begin treatment or preventive measures to avoid or delay complications [3]. It was demonstrated that undiagnosed DM is associated with a higher death risk when compared to normoglycemic individuals [4], as one-third of T2D patients do not present symptoms until complications appear [5]. From a health economics perspective, it has been previously reported that any undiagnosed diabetes case costs $4,250 per year in the USA [6], generating preventable healthcare expenditures.

Nowadays, screening campaigns rely on invasive blood glucose analysis that costs around 825 billion dollars per year [7], which might be difficult to deploy at a large scale or to implement in countries or settings with limited resources and/or infrastructures. Alternative methods include scores to identify individuals at risk of developing diabetes during the next 5 to 10 years. The FINDRISC score [7,8] is widely used, although it is based on a questionnaire with limited detection capacities (AUC around 76%) and can be prone to errors or desirability biases.

In the United States (USA), The American Diabetes Association (ADA) diabetes risk test[9] was developed as a screening tool to classify high-risk subjects in the community and to raise awareness of modifiable risk factors and healthy lifestyles (5). The ADA diabetes risk test scoring includes seven questions (total score of 0–11) regarding age, gender, gestational diabetes mellitus (GDM), family history of diabetes, high blood pressure, physical activity, and obesity (based on body mass index (BMI) via a weight-height chart). Those having scores of 5 and more are considered to be at high risk of having diabetes.

With the advancement of digital technologies and artificial intelligence, significant effort is being directed towards detecting diabetes through noninvasive methods. These methods range from human facial block color analysis using sparse representation classifiers [10], hair analysis through elemental composition [11,12,13], specialized eye exams aimed at detecting diabetic retinopathy [14], to voice analysis, which stands as one of the most promising technologies in healthcare applications. This includes early diagnosis of neurodegenerative diseases [15] and assisting in screening and monitoring symptoms of conditions like COVID-19 [16] through the analysis of subtle speech pattern alterations and vocal biomarkers.

Previous works have suggested that people with diabetes have different voice features than people without diabetes. People with T2D with poor glycemic control or with neuropathy are also more likely to have phonatory symptoms compared to controls [17], namely a higher average score for vocal grading, straining [18], and hoarseness [19] that are affecting patients’ quality of life. From an acoustic perspective, it has been shown that voice parameters like jitter, shimmer, smoothed amplitude perturbation quotient, noise-to-harmonic ratio, relative average perturbation, mean fundamental frequencies, maximum phonation time, and amplitude perturbation quotient show significant differences in their values between T2D patients and people without diabetes [20,21]. However, previous studies relied on relatively small sample sizes, a lack of diversity in the participant profiles, and a lack of validation with audio recordings captured in real-world settings.

Building on this groundwork, our study distinguishes itself by leveraging data from the Colive Voice program to develop and assess the performance of a voice-based AI algorithm for T2D status detection in the adult population in the USA. This initiative not only serves as a first step toward using voice analysis as a first-line T2D screening strategy but also offers insights into the complex nature of T2D and its interaction with voice characteristics. Accordingly, we place special emphasis on considering a wide array of demographic and health-related parameters. This holistic approach is crucial as these factors can significantly affect voice characteristics and, consequently, their potential as indicators for disease states.

Methods

Study population

In 2021, the Luxembourg Institute of Health initiated a worldwide, multilingual research program named Colive Voice. Its ongoing project serves as a screening platform for vocal biomarkers, for screening or monitoring various chronic diseases and frequent health symptoms. To ensure diversity, Colive Voice collects voice recordings from participants above the age of 15 years, regardless of their health status and conditions, in English, French, German, and Spanish globally. Each participant contributes with standardized vocal tasks which are then annotated with clinical and demographic data.

Ethics statement

Colive Voice is registered on ClinicalTrials.gov (NCT04848623) and was approved by the National Research Ethics Committee of Luxembourg (study number 202103/01) in March 2021. All participants provided informed consent to take part in the study.

Collected data

Colive Voice participants are invited to complete a comprehensive questionnaire to gather a diverse range of information: demographic characteristics, lifestyle habits, anthropometric data, symptoms, drug use, and history of chronic diseases. Regarding diabetes, Colive Voice gathers data on the diagnosis, type of diabetes, duration since diagnosis as well as treatment categories, and HbA1c levels. For the present work, we included English-speaking participants from the USA and we analyzed each gender separately. Participants were invited to record a standardized reading task using the 25th article from the Human Rights Declaration (Fig 1). All the collected raw audio data was processed and quality-checked to ensure consistency throughout the study. There was no missing data in this study, ensuring a robust and complete dataset for analysis.

Fig 1. General workflow.

Fig 1

Voice feature extraction

OpenSmile

OpenSmile [22] is an open-source toolkit, popularly used for generating handcrafted low-level descriptors (LLD) from audio inputs. These descriptors encapsulate key characteristics of audio signals over time such as pitch, intensity, and spectral properties. OpenSmile computes functionals on these LLD contours, capturing statistical attributes like peaks, means, and ranges to provide a higher-level overview of the audio signal. Among the feature sets that OpenSmile offers, the ComParE set stands out. Comprising 6373 static features, ComParE is notable for its comprehensive nature, offering a rich and extensive array of data points. This vast collection of features facilitates the detection of complex patterns in the audio data, offering an in-depth understanding of the audio source.

BYOL-S/CvT

The hybrid model, BYOL-S/CvT [23], is a new method that detects cognitive and physical load in speech. It uses both data-driven features from the self-supervised BYOL-S model trained on Audioset and handcrafted features from OpenSmile. This mix improves the model’s performance and helps it learn speech patterns better than traditional methods. The BYOL-S/CvT model is also efficient and fast, needing only a single step during the decision-making stage, and produces 2048-dimensional embeddings.

Data analysis

In this study, the authors adhered to the TRIPOD criteria (the Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis Or Diagnosis) standards for the reporting of AI-based algorithm development and validation and used the corresponding checklist to guide the drafting of the manuscript.

To mitigate gender bias and manage imbalanced data challenges in our machine learning algorithm training, we first stratified the dataset based on gender. Following this, we used a simple random sampling technique to generate balanced group sizes, ensuring a more equitable and effective training process. Individuals without endocrine diseases, including diabetes, were selected randomly from the general USA population to create a control group that matched the size of the group of participants with T2D.

To enhance our algorithm’s performance, we first normalized the extracted features and embeddings using a standard scaler, which helps ensure consistent variance across all features. As high-dimensional inputs could lead to overfitting and poor generalization in machine learning algorithms, we used Principal Component Analysis (PCA) to reduce the dimensionality of the BYOL-S/CvT embeddings. For OpenSmile features, we used feature selection using the SelectKBest function from scikit-learn.

Once the normalization, reduction, and feature selection processes were complete, the resulting features were fed into three different classifiers: Logistic Regression (LR), Support Vector Machine with a radial basis function kernel (SVM RBF), and Multi-Layer Perceptron classifiers (MLP).

To evaluate the performance and compare the classifiers, we used stratified 5-fold cross-validation, ensuring no data leakage via the Pipeline functionality from scikit-learn. This pipeline handled scaling and PCA reduction for the BYOL-S embeddings, as well as scaling and feature selection for OpenSmile features. We measured the algorithms’ performances using accuracy, specificity, sensitivity, and AUC metrics.

For optimal results, we fine-tuned the number of PCA components and the algorithms’ hyperparameters using the grid search function from scikit-learn. We then used the best feature-classifier combination to select the most performant algorithm for each gender.

Influence of cofactors and their impact on algorithms’ performance metrics

In order to highlight how different cofactors influence the efficacy of our predictive algorithms, we conducted a performance stratification analysis. This analysis was segmented by age (below and over 60 years), BMI (below and over 25–29.9 kg/m2). Additionally, we examined conditions including hypertension, migraine, diagnosed depression, smoking, stress, and fatigue (measured by the Fatigue Severity Scale [24]), designating each condition’s status as either ‘present’ and ‘absent’ or ‘severe’ and ‘mild’. To reinforce confidence in our performance metrics and facilitate comparisons, we employed a bootstrapping technique. This involved generating multiple subsamples for each combination of comorbidity and its status. The bootstrapping process, repeated for each comorbidity, involves sampling with replacement from the original dataset and subsequently recalculating the metrics for each subsample.

With the objective of developing a screening tool in mind, the assessment of specificity and sensitivity metrics was prioritized, but AUC was also reported. High sensitivity guarantees that true cases are not missed, while high specificity reduces false alarms, optimizing resource use and building user trust. Performance metrics were computed independently for each bootstrap iteration within the respective groups. To evaluate the statistical significance of performance differences between categories, we employed the Mann-Whitney U test. Finally, to account for multiple comparisons, we adjusted the p-values obtained from these statistical tests using a Bonferroni correction.

Sensitivity analysis

As a sensitivity analysis, we conducted a Bland-Altman analysis between the voice-based algorithms and the ADA risk score, which serves as a gold standard for assessing T2D risk in the USA population [9]. Due to data limitations, physical activity levels and family history of diabetes were not available in Colive Voice and were set to zero for all participants by default. In this context, the modified ADA risk score ranges from zero, denoting no T2D risk, to seven, indicating a high risk.

Results

Population characteristics

We analyzed 323 females and 284 males based on T2D status. The majority were identified as white: 73.3% of females with T2D, 76.5% of females without T2D, 77.5% of males with T2D, and 71.8% of males without T2D.

Significant differences were identified across the groups, including age, BMI (t-test p-value < 0.001), and prevalence of hypertension (chi2 p-value < 0.001). Those with T2D, in both genders, had higher average ages and BMIs than those without T2D. Specifically, females with T2D had an average age of 49.5 years and a BMI of 35.8 kg/m2, compared to 40.0 years and 28.0 kg/m2 in those without T2D. Male participants with T2D had an average age of 47.6 years and BMI of 32.8 kg/m2, whereas those without averaged 41.6 years and 26.6 kg/m2.

Hypertension was more prevalent among the T2D group. Among females with T2D, 50% reported hypertension, compared to 11.18% in the group without T2D. For males, a similar trend was observed, with 58.5% of those with T2D having hypertension, compared to 12.7% without the condition.

Depression diagnosis history also was more prevalent in those with T2D (chi2 p-value < 0.001), especially in females: 61.7% with T2D reported depression, compared to 45.3% without T2D. Among males, the rates were 48.6% for those with T2D and 31.7% for those without T2D. Other health conditions and scores are included in Table 1.

Table 1. Study population characteristics.

Female group Male group
T2D status Without T2D With T2D P-value Without T2D With T2D P-value
Participants (N) 161 162 - 142 142 -
Age (year) 40.0 (13.5) 49.5 (12.1) <0.001 41.6 (14.0) 47.6 (13.4) <0.001
Body Mass Index (kg/m2) 28.0 (7.3) 35.8 (8.9) <0.001 26.6 (5.5) 32.8 (8.5) <0.001
Ethnicity: White 118 (73.3%) 124 (76.5%) 0.28 110 (77.5%) 102 (71.8%) 0.59
Ethnicity: Black 20 (12.4%) 21 (13.0%) 10 (7.0%) 12 (8.5%)
Ethnicity: Other 23 (14.3%) 17 (10.5%) 22 (15.5%) 28 (19.7%)
Fatigue Severity Scale 32.3 (13.4) 40.3 (12.3) <0.001 31.3 (12.8) 40.3 (12.3) <0.001
Perceived stress (% yes) 38 (23.6%) 49 (30.3%) 0.48 29 (20.4%) 38 (26.7%) 0.16
Smoking (% yes) 28 (17.4%) 19 (11.7%) 0.22 32 (22.5%) 34 (23.9%) 0.24
Migraine (% yes) 33 (20.5%) 43 (26.5%) 0.25 16 (11.3%) 19 (13.4%) 0.72
Thyroidic disease (% yes) 0 (0%) 37 (22.8%) <0.001 0 (0%) 10 (0.7%) <0.01
Hypertension (% yes) 18 (11.2%) 81 (50.0%) <0.001 18 (12.7%) 83 (58.5%) <0.001
Diagnosed depression (% yes) 73 (45.3%) 100 (61.7%) <0.01 45 (31.7%) 69 (48.6%) <0.01
HbA1c (%) - 7.14 (1.8) - - 7.20 (1.7) -
Diabetes treatment (% yes) - 126 (77.8%) - - 114 (80.3%) -
Diabetes duration (year) - 8.9 (7.3) - - 9.1 (7.6) -

The table presents clinical data describing the overall population of the study. Categorical data are represented by total numbers and percentages, with the calculated p-values derived from chi-square tests. Continuous data are represented by mean and standard deviation, with p-values calculated using the Student’s t-test.

Algorithms’ performances

In both genders, MLP classifiers trained with BYOL-S/CvT embeddings significantly outperformed those trained solely on OpenSMILE features in both males and females (Table 2).

Table 2. Results of the prediction models.

Features Dimensionality reduction Classifier Accuracy Specificity Sensitivity AUC
Female group Opensmile ComParE 2016 (6373) 200 selected features LR 0.60 (0.03) 0.60 (0.03) 0.62 (0.07) 0.62 (0.02)
MLP Classifier 0.63 (0.02) 0.61 (0.02) 0.74 (0.02) 0.66 (0.02)
SVM RBF 0.57 (0.02) 0.57 (0.02) 0.63 (0.03) 0.61 (0.01)
Byol-S embeddings (2048) PCA, n_components = n_samples LR 0.67 (0.04) 0.68 (0.04) 0.65 (0.11) 0.70 (0.06)
MLP Classifier 0.67 (0.04) 0.66 (0.04) 0.67 (0.11) 0.71 (0.07)
SVM RBF 0.66 (0.04) 0.65 (0.07) 0.67 (0.11) 0.71 (0.05)
Male group Opensmile ComParE 2016 (6373) 100 selected features LR 0.56 (0.02) 0.55 (0.01) 0.58 (0.05) 0.61 (0.05)
MLP Classifier 0.61 (0.05) 0.61 (0.06) 0.63 (0.06) 0.64 (0.05)
SVM RBF 0.57 (0.05) 0.57 (0.06) 0.54 (0.05) 0.57 (0.05)
Byol-S embeddings (2048) PCA, n_components = 100 LR 0.69 (0.04) 0.66 (0.07) 0.72 (0.03) 0.73 (0.06)
MLP Classifier 0.71 (0.02) 0.70 (0.02) 0.73 (0.03) 0.75 (0.05)
SVM RBF 0.70 (0.04) 0.64 (0.05) 0.76 (0.03) 0.78 (0.05)

Table 2 presents the mean and standard deviation (in parentheses) of the performance metrics across cross-validation folds. The selected algorithm for each gender group is highlighted in bold. Logistic Regression (LR), Multi-layer Perceptron (MLP), Support Vector Machine Radial basis function kernel (SVM RBF).

For the prediction of T2D in females, the classifier achieved a sensitivity of 0.67±0.11, specificity of 0.66±0.04, an AUC of 0.71±0.07 and a Brier score of 0.31. For the prediction of T2D in males, the reported performance metrics were a sensitivity of 0.73±0.03, specificity of 0.70±0.02, an AUC of 0.75±0.05 and a Brier score of 0.22 (Fig 2). The predicted probability of having T2D is then used for the sensitivity analysis with ADA risk score.

Fig 2. Voice-based T2D status detection algorithms’ overall performance.

Fig 2

A: Predicted probability distribution by T2D status. B: Confusion matrix of the selected models. C: AUC-ROC curve of the selected mode.

Performance stratification

The specificity and sensitivity metrics showed variability across various subgroups.

When stratifying by key demographics, notable differences were observed for females across age categories, with females aged 60 and above exhibiting higher specificity (0.74±0.12), sensitivity (0.74±0.07), and AUC (0.74±0.07) compared to females aged below 60 for both specificity and sensitivity (0.65±0.04), and for AUC (0.65±0.03) (Table 3).

Table 3. Performance stratification of voice-based T2D status detection algorithms.

Females Males
Specificity Sensitivity AUC Specificity Sensitivity AUC
Demographics Age <60 y 0.65 (0.04) 0.65 (0.04) 0.65 (0.03) 0.70 (0.04) 0.74 (0.04) 0.72 (0.03)
≥ 60y 0.74 (0.12) 0.74 (0.07) 0.74 (0.07) 0.70 (0.11) 0.70 (0.10) 0.70 (0.07)
Body Mass Index <25 kg/m2 0.68 (0.06) 0.58 (0.12) 0.63 (0.07) 0.70 (0.06) 0.78 (0.09) 0.74 (0.05)
≥ 25 kg/m2 0.65 (0.05) 0.68 (0.04) 0.67 (0.03) 0.69 (0.05) 0.72 (0.04) 0.71 (0.03)
Comorbidities Hypertension Present 0.76 (0.11) 0.75 (0.05) 0.75 (0.06) 0.72 (0.11) 0.76 (0.05) 0.74 (0.06)
Absent 0.65 (0.04) 0.61 (0.05) 0.63 (0.03) 0.69 (0.04) 0.70 (0.05) 0.70 (0.03)
Migraine Present 0.86 (0.07) 0.75 (0.07) 0.80 (0.05) 0.67 (0.12) 0.71 (0.11) 0.69 (0.09)
Absent 0.62 (0.04) 0.65 (0.04) 0.65 (0.04) 0.70 (0.04) 0.74 (0.04) 0.72 (0.03)
Lifestyle factors and symptoms Smoking Present 0.60 (0.09) 0.53 (0.12) 0.57 (0.07) 0.74 (0.09) 0.76 (0.07) 0.75 (0.06)
Absent 0.67 (0.04) 0.69 (0.04) 0.68 (0.03) 0.69 (0.04) 0.72 (0.04) 0.71 (0.03)
Depressive symptoms Severe 0.75 (0.05) 0.71 (0.05) 0.73 (0.03) 0.71 (0.07) 0.71 (0.06) 0.71 (0.04)
Mild 0.58 (0.05) 0.61 (0.06) 0.60 (0.04) 0.69 (0.05) 0.75 (0.05) 0.72 (0.03)
Stress Present 0.76 (0.07) 0.62 (0.07) 0.69 (0.05) 0.69 (0.09) 0.77 (0.07) 0.72 (0.06)
Absent 0.63 (0.04) 0.70 (0.04) 0.66 (0.03) 0.70 (0.04) 0.72 (0.04) 0.71 (0.03)
Fatigue Severe 0.68 (0.06) 0.68 (0.05) 0.68 (0.04) 0.71 (0.06) 0.73 (0.05) 0.72 (0.04)
Mild 0.65 (0.05) 0.66 (0.06) 0.65 (0.04) 0.69 (0.05) 0.73 (0.06) 0.71 (0.04)

This table provides an overview of various metrics, differentiated by gender across different demographic factors, comorbidities, and lifestyle factors. The statistical significance of performance differences between categories was evaluated using the Mann-Whitney U test, with all results being statistically significant (p < 0.001).

Conversely, no noticeable disparities were observed among males.

When considering comorbidities, hypertension emerged as a significant enhancer of the algorithm’s performance in both genders. The presence of hypertension enhanced the sensitivity (0.75±0.05 for females and 0.76±0.05 for males), highlighting the algorithm’s efficiency in detecting T2D in individuals with hypertension. On the other hand, for females, migraine considerably increases specificity to 0.86±0.07 and sensitivity to 0.75±0.07, while for males with migraine, both specificity (0.67±0.12) sensitivity (0.71±0.11) is lower. This suggests that migraine has a more pronounced impact on the accuracy of T2D detection in women than in men.

Lifestyle factors and symptoms also influence performance. The presence of depressive symptoms significantly impacts the algorithm’s performance in women, increasing both specificity (0.75±0.05) and sensitivity (0.71±0.05). Conversely, for men, the impact of depressive symptoms are less prominent, with a slight decrease in sensitivity (from 0.75±0.05 to 0.71±0.06) yet a stable specificity of 0.71±0.07. This demonstrates enhanced accuracy in detecting T2D in women with depression. Smoking and stress revealed gender-specific impacts; smoking led to higher sensitivity in males (0.76±0.07) compared to a decreased sensitivity in females (0.53±12). Similarly, stress resulted in increased sensitivity for men (0.77±0.07) but decreased for women (0.62±0.07). Fatigue showed a uniform impact on specificity in both genders yet an increase in sensitivity in females with severe fatigue (0.68±0.05) compared to a stable sensitivity for males of (0.73±0.05).

Overall, the data indicates that the algorithm’s specificity and sensitivity are influenced by demographic factors, comorbidities, and lifestyle factors, with notable differences observed between genders. These findings underscore the importance of considering these variables in the development and refinement of diagnostic tools, ensuring more accurate and gender-specific healthcare strategies in managing and diagnosing T2D.

Agreement with ADA risk score

In the Bland-Altman analysis, the mean difference indicates the average bias between the algorithm’s scores and the ADA risk scores. This analysis indicates that the algorithm has a mean difference of 0.57 for females and -0.15 for males compared to the ADA risk score, with over 93% agreement within acceptable limits for both genders, showing consistent agreement across genders (see S1 Fig).

Furthermore, we calculated the AUC for the ADA score and found comparable results to the voice-based algorithm’s performance: AUC for the ADA risk score was 0.72 for females and 0.71 for males, compared to the algorithm’s AUC of 0.71 (0.07) for females and 0.75 (0.05) for males. These findings indicate that our voice-based algorithm performs similarly to the established ADA risk score, further supporting its potential as a reliable screening tool for T2D.

Discussion

In this study, using a large sample from the USA population, we developed voice-based algorithms to detect T2D status. Our goal was to explore the possibility of using a rapid, user-friendly voice recording as a T2D status predictor. We observed that the performance of the predictive algorithms was maximal when trained using the hybrid BYOL-S/CvT embeddings, achieving AUC scores of 0.75 and 0.71 for the male and female groups, respectively. Besides demonstrating overall fair to good performances, we also examined the influence of cofactors on voice-based T2D status prediction, which allowed us to identify key subgroups of the population with enhanced performances. In a sensitivity analysis, we have confirmed a strong agreement with the currently used questionnaire-based ADA risk score, a gold standard for T2D risk assessment in the USA.

Undiagnosed T2D or delayed diagnosis can accelerate the occurrence of serious diabetes-related complications, including cardiovascular diseases, neuropathy, retinopathy, and nephropathy [25]. One potential under-investigated effect of T2D is its impact on voice, which may be due to the disease’s influence on respiratory and neuromuscular functions [19,20,26]. It was already shown that pulmonary function is reduced in people with T2D compared to those with no diabetes [27]. For speech production, an individual needs a sufficient air intake, which then travels through the trachea and larynx, causing vocal fold vibrations. Articulating these vibrations into speech requires various small muscles in the neck and throat, connected by a large nerve network. Diabetes is commonly linked to peripheral neuropathy, but it can also impact other systems [28]. This includes potential nerve damage in the throat and neck region, which is vital for speech production. Research has suggested that diabetes can lead to voice changes, especially in those with poor glucose control, causing symptoms like hoarseness and strain [18,28]. These patients often have reduced maximum phonation times, indicating neuromuscular and respiratory alterations [18,20,28]. Building upon this, our study, with its larger sample size, offered a comprehensive exploration of the vocal and physiological complications associated with T2D. By assessing cofactors, we also highlighted how they influence voice patterns, providing valuable insights for future diagnostic strategies.

Key demographic indicators, mainly age, were central in T2D status prediction using voice, especially for women. This aligns with existing research that emphasizes the importance of this variable as a critical determinant of diabetes risk [29,30]. We observed that older females (≥60) exhibited higher specificity, sensitivity, and AUC compared to younger ones (<60), but no difference was observed in males. An adult woman’s average fundamental frequency range is 165 to 255 Hz, while a man’s is 85 to 155 Hz [31]. In females, hormonal changes related to menopause can affect vocal cords and larynx and, consequently, cause a drop in the fundamental frequency of the voice [32]. These hormonal variations may interact with the metabolic disruptions caused by diabetes, leading to observable changes in voice pitch. On the other hand, males, not subject to the same degree of hormonal fluctuations, may exhibit less noticeable alterations in fundamental frequency.

Additionally, hypertension emerged as a key influencer in T2D status detection using voice, improving the predictive performance for both genders by up to 6%. While hypertension is known to be associated with diabetes development [33], it is not commonly incorporated into standard T2D risk assessment tools and its correlation with voice changes remains relatively unexplored [33,34].

Another distinguishing feature of our study is the gender-specific analysis of voice-based algorithms. We found that while certain determinants of T2D status were consistently influential across genders, others displayed gender-specific variations. Discrepancies observed in the impact of conditions such as migraines and on the voice-based T2D status detection algorithm’s performance might be traced back to inherent gender-based physiological differences. Women are also more likely to experience migraine than men, with more frequent and severe attacks [35]. Besides, migraine and diabetes have already been shown to be associated with women [36], and our study confirms that this association can be captured by changes in female voices [37]. The varying impact of smoking on the algorithm’s performances between genders may reflect gender-specific vocal changes caused by smoking [37,38]. Depression affects voices differently between men and women, suggesting that depression is linked to a higher risk of diabetes in women, but not in men [39,40]. This gender-specific association might explain the observed disparities in voice changes. The physiological and psychological stresses associated with depression may induce subtle voice changes that vary between genders, potentially due to hormonal or neurological differences. This variation might be more pronounced in women due to the combined impact of hormonal disruptions related to both diabetes and depression. Stress and fatigue, both of which can affect voice quality [41,42], seem to influence the algorithm’s performance in a gender-specific manner. These factors, known to play roles in glucose metabolism and insulin resistance [43], likely contribute to the voice patterns identified by the algorithm as indicative of T2D risk.

Such a sensitivity analysis is rarely performed in the field of vocal or digital biomarkers, as authors frequently report overall performances only. Our approach underscores that integrating the analysis of the influence of key demographic and health parameters is essential before developing any reliable voice-based screening tool. This helps to understand the potential physiological influence of these factors on either voice features or the health outcome of interest. Identifying the key sub-groups in the population is crucial to determining where the performance of these tools could be optimal.

Strengths and limitations

This work has several strengths. First, we used the most comprehensive sample of USA participants with standardized voice ecological recordings, collected in a real-life setting, compared to existing datasets. Additionally, we performed the analysis separately, stratified for males and females, to account for major gender differences in voice characteristics and to mitigate gender bias. Voice features can vary significantly between males and females due to physiological and hormonal differences, which can affect the accuracy and performance of the algorithm if not accounted for. By developing separate models for each gender, we were able to fine-tune the algorithms for the specific characteristics of males and females, improving overall predictive performance and ensuring fairness and generalizability.

Besides displaying overall good performances, we also performed additional analyses to identify important subgroups where the voice-based algorithms would perform even better. Our comparative analysis of cofactors emphasized the complex nature of T2D and its interaction with voice characteristics, providing some levels of interpretability and explainability to the algorithms. Importantly, we have been able to benchmark the voice-based algorithms against an existing screening strategy in the USA, and we demonstrated a strong agreement with the ADA risk score. This concordance reinforces the potential use of voice-based analysis as a viable first-line screening tool for T2D.

There is also scope for further refinement before such algorithms can be considered ready for implementation as a screening tool and several limitations have to be acknowledged in our study. First, due to data constraints in ADA score calculation, missing values for parameters, namely physical activity and family history of diabetes were assigned a value of zero for all participants by default. While this approach might introduce less variability in the ADA scores, the potential for misclassification arises. However, the impact of this limitation is somewhat limited since the ADA score is primarily driven by age and BMI, which are available in our study. Even though they represent different constructs, we have still observed a strong agreement between the voice-based algorithms and the ADA risk score. Another limitation is that our study relied on a sample of English speakers only, with diverse T2D durations. To robustly establish and reinforce the performance of a future screening tool in predicting T2D, a more diverse and large dataset is needed, while specifically targeting early-stage T2D and prediabetes cases. Additionally, conducting longitudinal studies will help to better understand how changes in voice characteristics correlate with the development and progression of T2D. This approach will provide insights into the main clinical diabetes-related parameters, such as glycemic control and diabetes-related complications, and help establish causal relationships. Furthermore, it is also important to generalize this research across different populations, with diverse backgrounds and languages. Expanding datasets will allow a deeper examination of nuanced factors, comorbidities, and their interactions affecting voice-based screening tools in predicting T2D.

Conclusion and perspectives

This work demonstrates the potential of using voice analysis in a diabetes context. A voice recording could potentially be soon used as a scalable, non-invasive first-line diabetes screening strategy. Future research should focus on targeting individuals with early-stage T2D and prediabetes and expanding our findings to other populations in prospective studies. Given the high societal costs of undiagnosed diabetes in the USA, our findings open new perspectives to improve secondary prevention, reduce the impact of diabetes and prevent severe complications and premature diabetes-related mortality.

Supporting information

S1 Fig

Bland-Altman plot showing the agreement between the voice-based algorithms’ predicted probability and the ADA risk score for both gender groups (A: Female group, B: Male group). Note: The predicted probability was scaled by a factor of 7 for harmonization.

(TIF)

pdig.0000679.s001.tif (328.7KB, tif)

Acknowledgments

We would like to thank all participants who contributed to the Colive Voice study, as well as our partners for their help in recruiting new participants. Special thanks go to Aurélie Fischer, Philippe Kayser, Luigi De Giovanni, Michael Schnell, and Aurore Dobosz for their substantial contribution to the Colive Voice study.

Data Availability

Audio data and source codes used in this study are publicly available in a Github repository. https://github.com/LIHVOICE/Voice-and-diabetes-VOCADIAB.

Funding Statement

Colive Voice study is funded by the Luxembourg Institute of Health. The French-speaking Diabetes Society, the Luxembourg Diabetes Society and the Luxembourg Diabetes Association further supported this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Sun H, Saeedi P, Karuranga S, Pinkepank M, Ogurtsova K, Duncan BB, et al. IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res Clin Pract. 2022;183: 109119. doi: 10.1016/j.diabres.2021.109119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.National Diabetes Statistics Report 2022. [cited 4 Sep 2023]. Available: https://repository.gheli.harvard.edu/repository/11854/.
  • 3.Ogurtsova K, Guariguata L, Barengo NC, Ruiz PL-D, Sacre JW, Karuranga S, et al. IDF diabetes Atlas: Global estimates of undiagnosed diabetes in adults for 2021. Diabetes Res Clin Pract. 2022;183: 109118. doi: 10.1016/j.diabres.2021.109118 [DOI] [PubMed] [Google Scholar]
  • 4.Wild SH, Smith FB, Lee AJ, Fowkes FG. Criteria for previously undiagnosed diabetes and risk of mortality: 15-year follow-up of the Edinburgh Artery Study cohort. Diabet Med. 2005;22. doi: 10.1111/j.1464-5491.2004.01433.x [DOI] [PubMed] [Google Scholar]
  • 5.Standards of medical care for patients with diabetes mellitus. Diabetes Care. 2003;26 Suppl 1. doi: 10.2337/diacare.26.2007.s33 [DOI] [PubMed] [Google Scholar]
  • 6.Dall TM, Yang W, Gillespie K, Mocarski M, Byrne E, Cintina I, et al. The Economic Burden of Elevated Blood Glucose Levels in 2017: Diagnosed and Undiagnosed Diabetes, Gestational Diabetes Mellitus, and Prediabetes. Diabetes Care. 2019;42: 1661. doi: 10.2337/dc18-1226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zhou B, Lu Y, Hajifathalian K, Bentham J, Di Cesare M, Danaei G, et al. Worldwide trends in diabetes since 1980: a pooled analysis of 751 population-based studies with 4·4 million participants. Lancet. 2016;387: 1513–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lim HM, Chia YC, Koay ZL. Performance of the Finnish Diabetes Risk Score (FINDRISC) and Modified Asian FINDRISC (ModAsian FINDRISC) for screening of undiagnosed type 2 diabetes mellitus and dysglycaemia in primary care. Prim Care Diabetes. 2020;14. doi: 10.1016/j.pcd.2020.02.008 [DOI] [PubMed] [Google Scholar]
  • 9.Lindström J, Tuomilehto J. The diabetes risk score: a practical tool to predict type 2 diabetes risk. Diabetes Care. 2003;26. doi: 10.2337/diacare.26.3.725 [DOI] [PubMed] [Google Scholar]
  • 10.Zhang B, Vijaya kumar BV, Zhang D. Noninvasive Diabetes Mellitus Detection Using Facial Block Color With a Sparse Representation Classifier. [cited 4 Sep 2023]. Available: https://ieeexplore.ieee.org/document/6675828. [DOI] [PubMed] [Google Scholar]
  • 11.IEEE Xplore—Temporarily Unavailable. [cited 4 Sep 2023]. Available: https://ieeexplore.ieee.org/document/6675828.
  • 12.Saleh SAK, Fatani SH, Adly HM, Abdulkhaliq AA, Al-Amodi HS. Variations of Hair Trace Element Contents in Diabetic Females. Journal of Biosciences and Medicines. 2017;5: 49–56. [Google Scholar]
  • 13.Jaime Miranda J, Taype-Rondan A, Tapia JC, Gastanadui-Gonzalez MG, Roman-Carpio R. HAIR FOLLICLE CHARACTERISTICS AS EARLY MARKER OF TYPE 2 DIABETES. Med Hypotheses. 2016;95: 39. doi: 10.1016/j.mehy.2016.08.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Diabetic Retinopathy: Present and Past. Procedia Comput Sci. 2018;132: 1432–1440. [Google Scholar]
  • 15.Investigating voice as a biomarker: Deep phenotyping methods for early detection of Parkinson’s disease. J Biomed Inform. 2020;104: 103362. doi: 10.1016/j.jbi.2019.103362 [DOI] [PubMed] [Google Scholar]
  • 16.Fagherazzi G, Zhang L, Elbéji A, Higa E, Despotovic V, Ollert M, et al. A voice-based biomarker for monitoring symptom resolution in adults with COVID-19: Findings from the prospective Predi-COVID cohort study. PLOS Digital Health. 2022;1: e0000112. doi: 10.1371/journal.pdig.0000112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gölaç H, Atalik G, Türkcan AK, Yilmaz M. Disease related changes in vocal parameters of patients with type 2 diabetes mellitus. Logoped Phoniatr Vocol. 2022;47. doi: 10.1080/14015439.2021.1917653 [DOI] [PubMed] [Google Scholar]
  • 18.Hamdan AL, Jabbour J, Nassar J, Dahouk I, Azar ST. Vocal characteristics in patients with type 2 diabetes mellitus. Eur Arch Otorhinolaryngol. 2012;269. doi: 10.1007/s00405-012-1933-7 [DOI] [PubMed] [Google Scholar]
  • 19.Hamdan AL, Kurban Z, Azar ST. Prevalence of phonatory symptoms in patients with type 2 diabetes mellitus. Acta Diabetol. 2013;50. doi: 10.1007/s00592-012-0392-3 [DOI] [PubMed] [Google Scholar]
  • 20.Instrumental Acoustic Voice Characteristics in Adults with Type 2 Diabetes. J Voice. 2021;35: 116–121. doi: 10.1016/j.jvoice.2019.07.003 [DOI] [PubMed] [Google Scholar]
  • 21.Stogowska E, Kamiński KA, Ziółko B, Kowalska I. Voice changes in reproductive disorders, thyroid disorders and diabetes: a review. Endocrine Connections. 2022;11. doi: 10.1530/EC-21-0505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Florian Eyben Technische Universität München, München, Germany, Martin Wöllmer Technische Universität München, München, Germany, Björn Schuller Technische Universität München, München, Germany. Opensmile. [cited 4 Sep 2023]. doi: 10.1145/1873951.1874246 [DOI]
  • 23.Elbanna G, Biryukov A, Scheidwasser-Clow N, Orlandic L, Mainar P, Kegler M, et al. Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. ArXiv. 2022. Available: https://arxiv.org/pdf/2203.16637.pdf. [Google Scholar]
  • 24.Krupp LB, LaRocca NG, Muir-Nash J, Steinberg AD. The fatigue severity scale. Application to patients with multiple sclerosis and systemic lupus erythematosus. Arch Neurol. 1989;46. doi: 10.1001/archneur.1989.00520460115022 [DOI] [PubMed] [Google Scholar]
  • 25.Wu Y, Ding Y, Tanaka Y, Zhang W. Risk Factors Contributing to Type 2 Diabetes and Recent Advances in the Treatment and Prevention. Int J Med Sci. 2014;11: 1185. doi: 10.7150/ijms.10001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Blood Glucose Estimation From Voice: First Review of Successes and Challenges. J Voice. 2022;36: 737.e1–737.e10. doi: 10.1016/j.jvoice.2020.08.034 [DOI] [PubMed] [Google Scholar]
  • 27.Davis TME, Drinkwater JJ, Davis WA. Pulmonary Function Trajectories Over 6 Years and Their Determinants in Type 2 Diabetes: The Fremantle Diabetes Study Phase II. Diabetes Care. 2024. [cited 17 Jan 2024]. doi: 10.2337/dc23-1726 [DOI] [PubMed] [Google Scholar]
  • 28.Patel K, Horak H, Tiryaki E. Diabetic neuropathies. Muscle Nerve. 2021;63: 22–30. doi: 10.1002/mus.27014 [DOI] [PubMed] [Google Scholar]
  • 29.Ganz ML, Wintfeld N, Li Q, Alas V, Langer J, Hammer M. The association of body mass index with the risk of type 2 diabetes: a case–control study nested in an electronic health records system in the United States. Diabetol Metab Syndr. 2014;6: 50. doi: 10.1186/1758-5996-6-50 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yan Z, Cai M, Han X, Chen Q, Lu H. The Interaction Between Age and Risk Factors for Diabetes and Prediabetes: A Community-Based Cross-Sectional Study. Diabetes Metab Syndr Obes. 2023;16: 85. doi: 10.2147/DMSO.S390857 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Fitch JL, Holbrook A. Modal vocal fundamental frequency of young adults. Arch Otolaryngol. 1970;92. doi: 10.1001/archotol.1970.04310040067012 [DOI] [PubMed] [Google Scholar]
  • 32.Lã FMB, Ardura D. What Voice-Related Metrics Change With Menopause? A Systematic Review and Meta-Analysis Study. J Voice. 2022;36. doi: 10.1016/j.jvoice.2020.06.012 [DOI] [PubMed] [Google Scholar]
  • 33.Kim M-J, Lim N-K, Choi S-J, Park H-Y. Hypertension is an independent risk factor for type 2 diabetes: the Korean genome and epidemiology study. Hypertens Res. 2015;38: 783–789. doi: 10.1038/hr.2015.72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Sakai M. Case study on analysis of vocal frequency to estimate blood pressure. [cited 4 Sep 2023]. Available: https://ieeexplore.ieee.org/document/7257173. [Google Scholar]
  • 35.Allais G, Chiarle G, Sinigaglia S, Airola G, Schiapparelli P, Benedetto C. Gender-related differences in migraine. Neurol Sci. 2020;41: 429. doi: 10.1007/s10072-020-04643-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Fagherazzi G, El Fatouhi D, Fournier A, Gusto G, Mancini FR, Balkau B, et al. Associations Between Migraine and Type 2 Diabetes in Women: Findings From the E3N Cohort Study. JAMA Neurol. 2019;76. doi: 10.1001/jamaneurol.2018.3960 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schwedt TJ, Peplinski J, Garcia-Filion P, Berisha V. Altered speech with migraine attacks: A prospective, longitudinal study of episodic migraine without aura. Cephalalgia. 2019;39: 722. doi: 10.1177/0333102418815505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature. J Voice. 2023;37: 300.e11–300.e20. doi: 10.1016/j.jvoice.2020.12.014 [DOI] [PubMed] [Google Scholar]
  • 39.Pan A, Lucas M, Sun Q, van Dam RM, Franco OH, Manson JE, et al. Bidirectional Association between Depression and Type 2 Diabetes in Women. Arch Intern Med. 2010;170: 1884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Demmer RT, Gelb S, Suglia SF, Keyes KM, Aiello AE, Colombo PC, et al. Sex Differences in the Association between Depression, Anxiety, and Type 2 Diabetes Mellitus. Psychosom Med. 2015;77: 467. doi: 10.1097/PSY.0000000000000169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Greeley HP, Berg J, Friets E, Wilson J, Greenough G, Picone J, et al. Fatigue estimation using voice analysis. Behav Res Methods. 2007;39: 610–619. doi: 10.3758/bf03193033 [DOI] [PubMed] [Google Scholar]
  • 42.Van Puyvelde M, Neyt X, McGlone F, Pattyn N. Voice Stress Analysis: A New Framework for Voice and Effort in Human Performance. Front Psychol. 2018;9: 414457. doi: 10.3389/fpsyg.2018.01994 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yaribeygi H, Maleki M, Butler AE, Jamialahmadi T, Sahebkar A. Molecular mechanisms linking stress and insulin resistance. EXCLI J. 2022;21: 317. doi: 10.17179/excli2021-4382 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000679.r001

Decision Letter 0

Henry Horng-Shing Lu, Ludwig Christian Giuseppe Hinske

5 Jul 2024

PDIG-D-24-00050

A voice-based algorithm can predict type 2 diabetes status in USA adults: Findings from the Colive Voice study

PLOS Digital Health

Dear Dr. Fagherazzi,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Sep 03 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Ludwig Christian Giuseppe Hinske, M.D.

Academic Editor

PLOS Digital Health

Journal Requirements:

Additional Editor Comments (if provided):

Dear authors,

thank you for submitting your work at PLOS Digital Health. As you can see from the attached reviews, all reviewers found the manuscript interesting. However, major issues were identified, some of which could impact the value for the scientific community. As you can see from the reviews, many of the comments address definition criteria as well as the methodological approach, especially with respect to sample size and the risk of massive overfitting.

We would like to encourage you to work on these aspects in a revised version of your manuscript.

Sincerely,

L. C. Hinske

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: No

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study explores using voice-based algorithms to predict T2D status in US adults, aiming to develop a non-invasive, scalable screening method. The authors analyzed text recordings from 607 Colive Voice study participants and used hybrid BYOL-S/CvT embeddings to create gender-specific algorithms for T2D prediction. The algorithms were evaluated using cross-validation, and their performance was stratified by age, BMI, hypertension, and compared to the ADA score for T2D risk assessment.

Comments

The study did not provide detailed information on the recruitment process and inclusion/exclusion criteria for participants. There may be potential selection bias if certain groups of individuals were more likely to participate in the study.

The study included participants with diverse T2D durations, but did not specifically target early-stage T2D or prediabetes cases.

Due to data constraints, physical activity levels and family history of diabetes were not available and were assigned a default value of zero for all participants. This may introduce less variability in the ADA scores and potential misclassification.

While the study performed additional analyses to identify important subgroups and compared the influence of key demographic and health parameters, the interpretability and explainability of the algorithms could be further improved.

The study did not account for potential confounding factors that may influence voice characteristics, such as smoking, alcohol consumption, or other underlying health conditions.

The study used cross-sectional data, which limits the ability to establish causal relationships between voice characteristics and T2D status.

The study relied on a sample of English speakers only, which may limit the generalizability of the findings to other languages and populations.

The study did not include an external validation dataset to assess the performance of the developed algorithms.

Although the study used a larger sample size compared to previous studies, the sample size may still be insufficient to capture the full spectrum of voice variations associated with T2D.

While the study compared the performance of the voice-based algorithms with the ADA risk score, it did not compare them with other established screening methods, such as fasting blood glucose or HbA1c tests.

Reviewer #2: The manuscript by Abir Elbeji et al developed a novel screening tool for diagnosing type 2 diabetes mellitus by building a voice-based machine-learning algorithm.

Although the data presented is clearly interesting, there are several issues that will require further clarification prior to publication:

1. The authors are advised to clearly describe how type 2 DM are diagnosed and defined. What is the diagnositc criteria for Type2 DM in this study?

2. The authors are requested to calculate the AUC by diagnosisng/evaluating participants with ADA risk scores.

3. The authors are also requested to compare the AUC above with that of the develpped algorithm.

4. What does the numerical value in the brackets in Table 2 mean? Are they standard deviation? or standar error mean?

5. In line 297, the authors wrote, "notable differences were observed for females across...". In addition, they also wrote in line 305 that no noticeable disparities were observed among males. The authors are requested to clarify the difference between "notable difference in women" and "no noticeable disparities among males". The authors are advised to describe how they defined "notable/noticeable difference".

6. In line 315, the authors wrote that the presence of depression significantly influenced the algorithm's performance in woman. The authors need to show the evidence of this significance.

7. In line 344, they wrote AUC score of algorithm in female T2DM as 0.72. However, I'm afraid that this might be 0.71 (I think this is a innocent mistake/mistyping)

Reviewer #3: The present manuscript highlights an algorithm to detect type 2 diabetes (t2d) based on multidimensional data features derived from voice recordings. I have a few comments/questions for the authors. Thank you for considering my comments.

1. Introduction: The authors say the FINDRISC (Finnish diabetes risk score) has limited detection capabilities (AUC of 76%). However, their algorithm's AUC is similar or lower (75% for males, 71 for females). Would this not be an argument to use a much simpler questionnaire to assess the risk for t2b?

2. Sample size: The overall sample size is N=607, reported in the abstract. However, the authors performed the analyses stratified by gender. The larger group of females is N=323, with 162 events (t2d) and 161 non-events. Thus, the effective sample size for their model is only N=161. This is a rather small study to develop a model based on 200 features after dimensionality reduction. The number of features is higher as the number of events observed; how do the authors mitigate massive overfitting?

3. Methods: The authors state they use TRIPOD reporting guidelines; however, it was not reported whether the study had missing data or not and how missing data was handled if present.

4. Methods: What was the rationale for performing the analysis separately, stratified for males and females? Why was the model not developed on all the data, and why is sex used as one of the prognostic factors along the features?

5. Methods: How was the number of components in the PCA determined?

6. Methods: I was wondering about the performance of a very simple logistic model using sex, age, hypertension, and BMI as diagnostic factors. Would that be feasible as a benchmark?

7. Discussion/Conclusion: The authors suggest the tool as a screening strategy for t2d; however, what is the optimal threshold to be used? For clinical implementation, this would require a decision curve analysis. Maybe the author could discuss this point.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Satoru Tada

Reviewer #3: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000679.r003

Decision Letter 1

Henry Horng-Shing Lu, Ludwig Christian Giuseppe Hinske

23 Oct 2024

A voice-based algorithm can predict type 2 diabetes status in USA adults: Findings from the Colive Voice study

PDIG-D-24-00050R1

Dear Dr. Fagherazzi,

We are pleased to inform you that your manuscript 'A voice-based algorithm can predict type 2 diabetes status in USA adults: Findings from the Colive Voice study' has been provisionally accepted for publication in PLOS Digital Health. We apologize for the long processing time, since we were hoping to get feedback from all three reviewers of the initial submission. Thanking you for reaching out to us, explaining the situation regarding the time constraints.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Ludwig Christian Giuseppe Hinske, M.D.

Academic Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The revised manuscript shows significant improvement in terms of clarity, context, and scientific rigor. The additions, particularly the ADA risk score comparison, strengthen the paper's contribution to the field.

Reviewer #2: The authors have addressed my concerns.

I am pleased to inform you that this manuscript has now been accepted for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Satoru Tada, MD, PhD

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig

    Bland-Altman plot showing the agreement between the voice-based algorithms’ predicted probability and the ADA risk score for both gender groups (A: Female group, B: Male group). Note: The predicted probability was scaled by a factor of 7 for harmonization.

    (TIF)

    pdig.0000679.s001.tif (328.7KB, tif)
    Attachment

    Submitted filename: Rebuttal letter_PLOS digital health.pdf

    pdig.0000679.s002.pdf (172.9KB, pdf)

    Data Availability Statement

    Audio data and source codes used in this study are publicly available in a Github repository. https://github.com/LIHVOICE/Voice-and-diabetes-VOCADIAB.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES