Skip to main content
HemaSphere logoLink to HemaSphere
. 2025 Nov 12;9(11):e70227. doi: 10.1002/hem3.70227

Accurate diagnosis of hemoglobinopathies with machine learning based on high‐throughput proteomics

Shaodong Wei 1,2,^, Annelaura Bach Nielsen 2,3,^, Jens Helby 1,4, Lylia Drici 3, Christine Rasmussen 2, Juanjuan Wang 3, Matthias Mann 3,5, Jesper Petersen 1, Nicolai J Wewer Albrechtsen 2,3,4,6,^,, Andreas Glenthøj 1,4,^,
PMCID: PMC12606001  PMID: 41235355

Abstract

Hemoglobinopathies, such as sickle cell disease and thalassemias, impose a substantial global burden, particularly in endemic regions. Current diagnostic methods, such as high‐performance liquid chromatography (HPLC), capillary electrophoresis, and genetic testing, can be time‐consuming, expensive, or limited in detecting all variants. This study introduces a novel diagnostic framework that combines high‐throughput proteomics with machine learning to address these challenges. We processed red blood cells, whole blood, and plasma samples from 82 individuals (development cohort) and 45 individuals (validation cohort) with structural hemoglobin variants (hemoglobin S, hemoglobin C, hemoglobin D, and hemoglobin E) or β‐thalassemia trait, as confirmed by standard clinical testing. Tryptic peptides were analyzed using data‐independent acquisition mass spectrometry, and random forest classifiers were trained to identify structural variants or β‐thalassemia trait. Model performance was evaluated across 100 Monte Carlo cross‐validations. For structural variants, the classifier achieved an area under the receiver‐operating characteristic curve (AUC) of 1.000 and 99.9% prediction accuracy in the validation cohort, when comparing our proteomics‐based diagnostics to standard testing with HPLC and Sanger sequencing (gold standard). For β‐thalassemia trait, the mean AUC was 1.000, and the prediction accuracy was 96.9% in the validation cohort, and a single peptide alone yielded 92% accuracy in a simple decision tree. This high‐throughput proteomics approach offers a rapid, scalable, and potentially cost‐effective alternative to existing diagnostic workflows, requiring minimal sample preparation while reducing manual interpretation. By combining peptide‐level data with machine learning, it enables precise classification of hemoglobinopathies and demonstrates a compelling path for routine clinical evaluation of hereditary anemias.


graphic file with name HEM3-9-e70227-g001.jpg

INTRODUCTION

Hemoglobinopathies are among the most prevalent genetic disorders worldwide, with carrier rates exceeding 7%, placing many at risk of having severely affected children. 1 These conditions impose a substantial burden on global health and resources, particularly in endemic regions such as Africa, the Mediterranean, and Asia, as well as in previously non‐endemic countries. 1

Screening strategies include premarital, antenatal, and newborn screening, targeting at‐risk groups or the general population depending on the country's risk level as well as cultural and legal considerations. 2 , 3 , 4 Premarital and antenatal programs have led to a substantial reduction in the number of children born with thalassemia in regions where they have been successfully implemented. 5 , 6 , 7 , 8 , 9 In contrast, sickle cell disease screening is typically conducted after birth, which enables early diagnosis and treatment but does not affect the number of affected births. Clinical severity of hemoglobinopathies ranges from mild, asymptomatic disease to severe anemia requiring life‐long treatment, often complicated by multiple comorbidities. 10 , 11 Early and accurate diagnosis is critical for effective management, screening, and genetic counseling. The diagnostic workup for suspected hemoglobinopathies varies but often begins with standard hematological parameters, including hemoglobin concentration, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and ferritin levels. While these measurements are simple and cheap, they themselves have limited accuracy in detecting clinically relevant hemoglobinopathies. Specialized tests, such as high‐performance liquid chromatography (HPLC) or capillary electrophoresis, are typically required for a reliable diagnosis. Both rely on the differential migration of hemoglobin variants, such as hemoglobin S (HbS) or hemoglobin C (HbC), to identify structural hemoglobinopathies including sickle cell disease, or on the measurement of increased hemoglobin A2 (HbA2) levels for the diagnosis of β‐thalassemia trait. However, not all hemoglobin variants can be detected by these methods, and even in cases of typical pathological findings of hemoglobin variants, verification by an alternative method is recommended. 3 Collectively, these methods are often labor‐intensive, expensive, and time‐consuming.

Previous studies have highlighted the potential of mass spectrometry (MS) techniques for reliably identifying hemoglobinopathies. Wild et al. identified about 95% of hemoglobin variants in 250 samples recognized by conventional tests using electrospray ionization‐MS. 12 Osta et al. demonstrated its effectiveness as a first‐line screening tool for detecting HbS in newborns. 13 Other efforts have used MS for precise HbA2 measurement, enhancing detection of β‐thalassemia trait, 14 and rare hemoglobin variants have also been characterized with MS, highlighting its versatility. 15 Lidong et al. accurately identified hemoglobinopathies in 18 patients using advanced 21‐T Fourier transform ion cyclotron resonance MS. 16 Studies have transited from basic MS‐based proteomics research to clinical applications. 17 , 18

In this study, we combined high‐throughput proteomics with machine learning to develop a novel diagnostic framework for hemoglobinopathies. By leveraging peptide‐level data, our approach enabled the classification of structural hemoglobin variants and β‐thalassemia trait with high accuracy, providing a scalable and clinically relevant solution. Using random forest algorithms and rigorous validation protocols, we investigated its performance across multiple sample types and present a pathway for integration into routine clinical workflows.

MATERIALS AND METHODS

Study design

This study was an observational retrospective cohort study.

Ethical considerations

The study was approved by the Capital Region of Denmark (p‐2023‐15219). Data were stored and handled in accordance with permission from the Danish Data Protection Agency (10122009 HEH‐LHB).

Study population and setting

We collected samples in 2021–2023 from adults (≥18 years) undergoing blood investigations for suspected hemoglobinopathies at the Danish Red Blood Cell Center, Department of Hematology, Copenhagen University Hospital‐Rigshospitalet. Red blood cells (RBCs), whole blood, and plasma for the development cohort were collected between 01‐2023 and 03‐2023. The validation cohort, consisting of RBC samples, was collected between 02‐2021 and 04‐2022. Both cohorts included all consecutive samples from heterozygous carriers, as well as noncarrier samples that were processed in our laboratory during the same periods.

The development cohort included 82 individuals diagnosed with heterozygosity for either β‐thalassemia (β‐trait; n = 18) or a structural hemoglobin variant, specifically HbS (n = 15), hemoglobin E (HbE, n = 10), hemoglobin D (HbD, n = 3), and HbC (n = 4). Additionally, 32 tested individuals without thalassemias or structural hemoglobin variants served as noncarrier controls (Table 1). The validation cohort comprised 45 individuals, including those heterozygous for either β‐thalassemia (β‐trait; n = 18) or a structural variant—HbS (n = 7), HbE (n = 6), HbD (n = 1), and HbC (n = 1)—along with 12 noncarrier controls.

Table 1.

Participant characteristics, cohorts, and sample types.

Development (n = 82) Validation (n = 45) P value
Age (year) 31.3 (16.2) 33.7 (17.0) 0.43
Male sex (%) 26 (31.7) 12 (26.7) 0.43
MCV (fL) 78.8 (10.6) 74.6 (12.6) 0.17
Hemoglobin (g/dL) 12.3 (1.7) 11.4 (17.5) 0.02
Plasma iron (µmol/L) 16.3 (8.05) 14.6 (6.50) 0.39
Ferritin (µg/L) 122 (293) 172 (347) 0.43
Hydroxyurea treatment (%) 0 (100) 0 (100) 1.00
Pregnant (%) 29 (35.4) 13 (28.9) <0.001
Sample type RBC 82 45
Whole blood 82 0
Plasma 82 0
Genotype HbC 4 1
HbD 3 1
HbE 10 6
HbS 15 7
β‐trait 18 18
Noncarrier 32 12

Note: The participants' characteristics are presented with mean (standard deviation) for continuous variables and number (%) for categorical variables. All presented P values are after adjustment for multiple comparisons using the Benjamini–Hochberg method.

Abbreviations: β‐trait, β‐thalassemia trait; HbC, hemoglobin C; HbD, hemoglobin D; HbE, hemoglobin E; HbS, hemoglobin S; MCV, mean corpuscular volume; Noncarrier, individuals not carrying either structural variants or thalassemia genes; RBC, red blood cell.

The two cohorts were kept separate in the subsequent analyses to preserve independency for machine learning model development and validation.

Standard diagnostic testing (gold standard)

All samples underwent first‐line testing (hemoglobin, MCV, MCH, ferritin, and HPLC). Structural variants were identified by HPLC, and β‐thalassemia trait was identified by a high HbA2 fraction in the absence of a structural variant. All findings triggered second‐line genetic testing with Sanger sequencing of HBB as previously described in detail. 3 Structural variants of interest were E7V (HbS), E7K (HbC), E27K (HbE), and E122Q (HbD).

Sample collection and peptides preparation

Blood samples were collected in ethylenediaminetetraacetic acid tubes and subjected to standard diagnostic testing as previously described. 3 For MS analysis, the following fractions were obtained in sequence: (1) whole blood, (2) plasma, isolated by centrifugation at 1500 × g for 10 min, and (3) RBCs, obtained by triple washing and centrifugation at 1500 × g in saline, with the buffy coat and supernatant removed. RBCs, whole blood, and plasma samples were thawed and aliquoted into a 96‐well plate and diluted 1:40 with Lysis buffer (Tris 1 M, tris(2‐carboxyethyl)phosphine 0.5 M, and chloroacetamide 0.5 M in H2O), using an Agilent Bravo Liquid Handling Platform. The diluted samples were incubated at 95°C for 10 min to denature proteins, reduce disulfide bridges, and alkylate cysteines. After cooling for 15 min, a Trypsin/LysC mixture (1–100 µg protein) was added, followed by incubation at 37°C for 4 h. The enzyme reaction was quenched by adding 64 µL of 0.2% trifluoroacetic acid.

Samples were loaded onto Evotips (Evosep Biosystems, Denmark) according to the manufacturer's recommendations. The Evotips were washed once with 20 µL solvent B (99% acetonitrile, 0.1% formic acid [FA]) and centrifuged at 700 ×g for 60 s. Then, the Evotips were wetted with isopropanol for 5 min and equilibrated with 20 µL solvent A (0.1% FA in H2O), followed by sample loading, and washing with 20 µL Solvent A. For storage, 200 µL Solvent A was added to each Evotip and centrifuged at 700 ×g for 10 s to prevent drying.

Liquid chromatography and MS analysis

The samples were injected to an Exploris 480 Thermo Fischer Scientific system using an Evosep One instrument (Evosep Biosystem). A preset chromatographic method was used corresponding to 60 samples per day. The peptides were separated on an 8 cm Pepsep (Marslev, Denmark) column (100 μm ID/3 μm bead size Reprosil‐Pur C18 beads) at 1 μL/min flow rate with a 21‐min gradient. The heated capillary temperature was set to 275°C, the spray voltage to 2650 V, and the funnel radiofrequency to 40. The mass spectrometer was operated in a data‐independent mode (DIA) with a full MS range from 350 to 1650 at a resolution of 60,000 at 200 m/z. The AGC target was set to 300% with an injection time (IT) of 50 ms. The AGC value of the targeted MS2 experiment was set to 1000%. Thirty‐two windows of variable sizes were defined for target MS2 (tMS2) acquisition and subjected to high‐energy collisional dissociation (HCD) fragmentation with a normalized collision energy at 30%. Each tMS2 scan was acquired at a resolution of 30,000 with a maximum ion IT of 100 ms for a scan range of m/z 349.5–1650.5.

Data processing

The MS raw files were processed with DIA‐NN (version 1.9), 19 allowing one missed cleavage and N‐terminal methionine excision with a minimum peptide length of six amino acids. A custom FASTA file was used for library‐free search, including (1) the human reference proteome (downloaded in May 2024 from UniProt), 20 (2) the mutation‐derived peptides, resulted either from mutations or from mutation‐induced cleavage site shift. The mutated sequences were extracted from Ithanet, 21 containing all known genetic variants (n = 1400, in October 2024), tryptic digestion was performed in silico, and unique peptides relevant to our cohort were added to the reference proteome (Figure 1).

Figure 1.

Figure 1

The study design and sample cohorts. (A) Development cohort consists of three sample types that are plasma, whole blood, and red blood cells (RBCs). The validation cohort only has RBC samples. All samples were analyzed using high‐throughput bottom‐up proteomics. Machine learning classifiers were developed to identify structural variants and β‐thalassemia. For structural variants, mutation‐derived peptides from the variants were used to build machine learning classifier. In contrast, for β‐thalassemia trait, only wild‐type peptides from hemoglobin genes were used. (B) Development and validation cohort sizes. Participants were genetically diagnosed hemoglobin C (HbC), hemoglobin D (HbD), hemoglobin E (HbE), hemoglobin S (HbS), β‐thalassemia trait, or noncarrier by first screening with high‐performance liquid chromatography (HPLC) and then confirming suspected variants with Sanger sequencing (gold standard). (C) The HBB protein sequence and the mutation sites of hemoglobin variants. Mutation‐derived peptides are unique and variant‐specific peptides resulted from mutations when allowing one missed cleavage during tryptic digestion.

The DIA‐NN peptide report underwent further processing using Python. The filtering, imputation, and correction steps described next were done separately for plasma, whole blood, and RBC data. A strict filtering process was applied to address missing data: (1) samples with low protein counts—defined as values falling below 1.5 times the interquartile range from the 25th percentile of the overall distribution—were removed, and (2) wild type peptides with more than 40% missing values across all samples were excluded. The data underwent log2 transformation, and any remaining missing values were imputed using a variational autoencoder implemented in the PIMMS software. 22 To avoid potential plate‐specific biases in downstream analyses, batch correction was performed using pyComBat. 23

Machine learning classifiers for structural variants and β‐thalassemia trait

Tryptic mutation‐derived peptides from the hemoglobin HBB gene were used for structural variants classification based on machine learning (Figure 1A). For thalassemia, only the wild‐type tryptic peptides (n = 63) from HBA1, HBA2, HBB, HBD, HBG1, and HBG2 genes were used, so that our model is clinically relevant, not biased by confounding factors, more interpretable, and easier to be integrated into clinical workflows.

For the structural variants classifier, we utilized mutation‐derived peptides from the variants. The dataset was partitioned into 75% for training and 25% for testing. We employed the random forest algorithm to model the training data. The parameter “mtry” (number of variables randomly sampled as candidates at each split) was optimized using three repetitions of 10‐fold cross‐validation, and 5000 trees were used in each iteration. To mitigate potential bias due to data partitioning, the random splitting of training and testing data was repeated 100 times. In each random partition, the output probability from the machine learning model was calibrated using Platt scaling. Model performance was evaluated with a series of metrics, including area under the receiver‐operating characteristic curve (AUC), sensitivity, specificity, F1 score, and Matthews Correlation Coefficient (MCC). When the number of response levels exceeded two, these metrics were reported as the mean.

For the β‐thalassemia trait classifier, we used the same data strategy for data splitting and machine learning classifier developing, and performance evaluation, except that the input data are wild‐type peptides from hemoglobin genes.

The machine learning models were built mainly using R package “caret.”

Decision trees

We used the “rpart” package in R to build and plot decision trees. To mitigate potential bias due to data partitioning, the random splitting of training and testing data was repeated 500 times. In each iteration, the prediction accuracy of the built decision tree was evaluated based on the test dataset and the validation cohort.

Statistics

The differences in mutation‐derived peptide abundances between positive and negative groups were assessed using the Wilcoxon rank sum test in R. In cases where all abundance values were identical across groups, a P value could not be computed, and a value of 1 was assigned. The cohort characteristics were compared using Welch's t‐test for numeric variables and the chi‐square test for categorical variables. The reported P values from all statistical tests were two‐sided. P value correction was performed for multiple tests using the Benjamini–Hochberg method. P < 0.05 was considered significant.

RESULTS

Eighty‐two individuals were analyzed in the development cohort. Three different sample types, namely RBC, whole blood, and plasma, were collected from them. To access the generalizability of our results, we included a validation cohort of 45 individuals, only RBC samples were available for this cohort (Figure 1A). All individuals from both cohorts were tested using our standard clinical diagnostic workflow (which we defined as our gold standard), which consists of HPLC followed by Sanger sequencing for mutation verification. Based on these results, individuals were diagnosed as carriers of common clinically important structural hemoglobin variants (HbC, HbD, HbE, or HbS), as having β‐thalassemia trait, or as noncarriers if neither condition was present. Relevant cohort characteristics are shown in Table 1.

To evaluate if diagnosis could be directly assigned from the proteomic profiles, we performed MS measurements on the available samples. Hemoglobin proteins were well detected using this methodology. For all five hemoglobin proteins, all amino acid positions belonging to peptides longer than six amino acids (cutoff for confident identification) were observed (Figure S1). When examining coverage in each sample individually, HBA, HBB, and HBD had mean coverages of 96.1%, 98.7%, and 96.0%, respectively. In comparison, fetal hemoglobin proteins HBG1 and HBG2, known to have low abundances in adults, had mean coverages of 75.6% and 80.7%. As hemoglobin proteins have regions of overlapping amino acid sequences, we further examined the coverage of protein‐specific regions (based on tryptic peptides). These unique regions were all detected with high coverage (mean coverages of 98.8% for HBA, 100% for HBB, 97.2% for HBD, 59.8% for HBG1, and 92.4% for HBG2) (Table S1).

Our variants and β‐thalassemia trait classifiers were built on the development cohort with RBC samples and validated in the validation cohort (Figure 1B). To increase the number of mutation‐derived peptides, one missed cleavage was allowed for trypsin digestion, and the proteins with or without a leading methionine were both included, which resulted in multiple mutation‐derived peptides for a given variant (Figure 1C). These mutation‐derived peptides are specific and unique to each variant when searched against the entire human proteome. However, although these mutation‐derived peptides were expected to be discriminative, their diagnostic abilities were variable (Figure S2). For example, the HbS mutation‐derived peptide “MVHLTPVEK” was highly abundant in the HbS genotype (median abundance level of 226) and could not be detected in other genotypes (abundance of zero), which perfectly aligned with the fact that only mutation carriers should have the mutated variant peptide. The peptide “VHLTPVEK” was detected at even higher levels (32 times higher than MVHLTPVEK) in all HbS carriers, but the peptide was also detected at very low levels in some samples without the HbS genotype (128 times lower: 231 [HbS positive] vs. 224 [HbS negative], P < 0.001, median). This effect was due to minimal retention (0.78%) of material between runs, but it did not interfere with the outcome of our analysis. To visualize the abundances of mutation‐derived peptides in each variant, we log2‐transformed the raw values and summed mutation‐derived peptides' abundances from the same variant (Figure 2A). In the development cohort, positive groups consistently had higher abundances of mutation‐derived peptides than negative groups. For example, when HbS mutation‐derived peptides are being evaluated, the genotype HbS samples are the positive samples and have an abundance of 286 (median), other genotypes grouped as negative showed much lower abundances (median of 225). To classify these structural variants, the machine learning variants classifier was built on the development cohort. The classifier achieved 100% prediction accuracy through 100 iterations of Monte Carlo cross‐validations (Figure 2A). To validate this finding, the built classifier was applied on the validation cohort and showed 99.9% accuracy (one error out of 1500 predictions) (Figure 2B). For β‐thalassemia trait classification, mean AUC values of 1.000 and 1.000 were observed for development and validation cohorts, respectively (Figure 2C). The validation cohort had high values of accuracy (0.969), sensitivity (0.949), specificity (0.998), F1 score (0.974), and MCC (0.938) (Figure 2D). We further simplified our β‐thalassemia trait classifier with a decision tree through 500 Monte Carlo cross‐validations (Figure 3A). We found that as few as just one peptide “EFTPQMQAAYQKVVAGVANALAHK” from β‐globin enabled an accuracy of 0.92 in both development and validation cohorts, and this peptide was the most frequently chosen peptide for decision trees (424/500 = 84.8%). Based on this single peptide, individuals having an abundance higher or equal to 230.383 are predicted to have β‐thalassemia trait with 92% accuracy (Figure 3B).

Figure 2.

Figure 2

Variants (A, B) and β‐thalassemia trait (C, D) classifiers for red blood cell RBC). (A) The X‐axis represents mutation‐derived peptides from different variants, and the Y‐axis shows the log2‐transformed abundances of peptides. Each dot represents a sample. For samples with multiple mutation‐derived peptides, their log2‐transformed abundances were summed. Samples were grouped based on the peptides being evaluated and the genotypes. All presented P values are after adjustment for multiple comparisons using the Benjamini–Hochberg method. (B) The confusion matrix for variants classifier achieved through 100 Monte Carlo cross‐validations. The number of samples in development was the test dataset (25%). Different metrics are used to evaluate the classifier performance, including area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, F1 score, and Matthews Correlation Coefficient (MCC). (C) The ROCs and mean AUCs through 100 Monte Carlo cross‐validations. The colored lines are the mean ROCs, and the shaded area is the 95% confidence interval. The thin light gray lines show individual ROCs in each iteration. The dashed gray line represents the AUC of 0.5. (D) The confusion matrix for β‐thalassemia trait classifiers achieved through 100 iterations and the corresponding performance metrics. HbC, hemoglobin C; HbD, hemoglobin D; HbE, hemoglobin E; HbS, hemoglobin S.

Figure 3.

Figure 3

The performance of decision trees for red blood cell (RBC) development and validation cohorts. (A) Five hundred iterations of random splitting of training and testing datasets in development cohort. In each iteration, the tree performance was evaluated on the development testing dataset and the validation dataset. The specific tree structures, mean prediction accuracy, and frequencies of a tree structure occurred are shown. (B) Based on the 500 iterations, considering the balance of prediction accuracy and frequencies, a tree using a single peptide EFTPQMQAAYQKVVAGVANALAHK from the HBB gene was selected and shown. Individuals with an abundance 230.386 or higher are classified as β‐thalassemia trait. (C) The abundances of selected peptide EFTPQMQAAYQKVVAGVANALAHK are shown across cohorts. The horizontal line represents the decision threshold defined by the selected decision tree.

To investigate whether our methodology is sample type dependent, we next applied it to whole blood. For the variants classifier, it achieved 100% accuracy and an AUC of 1.000 (Figure 4A,B). For β‐thalassemia trait, our classifier reached an AUC of 0.991 (Figure 4C), and a prediction accuracy of 0.933, sensitivity of 0.855, specificity of 0.973, F1 score of 0.900, and MCC of 0.850 (Figure 4D).

Figure 4.

Figure 4

Variants (A, B) and β‐thalassemia trait classifiers (C, D) for whole blood. (A) The X‐axis represents mutation‐derived peptides from different variants, and the Y‐axis shows the log2‐transformed abundances of peptides. Each dot represents a sample. For samples with multiple mutation‐derived peptides, their log2‐transformed abundances were summed. Samples were grouped based on the peptides being evaluated and the genotypes. All presented P values are after adjustment for multiple comparisons using the Benjamini–Hochberg method. (B) The confusion matrix for variants classifier achieved through 100 Monte Carlo cross‐validations. The number of samples in development was the test dataset (25%). Different metrics are used to evaluate the classifier performance, including area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, F1 score, and Matthews Correlation Coefficient (MCC). (C) The ROCs and mean AUCs through 100 Monte Carlo cross‐validations. The colored lines are the mean ROCs, and the shaded area is the 95% confidence interval. The thin light gray lines show individual ROCs in each iteration. The dashed gray line represents the AUC of 0.5. (D) The confusion matrix for β‐thalassemia trait classifiers achieved through 100 iterations and the corresponding performance metrics. HbC, hemoglobin C; HbD, hemoglobin D; HbE, hemoglobin E; HbS, hemoglobin S.

Furthermore, we also evaluated our classifier performance in plasma (Figure 5). The variants classifier achieved 100% accuracy and an AUC of 1.000. The β‐thalassemia trait classifier achieved an accuracy of 0.846, a sensitivity of 0.760, a specificity of 0.889, an F1 score of 0.770, and an MCC of 0.650.

Figure 5.

Figure 5

Variants (A, B) and β‐thalassemia trait classifiers (C, D) for plasma. (A) The X‐axis represents mutation‐derived peptides from different variants, and the Y‐axis shows the log2‐transformed abundances of peptides. Each dot represents a sample. For samples with multiple mutation‐derived peptides, their log2‐transformed abundances were summed. Samples were grouped based on the peptides being evaluated and the genotypes. All presented P values are after adjustment for multiple comparisons using the Benjamini–Hochberg method. (B) The confusion matrix for variants classifier achieved through 100 Monte Carlo cross‐validations. The number of samples in development was the test dataset (25%). Different metrics are used to evaluate the classifier performance, including area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, F1 score, and Matthews Correlation Coefficient (MCC). (C) The ROCs and mean AUCs through 100 Monte Carlo cross‐validations. The colored lines are the mean ROCs, and the shaded area is the 95% confidence interval. The thin light gray lines show individual ROCs in each iteration. The dashed gray line represents the AUC of 0.5. (D) The confusion matrix for β‐thalassemia trait classifiers achieved through 100 iterations and the corresponding performance metrics. HbC, hemoglobin C; HbD, hemoglobin D; HbE, hemoglobin E; HbS, hemoglobin S.

DISCUSSION

We evaluated the feasibility of coupling high‐throughput proteomics with machine learning for rapid and precise diagnosis of common hemoglobinopathy carrier states.

Our machine learning classifiers, trained on tryptic peptide data, showed exceptional diagnostic performance. For structural hemoglobin variants, the model achieved near‐perfect accuracy (99.9%) in the validation cohort. For β‐thalassemia trait, mean AUC values were 1.000 and 1.000 for the development and validation cohorts, respectively. Of note, a single mutation‐derived peptide for β‐thalassemia trait delivered 92% accuracy in a simple decision tree, highlighting the potential for streamlined diagnostic workflows. High‐throughput proteomics provided several advantages, including minimal sample requirements and reduced need for manual interpretation, all while maintaining high accuracy. In principle, detecting a mutated hemoglobin peptide should directly reflect the corresponding genetic alteration. For the studied variants, the method achieved near‐perfect accuracy across RBCs, whole blood, and plasma. For β‐thalassemia trait, performance remained strong across all sample types; however, as expected, given hemoglobin's intracellular localization, plasma was less well‐suited for detection.

While HPLC and capillary electrophoresis remain gold‐standard techniques, they still rely on hemoglobin migration patterns, which have multiple known pitfalls, most notably the inability to distinguish variants with overlapping elution profiles, and often require verification by suitable alternative methods to provide satisfactory diagnostic precision. 24 In a Danish setting, genetic testing by Sanger sequencing is the first‐choice validation method. In comparison, our proteomics‐based pipeline can potentially provide a precise, less labor‐intensive, and cost‐effective alternative to this gold‐standard two‐step approach in under 30 min.

Despite these promising findings, certain limitations warrant consideration. First, our model was developed specifically for β‐thalassemia trait and four key hemoglobin variants. Future studies should also incorporate α‐thalassemia and other less common variants. Second, our sample size was small for certain variants, such as HbC (n = 4) and HbD (n = 3), in the development cohort. Hence, although we found that our proteomics approach was promising for identifying these variants, findings should be interpreted with caution due to limited statistical power, and validation in a larger cohort is necessary before clinical implementation can be considered. Third, limited levels of retention of material between runs, exemplified by “VHLTPVEK,” in high‐cell‐count samples (RBC and whole blood), underscore the need for further optimization of sample preparation and loading protocol in future studies. Although additional blanks or wash gradients could help mitigate this issue, they would also reduce throughput and increase processing time and cost, potentially limiting clinical scalability. Notably, we did not observe this carryover effect in plasma samples, which contain significantly lower hemoglobin concentrations. This suggests that reducing input material in RBC and whole blood samples may be an effective and practical strategy to minimize carryover without compromising throughput, offering a clinically viable path forward. Fourth, all samples were processed on a unified platform at a single laboratory site, limiting the generalizability of our findings. Validation across different laboratories and proteomics workflows will be essential to assess reproducibility and support clinical implementation. Fifth, although the single‐peptide decision tree classifier for β‐thalassemia demonstrated strong performance and reflects the underlying biology of the disease, consistent with the quantitative deficiency in hemoglobin β‐chain, it may be vulnerable to biological and technical variability across individuals, populations, laboratories, or MS platforms. Accordingly, our primary classification approach is based on a broader panel of 63 peptides, which provides greater robustness by integrating multiple features and reducing reliance on any single measurement.

Future perspectives include rare hemoglobin variants and diverse clinical scenarios, such as homozygotes, compound heterozygotes, and transfused patients, to enhance model generalizability. These genotypes may produce more complex peptide profiles, including multiple mutation‐derived peptides and altered abundance patterns, which can complicate classification. As such, the present study, with its limited sample size and focus on heterozygous carriers, does not address these scenarios. The findings should therefore not be generalized to these clinically important subgroups, and larger cohorts will be required to evaluate model performance across a broader spectrum of genotypes. Additionally, our approach can be applied to newborn screening using dried blood spots and integrated into laboratory information systems for automated and efficient diagnostics in reference laboratories.

In conclusion, we demonstrate that combining high‐throughput proteomics with machine learning represents a powerful tool for diagnosing common hemoglobinopathies. With further validation and refinement, this framework has the potential to advance the diagnostic landscape for hereditary anemias.

AUTHOR CONTRIBUTIONS

Shaodong Wei: Methodology; formal analysis; visualization; writing—review and editing; writing—original draft; investigation; data curation. Annelaura Bach Nielsen: Methodology; investigation; writing—review and editing; data curation; writing—original draft; formal analysis. Jens Helby: Conceptualization; writing—review and editing; supervision. Lylia Drici: Investigation; writing—review and editing. Christine Rasmussen: Investigation; writing—review and editing. Juanjuan Wang: Investigation; data curation; writing—review and editing. Matthias Mann: Conceptualization; writing—review and editing. Jesper Petersen: Conceptualization; investigation; data curation; writing—review and editing; supervision. Nicolai J. Wewer Albrechtsen: Conceptualization; project administration; writing—review and editing; supervision; funding acquisition. Andreas Glenthøj: Conceptualization; funding acquisition; writing—original draft; writing—review and editing; project administration; supervision.

CONFLICT OF INTEREST STATEMENT

A.G.: Agios, Novo Nordisk, Pharmacosmos, and Vertex Pharmaceuticals (consultancy/advisory board); Agios, Bristol Myers Squibb, Novo Nordisk, and Sanofi (Research support). M.M. is an indirect shareholder in Evosep Biosystems. N.J.W.A. has received funding, served on scientific advisory panels, and/or speakers bureaus for Boehringer Ingelheim, MSD/Merck, Novo Nordisk, EvoSep, ROCHE, Janssen, and Mercodia. J.H. has received research funding, payment for advisory board participation, and a conference travel grant from Sanofi, and payment for advisory board participation from Disc Medicine. None of the other authors report any conflict of interests.

Supporting information

Supporting Information.

HEM3-9-e70227-s001.docx (496KB, docx)

FUNDING

This work was supported by a grant from the Novo Nordisk Foundation (#0085102). N.J.W.A. is supported by the European Foundation for the Study of Diabetes Future Leader Award (NNF21SA0072746), Independent Research Fund Denmark, Sapere Aude (1052‐00003B), Novo Nordic Foundation (NNF23OC0084970, NNF19OC0055001, and NNF24OC0088402). Novo Nordisk Foundation Center for Protein Research is supported financially by the Novo Nordisk Foundation (Grant agreement NNF14CC0001). A.G. is supported by Rigshospitalets Research Foundation.

Contributor Information

Nicolai J. Wewer Albrechtsen, Email: nicolai.albrechtsen@regionh.dk.

Andreas Glenthøj, Email: Andreas.glenthoej@regionh.dk.

DATA AVAILABILITY STATEMENT

Technical details can be made available from the corresponding author at andreas.glenthoej@regionh.dk. To comply with data privacy regulations, access to original data is possible only in the case of a collaborative agreement.

The custom code used to analyze the mass spectrometry proteomics data can be found from https://github.com/WewerAlbrechtsenLab/hemoglobinopathy.

REFERENCES

  • 1. Modell B, Darlison M. Global epidemiology of haemoglobin disorders and derived service indicators. Bull World Health Organ. 2008;86(6):480‐487. 10.2471/BLT.06.036673 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lobitz S, Telfer P, Cela E, et al. Newborn screening for sickle cell disease in Europe: recommendations from a Pan‐European Consensus Conference. Br J Haematol. 2018;183(4):648‐660. 10.1111/bjh.15600 [DOI] [PubMed] [Google Scholar]
  • 3. Gravholt EAE, Petersen J, Mottelson M, et al. The Danish national haemoglobinopathy screening programme: report from 16 years of screening in a low‐prevalence, non‐endemic region. Br J Haematol. 2024;204(1):329‐336. 10.1111/bjh.19103 [DOI] [PubMed] [Google Scholar]
  • 4. Gravholt EAE, Jørgensen FS, Holm C, et al. Optimisation of the Danish national haemoglobinopathy screening programme—a prospective intervention study. EJHaem. 2025;6(3):980. 10.1002/jha2.980 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Chakravorty S, Dick MC. Antenatal screening for haemoglobinopathies: current status, barriers and ethics. Br J Haematol. 2019;187(4):431‐440. 10.1111/bjh.16188 [DOI] [PubMed] [Google Scholar]
  • 6. Cao A, Kan YW. The prevention of thalassemia. Cold Spring Harbor Perspect Med. 2013;3(2):011775. 10.1101/cshperspect.a011775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Goonasekera HW, Paththinige CS, Dissanayake VHW. Population screening for hemoglobinopathies. Annu Rev Genomics Hum Genet. 2018;19(1):355‐380. 10.1146/annurev-genom-091416-035451 [DOI] [PubMed] [Google Scholar]
  • 8. Theodoridou S, Prapas N, Balassopoulou A, et al. Efficacy of the national thalassaemia and sickle cell disease prevention programme in Northern Greece: 15‐year experience, practice and policy gaps for natives and migrants. Hemoglobin. 2018;42(4):257‐263. 10.1080/03630269.2018.1528986 [DOI] [PubMed] [Google Scholar]
  • 9. Weil LG, Charlton MR, Coppinger C, Daniel Y, Streetly A. Sickle cell disease and thalassaemia antenatal screening programme in England over 10 years: a review from 2007/2008 to 2016/2017. J Clin Pathol. 2020;73(4):183‐190. 10.1136/jclinpath-2019-206317 [DOI] [PubMed] [Google Scholar]
  • 10. Piel FB, Steinberg MH, Rees DC. Sickle cell disease. N Engl J Med. 2017;376(16):1561‐1573. 10.1056/NEJMra1510865 [DOI] [PubMed] [Google Scholar]
  • 11. Farmakis D, Porter J, Taher A, Domenica Cappellini M, Angastiniotis M, Eleftheriou A. 2021 Thalassaemia International Federation guidelines for the management of transfusion‐dependent thalassemia. HemaSphere. 2022;6(8):732. 10.1097/HS9.0000000000000732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Wild BJ, Green BN, Cooper EK, et al. Rapid identification of hemoglobin variants by electrospray ionization mass spectrometry. Blood Cells Mol Dis. 2001;27(3):691‐704. 10.1006/bcmd.2001.0430 [DOI] [PubMed] [Google Scholar]
  • 13. El Osta M, Benoist JF, Naubourg P, et al. MALDI‐MS in first‐line screening of newborns for sickle cell disease: results from a prospective study in comparison to HPLC. Clin Chem Lab Med. 2024;62(6):1149‐1157. 10.1515/cclm-2023-1250 [DOI] [PubMed] [Google Scholar]
  • 14. Arsene CG, Kaiser P, Paleari R, et al. Determination of HbA2 by quantitative bottom‐up proteomics and isotope dilution mass spectrometry. Clin Chim Acta. 2018;487:318‐324. 10.1016/j.cca.2018.10.024 [DOI] [PubMed] [Google Scholar]
  • 15. Dakshinamoorthy Putchen D, Nambiar A, Ashok Menon A, Jayaram A, Ramaprasad S. Electrospray triple quadrupole mass spectrometry guides pathologists to suggest appropriate molecular testing in the identification of rare hemoglobin variants. J Mass Spectrom Adv Clin Lab. 2024;32:18‐23. 10.1016/j.jmsacl.2024.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. He L, Rockwood AL, Agarwal AM, et al. Diagnosis of hemoglobinopathy and β‐thalassemia by 21 Tesla fourier transform ion cyclotron resonance mass spectrometry and tandem mass spectrometry of hemoglobin from blood. Clin Chem. 2019;65(8):986‐994. 10.1373/clinchem.2018.295766 [DOI] [PubMed] [Google Scholar]
  • 17. Guo T, Steen JA, Mann M. Mass‐spectrometry‐based proteomics: from single cells to clinical applications. Nature. 2025;638(8052):901‐911. 10.1038/s41586-025-08584-0 [DOI] [PubMed] [Google Scholar]
  • 18. Williams SA, Ostroff R, Hinterberg MA, et al. A proteomic surrogate for cardiovascular outcomes that is sensitive to multiple mechanisms of change in risk. Sci Transl Med. 2022;14(639):9625. 10.1126/scitranslmed.abj9625 [DOI] [PubMed] [Google Scholar]
  • 19. Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M. DIA‐NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods. 2020;17(1):41‐44. 10.1038/s41592-019-0638-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Bateman A, Martin MJ, Orchard S, et al. UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025;53(D1):D609‐D617. 10.1093/nar/gkae1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Kountouris P, Lederer CW, Fanis P, Feleki X, Old J, Kleanthous M. IthaGenes: an interactive database for haemoglobin variations and epidemiology. PLoS One. 2014;9(7):103020. 10.1371/journal.pone.0103020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Webel H, Niu L, Nielsen AB, et al. Imputation of label‐free quantitative mass spectrometry‐based proteomics data using self‐supervised deep learning. Nat Commun. 2024;15(1):5405. 10.1038/s41467-024-48711-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Behdenna A, Colange M, Haziza J, et al. pyComBat, a Python tool for batch effects correction in high‐throughput molecular data using empirical Bayes methods. BMC Bioinform. 2023;24(1):459. 10.1186/s12859-023-05578-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Bain BJ, Daniel Y, Henthorn J, et al. Significant haemoglobinopathies: a guideline for screening and diagnosis. Br J Haematol. 2023;201(6):1047‐1065. 10.1111/bjh.18794 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information.

HEM3-9-e70227-s001.docx (496KB, docx)

Data Availability Statement

Technical details can be made available from the corresponding author at andreas.glenthoej@regionh.dk. To comply with data privacy regulations, access to original data is possible only in the case of a collaborative agreement.

The custom code used to analyze the mass spectrometry proteomics data can be found from https://github.com/WewerAlbrechtsenLab/hemoglobinopathy.


Articles from HemaSphere are provided here courtesy of Wiley

RESOURCES