Summary
Multiple myeloma evolves unnoticed over years, and when diagnosed, organ damage is common. Electronic health records (EHR) can help in developing predictive models identifying ‘healthy’ people at risk. MM patients from Clalit Health Services (2002–2019) were matched with healthy controls. Stage I: EHR from 5 years prior to MM diagnosis were reviewed and >200 parameters were compared (patients vs. controls). Stage II: Establishing xgboost model predicting 5 year risk for MM, with validation. Stage III: A simplified logistic regression model for community, requiring 20 variables (Age; Hb; RBC; MCV; RDW; WBC; neutrophils; lymphocytes; monocytes; basophils; glucose; creatinine; total protein; albumin; calcium; uric acid; bilirubin; HDL‐C; LDL‐C; triglycerides). EHR from the pre‐MM period of 4256 patients were compared to controls. Future MM patients had higher ESR, lower Hb, ANC, neutrophil/lymphocyte ratio, higher globulins and ferritin, more immune deficiencies, MDS and FMF. They took fewer tranquilizers, anti‐diabetics and statins. Using labs from future MM (n = 19 129) and controls (n = 382 580, 20:1), a predictive model was developed (ROC AUC = 0.836). The simple LR model provided individual risk prediction for MM within 5 years (AUC = 0.72). Two models with machine learning predict the risk of myeloma in ‘healthy’ individuals within 5 years. The models can be used in practice.
Keywords: computer modelling, disease prediction, gradient boosted, logistic regression, multiple myeloma
Individuals who may develop multiple myeloma within 5 years. Stage I (left) identifies patient and control groups and variables that differ between them. Stage II (middle) develops a complex SGBOOST model to predict future MM patients. Stage III (right) develops a simplified model.

INTRODUCTION
Multiple myeloma (MM) evolves over years, 1 , 2 , 3 , 4 , 5 , 6 yet only a minority of patients with pre‐MM states (MGUS, smouldering MM) are diagnosed on time. 3 , 4 Unfortunately, with no markers or screening, many are missed. When MM is diagnosed, it is often already associated with organ damage. 3 , 4 , 7
We hypothesize that clinical and lab markers of individuals at risk could serve for identification prior to MM diagnosis. Database collection and machine learning (ML) techniques can assist. 8 , 9 , 10 , 11
We present risk prediction models using an ML approach to identify people at risk for developing MM within 5 years. Hopefully, irreversible complications can be prevented. 12 , 13 , 14 , 15 , 16
METHODS
Stage I: Data collection and comparison: Future MM versus controls
Goals
(1) Identify patients and controls, (2) review the charts prior to diagnosis (pre‐MM period—from 5 years prior to MM to 1 month prior to MM diagnosis) and collect data and (3) compare between MM patients and controls in the pre‐MM period. The aim was to identify key early indicators of MM, suggesting that signs of MM can be detected already in the pre‐MM period.
Data were extracted from Clalit Health Services, the largest healthcare organization in Israel (>5.3 × 106 patients). The electronic health records (EHRs) contain socio‐demographic information, medical histories, encoded diagnoses, medications, physicians' notes, hospital admissions, laboratory and imaging results. This database of more than three decades allows data mining for analysis.
Patients
The study population included CHS members (40–85 years) diagnosed with MM from January 2002 to December 2019. Inclusion criteria were (1) diagnostic codes of MM, for example, multiple myeloma, plasma cell myeloma; (2) patients registered with MM in the national cancer registry (years 2000–2014), avoiding duplication. We excluded patients with MGUS or SMM, and with unconfirmed MM diagnosis (‘Suspected’, ‘not excluded’, ‘Rule/out’, ‘Differential diagnosis’, ‘history of MM’, amyloidosis).
MM diagnosis date was defined as either the earliest date documented, or the earliest date of anti‐MM medication prescription. The index date was set as the last day of the month in which the patient performed lab tests in the pre‐MM period. Data from the pre‐MM period were collected and MM patients without available labs during the pre‐MM period were excluded.
Controls
Randomly chosen individuals from CHS database without MM—age, sex and geographic region matched. For each MM patient, 10 non‐MM controls were chosen to increase the statistical power of the analysis. The controls had performed lab tests (CBC, chemistry) in the same month (in the pre‐MM period) and facility as their matched MM patients (same index date).
Study design and analysis
All available clinical and lab information was extracted from the pre‐MM period (5 year prior to index date). We defined individuals who later developed MM as ‘Future MM’ patients.
Using univariate analysis, various available demographic, clinical, medication and lab variables (~200) from the pre‐MM period were compared between future MM patients and controls. We also evaluated the change of variables in each group over time during the 5 year prior to the index date.
Statistical testing
We used the Wilcoxon rank‐sum test to compare the values in the case and control groups for continuous variables; Fisher's exact test to compare categorical variables and calculate odds ratios; the Benjamini–Hochberg false discovery rate (FDR) to correct for multiple testing.
Stage II: Development of the MM predictive model (for the large health organization)
The goal
Establishing a risk prediction model for developing MM within 5 years from a given date.
Patient data
The entire CHS population was randomly divided into training and test cohorts in a 70–30 ratio, so that the individuals assigned to the test cohort did not overlap with the training cohort. From the training cohort (70% of the population), we used all the available lab tests of the future MM patients. Thus, for instance, a future MM patient who had performed labs three times during the pre‐MM period provided three samples or ‘lab units’ for the model development. Note that between stages I and II, more than 1 year passed and more patients, including patients whose data were not used for stage I, were included in the second stage.
Controls
As controls for the development of the model, we used lab units from matched non‐MM individuals from the training set. For each lab unit from the future MM cohort, 20 control lab units were matched from non‐MM matched individuals who performed blood tests in the same month and the same facility as the future MM patient. Note that this allowed a training set that was relatively enriched with MM patients compared to their ratio in the general population.
Variables
The model included over 200 demographic, clinical and lab variables extracted from the EHR for each patient at the index dates.
Model development
We applied a gradient boosting machine learning (ML) model using the xgboost library in Python. The model was trained upon data from the training set (70%), which was further divided into three cross‐validation sets, based on patient ID number so that any given patient could not be included in different cross‐validation sets. One of the advantages of xgboost is that it has an inherent methodology to handle missing data. Prior to model construction, xgboost metaparameter exploration and optimization were performed on the cross‐validation sets using the ‘hyperopt’ library.
The following hyperparameters were optimized: max_depth, min_child_weight, gamma, subsample, colsample_bytree and learning_rate.
In the tuning process using the threefold cross‐validation strategy, the first two folds were used for training, and the third fold was used for validation. The hyperparameters were selected to maximize model performance on the validation fold.
Model validation
Model performance was evaluated on individuals from the test cohort (30% of the whole CHS population that had been reserved to test the model performance). Of the test cohort, the test set included individuals who had lab tests in the year 2014 (5 years before the study end date). Patients with a known MM diagnosis at that time were excluded. Subjects who had more than one blood test during 2014 were included only once, and the date used was chosen randomly from among all the lab test dates for the given subject.
For model validation, the fitted model was run on the test set to create individual prediction scores. The overall model performance was then evaluated using a receiver operator characteristic (ROC) curve and the area under the ROC curve (AUC). In addition, we evaluated the performance of the model in multiple possible prediction score cut‐offs using positive predictive value (PPV) and lift scores.
Stage III: Model simplification (for any physician in the community)
The xgboost predictive model developed in phase II is complex and requires significant resources and facilities, which might limit its application to large organizations like CHS. Thus, in order to implement the predictive model on other platforms, we developed a simplified model that could be used by any community physician with limited computational resources. This model uses a simpler system, logistic regression, that can be programmed and implemented on any platform. Based on the results of the model from stage II, we selected variables that were the most predictive and that are clinically relevant and commonly available in the community. We used several criteria to choose the variables, including variables that are common and routine universally, are readily available and not missing for any patient, are clinically related to prediction/exclusion of myeloma in the future and are based on their weighted importance from the initial model analysis (stage II).
The 20 chosen variables were age; haemoglobin; RBC; MCV; RDW; WBC; neutrophil (absolute) count; lymphocyte (absolute) count; monocyte (absolute) count; basophil (absolute) count; glucose (serum); creatinine (serum); total serum protein; albumin (serum); calcium (serum); uric acid; total serum bilirubin; HDL‐cholesterol; LDL‐cholesterol; triglycerides (See also Table 4).
TABLE 4.
The simplified predictive model.
| Feature | Units | Coefficients | Example values |
|---|---|---|---|
| Intercept | Constant | −0.029529 | |
| Age | Years | 0.03132 | 75 |
| HB | g/dL | −0.227172 | 12.1 |
| RBC | 10×3/μL | −0.236324 | 3.5 |
| MCV | Fl | −0.00554 | 85 |
| RDW | % | −0.089781 | 14 |
| WBC | 10×3/μL | −0.029068 | 3.8 |
| Neutrophils (absolute count) | 10×3/μL | −0.088919 | 1.6 |
| Lymphocytes (absolute count) | 10×3/μL | 0.027145 | 2.4 |
| Monocytes (absolute count) | 10×3/μL | 0.001484 | 0.4 |
| Basophils (absolute count) | 10×3/μL | −0.008366 | 0 |
| Glucose | mg/dL | −0.002343 | 80 |
| Creatinine | mg/dL | −0.119084 | 0.7 |
| Protein, total | g/dL | 0.539083 | 7.6 |
| Albumin | g/dL | −0.184958 | 3.8 |
| Calcium | mg/dL | −0.150997 | 10 |
| Uric Acid | mg/dL | 0.19732 | 6.9 |
| Bilirubin, total | mg/dL | −0.047166 | 0.7 |
| HDL cholesterol | mg/dL | −0.015262 | 60 |
| LDL cholesterol | mg/dL | −0.001835 | 93.2 |
| Triglycerides | mg/dL | −0.001437 | 139 |
| Calculated score, P (Y = 1) | 0.180 |
Note: Variables used in the simplified logistic regression model are presented. The table presents the name of the variable with the appropriate units, and the coefficients obtained from building the model. An example of its implementation is provided, showing the values of each variable for a given patient. For this patient, the calculated score, P is 0.180 (see Table 5).
The model was developed on the training set and validated on the test set, as in phase II.
The study was approved by the Helsinki Committee (IRB) of CHS.
RESULTS
Stage I: Data collection and comparison between future MM patients and controls
From the CHS database (entire cohort), we identified 4982 patients with MM. Of those, 4256 patients met the inclusion criteria and were matched (1:10) with 42,560 controls (Figure 1A). Table 1 displays several relevant clinical, demographic and socioeconomic characteristics of the patients. The mean age was 69.0 years (±10.3 SD), with 46% females and 60% with current or past smoking history. Being matched, the controls had the same age, sex and geographic distribution. Most other demographic and social variables were similar for future MM patients and the controls (except for number of children, marital status and birth area). The full list of the other demographic features of the future MM and control populations is presented in Table S1 in the supplement.
FIGURE 1.

Patients or labs associated with future MM patients are depicted in red; Controls, in blue. (A) Stage I: Of the 4.7 million CHS members (at the time of the study), 4982 individuals developed MM during the study period; 4256 of them met the inclusion/exclusion criteria. Variables of these patients were compared with those 42 560 (10:1) controls using univariate analysis. (B) Stage II: The gradient boosting model was developed using 70% of the CHS members (black arrow, left). The remaining 30% of the population was set aside and reserved for model validation (green arrow, right). For the model development (left portion), laboratory data were collected from future MM (3515) patients and controls: 19129 lab units from future MM patients and 382 580 (20:1) lab units from controls. Validation (green, right portion) was performed using labs performed in the year 2014 (268 058 people): 368 from future MM patients and 267 690 from controls.
TABLE 1.
MM patient (demographic, epidemiological, social) characteristics.
| Parameters/characteristics | MM patients | ||
|---|---|---|---|
| N | (%) | ||
| N | 4256 | 100 | |
| Age | Years | 69.0 year (+/−10.3) a | ‐ |
| Sex | Females | 1970 | 46.3 |
| Immigration status | Native Israeli | 1724 | 40.5 |
| Marital status | Married | 1855 | 43.6 |
| Socioeconomic status | Middle/high | 3497 | 82.2 |
| Smoking | Current/Past | 2554 | 60.0 |
Standard deviation.
Differences between future MM patients and controls
Table 2 demonstrates lab variables of future MM patients as compared to those of controls, who did not eventually develop MM. A full list of compared parameters can be seen in the supplement (Table S2a–c). All the differences between the future MM patients and controls were statistically significant, after correction for multiple testing. The first section of Table 2 examines labs that are typically abnormal in MM patients, though these are not specific to MM: haemoglobin (Hb), erythrocyte sedimentation rate (ESR), proteins and uric acid. The next portion shows levels of immunoglobulin (Ig), abnormalities which are part of the pathophysiology of the disease. The third portion shows labs that are unrelated to MM. The N column provides the number of patients and controls for whom the particular parameter was available and used for comparison.
TABLE 2.
Laboratory differences between future MM patients and controls in the pre‐MM period.
| Future MM | Controls | p value | |||
|---|---|---|---|---|---|
| N | Median [IQR] | N | Median [IQR] | ||
| Lab tests often abnormal (but not specific) in MM | |||||
| Hb (g/dL) | 4032 | 12.5 [11.4–13.6] | 40 604 | 13.4 [12.3–14.4] | <0.0001 |
| ESR (mm/h) | 1811 | 40.0 [20.0–68.0] | 16 083 | 24.0 [13.0–40.0] | <0.0001 |
| Glob/Alb ratio | 3645 | 0.85 [0.68–1.07] | 36 034 | 0.71 [0.63–0.8] | <0.0001 |
| Protein (mg/24 h) | 1001 | 225.0 [97.0–966.0] | 6172 | 165.0 [84.0–423.2] | <0.0001 |
| Uric acid (mg/dL) | 3791 | 5.8 [4.8–6.95] | 37 919 | 5.5 [4.5–6.6] | <0.0001 |
| Lab tests associated with MM | |||||
| IgM (mg/dL) | 1224 | 44 [24–85] | 5118 | 81 [51–125] | <0.0001 |
| IgA (mg/dL) | 1266 | 140 [64–321] | 5597 | 229 [164–314] | <0.0001 |
| IgG (mg/dL) | 1219 | 1420 [935–2305] | 5175 | 1160 [955–1410] | <0.0001 |
| Lab tests unrelated to MM | |||||
| ANC ×109/L | 4013 | 3.7 [2.73–4.80] | 40 424 | 4.1 [3.21–5.2] | <0.0001 |
| Neut/Lymph ratio | 4009 | 1.88 [1.35–2.6] | 40 321 | 2.08 [1.56–2.85] | <0.0001 |
| Total cholesterol | 4013 | 171 [145–200] | 40 477 | 178 [152–206] | <0.0001 |
| LDL‐C | 3963 | 128 [106–153] | 40 016 | 133 [110–157] | <0.0001 |
| Triglycerides | 4002 | 164 [118–231] | 40 395 | 170 [124–235] | <0.0001 |
| LDH (u) | 3267 | 341 [299–390] | 31 817 | 353 [315–401] | <0.0001 |
| RDW | 3965 | 14.1 [13.3–15.1] | 39 882 | 13.8 [13.2–14.7] | <0.0001 |
| Ferritin (ng/mL) | 2425 | 85 [44–158] | 21 360 | 70 [34–141] | <0.0001 |
Note: All differences were significant after correction for multiple testing (FDR q, 0.05). p‐values displayed in the table are before correction for multiple comparisons.
Abbreviations: Alb, albumin; ANC, absolute neutrophil count; ESR, erythrocyte sedimentation rate; Glob, globulins; Hb, haemoglobin; Ig, immunoglobulin (M, A, G); IQR, interquartile range; LDH, lactate dehydrogenate; LDL‐C, low‐density lipoprotein cholesterol; lymph, lymphocyte count; N, number of individuals for whom the variable was available; Neut, neutrophil count; RDW, red cell distribution width.
Note that, in the pre‐MM period, future MM patients had higher ESR as well as lower Hb level, absolute neutrophil count (ANC) and neutrophil/lymphocyte (N/L) ratio, compared with the controls in the parallel period. Future MM patients also presented with higher levels of serum globulins, globulin/albumin ratio, urinary protein and serum IgG, than controls. Finally, Table 2 shows that, compared with controls, future MM patients had lower serum levels of total cholesterol, LDL‐cholesterol, triglycerides and LDH, as well as higher values of RDW and serum ferritin.
There were also statistically significant differences between future MM patients and controls in the prevalence of comorbidities in the pre‐MM period (Table S2b). Compared to the controls, future MM patients had higher rates of immune deficiencies (defined by using steroids or other immunosuppressive agents), a history of pneumococcal pneumonia, familial mediterranean fever (FMF), pernicious anaemia, nephrotic syndrome and osteoporosis.
Differences between future MM patients and controls were also found in the use of medications in the pre‐MM period (Table S3c). Fewer future MM patients than controls were using tranquilizers, anti‐diabetics, calcium antagonists and statins.
We also found lab differences between future MM and control groups in trends over time prior to MM diagnosis. Several representative lab differences are displayed in Figure S2. When subdividing the lab values by each year for both groups, from the fifth year (−4.0) before MM diagnosis to the year of the index date (0), we found that as patients approached the date of MM diagnosis, the abnormalities in some variables progressed for future MM patients while they did not change for controls. For example, the median Hb level in the future MM group declined during the 5 year prior to MM diagnosis from 13.3 to 12.3 g/dL, a decline that might be ignored by the human eye in regular clinic visits, while in the control group, Hb level only declined from 13.6 to 13.2 g/dL. The median erythrocyte sedimentation rate (ESR) rose during the pre‐MM period from 29 mm, 5 year prior to MM diagnosis, to 48 mm at the index date, but only from 24 to 25 mm in the future MM and control groups, respectively. The median globulin/albumin ratio of future MM and controls increased over the 5 year prior to MM diagnosis from 0.77 to 0.85 and from 0.70 to 0.71, respectively. A similar tendency was demonstrated in comparing the trend over time for RBC, Mentzer (MCV/RBC), Hct, Globulins, Total proteins, ESR, Albumin, % Macrocytes, Absolute neutrophil count, MCV, IgM, NLR, % lymphocytes, MCH, WBC, % neutrophils and HDL‐cholesterol (Figure S2).
Stage II: Development of the MM predictive model
Model development
From the training cohort 3513 future MM patients met the inclusion criteria for the model development (Figure 1B, left). This group of future MM patients performed a total of 19,129 lab tests in the pre‐MM period. These were matched 1:20 to 382,580 labs from non‐MM controls which together served as the learning units for the model development.
Validation of model
We validated the predictive model using 268 058 subjects (Figure 1B, right), who performed blood tests in the year 2014 (from the 30% of the CHS population reserved for validation) and had not been diagnosed with MM at the time of their blood tests. Of these, 368 (0.14%) patients were diagnosed with MM within the 5‐year window, reflecting the incidence of the disease in the general community.
Model performance
Figure 2 demonstrates the performance of the model in this population with a receiver operator characteristic (ROC) curve. The area under the ROC curve (AUC) was 0.836.
FIGURE 2.

Receiver operating characteristic (ROC) curve model performance using the patients' labs performed in the year 2014. Note that the area under the ROC curve is 0.836.
Table 3 displays the performance of diagnosing MM as a function of a given predictor threshold. When implementing the model, a decision needs to be made by the organization as to where to place the predictor threshold. For instance, choosing the predictor threshold of 0.7 (Table 3, bold) points to 2684 individuals (~1% of the training set population) above the threshold, thus defining them as having high risk to develop MM in the future. Eventually, MM will be diagnosed in 111 of these suspected subjects within 5 years. These values reflect a positive predictive value (PPV) of 4.14% and a lift of ~30. This means that, for this particular threshold, we can either closely follow or perform an MM diagnostic workup in 2684 (1%) suspected individuals, in order to identify the 111 patients who will truly develop the disease. Alternatively, we can prefer the less stringent predictor threshold of 0.5, marking 21 599 individuals (~8% of the training set population) above the threshold, defined as having high risk for future MM. Eventually, MM will be diagnosed in 195 of these suspected subjects within 5 years. These values reflect a positive predictive value (PPV) of 0.9%, and a lift of ~6.6. This means that, for this particular threshold, we can either closely follow or perform an MM diagnostic workup in 21 599 (8%) suspected individuals, in order to identify the 195 patients who will truly develop the disease. Obviously, one can select any threshold, resulting in different numbers of suspected and diagnosed patients.
TABLE 3.
Model performance as a function of the predictor threshold.
| Predictor threshold | % Above threshold | PPV | Lift |
|---|---|---|---|
| 0.00 | 100.00 | 0.001 | 1.00 |
| 0.50 | 8.06 | 0.009 | 6.58 |
| 0.60 | 3.01 | 0.018 | 13.38 |
| 0.70 | 1.00 | 0.041 | 30.12 |
| 0.80 | 0.32 | 0.103 | 74.82 |
| 0.88 | 0.11 | 0.197 | 143.17 |
| 0.92 | 0.06 | 0.289 | 210.12 |
| 0.95 | 0.03 | 0.413 | 300.47 |
| 0.96 | 0.02 | 0.459 | 334.36 |
| 0.97 | 0.02 | 0.500 | 364.21 |
Note: The bold row is used as an example of how a given threshold is used for disease prediction (see text).
Abbreviation: PPV, positive predictive value.
Stage III: The simplified model
We applied a logistic regression model, which is much simpler and easier to implement on multiple computational platforms (in any community clinic). Figure S1 in the supplement provides the formula used for the logistic regression. Table 4 presents the simplified model. The 20 chosen variables are presented; all are easily available to any community physician and, in fact, are part of the routine workup. Also presented are the coefficients that weight each variable, determined upon building the model. Table 4 also provides an example for a given patient. By entering the lab values for each variable for this particular patient, the calculated model score is 0.180. The AUC of the ROC curve was 0.72.
Table 5 provides the performance of the logistic regression model for various thresholds. If the cut‐off is chosen as 0.18, as in the example from Table 4, the test will have a lift of 9.6. The patient in Table 4 who falls above that threshold falls within a group where 1% (Table 5, PPV column) of them may develop MM within the next 5 years. This is a patient who should be followed more closely or occasionally tested for the development of the disease.
TABLE 5.
Performance of the simplified logistic regression model.
| Probability threshold | % above threshold | PPV | Lift |
|---|---|---|---|
| 0.05 | 29% | 0.00 | 2.12 |
| 0.06 | 20% | 0.00 | 2.61 |
| 0.07 | 15% | 0.00 | 3.06 |
| 0.08 | 11% | 0.01 | 3.74 |
| 0.09 | 8% | 0.01 | 3.91 |
| 0.10 | 6% | 0.01 | 4.34 |
| 0.11 | 5% | 0.01 | 4.99 |
| 0.12 | 4% | 0.01 | 5.74 |
| 0.13 | 3% | 0.01 | 6.92 |
| 0.14 | 3% | 0.01 | 7.03 |
| 0.15 | 2% | 0.01 | 7.95 |
| 0.16 | 2% | 0.01 | 9.19 |
| 0.17 | 1% | 0.01 | 9.55 |
| 0.18 | 1% | 0.01 | 9.60 |
| 0.19 | 1% | 0.02 | 11.04 |
| 0.20 | 1% | 0.02 | 12.76 |
Note: The bold row is used as an example of how a given threshold is used for disease prediction (see text).
Table S3 in the supplement demonstrates how the simplified logistic regression model (of Figure S1) can be implemented using a spreadsheet (in this case, Excel). It also provides and contrasts two additional examples of the model's use and the determination of the respective scores with various thresholds. In example 1, the score is 0.051. Using Table 5, it is determined that, with a threshold of 0.05, 29% of patients are above this threshold, meaning that a very large number of patients would have to be followed and tested. More MM would be discovered but with a higher cost to the healthcare system. In example 2, the score is 0.201. This is above the threshold of 0.2; only approximately 1% of patients are above that threshold, which might represent a smaller group with a less expensive approach.
DISCUSSION
Digital tools can identify minor findings faster and earlier than people, and construct predictive diagnostic and therapeutic models. 8 , 9 , 10 , 11 , 17 , 18 This is relevant to MM, which is diagnosed already with complications. 1 , 3 , 4 , 5 , 19 , 20 , 21 , 22 Serum electrophoresis screening could theoretically diagnose MGUS/SMM, but this is not routine, and many are missed, supporting our approach.
Here, ~200 clinical and lab variables from EHRs were different between future MM patients and controls, prior to MM diagnosis. We focused on statistically significant and clinically meaningful parameters. MM‐related variables (ESR, globulins) were abnormal even before diagnosis, but interestingly, others (glucose, cholesterol) as well as some comorbidities (FMF, Table S2d) 23 showed differences.
Based on these differences, predictive models were developed. Both models can be applied in practice to predict the individual's risk to develop MM in the future. The high AUC reflects the good performance of both models. The models can assist in decision‐making for screening. Regulators can decide which threshold to use to screen suspected individuals. A small number of true‐positive patients with MM will eventually be diagnosed. This can be extended to other diseases also.
Early detection or identification of individuals at risk will hopefully lead to therapeutic paradigm shifts, and now with the availability of biological agents it is feasible. 1 , 12 , 13 , 14 , 16 , 24 , 25 , 26 , 27 , 28
Both models have limitations. First, we need to test many individuals to diagnose MM in a few. A threshold line is determined, and there is a trade‐off: a lower threshold leads to testing more, diagnosing more, but at a higher cost. With a higher threshold, more MM patients will be missed. However, since no myeloma screening is currently in practice, most of such patients are missed and diagnosed only with active disease. Also, the lack of genetic information makes risk stratification more difficult. 1 , 2 , 29 Finally, external validation can further strengthen the model.
Despite these limitations, the development of MM predictive models is a step towards the identification of a high‐risk population and the prevention of complications. Future developments will improve the models' accuracy. Nevertheless, the models, especially the simplified ones, can already be implemented by every community physician.
AUTHOR CONTRIBUTIONS
Moshe Mittelman, Ran Balicer, Howard S. Oster, Galit Shaham: Conceptualization, data analysis, writing manuscript, final approval. Ariel Israel, Michael Leshchinsky, Yatir Ben‐Shlomo, Eldad Kepten: Model development, data analysis and final approval. Osnat Jarchowsky Dolberg: Data analysis, writing the manuscript, final approval.
FUNDING INFORMATION
This project was supported in part by an unrestricted grant from Janssen Pharmaceuticals.
CONFLICT OF INTEREST STATEMENT
There are no conflicts of interest to disclose for any author.
ETHICS STATEMENT
The study protocol was approved by the CHS Institutional Review Board.
PATIENT CONSENT STATEMENT
As this was a retrospective study using de‐identified patient data, the requirement for informed consent was waived.
Supporting information
Data S1.
ACKNOWLEDGEMENTS
The authors wish to acknowledge the assistance of Noa Goldschmidt, our study coordinator, and of Yochi Menachem for assistance in preparing the manuscript.
Mittelman M, Israel A, Oster HS, Leshchinsky M, Ben‐Shlomo Y, Kepten E, et al. Can we identify individuals at risk to develop multiple myeloma? A machine learning‐based predictive model. Br J Haematol. 2025;207(2):387–394. 10.1111/bjh.20136
Moshe Mittelman, Ariel Israel and Howard S. Oster contributed equally to this study.
Preliminary work on this study was presented in ASH 2022: https://doi.org/10.1182/blood‐2022‐162438.
Contributor Information
Moshe Mittelman, Email: moshemt@gmail.com.
Howard S. Oster, Email: howardo@tlvmc.gov.il.
DATA AVAILABILITY STATEMENT
Raw data are not available because they are part of the Clalit Health Organization patient database.
REFERENCES
- 1. Landgren O. Advances in MGUS diagnosis, risk stratification, and management: introducing myeloma‐defining genomic events. Hematol Am Soc Hematol Educ Program. 2021;2021:662–672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Oben B, Froyen G, Maclachlan KH, Leongamornlert D, Abascal F, Zheng‐Lin B, et al. Whole‐genome sequencing reveals progressive versus stable myeloma precursor conditions as two distinct entities. Nat Commun. 2021;12:1861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Palumbo A, Anderson K. Multiple myeloma. N Engl J Med. 2011;364:1046–1060. [DOI] [PubMed] [Google Scholar]
- 4. Rajkumar SV. Multiple myeloma: 2018 update on diagnosis, risk‐stratification, and management. Am J Hematol. 2018;93:981–1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Rustad EH, Yellapantula V, Leongamornlert D, Bolli N, Ledergor G, Nadeu F, et al. Timing the initiation of multiple myeloma. Nat Commun. 2020;11:1917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ramberger E, Sapozhnikova V, Ng YL, Dolnik A, Ziehm M, Popp O, et al. The proteogenomic landscape of multiple myeloma reveals insights into disease biology and therapeutic opportunities. Nat Cancer. 2024;5:1267–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Kariyawasan CC, Hughes DA, Jayatillake MM, Mehta AB. Multiple myeloma: causes and consequences of delay in diagnosis. QJM. 2007;100:635–640. [DOI] [PubMed] [Google Scholar]
- 8. Goshen R, Choman E, Ran A, Muller E, Kariv R, Chodick G, et al. Computer‐assisted flagging of individuals at high risk of colorectal cancer in a large health maintenance organization using the ColonFlag test. JCO Clin Cancer Inform. 2018;2:1–8. [DOI] [PubMed] [Google Scholar]
- 9. Greene JA, Lea AS. Digital futures past ‐ the long arc of big data in medicine. N Engl J Med. 2019;381:480–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388:1201–1208. [DOI] [PubMed] [Google Scholar]
- 11. Nazha A, Komrokji R, Meggendorfer M, Jia X, Radakovich N, Shreve J, et al. Personalized prediction model to risk stratify patients with myelodysplastic syndromes. J Clin Oncol. 2021;39:3737–3746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lonial S, Jacobus S, Fonseca R, Weiss M, Kumar S, Orlowski RZ, et al. Randomized trial of lenalidomide versus observation in smoldering multiple myeloma. J Clin Oncol. 2020;38:1126–1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mateos MV, Hernandez MT, Giraldo P, De la Rubia J, De Arriba F, Corral LL, et al. Lenalidomide plus dexamethasone for high‐risk smoldering multiple myeloma. N Engl J Med. 2013;369:438–447. [DOI] [PubMed] [Google Scholar]
- 14. Vaxman I, Gertz MA. How I approach smoldering multiple myeloma. Blood. 2022;140:828–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Musto P, Engelhardt M, Caers J, Kaiser M, Van de Donk N, Terpos E, et al. 2021 European Myeloma Network review and consensus statement on smoldering multiple myeloma: how to distinguish (and manage) Dr. Jekyll and Mr. Hyde. Haematologica. 2021;106:2799–2812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Salmasi G, Murray DL, Padmanabhan A. Myeloma therapy for monoclonal gammopathy of thrombotic significance. N Engl J Med. 2024;391:570–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Beam AL, Drazen JM, Kohane IS, Leong TY, Manrai AK, Rubin EJ. Artificial intelligence in medicine. N Engl J Med. 2023;388:1220–1221. [DOI] [PubMed] [Google Scholar]
- 18. Oster HS, Crouch S, Smith A, Yu G, Abu Shrkihe B, Baruch S, et al. A predictive algorithm using clinical and laboratory parameters may assist in ruling out and in diagnosing MDS. Blood Adv. 2021;5:3066–3075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Heider M, Nickel K, Hogner M, Bassermann F. Multiple myeloma: molecular pathogenesis and disease evolution. Oncol Res Treat. 2021;44:672–681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Avet‐Loiseau H, Bahlis NJ. Smoldering multiple myeloma: taking the narrow over the wide path? Blood. 2024;143:2025–2028. [DOI] [PubMed] [Google Scholar]
- 21. Cowan AJ, Green DJ, Kwok M, Lee S, Coffey DG, Holmberg LA, et al. Diagnosis and management of multiple myeloma: a review. JAMA. 2022;327:464–477. [DOI] [PubMed] [Google Scholar]
- 22. Landgren O, Kyle RA, Pfeiffer RM, Katzmann JA, Caporaso NE, Hayes RB, et al. Monoclonal gammopathy of undetermined significance (MGUS) consistently precedes multiple myeloma: a prospective study. Blood. 2009;113:5412–5417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Celik S, Tangi F, Oktenli C. Increased frequency of Mediterranean fever gene variants in multiple myeloma. Oncol Lett. 2014;8:1735–1738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Dispenzieri A, Stewart AK, Chanan‐Khan A, Rajkumar SV, Kyle RA, Fonseca R, et al. Smoldering multiple myeloma requiring treatment: time for a new definition? Blood. 2013;122:4172–4181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Fermand JP, Bridoux F, Dispenzieri A, Jaccard A, Kyle RA, Leung N, et al. Monoclonal gammopathy of clinical significance: a novel concept with therapeutic implications. Blood. 2018;132:1478–1485. [DOI] [PubMed] [Google Scholar]
- 26. Leung N, Bridoux F, Hutchison CA, Nasr SH, Cockwell P, Fermand JP, et al. Monoclonal gammopathy of renal significance: when MGUS is no longer undetermined or insignificant. Blood. 2012;120:4292–4295. [DOI] [PubMed] [Google Scholar]
- 27. Rognvaldsson S, Love TJ, Thorsteinsdottir S, Reed ER, Óskarsson JÞ, Pétursdóttir Í, et al. Iceland screens, treats, or prevents multiple myeloma (iStopMM): a population‐based screening study for monoclonal gammopathy of undetermined significance and randomized controlled trial of follow‐up strategies. Blood Cancer J. 2021;11:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Dimopoulos MA, Voorhees PM, Schjesvold F, Cohen YC, Hungria V, Sandhu I, et al. Daratumumab or active monitoring for high‐risk smoldering multiple myeloma. N Engl J Med. 2024. [DOI] [PubMed] [Google Scholar]
- 29. Xu M, Meng Y, Li Q, Charwudzi A, Qin H, Xiong S. Identification of biomarkers for early diagnosis of multiple myeloma by weighted gene co‐expression network analysis and their clinical relevance. Hematology. 2022;27:322–331. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1.
Data Availability Statement
Raw data are not available because they are part of the Clalit Health Organization patient database.
