Towards a data-driven system for personalized cervical cancer risk stratification

Geir Severin R E Langberg; Jan F Nygård; Vinay Chakravarthi Gogineni; Mari Nygård; Markus Grasmair; Valeriya Naumova

doi:10.1038/s41598-022-16361-6

. 2022 Jul 15;12:12083. doi: 10.1038/s41598-022-16361-6

Towards a data-driven system for personalized cervical cancer risk stratification

Geir Severin R E Langberg ^1,^✉, Jan F Nygård ², Vinay Chakravarthi Gogineni ³, Mari Nygård ¹, Markus Grasmair ⁴, Valeriya Naumova ⁵

PMCID: PMC9287371 PMID: 35840652

Abstract

Mass-screening programs for cervical cancer prevention in the Nordic countries have been effective in reducing cancer incidence and mortality at the population level. Women who have been regularly diagnosed with normal screening exams represent a sub-population with a low risk of disease and distinctive screening strategies which avoid over-screening while identifying those with high-grade lesions are needed to improve the existing one-size-fits-all approach. Machine learning methods for more personalized cervical cancer risk estimation may be of great utility to screening programs shifting to more targeted screening. However, deriving personalized risk prediction models is challenging as effective screening has made cervical cancer rare and the exam results are strongly skewed towards normal. Moreover, changes in female lifestyle and screening habits over time can cause a non-stationary data distribution. In this paper, we treat cervical cancer risk prediction as a longitudinal forecasting problem. We define risk estimators by extending existing frameworks developed on cervical cancer screening data to incremental learning for longitudinal risk predictions and compare these estimators to machine learning methods popular in biomedical applications. As input to the prediction models, we utilize all the available data from the individual screening histories.Using data from the Cancer Registry of Norway, we find in numerical experiments that the models are strongly biased towards normal results due to imbalanced data. To identify females at risk of cancer development, we adapt an imbalanced classification strategy to non-stationary data. Using this strategy, we estimate the absolute risk from longitudinal model predictions and a hold-out set of screening data. Comparing absolute risk curves demonstrate that prediction models can closely reflect the absolute risk observed in the hold-out set. Such models have great potential for improving cervical cancer risk stratification for more personalized screening recommendations.

Subject terms: Cancer, Health care, Mathematics and computing

Introduction

Nation-wide cervical cancer screening programs in the Nordic countries have shown to be an effective cancer prevention strategy. These programs recommend repeated screening at regular intervals to detect precancerous lesions¹. Although the screening recommendations have become more accurate and efficient over the years, they are based on only the most recent screening results and are standardised across the whole screening population. Specifically, the Norwegian Cervical Cancer Screening Program (NCCSP) currently recommends a routine screening every 3 or 5 years for females aged 25–33 years and 34–69 years, provided their last screening was normal. An alternative strategy would be to adapt the recommendations to the individual risk of disease initiation as inferred from the full screening history. For instance, a more personalized approach could be to recommend a longer screening interval to a female older than 45 who had only negative results in the past, as she may be at considerably lower risk than a 30 year old female with several past abnormalities. More personalized recommendations in cervical cancer screening may reduce the large number of unnecessary screenings of females unlikely to develop the disease, while simultaneously preventing more cancer cases².

A step towards more individualized recommendations is utilizing data from existing cancer screening registries to derive prediction models for the individual risk of cervical cancer development. However, the data available from these registries contain only a few variables about previous exam results, necessary to organize and run the screening programs but no information about female lifestyle or habits. Moreover, due to most females having only normal results, the distribution of results is heavily skewed towards disease-free cases, and this distribution may also be changing over time due to temporal variations in female screening and lifestyle habits. In this paper, we use data from the NCCSP to evaluate the impact of data imbalance and data drift on model performance. We adapt machine learning methods to predict the individual time-varying risk of cervical cancer and compare their performances in numerical experiments.

Data

Our approach is based on data from the 1.7 million females in the NCCSP screening population between 1991–2015. As this data covers the Norwegian cervical cancer screening population, the prediction models derived herein can only be evaluated internally using hold-out methods. External model validation require data from a different country but differences in screening recommendations³ and data collection practices make it challenging to align information for comparability. However, synergy projects with Baltic countries and Sweden are being developed to investigate the potential to extending predictors to other countries.

In this paper we considered only histories with at least 3 exam results for hold-out model validation. In the NCCSP data, more than 75 % of the females have only normal results in their history. To have more variation in the training data we sampled training histories with probabilities proportional to the most severe result in each history, making it more likely to select females with at least one abnormal result. For the test set we used only histories where the first exam was taken no later than year 2000 and at the ages 20–30 (± 5 years from the recommended youngest age for the first exam). This sampling give more recent and complete test data for a comprehensive model evaluation. As our dataset ends before 2015, selecting test histories with the first exam from year 2000 gave female age range 20–53 in the test data, while the results in the training set were from female ages 20–72. The final training set included 10K histories and the test set included 50K histories.

The NCCSP data contains only the information necessary for the Cancer Registry of Norway to organize and run the screening program. Although previous works^4,5 deriving prediction models for cervical cancer risk stratification leverage personal lifestyle information, this information is in general unavailable for the whole screening population. It is also typically collected only once for each female, and thus does not capture temporal variations in the data. Therefore, there is large potential and benefits in providing prediction models based on the data routinely collected by the registries, as these may be integrated directly into the cancer screening programs. Specifically, the NCCSP data consists of timestamps, three types of medical exams (cytology, histology and human papillomavirus (HPV)) and the corresponding results. The HPV exams were introduced around 2005 to follow up on abnormal cytology, and as our dataset ends before 2015 it contains only a few HPV results. Due to the scarcity of HPV results in our data sample, we exclude all HPV data in this study and use only cytology and histology results. However, we plan to include more recent registry data with detailed HPV information in future work.

The primary cause for cervical cancer is persistent infection with HPV. This infection may lead to the development of low-grade lesions, progressing via high-grade precursor lesions (pre-cancer) to invasive cancer^6,7. Exposure to HPV occurs mainly via sexual contact which, together with individual lifestyle variations, makes the risk of cervical cancer both time-varying and in-homogeneous across the screening population⁷. To represent the risk of cervical cancer development, we consider three clinically actionable states, reflecting stages in disease initiation and progression. We label these states normal, low-grade and high-grade.

A normal state requires no additional exams before the next routine screening, while progression from normal to low-grade calls for closer follow-up – although low-grade lesions may spontaneously regress back to normal⁸. Progression from low-grade to high-grade requires immediate clinical action to prevent cancer. Each state is determined by the outcome of medical exams and corresponds to different risk-levels of disease development.

The NCCSP data is strongly skewed towards disease-free cases with more than 85 % of the individual results being normal and fewer than 5 % high-grade results. Due to the screening recommendations not being strictly adhered to in practice, the histories are irregular in time. This irregularity poses a significant challenge in prediction tasks if the time between the last examination and the time to predict amounts to several years (e.g. > 4 years). The panel in Fig. 1, illustrates these characteristics of the NCCSP data by showing to the left a Lexis diagram depicting screening histories, a histogram of screening intervals in the middle and a histogram of the proportion of female states in three age intervals to the right.

Cervical cancer screening data characteristics. Left: A Lexis diagram illustrating screening histories. Each history is depicted as a gray line spanning from the first to the last visit. Visits are indicated by a marker for the exam type (histology and cytology) and colored by the exam result. Middle: A histogram of the time between visits. Right: The proportion of female states (normal in blue, low-grade in orange and high-grade in red) in three age intervals.

The Lexis diagram in Fig. 1 illustrates the scarcity in screening histories sampled from the NCCSP data, where the median number of exams is 6. The histogram of screening intervals shows that the time between exams varies from just over 1 month and up to almost 20 years, illustrating data irregularity. Finally, the proportion of states is changing with female age, containing about 0.87 normal results for females younger than 36 and up to 0.93 normals for 46–69+ year old females. This drift in the state distribution could be attributed to changes in female lifestyle and screening habits.

State of the art

Popular prediction models in biomedical applications include logistic regression⁹ (LR), random forest¹⁰ (RF) and gradient tree boosting¹¹ (GTB). Ensemble methods such as RF and GTB may be strong performers on imbalanced data¹² but neither of these models is typically used with time-dependent data. Popular models in time-series prediction tasks such as long short-term memory¹³ (LSTM) networks expect regular and sufficiently sampled data. However, this is not the case with the NCCSP data, as described in the previous section.

An alternative to the LSTM, also capable of modelling cervical cancer data, is a continuous-time hidden Markov model (HMM) developed in a recent study¹⁴ for the disease dynamics observed in cervical cancer screening data. The model was learned from a subset of NCCSP data and validated against a hold-out set by using the HMM as a stochastic simulator to derive Kaplan-Meier estimates. However, the study did not evaluate the HMM on risk prediction tasks or presented a method for generating such predictions from the model.

A later work¹⁵ introduced a matrix factorization (MF) framework using historical data for cervical cancer risk prediction along with a method for classifying the female state from the risk estimate. The MF has become a popular approach to dealing with scarce and irregular data. The proposed framework was compared to a geometric deep learning (GDL) model based on¹⁶, and validated by the probability of agreement¹⁷ between Kaplan-Meier estimates derived from model predictions and a hold-out set. Despite highlighting a heavy class imbalance in their data sample, the authors did not evaluate model calibration or state drift – factors that may explain and affect model performance.

In¹⁸ the GDL approach was further adapted to handle the scarcity in the cervical cancer screening data. The method was evaluated in numerical experiments by predicting the future risk for individual females at a single randomly chosen time point. To extend methods for personalized risk prediction even further, incremental learning may be incorporated, allowing the models to update risk estimate after more data is available in the future.

Contribution

This is the first paper comparing several different machine learning methods for cervical cancer risk estimation, focusing on methods for incremental learning from longitudinal data, the impact of data imbalance on model estimates and classification with a time-varying state distribution. Specifically, we compare methods based on HMM, MF and GDL, as well as LR, RF and GTB. Our motivation for including the HMM, GDL and MF is that they were developed to handle scarce and irregular data for cervical cancer screening applications, and we further adapt these herein to incremental learning for longitudinal risk estimation. Moreover, to handle problems with both imbalance and temporal changes in the state distribution, we extend the classification method from¹⁵ to separate classifiers over different female age intervals. To evaluate their ability to predict the next exam results over time, we compare absolute risk curves derived from model predictions and a hold-out set of screening data to assess model calibration against the trend in the time-varying risk.

The rest of this paper is organized as follows. In “Predicting the risk of cervical cancer development” we outline the risk estimators that are based on extensions to HMM, MF and GDL. “Numerical experiments” section describes the numerical experiments and discuss the results on NCCSP data, followed by a conclusion and outline of future work in “Conclusions and future work”.

Predicting the risk of cervical cancer development

We represent a cervical cancer screening history with data recorded at times $t_{0} < t_{1} < \dots < t_{j}$ as a set of tuples $y_{t_{j}} = {\{(t_{i}, ρ_{t_{i}}, x_{t_{i}})\}}_{i = 0}^{j}$ . The history includes time points $t_{i}$ representing the female age at visit i when she was measured with medical exam $ρ_{t_{i}}$ to be in state $x_{t_{i}} = s$ . The potential female states $s \in S$ are numerically encoded with $s = 1$ for normal, $s = 2$ for low-grade and $s = 3$ for high-grade.

To estimate the individual future risk of cervical cancer, we assume that we know her screening history up to some time $t_{j}$ . The predicted risk at a later time point $\hat{t} > t_{j}$ is expressed as the triple of conditional probabilities $p (x_{\hat{t}} = s | y_{t_{j}})$ , $s = 1, 2, 3$ .

In the following sections we provide a detailed description of how we extend existing frameworks based on MF, GDL and HMM to incremental learning for longitudinal predictions. For the LR, RF and GTB predictors we use the implementations from publicly available software¹⁹.

Matrix factorization

The matrix factorization (MF) risk estimate as we define it herein is based on the Shifted Weighted Convolutional Matrix Factorization (SWCMF) from¹⁵. The SWCMF assumes that the discrete observed states are possibly inaccurate measurements of a continuous latent state, evolving slowly with time for each female. The MF risk estimator requires that we derive such latent state profiles from a hold-out set of screening histories before we can use it for predictions.

By organizing all the states $x_{t_{i}}$ from a screening history according to female age $t_{i}$ we obtain a scarce longitudinal vector $z$ spanning $T \geq t_{j}$ years. Here T is the maximum female age in the data and the times $t_{0} \leq t_{i} \leq T$ for when the female state was measured correspond to the observed entries in $z$ . Combining N such state vectors gives a partially observed matrix $Z$ .

With the SWCMF model from¹⁵, we estimate a complete matrix $\hat{M} \in R^{N \times T}$ of latent state profiles using the observed states in $Z$ and the medical exam type data. Each row ${\hat{M}}_{n}$ in $\hat{M}$ corresponds to a continuous latent profile estimated from $Z_{n}$ , and these profiles are used to construct the MF risk estimator.

As in¹⁵, we assume the probability of observing a female in state $x_{t_{i}} = s$ at time $t_{i}$ given the latent state $m_{t_{i}}$ is given by

p (x_{t_{i}} = s | m_{t_{i}}) = C exp ({(m_{t_{i}} - s)}^{2} / 2 σ^{2})

where $C = C (m_{t_{i}})$ is a normalizing factor. Here we estimate $σ$ using the same MLE procedure as in¹⁵. To estimate the risk of some female with screening history $y_{t_{j}}$ being in state s at time $\hat{t} > t_{j}$ , we consider the posterior predictive distribution

\begin{matrix} p (x_{\hat{t}} = s | y_{t_{j}}) & \propto \int p (x_{\hat{t}} = s | m, y_{t_{j}}) p (m | y_{t_{j}}) d m \\ = \int p (x_{\hat{t}} = s | m) p (m | y_{t_{j}}) d m . \end{matrix}

In (2) we assume $y_{\hat{t}}$ is conditionally independent from $y_{t_{0}}, \dots, y_{t_{j}}$ so the probability of observing state s when given history $y$ and latent profile $m$ becomes $p (x_{\hat{t}} = s | m)$ . Using Bayes’ rule $p (m | y_{t_{j}}) \propto p (y_{t_{j}} | m) p (m)$ we get

\begin{matrix} p (x_{\hat{t}} = s | y_{t_{j}}) \propto \int p (x_{\hat{t}} = s | m) p (y_{t_{j}} | m) p (m) d m \end{matrix}

In (3) the latent risk prior $p (m)$ is unknown so $p (m | y_{t_{j}})$ is really intractable but following the variational approximation approach we may use $\hat{M}$ to approximate $p (m | y_{t_{j}})$ . Thus, we can approximate (3) with

\begin{matrix} \hat{p} (x_{\hat{t}} = s | y_{t_{j}}) \propto \sum_{n = 1}^{N} p (x_{\hat{t}} = s | {\hat{M}}_{n, \hat{t}}) \hat{p} (y_{t_{j}} | {\hat{M}}_{n}) . \end{matrix}

In (4) we compute $p (x_{\hat{t}} = s | {\hat{M}}_{n, \hat{t}})$ from (1) and the data likelihood as

\begin{matrix} \hat{p} (y_{t_{j}} | {\hat{M}}_{n}) = \prod_{i = 0}^{j} C ({\hat{M}}_{n, t_{i}}) exp (\frac{{({\hat{M}}_{n, t_{i}} - x_{t_{i}})}^{2}}{2 σ^{2}}), \end{matrix}

Moreover, assuming data from a visit at time $t_{j + 1} > t_{j}$ is added to history $y_{t_{j}}$ , we can recursively update the data likelihood by

\begin{matrix} \hat{p} (y_{t_{j + 1}} | {\hat{M}}_{n}) \propto \hat{p} ({\hat{M}}_{n} | y_{t_{j}}) C ({\hat{M}}_{n, t_{j + 1}}) exp (\frac{{({\hat{M}}_{n, t_{j + 1}} - x_{t_{j + 1}})}^{2}}{2 σ^{2}}) \end{matrix}

This recursive update was not described in¹⁵ and allows us to do efficient adaptive learning by re-estimating the risk when more data is available.

Geometric deep learning

An alternative to using the SWCMF model¹⁵ for estimating latent state profiles is to use a geometric deep learning (GDL) approach based on¹⁶. We define the GDL risk estimate based on latent profiles derived with GDL and using (4) for risk predictions. To estimate latent profiles, the GDL leverages two similarity graphs where one encode similarities between screening histories and the other represents the temporal dependency of results. When estimating the latent state profiles, GDL use these graphs to determine the structure of the profiles.

In¹⁵ the authors used a k-nearest neighbour (NN) graph linking together similar histories in addition to a sequential graph for the temporal dependencies. In the k-NN graph, each node represents a screening history that is connected to k other most similar histories, where the similarity between histories is determined by some pre-defined measure. A potential drawback of the k-NN graph is that it each node has to have exactly k connections – even if one node is quite dissimilar from the others.

In this paper, we follow¹⁸ in constructing a graph with a variable number of connections for each node. This graph is learned directly from the data under a smoothness constraint where we assume that certain screening histories exhibit strong similarities to each other. The resulting graph will then contain nodes connecting together histories that are alike in the results and time of visits. Moreover, we assume that the risk of cancer development does not change rapidly within a year and use this to construct the second graph for the temporal dependency of results. Using these two graphs with the GDL we obtain latent state profiles that are changing slowly in time and reflect the similarities between screening histories in the data. The GDL approach is similar to the MF estimate except that they use different constraints to characterize the latent state profiles.

Hidden Markov model

The HMM risk estimate that we compare to the MF and GDL estimates is based on an extension of the HMM described in¹⁴ with a prediction module. In the HMM, each observed state is taken to originate from a discrete hidden state indicating the latent risk of cervical cancer development. Here we take the observed states to originate from some hidden states, labelled normal, low-risk and high-risk as in¹⁴. To define the HMM risk estimate, we first consider the probability

\begin{matrix} α (h_{t_{j}}) = p (h_{t_{j}} | y_{t_{j}}) \end{matrix}

of a female being in hidden state $h_{t_{j}}$ at time $t_{j}$ , conditioned on her screening history $y_{t_{j}}$ . We use this probability estimate to predict the risk at time $\hat{t} > t_{j}$ . To compute $α (h_{t_{j}})$ we initialize

\begin{matrix} α (h_{t_{0}}) = p (h | t_{0}) p (x_{t_{0}} | h_{t_{0}}, ρ_{t_{0}}) \end{matrix}

Here, $p (h | t_{0})$ is a prior over the hidden state at the time of the initial exam, and $p (x_{t_{0}} | h_{t_{0}}, ρ_{t_{0}})$ is the probability of the observed state conditioned on the medical exam and hidden state. The estimates for $p (h | t_{0})$ and $p (x_{t_{0}} | h_{t_{0}}, ρ_{t_{0}})$ are available from the parameters of the HMM in¹⁴. To reach $α (h_{t_{i}})$ for $t_{i} > t_{0}$ , we use our previous estimate $α (h_{t_{i - 1}})$ to compute the recursion

\begin{matrix} α (h_{t_{i}}) = p (x_{t_{i}} | h_{t_{i}}, ρ_{t_{i}}) \sum_{h_{t_{i - 1}}} p (h_{t_{i}} | h_{t_{i - 1}}) α (h_{t_{i - 1}}) . \end{matrix}

The transition probabilities between hidden states $p (h_{t_{i}} | h_{t_{i - 1}})$ are also given by the HMM parameters¹⁴.

Having used (5) to obtain our estimate for the hidden state probabilities at time $t_{j}$ , we predict the future risk at time $\hat{t} > t_{j}$ by approximating

\begin{matrix} p (x_{\hat{t}} = s | y_{t_{j}}) \propto \int_{ρ_{\hat{t}}} \int_{h_{\hat{t}}} \int_{h_{t_{j}}} p (x_{\hat{t}} = s | h_{\hat{t}}, ρ_{\hat{t}}) p (ρ_{\hat{t}} | h_{\hat{t}}) p (h_{\hat{t}} | h_{t_{j}}) α (h_{t_{j}}) d h_{t_{j}} d h_{\hat{t}} d ρ_{\hat{t}} . \end{matrix}

The probabilities $p (ρ_{\hat{t}} | h_{\hat{t}})$ we derive from the Poisson intensity estimates presented in¹⁴. To incorporate more data and update the HMM risk estimate, like we do with MF and GDL, we first update the $α$ estimate with (5) and then estimate the risk with (6).

Predicting the next state

Using any model to predict the risk of some female being in each state $s \in S$ gives a comprehensive overview of her risk. Predicting the exam result by classifying the female state from these risk estimates amounts to a multi-class classification problem. One approach to this task is to select the most probable state

\begin{matrix} {\hat{x}}_{\hat{t}} = arg max_{s \in S} \hat{p} (x_{\hat{t}} = s | y_{t}) . \end{matrix}

However, this method often fails to predict the minority states because data imbalance shifts the risk inference and classification towards normal. We refer to this classification rule as the default strategy as it selects the most probable state without considering data imbalance.

An alternative to the default strategy is to consider state-specific probability thresholds ${\{δ_{s} \in (0, 1)\}}_{s \in S}$ adapted to the skewed state distribution. To perform multi-class classification using these thresholds, we can construct a classification rule similar to¹⁵ where for each state we evaluate

\begin{matrix} \hat{p} (x_{\hat{t}} = s | y_{t}) \geq δ_{s} \Rightarrow {\hat{x}}_{\hat{t}} = s . \end{matrix}

If condition (7) holds we predict ${\hat{x}}_{\hat{t}} = s$ . We first evaluate (7) for s being the high-grade state and then the low-grade state. If the condition is not satisfied for either of these states, we predict normal. This means that we prioritize predicting high-grade over low-grade, and low-grade over and normal as we in our application is more tolerant towards false positives than false negatives.

Furthermore, taking $δ_{s} = δ_{s} (t)$ , we can adapt (7) to the label drift observed in our data by training a separate classifier for different female age intervals. Since the risk of HPV infection peaks in adolescence and early adulthood, and the risk of cervical cancer peaks in middle aged females⁷, we choose three age intervals: 20–35, 36–45 and 46–69+ for our experiments. Moreover, choosing only three age intervals we aim to avoid overfitting as increasing the number of intervals would also increase the risk of overfitting the classifier in each interval. We refer to using (7) with time-dependent thresholds as the adaptive strategy.

To derive the probability thresholds $δ (T_{k})$ for each female age interval $T_{k}$ , we maximize the K-category Matthews correlation coefficient (MCC)²⁰. The MCC summarizes the confusion matrix in a single score

\begin{matrix} R_{K} = \frac{n_{+} \times n - \sum_{s \in S} {\hat{n}}_{s} \times n_{s}}{\sqrt{(n^{2} - \sum_{s \in S} {\hat{n}}_{s}^{2}) \times (n^{2} - \sum_{s \in S} n_{s}^{2})}} \end{matrix}

to measure the quality of multi-class classifications. Here n is the total number of test samples, $n_{+}$ is the number of correct classifications, and $n_{s}$ and ${\hat{n}}_{s}$ are the number of times where state s was the ground truth and was correctly predicted, respectively. Higher $R_{K} \in [- 1, 1]$ means a more accurate classification. The thresholds $δ (T_{j})$ for age interval $T_{j}$ are obtained by computing

\begin{matrix} max_{δ (T_{k}) \in {(0, 1)}^{S}} R_{K} (x, \hat{x}) . \end{matrix}

This maximization problem is solved by using the differential evolution algorithm²¹.

Numerical experiments

The research for this study is approved by the South East Norway Regional Committee for Medical and Health Research Ethics (application ID: 11752). The health registry data used in this study does not originate from clinical trails and therefore the ethical committee granted this study with an exception from obtaining informed consent. All the research conducted herein accommodate the relevant guidelines and regulations.

In numerical experiments we study machine learning methods taking the individual screening history as input for cervical cancer risk prediction. The methods are based on: hidden Markov model¹⁴ (HMM), matrix factorization¹⁵ (MF), geometric deep learning¹⁶ (GDL), logistic regression⁹ (LR), random forest¹⁰ (RF) and gradient tree boosting¹¹ (GTB). To predict the individual risk of cervical cancer development we use (6) for the HMM estimate, and (4) for MF and GDL. Although LR, GTB and RF treat each exam result as independent we facilitate adaptive learning by re-fitting the models with additional data, using the current estimate for model parameters as initialization.

As input to the HMM, MF and GDL predictors we provide all the data up to six months prior to the result we want to predict. For MF and GDL, the input data consist of female states and the corresponding time stamps, while the HMM also utilize exam type information. The input features to LR, RF and GTB combine the cumulative counts of each state conditioned on the exam type over time, together with the corresponding time stamps. For LR we used Z-scoring with parameters estimated from a hold-out set to normalize the features. To derive the latent state profiles used by the MF and GDL estimators, we leverage the exam results and the time stamps from the 10K histories sampled for our training set to construct the input matrix. Moreover, data from the training set is also used to fit LR, RF and GTB.

To simulate an environment for doing longitudinal adaptive learning with MF, HMM and GDL, we masked parts of the screening histories in the test set with a moving window. This way we mimic histories growing over time as a female has more exams. We start by revealing only the first 2 results to fit the estimators, and move forward in time to predict the 3rd result. For any history with more than 3 results, we repeatedly update the model by including the previous data point before we move to predict at the next result.

The impact of data imbalance on risk estimation

After fitting the models, we assess how well their risk estimates are calibrated using the Brier score. This score measures the agreement between the predicted risk $\hat{p}$ and an indicator $o_{n, \hat{t}} (x_{n, \hat{t}} = s)$ for whether the result was actually $x_{n, \hat{t}} = s$ . We compute the Brier score over N cases as

\begin{matrix} B = \frac{1}{N} \sum_{n = 1}^{N} {(\hat{p} (x_{n, \hat{t}} = s | y_{n}) - o_{n, \hat{t}} (x_{n, \hat{t}} = s))}^{2} . \end{matrix}

In Table 1, we present Brier scores to evaluate the impact of class imbalance on model estimates. The scores were derived from model predictions aggregated over time and stratified by each ground truth state.

Table 1.

Brier scores stratified by female states.

Model	Normal	Low-grade	High-grade
MF	0.0830	0.644	0.700
HMM	0.0410	0.680	0.734
GDL	0.0430	0.683	0.863
GTB	0.0220	0.780	0.766
LR	0.0240	0.795	0.777
RF	0.0330	0.790	0.793

Open in a new tab

The prediction models are matrix factorization (MF), hidden Markov model (HMM), geometric deep learning (GDL) gradient tree boosting (GTB), logistic regression (LR), and random forest (RF).

Significant values are in bold.

From Table 1 we see that we have lower Brier scores for normal states and higher scores for low-grade and high-grade states, which indicates that the prediction models are strongly biased towards the normal state. Thus, the model estimates are clearly affected by the skew in the state distribution. The GDL is especially poor at high-grade predictions but improves on low-grade, while MF is the best calibrated on high-grade followed by HMM.

Probability thresholding for risk classification

One way to alleviate biased probability estimates in classification tasks is to use a classification rule adapted to the data imbalance when converting probabilities into class labels. Using the adaptive thresholding technique from “Predicting the risk of cervical cancer development”, we may also relax the effect of temporal drift in the state distribution by having a different classifier over female age intervals. In Fig. 2 we give the multi-class classification performance as $R_{K}$ scores achieved with the adaptive and the default classification strategies.

Classification performance as Matthews correlation coefficient ( $R_{K}$ ) over female age intervals. The prediction models are matrix factorization (MF), hidden Markov model (HMM), geometric deep learning (GDL) gradient tree boosting (GTB), logistic regression (LR), and random forest (RF), combined with either the adapted or default probability threshold method from “Predicting the risk of cervical cancer development”.

Comparing the $R_{K}$ scores in Fig. 2 indicates that classification thresholds adjusted to class imbalance improves model performance and is the favourable method over the default strategy, especially with older females. The MF and HMM attains the strongest prediction performance. which is consistent with the model calibration estimates in Table 1, while the GDL, GTB, LR and RF performances decreases more over the age intervals.

Evaluating classifier performance

To assess how well the classifiers reflect the trends in the observed data, we compare absolute risk curves from hold-out data and longitudinal model predictions, using the strategy (either default or adaptive probability thresholds) improving on the classification scores in Fig. 2. Here we give absolute risk as the proportion of each state measured over some small time interval of about 10 months. In Fig. 3, we plot risk curves derived from test data and from model predictions. Each row in the panel figure corresponds to a different prediction model and there is one column for each state to more easily distinguish between the curves visually. Stippled vertical lines indicate the age intervals 20–35, 36–45 and 46–69+. Note that the scale on the y-axis differs between the normal/low-grade and high-grade plots to better illustrate the model fit. The colored regions illustrate the difference between the observed (r(t)) and predicted absolute risk ( $\hat{r} (t)$ ) at time t. To quantify the relative deviation between the absolute risk curves, we define a performance indicator

\begin{matrix} η = \frac{\int |r (t) - \hat{r} (t)| d t}{\int r (t) d t} . \end{matrix}

Absolute risk estimated from observed data and model predictions. The $η$ score computed with (9) indicates model performance over female age intervals. The prediction models are matrix factorization (MF), hidden Markov model (HMM), geometric deep learning (GDL) gradient tree boosting (GTB), logistic regression (LR), and random forest (RF).

Ideally, $η = 0$ , implying perfect classification, while miss-classifications cause the predicted curve to deviate from the test curve, giving $η > 0$ .

The results in Fig. 3 indicate that the MF model is overall the most accurately calibrated against the trend in the reference curve from the hold-out registry screening data. The predictions from HMM and GDL improve over time, which may be attributed to an increasing amount of training data as older females have typically had more exams. The GTB, RF and LR estimates closely follow the reference curve for normal and low-grade but shows a large deviation in younger females which improves with older females.

Conclusions and future work

Machine learning methods for more targeted risk stratification can have a high utility to existing cervical cancer screening programs shifting to more personalized screening recommendations. However, deriving such methods from cancer registry data is challenging due to strong class imbalance and a non-stationary data distribution. In this paper, we compare machine learning models based on matrix factorization (MF), hidden Markov model (HMM), geometric deep learning (GDL), logistic regression (LR), random forest (RF) and gradient tree boosting (GTB) in cervical cancer risk estimation, using population-level data from the Cancer Registry of Norway.

To define the risk estimators based on HMM, MF and GDL, we extend existing methods with incremental learning mechanisms for longitudinal risk prediction. Results from numerical experiments showed that all the models studied herein suffered from data skewness and were strongly biased towards disease-free results. To predict the individual risk of cancer development we trained separate classifiers adapted to data imbalance over separate female age intervals. Comparing absolute risk curves derived from model predictions and hold-out data showed promising results for matrix factorization to capture the time-varying trend in the observed risk from the data. This methods may thus be useful to improve cervical cancer risk stratification for more personalized screening. We are currently working to elucidate the ability of predictions models to correctly predict individual females using a different representation of model performance.

The methods used in this paper may also be applied to data from other types of mass-screening programs such as breast, colorectal and prostate cancer. In this paper, we focus on using only the routinely collected cervical cancer registry data as we see this to currently have more societal impact and utility for improving healthcare delivery. Expanding the models to include data from more recent screening technology with additional biomarkers and, eventually, individual HPV vaccination status has the potential to improve model performance. In future work we will combine female lifestyle information with registry screening data, believing that including more detailed information about each individual can improve the risk prediction accuracy.

Author contributions

J.F.N. and G.S.R.E.L. conceived the experiments, G.S.R.E.L. conducted the experiments and M.G., V.N. and G.S.R.E.L. analyzed the results. All authors reviewed the manuscript.

Data availability

Due to individual privacy and ethical restrictions, the data used in this study are not publicly available. However, the data can be made available from the Cancer Registry of Norway pursuant the legal requirements mandated by the European GDPR.

Competing interests:

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Vaccarella S, et al. 50 years of screening in the Nordic countries: Quantifying the effects on cervical cancer incidence. Br. J. Cancer. 2014;111:965–969. doi: 10.1038/bjc.2014.362. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Pedersen K, et al. Advancing the evaluation of cervical cancer screening: Development and application of a longitudinal adherence metric. Eur. J. Public Health. 2017;27:1089–1094. doi: 10.1093/eurpub/ckx073. [DOI] [PubMed] [Google Scholar]
3.Perkins RB, et al. 2019 asccp risk-based management consensus guidelines for abnormal cervical cancer screening tests and cancer precursors. J. Lower Genital Tract Dis. 2020;24:102. doi: 10.1097/LGT.0000000000000525. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rothberg MB, et al. A risk prediction model to allow personalized screening for cervical cancer. Cancer Causes Control. 2018;29:297–304. doi: 10.1007/s10552-018-1013-4. [DOI] [PubMed] [Google Scholar]
5.van der Waal D, et al. Risk prediction of cervical abnormalities: The value of sociodemographic and lifestyle factors in addition to HPV status. Prev. Med. 2020;130:105927. doi: 10.1016/j.ypmed.2019.105927. [DOI] [PubMed] [Google Scholar]
6.Cohen PA, Jhingran A, Oaknin A, Denny L. Cervical cancer. Lancet. 2019;393:169–182. doi: 10.1016/S0140-6736(18)32470-X. [DOI] [PubMed] [Google Scholar]
7.Schiffman M, Wentzensen N. Human papillomavirus infection and the multistage carcinogenesis of cervical cancer. Cancer Epidemiol. Prev. Biomark. 2013;22:553–560. doi: 10.1158/1055-9965.EPI-12-1406. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Castle PE, Schiffman M, Wheeler CM, Solomon D. Evidence for frequent regression of cervical intraepithelial neoplasia-grade 2. Obstet. Gynecol. 2009;113:18. doi: 10.1097/AOG.0b013e31818f5008. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Open, 2017).
10.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
11.Friedman JH. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]
12.Krawczyk B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016;5:221–232. doi: 10.1007/s13748-016-0094-0. [DOI] [Google Scholar]
13.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
14.Soper BC, Nygård M, Abdulla G, Meng R, Nygård JF. A hidden Markov model for population-level cervical cancer screening data. Stat. Med. 2020;39:3569–3590. doi: 10.1002/sim.8681. [DOI] [PubMed] [Google Scholar]
15.Langberg, G. S. R. E. et al. Matrix factorization for the reconstruction of cervical cancer screening histories and prediction of future screening results (2021). Accepted for minor revision. [DOI] [PMC free article] [PubMed]
16.Monti, F., Bronstein, M. M. & Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. arXiv preprintarXiv:1704.06803 (2017).
17.Stevens NT, Lu L. Comparing Kaplan-Meier curves with the probability of agreement. Stat. Med. 2020;39:4621–4635. doi: 10.1002/sim.8744. [DOI] [PubMed] [Google Scholar]
18.Gogineni, V. C. et al. Data-driven personalized cervical cancer risk prediction: A graph-perspective. In 2021 IEEE Statistical Signal Processing Workshop (SSP) 46–50 (IEEE, 2021).
19.Pedregosa F, et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
20.Gorodkin J. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem. 2004;28:367–374. doi: 10.1016/j.compbiolchem.2004.09.006. [DOI] [PubMed] [Google Scholar]
21.Storn R, Price K. Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997;11:341–359. doi: 10.1023/A:1008202821328. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Vaccarella S, et al. 50 years of screening in the Nordic countries: Quantifying the effects on cervical cancer incidence. Br. J. Cancer. 2014;111:965–969. doi: 10.1038/bjc.2014.362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Pedersen K, et al. Advancing the evaluation of cervical cancer screening: Development and application of a longitudinal adherence metric. Eur. J. Public Health. 2017;27:1089–1094. doi: 10.1093/eurpub/ckx073. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Perkins RB, et al. 2019 asccp risk-based management consensus guidelines for abnormal cervical cancer screening tests and cancer precursors. J. Lower Genital Tract Dis. 2020;24:102. doi: 10.1097/LGT.0000000000000525. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Rothberg MB, et al. A risk prediction model to allow personalized screening for cervical cancer. Cancer Causes Control. 2018;29:297–304. doi: 10.1007/s10552-018-1013-4. [DOI] [PubMed] [Google Scholar]

[CR5] 5.van der Waal D, et al. Risk prediction of cervical abnormalities: The value of sociodemographic and lifestyle factors in addition to HPV status. Prev. Med. 2020;130:105927. doi: 10.1016/j.ypmed.2019.105927. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Cohen PA, Jhingran A, Oaknin A, Denny L. Cervical cancer. Lancet. 2019;393:169–182. doi: 10.1016/S0140-6736(18)32470-X. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Schiffman M, Wentzensen N. Human papillomavirus infection and the multistage carcinogenesis of cervical cancer. Cancer Epidemiol. Prev. Biomark. 2013;22:553–560. doi: 10.1158/1055-9965.EPI-12-1406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Castle PE, Schiffman M, Wheeler CM, Solomon D. Evidence for frequent regression of cervical intraepithelial neoplasia-grade 2. Obstet. Gynecol. 2009;113:18. doi: 10.1097/AOG.0b013e31818f5008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Open, 2017).

[CR10] 10.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[CR11] 11.Friedman JH. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]

[CR12] 12.Krawczyk B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016;5:221–232. doi: 10.1007/s13748-016-0094-0. [DOI] [Google Scholar]

[CR13] 13.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Soper BC, Nygård M, Abdulla G, Meng R, Nygård JF. A hidden Markov model for population-level cervical cancer screening data. Stat. Med. 2020;39:3569–3590. doi: 10.1002/sim.8681. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Langberg, G. S. R. E. et al. Matrix factorization for the reconstruction of cervical cancer screening histories and prediction of future screening results (2021). Accepted for minor revision. [DOI] [PMC free article] [PubMed]

[CR16] 16.Monti, F., Bronstein, M. M. & Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. arXiv preprintarXiv:1704.06803 (2017).

[CR17] 17.Stevens NT, Lu L. Comparing Kaplan-Meier curves with the probability of agreement. Stat. Med. 2020;39:4621–4635. doi: 10.1002/sim.8744. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Gogineni, V. C. et al. Data-driven personalized cervical cancer risk prediction: A graph-perspective. In 2021 IEEE Statistical Signal Processing Workshop (SSP) 46–50 (IEEE, 2021).

[CR19] 19.Pedregosa F, et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]

[CR20] 20.Gorodkin J. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem. 2004;28:367–374. doi: 10.1016/j.compbiolchem.2004.09.006. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Storn R, Price K. Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997;11:341–359. doi: 10.1023/A:1008202821328. [DOI] [Google Scholar]

PERMALINK

Towards a data-driven system for personalized cervical cancer risk stratification

Geir Severin R E Langberg

Jan F Nygård

Vinay Chakravarthi Gogineni

Mari Nygård

Markus Grasmair

Valeriya Naumova

Abstract

Introduction

Data

Figure 1.

State of the art

Contribution

Predicting the risk of cervical cancer development

Matrix factorization

Geometric deep learning

Hidden Markov model

Predicting the next state

Numerical experiments

The impact of data imbalance on risk estimation

Table 1.

Probability thresholding for risk classification

Figure 2.

Evaluating classifier performance

Figure 3.

Conclusions and future work

Author contributions

Data availability

Competing interests:

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Towards a data-driven system for personalized cervical cancer risk stratification

Geir Severin R E Langberg

Jan F Nygård

Vinay Chakravarthi Gogineni

Mari Nygård

Markus Grasmair

Valeriya Naumova

Abstract

Introduction

Data

Figure 1.

State of the art

Contribution

Predicting the risk of cervical cancer development

Matrix factorization

Geometric deep learning

Hidden Markov model

Predicting the next state

Numerical experiments

The impact of data imbalance on risk estimation

Table 1.

Probability thresholding for risk classification

Figure 2.

Evaluating classifier performance

Figure 3.

Conclusions and future work

Author contributions

Data availability

Competing interests:

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases