Skip to main content
eLife logoLink to eLife
. 2023 May 25;12:e81752. doi: 10.7554/eLife.81752

Approximating missing epidemiological data for cervical cancer through Footprinting: A case study in India

Irene Man 1,, Damien Georges 1, Maxime Bonjour 1, Iacopo Baussano 1
Editors: Belinda Nicolau2, Wendy S Garrett3
PMCID: PMC10212556  PMID: 37227260

Abstract

Local cervical cancer epidemiological data essential to project the context-specific impact of cervical cancer preventive measures are often missing. We developed a framework, hereafter named Footprinting, to approximate missing data on sexual behaviour, human papillomavirus (HPV) prevalence, or cervical cancer incidence, and applied it to an Indian case study. With our framework, we (1) identified clusters of Indian states with similar cervical cancer incidence patterns, (2) classified states without incidence data to the identified clusters based on similarity in sexual behaviour, (3) approximated missing cervical cancer incidence and HPV prevalence data based on available data within each cluster. Two main patterns of cervical cancer incidence, characterized by high and low incidence, were identified. Based on the patterns in the sexual behaviour data, all Indian states with missing data on cervical cancer incidence were classified to the low-incidence cluster. Finally, missing data on cervical cancer incidence and HPV prevalence were approximated based on the mean of the available data within each cluster. With the Footprinting framework, we approximated missing cervical cancer epidemiological data and made context-specific impact projections for cervical cancer preventive measures, to assist public health decisions on cervical cancer prevention in India and other countries.

Research organism: Viruses

Introduction

Cervical cancer is an important source of disease burden worldwide (de Martel et al., 2017). In 2020, the number of new cases and deaths due to cervical cancer worldwide were estimated to be 604,000 and 342,000, respectively (Sung et al., 2021). Vaccination against human papillomavirus (HPV), cervical cancer screening, and treatment of pre-cancer and cancer can reduce the burden of cervical cancer (Lei et al., 2020; Bouvard et al., 2021), but access to these preventive measures is still limited in many settings, especially in low- and middle-income countries (LMICs) (Bruni et al., 2021; de Sanjose and Tsu, 2019; Bonjour et al., 2021). To accelerate the scale-up of cervical cancer prevention worldwide, the World Health Organization developed a global strategy to eliminate cervical cancer as a public health problem (WHO, 2021). The strategy proposes an elimination target of 4 cases per 100,000 women-years (age-standardized) with three intervention targets: 90% of girls vaccinated against HPV by age 15; 70% of women receiving twice-lifetime screening with high-performance testing; and 90% of women having access to cervical pre-cancer and cancer treatment, and palliative care.

For the WHO’s aspirational global targets to be perceived as realistic, achievable, and equitable, they must be adapted to local context (Tsu, 2020). The local need for and impact of cervical cancer preventive measures depend on the burden in a given population, which is determined by context-specific sexual behaviour, and HPV prevalence (Guan et al., 2012). Local data on these aspects are therefore crucial to derive projections of the health and economic impact of possible interventions. When based on adequate data, impact projections of cervical cancer preventive measures can help local health authorities set adequate public health targets and allocate resources accordingly (Goldie et al., 2006).

However, local epidemiological data for cervical cancer needed to derive impact projections are sometimes missing. High-quality type- and age-specific data on HPV prevalence and cervical cancer incidence from local populations are often unavailable. The same holds for adequate data on sexual behaviour, for example, data on sex outside marriage, which are also prone to bias, for example, social desirability and recall bias (Morris et al., 2014; Kelly et al., 2013). When essential epidemiological data for projections are missing, there are two main possible solutions: collection and approximation. Collection of new data in a local context would be the preferred option. However, this could be time- and resource-demanding and therefore not always feasible. Alternatively, missing data on a given population can be approximated using available data from populations sharing similar characteristics.

In this paper, we propose a framework, hereafter named Footprinting, to approximate missing cervical cancer epidemiological data for a selected number of geographical units within a larger geographical target area, to derive impact projections of cervical cancer prevention. The framework is presented using a case study in India, the country with the world’s highest expected burden of cervical cancer (Bonjour et al., 2021) and very limited access to cervical cancer preventive measures (Sankaranarayanan et al., 2019). To assist local public health decision-making in India, we applied Footprinting to approximate missing Indian state-specific cervical cancer incidence and HPV prevalence data and so to enable impact projections of cervical cancer preventive measures with state-specific granularity.

Results

Footprinting framework

We developed a framework, Footprinting, to approximate missing data on the three key aspects of cervical cancer epidemiology: sexual behaviour, HPV prevalence, and cervical cancer incidence. For convenience of explaining the framework, missing data across geographical units are for the moment assumed to occur in a hierarchical manner, that is, geographical units are ordered according to their levels of data availability (Figure 1). At the highest level, there are a small number of geographic units for which data on all three key aspects are available. For the Indian case study, there were two such states (out of 25 states or groups of states) with the high-quality type- and age-specific HPV prevalence data needed for impact projections (Franceschi et al., 2005; Dutta et al., 2012; Kataria et al., 2022). The remaining geographic units are further divided into levels of intermediate and low data availability. In the Indian case study, there were 12 states with cervical cancer incidence (Bray et al., 2017; Report of National Cancer Registry Programme, 2020) and sexual behaviour data, which were assigned to the intermediate level of data availability. The remaining 11 states only had data on sexual behaviour (National Behavioural Surveillance Survey: General Population, 2006) and were assigned to the low level of data availability. The three data sources with increasing data availability are labelled as ‘Bottleneck’, ‘Pattern’, and ‘Footprint‘ data (Figure 1). See Figure 1—source data 1 for the definition of the 25 states and a detailed overview of data availability by state.

Figure 1. Hierarchical structure of availability of cervical cancer epidemiological data.

Figure 1.

Figure 1—source data 1. Overview of availability of cervical cancer epidemiological data by state.

To address the hierarchical form of missing data, we propose a three-step approach labelled as ‘Clustering’, ‘Classification’, and ‘Projection’. In brief, the approach first identifies clusters of geographical units sharing similar patterns of cervical cancer epidemiology and then uses the available data within each cluster to approximate data and extrapolate impact projections to geographical units with lower data availability. The details of the respective steps are as follows:

  1. Clustering step

  • In the Clustering step, clusters of geographical units sharing similar patterns of cancer epidemiology are identified. This step corresponds to unsupervised learning in machine learning terminology (James et al., 2013). Clustering should be done based on a source of Pattern data that has large enough coverage over the all geographical units. In the Indian case study, cervical cancer incidence data were available in 14 out of 25 Indian states and was therefore suitable. As a constraint, each of the resulting clusters must contain at least one state with high data availability. This ensures that each Indian state with an intermediate level of data availability is matched with a state with high level of data availability, which is needed to approximate missing data.

  1. Classification step

  • In the second step, geographical units with the lowest level of data availability, which have not yet been clustered, are then classified into the identified clusters. This step corresponds to supervised learning in machine learning terminology (James et al., 2013). Classification is based on the similarity between geographical units according to the Footprint data, which should be available for the remaining unclustered geographical units. In the Indian case study, sexual behaviour data would be suitable. As in the Clustering step, the classification step matches each Indian state with the lowest level of data availability to states with higher levels of data availability within the same cluster, in order to approximate the missing cervical cancer incidence and HPV prevalence data.

  1. Projection step

  • In the last step, missing data are approximated based on available data from other geographical units within the same cluster, for example, based on the mean or median of the available data. If the Classification step also provided the probability of belonging to each cluster, approximation could even be based on weighted averages of different clusters. With the approximated data, it is then possible to calibrate projection models, that is, HPV transmission and cervical cancer progression models, and derive context-specific impact projections for each geographical unit. Alternatively, as a less computationally demanding approach, it is also possible to calibrate projection models for the geographical units of the highest level of data availability only, and then scale the projections to the other geographical units within each cluster.

As previously mentioned, we assumed that data availability occurs hierarchically. However, the framework can also be applied with less stringent data requirements. Firstly, the source of Footprint data does not necessarily need to cover all geographical units. It is possible to train a classifier in the classification step with Footprint data available for only a part of clustered geographical units. Secondly, if none of the key cervical cancer epidemiological data (sexual behaviour, HPV prevalence, and cervical cancer incidence data) have large enough coverage to serve as Footprint data, alternative indicators of similarity, such as human development index and geographical distance, could also be used as substitutes. However, this might result in suboptimal classification, as we expect these indicators to correlate less well with cervical cancer risk. Finally, for the projection step, data on cervical cancer incidence, sexual behaviour, and HPV prevalence needed to calibrate projection models do not necessarily need to belong to the same geographical unit. Calibration can be performed as long as the three types of data are available within each cluster.

With these less stringent data requirements, the proposed framework should be sufficiently flexible to be applied to many situations. However, one should still be cautious in applying the framework when little data are available. If the data are not sufficiently granular, one might need to exclude geographical units with insufficient data or redefine bigger geographical units. Furthermore, one should assess the goodness-of-fit of the obtained clustering, performance of classification, correlation of data within different clusters, and calibration fits to ensure the validity of the final impact projections.

Clustering of cervical cancer incidence patterns in the Indian case study

As some Indian states have multiple cancer registries, we first obtained clusters of registry-specific cervical cancer incidence. See Materials and methods for the description of the source of cervical cancer incidence data and the statistical method used for clustering. Registry-specific cervical cancer incidence were obtained for up to four prefixed clusters (Figure 2). Model fit improved substantially when increasing the number of prefixed clusters from two to three, with the Bayesian information criterion (BIC) reducing from 6933 to 5700 (Table 1). Further increases in the number of prefixed clusters to four only led to marginal improvement in model fit, with a small reduction in BIC from 5700 to 5532, and a poorly defined cluster of only one registry. With the number of prefixed clusters set at five, the clustering method no longer converged. We concluded that two or three clusters were fitting to describe the patterns of cervical cancer incidence in the available data.

Figure 2. Identified clusters of registry-specific cervical cancer incidence.

Clusterings under (A) 2, (B) 3, and (C) 4 prefixed clusters. Each panel within a row corresponds to a cluster within a k-clustering, with the cluster label given on top of the panel. The cervical cancer incidence data were extracted from volume XI of Cancer Incidence in Five Continents (CI5) (Bray et al., 2017) and the 2012–2016 report by the Indian National Centre for Disease Informatics and Research (NCDIR) (Report of National Cancer Registry Programme, 2020). Black: cluster mean of cervical cancer incidence; dark grey: registry incidence assigned to the cluster; light grey: registry incidence assigned to other clusters.

Figure 2—source data 1. Registry-specific cervical cancer incidence data from Cancer Incidence in Five Continents (CI5) and National Centre for Disease Informatics and Research (NCDIR).
Figure 2—source data 2. Estimated model parameters under Poisson regression models.

Figure 2.

Figure 2—figure supplement 1. Registry-specific cervical cancer incidence data from Cancer Incidence in Five Continents (CI5) and National Centre for Disease Informatics and Research (NCDIR).

Figure 2—figure supplement 1.

See Figure 1—source data 1 for whether registries belong to CI5 or NCDIR.
Figure 2—figure supplement 2. Mean age-specific cervical cancer incidence by cluster.

Figure 2—figure supplement 2.

Table 1. Estimated parameters of clusters of cervical cancer incidence patterns.

Number of prefixed clusters BIC* Cluster label i Number (%) of registries in cluster Maximum incidence Maximum incidence pattern Maximum incidence age group Maximum incidence age group pattern
2 6933 1 27 (82%) 47 cases Low 60–64 years Late
2 6 (18%) 91 cases High 55–59 years Early
3 5700 1 19 (58%) 38 cases Low 60–64 years Late
2 5 (15%) 92 cases High 55–59 years Early
3 9 (27%) 64 cases Intermediate 60–64 years Late
4 5532 1 18 (55%) 39 cases Low 60–64 years Late
2 5 (15%) 92 cases High 55–59 years Early
3 9 (27%) 64 cases Intermediate 60–64 years Late
4 1 (3%) 20 cases Very low 60–64 years Early
*

Bayesian information criterion for evaluating the goodness-of-fit of obtained clustering.

Maximum incidence given in cases per 100,000 women-years.

Five-year age group in which the maximum incidence is located.

The cervical cancer incidence clusters differed in terms of magnitude of incidence and location of maximum incidence (Figure 2). When allowing two clusters, cluster 1 had a low maximum incidence of 47 cases per 100,000 women-years at age group 60–64 years, compared to cluster 2 with its higher maximum incidence of 91 cases per 100,000 women-years at the earlier age group of 55–59 years (Figure 2, Table 1). When allowing three clusters, we observed an additional cluster characterized by intermediate maximum incidence of 64 cases per 100,000 women-years at age group 60–64 years (Figure 2, Table 1). This third cluster mainly consisted of registries that had previously been assigned to the low-incidence cluster, that is, cluster 1 of the 2-clustering, while having a relatively high incidence (Figure 2, Table 2). See Figure 2—source data 2 for additional details of the obtained clusters.

Table 2. Clustering of cervical cancer incidence of Indian states based on clustering of registries.

State/group of states*
2-Clustering 3-Clustering 4-Clustering
1(low, late) 2(high, early) 1(low, late) 2(high, early) 3(interm., late) 1(low, late) 2(high, early) 3(interm., late) 4(very low, early)
Andhra Pradesh
Assam ●●● ●●● ●●
Delhi
Gujarat+Dadra and Nagar Haveli
Karnataka
Kerala +Lakshadweep ●● ●● ●●
Madhya Pradesh
Maharashtra ●●●●●● ●●●● ●●● ●●●● ●●●
Manipur ●● ●● ●●
Other North Eastern states
●●●●● ●●●● ●●●● ●●●● ●●●● ●●●●
Punjab +Chandigarh
Sikkim
Tamil Nadu +Puducherry
West Bengal +Andaman and Nicobar Islands

Each circle represents the count of one registry being assigned to the corresponding cluster. Grey shading represents the cluster including the highest number of registries, either exclusively or in a draw with another cluster.

Cluster labels and the corresponding patterns of maximum incidence and maximum incidence age group given in the second row were defined in the third, sixth, and eighth columns of Table 1, respectively.

*

States/or groups of states were defined as reported in the 2006 National Behaviour Surveillance Survey of the National AIDS Control Organization of India (National Behavioural Surveillance Survey: General Population, 2006).

Other North Eastern states included Arunachal Pradesh, Nagaland, Meghalaya, Mizoram, and Tripura.

The registry clusters were then used to derive clusters of Indian states based on the majority rule (Table 2). When using the 2-clustering of registries, none of the states were exclusively attributed to cluster 2, hence 2-clustering could not be used for the classification step. When using 3-clustering, again none of the states were exclusively attributed to cluster 2, however, 8 and 4 states were assigned to clusters 1 and 3, that is, the clusters with low and intermediate incidence, respectively. Hence, we combined clusters 2 and 3, that is, the clusters with high and intermediate incidence as the new ‘high-incidence’ cluster, while keeping cluster 1 as the ‘low-incidence’ cluster. We note that, with these newly defined clusters, each cluster still contained at least one state with the highest level of data availability, which was necessary for the projection step. However, with the new definition of clusters, we could no longer distinguish patterns of early or late peak of incidence.

Classification of cervical cancer incidence patterns based on sexual behaviour data in the Indian case study

A random forest (RF) classifier was constructed using sexual behaviour data corresponding to the states with identified clusters. See Materials and methods for the details of the source of sexual behaviour data, the variables included, and the statistical method used for classification. The variables with the first, second, and third highest predictive values for cluster of cervical cancer incidence patterns were ‘proportion of urban male respondents reporting sex with non-regular partners in the last 12 months’, ‘median age of first sex in rural males’, and ‘median age of first sex in urban females’, respectively (Figure 3—source data 2). In particular, there was a good distinction between the high- and low-incidence clusters in terms of ‘proportion of urban male respondents reporting sex with non-regular partners in the last 12 months’, with high proportions associated with the high-incidence cluster (Figure 3). High values of ‘median age of first sex in females’ and low values of ‘median age of first age in males’ were also associated for the high-incidence cluster, although the distinction was less clear.

Figure 3. Sexual behaviour data from National AIDS Control Organization (NACO) by Indian state.

Indian state-specific data on (A) median age of first sex, (B) proportion of respondents reporting sex with non-regular partners in the last 12 months, (C) proportion of male respondents reporting sex with commercial partners in the last 12 months, and (D) proportion of male respondents by number of commercial partners in the last 12 months. Each violin plot and the associated cloud of circles correspond to a sexual behaviour variable. Each circle corresponds to the data of a state (or group of states). The data were extracted from the 2006 National Behaviour Surveillance Survey of the National AIDS Control Organization of India (National Behavioural Surveillance Survey: General Population, 2006). Blue and red: Indian states identified in the high and low cervical cancer incidence clusters. Grey: states without cervical cancer incidence data and therefore unknown cluster.

Figure 3—source data 1. Indian state-specific sexual behaviour data from National AIDS Control Organization (NACO).
Figure 3—source data 2. Predictive values of the sexual behavior variables for cervical cancer incidence cluster.

Figure 3.

Figure 3—figure supplement 1. Indian state-specific sexual behaviour data from National AIDS Control Organization (NACO).

Figure 3—figure supplement 1.

States or groups of states as reported in the 2006 National Behaviour Surveillance Survey of the National AIDS Control Organization of India. Other North Eastern states include Arunachal Pradesh, Nagaland, Meghalaya, Mizoram, and Tripura.

The estimated out-of-bag error of the constructed classifier was 29%. When applying the constructed classifier to the Indian states with identified clusters, only Karnataka and other North Eastern states (2 of 14 states; 14%) were wrongly classified to the low-incidence cluster (Table 3). Visualization shows that the sexual behaviour data for these two states resemble the other states belonging to the low-incidence cluster, despite being clustered into the high-incidence cluster (Figure 3, Figure 3—figure supplement 1).

Table 3. Identified and classified cluster of cervical cancer incidence pattern by Indian state.

Cervical cancer incidence data* State/group of states Identified cluster Classified cluster§ Probability of belonging to the low-incidence cluster
Available Andhra Pradesh Low Low 0.60
Assam Low Low 0.69
Delhi High High 0.42
Gujarat+Dadra and Nagar Haveli Low Low 0.69
Karnataka High Low 0.63
Kerala+Lakshadweep Low Low 0.60
Madhya Pradesh High High 0.44
Maharashtra Low Low 0.57
Manipur Low Low 0.65
Other North Eastern states High Low 0.53
Punjab+Chandigarh High High 0.41
Sikkim Low Low 0.63
Tamil Nadu+Puducherry High High 0.38
West Bengal+Andaman and Nicobar Islands Low Low 0.71
Unavailable Bihar - Low 0.67
Chhattisgarh - Low 0.66
Goa+Daman and Diu - Low 0.54
Haryana - Low 0.66
Himachal Pradesh - Low 0.58
Jammu and Kashmir - Low 0.63
Jharkhand - Low 0.71
Orissa - Low 0.68
Rajasthan - Low 0.66
Uttar Pradesh - Low 0.64
Uttarakhand - Low 0.69
*

Availability of cervical cancer incidence data was based on the incidence data from volume XI of Cancer Incidence in Five Continents (CI5) (Bray et al., 2017) and the 2012–2016 report of the National Centre for Disease Informatics and Research (NCDIR) (Report of National Cancer Registry Programme, 2020).

States/groups of states were defined as reported in the 2006 National Behaviour Surveillance Survey of the National AIDS Control Organization of India (National Behavioural Surveillance Survey: General Population, 2006).

Identified clusters derived in the Clustering step.

§

Classified clusters derived in the Classification step. A given state was classified to the low-incidence cluster if the probability of belonging to the low-incidence cluster (given in the next column) was above 0.50. For the Indian states with available cervical cancer incidence data and hence already in an identified cluster, classification was done for the purpose of validation.

Other North Eastern states included Arunachal Pradesh, Nagaland, Meghalaya, Mizoram, and Tripura.

Subsequently, the classifier was applied to classify the remaining states without cervical cancer incidence data and thus with unknown cluster. All 11 remaining Indian states received a higher probability of belonging to the low-incidence cluster (Table 3). Indeed, Figure 3 shows that the sexual behaviour data of the states with unknown cluster (indicated in grey) were generally closer to the sexual behaviour data of the states of the low-incidence cluster (indicated in red) than those of the high-incidence cluster (indicated in blue). Hence, we identified in total 19 and 6 states for the low- and high-incidence clusters, respectively.

Finally, missing cervical cancer incidence data and HPV prevalence were approximated based on the mean within each cluster (Figure 2—figure supplement 2). Approximation of HPV prevalence was based on the only one prevalence survey we could identify per cluster (Franceschi et al., 2005; Dutta et al., 2012). We verified that the HPV prevalence reported by the survey corresponding to the high-incidence cluster was higher than the prevalence reported by the one corresponding to the low-incidence cluster: HPV prevalence of 16.9% vs 9.8% (in women in the age range 20–60 years). This 1.7-fold difference in HPV prevalence was in the same order of magnitude as the 1.9-fold difference we found for the age-standardized cervical cancer incidence between the two clusters (17.9 vs 9.01 cases per 100,000 women-years). The final step of deriving impact projections for cervical cancer preventive measures for the whole of India with state-specific granularity was reported elsewhere (de Carvalho et al., 2023; Man et al., 2022).

Discussion

In this paper, we developed the Footprinting framework to approximate missing cervical cancer epidemiological data in some geographical units when deriving impact projections of cervical cancer preventive measures for a larger geographical area. In brief, the framework identified clusters of geographical units sharing similar patterns of cervical cancer epidemiology and uses the available data within each cluster to approximate data and extrapolate impact projections to geographical units with lower data availability. The framework was demonstrated using a case study approximating missing cervical cancer incidence and HPV prevalence data for a selection of Indian states. With the framework, we have derived, for the first time, impact projections of cervical cancer preventive measures for the whole of India with state-specific granularity (de Carvalho et al., 2023; Man et al., 2022).

This work has also generated a better understanding of cervical cancer epidemiology across India. We found that India can be divided into two main groups of 19 and 6 Indian states or groups of states that are characterized by low or high cervical cancer incidence, respectively. As expected, and in line with previous studies, individuals, in particular men, in high-incidence states had more sexual activity with non-regular partners, including commercial partners, than in low-incidence states (Vaccarella et al., 2006; Schulte-Frohlinde et al., 2022). While early sexual debut in women has also previously been suggested to be associated with high cervical cancer incidence and HPV positivity (Vaccarella et al., 2006; Schulte-Frohlinde et al., 2022), it was associated with lower cervical cancer incidence in the dataset we considered. We hypothesize that, for the Indian context, early sexual debut is common in states with a larger rural population, among whom less sexual activity occurs with non-regular partners, which is the main determining factor for a lower risk of cervical cancer. With the urbanization of rural areas, which often entails evolving socio-cultural norms, it is possible that more Indian states may shift to a high cancer incidence pattern, with an accompanying early peak in incidence (Baussano et al., 2016).

In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report (National Behavioural Surveillance Survey: General Population, 2006). We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available (e.g. age of sexual debut and number of sexual partners) so that the analyses can be easily applied to other settings (Vaccarella et al., 2006; Schulte-Frohlinde et al., 2022). In the Indian case study, the good classification performance shows that using the selected set was sufficient. As sexual behaviour variables are highly correlated, adding more variables might cause overfitting.

It should be noted that our Footprinting framework is similar to other extrapolation approaches previously used in model-based projection studies targeting large geographical areas with missing data, for example, a collection of LMICs or European countries (Brisson et al., 2020; Qendri et al., 2020). While the rationale behind different extrapolation approaches is similar, which is to approximate missing data from other geographical units with similar epidemiological indicators, there are also differences. A strength of our framework is that it relies on the observed patterns of epidemiology in the data to select key geographical units from which impact projections are extrapolated to other units instead of working with a predefined selection of key units. This allows the selection of key units that maximizes the representation of different epidemiological patterns in the data. Moreover, it also helps to pinpoint geographical units that could be interesting for future data collection efforts. Secondly, we used a newly developed clustering method (Subtil et al., 2017; Klich et al., 2021) that is able to assess the similarities between cervical cancer incidence of different geographical units based on the entire age-specific pattern, instead of clustering by age-standardized cervical cancer incidence or cervical cancer incidence in a certain age group only. Finally, we provided a detailed description of the clustering/mapping steps and intermediate results, which make them more reproducible and falsifiable.

Our application of Footprinting on the Indian case study also bears some resemblance with the extrapolation of cervical cancer incidence by GLOBOCAN’s nationwide estimates of cervical cancer incidence in India (Bray et al., 2017). Essentially, in GLOBOCAN, missing incidence was extrapolated based on urban or rural residency as a footprint, while we used sexual behaviour for this purpose and considered each state separately, which is necessary for state-specific impact projections. As a result, we neglected the variation between rural and urban areas within Indian states, which is a limitation of our analysis. We expect that Footprinting with further stratification of states by rural/urban residency could improve the approximation. Furthermore, our nationwide estimate of cervical cancer incidence derived from aggregating the state-specific estimates (reported in a separate manuscript; Man et al., 2022) was lower than the estimate reported by GLOBOCAN. This could be explained by the use of different methods of extrapolation and the fact that we included data from 17 additional cancer registries with relatively low incidence not included in GLOBOCAN estimates. Various possible adaptations of the proposed Footprinting framework are worth mentioning. Firstly, in the suboptimal situation where none of the relevant cervical cancer epidemiology data are available in some geographical units, data on indicators of human development and geographical location could be used as Footprint data. Secondly, while we focused on epidemiological data for cervical cancer, Footprinting could be used to approximate missing economic data (e.g. treatment or vaccine delivery costs) that are needed to assess the health economic impact of cervical cancer preventive measures, given that relevant Footprint data can be defined and collected. It is important to note that, in general, the applicability the proposed framework depends on the amount of data available. However, in our opinion, lack of data is a general challenge for approximating missing data, rather than a weakness particular to our methodology. By allowing possible adaptations, we believe that our framework is sufficiently flexible to effectively address missing data in many situations.

This work has provided a comprehensive framework to dealing with the important and ubiquitous challenge of missing data on cervical cancer epidemiology. By using the proposed framework, it is possible to derive robust and context-specific impact projections for cervical cancer preventive measures for a wide range of geographical settings. Such projections can assist local health authorities to plan and implement cervical cancer preventive strategies that are adapted to local needs and resources, intensifying efforts to reduce the high burden of cervical cancer still existing in many countries in low-resource settings.

Materials and methods

Data sources

In this section, we describe the data sources used in the Indian case study. The primary source of cervical cancer incidence data, which was used as Pattern data in the Clustering step, was cancer registry data from volume XI of Cancer Incidence in Five Continents (CI5) (Bray et al., 2017). It comprised incidence data from 16 cancer registries in 10 of the 25 Indian states (some states had more than one registry). In addition, cervical cancer incidence data were extracted from the 2012–2016 report by the Indian National Centre for Disease Informatics and Research (NCDIR) to provide data from 17 additional cancer registries not included in CI5 (Report of National Cancer Registry Programme, 2020). When data of a registry is both were reported both in CI5 and NCDIR, we only used the data from CI5. Combining the two sources provided incidence data for 33 registries in 14 Indian states. Cervical cancer incidence was reported by number of cases per 100,000 women-years, stratified by 5-year age groups from 15 to 79 years. See Figure 2—figure supplement 1 and Figure 2—source data 1 for the extracted incidence data by state.

Sexual behaviour data, which were used as Footprint data in the Classification step, were from the report of the National Behaviour Surveillance Survey by the National AIDS Control Organization (NACO) of India in 2006, which was the most recent edition at the moment of writing (National Behavioural Surveillance Survey: General Population, 2006). Data for all 25 Indian states were available in the survey. Sexual behaviour data by Indian state in the form of aggregate statistics of survey respondents were available for the following 4 groups of 12 variables:

  • Median age of first sex – stratified by residence (urban/rural) and sex (male/female), resulting in four variables.

  • Proportion of respondents reporting sex with non-regular partners in the last 12 months – stratified by residence (urban/rural) and sex (male/female), resulting in four variables.

  • Proportion of male respondents reporting sex with commercial partners in the last 12 months – stratified by residence (urban/rural), resulting in two variables.

  • Proportion of male respondents by number of commercial partners in the last 12 months – restricted to respondents with at least one commercial partner and divided into three categories (1/2–3/>3). As the three proportions always sum up to one and are therefore correlated, we omitted one category, resulting in two variables.

See Figure 3—figure supplement 1 and Figure 3—source data 1 for the extracted sexual behaviour data by state.

Method to cluster cervical cancer incidence patterns

The statistical method employed in the Clustering step to cluster registry-specific cervical cancer incidence data was a Poisson-regression-based CEM clustering algorithm (Subtil et al., 2017; Klich et al., 2021), described in detail in Appendix 1. Briefly, clusters of age-specific cervical cancer incidence were obtained by likelihood-based optimization under Poisson regression model. The Poisson regression model for each cluster was characterized by three parameters: an intercept, one parameter for age, and one for the square of age. This parametric form was chosen to match the general pattern of incidence by age, namely, increasing from zero incidence from the youngest age group, then decreasing after reaching the maximum incidence (Figure 2—figure supplement 1). Application of the clustering method required prefixing the number of clusters k. The goodness-of-fit of each k-clustering was evaluated based on the BIC. To transform the obtained clustering of registry-specific data to clustering of Indian states for states with multiple registries, we assigned each state to the cluster that included the highest number of registries, that is, according to a majority rule.

Method to classify cervical cancer incidence patterns based on sexual behaviour data

In the Classification step, we assigned the remaining states without cervical incidence data to the identified clusters based on RF using the sexual behaviour data as Footprint data (Breiman, 2001). The RF classifier was constructed using sexual behaviour data from states with identified clusters. The predictive value of each variable was evaluated with the mean decrease in accuracy, which expressed how much the accuracy of the model decreased if the variable was excluded. The performance of the classification step was validated by both out-of-bag error estimate and by applying the constructed classifier to the sexual behaviour data from states with identified clusters. Subsequently, the constructed classifier was applied to the sexual behaviour data from states without identified clusters, providing the probability to belong to each cluster. Each state was classified to the cluster receiving the highest probability. Classification was performed using the R package party version 1.3–7 with the following setting: cforest_control(teststat = 'quad', testtype = 'Univariate', mincriterion = 0.9, ntree = 50 000,, mtry = 3, maxdepth = 2, minsplit = 0, minbucket = 0).

Finally, in the projection step, missing data on cervical cancer incidence and HPV prevalence were approximated based on the mean within each cluster. Results were validated based on the ratios of HPV prevalence and cervical cancer incidence across clusters. Derivation of impact projections was reported elsewhere (de Carvalho et al., 2023; Man et al., 2022).

Acknowledgements

This study was funded by the Bill and Melinda Gates Foundation (grant numbers: OPP48979; INV-039876). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. For the authors identified as personnel of the International Agency for Research on Cancer or World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer or World Health Organization. The designations used and the presentation of the material in this article do not imply the expression of any opinion whatsoever on the part of WHO and the IARC about the legal status of any country, territory, city, or area, or of its authorities, or concerning the delimitation of its frontiers or boundaries.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Irene Man, Email: mani@iarc.who.int.

Belinda Nicolau, McGill University, Canada.

Wendy S Garrett, Harvard T.H. Chan School of Public Health, United States.

Funding Information

This paper was supported by the following grants:

  • Bill and Melinda Gates Foundation OPP48979 to Iacopo Baussano.

  • Bill and Melinda Gates Foundation INV-039876 to Iacopo Baussano.

Additional information

Competing interests

No competing interests declared.

No competing interests declared.

Author contributions

Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing – review and editing.

Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – review and editing.

Software, Validation, Methodology, Writing – review and editing.

Conceptualization, Resources, Formal analysis, Supervision, Validation, Investigation, Methodology, Writing – review and editing.

Additional files

Supplementary file 1. Appendix 1 - Poisson-regression-based CEM clustering algorithm.
elife-81752-supp1.docx (38.7KB, docx)
MDAR checklist

Data availability

All data used in the present study were openly available and extracted from http://ci5.iarc.fr for the cervical cancer incidence data published by the International Agency for Research on Cancer, from https://www.ncdirindia.org/All_Reports/Report_2020/resources/NCRP_2020_2012_16.pdf for the cervical cancer incidence data published by the National Centre for Disease Informatics and Research of India, and from https://www.aidsdatahub.org/sites/default/files/resource/national-bss-general-population-india-2006.pdf for the sexual behavior data published by the National AIDS Control Organisation Ministry of Health and Family Welfare Government of India. The extracted cervical cancer incidence and sexual behavior data are provided in Figure 2—source data 1 and Figure 3—source data 1, respectively. The computer code regarding the Poisson-regression-based CEM clustering algorithm is available upon reasonable request to the authors. The Random forest analysis was done with the open-source R packages party available at https://cran.r-project.org/web/packages/party/index.html (Hothorn et al., 2023).

References

  1. Baussano I, Lazzarato F, Brisson M, Franceschi S. Human papillomavirus vaccination at a time of changing sexual behavior. Emerging Infectious Diseases. 2016;22:18–23. doi: 10.3201/eid2201.150791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bonjour M, Charvat H, Franco EL, Piñeros M, Clifford GM, Bray F, Baussano I. Global estimates of expected and preventable cervical cancers among girls born between 2005 and 2014: a birth cohort analysis. The Lancet. Public Health. 2021;6:e510–e521. doi: 10.1016/S2468-2667(21)00046-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bouvard V, Wentzensen N, Mackie A, Berkhof J, Brotherton J, Giorgi-Rossi P, Kupets R, Smith R, Arrossi S, Bendahhou K, Canfell K, Chirenje ZM, Chung MH, Del Pino M, de Sanjosé S, Elfström M, Franco EL, Hamashima C, Hamers FF, Herrington CS, Murillo R, Sangrajrang S, Sankaranarayanan R, Saraiya M, Schiffman M, Zhao F, Arbyn M, Prendiville W, Indave Ruiz BI, Mosquera-Metcalfe I, Lauby-Secretan B. The IARC perspective on cervical cancer screening. The New England Journal of Medicine. 2021;385:1908–1918. doi: 10.1056/NEJMsr2030640. [DOI] [PubMed] [Google Scholar]
  4. Bray F, Mery L, Piñeros M. Cancer incidence in five continents. Ci5. 2017 http://ci5.iarc.fr
  5. Breiman L. Random forests. Machine Learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  6. Brisson M, Kim JJ, Canfell K, Drolet M, Gingras G, Burger EA, Martin D, Simms KT, Bénard É, Boily M-C, Sy S, Regan C, Keane A, Caruana M, Nguyen DTN, Smith MA, Laprise J-F, Jit M, Alary M, Bray F, Fidarova E, Elsheikh F, Bloem PJN, Broutet N, Hutubessy R. Impact of HPV vaccination and cervical screening on cervical cancer elimination: a comparative modelling analysis in 78 low-income and lower-middle-income countries. The Lancet. 2020;395:575–590. doi: 10.1016/S0140-6736(20)30068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bruni L, Saura-Lázaro A, Montoliu A, Brotons M, Alemany L, Diallo MS, Afsar OZ, LaMontagne DS, Mosina L, Contreras M, Velandia-González M, Pastore R, Gacic-Dobo M, Bloem P. Hpv vaccination introduction worldwide and who and UNICEF estimates of national HPV immunization coverage 2010-2019. Preventive Medicine. 2021;144:106399. doi: 10.1016/j.ypmed.2020.106399. [DOI] [PubMed] [Google Scholar]
  8. de Carvalho TM, Man I, Georges D, Saraswati LR, Bhandari P, Kataria I. Health Economic Impact of the Introduction of Single-Dose HPV Vaccination in India. medRxiv. 2023 doi: 10.1101/2023.04.14.23288563. [DOI]
  9. de Martel C, Plummer M, Vignat J, Franceschi S. Worldwide burden of cancer attributable to HPV by site, country and HPV type. International Journal of Cancer. 2017;141:664–670. doi: 10.1002/ijc.30716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. de Sanjose S, Tsu VD. Prevention of cervical and breast cancer mortality in low- and middle-income countries: a window of opportunity. International Journal of Women’s Health. 2019;11:381–386. doi: 10.2147/IJWH.S197115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dutta S, Begum R, Mazumder Indra D, Mandal SS, Mondal R, Biswas J, Dey B, Panda CK, Basu P. Prevalence of human papillomavirus in women without cervical cancer: a population-based study in eastern India. International Journal of Gynecological Pathology. 2012;31:178–183. doi: 10.1097/PGP.0b013e3182399391. [DOI] [PubMed] [Google Scholar]
  12. Franceschi S, Rajkumar R, Snijders PJF, Arslan A, Mahé C, Plummer M, Sankaranarayanan R, Cherian J, Meijer CJLM, Weiderpass E. Papillomavirus infection in rural women in southern India. British Journal of Cancer. 2005;92:601–606. doi: 10.1038/sj.bjc.6602348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Goldie SJ, Goldhaber-Fiebert JD, Garnett GP. Chapter 18: public health policy for cervical cancer prevention: the role of decision science, economic evaluation, and mathematical modeling. Vaccine. 2006;24:S155–S163. doi: 10.1016/j.vaccine.2006.05.112. [DOI] [PubMed] [Google Scholar]
  14. Guan P, Howell-Jones R, Li N, Bruni L, de Sanjosé S, Franceschi S, Clifford GM. Human papillomavirus types in 115,789 HPV-positive women: a meta-analysis from cervical infection to cancer. International Journal of Cancer. 2012;131:2349–2359. doi: 10.1002/ijc.27485. [DOI] [PubMed] [Google Scholar]
  15. Hothorn T, Hothorn K, Strobl C, Zeileis A. Party: a laboratory for recursive partytioning. 1.3-13CRAN. 2023 https://cran.r-project.org/web/packages/party/index.html
  16. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. Springer; 2013. [DOI] [Google Scholar]
  17. Kataria I, Bhandari P, Saraswativ LR, Siddiqui M, Sankaranarayanan R. Review of HPV prevalence data in Indian by RTI internation. Personal Communication 2022
  18. Kelly CA, Soler-Hampejsek E, Mensch BS, Hewett PC. Social desirability bias in sexual behavior reporting: evidence from an interview mode experiment in rural Malawi. International Perspectives on Sexual and Reproductive Health. 2013;39:014–021. doi: 10.1363/3901413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Klich A, Ecochard R, Subtil F. Trajectory clustering using mixed classification models. Statistics in Medicine. 2021;40:3425–3439. doi: 10.1002/sim.8975. [DOI] [PubMed] [Google Scholar]
  20. Lei J, Ploner A, Elfström KM, Wang J, Roth A, Fang F, Sundström K, Dillner J, Sparén P. Hpv vaccination and the risk of invasive cervical cancer. New England Journal of Medicine. 2020;383:1340–1348. doi: 10.1056/NEJMoa1917338. [DOI] [PubMed] [Google Scholar]
  21. Man I, Georges D, de Carvalho TM, Ray Saraswati L, Bhandari P, Kataria I, Siddiqui M, Muwonge R, Lucas E, Berkhof J, Sankaranarayanan R, Bogaards JA, Basu P, Baussano I. Evidence-Based impact projections of single-dose human papillomavirus vaccination in India: a modelling study. The Lancet. Oncology. 2022;23:1419–1429. doi: 10.1016/S1470-2045(22)00543-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Morris M, Vu L, Leslie-Cook A, Akom E, Stephen A, Sherard D. Comparing estimates of multiple and concurrent partnerships across population based surveys: implications for combination HIV prevention. AIDS and Behavior. 2014;18:783–790. doi: 10.1007/s10461-013-0618-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. National Behavioural Surveillance Survey: General Population National AIDS Control Organisation Ministry of Health and Family Welfare Government of India. 2006. [February 1, 2021]. https://www.aidsdatahub.org/sites/default/files/resource/national-bss-general-population-india-2006.pdf
  24. Qendri V, Bogaards JA, Baussano I, Lazzarato F, Vänskä S, Berkhof J. The cost-effectiveness profile of sex-neutral HPV immunisation in European tender-based settings: a model-based assessment. The Lancet. Public Health. 2020;5:e592–e603. doi: 10.1016/S2468-2667(20)30209-7. [DOI] [PubMed] [Google Scholar]
  25. Report of National Cancer Registry Programme National centre for disease informatics and research. Report of National Cancer Registry Programme 2020
  26. Sankaranarayanan R, Basu P, Kaur P, Bhaskar R, Singh GB, Denzongpa P, Grover RK, Sebastian P, Saikia T, Oswal K, Kanodia R, Dsouza A, Mehrotra R, Rath GK, Jaggi V, Kashyap S, Kataria I, Hariprasad R, Sasieni P, Bhatla N, Rajaraman P, Trimble EL, Swaminathan S, Purushotham A. Current status of human papillomavirus vaccination in India’s cervical cancer prevention efforts. The Lancet Oncology. 2019;20:e637–e644. doi: 10.1016/S1470-2045(19)30531-5. [DOI] [PubMed] [Google Scholar]
  27. Schulte-Frohlinde R, Georges D, Clifford GM, Baussano I. Predicting cohort-specific cervical cancer incidence from population-based surveys of human papilloma virus prevalence: a worldwide study. American Journal of Epidemiology. 2022;191:402–412. doi: 10.1093/aje/kwab254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Subtil F, Boussari O, Bastard M, Etard JF, Ecochard R, Génolini C. An alternative classification to mixture modeling for longitudinal counts or binary measures. Statistical Methods in Medical Research. 2017;26:453–470. doi: 10.1177/0962280214549040. [DOI] [PubMed] [Google Scholar]
  29. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians. 2021;71:209–249. doi: 10.3322/caac.21660. [DOI] [PubMed] [Google Scholar]
  30. Tsu VD. Cervical cancer elimination: are targets useful? Lancet. 2020;395:539–540. doi: 10.1016/S0140-6736(20)30219-1. [DOI] [PubMed] [Google Scholar]
  31. Vaccarella S, Franceschi S, Herrero R, Muñoz N, Snijders PJF, Clifford GM, Smith JS, Lazcano-Ponce E, Sukvirach S, Shin H-R, de Sanjosé S, Molano M, Matos E, Ferreccio C, Anh PTH, Thomas JO, Meijer CJLM, IARC HPV Prevalence Surveys Study Group Sexual behavior, condom use, and human papillomavirus: pooled analysis of the IARC human papillomavirus prevalence surveys. Cancer Epidemiology, Biomarkers & Prevention. 2006;15:326–333. doi: 10.1158/1055-9965.EPI-05-0577. [DOI] [PubMed] [Google Scholar]
  32. WHO Global strategy to accelerate the elimination of cervical cancer as a public health problem. 2021. [October 21, 2021]. https://www.who.int/publications/i/item/9789240014107

Editor's evaluation

Belinda Nicolau 1

This study presents a useful framework for estimating missing data in cervical cancer epidemiology. The evidence supporting the authors' claims is solid, although validation studies in other populations will strengthen the methodology. The work will be of broad interest to researchers and policymakers interested in evaluating the impact of cervical cancer prevention measures.

Decision letter

Editor: Belinda Nicolau1
Reviewed by: Esther Roura2, Belinda Nicolau3

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Footprinting" missing epidemiological data for cervical cancer: a case study in India" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, including Belinda Nicolau as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Wendy Garrett as the Senior Editor. The following individuals involved in the review of your submission have agreed to reveal their identity: Esther Roura (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Both reviewers' comments raised the question about the utility of this framework when little data are available on HPV prevalence, sexual behaviour, and cervical cancer incidence. What are the minimum data required to use this technique? What are the combinations of data that one can utilize with this methodology? Moreover, this methodology has not been validated; validation studies using data from countries where complete datasets are available (e.g., the US) will strengthen this methodology.

The authors should consider changing the name "footprinting"; as it is not intuitive in the context of the manuscript. Also, the authors should provide a more comprehensive explanation of the advantages of this technique over previous approaches, discussing its strengths and limitations.

Similarly, a more comprehensive explanation of the use of the sexual behaviour data is necessary; how did the authors decide which statistics to include in the methodology (e.g., age of first sexual intercourse, number of sexual partners lifetime)? Is there a guide to know which variables are ideal to apply in the technique?

Reviewer #1 (Recommendations for the authors):

I have some general comments.

The word "footprinting" is not very intuitive in the context of the manuscript. It sounds a bit strange, especially in the title; in the manuscript it's fine.

I have doubts about the utility of this framework when little data is available on HPV prevalence, sexual behavior, and cervical cancer incidence. What are the minimum data required to use this technique? What combinations of data types do you think we can do with this methodology (e.g., we only have sexual behavior but can use it with the HDI)?

Regarding sexual behavior data, how do you decide the statistics to include in the methodology (e.g., the median age of first age, number of sexual partners lifetime)? Is there a guide to know which variables are more recommended to apply in the technique?

It will be interesting to validate this methodology in other regions with less available data to confirm its potential and utility.

I also have more specific comments or questions.

When a cancer registry is available in both CI5C and NCDIR, and the information is not the same, which one do you get?

Can you explain how you decide the initial assignments in the CEM clustering algorithm (ranges of 100 repetitions)?

Considering that none of the clusterings obtained using the CEM clustering algorithm are appropriate (2, 3, or 4), have you considered another method of clustering instead of combining clusters 2 and 3?

Can you explain more extensively the differences between the previous approaches and your approach, with the strengths and limitations, and why this technique is more advantageous than the other ones?

Reviewer #2 (Recommendations for the authors):

The proposed framework's strength is difficult to evaluate because the steps and justification for the model variables were not clearly presented, nor were the models validated. Since the whole framework is built on one single imputation, how do the authors account for uncertainties about the estimation? Perhaps the authors could consider validating these models by simulating models using data from countries where complete datasets are available (e.g., the US).

The manuscript would be strengthened by a more explicit stepwise delimitation of how to apply this model to data. The paper would be strengthened by including evidence for the utility of estimating HPV and cervical cancer rates based on sexual behaviours.

Lastly, it seems that the impact assessment of this work has already been published. Why was the current manuscript sent for publication after the paper on impact assessment was published?

eLife. 2023 May 25;12:e81752. doi: 10.7554/eLife.81752.sa2

Author response


Essential revisions:

Both reviewers' comments raised the question about the utility of this framework when little data are available on HPV prevalence, sexual behaviour, and cervical cancer incidence. What are the minimum data required to use this technique? What are the combinations of data that one can utilize with this methodology?

We thank the editor for the comments and for summarizing the main comments of the reviewers. We have tried to incorporate them as much as possible in the revised manuscript. We believe that the current version is now more complete for readers who may be interested in adapting this technique to their needs.

As the editor and reviewers have pointed out, the applicability of the proposed methodology depends on the available data. In our opinion, it is a general challenge of approximating missing data, rather than a weakness particular to our methodology. In fact, we believe that our framework is flexible to address missing data in many situations.

To answer the editor’s first question, there are three minimum data requirements. In our opinion, these requirements are reasonable and flexible enough to be fulfilled in many situations.

1) The first requirement is a data source of cervical cancer incidence, sexual behaviour, or HPV prevalence that has large enough, but not necessarily complete, coverage over all geographical units of interest. This data source, called “Pattern data” in the manuscript, is used to identify the main patterns of cervical cancer epidemiology.

2) A source of cervical cancer incidence, sexual behaviour, or HPV prevalence, or even alternative proxy as HDI and geographical location with coverage over the remaining geographical units to classify the unclustered geographical units to the identified clusters. In the manuscript, this data source is called “Footprint data”. This data source needs to cover some geographical units with identified clusters but not necessary all. Coverage on a part of the clustered geographical units should be sufficient for training the classifier used in the classification step.

3) Finally, data of cervical cancer incidence, sexual behaviour, and HPV prevalence for one geographical unit within each cluster are needed for the calibration of the projection models. However, these data do not necessarily need to come from the same geographical unit, as the data within a cluster should be similar enough.

To answer editor’s second question, any combination of two of the three key cervical cancer epidemiological data, i.e., sexual behaviour, HPV prevalence, and cervical cancer incidence data, can serve as “Pattern” and “Footprint” data. As mentioned under data requirement (2) even proxies as HDI can be used as footprint data.

While we think our framework is flexible to be applied to many situation, we would like to stress that, as a general principle, we should not try overly approximate missing data when too little data are available to approximate from. This means that sometimes we might need to exclude from the analysis geographical units, or we might need to define bigger geographical units if the data are not granular enough. Only by doing so can we ensure the quality of the approximated data and the final impact projections.

Finally, it is worth noting that there are various widely recognized sources of databases of cervical cancer epidemiological data by country:

- cervical cancer incidence from CI5,

- HPV prevalence from ICO/IARC HPV information centre,

- sexual behaviour from the Demographic and Health Surveys (DHS) Program.

Therefore, there are likely ample data for application of the framework when considering countries as geographical units. For application within countries with states/provinces/municipalities as geographical units, data availability can differ from country to country.

To clarify these points, we have added the following paragraph to the Method (lines 144-163, pages 7-8): “For convenience of explanation, we assumed earlier that data availability occurs hierarchically. However, the framework can also be applied with less stringent data requirements. First, the source of Footprint data needs not necessarily cover all geographical units. It is still possible to train a classifier in the classification step with Footprint data available for only a part of clustered geographical units. Second, if none of the key cervical cancer epidemiological data (sexual behavior, HPV prevalence, and cervical cancer incidence data) have large enough coverage to serve as Footprint data, alternatives indicators of similarity, such as human development index and geographical distance, could also be used as substitute. However, the resulting classification performance might be suboptimal, as we expect these indicators to correlate less well with cervical cancer risk. Third, for the projection step, data of cervical cancer incidence, sexual behavior, and HPV prevalence needed for calibration of projection models need not necessarily belong to the same geographical unit. Calibration can be performed as long as the three types of data are available within each cluster.

With these less stringent data requirements, the proposed framework should sufficient flexible to be applied to many situations. However, one should still be cautious in applying the framework when there are little data. This means that, in some cases, we might need to exclude from the analysis some geographical units with too little data or redefine bigger geographical units if the data are not granular enough. Furthermore, we should assess the goodness-of-fit of the obtained clustering, performance of classification, correlation of data within different clusters, and calibration fits to ensure the validity of the final impact projections.”

Moreover, this methodology has not been validated; validation studies using data from countries where complete datasets are available (e.g., the US) will strengthen this methodology.

We agree that it would be very interesting to validate this proposed methodology in other regions. Unfortunately, it was beyond the scope of this work. Currently, we are working on a project in which we try to apply footprinting to a collection of low- and middle-income countries.

The authors should consider changing the name "footprinting"; as it is not intuitive in the context of the manuscript.

We have changed the title into ‘Approximating missing epidemiological data for cervical cancer through “footprinting”: a case study in India’ to explain the purpose of “footprinting”.

Also, the authors should provide a more comprehensive explanation of the advantages of this technique over previous approaches, discussing its strengths and limitations.

To our knowledge, in the field of HPV and cervical cancer control, there are two publications on multi-country modelling for cervical cancer prevention with similar approaches (sometimes also called mapping) as the approach we proposed [ref #29 Brisson 2020. ref #30 Qendri 2020]. In these publications, similar data (sexual behaviour, cervical cancer data, and HPV prevalence) were used for clustering of countries. In our opinion (lines 329-334, page 16), an advantage of our approach is that we base our choice of key geographical units from which impact projections are extrapolated from on the pattern discovered in the data. Other approaches work with prefixed key geographical units because projection models have been calibrated to these geographical units in previous publications. However, no formal analyses have been done to show how well these key geographical units represent the cervical cancer epidemiological patterns across the geographical area of interest.

In addition, we used a more elaborate method for clustering the entire curve of age-specific cervical cancer incidence using a Poisson-regression-based CEM clustering algorithm (lines 336-338, page 16). In the other publications, only age-standardized cervical cancer incidence or cervical cancer incidence in a certain age group were used.

Finally, the other publications provided less detailed description of their clustering/mapping steps. These steps were only reported briefly in supplementary material without intermediate results, whereas in this paper, we provided extensive details on each step with intermediate results, making them more reproducible and falsifiable.

To better explain the last two points we have added the following underlined part to the Discussion (lines 338-342, pages 16-17): “Secondly, we made use of a newly developed clustering method that is able to assess the similarities between cervical cancer incidence of different geographical units based on the entire age-specific pattern, instead of clustering on age-standardized cervical cancer incidence or cervical cancer incidence in a certain age group only. Finally, we provided a more detailed description of the clustering/mapping steps and intermediate results, which make them more reproducible and falsifiable.

Similarly, a more comprehensive explanation of the use of the sexual behaviour data is necessary; how did the authors decide which statistics to include in the methodology (e.g., age of first sexual intercourse, number of sexual partners lifetime)? Is there a guide to know which variables are ideal to apply in the technique?

We have included sexual behaviour variables that have previously been shown to be risk factors of HPV infection and cervical cancer risk, e.g., age of sexual debut and number of sexual partners [ref #26 Vaccerella 2006, ref #27 Schulte-Frohlinde 2021]. Furthermore, we used variables that are commonly available so that the analyses can be easily applied to other settings.

As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance in the India case study shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.

To clarify these points we have included the following paragraph in the Discussion (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available so that the analyses can be easily applied to other settings, e.g., age of sexual debut and number of sexual partners [26, 27]. As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

Reviewer #1 (Recommendations for the authors):

I have some general comments.

The word "footprinting" is not very intuitive in the context of the manuscript. It sounds a bit strange, especially in the title; in the manuscript it's fine.

We have changed the title into ‘Approximating missing epidemiological data for cervical cancer through “footprinting”: a case study in India’ to explain the purpose of “footprinting”.

I have doubts about the utility of this framework when little data is available on HPV prevalence, sexual behavior, and cervical cancer incidence. What are the minimum data required to use this technique? What combinations of data types do you think we can do with this methodology (e.g., we only have sexual behavior but can use it with the HDI)?

While the proposed framework works better with more data, we think that it is flexible enough to be adapted to many cases with little data.

To answer the reviewer’s first question, there are three minimum data requirements. In our opinion, these requirements are reasonable and flexible enough to be fulfilled in many situations.

1) The first requirement is a data source of cervical cancer incidence, sexual behaviour, or HPV prevalence that has large enough, but not necessarily complete, coverage over all geographical units of interest. This data source, called “Pattern data” in the manuscript, is used to identify the main patterns of cervical cancer epidemiology.

2) A source of cervical cancer incidence, sexual behaviour, or HPV prevalence, or even alternative proxy as HDI and geographical location with coverage over the remaining geographical units to classify the unclustered geographical units to the identified clusters. In the manuscript, this data source is called “Footprint data”. This data source needs to cover some geographical units with identified clusters but not necessary all. Coverage on a part of the clustered geographical units should be sufficient for training the classifier used in the classification step.

3) Finally, data of cervical cancer incidence, sexual behaviour, and HPV prevalence for one geographical unit within each cluster are needed for the calibration of the projection models. However, these data do not necessarily need to come from the same geographical unit, as the data within a cluster should be similar enough.

In the reviewer’s example with only sexual behaviour and HDI, data requirement (3) is not fulfilled. Without any HPV prevalence and cervical cancer data at all, it would not be able to derive impact projections of cervical cancer intervention measures.

To answer reviewer’s second question, any combination of two of the three key cervical cancer epidemiological data, i.e., sexual behaviour, HPV prevalence, and cervical cancer incidence data, can serve as “Pattern” and “Footprint” data. As mentioned under data requirement (2) even proxies as HDI can be used as footprint data.

While we think our framework is flexible to be applied to many situation, we would like to stress that, as a general principle, we should not try overly approximate missing data when too little data are available to approximate from. This means that sometimes we might need to exclude from the analysis geographical units, or we might need to define bigger geographical units if the data are not granular enough. Only by doing so can we ensure the quality of the approximated data and the final impact projections.

Finally, it is worth noting that there are various widely recognized sources of databases of cervical cancer epidemiological data by country:

- cervical cancer incidence from CI5,

- HPV prevalence from ICO/IARC HPV information centre,

- sexual behaviour from the Demographic and Health Surveys (DHS) Program.

Therefore, there are likely ample data for application of the framework when considering countries as geographical units. For application within countries with states/provinces/municipalities as geographical units, data availability can differ from country to country.

To clarify these points, we have added the following paragraph to the Method (lines 144-163, pages 7-8): “For convenience of explanation, we assumed earlier that data availability occurs hierarchically. However, the framework can also be applied with less stringent data requirements. First, the source of Footprint data needs not necessarily cover all geographical units. It is still possible to train a classifier in the classification step with Footprint data available for only a part of clustered geographical units. Second, if none of the key cervical cancer epidemiological data (sexual behavior, HPV prevalence, and cervical cancer incidence data) have large enough coverage to serve as Footprint data, alternatives indicators of similarity, such as human development index and geographical distance, could also be used as substitute. However, the resulting classification performance might be suboptimal, as we expect these indicators to correlate less well with cervical cancer risk. Third, for the projection step, data of cervical cancer incidence, sexual behavior, and HPV prevalence needed for calibration of projection models need not necessarily belong to the same geographical unit. Calibration can be performed as long as the three types of data are available within each cluster.

With these less stringent data requirements, the proposed framework should sufficient flexible to be applied to many situations. However, one should still be cautious in applying the framework when there are little data. This means that, in some cases, we might need to exclude from the analysis some geographical units with too little data or redefine bigger geographical units if the data are not granular enough. Furthermore, we should assess the goodness-of-fit of the obtained clustering, performance of classification, correlation of data within different clusters, and calibration fits to ensure the validity of the final impact projections.”

Regarding sexual behavior data, how do you decide the statistics to include in the methodology (e.g., the median age of first age, number of sexual partners lifetime)? Is there a guide to know which variables are more recommended to apply in the technique?

A similar comment was raised by Reviewer #2. We have included sexual behaviour variables that have previously been shown to be risk factors of HPV infection and cervical cancer risk, e.g., age of sexual debut and number of sexual partners [ref #26 Vaccerella 2006, ref #27 Schulte-Frohlinde 2021]. Furthermore, we used variables that are commonly available so that the analyses can be easily applied to other settings.

As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance in the India case study shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.

To clarify these points we have included the following paragraph in the Discussion (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available so that the analyses can be easily applied to other settings, e.g., age of sexual debut and number of sexual partners [26, 27]. As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

It will be interesting to validate this methodology in other regions with less available data to confirm its potential and utility.

We agree that it would be very interesting to validate this proposed methodology in other regions. Unfortunately, it was beyond the scope of this work. Currently, we are working on a project in which we try to apply footprinting to a collection of low- and middle-income countries.

I also have more specific comments or questions.

When a cancer registry is available in both CI5C and NCDIR, and the information is not the same, which one do you get?

When a cancer registry is present in both CI5 and NCDIR, we take the data present in CI5. To clarify this point, we have added the following sentence to the “Data sources” section (lines 171-172, page 9): “When data of a registry is both reported in CI5 and NCDIR, we only used the data from CI5.”

Note that, in principle, the data in CI5 and NCDIR should be the same if they come from the same cancer registry. However, differences may arise when aggregation was done for different periods.

Can you explain how you decide the initial assignments in the CEM clustering algorithm (ranges of 100 repetitions)?

The initial assignments were randomly generated from a multinomial distribution. We added the underlined part to the Supplementary File (lines 34-35, page 3) to clarify this: “As different initial assignments could result in different final assignments, the above iterative process was repeated 100 times with different initial assignments, randomly generated from a multinomial distribution.”

Considering that none of the clusterings obtained using the CEM clustering algorithm are appropriate (2, 3, or 4), have you considered another method of clustering instead of combining clusters 2 and 3?

In the application considered, each Indian state can have multiple registries, while the sexual behaviour data were collected by Indian state. Hence, we needed to find a solution to deal with this. As the clustering obtained by combining clusters 2 and 3 already gave good separation for of high and low cervical cancer incidence, we did not consider it necessary to find alternative solutions.

Can you explain more extensively the differences between the previous approaches and your approach, with the strengths and limitations, and why this technique is more advantageous than the other ones?

To our knowledge, there are two publications on multi-country modelling for cervical cancer prevention with similar approaches (sometimes also called mapping) as the approach we proposed [ref #29 Brisson 2020. ref #30 Qendri 2020]. In these publications, similar data (sexual behaviour, cervical cancer data, and HPV prevalence) were used for clustering of countries. In our opinion (lines 329-334, page 16), an advantage of our approach is that we base our choice of key geographical units from which impact projections are extrapolated from on the pattern discovered in the data. Other approaches work with prefixed key geographical units because projection models have been calibrated to these geographical units in previous publications. However, no formal analyses have been done to show how well these key geographical units represent the cervical cancer epidemiological patterns across the geographical area of interest.

In addition, we used a more elaborate method for clustering the entire curve of age-specific cervical cancer incidence using a Poisson-regression-based CEM clustering algorithm (lines 336-338, page 16). In the other publications, only age-standardized cervical cancer incidence or cervical cancer incidence in a certain age group were used.

Finally, the other publications provided less detailed description of their clustering/mapping steps. These steps were only reported briefly in supplementary material without intermediate results, whereas in this paper, we provided extensive details on each step with intermediate results, making them more reproducible and falsifiable.

To better explain the last two points we have added the following underlined part to the Discussion (lines 338-342, pages 16-17): “Secondly, we made use of a newly developed clustering method that is able to assess the similarities between cervical cancer incidence of different geographical units based on the entire age-specific pattern, instead of clustering on age-standardized cervical cancer incidence or cervical cancer incidence in a certain age group only. Finally, we provided a more detailed description of the clustering/mapping steps and intermediate results, which make them more reproducible and falsifiable.

Reviewer #2 (Recommendations for the authors):

The proposed framework's strength is difficult to evaluate because the steps and justification for the model variables were not clearly presented, nor were the models validated. Since the whole framework is built on one single imputation, how do the authors account for uncertainties about the estimation?

Uncertainty was accounted for in various steps of the framework. Firstly, the classification step was done through random forest (line 210, page 10), which summarizes the uncertainty of prediction by combining multiple classification trees. As we mentioned in the Method (lines 136-137, page 7), obtained classification probability can be used to weight projection outcomes. Furthermore, in the projection step, model calibration account for uncertainty of the target HPV prevalence by allowing model parameters that provide fit within the confidence intervals.

Perhaps the authors could consider validating these models by simulating models using data from countries where complete datasets are available (e.g., the US).

We agree that it would be very interesting to validate this proposed methodology in other regions. Unfortunately, it was beyond the scope of this work. Currently, we are working on a project in which we try to apply footprinting to a collection of low- and middle-income countries.

The manuscript would be strengthened by a more explicit stepwise delimitation of how to apply this model to data.

We acknowledge that the framework could be more clearly presented and have added additional explanation in the following places to do so:

– Concerning the framework steps, in Method (144-163, pages 7-8): “For convenience of explanation, we assumed earlier that data availability occurs hierarchically. However, the framework can also be applied with less stringent data requirements. First, the source of Footprint data needs not necessarily cover all geographical units. It is still possible to train a classifier in the classification step with Footprint data available for only a part of clustered geographical units. Second, if none of the key cervical cancer epidemiological data (sexual behavior, HPV prevalence, and cervical cancer incidence data) have large enough coverage to serve as Footprint data, alternatives indicators of similarity, such as human development index and geographical distance, could also be used as substitute. However, the resulting classification performance might be suboptimal, as we expect these indicators to correlate less well with cervical cancer risk. Third, for the projection step, data of cervical cancer incidence, sexual behavior, and HPV prevalence needed for calibration of projection models need not necessarily belong to the same geographical unit. Calibration can be performed as long as the three types of data are available within each cluster.

With these less stringent data requirements, the proposed framework should sufficient flexible to be applied to many situations. However, one should still be cautious in applying the framework when there are little data. This means that, in some cases, we might need to exclude from the analysis some geographical units with too little data or redefine bigger geographical units if the data are not granular enough. Furthermore, we should assess the goodness-of-fit of the obtained clustering, performance of classification, correlation of data within different clusters, and calibration fits to ensure the validity of the final impact projections.”

– Concerning selection of model variables (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available (e.g., age of sexual debut and number of sexual partners) so that the analyses can be easily applied to other settings [26, 27]. In the India case study, the good classification performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

The paper would be strengthened by including evidence for the utility of estimating HPV and cervical cancer rates based on sexual behaviours.

We have included sexual behaviour variables that have previously been shown to be risk factors of HPV infection and cervical cancer risk, e.g., age of sexual debut and number of sexual partners [ref #26 Vaccerella 2006, ref #27 Schulte-Frohlinde 2021]. Furthermore, we used variables that are commonly available so that the analyses can be easily applied to other settings.

As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance in the India case study shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.

To clarify these points we have included the following paragraph in the Discussion (lines 319-325, page 16): “In our analysis of classifying clusters of cervical cancer incidence, we only included some of the sexual behaviour variables available in the NACO report [15]. We selected variables that were previously shown to be risk factors of HPV infection and cervical cancer risk and that are commonly available so that the analyses can be easily applied to other settings, e.g., age of sexual debut and number of sexual partners [26, 27]. As far as we know, there is no established set of sexual behaviour variables for predicting the patterns of HPV prevalence and cervical cancer incidence. The good prediction performance shows that using the selected set is sufficient. As sexual behaviour variables are highly correlated, including more variables might even risk overfitting.”

Lastly, it seems that the impact assessment of this work has already been published. Why was the current manuscript sent for publication after the paper on impact assessment was published?

This manuscript was posted on MedRxiv at the same time as we submitted the impact assessment paper. The other one was accepted more quickly.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. Overview of availability of cervical cancer epidemiological data by state.
    Figure 2—source data 1. Registry-specific cervical cancer incidence data from Cancer Incidence in Five Continents (CI5) and National Centre for Disease Informatics and Research (NCDIR).
    Figure 2—source data 2. Estimated model parameters under Poisson regression models.
    Figure 3—source data 1. Indian state-specific sexual behaviour data from National AIDS Control Organization (NACO).
    Figure 3—source data 2. Predictive values of the sexual behavior variables for cervical cancer incidence cluster.
    Supplementary file 1. Appendix 1 - Poisson-regression-based CEM clustering algorithm.
    elife-81752-supp1.docx (38.7KB, docx)
    MDAR checklist

    Data Availability Statement

    All data used in the present study were openly available and extracted from http://ci5.iarc.fr for the cervical cancer incidence data published by the International Agency for Research on Cancer, from https://www.ncdirindia.org/All_Reports/Report_2020/resources/NCRP_2020_2012_16.pdf for the cervical cancer incidence data published by the National Centre for Disease Informatics and Research of India, and from https://www.aidsdatahub.org/sites/default/files/resource/national-bss-general-population-india-2006.pdf for the sexual behavior data published by the National AIDS Control Organisation Ministry of Health and Family Welfare Government of India. The extracted cervical cancer incidence and sexual behavior data are provided in Figure 2—source data 1 and Figure 3—source data 1, respectively. The computer code regarding the Poisson-regression-based CEM clustering algorithm is available upon reasonable request to the authors. The Random forest analysis was done with the open-source R packages party available at https://cran.r-project.org/web/packages/party/index.html (Hothorn et al., 2023).


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES