Abstract
Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. We implement an eleven-gene, eleven-cancer model, the largest Mendelian model created thus far, based on this framework. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage. Our implementation is freely available for research use in the PanelPRO R package.
Keywords: Mendelian models, risk prediction, germline panel gene testing, precision prevention, pathogenic variants
1. Introduction
Certain genetic variants are known to greatly increase one’s risk for developing cancer (Foulkes, 2008). Estimating the probability of carrying cancer-causing variants is important for personalized clinical management (Win, MacInnis, Dowty, & Jenkins, 2013). Mendelian models use principles of Mendelian genetics, probability theory, and gene-specific knowledge to estimate the probability that an individual receiving counseling (the counselee) is a pathogenic variant carrier (Murphy & Mutalik, 1969). As input, they take a potentially detailed pedigree of family history, which contains information on the relationships between individuals in the counselee’s family, as well as their ages, sexes, cancer diagnosis statuses, and ages at diagnosis. As training, they rely on external estimates of the allele frequencies of the cancer-causing variants by gene, as well as the sex-specific and age-dependent probabilities of developing cancer conditional on presence of a pathogenic variant (cancer penetrances). These are population-level quantities and can be estimated based on published literature. The model outputs the counselee’s carrier probabilities conditional on family history and the future risk of developing cancer.
Mendelian models have been widely adopted (Euhus, 2001; Riley et al., 2012), but existing models are generally limited to a small number of genes and cancers within an individual syndrome. For example, Mendelian models implemented in the free, open-source BayesMendel R package (S. Chen, Wang, Broman, Katki, & Parmigiani, 2004) include BRCAPRO, which computes carrier probabilities for pathogenic variants on BRCA1 and BRCA2 using family history of breast and ovarian cancers (Berry, Parmigiani, Sanchez, Schildkraut, & Winer, 1997; Parmigiani, Berry, & Aguilar, 1998); MMRpro, which considers Lynch syndrome genes MLH1, MSH2, and MSH6 using family history of colorectal and endometrial cancers (S. Chen et al., 2006); Pancpro, which considers an unspecified susceptibility gene for pancreatic cancer (W. Wang et al., 2007); and Melapro, which computes the carrier probability for a variant on CDKN2A using family history of melanoma (W. Wang et al., 2010). These models have been validated in the literature (Berry et al., 2002; S. Chen et al., 2006; W. Wang et al., 2007, 2010). Other Mendelian risk prediction models for breast and ovarian cancer susceptibility include Claus (Claus, Risch, & Thompson, 1994), IBIS (Tyrer-Cuzick) (Tyrer, Duffy, & Cuzick, 2004), and BOADICEA (A. C. Antoniou et al., 2008). In the context of future risk, non-Mendelian models are also prevalent; many summarize family history into binary or categorical predictors, followed by fitting logistic regression or another statistical learning model (Couch et al., 1997; Frank et al., 2002; Gail et al., 1989; Shattuck-Eidens et al., 1997; Vahteristo, Eerola, Tamminen, Blomqvist, & Nevanlinna, 2001).
In recent years, many deleterious gene variants have been linked with the increased risk of various cancers. For example, ATM, CHEK2, and PALB2 (A. C. Antoniou et al., 2014; Marabelli, Cheng, & Parmigiani, 2016; Schmidt et al., 2016) are among those associated with higher risk of breast cancer, in addition to the previously established BRCA1 and BRCA2. Furthermore, there is rising evidence that syndromes once thought to be distinct are determined by variants that increase the risk of multiple cancers (Hruban, Canto, Goggins, Schulick, & Klein, 2010; Kastrinos et al., 2009; Kastrinos & Syngal, 2011; Kote-Jarai et al., 2011; Moran et al., 2012). These discoveries point to the need for flexible, scalable models that account for these relationships, many of which may be known but not yet accounted for by existing models. Simultaneously, the decreasing cost of DNA sequencing (Plichta, Griffin, Thakuria, & Hughes, 2016) positions us to leverage data from multi-gene panel germline testing. As more associations are discovered using panel studies, these associations will also need to be effectively incorporated into models for quantitative risk assessment.
The development of a multi-syndrome model necessitates extending the Mendelian model framework from syndrome-specific to a fully generalizable framework that can account for associations between any number of genes and cancers. The increased model complexity implied by the inclusion of many genes and cancers also requires overcoming significant computational challenges. In particular, the Elston-Stewart peeling algorithm (Elston & Stewart, 1971) used to compute the carrier probabilities involves summing over all possible genotype configurations of a pedigree, a burden that is exponential with respect to the number of genes in the model. This computational cost must be appropriately addressed in order for larger, complicated models to return results in a reasonable runtime.
We present PanelPRO, a flexible, computationally efficient Mendelian risk model that incorporates an arbitrary number of genes and cancers. Generalizing beyond syndrome-specific approaches, we model carrier probabilities and cancer future risk probabilities for multi-syndrome associations for a large number of genes. For example, BRCAPRO identifies individuals at high risk for breast or ovarian cancers due to pathogenic BRCA1 and BRCA2 variants. PanelPRO can also incorporate penetrances of breast cancer specific to pathogenic variants of ATM, CHEK2, and PALB2, as well as return carrier probabilities for these genes. We note that the BOADICEA model (A. Lee et al., 2019) has recently been extended to account for information from all five of these genes, as well as the effects of polygenic risk scores and other risk factors. Additionally, BRCA1 and BRCA2 have been linked to other syndromes, including pancreatic cancer (Greer & Whitcomb, 2007). Therefore, it may be beneficial to incorporate the penetrance of pancreatic cancer for BRCA1 and BRCA2 pathogenic variant carriers, allowing us to borrow strength across additional cancers and heighten our understanding of cross-syndrome effects. The framework introduced is highly flexible and can be used to obtain the future risk of cancer for all cancers in the model.
G. Lee et al. (2021) presents the R software package implementation for PanelPRO. It covers the informatics aspects of PanelPRO; detailed statistical topics are not included in G. Lee et al. (2021), and are the focus of this work. Specifically, G. Lee et al. (2021) presents the package workflow from a user standpoint, including information on how the PanelPRO package pre-processes user-specified pedigrees and imputes missing ages prior to model evaluation. It includes only synthetic example pedigrees, highlighting expected model inputs and illustrating package usage. It also provides computational benchmarking and some discussion of computational concerns. In contrast, here we investigate the methodological background on which the software is based, providing a formal statistical treatment, methodological details, and assumptions. We provide comprehensive notation for our statistical model, including the flexible incorporation of secondary cancers and additional risk modifiers, and derive conditions for model collapsibility. As previously mentioned in G. Lee et al. (2021), we reduce the computational expense of the peeling algorithm to a feasible level for clinical use by making some reasonable simplifying assumptions (Madsen, Braun, Peng, Parmigiani, & Trippa, 2018). We scrupulously verify the comparability of the prediction performance of PanelPRO against existing models in simulations and evaluate model performance using simulated data and a validation study using a high-risk multi-gene panel testing dataset from the USC- Stanford Hereditary Cancer Panel (HCP) Testing Study.
PanelPRO generalizes beyond syndrome-specific approaches while resolving computational challenges that arise from including many associations. Although the proposed model is general, allowing users to enter an arbitrary number of genes and cancers, for this study, we illustrate the model using a proposed PanelPRO-5BC, a five-gene (ATM, BRCA1, BRCA2, CHEK2, and PALB2) breast and ovarian cancer model; and a proposed PanelPRO-11, an eleven-gene (ATM, BRCA1, BRCA2, CDKN2A, CHEK2, EPCAM, MSH1, MSH2, MLH6, PALB2, and PMS2), eleven-cancer (brain, breast, colorectal, endometrial, gastric, kidney, melanoma, ovarian, pancreatic, prostate, and small intestine) model. These PanelPRO models are applied to simulated data to illustrate the properties and usage of our approach. For validation, we apply the PanelPRO-5BC and PanelPRO-11 models to the high-risk multi-gene panel testing HCP dataset (Idos et al., 2019). The PanelPRO R package is freely available for research purposes at https://projects.iq.harvard.edu/bayesmendel/panelpro (G. Lee et al., 2021). Code to perform the analysis and generate the figures in this paper is available at https://github.com/janewliang/PanelRePROducible (Liang et al., 2022b).
In Section 2, we detail the statistical framework for PanelPRO. In Section 3, we describe the data used for specifying the input parameters and for validating our models. The results from the simulation studies and data application are reported in Sections 4 and 5, respectively. We conclude with a discussion in Section 6.
2. Statistical Methods
The contents of Sections 2.1–2.2 expand on and provide additional context for the framework described in G. Lee et al. (2021), including a generalization of the genotype vectors to multiple variants/mutation states, as well as notation for net future risk and further context for net vs. crude penetrance and future risk estimates. The following sections, Sections 2.3–2.4, contain entirely new elements that cover extensions for secondary cancers and other additional risk modifiers and describe conditions for model collapsibility.
2.1. Proposed Model
Suppose that a given counselee has a family of size I, and let Gi = (G1i, …, GKi) be the genotype vector for the ith family member, i = 1, …, I, where the counselee is indexed by 1. Each element Gki stores individual i’s carrier status for a pathogenic variant in the kth gene. Let denote the set of all 2K possible genotypes. In the simple case where only wild-type and heterozygous carriers need to be considered, Gki is binary. Under this simplest case of only wild-type and heterozygous states for each gene, in a two-gene model, Gi has length K = 2 and there are 22 = 4 possible genotypes (noncarrier for both genes, carrier of only the first pathogenic variant, carrier of only the second pathogenic variant, and carrier of pathogenic variants for both genes). More generally, additional states can be assumed: for instance, we can differentiate between carriers of variants on one allele vs. both alleles. Under this scenario, the three possible states would be encoded as 0, 1, or 2, corresponding to 3K possible genotypes. One can also assume the presence of additional states implied by having multiple variant types for a given gene.
Each individual in the family also has a vector of phenotypes Hi = (H1i, …, HRi), where R is the arbitrary but fixed number of cancers or other phenotypes in the model. For individual i, let Tri be the age of diagnosis for cancer r, Ci be their censoring age (current age or age of death), and be the binary indicator for the occurrence of cancer r before the censoring age. Then the observed history of cancer r is Hri = (Ci, δri) when δri = 0 and Hri = (Ci, δri, Tri) when δri = 1.
Denote the binary indicator of the ith individual being male as Ui. Finally, let H = (H1, …, HI) and U = (U1, …, UI) indicate the phenotypes and sexes of all family members, respectively.
Of interest is the estimation of the counselee’s conditional probabilities for all genotypes . To obtain the posterior distribution, we apply Bayes’ rule as presented by Murphy and Mutalik (1969) and described by Lange (2003):
| (1) |
After computing the right-hand side, the posterior probabilities must be normalized so that they sum to 1. is the population-based frequency of the counselee’s observed genotype g1, which we generally estimate from published literature. is the conditional probability of the observed cancers for the entire pedigree given that the counselee has genotype g1. It can be computed by integrating over the entire set of possible family genotypes, using the law of total probability:
| (2) |
In Eq. 2, the joint genotype distribution conditional on the counselee’s genotype is derived using properties of Mendelian inheritance. One must have estimates of the population allele frequency of each variant and know the exact relationship between the counselee and each relative in the pedigree. We assume genotype independence across the founding family members, followed by conditionally independent genotypes in subsequent generations (Berry et al., 1997). In order to model in the integration step, we assumed that the relatives’ phenotypes are independent conditional on genotype and sex. is computed from the sex-specific penetrances.
Two types of penetrances can be used. The net penetrance is the genotype-specific probability of developing a phenotype by a given age, in the absence of death and other competing risks. The crude penetrance is the genotype-specific probability of developing a phenotype by a given age prior to dying or developing other phenotypes. We typically use estimates reported in peer-reviewed literature, which are most often net. In such cases, we convert the estimates to crude when needed. Our implementation uses net penetrances to calculate the carrier probabilities. Explicitly,
| (3) |
where Tri is the random variable and is the observed cancer age. We assume that the censoring process and deaths by causes unrelated to a given cancer are independent of the time to cancer diagnosis. We further assume that censoring and death by other causes are non-informative with respect to carrier status. An in-depth discussion of the trade-offs involved in these modeling choices is provided by Katki, Blackford, Chen, and Parmigiani (2008).
Once these carrier probabilities have been obtained, they can be used to calculate the future risk of developing any of the cancers in the model. The future risk that the counselee develops the cancer indexed by r can be computed using either the net or crude penetrances, and has a slightly different interpretation depending on which penetrance estimates are used.
The counselee’s t0-year net future risk for a given cancer indexed by r is
| (4) |
The numerator and denominator of the fractional term in the final expression can be computed from the net genotype-specific penetrances. This t0-year risk prediction can be interpreted as the probability of developing cancer r within t0 years, conditional on being disease-free at the counselee’s current age and assuming a hypothetical world where there is no death or other phenotypes.
More realistically, one may consider the crude future risk, which can be interpreted as the probability of developing cancer r within t0 years, conditional on being disease-free and alive at the current age and assuming that the counselee survives all competing risks and death. For individual i, let Td,ri be their age of death from causes other than cancer r, be their age of first outcome (either cancer r or death from causes other than cancer r), and be the binary indicator for developing the rth cancer. The counselee’s crude risk can be calculated as
| (5) |
The fractional term in the final expression is computed directly from the crude penetrances for cancer r and the probability distribution of dying by other causes. The distribution of death by other causes is an additional parameter that can be estimated based on peer-reviewed literature. PanelPRO outputs crude future risk estimates by default.
2.2. Computation
The integration step over the counselee genotypes in computing the probabilities in a Mendelian model is usually obtained by applying the Elston-Stewart peeling algorithm (Elston & Stewart, 1971). The peeling algorithm is broadly used and generalizable to any number of heritable sites specified in the model. It is linear with respect to the number of relatives I in the counselee’s family, but sums over all possible genotypes in the pedigree. Therefore, computation time is exponential with respect to the number of genes K, making integration challenging for even a moderate number of genes. The peeling and paring algorithm (Madsen et al., 2018) restricts the set of possible genotypes to the set of genotypes that contain up to M deleterious variants. Then, one applies local integration over the “pared” set of genotypes to approximate as
| (6) |
When M ≥ K, the peeling-paring algorithm is equivalent to the peeling algorithm.
For example, if the model has K = 11 genes, each of which is fairly uncommon in the population, it would be very rare to observe an individual with all 11 variants, or even 9 or 10 variants. It is also possible that many genotypes with several such pathogenic variants may not be viable. Illustrating again with the scenario of two states for each gene, if M = 2 and K = 11, there are only admissible genotypes in hat need to be summed over. This is a substantial reduction compared to the MK = 2048 in the original space .
There are trade-offs in selecting the value for M. Higher values of M lead to more precise approximations of the posterior genotypic distribution, but as M approaches K, the computational complexity approaches 2K once again. A smaller value of M leads to higher efficiency, but this simplifying assumption has the cost of lower precision. These trade-offs are discussed in greater detail by Madsen et al. (2018). Our PanelPRO implementation allows users to set the value for M, but in both our simulations and data application, we used M = 2. Even as the number of genes in PanelPRO increases to 5, 7, 11, and potentially more, the computational expense of the integration step is kept feasible by imposing this restriction. The Lander-Green algorithm (Lander & Green, 1987) is an alternative algorithm that is expected to be more efficient when the number of genes K exceeds the number of family members I in the pedigree. However, it may be less effective for large pedigrees. Further discussion of genetic linkage algorithms and their computational considerations can be found in G. Lee et al. (2021).
2.3. Additional Modifiers
2.3.1. Secondary Cancers
Separate from the general approach of including primary cancers in a model, secondary cancers such as contralateral breast cancer modify risk estimation based on years since the primary cancer diagnosis. Suppose that primary breast cancer is indexed by r = b and contralateral breast cancer by r = c. Then can be expressed in terms of the contralateral breast cancer net penetrances as
| (7) |
Tci is a random variable, whereas and are the observed ages of diagnosis for breast and contralateral breast cancer, respectively.
2.3.2. Risk Modifiers
Our implementation supports modifying risk at the cancer penetrance level using risk or hazard ratios for interventions such as prophylactic mastectomies, oophorectomies, and hysterectomies. When calculating carrier probabilities, simply use the modified penetrances in place of the raw penetrances . Suppose family member i received a preventive intervention at age Tint,i, and let RRr, int (Tint,i|Gi, Ui) be the relative risk of developing cancer r for those that receive the intervention at age Tint,i, computed based on probabilities that are conditional on genotype and sex. Their modified penetrance for cancer r is
| (8) |
Suppose that instead of a relative risk, the hazard ratio for the intervention at age Tint,i, HRr, int (Tint,i|Gi, Ui), is available. Define as the survival function for cancer r (conditional on sex and genotype), and let be the corresponding hazard. Then the modified hazard, survival, and penetrance functions at age t for this intervention are:
| (9) |
2.3.3. Germline and Tumor Biomarker Testing
When germline or tumor marker testing results are available for some family members, this information can also be incorporated into the probability estimates. For each relative i, one substitutes with
| (10) |
where Vi is the observed germline testing results and Wi is the observed tumor marker testing results for relative i. The probabilities can be interpreted as test sensitivities and specificities. In practice, we often assume conditional independence of the different germline and marker testing results when robust joint estimates are not available. Currently, our implementation allows for modifying risk based on germline testing for any of the genes included in the model. It also supports modifying the risk of breast cancer conditional on BRCA1 or BRCA2 carrier status based on breast cancer tumor markers, ER, CK5/6, CK14, HER, and PR; and the risk of colorectal cancer conditional on MLH1, MSH2, MSH6, or PMS2 carrier status based on the colorectal cancer tumor marker, MSI. The implementation is designed to make it straightforward to extend risk modification to additional tumor markers or to modify the values according to user inputs.
2.4. Conditions for Collapsibility
It is possible to collapse a larger “full” model into a smaller submodel that contains a subset of the specified genes and/or cancers. In this section we derive the following important facts:
If the submodel contains a subset of the genes in the full model, both models will return the same posterior probabilities if the allele frequency for being a noncarrier is 1 for all genes not present in the submodel. In other words, the probability of carrying a pathogenic variant is zero for any of the genes that are absent in the submodel.
If the submodel contains a subset of the cancers in the full model, the submodel can be collapsed within the full model if the penetrances of the cancers not included in the submodel do not change depending on genotype.
Suppose that one has fit a model with K genes and wishes to collapse it into a submodel that only includes the genes with indices in the set . Let be the index set for genes not included in the submodel, such that is the full set of K gene indices. Similarly, Gi can be partitioned as as , where is the genotype vector for the genes included in the submodel and is the genotype vector for the genes not included in the submodel. For the ith individual and kth gene, let Gki = 0 represent a noncarrier and allow carriers of pathogenic variants to be represented using numbers other than zero. In order to obtain the same results from both the full model and submodel, a sufficient condition for collapsibility is that for all genes and all individuals i in the family.
Under this condition, the counselee’s genotype frequency is
| (11) |
Under the Mendelian model’s conditional independence assumptions for the genotypes in the pedigree, we can also obtain
| (12) |
when gki = 0 for all k ∈ K(−*) and i = 1, …, I; and otherwise.
Let be the set of gene indices that satisfy gki = 1 for a given individual i, and let be the analogous index set for only the genes that are included in the submodel. In the case where gki = 0 for all k ∈ K(−*), it follows that and
| (13) |
Using these findings to compute Eq. 2, the expression for the posterior probabilities (Eq. 1) can be shown to be equivalent for the full model with K genes and the submodel with genes. Note that the equality below holds because the terms in the summation where gki ≠ 0 for at least one are zero, and therefore do not contribute.
| (14) |
Now suppose that the full model has R cancers or other phenotypes and that the submodel of interest contains only the phenotypes with indices in the set . Denote the complementary set as . Let be the phenotype vector for individual i corresponding to the phenotypes included in the submodel. In order to collapse the submodel within the full model, we require to be constant for all with respect to each individual i and cancer . In words, each relative’s penetrances for the cancers/phenotypes not included in the submodel must not depend on genotype. When this value is constant across genotypes, the additional cancers do not contribute to the posterior probabilities (Eq. 1) after normalization:
| (15) |
When the peeling-paring algorithm is used to restrict the possible genotypes to those in , one can simply substitute for in the derivations.
3. Data
3.1. Model Parameters
In summary, the PanelPRO framework can be used to train a Mendelian model with any K genes and R cancers. The training involves meta-analysis or literature reviews to identify allele frequency inputs for each of the genes and age-dependent penetrance inputs for each of the gene-cancer combinations. PanelPRO converts these into clinically relevant carrier probabilities and absolute risk evaluations.
To illustrate, we present three different examples of PanelPRO models that collectively span 11 genes, plus a hypothetical PANC gene, and 11 cancers. PANC is an unspecified pancreatic cancer susceptibility gene used in the PancPRO (W. Wang et al., 2007) Mendelian model for identifying individuals at high risk for pancreatic cancer. The current PanelPRO package includes all the input estimates discussed here, as well as others, with the exception of PANC, which is only discussed here in the context of back-compatibility with PancPRO. Allele frequencies and penetrances used to construct the models are estimated based on existing literature. For BRCA1 and BRCA2, we used the non-Ashkenazi, Ashkenazi Jewish and Italian allele frequency estimates from BRCAPRO (A. Antoniou et al., 2002; S. Chen et al., 2004); for MLH1, MSH2, and MSH6, we used the allele frequency estimates from MMRpro (S. Chen et al., 2004, 2006); for the hypothetical PANC gene, we used the allele frequency estimate from Pancpro (S. Chen et al., 2004; W. Wang et al., 2007); and for CDKN2A, we used the allele frequency estimate from Melapro (Berwick et al., 2006; S. Chen et al., 2004). Allele frequency estimates for ATM, CHEK2, and PALB2 were taken from A. J. Lee et al. (2016). Finally, we estimated allele frequencies for EPCAM and PMS2 from a 25-gene panel study of 252,223 individuals (Rosenthal, Bernhisel, Brown, Kidd, & Manley, 2017).
Racially-informed (All Races, American Indian and Alaska Native, Asian, Black, White, Hispanic, White Hispanic, and White Non-Hispanic) probabilities of developing cancer (unconditional on genotype) were taken from the DevCan database (Statistical Research and Applications Branch, National Cancer Institute, 2020); penetrances for non-carriers are estimated by PanelPRO based on these values as well as the carrier penetrances and allele frequencies. When available, we took cancer penetrances for carriers from data included in the BayesMendel package: the BRCA1 and BRCA2 estimates for the probability of developing breast or ovarian cancer (J. Chen et al., 2020); the MLH1, MSH2, and MSH6 estimates for the probability of developing colorectal or endometrial cancer (Felton, Gilchrist, & Andrew, 2007; C. Wang, Wang, Hughes, Parmigiani, & Braun, 2020); the PANC estimates for the probability of developing pancreatic cancer (Klein et al., 2002; W. Wang et al., 2007); and the CDKN2A estimates for the probability of developing melanoma (Begg et al., 2005; Bishop et al., 2002; W. Wang et al., 2010). These gene×cancer combinations are colored orange in Figure 1. Combinations colored in blue were obtained based on a literature review and are also available in the All Syndromes Known to Man Evaluator (ASK2ME) (Braun, Yang, Griffin, Parmigiani, & Hughes, 2018) tool. Combinations in white are not included, as insufficient evidence of association has been accrued.
Figure 1:

Associations between the 11 genes, plus a hypothetical PANC gene, (rows) and 11 cancers (columns) collectively incorporated in the PanelPRO models illustrated in this paper. PANC is only listed in order to discuss backc-ompatibility with PancPRO and is not a gene that can currently be specified in the PanelPRO package. Population-level allele frequencies are given in parentheses (the CHEK2 allele frequency is for the 1100delC only; the allele frequencies for all other genes are for any pathogenic variant). Orange cells correspond to penetrances that are available in the BayesMendel package, and blue cells correspond to penetrances based on a literature review that are also available in ASK2ME.
The sensitivities and specificities for tumor marker testing of ER, CK5/6, CK14, HER, and PR are from Lakhani et al. (2005, 2002); the sensitivity and specificity for MSI testing are from Chu, Chen, and Louis (2007). While the probability distribution of death from other causes does not come into play for the analysis in this paper, the PanelPRO package takes estimates for these values from DevCan (Statistical Research and Applications Branch, National Cancer Institute, 2020). The hazard ratios used for incorporating prophylactic mastectomies and oophorectomies into risk assessment are based on Katki (2007).
3.2. USC-Stanford HCP Cohort
In Section 5, we validate our method using data from the USC-Stanford Hereditary Cancer Panel (HCP) Testing Study, a prospective, multi-center study on multiplex gene panel testing for cancer susceptibility (Idos et al., 2019). A diverse group of 2000 patients was recruited from three medical centers: 1) USC Norris Comprehensive Cancer Center; 2) Los Angeles County + USC Medical Center; 3) Stanford University Cancer Institute. Eligible patients met clinical guideline criteria for genetic testing or had a ≥2.5% probability of mutation carriage calculated by the following validated models or algorithms: BayesMendel (S. Chen et al., 2004), BOADICEA (A. C. Antoniou et al., 2008), IBIS (Tyrer-Cuzick) (Tyrer et al., 2004), PREMM 1,2,6 (Euhus, 2001; Kastrinos et al., 2009; Kote-Jarai et al., 2011), National Comprehensive Cancer Network (NCCN) Guidelines, or a personal history of ≥10 cumulative lifetime tubular adenomas (Idos et al., 2019). Diagnostic yield and off-target mutation detection was evaluated for 25- or 28-gene multi-gene panels (Idos et al., 2019) and performed by Myriad Genetic Laboratories (Salt Lake City, UT).
We validate two models, PanelPRO-5BC and PanelPRO-11, using this data. We excluded families if: the counselee is a carrier of one or more variants only in genes not included in the model; the counselee has a variant of uncertain significance (VUS), but not a pathogenic variant for any of the genes in the model; or the models cannot be applied to the pedigree (typically due to “loops”/inter-marriages or the reporting of cancer affection ages that are greater than the individual’s censoring age). We thus consider 1612 families for validating PanelPRO-5BC (average family size of 34.15) and 1468 families for PanelPRO-11 (average family size of 33.78). Tables 1 and 2 summarize the number of pathogenic variant carriers and cancer cases among the counselees in the study for PanelPRO-5BC and PanelPRO-11, respectively. The differing number of carriers and cancers in Table 1 vs Table 2 is due to our pre-processing approach, which is partially dependent on the genes included in the model.
Table 1:
Number of counselees in the HCP multi-gene panel testing study who are pathogenic variant carriers (Table 1a) and cancer cases (Table 1b) for genes and cancers modeled by PanelPRO-5BC. These counts are based on the 1612 families left after processing the data for validating PanelPRO-5BC. It is possible for a counselee to be a carrier of more than one pathogenic variant or diagnosed with more than one cancer.
| (a) Counselees who are pathogenic variant carriers. | ||||||
|---|---|---|---|---|---|---|
| ATM | BRCA1 | BRCA2 | CHEK2 | PALB2 | Any Carrier | Non-Carrier |
| 23 | 41 | 35 | 22 | 15 | 113 | 1499 |
| (b) Counselees who are cancer cases. | ||||
|---|---|---|---|---|
| Breast | Ovarian | Other Cancer | Any Cancer | No Cancer |
| 485 | 106 | 693 | 1161 | 451 |
Table 2:
Number of counselees in the HCP multi-gene panel testing study who are pathogenic variant carriers (Table 2a) and cancer cases (Table 2b) for genes and cancers modeled by PanelPRO-11. These counts are based on the 1468 families left after processing the data for validating PanelPRO-11. It is possible for a counselee to be a carrier of more than one pathogenic variant or diagnosed with more than one cancer.
| (a) Counselees who are pathogenic variant carriers. | ||||||
|---|---|---|---|---|---|---|
| ATM | BRCA1 | BRCA2 | CDKN2A | CHEK2 | EPCAM | MLH1 |
| 23 | 41 | 35 | 5 | 27 | 1 | 9 |
| MSH2 | MSH6 | PALB2 | PMS2 | Any Carrier | Non-Carrier |
|---|---|---|---|---|---|
| 11 | 9 | 14 | 15 | 150 | 1318 |
| (b) Counselees who are cancer cases. | |||||||
|---|---|---|---|---|---|---|---|
| Brain | Breast | Colorectal | Endometrial | Gastric | Kidney | Melanoma | Ovarian |
| 9 | 444 | 218 | 50 | 36 | 15 | 92 | 93 |
| Pancreatic | Prostate | Small Intestine | Other Cancer | Any Cancer | No Cancer |
|---|---|---|---|---|---|
| 36 | 26 | 1 | 254 | 1055 | 413 |
Additionally, breast tumor marker and MSI testing results are reported for some of families. In the PanelPRO-5BC validation subset, there are 204 families with at least one relative tested for ER, 167 for PR, and 55 for HER (a total of 226 or 14% families reported any results). In the PanelPRO-11 validation subset, there are 185 families with at least one relative tested for ER, 153 for PR, 45 for HER, and 3 for MSI (a total of 208 or 14.2% families reported any results). Information on preventative interventions like mastectomies, oopherectomies, and hysterectomies was not available.
We used racial or ancestry-informed allele frequencies and penetrances when available to match those reported in the data. Missing ages of cancer diagnosis and censoring ages were imputed by PanelPRO (G. Lee et al., 2021).
4. Simulation Studies
4.1. Data Generation
For each simulation study, we generated 1,000,000 families with structures sampled from families in the HCP study. Each simulated family includes the counselee and their grandparents, parents, and siblings, as well as any other offspring of the above. Genotypes for the founders of each family were sampled from non-Ashkenazi Jewish population-level allele frequencies. Then, the genotypes of their descendants were assigned based on Mendelian inheritance, with all genes assumed to be independent. The genotypes of non-blood relatives (i.e. spouses of the descendants) were sampled from the non-Ashkenazi Jewish population-level allele frequencies. Cancer status and age of diagnosis were simulated conditional on the individual’s genotype, based on the cancer penetrances for “All Races” used in the models. Based on the genotype and cancer diagnosis statuses of each individual, we simulated results from biomarker testing for ER, CK5/6, CK14, PR, and HER for those with breast cancer and MSI for those with colorectal cancer. The code to simulate families and reproduce these simulation studies is publicly available (Liang et al., 2022b).
4.2. Performance Metrics
To evaluate the carrier probabilities from our models, we consider several performance metrics (Steyerberg et al., 2010). We use the area under the curve (AUC) as a measure for discrimination, the expected divided by the observed number of events (E/O) as a measure of calibration, and mean squared error (MSE) as a measure of overall accuracy. When computing these metrics, non-carriers are defined as individuals who are not variant carriers of any gene in the model. Percentile confidence intervals are obtained from 1000 bootstrap samples.
4.3. Back-Compatibility
Four Mendelian risk prediction models are currently implemented in the BayesMendel R package (Table 3), each incorporating information from between 1–3 genes and cancers. Altogether, these models span seven genes (including a hypothetical PANC gene) and six cancers.
Table 3:
Genes and cancers incorporated in the four existing BayseMendel Mendelian risk prediction models.
| BayesMendel model | Gene(s) | Cancer(s) |
|---|---|---|
| BRCAPRO | BRCA1, BRCA2 | Breast, Ovarian |
| MMRpro | MLH1, MSH2, MSH6 | Colorectal, Endometrial |
| Pancpro | hypothetical PANC | Pancreatic |
| Melapro | CDKN2A | Melanoma |
To validate the reverse-compatibility of our approach, we fit these existing models using the PanelPRO and BayesMendel packages and evaluated their performance on 1,000,000 simulated families, using the same model parameters and settings. These families were simulated assuming only the presence of the gene-cancer associations incorporated in the four existing models (orange in Figure 1), and based on family structures sampled from the HCP cohort. Because PanelPRO and BRCAPRO from the BayesMendel package handle contralateral breast cancer using different approaches, we eliminated it from consideration when fitting the models, to obtain a more direct comparison. The resulting metrics and carrier probabilities are virtually identical between the two sets of models, as expected (Supplemental Figure S1 and Supplemental Table S1 (Liang et al., 2022a)).
4.4. PanelPRO-5BC
We then simulated and evaluated 1,000,000 families under PanelPRO-5BC, a model that obtains carrier probabilities for ATM, BRCA1, BRCA2, CHEK2, and PALB2 based on family history of breast and ovarian cancer. As a point of reference, we also used the PanelPRO package to evaluate the simulated families with the BRCAPRO submodel, which only considers BRCA1 and BRCA2. Figure 2 plots metrics with bootstrap confidence intervals for PanelPRO-5BC; these metrics are also reported in Supplemental Table S2 (Liang et al., 2022a).
Figure 2:

AUC, calibration, and MSE for the full PanelPRO-5BC model (black) and the BRCAPRO submodel (blue) evaluated on 1,000,000 families simulated based on PanelPRO-5BC, with (open points) and without (closed points) risk modifiers. “BRCAPRO genes” indicates any of the genes in BRCAPRO (BRCA1, BRCA2), and “Any” indicates any of the five genes in PanelPRO-5BC. 95% bootstrap percentile confidence intervals are also shown.
The results for the PanelPRO-5BC model and its BRCAPRO submodel are highly similar. The metrics for PanelPRO-5BC and BRCAPRO were also similar to each other across bootstrap replicates, as described in Section S7 of the supplementary material (Liang et al., 2022a). Incorporating simulated tumor marker testing for ER, PR, and HER in the model led to slightly better discrimination and precision of BRCA1 and BRCA2, as well as differences in calibration.
4.5. PanelPRO-11
As another illustration of the flexibility of our approach, we considered PanelPRO-11, an 11-gene (ATM, BRCA1, BRCA2, CDKN2A, CHEK2, EPCAM, MLH1, MSH2, MSH6, PALB2, and PMS2), 11-cancer (brain, breast, colorectal, endometrial, gastric, kidney, melanoma, ovarian, pancreatic, prostate, and small intestine) model. The performance metrics resulting from evaluating PanelPRO-11 on 1,000,000 simulated families are plotted in Figure 3 and are reported in detail in Supplemental Table S3 (Liang et al., 2022a). We also evaluated the three BayesMendel submodels (BRCAPRO, MMRpro, Melapro) that are nested within PanelPRO-11. Including biomarker testing results can improve the metrics, particularly discrimination, for related genes such as BRCA1, BRCA2, MLH1, MSH2, MSH6, and PMS2. The most noticeable differences in discrimination are for MSH6 and PMS2. Incorporating tumor biomarker testing does not appear to meaningfully improve the discrimination for BRCA1, BRCA2, MLH2, and MSH2 likely because (unlike MSH6 and PMS2) these genes already have very high AUCs (> 0.9 or even > 0.95) in the baseline simulations. If the AUCs are close to the maximum of 1 to begin with, the range in which to improve after adding additional risk modifiers is very small. The very large calibration confidence intervals for EPCAM can be attributed to its low allele frequency, and the therefore low number of cases among the 1,000,000 simulated counselees.
Figure 3:

AUC, calibration, and MSE for the full PanelPRO-11 model (black) and its BayesMendel submodels BRCAPRO, MMRpro, and Melapro (blue) evaluated on 1,000,000 families simulated based on PanelPRO-11, with (open points) and without (closed points) risk modifiers. “BRCAPRO genes”, “MMRpro genes”, and “Any BM” indicate any of the genes in BRCAPRO (BRCA1, BRCA2), MMRpro (MLH1, MSH2, MSH6), and any model in the BayesMendel package (BRCA1, BRCA2, MLH1, MSH2, MSH6, CDKN2A), respectively. “Any” indicates any of the eleven genes in PanelPRO-11. 95% bootstrap percentile confidence intervals are also shown.
Throughout, the full PanelPRO-11 model’s metrics are closely aligned with those of its submodels, with slightly better discrimination in some cases. In nearly all of the bootstrap samples, PanelPRO-11 has a better AUC and MSE. PanelPRO-11’s improvement in terms of calibration is less consistent, but the differences in model calibration are typically small (less than 2%) within bootstrap samples. Furthermore, whether or not PanelPRO improves over the submodel seems to be largely dependent on how well calibrated both models are; when both models are well-calibrated, PanelPRO’s calibration will be better about half the time. See Section S7 of the supplementary material for additional details (Liang et al., 2022a). PanelPRO-11 is thus able to produce the results of the existing BRCAPRO, MMRpro, and Melapro models, while also computing probabilities for being a carrier of ATM, CHEK2, EPCAM, PALB2, and PMS2.
The AUC for predicting pathogenic variant carriers of “Any” gene is 0.69. This is lower than the discrimination for most of the individual genes, many of which have AUCs above 0.8 or even 0.9. Groupings of genes that have individually high AUCs, like the BRCAPRO genes (BRCA1 and BRCA2) or the MMRpro genes (MLH1, MSH2, and MSH6), tend to retain good discrimination. However, including poorly-discriminating genes, as expected, pulls down the group AUC. This trend is observable for both PanelPRO-11 and the PanelPRO-5BC AUC results in Figure 2(a).
Figure 4 plots the AUCs from this simulation study against the relative risks of developing each of the 11 cancers in PanelPRO-11 by age 70, for pathogenic variant carriers compared to noncarriers. Both male and female relative risks are shown; see Supplemental Figure S2 for plots that separate between sexes (Liang et al., 2022a). Genes with lower AUCs correspond to those that have low relative risks for each of the eleven cancers. It appears that good discrimination for a given gene requires the gene to have a high relative risk for at least one cancer in the model. This trend is not expected for allele frequencies (Supplemental Figure S3 (Liang et al., 2022a)). Note that ATM and CHECK2 have both low penetrance and high prevalence, but their low AUC is likely attributable to the former. The AUCs, allele frequencies, and relative risks plotted in Figure 4 and Supplemental Figures S2–S3 are summarized in Supplemental Table S4 (Liang et al., 2022a).
Figure 4:

Area under the curve (AUC) from evaluating PanelPRO-11 on 1,000,000 simulated families plotted against the relative risk of developing each cancer by age 70, for pathogenic variant carriers compared to noncarriers.
We simulated an additional 1 million families generated by adding a latent PRS to the assumptions of PanelPRO-11. For the founders of a given family, we simulated 20 independent SNPs, each with minor allele frequency 0.1, and passed them down to their descendants based on Mendelian laws of inheritance. Each family member was then assigned a score calculated as the sum of the 20 SNPs in their genotype (1 being carrier and 0 being noncarrier). Each score was mapped to a penetrance-multiplying factor ranging between 0.8 and 1.2, with smaller scores corresponding to multiplying factors below 1 and larger scores corresponding to multiplying factors above 1. We then multiplied each individual’s cancer penetrances (used for simulating cancer status) by the multiplying factor corresponding to their score, thereby modifying the simulated families’ cancer risk in a heritable fashion that is not accounted for by PanelPRO. We also simulated a set of 1 million families using the same scheme, but with a score based on 40 independent SNPs that map to multiplying factors ranging between 0.6 and 1.4.
The performance metrics for PanelPRO-11 (which does not account for the PRS) in these simulations with unmeasured genetic risk factors were virtually identical to those for the original simulation (Supplemental Figure S4 and Table S5 (Liang et al., 2022a)). We designed the PRS simulations and the multiplying factors assigned to each score such that the expected number of cancer cases should remain similar to the original simulation. Counselees with the mean PRS will be assigned cancer statuses based on cancer penetrances that are the same as those used in the model. The cancer risk for many of the other counselees (i.e. those whose scores map to multiplying factors close to 1) will also be similar to the risk accounted for by the cancer penetrances in PanelPRO. So it is not entirely unexpected that PanelPRO’s performance is largely unaffected when assessed by aggregate measures like the AUC.
5. Data Application
5.1. Model Validation
For model validation, we applied PanelPRO-5BC and PanelPRO-11 to the data from the HCP study (Section 3.2). We also ran the BayesMendel models nested within these models (BRCAPRO for PanelPRO-5BC and BRCAPRO, MMRpro, and Melapro for PanelPRO-11). The code to reproduce the validation results is available in Liang et al. (2022b). Some of the families in the HCP study report tumor marker testing results, so we ran the models with and without incorporating this extra risk-modifying information. To evaluate the models, we used the performance metrics described in Section 4.2 and obtained percentile confidence intervals from 1000 bootstrap samples.
The metrics and confidence intervals for PanelPRO-5BC are reported in Figure 5 and Supplemental Table S6; those for for PanelPRO-11 are reported in Figure 6 and Supplemental Table S7 (Liang et al., 2022a). Metrics use outcome labels based on carrier status of individual genes as well as groups of genes. Both types remain fairly consistent between the full PanelPRO models and their submodels. The results for models run with risk modifiers are often similar to those without risk modifiers. The study includes a handful of noncarrier counselees for whom PanelPRO-11 greatly overpredicts the MSH6 carrier probability compared to MMRpro. Inspection of individual HCP pedigrees reveals that many of these counselees have no or limited family history of colorectal and endometrial cancer, but have family history for other cancers, such as brain, gastric, and ovarian cancer. Both models incorporate family history for colorectal and endometrial cancers, but only PanelPRO-11 incorporates associations between Lynch genes and these additional cancers, which are also associated with other genes. In this scenario, PanelPRO-11 points to inherited susceptibility, distributing probabilities across MLH1, MSH2, and MSH6 as well as other genes associated with these cancers. When we evaluate performance for Lynch syndrome genes MLH1, MSH2, and MSH6 only, PanelPRO slightly overpredicts on the overall average for these carrier probabilities compared to the MMRpro submodel, and suffers a slight worsening of the AUC for MSH6.
Figure 5:

AUC, calibration, and MSE for the full PanelPRO-5BC model (black) and the BRCAPRO submodel (blue) evaluated on the HCP cohort, with (open points) and without (closed points) risk modifiers. “BRCAPRO genes” indicates any of the genes in BRCAPRO (BRCA1, BRCA2), and “Any” indicates any of the five genes in PanelPRO-5BC. 95% bootstrap percentile confidence intervals are also shown.
Figure 6:

AUC, calibration, and MSE for the full PanelPRO-11 model (black) and its BayesMendel submodels BRCAPRO, MMRpro, and Melapro (blue) evaluated on the HCP cohort, with (open points) and without (closed points) risk modifiers. “BRCAPRO genes”, “MMRpro genes”, and “Any BM” indicate any of the genes in BRCAPRO (BRCA1, BRCA2), MMRpro (MLH1, MSH2, MSH6), and any model in the BayesMendel package (BRCA1, BRCA2, MLH1, MSH2, MSH6, CDKN2A), respectively. “Any” indicates any of the eleven genes in PanelPRO-11. 95% bootstrap percentile confidence intervals are also shown.
We acknowledge that the small sample size of the data-processed cohort makes it difficult to detect meaningful differences. The wide bootstrap confidence intervals are likely also attributable to the small number of cases. Similarly, the observation that incorporating risk modifiers into the models makes a limited impact on the results may be due to the relatively low number of families with tumor marker information. Section S7 of the supplementary material (Liang et al., 2022a) contains discussion on how often PanelPRO-5BC and PanelPRO-11 improve over the BayesMendel submodels in the 1000 bootstrap replicates; the full model does not consistently have better metrics across bootstraps. These results are again likely due to limitations in the sample size and ascertainment process for the HCP data that cause it to deviate from the makeup of the general population, for which the model parameters were estimated. The HCP cohort used for data validation is highly ascertained, such that much of the family history in the pedigrees includes breast and colorectal cancers. Family history for additional cancers incorporated by PanelPRO beyond those in the syndrome-specific BRCAPRO and MMRpro is underrepresented compared to the general population. We would expect PanelPRO to return better predictions for families that have history of cancers and/or gene-cancer associations not incorporated in the existing syndrome-specific models.
The AUC confidence intervals for some genes, especially low-penetrant genes, overlap with 0.5. However, we note that the lower bound for the “Any” AUC is consistently above 0.5 for both PanelPRO-5BC and PanelPRO-11. Moreover, it is still beneficial to include lower-penetrant genes so that PanelPRO can estimate the probabilities of carrying pathogenic variants of these genes.
A key contribution of the multi-syndrome predictions enabled by PanelPRO is the ability to acknowledge syndrome overlap in counseling individuals at risk. To illustrate, Figure 7 shows the probabilities of carrying a pathogenic variant of any PanelPRO-11 gene against the probabilities for carrying a pathogenic variant of any BRCAPRO gene (BRCA1 or BRCA2, left) and any MMRpro gene (MLH1, MSH2, or MSH6, right) in the HCP cohort. Of the 150 HCP counselees who are carriers of any pathogenic variant in PanelPRO-11, 61 had carrier probabilities above a 2.5% threshold for PanelPRO but not BRCAPRO and 0 had carrier probabilities above 2.5% for BRCAPRO but not PanelPRO. Similarly, 89 counselees had carrier probabilities above above 2.5% for PanelPRO but not MMRpro, and 0 had carrier probabilities above 2.5% for MMRpro but not PanelPRO. This demonstrates that the potential refinement in counseling practice by jointly considering syndromes whose phenotypes overlap can affect a large proportion of families.
Figure 7:

Probabilities of carrying a pathogenic variant of any PanelPRO-11 gene plotted against the probabilities for carrying a pathogenic variant of any BRCAPRO gene (BRCA1 or BRCA2, left) and against the probabilities for carrying a pathogenic variant of any MMRpro gene (MLH1, MSH2, or MSH6, right) in the HCP cohort. All counselees shown are true pathogenic variant carriers. BRCA1 or BRCA2 carriers are colored blue; MLH1, MSH2, or MSH6 are colored orange; and carriers of pathogenic variants of any of the six other PanelPRO-11 genes are colored black. Reference lines are drawn at 2.5%.
In Figure S5 (Liang et al., 2022a), the points are now colored to represent the cancer types recorded in each counselee’s pedigree. As expected, both PanelPRO-11 and the syndrome-specific models tend to correctly identify carriers with family history of only the cancers included in the syndrome-specific models. However, PanelPRO often has the advantage when the counselee has family history of both the cancers that are included in the syndrome-specific models as well as additional cancer types and especially when the counselee only has other related cancers that are not included in the syndrome-specific model.
5.2. Sensitivity to Family Size
To examine the performance that can be achieved by PanelPRO-5BC and PanelPRO-11 when very extensive family history is available, we simulated additional families with a large number of relatives. Each simulated counselee has four maternal aunts, maternal uncles, paternal aunts, paternal uncles, sisters, and brothers. The counselee and their siblings each have four daughters and four sons, for a total of 112 simulated relatives in each family. These are unrealistically large for most clinical applications, and serve here to provide a best case scenario.
Performance metrics and bootstrap confidence intervals obtained from evaluating PanelPRO-5BC and PanelPRO-11 on the HCP cohort, simulated families with structures sampled from the HCP cohort, and large simulated families are reported in Supplemental Figures S10–S11 and Supplemental Tables S12–S13 (Liang et al., 2022a). To make the simulation results more comparable to the results from the HCP data, we computed these metrics based on a subset of the simulated families that has the same number of families as the HCP cohort, with the same number of pathogenic variant carriers for each gene. While the metrics usually look best when PanelPRO can take advantage of a detailed family history, the results from the real data and a simulation with fewer relatives often do not lag too far behind. We also note that the results from the HCP cohort are typically better calibrated than those from either set of simulated families, albeit with wide confidence intervals. The process of subsetting the simulated families to have the same number of carriers as the HCP cohort does not match the ascertainment process for the cohort, which is likely the driving force behind this phenomenon.
6. Discussion
We have developed PanelPRO, a general, computationally efficient framework for Mendelian risk models that incorporates an arbitrary number of genes and cancers or syndromes (groups of cancers that frequently occur together). Our model can leverage more family history information than its syndrome-specific counterparts by including multiple cancer types. It also allows for the development of models that address syndromes caused by potentially many hereditary genetic factors. Extending beyond syndrome-specific models, PanelPRO integrates multiple gene-cancer associations and is flexible enough to incorporate any number of additional associations as they may be discovered in the future. This more comprehensive approach for quantitative risk assessment can be used to make a better determination of which individuals should undergo genetic testing and targeted preventative interventions.
By limiting the maximum number of pathogenic variants to consider in an individual’s genotype, we keep the problem computationally feasible (Madsen et al., 2018) even as the number of genes in the model grows, trading off between speed and accuracy. G. Lee et al. (2021) describes the package in detail, provides practical examples for users, and contains further discussion on computational performance. For moderately-sized families, models with as many as 10 genes can be run within a few seconds, and models with as many as 20 genes can be run in under a minute, making this package suitable for clinical applications. As our simulation studies demonstrate, PanelPRO is fully reverse-compatible with existing models. These results suggests that the more general PanelPRO could easily be adopted by current users of the BayesMendel models. For simplicity and ease of use, some of the results from PanelPRO could be masked from viewers who are only interested in specific genes and cancers.
Validation of models jointly predicting the carrier status of a large number of genes remains a challenge. The rich, multi-class nature of the labels, the relative rarity of pathogenic variants in certain genes, and the broad variety of gene panels that are emerging in clinical practice all contribute to the difficulty of assembling cohorts that can definitively establish performance. The HCP study considered here is among the best available, but nonetheless, the small sample size is not sufficient for addressing high resolution questions about the more rare genes. We do succeed in establishing the important conclusion that the five- and eleven-gene models can be implemented clinically without detriment of performance compared to current practice. While PanelPRO’s ability to discriminate pathogenic variant carriers of any gene in the model is reduced as one incorporates genes with low penetrances, doing so can still be beneficial for identifying additional at-risk individuals. Furthermore, PanelPRO’s carrier probabilities for any pathogenic variant detect a high number of true carriers that would have been missed by the syndrome-specific models. We hope that these results encourage the collection of more multi-gene panel testing data; as more suitable datasets become available, we plan to perform additional validation.
The PanelPRO package (G. Lee et al., 2021) includes some functionalities that were not illustrated in our analysis. Besides tumor markers, our models currently consider prophylactic surgeries and germline testing as risk modifiers, and additional modifiers can be flexibly incorporated. Our framework for incorporating risk modifiers generalizes beyond surgical interventions and can also be used to support polygenic risk scores, assuming that suitable parameter estimates (modified penetrances, risk ratios, or hazard ratios) can be obtained. An observed PRS could simply be treated as an additional risk modifier. This approach can be applied to the counselee and any number of relatives who report PRSs. A further generalization could also explicitly include individual SNPs as additional genes, and model risk modification for each. While computationally challenging at the moment, particularly for a multi-syndrome model, this approach would account for additional genetic aggregation among individuals for whom the PRS is unobserved. When the cancer penetrance information is available, PanelPRO can return carrier probabilities for genotypes with a given pathogenic variant on both alleles as well as one allele. A future package release will handle multiple gene variants for a given gene.
We made several modeling assumptions, such as assuming that genotypes are independent across first-generation family members, followed by conditionally independent genotypes for second-generation family members and so on for subsequent generations (Berry et al., 1997). Including many genes in model therefore necessitates relying more heavily on these conditional independence assumptions. More importantly, we also assumed that the phenotypes are independent conditional on the individual’s genotype. Censoring and deaths by causes unrelated to a given cancer are taken to be independent of the time to cancer diagnosis and non-informative with respect to carrier status. We assumed an absence of competing risks, but as the number of cancers included in the model increases, many of the competing risks will naturally be accounted for. PanelPRO is further limited by uncertainties in its inputs. The model’s ability to estimate carrier probabilities based on family history depends in part on the input pedigree, which may contain errors or missing information (Braun et al., 2014). The model also relies on robust estimation of the model parameters— namely, the allele frequencies and penetrances, as well as parameters for secondary cancers and risk modifiers— based on existing literature. As the number of genes, cancers, and associations grows, more model parameters will need to be obtained and carefully evaluated to ensure that they improve model performance overall. Future versions of PanelPRO may also explicitly incorporate uncertainty in the input parameters, for example via the Monte-Carlo approach in Parmigiani et al. (1998).
We would generally expect the inclusion of cancers that are associated with many genetic variants to provide more gains than cancers that are associated with only one or two, but the number of associations is not the only consideration. There is also the size/strength of the association (e.g. a highly penetrant vs. moderately penetrant gene) as well as the prevalence of deleterious variants and the rate of the cancer in the non-carrier population. Larger families and families with more complete information are more informative and should lead to more accurate predictions by PanelPRO (as well as other models). The probability provided by Mendelian models of the kind described here can be thought of as integrating over all unknowns (genotypes and phenotypes) of any pedigree that is a superset of the observed pedigree. This justifies looking at calibration across varying family sizes. However, this result relies on model assumptions whose degree of realism may begin to deteriorate as the family size increases (Huang, Braun, Lynch, & Parmigiani, 2021).
Despite the numerous practical challenges remaining in this area, we hope that the availability of our comprehensive framework and associated open source software will contribute to a more systematic consideration of the relations between inherited susceptibility and cancer phenotypes, support more systematic prevention efforts, decrease barriers to interdisciplinary management of families at high risk for cancer, and encourage the adoption of general-purpose early detection strategies (Cristiano et al., 2019; Fiala, Kulasingam, & Diamandis, 2019) among relatives of carriers.
Supplementary Material
Acknowledgments
We gratefully acknowledge support from the National Cancer Institute at the National Institutes of Health grants 5T32CA009337 (JWL) and 4P30CA006516 (GP).
Footnotes
Supporting Information
Supplemental figures, tables, and discussion can be viewed in the separate appendix file (Liang et al., 2022a). The PanelPRO R package is freely available for research purposes at https://projects.iq.harvard.edu/bayesmendel/panelpro (G. Lee et al., 2021). Code to perform the analysis and generate the figures in this paper is available at https://github.com/janewliang/PanelRePROducible (Liang et al., 2022b).
References
- Antoniou A, Pharoah P, McMullan G, Day N, Stratton M, Peto J, … Easton D (2002). A comprehensive model for familial breast cancer incorporating BRCA1, BRCA2 and other genes. British Journal of Cancer, 86(1), 76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Antoniou AC, Casadei S, Heikkinen T, Barrowdale D, Pylkäs K, Roberts J, … others (2014). Breast-cancer risk in families with mutations in PALB2. New England Journal of Medicine, 371(6), 497–506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Antoniou AC, Cunningham A, Peto J, Evans D, Lalloo F, Narod S, … others (2008). The BOADICEA model of genetic susceptibility to breast and ovarian cancers: updates and extensions. British Journal of Cancer, 98(8), 1457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Begg CB, Orlow I, Hummer AJ, Armstrong BK, Kricker A, Marrett LD, … others (2005). Lifetime risk of melanoma in CDKN2A mutation carriers in a population-based sample. Journal of the National Cancer Institute, 97(20), 1507–1515. [DOI] [PubMed] [Google Scholar]
- Berry DA, Iversen ES Jr, Gudbjartsson DF, Hiller EH, Garber JE, Peshkin BN, … others (2002). BRCAPRO validation, sensitivity of genetic testing of BRCA1/BRCA2, and prevalence of other breast cancer susceptibility genes. Journal of Clinical Oncology, 20(11), 2701–2712. [DOI] [PubMed] [Google Scholar]
- Berry DA, Parmigiani G, Sanchez J, Schildkraut J, & Winer E (1997). Probability of carrying a mutation of breast-ovarian cancer gene BRCA1 based on family history. Journal of the National Cancer Institute, 89(3), 227–237. [DOI] [PubMed] [Google Scholar]
- Berwick M, Orlow I, Hummer AJ, Armstrong BK, Kricker A, Marrett LD, … others (2006). The prevalence of CDKN2A germ-line mutations and relative risk for cutaneous malignant melanoma: an international population-based study. Cancer Epidemiology and Prevention Biomarkers, 15(8), 1520–1525. [DOI] [PubMed] [Google Scholar]
- Bishop DT, Demenais F, Goldstein AM, Bergman W, Bishop JN, Paillerets B. B. d., … others (2002). Geographical variation in the penetrance of CDKN2A mutations for melanoma. Journal of the National Cancer Institute, 94(12), 894–903. [DOI] [PubMed] [Google Scholar]
- Braun D, Gorfine M, Katki HA, Ziogas A, Anton-Culver H, & Parmigiani G (2014). Extending Mendelian risk prediction models to handle misreported family history.
- Braun D, Yang J, Griffin M, Parmigiani G, & Hughes KS (2018). A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations. Journal of Genetic Counseling, 27(5), 1187–1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, Bae E, Zhang L, Hughes K, Parmigiani G, Braun D, & Rebbeck TR (2020). Penetrance of breast and ovarian cancer in women who carry a BRCA1/2 mutation and do not use risk-reducing salpingo-oophorectomy: An updated meta-analysis. JNCI Cancer Spectrum, 4(4), pkaa029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Wang W, Broman KW, Katki HA, & Parmigiani G (2004). BayesMendel: an R environment for Mendelian risk prediction. Statistical Applications in Genetics and Molecular Biology, 3(1), 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Wang W, Lee S, Nafa K, Lee J, Romans K, … others (2006). Prediction of germline mutations and cancer risk in the Lynch syndrome. JAMA, 296(12), 1479–1487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu H, Chen S, & Louis TA (2007). Random effects models in a meta-analysis of the accuracy of diagnostic tests within a gold standard in the presence of missing data. [DOI] [PMC free article] [PubMed]
- Claus EB, Risch N, & Thompson WD (1994). Autosomal dominant inheritance of early-onset breast cancer. implications for risk prediction. Cancer, 73(3), 643–651. [DOI] [PubMed] [Google Scholar]
- Couch FJ, DeShano ML, Blackwood MA, Calzone K, Stopfer J, Campeau L, … others (1997). BRCA1 mutations in women attending clinics that evaluate the risk of breast cancer. New England Journal of Medicine, 336(20), 1409–1415. [DOI] [PubMed] [Google Scholar]
- Cristiano S, Leal A, Phallen J, Fiksel J, Adleff V, Bruhm DC, … Velculescu VE (2019). Genome-wide cell-free DNA fragmentation in patients with cancer. Nature (London), 570(7761), 385–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elston RC, & Stewart J (1971). A general model for the genetic analysis of pedigree data. Human Heredity, 21(6), 523–542. [DOI] [PubMed] [Google Scholar]
- Euhus DM (2001). Understanding mathematical models for breast cancer risk assessment and counseling. The Breast Journal, 7(4), 224–232. [DOI] [PubMed] [Google Scholar]
- Felton K, Gilchrist D, & Andrew S (2007). Constitutive deficiency in DNA mismatch repair: is it time for Lynch III? Clinical Genetics, 71(6), 499–500. [DOI] [PubMed] [Google Scholar]
- Fiala C, Kulasingam V, & Diamandis EP (2019). Circulating tumor DNA for early cancer detection. The Journal of Applied Laboratory Medicine, 3(2), 300–313. [DOI] [PubMed] [Google Scholar]
- Foulkes WD (2008). Inherited susceptibility to common cancers. New England Journal of Medicine, 359(20), 2143–2153. [DOI] [PubMed] [Google Scholar]
- Frank TS, Deffenbaugh AM, Reid JE, Hulick M, Ward BE, Lingenfelter B, … others (2002). Clinical characteristics of individuals with germline mutations in BRCA1 and BRCA2: analysis of 10,000 individuals. Journal of Clinical Oncology, 20(6), 1480–1490. [DOI] [PubMed] [Google Scholar]
- Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, & Mulvihill JJ (1989). Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. JNCI: Journal of the National Cancer Institute, 81(24), 1879–1886. [DOI] [PubMed] [Google Scholar]
- Greer JB, & Whitcomb DC (2007). Role of BRCA1 and BRCA2 mutations in pancreatic cancer. Gut, 56(5), 601–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hruban RH, Canto M, Goggins M, Schulick R, & Klein AP (2010). Update on familial pancreatic cancer. Advances in Surgery, 44, 293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang T, Braun D, Lynch HT, & Parmigiani G (2021). Variation in cancer risk among families with genetic susceptibility. Genetic Epidemiology, 45(2), 209–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Idos GE, Kurian AW, Ricker C, Sturgeon D, Culver JO, Kingham KE, … others (2019). Multicenter prospective cohort study of the diagnostic yield and patient experience of multiplex gene panel testing for hereditary cancer risk. JCO Precision Oncology, 3, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kastrinos F, Mukherjee B, Tayob N, Wang F, Sparr J, Raymond VM, … Syngal S (2009). Risk of pancreatic cancer in families with Lynch syndrome. JAMA, 302(16), 1790–1795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kastrinos F, & Syngal S (2011). Inherited colorectal cancer syndromes. Cancer Journal (Sudbury, Mass.), 17(6), 405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katki HA (2007). Incorporating medical interventions into carrier probability estimation for genetic counseling. BMC Medical Genetics, 8(1), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katki HA, Blackford A, Chen S, & Parmigiani G (2008). Multiple diseases in carrier probability estimation: accounting for surviving all cancers other than breast and ovary in BRCAPRO. Statistics in Medicine, 27(22), 4532–4548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein AP, Beaty TH, Bailey-Wilson JE, Brune KA, Hruban RH, & Petersen GM (2002). Evidence for a major gene influencing risk of pancreatic cancer. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 23(2), 133–149. [DOI] [PubMed] [Google Scholar]
- Kote-Jarai Z, Leongamornlert D, Saunders E, Tymrakiewicz M, Castro E, Mahmud N, … others (2011). BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients. British Journal of Cancer, 105(8), 1230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lakhani SR, Reis-Filho JS, Fulford L, Penault-Llorca F, van der Vijver M, Parry S, … others (2005). Prediction of BRCA1 status in patients with breast cancer using estrogen receptor and basal phenotype. Clinical Cancer Research, 11(14), 5175–5180. [DOI] [PubMed] [Google Scholar]
- Lakhani SR, Van De Vijver MJ, Jacquemier J, Anderson TJ, Osin PP, McGuffog L, & Easton DF (2002). The pathology of familial breast cancer: predictive value of immunohistochemical markers estrogen receptor, progesterone receptor, her-2, and p53 in patients with mutations in BRCA1 and BRCA2. Journal of Clinical Oncology, 20(9), 2310–2318. [DOI] [PubMed] [Google Scholar]
- Lander ES, & Green P (1987). Construction of multilocus genetic linkage maps in humans. Proceedings of the National Academy of Sciences, 84(8), 2363–2367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lange K (2003). Mathematical and statistical methods for genetic analysis. Springer Science & Business Media. [Google Scholar]
- Lee A, Mavaddat N, Wilcox AN, Cunningham AP, Carver T, Hartley S, … others (2019). BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genetics in Medicine, 21(8), 1708–1718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee AJ, Cunningham AP, Tischkowitz M, Simard J, Pharoah PD, Easton DF, & Antoniou AC (2016). Incorporating truncating variants in PALB2, CHEK2, and ATM into the BOADICEA breast cancer risk model. Genetics in Medicine, 18(12), 1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee G, Liang JW, Zhang Q, Huang T, Choirat C, Parmigiani G, & Braun D (2021). Multi-syndrome, multi-gene risk modeling for individuals with a family history of cancer with the novel R package PanelPRO. eLife, 10, e68699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang JW, Idos GE, Hong C, Gruber SB, Parmigiani G, & Braun D (2022a). Additional figures, tables, and discussion for “Statistical methods for Mendelian models with multiple genes and cancers”. [DOI] [PMC free article] [PubMed]
- Liang JW, Idos GE, Hong C, Gruber SB, Parmigiani G, & Braun D (2022b). Code to reproduce analysis for “Statistical methods for Mendelian models with multiple genes and cancers”. [DOI] [PMC free article] [PubMed]
- Madsen T, Braun D, Peng G, Parmigiani G, & Trippa L (2018). Efficient computation of the joint probability of multiple inherited risk alleles from pedigree data. Genetic Epidemiology, 42(6), 528–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marabelli M, Cheng S-C, & Parmigiani G (2016). Penetrance of ATM gene mutations in breast cancer: A meta-analysis of different measures of risk. Genetic Epidemiology, 40(5), 425–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moran A, O’Hara C, Khan S, Shack L, Woodward E, Maher E, … Evans D (2012). Risk of cancer other than breast or ovarian in individuals with BRCA1 and BRCA2 mutations. Familial Cancer, 11(2), 235–242. [DOI] [PubMed] [Google Scholar]
- Murphy EA, & Mutalik GS (1969). The application of Bayesian methods in genetic counselling. Human Heredity, 19(2), 126–151. [Google Scholar]
- Parmigiani G, Berry DA, & Aguilar O (1998). Determining carrier probabilities for breast cancer–susceptibility genes BRCA1 and BRCA2. The American Journal of Human Genetics, 62(1), 145–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plichta JK, Griffin M, Thakuria J, & Hughes KS (2016). What’s new in genetic testing for cancer susceptibility? Oncology, 30(9). [PubMed] [Google Scholar]
- Riley BD, Culver JO, Skrzynia C, Senter LA, Peters JA, Costalas JW, … others (2012). Essential elements of genetic cancer risk assessment, counseling, and testing: updated recommendations of the national society of genetic counselors. Journal of Genetic Counseling, 21(2), 151–161. [DOI] [PubMed] [Google Scholar]
- Rosenthal ET, Bernhisel R, Brown K, Kidd J, & Manley S (2017). Clinical testing with a panel of 25 genes associated with increased cancer risk results in a significant increase in clinically significant findings across a broad range of cancer histories. Cancer Genetics, 218, 58–68. [DOI] [PubMed] [Google Scholar]
- Schmidt MK, Hogervorst F, Van Hien R, Cornelissen S, Broeks A, Adank MA, … others (2016). Age-and tumor subtype–specific breast cancer risk estimates for CHEK2*1100delC carriers. Journal of Clinical Oncology, 34(23), 2750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shattuck-Eidens D, Oliphant A, McClure M, McBride C, Gupte J, Rubano T, … others (1997). BRCA1 sequence analysis in women at high risk for susceptibility mutations: risk factor analysis and implications for genetic testing. JAMA, 278(15), 1242–1250. [PubMed] [Google Scholar]
- Statistical Research and Applications Branch, National Cancer Institute. (2020). DevCan: Probability of developing or dying of cancer software. version 6.7.8.
- Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, … Kattan MW (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, Mass.), 21(1), 128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tyrer J, Duffy SW, & Cuzick J (2004). A breast cancer prediction model incorporating familial and personal risk factors. Statistics in Medicine, 23(7), 1111–1130. [DOI] [PubMed] [Google Scholar]
- Vahteristo P, Eerola H, Tamminen A, Blomqvist C, & Nevanlinna H (2001). A probability model for predicting BRCA1 and BRCA2 mutations in breast and breast-ovarian cancer families. British Journal of Cancer, 84(5), 704–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang C, Wang Y, Hughes KS, Parmigiani G, & Braun D (2020). Penetrance of colorectal cancer among mismatch repair gene mutation carriers: A meta-analysis. JNCI Cancer Spectrum. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W, Chen S, Brune KA, Hruban RH, Parmigiani G, & Klein AP (2007). PancPRO: risk assessment for individuals with a family history of pancreatic cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 25(11), 1417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W, Niendorf KB, Patel D, Blackford A, Marroni F, Sober AJ, … Tsao H (2010). Estimating CDKN2A carrier probability and personalizing cancer risk assessments in hereditary melanoma using MelaPRO. Cancer Research, 70(2), 552–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Win AK, MacInnis RJ, Dowty JG, & Jenkins MA (2013). Criteria and prediction models for mismatch repair gene mutations: a review. Journal of Medical Genetics, 50(12), 785–793. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
