Abstract
The familial recurrence risk is the probability a person will have disease, given a reported family history. When family histories are obtained as simple counts of disease among family members, as often obtained in cancer registries or surveys, we propose methods to estimate recurrence risks based on truncated binomial distributions. By this approach, we are able to obtain unbiased estimates of risk for a person with at least k affected relatives, where k can be specified in order to determine how risk varies with k. We also derive robust variances of the recurrence risk estimate, to account for correlations within families, such as those induced by shared genes or shared environment, without explicitly modeling the factors that cause familial correlations. Furthermore, we illustrate how mixture models can be used to account for a sample composed of low-risk and high-risk families. Using simulations, we illustrate the properties of the proposed methods. Application of our methods to a family history survey of prostate cancer shows that the recurrence risk for prostate cancer increased from 16% when there was at least one affected relative, to 52% when there was at least five affected relatives.
Keywords: ascertainment, mixture likelihood, recurrence risk, truncated binomial
Introduction
The familial recurrence risk is the probability a person will have disease, given a reported family history. This familial recurrence risk can be clinically useful to guide discussions between physicians and their disease-free patients who are concerned about their disease risk. There are different approaches to estimate the familial recurrence risk. One approach is to use the detailed information about a family history, including relationships of all members in a pedigree and their age of disease onset, or censoring age. For this level of detail, sophisticated statistical methods have been developed to model segregation patterns of disease in pedigrees and the age-specific penetrance of disease. These types of models also provide a powerful approach to evaluate whether the observed data fits a Mendelian model of disease inheritance. However, many common and complex diseases do not fit a simple Mendelian model of inheritance, limiting the clinical utility of these types of parametric models. An alternative approach is to estimate the empiric risk of disease, conditional on the number of affected family members. This is the approach that we expand upon because it does not assume specific modes of inheritance. Furthermore, because details of age of onset of disease among family members can be difficult to accurately obtain in a broad clinical setting, we opt for modeling simpler types of collected data.
A challenge in estimating the recurrence risk is accounting for conditioning on family history, because restricting sampling and analyses of families to those with at least k affected members causes oversampling of multi-case families, resulting in biased risk estimates if ascertainment is ignored. Adjustment for ascertainment in pedigree studies has a long history of developed approaches [Apert 1914; Cannings and Thompson 1977; Elston and Sobel 1979; Epstein, et al. 2002; Ewens and Shute 1986; Ewens and Shute 1989; Fisher 1934; Haldane 1938; Morton 1959; Shute and Ewens 1988a; Shute and Ewens 1988b; Vieland and Hodge 1995; Weinberg 1912]. Many of these approaches were developed to estimate parameters for sophisticated genetic models, such as gene segregation parameters, as well as models for risk of disease. There are several limitations of these general models when considering how the recurrence risk of disease depends on having at least k affected relatives. First, the genetic models assume a specific mode of inheritance (e.g., dominant or recessive). More sophisticated models that account for environmental covariates and age of onset are useful but require more parameters to estimate and are computationally challenging to fit. Second, most methods have been developed for when there is a single proband (i.e., a single affected person who, independent of other family members, brings the family to the attention of the investigator). An excellent review of methods to estimate recurrence risks among siblings when at least one of them is affected, for general models of ascertainment, is provided by Oslon and Cordell [Olson and Cordell 2000]. When evaluating the recurrence risk of disease conditional on having at least k affected relatives, the conditioning events for ascertainment correction become much more complicated. In contrast to these more sophisticated models, we propose a simple empirical estimate of recurrence risk in order to address the important clinical concern of disease risk based on family history, while accounting for how families are sampled. When family histories are obtained in a clinic setting, with limited details of age of disease onset on relatives, our proposed method offers a reasonable solution to recurrence risk. In addition, we provide methods to estimate robust variances of the estimated risks, as well as methods to estimate the probability that a family is either a low-risk or high-risk family, conditional on their family history.
The events that lead to sampling of families can be complex, which makes correction for ascertainment a challenge. Two general schemes are complete and single ascertainment. Complete ascertainment occurs when every family with at least one affected member has an equal chance of being sampled, independent of the size of the family and the number of affecteds. This type of sampling can occur in population-based registries, for example when the Utah Population Database, a genealogy database, is linked to statewide medical data [Abbott, et al. 2018]. Alternatively, when family-history is routinely obtained in a clinic, and the patients attending the clinic are unselected with respect to their family size and family history, it is possible to adapt our proposed methods to this setting. In contrast to complete ascertainment, single ascertainment occurs when the probability of sampling families is proportional to the number of affected family members. The methods we develop focus solely on complete ascertainment.
In the Methods section, we derive new methods to estimate the recurrence risk of disease, conditional on at least k affected family members. We first derive methods for when family data are collected from a registry, and there is no defined proband. Our methods are based on a truncated binomial distribution. To allow for over-dispersion, which can occur when family members are correlated (e.g., due to shared genetic and environmental risk factors), we derive robust variances of the recurrence risk. We then extend our approach to when a single proband per family reports their family history, which can be collected in a clinical setting.
Methods
Statistical Methods
For complete ascertainment, each family can be viewed as sampled from a truncated binomial distribution. So, a family with at least k affected relatives represents a sample from a truncated binomial distribution where families with less than k affected relatives are “truncated” from the sample. For a family of size si, the probability of having ai affected family members, conditional on having at least k affected members, and when the recurrence risk is P, is given by the probability density for a truncated binomial,
| (1) |
For a single sample from a truncated binomial distribution, Rider [Rider 1955] derived a moment estimator for an unbiased estimate of the binomial probability P. In our situation, this binomial probability is the absolute probability of disease. With N families, we have N truncated binomial samples, and extending Rider’s approach to this situation results in the recurrence risk estimate
| (2) |
Here, is the estimated recurrence risk (the probability of disease), ai is the number of affected members in the ith family, si is the size of the ith family (where size refers to the number of family members at risk for disease), k is the ascertainment criterion (e.g., strength of family history), and the sum is over N families that have at least k affected relatives.
To estimate the variance of , we propose using methods for estimating equations based on generalized method of moments [Hansen 1982]. Unlike maximum likelihood methods, generalized method of moments does not require complete specification of the distribution of the data. So, we can account for correlations within families, such as those induced by shared genes or shared environment, without explicitly modeling the factors that cause familial correlations. Although expression (2) was derived without an estimating equation, we can work backwards from expression (2) to determine the estimating equation that would give rise to the estimator . The estimating equation is
| (3) |
By setting the estimating equation to 0 and solving for , the estimator in expression (2) results. To derive the variance of , we take the following approach. Let fi be the contribution of the ith family to the estimating equation,
The derivative of fi with respect to Pk is
Based on generalized method of moments [Hansen 1982] and robust variance estimation [White 1982], the variance of is
Familial Relative Risk
The familial relative risk is the ratio of recurrence risk divided by the population risk of disease (i.e., population prevalence, denoted p):
An estimate of the familial relative risk is likely to be most reliable when p is estimated from data independent of that used to estimate . There are two reasons for this. First, when there is heterogeneity in disease risk across families, and the heterogeneity is not accounted for in analyses, the ascertainment-adjusted parameter reflects the true parameter value in the ascertained subpopulation, not the entire population [Burton, et al. 2000; Epstein, et al. 2002].This point is clarified below by illustration with simulations. Second, if Pk and p are estimated from the same data, is biased because it is the ratio of two correlated random variables. This point is expanded in the Appendix, along with ways to adjust for this bias, and a way to estimate the variance of .
Extensions for Family History of Probands
In a clinical setting it is common to collect family history of disease. When affected subjects (i.e., probands) attend a clinic independent of their family size and independent of their family history, and if the family history provided by probands is complete (i.e., reporting of family history does not depend on the number of affected members), then we can view the probands and their families as sampled under complete ascertainment. The only change from our prior methods is that there is a well-defined proband. So, the recurrence risk when considering at least one affected family member can be computed by excluding the proband (i.e., ai − 1 for number of affecteds and si – 1 for family size), and estimating the risk among the remaining relatives is
To compute the variance of by estimating equations, the corresponding estimating equation contributed by the ith family is
and the derivative of fi is
To extend estimating of the recurrence risk for a family history of k > 1 affected relatives, we exclude the proband (i.e., the single-ascertainment step), and then use a truncated binomial to condition on having at least (k – 1) affected relatives among the remaining family members.
Modeling Mixture of Risk Groups
When there is heterogeneity in risk across families, it would be of interest to estimate whether a family originates from a particular risk group. We propose modeling heterogeneity by a mixture of truncated binomial distributions. Because there is limited information per family, it seems reasonable to limit to two types of risk groups: low versus high risk. The mixture model in turn can provide a posterior probability that a family is of high risk or low risk. The likelihood equation for a mixture of two truncated binomial distributions is
| (4) |
where π is the probability that a family is in risk group-A with recurrence risk PA, and 1 – π is the probability that a family is in risk group-B with recurrence risk PB, and is the density for a truncated binomial, illustrated in expression (1).
To fit this mixture model to data, we used the following expectation-maximization algorithm. Let be the posterior probability that family i comes from risk group-A, and initialize it by sampling from a uniform distribution. The mean of these posteriors across families is an estimate of the prior probability π, so initializing by a uniform distribution implicitly assumes an initial value of . These posteriors are used to estimate PA by including them in expression (2) as
A similar estimate for PB is obtained by replacing with . The estimates for π, PA, and PB are used to update the likelihood of expression (4). An updated posterior for each family can then be computed as . This process is iterated until the change in the log-likelihood is very small. At convergence, the value of provides an estimate of the probability that a family comes from risk group-A with recurrence risk PA, versus the probability that a family comes from risk group-B with recurrence risk PB. Note that the estimated prior probability π represents the prior probability that family is from the risk group-A in the ascertained population.
Simulations
To test the performance of the estimator corrected for ascertainment, we simulated data under two scenarios. Both scenarios simulated a population of 5,000 families, with family sizes ranging 3–10. In Scenario-1, all subjects in all families had the same 10% chance of disease. In Scenario-2, 500 families had increased risk of disease at 50%, while the remaining 4,500 families had a risk of 10% for each family member. Simulations were repeated 1,000 times in order to determine the distribution of the familial recurrence risk estimates. The risks were estimated by two approaches. One approach was a naïve method that was based on the proportion of diseased subjects in the ascertained sample, without correcting for ascertainment. The second method used expression (2) that corrects for ascertainment.
Application Data Set
Our methods were applied to a prostate cancer family history survey conducted at the Mayo Clinic. A survey of family-history of cancer was conducted on 5,486 men who underwent a radical prostatectomy, for clinically localized prostate cancer, in the Department of Urology at the Mayo Clinic during 1966–95. The 4,288 men who responded to the survey were included in analyses. Further details of this survey study are provided elsewhere [Schaid, et al. 1998]. Because prostate cancer history has been found to be accurately reported for first-degree relatives, but underreported for more distant relatives [Steinberg, et al. 1990], similar to other types of cancers [Bondy, et al. 1994; Love, et al. 1985], we restricted family history to first-degree relatives of probands. The original survey study was approved by the Mayo Clinic Institutional Review Board.
Results
Scenario-1
The results for Scenario-1 are illustrated in Figure 1, top panels. This figure shows that our proposed method to correct for ascertainment gives unbiased results, with estimates of centered at 10%, even when the family history ranges from 1, to 2, to 3 affected family members. In contrast, the naïve estimates that do not correct for ascertainment are all biased upward. Figure 1 also illustrates that increasing the number of affected relatives in the ascertainment (k) increases the variability of the estimated recurrence risk because of decreasing number of families available for analyses.
Figure 1.

Estimated familial recurrence risk when correcting for ascertainment (left panels) and when not correcting (right panels). The upper panels are for a homonenous setting with a 10% chance of disease 10% for each family member. The lower panels are for a heterogeneous setting when the families are a mixture of two types, with one type having a low 10% risk (90% of families) and another type having a high 50% risk (10% of families).
Scenario-2
The results for Scenario-2 are illustrated in Figure 1, bottom panels. This figure shows that our proposed method to correct for ascertainment gives reasonable results. The families are mixtures of two types, with one type having a low 10% risk and another type having a high 50% risk. As the strength of family history increases, the recurrence risk increases towards the higher 50% risk. The naive uncorrected estimates are consistently larger than the ascertainment corrected estimates, once again emphasizing that failure to correct for ascertainment falsely increases the estimated risk.
A point worth emphasizing is whether we want to estimate a parameter that represents the true parameter in the general population versus the true parameter in the ascertained population. This issue was extensively discussed elsewhere [Burton, et al. 2000; Epstein, et al. 2002], with a main point that when there is heterogeneity in disease risk across families, and the heterogeneity is not accounted for in analyses, the ascertainment-adjusted parameter reflects the true parameter value in the ascertained subpopulation, not the entire population. To make this clear, in our scenario-2, the true population is comprised of 5,000 families, of which 10% have a high 50% chance of disease, and 90% have a low 10% chance of disease. The probability of disease in the general population is the weighted average, 0.1×0.5+0.9×0.1=0.14. In order to obtain an unbiased estimate of this population risk we would need to know the risk group for each family, estimate the ascertainment-corrected recurrence risk within each risk group, and then take a weighted average of these to compute the population risk. To clarify these points when truncating according to k=2, we illustrate in Figure 2 the estimated risks from simulated data for each risk-group separately, the weighted average, and the recurrence risk based on the pool of all families, the latter which does not account for the heterogeneity in risk across families. The weighted average provides an unbiased estimate of the disease risk in the entire population. However, this is not the risk of interest for the clinical question. Rather, the recurrence risk for the ascertained population (i.e., all families in the population with at least k affected relatives) is the main risk of interest. So, this motivates us to use the recurrence risk estimator in expression (2).
Figure 2.

Simulation results: Recurrence risk estimates for low risk (true risk of 10%) and high risk (true risk of 50%) groups, the weighted average of these risk estimates (weighted by frequency of type of family), and recurrence risk estimate in the pool of all ascertained families. Families were ascertained to have at least 2 affected members.
Results for Mixture of Risk Groups
Revisiting the simulations that generated Figure 2, we illustrate in Figure 3 an estimate of the posterior probabilities for a single simulation when two risk groups are assumed to exist, and ascertainment required at least two affected relatives. For this sample, the recurrence risk estimates were 51% and 11% for the high- and low-risk groups, respectively (close to the true values of 50% and 10%), and the estimated prior probability for the high-risk group for the ascertained population was 34%. The top panel of Figure 3 shows that the estimated posterior probabilities can be quite variable around the true values. This variation is influenced by random variation as well as characteristics of the families. The bottom panel of Figure 3 illustrates that the posterior probability that a family is from a high-risk group depends not only on the number of affecteds, but also the family size. Large families with few affecteds have a much lower posterior probability than smaller families with the same number of affecteds.
Figure 3.


Posterior probabilities for a single simulation sample of scenario-2 (mixture of low and high risk families), assuming ascertainment of at least two affected in each family. The upper panel shows how the posterior probabilities vary within each true risk group. The lower panel shows red dots for posterior probabilities > 50% and black dots for posterior probabilities 50%. For this sample, the recurrence risk estimates were 47% and 9.5% for the two groups. This figure illustrates that the posterior probability that a family is from a high risk group depends on both the number of affecteds and the family size. Large families with few affecteds have a much lower posterior probability than smaller families with the same number of affecteds.
Application to Prostate Cancer Family History Survey
The recurrence risk estimates for varying amounts of prostate cancer family history (i.e., amount of ascertainment) for the Mayo Clinic survey are summarized in Figure 4. The recurrence risks in the pool of all families ranged from 16% when there was at least one affected relative, to 52% when there was at least 5 affected relatives. When fitting a mixture model, the recurrence risks for the low risk group varied as 13%, 18%, to 29% when ascertainment k increased from 1 to 2 to 3; for the high risk group, these risks were 59%, 72%, and 75%, respectively. The width of the 95% confidence intervals increased as k increased, because the number of families decreased as k increased.
Figure 4.

Recurrence risk estimates for Mayo Clinic prostate cancer family history survey. 95% confidence intervals based on robust variance estimates are illustrated by vertical bars.
Discussion
Additional points:
When family history of disease is recorded as counts of disease within a family of a given size, we provide a straight-forward procedure to estimate the recurrence risk of disease, and robust estimates of its variance. The estimated recurrence risk is representative of the ascertained population, and is an average risk, averaged over all possible families with at least k affected relatives. It is important to recognize that the risk estimates depend on ages of the family members. If family members are young, the risk estimates might underestimate the lifetime risk of disease. For our survey data, the surveys were sent to men after they received radical prostatectomy, some many years after surgery. Because prostate cancer occurs at older ages, the risks we estimate likely represent lifetime risk of disease.
We expect that more precise estimates of recurrence risk can be calculated with more detailed data on relatives, such as age of disease onset, age at death, current age, relationship to proband, environmental risk factors, and known genetic risk factors. In this situation, sophisticated modeling approaches are needed (e.g., BODICEA for breast cancer risk [Antoniou, et al. 2008] ), which often assume a genetic mode of disease inheritance. However, many common complex diseases do not fit a simple Mendelian transmission mode of inheritance, such as prostate cancer which has fit dominant, recessive, X-linked, and multifactorial models of disease transmission [Schaid 2004]. When detailed family data are unavailable, which often occurs for population registries or family history obtained in clinics, our procedures offers a practical solution. Future work that does not assume parametric models, but account for age of onset, censoring, missing data, and correlations within families, would provide more precise estimates of risk. Approaches that build on semi-parametric modeling of hazard rates, and use of copula models to account for correlations, offer promise, but the need to correct for ascertainment makes these approaches statistically challenging.
Highlighted by the discussions of Burton and Epstein [Burton, et al. 2000; Epstein, et al. 2002], it is important to recognize that the estimated recurrence risks, and prior probability of risk group for mixture models, represent the parameters of the ascertained population, not the entire population. For the recurrence risk for a disease-free person with at least k affected family members, the ascertained population parameter is the parameter of interest.
Acknowledgements
This research was supported by the U.S. Public Health Service, National Institutes of Health, contract grants numbers GM065450.
Appendix
If the recurrence risk, Pk, and population prevalence, p, are estimated from the same data, the ratio is biased because it is the ratio of two correlated random variables. To see this, by taking a second order Taylor expansion of , expanded around Pk and p, the expected value of the ratio estimator is
By subtracting the bias term, we can construct an unbiased estimator as
By using a second order Taylor expansion, the variance of can be shown to be
Note that when and are estimated from independent sets of data, in the above equation. To derive the variances and covariance for and , we use the estimating equation to estimate Pk, where
Here, the sum is over all N families, and is an indicator whether . We also use the estimating equation to estimate p, where
Then, the covariance matrix for and is given by the use of estimating equations for generalized method of moments estimators [Hansen 1982; White 1982],
Note that the upper right corner and lower left corner of the matrix A are filled with zeros. In general, the upper right would contain , and the lower left, . But, because fi does not depend on p, and gi does not depend on Pk, the corresponding derivatives are equal to 0. After simplifying, the following variances and covariances result:
The derivatives of the estimating equations are
When multiple Pk values are computed from the same sample, the above approach can be used to estimate the covariance matrix for the vector of Pk values and prevalence parameter p. Let this vector of K recurrence risks and prevalence be . The general formula for the covariance matrix is . To define the elements of matrices A and B, let fk,i be the contribution that subject i makes to the estimating equation for Pk. The matrix A has diagonal terms and . The matrix B has terms
So, based on the general estimating equations we provide, it is straight forward to estimate the covariance matrix when multiple Pk values are computed from the same sample.
Footnotes
Software: The software package fam.recrisk is available as an R package from the CRAN web site (https://cran.r-project.org/web/packages/pleio/index.html).
References
- Abbott D, Brockmeyer D, Neklason DW, Teerlink C, Cannon-Albright LA 2018. Population-based description of familial clustering of Chiari malformation Type I. Journal of neurosurgery 128(2):460–465. [DOI] [PubMed] [Google Scholar]
- Antoniou AC, Cunningham AP, Peto J, Evans DG, Lalloo F, Narod SA, Risch HA, Eyfjord JE, Hopper JL, Southey MC and others. 2008. The BOADICEA model of genetic susceptibility to breast and ovarian cancers: updates and extensions. Br J Cancer 98(8):1457–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Apert E. 1914. The laws of Naudin-Mendel. J Hered 5:492–497. [Google Scholar]
- Bondy ML, Strom SS, Colopy MW, Brown B, W., Storng LC 1994. Accuracy of family history of cancer obtained through interviews with relatives of patients with childhood sarcoma. J Clinical Epidemiol 47(1):89–96. [DOI] [PubMed] [Google Scholar]
- Burton PR, Palmer LJ, Jacobs K, Keen KJ, Olson JM, Elston RC 2000. Ascertainment adjustment: where does it take us? Am J Hum Genet 67(6):1505–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cannings C, Thompson EA 1977. Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208–212. [DOI] [PubMed] [Google Scholar]
- Elston RC, Sobel E 1979. Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet 31:62–69. [PMC free article] [PubMed] [Google Scholar]
- Epstein MP, Lin X, Boehnke M 2002. Ascertainment-adjusted parameter estimates revisited. Am J Hum Genet 70(4):886–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewens WJ, Shute NCE 1986. A resolution of the ascertainment sampling problem. I. Theory. Theoret Pop Biol 30:388–412. [DOI] [PubMed] [Google Scholar]
- Ewens WJ, Shute NCE 1989. Remarks on ascertainment. Genet Epidemiol 6:89–93. [DOI] [PubMed] [Google Scholar]
- Fisher RA 1934. The effect of methods of ascertainment upon the estimation of frequencies. Ann Eugenics 6:13–25. [Google Scholar]
- Haldane J 1938. The estimation of the frequencies of recessive conditions in man. Ann Eugen 5:255–262. [Google Scholar]
- Hansen L 1982. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica 50:1029–1054. [Google Scholar]
- Love RR, Evans AM, Josten DM 1985. The accuracy of patient reports of a family history of cancer. J Chron Dis 38:289–293. [DOI] [PubMed] [Google Scholar]
- Morton NE 1959. Genetic tests under incomplete ascertainment. Am J Hum Genet 11:1–16. [PMC free article] [PubMed] [Google Scholar]
- Olson JM, Cordell HJ 2000. Ascertainment bias in the estimation of sibling genetic risk parameters. Genetic epidemiology 18(3):217–35. [DOI] [PubMed] [Google Scholar]
- Rider P 1955. Truncated binomial and negative binomial distributions. J Amer Stat Assoc 50:877–883. [Google Scholar]
- Schaid D 2004. The complex genetic epidemiology of prostate cancer. Human Molecular Genetics Vol. 13, Review Issue:R103–R121. PMID: 14749351. [DOI] [PubMed] [Google Scholar]
- Schaid D, McDonnell S, Blute M, Thibodeau S 1998. Evidence for autosomal dominant inheritance of prostate cancer. Am J Hum Genet 62:1425–1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shute NCE, Ewens WJ 1988a. A resolution of the ascertainment sampling problem. II. Generalizations and numerical results. Am J Hum Genet 43:374–386. [PMC free article] [PubMed] [Google Scholar]
- Shute NCE, Ewens WJ 1988b. A resolution of the ascertainment sampling problem. III. Pedigrees. Am J Hum Genet 43:387–395. [PMC free article] [PubMed] [Google Scholar]
- Steinberg GD, Carter BS, Beaty TH, Childs B, Walsh PC 1990. Family history and the risk of prostate cancer. Prostate 17(4):337–347. [DOI] [PubMed] [Google Scholar]
- Vieland VJ, Hodge SE 1995. Inherent intractability of the ascertainment problem for pedigree data: a general likelihood frame work. Am J Hum Genet 56:33–43. [PMC free article] [PubMed] [Google Scholar]
- Weinberg W 1912. Methode und Fehlerquellen der Untersuchung auf Mendelschen Zahlen beim Menschen. Rass Ges Biol 9(165–174). [Google Scholar]
- White H 1982. Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25. [Google Scholar]
