Abstract
Many neuro-oncology studies commonly assess the association between a prognostic factor (predictor) and disease or outcome, such as the association between age and glioma. Predictors can be continuous (eg, age) or categorical (eg, race/ethnicity). Effects of categorical predictors are frequently easier to visualize and interpret than effects of continuous variables. This makes it an attractive, and seemingly justifiable, option to subdivide the continuous predictors into categories (eg, age <50 years vs age ≥50 years). However, this approach results in loss of information (and power) compared to the continuous version. This review outlines the use cases for continuous and categorized predictors and provides tips and pitfalls for interpretation of these approaches.
Keywords: analysis, categorical, categorize, continuous, statistics
Studies of prognostic factors, such as clinical, radiologic, or genomic features, laboratory values, or drug use are commonly performed in the field of neuro-oncology. A common practice when data values are continuous is to categorize (ie, “bin”) them into two or more levels. The results are easier to interpret and report but also present a simplified version of an otherwise potentially more complex system.
There are several risk factors for brain tumor that are measured on a continuous scale. Exposure to N-nitroso compounds in processed food, exposure to ionizing radiation from cellular and cordless phones, dosage of vitamin supplements in pregnancy, and age have all been studied as possible risk factors.1,2 In addition, Karnofsky performance status (KPS) and midline shift3 are continuous variables that are often dichotomized in the analysis. Conversion of these measurements into classes such as “Low,” “Medium,” and “High” can result in misleading models and faulty subsequent treatment strategies.
What Are the Shortcomings of Introducing Categorization Instead of Preserving Continuity?
When you categorize a continuous predictor variable into two or more categories, the behavior of the outcome is assumed to be constant within a category. For example, a population-based study from Olmsted County, Minnesota examined the relationship between glioblastoma grade and age as a continuous variable (Figure 1). The figure shows that the majority (77%) of gliomas after age 40 years were grade IV glioblastoma, which occurred in only 12% of adults between 20 and 40 years.4 The majority (68%) of gliomas in patients aged 20-39 years were grades II/III. If age was arbitrarily dichotomized at 50 years, which would be a reasonable choice, since this is the median age of glioblastoma, and since there is an increased incidence of glioma in older patients, then the difference in the proportion of patients with grade IV glioblastomas would be much less pronounced (43% for age <50 years vs 77% for age 50+ years). In addition, both the 1st percentile (~20 years) and the 49th percentile (~46 years) would be in the same category. However, the 40+ years old is more likely to have a grade IV glioblastoma than the 20 years old. In contrast, the 49th and 51st percentiles (~46 years old and ~53 years old, respectively), with similar proportions of grade IV glioblastomas, would be in different categories. Hence, the categorization of a continuous variable can result in a massive loss of information. Some studies have tried to find a middle ground between these approaches by offering both types of analyses.5
Figure 1.
Number of glioma patients (with glioma grade and subtype) per age group from a population-based study of Olmsted County, Minnesota residents. Adapted from Ryan et al4.
Focusing on ease of interpretation, cut-points are frequently decided from collected data with common groupings being at medians or quartiles. However, this arbitrary selection of cut-points makes the study difficult to replicate and complicates comparisons with other literature.6,7 A 2005 review of multiple studies on risk of childhood brain tumors1 highlights the different cut-points used for these risk factors in the absence of any standardized guidelines. In order to circumnavigate this complication, additional work would be needed to establish one “optimal” cut-point for the continuous variable of interest. Determining an optimal cut-point requires model-building techniques and validation. The process involves testing a series of cut-points, which can lead to issues with multiple comparisons.8 Proper adjustment for multiple comparisons is required to avoid overestimating the effect and inflating the false positive (type I) error rate.9,10
Categorization may also mask any nonlinear relationship that might exist between the risk factor and outcome, leading to an unrealistic model. For example, if a risk factor with a “U”-shaped nonlinear relationship to the outcome were dichotomized (eg, at the median), the association of the risk factor with the outcome may be completely missed. For example, presurgical serum lipid levels in patients with glioblastoma were examined using dichotomized levels of low-density lipoprotein (LDL) of ≥1.84 vs <1.84 mmol/L (corresponding to 70 mg/dL), and low LDL was found to be an independent prognostic factor for overall survival.11 However, in other disease settings, LDL has demonstrated a “U”-shaped relationship with cardiovascular disease outcomes, which may have been missed with the dichotomized approach used in the glioblastoma analysis.12 These issues are illustrated in Figure 2, which depicts a hypothetical relationship between LDL and overall survival based on the hazard ratio curve found in other literature. To our knowledge, this relationship has not been assessed using nonlinear effects in the glioblastoma literature. In this depiction, both low and high LDL are associated with an increased risk of mortality but dichotomizing at LDL of 70 mg/dL would only identify the increased risk for low LDL. A linear fit would only identify the increased risk of mortality for higher LDL. The most flexible fit, which avoids the pitfalls of categorization and of the linear fit, can be obtained using a spline.9,13 A spline is a flexible curve that is named for a tool used by shipbuilders, which can be fit mathematically using a series of piecewise polynomials or other smoothing methods to model a nonlinear effect. Splines are one of the possible methods to test for nonlinear effects, but it is not the only method. Polynomial effects (eg, squared and cubic terms) can also be used to assess nonlinear effects. However, splines provide the most flexible fits, so they are preferred for assessing nonlinear effects. Most analysis software packages have the capability to perform spline fits. The best way to assess whether a nonlinear effect is present is to fit a model with a nonlinear effect and compare it to a model with a linear or categorical effect to see if the nonlinear effect shows improved goodness of fit of the model.
Figure 2.
Depiction of continuous vs categorical estimates of risk for the hypothetical relationship between low-density lipoprotein (LDL) and mortality. The solid lines show risk estimates for LDL dichotomized at 70 mg/dL. The dashed line shows a spline fit to continuous LDL and the dotted line shows a linear fit to continuous LDL. The gray lines are reference lines.
What Are the Circumstances in Which Categorization Would Be Acceptable or Preferred?
Categorization can be useful in certain circumstances:
When the goal is to compare your study’s results with previously published reports.
To deidentify a study population (eg, coarsening some variables in national databases to avoid potential reidentification).
When the continuous variable has a standardized categorization structure6,7 (eg, body mass index).
When categorization is desired for ease of model interpretation, the recommendation is to model continuous predictors as continuous in order to arrive at a model, and then later to categorize them into risk groups in such a way as to capture the results.9
What Is the Negative Impact of Improper Categorization on Analysis?
The power to detect an association with a given outcome is often reduced when using a categorized predictor as compared to when examining the predictor variable on its natural continuous scale. Specifically, dichotomizing a continuous variable can result in a loss of power equivalent to removing up to a third of your data.6,7 Consequently, the effect may not be detected and the probability of a false-negative finding (ie, type II error) increases.
Recommended Approaches for Evaluating Prognostic Factors and Outcomes
If the relationship is unknown, nonlinear methods such as smoothing splines, restricted cubic splines, and fractional polynomials are recommended.9
If the relationship is known to be linear, keep the variable continuous and use linear methods. Careful interpretation of the resulting estimate of effect as representing the effect for each additional 1 unit (or a meaningful unit change such as 10 years of age) can be readily understood by readers.
If a continuous variable such as age is to be dichotomized, the choice of cut-point should be made before analysis and with some theoretical or clinical justification. Data-driven cut-points should be avoided. Never choose an optimal cut-point based on minimizing the P value or maximizing statistics such as odds ratios.
If a continuous variable is categorized, having 3 or more groups is preferable to just 2. Again, pre-specification of cut-points is strongly recommended.
In conclusion, there are several risk factors in neuro-oncology research that are inherently continuous but frequently used in analyses after conversion to a categorized form. Our review briefly highlights the pros and cons of such practices and hopes to raise awareness of the need for a justifiable reason when this is the preferred pathway. The trade-off between power and interpretability, or splines and quasi-continuous categorization should be weighed carefully by the statisticians in the best interest of the goals of the study.
Contributor Information
Ruchi Gupta, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA.
Courtney N Day, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA.
Wlliam O Tobin, Department of Neurology, Mayo Clinic, Rochester, Minnesota, USA.
Cynthia S Crowson, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA; Division of Rheumatology, Mayo Clinic, Rochester, Minnesota, USA.
Funding
None.
Conflict of interest statement. The authors declare no conflict of interest.
References
- 1. Dietrich M, Block G, Pogoda JM, et al. A review: dietary and endogenously formed N-nitroso compounds and risk of childhood brain. Cancer Causes Control. 2005;16(6):619–635. [DOI] [PubMed] [Google Scholar]
- 2. Cancer.net. Brain Tumor: Risk Factors.https://www.cancer.net/cancer-types/brain-tumor/risk-factors. Accessed April 1, 2021.
- 3. Gamburg ES, Regine WF, Patchell RA, et al. The prognostic significance of midline shift at presentation on survival in patients with glioblastoma multiforme. Int J Radiat Oncol Biol Phys. 2000;48(5):1359–1362. [DOI] [PubMed] [Google Scholar]
- 4. Ryan CS, Juhn YJ, Kaur H, et al. Long-term incidence of glioma in Olmsted County, Minnesota, and disparities in postglioma survival rate: a population-based study. Neurooncol Pract. 2020;7(3):288–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ganesh A, Luengo-Fernandez R, Wharton RM, Rothwell PM, Oxford Vascular Study . Ordinal vs dichotomous analyses of modified Rankin Scale, 5-year outcome, and cost of stroke. Neurology. 2018;91(21):e1951–e1960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332(7549):1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Naggara O, Raymond J, Guilbert F, et al. Analysis by categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms. Am J Neuroradiol. 2011;32:437–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994;86:829–835. [DOI] [PubMed] [Google Scholar]
- 9. SAS Support. Croxford R. Continuous predictors in regression analyses.2017. https://support.sas.com/resources/papers/proceedings17/0288-2017.pdf. Accessed April 1, 2021.
- 10. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25(1):127–141. [DOI] [PubMed] [Google Scholar]
- 11. Liang R, Junhong L, Mao L, et al. Clinical significance of pre-surgical serum lipid levels in patients with glioblastoma. Oncotarget. 2017;8:85940–85948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Zhang J, Chen L, Delzell E, et al. The association between inflammatory markers, serum lipids and the risk of cardiovascular events in patients with rheumatoid arthritis. Ann Rheum Dis. 2014;73(7):1301–1308. [DOI] [PubMed] [Google Scholar]
- 13. Gauthier J, Wu QV, Gooley TA. Cubic splines to model relationships between continuous variables and outcomes: a guide for clinicians. Bone Marrow Transplant. 2020;55(4):675–680. [DOI] [PubMed] [Google Scholar]


