Modeling and variable selection in epidemiologic analysis

S Greenland

doi:10.2105/ajph.79.3.340

. 1989 Mar;79(3):340–349. doi: 10.2105/ajph.79.3.340

Modeling and variable selection in epidemiologic analysis.

S Greenland ¹

PMCID: PMC1349563 PMID: 2916724

Abstract

This paper provides an overview of problems in multivariate modeling of epidemiologic data, and examines some proposed solutions. Special attention is given to the task of model selection, which involves selection of the model form, selection of the variables to enter the model, and selection of the form of these variables in the model. Several conclusions are drawn, among them: a) model and variable forms should be selected based on regression diagnostic procedures, in addition to goodness-of-fit tests; b) variable-selection algorithms in current packaged programs, such as conventional stepwise regression, can easily lead to invalid estimates and tests of effect; and c) variable selection is better approached by direct estimation of the degree of confounding produced by each variable than by significance-testing algorithms. As a general rule, before using a model to estimate effects, one should evaluate the assumptions implied by the model against both the data and prior information.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Breslow N. E., Storer B. E. General relative risk functions for case-control studies. Am J Epidemiol. 1985 Jul;122(1):149–162. doi: 10.1093/oxfordjournals.aje.a114074. [DOI] [PubMed] [Google Scholar]
Dales L. G., Ury H. K. An improper use of statistical significance testing in studying covariables. Int J Epidemiol. 1978 Dec;7(4):373–375. doi: 10.1093/ije/7.4.373. [DOI] [PubMed] [Google Scholar]
Doll R. An epidemiological perspective of the biology of cancer. Cancer Res. 1978 Nov;38(11 Pt 1):3573–3583. [PubMed] [Google Scholar]
Flanders W. D., Rhodes P. H. Large sample confidence intervals for regression standardized risks, risk ratios, and risk differences. J Chronic Dis. 1987;40(7):697–704. doi: 10.1016/0021-9681(87)90106-8. [DOI] [PubMed] [Google Scholar]
Fleiss J. L. Significance tests have a role in epidemiologic research: reactions to A. M. Walker. Am J Public Health. 1986 May;76(5):559–560. doi: 10.2105/ajph.76.5.559. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gordon T. Editorial: Hazards in the use of the logistic function with special reference to data from prospective cardiovascular studies. J Chronic Dis. 1974 Mar;27(3):97–102. doi: 10.1016/0021-9681(74)90078-2. [DOI] [PubMed] [Google Scholar]
Greenland S. Interpretation and estimation of summary ratios under heterogeneity. Stat Med. 1982 Jul-Sep;1(3):217–227. doi: 10.1002/sim.4780010304. [DOI] [PubMed] [Google Scholar]
Greenland S. Multivariate estimation of exposure-specific incidence from case-control studies. J Chronic Dis. 1981;34(9-10):445–453. doi: 10.1016/0021-9681(81)90004-7. [DOI] [PubMed] [Google Scholar]
Greenland S., Neutra R. Control of confounding in the assessment of medical technology. Int J Epidemiol. 1980 Dec;9(4):361–367. doi: 10.1093/ije/9.4.361. [DOI] [PubMed] [Google Scholar]
Greenland S., Poole C. Invariants and noninvariants in the concept of interdependent effects. Scand J Work Environ Health. 1988 Apr;14(2):125–129. doi: 10.5271/sjweh.1945. [DOI] [PubMed] [Google Scholar]
Greenland S. Tests for interaction in epidemiologic studies: a review and a study of power. Stat Med. 1983 Apr-Jun;2(2):243–251. doi: 10.1002/sim.4780020219. [DOI] [PubMed] [Google Scholar]
Haber M., Longini I. M., Jr, Cotsonis G. A. Models for the statistical analysis of infectious disease data. Biometrics. 1988 Mar;44(1):163–173. [PubMed] [Google Scholar]
Hauck W. W., Anderson S. A proposal for interpreting and reporting negative studies. Stat Med. 1986 May-Jun;5(3):203–209. doi: 10.1002/sim.4780050302. [DOI] [PubMed] [Google Scholar]
Lagakos S. W. Effects of mismodelling and mismeasuring explanatory variables on tests of their association with a response variable. Stat Med. 1988 Jan-Feb;7(1-2):257–274. doi: 10.1002/sim.4780070126. [DOI] [PubMed] [Google Scholar]
Liang K. Y. Extended Mantel-Haenszel estimating procedure for multivariate logistic regression models. Biometrics. 1987 Jun;43(2):289–299. [PubMed] [Google Scholar]
Miettinen O. S., Cook E. F. Confounding: essence and detection. Am J Epidemiol. 1981 Oct;114(4):593–603. doi: 10.1093/oxfordjournals.aje.a113225. [DOI] [PubMed] [Google Scholar]
Miettinen O. S. Standardization of risk ratios. Am J Epidemiol. 1972 Dec;96(6):383–388. doi: 10.1093/oxfordjournals.aje.a121470. [DOI] [PubMed] [Google Scholar]
Moolgavkar S. H., Venzon D. J. General relative risk regression models for epidemiologic studies. Am J Epidemiol. 1987 Nov;126(5):949–961. doi: 10.1093/oxfordjournals.aje.a114733. [DOI] [PubMed] [Google Scholar]
Pregibon D. Data analytic methods for matched case-control studies. Biometrics. 1984 Sep;40(3):639–651. [PubMed] [Google Scholar]
Robins J. M., Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol. 1986 Mar;123(3):392–402. doi: 10.1093/oxfordjournals.aje.a114254. [DOI] [PubMed] [Google Scholar]
Rothman K. J. Epidemiologic methods in clinical trials. Cancer. 1977 Apr;39(4 Suppl):1771–1775. doi: 10.1002/1097-0142(197704)39:4+<1771::aid-cncr2820390803>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]
Siemiatycki J., Thomas D. C. Biological models and statistical interactions: an example from multistage carcinogenesis. Int J Epidemiol. 1981 Dec;10(4):383–387. doi: 10.1093/ije/10.4.383. [DOI] [PubMed] [Google Scholar]
Vandenbroucke J. P. Should we abandon statistical modeling altogether? Am J Epidemiol. 1987 Jul;126(1):10–13. doi: 10.1093/oxfordjournals.aje.a114640. [DOI] [PubMed] [Google Scholar]
Walker A. M., Rothman K. J. Models of varying parametric form in case-referent studies. Am J Epidemiol. 1982 Jan;115(1):129–137. doi: 10.1093/oxfordjournals.aje.a113267. [DOI] [PubMed] [Google Scholar]
Walter S. D., Feinstein A. R., Wells C. K. Coding ordinal independent variables in multiple regression analyses. Am J Epidemiol. 1987 Feb;125(2):319–323. doi: 10.1093/oxfordjournals.aje.a114532. [DOI] [PubMed] [Google Scholar]

[OCR_01299] Breslow N. E., Storer B. E. General relative risk functions for case-control studies. Am J Epidemiol. 1985 Jul;122(1):149–162. doi: 10.1093/oxfordjournals.aje.a114074. [DOI] [PubMed] [Google Scholar]

[OCR_01247] Dales L. G., Ury H. K. An improper use of statistical significance testing in studying covariables. Int J Epidemiol. 1978 Dec;7(4):373–375. doi: 10.1093/ije/7.4.373. [DOI] [PubMed] [Google Scholar]

[OCR_01341] Doll R. An epidemiological perspective of the biology of cancer. Cancer Res. 1978 Nov;38(11 Pt 1):3573–3583. [PubMed] [Google Scholar]

[OCR_01452] Flanders W. D., Rhodes P. H. Large sample confidence intervals for regression standardized risks, risk ratios, and risk differences. J Chronic Dis. 1987;40(7):697–704. doi: 10.1016/0021-9681(87)90106-8. [DOI] [PubMed] [Google Scholar]

[OCR_01430] Fleiss J. L. Significance tests have a role in epidemiologic research: reactions to A. M. Walker. Am J Public Health. 1986 May;76(5):559–560. doi: 10.2105/ajph.76.5.559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01234] Gordon T. Editorial: Hazards in the use of the logistic function with special reference to data from prospective cardiovascular studies. J Chronic Dis. 1974 Mar;27(3):97–102. doi: 10.1016/0021-9681(74)90078-2. [DOI] [PubMed] [Google Scholar]

[OCR_01270] Greenland S. Interpretation and estimation of summary ratios under heterogeneity. Stat Med. 1982 Jul-Sep;1(3):217–227. doi: 10.1002/sim.4780010304. [DOI] [PubMed] [Google Scholar]

[OCR_01457] Greenland S. Multivariate estimation of exposure-specific incidence from case-control studies. J Chronic Dis. 1981;34(9-10):445–453. doi: 10.1016/0021-9681(81)90004-7. [DOI] [PubMed] [Google Scholar]

[OCR_01251] Greenland S., Neutra R. Control of confounding in the assessment of medical technology. Int J Epidemiol. 1980 Dec;9(4):361–367. doi: 10.1093/ije/9.4.361. [DOI] [PubMed] [Google Scholar]

[OCR_01449] Greenland S., Poole C. Invariants and noninvariants in the concept of interdependent effects. Scand J Work Environ Health. 1988 Apr;14(2):125–129. doi: 10.5271/sjweh.1945. [DOI] [PubMed] [Google Scholar]

[OCR_01320] Greenland S. Tests for interaction in epidemiologic studies: a review and a study of power. Stat Med. 1983 Apr-Jun;2(2):243–251. doi: 10.1002/sim.4780020219. [DOI] [PubMed] [Google Scholar]

[OCR_01361] Haber M., Longini I. M., Jr, Cotsonis G. A. Models for the statistical analysis of infectious disease data. Biometrics. 1988 Mar;44(1):163–173. [PubMed] [Google Scholar]

[OCR_01407] Hauck W. W., Anderson S. A proposal for interpreting and reporting negative studies. Stat Med. 1986 May-Jun;5(3):203–209. doi: 10.1002/sim.4780050302. [DOI] [PubMed] [Google Scholar]

[OCR_01311] Lagakos S. W. Effects of mismodelling and mismeasuring explanatory variables on tests of their association with a response variable. Stat Med. 1988 Jan-Feb;7(1-2):257–274. doi: 10.1002/sim.4780070126. [DOI] [PubMed] [Google Scholar]

[OCR_01465] Liang K. Y. Extended Mantel-Haenszel estimating procedure for multivariate logistic regression models. Biometrics. 1987 Jun;43(2):289–299. [PubMed] [Google Scholar]

[OCR_01416] Miettinen O. S., Cook E. F. Confounding: essence and detection. Am J Epidemiol. 1981 Oct;114(4):593–603. doi: 10.1093/oxfordjournals.aje.a113225. [DOI] [PubMed] [Google Scholar]

[OCR_01266] Miettinen O. S. Standardization of risk ratios. Am J Epidemiol. 1972 Dec;96(6):383–388. doi: 10.1093/oxfordjournals.aje.a121470. [DOI] [PubMed] [Google Scholar]

[OCR_01303] Moolgavkar S. H., Venzon D. J. General relative risk regression models for epidemiologic studies. Am J Epidemiol. 1987 Nov;126(5):949–961. doi: 10.1093/oxfordjournals.aje.a114733. [DOI] [PubMed] [Google Scholar]

[OCR_01328] Pregibon D. Data analytic methods for matched case-control studies. Biometrics. 1984 Sep;40(3):639–651. [PubMed] [Google Scholar]

[OCR_01378] Robins J. M., Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol. 1986 Mar;123(3):392–402. doi: 10.1093/oxfordjournals.aje.a114254. [DOI] [PubMed] [Google Scholar]

[OCR_01243] Rothman K. J. Epidemiologic methods in clinical trials. Cancer. 1977 Apr;39(4 Suppl):1771–1775. doi: 10.1002/1097-0142(197704)39:4+<1771::aid-cncr2820390803>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]

[OCR_01444] Siemiatycki J., Thomas D. C. Biological models and statistical interactions: an example from multistage carcinogenesis. Int J Epidemiol. 1981 Dec;10(4):383–387. doi: 10.1093/ije/10.4.383. [DOI] [PubMed] [Google Scholar]

[OCR_01230] Vandenbroucke J. P. Should we abandon statistical modeling altogether? Am J Epidemiol. 1987 Jul;126(1):10–13. doi: 10.1093/oxfordjournals.aje.a114640. [DOI] [PubMed] [Google Scholar]

[OCR_01295] Walker A. M., Rothman K. J. Models of varying parametric form in case-referent studies. Am J Epidemiol. 1982 Jan;115(1):129–137. doi: 10.1093/oxfordjournals.aje.a113267. [DOI] [PubMed] [Google Scholar]

[OCR_01477] Walter S. D., Feinstein A. R., Wells C. K. Coding ordinal independent variables in multiple regression analyses. Am J Epidemiol. 1987 Feb;125(2):319–323. doi: 10.1093/oxfordjournals.aje.a114532. [DOI] [PubMed] [Google Scholar]

PERMALINK

Modeling and variable selection in epidemiologic analysis.

S Greenland

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Modeling and variable selection in epidemiologic analysis.

S Greenland

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases