Table 2.
Glossary of Terms Used in Secondary Dataset Analysis Research
Term | Meaning |
---|---|
Types of datasets (not mutually exclusive) | |
Administrative or claims data | Datasets generated from reimbursement claims, such as ICD-9 codes used to bill for clinical encounters, or discharge data such as discharge diagnoses |
Longitudinal data | Datasets that measure factors of interest within the same subjects over time |
Clinical registries | Datasets generated from registries of specific clinical conditions, such as regional cancer registries used to create the Surveillance Epidemiology and End Results Program (SEER) dataset |
Population-based survey | A target population is available and well-defined, and a systematic approach is used to select members of that population to take part in the study. For example, SEER is a population-based survey because it aims to include data on all individuals with cancer cared for in the included regions |
Nationally representative survey | Survey sample that is designed to be representative of the target population on a national level. Often uses a complex sampling scheme. The Health and Retirement Study (HRS), for example, is nationally representative of community-dwelling adults over age 50 |
Panel survey | A longitudinal survey in which data are collected in the same panel of subjects over time. As one panel is at the middle or end of its participation, a panel of new participants is enrolled. In the Medical Expenditures Panel Survey (MEPS), for example, individuals in the same household are surveyed several times over the course of 2 years |
Statistical sampling terms | |
Clustering | Even simple random samples can be prohibitively expensive for practical reasons such as geographic distance between selected subjects. Identifying subjects within defined clusters, such as geographic regions or subjects treated by the same physicians, reduces cost and improves the feasibility of the study but may decrease the precision of the estimated variance (e.g., wider confidence intervals) |
Complex survey design | A survey design that is not a simple random selection of subjects. Surveys that incorporate stratification, clustering and oversampling (with patient weights) are examples of complex data. Statistical software is available that can account for complex survey designs and is often needed to generate accurate findings |
Oversampling | Intentionally sampling a greater proportion of a subgroup, increasing the precision of estimates for that subgroup. For example, in the HRS, African-Americans, Latinos, and residents of Florida are oversampled (see also survey weights) |
Stratification | In stratification, the target population is divided into relatively homogeneous groups, and a pre-specified number of subjects is sampled from within each stratum. For example, in the National Ambulatory Medical Care Survey physicians are divided by specialty within each geographic area targeted for the survey, and a certain number of each type of physician is then identified to participate and provide data about their patients |
Survey weights | Weights are used to account for the unequal probability of subject selection due to purposeful over- or under-sampling of certain types of subjects and non-response bias. The survey weight is the inverse probability of being selected. By applying survey weights, the effects of over- and under-sampling of certain types of patients can be corrected such that the data are representative of the entire target population |