Abstract
Abstract
Objective
Data quality in epidemiological studies is a basic requirement for good scientific research. The aim of this study was to examine an important indicator of data quality, data completeness, by investigating predictors of missing data.
Methods
Baseline data of a cohort study, the population-based Hamburg City Health Study, were used. Missingness was investigated at the levels of a whole research unit, on the two segments of health service utilisation and psychosocial variables, and two sensitive items (income and number of sexual partners). Predictors for missingness were sociodemographic variables, cognitive abilities and the mode of data collection. Associations were estimated using binary and multinomial logistic regression models.
Results
Of 10 000 participants (mean age=62.4 years; 51.1% women), 32.9% had complete data at the unit level, 66.8% had partially missing data and 0.3% missed all items. The highest proportions of missing values were found for income (27.8%) and the number of sexual partners (36.7%). At both the unit, segment and item level, older age, female sex, low education, a foreign mother language and cognitive impairment were significant predictors for missingness.
Conclusion
For analysing population-based data, dealing with missingness is equally important at all levels of analysis. During the design and conduct of the study, the identified groups may be targeted to reach higher levels of data completeness.
Keywords: Methods, STATISTICS & RESEARCH METHODS, EPIDEMIOLOGIC STUDIES, EPIDEMIOLOGY
Strengths and limitations of this study.
Data completeness was investigated in a large variety of different questionnaire items in a population-based sample.
Assessing data completeness at the unit, segment and item levels provides a comprehensive and methodologically robust analysis.
The large sample size of N=10 000 provides high statistical power to detect significant predictors of data missingness.
The study is limited to participants with sufficient German language skills, which may affect the generalisability of the findings.
The specific age inclusion criterion (45–74 years) restricts the applicability of the findings to other age demographics.
Introduction
High data quality in epidemiological studies is a prerequisite for good scientific work.1 2 Guidelines with data quality indicators could make assessments of data quality comparable and enhance it.3 Data completeness is one of the most important indicators of data quality. Data can be missing at the level of a whole research unit, a specific segment (eg, a segment in a questionnaire), or a single item.4 An investigation of causes of missing data should be part of any data analysis, and the handling of missing values should be based on reasonable assumptions to avoid biased results.5 6 If data are missing, they may be missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR).7 If data are MCAR, the presence of a data element and any observed or unobserved other value in the dataset are independent of each other. In contrast, data are MAR when the presence of a data element only depends on some other measured variables. If the missingness additionally depends on unobserved data, they are MNAR. If data are MNAR, the causes of missing data are likely unknown to us. As many modern methods to deal with missing values depend on the assumption of data to be at least MAR,8 predictors of data completeness are of great practical relevance.
Several predictors of data completeness were investigated in the literature. Significant increases in item non-response were discovered with increasing age and cognitive impairment in a nursing home population,9 whereas no significant associations of age and socioeconomic status on data completeness were found in a sample of men diagnosed with prostate cancer.10 In a French national survey, more missing and inconsistent data were found in the elderly (+50 years), low-educated and foreign nationals and were associated with specific diseases such as neurological disorders.11 Sex differences were found regarding the number of missing values. Female respondents had more missing data in a study of Australian residents,12 and in a randomised preventive intervention study in the USA.13 Here, being unmarried, low income, low education and a poor personal health status were associated with higher numbers of missing values. Similarly, low education, poor health status and unhealthy lifestyle habits were predictors of missing values in a prospective cohort study in Greece.14 In the Religion Among Academic Scientists study, participants with a religious affiliation were more likely to answer questions that were deemed controversial.15
The mode of data collection also contributes to data quality.16 Fewer missing data are found in digital surveys than in paper-pencil surveys. In a pilot study for mixed-mode surveys, item non-response for selected health indicators in most cases showed higher proportions of missing values in a written-postal questionnaire than in computer-assisted survey forms.17 Additionally, the question topic matters. Sensitive questions, such as sexual activities, income or illegal behaviour are more likely to be answered online compared with paper-pencil.18 Respondents’ discomfort and refusal to answer often comes up with sensitive questions. Thereby, social desirability bias often generates inaccurate survey results.19
Objectives and hypotheses
The starting point of the present data analysis was the research gap regarding missingness patterns in population-based epidemiologic studies with an extensive examination programme and a large number of questionnaires. Thus, our aim was to investigate the data completeness of the baseline questionnaire data of the Hamburg City Health Study (HCHS), a prospective population-based cohort study.20 Data completeness was investigated at the unit, segment and item level. At unit level, complete cases were differentiated from partially missing data and a unit level missingness. As data segments, health service utilisation (HSU) and psychosocial scales (PS) were defined. At item level, sensitive questions regarding income and sexual partners were examined. According to previous literature, the following predictors for data completeness were employed:
Sociodemographic characteristics: age, sex, education, mother language, religious affiliation
Cognitive impairment
Mode of data collection
Based on previous studies, the following hypotheses were formulated: We expected that older participants, women, participants with lower education, a foreign mother language, no religious affiliation and cognitive impairment tended to show higher proportions of missing values, while digital data collection is associated with higher data completeness compared with a paper-pencil mode.
Methods
Hamburg City Health Study
The data analysed are from the HCHS, an ongoing, single-centre, prospective, population-based cohort study aiming to identify risk factors for major chronic diseases. The HCHS is carried out by the University Medical Center Hamburg-Eppendorf in Germany, started in 2016 and aims to include 45 000 Hamburg residents between 45 and 74 years with a sufficient knowledge of the German language identified by random selection of the residents’ registration office for an extensive baseline assessment and follow-up 6 years later. Next to medical examinations, participants fill out validated self-reported questionnaires before, during and after the baseline visit. For the present study, a data freeze of the first 10 000 participants of the baseline study, recruited from February 2016 to November 2018, is included. Further details about the study can be found in 20 21.
Assessment of predictor variables
The variables age and biological sex (male/female) were retrieved from the registration office. The other predictor variables were taken from the questionnaire that was administered in advance to the participants’ examination in the study centre. Education was grouped into high (12 or 13 years of education), medium (10 years of education) and no/low (up to 9 years of education), while participants with German as their mother language were compared with persons with a foreign first language. Cognitive impairment was measured by the Mini-Mental State Examination.22 Participants with a score of less than 27 points were considered to be cognitively impaired.23 The collection method was recorded and could be entirely paper-pencil based, entirely digital or a mixed mode of paper-pencil and digital. Participants decided for themselves which mode they preferred.
Assessment of dependent variables
Three selected thematic blocks have been examined in this study: HSU, PS and sensitive items. The HSU items were recorded by (1) the commitment to the general practitioner questionnaire24 and (2) several items of the Scale of the German Health Interview and Examination Survey for Adults (DEGS25). These were six questions about the use of primary care physicians, ambulatory care in medical practices, hospitals, ambulatory or inpatient rehabilitation clinics and the insurance status. The PS items are composed of the following psychosocial instruments:
Quality of life was recorded with the Short-Form Health Survey (SF-8)26 27 and with the Quality of Life Questionnaire EuroQol-5 Dimensions-5 Levels (EQ-5D-5L).28 29
The Patient Health Questionnaire-9 for depression PHQ-9).30,32
The Generalised Anxiety Disorder Scale-7 (GAD-7).33
The Patient Health Questionnaire-15 for somatic symptoms (PHQ-15).34 35
The Hamburg Resilience Scale-6(HRES-6).36
The Adverse Childhood Experiences Questionnaire (ACE).37 38
Thereby, 60 PS items were evaluated as a segment.
As sensitive variables, we included two items: (1) Income was recorded by the monthly household net income with 17 categories ranging from ‘less than 500€’ to ‘more than 8000€’. (2) The total number of sexual partners was asked with the question ‘How many partners have you had sexual intercourse with so far?’ with five categories (none, 1 partner, 2–5 partners, 6–15 partners, 16 or more partners).
Assessment of data completeness
We defined unit missingness as a missingness of all questionnaire items and distinguished further between complete cases with all items available and partially missing data at the unit level. For both the HSU and the PS items, we viewed them as a segment and differentiated between no missingness, partial missingness (at least one item missing) and total missingness in the respective segment. The sensitive variables were considered as single items. The number of sexual partners was queried in a questionnaire that participants were asked to fill out after the visit of the study centre. As some participants dropped out before beginning to fill out this questionnaire, these participants were excluded from the analysis for this question.
Statistical analyses
At unit level, the total numbers of unit missingness, complete cases and partial missingness were calculated. Bivariate associations between predictors for complete cases versus partial missingness were analysed by the χ2 test. For the segment missingness, multivariable multinomial regression models were performed. We report adjusted ORs (aOR) with the corresponding 95% CI comparing the segment to be either totally missing or partially missing with the reference category that all questions were answered. For the two sensitive items, binary logistic regressions were employed. Here, we also included all predictor variables into the model and present aOR with the 95% CI and the minimal and maximal predicted probabilities for observed combinations of predictor variables. The predictive capabilities of the regression models at unit, segment and item level were compared by means of the area under the curve (AUC). As sensitivity analyses, we incorporated the reasons for missingness into the models for the sensitive items, that is, we compared participants who refused or did not know the answer to those who replied to the question. The analyses were conducted with SPSS (V.27) and the R package ggplot2.
Patient and public involvement
None.
Results
The participants’ mean age was 62.4 years (SD=8.5), and the sample included 5108 (51.1%) women and 4892 (48.9%) men (table 1). A higher education was reported by 43.5% (n=4345) of the participants, while 7.6% (n=764) had a foreign first language. The items income (n=2777, 27.8%) and number of sexual partners (n=3665, 36.7%) showed the highest number of missing values. For both items, more than 10% of the participants explicitly refused to answer (income: n=1819, 18.2%; sexual partners: n=1093, 10.9%).
Table 1. Baseline characteristics of the study population and types of missing values.
| Total n=10 000 | Total missings n (%) | ||||
|---|---|---|---|---|---|
| n (%) | ‘I don’t want to answer’ n (%) | ‘I don’t know’ n (%) | Other missings* n (%) | ||
| Sex | 0 | 0 | 0 | 0 | |
| Female | 5108 (51.1) | ||||
| Male | 4892 (48.9) | ||||
| Age | 0 | 0 | 0 | 0 | |
| 45–54 years | 2273 (22.7) | ||||
| 55–64 years | 3322 (33.2) | ||||
| 65–78 years | 4405 (44.1) | ||||
| Education | 94 (0.9) | 16 (0.2) | 927 (9.3) | 1037 (10.4) | |
| No or low† | 2082 (20.8) | ||||
| Medium‡ | 2536 (25.4) | ||||
| High§ | 4345 (43.5) | ||||
| Income | 1819 (18.2) | 186 (1.9) | 772 (7.7) | 2777 (27.8) | |
| Low¶ | 1463 (14.6) | ||||
| Medium** | 4189 (41.9) | ||||
| High†† | 1571 (15.7) | ||||
| Mother language | 23 (0.2) | 9 (0.1) | 971 (9.7) | 1003 (10.0) | |
| German | 8233 (82.3) | ||||
| Other mother language | 764 (7.6) | ||||
| Religious affiliation | 134 (1.3) | 16 (0.2) | 574 (5.7) | 724 (7.2) | |
| No | 5694 (56.9) | ||||
| Yes | 3582 (35.8) | ||||
| Cognitive impairment (<27 points MMSE) | 447 (4.5) | ||||
| No | 7978 (79.8) | ||||
| Yes | 1575 (15.8) | ||||
| Number of sexual partners | 1093 (10.9) | 239 (2.4) | 2333 (23.3) | 3665 (36.7) | |
| 0 | 67 (0.7) | ||||
| 1 | 1088 (10.9) | ||||
| 2–5 | 2617 (26.2) | ||||
| 6–15 | 1783 (17.8) | ||||
| 16+ | 780 (7.8) | ||||
| Mode of data collection | 108 (1.1) | ||||
| Paper-pencil | 4737 (47.4) | ||||
| Digital | 4545 (45.5) | ||||
| Mixed‡‡ | 610 (6.1) | ||||
Other missings comprise a unit missingness, segment missingness and omitted questions in the paper-pencil questionnaire.
9 or less years of education.
10 years of education.
12 or 13 years of education.
0–1999€ monthly household net income.
2000–4999€ monthly household net income.
Above 5000€ monthly household net income.
At least one change occurred either paper-pencil to digital or from digital to paper-pencil.
MMSE, Mini-Mental State Examination.
Unit missingness
25 (0.3%) participants had missing values for all questionnaire items, while 3292 complete cases (32.9%) and 6683 (66.8%) partially missing units were recorded. All predictors except the religious affiliation were associated with missingness at the unit level (table 2). Missing data were more often recorded in female, older, low-educated, low-income, cognitive impaired and foreign participants. The digital data collection showed the highest proportion of complete cases (36.1%), while a paper-pencil-based collection (31.6%) and a mixed version (26.1%) had lower numbers of complete records.
Table 2. Variable distributions at unit level missingness comparing complete cases with partial missingness.
| Complete case (n=3292) | Partial missingness (n=6683) | P value | |
|---|---|---|---|
| Sex | n (%) | n (%) | <0.001 |
| Female | 1492 (29.3) | 3602 (70.7) | |
| Male | 1800 (36.9) | 3081 (63.1) | |
| Age | <0.001 | ||
| 45–54 years | 926 (40.8) | 1342 (59.2) | |
| 55–64 years | 1262 (38.1) | 2053 (61.9) | |
| 65–78 years | 1104 (25.1) | 3288 (74.9) | |
| Education | <0.001 | ||
| No or low* | 540 (25.9) | 1542 (74.1) | |
| Medium† | 861 (34.0) | 1675 (66.0) | |
| High‡ | 1891 (43.5) | 2454 (56.5) | |
| Income | <0.001 | ||
| Low§ | 468 (32.0) | 995 (68.0) | |
| Medium¶ | 1913 (45.7) | 2276 (54.3) | |
| High** | 911 (58.0) | 660 (42.0) | |
| Mother language | <0.001 | ||
| German | 3140 (38.1) | 5093 (61.9) | |
| Other mother language | 152 (19.9) | 612 (80.1) | |
| Religious affiliation | 0.208 | ||
| No | 2049 (36.0) | 3645 (64.0) | |
| Yes | 1243 (34.7) | 2339 (65.3) | |
| Cognitive impairment (<27 points MMSE) | <0.001 | ||
| No | 2950 (37.0) | 5028 (63.0) | |
| Yes | 342 (21.7) | 1233 (78.3) | |
| Number of sexual partners | <0.001 | ||
| 0 | 18 (26.9) | 49 (73.1) | |
| 1 | 515 (47.3) | 573 (52.7) | |
| 2–5 | 1360 (52.0) | 1257 (48.0) | |
| 6–15 | 991 (55.6) | 792 (44.4) | |
| 16+ | 408 (52.3) | 372 (47.7) | |
| Mode of data collection | <0.001 | ||
| Paper-pencil | 1494 (31.6) | 3241 (68.4) | |
| Digital | 1639 (36.1) | 2905 (63.9) | |
| Mixed†† | 159 (26.1) | 451 (73.9) |
9 or less years of education.
10 years of education.
12 or 13 years of education.
0–1999€ monthly household net income.
2000–4999€ monthly household net income.
Above 5000€ monthly household net income.
At least one change occurred either paper-pencil to digital or from digital to paper-pencil.
MMSE, Mini-Mental State Examination.
Segment missingness
In the HSU segment, 8140 participants (81.4%) answered all questions, and 1172 had partially missing values (11.7%), while for 688 participants (6.9%), the whole segment was missing. The whole PS segment was complete for 6981 participants (69.8%), 2377 (23.8%) had partially missing values and 642 (6.4%) were completely missing.
A large aOR for a segment to be totally missing compared with no missingness was observed for participants with a foreign mother language (aOR HSU segment: 2.49 (95% CI 1.82 to 3.40); aOR PS segment: 2.63 (95% CI 1.88 to 3.66); figure 1). For partial missingness, a strong predictor compared with no missingness was older age (aOR age 65–78 years vs 45–54 years HSU segment: 2.03 (95% CI 1.64 to 2.50); aOR PS segment: 2.32 (95% CI 1.99 to 2.71)).
Figure 1. Multivariable multinomial logistic regression analysis for non-response of the two segments health service utilisation (HSU) and psychosocial scales (PS). Sample size HSU model=8008. Sample size PS model=8008.
All sociodemographic characteristics were associated with the missingness of both the HSU segment and the PS segment. Male participants less often refused to answer questions compared with women, while for education, a gradient was apparent with no/low education showing the highest odds followed by medium education, while highly educated participants had the lowest odds of a segment to be totally/partially missing. Cognitive impairment was associated with higher odds of a segment to be either partially or totally missing. Regarding the data collection method, digital questionnaires were less often partially missing compared with paper-pencil versions. For a segment to be totally missing, no differences between digital and paper-pencil questionnaires were observed.
Item missingness
The missingness of the sensitive items regarding income and sexual partners was associated with having a foreign mother language, being female and having a low/medium education (figure 2). Age was more strongly associated with the missingness of the item number of sexual partners (aOR 65–78 years vs 45–54 years: 1.76 (95% CI 1.46 to 2.13); 55–64 years vs 45–54 years: 1.27 (95% CI 1.04 to 1.55)) than with the item income (aOR 65–78 years vs 45–54 years: 1.24 (95% CI 1.08 to 1.43); 55–64 years vs 45–54 years: 0.99 (95% CI 0.85 to 1.14)). In contrast to the HSU and PS segments, digital data collection was associated with a higher missingness of both sensitive items.
Figure 2. Multivariable binary logistic regression analysis for non-response to the sensitive items number of sexual partners and income. Sample size for the number of sexual partners regression model=6375. Sample size for the income regression model=8008.
In the sensitivity analysis, in which we included the reasons for missingness, most predictor variables showed similar associations to items not being present due to either refusal or lacking knowledge (online supplemental tables 1 and 2). However, men had higher odds of not knowing their number of sexual partners compared with women (aOR: 1.38 (95% CI 1.03 to 1.86)), while this association was reversed for a refusal to answer this item (aOR: 0.60 (95% CI 0.51 to 0.70)).
Predictions at unit, segment and item level
The predictive capabilities of the covariates were similar at unit, segment and item level except for the item income (AUC=0.59 (95% CI 0.58 to 0.61)), where the discrimination was poorer compared with the item number of sexual partners as well as the segment and unit missingness (online supplemental table 3). The predicted probabilities for observed combinations of predictor variables varied from 0.14 to 0.47 for the missingness of the item income, while the item number of sexual partners had a wider range from 0.06 to 0.54.
Discussion
In the present study, we investigated the data quality concept of data completeness at unit, segment and item level in the baseline assessment of a large prospective cohort study (HCHS) with a population representative sampling scheme.20 Only 33% of the participants provided complete data at unit level. At item level, the data completeness ranged from 90% to 96% for the sociodemographic characteristics and the measurement of cognitive impairment, while for the sensitive items, it was only 72% for income and 63% for the number of sexual partners. Data completeness depended on sociodemographic characteristics, cognitive impairment and the mode of data collection. Similar associations were found at unit, segment and item level. At all levels, missing values were positively associated with higher age, decreasing education, female sex, a foreign mother language and cognitive impairment. For the mode of data collection, the results differed between unit, segment and item level. Complete cases were more often recorded with digital data collection, and at the segment level, a partial missingness was also less frequently observed for the digital mode. However, at the item level, these associations were reversed. For both sensitive items, digital data collection was associated with a higher number of missing values.
The associations between sociodemographic characteristics and missingness were comparable to previous studies.911,13 Female sex and higher age were associated with increased missing values. By incorporating the reasons for missingness into the sensitivity analyses, it became apparent that male participants had decreased odds of a refusal to answer the question regarding sexual partners, while the odds were higher for a lacking knowledge. In contrast, the reasons for missingness regarding the item income did not change the interpretation of predictors substantially. For the other sociodemographic characteristics, the hypothesised relationships of higher odds of missingness with decreasing education13 14 and a foreign mother language11 were also found in our study. For a religious affiliation, we did not find any associations. Our results for the association between the mode of data collection and missingness depended on the level of analysis. At item level, our results differed from previous studies, where digital data collection was associated with fewer missing values compared with paper-pencil questionnaires,17 18 39 while in our study, the items income and number of sexual partners were associated with more missing values. This may be explained by the fact that participants filled out the questionnaires in a room where other participants may have been present. Therefore, this digital data collection must be distinguished from a situation with greater privacy. Additionally, our study population had a mean age of 62 years and it can be assumed that the digital affinity is higher in younger populations.
Strengths of our study were the possibility to investigate data completeness in a population-based study with a large number of questionnaires. We were also able to examine data completeness at unit, segment and item level, a distinction that has not been made in previous studies. At item level, we focused on two sensitive items: the number of sexual partners and income. However, certain limitations apply to our study. The extensive medical examinations required that the completion of the questionnaires was spread over three time points, that is, before, during and after the study visit. As the sequence of the questions was not randomised in the paper-pencil questionnaire, items that were collected comparatively late had a higher probability of being missing due to tiredness or termination of a questionnaire. The generalisability of our findings is limited to the age range of 45–74 years and to participants with sufficient German language skills. Furthermore, the selection of HSU and PS segments mainly reflects the different research interests of the authors in this multitopic survey.
Our results highlight that data are often not MCAR, but depend on the values of other variables, which might, for example, lead to biased prevalence estimates in descriptive epidemiology or biased effect estimates in analytical epidemiology. Hence, the question about adequate methods for handling missing values in statistical analyses arises. Among several possible methods, multiple imputation (MI) has been proposed and applied frequently as a flexible and practical approach.40 Therefore, we focus on the distinction between MI and a complete case analysis (CCA). In terms of bias, MI is in many cases advantageous over a CCA, when the data are not MCAR.6 14 Additionally, MI is often more efficient as observations with missing values remain in the analysis, while they are excluded in a CCA.6 Whereas we can identify whether the data are MCAR or not, it is not possible to distinguish between the mechanism being MAR or MNAR based on the data alone.41 Causal diagrams may be used to guide the decision which of these two assumptions is more plausible.41 42 While we were investigating intrinsic data quality, that is, data quality without a clear research question, causal diagrams depict the relationships between exposure, covariates, outcome and an indicator for missingness. Generally, MI is to be preferred over a CCA in the following cases6 41 43:
The value of the outcome influences the probability of being a complete case.
The presence of auxiliary variables that predict the missing values.
If there are relatively many cases with only a few missing values.
If the missing values occur in the covariates and not only in the exposure or outcome.
However, deficiencies in the reporting of MI analyses such as details about the imputation procedure, comparisons of distributions of non-imputed and imputed data and missing sensitivity analyses have been identified.44
In summary, our results show that the MCAR assumption did not hold in a large population-based cohort study at unit, segment and item level. Because the number of participants with partially missing data was large, imputation methods such as MI seem useful in many cases to avoid biased results and increase the efficiency of the analysis. Yet, the prevention of missing data would be even better. Thus, considerations of reaching the highest possible level of data completeness should be implemented during the planning and conduct phase of the study and based on the identified associations between the characteristics of the participants, the mode of data collection and the missingness.
Supplementary material
Acknowledgements
The authors wish to acknowledge the participants of the Hamburg City Health Study, the staff of the Epidemiological Study Centre, the cooperation partners, the patrons and the Deanery from the University Medical Centre Hamburg-Eppendorf. Founding Board: Adam, Gerhard; Blankenberg, Stefan (speaker); Koch-Gromus, Uwe; Gerloff, Christian; Jagodzinski, Annika (assessor). List of Investigators: Adam, Gerhard; Aarabi, Ghazal; Augustin, Matthias; Behrendt, Christian; Beikler, Thomas; Betz, Christian; Blankenberg, Stefan; Bokemeyer, Carsten; Brassen, Stefanie; Brekenfeld, Caspar; Briken, Peer; Busch, Chia-Jung; Büchel, Christian; Debus, Eike Sebastian; Fiehler, Jens; Gallinat, Jürgen; Gellißen, Simone; Gerloff, Christian; Girdauskas, Evaldas; Gosau, Martin; Hanning, Uta; Härter, Martin; Harth, Volker; Heydecke, Guido; Huber, Tobias; Jagodzinski, Annika; Johansen, Christoffer; Koch-Gromus, Uwe; Konnopka, Alexander; König, Hans-Helmut; Kromer, Robert; Kubisch, Christian; Kühn, Simone; Löwe, Bernd; Lund, Gunnar; Meyer, Christian; Nienhaus, Albert; Pantel, Klaus; Püschel, Klaus; Reichenspurner, Hermann; Sauter, Guido; Scherer, Martin; Schiffner, Ulrich; Schnabel, Renate; Schulz, Holger; Smeets, Ralf; Spitzer, Martin S; Terschüren, Claudia; Thomalla, Götz; von dem Knesebeck, Olaf; Waschki, Benjamin; Wenzel, Jan-Peer; Wegscheider, Karl; Zeller, Tanja; Zyriax, Birgit-Christiane. Steering Board: Augustin, Matthias; Blankenberg, Stefan; Gallinat, Jürgen; Gerloff, Christian; Härter, Martin; Jagodzinski, Annika; Johansen, Christoffer; Koch-Gromus, Uwe; Sauter, Guido; Zeller, Tanja; Wegscheider, Karl; Betz, Christian/ Heydecke, Guido/ Gosau, Martin. Research consortium: Aarabi, Ghazal; Andrees, Valerie; Behrendt, Christian; Brassen, Stefanie; Brekenfeld, Caspar; Brünahl, Christian; Busch, Chia-Jung; Freitag, Janina; Gallinat, Jürgen; Gellißen, Susanne; Girdauskas, Evaldas; Heidemann, Christoph; Hussein, Yassin; Klein, Verena; Kofahl, Christopher; Kohlmann, Sebastian; Konnopka, Alexander; Kühn, Simone; Lühmann, Dagmar; Lund, Gunnar; Nagel, Lina; Magnussen, Christina; Meyer, Christian; Petersen, Elina; Scherschel, Katharina; Schiffner, Ulrich; Schnabel, Renate; Schulz, Holger; Seedorf, Udo; Smeets, Ralf; Terschüren, Claudia; Thomalla, Götz; Waschki, Benjamin; Zeller, Tanja; Zyriax, Birgit-Christiane.
We acknowledge financial support from the Open Access Publication Fund of UKE - Universitätsklinikum Hamburg-Eppendorf.
Footnotes
Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Prepub: Prepublication history and additional supplemental material for this paper are available online. To view these files, please visit the journal online (https://doi.org/10.1136/bmjopen-2025-103154).
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient consent for publication: Not applicable.
Ethics approval: This study involves human participants and was approved by State of Hamburg Chamber of Medical Practitioners (PV5131). Participants gave informed consent to participate in the study before taking part.
Patient and public involvement: Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Data availability statement
Data are available upon reasonable request.
References
- 1.Stausberg J, Bauer U, Nasseh D, et al. Indicators of data quality: review and requirements from the perspective of networked medical research. GMS Med Inform Biom Epidemiol. 2019;15 doi: 10.3205/mibe000199. [DOI] [Google Scholar]
- 2.Hassenstein MJ, Vanella P. Data Quality—Concepts and Problems. Encyclopedia . 2022;2:498–510. doi: 10.3390/encyclopedia2010032. [DOI] [Google Scholar]
- 3.Schmidt CO, Richter A, Enzenbach C, et al. Assessment of a data quality guideline by representatives of German epidemiologic cohort studies. GMS Medizinische Informatik, Biometrie Und Epidemiologie. 2019;15:Doc09. doi: 10.3205/MIBE000203. Available. [DOI] [Google Scholar]
- 4.Schmidt CO, Struckmann S, Enzenbach C, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21:63. doi: 10.1186/s12874-021-01252-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wirtz M. Über das Problem fehlender Werte: Wie der Einfluss fehlender Informationen auf Analyseergebnisse entdeckt und reduziert werden kann. Rehabilitation. 2004;43:109–15. doi: 10.1055/s-2003-814839. [DOI] [PubMed] [Google Scholar]
- 6.Lee KJ, Tilling KM, Cornish RP, et al. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. doi: 10.1016/j.jclinepi.2021.01.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. doi: 10.1093/biomet/63.3.581. [DOI] [Google Scholar]
- 8.Van Buuren S. Flexible imputation of missing data. CRC Press; 2018. [Google Scholar]
- 9.Kutschar P, Weichbold M, Osterbrink J. Effects of age and cognitive function on data quality of standardized surveys in nursing home populations. BMC Geriatr. 2019;19:244. doi: 10.1186/s12877-019-1258-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hoque DME, Earnest A, Ruseckaite R, et al. A randomised controlled trial comparing completeness of responses of three methods of collecting patient-reported outcome measures in men diagnosed with prostate cancer. Qual Life Res. 2019;28:687–94. doi: 10.1007/s11136-018-2061-7. [DOI] [PubMed] [Google Scholar]
- 11.Coste J, Quinquis L, Audureau E, et al. Non response, incomplete and inconsistent responses to self-administered health-related quality of life measures in the general population: patterns, determinants and impact on the validity of estimates - a population-based study in France using the MOS SF-36. Health Qual Life Outcomes. 2013;11:44. doi: 10.1186/1477-7525-11-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Barnett AG, McElwee P, Nathan A, et al. Identifying patterns of item missing survey data using latent groups: an observational study. BMJ Open. 2017;7:e017284. doi: 10.1136/bmjopen-2017-017284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Slymen DJ, Drew JA, Wright BL, et al. Item non-response to lifestyle assessment in an elderly cohort. Int J Epidemiol. 1994;23:583–91. doi: 10.1093/ije/23.3.583. [DOI] [PubMed] [Google Scholar]
- 14.Tsiampalis T, Panagiotakos DB. Missing-data analysis: socio- demographic, clinical and lifestyle determinants of low response rate on self- reported psychological and nutrition related multi- item instruments in the context of the ATTICA epidemiological study. BMC Med Res Methodol. 2020;20:148. doi: 10.1186/s12874-020-01038-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Porter JR, Ecklund EH. Missing Data in Sociological Research: An Overview of Recent Trends and an Illustration for Controversial Questions, Active Nonrespondents and Targeted Samples. Am Soc . 2012;43:448–68. doi: 10.1007/s12108-012-9161-6. [DOI] [Google Scholar]
- 16.Tuten TL, Urban DJ, Bosnjak M. Internet surveys and data quality: A review. Online Social Sciences. 2002;1:7–26. [Google Scholar]
- 17.Schilling R, Hoebel J, Müters S, et al. Pilotstudie zur durchführung von mixed-mode-gesundheitsbefragungen in der erwachsenenbevölkerung (projektstudie GEDA 2.0) Robert Koch-Institut; 2015. [Google Scholar]
- 18.Kays K, Gathercoal K, Buhrow W. Does survey format influence self-disclosure on sensitive question items? Comput Human Behav. 2012;28:251–6. doi: 10.1016/j.chb.2011.09.007. [DOI] [Google Scholar]
- 19.Krumpal I. Determinants of social desirability bias in sensitive surveys: a literature review. Qual Quant. 2013;47:2025–47. doi: 10.1007/s11135-011-9640-9. [DOI] [Google Scholar]
- 20.Jagodzinski A, Johansen C, Koch-Gromus U, et al. Rationale and Design of the Hamburg City Health Study. Eur J Epidemiol. 2020;35:169–81. doi: 10.1007/s10654-019-00577-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.University Medical Center Hamburg Hamburg city health study. http://hchs.hamburg/ n.d. Available.
- 22.Folstein MF, Folstein SE, McHugh PR. “Mini-mental state”. J Psychiatr Res. 1975;12:189–98. doi: 10.1016/0022-3956(75)90026-6. [DOI] [PubMed] [Google Scholar]
- 23.Kukull WA, Larson EB, Teri L, et al. The Mini-Mental State Examination score and the clinical diagnosis of dementia. J Clin Epidemiol. 1994;47:1061–7. doi: 10.1016/0895-4356(94)90122-8. [DOI] [PubMed] [Google Scholar]
- 24.Hansen H, Schäfer I, Porzelt S, et al. Regional and patient-related factors influencing the willingness to use general practitioners as coordinators of the treatment in northern Germany - results of a cross-sectional observational study. BMC Fam Pract. 2020;21:110. doi: 10.1186/s12875-020-01180-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gößwald A, Lange M, Kamtsiuris P, et al. DEGS: Studie zur Gesundheit Erwachsener in Deutschland. Bundesweite Quer- und Längsschnittstudie im Rahmen des Gesundheitsmonitorings des Robert Koch-Instituts. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2012;55:775–80. doi: 10.1007/s00103-012-1498-z. [DOI] [PubMed] [Google Scholar]
- 26.Beierlein V, Morfeld M, Bergelt C, et al. Messung der gesundheitsbezogenen Lebensqualität mit dem SF-8. Diagnostica . 2012;58:145–53. doi: 10.1026/0012-1924/a000068. [DOI] [Google Scholar]
- 27.Ware JE, Kosinski M, Dewey JE, et al. How to score and interpret single-item health status measures: a manual for users of the SF-8 health survey, vol 15. Lincoln, RI: Quality-Metric Incorporated; 2001. [Google Scholar]
- 28.Hinz A, Kohlmann T, Stöbel-Richter Y, et al. The quality of life questionnaire EQ-5D-5L: psychometric properties and normative values for the general German population. Qual Life Res. 2014;23:443–7. doi: 10.1007/s11136-013-0498-2. [DOI] [PubMed] [Google Scholar]
- 29.Ludwig K, Graf von der Schulenburg J-M, Greiner W. German Value Set for the EQ-5D-5L. Pharmacoeconomics. 2018;36:663–74. doi: 10.1007/s40273-018-0615-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med . 2001;16:606–13. doi: 10.1046/j.1525-1497.2001.016009606.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Löwe B, Gräfe K, Zipfel S, et al. Diagnosing ICD-10 depressive episodes: superior criterion validity of the Patient Health Questionnaire. Psychother Psychosom. 2004;73:386–90. doi: 10.1159/000080393. [DOI] [PubMed] [Google Scholar]
- 32.Löwe B, Unützer J, Callahan CM, et al. Monitoring depression treatment outcomes with the patient health questionnaire-9. Med Care. 2004;42:1194–201. doi: 10.1097/00005650-200412000-00006. [DOI] [PubMed] [Google Scholar]
- 33.Löwe B, Decker O, Müller S, et al. Validation and standardization of the Generalized Anxiety Disorder Screener (GAD-7) in the general population. Med Care. 2008;46:266–74. doi: 10.1097/MLR.0b013e318160d093. [DOI] [PubMed] [Google Scholar]
- 34.Gierk B, Kohlmann S, Toussaint A, et al. Assessing somatic symptom burden: a psychometric comparison of the patient health questionnaire-15 (PHQ-15) and the somatic symptom scale-8 (SSS-8) J Psychosom Res. 2015;78:352–5. doi: 10.1016/j.jpsychores.2014.11.006. [DOI] [PubMed] [Google Scholar]
- 35.Kroenke K, Spitzer RL, Williams JBW. The PHQ-15: validity of a new measure for evaluating the severity of somatic symptoms. Psychosom Med. 2002;64:258–66. doi: 10.1097/00006842-200203000-00008. [DOI] [PubMed] [Google Scholar]
- 36.Löwe B. Hamburg resilience scale (HRES-6)
- 37.Wingenfeld K, Schäfer I, Terfehr K, et al. Reliable, valide und ökonomische Erfassung früher Traumatisierung: Erste psychometrische Charakterisierung der deutschen Version des Adverse Childhood Experiences Questionnaire (ACE) Psychother Psych Med . 2011;61:e10–4. doi: 10.1055/s-0030-1263161. [DOI] [PubMed] [Google Scholar]
- 38.Richter D, Brähler E, Strauß B, editors. Diagnostische Verfahren in Der Sexualwissenschaft. 2014. Deutsche version des "adverse childhood experiences questionnaire. [Google Scholar]
- 39.Tourangeau R, Smith TW. Asking Sensitive Questions: The Impact of Data Collection Mode, Question Format, and Question Context. Public Opin Q. 1996;60:275. doi: 10.1086/297751. [DOI] [Google Scholar]
- 40.Carpenter JR, Smuk M. Missing data: A statistical framework for practice. Biometrical J. 2021;63:915–47. doi: 10.1002/bimj.202000196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Hughes RA, Heron J, Sterne JAC, et al. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48:1294–304. doi: 10.1093/ije/dyz032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Daniel RM, Kenward MG, Cousens SN, et al. Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res. 2012;21:243–56. doi: 10.1177/0962280210394469. [DOI] [PubMed] [Google Scholar]
- 43.Lee KJ, Carlin JB. Recovery of information from multiple imputation: a simulation study. Emerg Themes Epidemiol. 2012;9:3. doi: 10.1186/1742-7622-9-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hayati Rezvan P, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15:30. doi: 10.1186/s12874-015-0022-1. [DOI] [PMC free article] [PubMed] [Google Scholar]


