Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2023 May 18;18(5):e0285848. doi: 10.1371/journal.pone.0285848

Importance of missingness in baseline variables: A case study of the All of Us Research Program

Robert M Cronin 1,‡,*, Xiaoke Feng 2,, Lina Sulieman 3, Brandy Mapes 4, Shawn Garbett 2, Ashley Able 4, Ryan Hale 4, Mick P Couper 5,6,7, Heather Sansbury 8, Brian K Ahmedani 9, Qingxia Chen 2,3,*
Editor: Bijan Najafi10
PMCID: PMC10194909  PMID: 37200348

Abstract

Objective

The All of Us Research Program collects data from multiple information sources, including health surveys, to build a national longitudinal research repository that researchers can use to advance precision medicine. Missing survey responses pose challenges to study conclusions. We describe missingness in All of Us baseline surveys.

Study design and setting

We extracted survey responses between May 31, 2017, to September 30, 2020. Missing percentages for groups historically underrepresented in biomedical research were compared to represented groups. Associations of missing percentages with age, health literacy score, and survey completion date were evaluated. We used negative binomial regression to evaluate participant characteristics on the number of missed questions out of the total eligible questions for each participant.

Results

The dataset analyzed contained data for 334,183 participants who submitted at least one baseline survey. Almost all (97.0%) of the participants completed all baseline surveys, and only 541 (0.2%) participants skipped all questions in at least one of the baseline surveys. The median skip rate was 5.0% of the questions, with an interquartile range (IQR) of 2.5% to 7.9%. Historically underrepresented groups were associated with higher missingness (incidence rate ratio (IRR) [95% CI]: 1.26 [1.25, 1.27] for Black/African American compared to White). Missing percentages were similar by survey completion date, participant age, and health literacy score. Skipping specific questions were associated with higher missingness (IRRs [95% CI]: 1.39 [1.38, 1.40] for skipping income, 1.92 [1.89, 1.95] for skipping education, 2.19 [2.09–2.30] for skipping sexual and gender questions).

Conclusion

Surveys in the All of Us Research Program will form an essential component of the data researchers can use to perform their analyses. Missingness was low in All of Us baseline surveys, but group differences exist. Additional statistical methods and careful analysis of surveys could help mitigate challenges to the validity of conclusions.

Introduction

Understanding the patterns of missing data is vital for any scientific research project. If data are incomplete, there are potential threats to the validity of conclusions that use those data [1]. Some of the most critical threats to validity include a loss of statistical power, data not missing completely at random, and how analyses and missingness are handled. A loss of statistical power can occur for complete case analyses where many of the target population are removed due to missing key variables. Excluding participants from analyses because of missing data can undermine the original goals of the study. If data are not missing completely at random, meaning if missing cases differ from non-missing cases on key outcomes or covariates, conclusions could be biased. This point is crucial in a large cohort study where significant effort is expended to recruit and retain diverse populations. Finally, different analyses could yield different results depending on which variables are included and what strategies are used to account for missing data on those variables [2].

Health surveys are traditional methods to collect data from participants in biomedical research. Since participants can choose what questions they want to respond to in health surveys, they may be especially susceptible to missing data [3]. Many articles describe potential biases of testing hypotheses with datasets having critical missingness [4]. In recent years, other articles have shown a decline in survey response rates, threatening the validity of conclusions drawn from these studies [59]. There are multiple ways to handle missing survey data in general [10], including multiple imputation [11], inverse probability weighting [12, 13], full likelihood [14], Fully Bayesian [15], or hybrid methods [16]. In surveys, missing data can occur when a subpopulation is not included in the survey’s sampling frame (noncoverage), a sampled unit does not participate in the survey (total nonresponse), or because a responding sampled element fails to provide acceptable responses to one or more of the survey items (item nonresponse) [4]. Various methods have been developed to compensate for missing survey data in a generally purposeful way to mitigate the effect on estimates. Weighting adjustments are often used to compensate for noncoverage and total nonresponse. Imputation methods that assign values for missing responses compensate for item nonresponses. To make the best use of these methods, it is essential to first understand the levels and patterns of missingness within health surveys.

The All of Us Research Program, hereafter referred to as All of Us, has set out to collect information from over 1 million participants of diverse backgrounds historically underrepresented in biomedical research to advance the science of precision medicine [17, 18]. The program collects data from participants through multiple sources, including electronic health records, digital health technology, biospecimens, and health surveys. These health surveys can augment and validate the information about participants from other sources, thereby helping researchers answer crucial biomedical research questions of precision medicine. Populations of diverse backgrounds that have been historically underrepresented in biomedical research may pose additional challenges in missingness from health surveys. It is anticipated that All of Us data will be heavily used by scientists worldwide [17]. Understanding the missingness in such an extensive program of participants usually underrepresented in biomedical research is of utmost importance.

All of Us created and launched seven surveys, three of which are available to participants when a participant initially enrolls in the program and are referred to here as baseline surveys. Participants will continue to receive surveys throughout the life of the program. The data from these surveys are currently available to researchers (https://www.researchallofus.org/); however, there is a gap in our understanding of missing data in these baseline surveys. By understanding what data are missing from the All of Us health surveys, why it is missing, and how to overcome missingness, scientific researchers can understand limitations and best address their research questions using this data resource.

The objective of this project was to use All of Us as a case study to demonstrate an approach to evaluate missingness of surveys and identify characteristics that are associated with missingness in a large epidemiological cohort. In particular, we studied if the demographical variables that define the historically underrepresented groups in biomedical research, enrollment date, and health literacy were associated with an increased risk of missingness in the remaining survey questions of the baseline surveys.

Methods

Overview

The initial three survey modules released at baseline were 1) The Basics, which covered basic demographic, socioeconomic, and health insurance questions; 2) Overall Health, which included the brief health literacy scale [19, 20], the overall health PROMIS scale [21], and questions important for collecting biospecimens, such as transplant and travel history; and 3) Lifestyle, which included questions about smoking, alcohol, and illicit drug use. The development of these surveys is described elsewhere [22]. These surveys contained branching logic, which was used to ensure that specific questions, often referred to as “child” or follow-up questions, were presented to participants based on selecting only a relevant previous question response. For example, if a participant has never had at least one drink of any kind of alcohol in their lifetime, they would not be asked questions about how often and how much they drank. Our analyses only included questions that the participant saw and did not respond to as missing. We excluded questions that the participant did not see from our analyses. All the potential questions and branching logic are available at: https://www.researchallofus.org/data-tools/survey-explorer/. Of the questions participants have seen, a participant can skip any question and progress to the next one. Once a participant completed a survey, the data were sent to a raw data repository at the All of Us Data Research Center at Vanderbilt University Medical Center. We extracted the survey responses from May 31, 2017, to September 30, 2020. All data presented were stripped of identifiable information. The Institutional Review Boards of the All of Us Research Program approved all study procedures and informed consent was obtained from all participants.

Missingness analysis

Missingness can be evaluated by completing an entire survey or by specific questions within one survey. Missing data can be examined at the level of the participant, such as one participant not answering a set of questions, or by the item, such as a set of participants skipping one question. In this manuscript, we evaluated the missingness by the participant level because this allowed us to understand the pattern of missingness by participants’ characteristics. We observed three types of missingness in this project: (a) missingness or no submission of an entire survey; (b) survey submission without answering any questions [23, 24]; and (c) item nonresponse, where specific but not all questions were skipped within a survey. We defined item nonresponse as when the participant saw the question and they did not respond to the question. Some questions also had explicit “Prefer not to answer” or “Don’t know” options. Participants who responded with one of these options were not counted as missing in the primary analysis but were analyzed in a sensitivity analysis described below.

We reported the count and percentage of participants who did not submit each of the entire survey modules or skipped all the questions in a survey. For the participants who answered at least one question, we defined the missing percentage as the ratio of missing items to the number of corresponding branching-logic-based eligible questions. The demographic questions defining the historically underrepresented in biomedical research and health literacy questions were considered as explanatory variables and excluded from the missing percentage calculation. The missing percentages for various underrepresented groups were compared to represented groups using Wilcoxon rank-sum or Kruskal Wallis tests. Associations of missing percentages with age at enrollment, health literacy score, and enrollment date were evaluated using Spearman correlation coefficients. We performed a negative binomial regression to evaluate the impact of participant characteristics on the percentage of missingness of a participant based on the number of eligible questions for a participant allowing for overdispersion. The independent variables included race and ethnicity, age, education attainment, household income, sexual and gender minority, geography (non-urban versus urban status), health literacy score, and enrollment since All of Us initiation, which we defined as the number of weeks since All of Us started (May 2017) to the participant’s enrollment date. The three continuous variables of age, and enrollment since All of Us initiation were modeled using a five-knot natural cubic spline to allow for a nonlinear association. As race/ethnicity was associated with an increased risk of missingness, we also investigated if the effect was moderated by education, age, and sex/gender by including the two-way interaction terms in the model.

A health literacy score was defined as the summation of three individual questions of the brief health literacy scale in the Overall Health survey. For the participants missing one or more of the three individual questions, multiple imputation was applied to individual questions using the mice package in R [25]. Six complete datasets were generated from the multiple imputation models [26] and analyzed using the negative binomial regression method described above. The health literacy scale had about 6% missingness. Therefore, we used six imputations to allow for a <1% efficiency loss [26, 27]. Estimates and standard errors for regression coefficients across the six datasets were combined into single estimates by averaging and using standard errors with Rubin’s rules [27]. We performed the following additional sensitivity analyses: A) evaluated missingness using a negative binomial regression with complete case analysis, which removed the participants with missing values on any variables included in the model; B) applied multiple imputation on the total health literacy score instead of individual questions and repeated the same analysis with this imputed health literacy score; C) for participants who only missed one health literacy score questions, we used the average of the other scores to impute the total health literacy score then repeated two analyses described above; D) we also counted “Prefer not to answer” or “Don’t know” as missing in defining the missing percentage and repeated the primary analysis as an additional sensitivity analysis. Multiple imputation was applied to individual health literacy score questions, and a negative binomial regression was performed on the imputed dataset.

All analyses were performed using the R Programming Language 3.3.0 [25]. We considered P-values less than 0.05 a statistically significant difference. With the large sample size, 95% confidence intervals (CI) were reported.

Results

Descriptive analysis

The program had 334,183 participants who submitted at least The Basics baseline survey (the first baseline survey available for completion) between May 31, 2017, and September 30, 2020. Among those 334,183 participants, all three baseline surveys (The Basics, Overall Health, or Lifestyle) were completed by 323,693 (97.0%) participants, and only 541 (0.2%) participants skipped all questions in at least one of the three surveys. A subset of participants, 36,077 (10.8%), answered every eligible question. A vast majority, 250,304 (74.9%), skipped fewer than 10% of the questions, while very few, 522 (0.2%), skipped more than half of the questions. The median skip rate was 5.0% of the questions, with an interquartile range (IQR) of 2.5% to 7.9%.

Different population characteristics had different levels of missingness

Compared to the mean, participants who skipped the educational attainment, race and ethnicity, household income, or sexual and gender questions skipped significantly more additional questions (Fig 1a). For example, 5741 participants (1.7%) who did not answer a race/ethnicity skipped 13% of the remaining questions of the baseline surveys, while the average skip rate was about 5% of the questions. Participants from specific populations historically underrepresented in biomedical research skipped more than the average, including those with less than high school education, Black or African American, and Latino or Spanish participants. Other underrepresented groups, such as sexual and gender minorities, rural geography, and older ages, had slightly lower missingness than the mean.

Fig 1. Missingness by groups underrepresented in biomedical research, consent date, age, and health literacy scores.

Fig 1

(A) Missing percentage mean (B) Consent date (C) Age (D) Health literacy score.

The participant missing percentage was similar over time, participant ages, and participant health literacy score

The participant missing percentage has been relatively stable in All of Us thus far (Fig 1b). There were slight deviations before the national launch of the program in May 2018 and after the start of the COVID-19 pandemic in March 2020. The participant missing percentage was almost constant across different ages (Fig 1c). Between ages 68–78, the missing percentage was slightly decreased but stabilized after 78 years of age. The health literacy score ranged between 3 and 15, with higher scores indicating higher subjective health literacy. In Fig 1d, we only included participants who answered all three health literacy questions. The participant missing percentage did not change much as the health literacy score increased.

Multivariable analysis

In the negative binomial regression analysis (Fig 2a–2d), when holding the other variables constant in the model, participants who skipped household income, race and ethnicity, educational attainment, and sexual and gender questions were more likely to have a higher overall missingness rate compared to those who didn’t skip those questions (incidence rate ratios (IRRs) [95% CI]: 1.39 [1.38, 1.40] for skipping income, 1.69 [1.66, 1.72] for skipping race and ethnicity, 1.92 [1.89, 1.95] for skipping education, 2.19 [2.09–2.30] for skipping sexual and gender questions). Participants from rural geography had a lower incident rate for missing questions than urban geography (IRR: 0.93, 95% CI: [0.92, 0.94]). Except for geography, underrepresented groups had higher incident rates compared with represented groups, especially for racial and ethnic minority groups (IRRs [95% CI]: 1.26 [1.25,1.27] for Black/African; 1.15 [1.14,1.16] for Hispanic/Latino; 1.22 [1.21,1.23] for other race or a combination of two or more races, all compared to White) and lower educational attainment (IRRs [95% CI]: 1.14 [1.13,1.15] for less than high school; 1.11 [1.10, 1.12] for high school, all compared to a college degree). The IRR increased before age 35 and after 65 but decreased between 35 and 65 (Fig 2b). In addition, participants with health literacy scores between 6 and 10 had a higher IRR than those who had higher or lower scores (Fig 2c). Participants who enrolled near the beginning of the All of Us program or enrolled more recently had lower IRRs than those enrolled in the middle period (Fig 2d). The results were similar in the sensitivity analyses (see the supplemental document, S1S5 Figs). Some race/ethnicity interactions with sex/gender, age, and education were significant and demonstrated differential race/ethnicity effects moderated other baseline variables (see the supplemental document, S6 Fig). However, the most significant IRRs were among the groups with missingness in the baseline variables.

Fig 2. Negative binomial analysis of missingness.

Fig 2

(A) Incidence Rate Ratio (B) Age (Years) (C) Health literacy score (D) Enrollment since All of Us initiation (in weeks).

Discussion

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this manuscript, we described the completeness of these survey data and identified some key baseline variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies, such as All of Us, that make special efforts to include such populations. The missingness of these background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups.

Very few participants only answered one of the first three surveys. The low level of nonresponse to these surveys is promising, and using all of the surveys for analyses appears to be a reasonable approach. Almost all participants who also started a survey did not simply click through without responding to any questions. A vast majority (74.9%) skipped less than 10% of the questions, while very few (0.2%) skipped more than half of the questions. While this number is low, researchers will need to be cautious in reviewing whether their population of interest may have “completed” surveys yet not have usable data for a portion of their population. Removing these participants from analyses may be a reasonable approach due to their rarity and a large amount of item missingness. Due to a large number of participants answering most questions, researchers should be able to pursue hypotheses with the data without concern for bias.

Participants with different characteristics had different levels of missingness. Participants who skip a few of the baseline questions, such as race and ethnicity, sex and gender, educational attainment, and household income, are more likely to skip other questions. These participants had the highest missingness rates compared to participants who answered these questions. Also, these indicators were some of the most substantial risk factors for missingness in our regression models. These data suggest that participants unwilling to provide critical demographic details are also less willing to answer other survey questions. Participants from certain historically underrepresented populations skipped more than the mean, including those with less than a high school education and those who identify as Black or African American and Latino or Spanish. While only slightly lower missing percentages, certain underrepresented groups had lower missingness than the mean, such as sexual and gender minorities and those residing in rural geographic areas. Other characteristics did not differ in missingness rates, such as consent date, health literacy score, and age. Understanding the missingness of certain sociodemographic populations to understand the potential missingness of other questions is crucial.

Sensitivity analysis demonstrated similar results as the primary analysis. Multiple imputation demonstrated similar results as complete case analysis and average score imputation for the brief health literacy scale. A vast majority of the participants completed all three brief health literacy scale items. Our approach for this scale could be used for other scales in the All of Us dataset, such as the PROMIS overall health scale, to mitigate biases in the data. Also, considering “don’t know” and “prefer not to answer” as missing values did not alter our results. However, researchers need to be cautious as certain questions, such as income, could have higher rates of missingness if these options are considered missing.

This study had several limitations. First, this is a snapshot of the data as of September 2020. While the missingness may change over time, we noted that the missingness has not historically changed a large amount. Second, we did not review additional surveys that were completed after baseline. Reviewing additional surveys is an area of future work to help understand the missingness of all survey data. Third, we did not review other data sources in the program, such as electronic health records. Other data sources within All of Us may augment missingness in surveys. Finally, additional variables may be important for missingness, such as having enrollment staff help participants with questions and responses they may not understand. Obtaining and evaluating these additional variables could help researchers understand the causes of missingness.

This analysis will help researchers of the All of Us data understand and assess data missingness. This manuscript serves as a complementary follow-up to the initial All of Us survey development manuscript [22], detailing quality assessment efforts routinely undertaken by the All of Us Data and Research Center, in collaboration with program partners, to understand the All of Us survey data composition and provide recommendations to researchers interested in applying similar checks of their data. This work will be put into the All of Us Researcher Workbench as featured notebooks and educational documentation that can be used by researchers in evaluating and understanding missing data as more data continues to come into the program.

The analyses presented in this manuscript can be used by researchers for the All of Us survey data and survey data from other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. Another key message is that identifying “leading indicators” of missingness (i.e., the predictors) could help survey designers to target strategies to reduce differential missingness. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the critical variables associated with overall missingness on substantive analyses. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

Supporting information

S1 Fig. Negative binomial regression with complete cases.

(PNG)

S2 Fig. Negative binomial regression with multiple imputation on health literacy.

We applied multiple imputation on the total health literacy score. Then we repeated the negative binomial regression on the imputed dataset.

(PNG)

S3 Fig. Negative binomial regression with averaging of two scores to impute an overall score for health literacy.

If participants missed only one health literacy score question, we used the average of the two non-missing scores to impute the missing health literacy score. Then we applied negative binomial regression with the complete cases.

(PNG)

S4 Fig. Negative binomial regression with averaging of two scores and multiple imputation to impute an overall score for health literacy.

If participants missed only one health literacy score question, we used the average of the two non-missing scores to impute the missing health literacy score. Then we applied multiple imputation on the total health literacy score and repeated negative binomial regression on the imputed dataset.

(PNG)

S5 Fig. Negative binomial regression with “prefer not to answer” and “don’t know” as missing.

Multiple imputation was applied to health literacy as above, and the negative binomial regression method was applied on the imputed dataset.

(PNG)

S6 Fig. Negative binomial regression with interaction terms.

Race/ethnicity interactions with sex/gender, age, and education were added to the model, significant interaction terms were shown in the forest plot.

(PNG)

Acknowledgments

We wish to thank our participants who have joined All of Us and contributed to PPI, helped refine early materials, engaged in developing and evaluating PPI, and provided other ongoing feedback. We thank the countless other co-investigators and staff across all awardees and partners, without which All of Us would not have achieved our current goals.

All of Us PPI Committee Members: James McClain, Brian Ahmedani, Rob Cronin, Michael Manganiello, Kathy Mazor, Heather Sansbury, Alvaro Alonso, Sarra Hedden, Randy Bloom, Mick Couper, Scott Sutherland

We also wish to thank All of Us Research Program Director Josh Denny and our partners Verily, Vibrent, Scripps, and Leidos.

“Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins with You are service marks of the US Department of Health and Human Services.”

Data Availability

Data is owned by a third party, the All of Us Research Program. The data underlying this article were provided by the All of Us Research Program by permission that can be sought by scientists and the public alike. The Researcher Workbench is a cloud-based platform where registered researchers can access Registered and Controlled Tier data, including the data presented here. Researchers/citizen scientists must verify their identity and complete the All of Us Research Program data access process to access the Researcher Workbench and Registered Tier data. Once this process is completed, the data will be made available to all persons. More information on data access can be found in the All of Us Research Hub (https://www.researchallofus.org/) as is the option to register for access. The authors did not have any special access privileges to this data that other researchers would not have.

Funding Statement

This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). (Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Little RJ, Rubin DB. Statistical analysis with missing data: John Wiley & Sons; 2019.
  • 2.Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE open medicine. 2019;7:2050312118822912. doi: 10.1177/2050312118822912 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.LAAKSONEN S. SURVEY METHODOLOGY AND MISSING DATA: Tools and Techniques for Practitioners: SPRINGER; 2019.
  • 4.Brick JM, Kalton G. Handling missing data in survey research. Statistical methods in medical research. 1996;5(3):215–38. doi: 10.1177/096228029600500302 [DOI] [PubMed] [Google Scholar]
  • 5.Czajka JL, Beyler A. Background paper declining response rates in federal surveys: Trends and implications. Mathematica policy research. 2016;1:1–86. [Google Scholar]
  • 6.Williams D, Brick JM. Trends in US face-to-face household survey nonresponse and level of effort. Journal of Survey Statistics and Methodology. 2018. [Google Scholar]
  • 7.Luiten A, Hox J, de Leeuw E. Survey nonresponse trends and fieldwork effort in the 21st century: Results of an international study across countries and surveys. Journal of Official Statistics. 2020;36(3):469–87. [Google Scholar]
  • 8.McQuillan G, Kruszon-Moran D, Di H, Schaar D, Lukacs S, Fakhouri T, et al., editors. Assessing consent for and response to health survey components in an era of falling response rates: National Health and Nutrition Examination Survey, 2011–2018. Survey Research Methods; 2021. [DOI] [PMC free article] [PubMed]
  • 9.Boyle J, Berman L, Dayton J, Iachan R, Jans M, ZuWallack R. Physical measures and biomarker collection in health surveys: Propensity to participate. Research in Social and Administrative Pharmacy. 2021;17(5):921–9. doi: 10.1016/j.sapharm.2020.07.025 [DOI] [PubMed] [Google Scholar]
  • 10.Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH. Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association. 2005;100(469):332–46. [Google Scholar]
  • 11.Rubin DB. Multiple imputation after 18+ years. Journal of the American statistical Association. 1996;91(434):473–89. [Google Scholar]
  • 12.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association. 1994;89(427):846–66. [Google Scholar]
  • 13.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical methods in medical research. 2013;22(3):278–95. doi: 10.1177/0962280210395740 [DOI] [PubMed] [Google Scholar]
  • 14.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 1977;39(1):1–22. [Google Scholar]
  • 15.Ibrahim JG, Chen MH, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Canadian Journal of Statistics. 2002;30(1):55–78. [Google Scholar]
  • 16.Seaman SR, White IR, Copas AJ, Li L. Combining multiple imputation and inverse-probability weighting. Biometrics. 2012;68(1):129–37. doi: 10.1111/j.1541-0420.2011.01666.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, Jenkins G, et al. The "All of Us" Research Program. N Engl J Med. 2019;381(7):668–76. Epub 2019/08/15. doi: 10.1056/NEJMsr1809937 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mapes BM, Foster C.S., Kusnoor, S.V., Epelbaum, M.I., AuYoung M., Jenkins G., et al. Diversity and Inclusion for the All of Us Research Program: A Scoping Review. In: RM C, editor. 2020. [DOI] [PMC free article] [PubMed]
  • 19.Chew LD, Griffin JM, Partin MR, Noorbaloochi S, Grill JP, Snyder A, et al. Validation of screening questions for limited health literacy in a large VA outpatient population. Journal of general internal medicine. 2008;23(5):561–6. doi: 10.1007/s11606-008-0520-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wallston KA, Cawthon C, McNaughton CD, Rothman RL, Osborn CY, Kripalani S. Psychometric properties of the brief health literacy screen in clinical practice. Journal of general internal medicine. 2014;29(1):119–26. doi: 10.1007/s11606-013-2568-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hays RD, Bjorner JB, Revicki DA, Spritzer KL, Cella D. Development of physical and mental health summary scores from the patient-reported outcomes measurement information system (PROMIS) global items. Quality of life Research. 2009;18(7):873–80. doi: 10.1007/s11136-009-9496-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cronin RM, Jerome RN, Mapes B, Andrade R, Johnston R, Ayala J, et al. Development of the Initial Surveys for the All of Us Research Program. Epidemiology. 2019;30(4):597–608. Epub 2019/05/03. doi: 10.1097/EDE.0000000000001028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bosnjak M, Tuten TL. Classifying response behaviors in web-based surveys. Journal of Computer-Mediated Communication. 2001;6(3):JCMC636. [Google Scholar]
  • 24.Lugtig P. Panel attrition: separating stayers, fast attriters, gradual attriters, and lurkers. Sociological Methods & Research. 2014;43(4):699–723. [Google Scholar]
  • 25.Team RC. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2013. [Google Scholar]
  • 26.Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci. 2007;8(3):206–13. Epub 20070605. doi: 10.1007/s11121-007-0070-9 . [DOI] [PubMed] [Google Scholar]
  • 27.Rubin DB. Multiple imputation for nonresponse in surveys: John Wiley & Sons; 2004.

Decision Letter 0

Ralph C A Rippe

2 Oct 2022

PONE-D-22-22898Missingness patterns in the baseline health surveys of the All of Us Research ProgramPLOS ONE

Dear Dr. Cronin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not (yet) meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. I had invited two reviewers with vastly different backgrounds (substantive and with statistical expertise on missing data) to provide their insights. The comments from two independent reviewers are attached below, and I concur with both in the suggestion that a major revision is required to further consider the manuscript for publication in PLOS ONE. The manuscript title suggests a statistical approach while it doesn't seem to adopt one. The manuscript seems to be indecisive in terms of the angle taken (substantive or methodological), as also reflected by the comments of reviewer 1. In addition, reviewer 2 mentions substantive reasons and arguments for deeper and exhaustive elaborations on the meaning and interpretation of the patterns suggested in the manuscript. It is very important to support the suggested patterns with established techniques or, if not available, with novel statistical technique(s) for which evidence of validity and reliability are provided. For a paper to be considered in PLOS ONE - which has focus on methodological and statistical rigor -, the manuscript should provide a scope and message beyond (just) the current dataset or applications to it. Therefore, could you please revise the manuscript to clearly address the comments attached herein?

Please submit your revised manuscript by Nov 16 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ralph C. A. Rippe, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

3. Thank you for stating the following in the Sources of financial support of your manuscript: 

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). ( Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC)"

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). (Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC). This team worked closely with the NIH on the study design approval and reviewed the manuscript for completeness."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: General comment:

After reading the paper I thought: what is the goal, and what is the core message from the paper? If I understand correctly, the authors are trying to predict whether values on specific questions are missing, using other variables as predictors. Although that may be an interesting thing to look at, I think it would be way too limited to dedicate a complete paper to. Normally, checking whether missing data show a systematic pattern is part of the statistical analysis. I don’t see any statistical analysis here, apart from the prediction of the missing values.

Detailed comments:

1. P. 1, last line: “Various methods…”: This sounds kind of repetitive since earlier it was said that “There are multiple ways to handle missing data in general”. Maybe it should be emphasized more that here you are specifically talking about survey data.

2. P. 4, line 2: “Missing data can be examined by the participant”: at first it read like the examining of the missing data was done by the participant. Please, reformulate.

3. P. 5, line 4: although the advice used to be to impute 5 times, it was later established that 5 imputations is actually quite a small number (see, e.g., Graham, Olchowski, and Gilreath, 2007). I would advice increasing the number of imputations to 100.

4. P. 5, line 7: “Rubin’s rules” needs a reference.

5. I don’t understand the negative binomial regression, nor its purpose. If I understand correctly, a number of predictors is used to predict whether a value on an eligible question is missing or not. These predictors have missing data as well, and are multiply imputed as well? And why is prediction of missing data on the basis of background variables important?

6. I think the layout of the figures is occasionally sloppy. For example, in Figure 1, one of the two categories of age is labeled “>= 65” rather than “≥ 65”, in Figure 2 the variable names overlap with the graph, and “(a) Incidence Rate Ratio” overlaps with the numbers on the X axis.

References:

Graham, J.W., Olchowski, A.E., & Gilreath, T.D (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8, 206-213. doi: 10.1007/s11121-007-0070-9

Reviewer #2: I am very excited about this paper which will help a lot of future papers using All of Us data, so I am generally positive about this paper. My only issues are about more deep analysis of data on race/ethnicity. Here I explain:

1- Please also include data on missingness of race/ethnicity.

2- Based on your findings, please discuss if multiple imputation would reduce or increase bias due to missing data in this case.

3- Race/ethnicity interacts with sex/gender, age, and SES. For example, income is higher among high educated Whites than high educated Blacks, and Black men are treated worse by some institutions than Black women. This means, trust is missingness may not be just shaped by race but the interaction of race and other social constructs. Why this is important? Because missingness may vary not just by race and education, but race x education (due to reduced relevance of education in non-Whites). So, please kindly not only check the main effects of race, but also the interactions of race/ethnicity (at least for Blacks vs Whites) with education, age, and sex/gender on missingness.

Again, thank you for your service and exceptional work. Looking forward to cite your work in our future All of Us papers.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review_PONE.docx

PLoS One. 2023 May 18;18(5):e0285848. doi: 10.1371/journal.pone.0285848.r002

Author response to Decision Letter 0


24 Nov 2022

Editor Comments:

I had invited two reviewers with vastly different backgrounds (substantive and with statistical expertise on missing data) to provide their insights. The comments from two independent reviewers are attached below, and I concur with both in the suggestion that a major revision is required to further consider the manuscript for publication in PLOS ONE. The manuscript title suggests a statistical approach while it doesn't seem to adopt one. The manuscript seems to be indecisive in terms of the angle taken (substantive or methodological), as also reflected by the comments of reviewer 1. In addition, reviewer 2 mentions substantive reasons and arguments for deeper and exhaustive elaborations on the meaning and interpretation of the patterns suggested in the manuscript. It is very important to support the suggested patterns with established techniques or, if not available, with novel statistical technique(s) for which evidence of validity and reliability are provided. For a paper to be considered in PLOS ONE - which has focus on methodological and statistical rigor -, the manuscript should provide a scope and message beyond (just) the current dataset or applications to it. Therefore, could you please revise the manuscript to clearly address the comments attached herein?

>>>>>>>>

Response:

We thank the Editor for their comments. We believe handling missingness is critical for statistical analyses of research questions. It is common for substantive researchers to ignore missing data, especially in large cohort studies where sample sizes are less of a concern. If variables are missing not at random (MNAR), complete case analysis or methods assuming missingness at random (MAR) could lead to biased conclusions. While the assumption on the missing mechanism is not testable, identifying factors associated with missingness is critical in lessening MNAR by controlling for those factors in the analysis. The findings presented in this manuscript can be useful to the researchers of the All of Us survey data and helps survey designers for other large cohorts to target strategies to reduce differential missingness. Similar analysis can be conducted for other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them.

While this paper studies an important question that will impact the downstream decision process of statistical method development and implementation, the goal is not to develop novel statistical methods. We clarified the objective of this paper in the introduction:

The objective of this project was to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify characteristics associated with missingness in a large epidemiological cohort. In particular, we studied if the demographical variables that define the historically underrepresented groups in biomedical research, enrollment date, and health literacy were associated with an increased risk of missingness in the remaining survey questions of the baseline surveys.

We also changed the title to “Importance of missingness in baseline variables: a case study of the All of Us Research Program.”

We appreciate the reviewer's comments and believe these comments have helped strengthen this manuscript significantly. We addressed the comments to better describe the angle for reviewer 1 comments and completed a deeper and exhaustive elaboration as recommended by reviewer 2.

>>>>>>>>

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

>>>>>>>>

Response: We fixed the styles for PLOS ONE

>>>>>>>>

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

>>>>>>>>

Response: We included the correct grant numbers in the funding information section.

>>>>>>>>

3. Thank you for stating the following in the Sources of financial support of your manuscript:

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). ( Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC)"

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). (Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC). This team worked closely with the NIH on the study design approval and reviewed the manuscript for completeness."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

>>>>>>>>

Response: The above funding statement is correct; we removed this information from the acknowledgments.

>>>>>>>>

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

>>>>>>>>

Response: We updated the data availability statement in the submission:

Data is owned by a third party, the All of Us Research Program. The data underlying this article were provided by the All of Us Research Program by permission that can be sought by scientists and the public alike. The Researcher Workbench is a cloud-based platform where registered researchers can access Registered and Controlled Tier data, including the data presented here. Researchers/citizen scientists must verify their identity and complete the All of Us Research Program data access process to access the Researcher Workbench and Registered Tier data. Once this process is completed, the data will be made available to all persons. More information on data access can be found in the All of Us Research Hub (https://www.researchallofus.org/) as is the option to register for access. The authors did not have any special access privileges to this data that other researchers would not have.

>>>>>>>>

5. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

[Note: HTML markup is below. Please do not edit.]

>>>>>>>>

Response: We added the full ethics statement to the method section of the manuscript.

>>>>>>>>

Reviewer #1: General comment:

After reading the paper I thought: what is the goal, and what is the core message from the paper? If I understand correctly, the authors are trying to predict whether values on specific questions are missing, using other variables as predictors. Although that may be an interesting thing to look at, I think it would be way too limited to dedicate a complete paper to. Normally, checking whether missing data show a systematic pattern is part of the statistical analysis. I don’t see any statistical analysis here, apart from the prediction of the missing values.

>>>>>>>>

Response: We apologize for any confusion. The objective of this article is to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify key variables, in particular, the variables defining underrepresented groups in biomedical research, that are associated with missingness in a large epidemiological cohort. Handling missingness is critical for additional statistical analyses of research questions. It is common for substantive researchers to ignore missing data, especially in large cohort studies where small sample sizes are less of a concern. If variables are missing not at random (MNAR), complete case analysis or methods assuming missing completely at random (MCAR) could lead to biased conclusions. While the assumption on the missing mechanism is not testable, identifying factors associated with missingness is critical in lessening MNAR by controlling for those factors in the analysis. We agree that checking missing data mechanisms, especially the individual question related to the specific research question is, and has to be, part of its own statistical analysis. However, pre-identifying common variables explaining the missingness of the other variables will reduce the analysis burden in the substantive analysis, especially when the size of potential variables to evaluate is large (a total of 126 questions from the three baseline modules) and branching logic is involved as in our study. As we clarified in our response to the Editor, while this paper studies an important question that will impact the downstream decision process of statistical method development and implementation, the goal is not to develop novel statistical methods. We added the following to the introduction:

The objective of this project was to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify characteristics associated with missingness in a large epidemiological cohort. In particular, we studied if the demographical variables that define the historically underrepresented groups in biomedical research, enrollment date, and health literacy were associated with an increased risk of missingness in the remaining survey questions of the baseline surveys.

We also changed the title to “Importance of missingness in baseline variables: a case study of the All of Us Research Program.”

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this paper, we described the completeness of these survey data and identified some key variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies such as All of Us, which make special efforts to include such populations. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups. Another key message of our paper is identifying "leading indicators" of missingness (i.e., the predictors) that could help survey designers to target strategies to reduce differential missingness.

The findings presented in this manuscript can be useful to the researchers of the All of Us survey data and helps survey designers for other large cohorts to target strategies to reduce differential missingness. Similar analysis can be conducted for other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the key variables associated with overall missingness on substantive analyses.

We added the following to the discussion:

Discussion:

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this manuscript, we described the completeness of these survey data and identified some key baseline variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies, such as All of Us, that make special efforts to include such populations. The missingness of these background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups.

[…]

The analyses presented in this manuscript can be used by researchers for the All of Us survey data and survey data from other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. Another key message is that identifying “leading indicators” of missingness (i.e., the predictors) could help survey designers to target strategies to reduce differential missingness. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the critical variables associated with overall missingness on substantive analyses. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

>>>>>>>>

Detailed comments:

1. P. 1, last line: “Various methods…”: This sounds kind of repetitive since earlier it was said that “There are multiple ways to handle missing data in general”. Maybe it should be emphasized more that here you are specifically talking about survey data.

>>>>>>>>

Response: We appreciate the comment and modified the introduction as follows:

“There are multiple ways to handle missing survey data in general.”

>>>>>>>>

2. P. 4, line 2: “Missing data can be examined by the participant”: at first it read like the examining of the missing data was done by the participant. Please, reformulate.

>>>>>>>>

Response: We appreciate the reviewer pointing out this potential confusion. We modified to the following:

“Missing data can be examined at the level of the participant.”

>>>>>>>>

3. P. 5, line 4: although the advice used to be to impute 5 times, it was later established that 5 imputations are actually quite a small number (see, e.g., Graham, Olchowski, and Gilreath, 2007). I would advise increasing the number of imputations to 100.

>>>>>>>>

Response: We appreciate the reviewer’s comment, reviewed the reference mentioned and Rubin[1], and included both pieces of information below. We imputed the health literacy score, which had about 6% (0.06) missing.

Graham et al. (2007) conclusions state: “With these assumptions, we recommend that one should use m = 20, 20, 40, 100, and >100 for true γ = 0.10, 0.30, 0.50, 0.70, and 0.90, respectively,” where the fraction of missing information (γ). Although γ is the same as the amount of missing data in the simplest case, it is typically rather less than the amount of missing data, per se, in more complicated situations (Rubin[1]).

Also, in Graham et al. (2007):

“Rubin[1] shows that the efficiency of an estimate based on m imputations is approximately

(1+γ/m)^[2], where γ is the fraction of missing information for the quantity being estimated.... gains rapidly diminish after the first few imputations. ... In most situations, there is simply little advantage to producing and analyzing more than a few imputed datasets (pp. 548–549).”

The recommendation of 100*missing percentage imputation comes from this formula allowing a 1% efficiency loss. With 6% missing in the data for health literacy, which we imputed, γ is less than 0.06. Therefore, 6 imputations lead to <1% efficiency loss, 10 imputations lead to <0.6% efficiency loss and 100 imputations lead to <0.06% efficiency loss.

Considering this above and the computation burden for the large data size, the other factor that Graham et al. (2007) pointed out, we increased the number of imputations to 6, which gives us a <1% efficiency loss.

We updated the methods and results in the manuscript and included Graham et al. (2007) and Rubin in the reference:

Methods:

The health literacy scale had about 6% missingness. Therefore, we used six imputations to allow for a <1% efficiency loss[1, 3].

>>>>>>>>

4. P. 5, line 7: “Rubin’s rules” needs a reference.

>>>>>>>>

Response: We added the following reference[1]:

1. Rubin DB. Multiple imputation for nonresponse in surveys. Vol. 81: John Wiley & Sons; 2004.

>>>>>>>>

5. I don’t understand the negative binomial regression, nor its purpose.

>>>>>>>>

Response: We used negative binomial regression because the dependent variable is the number of questions (excluding the questions included as the independent variables) with a missing value out of the total number of eligible questions based on branching logic (each participant could have a different number of eligible questions). Instead of Poisson regression, negative binomial regression was used to allow for over-dispersion.

>>>>>>>>

5. If I understand correctly, a number of predictors is used to predict whether a value on an eligible question is missing or not.

>>>>>>>>

Response: Predictors were used to study their associations with the level of missingness for an eligible participant

We updated the methods as follows to improve clarity:

We performed a negative binomial regression to evaluate the impact of participant characteristics on the percentage of missingness of a participant based on the number of eligible questions for a participant allowing for overdispersion

>>>>>>>>

5. These predictors have missing data as well, and are multiply imputed as well?

>>>>>>>>

Response: Multiple imputation was only used for the health literacy scale. One of the key findings of this paper is that those who skip key demographic questions are more likely to skip other substantive variables in the surveys (see Figure 1). Imputing these predictor variables would mask this relationship. These variables could also be a leading indicator of later missingness in the surveys. We also updated the regression to evaluate missingness from the remaining questions and not include the predictors in the outcomes.

>>>>>>>>

5. And why is prediction of missing data on the basis of background variables important?

>>>>>>>>

Response:

The missingness of background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. Issues of missingness based on these background variables have affected surveys from other national programs such as NHANES and the UK biobank[4-7]. Of the many papers (over 75) that have used the UK biobank, we could only find two that explicitly discussed missing data[6, 7]. Our manuscript helps highlight the importance of evaluating this potential issue for All of Us and other data sets with survey data. Using complete case analysis or assuming missingness at random, without evaluating background variables, could bias the research conclusions described above in response to the general comment. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

We added the following to the manuscript, as mentioned above to the response to the general comment from reviewer #1:

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this manuscript, we described the completeness of these survey data and identified some key baseline variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies, such as All of Us, that make special efforts to include such populations. The missingness of these background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups.

[…]

The analyses presented in this manuscript can be used by researchers for the All of Us survey data and survey data from other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. Another key message is that identifying “leading indicators” of missingness (i.e., the predictors) could help survey designers to target strategies to reduce differential missingness. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the critical variables associated with overall missingness on substantive analyses. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

>>>>>>>>

6. I think the layout of the figures is occasionally sloppy. For example, in Figure 1, one of the two categories of age is labeled “>= 65” rather than “≥ 65”, in Figure 2 the variable names overlap with the graph, and “(a) Incidence Rate Ratio” overlaps with the numbers on the X axis.

>>>>>>>>

Response: We appreciate the reviewer’s comments and have improved the layout of the figures.

>>>>>>>>

References:

Graham, J.W., Olchowski, A.E., & Gilreath, T.D (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8, 206-213. doi: 10.1007/s11121-007-0070-9

Reviewer #2: I am very excited about this paper which will help a lot of future papers using All of Us data, so I am generally positive about this paper. My only issues are about more deep analysis of data on race/ethnicity. Here I explain:

1- Please also include data on missingness of race/ethnicity.

>>>>>>>>

Response: Missingness of the question for race/ethnicity is included in Figure 1a. We considered missing data for this question to be a skip or a “prefer not to answer” response.

Overall we saw 5741 (1.7%) with missing data for race/ethnicity. We also added this to the results:

For example, 5741 participants (1.7%) who did not answer a race/ethnicity skipped 13% of the remaining questions of the baseline surveys, While the average skip rate was about 5% of the questions.

>>>>>>>>

2- Based on your findings, please discuss if multiple imputation would reduce or increase bias due to missing data in this case.

>>>>>>>>

Response: When the missing at random assumption is valid, multiple imputation will reduce the bias and lead to a more valid inference.

>>>>>>>>

3- Race/ethnicity interacts with sex/gender, age, and SES. For example, income is higher among high educated Whites than high educated Blacks, and Black men are treated worse by some institutions than Black women. This means, trust is missingness may not be just shaped by race but the interaction of race and other social constructs. Why this is important? Because missingness may vary not just by race and education, but race x education (due to reduced relevance of education in non-Whites). So, please kindly not only check the main effects of race, but also the interactions of race/ethnicity (at least for Blacks vs. Whites) with education, age, and sex/gender on missingness.

>>>>>>>>

Response: We greatly appreciate this suggestion by the reviewer. We added the interaction terms to our primary analysis and included the following in the methods:

As race/ethnicity was associated with an increased risk of missingness, we also investigated if the effect was moderated by education, age, and sex/gender by including the two-way interaction terms in the model.

We added an appendix figure showing only the effects of the interaction terms that were significant and the following to the results:

Some race/ethnicity interactions with sex/gender, age, and education were significant and demonstrated differential race/ethnicity effects moderated other baseline variables (see the supplemental document, Figs. S6). However, the most significant IRRs were among the groups with missingness in the baseline variables.

>>>>>>>>

Again, thank you for your service and exceptional work. Looking forward to cite your work in our future All of Us papers.

>>>>>>>>

Response: Thank you very much for your review!

>>>>>>>>

REFERENCES

1. Rubin DB. Multiple imputation for nonresponse in surveys: John Wiley & Sons; 2004.

2. Löwe B, Kroenke K, Herzog W, Gräfe K. Measuring depression outcome with a brief self-report instrument: sensitivity to change of the Patient Health Questionnaire (PHQ-9). Journal of affective disorders. 2004;81(1):61-6. doi: https://doi.org/10.1016/S0165-0327(03)00198-8.

3. Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention science. 2007;8(3):206-13.

4. Hartwell ML, Khojasteh J, Wetherill MS, Croff JM, Wheeler D. Using Structural Equation Modeling to Examine the Influence of Social, Behavioral, and Nutritional Variables on Health Outcomes Based on NHANES Data: Addressing Complex Design, Nonnormally Distributed Variables, and Missing Information. Current Developments in Nutrition. 2019;3(5). doi: 10.1093/cdn/nzz010.

5. Pridham G, Rockwood K, Rutenberg A. Strategies for handling missing data that improve Frailty Index estimation and predictive power: lessons from the NHANES dataset. GeroScience. 2022;44(2):897-923. doi: 10.1007/s11357-021-00489-w.

6. Lv X, Li Y, Li R, Guan X, Li L, Li J, et al. Relationships of sleep traits with prostate cancer risk: A prospective study of 213,999 UK Biobank participants. The Prostate. 2022;82(9):984-92. doi: https://doi.org/10.1002/pros.24345.

7. Foster HME, Celis-Morales CA, Nicholl BI, Petermann-Rocha F, Pell JP, Gill JMR, et al. The effect of socioeconomic deprivation on the association between an extended measurement of unhealthy lifestyle factors and health outcomes: a prospective analysis of the UK Biobank cohort. The Lancet Public Health. 2018;3(12):e576-e85. doi: 10.1016/S2468-2667(18)30200-7.

Editor Comments:

I had invited two reviewers with vastly different backgrounds (substantive and with statistical expertise on missing data) to provide their insights. The comments from two independent reviewers are attached below, and I concur with both in the suggestion that a major revision is required to further consider the manuscript for publication in PLOS ONE. The manuscript title suggests a statistical approach while it doesn't seem to adopt one. The manuscript seems to be indecisive in terms of the angle taken (substantive or methodological), as also reflected by the comments of reviewer 1. In addition, reviewer 2 mentions substantive reasons and arguments for deeper and exhaustive elaborations on the meaning and interpretation of the patterns suggested in the manuscript. It is very important to support the suggested patterns with established techniques or, if not available, with novel statistical technique(s) for which evidence of validity and reliability are provided. For a paper to be considered in PLOS ONE - which has focus on methodological and statistical rigor -, the manuscript should provide a scope and message beyond (just) the current dataset or applications to it. Therefore, could you please revise the manuscript to clearly address the comments attached herein?

>>>>>>>>

Response:

We thank the Editor for their comments. We believe handling missingness is critical for statistical analyses of research questions. It is common for substantive researchers to ignore missing data, especially in large cohort studies where sample sizes are less of a concern. If variables are missing not at random (MNAR), complete case analysis or methods assuming missingness at random (MAR) could lead to biased conclusions. While the assumption on the missing mechanism is not testable, identifying factors associated with missingness is critical in lessening MNAR by controlling for those factors in the analysis. The findings presented in this manuscript can be useful to the researchers of the All of Us survey data and helps survey designers for other large cohorts to target strategies to reduce differential missingness. Similar analysis can be conducted for other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them.

While this paper studies an important question that will impact the downstream decision process of statistical method development and implementation, the goal is not to develop novel statistical methods. We clarified the objective of this paper in the introduction:

The objective of this project was to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify characteristics associated with missingness in a large epidemiological cohort. In particular, we studied if the demographical variables that define the historically underrepresented groups in biomedical research, enrollment date, and health literacy were associated with an increased risk of missingness in the remaining survey questions of the baseline surveys.

We also changed the title to “Importance of missingness in baseline variables: a case study of the All of Us Research Program.”

We appreciate the reviewer's comments and believe these comments have helped strengthen this manuscript significantly. We addressed the comments to better describe the angle for reviewer 1 comments and completed a deeper and exhaustive elaboration as recommended by reviewer 2.

>>>>>>>>

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

>>>>>>>>

Response: We fixed the styles for PLOS ONE

>>>>>>>>

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

>>>>>>>>

Response: We included the correct grant numbers in the funding information section.

>>>>>>>>

3. Thank you for stating the following in the Sources of financial support of your manuscript:

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). ( Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC)"

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"This work was funded by the National Institutes of Health (https://allofus.nih.gov/) 5U2COD023196 (RMC, XF, LS, BMM, SG, AS, RJ, MPC, QC) and 3OT2OD026550 (BKA). (Additional support included the National Heart, Blood, and Lung Institute (https://www.nhlbi.nih.gov/) K23HL141447 (RMC). This team worked closely with the NIH on the study design approval and reviewed the manuscript for completeness."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

>>>>>>>>

Response: The above funding statement is correct; we removed this information from the acknowledgments.

>>>>>>>>

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

>>>>>>>>

Response: We updated the data availability statement in the submission:

Data is owned by a third party, the All of Us Research Program. The data underlying this article were provided by the All of Us Research Program by permission that can be sought by scientists and the public alike. The Researcher Workbench is a cloud-based platform where registered researchers can access Registered and Controlled Tier data, including the data presented here. Researchers/citizen scientists must verify their identity and complete the All of Us Research Program data access process to access the Researcher Workbench and Registered Tier data. Once this process is completed, the data will be made available to all persons. More information on data access can be found in the All of Us Research Hub (https://www.researchallofus.org/) as is the option to register for access. The authors did not have any special access privileges to this data that other researchers would not have.

>>>>>>>>

5. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

[Note: HTML markup is below. Please do not edit.]

>>>>>>>>

Response: We added the full ethics statement to the method section of the manuscript.

>>>>>>>>

Reviewer #1: General comment:

After reading the paper I thought: what is the goal, and what is the core message from the paper? If I understand correctly, the authors are trying to predict whether values on specific questions are missing, using other variables as predictors. Although that may be an interesting thing to look at, I think it would be way too limited to dedicate a complete paper to. Normally, checking whether missing data show a systematic pattern is part of the statistical analysis. I don’t see any statistical analysis here, apart from the prediction of the missing values.

>>>>>>>>

Response: We apologize for any confusion. The objective of this article is to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify key variables, in particular, the variables defining underrepresented groups in biomedical research, that are associated with missingness in a large epidemiological cohort. Handling missingness is critical for additional statistical analyses of research questions. It is common for substantive researchers to ignore missing data, especially in large cohort studies where small sample sizes are less of a concern. If variables are missing not at random (MNAR), complete case analysis or methods assuming missing completely at random (MCAR) could lead to biased conclusions. While the assumption on the missing mechanism is not testable, identifying factors associated with missingness is critical in lessening MNAR by controlling for those factors in the analysis. We agree that checking missing data mechanisms, especially the individual question related to the specific research question is, and has to be, part of its own statistical analysis. However, pre-identifying common variables explaining the missingness of the other variables will reduce the analysis burden in the substantive analysis, especially when the size of potential variables to evaluate is large (a total of 126 questions from the three baseline modules) and branching logic is involved as in our study. As we clarified in our response to the Editor, while this paper studies an important question that will impact the downstream decision process of statistical method development and implementation, the goal is not to develop novel statistical methods. We added the following to the introduction:

The objective of this project was to use All of Us as a case study to demonstrate an approach to evaluate the missingness of surveys and identify characteristics associated with missingness in a large epidemiological cohort. In particular, we studied if the demographical variables that define the historically underrepresented groups in biomedical research, enrollment date, and health literacy were associated with an increased risk of missingness in the remaining survey questions of the baseline surveys.

We also changed the title to “Importance of missingness in baseline variables: a case study of the All of Us Research Program.”

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this paper, we described the completeness of these survey data and identified some key variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies such as All of Us, which make special efforts to include such populations. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups. Another key message of our paper is identifying "leading indicators" of missingness (i.e., the predictors) that could help survey designers to target strategies to reduce differential missingness.

The findings presented in this manuscript can be useful to the researchers of the All of Us survey data and helps survey designers for other large cohorts to target strategies to reduce differential missingness. Similar analysis can be conducted for other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the key variables associated with overall missingness on substantive analyses.

We added the following to the discussion:

Discussion:

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this manuscript, we described the completeness of these survey data and identified some key baseline variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies, such as All of Us, that make special efforts to include such populations. The missingness of these background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups.

[…]

The analyses presented in this manuscript can be used by researchers for the All of Us survey data and survey data from other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. Another key message is that identifying “leading indicators” of missingness (i.e., the predictors) could help survey designers to target strategies to reduce differential missingness. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the critical variables associated with overall missingness on substantive analyses. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

>>>>>>>>

Detailed comments:

1. P. 1, last line: “Various methods…”: This sounds kind of repetitive since earlier it was said that “There are multiple ways to handle missing data in general”. Maybe it should be emphasized more that here you are specifically talking about survey data.

>>>>>>>>

Response: We appreciate the comment and modified the introduction as follows:

“There are multiple ways to handle missing survey data in general.”

>>>>>>>>

2. P. 4, line 2: “Missing data can be examined by the participant”: at first it read like the examining of the missing data was done by the participant. Please, reformulate.

>>>>>>>>

Response: We appreciate the reviewer pointing out this potential confusion. We modified to the following:

“Missing data can be examined at the level of the participant.”

>>>>>>>>

3. P. 5, line 4: although the advice used to be to impute 5 times, it was later established that 5 imputations are actually quite a small number (see, e.g., Graham, Olchowski, and Gilreath, 2007). I would advise increasing the number of imputations to 100.

>>>>>>>>

Response: We appreciate the reviewer’s comment, reviewed the reference mentioned and Rubin[1], and included both pieces of information below. We imputed the health literacy score, which had about 6% (0.06) missing.

Graham et al. (2007) conclusions state: “With these assumptions, we recommend that one should use m = 20, 20, 40, 100, and >100 for true γ = 0.10, 0.30, 0.50, 0.70, and 0.90, respectively,” where the fraction of missing information (γ). Although γ is the same as the amount of missing data in the simplest case, it is typically rather less than the amount of missing data, per se, in more complicated situations (Rubin[1]).

Also, in Graham et al. (2007):

“Rubin[1] shows that the efficiency of an estimate based on m imputations is approximately

(1+γ/m)^[2], where γ is the fraction of missing information for the quantity being estimated.... gains rapidly diminish after the first few imputations. ... In most situations, there is simply little advantage to producing and analyzing more than a few imputed datasets (pp. 548–549).”

The recommendation of 100*missing percentage imputation comes from this formula allowing a 1% efficiency loss. With 6% missing in the data for health literacy, which we imputed, γ is less than 0.06. Therefore, 6 imputations lead to <1% efficiency loss, 10 imputations lead to <0.6% efficiency loss and 100 imputations lead to <0.06% efficiency loss.

Considering this above and the computation burden for the large data size, the other factor that Graham et al. (2007) pointed out, we increased the number of imputations to 6, which gives us a <1% efficiency loss.

We updated the methods and results in the manuscript and included Graham et al. (2007) and Rubin in the reference:

Methods:

The health literacy scale had about 6% missingness. Therefore, we used six imputations to allow for a <1% efficiency loss[1, 3].

>>>>>>>>

4. P. 5, line 7: “Rubin’s rules” needs a reference.

>>>>>>>>

Response: We added the following reference[1]:

1. Rubin DB. Multiple imputation for nonresponse in surveys. Vol. 81: John Wiley & Sons; 2004.

>>>>>>>>

5. I don’t understand the negative binomial regression, nor its purpose.

>>>>>>>>

Response: We used negative binomial regression because the dependent variable is the number of questions (excluding the questions included as the independent variables) with a missing value out of the total number of eligible questions based on branching logic (each participant could have a different number of eligible questions). Instead of Poisson regression, negative binomial regression was used to allow for over-dispersion.

>>>>>>>>

5. If I understand correctly, a number of predictors is used to predict whether a value on an eligible question is missing or not.

>>>>>>>>

Response: Predictors were used to study their associations with the level of missingness for an eligible participant

We updated the methods as follows to improve clarity:

We performed a negative binomial regression to evaluate the impact of participant characteristics on the percentage of missingness of a participant based on the number of eligible questions for a participant allowing for overdispersion

>>>>>>>>

5. These predictors have missing data as well, and are multiply imputed as well?

>>>>>>>>

Response: Multiple imputation was only used for the health literacy scale. One of the key findings of this paper is that those who skip key demographic questions are more likely to skip other substantive variables in the surveys (see Figure 1). Imputing these predictor variables would mask this relationship. These variables could also be a leading indicator of later missingness in the surveys. We also updated the regression to evaluate missingness from the remaining questions and not include the predictors in the outcomes.

>>>>>>>>

5. And why is prediction of missing data on the basis of background variables important?

>>>>>>>>

Response:

The missingness of background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. Issues of missingness based on these background variables have affected surveys from other national programs such as NHANES and the UK biobank[4-7]. Of the many papers (over 75) that have used the UK biobank, we could only find two that explicitly discussed missing data[6, 7]. Our manuscript helps highlight the importance of evaluating this potential issue for All of Us and other data sets with survey data. Using complete case analysis or assuming missingness at random, without evaluating background variables, could bias the research conclusions described above in response to the general comment. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

We added the following to the manuscript, as mentioned above to the response to the general comment from reviewer #1:

For All of Us, the three baseline modules studied in this article were required for all participants and contained the information that could be commonly used in the All of Us research program. In this manuscript, we described the completeness of these survey data and identified some key baseline variables associated with the overall missingness of the rest of the variables from the three baseline survey modules. The fact that those often not included in biomedical research have higher rates of missingness is a potential red flag for studies, such as All of Us, that make special efforts to include such populations. The missingness of these background variables of race, sex and gender identity, education, time of enrollment, and geography, could contribute to non-random missingness in surveys. If these populations fail to answer other questions or drop out altogether, any complete case analysis will mean that these losses undermine the efforts to include such groups.

[…]

The analyses presented in this manuscript can be used by researchers for the All of Us survey data and survey data from other large epidemiological cohorts like the Million Veterans Program or UK Biobank to help reduce potential biases and account for them. Another key message is that identifying “leading indicators” of missingness (i.e., the predictors) could help survey designers to target strategies to reduce differential missingness. The fact that we found a low missing percentage in the three baseline survey modules of the All of Us cohort offers some reassurance to substantive research but also points to the importance of adjusting for the critical variables associated with overall missingness on substantive analyses. Researchers must be cautious when using complete case analysis or assuming missingness at random in All of Us or other large epidemiological cohorts (e.g., UK biobank, Million Veterans Program).

>>>>>>>>

6. I think the layout of the figures is occasionally sloppy. For example, in Figure 1, one of the two categories of age is labeled “>= 65” rather than “≥ 65”, in Figure 2 the variable names overlap with the graph, and “(a) Incidence Rate Ratio” overlaps with the numbers on the X axis.

>>>>>>>>

Response: We appreciate the reviewer’s comments and have improved the layout of the figures.

>>>>>>>>

References:

Graham, J.W., Olchowski, A.E., & Gilreath, T.D (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8, 206-213. doi: 10.1007/s11121-007-0070-9

Reviewer #2: I am very excited about this paper which will help a lot of future papers using All of Us data, so I am generally positive about this paper. My only issues are about more deep analysis of data on race/ethnicity. Here I explain:

1- Please also include data on missingness of race/ethnicity.

>>>>>>>>

Response: Missingness of the question for race/ethnicity is included in Figure 1a. We considered missing data for this question to be a skip or a “prefer not to answer” response.

Overall we saw 5741 (1.7%) with missing data for race/ethnicity. We also added this to the results:

For example, 5741 participants (1.7%) who did not answer a race/ethnicity skipped 13% of the remaining questions of the baseline surveys, While the average skip rate was about 5% of the questions.

>>>>>>>>

2- Based on your findings, please discuss if multiple imputation would reduce or increase bias due to missing data in this case.

>>>>>>>>

Response: When the missing at random assumption is valid, multiple imputation will reduce the bias and lead to a more valid inference.

>>>>>>>>

3- Race/ethnicity interacts with sex/gender, age, and SES. For example, income is higher among high educated Whites than high educated Blacks, and Black men are treated worse by some institutions than Black women. This means, trust is missingness may not be just shaped by race but the interaction of race and other social constructs. Why this is important? Because missingness may vary not just by race and education, but race x education (due to reduced relevance of education in non-Whites). So, please kindly not only check the main effects of race, but also the interactions of race/ethnicity (at least for Blacks vs. Whites) with education, age, and sex/gender on missingness.

>>>>>>>>

Response: We greatly appreciate this suggestion by the reviewer. We added the interaction terms to our primary analysis and included the following in the methods:

As race/ethnicity was associated with an increased risk of missingness, we also investigated if the effect was moderated by education, age, and sex/gender by including the two-way interaction terms in the model.

We added an appendix figure showing only the effects of the interaction terms that were significant and the following to the results:

Some race/ethnicity interactions with sex/gender, age, and education were significant and demonstrated differential race/ethnicity effects moderated other baseline variables (see the supplemental document, Figs. S6). However, the most significant IRRs were among the groups with missingness in the baseline variables.

>>>>>>>>

Again, thank you for your service and exceptional work. Looking forward to cite your work in our future All of Us papers.

>>>>>>>>

Response: Thank you very much for your review!

>>>>>>>>

REFERENCES

1. Rubin DB. Multiple imputation for nonresponse in surveys: John Wiley & Sons; 2004.

2. Löwe B, Kroenke K, Herzog W, Gräfe K. Measuring depression outcome with a brief self-report instrument: sensitivity to change of the Patient Health Questionnaire (PHQ-9). Journal of affective disorders. 2004;81(1):61-6. doi: https://doi.org/10.1016/S0165-0327(03)00198-8.

3. Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention science. 2007;8(3):206-13.

4. Hartwell ML, Khojasteh J, Wetherill MS, Croff JM, Wheeler D. Using Structural Equation Modeling to Examine the Influence of Social, Behavioral, and Nutritional Variables on Health Outcomes Based on NHANES Data: Addressing Complex Design, Nonnormally Distributed Variables, and Missing Information. Current Developments in Nutrition. 2019;3(5). doi: 10.1093/cdn/nzz010.

5. Pridham G, Rockwood K, Rutenberg A. Strategies for handling missing data that improve Frailty Index estimation and predictive power: lessons from the NHANES dataset. GeroScience. 2022;44(2):897-923. doi: 10.1007/s11357-021-00489-w.

6. Lv X, Li Y, Li R, Guan X, Li L, Li J, et al. Relationships of sleep traits with prostate cancer risk: A prospective study of 213,999 UK Biobank participants. The Prostate. 2022;82(9):984-92. doi: https://doi.org/10.1002/pros.24345.

7. Foster HME, Celis-Morales CA, Nicholl BI, Petermann-Rocha F, Pell JP, Gill JMR, et al. The effect of socioeconomic deprivation on the association between an extended measurement of unhealthy lifestyle factors and health outcomes: a prospective analysis of the UK Biobank cohort. The Lancet Public Health. 2018;3(12):e576-e85. doi: 10.1016/S2468-2667(18)30200-7.

Attachment

Submitted filename: Reviewer Comments v2022-11-17.docx

Decision Letter 1

Bijan Najafi

3 May 2023

Importance of missingness in baseline variables: a case study of the All of Us Research Program

PONE-D-22-22898R1

Dear Dr. Cronin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Bijan Najafi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thank you for your diligent work in revising the manuscript following the initial rounds of critiques. Unfortunately, we haven't received a response from the third reviewer despite our attempts to follow up. To prevent further delay in the peer-review process, and given that the two other reviewers, as well as I, have had the opportunity to review your revised manuscript and response letter, I believe I am able to make a fair judgment regarding your manuscript. Reviewer #1 has expressed a few minor concerns, however, I concur with Reviewer #2 about the scientific merit of your work and its potential impact. I also agree with both reviewers that your revisions have adequately addressed all major critiques. In order to avoid further delay, I recommend accepting your revised manuscript, contingent upon addressing the concerns raised by Reviewer #1 directly with the editorial team..

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think the paper has improved a lot. I do have my doubts however, about the impact this work will. I wonder to what extent far reaching conclusions can be drawn about missing data in such surveys on the basis of just one dataset. However, I’ll leave it up to the editorial board to decide whether the impact is high enough in order for the paper to be publishable. One small textual comment: p. 1, lines 1-2 of second paragraph: “Since health surveys rely on voluntary participation”: is there any research where participation is not voluntary? Please, reformulate.

Reviewer #3: Although I wasn't one of the original reviewers, I was asked to weigh in on the paper. In doing so, I also had a look at the original reviews and the responses.

Overall, I think the paper is excellent and I don't really have any additional comments. The only thing that came up for me was an initial query around the negative binomial regression. In the main text, it's clearly indicated that the "outcome" is a count and that the negative binomial has (reasonably in my view) be adopted to acknowledge the potential for overdispersion. Nevertheless, this isn't clear in the Abstract (where the outcome is "missingness" and my initial thought was "why not a logistic model?"). With that, my only concrete suggestion would be to amend the abstract to make it clearer.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: Yes: Sebastien Haneuse

**********

Attachment

Submitted filename: Review_PONE_2.docx

Acceptance letter

Bijan Najafi

10 May 2023

PONE-D-22-22898R1

Importance of missingness in baseline variables: a case study of the All of Us Research Program

Dear Dr. Cronin:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Bijan Najafi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Negative binomial regression with complete cases.

    (PNG)

    S2 Fig. Negative binomial regression with multiple imputation on health literacy.

    We applied multiple imputation on the total health literacy score. Then we repeated the negative binomial regression on the imputed dataset.

    (PNG)

    S3 Fig. Negative binomial regression with averaging of two scores to impute an overall score for health literacy.

    If participants missed only one health literacy score question, we used the average of the two non-missing scores to impute the missing health literacy score. Then we applied negative binomial regression with the complete cases.

    (PNG)

    S4 Fig. Negative binomial regression with averaging of two scores and multiple imputation to impute an overall score for health literacy.

    If participants missed only one health literacy score question, we used the average of the two non-missing scores to impute the missing health literacy score. Then we applied multiple imputation on the total health literacy score and repeated negative binomial regression on the imputed dataset.

    (PNG)

    S5 Fig. Negative binomial regression with “prefer not to answer” and “don’t know” as missing.

    Multiple imputation was applied to health literacy as above, and the negative binomial regression method was applied on the imputed dataset.

    (PNG)

    S6 Fig. Negative binomial regression with interaction terms.

    Race/ethnicity interactions with sex/gender, age, and education were added to the model, significant interaction terms were shown in the forest plot.

    (PNG)

    Attachment

    Submitted filename: Review_PONE.docx

    Attachment

    Submitted filename: Reviewer Comments v2022-11-17.docx

    Attachment

    Submitted filename: Review_PONE_2.docx

    Data Availability Statement

    Data is owned by a third party, the All of Us Research Program. The data underlying this article were provided by the All of Us Research Program by permission that can be sought by scientists and the public alike. The Researcher Workbench is a cloud-based platform where registered researchers can access Registered and Controlled Tier data, including the data presented here. Researchers/citizen scientists must verify their identity and complete the All of Us Research Program data access process to access the Researcher Workbench and Registered Tier data. Once this process is completed, the data will be made available to all persons. More information on data access can be found in the All of Us Research Hub (https://www.researchallofus.org/) as is the option to register for access. The authors did not have any special access privileges to this data that other researchers would not have.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES