Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 1.
Published in final edited form as: Stat (Int Stat Inst). 2024 Jul 10;13(3):e714. doi: 10.1002/sta4.714

What is it that you say you do here? Advocating for the critical role of data scientists in research infrastructure

Chasz Griego 1, Nicky Agate 1, Ana-Maria Iosif 2, Amy M Crisp 3
PMCID: PMC11340204  NIHMSID: NIHMS2016074  PMID: 39184224

Abstract

Clinical and academic research continues to become more complex as our knowledge and technology advance. A substantial and growing number of specialists in biostatistics, data science, and library sciences are needed to support these research systems and promote high-caliber research. However, that support is often marginalized as optional rather than a fundamental component of research infrastructure. By building research infrastructure, an institution harnesses access to tools and support/service centers that host skilled experts who approach research with best practices in mind and domain-specific knowledge at hand. We outline the potential roles of data scientists and statisticians in research infrastructure and recommend guidelines for advocating for the institutional resources needed to support these roles in a sustainable and efficient manner for the long-term success of the institution. We provide these guidelines in terms of resource efficiency, monetary efficiency, and long-term sustainability. We hope this work contributes to—and provides shared language for—a conversation on a broader framework beyond metrics that can be used to advocate for needed resources.

Keywords: data scientists, statistics, collaboration, team science, research infrastructure

RESEARCH INFRASTRUCTURE

As knowledge and technology have continued to grow, the complexity of the research apparatus has grown with them. A great deal can be accomplished now, but the enormity of such a possibility may leave the individual researcher feeling alone without a life raft beyond the limits of their training. Fortunately, this growth has spurred the development and extension of various research professions, including those represented by the authors here: biostatistics, data science, and library sciences. With some focused efforts, these fields can work synergistically to make the labyrinthine research system more manageable to a broader spectrum of investigators.

Consider, though, the term “research enablement,” broadly defined as “the guidance, support, and education to allow campus researchers to do their research in an effective and appropriate manner.” This definition strongly supports our efforts and ambitions to successfully support statistics and data science in the university setting. However, we challenge the explicit use of the term “research enablement,” as it risks marginalizing and obfuscating the labor that is done to allow researchers to do their research effectively, appropriately, and efficiently.

Research “enablement” becomes tricky when considering the individuals who offer said guidance, support, and education. Are we enablers? In the sense of the colloquial use of the word, an enabler promotes or facilitates addictive, harmful, or nefarious behaviors. Translating that back into the context of research, this enablement is possible. If the research was done poorly and did not follow best practices, any labor provided to assist with this research risks encouraging repeated behavior, and everyone suffers as a result.

Compared to an enabler of addictive behavior who does not necessarily contribute to the origin of this behavior—but instead lets it continue—a consultant who does the same is missing the objective of supporting effective and appropriate research. This can be even more difficult to avoid for those in staff roles who may not have the same oversight and inherent valuation of their work that faculty do (Sharp et al., 2016; Devick et al., 2022). Staff analysts are often expected to assist on every project brought to them with little consideration for their time, recognition, or compensation, which can quickly lead to burnout and/or subpar work.

In research computing (RC), dedicated RC facilitators, especially within the space of campus-supported research computing centers, serve as proactive and personalized guides and help researchers identify and apply computational approaches that result in the greatest impact on their projects (Michael & Mass, 2016). These facilitation practices significantly enhance the research computing business models by guiding the development of adaptive computing solutions. Unlike traditional, one-size-fits-all enterprise models that primarily serve researchers with the most prominent needs, these practices support a more tailored approach, addressing the diverse and evolving requirements of the entire research community, including nontraditional and would-be users in the social sciences, life sciences, and humanities. The use of the term facilitator here seems more appropriate than enabler, but there is still a key distinction between the focused services offered by RC and the broader foundational guidance that can be accomplished by research and data professionals.

These points suggest moving away from terms like research enablement and instead moving towards research infrastructure. This term simultaneously captures the need for sustainability, strength, expertise, collaboration, and robustness but may require an overall cultural shift. This also emphasizes the point that research is collaborative and not casual. Research infrastructure harnesses access to tools and support/service centers that host skilled experts who approach research with best practices in mind and domain-specific experience at hand.

In this paper, we will outline potential roles of data scientists and statisticians in research infrastructure, provide examples of different models of that infrastructure, and recommend guidelines for advocating for the institutional resources needed to support these roles in a sustainable and efficient manner for long-term success of the institution.

RESEARCH INFRASTRUCTURE ACROSS THE INSTITUTION

Research infrastructure aims to create an “energy efficient” institution where the level of labor supplied to research appropriately matches the desired quality and outcome. In the ideal case of a perfectly energy-efficient institution, all the labor and time devoted to a research project would directly contribute to the final product. While this scenario is nearly impossible, appropriate infrastructure can bring an institution closer to the realistic efficiency limit. This concept may seem easy to sell to the administration, but what often gets overlooked is the additive effects of a lack of efficiency and how those inefficiencies manifest. Even if a researcher is capable of a relatively simple analysis or a literature review, they will spend many excess hours accomplishing this compared to those trained in these areas. Those lost hours are expensive, especially in a clinical setting (Parker, 2000).

Many institutions can improve efficiency through libraries, research cores, Biostatistics, Epidemiology and Research Design (BERD) Cores, and data and statistics consulting centers. These centralized units can employ numerous specialists across a broad spectrum of knowledge and experience surrounding discipline-specific tools and domain-specific practices. Among these specialists are data scientists, statisticians, and librarians who are qualified and devoted to collaborating with researchers to simultaneously guide them toward best practices and respond to problems or questions specific to their area of expertise.

However, the personnel who make up the research infrastructure cannot devote such an amount of effort effectively or efficiently if the larger system lacks sufficient resources and funding, ignores personnel needs, or fails to properly promote services. Welty et al. (2013) point out the often-overlooked necessities of software subscriptions and administrative support required in units of a certain size. Several sources also discuss the support needed to continue to train statisticians and provide them with opportunities for professional development in the context of collaborative science (Griffith et al., 2022; Mazumdar et al., 2015; Mehta et al., 2022). Oliver et al. (2019) address the challenge library services face in successfully advertising the types of support offered, which naturally extends to statistical support centers.

BUILDING AND ENHANCING RESEARCH INFRASTRUCTURE

Research infrastructure will vary across institutions based on the unit and type of institution. This particularly depends on the model of support. Among these are paid models where a statistician joins a research grant as a co-investigator. In this model, some percentage of the statistician’s salary comes from the grant, and they will support the project in a range of approaches, from data cleaning and simple analyses to more complicated time series analyses and Bayesian methods. High-level support (e.g., trial design, proposal development, advanced analyses) typically comes from a Ph.D.-level collaborator who additionally would supply highly informed opinions and recommendations for all stages of the research life cycle (Wild & Pfannkuch, 1999; Welty et al., 2013).

Another paid model consists of support through hourly charges (Welty et al., 2013; Parker, 2000). This can be advantageous for small projects that only require specific support or lack external or grant funding. In these cases, however, the funding to pay for these charges must come from somewhere, and it must encompass all the support activities that are performed (e.g., meetings, data cleaning, analyses, data visualizations, manuscript revisions, etc.). Many of these tasks tend to be overlooked by the researcher, who thinks they simply need assistance with an analysis. On top of all of this—and rarely encompassed by an hourly rate table—funding is needed to keep the lights on and cover administrative requirements, etc.

In contrast, there are unpaid models such as those often used for library support. In this model, librarians may offer direct support such as reference management, finding secondary data sources, developing data management and sharing plans, and though less frequent, research planning and registered reports. The research infrastructure from the libraries also includes indirect support through workshops, online guides, and subscriptions to journals, databases, and a variety of other tools or platforms. In these models, the institution funds the entire unit without expecting to recuperate those costs directly.

While both models differ greatly—where libraries teach researchers “how to fish,” and paid consultants give researchers the “fish”—these specialists both serve in roles where they can teach researchers skills not provided elsewhere. In an academic health center (AHC) setting, for example, residents and fellows are often not directly trained in how to do research by their mentors, but they are still required to do some form of “scholarly activity.” What that looks like can vary greatly, and not all physicians desire to continue engaging in research throughout their careers. However, this requirement was created for a reason. At a minimum, the physician should gain the ability to assess the quality and limitations of published research in their field (Brearly et al., 2023; Enders et al., 2017). A consulting statistician can help provide this training as an inherent part of their services, but only if they are given the time, support, and opportunity to do so. Accomplishing this requires bridging the gap between the expectations of general unpaid educational support and the reality of a unit operating within a paid model.

Welty et al. (2013) discuss the advantages of a centralized statistics unit in terms of the distribution of resources between smaller projects and more extensive collaborations, as well as the continuity of knowledge, funding, and training of junior analysts. Centralized units also reduce the risk of “siloing,” where isolated statisticians face a greater challenge to continuing their professional growth in an environment lacking access to workplace peers (Griffith et al., 2022). An additional hazard of siloing is the hyper-specialization of an individual who is only approached to work on a specific subset of projects. This can impact both their professional development and the likelihood of others approaching them for collaborations.

However, there is an important distinction between a “silo” and a “swim lane.” Respecting one’s skills and authority allows one to perform at peak capacity. We would not ask a surgeon to let us handle the scalpel, but many physicians think that expertise is not needed for research and that they can do their own analyses. By excluding subject matter experts, there are no checks and balances to ensure the validity of assumptions and whether the techniques they have learned previously remain applicable in a new paradigm. Furthermore, outside experts can offer equipoise, because they are detached from the personal interest and investment often felt by the researcher.

While many data scientists have a sense for all these concepts, finding the language needed to lobby their institution for these necessary investments can be challenging. We provide here a succinct list of the points made above for reference.

  1. Time is money: While funds do have to be allocated to invest in these resources, it is critical to remember that much of the return on investment comes in the form of recuperated time for clinical/academic professionals.

  2. Centralization creates efficiency: Sharing resources reduces logistic and financial burdens. This applies to the entirety of the research infrastructure.

  3. Research is an educational opportunity: It wastes resources to ignore the value of early career professionals learning directly from the myriad of subject matter experts available at their institution.

  4. Data science is more complicated than you think and is also incredibly valuable: There is no shortage of references for this, but the 2018 position paper from the American Statistical Association, “Overview of Statistics as a Scientific Discipline and Practical Implications for the Evaluation of Faculty Excellence,” is a great place to start.

  5. Sustainability is critical: All the above points are pieces of the sustainability puzzle, and sustainability is key to success, no matter what your definition of success is. Without sustainability, an organization will end up back where it started, but with less time and money.

DISCUSSION

While there is a great need for value metrics for data science professionals across the spectrum of faculty and staff job descriptions, we cannot rely solely on measurements to make much-needed changes in academic culture. Furthermore, we must actively avoid the trap of valuing what we measure instead of measuring what we value. There must be a broader framework to create context for the metrics we use to advocate for the resources we need, and there must be clear and consistent language that communicates that framework across disciplines.

This advocacy does not just belong to the data scientist but also the research software engineer (RSE), a role that is equally new and unique to academia. Like a data scientist or statistician, an RSE can offer a quick consultation for a project—more specifically, one that is developing or utilizing research software—or an RSE can become more embedded and integral to study design, implementing research processes and teaching best practices (Jay et al., 2017; Katz et al., 2019). Guidance towards hiring and retaining data scientists and RSEs shows parallel solutions to create success for both non-traditional academic professions (Van Tuyl, 2023).

We have provided a broad overview of some common ways of integrating data scientists and statisticians into the larger academic research domain to give context to the points made above. This overview is by no means exhaustive, and none of the examples provided here are intended to be a one-size-fits-most solution. For example, the points made here will have limited applicability to isolated or embedded faculty and staff. While their need for advocacy certainly exists, that conversation will likely look very different given their specific challenges. There are cases in which centralization is neither feasible nor desired, but there may still be ways for the embedded analyst to advocate for efficiency and sustainability. We encourage the extension of these ideas to those scenarios. Further work also exists in the explicit linkage of value metrics to these concepts and in the need to measure and predict the sustainability of a unit.

Sustainable research infrastructure requires an understanding of what success will look like for the institution upon hiring faculty and staff in these specialized roles, and at the same time, there must be an equal understanding of the employee’s expectations for career success. Equal to the institutional accomplishment of offering specialized support and collaboration for researchers, these faculty and staff deserve the accomplishment and motivation that come with identity and ownership in the roles they play in a project. Moreover, these do not need to be in competition with each other, so long as the organization is willing to prioritize long-term growth.

Research is not a casual undertaking and should not be treated as such. We must actively fight against the encroaching idea that it is simply a box that must be checked for students and young professionals. One tool in this fight is to strive for efficient and sustainable research infrastructure so that the challenges and benefits are shared in a way that makes the inherent return on investment clear and easy to illustrate. The critical place of data science within that research infrastructure creates a unique opportunity to advocate for broader cultural change that paves the way for these advances in efficiency and sustainability.

Funding statement:

This work was supported in part by the Alfred P. Sloan Foundation [G-2022-19414] and by the National Institutes of Health [P50MH106438].

Footnotes

Conflict of interest: The authors declare no conflicts of interest with the publication of this work.

Ethics approval: This study did not include human subjects research and did not require Institutional Review Board approval.

Data availability statement:

No quantitative data were generated for this manuscript.

REFERENCES

  1. American Statistical Association (2018). Overview of Statistics as a Scientific Discipline and Practical Implications for the Evaluation of Faculty Excellence.
  2. Brearley AM, Rott KW, & Le LJ (2023). A Biostatistical Literacy Course: Teaching Medical and Public Health Professionals to Read and Interpret Statistics in the Published Literature. Journal of Statistics and Data Science Education, 31(3), 286–294. 10.1080/26939169.2023.2165987 [DOI] [Google Scholar]
  3. Devick KL, Gunn HJ, Price LL, Meinzen-Derr JK, Enders FT, Perkins SM, & Schulte PJ (2022). Collaborative biostatistics and epidemiology in academic medical centres: A survey to assess relationships with health researchers and ethical implications. Stat, 11(1), e481. 10.1002/sta4.48 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Enders FT, Lindsell CJ, Welty LJ, Benn EK, Perkins SM, Mayo MS, … & Oster RA (2017). Statistical competencies for medical research learners: What is fundamental?. Journal of clinical and translational science, 1(3), 146–152. 10.1017/cts.2016.31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Griffith EH, Sharp JL, Bridges WC, Craig BA, Hanford KJ, & Stevens JR (2022). The academic collaborative statistician: Research, training and evaluation. Stat, 11(1), e483. 10.1002/sta4.483 [DOI] [Google Scholar]
  6. Jay C, Haines R, Vigo M, Matentzoglu N, Stevens R, Boyle J, … & Vega J (2017) Identifying the challenges of code/theory translation: report from the Code/Theory 2017 workshop. Research Ideas and Outcomes, 3, e13236. 10.3897/rio.3.e13236 [DOI] [Google Scholar]
  7. Katz DS, McHenry K, Reinking C & Haines R (2019). Research Software Development & Management in Universities: Case Studies from Manchester’s RSDS Group, Illinois’ NCSA, and Notre Dame’s CRC, 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science), Montreal, QC, Canada, 17–24. 10.1109/SE4Science.2019.00009 [DOI] [Google Scholar]
  8. Mazumdar M, Messinger S, Finkelstein DM, Goldberg JD, Lindsell CJ, Morton SC, … & Parker RA (2015). Evaluating academic scientists collaborating in team-based research: A proposed framework. Academic Medicine, 90(10), 1302–1308. 10.1097/ACM.0000000000000759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Mehta CC, Stedman MR, Rao SR, & Podolsky R (2022). Advice for isolated statisticians collaborating in academic healthcare centre settings. Stat, 11(1), e492. 10.1002/sta4.492 [DOI] [Google Scholar]
  10. Michael L, & Maas B (2016). Research computing facilitators: The missing human link in needs-based research cyberinfrastructure. Research bulletin. Louisville, CO: ECAR. [Google Scholar]
  11. Oliver JC, Kollen C, Hickson B, & Rios F (2019). Data science support at the academic library. Journal of Library Administration, 59(3), 241–257. 10.1080/01930826.2019.1583015 [DOI] [Google Scholar]
  12. Parker RA (2000). Estimating the value of an internal biostatistical consulting service. Statistics in medicine, 19(16), 2131–2145. [DOI] [PubMed] [Google Scholar]
  13. Sharp JL, Wrenn J, & Gerard PD (2016). Identifying the perceived value of statistical consulting in a university setting. Journal of Statistical Theory and Practice, 10, 216–225. 10.1080/15598608.2015.1108254 [DOI] [Google Scholar]
  14. Van Tuyl S (Ed.) (2023). Hiring, Managing, and Retaining Data Scientists and Research Software Engineers in Academia: A Career Guidebook from ADSA and US-RSE. Zenodo. 10.5281/zenodo.8329337 [DOI] [Google Scholar]
  15. Welty LJ, Carter RE, Finkelstein DM, Harrell FE Jr., Lindsell CJ, Macaluso M, … & Ware JH (2013). Strategies for developing biostatistics resources in an academic health center. Academic Medicine, 88(4), 454–460. 10.1097/ACM.0b013e31828578ed [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Wild CJ, & Pfannkuch M (1999). Statistical thinking in empirical enquiry. International statistical review, 67(3), 223–248. 10.1111/j.1751-5823.1999.tb00442.x [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No quantitative data were generated for this manuscript.

RESOURCES