Bioinformatics Core Survey Highlights the Challenges Facing Data Analysis Facilities

Julie A Dragon; Chris Gates; Shannan Ho Sui; John N Hutchinson; R Krishna Murthy Karuturi; Alper Kucukural; Shawn Polson; Alberto Riva; Matthew Lee Settles; Jyothi Thimmapuram; Stuart S Levine

doi:10.7171/jbt.20-3102-005

. 2020 Jul;31(2):66–73. doi: 10.7171/jbt.20-3102-005

Bioinformatics Core Survey Highlights the Challenges Facing Data Analysis Facilities

Julie A Dragon ^1,^*, Chris Gates ², Shannan Ho Sui ³, John N Hutchinson ³, R Krishna Murthy Karuturi ⁴, Alper Kucukural ⁵, Shawn Polson ⁶, Alberto Riva ⁷, Matthew Lee Settles ⁸, Jyothi Thimmapuram ⁹, Stuart S Levine ¹⁰

PMCID: PMC7192196 PMID: 32382253

Abstract

Over the last decade, the cost of -omics data creation has decreased 10-fold, whereas the need for analytical support for those data has increased exponentially. Consequently, bioinformaticians face a second wave of challenges: novel applications of existing approaches (e.g., single-cell RNA sequencing), integration of -omics data sets of differing size and scale (e.g., spatial transcriptomics), as well as novel computational and statistical methods, all of which require more sophisticated pipelines and data management. Nonetheless, bioinformatics cores are often asked to operate under primarily a cost-recovery model, with limited institutional support. Seeing the need to assess bioinformatics core operations, the Association of Biomolecular Resource Facilities Genomics Bioinformatics Research Group conducted a survey to answer questions about staffing, services, financial models, and challenges to better understand the challenges bioinformatics core facilities are currently faced with and will need to address going forward. Of the respondent groups, we chose to focus on the survey data from smaller cores, which made up the majority. Although all cores indicated similar challenges in terms of changing technologies and analysis needs, small cores tended to have the added challenge of funding their operations largely through cost-recovery models with heavy administrative burdens.

Keywords: ABRF, funding model, institutional support, omics

INTRODUCTION

In a time of rapid growth in the bioinformatics and genomics fields, bioinformatics core facilities face new and diverse challenges. From staffing, services offered, and financial models to software utilization and development, data management, and reporting, bioinformatics core facilities constantly work to adapt to the changing demands of technologies and their research communities. Previous surveys of core facilities have analyzed singular issues, including funding mechanisms, data management, scientific rigor and reproducibility, and services offered, but a general survey assessing the broad spectrum of challenges faced by core facilities has not been carried out in over 10 yr.¹^–⁶ During that time, bioinformatics technologies and demands have continued to grow while financial support has fluctuated. In an effort to better understand the current challenges, the Association of Biomolecular Resource Facilities Genomics Bioinformatics Research Group (ABRF GBIRG) conducted a survey via Survey Monkey for all ABRF members from February to March 2019. Survey results were presented at the ABRF annual meeting in San Antonio, Texas, on March 24, 2019.

MATERIALS AND METHODS

GBIRG created an online survey composed of 27 multiple choice and open-ended questions that was sent to all members of ABRF on February 18 and March 12 of 2019 via Survey Monkey. Membership totaled 512 at the time, of whom 220 opened the email containing the survey link (43%). From ABRF, 27 members responded by March 14, 2019. We then reposted the survey to ABRF members as well as on bioinfo-core website (http://bioinfo-core.org) and received an additional 29 respondents, for a total of 56 respondents. We did not track which of the second round of survey respondents were ABRF members. Data were collated and analyzed using Microsoft Excel. Due to the limited number of responses from medium (staff/faculty = 5–8) and larger cores (staff/faculty > 8), the results were filtered to include only small cores (staff/faculty < 5), resulting in 43 respondents. Survey data were then “cleaned” to remove tangential and inappropriately assigned responses (i.e., where a respondent chose to write in “other” when that choice was available within a multiple choice question).

RESULTS

Staffing, effort, and services

Our initial questions asked about core staffing levels, how those efforts were distributed among staff, and where those efforts were applied most often. Of the 56 respondents, 77% were from small cores (staff/faculty < 5), leaving only 11% from medium (staff/faculty = 5–8) and 9% from large (staff/faculty > 8) cores. Small cores indicated that they spend the majority of their effort on data analysis, followed by core administration, data management, experimental design, software development, methods development, systems administration, and biostatistics (data not shown). As hypothesized, most cores broadly support many different types of analyses (Fig. 1A), and for the predominant analysis type, next-generation sequencing, they support an array of services: mainly RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing, but also variant calling, single-cell RNA-Seq, methylation sequencing (Methyl-Seq), metagenomics, long-read sequencing, HiC/3C/4C (i.e. chromosome conformation capture techniques), and single-cell DNA sequencing (Fig. 1B). Other services supported included network analysis, proteomics, machine learning, metabolomics, and general high-dimensional biostatistics. Only a small percentage of the small cores’ respondents indicated that they support image analysis, data mining, flow cytometry, omics integration, and epigenomics (Fig. 1A). Based on these responses, it is clear that small cores cover a large number of services with a limited number of staff.

FIGURE 1. — Technologies supported by small bioinformatics cores. A) Bioinformatics technologies. Response to survey question: Which kinds of bioinformatics do you support? B) Next generation (nextgen) sequencing technologies. Response to survey question: What kinds of nextgen sequencing do you support? ATAC-seq, assay for transposase-accessibly chromatin sequencing; ChIP-seq, chromatin immunoprecipitation sequencing; CRISPR, clustered regularly interspaced short palindromic repeats; CyTOF, mass cytometry; DNA-seq, DNA sequencing; FACS, fluorescence activated cell sorting; HiC/3C/4C, chromatin conformation capture techniques; MethylSeq, methylation sequencing.

Financial model

Bioinformatics cores typically rely on a combination of institutional support and other various sources of funding to sustain their services. Of the small core respondents, 88% received some institutional support—for example, in the forms of salary support, space costs, software and hardware costs, year-end deficit and travel expenses, service contracts, supplies, or a lump sum at the beginning of the year. Five respondents of the 43 (12%) do not receive any institutional support. The 3 “other” responses included “vouchers to PIs” and 2 descriptions of the core finances that were uninterpretable in the context of the question (Fig. 2A). In addition to institutional funding, the majority of cores who responded gain financial support through grants, fees (service and training), and funds from external institutes and departments, including industry (Fig. 2B).

FIGURE 2. — Financial support models for small bioinformatics cores. A) Institutional support to cores. Response to survey question: What form does institutional financial support to the core take? B) Other sources of funding. Response to survey question: What other sources of funding are part of your core’s funding model? C) Charging mechanisms. Response to survey question: How do you charge for your work? $/h, dollars per hour; $/FTE/mo, dollars per full-time equivalent per month.

There are most often two models that core facilities use when charging for their work: an hourly model, which is dollars per hour (internal and external rates may be different), and a fraction model (i.e., percent-effort model), which is dollars per full-time equivalent per month and varies by person. The majority of cores use a hybrid hourly and fraction model (33%) or only the hourly model (30%), followed by cores that use only a fraction model (9%) (Fig. 2C). The average hourly rate of the respondents was $79/h for internal users (minimum $10; median $75; maximum $150) and $119/h for external users (minimum $30; median $117; maximum $475). Ten respondents (23%) indicated that they do not charge for their work (we did not ask specifically how those who did not charge recovered their costs). The “other” responses were descriptions of how they set their fees (“project-based pricing” and “fixed cost per analysis type”), not how they charge them. The majority of cores (93%) do not charge for initial consultations, such as experiment design, without analysis or sequencing.

Software and data

Bioinformatics facilities use existing software and develop custom software to support the analyses they perform. The ability to support customized analyses for applications in which existing software is either completely missing or does not prove comprehensive or flexible enough to meet the needs of an investigator can be a key factor in driving usage of core facilities. When asked “What percentage of your time is spent using existing software tools vs. developing customized software?” it was apparent that at least 3 of the respondents misinterpreted the intent of the question and responded 0% based upon their percent usage of custom software (made apparent by a subsequent answer that they do not develop custom software)—these responses were excluded from results. A fourth response of 10%, although a significant outlier, was not removed. The majority of cores indicated that they utilize existing software tools most of the time (mean 77%). The reasons for developing custom software were lack of existing solutions, lack of flexibility in current solutions, or that the client requested the software (Fig. 3A).

FIGURE 3. — Development and support of custom software by small bioinformatics facilities. A) Reasons for development of custom software. Response to survey question: If developing custom software for investigators, why? B) Cost recovery for custom software. Response to survey question: If you develop custom software, how is your time paid for?

Of the facilities that produce custom software, reimbursement mechanisms include billing the investigator (either %full-time equivalent or fee for service), institutional subsidy, or charging an hourly rate. Only a very small percentage (2%) of cores do not charge for the development of custom software (Fig. 3B).

We also asked about the use of workflow management systems used to create reproducible and scalable data analyses. Most respondents did not utilize a workflow management system or stated they were unaware of them (60%). Of the 40% who did employ one of these systems, the majority (82%) uses an existing pipeline framework, such as Snakemake,⁷ Nextflow (https://www.nextflow.io/index.html), bcbio (https://github.com/bcbio/bcbio-nextgen), and Galaxy (https://usegalaxy.org), whereas 18% use custom or other pipelines.

Although software development does not seem to present a challenge to small bioinformatics cores, data storage does. Small cores varied on data storage, with the majority (49%) responding that they use local enterprise storage, followed by local core storage (32%) and cloud storage (5%) (Fig. 4). Some cores (14%) do not store investigator data files. However, of the cores that store investigator data files, only 19% responded that storage costs are built into their fees. As subsequent responses to a question about future challenges indicate, small cores are concerned about data storage going forward.

FIGURE 4. — Data storage models at small bioinformatics cores. Response to survey question: How do you store investigator data files?

Reporting and accreditation

Because bioinformatics cores work within academic environments, 1 component of their work is to report on grants and publications, which is one of many administrative tasks. Most are required to report on grant submissions (53%) and manuscripts (63%) that they have helped with, and of these, they are required to report grants that were funded (70%) and manuscripts accepted for publication (78%) (Fig. 5A). To keep track of funded grants and published manuscripts, most rely on being notified by the researcher directly or the funding agency or journal (56% for grants and 52% for publications), but others also execute individual or administrative staff searches through NIH reporter (or a combination of all of these) (Fig. 5B Grants, Fig. 5C Publications). Cores vary on their requirements for authorship in publications, with 28% responding that they require authorship, 28% requiring an acknowledgment, and 9% requiring neither publication nor acknowledgment; 35% responded that it depended on the level of involvement and/or that it was not a requirement but a preference.

FIGURE 5. — Reporting and accreditation practices at small bioinformatics cores. A) Reporting practices for grants and papers. Response to survey questions: Are you required to report on grant submissions you have helped with? If yes, are you required to report on which were funded? Are you required to report on manuscripts you have assisted with? If yes, are you required to report on which were accepted for publication? B) Recording and notification for grants. Response to survey question: How do you track which grants are funded? C) Recording and notification for papers. Response to survey question: How do you track which manuscripts were published? PI, primary investigator.

Future

As the field advances, there is increased pressure for core facilities to support the latest cutting-edge technologies and approaches (often unattainable by individual laboratories). When asked about current and future technological challenges, respondents shared that they are concerned about their capacity to provide analysis support for single-cell sequencing, long-read sequencing, spatial transcriptomics, metagenomics, machine learning/artificial intelligence applications, and integration of multiple data types (i.e., “multi-omics”). Current and future administrative challenges identified included managing capacity and staffing (or lack thereof), data storage and management, and fee structure/cost recovery models. The question of administrative challenges elicited the most diverse responses, in particular the description of staffing needs and capacity. Responses included concerns about “keeping up with requests,” “burnout with overworked members that need to know a little bit about everything,” “lack of manpower to meet demand,” “needing more bodies,” “hiring good people and making them want to stay and not move to industry,” “too few hands, too much work,” “difficult to get approval for hiring new personnel,” “under-funded and under-staffed,” and general overcommitment and lack of personnel. Multiple respondents also expressed feeling undervalued for their work and described disagreement/dissatisfaction among administrators, staff, and clients around fees, subsidies, authorship, training, and complexity of the work, including demand for custom analyses.

DISCUSSION

The survey highlights the challenges that small cores are facing as the fields of bioinformatics and genomics continue to expand. Most small cores support a large variety of bioinformatics and sequencing technologies, in addition to administrative functions and data management, with fewer than 5 faculty and/or staff. New emerging sequencing technologies bring increased business and diverse researchers to small cores, and with this, the quantity and quality requirements for data analytics grow. Covering such a broad range of expertise among a small staff presents a significant problem (i.e., with a larger staff, more specialization is possible, whereas in small cores, it is often that everyone has to be able to do everything). Division among the many technologies leads to further time issues as staff need to self-educate on a particular method or analysis as they go.

The small cores use a variety of financial models that primarily include institutional supports coupled with additional support from user fees, external institutes, grants, consulting, and industry. As found in previous surveys investigating core finances, cores continue to depend on a variety of institutional and external support.¹^, ² Although most small cores receive some institutional subsidies and deficit coverage, most are covering their costs with user fees and grants. This places cores in a precarious situation due to lack of stable funding in a challenging funding climate. Pressure to recover costs through fees is a concern for these small cores (i.e., “pricing themselves out of business”), and a largely fee-based funding model contributes to this concern. The added costs of training and development, which require time for bioinformatics staff to develop expertise in new and emerging technologies, makes full cost recovery especially difficult.

The main challenges for small cores identified by the survey were sufficient personnel for the workload, custom software development, data storage, and data reporting. Of these, the most common and pressing challenge is attaining an adequate staffing level to keep up with data analysis. Questions the survey did not ask but that we would like to address in the future include how staff are recruited, why current staff accepted their positions, and what they see as their career path moving forward, as well as questions about staff turnover and/or overall job satisfaction. We likewise did not ask several questions that are important in data management beyond storage, such as how data are delivered, whether core staff have concerns about this process and data security, and the role of the supporting department or institution in that process. In addition to public repositories for the ever-increasing genomics data, institutions will need thoughtful plans to support the storage needs for their cores and their clients.

The wealth of data being generated within research groups and cores makes it increasingly important to address issues of data management, sharing, and reusability. Multiple efforts at the NIH are now underway to create data ecosystems that adhere to the FAIR principles, i.e., that data be findable, accessible, interoperable, and reusable.⁸ This will require customizable data management plans in research projects that include plans for data acquisition/generation, data processing/analysis, storage and data retirement (backups), as well as multilevel, customizable hierarchy definitions for metadata to support different type of projects. Machine-readable metadata describing data processing pipelines must be part of an overall tracking system for more accurate data integration and management. Lastly, project-specific automatic portal/report generation modules that not only make the data available to user group(s) within appropriate security and licensing levels but also make them reproducible by subsequent users once made public are essential. Cores are uniquely positioned to participate in these efforts—many have already established processes for data sharing and management and are exploring implementations that enable reproducibility—but considerable work and effort will be needed to achieve these goals, presenting both an opportunity and challenge for cores moving forward.

The future of bioinformatics and genomics core facilities relies on the ability to staff them with trained and qualified personnel who can address the mounting data analysis demands. The trajectory of bioinformatics trends upward as new sequencing technologies are developed, and core facilities are a cost-effective way to support a diverse breadth of researchers within institutions. Observing that the scale, complexity, and impact of bioinformatics analysis continues to grow at an increasing rate, it is imperative that institutions support the ongoing training and professional development of current and future bioinformatics professionals.

ACKNOWLEDGMENTS

The Genomics Bioinformatics Research Group thanks the Association of Biomolecular Resource Facilities for use of their Survey Monkey and the ABRF membership for helping to gather these data. S.P. would like to thank the Delaware IDeA Networks for Biomedical Research Excellence (INBRE) [U.S. National Institutes of Health (NIH) P20 GM103446]. S.S.L. is funded by the National Cancer Institute (NCI) (P30-CA14051) and National Institute of Environmental Health Sciences (NIEHS) (P30-ES002109). J.A.D. is funded in part by Vermont Center for Immunobiology and Infectious Disease Center for Biomedical Research Excellence (COBRE) (NIH P30 GM118228), Northern New England Clinical and Translational Research Network (U54 GM115516-01), and the Translational Global Infectious Research COBRE (P20GM125498), P01 CA098993/6-10, and U01 CA196383-01. The authors also thank Dr. Marni Slavik for assistance in preparing the manuscript. The authors declare no conflicts of interest.

REFERENCES

1.Ogorzalek Loo R, Nicolet CM, Niece RL, Young M, Simpson JT. Association of biomolecular resource facilities survey: service laboratory funding. J Biomol Tech. 2009;20:180–185. [PMC free article] [PubMed] [Google Scholar]
2.Haley R. Institutional management of core facilities during challenging financial times. J Biomol Tech. 2011;22:127–130. [PMC free article] [PubMed] [Google Scholar]
3.Riley MB. University multi-user facility survey-2010. J Biomol Tech. 2011;22:131–135. [PMC free article] [PubMed] [Google Scholar]
4.Needleman D, Adam D, Detwiler M, et al. DNA sequencing research group: 2006 general survey of DNA sequencing facilities. J Biomol Tech. 2007;18:113–119. [PMC free article] [PubMed] [Google Scholar]
5.Hutchinson A, Blake S, Nicolet C, Rosato C, Grills G, Dagnall C. GVRG 2012 research study survey: a current snapshot of core services. J Biomol Tech. 2012;23(Suppl):S16. [Google Scholar]
6.Knudtson KL, Auer H, Brooks AI, et al. The ABRF MARG microarray survey 2005: taking the pulse of the microarray field. J Biomol Tech. 2006;17:176–186. [PMC free article] [PubMed] [Google Scholar]
7.Köster J, Rahmann S. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. [DOI] [PubMed] [Google Scholar]
8.Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] 1.Ogorzalek Loo R, Nicolet CM, Niece RL, Young M, Simpson JT. Association of biomolecular resource facilities survey: service laboratory funding. J Biomol Tech. 2009;20:180–185. [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Haley R. Institutional management of core facilities during challenging financial times. J Biomol Tech. 2011;22:127–130. [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Riley MB. University multi-user facility survey-2010. J Biomol Tech. 2011;22:131–135. [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Needleman D, Adam D, Detwiler M, et al. DNA sequencing research group: 2006 general survey of DNA sequencing facilities. J Biomol Tech. 2007;18:113–119. [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Hutchinson A, Blake S, Nicolet C, Rosato C, Grills G, Dagnall C. GVRG 2012 research study survey: a current snapshot of core services. J Biomol Tech. 2012;23(Suppl):S16. [Google Scholar]

[B6] 6.Knudtson KL, Auer H, Brooks AI, et al. The ABRF MARG microarray survey 2005: taking the pulse of the microarray field. J Biomol Tech. 2006;17:176–186. [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Köster J, Rahmann S. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. [DOI] [PubMed] [Google Scholar]

[B8] 8.Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bioinformatics Core Survey Highlights the Challenges Facing Data Analysis Facilities

Julie A Dragon

Chris Gates

Shannan Ho Sui

John N Hutchinson

R Krishna Murthy Karuturi

Alper Kucukural

Shawn Polson

Alberto Riva

Matthew Lee Settles

Jyothi Thimmapuram

Stuart S Levine

Abstract

INTRODUCTION

MATERIALS AND METHODS