Abstract
Objective
Growth in big data and its potential impact on the healthcare industry have driven the need for more data scientists. In health care, big data can be used to improve care quality, increase efficiency, lower costs, and drive innovation. Given the importance of data scientists to U.S. healthcare organizations, I examine the qualifications and skills these organizations require for data scientist positions and the specific focus of their work.
Materials and Methods
A content analysis of U.S. healthcare data scientist job postings was conducted using an inductive approach to capture and categorize core information about each posting and a deductive approach to evaluate skills required. Profiles were generated for 4 job focus areas.
Results
There is a spectrum of healthcare data scientist positions that varies based on hiring organization type, job level, and job focus area. The focus of these positions ranged from performance improvement to innovation and product development with some positions more broadly defined to address organizational-specific needs. Based on the job posting sample, the primary skills these organizations required were statistics, R, machine learning, storytelling, and Python.
Conclusions
These results may be useful to organizations as they deepen our understanding of the qualifications and skills required for data scientist positions and may aid organizations in identifying skills and knowledge areas that have been overlooked in position postings.
Keywords: data science, data scientist, content analysis, analytics
INTRODUCTION
Growth in big data in health care and its potential impact on the industry have driven the need for more healthcare workers and researchers trained as data scientists capable of working with big data. Big data is complex, as it consists of heterogeneous and unstructured datasets, which may include text, images, and video, across multiple areas. Researchers have estimated that unstructured data represent approximately 95% of big data.1 Analytic techniques such as machine learning and artificial intelligence are used to enhance analysis of both structured and unstructured data.2 Critical skills required of data scientists include understanding data heterogeneity and fragmentation as well as data management and conceptualization.3
Big data can be used by healthcare organizations to find innovative solutions as well as to improve care quality and efficiency. Innovation stems from the opportunity to combine traditional data with new data forms, for example, integrating population clinical data with genomics data to improve drug therapies.4 To that end, open-access initiatives are underway to increase availability of a variety of data sources for data sharing.5 Additionally, big data could be used in innovative ways to achieve greater efficiency, thereby reducing U.S. healthcare expenditures.6 There are 5 key areas of savings potential: clinical operations (eg, comparative effectiveness research), research and development (eg, precision medicine), new business models (eg, online platforms), payment (eg, fraud detection), and public health (eg, improved surveillance).6 In sum, the work of data scientists holds the potential to have significant impact in health care.
With this in mind, I analyze the content of healthcare data scientist job postings to identify the qualifications and skills required for this work. These research results can be used by healthcare organizations for human resource planning and by those providing education and training to target efforts toward needed skills.
Background
A data scientist is a practitioner who has the knowledge and skills to do data science work. As noted by the National Institute of Standards and Technology Big Data Working Group, data scientists extract knowledge from data to drive action.7 Important knowledge areas include statistics (ie, a working knowledge of probability, distributions, hypothesis testing, and multivariate analysis); computer science, which encompasses an understanding of data structures, algorithms, and database systems (eg, Hadoop); and problem formulation (ie, the ability to formulate problems to bring about effective solutions).2 Data scientists differ from data engineers in that data scientist core expertise is in math, statistics, and machine learning, whereas data engineer core expertise is in advanced programming and distributed systems.8 Machine learning skills, in particular, are becoming mandatory for data scientists for building automated decision systems that provide future predictions. The ability to mine text is also a prerequisite for working with unstructured data, particularly in health care, where much of the clinical data is in a note format.
Data scientists’ defining feature is their ability to go broad (eg, full data analysis cycle) as well as deep for at least 1 aspect of the field such as statistics or big data. This breadth and depth combo is often called “T” formation in which the breadth of skills is represented by the horizontal line and the depth of skills represented by the vertical bar of the T.9 Given the nature of the work, data science is an inherently collaborative and creative field, as these scientists generally work within interdisciplinary team environments to find innovative ways to complete projects.9 Data communication is also an important skill and includes expertise with visualization tools used to explore raw data as part of the iterative data science process.10
A number of reports have profiled data scientist roles across industries in an attempt to categorize positions with common focus areas. For example, O’Reilly Strata industry research defined 4 data scientist profession profiles (data businessperson, data creative, data developer, data researcher) and then mapped these profiles to a basic set of technology domains and competencies based on the data scientist practitioners’ self-identification.9 In another report, Burning Glass Technologies identified 6 job categories in the data science and analytics landscape (data scientists and advanced analysts, data analysts, data systems developers, analytics managers, functional analysts, and data-driven decision makers) based on similarities in skillsets and functional roles.11 Burning Glass Technologies reported that data scientists require the strongest analytical skills as well as proficiency with specialized tools such as machine learning, Apache Hadoop, and data mining. Moreover, data scientists require generalized skills such as SQL, R, and data analysis.11 In their data science survey, Strata O’Reilly found that Python and Spark were among the tools that contribute most to data scientists’ salary while SQL, Excel, R, and Python were the most commonly used tools.12
Demand for data science professionals continues to grow rapidly. Recent cross-industry estimates project demand for data scientists to grow by 28% by 2020.11 Employers are struggling to meet demand for data scientists.11 This skill shortage is compounded by the hybrid nature of data scientist positions; that is, needing a mix of analytic skills and domain-specific expertise, which is difficult to develop in 1 individual.11 This difficulty in finding qualified data scientist candidates is leading organizations to seek creative ways to develop and grow workforce talent in-house.
Healthcare organizations need to formulate strategies to use big data analytics more effectively to achieve healthcare transformation.13–15 Training staff to use big data analytics is one recommended strategy for doing so.16,17 Having repeated exposure to the data science life cycle (eg, posing a question, collecting data, exploring the data, developing models, making inferences, and communicating results) helps develop data acumen.18 Researchers have identified many healthcare big data use cases, including analyzing care patterns and unstructured data, building predictive models, and providing decision support16; knowledge generation and dissemination, patient engagement, and personalized medicine14; risk and resource use predictive modeling; and population management,19 all with potential impact for healthcare delivery.
Research questions and significance
While the work of healthcare data scientists has increased in importance as outlined previously, no research to date has been conducted specifically with regards to healthcare data scientist positions. The purpose of this study was to identify, based on data scientist job postings, the qualifications and skills required for healthcare data scientist positions. This study was guided by the following research questions:
What are the types of data scientist positions for which U.S. healthcare organizations are hiring and what is the focus of the work?
What job qualifications and skills are required for data scientist positions in U.S. healthcare organizations?
How do U.S. healthcare data scientist job qualifications and skills vary by job focus area?
Materials and METHODS
The research questions were addressed by analyzing the content of healthcare data scientist job postings which contain the qualifications and skills that employers require in relation to a specific position. Content analysis is a method for systematically describing documents and written communications. The objective is to condense a volume of text through identifying and describing meaningful categories, analyzing patterns, and thereby achieving new insights.20–23 Content analysis of job announcements has been used in a variety of professions to evaluate hiring trends24,25 as well as for research in health care.16,26 Content analysis consists of the 3 phases of preparation, organization, and reporting.20Figure 1 outlines the steps followed for the content analysis process in this study which were adapted from previous research.20,21,27,28
Figure 1.
Content Analysis Process.
Preparation phase
The preparation phase includes defining the data collection method, sampling strategy, and unit of analysis.20,21 The data collection method in this study consisted of capturing active job postings on Indeed.com for U.S. healthcare data scientist positions from February to April 2018. Indeed.com aggregates jobs from organization websites and job boards. The text of each job posting was copied into a separate word-processing document and saved with the job title and organization in the filename.
The sampling strategy was based on a convenience sample of job postings from Indeed.com for healthcare organizations such as health systems, hospitals, insurance companies, vendors, and recruiters. Inclusion criteria for the sample were having “data scientist” in the job posting job title for healthcare organization positions. All active job postings that met the criteria were used in the analysis. Job postings were captured daily and spot checked for position type and organization. Duplicate job postings were eliminated. Job postings were reviewed during the data collection period to assess progress towards data saturation (ie, an indication of optimal sample size).29 After 3 months of capturing job postings, few, if any, new types of positions were identified signaling data saturation had been reached. A total of 198 job postings were identified during the specified time period. The unit of analysis was an individual job posting. The data were reviewed multiple times to better understand the data overall as well as job posting components.
Organization phase
Both inductive and deductive approaches were used in the organization phase. In an inductive content analysis, the concepts are generated from the data and thus adhere to the naturalistic paradigm.27,30 Deductive content analysis is typically used to evaluate data in a new context or to test categories.20 The inductive approach was used in this study to identify and categorize core job posting information while the deductive approach was used to evaluate skills required for specific job categories.
The inductive content analysis process includes open coding, categorization, and abstraction.20 Open coding was completed by highlighting key parts of the document during the initial review, inserting comments via word-processing software, and identifying meaning units or codes for portions of the text. Category creation is a core feature of qualitative content analysis.31 Categories help increase understanding of the research topic and generate knowledge.32 From descriptive headings that were created to describe the content, possible categories were generated and recorded in a spreadsheet. Job postings were reviewed again to abstract basic information such as job title, degree requirements, and years of experience. Iterative reviews were conducted to identify classifications by type of hiring organization, job title, and job focus. Each job focus area was then analyzed in further detail.
Deductive content analysis included developing structured matrices based on skills and coding according to the matrix categories. The skills list was derived from previous data science surveys9,12 and included both general and specific skills. Skills were coded based on job posting required qualifications and key responsibilities. Coding rules consisted of identifying key terms that matched in the job posting and skills matrix, and then reviewing the related description in the job posting to ensure that the context was appropriate. For example, this job posting text “research, design and prototype robust and scalable models based on machine learning, data mining, and statistical modeling to answer key business problems” was matched to this skill “developing prototype models.” An initial coding was completed for each of the data scientist positions to ensure skills were holistically captured. Only qualifications that fit the skill categories were coded. Recoding of job skills was conducted again approximately 30 days after initial coding and any discrepancies were noted and reanalyzed.
Reporting phase
Descriptive statistics were generated for the overall sample as well as for identified skills. All data for each category were included in the frequency counts representing 100% of the sample. Frequency counts by job levels, hiring organization type, and job focus areas were also generated to allow for more detailed analysis.
Trustworthiness
The qualitative content analysis process must be described in sufficient detail to establish trustworthiness.20 Various aspects of trustworthiness include credibility, confirmability, and dependability.21,22,31 In this study, trustworthiness was addressed in the following ways. Credibility was demonstrated through data analysis focused on the research questions. Categories were chosen that reflect the research topic and covered the data. Confirmability of findings was demonstrated through tables, a figure and quotes that show a direct link between the results of the study and the data, thus providing an audit trail.33 Dependability was addressed by collecting the data during a specific time period to ensure consistency and by the use of a code-recode strategy to help address intrarater reliability,27,34,35 specifically, consistent content categorization using the coding protocol.36
RESULTS
Descriptive statistics
Descriptive statistics for the overall sample based on the inductive analysis are shown in Table 1.
Table 1.
Sample descriptive statistics (N = 198)
| Frequency | Percentage | |
|---|---|---|
| Hiring Organization Type | ||
| Academia | 8 | 4.0% |
| Biotechnology | 4 | 2.0% |
| Consulting | 12 | 6.1% |
| Health System | 32 | 16.2% |
| Home Health | 1 | 0.5% |
| Hospital | 5 | 2.5% |
| Insurance Company | 37 | 18.7% |
| Pharmacy/Pharma | 9 | 4.5% |
| Physician Practice | 1 | 0.5% |
| Recruiter | 13 | 6.6% |
| Research | 9 | 4.5% |
| Vendor | 67 | 33.8% |
| Total | 198 | 100.0% |
| Degree Required | ||
| Advanced Degree | 8 | 4.0% |
| Bachelor’s and Above | 82 | 41.4% |
| Master’s and Above | 76 | 38.4% |
| PhD and MD or Equivalent | 18 | 9.1% |
| Not Listed | 14 | 7.1% |
| Total | 198 | 100.0% |
| Degree Preferred | ||
| Advanced Degree | 7 | 3.5% |
| Master’s and Above | 38 | 19.2% |
| PhD | 32 | 16.2% |
| Not Listed | 121 | 61.1% |
| Total | 198 | 100.0% |
| Job Level | ||
| Data Science Associate | 11 | 5.6% |
| Data Scientist | 125 | 63.1% |
| Senior Data Scientist | 60 | 30.3% |
| Manager/Director | 2 | 1.0% |
| Total | 198 | 100.0% |
| Job Focus Area | ||
| Innovation | 14 | 7.1% |
| Performance Improvement | 74 | 37.4% |
| Product Development | 48 | 24.2% |
| Nonspecific | 62 | 31.3% |
| Total | 198 | 100.0% |
| Experience | ||
| 1–2 y | 33 | 16.7% |
| 3–4 y | 44 | 22.2% |
| 5–7 y | 70 | 35.4% |
| 8–10 y | 11 | 5.6% |
| 11–15 y | 2 | 1.0% |
| Not listed | 38 | 19.2% |
| Total | 198 | 100.0% |
| States | ||
| California | 36 | 18.2% |
| New York | 22 | 11.1% |
| Illinois | 16 | 8.1% |
| Massachusetts | 11 | 5.6% |
| Missouri | 9 | 4.5% |
| Other States | 104 | 52.5% |
| Total | 198 | 100.0% |
There were 4 job-level categories (as noted in Table 1). About two-thirds of the job postings were in the data scientist category and about one-third were classified as senior data scientists including principal and lead roles. Data scientist–level jobs were predominantly found in vendor organizations, health systems, and insurance companies. The minimum required degree was most often a bachelor’s degree. Senior data scientist–level positions were found primarily at vendors and insurance companies. More years of experience was typically required for senior level positions, with 5–7 years of experience required in half of these positions. Higher levels of education were also required, specifically at the master’s degree level.
Positions most often required a bachelor degree and some required a master’s degree. Specific degree areas included quantitative fields such as computer science, engineering, and statistics. The main job focus areas of the posted positions were performance improvement, product development, and innovation. However, about one-third of the job postings were nonspecific. Job postings were generally distributed proportionally across the U.S. The top states based on the number of job postings were California, New York, Illinois, Massachusetts, and Missouri (see Table 1).
Positions varied by hiring organization type. Positions in health systems tended to focus on performance improvement, while vendor positions focused more on product development. Data scientist positions at health systems were found in departments such as enterprise analytics, clinical strategy, informatics, or population health and at insurance companies in departments named clinical analytics or corporate analytics. More advanced health systems or those associated with academic medical centers had data science or artificial intelligence departments. Data scientist positions at insurance companies often did not have a specific job focus and generally supported broader data science needs.
A distribution of top skills based on the job postings is shown in Table 2. Top required skills were statistics, R, machine learning, storytelling (eg, communicating what the data analysis means in a compelling way), and Python (see Table 2). For data scientist–level roles, statistics, storytelling, and R were top skills. For senior-level roles, machine learning, R, and Python were top skills. Microsoft SQL Server, Oracle, and MySQL were top requested relational database skills (see Figure 2 ). R, Python, and SQL were the top programming languages to know. Tableau was the top visualization tool; Spark and Hive were the top big data management platforms.
Table 2.
U.S. healthcare data scientist top 20 skills overall and by job level
| Data Scientist Skills | Overall (All Job Postings) n = 3218 | Overall Percentage Distribution | Data Scientist Level n = 2006 | Senior Data Scientist Level n = 1094 |
|---|---|---|---|---|
| Statistics (eg, general linear model, analysis of variance) | 138 | 4% | 94 | 40 |
| R | 136 | 4% | 87 | 44 |
| Applying machine learning techniques | 133 | 4% | 85 | 44 |
| Storytelling; delivering actionable results | 132 | 4% | 91 | 38 |
| Python | 125 | 4% | 79 | 41 |
| Communicating findings | 117 | 4% | 79 | 36 |
| Developing products | 117 | 4% | 75 | 39 |
| Data-driven problem solving | 112 | 3% | 71 | 39 |
| Data manipulation | 108 | 3% | 70 | 34 |
| Developing algorithms | 106 | 3% | 66 | 34 |
| Setting up/maintaining data platforms | 97 | 3% | 65 | 28 |
| SQL | 95 | 3% | 63 | 27 |
| Implementing models into production | 89 | 3% | 54 | 33 |
| SAS | 84 | 3% | 54 | 27 |
| Work in multidisciplinary teams | 84 | 3% | 55 | 28 |
| Creating visualizations | 73 | 2% | 45 | 25 |
| Identifying business problems to address | 64 | 2% | 38 | 23 |
| Big and Distributed Data | 61 | 2% | 34 | 24 |
| Hadoop | 61 | 2% | 25 | 34 |
| Unstructured Data (eg, noSQL, text mining) | 56 | 2% | 34 | 20 |
| Other | 1230 | 38% | 742 | 436 |
| TOTAL | 3218 | 100% | 2006 | 1094 |
Figure 2.
Data Scientist Technical Skills.
Job focus area analysis
Four main job focus profiles were developed: performance improvers, product developers, modelers, and innovators. Each is described in this section (see Table 3 for a list of top skills and Figure 3 for a data map by job focus area).
Table 3.
U.S. healthcare data scientist top 20 skills by job focus area
| Data Scientist Skills | Overall (All Job Postings) n = 3218 | Performance Improvers n = 1181 | Percentage Distribution | Product Developers n = 775 | Percentage Distribution | Modelers n = 1084 | Percentage Distribution | Innovators n = 178 | Percentage Distribution |
|---|---|---|---|---|---|---|---|---|---|
| Statistics (eg, general linear model, analysis of variance) | 138 | 55 | 4.7% | 30 | 3.9% | 46 | 4.2% | 7 | 3.9% |
| R | 136 | 50 | 4.2% | 32 | 4.1% | 46 | 4.2% | 0.0% | |
| Applying machine learning techniques | 133 | 44 | 3.7% | 41 | 5.3% | 40 | 3.7% | 8 | 4.5% |
| Storytelling; delivering actionable results | 132 | 55 | 4.7% | 27 | 3.5% | 40 | 3.7% | 10 | 5.6% |
| Python | 125 | 42 | 3.6% | 41 | 5.3% | 39 | 3.6% | 3 | 1.7% |
| Communicating findings | 117 | 51 | 4.3% | 18 | 2.3% | 38 | 3.5% | 10 | 5.6% |
| Developing products | 117 | 42 | 3.6% | 30 | 3.9% | 36 | 3.3% | 9 | 5.1% |
| Data-driven problem solving | 112 | 40 | 3.4% | 25 | 3.2% | 41 | 3.8% | 6 | 3.4% |
| Data manipulation | 108 | 47 | 4.0% | 19 | 2.5% | 38 | 3.5% | 4 | 2.2% |
| Developing algorithms | 106 | 39 | 3.3% | 27 | 3.5% | 36 | 3.3% | 4 | 2.2% |
| Setting up/maintaining data platforms | 97 | 37 | 3.1% | 20 | 2.6% | 34 | 3.1% | 6 | 3.4% |
| SQL | 95 | 40 | 3.4% | 23 | 3.0% | 30 | 2.8% | 2 | 1.1% |
| Implementing models into production | 89 | 32 | 2.7% | 19 | 2.5% | 35 | 3.2% | 3 | 1.7% |
| SAS | 84 | 36 | 3.0% | 6 | 0.8% | 37 | 3.4% | 5 | 2.8% |
| Work in multidisciplinary teams | 84 | 32 | 2.7% | 22 | 2.8% | 23 | 2.1% | 7 | 3.9% |
| Creating visualizations | 73 | 31 | 2.6% | 15 | 1.9% | 25 | 2.3% | 2 | 1.1% |
| Identifying business problems to address | 64 | 26 | 2.2% | 11 | 1.4% | 23 | 2.1% | 4 | 2.2% |
| Big and Distributed Data | 61 | 17 | 1.4% | 13 | 1.7% | 26 | 2.4% | 5 | 2.8% |
| Hadoop | 61 | 18 | 1.5% | 16 | 2.1% | 25 | 2.3% | 2 | 1.1% |
| Unstructured Data (eg, noSQL, text mining) | 56 | 21 | 1.8% | 12 | 1.5% | 18 | 1.7% | 5 | 2.8% |
| Other | 2240 | 805 | 68.2% | 572 | 73.8% | 734 | 67.7% | 129 | 72.5% |
| TOTAL | 3218 | 1181 | 100.0% | 775 | 100.0% | 1084 | 100.0% | 178 | 100.0% |
Figure 3.
Data Scientist Job Focus Data Map.
Performance Improvers
Performance improvers were sought to work on areas such as quality measures, financial performance, and patient outcomes. In some cases, these positions were specific to areas such as population health, decision support, or biomedical informatics. About two-thirds of these positions were at the data scientist job level vs senior level. Five to 7 years of experience was most commonly required. A bachelor’s degree was required about half the time, followed by a master’s degree. These positions were most often found in health systems or insurance companies. Top skills required included statistics, storytelling, and communicating findings (see Table 3). Examples of key responsibility areas included the following:
“Compile and analyze data to improve health outcomes and increase access to health services for populations at risk.”
“Responsible for defining, building, and improving statistical models to improve business processes and outcomes in one or more healthcare domains such as Clinical, Enrollment, Claims, and Finance.”
“Develops advanced statistical models to predict, quantify or forecast various operational and performance metrics in multiple healthcare domains.”
Product Developers
Product developers focused on a wide range of product development areas including population health, performance improvement, digital health, decision support, speech/language solutions, behavioral health, and claims analytics. The majority of these positions were at the data scientist level and most frequently required 5–7 years of experience. A higher educational level was required, specifically, at the master’s/PhD level. These positions were overwhelmingly found at vendors. Top skills required included machine learning, Python, and R (see Table 3). Examples of key responsibility areas included the following:
“Apply data mining and machine learning techniques to develop better personalization and recommendation for patients’ and doctors’ needs.”
“Develop analyses and user-centered software products that inform the day-to-day decision-making of physicians and hospitals.”
“Develop models to infer or predict disease conditions with a high level of accuracy.”
Modelers
Modeler job postings were nonspecific and required core data science skills. Organizations hiring for these positions were looking for data science bench strength, often machine learning pros. These positions were primarily at the data scientist level with some at a more senior level. Modeler positions typically required 5–7 years of experience. This group of positions most commonly required a bachelor degree. Organizations hiring modelers included insurance companies, vendors, and consulting organizations. Top skills required included R, statistics, data-driven problem solving, and machine learning (see Table 3). Examples of modeler key responsibility areas included the following:
“Serve as the ‘methodology-expert’ on the team and provide guidelinesto team for applying appropriate algorithms or solutions to tackle different problems. Coach junior team members on various data science techniques.”
“Conduct statistical modeling and experimental design on a variety of healthcare datasets, including claims and pharmacy data, biometric data, and healthcare outcomes. Train and validate predictive and categorical algorithms from large datasets.”
Innovators
Innovators were an eclectic group addressing topical areas such as health standards, personalized or precision medicine, genomics, and biology but from a healthcare delivery or informatics perspective. The majority of innovators were at the data scientist level and required 5–7 years of experience. Educational degree requirements varied, with bachelor’s and master’s/PhD being most common. Top hiring organizations included biotech, vendors, and recruiters. Top skills required included communicating findings, storytelling, and product development (see Table 3). Examples of key responsibility areas included the following:
“Working within the innovation team, address real-world challenges such as the rising cost of healthcare, mobility in urban society and payment card fraud in financial welfare programs, using the power of Data Analytics.”
“Develop evidence generation strategies, identify evidence gaps and data sources, design and execute studies, and implement analyses to address molecule and disease area questions; develop next-generation AI for precision healthcare.”
“Design and implement efficient systems for processing large-scale biomedical datasets and conducting rapid experimentation of machine learning and NLP systems.”
DISCUSSION
This study, which was based on job postings, was the first to identify and analyze the required qualifications, skills, and job focus for healthcare data scientist positions. Of note, these positions are often tied to strategic initiatives such as performance improvement, innovation, and product development. The focus area of job postings varied by type of hiring organization, with health system roles generally more focused on performance improvement, and not surprisingly, vendor roles more focused on product development.
There is a spectrum of healthcare data scientist positions. Data scientist roles at health systems and insurance companies are evolving as organizations seek to grow the depth of their operational analytics work. This study found that product developers and modelers have more data science domain expertise and can apply those skills to a wider range of areas. At the more advanced end of the spectrum are the innovators, who have more ability and training in research and science, in particular, precision medicine.
Healthcare organizations are investing in data scientist positions, with vendors and health systems seeking the most applicants. Vendors and insurance companies advertised the most for positions at the senior level, signifying an advancing level of work. With respect to education, a bachelor’s degree was frequently requested though broader education and/or experience was often preferred. As data science expertise can be gained through a variety of channels, hiring requirements must be somewhat flexible.
Healthcare data scientist job skills generally parallel those outlined in previous research. For example, expertise with Spark and Hive were top requested skills for healthcare data scientists which was also the case in previous cross-industry data scientist surveys. With respect to relational databases, Microsoft SQL server was more frequently a required skill vs MySQL and with respect to programming languages, R was more frequently required than SQL was. Finally, machine learning skills were generally requested for healthcare data scientist positions but the applications were not usually specified.
The mix of skills and work varies by job level. For all healthcare data scientist levels, a strong understanding of statistics was a given; whereas for more senior or broader roles proficiency in machine learning was an absolute requirement. This indicates that those individuals interested in a career in data science will need to gain expertise in programming languages such as R and Python. Data scientist–level positions were more closely aligned to analytics across a range of organizational areas. As noted, senior-level positions tended to require stronger programming and modeling skills, as evidenced by the higher ranking of Python and machine learning. These senior-level data scientists often mentor other team members.
While data scientist level roles require technical and statistical expertise, the importance of data communication should not be underestimated. For example, storytelling and communicating findings were top required skills for many positions. Working effectively within a multidisciplinary team was also often listed as a common requirement, signifying the broad base of the work. Also, often listed were proven problem-solving skills, as this ensures the candidate can deal with the challenges involved in addressing business problems in creative ways. Exposure to real-world problems helps develop data acumen.
As shown, data scientist positions require diverse qualifications and skills. Thus, organizations will need to consider candidates with a variety of backgrounds and experience. Critical to filling these positions will be prioritization of the top skills and domain expertise required as well understanding what skills may be developed internally or through additional training. Building a well-defined understanding of talent needs will enable organizations to invest strategically in data science talent pipeline development.11 With this in mind, mentors can help grow talent within an organization. Importantly, organizational culture must fully appreciate this area to ensure retention of key data scientist roles.37
To help meet market demand, analytic skills need to be developed in the broader workforce and career paths need to be provided for analytic professionals to grow their skills over time. Further, educational programs should identify opportunities for students to develop and then apply data science skills. Additionally, as the increase in data impacts nearly every career path, data literacy needs to become institutionalized; moreover, students need to be exposed to data science as early as possible.11
Four limitations were identified in this study. First, the job posting data were collected at only 1 point in time and, as job requirements tend to change over time, this limits the transferability of the findings. Second, as a convenience sample was used, the results are not generalizable. Third, because the content analysis coding was based on the interpretation of 1 researcher, there is a potential for bias. With respect to this third limitation, coding was reviewed and validated multiple times. Last, content analysis is dependent on job posting content which varied and, in some cases, postings were missing key items such as required skills or education. These missing data may impact dependability.
CONCLUSION
This study highlights the critical role of data scientists and how healthcare organizations are seeking data scientists to address specific priority areas. There is no “1-size-fits-all” data scientist, rather the positions may vary broadly. Growing demand for healthcare data scientists requires enhanced professional development, training, and education. The necessary skills for data scientists will likely evolve as market needs change. Further study is needed to monitor trends over time.
AUTHOR CONTRIBUTORS
MAM was the sole contributor.
Conflict of interest statement. None declared.
REFERENCES
- 1. Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 2015; 35 (2): 137–44. [Google Scholar]
- 2. Dhar V. Data Science and Prediction. 2012. http://hdl.handle.net/2451/31553. (accessed November 19, 2018).
- 3. Bhavnani SP, Munoz D, Bagai A. Data science in healthcare: implications for early career investigators. Circ Cardiovasc Qual Outcomes 2016; 9 (6): 683–7. [DOI] [PubMed] [Google Scholar]
- 4. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2014; 2 (3): 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Krumholz H, Gross C, Blount K. Sea change in open science and data sharing: leadership by industry. Circ Cardiovasc Qual Outcomes 2014; 7 (4): 499–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Manyika J, Chui M, Brown B et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. 2011. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation.
- 7. NIST Big Data Public Working Group. NIST SP 1500–1 NIST Big Data Interoperability Framework (NBDIF): Volume 1: Definitions. 2015. http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500–1.pdf.
- 8. Anderson J. Data Engineers vs. Data Scientists. 2018. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists.
- 9. Harris H, Murphy S, Vaisman M. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. 2013. http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf.
- 10. Larson D, Chang V. A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manag 2016; 36 (5): 700–10. [Google Scholar]
- 11. Burning Glass Technologies. The Quant Crunch: How Demand for Data Science Skills is Disrupting the Job Market. 2017. Boston, MA: Burning Glass Technologies; https://www.burning-glass.com/wp-content/uploads/The_Quant_Crunch.pdf
- 12. King J, Magoulas R. 2016 Data Science Salary Survey. 2016. https://www.oreilly.com/data/free/2016-data-science-salary-survey.csp.
- 13. Cortada J, Gordon D, Lenihan B. The Value of Analytics in Healthcare. Somers, NY: IBM Global Business Services; 2012. [Google Scholar]
- 14. Murdoch T, Detsky A. The inevitable application of big data to health care. JAMA 2013; 309 (13): 1351–2. [DOI] [PubMed] [Google Scholar]
- 15. Wegener R, Sinha V. The Value of Big Data: How Analytics Differentiates Winners. Boston, MA: Bain & Company; 2013. [Google Scholar]
- 16. Wang Y, Kung L, Byrd TA. Big data analytics: understanding its capabilities and potential benefits for healthcare organizations. Technol Forecast Soc Change 2018; 126: 3–13. [Google Scholar]
- 17. Westra BL, Clancy TR, Sensmeier J, Warren JJ, Weaver C, Delaney CW. Nursing knowledge: big data science-implications for nurse leaders. Nurs Adm Q 2015; 39 (4): 304–10. [DOI] [PubMed] [Google Scholar]
- 18. National Academies of Sciences. Data Science for Undergraduates: Opportunities and Options. Washington, DC: National Academy Press; 2018. [PubMed] [Google Scholar]
- 19. Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nat Rev Cardiol 2016; 13 (6): 350–9. [DOI] [PubMed] [Google Scholar]
- 20. Elo S, Kyngas H. The qualitative content analysis process. J Adv Nurs 2008; 62 (1): 107–15. [DOI] [PubMed] [Google Scholar]
- 21. Elo S, Kaariainen M, Kanste O, Polkki T, Utriainen K, Kyngas H. Qualitative content analysis: a focus on trustworthiness. SAGE Open 2014; 4 (1): 1–10. [Google Scholar]
- 22. Bengtsson M. How to plan and perform a qualitative study using content analysis. NursingPlus Open 2016; 2: 8–14. [Google Scholar]
- 23. Vaismoradi M, Turunen H, Bondas T. Content analysis and thematic analysis: implications for conducting a qualitative descriptive study. Nurs Health Sci 2013; 15 (3): 398–405. [DOI] [PubMed] [Google Scholar]
- 24. Choi Y, Rasmussen E. What qualifications and skills are important for digital librarian positions in academic libraries: a job advertisement analysis. J Acad Librariansh 2009; 35 (5): 457–67. [Google Scholar]
- 25. Todd P, McKeen J, Gallupe R. The evolution of IS job skills: A content analysis of IS job advertisements from 1970 to 1990. MIS Q 1995; 19 (1): 1–27. [Google Scholar]
- 26. Sowles SJ, McLeary M, Optican A et al. A content analysis of an online pro-eating disorder community on Reddit. Body Image 2018; 24: 137–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Mayring P. Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution. 2014. https://www.ssoar.info/ssoar/bitstream/handle/document/39517/ssoar-2014-mayring-Qualitative_content_analysis_theoretical_foundation.pdf.
- 28. Mayring P. Qualitative content analysis. Forum Qualitative Soc Res 2000; 1 (2): 1–7. [Google Scholar]
- 29. Guthrie J, Petty R, Yongvanich K, Ricceri F. Using content analysis as a research method to inquire into intellectual capital reporting. J Intellect Cap 2004; 5 (2): 282–93. [Google Scholar]
- 30. Hsieh HF, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res 2005; 15 (9): 1277–88. [DOI] [PubMed] [Google Scholar]
- 31. Graneheim UH, Lundman B. Qualitative content analysis in nursing research: Concepts, procedures and measures to achieve trustworthiness. Nurse Educ Today 2004; 24 (2): 105–12. [DOI] [PubMed] [Google Scholar]
- 32. Cavanagh S. Content analysis: concepts, methods and applications. Nurse Res 1997; 4 (3): 5–13. [DOI] [PubMed] [Google Scholar]
- 33. Bowen G. Supporting a grounded theory with an audit trail: an illustration. Int J Soc Res Methodol 2009; 12 (4): 305–16. [Google Scholar]
- 34. Krefting L. Rigor in qualitative research: the assessment of trustworthiness. Am J Occup Ther 1991; 45 (3): 214–22. [DOI] [PubMed] [Google Scholar]
- 35. Mackey A, Gass S. Second Language Research: Methodology and Design. 2nd ed Florence, KY: Routledge; 2015. [Google Scholar]
- 36. Lacy S, Watson B, Riffe D, Lovejoy J. Issues and best practices in content analysis. Journal Mass Commun Q 2015; 92 (4): 791–811. [Google Scholar]
- 37. Dunn M, Bourne P. Building the biomedical data science workforce. PLoS Biol 2017; 15 (7): e2003082. [DOI] [PMC free article] [PubMed] [Google Scholar]



