Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 24.
Published in final edited form as: J Stat Data Sci Educ. 2024 Sep 24;32(4):331–344. doi: 10.1080/26939169.2024.2394541

Open Case Studies: Statistics and Data Science Education through Real-World Applications

Carrie Wright a, Qier Meng b, Michael R Breshock c, Lyla Atta c, Margaret A Taub b, Leah R Jager d, John Muschelli e, Stephanie C Hicks f,*
PMCID: PMC12002412  NIHMSID: NIHMS2019254  PMID: 40241937

Abstract

With unprecedented and growing interest in data science education, there are limited educator materials that provide meaningful opportunities for learners to practice statistical thinking, as defined by Wild and Pfannkuch, with messy data addressing real-world challenges. As a solution, Nolan and Speed advocated for bringing applications to the forefront in undergraduate statistics curriculum with the use of in-depth case studies to encourage and develop statistical thinking in the classroom. Limitations to this approach include the significant time investment required to develop a case study – namely, to select a motivating question and to create an illustrative data analysis – and the domain expertise needed. As a result, case studies based on realistic challenges, not toy examples, are scarce. To address this, we developed the Open Case Studies (opencasestudies.org) project, which offers a new statistical and data science education case study model. This educational resource provides self-contained, multimodal, peer-reviewed, and open-source guides (or case studies) from real-world examples for active experiences of complete data analyses. We developed an educator’s guide describing how to most effectively use the case studies, how to modify and adapt components of the case studies in the classroom, and how to contribute new case studies (opencasestudies.org/OCS_Guide).

Keywords: applied statistics, data science, statistical thinking, case studies, education, computing

1. Introduction

A major challenge in the practice of teaching data science and statistics is the limited availability of courses and course materials that provide meaningful opportunities for students to practice and apply statistical thinking, as defined by Wild and Pfannkuch (1999), with messy data addressing real-world challenges across diverse context domains. The importance of statistical thinking has been discussed widely in the literature (Hicks and Irizarry, 2018; Wood et al., 2018). Furthermore, studies have suggested that learners in undergraduate science and math courses perform better in classes with active learning experiences (Freeman et al., 2014). To address this problem, Nolan and Speed (1999) presented a model for developing case studies (also known as ‘labs’) for use in undergraduate statistics courses with a specific goal to “encourage and develop statistical thinking”. Specifically, the model calls for each case study to be:

“a substantial exercise with nontrivial solutions that leave room for different analyses, and for it to be a central part of the course. The lab should offer motivation and a framework for studying theoretical statistics, and it should give students experience with how statistics can be used to answer questions about a scientific problem. An important goal of this approach is to encourage and develop statistical thinking while imparting knowledge in mathematical statistics.” (Nolan and Speed, 1999)

Hicks and Irizarry (2018) stated that one of their five principles for teaching data science was to “organize the course around a set of diverse case studies” based on the model by Nolan and Speed (1999), with a goal of practicing statistical thinking and bringing real-world applications into the classroom. Case studies are also being used in the classroom across a diverse set of fields, including statistics (Weinberg and Abramowitz, 2000; Schafer and Ramsey, 2003; Khachatryan and Karst, 2017; Rivera et al., 2019; Donoghue et al., 2021), evolutionary biology (Reyes and McTavish, 2022), engineering (Romero et al., 1995), and environmental science (Theobold et al., 2021).

However, there are several limiting factors to scaling up the use of case studies. First, the process of selecting motivating questions (Arnold and Franklin, 2021), finding real-world and motivating data (Neumann et al., 2013; Donoho, 2017), describing the context around the data (Wood et al., 2018; Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective et al., 2018), and preparing diverse didactic data analyses requires a large initial investment in time and effort (Hicks and Irizarry, 2018). Second, the individuals who are most primed to develop effective and insightful case studies are practitioner-instructors (Kross and Guo, 2019), or practicing applied statisticians and data scientists, who teach and practice in a field-specific context. For these individuals, successfully constructing a diverse set of case studies across a wide range of contextual topics may require collaboration with individuals in other disciplines; this can be hard without protected time and effort from their academic institutions (Waller, 2018). Third, while there are rich repositories of data sets (Rivera et al., 2019), there are few collections of associated data analyses that show how the data can be used to demonstrate fundamental data science and statistical concepts, potentially with unexpected outcomes (Peng et al., 2021). This is especially true for complex and messy data, where analysis decisions must go beyond what can be summarized in a brief summary about the data, such as a README file (Vilhuber et al., 2022; Dogucu and Çetinkaya-Rundel, 2022). These challenges have resulted in a scarcity of case studies based on real-world challenges instead of simple toy examples. Moreover, many data repositories have different recommended processing and analysis of subsets of data, which are commonly used as “the” analysis, without proper discussion of alternative choices along the research pathway.

To address these challenges, we developed an open-source educational resource, the Open Case Studies (OCS) project (opencasestudies.org). This resource contains in-depth, self-contained, engaging (including in some cases interactive plots, tables, dashboards, and gifs), and peer-reviewed experiential guides (or case studies) that demonstrate illustrative data analyses covering a diverse range of statistical and data science topics to teach learners how to effectively derive knowledge from data. These guides can be used by instructors to bring applications to the forefront in the classroom or they can be used by independent learners outside of the classroom. Finally, we developed an educator’s guide describing how to most effectively use the case studies, how to modify and adapt components of the case studies in the classroom, and how to contribute new case studies. (opencasestudies.org/OCS_Guide).

2. Putting OCS model into practice

2.1. An overview of the Open Case Studies model

The case-studies model described by Nolan and Speed (1999) divides each case study into five main components: (i) introduction, (ii) data description, (iii) background, (iv) investigations, and (v) theory, with an optional section for advanced analyses or related theoretical material. In our Open Case Studies (OCS) model, we expand upon these components to thirteen components. Table 1 describes the components of the OCS model as well as the mapping between our model and the original model of (Nolan and Speed, 1999).

Table 1: Components of an Open Case Study.

Descriptions of the components of our Open Case Studies model (left) and their mapping to the components of the case studies model proposed by Nolan and Speed (1999) (right). We note that the model from Nolan and Speed (1999) orders ‘Data description’ before ‘Background’. However, Background is listed first here to more easily map to our Open Case Studies model.

Mapping of components between two case study models
Open Case Studies model Case-study model of Nolan and Speed
Component Description Component Description
1. Motivation Motivating content at the start of the case study
2. Main questions Question(s) to explore Introduction Describes context of real-world question and motivation
3. Learning objectives Both data science and statistics learning objectives
4. Context Context of question(s) or data
5. Limitations Any limitations in case study or with data used Background Information to put question in context using non-technical language
6. What are the data? Summary of where the data came from and what the data contain Data description Documentation for data collected to address the question
7. Data import Analyses for importing data
8. Data wrangling and exploration Analyses for wrangling and exploring the data
9. Data visualization Analyses for data visualization Investigations Suggestions for answering the question (varies in difficulty)
10. Data analysis Analyses containing statistical concepts and methods to answer question(s) Theory Describes relevant statistical concepts and methodologies to answer the question
11. Summary Summary of results
12. Suggested homework Question(s) to explore further
13. Additional information Helpful links or packages used Extended material (optional) Describes advanced analyses or related theoretical material

We highlight that while the structures of the two case-study models are similar, our OCS model has a different purpose. Briefly, Nolan and Speed (1999) designed case studies to be either (i) used in open-ended discussions in lecture or (ii) used as open-ended lab exercises where students do extensive analyses outside of class and write reports containing their observations and solutions. In both applications, the case studies are designed to be open-ended; the background may be initially discussed in class or as part of an assignment, but students work independently or in a group to create their own solutions and summarize their own findings in a full-length report to answer the original question. In contrast, we made a design choice to build case studies that are full-length, in-depth experiential guides that walk learners through the entire process of data analysis, with an emphasis on computing (Nolan and Lang, 2010), starting from a motivating question and ending with a summary of the results. Our goal is for educators either to directly use an entire case study in the classroom or to adapt a subset of the material for their use. For example, an educator can choose to show the solutions provided in the case study, show a different solution, or leave the discussion open-ended. Our reasoning for providing full-length guides is that it is typically easier for an educator to remove or modify material instead of creating it from scratch. In this way, we aim to reach a broader audience than just educators in a classroom, as any learner interested in a particular topic can walk through the case study to see an example of a complete data analysis. In addition, this method is particularly helpful for instructors who may not feel confident creating an analysis from scratch, especially if it is outside their main area of expertise, as our case studies are built with domain experts and are peer-reviewed.

2.2. Components of the Open Case Studies model

We will describe the thirteen individual components of our Open Case Studies model (Table 1) using one case study as an example. Currently all of our case studies showcase how to use the R statistical programming language (R Core Team, 2021) for data analyses, although other programming languages could be used with our model. Here, we use the “Exploring CO2 emissions across time” case study (opencasestudies.org/ocs-bp-co2-emissions), which explores global and country level carbon dioxide (CO2) emissions from the 1700s to 2014 (Figure 1). This case study also investigates how CO2 emission rates may relate to increasing temperatures and increasing rates of natural disasters in the United States (US). We also describe four other case studies (Table 2) and give example topics covered in all case studies (Table S1).

Figure 1: Example of a motivating figure in the “Exploring CO2 emissions across time” case study.

Figure 1:

The complete case study can be found at (opencasestudies.org/ocs-bp-co2-emissions). Top row: Line plot showing the increase in CO2 emissions over time (left). Longitudinal heatmap plot highlighting that the US has been one of the top emission producing countries historically and currently (right). Bottom row: The top left plot shows CO2 emissions in Metric Tons from the United States over time in years. The bottom left plot shows the annual average temperature in Fahrenheit over time in years. The blue line shows a smoothed trend line using the locally estimated scatterplot smoothing (loess) method, a type of local polynomial regression fitting. The scatter plot on the right shows the relationship of scaled CO2 emissions and scaled average temperature in the United States where each data point is a year from 1980-2014. The values of emissions and average temperature were scaled by subtracting the mean and dividing by the standard deviation. The blue trend line shows the trend using the linear model method.

Table 2: Description of four example case studies in the OCS resource.

This table shows the topics covered in four individual case studies, as well as information about the raw data. EPA = the US Environmental Protection Agency, NASA = National Aeronautics and Space Administration, NCHS = the National Center for Health Statistics, NYTS = the National Youth Tobacco Survey, and NOAA = National Oceanic and Atmospheric Administration. CO2 emission data obtained from Gapminder was originally from the World Bank. School Shooting data was obtained from the Center for Homeland Defense and Security at the Naval Postgraduate School (NPS).

Example case studies in the OCS resource
Topic Question(s) Data source(s) Raw data Data science skills Statistical concepts
Air Pollution [html] Can we predict annual fine particulate air pollution concentrations using predictors such as population density, urbanization, and satellite data? Gravimetric EPA air pollution data (from 2008)and predictor data from NASA, the US Census, and NCHS Single curated CSV file tidymodels, correlation visualizations, geospatial visualizations machine learning, linear regression, random forest
Vaping [html] How has tobacco / nicotine product use by American youths changed since 2015? Is there a relationship between e-cigarette / vaping use and other tobacco / nicotine product use? NYTS 2015-2019 survey data Excel files and codebooks for each year importing Excel files, importing multiple files efficiently, merging data, writing functions, functional programming, longitudinal visualizations survey weighting, logistic regression with survey weighting, longitudinal data, codebooks
CO2 Emissions [html] How have global CO2 emission rates changed over time? In particular for the US, and how does the US compare to other countries? Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated? CO2 emissions (from 1751-2019), GDP and energy use data from Gapminder. US temperature and disaster data form the NOAA XLSX and CSV files importing data from Excel files and CSV files, data joining, longitudinal data visualizations, plots with text and labels correlation coefficient, relationship between correlation and linear regression, correlation vs. causation
US School Shootings [html] What has been the yearly rate of school shootings and where in the country have they occurred in the last 50 years (from January 1970 to June 2020)? Open-source K-12 school shooting database (1970-2020) single CSV file, Google sheets importing Google sheets, date formats, geocoding, interactive tables, R Markdown, maps, interactive dashboards calculating percentages for data with missing values

1. Motivation.

Currently, each case study begins with a motivating data visualization. This idea originated from Dr. Mine Çetinkaya-Rundel’s talk entitled ‘Let Them Eat Cake (First)!’, presented at the Harvard University Statistics Department’s 2018 David K. Pickard Memorial Lecture (Çetinkaya-Rundel, 2018). She argues that, similar to a recipe book about baking cakes, showing a learner a visualization first can be motivating and give learners a sense of what they will be doing. This practice of showing a visualization at the start of a data analysis and then showing learners the code for how to produce the data visualization enables the learners to have a better sense of the final product and can be motivating to learn the more challenging concepts needed to make the visualization. We also note that the motivating visual may not just be a figure, but could also be a dashboard or video with images.

The motivating figure from the CO2 emissions case study (Figure 1) is reproduced here. In the case studies, we also include text explaining the motivation for the case study. Our case studies are often motivated by a recent report or publication investigating a specific a real-world question. In this section, we explain why the topic is of interest and define any terms that are needed to understand the main questions of interest (described in the next section).

2. Main questions.

In this section, we highlight and explicitly state a precise set of real-world question(s) or problem(s) before beginning the analysis (Ratan et al., 2019). When the case study is motivated by a previous publication, these questions may not be exactly the same as what was originally investigated in the paper or report. For example, a case study may only investigate a small subset of the results presented in the report or publication. Alternatively, a case study may not investigate the same question(s) at all, but rather use the data from the report or publication to demonstrate a specific data science or statistics learning objective. This framework also reiterates that many problems have a set of questions prior to analysis; finding an answer and engineering the question post-hoc is not recommended. Data exploration is a large component of the analysis framework and is shown in case studies, but the case studies demonstrate that data analyses should be guided by thoughtful questions that are determined prior to analysis.

In the CO2 emissions case study, the questions are:

noitemsep How have global CO2 emission rates changed over time? In particular, how does the US compare to other countries?

noiitemsep Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated?

3. Learning objectives.

Each case study consists of a set of didactic learning objectives. We categorize each objective as related to either (i) data science or (ii) statistics where the latter are concepts traditionally taught in a statistics curriculum such as linear regression, multiple testing correction, significance and the former are concepts often appearing outside of a traditional statistics course, such as re-coding data values, scraping data from a website, or creating a dashboard for a data set. Other categories could be considered depending on the purpose of the case study. This separation also allows for educators to adapt the material to other computational frameworks and languages other than R, such as Python.

We include these learning objectives for three reasons: (i) to help educators select a case study that meets the objectives they want to teach, (ii) to help learners select a case study that demonstrates what they want to learn, and (iii) to provide both educators and learners with a clear understanding about the goals of a particular case study. For example, a study of the use of learning objectives in an undergraduate science course found that students find learning objectives helpful for narrowing and organizing their studying (Osueke et al., 2018).

For the CO2 emissions case study, we designed the case study around the following learning objectives:

  1. Data Science Learning Objectives:
    • Importing data from various types of Excel files and CSV files
    • Apply action verbs in dplyr (Wickham et al., 2022) for data wrangling
    • How to pivot between “long” and “wide” data sets
    • Joining together multiple datasets using dplyr
    • How to create effective longitudinal data visualizations with ggplot2 (Wickham, 2016)
    • How to add text, color, and labels to ggplot2 plots
    • How to create faceted ggplot2 plots
  2. Statistical Learning Objectives:
    • Correlation coefficient as a summary statistic
    • Relationship between correlation, linear regression
    • Correlation is not causation

In addition, by stating these objectives within the case studies, students may begin identify how they can apply these concepts for future analyses. Finally, we provide an interactive search table of learning objectives on the Open Case Studies website (opencasestudies.org) to make it easier to find a case study that would demonstrate a particular technique, method, or concept that an instructor or learner might be interested in.

4. Context.

The context section provides background information needed to understand the context of the question(s) of interest and the data that will be used to answer the questions (Wood et al., 2018; Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective et al., 2018). This may include information from the publication on which the case study is based, but also additional background literature. For an example from public health, the case study may describe what is currently known (or not known) about the health impact of the topic. This serves to demonstrate how the specific question(s) fit into a larger scientific context.

For the CO2 case study, the context section includes a discussion of the potential impacts of climate change on human health, an overview of the likely progression of warming in the coming years, and potential impacts on other components of the environment such as ocean acidity and rainfall quantities.

5. Limitations.

In addition to the motivation and context for each case study, it is important to formally describe limitations of the data presented as it provides important context for the educator or learner (Rivera et al., 2019). Examples of limitations include (i) limitations due to the available data, such as the use of surrogate variables or indicators, (ii) limitations in the methods used, such as annual average estimates for quantities that are likely to vary daily or monthly, and (iii) selection biases due to sampling of observed data. A key concept in data science is that the conclusions from an analysis can only be as good as the data that go into it and the methods used to analyze them, so presenting these limitations provides a valuable learning opportunity. Additional limitations based on the analysis are described in a later section of the case study, while this section helps learners consider what is and is not possible with the data before going further.

In the CO2 case study, we describe limitations about how the data are incomplete because only certain countries reported CO2 emissions for certain years. We describe how additional emissions were also produced by countries that are not included in the data. This helps the learners to understand that while the data will help us understand CO2 emissions, it will not provide the full picture.

6. What are the data?

To provide transparency about the data sources, we describe where and how the raw data were obtained and used in the case study. If the data are obtained from a website, survey or report, and where possible, we also describe how the data were originally collected. We typically describe what the variables are in each dataset later in the case study to better match the experience of the learners discovering the data after they import and explore it.

The data sources for the CO2 case study are from Gapminder (gapminder.org) (originally from the World Bank) and the United States National Oceanic and Atmospheric Administration. In the case study, we present a table with the different data sources and a brief description of each one, including sources to cite.

7. Data import.

Next, we describe the steps and give the code required to read the raw data into the analysis environment. Currently, all of our case studies describe analyses in the R programming language. In some cases, importing the raw data is fairly straightforward, and this section is quite short. Other case studies have longer and more involved data import sections that involve scraping data from a PDF, accessing data using an Application Programming Interface (API), or writing functions to efficiently access data from multiple files with the same format. Importantly, we describe all of our use of code in the case studies in a literate programming way (Knuth, 1984), meaning that we describe each step in a way that can be understandable by learners.

Since the data for the CO2 case study are stored in Excel and comma-separated-variable (CSV) files, we use standard data import functions read_excel() from the readxl package and read_csv() from the readr R package (Wickham and Hester, 2020) to import our data.

8. Data wrangling and exploration.

Typically one of the longest sections for many of our case studies is the wrangling section, which describes all of the steps required to take the imported raw data and get it into a state that is ready for analysis and creating visualizations. We also demonstrate how to perform exploratory data analysis (Tukey, 1962).

For example in the CO2 case study, the raw data needs to be converted from a “wide” to “long” format so that each country-year observation is in a single row. After wrangling the data from each source, we demonstrate how to join together data sets from different sources by matching on country-year combinations. Ultimately, we create one large data set containing all the variables we want to use for our analysis (in the columns) with one record for each country-year combination (in the rows).

9. Data visualization.

We show both simple and complex data visualizations to explore and demonstrate a variety of graphical design choices, including plot type and other aesthetic choices to best show the types of variables of interest. In addition, most case studies describe how to facet or combine plots together so that all the major findings of the case study are illustrated in a single data visualization.

In the CO2 case study, we create data visualizations for a subset of the variables. For example, we use line plots to visualize how CO2 emissions, in metric tons, have varied over time globally (Figure 1). We go into detail around coloring and labeling the lines, zooming in and out on the time-scale axis, as well as including informative plot titles and axis labels. We demonstrate that when looking at CO2 emissions from different countries across time, special consideration for labeling is required. We show that a heat map or tile plot does a great job of illustrating top country differences in a less overwhelming manner (Figure 1). We also demonstrate the utility of faceted plots to simultaneously visualize more variables over time. In addition, we show how to start looking for associations or trends in the data through scatter plots with smoothed lines using the locally estimated scatterplot smoothing (loess) method, a type of local polynomial regression fitting or using linear regression lines.

10. Data analysis.

Our case studies are intended to introduce how a particular statistical test or data science technique might be implemented and interpreted to answer the question(s) of interest. However, we also walk the learner through an unexpected outcome and how we diagnosed it (Peng et al., 2021). We provide background information about statistical concepts and how these concepts apply to our example analysis, as well as relevant method limitations.

The main topic of the analysis section for our CO2 emissions Open Case Study is correlation and how correlation is related to linear regression. We discuss background information such as a description about what summary statistics are, what the correlation coefficient is, and how the correlation coefficient is mathematically calculated. We also describe the limitations of correlation analysis and how correlation does not imply causation. We demonstrate how to implement assessments of correlation and how to interpret the results.

11. Summary.

In this section we provide a summary figure that visually indicates some of the major findings of the case study. The goal of this visualization is to demonstrate how to communicate the results of the analysis to a broader audience (Khachatryan and Karst, 2017). This often involves combining plots and adding annotations. This summary figure is the motivating figure used at the beginning of the case study. Along with this figure, we provide a synopsis of the case study in which the motivation, context, and questions are restated and summarized, while the major steps of wrangling, data exploration, and analysis are described. The main findings of the analysis are discussed, with emphasis on what these findings might indicate for the larger context of the question, in addition to what still remains unknown, as well as how this might relate to limitations of the data and the analysis methods.

In the CO2 emissions Open Case Study, the summary figure (Figure 1) combines several of the plots from the case study together to demonstrate the major findings. The synopsis recaps what data we worked with (CO2 emissions for some countries from 1751-2014) and what we have shown in the analysis, including touching on the learning objectives outlined at the beginning. We give a simpler explanation about the statistical concepts that were discussed in the analysis section, in this case about correlation and regression. We discuss more about what we were able to answer or not answer in terms of the questions of interest. We describe how we discovered a dramatic increase in global CO2 emissions over time and that some countries appear to be especially responsible. We discuss that although the data suggests a relationship between temperature and CO2 emissions in the US, there are many other important factors to consider based on what we know about climate change. These include: the influence of CO2 emissions from other countries in the atmosphere, the influence of other greenhouse gases, the fact that the already existing CO2 in the atmosphere continues to trap heat for many years, and the fact that heat trapped in the ocean due to previous emissions causes delayed changes in surface temperatures. We also point out what the results of our analysis might mean for how we should consider mitigating climate change effects and how warming temperatures may impact society in the future.

12. Suggested homework.

Each case study suggests a homework activity for students to try on their own. These activities typically require the students to use the skills that they have learned on a new data set or to expand the analysis to evaluate another subset of the data. Students may also be asked to make visualizations based on these analyses.

The suggested homework for the CO2 emissions Open Case Study are to:

  • Create a plot with labels showing the countries with the lowest CO2 emission levels.

  • Plot CO2 emissions and other variables (e.g. energy use) on a scatter plot, calculate the Pearson’s correlation coefficient, and discuss results.

These suggestions would require learners to practice their visualization and analytic skills to further investigate the data with less guidance.

13. Additional information.

This section includes additional information about the broader scientific topic of the case study, the methods used to analyze the data, and the specific data sets used in the analysis. Information is provided as links to external online resources such as blog posts, scientific articles, scientific reports, and educational websites. We also provide links to documentation about the R packages used, as well as the specific package versions that were used. We also link to information about the specific subject-matter experts who contributed to the development of the case study.

The CO2 emissions Open Case Study includes links to resources for learning more about the various R packages used in the case study (such as here (Müller, 2020), readxl (Wickham and Bryan, 2022), readr (Wickham and Hester, 2020), dplyr (Wickham et al., 2022), magrittr (Bache and Wickham, 2022), stringr (Wickham, 2022), purrr (Henry and Wickham, 2020), tidyr (Wickham and Girlich, 2022), tibble (Müller and Wickham, 2022), forcats (Wickham, 2021), ggplot2 (Wickham, 2016), directlabels (Hocking, 2021), ggrepel (Slowikowski, 2021), broom (Robinson et al., 2022), patchwork (Pedersen, 2020)) and how they were used, as well as information about the statistical topics touched on, including correlation, regression and time series analysis. These go beyond some of the material presented in the case study, to help point instructors or learners to additional resources for topics of interest.

3. The OCS educational resource

The OCS resource can be found online (opencasestudies.org). In addition, we created an educator’s guide describing how to most effectively use the case studies, how to modify and adapt components of the case studies in the classroom, and how to contribute new case studies. (opencasestudies.org/OCS_Guide).

3.1. Open Case Study website and search tool

Our case study resource is hosted on our Open Case Studies (OCS) website (Figure 2). To navigate the case studies, we provide an interactive search table, built using the DT package (Xie et al., 2021), that allows those interested to search through our case studies by topic, statistical learning objective, data science learning objective, and R packages demonstrated. This table includes links to the GitHub repository with the code and data for each case study, as well as links to websites that are rendered versions of each case study where the entire analysis can be read in full.

Figure 2: An overview of the OCS educational resource.

Figure 2:

The Open Case Studies website contains a searchable database of all available case studies. Users can search by case study name, R packages used, learning objectives, and category. Each case study links to a website with a rendered version of the entire analysis and to a GitHub repository. The GitHub repository hosts the online lesson and all of the related code, data, image, plot, and document files needed to follow along or conduct new analyses. Some case studies now have interactive versions that include live quizzes and coding tutorials. All GitHub repositories for the current case studies are at our GitHub organization (github.com/opencasestudies).

3.2. Open Case Studies on GitHub

The code and data for each case study are hosted in a GitHub repository (Figure 2) in the Open Case Studies GitHub organization (github.com/opencasestudies). Our case studies are built in R Markdown, allowing text, images, and gifs that describe the context and data analytic process to be interspersed with code chunks that show the actual code used in the analysis (Knuth, 1984). We developed these prior to the release of the Quarto publishing system (quarto.org/quarto) and also feel that the R Markdown setup that we originally had is slightly easier for less familiar instructors to modify. These case studies are then “knit” into rendered html-formatted files using GitHub actions (GitHub Actions, 2023) (github.com/features/actions) for continuous integration and deployment. By continuous integration, we mean that changes are tracked and a history of the code from various authors is saved to a single main version (Shahin et al., 2017) using Git and GitHub. By continuous deployment, we mean that the website versions of the case studies are automatically rendered and available to the public once a new version is established on GitHub. These website versions of the case studies are also hosted on GitHub. Currently our case studies are all written using the R programming language, however our current format could be extended to support tutorials using other programming languages as well. Our case studies have a table of contents that allows instructors and learners to easily navigate from section to section, so that they can focus on the materials most useful for their needs. In addition, each case study starts with a graphic or plot that describes the basic findings of the case study. Each case study is organized with the same basic structure so that learners can navigate case studies more easily, and see patterns across case studies on how analysis is performed. See (Figure S1) for an example from a case study on obesity rates (www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/).

3.3. Open Case Study file structure

Each case-study repository has a similar file structure, with a data directory containing both raw data and versions of the data in various processed forms to allow instructors/learners to modularize the case studies for their own purposes (Figure 3). For example, an instructor could skip the data import and wrangling sections of the case study and focus on the visualizations and analysis pieces using a fully cleaned data set. To support this modular style of instruction, each case study includes commands at the beginning of each section that imports the data in the final state of the previous section. These different stages of the data are organized in a data folder with five categories: raw, imported, wrangled, simpler import, and extra. The raw data directory includes files in their original unaltered condition and in the original file format from the original data source (in some cases raw files are CSV files, Excel files, PDFs among other file formats). The imported data directory includes files containing the data in a format that is directly compatible with R, such as RData files which are often abbreviated as Rda. The wrangled data directory also includes an RData file that contains a clean and tidy version of the data that has been pre-processed and is ready for analysis, as well as csv files for instructors that wish to demonstrate a simpler version of data import. The simpler import folder contains raw files that have been converted to CSV file format or other formats that can be more easily imported into R. The extra data folder contains data files that allow for individuals to conduct analyses beyond what was done in the case study (the file format for these extra files varies). Each repository also contains a README file (Vilhuber et al., 2022; Dogucu and Çetinkaya-Rundel, 2022) that explains the modular aspect of the case study, as well as other information about how to use the case study for educational purposes (Table S2).

Figure 3: An overview of the data file structure on GitHub.

Figure 3:

A tree illustrating the repository data directory structure. Each bubble describes the type of data files that can be found in the sub-folders. All the files can be found in our Open Case Studies GitHub organization (github.com/opencasestudies).

3.4. Interactive elements in Open Case Studies

To make our case studies more experiential, we have introduced interactive elements including quizzes and coding exercises using the learnr (Schloerke et al., 2020) and gradethis (Aden-Buie et al., 2021) packages.

We include a mix of multiple choice questions and coding exercises in each case study. Coding exercises are embedded throughout the content of the case studies and give students a chance to write code for a specific step in the analysis. The answers to these exercises (the code/output used in the case study to complete these steps) are then hidden in a click-to-expand section right after the exercise window. Students can compare their own code and output with these answers. We also create exercise subsections at the end of the main sections of the case study. These exercise subsections include both multiple choice questions and coding exercises. Students can use them to test their understanding of the content in each section. All multiple-choice questions provide real-time feedback, giving hints after wrong answers and allowing students to retry the questions if they submitted a wrong answer. For most of the coding exercises, hints and/or solutions are available. With the help of the gradethis package, some of these coding problems also provide real-time feedback after students submit their code.

3.5. Open Case Studies user feedback

Feedback on the Open Case Studies educational platform is actively being collected. Case study users are being asked to complete a Google Forms survey that is embedded in the OCS website and within the case studies themselves. The survey is part of an ongoing effort to produce evidence of the case studies’ utility that will be expanded upon in future work. However, here we summarize the results of a preliminary analysis of this survey as of November 2021 (Breshock, 2021).

The survey recruited educators, students, and self-learners to provide feedback on their experience using Open Case Studies. Their feedback was quite favorable. The educator respondents taught at levels ranging from high school to graduate programs. The student respondents were university undergraduate and graduate students. At the time of this analysis, 96% of respondents indicated that they were likely or very likely to recommend Open Case Studies to others. Both students and educators found many, if not all, of the case study sections to be useful and relevant to their interests.

A few of the educator respondents had already used a case study to teach and a subset of these respondents provided qualitative feedback. They indicated that the case studies are a valuable resource for teaching and using OCS data saved them time and effort. Their comments suggested that student comprehension of the material was better than with some of their previous teaching methods, however we hope to do future studies to more formally compare. According to these educators, the primary value added by OCS was the readily available data and the student interaction.

The students and self-learners also provided favorable reviews of Open Case Studies. The majority of these respondents indicated that they found the case studies to be very useful and enjoyable. A majority also reported that they were very likely to refer back to OCS in the future. One respondent stated “Open Case Studies is intuitive, informative, and easy to access” (Breshock, 2021).

3.6. Open Case Studies popularity and usage

The survey detailed above also asked questions about which case studies the respondents looked at. The responses to this question indicated that Obesity and Air Pollution were the top two most popular case studies. The Obesity case study would have wide appeal as it involves many standard statistical tests which are commonly taught in statistics courses. The CO2 Emissions, Vaping, and Diet case studies were tied for 3rd in popularity based on the survey data (Breshock, 2021).

Case study usage was also measured through website traffic data from Google Analytics. The traffic data indicated that the OCS website had visitors from Europe, South America, Africa, Asia and Oceania as well as North America. Google Analytics also kept track of which case studies were visited. From January 1st, 2021 to November 3rd, 2021, the most popular case study was Air Pollution, with 825 web-sessions. This was followed by Obesity (717 sessions), CO2 Emissions (535 sessions), Diet (473 sessions) and Vaping (443 sessions) (Breshock, 2021). The popularity of the Air Pollution case study is likely due to its machine learning content, a type of artificial intelligence and a topic of much interest for students and educators alike. These results match what we found in the survey responses in terms of case study popularity.

The survey also indicated ways in which people have or are interested in using case studies including: for discussion prompts, as a scaffold or example for projects, for extra practice to supplement course work, and for flipped classroom experiences. In our Educator’s guide (opencasestudies.org/OCS_Guide) we also suggest uses based on how other educators have told us they used the cases studies including for materials for lectures, for homework assignments, as material for students to create written reports or presentations, and to ask additional questions to take the case study even further.

Students can be evaluated for their ability to interpret the case study materials to a depth greater than described in the case study itself, to perform additional analyses, or to create additional visualizations. They can also be evaluated for their understanding of the decision-making involved in the data analysis process by asking questions about why particular steps were taken in the case study and what alternatives one might take.

As future work, these survey results and website traffic data could be summarized and included in the Open Case Study Search tool on the website. Additionally, a 5-star rating system could be implemented and the average rating could be included in the search tool. This would allow users to assess both popularity and quality when selecting a case study to learn from. We also hope to do formal analysis of comparisons of the usage of case studies to other more typical teaching resources and how it impacts the instructor and learner experience.

4. Building your own case studies

For educators interested in constructing their own case studies, in this next section, we describe our recommendations for the process based on our experiences and challenges throughout this project. We also describe these recommendations in our Educator’s guide (opencasestudies.org/OCS_Guide).

4.1. Identifying questions and data for case studies

The process of choosing data sources and questions of interest is arguably the most important part of constructing a case study. We can either identify an interesting and publicly available data set and then ask a timely and engaging question about a topic related to the data, or we can identify an interesting question and then work to find publicly available data to answer this question. This process of linking a question to publicly available data often involves a bit of trial and error and reshaping of the question while keeping in mind and potentially adjusting what the case study is meant to demonstrate.

In our experience developing case studies, we found that identifying a data set first was often easier than relying on finding a data set to answer a particular question. While many of our case studies were specifically designed to address a public health challenge, we sometimes struggled to find publicly available data that was appropriate for the question or set of questions of interest. Collaboration with subject-matter experts can be especially helpful in addressing this challenge. For our case studies, we worked with public health experts in order to both identify interesting, timely, and testable questions and to find a public source of data to answer our questions.

We found we could use the difficulty of obtaining data in a standard format (e.g., Excel, CSV) as a teaching opportunity, and that being open-minded about the source of the data allowed us to demonstrate unconventional skills. For example, when we could not easily access the data stored in a table in a published report, we illustrated the data science skill of pulling data directly from a PDF. As future data scientists, our students need the skills to be flexible to access data that cannot simply be read in or imported as-is into R.

While we typically started developing each case study with a set of data science and statistical learning objectives in mind, there was sometimes a tension between finding a data set that would allow us to meet these specific objectives and allowing the data to guide the direction of the case study. We found that following opportunities presented by the data itself led us to give examples that were more authentic to a real-world data analysis situation. We recorded some of these challenges within the case studies themselves so students could better understand the process of finding the right data to answer a question of interest (and the potential need to refocus a question). The limitations section in particular provides some of the most useful material for class discussions about the types of questions the data can and cannot answer and how sometimes we must simplify our analysis to reflect the limitations of the data available to us.

As educators working during a time of reflection and social change around issues of gender and race in research, we also took care to point out some historically overlooked aspects of our data sets. For example, collecting data with surveys that provide a limited number of options about ethnicity or race or racial and gender intersections, limits our ability to accurately capture the diversity of the population being studied. As an example, we refer the reader to the case study about youth disconnection (opencasestudies.org/ocs-bp-youth-disconnection).

For some case studies, we focused on finding mostly clean and complete data to allow us to demonstrate certain concepts, like machine learning or how to create a dashboard. In these case studies where we knew that the analytical material was going to get quite intensive and lengthy, we specifically sought to find data sets that would allow us to jump right in with little difficulty in terms of gathering, cleaning, and importing the data.

Our overall suggestions for starting a case study are:

  • Choose topics and contexts that are of interest and timely: As Wood et al. (2018) describes in the updates to the Guidelines for Assessment and Instruction in Statistics Education College Report in 2016, it is important to provide learners with an immersive experience using data science and statistical skills and concepts in a context that is relevant. This helps to demonstrate the practical utility of data science and statistical thinking in real-world situations.

  • Be open-minded and flexible about data sources: Unlike performing a real analysis where an analyst might choose to avoid complications in accessing the data (when the option is available to go with a data set that is easier to access), such complications can provide teaching opportunities to prepare students for cases where they will not have a simpler option available.

  • Determine the level of flexibility based on the goals of the case study: If the case study is intended to demonstrate a specific statistical method or data import method, more effort may be required to find the right data to meet this specific teaching expectation. In our situation, we knew we were planning to make several case studies, thus we were able to let some of the case studies naturally flow in directions we didn’t initially intend. This ultimately led to some teaching opportunities we did not expect. However, for some of our case studies we were more rigid about our data needs.

  • Think about the scope of the case study: Keep in mind the 1) type of learners that the case study is intended for and 2) data analysis method goals that the case study is intended to demonstrate. Try to avoid a case study that is both intensive for data import/wrangling and intensive for data analysis. At a later point reevaluation of the overall direction and scope of the case study may be needed. If the case study is too long, consider splitting it into multiple case studies.

  • Keep it simple: Explaining a process at a beginner level often involved more space within a case study than anticipated. Keeping case study plans simple can help as unexpected teaching opportunities may arise that may require more instruction.

4.2. Do the analysis first but with a learner in mind

To present an analysis narrative, it is necessary to first perform the analysis before working on the narrative description. However, the case study itself should not simply be a reproduction of the process used to analyze the data. Instead it should contain simplifications and modifications to create a clear and coherent presentation for students. To do this, it is crucial to keep a good record of all the steps taken during this initial analysis, including explanations and comments to justify the analysis choices made along the way. Special care should be taken to record exactly how the raw data is obtained.

Often the way we would typically perform an analysis ourselves might not always be the best for instruction purposes. For example, an experienced data analyst might start by writing a function that wrangles multiple similar data files. However, this would not be the appropriate way to start a case study for beginners. Instead one might choose to focus on wrangling a single specific file in great detail before trying to generalize the code as a function. Thus we try to determine an overall process of data import and wrangling for the intended level of audience before really generating the dialogue that describes this process.

We also found that often the data exploration steps and the steps involved in the decision-making process of how to wrangle the data needed to be simplified for a case study. For example, we may ultimately decide to remove a data source from our analysis because we find errors in the data and dealing with these errors are beyond the scope for our intended audience. While it may be useful to tell students about these data errors (Peng et al., 2021) and how to address them, we also need to keep an appropriate level of detail so as not to overwhelm them.

Another situation where we might modify the analysis is if a process requires a considerable level of trial and error. Rather than showing the students all the iterations of the trial and error and all of the decisions around this process, we may only demonstrate a small portion so as not to make the case study too lengthy. In a case study about machine learning, for example, we aimed to achieve a certain level of performance so we spent a fair amount of time demonstrating how to optimize and tune parameters. While we briefly described our tuning strategy, we did not show all intermediate models, but ultimately showed two that were interesting and useful for describing parameter tuning.

To conclude, we may have gone through a learning process in our own analysis, eventually arriving at a more refined approach. Instead of describing the entire process to get to this point, we would sometimes simply present the final approach, yet describe in the narrative that in practice more effort would be required. While we do want to present a realistic depiction of the data analysis process, we also need to achieve clarity and focus.

4.3. Creating the case study narrative

Once an analyst has performed the analysis to address the questions of interest, it is time to start writing the narrative. First, we introduce and motivate the main topic by presenting some research related to the particular question evaluated in the case study.

First, we describe the data import, wrangling, and analysis processes. As mentioned above, this will likely not be a faithful reproduction of our own analysis process, but will be recreated to best meet the pedagogical goals of the case study. In terms of added narrative, we do our best to guide students through the new information we are presenting. The first time we use a function we describe what it does, its main arguments, and what packages it comes from. We describe the thoughts behind our decision making process from one step to the next, sometimes illustrating times where we try something and it does not work to reflect a real-world data analysis.

We also describe jargon and background information where possible with click-to-expand sections so as not to disrupt the general flow of the case study. For example, an expanded section would explain how “piping” works, passing objects through a series of steps, to avoid slowing down students who are already familiar, while allowing us to not lose students that have never seen piping before. Other material for such expandable sections includes describing the “grammar of graphics” for the ggplot2 package or providing background statistical information before performing a statistical test. In some cases we describe a concept at great length in another case study so we link to the description there, but in general we at least minimally describe most concepts and methods in each case study to keep them as self-contained as possible. Similarly, we found including portions of Posit (formerly called RStudio) Cheatsheets (Posit Cheatsheets, 2023) (posit.co/resources/cheatsheets) to be very useful for certain topics, such as describing regular expressions or joining functions. In some cases we found it best to explain a concept or challenge with a simpler example first using a smaller data set imported into R or created in R ourselves. This material is also included in click-to-expand sections for students who might already be familiar with such concepts.

While constructing the narrative, we think about where we can include question opportunities. These opportunities include places for an instructor to start a discussion about the analysis decision-making process, such as why a particular graph choice is not always effective or why a wrangling method might not be reproducible. We may prompt students to try to remember how to perform a task that has already been shown in the case study previously. In our interactive case studies we also include quiz questions and coding exercises, as described in the Section 3.4.

Finally, we end the narrative by summarizing how to communicate the major findings of the analysis (Khachatryan and Karst, 2017). We also describe how the results fit into the greater context of the field, what the implications are, the limitations of the study, and what is still unknown. We finish by going through the case study to create a list of all the resources shown throughout the case study.

Through the process of creating this resource, we discovered a variety of challenges, as well as strategies that we used to overcome these challenges, as described in Table S3 and guidelines for creating new case studies (Supplemental Note 1).

4.4. Creating interactive case studies

We have also included interactive elements in a subset of our case studies using packages (learnr (Schloerke et al., 2020) and gradethis (Aden-Buie et al., 2021)) that build on the shiny (Chang et al., 2021) package which allows R users to more easily create web applications. The learnr package allows users to create multiple choice questions and coding exercises, while gradethis allows for customization of the feedback divided to learners as they answer questions or perform exercises.

There are two methods to do this. One method is to host each exercise as an individual Shiny application and then embed these applications in the case study using inline frames (HTML ‘iframe’). The second method is to create one single application that incorporates exercises within the case study (Supplemental Notes S2 [including Figures S2S5] and Supplemental Note S3).

4.5. Contributing a case study to the OCS project

To provide greater access to learners, we have established two methods for others to contribute case studies to the OCS project. Our ability to integrate submitted case studies will depend on our capacity to review and integrate them into the existing repository. Thus we have created a more casual and official process. Case studies do not have to be about public health topics but just need to explore a relevant problem that can be explored with publicly available data.

1). Community Library Submissions.

These case studies will be shared publicly on our website as part of an Open Case Studies Community Repository for the benefit of other educators and learners, with minimal review from the Open Case Studies team. More information about how to submit and access to our submission form can be found in our Educator’s guide (opencasestudies.org/OCS_Guide).

2). Official Library Submissions.

These submissions are intended to follow the guidelines of OCS and undergo peer review to be considered for being included in the official repository. Interested individuals need to first fill out a form to contact our team to discuss the idea that is available on our in our Educator’s guide (opencasestudies.org/OCS_Guide). Completed submissions will be submitted through GitHub.

5. Discussion and Conclusions

In this paper, we introduce a model for creating fully open-source, peer-reviewed, and complete case studies to create an archive of examples of best practices to guide students through data analyses involving real, complicated, messy, and interesting data. Our archive can be used in the classroom by instructors to guide students through any part of our case studies due to the easy navigation and common modularized architecture to structure the case studies. These can also be used by independent learners due to the thorough narrative, interactive elements, and complete analyses. Students and learners can learn about new topics or return to a case study to brush up on details of a particular method or technique. The data within our case studies and the narrated data analyses and data science methods can be used by instructors educating undergraduate and graduate students, as well as high school students in a variety of topics including statistics, public health, programming, and data science. This provides an opportunity for instructors to use data that is relevant to current public health concerns and therefore of interest to a large variety of students without the work required to identify such data or to determine what analyses are possible with such data. This will free instructors to focus on challenging the students with more interactive discussions in class and allow students to learn more about the decision processes required for analyzing data.

In summary, the OCS project is a repository of complete data analyses of real-world data that covers a variety of data science and statistical concepts. This initiative provides learners and instructors guidance in each case study about every step in the process to illustrate the consideration and decision-making skills required to effectively interpret data. The case studies also contextualize data science and statistical lessons to explore important and relevant public health concerns in an engaging manner. Each case study has a consistent framework grounded in the Nolan and Speed (1999) model. They are open access and provide recommendations on how to teach the material. With the OCS resources, educators can also make their own case studies and expand OCS if they contribute back. We believe these additions try to bridge the gaps in the last mile of analysis education. While the case study model has been proposed for statistical education for some time (Nolan and Speed, 1999), we hope that this resource helps instructors and learners use this education method more easily and in a way that also includes more data science topics. More analysis is required to better understand how use of this teaching method compares to other other methods for the education of data science and statistics. Regardless we hope that the data and resources provided by our project will be helpful to instructors and learners.

Supplementary Material

Supp 1

6. Acknowledgements

The authors gratefully acknowledge the Johns Hopkins Data Science lab (jhudatascience.org), in particular Roger Peng, Jeff Leek, Brian Caffo, and Jessica Crowell for their support and valuable feedback on the Open Case Studies project. We would like to thank Ira Gooding for his feedback on incorporating case studies into the Coursera platform. In addition we would like to thank all the data science and statistics reviewers of our case studies, including: Shannon Ellis, Nicholas Horton, Leslie Myint, Mine Çetinkaya-Rundel, Michael Love, and Christina P. Knudson, as well as the following student reviewers: Jensen Stanton, Tina Trinh, and Ruby Ho. We would also like to acknowledge the topic reviewers including: Roger Peng, Tamar Mendelson, Brendan Saloner, Renee Johnson, Jessica Fanzo, Daniel Webster, Elizabeth Stuart, Aboozar Hadavand, Megan Latshaw, Kirsten Koehler, and Alexander McCourt. We would also like to acknowledge Ashkan Afshin and Erin Mullany for giving us access to the data for the case study titled “Exploring global patterns of dietary behaviors associated with health risk.” We would also like to thank the Johns Hopkins Bloomberg School of Public Health Department of Biostatistics for initially funding this project.

Funding:

The Open Case Study project reported in this publication was supported by a High-Impact Project grant in 2019-2020 by the Bloomberg American Health Initiative to create the majority of the case studies currently part of the project. A 2020 Digital Education & Learning Technology Acceleration (DELTA) Grant from the Office of the Provost at the Johns Hopkins University supported the creation of interactive case studies and many of the tools that support their use, such as the search tool. The Open Case Studies guide was funded as an extension to funding for the Genomic Data Science Community Network (GDSCN). The GDSCN is supported through a contract to Johns Hopkins University (75N92020P00235) NHGRI. JM was supported by Streamline Data Science, U24HG010263-01 (ANVIL), UL1TR003098 (NIH/NCATS): Institutional Clinical and Translational Science Award, and UE5CA254170.

7. Data Availability

All data used in the case studies are publicly available and documented in each individual case study. Data from our survey and about our package can be found on Github (github.com/opencasestudies/OCS_analytics) in the Open Case Study Survey (Responses).xlsx file and on Zenodo (zenodo.org/doi/10.5281/zenodo.13228343). Our survey was reviewed by the Institutional Review Board at the Johns Hopkins Bloomberg School of Public Health. No personal or private information was collected unless a responder volunteered that information. Such information has been removed form the version available online. The survey was found to not qualify as human subjects research as defined by DHHS regulations 45 CFR 46.102, and did not require IRB oversight. IRB No.: 14965, approved 2020-02-26.

References

  1. Aden-Buie G, Chen D, Grolemund G and Schloerke B (2021), gradethis: Automated Feedback for Student Exercises in ’learnr’ Tutorials. https://pkgs.rstudio.com/gradethis/ [Google Scholar]
  2. Arnold P and Franklin C (2021), ‘What Makes a Good Statistical Question?’, Journal of Statistics and Data Science Education 29(1), 122–130. https://www.tandfonline.com/doi/full/10.1080/26939169.2021.1877582 [Google Scholar]
  3. Bache SM and Wickham H (2022), magrittr: A Forward-Pipe Operator for R. R package version 2.0.3 https://CRAN.R-project.org/package=magrittr [Google Scholar]
  4. Breshock MR (2021), ‘Expanding Access and Removing Barriers: Data Science Education with the Open Case Studies Digital Platform’, Johns Hopkins University Graduate Theses. http://jhir.library.jhu.edu/handle/1774.2/66820 [Google Scholar]
  5. Çetinkaya-Rundel M. (2018), Let Them Eat Cake (First)! [Video]. YouTube. (2018, November 9). https://www.youtube.com/watch?v=RsVOrpXAPXo [Google Scholar]
  6. Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson J, Dipert A and Borges B (2021), shiny: Web Application Framework for R. R package version 1.6.0 https://CRAN.R-project.org/package=shiny [Google Scholar]
  7. Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective, Computer Science and Telecommunications Board, Board on Mathematical Sciences and Analytics, Committee on Applied and Theoretical Statistics, Division on Engineering and Physical Sciences, Board on Science Education, Division of Behavioral and Social Sciences and Education and National Academies of Sciences, Engineering, and Medicine (2018), Data Science for Undergraduates: Opportunities and Options, National Academies Press. https://www.nap.edu/catalog/25104 [PubMed] [Google Scholar]
  8. Dogucu M and Çetinkaya-Rundel M (2022), ‘Tools and Recommendations for Reproducible Teaching’, Journal of Statistics and Data Science Education 30(3), 251–260. https://www.tandfonline.com/doi/full/10.1080/26939169.2022.2138645 [Google Scholar]
  9. Donoghue T, Voytek B and Ellis SE (2021), ‘Teaching Creative and Practical Data Science at Scale’, Journal of Statistics and Data Science Education 29(sup1), S27–S39. [Google Scholar]
  10. Donoho D. (2017), ‘50 Years of Data Science’, Journal of Computational and Graphical Statistics 26(4), 745–766. 10.1080/10618600.2017.1384734 [DOI] [Google Scholar]
  11. Freeman S, Eddy SL, McDonough M, Smith MK, Okoroafor N, Jordt H and Wenderoth MP (2014), ‘Active Learning Increases Student Performance in Science, Engineering, and Mathematics’, Proceedings of the National Academy of Sciences 111(23), 8410–8415. Publisher: Proceedings of the National Academy of Sciences. https://www.pnas.org/doi/abs/10.1073/pnas.1319030111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. GitHub Actions (2023). https://github.com/features/actions
  13. Henry L and Wickham H (2020), purrr: Functional Programming Tools. R package version 0.3.4 https://CRAN.R-project.org/package=purrr [Google Scholar]
  14. Hicks SC and Irizarry RA (2018), ‘A Guide to Teaching Data Science’, The American Statistician 72(4), 382–391. 31105314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hocking TD (2021), directlabels: Direct Labels for Multicolor Plots. R package version 2021.1.13 https://CRAN.R-project.org/package=directlabels [Google Scholar]
  16. Khachatryan D and Karst N (2017), ‘V for Voice: Strategies for Bolstering Communication Skills in Statistics’, Journal of Statistics Education 25(2), 68–78. https://www.tandfonline.com/doi/full/10.1080/10691898.2017.1305261 [Google Scholar]
  17. Knuth DE (1984), ‘Literate Programming’, The Computer Journal 27(2), 97–111. 10.1093/comjnl/27.2.97 [DOI] [Google Scholar]
  18. Kross S and Guo PJ (2019), Practitioners Teaching Data Science in Industry and Academia: Expectations, Workflows, and Challenges, in ‘Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems’, CHI ’19, Association for Computing Machinery, New York, NY, USA, p. 1–14. [Google Scholar]
  19. Müller K. (2020), here: A Simpler Way to Find Your Files. R package version 1.0.1 https://CRAN.R-project.org/package=here [Google Scholar]
  20. Müller K and Wickham H (2022), tibble: Simple Data Frames. R package version 3.1.8 https://CRAN.R-project.org/package=tibble [Google Scholar]
  21. Neumann DL, Hood M and Neumann MM (2013), ‘Using Real-Life Data When Teaching Statistics: Student Perceptions of This Strategy in an Introductory Statistics Course’, STATISTICS EDUCATION RESEARCH JOURNAL 12(2), 59–70. https://iase-web.org/ojs/SERJ/article/view/304 [Google Scholar]
  22. Nolan D and Lang DT (2010), ‘Computing in the Statistics Curricula’, The American Statistician 64(2), 97–107. [Google Scholar]
  23. Nolan D and Speed TP (1999), ‘Teaching Statistics Theory through Applications’, The American Statistician 53, 370. [Google Scholar]
  24. Osueke B, Mekonnen B and Stanton JD (2018), ‘How Undergraduate Science Students Use Learning Objectives to Study’, Journal of Microbiology & Biology Education 19(2). https://journals.asm.org/doi/10.1128/jmbe.v19i2.1510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pedersen TL (2020), patchwork: The Composer of Plots. R package version 1.1.1 https://CRAN.R-project.org/package=patchwork [Google Scholar]
  26. Peng RD, Chen A, Bridgeford E, Leek JT and Hicks SC (2021), ‘Diagnosing Data Analytic Problems in the Classroom’, Journal of Statistics and Data Science Education 29(3), 267–276. https://www.tandfonline.com/doi/full/10.1080/26939169.2021.1971586 [Google Scholar]
  27. Posit Cheatsheets (2023). https://posit.co/resources/cheatsheets/
  28. R Core Team (2021), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ [Google Scholar]
  29. Ratan SK, Anand T and Ratan J (2019), ‘Formulation of Research Question – Stepwise Approach’, Journal of Indian Association of Pediatric Surgeons 24(1), 15. Publisher: Wolters Kluwer – Medknow Publications. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322175/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Reyes LLS and McTavish EJ (2022), ‘Approachable Case Studies Support Learning and Reproducibility in Data Science: An Example from Evolutionary Biology’, Journal of Statistics and Data Science Education 30(3), 304–310. [Google Scholar]
  31. Rivera R, Marazzi M and Torres-Saavedra PA (2019), ‘Incorporating Open Data Into Introductory Courses in Statistics’, Journal of Statistics Education 27(3), 198–207. https://www.tandfonline.com/doi/full/10.1080/10691898.2019.1669506 [Google Scholar]
  32. Robinson D, Hayes A and Couch S (2022), broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0 https://CRAN.R-project.org/package=broom [Google Scholar]
  33. Romero R, Ferrer A, Capilla C, Zunica L, Balasch S, Serra V and Alcover R (1995), ‘Teaching Statistics to Engineers: An Innovative Pedagogical Experience’, Journal of Statistics Education 3(1), 5. https://www.tandfonline.com/doi/full/10.1080/10691898.1995.11910481 [Google Scholar]
  34. Schafer DW and Ramsey FL (2003), ‘Teaching the Craft of Data Analysis’, Journal of Statistics Education 11(1), null. [Google Scholar]
  35. Schloerke B, Allaire J and Borges B (2020), learnr: Interactive Tutorials for R. R package version 0.10.1 https://CRAN.R-project.org/package=learnr [Google Scholar]
  36. Shahin M, Ali Babar M and Zhu L (2017), ‘Continuous Integration, Delivery and Deployment: A Systematic Review on Approaches, Tools, Challenges and Practices’, IEEE Access 5, 3909–3943. http://ieeexplore.ieee.org/document/7884954/ [Google Scholar]
  37. Slowikowski K. (2021), ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1 https://CRAN.R-project.org/package=ggrepel [Google Scholar]
  38. Theobold AS, Hancock SA and Mannheimer S (2021), ‘Designing Data Science Workshops for Data-Intensive Environmental Science Research’, Journal of Statistics and Data Science Education 29(sup1), S83–S94. [Google Scholar]
  39. Tukey JW (1962), ‘The Future of Data Analysis’, The Annals of Mathematical Statistics 33(1), 1–67. [Google Scholar]
  40. Vilhuber L, Son HH, Welch M, Wasser DN and Darisse M (2022), ‘Teaching for Large-Scale Reproducibility Verification’, Journal of Statistics and Data Science Education 30(3), 274–281. https://www.tandfonline.com/doi/full/10.1080/26939169.2022.2074582 [Google Scholar]
  41. Waller LA (2018), ‘Documenting and Evaluating Data Science Contributions in Academic Promotion in Departments of Statistics and Biostatistics’, The American Statistician 72(1), 11–19. [Google Scholar]
  42. Weinberg SL and Abramowitz SK (2000), ‘Making General Principles Come Alive in the Classroom Using an Active Case Studies Approach’, Journal of Statistics Education 8(2), null. 10.1080/10691898.2000.12131290 [DOI] [Google Scholar]
  43. Wickham H. (2016), ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag; New York. https://ggplot2.tidyverse.org [Google Scholar]
  44. Wickham H. (2021), forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1 https://CRAN.R-project.org/package=forcats [Google Scholar]
  45. Wickham H. (2022), stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.1 https://CRAN.R-project.org/package=stringr [Google Scholar]
  46. Wickham H and Bryan J (2022), readxl: Read Excel Files. R package version 1.4.0 https://CRAN.R-project.org/package=readxl [Google Scholar]
  47. Wickham H, François R, Lionel H and Müller K (2022), dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr [Google Scholar]
  48. Wickham H and Girlich M (2022), tidyr: Tidy Messy Data. R package version 1.2.0 https://CRAN.R-project.org/package=tidyr [Google Scholar]
  49. Wickham H and Hester J (2020), readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr [Google Scholar]
  50. Wild CJ and Pfannkuch M (1999), ‘Statistical Thinking in Empirical Enquiry’, International Statistical Review 67(3), 223–248. [Google Scholar]
  51. Wood BL, Mocko M, Everson M, Horton NJ and Velleman P (2018), ‘Updated Guidelines, Updated Curriculum: The GAISE College Report and Introductory Statistics for the Modern Student’, CHANCE 31(2), 53–59. https://www.tandfonline.com/doi/full/10.1080/09332480.2018.1467642 [Google Scholar]
  52. Xie Y, Cheng J and Tan X (2021), DT: A Wrapper of the JavaScript Library ’DataTables’. R package version 0.17 https://CRAN.R-project.org/package=DT [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

Data Availability Statement

All data used in the case studies are publicly available and documented in each individual case study. Data from our survey and about our package can be found on Github (github.com/opencasestudies/OCS_analytics) in the Open Case Study Survey (Responses).xlsx file and on Zenodo (zenodo.org/doi/10.5281/zenodo.13228343). Our survey was reviewed by the Institutional Review Board at the Johns Hopkins Bloomberg School of Public Health. No personal or private information was collected unless a responder volunteered that information. Such information has been removed form the version available online. The survey was found to not qualify as human subjects research as defined by DHHS regulations 45 CFR 46.102, and did not require IRB oversight. IRB No.: 14965, approved 2020-02-26.

RESOURCES