Protocol for EHR laboratory data preprocessing and seasonal adjustment using R and RStudio

Victorine P Muse; Søren Brunak

doi:10.1016/j.xpro.2024.102912

. 2024 Feb 29;5(1):102912. doi: 10.1016/j.xpro.2024.102912

Protocol for EHR laboratory data preprocessing and seasonal adjustment using R and RStudio

Victorine P Muse ^1,^3,^∗, Søren Brunak ^1,^2,^4,^∗∗

PMCID: PMC10918320 PMID: 38427569

Summary

Seasonality in laboratory healthcare data is associated with possible under- and overdiagnoses of patients in the clinic. Here, we present a protocol to analyze electronic health record data for seasonality patterns and adjust existing reference intervals for these changes using R software. We describe steps for preprocessing population-wide patient laboratory data into a single dataset. We then detail steps for defining strata, normalizing to median, and fitting data to selected functions.

For complete details on the use and execution of this protocol, please refer to Muse et al. (2023).¹

Subject areas: Bioinformatics, Health Sciences, Systems biology

Graphical abstract

Highlights

•
Steps described for laboratory data cleaning using synthetic example data
•
Instructions for applying a low parameter sinusoidal model to investigate seasonality
•
Guidance on performing reference interval modifications on synthetic example data

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

The protocol below exemplifies an approach for curating and cleaning an Electronic Health Record (EHR) dataset for population-wide assessment. It then goes on to show how to detect seasonality in this same data set and adjust for it in large-scale analyses. This protocol was developed on Danish registry and EHR data; when applied to other cohorts the protocol will likely require cohort-specific modifications, noted within the protocol.

Hardware

Either a supercomputer or local-memory requirement that is large enough to manage your dataset. This can vary project to project and therefore cannot be definitive for this protocol. For reference, the data set used by the authors to develop this protocol is approximately 30 GB and 340 million laboratory test measurements.

Institutional permissions

Often when working with patient EHR data, extensive legal and ethical permissions, patient ID encryptions, and more are required but can vary between countries, institutions, and study types. Please be sure to check your specific data permissions before pursuing this protocol. All experiments must be conducted in compliance with the relevant institutional and national guidelines for processing personal data and all other applicable laws and regulations. This protocol was developed in accordance with rules from the Danish Health Data Authority and approved by The Danish Data Protection Agency (ref. 514-0255/18-3000, 514-0254/18-3000, SUND-2016-50), The Danish Health Data Authority (ref: FSEID-00003724 and FSEID-00003092) and The Danish Patient Safety Authority (3-3013-1731/1/). The study was approved as a registry study where patient consent is not needed in Denmark.

Install R, RStudio, and respective packages

Timing: 20–30 min

1.
Download R and R studio software available at https://cran.r-project.org/ and https://www.rstudio.com/, respectively. R version 4.0.0 was used for this protocol and last tested December 23^rd, 2023. RStudio is not required but is a free development environment and can make viewing, editing, and interacting with files for this protocol much easier.
2.
To use the code in this protocol, please install relevant packages as in the “key resources table” using this code in R:

>install.packages("package_name")

Pre-processing laboratory data

Timing: months (varies)

These steps cover the pre-processing needs for laboratory data if it has not been cleaned previously. This entails correcting or typos, missing data, and mismatched labels as is commonly introduced when collecting data from hospital and laboratory databases.

3.
The presumption at this step is that data has already been acquired via the correct pathways and requires cleaning due to multiple sources and laboratory assays. If data is already harmonized and no missing data issues are of concern, then the user may skip to the “step-by-step method details”.
4.
Please see GitHub page (https://github.com/vmuse12/Lab_data_processing) for full downloadable code for pre-processing steps used in the Muse et al. paper.¹ These steps can be extremely different from one institution to another due to differences in data collection systems and hospital systems.
Note: The steps outlined here follow a dummy data set modeled after the real data available to our group collected in Denmark during 2012–2015 (and still are) and should be used as a starting point to process your data. These data are not all physiologically possible, but an attempt was made to develop somewhat realistic data.
- a.
  Run Step0_clean.R from GitHub.
  Note: This step loads in the respective data, labels the database source (in case multiple sources exist), and selects for the permission approved study window.
  - i.
    Apply a date filter to only allow for tests from 2012 to 2015 inclusive, due to the study design.
- b.
  Run Step1_clean.R from GitHub.
  Note: This step conforms all testing codes to the same system using the NPU name harmonizing table (https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_test_lookup.tsv) which was carefully curated to reflect the active and usable test codes and English language naming for the study time frame. Correct units using the Unit Correction table (https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_unit_lookup.tsv). The data is also split into two intermediate tables at this stage, those with numeric responses and those with non-numeric responses (such as negative/ positive, but also failed/ cancelled tests).
  - i.
    These tables are highly dependent on your data set, and those posted here are for demonstration only, as they are likely not fully equivalent for other cohorts.
  - ii.
    The NPU table can be modified to your institution’s system and use other systems such as LOINC codes as opposed to the Nordic NPU codes used in Denmark.
- c.
  Run Step2_clean.R from GitHub.
  Note: This step loads in the quantitative data defined above and starts to process the symbols of “>”, “<”, “ = ” etc. It also removes extra spaces, conforms decimal separator systems, and more.
  - i.
    A “FLAG” column is introduced to label tests that were below (−1), within (0), or above (1) the given reference interval for easy indexing of test results when working with these data.
    Note: This is not needed for seasonal investigations but is a useful step for other work, for example examining distributions of normal vs. abnormal test distributions in the populations or input features in a machine learning model where using continuous data is not an option. This labeling system will also enable the user to see how the distribution of tests changes before and after seasonal adjustment of the reference intervals.
- d.
  Run Step3_clean.R from GitHub.
  Note: This step loads in the non-quantitative data and processes binary response data (positive/negative) into the FLAG system (1/0) so that it is processible with the other data. This step is optional as these data do not need seasonal correction but is required for standard laboratory data harmonization. It can be useful however because there were typos/ inconsistencies (neg, negativ, negative, and pos, positiv, positive for example), conforming these responses to a clear binary system was therefore deemed relevant for other studies.
- e.
  Run Step4_clean.R from GitHub.
  Note: this step corrects data to the appropriate numerical values because the non-quantitative data set also included data with mixed text and numeric responses.
  - i.
    Examples of such issues stem from having “units” in the “value” field.
  - ii.
    Process the remainder of these tests as in step 4c for the “FLAG” system.
- f.
  Run Step5_clean.R from GitHub.
  Note: this step takes the intermediate files with the “FLAG” data and merge them into one intermediate file for study use.
  - i.
    Please note this subset only includes data with reference interval in a continuous system as well as a binary classifier (step 4d).
- g.
  Run Step6_clean.R from GitHub.
  Note: this step processes all quantitative data cleaned in step 4c and 4e that does not have reference interval information available and save it in an intermediate file. The “FLAG” column is given an “NA” entry.
- h.
  Run Step7_clean.R from GitHub.
  Note: This step takes all text responses that are not binary related and merges them together and save with “FLAG” entries as “NA”.
- i.
  Run Step8_clean.R from GitHub.
  Note: This step takes all quantitative data and preps it for further processing. It also removes non-linkable patients (e.g. tourists, as they do not have a person identifying number as is given at birth in Denmark), changes time stamps to the local time (i.e. corrected for daylight savings). Lastly, it removes redundant columns.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Dummy NPU name harmonizing table	GitHub, annotated in this protocol	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_test_lookup.tsv https://doi.org/10.5281/zenodo.10598405
Dummy unit correction table	GitHub, annotated in this protocol	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_unit_lookup.tsv https://doi.org/10.5281/zenodo.10598405
Dummy dirty lab dataset	GitHub	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_data_lite.tsv https://doi.org/10.5281/zenodo.10598405
Dummy binary test list	GitHub	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_binary_lookup.tsv https://doi.org/10.5281/zenodo.10598405
Dummy person ID table for sex, date of birth (DOB), and date of death (DOD), if relevant	GitHub	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_tpers_data.tsv https://doi.org/10.5281/zenodo.10598405
Dummy text response table	GitHub	https://github.com/vmuse12/Lab_data_processing/blob/main/data_cleaning/raw_data/dummy_shown_list.tsv https://doi.org/10.5281/zenodo.10598405
Step-by-step code	In this protocol	https://github.com/vmuse12/Lab_data_processing/ https://doi.org/10.5281/zenodo.10598405

Software and algorithms

R 4.0.0	R Core Team²	https://www.R-project.org/; RRID: SCR_001905
RStudio	RStudio Team³	http://www.rstudio.com/; RRID: SCR_000432
ggplot2 3.4.0 R package	Wickham⁴	https://cran.r-project.org/web/packages/ggplot2/index.html; RRID: SCR_014601
dplyr 1.1.3 R package	Wickham et al.⁵	https://CRAN.R-project.org/package=dplyr; RRID: SCR_016708
reshape2 1.4.4 R package	Wickham⁶	https://cran.r-project.org/web/packages/reshape2/index.html; RRID: SCR_022679
stringr 1.5.0 R package	Wickham⁷	https://CRAN.R-project.org/package=stringr
lubridate 1.9.3 R package	Grolemund and Wickham⁸	https://CRAN.R-project.org/package=lubridate
data.table 1.14.8 R package	Dowle and Srinivasan⁹	https://CRAN.R-project.org/package=data.table
minpack.lm 1.2–4 R package	Elzhov et al.¹⁰	https://CRAN.R-project.org/package=minpack.lm

Other

Computer with an operating system that can run R and RStudio versions as listed above	https://posit.co/download/rstudio-desktop/	https://posit.co/download/rstudio-desktop/

Open in a new tab

Materials and equipment

All code was run on a supercomputer hosted at the National Genome Center in Denmark. General specifications include 4 GPU nodes, extra compute nodes (thin/fat), and 1.2 PB of storage with 200 TB of high endurance flash storage. Please contact the authors if more detailed specifications are of interest.

It is possible to run the code described locally (dummy data and code should run locally) but this will likely crash or take enormous periods of time depending on your computer’s specifications and the size of your dataset. The preprocessing steps are more computationally demanding, while the math modeling steps are more manageable in some cases. Testing your computers capabilities beforehand would be wise before risk of crashing local sessions and losing progress.

Step-by-step method details

Here we describe the steps needed to analyze your data for seasonality patterns and thereafter adjust existing reference intervals for these changes, specific to your patient cohort and research question. The steps here are available in a generalized way on GitHub (https://github.com/vmuse12/Lab_data_processing), with annotated sections where modifications are needed to alter code and input parameters to your study specifications.

Identify seasonality in your data

Timing: days to weeks (can vary based on dataset size)

Not all tests are seasonal, or seasonal in all parts of the world. These steps are crucial to characterize your laboratory data set and identifying which tests may exhibit seasonality patterns and to what extent. It may be prudent to also set a threshold above which you deem it necessary to adjust for seasonality; for example, in some studies you may only care if there is >5% seasonality changes, but in others it may be >1%. Tests that are found to not exhibit seasonality at this section can retain their originally defined reference intervals. Steps correlate to script labels as seen in GitHub folder (https://github.com/vmuse12/Lab_data_processing).

1.
Load in your cleaned data set.
Note: This step loads data as described in the preprocessing steps and prepares it for mathematical fitting. A summary of steps for this script is outlined here:
- a.
  Assign age groups, this example uses 10-year groups. Patients over 100 years old are removed as there is not sufficient data in this cohort.
- b.
  Take the average lab value for patients with more than one unique test per day.
- c.
  Define stratum groups (here, unique age/sex/unit/test/lab id) and the median value for the given test is taken for each.
- d.
  Normalize all tests to the respective stratum’s median calculated in 1c.
- e.
  Apply a minimum threshold of unique observations filter (modifiable).
- f.
  Save an intermediate data file for stratum counts and unique values for the next step.
2.
Analyze data by strata.
Note: This step starts to look at trends at the week-year level and starts to separate data into specific research questions (fx: sex/ age/ mortality trends). A summary of steps is outlined here:
- a.
  Calculate and store median values per week-year for each test.
- b.
  Store counts of observations per week-year for later steps.
- c.
  Save research question specific intermediate versions of these files for specific fitting to be performed at the next step.
  Note: the mock file only looks at data by unique test, but it is possible to investigate more features (sex/ age/ mortality).
3.
Fit data to chosen model.
Note: This step pulls in all the specific data for the research question and fits it to the chosen seasonal mathematical model (here a low-parameter model as a starting point). This step has several parts that need user input specific to the user’s cohort size. During fitting, an output of parameter-fit values is iteratively stored and saved. The general approach is outlined here:
- a.
  Load in patient value data and test count data and filter it for the specific study window.
- b.
  Apply a filter for a minimum (here, 50) unique tests per week (by unique observation not by patient).
- c.
  Define a weight function to make model fitting proportional to the number of unique tests available per week.
- d.
  Define the desired math model (here a low parameter sinusoidal model), and the data is fed through the model using a non-linear least squares regression model.
- e.
  Apply another filtering step to avoid model convergence issues (Here, ∼50% of weeks must have sufficient data as set in step 3b for the respective test to be processed).
- f.
  Save parameter fits (and p-values, FDR, corrected for multiple testing) and stratum predicted values in a .tsv file and .pdf file with figures, respectively.
  CRITICAL: The code for these steps requires a lot of user input as it is very specific to the cohort size and specifications. The models will not converge if not done properly. Please read comments within each script and consider what is best for your research project and cohort size.

Apply seasonality parameters to current reference intervals

Timing: days

Once you have all the parameter fits calculated, you can easily then calculate new reference intervals for your original dataset.

4.
Identify tests with seasonality.
Note: From step 3 you should have a table of all parameter-fit values for each test by stratum. This step then merges this information back onto the original data set. This should be completed for each iteration of reference interval adjustment you wish to use (sex/age/ mortality/etc.). An overview of the steps:
- a.
  Load in original lab data again with the age groups defined.
- b.
  For any lab tests with seasonal variation detected significantly (using FDR multiple correction), merge the data to the parameter data and saved as one file.
  Note: In this example, we suggested that it’s important for the amplitude and week offset parameter to be deemed significant in order to continue.
5.
Calculate new reference intervals.
Note: This script takes the data from step 4b and calculates new reference intervals based on the existing ones per week of the year by inputting the stratum-specific parameters into the fitted equation. A new Flag value is calculated based on this using the same approach as in pre-processing step 4c. An overview of the steps:
- a.
  Load in data from step 4b.
- b.
  Calculate week and year information from each test in a new column.
- c.
  Calculate new upper and lower reference values based on the fitted equation.
- d.
  Calculate a new column of “FLAG” data, here termed “FLAG2” for easy comparison to the original FLAG value.
- e.
  Save these data for future research endeavors.
  Optional: Steps 4 and 5 are optional if you are not interested in adjusting reference intervals. The end of step 3 should suffice if your only interest is detecting and characterizing seasonality.

Expected outcomes

Expected results outcome should be in the format of a data table in a .tsv file that looks similar to below. Depending on the stratum modeling you defined, there may be more columns of information. The results shown here are an example of possible of parameters outputs. Additional outcomes would be some figures of predicted seasonality patterns by stratum, if selected for within code to be shown.

Table 1 exemplifies what a typical result may look like using the data and code provided on GitHub. As mentioned in Step 5, the user can determine the threshold for what is deemed significantly changing. In this tale, we can see the “ALBUMIN / CREATININE; RATIO – U” and “ALKALINE PHOSPHATASE – P” tests have p-values <0.05 for the amplitude and offset parameters and therefore represent tests that likely experience significant seasonality changes throughout the year. Figure 1 demonstrates an example of a pdf output from Step 3 using the dummy data for Thyrotropin (TSH), a known seasonally changing laboratory test.¹¹

Table 1.

Examples of parameter fit output table, typically in a .tsv format

name	Amplitude	A_pval	Offset	offset_pval	Height	h_pval
ALANINE TRANSAMINASE (ALAT) - P	0.021310171	0.000378179	2.447618948	1	-0.008224443	1
ALBUMIN - P	0.006032911	0.098564193	6.949313116	1	-0.000892322	1
ALBUMIN - U	0.022871674	0.003365622	3.570973976	1	0.002383952	1
ALBUMIN / CREATININE; RATIO - U	-0.035591689	2.60E-05	11.4368402	3.79E-07	0.029354362	3.31E-05
ALKALINE PHOSPHATASE - P	0.006836668	5.19E-06	8.533295192	4.75E-05	0.000626055	1
ALPHA-1 GLOBULIN - P	0.015825286	0.069753204	5.911935973	1	0.014806538	0.74840031

Open in a new tab

An example of an expected output from the dummy data set and protocol

Here is an automated output given by step 3f demonstrating the modeling fit on thyrotropin (TSH) test data (example data). The color of the bar’s correlates to available data per week (function is weighted to this data) and solid blue line is the estimated fit. The data is centered around 0 which represents the median value for the given stratum and the figure shows an approximate change of +/- 100% occurs over the year.

Limitations

As noted in several steps, the fitting algorithm may not converge in some cases due to lack of data or other processing issues. It is therefore essential to have a robust data set that can assess seasonality effectively over the patient cohort used. We suggest at least 50 observations per week as shown in the example case. In some cases, there may not be enough data to get a parameter fit, but this does not mean the test is not seasonal. Instead, alternative ways to detect patterns could be used, for example comparing data season by season (four subsets of data) or similar more basic approaches (monthly assessments instead of weekly).

Troubleshooting

Problem 1

Scripts crash due to memory limits or the R limit of 2ˆ31 cells in a matrix reached.

Potential solution

•
Re-run code with more memory requested.
•
Break code into smaller steps.

Problem 2

Scripts may crash for step 3 because the parameter fitting cannot converge.

Potential solution

•
Increase filtering requirements, you may need a larger cohort size.
•
The test may not be seasonal at all, plot the data and see what it looks like. You may need to manually remove it from the fitting algorithm to prevent the pipeline from crashing.
•
As noted before, if your cohort is small, look into fitting data by month or season, instead of weekly as described here.
•
You can introduce code breaks or skip errors using R’s “TryCatch” method for example.

Problem 3

In step 3 of the seasonal adjustment, some parameters may converge to border limit vales (for example offset = 52).

Potential solution

•
Change the starting input parameters to help the optimization.

Problem 4

You get warning errors in various steps.

Potential solution

•
Warning areas may be expected, for example when we extract all numeric values by calling the function “as.numeric”
•
Check the source of the warning error and how it affects your output. Some of these are okay as they are warnings about package updates, changes, or similar.

Problem 5

When running the scripts, certain files cannot be found.

Potential solution

•
Check the directory you set; when downloading the scripts from GitHub, it’s possible there will be directory issues depending on where you saved the information.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Søren Brunak (soren.brunak@cpr.ku.dk).

Technical contact

Further technical information requests should be directed to and will be fulfilled by the technical contact, Victorine Muse (victorine.muse@cpr.ku.dk).

Materials availability

This study did not generate any new materials.

Data and code availability

All fundamental code for this protocol is available at https://github.com/vmuse12/Lab_data_processing (https://doi.org/10.5281/zenodo.10598405). As noted in the protocol and within code’s annotation, several steps need user input to make it usable to other user’s datasets and research questions. Some steps of the code were removed/hidden as they contained patient sensitive information; these sections are noted within the code as well. Referenced tables are available at https://github.com/vmuse12/Lab_data_processing/raw_data. but may not be applicable to the user’s dataset, as they are mostly specific to Danish healthcare records and testing protocols.

The data originally used to develop this protocol are not publicly available as they contain person-sensitive information. Application for data access can be made to the Danish Health Data Authority (contact servicedesk@sundhedsdata.dk). All studies should be conducted in compliance with the Danish Act on Processing of Personal Data and all other applicable laws and regulations. Anyone wanting access to the data and to use them for research will be required to meet research credentialing requirements as outlined at the authority’s website: https://sundhedsdatastyrelsen.dk/da/english/health_data_and_registers/research_services. Requests are normally processed within 3–6 months.

Acknowledgments

We thank the Novo Nordisk Foundation (NNF14CC0001 and NNF17OC0027594) as well as the Danish Innovation Fund (5184-00102B) for providing funding for the study. V.P.M. is the recipient of a fellowship from the Novo Nordisk Foundation as part of the Copenhagen Bioscience PhD Programme, supported through grant (NNF19SA0035440).

Author contributions

V.P.M. and S.B. conceived the study. V.P.M. developed, tested, and validated the code. V.P.M. also drafted the protocol and repurposed the developed code for public sharing and use as seen on GitHub. S.B. supervised, edited, and proofed the final protocol and related documents shared on the GitHub page.

Declaration of interests

S.B. reports ownerships in Intomics A/S, Hoba Therapeutics ApS, Novo Nordisk A/S, Lundbeck A/S, and ALK-Abello A/S and managing board memberships in ProScion A/S and Intomics A/S.

Contributor Information

Victorine P. Muse, Email: victorine.muse@cpr.ku.dk.

Søren Brunak, Email: soren.brunak@cpr.ku.dk.

References

1.Muse V.P., Aguayo-Orozco A., Balaganeshan S.B., Brunak S. Population-wide Analysis of Hospital Laboratory Tests to Assess Seasonal Variation and Temporal Reference Interval Modification. Patterns. 2023;4 doi: 10.1016/j.patter.2023.100778. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.R Core Team . R Foundation for Satistical Computing; 2021. R: A Language and Environment for Statistical Computing.https://www.R-project.org [Google Scholar]
3.R Studio Team . RStudio Team; 2020. RStudio: Integrated Development Environemtn for R. [Google Scholar]
4.Wickham H. Springer; 2009. ggplot2: Elegant Graphics for Data Analysis. [DOI] [Google Scholar]
5.Wickham H., François R., Henry L., Müller K., Vaughan D. 2023. Dplyr: A Grammar of Data Manipulation.https://github.com/tidyverse/dplyr [Google Scholar]
6.Wickham H. Reshaping Data with the reshape Package. J. Stat. Softw. 2007;21:1–20. doi: 10.18637/jss.v021.i12. [DOI] [Google Scholar]
7.Wickham H. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations.https://github.com/tidyverse/stringr [Google Scholar]
8.Grolemund G., Wickham H. Dates and Times Made Easy with lubridate. J. Stat. Softw. 2011;40:1–25. doi: 10.18637/jss.v040.i03. [DOI] [Google Scholar]
9.Dowle M., Srinivasan A. 2023. data.table: Extension of `data.frame`.https://github.com/Rdatatable/data.table [Google Scholar]
10.Elzhov T., Mullen K., Spies A.-N., Bolker B. 2023. R Interface to the Levenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK, Plus Support for Bounds. [Google Scholar]
11.Wang D., Cheng X., Yu S., Qiu L., Lian X., Guo X., Hu Y., Lu S., Yang G., Liu H. Data mining: Seasonal and temperature fluctuations in thyroid-stimulating hormone. Clin. Biochem. 2018;60:59–63. doi: 10.1016/j.clinbiochem.2018.08.008. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[bib1] 1.Muse V.P., Aguayo-Orozco A., Balaganeshan S.B., Brunak S. Population-wide Analysis of Hospital Laboratory Tests to Assess Seasonal Variation and Temporal Reference Interval Modification. Patterns. 2023;4 doi: 10.1016/j.patter.2023.100778. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.R Core Team . R Foundation for Satistical Computing; 2021. R: A Language and Environment for Statistical Computing.https://www.R-project.org [Google Scholar]

[bib3] 3.R Studio Team . RStudio Team; 2020. RStudio: Integrated Development Environemtn for R. [Google Scholar]

[bib4] 4.Wickham H. Springer; 2009. ggplot2: Elegant Graphics for Data Analysis. [DOI] [Google Scholar]

[bib5] 5.Wickham H., François R., Henry L., Müller K., Vaughan D. 2023. Dplyr: A Grammar of Data Manipulation.https://github.com/tidyverse/dplyr [Google Scholar]

[bib6] 6.Wickham H. Reshaping Data with the reshape Package. J. Stat. Softw. 2007;21:1–20. doi: 10.18637/jss.v021.i12. [DOI] [Google Scholar]

[bib7] 7.Wickham H. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations.https://github.com/tidyverse/stringr [Google Scholar]

[bib8] 8.Grolemund G., Wickham H. Dates and Times Made Easy with lubridate. J. Stat. Softw. 2011;40:1–25. doi: 10.18637/jss.v040.i03. [DOI] [Google Scholar]

[bib9] 9.Dowle M., Srinivasan A. 2023. data.table: Extension of `data.frame`.https://github.com/Rdatatable/data.table [Google Scholar]

[bib10] 10.Elzhov T., Mullen K., Spies A.-N., Bolker B. 2023. R Interface to the Levenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK, Plus Support for Bounds. [Google Scholar]

[bib11] 11.Wang D., Cheng X., Yu S., Qiu L., Lian X., Guo X., Hu Y., Lu S., Yang G., Liu H. Data mining: Seasonal and temperature fluctuations in thyroid-stimulating hormone. Clin. Biochem. 2018;60:59–63. doi: 10.1016/j.clinbiochem.2018.08.008. [DOI] [PubMed] [Google Scholar]

PERMALINK

Protocol for EHR laboratory data preprocessing and seasonal adjustment using R and RStudio

Victorine P Muse

Søren Brunak

Summary

Graphical abstract

Highlights

Before you begin

Hardware

Institutional permissions

Install R, RStudio, and respective packages

Pre-processing laboratory data

Key resources table

Materials and equipment

Step-by-step method details

Identify seasonality in your data

Apply seasonality parameters to current reference intervals

Expected outcomes

Table 1.

Figure 1.

Limitations

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Resource availability

Lead contact

Technical contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases