Abstract
We developed a novel data mining pipeline that automatically extracts potential COVID-19 vaccine-related adverse events from a large Electronic Health Record (EHR) dataset. We applied this pipeline to Optum® de-identified COVID-19 EHR dataset containing COVID-19 vaccine records between December 11, 2020 and January 20, 2022. We compared post-vaccination diagnoses between the COVID-19 vaccine group and the influenza vaccine group among 553,682 individuals without COVID-19 infection. We extracted 1,414 ICD-10 diagnosis categories (first three ICD10 digits) within 180 days after the first dose of the COVID-19 vaccine. We then ranked the diagnosis codes using the adverse event rates and adjusted odds ratio based on the self-controlled case series analysis. Using inverse probability of censoring weighting, we estimated the right-censored time-to-event records. Our results show that the COVID-19 vaccine has a similar adverse events rate to the influenza vaccine. We found 20 types of potential COVID-19 vaccine-related adverse events that may need further investigation.
1. Introduction
According to the Centers for Disease Control and Prevention (CDC), as of Feb 10, 2022, 252 million (76%) of the U.S. population had received at least one dose of the COVID-19 vaccine. Although 3 phases of clinical trials were performed, and the first Emergency Use Authorization (EUA) for the COVID-19 vaccine was issued in December 2020 [1], safety concerns remain one of the reasons for “vaccine hesitancy” [2]. Based on the COVID-19 vaccine trial record, mild to moderate intensity adverse reactions including headache, chills, fatigue, myalgia, and pain at the site of injection were reported for mRNA vaccines [3]. Thomas et al. [4] studied 6 months of follow-up on 22,026 subjects who participated in a Pfizer COVID-19 vaccine efficacy trial. New adverse events identified included decreased appetite, lethargy, asthenia, malaise, night sweats, and hyperhidrosis [4]. CDC and Food and Drug Administration (FDA) co-managed Vaccine Adverse Event Reporting System (VAERS), and a smartphone-based health check-in application called V-safe, are two commonly used data sources for COVID-19 vaccine studies. An early study confirmed those short-term side-effects during the first month (December 14, 2020–January 13, 2021) after the EUA was issued [5]. A recent study using VAERS and V-safe not only shows common side effects after first dose, second dose and booster, but also found myocarditis cases related to COVID-19 vaccine [6]. An important complementary approach for vaccine safety study is retrospective analysis using EHR data which provides a large and diverse study sample including specific sub-groups that do not meet the inclusion criteria in such clinical trials. Curtis and colleagues analyzed the clinical characteristics of COVID-19 vaccine recipients among a 57.9 million patients EHR dataset [7]. McMurry et al. stated that severe adverse effects are rare among individuals receiving Pfizer and Moderna COVID-19 vaccine based on 1.2 million clinical notes [8].
There are several challenges for vaccine side-effect studies using EHR data [9]. EHR data itself could bring biases because of the search criteria, and it may be difficult to control because of missing data. For instance, healthier people are less likely to have a lot of historical health information in their EHR, due to lesser needs for healthcare encounter. EHR data are usually collected from multiple data sources using different concepts, terminologies, and record settings. To harmonize such data, for example, diagnosis codes need conversion between ICD-9 and ICD-10, and lab test results using free-text require Natural Language Processing (NLP). Moreover, data discontinuity and latency of release are common issues for EHR that cause missing follow-up data of adverse events observed. More importantly, large-scale EHR data require significant time and effort for preprocessing before they are ready for analysis.
To provide an efficient way to explore and process a large-scale EHR dataset, we previously developed a fast temporal query system with Event-level Inverted Index (ELII) [10, 11]. In this earlier work, we build the ELII database using the January 2022 version of Optum® COVID-19 dataset that contains 8.87 million patients in the U.S. with COVID-19 related health records. Leveraging the fast temporal query of ELII, a complicated cohort selection using multiple set intersections can be finished in minutes. We also implemented a sampling method for this study to accelerate data extraction and statistical analysis.
Existing COVID-19 vaccine studies often use case-control analysis with data from two independent groups: the case group receives the COVID19 vaccine, and the control group receives the placebo. The method requires both groups to come from the same population and be independent to the exposure of interest [12], which is difficult to match perfectly in practice. Especially for vaccines with a high coverage rate like the COVID-19 vaccine, it would be difficult to recruit unvaccinated controls [13]. The within-person self-controlled case series method is an alternative study design originally developed for evaluation of vaccine safety [14]. It overcomes the difficulty of comparison group selection by utilizing intra-person comparisons in a population of individuals with both the case and control periods. In this paper, we design a self-controlled case series method using the COVID-19 vaccine as the case and the latest influenza vaccine before the pandemic as the control for each selected individual.
Censoring occurs if a patient’s follow-up stops before the end of the designed observation period, which may be a cause for missing data in that the events could have happened but were not recorded. This missing data scenario can be typical in COVID-19 EHR because of data release latency and short observation windows for patients with recent vaccinations. The simple solution is to exclude such patients from the study. However, some covariates may be associated with this censoring mechanism, so dropping or keeping the censored data may lead to biased results [15]. The inverse probability censoring weighted (IPCW) estimator [16] was developed to correct such bias for dependent censoring. In this study, we use the same implementation proposed in another study which uses the Kaplan–Meier estimator to calculate the probability of censoring at each time point [17].
To address such challenges, we introduce a workflow to identify potential adverse reactions of the COVID-19 vaccine. The pipeline can be applied to other types of healthcare data and the preliminary result and extracted data may be reused for further statistical analyses. Our contributions are the following:
A reusable computational pipeline using fast EHR temporal cohort selection with inverted index, parallel data extraction from a shared database, and patient sampling mode. This pipeline accelerated data exploration, extraction, and analysis for a large-scale EHR dataset;
No pre-defined adverse events are required, the pipeline will automatically examine all possible diagnoses and provide a ranking based on their occurrence rates; and
Our pipeline leverages modern statistical methods such as self-controlled case series and IPCW estimator to achieve enhanced scientific rigor in secondary analysis of real-world data.
2. Methods
2.1. Workflow
Figure 1 shows the data processing workflow for comparing the diagnosis occurrence rate between two exposure events using EHR data. The first step was to initialize the sample of interest, which included subjects with index events (e.g., COVID-19 vaccine) and control index events (e.g., influenza vaccine). The query can include exclusive events. In this study, we excluded subjects who had COVID-19 infection during any period of time. The initial queries can be accomplished in various categories (e.g., Diagnoses, Procedures, and Lab Tests) using standard coding systems, such as International Statistical Classification of Diseases and Related Health Problems (ICD), Current Procedural Terminology (CPT), and Logical Observation Identifiers Names and Codes (LOINC). Multiple iterations may needed to adjust the cohort and researchers will be given a estimated sample size for each query. A fast keyword-searching function is supported by our pipeline. The input keywords can be translated into standard codes in addition to being utilized for text matching. The automatic code mapping function helps researchers to skip the process of manually looking up the codes and return more relevent events.
Figure 1:
Workflow.
Once the initialized cohort has been identified, the next step is cohort filtering with detailed constraints. We can specify the age and gender of the cohort once we have obtained the demographic information for each patient. By extracting a subject’s first and last records, we can determine whether the patients had a complete observation data, left censoring (lack of history), or right censoring (lack of follow-up). Temporal constraints can also be taken into the process. In this work, we only focused on influenza vaccination cases in 2019 and the time between receiving the COVID-19 vaccine and receiving the influenza vaccine should be no more than one year.
At this point, since a cohort with both index event and control index event has been precisely identified, we can design the self-controlled case series analysis. Becasue a subject’s record timeline could contain multiple index events or control index events, so we need to determine which index dates are selected for analysis. We then defined two periods of time: the observation window is to catch the events after the index events for occurrence rate computation, and the history window refers to a period before the index date characterized by the same length as the observation window. We can use the events that happened during the history window to determine 1) the health status of the subject at the index date, and 2) whether the observed diagnoses were existing conditions or newly diagnosed events.
After the case period and control period are clearly defined, the next step is to extract diagnoses during the observation window and history window. To unify the diagnosis codes, we converted ICD-9 codes to ICD-10 codes using 2018 release ICD-9 to ICD-10 General Equivalence Mappings (GEMs) [18]. The diagnosis date of each event was also extracted for computing the days since the index event. The “days” were used for calculating the weight of event occurrence at the next step.
Once the diagnosis events with the days from the index date are extracted, statistical analyses can be implemented according to each diagnosis code. Based on different diagnosis events, the subjects’ weights were calculated using IPCW, which adjusted the weights for the right-censored and complete subjects when computing the diagnosis occurrence rate. The subjects’ weights described above is used for adjustment in odds ratio generation. Ultimately, a ranking list of all observed diagnoses will be generated based on their adjusted odds ratio between the index event and control index event.
2.2. Dataset
We used Optum® COVID-19 dataset which is drawn from dozens of healthcare providers in the United States, including more than 700 hospitals and 7,000 clinics. In the January 2022 release, the dataset consisted of 8.87 million unique individuals who have documented clinical care with a documented diagnosis of COVID-19 or acute respiratory illness after 02/01/2020 and/or documented COVID-19 testing regardless of their results. The data incorporated a wide swath of raw clinical data, including new, unmapped COVID-specific clinical data points from both inpatient and ambulatory electronic medical records, which included patient-level information: demographics, diagnoses, procedures, lab tests, care settings, medications prescribed or administered, and mortality. These data are certified as de-identified by independent statistical experts following Health Insurance Portability and Accountability Act (HIPAA) statistical de-identification rules and are managed by the Optum® customer data use agreement.
17 individual source files were contained in the release, for different types of EHR records, such as patient demographics, diagnosis, medication, and lab. Each source file came with several types of attributes (i.e., the “columns”). For instance, the PATIENT source file contained demographic attributes such as gender, age, and race. The DIAGNOSIS source file contained attributes such as diagnosis code, diagnosis code type, and diagnosis status; whereas the IMMUNIZATION source file contains attributes such as drug name, National Drug Code (NDC), quantity of dose, and dose frequency. Definition of these attribute types are given in the accompanying data dictionary provided by Optum®.
2.3. Query for COVID-19 Vaccine Cohort Initialization
Data for the current analysis were obtained from Optum® COVID-19 dataset January 2022 release. The database contains 8,871,509 patients who were recorded COVID-19 related events such as diagnoses and lab tests. In this paper, we performed three queries to establish a cohort:
To find patients who received COVID-19 vaccine (index event): The COVID-19 vaccine records were retrieved by vaccine sale labeler (e.g. Moderna,Pfizer-BioNTech, and Janssen), vaccine long/short description (e.g. COVID-19, mRNA, LNP-S, PF, 100 mcg/ 0.5 mL dose), COVID-19 NDC (80777-273-10, 80777-273-15, 59267-1000-1, 0310-1222-10, and 59676-580-05), and vaccination CPT for COVID-19 vaccine (0001A, 0002A, 0011A, 0012A, 0021A, 0022A, 0031A, 0041A, 0042A). The description and codes are provided by CDC [19] and American Medical Association (AMA) [20].
To find patients who received the influenza vaccine (control index event): The influenza vaccine records were retrieved by influenza vaccine NDC (e.g. 49281-0635-15) provided by CDC [21]. We also implemented string matching on the description fields using keywords (influenza and flu shot).
To find patients who were confirmed by COVID-19 infection: The COVID-19 infection records were retrieved from two tables: Diagnosis and Lab Test. For diagnosis, we searched the ICD-10 diagnosis code U071. Since we observed U071 records before the pandemic in the Optum® COVID-19 dataset (emergence use for other unspecified diseases), the query result only includes U071 records after February 2020 when community transmission of COVID-19 was first detected in the United States [22]. For the Lab Test table, we searched for the COVID-19 related Polymerase Chain Reaction (PCR) tests, antibody tests, and antigen tests using standard test codes and terms provided by Logical Observation Identifiers Names and Codes (LOINC) [23]. We also included non-standard lab test names related to the COVID-19 test using keyword matching. To determine the positive result of each test record, we manually reviewed all the possible texts from the result field of extracted COVID-19 tests. Based on our method, 990,842 (14.2%) patients of the entire population from the Optum® COVID-19 dataset were confirmed COVID-19 infection cases.
After we have three patient groups from the above queries, we initialize a cohort containing non-COVID-19 patients who received both the COVID-19 vaccine and the influenza vaccine:
2.4. Self-controlled Case Series
Self-controlled case series analysis was performed on the patients from the initialized cohort. Figure 2 illustrates a patient timeline including a COVID-19 vaccine record and an influenza vaccine record. The 180-day observation window (exposure period) for each individual started the day of their first dose of the COVID-19 vaccine. The 180 days before each individual’s vaccination date were considered as a history window (baseline period). We selected one influenza vaccine record as the control event for each individual between 6 to 18 months before the pandemic for two reasons: 1) ensure no overlap period of COVID-19 vaccine observation window and influenza vaccine observation window; 2) the gap between case and control period was as small as possible so that each patient had similar health status during the two periods.
Figure 2:
Pictorial representation of self-controlled case-series study design.
In this paper, we used the Optum® COVID-19 dataset contains records through January 20, 2022, which meant that the patients who received COVID-19 vaccine after March 20, 2021, had the COVID-19 vaccine observation windows less than 180 days. Such patients were considered as right censored patients. For the patients who terminated the observation earlier than January 20, 2022, we used their last record date as the termination date because no indicator of censoring was given in the dataset. We extracted all patients’ diagnoses and days to COVID-19 vaccine date, then re-weighted each patient using IPCW. The method uses the Kaplan-Meier estimator of the survival distribution to build a function of t to calculate the probability that the censoring time is greater than t days. Then all the censored patients were dropped from occurrence rate calculation, and remaining records’ weights were adjusted to the inverse probability of censoring weight.
2.5. Diagnoses Extraction
For each patient, we extracted all diagnoses that occurred inside the 180-day observation window and the history window of identified COVID-19 vaccine and influenza vaccine. The extraction query needed to examine or project six fields for records from the DIAGNOSIS table: 1) PTID, patient ID for linking across tables; 2) DIAG DATE, Date of diagnosis in the format of MM-DD-YYYY; 3) DIAGNOSIS CD, ICD-9, ICD-10, and SNOMED (98% records are ICD-9 or ICD-10 codes, decimal point of each ICD code is removed); 4) DIAGNOSIS CD TYPE, an indicator of whether the DIAGNOSIS CD is ICD-9, ICD-10, or SNOMED; 5) DIAGNOSIS STATUS, “Possible diagnosis of”, “Other diagnosis status”, “Family history of”, “History of”, “Unknown diagnosis status”, “Diagnosis of”; 6) PRIMARY DIAGNOSIS, an indicator of whether or not the diagnosis is documented as the principal diagnosis (1 = Yes, 0 = No/Unknown). The results only included ICD-9 and ICD-10 diagnosis codes, and “Diagnosis of” status. Rare diagnoses with less than 1000 records in the dataset were removed. During the implementation of data extraction, we found that the patient-by-patient query was slower than the batch query when pulling data from the MongoDB database. Based on an experiment on finding the most efficient batch size, we set the max number of patients for a batch to 50,000. By using batch processing, we allowed an early stop if the occurrence rate result achieves an acceptable error (e.g., less than 0.5%). Extraction with an early stop was called fast mode to significantly reduce the processing time and extracted data size, but the results were close to the results using the entire cohort. 24.77% among all diagnosis records were ICD-9 codes. We built an automatic converting tool for mapping extracted ICD-9 codes to ICD-10 codes. We also grouped the ICD-10 codes into categories by only extracting the first three digits.
We observed a large number occurrence of chronic diseases during the observation window. Such events may not be considered as adverse event after vaccination since they could be existing conditions or related to other diseases. We used primary diagnosis and history window to avoid counting on such non-related diagnoses. In the observation windows, we only obtained primary diagnoses. In the history windows, we extracted diagnoses as baseline conditions. The intersection of diagnoses occurred during both windows for each individual were considered as existing conditions, so they were removed from the occurrence rate computation.
2.6. Diagnoses Ranking
We ranked the diagnosis codes using their occurrence rates after COVID-19 vaccine and influenza vaccine and an estimated weighting based on the diagnosis time and age of the patients. By using the IPCW method described in another study [17], the occurrence rate of adverse event X after the COVID-19 vaccine was defined as the count of IPCW weighted patients with X during the COVID-19 vaccine observation window divided by the sum of all patients’ IPCW weights. Based on the calculation of IPCW, the sum of all patients’ IPCW weights equals the number of patients.
Similarly, we computed the occurrence rate of adverse event X after the influenza vaccine. According the observation window selection, a patient could be at most two years older when he/she received the first dose of COVID-19 vaccine comparing to when the patient received the most recent influenza vaccine. Considering that the changing age would be a factor of diagnosis rate, especially for elder patients, we computed the weight of diagnosis X based on the age at diagnosis date:
The overall occurrence rate at different ages for diagnosis X is calculated using the entire dataset with 8.87 million patients. For all n patients in our selected self-controlled cohort, the expected occurrence rate of X after COVID-19 vaccine/influenza vaccine is the sum of overall rate at the age of receiving COVID-19 vaccine/influenza vaccine for each patient. The ratio is the weight based on age for diagnosis X. In this study, for example, the weight of age for I10-I16 (Hypertensive diseases) is 1.059, which means our cohort is 5.9% more likely to be diagnosed with I10-I16 after receiving COVID-19 vaccine because of getting older. We removed the age bias by adding the weight to each diagnosis occurrence rate after influenza vaccine. In addition, we found that some events may be affected by the diagnosis time of the year. For instance, transportation accidents have lower rates during November and December because of holidays. We calculated the weight of timing for each diagnosis by pre-computing the overall rate within 180-day window on every day of the year using the entire database:
Then we adjusted the diagnosis rate of X after influenza vaccine using the weight of age and timing:
And the raking score is the adjusted odds ratio:
If the adjusted odds ratio of X is larger than 1, it indicates diagnosis X has a higher probability to happen after receiving the COVID-19 vaccine than after receiving the influenza vaccine. We computed the score for any diagnoses extracted from the previous step and generated a ranking list as the final results.
3. Results
3.1. Study cohort
The January 2022 release of the Optum® COVID-19 dataset contained 8.87 million individuals who had COVID-19 related records. From the immunization table and procedure table, we extracted 2.1 million patients who received at least one dose of any brand of COVID-19 vaccine, about 1.68 million of them were COVID-19 free at any time before the data release. There were 1.08 million patients who also had at least one influenza vaccine record, and about 0.6 million received influenza vaccine during 6-18 months before the COVID-19 pandemic. After excluding the patients without complete 6-month records before their selected influenza vaccine date, the final analytical cohort had 553,682 patients (Figure 3). Table 1 displays demographic characteristics of the patients who met the search criteria. Most patients received mRNA vaccine: Pfizer (59.38%) and Moderna (35.64%). More than 66% in the cohort were elder patients (age 50+), which was consistent with the policy that the majority of COVID-19 vaccines were only available to elder people during the first several months after the Food and Drug Administration (FDA) issued the EUA to Pfizer and Moderna. Although the existing studies reported that males had a higher or equal acceptance rate of the COVID-19 vaccine [24, 25], there was a much higher proportion of females (61.1%). Our dataset also showed high rates of Caucasians (79.4%) and age 50+ (66.36%), the reason was high influenza vaccine coverage in females, Caucasians, and older people [26]. According to this cohort, almost half of the COVID-19 vaccines were distributed to East North Central (25.14%) and West North Central(20.65%). Because of the latency of some data sources, few COVID-19 vaccine records occurred after September 2021.
Figure 3:

Flowchart of cohort selection.
Table 1:
Characteristics of study samples.
| Characteristics | No.(%) | Pfizer(%) | Moderna(%) | Janssen(%) | Unspecified/other(%) |
|---|---|---|---|---|---|
| All | |||||
| All | 553,682 (100.00) | 328,797 (59.38) | 197,305 (35.64) | 19,261 (3.48) | 8,319 (1.50) |
| Gender | |||||
| Female | 338,281 (61.10) | 200,937 (59.40) | 120,945 (35.75) | 11,145 (3.29) | 5,254 (1.55) |
| Male | 215,106 (38.85) | 127,681 (59.36) | 76,261 (35.45) | 8,103 (3.77) | 3,061 (1.42) |
| Age Group | |||||
| 0-17 | 31,605 (5.71) | 31,367 (99.25) | 109 ( 0.34) | 15 (0.05) | 114 (0.36) |
| 18-33 | 60,566 (10.94) | 39,156 (64.65) | 18,543 (30.62) | 2,366 (3.91) | 501 (0.83) |
| 34-49 | 94,096 (16.99) | 56,719 (60.28) | 32,007 (34.02) | 4,242 (4.51) | 1,128 (1.20) |
| 50-64 | 152,334 (27.51) | 84,874 (55.72) | 57,142 (37.51) | 7,772 (5.10) | 2,546 (1.67) |
| 65+ | 215,078 (38.85) | 116,679 (54.25) | 89,503 (41.61) | 4,866 (2.26) | 4,030 (1.87) |
3.2. Potential adverse events
We grouped all extracted diagnosis codes that occurred in the observation window of the COVID-19 vaccine or influenza vaccine with their first 3 digits (e.g., G610 => G61). The final ranking list included 1,414 grouped diagnosis codes. For 599 grouped diagnosis codes with the occurrence rate 0.1%, only 20 codes had the adjusted odds ratios ≥ 1.2. Table 2 lists top diagnosis codes in our ranking list. The majority (544 codes, 90.8%) had the adjusted odds ratios between 0.8 and 1.2. The overall diagnosis event occurrence rate between the COVID-19 vaccine and influenza vaccine was close based on the large EHR dataset, the result confirmed that there was no significant safety concern on COVID-19 vaccines compared to influenza vaccine.
Table 2:
Diagnosis codes with adjusted odds ratio 1.2. COV-vac: occurrence rate in percentage during COVID-19 vaccine observation window. Inf-vac: occurrence rate in percentage during influenza vaccine observation window. AOR: Adjusted Odds Ratio. χ2 test p-value: Pearson’s chi-squared test p-value.
| ICD-10 | Description | COV-vac(%) | Inf-vac(%) | AOR | χ2 test p-value |
|---|---|---|---|---|---|
| M | Diseases of the musculoskeletal system and connective tissue | ||||
| M89 | Other disorders of bone | 8,327(1.50) | 3,736(0.67) | 1.30 | <0.01 |
| M84 | Disorder of continuity of bone | 1,165(0.21) | 755(0.14) | 1.22 | <0.01 |
| M94 | Other disorders of cartilage | 2,227(0.40) | 2,225(0.40) | 1.20 | <0.01 |
| L | Diseases of the skin and subcutaneous tissue | ||||
| L24 | Irritant contact dermatitis | 836(0.15) | 640(0.12) | 2.18 | <0.01 |
| L23 | Allergic contact dermatitis | 2,178(0.39) | 1,304(0.24) | 1.57 | <0.01 |
| L25 | Unspecified contact dermatitis | 1,885(0.34) | 1,301(0.23) | 1.45 | <0.01 |
| L53 | Other erythematous conditions | 848(0.15) | 619(0.11) | 1.23 | <0.01 |
| G | Diseases of the nervous system | ||||
| G63 | Polyneuropathy in diseases classified elsewhere | 781(0.14) | 564(0.10) | 1.34 | <0.01 |
| G44 | Other headache syndromes | 6,356(1.15) | 5,999(1.08) | 1.21 | <0.01 |
| R | Symptoms, signs and abnormal clinical and laboratory findings | ||||
| R78 | Findings of drugs and other substances, not normally found in blood | 2,001(0.36) | 1,327(0.24) | 1.71 | <0.01 |
| R70 | Elevated erythrocyte sedimentation rate and abnormality of plasma viscosity | 996(0.18) | 754(0.14) | 1.70 | <0.01 |
| R90 | Abnormal findings on diagnostic imaging of central nervous system | 1,397(0.25) | 863(0.16) | 1.32 | <0.01 |
| R69 | Illness, unspecified | 1,007(0.18) | 1,832(0.33) | 1.32 | <0.01 |
| R44 | Other symptoms and signs involving general sensations and perceptions | 1,450(0.26) | 838(0.15) | 1.19 | <0.01 |
| J | Diseases of the respiratory system | ||||
| J11 | Influenza due to unidentified influenza virus | 282(0.05) | 3,177(0.57) | 1.77 | <0.01 |
| J10 | Influenza due to other identified influenza virus | 275(0.05) | 5,036(0.91) | 1.77 | <0.01 |
| E | Endocrine, nutritional and metabolic diseases | ||||
| E79 | Disorders of purine and pyrimidine metabolism | 1,900(0.34) | 1,504(0.27) | 2.18 | <0.01 |
| A | Certain infectious and parasitic diseases | ||||
| A49 | Bacterial infection of unspecified site | 1,539(0.28) | 1,392(0.25) | 1.43 | <0.01 |
| D | Diseases of the blood and blood-forming organs and certain | ||||
| disorders involving the immune mechanism | |||||
| D84 | Other immunodeficiencies | 7,201(1.30) | 607(0.11) | 1.39 | <0.01 |
| F | Mental, Behavioral and Neurodevelopmental disorders | ||||
| F34 | Persistent mood affective disorders | 2,606(0.47) | 2,528(0.46) | 1.29 | <0.01 |
3.3. Processing time
The dataset was hosted on a shared ELII database using MongoDB 5.0.2. The whole pipeline was implemented using Python3.8. The development machine was a 2019 Mac Pro with a 2.5 GHz 28-Core Intel Xeon W processor and 1.5TB 2933 MHz DDR4 memory. The operating system was macOS Catalina version 10.15.7. We optimized the cohort queries and data extraction queries using parallel computing, and the censoring adjustment and statistical analysis used sequential programming. A complete run of the pipeline for this paper using the entire cohort took 1 hour and 20 minutes. The optimized cohort initialization cost 6 minutes and the data extraction cost 23 minutes. For the fast mode, if we allowed a maximum 0.5% error with a 95% confidence interval, the early stop would generate a reduced size cohort of approximately 40,000 patients, and the processing time will be down to 13 minutes.
4. Discussion
4.1. Study contribution
The CDC provided COVID-19 vaccine information regarding the safety concern [27]. Table 3 shows the adjusted odds ratio for some side effects and adverse events of the COVID-19 vaccine from other studies based on the VAERS dataset, which includes common symptoms within days [5], and rare diseases such as Guillain-Barre´ Syndrome (GBS) [28] and myocarditis [6, 29]. Our results have shown that the COVID-19 vaccine common side effects listed in Table 3 had slightly lower odds compared with common side effects of the influenza vaccine. And for each of the adverse outcomes, except for GBS, there was not a significant difference between COVID-19 vaccine and influenza vaccine. VAERS and EHR both have data quality issues: VAERS records were self-reported and unverified, and EHR data may have latency and underreporting because the patients must engage with healthcare service to have the diagnosis recorded. The EHR-based analysis can benefit studies using VAERS as a validation source. Moreover, EHR contains more comprehensive and detailed data such as current conditions and diagnosis history, and EHR usually includes long-term data, which may enhance long-term vaccine safety studies.
Table 3:
Adjusted odds ratio of side effects or adverse events reported from VAERS. COV-vac: occurrence rate in percentage during COVID-19 vaccine observation window. Inf-vac: occurrence rate in percentage during influenza vaccine observation window. AOR: Adjusted Odds Ratio. χ2 test p-value: Pearson’s chi-squared test p-value.
| Event | Related ICD10 | COV-vac(%) | Inf-vac(%) | AOR | χ2 test p-value |
|---|---|---|---|---|---|
| Common Side Effects | |||||
| Headache | R51 | 19,180(3.46) | 15,893(2.87) | 0.90 | <0.01 |
| Fatigue | R53 | 44,537(8.04) | 35,507(6.41) | 0.89 | <0.01 |
| Dizziness | R42 | 22,506(4.06) | 17,996(3.25) | 0.93 | <0.01 |
| Chills | R68 | 10,606(1.92) | 19,012(3.43) | 0.93 | <0.01 |
| Fever | R50 | 9,661(1.74) | 14,656(2.65) | 0.83 | <0.01 |
| Nausea | R11 | 23,743(4.29) | 19,196(3.47) | 0.98 | 0.08 |
| Adverse Events | |||||
| Adverse effects, anaphylaxis | T78 | 4,185(0.76) | 3,344(0.60) | 0.98 | 0.31 |
| Urticaria | L50 | 2,422(0.44) | 2,476(0.45) | 0.97 | 0.27 |
| Guillain-Barre´ Syndrome | G610 | 75(0.01) | 30(0.01) | 1.57 | 0.04 |
| Myocarditis | B33, I40, I51 | 61(0.01) | 73(0.01) | 1.28 | 0.18 |
We set a fixed 180-day window for each patient to extract abnormal events after they received COVID-19 vaccine. If the window length is too small, only short-term side effects will be captured; if the window length is too large, more censored data will be captured. An extended window can be used for future data release. We also introduced the 180-day history window in the study for similar purpose: remove existing condition. Since the majority of our study samples were elder people, we found a high rate of chronic diseases which have high probability to be developed before the vaccination, so we removed such diagnoses if they occurred in the history window. Increasing the length of history window will capture more chronic diseases but also remove some repeatable short-term conditions.
4.2. Limitation of EHR data
We observed that the overall diagnosis rate during a short period (less than one month) after the COVID-19 vaccination was lower than the rate for the influenza vaccine. One reason could be the data capture and integration latency, and other studies have also reported that side effect reports are rare in EHR notes compared to clinical trials and V-safe [8]. The reason could be that people’s behavior in engaging with health care systems had changed since the COVID-19 pandemic started. Patients with mild to moderate reactions chose to stay at home because they were concerned about the OCVID-19 infection or the high hospitalization rate. To precisely analyze short-term adverse reactions of COVID-19 using the EHR data, we need to only include patients who had hospital visits during the observation window to avoid possible missing data. We also observed imbalanced gerographical distribution of COVID-19 vacciine cases, it may be caused by the functions of the COVID-19 vaccine distribution of Optum®’s clients in the early stage. Such limitation may bring bias to the results of this work, but it could be reduced by using newer data in the future.
4.3. Generalization
The pipeline developed for this study was intended as a general-purpose approach to comparing two events using large EHR data. Many parameters can be easily adjusted based on the development of study design. The events of interest can be changed but not limited to other vaccines using the cohort initialization tool. The previous paragraph illustrates how to use the same pipeline to analyze the COVID-19 vaccine and COVID-19 infection. Furthermore, the backend database can host other EHR datasets for fast temporal query and data extraction. One future work could be supporting events extraction from not only Diagnosis but also Lab Test, Observation, and Medication.
5. Conclusion
In this study we introduced a data mining pipeline to automatically generate a list of potential adverse events for the COVID-19 vaccine. From a EHR dataset with more than 8.87 million patients, we extracted 0.55 million individuals to establish a self-controlled case series analysis. Our fast temporal query system and sampling mode accelerated each iteration of the analysis life-cycle. Using an observation window of 180 days after the first dose of the COVID-19 vaccine, we observed that the occurrence rates of adverse events were similar between the COVID-19 vaccine and the influenza vaccine. We designed a calculation metric of adjusted odds ratio which showed that COVID-19 vaccine had a similar adverse events rate compared to the influenza vaccine. We provided a list of 20 diagnosis codes serving as potential adverse events for COVID-19 vaccine that needed further investigation.
Acknowledgment
We sincerely thank our colleagues Yashar Talebi, Lili Liu, and Zhouxuan Li for valuable project discussions. We thank our colleagues Yashar Talebi and Lili Liu for evaluating the computation results. We also thank our colleague Michael Phan for setting up the distributed MongoDB database. This work was supported in part by the National Institutes of Health (NIH) grant R01NS126690. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Figures & Table
References
- [1].FDA Takes Key Action in Fight Against COVID-19 By Issuing Emergency Use Authorization for First COVID-19 Vaccine. Published December 11, 2020. https://www.fda.gov/news-events/press-announcements/fda-takes-key-action-fight-against-covid-19-issuing-emergency-use-authorization-first-covid-19.
- [2].Sallam M. COVID-19 vaccine hesitancy worldwide: a concise systematic review of vaccine acceptance rates. Vaccines. 2021 Feb;9(2):160. doi: 10.3390/vaccines9020160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Kaur RJ, Dutta S, Bhardwaj P, Charan J, Dhingra S, Mitra P, Singh K, Yadav D, Sharma P, Misra S. Adverse events reported from COVID-19 vaccine trials: a systematic review. Indian Journal of Clinical Biochemistry. 2021 Oct;36(4):427–39. doi: 10.1007/s12291-021-00968-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Thomas SJ, Moreira Jr ED, Kitchin N, Absalon J, Gurtman A, Lockhart S, Perez JL, Pérez Marc G, Polack FP, Zerbini C, Bailey R. Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine through 6 months. New England Journal of Medicine. 2021 Nov 4;385(19):1761–73. doi: 10.1056/NEJMoa2110345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Gee J, Marquez P, Su J, Calvert GM, Liu R, Myers T, Nair N, Martin S, Clark T, Markowitz L, Lindsey N. First month of COVID-19 vaccine safety monitoring—United States, December 14, 2020–January 13, 2021. Morbidity and Mortality Weekly Report. 2021 Feb 26;70(8):283. doi: 10.15585/mmwr.mm7008e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Hause AM, Baggs J, Marquez P, Myers TR, Su JR, Blanc PG, Baumblatt JA, Woo EJ, Gee J, Shimabukuro TT, Shay DK. Safety monitoring of COVID-19 vaccine booster doses among adults—United States, September 22, 2021–February 6, 2022. Morbidity and Mortality Weekly Report. 2022 Feb 18;71(7):249. doi: 10.15585/mmwr.mm7107e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Curtis HJ, Inglesby P, Morton CE, MacKenna B, Walker AJ, Morley J, Mehrkar A, Bacon SC, Hickman G, Bates C, Croker R. Trends and clinical characteristics of COVID-19 vaccine recipients: a federated analysis of 57.9 million patients primary care records in situ using OpenSAFELY. MedRxiv. 2021 Jan 1 doi: 10.3399/BJGP.2021.0376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].McMurry R, Lenehan P, Awasthi S, Silvert E, Puranik A, Pawlowski C, Venkatakrishnan AJ, Anand P, Agarwal V, O’Horo JC, Gores GJ. Real-time analysis of a mass vaccination effort confirms the safety of FDA-authorized mRNA COVID-19 vaccines. Med. 2021 Aug 13;2(8):965–78. doi: 10.1016/j.medj.2021.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Sarri G, Bennett D, Debray T, Deruaz-Luyet A, Soriano Gabarro M, Largent JA, Li X, Liu W, Lund JL, Moga DC, Gokhale M. ISPE-endorsed guidance in using electronic health records for comparative effectiveness research in COVID-19: opportunities and trade-offs. Clinical Pharmacology & Therapeutics. 2022 Feb 16 doi: 10.1002/cpt.2560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Huang Y, Li X, Zhang GQ. ELII: A novel inverted index for fast temporal query, with application to a large Covid-19 EHR dataset. Journal of Biomedical Informatics. 2021 May 1;117:103744. doi: 10.1016/j.jbi.2021.103744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Zhang Guo-Qiang, Li Xiaojin, Huang Yan, Cui Licong. AMIA Annual Symposium Proceedings. volume 2022. American Medical Informatics Association; 2022. Temporal Cohort Logic. [PMC free article] [PubMed] [Google Scholar]
- [12].Schulz KF, Grimes DA. Case-control studies: research in reverse. The lancet. 2002 Feb 2;359(9304):431–4. doi: 10.1016/S0140-6736(02)07605-5. [DOI] [PubMed] [Google Scholar]
- [13].El-Gilany AH. Self-controlled case series study (SCCSS): a novel research method. Asploro Journal of Biomedical and Clinical Case Reports. 2019;2019(1):29. [Google Scholar]
- [14].Petersen I, Douglas I, Whitaker H. Self controlled case series methods: an alternative to standard epidemiological study designs. bmj. 2016 Sep 12:354. doi: 10.1136/bmj.i4515. [DOI] [PubMed] [Google Scholar]
- [15].Willems SJ, Schat A, van Noorden MS, Fiocco M. Correcting for dependent censoring in routine outcome monitoring data by applying the inverse probability censoring weighted estimator. Statistical Methods in Medical Research. 2018 Feb;27(2):323–35. doi: 10.1177/0962280216628900. [DOI] [PubMed] [Google Scholar]
- [16].Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics. 2000 Sep;56(3):779–88. doi: 10.1111/j.0006-341x.2000.00779.x. [DOI] [PubMed] [Google Scholar]
- [17].Vock DM, Wolfson J, Bandyopadhyay S, Adomavicius G, Johnson PE, Vazquez-Benitez G, O’Connor PJ. Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. Journal of biomedical informatics. 2016 Jun 1;61:119–31. doi: 10.1016/j.jbi.2016.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Butler RR. Icd-10 general equivalence mappings: Bridging the translation gap from icd-9. Journal of AHIMA. 2007 Oct;78(9):84–6. [PubMed] [Google Scholar]
- [19].COVID-19 Vaccine Related Codes. https://www.cdc.gov/vaccines/programs/iis/COVID-19-related-codes.html .
- [20].COVID-19 CPT vaccine and immunization codes. https://www.ama-assn.org/practice-management/cpt/COVID-19-cpt-vaccine-and-immunization-codes .
- [21].IIS: NDC Lookup Crosswalk. https://www2a.cdc.gov/vaccines/iis/iisstandards/vaccines.asp?rpt=ndc .
- [22].COVID CD, Team R, COVID CD, Team R, Bialek S, Bowen V, Chow N, Curns A, Gierke R, Hall A, Hughes M. Geographic differences in COVID-19 cases, deaths, and incidence—United States, February 12–April 7, 2020. Morbidity and Mortality Weekly Report. 2020 Apr 17;69(15):465. doi: 10.15585/mmwr.mm6915e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].SARS-CoV-2 and COVID-19 related LOINC terms. https://loinc.org/sars-cov-2-and-covid-19/
- [24].Malik AA, McFadden SM, Elharake J, Omer SB. Determinants of COVID-19 vaccine acceptance in the US. EClinicalMedicine. 2020 Sep 1;26:100495. doi: 10.1016/j.eclinm.2020.100495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Lazarus JV, Wyka K, Rauh L, Rabin K, Ratzan S, Gostin LO, Larson HJ, El-Mohandes A. Hesitant or not? The association of age, gender, and education with potential acceptance of a COVID-19 vaccine: a country-level analysis. Journal of Health Communication. 2020 Oct 2;25(10):799–807. doi: 10.1080/10810730.2020.1868630. [DOI] [PubMed] [Google Scholar]
- [26].Kini A, Morgan R, Kuo H, Shea P, Shapiro J, Leng SX, Pekosz A, Klein SL. Differences and disparities in seasonal influenza vaccine, acceptance, adverse reactions, and coverage by age, sex, gender, and race. Vaccine. 2021 Apr 28 doi: 10.1016/j.vaccine.2021.04.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].CDC: Safety of COVID-19 Vaccines. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/safety/safety-of-vaccines.html .
- [28].Trimboli M, Zoleo P, Arabia G, Gambardella A. Guillain-Barré syndrome following BNT162b2 COVID-19 vaccine. Neurological Sciences. 2021 Nov;42(11):4401–2. doi: 10.1007/s10072-021-05523-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Boehmer TK, Kompaniyets L, Lavery AM, Hsu J, Ko JY, Yusuf H, Romano SD, Gundlapalli AV, Oster ME, Harris AM. Association between COVID-19 and myocarditis using hospital-based administrative data—United States, March 2020–January 2021. Morbidity and Mortality Weekly Report. 2021 Sep 9;70(35):1228. doi: 10.15585/mmwr.mm7035e5. [DOI] [PMC free article] [PubMed] [Google Scholar]


