Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2025 May 22;2024:329–338.

SeqTrial: Utility Preserving Sequential Clinical Trial Data Generator

Trisha Das 1, Afrah Shafquat 2, Mandis Beigi 2, Jacob Aptekar 2, Jason Mezey 3, Jimeng Sun 1
PMCID: PMC12099387  PMID: 40417577

Abstract

Clinical trial data used to evaluate new treatments have value beyond the original studies, but limitations in data access due to privacy concerns make further use of these data challenging. Digital twins offer a solution by simulating patient outcomes, providing less restricted data access, reducing costs and increasing sample sizes. However, existing research focuses on synthetic Electronic Healthcare Records (EHRs) and lacks personalized patient record generation. This paper introduces SeqTrial, a framework for generating personalized digital twins for sequential clinical trial event data. The method uses BioBERT word embeddings to capture biomedical term semantics, an attention mechanism to understand visit relationships, and synthesizes digital twins for each patient. SeqTrial generates utility-preserving digital twins capable of estimating clinical outcomes, while addressing data scarcity through self-supervised pretraining. The method demonstrates high fidelity and utility in generating synthetic sequential clinical trial data for patient outcome prediction while ensuring privacy protection. The code is available at1.

Introduction

Clinical trials are prospective studies that assess the impact of new interventions on human subjects through observed outcomes, involving tens to hundreds of participants in an extended time frame. Secondary analysis of clinical trial data can provide valuable insights on drug safety, differential drug efficacy, and other factors [1]. However, researchers not directly associated with a clinical trial encounter difficulties accessing high-quality patient-level data for secondary purposes due to privacy concerns [2]. There is increasing interest in using virtual/synthetic patients to simulate diverse patient outcomes based on accurate personal characteristics data. Digital twins, synthetic patient records corresponding to each real patient record, can serve as a solution that address privacy concerns when sharing clinical trial data, enabling secondary research purposes, including the development of improved prediction models and subject-level statistical analyses of disease progression. Additionally, from the perspective of a clinical trial investigator, using virtual patients potentially offers the advantage of reducing the required sample size for participant recruitment, leading to a significant cost reduction in the clinical trial.

The process of creating digital twins for clinical trials has a close counterpart in the generation of synthetic Electronic Healthcare Records (EHRs). This involves the use of deep generative models to produce synthetic EHRs that preserve the statistical patterns found in real EHRs [3, 4, 5]. These synthetic EHRs can then be utilized to develop predictive models for health risks. However, there are significant distinctions between generating digital twins for clinical trials and generating synthetic EHRs. Firstly, clinical trial data exhibits substantial differences in structure and temporal patterns compared to EHRs [6]. Clinical trial data is less sparse, with more regular event incidence patterns and intervals, and generally involves a much smaller cohort of patients. Consequently, using large models for this data may not be optimal, making the application of EHR generation methods unsuitable for clinical trial data. Secondly, the primary objective of generating synthetic EHRs is to capture the overall characteristics of real EHR data. In contrast, generating digital twins aims to create virtual patients that closely match individual characteristics with high precision, usefulness, privacy and diversity.

To this end, we propose SeqTrial for personalized digital twin generation for Sequential clinical Trial event data. It is designed for two tasks: (i) predictive modeling for clinical outcome prediction and (ii) digital twin generation. Our main contributions are summarized as follows.

  • We propose SeqTrial, a lightweight variational autoencoder (VAE)-based generative model that creates digital twins to mimic real participants, preserving relationships across visits and events. We chose a VAE due to its stability, reliability and affordability compared to other generative models. We utilize BioBERT [7], pre-trained on an extensive corpus of text data from PubMed, as our knowledge base for obtaining word embeddings that can effectively capture intricate linguistic patterns and relationships among events. For example, the knowledge that medication Aspirin is provided to a patient with adverse event fever, can be found in several PubMed articles [8, 9]. We assume that word embeddings from BioBERT for Aspirin and fever are able to capture this relationship. We also utilize attention mechanism based light-weight transformer modules to capture the relationships among visits.

  • SeqTrial produces digital twins that preserve utility, fidelity and privacy. We prioritize maintaining realistic clinical outcome measures in generating synthetic clinical trial data while ensuring fidelity and privacy. Keeping the model architecture simple reduces overfitting, and privacy results demonstrate its ability to generate diverse synthetic data for data sharing. To address limited sample sizes, we utilize self-supervised pretraining with a masking technique akin to Masked Language Modeling (MLM) in NLP, enhancing data utilization and model performance through learning from unlabeled data.

In the rest of the paper, we review related work, discuss the proposed framework SeqTrial in detail, and present experimental results that demonstrate SeqTrial outperforms other baseline methods in generating synthetic sequential clinical trial event data while maintaining moderate privacy risk.

Related Work

The generation of synthetic patient records offers a promising solution to address privacy concerns associated with the release of health-related data from different institutions [10]. Current research in this area primarily focuses on generating synthetic Electronic Health Records (EHRs) using techniques such as generative adversarial networks (GANs) [3, 11, 12, 13], variational autoencoders (VAEs) [4, 14, 15], and language models [5]. The objective of these methods is to closely align global statistics with real EHRs and the generated synthetic data facilitates collaboration among institutions for the development of AI algorithms based on healthcare data, thereby addressing regulatory, intellectual property, and privacy concerns. Building on the success of synthetic EHR generation, there is emerging research on the generation of synthetic clinical trial data [10], including synthetic tabular clinical trial data [16, 6, 17]. However, these efforts often overlook the temporal structure of clinical trial data, limiting their ability to replicate the original clinical trial structure or support longitudinal modeling. There have been efforts to utilize probabilistic graphical models to capture the temporal distribution of clinical trial data for specific diseases like Alzheimer’s Disease [18] and Multiple Sclerosis [19]. However, these efforts primarily focus on estimating the uncertainty of patient trajectories. With the recent success of large language models (LLMs) in various fields, researchers have utilized LLMs for synthetic EHR generation. PromptEHR is such a model that uses BART to generate synthetic EHR data but overlooks event relations from biomedical knowledge bases such as PubMed [5]. A recent study has been conducted that generates visit-level digital twins and combines all visits of a patient’s digital twin to create a sequential trajectory [20]. It assumes Markov properties between two consecutive visits which cannot capture long term relationships between visits.

To address the limitations of the previous synthetic data generation methods, we develop SeqTrial that can generate sequential clinical trial digital twins that focuses both on global statistics and local statistics. In contrast to uncertainty estimating methods [18, 19], our proposed method aims to generate synthetic data that can be utilized alone or augmented with the real data, enabling the utilization of more advanced machine learning models for improved predictive performance. Moreover, relations among biomedical events which can be captured from biomedical knowledge bases like PubMed, can improve the generation of synthetic data. Therefore, in contrast to PromptEHR [5], we employ a BioBERT model pretrained on large-scale biomedical corpora to obtain informed event embeddings. SeqTrial, similar to another digital twin generation method TWIN [20], can be specifically useful in cases where we want to generate synthetic data for a specific group (e.g., minority group). However, SeqTrial has no limiting assumptions of Markov properties and thus can utilize information from all previous visits while generating digital twins.

Methods

Problem Formulation: Let’s assume there are N patients in a clinical trial and the nth patient is represented by a sequence of visits in the temporal order as Xn;1:Tn = {xn,1, xn,2, …, xn,Tn}, where xn,t denotes the events that occurred during the patient’s tth visit and Tn is the total number of visits for that patient. There are three types of events: treatments, medications, and adverse events. The treatments are interventions being tested in the clinical trial, medications are interventions that differ from the primary treatment under investigation in the clinical trial, and adverse events represent negative outcomes. If a real patient is assigned to a specific arm (treatment or control arm) in a clinical trial, their corresponding digital twin should also receive the same treatment assignment. Each visit xn,t constitutes sets of events as xn,t={xn,t1xn,t2}. An event set xn,t1 contains a set of events of type treatment occurred at the tth visit of patient n. All other events (medications and adverse events) are in the event set xn,t2. As the treatments of a clinical trial cannot be altered in the digital twins, we want to reconstruct only xn,t2 for each visit t of patient n and get x^n,t2. The final visit t of patient n is x^n,t={xn,t1x^n,t2}. We get the reconstructed patient record for patient n,

X^n;1:Tn={x^n,1,x^n,2,,x^n,Tn}. (1)

For each patient n, we also have outcome label yn. For example, yn can be the overall survival of that patient for a clinical trial where the goal was to observe overall survival of patients in different arms. We want the model to predict outcome labels so that the predicted labels ŷn are realistic.

Digital Twin Generation: We use a variational autoencoder that utilizes transformer architecture for generating digital twins. The self-attention and cross-attention mechanisms in the transformer architecture help the model gain information from all visits while reconstructing a single visit. Figure 1 shows the overall architecture of SeqTrial.

Figure 1:

Figure 1:

SeqTrial reconstructs partial raw events (xn,t2s) of patient n. An additional Outcome Prediction Module (OPM) takes the reconstructed events (x^n,t2s) to predict the outcome of interest. Real treatment events (xn,t1s) are concatenated to corresponding x^n,t2s to generate the digital twin of the input patient record Xn;1:Tn.

Visit Embeddings: We use two types of visit embeddings.

  • BioBERT embedding: We get an embedding for an event by passing the actual name of the event through BioBERT to get a representation of size 768. For example, if a visit contains events Antibiotics, Fever, etc. Each of these is input to a pretrained BioBERT model. We further do average pooling on all the representations of events we got from BioBERT to get visit embedding. These BioBERT embeddings are input to the encoder.

  • Multihot embedding: We create a multihot representation of a visit by placing 1 if that event has occurred in that visit and otherwise placing 0 for that event. These multihot embeddings are input to the decoder. The intuition behind using multihot embeddings is that the output of the transformer decoder is meaningful to the user so that they can know which events will occur in the digital twin’s record.

Encoder: The encoder is composed of three consecutive components: a fully connected layer, a positional encoder and a transformer encoder layer with single head attention. The BioBERT embeddings are the inputs to the encoder. The encoder is denoted by qϕ(z|x). For clarity, we use the abbreviated notation x as the BioBERT embedding of xn,t and z as the latent representation of xn,t. Formally, we parameterize the encoder qϕ(z|x) using Gaussian distribution, as qϕ(z|x) = N (z|μ, σ · Id), where d is the number of dimensions of the latent space. The sampling process (1) obtains the mean and variance of the Gaussian distribution from the encoder, i.e., {μ(x), σ(x)} ~ qϕ(z|x); (2) samples a random noise ∊ from a standard normal distribution, i.e., ∊ ~ N(0, Id); and (3) transforms the noise by scaling it with the obtained variance, and adding it to the mean, as z = μ(x) + σ(x) · ∊ · β. Here, β is a hyperparameter that we tune to adjust the amount of noise we want to introduce in the digital twins.

Decoder: The decoder pθ(x|z) is composed of three consecutive components: a fully connected layer, a positional encoder, and a transformer decoder layer with single head attention. The first layer takes the multihot encodings of each visit xn,t and generates an embedding that goes to the positional encoder. The output of the positional encoder for each visit goes to the transformer decoder layer as input and the corresponding zn,t as memory to get partially reconstructed visits x^n,t2. The cross-attention mechanism of a transformer decoder layer helps to attend all the visits in the input sequence. Finally, using Eq. 1, we get the final reconstructed patients from SeqTrial.

Outcome Prediction Module: As our intention is to get utility-preserving digital twins, we introduce an Outcome Prediction Module (OPM). This module consists of several linear layers with ReLU activation function and estimates the outcome of interest from the reconstructed patient record in a supervised manner.

Pretraining: To address the challenge of limited sample sizes, we employ self-supervised pretraining using a masking technique similar to MLM that enhances data utilization by leveraging unlabeled data effectively. Specifically, we pre-train the encoder with self-supervised learning by masking random patient visits, generating multiple samples from one and using the masked-out visits as labels. This pre-training task involves predicting masked visits, and the pre-trained encoder serves as the starting point for subsequent training, improving the model’s performance.

Training: Our training loss consists of two parts described below:

Generative Loss: The VAE’s loss function comprises two components: the reconstruction loss and the KL divergence. The process of minimizing this loss is equivalent to maximizing the Evidence Lower Bound. For the reconstruction loss, we employ binary cross entropy.

L1=n=1Nt=1Tn(xn,t2log(x^n,t2)+(1xn,t2)log(1x^n,t2))+n=1Nt=1TnDKL(qϕ(z|x)||pθ(z)).

Here, Tn is the number of visit of patient n, x^n,t2 is the reconstructed version of xn,t2 we get from the decoder of the VAE as a multihot vector. pθ(z) is the prior distribution N(z|0, Id).

Outcome Loss: This loss can be either binary cross entropy loss if the task is a binary classification task (e.g., death prediction) or mean squared error if the outcome is numerical (e.g., overall survival, progression-free survival).

L2={1Ni=1N[yilog(y^i)+(1yi)log(1y^i)],if yi is binary1Ni=1N(yiy^i)2,if yi is numerical. (2)

The final objective function of SeqTrial is L=L1+αL2. Here, α is a non-negative hyperparameter to assign importance to L2.

Inference: During inference, real test samples are fed to a trained model to produce corresponding digital twins or predictions. In contrast, traditional synthetic data generation methods create artificial data by sampling from a latent space, simulating patterns similar to real-world data without relying on actual observations.

Experimental Setup

Source Data: We downloaded and processed the publicly available datasets from Project Data Sphere[21]. The statistics of the datasets used are summarized in Table 1. The first dataset is a phase III breast cancer clinical trial (NCT00174655). There are a total of 2,887 patients who were randomly assigned to the arms to evaluate the activity of Docetaxel, given either sequentially or in combination with Doxorubicin, followed by CMF (cyclophosphamide, methotrexate, and fluorouracil), in comparison to Doxorubicin alone or in combination with Cyclophosphamide, followed by CMF, in the adjuvant treatment of node-positive breast cancer patients. The records of 971 publicly available patient records were utilized in this project after data cleaning.

Table 1:

Statistics of datasets

Breast Cancer NSCLC Melanoma
Number of participants 971 548 310
Total number of visits 8292 4000 3578
Maximum number of visits 14 26 37
Types of treatments 4 3 1
Types of medications 100 13 40
Types of adverse events 273 83 7

The second dataset is Non-Small Cell Lung Cancer (NSCLC) trial (NCT00981058). Among 1093 of the participants of the trial, records of 548 patients were publicly available to use. The main objective of this study was to evaluate the overall survival in patients with Stage IV squamous NSCLC treated with necitumumab plus gemcitabine and cisplatin chemotherapy versus gemcitabine and cisplatin chemotherapy alone.

The third dataset is Melanoma clinical trial data (NCT00522834). This is a Phase 3 trial of Elesclomol (STA-4783) in combination with Paclitaxel versus Paclitaxel alone for treatment of chemotherapy-na¨?ve subjects with stage IV metastatic melanoma. Among 651 participants, records of 326 patients were publicly available. After cleaning and preprocessing, we have a total of 310 patient records.

Baseline Models: We compare SeqTrial with the following methods for synthetic EHR generation and clinical trial data generation:

  • EVA [4]: a generative model for generating synthetic electronic health records (EHR) using conditional variational autoencoders.

  • PAR [22]: a probabilistic auto-regressive model for sequential data.

  • PromptEHR [5]: a method for EHR generation with generative language models equipped with prompt learning.

  • TWIN [20]: a clinical trial digital twin generation method.

EVA and PromptEHR are synthetic EHR generation models. These three models cannot handle constraints of specific treatment strategies for specific trial arms. Also, these methods are not designed for personalized synthetic records generation and thus do not apply to the digital twin generation. TWIN, is a personalized synthetic record generation model and can generate digital twins. PAR is a more general framework for any multi-sequence real world data.

Performance Assessments: We assessed three different aspects which are historically being used in synthetic clinical data generation research: fidelity, utility, and privacy [4, 20, 5].

Fidelity is the degree to which the synthetic data resembles real clinical trial and was evaluated on all datasets for all algorithms unless noted:

  • Dimension-wise probability r (DP) refers to the Bernoulli success probability of each feature (medication or adverse event) in the dataset.

  • Conditional probability r2 (CP) is defined as the probability P(A|B) of event A occurring within the first 5 visits after event B occurred.

  • Event frequency r2 (EF) is the frequency of events in the source and synthetic datasets.

  • Bigram frequency r2 (BF) is calculated for each patient, where a sequence of events is computed ordering their respective date of occurrence.

  • Rouge-L F1 score1 [23] measures the longest common subsequence (LCS) between a digital twin and the corresponding reference patient record, where the F1 score balances between precision and recall, providing a single value that indicates the overall similarity between the generated sequence and the reference sequence based on their LCS. For this score, we scanned each row (each visit) of a data csv file and created a sentence by appending the corresponding column names with 1 in the cell and then aggregated all the visits for each patient to create a sequence for that patient. Rouge-L F1 can be calculated only for digital twins; SeqTrial and TWIN were evaluated.

  • Edit distance1 [24] is a measure of the similarity between two sequences in terms of the minimum number of operations required to transform one sequence into the other, where a smaller edit distance indicates that there are fewer differences between the digital twin and the real one. Edit distance can be calculated only for digital twins; SeqTrial and TWIN were evaluated with this measure.

Utility is the extent to which the synthetic data is useful for the downstream tasks. For utility evaluation, we trained LSTM models with real or synthetic data to predict the outcome of interest on real test data. We calculated Mean Squared Error (MSE) for numerical outcomes and area under the receiver-operating characteristic curve (AUROC) for binary outcomes. We selected a diverse set of downstream tasks to show our method works well on different tasks.

  • Overall Survival (OS) is the time to patient death and Progression-free survival (PFS) is the time to first evidence of patient disease progression or death. We analyzed the NSCLC dataset, using MSE to evaluate OS and PFS prediction.

  • Severe Outcome is a binary classification task for patient occurrences of a severe event. We evaluated the breast cancer dataset, where SO is 1 if the patient has faced any of the following events 1) death, 2) serious adverse events, 3) local relapse, 4) distant relapse, 5) regional relapse. We analyzed the breast cancer dataset, using AUROC to evaluate SO prediction.

  • Last Visit Outcome (LVO) are study specific important outcomes measured at the last patient visit. For the breast cancer dataset LVO, we considered MSE for Total bilirubin (TB), an assay of how well a patient’s liver is working. For the melanoma dataset, we considered MSE for Heart Rate (HR) & Repiratory Rate (RR).

Privacy is the extent to which the synthetic data protects the privacy of individuals in the real data and was evaluated by:

  • Nearest neighbor adversarial accuracy compares the distance from all the digital twins from a target distribution T, to the nearest real person in a source distribution S, as denoted by dTS to the distance to the nearest digital twin in the same target distribution as denoted by dTT [25]. The formal representation is given by:
    AATS=12(1ni=1n1(dTS(i)>dTT(i))+1ni=1n1(dST(i)>dSS(i)))
  • Privacy loss is defined as the difference between the adversarial accuracy on the test set and the adversarial accuracy on the training set.

  • Singling out evaluates the risk of singling out any generated digital twin based on the uniqueness of a single attribute.

  • Membership disclosure evaluates the ability of an attacker (who possesses a fraction of the real records) to identify membership of records from the real to synthetic ones.

We additionally did an ablation study by discarding the OPM from SeqTrial (SeqTrial w/o OPM). We assessed fidelity, utility, and privacy utilizing the Melanoma dataset in the ablation study.

Implementation Details: We split all datasets into train, test and validation sets. The test sets contain records of 50, 50 and 20 patients from the breast cancer, NSCLC and Melanoma clinical trial event datasets respectively. Similarly, the validation sets contain records of 50, 50 and 20 patients from the breast cancer, NSCLC and Melanoma clinical trial event datasets respectively. There is no overlap between any of the splits. We utilized PyTrial [26] for EVA, PromptEHR, TWIN and SDV[27] python package for PAR. For utility evaluation, we used the same test sets but used 3 different seeds to sample training and validation sets. We utilize biobert-base-cased-v1.1 for pretraining the encoder for SeqTrial. We incorporated computational power from an NVIDIA GPU with CUDA 12 support, featuring 11264 MiB of dedicated RAM, optimizing performance for our implementations.

We tuned the loss hyperparameter α and encoder hyperparameter β for all six scenarios of the three different datasets. For α, we tuned in the range of 1 to 500. For β, we tuned in the range of 1 to 20. We also tuned the hyperparameters for the LSTM models we used for utility prediction. The number of epochs we tuned ranges from 50 to 500. The number of recurrent layers we explored ranges from 1 to 3. The final set of hyperparameters is shown in Table 5.

Table 5:

Final set of hyperparameters

Hyperparameters after tuning LSTM hyperparameters for utility prediction
α β lstm epochs lstm layers
Breast Cancer Severe Outcome 500 1.5 500 3
Total Bilirubin 100 2 200 3
NSCLC Overall Survival 15 3 200 2
Progression-free Survival 500 1 200 3
Melanoma Heart Rate 15 3 500 2
Respiratory Rate 50 5 500 2

Results & Discussion

For fidelity (Dimension-wise probability r, Conditional probability r2, Event frequency line r2, Bigram r2, Rouge-L F1 best score: 1 & Edit distance best: low), SeqTrial had better scores compared to other algorithms across all datasets (Table 2 and Table 3), where only PAR had similar Dimension-wise probability and Event frequency line performance for the NSCLC and Melanoma datasets. Similarly, for utility, SeqTrial had the lowest MSE for Overall Survival and Progression Free Survival for NSCLC, the lowest AUROC for Severe Outcome and the lowest MSE for Total Bilirubin for breast cancer, and the lowest MSE for Heart rates and Respiratory rates for Melanoma. Moreover, the confidence intervals did not overlap the higher MSE or AUROC observed for all the other algorithms (Table 4). The enhancement in predictive performance compared to real data underscores the superiority of synthetic data generated by SeqTrial for these prediction tasks.

Table 2:

Fidelity scores for baseline models and SeqTrial. Bold values represent the best fidelity score as calculated by each metric across algorithms for each clinical trial dataset.

Dimension-wise probability (r) Event frequency line (r2) Bigram (r2) Conditional probability (r2)
Breast NSCLC Melanoma Breast NSCLC Melanoma Breast NSCLC Melanoma Breast NSCLC Melanoma
EVA 0.13 -0.04 -0.06 -1.79 -7.27 -14.23 -4.42 -7.02 -29.53 -0.42 0.62 -0.22
PAR 0.87 1.00 0.95 0.75 0.99 0.89 -0.08 -0.06 -5.16 -4.49 -0.21 -6.34
PromptEHR 0.70 0.56 0.55 0.33 0.73 0.54 -0.14 0.51 0.11 0.27 0.76 0.54
TWIN 0.09 0.41 0.48 -0.40 -0.19 -0.10 -0.35 0.02 -0.79 -0.62 0.42 0.19
SeqTrial 0.99 1.00 1.00 0.99 0.99 0.99 0.86 0.92 0.77 0.56 0.94 0.90

Table 3:

Sequential fidelity scores for digital twin generation methods. Bold values represent the best fidelity score as calculated by each metric across algorithms for each clinical trial dataset.

Rouge-L F1 Edit Distance
Breast NSCLC Melanoma Breast NSCLC Melanoma
TWIN 0.5868 0.6770 0.6739 56.5063 41.3629 40.8185
SeqTrial 0.8719 0.9121 0.8868 16.9322 6.7189 10.5815

Table 4:

Utility evaluation of models in different prediction scenarios. Bold scores are best in the corresponding columns. OS: overall survival, PFS: progression-free survival, SO: severe outcome, TB: total bilirubin, HR: heart rate, RR: respiratory rate

NSCLC Breast Cancer Melanoma
OS (MSE) PFS (MSE) SO (AUROC) TB (MSE) HR (MSE) RR (MSE)
Real 2.67 ± 0.04 0.64 ± 0.00 0.45 ± 0.08 15.08 ± 1.63 262.60 ± 1.07 6.65 ± 0.16
EVA 2.72 ± 0.00 0.69 ± 0.00 0.52 ± 0.09 15.09 ± 1.63 262.81 ± 1.73 6.61 ± 0.12
PAR 2.62 ± 0.00 0.63 ± 0.00 0.54 ± 0.11 14.14 ± 0.00 264.84 ± 0.09 6.89 ± 0.01
PromptEHR 2.92 ± 0.00 0.77 ± 0.00 0.42 ± 0.02 14.16 ± 0.01 263.59 ± 0.99 6.63 ± 0.00
TWIN 2.68 ± 0.00 0.65 ± 0.01 0.51 ± 0.01 14.11 ± 0.03 265.71 ± 0.00 7.05 ± 0.00
SeqTrial 2.57 ± 0.00 0.63 ± 0.01 0.56 ± 0.03 14.11 ± 0.00 261.97 ± 0.00 6.46 ± 0.00

The privacy evaluations provided both an assessment of synthetic data quality and privacy risk for each of the methods. For the Nearest neighbor adversarial accuracy (best score: 0.5), all methods except for SeqTrial were close to 1 indicating poor quality of synthesis for these methods rendering datasets synthesized using these methods as easily distinguishable from the real datasets. For NSCLC and Melanoma, scores for SeqTrial were close to 0.5 indicating the adversarial classifier is not able to distinguish between digital twins generated using SeqTrial and real data. In contrast, the low accuracy adversarial accuracy for breast cancer indicates that the generator may have overfitted the training data, where the risk associated with overfitting is represented by associated privacy loss. The privacy loss (best score: 0) for all methods across datasets remained low ≤ 0.1 . However, given the adversarial accuracy was high (close to 1) for all methods except SeqTrial indicating low quality of synthesis, the low privacy loss is of limited significance. While privacy loss for SeqTrial was low (≤ 0.1) for NSCLC and Melanoma datasets, the value for loss for Breast cancer was 0.37, again indicating the SeqTrial may have overfitted to the training data. However, the privacy loss in this case may be justified by the significant improvement in the quality of synthesis for SeqTrial over other baseline methods. The singling out risks (best score: 0) were comparable across methods and remained low overall (< 0.4) across datasets, while most evaluations of membership disclosure risk (best score: 0.5) remained ≤ 0.6 indicating low risk across methods and datasets. Membership disclosure risk for SeqTrial on the breast cancer data was 0.75 indicating moderate disclosure risk. This risk was also echoed by the privacy loss indicated for this dataset.

Overall, when balancing fidelity and privacy, considering the significant improvement in the quality of the digital twins generated by SeqTrial as compared to other baseline methods (Figure 2), the low to moderate privacy risk for SeqTrial is expected. The lower privacy risk associated with other baseline methods is evident due to the low fidelity of synthesis as indicated by the fidelity metrics and high train and test adversarial accuracy scores. Figure 2 indicates that the area covered by SeqTrial is considerably larger than that of the other baselines across all datasets, suggesting that SeqTrial outperforms the other baseline methods overall.

Figure 2:

Figure 2:

Fidelity vs. Privacy evaluation of SeqTrial against baseline methods. All values across metrics are normalized from 0 to 1 where 1 represents the ideal score for all metrics i.e. low privacy risk and high fidelity. SeqTrial has the largest area covered under the red lines showing that overall SeqTrial is better than baseline models. DP: Dimension-wise probability, CP: Conditional probability, EF: Event frequency, BF: Bigram frequency, MD: Membership disclosure, PL: Privacy loss, SO: Singling out, AA train: Adversarial Accuracy (Train), AA_test: Adversarial Accuracy (Test).

Finally, we note that for the ablation study performed by discarding the OPM from SeqTrial (SeqTrial w/o OPM) and utilizing the Melanoma dataset, Table 6 shows that SeqTrial performs better than SeqTrial w/o OPM in terms of utility and privacy and comparable to SeqTrial w/o OPM in terms of fidelity. Lower MSE scores and fewer exact matches with real records indicate better utility and lower privacy risk respectively. Although both versions have high fidelity as indicated by high correlation r between dimension-wise probabilities, utility (as defined by the MSE in clinical outcome prediction) and privacy is best when digital twins are generated by SeqTrial.

Table 6:

Ablation study. Results are from the Melanoma dataset. HR: heart rate, RR: respiratory rate. Columns Dimension-wise probability (r) is calculated between the dimension-wise probability of events. Columns MSE shows the mean squared errors. Column Exact matches % show the percentage of digital twins that have 100% match with the corresponding real data.

RR HR
MSE Exact match % Dimension-wise probability (r) MSE Exact match % Dimension-wise probability (r)
SeqTrial 6.46 0.03 1 261.96 0 0.99
SeqTrial w/o OPM 8.54 0.95 1 292.98 0.99 1

Conclusion

This research introduces a novel approach called SeqTrial that produces high-quality personalized clinical trial digital twins that capture event and visit level relationships. Through rigorous evaluation, SeqTrial outperforms other baseline methods in terms of fidelity and utility while maintaining moderate privacy risks. Considering the challenges of acquiring clinical trial data and limited patient cohort sizes, we believe that SeqTrial offers a reliable, lightweight, and secure method for using and augmenting clinical trial data. In future endeavors, our goal is to enhance our model to accommodate multiple outcomes of interest simultaneously, while also reconstructing numerical data such as lab test results.

Footnotes

1

https://github.com/trishad2/SeqTrial.git

Figures & Tables

References

  • 1.Ebrahim S, Sohani ZN, Montoya L, Agarwal A, Thorlund K, Mills EJ, et al. Reanalyses of randomized clinical trial data. Jama. 2014;312(10):1024–32. doi: 10.1001/jama.2014.9646. [DOI] [PubMed] [Google Scholar]
  • 2.Doshi P. Data too important to share: do those who control the data control the message? BMJ. 2016. p. 352. [DOI] [PubMed]
  • 3.Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Machine Learning for Healthcare Conference. PMLR; 2017. Generating multi-label discrete patient records using generative adversarial networks; pp. p. 286–305. [Google Scholar]
  • 4.Biswal S, Ghosh S, Duke J, Malin B, Stewart W, Sun J. EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders. arXiv preprint arXiv:201210020. 2020.
  • 5.Wang Z, Sun J. PromptEHR: Conditional Electronic Healthcare Records Generation with Prompt Learning. Conference on Empirical Methods in Natural Language Processing. 2022. [DOI] [PMC free article] [PubMed]
  • 6.Beigi M, Shafquat A, Mezey J, Aptekar JW. Synthetic Clinical Trial Data while Preserving Subject-Level Privacy. NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research. 2022. [PMC free article] [PubMed]
  • 7.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bartfai T, Conti B. Fever. ScientificWorldJournal. 2010;10:490–503. doi: 10.1100/tsw.2010.50. Available from: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC2850202/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lanas A, McCarthy D, Voelker M, Brueckner A, Senn S, Baron JA. Short-term acetylsalicylic acid (aspirin) use for pain, fever, or colds—gastrointestinal adverse effects: a meta-analysis of randomized clinical trials. Drugs in R & D. 2011;11:277–88. doi: 10.2165/11593880-000000000-00000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wang Z, Gao C, Glass LM, Sun J. Artificial Intelligence for In Silico Clinical Trials: A Review. arXiv preprint arXiv:220909023. 2022.
  • 11.Guan J, Li R, Yu S, Zhang X. IEEE International Conference on Bioinformatics and Biomedicine (BIBM) IEEE Computer Society; 2018. Generation of Synthetic Electronic Medical Record Text; pp. p. 374–80. [Google Scholar]
  • 12.Baowaly MK, Lin CC, Liu CL, Chen KT. Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association. 2019;26(3):228–41. doi: 10.1093/jamia/ocy142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhang Z, Yan C, Lasko TA, Sun J, Malin BA. SynTEG: A framework for temporal structured electronic health data simulation. Journal of the American Medical Informatics Association. 2021;28(3):596–604. doi: 10.1093/jamia/ocaa262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lee D, Yu H, Jiang X, Rogith D, Gudala M, Tejani M, et al. Generating sequential electronic health records using dual adversarial autoencoder. Journal of the American Medical Informatics Association. 2020;27(9):1411–9. doi: 10.1093/jamia/ocaa119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Allen A, Siefkas A, Pellegrini E, Burdick H, Barnes G, Calvert J, et al. A digital twins machine learning model for forecasting disease progression in stroke patients. Applied Sciences. 2021;11(12):5576. [Google Scholar]
  • 16.Emam KE, Mosquera L, Zheng C. Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association. 2021;28(1):3–13. doi: 10.1093/jamia/ocaa249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ open. 2021;11(4):e043497. doi: 10.1136/bmjopen-2020-043497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bertolini D, Loukianov AD, Smith AM, Li-Bland D, Pouliot Y, Walsh JR, et al. Modeling Disease Progression in Mild Cognitive Impairment and Alzheimer’s Disease with Digital Twins. arXiv preprint arXiv:201213455. 2020.
  • 19.Walsh JR, Smith AM, Pouliot Y, Li-Bland D, Loukianov A, Fisher CK. Generating digital twins with multiple sclerosis using probabilistic neural networks. arXiv preprint arXiv:200202779. 2020.
  • 20.Das T, Wang Z, Sun J. New York, NY, USA: Association for Computing Machinery; 2023. TWIN: Personalized Clinical Trial Digital Twin Generation. KDD ‘23; pp. p. 402–413. Available from: [DOI] [Google Scholar]
  • 21.Green AK, Reeder-Hayes KE, Corty RW, Basch E, Milowsky MI, Dusetzina SB, et al. The project data sphere initiative: accelerating cancer research by sharing data. The oncologist. 2015;20(5):464–e20. doi: 10.1634/theoncologist.2014-0431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhang K, Patki N, Veeramachaneni K. Sequential models in the synthetic data vault. arXiv preprint arXiv:220714406. 2022 [Google Scholar]
  • 23.Lin CY. Rouge: A package for automatic evaluation of summaries. Text summarization branches out. 2004. pp. p. 74–81.
  • 24.Navarro G. A guided tour to approximate string matching. ACM computing surveys (CSUR) 2001;33(1):31–88. [Google Scholar]
  • 25.Yale A, Dash S, Bhanot K, Guyon I, Erickson JS, Bennett KP. Business Information Systems Workshops: BIS 2020 International Workshops, Colorado Springs, CO, USA, June 8–10, 2020, Revised Selected Papers 23. Springer; 2020. Synthesizing quality open data assets from private health research studies; pp. p. 324–35. [Google Scholar]
  • 26.Wang Z, Theodorou B, Fu T, Xiao C, Sun J. PyTrial: A Comprehensive Platform for Artificial Intelligence for Drug Development. 2023. Available from: https://pytrial.readthedocs.io/en/latest/
  • 27.Patki N, Wedge R, Veeramachaneni K. The Synthetic data vault. IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2016. pp. p. 399–410.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES