Abstract
Physiological determinants of drug dosing (PDODD) are a promising approach for precision dosing. This study investigates the alterations of PDODD in diseases and evaluates a variational autoencoder (VAE) artificial intelligence model for PDODD. The PDODD panel contained 20 biomarkers, and 13 renal, hepatic, diabetes, and cardiac disease status variables. Demographic characteristics, anthropometric measurements (body weight, body surface area, waist circumference), blood (plasma volume, albumin), renal (creatinine, glomerular filtration rate, urine flow, and urine albumin to creatinine ratio), and hepatic (R‐value, hepatic steatosis index, drug‐induced liver injury index), blood cell (systemic inflammation index, red cell, lymphocyte, neutrophils, and platelet counts) biomarkers, and medical questionnaire responses from the National Health and Nutrition Examination Survey (NHANES) were included. The tabular VAE (TVAE) generative model was implemented with the Synthetic Data Vault Python library. The joint distributions of the generated data vs. test data were compared using graphical univariate, bivariate, and multidimensional projection methods and distribution proximity measures. The PDODD biomarkers related to disease progression were altered as expected in renal, hepatic, diabetes, and cardiac diseases. The continuous PDODD panel variables generated by the TVAE satisfactorily approximated the distribution in the test data. The TVAE‐generated distributions of some discrete variables deviated from the test data distribution. The age distribution of TVAE‐generated continuous variables was similar to the test data. The TVAE algorithm demonstrated potential as an AI model for continuous PDODD and could be useful for generating virtual populations for clinical trial simulations.
INTRODUCTION
The goal of precision medicine is to consider inter‐individual differences to improve treatment outcomes given that many treatment algorithms are based on the “expected response of the average patient”. 1 The National Institutes of Health and Food and Drug Administration have championed tailoring treatments based on differences in “genes, environments, and lifestyles” to achieve the precision medicine mission. 1 , 2 Reliably measuring environmental and lifestyle factors in clinical settings is challenging, and their utility for individualizing dosing is not proven.
Realizing the promise of precision medicine requires innovative methods to enable efficient selection of the optimal treatment dosing regimen for the individual patient's disease state and health condition. However, dose individualization decisions are frequently based on easily obtained demographic and anthropometric measures, for example, age, race/ethnicity, sex, body weight, and body surface area. Our group has proposed an innovative biomarker‐guided approach for precision medicine called physiological determinants of drug dosing (PDODD). 3 , 4 The PDODD approach is particularly promising and viable since point‐of‐care laboratory testing and multiplexed assays are becoming increasingly utilized in many hospital and clinical care settings.
Point‐of‐care tests are used for diagnostic decision‐making in emergency and acute care settings where time and space are at a premium. Many clinically validated biomarkers of renal, hepatic, and metabolic functions are routinely obtained during standard outpatient care for chronic diseases, and during the clinical screening, safety, and efficacy evaluation protocols of drug development trials. Proper utilization of available renal, hepatic, and metabolic function biomarkers in conjunction with innovative pharmacometric algorithms could aid acute and chronic dosing decisions and enhance patient outcomes. PDODD has the potential to yield a more viable approach to precision medicine. However, disease states that impact body habitus or the major organs of drug absorption, distribution, metabolism, and elimination frequently warrant dose individualization. These disease‐related changes cause alterations to PDODD that are not well characterized and are the focus of this research.
Generating synthetic data emerges as a practical solution in clinical trial simulations where the availability of data is limited, for example, in under‐represented groups. Generated data can be used in simulations to assess the full range of factors that could contribute to drug efficacy, effects, and disposition. Generative AI methods can estimate the high‐dimensional joint distribution of a PDODD biomarker panel containing disease status variables to provide a complete statistical description of the trends, pairwise correlations, multivariate associations, and inter‐individual variability among the constituent variables, including the dependence of the PDODD on disease status. 4 An AI approach capable of reliably generating random variate vectors drawn from the joint distribution could be used to obtain virtual patient populations for the PDODD and disease biomarkers for clinical trial simulations.
Generative adversarial networks (GAN) 5 and variational autoencoders (VAE) 6 are commonly used approaches for image generation. Both GAN and VAE use systems of deep neural networks and have the capability to learn the high‐dimensional joint distribution from training data. However, GAN and VAE use distinctly different computational strategies. GANs employ adversarial training via classification, whereas VAEs perform nonlinear dimensionality reduction. The goals of this research are to characterize the effects of disease states on PDODD, and to investigate whether VAE‐based generative AI models can simulate the joint distribution of PDODD in renal, hepatic, metabolic (diabetes), and cardiac diseases.
GENERATIVE MODELING WITH VARIATIONAL AUTOENCODERS
Study design
Data sets
Data were obtained from the 2009–2010, 2011–2012, 2013–2014, 2015–2016, and 2017–2018 cycles of the National Health and Nutrition Examination Survey (NHANES). NHANES is conducted by the National Center for Health Statistics (NCHS) and contains data from laboratory measurements, physical screenings, and survey questionnaires from a representative sample of the United States population. 7 , 8
The NHANES data retrieval utility R package was used for downloading NHANES data files. 9
Data preprocessing
Subjects 20 years and older were included. In NHANES, subjects 80 years old and over are coded as 80 years of privacy protection.
Race was recorded from the RIDRETH1 Race/Hispanic origin variable. The Mexican American and Other Hispanic participants were recorded as Hispanic; the other races (Non‐Hispanic White, Non‐Hispanic Black, Other – including Multiracial) were retained unchanged.
Missing data were handled using listwise deletion.
Computed biomarkers
Several biomarkers in the PDODD panel were derived from primary biomarkers in NHANES using equations.
Body surface area (, m2) was calculated from and using the Dubois and Dubois equation 10 :
Normalized waist circumference was () calculated from the waist circumference using divisors of 88 cm for women and 102 cm for men, which are the recommended treatment targets 11 :
Estimated plasma volume (, L) was calculated from hematocrit (, %) and hemoglobin (, g/dL) with the Strauss formula 12 :
The estimated glomerular filtration rate (, mL/(min1.73 m2)) was obtained from serum creatinine measurements using the CKI‐EPI study 2021 formula. 13
In the equation: is serum creatinine in mg/dL; is in years; is a constant that is 0.7 for females and 0.9 for males, is a constant that is −0.241 for females and −0.302 for males; is a constant that is 1.012 for females and 1 for males. 13
is a computed measure of liver function 14 obtained from serum alanine aminotransferase () and serum alkaline phosphatase () activity measurements in a standard complete metabolic panel (CMP).
and , the upper limits of normal (ULN) for alanine aminotransferase and alkaline phosphatase, respectively. was set to 29 IU/L for males and 22 IU/L for females 15 whereas for the different racial groups was based on Gonzalez et al. 16 as described in. 4
The risk of drug‐induced liver injury (DILI) is an important safety consideration in drug development and utilization. We used , a measure of the DILI risk that was based on the work of Diaz‐Robles et al. 17 who identified an algorithm based on aspartate aminotransferase (), bilirubin (), and the ratio of aspartate aminotransferase to alanine aminotransferase that improved on Hy's law 18 :
and , are the upper limits of normal for aspartate aminotransferase, and bilirubin, respectively. Based on Sohn et al., 19 was set to 32 IU/L for men and 26 IU/L for women; was set to 2 mg/dL based on Perlstein et al. 20 There were a few zeroes in , the zero values were set to a small positive number (0.01), which was twofold lower than the next lowest measured value.
Hepatic steatosis index (HSI) was calculated as follows 21 :
Average urine flow rate (, mL/min) was calculated using NHANES guidelines. 7
The systemic immune‐inflammation index (SII) was calculated using 22 :
Active hepatitis B virus (HBV) infection status was a binary variable that was set to unity for anti‐HBV core antigen–antibody (anti‐HBc Ab and LBXHBV)‐positive subjects who tested positive for HBV surface antigen (HBsAg, LBDHBG) and 2 for anti‐HBc Ab tested subjects not meeting the criterion. Active hepatitis C virus (HCV) infection status was a binary variable that was set to unity for anti‐HCV screening antibody (anti‐HCV Ab)‐positive subjects who tested positive for HCV‐RNA (LBXHCR) and 2 for anti‐HCV Ab screening antibody subjects not meeting the criterion.
Renal status
Renal disease status was coded as two binary variables Kidney Disease and Dialysis. The Kidney Disease status variable was computed from the responses to NHANES variables KIQ022 (Ever told you had weak/failing kidneys?), and the Dialysis status variable from KIQ025 (Received dialysis in the past 12 months?).
Hepatic status
Liver disease status was coded as four binary variables: Active Liver Disease, Past Liver Disease, Active Hepatitis B, and Active Hepatitis C. The Past Liver Disease status variable was computed from the responses to NHANES variables MCQ160L (Ever told you had any liver condition?); the Active Liver Disease was obtained from MCQ170L (Do you still have a liver condition?). Active Hepatitis B and Active Hepatitis C variables were computed as previously described. 4 Individuals were categorized as having active hepatitis B infection‐positive status (HBV) if they were positive for anti‐HBV core antigen–antibody (anti‐HBc Ab, LBXHBV), and positive for HBV surface antigen (HBsAg, LBDHBG). Individuals were categorized as having active hepatitis C (HCV) infection‐positive status if they were positive for anti‐HCV screening antibody (anti‐HCV Ab) and positive for HCV‐RNA (LBXHCR).
Diabetes status
Diabetes disease status was coded with three binary variables: Diabetes, Prediabetes, and Insulin Use. The Diabetes variable was computed from the responses to NHANES variables DIQ010 (Doctor told you have diabetes?); Insulin Use from DIQ050 (Taking insulin now?), and Prediabetes from DIQ160 (Ever told you have prediabetes?). Borderline diabetes was categorized as prediabetes.
Cardiac status
Cardiac disease status variable was coded with four binary variables: Congestive Heart Failure (CHF), Coronary Heart Disease (CHD), Angina Pectoris, and Heart Attack from the responses to NHANES variables MCQ160B (Ever told you had congestive heart failure?), MCQ160C (Ever told you had coronary heart disease?), MCQ160D (Ever told you had angina pectoris?), and MCQ160E (Ever told you had heart attack?).
Panel of physiological determinants of drug dosing
The following PDODD panel of variables was modeled: gender (RIAGENDR), race, as recorded from RIDRETH1, age at screening (years, RIDAGEYR), weight (kg, BMXWT), waist circumference normalized (WAISTMF), body surface area (m2, BSA), plasma volume (liters, PLASMAVOL), albumin (g/dl, LBXSAL), hepatic R‐value (RVALUE), drug‐induced liver injury index (DILI), systemic inflammation index (SII), hepatic steatosis index (HSI), urine flow rate (mL/min, URDFLOW), urine creatinine (mg/dL, URXUCR), urine albumin–creatinine ratio (mg/g, URDACT), glomerular filtration rate (mL/(min 1.73 m2), EGFR), red blood cell count (million cells/μL, LBXRBCSI), platelet count (1000 cells/μL, LBXPLTSI), lymphocyte count, (1000 cells/μL, LBDLYMNO), and segmented neutrophil count, (1000 cells/μL, LBDNENO).
Gender, race, and age were included as they are common patient characteristics easily obtained in the medical record. Weight, body surface area, waist circumference, and plasma volumes were included as dosing‐relevant anthropometric measures. Albumin was included because it binds acidic and neutral drugs. Several measures of hepatic function (hepatic R‐value, DILI index, and hepatic steatosis index) and renal function (urine flow rate, urine creatinine, urine albumin–creatinine ratio, and glomerular filtration rate) were included given the importance of these organs in drug metabolism and elimination. Red cell, platelet, lymphocyte, neutrophil counts, and systemic inflammation index, which is derived from these counts were included since lymphopenia, thrombocytopenia, and neutropenia are common drug side effects.
TVAE Architecture
Tabular VAE architecture
Figure 1 is a schematic of the VAE architecture. The VAE consists of two neural networks: an encoder that estimates a reduced‐dimensionality representation of the training data, and a decoder that generates data from this representation.
FIGURE 1.

Schematic representation of the variational autoencoder (VAE) method. A VAE consists of an encoder neural network and a decoder neural network. The encoder conducts nonlinear dimensionality reduction to obtain a latent space vector representation of the distributional parameters (, the mean vector; , the covariance matrix; and the noise vector). The encoder and decoder are updated via a variational loss function that takes training data and decoder output.
Training of a VAE involves learning the optimal encoding‐decoding scheme for the data using backpropagation. Regularization is used to reduce the risk of overfitting, and the input is encoded as a distribution, which is typically a multivariate Gaussian prior distribution in the latent space.
Tabular VAE (TVAE), which adapts VAE to the tabular nature of PDODD data, was used for generative modeling. 23 We utilized TVAE, a class within the Synthetic Data Vault (SDV) Python library (https://docs.sdv.dev/sdv/) 24 for engineering the generative model.
TVAE extends conventional VAE 6 by incorporating a regularization term in the latent space to reduce overfitting. TVAE employs mode‐specific normalization techniques to tackle non‐Gaussian, and multimodal distributions using a conditional generator. The Adam optimizer with a learning rate of 0.001, and the evidence lower bound (ELBO) loss function were used for training. Kullback–Leibler (KL) annealing 25 was used to increase the weight of the KL divergence term in the loss function from 0 to 1 in increments of 0.01 per epoch. Training was conducted for 1000 epochs for all experiments.
A grid search was conducted to evaluate TVAE hyperparameters over , , , and . The parameter contributes to regularizing the model to prevent overfitting, and the is a multiplier for the reconstruction error.
Based on the grid search, the following hyperparameter values were chosen: , , , , and .
Data analysis
The R statistical program was used 26 for data import and variable computations in a Jupyter Notebook environment.
Data transformations
The continuous biomarker data were log‐transformed and minmax scaled to the range .
The pooled data were randomly split into training (80%) and test (20%) data sets. Listwise exclusion was employed.
Dissimilarity measures for univariate distributions
The dissimilarities in marginal distributions of categorical and binary variables in the real () and synthetic () datasets were compared using total variation complement () 27 :
In the equations, represents the elements in the space of events.
Likewise, the dissimilarities in distributions of continuous variables in the real () and synthetic () datasets were compared using Kolmogorov–Smirnov complement (), 28 which is defined in terms of the Kolmogorov–Smirnov statistic () as follows:
The values of and lie in the range ; a value near 0 represents high dissimilarity and a value near 1 represents high similarity between the two data sets.
Dissimilarity measures for multivariate distributions
The nonparametric maximum mean discrepancy (MMD) test was used to determine whether the test and TVAE‐generated samples were drawn from different multivariate distributions. The MMD test was conducted using the kernlab R package 29 with automatic sigma estimation for the Gaussian radial basis function kernel.
Visualization of bivariate distributions
The differences in Pearson correlation values for continuous variable pairs in the test and TVAE‐generated data were compared using correlation heatmaps. The age dependence of the PDODD panel's continuous variables was assessed by examining the dispersion of the test and TVAE‐generated data and the proximity of the loess fit lines.
Visualization of multivariate distributions
The t‐distributed stochastic neighbor embedding (t‐SNE), uniform manifold approximation and projection (UMAP), and principal components analysis (PCA) methods for multivariate visualization were obtained with the Rtsne, umap packages and prcomp function in R. 30 , 31 , 32
The results were visualized using graphing routines in base R or the ggplot2 R package. 33
EVALUATING THE PERFORMANCE OF VARIATIONAL AUTOENCODERS
Characteristics of study subjects
The total sample size was n = 17,480 (51.9% female) with a mean age (SD) of 49.2 (17.3) years. Table 1 summarizes the demographic characteristics, biomarker levels, and frequencies of the renal, hepatic, diabetes, and cardiac disease status variables in the NHANES dataset.
TABLE 1.
Summary statistics of the demographic characteristics and drug disposition biomarkers.
| Variable | Count (Percent) |
|---|---|
| Gender | |
| Female | 9064 (51.9) |
| Male | 8416 (48.1) |
| Race/Ethnicity | – |
| Non‐Hispanic Black | 3758 (21.5) |
| Mexican American and Other Hispanic | 4288 (24.5) |
| Other Race and Multiracial | 2812 (16.1) |
| Non‐Hispanic White | 6622 (37.9) |
| Disease status | Negative | Positive |
|---|---|---|
| Dialysis | 17,457 (99.9) | 23 (0.132) |
| Kidney disease | 16,925 (96.8) | 555 (3.18) |
| Active liver disease | 17,053 (97.6) | 427 (2.44) |
| Past liver disease | 17,200 (98.4) | 280 (1.60) |
| Hepatitis C | 17,291 (98.9) | 189 (1.08) |
| Hepatitis B | 17,367 (99.3) | 113 (0.646) |
| Diabetes | 15,194 (86.9) | 2286 (13.1) |
| Prediabetes | 13,455 (77.0) | 4025 (23.0) |
| Insulin use | 16,845 (96.4) | 635 (3.63) |
| Heart attack | 16,819 (96.2) | 661 (3.78) |
| Congestive heart failure | 16,973 (97.1) | 507 (2.90) |
| Coronary heart disease | 16,821 (96.2) | 659 (3.77) |
| Angina pectoris | 17,087 (97.8) | 393 (2.25) |
| Mean (SD) | Median (IQR) | Min–Max | ||
|---|---|---|---|---|
| RIDAGEYR | Age, years | 49.2 (17.3) | 49.0 (34.0–63.0) | 20.0–80.0 |
| BMXWT | Body weight, kg | 81.7 (21.5) | 78.6 (66.5–93.2) | 29.1–216 |
| WAISTMF | Normalized waist circumference | 1.06 (0.186) | 1.03 (0.924–1.17) | 0.544–1.95 |
| BSA | Body surface area, m2 | 1.89 (0.255) | 1.88 (1.71–2.06) | 1.12–3.10 |
| PLASMAVOL | Plasma volume, L | 4.28 (0.846) | 4.18 (3.71–4.70) | 2.07–13.0 |
| LBXSAL | Albumin, g/dL | 4.23 (0.349) | 4.20 (4.00–4.50) | 2.10–5.60 |
| RVALUE | Hepatic R‐value | 1.61 (1.18) | 1.35 (0.988–1.88) | 0.0674–43.3 |
| DILI | Drug‐induced liver injury index, ×1000 | 1.74 (5.51) | 1.32 (0.846–2.00) | 0.021–651 |
| SII | Systemic inflammation index | 511 (327) | 442 (313–625) | 1.53–8464 |
| HSI | Hepatic steatosis index | 38.3 (8.06) | 37.2 (32.5–42.9) | 18.3–90.0 |
| URDFLOW | Urine flow, mL/min | 1.12 (1.43) | 0.806 (0.500–1.36) | 0.006–76.7 |
| URXUCR | Urine creatinine, mg/dL | 124 (81.8) | 109 (62.0–168) | 3.54–800 |
| URDACT | Urine albumin to creatine ratio, mg/g | 44.0 (341) | 7.27 (4.72–13.8) | 0.210–21,152 |
| EGFR | Glomerular filtration rate, mL/(min 1.73 m2) | 95.1 (22.2) | 97.8 (81.2–112) | 4.90–164 |
| LBXRBCSI | Red blood cell count, 106 cells/μL | 4.67 (0.497) | 4.66 (4.34–4.99) | 1.67–8.30 |
| LBXPLTSI | Platelet count, 103 cells/μL | 238 (61.4) | 232 (197–273) | 8.00–818 |
| LBDLYMNO | Lymphocyte count, 103 cells/μL | 2.20 (2.99) | 2.10 (1.70–2.60) | 0.300–359 |
| LBDNENO | Segmented neutrophil count, 103 cells/μL | 4.23 (1.72) | 4.00 (3.10–5.10) | 0.100–35.2 |
There were no missing demographic data (age, sex, race/ethnicity). The percentage of missing data in the categorical clinical status variables ranged from 0.15% for diabetes to 5.55% for liver disease. The percentage of missing data for continuous variables ranged from 5.53% for body weight to 13.3% for average urine flow. The missing data were handled using listwise deletion.
Validation of renal, hepatic, diabetes, and cardiac disease biomarkers
We first assessed whether the renal (kidney disease, and dialysis), hepatic (current liver disease, active hepatitis B, active hepatitis C), diabetes (prediabetes, diabetes, and insulin use) and cardiac (coronary heart disease/angina pectoris (CHD/AP), congestive heart failure (CHF), and heart attack) disease index status variables obtained from the NHANES medical questionnaire showed the expected associations with known diagnostic and laboratory biomarkers for renal, hepatic, diabetes and cardiac diseases.
The results in Figure S1 show that the groups with kidney disease and dialysis had progressive deterioration in estimated glomerular filtration rate (EGFR), urine flow (URDFLOW), creatinine clearance (CRCL), urine albumin to creatinine ratio (URDACT), red blood cell count (LBXRBCSI), and plasma volume (PLASMAVOL). Decreases in red cell count (LBXRBCSI) occur in kidney diseases because of the reduced capacity of diseased kidneys to produce erythropoietin. Figure S2 summarizes the range of hepatic enzymes obtained in a complete metabolic panel and several indices of hepatocellular (RVALUE), fatty liver disease (e.g., fatty liver index or FLI, hepatic steatosis index or HSI), and drug‐induced liver injury index (DILI). The liver disease groups showed characteristic changes in these biomarkers relative to the group without liver disease. Fasting insulin, fasting glucose, HOMA‐insulin resistance index (HOMA‐IR), and glycohemoglobin were progressively worse in prediabetes, diabetes, and diabetes with insulin groups relative to controls. HOMA‐beta cell index decreases and increases in the urinary albumin to creatinine ratio was more pronounced in diabetes and diabetes with insulin groups (Figure S3). A wide range of biomarkers associated with heart disease risk are summarized in Figure S4.
These results indicate that the disease status groups have the key biomarker patterns and are representative of renal, hepatic, diabetes, and heart disease patient groups in the general population.
TVAE‐generated univariate PDODD biomarker distributions
Table 2 compares the median (IQR) of continuous variables in the PDODD panel that were generated by TVAE to the test data. Five independent instances of TVAE‐generated data are shown. The median and IQR of the TVAE data satisfactorily approximate the median and of the test data for all the biomarkers. The Kolmogorov–Smirnov complement () values for 16 of the 18 biomarkers were greater than 0.90; serum albumin () and urine albumin–creatinine ratio () had modestly lower values of 0.848 and 0.899, respectively.
TABLE 2.
Summary of the continuous variables produced by VAE compared with test data.
| Test data | TVAE‐1 | TVAE‐2 | TVAE‐3 | TVAE‐4 | TVAE‐5 | TVAE KSC | |
|---|---|---|---|---|---|---|---|
| Median (IQR) | Median (IQR) | Median (IQR) | Median (IQR) | Median (IQR) | Median (IQR) | Mean (Min–Max) | |
| RIDAGEYR | 50.0 (35.0–63.0) | 49.7 (34.2–62.1) | 50.4 (34.6–62.4) | 49.9 (34.6–62.6) | 49.5 (34.7–62.1) | 49.9 (34.5–62.5) | 0.911 (0.908–0.913) |
| BMXWT | 78.7 (66.3–93.5) | 77.5 (67.1–91.0) | 78.4 (67.6–91.9) | 78.1 (67.2–92.1) | 77.6 (67.1–91.1) | 78.4 (66.9–92.3) | 0.968 (0.961–0.979) |
| WAISTMF | 1.04 (0.925–1.17) | 1.03 (0.919–1.16) | 1.04 (0.922–1.18) | 1.03 (0.917–1.17) | 1.03 (0.919–1.17) | 1.04 (0.917–1.17) | 0.969 (0.965–0.975) |
| BSA | 1.87 (1.71–2.05) | 1.86 (1.71–2.03) | 1.87 (1.72–2.03) | 1.87 (1.72–2.03) | 1.86 (1.71–2.01) | 1.87 (1.71–2.04) | 0.958 (0.947–0.965) |
| PLASMAVOL | 4.18 (3.72–4.71) | 4.20 (3.71–4.73) | 4.21 (3.73–4.77) | 4.16 (3.70–4.71) | 4.19 (3.72–4.74) | 4.24 (3.74–4.78) | 0.975 (0.958–0.982) |
| LBXSAL | 4.20 (4.00–4.50) | 4.30 (4.00–4.50) | 4.30 (4.00–4.49) | 4.30 (4.00–4.51) | 4.30 (4.00–4.50) | 4.30 (4.00–4.50) | 0.848 (0.838–0.865) |
| RVALUE | 1.35 (1.00–1.88) | 1.33 (1.01–1.85) | 1.31 (0.989–1.82) | 1.33 (0.991–1.85) | 1.32 (1.00–1.84) | 1.31 (0.98–1.81) | 0.969 (0.962–0.976) |
| DILI × 1000 | 1.32 (0.849–2.03) | 1.32 (0.895–1.82) | 1.30 (0.885–1.79) | 1.33 (0.881–1.81) | 1.34 (0.886–1.84) | 1.35 (0.897–1.83) | 0.923 (0.921–0.926) |
| SII | 443 (315–626) | 445 (328–629) | 448 (326–627) | 451 (329–625) | 448 (328–627) | 445 (327–620) | 0.936 (0.932–0.940) |
| HSI | 37.3 (32.6–42.9) | 37.3 (32.8–42.6) | 37.6 (33.0–43.0) | 37.3 (32.8–43.0) | 37.3 (32.8–42.7) | 37.5 (32.8–43.0) | 0.959 (0.955–0.963) |
| URDFLOW | 0.821 (0.515–1.36) | 0.812 (0.526–1.34) | 0.799 (0.517–1.34) | 0.806 (0.526–1.34) | 0.807 (0.511–1.32) | 0.798 (0.513–1.34) | 0.963 (0.956–0.970) |
| URXUCR | 104 (60.0–164) | 111 (65.5–165) | 113 (65.5–169) | 115 (65.6–170) | 113 (64.1–169) | 110 (64.6–169) | 0.957 (0.951–0.965) |
| URDACT | 7.23 (4.68–13.7) | 6.03 (4.40–10.5) | 6.19 (4.46–10.7) | 6.20 (4.46–10.3) | 6.28 (4.51–11.1) | 6.23 (4.50–10.7) | 0.899 (0.884–0.911) |
| EGFR | 97.8 (80.9–111) | 100 (82.9–113) | 100 (83.2–112) | 99.3 (82.9–113) | 100 (84.2–113) | 100 (82.9–112) | 0.944 (0.932–0.955) |
| LBXRBCSI | 4.65 (4.34–4.98) | 4.62 (4.30–4.99) | 4.63 (4.28–4.97) | 4.65 (4.31–5.01) | 4.62 (4.29–4.98) | 4.62 (4.28–4.96) | 0.959 (0.948–0.977) |
| LBXPLTSI | 233 (198–272) | 231 (198–278) | 231 (198–280) | 231 (197–280) | 233 (197–281) | 232 (199–280) | 0.953 (0.949–0.960) |
| LBDLYMNO | 2.10 (1.70–2.60) | 2.07 (1.73–2.50) | 2.06 (1.73–2.52) | 2.06 (1.74–2.51) | 2.07 (1.72–2.53) | 2.05 (1.73–2.52) | 0.921 (0.918–0.925) |
| LBDNENO | 4.00 (3.10–5.10) | 3.95 (3.08–5.09) | 3.98 (3.08–5.05) | 3.99 (3.10–5.10) | 3.97 (3.10–5.08) | 3.95 (3.06–5.04) | 0.969 (0.967–0.974) |
Abbreviations: BMXWT, Weight (kg); BSA, Body surface area (m2); DILI, Drug‐induced liver injury index; EGFR, Glomerular filtration rate mL/(min 1.73 m2); HSI, Hepatic steatosis index; KSC, Kolmogorov–Smirnov complement; LBDLYMNO, Lymphocyte count, (1000 cells/μL); LBDNENO, Neutrophil count (1000 cells/μL); LBXPLTSI, Platelet count (1000 cells/μL); LBXRBCSI, Red blood cell count (million cells/μL); LBXSAL, Albumin, serum (g/dL); PLASMAVOL, Plasma volume; RIDAGEYR, Age in years at screening; RVALUE, Hepatic R‐value; SII, Systemic inflammation index; URDACT, Albumin–creatinine ratio (mg/g); URDFLOW, Urine flow rate average (mL/min); URXUCR, Urine creatinine, mg/dL; WAISTMF, Waist circumference, normalized.
To obtain visual comparisons of distribution shape, we used probability density histograms of TVAE‐generated data (teal bars) compared with test data (salmon bars), as shown in Figure 2. The darker regions of the histogram bars correspond to the regions of overlap. The extensive overlap indicates a satisfactory approximation of the univariate distributions.
FIGURE 2.

Probability density histograms of generated data from the TVAE (teal bars) compared with test data (salmon bars). The darker regions of the histogram bars correspond to the regions of overlap. The log‐transformed and min–max scaled values of the continuous biomarkers were plotted and are age (RIDAGEYR, a), body weight (BMXWT, b), waist circumference normalized (WAISTMF, c), body surface area (BSA, d), plasma volume (PLASMAVOL, e), serum albumin (LBXSAL, f), hepatic R‐value (RVALUE, g), DILI value (DILI, h), systemic inflammation index (SII, i), hepatic steatosis index (HSI, j), average urine flow (URDFLOW, k), urine creatinine (URXUCR, l), urine albumin to creatinine ratio (URDACT, m), glomerular filtration rate (EGFR, n), red cell number (LBXRBCSI, o), platelet number (LBXPLTSI, p), lymphocyte count (LBDLYMNO, q), and neutrophil count (LBDNENO, r). The x‐axes on all graphs are biomarker levels that are log‐transformed and scaled to lie between −1 and 1.
Table 3 compares the frequency distributions of categorical variables from five independent instances of TVAE to the test data. The total variation complement () values were high for all the categorical variables. However, the TVAE approximations across the variables were mixed: some variables, for example, sex, race/ethnicity, and diabetes were well approximated, but other variables showed deviations from the frequencies in the test data. Unsurprisingly, the TVAE approximations to variables with low minor class frequency, for example, dialysis, were poor.
TABLE 3.
Summary of the categorical data produced by VAE compared with test data.
| Variable | Status | Test data | TVAE1 | TVAE2 | TVAE3 | TVAE4 | TVAE5 | TVAE TVC |
|---|---|---|---|---|---|---|---|---|
| Count (%) | Count (%) | Count (%) | Count (%) | Count (%) | Count (%) | Mean (Min–Max) | ||
| Gender | Female | 1844 (52.7) | 1811 (51.8) | 1809 (51.7) | 1786 (51.1) | 1813 (51.9) | 1815 (51.9) | 0.989 (0.983–0.992) |
| Male | 1652 (47.3) | 1685 (48.2) | 1687 (48.3) | 1710 (48.9) | 1683 (48.1) | 1681 (48.1) | ||
| Race/Ethnicity | Non‐Hispanic Black | 735 (21.0) | 763 (21.8) | 839 (24.0) | 761 (21.8) | 779 (22.3) | 798 (22.8) | 0.971 (0.959–0.977) |
| Mexican American & Other Hispanic | 853 (24.4) | 899 (25.7) | 874 (25.0) | 884 (25.3) | 885 (25.3) | 866 (24.8) | ||
| Other Race and Multiracial | 568 (16.3) | 472 (13.5) | 423 (12.1) | 488 (13.9) | 454 (13.0) | 489 (14) | ||
| Non‐Hispanic White | 1340 (38.3) | 1362 (39.0) | 1360 (38.9) | 1363 (39.0) | 1378 (39.4) | 1343 (38.4) | ||
| Dialysis | No | 3492 (99.9) | 3495 (99.9) | 3494 (99.9) | 3495 (99.9) | 3494 (99.9) | 3495 (99.9) | 0.999 (0.999–0.999) |
| Yes | 4 (0.114) | 1 (0.029) | 2 (0.057) | 1 (0.029) | 2 (0.057) | 1 (0.029) | ||
| Kidney disease | No | 3394 (97.1) | 3462 (99.0) | 3459 (98.9) | 3464 (99.1) | 3460 (98.9) | 3456 (98.9) | 0.981 (0.980–0.982) |
| Yes | 102 (2.92) | 34 (0.973) | 37 (1.06) | 32 (0.915) | 36 (1.03) | 40 (1.14) | ||
| Active liver disease | No | 3428 (98.1) | 3484 (99.7) | 3479 (99.5) | 3473 (99.3) | 3473 (99.3) | 3480 (99.5) | 0.986 (0.984–0.987) |
| Yes | 68 (1.95) | 12 (0.343) | 17 (0.486) | 23 (0.658) | 23 (0.658) | 16 (0.458) | ||
| Past liver disease | No | 3448 (98.6) | 3491 (99.8) | 3492 (99.9) | 3489 (99.8) | 3494 (99.9) | 3490 (99.8) | 0.988 (0.987–0.988) |
| Yes | 48 (1.37) | 5 (0.143) | 4 (0.114) | 7 (0.2) | 2 (0.057) | 6 (0.172) | ||
| Hepatitis C | No | 3465 (99.1) | 3478 (99.5) | 3492 (99.9) | 3484 (99.7) | 3487 (99.7) | 3486 (99.7) | 0.994 (0.992–0.996) |
| Yes | 31 (0.887) | 18 (0.515) | 4 (0.114) | 12 (0.343) | 9 (0.257) | 10 (0.286) | ||
| Hepatitis B | No | 3479 (99.5) | 3493 (99.9) | 3492 (99.9) | 3493 (99.9) | 3494 (99.9) | 3495 (99.9) | 0.996 (0.995–0.996) |
| Yes | 17 (0.486) | 3 (0.086) | 4 (0.114) | 3 (0.086) | 2 (0.057) | 1 (0.029) | ||
| Diabetes | No | 3030 (86.7) | 3008 (86.0) | 2975 (85.1) | 2986 (85.4) | 2943 (84.2) | 2974 (85.1) | 0.985 (0.975–0.994) |
| Yes | 466 (13.3) | 488 (14.0) | 521 (14.9) | 510 (14.6) | 553 (15.8) | 522 (14.9) | ||
| Prediabetes | No | 2674 (76.5) | 2537 (72.6) | 2484 (71.1) | 2509 (71.8) | 2458 (70.3) | 2521 (72.1) | 0.951 (0.938–0.961) |
| Yes | 822 (23.5) | 959 (27.4) | 1012 (28.9) | 987 (28.2) | 1038 (29.7) | 975 (27.9) | ||
| Insulin use | No | 3372 (96.5) | 3399 (97.2) | 3367 (96.3) | 3380 (96.7) | 3342 (95.6) | 3376 (96.6) | 0.996 (0.991–0.999) |
| Yes | 124 (3.55) | 97 (2.78) | 129 (3.69) | 116 (3.32) | 154 (4.41) | 120 (3.43) | ||
| Heart attack | No | 3353 (95.9) | 3398 (97.2) | 3399 (97.2) | 3408 (97.5) | 3385 (96.8) | 3391 (97.0) | 0.988 (0.984–0.991) |
| Yes | 143 (4.09) | 98 (2.80) | 97 (2.78) | 88 (2.52) | 111 (3.18) | 105 (3.00) | ||
| Congestive heart failure | No | 3386 (96.9) | 3445 (98.5) | 3455 (98.8) | 3454 (98.8) | 3450 (98.7) | 3445 (98.5) | 0.982 (0.98–0.983) |
| Yes | 110 (3.15) | 51 (1.46) | 41 (1.17) | 42 (1.20) | 46 (1.32) | 51 (1.46) | ||
| Coronary heart disease | No | 3356 (96) | 3399 (97.2) | 3386 (96.9) | 3384 (96.8) | 3369 (96.4) | 3363 (96.2) | 0.993 (0.988–0.998) |
| Yes | 140 (4.01) | 97 (2.78) | 110 (3.15) | 112 (3.20) | 127 (3.63) | 133 (3.80) | ||
| Angina pectoris | No | 3407 (97.5) | 3464 (99.1) | 3453 (98.8) | 3457 (98.9) | 3449 (98.7) | 3454 (98.8) | 0.986 (0.984–0.988) |
| Yes | 89 (2.55) | 32 (0.915) | 43 (1.23) | 39 (1.12) | 47 (1.34) | 42 (1.20) |
Abbreviation: TVC, Total variation complement.
TVAE‐generated high‐dimensional PDODD biomarker panel joint distributions
We did not obtain evidence for statistical differences between the TVAE‐generated and test dataset distributions in the MMD test. The first‐order MMD statistic value of 0.0665 indicates that the means of the TVAE‐generated data and test data are close, whereas the low value of the third‐order MMD statistic of 0.00407 indicates that the third‐order interactions of the two distributions are also similar. The low Rademacher bound value of 0.0924 is indicative of good generalizability.
The high‐dimensional PDODD joint distribution is challenging to visualize. To graphically assess whether the joint distribution of TVAE‐generated data vectors satisfactorily approximated the test data joint distribution, we used three projection methods: t‐stochastic neighbor embedding (t‐SNE, Figure 3a), uniform manifold approximation and projection (UMAP, Figure 3b), and principal component analysis (PCA, Figure 3c) to obtain two‐dimensional projections of the data that enable visualization of the joint distributions. In all three visualizations, the distribution of the TVAE‐generated data overlapped extensively with the distribution of the test data vectors. This indicates that TVAE provide a satisfactory approximation of the PDODD panel.
FIGURE 3.

The t‐stochastic neighbor embedding (t‐SNE, a), uniform manifold approximation and projection (UMAP, b), and principal component analysis (PCA, c) two‐dimensional projections of the data. The test data results are shown in salmon circles and the TVAE‐generated results are in teal circles. The x‐axis (t‐SNE X and UMAP X) and y‐axis (t‐SNE Y and UMAP. Y) correspond to the t‐SNE and UMAP projections into two dimensions of the input of 18‐dimensional biomarker levels that are log‐transformed and scaled to lie between −1 and 1. The PC 1 and PC 2 on the x‐axis and y‐axis of c correspond to the first and second principal components, respectively.
TVAE‐generated bivariate PDODD biomarker distributions
Bivariate Pearson correlation coefficients between the pairs of continuous variables in the test data (Figure 4a) were compared with those in TVAE‐generated data (Figure 4b) using correlation heatmaps. A visual comparison shows that the patterns of bivariate correlations present in the test data are also present in the TVAE‐generated data. This is confirmed by the difference heatmap in Figure 4c.
FIGURE 4.

Bivariate Pearson correlations between the continuous variables in the test data (a), TVAE‐generated data (b), and the difference in correlation coefficients (c). The color scale is shown. AGE, Age; ALB, Serum albumin; BSA, Body surface area; DILI, Drug‐induced liver injury; EGFR, Estimated glomerular filtration rate; HSI, Hepatic steatosis index; LYM, Lymphocyte count; NEU, Neutrophil count; PLT, Platelet count; PVOL, Plasma volume; RBC, Red blood cell count; SII, Systemic inflammation index; UCR, Urine creatinine; URDACT, Urine albumin–creatinine ratio; URINE, Urine flow rate; WAIST, Normalized waist circumference; WT, Body weight.
In the next step of assessing bivariate marginals of the PDODD joint distribution, we compared the age dependence of the biomarkers in the TVAE‐generated data (teal points) to the test data (salmon points). These results are summarized in Figure S5. The solid lines represent the loess fits to the test data (salmon line) and the TVAE‐simulated data (teal line). There was extensive overlap of the TVAE data points with the test data points. The corresponding loess lines also overlapped indicating that the age dependencies were satisfactorily approximated.
TVAE‐generated conditional PDODD biomarker distributions in disease states
The box plots in Figure 5 show the disease dependence of natural logarithm‐transformed values of three key PDODD biomarkers: estimated glomerular filtration rate (Figure 5a–h), urine albumin–creatinine ratio (Figure 5i–p), and plasma volume (Figure 5q–x) in the test data (salmon boxes) and the TVAE‐simulated data (teal boxes). There was extensive concordance between the conditional distributions of the three PDODD biomarkers across the disease states, for example, EGFR was lower in kidney disease and the albumin–creatinine ratio was increased in both diabetes and kidney disease. However, a subset of box plots, for example, albumin–creatinine ratio in active liver disease and EGFR in coronary heart disease presented evidence for deviations. Overall, we characterize the approximation of conditional distributions as mixed.
FIGURE 5.

The box plots show the disease dependence of natural logarithm‐transformed values of estimated glomerular filtration rate (a–h), urine albumin–creatinine ratio (i–p), and plasma volume (q–x) in the test data (salmon boxes) and the TVAE‐generated data (teal boxes). The central line in the box represents the median, the top and bottom edges of the box represent the 75th and 25th percentiles, the error bars represent the 1.5‐fold inter‐quartile range, and the filled circles represent outliers (beyond the 1.5‐fold inter‐quartile range). The notches are the comparison interval around the median values. CHF, Congestive heart failure; CHD, Coronary heart disease.
DISCUSSION
We evaluated TVAE for generative modeling of a panel of PDODD in renal, hepatic, diabetic, and cardiac disease states. Evaluating PDODD data in disease states is an important step in utilizing PDODD to individualize therapy and generative AI methods for PDODD could facilitate pharmacometrics studies. We found that the TVAE performed satisfactorily for all the continuous biomarkers in the PDODD panel; however, the performance on the categorical and binary biomarkers was mixed.
The PDODD approach utilizes physiological biomarker data directly relevant to dosing decisions for an individual patient and it can be implemented at the bedside in emergency, hospital, and chronic disease healthcare settings. It has significant advantages for precision medicine over the “genes, environments, and lifestyles” strategy championed by the National Institutes of Health based on exemplars from cancer typing and immunotherapy in oncology, pharmacogenetics, and rare diseases. 34 Obtaining and efficiently integrating information on genes, environments, and lifestyles may be infeasible in many dosing scenarios, for example, children, nonresponsive patients, and people with impairments and disabilities. Our approach also differs substantively from the Food and Drug Administration precision medicine strategy, which is also based on “genes, environments, and lifestyles” but emphasizes next‐generation sequencing, analytical validation, and bioinformatics tools (e.g., precisionFDA). 2
In this work, we included several PDODD measures that were not included in our earlier work, and systematically expanded the range of disease states beyond active hepatitis B and hepatitis C. 4 Importantly, we deprecated the body composition measures of total body fat and lean body weight, which require dual energy X‐ray absorptiometry scans that are not routinely available in favor of plasma volume (), a measure of central compartment volume of distribution, and normalized waist circumference (), a measure of adiposity, which are easier to obtain in the clinical setting. We also included , a measure of the DILI risk based on the Diaz‐Robles et al. 17 algorithm, and the hepatic steatosis index () as additional measures of hepatic function to complement the R‐value. Urine creatinine (), was added to the renal markers (e.g., estimated glomerular filtration rate, urine flow, and urine albumin to creatinine ratio). The renal, hepatic, diabetes, and cardiac disease status groups included controls with no disease and groups with different disease severities. The renal, hepatic, and diabetes disease status groups showed the expected changes in the corresponding diagnostic biomarkers.
VAE enables dimensionality reduction in the latent space. Principal components analysis (PCA) is the archetype of dimensionality reduction techniques. The principal components (PC) are linear combinations of variables that are mutually orthogonal, that is, uncorrelated, with each other and explain variance in an ordered manner: each PC explains more variance than every subsequent PC. PCA is often used to obtain two‐dimensional projections for visualizing high‐dimensional data because the first PC explains the most variance and the second PC accounts for most of the remaining variance. Indeed, we used PCA alongside t‐SNE and UMAP to visually assess the joint distribution of the generated data. Because VAE has neural networks in the encoder and decoder, their dimensionality reduction process can deploy nonlinear functions. The distribution produced by the VAE encoder is typically modeled as a multivariate Gaussian prior distribution whose parameters are estimated by maximizing the variational lower bound on the conditional log‐likelihood similar to variational Bayesian methods. 35 In the image generation setting, VAE tends to produce blurry outputs, which is problematic. We did not view the “blurriness” as a disqualifying limitation because many drug dosing determinants biomarkers are log‐normally distributed.
However, we found that TVAE were not able to recapitulate categorical variable distributions whose minor group frequencies were ≤10%. Similar findings were reported by Kiran et al. in a data science setting. 36 The poor performance of TVAE for the minority classes in imbalanced categorical variables (e.g., dialysis) is often caused because misclassifying the minority class has a smaller relative impact on the loss function than misclassifying the majority class. Data vectors containing low frequency minority variables occur less frequently in training batches, and this can affect the learning of the joint distribution of the minority group. The learning of the joint distribution of the minority class can be improved by oversampling the minority class.
The TVAE was trained with complete observations without missing values. Imputation was not done to avoid introducing artifacts in the joint distribution. While TVAE has been proposed as a promising method for imputing missing tabular data, 37 we did not evaluate the robustness of our TVAE method in the presence of missing data.
We evaluated a range of PDODD‐relevant disease state variables (e.g., kidney, heart, and liver diseases) that were available in the NHANES data. However, we were not able to assess the efficacy and safety end points that are often of interest in clinical trials because the NHANES data were obtained from a population‐based observational study. We anticipate that efficacy and safety end points can be seamlessly incorporated as additional variables in generative methods such TVAE and GAN.
In conclusion, our results indicate that TVAE provides a good generative modeling approach for continuous PDODD variables. Innovative refinements may be necessary to improve the modeling of categorical and conditional distributions. Further research to rigorously compare TVAE to GAN, adversarial autoencoders, and other emerging generative methods for pharmacometrics problems is needed.
FUNDING INFORMATION
Funding for the Ramanathan laboratory from MS190096 from Department of Defense Congressionally Directed Medical Research Programs, USAMRDC, Multiple Sclerosis Research Program is gratefully acknowledged. The funder had no role in the design of the study or the data analysis.
CONFLICT OF INTEREST STATEMENT
The authors declared no competing interests for this work.
Supporting information
Data S1:
ACKNOWLEDGMENTS
The authors thank the patients who participated in this study.
Titar RR, Ramanathan M. Variational autoencoders for generative modeling of drug dosing determinants in renal, hepatic, metabolic, and cardiac disease states. Clin Transl Sci. 2024;17:e13872. doi: 10.1111/cts.13872
REFERENCES
- 1. National Institutes of Health . The promise of precision medicine . Accessed October 8, 2023. https://www.nih.gov/about‐nih/what‐we‐do/nih‐turning‐discovery‐into‐health/promise‐precision‐medicine#:~:text=This%20one%2Dsize%2Dfits%2D,genes%2C%20environments%2C%20and%20lifestyles 2023.
- 2. Food and Drug Administration . Precision medicine . Accessed October 10, 2023. https://www.fda.gov/medical‐devices/in‐vitro‐diagnostics/precision‐medicine 2018.
- 3. Nair R, Mohan DD, Frank S, Setlur S, Govindaraju V, Ramanathan M. Generative adversarial networks for modelling clinical biomarker profiles with race/ethnicity. Br J Clin Pharmacol. 2023;89:1588‐1600. [DOI] [PubMed] [Google Scholar]
- 4. Nair R, Mohan DD, Setlur S, Govindaraju V, Ramanathan M. Generative models for age, race/ethnicity, and disease state dependence of physiological determinants of drug dosing. J Pharmacokinet Pharmacodyn. 2023;50:111‐122. [DOI] [PubMed] [Google Scholar]
- 5. Goodfellow IJ, Pouget‐Abadie J, Mirza M, Xu B, Warde‐Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. arXiv. 2014;arXiv:1406.2661[stat.ML]. [Google Scholar]
- 6. Kingma DP, Welling M. An introduction to variational autoencoders. arXiv. 2019;arXiv:1906.02691. [Google Scholar]
- 7. National Health and Nutrition Examination Survey . National Health and Nutrition Examination Survey: NHANES 2015–2016 Overview. National Center for Health Statistics, Centers for Disease Control; 2015. [Google Scholar]
- 8. National Health and Nutrition Examination Survey . About the National Health and Nutrition Examination Survey. National Center for Health Statistics; 2017. [Google Scholar]
- 9. Endres CJ. nhanesA: NHANES Data Retrieval. 2023.
- 10. Dubois D, Dubois EF. A formula to estimate the approximate surface area if height and weight be known. Arch Intern Med. 1916;17:863‐871. [Google Scholar]
- 11. Ross R, Neeland IJ, Yamashita S, et al. Waist circumference as a vital sign in clinical practice: a consensus statement from the IAS and ICCR working group on visceral obesity. Nat Rev Endocrinol. 2020;16:177‐189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Strauss MB, Davis RK, Rosenbaum JD, Rossmeisl EC. Water diuresis produced during recumbency by the intravenous infusion of isotonic saline solution. J Clin Invest. 1951;30:862‐868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Inker LA, Eneanya ND, Coresh J, et al. New creatinine‐ and cystatin C‐based equations to estimate GFR without race. N Engl J Med. 2021;385:1737‐1749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Chalasani NP, Hayashi PH, Bonkovsky HL, Navarro VJ, Lee WM, Fontana RJ. ACG clinical guideline: the diagnosis and management of idiosyncratic drug‐induced liver injury. Am J Gastroenterol. 2014;109:950‐966; quiz 67. [DOI] [PubMed] [Google Scholar]
- 15. Ruhl CE, Everhart JE. Upper limits of normal for alanine aminotransferase activity in the United States population. Hepatology. 2012;55:447‐454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Gonzalez H et al. Normal alkaline phosphatase levels are dependent on race/ethnicity: NationalGEP health and nutrition examination survey data. BMJ Open Gastroenterol. 2020;7:e000502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Robles‐Diaz M, Lucena MI, Kaplowitz N, et al. Use of Hy's law and a new composite algorithm to predict acute liver failure in patients with drug‐induced liver injury. Gastroenterology. 2014;147:109‐118.e5. [DOI] [PubMed] [Google Scholar]
- 18. Zimmerman HJ. Drug‐induced liver disease. Drugs. 1978;16:25‐45. [DOI] [PubMed] [Google Scholar]
- 19. Sohn W, Jun DW, Kwak MJ, et al. Upper limit of normal serum alanine and aspartate aminotransferase levels in Korea. J Gastroenterol Hepatol. 2013;28:522‐529. [DOI] [PubMed] [Google Scholar]
- 20. Perlstein TS, Pande RL, Creager MA, Weuve J, Beckman JA. Serum total bilirubin level, prevalent stroke, and stroke outcomes: NHANES 1999‐2004. Am J Med. 2008;121:781‐788.e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Lee JH, Kim D, Kim HJ, et al. Hepatic steatosis index: a simple screening tool reflecting nonalcoholic fatty liver disease. Dig Liver Dis. 2010;42:503‐508. [DOI] [PubMed] [Google Scholar]
- 22. Song Y, Guo W, Li Z, Guo D, Li Z, Li Y. Systemic immune‐inflammation index is associated with hepatic steatosis: evidence from NHANES 2015‐2018. Front Immunol. 2022;13:1058779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Xu L, Skoularidou M, Cuesta‐Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. In: Wallach H, Larochelle H, Beygelzimer A, d'Alché‐Buc F, Fox E, Garnett R, eds. 33rd conference on neural information processing systems (NeurIPS 2019). Neural Information Processing Systems Foundation. [Google Scholar]
- 24.(2016)The Synthetic Data Vault . 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 17–19 October 2016.
- 25. Fu, H. , Li, C. , Liu, X. , Gao, J. , Celikyilmaz, A. & Carin, L. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 240–250.
- 26. R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2017. [Google Scholar]
- 27. SDMetrics DataCebo . TVComplement . Accessed April 24, 2024. https://docs.sdv.dev/sdmetrics/metrics/metrics‐glossary/tvcomplement 2023.
- 28. SDMetrics DataCebo . KSComplement . Accessed April 24, 2024. https://docs.sdv.dev/sdmetrics/metrics/metrics‐glossary/kscomplement 2023.
- 29. Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab – an S4 package for kernel methods in R. J Stat Softw. 2004;11:1‐20. [Google Scholar]
- 30. Krijthe, J.H. Rtsne: T‐Distributed Stochastic Neighbor Embedding Using Barnes‐Hut Implementation (2015). https://cran.r‐project.org [Google Scholar]
- 31. van der Maaten LJP, Hinton GE. Visualizing high‐dimensional data using t‐SNE. J Mach Learn Res. 2008;9:2579‐2605. [Google Scholar]
- 32. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. 2018;arXiv:180203426 [statML]. [Google Scholar]
- 33. Wickham H. ggplot2: Elegant graphics for data analysis. Use R!. 2nd ed. 1 online resource (XVI, 260 pages 32 illustrations, 140 illustrations in color) Springer International Publishing; 2016. [Google Scholar]
- 34. Anonymous . The promise of precision medicine . Accessed October 8, 2023. https://www.nih.gov/about‐nih/what‐we‐do/nih‐turning‐discovery‐into‐health/promise‐precision‐medicine#:~:text=This%20one%2Dsize%2Dfits%2D,genes%2C%20environments%2C%20and%20lifestyles 2023.
- 35. Ma ZY, Lai YP, Kleijn WB, Song YZ, Wang L, Guo J. Variational Bayesian learning for Dirichlet process mixture of inverted Dirichlet distributions in non‐Gaussian image feature modeling. IEEE T Neur Net Lear. 2019;30:449‐463. [DOI] [PubMed] [Google Scholar]
- 36. Kiran A, Kumar SS. A comparative analysis of GAN and VAE based synthetic data generators for high dimensional, imbalanced tabular data. 2023 2nd International Conference for Innovation in Technology (INOCON). IEEE:1‐6. [Google Scholar]
- 37. Zheng S, Charoenphakdee N. Diffusion models for missing value imputation in tabular data. arXiv. 2022;arXiv:2210.17128. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1:
