Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 21.
Published in final edited form as: Reprod Toxicol. 2020 May 16;95:148–158. doi: 10.1016/j.reprotox.2020.05.004

Machine learning on drug-specific data to predict small molecule teratogenicity

Anup P Challa 1,3,4,5,*, Andrew L Beam 2,3, Min Shen 4, Tyler Peryea 4, Robert R Lavieri 1, Ethan S Lippmann 5, David M Aronoff 6,7,8
PMCID: PMC7577422  NIHMSID: NIHMS1637386  PMID: 32428651

Abstract

Pregnant women are an especially vulnerable population, given the sensitivity of a developing fetus to chemical exposures. However, prescribing behavior for the gravid patient is guided on limited human data and conflicting cases of adverse outcomes due to the exclusion of pregnant populations from randomized, controlled trials. These factors increase risk for adverse drug outcomes and reduce quality of care for pregnant populations. Herein, we propose the application of artificial intelligence to systematically predict the teratogenicity of a prescriptible small molecule from information inherent to the drug. Using unsupervised and supervised machine learning, our model probes all small molecules with known structure and teratogenicity data published in research-amenable formats to identify patterns among structural, meta-structural, and in vitro bioactivity data for each drug and its teratogenicity score. With this workflow, we discovered three chemical functionalities that predispose a drug towards increased teratogenicity and two moieties with potentially protective effects. Our models predict three clinically-relevant classes of teratogenicity with AUC = 0.8 and nearly double the predictive accuracy of a blind control for the same task, suggesting successful modeling. We also present extensive barriers to translational research that restrict data-driven studies in pregnancy and therapeutically “orphan” pregnant populations. Collectively, this work represents a first-in-kind platform for the application of computing to study and predict teratogenicity.

Keywords: teratogenicity, drug development, drug exposure, machine learning, informatics, chemical structure, high-throughput screening, translational medicine

Introduction

1.1. Risky prescriptive behavior in pregnancy

Teratogenicity is the most serious manifestation of iatrogenic fetal toxicity: teratogens lead to fetal malformation and are implicated in lifelong physical and/or mental disabilities1. Nonetheless, clinical trial results of drug exposure during pregnancy are often conflicting24, and teratogenicity scoring for small molecules is unsystematic and performed outside the clinical environment57. The consequences of this subjectivity are seen in the high rate of unintended maternal exposure to a teratogenic agent8, reminiscent of the “thalidomide disaster” of the early 1960s9,10. Following this disaster, randomized, controlled trials (RCTs) were modified to exclude pregnant populations, fearing unintended teratogenicity from exposure to unsystematically profiled drugs10. This change continues to “orphan” pregnant women, as many diseases in women’s health lack safe and effective drug choices for treatment8,11,12.

In the wake of the “thalidomide disaster,” the United States Food and Drug Administration (FDA) developed a five-point scale for ranking the teratogenicity of a compound79,11. This scale is presented in Table 1 (Appendix).

A hallmark of the binning within this scale is the absence of definitive human data: at present, teratogenicity scores are established pre-clinically by pharmacologists, who evaluate biomarkers of fetal toxicity in animal models5,6. This approach is inherently limited, as common in vivo models are not sufficiently representative of human physiology13, and human subjects are not included in the teratogenicity scoring process for ethical reasons11,14,15. Indeed, the limited human data available for teratology scoring are often derived retrospectively from high-profile cases of fetal malformation resulting from drug exposure9,16,17. While new FDA standards for scoring teratogenicity acknowledge these limitations by providing fewer, more holistic toxicity scores, these standards still suffer from the absence of robust human data and are not yet integrated in clinical decision-making tools18.

Collectively, the factors above create a significant degree of uncertainty at the point of care (POC), as providers are guided on contradictory, incomplete, and non-human derived information in their choice of prescriptions for pregnant women. This dilemma is of special consequence to expectant mothers with chronic morbidities pre-existing to their pregnancies11.

1.2. Target rationale for teratogenesis

Fetal exposure to a teratogen in utero strongly associates with cognitive and/or physical disabilities, resulting from dysregulation of key developmental processes such as neurulation, purine and pyrimidine synthesis, and lipid anabolism2,19.

Broadly, teratogens may be categorized by their mechanism of action (MOA) as either “on-target” or “off-target2022.” “On-target” teratogenicity implies the generation of adverse phenotypes from bioactive agents impacting well-defined protein targets that are critically regulated in development. In contrast, “off-target” teratogenicity implies mutagenicity, resulting from DNA damage such as alkylation and thymine dimerization. “Off-target” teratogenicity involves repeated reactions between a teratogen and newly-synthesized nucleic acid residues, often resulting from the generation of reactive oxygen species (ROS) generated from drug metabolism20.

Thus, teratology is known to converge on few principal MOA classes19,23, which are outlined in Table 2.

1.3. Machine learning in maternal-fetal medicine

The inherent contradiction between the limited target rationale for teratogenesis and the extent of uncertainty that guides prescribing behavior for gravid populations speaks to the need for more rigorous predictions of small molecule teratogenicity. Furthermore, computational modeling on healthcare data is the most accurate method of predicting drug safety in pregnant women, given that phase I trials are unethical for expectant populations and animal models are inherently limited for studying human health12,13,24.

Classification algorithms are optimized to identify patterns between associated data sets (such as binding affinity and phenotype data for a cytotoxic target)2528, suggesting that machine learning (ML) classifiers may play a pivotal role in systematically establishing relationships between maternal drug history and adverse fetal outcomes2931. While these models are not intended as a replacement for existing physician knowledge of responsible prescriptive practice32, ML classifiers offer an attractive opportunity to discover meaningful relationships within existing biomedical data than could result in meaningful POC conclusions.

There have been few previous studies leveraging this brand of artificial intelligence (AI) for predicting iatrogenic fetal toxicity. Of these select investigations, a majority have focused solely on population-level, patient-derived data to discover adverse outcomes from maternal medication history and neonatal disease information17,30,31,33,34. In 2017, Boland et al. reported on a successful ML algorithm for parsing electronic health record (EHR) data to develop data-driven definitions of adverse drug outcomes associated with class C teratogenicity; the authors focused their modeling on congenital disease and fetal death phenotypes35. Studies with similarly-limited scope that analyzed insurance claims data are also available30,33,36. Recognizing the additional predictive power of chemical data for teratogenicity, Baker et al. published a ML model for the identification of compounds implicated in cleft palate formation from existing toxicology high-throughput screening (HTS) bioassay data and information on chemicals implicated in cleft palate phenotype identified from systematic literature review. This allowed the authors to identify biomarkers with high positive predictive value for cleft palate and further elucidate chemical exposure-adverse outcome clusters17.

In this study, we report on a previously-unattempted, unbiased (phenotype-agnostic and target-agnostic) approach to predicting teratogenicity by identifying chemical and biochemical factors that predispose a chemical to increased teratogenic risk. Given significant limitations in established teratogenicity scoring criteria, we propose a novel application of ML to develop a teratogenicity quantitative structure-activity relationship (QSAR)37. By leveraging drug structure, meta-structural elements like molecular energetics, and real-world bioactivity data, we attempt to predict the teratogenic risk of drugs potentially prescriptible in pregnancy.

Materials and Methods

Our teratogenicity QSAR accesses chemical and bioassay data to predict a teratogenicity score for compounds that are prescriptible in pregnancy and to identify patterns within drug-specific information that predispose a drug towards an increased risk of fetal toxicity.

Broadly, we leverage three layers of drug data to accomplish these tasks:

  1. The inherent structure of each drug, as encoded by several classes of chemical fingerprints38 that capture upwards of 1,024 structural features of each molecule

  2. Meta-structural features for each drug, including druglikeness, predicted molecular energetics, and mutagenicty—as calculated from the Molecular Operating Environment (MOE)39, an industrial-grade chemical computing software—and mutagenicity data from predictive models of the Ames test40

  3. Antagonist-mode toxicology assay data from Tripod, a public-facing collection of HTS data from the Toxicology in the 21st Century (Tox21) initiative of the National Institutes of Health41,42, on all targets listed in Table 2 with simultaneous coverage in Tox21

We employed version 3.5.3 of the R integrated development environment (https://www.r-project.org/)43 for parsing all chemical and bioassay data and implementing and tuning unsupervised and supervised ML models for associating these data.

We are committed to open-source science. All source data, code, and output files relevant to the development of our model is available through the following GitHub repository: https://github.com/apchalla/teratogenicity-qsar.

2.1. Mining structure and teratogenicity data

DrugBank 5.1.0 (https://www.drugbank.ca/) is a publically-available drug encyclopedia developed by the University of Alberta. It contains comprehensive entries of more than two hundred (200) data fields for 9,099 small molecules of known structure that have passed phase I of an existing RCT. Each DrugBank entry contains structured information on compound structure, MOA, existing formulations, and drug marketing history, among other clinically-relevant datasets44. DrugBank is a self-described cheminformatics resource45; therefore, the pharmacopeia provides a highly pliable application programming interface (API), which allows for easy data mining and extraction. Given the comprehensiveness of the DrugBank database, as well as its amenability for data-driven analyses, we extracted the structures of all 9,099 DrugBank entries as three-dimensional spatial data files (3D-SDFs).

To obtain relevant, FDA-compliant teratology data, we interrogated SafeFetus (https://www.safefetus.com/)46, a registry for expectant mothers hosting the largest publically-available repository of structured, FDA-aligned teratogenicity scores. Therefore, we extracted teratogenicity scores for all 652 eligible drugs from SafeFetus.

Teratology data are not routinely published, and many large pharmacovigilance databanks like FDA’s DailyMed (https://dailymed.nlm.nih.gov/dailymed/)47, do not present all teratology information in structured fields—as required for computing on this information—and have inflexible APIs for data extraction.

2.2. Layer 1: Leveraging drug structure for predicting teratogenicity

From DrugBank 5.1.0, all 9,099 small molecule structure files were mined in SDF format. To ensure that DrugBank structure files were not corrupted in the extraction process, the SDF set was imported into the LigPrep graphical user interface of Schrödinger 2018-2 (https://www.schrodinger.com/)48, a suite of chemical computing software that enables predictive modeling in structure-guided pharmacological studies. Validating by visual inspection that all DrugBank files were chemically-valid, the SDF set was imported to R. Then, using the cheminformatics toolkits ChemmineR (CRAN: ChemmineR)49 and Rcdk (CRAN: Rcdk)50, the SDF set was converted to twelve (12) classes of chemical fingerprints, encodings of chemical structure as thousand-dimensional matrices that record the presence of absence of distinctive chemical motifs, including topological torsions, R/S stereochemistry, common functional groups, Brønsted-Lowey acidity/basicity, general acid/base catalysts, and other salient chemomarkers. This fingerprinting process is only valid for organic small molecules; therefore, all inorganic agents were automatically parsed from our drug set by the ChemmineR and Rcdk fingerprinting algorithms38. Thus, fingerprinting allowed us to access comprehensive, structured information on nearly nine thousand (9,000) small molecules and one-hot encode this information.

As noted above, we obtained FDA-compliant teratogenicity data from SafeFetus, the largest publically-available source of structured FDA teratogenicity scores with an API. Integrating the data sets for teratogenicity and drug structure in R, we obtained N = 611 drugs with information on both structure and teratogenicity.

We then developed multiple label classification strategies for teratogenicity scores, based on the nature of FDA teratology scores and a bibliostatistic search. While we acknowledge that the conceptual relevance and quality of labels are of utmost importance for classification tasks, we also argue that the definition of optimal labels a priori can be difficult. Therefore, in this manuscript, we detail the procedure we employed to define teratogenicity score embeddings, starting with the application of published rubrics and moving towards literature searches and necessary trial-and-error approaches to define and optimize a more precise set of scores.

Therefore, one set of teratogenicity scores we employed for all 611 drugs was aligned according to native FDA schema. Through consultation with practicing clinicians to discuss the heuristics they employ in prescriptive practice, we redefined a second set of scores as a three-pronged scale of bins: “Clinically Acceptable Risk” (scores A/B), “Moderate Risk” (score C), and “Clinically Unacceptable Risk” (scores D/X). We then defined a third scale by a systematic literature search of the Embase medical library system (https://www.elsevier.com/solutions/embase-biomedical-research)51 and a Cochrane review (https://www.cochrane.org/evidence)52 for the keyword “teratogenic.” We queried the ~16,000 articles that resulted from this search using simple random sampling, such that we assigned an identifier to each article and sampled O(101) articles by their identifiers. For articles within our random sample and which referenced specific drugs, we observed that the keyword “teratogenic” was associated with a mention within the article of FDA scores of C, D, or X. Therefore, we defined a binary scale of scores as “Non-Teratogenic” (scores A/B) and “Teratogenic” (scores C/D/X) classes.

We emphasize that this ad hoc literature review was a non-rigorous—but necessary—step that allowed us to develop a starting point from which we could study and discuss potential tuning of the definition of our labels, per their contextual relevance and the model performance that we observed with these embeddings. We discuss these issues throughout the remainder of our manuscript.

2.2.1. Unsupervised modeling

First, to discover clustering relationships between teratogenicity and drug structure, the Barnes-Hut implementation of the t-Distributed Stochastic Neighbor Embedding (t-SNE) procedure53 was enacted on all combinations of fingerprint and teratogenicity score data sets, including a combined, non-redundant, and feature-prioritized set of all chemical fingerprints. t-SNE is a dimensionality reduction procedure, which can plot all dimensions of drug structure against all dimensions of teratogenicity for all drugs included our data sets. The presence of tight clusters in a t-SNE plot indicates dependency between the plotted variables54,55.

Of the t-SNE combinations we attempted on our structure and teratology encodings, the t-SNE plot generated with 1,024-dimensional Morgan fingerprints56 and a binary classification of teratological risk showed the strongest clustering relationships. Clusters were identified by visual inspection, with each point within a cluster representing a drug. Hence, we mapped points within each discrete t-SNE cluster in reverse, from t-SNE space to its associated DrugBank entry. Noting that all points within each cluster were consistent with a salient chemical functionality within the component drug structures, and that all cluster component drugs belonged to the same class, we considered our identification of meaningful clusters to be successful. Performing systematic literature review on each drug class identified as strongly associated to the presence or absence of elevated teratogenic risk, we noted that select drug class—teratogenicity score relationships identified by our model were verified in clinical decision-making tools like UpToDate (https://www.uptodate.com/contents/search)57 and Medscape (https://www.medscape.com/)58. However, several structure-teratology relationships identified by t-SNE appeared contentious in relevant literature: sufficient human data are not available to accurately classify the class of drugs distinguished by the t-SNE-identified chemical functionality as teratogenic or safe. We present a deeper discussion of the contribution of our t-SNE findings to these debates in the “Results and Discussion” section of this publication. The most consistent t-SNE plot and the functionalities it identified as significantly associated to the presence or absence of teratological risk are also shown in Figures 3 and 4 in the “Results and Discussion” section.

Figure 3:

Figure 3:

t-SNE—when enacted on a 1,024-bit representation of the Morgan class of chemical fingerprints and a binary classification of teratogenicity (“YES” (class A/B), “NO” (class C/D/X))—reveals small clusters that indicate potential structure-teratogenicity relationships. This plot was generated using the R package Rtsne (CRAN: Rtsne)102.

Figure 4:

Figure 4:

We discovered relationships between teratogenic risk (“YES”, “NO”) and the presence of distinct chemical functionalities from consistent structure-teratogenicity points within each discrete t-SNE cluster.

Noting that multiple structure-teratogenicity relationships resulting from our t-SNE analysis were validated in the literature, we considered our unsupervised ML model to be a successful proof-of-concept experiment.

2.2.2. Supervised modeling

Given that t-SNE successfully and consistently identified moieties that might predispose a drug towards an increased risk of teratogenicity, we decided to enable a supervised ML model that can prospectively predict a drug’s teratogenicity score from structural information. Using the R package Caret (CRAN: Caret)59, we developed three (3) models with inherent five (5)-fold cross validation (CV), such that we obtained test set accuracy on running each model. These models included Random Forest60, Extreme Gradient Boosting61, and Gradient Boosting Machine (GBM)62. Testing these models with five-pronged, FDA-adherent teratogenicity scores, we found that GBM yielded the highest predictive accuracy. Therefore, we re-trained our GBM model with the trivariate, clinically-oriented teratology scale described above and obtained higher accuracy for this model than for the GBM trained on five-dimensional labels. For all models, we optimized hyperparameters using a large grid search within Caret.

2.3. Layer 2: Curating meta-structural information for exploratory analysis

After deriving a successful model for predicting teratological risk from drug structure, we sought to increase the predictive accuracy of our GBM by supplementing our features with information on “meta-structure63.” These factors included the following variables, which were calculated for all 611 sampled drugs within MOE (https://www.chemcomp.com/Products.htm)39, a suite of industry-grade chemical computing software for computer-aided molecular design. Each of the following meta-structural sets was encoded by chemically-significant cutoffs when available (e.g., druglikeness benchmarks from Lipinski’s Rule of Five (RO5)64) or cutoffs determined from ROC analysis of extracted data):

  • Druglikeness: the adherence of each molecule to Lipinski’s Rule of Five restrictions on the number of hydrogen-bond acceptors, hydrogen-bond donors, octanol-water partition effects, total polar surface area, molecular weight, and number of rotatable bonds for an attractive drug candidate64

  • Energy of the Highest Occupied Molecular Orbital (HOMO): a quantum chemistry metric of the tendency of a molecule to donate an electron, as a proxy for drug stability and tendency to generate mutagenic free radicals65

  • Energy of the Lowest Unoccupied Molecular Orbital (LUMO): a quantum chemistry metric of the tendency of a molecule to accept an electron, as a proxy for drug stability and tendency to generate mutagenic free radicals65

  • Mutagenicity score, as calculated from in-built predictive models of the Ames test40

  • pKa and most basic pKa

2.3.1. Unsupervised modeling

Combining MOE calculations for the above variables and all structural data sets, we performed feature selection within Caret to remove redundancy and highly-correlated features within the integrated descriptor set. Then, we re-executed t-SNE on binary teratogenicity scores, with the hope of identifying new clustering relationships between physiochemical features and teratogenicity.

2.3.2. Supervised modeling

GBM with five-fold CV was re-executed with a three-pronged set of teratogenicity scores and feature-prioritized structural and meta-structural information. Hyperparameters were optimized by large grid search within Caret59.

2.4. Layer 3: Repurposing Tox21 HTS Data on Teratogenic Targets

Given that teratogenicity has well-identified target rationale, we decided to leverage existing, real-world bioassay information for all targets implicated in teratogenesis (as described Table 2) and previously screened through the Toxicology in the 21st Century Initiative (Tox21) of the National Institutes of Health (https://ncats.nih.gov/tox21)41. Tox21 leverages HTS of millions of bioactive compounds—including most common pharmaceuticals—in thousand well-plate, cell-based assays. While this HTS platform is not teratogenicity-specific, it does contain information on targets implicated in teratogenesis66.

Scoping all information available on Tripod, the public-facing data browser of Tox21 (https://tripod.nih.gov/tox21)42, we extracted antagonist-mode RAR and HDAC data for supplementation of our model. RAR data were derived from murine embryo fibroblast cells (C3H10T1/2, American Type Culture Collection, Manassas, Va., USA), and HDAC data were obtained from human colorectal carcinoma cells (HCT-116, American Type Culture Collection, Manassas, Va., USA). Assay protocols are available from the Tripod website specified above.

Data available from Tripod include bioactivity for a given target (encoded as “inactive,” “active”), curve class, IC50, efficacy, and Hill coefficient. Of these variables, we studied curve class, IC50, and efficacy as proxies of binding affinity of each sampled compound for RAR and HDAC. Therefore, all compounds with available structure, teratogenicity score, and RAR/HDAC HTS coverage (N = 128) were probed by t-SNE and GBM. Data were one-hot encoded using standard bioactivity cutoffs for drug development (i.e., curve class ≠ 4, IC50 ≤ 20 µM, efficacy ≤ −50%)6769.

2.4.1. Unsupervised modeling

Combining MOE calculations for the above assay data and all structural and meta-structural data sets, we performed feature selection within Caret to remove redundancy and highly-correlated features within the integrated descriptor set. Then, we re-executed t-SNE on binary teratogenicity scores, with the hope of identifying new clustering relationships between assay data and teratogenicity.

2.4.2. Supervised modeling

GBM with five-fold CV was re-executed with a three-pronged set of teratogenicity scores and feature-prioritized structural, meta-structural, and biochemical assay information. Hyperparameters were optimized by large grid search within Caret.

2.5. ROC statistics

In evaluating the results of our supervised and unsupervised models, we took special note of the imbalanced nature of our teratogenicity score data set. This is a problem inherent to the subjective nature of teratogenicity scoring by the FDA, as drugs with unclear safety profiles often receive a label of class C24. In accordance with this practice, we observed that 310 of our 611 sampled drugs (51%) were labelled C. The remainder of our label set was distributed as follows, which is—at large—representative of the FDA’s classification behavior: A = 14/611 drugs (2%), B = 157/611 drugs (26%), D = 91/611 drugs (15%), X = 39/611 drugs (6%)24,47.

Therefore, we decided to perform ROC statistics to evaluate the strength of our set of features to predict a drug’s FDA teratogenicity score, since ROC statistics are more resilient to class imbalance than GBM accuracy. ROC statistics for structure-based predictions of teratogenicity (AUC = 0.8) suggested that chemical structure has strong predictive power for a drug’s teratological risk. We describe these results in depth in the following section.

Figure 1 contains a summary of all data sources and modes of ML analysis we considered in the creation of our model. For each feature layer, we ensured that the associated data sources were non-trivial to our model by plotting a feature importance spectrum. In querying these feature importance data, we found that Caret considered all features as having non-zero importance at each implementation of a GBM.

Figure 1:

Figure 1:

To develop our QSAR model, we synthesized data from several pharmacological and clinical databases. However, as we describe in “Materials and Methods,” the quantity and quality of available data differed significantly across the ontologies, with teratogenicity data from SafeFetus as the most limiting. To reflect this heterogeneity, data sources and conclusions that we consider substantial are colored green, while yellow-colored data elements are intermediate in their contribution of meaningful information. Sources in red are most limiting.

Results and Discussion

In this manuscript, we present a first-in-kind application of ML to identify structural, meta-structural, and bioassay performance factors that predispose a drug towards increased teratogenic risk. We developed a model to prospectively score a drug’s teratogenicity from these drug-specific factors. Because our workflow is anchored in computing, our methods apply algorithmic rigor to studying teratogenicity, a contrast to many non-systematic studies which have historically dominated this space.

3.1. Summary of key results

3.1.1. Unsupervised learning outcomes

We found that drug structure is a good predictor of teratogenicity, as multiclass ROC analysis between 1,024-dimensional Morgan fingerprints and a three-pronged teratogenicity metric gave AUC = 0.78 (Figure 2). This result validates our hypothesis that a “form-fits-function” argument is valid for predicting teratogenicity from homology between drug structure and pharmacophore biochemistry among targets implicated in teratogenesis.

Figure 2:

Figure 2:

ROC analysis suggests that 1,024-bit Morgan fingerprints have good predictive accuracy for teratogenicity (AUC = 0.78). This plot was generated using the R package pROC (CRAN: pROC)101.

From t-SNE analysis between drug structure and a binary encoding of teratogenicity (Figure 3), we discovered clusters of teratogenic risk and the absence thereof, which are partially validated within existing clinical literature (Figure 4). Though t-SNE contains noise across most of the diminished structure-teratogenicity landscape, the clusters we identified by visual inspection were consistent in teratogenic risk. A reason for the limited tightness of the observed clustering behavior may involve dimensionality mismatch between structure and teratogenicity data sets, given that we plotted 1,024 structural motifs against only two (2) teratogenicity scores. However, since generating ~103 independent teratogenicity scores and reducing chemical structure to ~101 categories are both unfeasible (this would remove the clinical and chemical significance of the respective data sets), we cannot address this probable cause of loose clustering by adjusting the form of the data we seek to associate. Despite these issues, our t-SNE step was a successful proof-of-concept experiment, as we discovered functionalities that are known to be highly fetal toxic and those that are known to be safe through this procedure.

Beyond these validated associations, we also discovered new structure-teratogenicity relationships that might have application in clarifying cases of suspect toxicity risk in the clinical literature. Indeed, our analysis reveals five motifs that are distinctive among cohorts of molecules identified as “teratogenic” and “non-teratogenic.” Both moieties in the “NO” cluster are components of cephalosporins, which include a group of broad-range antibiotics known to be safe for pregnant mothers (class B)7073. Two distinctive functionalities distinguish cephalosporins from other classes of drugs: the presence of an azetidinone group and a dihydrothiazine ring74. Therefore, as there features distinctively establish cephalosporin identity—which is non-teratogenic—it is reasonable to assert that the azetidinone functionality and dihydrothiazine ring are non-teratogenic chemomarkers in this case. We recognize that the burden of evidence is significant to claim that these motifs demonstrate protective effects. Instead, we suggest that our results warrant more involved analysis of these potentially protective moieties.

In contrast, similar analysis of “YES” clusters reveals three teratogenic chemomarkers, including corticosteroids, fluoroquinolones, and acetylproline derivatives. While fluoroquinolones are documented teratogens7578, there is contention on the toxicity of steroid derivatives7981, as well as prolinated compounds8284. Our model adds to this discussion by arguing that the safety of steroid derivatives should be more deeply interrogated for potentially teratogenic outcomes.

We reasonably assume that the “YES” functionalities in Figure 4 are the source of teratogenicity within molecules that contain them, given that these moieties are distinctive. This conclusion requires MOA validation; however, as with fluoroquinolones, available phenotypic data appear to support our conclusions on functional group toxicity.

Drawing on these mappings also allows us to evaluate new trends in drug development; namely, we can extrapolate functional group mappings towards drug development targets in the anti-hypercholesterolemic space. Pregnant women with high cholesterol are not advised to take statins, as these drugs are antagonists of HMG-CoA reductase, restricting fatty acid synthesis in a developing fetus (Table 1)19,8587. Statins contain a fluorobenzene motif, which our model predicts to be the core teratogenic functionality within these drugs. As of date, only one small-molecule anti-hypercholesterolemic drug, ezetimibe (Zetia), does not belong to the statin class of drugs88. Instead, ezetimibe contains a central azetidinone group and has been noted in reduced teratogenicity across the expectant population, as compared to statins (statins are class D agents; ezetimibe carries a class C score)89. Given that we identify azetidinone-containing drugs to carry potential protective effects, this observation edifies the results from our model and speaks to the potential applicability of structure-teratogenicity relationship modeling similar to that in this paper to inform downstream, data-driven inquiries into drug safety for expectant populations. We emphasize that expansion of this study and downstream mechanistic studies are required to fully substantiate our observations.

3.1.2. Supervised learning outcomes

Our GBM predicts three classes of teratogenicity with 64.7% accuracy (SD = 3.0%) when trained on 1,024-dimensional Morgan fingerprints. Thus, our model achieves nearly double the predictive accuracy as a blind, probabilistic control for the same trivariate predictive task; QSAR accuracy enrichment is nearly 32% on these baseline predictions. Model penalization to correct for an imbalance of teratogenicity scores did not increase predictive accuracy. Because there exist no other structure-activity relationships, meta-structure-activity relationships, or structure-assay-activity relationships published in this space, we assert our model as a first attempt at applying drug-inherent information towards predicting teratogenicity.

3.2. Ontological limitations and barriers to data-driven studies in pregnancy

While the results above appear promising, the data that we queried in this investigation present significant ontological challenges. These problems drastically reduce the sample size of all drug-specific teratology probes and present significant barriers to translational science, as we explain below.

In this study, we encountered problems with procuring teratogenicity information, given that teratology reference data are not published and updated in the relevant clinical literature very often. Furthermore, existing clinical decision-making tools like UpToDate and Medscape do not have APIs and contain contradictory teratology information that is not available in structured formats—as is required for systematic, retrospective data analysis and ML modeling. FDA resources containing teratology data are also not published in structured formats amenable for computational research, despite the availability an API for FDA pharmacopeias like DailyMed. For this investigation, the consequence of this limitation in available teratogenicity data was a significant reduction in drug sample size, as available to t-SNE and GBM. Though we used one of the arguably most powerful chemical computing software programs currently available (i.e., MOE), we encountered sparsity in meta-structural predictions within our limited subset of drugs with available structure and teratogenicity information. This restricted the power of our meta-structural t-SNE and GBM probes, resulting in no test power for a feature-selected meta-structural and structural feature set.

Despite the gravity of the inherent uncertainty within available teratogenicity scoring criteria and limited target rationale for teratogenesis, there exist no teratology-specific HTS platforms. Though large toxicology HTS programs like Tox21 have screened targets that overlap with those in Table 2, this intersection remains small: only two (2) targets have coverage through Tox21. Therefore, though real-world bioactivity information is inherently powerful, we were able to access data on only two (2) relevant targets, and for only 128 drugs with structure and available teratogenicity data and assay information. Only sixty-four (64) drugs had information available for both RAR and HDAC, available structure data, and a known teratology score. Hence, a major reason why the addition of Tox21 HTS data did not improve predictive accuracy or t-SNE clustering over a purely structural model was limited sample size. This issue remains intractable, given the inherently limited data resources currently existing available and little action on the part of data providers to address these quality issues.

Finally, we note that we designed this study to remain as translational and open-source as possible, though we encountered significant barriers to model development from the lack of published teratology and HTS data, as well as the lack of granularity and contextual relevance within available teratology scoring protocols like those of the FDA. All data that we employed in our ML models were available publically, either from dedicatedly open-source databases or public disclosures of multi-institutional research initiatives. These databanks are well-referenced in the relevant cheminformatics literature, as they provide high-quality information on the structure, pharmacology, and teratology of small molecules, per what is currently published. To review the clinical applicability of the drugs that we studied, we applied standard-of-care clinical decision support tools like UpToDate and Medscape, which contain peer-reviewed and data-driven documentation for the guidance they present. Within these softwares, users may access the component publications that underlie the clinical decision support that the tools present. Furthermore, these tools—and their component data—are available at no individual charge to most investigators who belong to an institution with an associated patient care facility, as these suites benefit from high-frequency use by staff at most medical centers57.

Conclusions

Current standards of evaluating small molecule teratogenicity are inherently unsystematic and driven on a lack of human data. This informs irresponsible prescribing behavior at the POC, reducing the quality of care for pregnant women and their developing fetuses. However, given the rigor of rules-based ML classification algorithms and limited “on-target” rationale for teratogenesis, there is potential to systematically predict a compound’s risk for fetal toxicity by leveraging AI on drug-specific information, such as drug structure, meta-structure, and existing real-world bioassay data, as a proxy for binding affinity to teratogenic targets.

In our study, we assert that drug structure is a good predictor of teratogenicity, using ROC analysis, unsupervised ML (t-SNE), and a supervised GBM to discover relationships between chemical functionalities within drugs prescriptible in pregnancy and existing teratogenicity information. This allowed us to identify moieties that appear to predispose a drug towards an increased chance of teratogenicity, based on existing use cases that are salient in relevant clinical and drug development literature. We also identify significant barriers to translational research in this space as rationale for the limited utility of existing meta-structural and toxicology HTS platforms for teratogenicity prediction tasks. The importance of these ontological considerations cannot be overstated in considering future research to improve the quality of data-driven maternal-fetal medicine.

Our team of investigators has formed a first-in-kind research collaboration of engineers, informaticians, and clinicians dedicated to the development of computational tools to predict adverse drug outcomes in pregnancy from existing healthcare data on pregnant populations and in vitro drug exposure models that are more representative of pregnant human physiology than the in vivo animal platforms currently employed in this space. This group—called Modeling Adverse Drug Reactions in Embryos (MADRE)13,90,91—proposes refinement of the teratogenicity QSAR reported in this manuscript by harnessing a more continuous spectrum of relevant phenotype information (Figure 5). Given that data quality and availability issues with teratogenicity scores restricted the scope of this study, we propose a medication history-wide association study (MedWAS) that can leverage billing-encoded, population-level EHR data as a label set. The benefit of MedWAS over QSAR is increased flexibility: associative study model architecture would not necessitate classification of adverse outcomes into rigid bins, as the QSAR requires37. Therefore, MedWAS would not be restricted by the limited availability of FDA-encoded teratogenicity data, giving a larger sample size of drugs eligible for analysis and a more continuous spectrum of phenotype information through which to quantify teratogenicity. In turn, this allows for easier validation of associative outcomes in silico and in vitro, as compared to similar hits from QSAR. Indeed, drugs identified as teratogenic through MedWAS may be referred to our QSAR model for validation, and vice versa. We have begun work on this MedWAS and look forward to further exploring its intersections with our teratogenicity QSAR.

Figure 5:

Figure 5:

Our team—dubbed Modeling Adverse Drug Reactions in Embryos (MADRE)—leverages a broad knowledge base across the basic, applied, and clinical sciences to develop predictive models of adverse drug outcomes in pregnancy. We leverage the strengths of all sites within our network to optimize both the quantity and quality of data and analytical expertise that are essential to our QSAR and MedWAS models.

Acknowledgements

We thank Asher Schachter, MD, Senior Vice President, Clinical, and Head of Pharmaceutical Sciences at CAMP4 Therapeutics, for sharing teratogenicity data that he extracted from SafeFetus. We also thank Jeffery Goldstein, MD, PhD, Assistant Professor of Pathology at Northwestern University, for providing clinical consultation on our model and reviewing this manuscript.

Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number U54HG007963–05 and the National Center for Advancing Translational Sciences of the National Institutes of Health under Clinical and Translational Science Award Number U54TR02243–02. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health.

Abbreviations

RCT

randomized controlled trial

FDA

United States Food and Drug Administration

POC

point of care

MOA

mechanism of action

ROS

reactive oxygen species

ML

machine learning

AI

artificial intelligence

HER

electronic health record

HTS

high-throughput screening

QSAR

quantitative structure-activity relationship

MOE

Molecular Operating Environment

Tox21

Toxicology in the 21st Century Initiative

API

application programming interface

3D-SDF

three-dimensional spatial data file

t-SNE

t-Distributed Stochastic Neighbor Embedding

GBM

gradient boosting machine

CV

cross-validation

RO5

Lipinski’s Rule of Five

HOMO

highest occupied molecular orbital

LUMO

lowest unoccupied molecular orbital

MADRE

Modeling Adverse Drug Reactions in Embryos

MedWAS

medication history-wide association study

Appendix

Table 1.

Teratogenicity scoring criteria established by the FDA are driven by a lack of human data, making them dangerously imprecise for application at the bedside7,13.

Classification Attributes
A Generally acceptable. Controlled studies in pregnant women show no evidence of fetal risk.
B May be acceptable. Either animal studies show no risk but human studies not available or animal studies showed minor risks and human studies done and showed no risk.
C Use with caution if benefits outweigh risks. Animal studies show risk and human studies not available or neither animal nor human studies done.
D Use only in life-threatening emergencies when no safer drug available. Positive evidence of human fetal risk.
X Do not use in pregnancy. Risks involved outweigh potential benefits. Safer alternatives exist.
N/A Information not available

Table 2.

Teratogenesis converges on a limited subset of targets19,4453.

Target Class Mechanism of Action
dihydrofolate reductase (DHFR) Inhibition of DHFR—both competitively and through antagonism of its folate cofactor— reduces the rate of purine and pyrimidine synthesis and DNA methylation reactions in a developing fetus. This leads to congenital malformations and neural tube defects.
retinoic acid receptor (RAR, RXR) The nuclear ligand-inducible receptors RAR and RXR are mobile; they act as transcription factors for heavily-conserved developmental genes, including the Hox gene. Inhibition of these receptors leads to malformation of the neural crest.
androgen receptor (AR), estrogen receptor (ER) Synthetic estradiols and androgens disrupt natural endocrine homeostasis within the developing fetus, resulting in errors of sexual differentiation.
prostaglandin H synthase (PHS), lipoxygenase (LPO) PHS and LPO activation results in increased rates of prototeratogen oxidation, generating ROS with the potential to attack fetal DNA and generate mutagenicity.
angiotensin II (ATII) and angiotensin converting enzyme (ACE) ACE and ATII receptor inhibitors reduce perfusion to developing fetal tissues, which especially affects peripheral structures such as the distal limbs. These agents also decrease the tone of fetal vasculature, leading to cardiovascular morbidity.
hydroxymethylglutaryl-coenzyme A (HMG-CoA) reductase HMG-CoA reductase inhibitors downregulate the conversion of HMG-CoA to mevalonic acid, an essential step in cholesterol synthesis. In the developing fetus, cholesterol is an essential progenitor of lipid regulators of the SHH gene, which affects fetal patterning and morphogenesis. Therefore, HMG-CoA inhibition is associated with severe fetal malformation and lipid deficiencies.
histone deacetylase (HDAC) HDAC proteins are essential in regulating gene expression by promoting chromatin unwinding. Therefore, HDAC inhibitors lead to a wide spectrum of morbidities (e.g., axial skeletal malformations) and may be fetal lethal.
cyclooxygenase-1 (COX-1) COX-1 inhibition is associated with cardiac, midline, and diaphragm defects, as the release of prostaglandins required for healthy morphogenesis is reduced by interference within the COX-1 signaling pathway.
N-methyl-D-aspartate receptor (NMDAR) NMDAR inhibition is associated with gross structural defects within the brain, resulting from dysregulation of neuronal migration, synapse formation, and synapse elimination in the developing fetus.
5-hydroxytryptamine (5-HT) receptor, 5-HT transporter 5-HT is a neurotransmitter critical to craniofacial morphogenesis in development. Agents activating or inhibiting 5-HT—or promoting 5-HT reuptake—disrupt a critical 5-HT concentration, resulting in craniofacial malformations and other structural defects in the fetus.
γ-aminobutyric acid (GABA) receptor GABA is a key inhibitory neurotransmitter that guides healthy testicular, ovarian, pancreatic, enteric, and palatal morphogenesis at a critical concentration. Enhancers of GABA receptor are significantly associated in malformation of these tissues and are therefore implicated in morbidities such as cleft palate and atresia of the gastrointestinal tract.
carbonic anhydrase Carbonic anhydrase hydrates carbon dioxide to promote pH homeostasis through the carbonic acid-bicarbonate buffering system. Inhibitors of this target are therefore implicated in pH disruption during development, resulting in metabolic diseases and limb malformation from largescale misfolding of key proteins at non-physiological pH.

Footnotes

Conflicts of Interest

We declare no competing interests relevant to the execution or outcomes of this study.

References

RESOURCES