Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 1.
Published in final edited form as: Regul Toxicol Pharmacol. 2023 Oct 21;145:105500. doi: 10.1016/j.yrtph.2023.105500

Development and application of a systematic and quantitative weighting framework to evaluate the quality and relevance of relative potency estimates for dioxin-like compounds (DLCs) for human health risk assessment

Daniele Wikoff a,*, Caroline Ring b,1, Michael DeVito c, Nigel Walker d, Linda Birnbaum e,f, Laurie Haws b
PMCID: PMC10941990  NIHMSID: NIHMS1967059  PMID: 37866700

Abstract

The toxic equivalency factors (TEFs) approach for dioxin-like chemicals (DLCs) is currently based on a qualitative assessment of a heterogeneous data set of relative estimates of potency (REPs) spanning several orders of magnitude with highly variable study quality and relevance. An effort was undertaken to develop a weighting framework to systematically evaluate and quantitatively integrate the quality and relevance for development of more robust TEFs. Six main-study characteristics were identified as most important in characterizing the quality and relevance of an individual REP for human health risk assessment: study type, study model, pharmacokinetics, REP derivation method, REP derivation quality, and endpoint. Subsequently, a computational approach for quantitatively integrating the weighting framework parameters was developed and applied to the REP2004 database. This was accomplished using a machine learning approach which infers a weighted TEF distribution for each congener. The resulting database, weighted for quality and relevance, provides REP distributions from >600 data sets (including in vivo and in vitro studies, a range of endpoints, etc.). This weighted database provides a flexible platform for systematically and objectively characterizing TEFs for use in risk assessment, as well as providing information to characterize uncertainty and variability. Collectively, this information provides risk managers with information for decision making.

Keywords: Dioxins, PCBs, Relative potency, REPs, TEFs, Risk assessment, Quality, Bayesian, Integration

1. Introduction

Potential human health risks associated with exposures to “dioxin-like compounds” (DLCs)—which are mixtures of a specific subset of polychlorinated dibenzo-p-dioxins (PCDDs), dibenzofurans (PCDFs), and biphenyls (PCBs)—are typically assessed using a toxic equivalency factor (TEF) approach. The TEF approach is a “relative potency” scheme: the potency of each individual congener is expressed relative to the potency of an index chemical, typically 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD), generally recognized as the most toxic and well-studied member in this class of DLCs (Haws et al., 2006a; Van den Berg et al., 1998, 2006). The concentration of each congener is then multiplied by its corresponding TEF to yield a toxic equivalent (TEQ), and the toxicity of a mixture of congeners is determined by summing the TEQs for each congener to yield a total toxic equivalent, or total TEQ, effectively the equivalent amount of TCDD. The TEF approach is used by authoritative bodies worldwide to assess exposure and risk from DLCs.

To be included in the TEF approach, congeners must meet the following criteria defined by the World Health Organization (WHO): (1) they are structurally similar to TCDD; (2) they have been shown to bind to the aryl hydrocarbon receptor (AhR); (3) they induce a spectrum of AhR-mediated biochemical and toxic responses similar to TCDD; and (4) they are persistent in the environment and bioaccumulate in tissues (Haws et al., 2006a; Van den Berg et al., 1998, 2006). The focus of this current assessment is on the chlorinated congeners that were the subject of the last comprehensive assessment by the WHO (Van den Berg et al., 2006) and includes 29 congeners: 7 PCDDs, 10 PCDFs, 4 non-ortho PCBs, and 8 mono-ortho PCBs (see supplemental material for list of individual congener included in the TEF methodology).

The current WHO TEF values (Van den Berg et al., 2006) represent point estimates assigned by a 2005 WHO expert panel based on application of scientific judgment to a database of relative potency (REP) estimates derived from a multitude of studies conducted over the past several decades. The REP database used by the WHO expert panel was developed in 2004 and is referred to as the REP2004 database (Haws et al., 2006a). Notably, the distributions of REP values underlying each congener often span several orders of magnitude and percentiles—where each TEF falls within its distribution varies significantly. For example, PCB 77 has 49 REP estimates that range from 2E-6 to 0.48; the TEF value of 0.0001 lies at the 25%ile of the distribution (Haws et al., 2006a). In contrast, the distribution for 1,2,3,7,8-pentachlorodibenzo-p-dioxin (PeCDD) comprises 45 REPs, ranging from 0.044 to 1.5, and the TEF value of 1 is at the 94%ile of the distribution. Recognizing the importance of underlying REP distributions, the WHO TEF panel endeavored to use the 75%ile of the unweighted distributions, based on in vivo data, as a guide to determine the need for re-evaluation of TEF.

During the 2005 review, the WHO panel also considered using the distribution of REP2004 values for each congener quantitatively to derive TEF values and estimates of uncertainty/variability. For example, the mean or median of REP2004 values for a congener could have been assigned as the TEF; upper and lower percentiles of REP2004 values could have served as metrics of REP2004 variability. However, the panel ultimately concluded that not all REP2004 values are created equal. Studies in the REP2004 database are extremely heterogeneous, including many different study types (in vivo, in vitro), designs, durations (e.g., chronic, acute), endpoints evaluated (e.g., ranging from enzyme induction to cleft palate), species (e.g., rats, mice, humans), etc. Given these study attributes, the WHO panel expressed greater confidence in some REP2004 values than others, and felt that these REP2004 values should be weighted more heavily than others. For this reason, the panel ultimately concluded that REP2004 distributions could not be used quantitatively (i.e., by setting TEFs equal to mean, median, or a percentile of the REP2004 distribution for each congener) to derive TEFs until an approach could be developed to weight REP2004 values based on the quality of the underlying studies (Van den Berg et al., 2006).

Therefore, the objective of the current study was to develop a framework to systematically evaluate and quantitatively integrate the quality and relevance of REPs for DLCs. To do so, the first step was to develop a consensus-based weighting framework based on study characteristics that are believed most important when evaluating REP quality and relevance for purposes of human health risk assessment. The second step was to develop a computational approach for quantitatively integrating the weighting framework and apply it to the REP2004 database. The combined result is a database of weighted REPs that could be used to facilitate data-driven selection of TEFs. Of note, these efforts began following the 2005 WHO expert panel meeting and collectively represent an early application of a highly tailored systematic review process which was developed before systematic review methodology was formally considered for use in toxicology and risk assessment, as it is currently.

2. Materials and methods

2.1. Step 1: development of a consensus-based weighting framework to assess REP quality and relevance

To achieve the goal of developing a consensus-based weighting framework, a panel of scientists having substantial expertise with DLCs was convened and included several participants from the WHO 2005 Panel: Drs. Linda Birnbaum, Michael DeVito, Laurie Haws, Nigel Walker, and Daniele Wikoff (hereafter referred to as the Study Panel). Expertise and input were sought throughout development of the framework from other members of the WHO panel, as well as from individuals who did not participate. Thus, “consensus” herein refers to agreement of this expert panel.

From initiation of work in ~2006, the study panel was in agreement that the approach must be: (1) objective and transparent; (2) relatively simple to understand and founded on established scientific and statistical principles; (3) able to be readily applied to the existing REP database; and (4) able to be readily applied to new REPs going forward, such that the TEFs used in human health risk assessment would always reflect the current state of the science in terms of data availability. Since the time of development of an initial REP weighting framework (Haws et al., 2006b), evidence-based systematic review methods have become increasingly popular as a tool to increase transparency and to incorporate assessment of study quality into the evaluation of animal toxicology and epidemiology data for the purpose of understanding potential human health risks (e.g., Stephens et al., 2016; NTP OHAT, 2019). Therefore, the original weighting framework was also assessed relative to these systematic review methods and refined to reflect concepts where applicable. The development of the weighting framework, and subsequent application of the framework to the REP2004 database, were conducted in a stepwise fashion.

The first step involved selecting the specific study elements to include in the weighting framework (discussed in the results). The focus was on those study elements that most influenced REP quality and relevance for human health risk assessment purposes, while keeping the weighting scheme as simple as possible (thereby increasing utility and feasibility).

The original qualitative criteria used by the WHO expert panel in deriving the WHO1998 TEFs served as the starting point. These criteria included study type, study duration, and endpoint, and were applied in the following manner by the WHO expert panel: REPs from in vivo studies were given more weight than those from in vitro studies, which in turn, were given more weight than QSAR (quantitative structure-activity relationship)-based REPs; REPs from chronic studies were given more weight than those from subchronic studies, which in turn, were given more weight than those from subacute studies, which in turn, were given more weight than those from acute studies; and REPs that represented toxic endpoints were given more weight than those based on biochemical endpoints (Van den Berg et al., 1998). The scientific basis of each of these original criteria was evaluated thoroughly. Additional criteria believed to potentially be important contributors to REP quality and relevance were identified and evaluated by the Study Panel, these included (but were not limited to): route of administration, chemical purity, exposure duration, tissue type (in vivo only); delay between final dose and measurement of effect, specific endpoints (e.g., tumor promotion, CYP1A2 induction, etc.), species and/or strain (or cell type), number of dose levels, attainment of a maximal response, method of REP derivation, vehicle; animal age and sex, number of animals per treatment group, use of controls, and the reference compound used. Following iterative assessment by subject matter experts, and also consideration of reporting of attributes, a refined subset of these study attributes were ultimately chosen by the Study Panel for use in the weighting framework. The review and selection of attributes is discussed in the results.

2.2. Step 2: quantitative integration of the weighting framework via bayesian-inference meta-analysis approach

Substantial efforts were expended to consider a variety of integration methods beyond those originally considered (e.g., Scott et al., 2006). These ranged from simple matrices and algorithms to more complex approaches. The more complex approach described here represents the final agreed-upon method. This approach was selected, because it facilitated a more objective integration technique that did not rely on the Study Panel to assign discrete weights or scales. The more complex approach also allowed for use of machine learning and Bayesian techniques, thus both providing more information for REP characterizations (including variability and uncertainty) and aligning the approach with current risk assessment approaches (NAS, 2014).

The quantitative integration of REP estimates with weighting data was accomplished using a Bayesian-inference meta-analysis approach (Fig. 1). The result is the development of a weighted uncertainty distribution of TEFs for each congener (vs. separate, individually weighted REP values). The meta-analysis approach assumes that each congener has a single unknown underlying TEF value (representing its “true” potency relative to TCDD). In a perfect world, all studies would measure REP exactly equal to this unknown TEF. However, the world not being perfect, each REP study measures that TEF value plus or minus some unknown amount of error. A standard approach for estimating an effect measured by multiple studies, with varying amounts of error, is a meta-analysis. The simplest meta-analysis computes a weighted average of effect estimates reported from individual studies, where the weights are the inverse reported variances (measurement errors). In the REP database, the effect reported by each study is the measured REP (box “D1” in Fig. 1), but the studies do not report associated REP measurement errors that can be used as weights (e.g., standard deviation or confidence intervals on reported REP values). For a REP meta-analysis, weights must be estimated by other means. Intuitively, the Study Panel’s recommendation to weight higher-quality REP studies more heavily when estimating TEFs implies that higher-quality studies are expected to measure TEF with less error, and lower-quality studies to measure TEF with more error. Therefore, expected measurement error can be estimated as a function of study quality. One part of our model (box “M1” in Fig. 1) estimates a relationship between study quality category and expected measurement error variance.

Fig. 1.

Fig. 1.

Flowchart diagram of approach to qualitative REP weighting. Boxes D1-D3 represent data elements; boxes M1-M2 represent model elements. Study attributes (D2) from Step 1 include: study type, study model, pharmacokinetics (PK), REP derivation quality, REP derivation method, and endpoint. Quality categories (D3) ranges from 1 (highest quality) to 5.5 (lowest quality), using half-steps.

Study quality categories for use in the Bayesian meta-analysis were based on overall categorical characterizations of REP quality assigned by the Study Panel. For each REP in the REP2004 database, the Study Panel considered study attributes from the weighting framework (box “D2” in Fig. 1) and assigned a “weight of evidence” categorical quality designation (box “D3” in Fig. 1), using half-steps from 1 (highest quality) to 5.5 (lowest quality). The weight of evidence categorical assignments were based on expert judgement when considering the totality of attributes important to the reliability and relevance of a REP estimate (used to train the model, described further below). All QSAR-based REP values were assigned category 5.5, whereas the in vivo and in vitro-based REPs were assigned categories between 1 and 5. No REPs were assigned a category of 1.5; therefore, this category was not used in further analysis. (During this process, the study panel noted that three REP values for PCB126 in the REP2004 database were duplicate measures of the same endpoint in the same study, calculated using different modeling approaches. The results of the study initially reported in conference proceedings by Walker et al. [2000] were later re-analyzed using a more sophisticated method and published in the peer-reviewed literature [Toyoshiba et al., 2004]. Therefore, the three REP values for PCB 126 initially reported by Walker et al. [2000] were removed from the REP2004 database, and only the values from the later publication [Toyoshiba et al., 2004] were retained.)

Although the category labels appear numeric, they were treated purely as ordered labels (i.e., as though they had been “best, better, good, acceptable ...” rather than “1, 2, 2.5, 3 ...“). The numeric values themselves were not used quantitatively; for example, a category-3 REP was not weighted 3 times less than a category-1 REP. Instead, quantitative weights for each category were estimated as part of a Bayesian-inference process, described below. That is, these categorizations were not themselves used as weighting criteria, scales, or scoresrather, they were used to inform or train the model to quantitatively integrate the weighting parameters in the Bayesian-inference model.

It was recognized that the Study Panel–assigned quality categories themselves (box “D3” in Fig. 1) are associated with some uncertainty. To quantify this uncertainty, a second part of our model (box “M2” in Fig. 1) estimates the probability that the Study Panel might have assigned each study to each of the nine quality categories, by training a machine-learning model to reproduce the Study Panel’s implicit expert judgment about the relationship between study attributes and quality category. The machine-learning model used the method of random forests (RF) (Breiman, 2001), implemented with the R package “randomForest” (Liaw and Wiener, 2002). RF consists of a large ensemble of classification trees (here, 10,000 trees), each trained on a randomly-chosen (approximately 2/3) subset of the training data to avoid overfitting. The RF model was trained on the Study Panel–assigned quality categories, using the study attributes as features (predictor variables). Each tree in the RF ensemble produced its own model-predicted category for each study. The fraction of trees that predict study i in category k can be interpreted as the probability that study i falls into category k. (To illustrate how a classification tree works, the relationship between study attributes and study quality categories was also analyzed using a single classification tree, using the R package “rpart” [Therneau and Atkinson, 2018]. However, the results of the single classification tree were not used as part of the Bayesian inference approach, because the ensemble results of the RF supersede any individual tree.)

2.3. Statistical model of TEFs from quality-weighted REP distributions

To estimate the unknown TEFs and quality-dependent expected measurement errors, a statistical model was formulated. Conceptually, the expected measurement error for a REP can be estimated as a function of study quality category (box “M1” in Fig. 1). However, quality categories are uncertain — expressed as a probability that a given REP is in each category (box “M2” in Fig. 1). In order to fully understand the error in a particular REP — the difference between the log-transformed measured REP and the TEF— you need to know both the probability that the REP is in each category, and the expected error associated with each category. To get the overall probability of observing a given error for a particular REP, first calculate the probability of observing that error if the REP is in each quality category (defined by a zero-mean normal distribution whose standard deviation is the expected error for that category); then multiply by the probability that the REP is in that category; and sum across categories (Equation (1)).

log(measuredREPi,j)log(TEFj)~kp(qualityi=k)×Normal(μ=0,σ=f(k)) Equation 1

In Equation (1).

  • TEFj represents the unknown TEF for congener j

  • Measured REPi,j represents the measurement of the REP for congener j made by study i

  • qualityi represents the quality category for study i

  • pqualityi=k represents the probability that the study i has quality category k, where k is one of the nine quality categories (1, 2, 2.5, ..., 5, 5.5)

  • Normal(μ=0,σ=f(k)) represents a normal distribution with zero mean and standard deviation σ, related to study quality category by a function f,σ=f(k).

The relationship between study quality and expected error SD was assumed to be the same for all congeners, and the expected error SD was assumed to increase as study quality was judged to decrease. To describe this relationship mathematically, f was formulated as a piecewise constant (Equation (2)).

f(k)=σ1,k=1σ2,k=2σ2.5,k=2.5,0<σ1<σ2<σ2.5<<σ5.5σ5.5,k=5.5 Equation 2

The inverse squares of the quality-related expected error SDs (inverse variances) can be interpreted as REP quality weights (analogous to a standard meta-analysis). For any individual REP, the error variance (meta-analysis weight) is a weighted average of the expected variances for each quality category, weighted by the probability that the REP is in each quality category.

In Equations (1) and (2), the congener TEFs TEFj and the expected error SDs σ1,...,σ5.5 are unknown and must be estimated from the data. The Bayesian-inference approach estimates the uncertainty distributions for these parameters as estimated from the data (in Bayesian jargon, the “posterior distribution”), where the data consist of the measured REPs, and the probabilities for the quality category of each REP estimated by the RF model. The approach requires a starting description for the uncertainty in the log TEFs and the expected error SDs before considering the data; in Bayesian jargon, this is called the “prior.” The prior for the log TEFs was uniform, reflecting an a priori assumption that no TEF value was more likely than any other (e.g., there was no a priori assumption that TEFs should be similar to their WHO (2005) values). The prior for each expected error SD was a half-Cauchy distribution with scale parameter equal to the range of the log measured REPs across all congeners (approximately seven orders of magnitude). This prior reflected two a priori assumptions: (1) that error SDs must be positive; and (2) that the expected error SDs are probably not very much larger than the scale of the measured REPs themselves (Gelman and Hill, 2006). In Bayesian jargon, these are “weakly informative” priors that do not make strong assumptions about the TEFs or expected error SDs before considering the data. As a result, the final estimated distributions are driven primarily by the available data, rather than by the prior assumptions.

The Bayesian inference approach was implemented using the open-source Stan modeling language, accessed from R (R Core Team, 2018) using the package “rstan” (Stan Development Team, 2018). Using the mathematical model formulations in Equations (1) and (2), and the results of RF, Stan used a Markov Chain Monte Carlo algorithm to draw samples from the posterior distributions for TEFs and quality-dependent expected error SDs. After the algorithm’s adaptation period (during which 5000 samples were drawn and discarded), 2000 samples were drawn in each of four independent Markov chains, for a total of 8000 samples from each distribution.

To evaluate the effect of weighting REPs based on study quality, an “unweighted” Bayesian hierarchical model was also implemented that did not include study quality. This model was formulated similarly to the “weighted” model (Equation (1)), but the measurement errors were assumed to obey a zero-mean normal distribution with a constant standard deviation σϵ. Again, the error variance represents an inverse weight for each REP study; however, in the unweighted model, the weight is the same for every study. Stan was used to sample the probability distributions of the log TEF for each congener, and the constant error SD, in this unweighted model. The distributions of the log TEF for each congener were compared between weighted and unweighted models to assess the effect of using the weighting framework.

3. Results

3.1. Consensus-based weighting framework

The study elements believed to influence REP quality and relevance for human health risk assessment were identified; all such study elements were then discussed in detail, and the Study Panel ultimately selected those believed to be the most important. The final study elements recommended for inclusion in the consensus-based weighting framework include study type, study model, pharmacokinetics (PK), REP derivation quality, REP derivation method, and endpoint (Fig. 2), each of which is discussed below.

Fig. 2.

Fig. 2.

Consensus-based weighting framework and corresponding hierarchical scoring categories.

3.1.1. Study type

As described above, study type was originally considered as a criterion, albeit qualitatively, by the WHO study panel in 1997 (Van den Berg et al., 1998). The Study Panel agreed that study type was an important criterion with respect to relevance for human health risk assessment and should be included in the consensus-based weighting framework. Additionally, because of the recent publication of REPs based on in vitro studies using human cell lines (e.g., Sutter et al., 2010), the panel determined that it was important to include this as an additional distinguishing characteristic in the framework. After much deliberation, the panel concluded that REPs from in vivo studies should continue to be given more weight than those from in vitro studies involving primary human cell lines, which in turn, should be given more weight than those from in vitro studies involving immortalized human cell lines, which should be given more weight than those from in vitro studies involving non-human mammalian cell lines, and which should be given more weight than QSAR-based REPs.

3.1.2. Endpoint

As noted above, the type of endpoint underlying each REP was one of the qualitative criteria considered by the 1997 WHO expert panel (Van den Berg et al., 1998). In considering the inclusion of this study element in the consensus-based weighting framework, the Study Panel discussed several options, ranging from the addition of a “cancer” category to the omission of this study element from the consensus-based weighting framework. The panel recognized that the actual endpoint, in theory, should not greatly affect the relative potency of the response; however, as in practice, endpoints used as the basis for the REP which are further down the pathway from AhR activation have less consistent REP values on both intra- and inter-species bases (see Haws et al., 2006a for further discussion). This difference may be due to some DLCs having multiple modes of action. For example, TCDD and PCB118 both induce EROD activity to the same magnitude in mice (DeVito et al., 2000), yet in rats, TCDD produces only a 50% decrease in serum thyroxine, whereas PCB118 decreases serum thyroxine by greater than 90% (Crofton et al., 2005). The difference in these dose-response relationships is that, higher concentrations of PCB118have CAR and PXR activities that also are involved in the decreases in thyroid hormones. Thus, more complex biology may be affected by off-target activities of the weaker dioxin-like chemicals.

Additional discussions on the endpoint parameter included weighting REPs based on commonly used risk assessment categories—cancer and noncancer endpoints. Some Study Panel members suggested that this “impact-based” approach be utilized, and that the endpoints be evaluated separately and broken down further than toxic or biochemical. Further differentiation of endpoints based on “dynamics” was also considered—effects or endpoints that occurred within 24 h would have been classified as “high,” days to weeks as “medium,” and weeks to years as “low.” Ultimately, the panel concluded that endpoint should be retained in its simplest and most scientifically defensible form by categorizing all endpoints as either Ah-mediated toxic or Ah-mediated biochemical. Based on these discussions concerning endpoint, the panel concluded that REPs based on Ah-mediated toxic responses would be given more weight than those based on Ah-mediated biochemical response. Both would be given more weight than QSAR-based REP values.

The panel reviewed all endpoints in the REP2004 database. (Note that only those endpoints that were, in fact, Ah-mediated were retained in creating the REP2004 database [Haws et al., 2006a (a or b?)]), which included approximately 150 different endpoints, and classified each one as either toxic or biochemical (Table 1). The individual REP classifications are shown in the supplementary materials section of this paper.

Table 1.

Categories of REP endpoints relative to “Endpoint” and “Study Model” parameters.

REP Endpoint (as listed in REP database) Weighting Parameter: Endpoint Weighting Parameter: Study Model

IN VIVO ENDPOINTS

# days in estrus Toxic Organismal
# of ova Toxic Multicellular/organ level
4-OH-AA Biochemical Unicellular
A4H Biochemical Unicellular
ACOH Biochemical Unicellular
Adrenal Cortex Atrophy 2 years POLY-3 Toxic Multicellular/organ level
AHH Biochemical Unicellular
Benzo(a)pyrene hydroxylase Biochemical Unicellular
biphenyl-4-hydroxylase Biochemical Unicellular
body weight Toxic Organismal
body weight gain Toxic Organismal
body weight loss Toxic Organismal
cholangiocarcinoma Toxic Multicellular/organ level
CKE - cystic keratinizing epithelioma Toxic Multicellular/organ level
Cleft palate Toxic Multicellular/organ level
DRL reinforcement rate Toxic Organismal
ECOD Biochemical Unicellular
EROD Biochemical Unicellular
foci Toxic Multicellular/organ level
FR response rate Toxic Organismal
GGT + -foci Toxic Multicellular/organ level
Heart Cardiomyopathy 2 years POLY-3 Toxic Multicellular/organ level
hepatocellular adenoma Toxic Multicellular/organ level
IgM Toxic Organismal
IgM units, anti-TNP Toxic Organismal
kidney changes Toxic Multicellular/organ level
kidney damage Toxic Multicellular/organ level
LD50 Toxic Organismal
Liver Cholangiofibrosis 2 years POLY-3 Toxic Multicellular/organ level
Liver Eosinophilic Focus 2 years POLY-3 Toxic Multicellular/organ level
Liver Fatty Change Diffuse 2 years POLY-3 Toxic Multicellular/organ level
Liver Fatty Change Diffuse 53 weeks interim Toxic Multicellular/organ level
Liver Hyperplasia 2 years POLY-3 Toxic Multicellular/organ level
Liver Pigmentation 2 years POLY-3 Toxic Multicellular/organ level
Liver Pigmentation 31 weeks interim Toxic Multicellular/organ level
Liver Pigmentation 53 weeks interim Toxic Multicellular/organ level
Liver Toxic Hepatopathy 2 years POLY-3 Toxic Multicellular/organ level
Liver Toxic Hepatopathy 53 weeks interim Toxic Multicellular/organ level
Liver: Bile Duct Cyst 2 years POLY-3 Toxic Multicellular/organ level
Liver: Bile Duct Hyperplasia 2 years POLY-3 Toxic Multicellular/organ level
Liver: Bile Duct Hyperplasia 53 weeks interim Toxic Multicellular/organ level
Liver: Hepatocyte Hypertrophy 14 weeks interim Toxic Multicellular/organ level
Liver: Hepatocyte Hypertrophy 2 years POLY-3 Toxic Multicellular/organ level
Liver: Hepatocyte Hypertrophy 31 weeks interim Toxic Multicellular/organ level
Liver: Hepatocyte Hypertrophy 53 weeks interim Toxic Multicellular/organ level
Liver: Hepatocyte Multinucleated 2 years POLY-3 Toxic Multicellular/organ level
Liver: Hepatocyte Multinucleated 31 weeks interim Toxic Multicellular/organ level
Liver: Hepatocyte Multinucleated 53 weeks interim Toxic Multicellular/organ level
Liver: Oval Cell Hyperplasia 2 years POLY-3 Toxic Multicellular/organ level
Lung Metaplasia 2 years POLY-3 Toxic Multicellular/organ level
Lung: Alveolar Epithelium Metaplasia Bronchiolar 2 years POLY-3 Toxic Multicellular/organ level
ova/rat Toxic Multicellular/organ level
ovarian weight gain Toxic Multicellular/organ level
Pancreas Inflammation Chronic 2 years POLY-3 Toxic Multicellular/organ level
Pancreas: Acinus Atrophy 2 years POLY-3 Toxic Multicellular/organ level
Pancreas: Acinus Vacuolization Cytoplasmic 2 years POLY-3 Toxic Multicellular/organ level
Pancreas: Artery Inflammation Chronic 2 years POLY-3 Toxic Multicellular/organ level
PFC Toxic Organismal
PFC/10^6 viable cells Toxic Organismal
PFC/viable cells Toxic Organismal
Porphyrin Biochemical Multicellular/organ level
Promotion index Toxic Multicellular/organ level
Rel liver weight Toxic Multicellular/organ level
Rel thymus weight Toxic Organismal
Retinol conc. Biochemical Unicellular
s-ALAT Toxic Multicellular/organ level
s-ASAT Toxic Multicellular/organ level
SCC - gingival squamous cell carcinoma Toxic Multicellular/organ level
Serum E2 Biochemical Organismal
Serum FSH Biochemical Organismal
Serum LH Biochemical Organismal
Serum P4 Biochemical Organismal
SRBC ELISA Toxic Organismal
T4 conc. Biochemical Organismal
TT4 Biochemical Organismal
UDPGT Biochemical Unicellular
UGT1A1 Biochemical Unicellular
Vitamin A Biochemical Unicellular
volume fraction Toxic Multicellular/organ level



IN VITRO ENDPOINTS

% S-Phase Toxic Unicellular
2-MeOE1/2 formation Biochemical Unicellular
2-MeOE2 formation Biochemical Unicellular
5-alpha reductase activity Biochemical Unicellular
Ah Receptor binding Biochemical S
AHH Biochemical Unicellular
antiestrogenicity (as anti-proliferation) Toxic Multicellular/organ level
aromatase (CYP19) Biochemical Unicellular
Binding affinity Biochemical S
CAFLUX Biochemical Unicellular
CALUX Biochemical Unicellular
Cell Number Toxic Multicellular/organ level
cell proliferation Toxic Multicellular/organ level
Cell size Toxic Unicellular
Cellular oxidative stress DR Biochemical Unicellular
Competitive binding Biochemical S
CYP1A1 mRNA Biochemical Unicellular
CYP1A1 protein Biochemical Unicellular
DR-CALUX assay Biochemical Unicellular
EROD Biochemical Unicellular
flat cell effect Toxic Multicellular/organ level
intercellular communication Biochemical Multicellular/organ level
luciferase expression Biochemical Unicellular
PFC response Toxic Unicellular
PSA secretion Biochemical Unicellular
Receptor binding Biochemical Unicellular
Viable thymocytes Toxic Unicellular

3.1.3. Study model

Although study model was not one of the criteria in the original qualitative framework used by the WHO expert panel in 1997 (Van den Berg et al., 1998), it was added to ensure that REPs that represented more complex biological responses were given higher weight. Additionally, this parameter considers pharmacodynamic issues not specifically addressed in the pharmacokinetic component (discussed below). This was particularly important given that REPs from both in vivo and in vitro studies were going to be combined in the weighting exercise. Therefore, all the REP values were evaluated and classified into one of four categories based on the toxicity endpoint: organismal; multicellular/organ level; unicellular; or QSAR (in order of decreasing biological complexity, respectively), similar to the organizational construct of an adverse outcome pathway.

Organismal responses are the most complex and reflect a series of kinetic and dynamic changes to generate the given change in the endpoint as a result of exposure. For example, changes in body weight, cleft palate, and relative thymus weight are complex responses and were thus classified as organismal. Multicellular/organ-level responses represent less complex responses, but yet still require a multicellular system in order to observe changes. Examples include changes in relative liver weight, cell number, cell proliferation, and lung metaplasia. Unicellular responses represent those in which responses can be measured in a single cell, such as changes in EROD, CYP1A2 protein levels, and CYP19 (aromatase) levels. QSAR studies were considered to have the lowest level of biological complexity and thus were given the least amount of weight for the study model parameter.

Based on these deliberations, the Study Panel concluded that REPs based on organism-level responses would be given more weight than those based on multicellular/organ-level responses, which in turn, would be given more weight than REPs based on unicellular responses, which in turn, would be given more weight than QSAR-based REPs. Table 1 shows the assignment of each endpoint with respect to study model.

3.1.4. Pharmacokinetics (PK)

As noted above, in the qualitative framework used by the 1997 WHO expert panel (Van den Berg et al., 1998), REPs from chronic studies were given more weight than those from subchronic studies, which in turn were given more weight than REPs from subacute studies, which in turn, were given more weight than those from acute studies. This specific criterion was discussed at length by the Study Panel, and it was decided that the primary purpose of this criterion was to address potential differences in time to reach pharmacokinetic (pseudo) steady state. Therefore, this criterion was refined, focusing instead on whether differences in pharmacokinetics (PK) between TCDD (the reference compound) and the test congener were accounted for in the study design. Specific topics considered by the Study Panel included assessment of PK differences, achievement of steady state for test and reference congeners, and assessment of the length of time associated with the measurement of the endpoint relative to the time of dosing. As part of these discussions, the Study Panel considered the relative kinetics among congeners, and specifically, elimination rates and differences in absorption. It was determined that, for congeners that had similar properties to TCDD or PCB126, study duration was not a critical factor. However, if kinetic properties were not similar (e.g., absorption or elimination rates varied greatly such that achieving steady state was required for direct comparison of relative potency), study duration and achievement of steady state were critical factors that would clearly affect estimates of relative potency.

In cases where congeners were believed to exhibit PK properties like TCDD, it was concluded that study design would not dramatically affect the REP, and thus, those congeners were assumed to satisfy the PK criterion. Only two congeners were determined to have TCDD-like pharmacokinetic properties—2,3,4,7,8-pentachlordibenzo-p-dioxin (2,3,4,7,8-PeCDD) and PCB126 (DeVito et al., 1998, 2000). If the pharmacokinetic properties of the test congener were not known, or not sufficiently known or documented in the literature, the default was to treat the given congener as if it were different from TCDD. Additionally, it was concluded that all congeners would be close to achieving steady state following subchronic or chronic exposures (generally 13 weeks or greater) and, as a result, all such studies would be assumed to satisfy the PK criterion. If the exposure duration was less than subchronic, and the test congener was anything other than either PCB 126 or 2,3,4,7,8-PeCDF, it was assumed that the PK criterion was not satisfied.

An example of how study duration was applied is as follows. If a given study involved estimating potency of PCB126 and OCDD relative to TCDD and used an acute dosing regimen, the REP value for PCB 126 would satisfy the PK criterion, but the REP value for OCDD would not. This is because previous pharmacokinetic studies have shown that the pharmacokinetic parameters for OCDD are different from TCDD, and therefore, the internal dose metrics between the two compounds would not be comparable at the target tissue following an acute exposure. However, if these same compounds were administered chronically, it would be assumed that the compounds achieved steady state in the body, and as such, differences in absorption, distribution, metabolism, and excretion would have been accounted for; therefore, the REP values for both PCB 126 and OCDD would be assumed to satisfy the PK criterion.

Finally, because QSAR inherently does not allow for the assessment of pharmacokinetics, QSAR-based REPs were automatically assigned “no” for this parameter.

3.1.5. REP derivation quality

In the course of deliberations regarding study elements that influence the quality and relevance of a REP for human health risk assessment, the Study Panel concluded that the quality of the underlying dose-response data (“REP derivation quality”) and the specific method used to derive the REP value (“REP derivation method”—discussed in detail in the next section) were also important contributors to REP quality and, therefore, should be included in the consensus-based weighting framework. Many factors that could potentially affect REP derivation quality, such as vehicle, dose levels, age and number of animals, sex, strain, diet, achievement of a maximal response level, and chemical purity, were considered for inclusion as a component of this specific weighting factor. However, several of these factors were not included in the final version, primarily because they were not reported for many of the studies in the REP2004 Database. (Note that such information was not reported in the REP2004 database, because that specific information was not reported by the original study authors [Haws et al., 2006(a)]). Additional factors considered for in vitro studies, such as cell type and passage number, were also not included as a component of this specific weighting factor, because, at the time of development, weighting of cell types utilized in studies within the database relative to other could not be readily justified. However, the Study Panel recognized that future iterations may need to do so given the increased utilization of in vitro methods.

Ultimately, the panel concluded that the underlying dose-response data were determined to be of high quality if all three of the following criteria were met: (1) there were a sufficient number of dose levels; (2) there was a sufficient number of animals (in vivo) or replicates (in vitro); and (3) a maximum response was achieved by the reference and test chemical, or it was not necessary to achieve an observed maximum response because of the REP derivation method used. Each of these three specific criteria was then reviewed in detail by the panel to determine, in the context of dose-response modeling, what constituted a sufficient number of dose levels, what constituted a sufficient number of animals or replicates, and when achievement of a maximal response was essential.

Sufficient number of dose levels:

The Study Panel concluded that a study must include a minimum of three dose levels plus a control to be considered sufficient for developing a relative potency estimate. Quantitatively understanding the relative potency of two chemicals requires sufficient information on the dose-response relationship for the endpoint of concern. The expert opinion of the study panel was that studies with fewer than three dose levels produced relative potencies with sufficient error that these studies were quantitatively uninformative as to the relative potency.

Sufficient number of animals or replicates (“N”):

The panel concluded that a determination as to whether a study included a sufficient number of animals (in vivo) or replicates (in vitro) was dependent on both the study type and endpoint. For this reason, the panel reviewed categories of endpoints in the REP2004 database to determine the sufficient N for each (Table 2). Statistical power calculations are based on the variance in the controls and the statistical method used in the analysis. Given the heterogeneity of the endpoints and statistical methods used in the REP data base, the study panel used expert judgement to determine the sufficient N for each study type and endpoint.

Table 2.

Categories for sufficient number (n) of animals or replicates).

Endpoint/Parameter Sufficient “n” Required

in vivo (endpoint evaluated)

Tumor Promotion 10
Cancer 20
Histopathology 5
Organ Weight 5
Endocrine-Based Endpoints 5
Immunotoxicity 6
Developmental Toxicity 6
Other Categories (default) 3


in vitro (replicates/group) a

intra-study replicates 2
inter-study replicates 2
a

Intra- or inter-study replicates satisfy this requirement.

Achievement of a maximal response:

The original REP2004 database included information regarding whether a maximal response was achieved for each REP value (i.e., “max/near max response attained – yes or no” [Haws et al., 2006a]). The panel noted that, in some cases, because of the REP derivation method employed, it was not necessary to achieve a maximal response (USEPA, 2012).

These three specific study elements were then considered collectively to determine REP derivation quality. Specifically, the Study Panel concluded that, if a REP was based on a study that met all three criteria, it would be given more weight than one based on a study that met two of the three criteria, which in turn, would be given more weight than an REP that was based on a study that met only one of the three criteria, which in turn, would be given more weight than an REP based on a study that didn’t meet any of the three criteria, which in turn, would be given more weight than a QSAR-based REP.

3.1.6. REP derivation method

As indicated by Haws et al. (2006a), a wide variety of methods were relied upon to calculate the REP values in the REP2004 database, ranging from statistically based non-linear dose-response modeling to linear graphical techniques and LOEL/NOEL (lowest- and no-observed-effect level) ratios. These methods represent a wide range of sophistication and accuracy with respect to estimating relative potency. Importantly, some REP values in the database were derived not by study authors, but rather by database authors. The method used to calculate a REP estimate was largely dependent on the data available (i.e., more sophisticated modeling techniques, such as non-linear regression models, were utilized if ample data were available. The Study Panel also noted that some studies represented in the database were not actually designed to evaluate relative potency, thus further limiting the potential accuracy of the REP estimate. Therefore, the Study Panel agreed that accounting for the method used to derive REP values was important when quantitatively weighting REPs for use in human health risk assessment.

For this factor, REP calculation methods were grouped into one of four possible categories (high, medium, low, QSAR) based on their perceived ability to accurately estimate a REP (Table 3). As an example, those REP derivation methods that involved using statistical models and included evaluation of the parallelism of dose response (a fundamental assumption in the use of TEFs) were categorized as “high.” In cases where statistical models were employed but parallelism was not assessed, the REP derivation method was categorized as “medium.” Other types of derivation methods classified as “medium” included ED50 or EC50 ratios, promotion index ratios, development of dose response graphs, etc. Those that involved crude estimates, NOEL/LOEL ratios, or response ratios were categorized as “low.” Only QSAR studies were assigned to the QSAR category; other such models would be included in this fourth category should they become available in the future.

Table 3.

Categories for REP derivation method.

High
Modeled REP based on parallelism

Medium

Modeled REP with no evaluation of parallelism
REP calculated by linear interpolation
Partial least squared analysis
Linear extrapolation +average response ratios
Dose-response graphs
ED50, EC50 or LD50 ratios
Curve fitting
Graphical comparison of dose response curves
Promotion index ratio

Low

Dose estimates, crude estimates

QSAR

(Or other model)

3.2. Study panel–assigned quality categories for use in machine learning

The descriptive visualization in Fig. 3 characterizes the Study Panel–assigned quality categories based on overall consideration of the weighting criteria. Several features of the database are apparent from this visualization.

  1. The number of available REPs varies substantially among congeners, ranging from only 2 REP values (1,2,3,4,6,7,8-HpCDF, 1,2,3,4,7,8,9-HpCDF, and 1,2,3,7,8,9-HxCDF) to 112 REP values (PCB126).

  2. The quality categories of available REPs vary substantially both within and among congeners.

  3. Measured REP values for a given congener may span several orders of magnitude, varying by one-thousand-fold or more.

Fig. 3.

Fig. 3.

Overall characterization of study quality within the REP database. Each point is one measured REP from the REP database; Study Panel–assigned subjective quality categories for each congener. Point colors and vertical positions both represent the Study Panel–assigned quality category corresponding to the REP (note that such assignments were used to train the machine learning model but are not the final weights applied in the weighted database): highest-quality REPs are light blue, located along the bottom row, and lowest-quality REPs are light purple, located along the top row. For comparison, WHO TEFs are also plotted: 1998 TEFs (crosses), 2005 TEFs (circles).

3.3. Integration via bayesian inference framework

The weighted TEF distributions for each congener are summarized in Table 4 and Fig. 4. In Fig. 4, they are compared to the unweighted TEF distributions, and presented with the original measured REPs and their quality categories. Additionally, the percentiles of the unweighted and weighted distributions that correspond to the WHO TEFs (both 2005 and 1998) are presented in Table 2 (see Table 5).

Table 4.

Summary statistics for the weighted REP distributions for each congener: mean and various percentiles: 2.5th, 25th, 5th, 10th, 50th, 75th, 90th, 95th, and 97.5th. On natural scale in REP units, not on log scale.

Congener WHO 2005 TEF WHO 1998 TEF Mean 2.5% 5% 10% 25% 50% 75% 90% 95% 97.5%

1234678HpCDD 0.01 0.01 1.11E-02 4.69E-03 5.46E-03 6.38E-03 8.28E-03 1.10E-02 1.47E-02 1.93E-02 2.28E-02 2.62E-02
1234678HpCDF 0.01 0.01 7.44E-02 1.69E-03 3.33E-03 6.59E-03 2.11E-02 7.33E-02 2.69E-01 8.52E-01 1.67 2.98
1234789HpCDF 0.01 0.01 2.85E-02 8.04E-04 1.39E-03 2.86E-03 8.75E-03 2.90E-02 9.47E-02 2.80E-01 5.95E-01 1.00
123478HxCDD 0.1 0.1 6.89E-02 3.13E-02 3.61E-02 4.17E-02 5.28E-02 6.84E-02 8.91E-02 1.15E-01 1.37E-01 1.60E-01
123478HxCDF 0.1 0.1 8.67E-02 2.83E-02 3.42E-02 4.18E-02 5.76E-02 8.48E-02 1.27E-01 1.90E-01 2.38E-01 2.84E-01
123678HxCDD 0.1 0.1 5.84E-02 6.16E-03 8.92E-03 1.35E-02 2.69E-02 5.76E-02 1.26E-01 2.58E-01 3.88E-01 5.80E-01
123678HxCDF 0.1 0.1 4.93E-02 1.87E-02 2.20E-02 2.69E-02 3.64E-02 4.99E-02 6.78E-02 8.93E-02 1.05E-01 1.20E-01
123789HxCDD 0.1 0.1 3.38E-02 5.15E-03 7.05E-03 9.98E-03 1.79E-02 3.39E-02 6.47E-02 1.14E-01 1.61E-01 2.14E-01
123789HxCDF 0.1 0.1 1.44E-01 4.48E-03 7.57E-03 1.51E-02 4.47E-02 1.45E-01 4.65E-01 1.39 2.74 4.81
12378PeCDD 1 1 3.55E-01 2.29E-01 2.46E-01 2.68E-01 3.05E-01 3.56E-01 4.12E-01 4.69E-01 5.04E-01 5.44E-01
12378PeCDF 0.03 0.05 3.07E-02 1.56E-02 1.72E-02 1.94E-02 2.39E-02 3.03E-02 3.91E-02 4.95E-02 5.70E-02 6.37E-02
234678HxCDF 0.1 0.1 8.10E-02 1.99E-02 2.47E-02 3.18E-02 5.02E-02 8.05E-02 1.31E-01 2.06E-01 2.67E-01 3.41E-01
23478PeCDF 0.3 0.5 2.08E-01 1.72E-01 1.78E-01 1.84E-01 1.95E-01 2.08E-01 2.22E-01 2.35E-01 2.44E-01 2.51E-01
OCDD 3E-04 1E-04 5.58E-04 8.39E-05 1.14E-04 1.64E-04 3.01E-04 5.56E-04 1.05E-03 1.88E-03 2.65E-03 3.63E-03
OCDF 3E-04 1E-04 1.45E-04 4.28E-05 5.17E-05 6.52E-05 9.17E-05 1.42E-04 2.20E-04 3.44E-04 4.56E-04 5.86E-04
PCB105 3E-05 1E-04 5.18E-05 2.25E-05 2.59E-05 3.03E-05 3.94E-05 5.21E-05 6.80E-05 8.84E-05 1.03E-04 1.18E-04
PCB114 3E-05 5E-04 4.55E-04 8.71E-05 1.18E-04 1.62E-04 2.62E-04 4.51E-04 7.80E-04 1.29E-03 1.78E-03 2.36E-03
PCB118 3E-05 1E-04 3.26E-05 1.33E-05 1.53E-05 1.81E-05 2.43E-05 3.28E-05 4.40E-05 5.77E-05 6.76E-05 7.85E-05
PCB123 3E-05 1E-04 4.64E-05 7.83E-06 1.05E-05 1.43E-05 2.48E-05 4.64E-05 8.56E-05 1.51E-04 2.11E-04 2.82E-04
PCB126 0.1 0.1 9.51E-02 7.84E-02 8.13E-02 8.44E-02 8.95E-02 9.54E-02 1.01E-01 1.07E-01 1.11E-01 1.14E-01
PCB156 3E-05 5E-04 1.70E-04 6.81E-05 7.84E-05 9.32E-05 1.24E-04 1.71E-04 2.34E-04 3.05E-04 3.58E-04 4.09E-04
PCB157 3E-05 5E-04 3.28E-04 6.94E-05 8.76E-05 1.17E-04 1.89E-04 3.27E-04 5.63E-04 9.29E-04 1.25E-03 1.57E-03
PCB167 3E-05 1E-05 1.63E-05 1.77E-06 2.59E-06 3.92E-06 7.72E-06 1.63E-05 3.49E-05 6.79E-05 1.01E-04 1.52E-04
PCB169 0.03 0.01 9.16E-03 3.90E-03 4.54E-03 5.35E-03 6.91E-03 9.20E-03 1.22E-02 1.56E-02 1.82E-02 2.10E-02
PCB189 3E-05 1E-04 3.12E-05 4.50E-06 6.38E-06 9.26E-06 1.64E-05 3.17E-05 6E-05 1.04E-04 1.52E-04 2.12E-04
PCB77 1E-04 1E-04 6.40E-04 2.98E-04 3.38E-04 3.90E-04 4.96E-04 6.45E-04 8.36E-04 1.04E-03 1.17E-03 1.31E-03
PCB81 3E-04 1E-04 4.34E-03 1.08E-03 1.34E-03 1.74E-03 2.69E-03 4.32E-03 7.01E-03 1.10E-02 1.44E-02 1.78E-02
TCDF 0.1 0.1 5.34E-02 2.41E-02 2.74E-02 3.16E-02 4.08E-02 5.31E-02 6.99E-02 9.02E-02 1.05E-01 1.18E-01

Fig. 4.

Fig. 4.

Violin plots summarizing the inferred distributions of the TEFs for each congener, on a log scale. The width of the “violin” corresponds to the probability at a given REP. Red curves indicate weighted TEF distributions; blue curves indicate unweighted TEF distributions The measured REPs for each congener are also plotted as gray points, whose vertical position indicates their most-probable RF-model-predicted quality category, as labeled at left of each panel (ranging from 1 at the bottom to 5.5 at the top). For comparison, points are plotted to indicate the WHO TEFs for each congener, from the 1998 scheme (crosses) and the 2005 scheme (open circles). All plots are shown with the same range of REPs/TEFs, to facilitate comparing TEF distributions across congeners.

Table 5.

Percentiles of the unweighted and weighted REP distributions where the WHO 2005; 1998 TEFs fall.

Congener WHO 2005 TEF WHO 1998 TEF Unweighted model Weighted model


WHO 2005 TEF % ile WHO 1998 TEF % ile WHO 2005 TEF % ile WHO 1998 TEF % ile

1234678HpCDD 0.01 0.01 30% 30% 41% 41%
1234678HpCDF 0.01 0.01 7% 7% 14% 14%
1234789HpCDF 0.01 0.01 21% 21% 28% 28%
123478HxCDD 0.1 0.1 71% 71% 83% 83%
123478HxCDF 0.1 0.1 38% 38% 60% 60%
123678HxCDD 0.1 0.1 74% 74% 68% 68%
123678HxCDF 0.1 0.1 94% 94% 93% 93%
123789HxCDD 0.1 0.1 91% 91% 87% 87%
123789HxCDF 0.1 0.1 39% 39% 42% 42%
12378PeCDD 1 1 100% 100% 100% 100%
12378PeCDF 0.03 0.05 23% 73% 49% 91%
234678HxCDF 0.1 0.1 51% 51% 62% 62%
23478PeCDF 0.3 0.5 87% 100% 100% 100%
OCDD 3E-04 1E-04 17% 1% 25% 3%
OCDF 3E-04 1E-04 64% 9% 86% 30%
PCB105 3E-05 1E-04 1% 76% 10% 94%
PCB114 3E-05 5E-04 0% 52% 0% 55%
PCB118 3E-05 1E-04 8% 95% 42% 99%
PCB123 3E-05 1E-04 27% 83% 31% 79%
PCB126 0.1 0.1 94% 94% 69% 69%
PCB156 3E-05 5E-04 0% 96% 0% 99%
PCB157 3E-05 5E-04 0.025% 80% 0.125% 70%
PCB167 3E-05 1E-05 75% 30% 71% 34%
PCB169 0.03 0.01 100% 66% 100% 58%
PCB189 3E-05 1E-04 61% 95% 47% 89%
PCB77 1E-04 1E-04 0% 0% 0% 0%
PCB81 3E-04 1E-04 0% 0% 0.025% 0%
TCDF 0.1 0.1 91% 91% 94% 94%

Comparing weighted and unweighted distributions affords a few key observations. First, congeners with very few measured REPs (e.g., 1,2,3,4,6,7,8-HpCDF, 1,2,3,4,7,8,9-HpCDF, and 1,2,3,7,8,9-HxCDF) have noticeably wider TEF distributions than congeners with many measured REPs, reflecting the fact that TEF estimates are less certain when there are fewer data to inform them. For these congeners, the weighted distribution is substantially wider than the unweighted distribution, reflecting the fact that these congeners not only have very few data, but also have lower-quality data (REPs of quality category 5 and 5.5).

Second, for congeners with the highest-quality REPs (e.g., 2,3,4,7,8-PeCDF and PCB126, which both have many measured REPs of quality category 1), the weighted model yields a substantially narrower distribution than the unweighted model, reflecting the greater importance of the narrower range of high-quality REPs in defining the weighted TEF distribution, and the lesser importance of the wider range of lower-quality REPs.

Third, for congeners where higher-quality measured REPs tend to have a different range from lower-quality measured REPs (e.g., PCB156, PCB118, and PCB105, where category-2 REPs all lie orders of magnitude on the lowest end of the range of measured REPs), the weighted model yields a TEF distribution that is shifted toward the higher-quality REPs, compared to the unweighted model. However, noticeably, the shift is relatively small compared to the difference in measured REPs between higher-quality and lower-quality studies. For example, for PCB156, the center of the weighted distribution is shifted only about two-fold lower than the center of the unweighted distribution, even though the median REP of the category-2 studies is more than 100 times lower than the median REP of the other studies.

Why should category-1 REPs affect the weighted TEF distributions so strongly, but category-2 REPs affect them so weakly? The result can be explained by converting the expected error SDs into inverse-variance weights. Category-1 REPs are weighted much more heavily than other categories: with the expected error SD for category 1 (mean σ1=0.23) and for category 2 (mean σ2=0.73), category-1 REPs are weighted (0.732/0.232) = 10 times more heavily than category-2 REPs, and 22.5 times more heavily than category-5 REPs (mean σ5=1.09). However, category-2 REPs are weighted only slightly more heavily than REPs in lower quality categories: only 1.1 times more heavily than category-3 REPs (mean σ3=0.75), and only 2.9 times more heavily than category-5 REPs.

A standard classification tree (Fig. 5) trained on the REP2004 database provides an illustrative example of how each tree in the random forest predicted quality categories for each study. The terminal nodes (colored in Fig. 5) are each labeled with the tree-predicted category for all the studies in that node, which may or may not correspond with the panel-assigned category for each study in that node. Note that this example classification tree did not predict categories 2.5, 3.5, or 4.5 for any studies. Each tree in the random forest may have made different decisions and/or predicted different categories, since each RF tree was trained on a different randomly chosen subset of the REP database.

Fig. 5.

Fig. 5.

An illustrative example of a classification decision tree modeling the relationship between study attributes and study quality categories. Boxes indicate the variable to split at each decision point (node); branches are labeled with the value(s) of the above node variable on that branch. Terminal nodes (colored circles at bottom) are labeled with the most-common category in that node (i.e., the model-predicted category for all studies in that node), then with the number of studies in that node that were panel-assigned to each category (e.g., in the terminal node labeled “2”, 57 of the studies were panel-assigned category 2, and one study was panel-assigned category 4). The random forest consists of 10,000 classification trees like this one, each trained on a different randomly chosen subset of the data and therefore each making slightly different decisions.

The probability of study i being placed in category k was calculated as the fraction of the total 10,000 R F trees that placed study i in category k (See Supplemental Materials). The higher the probability in a category, the more certain the RF model was about placing that study in that category. The RF model was very certain about placing studies in categories 1, 5, and 5.5. However, no study was classified with over 50% probability in categories 2.5, 3.5, or 4.5, suggesting that these categorizations were very uncertain. Since relatively few studies were originally classified into categories 2.5, 3.5, or 4.5, (16, 20, and 2, respectively), the RF model likely did not have enough information to reliably distinguish these categories.

4. Discussion

To move toward using TEF distributions rather than single point-estimate TEFs, experts have indicated the need for development of a consensus-based framework to weight REPs based on study quality and relevance for human health risk assessment (Haws et al., 2006a; Van den Berg et al., 2006). The objective of this study was to identify those study characteristics believed most important when evaluating REP quality and relevance, to develop an approach for quantitatively weighting each REP, and to apply that approach to the current REP2004 database. Six main-study characteristics (study type, study model, pharmacokinetics, REP derivation method, REP derivation quality, and endpoint) were identified as most important and form the backbone of the framework. A Bayesian-inference approach was applied to infer underlying probability distributions, calibrated using the set of measured REPs for each congener with their panel-assigned study attributes. Because the distributions relate study quality and relevance, a weighted REP database for use in risk assessment is generated.

The weighting framework proposed herein is “fit for purpose” to REPs for DLCs, because no established study quality tool or method was considered sufficient by itself for these purposes. Through collaborative iterations, a need for granularity specific to REPs for DLCs was recognized by the study panel. Various tools or approaches, such as standard Klimisch scoring (Klimisch et al., 1997), the ToxRTool (Schneider et al., 2009), or SciRAP (Beronius et al., 2018), would not have adequately characterized study parameters that have been recognized for decades by experts as being important to the quality and reliability of REPs (Haws et al., 2006a; Van den Berg et al., 2006). Although there is overlap with aspects of existing tools/approaches that focus on reporting and methodological quality, the REP weighting framework aims to combine multiple aspects of study validity—internal, external, and construct validity (Henderson et al., 2013). The REP study quality elements related to methodological quality have some similarities to the domain-based structure of a risk-of-bias assessment for internal validity (NTP OHAT, 2019), although other framework elements are more aligned with characterizing the construct and external validity. That is, determining whether the REP was developed from a study that was both well-designed to characterize a REP value (i.e., the construct validity) and generalizable for use in human health risk assessment (i.e., the external validity).

It is notable that the proposed REP weighting framework uses both a combined-validity approach for REPs, and quantitatively integrates heterogenous data—two well-recognized current challenges in the field of systematic review in toxicology (EFSA 2018; NAS, 2018; Wikoff et al., 2019). This work began long prior to the emergence of systematic review (Stephens et al., 2016), although it was an early adaptation of the method without being so named. This was demonstrated initially via the approach and reporting of the REP2004 database by Haws et al. (2006a) and the subsequent evidence-to-decision methods employed by the WHO TEF panel (Van den Berg et al., 2006). Specifically, the Haws et al. (2006a) publication detailed the methods for identification and selection of studies, extraction (and calculation) of REP values, and synthesized the evidence in a fashion similar to an evidence map or scoping review. The WHO TEF panel then used the information from the REP evidence map to systematically determine which TEFs to update. The missing element, from fully employing what is now considered evidence-based methods, was critical appraisal of the REPs. This element was well-recognized by the WHO panel in 2005 as something that would be required to quantitatively estimate REPs directly from underlying data (and thus eliminating subjectivity in selection of TEFs). The REP weighting framework proposed herein provides a “fit for purpose” solution to this element. The Study Panel also recognizes the evolving nature of TEFs, and the developing field of systematic review. Therefore, as additional critical appraisal tools become available (particularly for in vitro studies), they can be considered in the REP weighting framework.

The quantitative integration approach proposed herein is also notable in the context of developing best practices in evidence-based methods, combining a large volume (>600 data sets) of extremely heterogeneous data (multiple species, study designs, etc.) while accounting for quality and relevance as those factors relate to the common metric—relative potency. Conceptually, the Bayesian inference approach applied to develop a weighted REP database is analogous to a meta-analysis, which is a technique to combine measurements of the same outcome from multiple studies with varying amounts of uncertainty. The intuition behind a meta-analysis is that there is some unknown “true” effect, which each study has measured with some amount of error. The error size can be estimated by the sampling variance for each study. The application of the REP weighting framework demonstrates the utility of a Bayesian hierarchical model approach for incorporating all the available data, while still accounting for differences in the quality and relevance of those data. The Bayesian approach borrows information across congeners to estimate an overall relationship between study quality and relevance and expected error in each measured REP; then it uses that estimate of quality-related error to infer a distribution that characterizes variability and uncertainty in an underlying “true” REP for each congener. Thus, these powerful machine-learning approaches collectively align the integration method to facilitate probabilistic analyses that account for uncertainty (a practical need in risk assessment for DLCs specifically recognized by the USEPA [2010]) and consistent with the overall direction of accounting for uncertainty in risk assessments.

Several of the a priori objectives involved relative simplicity, foundations in established scientific principles, and ready application to the existing and future REP databases—which the proposed framework achieves. The weighting parameters assess multiple aspects of study validity in an objective, simple, and reproducible manner. One drawback, however, is that the Bayesian approach, while powerful, is not particularly accessible to non-statisticians. That is, the proposed method does not result in an explicit analytical equation with point-estimate parameters that could be used to update REP distributions as new studies are added to the REP database. Rather, it results in non-parametric probability distributions that can be characterized only by using Markov chain Monte Carlo sampling. While the mean and percentiles of the REP distributions for each congener have been provided, the inference process will need to be re-run as new studies are added to the database, or if new study attributes are identified that affect study quality. Although the inference process requires very little computational time, future efforts to make this process and application more accessible to the broader field may be worthwhile but were beyond the scope of this work. Furthermore, applying this approach requires knowledge of Bayesian hierarchical modeling and random-forest modeling, as well as both R and Stan software and coding languages.

The availability of weighted REP distributions will advance the field of probabilistic risk assessments for this group of compounds. Multiple investigators have suggested that risk estimates associated with exposure to DLCs be based on the distribution of REP values for each congener to allow for better characterization of the uncertainty and variability (EPA SAB, 2001; Haws et al., 2006a). The WHO2005 panel indicated that, inherent to selection of the mathematical scale for TEFs, all TEF values are assumed to vary in uncertainty by at least one order of magnitude, and as a result, the TEF is a value with a degree of uncertainty assumed to be at least ± half a log. This uncertainty was also recognized by the USEPA in their guidance on TEFs (EPA, 2010). The EPA guidance discusses this same topic, indicating that the uncertainty around each TEF can be assumed to be “at least ± half a log.” However, the current results indicate that this assumption is inconsistent with the underlying REP data. The uncertainty can be indexed by the width of the weighted distribution, and it varies dramatically across the congeners, due to factors such as the number of studies of each chemical, the quality of those studies, and the characteristics of the chemical itself. In fact, the “half a log” estimate of uncertainty is only consistent with the most data-rich congeners with the highest quality studies. Thus, application of a constant factor to describe the uncertainty with the current TEFs is inconsistent with the data and is also likely to contribute to inaccurate results in a risk assessment. The use of weighted REP distributions in a probabilistic fashion, combined with characterization of uncertainty, provides science-based solutions around these long-standing topics. The weighted REP distributions are not only useful in probabilistic analyses but also reflect a substantial improvement in the development of more robust point-estimate TEFs based on weighted distributions of REPs for each congener that reflect consideration of study quality and relevance. In addition, the weighted distributions of REPs allow regulatory agencies to select REPs at a desired percentile consistent with risk management goals.

5. Conclusions

In conclusion, the REP weighing framework allows a greater emphasis to be placed on REP values of higher quality and relevance, and the distributions improve characterization of the variability and uncertainty inherent in the health risk estimates for these dioxin-like compounds (such as that recommended by EPA [2010]), facilitate use in probabilistic analyses, provide a basis for development of more robust deterministic REPs, and give regulators more flexibility in applying TEFs that match their risk management goals. This information provides risk managers with necessary information for decision making.

Supplementary Material

Supplement1

Acknowledgements

We extend our thanks to Drs. William Farland, Martin van den Berg, Michael Denison, Richard Peterson, and Annika Hanberg for their insights regarding the weighting framework.

Funding

Partial funding for ToxStrategies authors was initially provided by Tierra Solutions, Inc., and subsequently by Glenn Springs Holdings, Inc., entities involved with TCDD-related litigation. No external funding was provided for the remaining efforts for ToxStrategies authors; the work was carried out as part of the normal course of employment (no personal fees were received). Drs. DeVito, Walker, and Birnbaum did not receive any funding from these sources.

During the conceptualization, implementation, and drafting of this manuscript, Drs. DeVito, Walker, and Birnbaum were supported, in part, by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (MDV, NJW) and the National Cancer Institute (LSB); these authors were supported by their own institutions (NIEHS/NIH — MJD, NJW) and (NCI/NIH — LSB). Dr. Birnbaum is currently a defense expert in dioxin-related litigation. The contents of this paper reflect the opinions and views of the authors. The mention of trade names and commercial products does not constitute endorsement or use recommendations.

Footnotes

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Daniele Wikoff, Caroline Ring, Laurie Haws reports financial support was provided by Tierra Solutions. Linda Birnbaum reports a relationship with Defense Expert in Dioxin-Related Litigation that includes: paid expert testimony. DW serves as an Associate Editor at Regulatory Toxicology and Pharmacology.MD served as an ad-hoc expert at the 2022 World Health Organization consultation on Toxic Equivalency Factors. LH and DW served as contractors to the European Food Safety Authority to support the WHO expert consultation.

CRediT authorship contribution statement

Daniele Wikoff: Conceptualization, Methodology, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization, Project administration. Caroline Ring: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization. Michael DeVito: Conceptualization, Methodology, Investigation, Data curation, Writing – original draft, Writing – review & editing. Nigel Walker: Conceptualization, Methodology, Investigation, Data curation, Writing – review & editing. Linda Birnbaum: Conceptualization, Investigation, Data curation, Writing – review & editing. Laurie Haws: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.yrtph.2023.105500.

Data availability

A companion manuscript is being submitted with additional information regarding machine learning.

References

  1. Beronius A, Molander L, Zilliacus J, Rudén C, Hanberg A, 2018. Testing and refining the Science in Risk Assessment and Policy (SciRAP) web-based platform for evaluating the reliability and relevance of in vivo toxicity studies. J. Appl. Toxicol. 38 (12), 1460–1470. 10.1002/jat.3648. [DOI] [PubMed] [Google Scholar]
  2. Breiman L, 2001. Random forests. Mach. Learn. 45 (1), 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
  3. Crofton KM, Craft ES, Hedge JM, Gennings C, Simmons JE, Carchman RA, et al. , 2005. Thyroid-hormone–disrupting chemicals: evidence for dose-dependent additivity or synergism. Environ. Health Perspect. 113 (11), 1549–1554. 10.1289/ehp.8195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. DeVito MJ, Ross DG, Dupuy AE Jr., Ferrario J, McDaniel D, Birnbaum LS, 1998. Dose–response relationships for disposition and hepatic sequestration of polyhalogenated dibenzo-p-dioxins, dibenzofurans, and biphenyls following subchronic treatment in mice. Toxicol. Sci. 46 (2), 223–234. 10.1006/toxs.1998.2530. [DOI] [PubMed] [Google Scholar]
  5. DeVito MJ, Ménache MG, Diliberto JJ, Ross DG, Birnbaum LS, 2000. Dose–response relationships for induction of CYP1A1 and CYP1A2 enzyme activity in liver, lung, and skin in female mice following subchronic exposure to polychlorinated biphenyls. Toxicol. Appl. Pharmacol. 167 (3), 157–172. 10.1006/taap.2000.9010. [DOI] [PubMed] [Google Scholar]
  6. European Food Safety Authority, 2018. EFSA scientific colloquium 23–joint European Food safety authority and evidence-based toxicology collaboration colloquium evidence integration in risk assessment: the science of combining apples and oranges 25–26 october 2017 Lisbon, Portugal. EFSA Supporting Publications; 15 (3), 1396E. 10.2903/sp.efsa.2018.EN-1396. [DOI] [Google Scholar]
  7. Gelman A, Hill J, 2006. Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge university press. [Google Scholar]
  8. Haws LC, Su SH, Harris M, DeVito MJ, Walker NJ, Farland WH, et al. , 2006a. Development of a refined database of mammalian relative potency estimates for dioxin-like compounds. Toxicol. Sci. 89 (1), 4–30. 10.1093/toxsci/kfi294. [DOI] [PubMed] [Google Scholar]
  9. Haws LC, DeVito MJ, Birnbaum LS, Walker NJ, Scott PK, Unice KM, et al. , 2006b. An alternative method for establishing TEFs for dioxin-like compounds. Part 2. Development of an approach to quantitatively weight the underlying potency data. Organohalogen Compd. 68, 2523. [Google Scholar]
  10. Henderson VC, Kimmelman J, Fergusson D, Grimshaw JM, Hackam DG, 2013. Threats to validity in the design and conduct of preclinical efficacy studies: a systematic review of guidelines for in vivo animal experiments. PLoS Med. 10 (7), e1001489 10.1371/journal.pmed.1001489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Klimisch HJ, Andreae M, Tillmann U, 1997. A systematic approach for evaluating the quality of experimental toxicological and ecotoxicological data. Regul. Toxicol. Pharmacol. 25 (1), 1–5. 10.1006/rtph.1996.1076. [DOI] [PubMed] [Google Scholar]
  12. Liaw A, Wiener M, 2002. Classification and regression by randomForest. R. News 2 (3), 18–22. [Google Scholar]
  13. National Academics of Sciences (NAS). National Research Council, 2014. Review of EPA’s Integrated Risk Information System (IRIS) Process. The National Academies Press, Washington, DC. 10.17226/18764. [DOI] [PubMed] [Google Scholar]
  14. National Academies of Sciences, Engineering, and Medicine, 2018. Progress toward Transforming the Integrated Risk Information System (IRIS) Program: A 2018 Evaluation. The National Academies Press, Washington, DC. 10.17226/25086. [DOI] [Google Scholar]
  15. Office of Health Assessment and Translation (OHAT), 2019. Division of the national toxicology Program NTP. Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration. https://ntp.niehs.nih.gov/ntp/ohat/pubs/handbookdraftmarch2019.pdf.
  16. R Core Team, 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org. [Google Scholar]
  17. Schneider K, Schwarz M, Burkholder I, Kopp-Schneider A, Edler L, Kinsner-Ovaskainen A, et al. , 2009. “ToxRTool,” a new tool to assess the reliability of toxicological data. Toxicol. Lett. 189 (2), 138–144. 10.1016/j.toxlet.2009.05.013. [DOI] [PubMed] [Google Scholar]
  18. Scott PK, Haws LC, Staskal DF, Birnbaum LS, Walker NJ, DeVito MJ, et al. , 2006. An alternative method for establishing TEFs for dioxin-like compounds. Part 1. Evaluation of decision analysis methods for use in weighting relative potency data. Organohalogen Compd. 68, 2519. [Google Scholar]
  19. Stan Development Team, 2018. RStan: the R Interface to Stan. R Package version 2.17.3. http://mc-stan.org. [Google Scholar]
  20. Stephens ML, Betts K, Beck NB, Cogliano V, Dickersin K, Fitzpatrick S, et al. , 2016. The emergence of systematic review in toxicology. Toxicol. Sci. 152 (1), 10–16. 10.1093/toxsci/kfw059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sutter CH, Bodreddigari S, Sutter TR, Carlson EA, Silkworth JB, 2010. Analysis of the CYP1A1 mRNA dose-response in human keratinocytes indicates that relative potencies of dioxins, furans, and PCBs are species and congener specific. Toxicol. Sci. 118 (2), 704–715. 10.1093/toxsci/kfq262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Therneau T, Atkinson B, 2018. Rpart: Recursive Partitioning and Regression Trees. Retrieved from. https://CRAN.R-project.org/package=rpart.
  23. Toyoshiba H, Walker NJ, Bailer AJ, Portier CJ, 2004. Evaluation of toxic equivalency factors for induction of cytochromes P450 CYP1A1 and CYP1A2 enzyme activity by dioxin-like compounds. Toxicol. Appl. Pharmacol. 194 (2), 156–168. 10.1016/j.taap.2003.09.015. [DOI] [PubMed] [Google Scholar]
  24. United States Environment Protection Agency, 2010. Recommended Toxicity Equivalence Factors (TEFs) for Human Health Risk Assessments of 2,3,7,8-Tetrachlorodibenzo-P-Dioxin and Dioxin-like Compounds. Risk Assessment Forum. U.S. Environmental Protection Agency, Washington, DC. EPA/100/R-10/005. December 2010. Available online at: http://www.epa.gov/osa/raf/hhtefguidance/. [Google Scholar]
  25. United States Environment Protection Agency, 2001. Appendix C – characterizing variability and uncertainty in the concentration term. In: Risk Assessment Guidance for Superfund (RAGS) Volume III – Part A: Process for Conducting Probabilistic Risk Assessment. [Google Scholar]
  26. United States Environmental Protection Agency. Risk Assessment Forum, 2012. Benchmark Dose Technical Guidance. US Environmental Protection Agency, Office of the Science Advisor, Risk Assessment Forum. EPA/100/R-12/001. [Google Scholar]
  27. Van den Berg M, Birnbaum L, Bosveld AT, Brunström B, Cook P, Feeley M, et al. , 1998. Toxic equivalency factors (TEFs) for PCBs, PCDDs, PCDFs for humans and wildlife. Environ. Health Perspect. 106 (12), 775–792. 10.1289/ehp.98106775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Van den Berg M, Birnbaum LS, Denison M, De Vito M, Farland W, Feeley M, et al. , 2006. The 2005 World Health Organization reevaluation of human and mammalian toxic equivalency factors for dioxins and dioxin-like compounds. Toxicol. Sci. 93 (2), 223–241. 10.1093/toxsci/kfl055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Walker MK, Heid SE, Smith SM, Swanson HI, 2000. Molecular characterization and developmental expression of the aryl hydrocarbon receptor from the chick embryo. Comp. Biochem. Physiol. C Pharmacol. Toxicol. Endocrinol. 126 (3), 305–319. 10.1016/s0742-8413(00)00119-5. [DOI] [PubMed] [Google Scholar]
  30. Wikoff DS, Rager JE, Chappell GA, Fitch S, Haws L, Borghoff SJ, 2019. A framework for systematic evaluation and quantitative integration of mechanistic data in assessments of potential human carcinogens. Toxicol. Sci. 167 (2), 322–335. 10.1093/toxsci/kfy279. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1

Data Availability Statement

A companion manuscript is being submitted with additional information regarding machine learning.

RESOURCES