Preferential Mixture-of-Experts: Interpretable Models that Rely on Human Expertise As Much As Possible

Melanie F Pradier; Javier Zazo; Sonali Parbhoo; Roy H Perlis; Maurizio Zazzi; Finale Doshi-Velez

. 2021 May 17;2021:525–534.

Preferential Mixture-of-Experts: Interpretable Models that Rely on Human Expertise As Much As Possible

Melanie F Pradier ^1,², Javier Zazo ^1,², Sonali Parbhoo ¹, Roy H Perlis ^3,⁴, Maurizio Zazzi ⁵, Finale Doshi-Velez ¹

PMCID: PMC8378634 PMID: 34457168

Abstract

We propose Preferential MoE, a novel human-ML mixture-of-experts model that augments human expertise in decision making with a data-based classifier only when necessary for predictive performance. Our model exhibits an interpretable gating function that provides information on when human rules should be followed or avoided. The gating function is maximized for using human-based rules, and classification errors are minimized. We propose solving a coupled multi-objective problem with convex subproblems. We develop approximate algorithms and study their performance and convergence. Finally, we demonstrate the utility of Preferential MoE on two clinical applications for the treatment of Human Immunodeficiency Virus (HIV) and management of Major Depressive Disorder (MDD).

1. Introduction

In the last few years, there has been a growth in the use of machine learning (ML) methods for decision-making in complex domains such as loan approvals, medical diagnosis and criminal justice. In particular, ML currently plays a key role in the healthcare sector for several tasks such as developing medical procedures [1, 2], handling patient data and records [3] and treating chronic diseases [4]. However, these algorithms typically require large amounts of data to make reasonable predictions. Additionally in the health sector, variability in practice between clinicians, patient heterogeneity, different disease prevalences, and confidentiality issues all result in final training cohorts being relatively small. Moreover, a clinician is often faced with rare events or outlier cases, where classic ML approaches suffer from insufficient training samples. In each of these scenarios, it is crucial to be able to incorporate clinical experience and domain knowledge.

Specifically, in practice, clinicians often rely on relatively simple human-based rules that reflect reasonable approaches to handle a situation. These rules can be seen as an additional source of knowledge that can be leveraged when building ML systems for clinical decision-support. For instance, clinicians treating patients with HIV tend to adhere to a list of guidelines for administering first and second-line therapies specified by several organizations [5, 6]; other well-known guidelines exist for prescribing antidepressants to address Major Depressive Disorder (MDD) [7]. Often, these rules provide benefits that are not easily formalized into a machine learning objective, for example, in terms of safety [8], or tolerability [9] (e.g., not giving excitatory drugs to a patient that has insomnia). Thus, one might prefer an ML system that agrees with these human-based rules as much as possible.

Several ML methods have been proposed that combine human expertise in conjunction with training data to perform a prediction task [10, 11]. Some of these methods such as [12] explicitly focus on modeling the interaction between an automated ML model and an external decision-maker; the decision-maker determines whether to reject a particular decision made by the model based on the model's confidence and the expertise of the decision-maker. An extension to this procedure in [10] describes when to defer decisions to a downstream decision-maker based solely on samples of the expert's decisions. In contrast to these approaches, we propose a ML system that complements human expertise only when needed, that is, it gives preference to human-based rules as much as possible, subject to explicit performance constraints in the optimization problem.

In this work, we develop a novel mixture-of-experts (MoE) approach, called Preferential MoE, that explicitly incorporates human expertise in learning to provide predictions that align with human-based rules as frequently as possible without losing performance. The MoE framework allows for an intuitive way to combine ML with clinical expertise. Importantly, Preferential MoE provides a means of enforcing preference for the human decision rules, as well as an interpretable gating function that allows us to understand when data-driven or clinical expertise should be used. Specifically, we identify when a human decision rule should be followed, and when it makes more sense to provide an alternative data-driven prediction. Overall, by explicitly incorporating and optimizing for human expertise in our predictions, we obtain models that aligns better with human knowledge, making them easier to inspect, audit and trust.

2. Related Work

Human-ML decision making systems.

There is a long history of approaches to incorporate human expertise in the architecture of ML systems. In particular, [13] and [14] propose methods that map rules to elements of a neural network. [15] incorporates human-based knowledge gates into Recurrent Neural Networks for question answering or text matching. Closer in spirit, [16] constrained a ML model to be more credible by relying as much as possible on input predictors that are intuitive for human experts. All these approaches include human expertise as input or intermediate features, whereas we assume that the expert information is available in the form of output decision rules, on which we want to rely as much as possible. Recently, [17] learns a ML system complementary to humans by modeling the residual of humans in the context of timeseries. Here we focus on classification, and additionally provide an interpretable explanation about when to rely on human-based rules. Finally, [18] proposes a knowledge distillation approach, where human decisions are used as a teacher, and a student network is trained to mimic the human decisions while performing well on test data. Unlike implicitly assuming human expertise as additional ground truth labels (teacher), this work has the capacity of ignoring human rules if those are found unreliable.

Mixture of Experts.

In the ML community, mixture-of-expert (MoE) models [19, 20] are frequently used to leverage different types of expertise in decision-making. The model works by explicitly learning a partition of the input space such that different regions of the domain may be assigned to different specialized sub-models or experts. MoEs have also been applied to several healthcare domains such as HIV [21, 22]. The proposed approach Preferential MoE is different in three regards: first, we explicitly incorporate human knowledge in the form of therapy standards and guidelines for medical decision-making. Second, our framework expresses an explicit preference for a specific expert (human-based), and trains an ML-based expert to complement the human expert; third, we learn a gating function, which makes the model easy-to-interpret and give us information on when human-based rules should be followed.

Learning to defer approaches.

[12, 10] propose MoE classification models to be used as triage tools, where only the most critical decisions are deferred to a medical expert, whilst relying on data-driven approaches the majority of the time. Specifically, these classifiers are trained based solely on the samples of an expert's decisions. Other approaches for integrating human expertise in decision-making such as [4, 23] train a standard classifier on the data and subsequently obtain uncertainty estimates based on this classifier and the human expert. The decision is ultimately deferred to the expert with the lowest uncertainty. Unlike triage methods, we view human expertise as complementary to data-driven approaches and explicitly leverage these sources of knowledge to inform better predictions. That is, we optimize to rely on human expertise as much as possible, except for those regions for which human-based rules are inadequate. Our training samples consist of generic (potentially partial) rules that have been specified a priori.

3. Methodology

In this section, we present Preferential MoE. The proposed approach fulfills two desiderata. First, Preferential MoE relies on the human rules as much as possible while preserving predictive performance. When the human-based rules are damaging w.r.t the prediction task, the proposed approach is able to overrule them (that is, we recover the same solution as the unconstrained standard MoE formulation). Second, the gating function is interpretable, providing information on when each human guideline is applicable.

An overview of how Preferential MoE operates is illustrated in Figure 1. Colors in columns b)-d) represent predictive decision boundaries. In the proposed example, the human-based rule predicts red everywhere in the input space (sketch b). The third column (sketch c) shows the final predictions (colors) for each region of the input space. Each prediction either comes from the human decision rule, or from a data-based ML classifier. We learn a gating function (highlighted in purple) to select which classifier to rely on, as well as a complementary ML classifier to make predictions in regions outside of the purple region. In this diagram, both the standard MoE and the preferential MoE exhibit same predictive performance; however, the preferential MoE relies on humans much more often.

More formally, let $D = {(x_{n}, y_{n})}_{n = 1}^{N}$ be a dataset of observations where $x_{n} \in ℝ^{d}$ are the covariates and $y_{n} \in {1, \dots, k}$ is a categorical outcome for a specific prediction task. Let the guideline function g : $ℝ^{d} \to {1, 0, - 1}$ be an aggregated function encoding all available human decision rules, whose input are the covariates, and the output might be any output category, or a flag (–1) indicating that the rule is not applicable (human does not know). This human guideline function g is fixed a priori by domain knowledge or well-established medical practice. In the case of not having access to an explicit human function g, but samples of past human decisions instead, we can pre-train a classifier to mimic those human decisions beforehand, and use such classifier as our human-based rules function g.

Given dataset D, there might exist several functions that exhibit similar predictive performance, but are qualitatively different. We want to use expert knowledge (via human-based rules) to guide the optimization such that we are able to find models that have high predictive performance and agree with the human-based rules as much as possible. In order to accomplish that, we will include both objectives in the proposed optimization.

Modeling.

Our goal is to make predictions that prioritize human-based rules when the data supports (or does not contradict) such knowledge, and learn to defer to another trainable ML expert when the human rules counters empirical evidence. For that, we propose a new classification model based on a mixture of experts formulation. Let $f_{θ} : ℝ^{d} \to Δ^{K}$ be a trainable ML expert parameterized by θ, where $Δ^{K}$ denotes the (K – 1)-simplex (outcome vectors of f_θ should sum to one). Our approach combines the predictions of the ML expert f_θ and the human expert g via the gating function $ρ_{w} : ℝ^{d} \to {0, 1}$ parametrized by $w; ρ_{w}$ is another classifier that selects which expert to rely on given the covariates x. The prediction model of Preferential MoE is formalized as

{\hat{y}}_{θ, w} (x) = {\begin{cases} (1 - ρ_{w} (x)) f_{θ} (x) + ρ_{w} (x) g (x) if g (x) \neq - 1 \\ f_{θ} (x) if g (x) = - 1 \end{cases}

(1)

where ${\hat{y}}_{θ, w} (x) \in Δ^{K}$ , and the likelihood function is given by

y | x ~ Categorical ({\hat{y}}_{θ, w} (x))

(2)

Note that the ML expert f_θ might make predictions and specialize in input regions where the human rule g is not applicable or is inaccurate. In summary, y|x assumes that every data point x can be discriminated by f_θ or g, and $ρ_{w}$ makes a deterministic decision on which expert to rely on. When learning $ρ_{w}$ , we will prioritize human-based rules g during inference. Notice that (1) produces a non-convex prediction model which may be difficult to optimize.

The gating function selects when a decision should rely on human-based rule or a trained expert. By making $ρ_{w}$ an interpretable function, e.g., a linear classifier or a decision tree, the model learns which features are important for human-based decision making, and identifies the regions of other expert classifiers. We note that, even if the gating function $ρ_{w}$ is chosen to be interpretable, our approach does not provide theoretical guarantees on identifying all the regions suitable for human decision rules. More generally, $ρ_{w}$ can also be a non-interpretable function, e.g., a neural network. In such case, the gating function still identifies regions appropriate for human-based decisions, although it may miss the interpretability of the parameters w. Overall, our framework allows model constructions that balance flexibility and interpretability suitable to different applications.

Problem formalization. Our formulation as an optimization problem needs to reflect the following criteria: (i) we want to minimize the predictive error, (ii) we want to follow human-based rules as frequently as possible without hurting performance.

We optimize for predictive performance by minimizing the cross-entropy $L_{θ, w}^{γ} (D)$ with respect to the predictions from Equation (1); this corresponds to a standard maximum log-likelihood estimator for the probabilistic model y|x with an additional regularizer. For example, if the outcome is binary we write

L_{θ, w}^{γ} (D) = \sum_{n = 1}^{N} [y_{n} In ({\hat{y}}_{θ, w} (x_{n})) + (1 - y_{n}) In (1 - {\hat{y}}_{θ, w} (x_{n}))] + γ | | w | | 1,

(3)

where $γ \geq 0$ is a regularization weight that controls the trade-off between predictive performance and sparsity of w. A sparse w can help identify important features for the gating function $ρ_{w} (x)$ .

We bound the cross-entropy loss $L_{θ, w}^{γ} (D)$ with a prefixed optimized value for performance guarantees. Denote $L_{θ *, w *}^{γ} (D)$ an attainable loss where $θ *$ and $w *$ are solutions of minimizing $L_{θ, w}^{γ} (D)$ for the stated MoE in Equation (2). Consider a margin $ε \geq 0$ measuring an acceptable performance decrease, and consider the constraint:

L_{θ, w}^{γ} (D) \leq (1 + ε) L_{θ *, w *}^{γ} (D) .

(4)

Equation (4) guarantees that the performance loss will not increase more than specified, and will maintain predictive error results. We introduce sets $Θ \subset ℝ^{q}$ and $W \subset ℝ^{p}$ such that $θ \in Θ$ and $w \in W$ . Variables θ and w do not need to have same dimensions and can be constructed with different model classifiers.

We present next the problem formulation for Preferential MoE:

G : (player 1) \underset{s . t . L_{θ, w}^{γ} (D) \leq (1 + ε) L_{θ *, w *}^{γ} (D)}{\min_{w \in W} - \sum_{n = 1}^{N} In (ρ_{w} (x))} (player 2) \min_{θ \in Θ} L_{θ, w}^{γ} (D) .

(5)

We refer to (5) as game G. Using game theory terminology, there are 2 players and each player optimizes their own objective, variables and constraints, while taking into account the other player's decisions. Notice that G explicitly models our discussed goals: player 1, which is optimizing the gating function, maximizes the number of human-based decisions; player 2, which optimizes the classifier f_θ , minimizes prediction error according to the loss function (3). Note that the negative logarithm is a monotone transformation that helps obtain a convex objective for player 1. Player 1 also imposes the performance constraint and limits the classification loss.

4. Inference Algorithms

Inference for determining θ and w from G proceeds in two steps:

Unconstrained optimization: we train a standard MoE model from Equation (2) by minimizing the performance loss $L_{θ, w}^{γ} (D)$ described in Equation (3). This step yields a performance reference value of $L_{θ *, w *}^{γ} (D)$ which we will aim to maintain up to a certain margin ε. We use the optimal parameters $θ *$ and $w *$ from the unconstrained problem as warm initialization for the next step.
Constrained optimization: we solve game G initializing from previous solution.

We discuss two algorithms for solving G. The first proposal combines both objectives and uses a log-barrier method to approximate a solution. The second proposal takes gradient steps that minimize each objectives alternatively and projects to the feasible region. Both methods have convergence guarantees.

Log-Barrier Method. We want to approximate a solution of G by simplifying its formulation. We move player 1's constraint to the objective using a log-barrier penalty used in interior point methods [24, Chapter 11], and we get

\min_{θ \in Θ w \in W} - t \sum_{n = 1}^{N} In (ρ_{w} (x_{n})) - In ((1 + ε) L_{θ *, w *}^{γ} (D) - L_{θ, w}^{γ} (D)) .

(6)

The first term of equation (6) corresponds to player 1's objective, and the second term to the log-barrier function $\hat{I} (u) = - 1 / t In (- u)$ transforming its constraint, which also aligns with player 2's objective.

The log-barrier argument is susceptible of becoming negative inside the logarithm and be a source of numerical instability, so care needs to be taken with step sizes and correct initialization (warm-start). Parameter t is a hyperparameter that weights the satisfiability of the constraint, and the approximation improves as t grows. Note that this approximated form encourages that the difference $L_{θ *, w *}^{γ} (D) - L_{θ, w}^{γ} (D)$ becomes large, regardless of the constraint already being satisfied. This has the desirable effect of continuously minimizing $L_{θ, w}^{γ} (D)$ . Finally, because of the non-convex nature of the problem, gradient descent methods only guarantee convergence to stationary solutions.

Projected Gradient Method. Player 2's decisions affect player 1's constraint, and player 1's affect player's 2 objective in game G. A simple algorithm would be to alternate solving subproblems and repeat until convergence. Such schemes are only guaranteed to converge under very stringent conditions of monotonicity of the game. Monotonocity is a desirable property of multivariate mappings, informally stating that a small change in the input guarantees a bounded change in the output, therefore permitting dynamics of control towards stable solutions. We refer the reader to [25] for definitions, properties and algorithms for solving monotone games.

We present Algorithm 1 for solving G. The algorithm makes a gradient update on each objective, and projects the result onto the feasibility region. We denote estimates on iteration k with θ^k and w^k. The feasibility region is denoted with $K_{ε} (θ^{k + 1})$ , and is formally introduced in the appendix.^* The operation $\prod_{K_{ε} (θ^{k + 1})}$ denotes projection of w onto the set $_{K_{ε} (θ^{k + 1})}$ . The projection operation solves the following optimization problem $\prod_{K_{ε} (θ^{k + 1})} (z)$ = arg $\min_{w \in K_{ε}} \frac{1}{2} | | w - z | |^{2}$ , whose solution can be efficiently computed via a bisection search, described in Algorithm 2. The optimization inside the while loop in Algorithm 2 can be solved via L-BFGS [26].

Algorithm 1: Projected Gradient Descent

graphic file with name 3478327unf1.jpg

Algorithm 2: $\prod_{K_{ε} (θ^{k + 1})}$ (bisection search)

graphic file with name 3478327unf2.jpg

5. Results

We compare the performance of Preferential MoE against several baselines for two medical tasks for the treatment of Human Immunodeficiency Virus (HIV), or pharmacological management of Major Depressive Disorder (MDD). Our baselines include using predictions a) based on a human expert alone; b) a logistic regression ML expert alone; c) a standard mixture-of-experts model (standard MoE); d) the learn-to-defer model in [12]; and e) a learn-to-defer model from [10]. For the standard MoE and Preferential MoE, we train models either assuming discrete $ρ (x)$ values to begin with, or assuming continuous $ρ (x)$ values and the discretizing at the end, exploring all operating points for the threshold of the gating function. Here we report the latter, which seems to work better in practice.

Hyperparameter selection. For both prediction tasks, we explore different learning rates for both, the unconstrained and constrained optimization steps, in the range of {10^-4, 10^-3, 0.01, 0.1}. We also explore a range of regularization parameters $γ \in$ {0.0, 0.001, 0.01, 0.05, 0.1, 1.0} for the gating function, and select those that maximize predictive performance in a validation set. For the psychiatry dataset, we additionally regularize the ML classifier with an L1 penalty to avoid overfitting due to the high-dimensionality of the input space. We fix the margin ε = 0.1, and the trade-off parameter t = 5.0 for the log-barrier penalty in Equation (6). Our results were stable to perturbations of these parameters. Intuitively, t can be matched to existing interior-point algorithms and is quite robust with appropriate gradient step sizes. The margin ε affects model's accuracy, but even if there is no direct mapping from its value to a desired performance level, its impact was similar in the range $ε \in [1 e^{- 2}, 2 e^{- 1}]$ . Setting ε too small can make the model not move from the initialization point, and its solution stay similar to the standard MoE's.

Evaluation metrics. To evaluate Preferential MoE and other baselines, we measure performance as Area-Under-the-operating-ROC-Curve (AUC), as well as predictive accuracy (percentage of correct predictions) for a fine-grid of threshold values, both for the gating function and final predictions. Note that all thresholds are chosen by cross-validation, we thus guarantee that the right thresholds (w.r.t the most adequate metric for each downstream task) are selected, in a data-driven manner. We report coverage as a measure of how frequently (in percentage) each model relies on the human-based guideline function g. More specifically, we define soft-coverage as $100.0 \times E [ρ (x)]$ and hard-coverage(t) as $100.0 \times E [1 [ρ (x) \geq t]]$ for a given gating function threshold t.

5.1. Human Immunodeficiency Virus (HIV) Therapy Outcome Prediction

HIV currently affects more than 36 million people worldwide. The life-long use of combinations of antiretrovirals has largely helped combat the virus in most parts of the world and has transformed the virus from a life-threatening condition to a chronic illness. However, administering therapies is tricky as patients frequently suffer from drug resistance, viral relapses or spikes, as well as adherence issues and several other side-effects from use of antiretrovirals.

We identified individuals between 18-72 years of age from the EuResist database comprising of genotype, phenotype and clinical information of over 65 000 individuals in response to antiretroviral therapy administered between the years 1983 and 2018. We focus on a subset of 36 780 of these patients who received at least 3 prior treatments and base our predictions on the genotype, phenotype, clinical and demographic information of these individuals. The curated dataset contains a total of 384 such features. Our goal is to predict short-term therapy success where viral suppression is maintained for at least 40 days after a therapy is administered.

Table 1 shows predictive performance and coverage results for the proposed approach and competing baselines. Compared to other approaches, Preferential MoE exhibits highest soft coverage while either retaining or improving predictive performance. Figure 3 compares the accuracy relative to hard thresholding of the coverage for each of the MoE models. In the HIV setting, both variants of the Preferential MoE outperform the standard MoE approach at various coverage values. At 60% coverage, the methods all seem to perform relatively similarly in terms of accuracy.

Table 1: Performance vs Coverage (HIV): Preferential MoE relies much more often on human expertise while preserving predictive performance. Predictive performance measured by Area-Under-the-operating-ROC-Curve (AUC); Reliance on human decision rules based on soft coverage.

		AUC	soft coverage (%)
Baselines	mean	CI	mean	CI
ML only	0.64	[0.63-0.65]	0.00	[0.00-0.00]
Learn-to-defer[12]	0.71	[0.68-0.72]	54.07	[48.18 - 55.63]
Consistent Learn-to-defer [10]	0.66	[0.62-0.69]	56.81	[50.02 - 57.62]
Standard MoE (unconstrained)	0.69	[0.69-0.70]	52.87	[51.19-54.55]
Preferential MoE (log barrier)	0.74	[0.72-0.76]	62.06	[60.8-63.32]
Preferential MoE (projected gradient)	0.74	[0.73-0.75]	63.18	[61.7-64.66]

Open in a new tab

Figure 3: — **Accuracy-coverage trade-off.** Preferential MoE (trained by the log barrier or projected gradient method) for HIV either relies more on human rules for the same predictive accuracy, or gets higher accuracy for the same coverage with human rules.

Importantly, Preferential MoE allows us to incorporate human expertise into the prediction task and provides us with insights of when it makes sense to follow the rules based on the gating function. Table 2 provides a sparse list of predictors and corresponding weights averaged over 10 random seeds for the gating function. These predictors are associated with regions where it makes sense to follow human intuition. While Standard MoE identifies blood count data, certain mutations and a patient's risk group as meaningful factors, Preferential MoE identifies a significantly different set of predictors. Notably, many of the predictors identified in the latter correspond to cases where patients have additional conditions such as lipodystrophy or side effects to medication where it is preferable to rely on human judgement to determine how to treat these individuals. Figure 2 compares the gating function values ρ(x) in the test set for HIV. Unsurprisingly, Preferential MoE shows a higher preference for relying on human rules.

Table 2: Interpretation of gating function (HIV). Sparse list of predictors describing the regions where human decision rules are followed. We report weight parameters averaged across 10 different random seeds, and for regularization γ=0.1). (left) Standard MoE (predictors after step 1 in training); (right) preferential MoE (predictors after step 2 in training). Highlighted in red/green are those predictors that disappear/pop-up after step 2 in training.

Weight w	Covariate Description	Weight w	Covariate Description
+0.1612 ± 0.014	CD8+ cell count (cells/ml)	+0.0359 ± 0.022	CD4 + cell count (cells/ml)
-0.1161 ± 0.002	Reverse Transcriptase Mutation 67N	-0.0236 ± 0.027	Baseline Viral Load
-0.0310 ± 0.025	Protease Mutation 20M	+0.0151 ± 0.030	High Adherence
0.0280 ± 0.001	Blood count; complete (CBC)	+0.0150 ± 0.001	Number of Prior Treatment Lines
-0.0195 ± 0.005	Co-infection of Hepatitis C	+0.076 ± 0.007	Pregnancy
-0.0156 ± 0.001	Stavudine	-0.0055 ± 0.016	Reverse Transcriptase Mutation 184V
-0.0124 ± 0.011	Reverse Transcriptase Mutation 215YF	-0.0035 ± 0.002	Race black
+0.0121 ± 0.020	Nevirapine	-0.0026 ± 0.001	Lamivudine
-0.0068 ± 0.031	Risk group MSM	+0.0025 ± 0.003	Anaemia
-0.0055 ± 0.005	Age	+0.0012 ± 0.007	Lipodystrophy

Open in a new tab

Figure 2: — **ρ(x) values in the test set for HIV.** Preferential MoE pushes up the values for the gating functions, favoring human decision rules more frequently in the input space. Each box plot corresponds to a different random seed (we report 3 different initializations per method).

5.2. Prediction of Antipsychotic for Major Depressive Disorder (MDD)

Antidepressant prescription for MDD often involves trial and error. Roughly 2/3 of individuals diagnosed with MDD do not yield remission with their initial treatment, and 1/4 of patients is expected to dropout against clinical advice before finishing their treatment [27, 28]. The list of potential side-effects translates in tolerability and safety concerns that need to be taken into account while prescribing antidepressants. Here we focus on predicting prescription of antipsychotics, which is a class of medication primarily used to manage psychosis, but often used as an adjunctive treatment in the pharmacological management of MDD. The guideline function g for this prediction task is as follows: if the patient has anxiety or insomnia, promote antipsychotic (predict positive label), if the patient has overweight, avoid antipsychotic (predict negative label).

We identified individuals age 18-80 years drawn from the outpatient clinical networks of two academic medical centers in New England, Massachusetts General Hospital and Brigham and Women's Hospital. These patients had received at least one electronically-prescribed antidepressant between March 2008 and December 2017 with a diagnosis of MDD or depressive disorder at the nearest visit to that prescription. The goal is to predict prescription of antipsychotic based on demographic information (gender, race) as well as diagnostic and procedure codes. Race and gender were self-identified features and were included as a proxy for socio-economic variables. The curated dataset consists of 3,865 individuals and 1,680 features.

Table 3 shows predictive performance and soft coverage results for the proposed approach and competing baselines, averaged across 5 random initializations. We encountered issues training the Learn-to-defer approaches to this data (probably due to its high-dimensionality), so we only include the other baselines. Preferential MoE exhibits highest soft coverage (reliance on human rules) while maintaining (or even slightly improving) predictive performance.

Table 3: Performance vs Coverage (Psychiatry): Preferential MoE relies much more often on human expertise while preserving predictive performance. Predictive performance measured by Area-Under-the-operating-ROC-Curve (AUC); Reliance on human decision rules based on soft coverage.

		AUC	soft coverage (%)
Baselines	mean	CI	mean	CI
ML only	0.70	[0.69-0.71]	0.00	[0.00-0.00]
Standard MoE (unconstrained)	0.71	[0.70-0.71]	31.41	[28.48-34.56]
Preferential MoE (log barrier)	0.72	[0.71-0.73]	48.34	[46.24-51.74]
Preferential MoE (projected gradient)	0.72	[0.71-0.72]	45.06	[42.85-46.70]

Open in a new tab

Preferential MoE gives us additional information on when to follow such human rules by inspecting the gating function. Table 4 presents the sparse list of predictors for the gating function, associated to regions where human decision rules are followed. By regularizing the gating function classifier with an L1-penalty, we get concise list of predictors to describe those regions. The list on the left correspond to Standard MoE (unconstrained optimization), and the list on the right correspond to Preferential MoE (constrained optimization maximizing reliance on humans). In both lists, most predictors corresponding to general patient care (examination, hospital care, etc) are negatively-correlated: this can be interpreted as higher reliance on humans in the absence of patient care related codes. In the case of Preferential MoE, additional covariates coding for cardiovascular risk factors (highlighted in green) are positively-correlated with reliance on human rules. Such information can be used to explore refinements of the human-based rules.

Table 4: Interpretation of gating function. Sparse list of predictors describing the regions where human decision rules are followed. We report weights averaged across 10 different random seeds, and for a regularization parameter γ=0.1). (left) Standard MoE (predictors after unconstrained step 1 in training); (right) Preferential MoE (predictors after step 2 in training). Highlighted in red/green are those predictors that disappear/pop-up after step 2 in training.

Weight w	Covariate Description	Weight w	Covariate Description
-0.0303 ± 0.0143	Subsequent hospital care	-0.0202 ± 0.0043	Subsequent hospital care
-0.0242 ± 0.0152	MDD, recurrent episode	-0.0115 ± 0.0075	MDD, recurrent episode
-0.0235 ± 0.0126	Psychiatric examination	-0.0110 ± 0.0107	Psychiatric examination
-0.0208 ± 0.0071	Depressive disorder	0.0103 ± 0.0041	Office or outpatient visit
-0.0153 ± 0.0106	Anxiety state	-0.0070 ± 0.0112	Depressive disorder
-0.0117 ± 0.0089	Office or outpatient visit	0.0068 ± 0.0022	General medical examination
-0.0083 ± 0.0033	Radiologic examination	0.0061 ± 0.0022	Type II diabetes
-0.0073 ± 0.0062	Trazodone	0.0037 ± 0.0016	Hypertension
-0.0068 ± 0.0049	Emergency department visit	-0.0035 ± 0.0046	Anxiety state
-0.0031 ± 0.0062	race white	-0.0034 ± 0.0083	Trazodone

Open in a new tab

Figure 4 compares the histogram of the gating function values $ρ (x)$ in the test set. As expected, Preferential MoE pushes those values up, reflecting a preference for relying on human rules when possible. Although these values are continuous, we can discretize them using a specific threshold v calibrated on the validation set. Each threshold v yields a different trade-off between accuracy and coverage. Figure 5 shows the trade-off between accuracy and hard coverage reachable by these models. As a reference point, the human decision rules have an accuracy of 49.87% for this prediction task. The curves are averaged over 10 different random seeds, each curve is obtained by changing the thresholds for the gating function and final decision. Overall, Preferential MoE is able to reach better trade-offs, either better accuracy for a given fixed hard coverage, or more hard coverage for a given accuracy level.

Figure 5: — **Accuracy-coverage trade-off.** Preferential MoE either relies more on human rules for the same predictive accuracy, or gets higher accuracy for the same coverage with human rules.

6. Conclusion

We presented Preferential MoE, a mixture of experts that learns and combines a ML general classifier with a human expert, prioritizing the human-based rules. We presented a game formulation of two objectives, which we solve by a log-barrier method or alternating projected gradient descent. We evaluate both approaches in the prediction of HIV therapy success, and prescription of antipsychotic for MDD. Both algorithms preserve performance and maximize coverage of human-based decisions compared to other baselines, assuming soft and hard decision assignments of the gating function. Future work will further explore other MoE formulations balancing performance and global optimality of the MoE formulation.

Footnotes

Equal contribution.

The appendix is available at: https://arxiv.org/abs/2101.05360

Figures & Table

References

[1].Hamid K., et al. 2017 International Conference on Frontiers of Information Technology (FIT) IEEE; 2017. “Machine learning with abstention for automated liver disease diagnosis”; pp. 356–361. [Google Scholar]
[2].Esteva A., et al. “Dermatologist-level classification of skin cancer with deep neural networks”. Nature. 2017;542.7639:115–118. doi: 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Pianykh O. S., et al. “Improving healthcare operations management with machine learning”. Nature Machine Intelligence. 2020;2.5:266–273. [Google Scholar]
[4].Raghu M., et al. “The algorithmic automation problem: Prediction, triage, and human effort”. 2019.
[5].W. H. Organization Tackling HIV drug resistance: trends, guidelines and global action. Tech. rep. 2017.
[6].OARAC “Guidelines for the Use of Antiretroviral Agents in Adults and Adolescents with HIV”. Panel on Antiretroviral Guidelines for Adults and Adolescents. 2017.
[7].Lage I., et al. “Do clinicians follow heuristics in prescribing antidepressants?”. submitted. 2020. [DOI] [PMC free article] [PubMed]
[8].Stone M., et al. “Risk of suicidality in clinical trials of antidepressants in adults: analysis of proprietary data submitted to US Food and Drug Administration”. BMJ (Clinical research ed.) 2009;339 doi: 10.1136/bmj.b2880. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Blumenthal S. R., et al. “An electronic health records study of long-term weight gain following antidepressant use”. JAMA psychiatry. 2014 (Aug.71.8:2168–6238. doi: 10.1001/jamapsychiatry.2014.414. ISSN: [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Mozannar H., Sontag D. “Consistent Estimators for Learning to Defer to an Expert”. 2020.
[11].Gennatas E. D., et al. “Expert-augmented machine learning”. Proceedings of the National Academy of Sciences. 2020;117.9:4571–4577. doi: 10.1073/pnas.1906831117. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Madras D., Pitassi T., Zemel R. “Predict responsibly: improving fairness and accuracy by learning to defer”. Advances in Neural Information Processing Systems. 2018:6147–6157. [Google Scholar]
[13].Towell G. G., Shavlik J. W. “Knowledge-based artificial neural networks”. Artificial Intelligence. 1994.
[14].Tran S. N., d’Avila Garcez A. S. “Deep Logic Networks: Inserting and Extracting Knowledge From Deep Belief Networks”. IEEE Transactions on Neural Networks and Learning Systems. 2018 (Feb.29.2 doi: 10.1109/TNNLS.2016.2603784. [DOI] [PubMed] [Google Scholar]
[15].Wu Y., et al. “Knowledge enhanced hybrid neural network for text matching”. AAAI Conference. 2018.
[16].Wang J., et al. “Learning credible Models”. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. arXiv: (July;1711.03190 [Google Scholar]
[17].Chattha M. A., et al. “KINN: Incorporating Expert Knowledge in Neural Networks”. 2019.
[18].Hu Z., et al. “Harnessing deep neural networks with logic rules”. arXiv:1603.06318. 2016.
[19].Jacobs R. A., et al. “Adaptive mixtures of local experts”. Neural computation. 1991;3.1:79–87. doi: 10.1162/neco.1991.3.1.79. [DOI] [PubMed] [Google Scholar]
[20].Jordan M. I., Jacobs R. A. “Hierarchical mixtures of experts and the EM algorithm”. Neural computation. 1994;6.2:181–214. [Google Scholar]
[21].Parbhoo S., et al. “Combining kernel and model based learning for hiv therapy selection”. AMIA Summits on Translational Science Proceedings. (2017;2017:239. [PMC free article] [PubMed] [Google Scholar]
[22].Parbhoo S., et al. “Improving counterfactual reasoning with kernelised dynamic mixing models”. PloS one. 2018;13.11:e0205839. doi: 10.1371/journal.pone.0205839. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Wilder B., Horvitz E., Kamar E. “Learning to Complement Humans”. arXiv:2005.00582. 2020.
[24].Boyd S., Vandenberghe L. Convex optimization. Cambridge University Press; 2004. Mar, [Google Scholar]
[25].Scutari G., et al. Springer; 2012. “Monotone games for cognitive radio systems”. [Google Scholar]
[26].Liu D. C., Nocedal J. “On the limited memory BFGS method for large scale optimization”. Mathematical programming. 1989;45.1-3:503–528. [Google Scholar]
[27].Hughes M. C., et al. “Assessment of a Prediction Model for Antidepressant Treatment Stability Using Supervised Topic Models”. JAMA Network Open. 2020 (May;3 doi: 10.1001/jamanetworkopen.2020.5308. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Pradier M. F., et al. “Predicting treatment dropout after antidepressant initiation”. Tr. Psychiatry. 2020. [DOI] [PMC free article] [PubMed]

[r1-3478327] [1].Hamid K., et al. 2017 International Conference on Frontiers of Information Technology (FIT) IEEE; 2017. “Machine learning with abstention for automated liver disease diagnosis”; pp. 356–361. [Google Scholar]

[r2-3478327] [2].Esteva A., et al. “Dermatologist-level classification of skin cancer with deep neural networks”. Nature. 2017;542.7639:115–118. doi: 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-3478327] [3].Pianykh O. S., et al. “Improving healthcare operations management with machine learning”. Nature Machine Intelligence. 2020;2.5:266–273. [Google Scholar]

[r4-3478327] [4].Raghu M., et al. “The algorithmic automation problem: Prediction, triage, and human effort”. 2019.

[r5-3478327] [5].W. H. Organization Tackling HIV drug resistance: trends, guidelines and global action. Tech. rep. 2017.

[r6-3478327] [6].OARAC “Guidelines for the Use of Antiretroviral Agents in Adults and Adolescents with HIV”. Panel on Antiretroviral Guidelines for Adults and Adolescents. 2017.

[r7-3478327] [7].Lage I., et al. “Do clinicians follow heuristics in prescribing antidepressants?”. submitted. 2020. [DOI] [PMC free article] [PubMed]

[r8-3478327] [8].Stone M., et al. “Risk of suicidality in clinical trials of antidepressants in adults: analysis of proprietary data submitted to US Food and Drug Administration”. BMJ (Clinical research ed.) 2009;339 doi: 10.1136/bmj.b2880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-3478327] [9].Blumenthal S. R., et al. “An electronic health records study of long-term weight gain following antidepressant use”. JAMA psychiatry. 2014 (Aug.71.8:2168–6238. doi: 10.1001/jamapsychiatry.2014.414. ISSN: [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-3478327] [10].Mozannar H., Sontag D. “Consistent Estimators for Learning to Defer to an Expert”. 2020.

[r11-3478327] [11].Gennatas E. D., et al. “Expert-augmented machine learning”. Proceedings of the National Academy of Sciences. 2020;117.9:4571–4577. doi: 10.1073/pnas.1906831117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-3478327] [12].Madras D., Pitassi T., Zemel R. “Predict responsibly: improving fairness and accuracy by learning to defer”. Advances in Neural Information Processing Systems. 2018:6147–6157. [Google Scholar]

[r13-3478327] [13].Towell G. G., Shavlik J. W. “Knowledge-based artificial neural networks”. Artificial Intelligence. 1994.

[r14-3478327] [14].Tran S. N., d’Avila Garcez A. S. “Deep Logic Networks: Inserting and Extracting Knowledge From Deep Belief Networks”. IEEE Transactions on Neural Networks and Learning Systems. 2018 (Feb.29.2 doi: 10.1109/TNNLS.2016.2603784. [DOI] [PubMed] [Google Scholar]

[r15-3478327] [15].Wu Y., et al. “Knowledge enhanced hybrid neural network for text matching”. AAAI Conference. 2018.

[r16-3478327] [16].Wang J., et al. “Learning credible Models”. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. arXiv: (July;1711.03190 [Google Scholar]

[r17-3478327] [17].Chattha M. A., et al. “KINN: Incorporating Expert Knowledge in Neural Networks”. 2019.

[r18-3478327] [18].Hu Z., et al. “Harnessing deep neural networks with logic rules”. arXiv:1603.06318. 2016.

[r19-3478327] [19].Jacobs R. A., et al. “Adaptive mixtures of local experts”. Neural computation. 1991;3.1:79–87. doi: 10.1162/neco.1991.3.1.79. [DOI] [PubMed] [Google Scholar]

[r20-3478327] [20].Jordan M. I., Jacobs R. A. “Hierarchical mixtures of experts and the EM algorithm”. Neural computation. 1994;6.2:181–214. [Google Scholar]

[r21-3478327] [21].Parbhoo S., et al. “Combining kernel and model based learning for hiv therapy selection”. AMIA Summits on Translational Science Proceedings. (2017;2017:239. [PMC free article] [PubMed] [Google Scholar]

[r22-3478327] [22].Parbhoo S., et al. “Improving counterfactual reasoning with kernelised dynamic mixing models”. PloS one. 2018;13.11:e0205839. doi: 10.1371/journal.pone.0205839. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-3478327] [23].Wilder B., Horvitz E., Kamar E. “Learning to Complement Humans”. arXiv:2005.00582. 2020.

[r24-3478327] [24].Boyd S., Vandenberghe L. Convex optimization. Cambridge University Press; 2004. Mar, [Google Scholar]

[r25-3478327] [25].Scutari G., et al. Springer; 2012. “Monotone games for cognitive radio systems”. [Google Scholar]

[r26-3478327] [26].Liu D. C., Nocedal J. “On the limited memory BFGS method for large scale optimization”. Mathematical programming. 1989;45.1-3:503–528. [Google Scholar]

[r27-3478327] [27].Hughes M. C., et al. “Assessment of a Prediction Model for Antidepressant Treatment Stability Using Supervised Topic Models”. JAMA Network Open. 2020 (May;3 doi: 10.1001/jamanetworkopen.2020.5308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28-3478327] [28].Pradier M. F., et al. “Predicting treatment dropout after antidepressant initiation”. Tr. Psychiatry. 2020. [DOI] [PMC free article] [PubMed]

PERMALINK

Preferential Mixture-of-Experts: Interpretable Models that Rely on Human Expertise As Much As Possible

Melanie F Pradier, PhD

Javier Zazo, PhD

Sonali Parbhoo, PhD

Roy H Perlis, MD MSc

Maurizio Zazzi, MD

Finale Doshi-Velez, PhD

Abstract

1. Introduction