Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jul 8.
Published in final edited form as: ACM BCB. 2017 Aug;2017:526–535. doi: 10.1145/3107411.3107447

Infer Cause of Death for Population Health Using Convolutional Neural Network

Hang Wu 1, May D Wang 1,*
PMCID: PMC7341948  NIHMSID: NIHMS1595655  PMID: 32642743

Abstract

In biomedical data analysis, inferring the cause of death is a challenging and important task, which is useful for both public health reporting purposes, as well as improving patients’ quality of care by identifying severer conditions. Causal inference, however, is notoriously difficult. Traditional causal inference mainly relies on analyzing data collected from experiment of specific design, which is expensive, and limited to a certain disease cohort, making the approach less generalizable.

In our paper, we adopt a novel data-driven perspective to analyze and improve the death reporting process, to assist physicians identify the single underlying cause of death. To achieve this, we build state-of-the-art deep learning models, convolution neural network (CNN), and achieve around 75% accuracy in predicting the single underlying cause of death from a list of relevant medical conditions. We also provide interpretations for the black-box neural network models, so that death reporting physicians can apply the model with better understanding of the model.

Introduction

Physicians and medical examiners are faced with the challenges of inferring causes of death in their death reporting routine. When a death case occurs in hospital, for example, a physician will file a death certificate to the state agency, summarizing the demographics, a sequence of up to 20 medical conditions relevant to the death, coded using ICD-10 standards, and a single underlying cause of death the physician thinks most probable. These mortality data are finally aggregated and recorded by the national vital statistics system of the national center for health statistics (NCHS).

With more and more such death certificates available, it is natural for us to wonder: can we utilize the large-scale observational datasets, to build a causal inference model that can identify the cause of death based on the observations?

Causal inference, which aims to uncover the mechanism behind observations or predict the effect of an intervention to the system, is an important task in biomedical data analysis. Identifying the causes of diseases and deaths will facilitate public health reporting process, and improve individual’s quality of care by guiding design of treatment for diseases.

Such causal inference tasks are notoriously difficult, as for each patient, we can only observe one of all the potential outcomes, and to determine the causal effect, we need to compare the observed ones against all others with some approximation schemes. Several biomedical studies have studied the causal structures inside patients with one type of disease, and design random or non-random experiments to analyze the causal effects. These studies provide great insight into the diseases they studied, but there are certain limitations to it: collecting experimental data is expensive and time-consuming; moreover, in a general hospital setting, physicians and nurses might be dealing with a combination of complicated medical conditions, and need to prioritize the treatment and resource allocation, which requires better understanding of the effect of combination of medical conditions, and current studies on one type of disease might be insufficient.

In this study, we take a novel approach, to study death certificates data and build predictive model to assist physicians in identifying the single underlying cause of death.

To predict the single cause of death, given the input as a sequence of conditions, poses several challenges: 1) The input sequence of conditions is highly unstructured, so if we are to use traditional one-hot vector to represent each of the conditions, we will be dealing with large dimension of inputs, and succeeding feature extraction will require large computation resources. 2) Since essentially we are selecting one medical condition out of a list of conditions, and each death certificate will be different, we need to seek a model that can adaptively process inputs of different lengths, and predict accordingly.

To address such challenges, inspired by the recent success of deep learning, we propose to apply deep learning, specifically convolutional neural networks, to build our causal inference model. Deep learning has shown great performance in processing raw data in various formats and obtain trainable representation specific to the tasks, and deep learning structures can be configured to adapt to inputs of different shapes (Group 2017).

Our model will facilitate the further analysis of causality modeling in several aspects: 1) although our current model is build on NCHS mortality data, it is directly transferrable to analyzing more specific electronic health records data (EHR) during hospital admission. By parsing EHR to a sequence of observed conditions format, we can identify the most likely cause of death, which can help physicians attend more to important conditions, thus improving the quality of care. 2) another advantage with deep learning models lie in the distributed representations of the input medical conditions, and such representation is universal and can be applied to other data analysis tasks to understand the relationships between them better.

The overall model structure is presented in Fig. 1.

Figure 1: Overall Architecture.

Figure 1:

In this paper, we analyze a causal inference task originating from the death reporting process. In the process, for each of the death case, the physician will select a sequence of medical conditions, coded in ICD-10 standard, that is most relevant to death, and identify one condition as the underlying cause of death. To predict the cause of death using the sequence as input, we build a convolutional neural network, and provide interpretations of the model’s prediction for physicians.

Our contribution in this paper is mainly two-fold:

  • We designed a convolutional neural network (CNN) model to identify the underlying cause of death from a list of relevant medical conditions using death certificates data, and achieved about 75% accuracy, a significant performance improvement over the conventional methods.

  • We illustrated how we can interpret the black-box deep learning models, so that physicians can understand and choose whether to adopt the machine prediction.

In the rest of the paper, we first review related work on cause of death, causal inference, and deep learning models. We present our model in Section 3, with experiment results in Section 4. We then show how we can interpret the black-box algorithms, followed by discussions and conclusions.

Related Work

Causes of Death

Understanding causes of death has been a big challenge in biomedical research, as discerning new risk factors for death can help physicians understand the mechanism of death and certain disease, so to improve the quality of care for patients.

Previous research mainly focuses on discovering the cause of death for patients with specific disease types, or discovering new factors that can be identified as cause of death. The Emerging Risk Factors Collaboration studied the risk factors associated with diabetes (Collaboration and others 2011) and Khorana et al. studied for the cancer patients with chemotherapy (Khorana et al. 2007). Other studies include causes of death for neonatal deaths (Lawn et al. 2005), Alzheimer’s death (Mölsä, Marttila, and Rinne 1986), and sclerosis (Sadovnick et al. 1991).

Back in 1986, Israel et al (Israel, Rosenberg, and Curtin 1986) first pointed out the potential multiple cause-of-death data, and with the popularity of statistical data analysis methods, the dataset gained more popularity in the healthcare research community. Redelings et al. (M. D. Redelings, Wise, and Sorvillo 2007) analyzed the associations of cause-of-death conditions, and Jiang et al. (Jiang, Wu, and Wang 2017) analyzed the evolution of causes with a topic model approach. Some researchers also look in to the subgroups of patients associated with certain diseases, for example, McCoy et al. on asthma (McCoy et al. 2005), Melamed et al. on sepsis (Melamed and Sorvillo 2009). Yet these studies mostly take on an association analysis approach and few have focused on the problem of causal identification.

Causal Inference

Causal inference plays a vital role in analyzing biomedical observational study, as it can help determine the treatment effect of certain drugs or procedures (Rubin 2007; Kleinberg and Hripcsak 2011). Rubin et al. (Rubin 1974) was the first to analyze the causal inference problem in such experiment problems, and the paper mainly discussed the difference between random experiments and non-random experiments. Such experiments aim to discover the effect of a binary treatment, and the term “random/ non-random” refers to whether the assignment of treatment depends on the patients’ conditions or not. In a random design experiment, since the treatment is assigned randomly, it’s straightforward to analyze the treatment effect by comparing populations with treatment to populations without treatment. However, most of biomedical experiments are non-random observational studies, so it’s crucial to identify the causal structure, or design special algorithms to account for the cofounding effects in such studies.

Identify causal structure mainly builds on the causal graph framework proposed by Pearl et al. (Pearl 2009), where variables/ features are represented as nodes, and edges indicate causal relationships. Under this framework, when the causal structure is specified beforehand, often by domain experts, structural equation model (SEM) (Muthén 1984; Bollen and Long 1993) can analyze the causal effects of intervention on certain variables.

Oftentimes, such structure is unknown or incomplete, thus, a series of work is conducted to learn the causal structure. Constraint-based algorithm, such as PC algorithm (Spirtes, Glymour, and Scheines 2000), learns the graph by exploiting conditional dependence relationship between variables. Score-based method (Chickering 2002) designs an evaluation metric to score each potential causal structure, and finds the one with highest score.

On the other hand, to analyze the treatment effect from observation studies, we could also adopt the potential outcomes framework (Rubin 2005). Two popular approaches are matching (Stuart 2010; Morgan and Winship 2014; Rubin 2006) and propensity score. By matching, we find a pair of two instances, that receive opposite treatment, while are most similar in other features. In this way, the difference between them after treatment can be used as an estimate for the treatment effect. Propensity score works by reweighing instances to convert an observation study to a pseudo-random experiment and work with random experiment (Rosenbaum and Rubin 1983; Agostino 1998).

Deep Learning and CNN

The past decade has witnessed the success of deep learning, enabled by effective training algorithm (Hinton, Osindero, and Teh 2006; Kingma and Ba 2014), high performance computing structure including GPU, and large-scale labeled datasets (Deng et al. 2009). Deep learning has shown great capabilities in image classification (Krizhevsky, Sutskever, and Hinton 2012), image segmentation(J. Long, Shelhamer, and Darrell 2015), text analysis (Sutskever, Martens, and Hinton 2011), and reinforcement learning (Mnih et al. 2013).

Recently, biomedical researchers are applying deep learning to biomedical data analysis (Holzinger and Jurisica 2014). Liu et al. analyzed brain imaging with deep learning, to achieve early diagnosis of Alzheimer’s disease; Esteva et al. used deep learning to classify a skin cancer dataset containing about 130,000 images and beat human physician accuracy(Esteva et al. 2017); Suo et al. (Suo et al. 2016) applied deep belief nets to derive risk factors from electronic health records.

CNN was originally proposed to address image classification (LeCun, Bengio, and others 1995), initially dealing digit recognition. It revived as the standard practice of image classification around 2012, with the success of AlexNet beating human accuracy. Then CNN showed great success in other image processing tasks, such as image segmentation (J. Long, Shelhamer, and Darrell 2015) and human action recognition (Ji et al. 2013). CNN then was also applied to sentence classification (Kim 2014) and inspired several follow-ups (Dos Santos and Gatti 2014; Hu et al. 2014). Another direction of work uses recurrent neural network (Lai et al. 2015), and showed better capabilities in text generation (Sutskever, Martens, and Hinton 2011), and text conversion to other modalities (Mao et al. 2014).

Model

Problem Formulation

In the death reporting scenario, a physician will first select several conditions, coded by ICD-10 codes,1 as the conditions most relevant to death. The conditions are then recorded as they are sequential observed. Among these conditions, one of them will be predicted as the cause of death, based on physicians’ expertise and the understanding of the death case.

In this paper, we are interested in automating the latter part of the process, identifying the one cause of death from the list of conditions. Mathematically, suppose we have a vocabulary of ICD-10 conditions V, with the total number denoted as |V|. We are given a dataset of {xi,yi}, i = 1,…,N, where xi=[ci,1,...,ci,ik] is the sequence of ik relevant conditions recorded, and yi is the identified cause of death. We are interested in learning a classifier as f(xi) = yiV, essentially a multi-class classification problem, with input being a list of items also from the vocabulary V.

The sequential and discrete nature of the ICD-10 conditions in our case presents a strong analogy with natural language. We can regard each ICD-10 condition as a word, and physicians use a sequence of ICD-10 conditions to describe a death case, which is similar to a human sentence describing a concept. So our problem can be considered as a sentence classification problem. However, we should note a profound difference between our problem and sentence classification: traditional sentence classification generally deals with binary classification (e.g., positive sentiment vs. negative sentiment), or a few classes that indicate the topic of the sentence (e.g., religious, movie, news, etc.). In our cause of death identification, we are dealing with thousands of ICD-10 vocabulary as the final class label, which means the final classification probability could be sparsely distributed among the vocabulary, posing a computational challenge.

In light of the recent success in deep learning in text classification cases, we adopt the convolutional neural network (CNN) framework to our application, and made necessary modifications.

Convolutional neural network(CNN) for sentence classification

Applying CNN to sentence classification is proposed by (Kim 2014), and has thus been extensively studied and applied in the literature. Here, we first review the basics of the CNN framework, and then introduce the model we modified.

For a sequence x = [c1,…,ck], where each condition cjR|V| is one-hot vector encoding representation, we first apply a word embedding to obtain a distributed representation for it (Mikolov et al. 2013) in a lower dimensional space RD. Equivalently, we are learning a weight matrix WR|V|×D, so that we embed each condition cj as the jth row of the matrix, with the simple matrix multiplication vj = cjW.

After the embedding, we concatenate all the embedding vectors, and obtain the initial representation for the sequence as

v=v1v2...vk

. The convolution operator is applied to a segment of the sequence, determined by the window size. For example, for a window of size of H, the convolution uses a filter mRH×D, and the result of this convolution to a segment vi:i+H1=[vi,vi+1,...,vi+H1] is

zi=f(m*vi:i+H+b0)

, where * is the convolution operator, b0 a scalar bias term, and f a nonlinear transformation, such as Tanh and ReLU. The rationale for a convolution operation lies in that conditions that occur closely in the sequence should share some characteristics and be correlated to each other, as we would expect in the 2D image case.

We can further apply a pooling operation to the obtained feature map [z1,…,zk], which finds the maximum or average among all the zjs. The intuition is that for each filter, we find the most important features, and this operation naturally handles variable length sequences.

In practice, we can apply several convolution filters to obtain several corresponding features. These intermediate features can be passed again to convolution filters then nonlinear transformation, and the stacking of these layers compose a deep neural network structure, which some describes as Parrellel-CNN.

In the ultimate layer, for the hidden feature vector uRD1, we apply a fully connected layer to obtain the final output y = g(uTQ + b1). The parameter QRD1*|V| maps the hidden feature to a distribution over all the vocabulary V, and the condition with maximum probability is predicted as the cause of death.

An illustration of the model architecture is shown in Fig. 2.

Figure 2: Model architecture.

Figure 2:

The convolution neural network architecture we used in the project is from Kim et al. [17], which contains a 1D convolution layer and a max pooling, followed by a fully connected layer as logit outputs.

Proposed method: convolutional neural network with dynamic computation graph

A challenge of the above vanilla CNN structure lies in the final parameter matrix Q, as computing a soft-max distribution over the whole vocabulary of codes can result in sparse entries, and makes the inference more difficult.

To overcome such challenge, we note that instead of predicting over the whole vocabulary, we only need to select one condition out of the input sequence, which has a length significantly smaller than the total vocabulary. Moreover, each of the sentence has a potential different length, so this requires that we dynamically build a neural network for each of the input.

The new “Define by Run” paradigm of deep learning framework has enabled us to build such dynamic neural networks. In brief, instead of specifying a static network structures before any input is fed, now we take a particular training sample, a sequence in our case, and define the network structure from this input to its output. This is also the same for testing phase, where we dynamically construct a network architecture for each test sample, potentially of various shapes. Although this practice seems more tedious in terms of computation, it supports dynamically sized data, such as our sequences of medical conditions with variable lengths, and also reduces the complexity of computation in computation graph implementation.

Regularization

Deep neural networks tend to overfit the data, thus, in our implementation, we imposed following three regularization techniques

Batch Normalization (BN)

Batch Normalization (BN) (Ioffe and Szegedy 2015) works on the intuition that during the training of neural network, because of the changes of the network parameters, the distribution of the nonlinearity activation function also shifts accordingly. Mathematically, for each dimension xk of the D dimension input feature x = [x1,…,xD], the new normalized feature is

x^k=xkE[xk]Var[xk]

, where the expectation and variance are computed for each mini-batches.

Reducing such shift in covariates can accelerate the convergence in training, and sometimes has the benefit as regularization.

Dropout (DO)

Dropout (Srivastava et al. 2014) is one of the most successful regularization techniques. In brief, the weights of a proportion of the hidden neurons in the network layers are chosen to be 0s during training. For example, in the ultimate layer of our network, we have the final output computed as y = g(uTQ + b1). Instead of applying Q directly, we randomly generated as mask matrix MQ, where each entry of it is a Bernoulli variable with specified probability p in (0,1), often set to 0.5 as suggested by literature (Srivastava et al. 2014; Kim 2014).

The output is then

y^=g(uT(Q.×MQ)+b1)

, where .× denotes the element-wise multiplication.

During test, we scale the learned weight matrices by p as Q^=pQ, and use them without dropout to predict unseen sentences.

Early Stopping (ES)

Early stopping is also widely used as a regularization technique for almost any types of machine classifiers, that rely on sub-gradient methods as training algorithms. It attracts wide popularity in training deep neural networks as it can help save a great amount of computation time while preserving considerable test performance.

We first partition a small proportion of the training set as our development set, and train classifiers on the rest of the training set. Once we observe the test performance on development set is worse than the training performance on the rest, and the test performance exceeds a threshold we set beforehand, we can conclude that the training might already overfit the data, and terminate the training process.

Discussion: CNN Vs. Bag of Words

Traditional sentence processing mainly use the bag-of-n –grams method to represent the sentence (Wolf, Poggio, and Sinha 2006), where each dimension is the term frequency times inverse document requency (tf-idf) for the n gram for the sentence. With features extracted, off-the-shelf classification methods, such as naive Bayes, support vector machine, can be applied to classify the sentences.

Despite its simplicity, there are several disadvantages of this approach compared to CNN: 1) the n-gram tend to be really sparse, and even infeasible to compute, when the n is large; 2) the one-hot representation of these n –grams ignore the shared parts between all n –grams, as well as the distributed representations obtained by embedding words; 3) training such models requires first loading all data into RAM to process the TFIDF matrix, so we cannot adopt mini-batch training techniques here, making the model less scalable.

Experiments & Results

Dataset overview

For the experiments, we pick the death certificates in the United States from Year 2014, which contains approximately 2 million records of death cases (NCHS 2017). After preprocessing, removing identical records and filter out records with length less than 3, we obtain 1,499,128 records.

The ICD-10 codes, in the format of A123.4, observe a hierarchy structure, where the digits before the dot can be regarded a coarse high-level classification of the condition.To save computation, we here use the coarse version, and as a result, we obtain a vocabulary of input conditions of 1610 and a total of 1180 possible classes as causes of death.

Configurations for CNN

For our method, we experiment both static and dynamic constructed CNN. The static constructed one is referred to as CNN-static. As for the dynamic version, we can either train the network as a static structure, and when testing, only selects ones that are present in the input, or we can use a dynamic structure in both training and testing phase. These two versions are referred to CNN-dyn-eval and CNN-dynamic respectively.

The network structure we used is specified as follows:

Input -> Embedding Layer -> Convolution & Pooling -> Batch Normalization -> Dropout -> Fully Connected Layer -> Output 

The embedding dimension is set to 128, and three kernel sizes for the convolution layers are 3,5,7. Dropout probability is set to .5 as suggested in (Kim 2014) and maximum norm of parameters are set to 3.0. Our model is built with PyTorch (Group 2017), and adapted from open source implementations.2

Baseline methods

For the baselines, we implemented two types of baseline algorithms: traditional BoW classification, and shallow classifiers built on embeddings.

For the bag-of-words feature extraction, we first construct a count matrix X, where X[i,j] denotes the count of word j appearing in the document i. Then tf-idf transform, short for “term frequency times inverse document frequency” is applied to X to obtain the final feature matrix. We used bag-of-words, instead of more expressive n-gram, because it would require much more than the computation power we have available right now. We then apply naive Bayes (NB), support vector machine (SVM), and logistic regression (RF) to the feature matrix, and obtain predictions.

For shallow classifiers, we use the architectures shown below

Input -> Embedding Layer -> Vector Averaging -> Fully Connected Layer -> Output

After we embed all the medical conditions of a sequence, we average them as the vector representation as the sentence, then use a fully connected layer to obtain the final output. We use cross entropy loss (equivalent to a logistic regression model), and multi margin loss (equivalent to a support vector machine model), and also implemented the three variants w./o. dynamic graph in PyTorch.

Experiment Settings

We randomly partition the data into training, development, and test sets with ratio 7.9:.1:1. The hyper-parameters were selected based on the performance on the development set and then tested on the test set. We also reported the model performance with and without early stopping using development set.

In naive Bayes, we don’t have a parameter to tune, and we tune the regularization parameter for SVM, and the number of trees for RF. As for CNN, we mainly tune the kernel sizes. The results were reported by averaging 3 runs of experiments.

BoW classifiers are run in CPU with 60 GB RAM and we used implementation of this pipeline from Scikit-Learn (Pedregosa et al. 2011). CNN and shallow learners are trained with NVIDIA K80 GPU. We use a mini-batch of 64 for training, and we set the maximum number of epochs to 2 (the number of iterations over the whole training data). We used Adam with adaptive learning rate as the sub-gradient optimization method. Both training with and without early stopping are tested. Training such networks averaged to about 5 hours under our configurations, while testing a single case using a trained model takes about seconds.

Evaluation Metrics

To evaluate the performance of the algorithms, we adopt common evaluation metrics for multi-class classification.

Accuracy (ACC)

Accuracy (ACC) measures the percentage of the sequences that are predicted correct.

Cross Entropy Loss

For a classifier with logits as output, the classification cross entropy loss is defined as

Loss(logit,class)=logit[class]+log(j=1Cexp(logit[j]))

, where class is the true class of the sample, and logit is a vector containing the logits for all the classes.

F1

To account for the potential imbalance between false-negative and false-positives, F1 measure computes the harmonic average of precision and recall. F1 is in the range of [0,1] and the higher, the better predictive power. In the case of multi-class classification, we compute for each class an F1 measure, and then use the average of these F1s as the final metric.

Cohen’s kappa

Cohen’s kappa is a statistical that measures the inter-rater agreement between two classification output, defined as

κ=11po1pe

, where po is the accuracy we mentioned above, and pe is the probability of agreement by chance. A perfect agreement will have κ = 1, and κ < 0 indicates a no agreement other than by chance.

Results & Discussion

We can see that Bag-of-Words classification techniques fail to capture the causal inference in our case, giving a poor classification performance. By examining the classification output, the classifier simply outputs the class that has the highest frequency, thus resulting in an identical output in all runs. This most likely results from the feature extraction process, where Bag-of-Words only consider the frequency of words.

Between the shallow classifiers and CNN, we can see that CNN obtained the highest classification accuracy, and the lowest loss with dynamic evaluation configuration, beating all other models with lowest variance.

Models with dynamic neural network structure in training and testing phase have a varied performance, and in the case of CNN and LR, it has the benefit of improving classification performance slightly, while in SVM, the benefit is not observed.

Early stopping as a regularization decrease the number of batches from about 21000 to about 9000 in all three models, saving almost half of the running time, however, most of models with early stopping have a worse performance than the ones without early stopping, indicating that simply comparing development set training accuracy might not be a sufficient criterion for model overfitting, but just the oscillation behavior in the local minima region.

While some algorithms present performance in terms of accuracy and loss, all algorithms perform poorly evaluated with Cohen-Kappa’s and microF1, mainly because of the fact that we are dealing with an extreme large number of classes. Take microF1 as an example, it’s an average of the F1 value for each of the classes. For some of the classes, it may only have one or two samples, if the algorithm predicts these few samples wrong, its precision will be zero, thus a zero F1 score, significantly decreasing the final averaged microF1. In future, it may be of interest to design algorithms that can achieve high F1 in such classification case, as classification with extremely large number of classes itself is an interesting research question.

Parameter Analysis

Deep learning models have seemingly a large number of parameters to tune, in our case, such as the convolution kernels, the maximum norm, the dropout rate, and the dimension of embedding. Here we briefly show the effect of varying these parameters on the final prediction accuracy.

The base model is the standard static version of CNN, and we vary these four parameters, and plot the prediction accuracy on the test set with standard deviation as error bars.

From the figure, we can see that although there are several parameters to set, with a good model architecture, the specific values of these parameters don’t influence the final outcome much. Except with the dimension of embedding medical conditions, we found that as the size of embedding dimension increases, the test performance slightly increases.

Analyzing Embeddings of Medical Conditions

One side-product of our model is the distributed representations of medical conditions, which we visually present in Fig. 4, using t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten 2014) following a dimension reduction using kernal PCA to a three dimensional space.

Figure 4: Embedding Visualization.

Figure 4:

After dimension reduction with kernel PCA, and t-SNE visualization, we plot the embeddings of medical conditions. We label the points with colors using the clustering results on the original embeddings. We can see the embedding vectors are scattered in the 2-dimensional data, and nicely clustered in five groups.

Embedding Visualization. After dimension reduction with kernel PCA, and t-SNE visualization, we plot the embeddings of medical conditions. We label the points with colors using the clustering results on the original embeddings. We can see the embedding vectors are scattered in the 2 dimensional data, and nicely clustered in five groups.

The color of these dimension reduced points are obtained by running K-means clustering the embedding matrix of size (1610,128), with number of clusters set to 5. We can see the embedding vectors are scattered in the 2 dimensional data, and nicely clustered in five groups.

To understand the embedding better, we also show some of the clustering results, with top conditions in each cluster in Fig. 5. We conjecture the embedding of medical conditions reflect two characteristics, the likelihood of each condition causing death, and the physiological relationship between them, so we may not see a very clear pattern of the clustering results. For example, cluster 5 mainly consists of circulatory related conditions, while cluster 2 contains a few more severe diseases.

Figure 5: Clustering of Embedding.

Figure 5:

We perform K-means clustering with K=8 on the embeddings and plot for each of the cluster, the top conditions that are closest to its centroid.

Interpretation of Cause Identification

Right now there are no golden standards as to explain the cause of death, and physicians mainly fill the death certificates speaking from their own perspective. The collection of death certificates from across the nation is thus a good resource to distill the knowledge in understanding cause of death, by training such supervised learning model. To help physicians really understand why our model gives predictions for causes given each sequences, we need to give proper interpretations for the black-box models, i.e., to provide relation between the input sequence and the final selected condition.

LIME-Local Interpretable Model-Agnostic Explanations

In this section, we briefly introduce the Local Interpretable Model-Agnostic Explanations model proposed (Ribeiro, Singh, and Guestrin 2016), which can explain any type of black-box predication algorithms. To understand which parts of the input are contributing to the final prediction, the model perturbs the input around its neighbors, and analyzes the classifier’s predictions on these perturbed instances with a sparse linear model. Then the weights from the linear model indicates how important the corresponding part is to the final predictions.

Case study

Here we showcase how the model will explain an unseen instance. we synthesize a patient history and the resulting sequence is

‘I50,J44,I25,T82,Y83,I73,I10,J96,I64,K21,F17’

The explain model then outputs the top most likely cause of death as I25, as well as why certain conditions are more likely to cause the death, explained by the input conditions. We show such explanation in Fig. 6.

Figure 6: Interpretation.

Figure 6:

We use LIME on our CNN, and obtain the interpretation of the model for top three likely causes of death. For each of the likely cause, we show the contribution of each input medical condition to the nal condition, where the absolute values indicate the scale of the contribution, and the sign indicates a positive or negative contribution.

Because of the constraint of the data, we now can only pinpoint these ICD10 conditions, which admittedly, is still limited.

If we are to have a more complete dataset, where we have the complete history of patients, as well as the identified cause of death, we are then able to train a deep learning model predicting cause of death using the whole medical history. With the model and our explainer, we can understand in a greater detail about death cases. When a new patient is admitted to hospital, we can use such model to understand, which condition is the most likely cause of death, and which symptoms are contributing to this causality sequence.

Another interesting application of such interpretation model is that we can provide several predictive models for physicians, as well as their interpretations, and ask physicians to choose the one that matches human knowledge more. Running these tests will help us choose an accurate model that is more interpretable.

Conclusion & Future Work

In this paper, we showed how a modern deep learning architecture, CNN, can be adapted to identify the cause of death. The model shows significant improvement over the traditional baselines, and can handle even larger scale datasets than traditional methods. We also provide human understandable interpretation for the model, so that death reporting physicians.

The current work is limited by the dataset itself, and we are working

There are several ways our current work can be extended: First, we may deploy the model in a general EHR setting, where we can identify the most probable potential causes of death, so as to alert physicians and nurses to attend to more critical conditions. Second, there are medical ontologies specified by domain experts, which record all the viable causal relations between medical conditions. We can seek to integrate the guidance and constraint from these constraints into our models, and reach a model derived both from data and human knowledge. Moreover, it will be interesting to see how other deep learning architectures will perform in this task and other causal inference problems.

Figure 3: Parameter Analysis.

Figure 3:

We varied several key parameters of the CNN model and overall, the scale or the pattern of influence is not clearly shown from our experiments. The Y-axis is the accuracy on test sets with error bars and X-axis is the parameter we analyzed.

Table 1:

Classification Results

Classifier Name Test Loss Test Accuracy Test Micro FI Test Cohen Kappa
CNN-static 0.799±0.009 75.481±0.345 8.4e-06±3.8e-08 8.3e-06±3.8e-08
CNN-static-es 0.902±0.017 73.681±0.346 8.2e-06±3.8e-08 8.1e-06±4.0e-08
CNN-dyn 3.996±0.861 68.261±0.362 7.6e-06±4.0e-08 7.5e-06±3.9e-08
CNN-dyn-es 3.765±0.400 51.946±1.130 5.8e-06±1.3e-07 5.6e-06±1.2e-07
CNN-dyn-eval 0.738±0.007 75.787±0.179 8.4e-06±2.0e-08 8.3e-06±1.8e-08
CNN-dyn-eval-es 0.826±0.044 73.184±1.254 8.1e-06±1.4e-07 8.0e-06±1.4e-07

LR 1.011±0.007 66.762±0.298 7.4e-06±3.3e-08 7.3e-06±2.9e-08
LR-es 1.007±0.024 66.706±0.681 7.4e-06±7.6e-08 7.3e-06±7.6e-08
LR-dyn 0.852±0.014 66.946±0.431 7.4e-06±4.8e-08 7.3e-06±4.9e-08
LR-dyn-es 0.886±0.015 66.326±0.720 7.4e-06±8.0e-08 7.2e-06±8.0e-08
LR-dyn-eval 0.842±0.012 67.295±0.514 7.5e-06±5.7e-08 7.4e-06±6.0e-08
LR-dyn-eval-es 0.881±0.006 66.633±0.297 7.4e-06±3.3e-08 7.3e-06±3.1e-08

SVM 1.285±0.056 64.597±1.042 7.2e-06±1.2e-07 7.0e-06±1.2e-07
SVM-es 2.039±1.116 62.812±2.312 7.0e-06±2.6e-07 6.8e-06±2.6e-07
SVM-dyn 14.588±1.857 44.984±1.771 5.0e-06±2.0e-07 4.8e-06±2.3e-07
SVM-dyn-es 13.007±2.572 45.068±2.111 5.0e-06±2.3e-07 4.9e-06±2.3e-07
SVM-dyn-eval 13.800±1.670 47.377±1.891 5.3e-06±2.1e-07 5.1e-06±2.4e-07
SVM-dyn-eval-es 11.253±2.384 48.081±2.345 5.3e-06±2.6e-07 5.2e-06±2.8e-07

LR-BoW 6.486±0.000 8.3±0.000 8.3e-02±0.0e+00 0.0e+00±0.0e+00
NB-BoW 4.864±0.000 8.3±0.000 8.3e-02±0.0e+00 0.0e+00±0.0e+00

Acknowledgement

This work was supported in part by grants from the National Center for Advancing Translational Sciences of the National Institutes of Health (NIH) under Award UL1TR000454, National Science Foundation Award NSF1651360, the US Department of Health and Human Services (HHS) Centers for Disease Control and Prevention (CDC) HHSD2002015F62550B, and Microsoft Research and Hewlett Packard. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health nor National Science Foundation. The authors thank Paula Braun and Mark Braunstein for their valuable insights on the project, as well as helpful comments from Ying Sha and Janani Venugopalan.

Footnotes

1

Starting FROM 1999, NCHS has switched from ICD-9 to ICD-10 coding system.

References

  • 1.Agostino Ralph B. 1998. “Tutorial in Biostatistics: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Control Group.” Stat Med 17 (19): 2265–81. [DOI] [PubMed] [Google Scholar]
  • 2.Bollen Kenneth A, and Scott Long J. 1993. Testing Structural Equation Models. Vol. 154 Sage. [Google Scholar]
  • 3.Chickering David Maxwell. 2002. “Optimal Structure Identification with Greedy Search.” Journal of Machine Learning Research 3 (Nov): 507–54. [Google Scholar]
  • 4.Collaboration Emerging Risk Factors, and others. 2011. “Diabetes Mellitus, Fasting Glucose, and Risk of Cause-Specific Death.” N Engl J Med 2011 (364). Mass Medical Soc: 829–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In Computer Vision and Pattern Recognition, 2009. Cvpr 2009. Ieee Conference on, 248–55. IEEE. [Google Scholar]
  • 6.Santos Dos, Nogueira Cícero, and Gatti Maira. 2014. “Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.” In COLING, 69–78. [Google Scholar]
  • 7.Esteva Andre, Kuprel Brett, Novoa Roberto A, Ko Justin, Swetter Susan M, Blau Helen M, and Thrun Sebastian. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639). Nature Research: 115–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Group, PyTorch. 2017. “PyTorch: Tensors and Dynamic Neural Networks in Python with Strong Gpu Acceleration.” http://www.pytorch.org.
  • 9.Hinton Geoffrey E, Osindero Simon, and Teh Yee-Whye. 2006. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18 (7). MIT Press: 1527–54. [DOI] [PubMed] [Google Scholar]
  • 10.Holzinger Andreas, and Jurisica Igor. 2014. “Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions” In Interactive Knowledge Discovery and Data Mining in Biomedical Informatics, 1–18. Springer. [Google Scholar]
  • 11.Hu Baotian, Lu Zhengdong, Li Hang, and Chen Qingcai. 2014. “Convolutional Neural Network Architectures for Matching Natural Language Sentences.” In Advances in Neural Information Processing Systems, 2042–50. [Google Scholar]
  • 12.Ioffe Sergey, and Szegedy Christian. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv Preprint arXiv:1502.03167. [Google Scholar]
  • 13.Israel Robert A, Rosenberg Harry M, and Curtin Lester R. 1986. “Analytical Potential for Multiple Cause-of-Death Data.” American Journal of Epidemiology 124 (2): 161–81. [DOI] [PubMed] [Google Scholar]
  • 14.Ji Shuiwang, Xu Wei, Yang Ming, and Yu Kai. 2013. “3D Convolutional Neural Networks for Human Action Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1). IEEE: 221–31. [DOI] [PubMed] [Google Scholar]
  • 15.Jiang Hanyu, Wu Hang, and Wang May D. 2017. “A Topic Model View on Causes of Death in the United States, 1999 to 2014.” In Biomedical and Health Informatics (Bhi), 2017 Ieee-Embs International Conference on IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Khorana AA, CW Francis E, Culakova NM Kuderer, and Lyman GH. 2007. “Thromboembolism Is a Leading Cause of Death in Cancer Patients Receiving Outpatient Chemotherapy.” Journal of Thrombosis and Haemostasis 5 (3). Wiley Online Library: 632–34. [DOI] [PubMed] [Google Scholar]
  • 17.Kim Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” arXiv Preprint arXiv:1408.5882. [Google Scholar]
  • 18.Kingma Diederik, and Ba Jimmy. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980. [Google Scholar]
  • 19.Kleinberg Samantha, and Hripcsak George. 2011. “A Review of Causal Inference for Biomedical Informatics.” Journal of Biomedical Informatics 44 (6). Elsevier: 1102–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–1105. [Google Scholar]
  • 21.Lai Siwei, Xu Liheng, Liu Kang, and Zhao Jun. 2015. “Recurrent Convolutional Neural Networks for Text Classification.” In AAAI, 333:2267–73. [Google Scholar]
  • 22.Lawn Joy E, Cousens Simon, Zupan Jelka, Lancet Neonatal Survival Steering Team, and others. 2005. “4 Million Neonatal Deaths: When? Where? Why?” The Lancet 365 (9462). Elsevier: 891–900. [DOI] [PubMed] [Google Scholar]
  • 23.LeCun Yann, Bengio Yoshua, and others. 1995. “Convolutional Networks for Images, Speech, and Time Series.” The Handbook of Brain Theory and Neural Networks 3361 (10): 1995. [Google Scholar]
  • 24.Long Jonathan, Shelhamer Evan, and Darrell Trevor. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 3431–40. [DOI] [PubMed] [Google Scholar]
  • 25.Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. “Deep Captioning with Multimodal Recurrent Neural Networks (M-Rnn).” arXiv Preprint arXiv:1412.6632. [Google Scholar]
  • 26.McCoy Lucie, Redelings Matthew, Sorvillo Frank, and Simon Paul. 2005. “A Multiple Cause-of-Death Analysis of Asthma Mortality in the United States, 1990–2001.” Journal of Asthma 42 (9). Taylor & Francis: 757–63. [DOI] [PubMed] [Google Scholar]
  • 27.Melamed Alexander, and Sorvillo Frank J. 2009. “The Burden of Sepsis-Associated Mortality in the United States from 1999 to 2005: An Analysis of Multiple-Cause-of-Death Data.” Critical Care 13 (1). BioMed Central: R28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S, and Dean Jeff. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, 3111–9. [Google Scholar]
  • 29.Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Graves Alex, Antonoglou Ioannis, Wierstra Daan, and Riedmiller Martin. 2013. “Playing Atari with Deep Reinforcement Learning.” arXiv Preprint arXiv:1312.5602. [Google Scholar]
  • 30.Morgan Stephen L, and Winship Christopher. 2014. Counterfactuals and Causal Inference. Cambridge University Press. [Google Scholar]
  • 31.Mölsä Pekka K, Marttila RJ, and Rinne UK. 1986. “Survival and Cause of Death in Alzheimer’s Disease and Multi-Infarct Dementia.” Acta Neurologica Scandinavica 74 (2). Wiley Online Library: 103–7. [DOI] [PubMed] [Google Scholar]
  • 32.Muthén Bengt. 1984. “A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators.” Psychometrika 49 (1). Springer: 115–32. [Google Scholar]
  • 33.NCHS. 2017. “Mortality Data, Vital Statistics Nchs’ Multiple Cause of Death Data, 1959 to 2015.” http://www.nber.org/data/vital-statistics-mortality-data-multiple-cause-of-death.html.
  • 34.Pearl Judea. 2009. Causality. Cambridge university press. [Google Scholar]
  • 35.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30. [Google Scholar]
  • 36.Redelings Matthew D, Wise Matthew, and Sorvillo Frank. 2007. “Using Multiple Cause-of-Death Data to Investigate Associations and Causality Between Conditions Listed on the Death Certificate.” American Journal of Epidemiology 166 (1). Oxford Univ Press: 104–8. [DOI] [PubMed] [Google Scholar]
  • 37.Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM. [Google Scholar]
  • 38.Rosenbaum Paul R, and Rubin Donald B. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika JSTOR, 41–55. [Google Scholar]
  • 39.Rubin Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5). American Psychological Association: 688. [Google Scholar]
  • 40.———. 2005. “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions.” Journal of the American Statistical Association 100 (469). Taylor & Francis: 322–31. [Google Scholar]
  • 41.———. 2006. Matched Sampling for Causal Effects. Cambridge University Press. [Google Scholar]
  • 42.———. 2007. “The Design Versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials.” Statistics in Medicine 26 (1). Wiley Online Library: 20–36. [DOI] [PubMed] [Google Scholar]
  • 43.Sadovnick AD, Eisen K, Ebers GC, and Paty DW. 1991. “Cause of Death in Patients Attending Multiple Sclerosis Clinics.” Neurology 41 (8). AAN Enterprises: 1193–3. [DOI] [PubMed] [Google Scholar]
  • 44.Spirtes Peter, Glymour Clark N, and Scheines Richard. 2000. Causation, Prediction, and Search. MIT press. [Google Scholar]
  • 45.Srivastava Nitish, Geoffrey E Hinton Alex Krizhevsky, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (1): 1929–58. [Google Scholar]
  • 46.Stuart Elizabeth A. 2010. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 25 (1). NIH Public Access: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Suo Qiuling, Xue Hongfei, Gao Jing, and Zhang Aidong. 2016. “Risk Factor Analysis Based on Deep Learning Models.” In Proceedings of the 7th Acm International Conference on Bioinformatics, Computational Biology, and Health Informatics, 394–403. ACM. [Google Scholar]
  • 48.Sutskever Ilya, Martens James, and Hinton Geoffrey E. 2011. “Generating Text with Recurrent Neural Networks.” In Proceedings of the 28th International Conference on Machine Learning (Icml-11), 1017–24. [Google Scholar]
  • 49.Van der Maaten Laurens. 2014. “T-Distributed Stochastic Neighbor Embedding (T-Sne).” [Google Scholar]
  • 50.Wolf Florian, Poggio Tomaso, and Sinha Pawan. 2006. “Human Document Classification Using Bags of Words.” [Google Scholar]

RESOURCES