Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2019 May 6;2019:127–135.

Using Self Organizing Maps to Compare Sepsis Patients from the Neonatal and Adult Intensive Care Unit

Benjamin Goddard 1, Jonathan Chang 1, Indra Neil Sarkar 1
PMCID: PMC6568064  PMID: 31258964

Abstract

Neonatal sepsis, a blood infection occurring in infants younger than 90 days old, represents a significant source of mortality and morbidity among infants.1 Mortality rates increase with postnatal age and can be as high as 52% (36% in newborns aged 8–14 days and 52% in those aged 15–28 days).2 While sepsis in adults has a generally accepted definition, the definition for clinical diagnosis in infants is less well defined. Using the Medical Information Mart for Intensive Care database (MIMIC-III), patient diagnoses and microbiology results records were processed with an artificial neural network trained using unsupervised learning known as a self-organizing map (SOM). The results of this feasibility study suggest a low degree of overlap between the presentation of sepsis in neonate and adult intensive care unit populations. As a consequence, it supports the need for dedicated research in neonatal sepsis, which may manifest differently than adult sepsis.

Introduction

Globally, neonatal infections account for approximately 29% of neonatal mortality.4 There are a host of symptoms indicative of clinically diagnosed neonatal sepsis that are documented in patient records in the ICU, including fever or hypothermia, hyper- or hypoglycemia, apnea or tachypnea, frequent oxygen desaturation with an increased requirement for ventilator support, bradycardia and/or cyanosis, feeding intolerance, abdominal distension, seizures, decreased motor activity, skin mottling, and hypotension.5 However, while sepsis in adults has a generally accepted definition,3 there have been limited studies to characterize the difference with its definition in neonates.

The MIMIC-III dataset is an openly available dataset developed by MIT that represents de-identified health data associated with ~40,000 critical care patients of all ages from 2001 to 2012. A variety of studies have already utilized this dataset in conjunction with machine learning algorithms or other predictive models to gain insight into patient conditions and outcomes.6,7 Self Organizing Maps (SOMs) are an artificial neural network technique used to accomplish unsupervised clustering. They are particularly well suited for high dimensional data, which are accommodated by SOMs through reduction in dimensionality through mapping to discrete representations of high dimensional data. As an unsupervised classifying technique, SOMs enable analysis of complex data without a priori knowledge.

The goal of this study was to explore whether the presentation of sepsis between infant and adult populations is indeed different by investigating comorbidities and bacterial infections recorded for admissions with a diagnosis for sepsis through the use of SOMs. The MIMIC-III dataset contains admissions diagnosed with sepsis as denoted by International Classification of Disease and Death, 9th Revision, Clinical Modification (ICD-9-CM) code for 230 infants and 5,176 adults recorded at the time of discharge. Diagnosed septic patients are associated with 2,079 possible other ICD code and microbiology event features. This ratio of observations to features is not favorable for many neural networks and clustering methods. Additionally, interpretation of clusters associated with data of these dimensions is difficult. This study aimed to therefore demonstrate that SOMs are well suited to both dimensionality reduction and resistance to over-fitting to data.9

Methods

From the MIMIC-III dataset, diagnoses_icd and microbiologyevents records were retrieved for admissions that had any of the following ICD-9-CM codes recorded in the chart (which may have been added at any time during the ICU stay) at the time of discharge: 771.81, 995.91, or 995.92, which code for Septicemia [Sepsis] of newborn, Sepsis, or Severe Sepsis respectively. No distinguishment was made between primary or secondary diagnosis codes; all codes were treated equally.

ICD-9-CM codes that appeared fewer than three times in the retrieved dataset were omitted. Patient age was calculated at the time of admission and each admission is considered independent of other admissions for the same patient. The resultant dataset contained 230 and 5,176 for admissions with patients aged less than one year old (“neonate”) and greater than 18 (“adult”), respectively.

From these data, a binary matrix was created with a each row representing an admission. All possible ICD-9-CM codes associated with the study admission group where given columns with the exception of those used for the initial filtering (771.8, 995.91, 995.92). Furthermore, ICD-9-CM codes 765.20-765.29, which specify the number of weeks of completed gestation were removed in an effort to get more meaningful clustering. Besides the codes listed above, if an admission had an associated ICD-9-CM code, the corresponding column in the binary matrix was set to 1. Likewise for microbiology events data, each organism associated with the dataset was given a corresponding column in the binary matrix, which was set to 1 if it was documented with the admission.

The resultant matrix was separated into three different matrices for each of the two age groups for a total of six matrices. These matrices consisted of the following content for each age group:

  1. ICD-9-CM diagnosis only (1,847 features, neonate: n=230, adult: n=5,176)

  2. Microbiology events only (239 features, neonate: n=94, adult: n=4,094)

  3. Combined ICD-9-CM and microbiology events (2,079 features, neonate: n=230, adult: n=5,176)

For example, for the ICD-9-CM diagnosis only matrix, given n ICD-9-CM and m admissions, an input matrix, D ∈ {0,1}m×n, was created:

D=Adm1Adm2Admm1AdmmICD1ICD2ICDn1ICDn(1011000100010001)

where dij={1,if Admi has ICDj0,if Admi does not have ICDj for dijD,i{1,2,,m}, and j{1,2,,n}

For the the SOM algorithm to be applied, a metric to evaluate the distance between two admission vectors was required. Euclidean distance is not an applicable distance metric since binary data is being analyzed. For this application, Hamming distance was utilized for distance calculations: Let 1{x1 = x2} that equals 1 when x1 = x2 and 0 otherwise. Given n ICD codes and two distinct admissions:

Adm(1)={d1(1),d2(1),,dn(1)}Adm(2)={d1(2),d2(2),,dn(2)}

that the Hamming distance is defined as:

Hamming Distance=j=1n1{dj(1)=dj(2)}

With our input matrix, the SOM algorithm can be summarized as follows:

  • W(υ) is the weight vector for node υ

  • αs is the learning penalty at iteration s

  • θs,υ is the neighborhood function at iteration s for node υ

  • D(t) is the sample t

  • λ is the maximum number of iterations

The update function for every iteration is:

Ws+1(υ)=Ws(υ)+αsθs,υ(Ws(υ)D(t))

From this, the following algorithmic procedure was applied:

  1. Initialize our grid of lattices with random samples as the initial weights, W(υ)(0)

  2. For every sample, D(t), find the node that has the shortest Hamming distance (the best matching unit, υbmu), and update the weights by the update function

  3. Repeat above step for the maximum number of iterations λ

As defined by Kohonen et al.,10:

αs=α0exp(sλ)θs,υ=f(θ0exp(sλ))

where α0 and θ0 are the initial learning rate and neighborhood radius respectively and f the neighborhood function. For this implementation, the neighborhood function had a Gaussian decay:

θs,υ=exp(Hamming (υ,υbmu)22(θ0exp(sλ))2)

Note that υbmu is the best matching node for Dbmu. The effect of the neighborhood function is that the weight of υ was updated less the further υ was from the best matching node, υbmu. Finally, similarly to Lebbah et al.,11 values of α0 = 1 and θ0 = 1 were chosen. Notice that as the number of iterations approach the maximum, λ, both the learning rate as well as the neighborhood function decrease in value, and thus updates occur in smaller increments. Figure 1 shows an illustration how the neighborhood function, and thus the number of nodes updated, decreases over time. Also illustrated is how the strength of those updates decreases the as the distance from the υbmu increases (areas in the darkest shades of red experiencing the largest updates).

Figure 1:

Figure 1:

Illustration showing the decreasing radius of neighborhood function over time and the decreasing update strength (decreasing shades of red) in relation to increasing distance from υbmu.

For this implementation, the Hamming distance was used for the update function as described in Chen et al.12 and Lebbah et al.10 Since the learning rate and the neighborhood function have a real number output, a threshold τ had to be defined. If a given feature has an update value less than τ, its weight was changed to be 0 (“null”) and 1 otherwise. More formally, given Ws(υ)={w1,s(υ),w2,s(υ),,wn,s(υ)}, threshold τ, and data D(t) at iteration t,

wj,s+1(υ)={1,τwj,s(υ)+αsθs,υ(wj,s(υ)dj)0,otherwise,j{1,2,,n},djD(t)

A 20 x 20 node lattice was used for all SOMs. Each SOM was trained for 5,000 iterations. For all SOMs, a threshold (τ) value of 0.6 was chosen.

Results

Each age group was clustered independently. Interactive heat maps can be found online: http://bit.ly/amiaSummit2019-goddard. Selected SOM heatmap results of the microbiology events are shown in Figure 2. Darker colors in the heat maps below are associated with the most frequently exhibited clusters for a given SOM.

Figure 2:

Figure 2:

Microbiology events SOM heatmap output for neonate and adult groups.

The top five clusters from SOM output in Tables 3 through 6 below. “Cluster Features” refers to the the single feature (ICD-9-CM code or microbiology event) or combination of features that composed each cluster. Each distinct feature is separated by an “&.” For example, the cluster “Staph aureus Coag + & Yeast” indicates that both Staph aureus Coag + and Yeast are associated with each other frequently enough to represent a significant cluster and could exist as comorbidities. Additionally, for each Cluster Feature the number of samples associated with that cluster and the percent of admission associated with that number of samples is also shown.

Table 3.

ICD-9-CM diagnosis SOM output top five clusters for the neonate age group.

graphic file with name 3055408t3.jpg

Table 6.

Combined ICD-9-CM diagnosis and Microbiology Events SOM output: top five clusters for the adult age group.

graphic file with name 3055408t6.jpg

Among all results, the only areas where the neonate and adult age groups yielded overlapping clusters were when microbiology events were processed through the SOM on their own. Clusters that existed in both age groups are shown in Table 7 below (sorted by the number of samples from the adult age groups).

Table 7.

Overlapping microbiology event clusters between the neonate and adult age groups, their number of samples from SOM algorithm and % of total admissions.

graphic file with name 3055408t7.jpg

Discussion

This study explored the potential of an unsupervised clustering technique, SOMs, to compare the manifestation of sepsis in two distinct populations: neonates and adults. The results suggest a meaningful clustering that show promise in the use of SOMs to characterize populations based on available clinical data. Within the scope of this study, a distinct profile emerges that distinguishes neonatal and adult sepsis. This finding supports general clinical guidance in both the identification and treatment of these different populations. Furthermore, it justifies future research that can just focus on neonatal or adult sepsis, without concern for overlapping characteristics that would potentially be incorporated into clinical decision support.

Typical application of SOM clustering is on high dimension, continuous data. Often the dimensions are still such that each component plane of the SOM can be laid out such that spatial relevance of clusters can be visually associated with other component planes as shown with the data in Figure 3, which is taken from a study that used SOMs to visualize “combined associations between metabolic markers, diabetic kidney disease, retinopathy, hypertension, obesity, and mortality.”13 The study behind the visualization in Figure 3 used around 12 continuous features. Each feature then had a separate component plane in the resultant SOM. When placed side-by-side the spatial location and value of clusters can be compared more easily.

Figure 3:

Figure 3:

Component plane representation of the SOM trained using patient survey responses. Modified from Makinen, et al.13

While the resultant output would be the same if the underlying data presented in the same way, it is important to view and evaluate the heatmaps in Figure 2 independently. The reduction in dimensionality and nature of SOMs will cluster based on the underlying topology, but if a given cluster is expressed more strongly in one dataset versus the other, the final cluster size and the location on the lattice may change. As a result, direct comparison of spatial location between heatmaps may be of limited value. Despite this, the visualization provided by the heatmaps shown in Figure 2 (as well as online at: http://bit.ly/amiaSummit2019-goddard) provide a means to explore the density of clusters. With the data used in this study, which are binary and significantly higher in dimensions (~2,100 features), the resultant component planes are not suitable for visualization. Instead, significant component planes are combined and returned as meaningful clusters of comorbidities.

The main purpose of displaying the heatmaps is primarily to show how strongly each cluster presents in the data. In the case of this data, the unsupervised algorithm is useful in finding comorbidities associated with sepsis for the two age groups that were evaluated. It is important to notice that despite the SOM algorithm’s ability to cope with the high dimensional and sparsely populated data, some of the most dominant clusters do drop to “Null.” This result is indicative of no single feature or grouping of features being expressed strongly enough to exceed the threshold of 0.6 as chosen for this study. As a result of no feature exceeding the necessary threshold, no feature is recorded as having had a positive occurrence. Experiments were done with lower thresholds, however, significant “Null” clusters still existed and rates of error increased significantly. The ICD-9-CM and Combined matrices suffered the most from this scenario.

The Microbiology event matrix, where the dimensionality is more favorable to the number of admissions records available, showed stronger presentations of clusters. Being able to review the bacteria that cluster together is particularly interesting. Future work will include weighting of results based on known common microbiology results expected for populations (e.g., neonates are known to have high rates of Staphylococcus, Coagulase Negative, which may also be caused by contamination) compared to those that are specific to patients who have sepsis. For example, it may be useful to use a SOM to examine if there are clusters of patients within neonates who were diagnosed with sepsis versus those who were not. Nonetheless, the utility of the results of this study suggest that the profile of features is different than the adult population. This finding may be intuitive, but validates the potential of using an unsupervised clustering approach, which inherently does not start with a priori knowledge.

While the absence of overlapping clusters in ICD-9-CM data is interesting, there is room for improvement in the study design that might make the results more meaningful. For example, some diagnosis codes are neonate specific. As a result, overlap between age groups on neonate specific codes should not occur by design. If they did, this would represent an interesting finding in either: (a) a flaw in the algorithm; or (b) a systemic coding error. Future work that uses the methodology outlined in this study would be enhanced by either removing additional age specific diagnoses or, ideally, finding appropriate mappings to normalize diagnoses between age groups. The number of weeks gestation, while important as a predictive factor for neonatal sepsis, is so strongly expressed in this dataset, that its inclusion obscures any other clustering besides those specific codes. An alternative study design that could be particularly interesting would be to apply a SOM using discrete and continuous variables derived from ICD-9-CM groupings as could be created with codes 765.20-765.29, for example that code for specific weeks of gestation completed. Furthermore, from the MIMIC-III dataset, other continuous clinical metrics can be calculated like average heart rate, Sp02, respiration rate, etc.

Combined with the significant clusters identified by this study, new SOMs could be trained to further visualize comorbidities and relationships between specific microbiology event clusters and other patient features to assist in enhanced phenotyping for adult or infant sepsis. Finally, this study only considered structured data elements that could be imputed into a binary matrix. Future work would include incorporation of additional clinical data that may originate from unstructured (e.g., narrative) clinical data. The added dimensions that would result from the inclusion of information imputed from narrative clinical data would further underscore the value of the SOM demonstrated in this study.

Understanding the differences (and similarities) between neonatal and adult sepsis may support the the longer term goal of this work to develop tools for earlier detection of neonatal sepsis. The promising results of using SOMs in this study demonstrate the potential to identify clusters that can then be used for supporting the identification of factors that can be incorporated into subsequent decision support systems. Future work will also include studying whether there are identifiable subtypes of neonatal sepsis that can also be used to support clinical decisions at earlier points in care.

Conclusion

Leveraging data in the MIMIC-III dataset and SOM clustering method, we were able to demonstrate a low degree of overlap between the presentation of sepsis in the two patient age groups evaluated in this study: neonate and adult, suggesting that sepsis does indeed manifest differently between infant and adult ICU populations. There remain limitations in the data analysis that will require further study to get the most out of the ICD-9-CM diagnosis code information.

Table 1.

Microbiology events SOM top five clusters for neonate age group.

graphic file with name 3055408t1.jpg

Table 2.

Microbiology events SOM top five clusters for the adult age group.

graphic file with name 3055408t2.jpg

Table 4.

ICD-9-CM diagnosis SOM output top five clusters for the adult age group.

graphic file with name 3055408t4.jpg

Table 5.

Combined ICD-9-CM diagnosis and Microbiology Events SOM output: top five clusters for the neonate age group.

graphic file with name 3055408t5.jpg

Acknowledgements

This study was funded in part by by grant U54GM115467 from the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • 1.Cortese F, Scicchitano P, Gesualdo M, Filaninno A, Giorgi ED, Schettini F. Early and Late Infections in Newborns: Where Do We Stand? A Review. (doi:10.1016/j.pedneo.2015.09.007).Pediatrics & Neonatology. 2016;57:265–73. doi: 10.1016/j.pedneo.2015.09.007. [DOI] [PubMed] [Google Scholar]
  • 2.Stoll BJ, Hansen N, Fanaroff AA, Wright LL, Carlo WA, Ehrenkranz RA. Late-Onset Sepsis in Very Low Birth Weight Neonates: The Experience of the NICHD Neonatal Research Network. (doi:10.1542/peds.110.2.285).Pediatrics. 2002;110:285–91. doi: 10.1542/peds.110.2.285. [DOI] [PubMed] [Google Scholar]
  • 3.Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3) (doi:10.1001/jama.2016.0287).Jama. 2016;315:801. doi: 10.1001/jama.2016.0287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kamath BD. Bermans Pediatric Decision Making. 2011. Neonatal sepsis; pp. 300–4. (doi:10.1016/b978-0-323-05405-8.00080-2). [Google Scholar]
  • 5.Fell DB, Hawken S, Wong CA, Wilson LA, Murphy MSQ, Chakraborty P. Using newborn screening analytes to identify cases of neonatal sepsis. (doi:10.1038/s41598-017-18371-1).Scientific Reports. 2017;7 doi: 10.1038/s41598-017-18371-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wu I-H, Tsai M-H, Lai M-Y, Hsu L-F, Chiang M-C, Lien R. BMC Infectious Diseases. 2017. Incidence, clinical features, and implications on outcomes of neonatal late-onset sepsis with concurrent infectious focus; p. 17. (doi:10.1186/s12879-017-2574-7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der LAAN MJ. Mortality prediction in the ICU: can we do better? Results from the Super ICU Learner Algorithm (SICULA) project, a population-based study. (doi:10.1016/S2213-2600(14)70239-5).The Lancet Respiratory medicine. 2015;3(1):42–52. doi: 10.1016/S2213-2600(14)70239-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ghassemi MM, Richter SE, Eche IM, Chen TW, Danziger J, Celi LA. A data-driven approach to optimized medication dosing: a focus on heparin. (doi:10.1007/s00134-014-3406-5).Intensive Care Medicine. 2014;40:1332–9. doi: 10.1007/s00134-014-3406-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Russo T, Scardi M, Cataudella S. Applications of Self-Organizing Maps for Ecomorphological Investigations through Early Ontogeny of Fish. (doi:10.1371/journal.pone.0086646).PLoS ONE. 2014;9 doi: 10.1371/journal.pone.0086646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kohonen T. Self-organized formation of topologically correct feature maps. (doi:10.1007/bf00337288).Biological Cybernetics. 1982;43:59–69. [Google Scholar]
  • 11.M Lebbah, Badran F, Thiria S. Bruges: 2000. Apr 26-27-28, Topological Map for Binary Data, ESANN 2000; pp. 267–272. [Google Scholar]
  • 12.Chen N, Marques NC. Progress in Artificial Intelligence Lecture Notes in Computer Science. 2005. An Extension of Self-organizing Maps to Categorical Data; pp. 304–13. (doi:10.1007/11595014_31). [Google Scholar]
  • 13.Makinen V-P, Forsblom C, Thorn LM, Waden J, Gordin D, Heikkila O. Metabolic Phenotypes, Vascular Complications, and Premature Deaths in a Population of 4,197 Patients With Type 1 Diabetes. (doi:10.2337/db08-0332).Diabetes. 2008;57:2480–7. doi: 10.2337/db08-0332. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES