A Cloud-based Framework for Implementing Portable Machine Learning Pipelines for Neural Data Analysis

Charles A Ellis; Ping Gu; Mohammad S E Sendi; Daniel Huddleston; Ashish Sharma; Babak Mahmoudi

doi:10.1109/EMBC.2019.8856929

. Author manuscript; available in PMC: 2020 Jul 30.

Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2019 Jul;2019:4466–4469. doi: 10.1109/EMBC.2019.8856929

A Cloud-based Framework for Implementing Portable Machine Learning Pipelines for Neural Data Analysis

Charles A Ellis ¹, Ping Gu ², Mohammad S E Sendi ³, Daniel Huddleston ⁴, Ashish Sharma ⁵, Babak Mahmoudi ^6,⁷

PMCID: PMC7390749 NIHMSID: NIHMS1612376 PMID: 31946857

Abstract

Cloud-based computing has created new avenues for innovative research. In recent years, numerous cloud-based, data analysis projects within the biomedical domain have been implemented. As this field is likely to grow, there is a need for a unified platform for the developing and testing of advanced analytic and modeling tools that enables those tools to be easily reused for the analysis of biomedical data by a broad set of users with diverse technical skills. A cloud-based platform of this nature could greatly assist future research endeavors. In this paper, we take the first step towards building such a platform. We define an approach by which containerized analytic pipelines can be distributed for use on cloud-based or on-premise computing platforms. We demonstrate our approach by implementing a portable biomarker identification pipeline using a logistic regression model with elastic net regularization (LR-ENR) and running it on the Google cloud. We used this pipeline for the diagnosis of Parkinson’s disease based on a combination of clinical, demographic, and MRI-based features and for the identification of the most predictive biomarkers.

I. Introduction

With the advent and proliferation of cost-effective, cloud-based computing, new avenues for disseminating analytic algorithms and running those algorithms independent of the local hardware resources have been created. To describe a few of the many potential avenues, cloud-based computing can (1) enable near real-time analysis of data from devices with otherwise prohibitively computationally expensive algorithms, (2) provide a useful platform for large-scale data analytics, and (3) enable multisite research projects to be conducted. Furthermore, cloud-based computing has the potential to promote innovative research within the biomedical field.

Many healthcare-related projects have been conducted in recent years. Cloud-based approaches have been applied in seizure detection, real-time epilepsy management, real-time ECG monitoring and analysis, and respiratory rate measurement, as demonstrated in [1]–[5]. While many of these applications are real-time applications, the authors would likely have found it useful to first test their algorithms on data retroactively to assess the utility of the algorithms.

As cloud-based computing is likely to continue growing, there is a need for a unified platform for the developing and testing of advanced analytic and modeling tools that enables those tools to be easily reused for the analysis of biomedical data by a broad set of users with diverse technical skills. Our long-term goal is to develop such a platform. With this flexible platform, data would not need to be downloaded. Data could remain on the cloud, and data analysis algorithms could be structured in rapidly customizable pipelines and run on a cloud-based or on-premise platform.

In this paper, we take a first step towards creating that data analysis platform. We define an approach by which containerized analytic pipelines can be distributed for use on cloud-based or on-premise platforms. We demonstrate our approach by implementing a portable biomarker identification pipeline using a MATLAB logistic regression model with elastic net regularization (LR-ENR) and running it on Google Cloud. While we do focus on MATLAB algorithms for this paper, the same general framework can be used for implementing Python algorithms. Also, most of the same tools that we employ to interface with Google Cloud can be used to interface with AWS and on-premise computing installations. When our approach is applied to an algorithm, the final product can be disseminated via the Cloud to other users and serve as a standalone data analysis tool, or when our approach is applied to multiple algorithms, the resulting containers can serve as the building blocks for highly customizable, rapidly prototyped data analysis pipelines.

We employ our portable biomarker identification pipeline for the diagnosis of Parkinson’s disease (PD) based on a combination of clinical, demographic, and MRI-based features from patients with PD and healthy controls (HC) and for the identification of the biomarkers that most contribute to effective diagnosis. Though our dataset is relatively small, the data was used to demonstrate the effectiveness, not the scalability, of our approach.

While the results of our analysis are not the focus of our paper, they are of scientific and clinical relevance and demonstrate that our approach can be used to generate meaningful results. Parkinson’s disease is the second most-common neurodegenerative disease. PD occurs in 2–3% of the population over 65 years of age [6]. Symptoms include bradykinesia, cognitive impairment, sleep dysfunction, depression, hyposmia, and autonomic dysfunction [6]. Diagnosis is often made via clinical symptoms.

II. METHODS

A. System Architecture

As shown in Fig. 1, our approach can be separated into 3 stages. In stage 1, we write a MATLAB script, convert the MATLAB script to an executable, and create a Docker image that will later be made into a container. In stage 2, we create a WDL script and JSON file to organize the Docker images and data into a workflow, and we send the workflow to Cromwell, which interfaces with Google Cloud. In stage 3, Google Cloud, or another cloud-based platform, implements the workflow and outputs results. Note that the data is stored on the cloud prior to use.

Figure 1. — System Diagram. Our approach is carried out in three main stages. 1) On their desktop, the user writes the algorithms they wish to use, using the MATLAB environment to transform the algorithms into executables and Docker to generate Docker images that will ultimately be used to create containers. 2) To interface with the Cloud Platform, the user creates a WDL script and JSON file that organize the Docker images and data into a workflow and sends them to Cromwell. 3) The Cloud Platform implements the workflow that it receives from Cromwell.

Compiling a MATLAB Executable:

A MATLAB executable is a standalone application that runs a MATLAB script and contains any user-created functions that the script may require. To compile a MATLAB executable, three main components are required: (1) MATLAB, (2) the MATLAB Compiler Software Development Kit (SDK), and (3) MATLAB Runtime. We used the MATLAB Compiler SDK, an add-on to MATLAB, to generate the MATLAB executable. To run the MATLAB executable on a system that did not have the basic MATLAB package, we used MATLAB Runtime, a publicly available set of libraries that can be downloaded from the Mathworks website and installed on local devices and on containers in on-premise computing installations and cloud-based platforms. After compiling the MATLAB executable, we stored it in a Google Bucket.

The MATLAB Compiler SDK is used for the purpose of creating MATLAB executables. Another add-on, MATLAB Compiler, can also be used for compiling MATLAB executables. However, MATLAB executables are restricted to being used on the operating system on which they are compiled, so, for example, an executable compiled on Linux CentOS will not work on Linux Ubuntu, Mac, or Windows. MATLAB Compiler is limited to creating basic MATLAB executables, which cannot be transferred across operating systems. The MATLAB Compiler SDK enables executables compiled on multiple operating systems to be combined in one JAVA JAR file, which can be run on any of the operating systems in which the initial executables were compiled. For the purpose of this project, we did not need to use multiple platforms and only used CentOS 7, so we generated a MATLAB executable rather than a JAVA JAR file.

Once MATLAB executables have been compiled, they can be run on desktops, on-premise computing installations, or cloud-based platforms like Google Cloud or Amazon AWS. However, for MATLAB executables to run on a system, they require access to the MATLAB libraries housing any built-in functions that the executables call. There are two ways to accommodate this requirement. (1) The executable can call the functions it needs from a local MATLAB installation. (2) The executable can call the function from MATLAB Runtime. The first of these two options is very restrictive. Due to licensing issues arising from MATLAB being a proprietary software, the basic MATLAB package can only be deployed in local devices like laptops and desktops. As such, if the executable is distributed to users who do not have access to a local MATLAB installation, those users will not be able to use the executable. The second option is much better. MATLAB Runtime is set of shared libraries that can be freely downloaded from the Mathworks website and used to run MATLAB executables and applications. Unfortunately, while MATLAB Runtime does include most MATLAB functions that may be required, there are some functions that are not included in the package.

Creating a Docker Container:

A Docker container is an entity that is run on a host machine but is still separate from the host machine. It borrows storage space and memory from the host machine. It can run applications and contain the bins and libraries needed to run those applications. A Docker container is similar to a virtual machine (VM) in many respects. However, it defers from a virtual machine in that it utilizes the host computer’s operating system while a VM has its own built-in operating system.

To create a Docker container, it is necessary to install the Docker Daemon on the local host computer. After installation, we scripted a Dockerfile. We used the Dockerfile to build a Docker image, and we used the Docker image to construct the Docker container.

A Dockerfile is a text file that forms a schematic for the Docker image and the environment that will be within the Docker container. We incorporated MATLAB Runtime into our system through the Dockerfile. After writing the Dockerfile, we ran a build command via the host computer command line that read the Dockerfile and built a Docker image. A Docker image can be thought of as an application that produces a Docker container when it is run. After compiling the Docker image, we pushed the Docker image to Docker Hub, an online repository for Docker images. Once the Docker image was on Docker Hub, it could be accessed for use on the cloud.

Running the Docker Image on Google Cloud:

Running the Docker image on a Google Cloud pipeline was a multistep process. We first installed the Google SDK on our desktop and used it to configure our Google project to work with Swagger UI. After configuring the Google Cloud project, we wrote a workflow description language (WDL) script. The WDL script detailed the general workflow to be carried out on Google Cloud. It defined variables for the name and location of all the components needed for the workflow - the input data, the Docker image, the MATLAB executable, and the output. However, with the exception of the Docker image name and location, the WDL script only defined the variables. The actual names and locations were not hardcoded into the script. We generated a JSON file from the WDL script. Within the JSON file, we defined values for each of the name and location variables created within the WDL script. New names and locations can easily be substituted for those originally defined within the JSON file to allow other data or executables to be included in the workflow.

Upon completing the JSON file, we uploaded the JSON file and WDL script to Google Cloud using Swagger UI, a representational state transfer (RESTful) application programming interface (API). Swagger UI exposes a RESTful API that interfaces with Cromwell, a workflow management system that communicates with Google Cloud. Cromwell can be interacted with directly, however, Swagger UI provides an easy-to-use interface with web services, like Google Cloud and AWS, and with on-premise computing installations and greatly simplifies the communication process. Once the JSON file is running on Google Cloud, Swagger UI can be used to cancel runs and query the progress of a run. After the process has finished running on Google Cloud, the results are output to a predefined Google Bucket. The results can also be downloaded from a link sent by Swagger UI to the email address of the user.

B. System Testing

To test our framework, we implemented LR-ENR on a dataset composed of 38 PD and 25 HC subjects with 12 clinical, demographic, and MRI-based features. We classified between PD and HC groups and identified the importance of each of the features.

Machine Learning Algorithm:

Fig. 2 shows the structure of our biomarker identification pipeline. We input the labels and features into the classifier. We used logistic regression with elastic net regularization and 5-fold cross-validation to classify between subject groups. After classifying the subjects, we calculated the receiver operating characteristic (ROC) and the area under the curve (AUC). Additionally, we generated a confusion matrix and calculated the sensitivity, specificity, and accuracy of the classification. A feature selection method was embedded in the classifier. We used the ENR coefficients output for each feature to determine feature importance. Those features with higher mean ENR coefficients were considered to have greater importance than those features with lower mean ENR coefficients.

Figure 2. — Biomarker identification pipeline using machine learning classification and feature learning

Experimental Data:

Our dataset was composed of 12 clinical, demographic, and MRI-based features. It included 64 subjects: 38 patients with PD and 25 healthy controls. We collected the MRI features using a Siemens Prisma-Fit 3T scanner. MRI-based features were derived from an automated imaging process [7]. The features included neuromelanin-sensitive MRI (NM-MRI), R2* (iron-sensitive MRI), and magnetization transfer contrast (MTC) of the substantia nigra pars compacta (SNc) and profound locus coeruleus (LC). We used neuromelanin-sensitive and iron-sensitive MRI, because PD is often characterized by the loss of neuromelanin and deposition of iron in the substantia nigra pars compacta (SNc) and by the degeneration of the profound locus coeruleus [8], [9]. We used several clinical questionnaires as features: the Movement Disorders Society Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) Parts I and II, the REM Sleep Behavior Disorder Questionnaire (RBDQ), and the Non-Motors Symptoms Questionnaire (NMSQ). Our demographic features included age and sex.

We did perform some preprocessing on the data. We stored the data in a MATLAB MAT file. The MAT file contained matrices for the subject labels, features, and feature names. Structuring the input data in this manner made inputting the data relatively simple. After structuring the data, we stored it in a private Google Bucket for later use.

III. RESULTS

Fig. 3a shows the ROC curve for our classifier. The ROC curve was well above the diagonal for the most part. The classification sensitivity, specificity, accuracy, and AUC were 0.7612 ± 0.0904, 0.9750 ± 0.0559, 0.8577 ± 0.0638, and 0.9593 ± 0.0516, respectively. Fig. 3b shows a boxplot of the relative importance of each of the features to the classification. Note that the coefficient value for each feature for each of the 5 folds is included in the plot as a red circle. When reading the plot, we considered those features with higher mean coefficient values to have greater importance. The five most important features were (1) mds_updrs_2tot, (2) SNcR2str, (3) SNcVol, (4) rbd_2, and (5) sex. Aside from the top three or four features, most of the other features seemed to have comparable importance. The top four features had very little variance in their importance between folds, whereas many of the other features displayed greater variance in importance between folds.

Figure 3. — Classifying between normal and PD subjects and feature selection. (a) ROC Curve for PD and HC Classification. Note that the figure shows the mean and standard deviation of the ROC curve. (b) Box Plot of the Importance of Each Feature. Note that SNcR2str and LV_R2str are the R2* measured in the SNc and LV, respectively. SNcVol and LC_Vol are the volume of the SNc and LC, respectively. SNcMTC and LV_MTC are the MTC values in the SNc and LV, respectively. The features mds_updrs_1_tot and mds_updrs_2_tot are the MDS-UPDRS Part 1 and Part 2 total scores. The feature rbd_2 is the answer to question two on the RBDQ, and the feature nmsq_tot is the NMSQ total score.

IV. CONCLUSION

In this paper, we presented an approach by which data analysis pipelines can be implemented on a cloud-based platform. We demonstrated our approach by implementing a MATLAB-based portable biomarker identification pipeline on Google Cloud. In the future, we intend to expand upon this approach by applying it to other programming languages, like Python, and other cloud-based and on-premise platforms and by using it to form complex, multi-language pipelines. The approach we presented for implementing pipelines on Google Cloud is the first step towards creating a unified platform for the developing and testing of advanced analytic and modeling tools that enables those tools to be easily reused for the analysis of biomedical data by a broad set of users with diverse technical skills. When completed, our platform could provide a useful tool for biomedical data analysis and spur greater innovation within the field of biomedical research.

Contributor Information

Charles A. Ellis, Wallace H. Coulter Department of Biomedical Engineering at Georgia Institute of Technology and Emory University, Atlanta, Georgia, 30332.

Ping Gu, Department of Biomedical Informatics at Emory University, Atlanta, Georgia 30332..

Mohammad. S. E. Sendi, Wallace H. Coulter Department of Biomedical Engineering at Georgia Institute of Technology and Emory University, Atlanta, Georgia, 30332..

Daniel Huddleston, Department of Neurology, Emory University School of Medicine, Atlanta, GA, 30322, United States..

Ashish Sharma, Department of Biomedical Informatics at Emory University, Atlanta, Georgia 30332..

Babak Mahmoudi, Wallace H. Coulter Department of Biomedical Engineering at Georgia Institute of Technology and Emory University, Atlanta, Georgia, 30332.; Department of Biomedical Informatics at Emory University, Atlanta, Georgia 30332.

REFERENCES

[1].Baldassano S et al. , “Cloud computing for seizure detection in implanted neural devices,” J. Neural Eng, vol. 16, no. 2, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Kremen V et al. , “Integrating brain implants with local and distributed computing devices: A next generation epilepsy management system,” IEEE J. Transl. Eng. Heal. Med, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Sendi MSE, Heydarzadeh M, and Mahmoudi B, “A Spark-based Analytic Pipeline for Seizure Detection in EEG Big Data Streams,” 40th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc, 2018. [DOI] [PubMed] [Google Scholar]
[4].Xia H, Asif I, and Zhao X, “Cloud-ECG for real time ECG monitoring and analysis,” Comput. Methods Programs Biomed, vol. 110, no. 3, pp. 253–259, 2013. [DOI] [PubMed] [Google Scholar]
[5].Fekr AR, Janidarmian M, Radecka K, and Zilic Z, “A medical cloud-based platform for respiration rate measurement and hierarchical classification of breath disorders,” Sensors, vol. 14, no. 6, pp. 11204–11224, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Poewe W et al. , “Parkinson disease,” Nat. Rev. Dis. Prim, vol. 3, 2017. [Google Scholar]
[7].Langley J, Huddleston DE, Liu CJ, and Hu X, “Reproducibility of locus coeruleus and substantia nigra imaging with neuromelanin sensitive MRI,”Magn. Reson. Mater. Physics, Biol. Med, 2017. [DOI] [PubMed] [Google Scholar]
[8].Ohtsuka C et al. , “Changes in substantia nigra and locus coeruleus in patients with early-stage Parkinson’s disease using neuromelanin-sensitive MR imaging,” Neurosci. Lett, vol. 541, pp. 93–98, 2013. [DOI] [PubMed] [Google Scholar]
[9].Niu T et al. , “Quantifying iron deposition within the substantia nigra of Parkinson’s disease by quantitative susceptibility mapping,” J. Neurol. Sci, vol. 386, pp. 46–52, 2018. [DOI] [PubMed] [Google Scholar]

[R1] [1].Baldassano S et al. , “Cloud computing for seizure detection in implanted neural devices,” J. Neural Eng, vol. 16, no. 2, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Kremen V et al. , “Integrating brain implants with local and distributed computing devices: A next generation epilepsy management system,” IEEE J. Transl. Eng. Heal. Med, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Sendi MSE, Heydarzadeh M, and Mahmoudi B, “A Spark-based Analytic Pipeline for Seizure Detection in EEG Big Data Streams,” 40th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc, 2018. [DOI] [PubMed] [Google Scholar]

[R4] [4].Xia H, Asif I, and Zhao X, “Cloud-ECG for real time ECG monitoring and analysis,” Comput. Methods Programs Biomed, vol. 110, no. 3, pp. 253–259, 2013. [DOI] [PubMed] [Google Scholar]

[R5] [5].Fekr AR, Janidarmian M, Radecka K, and Zilic Z, “A medical cloud-based platform for respiration rate measurement and hierarchical classification of breath disorders,” Sensors, vol. 14, no. 6, pp. 11204–11224, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Poewe W et al. , “Parkinson disease,” Nat. Rev. Dis. Prim, vol. 3, 2017. [Google Scholar]

[R7] [7].Langley J, Huddleston DE, Liu CJ, and Hu X, “Reproducibility of locus coeruleus and substantia nigra imaging with neuromelanin sensitive MRI,”Magn. Reson. Mater. Physics, Biol. Med, 2017. [DOI] [PubMed] [Google Scholar]

[R8] [8].Ohtsuka C et al. , “Changes in substantia nigra and locus coeruleus in patients with early-stage Parkinson’s disease using neuromelanin-sensitive MR imaging,” Neurosci. Lett, vol. 541, pp. 93–98, 2013. [DOI] [PubMed] [Google Scholar]

[R9] [9].Niu T et al. , “Quantifying iron deposition within the substantia nigra of Parkinson’s disease by quantitative susceptibility mapping,” J. Neurol. Sci, vol. 386, pp. 46–52, 2018. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Cloud-based Framework for Implementing Portable Machine Learning Pipelines for Neural Data Analysis

Charles A Ellis

Ping Gu

Mohammad S E Sendi

Daniel Huddleston

Ashish Sharma

Babak Mahmoudi

Roles

Abstract

I. Introduction