Nanopore sequencing data analysis using Microsoft Azure cloud computing service

Linh Truong; Felipe Ayora; Lloyd D’Orsogna; Patricia Martinez; Dianne De Santis

doi:10.1371/journal.pone.0278609

. 2022 Dec 2;17(12):e0278609. doi: 10.1371/journal.pone.0278609

Nanopore sequencing data analysis using Microsoft Azure cloud computing service

Linh Truong ^1,^2,^*, Felipe Ayora ³, Lloyd D’Orsogna ^1,², Patricia Martinez ^1,², Dianne De Santis ^1,²

Editor: Mingming Liu⁴

PMCID: PMC9718390 PMID: 36459531

Abstract

Genetic information provides insights into the exome, genome, epigenetics and structural organisation of the organism. Given the enormous amount of genetic information, scientists are able to perform mammoth tasks to improve the standard of health care such as determining genetic influences on outcome of allogeneic transplantation. Cloud based computing has increasingly become a key choice for many scientists, engineers and institutions as it offers on-demand network access and users can conveniently rent rather than buy all required computing resources. With the positive advancements of cloud computing and nanopore sequencing data output, we were motivated to develop an automated and scalable analysis pipeline utilizing cloud infrastructure in Microsoft Azure to accelerate HLA genotyping service and improve the efficiency of the workflow at lower cost. In this study, we describe (i) the selection process for suitable virtual machine sizes for computing resources to balance between the best performance versus cost effectiveness; (ii) the building of Docker containers to include all tools in the cloud computational environment; (iii) the comparison of HLA genotype concordance between the in-house manual method and the automated cloud-based pipeline to assess data accuracy. In conclusion, the Microsoft Azure cloud based data analysis pipeline was shown to meet all the key imperatives for performance, cost, usability, simplicity and accuracy. Importantly, the pipeline allows for the on-going maintenance and testing of version changes before implementation. This pipeline is suitable for the data analysis from MinION sequencing platform and could be adopted for other data analysis application processes.

Introduction

Advancement in genomic analysis has contributed to several scientific breakthroughs over the last decade. Genetic information provides insights into the exome, genome, epigenetics and structural organisation of the organism. Given the enormous amount of information, scientists are able to perform mammoth tasks to improve the standard of health care such as determining genetic influences on outcome of allogeneic transplantation, identifying disease-causing genes, predicting candidature proteins for vaccine development within an unprecedented speed as seen in the development of the messenger RNA (mRNA) vaccine against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), tracking community spread of new viral outbreaks using ‘genomic fingerprinting’, and many more [1].

Nanopore sequencing from Oxford Nanopore Technologies (ONT) became commercially available in 2016 and has since embarked a new era of long-read single molecule technology. The ultra-long reads from ONT assist the visualization of genetic information in real-time without the need of a reference sequence as currently required by other short-read technologies. Nanopore sequencing also requires a lower capital footprint for instrumentation and reagents; however, it also creates new computational problems. Converting raw sequencing data to scientific results requires computational power, coordinated automation and storage capacity [2].

High number of compute resources is required to deploy the ONT data analysis pipeline consisting of base-calling, a process that converts raw electrical signal to nucleotide sequence, demultiplexing samples in the data set, and filtering based on predetermined quality thresholds. Additionally, the series of specialised tools used during the analysis often require access to GPU (Graphics Processing Unit) and CPU (Central Processing Unit) resources, as well as different operating systems, tools, applications and prerequisites. This complex combination of hardware and software resources forces the laboratory technicians to follow a time-consuming manual process including switching between different operating systems and being on stand-by for the duration of long analysis steps before starting the next one. Therefore, coordinated automation is desirable to increase the throughput without sacrificing the valuable time of our scientists and technicians.

The computational resources that are needed to analyse nanopore sequencing data in a timely-manner and to enable long-term storage has also outgrown the infrastructure owned by a single laboratory such as the Department of Clinical Immunology (DCI) located within PathWest Laboratory Medicine Western Australia (PathWest). Specifically, the raw data of one single ONT run could be as large as 250 Gigabytes (result of a R10.3 flow cell in 16 hours sequencing run) and the processed data could be up 20 Gigabytes in our experience. That means ONT workflow would produce approximately 35 Tetrabytes annually in our centre, while our current testing policy requires permanent storage of patient’s clinical data, both raw and processed data. Therefore, the cost of purchasing physical storage hardware to keep up with the ONT output would become a burden to the centre’s operating budget.

The Department of Clinical Immunology is the sole state provider for HLA genotyping, providing HLA gene characterization on patients awaiting bone marrow transplantation as well as potential unrelated donors recruited to the Australian Bone Marrow Donor Registry, which is then linked to worldwide registries. When a patient needs a stem cell transplant, their HLA genotype is compared with all potential donors on the worldwide registries. Having more enlisted local donors with high-resolution HLA typing increases the chance of finding the best-matched local donor for patients. This means that local patients can be transplanted faster and at lower cost compared with using an overseas donor. Additionally, high-resolution HLA typing of donors at the point of recruitment provides more information about the donor’s immunogenetic make-up to the clinician and transplant team, therefore eliminating potential unknown mismatches during the donor selection process, and allows patients to proceed to transplant quickly which can directly influence the patient’s survival outcome. Historically, all HLA type data from the MinION platform were processed manually using physical computing resources available on-site at Fiona Stanley Hospital. In order to increase the throughput and decrease the processing time, it was desirable to seek an affordable and stream-lined method for genetic analysis.

Cloud based computing has increasingly become a key choice for many scientists, engineers and institutions as it offers on-demand network access and users can conveniently rent rather than buy all required computing resources. Cloud providers refer to major commercial services such as Amazon Web Services (AWS), Google Cloud Platform or Microsoft Azure. All cloud providers offer elasticity, convenience and scalability depending on specific demand of individual workflow. Furthermore, they ensure the security and safety to store encrypted data for long-term while maintaining the utmost confidentiality of genetic information [3]. With the positive advancements of cloud computing and ONT data output, we were motivated to develop an automated and scalable analysis pipeline utilizing cloud infrastructure in Microsoft Azure to accelerate HLA genotyping service and improve the efficiency of the workflow at lower cost. Microsoft Azure was chosen for this study as PathWest has an active subscription with Microsoft as part of the IT arrangement for the health network. Therefore, as a default we had access to Microsoft services such as Microsoft 365 and Azure platform. In this study, we describe (i) the selection process of suitable virtual machine sizes for computing resources to balance between the best performance versus cost effectiveness; (ii) the building of Docker containers to include all tools in the cloud computational environment; (iii) the comparison of HLA genotype concordance between the in-house manual method and the automated cloud-based pipeline to assess data accuracy.

Methods and results

Workflow overview

The raw data from Oxford Nanopore Technologies (ONT) MinION platform was processed using multiple online bioinformatics tools. Changes in the voltage and raw signalling data was acquired by MinKNOW software (ONT) as the sequencing run progressed, converted and stored as FAST5 file format for downstream processing. Basecalling of raw data was performed using Guppy v4.0.14, a data processing toolkit provided by ONT, which provided a basecaller tool based on a recurrent neural network algorithm that converted raw nanopore signals into nucleotide sequences and wrote the results in FASTQ file format.

All FASTQ files were then de-multiplexed by the indexes-sorting tool in Guppy v4.0.14, which assigned reads according to the ligated molecular barcode into separate folders. The results were in multiple files containing reads from the same individual. The sequences corresponding to the molecular barcoded sequences were also trimmed by the Guppy Basecaller at the completion of de-multiplexing process. The individual FASTQ reads were then further filtered by size, a minimum length of 2 kbases, and quality, minimum Q-score of 7 by NanoFilt software tool [4] (https://github.com/wdecoster/nanofilt). The smallest HLA amplicon within this study amplicon pool was 3 kbases in size, therefore, any read shorter than 2 kbases was filtered to eliminate unbound primers, primer dimer or any potential non-specific products [5]. The Phred quality score or Q score is the most common metric used to assess the accuracy of sequencing technology. In earlier publications of ONT dataset, Q7 reads were considered as the benchmark for quality threshold of ONT sequencing, e.g. any reads with Q-score less than Q7 was defaulted into the “fail” bin and any reads with Q-score equal or higher than Q7 was sorted into the “past” bin. Therefore, read length of 2 kbases and minimum Q-score of Q7 were used for filtering parameter.

Finally, the quality of the sequencing run was monitored using MinIONQC [6] (https://github.com/roblanf/minion_qc). MinIONQC produced a sequencing summary outlining the data yield overtime, data total output, read quality histogram, read length histogram and Q-score obtained over time. The overview of each analysis step is shown in Fig 1. This workflow was then built into a pipeline of applications, all included within a single Docker container so that they can be deployed quickly and scalable to on-demand cloud compute resources, and so that software versions and updates can be easily managed. The protocol described in this peer-reviewed article is published on protocols.io (dx.doi.org/10.17504/protocols.io.x54v9dj7pg3e/v1]dx.doi.org/10.17504/protocols.io.x54v9dj7pg3e/v1) and is included for printing purposes as S1 File.

Building the pipeline in cloud computing

A set of predetermined criteria was applied during the building and deployment of the Azure cloud-based pipeline: (1) The pipeline would give preference to Platform as a Service (PaaS) and Software/Solution as a Service (SaaS) technologies, over Infrastructure as a Service (IaaS) technologies, in order to benefit the most from the cloud platform capabilities. (2) Sequencing runs can be processed in parallel, without having to queue them or change the pipeline. (3) The pipeline would match performance to cost, to achieve a balance that produces results in line with cost and analysis time expectations. (4) The pipeline would utilise the Loome platform (https://www.loomesoftware.com) for job orchestration, data movement and logging. To run jobs, Loome would deploy the pipeline using Docker containers running on GPU-based resources on Azure, and the resources must be automatically deleted when jobs have completed. When running, the automated genetic analysis pipeline should be accessible for monitoring, troubleshooting and alerting with detailed task execution history. (5) The speed to obtain the dataset using the Azure cloud-based analysis pipeline must be faster than the manual on-premises analysis pipeline. (6) The processing time must be less than 3 days and ideally no more than 12 hours to ensure that the overall assay turnaround time remains equivalent to the current assay turnaround of 5 days. (7) The upload of raw FAST5 files into the Azure cloud storage must be automated and progressive as they are generated by the MinION device. The download of analysed FASTQ result files back to Immunology (Fiona Stanley Hospital site) must also be automated and progressive. (8) The result datasets should be comparable between the Azure cloud analysis pipeline and manual on-premises analysis pipeline. The HLA genotype results should be concordant between two workflows. (9) The nanopore sequencing data and analysed results cannot leave Australian data centres.

Overview of architecture of the Azure cloud-workflow

The architecture of analysis pipeline on the Azure was designed as shown in Fig 2.

(1) The FAST5 sequencing files from MinKNOW acquisition software were exported to a default folder on the stand-alone computer progressively as they were generated. (2) The input files were automatically uploaded by Loome from the stand-alone computer into a container in a blob storage account, deployed within the PathWest Azure subscription. The files were uploaded using Transport Layer Security (TLS), and was encrypted at rest using 256-bit AES encryption. (3) The Loome agent running in the PathWest Azure subscription detected the presence of a new input dataset and triggered a new job to deploy the necessary resources, and to start the processing steps, using the Azure Batch service. (4) Azure Batch automatically deployed a GPU-enabled Virtual Machine (VM) for basecalling, de-multiplexing, quality trimming and QC overview. (5) As part of the job submission process, Loome communicated with Azure Batch which Docker container to use during each task in the job, and that container was pulled from the private Azure Container Registry and instantiated in each of the VMs that are managed by Azure Batch. (6) When each of the VMs was running, they copied the input data into their local disk for faster processing, run the analyses, and then copied the results back into blob storage so that the VMs could be deleted when processing had been completed. Loome in coordination with Azure Batch orchestrated these steps. (7) The results were stored in a blob storage account within the PathWest Azure subscription, ready to be downloaded. Loome detected the successful completion of all tasks in the job and sent an email to notify that the analysis had completed. (8) The analysed files in FASTQ format were then automatically downloaded from the blob storage account using Transport Layer Security (TLS), onto the stand-alone computer where downstream HLA analysis would be performed. (9) If troubleshooting was required, a user with administrative permission in Loome could log on using their existing enterprise account credentials, to review the detailed logs for each job and task. (10) Access to all Azure resources and to Loome was secured by Azure Active Directory, applying role-based access controls to existing Active Directory user accounts in the PathWest directory, and enforcing multi-factor authentication.

Identifying optimal virtual machine

An analysis of different types of compute resources was performed to determine the best performance versus cost efficiency. This analysis also helped decide how to best containerise the applications, to match the resources required by them. A representative data subset was used for testing and the cost calculation analyses. A range of virtual machines (VMs) that were available in the Azure datacentres in Australia was evaluated. The available VMs for GPU applications included NC24 v3, NC6 v3, NV24 and NV48 v3. The VMs for CPU applications included D15v2, L32, G5, DS5 v2, F16s, DS32 v3, F72s v2, F32s v2, HC44, HB120 and HB60. The results from testing the pipeline applications, end to end, are shown in Tables 1 and 2.

Table 1. GPU VMs that were tested in the validation.

VM size	GPUs	Cost (hr)	Cost Per GPU	Finding	Runtime	Run Cost
NC24 v3	4	$ 23.2531	$ 5.8133	Too costly	Not ran
NC6 v3	1	$ 5.8133	$ 5.8133	Best fit	0:15:14	$1.476
NV24	4	$ 8.7295	$ 2.1824	Incompatible GPU type	Not ran
NV48 v3	4	$ 8.7295	$ 2.1824	Incompatible GPU type	Not ran

Open in a new tab

(Best fit: identified to provide the best performance vs. cost results; Too costly: prices per GPU higher than comparable VM sizes; Incompatible GPU type: a GPU that supports NVIDIA CUDA drivers and complies with its licensing requirements is required to run Guppy analyses performantly. Only NC-series VMs currently provide this on Azure.

Table 2. CPU VMs that were tested in the validation.

VM size	vCPUS	Cost (hr)	Cost Per CPU core	Finding	Runtime	Run Cost
D15v2	20	$ 2.7391	$ 0.1370	Low CPU count	Not ran
L32	32	$ 4.1080	$ 0.1284	Best fit	3:57:06	$16.23
G5	32	$ 14.0156	$ 0.4380	Too costly	Not ran
DS5 v2	16	$ 1.8467	$ 0.1154	Low CPU count	Not ran
F16s	16	$ 1.4307	$ 0.0894	Long runtime	5:03:10	$7.23
DS32 v3	32	$ 2.7460	$ 0.0858	Hyperthreading	Not ran
F72s v2	72	$ 5.4865	$ 0.0762	Hyperthreading	3:37:01	$19.84
F32s v2	32	$ 2.4384	$ 0.0762	Hyperthreading	Not ran
HC44	44	$ 2.8270	$ 0.0643	Not in Australia	Not ran
HB120	120	$ 6.4256	$ 0.0535	Not in Australia	Not ran
HB60	60	$ 2.0348	$ 0.0339	Not in Australia	Not ran

Open in a new tab

(Low CPU count: during testing, Porechop was found to utilise up to 30 CPU cores to speed up processing. VMs with less than 30 CPU cores were determined to have too low a CPU count to achieve the faster results; Best fit: identified to provide the best performance vs. cost results; Too costly: prices per CPU core higher than comparable VM sizes; Long runtime: these VMs do not comply with the faster results solution; Hyperthreading: Hyperthreading splits physical CPU cores amongst running processes, which could cause a considerable decrease in performance, when compared to non-Hyperthreaded VMs, and running CPU-intensive applications; Not in Australia: one of the key requirements was to keep the sequencing data within Australian datacentres.)

The data used for the cost calculation analyses constituted only a representative subset of a complete MinION sequencing data (e.g. 50 Gigabytes of 150 Gigabytes of data were used for testing). For that reason, the runtime results shown in the Tables 1 and 2 were considerably lower than the runtime of a complete analysis. An extrapolation of 2x to 3x the total runtime of the cost calculation analyses was used as an approximation to estimate the runtime of a complete analysis. The key imperatives in selecting the suitable VM for GPUs and CPUs included availability within Australian server, compatibility with the application, non-hyperthreading configuration and most importantly balance between cost and performance. For the GPU evaluation, both NC24 v3 and NC6 v3 VMs were suitable. The NC6 v3 consisted of one GPU while NC24 v3 offered four units, hence NC24 v3 would cost almost 4 times as much as NC6 v3 per hour of rent albeit it could potentially out-perform the NC6 v3 GPU by 4-folds. The analysis workflow on premise only involved one GPU, therefore the cost-benefit and comparable computing power to the manual process of NC6 v3 appeared superior to NC24 v3 (Table 1). Nc24 v3 was selected as the final candidate for GPU VM evaluation.

In the CPU VM evaluation, there were more options available within Australian Azure server without the hyperthreading possibility compared to GPU VM. Among five available CPU (D15v2, L32, G5, DS5v2 and F16s), L32 and G5 offered equivalent CPU power as the onsite computing unit, specifically 32 CPUs. It is important to note that G5 VM would cost approximately 3.5 times more than L32 machine, therefore G5 VM was not included in the testing and L32 VM was selected as the final candidate. In conclusion, the most optimal GPU- enabled VM was NC6 v3 with 1 NVIDIA Tesla V100 GPU, 6 Intel Xeon E5-2690 v4 (Broadwell) CPUs, 112 GB of RAM and P10 disk. The most compatible CPU-enabled VM was L32 with 32 Intel Xeon E5 v3 CPUs, 256 GB of RAM and P10 disk.

Cost estimation for genetic analysis using Azure cloud-pipeline

After the best VM sizes were identified, it was possible to estimate the total cost per sample, as show in the Table 3 below. It is important to note that the usage of CPU-enabled VM was superseded with the implementation of the barcoding tool in Guppy and no longer included in the pipeline and cost-estimation below. The analysis cost per sample was estimated at $0.25 for a run of 48 samples, which was significantly lower compared to $5 per sample by manual on-premise analysis. The manual cost calculation was based on 3 hours of labour cost of a laboratory scientist to complete the execution.

Table 3. Estimation of running cost per sample.

Compute	Time (hr)	Cost (hr)	Note
GPU applications	2.12	$ 5.6405	NC6 v3 (1x V100 GPU, 112 GB RAM, 1x P10 disk)
Cost per run		$ 11.96	Includes VM start-up and deletion totalling 12 mins
Cost per sample		$ 0.25	48 samples per run

Open in a new tab

The initial capital footprint to purchase physical computing infrastructure for the manual process cost approximately $3500 for the specification of Intel® Core™ i&-7700K CPU @ 4.20Ghz, 32 GB RAM, 64-bit operating system and GPU driver NVIDIA GTX 1080 Ti. However, this computer was for communal usage and not designated for TGS data analysis solely, therefore the onsite infrastructure and ongoing maintenance cost was not incorporated into the data analysis cost calculation. Similarly, PathWest holds an active subscription to the Microsoft Azure cloud server that is accessible to all departments within the organization. As the computing resource in Azure cloud was charged for the length of usage to perform the required task for current tenant, the licence or subscription fee to Azure was not included in this study. Overall, this validation was only based on the labour cost to perform analysis manually versus the cost to perform analysis by cloud computing for side-by-side cost comparison.

Data analysis processing time evaluation

A comparison of analysis time between the manual analysis pipeline and the automated Azure cloud-based analysis pipeline was performed on MinION sequencing run 15_06_20_M1 with 16 hours-worth of data output (250 Gigabytes of data). The breakdown of running time is shown in Table 4.

Table 4. Comparison of the running time required to complete the analysis for a MinION run between the automatic pipeline in the Azure cloud and the manual process onsite.

ANALYSIS TIME THROUGH AUTOMATIC CPU AND GPU PIPELINE (AZURE CLOUD)
Run ID	Timeline	Date	Start	End	Duration
15_06_20_M1	Guppy Basecalling (v3.6.1)	16-06-2020	9:00	10:32	01:32
	Porechop	16-06-2020	10:32	17:57	07:25
	NanoFilt, MinIONQC	16-06-2020	17:57	18:38	00:41
				Total	09:38
ANALYSIS TIME THROUGH AUTOMATIC GPU-ONLY PIPELINE (AZURE CLOUD)
Run ID	Timeline	Date	Start	End	Duration
15_06_20_M1	Guppy Basecalling (v4.0.14)	05-11-2021	10:37	11:56	01:19
	Guppy Barcoding (v4.0.14)	05-11-2021	11:56	12:18	00:22
	NanoFilt, MinIONQC	05-11-2021	12:18	12:32	00:14
				Total	01:55
ANALYSIS TIME THROUGH MANUAL PROCESS (ON PREMISES)
Run ID	Timeline	Date	Start	End	Duration
15_06_20_M1	Guppy Basecalling (v3.6.1)	17-06-2020	(17–06) 18:06	(18–06) 02:24	08:18
	Porechop	18-06-2020	08:30	20:51	12:21
	NanoFilt, MinIONQC	19-06-2020	08:30	09:15	00:45
				Total	21:24

Open in a new tab

Initially, when using a combination of CPU and GPU VMs and Porechop instead of Guppy for barcoding and demultiplexing, the full analysis of a representative MinION dataset using the Azure cloud infrastructure completed in 9 hours and 38 minutes compared to 21 hours and 24 minutes using the manual analysis pipeline on the DCI computer workstations. That represents a time-to-answer speed-up of 2.22x, or a reduction of analysis run time to approximately 45% of the manual process. It is also important to note that even though the total running time of the onsite analysis was ~ 21 hours, the full analysis took place over three days as each process was triggered manually and sequentially by an operator. If one step finished outside of the 8-hours routine working day, the next step was delayed until the next business day. Through automation & the flexibility of cloud architecture, it was possible to obtain the fully analysed FASTQ files from a complete MinION run within one working day without operator intervention.

After consolidation of the complete pipeline on GPU VMs by incorporating the barcoding and demultiplexing tool in Guppy which can take advantage of GPU resources, the full analysis of the same representative MinION data set on Azure completed in 1 hours and 55 minutes. That represents a time-to-answer speed-up of 11.17x, or a reduction of analysis run time to approximately 9% of the manual process.

Genetic data output comparison

The study data set consisted of 48 representative samples from a well-characterized DNA panel selected for HLA antigens commonly found in Western Australian population. All genomic DNA was extracted from the peripheral white blood cells or B-cell transformed cell lines by QIAsymphony DNA Midi Kit (Qiagen, Germany) according to the vendor’s protocol. The concentration and purity of extracted DNA were assessed by the optical density (OD) 260/280 ratio of 1.6–2.0. Furthermore, all samples had historical HLA high-resolution results available, obtained using the Ion Torrent platform [5].

The FASTQ files obtained from the manual and Azure analysis pipelines were examined. The number of reads in each FASTQ files from the Porechop output and NanoFilt output summaries were used for data yield comparison. The data output from the two analysis pipelines was strikingly comparable with p-value of 0.94 as shown in Fig 3. Therefore, the difference between two data sets was not statistically significant.

Fig 3 — The blue colour bar represents the data obtained from Azure cloud analysis and the red colour bar depicts the data from manual analysis onsite.

The demultiplexed output for each sample and the proportion of unclassified reads in the run were also comparable between two analytic pipelines (Fig 4). As expected, the output from two analysis pipelines had p-value of 1 and not statistically different. Ideally, the number of reads between the Azure and manual pipeline should be identical in each sample, however, minor differences in the output was acceptable only if the HLA genotype calls were concordant between the two pipelines. For example, sample 22 contained 27635 reads and 27097 reads using Azure and manual analysis, respectively. The gap between two pipelines was 538 reads, however, there was more than the minimum threshold of 3000 reads per sample for analysis obtained by both workflows. Most importantly, the HLA types were 100% concordant in sample 22.

HLA genotyping analysis using GenDx NGSengine (Utrecht, The Netherlands) software was performed on each sample in two data sets to determine data accuracy. There were very minor differences in the number of reads utilized by the HLA allele assignment software NGSengine to assign an HLA genotype between the data generated for each data set for the same sample. However, read depth for each gene analysed met the acceptance criteria of > 20 reads and there were no HLA genotype discrepancies in the data generated by each of the two pipelines. For instance, in sample #38, the number of reads matching to a HLA reference was 4112 reads using Azure analysis and 4139 reads using the manual onsite analysis as shown in the Fig 5A and 5B. The genotyping results were consistent and the presence of a novel polymorphism in an HLA-C*03 allele was identified in the data generated using the two analysis pipelines.

Fig 5 — Analysis snapshot from NGSengine of sample #38 obtained from the automatic pipeline in Azure cloud platform (A panel) and the manual analysis workflow onsite (B panel).

Overall, the data quantity and quality were comparable between the Azure cloud and manual on-premises analysis pipelines. There were no differences in the HLA genotype results obtained from either data analysis pipelines.

Docker container overview

Docker containers were chosen for the nanopore sequencing pipeline since the Azure platform provides enhanced support for running containers at the right scale for the specialised hardware, such as hosts with GPUs and high-performance computing (HPC) clusters, both of which were required for this genomics pipeline. The dockerfile for GPU apps could be followed as:

# Base image from NVIDIA so that the GPU drivers and libraries are available to Guppy

FROM nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04.04

ARG GENERAL_DEPENDENCIES = "wget apt-transport-https software-properties-common"

ARG GUPPY_DEPENDENCIES = "lsb-release"

ARG NANOFILT_DEPENDENCIES = "python3-pip python3-pkg-resources dos2unix"

ARG MINIONQC_DEPENDENCIES = "r-base"

# Set non-interactive mode to override any user inputs requested by package installations

ENV DEBIAN_FRONTEND noninteractive

# Install dependencies

RUN apt-get update && \

apt-get install—yes $GUPPY_DEPENDENCIES $MINIONQC_DEPENDENCIES $NANOFILT_DEPENDENCIES $GENERAL_DEPENDENCIES && \

apt-get clean

# Install AzCopy (data movement tool)

RUN mkdir -p /home/azcopy && cd /home/azcopy && \

wget -O azcopy.tar.gz https://aka.ms/downloadazcopy-v10-linux && \

tar -xf azcopy.tar.gz—strip = 1

ENV PATH = /home/azcopy:$PATH

# Install NanoFilt

RUN pip3 install—upgrade pip

RUN pip3 install nanofilt

# Install MinIONQC

# Set the CRAN mirror as Perth (7th in the index = Curtin University’s mirror) and then install dependency packages

RUN R -e "chooseCRANmirror(graphics = FALSE, ind = 6);install.packages(c(’rlang’, ’data.table’, ’futile.logger’, ’ggplot2’, ’optparse’, ’plyr’, ’readr’, ’reshape2’, ’scales’, ’viridis’, ’yaml’))" && \

export R_LIBS = "/usr/local/lib/R/site-library"

# Finally, download the R script for MinIONQC directly

RUN wget -O MinIONQC.R https://raw.githubusercontent.com/roblanf/minion_qc/master/MinIONQC.R

# Install Guppy with dependencies

# ********************************************************************************************************************************

# ***NOTE: The next line can be modified to create a container with a specific version of Guppy (e.g., a new one for testing) ***

# ********************************************************************************************************************************

ARG GUPPY_VERSION = 4.0.14

# Requires setting DEBIAN_FRONTEND as non-interactive or keyboard-layout installation stalls the installation of Guppy

ARG DEBIAN_FRONTEND = noninteractive

ARG DEPENDENCY_PACKAGES = "libhdf5-100 libnorm1 libpgm-5.2–0 libsodium23 libzmq5 libhdf5-cpp-100 libboost-thread1.65.1 libboost-atomic1.65.1 libboost-chrono1.65.1 libboost-date-time1.65.1 libboost-filesystem1.65.1 libboost-iostreams1.65.1 libboost-program-options1.65.1 libboost-regex1.65.1 libboost-system1.65.1 libboost-log1.65.1"

RUN wget https://mirror.oxfordnanoportal.com/software/analysis/ont_guppy_${GUPPY_VERSION}-1~bionic_amd64.deb && \

apt-get install—yes $DEPENDENCY_PACKAGES && \

dpkg -i *.deb

# Install dateutils to for date and time calculations

RUN apt-get install -y dateutils

# Install PowerShell Core

RUN wget -q https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb && \

dpkg -i packages-microsoft-prod.deb && \

apt-get update && \

add-apt-repository universe && \

apt-get install -y powershell

# Define Command or Entry Point

CMD ["/bin/bash"]

(Highlighted value: the value for this variable can be modified to build the container with a different version of Guppy)

Ethical approval

Written consent for genetic analysis (HLA genotyping in specific) was obtained for all samples at the point of collection. For patient under 18 years old, the written consent was sought from parents or guardians. Ethical approval for storage and biobanking of PBMC and DNA samples at Fiona Stanley Hospital has been granted (RGS 0552) by Human Research Ethics Committee (PathWest, Department of Health). The main purpose of this study was to assess the analysis HLA genetic information using the cloud compute resources. No additional genetic information outside of the HLA would be unveiled from the genome of the study panel. All samples that were selected for this study had been de-identified for personal information, therefore the genomic data of these samples can be shared or published without jeopardising personal privacy and confidentiality. Additionally, other private information such as name, address, clinical history or treatment was not collected for the scope of this study. Lastly, only human samples were included in this research and no animal was involved.

Discussion

Cloud computing server has been utilized ubiquitously in processing clinical trial data such as cancer treatment trial [7] and genetic studies [8]. To our knowledge, this study was the first in the world to leverage cloud computing to analyse TGS raw data for clinical HLA genotyping. The main goal of this study was to develop an automatic data analysis pipeline that streamlined the data flow from the MinION sequencing device to cloud computing and then back to the hospital network for downstream genomic analysis. This pipeline leveraged the scalability and flexibility of the cloud computing resources to produce the end-result of demultiplexed filtered FASTQ files at 11.17x times faster than the manual on-premises data analysis pipeline. Data analysis of MinION sequencing data took over 3 working days however with the implementation of the Azure cloud-based data analysis pipeline this was reduced to under 2 hours with the use of GPU-enabled VMs.

In the latest cloud pipeline, the total analysis cost was estimated at $11.96 per run or $0.25 per sample compared to $250 per run or $5 per sample if perform manually. Furthermore, the Azure cloud-based data analysis pipeline produced data quantity and quality comparable to that of the manual on-premises data analysis pipeline. There were no differences in the HLA genotype results obtained from both data analysis pipelines. Lastly, all data were stored in the data centre within Australia with utmost security. All samples were de-identified to adhere strict patient confidentiality.

This study showed that the most optimal GPU-enabled virtual machine (VM) was NC6 v3 with 1 GPU, 112 GB RAM and P10 disk. All GPU applications were stored in the Docker containers and managed by Loome and the Azure Batch service. The Loome agent communicated with Azure Batch and triggered the deployment of necessary computing resources. The Loome agent also orchestrated the movement of data from Blob Storage to VM and back to Blob Storage readily for transferring to the hospital stand-alone computer. The usage of Docker for containerisation would ensure the possibility of updating applications involved in the workflow such as Guppy, which has important updates approximately every 6 months.

It is important to note that the testing and optimization of this pipeline commenced in 2020, and the completed in 2021. The current cloud infrastructure and costing of computing resources remain as per the description in the method and result section. The analysis workflow has shown to be resilient over time as this pipeline with specific virtual machine (NC6 v3 with 1 NVIDIA Tesla V100 GPU, 6 Intel Xeon E5-2690 v4 (Broadwell) CPUs, 112 GB of RAM and P10 disk) is able to analyse the latest version of MinION flow cell (R10.4) and sequencing chemistry (Q20) using the same cost & computing set-up, both reagents were released in early 2022 (data not shown in this study). Furthermore, this pipeline is currently adopted for TGS raw data analysis for HLA typing in decease donor organ workup, which is time-sensitive and requires immediate turn-around-time.

A major imitation of this study is the dependency on availability of programmer and IT support in the building of Docker containers and orchestrating Loome agent, container registry, VM and blob storage in Azure platform. One medical scientist from PathWest with entry-level skill in data science and one programmer from BizData/Microsoft had worked closely to build and optimize this cloud-based pipeline over the course of six months. The prolonged testing time was not due to the complexity of the pipeline or lack of manpower, the major hurdle was to obtain clearance to create a connection bridging the cloud server and the health network, while maintaining compliance with the cyber security and patient confidentiality policy of Fiona Stanley Hospital and Western Australian Department of Health.

In addition, IT support or a bioinformatician is required for the testing of application updates in the aforementioned Docker container prior implementation into production environment. Guppy base-caller algorithm is being upgrading periodically to improve the accuracy of the technology; therefore, it is important to keep up to date with the latest version of application for highest quality of sequencing data. The codes to build and maintain Docker image are open-access and made available in this study in order to assist the laboratories with limited access to programmer. However, in order to replicate this pipeline in a different laboratory, one would still require IT support and subscription to Microsoft Azure platform or cloud service provider of choice.

The common concern among medical scientists and clinicians with the usage of cloud server is the potential breach of patient’s confidentiality and sensitive medical data such as HLA genotype results. The construction and deployment of this analysis pipeline in the Azure server had undergone the highest level of scrutiny and approval by the PathWest IT manager as well as network engineer manager at Fiona Stanley Hospital. The senior authors of this study provided a Deployment Architecture Design Document to the site manager outlining the requirements for network configuration, system configuration and the adherence to data security policy.

Specifically, the raw data sent to the Azure server do not contain patient’s demographic data and consists only of electrical signals. The processed data of nucleotide sequences sent back to the hospital site is linked to a laboratory reference number. To identify patient data, one would require access to the PathWest Laboratory Information System which requires password access, to link the laboratory number to a patient’s demographic data. Furthermore, the end-user would also require access to allele assignment program such as GenDx NGSengine to decipher nucleotide sequences to HLA types. Therefore, there are several checkpoints and multiple layers of security to warrant the utmost protection of patient’s clinical data.

Overall, the Microsoft Azure cloud-based data analysis pipeline was shown to meet all the key imperatives for performance, cost, usability, simplicity and accuracy. Importantly, the pipeline allows for the on-going maintenance and testing of version changes before implementation. This pipeline is suitable for the data analysis from MinION sequencing platforms and could be adopted for other data analysis application processes.

Supporting information

S1 File

(PDF)

Click here for additional data file.^{(104.5KB, pdf)}

Acknowledgments

We would like to thank Mr Johnny Gorea for his valuable input in the conceptualization of this study. We are also grateful to all the colleagues at the Department of Clinical Immunology, PathWest for their technical assistance and troubleshooting.

Data Availability

Yes - sample data are fully available without restriction. All FASTQ files from this study are available from the Dyrad database (https://doi.org/10.5061/dryad.x0k6djhp4).

Funding Statement

This study was funded by Innovation Grant (Microsoft Australia). The authors (LT, FA, DDS) were granted $35,000 AUD for the development of the automatic pipeline in Microsoft Azure server. The sponsor (Microsoft Australia) played no role in the study design, data analysis or preparation of the manuscript.

References

1.Ohta T, Tanjo T & Ogasawara O. Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection. GigaScience. 2019(8):1–11 Oxford Nanopore Technologies Ltd. GitHub repository, https://github.com/nanoporetech/bonito. (2019) doi: 10.1093/gigascience/giz052 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kono N, Arakawa K. Nanopore sequencing: Review of potential applications in functional genomics. Develop Growth Differ. 2019(61):316–326. doi: 10.1111/dgd.12608 [DOI] [PubMed] [Google Scholar]
3.Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews: Genetics. 2018(19):208–219. 10.1038/nrg.2017.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wick R, Volkening J & Loman N. GitHub repository, https://github.com/rrwick/Porechop. (2017) [Google Scholar]
5.Truong L, Matern B, D’Orsogna L, Martinez P, Tilanus MGJ, De Santis D. A novel multiplexed 11 locus HLA full gene amplification assay using next generation sequencing. HLA. 2020. Feb;95(2):104–116. doi: 10.1111/tan.13729 Epub 2019 Oct 24. [DOI] [PubMed] [Google Scholar]
6.Lanfear R, Schalamun M, Kainer D, et al. MinIONQC: fast and simple quality control for MinION sequencing data. Bioinformatics. 2018. 10.1093/bioinformatics/bty654 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ronquillo, Jay G and Lester William T, Practical Aspects of Implementing and Applying Health Care Cloud Computing Services and Informatics to Cancer Clinical Trial Data. 2021: 5(5) JCO clinical cancer informatics 826 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Hung CL, Lin CY. Open reading frame phylogenetic analysis on the cloud. International Journal of Genomics. 2013; 2013:614923. doi: 10.1155/2013/614923 ; PMCID: PMC3647537. [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0278609.r001

Decision Letter 0

Mingming Liu

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

5 Sep 2022

PONE-D-22-07900Nanopore sequencing data analysis using Microsoft Azure cloud computing servicePLOS ONE

Dear Dr. Truong,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 20 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mingming Liu

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. We note you have not yet provided a protocols.io PDF version of your protocol and/or a protocols.io DOI. When you submit your revision, please provide a PDF version of your protocol as generated by protocols.io (the file will have the protocols.io logo in the upper right corner of the first page) as a Supporting Information file. The filename should be S1_file.pdf, and you should enter “S1 File” into the Description field. Any additional protocols should be numbered S2, S3, and so on. Please also follow the instructions for Supporting Information captions [https://journals.plos.org/plosone/s/supporting-information#loc-captions]. The title in the caption should read: “Step-by-step protocol, also available on protocols.io.”

Please assign your protocol a protocols.io DOI, if you have not already done so, and include the following line in the Materials and Methods section of your manuscript: “The protocol described in this peer-reviewed article is published on protocols.io (https://dx.doi.org/10.17504/protocols.io.[...]) and is included for printing purposes as S1 File.” You should also supply the DOI in the Protocols.io DOI field of the submission form when you submit your revision.

If you have not yet uploaded your protocol to protocols.io, you are invited to use the platform’s protocol entry service [https://www.protocols.io/we-enter-protocols] for doing so, at no charge. Through this service, the team at protocols.io will enter your protocol for you and format it in a way that takes advantage of the platform’s features. When submitting your protocol to the protocol entry service please include the customer code PLOS2022 in the Note field and indicate that your protocol is associated with a PLOS ONE Lab Protocol Submission. You should also include the title and manuscript number of your PLOS ONE submission.

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

5. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This work had been supported through the Innovation Fund by Microsoft Australia.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This study was funded by Innovation Grant (Microsoft Australia). The authors (LT, FA, DDS) were granted $35,000 AUD for the development of the automatic pipeline in Microsoft Azure server. The sponsor (Microsoft Australia) played no role in the study design, data analysis or preparation of the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

6. Thank you for stating the following in your Competing Interests section:

“No authors have competing interests”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state ""The authors have declared that no competing interests exist."", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

7. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

8. Please amend your manuscript to include your abstract after the title page.

Additional Editor Comments:

This paper presents an automatic data processing pipeline for nanopore sequence analysis using Microsoft Azure cloud computing service. In this process the claim is the automatic pipeline in the specific cloud environment which can be used for HLA genotype analysis. There is some merit in the proposed work but unfortunately the presentation lacks clarify and details. In addition to reviewers' comments, please include clarifications to the following concerns/questions. 1. Please consider including more references to your article. For instance, a related work section could be added to review some existing work in the literature. 2. Please discuss the limitation of your approach and any other features/settings that could be incorporated to further improve your pipeline. 3. Please present more details of the dataset used in your experiments.

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

Descriptions of methods and reagents contained in the step-by-step protocol should be reported in sufficient detail for another researcher to reproduce all experiments and analyses. The protocol should describe the appropriate controls, sample sizes and replication needed to ensure that the data are robust and reproducible.

Reviewer #1: Partly

Reviewer #2: Partly

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #1: Yes

Reviewer #2: No

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This study was undertaken to address the problem of processing and analyzing data from the nanopore sequencing of HLA genes in a short period of time and with possibly lower cost. The study proposes a cloud-based solution that is stream-lined and increases the throughput and decrease the processing time and cost per sample genotyped.

The HLA data generated appear to be the same whether the investigators use the manually physical computing resources, or the cloud-based pipeline generated. This is positive.

Other comments:

1. Is it appropriate to compare the cost per sample of using the cloud-based pipeline, which is the cost it takes to use all the various different computer pieces to that of the cost of a technician for the manual process? There is no information regarding the costs for buying and maintaining the computing infrastructure for the manual process that they describe, which is not insignificant, especially if an institution doesn’t have a hig- performance computing cluster for use by the lab interested in running the ONT pipeline. This infrastructure cost should be combined with the cost for the technician’s time to account for all resources needed.

2. Based on data in Table 4, the underlying work and testing for building this pipeline occurred 1 – 2 years ago. Has the system you configured and built in Azure using the specific GPUs and CPUs described here, held up over time? Meaning, if you were to analyze an ONT run today, would you still be utilizing this hardware set up? And, if you are utilizing the same compute times, how has the cost changed in that time frame?

3. One thing that is not addressed is the time and effort it took to build this automated system. So let’s say your lab has a competent bionformatician or IT support person who is currently running a similar pipeline “in-house” as you were doing, how much extra effort was spent to build out this cloud-based resource? Is this something that someone can do it a matter of a few days, weeks, months? What kind of expertise are required?

4. For labs that do not have a dedicated bioinformatician/IT resources to put this pipeline into place, is this resource that you’ve built available for others to spin up and use?

5. Did the authors meet any challenges from their respective institutions when it came to utilizing cloud-based resources for clinical data? In our experience, many institutions are not trusting of these platforms for non-research-based activities, including clinical HLA genotyping. If you did, can you comment on the driving forces behind accepting a cloud-based pipeline instead of your in-house pipeline? It is not uncommon for a reduction in cost to not overcome the security factors of keeping the data on-site.

The manuscript needs significant editing. Two examples:

1. Page 16: Final sentence of the Discussion section should read “clinical history or treatment was not collected for the scope of this study.”, not scoop of this study.

2. Page 16: End of the 1st paragraph of Discussion should be “utmost security” not “upmost security”

Reviewer #2: In this protocol, the authors propose a pipeline to analyze the nanopore sequential data automatically utilizing cloud computation service from Microsoft Azure to accelerate HLA genotyping service. However, there are some comments the author may consider to improve the quality of the protocol.

Introduction section:

1. “The computational power that is needed to analyze nanopore sequencing data in a timely-manner and to enable long-term storage has also outgrown …”

Here is a bit confusing. Please give one or two more sentences to explain how computational power enable the long-term storage.

2. As the authors mentioned, there are cloud services provided by Google, Amazon and Microsoft and “All cloud providers offer elasticity, convenience and scalability depending on specific demand of individual workflow.”

Why the Microsoft Azure is chosen for the protocol?

Methods and Results section:

1. “The individual FASTQ reads were then further filtered by size, a minimum length of 2 kb, and quality, minimum Q-score of 7 ….”

Are there any references indicating that the threshold of minimum length and minimum Q-score are reasonable in the HLA genotyping? These thresholds are quite critical in data preprocessing.

Identifying optimal virtual machine section:

1. In table 1, the comparison between different GPU VMs, only NC6 v3 is tested and other types of GPU VMs has not been used. Why the NC6 v3 is the best option here? For instance, what if the runtime of NC24 v3 is extremely fast (even if it is expensive)?

2. In table 1, the “finding” gives the decision whether the VM is selected. However, in VM selection, what metrics you consider mostly here? For instance, a calculation based on the cost and runtime, which helps to generate the final decision.

3. There problems in table 2 are similar with that in table 1. Why the L32 is selected since the runtime and cost are not optimal compared with other types of VMs?

Please give one or two sentences to clearly summarize that the VM is selection based on what consideration.

Genetic data output comparison section:

1. What is the unit for y-axis in figure 3?

2. The x-axis is not legible.

3. Ideally, the number of reads detected in each sample (y-axis) should be identical between Azure cloud analysis and manual analysis. However, from BC22 to BC26, the gap between the two analysis methods seems a bit large. Will it cause any consequences in the following HLA genotyping?

4. In figure 4, ensure the “none”s are with a same font size.

5. The pie charts in figure 4 looks similar, are the any metrics to evaluate the similarity between them? (not just mention the two pie charts are comparable)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Dimitri Monos Ph.D

Reviewer #2: Yes: Hongde Wu

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 2;17(12):e0278609. doi: 10.1371/journal.pone.0278609.r002

Author response to Decision Letter 0

20 Oct 2022

To address the academic editor’s comments:

1. The manuscript has been reformatted as per the provided PLOS ONE style template.

2. The step-by-step protocol was uploaded to protocols.io DOI and the PDF was attached as S1_file.pdf file. In addition, the statement referring to protocols.io was also added to the Methods section in the manuscript.

3. The author-generated code and script was included in the manuscript in the intention of sharing without restrictions upon publication of this work. Additional code and commands were also included in the step-by-step protocol in protocols.io DOI.

4. The grant information in the “Funding Information” and “Financial Disclosure” sections was updated to be consistent.

5. The funding-related text was removed from the Acknowledgments section in the manuscript. Please update the statement “This study was funded by Innovation Grant (Microsoft Australia). The authors (LT, FA, DDS) were granted $35,000 AUD for the development of the automatic pipeline in Microsoft Azure server. The sponsor (Microsoft Australia) played no role in the study design, data analysis or preparation of the manuscript.” Into the Funding Statement section.

6. The statement for Competing Interests section was updated to PLOS ONE’s guidelines and read as follows “The authors have declared that no competing interests exist.”

7. The dataset is currently published on Dyrad repository (https://doi.org/10.5061/dryad.x0k6djhp4) and fully accessible to reviewers.

8. The abstract was included in the manuscript after the title page.

To address additional comments from the editor:

1. More references to cloud computing application in clinical services were added to our discussion section (last paragraph on page 12 of the manuscript).

2. The limitations of this study was also acknowledged in the discussion section (first paragraph on page 14 of the manuscript).

3. The demographics and details of the data set in this study were included in the first paragraph on page 10 of the manuscript.

To address reviewer #1’s comments:

1. The rationale behind cost estimation exercise was elucidated on page 8 right after Table 3.

2. The gap between validation run and implementation was explained in the discussion section, page 14 of the manuscript.

3. The lack in details on the effort and timeline to build the pipeline was clarified in the discussion section on page 14.

4. The dependency on dedicated IT support was listed as a major limitation of the study.

5. The common concern with the usage of cloud server regarding security and patient’s sensitive information was addressed in the discussion on page 15.

6. The two spelling errors had been corrected. The manuscript had undergone critical review by all authors prior to the resubmission.

To address reviewer #2’s comments:

1. The remark on computational power and long-term storage was elaborated in the introduction section on page 2.

2. The rationale of choosing Azure server was included in the introduction on page 3 of the manuscript.

3. The reasoning for pre-determined quality threshold was elaborated in the methods and results section on page 4.

4. The findings from VM testing for GPU and CPU in Table 1 and Table 2, respectively, were explained in more details on page 5 and page 6.

5. The y-axis of Figure 3 was labelled to represent number of reads detected per sample after demultiplexing process.

6. The x-axis of Figure 3 was re-formatted in bigger front to be legible.

7. The gap in read output between two pipelines and potential consequences were explained on page 10 of the manuscript.

8. Figure 4 was reformatted for consistency between two pie charts.

9. P-value was included to demonstrate the statistical significance between two datasets on page 10.

PLoS One. doi: 10.1371/journal.pone.0278609.r003

Decision Letter 1

Mingming Liu

21 Nov 2022

Nanopore sequencing data analysis using Microsoft Azure cloud computing service

PONE-D-22-07900R1

Dear Dr. Truong,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Mingming Liu

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

To answer this question, please click the link to protocols.io in the Materials and Methods section of the manuscript (if a link has been provided) or consult the step-by-step protocol in the Supporting Information files.

The step-by-step protocol should contain sufficient detail for another researcher to be able to reproduce all experiments and analyses.

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Does the protocol describe a validated method?

Reviewer #1: Yes

Reviewer #2: No

**********

4. If the manuscript contains new data, have the authors made this data fully available?

Reviewer #1: Yes

Reviewer #2: N/A

**********

5. Is the article presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have adequately responded to the reviewers' comments. However, the manuscript needs carefully editing and reviewing as it still has a few spelling errors.

Reviewer #2: The authors have addressed all the issues mentioned in the comments. However, the text in Fig 3 and Fig 4 is still not legible. I suggest the authors fix this issues in the final version of paper.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Hongde Wu

**********

PLoS One. doi: 10.1371/journal.pone.0278609.r004

Acceptance letter

Mingming Liu

25 Nov 2022

PONE-D-22-07900R1

Nanopore sequencing data analysis using Microsoft Azure cloud computing service

Dear Dr. Truong:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Mingming Liu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(PDF)

Click here for additional data file.^{(104.5KB, pdf)}

Data Availability Statement

Yes - sample data are fully available without restriction. All FASTQ files from this study are available from the Dyrad database (https://doi.org/10.5061/dryad.x0k6djhp4).

[pone.0278609.ref001] 1.Ohta T, Tanjo T & Ogasawara O. Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection. GigaScience. 2019(8):1–11 Oxford Nanopore Technologies Ltd. GitHub repository, https://github.com/nanoporetech/bonito. (2019) doi: 10.1093/gigascience/giz052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278609.ref002] 2.Kono N, Arakawa K. Nanopore sequencing: Review of potential applications in functional genomics. Develop Growth Differ. 2019(61):316–326. doi: 10.1111/dgd.12608 [DOI] [PubMed] [Google Scholar]

[pone.0278609.ref003] 3.Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews: Genetics. 2018(19):208–219. 10.1038/nrg.2017.113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278609.ref004] 4.Wick R, Volkening J & Loman N. GitHub repository, https://github.com/rrwick/Porechop. (2017) [Google Scholar]

[pone.0278609.ref005] 5.Truong L, Matern B, D’Orsogna L, Martinez P, Tilanus MGJ, De Santis D. A novel multiplexed 11 locus HLA full gene amplification assay using next generation sequencing. HLA. 2020. Feb;95(2):104–116. doi: 10.1111/tan.13729 Epub 2019 Oct 24. [DOI] [PubMed] [Google Scholar]

[pone.0278609.ref006] 6.Lanfear R, Schalamun M, Kainer D, et al. MinIONQC: fast and simple quality control for MinION sequencing data. Bioinformatics. 2018. 10.1093/bioinformatics/bty654 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278609.ref007] 7.Ronquillo, Jay G and Lester William T, Practical Aspects of Implementing and Applying Health Care Cloud Computing Services and Informatics to Cancer Clinical Trial Data. 2021: 5(5) JCO clinical cancer informatics 826 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0278609.ref008] 8.Hung CL, Lin CY. Open reading frame phylogenetic analysis on the cloud. International Journal of Genomics. 2013; 2013:614923. doi: 10.1155/2013/614923 ; PMCID: PMC3647537. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Nanopore sequencing data analysis using Microsoft Azure cloud computing service

Linh Truong

Felipe Ayora

Lloyd D’Orsogna

Patricia Martinez

Dianne De Santis

Roles

Abstract

Introduction

Methods and results

Workflow overview

Fig 1. Overview of the analytic workflow for ONT data.

Building the pipeline in cloud computing

Overview of architecture of the Azure cloud-workflow

Fig 2. The architecture of analysis pipeline on the Microsoft Azure.

Identifying optimal virtual machine

Table 1. GPU VMs that were tested in the validation.

Table 2. CPU VMs that were tested in the validation.

Cost estimation for genetic analysis using Azure cloud-pipeline

Table 3. Estimation of running cost per sample.

Data analysis processing time evaluation

Table 4. Comparison of the running time required to complete the analysis for a MinION run between the automatic pipeline in the Azure cloud and the manual process onsite.

Genetic data output comparison

Fig 3. The number of reads detected in each sample on a 48-samples run.

Fig 4. The comparison of sample composition in a sequencing run between the automatic pipeline in the Azure cloud and the manual process onsite.

Fig 5.

Docker container overview

Ethical approval

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Mingming Liu

Roles

Transfer Alert

Author response to Decision Letter 0

Decision Letter 1

Mingming Liu

Roles

Acceptance letter

Mingming Liu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases