Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2025 Apr 21;4(4):e0000825. doi: 10.1371/journal.pdig.0000825

Addressing data management and analysis challenges in viral genomics: The Swiss HIV cohort study viral next generation sequencing database

Marius Zeeb 1,2,*,, Paul Frischknecht 1,, Suraj Balakrishna 1,2, Lisa Jörimann 1,2, Jasmin Tschumi 1,2, Levente Zsichla 3,4, Sandra E Chaudron 1,2, Bashkim Jaha 1, Kathrin Neumann 1, Christine Leemann 1, Michael Huber 2, Karoline Leuzinger 5, Huldrych F Günthard 1,2,, Karin J Metzner 1,2,, Roger D Kouyos 1,2,; The Zurich HIV Primary Infection Cohort Study, and the Swiss HIV Cohort Study1
Editor: Miguel Ángel Armengol de la Hoz6
PMCID: PMC12011223  PMID: 40257980

Abstract

Numerous HIV related outcomes can be determined on the viral genome, for example, resistance associated mutations, population transmission dynamics, viral heritability traits, or time since infection. Viral sequences of people with HIV (PWH) are therefore essential for therapeutic and research purposes. While in the first three decades of the HIV pandemic viral genomes were mainly sequenced using Sanger sequencing, the last decade has seen a shift towards next-generation sequencing (NGS) as the preferred method. NGS can achieve near full length genome sequence coverage and simultaneously, it accurately encapsulates the within-host diversity by characterizing HIV subpopulations. NGS opens new avenues for HIV research, but it also presents challenges concerning data management and analysis. We therefore set up the Swiss HIV Cohort Study Viral NGS Database (SHCND) to address key issues in the handling of NGS data including high loads of raw- and processed NGS data, data storage solutions, downstream application of sophisticated bioinformatic tools, high-performance computing resources, and reproducibility. The database is nested within the Swiss HIV Cohort Study (SHCS) and the Zurich Primary HIV Infection Cohort Study (ZPHI), which together enrolled 21,876 PWH since 1988 and include a biobank dating back to the early nineties. Since its initiation in 2018, the SHCND accumulated NGS sequences (plasma and proviral origin) of 5,178 unique PWH. We here describe the design, set-up, and use of this NGS database. Overall, the SHCND has contributed to several research projects on HIV pathogenesis, treatment, drug resistance, and molecular epidemiology, and has thereby become a central part of HIV-genomics research in Switzerland.

Author summary

Medical data is becoming increasingly more complex, which significantly enhances research and clinical decision making. However, this growing complexity makes it also more difficult to handle it in a structured manner while adhering to good research practice and data management guidelines. In this context, we present the Swiss HIV Cohort Study Viral NGS Database (SHCND), a dedicated database storing and processing Next Generation Sequencing data (NGS) of HIV genomics data.

The SHCND centralizes all NGS data generated in the framework of the Swiss HIV Cohort Study (SHCS) and provides direct integration of bioinformatic pipelines for their processing. The SHCND streamlined the use of NGS data across researchers and was fundamental for a range of published research projects. Although developed to handle HIV NGS data, its flexible design makes it universally adaptable to any kind of data, for example, proteomics or imaging data. This work details the key design choices and functionalities of the SHCND aiming to serve as a practical guide for others seeking to establish databases for medical research data.

Introduction

Viral genomic data plays a crucial role in HIV-1 medicine, research, epidemiology, and public health [125]. The global HIV-1 pandemic affects diverse populations, with differences in healthcare access and HIV-1 related health outcomes [2629]. In this context, the high diversity of HIV-1 is of significant importance; notably to determine: drug resistance mutations (DRM) against anti-retroviral therapy (ART), viral transmission networks between people with HIV (PWH), viral pathogenesis, comorbidities linked to HIV-1, and immune responses, e.g., antibodies against HIV or auto immune diseases [10,3033].

Until recently, Sanger sequencing was used to sequence at least the pol region (encoding for proteins relevant for viral replication) to determine the presence of DRMs [3437]. However, over the last decade, next-generation sequencing (NGS), with technologies such as Illumina and Nanopore, increasingly replaced Sanger sequencing in both research and diagnostics. NGS allows a much higher throughput and easier sequencing of the entire viral genome [1,38]. NGS has several advantages over Sanger sequencing, especially the ability to detect (resistance) mutations present at low frequencies, i.e., when the mutation is only present in few percent of viral particles [3941], and to determine within-patient diversity of HIV-1 genotypes as a marker of HIV-1 time since infection, transmission (if viral strains of two PWH are of high similarity), and super-infection (presence of two unique HIV-1 strains from independent transmissions) [31,4244].

While NGS provides detailed information on the whole HIV-1 genome and offers benefits such as high-throughput and declining costs, additional challenges arise as it generates large data files and requires elaborate bioinformatic pipelines for interpretation [4547]. Additionally, the lack of standardization for NGS data processing make it challenging to reproduce data reliably [48]: various NGS bioinformatic processing pipelines exist by different developers for applications such as genome assembly or DRM detection, each with different versions, input parameters [46] and often stochastic algorithms. Many tools are designed to produce the same output in principle, such as a genome alignment, but each tool-/version-/parameter choice can subtly influence the actual outcome leading to results to differing between tools.

Considering these obstacles, detailed documentation of all processing steps is crucial for the reproducibility and comparability of results in case some steps need to be redone or altered. This needs special consideration, even more so, because software complexity will only increase further, with trends to specialization on narrow tasks. Therefore, the research protocol documentation for each sample should contain all corresponding analyses in detail, including the version and specific running parameters of the tool used. If this is fulfilled, output data storage is not even required, as results can be reproduced using the documentation. One potential exception are non-deterministic analyses (processes which have different outcomes despite the same input due to randomness in the algorithms), for example, the subsampling of NGS reads. While modern software mostly allows specifying a so called random “seed” to reproduce the same (pseudo-)random results, some software might be hard-coded to determine a “seed” based on the current time or hardware random number generators. Finally, if external (unversioned) databases or web services outside of the project's control are utilized, their results need to be stored as they might not be reproducible later. In consequence, such data storage and computational solutions are required, which satisfy the conditions of sensitive human health data and have the ability to integrate raw data, processed data, and executable copies of the bioinformatic pipelines used.

A modern, future proof NGS database must meet all these demands. We, accordingly, developed therefore, the Swiss HIV Cohort Study Viral NGS Database (SHCND), in the framework of the Swiss HIV Cohort Study (SHCS) and Zurich Primary HIV Infection Cohort Study (ZPHI) [49,50] to ensure the reproducible use of viral NGS data. This work aims to describe the design, set-up, maintenance, and use of the SHCND, both from a developer- and a user-/researcher perspective.

Methods

Aim

The SHCND aims to provide a centralized storage and compute-orchestration solution over the whole digital life cycle of a HIV-1 NGS record, i.e., from raw NGS read data to all relevant derived processing outputs. For the purpose of reproducibility, all processed outputs are linked to the exact version of the originating pipeline with all its dependencies. Redundant processing steps, like NGS sequence assembly, are streamlined and executed only once, rather than separately for each analysis. The assembly is then accessible to all researchers, saving both computational resources and time. Currently the SHCND handles mostly NGS data from Illumina currently. However, the SHCND is designed with a modular and extensible design, allowing for the convenient incorporation of new bioinformatic pipelines and data, particularly when they are available in a containerized from.

FAIR guiding principles

The SHCND follows the FAIR (Findability, Accessibility, Interoperability, and Reusability) Guiding Principles for scientific data management [51]. In brief, it makes data Findable and Accessible, by assigning globally unique and persistent (immutable) identifiers (F1, A1) and by use of standardized HTTPS communication protocols to handle access and authentication (A1.1, A1.2). It uses standard JSON format for rich and extensible metadata capture (F2, F3, A2), which is transformed to an indexed and searchable graph database representation (F4). The system is Interoperable with freely available analysis and processing solutions, specifically targeted at NGS data processing (I1, I3). The database is Reusable with management of data provenance and reproducibility of results, and we argue that reuse of our data and metadata is limited only by medical data privacy (R1, R1.1, R1.2). We also make the case that the infrastructure and design of the system are reusable for other similar efforts and give an outlook to some of the many possible extensions, uses and existing reuses of the system.

Sampling & NGS workflow

In the basic workflow from sampling to digital HIV-1 sequence described in Fig 1, a blood sample (plasma for viral RNA, PBMC for proviral DNA) is collected from PWH and then processed either for storage in the SHCS/ZPHI biobank or for direct analysis. The SHCND includes data from samples requested for various research projects from centers all over Switzerland, which are shipped for sequencing primarily to the SHCS Laboratory in Zurich and in some cases internationally, for example, to the Sanger Institute in Oxford [8]. NGS of near full-length HIV-1 sequences is performed either from plasma RNA or from proviral DNA. The preparation, amplification, and sequencing protocols are described elsewhere in detail for plasma RNA [42,52] and proviral DNA [53,54].

Fig 1. From sample to genome.

Fig 1

Description of the steps from blood sampling of people with HIV, Next Generation Sequencing and its data handling, to the bio-informatics tools for computational HIV genome analysis.

SHCND workflow

The steps in the workflow described in Fig 1 were significantly centralized and streamlined with the introduction of the SHCND – prior to its development, similar steps were carried out by each researcher individually with NGS raw reads files stored on a networked file share. The NGS records (NGS raw reads in the form of fastq files) are uploaded to the SHCND together with the respective metadata (e.g., sample date, primer information), excluding sensitive patient data. Each uploaded NGS record is assigned a 36-character alphanumeric randomly generated universally unique identifier (UUID version 4), such as “5bfc99f6-8432-4afc-be32-3f9d2dfa4871” (Fig 2A).

Fig 2. Illustration of the database storage.

Fig 2

(A) NGS storage and (B) Bioinformatic result storage implementation in SHCND. Basically, a graph of UUID labeled nodes is formed. The graph structure (edges) itself is also materialized in UUID-named files containing JSON (not shown). Furthermore, the exact processing tool used to generate each result container is included in the metadata of each result container.

The use of UUID version 4 identifiers [55] ensures globally unique, unequivocal, context-free, immutable identification that can be auto-generated in a distributed fashion as opposed to e.g. sequential integer IDs or human readable names, free from any privacy or ordering concerns. UUIDs are in widespread use in technical and non-technical domains for physical and digital artifact identification (e.g., [5658]). Their special notation makes them particularly easy to recognize and parse as identifiers and we use this extensively to allow users to copy-paste lists of such UUIDs that can be separated by any or no characters at all.

All generated data files on the database receive their own UUID as well and are linked to the corresponding NGS record and processing tool through additional metadata files (Fig 2B). From the SHCND, any number of NGS records can be chosen by their UUID and submitted for parallel processing with a selected well-defined version of a bioinformatic tool. Once the user confirms the selection of NGS records, tool, and relevant parameter settings (appropriate for the NGS record, e.g., NGS platform and amplification method), the data files are transferred to an external high-performance-computing cluster, where they are processed. The results are transferred back to the SHCND where they are stored and linked with the sample by their UUIDs. For the retrieval of results, the user must specify again the UUIDs of the NGS records and the relevant tool to initiate a download. More sophisticated means of querying for results obtained with specific tool versions and parameters are possible, as the SHCND persistently stores the entire metadata of each parametrized run of a processing tool.

SHCND IT architecture

On the highest level, the SHCND consists of two types of subsystems: firstly, the singular database core, which is responsible for data storage, and secondly, a set of client nodes, which interact with the core to request and handle computations. Client nodes interact with the database in three operating modes: performing queries (read), uploading data (write), and performing computations (read-write). These client nodes are typically either computers of researchers or virtual machines running in a self-service private cloud operated by the University of Zurich scientific IT services (an OpenStack instance) [59]. Client nodes controlled by human operators can perform manual (bulk) interactions through the SHCND web UI. In particular they can request NGS processing computations and query and download data. For data uploads, data querying, and bulk downloads, a Hypertext Transfer Protocol Secure (HTTPS) application programming interface (API) is used in dedicated client applications or scripts. The database core hosts the web UI and API.

The automatically operated virtual machine clients handle any outstanding requested computations. Initially, outstanding computations were part of the database system state and required custom implementation of coordination strategies to ensure they would run non-redundantly in parallel. However, we have since migrated computational job assignment to worker nodes to a private instance of Jenkins [60], a widely used open source “automation server” tool. In brief, Jenkins stores a list of pending tasks and schedules them to run on a set of worker nodes, known as agents. It evenly distributes a specified maximum number of parallel jobs across the worker nodes. We chose Jenkins for its web UI, which provides real time logs of running or past processing tools and facilitates easy observation of errors in individual jobs. We are also considering using a SLURM-managed compute cluster as an additional or alternative job submission system implementations [61].

Bioinformatic pipelines

The bioinformatics tools or pipelines (detailed below) are executed as Docker containers [62] on Jenkins agents (Fig 3). We maintain a git repository with a Dockerfile for each tool. Each tool is built once for a given version with a fixed build time and fixed dependencies (potentially obtained from external sources), resulting in an immutable Docker container image. This image is used for all processing jobs requiring the respective tool with the specified version. In a nutshell, Docker container images can be understood as virtual machine images or a complete (Linux) file system tree. Processes running inside a Docker container are isolated from the host machine except for selective access to the file system and network interface. Another tool for the same job that we consider using is Singularity [63]. Before tool execution, we copy the necessary input files, such as fastq files, from the SHCND into the container and rename them to adhere to the expected file names of the respective tool. After successful processing, we collect result files from a predetermined folder before deleting the container on the agent. The result files are uploaded to the SHCND and incorporated into the file system as a set of result files and new metadata files, which link the NGS record to the new processing result (Figs 2/3).

Fig 3. Illustration of the bioinformatic pipeline processing.

Fig 3

From initial job submission to storage of processed results. In the case of “global” processing tools, multiple samples’ fastq and output files from previous processing tool runs can be retrieved as inputs to the Processing step (not shown).

While the git repositories containing the Dockerfiles and the Docker container images built from them are not directly stored as immutable data files in the SHCND (unlike the input and output data files of all computations), they are nevertheless an integral part of the reproducibility strategy of the system. Every version of every tool and data file is persistent, immutable and remains accessible. Any “modifications” to data and computation tools are represented as new versions in an append-only version history across the system.

In addition to running the tools directly on the raw NGS record files of an individual sample, they can also be applied to any combination of previous processing tool output files associated with that individual sample. Furthermore, both raw NGS files and output files from a whole set of samples can be processed together with what we call “global” processing tools (as opposed to the usually local or per-sample processing tools used on a single sample at a time). This capability allows for the use of tools like blast or the construction of phylogenetic trees on any subset of samples from the database, as discussed below.

File management

The database core stores and manages all data as immutable flat binary files with an assigned UUID. It also coordinates all client node interactions, serving both the web UI and HTTPS API. An initial priority was, that a combination of large unstructured (from the SHCND point of view) data files, semi-interpreted metadata, and evolvable structures of links between this data must be stored. This requirement made traditional SQL/relation table-based database structures unsuitable, as they would require significant structural adjustments during the evolution of the system. For reasons of reproducibility, any and all changes to the data had to be fully versioned and in principle undoable. Furthermore, by the original design for job distributions and other features, it was a requirement that database status changes, e.g., pending job requests, must be communicated to connected clients to inform their next processing decisions or update their redundant/denormalized data replicas or projections of database state. We wanted a design where meta- and linkage data could be converted in real-time into a structured, queriable format (a “projected” data replica) as soon as it became available in the SHCND. During prototyping, metadata was transformed and submitted to a (read only) PostgreSQL database for querying needs as soon as it became available. Currently, we maintain a non-rigid structured replica in the form of a neo4j graph database [64]. Importantly, such ephemeral projected read models or read replica representations of the data in the database can be evolved extremely rapidly and flexibly to fit any querying needs without the need to physically restructure the historic versioned master information preserved in the database core.

For these reasons, the central SHCND storage was conceptually designed as a single, append-only, fully sequential list or “log” of UUID-named immutable binary files, also known as binary large objects (blobs). This is sometimes called an event-driven- and more precisely an Event Sourcing software architecture [65]. Any permanent information added to the SHCND is appended to this log/list of “events” as a blob. The only atomic operation required of the system is the assignment of successive sequence numbers to each new blob. The event of adding a blob is immediately communicated in-order to all connected clients via a WebSocket connection [66]. In this way, the blob log takes on the additional role of a shared message bus or event queue.

A blob is added, for example, to represent the initial metadata state of a new NGS record as an empty JSON object. The server assigns the next sequence number and a UUID (e.g., “5bfc99f6-8432-4afc-be32-3f9d2dfa4871”) as the identifier for this new blob. Conceptually, we also use this same UUID as what we call a “base UUID” to refer to the entire version history of this NGS record, and in particular to refer to its latest revision at any point in time. To illustrate how revisions are handled, let us continue the example: an additional blob can be added to mark any sample changes, for example, metadata corrections or file attachments (the case of file attachment is illustrated in Fig 2A and Table 1). The new blob is assigned a new random UUID (e.g., “ed7ed2c3-ea61-4e33-86da-49c3eea347d8”) and contains an updated JSON representation of the NGS record file and its metadata. A third blob, whose content is marked with a special Update-UUID-Prefix-Constant, then communicates that “ed7ed2c3-ea61-4e33-86da-49c3eea347d8” is an updated version of “5bfc99f6-8432-4afc-be32-3f9d2dfa4871”. A dedicated client process monitors these update messages and maintains queriable derived databases of the “latest revision” and “version history” of all blobs. These derived databases thus are other examples of projected read models of the database. We note that most blobs containing data files are never updated in this fashion.

Table 1. Simplified illustration of how the NGS record with base UUID 5bfc99f6-8432-4afc-be32-3f9d2dfa4871 might have been created in the database core’s blob log as a series of UUID-named files whose content is interpreted to yield the SHCND subgraph shown in Fig 2A. The sequence number arbitrarily starts at 1000 and there are no other blobs created between the initial creation of the NGS record and the uploading and attachment of its fastq files.

SEQUENCE# Blob uuid Content Meaning of content
1000 5bfc99f6-8432-4afc-be32-3f9d2dfa4871 {} Empty JSON object. Base revision of an NGS record.
1001 6b620cf3-b123-4ef8-a63b-d389c264c0ea @M02081:266:000000000-BCBNY:1:1101:17805:1811 1:N:0:3
GGTTCTATAAAACTCTGAGGGCCGAGCAAG…
+
BBA…
R1 fastq file content.
1002 479e8058-532a-49bc-8e9e-ca542f132561 @M02081:266:000000000-BCBNY:1:1101:17805:1811 2:N:0:3… R2 fastq file content
1003 ed7ed2c3-ea61-4e33-86da-49c3eea347d8 {“R1”: “6b620cf3-b123-4ef8-a63b-d389c264c0ea”, “R2”: “479e8058-532a-49bc-8e9e-ca542f132561”} New revision content for the NGS record, now linking it to the two fastq files
1004 b8ec2f65-e195-44c4-a28e-9b551056decd <UPDATE-UUID-PREFIX-CONSTANT>,5bfc99f6-8432-4afc-be32-3f9d2dfa4871,ed7ed2c3-ea61-4e33-86da-49c3eea347d8 Tells the system that “ed7ed2c3-ea61-4e33-86da-49c3eea347d8” is an updated version of “5bfc99f6-8432-4afc-be32-3f9d2dfa4871”

Originally, the blob-append mechanism was also used to implement a remote-procedure call (RPC) mechanism to facilitate command/query-response protocols between clients. A first specially tagged blob would communicate a service request to another client process implementing the requested functionality. The responding client would observe the request, process it, then post a suitably tagged response to the blob log. However, this approach had a drawback: read-only operations, such as querying the latest blob revisions, unnecessarily extended the blob log with ephemeral, computable information. Therefore, read-only queries are now implemented directly as ephemeral HTTP API calls. Only persistent state changes (commands) are durably stored in the log. The special revision marker or “update” blob mentioned above can be seen as an example of such a durably stored command, though the client process handling it doesn’t post any response to the log in reaction to it.

The monolithic core of the system is written in isomorphic TypeScript (migrated from JavaScript), running in both nodejs (server-side) and web browser (client-side) JavaScript engines [67,68]. Other technology choices include Docker for containerization, Jenkins for job orchestration, and MongoDB for fast blob metadata querying [60,62,69]. All server-side system components run in Docker containers based on Ubuntu Linux, while the client-side web UI works in any modern browser and consists of basic HTML layouts using jQuery for selected interactivity [70]. As for deployment, apart from the web UI that can instantly be invoked from any authorized web-browser, dedicated scripts and client programs are downloaded manually to client computers where bulk downloads or uploads of data shall be performed. The database core and Jenkins agent virtual machines are set up manually. GitLab CI (continuous integration) scripts are used to deploy any enhancements to the database core virtual machine automatically on each git commit. There is no obstacle in principle to further automate the few manual steps, should the need arise.

SHCND scientific pipelines

Currently implemented in the production environment of the SHCND and predominantly used are the following bioinformatic pipelines:

  1. SmaltAlign

SmaltAlign performs consensus genome assembly using a reference sequence and an NGS record file (a fastq file). In brief, the pipeline performs an initial de novo assembly-assisted alignment against a chosen HIV-1 reference (e.g., HXB2, GenBank accession number K03455 [71]) followed by iterative alignments against the newly generated consensus. The alignment is output as a BAM file. The retrieved consensus sequences exhibit high quality, comparable to other commonly used assembly pipelines like shiver or V-pipe [7274]. Detailed descriptions of the design and parameter settings can be found on https://github.com/medvir/SmaltAlign.

  1. MinVar

MinVar uses the NGS record file (a fastq file) to detect drug resistance mutations (DRMs). In brief, MinVar samples sequencing reads to determine the HIV-1 subtype, followed by generating a consensus sequence from which it ascertains mutations, and compares those to known DRMs maintained by the Stanford drug resistance database [75]. This pipeline is also used in diagnostic settings and was validated accordingly [30]. Detailed descriptions can be found on https://ozagordi.github.io/MinVar/.

  1. Hypermutation read filter

Hypermutation read filter uses the BAM file, i.e., mapped reads aligned to HIV-1 HXB2, generated by SmaltAlign to determine if reads are hypermutated according to Hypermut 2.0 [76]. The output is a fastq file with hypermutated reads filtered out, which can then be used as clean input for subsequent runs of SmaltAlign and MinVar instead of the unfiltered NGS raw reads file of the sample.

Results

Swiss HIV cohort study viral NGS database sequence collection

The SHCND contains 8,015 NGS records from 5,178 (24%) out of the 21,876 PWH ever enrolled in the SHCS or ZPHI (Table 2). While the SHCND covers all major demographic groups, transmission modes, and HIV-1 subtypes, several demographic groups among the SHCS participants are currently overrepresented, due to the fact that NGS was performed in the framework of specific projects [8,43,53,54,7780]: PWH with any sequence (proviral DNA or plasma RNA) in the SHCS are on average more often white (76.7% NGS vs 64.6% no NGS) leading to an underrepresentation of ethnical minorities (3.4% NGS vs 18.9% no NGS), have HIV-1 subtype B (64.4% NGS vs 47.1% no NGS) leading to an underrepresentation of rare subtypes (16.6% NGS vs 38% no NGS), and were enrolled into the SHCS more recently (median calendar year 2005 NGS vs 1997 no NGS). Additionally, they are more likely to have acquired HIV via homosexual contact (45.2% NGS vs 37.5% no NGS) and less likely through intravenous drug use (10.5% NGS vs 17.1% no NGS) (Table 2).

Table 2. Characteristics of people with HIV stratified by HIV NGS availability and cohort.

Overall SHCS1 ZPHI2
no NGS NGS no NGS NGS
N (PEOPLE) 21,876 16,606 4,759 92 419
FEMALE SEX (%) 5,891 (26.9) 4,583 (27.6) 1,280 (26.9) 8 (8.7) 20 (4.8)
YEAR OF ENROLLMENT (MEDIAN [IQR]) 2000[1992,2010] 1997[1991,2009] 2005 [1998,2010] 2016 [2008,2020] 2009[2006,2014]
YEAR OF BIRTH (MEDIAN [IQR]) 1964[1958,1972] 1963[1957,1971] 1966[1959, 1974] 1978[1967, 1986] 1974[1967,1981]
LIKELY HIV TRANSMISSION (%)
MSM3 8,767 (40.1) 6,224 (37.5) 2,151 (45.2) 66 (71.7) 326 (77.8)
HET4 7,152 (32.7) 5,404 (32.5) 1,654 (34.8) 19 (20.7) 75 (17.9)
I.V. drug use 3,341 (15.3) 2,834 (17.1) 500 (10.5) 0 (0.0) 7 (1.7)
I.V. drug use/ sexual 1522 (7.0) 1288 (7.8) 229 (4.8) 2 (2.2) 3 (0.7)
Perinatal transmission 138 (0.6) 100 (0.6) 38 (0.8) 0 (0.0) 0 (0.0)
Other 956 (4.4) 756 (4.6) 187 (3.9) 5 (5.4) 8 (1.9)
ETHNICITY (%)
White 14,818 (67.7) 10,732 (64.6) 3,650 (76.7) 78 (84.8) 358 (85.4)
Black 2,414 (11.0) 1,773 (10.7) 620 (13.0) 3 (3.3) 18 (4.3)
Hispano-American 682 (3.1) 507 (3.1) 143 (3.0) 1 (1.1) 31 (7.4)
Asian 643 (2.9) 448 (2.7) 185 (3.9) 4 (4.3) 6 (1.4)
Other 3,319 (15.2) 3,146 (18.9) 161 (3.4) 6 (6.5) 6 (1.4)
HIV SUBTYPE (%)
B 11,210 (51.2) 7,815 (47.1) 3,067 (64.4) 27 (29.3) 301 (71.8)
02_AG 717 (3.3) 526 (3.2) 182 (3.8) 0 (0.0) 9 (2.1)
C 693 (3.2) 512 (3.1) 169 (3.6) 1 (1.1) 11 (2.6)
01_AE 652 (3.0) 417 (2.5) 195 (4.1) 9 (9.8) 31 (7.4)
A 548 (2.5) 496 (3.0) 43 (0.9) 4 (4.3) 5 (1.2)
A1 349 (1.6) 113 (0.7) 208 (4.4) 1 (1.1) 27 (6.4)
G 266 (1.2) 194 (1.2) 62 (1.3) 3 (3.3) 7 (1.7)
F 140 (0.6) 125 (0.8) 9 (0.2) 4 (4.3) 2 (0.5)
D 138 (0.6) 101 (0.6) 33 (0.7) 0 (0.0) 4 (1.0)
Other 7,163 (32.7) 6307 (38.0) 791 (16.6) 43 (46.7) 22 (5.3)
PLASMA NGS SEQUENCE (%) 2,161 (9.9) 0 (0.0) 1,755 (36.9) 0 (0.0) 406 (96.9)
PROVIRAL NGS SEQUENCE (%) 3,184 (14.6) 0 (0.0) 3,064 (64.4) 0 (0.0) 120 (28.6)

1Swiss HIV Cohort Study; 2 Zurich Primary HIV Infection Cohort Study; 3 Men Who Have Sex with Men; 4 Heterosexual.

Gene sequence availability (Fig 4A) from plasma RNA samples ranges from 3,567 vif sequences up to 3,678 gag sequences. For sequences derived from proviral DNA, it ranges from 3,264 env sequences up to 3,806 nef sequences. 3,287/3,795 (87%) sequences from plasma RNA cover all nine genes, compared to 2,829/4,220 (67%) sequences from proviral DNA. The earliest sample was already collected in 1988, whereas NGS started only in 2013 (Fig 4B).

Fig 4. Sequence availability and timing.

Fig 4

(A) Sequences available per gene and for the whole genome in the Swiss HIV Cohort Study Viral NGS Database (SHCND), stratified by sample source. Viral gene sequences considered, cover at least 40% of the respective gene length compared to the HIV-1 HXB2 gene reference sequence (GenBank accession number K03455). (B) Cumulative sample counts by sampling year and year when NGS was performed, stratified by HIV-1 plasma RNA and HIV-1 proviral DNA.

Several original research articles harnessing the SHCND have already been published in international peer-reviewed journals (Table 3). A precursor database was used to infer the impact of the viral genome on the size and heritability of the HIV-1 reservoir and the prevalence of HIV-1 proviral drug resistances [54,78,83]. Several projects then used the SHCND to study antiretroviral resistance. A comparison between Sanger sequencing and NGS for DRM detection showed comparable performance, but NGS detected more low frequency DRMs [39]. In a randomized controlled trial on dolutegravir monotherapy in early treated PWH, the absence of proviral evolution and DRMs was confirmed [53,84]. It was further used for studying HIV-superinfection [43]. Further, the value of proviral diversity as a proxy for time since HIV infection was confirmed [81]. A viral genome wide association study (GWAS) and heritability analysis was performed which determined associations with neurocognitive complaints [82]. Multiple other projects are in progress, in particular attempts to infer the impact of the HIV genome on neurocognitive outcomes, immune responses, low-level viremia, to better identify hypermutations, and for proviral drug resistance testing.

Table 3. Published original research which made use of the SHCND.

First author Year Journal Title Conclusion Sequences used Sample origin
Chaudron et al. [43] 2022 The Journal of Infectious Diseases A Systematic Molecular Epidemiology Screen Reveals Numerous Human Immunodeficiency Virus (HIV) Type 1 Superinfections in the Swiss HIV Cohort Study Prevalence of superinfections: 1%-7% 128 Plasma RNA
Balakrishna et al. [39] 2023 The Journal of Infectious Diseases Frequency matters: comparison of drug resistance mutation detection by Sanger and next-generation sequencing in HIV-1 High concordance of sanger sequencing and NGS for detection of HIV drug resistance mutations 594 Plasma RNA
Jörimann et al. [53] 2023 The Journal of Infectious Diseases Absence of Proviral Human Immunodeficiency Virus (HIV) Type 1 Evolution in Early-Treated Individuals With HIV Switching to Dolutegravir Monotherapy During 48 Weeks No HIV proviral evolution under dolutegravir monotherapy 210 Proviral DNA
Zeeb et al. [81] 2024 The Journal of Infectious Diseases Genetic diversity from proviral DNA as a proxy for time since HIV-1 infection HIV diversity of proviral DNA is a proxy for time since infection 247 Proviral DNA
Zeeb et al. [82] 2024 Brain Communications Self-reported neurocognitive complaints in the Swiss HIV Cohort Study - a viral genome wide association study Neurocognitive complaints in people with HIV are a heritable trait by the HIV genome 2,334 Plasma RNA and proviral DNA

Discussion

Overall, established in early 2018, the SHCND has streamlined and standardized the viral genomic analyses in the Swiss HIV Cohort Study and the Zurich Primary HIV Infection Cohort Study. Over its lifetime, the SHCND amassed 8,015 HIV-1 sequences from 5,178 PWH and continues to grow.

The SHCND is actively incorporating new bioinformatic tools based on the requests and requirements of researchers. Currently, most sequences in the database were obtained by Illumina MiSeq. The pipelines currently in use are designed for Illumina data. If it becomes a necessity to process sequences from other NGS technologies, the set-up of the database allows a quick implementation of other pipelines. Moreover, tools are maintained and improved to increase computational efficiency and usability. As tools dependent on the output of other tools must be executed in sequence, a current focus is the implementation of a piping functionality to allow the automatic execution of complex workflows. Recently, the SHCND has also gained the capability to run tools that analyze data from multiple NGS records in a single analysis, such as building phylogenetic trees, performing descriptive analyses on the whole sample archive of the SHCND or running BLAST searches. This type of processing can easily be split into a parallel preparation step that runs on individual samples and a final combination step.

We believe that the trend towards containerization of applications is very beneficial to avoid redundant efforts concerning software setup, installation, and dependency management. Therefore, we encourage the providers of such tools to include a containerized unit, as possible with Docker or Singularity, with their published data analysis tool. This has the advantage, that the analysis is done in a very reproducible manner, requiring only a Linux kernel, regardless of the state and configuration of the host operating system and installed programs outside the container.

We continuously evaluate the needs and costs of the system, to steward resources for updates. For instance, hardware requirements such as data storage and computational resources can fluctuate. Moreover, the underlying software which runs the SHCND, i.e., data management solutions, programming languages and libraries, high-performance computing clusters, or workflow managers have their own lifecycle and development that need to be handled. Crucially, these changes must keep the documentation history intact and are ideally done without users noticing it. This flexibility and customizability of the system is crucial for adapting to technological advancements. Moreover, as the core architecture is very general and not specifically tailored to NGS data and bioinformatic workflows, it allows for possible future expansions to include other data types. As such, we believe that our database solution would be beneficial in many other research settings where complex data is handled.

The implementation of the SHCND may in the future also incorporate other virus or pathogen genomes, for example, if there is an increase of coinfections, e.g., HCV or Mpox. There is also the potential to extend the SHCND even with human genome SNP data or antibody sequence data. Currently, our database already includes few Hepatitis C virus (HCV) genomes and the flexible database design would support seamless expansion. Furthermore, a separate instance of the database is currently being set-up for bacterial genomic data and corresponding analysis pipelines tailored to that, with the goal of conducting real-time molecular epidemiolocal analyses to optimize infection control and facilitating genomic comparisons to better understand resistance spread mechanisms and support targeted interventions.

As human diseases are caused by complex interactions between host factors, environmental exposures, and microorganisms, research increasingly incorporates multiple data types, which require highly specialized computational tools, for example, multi-omics or one health approaches. We are particularly interested in metagenomic approaches, which would allow us to study the whole human microbiome including all infectious pathogens. We currently evaluate approaches in this direction, e.g., Kraken 2 [85]. Moreover, to investigate HIV transmission clusters we plan to integrate Phyloscanner [44]. Finally, while the current database has been established as a tool for research projects, it could also be readily adapted for clinical, diagnostic, and surveillance purposes in the future.

In conclusion, the SHCND considerably improved the scientific work with HIV NGS records in Switzerland. It reduced redundant computational processes and increased the reproducibility and accessibility of bioinformatic analyses. We hope the here outlined requirements, design, set-up, and use of a viral NGS database serves as inspiration and support for others with similar challenges.

Data Availability

The use of the outputs generated by the database SHCND can be accessed in the framework of a collaboration with the SHCS and/or ZPHI, as the individual level datasets generated or analyzed during the current study do not fulfill the requirements for open data access: 1) The SHCS informed consent states that sharing data outside the SHCS network is only permitted for specific studies on HIV infection and its complications, and to researchers who have signed an agreement detailing the use of the data and biological samples; and 2) the data is too dense and comprehensive to preserve patient privacy in people with HIV. For collaborations a proposal should be send to the respective SHCS address (www.shcs.ch/contact). The provision of data will be considered by the Scientific Board of the SHCS and the study team. The full statement, from which the present statement is derived, can be found online: http://www.shcs.ch/294-open-data-statement-shcs.

Funding Statement

This work was funded within the framework of the Swiss HIV Cohort study (SHCS) supported by the Swiss National Science Foundation (grant numbers 33CS30-201369 and 33FI-0 229621) (https://www.snf.ch/en) and by the Swiss HIV Cohort Study research foundation (https://shcsfoundation.ch/). The SHCS data are gathered by the Five Swiss University Hospitals, two Cantonal Hospitals, 15 affiliated hospitals and 36 private physicians (listed in in http://www.shcs.ch/180-health-care-providers). Furthermore, this work was supported by the Swiss National Science Foundation (grant number 179571 to H. F. G.); the Yvonne-Jacob Foundation (to H. F. G.) (https://stiftungen.stiftungschweiz.ch/organisation/stiftung-yvonne-jacob); the SHCS project 915 (to K.J.M.) and the University of Zurich Clinical Research Priority Program Viral Infectious Diseases, Zurich Primary HIV Infection Cohort Study (to H. F. G.) (https://www.viralinfectiousdiseases.uzh.ch/en.html). R. D. K. was supported by the Swiss National Science Foundation (grant numbers 324730_207957 and BSSGI0_155851). L.Z. was supported by the National Research, Development and Innovation Office in Hungary (RRF-2.3.1-21-2022-00006) (https://nkfih.gov.hu/about-the-office) as a part of the National Laboratory for Health Security. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Chabria S, Gupta S, Kozal M. Deep sequencing of HIV: clinical and research applications. Annual Review of Genomics and Human Genetics. 2014;15:295–325. https://pubmed.ncbi.nlm.nih.gov/24821496/ [DOI] [PubMed] [Google Scholar]
  • 2.Bonsall D, Golubchik T, de Cesare M, Limbada M, Kosloff B, MacIntyre-Cockett G, et al. A comprehensive genomics solution for HIV surveillance and clinical monitoring in low-income settings. Journal of Clinical Microbiology. 2020;58(10):382–402. https://pmc/articles/PMC7512176/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McLaren PJ, Fellay J. HIV-1 and human genetic variation. Nat Rev Genet. 2021;22(10):645–57. doi: 10.1038/s41576-021-00378-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gabrielaite M, Bennedbæk M, Zucco A, Ekenberg C, Murray D, Kan V. Human immunotypes impose selection on viral genotypes through viral epitope specificity. J Infect Dis. 2021;224(12):2053–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wong JK, Ignacio CC, Torriani F, Havlir D, Fitch NJ, Richman DD. In vivo compartmentalization of human immunodeficiency virus: evidence from the examination of pol sequences from autopsy tissues. J Virol. 1997;71(3):2059–71. doi: 10.1128/JVI.71.3.2059-2071.1997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kellam P, Boucher CA, Larder BA. Fifth mutation in human immunodeficiency virus type 1 reverse transcriptase contributes to the development of high-level resistance to zidovudine. Proc Natl Acad Sci U S A. 1992;89(5):1934–8. doi: 10.1073/pnas.89.5.1934 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Nájera I, Holguín A, Quiñones-Mateu ME, Muñoz-Fernández MA, Nájera R, López-Galíndez C, et al. Pol gene quasispecies of human immunodeficiency virus: mutations associated with drug resistance in virus from patients undergoing no drug therapy. J Virol. 1995;69(1):23–31. doi: 10.1128/JVI.69.1.23-31.1995 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Blanquart F, Wymant C, Cornelissen M, Gall A, Bakker M, Bezemer D, et al. Viral genetic variation accounts for a third of variability in HIV-1 set-point viral load in Europe. PLoS Biol. 2017;15(6):e2001855. doi: 10.1371/journal.pbio.2001855 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bartha I, Carlson J, Brumme C, McLaren P, Brumme Z, John M, et al. A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control. Elife. 2013;2:e01123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wymant C, Bezemer D, Blanquart F, Ferretti L, Gall A, Hall M. A highly virulent variant of HIV-1 circulating in the Netherlands. Science. 2022;375(6580):540–5. [DOI] [PubMed] [Google Scholar]
  • 11.Kuiken C, Korber B, Shafer R. HIV sequence databases. AIDS Reviews. 2003;5(1):52. [PMC free article] [PubMed] [Google Scholar]
  • 12.Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 2012;8(3):e1002529. doi: 10.1371/journal.ppat.1002529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Swenson L, Mo T, Dong W, Zhong X, Woods C, Jensen M, et al. Deep sequencing to infer HIV-1 co-receptor usage: Application to three clinical trials of maraviroc in treatment-experienced patients. Journal of Infectious Diseases. 2011;203(2):237–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Simmonds P, Zhang LQ, McOmish F, Balfe P, Ludlam CA, Brown AJ. Discontinuous sequence change of human immunodeficiency virus (HIV) type 1 env sequences in plasma viral and lymphocyte-associated proviral populations in vivo: implications for models of HIV pathogenesis. J Virol. 1991;65(11):6266–76. doi: 10.1128/JVI.65.11.6266-6276.1991 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wu X, Zhou T, Zhu J, Zhang B, Georgiev I, Wang C, et al. Focused evolution of HIV-1 neutralizing antibodies revealed by structures and deep sequencing. Science. 2011;333(6049):1593–602. doi: 10.1126/science.1207532 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fischer W, Ganusov VV, Giorgi EE, Hraber PT, Keele BF, Leitner T, et al. Transmission of single HIV-1 genomes and dynamics of early immune escape revealed by ultra-deep sequencing. PLoS One. 2010;5(8):e12303. doi: 10.1371/journal.pone.0012303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bbosa N, Kaleebu P, Ssemwanga D. HIV subtype diversity worldwide. Current Opinion in HIV and AIDS. 2019;14(3):153–60. https://journals.lww.com/co-hivandaids/fulltext/2019/05000/hiv_subtype_diversity_worldwide.3.aspx [DOI] [PubMed] [Google Scholar]
  • 18.Zhu T, Korber BT, Nahmias AJ, Hooper E, Sharp PM, Ho DD. An African HIV-1 sequence from 1959 and implications for the origin of the epidemic. Nature. 1998;391(6667):594–7. doi: 10.1038/35400 [DOI] [PubMed] [Google Scholar]
  • 19.Faria NR, Rambaut A, Suchard MA, Baele G, Bedford T, Ward MJ. The early spread and epidemic ignition of HIV-1 in human populations. Science. 2014;346(6205):56–61. https://pubmed.ncbi.nlm.nih.gov/25278604/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wolinsky SM, Wike CM, Korber BT, Hutto C, Parks WP, Rosenblum LL, et al. Selective transmission of human immunodeficiency virus type-1 variants from mothers to infants. Science. 1992;255(5048):1134–7. doi: 10.1126/science.1546316 [DOI] [PubMed] [Google Scholar]
  • 21.Wolinsky SM, Korber BT, Neumann AU, Daniels M, Kunstman KJ, Whetsell AJ, et al. Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science. 1996;272(5261):537–42. doi: 10.1126/science.272.5261.537 [DOI] [PubMed] [Google Scholar]
  • 22.Wain-Hobson S, Sonigo P, Danos O, Cole S, Alizon M. Nucleotide sequence of the AIDS virus, LAV. Cell. 1985;40(1):9–17. https://pubmed.ncbi.nlm.nih.gov/2981635/ [DOI] [PubMed] [Google Scholar]
  • 23.Shankarappa R, Margolick J, Gange S, Rodrigo A, Upchurch D, Farzadegan H. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. 1999;73(12):10489–502. https://pubmed.ncbi.nlm.nih.gov/10559367/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Coffin JM. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science. 1995;267(5197):483–9. doi: 10.1126/science.7824947 [DOI] [PubMed] [Google Scholar]
  • 25.Bonhoeffer S, Chappey C, Parkin N, Whitcomb J, Petropoulos C. Evidence for positive epistasis in HIV-1. Science. 2004;306(5701):1547–50. https://pubmed.ncbi.nlm.nih.gov/15567861/ [DOI] [PubMed] [Google Scholar]
  • 26.Lerner A, Eisinger R, Fauci A. Comorbidities in persons with HIV: The lingering challenge. JAMA. 2020;323(1):19–20. https://jamanetwork.com/journals/jama/fullarticle/2757599 [DOI] [PubMed] [Google Scholar]
  • 27.Trickey A, Sabin CA, Burkholder G, Crane H, d’Arminio Monforte A, Egger M, et al. Life expectancy after 2015 of adults with HIV on long-term antiretroviral therapy in Europe and North America: a collaborative analysis of cohort studies. Lancet HIV. 2023;10(5):e295–307. doi: 10.1016/S2352-3018(23)00028-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.UNAIDS. Global HIV & AIDS statistics — Fact sheet | UNAIDS [Internet]. 2023. [cited 2024 Apr 18]. Available from: https://www.unaids.org/en/resources/fact-sheet [Google Scholar]
  • 29.Nightingale S, Ances B, Cinque P, Dravid A, Dreyer AJ, Gisslén M, et al. Cognitive impairment in people living with HIV: consensus recommendations for a new approach. Nat Rev Neurol. 2023;19(7):424–33. doi: 10.1038/s41582-023-00813-2 [DOI] [PubMed] [Google Scholar]
  • 30.Huber M, Metzner K, Geissberger F, Shah C, Leemann C, Klimkait T. MinVar: A rapid and versatile tool for HIV-1 drug resistance genotyping by deep sequencing. J Virol Methods. 2017;240:7–13. [DOI] [PubMed] [Google Scholar]
  • 31.Labarile M, Loosli T, Zeeb M, Kusejko K, Huber M, Hirsch H, et al. Quantifying and predicting ongoing human immunodeficiency virus type 1 transmission dynamics in Switzerland using a distance-based clustering approach. J Infect Dis. 2023;227(4):554–64. https://pubmed.ncbi.nlm.nih.gov/36433831/ [DOI] [PubMed] [Google Scholar]
  • 32.Kouyos RD, Rusert P, Kadelka C, Huber M, Marzel A, Ebner H, et al. Tracing HIV-1 strains that imprint broadly neutralizing antibody responses. Nature. 2018;561(7723):406–10. doi: 10.1038/s41586-018-0517-0 [DOI] [PubMed] [Google Scholar]
  • 33.INSIGHT START Study Group, Lundgren JD, Babiker AG, Gordin F, Emery S, Grund B, et al. Initiation of Antiretroviral Therapy in Early Asymptomatic HIV Infection. N Engl J Med. 2015;373(9):795–807. doi: 10.1056/NEJMoa1506816 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Sanger F, Coulson A, Barrell B, Smith A, Roe B. Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J Mol Biol. 1980;143(2):161–78. doi: 10.1016/S0022-2836(80)90087-0 [DOI] [PubMed] [Google Scholar]
  • 35.Estrada-Rivadeneyra D. Sanger sequencing. FEBS J. 2017;284(24):4174. doi: 10.1111/febs.14319 [DOI] [PubMed] [Google Scholar]
  • 36.Dewey FE, Pan S, Wheeler MT, Quake SR, Ashley EA. DNA sequencing: clinical applications of new DNA sequencing technologies. Circulation. 2012;125(7):931–44. doi: 10.1161/CIRCULATIONAHA.110.972828 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Larder BA, Darby G, Richman DD. HIV with reduced sensitivity to zidovudine (AZT) isolated during prolonged therapy. Science. 1989;243(4899):1731–4. doi: 10.1126/science.2467383 [DOI] [PubMed] [Google Scholar]
  • 38.Metzner KJ. Technologies for HIV-1 drug resistance testing: inventory and needs. Curr Opin HIV AIDS. 2022;17(4):222–8. doi: 10.1097/COH.0000000000000737 [DOI] [PubMed] [Google Scholar]
  • 39.Balakrishna S, Loosli T, Zaheri M, Frischknecht P, Huber M, Kusejko K. Frequency matters: comparison of drug resistance mutation detection by Sanger and next-generation sequencing in HIV-1. Journal of Antimicrobial Chemotherapy. 2023;78(3):656–64. https://pubmed.ncbi.nlm.nih.gov/36738248/ [DOI] [PubMed] [Google Scholar]
  • 40.Döring M, Büch J, Friedrich G, Pironti A, Kalaghatgi P, Knops E. geno2pheno[ngs-freq]: a genotypic interpretation system for identifying viral drug resistance using next-generation sequencing data. Nucleic Acids Research. 2018;46(W1):W271-7. https://pubmed.ncbi.nlm.nih.gov/29718426/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Cozzi-Lepri A, Noguera-Julian M, Di Giallonardo F, Schuurman R, Däumer M, Aitken S, et al. Low-frequency drug-resistant HIV-1 and risk of virological failure to first-line NNRTI-based ART: a multicohort European case-control study using centralized ultrasensitive 454 pyrosequencing. J Antimicrob Chemother. 2015;70(3):930–40. doi: 10.1093/jac/dku426 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Carlisle LA, Turk T, Kusejko K, Metzner KJ, Leemann C, Schenkel CD, et al. Viral Diversity Based on Next-Generation Sequencing of HIV-1 Provides Precise Estimates of Infection Recency and Time Since Infection. J Infect Dis. 2019;220(2):254–65. doi: 10.1093/infdis/jiz094 [DOI] [PubMed] [Google Scholar]
  • 43.Chaudron SE, Leemann C, Kusejko K, Nguyen H, Tschumi N, Marzel A. A systematic molecular epidemiology screen reveals numerous human immunodeficiency virus (HIV) type 1 superinfections in the Swiss HIV cohort study. Journal of Infectious Diseases. 2022;226(7):1256–66. https://pubmed.ncbi.nlm.nih.gov/35485458/ [DOI] [PubMed] [Google Scholar]
  • 44.Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, et al. PHYLOSCANNER: Inferring transmission from within- and between-host pathogen genetic diversity. Molecular Biology and Evolution. 2018;35(3):719–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today. 2017;22(4):712–7. doi: 10.1016/j.drudis.2017.01.014 [DOI] [PubMed] [Google Scholar]
  • 46.Pereira R, Oliveira J, Sousa M. Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. J Clin Med [Internet]. 2020. Jan 1 [cited 2024 Apr 29];9(1). Available from: https://pmc/articles/PMC7019349/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.van Nimwegen KJM, van Soest RA, Veltman JA, Nelen MR, van der Wilt GJ, Vissers LELM, et al. Is the $1000 Genome as Near as We Think? A Cost Analysis of Next-Generation Sequencing. Clin Chem. 2016;62(11):1458–64. doi: 10.1373/clinchem.2016.258632 [DOI] [PubMed] [Google Scholar]
  • 48.Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4. [DOI] [PubMed] [Google Scholar]
  • 49.Scherrer AU, Traytel A, Braun DL, Calmy A, Battegay M, Cavassini M, et al. Cohort Profile Update: The Swiss HIV Cohort Study (SHCS). Int J Epidemiol. 2022;51(1):33–34j. doi: 10.1093/ije/dyab141 [DOI] [PubMed] [Google Scholar]
  • 50.Freind M, Tallón de Lara C, Kouyos R, Wimmersberger D, Kuster H, Aceto L. Cohort profile: The Zurich primary HIV infection study. Microorganisms. 2024;12(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Giallonardo FD, Töpfer A, Rey M, Prabhakaran S, Duport Y, Leemann C, et al. Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations. Nucleic Acids Res. 2014;42(14):e115. doi: 10.1093/nar/gku537 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Jörimann L, Tschumi J, Zeeb M, Leemann C, Schenkel C, Neumann K. Absence of proviral human immunodeficiency virus (HIV) type 1 evolution in early-treated individuals with HIV switching to dolutegravir monotherapy during 48 weeks. J Infect Dis. 2023;228(7):907–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Jaha B, Schenkel C, Jörimann L, Huber M, Zaheri M, Neumann K. Prevalence of HIV-1 drug resistance mutations in proviral DNA in the Swiss HIV Cohort Study, a retrospective study from 1995 to 2018. Journal of Antimicrobial Chemotherapy. 2023;78(9):2323–34. https://pubmed.ncbi.nlm.nih.gov/37545164/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Davis KR, Peabody B, Leach P. Universally Unique IDentifiers (UUIDs) [Internet]. RFC Editor (Request for Comments); 2024. Available from: https://www.rfc-editor.org/info/rfc9562 [Google Scholar]
  • 56.Triebel D, Reichert W, Bosert S, Feulner M, Okach D, Slimani A, et al. A generic workflow for effective sampling of environmental vouchers with UUID assignment and image processing. Database. 2018;2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Briney B. AntiRef: reference clusters of human antibody sequences. Bioinform Adv. 2023;3(1):vbad109. doi: 10.1093/bioadv/vbad109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Zarkogianni K, Dervakos E, Filandrianos G, Ganitidis T, Gkatzou V, Sakagianni A, et al. The smarty4covid dataset and knowledge base as a framework for interpretable physiological audio data analysis. Sci Data. 2023;10(1):770. doi: 10.1038/s41597-023-02646-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.OpenInfra Foundation. Open Source Cloud Computing Infrastructure - OpenStack [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://www.openstack.org/ [Google Scholar]
  • 60.Jenkins J. Jenkins [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://www.jenkins.io/ [Google Scholar]
  • 61.Yoo A, Jette M, Grondona M. SLURM: Simple Linux Utility for Resource Management. Proceedings of the 2003 Conference on Resource Management. 2003. p. 44–60. [Google Scholar]
  • 62.Docker. Docker: Accelerated Container Application Development [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://www.docker.com/#build [Google Scholar]
  • 63.Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017;12(5):e0177459. doi: 10.1371/journal.pone.0177459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Neo4j. Neo4j Graph Database & Analytics | Graph Database Management System [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://neo4j.com/ [Google Scholar]
  • 65.Overeem M, Spoor M, Jansen S, Brinkkemper S. An empirical characterization of event sourced systems and their schema evolution — Lessons from industry. Journal of Systems and Software. 2021;178:110970. doi: 10.1016/j.jss.2021.110970 [DOI] [Google Scholar]
  • 66.Fette I, Melnikov A. The WebSocket Protocol. 2011 Dec. [cited 2024 Apr 29]; Available from: https://www.rfc-editor.org/info/rfc6455 [Google Scholar]
  • 67.Microsoft. TypeScript: JavaScript With Syntax For Types. [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://www.typescriptlang.org/ [Google Scholar]
  • 68.Ecma International. ECMA-262 - Ecma International [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://ecma-international.org/publications-and-standards/standards/ecma-262/ [Google Scholar]
  • 69.Mongo DB. MongoDB: The Developer Data Platform | MongoDB [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://www.mongodb.com/ [Google Scholar]
  • 70.OpenJS Foundation. jQuery [Internet]. 2024. [cited 2024 Apr 29]. Available from: https://jquery.com/ [Google Scholar]
  • 71.GenBank accession number K03455. Human immunodeficiency virus type 1 (HXB2), complete genome; HIV1/HTLV-III/LAV reference genome [Internet]. [cited 2024 May 21]. Available from: https://www.ncbi.nlm.nih.gov/nucleotide/K03455 [Google Scholar]
  • 72.Wymant C, Blanquart F, Golubchik T, Gall A, Bakker M, Bezemer D, et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver. Virus Evol. 2018;4(1):vey007. doi: 10.1093/ve/vey007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Wagner DD, Marine RL, Ramos E, Ng TFF, Castro CJ, Okomo-Adhiambo M, et al. VPipe: an Automated Bioinformatics Platform for Assembly and Management of Viral Next-Generation Sequencing Data. Microbiol Spectr. 2022;10(2):e0256421. doi: 10.1128/spectrum.02564-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Zsichla L, Zeeb M, Fazekas D, Áy É, Müller D, Metzner K. Comparative evaluation of open-source bioinformatics pipelines for full-length viral genome assembly. Viruses. 2024;16(12):1824. https://pubmed.ncbi.nlm.nih.gov/39772134/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Shafer R. Rationale and uses of a public HIV drug-resistance database. Journal of Infectious Diseases. 2006;194(Suppl 1). https://pubmed.ncbi.nlm.nih.gov/16921473/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Rose PP, Korber BT. Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics. 2000;16(4):400–1. doi: 10.1093/bioinformatics/16.4.400 [DOI] [PubMed] [Google Scholar]
  • 77.Seifert D, Joos B, Braun DL, Oberle CS, Schenkel CD, Kuster H, et al. Detecting Selection in the HIV-1 Genome during Sexual Transmission Events. Viruses. 2022;14(2):406. doi: 10.3390/v14020406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Wan C, Bachmann N, Mitov V, Blanquart F, Céspedes SP, Turk T, et al. Heritability of the HIV-1 reservoir size and decay under long-term suppressive ART. Nat Commun. 2020;11(1):5542. doi: 10.1038/s41467-020-19198-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Oberle CS, Joos B, Rusert P, Campbell NK, Beauparlant D, Kuster H, et al. Tracing HIV-1 transmission: envelope traits of HIV-1 transmitter and recipient pairs. Retrovirology. 2016;13(1):62. doi: 10.1186/s12977-016-0299-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Rindler A, Kusejko K, Kuster H, Neumann K, Leemann C, Zeeb M. The interplay between replication capacity of HIV-1 and surrogate markers of disease. J Infect Dis. 2022;226(6):1057–68. [DOI] [PubMed] [Google Scholar]
  • 81.Zeeb M, Frischknecht P, Huber M, Schenkel C, Neumann K, Leeman C, et al. Genetic diversity from proviral DNA as a proxy for time since HIV-1 infection. J Infect Dis. 2024. https://pubmed.ncbi.nlm.nih.gov/38507572/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Zeeb M, Pasin C, Cavassini M, Bieler-Aeschlimann M, Frischknecht P, Kusejko K, et al. Self-reported neurocognitive complaints in the Swiss HIV Cohort Study: a viral genome-wide association study. Brain Commun. 2024;6(4):fcae188. doi: 10.1093/braincomms/fcae188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Bachmann N, von Siebenthal C, Vongrad V, Turk T, Neumann K, Beerenwinkel N. Determinants of HIV-1 reservoir size and long-term dynamics during suppressive ART. Nature Communications. 2019;10(1):1–11. https://www.nature.com/articles/s41467-019-10884-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.West E, Zeeb M, Grube C, Kuster H, Wanner K, Scheier T, et al. Sustained Viral Suppression With Dolutegravir Monotherapy Over 192 Weeks in Patients Starting Combination Antiretroviral Therapy During Primary Human Immunodeficiency Virus Infection (EARLY-SIMPLIFIED): A Randomized, Controlled, Multi-site, Noninferiority Trial. Clin Infect Dis. 2023;77(7):1012–20. doi: 10.1093/cid/ciad366 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000825.r002

Decision Letter 0

Miguel Ángel Armengol de la Hoz

3 Dec 2024

PDIG-D-24-00350Addressing data management and analysis challenges in viral genomics: The Swiss HIV Cohort Study Viral Next Generation Sequencing databasePLOS Digital Health Dear Dr. Zeeb, Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript within 60 days Feb 01 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers '. This file does not need to include responses to any formatting updates and technical items listed in the 'Journal Requirements' section below.* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes '.* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript '. If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. We look forward to receiving your revised manuscript. Kind regards, Miguel Ángel Armengol de la Hoz, Ph.D.Section EditorPLOS Digital Health Leo Anthony CeliEditor-in-ChiefPLOS Digital Healthorcid.org/0000-0001-6712-6626 Journal Requirements:

1. Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

**Please only choose the relevant sentences from below**

1. Please clarify all sources of funding (financial or material support) for your study. List the grants (with grant number) or organizations (with url) that supported your study, including funding received from your institution. 

2. State the initials, alongside each funding source, of each author to receive each grant.

3. State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

4. If any authors received a salary from any of your funders, please state which authors and which funders.

2. Please send a completed 'Competing Interests' statement, including any COIs declared by your co-authors. If you have no competing interests to declare, please state "The authors have declared that no competing interests exist". Otherwise please declare all competing interests beginning with the statement "I have read the journal's policy and the authors of this manuscript have the following competing interests:"

3. Please note that your Data Availability Statement is currently missing the repository name and the DOI/accession number of each dataset OR a direct link to access each database. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable.

4. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex.

5. Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150–200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines: 

https://journals.plos.org/digitalhealth/s/submission-guidelines#loc-parts-of-a-submission

Additional Editor Comments (if provided):   [Note: HTML markup is below. Please do not edit.] Reviewers' Comments: Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria ? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: I don't know

Reviewer #3: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript by Zeev et al. offers an in-depth exploration of how HIV NGS data are processed and managed within the Swiss HIV Cohort Study. The authors have previously contributed several notable studies using NGS data, as highlighted in Table 2.

While certain sections provide an excess of detail unlikely to be of broad interest, other critical topics receive insufficient attention. Furthermore, the specific aims of the manuscript are not clearly articulated.

1. Data analysis and amplification methodology: The approach to data analysis and management depends significantly on the method by which the virus is amplified. Historically, sequencing of the HIV pol gene has relied on sets of well-conserved primers. However, achieving full-genome sequencing, especially of highly variable regions, presents additional challenges. The manuscript does not discuss how the analysis pipeline accounts for the amplification method used to generate whole-genome data. This omission raises doubts about the assertion in the abstract that "NGS can easily achieve near-whole-genome sequence coverage." In fact, several studies reporting full-genome sequencing have employed different strategies, underscoring the complexity involved.

2. NGS platform variability: There is no discussion of how the choice of NGS platform influences the analysis. Pipelines optimised for Illumina, PacBio, or ONT platforms can differ substantially, and this is not addressed in the manuscript.

3. Use of unique molecular identifiers (UMIs): While some research groups employ UMIs to reduce error and bias, it seems the described pipeline is not designed to accommodate sequences tagged with UMIs. This limitation warrants mention. Without UMIs, the claim that NGS "accurately encapsulates within-host diversity by characterising HIV subpopulations" is likely overstated.

4. Clinical versus research applications: It is unclear whether the pipeline described is employed for clinical drug resistance testing, and if so, how it differs from its use in research. Clarifying this distinction would enhance the manuscript.

5. Quality control challenges: Quality control is only briefly touched upon, with a passing reference to the Hypermut programme. This is particularly relevant for sequences derived from proviral DNA, where maintaining sequence integrity poses significant challenges.

In summary, while the manuscript offers valuable insights, it would benefit from a clearer focus on the practical relevance of the described methods, with additional attention to the amplification methods, NGS platforms, and quality control measures essential for both clinical and research contexts.

Reviewer #2: The study presents the innovative solution of SHCND in a sequential easy to follow fashion. I find the methods used detailed in a well streamed flow. The results presented a promising solution to address challenges in the HIV genome data management. Especially in standardizing the steps in handling the data, processing them and retrieving them for research or follow up purposes. The solution presented deserves population to set an example for wider global implementation and further modification to fit the diversity of the features of HIV infection, and might even be applicable for other common infectious diseases world wide.

In this context, I would suggest adding the prospects of widely sharing the achieved goals of the solution and the opportunities of collaboration to widely develop and implement the database solution.

One modification to be suggested is the presentation of the references in the introduction, methods and results sections. The citation within the text is distracting, and I would suggest that citation with putting the number of the cited reference in the text. Especially when citing multiple references.

The study also mentioned that some sample categories were over-represented without giving precise details about which categories are overrepresented and upon the expense of which underrepresented categories.

The references that presented the preparation, amplification, and sequencing protocols were cited. However, the matching and comparison of these protocols in this study and in the cited references were not detailed.

Finally the language used could be made easier for lay reader, which is something that I always advocate in reviewing any publication, because I believe it will not only benefit reader, but also benefit the published study in giving it wider outreach

Reviewer #3: Please see comments/suggestions

Introduction

1. The revised version should use PLOS Digital Health's citation style.

2. Paragraph 2 should be broken down into 2-3 paragraphs. The general public may have a difficult time digesting this.

3. Paragraph 2 shows some obstacles in setting up HIV-related databases but does not provide the specific databases on where these obstacles come from. Are there existing databases (whether HIV-related or not) that you can cite to justify why your database is superior?

4. I feel that part of the intro should include the extent by which the development and deployment of the NGS database adheres to FAIR Guidelines: https://pmc.ncbi.nlm.nih.gov/articles/PMC4792175/

Methods

5. Most of the arrowheads in Figure 2 are so small that it is difficult to determine their direction. Moreover, just use the full term for BioInf unless you plan to place a legends section.

6. Please provide a citation for the advantages of using UUIDv4

7. Figure 1 does not show this process: "All generated downstream data files receive their own UUID as well

and are linked to the corresponding NGS record and processing tool through additional metadata files." I also feel that Figure 1 needs to be expanded to provide a complete description of the workflow involved in using the database.

8. The manuscript will benefit from the addition of additional figures that visualize the contents of each methods sub-section. For now, only the workflow has a visual (which in itself seems to be lacking too).

9. Is the data availability statement found online?

Discussion

10. The results do not reflect the statement in paragraph 1 of the discussion section. Can you provide evidence that it's use makes the workflow efficient? While the technical details are presented in the methods section, using the term efficient is farfetched since no information was presented before and after.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: Yes:  Yasser Abdullah

Reviewer #3: No

**********

 [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] Figure resubmission: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions. Reproducibility: To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLOS Digit Health. doi: 10.1371/journal.pdig.0000825.r004

Decision Letter 1

Miguel Ángel Armengol de la Hoz

16 Mar 2025

Addressing data management and analysis challenges in viral genomics: The Swiss HIV Cohort Study Viral Next Generation Sequencing database

PDIG-D-24-00350R1

Dear Mr. Zeeb,

We are pleased to inform you that your manuscript 'Addressing data management and analysis challenges in viral genomics: The Swiss HIV Cohort Study Viral Next Generation Sequencing database' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Miguel Ángel Armengol de la Hoz, Ph.D.

Section Editor

PLOS Digital Health

***********************************************************

Additional Editor Comments (if provided):

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria ? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Thanks for addressing all the review suggestions

Reviewer #3: I am satisfied with the revision. Thank you.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #2: Yes:  Yasser Abdullah

Reviewer #3: No

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: point_by_point_reply_revision.docx

    pdig.0000825.s002.docx (51KB, docx)

    Data Availability Statement

    The use of the outputs generated by the database SHCND can be accessed in the framework of a collaboration with the SHCS and/or ZPHI, as the individual level datasets generated or analyzed during the current study do not fulfill the requirements for open data access: 1) The SHCS informed consent states that sharing data outside the SHCS network is only permitted for specific studies on HIV infection and its complications, and to researchers who have signed an agreement detailing the use of the data and biological samples; and 2) the data is too dense and comprehensive to preserve patient privacy in people with HIV. For collaborations a proposal should be send to the respective SHCS address (www.shcs.ch/contact). The provision of data will be considered by the Scientific Board of the SHCS and the study team. The full statement, from which the present statement is derived, can be found online: http://www.shcs.ch/294-open-data-statement-shcs.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES