Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2019 Dec 11;14(12):e0225848. doi: 10.1371/journal.pone.0225848

Backward compatibility of whole genome sequencing data with MLVA typing using a new MLVAtype shiny application for Vibrio cholerae

Jérôme Ambroise 1,*, Léonid M Irenge 1, Jean-François Durant 1, Bertrand Bearzatto 1, Godfrey Bwire 2, O Colin Stine 3, Jean-Luc Gala 1
Editor: Axel Cloeckaert4
PMCID: PMC6905556  PMID: 31825986

Abstract

Background

Multiple-Locus Variable Number of Tandem Repeats (VNTR) Analysis (MLVA) is widely used by laboratory-based surveillance networks for subtyping pathogens causing foodborne and water-borne disease outbreaks. However, Whole Genome Sequencing (WGS) has recently emerged as the new more powerful reference for pathogen subtyping, making a data conversion method necessary which enables the users to compare the MLVA identified by either method. The MLVAType shiny application was designed to extract MLVA profiles of Vibrio cholerae isolates from WGS data while ensuring backward compatibility with traditional MLVA typing methods.

Methods

To test and validate the MLVAType algorithm, WGS-derived MLVA profiles of nineteen Vibrio cholerae isolates from Democratic Republic of the Congo (n = 9) and Uganda (n = 10) were compared to MLVA profiles generated by an in silico PCR approach and Sanger sequencing, the latter being used as the reference method.

Results

Results obtained by Sanger sequencing and MLVAType were totally concordant. However, the latter were affected by censored estimations whose percentage was inversely proportional to the k-mer parameter used during genome assembly. With a k-mer of 127, less than 15% estimation of V. cholerae VNTR was censored. Preventing censored estimation was only achievable when using a longer k-mer size (i.e. 175), which is not proposed in the SPAdes v.3.13.0 software.

Conclusion

As NGS read lengths and qualities tend to increase with time, one may expect the increase of k-mer size in a near future. Using MLVAType application with a longer k-mer size will then efficiently retrieve MLVA profiles from WGS data while avoiding censored estimation.

Introduction

Rapid molecular typing of pathogens associated with human and animal diseases has proven instrumental in the surveillance and control of infectious diseases [1, 2]. Pulsed field gel-electrophoresis (PFGE), which was long considered as the gold standard for molecular typing of pathogens associated with outbreaks, has been superseded by Multi-Locus Sequence Typing (MLST) or Multi-Locus Variable Number of Tandem Repeats (VNTR) Analysis (MLVA), and more recently by Whole Genome Sequencing (WGS) [3].

However, unlike MLVA, WGS analysis requires a specific expertise in bioinformatics and is not yet affordable in all developing countries where highly pathogenic diseases would make it the most useful and such a method would be the most needed. One can therefore be sure that MLVA and WGS subtyping will coexist in years to come, making necessary a methodology enabling end-users (i.e. researchers, clinicians, microbiologist and epidemiologists) to compare respective results.

Accordingly, in silico methods which extract low-throughput typing results (e.g., MLST or MLVA) from WGS data should be developed to enable users to compare subtyping results irrespective of the methodology and time of data acquisition. Both parameters are important when WGS data need to be compared with data generated before the WGS era.

Whereas the number of tandem repeats at different VNTR loci may theoretically be retrieved from WGS data, like currently done when extracting MLST from WGS, this was practically not considered feasible with MLVA because of a lack of accuracy of genomes assembly derived from Next Generation Sequencing (NGS) short reads [4]. Limited backward compatibility of WGS with MLVA is indeed notoriously due to failure to correctly assemble repetitive regions assessed by MLVA [5]. However, it is worth noting that an in silico PCR approach to type MLVA from WGS data was recently developed and evaluated for Brucella [6] (https://github.com/dpchris/MLVA) and Salmonella species (https://github.com/Papos92/MISTReSS).

When analyzing NGS data, the first step generally consists in assembling reads into longer contiguous sequences (contigs), which can then be interrogated using BLAST or other search algorithms. The production of high quality assemblies using bacterial genome assembler such as SPAdes [7] requires quality filtering and optimization of different parameters including k-mer size.

In the current paper, we describe a new tool (named MLVAtype) which enables users to extract MLVA profiles of Vibrio cholerae isolates from WGS data. We tuned the k-mer parameter used during genome assembly in order to assess its impact on the performance and limitations of MLVAtype. As a proof of concept, this new tool was applied on draft genomes of isolates associated with cholera outbreaks in two bordering countries, i.e. the Democratic Republic of the Congo (DRC) and Uganda. Results were compared to MLVA profiles generated by an in silico PCR approach and Sanger sequencing, the latter being used as the reference method.

Materials and methods

Sample description

Nine V. cholerae isolates were selected from a collection of isolates characterized in a recent study conducted between 2014 and 2017 in the DRC [8]. In addition, ten V. cholerae isolates collected between 2014 and 2016 in Uganda by G. Bwire and colleagues were selected based on their published data [9].

Sanger-derived versus GeneScan-derived MLVA typing

Sanger-derived MLVA typing was performed by sequencing amplicons on both strands on the ABI 3130 GA, using the BigDye Terminator v1.1 cycle sequencing kit (Applied Biosystems, USA). Motif repeats were counted manually and translated into MLVA profiles. For Ugandan isolates, the fluorescently labeled amplified products were separated using a 3730xl Automatic Sequencer with the size determined from internal lane standards (LIZ600) by the GeneScan program (Applied Biosystems, Foster City, CA).

When MLVA profiles were generated according to the method proposed by Kendall et al. [10], the formula had to be modified to better fit the sequence length of the motif and the position of the primers (Table 1). It is of note that, the original calculation formula was used for the VCA0283 locus but with a modified reverse primer (AGCCTCCTCAGAAGTTGAG instead of the original reverse primer [GTACATTCACAATTTGCTCACC]). It is worth nothing that using the original reverse primer would require to adapt the formula.

Table 1. Formula used in the current study to compute the number of tandem repeats from the amplicon size.

*: Modifications introduced in the new formula.

Loci Motif Formula used by Kendall et al. New formula
VC0147 aacaga (X-150)/6 (X-150)/6
VC0437 gacccta (X-245)/6 (X-252)/7*
VC1650 ataatccag (X-307)/9 (X-308)/9*
VCA0171 gctgtt (X-270)/6 (X-268)/6*
VCA0283 ccagaa (X-95)/6 (X-95)/6

WGS-derived MLVA typing

WGS and short reads assembly

Whole genome assemblies of selected V. cholerae isolates, 9 from DRC and 10 from Uganda, were generated from paired-end 300 and 150 nt long reads, respectively. In brief, genomic DNA from DRC isolates was simultaneously fragmented and tagged with sequencing adapters in a single step using Nextera transposome (Nextera XT DNA Library Preparation Kit, Illumina, San Diego, CA, USA). Tagged DNA was then amplified with a 12-cycle polymerase chain reaction (PCR), cleaned up with AMPure beads, and subsequently loaded on a MiSeq paired-end 2 x 300 nt (MiSeq reagent kit V3 (600 cycles) sequence run. Raw genomic data were submitted to the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena), and are available under study accession number ERP114722. For Ugandan V. cholerae isolates, data which were retrieved from the previous publication [7], were obtained as follows: libraries for Illumina sequencing were prepared from DNA fragmented with Covaris E210 (Covaris, Wolburn, MA) using the KAPA High Throughput Library Preparation Kit (Millipore-Sigma, St. Louis MO). The libraries were enriched and barcoded in ten cycles of PCR amplification with primers containing an index sequence. Subsequently, the libraries were sequenced using a 150 nt paired-end run on an Illumina HiSeq2500 (Illumina, San Diego, CA). Raw genomic data were submitted to Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) under study accession number PRJNA439310.

WGS data were assembled into contigs using SPAdes v.3.13.0 [7] and a range of k-mer sizes of 55, 77, 99, 127, and 175-mers. Testing a k-mer value of 175, which has a longer size than what is proposed in the SPAdes v.3.13.0 software, required editing the software source code. It has to be kept in mind that using this longer k-mer size is only possible with reads of sufficient length. Accordingly, a k-mer size of 175 was only used with 300 nt reads from DRC isolates, but not with 150 nt reads from Ugandan isolates. Alternatively, WGS data were also assembled using SPAdes but without specifying the k-mer size. For each sample, k-mer size and locus, the number of tandem repeats was extracted from the assembled draft genome using the MLVAtype algorithm.

MLVAtype algorithm

The MLVAtype algorithm processes each locus separately. It requires several inputs including a draft genome, the size of the k-mer (i.e. k-mer parameter) used during genome assembly, and the nucleotide sequence of the motif. The algorithm returns the number of tandem repeats using the following steps: first, small contigs (< 1000 nt) are removed from the draft genome. Second, the vcountPattern function from the Biostring R package is used to count the number of occurrences that a single (j = 1) motif is detected within the draft genome. Then the same computation is iteratively performed with an increasing number (j = 2, 3 .., k) of tandem repeats. This iterative process is performed until there is only one occurrence of the k tandem repeats. Finally, the maximum value of j (i.e. k) is compared to the maximum number of tandem repeats (MNTR) that can be included in a specified k-mer (Fig 1 and Table 2). If k ≥ MNTR, the estimation of the number of tandem repeats is set to MNTR and considered as right-censored (i.e. ≥MNTR). If k < MNTR, the estimation of the number of tandem repeats is set to k. The same process is applied on each locus in order to extract the complete MLVA profile from the genome assembly.

Fig 1. Maximum number of tandem repeats (MNTR).

Fig 1

Example of the maximum number of a 6 nt motif included in a k-mer of 33 nt.

Table 2. Maximum number of tandem repeats in a k-mer.
Loci motif length k-mer
33 55 77 99 127 175
VC0147 6 5 9 12 16 21 29
VC0437 7 4 7 11 14 18 25
VC1650 9 3 6 8 11 14 19
VCA0171 6 5 9 12 16 21 29
VCA0283 6 5 9 12 16 21 29

MLVAtype shiny application

The MLVAtype algorithm was implemented in an R shiny application which is freely available at https://ucl-irec-ctma.shinyapps.io/NGS-MLVA-TYPING/. This application enables the user to upload a list of draft genomes, the nucleotide sequences of the motifs, and the value of the k-mer which was used to build the assembly, including a k-mer size selectable after modification of the SPAdes v.3.13.0 source code (Fig 2). The application provides a table with the number of tandem repeats that was found for each locus in the corresponding genomes.

Fig 2. Screenshot of the MLVAtype shiny application.

Fig 2

Nine V. cholerae genomes from DRC were uploaded and processed with this application.

In silico PCR MLVA typing

Additionally to MLVAtype and Sanger sequencing methods, an in silico PCR method was also assessed. The amplicon sizes were determined from the draft genomes using the vmatchPattern function of the Biostrings R package and subsequently used to derive the number of tandem repeats.

Results

GeneScan-derived and Sanger-derived MLVA typing

GeneScan- and Sanger-derived MLVA profiles from DRC and Uganda are reported in Table 3. Considering the high quality of the Sanger sequences, the number of tandem repeats extracted from these sequences were considered as a gold-standard in the current study. Only one mismatch was observed between GeneScan-derived and Sanger-derived MLVA typing.

Table 3. MLVA profiles consisting in the number of tandem repetition for each loci (VC0147, VC0437, VC1650, VCA0171, and VCA0283) extracted from GeneScan data and from Sanger sequences.

Mismatch is indicated in bold.

Country Isolate Sanger-derived MLVA profile GeneScan-derived MLVA profile
Uganda UG010 (9;3;7;21;26) (9,3,7,21,26)
Uganda UG020 (9;3;7;21;27) (9,3,7,21,27)
Uganda UG026 (9;3;7;21;28) (9,3,7,21,28)
Uganda UG040 (10;7;7;9;17) (10,7,7,9,17)
Uganda UG042 (8;7;7;10;21) (8,7,7,10,21)
Uganda UG046 (8;7;7;11;21) (8,7,7,11,21)
Uganda UG054 (8;7;7;10;21) (5,7,7,10,21)
Uganda UG060 (10;7;7;9;17) (10,7,7,9,17)
Uganda UG071 (10;7;7;8;18) (10,7,7,8,18)
Uganda UG086 (10;7;7;9;18) (10,7,7,9,18)

WGS-derived MLVA typing

WGS-derived MLVA profiles obtained using the MLVAtype algorithm from draft genomes assembled using SPAdes and each value of the k-mer parameter are reported for DRC and Ugandan isolates (Table 4). All estimates of number of tandem repeats appeared to be perfectly concordant with Sanger-derived typing values. However, the k-mer parameter directly impacts the number of right-censored (i.e., ≥) estimations. When a k-mer of 127 was used to assemble reads from DRC and Ugandan isolates, only 5 and 9 right-censored estimations of tandem repeat numbers were produced, respectively. Interestingly, WGS-derived MLVA profiles extracted from 175-mers DRC assemblies were perfectly concordant with Sanger-derived typing results with no censored estimation (Fig 3).

Table 4. WGS-derived MLVA profiles extracted from DRC and Ugandan genomes assembled with various values of the k-mer parameter and compared to the gold-standard (i.e. Sanger-derived MLVA profile).

*: Maximum Number of Tandem Repeats (MNTR) in the corresponding k-mer. n/a: not applicable. Read lengths obtained with DRC and Ugandan isolates were 300 and 150 nt, respectively.

Country Isolate WGS-derived MLVA profile Sanger-derived MLVA profile
k-mer = 55 (9,7,6,9,9)* k-mer = 77 (12,11,8,12,12)* k-mer = 99 (16,14,11,16,16)* k-mer = 127 (21,18,14,21,21)* k-mer = 175 (29,25,19,29,29)*
DRC CTMA-1402 (≥9;≥7;≥6;≥9;≥9) (9;7;7;10;≥12) (9;7;7;10;≥16) (9;7;7;10;16) (9;7;7;10;16) (9,7,7,10,16)
DRC CTMA-1421 (≥9;≥7;≥6;≥9;≥9) (9;7;7;11;≥12) (9;7;7;11;≥16) (9;7;7;11;17) (9;7;7;11;17) (9,7,7,11,17)
DRC CTMA-1424 (≥9;≥7;≥6;≥9;≥9) (10;7;7;11;≥12) (10;7;7;11;≥16) (10;7;7;11;16) (10;7;7;11;16) (10,7,7,11,16)
DRC CTMA-1426 (≥9;≥7;≥6;≥9;≥9) (10;7;6;≥12;≥12) (10;7;6;≥16;≥16) (10;7;6;≥21;≥21) (10;7;6;24;21) (10,7,6,24,21)
DRC CTMA-1427 (≥9;≥7;≥6;≥9;≥9) (10;7;6;≥12;≥12) (10;7;6;≥16;≥16) (10;7;6;≥21;≥21) (10;7;6;24;21) (10,7,6,24,21)
DRC CTMA-1432 (≥9;≥7;≥6;≥9;≥9) (10;7;6;≥12;≥12) (10;7;6;≥16;≥16) (10;7;6;16;20) (10;7;6;16;20) (10,7,6,16,20)
DRC CTMA-1435 (≥9;≥7;≥6;≥9;≥9) (10;7;6;≥12;≥12) (10;7;6;≥16;≥16) (10;7;6;≥21;18) (10;7;6;23;18) (10,7,6,23,18)
DRC CTMA-1461 (≥9;≥7;≥6;≥9;≥9) (11;7;7;≥12;≥12) (11;7;7;13;≥16) (11;7;7;13;16) (11;7;7;13;16) (11,7,7,13,16)
DRC CTMA-1473 (≥9;≥7;≥6;≥9;≥9) (10;7;7;≥12;≥12) (10;7;7;12;≥16) (10;7;7;12;16) (10;7;7;12;16) (10,7,7,12,16)
Uganda UG010 (≥9;3;≥6;≥9;≥9) (9;3;7;≥12;≥12) (9;3;7;≥16;≥16) (9;3;7;≥21;≥21) n/a (9;3;7;21;26)
Uganda UG020 (≥9;3;≥6;≥9;≥9) (9;3;7;≥12;≥12) (9;3;7;≥16;≥16) (9;3;7;≥21;≥21) n/a (9;3;7;21;27)
Uganda UG026 (≥9;3;≥6;≥9;≥9) (9;3;7;≥12;≥12) (9;3;7;≥16;≥16) (9;3;7;≥21;≥21) n/a (9;3;7;21;28)
Uganda UG040 (≥9;≥7;≥6;≥9;≥9) (10;7;7;9;≥12) (10;7;7;9;≥16) (10;7;7;9;17) n/a (10;7;7;9;17)
Uganda UG042 (8;≥7;≥6;≥9;≥9) (8;7;7;10;≥12) (8;7;7;10;≥16) (8;7;7;10;≥21) n/a (8;7;7;10;21)
Uganda UG046 (8;≥7;≥6;≥9;≥9) (8;7;7;11;≥12) (8;7;7;11;≥16) (8;7;7;11;≥21) n/a (8;7;7;11;21)
Uganda UG054 (8;≥7;≥6;≥9;≥9) (8;7;7;10;≥12) (8;7;7;10;≥16) (8;7;7;10;≥21) n/a (8;7;7;10;21)
Uganda UG060 (≥9;≥7;≥6;≥9;≥9) (10;7;7;9;≥12) (10;7;7;9;≥16) (10;7;7;9;17) n/a (10;7;7;9;17)
Uganda UG071 (≥9;≥7;≥6;8;≥9) (10;7;7;8;≥12) (10;7;7;8;≥16) (10;7;7;8;18) n/a (10;7;7;8;18)
Uganda UG086 (≥9;≥7;≥6;≥9;≥9) (10;7;7;9;≥12) (10;7;7;9;≥16) (10;7;7;9;18) n/a (10;7;7;9;18)

Fig 3. Percentage of correct, incorrect, and censored estimation of the number of tandem repeats.

Fig 3

Each percentage is produced by either GeneScan (GS) or WGS-based approaches; Sanger sequencing results were used as the reference values. *: SPAdes v.3.13.0. **: After modification of SPAdes v.3.13.0 source code.

Moreover, the last step of the MLVAtype algorithm (i.e., the comparison of the number of tandem repetitions [k] to the MNTR value) was essential. When this step was omitted, several discordances occurred between Sanger- and WGS-derived typing values, especially when using a small k-mer for read assembly (Table 5). When the k-mer size was not specified during genome assembly with SPAdes, resulting MLVA profiles were similar to those obtained with a k-mer of 127 for DRC isolates (300 nt long reads) and a k-mer of 77 for Ugandan isolates (150 nt long reads) (Table 5). When MLVA profiles were derived from draft genomes using the in silico PCR approach (Table 6), discordances were observed with Sanger-derived typing results. When using a small k-mer size during genome assembly, forward and reverse primer sequences were often located in different contigs, resulting in an undetermined (U) number of tandem repeats.

Table 5. WGS-derived MLVA profiles extracted with MLVAtype without taking into account the MNTR value.

Mismatches between WGS- and Sanger-derived values are indicated in bold. n/a: not applicable. Read lengths obtained with DRC and Ugandan isolates were 300 and 150 nt, respectively.

Country Isolate WGS-derived MLVA profile Sanger-derived MLVA profile
k-mer = 55 k-mer = 77 k-mer = 99 k-mer = 127 k-mer = 175 k-mer unsp.
DRC CTMA-1402 (9;7;6;9;15) (9;7;7;10;13) (9;7;7;10;16) (9;7;7;10;16) (9;7;7;10;16) (9;7;7;10;16) (9,7,7,10,16)
DRC CTMA-1421 (9;7;6;9;10) (9;7;7;11;14) (9;7;7;11;17) (9;7;7;11;17) (9;7;7;11;17) (9;7;7;11;17) (9,7,7,11,17)
DRC CTMA-1424 (9;7;6;9;9) (10;7;7;11;13) (10;7;7;11;16) (10;7;7;11;16) (10;7;7;11;16) (10;7;7;11;16) (10,7,7,11,16)
DRC CTMA-1426 (9;7;6;10;13) (10;7;6;13;13) (10;7;6;17;17) (10;7;6;21;21) (10;7;6;24;21) (10;7;6;21;21) (10,7,6,24,21)
DRC CTMA-1427 (9;7;6;10;13) (10;7;6;13;14) (10;7;6;17;17) (10;7;6;21;21) (10;7;6;24;21) (10;7;6;21;21) (10,7,6,24,21)
DRC CTMA-1432 (9;7;6;9;9) (10;7;6;13;13) (10;7;6;16;17) (10;7;6;16;20) (10;7;6;16;20) (10;7;6;16;20) (10,7,6,16,20)
DRC CTMA-1435 (9;7;6;10;10) (10;7;6;13;13) (10;7;6;16;17) (10;7;6;21;18) (10;7;6;23;18) (10;7;6;21;18) (10,7,6,23,18)
DRC CTMA-1461 (9;7;6;10;10) (11;7;7;12;14) (11;7;7;13;16) (11;7;7;13;16) (11;7;7;13;16) (11;7;7;13;16) (11,7,7,13,16)
DRC CTMA-1473 (9;7;6;10;15) (10;7;7;12;13) (10;7;7;12;16) (10;7;7;12;16) (10;7;7;12;16) (10;7;7;12;16) (10,7,7,12,16)
Uganda UG010 (9;3;6;10;9) (9;3;7;13;13) (9;3;7;16;16) (9;3;7;21;21) n/a (9;3;7;13;13) (9;3;7;21;26)
Uganda UG020 (9;3;6;10;9) (9;3;7;13;13) (9;3;7;16;16) (9;3;7;21;21) n/a (9;3;7;13;13) (9;3;7;21;27)
Uganda UG026 (9;3;6;10;9) (9;3;7;13;13) (9;3;7;16;16) (9;3;7;21;21) n/a (9;3;7;13;13) (9;3;7;21;28)
Uganda UG040 (9;7;6;9;9) (10;7;7;9;13) (10;7;7;9;16) (10;7;7;9;17) n/a (10;7;7;9;13) (10;7;7;9;17)
Uganda UG042 (8;7;6;9;9) (8;7;7;10;13) (8;7;7;10;16) (8;7;7;10;21) n/a (8;7;7;10;13) (8;7;7;10;21)
Uganda UG046 (8;7;6;9;9) (8;7;7;11;13) (8;7;7;11;16) (8;7;7;11;21) n/a (8;7;7;11;13) (8;7;7;11;21)
Uganda UG054 (8;7;6;9;9) (8;7;7;10;13) (8;7;7;10;16) (8;7;7;10;21) n/a (8;7;7;10;13) (8;7;7;10;21)
Uganda UG060 (9;7;6;9;9) (10;7;7;9;13) (10;7;7;9;16) (10;7;7;9;17) n/a (10;7;7;9;13) (10;7;7;9;17)
Uganda UG071 (9;7;6;8;20) (10;7;7;8;13) (10;7;7;8;16) (10;7;7;8;18) n/a (10;7;7;8;13) (10;7;7;8;18)
Uganda UG086 (9;7;6;9;20) (10;7;7;9;13) (10;7;7;9;16) (10;7;7;9;18) n/a (10;7;7;9;13) (10;7;7;9;18)

Table 6. WGS-derived MLVA profiles extracted using an in silico PCR approach.

Mismatches between WGS- and Sanger-derived values are indicated in bold. U: undetermined. n/a: not applicable. Read lengths obtained with DRC and Ugandan isolates were 300 and 150 nt, respectively.

Country Isolate WGS-derived MLVA profile Sanger-derived MLVA profile
k-mer = 55 k-mer = 77 k-mer = 99 k-mer = 127 k-mer = 175 k-mer unsp.
DRC CTMA-1402 (9;7;6;9;U) (9;7;7;10;U) (9;7;7;10;16) (9;7;7;10;16) (9;7;7;10;16) (9;7;7;10;16) (9,7,7,10,16)
DRC CTMA-1421 (9;7;6;9;U) (9;7;7;11;U) (9;7;7;11;17) (9;7;7;11;17) (9;7;7;11;17) (9;7;7;11;17) (9,7,7,11,17)
DRC CTMA-1424 (9;7;6;9;U) (10;7;7;11;13) (10;7;7;11;16) (10;7;7;11;16) (10;7;7;11;16) (10;7;7;11;16) (10,7,7,11,16)
DRC CTMA-1426 (9;7;6;U;U) (10;7;6;U;U) (10;7;6;U;17) (10;7;6;21;21) (10;7;6;24;21) (10;7;6;21;21) (10,7,6,24,21)
DRC CTMA-1427 (9;7;6;U;U) (10;7;6;U;U) (10;7;6;U;17) (10;7;6;21;21) (10;7;6;24;21) (10;7;6;21;21) (10,7,6,24,21)
DRC CTMA-1432 (9;7;6;U;U) (10;7;6;U;13) (10;7;6;16;17) (10;7;6;16;20) (10;7;6;16;20) (10;7;6;16;20) (10,7,6,16,20)
DRC CTMA-1435 (9;7;6;U;U) (10;7;6;U;13) (10;7;6;U;17) (10;7;6;21;18) (10;7;6;23;18) (10;7;6;21;18) (10,7,6,23,18)
DRC CTMA-1461 (9;7;6;U;U) (11;7;7;12;U) (11;7;7;13;16) (11;7;7;13;16) (11;7;7;13;16) (11;7;7;13;16) (11,7,7,13,16)
DRC CTMA-1473 (9;7;6;U;U) (10;7;7;12;U) (10;7;7;12;16) (10;7;7;12;16) (10;7;7;12;16) (10;7;7;12;16) (10,7,7,12,16)
Uganda UG010 (9;3;6;U;U) (9;3;7;U;U) (9;3;7;U;U) (9;3;7;21;U) n/a (9;3;7;13;13) (9;3;7;21;26)
Uganda UG020 (9;3;6;U;U) (9;3;7;U;U) (9;3;7;16;U) (9;3;7;21;U) n/a (9;3;7;13;13) (9;3;7;21;27)
Uganda UG026 (9;3;6;U;U) (9;3;7;U;U) (9;3;7;16;U) (9;3;7;21;U) n/a (9;3;7;13;13) (9;3;7;21;28)
Uganda UG040 (9;7;6;9;U) (10;7;7;9;13) (10;7;7;9;17) (10;7;7;9;17) n/a (10;7;7;9;13) (10;7;7;9;17)
Uganda UG042 (8;7;6;9;U) (8;7;7;10;U) (8;7;7;10;17) (8;7;7;10;21) n/a (8;7;7;10;13) (8;7;7;10;21)
Uganda UG046 (8;7;6;9;U) (8;7;7;11;U) (8;7;7;11;17) (8;7;7;11;21) n/a (8;7;7;11;13) (8;7;7;11;21)
Uganda UG054 (8;7;6;9;U) (8;7;7;10;U) (8;7;7;10;17) (8;7;7;10;21) n/a (8;7;7;10;13) (8;7;7;10;21)
Uganda UG060 (9;7;6;9;U) (10;7;7;9;U) (10;7;7;9;17) (10;7;7;9;17) n/a (10;7;7;9;13) (10;7;7;9;17)
Uganda UG071 (9;7;6;8;U) (10;7;7;8;13) (10;7;7;8;17) (10;7;7;8;18) n/a (10;7;7;8;13) (10;7;7;8;18)
Uganda UG086 (9;7;6;9;U) (10;7;7;9;13) (10;7;7;9;17) (10;7;7;9;18) n/a (10;7;7;9;13) (10;7;7;9;18)

It is worth noting that a longer k-mer size also improved the quality of genome assemblies as illustrated by QUAST [11] with lower numbers of contigs and larger N50 values (Fig 4).

Fig 4. Quality metrics of genome assemblies.

Fig 4

Number of contigs and N50 metrics, as reported by QUAST for genome assemblies from DRC and Ugandan (UGD) isolates and various selectable default and modified k-mer sizes. *: SPAdes v.3.13.0. **: After modification of the SPAdes v.3.13.0 source code.

Comparison of analytical reagent costs

Reagents costs for MLVA typing (5 loci) of V. cholerae isolates were compared using three different methods (Table 7). The lowest cost was associated with the GeneScan-based method. It is worth nothing that Sanger sequencing, which cost is provided in Table 7, was only used here as a reference method but not intended for routine MLVA typing.

Table 7. Cost of MLVA targeting five loci.

*: Optimal cost if the five loci are amplified in a single PCR (theoretical cost analysis). **: Reported cost of a typing method based on a duplex and a triplex PCR followed by two fragment analysis runs. ***: Cost of commercial sequencing per sample in 2019. **** Reagent and consumable cost of Illumina sequencing per sample [12].

Cost USD
Sanger-based (5 loci) GeneScan-based (5 loci) optimal cost* GeneScan-based (5 loci) published protocol** WGS-based
PCR 40 (5 x 8) 8 16 (2 x 8) -
Analytical cost 35 (5 x 7) 3.4 6.8 (2 x 3.4) 99–150
Total 75 11.4 22.8 99***–150****

Discussion

The rationale behind this study is the dramatic increase in the rate and amount of sequencing and a continuous decrease in sequencing costs. The efficiency in microbial identification and subtyping allowed by WGS characterization makes it now a reference method with a clear potential to replace traditional typing methods. Anyhow, this type of analysis introduces new challenges, i.e. data storage, computing power, bioinformatics expertise and results returned in real-time, all potentially related to high costs. Undoubtedly, the WGS analysis is not yet affordable in all institutions or networks in charge of pathogen surveillance.

Accordingly, we present here a new application enabling users to extract V. cholerae MLVA profiles from WGS data while preserving their backward compatibility with conventional MLVA typing. To validate this application, MLVA profiles generated by MLVAType were compared to Sanger-derived MLVA profiles on nineteen V. cholerae isolates among which 9 from DRC and 10 from Uganda, and the respective costs were assessed. In addition, Sanger-derived MLVA profiles of the 9 isolates from Uganda were compared to GeneScan-derived MLVA profiles.

There was only one mismatch found between the GeneScan-derived and Sanger reference method which could not be explained but does not seem unusual [13]. However, it should be noted that a modification of either the primer or the formula proposed by Kendall et al. [10], which determines V. cholerae VNTR repeat numbers, was applied in order to decrease the number of mismatches (Table 1), especially those observed with VC0437 and VCA0283.

Interestingly, there was no mismatch between WGS- and Sanger-derived MLVA profiles. However, WGS-derived profiles were affected by censored estimations whose proportion varied according to the k-mer size used during genome assembly with SPAdes: the larger, the k-mer size, the better the accuracy of WGS-derived MLVA profiles (Fig 3). Accordingly, it is recommended to use the largest possible k-mer size to assemble the reads into contigs before determining the number of tandem repeats, but considering that this maximum value is inherent to the WGS read length, as illustrated with the current application, using WGS data from V. cholerae. While using a k-mer size of 175 was inapplicable with read length shorter than 175, as for Ugandan isolates, it did not produce censored data with 300 nt reads from DRC isolates.

While WGS read lengths and qualities increase, one may expect to increase k-mer sizes to perform in silico MLVA typing using this MLVAType application whenever data comparison with MLVA database is needed. As previously reported, the length of repeat motifs should not exceed 174 nucleotides for V. cholerae, corresponding to 29 repetitions of a 6 nt motif [14]. Accordingly, the longer k-mer size (i.e. 175) proved to generate a correct MLVA profile with no censored data.

Importantly, the MLVAType algorithm was developed to extract MLVA profiles of V. cholerae isolates which are characterized by (i) a perfect repeat array, (ii) locus-specific VNTRs (i.e. there is no repeat unit in other loci), and (iii) the absence of indels in the flanking region. Whereas the in silico PCR approach was found inefficient for MLVA typing of V. cholerae (Table 6), MLVAType could not be used for species such as Brucella or Salmonella. This makes MLVAType a complementary tool to the existing in silico PCR approach when the MLVA profiles need to be retrieved from WGS data.

Conclusion

In conclusion, the MLVAType shiny application proved to extract reliably MLVA profiles of V. cholerae isolates from WGS data. Considering the wide in silico exploitation of WGS data, our perspective will then be to combine the extracted information related both to VNTRs and Single Nucleotide Variants (SNVs), and to calculate a single genetic relatedness index. This should further extend our understanding of the genetic relatedness of V. cholerae isolates while giving us better insight into how the VNTRs evolve over time.

Acknowledgments

The authors are grateful to the following institutions for exchange of information and expertise provided during the course of this work: Uganda Ministry of Health, DR Congo Ministry of Health, Makerere University, John Hopkins Bromberg School of Public Health, Maryland University of Medicine, and Defense Laboratory Department (DLD), Belgium. The authors would like in a special way thank the following persons for their wonderful contribution and guidance; Professor David A. Sack, Professor Christopher G. Orach, Mr. Atek. Kagirita, Mr. Mathieu Almeida, Dr. Amanda K. Debes, M/s Shan Li, Mr. JB.Voeglein, and Dr Prudence Mitangala.

Data Availability

All NGS data are available from the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena), available under study accession number ERP114722, and from the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) under study accession number PRJNA439310.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1. Pérez-Losada M, Cabezas P, Castro-Nallar E, Crandall KA. Pathogen typing in the genomics era: MLST and the future of molecular epidemiology. Infection, Genetics and Evolution. 2013;16:38–53. 10.1016/j.meegid.2013.01.009 [DOI] [PubMed] [Google Scholar]
  • 2. Inouye M, Dashnow H, Raven LA, Schultz MB, Pope BJ, Tomita T, et al. SRST2: rapid genomic surveillance for public health and hospital microbiology labs. Genome medicine. 2014;6(11):90 10.1186/s13073-014-0090-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Deng X, Shariat N, Driebe EM, Roe CC, Tolar B, Trees E, et al. Comparative analysis of subtyping methods against a whole-genome-sequencing standard for Salmonella enterica serotype Enteritidis. Journal of clinical microbiology. 2015;53(1):212–218. 10.1128/JCM.02332-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Deng X, Den Bakker HC, Hendriksen RS. Applied Genomics of Foodborne Pathogens. Springer; 2017. [Google Scholar]
  • 5. Nadon C, Van Walle I, Gerner-Smidt P, Campos J, Chinen I, Concepcion-Acevedo J, et al. PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance. Eurosurveillance. 2017;22(23). 10.2807/1560-7917.ES.2017.22.23.30544 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Vergnaud G, Hauck Y, Christiany D, Daoud B, Pourcel C, Jacques I, et al. Genotypic Expansion within the Population Structure of Classical Brucella Species Revealed by MLVA16 Typing of 1404 Brucella Isolates from Different Animal and Geographic Origins, 1974-2006. Frontiers in microbiology. 2018;9:1545 10.3389/fmicb.2018.01545 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology. 2012;19(5):455–477. 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Irenge L, Ambroise J, Mitangala P, Bearzatto B, Senga R, Durant JF, et al. Genomic analysis of pathogenic strains of Vibrio cholerae from eastern Democratic Republic of Congo (2014-2017). PLOS Neglected Tropical Diseases. submitted in 2019;. [DOI] [PMC free article] [PubMed]
  • 9. Bwire G, Sack DA, Almeida M, Li S, Voeglein JB, Debes AK, et al. Molecular characterization of Vibrio cholerae responsible for cholera epidemics in Uganda by PCR, MLVA and WGS. PLoS neglected tropical diseases. 2018;12(6):e0006492 10.1371/journal.pntd.0006492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Kendall EA, Chowdhury F, Begum Y, Khan AI, Li S, Thierer JH, et al. Relatedness of Vibrio cholerae O1/O139 isolates from patients and their household contacts, determined by multilocus variable-number tandem-repeat analysis. Journal of bacteriology. 2010;192(17):4367–4376. 10.1128/JB.00698-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075. 10.1093/bioinformatics/btt086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Goldstein S, Beka L, Graf J, Klassen JL. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC genomics. 2019;20(1):23 10.1186/s12864-018-5381-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Hyytia-Trees E, Lafon P, Vauterin P, Ribot EM. Multilaboratory validation study of standardized multiple-locus variable-number tandem repeat analysis protocol for Shiga toxin–producing Escherichia coli O157: a novel approach to normalize fragment size data between capillary electrophoresis platforms. Foodborne pathogens and disease. 2010;7(2):129–136. 10.1089/fpd.2009.0371 [DOI] [PubMed] [Google Scholar]
  • 14. Ghosh R, Nair GB, Tang L, Morris JG, Sharma NC, Ballal M, et al. Epidemiological study of Vibrio cholerae using variable number of tandem repeats. FEMS microbiology letters. 2008;288(2):196–201. 10.1111/j.1574-6968.2008.01352.x [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Axel Cloeckaert

12 Aug 2019

PONE-D-19-20834

Backward compatibility of whole genome sequencing data with MLVA typing using a new MLVAtype shiny application: the example of Vibrio cholerae

PLOS ONE

Dear Dr Ambroise,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please consider the comments of the reviewer to improve the manuscript.

We would appreciate receiving your revised manuscript by Sep 26 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Axel Cloeckaert

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

3. In your Methods section, please clarify whether the isolates were obtained from a collection, a company, or from another third party source

4. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Ambroise et al describe a software (MLVAtype) which allows to deduce MLVA genotypes from assembled Whole Genome Sequence (WGS) data. The input is a draft assembly, the size of the k-mer used for the assembly and the sequence of the repeated motif. MLVAtype will search for the largest occurrence of repeated motifs. Then MLVAtype will compare the length of this largest occurrence and will consider it as a valid VNTR allele if it is smaller than the k-mer value used for the assembly.

To evaluate MLVAtype, the authors used 19 V. cholera isolates and an MLVA assay comprising five loci with a 6 bp (3 loci), 7 bp or 9 bp repeat unit. They sequenced the PCR amplification products by Sanger sequencing to constitute the reference data. They then compare in vitro and WGS-derived MLVA typing data with the reference data.

Nine isolates were typed in vitro using the Agilent 2100 Bioanalyzer whereas ten isolates were typed in vitro using the higher resolution capillary electrophoresis system GenScan. 300 bp long WGS data was produced for the nine isolates versus 150 bp for the ten isolates. The 300 bp long reads allow to deduce a full and correct MLVA genotype whereas the 150 bp reads provide partial data. 90% of in vitro MLVA profiles are incorrect.

There are a number of issues associated with the software design, in vitro typing, report organization.

-in some species in which MLVA is used, some VNTRs occur in families, i.e. different VNTR loci share the same repeat unit. How will MLVAtype behave in such a case? It seems that all loci sharing an identical repeat motif will be (incorrectly) assigned the largest allele size.

-conversely, tandem repeats are often not perfect. How will MLVAtype behave in such a case?

-what is the rational for not taking into account contigs smaller than 2000 bp, especially given that in the present situation (Vibrio cholera VNTRs), the VNTR loci are shorter than 200 bp?

-Page 2 third $: the authors do not mention other published approaches for in silico MLVA typing. For instance Vergnaud et al. Frontiers Microbiology 2018 previously explored the possibility to deduce MLVA from WGS data was evaluated for Brucella using an in silico PCR approach.

-Page 6, Table 4: please also provide the initial estimates before applying the “right-censorchip”. The authors seem to implicitly assume that Spades will never correctly reconstruct tandem repeat arrays longer than the k-mer size, but do not provide the data to show that.

-by default, Spades will explore multiple k-mer values (rather than a specified value). It would seem useful to provide in Table 4 the results obtained in these conditions.

-if indeed Spades will not reconstruct tandem repeats longer than k-mer size, then why try to assemble reads when the read length is longer than available k-mer sizes (instead of recovering the reads of interest using tools such as BBduk)?

-the in vitro MLVA assay does not seem to be correctly working yet, because the Bioanalyzer does not have a sufficient amplicon sizing resolution and/or because the allele calling (conversion from size estimate to repeat copy number) is not optimized, as indeed suggested by the authors in the first paragraph page 8. The error rates reported in Table 3 for the Bioanalyzer (above 50%!) and for GenScan (7 errors in 10 strains for the second locus) are not acceptable. What is the interest of backward compatibility of WGS with in vitro MLVA typing if the in vitro data is so bad, i.e. if MLVA does not seem applicable here? I believe that the Bioanalyzer (in)capacity to discriminate VNTR alleles with repeat units smaller than 8 bp has already been discussed in the literature, see for instance De Santis et al., BMC microbiology 2011. Regarding the GenScan errors, this is probably due to incorrect allele calling, resulting from slightly inexact size measurement by the capillary equipment (literature available, see for instance Hyytia-Trees et al., Foodborne Pathog Dis 2010). Once correctly set-up, there should be no errors at least when using the GenScan.

Table 5: include the cost estimate for the GeneScan method (where the assay can be run in a single multiplex PCR).

The estimated cost for PCR (10 € per PCR for reagent costs) seems a bit high.

Please clarify the indicated WGS cost: does this cover the making of the sequencing library? What read length?

More generally, Table 5 and the associated paragraph are poorly informative, it would be useful to try to estimate the overall cost, based upon commercial services prices.

Page 7, paragraph on “Theoretical feasibility …” is not informative as is. Should either be developed, by being more specific, or deleted.

Page 8, second paragraph, “the larger, the k-mer size, the better the accuracy of WGS-derived MLVA profiles.” The authors need to show the data (see previous remark on Table 4). Also please explain the reason for limiting the k-mer size to 175

Page 8, last paragraph before conclusion: the theoretical evaluation has not been done appropriately. The postulated 6 nt motif is not applicable to any of the MLVA assays commonly used in the given list of genus/species. Rather the design of the MLVAtype software indicates that it will have a very limited range. Indeed the MLVAtype web page at https://ucl-irec-ctma.shinyapps.io/NGS-MLVA-TYPING/ appears tailored for V. cholerae (and perhaps MLVA assays with up to five VNTR loci, and short and perfect repeat arrays). The article might be more convincing if focused on V. cholerae and its 10000 publicly available sequence reads archives.

-details,

Page 1 last $, Mycobacterium, Streptococcus etc are not species but genus. Please be more specific.

Page 2, first paragraph “Mounting evidence …” obviously and by definition Whole Genome Sequencing can only be better than the previous methods, no need to refer to “Mounting evidence”. The issue is cost, as detailed in the second paragraph.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2019 Dec 11;14(12):e0225848. doi: 10.1371/journal.pone.0225848.r002

Author response to Decision Letter 0


23 Sep 2019

We thank the Editor and the Reviewer for their careful reading of our manuscript. In revising our paper, we carefully followed the editor’s direction, and replied to the point by point questions of the reviewer.

We are confident that the answers provided and the corresponding modifications in the revised version will now meet the Editor and the Reviewer’s expectations.

We thank you in advance for your editorial work.

The authors

Reviewer #1: Ambroise et al describe a software (MLVAtype) which allows to deduce MLVA genotypes from assembled Whole Genome Sequence (WGS) data. The input is a draft assembly, the size of the k-mer used for the assembly and the sequence of the repeated motif. MLVAtype will search for the largest occurrence of repeated motifs. Then MLVAtype will compare the length of this largest occurrence and will consider it as a valid VNTR allele if it is smaller than the k-mer value used for the assembly.

To evaluate MLVAtype, the authors used 19 V. cholera isolates and an MLVA assay comprising five loci with a 6 bp (3 loci), 7 bp or 9 bp repeat unit. They sequenced the PCR amplification products by Sanger sequencing to constitute the reference data. They then compare in vitro and WGS-derived MLVA typing data with the reference data.

Nine isolates were typed in vitro using the Agilent 2100 Bioanalyzer whereas ten isolates were typed in vitro using the higher resolution capillary electrophoresis system GenScan. 300 bp long WGS data was produced for the nine isolates versus 150 bp for the ten isolates. The 300 bp long reads allow to deduce a full and correct MLVA genotype whereas the 150 bp reads provide partial data. 90% of in vitro MLVA profiles are incorrect.

There are a number of issues associated with the software design, in vitro typing, report organization.

Question:

-in some species in which MLVA is used, some VNTRs occur in families, i.e. different VNTR loci share the same repeat unit. How will MLVAtype behave in such a case? It seems that all loci sharing an identical repeat motif will be (incorrectly) assigned the largest allele size.

Answer:

All loci sharing an identical repeat motif will indeed be incorrectly assigned the largest allele size. For such species, an in silico PCR approach would be more appropriate. According to your recommendation (see your other later comment), we decided to focus the paper on V. cholerae where each repeat unit is only found in one locus.

Question:

-conversely, tandem repeats are often not perfect. How will MLVAtype behave in such a case?

Answer:

For such case, an in silico PCR approach would indeed be more appropriate. According to your recommendation (see your comment later), we have now restricted the focus of our application on V. cholerae where tandem repeats are perfect. It is worth noting that the problem of right-censoring applies only to a perfect repetition of the same motif. This also explains why right-censored data were not observed in the study of Vergnaud et al. Frontiers Microbiology 2018.

Question:

-what is the rational for not taking into account contigs smaller than 2000 bp, especially given that in the present situation (Vibrio cholera VNTRs), the VNTR loci are shorter than 200 bp?

Answer:

The 2000 bp threshold was modified to 1000 bp in order to be in line with literature. Small contigs are actually excluded because they are often non-informative [1, 2, 3 , 4] and associated with a low coverage.

Question:

-Page 2 third $: the authors do not mention other published approaches for in silico MLVA typing. For instance Vergnaud et al. Frontiers Microbiology 2018 previously explored the possibility to deduce MLVA from WGS data was evaluated for Brucella using an in silico PCR approach.

Answer:

We fully agree with the reviewer and thank him for his suggestion. This reference is now included in the reference list of the amended version of the paper.

Question:

-Page 6, Table 4: please also provide the initial estimates before applying the “right-censorchip”. The authors seem to implicitly assume that Spades will never correctly reconstruct tandem repeat arrays longer than the k-mer size, but do not provide the data to show that.

Answer:

We agree with the comment. In the amended version of the paper, this table is now provided as ‘supplementary file 1’ and commented in the “Results” section.

Question:

-by default, Spades will explore multiple k-mer values (rather than a specified value). It would seem useful to provide in Table 4 the results obtained in these conditions.

Answer:

We agree with the comment. In the amended version of the paper, these results are now provided in the ‘supplementary file 1’ and commented in the “Results” section.

Question:

-if indeed Spades will not reconstruct tandem repeats longer than k-mer size, then why try to assemble reads when the read length is longer than available k-mer sizes (instead of recovering the reads of interest using tools such as Bbduk)?

Answer:

One advantage of the current application is that you can run the MLVA typing on a shiny application very quickly and easily. This process (including data transfer and analyses) would be much longer if the MLVA profiles were directly typed from the reads. Therefore, deriving MLVA profiles directly from the reads was not tested in our study. As the current results showed no discrepancy between WGS data (after assembly) and Sanger sequencing data (gold standard), we focused the paper on deriving MLVA profiles from the assembly.

Question:

-the in vitro MLVA assay does not seem to be correctly working yet, because the Bioanalyzer does not have a sufficient amplicon sizing resolution and/or because the allele calling (conversion from size estimate to repeat copy number) is not optimized, as indeed suggested by the authors in the first paragraph page 8. The error rates reported in Table 3 for the Bioanalyzer (above 50%!) and for GenScan (7 errors in 10 strains for the second locus) are not acceptable. What is the interest of backward compatibility of WGS with in vitro MLVA typing if the in vitro data is so bad, i.e. if MLVA does not seem applicable here? I believe that the Bioanalyzer (in)capacity to discriminate VNTR alleles with repeat units smaller than 8 bp has already been discussed in the literature, see for instance De Santis et al., BMC microbiology 2011. Regarding the GenScan errors, this is probably due to incorrect allele calling, resulting from slightly inexact size measurement by the capillary equipment (literature available, see for instance Hyytia-Trees et al., Foodborne Pathog Dis 2010). Once correctly set-up, there should be no errors at least when using the GenScan.

Answer:

The objective of the paper was not to optimize Bioanalyzer-derived MLVA typing (albeit this can of course be done as previously reported by Lista et a. in the literature). Accordingly and to avoid any ambiguity about the imperfect Bioanalyzer results presented in the first version of the paper, this part was removed in the amended version.

Likewise, and as discussed supra, the purpose was to compare NGs results with existing published results. We did not really questioned the reliability of these published GenScan results. However, following the comment of the reviewers, we addressed this question to our foreign collaborators and they decide to review the initial GenScan-based MLVA profiles; a senior technician ran them again: all but one - unexplained - mismatches were corrected!

These errors in our first submission and the reviewer’s comment underpin therefore the value of the work of Hyytia-Trees et al concluding that “proper training and experience is necessary to collect accurate information when using the GeneScan methodology”.

Albeit not 100% concordant, this concordance between Sanger- and GenScan-derived MLVA profiles is substantially improved in the amended version of the manuscript (Table 3, Figure 3).

Question:

Table 5: include the cost estimate for the GeneScan method (where the assay can be run in a single multiplex PCR).

The estimated cost for PCR (10 € per PCR for reagent costs) seems a bit high.

Please clarify the indicated WGS cost: does this cover the making of the sequencing library? What read length?

More generally, Table 5 and the associated paragraph are poorly informative, it would be useful to try to estimate the overall cost, based upon commercial services prices.

Answer:

The cost of the Bioanalyzer method was replaced by the cost of the GenScan method.

In addition, PCR costs were updated and a new reference was added to clarify the cost of WGS analysis.

Question:

Page 7, paragraph on “Theoretical feasibility …” is not informative as is. Should either be developed, by being more specific, or deleted.

Answer:

This paragraph was deleted, accordingly.

Question:

Page 8, second paragraph, “the larger, the k-mer size, the better the accuracy of WGS-derived MLVA profiles.” The authors need to show the data (see previous remark on Table 4). Also please explain the reason for limiting the k-mer size to 175.

Answer:

Page 8, second paragraph is based on Figure 3. This is specified in the amended version of the paper.

The reason for limiting the k-mer to 175 is justified as follows:

“As previously reported, the length of repeat motifs should not exceed 174 nucleotides for V. cholerae, corresponding to 29 repetitions of a 6 nt motif [13]. Accordingly, the longer k-mer size (i.e. 175) proved to generate a correct MLVA profile with no censored data.”

Question:

Page 8, last paragraph before conclusion: the theoretical evaluation has not been done appropriately. The postulated 6 nt motif is not applicable to any of the MLVA assays commonly used in the given list of genus/species. Rather the design of the MLVAtype software indicates that it will have a very limited range. Indeed the MLVAtype web page at https://ucl-irec-ctma.shinyapps.io/NGS-MLVA-TYPING/ appears tailored for V. cholerae (and perhaps MLVA assays with up to five VNTR loci, and short and perfect repeat arrays). The article might be more convincing if focused on V. cholerae and its 10000 publicly available sequence reads archives.

Answer:

We fully agree with the reviewer’s suggestion. The updated version of the paper focuses now only on V. cholerae MLVA application.

Question:

-details,

Page 1 last $, Mycobacterium, Streptococcus etc are not species but genus. Please be more specific.

Answer

Considering that the new version of the paper focuses on V. cholerae, and that the paragraph related to the theoretical feasibility has been removed in the new version of the paper, this sentence was removed.

Question:

Page 2, first paragraph “Mounting evidence …” obviously and by definition Whole Genome Sequencing can only be better than the previous methods, no need to refer to “Mounting evidence”. The issue is cost, as detailed in the second paragraph.

Answer

This sentence has been removed, as required.

References:

1: Gurevich, Alexey, et al. "QUAST: quality assessment tool for genome assemblies." Bioinformatics 29.8 (2013): 1072-1075.

2: Bultman, Katherine M., et al. "Draft Genome Sequences of Type VI Secretion System-Encoding Vibrio fischeri Strains FQ-A001 and ES401." Microbiology resource announcements 8.20 (2019): e00385-19.

3: Rozanov, Aleksey S., et al. "Metagenome-Assembled Genome Sequence of Phormidium sp. Strain SL48-SHIP, Isolated from the Microbial Mat of Salt Lake Number 48 (Novosibirsk Region, Russia)." Microbiology resource announcements 8.31 (2019): e00651-19.

4: Parks, Dylan, et al. "Genome Sequence of Bacillus subtilis natto VK161, a Novel Strain That Produces Vitamin K2." Microbiology resource announcements 8.35 (2019): e00444-19.

Attachment

Submitted filename: Response_to_reviewer.doc

Decision Letter 1

Axel Cloeckaert

8 Oct 2019

PONE-D-19-20834R1

Backward compatibility of whole genome sequencing data with MLVA typing using a new MLVAtype shiny application for Vibrio cholerae

PLOS ONE

Dear Dr Ambroise,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please consider the comments of the reviewer to improve the manuscript.

We would appreciate receiving your revised manuscript by Nov 22 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Axel Cloeckaert

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have significantly clarified and improved their report. A few points are written in a misleading way and can easily be improved.

(unfortunately for the reviewer, lines are still not numbered)

Page 2, “However, it is worth noting that an in silico PCR approach to type MLVA in Brucella from WGS data was recently developed”

Would be more exact to indicate:

“However, it is worth noting that an in silico PCR approach to type MLVA from WGS data was recently developed and evaluated for Brucella”

Along this line of benchmarking, may be worth mentioning https://github.com/Papos92/MISTReSS

Page 3 second line “GeneScan determination was retrieved for Ugandan isolates from published

data [9]”

This sentence is misleading since it suggests that the chromatograms were published as part of the Bwire et al. 2018 publication. However what I understand is that the authors have in the course of the present investigation realized that the MLVA alleles calling published in the 2018 report was incorrect, and have reanalyzed the data. This is a different thing.

Page 3 second paragraph “method proposed by Kendall et al. [10], the formula had to be modified to better fit the sequence length of the motif and the position of the primers (Table 1). It is of note that, the original calculation formula was used for the VC0283 motif but with a modified reverse primer”

Not clear to me. Do the authors just mean that the use of the modified VC0283 reverse primer has no impact on the PCR product size? May be worth recalling the sequence of the previous primer, as in “(AGCCTCCTCAGAAGTTGAG instead of the previous XXXXX)”

“Locus” would be less ambiguous than “motif” (which refers to the repeat unit).

Page 3, page 4 and elsewhere: two designations are used for loci VC0171 (alias VCA0171) and VC0283 (alias VCA0283), please harmonize

Page 4: “returns the number of tandem repeats” and “increasing number (j=2, 3 .., k) of tandem repeats”:

may be clearer to replace “tandem repeats” by “tandemly repeated units”

Page 5 cost analysis paragraph, “Reagents costs for MLVA typing (5 motifs) of V. cholerae isolates were compared using three different methods (Table 5).”

The authors need to expand a little bit. Discuss the relative costs. Recall that Sanger sequencing was used only to produce a reference dataset, but not as a suggestion for routine MLVA typing (as shown, would make no sense in terms of cost).

Table 5

Instead of “Motifs”, the authors probably mean “Loci”?

Cost estimates:

Sanger sequencing: are these commercial costs? Eight dollars for one PCR in terms of reagents and consumables seems a lot.

Genscan-based typing: with only five loci, all five loci should be run in one multiplex PCR. Most laboratories running MLVA on Genscan-type of equipment and significant numbers of strains will multiplex the PCRs. Then the fair reagents cost estimate should be down to one PCR and one run per sample i.e. 11.4 USD.

WGS cost: indicating current best commercial prices (for sequencing 1 strain alone versus as part of a batch of 96) might be useful

Page 5, “However, it should be noted that a modification of either the primer” When running MLVA on a Genscan type of machine, the “formula” is useless. The allele calling software will call each allele base on its associated observed size range. This remark by the authors suggests that the authors are exporting raw size estimates from the Genscan, and then convert by allele calling using the “formula”. This is not the most recommended way to proceed, see available literature.

Page 5, “was applied in order to decrease the number of mismatches, especially those observed

with VC0437 and VCA0283.”

Quote Table 1 (and check locus names, see a previous remark)

Page 7: “hence solving the well-recognized issue of backward compatibility with traditional MLVA typing methods.”

The software is not solving anything! There has never been an issue of backward compatibility when the sequencing reads are longer than the tandem repeat arrays. Recall that when tandem repeats contain internal variations, software for sequence assembly may be able to reconstruct correctly tandem repeats longer than the sequencing reads. Mention alternative approaches (the in silico PCR methods, including in terms of benchmarking https://github.com/Papos92/MISTReSS) and explain why the present approach is believed to be more appropriate at least for V. cholera MLVA.

I would suggest to merge Table 1 and S1 table as S1 table clearly illustrates the impact of the k-mer size.

Discussion: the authors need to comment on the pros and cons of the approach they use here versus the more commonly used in silico PCR approach. In particular they need to indicate that the approach used here work in the Vibrio cholerae context because the VNTRs are unique (so the repeat unit sequence is locus-specific), the tandem repeat arrays are perfect, and there are no indels in the flanking sequences. This is an uncommon situation. They might indicate why they think this approach may be of interest when applicable. Do they think it is because it does not require to have assembled the flanking sequences?

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 2

Axel Cloeckaert

14 Nov 2019

Backward compatibility of whole genome sequencing data with MLVA typing using a new MLVAtype shiny application for Vibrio cholerae

PONE-D-19-20834R2

Dear Dr. Ambroise,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Axel Cloeckaert

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Axel Cloeckaert

2 Dec 2019

PONE-D-19-20834R2

Backward compatibility of whole genome sequencing data with MLVA typing using a new MLVAtype shiny application for Vibrio cholerae

Dear Dr. Ambroise:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Axel Cloeckaert

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response_to_reviewer.doc

    Attachment

    Submitted filename: Response_to_reviewer.doc

    Data Availability Statement

    All NGS data are available from the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena), available under study accession number ERP114722, and from the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) under study accession number PRJNA439310.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES