Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Jul 31;16(7):e1008104. doi: 10.1371/journal.pcbi.1008104

Regional sequence expansion or collapse in heterozygous genome assemblies

Kathryn C Asalone 1,#, Kara M Ryan 1,#, Maryam Yamadi 1,#, Annastelle L Cohen 1, William G Farmer 1, Deborah J George 1, Claudia Joppert 1, Kaitlyn Kim 1, Madeeha Froze Mughal 1, Rana Said 1, Metin Toksoz-Exley 2, Evgeny Bisk 3, John R Bracht 1,*
Editor: Christos A Ouzounis4
PMCID: PMC7423139  PMID: 32735589

Abstract

High levels of heterozygosity present a unique genome assembly challenge and can adversely impact downstream analyses, yet is common in sequencing datasets obtained from non-model organisms. Here we show that by re-assembling a heterozygous dataset with variant parameters and different assembly algorithms, we are able to generate assemblies whose protein annotations are statistically enriched for specific gene ontology categories. While total assembly length was not significantly affected by assembly methodologies tested, the assemblies generated varied widely in fragmentation level and we show local assembly collapse or expansion underlying the enrichment or depletion of specific protein functional groups. We show that these statistically significant deviations in gene ontology groups can occur in seemingly high-quality assemblies, and result from difficult-to-detect local sequence expansion or contractions. Given the unpredictable interplay between assembly algorithm, parameter, and biological sequence data heterozygosity, we highlight the need for better measures of assembly quality than N50 value, including methods for assessing local expansion and collapse.

Author summary

In the genomic era, genomes must be reconstructed from fragments using computational methods, or assemblers. How do we know that a new genome assembly is correct? This is important because errors in assembly can lead to downstream problems in gene predictions and these inaccurate results can contaminate databases, affecting later comparative studies. A particular challenge occurs when a diploid organism inherits two highly divergent genome copies from its parents. While it is widely appreciated that this type of data is difficult for assemblers to handle properly, here we show that the process is prone to more errors than previously appreciated. Specifically, we document examples of regional expansion and collapse, affecting downstream gene prediction accuracy, but without changing the overall genome assembly size or other metrics of accuracy. Our results suggest that assembly evaluation methods should be altered to identify whether regional expansions and collapses are present in the genome assembly.

Introduction

De novo genome assembly is particularly challenging given a lack of ‘gold standard’ to determine whether the results are correct [1]. As the acquisition of genomic data rises, it becomes increasingly vital to assess the quality of these computer-generated predictions [2]. Benchmarking Universal Single Copy Orthologs, BUSCO [2], Recognition of Errors in Assemblies using Paired Reads, REAPR [3], and N50 value [1,4] are some examples of measures used to evaluate genomic data. These methods individually capture limited aspects of assembly quality and may not identify poorly assembled genomes, particularly where effects are more subtle. This is important because these mis-assemblies can lead to a proliferation of incorrect conclusions throughout the literature [5].

There are many high quality assemblers from which to choose [613], however, not all assemblers are alike or suitable for specific datasets. It has been documented that GC content, polyploidy, genome size, proportion of repeats, and heterozygosity can affect assemblers in different ways [1418]. Standardization of assembly pipelines is generally tenuous and ongoing [19] and trial-and-error remains the standard method for optimizing genome assembly. Here we show that subtle errors can be present within seemingly high-quality assemblies derived from a heterozygous dataset. We also show that this phenomenon can create statistically robust over-or-under representation of specific functional groups by PANTHER (gene ontology) analysis.

The goal of most assemblers is to collapse allelic differences into a single haploid output to obtain a consensus sequence. However, moderate to high levels of heterozygosity (1% or above) can make this challenging, as the allelic differences begin to resemble paralogy within the genome, confusing assembly algorithms. Even with high quality sequencing and coverage, high heterozygosity can result in poor quality fragmented assemblies [17,20]. Although most early assembly algorithms were created to sequence haploid organisms or inbred lines [2124], the application of next-generation sequencing to non-model organisms is increasingly vital, and these datasets tend to contain higher levels of heterozygosity. However, the impact of heterozygosity on the genome assembly process remains poorly characterized. Here we show that for moderate to high heterozygosity, subtle local errors can occur even using algorithms optimized to process heterozygous datasets. We conclude that additional measures of assembly quality are needed, and we show that tracking heterozygosity within specific regions may flag potential assembly errors that would otherwise distort downstream annotation efforts.

In this study we analyze multiple assemblies and annotations generated from a single, well-controlled dataset derived from the subterrestrial nematode Halicephalobus mephisto [25] which has an overall 1.15% heterozygosity [26]. For this work we utilized different parameters across two de novo assemblers, SOAPdenovo2 [11] and Platanus [10], to generate alternate genome assemblies displaying distinctive error profiles which we describe below.

Results

A total of 11 assemblies were generated from raw published Illumina data [26] using SOAPdenovo2 [11] and Platanus [10]. These assemblies were compared with the reference genome which was assembled by Platanus with PacBio long reads for super-scaffolding [26]. As might be expected given that Platanus is designed for heterozygous datasets, most Platanus assemblies have a higher percent completeness compared to SOAPdenovo2 assemblies (Fig 1). The exception was the Platanus assembly with step-size 1, which was more similar to the SOAPdenovo2 assemblies (Fig 1).

Fig 1. Assessment of genome completeness. Assembly and parameters are reported in the first column followed by N50 (nt), heterozygosity (as defined in Methods), genome size (Mb), and the BUSCO results.

Fig 1

The proportion complete and single copy is represented in light blue, complete and duplicated is represented in dark blue, fragmented is represented in yellow, and missing is represented by red.

We found that read-based snp heterozygosity is a valuable measure of assembly quality. The reference genome has 1.15% overall heterozygosity (Fig 1), consistent with GenomeScope [27] which reported a heterozygosity of 1.04% from the raw Illumnia reads. However, the heterozygosity of the more highly fragmented assemblies was far lower; these assemblies also recovered greater total sequence length than the reference (Fig 1). Both SOAPdenovo2 and Platanus deploy de Bruijn graphs, where heterozygous regions form haploid partial paths (or bubbles) that both duplicate existing sequences and lead to higher rates of fragmentation as they are resolved into separate contigs in the final assembly. Therefore, these assemblies contain an excess of short fragments that partly recover individual haplotypes, leading to longer total sequence assemblies owing to regional allelic duplications. It is likely that the expansion does not simply double (representing two alleles): for N nearby snps there should be 2N paths (creating bubbles on bubbles in the graph) and the expansion may be extreme as these paths yield contigs. We saw some evidence of this in the highly fragmented SOAPdenovo2 and Platanus step-size 1 assemblies, with a 200 bp length cutoff average 50% larger than reference (S1 Table). However, when we raised the length cutoff for the assemblies to 1 kb the sequence length excess for the fragmented assemblies was reversed (S2 Table), and they yielded a sequence length deficit relative to reference. Collectively, these results demonstrate that most of the excess sequence length for the highly fragmented assemblies is encoded in fragments from 200–1000 bp (S1 and S2 Tables).

Another consequence of raising the sequence length cutoff was a slight increase in heterozygosity as some allelic copies were removed, requiring reads to map onto their allelic counterparts and yielding higher snp frequencies (S1 and S2 Tables). Nevertheless, the heterozygosity of more-fragmented assemblies was still lower than reference or less-fragmented assemblies even with a 1000 bp cutoff (S2 Table).

As expected, BUSCO completeness also correlates inversely with fragmentation (Fig 1), given the shorter contigs and scaffolds are more difficult to annotate. Consistent with expansion of allelic copies, the Platanus step-size 1 assembly also reported an increase of complete and duplicated gene copies, while SOAPdenovo2 assemblies did not (Fig 1). This likely reflects the higher fragmentation of these latter assemblies, causing BUSCO protein matches to fall in the ‘fragmented’ or ‘missing’ categories instead of completed duplicates.

Our data also show that between step-size 1 and 2 of Platanus is a sharp transition in performance. The Platanus step-size 1 assembly has an N50 of 1.8 kb (Fig 1) while the reference assembly [26] was generated by first assembling the Illumina data with Platanus set to step-size 2, yielding an intermediate assembly with an N50 of 102.8 kb that was further scaffolded with PacBio reads yielding an N50 of 313 kb. The reference also yielded uniform scaffold-level coverage [26] and consistent heterozygosity of genomic regions that gave trouble in more fragmented assemblies as we discuss below. Therefore, Platanus step-size 2 and above work well for this dataset (Fig 1).

To assess relative sequence differences we used LAST [28] to directly compare the assemblies. While LAST dot plots showed a strong 1:1 alignment between the reference assembly and itself (Fig 2A), alignments between the reference and SOAPdenovo2 assemblies revealed significant losses of sequence (Fig 2B–2D). This suggests that the overall sequence assembly length increases for these assemblies (Fig 1 and S1 Table) yet there are still unique sequences missing completely from the assemblies, consistent with the low BUSCO scores for these assemblies (Fig 1). The average percent missing by LAST from the SOAPdenovo2 assemblies was 10% (Fig 2 and S1 Table). However, the SOAPdenovo2 assemblies averaged 50% larger in total assembly length than reference, suggesting that regional expansions more than compensate, constituting an average 39.5% extra sequence for these assemblies (Fig 2 and S1 Table). (Since LAST allows small sequence matches [28], and in our hands the minimum alignment length allowed was 24 bp, it is not eliminating short sequences from the alignments). Platanus step-size 1 is a good representative of highly fragmented assemblies: it yielded the second longest overall assembly, yet it is missing around 5.6% of reference sequence by LAST (S1 Table). Therefore, Platanus step-size 1 encodes 43% expanded sequences (S1 Table) consistent with the elevated duplicated gene content by BUSCO (Fig 1). Leaving out Platanus step-size 1, the other Platanus assemblies averaged only 5.7% expansion (S1 Table). These results reveal the exquisite sensitivity of regional expansions to parameter setting for a given assembly algorithm.

Fig 2. Oxford Grid of pairwise alignments generated using LAST, with x-axis representing the reference Plataus+PacBio assembly contigs and the y-axis representing the various assembly scaffolds filtered to 200bp.

Fig 2

Forward and reverse alignments are denoted with a red and blue dot, respectively. Percent missing is denoted with m and percent duplicated/expanded is denoted as d. A. reference assembly vs. itself. B. SOAPdenovo2 k-mer 23. C. SOAPdenovo2 k-mer 47. D. SOAPdenovo2 k-mer 63. E. Platanus step-size 1. F. Platanus step-size 3. G. Platanus step-size 5. H. Platanus step-size 7. I. Platanus step-size 10. J. Platanus step-size 15. K. Platanus step-size 20. L. Platanus step-size 30.

To investigate these findings further, we performed gene prediction on both the 200bp-cutoff and 1kb-cutoff assemblies with Maker2 [29], an annotation pipeline which returns two classes of protein: those supported by protein or transcript evidence and those lacking such evidentiary support. An unanticipated corollary of excessive fragmentation was a ballooning of proteins lacking evidentiary support, presumably as a consequence of many short sequences which cannot be matched with confidence to supporting protein evidence (Fig 3A). Consistent with our finding of sequence expansion in these assemblies, the number of these proteins exceeds 35,000 for all fragmented assemblies (Platanus step-size 1, and all three SOAPdenovo2 assemblies) (Fig 3A) and suggests more than just allelic expansion, given the non-evidence proteins of better assemblies range from 5,000 to 10,000 (S3 Table). Because we ran Maker2 configured to return a single best prediction for each gene (isoform predictions were turned off, see Methods) these extra predictions are not driven by alternative isoforms. When we set the length cutoff to 1000 bp, the excessive non-evidence predictions were largely lost (Fig 3B) along with some of the evidence-supported predictions, which is consistent with loss of some important sequences owing to extreme fragmentation; indeed the assembly N50s were below the 1000 bp cutoff for SOAPdenovo2 k-mer 47 and 63 (S1 Table). Thus, ballooning of non-evidence-supported protein predictions may be an indicator of poor assembly quality in heterozygous datasets, along with low N50 and lowered genome-wide heterozygosity (Figs 1 and 3). We also found that lengths of proteins with evidence were significantly larger than proteins without evidence for all assemblies at both 200 and 1000 bp cutoffs (S3 and S4 Tables).

Fig 3. Evaluation of Maker2 protein predictions.

Fig 3

(A) Predicted proteins generated from assemblies with 200 bp size cutoff. (B) Predicted proteins generated from assemblies with 1000 bp size cutoff. For both A and B we have indicated the two types of protein predictions from Maker2: homology evidence-supported (light green) and those without evidentiary support (dark green). (C) Fragmentation and duplication from OrthoMCL-generated blast matches of each assembly (200bp cutoff; all proteins combined) relative to reference proteins. For each assembly, we were able to quantify 1:1 matches (blue), cases of fragmentation (non-overlapping matches to a single reference protein, green), and duplication (multiple overlapping matches to a single reference protein, red). To screen out paralogous protein families, we ignored all cases where the reference assembly provided a non-self match (see Methods); thus the number of protein matches is substantially lower than the total number of predicted proteins for each assembly.

To further characterize these predictions, we grouped the 200 bp size cutoff genes (combining both evidence-based and non-evidence based) with OrthoMCL, which uses reciprocal BLAST to assign proteins to high-confidence groups [30]. By evaluating the OrthoMCL-generated BLAST output, we were able to quantify relative contributions of fragmentation (extra non-overlapping matches to reference) versus duplication (overlapping matches to reference) in protein predictions of each assembly. To avoid conflation of duplication with paralogy, we filtered out cases where multiple reference proteins map to each other; see Methods. While the less fragmented assemblies (Platanus step-size 3–30) exhibit negligible amounts of either fragmentation or duplication, we observed substantial amounts of both phenomena in the more fragmented cases (Fig 3C). We note that in every case duplication exceeds fragmentation, so fragmentation does not fully explain the increased protein predictions observed from the more fragmented assemblies (Fig 3C). Specifically, while from the fragmented assemblies (Platanus step-size 1 and the SOAPdenovo2 assemblies) the average protein fragmentation was 3,513 cases, the average duplication counts from these assemblies was 4,764. We note that our counting of duplications is very conservative: for two overlapping OrthoMCL BLAST matches we count one as correctly matching and one as duplicated; thus three overlapping matches would be measured as 2 duplicates and 1 correct match, and so on. Since 1:1 match counts are generally similar in magnitude to duplicate counts for these assemblies, our data suggest a pervasive pattern of duplication which we hypothesize represents expansion of allelic copies (Fig 3C). Taken together these data demonstrate that surprisingly high levels of duplication, not just fragmentation, are present in the SOAPdenovo2 and Platanus step-size 1 assemblies, which is not immediately apparent from their low N50 values (Fig 1). These data also reinforce the importance of not just relying on a single metric, like N50, in assessing assembly quality, since this metric only captures relative fragmentation, missing expansion and duplication errors.

We analyzed OrthoMCL grouping patterns, hypothesizing that shorter proteins should group less efficiently because they statistically do not align to orthologous proteins with high confidence. Consistent with this, we found that the seven Platanus assemblies with N50 > 2 kb have a higher proportion of grouped proteins by OrthoMCL analysis relative to SOAPdenovo2 proteins (Fig 4). We found that there is no statistical difference (p = 0.08485), by Wilcoxon rank sum test, between the average proportion grouped in Platanus and SOAPdenovo2 (Fig 4). However, Platanus step-size 1 is an outlier by two standard deviations (p < 0.05) and, when it is removed, the difference between the two algorithms is significant (p = 0.01667).

Fig 4. Comparison of proportion of proteins grouped by OrthoMCL of each assembly.

Fig 4

Blue dots represent Platanus assemblies, red represents the reference, and yellow dots represent SOAPdenovo2 assemblies. Numbers above the dots represent the step-size or k-mer size used.

Given that these assemblies are all derived from the same raw dataset, we would predict that all proteins in a paralogous gene family should map with a direct 1:1 correspondence with their iso-ortholog (isolog) from the reference assembly. Here we use the term isolog to refer to the identical gene recovered from two or more different assemblies, in contrast to paralogs, which are distinct copies within a genome, or orthologs, which are matching proteins from different organisms. To visualize the behavior of assemblies in one such case, we extracted the P-glycoprotein (pgp) related proteins, consisting of 167 total sequences, from the OrthoMCL data and constructed a phylogenetic tree (Fig 5) of this multigene family. (The OrthoMCL groups can be quite large. For example, the largest OrthoMCL group, Hsp70, consisted of 510 proteins, so we used the pgp group to obtain an interpretable tree). This tree of 167 proteins clustered into 13 distinct isolog subgroups (I-XIII in Fig 5) reflects the individual paralogs in the genome as captured in each assembly. There should be 12 isologs in each clade, with uniform 1:1 matching between each assembly and the reference. In contrast to this expectation, all isologs are represented in only six clades: clade VI-IX, XI, and XIII. Complex patterns of loss and duplications are visible across the tree; for example, one of the clades does not contain any SOAPdenovo2 assemblies (clade III) and three do not contain the reference (clades II, III, and XII) and likely represent divergent alleles collapsed in the reference assembly. Conversely, one clade (clade VI) contains two reference proteins and 26 total members suggesting a very recent duplication; as expected, most Platanus assemblies contribute two members to this cluster but Platanus step-size 1 contributes four members, SOAPdenovo2 k-mer 63 contributes three members, and SOAPdenovo2 k-mer 23 only contributes one member (a loss of one), explaining the deviance from predicted 24 members total.

Fig 5. Maximum likelihood tree of P-glycoprotein (pgp) related protein group from OrthoMCL.

Fig 5

Black, grey, and white circles on nodes represent bootstrap values of 100%, greater than or equal to 80% to less than 100%, and greater than or equal to 50% and less than 80%, respectively. Blue circles represent Platanus assemblies, red circles represent the reference assembly, and yellow circles represent SOAPdenovo2 assemblies. The numbers next to each circle depicts the step-size or k-mer size for each assembly. Roman numerals identify isolog clusters. Scale bar represents substitutions per site. Asterisks indicate two matches that represent the single case of fragmented protein prediction from the SOAPdenovo2 k-mer 63 assembly in this tree.

Highlighting pervasive expansion by duplication, clade V has duplicated genes for SOAPdenovo2 47, 63 and Platanus step-size 1 (for 14 total genes in the clade). Based upon our previous findings that the genomic fragmentation of these assemblies leads to significant levels of fragmentation in protein predictions (Fig 3C), we checked whether the apparent expansion of pgp in several clades was due to fragmentation or actual duplication. We only found one case of fragmentation (in Clade XIII, SOAPdenovo2 k-mer 63, indicated by an asterisk after the k-mer size) while all the rest of the expansions in our tree (33 total) were due to actual duplication; the total count of correct matches across the tree is 122 (Fig 5).

Hypothesizing that the expansions are driven at least partly by heterozygosity, we examined the heterozygosity of pgp-3 regions across the assemblies, and found that it was reduced in SOAPdenovo2 and Platanus step-size 1 assemblies (Fig 6). This is consistent with assembly of alleles as separate contigs, lowering the apparent heterozygosity in read-mapping.

Fig 6. Heterozygosity of P-glycoprotein (pgp) related proteins recovered from different assemblies.

Fig 6

We predict that expansion of the genomic regions should also ramify to increased protein Gene Ontology (GO) representation from those expanded regions after gene annotation. To test this we performed the PANTHER statistical overrepresentation test on the combined protein predictions from each assembly, identifying 237 enriched or depleted functional categories while controlling for false-discovery rate at 0.01 (S5 Table). For all PANTHER analysis we compared the 1000 bp assemblies to the reference sequence for two reasons. First, the 200 bp size cutoff enables a ballooning of non-evidence supported sequences from fragmented assemblies, leading to a high rate of non-meaningful noisy enrichments as discussed earlier (Fig 3A and 3B). Second, the process of creating the reference makes the 1000 bp size cutoff a more appropriate comparison. For the construction of the reference, assembly was performed with Platanus step-size 2, then a size cutoff of 500 bp was applied, prior to PacBio scaffolding with PBJelly. After scaffolding, aberrant chimeras were identified and fragmented with Reaper [3] and a size threshold of 1000 bp applied to the final assembly [26]. Finally, we found that comparing the 200 bp size cutoff data with the reference by PANTHER generates very high numbers of enrichments for all assemblies in comparison to the reference. Therefore for all analysis below we used the 1000 bp size cutoff which yielded enrichments or depletions in a few assemblies, but not all of them (Fig 7A).

Fig 7. Combined PANTHER, coverage, length, and heterozygosity analysis.

Fig 7

A. PANTHER Enrichment or depletion of specific functional categories. B. Regional read-mapping coverage relative to reference. C. Regional lengths relative to reference. D. Regional heterozygosity relative to reference.

PANTHER identified specific GO categories enriched or depleted in some assemblies relative to reference, but we wanted to understand in more detail why these assembly errors occurred. Therefore, we created a custom pipeline (with scripts available in our Github repository) for comparing regions of reference with each ‘test’ assembly in terms of length expansion, coverage and heterozygosity (Fig 8). As might be expected, in general, heterozygosity and coverage behave similarly to each other, with sequence length being anti-correlated (Figs 7B, 7C, 7D and 9). As a control, we examined heterozygosity and coverage of the reference assembly for these regions, which was consistent with the total assembly (Fig 10). Regional expansion or contraction was observed (as expected) across the highly fragmented SOAPdenovo2 and Platanus step-size 1 assemblies, but also in Platanaus step-size 3 (N50 = 76 kb) and Platanus step-size 7 (N50 = 71 kb) (S2 Table), suggesting that even assemblies appearing to be high-quality may contain hidden errors.

Fig 8. Schematic of computational methods used in this paper.

Fig 8

Assembly ‘X’ represents one of the 11 assemblies generated during the course of this study, and compared to the reference.

Fig 9. Conceptual summary, showing regional expansion (left), regional collapse (middle), and effects on heterozygosity, depth of coverage, and regional length, relative to expected (right).

Fig 9

Fig 10. Reference genome raw coverage and heterozygosity for specific panther categories.

Fig 10

A. Read coverage. B. Heterozygosity.

Discussion

Our data demonstrate the complex interaction between heterozygosity, genome assembler, and length thresholding effects with some problems becoming evident only after extensive comparison to a high-quality reference sequence. For example, from the 200 bp size cutoff assemblies, LAST showed an average of 10% sequence missing across the SOAPdenovo2 assemblies when compared to the reference, yet they were an average of 50% larger than the reference, in total assembly size. This suggests regional expansions account for a 60% excess of genomic sequence for these assemblies over the reference (S1 Table). To state this another way, an average of 40% of SOAPdenovo2 assemblies consist of expanded sequence (S1 Table). This may be an underestimate given that some regions have undergone sequence collapse (discussed below) which is also compensated by regional expansion. For the multigene pgp family we showed lower heterozygosity for the SOAPdenovo2 assemblies and one Platanus assembly (Fig 6). We interpret the lower heterozygosity in SOAPdenovo2 assemblies as evidence that these regions are not properly resolved and likely expanded regionally--consistent with duplicate genes observed throughout the phylogenetic tree in isolog clusters (Fig 5).

Confirming this, we performed PANTHER analysis of specific GO categories, yielding highly significant enrichment or depletion of 237 specific categories even after correction for false discovery rate to 0.01 (S5 Table and Fig 7A). These discrepancies can be at least partly explained by a complex interplay between regional heterozygosity and assembly parameters. While the reference genome does not display unusual heterozygosity or coverage of these regions (Fig 10) we documented in four categories that the assemblies of these regions diverge from the reference genome in terms of coverage, heterozygosity, and length assembled (Fig 7B, 7C and 7D). We would predict that if an assembler maximally “spreads out” the variation within a dataset into distinct contigs, length assembled would go up, while coverage and heterozygosity would go down as the reads are able to find their perfect match. In many cases this is precisely what we see: the assemblies shown for Oxidoreductase and Dehydrogenase behave in this way (Fig 7B, 7C and 7D) and are examples of ‘regional expansion’ (Fig 9). Somewhat surprisingly, this regional expansion appears to be far greater than one would expect for separation of alleles, which should lead to a doubling of the sequence length; in most cases we saw well over 3-fold expansion of length and in one extreme case 7-fold (Fig 7C). Even Platanus, algorithmically optimized for heterozygous genome assembly, was prone to this artifact under specific parameter settings (Fig 7B and 7C). While Platanus step-size 1 performs particularly poorly with our dataset, step-size 3 and 7 both showed artifacts in our PANTHER analysis (Fig 7, see Oxidoreductase, Dehydrogenase, and Response to Heat) while yielding reasonable N50 values (step-size 3, N50 = 74 kb; step-size 7, N50 = 70 kb). Therefore, our data highlight a potentially worrisome problem for genome assembly algorithms when confronted with moderate to highly heterozygous datasets.

The Amino Acid Transport category appears to violate the expectation that heterozygosity will behave similarly to coverage; it is increased, not decreased, in two of three SOAPdenovo2 assemblies where coverage was decreased (Fig 7D). Hypothesizing that this might reflect collapsed repetitive elements that are intronically located within these genes, we ran RepeatMasker over the corresponding extracted genomic regions from the reference, SOAPdenovo2 23, 47, and 63, along with Platanus 20 (control). We found that while the reference assembly encodes a highly repetitive component (34.6%), the repetitive content of SOAPdenovo2 23, 47, and 63 were dramatically reduced (4.6%, 9.2%, and 9.2%, respectively). Platanus 20 (control) was 30.3% repetitive. Thus, while the Amino Acid Transport coding regions were expanded in length (Fig 7C) leading to PANTHER enrichment (Fig 7A), these genomic regions encode repeats which are collapsed leading to higher heterozygosity (Fig 7D). Thus, rather than reflecting a simple expansion or contraction (Fig 9), Amino Acid Transport-related genomic regions reflect a combination of expansion and collapse. The reasons for this anomaly remain to be investigated in future work, especially given that the repetitive elements included in these regions are unclassified by RepeatMasker. It is worth noting that the expansion of sequence encoding Amino Acid Transport-related genes, and the collapse of repetitive elements should lead to compensatory changes in coverage and heterozygosity (i.e., increased lengths should decrease the apparent heterozygosity, while collapsed repeats should increase the coverage) but overall deviations from reference are detectable (Fig 7). Indeed, the extreme length extension (7-fold, Fig 7C) of the k-mer 23 assembly may have created the apparent low heterozygosity, offsetting the effect of its highly collapsed repeat (Fig 7D). These data suggest that taken together, coverage and heterozygosity offer better information on genome assembly quality than coverage alone.

The extreme enrichment of heterozygosity for the category ‘response to heat’ for the SOAPdenovo2 23 assembly is particularly striking. While it would suggest the collapse of the genes in this category relative to the reference genome, the expected decrease in sequence length was not observed (Fig 7C). However, to construct Fig 7C we required a 98 percent identical BLASTn match or better between sequences, using blast_analysis.py (Fig 8). By relaxing this requirement to 80 percent identity we found a 3.57-fold contraction (43,083 bp from SOAPdenovo2 23 corresponding with 153,958 bp in the reference genome) which agrees with the 3.49-fold enrichment in heterozygosity (Fig 7D). (Read-mapping was performed with BWA-MEM and does not invoke a percent identity threshold). Platanus step-size 7 represents a curious case: it also is depleted for the ‘response to heat’ category but the increase in heterozygosity was only minor and coverage did not increase, suggesting these regions simply did not assemble well and were likely lost from the assembly when we filtered out contigs smaller than 1 kb, leaving the corresponding reads without a suitable target in the mapping step.

Conclusions

We show in this work that genome assemblies are extremely dependent on assembly parameters particularly for data of moderate to high heterozygosity. Many of the deviations we uncovered here are localized expansions or contractions that may not dramatically alter the overall assembly length (Fig 1), because both expansion and contractions may offset each other. However, other cases demonstrate extensive expansions can lead to a dramatic increase in assembly length (Fig 1), strongly impacting protein predictions (Fig 3A) and Gene Ontology enrichment for functional categories (Fig 7). Our work uncovers a poorly characterized category of misassembly that leads to distortion of genomic representation and can propagate into gene ontology or other downstream analyses as we demonstrate here. Somewhat paradoxically then, we show here that more fragmentation can correlate with duplication of specific sequences within the assembly. It remains unclear whether the duplication drives the fragmentation, or the reverse, but we speculate that relatively high heterozygosity forces assemblers to resolve unusually complex de Brujin graphs, potentially causing both fragmentation and duplication errors.

These assembly variations we document here are not easily detected, particularly when assembling a genome for the first time. Particularly concerning, we were able to find PANTHER-enriched functional categories caused by this phenomenon which were statistically significant to a false-discovery rate of 0.01 (Fig 7). The errors we reveal here are not limited to one genome assembler, as we observed them with both SOAPdenovo2 and Platanus under specific k-mer or step-size settings, some of which yield reasonable N50 values. We also show that a dramatic excess of short, non-evidence-supported gene predictions may indicate assemblies that have failed to resolve heterozygous regions properly. We suggest that tracking heterozygosity along with coverage across the genome is likely to be a more accurate method to uncover errors of assembly than coverage alone, particularly for highly heterozygous datasets.

Methods

K-mer analysis and error correction

K-mer analysis was conducted on raw Illumina DNA sequence from H. mephisto [26] using SOAPec v. 2.01 [31,32] using a k-mer size of 23 and a maximum read length of 215. We then corrected the reads using SOAPec v. 2.01 with a k-mer size of 23, a quality shift of 33, and -o set to 3 to obtain a fastq output file.

Assembly

Two de novo assemblers were used to test the assembly quality using different parameters. Platanus v. 1.2.4 [10] was run with a starting k-mer of 21 and the maximum difference for bubble crush (u parameter) set to 1. The different step size of k-mer extension that were tested were; 1, 3, 5, 7, 10, 15, 20, and 30. These assemblies were scaffolded and gaps closed using Platanus. The second assembler that was used was SOAPdenovo2 v. 2.04 [11] with optional settings -R, to resolve repeats by reads, -F, to fill gaps in the scaffold, and merge similar sequence strength set to 3. The different k-mers that were tested were 23, 47, and 63. Two separate size cut-offs of 200 bp and 1000bp were used prior to downstream analysis in all Platanus and SOAPdenovo2 assemblies. The reference sequence used throughout was generated by Platanus v. 1.2.4 with k-mer 21, u = 1, and step-size of 2, which had been scaffolded, gap closed, and then super-scaffolded with 30 lanes of Pacific Biosciences (PacBio) data [26].

Assembly quality and completeness assessment

RepeatMasker v. 4.0.8 [33] was used to determine percent repetitive and a Python script (Python v2.7), getN50.py, was used to determine N50, longest contig length and total genome length of each assembly. These parameters were used for subsequent multivariate analysis. To assess completeness based on universal single copy orthologs, we used BUSCOv3 [2,34] to compare each assembly to a published Nematoda dataset accessed through the BUSCO database (https://busco.ezlab.org/).

Annotation

To obtain RNAseq evidence for annotation, Trinity v. 2.4.0 [35] was run along with Trimmomatic [36] on the RNAseq data from H. mephisto [26]. To annotate the H. mephisto genome, Maker v. 2.31.8 [29] utilized the RNAseq evidence, protein evidence from Caenorhabditis elegans [37], the RepeatMasker library, and gene predictions through SNAP and Augustus [38]. The alt_splice option in the maker_ctl file was set to 0 to ensure that unique genes were identified, not splice isoforms. Annotation was done for each of the different assemblies.

LAST

LAST v. 979 [28] was used to generate pairwise alignments in order to compare the sequence in each assembly filtered to 200bp. Briefly, a database was created using the published mephisto assembly [26] using the lastdb command with the -cR01 option to soft mask repeats. Then, the last-train command, with parameters --revsym --matsym --gapsym -E0.05 -C2, was used to train the aligner [39]. To generate pairwise alignments, the lastal command was used, with parameters -m50 -E-val 0.05 -C2 [40] with the last-split command to find split alignments and last-postmask used to remove low quality alignments. Last-dotplot was used with --sort1 = 3 --sort2 = 2 --stands1 = 0 --strands2 = 1, in order to visualize a 1:1 line and orient the assemblies. These parameters allowed for visualization of the steepness/expansions within the aligned regions. Percent missing was calculated using a Python script, blast_analysis.py.

OrthoMCL

OrthoMCL v. 2.0.9 [30], the relational database MySql v. 5.6 [41], and clustering algorithm MCL [42] were run on to a High Performance Computer following the suggestions found on BioStars [43]. A Python script was used to replace the protein identifiers with a counter, starting at one. Unique identifiers were added to the beginning of each protein ID by running orthomclAdjustFasta before running orthomclFilterFasta. BLASTp [44] was run to complete an all-vs-all BLASTp of the good proteins using an e-value of 1e-5 and outformat 6 and a Python script, rm_blast_redundancy.py, was used to remove duplicate hits. OrthoMCL was run using default settings from step eight, through the rest of the pipeline [30]. R Studio v. 1.1.463 [45] was used to analyze and visualize the difference in proportion of grouped proteins. Geneious [46] was used to perform a MUSCLE v. 3.8.425 alignment [47] and generate a maximum likelihood phylogenetic tree using PhyML v. 3.0 [48] with 100 bootstraps.

Analysis of fragmentation vs. duplication proteome-wide

For analysis of fragmentation vs. duplication, we evaluated the all-vs-all blast output (in outfmt 6), from the 200bp size-cutoff assemblies and with both evidence-based and non-evidence based proteins combined. From this file, only those rows using the reference as query were extracted using Python script getRef.py. This file was then analyzed with Python script parseOrthoMCL.py to extract the position of matches to the reference proteins. In our analysis, non-overlapping matches from the same assembly were counted as fragmentation events while overlapping matches were counted as N-1 duplications and 1 correct match. Thus, if two overlapping fragments were found, we counted 1 duplication event and 1 correct match; three overlapping fragments would count as 2 duplicates plus 1 correct match. To avoid paralogy, we discounted all reference proteins that had a non-self blast match from the reference assembly (a paralog).

PANTHER

A comprehensive analysis of protein representation amongst assemblies was completed using the Protein ANalysis THrough Evolutionary Relationships (PANTHER) system, v. 14.0 [49]. The sequences were scored using the PANTHER HMM library, for analysis of gene function. This was done using the generic mapping protocol, referencing scripts and data using default program option B for best hit [50]. Each gene was assigned a unique PANTHER ID and this output was then imported to the PANTHER database (www.pantherdb.org), in addition to the Platanus and PacBio reference assembly for comparison. PANTHER IDs of each assembly were then organized into five functional categories; pathways, molecular function, biological process, cellular component, and protein class. The PANTHER outputs were analyzed using the statistical overrepresentation test on the PANTHER database, with settings customized for the collection of raw p-values, which were then corrected using the Benjamini-Hochberg procedure to a false discovery rate of 0.01.

Heterozygosity analysis

Raw reads were mapped onto the Platanus and SOAPdenovo2 assemblies using BWA v. 0.7.12 MEM [51] under default settings. Duplicates were removed using the markdup function in Samtools v. 1.9 [52]. In BCFtools v. 1.9 [53] variants were called using the mpileup and call functions with -v and -m set to only output variants and to use multiallelic calling. The overall heterozygosity for each genome was calculated using the get_total_heterozygosity.py custom Python script. Regional heterozygosity was measured on PANTHER-identified genes; their coordinates were extracted by BLASTn v2.2.30+ alignment of transcripts to the genome with output in tabular format (-outfmt 6). Using the extract_regions.py custom Python script we extracted intervals (individual exons) with ≥ 92% identity to the genome; for each gene the minimum and maximum values were used to define genomic intervals for the transcripts and heterozygosity was reported for this region with a custom python script, get_regional_heterozygosity.py, from the vcf file. We excluded genes identified as mapping to over 10 kb of genomic sequence as potential errors of BLASTn mapping.

To compare our calculated overall heterozygosities with an established program’s calculation of overall heterozygosity, we used GenomeScope [27]. GenomeScope calculates the overall heterozygosity from Illumnia raw reads. To do this, the raw reads were run through Jellyfish v.2.3.0 [54] under default settings for a histogram of k-mer frequencies. The histogram was input to GenomeScope which was also run on default settings.

To compare the heterozygosity of the various assemblies to the reference sequence heterozygosity at the same region, we first extracted each assembly’s nucleotide sequence for each PANTHER term evaluated based on the coordinates identified above using BLASTn. These coordinates were read, along with the vcf file, by get_regional_heterozygosity.py (provided with supplemental data) to extract only snp data from the specific genomic sequences and to calculate their heterozygosity by dividing the number of snps in the region by the length of the region. These nucleotide sequences were matched to the corresponding reference sequences using BLASTn with tabular output (-outfmt 6). Then, the heterozygosity of the reference genome was evaluated for these regions using the reference_get_regional_heterozygosity.py custom python script that only considered regions with ≥ 98% identity, to prevent inappropriate comparison of paralogous regions. The heterozygosity of the region in question was compared to the heterozygosity of the reference region to obtain the values shown in Fig 7D. (See Fig 8 for a schematic).

To examine coverage, we used Samtools Depth to obtain a text file of per-basepair depth, from the same .bam file used for heterozygosity. By parsing this file along with the coordinates for each set of genes associated with a PANTHER term we recorded the coverage for those regions using a custom Python script, parse_genes2.py. Similarly to heterozygosity, we used the BLASTn against the reference genome to identify corresponding regions, requiring 98% identity of the matching region, and extracting the coverage of the reference assembly with reference_get_regional_depth2.py. The coverage of those regions (from reference) was used in the denominator to normalize the coverage of the region in question (Fig 8).

To examine length we performed a similar calculation: extraction of the genomic regions to give the numerator of the equation; the BLASTn to reference at 98% identity of the matching region was used to extract the corresponding reference length for the denominator (Fig 8). However, to avoid multi-counting the same reference sequence in case of expansion in the assembly in question (the query), we counted each reference region uniquely (collapsing overlapping or redundant High-Scoring Pairs (HSPs) based on the coordinates they map onto). Thus, we only counted unique assembly sequences mapped onto unique reference sequences. Note that the dynamics of the assembly versus reference might still lead to expansion or contraction of either the query or reference, but those differences will result from actual changes to the assembly, not from multiple counts of BLASTn outputs.

Supporting information

S1 Table. Summary of sequences missing from each assembly at 200 bp cut off by comparison with reference assembly and estimates of percentage regionally expanded sequence.

Also includes N50 (nt), total assembly lengths, heterozygosity levels, and missing by LAST.

(XLSX)

S2 Table. Summary of sequences missing from each assembly at 1000 bp cut off by comparison with reference assembly and estimates of percentage regionally expanded sequence.

Also includes N50 (nt), total assembly lengths, heterozygosity levels, missing by LAST, and RepeatMasker estimation of repetitive content.

(XLSX)

S3 Table. Summary of protein analysis from each assembly at 200 bp cut off.

Listed are evidence versus non-evidence based proteins by number of proteins, mean, median, and standard deviation of protein length.

(XLSX)

S4 Table. Summary of protein analysis from each assembly at 1000 bp cut off.

Listed are evidence versus non-evidence based proteins by number of proteins, mean, median, and standard deviation of protein length.

(XLSX)

S5 Table. PANTHER results of significantly enriched or depleted proteins and their corresponding assembly method (all results shown are significant to Benjamini Hochberg corrected FDR< 0.01).

(XLSX)

Acknowledgments

This manuscript originated in a Computational Genomics course taught in spring 2019 at American University, so the authors wish to acknowledge the University's support for the class, including computing resources and personnel enabling a flipped-classroom instructional implementation. Computing resources used for this work provided by the American University High Performance Computing System, which is funded in part by the National Science Foundation (BCS-1039497). The authors wish to acknowledge Dr. David Gerard for statistical guidance.

Data Availability

All python scripts are available from the GitHub database (https://github.com/brachtlab/Regional_heterozygosity). The raw Illumina DNA and PacBio DNA data are available on the Sequence 687 Reads Archive (SRA) at accession PRJNA528747. The assemblies are available on Zenodo at https://zenodo.org/record/3738267#.Xp4Ok9NKgq9.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Baker M. De novo genome assembly: what every biologist should know. Nat Methods. 2012;9: 333–337. 10.1038/nmeth.1935 [DOI] [Google Scholar]
  • 2.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31: 3210–3212. 10.1093/bioinformatics/btv351 [DOI] [PubMed] [Google Scholar]
  • 3.Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14: R47 10.1186/gb-2013-14-5-r47 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14: 157–167. 10.1038/nrg3367 [DOI] [PubMed] [Google Scholar]
  • 5.Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13: 329–342. 10.1038/nrg3174 [DOI] [PubMed] [Google Scholar]
  • 6.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19: 455–477. 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17: 1519–1533. 10.1089/cmb.2009.0238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chikhi R, Rizk G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013;8: 22 10.1186/1748-7188-8-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Haider B, Ahn T-H, Bushnell B, Chai J, Copeland A, Pan C. Omega: an overlap-graph de novo assembler for metagenomics. Bioinformatics. 2014;30: 2717–2722. 10.1093/bioinformatics/btu395 [DOI] [PubMed] [Google Scholar]
  • 10.Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 2014;24: 1384–1395. 10.1101/gr.170720.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1: 18 10.1186/2047-217X-1-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ribeiro FJ, Przybylski D, Yin S, Sharpe T, Gnerre S, Abouelleil A, et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 2012;22: 2270–2277. 10.1101/gr.141515.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19: 1117–1123. 10.1101/gr.089532.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chaisson MJP, Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet. 2015;16: 627–640. 10.1038/nrg3933 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, et al. Ten steps to get started in Genome Assembly and Annotation. F1000Res. 2018;7 10.12688/f1000research.13598.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 2008;9: R55 10.1186/gb-2008-9-3-r55 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pryszcz LP, Gabaldón T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res. 2016;44: e113 10.1093/nar/gkw294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen Y-C, Liu T, Yu C-H, Chiang T-Y, Hwang C-C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS One. 2013;8: e62856 10.1371/journal.pone.0062856 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Endrullat C, Glökler J, Franke P, Frohme M. Standardization and quality management in next-generation sequencing. Appl Transl Genom. 2016;10: 2–9. 10.1016/j.atg.2016.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tigano A, Sackton TB, Friesen VL. Assembly and RNA-free annotation of highly heterozygous genomes: The case of the thick-billed murre (Uria lomvia). Molecular Ecology Resources. 2018. pp. 79–90. 10.1111/1755-0998.12712 [DOI] [PubMed] [Google Scholar]
  • 21.Hutchison CA 3rd. DNA sequencing: bench to bedside and beyond. Nucleic Acids Res. 2007;35: 6227–6237. 10.1093/nar/gkm688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Morozova O, Marra MA. Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008. pp. 255–264. 10.1016/j.ygeno.2008.07.001 [DOI] [PubMed] [Google Scholar]
  • 23.Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet. 2011;52: 413–435. 10.1007/s13353-011-0057-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Schuster SC. Next-generation sequencing transforms today’s biology. Nat Methods. 2008;5: 16–18. 10.1038/nmeth1156 [DOI] [PubMed] [Google Scholar]
  • 25.Borgonie G, García-Moyano A, Litthauer D, Bert W, Bester A, van Heerden E, et al. Nematoda from the terrestrial deep subsurface of South Africa. Nature. 2011;474: 79–82. 10.1038/nature09974 [DOI] [PubMed] [Google Scholar]
  • 26.Weinstein DJ, Allen SE, Lau MCY, Erasmus M, Asalone KC, Walters-Conte K, et al. The genome of a subterrestrial nematode reveals adaptations to heat. Nat Commun. 2019;10: 5268 10.1038/s41467-019-13245-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33: 2202–2204. 10.1093/bioinformatics/btx153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21: 487–493. 10.1101/gr.113985.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18: 188–196. 10.1101/gr.6743907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13: 2178–2189. 10.1101/gr.1224503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20: 265–272. 10.1101/gr.097261.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform. 2013;14: 56–66. 10.1093/bib/bbs015 [DOI] [PubMed] [Google Scholar]
  • 33.Smit AFA, Hubley R. RepeatModeler Open-1.0. Available from http://www repeatmasker org. 2008.
  • 34.Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018;35: 543–548. 10.1093/molbev/msx319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8: 1494–1512. 10.1038/nprot.2013.084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30: 2114–2120. 10.1093/bioinformatics/btu170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Consortium TCES, The C. elegans Sequencing Consortium. Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. Science. 1998. pp. 2012–2018. 10.1126/science.282.5396.2012 [DOI] [PubMed] [Google Scholar]
  • 38.Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32: W309–12. 10.1093/nar/gkh379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hamada M, Ono Y, Asai K, Frith MC. Training alignment parameters for arbitrary sequencers with LAST-TRAIN. Bioinformatics. 2017;33: 926–928. 10.1093/bioinformatics/btw742 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics. 2010;11: 80 10.1186/1471-2105-11-80 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Widenius M, Axmark D, MySQL AB, Arno K. MySQL Reference Manual: Documentation from the Source. “O’Reilly Media, Inc; ”; 2002. Available: https://play.google.com/store/books/details?id=9c-pkLaNmqoC [Google Scholar]
  • 42.Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30: 1575–1584. 10.1093/nar/30.7.1575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.A:orthomcl with local mysql server on linux server, complete install. [cited 21 Jan 2020]. Available: https://www.biostars.org/p/120773/
  • 44.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215: 403–410. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 45.Team R. RStudio: integrated development for R. Boston, MA: RStudio. Inc; 2015. [Google Scholar]
  • 46.Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28: 1647–1649. 10.1093/bioinformatics/bts199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32: 1792–1797. 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52: 696–704. 10.1080/10635150390235520 [DOI] [PubMed] [Google Scholar]
  • 49.Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13: 2129–2141. 10.1101/gr.772403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41: D377–86. 10.1093/nar/gks1118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25: 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25: 2078–2079. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27: 2987–2993. 10.1093/bioinformatics/btr509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27: 764–770. 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008104.r001

Decision Letter 0

Christos A Ouzounis, William Stafford Noble

8 Jan 2020

Dear Dr Bracht,

Thank you very much for submitting your manuscript 'Regional sequence expansion or collapse in heterozygous genome assemblies' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: OVERVIEW

The development of efficient and inexpensive next-generation genome sequencing has enabled an explosion of new genome sequences for 'non-model' organisms. Such organisms are either not studied much in laboratories as a matter of past custom, or practically not feasible for such study, but their genomes are of biological importance nevertheless. Of their nature, such organisms are often outbred with substantial genomic diversity. Even for small sample populations used to extract the genomic DNA used for sequencing and assembly; levels of genetic heterogeneity can reach the 'hyperdiverse' level (e.g., 7% variation in non-selected nucleotides of the nematode Caenorhabditis brenneri) and almost always contain substantial amounts of unresolved allelism.

Asalone et al. have recently characterized the genome for a nonmodel but biologically interesting subterranean nematode, Halicephalobus mephisto. In so doing, they have identified a potential artifact of genomic analysis that to my knowledge has not been previously described: depending on fine details of the genome assembly programs and parameters used, different regions of the genome encoding gene families with biologically interesting functions can assemble in two different ways. They can either assemble so that two or more alleles are compressed in silico into a single sequence, or instead be assembled so that two or more alleles are artifactually linked in a tandem sequence array. Given heterozygosity throughout a genome, such variable compression or tandem expansion can have a visible effect on what genes are predicted for a genome, with expansions or compressions of a gene family having downstream effects on its biological function being scored as over- or under-represented among the protein-coding genes of that genome.

The authors compare assemblies from Platanus versus SOAPdenovo2 versus their best reference assembly (generated by Platanus with PacBio long-read superscaffolding). They observe a general tendency for genome assembly regions with lower polymorphism (assayed by raw Illumina reads mapped to a given assembly) to correlate with greater length. They observe striking differences between assemblies in which heterozygous regions are represented, despite those assemblies having considerably more similar total lengths. The authors conclude that, in assessing quality of a genome assembly, it is not sufficient to look at the size-weighted median of its scaffold or contig sizes (i.e., its N50 score); it is also advisable to assess its degree of sequence coverage and heterozygosity, with caution being exercised for regions of the assembly showing abnormally high or abnormally low heterozygosity (and, in parallel, abnormally low or abnormally high coverage).

One general result: given heterozygous sequence data, Platanus seriously outperforms SOAPdenovo2. The numbers in Supplemental Table 1 make that quite clear. Although the authors do not provide results for other mainstream short-read assemblers comparable to SOAPdenovo2 (e.g., ABySS 2), their results make it advisable that researchers assembling short reads from a heterozygous organism use a heterozygosity-aware program such as Platanus.

Although the general points are the paper is well-taken, I have some specific questions and caveats about it, along with some suggested revisions.

SPECIFIC QUESTIONS AND CAVEATS

An inconspicuous-looking point of the Methods may be driving nontrivial amounts of the differences between how different genome assemblies are scored for completeness: the authors have imposed a minimum scaffold/contig size of 1,000 nt for all of their competing assemblies. This is likely to be harmless for those assemblies with high N50 values, but may be leading to substantial losses of sequence information for those four assemblies with low N50 values (Platanus step-size 1, N50 = 2.8 kb; SOAPdenovo2 k-mer 23, N50 = 2.7 kb; SOAPdenovo2 k-mer 47, N50 = 1.9 kb; and SOAPdenovo2 k-mer 63, N50 = 1.9 kb) -- particularly when one considers that the N50 values given in Fig. 1 and Supp. Tab. 1 were computed for these assemblies *after* scaffolds/contigs of <1 kb had been discarded. If the authors had performed their analyses on assemblies that had had a less stringent minimum size filter (such as 200 nt), how much would the downstream results change? This question clearly has to affect BUSCO scores (Figure 1), but could conceivably also affect evidence-based annotation of genes (Figure 3) and homology of genes to other genes (Figure 4), since assemblies with low N50 values are likely to have fragmented or partial gene predictions.

At crucial points in their Methods -- specifically, when they compute heterozygosity levels for an entire genome assembly, or for particular genes within that assembly -- the authors invoke nameless "custom python scripts". Given the central importance of this computation to their work, this is entirely unacceptable. Each Python script used in the work must be given a name in the Methods and must be explicitly available through either github or some equivalently useful public software repository. Note: I am aware that the authors have written "All python scripts are available from the github database (repository: "Name TBD").", but that is not enough!

The authors cite results based on 11 alternative (non-reference) genome assemblies for H. mephisto. It would be preferable if these genome assemblies were themselves publicly available in some data repository. One data repository that works quite well for permanent archiving of such data is the Open Science Foundation (https://osf.io). Other options are Figshare (https://figshare.com) and Zenodo (https://zenodo.org).

The authors have devised their own tools for making either genome-wide or regional estimates of nucleotide heterozygosity. This is ingenious and potentially valuable to other researchers. However, there already exists a published open-source programs for estimating overall heterozygosity of a given organism, directly from that organism's raw Illumina sequence read set: GenomeScope (https://github.com/schatzlab/genomescope.git and https://academic.oup.com/bioinformatics/article/33/14/2202/3089939). I think it would be highly desirable for the authors to compute heterozygosities for H. mephisto from their raw Illumina sequence reads using either GenomeScope or an equivalent k-mer analysis tool, and then for them to compare the heterozygosity score generated with one of these tools versus their own results.

SUGGESTED REVISIONS

The authors had no page numbers in their manuscript. Next time, please have them! Page numbers in manuscripts help readers (even though the readers in this case will be a small number of editors and reviewers.) In this case, for clarity while reviewing, I am providing page numbers using my own count (with the title and abstract being on page 1).

Page 4 --

"(Borgonie et al., 2011)": although cited in the text, this was not included in the References on pp. 18-22. I assume that the authors meant Borgonie et al. (2011), Nature 474, 79-82, PubMed 21637257. Please add this reference to the References; more importantly, please proofread the entire manuscript to ensure that there are no other missing references cited anywhere.

Page 5 --

Legend for Figure 1: "N50, heterozygosity, and the BUSCO results." I would prefer something like "N50 (in nt), heterozygosity (as defined in Methods), and the BUSCO results." As it stands, the reader is left to guess what the measurement unit for N50 is, and to wonder where the heterozygosity comes from. It will be good for readers to understand that the authors are using their own methods of computing heterozygosity rather than using previously published methods.

Page 6 --

"We found that N50 is highly correlated with evidence-supported genes predicted..." What are the mean and median sizes of protein-coding sequences for these genes, and how do they vary with respect to assembly N50? It is a long-known problem in genome analysis that assemblies with low N50 values result in gene predictions that are fragmentary or partial; fragmentary or partial gene predictions, in turn, may lower the rate at which genes are scored as evidence-supported. (The same caveat also applies to Figures 3 and 4, which are cited at this point in the text.)

"However, we found that..." To avoid awkwardly starting two sentences in a row with "However", I suggest that this instance of "However" be replaced with something like "Nevertheless".

Page 7 --

"we extracted the second-largest group of proteins": why was the *second*-largest group chosen? Why not the first, or the third? The answer could go here or in Methods.

Page 10 --

The authors write: "We would predict that if an assembler maximally 'spreads out' the variation within a dataset into distinct contigs, coverage and length assembled would go up, while heterozygosity would go down as the reads are able to find their perfect match."

Unless I have misunderstood the argument of this paper badly, this is not quite correct, and they should have instead written: "We would predict that if an assembler maximally 'spreads out' the variation within a dataset into distinct contigs, length assembled would go up, while coverage and heterozygosity would go down as the reads are able to find their perfect match."

Page 11 --

"smaller than 1kb" should be "smaller than 1 kb" (i.e., do not fuse a number and its measurement unit).

Page 12 --

"These assembly variations are not easily detected particularly when assembling a genome for the first time" should read "These assembly variations are not easily detected, particularly when assembling a genome for the first time".

Pages 12 and 13 --

"sequences lower than 1000bp were removed prior to subsequent analysis", and "Sequences smaller than 1000bp were removed from these assemblies prior to downstream analysis". First, replace '1000bp' with '1000 bp'. Second, this filtering step can have strong and differential effects on genome assembly analysis. Consider the assembly N50s listed in both Figure 1 and Supplemental Table 1. For the reference genome (N50 = 313 kb), the effect of discarding scaffolds or contigs of under 1 kb will be slight -- almost all of the assembly will be over that threshold anyway. However, for four of the most fragmented genome assemblies (Platanus step-size 1, N50 = 2.8 kb; SOAPdenovo2 k-mer 23, N50 = 2.7 kb; SOAPdenovo2 k-mer 47, N50 = 1.9 kb; and SOAPdenovo2 k-mer 63, N50 = 1.9 kb), filtering out sequences of <1 kb is likely to be substantially depleting genomic contents -- particularly since these low N50s were presumably computed *after* sequences of <1 kb had been filtered out.

Given that the authors observe profound drops in their %BUSCO scores for these very same four assemblies (Figure 1 and Supp. Table 1), it is difficult not to suspect that they might have observed significantly better %BUSCO scores if they had adopted a somewhat smaller minimum scaffold/contig size (say, 200 nt instead of 1,000 nt). That, in turn, raises the question of how many *other* results in this paper would be significantly changed if the minimum size had been so lowered.

Page 14 --

"H. Mephisto" should read "H. mephisto".

Page 15 --

"SamTools" should probably be written "Samtools" (following how it is written on the author's main software page -- see http://www.htslib.org).

"BCFTools" should be written "BCFtools" (again, following http://www.htslib.org).

"variants were called using the mpileup and call function" should read 'functions', not 'function'; also, from exactly which software suite were these functions taken? The way the sentence is written, it is not clear whether they are from SAMtools or BCFtools.

"10kb" should be "10 kb".

Page 16 --

"(Note that the dynamics..." starts with a parenthesis ['('], but does not close with one [')'].

Figure 1 --

Please revise the header "N50" to "N50 (nt)", so that the reader knows what size the N50s are in.

Please *add* a column for total genome assembly sizes (i.e., total genome assembly lengths). I know that these data are in Supplemental Table 1, but I think they would be significantly useful in Figure 1, which is what most readers will see. The genome assemblies should be rounded to 0.1 Mb, and the header should be something like "Genome size (Mb)".

Figures 3 and 4 --

For genes predicted in the various H. mephisto assemblies, these two figures show quite different rates of evidence-association (as scored by MAKER; Figure 3) and homology to other genes (as scored by OrthoMCL; Figure 4). The authors note that different assemblies can have similar numbers of predicted genes, but quite different values for evidence-association or homology. However, they do not show whether these genes vary in the mean or median length of their protein-coding sequences; yet it is quite likely that the four genome assemblies with lowest N50 values (under 3.0 kb) will have significant numbers of truncated or partial gene predictions, which may well affect both assays. I would like to see the authors address this point in some reasonable way.

Figure 4 --

This figure shows different assemblies as "Platanus", "Platanus and PacBio", or "SOAPdenovo2". However, I would prefer to have individual labels next to each glyph, specifying exactly which assembly is associated with each data point in the figure (for instance, *which* Platanus assembly gave rise to the unpromising data point with only ~7,250 predicted proteins and ~0.93 proportion grouped?).

Also, the x-axis lists "proteins". However, not all gene prediction methods give exactly one predicted protein isoform per gene; my guess is that there is such a relationship, in this instance, but my guess could be wrong. The authors should make it clear in the legend for this figure that there is (or, is not) a one-to-one relationship between proteins in this figure and genes predicted in the various assemblies.

Supplemental Table 2 --

Here, it would be good to add a column for the value "Observed/Expected" (i.e., the ratio of the existing "Observed Hits" and "Expected Hits" columns.) Adding such a column would allow readers to sort the Excel spreadsheet by this ratio, and thus get a clear view of which particular PANTHER functions are either most overrepresented or most underrepresented by the various genome assemblies. (They can already use the 'sort' function in Excel to reorder the PANTHER functions by ascending "Raw P-value" scores, and thus get a clear view of which over- or under-represented functions are most statistically significant.)

Reviewer #2: In this manuscript, Asalone et al. examine the effects of assembler choice and parameter values on genome assembly of diploid genomes with high levels of heterozygosity. Specifically, they examine assemblies generated for a nematode species, Halicephalobus mephisto, using two different assemblers (Platanus and SOAPdenovo2) with various parameter settings. Assemblies are compared with a reference assembly generated using additional PacBio data and the Platanus assembler. Assemblies are evaluated with BUSCO, alignments with the reference genome, numbers of predicted protein-coding genes, and enrichment/depletion analysis of protein function groups with respect to the reference genome protein set. The overall conclusion of this work is that assemblies can vary significantly in erroneously expanded or contracted regions even if other measures of assembly quality are consistently good.

The topic of assembly accuracy in the presence of high heterozygosity is an important one and thus this is a welcome contribution. Whereas the the overall conclusions of the paper are supported by the experiments, I found the experiments and methods to be confusing and perhaps overly complicated.

Specific comments:

1. Nowhere in the manuscript is a description of the underlying data that was assembled. After some digging through the references, I'm assuming it was the Illumina data described in Weinstein et al. 2019, but this needs to be clear and explicit in this paper. There is also mention of "RNA from H. mephisto", by which I'm assuming the authors mean RNA-seq data, but there is no description of these data anywhere.

2. It seems troubling that one of the assemblers evaluated was the same one used to generate the "reference" assembly. And as I understand it, PacBio reads were only used for scaffolding this reference assembly, and not for constructing the original contigs, and thus erroneous expansions or contractions made by Platanus on the Illumina data are not necessarily corrected by the PacBio data in this reference assembly. This issue needs clarification and discussion in this manuscript. In particular, an assembly that appears to have an enrichment or depletion of a certain protein functional group relative to the reference is not necessarily less accurate, because the reference may (perhaps equally likely) be in error with respect to this group.

3. The evaluation of expansion/contraction via enrichment/depletion of functional groups seems more indirect and complicated than necessary. Why not simply align the genomes (gene sets) pairwise to the reference and quantify how many genes/regions are expanded/contracted with respect to the reference? One would expect only expansion/contraction of highly-similar sequences, not of broad functional categories of proteins.

4. There is no logic given for why the an assembly with a high (or low?) proportion of grouped proteins by OrthoMCL would be better/worse than another assembly.

5. Please provide a definition for an "evidence-supported gene"

6. I have never heard of an isolog or iso-ortholog. Perhaps simply one-to-one ortholog can be used instead.

7. Please describe early on how heterozygosity is defined/measured in these genome assemblies.

8. Fig 2 - the dot plots not very informative. They would be greatly improved if assembly contigs were ordered and oriented according to the reference.

9. Fig 5 - there are so few points here - just show the points instead of a box plot.

10. The Borgonie et al. 2011 reference seems to be missing.

11. Benchmarking *University* Single Copy Orthologs => universal

12. The GitHub link to the software/scripts used is not provided.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008104.r003

Decision Letter 1

Christos A Ouzounis, William Stafford Noble

25 May 2020

Dear Dr. Bracht,

Thank you very much for submitting your manuscript "Regional sequence expansion or collapse in heterozygous genome assemblies" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am fully satisfied with the response of the authors to my previous comments, and have no further questions or suggested revisions.

Reviewer #2: With this revision, the authors have satisfactorily addressed the majority of my previous comments. However, I continue to be of the opinion that a number of the analyses in this work are rather indirect and difficult to interpret.

1. In particular, the PANTHER analysis of enrichment/depletion of protein functional categories and the OrthoMCL grouping analysis are hard to interpret with regard to the quality of the assemblies. Consider one protein-coding gene in the reference genome and its assembly with one of the alternative assemblers or assembler parameters. There are many ways in which this gene might be assembled, but consider two simple erroneous cases: (1) the gene has two copies in the assembly and (2) the gene is fragmented into two non-overlapping pieces. In both cases, assuming protein-coding components can be detected in all contigs, there is an effective doubling of the gene, but only the former is truly an "expansion" of the gene in the assembly. It does not seem that either the PANTHER or OrthoMCL analyses can distinguish between these possibilities and thus the interpretation of their results is difficult. The OrthoMCL analysis is particularly hard to understand because an assembly that erroneously produced two copies of every gene would result in 100% grouping (because the two copies of each gene would fall into the same group), whereas an assembly that fragments each gene into many non-overlapping pieces that cannot be confidently aligned, would have a much lower % grouping. This seems to be a roundabout way of assessing fragmentation but says little about expansion/contraction, which is the focus of the manuscript.

2. The LAST analysis (alignment of each assembly to the reference) and associated Figure 2 is a much more direct and easier to interpret method of understanding expansion/contraction in an assembly compared to a reference. I recommend that the authors expand on this analysis. Briefly, LAST can used to identify the *single best place* in the reference to align each component of an assembly. I believe the authors are already using LAST for this purpose. Then, for each position in the reference genome, one can count how many positions in the assembly are aligned to it. The distribution of these counts is highly informative: the positions with zero alignments are "missing" (perhaps due to contraction) and positions with more than one alignment are duplicated/expanded in the assembly. This should be simple to implement and more directly assesses expansion/contraction/missing-ness than much of the rest of the analyses.

3. Related to point 2 above, Figure 2 is quite important and could be improved. With a few tweaks, it can visually display expansions ("steeper" diagonals) and contractions/missing-ness ("less steep" diagonals). Suggested improvements are:

a. Clarify whether this is for the 200bp or 1000bp filtered assemblies. I would suggest using the 200bp assemblies so that one can still see if an assembly is relatively "complete" even if highly fragmented.

b. Keep the x-axis constant across all plots. It currently seems to be changing slightly between plots, which is misleading. All contigs in the reference should be plotted such that contigs that are missing in the assembly can be seen.

c. Include all contigs in the assembly on the y-axis, regardless of whether they have an alignment to the reference. That way one can visually see (1) how large the assembly is and (2) the fraction of the assembly that doesn't align anywhere in the reference.

d. Make sure the scales are the same on both x and y axes. I believe this may already be the case, which is great. This is important for interpreting the "steepness" of the diagonals.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008104.r005

Decision Letter 2

Christos A Ouzounis, William Stafford Noble

29 Jun 2020

Dear Dr. Bracht,

We are pleased to inform you that your manuscript 'Regional sequence expansion or collapse in heterozygous genome assemblies' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Christos A. Ouzounis

Associate Editor

PLOS Computational Biology

William Noble

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I have reviewed the latest version, and am satisfied with it.

While reading it, I observed two minor possible corrections of the text.

1. In the Abstract, "downstream analyses, yet is common" might better read "downstream analyses, yet are common" (since 'are' is plural, it agrees with the preceding plural noun "High levels".

2. On page 5, "from the raw Illuminia reads" should read "from the raw Illumina reads" (i.e., "Illuminia" is a typo).

Reviewer #2: The authors have sufficiently addressed my previous comments. My only suggestion is to replace (or swap) Figure 2 with Supplementary Table 1, as it is hard to interpret the Oxford Grids when the set of reference contigs displayed changes from plot to plot (I disagree that the x-axis is not changing). If Fig 2 is retained as is, I would suggest adding some text to caption to guide the reader in its interpretation (e.g., steepness of diagonals).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008104.r006

Acceptance letter

Christos A Ouzounis, William Stafford Noble

23 Jul 2020

PCOMPBIOL-D-19-01915R2

Regional sequence expansion or collapse in heterozygous genome assemblies

Dear Dr Bracht,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Sarah Hammond

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Summary of sequences missing from each assembly at 200 bp cut off by comparison with reference assembly and estimates of percentage regionally expanded sequence.

    Also includes N50 (nt), total assembly lengths, heterozygosity levels, and missing by LAST.

    (XLSX)

    S2 Table. Summary of sequences missing from each assembly at 1000 bp cut off by comparison with reference assembly and estimates of percentage regionally expanded sequence.

    Also includes N50 (nt), total assembly lengths, heterozygosity levels, missing by LAST, and RepeatMasker estimation of repetitive content.

    (XLSX)

    S3 Table. Summary of protein analysis from each assembly at 200 bp cut off.

    Listed are evidence versus non-evidence based proteins by number of proteins, mean, median, and standard deviation of protein length.

    (XLSX)

    S4 Table. Summary of protein analysis from each assembly at 1000 bp cut off.

    Listed are evidence versus non-evidence based proteins by number of proteins, mean, median, and standard deviation of protein length.

    (XLSX)

    S5 Table. PANTHER results of significantly enriched or depleted proteins and their corresponding assembly method (all results shown are significant to Benjamini Hochberg corrected FDR< 0.01).

    (XLSX)

    Attachment

    Submitted filename: Response to Reviews of PLOS Computational Biology.pdf

    Attachment

    Submitted filename: Response to Reviews v3 (1).pdf

    Data Availability Statement

    All python scripts are available from the GitHub database (https://github.com/brachtlab/Regional_heterozygosity). The raw Illumina DNA and PacBio DNA data are available on the Sequence 687 Reads Archive (SRA) at accession PRJNA528747. The assemblies are available on Zenodo at https://zenodo.org/record/3738267#.Xp4Ok9NKgq9.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES