supporting_methods_S1_genome

New Phytologist Supporting Information

The genome of a nonphotosynthetic diatom provides insights into the metabolic shift to heterotrophy and constraints on the loss of photosynthesis

Anastasiia Onyshchenko, Wade R. Roberts, Elizabeth C. Ruck, Jeffrey A. Lewis, and Andrew J. Alverson

Article acceptance date: 3 August 2021

Supporting Information Methods S1

Assembly and filtering of the Nitzschia sp. Nitz4 nuclear genome

A total of 15.4 GB of 100-bp paired-end DNA reads were recovered and used to assemble the nuclear, plastid, and mitochondrial genomes of Nitzschia. We used FastQC version 0.11.5 (Andrews, 2010) to check reads quality and then used ACE (Sheikhizadeh & de Ridder, 2015) to correct predicted sequencing errors. We subsequently trimmed and filtered the reads with Trimmomatic version 0.32 (Bolger et al., 2014) with settings “ILLUMINACLIP:<TruSeq_adapters.fa>:2:40:15 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:30 TOPHRED64.” A detailed summary of genome assembly and filtering is available in Supporting Information Methods S1.

We used SPAdes version 3.12.0 (Bankevich et al., 2012) with default parameter settings and k-mer sizes of 21, 33, and 45 to assemble the genome then estimated read coverage across the genome by aligning all reads to the scaffolds with BWA-MEM (Li & Durbin, 2009). We used Blobtools (Laetsch & Blaxter, 2017) to identify and remove putative contaminants from the assembly. Blobtools uses a combination of GC percentage, taxonomic assignment, and read coverage to identify potential contaminants. The taxonomic assignment of each scaffold was based on a DIAMOND BLASTX search against the UniProt Reference Proteomes database (release 2019_06) with settings “–max-target-seqs 1 –sensitive –evalue 1e-25” (Buchfink et al., 2015). We used Blobtools to identify and remove scaffolds that did not belong to the Nitzschia Nitz4 nuclear genome (i.e., contaminants) based on the following criteria: taxonomic assignment to bacteria, archaea, or viruses; low GC percentage and high read coverage indicative of organellar scaffolds; scaffold length < 500 bp and without matches to the UniProt database. See Methods S1 for Blobtools-based graphical representations of the assembly at different stages of filtering. After removing these scaffolds, all strictly paired reads that mapped to the genome assembly were exported and reassembled with SPAdes as described above. We used the Kmer Analysis Toolkit (KAT; Mapleson et al., 2017) to identify and remove scaffolds from this assembly whose length was > 50% aberrant k-mers, which was possibly indicative of a source other than the Nitzschia nuclear genome. A total of 401 scaffolds (totaling 480 kb) met this criterion and of these, 340 had no hits to the UniProt database and were removed. A total of 55 of the flagged scaffolds had hits to eukaryotic sequences, so we used BLASTX to search each of these against the National Center of Biotechnology Information’s (NCBI) nonredundant (nr) database of diatom proteins. We removed scaffolds that met the following criteria: scaffold length < 1000 bp, no protein domain information, and significant hits to hypothetical or predicted proteins only. This resulted in the removal of an additional 31 scaffolds (totaling 55 kb) from the assembly.

We used Rascaf (Song et al., 2016) and SSPACE (Boetzer et al., 2011) to improve contiguity of the genome genome assembly (Methods S1). Rascaf uses alignments of paired RNA-seq reads to identify new contig connections, using an exon block graph to represent each gene and the underlying contig relationship to determine the likely contig path. RNA-seq read alignments were generated with Minimap2 (Li, 2018) and input to Rascaf for scaffolding (Methods S1). The resulting scaffolds were further scaffolded using SSPACE with the filtered read set. SSPACE aligns the read pairs and uses the orientation and position of each pair to connect contigs and place them in the correct order. Gaps produced from scaffolding were filled with two rounds of GapCloser (Luo et al., 2012) using both the RNA-seq reads and filtered DNA reads (Methods S1). Each stage of the genome assembly was evaluated with QUAST version 5.0.0 (Gurevich et al., 2013) and BUSCO version 4.0.6 (Simão et al., 2015). BUSCO was run in genome mode separately against both the eukaryota_odb10 and stramenopiles_odb10 datasets.

RNA extraction, sequencing, read processing, and assembly of RNA-seq reads followed Parks and Nakov et al. (2018). DNA and RNA sequencing reads are available through NCBI BioProject PRJNA412514. The Nitzschia Nitz4 genome is available through NCBI accession WXVQ00000000. The assembled Nitzschia Nitz4 and Nitzschia Nitz2144 transcriptomes are available from NCBI accessions GIQR00000000 and GIQQ00000000, respectively. A genome browser and gene annotations for Nitzschia Nitz4 are available through the Comparative Genomics (CoGe) web platform (https://genomevolution.org/coge/) under genome ID 60130.

Genome assembly procedure

Software used:

SPAdes (version 3.12.0)
Diamond
BWA
Samtools
KAT (K-mer Analysis Toolkit)
Minimap2
Rascaf
SSPACE
GapCloser (part of SOAPdenovo2)

1. Assemble reads with SPAdes

spades.py -1 Nitz4_trimmed_R1.fq.gz -2 Nitz4_trimmed_R2.fq.gz -k auto -t 16 -o spades --only-assemble

Output files

contigs.fasta
scaffolds.fasta

2. Filter assembly with Blobtools

2a. Search scaffolds against UniProt Reference Proteomes using BLASTX

diamond blastx --query scaffolds.fasta --max-target-seqs 1 --sensitive --threads 12 --db uniprot_ref_proteomes.dmnd --evalue 1e-25 --outfmt 6 --out Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.out

2b. Map reads against the scaffolds

bwa index scaffolds.fasta bwa mem -t 12 scaffolds.fasta Nitz4_trimmed_R1.fq.gz Nitz4_trimmed_R2.fq.gz | samtools sort -@ 12 -o Nitz4_mapped.sorted.bam - samtools index -@ 12 Nitz4_mapped.sorted.bam

2c. Create Blobtools dataset

blobtools taxify -f Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.out -m uniprot_ref_proteomes.taxids -s 0 -t 2

blobtools create -i scaffolds.fasta -b Nitz4_mapped.sorted.bam -t Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.taxified.out -o Nitz4

blobtools view -i Nitz4.blobDB.json -r superkingdom

blobtools plot -i Nitz4.blobDB.json -r superkingdom

blobtools plot -i Nitz4.blobDB.json -r phylum

Blob plot of inital genome assembly (Superkingdom resolution)

Figure 1. Blobology plot of initial genome assembly using all sequencing reads. BLAST hits show contig identity at the superkingdom level.

Blob plot of genome assembly (Phylum resolution)

Figure 2. Blobology plot of initial genome assembly using all sequencing reads. BLAST hits show contig identity at the phylum level.

2d. Remove the following scaffolds:

Scaffolds assigned to bacteria, archaea, or viruses
Scaffolds with no hits that are shorter than 500 bp in length
Scaffolds with GC content < 35%, indicative of organellar DNA

Commands:

awk '($6 == "Bacteria") || ($6 == "Archaea") || ($6 == "Viruses") || ($6 == "no-hit" && $2 < 500) || ($3 < 0.35) {print $1}' Nitz4.blobDB.table.txt > remove_these.txt

blobtools seqfilter -i scaffolds.fasta -l remove_these.txt --invert

grep -Fvwf remove_these.txt Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.taxified.out > Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.taxified.filtered.out

2e. Plot the filtered dataset

blobtools create -i scaffolds.filtered.fna -b Nitz4_mapped.sorted.bam -t Nitz4_spades.vs.uniprot_ref.mts1.1e25.blastx.taxified.filtered.out -o Nitz4_filt1

blobtools view -i Nitz4_filt1.blobDB.json -r superkingdom

blobtools plot -i Nitz4_filt1.blobDB.json -r superkingdom

blobtools plot -i Nitz4_filt1.blobDB.json -r phylum

Blob plot of genome assembly after one round of scaffold filtering (Superkingdom resolution)

Figure 3. Blobology plot of initial genome assembly filtered according to the criteria in Remove the following scaffolds section above. BLAST hits show contig identity at the superkingdom level.

Blob plot of genome assembly after one round of scaffold filtering (Phylum resolution)

Figure 4. Blobology plot of initial genome assembly filtered according to the criteria in Remove the following scaffolds section above. BLAST hits show contig identity at the phylum level.

2f. Export the filtered reads

blobtools bamfilter -b Nitz4_mapped.sorted.bam -e remove_these.txt -o Nitz4_spades_filt1 -n

Output files:

Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.1.fa
Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.2.fa

3. Reassemble the filtered read set with SPAdes

spades.py -1 Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.1.fa -2 Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.2.fa -k auto -t 16 -o spades_filt1 --only-assemble

Output files:

contigs.fasta
scaffolds.fasta

4. Repeat Blobtools analysis (steps 2a to 2f above)

Blob plot of genome assembly after filtering and reassembly (Superkingdom resolution)

Figure 5. Blobology plot of genome assembly after one round of scaffold and read filtering and reassembly. BLAST hits show contig identity at the superkingdom level.

Blob plot of genome assembly after filtering and reassembly (Phylum resolution)

Figure 6. Blobology plot of genome assembly after one round of scaffold and read filtering and reassembly. BLAST hits show contig identity at the phylum level.

5. Use KAT for kmer-based identification of contaminant scaffolds

kat gcp -o kat-gcp -t 8 Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.1.fa Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.2.fa

python3 density.py -x 200 kat-gcp.mx

Output files:

kat-gcp.mx
kat-gcp-density.png

5a. Extract kmers

kat filter kmer --low_count=25 --high_count=80 Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.1.fa Nitz4_spades_filt1.Nitz4_mapped.sorted.bam.2.fa

Output file: kat.filter.kmer-in.jf27

5b. Get scaffolds associated with the filtered k-mers

kat filter seq --threshold=0.5 --seq=scaffolds.fasta kat.filter.kmer-in.jf27

Output file: kat.filter.kmer.0.5.in.fasta

Use Blobtools output to manually remove scaffolds with no-hits to the UniProt Reference Proteomes dataset

Use BLASTX to manually search the remaining scaffolds against NCBI’s nr database of diatom proteins. Remove scaffolds that meet all the following criteria:

Scaffold length < 1000 bp
No protein domains detected
All hits are to hypothetical or predicted proteins

6. Scaffold the filtered assembly with Rascaf

6a. Map RNA-seq reads against scaffolds with Minimap2

minimap2 -ax sr scaffolds.filtered.fasta nitz4_trimmed_filtered_RNASEQ_R1.fq.gz nitz4_trimmed_filtered_RNASEQ_R2.fq.gz | samtools sort -@ 12 -o scaffolds.filtered.mapped.bam -

samtoojupyls index scaffolds.filtered.mapped.bam

6b. Use sorted and indexed BAM file as input to Rascaf

rascaf -b scaffolds.filtered.mapped.bam -f scaffolds.filtered.fasta

rascaf-join -r rascaf.out

Output file: rascaf_scaffold.fa

7. Scaffold again using SSPACE with the DNA reads

perl SSPACE_Standard_v3.0.pl -l libraries.txt -s rascaf_scaffold.fa -b nitz4_sspace

Output file: nitz4_sspace.final.scaffolds.fasta

8. Extend contigs and fill gaps with GapCloser

8a. Use RNA-seq reads and DNA reads for gap closure

GapCloser -a nitz4_sspace.final.scaffolds.fasta -b config -o nitz4_sspace.final.scaffolds.filled.fasta -t 16

8b. Perform a second round of gap closure

GapCloser -a nitz4_sspace.final.scaffolds.filled.fasta -b config -o nitz4_sspace.final.scaffolds.filled2.fasta -t 16

9. Make Blobtools plots for the final assembly (steps 2a to 2e above)

Blob plot of final genome assembly (Superkingdom resolution)

Figure 7. Blobology plot of final genome assembly. BLAST hits show contig identity at the superkingdom level.

Blob plot of final genome assembly (Phylum resolution)

Figure 8. Blobology plot of final genome assembly. BLAST hits show contig identity at the phylum level.

Final assembly file

nitz4_sspace.final.scaffolds.filled2.fasta

Genome annotation

We used the Maker pipeline version 2.31.8 (Cantarel et al., 2008) to identify protein-coding genes in the nuclear genome. We used the assembled Nitzschia Nitz4 transcriptome (Maker’s expressed sequence tag [EST] evidence) and the predicted proteome of Fragilariopsis cylindrus (GCA_001750085.1) (Maker’s protein homology evidence) to inform the gene predictions. RepeatModeler version 2.0 (Flynn et al., 2019) was used for de novo identification and compilation of transposable element families found in the Nitzschia Nitz4 genome. We used BLASTX with settings “-num_descriptions 1 -num_alignments 1 -evalue 1e-10” to search the resulting repeat library against the UniProt Reference Proteomes. These results were given to ProtExcluder version 1.2 (Campbell et al., 2014) to exclude repeats with significant protein hits. The final custom repeat library was used as input to the Maker annotation pipeline described below.

We used Augustus version 3.2.2 for de novo gene prediction with settings “max_dna_len = 200,000” and “min_contig = 300.” We trained Augustus with the annotated gene set for Phaeodactylum tricornutum (version 2), which we filtered as follows: (1) remove “hypothetical” and “predicted” proteins, (2) remove all but one splice variant of a gene, (3) remove genes with no introns, and (4) remove genes that overlap with neighboring gene models or the 1000 bp flanking regions of adjacent genes because these were included in the training set along with the enclosed gene. When possible, we annotated untranslated regions (UTRs) by subtracting 5’ and 3’ CDS coordinates from the corresponding 5’ and 3’ end coordinates of the associated mRNA sequence. The final filtered training set included a total of 726 genes. We separately generated a set of gene models for UTR prediction using the filtering criteria described above except that, (1) we retained intronless genes based on the assumption that intron presence or absence was less relevant for training the UTR annotation parameters, and (2) we retained only those UTR annotations from the P. tricornutum genome that were ≥ 40 bp in length and extended ≥ 25 bp beyond both the 5’ and 3’ ends of the CDS. Our final UTR training set included 531genes.

We trained and optimized the Augustus gene prediction parameters (with no UTRs) on the first set of 726 genes and optimized UTR prediction parameters with the second set of 531 genes. We then used these parameters to perform six successive gene predictions within Maker. We assessed each Maker annotation for completeness using BUSCO version 4.0.6 with the ‘eukaryota_odb10’ database (Simão et al., 2015). For the five subsequent runs, we followed recommendations of the Maker developers and used a different ab initio gene predictor, SNAP (Korf, 2004), trained with Maker-generated gene models from the previous run. Although the five SNAP-based Maker runs discovered more genes, the number of complete BUSCOs did not increase and the percentage of genes with Maker annotation edit distance (AED) scores less than 0.5 decreased slightly from 0.98 to 0.97 (Supporting Information Methods S2). We therefore used the gene predictions from the second Maker round (the first SNAP-based round) as the final annotation (Methods S2). In our search for genes related to carbon metabolism, we found several genes that were not in the final set of Maker-based annotations. The coding regions of these genes were manually annotated.

Iterative annotation procedure for identifying protein-coding genes in the genome of Nitzschia sp. Nitz4.
Annotation round	Ab initio algorithm (training round^a)	Number of gene models	Complete BUSCOs (out of 303^b)	Complete BUSCOs (out of 215^c)	Proportion of gene models with AED < 0.5
1	Maker	8609	253 (83.5%)	139 (64.7%)	0.976
2^d	SNAP (1)	9340	254 (83.8%)	140 (65.1%)	0.980
3	SNAP (2)	9387	254 (83.8%)	140 (65.1%)	0.970
4	SNAP (3)	9394	254 (83.8%)	140 (65.1%)	0.970
5	SNAP (4)	9396	254 (83.8%)	140 (65.1%)	0.970
6	SNAP (5)	9395	254 (83.8%)	140 (65.1%)	0.970

^a for annotation rounds 2–6, gene predictions used gene models from the previous round as a training set

^b BUSCO protein mode with eukaryota_odb9 database

^c BUSCO protein mode with protist_ensembl database

^d final annotation

We made functional annotations of the predicted genes in the Nitzschia Nitz4 genome by searching the set of predicted proteins against the SwissProt (release 2019_09) and UniProt Reference Proteomes databases (release 2019_09) using NCBI BLASTP version 2.4.0+ (Camacho et al., 2009) with an E-value cutoff of 1e-6. We identified protein domains using InterProScan version 5.36-75.0 (Jones et al., 2014) against the Pfam (release 32.0) (El-Gebali et al., 2019) and PANTHER (release 14.1) (Thomas et al., 2003) databases. We identified proteins with transmembrane helices and signal peptides using TMHMM version 2.0c (Krogh et al., 2001) and SignalP version 4.1 (Petersen et al., 2011). Finally, we used tRNAscan-SE version 2.0.5 (Chan & Lowe, 2019) for tRNA annotation, RNAmmer version 1.2 (Lagesen et al., 2007) for rRNA annotation, and Infernal version 1.1.2 (Nawrocki & Eddy, 2013) against the Rfam database (release 14.1) to predict non-coding RNAs (Kalvari et al., 2020).

Estimation of heterozygosity

We mapped the trimmed paired-end reads to the genome assembly using BWA-MEM version 0.7.17-r1188 (Li & Durbin, 2009) and used Picard Tools version 2.17.10 (https://broadinstitute.github.io/picard) to add read group information and mark duplicate read pairs. The resulting BAM file was used as input to the Genome Analysis Toolkit (GATK) version 3.5.0 to call variants (SNPs and indels) using the HaplotypeCaller tool (DePristo et al., 2011; Poplin et al., 2018). SNPs and indels were separately extracted from the resulting VCF file and then filtered to remove low quality variants. SNPs were filtered as follows: MQ > 40.0, SOR > 3.0, QD < 2.0, FS > 60.0, MQRankSum < -12.5, ReadPosRankSum < -8.0, ReadPosRankSum > 8.0. Indels were filtered as follows: MQ < 40.0, SOR > 10.0, QD < 2.0, FS > 200.0, ReadPosRankSum < -20.0, ReadPosRankSum > 20.0. The filtered variants were combined and used as “known” variants for base recalibration. We then performed a second round of variant calling using HaplotypeCaller on the recalibrated BAM file. We extracted and filtered SNPs and indels as described above before additionally filtering both variant types by approximate read depth (DP), removing those with depth below the 5th percentile (DP < 125) and above the 95th percentile (DP > 1507). The final filtered SNPs and indels were combined, evaluated using GATK’s VariantEval tool, and annotated using SnpEff version 4.3t (Cingolani et al., 2012) with the Nitzschia Nitz4 gene models.

References

Andrews S. 2010. FastQC: A quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed August 2018)

Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455–477.

Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. 2011. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27: 578–579.

Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120.

Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: Architecture and applications. BMC Bioinformatics 10: 421.

Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ, et al. 2014. MAKER-P: A tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiology 164: 513–524.

Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. 2008. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Research 18: 188–196.

Chan PP, Lowe TM. 2019. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. Methods in Molecular Biology: 1–14.

Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. 2012. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6: 80–92.

DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43: 491–498.

El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Research 47: D427–D432.

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2019. RepeatModeler2: Automated genomic discovery of transposable element families. bioRxiv: 856591.

Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29: 1072–1075.

Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. 2014. InterProScan 5: Genome-scale protein function classification. Bioinformatics 30: 1236–1240.

Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, Griffiths-Jones S, Toffano-Nioche C, Gautheret D, Weinberg Z, et al. 2021. Rfam 14: Expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research 49: D192-D200.

Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59.

Krogh A, Larsson B, von Heijne G, Sonnhammer EL. 2001. Transmembrane helices predicted using TMHMM: Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology 305: 567–580.

Laetsch DR, Blaxter ML. 2017. BlobTools: Interrogation of genome assemblies. F1000Research 6: 1287.

Lagesen K, Hallin P, Rødland EA, Staerfeldt H-H, Rognes T, Ussery DW. 2007. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research 35: 3100–3108.

Li H. 2018. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100.

Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760.

Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. 2012. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. GigaScience 1: 18.

Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ. 2017. KAT: A K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33: 574–576.

Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29: 2933–2935. Petersen TN, Brunak S, von Heijne G, Nielsen H. 2011. SignalP 4.0: Discriminating signal peptides from transmembrane regions. Nature Methods 8: 785–786.

Parks MB, Nakov T, Ruck EC, Wickett NJ, Alverson AJ. 2018. Phylogenomics reveals an extensive history of genome duplication in diatoms (Bacillariophyta). American Journal of Botany 105: 330–347.

Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. 2018. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv: 201178.

Sheikhizadeh S, de Ridder D. 2015. ACE: Accurate correction of errors using K-mer tries. Bioinformatics 31: 3216–3218.

Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. 2015. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210–3212.

Song L, Shankar DS, Florea L. 2016. Rascaf: Improving genome assembly with RNA sequencing data. The Plant Genome 9: 1–12.

Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. 2003. PANTHER: A library of protein families and subfamilies indexed by function. Genome Research 13: 2129–2141.