Version Changes
Revised. Amendments from Version 1
In response to the reviewers’ suggestions, the major changes that we have made to the manuscript are the inclusion of two figures. The first describes the breeding design used the generate the hemiclonal individuals that were sequenced, and thus hopefully makes this unique and powerful method comprehensible to a new audience. The second figure is a flow diagram describing how the data from the high-throughput sequencing was analysed, culminating in the generation of quality-filtered genotypes and deposition of the data in publicly-accessible databases. We hope that this figure, summarising the methods that are particular to our study, makes the analysis of the high-throughout sequencing data clearer. One of the reviewers also suggested that we make publicly-accessible, a gVCF file for the project. We have deposited this data file, alongside code and run-logs, in the dedicated research data repository, Zenodo, https://doi.org/10.5281/zenodo.198880. We have also made corrections in the manuscript, as per the reviewer’s suggestion, to the description of sequence-read mapping methods, and to definitions relating to the quality detected of nucleotide variation (‘Fisher strand bias’). Overall this adds accuracy and clarity to our reporting of the methods used. As a minor note, the structural variant data has been fully released by NCBI dbVar and so we have updated the URL in the Data Availability section accordingly.
Abstract
As part of a study into the molecular genetics of sexually dimorphic complex traits, we used high-throughput sequencing to obtain data on genomic variation in an outbred laboratory-adapted fruit fly ( Drosophila melanogaster) population. We successfully resequenced the whole genome of 220 hemiclonal females that were heterozygous for the same Berkeley reference line genome (BDGP6/dm6), and a unique haplotype from the outbred base population (LH M). The use of a static and known genetic background enabled us to obtain sequences from whole-genome phased haplotypes. We used a BWA-Picard-GATK pipeline for mapping sequence reads to the dm6 reference genome assembly, at a median depth-of coverage of 31X, and have made the resulting data publicly-available in the NCBI Short Read Archive (Accession number SRP058502). We used Haplotype Caller to discover and genotype 1,726,931 small genomic variants (SNPs and indels, <200bp). Additionally we detected and genotyped 167 large structural variants (1-100Kb in size) using GenomeStrip/2.0. Sequence and genotype data are publicly-available at the corresponding NCBI databases: Short Read Archive, dbSNP and dbVar (BioProject PRJNA282591). We have also released the unfiltered genotype data, and the code and logs for data processing and summary statistics ( https://zenodo.org/communities/sussex_drosophila_sequencing/).
Keywords: drosophila, sequencing, data, dimorphic traits
Introduction
As part of a study on the molecular genetics of sexually dimorphic complex traits, we used hemiclonal analysis in conjunction with high-throughput sequencing 1 to characterise molecular genetic variation across the genome, from an outbred laboratory-adapted population of Drosophila melanogaster, LH M 2, 3. The hemiclone experimental design allows the repeated phenotyping of multiple individuals, each with the same unrecombined haplotype on a different random genetic background. This method has been used to investigate standing genetic variation and intersexual genetic correlations for quantitative traits 2 and gene expression 4, but it has not yet been used to obtain genomic data.
The 220 hemiclone females that were sequenced in the present study have a maternal haplotype, from the dm6 reference assembly strain (BDGP6+ISO1 mito/dm6, Bloomington Drosophila Stock Center no. 2057) 5, 6, and have a different paternal genome each, sampled using cytogenetic cloning from the LH M base population (See Figure 1). All non-reference genotypes in the sequenced LH M hemiclones were expected to be heterozygous and in-phase, except in rare instances where the in-house dm6 reference strain also had the same non-reference allele.
Previous studies indicate that the limits for DNA quantity in next-generation sequencing are 50–500ng 7. We sequenced individual D. melanogaster, rather than pools of clones, because more biological information can be obtained, and because modern transposon-based library preparation allows accurate sequencing at low concentrations of DNA. D. melanogaster is a small insect (~1 μg) although this problem is off-set by the reduced proportion of repetitive intergenic sequence, and small genome size relative to other insects (170Mb verses ~500Mb) 7.
We mapped reads to the D. melanogaster dm6 reference assembly using a BWA-Picard-GATK pipeline, and called nucleotide variants using both HaplotypeCaller, and Genomestrip, the latter of which detects copy-number variation up to 1Mb in length. A graphic representation of the data analysis pipeline is provided in Figure 2. We have made the mapped sequencing data, and genotype data publicly-available on NCBI, and additionally have made the meta-data, analysis code and logs publicly-available on Zenodo. This is the first report of a study which uses methods for detecting both SNPs, indels and structural variants (deletions and duplications >1Kb in length), genome-wide in next-generation sequencing data, and the first report of whole genome resequencing in hemiclonal individuals.
Materials and methods
Study samples
The base population (LH M) was originally established from a set of 400 inseminated females, trapped by Larry Harshman in a citrus orchard near Escalon, California in 1991 3. It was initially kept at a large size (more than 1,800 reproducing adults) in the lab of William Rice (University College Santa Barbara, USA). In 1995 (approximately 100 generations since establishment) the rearing protocol was changed to include non-overlapping generations, and a moderate rearing density with 16 adult pairs per vial (56 vials in total) during 2 days of adult competition, and 150–200 larvae during the larval competition stage 3. In 2005, a copy of LH M population sample was transferred to Uppsala University, Sweden (approximately 370 generations since establishment), and in 2012, to the University of Sussex (UK), when the current set of 223 haplotypes were sampled. At the point of sampling we estimate that the population had undergone 545 generations under laboratory conditions, 445 of which had been using the same rearing protocol.
Hemiclonal lines were established by mating groups of five clone-generator females (C(1)DX, y, f ; T(2;3) rdgC st in ri p P bw D) with 230 individual males sampled from the LH M base population (see 2). A single male from each cross was then mated again to a group of five clone-generator females in order to amplify the number of individuals harbouring the sampled haplotype. Seven lines failed to become established at this point. The remaining 223 lines were maintained in groups of up to sixteen stock hemiclonal males in two vials that were transferred to fresh vials each week. Stock hemiclonal males were replenished every six weeks by mating with groups of clone-generator females. A stock of reference genome flies (Bloomington Drosophila Stock Center no. 2057) was established and maintained initially using five rounds of sib-sib matings before expansion. 223 virgin reference genome females were then collected and mated to a single male from each of 223 hemiclonal lines. Female offspring from this cross therefore have one copy of the reference genome and one copy of the hemiclonal haplotype. Groups of these hemiclonal females were collected as virgins, placed in 99% ethanol and stored at -20°C prior to DNA extraction.
DNA extraction
One virgin female per hemiclonal line, was homogenised with a microtube pestle, followed by 30-minute mild-shaking incubation in proteinase K. DNA was purified using the DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA), according to manufacturer’s instructions. Volumes were scaled-down according to mass of input material. Barrier pipette tips were used throughout, in order to minimise cross-contamination of DNA. Template assessment using the Qubit BR assay (Thermo Fischer, NY, USA) indicated double-stranded DNA, 10.4Kb in length at concentrations of 2–4 ng/ μl (total quantity 50–100ng).
Whole-genome resequencing
Sequencing was performed under contract by Exeter Sequencing service, University of Exeter, UK. The sonication protocol for shearing of the DNA was optimised for low concentrations to generate fragments 200–500bp in length. Libraries were prepared and indexed using the Nextera Library Prep Kit (Illumina, San Diego, USA). All samples were sequenced on a HiSeq 2500 (Illumina), with five individuals per lane. We also sequenced DNA from two individuals from the in-house reference line (Bloomington Drosophila Stock Centre no. 2057). One was prepared as the hemiclones, using the Illumina Nextera library (sample RGil), and the other using an older, Illumina Nextflex method (sample RGfi). The median number of read pairs across all samples was 29.23×10 6 (IQR 14.07×10 6). Quality metrics for the sequencing data were generated with FastQC v0.10.0 by Exeter Biosciences, and used to determine whether results were suitable for further analyses. For twelve samples with less than 8×10 6 reads, sequencing was repeated successfully (H006, H041, H061, H084, H086, H087, H092, H098, H105), with a further three samples omitted entirely (H015, H016, H136), leaving 220 hemiclonal samples in total. As shown in Figure 3A and 3B, the read quality score and quality-per-base for the samples taken forward for genotyping in this study were well within acceptable standards, and similar across all samples.
Read mapping
Raw data ( fastq files) were stored and processed in the Linux Sun Grid Engine in the High-Performance Computing facility, University of Sussex. Adaptor sequences (Illumina Nextera N501-H508 and N701-N712), poor quality reads (Phred score <7) and short reads were removed using Fastq-mcf (ea-utils v.1.1.2). Settings were: log-adapter minimum-length-match: 2.2, occurrence threshold before adapter clipping: 0.25, maximum adapter difference: 10%, minimum remaining length: 19, skew percentage-less-than causing cycle removal: 2, bad reads causing cycle removal: 20%, quality threshold causing base removal: 10, window-size for quality trimming: 1, number of reads to use for sub-sampling: 3×10 5.
Cleaned sequence reads were mapped to the D. melanogaster genome assembly, release 6.0 (Assembly Accession GCA_000001215.4 6) using Burrows-Wheeler Aligner mem (version 0.7.7-r441) 8, with a mapping quality score threshold of 20. Remaining reads were remapped using Stampy v1.0.24, which is slower but more precisely maps reads which are divergent from the reference genome assembly 9. This method was used previously for the Drosophila Genome Nexus 10. Removal of duplicate reads, indexing and sorting was performed with Picard-Tools v1.77. Re-mapping of sequence reads around insertion-deletion polymorphisms was performed using Genome Analysis Tool-Kit (GATK) v3.2.2, as a recommended standard practice 11.
The median depth of coverage across all samples used for genotyping was 31X (IQR 14, see Figure 3C). As shown in Figure 3D, the mean nucleotide mis-match rate to the dm6 reference assembly for the LH M hemiclones was 3.27×10 -3 per PCR cycle (IQR 0.2×10 -3), contrasting with the two reference line samples for which the mis-match rate was 0.89-1.10×10 -3 per cycle. We observed spikes of nucleotide mismatches in some PCR cycles for some samples, which are likely to be errors rather than true sequence variation.
Small-variant detection methods
Single-nucleotide polymorphisms (SNPs) and insertion/deletions (indels) <200bp in length, were detected and genotyped relative to the BDGP+ISO1/dm6 assembly, on chromosomes 2,3,4,X, and mitochondrial genome using Haplotyper Caller (GATK v3.4-0) 12. Individual bam files were genotyped, omitting reads with a mapping quality under 20, stand call and emit confidence thresholds of 31, then combined and genotyped again. 143,726,002 bases of genomic sequence were analysed from which 1,996,556 variant loci were identified consisting of 1,581,341 SNPs, 196,582 deletions, and 218,633 insertions. Functional annotation was added using SNPeff v4.1 13.
We used hard-filtering to remove variants generated by error, because the alternative ’variant recalibration’ requires prior information on variant positions from a similar population or parents. Quality filtering thresholds were decided following inspection of the various sequencing metrics associated with each variant locus, and by software developers’ recommendations 12. The filtering thresholds were: Quality-by-depth >2, strand bias <50 (Phred-scaled p-value from Fisher’s Exact test), mapping quality >58, mapping quality rank sum >-7.0, read position rank sum >-5.0, combined read depth <15000, and call rate >90%. This filtering removed 167,319 variants (8.3%), leaving 1,829,237. Summary values for the variant quality metrics are shown in Table 1. Distributions of quality metrics for Haplotype Caller variants are shown in Supplementary Figure 2. The density of sequence variants, measured as the median for windows of 10Kb in length across the genome, was 75 per for biallelic SNPs, 1 for multi-allelic SNPs, 6 for bi-allelic indels, and 3 for multi-allelic indels (see Figure 4A). Mean separation between variants of any type or allele frequency was 78bp. As shown in Figure 4B the allele frequency distribution for bi-allelic SNPs and indels was similar, and broadly within expectations for an out-bred diploid population sample. The two in-house reference line individuals had 515 homozygous and 3171 heterozygous mutations from the reference assembly. The median genotype counts for the 220 LH M hemiclone individuals, were 585 homozygous, 728,214 heterozygous and 4963 no-call (IQR 400, 36707 and 7876). Genotype counts for each individual are shown in Figure 4C.
Table 1. Haplotype Caller variant quality metrics and genotype frequencies.
Variant type
N |
SNPs (biallelic)
1,411,395 |
SNPs (multi)
43,798 |
Indels (biallelic)
138,687 |
Indels (multi)
65,660 |
---|---|---|---|---|
Total depth
Event length Strand bias* Mapping quality Map qual rank sum Quality-by-depth Quality |
6440 (1725)
0 (0) 1.12 (2.25) 62.12 (6.18) 0.25 (1.04) 16.65 (3.51) 34968 (62236) |
6316 (2100)
0 (0) 1.34 (3.14) 64.94 (8.57) 0.9 (2.37) 17(3.81) 57028 (67558) |
6134 (1836)
2 (5) 1.76 (3.88) 71.17 (12.77) 3.14 (3.21) 18.52 (6.21) 25842 (59889) |
5973 (2081)
1 (8) 1.77 (4.45) 69.58 (11.36) 2.68 (2.91) 16.96 (6.39) 40479 (63590) |
Genotype counts
Reference Heterozygous Homozygous non-ref. No call |
151 (120) 70 (118) 0 (0) 0 (1) |
102(122) 117(121) 0(0) 1(4) |
166(114) 54(114) 0(0) 0(2) |
122(123) 95(122) 0(0) 2(5) |
Values show the total number of variants, median (and IQR) for each metric. Data generated from vcf file using GATK VariantsToTable, on the quality-filtered data. *Strand bias refers to Phred-scaled p-value from Fisher’s Exact Test. Code and data used to generate this table located at https://doi.org/10.5281/zenodo.159282.
For data submission to NCBI dbSNP, we were obliged to exclude 44,644 indels that were multi-allelic or greater than 50bp in length, and a further 57,662 SNPs and indels situated within deletions. Variants greater than 50bp in length were submitted to the NCBI structural variant database dbVar. The genotype data submitted to dbSNP consists of 1,726,931 quality-filtered, functionally-annotated variant records (1,423,039 SNPs and 303,892 short, biallelic insertion and deletion variants) corresponding to 383,378,682 individual genotype calls.
Structural-variant detection methods
Large genomic variants – deletions and duplications, between 1Kb and 100Kb in length – were detected and genotyped using GenomeStrip v2.0 14. One of the reference strain individuals (sample RGfi) was omitted from the this analysis because a different sequencing library preparation method was used from the other samples (see above). We included the following settings (according to developers’ guidelines): Sex-chromosome and k-mer masking when estimating sequencing depth, computation of GC-profiles and read counts, and reduced insert size distributions. Large variant discovery and genotyping was performed only on chromosomes 2, 3, 4 and X, omitting the mitochondrial genome and unmapped scaffolds.
We used the Genomestrip CNV Discovery pipeline with the settings: minimum refined length 500, tiling window size 1000, tiling window overlap 500, maximum reference gap length 1000, boundary precision 100, and genotyped the results with the GenerateHaploidGenotypes R script (genotype likelihood threshold 0.001, R version 3.0.2). Following visualisation of the genotype results and comparison with the bam sequence alignment files using the Integrated Genomics Viewer (IGV) v2.3.72 15, we excluded telomeric and centromeric regions where the sequencing coverage was fragmented, and six regions of multi-allelic gains of copy-number with dispersed breakpoints, previously reported to undergo mosaic in vivo amplification prior to oviposition 16 (see Supplementary Table 1 for genomic positions, and Supplementary Figure 3 for visualisation of in vivo amplification in a sequence alignment file). We excluded 6 samples (H082, H083, H090, H097, H098, H153) for which 80–90% of the genome was reported by Genomestrip to contain structural variation, which we regarded as error. Most these samples were grouped by the order in which they were processed for DNA extraction and sequencing, so this may have been caused partly by a batch-effect leading to differences in read pair separation, depth-of-coverage, and response to normal fluctuations in GC-content. Following removal of these samples, there were 2897 CNVs (1687 deletions, 877 duplications, and 333 of the ’mixed’ type), ranging in size from 1000bp to 217,707bp. We observed eight regions, for which Genomestrip identified multiple adjacent CNVs in single individuals, but which are likely single CNVs, 100Kb to 1.3Mb in length ( Supplementary Table 2).
Quality-filtering for structural variants detected by Genomestrip analysis of whole-genome resequencing data are not thoroughly established. We visually inspected, in the bam read alignment files using the Integrated Genomics Viewer 15, reported structural variants which were most likely to be artefacts. Specifically these were variants with: i) Extreme values for quality-score, GC-content or cluster separation, ii) Any homozygous non-reference genotypes (not expected with our breeding design), iii) Type ’mixed’. Following this, we used the following criteria for quality filtering: Quality score >15, cluster separation <17, GC-fraction >0.33, no mixed types (deletions and duplications only), homozygous non-reference genotype count >0, and heterozygous genotype count <200. Summaries of the quality metrics for quality-filtered data are shown in Table 2 and Supplementary Figure 2. We applied an upper limit to the cluster separation to remove groups of outliers in the upper end of the distribution, although this may have excluded many true, low-frequency variants. However, data on rare variants are not directly useful for our further investigations.
Table 2. Quality metrics for Genomestrip CNVs.
Metric
N |
Deletions
78 |
Duplications
89 |
---|---|---|
GC-fraction
Cluster separation Quality Heterozygote count (max 213)* Length (kb) |
0.39 (0.07)
8.84 (3.70) 103.93 (505.71) 22.00 (42.50) 2.20 (3.54) |
0.42 (0.06)
9.78 (3.17) 490.95 (1128.32) 42.00 (53.00) 3.40 (2.35) |
Values show the total number of variants, median (and IQR) for each metric. Data generated from vcf file using GATK VariantsToTable, on the quality-filtered data. *No CNVs in the quality-filtered samples had a ‘no-call’ or homozygous non-reference genotype.
After filtering, 167 CNVs remained (78 deletions and 89 duplications, size range 1Kb-26.6Kb). The positions and genotypes of these CNVs for each individual are shown in Figure 5. The genotype data for quality-filtered CNVs were combined with the data from 2252 indels >50bp from the Haplotype Caller pipeline, and a total of 2419 variants were uploaded to the public database on structural variation, NCBI dbVar. Although we have used methods for detecting SNPs, indels and CNVs, variants between 200bp and 1Kb are not reported by either HaplotypeCaller or Genomestrip. Additionally, sequence inversions are not detected by these methods and the upper limits to CNV detection using Genomestrip, based on the parameters and results of this study, are 100Kb-1Mb.
Dataset validation
Initial validation of our methods can be seen by lack of variants in the two reference line individuals compared with the LH M hemiclones (3,686 verses a median of 728,799 per sample). For a more thorough test of the genotyping and hemiclone method reproducibility, we sequenced an additional hemiclone individual from three of the LH M lines, and mapped the reads to the reference genome assembly as before. For HaplotypeCaller, we generated gVCF files for each sample, and then performed genotyping and quality-filtering as described above, except that the original three samples were replaced with the replication test samples. Similarly, for Genomestrip, we performed structural variant discovery and genotyping on all of the same samples as before, replacing three original samples with the replication test samples. We then used the GATK Genotype Concordance function to generate counts of genotype differences between the three pairs of samples. Overall results are presented in Table 3. Genotype reproducibility for quality-filtered biallelic SNPs was 98.5–99.5%, going down to 89.1–93.2% for filtered multi-allelic indels. Reproducibility of structural variant genotype calls was 95.6–100.0%, although we noted that for one individual (H119) filtering actually reduced the reproducibility rate from 99.7% to 95.6%. Full code, logs and numerical results can be found at http://doi.org/10.5281/zenodo.160539.
Table 3. Genotype reproducibility rates(%)*.
Variant type | Sample ID | Unfiltered | Filtered |
---|---|---|---|
HaplotypeCaller/3.4
Bi-allelic SNP Multi-allelic SNP |
H119 H137 H151 H119 H137 H151 |
98.9 97.7 97.8 95.0 92.3 92.1 |
99.5 98.5 98.3 96.6 94.0 93.6 |
Bi-allelic indel
Multi-allelic indel |
H119
H137 H151 H119 H137 H151 |
98.1
96.3 96.0 91.9 88.0 87.9 |
98.6
96.8 96.4 93.2 89.3 89.1 |
Genomestrip/2.0
Deletion |
H119 H137 H151 |
99.7 100.0 100.0 |
95.6 100.0 100.0 |
Duplication | H119
H137 H151 |
99.7
99.9 99.6 |
100.0
100.0 100.0 |
*Presented values are the overall genotype concordance, as generated using GATK/3.4 Genotype Concordance function. Code, logs and output data are available at http://doi.org/10.5281/zenodo.160539.
Although these results indicate that our genotype accuracy is very good, there are several caveats to consider. In the quality-filtered small-variant data, seven samples (H034, H035, H040, H038, H039, H188, H174) had prominently higher genotype drop-out rates than the others (of 2–7%), as well as a higher proportion of homozygous non-reference genotypes (2–4%; See Figure 4C). Additionally two samples had prominently more heterozygous variants (H072:885,551 and H093:955,148 verses the other LH M hemiclones: mean 710,934).
Although the genotype replication rate for the structural variants was also very high, we cannot exclude the possibility that, due to incomplete masking of hard-to-sequence regions of the reference assembly, variants which are artefacts reported in the original genotype data, may also be present in the replication genotype data.
Data availability
The data referenced by this article are under copyright with the following copyright statement: Copyright: © 2016 Gilks WP et al.
All publicly-available records are for 220 LH M hemiclone individuals and 2 in-house reference line individuals, with the exception of the large-variant data for which one in-house reference line sample and six LH M hemiclones were omitted. The NCBI BioProject identifier is PRJNA282591. Code, logs and quality control data for each dataset, and for generating the figures and tables in this manuscript are publicly-available at Zenodo, https://zenodo.org/, ’Sussex Drosophila Sequencing’ community. Use of the files uploaded to Zenodo is under Creative Commons 4.0 license.
Sequencing data
Raw fastq sequence reads, and bam alignment files for D. melanogaster are publicly-available at the NCBI Sequence Read Archive, accession number SRP058502 ( https://www.ncbi.nlm.nih.gov/sra/?term=SRP058502). The code for read-mapping, alongside the run logs and quality-control data are available at https://doi.org/10.5281/zenodo.159251. Additionally the sequence alignment files for the corresponding Wolbachia have accession number SRP091004 ( https://www.ncbi.nlm.nih.gov/sra/?term=SRP091004), with further supporting files at https://doi.org/10.5281/zenodo.159784 16.
Small-variant data
Records of quality-filtered sequence variants identified by GATK HaplotypeCaller in the LH M hemiclones, and in the in-house reference line, are available from the NCBI dbSNP, https://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewBatch.cgi?sbid=1062461, handle: MORROW_EBE_SUSSEX. In compliance with NCBI dbSNP criteria, variants >50bp in length, multi-allelic indels, and variants with a null alternate allele are excluded. More extensive genotype data (unfiltered, quality-filtered, and formatted for NCBI dbSNP) are available at https://doi.org/10.5281/zenodo.159272 17. Also included is the code used for variant discovery and genotyping, quality-filtering and formatting, alongside run logs and quality-control data. Further filtering of this dataset may be necessary to remove localised areas of artefact SNPs in single samples. We have also released a gvcf genotypes file which contains an ’all-sites’ record of the sample genotypes, available for download at https://doi.org/10.5281/zenodo.198880.
Structural-variant data
Records of quality-filtered variants detected by GenomeStrip, and variants >50bp detected by Haplotype Caller are publicly-available at NCBI dbVar, accession number nstd134, https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd134/. Unfiltered and filtered genotype data, code for CNV discovery and genotyping using Genomestrip/2.0, run logs, and summary data are publicly-available at https://doi.org/10.5281/zenodo.159472 17.
Genotype reproducibility testing
Run code and logs for performing the genotyping using Haplotype Caller and Genomestrip when three samples are replaced by hemiclones from the same line, code for comparing the genotype calls between pairs of hemiclones, and results tables are located at https://doi.org/10.5281/zenodo.160539.
Data for manuscript tables and figures
Input data, code and logs for generating the figures and tables used in this manuscript are located at https://doi.org/10.5281/zenodo.159282. The LATEXcode for generating the flow-diagram for Figure 2 is available from https://doi.org/10.5281/zenodo.168582. Code and logs for the generation of the input data is provided in the data releases pertaining to each process.
Acknowledgements
Sequencing was performed under contract by Exeter University, DNA Sequencing Service (UK), who also provided analysis advice, http://www.exeter.ac.uk/business/facilities/sequencing/. Crucial computational support was provided by Jeremy Maris at the Centre for High-Performance Computing, University of Sussex, http://www.sussex.ac.uk/its/services/research/highperformance. Bob Handsaker (Harvard Medical School, USA) provided analysis advice for use of Genomestrip for structural variant detection.
Funding Statement
Funding was provided to EM by a Royal Society University Research Fellowship, the Swedish Research Council (No. 2011-3701), and by the European Research Council (No. 280632).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; referees: 2 approved]
Supplementary material
Supplementary information for: "Whole genome resequencing of a laboratory-adapted Drosophila melanogaster population".
References
- 1. Bentley DR, Balasubramanian S, Swerdlow HP, et al. : Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–59. 10.1038/nature07517 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Abbott JK, Morrow EH: Obtaining snapshots of genetic variation using hemiclonal analysis. Trends Ecol Evol. 2011;26(7):359–368. 10.1016/j.tree.2011.03.011 [DOI] [PubMed] [Google Scholar]
- 3. Rice WR, Linder JE, Friberg U, et al. : Inter-locus antagonistic coevolution as an engine of speciation: assessment with hemiclonal analysis. Proc Natl Acad Sci U S A. 2005;102(Suppl 1):6527–6534. 10.1073/pnas.0501889102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Innocenti P, Morrow EH: The sexually antagonistic genes of Drosophila melanogaster. PLoS Biol. 2010;8(3):e1000335. 10.1371/journal.pbio.1000335 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Adams MD, Celniker SE, Holt RA, et al. : The genome sequence of Drosophila melanogaster. Science. 2000;287(5461):2185–2195. 10.1126/science.287.5461.2185 [DOI] [PubMed] [Google Scholar]
- 6. Hoskins RA, Carlson JW, Wan KH, et al. : The Release 6 reference sequence of the Drosophila melanogaster genome. Genome Res. 2015;25(3):445–458. 10.1101/gr.185579.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Richards S, Murali SC: Best Practices in Insect Genome Sequencing: What Works and What Doesn’t. Curr Opin Insect Sci. 2015;7:1–7. 10.1016/j.cois.2015.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Li H, Handsaker B, Wysoker A, et al. : The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lunter G, Goodson M: Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011;21(6):936–939. 10.1101/gr.111120.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Lack JB, Cardeno CM, Crepeau MW, et al. : The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics. 2015;199(4):1229–1241. 10.1534/genetics.115.174664 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. DePristo MA, Banks E, Poplin R, et al. : A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–498. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Van der Auwera GA, Carneiro MO, Hartl C, et al. : From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;11(1110):11.10.1–11.10.33. 10.1002/0471250953.bi1110s43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cingolani P, Platts A, Wang le L, et al. : A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w 1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92. 10.4161/fly.19695 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Handsaker RE, Van Doren V, Berman JR, et al. : Large multiallelic copy number variations in humans. Nat Genet. 2015;47(3):296–303. 10.1038/ng.3200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Robinson JT, Thorvaldsdóttir H, Winckler W, et al. : Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Spradling AC, Mahowald AP: Amplification of genes for chorion proteins during oogenesis in Drosophila melanogaster. Proc Natl Acad Sci U S A. 1980;77(2):1096–1100. 10.1073/pnas.77.2.1096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Gilks W: Read-mapping for next-generation sequencing data (Wolbachia) [Data set]. Zenodo. 2016. Data Source
- 18. Gilks W: SNP and indel discovery and genotyping in next-generation sequencing data [Data set]. Zenodo. 2016. Data Source