Some important points to consider |
• Availability of appropriate computational resources |
• Collaboration with sequencing facility and bioinformatics groups |
• Plan for amount and type of sequencing data needed |
• Does funding allow to produce sufficient sequence coverage? If not, alternative approaches should be considered rather than producing a poor, low coverage, assembly |
• Familiarization with data handling pipelines and file formats (see below) |
• High-quality DNA sample (with individual metadata) |
• Plan for analyses and publication |
Some useful resources |
Internet forums for discussions related to genome sequencing |
• http://seqanswers.com/
|
• http://www.biostars.org/
|
• http://www.biosupport.se/
|
Entry points to genome sequencing, assembly and exemplary downstream analyses |
• Library preparation and Sequencing: Mardis (2008, 2013) |
• Quality filtering/preprocessing: Patel and Jain (2012), Zhou and Rokas (2014), Smeds and Künstner (2011) |
• Genome assembly: Nagarajan and Pop (2013), Pop (2009), Flicek and Birney (2009) |
• Assembly evaluation: Earl et al. (2011), Bradnam et al. (2013), Bao et al. (2011) |
• Genome annotation: Yandell and Ence (2012) |
• Mapping: Li and Durbin (2009), Trapnell and Salzberg (2009), Bao et al. (2011) |
• Data handling: Li et al. (2009), Quinlan and Hall (2010) |
• Variant calling: Nielsen et al. (2011), DePristo et al. (2011), Van der Auwera et al. (2013) |
• Haplotype-based approaches: Browning and Browning (2011), Tewhey et al. (2011), Lawson et al. (2012) |
• Population genomic summary statistics: Nielsen et al. (2012b), Danecek et al. (2011) |
Web resources |
• Galaxy (http://galaxyproject.org/) |
• Amazon cloud (http://aws.amazon.com/ec2/) |
• Windows Azure (http://www.windowsazure.com/) |
• Magellan: Cloud Computing for Science (http://www.alcf.anl.gov/magellan) |
• Web Apollo (http://genomearchitect.org/) |
• NCBI BioProject (http://www.ncbi.nlm.nih.gov/bioproject/) |
• Genomes OnLine Database (http://genomesonline.org/cgi-bin/GOLD/index.cgi) |
• ENSEMBL genome database (http://www.ensembl.org/index.html) |
• UCSC Genome Browser (http://genomebrowser.wustl.edu/) |
• fastQCtoolkit for data preprocessing (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) |
Genome size databases |
• Plants: http://data.kew.org/cvalues/
|
• Animals: http://www.genomesize.com/
|
Common file formats |
• FASTA |
Nucleotide sequence (file extension .fas or .fa) |
• FASTQ |
Nucleotide sequence including quality scores |
• SAM |
Sequence alignment |
• BAM |
Binary version of SAM |
• GFF3 |
Annotation |
• GTF |
Annotation |
• BED |
Annotation |
• VCF |
Variant calling |