Copyright (c) 2013, The Developers All rights reserved. This directory contains the SnowyOwl gene prediction package. Once the package is downloaded and decompressed in any location where you have write privileges, make sure that the programs used by SnowyOwl are available and edit CONFIG.template to suit your system. =========== DIRECTORIES =========== Projects: By default, SnowyOwl creates a directory under Projects for storing all intermediate and final results files for each genome data set. Alternatively, the -o option can be used to specify a different root directory for project files. bin: Contains scripts and programs used by SnowyOwl. ======== HARDWARE ======== SnowyOwl is designed to run on multi-processor workstations or servers. At least 3 processors are required, 12 are recommended, and more will shorten run time. SnowyOwl is not designed for clusters. 24 GB of RAM is adequate for fungal genomes. SnowyOwl will use temporary disk space approximately equal to the size of the input files, and leave about 200 MB of output files. SnowyOwl will optionally use TimeLogic boards and DeCypher software for accelerated BLAST searching. ======== SOFTWARE ======== The following program packages, with the indicated or newer versions, are required to run SnowyOwl; all these programs should be accessible through your system PATH variable. - UNIX, with both bash and tcsh shells - Perl 5 - Python 2.7, with modules Biopython 1.59, pysam 0.6, paramiko 1.7.7.1, doit 0.21, PyGTK 2.20 - Augustus 2.5.5 (http://bioinf.uni-greifswald.de/augustus/binaries/) - GeneMark-ES 2.3e (http://exon.gatech.edu/license_download.cgi) - NCBI Blast+ 2.2.25 (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) - Exonerate (http://www.ebi.ac.uk/~guy/exonerate/) - Blat (http://hgdownload.cse.ucsc.edu/admin/exe/) - samtools (https://sourceforge.net/projects/samtools/files/samtools/) - tabix (https://sourceforge.net/projects/samtools/files/tabix/) - Cd-hit (http://weizhong-lab.ucsd.edu/cd-hit/download.php) SnowyOwl uses 1 or 2 protein databases for Blast searching. During development we have used the Uniprot/Swissprot database for blastx searching and the NCBI Refseq Fungi database for blastp searching. ================== CONFIGURATION FILE ================== The values set in the file "CONFIG" are used as defaults. Any of these can be left empty and set on the command line using the same name. Values given on the command line override values in CONFIG. If a required value (indicated by "required" on the comment line in CONFIG) is set neither in CONFIG nor on the command line, SnowyOwl will complain and exit. When installing SnowyOwl, edit CONFIG.template to suit your system and your preferences, and save as CONFIG. In place of the default CONFIG file, you can specify a personalized configuration file with the -c option on the command line. SnowyOwl saves a CONFIG file with the values for each project in the project root directory. Besides providing a record, this file allows you to restart a project run without re-entering any parameters. ================= RUNNING SNOWY OWL ================= An example data set containing all the input files needed to predict the genes on chr_5_1 of Aspergillus niger with SnowyOwl is available for download at http://sourceforge.net/projects/snowyowl/files/. To run SnowyOwl, you must specify the project parameters through the CONFIG file or on the command line. You only need to enter values for parameters that differ from the CONFIG file values. A GUI is available to help you enter the project parameters. If you enter /SnowyOwl --gui a dialog will appear pre-populated with the values in the default CONFIG file. Make appropriate changes and press the OK button at the bottom of the dialog, and SnowyOwl will carry out a sanity check and then start its gene prediction run. If you already have a CONFIG file containing all the parameter values for your project, or you want to restart a project run using the project CONFIG file generated by SnowyOwl, enter /SnowyOwl -c Of course you can combine the two approaches if you want to change a few values in a custom configuration: /SnowyOwl --gui -c To avoid using the GUI (e.g. you are running SnowyOwl from a script) you can enter all parameters on the command line. The value of every parameter in the CONFIG file can be altered by prefixing the option name with '--' and following with a space and the new value. For the parameters changed most often there are short tags: ProjectName : -p ProjectDir : -o Genome : -g MaskedGenome: -n Reads : -r MappedReads : -m Transcripts : -t config_file : -c label : -l During the SnowyOwl run, the starting and finishing time for each step, and any fatal error messages, are output to /logs/SnowyOwl.log. Detailed progress and error output from the programs run by SnowyOwl are saved in individual logs in the logs directory; consult these logs to troubleshoot any problems that arise. When the run finishes, the high-quality gene models predicted by SnowyOwl can be found in /accepted.gff3. More results are available in the /Predictions directory, and a summary of all the models generated is in /logs/Prediction.log. SnowyOwl keeps all its intermediate files, and will use them when a run is restarted. Once you are satisfied with the results, you can delete any of the intermediate files to free up disk space; you will want to keep at least accepted.gff3 and CONFIG. =========== INPUT FILES =========== Genome sequence, in FASTA format. [Optional] Masked genome sequence, in FASTA format. Positions where no gene predictions are wanted, such as repetitive sequence or ribosomal DNA, can be masked with N. RNA-Seq reads, in FASTA or FASTQ format. A directory containing classified_juncs.gz, a tabix-indexed list of splice junction positions, and tuque.coverage.wig.gz, a tabix-indexed file of read coverage depth profiles in bedGraph format. These files are generated when tuqueSplice [http://sourceforge.net/projects/tuque/] is used to map RNA-Seq reads or can be created from a .BAM read mapping file with the script BAM_to_juncs_and_coverage.sh (see below). A file of likely transcript sequences, assembled from RNA-Seq reads, in FASTA format. ================= AUXILIARY SCRIPTS ================= SnowyOwl/bin/scripts/BAM_to_juncs_and_coverage.sh reads.bam genome.fasta can be used to generate the needed classified.juncs.gz and tuque.coverage.wig.gz files from a set of mapped RNA-Seq reads in BAM format. The program 'bedtools' (available from http://code.google.com/p/bedtools/) must be on the system PATH. SnowyOwl/bin/scripts/combine_new_and_old_predictions.sh old.models.gff3 new.models.gff3 start_num genome.fasta can be used to conservatively merge new predictions with existing predictions, preserving the names of any existing models that are the same as new models. It produces a non-redundant combined set of old and new models with name combined.gff3, and a list of the old models that have been replaced along with their replacements. SnowyOwl/bin/scripts/get_accepted_representatives.sh models.gff3 accepted.gff3 can be used to filter imperfect models from a set of scored gene models. It will output files named accepted.gff3 and imperfect.gff3 and a list of the frequencies of various flaws in the input models. ===== HELP! ===== Questions on the package can be posted on the discussion forum at http://sourceforge.net/p/snowyowl/discussion/.