################################################################################# #Title: Stepwise Threshold Clustering #Author: William E. Stutz #Last Updated: May 10, 2014 #contact: william.stutz@colorado.edu ## READ THE FOLLOWING BEFORE STARTING!!! # This R script will run the clustering algorithm (STC) described in: Stutz WE, Bolnick DI (2014) "Stepwise Threshold Clustering: a new method for genotyping MHC loci using next-generation sequencing technology". #### "What type of data (i.e. platform) can I use with STC?" #### # STC was originally created to analyze reads produced using 454 sequencing. As such, it assumes that the number of reads generated is on the order of 100's to 1000's. Data produced on platforms capable of producing millions of reads (10's of thousands or more reads per sample, i.e. Illumina) may or may not work well with STC as currently implemented. Please see the manuscript listed above for suggestions on applying STC for Illumina produced data. Additionally, STC was also created to analyze samples with a maximum of 12 different allelic sequences per sample. IF your samples are expected to have substantially more alleles (> 25), STC should still work. However, the time taken may be quite long and the input parameters may need to be adjusted quite a bit. #### "How many samples can I run with STC" #### # STC can be run for any number of samples. In fact, it can be run for single samples, which is a good way to optimize the adjustable parameters if sequences for those samples are known a priori. There is one major "for" loop in the script with iteratively runs through phases 2 and 3 of STC for a given input list of samples, which can be input by the user. #### "What format does my input data need to be in?" #### # STC does not take raw input from a 454 run. Different researchers use different barcoding schemes and will sequence different numbers of samples in a single run depending on their research interests. Researchers will often use different ways of parsing their 454 data and dividing the raw reads among samples and runs. STC takes as its input a data table (reads.csv) where each row corresponds to a single read. The columns that must be included in reads.csv are: # seq_number: a unique numerical ID given to each row (i.e. 1 through the total number of reads) # sampleID: the user given ID to which each read has been assigned (e.g. based on barcoding) # seqID: the user given identification for each unique sequence in a 454 run (or across multiple runs). These IDs must correspond to the IDs given in the fasta file containing all unique sequences (sequences.fasta, see below). # Other columns can be contained in the raw_reads file depending on how it is produced. These will be ignored by STC. #### "Are there any other input files that must be input into STC?" #### # Yes, there is one other file that will be needed. A file labeled "sequences.fasta" must be included in the working directory. This is a fasta file containing all the unique nucleotide sequences found in a given run, along with unique, corresponding IDs for those sequences supplied by the user. These IDs can be generated de novo for each run or can be recycled across runs for nucleotide sequences appearing across runs. #### "What if I can't/don't know how to generate these input files on my own? #### # That's totally fine. Please see the accompanying data dryad repository for the manuscript listed above (datadryad.org). In that repository you will find a full work flow including shell scripts, perl scripts, R scripts, and example data files for running the entire STC process. This work flow assumes only that one can pull the raw multi-fasta file of all reads from the .sff file produced during 454 sequencing (this is typically done by the sequencing facility using proprietary Roche software) #### "What are good values for the input parameters listed below? #### # The values supplied in the script were optimized for the data set analyzed in the accompanying manuscript (see above). Most of these will represent decent starting values which can be tweaked. However, there a few which will be vary quite a bit depending on the study: # amp_length: will need to be set individually for every study. It corresponds to the length of amplified fragment (not including primers) expected to be produced during PCR. Also the size_thresh (theta) parameter will vary depending on the number of loci expected to be amplified. Please see the manuscript for suggested values for this parameter. # min_reads: optimized for samples with a maximum of 12 alleles. If more alleles are expected, this number will need to increase to ensure enough coverage. Please see manuscript for suggestions on how to set this parameter. # size_thresh: will also depend on the maximum number of alleles expected in any given sample. Please see manuscript for suggestions on how to set this parameter. #### "What if I have duplicate samples in the same run?" #### # That's fine. Just give duplicate samples different sequenceIDs in the reads.csv file. #### "How are sequences aligned?" #### # We have written STC to use clustal to align sequences. Please make sure that clustal is installed on your machine and that it can be called using the clustal function in the ape package (i.e. using exec = "clustalw2") #### "How are potentially chimeric alleles identified?" #### # We have implemented a function within R to check if potential true alleles are chimeras (see Phase 4). If you have your own preferred method for identifying chimeric alleles, by all means, do so. #### "Where are all the functions used in this script?" #### # All of the functions are included in the accompanying stc_functions.R script. These functions are the nitty-gritty of STC. We have elected to separate them out from the main script so that the individual steps and overall structure of STC can be easily discerned without users getting bogged down in endless lines of code. IF any particular step is not working, by all means consult the functions script and find the troublesome function in question. #### "What do I get at the end?" #### # If the script runs successfully through all phases of STC you will end up with a lot of files, of which the following are important: # all_TAs_phase4.csv: a table of all the true alleles found for each sample # alleles_summary: summary of all true alleles identified during genotyping # samples_summary: summary of all samples genotyped (and also samples skipped) # TA.fasta : fasta file of the sequences of all true alleles # TAs.aln: alignment of all true allele sequences # TA_dist.csv: distance matrix of all true allele sequences # sample specific files (where #### denotes sample ID): # ####_dist.csv: distance matrix of all sequences found for that sample # ####.fasta: fasta file of all sequences found for that sample # ####.aln: alignment of sequences for that sample # ####_dist.csv: distance matrix of all sequences found for that sample # ####_pairs.csv: list of combined pairs of sequences # ####_sequence_summary.csv: summary of cluster results for by sequence # ####_clusters.csv: table of sequences organized into final clusters # ####_smalls.csv: table of sequences in small clusters after phase 3 # ####_TAs.csv: table of true alleles # ####_true.fasta: fasta file of true alleles #### "Is there anything else I should know?" #### # We would urge anyone who wishes to try STC on their data set to CAREFULLY read over the manuscript before starting. We would also recommend starting with just one sample and working through phases 2 and 3 of STC step by step (i.e. don't use the for loop which loops through multiple samples). T ######################################################################### ##################### SET PARAMETERS FOR STC ############################ ######################################################################### #assumed correct length for amplicon !!THIS VALUE WILL BE DIFFERENT FOR EVERY ORGANISM!!! amp_length <- 213 #minimum number of reads for genotyping !!THIS VALUE WILL BE DIFFERENT DEPENDING ON THE MAXIMUM NUMBER OF ALLELES EXPECTED PER SAMPLE!! min_reads <- 80 #maximum number of reads before sub-sampling max_reads <- 1000 #maximum similarity before clustering stops (%) max_sim <- 97 #how many reps per round of clustering? reps <- 100 #size criterion threshold (theta) !!THIS VALUE WILL BE DIFFERENT DEPENDING ON THE MAXIMUM NUMBER OF ALLELES EXPECTED PER SAMPLE!! size_thresh <- 1/22 #dominance criterion threshold (sigma) dom_thresh <- 4/1 #common allele threshold (epsilon) small_cutoff <- 3 #create sample specific sequence tables (TRUE/FALSE)? create_sequence_tables <- TRUE ###################################################################### # Things to do before STC can start ###################################################################### #libraries needed library(ape) library(seqinr) library(plyr) #source functions (PLEASE ENSURE THIS FILE IS LOCATED IN THE WORKING DIRECTORY) source(file = "stc_functions.R") #data files needed, see top of script for descriptions (BE SURE THESE ARE ALSO LOCATED IN THE WORKING DIRECTORY) reads <- read.csv("reads.csv") #table of raw reads alls <- read.dna("sequences.fasta",format = "fasta", as.character = TRUE) #list of all sequences #create list of all samples (this just generates a list of samples based on the reads.csv file. This list can also be truncated by the user to list just the sampleIDs to be genotyped). samples <- data.frame(sampleID = unique(reads$sampleID)) write(as.vector(samples$sampleID),"samples", ncolumns = 1) ##################################################################### ################### PHASE 1: SEQUENCE PREPARATION ################### ##################################################################### ######################################################################### # Create empty data tables ####################################################################### #create empty tables to store information about all true alleles, small clusters, and all clusters produced during genotyping all_TAs <- data.frame() all_smalls <- data.frame() all_clusters <- data.frame() #empty list of samples that are skipped due to low read numbers samples_skipped <- c() #empty lists of samples that are sub-sampled due to large read numbers samples_big <- rbind() #empty tables for creating summary information about alleles and about individual samples alleles_summary <- data.frame() samples_summary <- data.frame() ###################################################################### #Start cycling through individual samples (Phases 2 and 3) ###################################################################### for (sample in samples[,1]){ #so you know what sample is currently active print (sample) #create a unique filename path for this sequence seqpath <- paste("",sample,sep = "/") ############################################### #determine whether the sample should be genotyped and sub-sampled ############################################### #extract reads from sample sample_reads <- na.omit(reads[reads$sampleID == sample,]) sample_reads$sampleID <- factor(sample_reads$sampleID) sample_reads$seqID <- factor(sample_reads$seqID) #how many reads? read_numb <- nrow(sample_reads) #check if too few reads if (read_numb < min_reads){ info <- data.frame(sampleID = sample,read_numb) samples_skipped <- rbind(samples_skipped,info) write.csv(samples_skipped,"samples_skipped.csv", row.names = FALSE) #add data to summary table samp_summ <- data.frame(sampleID == sample, read_number = read_numb, read_number_used = 0, good_alleles = NA, small_clusters = NA, amb_clusters = NA ) samples_summary <- rbind(samples_summary,samp_summ) next() } #if there are too many reads if (read_numb > 1000){ info <- cbind(sample,read_numb) samples_big <- rbind(samples_big,info) write.table(samples_big,"output/samples_big.csv") #subsample reads down to 1000 without replacement sample_reads <- sample_reads[sample(nrow(sample_reads),1000, replace = FALSE),] #re-factor the sample sequence IDs sample_reads$seqID <- factor(sample_reads$seqID) } #create entry for sample summary table (to be added later) samp_summ <- data.frame(sampleID = sample, read_number = read_numb, read_number_used = nrow(sample_reads) ) ##################################################################### ################### PHASE 2: SEQUENCE COMBINATION ################### ##################################################################### ############################################### # create alignment of sequences and distance matrix ############################################### #create table of the number of reads assigned to each sequence seq_table <- table(sample_reads$seqID) #which sequences are present in the given sample? seqs <- na.omit(names(table(sample_reads$seqID))) #extract sequence data for sequences sample_seqs <- alls[seqs] write.dna(sample_seqs,file = paste(sample,".fasta",sep = ""), format = "fasta") #align sequences using clustal aln_seqs <- align.seqs(sample_seqs) #create distance matrix dist <- dist.dna(aln_seqs, model = "JC69", as.matrix = TRUE) #convert distance matrix to percent similarity dist <- 100*(1-dist) #calculate minimum distance in matrix min_dist <- min(dist) #write distance matrix file to file write.table(dist,file = paste(sample,"_dist.csv",sep = ""), sep = ",") ############################################### #create list of paired sequences to combine ############################################### #re-upload alignment in character format aln_seqs <- read.dna(file = paste(sample,".aln",sep = ""), format = "interleaved", as.character = TRUE) #order rownames from lowest hapID to highest (for efficiency) aln_seqs <- aln_seqs[order(as.numeric(rownames(aln_seqs))),] #make list of all possible pairs of sequences temp_seqs <- data.frame(ID = seqs, numb = seq(1:length(seqs))) all_pairs <- make.pairs(temp_seqs,seq_table) #create list of pairs to combine pairs_use <- c() for(pair in c(1:nrow(all_pairs))){ #how do the two sequences differ? diffs <- pair.diffs(pair) #add pair if pair falls into type 1,2, or 3 pairs_use <- rbind(pairs_use, type.pair(diffs)) } #adjust pairs for combination pairs_use <- adjust.pairs(pairs_use) #write pairs to file write.csv(pairs_use, file = paste(sample,"_pairs.csv",sep = "")) ############################################## #combine pairs ############################################## #create a pre-combination sequence table old_seq_table <- seq_table #combine pairs in sample reads table sample_reads <- ddply(sample_reads,.(seq_number),combine.pairs) #convert sequence IDs to factor sample_reads$seqID <- factor(sample_reads$seqID) #create new sequence table after combinations seq_table <- table(sample_reads$seqID) ##################################################################### ################### PHASE 3: CLUSTERING ############################# ##################################################################### ##################################################################### #initiate data tables ##################################################################### #new table for good clusters good_clusters <- rbind() #new table for small clusters small_clusters <- data.frame() #new table for true alleles true_alleles <- data.frame() #create table of sequences to start with start_seqs <<- data.frame(cluster = NA, sequence = names(seq_table), nreads = as.vector(seq_table)) #how many unique sequences to start? seq_remain <- nrow(start_seqs) ####################################################################### # run the clustering replicates ####################################################################### #set first cutoff cutoff <- round(min_dist) while(cutoff <= 97 & nrow(start_seqs) > 0){ #display current cutoff value print(cutoff) #set existing clusters to have no value start_seqs$cluster <- NA #new table for ambiguous clusters amb_clusters <- data.frame() #perform the clustering replicates cluster_replicates <- repeat.cluster(cutoff,start_seqs,reps) #parse replicates and assign cluster names final_clusters <- parse.results(start_seqs,cluster_replicates) #pull good clusters good_clusters <- rbind(good_clusters, ddply(final_clusters,.(cluster),pull.goods)) #pull small clusters small_clusters <- rbind(small_clusters, ddply(final_clusters,.(cluster),pull.smalls)) #pull ambiguous clusters amb_clusters <- rbind(amb_clusters, ddply(final_clusters,.(cluster),pull.ambs)) #start with ambiguous clusters for next round start_seqs <- amb_clusters #if no new clusters were formed, skip a round of clustering if(nrow(amb_clusters) == nrow(cluster_replicates)) { #if cutoff is less than 80 if(cutoff < 80){ cutoff <- cutoff + 5 #if cutoff is greater than 80... } else { #but less than 90... if(cutoff < 90){ cutoff <- cutoff + 3 #adn greater than 90 } else { cutoff <- cutoff + 1 } } } else { cutoff <- cutoff + 1 } } #create table of all clusters for sample sample_clusters <- rbind(good_clusters,small_clusters,amb_clusters) sample_clusters <- data.frame(sample_clusters,sampleID = sample) ############################################################################ # divide ambiguous clusters among the top two alleles ############################################################################ #which clusters are ambiguous? amb_cluster_names <- names(table(amb_clusters$cluster)) #parse ambiguous clusters for(name in amb_cluster_names){ #pull cluster cl <- amb_clusters[amb_clusters$cluster == name,] #calculate summary stats for ambiguous cluster summary <- amb.summary(cl) #should the cluster be re-classified as small? small_clusters <- rbind(small_clusters,check.smalls(summary,cl)) #should the cluster be re-classified as one good cluster? good_clusters <- rbind(good_clusters, check.good(summary,cl)) } #create table of true alleles from existing good clusters true_alleles <- ddply(good_clusters,.(cluster), collapse.goods) ######################################################## #create summary tables for sequences from each sample ######################################################## if(create_sequence_tables == TRUE){ #create table sequence_summary <- create.sequence.summary(old_seq_table) #account for sequences that were combined during phase 2 sequence_summary <- ddply(sequence_summary,.(sequence), combined.seqs) #write to file sequence_summary <- sequence_summary[order(sequence_summary$cluster),] write.csv(sequence_summary,paste(sample,"_sequence_summary.csv",sep = "")) } ########################################################################## #clean up ########################################################################## #add sampleID to small clusters and good clusters if(nrow(small_clusters) > 0){ small_clusters <- data.frame(small_clusters,sampleID = sample) } good_clusters <- data.frame(good_clusters,sampleID = sample) #remove "cluster", but keep "allele", in true allele table true_alleles <- true_alleles[,-1] #create fasta file of true alleles for sample true_seqs <- alls[names(table(factor(true_alleles$allele)))] ########################################################################## #add sample results to global data tables ########################################################################## #append true alleles to global true allele file all_TAs <- rbind(all_TAs,true_alleles) #append small alleles to global small allele file all_smalls <- rbind(all_smalls, small_clusters) #append all clusters to global clusters file all_clusters <- rbind(all_clusters, sample_clusters) #create sample summary samp_summ <- data.frame(samp_summ, good_alleles = nrow(true_alleles), small_clusters = length(table(small_clusters$cluster)), amb_clusters = length(table(amb_clusters$cluster)) ) samples_summary <- rbind(samples_summary,samp_summ) ############################################################################ # write data to file ############################################################################ #write individual fish file for TAs write.csv(true_alleles, file = paste(sample,"_TAs.csv",sep = "")) #write individual fish file for clusters write.csv(sample_clusters, file = paste(sample,"_clusters.csv",sep = "")) #write small alleles fish file write.csv(small_clusters,file = paste(sample,"_smalls.csv",sep = "")) #write fasta file for true allele sequences for fish to file write.dna(true_seqs,file = paste(sample,"_true.fasta",sep = ""), format = "fasta") #write current sample summary write.csv(samples_summary, "samples_summary.csv", row.names = FALSE) #update cluster files write.csv(all_TAs,"all_TAs_phase3.csv", row.names = FALSE) write.csv(all_smalls,"all_smalls.csv", row.names = FALSE) write.csv(all_clusters,"all_clusters.csv", row.names = FALSE) } ##################################################################### ################ PHASE 4: SMALL CLUSTER CROSS CHECKING ############## ##################################################################### #reload true allele and small cluster tables all_TAs <- read.csv("all_TAs_phase3.csv") all_smalls <- read.csv("all_smalls.csv") samples_summary <- read.csv("samples_summary.csv") #add a good status to indicate alleles that did not drop out all_TAs$status <- "good" #change small clusters to small alleles names(all_smalls)[1] <- "allele" #create table of current true alleles alleles <- table(all_TAs$allele) #Which true alleles will be used for cross-checking? alleles.cc <- names(alleles[alleles >= small_cutoff]) #add small clusters as true alleles if they match common alleles d_ply(all_smalls, .(sampleID,allele), cross.check) #resort by sampleID all_TAs <- all_TAs[order(all_TAs$sampleID), ] #create preliminary list of TAs, fasta file, and alignment final.list(all_TAs,alls) ##################################################################### ########### PHASE 4: IDENTIFY AND REMOVE CHIMERIC ALLELES ########### ##################################################################### #re-upload nucleic acid alignment aln_TAs <- read.dna("TAs.aln", format = "interleaved", as.character = TRUE) ####################### #parameters to set ####################### #maximum number sites at which daughter sequence can match NEITHER putative parent and still be considered a possible daughter sequence no_match <- 2 #set lower bound for proportion of informative sites (i.e. sites that differ between possible parents) that must be consistent with recombinant daughter sequence for parents to be considered possible parents proportion <- 0.98 ############################################## #find all the possible recombinants in each sample ############################################## possible_recombinants <- ddply(all_TAs,.(sampleID), find.recombs) #write possibles to file write.csv(possible_recombinants, "possible_recombinants.csv", row.names = FALSE) ############################################### #determine if any of the possible recombinants are likely chimeras ############################################### #load the list of possible recombinants recomb_dat <- read.csv("possible_recombinants.csv") #in how many samples was each allele found in overall? allele_totals <- table(all_TAs$allele) #cycle through recombinant sequences possible_chimeras <- ddply(recomb_dat,.(seqID), find.chimeras) #sort by how the percentage match possible_chimeras <- possible_chimeras[order(possible_chimeras$proportion_match, decreasing = TRUE),] #write results to file write.csv(possible_chimeras, "chimera_results.csv", row.names = FALSE) ############################################### # remove likely chimeras from final results ############################################### #load list of possible chimeras possible_chimeras <- read.csv("chimera_results.csv") #count as chimeras only if potential parent sequences were found in all samples where the putative chimera occured chimeras <- possible_chimeras[possible_chimeras$proportion_match == 1,] #if there are no chimeras if(length(chimeras) == 0){ chimeras <- data.frame(allele = NA) } #indicate which TAs are chimeras all_TAs <- ddply(all_TAs, .(allele), indicate.chimeras) #resort by sampleID all_TAs <- all_TAs[order(all_TAs$sampleID), ] #update sequence summary tables to account for chimeras and dropped alleles if(create_sequence_tables == TRUE){ #which samples were skipped? skip <- read.csv("samples_skipped.csv", sep = ",") #cycle through samples and update sequency summaries ddply(samples, .(sampleID), update.sequence.summary) } ##################################################################### ########### POST STC CLEAN UP AND DATA PROCESSING ################### ##################################################################### ######################################################## # create a summary table of alleles ######################################################## alleles_summary <- ddply(all_TAs, .(allele), create.allele.summary) #sort most common to least common alleles_summary <- alleles_summary[order(alleles_summary$samples_found, decreasing = TRUE),] #write summary to file write.csv(alleles_summary, "alleles_summary.csv", row.names = FALSE) #indicate which TAs are not the correct length as given in the parameters all_TAs <- ddply(all_TAs, .(allele), indicate.lengths) ###################################################################################### # update sample summary table ###################################################################################### #load existing table samples_summary <- read.csv("samples_summary.csv") #add chimera, length, and true allele columns samples_summary <- merge(samples_summary, ddply(all_TAs,.(sampleID), new.TAs), by = c("sampleID"), all.x = TRUE ) #output updated table write.csv(samples_summary, "samples_summary.csv", row.names = FALSE) ###################################################################################### # Create list of final true alleles along with a fasta file and alignment of just true alleles ###################################################################################### #write new list of TAs to file write.csv(all_TAs, "all_TAs_phase4.csv", row.names = FALSE) #upload fasta file of reads from the run alls <- read.dna("sequences.fasta",format = "fasta", as.character = TRUE) #create list of TAs, fasta file, and alignment final.list(all_TAs,alls)