#################################################################################
#Title: Stepwise Threshold Clustering
#Author: William E. Stutz 
#Last Updated: May 10, 2014
#contact: william.stutz@colorado.edu

## READ THE FOLLOWING BEFORE STARTING!!!

# This R script will run the clustering algorithm (STC) described in: Stutz WE, Bolnick DI (2014) "Stepwise Threshold Clustering: a new method for genotyping MHC loci using next-generation sequencing technology". 


#### "What type of data (i.e. platform) can I use with STC?" ####
# STC was originally created to analyze reads produced using 454 sequencing. As such, it assumes that the number of reads generated is on the order of 100's to 1000's. Data produced on platforms capable of producing millions of reads (10's of thousands or more reads per sample, i.e. Illumina) may or may not work well with STC as currently implemented. Please see the manuscript listed above for suggestions on applying STC for Illumina produced data. Additionally, STC was also created to analyze samples with a maximum of 12 different allelic sequences per sample. IF your samples are expected to have substantially more alleles (> 25), STC should still work. However, the time taken may be quite long and the input parameters may need to be adjusted quite a bit. 

#### "How many samples can I run with STC" ####
# STC can be run for any number of samples. In fact, it can be run for single samples, which is a good way to optimize the adjustable parameters if sequences for those samples are known a priori. There is one major "for" loop in the script with iteratively runs through phases 2 and 3 of STC for a given input list of samples, which can be input by the user.

#### "What format does my input data need to be in?" ####
# STC does not take raw input from a 454 run. Different researchers use different barcoding schemes and will sequence different numbers of samples in a single run depending on their research interests. Researchers will often use different ways of parsing their 454 data and dividing the raw reads among samples and runs. STC takes as its input a data table (reads.csv) where each row corresponds to a single read. The columns that must be included in reads.csv are: 
	# seq_number: a unique numerical ID given to each row (i.e. 1 through the total number of reads)	
	# sampleID: the user given ID to which each read has been assigned (e.g. based on barcoding)
	# seqID: the user given identification for each unique sequence in a 454 run (or across multiple runs). These IDs must correspond to the IDs given in the fasta file containing all unique sequences (sequences.fasta, see below).
	# Other columns can be contained in the raw_reads file depending on how it is produced. These will be ignored by STC.
	
#### "Are there any other input files that must be input into STC?" ####
# Yes, there is one other file that will be needed. A file labeled "sequences.fasta" must be included in the working directory. This is a fasta file containing all the unique nucleotide sequences found in a given run, along with unique, corresponding IDs for those sequences supplied by the user. These IDs can be generated de novo for each run or can be recycled across runs for nucleotide sequences appearing across runs.

#### "What if I can't/don't know how to generate these input files on my own? ####
# That's totally fine. Please see the accompanying data dryad repository for the manuscript listed above (datadryad.org). In that repository you will find a full work flow including shell scripts, perl scripts, R scripts, and example data files for running the entire STC process. This work flow assumes only that one can pull the raw multi-fasta file of all reads from the .sff file produced during 454 sequencing (this is typically done by the sequencing facility using proprietary Roche software)

#### "What are good values for the input parameters listed below? ####
# The values supplied in the script were optimized for the data set analyzed in the accompanying manuscript (see above). Most of these will represent decent starting values which can be tweaked. However, there a few which will be vary quite a bit depending on the study:
	# amp_length:  will need to be set individually for every study. It corresponds to the length of amplified fragment (not including primers) expected to be produced during PCR. Also the size_thresh (theta) parameter will vary depending on the number of loci expected to be amplified. Please see the manuscript for suggested values for this parameter.
	# min_reads: optimized for samples with a maximum of 12 alleles. If more alleles are expected, this number will need to increase to ensure enough coverage. Please see manuscript for suggestions on how to set this parameter.
	# size_thresh: will also depend on the maximum number of alleles expected in any given sample.  Please see manuscript for suggestions on how to set this parameter.

#### "What if I have duplicate samples in the same run?" ####
# That's fine. Just give duplicate samples different sequenceIDs in the reads.csv file. 

#### "How are sequences aligned?" ####
# We have written STC to use clustal to align sequences. Please make sure that clustal is installed on your machine and that it can be called using the clustal function in the ape package (i.e. using exec = "clustalw2")

#### "How are potentially chimeric alleles identified?" ####
# We have implemented a function within R to check if potential true alleles are chimeras (see Phase 4). If you have your own preferred method for identifying chimeric alleles, by all means, do so.
							
#### "Where are all the functions used in this script?" ####	
# All of the functions are included in the accompanying stc_functions.R script. These functions are the nitty-gritty of STC. We have elected to separate them out from the main script so that the individual steps and overall structure of STC can be easily discerned without users getting bogged down in endless lines of code. IF any particular step is not working, by all means consult the functions script and find the troublesome function in question.
											
#### "What do I get at the end?" ####
# If the script runs successfully through all phases of STC you will end up with a lot of files, of which the following are important:
	# all_TAs_phase4.csv: a table of all the true alleles found for each sample
	# alleles_summary: summary of all true alleles identified during genotyping
	# samples_summary: summary of all samples genotyped (and also samples skipped)
	# TA.fasta	: fasta file of the sequences of all true alleles
	# TAs.aln: alignment of all true allele sequences
	# TA_dist.csv: distance matrix of all true allele sequences
	# sample specific files (where #### denotes sample ID):
		# ####_dist.csv: distance matrix of all sequences found for that sample
		# ####.fasta: fasta file of all sequences found for that sample
		# ####.aln: alignment of sequences for that sample
		# ####_dist.csv: distance matrix of all sequences found for that sample
		# ####_pairs.csv: list of combined pairs of sequences
		# ####_sequence_summary.csv: summary of cluster results for by sequence
		# ####_clusters.csv: table of sequences organized into final clusters
		# ####_smalls.csv: table of sequences in small clusters after phase 3
		# ####_TAs.csv: table of true alleles
		# ####_true.fasta: fasta file of true alleles
						
#### "Is there anything else I should know?" ####
# We would urge anyone who wishes to try STC on their data set to CAREFULLY read	 over the manuscript before starting. We would also recommend starting with just one sample and working through phases 2 and 3 of STC step by step (i.e. don't use the for loop which loops through multiple samples). T
																
#########################################################################
##################### SET PARAMETERS FOR STC ############################
#########################################################################

#assumed correct length for amplicon  !!THIS VALUE WILL BE DIFFERENT FOR EVERY ORGANISM!!!
amp_length <- 213

#minimum number of reads for genotyping  !!THIS VALUE WILL BE DIFFERENT DEPENDING ON THE MAXIMUM NUMBER OF ALLELES EXPECTED PER SAMPLE!!
min_reads <- 80

#maximum number of reads before sub-sampling
max_reads <- 1000


#maximum similarity before clustering stops (%)
max_sim <- 97

#how many reps per round of clustering?
reps <- 100

#size criterion threshold (theta) !!THIS VALUE WILL BE DIFFERENT DEPENDING ON THE MAXIMUM NUMBER OF ALLELES EXPECTED PER SAMPLE!!
size_thresh <- 1/22
	
#dominance criterion threshold (sigma)
dom_thresh <- 4/1

#common allele threshold (epsilon) 
small_cutoff <- 3

#create sample specific sequence tables (TRUE/FALSE)?
create_sequence_tables <- TRUE


######################################################################
# Things to do before STC can start
######################################################################

#libraries needed
library(ape)
library(seqinr)
library(plyr)

#source functions (PLEASE ENSURE THIS FILE IS LOCATED IN THE WORKING DIRECTORY)
source(file = "stc_functions.R")

#data files needed, see top of script for descriptions (BE SURE THESE ARE ALSO LOCATED IN THE WORKING DIRECTORY)
reads <- read.csv("reads.csv")	#table of raw reads 
alls <- read.dna("sequences.fasta",format = "fasta", as.character = TRUE) #list of all sequences

#create list of all samples (this just generates a list of samples based on the reads.csv file. This list can also be truncated by the user to list just the sampleIDs to be genotyped).
samples <- data.frame(sampleID = unique(reads$sampleID))
write(as.vector(samples$sampleID),"samples", ncolumns = 1)


#####################################################################
################### PHASE 1: SEQUENCE PREPARATION ################### 
#####################################################################


#########################################################################
# Create empty data tables
#######################################################################

#create empty tables to store information about all true alleles, small clusters, and all clusters produced during genotyping 
all_TAs <- data.frame()
all_smalls <- data.frame()
all_clusters <- data.frame()

#empty list of samples that are skipped due to low read numbers
samples_skipped <- c()

#empty lists of samples that are sub-sampled due to large read numbers
samples_big <- rbind()

#empty tables for creating summary information about alleles and about individual samples
alleles_summary <- data.frame()
samples_summary <- data.frame()


######################################################################
#Start cycling through individual samples (Phases 2 and 3)
######################################################################
for (sample in samples[,1]){
	
	#so you know what sample is currently active
	print (sample)

	#create a unique filename path for this sequence
	seqpath <- paste("",sample,sep = "/")

	###############################################
	#determine whether the sample should be genotyped and sub-sampled
	###############################################

	#extract reads from sample
	sample_reads <- na.omit(reads[reads$sampleID == sample,])
	sample_reads$sampleID <- factor(sample_reads$sampleID)
	sample_reads$seqID <- factor(sample_reads$seqID)

	#how many reads?
	read_numb <- nrow(sample_reads)

	#check if too few reads 
	if (read_numb < min_reads){
		info <- data.frame(sampleID = sample,read_numb)
		samples_skipped <- rbind(samples_skipped,info)
		write.csv(samples_skipped,"samples_skipped.csv", row.names = FALSE)

		#add data to summary table
		samp_summ <- data.frame(sampleID == sample,
			read_number = read_numb,
			read_number_used = 0,
			good_alleles = NA,
			small_clusters = NA,
			amb_clusters = NA
		)
		
		samples_summary <- rbind(samples_summary,samp_summ)
	
		next()
	}

	#if there are too many reads
	if (read_numb > 1000){
		info <- cbind(sample,read_numb)
		samples_big <- rbind(samples_big,info)
		write.table(samples_big,"output/samples_big.csv")
		
		#subsample reads down to 1000 without replacement
		sample_reads <- sample_reads[sample(nrow(sample_reads),1000, replace = FALSE),]
		
		#re-factor the sample sequence IDs
		sample_reads$seqID <- factor(sample_reads$seqID)

	}
	
	#create entry for sample summary table (to be added later)
	samp_summ <- data.frame(sampleID = sample,
		 read_number = read_numb,
		 read_number_used = nrow(sample_reads)
	)
	

	#####################################################################
	################### PHASE 2: SEQUENCE COMBINATION ################### 
	#####################################################################


	###############################################
	# create alignment of sequences and distance matrix
	###############################################
	#create table of the number of reads assigned to each sequence
	seq_table <- table(sample_reads$seqID)

	#which sequences are present in the given sample?
	seqs <- na.omit(names(table(sample_reads$seqID)))

	#extract sequence data for sequences
	sample_seqs <- alls[seqs]
	write.dna(sample_seqs,file = paste(sample,".fasta",sep = ""), format = "fasta")
	
	#align sequences using clustal
	aln_seqs <- align.seqs(sample_seqs)
	
	#create distance matrix
	dist <- dist.dna(aln_seqs, model = "JC69", as.matrix = TRUE)

	#convert distance matrix to percent similarity
	dist <- 100*(1-dist)
	
	#calculate minimum distance in matrix
	min_dist <- min(dist)

	#write distance matrix file to file
	write.table(dist,file = paste(sample,"_dist.csv",sep = ""), sep = ",")

	
	###############################################
	#create list of paired sequences to combine
	###############################################
	#re-upload alignment in character format
	aln_seqs <- read.dna(file = paste(sample,".aln",sep = ""), format = "interleaved", as.character = TRUE)
	
	#order rownames from lowest hapID to highest (for efficiency)
	aln_seqs <- aln_seqs[order(as.numeric(rownames(aln_seqs))),]

	#make list of all possible pairs of sequences
	temp_seqs <- data.frame(ID = seqs, numb = seq(1:length(seqs)))
	all_pairs <- make.pairs(temp_seqs,seq_table)

	#create list of pairs to combine
	pairs_use <- c()
	for(pair in c(1:nrow(all_pairs))){

		#how do the two sequences differ?
		diffs <- pair.diffs(pair)
		
		#add pair if pair falls into type 1,2, or 3
		pairs_use <- rbind(pairs_use, type.pair(diffs))
	
	}
	
	#adjust pairs for combination
	pairs_use <- adjust.pairs(pairs_use)
	
	#write pairs to file
	write.csv(pairs_use, file = paste(sample,"_pairs.csv",sep = ""))


	##############################################
	#combine pairs
	##############################################
	#create a pre-combination sequence table
	old_seq_table <- seq_table
	
	#combine pairs in sample reads table
	sample_reads <- ddply(sample_reads,.(seq_number),combine.pairs)
	
	#convert sequence IDs to factor
	sample_reads$seqID <- factor(sample_reads$seqID)

	#create new sequence table after combinations
	seq_table <- table(sample_reads$seqID)


	#####################################################################
	################### PHASE 3: CLUSTERING ############################# 
	#####################################################################


	#####################################################################
	#initiate data tables
	#####################################################################

	#new table for good clusters
	good_clusters <- rbind()
	
	#new table for small clusters
	small_clusters <- data.frame()

	#new table for true alleles
	true_alleles <- data.frame()

	#create table of sequences to start with
	start_seqs <<- data.frame(cluster = NA, sequence = names(seq_table), nreads = as.vector(seq_table))

	#how many unique sequences to start?
	seq_remain <- nrow(start_seqs)


	#######################################################################
	# run the clustering replicates
	#######################################################################
	
	#set first cutoff 
	cutoff <- round(min_dist)
	
	while(cutoff <= 97 & nrow(start_seqs) > 0){
	
		#display current cutoff value
		print(cutoff)
		
		#set existing clusters to have no value		
		start_seqs$cluster <- NA
	
		#new table for ambiguous clusters
		amb_clusters <- data.frame()

		#perform the clustering replicates
		cluster_replicates <- repeat.cluster(cutoff,start_seqs,reps)
		
		#parse replicates and assign cluster names
		final_clusters <- parse.results(start_seqs,cluster_replicates)

		#pull good clusters
		good_clusters <- rbind(good_clusters, ddply(final_clusters,.(cluster),pull.goods))
		
		#pull small clusters
		small_clusters <- rbind(small_clusters, ddply(final_clusters,.(cluster),pull.smalls))

		#pull ambiguous clusters
		amb_clusters <- rbind(amb_clusters, ddply(final_clusters,.(cluster),pull.ambs))
		
		#start with ambiguous clusters for next round
		start_seqs <- amb_clusters
		
		#if no new clusters were formed, skip a round of clustering
		if(nrow(amb_clusters) == nrow(cluster_replicates)) {
			
			#if cutoff is less than 80
			if(cutoff < 80){		
				cutoff <- cutoff + 5
				
			#if cutoff is greater than 80...
			} else {
			
				#but less than 90...
				if(cutoff < 90){
					cutoff <- cutoff + 3
				
				#adn greater than 90
				} else {
					cutoff <- cutoff + 1
				}
			}
		} else {
			cutoff <- cutoff + 1
		}
		
		
	}

	#create table of all clusters for sample
	sample_clusters <- rbind(good_clusters,small_clusters,amb_clusters)
	sample_clusters <- data.frame(sample_clusters,sampleID = sample)


	############################################################################
	# divide ambiguous clusters among the top two alleles
	############################################################################

	#which clusters are ambiguous?
	amb_cluster_names <- names(table(amb_clusters$cluster))
	
	#parse ambiguous clusters
	for(name in amb_cluster_names){
		
		#pull cluster
		cl <- amb_clusters[amb_clusters$cluster == name,]
		
		#calculate summary stats for ambiguous cluster
		summary <- amb.summary(cl)
	
		#should the cluster be re-classified as small?
		small_clusters <- rbind(small_clusters,check.smalls(summary,cl))
		
		#should the cluster be re-classified as one good cluster?
		good_clusters <- rbind(good_clusters, check.good(summary,cl))

	}
		
	#create table of true alleles from existing good clusters
	true_alleles <- ddply(good_clusters,.(cluster), collapse.goods)
	
		
	########################################################
	#create summary tables for sequences from each sample
	########################################################
	if(create_sequence_tables == TRUE){
		
		#create table		
		sequence_summary <- create.sequence.summary(old_seq_table)
		
		#account for sequences that were combined during phase 2
		sequence_summary <- ddply(sequence_summary,.(sequence), combined.seqs)
		
		#write to file
		sequence_summary <- sequence_summary[order(sequence_summary$cluster),]
		write.csv(sequence_summary,paste(sample,"_sequence_summary.csv",sep = ""))	
	}
	
	
	##########################################################################
	#clean up
	##########################################################################
	
	#add sampleID to small clusters and good clusters
	if(nrow(small_clusters) > 0){
		small_clusters <- data.frame(small_clusters,sampleID = sample)
	}	
	good_clusters <- data.frame(good_clusters,sampleID = sample)
	
	#remove "cluster", but keep "allele", in true allele table
	true_alleles <- true_alleles[,-1]
	
	#create fasta file of true alleles for sample
	true_seqs <- alls[names(table(factor(true_alleles$allele)))]
		
	
	##########################################################################
	#add sample results to global data tables
	##########################################################################
	#append true alleles to global true allele file
	all_TAs <- rbind(all_TAs,true_alleles)

	#append small alleles to global small allele file
	all_smalls <- rbind(all_smalls, small_clusters)
	
	#append all clusters to global clusters file
	all_clusters <- rbind(all_clusters, sample_clusters)

	#create sample summary
	samp_summ <- data.frame(samp_summ, good_alleles = nrow(true_alleles), 
		small_clusters = length(table(small_clusters$cluster)),
		amb_clusters = length(table(amb_clusters$cluster))
	)
	samples_summary <- rbind(samples_summary,samp_summ)


	############################################################################
	# write data to file
	############################################################################

	#write individual fish file for TAs
	write.csv(true_alleles, file = paste(sample,"_TAs.csv",sep = ""))

	#write individual fish file for clusters
	write.csv(sample_clusters, file = paste(sample,"_clusters.csv",sep = ""))

	#write small alleles fish file
	write.csv(small_clusters,file = paste(sample,"_smalls.csv",sep = ""))

	#write fasta file for true allele sequences for fish to file
	write.dna(true_seqs,file = paste(sample,"_true.fasta",sep = ""), format = "fasta")

	#write current sample summary
	write.csv(samples_summary, "samples_summary.csv", row.names = FALSE)
	
	#update cluster files
	write.csv(all_TAs,"all_TAs_phase3.csv", row.names = FALSE)
	write.csv(all_smalls,"all_smalls.csv", row.names = FALSE)
	write.csv(all_clusters,"all_clusters.csv", row.names = FALSE)

}


#####################################################################
################ PHASE 4: SMALL CLUSTER CROSS CHECKING ############## 
#####################################################################

#reload true allele  and small cluster tables
all_TAs <- read.csv("all_TAs_phase3.csv")
all_smalls <- read.csv("all_smalls.csv")
samples_summary <- read.csv("samples_summary.csv")

#add a good status to indicate alleles that did not drop out
all_TAs$status <- "good"

#change small clusters to small alleles
names(all_smalls)[1] <- "allele"

#create table of current true alleles
alleles <- table(all_TAs$allele)

#Which true alleles will be used for cross-checking? 
alleles.cc <- names(alleles[alleles >= small_cutoff])

#add small clusters as true alleles if they match common alleles
d_ply(all_smalls, .(sampleID,allele), cross.check)

#resort by sampleID
all_TAs <- all_TAs[order(all_TAs$sampleID), ]

#create preliminary list of TAs, fasta file, and alignment
final.list(all_TAs,alls)


#####################################################################
########### PHASE 4: IDENTIFY AND REMOVE CHIMERIC ALLELES ########### 
#####################################################################

#re-upload nucleic acid alignment
aln_TAs <- read.dna("TAs.aln", format = "interleaved", as.character = TRUE)


#######################
#parameters to set
#######################

#maximum number sites at which daughter sequence can match NEITHER putative parent and still be considered a possible daughter sequence
no_match <- 2

#set lower bound for proportion of informative sites (i.e. sites that differ between possible parents) that must be consistent with recombinant daughter sequence for parents to be considered possible parents
proportion <- 0.98


##############################################		
#find all the possible recombinants in each sample
##############################################	
possible_recombinants <- ddply(all_TAs,.(sampleID), find.recombs)

#write possibles to file
write.csv(possible_recombinants, "possible_recombinants.csv", row.names = FALSE)


###############################################
#determine if any of the possible recombinants are likely chimeras
###############################################
#load the list of possible recombinants
recomb_dat <- read.csv("possible_recombinants.csv")

#in how many samples was each allele found in overall?
allele_totals <- table(all_TAs$allele)

#cycle through recombinant sequences
possible_chimeras <- ddply(recomb_dat,.(seqID), find.chimeras)

#sort by how the percentage match
possible_chimeras <- possible_chimeras[order(possible_chimeras$proportion_match, decreasing = TRUE),]

#write results to file
write.csv(possible_chimeras, "chimera_results.csv", row.names = FALSE)


###############################################
# remove likely chimeras from final results
###############################################

#load list of possible chimeras
possible_chimeras <- read.csv("chimera_results.csv")

#count as chimeras only if potential parent sequences were found in all samples where the putative chimera occured 
chimeras <- possible_chimeras[possible_chimeras$proportion_match == 1,]

#if there are no chimeras
if(length(chimeras) == 0){
	chimeras <- data.frame(allele = NA)
}


#indicate which TAs are chimeras
all_TAs <- ddply(all_TAs, .(allele), indicate.chimeras)

#resort by sampleID
all_TAs <- all_TAs[order(all_TAs$sampleID), ]

#update sequence summary tables to account for chimeras and dropped alleles
if(create_sequence_tables == TRUE){

	#which samples were skipped?
	skip <- read.csv("samples_skipped.csv", sep = ",")	
	
	#cycle through samples and update sequency summaries
	ddply(samples, .(sampleID), update.sequence.summary)

}


#####################################################################
########### POST STC CLEAN UP AND DATA PROCESSING ################### 
#####################################################################


########################################################
# create a summary table of alleles
########################################################
alleles_summary <- ddply(all_TAs, .(allele), create.allele.summary)

#sort most common to least common
alleles_summary <- alleles_summary[order(alleles_summary$samples_found, decreasing = TRUE),]

#write summary to file
write.csv(alleles_summary, "alleles_summary.csv", row.names = FALSE)

#indicate which TAs are not the correct length as given in the parameters
all_TAs <- ddply(all_TAs, .(allele), indicate.lengths)
	

######################################################################################
# update sample summary table
######################################################################################

#load existing table
samples_summary <- read.csv("samples_summary.csv")

#add chimera, length, and true allele columns
samples_summary <- merge(samples_summary,
	ddply(all_TAs,.(sampleID), new.TAs), 
	by = c("sampleID"), 
	all.x = TRUE
)
	
#output updated table
write.csv(samples_summary, "samples_summary.csv", row.names = FALSE)


######################################################################################
# Create list of final true alleles along with a fasta file and alignment of just true alleles
######################################################################################

#write new list of TAs to file
write.csv(all_TAs, "all_TAs_phase4.csv", row.names = FALSE)

#upload fasta file of reads from the run
alls <- read.dna("sequences.fasta",format = "fasta", as.character = TRUE)

#create list of TAs, fasta file, and alignment
final.list(all_TAs,alls)