TREATMENT OF CLIPSeq and RNASeq READS


## quality analysis

performed in R see « Solidreads_quality.R »


## Remove adapter from the 3'end with cutadapt
#options
-c colorspace reads
-a remove 3'end adapters of sequence CGCCTTGGCCGTACAGCA
--minimum-length=30 keep reads of a minimal size of 30 nt

#command
« cutadapt -c -a CGCCTTGGCCGTACAGCA --minimum-length=30 <inputSeq.csfasta> > <inputSeq_cutadapt.csfasta> »


## filtering reads
All reads with uncalled bases in the first 29 nucleotides were removed from the analysis.
#command
« grep -P --before-context=1 'T[0123]{29,49}' <inputSeq_cutadapt.csfasta> | grep -v '\-\-' > inputSeq_cutadapt_onlyfull.csfasta »


## mapping to the human genome hg19 with SHRIMP2.2.3 "SHRiMP2: Sensitive yet Practical Short Read Mapping"( Bioinformatics (2011) doi:10.1093/bioinformatics/btr046)
# building the indexed genome  hg19.fa (hg19, GRCh37)
#command
« project-db.py --shrimp-mode cs hg19.fa »
The colorstate projection is hg19-cs

# mapping on the hg19-cs genome
#options
-L hg19-cs			Load the projection of the human genome
-N 8				Use 8 threads of control
--max-alignments 1	If more than 1 mappings for a read pass all filters,drop them all.	
-h 80%			drop alignment if mapping score is below 80% of the maximum possible match score
-E				Select SAM output format

# command
« gmapper-cs -L hg19-cs inputSeq_cutadapt_onlyfull.csfasta -N 8 --max-alignments 1 -h 80% -E  > inputSeq_cutadapt_onlyfull_hg19_uniq.sam »

##mapping of those libraries
#lib A
lib_A1=solid0331_20090818_AIA0AOLS_F3_cutadapt_onlyfull.csfasta
lib_A2=NICKEL_20091023_FC1_AIA0AOLS_F3_cutadapt_onlyfull.csfasta
lib_A3=NICKEL_20091023_FC2_AIA0AOLS_F3_cutadapt_onlyfull.csfasta
#lib B
lib_B1=solid0331_20090818_AIA0BOLS_F3_cutadapt_onlyfull.csfasta
lib_B2=NICKEL_20091023_FC1_AIA0BOLS_F3_cutadapt_onlyfull.csfasta
lib_B3=NICKEL_20091023_FC2_AIA0BOLS_F3_cutadapt_onlyfull.csfasta
#lib RNASEQ
lib_C=NICKEL_20100706_FC2_AIA_COLS1_F3_cutadapt_onlyfull.csfasta
##


## converting sam to bam
samtools view -bSh -o inputSeq_cutadapt_onlyfull_hg19_uniq.bam  inputSeq_cutadapt_onlyfull_hg19_uniq.sam

## sorting bam files
samtools sort inputSeq_cutadapt_onlyfull_hg19_uniq.bam inputSeq_cutadapt_onlyfull_hg19_uniq_sort

##merge sorted bam files for each clipseq library (exple for lib_A)
samtools merge -r CLIPCELF1_uniq_libA.bam
lib_A1_sort.bam lib_A2_sort.bam lib_A3_sort.bam

##The mapped data are now in 3 files
CLIPCELF1_uniq_libA.bam
CLIPCELF1_uniq_libB.bam
RNASEQ_uniq_libC.bam
##

## remove identical/duplicated reads
# reads are considered identical if Flag, Reference sequence, Start position and CIGAR are identical. One read is kept.
#command

« samtools view CLIPCELF1_uniq_libA.bam | gawk '! a[$2,$3,$4,$6]++' > CLIPCELF1_uniq_libA_RMDUP.sam »

# the header is added back
« samtools view -H CLIPCELF1_uniq_libA.bam | cat - CLIPCELF1_uniq_libA_RMDUP.sam > CLIPCELF1_uniq_libA_RMDUP_h.sam »

# bam file is recreated
« samtools view -bSh -o CLIPCELF1_uniq_libA_RMDUP_h.bam  CLIPCELF1_uniq_libA_RMDUP_h.sam »

# create index for IGV
« samtools index  CLIPCELF1_uniq_libA_RMDUP_h.bam »


# To be treated by findpeaks the reads with CIGAR field starting by insertions are removed because they trigger findpeaks processing errors
# they correspond respectively to 582 (libA),178 (libB) and 3962 reads (libC)
« samtools view -h CLIPCELF1_uniq_libA_RMDUP_h.bam| gawk ' $6 !~ /^[1-3]{0,1}[0-9]I/ ' | samtools view -bS - > CLIPCELF1_uniq_libA_noinsert.bam »


# Adding read groups in the header for the pooled libraries (libA,libB)
#for libA
« echo -e  '@RG\tID:NICKEL_20091023_FC1_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA\n@RG\tID:NICKEL_20091023_FC2_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA\n@RG\tID:solid0331_20090818_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA' > rg
samtools view -H CLIPCELF1_uniq_libA_noinsert.bam | cat - rg  > header_libA
samtools reheader header_libA CLIPCELF1_uniq_libA_noinsert.bam > libA_noinsert_norg.bam »
 
#for libB
« echo -e  '@RG\tID:solid0331_20090818_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB\n@RG\tID:NICKEL_20091023_FC1_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB\n@RG\tID:NICKEL_20091023_FC2_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB' > rg
samtools view -H CLIPCELF1_uniq_libB_noinsert.bam | cat - rg  > header_libB
samtools reheader header_libB CLIPCELF1_uniq_libB_noinsert.bam > libB_noinsert_norg.bam"

FINDPEAK 4.0 TREATMENT TO FILES :

├── libA_noinsert_norg.bam
├── libB_noinsert_norg.bam
├── RNAseq_noinsert.bam

# options
-aligner sam 	format for the aligner used
-dist_type 0 80 	size distribution of the original RNA
-alpha 0.05 	FDR
-compare 		file to use a the reference
-minimum 3 		minimun of reads per cluster 
-window_size 400 
-control_type 0 	linear regression
-bedgraph		output type
 
FPRUN=bin/fp4/FindPeaks.jar
« java -Xmx2G -jar $FPRUN -name libA -input $INPUTFILE -aligner sam -output $OUTPUTDIR -dist_type 0 80 -alpha 0.05 -compare $COMPAREFILE -minimum 3 -window_size 400 -control_type 0 -bedgraph »


# post-treatment of findpeaks was realised in R to create bed file of clusters,  annotate the clusters and analyse the cluster annotation.
Gencode Release 19 (GRCh37.p13)
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz

Script « FindPeaksAnalysis.R » available as supplementary data