TREATMENT OF CLIPSeq and RNASeq READS ## quality analysis performed in R see « Solidreads_quality.R » ## Remove adapter from the 3'end with cutadapt #options -c colorspace reads -a remove 3'end adapters of sequence CGCCTTGGCCGTACAGCA --minimum-length=30 keep reads of a minimal size of 30 nt #command « cutadapt -c -a CGCCTTGGCCGTACAGCA --minimum-length=30 >  » ## filtering reads All reads with uncalled bases in the first 29 nucleotides were removed from the analysis. #command « grep -P --before-context=1 'T[0123]{29,49}' | grep -v '\-\-' > inputSeq_cutadapt_onlyfull.csfasta » ## mapping to the human genome hg19 with SHRIMP2.2.3 "SHRiMP2: Sensitive yet Practical Short Read Mapping"( Bioinformatics (2011) doi:10.1093/bioinformatics/btr046) # building the indexed genome hg19.fa (hg19, GRCh37) #command « project-db.py --shrimp-mode cs hg19.fa » The colorstate projection is hg19-cs # mapping on the hg19-cs genome #options -L hg19-cs Load the projection of the human genome -N 8 Use 8 threads of control --max-alignments 1 If more than 1 mappings for a read pass all filters,drop them all. -h 80% drop alignment if mapping score is below 80% of the maximum possible match score -E Select SAM output format # command « gmapper-cs -L hg19-cs inputSeq_cutadapt_onlyfull.csfasta -N 8 --max-alignments 1 -h 80% -E > inputSeq_cutadapt_onlyfull_hg19_uniq.sam » ##mapping of those libraries #lib A lib_A1=solid0331_20090818_AIA0AOLS_F3_cutadapt_onlyfull.csfasta lib_A2=NICKEL_20091023_FC1_AIA0AOLS_F3_cutadapt_onlyfull.csfasta lib_A3=NICKEL_20091023_FC2_AIA0AOLS_F3_cutadapt_onlyfull.csfasta #lib B lib_B1=solid0331_20090818_AIA0BOLS_F3_cutadapt_onlyfull.csfasta lib_B2=NICKEL_20091023_FC1_AIA0BOLS_F3_cutadapt_onlyfull.csfasta lib_B3=NICKEL_20091023_FC2_AIA0BOLS_F3_cutadapt_onlyfull.csfasta #lib RNASEQ lib_C=NICKEL_20100706_FC2_AIA_COLS1_F3_cutadapt_onlyfull.csfasta ## ## converting sam to bam samtools view -bSh -o inputSeq_cutadapt_onlyfull_hg19_uniq.bam inputSeq_cutadapt_onlyfull_hg19_uniq.sam ## sorting bam files samtools sort inputSeq_cutadapt_onlyfull_hg19_uniq.bam inputSeq_cutadapt_onlyfull_hg19_uniq_sort ##merge sorted bam files for each clipseq library (exple for lib_A) samtools merge -r CLIPCELF1_uniq_libA.bam lib_A1_sort.bam lib_A2_sort.bam lib_A3_sort.bam ##The mapped data are now in 3 files CLIPCELF1_uniq_libA.bam CLIPCELF1_uniq_libB.bam RNASEQ_uniq_libC.bam ## ## remove identical/duplicated reads # reads are considered identical if Flag, Reference sequence, Start position and CIGAR are identical. One read is kept. #command « samtools view CLIPCELF1_uniq_libA.bam | gawk '! a[$2,$3,$4,$6]++' > CLIPCELF1_uniq_libA_RMDUP.sam » # the header is added back « samtools view -H CLIPCELF1_uniq_libA.bam | cat - CLIPCELF1_uniq_libA_RMDUP.sam > CLIPCELF1_uniq_libA_RMDUP_h.sam » # bam file is recreated « samtools view -bSh -o CLIPCELF1_uniq_libA_RMDUP_h.bam CLIPCELF1_uniq_libA_RMDUP_h.sam » # create index for IGV « samtools index CLIPCELF1_uniq_libA_RMDUP_h.bam » # To be treated by findpeaks the reads with CIGAR field starting by insertions are removed because they trigger findpeaks processing errors # they correspond respectively to 582 (libA),178 (libB) and 3962 reads (libC) « samtools view -h CLIPCELF1_uniq_libA_RMDUP_h.bam| gawk ' $6 !~ /^[1-3]{0,1}[0-9]I/ ' | samtools view -bS - > CLIPCELF1_uniq_libA_noinsert.bam » # Adding read groups in the header for the pooled libraries (libA,libB) #for libA « echo -e '@RG\tID:NICKEL_20091023_FC1_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA\n@RG\tID:NICKEL_20091023_FC2_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA\n@RG\tID:solid0331_20090818_AIA0AOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libA' > rg samtools view -H CLIPCELF1_uniq_libA_noinsert.bam | cat - rg > header_libA samtools reheader header_libA CLIPCELF1_uniq_libA_noinsert.bam > libA_noinsert_norg.bam » #for libB « echo -e '@RG\tID:solid0331_20090818_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB\n@RG\tID:NICKEL_20091023_FC1_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB\n@RG\tID:NICKEL_20091023_FC2_AIA0BOLS_F3_cutadapt_onlyfull_hg19_uniq_sort\tPL:Solid\tLB:library\tSM:libB' > rg samtools view -H CLIPCELF1_uniq_libB_noinsert.bam | cat - rg > header_libB samtools reheader header_libB CLIPCELF1_uniq_libB_noinsert.bam > libB_noinsert_norg.bam" FINDPEAK 4.0 TREATMENT TO FILES : ├── libA_noinsert_norg.bam ├── libB_noinsert_norg.bam ├── RNAseq_noinsert.bam # options -aligner sam format for the aligner used -dist_type 0 80 size distribution of the original RNA -alpha 0.05 FDR -compare file to use a the reference -minimum 3 minimun of reads per cluster  -window_size 400 -control_type 0 linear regression -bedgraph output type FPRUN=bin/fp4/FindPeaks.jar « java -Xmx2G -jar $FPRUN -name libA -input $INPUTFILE -aligner sam -output $OUTPUTDIR -dist_type 0 80 -alpha 0.05 -compare $COMPAREFILE -minimum 3 -window_size 400 -control_type 0 -bedgraph » # post-treatment of findpeaks was realised in R to create bed file of clusters, annotate the clusters and analyse the cluster annotation. Gencode Release 19 (GRCh37.p13) ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz Script « FindPeaksAnalysis.R » available as supplementary data