Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jan 3.
Published in final edited form as: Cold Spring Harb Protoc. 2023 Jan 3;2023(1):pdb.prot107748. doi: 10.1101/pdb.prot107748

Transposable Element Differential Expression Analysis of RNA-Seq Data in Nothobranchius furzeri

Bryan Teefy 1, Matthew Malone 1,2, Berenice Benayoun 1,3,4,5,6,*
PMCID: PMC9812909  NIHMSID: NIHMS1855460  PMID: 36223994

Abstract

Transposable elements (TEs) comprise large fractions of eukaryotic genomes, but their repetitive nature and high copy number makes bioinformatic analyses more complex. Here, we report three robust pipelines to analyze TE expression from RNA-seq data in a non-model organism, the African turquoise killifish Nothobranchius furzeri. Our protocol can be run with either a genomic or transcriptomic reference depending on available computational resources, with options both for limited memory usage and for more computationally intensive analyses. Our protocol leverages both standard software for classical RNA-seq analysis pipelines as well as software specialized for TEs. This protocol uses input RNA-seq data from Illumina reads and can use data in either single-end or paired-end layout. Here, we show how to start from input RNA-seq data from aging killifish tissues using a publicly available dataset from which we take single and paired reads, trim adapters, align and count trimmed reads, and perform differential expression analyses for TEs.

MATERIALS

Equipment

Standalone Software

cutadapt (3.3, dependency for TrimGalore required for the software to run even without an explicit call)

FastQC

fastx_toolkit (0.0.13, fastx_trimmer)

kallisto (0.46.2)

R (3.5.1)

RepeatMasker (4.1.2-p1)

STAR (2.7.0e)

Subread (2.0.2; for the featureCounts function)

trim_galore (0.6.7, TrimGalore)

R Packages

DESeq2 (1.30.1)

pheatmap (1.0.12)

Rcolorbrewer (1.1-2)

Supplied Files

The supplied files can be downloaded from a dedicated Figshare repository (https://doi.org/10.6084/m9.figshare.19394723).

Annotation Files

Genome GTF ( GCA_014300015.1_MPIA_NFZ_2.0_gene_and_RepeatMasker_FishTEDB_annotations.gtf.gz)

Soft-masked genome reference ( GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.masked.gz)

Transcriptome with TEs (GCA_014300015.1_MPIA_NFZ_2.0_rna_from_genomic_WithFishTEDBv4.fasta.gz)

The N. furzeri genome and transcriptome used to generate these files is publicly available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA599375/ (Willemsen, et al. 2020) (Genome, GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.gz; GTF, GCA_014300015.1_MPIA_NFZ_2.0_genomic.gtf.gz; transcriptome, GCA_014300015.1_MPIA_NFZ_2.0_rna_from_genomic.fna.gz).

The N. furzeri TE sequences can be downloaded as a fasta file from http://www.fishtedb.org/project/species-detail?species=Nothobranchius+furzeri (Shao, et al. 2018). Download by pressing the “Download” icon.

It is possible to use other repeat libraries with this protocol, such as teleost-specific subsets of DFAM (Hubley et al. 2016) or Repbase (Bao et al. 2015). Because of sequence divergence across species, an N. furzeri-specific library (e.g., the N. furzeri FishTEDB library used here) is likely to provide more sensitivity. Alternatively, if another custom-made TE library is desired, de novo TE-annotation programs such as RepeatModeler2 (Flynn et al. 2020) can also be used to generate such genome-specific TE references. Although, for concision and clarity in this protocol, we will use the N. furzeri-specific FishTEDB library (Shao et al. 2018b), the following protocol remains largely unchanged if using an alternate library. When using an alternate library, the code should be amended before beginning the protocol as follows: (i) sequences from the desired TE library should be added to the transcriptome (kallisto pipeline) or used to mask the reference genome (STAR pipeline) fasta file instead of the FishTEDB sequences, and (ii) a custom TE GTF should be generated to reflect the modified TE library for TE quantification by featureCounts (STAR pipeline).

Scripts

The supplied scripts can be downloaded from the Benayoun laboratory GitHub (https://github.com/BenayounLaboratory/Killifish_Transposon_Quantification).

Bash Scripts

Fastq_to_Counts_Arguments_v1.1.sh (Supplement)

Fastq_to_Counts_v1.1.sh (Supplement)

Perl Scripts

collapse_perl.pl

R Scripts

Complete_DESeq2_featureCounts.R (Supplement)

Complete_DESeq2_Kallisto.R (Supplement)

METHOD

As outlined in Figure 1A, this pipeline will take compressed raw RNA-seq fastq files as input, trim sequences for quality and perform differential expression analysis with a focus on TEs either with a low-memory-demand option or (if access to a high-memory machine is available) with a more comprehensive but memory-intensive option.

Figure 1. Pipeline overview and example quality-control results on an example African turquoise killifish RNA-seq dataset.

Figure 1.

(A) Flowchart diagramming the progress of RNA-seq data through the pipeline. (B) Quality scores from an untrimmed (left) and hard-trimmed (right; removing the first six bases) fastq file from Cencioni (2019). Note the reduced quality at the 5′ end of the untrimmed library compared to that of the trimmed library. (C) Random hexamer bias can be seen in the first six bases of the untrimmed library, while the trimmed library shows an even nucleotide distribution throughout. (D) Analysis of the rate of capture of TE counts at various multimapping allowance options in the STAR mapping step. For each of representation, the data is represented as a fraction of maximum detected reads. All nine samples from Cencioni (2019) were processed with allowed multimapping rates from 1 to 5000 as indicated. For both single-end and paired-end reads, no further gains in detected TE counts were observed with more than 100 allowed multimappers.

Preprocessing

Trimming Reads

Prior to removing adapters, reads should be hard trimmed to remove unwanted low-quality sequence stretches at the 3′ end of reads and known recurrent sequence biases at the 5′ end of reads. fastx_trimmer (http://hannonlab.cshl.edu/fastx_toolkit/index.html) is a memory-saving program to achieve this goal. TrimGalore (https://github.com/FelixKrueger/TrimGalore) (Felix Krueger 2021) is then used to remove remaining adapter sequences.

  1. Prior to adapter trimming, check the read quality using FastQC (Andrews 2010).

We recommend loading a few of the fastq.gz read files into the FastQC graphical user interface (GUI) to determine any specificities of the data under study. FastQC will help identify regions of low quality or sequence bias to be hard trimmed in Step 2 by fastx_trimmer (Fig. 1B,C). For example, if the base quality Phred score drops off significantly at the end of reads, it is recommended to hard trim reads to remove the region with consistent trailing poor-quality bases (Phred < 20). Importantly, regions of irregular nucleotide distributions at the 5′ end of reads (indicative of “random” hexamer bias in the reverse transcription step of library preparation) should be removed.

  • 2.

    Hard trim reads using fastx_trimmer.

The read file is "fastq.gz". In this example and all others, files will be designated as <file>, while options will be designated as [option].

gzcat <fastq.gz> ∣ fastx_trimmer -f [Integer] -l [Integer] -z -i - -Q33 -o [name_of_outfile]
Options Description
gzcat (macOS-based systems) Uncompress file for processing
zcat (Linux-based systems) Alternative for gzcat on Linux systems
-f [integer] Indicates first base to keep; default is 1.
-l [integer] Indicates last base to keep; default is the entire read.
-Q33 Phred +33 offset (standard for most recent Illumina sequencing pipelines). [Omit if using data with Phred +64 offset.]
-z Compress files before piping to output
-o Designates output file; default is STDOUT.
  • 3.

    Remove adapters, trim low-quality bases and keep pairing information for paired-end reads using TrimGalore (Felix Krueger 2021).

  • Paired-end data

trim_galore -q [Integer] --length [Integer] --stringency [Integer] --paired <paired_hardtrimmed_1.fq.gz> <paired_hardtrimmed_2.fq.gz> -o [output_directory]
  • Single-end data

trim_galore -q [Integer] --length [Integer] --stringency [Integer] <single_hardtrimmed.fq.gz> -o [output_directory]
Options Description
--paired Treat files as paired, discarding both reads if one read of the pair is removed due to trimming.
-q [integer] Trim low-quality ends from reads, keeping bases with quality at least [integer]. Default is 20.
--length [integer] Discard reads that would become shorter than [integer] after adapter and quality trimming. Default is 20 bp.
--stringency [integer] Minimum base pair overlap with adapter sequence required to trim a sequence. Default is 1 bp.
-o [output_directory] Indicate an output directly if trimmed files are to be written in a different directory than the one housing the input files.

Proceed to Step 4 or 6 depending on memory available.

Limited Memory Usage

Generating a Transcriptome Reference using kallisto

kallisto (Bray et al. 2016) is used to generate read counts from a transcriptome reference. kallisto requires a transcriptome reference with TEs added. In the supplied files, the N. furzeri transcriptome used in this protocol is GCA_014300015.1_MPIA_NFZ_2.0_rna_from_genomic.fna. The TE sequences used in this protocol are from FishTEDB (Shao et al. 2018a). TE fasta sequences were added to the transcriptome fasta file to generate a transcriptome with TEs included. The resultant transcriptome reference is called GCA_014300015.1_MPIA_NFZ_2.0_rna_from_genomic_WithFishTEDBv4.fasta in the supplied files on Figshare.

  • 4.
    Generate a kallisto index.
    1. Following the -i option, enter the desired name of the index file.
    2. Enter the name of the transcriptome with TE sequences included.

The default indexing k-mer used here is 31 bp (can be overridden with option --kmer-size=[Integer]).

Calculating Pseudocounts using kallisto

kallisto index -i <transcriptome_index.idx> <transcriptome.fasta>
  • 5.

    Run kallisto to obtain genic and TE counts.

kallisto will output counts in specific folders for each sample. The counts can be found in the file "abundance.tsv" in the kallisto result folder.

  • Paired-end data

kallisto quant -i <transcriptome_index.idx> <trimmed_paired_end_fastq.gz file 1> <trimmed_paired_end_fastq.gz file 2> -o [name_of_output_directory] --plaintext
  • Single-end data

kallisto quant -i <transcriptome_index.idx> --single -l [Integer] -s [Integer] <trimmed_single_end_fastq.gz file> -o [name_of_output_directory] --plaintext
Options Description
-I Index file
--plaintext Plain text output format
-o [output directory] Directory to write output to
--single Single-end mode
-l [integer] Estimated average fragment length
-s [integer] Estimated standard deviation of fragment length

Proceed to Step 12.i.

High Memory Usage

Genome Reference Generation

When mapping to a reference genome, genes and TEs must be annotated in an associated GTF file. To properly annotate N. furzeri TEs, TE sequences were taken from FishTEDB (Shao et al. 2018a) and used as a custom library to soft mask the genome using RepeatMasker (Smit 2013-2015) and determine the coordinates of TEs in the used genome reference. The genome reference is from Willemsen et al., 2020 and is in the supplied files (GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna).

  • 6.

    Run RepeatMasker on the genome fasta file to soft mask repeat sequences. Ensure that RepeatMasker outputs a “soft-masked” genome sequence in which repeat sequences are replaced by the corresponding lower-case letters embedded in their native positions within the chromosomes. Configure RepeatMasker to run with "RMBlast" as its default setting.

RepeatMasker <Genome.fa> -lib <FishTEDB_TE_sequences.fa> -xsmall

In the supplied files on Figshare, the soft-masked fasta file is provided under the name “GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.masked” and the file containing the information about the positions of identified TE sequences in the genome reference is provided under the name “GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.out”.

  • 7.

    Generate a TE GTF file that includes all the TE and gene positions in the reference genome using the RepeatMasker output file “GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.out” and the provided script “GTF_Generation_v1.1.R”.

    This script parses the RepeatMasker “.out” file to create GTF entries tracking all identified TE positions in the genome reference. TE positions are then combined with the reference genome genic GTF file “GCA_014300015.1_MPIA_NFZ_2.0_genomic.gtf” to obtain one annotation GTF combining genic and TE entries.

In the supplied files on Figshare, this resultant file is “GCA_014300015.1_MPIA_NFZ_2.0_gene_and_RepeatMasker_FishTEDB_annotations.gtf”.

STAR Mapping

STAR (Dobin and Gingeras 2015) is used to map trimmed RNA-seq reads to a reference genome with high accuracy. STAR requires a large amount of computer memory; therefore, at least 32 GB of RAM are recommended for Steps 8 and 9.

  • 8.

    Index the reference genome.

Here, be sure to use the version in which TE sequences were soft-masked (“GCA_014300015.1_MPIA_NFZ_2.0_genomic.fna.masked”).

STAR --runThreadN [integer] --runMode genomeGenerate --genomeDir
[directory_to_contain_indexed_genome]                    --genomeFastaFiles
<Genome_with_TEs.fa> --sjdbGTFfile <GTF_with_TEs.gtf>
Options Description
--runThreadN [integer] Number of threads to run
--runMode genomeGenerate Run STAR to generate a genome index
--genomeDir Path to genome index directory
--genomeFastaFiles Path to genome fasta file
--sjdbGTFfile Path to GTF
  • 9.

    Run STAR mapping.

Paired-end data will necessitate two input files following “--readFilesIn”, while single-end data require only one input. Paired-end data resulting from TrimGalore will have the suffixes *_val_1.fq.gz and *_val_2.fq.gz. Multimappers should be kept sufficiently high to allow multimapping TE reads to map. For this purpose, although it is more time consuming than using uniquely mapped reads, we recommend setting the upper limit on multimappers to 100, consistent with recommendations for TE-specific counting tools in the field (e.g., TETranscripts (Jin et al. 2015). Although, in our experience, the multimapping limit of 100 is already saturating detectable TE counts for both the paired-end and single-end data paradigm, without dramatic impact on run time and memory usage, this parameter can be optimized (Figure 1D).

  • Paired-end data

STAR --genomeDir [path_to_indexed_genome_reference_directory] --
readFilesIn      <trimmed_paired_end_fastq.gz    file         1>
<trimmed_paired_end_fastq.gz file 2> --readFilesCommand gzcat --
runThreadN [desired_number_of_threads]    --outFilterMultimapNmax
100 --outFilterIntronMotifs       RemoveNoncanonicalUnannotated --
alignEndsProtrude 10 ConcordantPair               --outSAMtype BAM
SortedByCoordinate                             --outFileNamePrefix
[desired_prefix_for_output]
  • Single-end data

STAR --genomeDir [path_to_indexed_genome_reference_directory] --
readFilesIn      <trimmed_single_end_fastq.gz    file>        --
readFilesCommand gzcat --runThreadN [desired_number_of_threads]
--outFilterMultimapNmax     100          --outFilterIntronMotifs
RemoveNoncanonicalUnannotated             --alignEndsProtrude 10
ConcordantPair --outSAMtype     BAM        SortedByCoordinate --
outFileNamePrefix [desired_prefix_for_output]
Options Description
--genomeDir Path to genome index directory
--readFilesIn Path to trimmed fastq.gz files
--readFilesCommand zcat or --readFilesCommand gzcat Use if files are compressed and indicate how to decompress [gzcat for macOS, zcat for Linux].
--runThreadN [integer] Number of threads to run
--outFilterMultimapNmax [integer] Maximum number of loci that the read is allowed to map to. We recommend 200 for this protocol.
--outFilterIntronMotifs RemoveNoncanonicalUnannotated Filter out alignments that contain non-canonical unannotated junctions when using the annotated splice junction database. The annotated non-canonical junctions will be retained.
--alignEndsProtrude ConcordantPair Report alignments with non-zero protrusion as concordant pairs.
--outSAMtype BAM SortedByCoordinate Output sorted BAM files.
--outFileNamePrefix Prefix for output file names

Counting Genome-Mapped Reads

  • 10.

    Count reads using featureCounts either with multimappers counted fractionally or with only unique mapping reads counted.

When counting genome-mapped reads, we use featureCounts, a program from the Subread suite (Liao et al. 2014), that allows read counting based on an input GTF file of annotations.

featureCounts with Multimappers Counted Fractionally [Compositor: Please set as NLH.]

TEs are repetitive and/or present in multiple copies throughout a genome. Therefore TE RNA-seq reads often map to multiple genomic loci. To account for both uniquely mapped reads and ambiguously mapped reads, TE reads can be counted fractionally, receiving 1/N counts, where N is the number of genomic loci to which the read mapped.

  1. Generate a count matrix using featureCounts.

    Here, multimapping reads are used and are apportioned fractional counts. BAM output files from STAR mapping must be added individually.

Paired-end data

featureCounts -O -p -M --fraction --primary -a
<GTF_with_TEs.gtf> -o <featureCounts_Count_Matrix.txt output>
<bam_file1.bam> <bam_file2.bam> … <bam_fileN.bam>

Single-end data

featureCounts -O -M --fraction --primary -a <GTF_with_TEs.gtf> -
o <featureCounts_Count_Matrix.txt> <bam_file1.bam>
<bam_file2.bam> … <bam_fileN.bam>
Options Description
-O Assigns reads to all their overlapping meta-features.
-p Fragments are counted instead of reads. Only applicable for paired-end reads.
-M Count multimapping reads.
--fraction Assign multimappers fractional (1/N) counts.
--primary Count primary alignments only. Identified using bit 0x100 in the SAM/BAM FLAG field.
-a Name of the GTF annotation file

featureCounts with Only Unique Mapping Reads Counted [Compositor: Please set as NLH.]

Again, to avoid the issue of overcounting TE reads due to multimapping, TE reads can alternatively be quantified using only reads mapping to unique genomic loci. This strategy avoids overcounting from multimapping TE reads by simply avoiding multimapping, although it may lead in some cases to underestimates of TE transcription.

  • ii.

    Generate a count matrix using featureCounts.

Here only uniquely aligned reads are used. BAM output files from STAR mapping must be added individually.

Paired-end data

featureCounts -O -p -–primary -a <GTF_with_TEs.gtf> -o
<featureCounts_Count_Matrix.txt> <bam_file1.bam> <bam_file2.bam> … <bam_fileN.bam>

Single-end data

featureCounts -O -–primary -a <GTF_with_TEs.gtf> -o
<featureCounts_Count_Matrix.txt> <bam_file1.bam> <bam_file2.bam> … <bam_fileN.bam>
Options Description
-O Assigns reads to all their overlapping meta-features.
-p Fragments are counted instead of reads. Only applicable for paired-end reads.
--primary Count primary alignments only. Identified using bit 0x100 in the SAM/BAM FLAG field.
-a Name of the GTF annotation file

Proceed to Step 12.v.

Automated Script (Optional)

  • 11.

    (alternatively to Steps 2–5 and 8–10) Run the script “Fastq_to_Counts_v1.1.sh”.

We supply on GitHub an automated bash script, “Fastq_to_Counts_v1.1.sh”, that takes arguments from “Fastq_to_Counts_Arguments_v1.1.sh" filled by the user and is saved as "Fastq_to_Counts_Filled_Arguments.sh". Fastq_to_Counts_v1.1.sh is an optional automated alternative to Steps 2–5 and 8–10, as it provides both low and high memory-usage options. This program will take either paired-end or single-end data, trim and remove adapters, and optionally 1) generate a count matrix using kallisto or 2) map reads to a genomic reference using STAR and generate a count matrix using featureCounts. If using the STAR option, users should allocate at least 32 GB of RAM to run the script.

Differential Gene Expression Analysis in R

  • 12.

    Perform differential gene expression analysis in R using (pseudo)counts from kallisto or featureCounts.

For each R-based analysis, counts will be imported into R, and the output will be 1) a gene and TE differential expression file 2) a TE-specific file and 3) a heatmap of differential TE expression.

Data import and formatting for kallisto option [Compositor: Please set as NLH.]

This is the limited memory-usage option.

  1. Prepare kallisto counts for analysis in R.

kallisto outputs individual count files for each sample in its own directory termed "abundance.tsv".

  • ii.

    Run the Perl script “collapse_perl.pl”, provided in the companion GitHub repository, to concatenate kallisto counts into a single count matrix.

The output will be “kallisto_count_matrix.txt”.

collapse_perl.pl <kallisto_count_matrix.txt> <…/abundance.tsv 1> <…/abundance.tsv 2> … <…/abundance.tsv N>
  • iii.

    For each instance of running differential expression analysis in R, load the following packages:

#Load Packages
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
  • iv.

    Import and reformat kallisto count matrix data into R.

Since kallisto generates fractional pseudocounts, a rounded count matrix should be generated for input to DESeq2, which requires integer input.

#Import Kallisto Count Matrix
Count_Matrix <- read.table("kallisto_count_matrix.txt", header = T, comment.char = "")

#Remove length and eff length columns. N denotes the final count
#column
Count_Matrix <- Count_Matrix[,c(1,4:N)]

#Prerequisite: round the counts to nearest integer.
Count_Matrix[,−1] <- round(Count_Matrix[,−1])

Proceed to Step 13.

Data import and formatting for featureCounts options [Compositor: Please set as NLH.]

This is the high memory-usage option.

  • v.

    For each instance of running differential expression analysis in R, load the following packages:

# Load Packages
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
  • vi.

    Import and reformat featureCounts count matrix data into R.

When including multimapping reads with featureCounts, fractional read counts are generated, so fractional counts should be rounded before input to DESeq2, which requires integer input.

# Import featureCounts Count Matrix
Count_Matrix <- read.table("featureCounts_Count_Matrix.txt", header = T)

# Remove Chr, Start, End, Strand, and Length columns.
# N denotes the final count column
Count_Matrix <- Count_Matrix[,c(1,7:N)]

# Only if using multimapping reads
# Unnecessary when using featureCounts with unique counts only.
Count_Matrix[,−1] <- round(Count_Matrix[,−1])

Proceed to Step 13.

Differential expression analysis

  • 13.

    Run the following R code after data has been prepared for differential expression analysis from either kallisto or featureCounts.

# Format matrix, make IDs into rownames for DESeq2
rownames(Count_Matrix) <- Count_Matrix$Gene

# Keep only Transcripts with non-critically low coverage
my.good <- which(apply(Count_Matrix[,−1]>0, 1, sum) >= 6)
my.filtered.matrix <- Count_Matrix[my.good,−1]
  • 14.

    Create the design matrix based on condition(s) of interest to be used by DESeq2.

In this example, we enter age, which is a continuous numerical variable.

# Set variables according to which differential expression will be assessed, ex. age
my.Var1 <- c(rep(Condition1, Number of Occurrences),
rep(Condition1, Number of Occurrences))

# Design matrix
dataDesign = data.frame(row.names =
colnames(my.filtered.matrix),
                          condition = my.Var1)
  • 15.

    Run DESeq2 variance-stabilizing transformation normalization and the differential gene expression test.

# Get matrix using the condition(s) as a modeling covariate
dds <- DESeqDataSetFromMatrix(countData = my.filtered.matrix,
                              colData = dataDesign,
                              design = ~ condition)

# Run DESeq2 normalizations and export results
dds.deseq <- DESeq(dds)

# Model differential gene expression as a function of ‘condition’
res.condition <- results(dds.deseq, name= "condition")
  • 16.

    Write analysis results to text files.

# Write results file with both genes and TEs included
res.condition.df <- as.data.frame(res.condition)
res.condition.df$ID <- rownames(res.condition.df)
write.table(res.condition.df, file = "TEs_Genes_Diff_Exp.txt", quote = F, row.names = F, sep = '\t')

# Write results file with only TEs included
# When using FishTEDB, all TE names contain the string “NotFur”
res.condition.df.TE <- subset(res.condition.df, grepl("NotFur",
res.condition.df$ID))
write.table(res.condition.df.TE, file = "TEs_Only_Diff_Exp.txt", quote = F, row.names = F, sep = '\t')
  • 17.

    Obtain variance-stabilizing transformation normalized gene counts for heatmap-plotting purposes.

# Normalized expression value
norm.cts <- getVarianceStabilizedData(dds.deseq)
  • 18.

    Retain only TE sequence data and apply a false discovery rate of 0.05.

# Prepare TE Heatmap

# Get the heatmap of conditional changes at FDR5; exclude NA
res.condition <- res.condition[!is.na(res.condition$padj),]

# Keep significantly differentially expressed transcripts
transcripts.condition <-
rownames(res.condition)[res.condition$padj < 0.05]

transcript_list <- as.data.frame(transcripts.condition)

# Exclude any genes; retain only TEs
TE_list <- subset(transcript_list, grepl("NotFur", transcript_list$transcripts.condition))
TE_chars <- as.character(TE_list$transcripts.condition)
  • 19.

    Plot differentially expressed TEs in a heatmap.

# Create heatmap
my.color.palette <-
colorRampPalette(rev(c("#CC3333","#FF9999","#FFCCCC","white","#CCCCFF","#9999FF","#333399")))(50)

pheatmap(norm.cts[TE_chars,],
         cluster_cols = F,
         cluster_rows = T,
         my.color.palette,
         show_rownames = F, scale="row",
         main = "Title", cellwidth = 15, border = NA)

Representative heatmaps obtained with this protocol using data from (Cencioni et al. 2019) are shown in Figure 2, and the results are summarized in Table 1, illustrating the three possible options in both single-end and paired-end mode (six analytical modes).

Figure 2. Analysis of differentially expressed TEs (false discovery rate (FDR) < 5%) generated from each analysis method.

Figure 2.

TEs tend to be derepressed with age in somatic tissues, and data from Cencioni (2019) suggests that TEs are upregulated in muscle samples from aged animals. Results using single-end data (A-D) are highlighted in yellow, while paired-end data (E-H) are highlighted in beige. Each method detects a distinct number of differentially expressed TEs (see Table 1). kallisto calls fewer differentially expressed TEs than either STAR/featureCounts option. The degree of shared calls between the three proposed methods for single-end (D) or paired-end (H) processing is reported in Venn diagrams. Note that, although both STAR/featureCounts options share a high number of common differential TE calls, the overlap is more limited with the kallisto option.

Table 1. Number of differentially expressed TEs following pipeline analysis using either kallisto or featureCounts.

Data from Cencioni (2019) comparing transcription from muscle samples taken from male N. furzeri strain MZM-04/10 specimens. Raw fastq data are available at the Sequence Read Archive (SRP216703). The deposited data is paired end. For single-end processing, only read 1 files were used to simulate a single-end experiment.

kallisto featureCounts
(unique)
featureCounts
(multimapper
fractional)
Cencioni (2019) (single-end data) 326 584 672
Cencioni (2019) (paired-end data) 311 606 663

Scripts for Complete Analysis in R (Optional)

  • 20.

    (Alternatively to Steps 12–19) Run the scripts "Complete_DESeq2_Kallisto.R" and "Complete_DESeq2_featureCounts.R".

Once count matrices are generated from kallisto or featureCounts (and concatenated using the collapse.pl scripts for kallisto data), the scripts "Complete_DESeq2_Kallisto.R" and "Complete_DESeq2_featureCounts.R" can be used for the complete R-based analysis outlined in the Steps 12–19 using kallisto and featureCounts data, respectively. These two scripts are a concatenation of Steps 12–19, unique to data prepared with kallisto or featureCounts, that execute Steps 12–19 in a single R script. These scripts can be run as an alternative to running Steps 12–19 stepwise.

DISCUSSION

Generating counts for TE expression remains a challenging problem due to their tendency to map to multiple loci. Here, we compared the results of TE differential expression on paired-end and single-end datasets using two counting methods, kallisto and STAR/featureCounts, considering both uniquely mapping reads and multimapping reads. When access to high-memory machines is not feasible, kallisto is the only choice of programs to run as it can easily run on a laptop machine. However, when access to high-memory machines or a computing cluster is available, users can run STAR/featureCounts. If access to high-memory machine is available, we recommend using STAR/featureCounts with multimapping reads fractionally apportioned, as it seems to provide the best sensitivity for our TE differential expression analysis.

Importantly, the protocol that we present will only provide information for differential TE expression at the subclass/family level. Since all available genome assemblies for N. furzeri are still incomplete (Reichwald et al. 2015; Valenzano et al. 2015; Willemsen et al. 2020), it is likely that locus-level information cannot be obtained reliably at this point. However, when near-complete genome assemblies for the African turquoise killifish become available, this protocol could be adapted to provide locus-level differential expression information by using tools tailored for such approaches (e.g., SQuIRE) (Yang et al. 2019).

ACKNOWLEDGMENTS

Work in our laboratory was supported by the NIA T32 AG052374 Postdoctoral Training Grant to B.T., NIA R21 AG063739, NIGMS R35 GM142395, a pilot grant from the Navigage Foundation, a Hanson-Thorell Family award and a Kathleen Gilmore Biology of Aging research award to B.A.B.

We acknowledge the Center for Advanced Research Computing at the University of Southern California for providing computing resources that have contributed to our work on transposable element differential expression analysis reported here (https://carc.usc.edu).

REFERENCES

  1. Andrews S 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ .Accessed August 20, 2021.
  2. Bao W, Kojima KK, Kohany O. 2015. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34(5): 525–527. [DOI] [PubMed] [Google Scholar]
  4. Cencioni C, Heid J, Krepelova A, Rasa SMM, Kuenne C, Guenther S, Baumgart M, Cellerino A, Neri F, Spallotta F et al. 2019. Aging Triggers H3K27 Trimethylation Hoarding in the Chromatin of Nothobranchius furzeri Skeletal Muscle. Cells 8(10). [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Dobin A, Gingeras TR. 2015. Mapping RNA-seq Reads with STAR. Curr Protoc Bioinformatics 51: 11 14 11–11 14 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Felix Krueger FJ, Ewels Phil, Afyounian Ebrahim, & Schuster-Boeckler Benjamin. 2021. FelixKrueger/TrimGalore: v0.6.7 - DOI via Zenodo (0.6.7). Zenodo. 10.5281/zenodo.5127899. Accessed August 20, 2021. [DOI] [Google Scholar]
  7. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A 117(17): 9451–9457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. 2016. The Dfam database of repetitive DNA families. Nucleic Acids Res 44(D1): D81–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jin Y, Tam OH, Paniagua E, Hammell M. 2015. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 31(22): 3593–3599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30(7): 923–930. [DOI] [PubMed] [Google Scholar]
  11. Reichwald K, Petzold A, Koch P, Downie BR, Hartmann N, Pietsch S, Baumgart M, Chalopin D, Felder M, Bens M et al. 2015. Insights into Sex Chromosome Evolution and Aging from the Genome of a Short-Lived Fish. Cell 163(6): 1527–1538. [DOI] [PubMed] [Google Scholar]
  12. Shao F, Wang J, Xu H, Peng Z. 2018a. FishTEDB: a collective database of transposable elements identified in the complete genomes of fish. Database 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Shao F, Wang J, Xu H, Peng Z. 2018b. FishTEDB: a collective database of transposable elements identified in the complete genomes of fish. Database (Oxford) 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Smit A, Hubley R & Green P . 2013-2015. RepeatMasker Open-4.0. <http://www.repeatmasker.org> Accessed August 20, 2021. [Google Scholar]
  15. Valenzano DR, Benayoun BA, Singh PP, Zhang E, Etter PD, Hu CK, Clement-Ziza M, Willemsen D, Cui R, Harel I et al. 2015. The African Turquoise Killifish Genome Provides Insights into Evolution and Genetic Architecture of Lifespan. Cell 163(6): 1539–1554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Willemsen D, Cui R, Reichard M, Valenzano DR. 2020. Intra-species differences in population size shape life history and genome evolution. Elife 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Yang WR, Ardeljan D, Pacyna CN, Payer LM, Burns KH. 2019. SQuIRE reveals locus-specific regulation of interspersed repeat expression. Nucleic Acids Res 47(5): e27. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES