Skip to main content
PLOS One logoLink to PLOS One
. 2022 Sep 22;17(9):e0274591. doi: 10.1371/journal.pone.0274591

On taming the effect of transcript level intra-condition count variation during differential expression analysis: A story of dogs, foxes and wolves

Diana Lobo 1,2,3,*, Raquel Linheiro 1, Raquel Godinho 1,2,3, John Patrick Archer 1,2,*
Editor: Katherine James4
PMCID: PMC9498955  PMID: 36136981

Abstract

The evolution of RNA-seq technologies has yielded datasets of scientific value that are often generated as condition associated biological replicates within expression studies. With expanding data archives opportunity arises to augment replicate numbers when conditions of interest overlap. Despite correction procedures for estimating transcript abundance, a source of ambiguity is transcript level intra-condition count variation; as indicated by disjointed results between analysis tools. We present TVscript, a tool that removes reference-based transcripts associated with intra-condition count variation above specified thresholds and we explore the effects of such variation on differential expression analysis. Initially iterative differential expression analysis involving simulated counts, where levels of intra-condition variation and sets of over represented transcripts are explicitly specified, was performed. Then counts derived from inter- and intra-study data representing brain samples of dogs, wolves and foxes (wolves vs. dogs and aggressive vs. tame foxes) were used. For simulations, the sensitivity in detecting differentially expressed transcripts increased after removing hyper-variable transcripts, although at levels of intra-condition variation above 5% detection became unreliable. For real data, prior to applying TVscript, ≈20% of the transcripts identified as being differentially expressed were associated with high levels of intra-condition variation, an over representation relative to the reference set. As transcripts harbouring such variation were removed pre-analysis, a discordance from 26 to 40% in the lists of differentially expressed transcripts is observed when compared to those obtained using the non-filtered reference. The removal of transcripts possessing intra-condition variation values within (and above) the 97th and 95th percentiles, for wolves vs. dogs and aggressive vs. tame foxes, maximized the sensitivity in detecting differentially expressed transcripts as a result of alterations within gene-wise dispersion estimates. Through analysis of our real data the support for seven genes with potential for being involved with selection for tameness is provided. TVscript is available at: https://sourceforge.net/projects/tvscript/.

Introduction

Developments in RNA-seq technology have revolutionized transcriptomic studies by allowing for a rapid hi-resolution view of transcript expression [1]. In a typical RNA-seq experiment, transcript expression profiles are estimated for each sample using a metric based upon the number of sequenced reads associated with each transcript within a reference set [28]. Condition dependent expression profiles can then be used in order to identify which transcripts are differentially expressed [9, 10]. A challenge arises due to sources of variation within expression profiles that are independent of, or partially overlapping with, the condition of interest [2, 1113]. The inclusion of biological replicates reduces the effect of such noise [14, 15], and it has been demonstrated that sufficient replicate numbers outweigh sequencing depth in terms of increasing the accuracy within differential expression experiments [14, 16]. In studies not involving highly controlled isolated environments, RNA-seq data from the rapidly growing repertoire of published works can be incorporated [1719], if data from a matching condition to that being studied is available. This effectively increases the number replicates although variability can be amplified [20, 21].

Differential expression tools compute a statistical significance for each transcript, based upon the abundance estimates within a condition, that reflect the possibility of that transcript being differentially expressed [9, 10, 22]. To reduce the effect of intra-condition variation across biological replicates on the estimation of abundance several methods have been proposed including ALDEx2 [23], EDASeq [24] and PEER [25]. In addition to these, and more generally applied, are the abundance estimation techniques implemented within established differential expression tools such as DESeq2 [9] and EdgeR [10]. However, when methods are compared, relative to the final sets of transcripts identified as being differentially expressed, variable results are observed [15, 23, 2629]. This is an indication that the problem of intra-condition variation relative to the detection of differentially expressed transcripts using RNA-seq data has not been completely resolved. Furthermore, there is no consensus on the best approach to use [30].

Here we explore the effects that individual transcripts associated with high levels of intra-condition count variation have on the end results of differential expression analysis using the tool DESeq2 [9]; a tool that is well established and that has consistently demonstrated reliability in identifying differentially expressed transcripts [27, 30, 31]. Our aim is to investigate the possibility of whether or not the removal of transcripts, harbouring the highest levels of intra-condition variation, from the reference set used during differential expression analysis can produce sets of differentially expressed transcripts that display an increased level of confidence. The latter being achieved through either: (a) the direct removal of transcripts previously identified as being differentially expressed, but whose expression patterns are ambiguous, or (b) the indirect addition, or removal, of transcripts to, or from, those previously identified as being differentially expressed as a consequence of alterations in p-adj values. The latter being associated with shifts in the distribution of intra-condition variation, following the removal of transcripts harbouring the highest levels of such variation. A by-product of this is the explicit quantification of the level of intra-condition abundance variation present within the final lists of differentially expressed transcripts.

To aid this exploration we present TVscript, a tool for the identification of transcripts above user-specified levels of intra-condition normalized count variation, the latter being strongly associated with transcript abundance estimation [48]. As input TVscript requires one file per condition-associated replicate that contains the per transcript read counts obtained following the mapping of reads from the replicate to a common reference set. As output TVscript produces a set of corresponding count files that are absent of transcripts harbouring normalized intra-condition count variation higher than that associated with a user specified percentile. These updated count files can be subsequently used within the differential expression tool of choice, in our case DESeq2 [9]. Through multiple iterations of differential expression analysis following filtering at varying thresholds and comparisons back to differential expression analysis performed on non-filtered inputs, the effects of transcripts associated with high intra-condition variation, in relation to quantity and consistency of differentially expressed transcripts identified, can be explored.

Using TVscript we first explore the effects of intra-condition per-transcript read count variation through iterative differential expression analysis experiments involving highly controlled simulated count datasets derived from the available dog reference transcriptome [32], and where the exact level of background intra-condition count variation could be specified as well as a subset-set of transcripts to be over represented across replicates (of second conditions used within each iteration). Next, we explored the effects of intra-condition per-transcript read count variation within two distinct case-studies, involving count data obtained following the mapping of intra and inter-study RNA-seq datasets. Within these case-studies differential expression patterns arising from data derived from brain samples of dogs and wolves (inter-study scenario involving frontal cortex, cerebral cortex, prefrontal cortex and frontal lobe) [3235], as well as tame and aggressive foxes (intra-study scenario involving prefrontal cortex) [36], generated in the scope of domestication experiments are compared at varying thresholds of intra-condition normalized read count variation exclusion.

In addition to exploring the general effects of transcripts harbouring high levels of intra-condition count variation on the outcome of differential expression analysis, we also had an interest in understanding whether or not there were genes commonly up or down regulated within the brain of both forms of domestic canids (dogs and tame foxes), but simultaneously not so within their “wild/aggressive” counter parts (wolves and wild foxes). Such genes are candidates for being associated with tameness. Domestic dogs present marked behaviour differences from wolves, their wild ancestors, due to the evolution of unique social cognitive capabilities [35, 37, 38]. Tame red foxes resulted from deliberated selection against fear and aggression over several generations of cross-breeding [39] and they present several behavioural and pheno-typical traits that resemble those found in dogs [36, 40, 41].

TVscript is open source and code, a quick start guide and test data, are available (under the GNU General Public License) through the SourceForge project page https://sourceforge.net/projects/tvscript/.

Materials & methods

RNA-seq datasets

To explore the effects of intra-condition count variation on the detection of differentially expressed transcripts using real data we used both intra and inter-study datasets. At an intra-study level, we combined publicly available RNA-seq data from brain tissue (prefrontal cortex) of 12 tame and 12 aggressive red foxes (S1 Table) generated within the same study [36]. For the inter-study case, we combined multiple publicly available RNA-seq datasets from several dogs and six wolves [3235], also derived from brain tissue (frontal cortex, cerebral cortex, prefrontal cortex and frontal lobe) (S1 Table). In relation to the latter, dogs 1 to 6 and wolves 1 to 6 were derived from Albert et al., (2012), dog 6 from Roy et al., 2013, dog 7 (two replicates) from Fushan et al., 2015 and dogs 8 and 9 (three replicates each) from Hoeppner et al., (2014) as described within the table. All the samples were downloaded from the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EMBL-EBI), covering a wide range of ages, both sexes as well as multiple replicates and sequencing strategies (S1 Table). We selected these specific case studies because, firstly we were interested in evaluating the effects of intra-condition count variation both at intra and inter-study levels and the domestic dog, being a model organism, has an available high-quality reference transcriptome as well as several high-quality RNA-seq datasets generated across different studies; while secondly, we sought to perform a brief exploratory inter-study scan to investigate if the domestication of both dogs and foxes has resulted in the co-expression of a set of common brain genes, relative to their “wild/aggressive” type, since behavioural modifications are considered to have been the first target in domestication [42].

Reads from all samples were mapped to the dog reference transcriptome [32], which contained 26,107 annotated transcripts (Ensembl CanFam3.1, release 92) [43], using Bowtie v.2.3.4.1 [44]. We did not use a splice-site aware mapper, such as Tophat2 [45] or HISAT2 [46], since introns were not expected to be present within the reference transcriptome. Reads were not mapped to the dog genome as our aim was to explore the effects of intra-condition variation on a predefined set of reference transcripts, and not infer novel transcripts from these previously published datasets. We did not use the fox reference transcriptome (available on Ensembl) for the mapping the fox datasets within our overall analysis as we were interested in directly comparing differentially expressed transcripts from the same underlying reference set identified using the fox, dog and wolf datasets. We did however map all of the fox datasets to the fox reference transcriptome in order to confirm that the proportion of reads mapped was similar to that when using the dog reference transcriptome. The pileup.sh script from the BBmap package [47] was used to obtain per transcript abundance estimates (measured by the number of mapped reads to the corresponding transcript) in each sample. Read counts from technical replicates of “Dog_8” and “Dog_9” were averaged and merged into one file (S1 Table), while read counts from the two biological replicates of “Dog_7” were treated separately. Finally, to confirm that the final read count numbers were reliable relative to the dog reference transcriptome, all read datasets from foxes, dogs and wolves were re-mapped using the pseudo-mapper kallisto v0.46.1 that implements a rapid and accurate kmer search based strategy for estimating transcript abundance counts [48]. For each dataset an r2 correlation value was calculated describing the linear correlation between the per-transcripts counts obtained following Bowtie2 mapping and the corresponding abundance counts obtained using kallisto.

Software

TVscript requires as input: (1) multiple files containing per transcript read counts (one per sample), (2) a file containing the lengths of the transcripts that the reads were mapped to, (3) a percentile threshold value for intra-condition variation and (4) a configuration file that indicates the locations all files as well as the condition allocations of the count files in (1). An example configuration file along with further details is available from the SourceForge project page. The steps that TVscript implements to identify transcripts associated with high levels of disjointed read counts are (see Fig 1 for a workflow): i) each input dataset, containing count values from a particular sample, is allocated to either condition A or B, as indicated within the configuration file; ii) counts are normalized by dividing them by the length of the corresponding reference transcript and by the sum of all counts for that sample; (iii) for each reference transcript (t), the absolute pairwise differences between normalized read counts across all samples within condition A are calculated; (iv) the corresponding variances are calculated; (v) steps (iii) and (iv) are repeated for condition B; (vi) variance scores from each condition are placed in ascending order and associated with corresponding percentiles; (vii) reference transcripts are removed (or filtered) if their variance score is above that associated with the user specified percentile threshold; (viii) raw read counts associated with the remaining transcripts are outputted into separate files that correspond to each input dataset. As to avoid overwriting the original count files, the names of updated count files are specified within the configuration file. These updated count files can be used as input for differential expression analysis software such as DESeq2. Note: in relation to step (vi) the final list of variance scores obtained are a representation of those derived from both conditions, and during step (vii) it is the variance score that is associated with the user-defined percentile that is used. The latter means that for separate comparisons (e.g. wolves vs. dogs and aggressive vs. tame foxes) at a given filtering threshold the number of filtered transcripts may not be identical (despite the same reference set being used); as the underlying variance score distribution calculated for each can be different.

Fig 1. Diagramatic overview of how reference based transcripts are removed by TVscript.

Fig 1

Steps (i) to (viii) indicate actions taken. Reference transcripts (green circles) are showen for diagramtic purposes in order to highlight how read counts (grey circles) across replicates are treated for each transcript independently. Read counts are grouped into individual files (gray rectangles) in accordance to replicate. These files are grouped into one of two conditions (blue and brown boxes). Remaining keys are indicated at the bottom.

TVscript is implemented in Java programming language and runs on all operating systems with installed Java Runtime Environment v.8.0 or higher. It is open source and available under the GNU General Public License v3.0. Source code, usage instructions and sample data can be found on the SourceForge project page: https://sourceforge.net/projects/tvscript/. Although TVscript is implemented in Java the steps involved can be readily implemented within any language (e.g. R or python), using the detailed description provided above as well as the Java source code that is fully available. There are no dependent packages where code is unavailable. At the time of development we choose Java mainly due to its platform independence, which can be an advantage within setting up analysis pipelines involving many different tools. That said we are aware that many differential expression analysis tools are R based and future demand may warrant a supported R version.

Controlled intra-condition variation within simulated data

We tested TVscript through a series of iterative differential expression analysis experiments involving highly controlled simulated count data. During each iteration the counts associated with each transcript, within each replicate dataset created, represented simulated transcript expression from the dog reference transcriptome and were obtained using CSReadGen [49]. Using the latter, the level of random background count variation away from that required for a normalized even coverage across all transcripts could be specified, as well as a subset of transcripts to be over represented within replicates of a condition. Background count variation refers to varying count levels associated with individual transcripts that are not maintained across replicates of a condition, thus effectively reflecting intra-condition noise, whilst specifying a subset of transcripts to be over represented across replicates of a condition reflects identifiable over expressed transcripts. A similar experiment to those described here, but in relation to the effects of chimerism on the results of differential expression, has been described in Linheiro and Archer (2021) [50].

As a preliminary, and to demonstrate the general reliability of DESeq2 in the absence of random intra-condition count variation, count data was simulated for 22,580 transcripts ranging in length from 300 to 5000 bp from within the dog reference transcriptome. For a single iteration ten replicates of count dataset, each representing three million read pairs (≈20X coverage), were generated in accordance with two conditions, A and B (five in each), where within condition B one hundred transcripts were selected for count over representation by a factor of two across replicates. For all other transcripts, counts were generated to represent an even distribution of reads (length 150 bp, insert size 300 bp). These count files were used as input to perform differential expression analysis between condition A and B in DESeq2 v.1.32.0, considering transcripts with p-adj < 0.05 (corrected by the Benjamini and Hochberg method) to be differentially expressed. The number of transcripts that were detected as being over-expressed was recorded. This was repeated one hundred times in two different ways: (i) the transcripts initially flagged for count over representation were kept constant throughout and (ii) during each iteration a new set of random transcripts for over representation was selected. A brief overview of the R-script we used for differential expression analysis within individual iterations is available on the Zenodo repository [51].

Next, a similar experiment was performed but where the level of random count variation introduced into the count datasets generated ranged from 1% to 10% in steps of one. Introduced variation was not coupled between replicates, thus reflected intra-condition variation. At each level of variation one hundred iterations of the following steps were performed. (i) Ten replicates were generated and allocated into two conditions A and B (five in each), where within B one hundred selected transcripts had counts over represented. (ii) At the percent level of variation associated with the iteration, that percent of transcripts from each of the ten replicates were randomly selected for count over representation. (iii) DESeq2 was used in a similar manner to before on the count files within conditions A and B to obtain a list of over expressed transcripts. (iv) TVscript was run using a 95th percentile variance threshold to generate ten corresponding modified count files also separated into two conditions (A’ and B’). (v) DESeq2 was again used on these to obtain a list of over-expressed transcripts. (vi) The lists of over-expressed transcripts obtained in (iv) and (v) were cross compared. Once again this was repeated in two different ways: (i) the one hundred transcripts initially flagged for over representation were kept constant throughout all levels of variation and for each of the associated iterations and (ii) during each level of variation and for each iteration a new set of one hundred transcripts were randomly selected.

Exploring the removal of transcripts associated with high intra-condition variation within real data

For each case study (wolves vs. dogs and aggressive vs. tame fox) we ran TVscript using the count datasets described under the section “RNA-seq datasets” and by applying variance filtering thresholds corresponding to variance values associated with the 70th up to the 90th percentiles (in steps of five), and to the 91st up to the 99th (in steps of one). Steps of one were used in the latter as to allow for transcripts associated with the highest levels of intra-condition variation to be explored in more detail. For each threshold value, only transcripts with variance below that value were maintained. During each run, we recorded the number and IDs of all transcripts that were removed so that they could be cross-compared. Following each run, we used the updated count files produced to perform differential expression analysis using DESeq2 v.1.22.2. Transcripts with p-adj < 0.05 (corrected by the Benjamini and Hochberg method) were considered to be differentially expressed. DESeq2 identifies differentially expressed transcripts by estimating gene-wise dispersions and applying shrinking methods to model counts and thus effectively normalize for individual outliers [31]. Distributions of gene-wise dispersions following normalization are conveniently accessible and provide a good metric to visualize the effects of removing transcripts associated differing filter levels. Differential expression analysis using the original non-filtered datasets was also performed. For the aggressive vs. tame fox case study batch effects were not considered as all data came from the same study, tissue and sequencing run, additionally no further information about sample preparation was available. For the wolves vs. dogs case study we tested for effects based on tissue, primarily for quality control of the final transcripts we drew biological-related conclusions about, and compared results obtained to those in the absence of batch information. In our analysis we used differential expression results based solely on the latter, as firstly, effects associated with tissue at an inter-study level are unpredictable as there are many factors involved, such as precision of dissection, time of dissection, time to dissect, state of individual tissue samples as well as individual who prepared sample, and other than publication or information mentioned for the fox case study, no further information on batches was obtainable. Secondly, although DESeq2 provides an internalized method for accommodating batch effects that we applied (~batch + condition), the results obtained at an intra-study level, with well defined batches, between alternative methods of testing are variable [52]. Lastly, we were primarily exploring the effects of removing hyper variable transcripts on the mechanics of detecting differentially expressed transcripts and our simulations and case studies were a means to an end in achieving this. As long as input counts for a given filtering threshold within a given case study or a iteration were consistent with those of the initial input data, the effects of removing hyper variable transcripts could be observed, independent of other factors affecting the data prior to analysis. To visualize the overall effects of covariates broadly affecting the relationships between datasets within each case study, we performed a principal component analysis (PCA) using the plotPCA function from DESeq2 with non-filtered normalized count data.

To evaluate TVscript we used three metrics that when combined quantify the overall impact of intra-condition variation on downstream differential expression analysis. The metrics were: i) number of ambiguous positives within transcripts identified as being differentially expressed in the non-filtered datasets; ii) distributions of dispersion estimates and outliers in differential expression analysis for non-filtered and all filtered datasets; and iii) discordance in the list of differentially expressed transcripts between non-filtered and filtered datasets (selected percentiles: 97th, 95th, 90th).

(i) Ambiguous positives

We identified transcripts appearing as being differentially expressed when using the non-filtered datasets as input to DESeq2 that were associated with the top 10 levels of intra-condition variation (above the 90th percentile threshold value). These we designated as ambiguous positives. Small numbers of these, relative to the overall number of identified differentially expressed transcripts would indicate that TVScript is having little direct effect on lists of identified differentially expressed genes.

(ii) Distributions of dispersion estimates

For the non-filtered and all filtered input datasets (70th up to the 90th percentiles in steps of five and to the 91st up to the 99th in steps of one) we calculated correlation coefficients (r2) using a linear regression analysis in R [53], between dispersion estimates and the mean of normalized counts, both the latter calculated by DESeq2 during differential expression analysis. Dispersion is inversely related to the mean, as lower mean counts are affected by variation to a higher degree. If a stronger correlation is seen for the filtered input datasets, then this would suggest that the distribution used to model differential expression could be more reliable in relation to identifying differentially expressed transcripts. In addition to this, we retrieved the number of outliers detected by DESeq2, expecting a decrease after each filtering step. Outliers are recognized by the DESeq2 as the points with extremely high dispersion values that cannot by shrunk towards the fit curve. This was performed independently for both case studies.

(iii) Discordance lists of differentially expressed transcripts between applied filter levels

We calculated the proportion of discordance between lists of differentially expressed transcripts produced when using non-filtered and filtered datasets at the 97th, 95th and 90th percentile threshold values. Two types of observed discordances relative to the non-filtered list were considered: (a) transcripts that were lost directly due to filtering or indirectly due to p-adj values no longer being significant, and (b) transcripts that were added due to alterations in p-adj values. Quantifying the nature of these discordances provides insight into the general consistency of genes identified as being differentially expressed across varying filter thresholds. To visualize the overlap between the non-filtered and filtered lists we used the VennDiagram v.1.6.20 package in R.

Gene annotation and gene family analysis

For each case study, differentially expressed transcripts obtained using the non-filtered and filtered datasets were matched to the correspondent gene ID. This was done with the R package BioMart [54] using the Ensembl Gene database (version 94). To begin to identify gene families that displayed similar regulation in both dogs and tame foxes, i.e. relative to the “wild/aggressive” type, we grouped up and down regulated genes into gene families. Genes within these families were then classified according to whether they were unique to dogs or tame foxes or shared between the two. Within each case study, a gene family was only considered if all the associated genes agreed in relation to their direction of expression (up or down regulation).

Results

Mapping success of RNA-seq data

Mapping of the 44 datasets corresponding to dogs and wolves against the dog reference transcriptome revealed an average success of 60% and 58% respectively, in terms of the number of mapped reads (S1 Fig). Similar values between wolves and dogs were expected, given their recent divergence of ~23,000 years ago [55]. Comparable proportions of reads failing to map (~40%) have been previously reported for dog brain samples [33] and are most likely associated with i) novel genes; ii) regions that are not translated despite being transcribed; iii) contamination with genomic DNA; and iv) uncharacterized chimeras and other artefacts within reference sets resulting from library preparation during sequencing [56] and various assembly errors [57]. When a different mapping approach was used for each RNA-seq dataset (i.e. kallisto) transcript abundance counts remained consistent with those obtained following Bowtie2 mapping, as indicated by high r2 correlation values (S2 Fig). R2 values ranged between 0.8546 and 0.9944. All per-transcript mapped read counts, obtained following each mapping approach, have been made available on the Zenodo repository [58]. For the fox datasets, an average of 50% of reads mapped to the dog reference transcriptome using Bowtie2 (S1 Fig); confirmed by the kallisto estimated abundance counts (S2 Fig). This lower percentage of mapped reads, relative to the dog and wolf datasets, could be expected due to an increased genetic divergence from dogs (~10 mya) [59] together with the other aforementioned factors. However, when these fox datasets were mapped to the Ensembl available fox reference transcriptome using both Bowtie2 an improvement in the overall mapped read counts was not observed (S3 Fig).

Controlled intra-condition variation within simulated data

When not faced with increasing levels of random intra-condition count variation DESeq2 performed exceptionally well. For 86 of the one hundred iterations performed DESeq2 recovered all transcripts that were selected for count over representation (S4a Fig). Of the other 14 iterations the lowest number recovered was 72. Similarly, for the one hundred iterations where the random transcripts selected for count over representation were re-selected during each, in 87 cases all over represented transcripts were identified as being over expressed, whilst in the remaining 13 the minimum number identified was 72 (S4b Fig). However, as levels of introduced intra-condition variation increased, the number of transcripts identified by DESeq2 fell, both before and after filtering the input counts with TVscript (Fig 2). At all levels of intra-condition variation, the post-filtered data had an increase in the number of differentially expressed transcripts identified. It should be emphasized that the reduction in the number of transcripts identified as being over expressed is not a negative reflection on the performance of DESeq2, but instead it is a consequence of purposefully increasing the level of randomness within the count data. The same pattern is true when the one hundred selected transcripts for count over representation are re-selected within each iteration (S5 Fig).

Fig 2. Over expressed transcripts pre- and post-filtering using simulated data.

Fig 2

The number of transcripts identified by DESeq2 as being over expressed both prior to (light gray) and post (dark gray) filtering of count datasets within each of the one hundred iterations performed at each level of introduced random intra-condition count variation. Each iteration involved initially simulating ten count datasets divided into conditions A and B following which DESeq2 was run to attempt to identify the one hundred transcripts selected for over representation as described in the methods. Following this the ten simulated datasets were filtered using TVscript with a 95th percentile threshold to generate corresponding filtered datasets (divided into corresponding conditions A’ and B’) on which DESeq2 was re-run.

For iterations associated with each increment in random intra-condition variation, the number of transcripts commonly identified as being over expressed both prior-to and post filtering are presented in S2 Table (over represented transcripts kept constant across iterations) and S3 Table (over represented transcripts re-selected for each iteration). The proportion that these numbers make up relative to the maximum number of transcripts identified as being differentially expressed, pre- and post filtering, are presented in S4 and S5 Tables (constant) (re-selected). In all cases, below a 5% level of random variation these numbers are high (constant—1 to 4% averages: 0.96, 0.98, 0.94 and 0.81; re-selected—1 to 4% averages: 0.96, 0.97, 0.93 and 0.82), indicating that on top of additional transcripts identified post-filtering with TVscript, transcripts identified pre-filtering are still found. Consequently, this suggests that additional transcripts identified as being over expressed post-filtering are not at the expense of previously detected transcripts pre-filtering. Above the 5% level of intra-condition variation the ability to successfully identify the one hundred transcripts selected for over representation within condition B diminishes within iterations (Fig 2; S5 Fig). This could be indicative of a tentative estimate on the limit of at what level of random intra-condition count variation becomes inhibitory within differential expression analysis studies.

Exploring the removal of transcripts associated with high intra-condition variation within real data

No significant difference existed between the overall distributions of the per-transcript intra-condition variation values for wolf and dog samples (Wilcoxon-test, p-value < 0.198, Fig 3a). The PCA based on the entire set of normalized non-filtered counts, revealed that the wolf samples were more aggregated than dog samples (Fig 3c). For aggressive and tame fox samples, we observed a significant difference (Wilcoxon-test, p-value < 2.2e-16, Fig 3b) between the distributions of the per-transcript intra-condition variation values, most likely resulting from an increased intra-condition variability within tame fox samples. In particular, we found five samples that were differentiated from the remaining seven in the PCA (Fig 3d), with 80% variance being explained by this clustering in PC1.

Fig 3. Characterization of intra-condition variation.

Fig 3

Percentile range of intra-condition variation scores (x-axis) observed prior to filtering, across both case studies, a) wolves (orange) and dogs (red); b) tame (dark blue) and aggressive (light blue) foxes. PCA plots based on normalized non-filtered count data of the individual datasets comparing c) wolf and dog, and d) tame and aggressive fox. In the latter only individual samples that were positioned within a distant cluster are labelled with the sample ID.

Prior to differential expression analysis, for each case study (wolves vs. dogs and aggressive vs. tame foxes), TVscript was used to remove transcripts in accordance with a series of intra-condition variance thresholds (Fig 4a and 4b; S6 Table). Initially, for wolves and dogs 184 transcripts (out of the 26,107) associated with high intra-condition variation (99th percentile and above) were removed, while for the aggressive and tame fox samples, 235 transcripts were removed. The number of transcripts removed was higher for the fox samples than for those of wolf and dog, reflecting the higher intra-condition variability present. Combined across the top ten levels of intra-condition variation, 12% (n = 3134) and 14.89% (n = 3888) of the reference transcripts were removed in wolf/dog datasets and aggressive/tame fox datasets respectively (S6 Table).

Fig 4. Removal of transcripts above specified levels of intra-condition variation.

Fig 4

Percentile range of combined intra-condition variation scores (x-axis) present in each case study, a) wolves and dogs; b) tame and aggressive foxes. The number of transcripts removed in the top five percentiles (from the 95th to the 99th) are presented in each panel.

Differential expression analysis

Using non-filtered datasets as input to DESeq2, 430 differentially expressed transcripts were identified between wolves and dogs (Fig 5a; S7 Table). Of those, 259 were up regulated, while 171 were down regulated in dogs. Between aggressive and tame foxes, 651 differentially expressed transcripts were identified (Fig 5a; S8 Table), of which, 532 and 119 were up and down regulated, respectively, in tame foxes. Post filtering, within the first ten steps of size one from the 99th down to the 90th percentiles, the number of differentially expressed transcripts identified peaks at the 97th (n = 430; up = 255, down = 175) and the 95th percentiles (n = 730; up = 607, down = 123) in dogs and tame foxes (Fig 5a), respectively. This indicates that for these data the removal of the 3% (n = 854) and 5% (n = 1940) of transcripts associated with the highest levels of intra-condition variation maximized the detection of differentially expressed transcripts.

Fig 5. Effects of removing transcripts above specified levels of intra-condition variation on differential expression analysis.

Fig 5

a) Number of differentially expressed transcripts (DETs) identified using non-filtered (NF) and filtered datasets, based on the top 10 percentiles (99th to the 90th), for both case studies. Up and down regulated transcripts are represented by red and blue dots respectively. Gray arrows identify the selected thresholds for which the results of subsequent corresponding differential expression analysis were used for the identification of candidate transcript associated with tameness within each case study. b) Number of transcripts identified as differentially expressed within the non-filtered datasets that were associated the highest levels of intra-condition variation (99th to 90th) within both case studies, wolves and dogs (orange dots), and tame and aggressive foxes (blue dots).

Evaluation metrics

(i) Ambiguous positives

Of the transcripts that appeared as being differentially expressed, when using non-filtered datasets as input to DESeq2, 17.44 (n = 75) and 21.51% (n = 140) were associated with high intra-condition variation (above the 90th percentile threshold) within the wolves vs. dogs and aggressive vs. tame foxes respectively (Fig 5b). This was higher than the relative proportion of such transcripts with the reference set in general, where 12.08% and 14.89% of transcripts possessed intra-condition variation above the 90th percentile for the wolf/dog and the aggressive/tame fox respectively (Fig 4a and 4b; S6 Table). These were transcripts that we considered as ambiguous positives and the average across both case studies was 19.45%. The number was higher within the fox datasets where elevated variability among samples was observed, suggesting that differences within intra-condition read counts could have influenced the final outcome of identified differentially expressed transcripts.

(ii) Distributions of dispersion estimates

Within both case studies the regression analysis indicates that removing transcripts associated with high levels of intra-condition variation improved correlation coefficients in relation to those from the non-filtered datasets (Fig 6a and 6b; S9 Table). Associated with the elevated levels of variation observed within the fox datasets, there was a better fit within the wolves vs. dogs comparison (r2 > 0.7) than that of the aggressive vs. tame fox one (r2 > 0.5). For the latter, there was visible elevation in the number of dispersed points around the line of best fit. With the removal of transcripts associated with the highest levels of intra-condition variation a reduction in the number of outliers within both case studies was also observed (Fig 6a and 6b; S9 Table).

Fig 6. Distribution of dispersion estimates.

Fig 6

Plots of final dispersion estimates for both case studies, a) wolves and dogs; b) tame and aggressive foxes, calculated using DESeq2 for the non-filtered (NF; orange) and 10% filtered datasets (90th; blue). Each black dot represents a single transcript, and red dots represent outliers. The number of outliers and correlation index (r2) are displayed in the top right corner of each panel. Both x and y-axis are transformed into a logarithm scale. The line in each graph corresponds to the regression analysis between the mean of normalized counts and dispersion estimates.

(iii) Discordance lists of differentially expressed transcripts between applied filter levels

Within the wolves vs. dogs case study, from the 430 differentially expressed transcripts identified when using non-filtered data as input to DESeq2, 346 were maintained when using input data filtered at the 97th, 95th, and 90th percentile threshold values (Fig 7 and S7 Table). 26 transcripts were added as differentially expressed following filtering. For this case study, the overall discordance between the differentially expressed transcripts identified using filtered and non-filter input data was 25.58% (Fig 7—inset table). For the second case study, aggressive vs. tame foxes, 504 out of the 651 differentially expressed transcripts identified using the non-filtered inputs were maintained when using filtered input data at the 97th, 95th, and 90th percentile threshold values, with up to 114 being added following filtering (Fig 7 and S8 Table). This time the overall level of discordance was 40.09% (Fig 7—inset table). Importantly, in both case studies the added transcripts were consistently maintained across the three filter levels. This reflects the general tendency observed within iterative testing using simulated data where the differentially expressed transcripts identified using lower filter levels are maintained at higher levels of filtering in addition to any newly identified transcripts (S4 and S5 Tables).

Fig 7. Differentially expressed transcripts overlapping between non-filtered and filtered datasets.

Fig 7

Venn diagrams representing the number of overlapping differential expressed transcripts found following differential expression analysis using non-filtered datasets and filtered datasets (97th, 95th and 90th percentiles), within each case studies. The inset table provides information about the number of differential expressed transcripts lost/added following each filter step in relation to the non-filtered dataset as well as the the percentage of total discordance.

Candidate genes and gene families

By performing annotation using the filtered datasets where the number of differentially expressed transcripts was maximized (3% and 5%, in dogs and tame foxes, respectively), we found 21 gene families in common among the up-regulated genes in dogs and tame foxes. These 21 gene families contained 50 genes (Table 1), of which 19 were exclusive to dogs while 24 were exclusive to tame foxes. The remaining seven genes (RGR, CHRNA5, SQLE, ARHGAP25, ITGA7, MYO7A and TRIB2), were found to be commonly up regulated in both dogs and tame foxes. When batch effects based on tissue were considered RGR, CHRNA5, MYO7A and TRIB2 were maintained as being commonly up regulated (S6 Fig). Note: although in relation to the latter SQLE, ARHGAP25 and ITGA7 were lost, batch effects based on tissue across multiple studies where little other batch information is obtainable could be considered as a very conservative exclusion.

Table 1. Shared genes and gene families (Up regulation).

List of the gene families, and shared genes, that were commonly up regulated in dogs and tame foxes. The number, and name, of the genes within each gene family are provided, with the corresponding log2fold-change values in brackets for each species. Within each family, single genes were charecterized as shared between dogs and tame foxes (bold), or as exclusively to each of the two groups. When more than one transcript for a specific gene was present, all the log2FC values are reported.

Gene Family Group N of genes Gene name and log2FC value
Retinal G protein-coupled receptor Shared 1 RGR (2.10 in dogs, 0.78 in tame foxes)
Cholinergic receptor nicotinic alpha Shared 1 CHRNA5 (1.1 in dogs, 0.4 in tame foxes)
Squalene epoxidase Shared 1 SQLE (0.54 in dogs, 0.31 in tame foxes)
Rho GTPase activating protein Shared 1 ARHGAP25 (0.86 in dogs, 0.72 in tame foxes)
Tame fox 2 ARHGAP4 (0.64); ARHGAP30 (0.57)
Integrin alpha subunits Dog 3 ITGA6 (1.25, 1.24); ITGA8 (1.14, 0.90); ITGAX (0.97)
Tame fox 1 ITGAL (0.73)
Shared 1 ITGA7 (0.76 in dogs, 0.46 and 0.49 in tame foxes)
Myosin Dog 1 MYO3A (1.12)
Tame fox 3 MYOZ1 (1.53); MYO1F (0.93); MYO1C (0.47)
Shared 1 MYO7A (0.82 in dogs; 0.41 in tame foxes)
Tribbles pseudokinase Tame fox 2 TRIB1 (0.94); TRIB3 (0.78)
Shared 1 TRIB2 (0.61 in dogs; 0.2 in tame foxes)
EF hand calcium binding Dog 1 EFCAB1 (2.59)
Tame fox 1 EFCAB2 (0.46)
Transcription factor Dog 1 TCF23 (2.04)
Tame fox 1 TCF19 (0.63)
Adhesion G protein-coupled receptors Dog 1 ADGRG6 (1.45)
Tame fox 1 ADGRG1 (0.57)
Patatin Like Phospholipase Domain Dog 1 PNPLA4 (1.41)
Tame fox 1 PNPLA7 (0.59)
SRY-box Dog 1 SOX6 (1.26)
Tame fox 2 SOX17(0.84); SOX10 (0.66)
Hyaluronan and proteoglycan link protein Dog 1 HAPLN1 (1.15)
Tame fox 1 HAPLN3 (0.70)
Serine/threonine kinase Dog 2 STK17A (1.15, 1.14); STK32A (1.10)
Tame fox 1 STK40 (0.57)
Potassium channels Dog 1 KCTD16 (0.98)
Tame fox 1 KCTD15 (0.72)
Podocalyxin like Dog 1 PODXL (0.95, 0.84)
Tame fox 1 PODXL2 (0.70, 0.69, 0.67)
ATP binding cassette subfamily B Dog 1 ABCB1 (0.93)
Tame fox 1 ABCB9 (0.52)
Zinc finger DHHC-type Dog 1 ZDHHC15 (0.75)
Tame fox 1 ZDHHC1 (0.70)
Sushi domain Dog 1 SUSD1 (0.68)
Tame fox 2 SUSD3 (0.79); SUSD6 (0.47)
TBC1 domain family Dog 1 TBC1D5 (0.54)
Tame fox 1 TBC1D7 (0.27)
Mitogen-activated protein kinase kinase kinases Dog 1 MAP3K5 (0.51)
Tame fox 1 MAP3K11 (0.76)

In addition, we also found three gene families, containing four genes, simultaneously down regulated in both groups (Table 2). Two of these genes (STMND1 and OASL) were shared between dogs and tame foxes, while the other two were unique to each group. When batch effects based on tissue were taken into account, STMND1 and OASL were maintained as being commonly down regulated (S6 Fig). The same analysis performed using the non-filtered datasets revealed similar results (S10 Table), although the RGR gene family which included a shared gene between dogs and tame foxes, was lost. This gene was not differentially expressed in the non-filtered fox dataset, representing an example of genes added as differentially expressed after filtering.

Table 2. Shared genes and gene families (Down regulation).

List of the gene families, and shared genes, that were commonly down regulated in dogs and tame foxes. The number, and name, of the genes within each gene family are provided, with the corresponding log2fold-change values in brackets for each species. Within each family, single genes were characterized as shared between dogs and tame foxes, or as exclusively to each of the two groups. When more than one transcript for a specific gene was present, all the log2FC values are reported.

Gene Family Group Number of UE Gene name and log2FC value
Stathmin domain Shared 1 STMND1 (-1.18 in dogs, -0.53 in tame foxes)
Oligoadenylate synthetase like Shared 1 OASL (-0.41 in dogs, -0.52 in tame foxes)
Heat shock protein family B Dog 1 HSPB8 (-0.70)
Tame fox 1 HSPB11 (-0.32)

Discussion

Studies involving RNA-seq data often rely on the identification of one, few or many differentially expressed transcripts in order to draw conclusions about biological pathways or about general transcriptome function and evolution. The explicit quantification of intra-condition count variation associated with such transcripts is important for maintaining the context of ambiguity that may exist following differential expression analysis. This is especially true given the growing ability to base highly informative studies around archived transcriptomics datasets at an inter-study level. Here, we developed a method that quantifies intra-condition variation for each individual transcript within the reference set and that can be used to explore the effects of identifying and removing reference-based transcripts harbouring such variation above specified thresholds. By initially applying the method to extensive highly controlled simulated datasets harbouring pre-defined levels of intra-condition count variation we demonstrate the high effectiveness of DESeq2 in identifying differentially expressed transcripts, but also that it can be advantageous to reduce intra-condition variation within the count datasets in relation to identifying additional differentially expressed transcripts that could have been overlooked without such filtering (Fig 2 and S5 Fig, S4 and S5 Tables). By using highly controlled simulated datasets for initial testing, we also provide a tentative estimate on the limit of random intra-condition count variation above which the ability to reliably detect differentially expressed transcripts is diminished (Fig 2 and S5 Fig).

Our real data case study showed that, on average, nearly 20% of the transcripts identified as being differentially expressed prior to filtering contained levels of intra-condition variation equal to or above the 90th percentile value of the total distribution. This was higher than the relative proportion of such transcripts within the reference set and indicates that transcripts associated with higher intra-condition variation have a tendency to being identified as differentially expressed. When transcripts possess large amounts of such variation, some ambiguity in their identification as being differentially expressed is inevitable, since reliable expression patterns for at least one of the two conditions being compared have not been fully established; even if statistical correction is applied. This likely partially explains the level of discordance between various differential expression tools available [15, 23, 2629], for which no consensus on the best approach to apply exists [30]. However, more importantly, when such transcripts are used for drawing biological conclusion, the context of this uncertainty must be maintained.

We then explored the effects of removing transcripts associated with intra-condition variation, at varying threshold levels, on the gene-wise dispersion estimates, used by DESeq2. Within both case studies, as such transcripts were increasingly removed from input datasets prior to differential expression analysis, the correlation between the mean of normalized counts and dispersion estimates increased, and the number of outliers identified decreased (Fig 6a and 6b; S9 Table; S7 Fig). This, along with discordances between the lists of differentially expressed transcripts identified prior to and post filtering, suggests that transcripts were not simply removed because of physical exclusion from the input data, but that they were also removed, and added, as a result of the effects of removing intra-condition variation from the general gene-wise dispersions applied. The high rates of discordance we found, reaching 40% within the aggressive vs. tame fox case study (Fig 7 and S8 Table), reveal how dependent the identification of differentially expressed transcripts is on the accuracy of gene-wise dispersion estimates used; these in turn being affected by transcripts associated with high intra-condition count variation.

High intra-condition count variation at an inter, and to a lesser extent intra, study level can arise from a range of sources including i) biological differences between samples such as age, sex, diet, and health; ii) in silica error involving assembly tools producing poorly understood chimeras within the reference transcriptome [50, 60, 61]; iii) ambiguities in read mapping to such references [62]; iv) normalization of count data derived from such mapped reads [63]; and v) including in vitro error during library preparation protocols [64, 65]. Although we used DESeq2 within our study, the results of our exploration on the effects of intra-condition variation in the detection of differentially expressed transcripts likely applies to other software used for differential expression analysis that rely on per transcript count information across replicates for the estimation of transcript abundance and dispersion, for example, edgeR [10], BBSeq [66], DSS [67], baySeq [68] and ShrinkBayes [69].

Following the removal of the 3% and 5% of transcripts associated with the highest levels of variation between wolves and dogs, and aggressive and tame foxes, respectively, we observed an increase in the number of differentially expressed transcripts. This pattern is similar to what we observed within our extensive iterative differential expression analysis experiments on simulated data where the levels of intra-condition variation, as well as sets of count over represented transcripts, were explicitly controlled. Thus, this result suggests that for our case studies the removal of variation at these levels optimized the detection of differentially expressed transcripts whilst maintaining consistency. Using these 3% and 5% cut-offs, amongst the 50 over expressed genes identified, across the 21 shared gene families, seven genes were shared between dogs and tame foxes (Table 1). Of these seven genes, three main functions related to brain development, neurotransmission, and immune response were identified. These functions have been repeatedly associated with behavior selection during domestication by different approaches, such as QTL analysis [40, 70, 71], whole-genome sequencing [7274], and RNA data both using microarrays and RNA-seq [36, 37, 7577].

Up until recently, almost no gene overlap had been observed between gene expression profiles involving pairs of domesticated and wild animals [35]. However, a recently published paper performing population genomic and brain transcriptional comparisons in seven bird and mammal domesticated species has revealed a strong convergent pattern in genes implicated in neurotransmission and neuroplasticity [42]. These functions are compatible with those found in our analysis. The shared gene ITGA7 belongs to a gene family that is known to play an essential role in the control of neuronal connectivity [78] and the inflammatory response [79]. Other genes from this family, for example, ITGA8, have been previously observed to be over expressed in tame foxes [76], and here we also observed its over expression in dogs providing further evidence of the family’s role in tameness. Similar functions are associated with the shared genes CHRNA5 [80, 81] and TRIB2 [82] from the cholinergic and tribbles family, respectively. Additionally, we found a shared gene involved in sensing local environmental stimuli, the MYO7A, whose mutation results in loss of hearing and vision [83]. Amongst the three gene families identified as under expressed (Table 2), we found the shared gene STMND1, which deficiency in the amygdala of mice was connected to a deficiency in innate and learned fear [84], a behavior that speculatively could also have an important role in domestication. Although we are aware that this overlap analysis between genes that show the same direction of expression in both dogs and tame foxes is not a formal test for gene convergence, we identified genes involved in several functions previously validated in the scope of domestication.

In this work, we have presented TVscript, a new tool that identifies and removes transcripts associated with high levels of intra-condition variation from RNA-seq count data prior to differential expression analysis. By applying TVscript to simulated data, as well as to real data derived from brain samples of dogs, wolves, tame and aggressive foxes, we demonstrate that as hyper variable transcripts are removed the ability to detect differentially expressed transcripts increases in a robust and repeatable manner. Furthermore, we show that above a certain level of random intra-condition count variation, the identification of differentially expressed transcripts is no longer viable. We propose that studies using RNA-seq data at an inter, or intra, study level should determine whether or not transcripts identified as being differentially expressed, using pre-filtered reference sets, are still identified once filtering based on intra-condition count variation as been performed; regardless of the differential expression software used (or the method of obtaining initial counts). Discussion of such transcripts can then be presented relative to the context of such filtering, thus taking a step forward in reducing the ambiguity surrounding intra-condition count variation. Such context is likely going to be dataset specific, as indicated between differences between our case studies, as the extent of intra-condition count variation will differ between datasets and will rarely be known as a prior to analysis. The latter is further highlighted by the consistent patterns observed during the iterative simulations that we performed where levels of intra-condition variation were pre-specified. Finally, by comparing the genes that were differentially expressed in the brain of dogs and tame foxes, we provided further tentative support for candidate genes involved with several functions long known for being involved with domestication. These genes, and functions, have potential for being involved with selection for tameness, which appears to have played a crucial role in canine domestication. We use the word tentative to describe our support, as the primary aim of this study was to investigate the effects of intra-condition count variation on the detection of differentially expressed transcripts, and the identification of genes involved within an evolutionary process, such as domestication, should be supported by datasets specifically generated for that purpose, and confirmed relative to the different reference transcriptomes involved. The quality of such transcriptomes in turn, in relation to chimeras, missing transcripts and partial redundancies, must also be carefully explored.

Supporting information

S1 Fig. Alignment rates obtained using Bowtie2.

Mapping success rates (%) resulting from the alignment of the 44 samples used in this study to the complete dog transcriptome. For each sample, the percentage of aligned reads is presented by the blue bars, while the percentage of reads failing to map is represented in red (the number of raw reads is available in S1 Table).

(TIF)

S2 Fig. Correlation between per-transcript counts obtained following Bowtie2 mapping and count estimates obtained using kallisto.

R2 values describing the linear correlation between each count dataset produced from the mapped datasets presented in S1 Fig and corresponding count extimates produced when pseudo-mapping the same RNA-Seq data to the complete dog transcriptome using kallisto.

(TIF)

S3 Fig. Re-mapping fox data to the fox reference transcriptome.

Read mapping rates achieved when mapping the fox RNA-Seq datasets to the fox reference transcriptome.

(TIF)

S4 Fig. Transcripts identified by DESeq2 as being over expressed in the absence of randomly introduced intra-condition variation.

Across one hundred iterations the dots represent the number of transcripts identified as being over expressed between condition A and B. Each condition contained five replicates. (A) The one hundred transcripts selected for read over representation within replicates of condition B were maintained as constant and (B) the one hundred transcripts selected for read over representation within replicates of condition B were re-selected during each iteration. During each iteration the ten count datasets that were simulated each reflected even transcript coverage of 3 million read pairs with the exception of the one hundred transcripts selected for over representation in condition B whose count values were increase by a factor of two.

(TIF)

S5 Fig. Over expressed transcripts pre- and post-filtering (transcripts selected for count over representation were re-selected during each iteration).

The number of transcripts identified by DESeq2 as being over expressed both prior to (light gray) and post (dark gray) filtering within each of the one hundred iterations performed at each level of introduced random intra-condition count variation. Each iteration involved initially simulating ten count datasets divided into conditions A and B following which DESeq2 was run to attempt to identify the one hundred transcripts selected for over representation as described in the methods. Following this the ten simulated datasets were filtered using TVScript with a 95th percentile threshold in order to generate corresponding filtered datasets (divided into corresponding conditions A’ and B’) on which DESeq2 was re-run.

(TIF)

S6 Fig. Confirmation of shared genes within differential expression analysis taking tissue effects into account.

The upper dark grey circle contains the nine genes identified as being either commonly over, or under, expressed simultaniously within dogs and tame foxes using filter levels the 95th and 97th percentiles whilst only accounting for condition (wolves vs. dogs and aggressive vs. tame fox). Six of these genes (RGR, CHRNA5, MYO7A, TRIB2, STMND1 and OASL) are present when DESeq2 is run whilst also accounting for differences in tissue (light grey left oval). SQLE, ARHGAP25 and ITGA7 are observed only within the differentially expressed transcript list that is based solely on condition (dark grey right oval).

(TIF)

S7 Fig. Distribution of dispersion estimates.

Plots of dispersion estimates in relation to the mean of normalized counts for both case studies, wolves and dogs (left panels), and tame and aggressive foxes (right panels). Estimates were calculated using DESeq2 for the non-filtered (NF) and all filtered datasets (99th, 95th and 90th are shown as an example). Gray dots represent the gene-wise maximum likelihood estimates (MLE), the red curve shows the fit to the MLEs, and blue dots identify the final maximum a posteriori (MAP) estimates of dispersion. Red dots represent the outliers detected by DESeq2. Both x and y-axis are transformed into a logarithm scale.

(TIF)

S1 Table. Dataset description.

Full details of all datasets, including the location of the relative tissue, age, and sex of each individual, replicate information and sequencing details (FC–frontal cortex; CC–cerebral cortex; PFC–prefrontal cortex; FL–frontal lobe; NS–not specified; F–female; M–male; AD–adult; ya–years old; PE–paired-end; SE–single end).

(DOCX)

S2 Table. Common over expressed transcripts pre- and post-filtering (when transcripts selected for count over representation are kept constant).

The number of transcripts from the dog reference set that are commonly identified by DESeq2 as being over expressed within condition B both prior to and post filtering for each of the one hundred iterations performed at each level of introduced random intra-condition count variation. Each iteration involved simulating ten count datasets divided into conditions A and B following which DESeq2 was run to attempt to identify the one hundred transcripts selected for over representation as described in the methods section. Filtering involved running TVScript with a 95th percentile threshold on the non-filtered datasets to generate corresponding filtered datasets (divided into corresponding conditions A’ and B’) following which DESeq2 was re-run and the results compared back to those obtained for the non filtered data.

(DOCX)

S3 Table. Common over expressed transcripts pre- and post-filtering (when transcripts selected for count over representation are re-selected during each iteration).

Same as S2 Table but where the one hundred transcripts selected for over representation within condition B are re-selected during each iteration.

(DOCX)

S4 Table. Ratio between the common number of over expressed transcripts pre- and post-filtering and the maximum number detected when transcripts selected for count over representation are kept constant.

Numbers in S2 Table were divided by the maximum number of over expressed transcripts detected within each correspomding iteration i.e. the maximum number detected using corresponding non-filtered and filtered datasets.

(DOCX)

S5 Table. Ratio between the common number of over expressed transcripts pre- and post-filtering and the maximum number detected when transcripts selected for count over representation are re-selected during each iteration.

Numbers in S3 Table were divided by the maximum number of over expressed transcripts detected within each correspomding iteration i.e. the maximum number detected using corresponding non-filtered and filtered datasets.

(DOCX)

S6 Table. Removal of intra-condition variation.

Number of transcripts kept and removed from the reference in each case study, wolves and dogs, and aggressive and tame foxes, across the filtered levels used (from the 99th to the 70th percentile). The first ten percentiles were explored in greater detail in steps of one, while the remaining were performed in steps of 5.

(DOCX)

S7 Table. Differentially expressed transcripts in wolf vs. dog.

Complete list of differentially expressed transcripts in dogs when compared to wolves, identified using non-filtered datasets, and those that got removed (red) within the highest 10% of intra-condition variation, as well as those added (green) as differentially expressed across selected filtered datasets (97th, 95th, and 90th percentiles). The correspondent annotated gene ID, log2FC values and p-values are provided.

(DOCX)

S8 Table. Differentially expressed transcripts in aggressive vs. tame fox.

Complete list of differentially expressed transcripts in tame foxes when compared to aggressive foxes, identified using non-filtered datasets, and those that got removed (red) within the highest 10% of intra-condition variation, as also those added (green) as differentially expressed across selected filtered datasets (97th, 95th, and 90th percentiles). The correspondent annotated gene ID, log2FC values and p-values are provided.

(DOCX)

S9 Table. Correlation and outliers.

Correlation values (r2) and the root mean square error (RMSE) from the regression analysis between the final dispersion estimates and the mean of normalized counts for both case studies, wolves and dogs, and aggressive and tame foxes. The number of outliers identified by DESeq2 are also presented. Values are shown for the non-filtered (NF) and all the filtered datasets used in differential expression analysis.

(DOCX)

S10 Table. Shared genes and gene families between non-filtered datasets.

List of the gene families, and shared genes, that were commonly regulated in dogs and tame foxes, using the non-filtered datasets. The number, and name, of the genes within each gene family are provided, with the corresponding log2fold-change values in brackets for each species. Within each family, single genes were charecterized as shared between dogs and tame foxes, or as exclusive to each of the two groups. When more than one transcript for a specific gene was present, all the log2FC values are reported.

(DOCX)

Data Availability

All data is publically available on NCBI (https://www.ncbi.nlm.nih.gov) under the project accession numbers: PRJEB3197 (runs: ERR266355, ERR266386, ERR266395, ERR266403, ERR266382, ERR266407, ERR266371, ERR266359, ERR266374, ERR266366 and ERR266400), PRJEB4668 (run: ERR351173), PRJNA185055 (runs: SRR636937 and SRR636938), PRJNA78827 (runs: SRR388737, SRR388740, SRR388766, SRR543733,SRR536881,SRR536883) and PRJNA307604 (runs: SRR3084300, SRR3084299, SRR3084298, SRR3084297, SRR3084296, SRR3084295, SRR3084294, SRR3084293, SRR3084292, SRR3084291, SRR3084290, SRR3084289, SRR3084312, SRR3084311, SRR3084310, SRR3084309, SRR3084308, SRR3084307, SRR3084306, SRR3084305, SRR3084304, SRR3084303, SRR3084302 and SRR3084301). Further details are available in S1 Table including: Species, Publication detail (Study), Sample IDs, Project accession and Run accession.

Funding Statement

This work was funded by the project NORTE-01-0246-FEDER-000063, supported by Norte Portugal Regional Operational Programme (NORTE2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and by research funding from the projects under the references PTDC/BIA-EVF/29115/2017, PTDC/BIA-EVF/2460/2014 and POCI-01-0145-FEDER-029115 co-funded by Operational Competitiveness and Internationalization Program, Portugal 2020 and the European Union via the European Regional Development Fund (ERDF) and by National Funds through FCT. DL, RG were supported by FCT (PD/BD/132403/2017 to DL, contract under DL57/2016 to RG) and JA was supported by Funds through FCT under the references POCI-01-0145-FEDER-029115 and PTDC/BIA-EVL/29115/2017. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. FCT and NORTH2020 url’s: https://www.fct.pt/ and https://www.norte2020.pt.

References

  • 1.Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10: 57–63. doi: 10.1038/nrg2484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17. doi: 10.1186/s13059-016-0881-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015;2015: 951. doi: 10.1101/pdb.top084970 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008 57. 2008;5: 621–628. doi: 10.1038/nmeth.1226 [DOI] [PubMed] [Google Scholar]
  • 5.Lee S, Li S, Seo CH, Lim B, Yang JO, Oh J, et al. Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res. 2011;39. doi: 10.1093/nar/gkq1015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28: 511–515. doi: 10.1038/nbt.1621 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jiang H, Wong WH. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009;25: 1026–1032. doi: 10.1093/bioinformatics/btp113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (80-). 2008;320: 1344–1349. doi: 10.1126/science.1158441 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26: 139–140. doi: 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011;29: 572–573. doi: 10.1038/nbt.1910 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Xu Z, Asakawa S. Physiological RNA dynamics in RNA-Seq analysis. Brief Bioinform. 2019;20: 1725–1733. doi: 10.1093/bib/bby045 [DOI] [PubMed] [Google Scholar]
  • 13.McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, et al. RNA-seq: Technical variability and sampling. BMC Genomics. 2011;12:293. doi: 10.1186/1471-2164-12-293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Liu Y, Zhou J, White KP. RNA-seq differential expression studies: More sequence or more replication? Bioinformatics. 2014;30: 301–304. doi: 10.1093/bioinformatics/btt688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016;22: 839–851. doi: 10.1261/rna.053959.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics. 2012;13: 484. doi: 10.1186/1471-2164-13-484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun 2018 91. 2018;9: 1–10. doi: 10.1038/s41467-018-03751-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zoabi Y, Shomron N. Processing and Analysis of RNA-seq Data from Public Resources. Methods in Molecular Biology. Methods Mol Biol; 2021. pp. 81–94. doi: 10.1007/978-1-0716-1103-6_4 [DOI] [PubMed] [Google Scholar]
  • 19.Sudmant PH, Alexis MS, Burge CB. Meta-analysis of RNA-seq expression data across species, tissues and studies. Genome Biol 2015 161. 2015;16: 1–11. doi: 10.1186/s13059-015-0853-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Rau A, Marot G, Jaffrézic F. Differential meta-analysis of RNA-seq data from multiple studies. BMC Bioinformatics. 2014;15:91. doi: 10.1186/1471-2105-15-91 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jeng SL, Chi YC, Ma MC, Chan SH, Sun HS. Gene expression analysis of combined RNA-seq experiments using a receiver operating characteristic calibrated procedure. Comput Biol Chem. 2021;93: 107515. doi: 10.1016/j.compbiolchem.2021.107515 [DOI] [PubMed] [Google Scholar]
  • 22.McDermaid A, Monier B, Zhao J, Liu B, Ma Q. Interpretation of differential gene expression results of RNA-seq data: review and integration. Brief Bioinform. 2019;20: 2044. doi: 10.1093/bib/bby067 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Quinn TP, Crowley TM, Richardson MF. Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods. BMC Bioinformatics. 2018;19: 274. doi: 10.1186/s12859-018-2261-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Risso D, Schwartz K, Sherlock G, Dudoit S. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics. 2011. doi: 10.1186/1471-2105-12-480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Stegle O, Parts L, Durbin R, Winn J. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol. 2010;6(5): e100. doi: 10.1371/journal.pcbi.1000770 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li S, Labaj PP, Zumbo P, Sykacek P, Shi W, Shi L, et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat Biotechnol. 2014;32: 888–895. doi: 10.1038/nbt.3000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics. 2019;20: 40. doi: 10.1186/s12859-019-2599-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Mou T, Deng W, Gu F, Pawitan Y, Vu TN. Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing. Front Genet. 2020;10. doi: 10.3389/fgene.2019.01331 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Arora S, Pattwell SS, Holland EC, Bolouri H. Variability in estimated gene expression among commonly used RNA-seq pipelines. Sci Rep. 2020;10: 2734. doi: 10.1038/s41598-020-59516-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Stupnikov A, McInerney CE, Savage KI, McIntosh SA, Emmert-Streib F, Kennedy R, et al. Robustness of differential gene expression analysis of RNA-seq. Comput Struct Biotechnol J. 2021;19: 3470–3481. doi: 10.1016/j.csbj.2021.05.040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Costa-Silva J, Domingues D, Lopes FM. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One. 2017;12:e019015. doi: 10.1371/journal.pone.0190152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hoeppner MP, Lundquist A, Pirun M, Meadows JRS, Zamani N, Johnson J, et al. An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLoS One. 2014;9(3):91172. doi: 10.1371/journal.pone.0091172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Roy M, Kim N, Kim K, Chung WH, Achawanantakun R, Sun Y, et al. Analysis of the canine brain transcriptome with an emphasis on the hypothalamus and cerebral cortex. Mamm Genome. 2013;24: 484–499. doi: 10.1007/s00335-013-9480-0 [DOI] [PubMed] [Google Scholar]
  • 34.Fushan AA, Turanov AA, Lee SG, Kim EB, Lobanov A V, Yim SH, et al. Gene expression defines natural changes in mammalian lifespan. Aging Cell. 2015;14: 352–365. doi: 10.1111/acel.12283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Albert FW, Somel M, Carneiro M, Aximu-Petri A, Halbwax M, Thalmann O, et al. A Comparison of Brain Gene Expression Levels in Domesticated and Wild Animals. Akey JM, editor. PLoS Genet. 2012;8:e1002962. doi: 10.1371/journal.pgen.1002962 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wang X, Pipes L, Trut L, Herbeck Y, Vladimirova A, Gulevich R, et al. Genomic responses to selection for tame/aggressive behaviors in the silver fox (Vulpes vulpes). Proc Natl Acad Sci. 2018;115: 10398–10403. doi: 10.1073/pnas.1800889115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Li Y, Wang GD, Wang MS, Irwin DM, Wu DD, Zhang YP. Domestication of the dog from the Wolf was promoted by enhanced excitatory synaptic plasticity: A hypothesis. Genome Biol Evol. 2014;6: 3115–3121. doi: 10.1093/gbe/evu245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li Y, VonHoldt BM, Reynolds A, Boyko AR, Wayne RK, Wu DD, et al. Artificial selection on brain-expressed genes during the domestication of dog. Mol Biol Evol. 2013;30: 1867–1876. doi: 10.1093/molbev/mst088 [DOI] [PubMed] [Google Scholar]
  • 39.Lord KA, Larson G, Coppinger RP, Karlsson EK. The History of Farm Foxes Undermines the Animal Domestication Syndrome. Trends Ecol Evol. 2019. doi: 10.1016/j.tree.2019.10.011 [DOI] [PubMed] [Google Scholar]
  • 40.Kukekova A, Trut L, Chase K, Kharlamova A, Johnson J, Temnykh S, et al. Mapping loci for fox domestication: Deconstruction/Reconstruction of a behavioral phenotype. Behav Genet. 2011;41: 593–606. doi: 10.1007/s10519-010-9418-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Hekman J, Johnson J, Edwards W, Vladimirova A, Gulevich R, Ford A, et al. Anterior Pituitary Transcriptome Suggests Differences in ACTH Release in Tame and Aggressive Foxes. G3; Genes|Genomes|Genetics. 2018;8: 859–873. doi: 10.1534/g3.117.300508 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hou Y, Qi F, Bai X, Ren T, Shen X, Chu Q, et al. Genome-wide analysis reveals molecular convergence underlying domestication in 7 bird and mammals. BMC Genomics. 2020;21: 1–20. doi: 10.1186/s12864-020-6613-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. Ensembl 2020. Nucleic Acids Res. 2020;48: D682–D688. doi: 10.1093/nar/gkz966 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012. doi: 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14: 1–13. doi: 10.1186/gb-2013-14-4-r36 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019 378. 2019;37: 907–915. doi: 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bushnell, Brian. BBMap: A Fast, Accurate, Splice-Aware Aligner. Conference: 9th Annual Genomics of Energy Environment Meeting. 2014.
  • 48.Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 2016 345. 2016;34: 525–527. doi: 10.1038/nbt.3519 [DOI] [PubMed] [Google Scholar]
  • 49.Linheiro R, Archer J. CSReadGen website. https://sourceforge.net/projects/csreadgen/
  • 50.Linheiro R, Archer J. CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure. PLOS Comput Biol. 2021;17: e1009631. doi: 10.1371/journal.pcbi.1009631 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lobo D, Linheiro R, Godinho R, Archer J. On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves: Example R Script for using DESeq2. 2022. [cited 22 Jun 2022]. doi: 10.5281/ZENODO.6676483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21: 1–32. doi: 10.1186/s13059-019-1850-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria. 2017. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
  • 54.Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21: 3439–3440. doi: 10.1093/bioinformatics/bti525 [DOI] [PubMed] [Google Scholar]
  • 55.Perri AR, Feuerborn TR, Frantz LAF, Larson G, Malhi RS, Meltzer DJ, et al. Dog domestication and the dual dispersal of people and dogs into the Americas. Proc Natl Acad Sci U S A. 2021;118: 1–8. doi: 10.1073/pnas.2010083118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Tu J, Guo J, Li J, Gao S, Yao B, Lu Z. Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis. PLoS One. 2015;6: e0139857. doi: 10.1371/journal.pone.0139857 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Rizzi R, Beretta S, Patterson M, Pirola Y, Previtali M, Della Vedova G, et al. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant Biol. 2019;7: 278–292. doi: 10.1007/s40484-019-0181-x [DOI] [Google Scholar]
  • 58.Lobo D, Linheiro R, Godinho R, Archer J. On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves: Bowtie2 counts and kallisto abundances. 2022. [cited 29 Jun 2022]. doi: 10.5281/ZENODO.6778429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wayne RK, Geffen E, Girman DJ, Koepfli K-P, Lau LM, Marshall CR. Molecular Systematics of the Canidae. Syst Biol. 1997;46: 622–653. doi: 10.1093/sysbio/46.4.622 [DOI] [PubMed] [Google Scholar]
  • 60.Linheiro R, Archer J. Quantification of the effects of chimerism on read mapping, differential expression and annotation following short-read de novo assembly. F1000Research 2022 11120. 2022;11: 120. doi: 10.12688/f1000research.108489.1 [DOI] [Google Scholar]
  • 61.Hsieh PH, Oyang YJ, Chen CY. Effect of de novo transcriptome assembly on transcript quantification. Sci Rep. 2019;9. doi: 10.1038/s41598-019-44499-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Reinert K, Langmead B, Weese D, Evers DJ. Alignment of Next-Generation Sequencing Reads. Annu Rev Genomics Hum Genet. 2015;16: 133–151. doi: 10.1146/annurev-genom-090413-025358 [DOI] [PubMed] [Google Scholar]
  • 63.Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018;19: 776–792. doi: 10.1093/bib/bbx008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Brodin J, Mild M, Hedskog C, Sherwood E, Leitner T, Andersson B, et al. PCR-Induced Transitions Are the Major Source of Error in Cleaned Ultra-Deep Pyrosequencing Data. PLoS One. 2013;8. doi: 10.1371/journal.pone.0070388 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20. doi: 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Zhou YH, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011. doi: 10.1093/bioinformatics/btr449 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14: 232–243. doi: 10.1093/biostatistics/kxs033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Hardcastle TJ, Kelly KA. BaySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422. doi: 10.1186/1471-2105-11-422 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013;14: 113–128. doi: 10.1093/biostatistics/kxs031 [DOI] [PubMed] [Google Scholar]
  • 70.Wirén A, Wright D, Jensen P. Domestication-related variation in social preferences in chickens is affected by genotype on a growth QTL. Genes, Brain Behav. 2013;12: 330–337. doi: 10.1111/gbb.12017 [DOI] [PubMed] [Google Scholar]
  • 71.Albert FW, Carlborg Ö, Plyusnina I, Besnier F, Hedwig D, Lautenschläger S, et al. Genetic architecture of tameness in a rat model of animal domestication. Genetics. 2009;182: 541–554. doi: 10.1534/genetics.109.102186 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Carneiro M, Rubin CJ, Di Palma F, Albert FW, Alföldi J, Barrio AM, et al. Rabbit genome analysis reveals a polygenic basis for phenotypic change during domestication. Science (80-). 2014;345: 1074–1079. doi: 10.1126/science.1253714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Freedman AH, Schweizer RM, Ortega-Del Vecchyo D, Han E, Davis BW, Gronau I, et al. Demographically-Based Evaluation of Genomic Regions under Selection in Domestic Dogs. PLoS Genet. 2016;12:e100585. doi: 10.1371/journal.pgen.1005851 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Kukekova A, Johnson J, Xiang X-Y, Feng S-H, Liu S, Rando H, et al. The red fox genome assembly identifies genomic regions associated with tame and aggressive behaviors. Nat Ecol Evol. 2018;2: 1479–1491. doi: 10.1038/s41559-018-0611-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Saetre P, Lindberg J, Ellegren H, Vila C, Jazin E, Leonard JA, et al. From wild wolf to domestic dog: Gene expression changes in the brain. Mol Brain Res. 2004;126: 198–206. doi: 10.1016/j.molbrainres.2004.05.003 [DOI] [PubMed] [Google Scholar]
  • 76.Kukekova A, Johnson J, Teiling C, Li L, Oskina I, Kharlamova A, et al. Sequence comparison of prefrontal cortical brain transcriptome from a tame and an aggressive silver fox (Vulpes vulpes). BMC Genomics. 2011;12:482. doi: 10.1186/1471-2164-12-482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Heyne HO, Lautenschläger S, Nelson R, Besnier F, Rotival M, Cagan A, et al. Genetic influences on brain gene expression in rats selected for tameness and aggression. Genetics. 2014;198: 1277–1290. doi: 10.1534/genetics.114.168948 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Lilja J, Ivaska J. Integrin activity in neuronal connectivity. J Cell Sci. 2018;131:jcs212. doi: 10.1242/jcs.212803 [DOI] [PubMed] [Google Scholar]
  • 79.González-Amaro R, Sánchez-Madrid F. Cell adhesion molecules: selectins and integrins. Crit Rev Immunol. 1999;19: 389–429. [PubMed] [Google Scholar]
  • 80.Winterer G, Mittelstrass K, Giegling I, Lamina C, Fehr C, Brenner H, et al. Risk gene variants for nicotine dependence in the CHRNA5-CHRNA3-CHRNB4 cluster are associated with cognitive performance. Am J Med Genet Part B Neuropsychiatr Genet. 2010;153: 1448–1458. doi: 10.1002/ajmg.b.31126 [DOI] [PubMed] [Google Scholar]
  • 81.Zhang H, Kranzler HR, Poling J, Gruen JR, Gelernter J. Cognitive flexibility is associated with KIBRA variant and modulated by recent tobacco use. Neuropsychopharmacology. 2009;34: 2508–2516. doi: 10.1038/npp.2009.80 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Eyers PA, Keeshan K, Kannan N. Tribbles in the 21st Century: The Evolving Roles of Tribbles Pseudokinases in Biology and Disease. Trends Cell Biol. 2017;27: 284–298. doi: 10.1016/j.tcb.2016.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Miller KA, Williams LH, Rose E, Kuiper M, Dahl HHM, Manji SSM. Inner Ear Morphology Is Perturbed in Two Novel Mouse Models of Recessive Deafness. PLoS One. 2012;7(12):e512. doi: 10.1371/journal.pone.0051284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Martel G, Nishi A, Shumyatsky GP. Stathmin reveals dissociable roles of the basolateral amygdala in parental and social behaviors. Proc Natl Acad Sci U S A. 2008;105: 14620–14625. doi: 10.1073/pnas.0807507105 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Katherine James

23 May 2022

PONE-D-21-32424

On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves

PLOS ONE

Dear Dr. Archer,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.   You will see that the reviewers have raised some concerns regarding the methodology, specifically the choice of data for evaluation and the possibility of batch effects in these data, which will need to be addressed. In addition several aspects of the methodology require further clarification.

 Please note that PLOS publication criteria only require a study to be rigorous, robust and described in sufficient detail for replication. Therefore, while I agree with reviewer 1 that an R package may improve user uptake of your tool, this is not required for acceptance of your manuscript, since both reviewers have confirmed you have made usable code available.

Please submit your revised manuscript by Jul 04 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Katherine James, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

3. Please upload a new copy of Figure S1 and S2 as the detail is not clear. Please follow the link for more information: https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/" https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Authors outline that filtering expression data based on intra-group variation is recommended for maximising the number of identified DE genes. However the goal of DE analysis is not to maximise the number of DEGs but to identify the truly correct DEGs, those that are likely to be replicated if the experiment were conducted again or confirmed with another technique. In addition, genes that have high variability may actually be true DE genes, and there is no valid reason to discard them. To build a better justification for such filtering, a more comprehensive analysis is required to show that accuracy in DE classification is improved. Analysis of RNA-seq datasets with large numbers of replicates would be useful (eg: PMID: 27022035).

I downloaded the package and tried the example. It seemed to work fine using the directions in the manual.

The genes which are hypervariable in expression, are these markers of different brain regions? I ask because dissection and sampling can be a major source of variation.

P9: Regarding the way the variance is calculated, is it calculated for each sample group separately and then the average of the two groups is used, or is this done in a different way?

Typically, in order to avoid violating FDR correction assumptions, it is not allowed to filter any genes after the sample labels have been revealed as this equates to cherry picking, a form of p-hacking. In microarray analysis it is customary to discard probes with low overall variance but is acceptable as this procedure does not peek at the sample labels before filtering (eg: PMID: 19133141). Some analysts filter lowly expressed RNA-seq data using a threshold of 1 TPM or an average of 10 reads per sample on average which is also fine.

DESeq2 and other differential expression tools are written in R so it makes sense that this tool would also be written in R. Exporting the R data objects as TSV, running tvscript and then reading the data back into R is clumsy and may lead to poor uptake of this tool. I’d recommend a bioconductor package, which has the added benefit of being able to generate charts so that the user can better understand the intra-condition variability, like how edgeR generates a BCV chart (https://rdrr.io/bioc/edgeR/man/plotBCV.html). Another informative diagnostic chart could be PCA plots of (1) all transcripts, (2) hypervariable discarded transcripts, and (3) retained transcripts.

Bowtie not recommended for transcriptome mapping. As there are reads that can map equally well to multiple transcripts which get discarded in such approaches, it is preferable to use Kallisto or Salmon which deals accurately with multi-mapped reads. This may explain the reason behind the low mapping rate of wolf, dog and fox reads.

Does this approach work equally well for gene-based analysis using counts generated using STAR or featureCounts?

This does not sound right. “The number of transcripts removed was higher for the fox samples than for those of wolf and dog, reflecting the higher intra-condition variability present.” If percentiles of transcripts are being discarded, shouldn’t the proportion of detected transcripts discarded be the same for both studies? It is not explained clearly.

The figures should be explained in sequence. Eg: Figure 4C and the minitable in Fig 4 should be explained in the text before Fig 5a.

Reviewer #2: Lobo et al present an evaluation of their software TVscript, which evaluates intra-condition variability in the counts that have been mapped to a transcriptome in an RNA-Seq experiment and removes the transcripts associated with the highest level of this variability, up to a user specified percentile threshold. They test the software by applying it to two pairs of datasets from wild and tame animals, wolves vs dogs and aggressive vs tame foxes. The greatest fraction of differentially expressed transcripts (DETs) is obtained by removing 3 to 5% of transcripts, and the authors describe some interesting features of the gene families of the corresponding differentially expressed genes, including common changes upon taming.

The approach to RNA-Seq analysis is a potentially interesting one, representing another approach similar in some ways to the “orthogonal filtering” of low-expressed transcripts that is commonly used to increase the power in the analysis of RNA-Seq experiments.

Unfortunately there are a number of aspects to the methodology that make it hard to recommend publication in the present form.

1 Most importantly, I don’t think the data sets analysed are appropriate for the main intention of the paper. It is hard to tell whether the alterations made to the transcriptome improve the results rather than inducing false positives. The data sets used for testing come from multiple batches, and two organisms, one of which is appreciably divergent from the transcriptome to which it is aligned. In short, there are too many other uncontrolled factors in the analysis done to tell whether the results are reliable. In testing TVscript, it would be better to use an approach like that taken in Rapaport et al 2013 (https://doi.org/10.1186/gb-2013-14-9-r95), which uses data sets where batch effects are better controlled, including one (GEO GSE 49712) where external rna control consortium (ERCC) spike ins were used to produce known true positive DEGs.

2 More detail is necessary about how the differential expression was performed; figure 2c seems to show that the dogs do separate by batch (1-5; 6; 7; 8-9) and one would normally use a design formula that took account of the different sources of the data, something like ~ batch + tameness.

(although the brain tissue and instrument used are also similar for dogs 6-9, so one could also try ~ tissue + tameness). The authors should state whether they used a formula like this, and justify why not if they did not (Ideally, the R script used for differential expression analysis could be made available).

I would also remark that, since the authors emphasise that data comes from different sources, it was not immediately clear, until I looked at supplementary table 1, that the wolves and dogs 1-5 all come from one study, and similarly all the foxes, aggressive and tame, come from one study. This should be brought out more in the text, as otherwise the reader is made to wonder how any difference between wild and tame will be detectable that is not confounded with batch effects.

3 It is unclear to me why the authors used the C. familiaris transcriptome for their work on the fox as well as the dog/wolf, when the fox and wolf lineages diverged 10 myr ago. A genome and transcriptome are available for the fox (https://www.ensembl.org/Vulpes_vulpes/Info/Index?db=core), and even though it is of lower quality than the dog, a higher mapping rate might have been expected. I appreciate that it makes the assessment of TVscript, and to some extent the comparison of dog and fox DET results, more straightforward (though information on orthology is also available). The low mapping rate and slightly strange clustering of the points in the PCA plot fig 2d are indications there may be some problems with the fox data that might in part come from the choice of transcriptome, and this casts some doubt on the DET results for me.

My recommendation would be to split the work into two papers, one comparing the wild and tame animals, which to me was the most interesting part of the manuscript, and one assessing TVscript. It appears to me from comparing Tables 1 and 2 in the paper (filtered transcriptome) with supplementary table 6 (unfiltered transcriptome), that the filtering did not make a very big difference here. Hence the first paper could use the results from the more standard methods of supp. Table 6 and the interesting overlaps between the changes on domestication in the two pairs of animals would still be largely maintained. The second paper really would really need to use different test data sets , as suggested above, to establish whether TVscript is genuinely increasing sensitivity without introducing type I errors.

I would like to thank the authors for preparing the manuscript carefully, providing detailed results and supplementary material, and providing access to the code of TVScript along with links to other useful material on sourceforge.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: review_PONE-D-21-32424.docx

PLoS One. 2022 Sep 22;17(9):e0274591. doi: 10.1371/journal.pone.0274591.r002

Author response to Decision Letter 0


14 Jul 2022

Cover letter to editor and the responses to reviewers are pasted below as requested. These have also been uploaded as separated documents.

Cover letter (also uploaded):

Dear Dr. James,

Ref: Revision of manuscript titled “On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves”.

We were very pleased to receive the feedback from both reviewers, and from you, in relation to our manuscript. We found the comments valuable in clarifying the content and presentation of our work. We have now responded to each of the comments in turn and have uploaded our modified manuscript, and requested files, as follows:

1. The clean manuscript file titled “manuscript.docx”. This is the version to which the line numbers within the response to reviewers document are relevant.

2. The corresponding file with track changes turned on. This is titled “'revised_manuscript_with_track_changes.docx”.

3. The reviewer response file titled “response_to_reviewers.docx”.

4. This cover letter titled “cover_letter_2.docx”.

As requested, we have invested extensive effort into evaluating our approach. In brief, prior to presenting our results in relation to real data, we now perform a series of differential expression analysis experiments involving highly controlled simulated datasets within which the exact level of background intra-condition count variation could be specified, as well as the specification of a subset-set of transcripts to be over-expressed across replicates within second conditions used. This testing framework was used to detect the known over expressed transcripts relative to incrementing levels of background intra-condition count variation. For each level of variation introduced one hundred iterations of differential expression analysis were performed. Details of this are found in the first response to each of the two reviewers. During testing our approach was shown to have consistent appositive effect on the detection of differentially expressed transcripts. We have added an author Raquel Linheiro, who helped to perform the simulated differential expression experiments, and who also tested for batch effects, within our dog/wolf data. I hope that the addition of an author at this point is acceptable.

In relation to batch effects, we have tested for this, based on tissue as reviewer #2 suggested, and provide the result in the new figure S6. We have also clearly explained within the manuscript why we chose to use comparisons based on condition alone for our dog vs. wolf data (reviewer 2: comment 2). For fox data all samples came from the same batch (and tissue), and for simulations batch effects were not relevant due to the nature in how the datasets were generated.

We have discussed our reasoning for implementing our tool in Java within the manuscript, and appreciate the editors choice in being lenient in relation to this given we have made source code available. We have highlighted within the manuscript that the approach could be readily implemented in R and, if future user demand is there, we are happy to create a supported R version.

In relation to our data availability statement, all data used is publically available on NCBI and is described in detail within table S1, but we have now added the required statement that includes all run accession numbers during the resubmission process. Additionally, all count data from the raw data that we use has been made publically available on the Zenodo platform. All old figures have been checked and re-uploaded along with the new ones.

Once again we found that all suggestions made by the reviewers were very helpful in improving our manuscript, for example, remapping to the fox transcriptome as well as remapping all of our data using kallisto to validate our counts obtained using Bowtie2, and we no hope that they, as well as you, find that our manuscript is now ready for presentation to the PLOS ONE readership. We await (tentatively) a positive response.

Finally in relation to our financial statement, we would like it to read as below, but there is no place on the re-submission platform to input this. The statement associated with the first submission is present on the final PDF build, but we need it replaced by this one:

“This work was funded by the project NORTE-01-0246-FEDER-000063, supported by Norte Portugal Regional Operational Programme (NORTE2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and by research funding from the projects under the references PTDC/BIA-EVF/29115/2017, PTDC/BIA-EVF/2460/2014 and POCI-01-0145-FEDER-029115 co-funded by Operational Competitiveness and Internationalization Program, Portugal 2020 and the European Union via the European Regional Development Fund (ERDF) and by National Funds through FCT. DL, RG were supported by FCT (PD/BD/132403/2017 to DL, contract under DL57/2016 to RG) and JA was supported by Funds through FCT under the references POCI-01-0145-FEDER-029115 and PTDC/BIA-EVL/29115/2017. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. FCT and NORTH2020 url’s: https://www.fct.pt/ and https://www.norte2020.pt.”

As before we thank you for your consideration.

Best Regards,

John Archer,

Principal Researcher (Bioinformatics)

CIBIO-InBIO, Universidade do Porto, Campus de Vairão, Rua Padre Armando Quintas. 4485-661 Vairão, Email: john.archer@cibio.up.pt

Response to reviewers (also uploaded):

Reviewer #1:

1. Authors outline that filtering expression data based on intra-group variation is recommended for maximising the number of identified DE genes. However the goal of DE analysis is not to maximise the number of DEGs but to identify the truly correct DEGs, those that are likely to be replicated if the experiment were conducted again or confirmed with another technique. In addition, genes that have high variability may actually be true DE genes, and there is no valid reason to discard them. To build a better justification for such filtering, a more comprehensive analysis is required to show that accuracy in DE classification is improved. Analysis of RNA-seq datasets with large numbers of replicates would be useful (eg: PMID: 27022035).

We thank the reviewer for providing this detailed review of our manuscript and we have responded to each of the points raised below. We feel that each point has greatly improved our study as well as its presentation. In relation to the initial summarization, we agree with what the reviewer has said, mainly that the goal of DE analysis is not to blindly maximize the number of DEG´s but to identify those that are truly correct. We now clarify this in the manuscript (line 176 of the introduction) and we have added an extensive analysis to demonstrate that we are improving the identification of differential expressed transcripts and not just maximizing the numbers in the lists produced.

In relation to the latter (which also overlaps with the first comment made by reviewer #2) we extensively explore the effects of intra-condition per-transcript read count variation on the identification of differentially expressed transcripts through a series of differential expression analysis experiments involving highly controlled simulated count datasets derived from the available dog reference transcriptome. These are mentioned within the introduction (line 130) and described in detail within the materials & methods section under the new section titled “Controlled intra-condition variation within simulated data” (starting at: line 247). Using simulated count datasets allowed the exact level of background intra-condition count variation to be specified, as well as the specification of a subset-set of transcripts to be over-expressed across replicates within second conditions used (for each experiment performed). Within our simulated replicates, background count variation refers to varying count levels associated with individual transcripts that are not maintained across replicates (of a given condition), thus effectively reflecting intra-condition noise. On the other hand specifying a subset of transcripts to be over represented across replicates of a condition reflects identifiable over expressed transcripts.

Following a brief introductory set of differential expression experiments on simulated data where levels of intra-condition variation are kept at zero (line 265), a set of simulations where the levels of background variation explored went from 1% to 10% in steps of one were performed. At each increment one hundred iterations of the following were performed (line 282): (i) Ten replicates of count dataset were generated, representing reads evenly spread across 22,580 dog transcriptome transcripts (after normalization by length) ranging in length from 300 to 5000 bp, and allocated into two conditions A and B (five in each). Within B one hundred selected transcripts had counts over represented. (ii) At the percent level of variation associated with the iteration, that percent of transcripts from each of the ten replicates were randomly selected for count over-representation. (iii) DESeq2 was used on the count files within conditions A and B to obtain a list of over expressed transcripts. (iv) TVscript was run using a 95th percentile variance threshold to generate ten corresponding modified count files separated into two conditions (A’ and B’). (v) DESeq2 was again used on these to obtain a list of over-expressed transcripts. (vi) The lists of over-expressed transcripts obtained in (iv) and (v) were cross compared. This was repeated in two different ways: (i) the one hundred transcripts initially flagged for over representation were kept constant throughout all levels of variation and for each of the associated iterations and (ii) during each level of variation and for each iteration a new set of one hundred transcripts were randomly selected.

The results are presented within two new paragraphs (staring at: line 414) and within the newly added figures (Fig. 2 and Fig. S5), as well as Tables S2, S3, S4 and S5. In brief, the results in the figures indicate that within each iteration of steps (i) to (vi) there are slightly more of the one hundred transcripts selected for read over representation (within condition B’s of each iteration) identified post filtering. Tables S2 and S3 provide the counts of commonly identified differentially expressed transcripts (before and after filtering) for each of the one hundred iterations, at each level of intra-condition variation introduced. However even if the numbers are high they must be taken in the context of the maximum identified from either pre- or post filtered data. This is why tables S4 and S5 were provided as they present the agreeability relative to these maximums, and from levels of randomly introduced intra-condition count variation going from 1 to 4% the agreeability is high (Table S4 - 1 to 4% averages: 0.96, 0.98, 0.94 and 0.81; Table S5 - 1 to 4% averages: 0.96, 0.97, 0.93 and 0.82) (line 447). Finally, within figure 2 and figure S5, at and above the 5% level of intra-condition variation the ability to successfully identify the one hundred transcripts selected for over representation within condition B greatly diminishes within iterations. This could be indicative of a tentative estimate on the limit of at what level of random intra-condition count variation becomes inhibitory within differential expression analysis studies (line 452).

We thank (both) reviewers for providing dataset suggestions with larger numbers of replicates than we initially used in our analysis and pointing out this area that could be improved. We felt that using simulated data for this newly included analysis allowed us to explicitly control all relevant factors involved and allowed us to recreate the many replicates under the varying degrees of intra-condition variation described. We feel that this comment has contributed greatly to the clarity of our exploration in relation to the effects of intra-condition read count variation on the detection of differentially expressed transcripts.

2. I downloaded the package and tried the example. It seemed to work fine using the directions in the manual.

We thank the reviewer for taking the time to test our tool. As an aside point, future work is planned on developing a graphical user interface and increasing in-software visualization and interaction with the input count datasets. We have also contemplated in allowing the user to input raw read datasets and perform tasks such as estimating counts internally using our own implemented pseudo-mapper or those such as kallisto (mentioned by this reviewer later). We have not altered the software since the previous review but we do explicitly state that it is open source at the end of the introduction (line 155).

3. The genes which are hypervariable in expression, are these markers of different brain regions? I ask because dissection and sampling can be a major source of variation.

It is correct that across different brain regions one would expect differing levels of expression and this is the reason why we limited compartments included frontal cortex, cerebral cortex, prefrontal cortex and frontal lobe for dogs and wolves and just prefrontal cortex for aggressive and tame foxes. In addition to this information being provided in Table S1, we have clarified the compartments that we use within the Materials & Methods section (line 162 and 166). We also now explicitly mention the possible effects of differing compartments and batch effects relative to the wolf and dog data (starting at: line 320); also in relation to a comment made by the second reviewer. Finally, as the reviewer points out, dissection and sampling can be one major cause of intra-condition variation, that will subsequently have an impact on the detection of DEGS. This is true even for biological replicates at an intra-study level and the reason why we opted for simulated data during testing, where we had explicit control in relation to specifying which transcripts were to be over represented as well as the levels of background count variation, in relation to the first comment by this reviewer.

4. P9: Regarding the way the variance is calculated, is it calculated for each sample group separately and then the average of the two groups is used, or is this done in a different way?

In the Materials & Methods paragraph, titled Software, it is mentioned that (≈line 210): “ … (iii) for each reference transcript (t), the absolute pairwise differences between normalized read counts across all samples within condition A are calculated; (iv) the corresponding variances are calculated; (v) steps (iii) and (iv) are repeated for condition B; (vi) variance scores from each condition are placed in ascending order and associated with corresponding percentiles; …” Thus, variance scores are calculated for each group separately, following which all scores from each group are placed in a sorted list. It is this final sorted list that is used for calculating thresholds. This way scores from both groups are represented within the final distribution used to calculate percentiles on and there is no averaging involved. We have clarified this within this paragraph (line 226).

5. Typically, in order to avoid violating FDR correction assumptions, it is not allowed to filter any genes after the sample labels have been revealed as this equates to cherry picking, a form of p-hacking. In microarray analysis it is customary to discard probes with low overall variance but is acceptable as this procedure does not peek at the sample labels before filtering (eg: PMID: 19133141). Some analysts filter lowly expressed RNA-seq data using a threshold of 1 TPM or an average of 10 reads per sample on average which is also fine.

We are exploring the effects of intra-condition variation at a per transcript level, simultaneously across all transcripts, on the general ability to detect those that are differentially expressed (line 103). We agree that our initial datasets were not sufficient to confirm, as we had no control on the actual levels of per-transcript intra-condition count variation that were present, nor did we have a list of specified transcripts that were guaranteed to be over expressed. Additionally we had no way to verify across hundreds of repetitions the consistency of the results. This would be largely true for any “sequenced” datasets used. We feel that with the addition of the simulated study, where parameters were explored in a highly defined framework, we are approaching a point where our exploration is more interesting to a wider readership as we could specify the actual level variation present within the replicates (in conditions A and B) of each differential expression experiment performed (iterations) involving both TVScript filtered and non-filtered counts and verify result across hundreds of iterations.

We do not feel that we are cherry picking transcripts, but instead using a robust variation threshold framework based on percentiles to explore the effects of intra-condition variation. We initially selected the 95th and 97th percentiles for such cut-offs as these maximized the number of differentially expressed transcripts, but admittedly we had not previously demonstrated the robustness of the approach though a simulated study (now included as described above), nor that then additional transcripts were on top of those identified prior to filtering as a result of the reduction of noise within the data. We hope now that with the addition of the extensive simulations involving predefined over expressed transcripts that the reviewer is convinced that there is no cherry picking involved and that the increase in the over expressed transcripts is because of the optimization of the threshold for removing noisy transcripts.

Finally, we fully admit that this is an exploratory approach and that optimized threshold will vary depending on the datasets involved thus potentially limiting general applicability without a complete prior exploration of the datasets involved in a similar manner to that presented here. We have now explicitly stated this with the concluding paragraph of the discussion (line 723). We feel that exploring and highlighting the effects of intra-condition count variation is still a highly relevant, and interesting, study to make widely available as its context is often mute within differential expression analysis studies, even though the transcript lists produced are at times a major part of the end result. Disagreeability between approaches used to identify differentially expressed transcripts is further evidence of the effects of such variation (line 95).

6. DESeq2 and other differential expression tools are written in R so it makes sense that this tool would also be written in R. Exporting the R data objects as TSV, running tvscript and then reading the data back into R is clumsy and may lead to poor uptake of this tool. I’d recommend a bioconductor package, which has the added benefit of being able to generate charts so that the user can better understand the intra-condition variability, like how edgeR generates a BCV chart (https://rdrr.io/bioc/edgeR/man/plotBCV.html).

Yes, many differential expression analysis tools are written in R, but prior to using them an array of many non-R based tools are required, for example mapping or abundance estimators (such as Bowtie2 or kallisto). For this reason we thought that having TVscript in a platform independent language such as Java would be acceptable. Additionally, we wanted the potential to be able develop an optional graphical user interface that could sit on top of our tool and help visualize, samples, count data and output files in a rich transcriptome analysis environment that would be difficult to provide in a package such as R. That said, our method is not overly complex and, being clearly described within the manuscript in a step wise fashion, it could be readily implemented within R. Eventually, if our approach is accepted as an interesting way to explore such data and if there is enough demand for an R implementation, we would be pleased to provide one, but for now we hope that the reviewer can accept our Java implementation. In relation to this comment we have added the following to the manuscript (line 242).

“Although TVscript is implemented in Java the steps involved can be readily implemented within any language (e.g. R or python), using the detailed description provided above as well as the Java source code that is fully available. There are no dependent packages where code is unavailable. At the time of development we choose Java mainly due to its platform independence, which can be an advantage within setting up analysis pipelines involving many different tools. That said we are aware that many differential expression analysis tools are R based and future demand may warrant a supported R version.”

7. Bowtie is not recommended for transcriptome mapping. As there are reads that can map equally well to multiple transcripts which get discarded in such approaches, it is preferable to use Kallisto or Salmon which deals accurately with multi-mapped reads. This may explain the reason behind the low mapping rate of wolf, dog and fox reads.

In relation to the placement of reads that map to identical regions of transcripts, we are not under the impression that kallisto deals with these more reliably, as kallisto is effectively estimating abundance counts for each transcript using kmer summarizations in order to rapidly speed up the obtaining of read counts used for downstream analysis (as read locational placements within transcripts are not found). Kallisto in effect somewhat avoids the problem by not placing individual reads. For example, within a given transcript if there are ambiguous regions present for full mapping, then as long as a pseudo-mapping tool is not looking to specifically place a read the count can be incremented for each read located within that transcript - regardless of the actual location. Kallisto is a very nice piece of software, performs exceptionally well in terms of obtaining transcript count abundances, and in relation to this comment we have mapped all of our data again using kallisto to the dog transcriptome reference set used within our analysis (line 196).

For each dataset mapped we have summarized the correlations between the Bowtie2 counts that we utilize and the kallisto estimated abundances. In all cases these correlations are exceptionally high and we have discussed this within the results section of the manuscript (line 401) and included a new supplementary figure (Fig. S2). All read counts and abundances by Bowtie2 and kallisto have now been made available on the Zenodo repository (line 405) (https://zenodo.org/record/6778429). High correlations were achieved as Bowtie2 also performs well at mapping RNA-seq data to reference transcriptomes and, even within transcripts where there can be ambiguity in mapping minority numbers of reads, the over all count associated with the transcript obtained using the BBMap package is still accurate.

Tangentially, the dog, fox and wolf datasets that we use in our study are mapped to the available dog reference transcriptome where there are no introns present to disrupt mapping process, and this is why we did not use a splice aware mapper such as Tophat2 or HISAT2. We now say this (line 183).

8. Does this approach work equally well for gene-based analysis using counts generated using STAR or featureCounts?

Such an exploration can be performed as long as count data (or estimated count data) is available that can be allocated into two different conditions. We mentioned this within the concluding paragraph (line 720). The important thing is to make sure the counts are reliable and this is why the reviewers previous kallisto comment was relevant, and why we remapped all our data using that tool to confirm that counts obtained following the Bowtie2 mapping were of high quality.

9. Another informative diagnostic chart could be PCA plots of (1) all transcripts, (2) hypervariable discarded transcripts, and (3) retained transcripts.

The PCA plots that we present in (now) figure 3 are intended to show the general relatedness between datasets, not pick up subtle internal signals, be they read count variation within few transcripts or more complex evolutionary relationships. Within each of the 44 datasets, given that there are 26,107 transcripts involved that harbour varying count values, from 0 to high, as well as varying levels of intra-condition count variation (across the “conditions” that we were interested in), we felt that although the PCA plots were appropriate for demonstrating the very general evolutionary based relationship between datasets, they may not have the resolution at this per dataset level, to pick up the subtle shift in signal of removing the relatively few transcripts with highest levels of intra condition noise. At the reviewers request we have generated the corresponding PCA plots (Fig. R1 embedded in uploaded reviewer response file), minus the transcripts harbouring high levels of intra-condition noise, but visually these do not change much from the original PCA’s included within the paper. We have included these plots here, but have not done so within the manuscript. We have clarified within the methods that we are using PCA plots to summarize the more generalized relationships between datasets (line 336).

<image is in the response_to_reviewers.docx uploaded file>

Figure R1. PCA plots of datasets following the removal of intra-condition variation. PCA plots based on normalized non-filtered count data of the individual datasets comparing wolf and dog (top), and tame and aggressive fox (bottom) samples after the removal of transcripts harboring levels of intra condition variation above the the 95th and 97th percentile thresholds respectively. In the latter only individual samples that were positioned within a distant cluster are labelled with the sample ID.

10. This does not sound right. “The number of transcripts removed was higher for the fox samples than for those of wolf and dog, reflecting the higher intra-condition variability present.” If percentiles of transcripts are being discarded, shouldn’t the proportion of detected transcripts discarded be the same for both studies? It is not explained clearly.

We have clarified this in in the manuscript (line 224), and the reviewer was correct here in saying this is ambiguous. We use the variation value associated with a specific percentile as our cut-off. This does not necessary mean that the number of transcripts removed are the same each time. For example, within one comparison the variation value at the 95th percentile could be one number but within a different comparison the variation value at the same threshold could be different, as the overall distribution is dependent on the input datasets. This is actually one of the problems in explaining our exploration of the data that the now added simulations greatly helps with; as we can specify the level of background intro-condition variation explicitly and in so doing so the numbers removed within each iteration of the same level would be expected more similar (aside from some stochastic change based on the random nature of introduced variation).

11. The figures should be explained in sequence. Eg: Figure 4C and the minitable in Fig 4 should be explained in the text before Fig 5a.

We have now split the original figure 5 into to separate figures (Fig. 5 and Fig 7), and referred to them in the appropriate order (including that of the embedded table), as indicated by the reviewer. Note, the original figure and table numbers have been altered as a result of the addition of new figures and tables where required – but the order is not referred to correctly for all.

Reviewer #2:

Lobo et al present an evaluation of their software TVscript, which evaluates intra-condition variability in the counts that have been mapped to a transcriptome in an RNA-Seq experiment and removes the transcripts associated with the highest level of this variability, up to a user specified percentile threshold. They test the software by applying it to two pairs of datasets from wild and tame animals, wolves vs dogs and aggressive vs tame foxes. The greatest fraction of differentially expressed transcripts (DETs) is obtained by removing 3 to 5% of transcripts, and the authors describe some interesting features of the gene families of the corresponding differentially expressed genes, including common changes upon taming. The approach to RNA-Seq analysis is a potentially interesting one, representing another approach similar in some ways to the “orthogonal filtering” of low-expressed transcripts that is commonly used to increase the power in the analysis of RNA-Seq experiments. Unfortunately there are a number of aspects to the methodology that make it hard to recommend publication in the present form.

We thank this reviewer for taking the time to review our manuscript and we have invested significant time and effort in responding to the individual comments provided. We hope that after reviewing the changes made, the reviewer will now agree that our manuscript has been sufficiently improved. In relation to the first comment, this overlapped with what reviewer #1 pointed out so there repetition in our response.

1 Most importantly, I don’t think the data sets analysed are appropriate for the main intention of the paper. It is hard to tell whether the alterations made to the transcriptome improve the results rather than inducing false positives. The data sets used for testing come from multiple batches, and two organisms, one of which is appreciably divergent from the transcriptome to which it is aligned. In short, there are too many other uncontrolled factors in the analysis done to tell whether the results are reliable. In testing TVscript, it would be better to use an approach like that taken in Rapaport et al 2013 (https://doi.org/10.1186/gb-2013-14-9-r95), which uses data sets where batch effects are better controlled, including one (GEO GSE 49712) where external rna control consortium (ERCC) spike ins were used to produce known true positive DEGs.

We now extensively explore the effects of intra-condition per-transcript read count variation on the identification of differentially expressed transcripts through a series of differential expression analysis experiments involving highly controlled simulated count datasets derived from the available dog reference transcriptome. These are mentioned within the introduction (line 130) and described in detail within the materials & methods section under the new section titled “Controlled intra-condition variation within simulated data” (starting at: line 247). Using simulated count datasets allowed the exact level of background intra-condition count variation to be specified, as well as the specification of a subset-set of transcripts to be over-expressed across replicates within second conditions used (for each experiment performed). Within our simulated replicates, background count variation refers to varying count levels associated with individual transcripts that are not maintained across replicates (of a given condition), thus effectively reflecting intra-condition noise. On the other hand specifying a subset of transcripts to be over represented across replicates of a condition reflects identifiable over expressed transcripts.

Following a brief introductory set of differential expression experiments on simulated data where levels of intra-condition variation are kept at zero (line 265), a set of simulations where the levels of background variation explored went from 1% to 10% in steps of one were performed. At each increment one hundred iterations of the following were performed (line 282): (i) Ten replicates of count dataset were generated, representing reads evenly spread across 22,580 dog transcriptome transcripts (after normalization by length) ranging in length from 300 to 5000 bp, and allocated into two conditions A and B (five in each). Within B one hundred selected transcripts had counts over represented. (ii) At the percent level of variation associated with the iteration, that percent of transcripts from each of the ten replicates were randomly selected for count over-representation. (iii) DESeq2 was used on the count files within conditions A and B to obtain a list of over expressed transcripts. (iv) TVscript was run using a 95th percentile variance threshold to generate ten corresponding modified count files separated into two conditions (A’ and B’). (v) DESeq2 was again used on these to obtain a list of over-expressed transcripts. (vi) The lists of over-expressed transcripts obtained in (iv) and (v) were cross compared. This was repeated in two different ways: (i) the one hundred transcripts initially flagged for over representation were kept constant throughout all levels of variation and for each of the associated iterations and (ii) during each level of variation and for each iteration a new set of one hundred transcripts were randomly selected.

The results are presented within two new paragraphs (staring at: line 414) and within the newly added figures (Fig. 2 and Fig. S5), as well as Tables S2, S3, S4 and S5. In brief, the results in the figures indicate that within each iteration of steps (i) to (vi) there are slightly more of the one hundred transcripts selected for read over representation (within condition B’s of each iteration) identified post filtering. Tables S2 and S3 provide the counts of commonly identified differentially expressed transcripts (before and after filtering) for each of the one hundred iterations, at each level of intra-condition variation introduced. However even if the numbers are high they must be taken in the context of the maximum identified from either pre- or post filtered data. This is why tables S4 and S5 were provided as they present the agreeability relative to these maximums, and from levels of randomly introduced intra-condition count variation going from 1 to 4% the agreeability is high (Table S4 - 1 to 4% averages: 0.96, 0.98, 0.94 and 0.81; Table S5 - 1 to 4% averages: 0.96, 0.97, 0.93 and 0.82) (line 447). Finally, within figure 2 and figure S5, at and above the 5% level of intra-condition variation the ability to successfully identify the one hundred transcripts selected for over representation within condition B greatly diminishes within iterations. This could be indicative of a tentative estimate on the limit of at what level of random intra-condition count variation becomes inhibitory within differential expression analysis studies (line 452).

We thank (both) reviewers for providing dataset suggestions with larger numbers of replicates than we initially used in our analysis and pointing out this area that could be improved. We felt that using simulated data for this newly included analysis allowed us to explicitly control all relevant factors involved and allowed us to recreate the many replicates under the varying degrees of intra-condition variation described. We feel that this comment has contributed greatly to the clarity of our exploration in relation to the effects of intra-condition read count variation on the detection of differentially expressed transcripts.

2 More detail is necessary about how the differential expression was performed; figure 2c seems to show that the dogs do separate by batch (1-5; 6; 7; 8-9) and one would normally use a design formula that took account of the different sources of the data, something like ~ batch + tameness.

(although the brain tissue and instrument used are also similar for dogs 6-9, so one could also try ~ tissue + tameness). The authors should state whether they used a formula like this, and justify why not if they did not (Ideally, the R script used for differential expression analysis could be made available).

We have now added the following text to the manuscript in relation to this comment (line 318):

“For the aggressive vs. tame fox case study batch effects were not considered as all data came from the same study, tissue and sequencing run, additionally no further information about sample preparation was available. For the wolves vs. dogs case study we tested for effects based on tissue, primarily for quality control of the final transcripts we drew biological-related conclusions about, and compared results obtained to those in the absence of batch information. In our analysis we used differential expression results based solely the latter, as firstly, effects associated with tissue at an inter-study level are unpredictable as there are many factors involved, such as precision of dissection, time of dissection, time to dissect, state of individual tissue samples as well as individual who prepared sample, and other than publication or information mentioned for the fox case study, no further information on batches was obtainable. Secondly, although DESeq2 provides an internalized method for accommodating batch effects that we applied (~batch + condition), the results obtained at an intra-study level, with well defined batches, between alternative methods of testing are variable [52]. Lastly, we were primarily exploring the effects of removing hyper variable transcripts on the mechanics of detecting differentially expressed transcripts and our simulations and case studies were a means to an end in achieving this. As long as input counts for a given filtering threshold within a given case study or a iteration were consistent with those of the initial input data, the effects of removing hyper variable transcripts could be observed, independent of other factors affecting the data prior to analysis.”

We have also now included the new supplementary figure S6 that confirms the present of six of the shared genes between dogs and tame foxes once batch effects based on tissue have been taken into account. This figure is referred to within the results section of the manuscript (lines 580 and 602).

For the newly added simulated testing framework (described for comment 1), we had strict control of all parameters within the simulated data within each of the differential expression analysis experiments that was performed. For example, we could select the level of random background count variation allowed within the other wise evenly distributed count values as well as select the transcripts to have counts over represented across multiple replicates of a given condition. Although such a scenario may be somewhat artificial in vivo, it allowed us to explicitly look at the effects of before filtering and after filtering within each iteration without having to deal with batch or other sources of variation.

I would also remark that, since the authors emphasise that data comes from different sources, it was not immediately clear, until I looked at supplementary table 1, that the wolves and dogs 1-5 all come from one study, and similarly all the foxes, aggressive and tame, come from one study. This should be brought out more in the text, as otherwise the reader is made to wonder how any difference between wild and tame will be detectable that is not confounded with batch effects.

We have now clarified this with the method section (line 166).

3 It is unclear to me why the authors used the C. familiaris transcriptome for their work on the fox as well as the dog/wolf, when the fox and wolf lineages diverged 10 myr ago. A genome and transcriptome are available for the fox (https://www.ensembl.org/Vulpes_vulpes/Info/Index?db=core), and even though it is of lower quality than the dog, a higher mapping rate might have been expected. I appreciate that it makes the assessment of TVscript, and to some extent the comparison of dog and fox DET results, more straightforward (though information on orthology is also available). The low mapping rate and slightly strange clustering of the points in the PCA plot fig 2d are indications there may be some problems with the fox data that might in part come from the choice of transcriptome, and this casts some doubt on the DET results for me.

In relation to the lower fox mapping numbers within the materials & methods we now map all of the fox data to the available fox transcriptome (line 191). Mapping numbers that were achieved (new figure S3) when mapped are similar to that when these data were mapped to the dog transcriptome (line 410). Within the discussion we explicitly state that:

“High intra-condition count variation at an inter, and to a lesser extent intra, study level can arise from a range of sources including i) biological differences between samples such as age, sex, diet, and health; ii) in silica error involving assembly tools producing poorly understood chimeras within the reference transcriptome [50,60,61]; iii) ambiguities in read mapping to such references [62]; iv) normalization of count data derived from such mapped reads [63]; and v) including in vitro error during library preparation protocols [64,65].”

The main reason why we selected the dog transcriptome for our analysis was that we wanted to use a common reference set between the dogs vs. wolves and aggressive vs. tame foxes comparisons, and we believed that the dog reference transcriptome, being more commonly worked on, is more refined and possibly less vulnerable to internal ambiguities such as chimeras, partial redundancy and missing transcripts. Given our primary goal was to explore the effects of per-transcript intra-conditon count variation on the detection of differentially expressed transcripts, we felt that the dog reference transcriptome was a good place to start. We realise that if our main approach was to solely (or primrily) study differential expression alterations during domestication, information on orthology is available; but this was more of a side product for us to make our exploration on intra-condition count variation more interesting. It is likely because of this ballance between testing and biological application that the reviewer suggests two different papers in comment 4; and we do understand the point being made.

More generally in terms of the mapping approach used, we discuss this in relation to comment 7 made by reviewer #1, where we confirm numbers using the alternative kallisto based approach and provide all counts obtained following mapping by Bowtie2 and kallisto. In all cases these correlations between are exceptionally high and we have discussed this within the results section of the manuscript (line 401) and included a new supplementary figure (Fig. S2). All read counts and abundances by Bowtie2 and kallisto have now been made available on the Zenodo repository (line 402) (https://zenodo.org/record/6778429).

Generally the reviewer comment “… are indications there may be some problems with the fox data that might in part come from the choice of transcriptome, and this casts some doubt on the DET results for me …” is true, but this is also true for differential expression analysis studies where there is not an established high quality, closely related, reference available, or when results are dependent on de novo assembled contigs where there can be many ambiguities, or indeed when single cell sequencing has not been performed. There should usually be some level of doubt. This is one of the reason why we emphasis that this study is an exploration of the effects of intra-condition count variation on the detection of differentially expressed transcripts (largely independent of reference set used), and why the identification of the genes commonly over or under expressed within dogs and tame foxes are providing “tentative” support previously identified genes, but not definitive proof. At the end of our concluding paragraph we have added the following sentences to highlight this (line 730):

“We use the word tentative to describe our support as the primary aim of this study was to investigate the effects of intra-condition count variation on the detection of differentially expressed transcripts, and the identification of genes involved within an evolutionary process, such as domestication, should be supported by datasets specifically generated for that purpose, and confirmed relative to the different reference transcriptomes involved. The quality of such transcriptomes in turn, in relation to chimeras, missing transcripts and partial redundancies, must also be carefully explored.”

My recommendation would be to split the work into two papers, one comparing the wild and tame animals, which to me was the most interesting part of the manuscript, and one assessing TVscript. It appears to me from comparing Tables 1 and 2 in the paper (filtered transcriptome) with supplementary table 6 (unfiltered transcriptome), that the filtering did not make a very big difference here. Hence the first paper could use the results from the more standard methods of supp. Table 6 and the interesting overlaps between the changes on domestication in the two pairs of animals would still be largely maintained. The second paper really would really need to use different test data sets , as suggested above, to establish whether TVscript is genuinely increasing sensitivity without introducing type I errors.

The extensive testing on simulated data indicates that the (small) difference TVscript makes is consistent across iterations, involving varying levels of introduced intra-condition variation. It is this point that we are trying to highlight, i.e. that the identification of differentially expressed transcripts should be presented within the context of intra-condition count variation and the alterations observed as a result removing hypervariable transcripts can be important, independent of the normalization approached used by the utilized differential expression tool. Often it is one or few such transcripts that paper conclusions are based on. In relation to this we have added the following text to the concluding paragraph of the discussion (line 718):

“We propose that studies using RNA-seq data at an inter, or intra, study level should determine whether or not transcripts identified as being differentially expressed, using pre-filtered reference sets, are still identified once filtering based on intra-condition count variation as been performed; regardless of the differential expression software used (or the method of obtaining initial counts). Discussion of such transcript can then be taken into the context of ambiguity observed. Such context is likely going to be dataset specific, as indicated between differences between our case studies, as the extent of intra-condition count variation will differ between datasets and will rarely be known as a prior to analysis.”

More generally, this study has been on-going for some time and the reviewer has highlighted something that we have struggled with since the start. Initially, the TVscript manuscript was going to be an application note, focussed mainly on testing, then it became more about a literature review/domestication study using published data and at one point we were attempting to generate our own dog and wolf transcriptomic data to include. But this got set-aside for various reasons. The eventual manuscript that emerged was (we feel) an interesting hybrid, but more focused on the software. As it stands we have come up with a relevant piece of work, especially following responses to both reviewers comments, that encompasses a reasonable compromise that should be of implicit interest within the field of transcriptomics. We feel that the manuscript adds relevant information to an important topic that should be considered more widely when discussing identified differentially expressed transcripts.

I would like to thank the authors for preparing the manuscript carefully, providing detailed results and supplementary material, and providing access to the code of TVScript along with links to other useful material on sourceforge.

Once again we thank the reviewer for their time and effort. We are glad that they liked the presentation of our work, and more generally, our background resources. We plan to produce more. The comments provided here have added greatly to our manuscript.

Attachment

Submitted filename: response_to_reviewers.docx

Decision Letter 1

Katherine James

10 Aug 2022

PONE-D-21-32424R1On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolvesPLOS ONE

Dear Dr. Archer,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Both reviewers are overall very happy with your revised version. However, they both have some minor comments that require final clarification. I don't anticipate these points will take too much time to address and look forward to reading the final version.

Please submit your revised manuscript by Sep 24 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Katherine James, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I commend the authors for the comprehensive amendments and explanations. I think the article is in great shape. The provision of scripts and data on zenodo is appreciated. Please consider the following points as optional suggestions.

1. This new simulation is a welcome addition that supports the perceived need for a tool like TVscript.In figure 2, I would recommend putting the legend for light and dark grey boxes on the plot itself.

2. OK

3. OK

4. OK

5. OK, but the passage on line 723 should be written in a clearer, more straightforward way.

6. OK.

7. OK, but "high r2 correlation values" should be qualified with a specific range (eg: 85-98%), so that the reader can understand what "high" means.

8. OK

9. OK.

10. OK

Typo: "we also had an interested in understanding whether"

Reviewer #2: I am grateful to Lobo et al for their efforts in addressing my criticisms of the first version of their manuscript.

1 Appropriateness of data sets used to assessing TVscript: I think that by using extensive simulated data the assessment of the behaviour of TVscript is much improved.

2 more detail has been given on how the DE analysis was performed as requested. I am not totally convinced by the lengthy discussion of unknown effects (though there is no reason to remove it); clearly there are always unknown factors, but that does not affect the apparent effect of batch in the PCA plots. But in any case, the important test, that including batch in the statistical model for DE in the dogs, has been done and the results (fig S6) seem to confirm that it does not have a very large effect

3. I thank the authors for checking the mapping rate of the fox data to the fox transcriptome (and, incidentally, by mapping with kallisto as well). Even though it turned out not to affect the mapping rate very much, I feel this was an important check to perform.

4. I appreciate that factors outside the authors’ control can affect the way that a project ends up being carried out and written up. Although my suggestion was to divide the manuscript, I do not insist on it, I am content with the manuscript’s current form.

The criticisms that I made in my first review, at least, are allayed. However the first reviewer raised other serious points particularly point 5 about filtering where the sample group information is used (see Bourgon, Gentleman, and Huber 2010). I did not spot this in my own review and it is for the first reviewer to assess whether the additional simulations have addressed this satisfactorily. I did notice that in the discussion of the matter in the DESeq2 vignette (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#independent-filtering-and-multiple-testing) a histogram of the p-value of the filtered genes is provided, showing that it is approximately uniform. I would tentatively suggest (again, I defer to the first reviewer here ) that, done for transcripts rather than genes, a p-value histogram could provide an empirical way of demonstrating that the filtering is independent of the test statistic under the null hypothesis, if required.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Sep 22;17(9):e0274591. doi: 10.1371/journal.pone.0274591.r004

Author response to Decision Letter 1


15 Aug 2022

Reviewer #1

“I commend the authors for the comprehensive amendments and explanations. I think the article is in great shape. The provision of scripts and data on zenodo is appreciated. Please consider the following points as optional suggestions.”

We thank the reviewer for these minor revisions and we have adjusted the manuscript accordingly. The time and effort that the reviewer has spent on this, and on the previous, round of review has been invaluable for our manuscript.

“1. This new simulation is a welcome addition that supports the perceived need for a tool like TVscript.In figure 2, I would recommend putting the legend for light and dark grey boxes on the plot itself.”

We have now added a key to figure 2 explaining the light and dark grey boxes. This has also been added to the similar figure S5.

“2. OK

3. OK

4. OK

5. OK, but the passage on line 723 should be written in a clearer, more straightforward way.”

We have reworded the three lines following line 723.

“6. OK.

7. OK, but "high r2 correlation values" should be qualified with a specific range (eg: 85-98%), so that the reader can understand what "high" means.”

We have now added the range to the text (on line 405) by saying: “R2 values ranged between 0.8546 and 0.9944.”

“8. OK

9. OK.

10. OK

Typo: "we also had an interested in understanding whether"”

Corrected

Reviewer #2

“I am grateful to Lobo et al for their efforts in addressing my criticisms of the first version of their manuscript.”

We are very grateful for the comments provided by this reviewer that, in conjunction to those provided by the first reviewer, has greatly clarified and improved our work.

“1 Appropriateness of data sets used to assessing TVscript: I think that by using extensive simulated data the assessment of the behaviour of TVscript is much improved.”

Agreed. The simulations allowed us to test our approach within a highly controlled framework prior to application on the real data.

“2 more detail has been given on how the DE analysis was performed as requested. I am not totally convinced by the lengthy discussion of unknown effects (though there is no reason to remove it); clearly there are always unknown factors, but that does not affect the apparent effect of batch in the PCA plots. But in any case, the important test, that including batch in the statistical model for DE in the dogs, has been done and the results (fig S6) seem to confirm that it does not have a very large effect”

Testing for batch effects was an important addition that we had not initially included. We feel that the inclusion of this, along with the results based on simulations (where batch effects less relevant), has greatly improved completion when presenting our results.

“3. I thank the authors for checking the mapping rate of the fox data to the fox transcriptome (and, incidentally, by mapping with kallisto as well). Even though it turned out not to affect the mapping rate very much, I feel this was an important check to perform.”

This was an important check that we should have provided within the first version in order to set the mind of readers at rest relative to the question raised by the reviewer. We feel that it is an interesting point and we are pleased to have made this addition at the reviewer’s suggestion.

“4. I appreciate that factors outside the authors’ control can affect the way that a project ends up being carried out and written up. Although my suggestion was to divide the manuscript, I do not insist on it, I am content with the manuscript’s current form. The criticisms that I made in my first review, at least, are allayed.”

Great, we completely understood why the reviewer was thinking about two separated papers here, and we will still aim to produce a future paper with our own generated RNA-seq datasets.

“However the first reviewer raised other serious points particularly point 5 about filtering where the sample group information is used (see Bourgon, Gentleman, and Huber 2010). I did not spot this in my own review and it is for the first reviewer to assess whether the additional simulations have addressed this satisfactorily. I did notice that in the discussion of the matter in the DESeq2 vignette (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#independent-filtering-and-multiple-testing) a histogram of the p-value of the filtered genes is provided, showing that it is approximately uniform. I would tentatively suggest (again, I defer to the first reviewer here) that, done for transcripts rather than genes, a p-value histogram could provide an empirical way of demonstrating that the filtering is independent of the test statistic under the null hypothesis, if required.”

Reviewer 1 was happy with our response to their point 5 and did not require any further clarification. Given that reviewer 2 has stated that “… I defer to the first reviewer here …” we feel that point 5 has been sufficiently discussed within the manuscript for the scope of our study.

Attachment

Submitted filename: response_to_reviewers.docx

Decision Letter 2

Katherine James

1 Sep 2022

On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves

PONE-D-21-32424R2

Dear Dr. Archer,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Katherine James, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Katherine James

12 Sep 2022

PONE-D-21-32424R2

On taming the effect of transcript level intra-condition count variation during differential expression analysis: a story of dogs, foxes and wolves

Dear Dr. Archer:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Katherine James

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Alignment rates obtained using Bowtie2.

    Mapping success rates (%) resulting from the alignment of the 44 samples used in this study to the complete dog transcriptome. For each sample, the percentage of aligned reads is presented by the blue bars, while the percentage of reads failing to map is represented in red (the number of raw reads is available in S1 Table).

    (TIF)

    S2 Fig. Correlation between per-transcript counts obtained following Bowtie2 mapping and count estimates obtained using kallisto.

    R2 values describing the linear correlation between each count dataset produced from the mapped datasets presented in S1 Fig and corresponding count extimates produced when pseudo-mapping the same RNA-Seq data to the complete dog transcriptome using kallisto.

    (TIF)

    S3 Fig. Re-mapping fox data to the fox reference transcriptome.

    Read mapping rates achieved when mapping the fox RNA-Seq datasets to the fox reference transcriptome.

    (TIF)

    S4 Fig. Transcripts identified by DESeq2 as being over expressed in the absence of randomly introduced intra-condition variation.

    Across one hundred iterations the dots represent the number of transcripts identified as being over expressed between condition A and B. Each condition contained five replicates. (A) The one hundred transcripts selected for read over representation within replicates of condition B were maintained as constant and (B) the one hundred transcripts selected for read over representation within replicates of condition B were re-selected during each iteration. During each iteration the ten count datasets that were simulated each reflected even transcript coverage of 3 million read pairs with the exception of the one hundred transcripts selected for over representation in condition B whose count values were increase by a factor of two.

    (TIF)

    S5 Fig. Over expressed transcripts pre- and post-filtering (transcripts selected for count over representation were re-selected during each iteration).

    The number of transcripts identified by DESeq2 as being over expressed both prior to (light gray) and post (dark gray) filtering within each of the one hundred iterations performed at each level of introduced random intra-condition count variation. Each iteration involved initially simulating ten count datasets divided into conditions A and B following which DESeq2 was run to attempt to identify the one hundred transcripts selected for over representation as described in the methods. Following this the ten simulated datasets were filtered using TVScript with a 95th percentile threshold in order to generate corresponding filtered datasets (divided into corresponding conditions A’ and B’) on which DESeq2 was re-run.

    (TIF)

    S6 Fig. Confirmation of shared genes within differential expression analysis taking tissue effects into account.

    The upper dark grey circle contains the nine genes identified as being either commonly over, or under, expressed simultaniously within dogs and tame foxes using filter levels the 95th and 97th percentiles whilst only accounting for condition (wolves vs. dogs and aggressive vs. tame fox). Six of these genes (RGR, CHRNA5, MYO7A, TRIB2, STMND1 and OASL) are present when DESeq2 is run whilst also accounting for differences in tissue (light grey left oval). SQLE, ARHGAP25 and ITGA7 are observed only within the differentially expressed transcript list that is based solely on condition (dark grey right oval).

    (TIF)

    S7 Fig. Distribution of dispersion estimates.

    Plots of dispersion estimates in relation to the mean of normalized counts for both case studies, wolves and dogs (left panels), and tame and aggressive foxes (right panels). Estimates were calculated using DESeq2 for the non-filtered (NF) and all filtered datasets (99th, 95th and 90th are shown as an example). Gray dots represent the gene-wise maximum likelihood estimates (MLE), the red curve shows the fit to the MLEs, and blue dots identify the final maximum a posteriori (MAP) estimates of dispersion. Red dots represent the outliers detected by DESeq2. Both x and y-axis are transformed into a logarithm scale.

    (TIF)

    S1 Table. Dataset description.

    Full details of all datasets, including the location of the relative tissue, age, and sex of each individual, replicate information and sequencing details (FC–frontal cortex; CC–cerebral cortex; PFC–prefrontal cortex; FL–frontal lobe; NS–not specified; F–female; M–male; AD–adult; ya–years old; PE–paired-end; SE–single end).

    (DOCX)

    S2 Table. Common over expressed transcripts pre- and post-filtering (when transcripts selected for count over representation are kept constant).

    The number of transcripts from the dog reference set that are commonly identified by DESeq2 as being over expressed within condition B both prior to and post filtering for each of the one hundred iterations performed at each level of introduced random intra-condition count variation. Each iteration involved simulating ten count datasets divided into conditions A and B following which DESeq2 was run to attempt to identify the one hundred transcripts selected for over representation as described in the methods section. Filtering involved running TVScript with a 95th percentile threshold on the non-filtered datasets to generate corresponding filtered datasets (divided into corresponding conditions A’ and B’) following which DESeq2 was re-run and the results compared back to those obtained for the non filtered data.

    (DOCX)

    S3 Table. Common over expressed transcripts pre- and post-filtering (when transcripts selected for count over representation are re-selected during each iteration).

    Same as S2 Table but where the one hundred transcripts selected for over representation within condition B are re-selected during each iteration.

    (DOCX)

    S4 Table. Ratio between the common number of over expressed transcripts pre- and post-filtering and the maximum number detected when transcripts selected for count over representation are kept constant.

    Numbers in S2 Table were divided by the maximum number of over expressed transcripts detected within each correspomding iteration i.e. the maximum number detected using corresponding non-filtered and filtered datasets.

    (DOCX)

    S5 Table. Ratio between the common number of over expressed transcripts pre- and post-filtering and the maximum number detected when transcripts selected for count over representation are re-selected during each iteration.

    Numbers in S3 Table were divided by the maximum number of over expressed transcripts detected within each correspomding iteration i.e. the maximum number detected using corresponding non-filtered and filtered datasets.

    (DOCX)

    S6 Table. Removal of intra-condition variation.

    Number of transcripts kept and removed from the reference in each case study, wolves and dogs, and aggressive and tame foxes, across the filtered levels used (from the 99th to the 70th percentile). The first ten percentiles were explored in greater detail in steps of one, while the remaining were performed in steps of 5.

    (DOCX)

    S7 Table. Differentially expressed transcripts in wolf vs. dog.

    Complete list of differentially expressed transcripts in dogs when compared to wolves, identified using non-filtered datasets, and those that got removed (red) within the highest 10% of intra-condition variation, as well as those added (green) as differentially expressed across selected filtered datasets (97th, 95th, and 90th percentiles). The correspondent annotated gene ID, log2FC values and p-values are provided.

    (DOCX)

    S8 Table. Differentially expressed transcripts in aggressive vs. tame fox.

    Complete list of differentially expressed transcripts in tame foxes when compared to aggressive foxes, identified using non-filtered datasets, and those that got removed (red) within the highest 10% of intra-condition variation, as also those added (green) as differentially expressed across selected filtered datasets (97th, 95th, and 90th percentiles). The correspondent annotated gene ID, log2FC values and p-values are provided.

    (DOCX)

    S9 Table. Correlation and outliers.

    Correlation values (r2) and the root mean square error (RMSE) from the regression analysis between the final dispersion estimates and the mean of normalized counts for both case studies, wolves and dogs, and aggressive and tame foxes. The number of outliers identified by DESeq2 are also presented. Values are shown for the non-filtered (NF) and all the filtered datasets used in differential expression analysis.

    (DOCX)

    S10 Table. Shared genes and gene families between non-filtered datasets.

    List of the gene families, and shared genes, that were commonly regulated in dogs and tame foxes, using the non-filtered datasets. The number, and name, of the genes within each gene family are provided, with the corresponding log2fold-change values in brackets for each species. Within each family, single genes were charecterized as shared between dogs and tame foxes, or as exclusive to each of the two groups. When more than one transcript for a specific gene was present, all the log2FC values are reported.

    (DOCX)

    Attachment

    Submitted filename: review_PONE-D-21-32424.docx

    Attachment

    Submitted filename: response_to_reviewers.docx

    Attachment

    Submitted filename: response_to_reviewers.docx

    Data Availability Statement

    All data is publically available on NCBI (https://www.ncbi.nlm.nih.gov) under the project accession numbers: PRJEB3197 (runs: ERR266355, ERR266386, ERR266395, ERR266403, ERR266382, ERR266407, ERR266371, ERR266359, ERR266374, ERR266366 and ERR266400), PRJEB4668 (run: ERR351173), PRJNA185055 (runs: SRR636937 and SRR636938), PRJNA78827 (runs: SRR388737, SRR388740, SRR388766, SRR543733,SRR536881,SRR536883) and PRJNA307604 (runs: SRR3084300, SRR3084299, SRR3084298, SRR3084297, SRR3084296, SRR3084295, SRR3084294, SRR3084293, SRR3084292, SRR3084291, SRR3084290, SRR3084289, SRR3084312, SRR3084311, SRR3084310, SRR3084309, SRR3084308, SRR3084307, SRR3084306, SRR3084305, SRR3084304, SRR3084303, SRR3084302 and SRR3084301). Further details are available in S1 Table including: Species, Publication detail (Study), Sample IDs, Project accession and Run accession.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES