Skip to main content
Physiological Genomics logoLink to Physiological Genomics
. 2013 Feb 19;45(8):301–311. doi: 10.1152/physiolgenomics.00128.2012

Characterization of the rat developmental liver transcriptome

Richard H Chapple 1, Polyana C Tizioto 1,2, Kevin D Wells 1, Scott A Givan 3, JaeWoo Kim 1, Stephanie D McKay 1,4, Robert D Schnabel 1, Jeremy F Taylor 1,
PMCID: PMC3633428  PMID: 23429212

Abstract

Gene regulation and transcriptome studies have been enabled by the development of RNA-Seq applications for high-throughput sequencing platforms. Next generation sequencing is remarkably efficient and avoids many issues inherent in hybridization-based microarray methodologies including the exon-specific dependence of probe design. Biologically relevant transcripts including messenger and regulatory RNAs may now be quantified and annotated regardless of whether they have previously been observed. We used RNA-Seq to investigate global patterns of gene expression in early developing rat liver. Liver samples from timed-pregnant Lewis rats were collected at six fetal and neonatal stages [embryonic day (E)14, E16, E18, E20, postnatal day (P)1, P7], transcripts were sequenced using an Illumina HiSeq 2000, and data analysis was performed with the Tuxedo software suite. Genes and isoforms differing in abundance were queried for enrichment within functionally related gene groups using the Functional Annotation Tool of the DAVID Bioinformatics Database. While hematopoietic gene expression is initiated by E14, hepatocyte maturation is a gradual process involving clusters of genes responsible for response to nutrients and enzymes responsible for glycolysis and fatty acid catabolism. Following birth, a large cluster of differentially abundant genes was enriched for mitochondrial gene expression and cholesterol synthesis indicating that by 1 wk of age, the liver is engaged in lipid sensing and bile production. Clustering results for differentially abundant genes and isoforms were similar with the greatest difference for the E14/E16 comparison. Finally, a bioinformatic approach was used to annotate 1,307 novel liver transcripts assembled from sequences that aligned to intergenic regions of the rat genome.

Keywords: RNA-Seq, rat, liver, transcriptome, development


the liver, the largest internal organ, is organized as a structurally diverse group of specialized tissues that are collectively involved in a vast number of essential biochemical functions that can broadly be characterized as metabolic, exocrine, and endocrine in nature. The liver contributes to functions including, but not limited to, detoxification, protein anabolism, nutrient metabolism, and bile production. The two major cell types in the liver are hepatocytes that comprise ∼70–80% of the total liver mass and are responsible for metabolic functions (19) and cholangiocytes (biliary epithelial cells), which are a specialized group of epithelial cells that comprise the bile ducts. The liver is distinctly organized into hexagonal lobules comprising plates of hepatocytes that are divided by sinusoidal blood vessels that drain to the central efferent vein. Each corner of the liver lobule contains the portal vein, hepatic artery, and bile duct, and this group of vessels is commonly known as the portal triad (11). The hepatic artery supplies oxygenated blood, and the portal vein transports venous blood to other endoderm-derived organs. The precise cellular architecture of the liver is crucial to its function, and the blueprint for its architectural framework is established during embryonic and postnatal development (18).

Hepatogenesis, the process of liver formation, manifests during development as a set of congruous tissue intercommunications, many of which exhibit evidence of evolutionary conservation (17, 19). The events that mark certain stages of liver development are primarily governed by transcriptional regulation. While much is known about the key involved factors, it is becoming increasingly apparent that the repertoire of transcripts present in a cell is immensely vast (10, 16). Next-generation sequencing has facilitated the investigation of entire mammalian transcriptomes (the comprehensive set of transcripts within a biological sample) at an unprecedented resolution (7, 8). The major goals of transcriptome research are to accurately classify and quantify the diverse repertoire of transcripts within tissues exposed to a biological treatment at a developmental time-point or disease state and elucidate the functional structure and regulation of genes (15). To examine further the extent of transcriptional control of the events that dictate hepatogenesis, we performed a time-course exploration of the behavior of the rat liver transcriptome. Our results indicate that there remain a number of transcripts that are incompletely annotated in this model species and that patterns of gene-specific expression change substantially during development.

MATERIALS AND METHODS

Animals and sampling.

Procedures for animal handling and tissue sampling were conducted in compliance with protocol #6640, which was approved by the University of Missouri Animal Care and Use Committee. Timed-pregnant inbred Lewis rats (LEW/Crl), purchased from Charles River Laboratories International (Wilmington, MA) were housed at the University of Missouri-Columbia Unit B Facility. Females were killed at specific gestational stages [embryonic days (E)14, 16, 18, and 20]. Pregnant reproductive tracts were removed and placed in ice cold, buffered Dulbecco's modified Eagle's medium (Sigma Aldrich, St. Louis, MO). The buffering solution (pH 7.35) consisted of 37 mM PIPES, 150 mM Tris, and 113 mM BES and was added at 20% of the final volume of the dissection media. Whole embryonic liver samples were carefully removed from each fetus under a microdissection microscope and immediately flash-frozen in liquid nitrogen. Whole postnatal liver samples were harvested from each pup at days 1 or 7 postparturition on a sterile working surface and flash-frozen.

RNA isolation.

Two biological replicates (fetus or pup) were chosen for total RNA extraction from members of the same litter at each of the six developmental time-points (n = 12). One milliliter of TRIzol reagent (Invitrogen, Carlsbad, CA) was added to each frozen liver sample, which was immediately homogenized. Samples were centrifuged at 12,000 g for 10 min at 4°C, and we added 200 μl of chloroform after transferring the aqueous layer to a fresh tube. After another centrifugation at 12,000 g for 10 min at 4°C, RNA from the aqueous layer was precipitated first with 500 μl of isopropanol and then washed with 1 ml of 75% ethanol. The pellet was dissolved in 100 μl DEPC water and placed at 4°C overnight. The following day, the RNA sample was DNase-treated to remove potential contamination from genomic DNA. RNA purity and concentration were established with a NanoDrop 1000 v1.3.2 (Thermo Scientific, Wilmington, DE). The absence of RNA degradation was first assessed by electrophoresis of 1 μg of RNA on a 1.0% agarose gel. Finally, the quality of each RNA sample was evaluated using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) with an RNA Nano Chip.

RNA sequencing.

Preparation of the mRNA sample for RNA-Seq analysis was performed using the TruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA). Briefly, sample prep began with 10 μg of total RNA from each liver sample. Polyadenylated RNA was purified from the sample with oligo dT magnetic beads, and the poly(A) RNA was fragmented with divalent cations under elevated temperature. First-strand cDNA synthesis produced single-stranded DNA copies from the fragmented RNA by reverse transcription. After second-strand cDNA synthesis, the double-stranded DNA underwent end repair, and the 3′ ends were adenylated. Finally, universal adapters were ligated to the cDNA fragments, and PCR was performed to produce the final sequencing library. These template molecules were used for cluster generation and sequencing on the Illumina HiSeq 2000 (Illumina, San Diego, CA) instrument. One sample per lane was used to generate 100 bp single end reads.

Processing of sequence reads.

The default Illumina analysis pipeline filters reads for quality based upon “chastity” scores, which identify adjacent clusters that were so close that the imaging software could not independently assign base incorporation signals. The chastity score for each base call is the ratio of the highest of the four possible base signal intensities to the sum of the highest two base signal intensities. Reads with two or more chastity scores <0.6 within the first 25 sequenced bases were removed from the analysis.

We made no attempt to remove duplicated sequences produced as PCR artifacts. In the absence of paired-end sequence information, we were unable to determine if a duplicated read was actually a PCR artifact or a valid sequence corresponding to an independently sampled fragment.

Finally, the issue of adapter sequence contamination was addressed with a custom Perl script that trimmed 3′ adapter sequence contamination by exact string matching to a user-supplied adapter sequence. We supplied the first 12 base pairs of the universal Illumina adapter sequence and added one A to the 5′ adapter-end to account for the adenylation of the fragment prior to adapter ligation. The script first searched each read for a perfect match to the complete adapter sequence and, if found, retained the sequence 5′ to the predicted adapter contamination. If a match was not found, the 3′ end of the read was iteratively searched for motifs in which one 3′ base was sequentially removed from the supplied adapter sequence. If the motif was found, the 5′ upstream sequence was retained. If the motif was not found, the read was not trimmed. This approach unnecessarily trims a few bases from the 3′ end of reads that match the first few bases of the adapter (e.g., all reads ending in A will be trimmed); however, this strategy supplies data that are devoid of adapter fragments. The script does not remove reads from the FASTQ file, which is particularly useful in the case of paired-end data, because the concordance of pairs of reads is unaltered in both files.

Read alignment.

Using Tophat v1.3.2 (12), we aligned reads to the Rattus norvegicus reference genome (rn4.fa). Along with the most current genome build, a preindexed reference that is used by Bowtie in its implementation of the Burrows-Wheeler transform was also downloaded from the UCSC database. Tophat first used Bowtie v0.12.7.0 (5) for an initial alignment to the genome, allowing for two mismatches and 20 nonunique genomic alignments. Reads that failed to align to the genome in the first iteration were set aside, and a database of putative exons and potential splice junctions was generated from the aligned reads. The reads that failed to align to the genome were then aligned to exons within this database, and a seed-and-extend method was used to map reads that may span an intron junction.

Transcript assembly and quantification.

Cufflinks v2.0.2 (13) was used to process the Tophat alignments. Cufflinks treated regions of the genome that were covered by sequence reads as potential exons and used the mapped junctions to assemble transcripts. Additionally, the abundance of assembled transcripts was estimated and reported as fragments per kilobase of exon per million fragments mapped (FPKM), and confidence intervals were estimated for each FPKM. This metric allows the comparison of each gene's transcript abundance across treatments by normalizing abundance in each treatment for the library's sequencing depth. The supplied reference annotation was used by Cufflinks to guide the reference annotation-based transcript assembly. The output included all reference transcripts as well as any novel assembled genes and isoforms. Consequently, novel discoveries including new splice sites, transcriptional start sites, and functional regions of the genome may be identified in the analysis. Transcript visualization was performed by uploading each sample's “transcripts.gtf” file as a custom track in the UCSC Genome Browser.

Testing for differential expression.

The assembled Cufflinks transcripts were separately processed with both the Cuffmerge and Cuffcompare analytical strategies. The minimum set of transcripts identified by Cufflinks as best describing the aligned reads in each sample was first processed with Cuffcompare to classify the transcripts based upon existing annotation. Cuffcompare used the output from Cufflinks (transcripts.gtf) and an annotation file (rn4.gtf) to match transcripts to existing gene models and label each transcript as known or novel using letter codes to describe the details of each match (e.g., complete match to a known gene, partial overlap with a known gene, etc.). Cuffmerge was also used to merge the assemblies produced by Cufflinks for all samples. Cuffmerge also runs Cuffcompare to annotate the merged assembly and filters a number of transfrags that are probably artifacts. The annotation file available from UCSC (rn4.gtf) was provided to enable the merging of assembled contigs into novel and known isoforms and to maximize the overall quality of the assembly. Finally, Cuffdiff was used to re-estimate the abundance of transcripts annotated by Cuffmerge using the original Tophat alignment files. The resulting abundance estimates were used to test for differential transcript expression. Assembled transcripts produced by Cuffmerge were first grouped by gene and tested for differential expression (DE). Transcripts were then grouped by transcriptional start site to test for DE of primary transcripts. For each primary transcript, DE of each alternately spliced isoform was also tested. Genes that were found to be DE between any of the developmental time-points were used to construct a multidimensional scaling (MDS) plot to investigate the relationships between samples and the within and between developmental time-point expression variances. Only comparisons between adjacent developmental time-points are reported (e.g., E14 vs. E16, E16 vs. E18).

Biological insight with the database for annotation, visualization, and integrated discovery.

The database for annotation, visualization, and integrated discovery (DAVID) v6.7 (1) was used to interpret the DE data. DAVID is a group of web-based tools that identify enriched biological themes and gene ontology (GO) terms, group functionally related genes, and cluster annotation terms for large gene lists. The functional annotation tool was used to query >40 annotation databases to determine the most relevant GO terms within each list of genes predicted to be differentially expressed between adjacent developmental time-points. The functional annotation clustering algorithm was used to generate a clustered, nonredundant report of related annotation terms, and groups of annotation clusters with EASE scores <0.1 (2) were retained. Finally, DAVID pathway mapping allows a gene list to be superimposed on static pathway maps such as BioCarta and KEGG pathways. Gene lists created from DE, alternate splicing, and alternative promoter usage analyses were each individually submitted to the DAVID functional annotation tool to explore enriched GO terms and pathways represented in the lists.

Investigation of unannotated transcripts.

The Cuffcompare output included a group of transcripts built from reads that represent contiguous sequence when aligned to the genome, but that fell within intergenic or unannotated regions (labeled with the class code “u”). All “u”-labeled transcripts were parsed from the cuffcmp.tracking output file, and single exon transcripts were removed to eliminate potential Cufflinks assembly artifacts. Multiple-exon transcripts were then grouped by locus ID (a unique internal software identifier for a region of the genome) and loci represented by >1 transcript were processed by the gffread utility in Cufflinks to extract the sequence of each transcript in FASTA format. The multiFASTA file for each locus was submitted to CAP3 (3) using an assembly overlap length threshold of 60 bp to assemble contigs and remove redundancy within the transcript sequences. The CAP3 output included all assembled sequences (in the “.contig” file), and all sequences that were not used in the assembly (in the “.singlets” file). To annotate these sequences, the assembled contigs and singlets for each locus were queried against both the Mus musculus and Homo sapiens RefSeq RNA database by a discontiguous megablast search. The top hit for each locus was retained if it displayed ≥80% homology across ≥70% of the query length.

RESULTS

Sequencing throughput and read processing.

We generated 12 lanes of 100 bp reads using the Illumina HiSeq 2000 instrument, and each lane represented one biological sample. After removing reads that failed chastity filtering, we obtained a total of 962,605,900 sequence reads with an average of 80,217,158 reads per lane. A total of 464,154,845 reads were trimmed for potential Illumina adapter sequence contamination, which represents 48.2% of the total number of sequence reads, but only 5.6% of the sequenced bases were trimmed (Table 1). The distribution of the number of bases trimmed per read in Fig. 1 demonstrates that the majority of reads had ≤3 bases trimmed. Additionally, in a small number of cases, the entire read was trimmed. We attribute this phenomenon to the case where two adapters ligate together without the presence of an interstitial cDNA sequence.

Table 1.

Sequencing throughput and trimming results for 12 lanes of 100 bp reads generated on the Illumina HiSeq 2000 instrument

Sample (Replicate) Total Reads Trimmed Reads Untrimmed Bases Trimmed Bases
E14 (1) 84,615,957 42,125,701 7,955,574,412 558,665,388
E14 (2) 80,135,005 39,673,206 7,550,171,359 467,856,741
E16 (1) 82,831,584 43,308,751 7,710,639,242 576,274,658
E16 (2) 82,433,865 42,290,955 7,677,583,647 574,950,153
E18 (1) 80,129,498 38,812,231 7,584,161,692 434,186,008
E18 (2) 73,971,530 35,681,311 7,008,710,743 392,022,657
E20 (1) 81,749,978 40,000,362 7,745,580,318 448,217,282
E20 (2) 82,163,230 38,813,244 7,842,461,324 377,343,076
P1 (1) 85,470,847 40,334,092 8,194,160,287 362,903,113
P1 (2) 83,593,243 37,758,595 8,037,276,785 329,410,115
P7 (1) 69,471,853 30,383,637 6,682,256,159 273,446,941
P7 (2) 76,039,310 34,972,760 7,287,101,551 323,507,549
Total 962,605,900 464,154,845 91,275,677,519 5,118,783,681

E, embryonic day; P, postnatal day.

Fig. 1.

Fig. 1.

Distribution of the number of adapter sequence trimmed bases for 12 samples each sequenced 1 × 100 bp on 1 lane of an Illumina HiSeq 2000.

Alignment and transcript assembly.

Trimmed reads were aligned to the R. norvegicus genome with Tophat v1.3.2. Without any filtering for base quality scores, 771,961,521 (80.2%) reads were aligned to the genome with at most two mismatches, and individual samples ranged from 54,054,236 to 70,368,354 aligned reads. Cufflinks v2.0.2 assembled 196,200 transcripts from all 771,961,521 aligned reads. From the Cuffcompare output, 8.2% of these transcripts were identified as being a complete match to a known R. norvegicus annotation. However, following the use of Cuffmerge to merge contigs representing the same transcript, 19,916 (22.1%) of the 90,130 merged transcripts were found to be a complete match to a known R. norvegicus annotation (Table 2).

Table 2.

Number of detected transcripts by annotation classification reported by Cuffcompare and Cuffmerge

Cuffcompare
Cuffmerge
Class Code Transcripts, n % Transcripts, n %
= 16,141 8.23 19,916 22.10
i 32,192 16.40 77 0.09
j 59,107 30.13 34,877 38.70
e 5,730 2.92 0 0.00
o 1,296 0.66 876 0.97
p 3,621 1.85 3 0.00
s 168 0.09 148 0.16
x 2,750 1.40 2,312 2.57
r 13,188 6.72 2 0.00
. 16,118 8.22 0 0.00
u 45,889 23.39 31,919 35.41
Total 196,200 100 90,130 100

=, Complete match of intron chain; i, a transfrag falling entirely within a reference intron; j, potentially novel isoform - at least 1 splice junction is shared with a reference transcript; e, single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment; o, generic exonic overlap with a reference transcript; p, possible polymerase run-on fragment (within 2 Kb of a reference transcript); s, an intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors); x, exonic overlap with reference on the opposite strand; r, repeat; “.”, multiple classifications; u, unknown intergenic transcript.

Quality control.

The Cufflinks and Cuffdiff output was examined for fidelity with several quality control methods. First, the variation between the replicates and different time-points was assessed with an MDS plot. In the MDS plot, the biological replicates clustered closely, indicating that there was little variation among the replicates and the first dimension clearly separated the fetal and neonatal developmental stages (Fig. 2). This result suggests that the use of two replicates is sufficient to identify the genes with the greatest DE between adjacent pairs of developmental time-points. The dynamic range of the FPKM values was also evaluated by using CummeRbund to create a boxplot of log10 transformed FPKM values for each developmental time-point (Fig. 3). The overall range and quartile distribution were consistent among developmental time-points, indicating that the data were reproducible and of high quality. The median FPKM values among developmental time-points were similar and slightly less than 1, indicating that very high levels of sequence coverage allowed the identification of genes with very low levels of expression.

Fig. 2.

Fig. 2.

Multidimensional scaling plot of samples based on genes found to be differentially expressed (DE) between any developmental-time period. Sample identification is Developmental Time Period_Replicate. E, embryonic day; P, postnatal day.

Fig. 3.

Fig. 3.

Dynamic range of fragments per kilobase of exon per million fragments mapped (FPKM) values represented as log10 transformed FPKM values for each gene calculated for each developmental time-period.

Volcano plots describing the relationships between statistical significance for tests of DE and relative transcript abundance for sequential time points are in Fig. 4. The E14/E16 plot shows that the majority of DE genes were downregulated between E14 and E16, while the total number of DE genes was lower at the E16/E18 transition than for all other developmental time-point comparisons, and Fig. 4 shows that the majority of DE genes were upregulated between E16 and E18.

Fig. 4.

Fig. 4.

Volcano plots showing the relationship between statistical significance of each test for DE and relative transcript abundance for 2 sequential developmental time-period comparisons. Significant DE genes are colored red.

Finally, we attempted to independently validate our data by comparison to previously published expression patterns of gene expression throughout liver development. Expression plots for three transcription factors known to be key regulators of liver development closely match previously published qPCR data (4) (Fig. 5). Similarly, expression profiles for glucose-6-phosphatase and tyrosine aminotransferase behave as expected since these genes are known to be silent, or detectable in scant amounts, before birth but are induced following parturition (9, 14) (Fig. 6). A recent microarray analysis of genes DE between fetal and adult rat livers found that gene transcripts involved in cell cycle progression such as cyclins are more abundant in fetal relative to adult livers, and we observed the expression of these genes to decrease with increasing developmental time (Fig. 7) (6). We also found a similar number of genes to be DE (∼1,000 transcripts) between fetal and postnatal livers, indicating that there are a large number of genes and pathways that influence liver development (6). Consequently, we considered our data and analyses to provide an accurate reflection of the biological processes occurring in perinatal liver development.

Fig. 5.

Fig. 5.

Independent validation of transcription factor expression patterns. Plots illustrate the developmental expression patterns of C/EBPα, GATA6, and HNF-4α detected in the RNA-Seq data that exhibit great similarity to the plots of Kyrmizi et al. (4). The tracking ID identifies the major isoform for each gene (XLOC).

Fig. 6.

Fig. 6.

Independent validation of metabolic gene expression patterns. The expression plots over the developmental time-course recapitulate the known expression profiles of glucose-6-phosphatase (G6pc) and tyrosine aminotransferase (Tat) (9, 14).

Fig. 7.

Fig. 7.

Independent validation of cell-cycle progression gene expression patterns. The expression plots over the developmental time-course recapitulate the known expression profiles of cyclin-dependent kinase 1 (Cdk1) and cyclin A2 (Ccna2) (6).

Highly expressed genes.

The Cufflinks -max-bundle-frags option identifies genes with large numbers of aligned reads (106 by default) and assigns them an HIDATA status, which causes the data for these genes to be omitted from the expression analysis to prevent their inflating the denominator in the FPKM calculations. Several genes were tagged with the HIDATA label across most developmental time-points, including albumin, alpha-fetoprotein, and hemoglobin (Table 3). By E20, transferrin (a blood plasma glycoprotein that regulates free iron levels in blood) was identified as highly expressed, which may correspond to the high levels of hemoglobin expression at the same stage of development. Starting at E20 and continuing into development, a group of cystatin and other protease inhibitors including alpha2-HS-glycoprotein, kininogen, kininogen 1-like 1, and serpina1 were all highly upregulated. Also of interest was apolipoprotein B, responsible for carrying cholesterol to tissues, which was highly expressed only at postnatal day (P)1.

Table 3.

The most highly expressed genes identified for each developmental time-point

Developmental Time Point
Gene E14 E16 E18 E20 P1 P7
Alb X X X X X X
Hbb-a2 X X X X X X
Alb X X X X X
Hbb X X X X
Hbb-b1 X X X X
H19 X X X
Tf X X X
Ahsg X X X
Serpina1 X X
Kng1 X X
Kng1l1 X X
Hp X
Apob X

DE.

FPKMs for each transcript were estimated with Cufflinks and Cuffdiff and were summed across all transcripts associated with each gene to produce the abundance metric for testing DE at the gene level. P values were estimated for each gene and isoform (Supplementary Tables S1 and S2, respectively) and corrected for multiple testing (q value) by the Benjamini-Hochberg correction.1 For all significant tests (q ≤ 0.01) the sign of the log2(fold change) was used to partition the DE genes into up- and downregulated groups (Table 4). The total number of DE genes ranged from 306 to 1,104, and the total number of DE isoforms ranged from 313 to 758. The E16/E18 comparison revealed fewer DE genes and isoforms than at any other time-point comparison. Conversely, the largest number of DE genes and isoforms occurred between E20 and P1, which is consistent with the hormonal, stimulatory, and dietary changes that occur at birth. From E16 onward, there was a strong tendency toward the global downregulation of genes and isoforms with development.

Table 4.

Numbers of up- and downregulated DE genes and isoforms between adjacent developmental time-points

Gene
Isoform
Comparison Upregulated Downregulated Upregulated Downregulated
E14/E16 73 (15.2)* 407 (84.8)* 205 (41.3) 291 (58.7)
E16/E18 274 (89.5) 32 (10.5) 303 (96.8) 10 (3.2)
E18/E20 395 (62.9) 233 (37.1) 258 (73.7) 92 (26.3)
E20/P1 615 (55.7) 489 (44.3) 456 (60.2) 302 (39.8)
P1/P7 132 (27.8) 343 (72.2) 134 (36.7) 231 (63.3)
Total unique 1,222 1,333 1,162 1,004
*

Percentage of differentially expressed (DE) up- and downregulated genes and isoforms within each developmental-period comparison in parentheses.

DAVID for DE genes.

We used DAVID v6.7 to gain biological insight from our gene lists, identify enriched GO terms, and find functionally related gene groups. Significantly up- and downregulated gene lists from each comparison were submitted to the functional annotation tool and analyzed with the functional clustering and pathway options (Supplementary Table S3). This analysis was also performed for the DE isoforms (Supplementary Table S4). From the functional clustering analysis, we selected the top five clusters for each processed gene list based on enrichment score, where a higher enrichment score indicates closer agreement among terms. The top five clusters for genes upregulated between E14 and E16 are porphyrin biosynthesis and nitrogen compound biosynthetic and defense response process. Genes downregulated between E14 and E16 clustered into muscle organ development, cell morphogenesis, embryonic morphogenesis, cell fraction, and blood vessel morphogenesis. These clusters indicate that, between E14 and E16, the rat fetal liver is transitioning from a mass of undifferentiated cells into a hematopoietic organ. Many of the pathways required at the early differentiation and expansion stages of liver development, such as the Wnt, Axon guidance, Hedgehog, and MAPK signaling pathways, were identified in the significant KEGG pathways for downregulated genes. This observation indicates that the signals required for the differentiation and expansion of liver cells were established prior to E16.

The genes found to be DE in the E16/E18 comparison produced a large cluster of related terms associated with the innate and adaptive immune responses. This suggests that E18 is the developmental time-point at which nonparenchymal lymphoid cells are induced. Beginning at E16 and continuing to P1, several common clusters were frequently detected including response to organic substrate, carboxylic acid catabolism, response to nutrient levels, and fatty acid metabolic processes. This demonstrates that hepatocyte maturation is a gradual process during which the expression of metabolic genes in the liver progresses with development. Another trend was found for the genes that were consistently downregulated between E16 and P7, which were enriched for DNA metabolic process, nuclear lumen, cell cycle, and RNA processing. Finally, the genes that increased in expression between P1 and P7 clearly indicate that by 1 wk after birth, the role of the liver is to metabolize lipids and produce bile (Fig. 8). Moreover, the mitochondria are clearly engaged in these processes, due to the observation of an increase in expression of many members of the cytochrome P450 family. Additionally, the top five clusters in this group evidence the upregulation of mitochondrial genes involved in oxidation-reduction, steroid metabolic process, cofactor binding, and electron carrier activity.

Fig. 8.

Fig. 8.

The fatty acid metabolism pathway was determined to be enriched for genes upregulated between E20 and P1. Red stars indicate the genes in the pathway that were found in the list of upregulated DE genes.

The main clusters from the functional clustering analysis for each processed isoform list were similar to those for the corresponding DE gene list for almost all comparisons; however, the top five clusters for genes upregulated between E14 and E16 were the most discrepant, revealing clusters for ATP binding, alternative splicing, protein dimerization activity, cytoplasmic membrane-bounded vesicle, and RNA export from nucleus.

Investigation of unknown transcripts.

To investigate the nature and origin of the unknown transcripts assembled by Cufflinks, all transcripts labeled with the class code “u” were parsed from the cuffcmp.tracking output and grouped by locus. In total, 2,488 loci with multiple transcript observations were detected. Redundancy at each locus was removed using CAP3, and 3,154 consensus contigs were assembled. Due to the stringent overlap requirement used in CAP3, only 86.7% of the loci had at least 1 assembled contig. Loci for which no contig was assembled may contain overlapping genes or transcripts for which the extent of overlap was lower than specified in the CAP3 analysis.

We subjected both the assembled contigs and the remaining singlets (Supplementary Data S5) for each locus to a discontiguous megablast alignment to both the mouse and human RefSeq RNA databases. The mouse or human annotation for the top hit was assigned to the contig or singlet provided it was ≥80% homologous across ≥70% of its length. Several loci aligned to both mouse and human orthologs; however, many loci could only be annotated by hits to one but not both databases. Using this strategy, we identified annotation for 581 of the 2,488 (23.3%) loci from the mouse database alone, 26 (1.0%) from the human database alone, 700 (28.1%) from both databases, and 1,181 (47.5%) could not be identified. Gene names were retrieved for the BLAST hit accession numbers for the annotated loci and were queried against the UCSC rat annotation to determine if Cufflinks was improperly annotating transcripts. None of the returned gene names were found in the rat annotation. The complete annotation for all of the sequences is provided in Supplementary Table S6.

The unannotated Cufflinks assemblies generated by Cuffcompare (Table 2) were further assessed by an additional set of alignments. Since the nature of artifacts may be reflected in the size of the transcripts, the largest 20 transcripts (9–17 Kb), the smallest 20 transcripts (0.3–0.4 Kb), and 20 modal transcripts (∼2 Kb) were queried against the expressed sequence tag (EST) trace archives for rat, mouse, and human and the genome assemblies of rat, mouse, and human (as opposed to the mRNA references) by discontiguous megablast. Ninety-one percent of the transcripts were interrupted by genomic sequence, and 73% had EST support. Sixty-two percent were found in the mouse and human genomes with similar implied structure, suggesting that they represent conserved transcripts that are not annotated in any of these three species. An additional 33% were found in the mouse genome but not in the human genome. A single transcript was found in the human assembly but was absent from the mouse assembly. Of the remaining two transcripts, one was supported by EST data in the mouse and the other appears to be specific to the rat. The largest transcripts included chimeric artifacts that resulted from overlapping genes. The smaller transcripts included fragments of first or last exons that extended beyond the region annotated in the rat assembly. A similar effort was attempted for the unknown intergenic transcripts generated by Cuffmerge. However, these sequences included an abundance of intronic and flanking sequence indicating the incorporation of artifactual sequence into the merged contigs, and consequently the sequences could not be evaluated by the same strategy.

DISCUSSION

To investigate the processes underlying liver development, we sequenced the transcriptomes of late gestation and perinatal livers in R. norvegicus. Six developmental time-points were chosen (E14, E16, E18, E20, P1, and P7), and two biological replicates were sequenced at each time-point (n = 12). Of the 962,605,900 chastity-filtered and adapter-trimmed sequence reads, 80.2% were aligned to the R. norvegicus genome with at most two mismatches. The reads that failed to align comprised low-quality reads, sequences for which Lewis rats are diverged from the reference R. norvegicus, reads containing adapter sequences in which sequence errors occurred, and reads that may represent contaminants induced during library preparation or sequencing.

We used Cufflinks to assemble transcripts, Cuffcompare, and Cuffmerge to characterize the assembled transcripts and Cuffdiff to estimate transcript abundances and test for DE between adjacent developmental time-points. The use of Cuffmerge increases the number of transcripts that completely match the supplied annotation file; however, assembly errors accumulated when the contigs were merged. The transcripts classified as unknown by Cuffmerge did not generally appear to represent processed mRNAs since most contain a mixture of intron and exon sequences, some of which are annotated. We speculate that the depth to which we sequenced each sample was sufficient that we captured artifacts of transcription run-on and incomplete splicing of mRNAs, which generated these artifactual transcripts since the proportion of each transcript that represented exon sequence was generally low. We also found predicted transcripts with sequences that are conserved in several species but that do not contain splice junctions and do not appear to be expressed in these species. Consequently, Cuffmerge overestimates the read counts associated with unannotated transcripts; of the 31,919 transcripts classified as unknown by Cuffmerge, 20,018 regions of annotated genes were represented. Considering these results, the Cuffmerge output is not appropriate for the characterization of unannotated transcripts, and we used the Cuffcompare output for this purpose.

We determined that the most abundantly expressed genes during late fetal and perinatal liver development include albumin, alpha-fetoprotein, and a group of hemoglobin genes. Highly expressed genes identified later in development included a group of protease inhibitors and pre-mRNA splicing genes. Lists of genes that were DE between adjacent developmental time-points were generated according to whether the genes increased or decreased in transcript abundance and both lists for each time-point comparison were submitted to DAVID functional annotation and analyzed for enriched GO terms. Our analysis of the top five clusters produced by the functional clustering tool indicates that hematopoietic gene expression had initiated by E14. We repeatedly found clusters of genes responsible for response to nutrient levels and enzymes responsible for glycolysis and fatty acid catabolism in the fetal samples, which indicate that hepatocyte maturation is a gradual process that starts before birth and continues into the perinatal period. Following birth, a large cluster of DE genes was enriched for mitochondrial gene expression and cholesterol synthesis, indicating that by 1 wk of age, the liver is engaged in lipid sensing and bile production.

Finally, we took a comparative approach to elucidate the identities of transcripts with unknown annotations and identified 1,307 transcribed regions of the rat genome that were not annotated. These regions represent loci whereby Cufflinks assembled transcripts that aligned to the rat reference genome but did not match any UCSC annotated sequences. Our annotation of these transcripts serves to improve the existing reference rat transcriptome for future RNA-Seq analyses. There remain 1,181 transcripts that were not identified in either of the mouse or human transcriptomes and remain unannotated. Some of these loci likely represent homologs with low sequence identity that consequently failed to meet the 70% query length alignment criterion. However, many loci likely represent transcribed rat loci that are not annotated in mouse or human.

High-throughput sequencing offers many advantages over hybridization-based gene expression arrays. The nature and extent of information captured in a typical RNA-Seq experiment are diverse and vast, and not all aspects of the data captured in this study were evaluated. For example, future work will focus on the characterization of the reads that failed to align to the rat reference assembly and on the complete annotation of the unknown transcripts. Nevertheless, these data provide insight into the global transcriptional landscape of late gestation and perinatal liver development and extend the reference transcriptome for an important model species.

GRANTS

This work was supported by the University of Missouri and grant 13321 from the Missouri Life Science Research Board. J. F. Taylor was supported by National Research Initiative Competitive Grants no. 2011-68004-30367 and 2011-68004-30214 from the USDA National Institute of Food and Agriculture.

DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

AUTHOR CONTRIBUTIONS

Author contributions: R.H.C. conception and design of research; R.H.C. performed experiments; R.H.C., P.C.T., K.D.W., S.A.G., J.W.K., S.D.M., and R.D.S. analyzed data; R.H.C., P.C.T., K.D.W., S.A.G., J.W.K., S.D.M., R.D.S., and J.F.T. interpreted results of experiments; R.H.C. and P.C.T. prepared figures; R.H.C., P.C.T., K.D.W., and J.F.T. drafted manuscript; R.H.C. and J.F.T. edited and revised manuscript; R.H.C. and J.F.T. approved final version of manuscript.

Supplementary Material

Table S1
tableS1.txt (4.2MB, txt)
Table S2
tables2.txt (10MB, txt)
Table S3
tableS3.xlsx (899.8KB, xlsx)
Table S4
tableS4.xlsx (691.1KB, xlsx)
Table S5
tableS5.txt (11.5MB, txt)
TableS6
tableS6.xlsx (71.1KB, xlsx)

ACKNOWLEDGMENTS

The short-read sequence data from this article are being submitted to the National Center for Biotechnology Information short read archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession number SRA654049. Data can also be obtained from the corresponding author.

Footnotes

1

The online version of this article contains supplemental material.

REFERENCES

  • 1. Dennis G, Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4: P3, 2003. [PubMed] [Google Scholar]
  • 2. Hosack DA, Dennis G, Jr, Sherman BT, Lane HC, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biol 4: R70, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res 9: 868–877, 1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Kyrmizi I, Hatzis P, Katrakili N, Tronche F, Gonzalez FJ, Talianidis I. Plasticity and expanding complexity of the hepatic transcription factor network during liver development. Genes Dev 20: 2293–2305, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Li C, Yu S, Zhong X, Wu J, Li X. Transcriptome comparison between fetal and adult mouse livers: implications for circadian clock mechanisms. PloS One 7: e31292, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Metzker M. Sequencing technologies - the next generation. Nat Rev Genet 11: 31–46, 2010. [DOI] [PubMed] [Google Scholar]
  • 8. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 5: 621–628, 2008. [DOI] [PubMed] [Google Scholar]
  • 9. Perry ST, Rothrock R, Isham KR, Lee KL, Kenney FT. Development of tyrosine aminotransferase in perinatal rat liver: changes in functional messenger RNA and the role of inducing hormones. J Cell Biochem 21: 47–61, 1983. [DOI] [PubMed] [Google Scholar]
  • 10. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321: 956–960, 2008. [DOI] [PubMed] [Google Scholar]
  • 11. Tanaka M, Itoh T, Tanimizu N, Miyajima A. Liver stem/progenitor cells: their characteristics and regulatory mechanisms. J Biochem 149: 231–239, 2011. [DOI] [PubMed] [Google Scholar]
  • 12. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Wang ND, Finegold MJ, Bradley A, Ou CN, Abdelsayed SV, Wilde MD, Taylor LR, Wilson DR, Darlington GJ. Impaired energy homeostasis in C/EBP alpha knockout mice. Science 269: 1108–1112, 1995. [DOI] [PubMed] [Google Scholar]
  • 15. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453: 1239–1243, 2008. [DOI] [PubMed] [Google Scholar]
  • 17. Zaret KS. Genetic programming of liver and pancreas progenitors: lessons for stem-cell differentiation. Nat Rev Genet 9: 329–340, 2008. [DOI] [PubMed] [Google Scholar]
  • 18. Zhao R, Duncan SA. Embryonic development of the liver. Hepatology 41: 956–967, 2005. [DOI] [PubMed] [Google Scholar]
  • 19. Zorn AM. Liver development. StemBook. http://www.stembook.org/node/512, 2008.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1
tableS1.txt (4.2MB, txt)
Table S2
tables2.txt (10MB, txt)
Table S3
tableS3.xlsx (899.8KB, xlsx)
Table S4
tableS4.xlsx (691.1KB, xlsx)
Table S5
tableS5.txt (11.5MB, txt)
TableS6
tableS6.xlsx (71.1KB, xlsx)

Articles from Physiological Genomics are provided here courtesy of American Physiological Society

RESOURCES