Skip to main content
Cell Reports Methods logoLink to Cell Reports Methods
. 2026 Jan 14;6(1):101276. doi: 10.1016/j.crmeth.2025.101276

Epigenomic, transcriptomic, and proteomic characterization of breast cancer cell line reference samples

Chirag Nepal 1,2, Wanqiu Chen 1,2, Zhong Chen 1,2, John A Wrobel 3, Ling Xie 3, Wenjing Liao 1, Chunlin Xiao 4, Adrew Farmer 5, Malcolm Moos Jr 6, Wendell Jones 7, Xian Chen 3,, Charles Wang 1,2,8,∗∗
PMCID: PMC12853167  PMID: 41539304

Summary

Next-generation sequencing requires accuracy, reproducibility, and standardized reference materials. The Sequencing Quality Control (SEQC-2) multicenter studies on paired breast cancer and B cell lines generated extensive genomic datasets, but integrated epigenomic and proteomic references remain limited. Here, we performed Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), Methyl-seq, RNA sequencing (RNA-seq), and proteomic profiling to establish comprehensive multi-omics reference materials. We identified >7,700 protein groups, with 95% of genes encoding a single peptide isoform. Protein expression from CpG island (CGI)-overlapping transcripts was higher than non-CGI transcripts in both cell lines. Certain SNVs were incorporated into mutated peptides. Chromatin accessibility was regulated by CG density: CG-rich regions showed lower methylation, greater accessibility, and higher gene/protein expression, whereas CG-poor regions exhibited higher methylation, reduced accessibility, and cell line-specific expression patterns. These datasets provide well-defined genomic, epigenomic, transcriptomic, and proteomic characterizations that can serve as benchmarks for validating omics assays and bioinformatics methods, offering a valuable community resource.

Keywords: epigenomics, transcriptomics, proteomics

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Multi-omics reference materials created for paired breast cancer and B cell lines

  • Integration of ATAC-seq, Methyl-seq, RNA-seq, and proteomics datasets

  • CG-rich regions show higher accessibility, lower methylation, and higher expression

  • Dataset provides benchmarks for assay validation and bioinformatics pipelines

Motivation

Accurate and reproducible multi-omics data are essential for understanding cellular mechanisms and benchmarking analytical tools. While genomic reference standards are well established, resources that integrate epigenomic, transcriptomic, and proteomic layers remain scarce. Such benchmarks are crucial to assess cross-laboratory reproducibility, validate new technologies, and interpret genetic and epigenetic variation in disease contexts. To address this gap, we generated a comprehensive reference dataset from paired cancer and normal cell lines, providing an integrated framework for standardizing multi-omics analyses in both research and clinical applications.


Nepal et al. establish comprehensive genomic, epigenomic, transcriptomic, and proteomic reference data from paired breast cancer and B cell lines. These datasets provide benchmarks for multi-omics assay validation and cross-platform reproducibility.

Introduction

Our previous studies, carried out by the Food and Drug Administration (FDA) Sequencing Quality Control (SEQC-2) consortium, generated a set of reference variant datasets from paired cell lines HCC1395 and HCC1395BL and provided best practice guidelines for data analysis.1,2,3,4,5,6,7 These two cell lines were from a 43-year-old, white female patient with ductal carcinoma. HCC1395 is an epithelial cancer cell line derived from the mammary gland, while HCC1395BL was prepared from a normal B lymphoblast isolated from the peripheral blood of the same patient.8 These cell lines have been used extensively in breast cancer research to study gene expression pathways and for drug discovery.9 The SEQC-2 consortium also compared analyses of variant calling methods on whole-genome and whole-exome datasets,1,2,3 genomic instability, and somatic structural variants (SVs)4 and showed that using personalized genome reference data significantly improved the detection accuracy of somatic SNVs and SVs.5 Complementary single-cell RNA sequencing (scRNA-seq) analyses across multiple centers provided benchmarks for gene expression profiling and technology reproducibility.6 The reference resources and methodology established in these studies will facilitate precision oncology analysis and personalized cancer therapy as well as other areas of personalized medicine.

While prior studies of HCC1395 and HCC1395BL focused on genomic alterations, limited information is available on their epigenomic and proteomic landscapes, limiting their utility as reference materials. Unlike genomic alterations, epigenetic alterations are frequently reversible chemical modifications that regulate DNA accessibility and gene transcription. DNA methylation,10 chromatin accessibility,11 and histone modifications12 are key measures of epigenetic states. The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) measures DNA accessibility by identifying open chromatin regions13 indicative of active transcription. DNA methylation, profiled using methods such as Methyl-seq, Reduced Representation Bisulfite Sequencing (RRBS), or Whole-Genome Bisulfite Sequencing (WGBS) measure DNA methylation levels14 associated with repressive marks. Proteins, including phosphoproteins,15 can be identified and quantified by mass spectrometry (MS)-based methods. These analyses allow detection of altered protein expression levels or functions through somatic mutations and/or post-translational modifications. Before gene transcription, promoters undergo remodeling of epigenetic marks (e.g., removal of repressive DNA methylation) and open chromatin for transcription factors (TFs) to bind around transcription start sites (TSSs). Thus, it is important to determine the extent of epigenetic remodeling, its influence on gene expression, and downstream protein function. Furthermore, we intend to determine the degree of variation in gene expression across cell types explained by epigenetic modifications. To achieve a comprehensive view of the epigenomic landscape, integrated measurement of chromatin accessibility and DNA methylation is required within the genomic context.

Here, we present a reference map of the epigenomic, transcriptomic, and proteomic landscapes of the two cell lines that were previously studied with whole-genome sequencing and scRNA-seq by the SEQC-2. We performed ATAC-seq, Illumina TruSeq Capture Epic Methyl-seq, RNA sequencing (RNA-seq), and MS-based proteomic analyses to generate a comprehensive catalog of open chromatin regions, methylation status, and transcriptomic and proteomic quantifications. By integrating these molecular layers, we sought to understand inter-relationships across the genome, epigenome, transcriptome, and proteome. Open chromatin regions had low DNA methylation, while closed chromatin regions had high methylation, reflecting a consistent relationship between two epigenetic modifications. This intricate relationship largely coincides with genomic CG density, where CG-rich regions had more accessible chromatin, low methylation levels, and higher gene and protein expression levels. The CG-poor regions had largely repressive epigenetic marks (less open chromatin and higher DNA methylation), often resulting in a cell line or tissue-specific expression. Our work provides cell-line reference materials, which are well characterized across epigenomic, transcriptomic, and proteomic levels, and highlights the relationship of epigenetic modification with genomic CG context as well as related protein expression.

Results

Construction of reference epigenomic, transcriptomic, and proteomic landscapes for HCC1395 and HCC1395BL cell lines

The HCC1395 and HCC1395BL cell lines have been analyzed extensively using whole-genome and whole-exome sequencing,1,2,3 personal genome assembly analysis,5 scRNA-seq for benchmarking bioinformatics algorithms,6,7 and structural variation analysis.4 However, these cell lines lack epigenetic and proteomic landscapes; thus we sought to fill this gap by providing a high-quality reference dataset. To measure the epigenetic landscape, we performed two assays, ATAC-seq (to measure chromatin accessibility) and TruSeq Capture Epic Methyl-seq (to measure DNA methylation levels). We used Illumina-based sequencing and performed ATAC-seq on three replicates of HCC1395 (average: 33.14 million (M) paired-end (PE) raw reads) and HCC1395BL (average: 50.56M PE reads) (Figure 1). Similarly, we performed three Methyl-seq replicates on HCC1395 (average: 33.14M PE reads) and HCC1395BL (average: 50.56M PE reads) (Table S1). To understand the association between epigenomics (chromatin accessibility and DNA methylation) and gene expression, we performed three replicates of bulk RNA-seq on HCC1395 (average: 62.6M PE reads) and HCC1395BL (average: 64.04M PE reads). We also integrated the scRNA-seq data of these two cell lines from our previous study.6 In addition, we performed MS on three replicates each of HCC1395 and HCC1395BL, resulting in the identification and quantification of thousands of expressed proteins. Each data type was systematically analyzed through standard pipelines (Figure 1), such as MACS,16 Bismark,17 Kallisto,18 Seurat,19 and Perseus20 (STAR Methods). After annotation of ATAC-seq peaks and (un)methylated regions, we sought to interrogate the influence of epigenetic patterns on gene and protein expression as well as understand how the genomic DNA (CG rich or CG poor) of promoters influenced epigenetic markers and gene expression.

Figure 1.

Figure 1

Multi-omics experimental design and bioinformatics workflow for profiling epigenetic, transcriptional, and proteomic features of HCC1395 and HCC1395BL cell lines

HCC1395 and HCC1395BL represent a breast cancer cell line and “normal” (i.e., immortalized) B lymphocyte cell line, respectively. ATAC-seq, RRBS, RNA-seq, and proteomic analyses were performed on three replicates of both cell lines. Each data type was analyzed using standard analytical pipelines. ATAC-seq and DNA methylation were analyzed to understand the correlation between the two layers of epigenetics in relation to genomic CG density. Gene expression was correlated with epigenetic marks and protein expression.

See also Table S1.

Annotation of HCC1395 and HCC1395BL ATAC peaks revealed cell line-specific open chromatin

To identify open chromatin regions, we mapped ATAC-seq data to the human genome (hg38) with Bowtie221 and used MACS216 to identify peaks across all samples. Visualization of the ATAC peaks revealed that some peaks were detected across all replicates in both cell lines (peaks highlighted in black rectangular box), while others were detected in a cell line-specific manner (peaks highlighted in red and blue rectangular boxes) (Figure 2A). We detected approximately 200,000 ATAC peaks on each replicate (Figure 2B). Correlation of peaks between replicates showed high reproducibility in HCC1395 (R = 0.914) and HCC1395BL (R = 0.917) (Figure S1A). From the initial pool of peaks, we retained only those peaks that were detected in all three replicates, resulting in 134,320 consensus peaks in HCC1395 and 139,179 consensus peaks in HCC1395BL (Figure 2B). Approximately one-third (48,963) of the consensus (or concordant) peaks were common between the two cell lines (Figure 2C), indicating that the majority of the open chromatin regions were specific to each cell line. This finding is consistent with previous observations of a low overlap of ATAC peaks across various tumor types.22 About 18% of the peaks overlapped promoter regions, while most (>50%) of the peaks were in gene bodies (intragenic regions) (Figure 2D) (Tables S2 and S3). Only about 12% of ATAC peaks overlapped with CpG islands ([CGIs] regions with high CG density) (Figure 2E).

Figure 2.

Figure 2

Open chromatin landscape of HCC1395 and HCC1395BL cell lines reveal distinct patterns of ATAC-seq accessibility

(A) A UCSC browser showing 3 replicates of ATAC-seq coverage of HCC1395 (red) and HCC1395BL (blue) cell lines. ATAC peaks region are shown as red and blue rectangular bars.

(B) Bar plot shows the number of ATAC peaks from three replicates. Peaks identified in all three replicates are defined as consensus peaks.

(C) Venn diagram shows the overlap of ATAC peaks from the HCC1395 and HCC1395 cell lines.

(D) Distribution of ATAC peaks with respect to promoter, intragenic, and intergenic regions.

(E) Distribution of ATAC peaks with respect to CGI.

(F) Mean counts of ATAC peaks overlapping promoter, intragenic, and intergenic regions. Peaks from each region are further classified based on overlap with CGIs. p values were calculated using t tests by comparing CGI and non-CGI groups across promoter, intragenic, and intergenic regions.

(G) Boxplots show the distribution of ATAC peak width across promoter, intragenic, and intergenic regions, which were classified into two groups based on overlap with CGIs.

See also Tables S2 and S3.

Thus, most open chromatin regions in both cell lines were in intragenic regions and had low CG density. CGI ATAC peaks had significantly higher mean read counts (reflecting higher peak intensities) than in non-CGI ATAC peaks in both cell lines, irrespective of their location in promoter, intragenic, or intergenic regions (Figure 2F). ATAC peaks overlapping promoter CGIs had significantly higher peak intensity than those of intergenic and intragenic CGI ATAC peaks. Similarly, among the non-CGI ATAC peaks, those overlapping promoters had significantly higher intensity than those in both inter- and intragenic regions (Figure 2F). We also observed that CGI ATAC peaks were significantly broader than non-CGI peaks (Figures 2G and S1A) reflecting wider open chromatin regions compared with non-CGI ATAC peaks. Thus, despite there being only a minimal overlap in ATAC peaks between the two cell lines, the genomic context of these peaks in terms of signal intensity and width showed similar CG-dependent variation, reflecting the influence of genomic CG density on open chromatin landscape in a cell type-specific manner.

Genomic CG density and chromatin context influence DNA methylation landscapes

To annotate the DNA methylation landscape of these two cell lines, we used Bismark17 to analyze methyl-seq data and identify methylation levels of CGs (mCG). We retained only those CGs whose methylation levels were measured in all three replicates (STAR Methods) for each cell line, resulting in 1,314,941 mCGs in HCC1395 and 1,415,665 mCGs in HCC1395BL (Figure S2A; Tables S4 and S5). Around 50% of the mCG sites detected overlapped between the two cell lines (Figure S2A), indicating that a substantial fraction of CpG sites were methylated in a cell type-specific manner. Methylation levels of CpG sites (beta value; STAR Methods) revealed a bimodal distribution in both cell lines (Figure 3A), reflecting that most CpGs are either methylated (beta value ≈ 1) or unmethylated (beta value ≈ 0), consistent with previous reports.23 Gene body profiles showed low methylation at TSSs, reflecting that the promoter regions were generally unmethylated. Methylation levels progressively increased away from TSSs and peaked toward transcription end sites (TESs). The methylation levels at promoters overlapping CGIs were much lower compared with non-CGI promoters (Figure 3B). Relatively higher methylation of non-CGI promoters suggests partial CpG methylation, whereas most CpGs were unmethylated in CGI promoters, linking promoter methylation to genomic CG density.

Figure 3.

Figure 3

DNA methylation profiling reveals global and locus-specific differences between HCC1395 and HCC1395BL cell lines across CGIs and gene bodies

(A) Distribution of beta value of CG methylation. Beta values ranged between 0 and 1 and were divided into 10 bins.

(B) Mean methylation levels of all expressed genes. Gene length is scaled between TSSs and TESs.

(C) A UCSC browser view showing the gene SKI and CGIs overlapping promoter and intragenic regions. Mean methylation levels of HCC1395 and HCC1395BL are represented as beta values in the range of 0–1. CGIs overlapping promoters had low methylation levels, while four intragenic CGIs had high methylation levels.

(D) Mean methylation levels of promoter CGIs (left) and intragenic CGIs (right) and 2-KB flanking CGIs. The y axis represents the mean methylation level (beta value).

(E) ATAC signals in promoter CGIs (left) and 2-KB flanking regions. ATAC signals in CGIs inside gene body (intragenic CGIs) (right) and 2-kb flanking regions. The y axis represents the mean ATAC signal measured in RPKM.

See also Tables S4 and S5.

In addition to promoter CGIs, thousands of genes contained intragenic CGIs, exemplified by SKI (Figure 3C). Promoter CGIs had low methylation, while intragenic CGIs were highly methylated (Figure 3C). Across all active promoter CGIs (N = 8,697), methylation levels remained low across the entire CGI (average CGI length: 2KB) but gradually increased away from CGIs (Figure 3D). In contrast, intragenic CGIs (N = 3,357) had higher methylation levels than in the flanking gene body24 (Figure 3D). Opposite methylation levels between promoter and intragenic CGIs, despite having similar CG density, suggest additional factors might influence methylation levels. To this end, we assessed the correlation of chromatin accessibility with methylation levels. We observed intense ATAC signals, reflecting open chromatin, near promoter CGIs. In contrast, intragenic CGIs had low-intensity ATAC signals, reflecting a closed chromatin state (Figure 3E). This indicated that the DNA methylation levels of genomic regions were associated with genomic CG density and its chromatin context.

To understand the association between open chromatin and DNA methylation, we classified ATAC peaks into two groups based on overlap with CGIs. In both groups, ATAC peaks were further classified into four bins based on their CG density. Open chromatin regions with higher CG density had higher ATAC signals in both cell lines (Figures 4A and 4B, left). We observed that ATAC signals in open chromatin regions decreased with CG density in both CGI and non-CGI groups (Figures 4A and 4B, right). However, overall ATAC signals were higher in CGI regions. This suggests that chromatin accessibility (as measured by ATAC signals) is largely influenced by genomic CG density. On the other hand, non-CGI ATAC peaks had higher methylation levels compared with CGI ATAC peaks (Figures 4C and 4D). While both CGI and non-CGI ATAC peaks represented accessible chromatin regions, relatively high methylation levels in non-CGI peaks revealed some remaining methylated CpGs despite being in an open chromatin region. To illustrate these genome-wide patterns at the individual gene level, we visualized the EGFR promoter locus, highlighting the CGI, ATAC-seq peak region, and coverage tracks for both ATAC-seq and DNA methylation beta values (Figure 4E). In the HCC1395 cell line, the EGFR promoter exhibited strong ATAC-seq signals accompanied by low methylation levels across most CpGs within the CGI. In contrast, the HCC1395BL cell line showed markedly reduced ATAC-seq signals, with multiple CpGs in the CGI being highly methylated. We found it interesting that the ATAC-seq peak region was slightly shorter than the annotated CGI, and we observed elevated methylation in the portion of the CGI lacking strong ATAC-seq coverage. This pattern underscores the inverse relationship between chromatin accessibility and DNA methylation at the EGFR locus, in agreement with our genome-wide findings. This supports the concept that the association between chromatin accessibility and DNA methylation is influenced by the genomic CG density.

Figure 4.

Figure 4

Association of CG density, open chromatin, and CG methylation uncovers coordinated epigenetic regulation

(A) ATAC peaks overlapping CGIs (left) and non-overlapping CGIs (right) in the HCC1395 cell line. ATAC peaks are grouped into 4 bins based on decreasing CG density.

(B) Same as in (A), but for the HCC1395 cell line.

(C and D) CG methylation levels of ATAC peaks overlapping CGIs (left) and non-overlapping CGIs (right) in HCC1395 (C) and HCC1395BL (D). ATAC peaks are grouped into 4 bins based on decreasing CG density. The y axis indicates methylation level (beta values), which ranged between 0 and 1.

(E) A UCSC genome browser view of the EGFR promoter region. CGI and ATAC peaks are shown as horizontal bars. ATAC-seq coverage and DNA methylation beta values across three replicates of HCC1395 and HCC1395BL are shown as coverage tracks. The HCC1395 cell line had high ATAC-seq signals in the ATAC peak region and low methylation levels. The HCC1395BL cell line had low ATAC-seq signals in the ATAC peak region and high methylation levels.

See also Tables S2, S3, S4, and S5.

Transcriptomic landscape and its relationship with chromatin accessibility and DNA methylation

We next quantified expression levels of reference and alternative transcripts across two cell lines and examined their association with promoter epigenetic profiles. We used Kallisto18 to quantify the expression levels of all Ensembl transcripts and included only those transcripts that had a minimum of 0.5 tags per million (TPM) in all three replicates and a minimum of 1 TPM in at least one replicate. 18,368 and 19,197 transcripts were detected in HCC1395 and HCC1395BL (Tables S6 and S7), respectively, representing 10,400 and 10,199 genes (Figure 5A). About 30% of genes had multiple transcripts, which were classified as either reference or alternative transcripts (STAR Methods). Overlapping promoters with CGIs showed that reference transcript promoters in both cell lines had significantly higher overlap with CGIs (Figure 5A), reflecting an enrichment of non-CGI promoters among alternative transcripts. Reference transcripts also had higher expression than alternatives in both CGI and non-CGI groups (Figure 5B). Among reference transcripts, CGI promoters had higher expression than non-CGI promoters, indicating that genes with promoter CGIs were more highly expressed. Overall, lower expression levels of alternative transcripts can partially be explained by their low GC content.

Figure 5.

Figure 5

Transcriptomic analysis of the HCC1395 and HCC1395BL reveals differences in gene expression and RNA editing

(A) Number of reference and alternative transcripts expressed in HCC1395 (left) and HCC1395BL (right). Alternative promoters were depleted for CGIs.

(B) Violin plots show mean expression levels across three replicates for HCC1395 (left) and HCC1395BL (right). p values were computed using two-sided t tests.

(C) The coverage of ATAC-seq reads around gene TSSs. Genes were classified into four bins based on expression levels separately for CGI promoters and non-CGI promoters.

(D) Same as in (C), for the HCC1395BL cell line.

(E and F) UMAP clusters of HCC1395 and HCC1395BL.

(G and H) Average expression levels of CGI promoters’ and non-CGI promoters’ genes by excluding genes in single cells with zero counts. The mean expression level of CGI promoters was higher than for non-CGI promoters.

(I and J) Frequency of A-to-I edits across three replicates in the HCC1395 and HCC1395BL cell lines.

See also Tables S6, S7, and S8.

To further understand the association of gene expression with ATAC signals at promoters, genes were classified into four bins by expression. Genes with higher expression had higher ATAC signals (Figures 5C and 5D) at both CGI and non-CGI promoters, while non-CGI promoters had lower ATAC signals overall, such that even the highly expressed bin from non-CGI promoters had low levels of ATAC signals. We next compared the expression levels of CGI and non-CGI promoters across different bins and observed that the highly expressed bin from non-CGI promoters had higher expression levels than the bin with lowest expression of CGI promoters (Figures S3A and S3B), while it had lower ATAC signals (Figures 5C and 5D). This suggests that promoter CG density influences chromatin accessibility and supports the notion that gene expression was tightly correlated with CG content. To understand how DNA methylation facilitated alternative transcript expression despite high DNA methylation levels across gene bodies, we compared DNA methylation levels between TSSs of reference and alternative transcripts and observed decreased methylation at alternative TSSs (Figure S3C), implying that DNA demethylation promotes alternative transcript promoter activity.

We next asked whether the lower expression of non-CGI promoters in bulk RNA-seq is associated with lower accessibility, resulting in expression in fewer cells, or is an inherent property of genes. We analyzed HCC1395 and HCC1395BL scRNA-seq data6 and counted the frequency at which CGI and non-CGI promoters were expressed in individual single cells across UMAP clusters (Figures 5E and 5F). In both cell lines, CGI promoter genes were expressed in more cells (Figures S3D and S3E), consistent with high chromatin accessibility. As non-CGI promoter genes were expressed in fewer cells, their average expression in bulk RNA-seq was expected to be lower, consistent with our observation (Figure 5B). Next, we asked how the expression levels of non-CGI promoter genes compare to CGI promoter genes when considering only the cells in which these genes are expressed. We observed that the expression levels of CGI promoter genes were significantly higher than those of non-CGI promoters’ genes, even when analyzing at single-cell level (Figures 5G and 5H).

Post-transcriptional modifications such as RNA editing can further shape transcript diversity and abundance. Thus, we assessed the global landscape of adenosine-to-inosine (A-to-I) RNA editing. We used REDItools25 and quantified the number of A-to-I sites detected across replicates of HCC1395 and HCC1395BL (methods). On average, HCC1395 cells exhibited ∼30 A-to-I sites per replicate, whereas two replicates of HCC1395BL had no detectable edits and the third replicate had only two A-to-I edits (Figures 5I and 5J; Table S8). This elevated A-to-I-editing burden in tumor cells may reflect dysregulated RNA-editing machinery, potentially mediated by altered ADAR enzyme activity. Differences in A-to-I-editing frequency could contribute to isoform diversity, RNA stability, or coding potential, adding another regulatory layer to the promoter-level epigenetic effects described above. Together, these findings highlight that gene expression in these cell lines is influenced by genomic CG density and post-transcriptional modifications, with each layer contributing to the regulation and diversity of the transcriptome.

Non-CGI-accessible chromatin revealed tissue-specific regulation across TCGA cancers

While the overlapping of the ATAC peak regions between the two cell lines was low, we sought to understand how variation in overlap of peaks across different tumor types depended on CG content. We asked whether the correlation of open chromatin with CpG density, observed in the two reference cell lines, was conserved across 23 cancer types from The Cancer Genome Atlas (TCGA).22 The ATAC peaks overlapping CGIs were mostly (>80%) accessible across different tumor tissues (Figure 6A), implying that CGI ATAC peaks are ubiquitously open across different tumor tissues. On the other hand, only 30%–50% of the non-CGI ATAC peaks were accessible across other tumor tissues, suggesting that non-CGI ATAC peaks reflected tumor tissue/cell-specific accessible regions. This pattern was consistent in both HCC1395 (Figure 6A) and HCC1395BL (Figure 6B), indicating that despite the low overlap between the two cell lines, CGI and non-CGI ATAC peaks exhibited cell type-restricted or -ubiquitous accessibility, respectively. The CGI promoters have larger open chromatin regions/peaks that are devoid of DNA methylation, associated with relatively high gene expression. On the other hand, non-CGI promoters have narrow open chromatin regions and contain some levels of methylated CGs within ATAC peaks, which could explain the relatively lower tissue-specific gene expression (Figure 6C). The CG density gradient in these epigenetic regulations provides a plausible way to fine-tune gene expression levels in a tissue-specific vs. ubiquitous manner.

Figure 6.

Figure 6

Projection of HCC1395 and HCC1395BL ATAC peaks across 23 tumor tissues from TCGA reveals ubiquitous and tissue-specific accessible regions

(A) Overlap of HCC1395 cell line ATAC peaks with the ATAC peaks from 23 different tumor tissues across TCGA. Most CGI-containing ATAC peaks found in HCC1395 were also detected in other tissues (left); for ATAC peaks without CGIs, this was not the case (right). x axis indicates the percentage of overlap of HCC1395 ATAC peaks with ATAC peaks from different tumor tissues.

(B) Same as in (A), but for the HCC1395BL cell line.

(C) Schematic representation to show the relationship between genomic CG density and epigenetic features (ATAC peaks and DNA methylation) and gene expression.

Proteomic maps of cell lines and their correlations with the transcriptomes

We next sought to map the protein expression landscape of these cell lines, detect alternatively spliced protein isoforms, and understand the effect of cell line-specific SNVs on protein quantification. We used MS and obtained a high coverage of tandem MS (MS/MS) spectra (called peptide-spectrum matches) that mapped to multiple regions of peptides as shown for the gene PFN2 (Figure 7A). Each individual MS/MS spectrum containing peptides mapping to a particular protein was quantified using MaxQuant.26 These spectra were then analyzed to quantify mean protein expression levels, including different protein isoforms. For example, the PFN2 gene encodes for the Profilin-2 protein and has two annotated isoform entries (P35080-1 and P35080-2) in UniProt. Both isoforms are 140 amino acid residues in length and differ in amino acid residues 109–140 (Figure S4A). Two isoforms have a different third exon as supported by MS results, where 1 peptide mapped to exon 1 (Figure S4B) and 4 peptides mapped to exon 2 (Figure S4C), common to both isoforms. The third exon had 2 peptides unique to P35080-1 and 3 peptides unique to P35080-2 (Figures S4D and S4E). Both isoforms also had a unique peptide (VLVFVMGK for P35080-1 and ALVIVMGK for P35080-2) that spanned the splice junction between exons 2 and 3 (Figures 7A, S4D, and S4E), supporting the alternative splicing of these isoforms at the protein level.

Figure 7.

Figure 7

Proteomic profiling links alternative splicing to isoform-specific protein expression and integrates protein abundance with genomic and epigenomic features

(A) A UCSC genome browser view (reverse strand) showing peptides identified by MS/MS spectra that mapped to two UniProt isoforms of PFN2 (P35080-1 and P35080-2). These isoforms share exons 1 and 2 but have different third exons resulting from alternative splicing. Peptides common to both isoforms are colored magenta. Peptides unique to P35080-1 and P35080-2 are colored blue and red, respectively. Peptides overlapping in each transcript are summed up to quantify protein expression levels.

(B) Bar plots show individual protein expression levels of each spliced isoform across three replicates for both HCC1395 and HCC1395BL cell lines.

(C) Number of protein groups detected per gene. The majority (98.5%) of genes have only one detected protein group.

(D and E) Correlation of gene expression and protein expression levels. Gene expression and protein levels are positively correlated in HCC1395 (R = 0.477) and HCC1395BL (R = 0.474).

(F and G) Average protein expression levels of genes grouped based on overlap with CGI in their promoters. Genes overlapping CGI have significantly higher protein expression levels in HCC1395 (F) and HCC1395BL (G). p values were calculated using a t test.

See also Tables S9 and S10.

In total we identified 7,733 protein groups, where the majority (7,349; 95%) of the genes had a single protein group (Figure 7C). We compared the protein and RNA gene expression levels based on their promoter overlap with CGIs and observed a positive correlation (r = 0.477 in HCC1395 and R = 0.474 in HCC1395BL) (Figures 7D and 7E; Table S9). The protein expression levels of transcripts overlapping CGIs were significantly higher than protein expression levels of non-CGI transcripts (Figures 7F and 7G) in both cell lines. This revealed that CGI promoter genes had higher gene and protein expression levels compared with non-CGI promoter genes. Differential expression of protein levels between the two cell lines revealed HCC1395BL to have 614 differentially upregulated proteins (red circles) and 372 significantly downregulated proteins (blue circles) (Figure S4F) at the defined threshold (p value <0.05 and absolute log2 ratio >1).

To understand the effects of somatic mutations on protein expression, we compiled a custom protein FASTA database incorporating somatic SNVs unique to HCC1395 compared with HCC1395BL as identified by the SEQC-2 consortium1 (see STAR Methods). The somatic mutations were divided into 2 sets based on the confidence level of the somatic call in the VCF file: a truth set containing high-confidence somatic calls and a non-truth set containing the remaining somatic mutations. We identified a total of 6 variant peptides indicating expression at the protein level (Table S10). Three of the somatic mutations were from the truth set in the following genes: OSTC, SEC22B, and PRDX5. In addition, 3 somatic mutations from the non-truth set were observed in DDX3X, FLNA, and TUBB8B. Five of the 6 variant peptides had higher expression in the breast cancer samples compared with the B lymphocyte samples (including all 3 from the truth set). In addition, four of these amino acid substitutions detected at the protein level were also found in COSMIC from breast ductal carcinoma tissue samples, indicating possible biological relevance: F to L in OSTC, F to L in PRDX5, R to T in DDX3X, and D to H in FLNA. To relate the peptides containing amino acid substitutions at the protein level to a specific SNV, we located the specific SNV corresponding to the amino acid substitutions identified by MS/MS. The mapping of the six variant peptides was visualized using the genome browser; these included three from the truth set (Figures S5A–S5C) and three from the non-truth set (Figures S5D–S5F). Collectively, we have provided protein expression levels of these two cell lines and association of expression level with genomic CG density and finally showed how often SNVs were incorporated into mutated peptides.

Discussion

As a follow-up to our previous studies,1,2,3,4,5,6,7 here we provided a comprehensive catalog and detailed characterizations at epigenomic, transcriptomic, and proteomic levels on two cell line reference samples. We interrogated thousands of open chromatin regions, millions of CG and CH sites on DNA methylation status, and thousands of genes based on both their transcript and protein expression levels. In addition to these processed datasets, we also provided a comprehensive integrative analysis across different molecular layers of the omics data from the same cell lines. Together with our previous multi-center studies1,2,3,4,5,6,7 of these two cell lines, we provided well-characterized genomic materials, reference datasets, and reference bioinformatics methods, which will be a valuable resource not only to the cancer research community but also to the entire genomics research community. From a benchmarking perspective, these cell lines have been annotated with somatic alterations1,2,3 and structural variations,5 which can provide an opportunity to measure how personalized somatic alteration will alter epigenetic peak calling. On the other hand, the breast cancer research community can mine the data to extract biological functions associated with cancer. While we aligned all data to the standard human reference genome (UCSC hg38), we acknowledge that mapping to personalized genome assemblies or pan-genomes could improve accuracy in capturing individual-specific and population-level genomic variation. However, for these approaches to be broadly applicable, additional tools are needed to generate high-quality genome assemblies and gene annotations on the fly. Developing these tools will be an important future direction for multi-omics analyses. Structural variations can reshape the epigenomic landscape by adding or removing of regulatory elements.27

While methylated DNA represents the repressive state, accessible chromatin marks the opening of chromatin of the genomic DNA for active transcription, suggesting that the inverse relationship is biologically relevant. However, we showed that this relationship is dependent on three factors: (1) transcriptional state (active or inactive), (2) genomic sequence CG density (CG rich or CG poor), and (3) genomic regions (promoter, gene body, and intergenic regions). Epigenetic marks of actively transcribed promoters are expected to differ from those of inactive/untranscribed promoters. Thus, when transcriptional states are controlled, genomic CG density largely influences the epigenetic landscapes where CG-rich regions have higher accessibility and lower methylation compared with CG-poor regions (lower accessibility and higher methylation). CG-rich regions not only had higher accessible chromatin and lower methylation levels in a single cell type but showed similar patterns across other cell types. On the other hand, CG-poor regions had restricted chromatin accessibility, which varied across cell types (Figures 6A and 6B), indicating tissue-specific activity. Based on the chromatin accessibility and DNA methylation of these two cell lines, we propose a distinct mode of regulation in CGI vs. non-CGI promoters. As inverse correlation between chromatin accessibility and DNA methylation is context (genomic CG density, transcriptional state, and genomic regions) dependent, we suggest that future studies consider the context in analyzing and interpreting epigenetics data.

Non-CGI promoters were associated with lower gene and protein expression levels and cell type-restricted open chromatin (Figures 6A and 6B). We speculate that the promoter genomic CG density influences epigenetic patterns, which in turn influence gene and protein expression. Non-CGI promoters with lower CG density exhibited narrow open chromatin regions, which may influence their gene expression patterns. As non-CGI promoters have narrow open chromatin regions at the promoters, they will have fewer available TF binding sites. A lower frequency of TF binding sites may explain the lower expression levels we observed. On the other hand, this lower frequency also suggests that genes might be activated in certain specific contexts and hence explains why they might have cell type-specific expression levels. The overall low expression levels of non-CGI promoters can be explained by CG density, which could in turn regulate epigenetic turnover, and ultimately, the expression patterns observed in single-cell analyses. Second, many CpGs within the open chromatin regions of non-CGI promoters were still methylated (Figures 4C and 4D), which makes some TF sites inaccessible. As DNA methylation is generally repressive, partially methylated CpGs in promoter regions may explain overall lower expression levels. Thus, lower frequency of TFs and partially methylated CpGs might explain their lower expression levels and tissue-restricted expression.

Like the epigenomic and transcriptomic profiles, the proteomic expression landscape was also influenced by CG density, where CGI promoters also had higher protein expression levels. In addition to quantification of peptides, we also analyzed what fraction of cell line genomic variants was expressed. We identified only 6 SNVs at the protein level, thus confirming that some of the mutations were indeed translated into altered peptides. Low detection of SNVs at the protein level suggested two potential scenarios. First, some of the proteins containing amino acid substitutions might result in unstable proteins that are degraded and remain undetected by MS. Second, low sequence coverage of certain proteins by MS might prevent a peptide containing the variant amino acid substitution from being detected. Of interest, our ability to detect mutations from the non-truth set at the protein level may serve as a method to validate these less-confident SNVs, which should also aid in other efforts to benchmark variant and reference calls.28

In summary, our study has provided a comprehensive reference epigenomic, transcriptomic, and proteomic map of the two cell lines studied by SEQC-2. We further demonstrated an integrative approach to analyze epigenomic, transcriptomic, and proteomic data to extract meaningful biological observations. Finally, we anticipate that the molecular layers of omics data provided, along with the detailed characterizations and integrative analyses of these two cell lines, will serve as a valuable reference for future studies.

Limitations of the study

The study is focused on cell line models, which may not fully capture the complexity of primary tissues or clinical samples. While the incorporation of mutated SNVs into peptides highlights the relevance of integrating mutations with proteomics, the specific mutated sites vary across patients. Thus, our study demonstrates methodological relevance, but biological relevance should be analyzed using patient-derived data to ensure clinical applicability. As the original donor was female, the current study does not include male-derived samples.

Resource availability

Lead contact

Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Charles Wang (chwang@llu.edu).

Materials availability

Information for all reagents, antibodies, and cell lines can be found in the STAR Methods. Resources and reagents are available from the lead contact upon completion of appropriate material transfer agreements.

Data and code availability

  • The ATAC-seq and methyl-seq data are available at Gene Expression Omnibus (GEO) with the accession code GSE268608. The mass spectrometry proteomics data are available at ProteomeXchange with the accession code PXD052353. The scRNA-seq data are available at NCBI BioProject with the accession code PRJNA504037. All processed data are available at https://doi.org/10.6084/m9.figshare.30643529.v1.

  • The original code has been deposited at Zenodo and is publicly available at https://doi.org/10.5281/zenodo.17781895 as of the date of publication.

  • Any additional information required to reanalyze the data reported in this work is available from the lead contact upon request.

Acknowledgments

The work was funded in part by the National Institutes of Health (NIH) grants S10OD019960 (C.W.), U01DA058278 (C.W.), and R01GM133107 (X.C.), the Ardmore Institute of Health grant 2150141 (C.W.), and Dr. Charles A. Sims’ gift to LLU Center for Genomics. The work of Chunlin Xiao was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health. The authors would like to thank Ms. Diana Ho and Ms. Adriana Lopez of the LLU Center for Genomics for their administrative support, particularly in coordinating the Zoom conference calls for the project. The authors would like to thank ATCC, and particularly Liz Kerrigan, for providing the two cell lines, i.e., HCC1395 and HCC1395BL, for our study.

Author contributions

C.W. conceived and designed the overall study and provided funding. X.C. designed and funded the proteomics study. C.W. managed the project. C.N. drafted the manuscript and conducted all the bioinformatics data analyses. W.C., Z.C., and W.L. carried out genomics experiments and helped with the data analysis. L.X. carried out proteomics experiments, and J.A.W. performed the proteomics data analysis. C.X., W.J., M.M.J., A.F., and C.W. helped edit the manuscript. All authors reviewed the manuscript. C.W. revised and finalized the manuscript.

Declaration of interests

A.F. is an employee of Takara Bio USA, Inc., and W.J. is an employee of IQVIA Laboratories Genomics. The views presented in this article do not necessarily reflect the current or future opinion or policy of the US Food and Drug Administration. Any mention of commercial products is for clarification and not intended as an endorsement.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Chemicals, peptides, and recombinant proteins

Deoxyribonuclease I (DNase I) Worthington Biochemical Corporation LS002007
MgCl2 solution Sigma-Aldrich 68475
NaCl solution Sigma-Aldrich S5150
Nonidet P40 Substitute Sigma-Aldrich 11332473001
Digitonin Detergent Solution Thermo Fisher Scientific PRG9441
Dimethylformamide Sigma-Aldrich D4551

Critical commercial assays

miRNeasy kit QIAGEN 217004
NuGEN Ovation universal RNA-seq kit NuGEN 0364-A01
TruSeq methyl capture EPIC Illumina FC-151-1003
KAPA Library Quantification kit Roche KK4854
Peptide assay Thermo Fisher Scientific 23275

Deposited data

Raw ATAC-seq and Methyl-seq data This paper GSE268608
Mass spectrometry proteomics This paper PXD052353
scRNA-seq Multi-center scRNA-seq study PRJNA504037
TCGA ATAC-seq TCGA https://portal.gdc.cancer.gov/
Processed data This paper https://doi.org/10.6084/m9.figshare.30643529.v1

Experimental models: Cell lines

HCC1395 American Type Culture Collection CRL-2324
HCC1395BL American Type Culture Collection CRL-2325

Software and algorithms

trim-galore Krueger et al. https://github.com/FelixKrueger/TrimGalore
Bowtie2 Langmead and Salzberg21 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
MACS Zhang. et al.16 http://samtools.sourceforge.net
Bismark Krueger et al.17 https://www.bioinformatics.babraham.ac.uk/projects/bismark/
Kallisto Bray.et al.18 https://github.com/pachterlab/kallisto
STAR (v2.7.10b) Dobin.et al.29 https://github.com/alexdobin/STAR
REDItools Picardi.and Pesole25 https://github.com/BioinfoUNIBA/REDItools
StringTie Pertea. et al.30 https://ccb.jhu.edu/software/stringtie/
Seurat Hao.et al.19 https://satijalab.org/seurat/
Bedtools Quinlan.et al.31 https://bedtools.readthedocs.io/en/latest/
Perseus Tyanoca.and Cox20 https://maxquant.org/perseus/
Source code This paper https://doi.org/10.5281/zenodo.17781895

Experimental model and study participant details

Cell lines

The HCC1395 (tumor) and HCC1395BL (matched normal B-lymphocyte) cell lines used in this study were obtained from the American Type Culture Collection (ATCC: CRL-2324 and CRL-2325, respectively). These lines were originally derived from a single consenting donor, a 43-year-old female patient diagnosed with ductal carcinoma, under protocols compliant with applicable ethical standards at the time of collection. HCC1395 cells were cultured in RPMI-1640 medium (ATCC 30–2001) supplemented with 10% fetal bovine serum (FBS). HCC1395BL cells were cultured in IMDM medium (ATCC 30–2005) supplemented with 20% FBS. Single cell suspensions were generated by dissociating adherent cells (HCC1395) with Accutase (Innovative Cell Technologies, AT104) or by harvesting suspensions cells (HCC1395 BL). We passed all cells through a 30-micron MACS SmartStrainer (Miltenyi Biotec, 130-098-458) to remove the cell aggregates. According to ATCC and associated documentation, these materials are provided for research use only, with no new human subjects recruited for this study, and all specimens are fully de-identified. Therefore, no institutional review board (IRB) approval was required for the use of these commercially available, anonymized cell line materials.

Method details

ATAC-seq library preparation

ATAC-seq was performed as previously described.13 Briefly, approximately 50,000 cells per sample were centrifuged at 500 g for 5 min in a pre-chilled (4°C) fixed-angle centrifuge. After centrifugation, cell pellets were resuspended in 50 μL of cold lysis buffer (10mM Tris-Cl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, and 0.1% IGEPAL CA-630) by pipetting up and down three times. This cell lysis reaction was incubated on ice for 3 min. After lysis, 1 mL of ATAC-seq Resuspension Buffer (RSB) containing 0.1% Tween 20 (without NP40 or digitonin) was added, and the tubes were inverted to mix. Nuclei were then centrifuged for 10 min at 500 g in a pre-chilled (4°C) fixed-angle centrifuge. All supernatants were aspirated and cell pellets were resuspended in 50 μL of transposition mixture. Reactions were incubated at 37°C for 30 min in a Thermomixer at 1000 RPM. Immediately following transposition, transposed DNA was purified using a Qiagen MinElute PCR purification kit and amplified for 5 cycles using NEBNext 2x MasterMix. 5 μL (10%) of the pre-amplified mixture was used to determine the number of additional cycles needed for qPCR. The remaining PCR reaction was run to the cycle number determined by qPCR. Finally, the amplified library was purified using a Qiagen MinElute PCR Purification kit.

Bulk RNA-seq library preparation

We isolated mRNA in bulk from HCC1395 and HCC1395BL cells using the miRNeasy Mini kit (QIAGEN, 217004) and built sequencing libraries using the NuGEN Ovation universal RNA-seq kit. Briefly, 100 ng of total RNA was reverse transcribed and then made into double-stranded cDNA (ds-cDNA) by the addition of a DNA polymerase. The ds-cDNA was fragmented to ∼200 bps using the Covaris S220, and then underwent end repair to blunt the ends followed by barcoded adapter ligation. The remainder of the library preparation followed the manufacturer’s protocol.

TruSeq Methyl capture EPIC library preparation and validation

TruSeq Methyl capture libraries were prepared manually following the manufacture’s protocol (document 1000000001643 v01, Illumina). Briefly, cell line DNA samples were quantified using the Qubit high sensitivity double-stranded DNA assay (Invitrogen). 500 ng of DNA was normalized to 52 μL of resuspension buffer (Illumina) and was sheared by sonication with a Covaris S220 to a target size of 160 bp. Illumina Sample Purification Beads (SPB, Illumina reference number, 15037172; manufactured by Beckman coulter) were used at a DNA to beads ratio of 1:1.6 for cleanup and size selection. Insert fragment sizes were verified by a TapeStation 2200 using D1000 Screentape (Agilent Technologies). Double size selection was performed using SPB beads following the end repair. After adenylating the 3-prime end, index adapters were ligated immediately, and the ligated libraries were purified using SPB beads. Probe hybridization was carried out at 58°C for 2 h after an initial denaturing step at 95°C for 10 min. The hybridized probes were then captured using Streptavidin Magnetic Beads (Dynabeads, Invitrogen). After elution, the hybridized probes were subjected to second hybridization, second capture, second elution, and bisulfite conversion with Lightning Conversion Reagent (Zymo Research). After bisulfite conversion, the library was amplified using Kapa HiFi Uracil kit (KaPa Biosystems Pty, Cape Town, South Africa) under the following conditions: 12 cycles of 98°C for 20 s, 60°C for 30 s, 72°C for 30 s; and 72°C for 5 min. The amplified library was purified using SPB beads.

Library quality control and sequencing

All the libraries were quantified with a TapeStation 2200 (Agilent Technologies) and Qubit 3.0 (Life Technologies). Sequencing was performed on either the NextSeq 550 or HiSeq 4000 platform using SBS reagents.

Proteomics sample preparation

Cell pellets were resuspended in 8 M Urea, 50 mM Tris-HCl pH 8.0, reduced with dithiothreitol (5 mM final) for 30 min at room temperature, and alkylated with iodoacetamide (15 mM final) for 45 min in the dark at room temperature. Samples were diluted 4-fold with 25 mM Tris-HCl pH 8.0, 1 mM CaCl2 and digested with trypsin at 1:100 (w/w, trypsin: protein) overnight at room temperature. Three biological samples were used for each condition, resulting in a total of 9 samples. Peptides were cleaned using homemade C18 stage tips, and the concentration was determined (Peptide assay, Thermo Scientific 23275). 50 μg of each sample was then used for labeling with isobaric stable tandem mass tags (TMT11-126, 127N, 127C, 128N, 128C, 129N, 129C, 130N and 131, Thermo Scientific) following the manufacturer’s instructions. The mixture of labeled peptides was fractionated using strong cation exchange beads (Source 15S, GE Healthcare). After the sample was applied, we eluted the beads sequentially with a buffer containing 25% acetonitrile, 0.05% formic acid, and 50, 125, 200, or 400 mM ammonium bicarbonate, respectively. Each SCX fraction was further separated into 8 fractions using a C18 stage tip with a buffer of 10 mM trimethylammonium bicarbonate (TMAB), pH 8.5 containing 5 to 50% acetonitrile. A couple of fractions with low peptide content were combined, resulting in a total of 28 fractions.

Mass spectrometry and proteomics

Dried peptides were dissolved in 0.1% formic acid, 2% acetonitrile in water. 0.5 μg of peptides from each fraction were analyzed on a Q-Exactive HF-X coupled with an Easy nanoLC 1200 (Thermo Fisher Scientific, San Jose, CA). Peptides were loaded on to a nanoEase MZ HSS T3 Column (100 Å, 1.8 μm, 75 μm × 250 mm, Waters). Analytical separation of all peptides was achieved with 100-min gradient. A linear gradient of 5–10% buffer B over 5 min, 10%–31% buffer B over 70 min, 31%–75% buffer B over 15 min was executed at a 300 nL/min flow rate followed by a ramp to 100%B in 1 min and 9-min wash with 100%B, where buffer A was aqueous 0.1% formic acid, and buffer B was 80% acetonitrile and 0.1% formic acid. MS experiments were also carried out in a data-dependent mode with full MS at a resolution of 120,000 followed by high energy collision-activated dissociation-MS/MS of the top 20 most intense ions with a resolution of 45,000 at m/z 200. High energy collision-activated dissociation-MS/MS was used to dissociate peptides at a normalized collision energy of 32 eV in the presence of nitrogen bath gas atoms. Dynamic exclusion was 45 s.

Quantification and statistical analysis

Mapping and quantification of ATAC-seq, RNA-seq and methyl-seq and single cell RNA-seq data

The ATAC-seq reads were mapped to the human genome (hg38) using Bowtie2.21 We used MACS to call the peaks individually in all samples. To identify consensus peaks, ATAC peaks had to be present in all three replicates in each cell line. For coverage plots, we generated bigwig files using deepTools.32

The RNA-seq data were analyzed using Kallisto.18 We quantified the GENCODE transcripts (hg38) in TPM using Kallisto. To identify genes/transcripts that were expressed reliably in each cell line, we defined a threshold of 0.5 TPM in all three replicates, while at least one of the replicates was required to have a minimum of 0.5 TPM.

The methyl-seq reads were trimmed using trim-galore with the command “trim_galore --non_directional –rrbs”. Trimmed reads were mapped to the genome (hg38) using bismark17 with the following parameters: “bismark --phred33 --bam --non_directional –unmapped”. Methylation calls were extracted using the following parameter “bismark_methylation_extractor bamfile --ignore 2 --ignore_3prime 2 --comprehensive –bedGraph”. To define the consensus call, every methylation had to be detected in all three replicates and have a minimum coverage of 3 reads. Methylation levels for each CG were defined as the ratio of number of detected methylated CpGs to the number of detected methylated and unmethylated CpGs, which ranged between 0 and 1 (defined as beta value). For visualization of DNA methylation levels across the genome, we built a bigwig coverage track using mean methylation value from three replicates of a given cell line.

The single cell RNA-seq data of HCC1395 and HCC1395BL were downloaded.6 Single cell RNA data were analyzed using the standard Seurat package.19 For comparing the expression level of genes with CGI and non-CGI promoters, we only compared the expression across a cell if the gene was expressed in that cell.

All statistical analysis and plotting were conducted in R version 4.3.1. Adjusted p-values <0.05 were considered statistically significant.

Classification of CpG islands and observed/expected CG ratio

We downloaded annotated CpG islands (CGIs) from the UCSC database.33 We intersected CGIs with promoter regions of gene transcripts and assigned overlapping CGIs as promoter CGIs. The remaining CGIs were either overlapped with gene bodies and classified as intragenic CGIs or else classified as intergenic CGIs. The ratio of observed/expected (O/E) CG dinucleotides for a given genomic window was calculated using the following formula O/E CG = (CG count) (genomic window)/(C count ∗G count). We used the “bedtools nuc” function from Bedtools31 to measure these values and computed the O/E CG ratio for each genomic window.

Visualization of metaplots across genes and CGIs

All metaplots across genes and CGIs were plotted using deepTools2.32 From mapped BAM files, we first generated coverage as Bigwig tracks, which were normalized as RPKM using default parameters from deepTools. For the genes and CGI metaplots, the variable lengths of genes and CGIs were scaled between start and end.

Raw proteomics data processing and analysis

Peptide identification and quantification with tandem mass tags (TMT) reporter ions was performed using MaxQuant26 software version 1.6.0.16 (Max Planck Institute, Germany). Protein database searches were performed against the UniProt human protein sequence database (UP000005640). A false discovery rate (FDR) for both peptide-spectrum match (PSM) and protein assignment was set at 1%. Search parameters included up to two missed cleavages at Lys/Arg on the sequence, oxidation of methionine, and protein N-terminal acetylation as a dynamic modification. Carbamidomethylation of cysteine residues was considered as a static modification. Peptide identifications were reported by filtering of reverse and contaminant entries and assigning to their leading razor protein. Data processing and statistical analysis were performed on Perseus34 (Version 1.6.0.7). Protein quantitation was performed on biological replicates and two-sample t test statistics were used with a p-value of 0.05 to report statistically significant protein abundance.

Proteogenomic mapping of peptides to genomic coordinates

To map peptides identified in the peptide spectrum to their genomic location, we constructed a custom FASTA protein database. To the header for each protein sequence entry, we added the genomic mapping information that contained genomic coordinates for the protein sequence, including the start and end, and CDS (coding sequences) start coordinates (relative to start of genome start position for the protein) and CDS lengths. This header information formatting was based on the proteogenomic data integration tool QUILTS.35 We downloaded protein-coding transcript translation sequences (gencode.v32.pc_translations.fa) and the GFF3 comprehensive gene annotation (gencode.v32.annotation.gff3) from GENCODE36 for GENCODE human Release 32 (GRCh38.p13). The genomic mapping information for each protein entry in the FASTA database was obtained from entries in the gene annotation. Knowing the position of the peptide in the protein sequence, and the mapping information of the protein sequence relative to the genome, allowed each peptide in the MaxQuant search to be mapped to the genome. A BED file, containing all the peptides with their mapping information, enabled display of the peptides in UCSC genome browser.

To map peptides to somatic missense amino acid substitutions, we constructed protein FASTA databases containing somatic missense SNVs (SNV.MSDUKT.superSet.v1.1.vcf.gz) from the HCC1395 which was downloaded from https://ftp-trace.ncbi.nlm.nih.gov/seqc/ftp/Somatic_Mutation_WG/release/v1.1/, and produced a list of somatic mutations used to screen our proteomic data. We divided the Variant Call Format (VCF) file into two files: (1) a truth set of somatic mutations and (2) a non-truth set of somatic mutations. The truth set contains VCF entries with a PASS label denoting high-confidence somatic calls, while the non-truth set contains the remaining VCF entries. We ran the Ensembl Variant Effect Predictor37 (VEP) tool to identify missense variants in these 2 VCF files. We made separate FASTA databases for the truth- and non-truth sets of missense variants. These FASTA databases were constructed using gene annotations as was done for the reference FASTA database, except that the variant data were used to modify the protein sequence to include missense amino acid changes. The position of the variation was maintained in the FASTA header following the format used in QUILTS. The somatic mutations were annotated using the VEP (Variant Effect Predictor) from Ensembl. We filtered the somatic mutation dataset to include only missense mutations in protein coding regions. We produced 2 FASTA database files: one for true somatic mutations (truth set) with high confidence calls as defined by SEQC-2 and another for the remaining somatic mutations (non-truth set).

We conducted a MaxQuant search using the reference FASTA protein database and the 2 SNV protein databases containing the somatic missense amino acid substitutions. The MaxQuant output file (peptides.txt) lists each peptide identified as a peptide spectrum match by the search and information related to its associated FASTA header(s) for the protein sequence(s) in which it can be found. With this it was possible to identify both peptides from the reference as well as variant peptides containing missense substitutions. We determined the position of the peptide in the protein sequence and used the mapping information contained in the FASTA header to produce a header track line in browser extensible data (BED) files that maps the peptide to the genome. Some peptides mapped to more than one genomic location. Each genomic location (chromosome name, start coordinate, stop coordinate, strand, and any sub-blocks for peptides spanning introns) serves as an address for a peptide. A BED file was produced for the mapping location of the peptides, which can be visualized in a genome browser.

A-to-I RNA editing detection

Raw RNA-seq reads from HCC1395 and HCC1395BL cell lines were aligned to the human genome (hg38) using STAR (v2.7.10b).29 Only uniquely mapped reads were retained for downstream analysis. A-to-I editing events were identified using the REDItools25 using the default parameters. Known SNPs from the cell lines were filtered using high-confidence mutations.1,5 Output from REDItools was further filtered to retain only A→G mismatches on the sense strand; T→C mismatches on the antisense strand were considered indicative of A-to-I editing. Events with editing frequency ≥10% were retained. The final set of high-confidence A-to-I RNA edits per replicate was summarized, and the total number of events was plotted as a grouped bar plot.

Published: January 14, 2026

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2025.101276.

Contributor Information

Xian Chen, Email: xian_chen@med.unc.edu.

Charles Wang, Email: chwang@llu.edu.

Supplemental information

Document S1. Figures S1–S5
mmc1.pdf (1.2MB, pdf)
Table S1. Sequencing read depth and mapping statistics for ATAC-seq, RRBS, and RNA-seq datasets, related to Figure 1
mmc2.xlsx (10.2KB, xlsx)
Table S2. Genomic coordinates and peak intensities of chromatin-accessible regions in the HCC1395 cell line, related to Figures 2 and 4
mmc3.xlsx (7.7MB, xlsx)
Table S3. Genomic coordinates and peak intensities of chromatin-accessible regions in the HCC1395BL cell line, related to Figures 2 and 4
mmc4.xlsx (7.7MB, xlsx)
Table S4. Genomic coordinates of methylated CpG regions and corresponding beta values across three HCC1395 replicates, related to Figures 3 and 4
mmc5.zip (11.5MB, zip)
Table S5. Genomic coordinates of methylated CpG regions and corresponding beta values across three HCC1395BL replicates, related to Figures 3 and 4
mmc6.zip (12.2MB, zip)
Table S6. Transcript-level expression values across three HCC1395 replicates, including annotation of representative vs. alternative transcripts, related to Figure 5
mmc7.xlsx (1.7MB, xlsx)
Table S7. Transcript-level expression values across three HCC1395BL replicates, including annotation of representative vs. alternative transcripts, related to Figure 5
mmc8.xlsx (1.7MB, xlsx)
Table S8. Genomic locations of A-to-I RNA-editing sites in each replicate of HCC1395 and HCC1395BL cell lines, related to Figure 5
mmc9.xlsx (18KB, xlsx)
Table S9. Quantification of detected peptides across three replicates of HCC1395 and HCC1395BL cell lines, related to Figure 7
mmc10.xlsx (1.1MB, xlsx)
Table S10. Mutated peptides identified in HCC1395 and HCC1395BL cell lines, related to Figure 7
mmc11.xlsx (10.2KB, xlsx)
Document S2. Article plus supplemental information
mmc12.pdf (4.3MB, pdf)

References

  • 1.Fang L.T., Zhu B., Zhao Y., Chen W., Yang Z., Kerrigan L., Langenbach K., de Mars M., Lu C., Idler K., et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat. Biotechnol. 2021;39:1151–1160. doi: 10.1038/s41587-021-00993-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zhao Y., Fang L.T., Shen T.W., Choudhari S., Talsania K., Chen X., Shetty J., Kriga Y., Tran B., Zhu B., et al. Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Sci. Data. 2021;8:296. doi: 10.1038/s41597-021-01077-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Xiao W., Ren L., Chen Z., Fang L.T., Zhao Y., Lack J., Guan M., Zhu B., Jaeger E., Kerrigan L., et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat. Biotechnol. 2021;39:1141–1150. doi: 10.1038/s41587-021-00994-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Talsania K., Shen T.W., Chen X., Jaeger E., Li Z., Chen Z., Chen W., Tran B., Kusko R., Wang L., et al. Structural variant analysis of a cancer reference cell line sample using multiple sequencing technologies. Genome Biol. 2022;23:255. doi: 10.1186/s13059-022-02816-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Xiao C., Chen Z., Chen W., Padilla C., Colgan M., Wu W., Fang L.T., Liu T., Yang Y., Schneider V., et al. Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples. Genome Biol. 2022;23:237. doi: 10.1186/s13059-022-02803-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chen W., Zhao Y., Chen X., Yang Z., Xu X., Bi Y., Chen V., Li J., Choi H., Ernest B., et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol. 2021;39:1103–1114. doi: 10.1038/s41587-020-00748-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen X., Yang Z., Chen W., Zhao Y., Farmer A., Tran B., Furtak V., Moos M., Jr., Xiao W., Wang C. A multi-center cross-platform single-cell RNA sequencing reference dataset. Sci. Data. 2021;8:39. doi: 10.1038/s41597-021-00809-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gazdar A.F., Kurvari V., Virmani A., Gollahon L., Sakaguchi M., Westerfield M., Kodagoda D., Stasny V., Cunningham H.T., Wistuba I.I., et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int. J. Cancer. 1998;78:766–774. doi: 10.1002/(sici)1097-0215(19981209)78:6&#x0003c;766::aid-ijc15&#x0003e;3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
  • 9.Kao J., Salari K., Bocanegra M., Choi Y.L., Girard L., Gandhi J., Kwei K.A., Hernandez-Boussard T., Wang P., Gazdar A.F., et al. Molecular profiling of breast cancer cell lines defines relevant tumor models and provides a resource for cancer gene discovery. PLoS One. 2009;4 doi: 10.1371/journal.pone.0006146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mattei A.L., Bailly N., Meissner A. DNA methylation: a historical perspective. Trends Genet. 2022;38:676–707. doi: 10.1016/j.tig.2022.03.010. [DOI] [PubMed] [Google Scholar]
  • 11.Yan F., Powell D.R., Curtis D.J., Wong N.C. From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biol. 2020;21:22. doi: 10.1186/s13059-020-1929-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Greer E.L., Shi Y. Histone methylation: a dynamic mark in health, disease and inheritance. Nat. Rev. Genet. 2012;13:343–357. doi: 10.1038/nrg3173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Buenrostro J.D., Giresi P.G., Zaba L.C., Chang H.Y., Greenleaf W.J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods. 2013;10:1213–1218. doi: 10.1038/nmeth.2688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gu H., Smith Z.D., Bock C., Boyle P., Gnirke A., Meissner A. Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat. Protoc. 2011;6:468–481. doi: 10.1038/nprot.2010.190. [DOI] [PubMed] [Google Scholar]
  • 15.Mertins P., Mani D.R., Ruggles K.V., Gillette M.A., Clauser K.R., Wang P., Wang X., Qiao J.W., Cao S., Petralia F., et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature. 2016;534:55–62. doi: 10.1038/nature18003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., Liu X.S. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9 doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Krueger F., Andrews S.R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–1572. doi: 10.1093/bioinformatics/btr167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bray N.L., Pimentel H., Melsted P., Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34:525–527. doi: 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
  • 19.Hao Y., Stuart T., Kowalski M.H., Choudhary S., Hoffman P., Hartman A., Srivastava A., Molla G., Madad S., Fernandez-Granda C., Satija R. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tyanova S., Cox J. Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research. Methods Mol. Biol. 2018;1711:133–148. doi: 10.1007/978-1-4939-7493-1_7. [DOI] [PubMed] [Google Scholar]
  • 21.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Corces M.R., Granja J.M., Shams S., Louie B.H., Seoane J.A., Zhou W., Silva T.C., Groeneveld C., Wong C.K., Cho S.W., et al. The chromatin accessibility landscape of primary human cancers. Science. 2018;362 doi: 10.1126/science.aav1898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lokk K., Modhukur V., Rajashekar B., Märtens K., Mägi R., Kolde R., Koltšina M., Nilsson T.K., Vilo J., Salumets A., Tõnisson N. DNA methylome profiling of human tissues identifies global and tissue-specific methylation patterns. Genome Biol. 2014;15 doi: 10.1186/gb-2014-15-4-r54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Nepal C., Andersen J.B. Alternative promoters in CpG depleted regions are prevalently associated with epigenetic misregulation of liver cancer transcriptomes. Nat. Commun. 2023;14:2712. doi: 10.1038/s41467-023-38272-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Picardi E., Pesole G. REDItools: high-throughput RNA editing detection made easy. Bioinformatics. 2013;29:1813–1814. doi: 10.1093/bioinformatics/btt287. [DOI] [PubMed] [Google Scholar]
  • 26.Cox J., Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008;26:1367–1372. doi: 10.1038/nbt.1511. [DOI] [PubMed] [Google Scholar]
  • 27.Zhuo X., Du A.Y., Pehrsson E.C., Li D., Wang T. Epigenomic differences in the human and chimpanzee genomes are associated with structural variation. Genome Res. 2021;31:279–290. doi: 10.1101/gr.263491.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zook J.M., McDaniel J., Olson N.D., Wagner J., Parikh H., Heaton H., Irvine S.A., Trigg L., Truty R., McLean C.Y., et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. doi: 10.1038/s41587-019-0074-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pertea M., Pertea G.M., Antonescu C.M., Chang T.C., Mendell J.T., Salzberg S.L. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ramirez F., Ryan D.P., Gruning B., Bhardwaj V., Kilpert F., Richter A.S., Heyne S., Dundar F., Manke T. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44:W160–W165. doi: 10.1093/nar/gkw257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Nassar L.R., Barber G.P., Benet-Pagès A., Casper J., Clawson H., Diekhans M., Fischer C., Gonzalez J.N., Hinrichs A.S., Lee B.T., et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 2023;51:D1188–D1195. doi: 10.1093/nar/gkac1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tyanova S., Temu T., Sinitcyn P., Carlson A., Hein M.Y., Geiger T., Mann M., Cox J. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods. 2016;13:731–740. doi: 10.1038/nmeth.3901. [DOI] [PubMed] [Google Scholar]
  • 35.Ruggles K.V., Tang Z., Wang X., Grover H., Askenazi M., Teubl J., Cao S., McLellan M.D., Clauser K.R., Tabb D.L., et al. An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol. Cell. Proteomics. 2016;15:1060–1071. doi: 10.1074/mcp.M115.056226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Armstrong J., Barnes I., et al. Gencode 2021. Nucleic Acids Res. 2021;49:D916–D923. doi: 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R.S., Thormann A., Flicek P., Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S5
mmc1.pdf (1.2MB, pdf)
Table S1. Sequencing read depth and mapping statistics for ATAC-seq, RRBS, and RNA-seq datasets, related to Figure 1
mmc2.xlsx (10.2KB, xlsx)
Table S2. Genomic coordinates and peak intensities of chromatin-accessible regions in the HCC1395 cell line, related to Figures 2 and 4
mmc3.xlsx (7.7MB, xlsx)
Table S3. Genomic coordinates and peak intensities of chromatin-accessible regions in the HCC1395BL cell line, related to Figures 2 and 4
mmc4.xlsx (7.7MB, xlsx)
Table S4. Genomic coordinates of methylated CpG regions and corresponding beta values across three HCC1395 replicates, related to Figures 3 and 4
mmc5.zip (11.5MB, zip)
Table S5. Genomic coordinates of methylated CpG regions and corresponding beta values across three HCC1395BL replicates, related to Figures 3 and 4
mmc6.zip (12.2MB, zip)
Table S6. Transcript-level expression values across three HCC1395 replicates, including annotation of representative vs. alternative transcripts, related to Figure 5
mmc7.xlsx (1.7MB, xlsx)
Table S7. Transcript-level expression values across three HCC1395BL replicates, including annotation of representative vs. alternative transcripts, related to Figure 5
mmc8.xlsx (1.7MB, xlsx)
Table S8. Genomic locations of A-to-I RNA-editing sites in each replicate of HCC1395 and HCC1395BL cell lines, related to Figure 5
mmc9.xlsx (18KB, xlsx)
Table S9. Quantification of detected peptides across three replicates of HCC1395 and HCC1395BL cell lines, related to Figure 7
mmc10.xlsx (1.1MB, xlsx)
Table S10. Mutated peptides identified in HCC1395 and HCC1395BL cell lines, related to Figure 7
mmc11.xlsx (10.2KB, xlsx)
Document S2. Article plus supplemental information
mmc12.pdf (4.3MB, pdf)

Data Availability Statement

  • The ATAC-seq and methyl-seq data are available at Gene Expression Omnibus (GEO) with the accession code GSE268608. The mass spectrometry proteomics data are available at ProteomeXchange with the accession code PXD052353. The scRNA-seq data are available at NCBI BioProject with the accession code PRJNA504037. All processed data are available at https://doi.org/10.6084/m9.figshare.30643529.v1.

  • The original code has been deposited at Zenodo and is publicly available at https://doi.org/10.5281/zenodo.17781895 as of the date of publication.

  • Any additional information required to reanalyze the data reported in this work is available from the lead contact upon request.


Articles from Cell Reports Methods are provided here courtesy of Elsevier

RESOURCES