The ENCODE Portal as an Epigenomics Resource

Jennifer Jou; Idan Gabdank; Yunhai Luo; Khine Lin; Paul Sud; Zachary Myers; Jason A Hilton; Meenakshi S Kagda; Bonita Lam; Emma O'Neill; Philip Adenekan; Keenan Graham; Ulugbek K Baymuradov; Stuart R Miyasato; J Seth Strattan; Otto Jolanki; Jin‐Wook Lee; Casey Litton; Forrest Y Tanaka; Benjamin C Hitz; J Michael Cherry

doi:10.1002/cpbi.89

. 2019 Nov 21;68(1):e89. doi: 10.1002/cpbi.89

The ENCODE Portal as an Epigenomics Resource

Jennifer Jou ¹, Idan Gabdank ¹, Yunhai Luo ¹, Khine Lin ¹, Paul Sud ¹, Zachary Myers ¹, Jason A Hilton ¹, Meenakshi S Kagda ¹, Bonita Lam ¹, Emma O'Neill ¹, Philip Adenekan ¹, Keenan Graham ¹, Ulugbek K Baymuradov ¹, Stuart R Miyasato ¹, J Seth Strattan ¹, Otto Jolanki ¹, Jin‐Wook Lee ¹, Casey Litton ¹, Forrest Y Tanaka ¹, Benjamin C Hitz ¹, J Michael Cherry ^1,^✉

PMCID: PMC7307447 NIHMSID: NIHMS1058769 PMID: 31751002

Abstract

The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine‐readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors.

Basic Protocol: Query the portal

Support Protocol 1: Batch downloading

Support Protocol 2: Using the cart to download files

Support Protocol 3: Visualize data

Alternate Protocol: Query building and programmatic access

Keywords: database, ENCODE, epigenetics, human genome, regulatory elements

INTRODUCTION

The Encyclopedia of DNA Elements (ENCODE; https://www.encodeproject.org/) web portal hosts genomic data generated by the ENCODE Consortium, the Genomics of Gene Regulation project, the NIH Roadmap Epigenomics Consortium (Bujold et al., 2016; Kundaje et al., 2015), and the Model organism ENCODE (modENCODE) and modERN projects (Gerstein et al., 2010; Roy et al., 2010; Stamatoyannopoulos et al., 2012). The ENCODE project has the goal of identifying all functional elements in the human and mouse genomes. The ENCODE portal (Davis et al., 2018; Hitz et al., 2017; Hong et al., 2016; Sloan et al., 2016) serves as the canonical source of ENCODE data, and is actively maintained by the ENCODE Data Coordination Center (DCC) to update the relevant experimental data and metadata and provide visualization and analysis tools for the scientific community. For this reason, researchers seeking to use ENCODE data should always use the ENCODE portal to ensure that they get the most up‐to‐date analysis results for their experiments, as well as metadata about data provenance and experimental methods. However, to facilitate data accessibility and findability, ENCODE data are also deposited to other repositories such as GEO (Barrett et al., 2013) and EpiRR (https://www.ebi.ac.uk/vg/epirr).

The protocols covered in this article will allow the scientific community to rapidly find and utilize the open‐access data resources provided by the ENCODE Consortium. The ENCODE portal provides various methods allowing users to search for and download data using a web browser. The Basic Protocol demonstrates how to navigate the ENCODE portal website to retrieve experiment records relevant to one's interests, as well as interpret the presented metadata.

Three support protocols describe additional features available on the portal: Support Protocol 1 describes how to download multiple files from a search result page containing one or more datasets; Support Protocol 2 shows how to save any experiment to a cart, from which files can be downloaded; and lastly Support Protocol 3 reviews methods for rapid visualization of signal and peaks tracks using the embedded Valis genome browser (https://valis.bio/) directly on the portal website or using external resources such as the UCSC Genome Browser (Kent et al., 2002).

The Alternate Protocol presents a method, independent of the portal website, producing the same search results as the Basic Protocol by directly interacting with the REST API, allowing programmatic access to the database.

QUERY THE PORTAL

Both the Matrix and the Search pages provide an intuitive method to perform metadata‐based searches for ENCODE datasets in the form of facet filters, which are categorized lists of commonly used experimental metadata.

Necessary Resources

Hardware

Computer with internet access

Software

Up‐to‐date web browser (Chrome, Microsoft Edge, Firefox, Safari)

Use the matrix to navigate to a search page

1
Navigate to the ENCODE portal home page at https://www.encodeproject.org. A clickable widget labeled “Help” automatically loads in the lower right corner of the browser window, which contains links to interactive tutorials, frequently asked questions, and other documentation on topics not addressed in this article. Users are encouraged to explore the widget for additional help with the portal.
2
In the toolbar along the top of the page, click “Data.” This opens a drop‐down menu with multiple options (Fig. 1).

The menu options in the toolbar provide access to key resources on the portal, including the Experiment matrix page and a link back to the home page. For example, the Data drop‐down menu also has links to Search and Summary pages, which represent the same data as the Experiment matrix page but in different layouts. The Encyclopedia menu contains information about and links to integrative‐level annotations generated using ENCODE data. The Materials & Methods menu links to information about experimental components used, data‐processing methods, and portal‐organization methodology. The Help menu contains links to information about ENCODE, portal usage, past ENCODE workshops, and contact information. A cart menu will also appear here if there are experiments added to the cart, which is discussed in Support Protocol 2.

The ENCODE home page. This image shows the Data drop‐down menu in the toolbar opened. The first item in the menu is a link to the Experiment Matrix page.

3
In the drop‐down menu, click “Experiment Matrix” to navigate to the Experiment matrix page (Fig. 2).
1. The Experiment matrix page lists biosample types, which refer to the biological material used such as a cell culture or tissue sample along the y axis, and various assays along the x axis, with each cell indicating the number of experiments of that combination of assay and biosample type.
2. Only a subsection of the matrix is visible upon loading the page. To view more, click the arrows along the left side of the biosample category headers to expand the categories and reveal an extended list of available biosample types. With the mouse cursor hovering on top of the matrix, scroll horizontally to view more available assays.

The Experiment Matrix page displays available ENCODE data in a matrix with biosample types and assays as the axes. Each cell of the matrix is clickable and leads to a list of experiments matching the given combination of biosample type and assay.

4
Click on the cell for transcription factor ChIP‐seq (abbreviated as TF ChIP‐seq) experiments on K562. As of September 2019, the ENCODE portal had 530 experiments in this group. This link leads to a search page with a list of the 530 experiments, shown in Figure 3.

By default, only experiments with the status “released” are included in a search. Explanations of the meaning of different experimental statuses are available in Table 1.

The Experiment search page displays ENCODE data as a list of search results. Each experiment is shown with a brief summary of the biological material and assay name, and a link to its individual experiment summary page with more metadata details (see Fig. 5). On the left is the facet sidebar, which can be used to modify and refine the search results. The “Add all items to cart” button is a Cart function, explained further in Support Protocol 3.

Table 1.

Dataset Statuses^a

Status

Meaning

Released

Publicly available datasets are marked with the “released” status. Datasets become publicly available after automatic and manual review to make sure they meet the standards and do not have data or metadata issues and inconsistencies.

This status is selected by default when visiting all search views, including the Matrix page (refer to Basic Protocol, step 2).

Revoked

An error was found with the experiment after it became publicly available, so the status was changed to “revoked” to indicate that caution should be exercised before using the data. Some examples of errors are:

- 1.
  The data was not compliant with the ENCODE quality requirements
- 2.
  Issue discovered with experimental elements (antibody, biosample, etc.)

Archived

The dataset was superseded by another dataset that has higher quality, was collected and/or processed with newer technology, etc. The ENCODE DCC encourages use of the superseding experiment instead of the archived one.

Open in a new tab

Additional information is available at https://www.encodeproject.org/help/getting‐started/status‐terms/.

Filter search results using facets

5
The sidebar on the left side of the search page is populated with facets, which allow users to filter search results by different properties. Locate the “Assay title” facet and observe that the facet term “TF ChIP‐seq” is highlighted in blue, indicating that the search results have been filtered for experiments with an assay title of TF ChIP‐seq.
1. Clicking a cell in the “TF ChIP‐seq” column on the Experiment matrix page (see step 4) automatically selects the “TF ChIP‐seq” facet term.
2. In general, selecting a facet term applies that term as a filter, and automatically updates the displayed search results.
6
Scroll further down on the page and locate the “Biosample term name” facet. Observe that “K562” is already selected, indicating that the search results have been filtered for experiments with a biosample term name of K562.
1. Clicking a cell in the “K562” row on the Experiment matrix page (see step 4) automatically selects the “K562” facet term.
2. Selections can be made in more than one facet at a time. When such a selection is made, the combined filters possess an AND relationship. For example, the selection of TF ChIP‐seq in the “Assay title” facet and K562 in the “Biosample term name” facet returns only experiments that are TF ChIP‐seq assays AND use K562 cells as the biosample.
7
In the type‐ahead search box below the “Biosample term name” label, type DND‐41. The list of terms below the search box will be dynamically filtered.

Because there are many biosample types to choose from, a type‐ahead search is available for this facet to help speed up the search process. This also applies to other facets with many terms, such as “Target of assay.”
8
Click “DND‐41” in the “Biosample term name” facet.
1. More than one facet term can be selected in a single facet simultaneously. Multiple selections in one facet represent an OR relationship between the selected facet terms. In this example, the selection of “K562” and “DND‐41” terms means the returned experiments may be on K562 or DND‐41 cells.
2. The URL for the current search is https://www.encodeproject.org/search/?type=Experiment&status=released&assay_title=TF+ChIP‐seq&biosample_ontology.term_name=K562&biosample_ontology.term_name=DND‐41.
9
Click “DND‐41” in the facet term list a second time to remove the filter for DND‐41 biosamples.
1. Users can also click the “DND‐41” link at the top of the facet after the words “Selected filters” to remove the filter. Both methods have the same result.
2. The URL for the current search is https://www.encodeproject.org/search/?type=Experiment&status=released&assay_title=TF+ChIP‐seq&biosample_ontology.term_name=K562.
10
Locate the “Target category” facet. Scroll through the list of terms and click on “cohesin.”
1. Targets have been categorized based on their Gene Ontology annotations in accordance with the methods described at https://www.encodeproject.org/target‐categorization/.
2. The URL for the current search is https://www.encodeproject.org/search/?type=Experiment&status=released&assay_title=TF+ChIP‐seq&biosample_ontology.term_name=K562&target.investigated_as=cohesin.
11
Locate the “Target of assay” facet. Hover the cursor over the “RAD21” facet term so that a red icon appears to the right. Click on the red icon to exclude experiments targeting RAD21.
1. The “Target of assay” most commonly applies to assays that utilize immunoprecipitation in their protocol, such as ChIP‐seq. In this context, the “target” refers to the DNA‐binding molecule targeted by the antibody for precipitation. However, the target property is sometimes also used for knockdown experiments or assays on genetically modified biosamples, such as shRNA RNA‐seq, in which case it refers to the gene target of the knockdown or modification.
2. The URL for the current search is https://www.encodeproject.org/search/?type=Experiment&status=released&assay_title=TF+ChIP‐seq&biosample_ontology.term_name=K562&target.investigated_as=cohesin&target.label%21=RAD21.
12
Scroll to the top of the page. At the top of the facet sidebar below the words “Experiment search,” click “Clear filters” to remove all selected filters.
13
Click the back button of the browser to undo the previous action. This will return to the previous query state for TF ChIP‐seq on K562 targeting cohesin‐related targets except for RAD21. The facet terms which should be selected at this stage are shown in Figure 4. As of September 2019, this search returned three experiments.

A truncated view of the facets with the items that should be selected after step 13 of the Basic Protocol.

14
Review the list of search results, which displays the summaries of the experiments satisfying the selected filters.
1. Each result is labeled with a short title based on the assay performed and biosample used. Because this example search has filtered for both a specific assay and a biosample type, all the results are titled with “TF ChIP‐seq of K562.”
2. Below the short title there is a more descriptive biosample summary, followed by the target if applicable, the lab that performed the experiment, and the project the experiment belongs to, such as ENCODE or Roadmap.
3. Additional details about each experiment are located on the right side of each summary:
1. Object type: In this case, all results are Experiment objects. However, there are many object types modeled in the ENCODE database, which represent different experimental components. The data model is briefly discussed in the Commentary section of this article.
2. The accession: Each experiment on the ENCODE portal is given a unique and persistent identifier known as an accession, which can be used for citing ENCODE datasets. An example of an accession is ENCSR670JDQ. Users can directly access the record page for any accessioned object by appending its accession to https://www.encodeproject.org/.
3. Status of the object: Brief explanations of the different statuses are shown in Table 1, and further information is available at https://www.encodeproject.org/help/getting‐started/status‐terms/. Other object types aside from Experiments also have statuses.
4. Audit flags: The ENCODE DCC uses automated checks known as “audits” to flag objects in the database for potential data or metadata issues, some of which are outlined in Table 2. Once flagged, the DCC is able to work with production labs to address the issues and ensure that all data and metadata are up to quality standards. Audits are divided into different severity levels, represented as red, orange, or yellow icons, which serve as an indication to users of potential concerns to be aware of if they use the data in their own research. If flagged, the icon(s) will appear below the Status and can be clicked to show further details about the audits.
5. Cart button: This button allows users to add experiments to their cart. Further cart‐specific information is detailed in Support Protocol 2.

Table 2.

Audit Flag Categories

Category	Description
Read coverage	Read depth or coverage issues for libraries. These standards were agreed upon by ENCODE production labs and are outlined in full on the data standards pages (https://www.encodeproject.org/data‐standards/).
Replication	Issues with replicate concordance or other replicate inconsistencies
Library complexity	Bottlenecking or library complexity issues, as outlined in the ENCODE Histone ChIP‐seq (https://www.encodeproject.org/chip‐seq/histone/) and Transcription Factor ChIP‐seq standards (https://www.encodeproject.org/chip‐seq/transcription_factor/)
Enrichment	Low SPOT scores for DNase‐seq experiments as outlined in the ENCODE DNase‐seq standards (https://www.encodeproject.org/data‐standards/dnase‐seq/)
Uniform pipeline requirements	Various pipeline issues, such as unexpected inconsistencies in read length, insufficient read length, and unknown platforms or other missing information
Antibody	Mismatches between antibody and target metadata or missing characterizations for antibodies
Metadata	Missing required metadata
Dataset consistency	Inconsistencies between different experiments grouped together in a series

Open in a new tab

^aAdditional information is available at https://www.encodeproject.org/data‐standards/audits/.

Visit a single experiment page

15
Click the short title corresponding to experiment ENCSR670JDQ to go to its experiment summary page, depicted in Figure 5. This experiment is also accessible at https://www.encodeproject.org/experiments/ENCSR670JDQ/.

This page displays a more complete picture of the metadata, which is not shown in the summaries on the search result page, including links to related experimental components, data files, and documents.

An experiment summary page. Below the page title and audits, the page is organized into distinct sections containing the following information: (A) Summary section: key info including but not limited to the assay performed, biosample used, assay target if applicable, platform, and controls. (B) Attribution section: information about the lab that performed the experiment and when the experiment was released. (C) Replicates section: table of experimental replicates with links to biosamples, antibodies, libraries, and genetic modifications when applicable. (D) Files section: information about the raw and processed data files generated from this experiment and subsequent analysis, provenance of data files as reflected in file association graph, and visualization of experiment‐specific genome tracks when applicable. (E) Documents section: links to additional protocol documents describing the experimental methods.

16
Audit flags are also displayed on the experiment summary pages. Click the audit button, as labeled in Figure 5, to display a list of audits. For this experiment, the button appears as an unlabeled yellow icon with a number corresponding to the severity and number of the audit flags. The plus symbol to the left of each audit expands the audit further to display the audit flag details.
17
Scroll down the page to the Summary and Attribution sections.

These sections contain general information about the experiment, such as the biosample, assay, and target, a link to a control experiment, and information about the lab that performed the experiment.
18
Scroll down the page to the Replicates section.

This section contains information about the replicate(s) of the experiment and links to biosamples, genetic modifications, and antibodies used when applicable.
19
Scroll down the page to the Files section.
1. The Files section is divided into three tabs: Genome browser, Association graph, and File details. By default, the Files section displays the Association graph, which shows data provenance and derivation of downstream processed files. Use the slider above the graph to zoom in or out of the graph.
2. The Files details tab is discussed in step 22, and the Genome browser tab is described in more detail in Support Protocol 3.
20
Each node in the Association graph can be clicked to view more information about the node. Click on a yellow node, which represent files, to view a pop‐up containing the file's unique accession as well as other metadata such as the file size, output type, and submission date.

Some file nodes may contain smaller, green circles representing quality control metrics associated with that file. These green circles are clickable and, like the file nodes, open a pop‐up with the quality metric (QC metric) values, plots, and other information if applicable. Click “Close” in the lower right‐hand corner of the pop‐ups when finished viewing the information.
21
Click on a blue node, which represents a step in the computational analysis pipeline, to view information about the software used, the inputs and outputs, and the general purpose of the relevant pipeline analysis step.
22
Click the “File details” tab to view a list of all files linked to this experiment.

The files are presented in a table containing information about the file type, reference genome assembly for mapping, and submission date, with one file per row. A small download icon next to each file accession allows users to download a single file at a time.
23
Scroll further down to view the Documents section. Experiments may also have attached documents describing the experimental and/or computational protocols. Click the link for a particular document to download it.

Support Protocol 1. BATCH DOWNLOADING

After identifying experiments of interest, users can download data associated with these experiments. Batch downloading provides a quick and simple way to download multiple files.