Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2013 Jan 9;29(4):509–510. doi: 10.1093/bioinformatics/btt003

HitWalker: variant prioritization for personalized functional cancer genomics

Daniel Bottomly 1,2,*, Beth Wilmot 1,2, Jeffrey W Tyner 1,3, Christopher A Eide 1,4,5, Marc M Loriaux 1,6, Brian J Druker 1,4,5, Shannon K McWeeney 1,2
PMCID: PMC3570211  PMID: 23303510

Abstract

Summary: Determining the functional relevance of identified sequence variants in cancer is a prerequisite to ultimately matching specific therapies with individual patients. This level of mechanistic understanding requires integration of genomic information with complementary functional analyses to identify oncogenic targets and relies on the development of computational frameworks to aid in the prioritization and visualization of these diverse data types. In response to this, we have developed HitWalker, which prioritizes patient variants relative to their weighted proximity to functional assay results in a protein–protein interaction network. It is highly extensible, allowing incorporation of diverse data types to refine prioritization. In addition to a ranked list of variants, we have also devised a simple shortest path-based approach of visualizing the results in an intuitive manner to provide biological interpretation.

Availability and implementation: The program, documentation and example data are available as an R package from www.biodevlab.org/HitWalker.html.

Contact: bottomly@ohsu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The advent of next-generation sequencing technology such as that available from Illumina provides an unprecedented ability to interrogate individual genomes (Metzker, 2010). Accompanying technologies such as the ability to multiplex samples, as well as efficient sequence capture technologies (Ng et al., 2009) further enable targeted regions of the genome to be re-sequenced at a reasonable cost per sample. In this manner, specific genes or whole exomes can be interrogated to identify variants potentially having a deleterious impact on protein coding regions. Efficacy of this approach for clinical research has been shown through the discovery of variants underlying simple Mendelian disorders, as well as more complicated variants/mutations involved in cancer and potentially complex traits (Kiezun et al., 2012). Because many variants are produced for a given sample, mechanistic and population genetic assumptions are often used to reduce the set of variants (Ng et al., 2009). For instance, limiting attention to non-synonymous single-nucleotide variations, as well as using the use of variant databases, such as dbSNP (Sherry et al., 2001) and the 1000 genomes project (The 1000 Genomes Project Consortium, 2010), enables researchers to focus on low-frequency variants that are potentially damaging. However, even after these filters, it is relatively infrequent that a researcher is left with a manageable set of variants for biological validation.

For cancer cells derived from a given patient, functional assays can be performed that allow researchers to determine the relevance of genes to cell viability, such as through the use of targeted siRNA screens (Tyner et al., 2009). These gene targets can be scored using a binary or quantitative encoding that indicates outlier status relative to other samples. Similarly, screens that measure the sensitivity of a patient’s cells to a panel of small molecules can be scored relative to genes using the known gene targets of each compound (Tyner et al., 2013). These functional assays by themselves provide some information to researchers on the identity of the signalling pathways that are required for cancer growth and survival. However, they do not necessarily indicate the mutated gene(s) leading to dysregulation of these signalling pathways, as the true causative variant(s) could be found in any gene with capacity to regulate the pathway.

To prioritize variants detected in our cohort of leukaemia patients relative to our functional assay results, we have devised the R and SQL framework HitWalker using a protein–protein interaction (PPI) network as a backbone. In addition to a ranked list of variants, HitWalker also allows for simple visualization of the relevant subnetwork, as well as overlaying relevant meta-data.

2 DESCRIPTION OF SOFTWARE

2.1 Variant prioritization

Variants are related to genes implicated from the functional assays through a user-defined PPI network, such as STRING (Szklarczyk et al., 2011), which provides known and hypothetical links between proteins that are weighted by a confidence score. Variants and the functional assay results are mapped to the proteins of the network using pre-supplied meta-data. For example, variants can be mapped to transcripts, which in turn can be mapped to proteins (as well as gene symbols). This hierarchy is managed through the specification of ‘core’ or ‘summary’ IDs to be used in the internal functions.

By default, prioritization is performed using a random walk with restarts algorithm (RWR) (Köhler et al., 2008; Tong et al., 2008). In HitWalker, RWR provides a measure of weighted proximity between a set of proteins associated with functional assay hits and a set of proteins containing variants. Variants are prioritized based on the resulting RWR association score attributed to the protein. See the Supplementary Material for more information. We note that any number of approaches related to our RWR implementation could be easily applied as well, including corrections for degree (Erten et al., 2011).

2.2 Visualization

The relevant subnetworks can be visualized by approximating the RWR algorithm using the unweighted shortest paths between the top queries and seeds. Shortest paths connecting the functional hit and variant genes are also displayed for additional context. For the shortest path calculations, the path with the highest overall path score is chosen if multiple equivalent shortest paths exist. The user can define functions mapping meta-data in the database to attributes of the network. For instance, genes can be represented using different shapes and colours. An example of such a figure is shown in Figure 1 and Supplementary Figure S1. In a patient with acute myeloid leukaemia, an S451F amino acid change in the FLT3 gene was the top ranking variant, which has been previously characterized and catalogued in COSMIC (Forbes et al., 2011). Non-synonymous single-nucleotide variants in ZAK and PRKCE were the next highest-ranked variants, and all three genes are candidates based on the observed drug response.

Fig. 1.

Fig. 1.

Modified visualization output from HitWalker displaying the top three assay hits (EPHA4, JAK3 and FRK) and variants (FLT3, ZAK and PRKCE) for an acute myeloid leukaemia patient. Note that other hits are pulled out and annotated, as they are on the shortest path. Gene names are provided for each node. For nodes containing variants (blue), frequency information is reported in terms of the patient cohort counts (F), as well as the RWR rank (R). Red and green nodes indicate siRNA and gene target hits, respectively. Dotted borders indicate absence of capture probes for a given gene. Dashed borders indicate functional assay targets whose inhibition did not significantly alter cell viability. Confidence scores for the interactions between the two genes are reported near the lines connecting two given genes

3 DISCUSSION

Our software provides a flexible framework for prioritizing variants relative to functional assay outcomes and a PPI network or other association graph using an RWR algorithm. In addition, we include visualization of key subnetwork members to aid in biological interpretation. Prioritized variants can then be followed up using Sanger sequencing, as well as additional experiments to categorize their phenotypic effect. Software developers can easily modify queries and R functions for their in-house databases and functional assays. Although MySQL is our database of choice, any other R/DBI-supported database would work with minor modifications to the R code.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

The authors acknowledge helpful discussions with Cristina Tognon and the other members of the Leukemia and Lymphoma Society Specialized Center of Research at the Knight Cancer Institute.

Funding: Leukemia & Lymphoma Society (award 7005-11); NIH/NCI (5P30CA069533-13, 4R00CA151457-03); NIH/NCATS (5UL1RR024140) V Foundation for Cancer Research; William Lawrence & Blanche Hughes Foundation, HHMI.

Conflict of Interest: none declared.

REFERENCES

  1. Erten S, et al. DADA: Degree-Aware Algorithms for Network-Based Disease Gene Prioritization. BioData Mining. 2011;4:19. doi: 10.1186/1756-0381-4-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Forbes SA, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011;39:D945–D950. doi: 10.1093/nar/gkq929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Kiezun A, et al. Exome sequencing and the genetic basis of complex traits. Nat. Genet. 2012;44:623–630. doi: 10.1038/ng.2303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Köhler S, et al. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 2008;82:949–958. doi: 10.1016/j.ajhg.2008.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Metzker ML. Sequencing technologies: the next generation. Nat. Rev. Genet. 2010;11:31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]
  6. Ng SB, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Sherry ST, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Szklarczyk D, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–D568. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Tong H, et al. Random walk with restart: fast solutions and applications. Knowl. Inf. Syst. 2008;14:327–346. [Google Scholar]
  10. Tyner JW, et al. RNAi screen for rapid therapeutic target identification in leukemia patients. Proc. Natl Acad. Sci. USA. 2009;106:8695–8700. doi: 10.1073/pnas.0903233106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Tyner JW, et al. Kinase pathway dependence in primary human leukemias determined by rapid inhibitor screening. Cancer Res. 2013;73:285–296. doi: 10.1158/0008-5472.CAN-12-1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
supp_btt003_FigureS1.png (182.7KB, png)

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES