Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2016 Nov 12;33(4):547–548. doi: 10.1093/bioinformatics/btw652

Isomorphic semantic mapping of variant call format (VCF2RDF)

Emanuel Diego S Penha 1, Egiebade Iriabho 1, Alex Dussaq 1, Diana Magalhães de Oliveira 2, Jonas S Almeida 3,
Editor: Inanc Birol
PMCID: PMC6041975  PMID: 27797761

Abstract

Summary

The move of computational genomics workflows to Cloud Computing platforms is associated with a new level of integration and interoperability that challenges existing data representation formats. The Variant Calling Format (VCF) is in a particularly sensitive position in that regard, with both clinical and consumer-facing analysis tools relying on this self-contained description of genomic variation in Next Generation Sequencing (NGS) results. In this report we identify an isomorphic map between VCF and the reference Resource Description Framework. RDF is advanced by the World Wide Web Consortium (W3C) to enable representations of linked data that are both distributed and discoverable. The resulting ability to decompose VCF reports of genomic variation without loss of context addresses the need to modularize and govern NGS pipelines for Precision Medicine. Specifically, it provides the flexibility (i.e. the indexing) needed to support the wide variety of clinical scenarios and patient-facing governance where only part of the VCF data is fitting.

Availability and Implementation

Software libraries with a claim to be both domain-facing and consumer-facing have to pass the test of portability across the variety of devices that those consumers in fact adopt. That is, ideally the implementation should itself take place within the space defined by web technologies. Consequently, the isomorphic mapping function was implemented in JavaScript, and was tested in a variety of environments and devices, client and server side alike. These range from web browsers in mobile phones to the most popular micro service platform, NodeJS. The code is publicly available at https://github.com/ibl/VCFr, with a live deployment at: http://ibl.github.io/VCFr/.

1 Introduction

The Variant Call Format or VCF was created by 1000 Genomes Project, as a generic format to store polymorphisms. Within 2 years since its initial development in 2011 (Danecek et al., 2011) it had become the format of choice for other next-gen sequencing projects. The basic NGS pipeline goes from fastQ (reads), to BAM (alignment to reference) and then to VCF (genomic variants).

Accordingly, the VCF format evolved to communicate the results of this pipeline, reflecting years of modeling and representation of genomic data. Specifically, the expressiveness of the VCF, which underscore its success as format, reflects both the need to accommodate dependencies between genomic annotation and genomic variation with sufficient granularity, while allowing those to change over time.

The need for data models that accommodate change is not unique to next-gen sequencing. A similar outcome can be found, for example, in the Distributed Annotation System format (DAS) (Jenkinson et al., 2008). In line with those developments, with the ubiquity of web computing, and with the current drive towards personalized medicine, the work reported here maps the VCF format onto W3C’s Resource Description Framework (RDF) (Klyne and Carroll, 2004), a standard developed to capture the universal dyadic predication behind all metadata. It should be noted that converters that DO NOT use web technologies already exist, notably https://github.com/dbcls/bh15/wiki/VCF-to-RDF-Mapping and https://github.com/JervenBolleman/sparql-vcf.

RDF is now in effective use in a wide range of reference Big Data resources, from the data platform of the European Bioinformatics Institute (Jupp et al., 2014) to the Centers of Medicare and Medicaid (CMS) (Linked Data Goes With DERI, 2012) in the USA, and the data backends of cloud computing infrastructure for cancer genomics funded by NIH/NCI (http://www.cancergenomicscloud.org). As the size and complexity of the VCF files grow to reflect the widening context of their use not having a way to map VCF into the integrative RDF framework impairs portability and interoperability between domains of application. This limitation is particularly clear when one considers that this complexity implies that we often only need a portion of the VCF file for a given analysis but are nonetheless compelled to download it whole first. The native indexed interoperation API (https://www.w3.org/TR/rdf-sparql-query) provides a solution to this problem. In other words, RDF removes the distinction between a file format and a distributed, queryable (federated), web-enable database.

On the other hand, the opposite argument is also true: forcing the native use of RDF would add an unnecessary requirement to the methodological context where VCF is working just fine. We propose to solve this problem by identifying a web enabled isomorphic map between VCF and RDF. The validating Web implementation (using only its ‘assembler language’, JavaScript) is delivered here with no restrictions to their use and extension.

2 Methods

The parser was built using JavaScript (ecma5) and the promise feature of ecma6 using following libraries: papaparse.js and jsonld.js (http://papaparse.com/ and https://github.com/digitalbazaar/jsonld.js).

Virtuoso was used as a triple store, on top of ngnix https://www.nginx.com/ and Nodejs https://nodejs.org/.

A second server running allegrograph http://allegrograph.com/ was used for testing federated querying: VCF files were first parsed to JSON-LD by accompanying isomorphic map, and then uploaded as Nquads.

3 Results

The accompanying isomorph map creates URIs for all resources we found on the VCF specification, which was tested with multiple versions, including v4.2 (Danecek et al., 2011). This allows other models to point to an identifier that already exists, and pass a reference that is easily linkable, identifiable, stable and unique (i.e. dereferenceable).

The URI is built using the VCF specification version and a class name as http://vcf2rdf.org/app/[VCF-spec-version]/[class-name]. When a new specification version comes, the class given the name DB will represent something slightly different and as such would need a new URI. Given our naming schema, the dereferenceable URL would be: http://vcf2rdf.org/app/v4_3/INFO_ID_DB/.

We chose to represent each VCF file with two major classes of information (Fig. 1). A series of new classes and subclasses then allows the header to describe content of the file while line identifiers link header classes to genomic information. This solution minimizes data expansion (5 fold versus 20 fold when using UUIDS). The approach implemented generates triples that capture sparce relationships as illustrated by this example:

Fig. 1.

Fig. 1.

Basic mapping of a line of the header to triples. (A) Mapping of metadata on the header for class INFO and subclass INFO_ID_DB. (B) Use of INFO_ID_DB on the body of a vcf file. For a line each line n we will have a INFO_ID_DB Boolean stating if that chrom:pos is present on dbSNP database (Sherry et al., 2001)

_:b0 < http://vcf2rdf.org/app/v4_2/body > _:b1 .

_:b0 < http://vcf2rdf.org/app/v4_2/head > _:b6179 .

_:b1 < http://vcf2rdf.org/app/v4_2/row_1000 > _:b6 .

_:b1 < http://vcf2rdf.org/app/v4_2/row_1001 > _:b10 . (…)

As contextualized in the Introduction, achieving this goal entirely within the implementation domain defined be Web Technologies is the distinctive achievement of the work reported here.

4 Conclusion

We built a VCF parser that acts as an isomorphic mapping function to (evolvable) linked data entirely within 3rd generation Web Technologies. That is, in line with the emerging ubiquitous Web Computing Space, the accompanying parser has no dependencies other than a web browser (i.e. a native javascript interpreter). The resulting JSON-LD file than can be uploaded into a triple store, a NoSQL database like MongoDB or the browser’s IndexedDb (www.w3.org/TR/IndexedDB) and be used on a standalone Web Application. At a more abstract level, this ability to map into RDF and JSON-LD provides the missing functional bridge onto constraint satisfaction engines, such as SPARQL, which enable the extraction of only the data that matches a use contexts that cannot easily be predicted, as is the case of Personalized Medicine applications.

Funding

This work was supported in part by 1U24CA180924-01A1 from the NCI, R01LM011119-01 and R01LM009239 from the NLM.

Conflict of Interest: none declared.

References

  1. Danecek P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Jenkinson A.M. et al. (2008) Integrating biological data – the Distributed Annotation System. BMC Bioinformatics, 9, S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Jupp S. et al. (2014) The EBI RDF platform: linked open data for the life sciences. Bioinformatics, 30, 1338–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Klyne G., Carroll J.J. (2004) Resource Description Framework (RDF): concepts and abstract syntax. W3C Recomm., 10, 1–20. [Google Scholar]
  5. Linked Data Goes With DERI (2012) Data.gov.
  6. Sherry S.T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES