Abstract
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services.
Keywords: NoSQL database, copy number variation, drug-target interaction, data integration
Introduction
With the proliferation of high-throughput technologies that provide genome-wide overviews of the molecular landscape within cells, researchers can quickly profile the whole genome using multiple modalities. For example, The Cancer Genome Atlas (TCGA) is a comprehensive effort to accelerate our understanding of cancer by profiling hundreds of samples of different types of cancer using many technologies, including gene expression microarrays, methylation arrays, microRNA arrays, array CGH, SNP chips, and large scale genomic sequencing [1]. The bioinformatics challenge of integrating these disparate sources of information remains difficult. The bioinformatics community continues to develop extensive resources to annotate and mine biological and genomic features from these high-throughput analyses. In this context, database management systems (DBMS) are required to store the source data and the analysis results, and to build tools that annotate and mine the data.
The underlying DBMS for most current bioinformatics utilities are based on a relational model [2]. Relational database management systems (RDBMS) provide excellent data integrity at the cost of fragmenting data into multiple tables. There might be issues associated with scalability and performance, especially during data retrieval, although these issues rarely come into play unless the database expands across multiple server nodes. Non-relational or NoSQL (not only SQL) database systems that do not require specific design schemas are used in such cases for high-throughput. Although NoSQL systems may not provide strict data consistency like RDBMS, they are highly scalable [3]. They are increasingly used for giant cloud computing services by commercial establishments like Google and Amazon [4]. Their close integration with web servers and standard protocols can, under the right circumstances, facilitate rapid development of interactive database applications; the absence of a rigid design schema also makes it easier to change designs while continuing to use and access the existing data. Some bioinformatics systems have already been developed using non-relational database systems like Hbase [5] and Persevere [6].
CouchDB is an open-source non-relational database system being developed by the Apache Software Foundation (http://couchdb.apache.org). Here, we empirically demonstrate the utility of CouchDB as a database management system for bioinformatics by building several web-based database utilities, namely (a) geneSmash, (b) drugBase, and (c) HapMap-CN. The geneSmash database collates data from various bioinformatics resources and provides automated gene-centric annotations. The drugBase system is a CouchDB database of drug-target interactions with a web-interface powered by geneSmash. The HapMap-CN database provides an interface to query the copy number variations identified using the HapMap [7; 8] SNP chip dataset using gene information or genomic location. In addition to the web sites, all three systems can be accessed programmatically via web services for high-throughput analyses.
Results
Figure 1A provides an overview of the CouchDB architecture. From the perspective of client applications, a CouchDB database looks like a web server. Clients make HTTP calls and get responses formatted as HTML or JSON documents. Internally, the database contains data documents, design documents, and pre-computed query results that can accessed using URLs. Figure 1B illustrates a typical development environment for CouchDB databases and applications. Although CouchDB allows views (or queries) to be defined in various languages, JavaScript is the most common choice. The source code for views is contained in the design document, and can therefore be extracted from the database for closer study.
Figure 1.
(A) Architecture of a CouchDB database To all clients (including web browsers or programming systems like the R statistical software environment), a CouchDB database appears to be a web server. Clients communicate with the server by making HTTP calls and receiving either JSON or HTML responses. The database is stored on disk as a collection of JSON documents, each of which has a unique identifier (_id) that can be used to construct a unique URL. Special design documents (_design) define the queries (or _views) that can be made on the database; the responses for each _view are precomputed and stored on disk in the database. Web applications can provide a rich environment for users to interact with the data by using standard AJAX protocols as implemented, for example, in the jQuery JavaScript library. (B) The CouchDB development environment. For genomics applications, the JSON documents in a typical CouchDB database will be constructed by reformatting source data files using scripts written in perl, python, or another convenient scripting language that can make HTTP PUT calls to insert the documents. The database views and web interfaces are stored in a design document that is developed and maintained using CouchApp (http://couchapp.org). All source code can be stored in a version control system such as subversion (http://subversion.apache.org). (C) Using geneSmash as a CouchDB web service. All CouchDB databases can function as web services that help build web sites. The geneSmash database includes a CouchApp web site that allows users to query gene information starting with a wide variety of gene identifiers. Both the drugBase and HapMap-CN web sites rely on geneSmash as a web service. The drugBase web site hides this implementation from the user, who simply requests a list of drugs that target a gene of interest. Behind the scenes, drugBase first queries geneSmash (Step 1) to translate the user’s preferred gene identifier into the Entrez gene id, which is then used as a key (Step 2) to retrieve the drug information. By contrast, the HapMap-CN web site explicitly separates the operations into two queries. HapMap-CN allows the user to query copy number information using any chromosomal coordinates (Q2). As an optional first step (Q1), the user can supply any gene identifier to get the chromosomal location from geneSmash.
geneSmash
As noted above, large-scale collaborative projects like TCGA are generating genome-wide data from multiple platforms. A fundamental need is to integrate these data, which always requires matching probes across platforms, either by the genes they target or by their genomic coordinates. Moreover, it would be extremely useful if the genomic coordinates were directly available to the programming language or scripts used to analyze the data. A number of attempts to address this problem have already been implemented. We start by reviewing the important tools previously developed in this domain.
BioMart provides general tools for serving biological data on the Internet. Their solution is based on a federated architecture, where each data supplier is responsible for exposing their own relational database through the BioMart interface [9]. Programmatic access to BioMart requires the user to learn a new application programming interface; usage in this form is limited to a small number of programming languages that have implemented the API. Alternatively, access to BioMart can occur through HTTP calls using the Mart Query Language (MQL), which is inspired by the Structured Query Language (SQL) used to query relational databases [10]. MQL supports extremely general queries, but requires the user to understand the XML format of queries as well as the table structure of individual marts. Biomart does support more complex queries than geneSmash, but exploiting that power requires learning a new vocabulary, the MQL. For researchers who want to integrate disparate kinds of data and annotations, the BioMart approach puts the burden of integration on the client, with a difficult learning curve.
GeneCards is a comprehensive compilation of annotative information about human genes [11]. The gene-centric content is integrated from over 80 digital sources resulting in extensive genomic, proteomic, transcriptomic, disease, and functional data. The integrated, relational database is available as plain text, XML files, and MySQL dumps. GeneCards provides a sophisticated search engine and two tools for working with sets of genes: GeneAlaCart and GeneDecks [11; 12]. GeneALaCart accepts a list of gene identifiers along with the user’s selected data fields and produces a file of annotations for the gene set. GeneDecks allows users to search for functional paralogs and to find annotations shared by a group of genes. Overall, GeneCards restricts the types of queries users can conduct and makes it difficult for users to integrate their own data with the GeneCard database.
BioDAS (Distributed Annotation System) is a communication protocol used to exchange annotations. It is based on the concept that annotations should not be provided by centralized databases, but should be spread over multiple sites [13]. BioDAS is a client-server system where a single client can integrate annotation data from multiple distant web sites and display it to the user in a single view. The communication between client and servers is defined by DAS XML specification. The system has its advantages: data providers maintain control over data, data is free from problems associated with release cycles, and data duplication is avoided. BioDAS also has some disadvantages: a lack of enforced semantics limits applications to visual displays and data sources must be DAS-enabled [14].
geneSmash Application Details
The geneSmash database integrates data from the NCBI Entrez Gene database [15], the UCSC genome site [16], miRBase [17; 18; 19; 20], and multiple microarray manufacturer web sites [21; 22; 23]. The NCBI Entrez Gene identifier is used as the primary ID; any record without this identifier is discarded while populating the geneSmash database. At present, geneSmash contains data only on the Homo sapiens genome. However, nothing about the design requires this restriction. Supplementary Table S1 explains the structure of the fields in a typical geneSmash document. The source of each field is recorded, along with an indication of which fields can be queried.
Users can query geneSmash with a variety of gene identifiers through their corresponding views. As described in Materials and Methods, the names and the source code for the views are contained in the design document, available at http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic. As of this writing, gene annotations can be queried by Entrez Gene identifier, HUGO symbol (by_symbol), Ensembl identifier (by_ensembl), UniGene cluster (by_unigene), cytoband location (by_cytoband), by probe (by_probe and get_probes) and genomic coordinates (by_location and gene_location). Probe annotation information is available for common Affymetrix, Illumina, and Agilent gene expression microarray platforms. Genomic coordinates are available for three different builds of the human genome (NCBI v.35 through v.37, or UCSC hg17 through hg19). Locations of microRNAs are included for four releases of miRbase (from v.13 through v.16).
Because geneSmash provides a web service, and not simply a web site displaying the underlying database, it can serve as a gene-centric platform on which other applications are constructed. We have already developed several such applications including drugBase and HapMap-CN. Figure 1C explains how the web site interfaces to drugBase or HapMap-CN use geneSmash as a web service to produce a better user experience.
drugBase
When conducting high-throughput studies of disease, biologists and clinicians frequently find themselves confronted by long lists of “interesting” genes; the canonical example is a set of differentially expressed genes between cohorts who respond differently to a certain treatment. When this happens, a frequent goal is to identify other drugs that might benefit the patients who respond poorly to the initial treatment or who have an adverse reaction to a drug or a negative drug-drug interaction. Exploratory molecular analysis on the mechanism of action of drugs with biomolecules was initiated in the final decades of the 20th century. In the past, the drug interaction information summarized from these scientific studies was limited to journals and commercial catalogs. Online public depository of existing drugs and their target information was initiated by the Therapeutic Target Database (TTD) in 2002 [24; 25]. Over the decade, many other resources like DrugBank [26; 27], PDTD [28], STITCH [29; 30] and SuperTarget [31] were developed to store information about drug-target interactions.
The web portals supported by the existing drug-target databases provide graphical web interfaces to manually access and search the data. However, for a comprehensive high-throughput analysis one usually needs to download the entire database, since the existing interfaces do not support batch queries of the drug-target interaction data. It would be useful to have a tool that supports batch queries using a list of genes to retrieve drug-target information that could be integrated into analyses of high-throughput genomic data.
drugBase Application Details
The drugBase database contains drug-target interactions, which can be batch-queried by gene target. The primary source of information for drugBase is Matador, a manually curated version of the SuperTarget database of drug-target interactions [31]. Users can query drugBase with a gene identifier to extract all the drugs targeting the protein associated with that specific gene. Sample code using perl or R to access drugBase for batch queries is provided in the online documentation. Supplementary Table S2 explains the structure of the fields in a typical drugBase document.
The database can be manually queried with a variety of gene identifiers from the website. The website uses HTML and JavaScript based on the CouchDB JSON API. Behind the scenes, the web interface uses geneSmash to make the application more user-friendly (Figure 1C). The geneSmash web service converts any input gene identifier (such as the official HUGO gene symbol or alias, or a microarray manufacturer’s probe number) into the corresponding Entrez Gene identifier. The Entrez Gene identifier is then forwarded to drugBase to retrieve the relevant drug-target interactions. The design information and source code can be accessed from the design document of drugBase, which is publicly available at http://app1.bioinformatics.mdanderson.org/drugbase/_design/basic.
HapMap-CN
Analyzing a set of SNP chip data to identify copy number variations usually results in a set of genomic segments (per sample and per chromosome) with a copy number call. However, a common request from researchers about these data is to retrieve the copy number calls (for all samples) corresponding to the location of a gene of interest. In this context, we used CouchDB to build an application that can provide user-friendly access to copy-number data analyses of a dataset associated with the HapMap project [7; 8]. SNP chip data for 225 HapMap samples from three data sets (GSE17205, GSE17206, and GSE17207) from different ethnic groups were downloaded from the Gene Expression Omnibus [32]. We processed these datasets to estimate copy number changes for each sample; see Materials and Methods. Although the HapMap samples were obtained from healthy volunteers, our analysis identified several individuals with relatively large chromosomal gains or deletions. These alterations included one individual (NA19193) with trisomy 12, one (NA19208) with trisomy 9, one (NA12004) with a deletion on chromosome 18, and one (NA12717) with a gain on 5p. We also identified several large regions exhibiting loss of heterozygosity (LOH). LOH events occurred for sample NA12874 on chromosome 1q, for NA18573 on 5q, and for NA18855 on 2p. Two individuals had multiple large regions of LOH: NA18987 has LOH on 14q, 18, and elsewhere; and NA19012 has LOH on 2q, 5q, and 11q. Graphical depictions of the results of the copy number analysis can be browsed at http://bioinformatics.mdanderson.org/HapMap.
HapMap-CN Application Details
HapMap-CN provides a tool to query the results of a copy number analysis of the segmented copy-number data. Supplementary Table S3 explains the structure of the fields in a typical HapMap-CN document. Users can query the summarized data with a genomic location. The genomic location of a specific gene can be acquired in the interface either by gene symbol, Entrez, or Ensembl gene identifier. The mapping of the gene symbol or identifier to the gene location is performed using the geneSmash web service (Figure 1C). The chromosomal location is then used to query the HapMap-CN database in order to identify all segments that overlap that location. Results are returned in an HTML table that can be sorted by sample name, copy number, or the start or end of the segment in the individual sample.
Our copy number analysis of the HapMap also identified numerous regions where the copy number varies from individual-to-individual; the frequency with which these changes occurred often differed between ethnic groups. One example of a highly variable region occurs on chromosome 22 in the vicinity of the PRAME gene, whose loss has previously been noticed in both chronic lymphocytic leukemia [33] and mantle cell lymphoma [34]. Using the genomic location of PRAME to query the 225 individuals in the HapMap-CN database, we found 3 individuals with homozygous deletions, 15 with heterozygous deletions, and 9 with gains in this region.
Database Size
Table 1 provides an overview of the size of these three databases. Each database contains only one kind of document, whose structure is explained in Supplementary Tables S1–3. More complicated designs can contain many different kinds of documents; the JavasScript code to define a view typically relies on either the existence or the value of a particular field to determine the document type and to decide if that document is an appropriate response to a query. At present, the drugBase (by_EntrezGene) and HapMap-CN (by_location) databases each contain only one pre-defined view. The geneSmash database, by contrast, has defined 11 views; since the values are precomputed for every view, geneSmash occupies much more disk space. At present, all three databases are synchronized quarterly with updates from their external sources.
Table 1.
Size of CouchDB databases.
Database | Number of Documents | Document Represents | Database Size | View Size | Total Disk Space |
---|---|---|---|---|---|
geneSmash | 50215 | Gene | 218.6 MB | 2936.1 MB | 3154.7 MB |
drugBase | 16883 | Drug-target interaction | 29.6 MB | 25.2 MB | 54.8 MB |
HapMap-CN | 61164 | Segment | 370.5 MB | 47.9 MB | 418.4 MB |
Discussion
CouchDB is a non-relational, document-oriented, schema-free database with a built-in web server. Database queries are processed through HTTP requests, which are handled by the RESTful JSON API. This feature provides universal accessibility to any modern programming language without any customized API. CouchDB also provides native support for incremental database replication, which makes it a very convenient tool to maintain mirror copies of data. The reliance of CouchDB on standard protocols (HTTP and JSON) that are fully integrated into web browsers promotes rapid, incremental development of database applications. We have illustrated the power of CouchDB by presenting three specific bioinformatics applications: geneSmash, drugBase and HapMap-CN.
CouchDB makes it easy to replicate databases. We have taken advantage of replication during database development and maintenance. The development version of the database runs on the developer’s machine; because CouchDB is fairly lightweight, this can be a fairly modest laptop. The development version of the database is then replicated to a test environment that matches the intended production environment. After testing is completed, the database is replicated to the production server. For small databases like the ones described here, it is feasible to replicate the entire database onto the computers used by statistical and bioinformatics analysts, giving them immediate local access to the data.
We must point out that the HapMap-CN application illustrates one potential challenge when using CouchDB for genomic data. CouchDB stores all documents in a B-tree structure that mathematically represents a complete (linear) ordering of the stored documents. Because it uses this structure, the CouchDB API is optimized for queries that return a consecutive range of documents (i.e., those between startkey and endkey; see Materials and Methods). In the HapMap-CN database, each document represents a genomic segment. The fundamental key that describes a segment consists of a chromosomal starting and ending location. Queries into HapMap-CN database also take the form of segments, and the goal is to retrieve all segments in the database that overlap the query segment. One can show mathematically that it is impossible to impose a complete ordering on the set of all sub-segments that will ensure that the response to every query is a consecutive range of segments (or documents). Because the DBMS is currently not able to provide such functionality, HapMap-CN solves the problem by embedding logic in the front-end JavaScript application to restrict the results of a query to the desired segments. The open source CouchDB development community has faced similar challenges when trying to handle geographic data, which is also intrinsically two-dimensional. They responded by developing a modified server, GeoCouch [35], which uses a different underlying data store optimized for spatial queries. If CouchDB is adopted by enough genomics researchers, similar alteration could be made to the open source code to develop a server that is more highly optimized for segment-based queries.
Relational (RDMBS) databases and NoSQL databases have their own strengths and weaknesses, and are best suited for different kinds of applications. RDBMS represents a mature technology that has been optimized over several decades. RDBMS systems have extensive support for transactions, which they use to ensure data consistency and data integrity. With properly normalized tables, there is a unique source for each piece of data stored in the database. RDBMS also have a flexible and powerful query language (SQL) that supports ad hoc queries. By contrast, CouchDB is much less mature, with limited support for transactions. The same piece of information may be repeated across many documents, which presents potential difficulties with maintenance if one of those pieces of information must be updated. Finally, CouchDB does not support ad hoc queries; all views into the database must be predefined in the design documents. If any of these features are critical to an application, then an RDBMS should be preferred over CouchDB.
There are, however, many potential database applications in genomics that do not require transactions, normalized tables, or ad hoc queries; the three applications we present here provide examples. In general, the tradeoff for normalization is a lack of flexibility: it is hard to change things when a piece of data comes along that does not fit into the schema. Our earliest version of geneSmash simply recapitulated the human gene data from Entrez Gene. While developing HapMap-CN, we recognized that it would be convenient for users to search for genes in a genomic region of interest. We then added the genomic mapping data from the UCSC Genome Browser into geneSmash, along with some new views. We later decided that it would be useful when analyzing microarray data to be able to query genes by the probe identifiers defined by different microarray manufacturers. So, we also added these data to the existing gene documents and defined additional views. In both cases, none of the existing views had to change; and none of our editing code broke.
We believe that the tight integration of CouchDB with web standards provides two advantages. First, client-side web applications talk directly to Couch without the need for a server-side middle layer, significantly reducing development time. These applications rely on Asychronous JavaScript and XML (AJAX) methods. AJAX plays a central role in much current web development, and, as a result, there are extensive open source libraries that make it easy to develop web sites that allow users to interact directly with the data. We use the implementation of AJAX provided by the jQuery JavaScirpt library (http://jquery.com/). The sortable table in the HapMap-CN application is implemented using a jQuery plugin called TableSorter (http://tablesorter.com/docs/). We are currently developing additional applications that rely on a JQuery plugin called flot (http://code.google.com/p/flot/) to create interactive graphics in a web browser using data stored in CouchDB.
The second advantage arising from CouchDB’s use of web standards is that every CouchDB application provides a web service with a RESTful interface, and not just a web site. Now, it is undoubtedly true that there are many experienced database programmers and database administrators who know how to use SQL to get data out of RDBMS databases and convert it into the format needed to perform analyses. Many bioinformatics and statistical analysts, however, do not know how to do this. In our experience, we have found that they can easily and quickly learn to use the HTTP/JSON API of CouchDB to get data into the systems that they use to perform their analyses. For example, once they know the URL for a relevant query, they can get the results of that query into the R statistical programming environment in only three lines of code:
library(RJSONIO) tempstring <- paste(readLines(URL), collapse=‘‘) results <- fromJSON(tempstring)
The “results” object that this code produces is a native R object that statistical analysts can examine and then use in their analyses directly.
One potential advantage that is often promoted for NoSQL databases is “horizontal scalability”. The performance of RDBMS databases scales well as long as all of the data can be stored on a single server. Performance can be improved by using multiple servers, provided each one has access to a complete copy of the data. However, if the data grows large enough that it needs to be split over multiple server nodes, RDBMS performance can degrade quickly; relational table joins across servers are complicated and expensive. The NoSQL (and CouchDB) solution is to replace table joins with queries that use the map/reduce paradigm, which is an inherently parallel procedure. Major web presences like Google, Amazon, Facebook, and Twitter use NoSQL databases to provide some of their services. The three databases described in this manuscript are far too small to benefit from this ability. We speculate, however, that the large datasets being generated by next generation sequencing technologies could potentially benefit from the horizontal scalability of NoSQL databases.
Materials and methods
CouchDB
CouchDB is a document-oriented, schema-free database with a built-in web server. Database queries are processed through HTTP requests, which are handled by a RESTful application programming interface (API) [36]. All interaction with the database uses only RESTful verbs (GET, PUT, POST, DELETE) that are defined by the HTTP standard. Database responses are returned in JavaScript Object Notation (JSON). It is possible to define specialized queries in a CouchDB application that return alternative formats in addition to JSON; the geneSmash, drugBase, and HapMap-CN web sites use this facility to return HTML-formatted responses for display in a web page. The CouchDB API uses only protocols that are defined by open web standards. Since the systems we developed use the same protocols, their web services are immediately accessible from any modern programming language without having to learn a new specialized query language.
The possible queries that can be made to these web services are self-documenting; this feature is a direct consequence of the design of CouchDB. The developer of a CouchDB database application defines the named views (the CouchDB term for queries) that can be applied to the data. Each named view produces key-value pairs from appropriate documents. The generic CouchDB API defines parameters key, startkey, and endkey that allow the user to restrict the set of documents returned when querying a view. Moreover, each named view must be defined as part of a design document that is stored in the database. Every CouchDB document has a unique identifier; design documents are characterized by an identifier that starts with the reserved word _design. Thus, you can obtain a list of all design documents in a Couch database called “db” by sending an HTTP GET request of the form
GET /db/_all_docs?startkey=“_design”&endkey=“_design0”.
You can then send another GET request for each design document and read the resulting JSON value to obtain a complete list of the views that it defines.
CouchDB relies on a map/reduce paradigm that allows it to pre-compute the responses to all views defined in all design documents. Technically, it computes all responses when the first query is made after a view is defined or modified. When new data documents are inserted, it only computes the responses for the new or updated documents. The initial indexing step can be quite time-consuming, depending on the number of documents in the database and the number and complexity of the views that are defined in the design document. After indexing has been completed, responses are extremely fast. For example, retrieving a complete list of the 50,000 symbols known to geneSmash takes about 10 seconds; retrieving the complete gene documents for all 50,000 symbols takes about 30 seconds. How quickly you can use this information in a program, however, depends on the quality of the implementation of the programming library that converts from the JSON format of the response into the native format for that language. The (highly optimized) RJSONIO package [37] for the R statistical software environment converts all 50,000 gene documents into an R object in less than 30 seconds.
CouchDB provides native support for database replication. The replication function allows any database to be copied from one CouchDB server to another. Replication works on the granularity of individual documents and is incremental. That is, only documents that have been inserted or modified since the previous replication are replicated on a later request. As a consequence, creation and maintenance of mirror copies of databases as CouchDB instances is easy, reliable, and robust.
geneSmash
geneSmash integrates data from existing biological annotation data repositories. The basic gene and cross-database identifier information was obtained from the NCBI Entrez Gene database [15]. The gene locus information for multiple human genome builds was acquired from the UCSC genome browser web site [16]. miRNA information was obtained from multiple versions of miRBase [17; 18; 19; 20]. Probe annotation information for a repertoire of gene expression arrays was obtained from the corresponding manufacturer’s web sites [21; 22; 23]. The geneSmash web site can be accessed at http://app1.bioinformatics.mdanderson.org/genesmash/_design/basic/index.html or the web service can be queried directly at http://app1.bioinformatics.mdanderson.org/genesmash/ using the full CouchDB API.
drugBase
drugBase is a database of drug-target interactions, which can be batch-queried by gene target. The primary source of information for drugBase is Matador, a manually curated version of the SuperTarget database of drug-target interactions [31]. Potential drug-target relationships in SuperTarget were extracted from the literature using Pubmed, Medline, and Medical Subject Heading (MeSH) terms [38]. Drug-target interactions with supporting PubMed literature in the existing databases DrugBank [26; 39], TTD [24; 25], KEGG [40], PDB [41; 42] and SuperLigands [43] were also present in SuperTarget. The protein identifiers used in Matador were defined based on an older version of the String (v7.1) protein interaction database [44]. We mapped these identifiers onto the corresponding Entrez Gene identifiers. In instances where they could not be mapped directly, we used HUGO gene symbols and NCBI Refseq identifiers. Because drugBase is designed as a gene-centered drug-target interaction database, the drug-target interactions defined solely based on MeSH terms without any specific gene or protein in Matador were not included in drugBase. The web site can be accessed at http://app1.bioinformatics.mdanderson.org/drugbase/_design/basic/index.html or the web service can be queried directly at http://app1.bioinformatics.mdanderson.org/drugbase/ using the full CouchDB API.
HapMap-CN database
The HapMap-CN database provides a tool to query the results of a copy number analysis of SNP chip data of 225 HapMap samples. The raw data were downloaded as three data sets (GSE17205, GSE17206, and GSE17207) from the Gene Expression Omnibus [32]. These datasets from the Hap Map project were acquired using Illumina 610K Quad v1 chips containing measurements on 225 normal samples from different ethnic groups. We processed the raw data in Illumina GenomeStudio to compute genotype calls, log R ratios (LRR), and B allele frequencies (BAF) for each SNP in each sample. We applied the circular binary segmentation (CBS) algorithm to LRR and BAF values for each chromosome of every sample to derive copy-number segments [45]. We used the CBS implementation from the R package DNAcopy to record the segmentation results [46; 47]. Loss of heterozygosity (LOH) was also estimated for every chromosome in all the samples. The algorithm computes the odds ratio of LOH vs. no LOH by considering the probability of seeing (almost) all consecutive homozygous SNPs in a string of 40 consecutive informative SNPs. The segment data from multiple algorithms was then summarized for each sample along each segment. Complete Sweave source code detailing the copy number analysis of the HapMap samples is available online at http://bioinformatics.mdanderson.org/HapMap/docs/index.html The web site can be accessed at http://app1.bioinformatics.mdanderson.org/hapmap/_design/basic/index.html or the web service can be queried directly at http://app1.bioinformatics.mdanderson.org/hapmap/ using the full CouchDB API.
Supplementary Material
Highlights.
We present the first three applications of CouchDB to bioinformatics.
geneSmash is a new web service integrating gene annotations and genomic location.
drugBase supports gene-based batch queries to get drug-target interactions.
HapMap-CN supports gene-based queries of copy number analyses.
Applications use standard internet protocols, usable in all programming languages.
Acknowledgments
This work was supported in part by grants R01 CA123252, P30 CA016672, and P50 CA070907 from the National Cancer Institute of the National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.N, Cancer Genome Atlas Research. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–8. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Codd EF. A relational model of data for large shared data banks. Commun ACM. 1970;13:377–387. [PubMed] [Google Scholar]
- 3.Strauch C. NoSQL Databases, Lecture: Selected Topics on Software-Technology Ultra-Large Scale Sites. Stuttgart Media University; Stuttgart: 2011. p. 149. [Google Scholar]
- 4.Leavitt N. Will NoSQL Databases Live Up to Their Promise? Computer. 2010;43:12–14. [Google Scholar]
- 5.O’Connor BD, Merriman B, Nelson SF. SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics. 2010;11(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Russ TA, Ramakrishnan C, Hovy EH, Bota M, Burns GA. Knowledge engineering tools for reasoning with scientific observations and interpretations: a neural connectivity use case. BMC Bioinformatics. 2011;12:351. doi: 10.1186/1471-2105-12-351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap Project Web site. Genome Res. 2005;15:1592–3. doi: 10.1101/gr.4413105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.The International HapMap Project. Nature. 2003;426:789–96. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- 9.Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A. BioMart--biological queries made easy. BMC Genomics. 2009;10:22. doi: 10.1186/1471-2164-10-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. BioMart Central Portal--unified access to biological data. Nucleic Acids Res. 2009;37:W23–7. doi: 10.1093/nar/gkp265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, Sirota-Madi A, Olender T, Golan Y, Stelzer G, Harel A, Lancet D. GeneCards Version 3: the human gene integrator. Database (Oxford) 2010:baq020. doi: 10.1093/database/baq020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Stelzer G, Inger A, Olender T, Iny-Stein T, Dalah I, Harel A, Safran M, Lancet D. GeneDecks: paralog hunting and gene-set distillation with GeneCards annotation. OMICS. 2009;13:477–87. doi: 10.1089/omi.2009.0069. [DOI] [PubMed] [Google Scholar]
- 13.Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The distributed annotation system. BMC Bioinformatics. 2001;2:7. doi: 10.1186/1471-2105-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kahari A, Kulesha E, Macias JR, Reeves GA, Prlic A. Integrating biological data--the Distributed Annotation System. BMC Bioinformatics. 2008;9(Suppl 8):S3. doi: 10.1186/1471-2105-9-S8-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2011;39:D52–7. doi: 10.1093/nar/gkq1237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011;39:D876–82. doi: 10.1093/nar/gkq963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004;32:D109–11. doi: 10.1093/nar/gkh023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006;34:D140–4. doi: 10.1093/nar/gkj112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008;36:D154–8. doi: 10.1093/nar/gkm952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kozomara A, Griffiths-Jones S. miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011;39:D152–7. doi: 10.1093/nar/gkq1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Affymetrix. NetAffx™ Analysis Center; 2011. http://www.affymetrix.com/analysis/index.affx. [Google Scholar]
- 22.Illumina. Annotation Files - Whole-genome Arrays; 2011. http://www.switchtoi.com/annotationfiles.ilmn. [Google Scholar]
- 23.Agilent. 2011 https://earray.chem.agilent.com/earray/ eArray web portal.
- 24.Chen X, Ji ZL, Chen YZ. TTD: Therapeutic Target Database. Nucleic Acids Res. 2002;30:412–5. doi: 10.1093/nar/30.1.412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhu F, Han B, Kumar P, Liu X, Ma X, Wei X, Huang L, Guo Y, Han L, Zheng C, Chen Y. Update of TTD: Therapeutic Target Database. Nucleic Acids Res. 2010;38:D787–91. doi: 10.1093/nar/gkp1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34:D668–72. doi: 10.1093/nar/gkj067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011;39:D1035–41. doi: 10.1093/nar/gkq1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Gao Z, Li H, Zhang H, Liu X, Kang L, Luo X, Zhu W, Chen K, Wang X, Jiang H. PDTD: a web-accessible protein database for drug target identification. BMC Bioinformatics. 2008;9:104. doi: 10.1186/1471-2105-9-104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, Jensen LJ, Beyer A, Bork P. STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res. 2010;38:D552–6. doi: 10.1093/nar/gkp937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kuhn M, von Mering C, Campillos M, Jensen LJ, Bork P. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 2008;36:D684–8. doi: 10.1093/nar/gkm795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ, Schneider R, Skoblo R, Russell RB, Bourne PE, Bork P, Preissner R. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008;36:D919–22. doi: 10.1093/nar/gkm862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–10. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gunn SR, Bolla AR, Barron LL, Gorre ME, Mohammed MS, Bahler DW, Mellink CH, van Oers MH, Keating MJ, Ferrajoli A, Coombes KR, Abruzzo LV, Robetorye RS. Array CGH analysis of chronic lymphocytic leukemia reveals frequent cryptic monoallelic and biallelic deletions of chromosome 22q11 that include the PRAME gene. Leuk Res. 2009;33:1276–81. doi: 10.1016/j.leukres.2008.10.010. [DOI] [PubMed] [Google Scholar]
- 34.Bea S, Salaverria I, Armengol L, Pinyol M, Fernandez V, Hartmann EM, Jares P, Amador V, Hernandez L, Navarro A, Ott G, Rosenwald A, Estivill X, Campo E. Uniparental disomies, homozygous deletions, amplifications, and target genes in mantle cell lymphoma revealed by integrative high-resolution whole-genome profiling. Blood. 2009;113:3059–69. doi: 10.1182/blood-2008-07-170183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.GeoCouch. https://github.com/couchbase/geocouch/, github. GeoCouch is a spatial extension for Apache CouchDB and Couchbase.
- 36.Lennon J. Beginning CouchDB, Apress. 2010. [Google Scholar]
- 37.Lang DT. RJSONIO: Serialize R objects to JSON, JavaScript Object Notation. 2012 http://www.omegahat.org/RJSONIO/
- 38.Lipscomb CE. Medical Subject Headings (MeSH) Bull Med Libr Assoc. 2000;88:265–6. [PMC free article] [PubMed] [Google Scholar]
- 39.Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36:D901–6. doi: 10.1093/nar/gkm958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–7. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, Green RK, Flippen-Anderson JL, Westbrook J, Berman HM, Bourne PE. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2005;33:D233–7. doi: 10.1093/nar/gki057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prlic A, Quesada M, Quinn GB, Westbrook JD, Young J, Yukich B, Zardecki C, Berman HM, Bourne PE. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 2011;39:D392–401. doi: 10.1093/nar/gkq1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Michalsky E, Dunkel M, Goede A, Preissner R. SuperLigands - a database of ligand structures derived from the Protein Data Bank. BMC Bioinformatics. 2005;6:122. doi: 10.1186/1471-2105-6-122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P. STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005;33:D433–7. doi: 10.1093/nar/gki005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
- 46.Seshan VE, Olshen A. DNAcopy: DNA copy number data analysis. BioConductor [Google Scholar]
- 47.Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–63. doi: 10.1093/bioinformatics/btl646. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.