Abstract
With the increasing application of various genomic technologies in biomedical research, there is a need to integrate these data to correlate candidate genes/regions that are identified by different genomic platforms. Although there are tools that can analyze data from individual platforms, essential software for integration of genomic data is still lacking. Here, we present a novel Java-based program called CGI (Cytogenetics-Genomics Integrator) that matches the BAC clones from array-based comparative genomic hybridization (aCGH) to genes from RNA expression profiling datasets. The matching is computed via a fast, backend MySQL database containing UCSC Genome Browser annotations. This program also provides an easy-to-use graphical user interface for visualizing and summarizing the correlation of DNA copy number changes and RNA expression patterns from a set of experiments. In addition, CGI uses a Java applet to display the copy number values of a specific BAC clone in aCGH experiments side by side with the expression levels of genes that are mapped back to that BAC clone from the microarray experiments. The CGI program is built on top of extensible, reusable graphic components specifically designed for biologists. It is cross-platform compatible and the source code is freely available under the General Public License.
Keywords: aCGH, expression profiling, visualization, correlation, and data integration
Introduction
With the advent of genomic technologies, DNA and RNA-based microarrays are becoming more accessible to biomedical researchers. One of the common DNA platforms is array-based Comparative Genomic Hybridization (aCGH), which can identify DNA copy number aberrations in the genome (Pinkel, 1998; Man, 2004). There are many software tools that have been developed to analyze aCGH data (Jong 2004; Margolin, 2005; Chen, 2005; Cheung, 2005; Price, 2005; Kim, 2005) and expression microarray data (Sykacek, 2005; Shamir, 2005; Saraiya, 2005; Li, 2001; Vaquerizas, 2005; Bumm, 2002; Saeed, 2003); however, no tool is currently available for the biologist to integrate these two types of data. One of the main challenges is that once the significant BAC clones or genes are identified, it is very difficult to correlate the DNA copy number and RNA expression results. This is because the significant genes may not lie within the corresponding BAC clones even though they are located in the same chromosomal region. Therefore, a more precise method of matching is needed in order to properly correlate these two types of data.
A typical way to perform the matching is to manually search the UCSC Genome Browser (http://genome.ucsc.edu/) to make sure the significant genes lie within the significant BAC clones. However, this type of manual search is very laborious and error prone if the numbers of BAC clones and genes are large. Thus, it is important to develop a user friendly and flexible tool that can match, correlate and display the aCGH and expression profiling data. Since it is common to identify hundreds to thousands of significant genes by either expression profiling or aCGH experiments, our program can further assist researchers to select genes that are found to be significant by both types of experiments, or genes that may not be identified by using either type of technique alone.
To address this issue, we developed a Java-based, stand-alone program that uses MySQL database (http://www.mysql.com) as a backend to store the BAC clones and gene information downloaded from UCSC database. This information is used to match the user-provided BAC clones in aCGH experiments and genes in expression profiling experiments. After that, the correlation coefficients and p-values of the matched BAC clone-gene pairs will then be computed and displayed in various formats for data visualization and comparison.
Software Designs
The CGI software is based on an object-oriented framework designed to conduct searches for features/genes in RNA expression-profiling experiments that mapped back to corresponding BAC clones in aCGH experiments. The program combines bioinformatic data matching from databases with simple correlation analysis. The software is organized into three functional modules (Data, Annotation, and Correlation). The Data module contains DNA copy number and RNA expression data and links them with the Annotation module by interacting with the MySQL database that holds a variety of different types of genomic information including chromosomal localization, Unigene ID, and gene annotation data. Information in the database is used to match the BAC clones and the genes provided by the users. The Correlation module calculates the Pearson correlation coefficients and p-values between the DNA copy numbers and expression values of matched BAC clone-gene pairs in different experiments. It also displays DNA copy numbers of a specific BAC clone in different aCGH experiments and the associated gene expression values in microarray experiments for easy data visualization and comparison.
Data Importing
A simple graphical user interface (GUI) prompts users to enter user name, password, database name, and the locations of the aCGH and RNA expression-profiling files (Fig. 1). The aCGH file contains FISH-mapped BAC Clone IDs, cytobands, and normalized log ratios representing DNA copy numbers from aCGH experiments. The RNA expression-profiling file contains Unigene IDs, gene symbols, and log-ratios (dual channel arrays) or log intensities (Affymetrix or oligo-based arrays) of gene expressions in a set of experiments involving identical cases as in the aCGH experiments.
Data Querying and Mining
The program offers two ways to query the data. First, BAC Clone IDs in an aCGH input file are used to query the MySQL database, which stores data downloaded from the UCSC database at URL:http://genome.ucsc.edu/cgi-bin/hgTables-fishClone and uniGene_2 tables. The two tables are first downloaded by the user and imported to the MySQL databases as described in the installation manual (see supplemental information). The Unigene IDs of the genes that reside in each BAC clone in the aCGH input file are retrieved based on chromosome number and their physical locations by SQL commands. Secondly, these Unigene IDs are used to match with the features/genes provided in a RNA expression-profiling input file, so that the matched BAC clone-gene pairs will be identified. The DNA copy numbers and gene expression values of the matched BAC clone-gene pairs will then be extracted from the input files and their Pearson correlation coefficients and p-values are computed by an internal correlation functions. Finally, the correlation coefficients and p-values of the BAC clone-gene pairs will be tabulated together with their BAC Clone IDs, cytobands, Unigene IDs, and gene symbols provided by the input files. If there are multiple genes within a BAC clone, the program will replicate the DNA copy number data of that BAC clone and correlate with the expression data of each of the other genes that are mapped to the BAC clone.
Data Visualization
CGI uses a correlation table to display a global overview of BAC Clone ID, Cytoband, their corresponding Unigene IDs, gene symbols and Pearson correlation coefficients and p-values of the matched BAC clones and features/genes. The table view is very flexible and the data in the table can be sorted dynamically in an ascending or descending order based on the correlation coefficients, BAC Clone ID, cytoband location, Unigene ID, etc (Fig. 2). It can also change the order of the columns to display different views according to user’s preference. Besides the table view, users can also visualize in detail the DNA copy number of a specific BAC Clone and the expression of its associated genes by entering the BAC Clone ID into the text box provided in the GUI (Fig. 1). The CGI program will display two graphic windows if the input BAC Clone ID matches one or more Clone IDs in the correlation table. One window displays three line graphs representing the DNA copy number changes of the queried BAC Clone in aCGH experiments and the expression values of its associated genes in RNA expression-profiling experiments (Fig. 3). The second window displays the DNA copy number data and RNA expression data as separate bar graphs for better visualization of the individual experiments if the number of matched genes is high (Fig. 4). This function provides a graphical visualization of the correlation between a BAC clone and its matching genes/features.
Application
To test this program, we have analyzed a previously published dataset that contains data from both aCGH arrays (Man, 2004) and cDNA microarrays (Man, 2005) of a set of pediatric osteosarcoma patients. We found several genes with RNA expressions correlating with the DNA copy numbers in the corresponding BAC clones (r >0.5, n = 15, p < 0.05, Fig. 2). One of the highly correlated genes (ZNF187) is mapped back to the BAC clone RP5-874C20, which is one of the most frequently amplified regions (6p21.1) in osteosarcoma (Man, 2004). ZNF187 or SRE-ZBP is induced by serum response and may regulate oncogene c-fos by binding to its serum response element (Attar, 1992). We have validated matching and correlation results of CGI by manually searching the UCSC genome browser to confirm the match between BAC Clone ID and Unigene ID, and recalculated their correlation coefficients using an independent method.
Discussion
We have developed the CGI program, which provides a simple yet powerful tool for matching, correlating, and visualizing aCGH and gene expression-profiling results simultaneously in multiple experiments. This tool is useful because it correlates the results from DNA profiling with those from RNA expression-profiling experiments in order to identify genes that are important at both DNA and RNA levels. The genes that are significantly altered in both sets of experiments add more confidence to the biological significance of these genes and therefore warrant further investigation. It also alleviates the need for manual matching between BAC clones on the aCGH arrays and the features in gene expression arrays using public databases. For data analysis, it provides a visualization tool and correlation calculations with an interactive and flexible interface. We have also implemented error detection routines to handle the database connection, e.g. user needs to enter username, password, and database name for secure connection. The number of experiments in the input files are also checked to ensure comparability of the data. This software was implemented in an object-oriented language, Java, to ensure portability across different operating systems. It is a stand-alone program, which is designed for users to install and run on their own local machine. Therefore, unlike other web-based analytical tools, the users do not need any server support, and are not affected by Internet traffic, server-side problems and downtime. Instead of using flat file data storage, the CGI program also provides fast data access and transfer from MySQL database, which is freely available via the web site (http://www.MySQL.com).
Different from some analytical tools, such as BioConductor (www.bioconductor.org), which uses command line interface, the CGI program uses an easy-to-use and intuitive graphical interface for bench biologists to perform the analysis without any prior computational background. A detailed description on how to install the databases and program is also provided in the supplementary information. The software framework that we employ supports the development of more sophisticated visualization and analytical functions in the future through its open API for Java-based plug-ins. The program is coded in Java reusable object classes, thus promoting a rapid development of future program extensions.
Two similar efforts of comparing aCGH and expression-profiling have been published recently. Kingsley et al have recently developed a web-based system, Magellan, which explores the quantitative relationship between aCGH and mRNA expression data (Kingsley, 2006). Magellan computes the relationship of aCGH and expression based on common annotation values between the two sets of experiments. Shankar et al has also developed a program mainly to visualize aCGH and expression data (Shankar, 2006). In contrast to these other two programs, the CGI program is standalone program, which does not require Internet connection and is not affected by the server-side problem. In addition to visualizing the data, the main strength of CGI is to provide an easy-to-use interface for fast matching and correlation of these two types of genomic data using a relational database. Once these candidate genes are identified, they can be subjected to additional analyses using other existing analytical tools. Since the software is developed in the object-oriented language Java, it can interact with other programs currently available for aCGH and microarray analysis, such as the BioConductor packages. It is straightforward to include other computational algorithms to extend the analytical capability of the program. The modular design of this program also adds flexibility and extensibility for the development of more functions and plug-ins in the future. In summary, we have developed an easy-to-use program CGI to map, correlate, and visualize aCGH and expression profiling data.
Acknowledgements
We would like to thank Jaya Visvanthan, and Jianhe Shen for the preparation of the aCGH and microarray data used in this study. We also thank Alexander Yu, Weichun Hsu, and Richard Lowry for their help in programming, and Carolyn Pena for her assistance in manuscript preparation. The study is supported by grants from NIH CA81465, the Robert J. Kleberg, Jr. and Helen C. Kleberg Foundation, the Gillson Longenbaugh Foundation, and the Cancer Fighters in Houston (CCL), as well as the Sarcoma Foundation of America and Fleming and Davenport Award (TKM).
References
- Attar RM, Gilman MZ. Expression cloning of a novel zinc finger protein that binds to the c-fos serum response element. Mol Cell Biol. 1992;12:2432–43. doi: 10.1128/mcb.12.5.2432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bumm K, Cheng M. CGO: utilizing and integrating gene expression microarray data in clinical research and data management. Bioinformatics. 2002;18:327–8. doi: 10.1093/bioinformatics/18.2.327. [DOI] [PubMed] [Google Scholar]
- Chen W, Erdogan F, Ropers HH, et al. CGHPRO—a comprehensive data analysis tool for array CGH. BMC Bioinformatics. 2005;6:85. doi: 10.1186/1471-2105-6-85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheung SW, Shaw CA, Yu W, et al. Development and validation of a CGH microarray for clinical cytogenetic diagnosis. Genet Med. 2005;7:422–32. doi: 10.1097/01.gim.0000170992.63691.32. [DOI] [PubMed] [Google Scholar]
- Jong K, Marchiori E, Meijer G, et al. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004;20:3636–7. doi: 10.1093/bioinformatics/bth355. [DOI] [PubMed] [Google Scholar]
- Kim SY, Nam SW, Lee SH, et al. ArrayCGH: a web application for analysis and visualization of array-CGH data. Bioinformatics. 2005;21:2554–5. doi: 10.1093/bioinformatics/bti357. [DOI] [PubMed] [Google Scholar]
- Kingsley CB, Kuo WL, Polikoffm D, et al. Magellan: A Web Based System for the Integrated Analysis of Heterogeneous Biological Data and Annotations; Application to DNA Copy Number and Expression Data in Ovarian Cancer. Cancer Informatics. 2006;1:10–21. [PMC free article] [PubMed] [Google Scholar]
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc Natl Acad Sci. 2001;98:31–36. doi: 10.1073/pnas.011404098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Man TK, Lu XY, Kim J, et al. Genome-wide array comparative genomic hybridization analysis reveals distinct amplifications in osteosarcoma. BMC Cancer. 2004;4:45. doi: 10.1186/1471-2407-4-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Man TK, Chintagumpala M, Visvanathan J, et al. Expression profiles of osteosarcoma that can predict response to chemotherapy. Cancer Research. 2005;65:8142–50. doi: 10.1158/0008-5472.CAN-05-0985. [DOI] [PubMed] [Google Scholar]
- Margolin AA, Greshock J, Naylor TL, et al. CGHAnalyzer: a stand-alone software package for cancer genome analysis using array-based DNA copy number data. Bioinformatics. 2005;21:3308–11. doi: 10.1093/bioinformatics/bti500. [DOI] [PubMed] [Google Scholar]
- Pinkel D, Segraves R, Sudar D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genet. 1998;20:207–11. doi: 10.1038/2524. [DOI] [PubMed] [Google Scholar]
- Price TS, Regan R, Mott R, et al. SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res. 2005;33:3455–64. doi: 10.1093/nar/gki643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saeed AI, Sharov V, White J, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003;34:374–8. doi: 10.2144/03342mt01. [DOI] [PubMed] [Google Scholar]
- Saraiya P, North C, Duca K. An insight-based methodology for evaluating bioinformatics visualizations. IEEE Trans Vis Compute Graph. 2005;11:443–56. doi: 10.1109/TVCG.2005.53. [DOI] [PubMed] [Google Scholar]
- Shamir R, Maron-Katz A, Tanay A, et al. EXPANDER–an integrative program suite for microarray data analysis. BMC Bioinformatics. 2005;6:232. doi: 10.1186/1471-2105-6-232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shankar G, Rossi MR, McQuaid DE, et al. aCGHViewer: A Generic Visualization Tool For aCGH data. Cancer Informatics. 2006;2:36–43. [PMC free article] [PubMed] [Google Scholar]
- Sykacek P, Furlong RA, Micklem G. A friendly statistics package for microarray analysis. Bioinformatics. 2005;21:4069–70. doi: 10.1093/bioinformatics/bti663. [DOI] [PubMed] [Google Scholar]
- Vaquerizas JM, Conde L, Yankilevich P, et al. GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data. Nucleic Acids Res. 2005;33:W616–W620. doi: 10.1093/nar/gki500. [DOI] [PMC free article] [PubMed] [Google Scholar]