Abstract
Many biomedical research databases contain time-oriented data resulting from longitudinal, time-series and time-dependent study designs, knowledge of which is not handled explicitly by most data-analytic methods. To make use of such knowledge about research data, we have developed an ontology-driven temporal mining method, called ChronoMiner. Most mining algorithms require data be inputted in a single table. ChronoMiner, in contrast, can search for interesting temporal patterns among multiple input tables and at different levels of hierarchical representation. In this paper, we present the application of our method to the discovery of temporal associations between newly arising mutations in the HIV genome and past drug regimens. We discuss the various components of ChronoMiner, including its user interface, and provide results of a study indicating the efficiency and potential value of ChronoMiner on an existing HIV drug resistance data repository.
Introduction
Numerous statistical and data mining algorithms have been developed to facilitate data analysis in biomedical research areas ranging from functional genomics to population-based health. These existing methods largely do not support the discovery of biologically or clinically relevant patterns among time-stamped data—such as longitudinal clinical outcomes or time-series gene expression results. The amount, diversity, and complexity of such data in scientific databases are rapidly increasing; understanding their temporal aspects is fundamental to the validation of causal or dynamic phenomena. Investigators further lack established means to represent and use the domain-specific temporal knowledge associated either with the collection of such research data or with the derived results generated from existing data-analytic methods.
To address these challenges, we are developing a knowledge-based framework for temporal pattern discovery in biomedical research databases. This approach can facilitate investigators in the tasks of abstracting, mining, validating, and maintaining knowledge about time-oriented research data. In this paper, we present our work on a novel ontology-driven temporal pattern mining method, which we call ChronoMiner. We show how time-oriented data in existing research databases can be mapped as input into our method, how we use the hierarchical structure of a mining ontology to undertake search for temporal patterns, and how we allow users to specify interval relationships that are of interest. We demonstrate the applicability, efficiency and potential value of ChronoMiner on an existing biomedical research data repository through the instantiation of the mining ontology and its mapping to the database.
Domain Challenge
In our work, we have focused on a particular research challenge in biomedical genomics: the discovery of associations between gene mutations, drug regimens, and therapy outcomes. Genetic mutations can result from evolutionary pressure of pharmacologic agents and lead to treatment inefficacy in infectious diseases [1]. In HIV research, for example, a mutation on the viral genome may be associated retrospectively to past administration of a specific drug or prospectively to the occurrence of poor clinical outcome with one or more drugs. Establishing such temporal associations may lead scientists to understand how particular mutations in the genome reduces drug efficacy, and can help healthcare providers to design treatment strategies given particular genotype-test results.
To study drug resistance in the context of clinical care, researchers at Stanford University have developed a research database system—called the Stanford HIV Drug Resistance Database (HIVdb) [2]. This research database contains time-stamped data on drug regimens, HIV reverse transcriptase (RT) and protease sequences, and HIV viral load collected at local clinics. The schema of the HIVdb is based on a linkage of sequence changes in HIV RT and protease enzymes to antiretroviral drug histories of the individual from whom the isolate (genotype test result) was obtained; drug susceptibility data on sequenced isolates when available; and primary outcome measurements (such as plasma viral load).
Figure 1 provides an integrated view of data for a subject in HIVdb. In our prior work [3], we identified sequential mutation changes in patients changing antiretroviral drugs (i.e., pairs of preceding drug predictors and mutation changes) among these data by applying association rule analysis [4]. We have found that this well-studied standard approach can meaningfully associate infrequent occurrences of protease gene mutations to previously administered protease inhibitor (PI) drugs. Association rule analysis, like most other mining methods, uses only one input table, which can represent one temporal pattern of interest (such as data on mutation changes after drug administration). This mining algorithm does not support hierarchical knowledge contained within the data, such as finding a temporal association between a mutation and administration of any PI drug or a particular type.
Figure 1.
Longitudinal view of clinical data from a single subject in the Stanford HIV Drug Resistance database. The time graph indicates the complexity of inferring the relationship of clinical response—viral load (top line) and CD4 count (bottom line)—to mutations in HIV gene sequence (listed in bottom boxes) and treatment history (noted by drug name abbreviations). Dates have been modified to protect confidential health information.
Method
We have designed the ChronoMiner method to find multiple types of temporal associations dynamically and at multiple levels of information. Our general mining approach uses a hierarchical view of data in a time-oriented database, which is encoded within a mining ontology, and searches for temporal relationships based on interval comparisons among these data. The ChronoMiner program has four components—data preprocessing, hierarchical representation, mining algorithm, and user interface—which we describe in this section.
Data preprocessing:
ChronoMiner uses hierarchical representation of multiple relations (tables) as input, and assumes that the representation is in the form of the valid-time database model [5]. Valid-time database model adds a third dimension to all records in a time-oriented database. Instant-based data, or events, (such as viral-load measurements) are in tables that have a unique key, value and date as columns. Interval-based data, or states, (such as antiretroviral drug administrations) are in tables that have a unique key, value, start date and stop date as columns. To create this standardized temporal view of a database, we used Java and JDBC to map existing data in its native schema into the target schema. Standard database optimization techniques (such as data normalization) and data cleanup are also performed at this step.
Hierarchical representation:
The traditional model of a relational database is a set of relations, which cannot directly encode hierarchical properties among data, such as the is-a relationship. To address this issue, we represented both theses relational and hierarchical properties of data explicitly within a mining ontology. Each domain class in the ontology, called a Node, corresponds to a relation in the database. The subclass hierarchy under Node encodes an “is-a” relationship among domain classes. For example, the antiretroviral drug class IDV is modeled as a subclass of the PIclass.
Each class has properties that contain string values mapping to the column names of the database table that stores the instances for that class. These properties are inherited from a DataMapperNode class, which has subclasses of EventData-MapperNode and StateDataMappernode to distinguish, respectively, between relations that store events and those that contain states. The use of ontology to encode an explicit representation of data allows reuse of the mining method with different database schemas and domains, since such encoded knowledge can be easily modified. The mining ontology serves as a bridge between the database and the mining algorithm, and guides the hierarchical search of the latter across multiple tables within the former.
Mining algorithm:
Our data mining approach undertakes rule association analysis between two input domain classes and their subclasses in the mining ontology. Standard rule association mining looks for frequently occurring associations between input values that meet the minimal criteria of user-defined interestingness, such as confidence (the probability of one value occurring given another) and support (the probability of two values occurring together). The ChronoMiner algorithm extends this standard approach by also examining the occurrence of different temporal relationships between the time stamps of those values. For example, it accounts not only for how often a given HIV mutation appears with a particular antiretroviral drug administered to patient who has that mutation but also how frequently that mutation appeared before, after or during the drug’s administration.
Using the mining ontology, the search for temporal associations involves partial or complete traversal of the hierarchical structure starting from each input class, proceeding through top-down induction as described in the pseudo code presented in Figure 2.
Figure 2.
Pseudo code describing our hierarchical, interval-based temporal mining algorithm.
The algorithm undertakes temporal comparisons between the time stamps of instances at each level of one input class hierarchy and those of instances at each level of the other input class hierarchy. The temporal comparisons are based on Allen’s interval logic [6], which defines the 13 possible temporal relationships that can occur between two intervals. To support comparison with events, we model them as a zero-length interval with equivalent start and stop dates. The algorithm returns temporal associations that meet the user’s criteria which are specified in the interface.
User interface:
We have created a mining query interface of ChronoMiner that permits users to select two classes from a given mining ontology file and the percent levels of confidence and support for finding temporal associations. The user can also restrict which of the 13 possible interval comparisons between the two classes are to be mined, which is useful if particular temporal relationships do not have domain relevance (such as mutations appearing before a drug’s administraion). Figure 3 shows the user interface specifying mining between the PI class, the administration of any protease inhibitor drug, and the PM_POSITION_54, which corresponds to occurrence of a mutation at position 54 on the protease gene.
Figure 3.
Query interface for ChronoMiner. This interface allows users to specify the mining of pair wise temporal comparisons between two domain classes of interest.
Experiment
We evaluated ChronoMiner by applying it to HIVdb for the discovery of temporal associations between newly arising mutations in the HIV genome in the context of sequential drug regimens.
Data source:
HIVdb is a continuously growing research data repository. The snapshot that we used in mid March 2007 for the verification of ChronoMiner contained approximately 18,000 patient subjects, 23,000 time points, 42,000 HIV gene isolates, 52,000 sequences, 850,000 mutations and 32,000 phenotype results. Gene sequence represents the specific order in which the structural components of DNA are arranged for a particular gene. Here time series data represent these sequences collected at various time points. In preprocessing this data, we created event and state data views for every abstract level of the data as defined in the domain ontology (Figure 4). This preprocessing step can allow modeling of certain domain-specific knowledge in the transformed data. For example: we instantiated the type of mutation at each protein position based on prior work [7].
Figure 4.
Mining ontology for HIV database in Protégé OWL. This ontology provides an “is-a” hierarchy of domain classes and the properties of a corresponding relation in the database.
Mining ontology:
We used Protégé OWL, which is the most widely used tool for modeling knowledge based on the Semantic Web ontology language standard OWL, to encode the mining ontology. For HIV drug resistance research domain, we created an ontology that specifies hierarchical “is-a” relationship between drug, mutation and viral load information. Drug is modeled as state data, whereas mutation and viral load is represented as event data. Metadata about the database location and schema is specified as data property of an individual at each level.
Query and results display:
A user specifies classes and temporal patterns using the ChronoMiner’s query interface (shown in Figure 3). The matching temporal patterns are displayed along with its confidence value in a tabular format. By default, the data is sorted on the confidence value of the first input class. However, the user has the ability to change the order of the data. Figure 5 shows the result of the mining query as entered in the Figure 3. These results correspond to our findings from our previous research using standard association rule mining.
Figure 5.
Result interface for ChronoMiner. This interface shows results of the query specified in the Figure 4.
Comparative efficiency:
We compared the performance of hierarchical mining over the traditional data mining of association rules. The transformation from traditional relational to hierarchical input increased the performance of finding temporal associations based on the query of Figure 3 by almost two fold for each drug type, as shown in Figure 6. ChronoMiner results (discovered temporal rules) confirmed the previously known associations. We also found some novel mutations that need to be verified by the domain experts.
Figure 6.
ChronoMiner performance for finding temporal patterns between various PI drugs and mutation at position 54 on the protease gene.
Discussion
A range of methods [8,9] have been previously developed to mine time-oriented data (such as time-series or episodic data). These methods can only find temporal patterns in a single input table and not among multiple tables. They are thus unable to discover interval-based relationships among sets of time-oriented data, such as those contained in HIVdb.
Unlike prior temporal mining methods, ChronoMiner also accounts for hierarchical relationships among data in searching for temporal patterns of interest. Multi-dimensional association rule mining [10] also uses a hierarchical approach to find (non-temporal) patterns of interests among input relations modeled along the dimensions of a data cube structure. Our ontology-driven approach allows us to encode this knowledge in a standard reusable, sharable format.
Although the ChronoMiner method supports reusability, the mapping from database to ontological hierarchical representation remains manual. We used a local script for converting data in HIVdb to our input tables. We are currently adapting Synchronus [11] to allow automated mapping of data in an existing database schema to the ChronoMiner representation. As a next step, we are also developing of a richer query language for temporal mining. Currently, our method supports any combination of Allen’s interval-based patterns between two classes. We are extending the language grammar to support more than two classes and to allow constraints (such as AND, OR, and NOT) on multiple comparisons.
With these extensions, ChronoMiner will provide investigators fully automated facilities to mine complex domain-specific temporal patterns from time-oriented biomedical research data. In this paper, we have shown that an end-to-end application of our novel mining approach can provide users of an established research database potential value and efficiency in the task of temporal knowledge discovery.
Acknowledgments
This work was supported in part by a Universitywide AIDS Research Program Award and a PhRMA Foundation Starter Grant Award in Informatics. We thank Dr. Robert Shafer and Soo-Yon Rhee with their assistance in our evaluation.
References
- 1.Kantor R, Shafer RW, Follansbee S, et al. Evolution of resistance to drugs in HIV-1-infected patients failing antiretroviral therapy. AIDS. 2004;18:1503–1511. doi: 10.1097/01.aids.0000131358.29586.6b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Research. 2003;31:298–303. doi: 10.1093/nar/gkg100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lin RS, Rhee SY, Shafer RW, Das AK. A combined data mining approach for infrequent events: Analyzing HIV Mutation changes based on treatment history. Proceedings of the LSS Computational Systems Bioinformatics Conference; Stanford, CA. 2006. pp. 385–388. [PubMed] [Google Scholar]
- 4.Agrawal R, Imielinksi T, Swami A. Mining association rules between sets of items in large databases. Proceedings of ACM SIGMOD Conference on Management of Data; Washington, DC. 1993. pp. 207–216. [Google Scholar]
- 5.Snodgrass RT. The TSQL2 temporal query language. Kluwer Academic Publishers; Boston: 1995. [Google Scholar]
- 6.Allen JF. Maintaining knowledge about temporal intervals. Comm ACM. 1993;26(11):832–843. [Google Scholar]
- 7.Rhee SY, Fessel WJ, Zolopa AR, et al. HIV-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype B isolates and implications for drug-resistance surveillance. Journal of Infectious Diseases. 2005;192:456–465. doi: 10.1086/431601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ramirez JCG, Cook DJ, Peterson LL, Peterson DM. Temporal pattern discovery in course-of-disease data. IEEE Engineering in Medicine and Biology. 2000;19:63–71. doi: 10.1109/51.853483. [DOI] [PubMed] [Google Scholar]
- 9.Winarko E, Roddick JF. Discovering richer temporal association rules from interval-based data. Proceedings of 7th International Conference on Data Warehousing and Knowledge Discovery; Copenhagen, Denmark. 2005. pp. 315–325. [Google Scholar]
- 10.Kamber M, Han J, Chiang JY. Metarule-guided mining of multidimensional association rules using data cubes. Proceedings of International Conference on Knowledge Discovery and Data Mining; Newport Beach, California. 1997. pp. 207–210. [Google Scholar]
- 11.Das AK, Musen MA. Synchronus: a reusable software module for temporal integration. Proceeding of AMIA Annual Symposium; San Antonio, Texas. 2002. pp. 195–199. [PMC free article] [PubMed] [Google Scholar]






