Abstract
Clinical study data management systems (CSDMSs) have many similarities to clinical patient record systems (CPRSs) in their focus on recording clinical parameters. Requirements for ad hoc query interfaces for both systems would therefore appear to be highly similar. However, a clinical study is concerned primarily with collective responses of groups of subjects to standardized therapeutic interventions for the same underlying clinical condition. The parameters that are recorded in CSDMSs tend to be more diverse than those required for patient management in non-research settings, because of the greater emphasis on questionnaires for which responses to each question are recorded separately. The differences between CSDMSs and CPRSs are reflected in the metadata that support the respective systems' operation, and need to be reflected in the query interfaces. The authors describe major revisions of their previously described CSDMS ad hoc query interface to meet CSDMS needs more fully, as well as its porting to a Web-based platform.
Clinical study data management systems (CSDMSs) have many similarities to clinical patient record systems (CPRSs). Both systems model much of their data using an entity-attribute-value (EAV) representation, which is discussed shortly. It is highly desirable to provide ad hoc query capability in a CSDMS/CPRS. Existing systems allow patient-centric query—retrieval of data from a single patient. However, the task of attribute-centric query, identifying sets of patients based on a Boolean combination of parameters, is technically more challenging, even if it is less important for real-time patient management.
In a previous paper describing an attribute-centric query tool,1 we discussed reasons that generic query tools for attribute-centric query of EAV-based systems are preferable to tedious, error-prone hand coding. Such tools need to be built from scratch, however, because commercial query tools are not currently designed for EAV data. This paper describes major extensions of our previous work, with an architectural overhaul based on lessons learned in the interim. We describe a Web-based query tool for TrialDB (formerly ACT/DB2), a CSDMS implemented at Yale and in use at the Yale Cancer Center and Yale General and Children's Clinical Research Centers, the Vanderbilt University Cancer Center, and the National Cancer Institute–supported Cancer Genetics Network.
Background
Existing Approaches for Query of Patient Data
Strategies for ad hoc querying of data from a CPRS/CSDMS fall within two extremes. At one extreme, queries are issued directly against the source data, or a copy with the same structure residing on a CPU different from the transactional system. At the other, data subsets undergo major restructuring and import into specialized DBMSs optimized for query, such as multidimensional database engines.3 An intermediate approach uses limited data transformation on a read-only copy of the data. For example, additional indexes may be created, or the data may be partitioned across multiple devices or CPUs to take advantage of hardware parallelism.4 In the context of CPRSs, source-data query has been described in Nigrin and Kohane5 and in Wilcox and Hripcsak6; restructuring/query in Murphy et al.,7 Prather et al.,8 and Wang et al.9; and intermediate approaches in Nigrin and Kohane10 and in Stoffel et al.11
Restructuring/query yields major gains in query performance, often by a couple of orders of magnitude. It is more or less mandated when very fast response is desired for queries performed repeatedly and regularly against critical data subsets. In the CPRS/ CSDMS scenario, however, the drawback is that the number of subsets can get quite large: Much planning is needed to determine which subsets will provide the most payoff after reorganization. Such reorganization involves, among other things, the pre-computation of aggregates. On the other hand, source-data query, while relatively slow, is convenient for one-time queries to which one would like to have answers reasonably quickly (if not necessarily within a few seconds) without having to extract and transform a large body of data first. Furthermore, if the volume of the data is relatively modest, e.g., a few gigabytes, queries might run quickly enough after minimal or no transformation that major data restructuring may not become a high priority.
An Overview of EAV Design
Entity-attribute-value systems used in clinical medicine include the well-known HELP system,12,13 now commercialized as the 3M Clinical Data Repository (CDR),14 and the Columbia-Presbyterian CDR.15,16 In EAV design for clinical data, every fact is conceptually stored in a table with three sets of columns:
The Entity—patient ID plus time stamps recording event occurrence.
The Attribute, also referred to as a Parameter or Question, e.g., “serum Potassium”—typically the Attribute ID rather than its name is recorded; the latter is stored, along with other information such as its data type, in a separate Attribute Definitions table.
The Value of the attribute, e.g., “4.5”
One row stores a single fact. In a conventional table that has one column per attribute, by contrast, one row stores a set of facts. Entity-attribute-value design is appropriate when the number of parameters that potentially apply to an “entity,” e.g., the patient, is vastly greater than the number of parameters that actually apply to an individual entity. Clinical study data management systems, for example, must track thousands of parameters (clinical/laboratory/questionnaire items) across numerous medical specialties, although the vast majority of subjects will be enrolled in only a single study that tracks a much smaller set of parameters. The EAV schema has the advantage of remaining stable when the number of parameters increases continually as knowledge expands, as in medicine.
In some systems, including TrialDB, instead of a single EAV table, we have separate EAV tables for each data type of the value column (string, integer, real, date, etc.). The row for potassium data would be stored in the EAV table for real values. Entity-attribute-value design represents a generalization of row modeling, which is used intuitively by many database designers who may not know of the EAV approach per se. For example, in a supermarket database that must manage thousands of products and brands, many of which have a transitory existence, product names are not hard-coded as names of columns in tables. Instead, product descriptions are stored in a Products table, and purchases/sales of individual items are recorded in other tables as separate rows, with a product ID referencing this table. In the CPRS, laboratory test results are managed similarly.
Entity-attribute-value design poses significant software challenges because end users as well as analytic programs take a conventional view of the data. Therefore, the user interface to a well-designed EAV system must create the illusion of all conventional data storage. The output of queries, or data extracts destined for analysis, must similarly be formatted conventionally.
Most “EAV” systems actually store some of their data, e.g., patient demographics, conventionally. This is because EAV makes sense only for attributes that are sparse, i.e., that are recorded for only a minority of patients. It would be misapplied to standard demographic attributes, which are recorded for every patient. The existence of heterogeneity complicates the task of query significantly, for several reasons. For example, the SQL (structured query language) statements required to operate on conventional data are different from those needed to operate on EAV data to return equivalent results. Furthermore, data may be reorganized, with the representation of attributes being switched from conventional to EAV and vice versa, causing existing “stored procedures” that hard-code the access to these attributes to break and require rewriting.
The Importance of Metadata in EAV Systems
In EAV systems, the user-interface illusion of conventional data organization is implemented through metadata. While the most restricted definition of metadata treats it as synonymous with “data dictionary,” we use this term in the broader sense of any data that are used by generic code within the application or system to provide functionality, as opposed to attempting to hard-code such functionality in application logic. All DBMSs use built-in metadata—such as table, column and index definitions—or various purposes, such as semantic checking and optimization of SQL queries. However, the metadata of an EAV system, which are designer-supplied, serve additional purposes. For example, knowing where an attribute's data are stored, based on the data type, helps generate the correct SQL when data are viewed or changed.
Typically, EAV metadata also record how information is presented to the user. Thus, for example, an attribute can have a caption that briefly describes its purpose, and may also be associated with a set of choices, constraining input to these choices. Such a set forms the basis of a “combo box” in the user interface. Furthermore, attributes can be grouped into higher-order objects (forms). If we also record the order in which attributes are to be presented, the collective metadata can be used to automatically generate a data-entry/browsing form. TrialDB, our existing CSDMS, has this capability. TrialDB's Web-form-generation infrastructure and metadata-editing facilities have been described previously.17,18 Its previous query interface will now be described briefly.
Using Metadata to Support Query: Précis of Previous Work with TrialDB
Given that metadata are critical for EAV systems, it is natural to explore their use for generating SQL queries based on interaction with a graphical user interface (GUI). The advantage of using metadata to drive the query GUI is that, although the program code needs to be more elaborate, portability of the GUI code across different systems is greatly enhanced, and the extent of hard-coding of individual user-interface components is minimized.
A prerequisite for portability is that a common metadata schema be used to describe the details of each system. Furthermore, by dynamically altering the behavior of individual elements appropriately, the query interface code that consults metadata can maximize the likelihood of queries being semantically correct as they are being composed, rather than reporting errors after the query has been composed and submitted.
For example, when the user chooses an attribute, he or she often wishes to specify a selection criterion based on that attribute, or a statistical aggregate of that attribute. Based on the data type of the attribute, only the appropriate relational and aggregate operators should be made available in pull-down lists. Thus, for an attribute of type “string,” the only permissible aggregate operators are COUNT, MAXIMUM, and MINIMUM; the operators SUM, MEDIAN, AVERAGE, and STANDARD_DEVIATION do not apply. Similarly, if an attribute is based on a choice list, then the contents of that list should be presented appropriately.
Our previous work described a user interface that could apply to either a CPRS or a CSDMS. We briefly summarize our former approach.
Attributes are searched and selected by specifying keywords or part of the attribute name/caption. Once an attribute is selected, the user may optionally specify that an aggregate value for that attribute be returned, with one aggregate value per patient, using an aggregate operator such as that mentioned above.
Restrictions based on relational operators, and values are optionally applied to individual attributes (e.g., “Hemoglobin<10”). Restrictions may be based on statistical aggregates (e.g., “max(Hemoglobin)<10”) as well as limited by a range of dates.
Individual search criteria can be combined in Boolean fashion.
Queries can be optionally constrained to operate on a user-specified subset of the clinical studies stored in TrialDB. The generated SQL uses such a constraint to restrict search and limit the volume of data stored in intermediate result sets.
The output of a single query is one or more tables. All aggregated parameters are placed in one table (with one row per patient), whereas non-aggregated parameters each appear in their own table. The reason for segregating parameters in this way is that they may have been recorded at different time points and with varying frequencies. Therefore, the number of values per patient will generally be unequal across parameters, especially when one is querying across studies with completely different designs. In a cross-study situation, combining raw data values for multiple parameters into a single table is often a meaningless operation.
Experience, however, indicated that our previous interface was attempting to solve the wrong problem. This issue is discussed below.
Need for a CSDMS Query Interface to Differ Significantly from a CPRS Query Interface
Although they are similar, CSDMSs and CPRSs also have significant differences reflecting their different purposes. For example, a CSDMS simulates an unlimited number of “virtual” databases, one for each study stored in the system, using a single physical database with a fixed number of tables. In a general-purpose CPRS, on the other hand, the concept of a “study” does not exist. Furthermore, most CSDMS users are not even authorized to access more than one study; cross-study query capability, while useful, is secondary in importance. Some major differences in a CSDMS that affect a query interface are highlighted below.
-
The significantly different nature of many attributes, and grouping of attributes. The vast majority of attributes stored in a CPRS are either laboratory tests, such as serum sodium, or clinical findings, such as the presence of icterus. Such attributes are meaningful in themselves, and it is natural for a query interface to allow search of attributes by keyword or name. In CSDMSs, however, attributes segregate into higher-order groups to a much greater degree. For example, a significant proportion of the attributes, especially in domains such as psychiatry, are derived from questionnaires such as Depression Rating Scales, where the attributes or questions on the questionnaire are often closely interrelated, and one question may allude to a question asked earlier in the questionnaire. This has two consequences.
- First, idiosyncratic attribute definitions are numerous. For example, in the SF36 form, a standard quality-of-life survey, one attribute is defined as, “Has the patient needed to cut down on the amount of time spent in work or other activities?” Searching such definitions by key phrase can be suboptimal. Instead, such attributes are often known by the questionnaire or form to which they belong, and it is more natural to let the user “drill down” to the attribute from the form definition, instead of mandating a search by keyword.
- Very often in the CSDMS scenario, a set of attributes, rather than a single attribute, conveys meaningful information. Thus, to record an adverse reaction to medication, one must record the nature of the reaction, its severity (e.g., on a 1-to-5 scale), whether it responded to treatment, whether therapy needed to be stopped, and so on. “Severity” is meaningless in isolation (severity of what?). Therefore, the user very often needs to select several related attributes. It is significantly more ergonomic to let users do this via the form that groups these attributes together.
- Second, search of attributes by keyword is still necessary: Users cannot be expected to remember the form where a particular attribute resides. Once an attribute has been located, however, it is desirable to allow the user to identify the form where it lies (“drilling up”) and then choose related attributes by drilling down again.
In brief, naturally occurring higher-order groupings of attributes in clinical studies must be maximally used in a CSDMS query interface for optimal ergonomics. (To some extent, this query interface requirement also exists in traditional patient-centric query of CPRSs, where users often wish to select all complete blood count or electrolyte results for a particular patient.)
-
Emphasis on groups of patients and study periods rather than absolute dates, and regularity of data collection. In clinical studies, individual response to therapy is less important than how subjects react as a group. Therefore, a clinical study is typically divided into study periods by critical time points, during which a predetermined set of evaluations, e.g., laboratory tests or questionnaires, is performed for each subject or patient. These time points are determined by the study protocol and are typically labeled as “Baseline,” “3-month follow-up,” and so on. At a given instant, various subjects will typically be within different study periods, e.g., one subject may be finishing the study while another has just begun. Absolute date or times associated with clinical events are therefore less useful, when the course of therapeutic response or adverse effects is being studied, than are the study periods associated with the events. The query GUI must therefore display and allow search of values in terms of study period labels.
For reasons such as expense and logistics, all clinical parameters relevant to a particular study are not evaluated during all periods. However, in particular studies, certain evaluations will always be done together. For example, a standard questionnaire will always be accompanied by a detailed physical examination. Therefore, for a given subject, the number of data points for certain seemingly unrelated parameters will be identical. In other words, the design of a particular study imposes a higher-order grouping of attributes above that due to individual questionnaires or laboratory test panels.
Grouping of attributes directly affects output of data. In a CPRS, it is generally not safe to combine raw data values for disparate parameters into the same output table. The consequent creation of multiple output tables complicates the task of visual inspection and data analysis. For a CSDMS, however, especially in a single-study, the query engine can and should take advantage of grouping of attributes to drastically reduce the number of output tables when numerous attributes have been selected.
Study access privileges and site-based restrictions. In a CPRS, there is generally an implicit assumption that anyone who has permission to perform ad hoc query has permission to read all the available data, i.e., “read-only super-user” status. In a CSDMS, this assumption is not necessarily true. Certain users have access only to certain studies, and the very existence of other studies should be hidden. Nonetheless, such users can also benefit from ad hoc query facilities, as long as the query GUI code can tap into user-privilege metadata to restrict the data on which they are permitted to operate. Furthermore, for certain collaborative studies across multiple centers, certain users are site-restricted; that is, they are permitted to access only data on patients in their own center within a single study, whereas other “study-level super-users” (analysts) may not be so restricted. The query interface should be aware of such restrictions.
Awareness of the above issues came to us gradually; in many cases after the feature set of TrialDB itself had been expanded to deal with the varying needs of different studies. In any case, we eventually realized that our assumptions about the requirements of a CSDMS query interface had been simplistic and that the interface needed to be drastically overhauled. Concurrently, we also took the opportunity to rewrite and generalize our original architecture to make it more metadata-driven and to optimize it for better performance and ease of maintenance.
Design Objectives
The Query Interface
The first objective was to re-engineer the query interface of TrialDB to facilitate single-study query, as discussed above, by using groupings of attributes and study periods and by conforming to access restrictions.
Queries, whose results are displayed as a series of HTML tables corresponding to temporary database tables, essentially constitute un-cosmeticized reports. Just as “power users” within a group of users create reports for others to run on a regular basis, ad hoc queries should also be capable of being saved, annotated with respect to what they do, and run by others who might not necessarily care to devise these queries themselves. Such queries should be aware of site-restriction, in that the same query should return different sets of patients when run by users with different privileges, without needing to be redefined.
An extension of the ability to save queries is decoupling of the SQL-generation code from the query GUI. There are occasions when regularly scheduled (“canned”) reports of clinical data are needed for computer-naive users. It is desirable to format these reports aesthetically and add explanatory text. Outputting tables is then only the first step in the process: One must operate on the tables just created to generate a formatted report. This is done most directly by using a commercial report generator, which typically operates on saved report definitions or templates; an individual template incorporates, among other things, a knowledge of the tables that it has to query. (We use a program called ActiveReports to allow Web-based display of dynamically generated reports in a variety of formats, including PDF.)
For the report-generation process to be streamlined, e.g., be available with a single button-click, it is desirable to allow the SQL-generation code to be called directly without having to go through the ad hoc query interface. The proper way to do this is to ensure that the query GUI and SQL-generation code do not communicate directly, but only through the intermediary of a query specification that the former creates and latter interprets or executes. Our XML-based query specification is described later.
Web-based Architecture
The second objective was to port the application from a two-tier client-server architecture to a Web-based one.
Our previously described work used a two-tier architecture, with a Microsoft Access client operating against a high-end relational database. The advantages of Web deployment are well known—increased accessibility to a large number of users without per-seat licensing costs, and simplified software maintenance and distribution. Secure communication of confidential patient data is also simpler to implement for Web-based applications than for traditional client-server applications: The application developer does not have to do anything special other than specifying that the URL for query will be based on https rather than http.
In the present case, another advantage applies—the greatly increased ability to customize the user interface programmatically based on attribute metadata. For example, for attributes that are based on a set of choices, one must present the choice set against the attribute's caption as a pull-down list when this attribute is chosen for a search criterion. When MS Access user interface components such as sub-forms are used to display attributes selected by the user, the rows in the component are not individually programmable: Changing the contents of the pull-down list for one row programmatically changes it inappropriately for all rows, causing significant confusion for the user. On the Web, where an HTML table is used to simulate a sub-form, the program code, while much more elaborate, allows fine-grained control over individual cells.
Integration with TrialDB Web Interface
The third objective was to provide full integration with the existing TrialDB Web interface, to provide the freedom to switch to patient-centric query on demand.
When a set of patients matching particular search criteria is identified, a user often wants to go to records for individual patients and browse data for other parameters; in other words, the user needs to perform a patient-centric query. Such browsing is often done by instinct, based on what the query has returned. For example, the user may want to look at other data associated with what appears to be an outlier. It is not reasonable for the query interface to force the user to specify all that he or she wishes to inspect a priori. The user must therefore be able to switch between the two modes of attribute-centric and patient-centric query as seamlessly as possible. This is quite simple to achieve with Web technology: The ID of an individual patient in the output of an attribute-centric query can hyperlink to the list of forms that represent that patient's total data. This list can be readily filtered and sorted by form category, study period, and so on.
System Description
Schematic of Architecture
The overall system design, based on a standard three-tier Web architecture, is shown in Figure 1▶; the legend accompanying this figure summarizes the role of each component in the system. The user's interaction is through a Web browser. We mandate the use of Microsoft Internet Explorer 5.x or greater, for reasons discussed later. The Web server runs on a Microsoft Windows 2000 machine with Internet Information Server and is implemented using Active Server Pages. The database server runs on a separate CPU and uses Oracle; however, we are working toward back-end database independence.
Figure 1 .
The overall architecture of the Ad Hoc Query Interface for TrialDB. In this “three-tier” architecture, the application resides on a Web server, which mediates between the Web browser and the database. An XML-based query specification is generated by the browser through the user's interaction with the metadata and search/identification of attributes of interest. This specification is then checked for correctness, and converted to a plan of action that generates SQL to create several temporary tables in the database, which can be viewed by the user. SQL generation is done with the help of metadata and “code templates,” which are parameterizable macros. Queries can be saved for later reuse by anyone with query privileges for that study.
The database schema consists of several sub-components.
Saved query definitions may be retrieved for later reuse and modification. A query definition is stored internally in relationalized form in three tables. However, it also has an XML counterpart that is used as the basis for communication between the Web browser and the Web Server. At the browser end, the query specification is composed through the user's actions via scripting code embedded in the Web page and executing within the browser. At the server, the specification is parsed, validated for correctness, and converted into an action plan. The use of XML is discussed later.
Data tables are queried and the results stored in temporary output tables, which are displayed to the user after query execution and can also be downloaded. Output tables are named using a convention with the user login name as a prefix, followed by a program-generated default. The user can modify the generated names prior to query execution. Output table names are checked for uniqueness to prevent accidental overwriting of data. The collective set of output tables created by a given user simulates a “user workspace.”
The metadata tables are browsed by the user as well as consulted by the SQL generator, which is discussed later. The user selects attributes of interest and optionally constrains them by value, aggregate operator, or occurrence within particular study periods. The metadata are presented to the user so as to hide the distinction between attributes represented in EAV form and attributes represented via conventional tables, e.g., demographics and scheduled therapies. (In our previous work, we have emphasized the importance of deliberately blurring this distinction, which is due to a system designer's decision and generally irrelevant from the user's perspective.) The metadata schema is discussed later.
Code templates are stored in the form of boilerplate code containing multiple SQL statements, with placeholders that are macro-substituted at runtime with actual parameters based on individual attributes and modifiers specified by the user, e.g., relational operators or statistical aggregates. Templates operate in the manner of the sprintf() function, which is available in C, C++, Java, and Perl. They may also contain placeholders for other templates, yielding a subroutine-like capability. After macro-substitution, the result is a series of SQL statements that is executed by the database to return results, perform cleanup of tables containing intermediate results, and so forth. Examples of their use are described in our earlier paper.1
The query engine is accessed from a button in TrialDB's existing Web front-end. Web browser and Web server communicate after password-based logon via https with 128-bit encryption to prevent electronic eavesdropping. The interface implements a metadata-based privileging mechanism: All users with access to a study can run a previously saved query for that study, including one created by another user, but only a limited number are allowed to create new queries or perform cross-study queries.
Most components of the Web server application have been briefly alluded to above and are illustrated in Figure 1▶. The application can be divided into two parts—a component that interacts with the user, and another that generates SQL to extract the data. The SQL generator does not access the database schema directly, but only through the metadata intermediary. That is, access to particular data tables is not hard-coded: The names of these tables and their fields are obtained by query of metadata tables.
User Interface
The interface shown in Figure 2▶ illustrates the single-study situation. The user can drill down from the forms, a list of which are shown at the top left, and drill down to the attributes or questions contained in them. Search by keyword is also provided: This is currently the only search option for cross-study query. In a cross-study query, the list of forms does not appear.
Figure 2 .
The User Interface for a single-study query. Here the query requests baseline albumin and bilirubin, initial chemotherapy and dose, and (optionally) prior surgery information for patients who might have received any one of several concomitant medications (omeprazole, theophylline, coumadin, etc.) before onset of therapy; patients with impaired liver function, or low serum albumin, are predisposed to interactions between the drugs listed and specific cancer chemotherapeutic agents. The compound Boolean criterion (1 and 6) means that the selected patient must meet the criteria of having a bilirubin recorded in the pretreatment phase (row 1) AND having one of the medications on the list in the same phase (row 6).
The user can drill down to an individual attribute via the forms within the study (and groups within the form) as well as by keyword search; the latter is accessed through the “Search” button above the list of forms. Selected attributes are displayed in a grid. Selection criteria may be applied by study-period restriction (start phase, end phase) or by relational/value operator, or both. (Although the user interface shows “time criteria” and “Boolean time criteria,” these are under development and their operation has been disabled; they are, therefore, not discussed in this paper.)
Creating a complete query specification often requires more screen space than can be fully displayed on most monitors, so the interface is divided into several “frames.” The topmost frame is a “toolbar” that, among other things, resizes other frames dynamically to minimize repetitive scrolling. The second frame, “Questions,” allows the drill-down selection of attributes, as previously described, as well as search by keyword or part of a phrase. The third frame, “Selected Questions,” lets the user apply restriction criteria for the selected attributes. The fourth frame, “Time Criteria,” still under development, allows composition of time-based criteria. The fifth frame, “User Information,” allows the user to combine individual criteria in Boolean fashion, save the query for reuse with a specific name and description, and run the query. The last frame, “Show Data,”which is hidden except during data display, shows the results.
The user may specify an arbitrary number of attributes or questions. Rows (criteria) are consecutively numbered as new questions are selected. Rows may also be deleted if a question was chosen accidentally, but deleted row numbers are not recycled; this way, other parts of the query specification—such as Boolean criteria that use criterion numbers, e.g., “1 and (3 or 4)”—are not affected. The filter on an attribute is specified in terms of a relational operator, a value (or comma-separated list of values), and optionally by statistical aggregate operator and study periods, as discussed earlier.
The output of the query in Figure 2▶ is shown in Figure 3▶. More than 50 individual SQL statements needed to be generated for the complete operation in this case; the query was executed against the transactional system and took 32 seconds to run. The output also includes the appropriate metadata descriptions—the interpretation of numeric codes representing study periods and choice sets. These are not illustrated for reasons of space.
Figure 3 .
The results returned by the query in Figure 2▶. Only a single patient ID (1780979) matches the search criteria: This patient has received coumadin. (The ID is an anonymous one that is only meaningful within the study.) Notice that attributes are segregated by the use of co-occurrence metadata: Chemotherapy agent and dose are shown together, as are bilirubin and albumin. This patient has not had any prior surgery (the top table is empty), but that does not prevent this patient from being selected, since this field has been designated as optional in specification of the query.
Also present in the output, but not shown here for space reasons, is a “key” for all coded data, such as the interpretation of all phase serial numbers for this study (here, 1=pretreatment, 2=cycle 1) as well as phrases corresponding to any choice set codes (such as gradations of adverse event severity). The patient ID is a hyperlink: Clicking on it takes the user to the TrialDB form that displays a list of available forms for that patient within this study.
The SQL statements fall into several functional categories—those that get attribute data from the tables where they reside and create temporary tables, those that combine temporary tables to consolidate the output, those that delete temporary tables, and those that display metadata. For debugging purposes, we allow logging of individual SQL statements, which can be inspected later for correctness.
A query does not always have to return patient data. Sometimes, we merely need to identify a set of patients matching the Boolean criteria, for whom a bulk data extract can be performed in a separate step. The extract typically involves complete data for one or more forms but may encompass all available data in the study for these patients.
Metadata Framework
The metadata schema is described in detail in Appendix 1* and is briefly summarized here. Different components of the metadata serve different purposes:
-
To enable search, and correct manipulation, of individual attributes. The requirements of metadata needed to enable searching of individual attributes in patient data, and to generate the correct SQL, have been described previously by us.1 In brief, every attribute must have a name, caption, description/set of associated keywords, a data type, and the name of the table/view that holds the data. For example, in TrialDB, the attribute named “Hemoglobin_Blood” has the caption “Hemoglobin, Blood,” keywords “Hemoglobin” and “Blood,” the data type “Real,” and the view “EAV_Real.”
In addition, we must record whether this view represents row-modeled or conventional storage. For row-modeled attributes, we must record the names of the attribute fields and the value fields. We must also record the number of time stamps associated with a data item for that attribute. Most attributes in medicine have a single time stamp recording when the event occurred, but a few, such as adverse events, have two time stamps recording onset and cessation of the event.
To manage user-specific privileges and data access restrictions. A User Privileges table records, for a given study, whether a user has ad hoc query creation or modification privileges as well as the ID of the site, if any, to which the user is restricted for the study. The latter acts as a filter for the SQL queries that are generated to access the data.
To record study periods. Study Periods have designer-defined labels and are chronologically offset from each other in time units, e.g., days, weeks, months, that are defined for the study.
The metadata used for query is actually multipurpose. For example, it is used to generate Web-based forms for data entry for both EAV-based data and conventional tables. In addition, the metadata for conventional attributes is used to document the entire system, including self-documenting of the tables containing metadata themselves. However, a Boolean flag indicates which tables contain patient data, as opposed to which contain metadata; only the former are accessible through the ad hoc query interface.
A major difference from our previously described work is that, for portability to other systems, the metadata and code-generation framework has been generalized. Thus, for example, although TrialDB uses the string “Patient_ID” as a column name in all applicable tables and views, other systems may not do so or may not even be consistent in using the same string across all tables. Similarly, when data on individual parameters are retrieved, they are combined (“joined”) on the basis of both patient ID and the time stamps describing the clinical event for which the parameters were recorded. Again, whereas TrialDB names the time-stamp fields consistently across tables, other systems may not do so. The metadata must therefore record the physical names of the patient ID and time-stamp fields for each patient-data-related attribute.
Query Specification: Use of XML
As stated previously, the query is generated by the user's interactions on the Web-browser client. The specification of a query can become quite complex. Some parts of the specification apply to individual criteria, whereas others apply to the query as a whole. Examples of the former are relational operators, statistical operators, values, and study period restrictions; examples of the latter are the query name and description and Boolean combinations of attributes. There is a one-to-many (hierarchic) relationship between the latter and the former.
XML, which can represent hierarchic data, supports a more natural and easily understood representation of the query while enabling the easy parsing, or decomposition, of a query specification into its component parts, which are then passed to the SQL generator described in the next subsection. (XML had not yet been devised when we described our original query interface.) An advantage of XML in our particular case is that the query specification has been a moving target, having been revised several times to strike a balance between conveying enough information to the SQL generator and remaining stable to change. Code that operates on XML is, in general, significantly more stable and robust than code that operates on other data formats, such as tab-delimited text. The query specification is described in Appendix 2.† This has been described using the newer W3C XML Schema notation,19as opposed to the older DTD notation, because W3C schemas allow a significantly greater degree of validation of an XML data stream.
Code Generation
The basic approach to code generation has been described in our earlier work1 and is now summarized. Conceptually, sets of patient IDs matching individual attribute-based criteria are identified. Compound Boolean criteria are evaluated using a “stack machine,”20,21 where “AND,” “OR,” and “AND NOT” operators are evaluated by performing intersection, union, and difference of the individual sets, respectively, yielding a final set of IDs for patients matching all criteria. The data requested for these patients is then fetched. In the present work, numerous optimizations are used in generation of the SQL that retrieves data and combines parameters into output tables. These are described in Appendix 3.†
Present Status
Currently, the Yale Cancer Center Clinical Trials Office and the Department of Psychiatry are intensively evaluating the query interface, which is in initial production mode. Both these groups have numerous studies in TrialDB. Four users are testing the system for usability and performance. About 20 queries were used in the initial development of the system, for four studies with between 50 and 200 patients currently enrolled in each. All queries except two were single-study queries that needed to be answered on a periodic basis. The two cross-study queries were devised strictly for the purpose of testing; currently, there is not sufficient similarity or overlap between the existing studies in TrialDB for such queries to be clinically “interesting.” Our early testing indicates the need for significant additional work to improve performance.
Execution time for queries currently runs from 30 sec to 2 min, but these figures apply only to queries that are executed in isolation. Query execution is serialized: Once a query begins to execute, it locks the Web server's resources until it is finished, so that another query must wait for the previous query to finish. Our current system design has assumed low concurrency, because only a few privileged and trained administrators are authorized to perform ad hoc query.
Furthermore, all result tables created by the query are saved unless explicitly overwritten by re-running the query; therefore, the user can minimize the browser window and perform other tasks while waiting for the query to complete. Although the assumption of low concurrency is currently valid, however, it might not necessarily be true in future. In any case, 2 min is still a long wait. Moreover, we can foresee the eventual creation of queries that are clinically justifiable but turn out to perform “pathologically.”
We have begun testing performance on a new database server with 2 gigabytes of RAM, which is connected directly as a “slave” CPU to a test Web server that has a dual network card. (In the existing production setup, half a mile separates the Web and the database servers, so that the speed of the University network is an important limiting factor.) In a preliminary test without any modifications to the schema, database contents, or code, the 2-min query ran in 5 sec. This increase in speed is probably due as much to caching effects, where temporary tables created during the course of the query are stored in memory instead of being written out to disk, as it is to the better network bandwidth, and we cannot say whether these encouraging figures will hold up when the machine is stressed with greater loads.
Discussion
Generic vs. Customized User Interfaces for Query
When building a query interface, one must decide how much to tailor the interface to reflect the organization, or schema, of that particular system. The more generic the interface, the more it is potentially portable to other systems, but portability can often come at the cost of ergonomics, and therefore the choice of design is often a judgment call on the developer's part.
Examples of highly tailored systems are described in two excellent papers by Nigrin and Kohane,5,10 which are concerned with query of a large repository of data from a pediatric population. The second paper describes a significant scaling-up and consequent redesign of the approach described in the first. The user interface that is described is typical of what is required for CPRSs. The examples that we cite in the following text are taken directly from these papers.
Each data table—e.g., Caregivers, Demographics, Labs, Diagnoses, and Medications—is accessible through a separate pull-down list that brings up the columns or attributes in that table. String attributes, such as laboratory test descriptions, can be searched on the basis of partial phrases. (This is similar to keyword-based attribute search in TrialDB.) For an individual table, the generated SQL derives directly from the table's definition. Thus, to identify patients for whom a laboratory parameter exceeds a certain value, the query criterion is of the form “lab_abbrev= HBA1C AND test_value>12,” whereas a search based on demographics leads to a criterion like “zip_code=10001.” Notice that laboratory data are row-modeled, whereas demographics are column-modeled: “zip_code” is the name of a column that stores a specific type of geographic information. These model differences are exposed to the user.
On the other hand, in EAV-based databases such as TrialDB, individual tables that segregate the data into functional categories, as in the above example, do not for the most part exist. The higher-order grouping of attributes simulates “virtual tables.” Such groups are numerous, however, and the groups that are actually applicable to a given study depend on the study domain, e.g., psychiatry vs. cardiology, and the sub-domain, e.g., depression vs. panic disorder.
Furthermore, because we do not wish to make EAV a religion, we permit schemas to have columnmodeled attributes (e.g., demographics) where the database designer thinks this is appropriate. Therefore, in the context of a CSDMS, our decision to make the query interface more generic, and drive it off metadata, is virtually forced, as is the decision to hide the row-modeled vs. column-modeled nature of individual attributes for most purposes. (The absence of a metadata-driven interface in the Nigrin and Kohane work is not a drawback; the problem they have addressed, which does not need that much metadata, is quite different from ours.)
As stated earlier, our previous interface erred in being too generic. The present incarnation of the TrialDB interface is based on the realization that the differences between CPRSs and CSDMSs must be reflected in the structure of the accompanying metadata and, therefore, in the user interface that drives off this metadata.
The code underlying our query interface does not, by any means, represent a software-engineering breakthrough. Although existing commercial CSDMSs (e.g., those provided by CB Technologies,22 ClinTrials Inc.,23 and Oracle Corp.24) also use EAV designs, however, they do not currently provide ad hoc query of EAV data, and they are not aware of the differences between single-study and multiple-study scenarios. Their approach appears to be restricted to documenting their database schemas so as to allow users with programming skills to extract data to a conventional, analyzable format. This can be a laborious task, and at least one vendor provides data extraction as a negotiable, fee-based service. As we have seen, however, given the appropriate metadata, the algorithm for data extraction is not particularly complicated.
Web-based Development: Choice of Browser
To provide an interface that responds rapidly to user actions, much of our code directly modifies the contents of HTML pages in the Web browser without a round trip to the server. The programming model for doing so (dynamic HTML) varies very greatly between Netscape and MS Internet Explorer (IE), almost to the point of mutual incompatibility. We have chosen to commit to IE, and our interface uses IE-specific features that simplify programming greatly. In IE, the HTML tables shown in Figure 2▶ are “first-class” objects that allow programmatic addition and deletion of rows without round trips. Second, IE's ShowModalDialog() function allows one Web page to call another as a subroutine; the latter returns a string that can be parsed and used to directly populate the contents of one or more controls (e.g., pull-down lists) in the calling page. This function essentially allows the creation of a responsive, two-tier-style interface while using a three-tier architecture. Finally, IE provides rich programmatic support for the XML Document Object Model (DOM), whereas Netscape has only recently begun to do so.
Future Work
The ad hoc query interface, like TrialDB itself, is a perpetual work-in-progress and will be tuned continually to be responsive to its users' needs. Some immediately foreseeable enhancements are mentioned here.
Because TrialDB is given freely to investigators on request, we will need to address the major differences between SQL as implemented in Oracle, which TrialDB uses at Yale, and Microsoft SQL Server, which many prefer because of lower cost and ease of administration. These differences affect the implementation of various operators as well as the handling of dates. Thus, for example, computing medians in MS SQL Server allows the straightforward use of the TOP N PERCENT clause (where N=50). This is unavailable in Oracle. Also, the syntax of “outer joins,” which are used extensively during data extractions, is completely incompatible between the two relational database management systems. To handle these differences, we have begun creating two sets of vendor-specific library routines, either one of which may be called by the SQL generator.
One major planned enhancement is to add the ability to query attributes containing “long” text. Currently, the proportion of such data within TrialDB is quite modest, but we can foresee that a significant part of our architecture is portable to CPRSs, which contain sizable volumes of narrative text, such as discharge summaries and surgical notes. Both MS SQL Server and Oracle support full-text indexing and robust query of narrative text, including “relevance ranking”25 of the returned results. The SQL Server programming interface is significantly more functional and easier to use than the Oracle one, but both engines lag behind dedicated text-indexing engines such as dtSearch, which can be called from any programming interface that supports Microsoft's Common Object Model. We are accumulating experience with dtSearch, and are beginning to benchmark its performance in comparison with that of SQL Server.
We are actively working on adding temporal query capability within the interface. We have currently implemented several temporal operators, but Boolean combinations of temporal criteria are not yet operational. This component, when functional, will also be metadata-driven.
One problem with viewing the results of cross-study queries is that, because different studies have different designs in terms of study periods or the set of attributes that are viewed together in forms, it is not a simple matter to use co-occurrence information to group attributes together in the final output. Visual interpretation of results is therefore more arduous. It may be possible to devise data structures to store higher-order groupings of attributes across sets of “similar” studies, as well as expend some effort in defining what makes two studies sufficiently “similar” to justify use of this adjective. This is not an easy problem, and whether the extra effort is justifiable in terms of increased utility to analyst-users is hard to say at the present time: the added functionality must be used frequently enough to make the effort worthwhile.
As stated earlier, a weakness of the present system lies in its performance characteristics. The time required for existing queries is currently considered (barely) acceptable, but this is partly because of low concurrency. In general, performing queries that return voluminous data against a transactional system is not a good idea, especially during times of high concurrent access. As both transactional and query loads increase, the data must be replicated on a separate “read-only” query server, and opportunities arise for optimizing the query execution environment. We will need to conduct serious research into how queries on EAV systems can be optimized by suitable schema alterations, and we will need to actively explore the trade-offs between incremental schema redesign and gain in query efficiency. One strategy, which falls into the category of “intermediate approaches” in our overview of CPRS query approaches, is to pre-join certain tables through a mechanism called “indexed” or “materialized” views.26 This feature, now part of recent versions of high-end relational database management systems, allows potentially significant gains in performance with minimal schema modification: Materialization of a view definition can often be done as a background task.
The client-server version of our query system can generate X-Y and bar graphs in a Microsoft Excel window from the result tables. The algorithms that prepare the data for graphing have been previously described.27 This capability has been temporarily sacrificed in the move to the Web, and our oncology users need to have it reintroduced.
Although it is not possible to anticipate every possible need when generating graphic output, it is important to allow graphic images to be pasted into other programs, in which individual objects in the image can be edited, to prepare it for publication. The W3C graphics standard that is gaining momentum is the XML-based Scalable Vector Graphics (SVG) language. An SVG viewer plug-in for the major Web browsers is freely downloadable from Adobe's Web site, where an excellent SVG tutorial is also provided. Several graphics programs, notably Adobe Illustrator and ChartFX, now output as well as import SVG. We have previous described the generation of SVG through program code in the creation of Web-accessible pedigree diagrams.28 For the present problem, however, it is best to use the higher-level facilities of a programmable graphics package.
Acknowledgments
The authors thank Roy Money, PhD, of the Connecticut Mental Health Center, and David Tuck, MD, of YCMI and the Yale Cancer Center, who helped define additional functionality.
This work was supported by grant R01 LM06843-02 from the National Library of Medicine and grant U01 CA78266-04 from the National Cancer Institute to Dr. Nadkarni. Dr. Deshpande is supported through the National Library of Medicine Medical Informatics Fellows Program.
Footnotes
Available online, with this article, at http://www.jamia.org.
Available online, with this article, at http://www.jamia.org.
The source code and documentation for TrialDB are distributed freely on request.
References
- 1.Nadkarni P, Brandt C. Data extraction and ad hoc query of an entity-attribute-value database. J Am Med Inform Assoc. 1998;5(6):511–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nadkarni PM, Brandt C, Frawley S, et al. Managing attribute-value clinical trials data using the ACT/DB client-server database system. J Am Med Inform Assoc. 1998;5(2):139–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Thomsen E. OLAP Solutions: Building Multidimensional Information Systems. New York: Wiley, 2000.
- 4.Inmon W, Rudin K, Buss C, Sousa R. Data Warehouse Performance. New York: Wiley, 1998.
- 5.Nigrin D, Kohane I. Data mining by clinicians. Proc AMIA Annu Fall Symp. 1998:957–61. [PMC free article] [PubMed]
- 6.Wilcox A, Hripcsak G. CPMC Query Builder. 1998. Available at: http://www.cpmc.columbia.edu/arden/qbr/. Accessed Jun 9, 2002.
- 7.Murphy S, Barnett G, Chueh H. Visual query tool for finding patient cohorts from a clinical data warehouse of the partners HealthCare system. Proc AMIA Annu Symp. 2000:1174. [PMC free article] [PubMed]
- 8.Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE. Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp. 1997:101–5. [PMC free article] [PubMed]
- 9.Wang P, Pryor TA, Narus S, Hardman R, Deavila M. The Web-enabled IHC enterprise data warehouse for clinical process improvement and outcomes measurement. Proc AMIA Annu Fall Symp. 1997:1028.
- 10.Nigrin D, Kohane I. Scaling a data retrieval and mining application to the enterprise-wide level. Proc AMIA Annu Symp. 1999:901–5. [PMC free article] [PubMed]
- 11.Stoffel K, Davis J, Rottman G, Saltz J, Dick J, Merz W, et al. A graphical tool for ad hoc query generation. Proc AMIA Annu Symp. 1998:503–7. [PMC free article] [PubMed]
- 12.Huff SM, Berthelsen CL, Pryor TA, Dudley AS. Evaluation of an SQL model of the HELP patient database. Proc Annu Symp Comput Appl Med Care. 1991:386–90. [PMC free article] [PubMed]
- 13.Huff SM, Haug DJ, Stevens LE, Dupont CC, Pryor TA. HELP the next generation: a new client-server architecture.Proc Annu Symp Comput Appl Med Care. 1994:271–5. [PMC free article] [PubMed]
- 14.3M Health Information Systems. 3M Clinical Data Repository. Murray, Utah: 3M Corp., 1998.
- 15.Friedman C, Hripcsak G, Johnson S, Cimino J, Clayton P. A generalized relational schema for an integrated clinical patient database. Proc Annu Symp Comput Appl Med Care. 1990: 335–9.
- 16.Johnson S, Cimino J, Friedman C, Hripcsak G, Clayton P. Using metadata to integrate medical knowledge in a clinical information system. Proc Annu Symp Comput Appl Med Care. 1990:340–4.
- 17.Nadkarni PM, Brandt CA, Marenco L. WebEAV: automatic metadata-driven generation of Web interfaces to entity-attribute-value databases. J Am Med Inform Assoc. 2000;7(7): 343–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Brandt C, Nadkarni P, Marenco L, et al. Reengineering a database for clinical trials management: lessons for system architects. Control Clin Trials. 2000;21(5):440–61. [DOI] [PubMed] [Google Scholar]
- 19.Fallside DC (ed). XML Schema Part 0: Primer. W3C Recommendation. May 2, 2001. Available at: http://www.w3.org/TR/xmlschema-0/. Accessed Oct 12, 2001.
- 20.Aho AV, Sethi R, Ullman JD. Syntax-directed translation. In: Compilers: Principles, Techniques, Tools. Reading, Mass.: Addison-Wesley, 1988:33–40.
- 21.Levine JR, Mason T, Brown D. lex & yacc. Sebastopol, Calif.: O'Reilly & Associates, 1992.
- 22.MetaTrial. CB Technologies Web site. 2001. Available at: http://www.metatrial.com. Accessed Aug 25, 2001.
- 23.ClinTrials [brief product description]. ClinTrials Inc. Web site. 2001. Available at: http://www.clintrialsresearch.com. Accessed Dec 14, 2001.
- 24.Oracle Clinical Version 3.0: User's Guide. Redwood Shores, Calif.: Oracle Corporation, 1996.
- 25.Salton G, Wu H, Yu CT. Measurement of term importance in automatic indexing. J Am Soc Inf Sci. 1981;32(3):175–86. [Google Scholar]
- 26.Greenspun P. SQL for Web Nerds. 2001. Accessible via http://www.arsdigita.com/books/sql/index.html. Accessed Sep 28, 2001.
- 27.Nadkarni P. Chronomerge: an application for the merging and display of multiple time-stamped data streams. Comput Biomed Res. 1998;31(6):451–64. [DOI] [PubMed] [Google Scholar]
- 28.Fernando S, Nadkarni P, Brandt C. Generation of pedigree diagrams for Web display using scalable vector graphics from a clinical trials database. Proc AMIA Annu Symp. 2001:174–8. [PMC free article] [PubMed]



