Using Large Clinical Corpora for Query Expansion in Text-based Cohort Identification

Dongqing Zhu; Stephen Wu; Ben Carterette; Hongfang Liu

doi:10.1016/j.jbi.2014.03.010

. Author manuscript; available in PMC: 2015 Jun 1.

Published in final edited form as: J Biomed Inform. 2014 Mar 26;0:275–281. doi: 10.1016/j.jbi.2014.03.010

Using Large Clinical Corpora for Query Expansion in Text-based Cohort Identification

Dongqing Zhu ^a, Stephen Wu ^b, Ben Carterette ^a, Hongfang Liu ^b

PMCID: PMC4058413 NIHMSID: NIHMS579605 PMID: 24680983

Abstract

In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of “use all available data” is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

Keywords: Cohort Identification, Information retrieval, Query expansion, Clinical text, Electronic Medical Records

1. Introduction

Electronic medical records (EMRs) have the potential to streamline the processes involved in clinical and translational research. In the past, identification of cohorts for clinical trials or retrospective studies has been a costly endeavor, requiring hours of trained expertise to accomplish manual chart reviews. In the EMR age, this problem of cohort identification may be cast as one of information retrieval (IR). Here, clinical text (e.g., a discharge summary) must be searched alongside structured data (e.g., lab results) to find a pool of patients that fit some criteria, such as symptoms present, family history, or demographics (i.e., the query). But it is challenging for a clinician or epidemiological researcher to formulate an optimal query based on their desired criteria. This is in part because of the inherent diversity of language: ‘cold’ could be a temperature or a disease (polysemy), ‘dyspnea’ could be expressed in a medical record as ‘shortness of breath’ (synonymy), ‘ibuprofen’ could be expressed as ‘pain reliever’ (hyponymy).

One effective general-domain IR approach for these problems is to expand queries to include other terms that might be relevant or implied. In the mixture of relevance models approach to query expansion [1], multiple large external text corpora have been used to select what terms might be helpful to add to a query. When searching for patient cohorts in the clinical domain, general-domain collections have been shown to select reasonable terms and improve retrieval performance [2]. What sort of improvement, if any, should be expected if clinical-domain collections are used for this query expansion?

In this work, we analyze the effects of including a large, unlabeled corpus of clinical notes into an statistical IR system for cohort identification. In particular, we evaluate the helpfulness of a corpus of Mayo Clinic clinical notes for the Text REtrieval Conference (TREC¹) task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction with other widely-available collections. As our results will show, the large clinical corpus is the single most useful collection for query expansion. It is interesting to note, however, that optimal results in the mixture of relevance models would require selective application of this query expansion.

1.1. TREC Medical Records Cohort Retrieval Task

The TREC Medical Records Cohort Retrieval Task was to retrieve relevant patient visits from a target text collection of patient records [3, 4]. The University of Pittsburgh NLP Repository supplied de-identified medical reports as the target collection for the TREC 2011 and 2012 Medical Records Tracks. A patient visit to the hospital usually generates multiple medical reports, so 100,866 Pittsburgh medical reports corresponded to 17,198 patient visits. This is an approximation of finding actual patients for a cohort (a patient could have multiple hospital visits), which was impossible due to the record de-identification process.

Each medical report is an XML file with a fixed set of fields as shown in Figure 1. We mainly used ICD-9 codes for admit and discharge diagnoses, and the “report text” field which contained the full text of clinical narratives. Medical reports could be mapped to patient visits via a report-to-visit mapping table provided with the Pittsburgh NLP Repository.

Sample medical report from the Pittsburgh NLP Repository used in the 2011 and 2012 TREC Medical Records tracks.

81 queries (or “topics” in TREC terminology) were developed by TREC assessors, reflecting the types of queries that might be used to identify cohorts for comparative effectiveness research [3]. These queries were designed to require information from the free-text fields, i.e., topics are not answerable solely by the diagnostic codes. The topic usually specifies the patient’s condition, disease, treatment, etc. For example:

Topic 107 Patients with ductal carcinoma in situ (DCIS)

Topic 185 Patients who develop thrombocytopenia in pregnancy

For each topic, TREC assessors gave a relevance judgment (highly relevant, relevant, or not relevant) to subsets of the 17,198 visits; these subsets were chosen based on pooled results from TREC Medical Records Track participants. Our evaluations used these 81 relevance-judged topics, with nine-fold cross validation to examine the benefit of query expansion.

1.2. Road Map

The remainder of this paper is organized as follows: Section 2 highlights related work. Section 3.1 introduces some characteristics of the collections used for query expansion. After some preprocessing in Section 3.2, Section 3.3 describes the retrieval model. Section 4 details the evaluation metrics and experimental setup. Then, Section 5 presents results and discusses the utility of the Mayo corpus under various settings. After some further discussion in Section 6, we summarize our findings in Section 7.

2. Related Work

While cohort identification is a widely known and addressed problem in medical informatics [5, 6, 7, 8, 9], few approaches have used traditional text-based IR approaches. A notable exception is the EMERSE (Electronic Medical Record Search Engine) system [10]. As one of the earliest and successful non-commercial EMR search engines, EMERSE supports free-text queries, and has been used by medical professionals in a few hospitals, health centers, and clinics since its initial introduction in 2005 [10, 11]. Though EMERSE has not achieved widespread adoption and there is little discussion about its search algorithms, a few interesting studies have been done using EMERSE: Seyfried et al. [11] compared EMERSE-facilitated chart reviews with manual reviews, and concluded that using a well-designed EMR search engine for retrieving information in free-text EMR can provide significant time saving while preserving reliability. Yang et al. [12] analyzed a query log of the EMERSE system recorded over the course of 4 years. They find that the users have difficulty expressing their information needs in well formulated EMR search queries. Another interesting finding is that the coverage of EMR query terms by a meta-dictionary (containing all terms in the Unified Medical Language System (UMLS) Metathesaurus [13], an English dictionary, and a medical dictionary) is much lower than the usual 85-90% coverage of Web queries by English dictionaries. Thus, they suggest seeking beyond the use of medical ontologies to enhance medical information retrieval. Apart from these few attempts, methods emerging from research on information retrieval have not been well explored, largely due to the sensitivity of patient data, preventing its use by academic researchers.

Fortunately, the Text REtrieval Conference (TREC) initiated a Medical Records track in 2011 making a set of real medical records and human judgments of relevance to search queries available to the research community. TREC is an annual evaluation workshop/competition with the goal of providing a common experimental setting for researchers that want to work on particular search tasks. Each year, there are up to 7 “tracks” devoted to a different search task. Organizers provide documents and information needs to participants, ensuring that all participants are using the same data and working towards the same task. TREC organizers also oversee the collection of human relevance judgments, which are instrumental in understanding the effectiveness of a search system.

Most participants in TREC’s Medical Records track tried using medical-specific knowledge to enhance retrieval, but only a few of them achieved positive results. King et al. [14] identified and indexed terms of medical reports that appeared in the UMLS. Meanwhile, they expanded original queries with related terms in UMLS and several commercial medical reference encyclopedias. Goodwin et al. [15] leveraged information from SNOMED-CT [16], UMLS, and a subset of PubMed Central database for query expansion. Zhu et al. [2, 17] also used several medical-related external utilities for query expansion. These three teams all obtained large improvement over their baselines which used no medical-specific knowledge.

Wu et al. [18, 19] represented queries and reports in the concept space with expansion to related concepts but found only modest improvements. Several others[20, 21, 22, 23, 24] expanded query terms more traditionally, using UMLS synonyms, MeSH terms, RxNorm, Google search, healthcare websites, and DrugBank. However, they all obtained very little or no improvement over their baselines. More recently, Qi et al. [25] transformed plain-text medical reports and queries into concepts based on UMLS Metathesaurus and experimented with various retrieval models, such as vector space model, query likelihood model, Learning to Rank model, and etc. Their approaches achieved reasonably good results.

All of the above groups used medical resources in similar ways, but produced varying results. This paper differs in that it aims to characterize the effectiveness of query expansion based on in-domain resources. The techniques and resources used here have several advantages over those previously used in the TREC Medical Records track. The 39 million-document subset of Mayo Clinic clinical notes is the most similar to the target collection, and it is the largest of its kind. Moreover, our query expansion technique (Section 3.3.2) is highly scalable in that it allows us to easily compare and combine this and other external resources in a uniform way.

3. Materials and Methods

3.1. Auxiliary Collections for Query Expansion

This work mainly performs an analysis based on a clinical text collection: a 39 million-document subset of Mayo Clinic clinical notes between 1/1/2001–12/31/2010, retrieved from the Mayo Clinic Life Sciences System (MCLSS). This includes data from a comprehensive snapshot of Mayo Clinic’s service areas, excluding only microbiology, radiology, ophthamology, and surgical reports. Additionally, each possible note type at Mayo was represented: Clinical Note, Hospital Summary, Post-procedure Note, Procedure Note, Progress Note, Tertiary Trauma, and Transfer Note. This corpus has been characterized for its clinical information content (namely, medical concepts[26] and terms[27]) and compared to other corpora, such as the 2011 MEDLINE/PubMed Baseline Repository and the 2010 i2b2 NLP challenge dataset[28].

In addition to the medical records from Mayo Clinic, we leverage information in several other large, widely-available collections: the TREC 2007 Genomics Track dataset [29], the TREC 2009 ClueWeb09 Category B dataset² (excluding Wikipedia pages), and the Pittsburgh NLP Repository itself (the target collection, as indicated by * in Table 1). Table 1 provides statistics about these datasets.

Table 1.

Collection Statistics

Collection	# documents	vocabulary size	avg doc length
PittNLP*	100,866	10⁵	423
Genomics	162,259	10⁷	6,595
ClueWeb09	44,262,894	10⁷	756
MayoClinic	39,449,222	10⁶	346

Open in a new tab

The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. TREC Category B contains first 50 million English pages³. The ClueWeb09 Cat-B dataset has comparable size to Mayo Clinic dataset in terms of the number of documents, however, it is less similar in content to the target collection (i.e., the Pittsburgh NLP Repository) and is considered more noisy than Mayo Clinic dataset.

TREC 2007 Genomics Track dataset consists of full-text HTML documents from 49 journals⁴ published publish electronically via Highwire Press⁵. The Genomics dataset is much smaller than the ClueWeb09 Cat-B dataset, however, the knowledge domain where the Genomics dataset comes from overlaps more with the clinical domain than the general web domain where the ClueWeb dataset is derived from.

Note that while there are other in-domain resources such as the UMLS Metathesaurus, to treat these more structured resources as expansion collections was out of the scope of this work. Mixtures of relevance models (Section 3.3.2) only apply directly to free-text collections. Also, the evaluations concerning collection size and quality (Section 5) cannot be easily extended to cover the case of a structured resource.

3.2. Preprocessing

As a pre-processing step, we merged the reports from a single visit into a visit document, thereby converting the 100,866 medical reports into 17,198 visit documents. This allowed us to combine evidence from scattered reports for each unique visit, and to consider the patient-visit retrieval problem as a document retrieval problem.

Additionally, we expanded the ICD-9 codes from the admit and discharge diagnoses. Though these codes are typically used for billing purposes, they give a high level summary of content in the medical records, and their associated descriptions can provide terms that are likely to be helpful to our retrieval system. Thus, we expand ICD-9 codes in medical reports with their corresponding descriptions. For instance, we substitute code “428.1” with “left heart failure”. While these codes could be searched directly as in a typical databse search, this mapping to text allowed us to consider the information in the codes alongside the rest of the text documents, without modifying the text-based search model described in Section 3.3.

We also used the Porter stemmer [30] to normalize for morphological differences in the indexed clinical text. The same stemmer was later run on the queries. No stopwords were removed from the index. We did not expand terms in the collections according to resource-based synonyms (e.g., from the UMLS), since synonyms were modeled through query expansion.

Finally, it is well-known that negation phrases in narrative clinical reports are frequently used to claim the absence of certain conditions or symptoms [31], such as “no evidence of.” Negations may cause retrieval false positives if they are not dealt with. Thus, we use ConText⁶ [32] to eliminate all negated portions of the sentences from the medical records before indexing. For instance, we would delete the whole phrase “ruled out for an acute coronary syndrome” from an original report. This simple way of dealing with negations in clinical narratives has been shown to be quite effective [17]. By removing negations we eliminate retrieval false positives that can possibly lead to an underestimation of the true performance of a good expansion collection.

After this preprocessing, we built an Indri index for the merged documents. Indri⁷ is an open-source system for indexing and retrieving full-text documents. It supports basic keyword queries, but also has a complex querying language that offers much greater expressive power.

3.3. Retrieval Models

3.3.1. Baseline Query Likelihood (QL) Model

We start with a basic “bag-of-words” probabilistic model: the query likelihood language model. This model scores documents for queries as a function of the probability that query terms would be sampled (independently) from a bag containing all the words in that document. Formally, the scoring function for a document D on query Q is a sum of the logarithms of smoothed probabilities:

score (D, Q) = \log P (Q ∣ D) = \sum_{i = 1}^{n} \log \frac{t f_{q i, D} + μ \frac{t f_{q i, C}}{∣ C ∣}}{∣ D ∣ + μ},

(1)

where q_i is the ith query term, |D| and |C| are the document and collection lengths in words respectively, tf_qi,D and tf_qi,C are the document and collection term frequencies of q_i respectively, and μ is the Dirichlet smoothing parameter. The Indri retrieval system supports this model by default.

3.3.2. Mixture of Relevance Models (MRM) for Query Expansion

The above model is a strong baseline, but the only information it uses is terms in the query and terms in the document. It can be improved by expanding the query with additional “related” terms. These related terms can be derived from a relevance model (involving pseudo-relevance feedback in the target collection) or mixtures of relevance models (involving feedback in multiple collections) [1]. We describe MRM from a query expansion perspective to emphasize how external collections like the clinical corpus under consideration are really used. For each query Q, we perform the following 4 steps:

Pseudo-relevance feedback. Retrieve and rank documents D_C from each expansion collection C according to the QL model. This considers all n terms of the query. Return the document score.
$score (D_{C}, Q) = \log P (Q ∣ D_{C}) = \sum_{i = 1}^{n} \log \frac{{tf}_{q i, D_{C}} + μ \frac{{tf}_{q i, C}}{∣ C ∣}}{∣ D_{C} ∣ + μ}$ (2)
Find candidate expansion terms. For each collection-specific ranked list, consider all terms e that are present in the top k documents, and calculate $P (e ∣ {\hat{θ}}_{Q, C})$ as below. Keep only the top m terms from each corpus.
$P (e ∣ {\hat{θ}}_{Q, C}) = \sum_{j = 1}^{k} \exp (\frac{{tf}_{e, D_{j}}}{∣ D_{j} ∣} + \log \frac{∣ C ∣}{{df}_{e, C}} + score (D_{j, C}, Q))$ (3)
Here, ${\hat{θ}}_{Q, C}$ represents the scored results of pseudo-relevance feedback, i.e., the estimated model $\hat{θ}$ of collection C for the query Q. |D| and |C| are the document and collection lengths in words respectively, and df_e,C is the number of documents containing term e in collection C.
Determine query and expansion term weights. Calculate $P (w ∣ {\hat{θ}}_{Q})$ as below; w represents both q (original query terms) and e (retained expansion terms from each collection), and ${\hat{θ}}_{Q}$ models scores for query Q the across all collections.
$P (w ∣ {\hat{θ}}_{Q}) = λ_{Q} \frac{{tf}_{w, Q}}{∣ Q ∣} + (1 - λ_{Q}) \sum_{C} λ_{C} P (w ∣ {\hat{θ}}_{Q, C})$ (4)
This linearly interpolates the mixture of relevance models (second term, where the λ are collection weights and Σ_Cλ_C = 1) with the maximum likelihood query estimate (first term). Note that the count of w in the query, tf_w,Q will be 0 for expansion terms since they do not appear in the query.
Aggregate across terms for final score. The final score for ranking document D is a log-transformed version of $P (Q ∣ D) = Σ_{W} P (w ∣ {\hat{θ}}_{Q})$ .

3.3.3. MRM in practice

In practice, the final score is given by creating a query in the Indri query language, with the weights of Step 3 combined via “#weight” operators: graphic file with name nihms-579605-f0001.jpg where p is the probability from step 2 above. Thus, an expanded query based on two expansion collections when the values of λ’s are specified as (0.7, 0.2, 0.1) looks like the following: where “female breast cancer mastectomies admission” is the original query, followed by weighted expansion terms specified in the “#weight” phrases.

4. Evaluation

MAP was the primary evaluation measure (for training and testing) in this work. MAP (mean average precision) provides a single-figure measure of quality across recall levels [33]. If {d1,…,d_j} is the set of relevant documents for an information need q ∈ Q, then MAP is defined as:

MAP (Q) = \frac{1}{∣ Q ∣} \sum_{q \in Q} \frac{Σ_{d \in {d_{1,_{\dots}}, d_{j}}} Prec (rank (d))}{∣ {d_{1},_{\dots}, d_{j}} ∣},

(5)

where Prec(k) is the proportion of relevant documents among the top k retrieved.

While the official metrics for the 2011 and 2012 TREC Medical Records Tracks were bpref [34] and infAP [35]⁸, respectively, we observed that MAP correlates well with the official measures in term of ranking systems. We train our systems on MAP as is most commonly used in IR to improve retrieval performance, and we find that training for high MAP improved the performance of other metrics, while training on the official metrics did not.

We used the Porter stemmer and a simple standard medical stoplist [36] for stemming and stopping words in queries during retrieval. Then we conducted 9-fold cross-validation and used the top 1000 retrieved visits⁹ for each query to evaluate our system under different settings. In each iteration, we trained our system on 72 queries to obtain the best parameter setting for MAP by sweeping over the parameter space according to Table 2 below, and then generate a ranking for each of the remaining 9 queries based on the trained system. When complete, we had full rankings for all 81 topics as a test set. We evaluated the system based on MAP over all 81 topics.

Table 2.

Parameter space for training.

Parameter	From	To	Step Size
Dirichlet smoothing parameter μ	1000	20000	5000
# of feedback documents k	20	60	20
# of expansion terms m	10	30	10

Open in a new tab

Note that the baseline system using Eq. 1 has only one free variable μ to train. In this work, we fix expansion weight λ_Q to 0.7 and use equal weights for λ_C. This is because we need to test various system settings with multiple parameters. Including λ in training will be computationally expensive when two or more expansions collections are used the mixture of relevance models. In fact, expansion collection weighting itself is an interesting research problem and we plan to explore it in our future work.

To assess the statistical significance of differences in the performance of two systems, we perform one-tailed paired t-test for difference in MAP.

5. Results and Analysis

In this section, we show and discuss the results of including the Mayo Clinic corpus under various settings.

5.1. Clinical corpus vs. other single collections

Table 3 shows the retrieval performance when a single collection was used to produce query expansions. It can be seen that the best single MAP score is using the Mayo Clinic corpus. This is particularly interesting because it outperforms the target collection (PittNLP) itself, though the difference is not statistically significant. The Mayo Clinic data does significantly (at the p < .10 level) outperform ClueWeb, showing the domain of similar-sized corpora matters.

Table 3.

MAP scores for single expansion collections, and the significance of their differences (p value)

Collection	PittNLP	Genomics	ClueWeb	Baseline	MAP score

Mayo	0.225	0.125	0.077	8.39×10⁻⁰⁷	0.391
PittNLP		0.363	0.354	2.50×10⁻⁰⁴	0.388
Genomics			0.443	1.12×10⁻⁰⁵	0.387
ClueWeb				1.57×10⁻⁰⁶	0.386
Baseline					0.373

Open in a new tab

In these single expansion collection tests, the similarity of the collection appears to be a suitable measure of quality. Similar corpora will tend to reduce noise and so improve precision; while dissimilar corpora will attempt to increase recall with novel terms but contribute noise, thus hurting precision.

5.2. Performance by clinical collection size and by query difficulty

To test the impact of the collection size on the query expansion effectiveness, we created multiple expansion collections of different size in an incremental way based on the original Mayo Clinic corpus. In particular, we built the smallest sub-collection C₀ by randomly sampling a set of clinical notes in the Mayo Clinc corpus, and then built the next sub-collection C₁ by adding more clinical notes that are randomly selected, and then built C₂ by adding more notes to C₁, and so on. Thus C_j is a superset of C_i for i < j. We built an index for each sub-collection and use it for query expansion. The number of terms in each sub-collection is shown on the x-axis in Figure 2. Figure 2 further shows the accumulated AP gain on the y-axis, where the accumulated AP gain is the sum of MAP score improvements. We have divided the queries into three classes, based on their performance (without any query expansion): hard = MAP< .33; medium = .33 <MAP< .67; and easy = MAP> .67. It is clear that “more is better” does not hold here. There is a peak at about 2.5 billion terms, at which the Mayo clinical notes no longer contribute positive query expansions, and they are more likely contributing noise instead. This is an interesting result, because it counters the common wisdom that more data will always solve the problem. In the case of query expansion, it is helpful to have the right amount of in-domain information.

Performance curve of incorporating different-sized clinical collections as relevance models for query expansion.

Figure 2 also shows that the beneficial effects of query expansion are less pronounced for easy queries. Because the query difficulty categorizations were made without query expansion, easy queries are already able to be retrieved without the help of query expansion. Medium and hard queries are more helped.

5.3. Clinical corpus among multiple expansion collections

In this experiment, we compared the individual performance of several expansion collections listed in Table 1. This corresponds to using Eq. 4 for computing the relevance scores. Since we had 4 expansion collections in total, there are 11 different combinations of two or more expansion collections.

First, we note that the 4-corpus MRM-11 run achieves the highest MAP score of any tested combination. Thus, while Figure 2 showed that bigger was not necessarily better for the size of the clinical corpus, more multi-corpus data was helpful to performance.

Furthermore, Table 4 suggests that the diversity of expansion collections should be considered when dealing with more than one expansion collection. With multiple expansion collections, the effect of noisy terms from a single collection can be mitigated by terms from other collections, as long as the collections are diverse. Thus, diversity (appropriate dissimilarity in domains) counterbalances similarity, providing additional recall with suppressed loss of precision.

Table 4.

Using multiple expansion collections (PittNLP, ClueWeb09, Trec-Genomics, Mayo Clinic) for mixture of relevanec models (MRM) query expansion

	MAP		Expansion Collection

			PittNLP	ClueWeb	Genomics	Mayo
MRM-1	0.4011		X	X
MRM-2	0.4031		X		X
MRM-3	0.3996	2 coll.	X			X
MRM-4	0.3979	2 coll.		X	X
MRM-5	0.3987			X		X
MRM-6	0.3943				X	X

MRM-7	0.4144		X	X	X
MRM-8	0.4089	3 coll.	X	X		X
MRM-9	0.4116	3 coll.	X		X	X
MRM-10	0.4061			X	X	X

MRM-11	0.4223	4 coll.	X	X	X	X

Open in a new tab

Thus, while an in-house clinical collection (here, the Mayo corpus) is the single most beneficial resource (according to Table 3), it is not necessarily the best resource in a multiple expansion collections case. However, because in house collections may be the most available resource within an institution (no subscription or data use agreement), the next subsection explores whether it is worth the effort to produce such an in-house collection.

5.4. Adding a clinical corpus to an existing setup

We ran query expansion using multiple collections and computed the relevance scores according to Equation 4. Using the 11 runs calculated in Table 4, we considered the significance of adding the Mayo corpus given that one or more of the other expansion collections were already present. This is a realistic setting when implementing an IR-based cohort identification system with a local EMR. The significance of adding the Mayo clinical corpus is very clear. Regardless of what collections have been used for the mixture of relevance models, results will be improved by adding the corpus. This implies that any locally-implemented IR-based cohort identification system can significantly improve its performance by utilizing a large unlabeled corpus within their institution.

6. Discussion

6.1. Analysis of performance factors

The quality, size, and diversity of the expansion collections are three important factors that impact performance gain.

First, larger expansion collections tend to have a better coverage of query-related expansion terms. However, an expansion collection can also introduce more noise if it is too large. Table 6 shows the top weighted expansion terms (word stems) for the query “hearing loss”. The first three columns are terms derived from Mayo sub-collections of different sizes. As we can see, M30 produces a much better set of expansions terms than M10. However, the set derived from M80 is apparently contaminated by noise.

Table 6.

Comparison of top 15 expansion terms for query “hearing loss”.

M10 (10% Mayo)	M30 (30% Mayo)	M80 (80% Mayo)	ClueWeb	Genomics

ear	ear	sensorineur	heare	hear
sensorineur	sensorineur	inherit	shakespeare	deaf
aid	aid	gene	herbert	hhie
gene	audiogram	connexin	campion	cochlear
audiogram	hi	ear	nniina	ear
inherit	nois	autosom	jokinen	sensorineur
genet	right	genet	alphabeticall	loss
tinnitu	cochlear	recess	cawdrey	ttss
hi	tinnitu	aid	tiiaa	nois
caus	left	ag	ierde	syndrom
connexin	bilater	slope	george	paget
nois	sudden	matern	hiele	audiometr
baud	ha	mutat	babel	fechter
carrier	db	famili	renee	cochlea
mitochondria	hz	patern	har	auditori

Open in a new tab

The quality of expansion collection is estimated by the overlap between two domains, i.e., the content similarity of the expansion collection to the target collection. Expansion collections containing similar content to the target collection tend to use a similar underlying language model (i.e., vocabulary and term distributions) and thus can derive a better “relevance model”.

Moreover, a diversified set of expansion collections work better than a specialized set of collections. This is because expansion collections from different domains contribute differently to the retrieval performance with respect to different queries. If one collection in that diversified set fails to improve retrieval the others might still help (as shown in Table 6), which is not the case if we use a set of similar collections.

6.2. Comparison with related results

We have illustrated the significant benefit of adding any auxiliary collection for MRM-based query expansion in Table 5. These results are similar to those reported in Zhu and Carterette[17], who also gave MAP scores on individual expansion collections in the 0.37-0.39 range. However, the results are not directly comparable because we have used a cross-validated evaluation across queries to obtain a more stable estimate, rather than testing on the 47 topics of the 2012 TREC Medical Records track. This means there was more training data available to tune parameters for each query, but the evaluation metrics were more susceptible to the idiosyncrasies of particular queries.

Table 5.

Change in performance (Δ MAP) and significance (p-values < .05), upon adding the clinical corpus to any existing configuration.

MRM Model	Δ MAP	p-value
PittNLP adding Mayo	0.0117	2.66×10⁻⁰⁵
ClueWeb adding Mayo	0.0124	0.000513297
Genomics adding Mayo	0.0075	0.003126875
PittNLP + ClueWeb adding Mayo	0.0078	0.004912243
PittNLP+Genomics adding Mayo	0.0085	0.005416947
ClueWeb+Genomics adding Mayo	0.0082	0.015875188
PittNLP + ClueWeb + Genomics adding Mayo	0.0162	0.023945989

Open in a new tab

The best results we have reported (MAP=0.4223) are from the full model in Table 3. Again, the results cannot be directly compared to Zhu and Carterette’s best scores because of the difference in cross-training vs. train/test evaluation setup. Additionally, Zhu and Carterette [17] modeled query term dependencies using Markov Random Fields. Modeling these dependencies resulted in their highest-performing system (MAP=0.413), which was the top ranked automatic system in TREC 2012. Using only expansion collections to improve upon the query likelihood model, as in our tests, still performed well (MAP=0.398). Thus, there is reason to believe that our reported retrieval performance could be further improved by modeling term dependencies with Markov Random Fields or another similar model.

7. Conclusion and Future Work

We have shown that, on the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion without access to a large clinical corpus would feasibly benefit from the additional resource. Additionally, we have shown that the more a query is inherently difficult, the more it will be helped by this type of query expansion. This implies the need for query-adaptive collection selection, which is a direction of active research. Finally, we have shown that more data is not necessarily better, implying that there is value in collection curation. This is a possible direction for future work.

Acknowledgements

This work was supported in part by the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC. DHHS 90TR000201.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

http://trec.nist.gov/

http://lemurproject.org/clueweb09.php/. The ClueWeb09 dataset was created to support research on IR and related human language technologies. It contains about 1 billion web pages, and is used by several tracks of the TREC conference. TREC Category B contains first 50 million English pages including about 6 million Wikipedia pages.

Available at http://lemurproject.org/clueweb09.php/

⁴

The full list of journal can be found at http://ir.ohsu.edu/genomics/2007data.html

⁵

http://www.highwire.org/

⁶

http://code.google.com/p/negex/

⁷

http://www.lemurproject.org/indri/

⁸

bpref and infAP measures are both used to approximate average precision when the relevance judgments are incomplete, and infAP is found to be more robust than bpref [35].

⁹

Medical Records track guideline requires each retrieval set contain no more than 1000 visits.

References

[1].Diaz F, Metzler D. Improving the estimation of relevance models using large external corpora; Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; New York, NY, USA. 2006.pp. 154–161. [Google Scholar]
[2].Zhu D, Carterette B. Using multiple external collections for query expansion; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[3].Voorhees EM, Tong RM. Overview of the TREC 2011 medical records track; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]
[4].Hersh W, Voorhees EM. Overview of the TREC 2012 medical records track; Proceedings of The 21th Text REtrieval Conference (TREC); 2012. [Google Scholar]
[5].Murphy SN, Barnett GO, Chueh HC. Visual query tool for finding patient cohorts from a clinical data warehouse of the partners healthcare system; Proceedings of the AMIA Symposium, American Medical Informatics Association; 2000; p. 1174. [PMC free article] [PubMed] [Google Scholar]
[6].Deshmukh V, Meystre S, Mitchell J. Evaluating the informatics for integrating biology and the bedside system for clinical research. BMC medical research methodology. 2009;9(1):70. doi: 10.1186/1471-2288-9-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].D’Avolio LW, Farwell WR, Fiore LD. Comparative effectiveness research and medical informatics. The American Journal of Medicine. 2010;123(12):e32–e37. doi: 10.1016/j.amjmed.2010.10.006. [DOI] [PubMed] [Google Scholar]
[8].McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, Li R, Masys DR, Ritchie MD, Roden DM, et al. The emerge network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC medical genomics. 2011;4(1):13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Kandula S, Zeng-Treitler Q, Chen L, Salomon WL, Bray BE. A bootstrapping algorithm to improve cohort identification using structured data. Journal of biomedical informatics. doi: 10.1016/j.jbi.2011.10.013. [DOI] [PubMed] [Google Scholar]
[10].Hanauer DA. EMERSE: The electronic medical record search engine. AMIA Annual Symposium Proceedings. 2006;331(7531):941. [PMC free article] [PubMed] [Google Scholar]
[11].Seyfried L, Hanauer D, Nease D. Enhanced identification of eligibility for depression research using an electronic medical record search engine. International Journal of Medical Informatics. 2009;78(12):e13–e18. doi: 10.1016/j.ijmedinf.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Yang L, Mei Q, Zheng K, Hanauer D. Query log analysis of an electronic health record search engine; AMIA Annual Symposium Proceedings; 2011; pp. 915–924. [PMC free article] [PubMed] [Google Scholar]
[13].Bodenreider O. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research. 2004;32(suppl 1):D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].King B, Wang L, Provalov I, Zhou J. Cengage Learning at TREC 2011 medical track; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]
[15].Goodwin T, Rink B, Roberts K, Harabagiu SM, Tx R. Cohort shepherd: Discovering cohort traits from hospital visits; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]
[16].Cornet R, de Keizer N. Forty years of snomed: a literature review. BMC medical informatics and decision making. 2008;8(Suppl 1):S2. doi: 10.1186/1472-6947-8-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Zhu D, Carterette B. Exploring evidence aggregation methods and external expansion sources for medical record search; Proceedings of The 21th Text REtrieval Conference (TREC); 2013. [Google Scholar]
[18].Wu S, Wagholikar K, Sohn S, Kaggal V, Liu H. Empirical ontologies for cohort identification; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[19].Wu ST, Zhu D, Hersh W, Liu H. Clinical information retrieval with split-layer language models; Proceedings of the ACM SIGIR Workshop on Health Search and Discovery (HSD), Association for Computing Machinery; 2013. [Google Scholar]
[20].Demner-Fushman D, Abhyankar S, Jimeno-Yepes A, Loane R, Rance B, Lang F, Ide N, Apostolova E, Aronson AR. A knowledgebased approach to medical records retrieval; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[21].Daoud M, Kasperowicz D, Miao J, Huang J. York University at TREC 2011: Medical Records Track; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[22].Wu H, Fang H. An exploration of new ranking strategies for medical record tracks; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[23].Schuemie M. DutchHatTrick: Semantic query modeling, ConText, section detection, and match score maximization; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]
[24].Bedrick S, Edinger T, Cohen A, Hersh W. Identifying patients for clinical studies from electronic health records: Trec 2012 medical records track at ohsu; Proceedings of The 21th Text REtrieval Conference; 2013. [Google Scholar]
[25].Qi Y, Laquerre P-F. Retrieving medical records with sennamed: Nec labs america at trec 2012 medical records track; Proceedings of The 21th Text REtrieval Conference; 2013. [Google Scholar]
[26].Wu S, Liu H. Semantic Characteristics of NLP-extracted Concepts in Clinical Notes vs. Biomedical Literature; Proceedings of AMIA 2011; 2011; [PMC free article] [PubMed] [Google Scholar]
[27].Wu S, Liu H, Li D, Tao C, Musen M, Chute C, Shah N. UMLS Term Occurrences in Clinical Notes: A Large-scale Corpus Analysis; Proceedings of the AMIA Joint Summit on Clinical Research Informatics; 2012; [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Hersh WR, Cohen AM, Ruslen L, Roberts PM. TREC. 2007. TREC 2007 genomics track overview. [Google Scholar]
[30].Porter MF. An algorithm for suffix stripping. Program: electronic library and information systems. 1980;14(3):130–137. [Google Scholar]
[31].Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation of negation phrases in narrative clinical reports; Proceedings of AMIA Symposium; 2001; pp. 105–109. [PMC free article] [PubMed] [Google Scholar]
[32].Harkema H, Dowling JN, Thornblade T, Chapman WW. Context: An algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics. 2009;42(5):839–851. doi: 10.1016/j.jbi.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Croft B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. 1st Edition Addison Wesley; 2009. [Google Scholar]
[34].Buckley C, Voorhees EM. Retrieval evaluation with incomplete information; Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; 2004.pp. 25–32. [Google Scholar]
[35].Yilmaz E, Aslam JA. Estimating average precision with incomplete and imperfect judgments; Proceedings of the 15th ACM international conference on Information and knowledge management, ACM; 2006.pp. 102–111. [Google Scholar]
[36].Hersh W. Information Retrieval: A Health and Biomedical Perspective. 3rd Edition Health Informatics, Springer; 2009. [Google Scholar]

[R1] [1].Diaz F, Metzler D. Improving the estimation of relevance models using large external corpora; Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; New York, NY, USA. 2006.pp. 154–161. [Google Scholar]

[R2] [2].Zhu D, Carterette B. Using multiple external collections for query expansion; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R3] [3].Voorhees EM, Tong RM. Overview of the TREC 2011 medical records track; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]

[R4] [4].Hersh W, Voorhees EM. Overview of the TREC 2012 medical records track; Proceedings of The 21th Text REtrieval Conference (TREC); 2012. [Google Scholar]

[R5] [5].Murphy SN, Barnett GO, Chueh HC. Visual query tool for finding patient cohorts from a clinical data warehouse of the partners healthcare system; Proceedings of the AMIA Symposium, American Medical Informatics Association; 2000; p. 1174. [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Deshmukh V, Meystre S, Mitchell J. Evaluating the informatics for integrating biology and the bedside system for clinical research. BMC medical research methodology. 2009;9(1):70. doi: 10.1186/1471-2288-9-70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].D’Avolio LW, Farwell WR, Fiore LD. Comparative effectiveness research and medical informatics. The American Journal of Medicine. 2010;123(12):e32–e37. doi: 10.1016/j.amjmed.2010.10.006. [DOI] [PubMed] [Google Scholar]

[R8] [8].McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, Li R, Masys DR, Ritchie MD, Roden DM, et al. The emerge network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC medical genomics. 2011;4(1):13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Kandula S, Zeng-Treitler Q, Chen L, Salomon WL, Bray BE. A bootstrapping algorithm to improve cohort identification using structured data. Journal of biomedical informatics. doi: 10.1016/j.jbi.2011.10.013. [DOI] [PubMed] [Google Scholar]

[R10] [10].Hanauer DA. EMERSE: The electronic medical record search engine. AMIA Annual Symposium Proceedings. 2006;331(7531):941. [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Seyfried L, Hanauer D, Nease D. Enhanced identification of eligibility for depression research using an electronic medical record search engine. International Journal of Medical Informatics. 2009;78(12):e13–e18. doi: 10.1016/j.ijmedinf.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Yang L, Mei Q, Zheng K, Hanauer D. Query log analysis of an electronic health record search engine; AMIA Annual Symposium Proceedings; 2011; pp. 915–924. [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Bodenreider O. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research. 2004;32(suppl 1):D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].King B, Wang L, Provalov I, Zhou J. Cengage Learning at TREC 2011 medical track; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]

[R15] [15].Goodwin T, Rink B, Roberts K, Harabagiu SM, Tx R. Cohort shepherd: Discovering cohort traits from hospital visits; Proceedings of The 20th Text REtrieval Conference (TREC); 2011. [Google Scholar]

[R16] [16].Cornet R, de Keizer N. Forty years of snomed: a literature review. BMC medical informatics and decision making. 2008;8(Suppl 1):S2. doi: 10.1186/1472-6947-8-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Zhu D, Carterette B. Exploring evidence aggregation methods and external expansion sources for medical record search; Proceedings of The 21th Text REtrieval Conference (TREC); 2013. [Google Scholar]

[R18] [18].Wu S, Wagholikar K, Sohn S, Kaggal V, Liu H. Empirical ontologies for cohort identification; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R19] [19].Wu ST, Zhu D, Hersh W, Liu H. Clinical information retrieval with split-layer language models; Proceedings of the ACM SIGIR Workshop on Health Search and Discovery (HSD), Association for Computing Machinery; 2013. [Google Scholar]

[R20] [20].Demner-Fushman D, Abhyankar S, Jimeno-Yepes A, Loane R, Rance B, Lang F, Ide N, Apostolova E, Aronson AR. A knowledgebased approach to medical records retrieval; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R21] [21].Daoud M, Kasperowicz D, Miao J, Huang J. York University at TREC 2011: Medical Records Track; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R22] [22].Wu H, Fang H. An exploration of new ranking strategies for medical record tracks; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R23] [23].Schuemie M. DutchHatTrick: Semantic query modeling, ConText, section detection, and match score maximization; Proceedings of The 20th Text REtrieval Conference; 2011. [Google Scholar]

[R24] [24].Bedrick S, Edinger T, Cohen A, Hersh W. Identifying patients for clinical studies from electronic health records: Trec 2012 medical records track at ohsu; Proceedings of The 21th Text REtrieval Conference; 2013. [Google Scholar]

[R25] [25].Qi Y, Laquerre P-F. Retrieving medical records with sennamed: Nec labs america at trec 2012 medical records track; Proceedings of The 21th Text REtrieval Conference; 2013. [Google Scholar]

[R26] [26].Wu S, Liu H. Semantic Characteristics of NLP-extracted Concepts in Clinical Notes vs. Biomedical Literature; Proceedings of AMIA 2011; 2011; [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Wu S, Liu H, Li D, Tao C, Musen M, Chute C, Shah N. UMLS Term Occurrences in Clinical Notes: A Large-scale Corpus Analysis; Proceedings of the AMIA Joint Summit on Clinical Research Informatics; 2012; [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Hersh WR, Cohen AM, Ruslen L, Roberts PM. TREC. 2007. TREC 2007 genomics track overview. [Google Scholar]

[R30] [30].Porter MF. An algorithm for suffix stripping. Program: electronic library and information systems. 1980;14(3):130–137. [Google Scholar]

[R31] [31].Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation of negation phrases in narrative clinical reports; Proceedings of AMIA Symposium; 2001; pp. 105–109. [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Harkema H, Dowling JN, Thornblade T, Chapman WW. Context: An algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics. 2009;42(5):839–851. doi: 10.1016/j.jbi.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Croft B, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. 1st Edition Addison Wesley; 2009. [Google Scholar]

[R34] [34].Buckley C, Voorhees EM. Retrieval evaluation with incomplete information; Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; 2004.pp. 25–32. [Google Scholar]

[R35] [35].Yilmaz E, Aslam JA. Estimating average precision with incomplete and imperfect judgments; Proceedings of the 15th ACM international conference on Information and knowledge management, ACM; 2006.pp. 102–111. [Google Scholar]

[R36] [36].Hersh W. Information Retrieval: A Health and Biomedical Perspective. 3rd Edition Health Informatics, Springer; 2009. [Google Scholar]

PERMALINK

Using Large Clinical Corpora for Query Expansion in Text-based Cohort Identification

Dongqing Zhu

Stephen Wu

Ben Carterette

Hongfang Liu

Abstract

1. Introduction

1.1. TREC Medical Records Cohort Retrieval Task

Figure 1.

1.2. Road Map

2. Related Work

3. Materials and Methods

3.1. Auxiliary Collections for Query Expansion

Table 1.

3.2. Preprocessing

3.3. Retrieval Models

3.3.1. Baseline Query Likelihood (QL) Model

3.3.2. Mixture of Relevance Models (MRM) for Query Expansion

3.3.3. MRM in practice

4. Evaluation

Table 2.

5. Results and Analysis

5.1. Clinical corpus vs. other single collections

Table 3.

5.2. Performance by clinical collection size and by query difficulty

Figure 2.

5.3. Clinical corpus among multiple expansion collections

Table 4.

5.4. Adding a clinical corpus to an existing setup

6. Discussion

6.1. Analysis of performance factors

Table 6.

6.2. Comparison with related results

Table 5.

7. Conclusion and Future Work

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases