Overview of the TREC 2017 Precision Medicine Track

Kirk Roberts; Dina Demner-Fushman; Ellen M Voorhees; William R Hersh; Steven Bedrick; Alexander J Lazar; Shubham Pant

. Author manuscript; available in PMC: 2020 Aug 6.

Published in final edited form as: Text Retr Conf. 2017 Nov;26:https://trec.nist.gov/pubs/trec26/papers/Overview-PM.pdf.

Overview of the TREC 2017 Precision Medicine Track

Kirk Roberts ¹, Dina Demner-Fushman ², Ellen M Voorhees ³, William R Hersh ⁴, Steven Bedrick ⁵, Alexander J Lazar ⁶, Shubham Pant ⁷

PMCID: PMC7410346 NIHMSID: NIHMS949699 PMID: 32776021

1 Introduction

For many complex diseases, there is no “one size fits all” solutions for patients with a particular diagnosis. The proper treatment for a patient depends upon genetic, environmental, and lifestyle choices. The ability to personalize treatment in a scientifically rigorous manner based on these factors is the hallmark of the emerging “precision medicine” paradigm. Nowhere is the potential impact of precision medicine more closely felt than in cancer, where lifesaving treatments for particular patients could prove ineffective or even deadly for other patients based entirely upon the particular genetic mutations in the patient’s tumor(s). Significant effort, therefore, has been devoted to deepening the scientific research surrounding precision medicine. This includes a Precision Medicine Initiative (Collins and Varmus, 2015) launched by former President Barack Obama in 2015, now known as the All of Us Research Program.

A fundamental difficulty with putting the findings of precision medicine into practice is that–by its very nature–precision medicine creates a huge space of treatment options (Frey et al., 2016). These can easily overwhelm clinicians attempting to stay up-to-date with the latest findings, and can easily inhibit a clinician’s attempts to determine the best possible treatment for a particular patient. However, the ability to quickly locate relevant evidence is the hallmark of information retrieval (IR). Further, for three consecutive years the TREC Clinical Decision Support (CDS) track has sought to evaluate IR systems that provide medical evidence to the point-of-care. It was natural, then, to specialize the CDS track to the needs of precision medicine so IR systems can focus on this important issue.

The 2017 Precision Medicine track focused on a single field, oncology, for a specific use case, genetic mutations of cancer. As described above, main idea behind precision medicine is to use detailed patient information (largely genetic information in most current research) to identify the most effective treatments. Improving patient care in precision oncology then requires both (a) a mechanism to locate the latest research relevant to a patient, and (b) a fallback mechanism to locate the most relevant clinical trials when the latest techniques prove ineffective for a patient. In the first part, the track continues the previous Clinical Decision Support track (with a more focused use case), while in the second part expands the task to cover a new type of data (clinical trial descriptions).

The remainder of this overview is organized as follows: Section 2 describes the Clinical Decision Support tracks, including their motivation and data, and how this led to the Precision Medicine track; Section 3 describes the structure of the topics and the process of creating them; Section 4 outlines the retrieval tasks; Section 5 describes the evaluation method; finally, Section 6 details the results of the participant systems.

2 Background

The TREC Clinical Decision Support track (2014–2016) sought to evaluate systems that provided evidence-based information (in the form of full-text literature articles) to clinicians for a specific patient (represented as a case description or admission note). This included information on diagnosing, treating, and testing patients. No attempt was made to limit topics by medical speciality (e.g., cardiology, pediatrics), which in some respects made it difficult to define precise use cases and have a uniform definition of relevance. Despite this, the track was extremely successful in attracting a large and diverse group of participants (ranging from 26 to 36 participating participants in each year). The track was also heavily inspired by the TREC Genomics (Hersh and Voorhees, 2009) and Medical Records (Voorhees and Hersh, 2012) tracks, in addition to the medical case-based retrieval track of ImageCLEF (Seco de Herrera et al., 2013), all of which are no longer active. All of these tracks have demonstrated significant interest in the problem of medical ad hoc retrieval.

To address the needs of a specific, high-profile, and clinically valuable use case, the Clinical Decision Support track was transitioned to the Precision Medicine track. While the Clinical Decision Support track utilized full-text articles from PubMed Central (PMC), the Precision Medicine track utilized shorter MED-LINE abstracts. This is mainly due to PMC being a poor resource for precision medicine: a low proportion of precision medicine-related articles are deposited in PMC. Further, those articles that are deposited are often subject to a 6–24 month embargo evaluation, a significant length of time in a fast-moving field such as precision medicine. Additionally, clinical trials were added as a separate corpus, consistent with the importance of this resource in precision oncology.

3 Topics

The 2017 Precision Medicine track provided 30 topics created by experienced precision oncologists at the University of Texas MD Anderson Cancer Center and the Oregon Health & Science University (OHSU) Knight Cancer Institute. Due to the difficulty in obtaining actual patient data, the topics were synthetically created, though often inspired by actual patients, with modification.¹

The topics contain four key elements in a semi-structured format to reduce the need to perform natural language processing to identify the key elements. The four key elements are: (1) disease (e.g., type of cancer), (2) genetic variants (primarily the genetic variants in the tumors themselves as opposed to the patient’s DNA), (3) demographic information (e.g., age, sex), and (4) other factors (which could impact certain treatment options). Four topics from the track are shown in Table 1. The first two topics are additionally shown in their corresponding XML format (i.e., what was provided to the participants) in Table 2.

Table 1.

Example topics from the 2017 track.

Disease: Liposarcoma

Variant: CDK4 Amplification

Demographic: 38-year-old male

Other: GERD

Disease: Colon Cancer

Variant: KRAS (G13D), BRAF (V600E)

Demographic: 52-year-old male

Other: Type II Diabetes, Hypertension

Disease: Cervical Cancer

Variant: STK11

Demographic: 26-year-old female

Other: None

Disease: Cholangiocarcinoma

Variant: IDH1 (R132H)

Demographic: 64-year-old male

Other: Neuropathy

Open in a new tab

Table 2.

XML format for the first two topics from Table 1.

graphic file with name nihms949699f4.jpg

Open in a new tab

4 Tasks

In the Clinical Decision Support track, three types of topics were utilized: diagnosis, treatment, and test. For the Precision Medicine track, only treatment topics were used. However, different types of data may be of interest, namely literature article and clinical trials. In more detail, the two types of results are:

Literature Articles. Because precision medicine is a fast-moving field, keeping up-to-date with the latest literature can be challenging due to both the volume and velocity of scientific advances. Therefore, when treating patients, it would be helpful to present the most relevant scientific articles for an individual patient. The primary literature corpus is therefore a snapshot of MEDLINE abstracts (i.e., what is searchable through the PubMed interface). Relevant literature articles can guide precision oncologists to the best-known treatment options for the patient’s condition. Specifically, this corpus is composed of approximately 26,759,399 MEDLINE abstracts and is supplemented with two additional sets of abstracts: (i) 37,007 abstracts from recent proceedings of the American Society of Clinical Oncology (ASCO), and (ii) 33,018 abstracts from recent proceedings of the American Association for Cancer Research (AACR). These additional datasets were added to increase the set of potentially relevant treatment information. Notably, the latest research is often presented at conferences such as ASCO and AACR prior to submission to journals (thus these proceedings may represent a more up-to-date snapshot of scientific knowledge than MEDLINE).
Clinical Trials. In many oncology patients, no approved treatment is available (or, commonly, none of the available treatments have proven effective). The common recourse in this case is to determine if any potential treatments are undergoing evaluation in a clinical trial. Therefore, in such situations, it would be helpful to automatically identify the most relevant clinical trials for an individual patient. Precision oncology trials typically use a certain treatment (e.g., a form of chemotherapy or radiation) for a certain disease with a specific genetic variant (or set of variants). Such trials can have complex inclusion and/or exclusion criteria that are challenging to match with automated systems (Weng et al., 2011). The corpus is derived from ClinicalTrials.gov, a repository of past, present, and future clinical trials in the U.S. and abroad. A total of 241,006 clinical trial descriptions compose the corpus provided to participants. Note that for the purposes of this track, the state of the trial (e.g., recruiting, active, completed) and geographic location constraints are not considered.

5 Evaluation

The evaluation followed standard TREC evaluation procedures for ad hoc retrieval tasks. Participants submitted (in trec_eval format) a maximum of five automatic or manual runs per task, each consisting of a ranked list of up to 1,000 literature article IDs and 1,000 ClinicalTrials.gov Identifiers per topic. That is, up to 10 total runs: a maximum of 5 literature runs and 5 clinical trial runs per topic.

The highest ranked articles and trials for each topic were pooled and judged by physician graduate students at OHSU and postdoctoral fellows at the National Library of Medicine (NLM), just as in the Medical Records and Clinical Decision Support tracks.

In the previous years of the TREC Clinical Decision Support Track, relevance assessors judged results on a simple scale: “definitely relevant”, “partially relevant”, and “not relevant”. Due to the particular challenges involved in precision medicine, however, this is not necessarily appropriate. Not only is precision medicine a highly specialized field (and thus difficult to get true experts to act as assessors), but the notion of relevance is far more flexible and case-specific. As such, the assessment process was two-tiered: first a manual assessment was made by the human assessors based on several categories for each result (referred to here as Result Assessment), then a relevance score was assigned to the result based on its categorization (referred to here as Relevance Assessment).

5.1 Result Assessment

Result assessment can be viewed as a set of multi-class annotations. Judging an individual result, whether an article or trial, proceeds in a cascaded manner with two steps: an initial pass ensures the article/trial is broadly relevant to precision medicine, after which the assessor categorizes the article/trial according to the four fields above.

See Figure 1 for a flow chart style overview of this process. The first step is designed to save assessor time by filtering out unrelated articles/trials, since the second step can be more time-consuming (possibly requiring a more detailed reading of the article/trial). The assessors were free to quickly skim the article/trial in order to make the initial decision. Then, if the article/trial is relevant to precision medicine (by the standard outlined below), a more detailed reading may be necessary in order to accurately assess all fields.

Step 1 is to determine whether the article/trial is related to precision medicine. There are three options:

Human PM: The article/trial (1) relates to humans, (2) involves some form of cancer, (3) focuses on treatment, prevention, or prognosis of cancer, and (4) relates in some way to at least one of the genes in the topic.
Animal PM: Identical to Human PM requirements (2)–(4), except for animal research.
Not PM: Everything else. This includes “basic science” that focuses on understanding underlying genomic principles (e.g., pathways), but provides no evidence for treatment.

Step 2 is to determine the appropriate categorization for each of the four fields:

Disease:
- Exact: The form of cancer in the article/trial is identical to the one in the topic.
- More General: The form of cancer in the article/trial is more general than the one in the topic (e.g., blood cancer vs. leukemia).
- More Specific: The form of cancer in the article/trial is more specific than the one in the topic (e.g., squamous cell lung carcinoma vs. lung cancer).
- Not Disease: The article/trial is not about a disease, or is about a different disease (or type of cancer) than the one in the topic.
Gene [for each particular gene in the topic]
- Exact: The article/trial focuses on the exact gene and variant as the one in the topic. If the topic does not contain a specific variant, then this holds as long as the gene is included. By “focus” this means the gene/variant needs to be part of the scientific experiment of the article/trial, as opposed to discussing related work.
- Missing Gene: The article/trial does not focus the particular gene in the topic. If the gene is referenced but not part of the study, then it is considered missing.
- Missing Variant: The article/trial focuses on the particular gene in the topic, but not the particular variant in the topic. If no variant is provided in the topic, this category should not be assigned.
- Different Variant: The article/trial focuses on the particular gene in the topic, but on a different variant than the one in the topic.
Demographic
- Matches: The article/trial demographic population matches the one in the topic.
- Excludes: The article/trial demographic population specifically excludes the one in the topic.
- Not Discussed: The article/trial does not discuss a particular demographic population.
Other
- Matches: The article/trial population matches the one in the topic. If the other field is “None” this category should also be assigned.
- Excludes: The article/trial population specifically excludes the one in the topic.
- Not Discussed: The article/trial does not discuss a population relating to the provided factors.

5.2 Relevance Assessment

Relevance assessment is defined here as the process of mapping the multi-class result assessments described above onto a single numeric relevance scale. This allows for the computation of evaluation metrics (e.g., P@10, infNDCG) as well as the tuning of IR systems to improve their search ranking. As already demonstrated by the need for result assessment above, for the Precision Medicine track the notion of relevance assessment becomes more complex than previous tracks.

One of the factors that makes precision medicine a difficult domain for IR is that different patient cases require different types of flexibility on the above categories. For some patients, the exact type of cancer is not relevant. Other times, the patient’s demographics or other factors might weigh more heavily. Most notably, the very concept of precision medicine acknowledges the uniqueness of the patient, and so it is to be expected that no perfect match is found. Not only do the topics provided to the participants not contain the necessary information to decide what factors are more/less relevant (e.g., the patient’s previous treatments), in many ways it isn’t realistic to assign the IR system this responsibility. Precision medicine requires a significant amount of oversight by clinicians, including the ability to consider multiple treatment options. So it might ultimately make the most sense to allow the relevance assessment to be, at least in part, designed by the clinician to allow the IR system to adjust its rankings to suit. Given the constraints of an IR shared task, however, it is necessary to define a relevance assessment process. As such, a fairly broad notion of relevance based on the above categories was used:

Definitely Relevant: The result should: be either Human PM or Animal PM; have a Disease assignment of Exact or More Specific; have at least one Gene is Exact; have both Demographic and Other assignments are either Exact or Not Discussed.
Partially Relevant: Largely the same as Definitely Relevant, but with the exception that Disease can also be More General and Gene can also be Missing Variant or Different Variant.
Not Relevant: Neither of the above.

The primary evaluation metrics for the literature articles are precision at rank 10 (P@10), inferred normalized discounted cumulative gain (infNDCG), and R-precision (R-prec). For infNDCG, Definitely Relevant has a score of 2, Partially Relevant is 1, and Not Relevant is 0. The primary evaluation metrics for clincial trials is P@5, P@10, and P@15.

6 Results

In total, there were 22,642 judgments for the literature articles and 13,441 judgments for the clinical trials. Table 3 shows basic statistics of the results and relevance assessments. Table 4 shows the number of Definitely Relevant, Partially Relevant, and Not Relevant judgments for each topic. Since each result was judged only once, no inter-rater agreement is available for the judgments. However, the PM assessment (Human, Animal, or Not PM) is independent of the topic, and thus some agreement calculation can be made when the same article/trial is judged for different topics. A Kappa agreement would be difficult to calculate and of limited utility due to the number of assessors (20 assessors, who judged between 1 and 6 topics) and inconsisent rates of duplicate judging (most duplicate judging involved just 2 topics, but two clinical trials were judged in 18 topics). Basic agreement numbers can be calculated, which work out to 84.5% agreement for literature articles and 85.8% agreement for clinical trials. These are by no means desirable agreement numbers (the baseline is effectively 50%), which underscores both the difficulty of assessment as well as the vague description of ‘precision medicine’. More analysis is certainly required as to why such disagreements arise and how to improve similar types of judgments on future tasks.

Table 3.

Descriptive statistics (per-topic) of manual judgments (both results assessment and relevance assessment) for both literature articles and clinical trials.

		Literature Articles					Clinical Trials
Type	Class	Total	Mean	Median	Min	Max	Total	Mean	Median	Min	Max
PM	Human PM	8,738	291	277	65	627	3,959	132	119	18	428
	Animal PM	536	18	17	2	71	2	0	0	0	1
	Not PM	13,368	446	435	92	881	9,480	316	314	80	565

Disease	Exact	4,149	138	120	9	506	1,093	36	26	0	139
	More Specific	1,273	42	20	0	358	723	24	6	0	249
	More General	938	31	18	0	139	679	23	17	0	92
	Not Disease	2,914	97	85	0	275	1,466	49	41	0	179

1st Gene	Exact	4,421	147	154	10	331	1,486	50	37	0	230
	Missing Variant	1,419	47	3	0	464	452	15	5	0	108
	Different Variant	560	19	10	0	110	243	8	2	0	91
	Missing Gene	2,874	96	60	0	378	1,780	59	35	0	391

2nd Gene	Exact	540	18	0	0	287	119	4	0	0	55
	Missing Variant	230	8	0	0	218	127	4	0	0	121
	Different Variant	91	3	0	0	83	17	1	0	0	14
	Missing Gene	964	32	0	0	264	579	19	0	0	170

3rd Gene	Exact	104	3	0	0	104	47	2	0	0	47
	Missing Variant	6	0	0	0	6	0	0	0	0	0
	Different Variant	0	0	0	0	0	2	0	0	0	2
	Missing Gene	136	5	0	0	136	124	4	0	0	124

Demographics	Matches	658	22	10	0	91	3,221	107	105	0	402
	Not Discussed	7,736	258	226	37	578	376	13	1	0	197
	Excludes	880	29	17	0	141	364	12	3	0	57

Other	Matches	789	26	2	0	220	1,114	37	3	0	208
	Not Discussed	8,320	277	283	0	646	2,736	91	81	0	425
	Excludes	165	6	1	0	45	111	4	0	0	63

Relevance	Definitely Relevant	2,022	67	44	1	221	436	15	5	0	98
	Partially Relevant	1,853	62	27	1	476	735	25	15	0	120
	Not Relevant	18,767	626	624	259	1,209	12,270	409	418	165	606

Open in a new tab

Note: only 6 topics had a 2nd Gene and only 1 had a 3rd Gene, but means are still provided across 30 topics.

Table 4.

Counts of Definitely Relevant (DR), Partially Relevant (PR), and Not Relevant (NR) results for each topic.

	literature articles			clinical trials
Topic	DR	PR	NR	DR	PR	NR
1	48	14	377	5	12	313
2	156	205	330	17	120	323
3	45	6	535	5	19	221
4	221	46	625	31	26	519
5	69	21	677	35	1	375

6	24	104	486	0	27	345
7	159	187	623	98	107	333
8	119	1	563	54	7	418
9	14	476	259	2	60	165
10	3	94	614	0	0	422

11	39	3	656	4	15	489
12	165	55	448	25	14	382
13	1	24	935	2	32	369
14	1	30	568	0	7	310
15	4	6	738	3	1	506

16	116	26	622	2	3	508
17	111	5	620	10	23	472
18	43	151	639	3	35	543
19	13	22	802	7	16	506
20	21	28	398	0	5	263

21	84	120	409	11	56	289
22	131	11	818	48	54	508
23	178	17	609	21	9	519
24	22	42	761	4	14	525
25	15	39	741	1	39	409

26	14	6	1029	1	4	606
27	72	4	771	11	17	433
28	9	46	735	0	2	332
29	19	23	665	2	6	418
30	106	41	714	34	4	449

Open in a new tab

There were a total of 32 participants in the track. For the literature articles, 29 participants submitted 125 runs (122 automatic, 3 manual). For the clinical trials, 31 participants submitted 133 runs (131 automatic, 2 manual). See Table 5 for a list of the participants and numbers of runs. Table 6 shows the top 10 runs (top run per participant) for each metric on each corpus. Figures 2 and 3 show box-and-whisker plots for the top 10 runs. Finally, Tables 7 and 8 show the per-topic aggregate results.

Table 5.

Participating teams and submitted runs. Numbers in parentheses indicate manual runs.

		# Runs
Team ID	Affiliation	Articles	Trials
BiTeM	BiTeM Group	5	5
cbnu	Chonbuk National University	3	3
CSIROmed	Commonwealth Science and Industry Research Org.	5	5
DA_IICT	Dhirubhai Ambani Inst. of Info. and Comm. Tech.	-	4

DUTIRL	Information Retrieval Laboratory of Dalian Univ. of Tech.	1	1
ECNUica	East China Normal University	5	5
ETH	ETH Zurich	5	5
FDUDMIIP	School of Computer Science, Fudan University	5	4

GravityWave	GravityWave Technologies	1 (1)	1 (1)
HokieGo	Virginia Tech	2	2
ielab-CSIRO-QUT	CSIRO and Queensland University of Technology	5	-
imi_mug	Medical University of Graz	5	5

iris	University of Pittsburgh	5	5
kaist-kse	KAIST Knowledge Service Engineering	3	3
KISTI	Korea Institute of Science and Technology Information	5	5
MayoNLPTeam	Mayo Clinic	5	5

NaCTeM	University of Manchester	5 (1)	5 (1)
NOVASearch	Universidade NOVA Lisboa	3	5
POZNAN_SEMMED	Poznan University of Technology	3	5
prna-mit-suny	Philips Research North America / MIT / SUNY	5	5

SDSFU	School of Data Science, Fudan University	5	5
teckro	teckro	-	5
TREC_UB	University at Buffalo	-	2
UCAS	University of Chinese Academy of Sciences	5	5

udel	University of Delaware	5	5
udel_fang	Infolab at University of Delaware	5	5
UD_GU_BioTM	University of Delaware / Georgetown University	5	5
UKNLP	University of Kentucky	5 (1)	4

UMich_MedIER	University of Michigan	4	4
UNTIIA	University of North Texas	5	5
UTDHLTRI	University of Texas at Dallas	5	5
UWMSOIS	University of Wisconsin-Milwaukee	5	5

Total		125 (3)	133 (2)

Open in a new tab

Table 6.

Top overall systems (best run per participant).

Literature Articles			Clinical Trials
infNDCG			P @ 5
Team	Run	Score	Team	Run	Score

UTDHLTRI	UTDHLTFF	0.4647	UD_GU_BioTM	UD_GU_CT_3	0.5448
BiTeM	SIBTMlit4	0.4175	prna-mit-suny	pms_run5_tri	0.4552
imi_mug	mugpubboost	0.4158	udel	udelT2GeMeSH	0.4552
UD_GU_BioTM	UD_GU_SA_5	0.4135	UTDHLTRI	UTDHLTAFT	0.4483
prna-mit-suny	pms_run5_abs	0.4070	NaCTeM	Broadc	0.4483
UKNLP	UKY_CJT	0.3897	NOVASearch	NOVAtr2	0.4414
udel_fang	UDInfoPMSA2	0.3897	UCAS	UCASSEM2	0.4345
NaCTeM	Broad	0.3800	teckro	teckro1	0.4276
iris	mRun3MRF	0.3758	CSIROmed	cCSIROmedAll	0.4183
FDUDMIIP	medline3	0.3555	KISTI	KISTI02CT	0.4000

R-prec			P @ 10
Team	Run	Score	Team	Run	Score

UTDHLTRI	UTDHLTFF	0.2993	UD_GU_BioTM	UD_GU_CT_3	0.4448
imi_mug	mugpubbase	0.2772	UTDHLTRI	UTDHLTFFT	0.4172
BiTeM	SIBTMlit4	0.2687	teckro	teckro1	0.4000
prna-mit-suny	pms_run5_abs	0.2622	NOVASearch	NOVAtr2	0.3966
UKNLP	UKY_AGG	0.2518	udel	udelT2Comb	0.3793
udel_fang	UDInfoPMSA2	0.2503	UCAS	UCASSEM2	0.3724
UD_GU_BioTM	UD_GU_SA_5	0.2477	NaCTeM	Broadc	0.3724
iris	mRun3MRF	0.2374	KISTI	KISTI02CT	0.3690
NaCTeM	Broad	0.2287	POZNAN_SEMMED	LGDraw	0.3690
UCAS	UCASBASEa	0.2282	BiTeM	SIBTct3	0.3586

P @ 10			P @ 15
Team	Run	Score	Team	Run	Score

UD_GU_BioTM	UD_GU_SA_5	0.6400	UD_GU_BioTM	UD_GU_CT_3	0.3885
UTDHLTRI	UTDHLTAF	0.6300	UTDHLTRI	UTDHLTAFT	0.3816
imi_mug	mugpubboost	0.6267	teckro	teckro1	0.3632
BiTeM	SIBTMlit4	0.5500	UCAS	UCASSEM2	0.3471
prna-mit-suny	pms_run5_abs	0.5300	NOVASearch	NOVAtr2	0.3448
UNTIIA	UNTIIALQ	0.5233	POZNAN_SEMMED	LGDraw	0.3356
iris	mRun3MRF	0.5133	KISTI	KISTI02CT	0.3356
udel_fang	UDInfoPMSA2	0.5067	udel	udelT2Comb	0.3333
UKNLP	UKY_AGG	0.4933	UWMSOIS	UWMSOIS0	0.3172
NaCTeM	Broad	0.4667	prna-mit-suny	pms run5 tri	0.3172

Open in a new tab

Top-performing runs (showing only best run per participant) on literature articles.

Top-performing runs (showing only best run per participant) on clinical trials.

Table 7.

Per-topic statistics (over 125 runs) for 29 topics on literature articles.

	infNDCG			P @ 10			R-prec
Topic	Best	Median	Worst	Best	Median	Worst	Best	Median	Worst
1	0.6068	0.4604	0.0000	1.0000	0.6000	0.0000	0.5000	0.3387	0.0000
2	0.8794	0.5978	0.0000	1.0000	0.9000	0.0000	0.4404	0.2742	0.0000
3	0.5130	0.2789	0.0000	0.9000	0.3000	0.0000	0.4118	0.1961	0.0000
4	0.8268	0.4085	0.0000	1.0000	0.7000	0.0000	0.4157	0.2097	0.0000
5	0.3011	0.1607	0.0000	0.5000	0.2000	0.0000	0.2111	0.1222	0.0000
6	0.5652	0.4208	0.0000	0.9000	0.6000	0.0000	0.4219	0.3047	0.0000
7	0.7042	0.3364	0.0000	1.0000	0.5000	0.0000	0.3786	0.1705	0.0000
8	0.5613	0.2662	0.0000	0.9000	0.3000	0.0000	0.3417	0.2333	0.0000
9	0.8506	0.6355	0.0360	1.0000	0.8000	0.1000	0.4612	0.3490	0.0020
10	0.3937	0.1660	0.0000	0.6000	0.2000	0.0000	0.2680	0.1340	0.0000
11	0.7937	0.2129	0.0000	0.9000	0.2000	0.0000	0.5952	0.1429	0.0000
12	0.7985	0.5116	0.0000	1.0000	0.8000	0.0000	0.4591	0.2409	0.0000
13	0.3238	0.0588	0.0000	0.5000	0.1000	0.0000	0.2400	0.0400	0.0000
14	0.6704	0.0300	0.0000	0.8000	0.0000	0.0000	0.5484	0.0000	0.0000
15	0.5073	0.1314	0.0000	0.3000	0.1000	0.0000	0.3000	0.1000	0.0000
16	0.6899	0.4070	0.0000	1.0000	0.6000	0.0000	0.4648	0.2606	0.0000
17	0.5334	0.2586	0.0000	1.0000	0.3000	0.0000	0.3707	0.2328	0.0000
18	0.5024	0.3072	0.0000	1.0000	0.5000	0.0000	0.2990	0.1546	0.0000
19	0.4655	0.1916	0.0000	0.9000	0.2000	0.0000	0.3429	0.0857	0.0000
20	0.4810	0.1715	0.0000	0.7000	0.2000	0.0000	0.3469	0.1224	0.0000
21	0.5998	0.3809	0.0000	0.9000	0.5000	0.0000	0.3922	0.2990	0.0049
22	0.6420	0.1840	0.0000	0.9000	0.4000	0.0000	0.4155	0.1127	0.0000
23	0.9132	0.5070	0.0000	1.0000	0.6000	0.0000	0.5744	0.3128	0.0000
24	0.5551	0.2497	0.0000	1.0000	0.4000	0.0000	0.4375	0.1719	0.0000
25	0.5310	0.2583	0.0000	0.7000	0.3000	0.0000	0.3704	0.1667	0.0000
26	0.6342	0.0900	0.0000	0.8000	0.1000	0.0000	0.5000	0.0500	0.0000
27	0.3064	0.0966	0.0000	0.8000	0.1000	0.0000	0.2105	0.0921	0.0000
28	0.4787	0.1223	0.0000	0.9000	0.2000	0.0000	0.4000	0.0727	0.0000
29	0.4628	0.2118	0.0000	1.0000	0.3000	0.0000	0.4048	0.1429	0.0000
30	0.4755	0.1857	0.0000	0.9000	0.2000	0.0000	0.3265	0.1497	0.0000

Open in a new tab

Table 8.

Per-topic statistics (over 133 runs) for 28 topics on clinical trials. (Topic 10 had no relevant trials.)

	P @ 5			P @ 10			Median
Topic	Best	Median	Worst	Best	Median	Worst	Best	Median	Worst
1	1.0000	0.8000	0.0000	0.9000	0.4000	0.0000	0.6667	0.2667	0.0000
2	1.0000	0.6000	0.0000	0.9000	0.6000	0.0000	0.9333	0.6000	0.0000
3	1.0000	0.6000	0.0000	1.0000	0.6000	0.0000	0.8667	0.4667	0.0000
4	1.0000	0.4000	0.0000	0.9000	0.4000	0.0000	0.8000	0.3333	0.0000
5	0.8000	0.2000	0.0000	0.5000	0.2000	0.0000	0.5333	0.2000	0.0000
6	0.8000	0.6000	0.0000	0.8000	0.5000	0.0000	0.6667	0.4000	0.0000
7	1.0000	0.6000	0.0000	1.0000	0.6000	0.0000	1.0000	0.6667	0.0000
8	0.8000	0.2000	0.0000	0.7000	0.2000	0.0000	0.6667	0.2667	0.0000
9	1.0000	0.6000	0.0000	1.0000	0.7000	0.0000	1.0000	0.6667	0.0000
11	0.8000	0.4000	0.0000	0.7000	0.2000	0.0000	0.6667	0.2000	0.0000
12	0.8000	0.2000	0.0000	0.7000	0.2000	0.0000	0.6000	0.2000	0.0000
13	0.6000	0.0000	0.0000	0.6000	0.0000	0.0000	0.6667	0.0000	0.0000
14	1.0000	0.4000	0.0000	0.6000	0.3000	0.0000	0.4000	0.2000	0.0000
15	0.4000	0.0000	0.0000	0.2000	0.0000	0.0000	0.1333	0.0000	0.0000
16	0.4000	0.2000	0.0000	0.3000	0.1000	0.0000	0.2667	0.0667	0.0000
17	0.6000	0.2000	0.0000	0.6000	0.2000	0.0000	0.5333	0.2000	0.0000
18	0.6000	0.0000	0.0000	0.4000	0.1000	0.0000	0.3333	0.0667	0.0000
19	0.8000	0.2000	0.0000	0.5000	0.1000	0.0000	0.4000	0.1333	0.0000
20	0.4000	0.0000	0.0000	0.3000	0.0000	0.0000	0.2667	0.0667	0.0000
21	1.0000	0.2000	0.0000	1.0000	0.3000	0.0000	0.9333	0.2667	0.0000
22	1.0000	0.4000	0.0000	1.0000	0.3000	0.0000	0.9333	0.2000	0.0000
23	0.8000	0.2000	0.0000	0.7000	0.1000	0.0000	0.5333	0.1333	0.0000
24	1.0000	0.6000	0.0000	1.0000	0.5000	0.0000	0.8667	0.3333	0.0000
25	1.0000	0.4000	0.0000	1.0000	0.3000	0.0000	0.8000	0.2667	0.0000
26	0.2000	0.0000	0.0000	0.3000	0.0000	0.0000	0.2000	0.0000	0.0000
27	0.8000	0.0000	0.0000	0.6000	0.1000	0.0000	0.4667	0.1333	0.0000
28	0.0000	0.0000	0.0000	0.1000	0.0000	0.0000	0.0667	0.0000	0.0000
29	0.8000	0.2000	0.0000	0.6000	0.1000	0.0000	0.4000	0.0667	0.0000
30	1.0000	0.2000	0.0000	0.7000	0.2000	0.0000	0.5333	0.1333	0.0000

Open in a new tab

7 Conclusion

This was the first year of the Precision Medicine track. The goal of the track is to inform the creation of information retrieval systems to support clinicians working in precision medicine (specifically oncologists in this track) in making better treatment decisions for individual patients. Participants were provided with synthetic patient data consisting of a type of cancer, one or more genetic variants, patient demographics, and other potentially relevant patient factors. Given this, participants were challenged with retrieving

Acknowledgments

The organizers would like to thank Kate Fultz Hollis for managing the assessment process. KR is supported by the National Institutes of Health (NIH) grant 2R00LM012104-02. DDF is supported by the Intramural Research Program of the U.S. National Library of Medicine, NIH. Finally, the organizers are grateful to the National Institute of Standards and Technology (NIST) for funding the assessment process.

Footnotes

¹

Note that while clinical data is frequently de-identified for research purposes without the need for patient permission, genomic data is fundamentally difficult to de-identify. So to be safe, synthetic data was used.

Contributor Information

Kirk Roberts, School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX.

Dina Demner-Fushman, Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD.

Ellen M. Voorhees, Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD

William R. Hersh, Department of Medical Informatics & Clinical Epidemiology, Oregon Health and Science University, Portland, OR

Steven Bedrick, Department of Medical Informatics & Clinical Epidemiology, Oregon Health and Science University, Portland, OR.

Alexander J. Lazar, Departments of Pathology & Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX

Shubham Pant, Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, Houston, TX.

References

Collins FS, Varmus H. A New Initiative on Precision Medicine. New England Journal of Medicine. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frey LJ, Bernstam EV, Denny JC. Precision medicine informatics. Journal of the American Medical Informatics Association. 2016;23:668–670. doi: 10.1093/jamia/ocw053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hersh W, Voorhees E. TREC genomics special issue overview. Information Retrieval. 2009;12:1–15. [Google Scholar]
Seco de Herrera AG, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks. CLEF 2013 Working Notes 2013 [Google Scholar]
Voorhees EM, Hersh W. Overview of the TREC 2012 Medical Records Track; Proceedings of the Twenty-First Text REtrieval Conference.2012. [Google Scholar]
Weng C, Wu X, Luo Z, Boland MR, Theodoratos D, Johnson SB. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association. 2011;18(Suppl 1):i116–i124. doi: 10.1136/amiajnl-2011-000321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Collins FS, Varmus H. A New Initiative on Precision Medicine. New England Journal of Medicine. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Frey LJ, Bernstam EV, Denny JC. Precision medicine informatics. Journal of the American Medical Informatics Association. 2016;23:668–670. doi: 10.1093/jamia/ocw053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Hersh W, Voorhees E. TREC genomics special issue overview. Information Retrieval. 2009;12:1–15. [Google Scholar]

[R4] Seco de Herrera AG, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks. CLEF 2013 Working Notes 2013 [Google Scholar]

[R5] Voorhees EM, Hersh W. Overview of the TREC 2012 Medical Records Track; Proceedings of the Twenty-First Text REtrieval Conference.2012. [Google Scholar]

[R6] Weng C, Wu X, Luo Z, Boland MR, Theodoratos D, Johnson SB. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association. 2011;18(Suppl 1):i116–i124. doi: 10.1136/amiajnl-2011-000321. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Overview of the TREC 2017 Precision Medicine Track

Kirk Roberts

Dina Demner-Fushman

Ellen M Voorhees

William R Hersh

Steven Bedrick

Alexander J Lazar

Shubham Pant

1 Introduction

2 Background