Chia, a large annotated corpus of clinical trial eligibility criteria

Fabrício Kury; Alex Butler; Chi Yuan; Li-heng Fu; Yingcheng Sun; Hao Liu; Ida Sim; Simona Carini; Chunhua Weng

doi:10.1038/s41597-020-00620-0

. 2020 Aug 27;7:281. doi: 10.1038/s41597-020-00620-0

Chia, a large annotated corpus of clinical trial eligibility criteria

Fabrício Kury ^1,^#, Alex Butler ^1,^#, Chi Yuan ^1,^#, Li-heng Fu ¹, Yingcheng Sun ¹, Hao Liu ^1,², Ida Sim ³, Simona Carini ³, Chunhua Weng ^1,^✉

PMCID: PMC7452886 PMID: 32855408

Abstract

We present Chia, a novel, large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types. Each criterion is represented as a directed acyclic graph, which can be easily transformed into Boolean logic to form a database query. Chia can serve as a shared benchmark to develop and test future machine learning, rule-based, or hybrid methods for information extraction from free-text clinical trial eligibility criteria.

Subject terms: Clinical trials

Measurement(s)	Clinical Trial Eligibility Criteria • Analytical Procedure Accuracy
Technology Type(s)	digital curation • computational modeling technique
Sample Characteristic - Organism	Homo sapiens

Open in a new tab

Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.12765602

Background & Summary

Clinical trial eligibility criteria specify rules for screening clinical trial participants and play a central role in clinical research in that they are interpreted, implemented, and adapted by multiple stakeholders at various phases in the clinical research life cycle¹. After being defined by investigators, eligibility criteria are used and interpreted by clinical research coordinators for screening and recruitment. Then, they are used by query analysts and research volunteers for patient screening. Later, they are summarized in meta-analyses for developing clinical practice guidelines and, eventually, interpreted by physicians to screen patients for evidence-based care. Hence, eligibility criteria affect recruitment, results dissemination, and evidence synthesis.

Despite their importance, recent studies highlight the often negative impact these criteria have on the generalizability of a given trial’s findings in the real world^2,3. When eligibility criteria lack population representativeness, the enrolled participants cannot unbiasedly represent those who will be treated according to the results from that study⁴. Given that eligibility criteria are written in free text, it is laborious to answer this representativeness question at scale⁵. A related challenge is to assess the comparability of trial populations, especially for multi-site studies: e.g.,, given two clinical trials investigating the same scientific question, can we tell if they are studying comparable cohorts? The manual labor required from domain experts for such appraisal is prohibitive. Another challenge is patient recruitment, or finding eligible patients for a clinical trial, which remains the leading cause of early trial termination^6,7. Unsuccessful recruitment wastes financial investment and research opportunities, on top of missed opportunities, inconvenience, or frustration of patients when the clinical trial is terminated early or cancelled.

Computable representations of eligibility criteria promise to overcome the above challenges and to improve study feasibility and recruitment success⁸. The Biomedical Informatics research community has produced various knowledge representations for clinical trial eligibility criteria⁹, though nearly all of them predate the current state-of-the-art in machine learning, and some even predate contemporary electronic health records⁹. Early efforts to create annotated datasets in eligibility criteria have used a variety of methods including ad-hoc annotation¹⁰, manual annotation of standardized biomedical concepts¹¹, as well as leveraging biomedical knowledge resources such as UMLS for automatic semantic pattern extraction¹². The annotations in these datasets do not capture sufficient information to form the logical statements of a database query, and few annotated datasets are publicly available. Ross et al. published a dataset with 1,000 eligibility criteria and analyzed their semantic complexity, but the data were not amenable for machine learning¹³. 79 eligibility criteria were annotated by Weng et al. with semantic tags and relations, but these are too few to serve as a sufficiently large training resource¹². The most robustly annotated and the only publicly available corpus to date was produced by Kang et al.¹⁴, who annotated eligibility criteria from 230 clinical trials, though all on Alzheimer’s Disease. Hence the corpus lacks generalizability to other diseases. These and other works have focused on bridging the gap between eligibility criteria and logical queries (Table 1), but the percentage of criteria that could be fully represented using these annotation models and used in database queries (here referred to as criteria coverage) is variable, ranging from 18% to 87%^5,14–18.

Table 1.

Annotated eligibility criteria with citations, methods of annotation, coverage, availability and size.

Citation	Annotation Method	Coverage	Availability	Criteria Count
Chondrogiannis et al., 2017⁵	Manual	87%	Online View Only	2,000
Tu et al., 2011 (ERGO)¹⁵	Manual	62%	Methods Only	1,000
Zhang et al., 2018¹⁶	Manual	85%	None	1,043
Milian et al., 2015¹⁷	Automated	18%	Methods Only	1,773
Lonsdale et al., 2006¹⁸	Automated	34%	Methods Only	1,545
Kang et al., 2017 (EliIE)¹⁴	Automated	71%	Available Upon Request	3,619
*Chia Annotation Model*	*Manual*	*86.8%*	*Publicly available*	*12,409*

Open in a new tab

A shared, sufficiently large dataset is much needed to boost machine learning natural language processing of eligibility criteria text. In this study we present Chia, a large annotated corpus of clinical research eligibility criteria from 1,000 diverse clinical studies. The annotations specify (a) the boundaries and semantic categories of named entities and (b) the Boolean operators needed to form the database query logic. As the first public large annotated corpus for clinical trial eligibility criteria, Chia can serve as a shared benchmark to develop and test future machine learning, rule-based, or hybrid methods for information extraction from free-text clinical trial eligibility criteria.

Methods

Chia’s Annotation Model (CAM)

The annotation model was developed by two annotators (FK and LHF), both with medical training, and one machine learning researcher (CY), following an iterative process. The entity categories are aligned with the domain names defined by the Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model (CDM), which is widely used in the medical research community for health data standardization¹⁹. Our annotation model is described in full in the Appendix. A brief description is provided below with a focus on its three main components: Entities, Relationships, and the resulting Annotation Graph.

Entities

Entities are concepts (e.g., hypertension, platelet count) and fall into one of the three types: Domain, Field, and Construct. Domain entities are the essential components of eligibility criteria, while Field entities and Construct entities are optional, depending on the semantic categories of Domain entities. Domain entities represent eight domains, i.e., observation, condition, person, device, drug, visit, procedure, and measurement. A Field entity represents a property of an applicable domain entity. For example, rule “Hemoglobin <8 g/dL” includes a Domain entity “Hemoglobin” and a Field entity “ <8 g/dL.” A Construct entity defines modifiers such as negation and repetition. For example, the Negation entity inverts the Boolean logic of the Domain entity being modified: no history of heart disease.

Relationships

Relationships express general Boolean algebra operators (AND or OR) between pairs of entities, as well as type-specific relationships (e.g.,, has_value, has_temporal etc.). This dataset represents 12 relationships: AND, OR, SUBSUMES, HAS_NEGATION (target argument is negation), HAS_MULTIPLIER (target argument is multiplier), HAS_QUALIFIER (target argument is qualifier), HAS_VALUE (target argument is value), HAS_TEMPORAL (target argument is temporal), HAS_INDEX (target argument is reference_point), HAS_MOOD (target argument is mood), HAS_CONTEXT (target argument is observation and not included in above relationships), and HAS_SCOPE (target argument is scope). The Boolean operator NOT, as mentioned in the previous section, is an entity instead of a relationship. When considering general Boolean relationships, both entities are independent, and the truth value of each term (i.e., is an entity “true”?) are resolved before applying Boolean logic to the relationship. For example, in “Patients with diabetes or hypertension,” “diabetes” and “hypertension” are independent Domain entities linked by an OR relationship between them.

Annotation graph

The entities and relationships in each trial’s eligibility criteria can be represented as two Annotation Graphs (AGs) (one for inclusion criteria and one for exclusion criteria) to computationally represent the query logic, that is, how all entities and relationships should be used to construct an executable query. The entities form the nodes and the relationships form the edges in the graph, and each entity evaluates to true if matching data can be found to satisfy it. Parsing of a graph runs from root to leaf nodes, forming a single Boolean expression. If that expression evaluates to true when fed the data from one patient, it means such patient is eligible (if inclusion criteria) or is ineligible (if exclusion criteria). An example annotation graph can be seen in Fig. 1.

Fig. 1 — Sample eligibility criterion with associated visual annotation (a), annotation graph (b), and pseudo-SQL query for relational patient database (c).

Scope objects

As eligibility criteria are used to identify patients eligible for a given trial, complex logic is often employed to ensure clinical judgments can be made with a high degree of accuracy. In some criteria, this complex logic can be considered ‘nested’ as a single entity is explained in greater detail and is provided with additional parameters whereby the entity could evaluate to true. An example is “patient has hypertension (systolic blood pressure >135 mmHg or diastolic blood pressure >85 mmHg).” Here, the central entity is hypertension, which can be evaluated using the specified systolic and diastolic pressure ranges. We can treat both blood pressure measurements connected by an OR relationship as a single logical statement by labeling it a Scope object. Thus, this criterion can be rewritten to “patient has hypertension, or [systolic blood pressure > 135 mmHg or diastolic blood pressure > 85 mmHg].” The hypertension entity is thus linked to this Scope object whereby satisfying either item results in the statement resolving to a single value of True.

The annotation processes

Sampling of trials and loading into the annotation tool

We searched ClinicalTrials.gov on August 2, 2018 for actively recruiting, interventional (clinical trial), phase 4 studies, and obtained 2,913 trials, from which a random sample of 1,000 clinical trials was drawn. We focused on current studies as opposed to historic ones, assuming reporting quality is generally better in more current trials²⁰, and prioritized phase 4 since they are more likely to be replicated via pure observational data analyses²¹. From each trial, a script downloaded and extracted every eligibility criterion (roughly defined as one line of free text), exported plain text files, and loaded them into the brat annotation tool (http://brat.nlplab.org/). That script is in R language and is available at https://github.com/WengLab-InformaticsResearch/CHIA.

Annotation of the eligibility criteria from the selected trials

The creation of Chia was performed by medical professionals (FK and LHF). Each annotator received a separate set of criteria loaded in brat²² and hand-created the entities and relationships as expressed above. In case of doubt for some concept, the annotator searched terminology at http://athena.ohdsi.org, which provides searching of concepts in the OMOP CDM. For the first 200 trials, both annotators regularly discussed adaptations to the annotation model based on their experience and re-annotated criteria as needed according to the changes being made. Once a satisfactory model was attained based on the consensus of both annotators, modifications were suspended and the task of annotation proceeded until the completion of the 1,000 trials. The final Chia dataset contains the summed and collectively revised work of the two annotators.

Post-processing the annotations

Minor post-production was performed to transform the ANN files produced by brat into a single long table in CSV format containing the entire dataset. That table also contains a number of variables that can be programmatically inferred from the annotations, e.g.,, which entities are roots in their annotation graphs. Additional post-processing was performed to generate two distinct datasets: one titled With Scopes and the other Without Scopes differing only in their utilization of Scope entity within the annotation model. Greater discussion of the reasoning behind the two distinct datasets is included in the Appendix, and all code used to generate these two models is available at https://github.com/WengLab-InformaticsResearch/CHIA.

To identify the target diseases of the 1,000 annotated trials, additional dataset enrichment was accomplished by leveraging the Aggregate Analysis of ClinicaTrials.gov (AACT) database²³. This publicly available relational database contains all information about every study registered in ClinicalTrials.gov and is provided by the Clinical Trials Transformation Initiative (CTTI). The list of 1,000 unique NCT IDs included in our dataset was extracted and matched with their corresponding target conditions using the conditions table in the AACT database.

Parsing the annotation graph

A distinguishing feature of our dataset is its capacity to support the parsing of the entities and their relationships into a Boolean expression containing the logic of the database query that replicates the eligibility criteria of each clinical trial. A sample annotation alongside its associated annotation graph and pseudo-query is provided in Fig. 1.

Data Records

The free-text of selected eligibility criteria, brat configuration files, and the annotated data files are all available on figshare at 10.6084/m9.figshare.11855817²⁴. There are two folders of annotation files titled With Scope and Without Scope, describing the inclusion or exclusion of Scope entities (additional information in Appendix).

Free-text (.txt) Files

Extracted free-text eligibility criteria from the 1,000 selected trials. Each text file adheres to the following naming format: [NCT Number][Inclusion/Exclusion Status].txt. Each row contains a single eligibility criterion.

Annotation (.ann) Files

Non-post-processed annotations in brat’s native ANN output format. Each annotation file adheres to the following naming format: [NCT Number]_ [Inclusion/ExclusionStatus].ann. As per the ANN format, each line corresponds to a single entity or relationship, except for OR relationships as explained below. For entities, the fields are as follows: item ID (e.g., T1), entity type (e.g., Condition), string start index (e.g., 28), string end index (e.g., 55), text (e.g., ‘metastatic carcinoid tumors’). For relationships, the fields are as follows: item ID (e.g., R1), relationship type (e.g., has_value), root argument (e.g., T3), target argument (e.g., T2). For OR relationships, the second field lists the relationship type and subsequent columns list all items connected by this OR relationship. In order to visualize the annotations, one needs brat. Simply open, in brat, Chia’s entire unzipped folder. Instructions for downloading and installing brat are available at http://brat.nlplab.org/.

Configuration (.conf) Files

These are the brat configuration files used to produce annotations following the definitions of CAM. They are used to load the.ann files into brat for viewing or editing them.

Technical Validation

Inter-annotator agreement

To evaluate the inter-annotator agreement of CHIA, we randomly selected 50 trials out of the 1,000 trials, constituting 5% of the whole dataset with 604 inclusion criteria and 1,034 exclusion criteria. The same 50 trials were provided to the two annotators (FK and AB) to annotate independently using the Brat annotation tool. To facilitate our agreement evaluation, the obtained raw annotations were converted into two formats: the CONLL-2003 BIO format²⁵ and the token-level format with annotated label on each token. For example, in the CONLL-2003 BIO format, Diabetes mellitus is annotated as “diabetes B-Condition mellitus I-Condition”. In the token-level format, Diabetes mellitus is annotated as “diabetes Condition mellitus Condition”. An agreement was reached if both annotators annotated the whole span of Diabetes mellitus as Condition. These two formats enable us to evaluate both phrase-level agreement and token-level agreement rates. We calculated the Cohen’s kappa scores and Precision, Recall, and F1 as the measure of inter-annotator agreement. At the phrase-level, the Cohen’s kappa score is 0.8043 with Precision 81.39%, Recall 80.30% and F1 80.84. At the token-level, the Cohen’s kappa score is 0.8489 with Precision 81.67%, Recall 86.68% and F1 84.10.

Dataset exploration and validation

Descriptive statistics were generated to report counts of entities, relationships, and combinations of both. A coverage statistic was used to compare this dataset with previous efforts. In an effort to assess the accuracy and immediate utility of individual annotations, the raw unprocessed annotated entities were mapped to standardized medical concepts. To do this we utilized an open-source concept mapping tool called Usagi (https://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:usagi)²⁶ to map free-text strings to medical concepts in the OMOP CDM and to provide a ‘mapping accuracy score’ based on string similarity that is automatically generated by the Usagi tool. All Domain entities were converted to lowercase and then input into Usagi v1.2.7 with the following restrictions: (1) ‘Filter standard concepts’ was unselected to allow for mapping to Standard and Classification Concepts, (2) ‘Filter by Domain’ was selected and the selected options included only the labeled Domain (e.g., Condition).

Descriptive statistics

Chia contains 65,886 annotations for 12,409 inclusion and exclusion eligibility criteria from 1,000 trials targeting 1,130 unique diseases, with the most common one being Coronary Artery Disease (24 trials). 196 of the trials included healthy volunteers. 1,606 of the annotated eligibility criteria were determined as not being useful for patient-focused database queries for reasons highlighted in Table 2. Of those, 462 contained multiple entity annotations (e.g.,, “1. Signed informed consent” was annotated with parsing_error, non-query-able, and informed_consent) so the sum of annotations in Table 2 is greater than 1,606. 10,768 eligibility criteria had evaluable annotations, accounting for 86.8% of all the eligibility criteria (Table 1). When comparing it to the dataset prepared by Kang et al. in 2017¹⁴ (called EliIE), Chia is larger in size in terms of number of annotations, number of entity and relationship types, and criteria coverage. A top-level comparison of these datasets is shown in Table 3.

Table 2.

Total count and percentage of unevaluable criteria using unevaluable entity annotations.

Entity Type	Count (%; n = 1,606)	Example
Non-query-able	557 (34.7%)	In clinical judgement of study doctor, participant should not participate
Post-eligibility	425 (26.5%)	Unable to adhere to follow up schedule and treatment
Informed_consent	223 (13.8%)	Inability to give informed consent
Pregnancy_considerations	172 (10.7%)	Are not willing to use a reliable method of barrier contraception during the study
Parsing_Error	135 (8.4%)	3. Medications:
Non-representable	120 (7.4%)	Subjects who are incarcerated or wards of the state
Competing_trial	86 (5.4%)	Participation in other drug clinical trial within the last 4 weeks
Context_Error	61 (3.8%)	Hematologic laboratory values as outlined in the protocol
Subjective_judgement	43 (2.7%)	Viable tumor resection confirmed by two highly qualified surgical doctors
Not_a_criteria	33 (2.1%)	Screening tool: physical assessment (EKG), medical history
Undefined_semantics	21 (1.3%)	Presence of clinical contraindications for treatment with MTX
Intoxication_considerations	5 (0.3%)	Active alcohol or drug use or dependence which may interfere with adherence to study requirements

Open in a new tab

Table 3.

Comparison of EliIE and Chia Annotated Datasets.

Statistic	EliIE	Chia
Disease Domain	Alzheimer’s	Representative of all diseases
No. of Trials	230	1,000
No. of Criteria	3,619	12,409
No. of Annotations	15,596	65,886
No. of Entity Types	8	15
No. of Relationship Types	3	12
Criteria Coverage	71%	85.9%

Open in a new tab

Of the 41,699 entity annotations, 63.5% fall into the Domain category, 18.4% in the Field category, 17.5% in the Construct category and 0.4% were concepts unable to be annotated or deemed ‘unqueryable’ – additional information on these categories is included in the Appendix. Condition is the most common entity and OR the most common relationship. The most common relationship triplet is Measurement-has_value-Value; the second most common is Condition-has_qualifier-Qualifier. The most common relationship types are listed in Table 4 and the most common relationship triplets are listed in Table 5. OR relationships were not included in calculating the most common triplets as they do not always follow the root-relationship-target structure. 29.9% of OR relationships linked more than 2 entities and maximally linked 25 entities. All type-specific relationships contained the respective target entity type (e.g., has_value to Value entity) except for has_temporal (86.7% Temporal, 13.3% Observation) and has_mood (97.7% Mood, 2.3% Observation), though these different target types reflect the flexibility of the Observation Domain. For example, “history of” is considered an Observation despite its role in the has_temporal relationship. Concept mappings to the OMOP CDM with a score greater than or equal to 0.7 were considered strong matches. In total, 65.9% of raw annotated entities within the 8 Domain entity categories were considered strong mappings to the OMOP CDM. The percentage of strong mappings in each Domain category is shown in Table 6. Finally, the 15 most frequent tokens within the most common entity types are listed in Table 7 (fuzzy string-matching was used to generate token-level information).

Table 4.

Most common relationship entities including overall count and percentage of all relationships.

Relationship	Count	Percent (n = 25,017)
OR	4,939	19.8%
has_value	3,806	15.2%
AND	3,679	14.7%
has_qualifier	3,535	14.1%
has_temporal	3,336	13.3%

Open in a new tab

Table 5.

Most common relationship triplets (excluding OR relationships) including overall count and percentage of all relationship triplets.

Root Type	Relationship	Target Type	Count	Percent (n = 20,078)
Measurement	Has_value	Value	2799	13.94%
Condition	Has_qualifier	Qualifier	2445	12.18%
Condition	Has_temporal	Temporal	1323	6.59%
Temporal	Has_index	Reference_point	889	4.43%
Procedure	Has_temporal	Temporal	857	4.27%
Person	Has_value	Value	752	3.75%
Condition	AND	Drug	645	3.21%
Condition	Subsumes	Condition	624	3.11%
Drug	Has_temporal	Temporal	532	2.65%
Condition	AND	Procedure	514	2.56%
Procedure	Has_qualifier	Qualifier	465	2.32%
Condition	AND	Condition	459	2.29%
Condition	AND	Measurement	408	2.03%
Condition	Has_negation	Negation	380	1.89%
Procedure	AND	Condition	315	1.57%

Open in a new tab

Table 6.

Mapping accuracy to OMOP CDM via Usagi per Entity Category.

Entity Category	Percent of Entities with Confidence Score ≥ 0.70
Condition	74.9%
Procedure	66.5%
Drug	64.8%
Device	62.1%
Person	61.8%
Measurement	55.2%
Observation	39.8%
Visit	31.3%

Open in a new tab

Table 7.

Most common annotated entities by Domain.

Condition		Qualifier		Drug		Procedure
Concept	Count	Concept	Count	Concept	Count	Concept	Count
pregnancy	442	severe	326	systemic corticosteroids	81	treatment	174
allergy	269	significant	117	medication	72	surgery	99
contraindications	197	active	114	anticoagulants	55	chemotherapy	81
infection	129	other	112	prednisone	49	radiation therapy	62
malignancy	104	uncontrolled	106	antibiotics	48	general anesthesia	58
hypertension	92	clinically significant	83	study medications	45	physical examination	42
lactation	90	chronic	57	antidepressants	40	cardiac surgery	41
heart failure	89	serious	55	aspirin	39	contraception	39
stroke	88	symptomatic	54	opioids	39	intubation	38
diabetes	82	moderate	47	vaccine	36	transplant	36
lactating	82	acute	43	statin	32	implantation	35
myocardial infarction	81	elective	40	warfarin	27	liver transplant	35
cardiovascular disease	64	untreated	39	insulin	27	dialysis	34
liver disease	63	stable	38	rifampin	27	hysterectomy	33
Measurement		Person		Observation		Device
Concept	Count	Concept	Count	Concept	Count	Concept	Count
serum creatinine	77	age	577	breastfeeding	68	pacemakers	18
body mass index	65	female	355	life expectancy	64	intrauterine device	12
blood pressure	64	male	355	informed consent	29	prosthetic valve	12
weight	59	older	67	family history	18	prosthetic material	11
hemoglobin	57	adult	54	english speaking	16	prosthetic mesh	11
bilirubin	55	years	47	smoking	15	contraceptive implant	10
systolic blood pressure	52	children	32	childbearing potential	13	drug-eluting stent	9
diastolic blood pressure	52	patients	16	alcohol abuse	9	metal implants	9
pregnancy test	48	prisoners	13	evidence	8	device	8
platelet count	45	smokers	7	nursing	7	cochlear implants	8
creatinine clearance	44	infants	6	contraception	7	condom	7
ast [aspartate aminotransferase]	43	newborns	5	lactating	6	joint prosthesis	7
hba1c [hemoglobin a1c]	41	donor	5	last vaccination intervals	6	aneurysm clips	6
alt [alanine aminotransferase]	41	liver transplant recipients	5	suspected	6	metal in the body	6
asa [american society of anesthesiologists]	40	adolescents	5	sexually active	6	bare-metal stent	5

Open in a new tab

The Chia dataset contains a total of 4,161 annotated Scope objects with 1,009 having an incoming subsumes or multi relationship, serving a body of original and useful knowledge for electronic phenotyping (discussed further in Use Case 2 below). A few examples of Scope objects are presented in Table 8. The average number of entities contained within these Scope objects is 3.51 with a maximum of 82. Additionally, 2,318 subsumes or multi relationships are available throughout the dataset with that number increasing to 2,521 in the dataset without Scope objects (post-Scope decomposition).

Table 8.

Examples of Scope objects in Chia (contained on Scope object).

Trial Number	Inc/Exc	Line	Sample Criterion
NCT02781610	Exclusion	5	…worsening lower respiratory symptoms (e.g.,, pulmonary clean out, distal intestinal obstruction syndrome (DIOS), sinusitis)
NCT02596555	Exclusion	13	…strong inhibitors of P-glycoprotein like ketoconazole, cyclosporin, itraconazole or dronedarone
NCT00650312	Inclusion	4	…judged normal and healthy during a pre-study medical evaluation (physical examination, laboratory evaluation, 12-lead ECG, hepatitis B and hepatitis C tests, HIV test, and urine drug screen including amphetamine, barbiturates, benzodiazepine, cannabinoid, cocaine, opiates, phencyclidine, and methadone)
NCT01373684	Exclusion	13	…immunodeficiency syndromes (e.g., HIV positivity, auto-immune diseases, organ transplants other than cornea and hair transplant)
NCT02531971	Inclusion	2	…including tobacco products (e.g., cigarettes, cigars, chewing tobacco, gum, patch or electronic cigarettes)

Open in a new tab

Usage Notes

Use case analyses

To demonstrate the utility of Chia, we propose two motivating use cases for this annotated corpus that can be explored in future research efforts.

Use Case 1: Machine learning model training for information extraction from eligibility criteria

Machine learning technologies can assist in parsing eligibility criteria. In previous research, Conditional Random Fields (CRFs)²⁷, Convolutional Neural Networks (CNNs)²⁸, Support Vector Machines (SVMs)¹⁴, hierarchical clustering²⁹, distant supervision³⁰, and other machine learning approaches have been used to extract entities and relationships from the free-text eligibility criteria to obtain structured representations. These extraction methods typically result in satisfactory accuracy but relatively low recall, such as the 94% accuracy and 18% recall described by Milian et al. in 2015¹⁷. There are a few possible explanations for this low recall. The first is that the training corpus lacks diversity in the types of eligibility criteria, as described in more detail in the Background section of this article. The second explanation is that most of the criteria (~85%) are very complex with various semantic patterns, including 35% of criteria containing more than one type of semantic pattern^13,31. Finally, incomplete sentence structure and word ambiguity can also lead to extraction failures as described by Yuan et al. in 2019³².

A larger dataset of eligibility criteria with samples from a broad range of clinical trials is necessary to train a more accurate and robust extraction model. With 41,699 entities and 25,017 relationships annotated, Chia provides ample training data for machine learning research for identifying not only the boundaries and classes of named entities within medical text, but also the dependencies between these concepts. Further, when comparing to previous annotation efforts outlined in the Background section, the richness of the Chia model becomes clear as highlighted in Fig. 2. This direct criterion-to-criterion comparison allows for better appreciation of the increased level of connectivity between concepts (section A), simplicity in structure and format (section B), and improved granularity (section C) provided by Chia.

Fig. 2 — Comparisons of Chia annotation model to previous annotation efforts using identical sample eligibility criteria text. (a) EliIE annotation model proposed by Kang *et al*., (b) hepatitis C trials outlined by Zhang *et al*., (c) ERGO annotation model proposed by Tu *et al*.

Chia’s non-flat annotation scheme is also noteworthy. Most corpuses for Named Entity Recognition (NER) training have adopted the flat annotation, disallowing discontinuous, nested, or overlapping entities³³, whereas Chia represents them and supports the use case Electronic Phenotyping described below. Discontinuous and overlapping entities are required to capture coordination ellipsis, such as “presence of pulmonary, hepatic or hematological cancer.”, which is one type of ellipsis used in coordinating structure to achieve textual concision by omitting repeated words³⁴. Coordination ellipses are more frequently used in medical language than in the general domain. More granular results are required for downstream tasks such as free-text based phenotyping. Existing annotated corpus containing overlapping entities are derived from biomedical literature, including GENIA and NCBI Disease corpora³⁵. GENIA corpus is focused primarily on biological entities such as DNA, RNA, and protein³⁶, and NCBI Disease is focused on disease mentions. To the best of our knowledge, Chia is the first clinical corpus of considerable size that is annotated in a non-flat fashion and supports annotation nesting and joining.

Use Case 2: Electronic phenotyping

Electronic phenotyping plays an essential role in disease knowledge discovery, application, and clinical research^2,37. Electronic phenotyping supports cross-sectional and epidemiological studies, association studies, clinical risk factor analyses, and cohort identification². In some cases, the phenotype definition is fairly simple (e.g., diagnosis of rheumatoid arthritis), but it can become more nuanced and complex (e.g., moderate or severe COPD exacerbation [requiring corticosteroids or increased dosage of corticosteroids and/or antibiotics or hospitalization]). Manual knowledge engineering to establish the linkage between the concepts in a phenotype is neither scalable nor efficient. Hierarchical relationships between annotated entities are explicitly defined in Chia via Scope objects, subsumes relationships and multi relationships, which provide reusable phenotyping knowledge. These annotations often indicate one concept (or group of concepts) that can be substituted for another because their meanings are inherently interconnected – oftentimes because the latter are clarifications or specifications of the former. For example, Systolic Blood Pressure > 130 and Diastolic Blood Pressure >85 can be used to define the condition Hypertension. There are 2,197 cases of subsumes in Chia, and Table 9 gives a few examples. Further, as eligibility criteria themselves serve to define a patient cohort, they can be considered to be small electronic phenotypes.

Table 9.

Examples of subsumes relationships in Chia (parent entity and subsumed entity).

Trial Number	Inc/Exc	Line	Sample Criterion
NCT00050349	Inclusion	2	…no major impairment of renal or hepatic function, as defined by the following laboratory parameters: total bilirubin < 1.5 X ULN; AST, ALT < 2.5X ULN ( < 5 X ULN if liver metastases are present)
NCT00094861	Exclusion	7	Presence or history of dysphagia or conditions predisposing to dysphagia (e.g.,, uncontrolled gastroesophageal reflux disease [GERD], dyspepsia, etc.)
NCT00182520	Inclusion	2	…open label trial of one the following SRI’s…and demonstrating a non or partial responses to SRI treatment (CGI-I of 3 or 4, Y-BOCS reduction of < 35%)
NCT00343668	Exclusion	10	…significant neurologic or psychiatric disorders including dementia or seizures

Open in a new tab

Supplementary information

Appendix^{(43.6KB, docx)}

Acknowledgements

This project was supported by The National Library of Medicine grant R01LM009886-10 (“Bridging the semantic gap between research eligibility criteria and clinical data”) and the National Center for Advancing Translational Science (NCATS) grant UL1TR001873.

Author contributions

Fabrício Kury – Primary dataset annotator, drafted original manuscript. Li-heng Fu – Secondary dataset annotator, aided in revision of manuscript. Alex Butler – Assisted with dataset annotation, aided in revision of manuscript. Chi Yuan – Machine learning developer, produced results presented in Technical Validation section. Yingcheng Sun – aided in revision of manuscript. Hao Liu – Programmer, developed code used for Scope decomposition. Ida Sim – Helped develop and refine the details of CAM, aided in reviewing the results. Simona Carini – Helped develop and refine the details of CAM, aided in reviewing the results. Chunhua Weng – Research supervisor, principal investigator, conceptualization and research design of the study, guarantor for the annotated datasets, review, editing, and revision of the manuscript.

Code availability

All code used to generate and process the datasets described above is freely available at https://github.com/WengLab-InformaticsResearch/CHIA in the file titled chia.R. It was written in R version 3.3.3.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Fabrício Kury, Alex Butler, Chi Yuan.

Supplementary information

is available for this paper at 10.1038/s41597-020-00620-0.

References

1.Weng C. Optimizing Clinical Research Participant Selection with Informatics. Trends in pharmacological sciences. 2015;36:706–709. doi: 10.1016/j.tips.2015.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Banda JM, Seneviratne M, Hernandez-Boussard T, Shah NH. Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. Annual review of biomedical data science. 2018;1:53–68. doi: 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Sen A, et al. Correlating eligibility criteria generalizability and adverse events using Big Data for patients and clinical trials. Ann N Y Acad Sci. 2017;1387:34–43. doi: 10.1111/nyas.13195. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Murthy VH, Krumholz HM, Gross CP. Participation in cancer clinical trials: race-, sex-, and age-based disparities. Jama. 2004;291:2720–2726. doi: 10.1001/jama.291.22.2720. [DOI] [PubMed] [Google Scholar]
5.Chondrogiannis E, et al. A novel semantic representation for eligibility criteria in clinical trials. Journal of biomedical informatics. 2017;69:10–23. doi: 10.1016/j.jbi.2017.03.013. [DOI] [PubMed] [Google Scholar]
6.Williams RJ, Tse T, DiPiazza K, Zarin DA. Terminated Trials in the ClinicalTrials.gov Results Database: Evaluation of Availability of Primary Outcome Data and Reasons for Termination. PloS one. 2015;10:e0127242. doi: 10.1371/journal.pone.0127242. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Richesson RL, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform Assoc. 2013;20:e226–e231. doi: 10.1136/amiajnl-2013-001926. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Weng C. Optimizing Clinical Research Participant Selection with Informatics. Trends Pharmacol Sci. 2015;36:706–709. doi: 10.1016/j.tips.2015.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Weng C, Tu SW, Sim I, Richesson R. Formal representation of eligibility criteria: a literature review. Journal of biomedical informatics. 2010;43:451–467. doi: 10.1016/j.jbi.2009.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Patel, P., Davey, D., Panchal, V. & Pathak, P. Annotation of a Large Clinical Entity Corpus. (2018).
11.Mohan, S. & Li, D. MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. (2019).
12.Weng C, et al. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association. 2011;18:i116–i124. doi: 10.1136/amiajnl-2011-000321. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Ross, J., Tu, S., Carini, S. & Sim, I. Analysis of eligibility criteria complexity in clinical trials. Summit Transl Bioinform, 46–50 (2010). [PMC free article] [PubMed]
14.Kang T, et al. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc. 2017;24:1062–1071. doi: 10.1093/jamia/ocx019. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Tu SW, et al. A practical method for transforming free-text eligibility criteria into computable criteria. Journal of biomedical informatics. 2011;44:239–250. doi: 10.1016/j.jbi.2010.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhang, H. et al. Computable Eligibility Criteria through Ontology-driven Data Access: A Case Study of Hepatitis C Virus Trials. AMIA Annu Symp Proc, 1601–1610 (2018). [PMC free article] [PubMed]
17.Milian K, et al. Enhancing reuse of structured eligibility criteria and supporting their relaxation. Journal of biomedical informatics. 2015;56:205–219. doi: 10.1016/j.jbi.2015.05.005. [DOI] [PubMed] [Google Scholar]
18.Lonsdale, D., Tustison, C., Parker, C. & Embley, D. Formulating Queries for Assessing Clinical Trial Eligibility. (2006).
19.Reich, C., Ryan, P. B., Belenkaya, R., Natarajan, K. & Blacketer, C. OHDSI Common Data Model v6.0 Specifications, https://github.com/OHDSI/CommonDataModel/wiki (2019).
20.Zarin DA, Fain KM, Dobbins HD, Tse T, Williams RJ. 10-Year Update on Study Results Submitted to ClinicalTrials.gov. New England Journal of Medicine. 2019;381:1966–1974. doi: 10.1056/NEJMsr1907644. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Suvarna V. Phase IV of Drug Development. Perspect Clin Res. 2010;1:57–60. [PMC free article] [PubMed] [Google Scholar]
22.Stenetorp, P. et al. brat: a Web-based Tool for NLP-Assisted Text Annotation. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 102–107 (2012).
23.Clinical Trials Transformation Initiative. Aggregate Analysis of ClinicalTrials.gov, https://aact.ctti-clinicaltrials.org/ (2016).
24.Kury FSP, 2020. Chia Annotated Datasets. figshare. [DOI]
25.Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003).
26.Observational Health Data Sciences and Informatics. Usagi, https://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:usagi (2018).
27.Luo, Z., Johnson, S. B., Lai, A. M. & Weng, C. Extracting temporal constraints from clinical research eligibility criteria using conditional random fields. AMIA Annu Symp Proc, 843–852 (2011). [PMC free article] [PubMed]
28.Chuan, C.-H. Classifying Eligibility Criteria in Clinical Trials Using Active Deep Learning. (2018).
29.Luo, Z., Johnson, S. B. & Weng, C. Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. AMIA Annu Symp Proc, 487–491 (2010). [PMC free article] [PubMed]
30.Sun, Y. & Loparo, K. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 954–955.
31.Sun, Y. & Loparo, K. Knowledge-guided Text Structuring in Clinical Trials. (2019).
32.Yuan C, et al. Criteria2Query: a natural language interface to clinical databases for cohort definition. J Am Med Inform Assoc. 2019;26:294–305. doi: 10.1093/jamia/ocy178. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Alex, B., Haddow, B. & Grover, C. Recognising nested named entities in biomedical text. (Association for Computational Linguistics, 2007).
34.Yuan, C. et al. A Graph-Based Method for Reconstructing Entities from Coordination Ellipsis in Medical Text. Journal of the American Medical Informatics Association (2020). [DOI] [PMC free article] [PubMed]
35.Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics. 2014;47:1–10. doi: 10.1016/j.jbi.2013.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kim JD, Ohta T, Tateisi Y, Tsujii J. GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
37.Banda, J. M., Halpern, Y., Sontag, D. & Shah, N. H. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Jt Summits Transl Sci Proc, 48–57 (2017). [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Kury FSP, 2020. Chia Annotated Datasets. figshare. [DOI]

Supplementary Materials

Appendix^{(43.6KB, docx)}

Data Availability Statement

[CR1] 1.Weng C. Optimizing Clinical Research Participant Selection with Informatics. Trends in pharmacological sciences. 2015;36:706–709. doi: 10.1016/j.tips.2015.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Banda JM, Seneviratne M, Hernandez-Boussard T, Shah NH. Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. Annual review of biomedical data science. 2018;1:53–68. doi: 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Sen A, et al. Correlating eligibility criteria generalizability and adverse events using Big Data for patients and clinical trials. Ann N Y Acad Sci. 2017;1387:34–43. doi: 10.1111/nyas.13195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Murthy VH, Krumholz HM, Gross CP. Participation in cancer clinical trials: race-, sex-, and age-based disparities. Jama. 2004;291:2720–2726. doi: 10.1001/jama.291.22.2720. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Chondrogiannis E, et al. A novel semantic representation for eligibility criteria in clinical trials. Journal of biomedical informatics. 2017;69:10–23. doi: 10.1016/j.jbi.2017.03.013. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Williams RJ, Tse T, DiPiazza K, Zarin DA. Terminated Trials in the ClinicalTrials.gov Results Database: Evaluation of Availability of Primary Outcome Data and Reasons for Termination. PloS one. 2015;10:e0127242. doi: 10.1371/journal.pone.0127242. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Richesson RL, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform Assoc. 2013;20:e226–e231. doi: 10.1136/amiajnl-2013-001926. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Weng C. Optimizing Clinical Research Participant Selection with Informatics. Trends Pharmacol Sci. 2015;36:706–709. doi: 10.1016/j.tips.2015.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Weng C, Tu SW, Sim I, Richesson R. Formal representation of eligibility criteria: a literature review. Journal of biomedical informatics. 2010;43:451–467. doi: 10.1016/j.jbi.2009.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Patel, P., Davey, D., Panchal, V. & Pathak, P. Annotation of a Large Clinical Entity Corpus. (2018).

[CR11] 11.Mohan, S. & Li, D. MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. (2019).

[CR12] 12.Weng C, et al. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association. 2011;18:i116–i124. doi: 10.1136/amiajnl-2011-000321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Ross, J., Tu, S., Carini, S. & Sim, I. Analysis of eligibility criteria complexity in clinical trials. Summit Transl Bioinform, 46–50 (2010). [PMC free article] [PubMed]

[CR14] 14.Kang T, et al. EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc. 2017;24:1062–1071. doi: 10.1093/jamia/ocx019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Tu SW, et al. A practical method for transforming free-text eligibility criteria into computable criteria. Journal of biomedical informatics. 2011;44:239–250. doi: 10.1016/j.jbi.2010.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Zhang, H. et al. Computable Eligibility Criteria through Ontology-driven Data Access: A Case Study of Hepatitis C Virus Trials. AMIA Annu Symp Proc, 1601–1610 (2018). [PMC free article] [PubMed]

[CR17] 17.Milian K, et al. Enhancing reuse of structured eligibility criteria and supporting their relaxation. Journal of biomedical informatics. 2015;56:205–219. doi: 10.1016/j.jbi.2015.05.005. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Lonsdale, D., Tustison, C., Parker, C. & Embley, D. Formulating Queries for Assessing Clinical Trial Eligibility. (2006).

[CR19] 19.Reich, C., Ryan, P. B., Belenkaya, R., Natarajan, K. & Blacketer, C. OHDSI Common Data Model v6.0 Specifications, https://github.com/OHDSI/CommonDataModel/wiki (2019).

[CR20] 20.Zarin DA, Fain KM, Dobbins HD, Tse T, Williams RJ. 10-Year Update on Study Results Submitted to ClinicalTrials.gov. New England Journal of Medicine. 2019;381:1966–1974. doi: 10.1056/NEJMsr1907644. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Suvarna V. Phase IV of Drug Development. Perspect Clin Res. 2010;1:57–60. [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Stenetorp, P. et al. brat: a Web-based Tool for NLP-Assisted Text Annotation. Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 102–107 (2012).

[CR23] 23.Clinical Trials Transformation Initiative. Aggregate Analysis of ClinicalTrials.gov, https://aact.ctti-clinicaltrials.org/ (2016).

[CR24] 24.Kury FSP, 2020. Chia Annotated Datasets. figshare. [DOI]

[CR25] 25.Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003).

[CR26] 26.Observational Health Data Sciences and Informatics. Usagi, https://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:usagi (2018).

[CR27] 27.Luo, Z., Johnson, S. B., Lai, A. M. & Weng, C. Extracting temporal constraints from clinical research eligibility criteria using conditional random fields. AMIA Annu Symp Proc, 843–852 (2011). [PMC free article] [PubMed]

[CR28] 28.Chuan, C.-H. Classifying Eligibility Criteria in Clinical Trials Using Active Deep Learning. (2018).

[CR29] 29.Luo, Z., Johnson, S. B. & Weng, C. Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. AMIA Annu Symp Proc, 487–491 (2010). [PMC free article] [PubMed]

[CR30] 30.Sun, Y. & Loparo, K. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 954–955.

[CR31] 31.Sun, Y. & Loparo, K. Knowledge-guided Text Structuring in Clinical Trials. (2019).

[CR32] 32.Yuan C, et al. Criteria2Query: a natural language interface to clinical databases for cohort definition. J Am Med Inform Assoc. 2019;26:294–305. doi: 10.1093/jamia/ocy178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Alex, B., Haddow, B. & Grover, C. Recognising nested named entities in biomedical text. (Association for Computational Linguistics, 2007).

[CR34] 34.Yuan, C. et al. A Graph-Based Method for Reconstructing Entities from Coordination Ellipsis in Medical Text. Journal of the American Medical Informatics Association (2020). [DOI] [PMC free article] [PubMed]

[CR35] 35.Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics. 2014;47:1–10. doi: 10.1016/j.jbi.2013.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Kim JD, Ohta T, Tateisi Y, Tsujii J. GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Banda, J. M., Halpern, Y., Sontag, D. & Shah, N. H. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Jt Summits Transl Sci Proc, 48–57 (2017). [PMC free article] [PubMed]

PERMALINK

Chia, a large annotated corpus of clinical trial eligibility criteria

Fabrício Kury

Alex Butler

Chi Yuan

Li-heng Fu

Yingcheng Sun

Hao Liu

Ida Sim

Simona Carini

Chunhua Weng

Abstract

Background & Summary

Table 1.

Methods

Chia’s Annotation Model (CAM)

Entities

Relationships

Annotation graph

Fig. 1.

Scope objects

The annotation processes

Sampling of trials and loading into the annotation tool

Annotation of the eligibility criteria from the selected trials

Post-processing the annotations

Parsing the annotation graph

Data Records

Free-text (.txt) Files

Annotation (.ann) Files

Configuration (.conf) Files

Technical Validation

Inter-annotator agreement

Dataset exploration and validation

Descriptive statistics

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Usage Notes

Use case analyses

Use Case 1: Machine learning model training for information extraction from eligibility criteria

Fig. 2.

Use Case 2: Electronic phenotyping

Table 9.

Supplementary information

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases