Knowledge Extraction of Cohort Characteristics in Research Publications

Jay D S Franklin; Shruthi Chari; Morgan A Foreman; Oshani Seneviratne; Daniel M Gruen; James P McCusker; Amar K Das; Deborah L McGuinness

. 2021 Jan 25;2020:462–471.

Knowledge Extraction of Cohort Characteristics in Research Publications

Jay D S Franklin ¹, Shruthi Chari ¹, Morgan A Foreman ², Oshani Seneviratne ¹, Daniel M Gruen ², James P McCusker ¹, Amar K Das ², Deborah L McGuinness ¹

PMCID: PMC8075436 PMID: 33936419

Abstract

When healthcare providers review the results of a clinical trial study to understand its applicability to their practice, they typically analyze how well the characteristics of the study cohort correspond to those of the patients they see. We have previously created a study cohort ontology to standardize this information and make it accessible for knowledge-based decision support. The extraction of this information from research publications is challenging, however, given the wide variance in reporting cohort characteristics in a tabular representation. To address this issue, we have developed an ontology-enabled knowledge extraction pipeline for automatically constructing knowledge graphs from the cohort characteristics found in PDF-formatted research papers. We evaluated our approach using a training and test set of 41 research publications and found an overall accuracy of 83.3% in correctly assembling the knowledge graphs. Our research provides a promising approach for extracting knowledge more broadly from tabular information in research publications.

Introduction

Clinical trials and other controlled research studies are gold standards for demonstrating evidence on the efficacy and safety of new interventions. The results of these studies are frequently included as recommendations in clinical guidelines that healthcare providers will use to inform practice decisions. There have been ongoing concerns, however, about whether clinical trials and other controlled studies enroll participants that are representative of the clinical population broadly, or if they include biases and population gaps¹. Providers may thus seek to review the underlying studies for a particular guideline recommendation to examine whether the characteristics of the study cohorts match those of the patient population they see in terms of demographic or clinical factors.

We have previously built the Study Cohort Ontology (SCO) to standardize this information and make such information accessible within knowledge-based decision support systems informed by guideline recommendations and clinical trial results. The SCO models a comprehensive set of characteristics about study cohorts and encodes this knowledge as Resource Description Framework (RDF) Knowledge Graphs (KGs)². The SCO ontology reuses concepts from standard, well-used biomedical ontologies, to support the modeling of study cohort table components. In addition to extending decision support systems with literature-derived knowledge, the SCO enables cohort similarity applications and population analysis scenarios. However, creating these study cohort KGs by hand is a time-consuming process and will not scale to the massive volume of research studies that have been published. In our current work, we seek to populate the SCO KG by automatically extracting the study cohort information from cohort tables in research publications. The accurate extraction of this information is challenging, however, given the wide variety of ways researchers report cohort characteristics in tabular representations.

Approach

Study cohort tables exhibit a wide variance in representation, style, and content, which poses problems for automatic knowledge extraction. In (Figure 1), we have annotated five examples of variances depicted via numbering in the figure. For instance, these tables may contain both continuous characteristics (1) and categorical characteristics (2), with statistical measures recorded in the row header. In the case of (4), measures are not explicitly stated at all, and instead context clues such as the plus-or-minus symbol “±” are used to indicate the measure. In addition to varying formats for measures, individual cells of a table may have dependencies with multiple row and column headers and sub-headers that must be parsed in order to correctly interpret the cell value. This situation can be seen in (3), which includes the indented rows labeled “Median” and “Interquartile range.” The subject characteristic being described by these rows cannot be determined based on these row headers alone, but only if the row sub-header “Weight (kg)” is associated with them. A similar situation is also seen in (5), where the row headers “mmol/mol” and “%” only describe units, and measures and other relevant information are included in multiple row sub-headers. While a single table tends to be consistent in how it formats data, nearly every table uses a different format from all the rest.

Figure 1: — A constructed example of a study cohort table assembled to depict variations in reporting styles. We highlight the nested row structure typical of study cohort tables, and show some of the several different reporting styles (see annotations **1, 2, 3, 4, 5**) we have observed across different research publications.

To address these challenges in extracting study cohort data, we have developed a general and scalable knowledge extraction pipeline that uses an ontology-enabled approach to handle the wide variety of formats of study cohort tables. The pipeline accepts a PDF format of a research publication as input, extracts and determines relationships within the study cohort information therein, and produces a KG modeling this information as the output. To do this, the pipeline uses a rule-based method in that the pipeline leverages knowledge found within the rules and relationships between concepts defined by the ontology. We have evaluated this pipeline using publications cited in the 2019 American Diabetes Association (ADA) guidelines³, and we show that we can achieve an overall high accuracy for correctly assembled knowledge graphs.

Related Work

There has been research into approaches to extracting information from clinical trials, including approaches that incorporate semantic methods. Milosevic et al. demonstrated a method to extract specific types of information (e.g. number of patients or gender distribution) provided by a user using a combination of semantic tagging and syntactic rules, provided for each data type. Their methods were based around the breaking down of a table into its component cells and cellular structures, and achieved F-scores ranging from 82% to 92% depending on data type ^4,5. Some other approaches also use semantic methods. Dhuliawala et al. found success in extracting information from a medical schedule of activity tables using a variety of semantic, structural, and NLP approaches, based on the fact that tables are structured according to reoccurring patterns⁶. MetaMap includes a feature for matching UMLS terms to text found in tables, but does not match these terms to values within the table⁷. We find that these prior works demonstrate the need for this kind of table extraction and the feasibility of a rule-based approach that takes advantage of reoccurring patterns. However, there is still work to be done in creating a robust system that can extract a wide variety of different cohort characteristics that are represented using different formats, statistical measures, and terminology. We also depart from these prior works by building a KG rather than extracting data to a less expressive (but still machine-readable) format. By representing the study cohort in KGs, we capture the original associations of the table closely and allow for systems to view the entities described in tables as they are presented.

Additional research has been conducted in extracting information from other domains using ontology-enabled methods. Embley et al. demonstrated that an ontology modeling a specific domain can be used to create rules for a rule-based method of knowledge extraction in a particular domain, and achieved 90% recall and 98% precision in tests⁸. This system is designed to only operate on unstructured text, and hence their methods are not directly applicable to study cohort tables. Although our methods use some of the same principles that underlie this work, we do not use the ontology to generate separate rules for extraction. Instead, we use our heuristics to convert a table to a graph-like tree table structure, and then use common graph algorithms such as depth-first search to match the content of the table to the original ontology. With this approach, the ontology is used both during the extraction process, and as the schema for the KG output by the pipeline. Related work by Embley does operate on tables, but aims to synthesize new ontologies from the relationships described by a table rather than create a KG of the table’s contents⁹.

Methods

Our extraction pipeline consists of four steps, as shown in (Figure 2). In Step 1¹, the study cohort table from the raw PDF of the research publication is converted to a machine-readable format. In Step 2, row sub-headers are identified and the data structure is reorganized into a tree table structure. In Step 3, individual KG components are identified from the text of the table, and the data structure is annotated with these components. In Step 4, the relationships between components are determined based on the table structure and the KG is created.

Step 1: PDF Conversion

In this step, a PDF of the research publication is supplied to the Corpus Conversion Service (CCS) tool¹⁰. The tool identifies text and tabular elements in the study and extracts them. Individual tables within the PDF are identified by a human as containing clinical trial data that should be extracted, and they are fed into the service. After we trained the CCS on some sample study cohort tables, the conversion service was able to use machine learning techniques to convert the PDFs into JSON representations of the tables. These converted tables included not only the raw text within the table but also row and column information for each cell, as well as the font, style, and bounding boxes² of the segments of extracted text. In the next step of the pipeline, we leverage this JSON representation of the cohort tables to convert them to a data structure the rest of the pipeline can interpret.

Step 2: Tree Table Construction

Although the converted tables obtained through the CCS tool are in a machine-readable form, this form represents the tables as “flat” tables, i.e., tables where rows are not arranged in a hierarchy. However, the vast majority of study cohort tables we observed were analogous to “tree tables.” As described in Tidwell et al., these tables “put hierarchical data in columns, like a table, but use an indented outline structure in the first column,” so that they exhibit many layers of nesting¹². An example of a tree table is shown in (Figure 3).

Figure 3: — The tree table structure, in which some rows are nested under other rows (row sub-headers). Data in this table was originally gathered from a publication by Patel et al¹¹.

To aid in the construction of the KG in later steps of the pipeline, we convert the JSON format produced by the CCS tool into a tree table format using a heuristic-based approach. The heuristic first relies on indentation information, if present, and if not uses the font style information of row headers. We leverage the row bounding boxes, captured within the CCS tool, to measure the pixel difference from one row to the next, which allows us to determine the indentation levels of each row. Additionally, some tables do not use indentation to indicate row subheadings, but instead use font style. To tackle this issue, we have designed an additional heuristic to be used if the initial indentation heuristic does not identify row subheadings, whereupon any row with entirely bold text is treated as a row subheading in the same manner.

Using these heuristics, we reorganize the tabular data structure into a nested structure, wherein a row may have several sub-rows where each sub-row is indented more than its parent row. This nested tree table structure aids in identifying all the associations between rows, as sub-rows typically require information from row subheading(s) to be correctly interpreted in later steps of the process. Given this tree table structure, we are in a position to construct the KG by classifying table cell values in Step 3 and leveraging the tree table structure to combine these values in Step 4.

Step 3: Node Classification

In this step, KG elements are identified from the cells of the extracted table. By KG elements, we mean nodes in the KG that will be associated with other nodes in Step 4 of the pipeline. To do this, we first tokenize the text of each cell using a regular expression-based tokenization scheme to create a separate token for each alphanumeric word, number (possibly including a decimal point and/or negative sign), and punctuation symbol. Tokenization allows us to consider individual “tokens” (segments of text) in a cell, which can then be annotated with KG elements. In the process of tokenization, we sometimes combine some separate words into a single multi-word token, if the series of words and/or punctuation appears on a predefined list of common multi-word concepts (e.g. “Std. Dev.” is treated as one token). We created this list of keywords (allowing for variations) based on the concepts we intend to annotate, which originate from the underlying ontology.

After tokens are created for a given cell, each token is passed to a series of classifiers, which attempt to identify the token and create a KG element representing that token. Currently, we use two classifiers. The first of these is the Value classifier, which identifies numerical tokens, parses their value, and assigns them to an RDF literal with the parsed value. An RDF literal is a node in a KG that contains “values such as strings, numbers, and dates.”³ The second, the Concept Classifier, is loaded with a mapping of keywords to SCO concepts, and if the supplied token matches one of these keywords, it is annotated with the corresponding SCO concept.

A depiction of tokens and some examples of the KG element annotations are shown in (Figure 4). At this stage, where tokens are converted to KG elements, standalone tokens are complete, however, tokens with dependencies or that need more information are marked as incomplete. Some of the KG elements (shown in blue in (Figure 4)) are completed, meaning they represent a literal value which can be represented in the KG as-is. For example, a number in a table cell would be a completed node. Other KG elements are incomplete, because they must be associated with other nodes to be represented in the KG. In the next step we utilize the tree table structure to associate nodes and construct a full KG.

Step 4: Graph Assembly

In the final step of our extraction pipeline, the different KG elements are combined with one another to build the complete KG. We use an ontology-enabled approach, in which relations are created between KG elements based on the structure defined by the KG’s underlying ontology, which in our case is given by SCO.

In an RDF KG, we typically represent content in subject, predicate, object triples,⁴ to capture associations between nodes. For example, a triple pattern in the cohort characteristics would be that of a standard deviation node which needs to be associated with the value it is measuring in order to be considered complete. We also observed that there exist other triple patterns in cohort characteristic tables, such as columns depict study arms, rows represent characteristics recorded on subjects belonging to these study arms, and cells report the statistics for these characteristics. We leveraged the existence of these patterns and represented them as templates that are used to assemble nodes in a KG.

These templates for interpreting the table’s rows and columns are informed by concepts from the underlying SCO ontology. An example of a row template is shown in (Figure 5). In SCO, these templates are defined based on the organization of concepts in cohort characteristics tables.

Figure 5: — In *Step 4* of the process, 1: the measure template forms associations between a measure and a value, 2: the row template forms associations between a characteristic and measures, 3: the column template forms associations between a study arm and characteristics, and 4: the final structure is shown. We show a snippet of a study cohort table, marked via dotted lines with the tabular locations of each template. Data from Patel et al¹¹.

Column Template: We use the “study arm” class as a column template. This class composition defines that study arms need to be associated with any number of subject characteristics, and have a study population size.
Row Template: We use the “subject characteristic” class as a row template. This class composition associates a subject characteristic to its corresponding measure(s). Additionally, we use two templates for rows as rows commonly contain either continuous characteristics (e.g., Age) or categorical characteristics (e.g., Race). In the KG created manually by Chari et al. using the SCO ontology, continuous characteristics are represented as attributes of study arms, whereas categorical characteristics are represented as collections². We leverage these representation styles in our templates in order to ensure that the KG is correctly representing the semantics of the knowledge contained within the cohort table.
Measure Template: We use the “Statistical Measure” class for measures (e.g. mean, std. dev.). This class composition needs to be associated with a value (an RDF literal).

We use these templates to form associations between the nodes created in Step 3. To ensure that the associations are created correctly, we traverse the tree table structure in a recursive depth-first search, such that measures in a row header are associated with values from cells in that row. In our case, the depth-first search starts at the column headers, and moves down to each row beneath that header. The movement of the recursive algorithm is dictated by the ordering between our templates, in that it first starts with the column template, then moves on to a row template, and then a measure template. Once a row with values is reached, it applies the measure template to associated measures in the row header with values in the row’s cells (see Figure 5). If there are any sub-rows, the algorithm moves to those; otherwise, it percolates the recently associated measure and value back up to the row template, that associates the measure to its corresponding subject characteristic. Also, the algorithm will percolate this subject characteristic up to associated row headers if present. If not, the subject characteristic is percolated up to the column header, and the column template associates this characteristic with a study arm. The algorithm then continues to the next row, repeating the process described until all rows in the table have been associated with the study arms in the KG. The KG created as a result of this process can be viewed in part 3 of (Figure 5).

The KG produced by the pipeline is in the Terse RDF Triple Language (commonly called Turtle)¹³. The Turtle output consists of nodes linked to other nodes via the subject-predicate-object paradigm, and can be queried via SPARQL (SPARQL Protocol and RDF Query Language) queries to learn information about the cohort these KGs represent. An example of a query is shown in (Figure 6), where the user queries the KG for information about study arms in a clinical trial, and filters the results to only include study arms with a mean age greater than 60 years. Although the KGs can be used as-is, they are intended to be used as a knowledge base for future physician-facing applications (examples, such as cohort similarity visualizations and population analysis scenarios, are provided in Chari et al².).

Figure 6: — The outputted knowledge graph (KG) can be queried via SPARQL. Shown: a query to display study arms of a study with an average age greater than 60.

Evaluation and Results

In designing the SCO ontology², we selected at random 18 research publications cited in the pharmaceutical interventions and hypertensive comorbodities chapters of the ADA Standards of Medical Care guidelines 2018¹⁴. We chose these publications because we plan to populate SCO with the study cohort data of clinical trials supporting specific clinical practice guidelines in diabetes care, for future work in meta-cohort analysis on studies supporting these guide-lines. We reused this set of publications as the training set for designing the knowledge extraction pipeline, in that the pipeline features were designed around these publications and were gradually iterated on over time. Our initial evaluation demonstrated that our pipeline achieved a 93.5%⁵ accuracy on the training set. To evaluate the correctness of the pipeline, we used a new set of 23 publications also cited in the pharmaceutical interventions and hypertensive comorbodities chapters of the 2019 ADA guidelines³ as a testing set, containing 27 tables total. These publications were not included in the old set or examined prior to evaluation, so that the design of the pipeline was not influenced by them. For each publication in the test set, we compared the KG produced by the knowledge extraction pipeline against the manually verified ground truth KG. The ground truth was generated by a tool we developed that allowed us to annotate the output of the KG and find inaccuracies. Additionally, we did not evaluate the performance of the external CCS tool in extracting table text and metadata from PDFs, and 3 tables which were originally slated to be included in the test set were excluded due to their inability to be parsed correctly. This left a total of 3744 data items within the testing set, with an average of 138 items per table and 163 per publication. The publications were sourced from nine different journals, most frequently the New England Journal of Medicine (with 12 publications total) or the Lancet (with 4). All other journals provided one publication.

We evaluated the performance of the extraction pipeline for each publication in the test set on a per-data-item basis. One data item is defined as one value in a table cell, which is in turn associated with a row and a column, and there may be multiple data items per cell. For example, if the subject characteristic “age” is reported by a mean and standard deviation measure, there are two data items per cell for the “age” characteristic. For each data item, we use three metrics that evaluate whether the item has been correctly extracted, and report the accuracy for a metric by dividing the sum total of correct data items over the total number of all data items for a given publication. Additionally, we report the average accuracy across all publications in the test set.

The first metric considers a value correctly extracted if it is correctly parsed and included in the KG. 99.8% of data items were correctly parsed.
The second metric considers a data item correct if it has been represented in the KG with the correct statistical measure. 85.6% of data items were assigned the correct statistical measure.
The third metric considers a data item correct if it has been grouped with the correct cohort characteristic and study arm. 97.9% of data items were grouped with the correct cohort characteristic and study arm.

By inspecting the output of our pipeline, we determine if a data item is wholly correct based on whether it meets all three metrics. Based on the output of this determination, we compute an overall accuracy for the knowledge extraction pipeline. The overall accuracy for the test set is 83.3%.

Overall, the knowledge extraction pipeline was able to handle a wide variety of publications and table reporting styles, and most of the KGs were extracted without issue. This finding is demonstrated by a highly right skewed distribution, as seen in (Figure 7). Sixteen of the twenty-three tables (70%) achieved an over-all accuracy of 90% or above, two tables (9%) achieved an accuracy between 90% and 75%, and five tables (22%) failed to achieve an over-all accuracy greater than 60%. Of the five tables that achieved an accuracy less than 60%, two of them did not explicitly specify the measures used to report the characteristics in the table itself—rather, the measures were only included in a footnote below the table, or were stated in the body text of the publication. The other three tables with an accuracy below 60% had measures reported in column sub-headers, including one table which combined study cohorts with frequency of events that the pipeline failed to capture any measures from. Our pipeline did not capture these measures in column sub-headers, as currently, sub-headers are only identified based on either indentation level or text style, although we are exploring the use of machine learning techniques in the tree table heuristic that would allow for increased flexibility in identifying sub-headers. On tables with measures included in rows or row sub-headers, we report a 98.5% accuracy, despite many other variations found within these tables. Although the pipeline was trained on a different set of tables, it was able to adapt to a new set of tables which included many previously unseen formats.

Figure 7: — Publications arranged by overall accuracy. Publications shown in black experienced no significant errors, publications in blue experienced errors related to column headers, and publications shown in red experienced errors related to relevant information only being included in footnotes or body text.

Discussion

By representing study population information as relevant KGs supported by common ontologies, the web of data could grow to encompass study cohort information. Although the study cohort data is readable by humans in research publications, they are otherwise nearly inaccessible for machines that can reuse and mine this data for further analysis and linking. Beyond decision support, the software that will consume such KGs can be useful to health professionals in locating a relevant study that has not surfaced when treating a complicated patient, or for clinical researchers who are relying on workflows that aggregate data from various streams. In pursuit of these goals, we have introduced a novel approach to automatically extract cohort tables into a Semantic Web-consistent framework.

Despite the variations in format and reporting styles of study cohort tables, the knowledge extraction pipeline was successful in extracting study cohort information from these tables and organizing this data in an RDF KG. We observed that the pipeline was able to perform well, even on tables with formats and styles not previously seen during the creation of the pipeline, achieving an overall accuracy of 83.3%. 35% of tables in the testing set were published previously, and 48% of tables contained at least one novel format or configurations of tabular features.

Rule-based algorithms have previously been demonstrated to be effective in extraction of specific, pre-selected cohort characteristics, as shown by the framework presented by Milosevic et al in 2019. In results comparable to ours, Milosevic used rule-based methods to extract measures and values from study cohort tables for the “age” characteristic with an f-measure of 82.8, and extracted the population size of study arms with an f-measure of 83.9 5. However, our pipeline extracts these characteristics and more from the entire study cohort table at one time, and does not require specific rules per characteristic.

When the pipeline experienced errors, they tended to occur when a statistical measure was assigned a value. This is demonstrated by the 85.6% accuracy in assigning the correct measure, compared to the 97.9% accuracy in assigning the measure to the correct characteristic and study arm. Statistical measures tend to have a much greater variability in representation than other aspects of the table, and the multistage nature of the pipeline tends to mitigate other errors. For example, when Step 2 of the pipeline organizes tabular data into a tree table structure, Step 4 of the pipeline utilizes this structure to ensure values in a row are assigned to the correct characteristic. The errors with statistical measures are mostly traced to two specific problems that could be fixed with adjustments to individual stages of the pipeline. The most common error occurs when statistical measures are reported in column sub-headers (as opposed to row sub-headers), and is fixable with tweaks to Step 4. The second-most common error occurs when statistical measures are reported in footnotes instead of the table proper, and we are exploring the option of parsing footnotes in addition to the table when the table is initially converted from PDF form in Step 1. When these two errors do not occur, we note that the rest of the information in the KG tends to be assembled without issue. However, there may be additional errors that were not observed due to the relatively limited size of the testing set. We plan to perform additional evaluation on larger sets of publications from additional medical domains as we continue to develop the pipeline.

The output of the pipeline is a KG that has been assembled based on the relationships between concepts in the original study cohort table. The KG format is a standardized, machine-readable data structure that is able to represent a variety of entities and relationships found in data, and represents an accessible resource that contributes to medical knowledge as a whole. For example, KGs can be semantically integrated into other health services, as described in Shi et al¹⁵. In particular, we see value in using a study cohort KG to compare a patient’s personal demographic characteristics and laboratory results against the study cohort of a clinical trial, to determine how well a patient matches the study population. Although there are other approaches to automatically generating KGs in the medical domain, such as the work by Goodwin et al. in generating KGs based on electronic medical records¹⁶, there is not yet an approach to generating KGs directly from tables in a research publication.

At present, the pipeline only uses a small list of keyword mappings, and only maps statistical measures to their concept in SCO. We can currently map information reported in different statistical measures, covering measures such as central tendency (e.g. Mean, Median), dispersion (e.g., Standard Deviation, Interquartile Range) and totals (e.g., population size) measures. Currently, terms such as demographic characteristics and medical interventions are included in the outputted KGs as labels, but are not directly mapped to an ontology concept. We are exploring the possibility of incorporating either UMLS MetaMap⁷ or the NCBO BioAnnotator¹⁷ into the pipeline, to augment or replace the keyword mappings and map all demographic and medical terms to a concept in a medical ontology. According to our preliminary results, the NCBO BioAnnotator has good support for Open Biological and Biomedical Ontology (OBO) foundry ontologies and is able to map terms to most SCO concepts. In the future, we plan to evaluate the results of augmenting the pipeline with these tools. The latest version of the pipeline is accessible via our GitHub repository⁶.

Conclusion

Currently, our pipeline is able to take a research publication in PDF form and produce a KG that assembles the tabular components as per the relationships in the provided ontology. Although this KG only matches terms to specific concepts when keyword mappings are provided, unidentified terms are included as metadata associated with placeholder concepts. Prior work⁴ shows that keyword-matching can be used to identify tabular data from clinical literature, but requires specific rules for each extracted data type (e.g. characteristics such as BMI or age). Our pipeline avoids this issue of manually designing or training these rules by incorporating relationships between data types that are already described by the underlying ontology into the extraction pipeline. Our initial results show that we are able to mitigate the variance in the format of study cohort tables, as we have been able to identify the statistical measures of subject characteristics and build a KG encapsulating associations between these measures and other components of the table. Overall, the KGs we are creating show the validity of an ontology-enabled approach to extracting study cohort data from tables and are a step in the automatic recovery of clinical trial data for analysis purposes.

Acknowledgments

This work is partially supported by IBM Research AI through the AI Horizons Network. We thank our colleagues from IBM Research, Ching-Hua Chen, and from RPI, Rebecca Cowan, who greatly assisted the research and document preparation.

Footnotes

We refer to steps of our extraction pipeline in italics.

Bounding box: the pixel coordinates of the smallest-sized box region enclosing a segment of text.

RDF literal : https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

⁴

RDF triples: https://www.w3.org/TR/rdf11-concepts/#section-triples

⁵

We highlight all accuracies in boldface.

⁶

See https://github.com/tetherless-world/study-cohort-extraction-pipeline for more information.

Figures & Table

References

1.Graham R, et al. Clinical Practice Guide- lines We Can Trust. Washington D.C., USA: National Academies Press (US); 2011. Trustworthy clinical practice guidelines: Challenges and potential; pp. 53–75. [Google Scholar]
2.Chari S, Qi M, Agu NN, Seneviratne O, McCusker JP, Bennett KP, et al. Int. Semantic Web Conf. Auckland, New Zealand: ISWC; 2019. Making Study Populations Visible through Knowledge Graphs; pp. 53–68. [Google Scholar]
3.Assoc AD, et al. Standards of Medical Care in Diabetes—2019. Diabetes Care. 2019;42(Supplement 1) [Google Scholar]
4.Milosevic N, Gregson C, Hernandez R, Nenadic G. Extracting patient data from tables in clinical literature-case study on extraction of BMI, weight and number of patients. HEALTHINF. 2016. pp. 223–228.
5.Milosevic N, Gregson C, Hernandez R, Nenadic G. A framework for information extraction from tables in biomedical literature. Int J on Document Analysis and Recognition (IJDAR) 2019;02 [Google Scholar]
6.Dhuliawala M, Fay N, Gruen D, Das A. Proc. of the ACM Conf. on Bioinformatics, Computational Biology, and Health Informatics. ACM BCB; 2018. What Happens When?: Interpreting Schedule of Activity Tables in Clinical Trial Documents; pp. 301–306. [Google Scholar]
7.Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J of the Amer Medical Informatics Assoc. 2010 05;17(3):229–236. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Embley DW, Campbell DM, Smith RD, Liddle SW. Proc. of the Seventh Int. Conf. on Information and Knowledge Management. CIKM ’98. New York, NY, USA: Assoc. for Computing Machinery; 1998. Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents; pp. 52–59. [Google Scholar]
9.Tijerino YA, Embley DW, Lonsdale DW, Ding Y, Nagy G. Towards Ontology Generation from Tables. World Wide Web. 2005;8(3):261–285. Available from: https://doi.org/10.1007/s11280-005-0360-8 . [Google Scholar]
10.Staar PW, Dolfi M, Auer C, Bekas C. Proc. of the 24th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining. ACM; 2018. Corpus conversion service: A machine learning platform to ingest documents at scale; pp. 774–782. [Google Scholar]
11.Patel A, Group AC, et al. Effects of a Fixed Combination of Perindopril and Indapamide on Macrovascular and Microvascular Outcomes in Patients with Type 2 Diabetes Mellitus (the ADVANCE Trial): a Randomised Controlled Trial. The Lancet. 2007;370(9590):829–840. doi: 10.1016/S0140-6736(07)61303-8. [DOI] [PubMed] [Google Scholar]
12.Tidwell J. DESIGNING INTERFACES: patterns for effective interaction design. OReilly Media. 2020. Available from: http://www.designinginterfaces.com/firstedition/index.php?page= Tree-Table .
13.Beckett D, Berners-Lee T, Prud’hommeaux E, Carothers G. RDF 1.1 Turtle. World Wide Web Consortium; 2014. [Google Scholar]
14.Riddle MC. Standards of Medical care in Diabetes. Diabetes Care. 2018;41(1) doi: 10.2337/dc18-su09. Available from: https://diabetesed.net/wp-content/uploads/2017/12/2018-ADA-Standards-of-Care.pdf . [DOI] [PubMed] [Google Scholar]
15.Shi L, Li S, Yang X, Qi J, Pan G, Zhou B. Semantic Health Knowledge Graph: Semantic Integration of Hetero- geneous Medical Knowledge and Services. BioMed research international. 2017;2017 doi: 10.1155/2017/2858423. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Goodwin T, Harabagiu SM. Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records. 2013 IEEE Seventh Int. Conf. on Semantic Computing. 2013. pp. 363–370.
17.Jonquet C, Shah N, Youn C, Callendar C, Storey MA, Musen M. NCBO Annotator: Semantic Annotation of Biomedical Data. Int. Semantic Web Conf., Poster and Demo session. 2009;vol. 110:1–3. [Google Scholar]

[r1-079_3413599] 1.Graham R, et al. Clinical Practice Guide- lines We Can Trust. Washington D.C., USA: National Academies Press (US); 2011. Trustworthy clinical practice guidelines: Challenges and potential; pp. 53–75. [Google Scholar]

[r2-079_3413599] 2.Chari S, Qi M, Agu NN, Seneviratne O, McCusker JP, Bennett KP, et al. Int. Semantic Web Conf. Auckland, New Zealand: ISWC; 2019. Making Study Populations Visible through Knowledge Graphs; pp. 53–68. [Google Scholar]

[r3-079_3413599] 3.Assoc AD, et al. Standards of Medical Care in Diabetes—2019. Diabetes Care. 2019;42(Supplement 1) [Google Scholar]

[r4-079_3413599] 4.Milosevic N, Gregson C, Hernandez R, Nenadic G. Extracting patient data from tables in clinical literature-case study on extraction of BMI, weight and number of patients. HEALTHINF. 2016. pp. 223–228.

[r5-079_3413599] 5.Milosevic N, Gregson C, Hernandez R, Nenadic G. A framework for information extraction from tables in biomedical literature. Int J on Document Analysis and Recognition (IJDAR) 2019;02 [Google Scholar]

[r6-079_3413599] 6.Dhuliawala M, Fay N, Gruen D, Das A. Proc. of the ACM Conf. on Bioinformatics, Computational Biology, and Health Informatics. ACM BCB; 2018. What Happens When?: Interpreting Schedule of Activity Tables in Clinical Trial Documents; pp. 301–306. [Google Scholar]

[r7-079_3413599] 7.Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J of the Amer Medical Informatics Assoc. 2010 05;17(3):229–236. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-079_3413599] 8.Embley DW, Campbell DM, Smith RD, Liddle SW. Proc. of the Seventh Int. Conf. on Information and Knowledge Management. CIKM ’98. New York, NY, USA: Assoc. for Computing Machinery; 1998. Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents; pp. 52–59. [Google Scholar]

[r9-079_3413599] 9.Tijerino YA, Embley DW, Lonsdale DW, Ding Y, Nagy G. Towards Ontology Generation from Tables. World Wide Web. 2005;8(3):261–285. Available from: https://doi.org/10.1007/s11280-005-0360-8 . [Google Scholar]

[r10-079_3413599] 10.Staar PW, Dolfi M, Auer C, Bekas C. Proc. of the 24th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining. ACM; 2018. Corpus conversion service: A machine learning platform to ingest documents at scale; pp. 774–782. [Google Scholar]

[r11-079_3413599] 11.Patel A, Group AC, et al. Effects of a Fixed Combination of Perindopril and Indapamide on Macrovascular and Microvascular Outcomes in Patients with Type 2 Diabetes Mellitus (the ADVANCE Trial): a Randomised Controlled Trial. The Lancet. 2007;370(9590):829–840. doi: 10.1016/S0140-6736(07)61303-8. [DOI] [PubMed] [Google Scholar]

[r12-079_3413599] 12.Tidwell J. DESIGNING INTERFACES: patterns for effective interaction design. OReilly Media. 2020. Available from: http://www.designinginterfaces.com/firstedition/index.php?page= Tree-Table .

[r13-079_3413599] 13.Beckett D, Berners-Lee T, Prud’hommeaux E, Carothers G. RDF 1.1 Turtle. World Wide Web Consortium; 2014. [Google Scholar]

[r14-079_3413599] 14.Riddle MC. Standards of Medical care in Diabetes. Diabetes Care. 2018;41(1) doi: 10.2337/dc18-su09. Available from: https://diabetesed.net/wp-content/uploads/2017/12/2018-ADA-Standards-of-Care.pdf . [DOI] [PubMed] [Google Scholar]

[r15-079_3413599] 15.Shi L, Li S, Yang X, Qi J, Pan G, Zhou B. Semantic Health Knowledge Graph: Semantic Integration of Hetero- geneous Medical Knowledge and Services. BioMed research international. 2017;2017 doi: 10.1155/2017/2858423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16-079_3413599] 16.Goodwin T, Harabagiu SM. Automatic Generation of a Qualified Medical Knowledge Graph and Its Usage for Retrieving Patient Cohorts from Electronic Medical Records. 2013 IEEE Seventh Int. Conf. on Semantic Computing. 2013. pp. 363–370.

[r17-079_3413599] 17.Jonquet C, Shah N, Youn C, Callendar C, Storey MA, Musen M. NCBO Annotator: Semantic Annotation of Biomedical Data. Int. Semantic Web Conf., Poster and Demo session. 2009;vol. 110:1–3. [Google Scholar]

PERMALINK

Knowledge Extraction of Cohort Characteristics in Research Publications

Jay D S Franklin, BS

Shruthi Chari, MS

Morgan A Foreman, BS

Oshani Seneviratne, PhD

Daniel M Gruen, PhD

James P McCusker, PhD

Amar K Das, MD, PhD

Deborah L McGuinness, PhD

Abstract

Introduction

Approach

Figure 1:

Related Work

Methods