Knowledge Graph-Enabled Cancer Data Analytics

SMShamimul Hasan; Donna Rivera; Xiao-Cheng Wu; Eric B Durbin; J Blair Christian; Georgia Tourassi

doi:10.1109/JBHI.2020.2990797

. Author manuscript; available in PMC: 2021 Jul 30.

Published in final edited form as: IEEE J Biomed Health Inform. 2020 May 4;24(7):1952–1967. doi: 10.1109/JBHI.2020.2990797

Knowledge Graph-Enabled Cancer Data Analytics

SMShamimul Hasan ¹, Donna Rivera ², Xiao-Cheng Wu ³, Eric B Durbin ⁴, J Blair Christian ⁵, Georgia Tourassi ⁶

PMCID: PMC8324069 NIHMSID: NIHMS1608706 PMID: 32386166

Abstract

Cancer registries collect unstructured and structured cancer data for surveillance purposes which provide important insights regarding cancer characteristics, treatments, and outcomes. Cancer registry data typically (1) categorize each reportable cancer case or tumor at the time of diagnosis, (2) contain demographic information about the patient such as age, gender, and location at time of diagnosis, (3) include planned and completed primary treatment information, and (4) may contain survival outcomes. As structured data is being extracted from various unstructured sources, such as pathology reports, radiology reports, medical records, and stored for reporting and other needs, the associated information representing a reportable cancer is constantly expanding and evolving. While some popular analytic approaches including SEER*Stat and SAS exist, we provide a knowledge graph approach to organizing cancer registry data. Our approach offers unique advantages for timely data analysis and presentation and visualization of valuable information. This knowledge graph approach semantically enriches the data, and easily enables linking with third-party data which can help explain variation in cancer incidence patterns, disparities, and outcomes. We developed a prototype knowledge graph based on the Louisiana Tumor Registry dataset. We present the advantages of the knowledge graph approach by examining: i) scenario-specific queries, ii) links with openly available external datasets, iii) schema evolution for iterative analysis, and iv) data visualization. Our results demonstrate that this graph based solution can perform complex queries, improve query run-time performance by up to 76%, and more easily conduct iterative analyses to enhance researchers’ understanding of cancer registry data.

Keywords: knowledge graph, cancer registry, treatment

I. Introduction

WORLDWIDE, almost 1 out of 6 deaths are due to cancer, a rate which is rising and estimated to increase by approximately 70% in the next 20 years [1]. To strategically conquer this challenge, we must improve our cancer research data infrastructure to support existing clinical and control programs and adapt to changing research needs. In the United States and other developed nations, cancer registries systematically collect data on diagnosed cancer cases such as tumor characteristics, first course of treatment, patient demographics, and outcomes [2], [3]. The information is abstracted and coded according to the national standards and submitted to a central cancer registry, such as the National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) program in the United States. SEER collects a broad set of clinical data from population based US cancer registries which cover 34.6% of the population. SEER also reports aggregated (non-identifiable) cancer statistics and provides this public-used case-level dataset to support both population-based cancer research and cancer control programs [2], [4].

A. Major Cancer Data Organization Challenges

Both the centralized SEER datasets and individual cancer registry datasets from which they are extracted have potential research advantages, especially when linking to vast third-party datasets, augmenting the original clinical data with behavioral and environmental information capable of explaining the variation in cancer incidence and outcomes. However, we are facing numerous data-related challenges for the secondary use of cancer registry data. Some challenges include:

Heterogeneous data: An enormous volume of third-party data is available on the web. Moreover, datasets are available in various locations. Furthermore, third-party datasets are available in various formats. For example, neighborhood concentrated disadvantage index (CDI) is available in comma-separated values (CSV) format [5], Wikipedia data is available in RDF format [6], and climate data is available in JavaScript Object Notation (JSON) format [7], to name a few. Customized software development is needed to analyze these datasets. However, if a new data format comes in, then a substantial code change is required to add a new data format to the customized software.
Schema change: Presently, the core cancer registry datasets are housed in a series of relational database management system (RDBMS), which has a rigid schema structure. Newer versions of the datasets sometimes come with their own new schema. For example, in this paper, we used the Louisiana Tumor Registry’s (LTR) Cancer/Tumor/Case (CTC) dataset. The current version of the Louisiana Tumor Registry’s dataset contains a column that provides the patients’ death location information (U.S. state). However, an earlier version of the dataset does not contain such information. It is not easy to handle schema change in RDBMS. The smallest change in the database requires quite a lot of graphical user interface (GUI)-level code change.
Not linked: Linking a large volume of heterogeneous data sources is quite challenging because heterogeneous data sources do not follow any common data-storing standard.
Difficult to execute complex queries: Data linking is crucial because it is not possible to answer complex cancer questions based on data from a single source. The complex queries (e.g., tree query, recursive query, transitive query) require joining, which is a costly operation in the RDBMS.

B. What Is a Knowledge Graph and Why Do We Need It?

A knowledge graph denotes a graph of entities with one or more properties that maintain defined semantic relationships to other entities. In a knowledge graph, the entities are represented as nodes, and the relationships are represented as edges [8], [9]. Many academic institutions (e.g., YAGO, DBPedia) and industries (e.g., Google, Microsoft, Facebook) are creating and taking advantage of the knowledge graph technology [10], [11]. This paper introduces a semantic web-based [12] knowledge graph approach to storing cancer registry data for analytic and research purposes. In the following, we briefly discuss the reason for our choice.

Standard data model: A major component of semantic web technology is the Resource Description Framework (RDF). We employed the RDF data model for our knowledge graph creation. The RDF data model is considered a standard model according to the World Wide Web Consortium (W3C) [13]. The RDF is developed on top of web infrastructure, is considered a directed labeled graph in mathematical settings and has been known to represent a graph of things. Furthermore, RDF data is stored in a triple format of subject, predicate, and object. It is easy to map different data models like relational, tree, key value store, and graph to RDF triples. The implication of this is that the RDF is actually a generic data model [9], [14].
Dynamic data growth and flexible schema: Given that it is a graph-based data model, the RDF’s structure is considered more flexible than the RDBMS. This is what makes the RDF a better option for dynamic data integration. Furthermore, the RDF does not need code changes or software redesigns to handle schema evolution [9].
Easy to incorporate semantics and links: A large collection of ontologies [15] and vocabularies [16] are available, which model the concepts of a domain. The RDF supports easy ontology and vocabulary integration compared to the RDBMS [17]. It is easy to dynamically link multiple data sources in the RDF, which is tedious in the RDBMS [9], [18].
Easy to execute complex queries: Complex cancer queries follow graph patterns like tree, long path traversal, and transitivity. The above types of queries require numerous joins, which is an expensive operation in the RDBMS’s Structured Query Language (SQL). However, joining is easy in SPARQL Protocol and RDF Query Language (SPARQL) because SPARQL is particularly designed to query graph structures [9].
Reasoning: An RDF-based knowledge graph supports dynamically finding new facts that are not explicitly stated in the knowledge graph (through description logic) [19], [20]. It is hard to develop a dynamic reasoner on top of an RDBMS system.
Affordable: An openly available link data browser, semantic search, link discovery, and many visualization tools can be used for RDF data analysis. Furthermore, hardly any programming effort is needed when using these tools. For instance, rather than browsing with a tailored graphical user interface (GUI), which requires customized software development, researchers can use the existing faceted browser for quick data exploration with ontologies or vocabularies. The GUI will automatically reflect any change in the ontology or data. Many of these tools are freely available on the web. These tools have constraints, but they offer a quick data analysis platform for free [9].
Query expressive power: Although SQL is older than SPARQL, Renzo Angles and Claudio Gutierrez proved that the expressive power of SPARQL is equivalent to relational algebra, which is a theoretical foundation for RDBMS and SQL [21]. Hence, along with other benefits mentioned in this section, the expressive power of SPARQL makes the knowledge graph as a better choice for the type of research problem we studied in this paper.

Our aim is to develop a knowledge graph-based scientific digital library platform for advanced cancer data analytics. The main contributions of our work are as follows:

We propose a scientific digital library framework for the secondary use of cancer registry data using a knowledge graph structure. As a proof of concept, we develop a knowledge graph based on the LTR’s CTC dataset abstracted according to the NAACCR data standards [3].
The development and results of scenario-specific hierarchical queries to understand the population level utilization of cancer treatment sequences.
The execution of complex queries over multiple datasets by linking the cancer registry knowledge graph to external datasets to provide an integrated knowledge base for researchers.
A demonstration of iterative schema evolution using anomaly detection as an example case.
Present knowledge graph visualization showing the usefulness of visual pattern recognition.

A preliminary version of this work has been reported [22].

II. Related Work

People are suffering from more than 100 types of cancer, which include pancreatic cancer, breast cancer, prostate cancer, etc. Pourshams et al. found that worldwide pancreatic cancer cases, death, and disability-adjusted life-years (DALYs) have greater than doubled from 1990–2017 [23]. Atieh Vafajoo, Reza Salarian, and Navid Rabiee proposed a suspension microbead arrays based method for early breast cancer diagnosis [24]. People of any age can have cancers. In [25], Force et al. discussed the global burden of childhood cancer disproportionately affects populations with the fewest resources. Although results presented in [23], [25], and [24] are interesting. However, authors did not propose any graph-based dynamic system creation to support advanced cancer analytics. There have been several attempts to formalize the secondary use of healthcare data for research and other uses [26]–[28]. There have been a number of interesting findings regarding the secondary use of electronic health records [29]. Early work in this area showed the application of Web Ontology Language (OWL) based representation of Electronic Health Records (EHR) data [30] but did not create cancer patient-level RDF graphs. Richesson et al. discussed the SHARPn framework that enables data normalization, secure transport, and common phenotyping facilities on various EHR data [31], but did not use graph based queries. A relatively small number of studies explored semantic web-based knowledge graph approach for the secondary use of the cancer registry datasets. Esteban-Gil, Fernandez-Breis, and Boeker proposed a semantic web based platform for cancer registries for data analysis and visualization [32]. However, they used simulated cancer registry data, not actual cancer patient data.

III. Proposed Cancer Scientific Digital Library Framework

Cancer researchers need a system to satisfy the information needs of users (societies), provide information services (scenarios), organize information in usable ways (structure), present information in useful ways (spaces), and communicate information with users (streams). A scientific digital library can fulfill the above-mentioned important requirements through the 5S (Streams, Structures, Spaces, Scenarios, and Societies) digital library framework [33]. We propose herein a scientific digital library framework for cancer research. Fig. 1 presents a high-level overview of the framework. Data, processing, knowledge graph, service, and user are five layers of the framework. In the following, we briefly discuss each layer.

Fig. 1. — In this figure, we present a high-level overview of the cancer scientific digital library framework.

A. Data

First, the data layer shows that we are handling a wide range of heterogeneous datasets to develop a comprehensive knowledge graph. The datasets to be stored include patient-level cancer tumor data, socioeconomic data, pathology reports, geographical data, climate data, behavioral data, web data, and many more. The above-mentioned datasets are stored in numerous formats (e.g., CSV file, Extensible Markup Language (XML) file, graph file, JSON file).

In the above, we present some dataset category examples. The knowledge graph that is employed to perform this paper’s experimentation contains CTC data, neighborhood CDI data, and Rural-Urban continuum codes data.

B. Processing

Second, in the processing layer, we perform various cleaning on datasets to make sure that our linked dataset is clean, correct, and useful. We have several types of datasets, and different datasets need different types of cleaning. Hence, we have prepared multiple scripts for data cleaning. In the following, we describe some of the checks we have performed through our cleaning scripts.

Data type check: We checked whether the column values were stored in the correct data types (e.g., integer, string, float) or not [34]. For example, in the CTC file, “Patient ID Number” should use integers. We checked the data types of the CTC, neighborhood CDI, and Rural-Urban Continuum Codes data.
Range check: This check tests whether the values are within the permissible range. For example, the “Marital Status” column in CTC data should contains values within the range of 1 to 9. Another example is that the “Percent Black” column in the neighborhood CDI file should not contain any value of more than 100. If we observe a discrepancy in the data, then we store that information (see Section V – “Application 3: Easy Schema Evolution for Iterative Analysis” for more details).
Cross-field consistency check: This means whether a condition needs other column values to validate a column [34]. For example, in the case of the CTC dataset, the “Date of Chemotherapy” cannot be earlier than the “Date of Diagnosis.” We conducted a cross-field validation check.
Mandatory field check: Some columns in the input data file cannot be empty [34]–for example, “Patient ID Number” and “Tumor Record Number” in the CTC dataset. We performed the mandatory field check.
Uniqueness check: One or multiple field values should be unique in the dataset. Hence, we employed a uniqueness check–for example, the Federal Information Processing Standards (FIPS) code in the neighborhood CDI and Rural-Urban Continuum Codes datasets.
Format check: We checked the format of the data values. For example, the CTC file contains many dates (e.g., “Date of Surgery”, “Date of Chemotherapy”, “Date of Radiation Therapy”). Date values should contain 8 digits and be stored in year-month-day (YYYYMMDD) format.

We checked the dates format. We found that month and day information is missing for some of the dates. We added zeros for missing month and day information and made all the dates consist of 8 digits. Moreover, in the different datasets, null values are presented in various ways (e.g., space, NA, NR, etc.). We set all of them to database null value (see Section IV – “Approach” for more details).

After cleaning, we incorporate ontologies or vocabularies to organize the datasets. Next, by using our graph engine (D2RQ [35]) we create a knowledge graph for different datasets. Finally, we link various datasets in the knowledge graph.

C. Knowledge Graph

The knowledge graph is the core of our cancer scientific digital library. It serves as a knowledge base. Third, the “Knowledge Graph” layer is showing a pictorial view of our graph. We have two layers in the graph. The first layer is the data organization layer, which mainly represents various classes (or entities). The second layer represents class instances. In the picture, we are showing an example of a patient and neighborhood disadvantage datasets. We stored our knowledge graph in a graph repository (Virtuoso [36]). We have an endpoint (SPARQL endpoint) available on top of our knowledge graph repository.

D. Service

Fourth, in the “Service” layer, we are showing that different types of services (or apps) can be developed on top of our knowledge graph. Numerous services like treatment sequence analysis, hypothesis generation and testing, data discrepancy analysis, and visualization can be developed on top of our knowledge graph. We discuss services more in section V (Applications).

E. User

Finally, the user layer is showing numerous stakeholders (e.g., Researcher, Cancer Registrar, Healthcare Provider, Policymaker) can derive benefits from our scientific digital library.

Our cancer scientific digital library follows the 5S theory. Various types of input and output data are the stream of our digital library. A knowledge graph organizes our data, thus serving as a structure of the digital library. We are using indexing and information retrieval algorithms from our graph repository that serves as a space. We are providing various services (e.g., treatment sequence analysis, visualization) that fulfills the 5S theory’s scenario requirement. Finally, we are connecting various societies (e.g., Researcher, Cancer Registrar) through our digital library.

IV. Knowledge Graph Creation

A. Cancer Registry Data Overview

We used Louisiana Tumor Registry (LTR) data from cancer patients who were Louisiana residents at the time of diagnosis. Each record of our data extract corresponds to a unique cancer, sometimes referred to as a CTC as defined by the NAACCR data standard [37]; each patient may occur in the database more than once if they have more than one primary tumor. While the primary data in the database consists of tumor information at time of diagnosis and first course of treatment, it also contains demographic information, vital status, and date of last contact. Our data is a CSV file containing 240 columns, 374,682 unique tumor records, and is 207 MB in size, containing data for diagnoses from 2000–2016 [3].

B. Approach

Loading the CTC dataset into a relational database: We leverage existing tools to convert the original CSV data into our graph database via an RBDMS. The first step is to load the CSV file into the RBDMS. A schema is required to load the data into a relational database which describes the column names, column datatypes, and primary and foreign key information. We used CTC as the table name and created database column names using the column headers found in the CTC extract, and inherit the column datatypes from the raw data. In the CTC CSV file, null values are presented in numerous ways (e.g., space, NA, NR), and are all set to the database’s null value. We used PostgreSQL 9.5.9 as a relational database.
Creating the relational to knowledge graph mapping file: We used the RDF data model to represent our knowledge graph. Numerous RDBMS to RDF conversion tools were available, and the D2RQ tool [35] was selected to create RDF conversion mapping files from the RDBMS. The D2RQ tool maps the relational database table name to the RDF class name, and table attribute names to the RDF property names. We created a CTC mapping file from the PostgreSQL database [38] by using the D2RQ generate-mapping service with D2RQ 0.8.1. The mapping file size, number of triples in the mapping file, and mapping generation time are shown in Table I‘s column 2 (Louisiana Mapping File).
Knowledge graph generation: We applied the D2RQ mapping file to the PostgreSQL CTC table to generate the materialized CTC knowledge graph. We employed the D2RQ dump-rdf service for RDF graph creation. We provide knowledge graph size, number of triples, and graph generation time in Table I‘s column 3 (Louisiana Knowledge Graph).
Loading the knowledge graph into a triplestore: Next, we loaded our knowledge graph into a triplestore. The triplestore provides a SPARQL endpoint that supports graph-based query execution facilities. We used Virtuoso Open-Source Edition 7.2.4 [36] as our triplestore using a virtual machine running CentOS 7.4 with a 2.0 GHz Intel Xeon E7 4850 CPU, 128 GB of memory and 1 TB of local disk storage.

TABLE I.

The size, number of triples, and creation time of the Louisiana mapping file and knowledge graph.

Variables	Louisiana Mapping File	Louisiana Knowledge Graph
Size (KB)	56	16,000,000
Number of Triples	1,467	90,673,527
Creation Time (minutes)	<1	∼75

Query ID	Scenario Path	Outcome
		C50.0		C50.1		C50.2		C50.3		C50.4		C50.5		C50.6		C50.8		C50.9
		RC	DE	RC	DE	RC	DE	RC	DE	RC	DE	RC	DE	RC	DE	RC	DE	RC	DE
Q1.1	S	42	*	303	55	521	92	271	60	1,583	290	324	55	32	*	1,123	191	1,189	244
Q1.2	C			*	*	22	19	*	*	64	59	*	*	*	*	56	49	45	131
Q1.3	R	*		*	*	*	*	*	*	*	*	*	*			*	*	*	18
Q1.4	H						*	*		*	*	*	*	*			*	*	17

Variables	Louisiana CDI Mapping File	Louisiana CDI Knowledge Graph
Size (KB)	4	864
Number of Triples	54	6,767
Creation Time (minutes)	<1	<1

Total TNBC Cases (Female) - Our Result: 1,299
Total TNBC Cases (Female) - Mentioned in [5]: 1,216
AGE, YEARS (%)
Age	Count (Our Result)	Percentage (Our Result)	Percentage (Mentioned in [5])
<30	15	1.2	Not Available
30–39	95	7.3	7.3
40–49	233	17.9	18.6
50–59	376	29	29.3
60–69	303	23.3	23.5
70>	277	21.3	21.3
RACE (%)
Race	Count (Our Result)	Percentage (Our Result)	Percentage (Mentioned in [5])
White	683	52.6	51.8
Black	616	47.4	47.5
SEER SUMMARY STAGE 2000(%)
Derived SEER Summary Stage 2000	Count (Our Result)	Percentage (Our Result)	Percentage (Mentioned in [5])
In situ	26	2	Not Available
Localized	737	56.7	57.8
Regional	424	32.6	33.3
Distant	107	8.2	8.4
Unknown	5	0.4	0.5

AGE
DESCRIPTIVE STATISTICS
Sample	Sample Size	Mean
Our Result	5	19.76
Result Mentioned in [5]	5	20
ESTIMATION FOR DIFFERENCE
95% CI for Difference	(−12.0287, 11.5487)
TEST
t-value	df	p-value
−0.046947	7.9994	0.9637
RACE
DESCRIPTIVE STATISTICS
Sample	Sample Size	Mean
Our Result	2	50
Result Mentioned in [5]	2	49.65
ESTIMATION FOR DIFFERENCE
95% CI for Difference	(−14.66804, 15.36804)
TEST
t-value	df	p-value
0.10374	1.9319	0.9271
SEER SUMMARY STAGE 2000
DESCRIPTIVE STATISTICS
Sample	Sample Size	Mean
Our Result	4	24.475
Result Mentioned in [5]	4	25
ESTIMATION FOR DIFFERENCE
95% CI for Difference	(−45.03209, 43.98209)
TEST
t-value	df	p-value
−0.028866	5.998	0.9779

Variables	Kentucky Mapping File	Kentucky Knowledge Graph
Size (KB)	56	8,100,000
Number of Triples	1,420	48,409,945
Creation Time (minutes)	<1	∼52

Total triple negative breast cancer cases (female): 1,096
AGE, YEARS (%)
Age	Count	Percentage
<30	10	0.9
30–39	59	5.4
40–49	200	18.3
50–59	293	26.7
60–69	301	27.5
70+	233	21.3
RACE (%)
Race	Count	Percentage
White	961	87.7
Black	135	12.3
SEER SUMMARY STAGE 2000(%)
Derived SEER Summary Stage 2000	Count	Percentage
In situ	29	2.6
Localized	665	60.7
Regional	333	30.4
Distant	68	6.2
Unknown	1	0.1

Abbbreviation	Full Form
5S	Streams, Structures, Spaces, Scenarios, and Societies
AA	African American
ACS	American Community Survey
API	Application Programming Interface
CDI	Concentrated Disadvantage Index
CSV	Comma-Separated Values
CTC	Cancer/Tumor/Case
DALY	Disability Adjusted Life Year
DE	Deceased
DF	Degree of Freedom
EA	European American
EHR	Electronic Health Record
FIPS	Federal Information Processing Standards
GUI	Graphical User Interface
JSON	JavaScript Object Notation
KB	Kilobyte
KCR	Kentucky Cancer Registry
LTR	Louisiana Tumor Registry
NA	Not Available
NCI	National Cancer Institute
NR	Not Reporting
OWL	Web Ontology Language
PHI	Protected Health Information
RC	Right Censored
RDBMS	Relational Database Management System
RDF	Resource Description Framework
SEER	Surveillance, Epidemiology, and End Results
SPARQL	SPARQL Protocol and RDF Query Language
SQL	Structured Query Language
TNBC	Triple Negative Breast Cancer
W3C	World Wide Web Consortium
XML	Extensible Markup Language

Query ID	Scenario Path	SQL Time (Millisecond)	SPARQL Time (Millisecond)	Improvement (Percentage)
Q2.1	S-C	342	82	76.0
Q2.2	S-R	344	75	78.2
Q2.3	S-H	334	83	75.1
Q2.4	C-S	328	84	74.4
Q2.5	C-R	311	75	75.9
Q2.6	C-H	310	79	74.5
Q2.7	R-S	331	78	76.4
Q2.8	R-C	309	74	76.1
Q2.9	R-H	314	76	75.8
Q2.10	H-S	324	75	76.9
Q2.11	H-C	315	77	75.6
Q2.12	H-R	311	78	74.9

Query ID	Scenario Path	SQL Time (Millisecond)	SPARQL Time (Millisecond)	Improvement (Percentage)
Q3.1	S-C-R	356	79	77.8
Q3.2	S-C-H	342	76	77.8
Q3.3	S-R-C	351	77	78.1
Q3.4	S-R-H	360	73	79.7
Q3.5	S-H-C	338	75	77.8
Q3.6	S-H-R	354	78	78.0
Q3.7	C-S-R	344	75	78.2
Q3.8	C-S-H	337	75	77.7
Q3.9	C-R-S	343	74	78.4
Q3.10	C-R-H	308	89	71.1
Q3.11	C-H-S	335	74	77.9
Q3.12	C-H-R	307	76	75.2
Q3.13	R-S-C	343	78	77.3
Q3.14	R-S-H	341	76	77.7
Q3.15	R-C-S	334	75	77.5
Q3.16	R-C-H	308	105	65.9
Q3.17	R-H-S	337	94	72.1
Q3.18	R-H-C	312	84	73.1
Q3.19	H-S-C	331	90	72.8
Q3.20	H-S-R	337	73	78.3
Q3.21	H-C-S	329	76	76.9
Q3.22	H-C-R	306	72	76.5
Q3.23	H-R-S	331	74	77.6
Q3.24	H-R-C	307	88	71.3

Query ID	Scenario Path	SQL Time (Millisecond)	SPARQL Time (Millisecond)	Improvement (Percentage)
Q4.1	S-C-R-H	345	75	78.3
Q4.2	S-C-H-R	344	82	76.2
Q4.3	S-R-C-H	339	75	77.9
Q4.4	S-R-H-C	343	76	77.8
Q4.5	S-H-C-R	357	77	78.4
Q4.6	S-H-R-C	342	89	74.0
Q4.7	C-S-R-H	328	76	76.8
Q4.8	C-S-H-R	327	79	75.8
Q4.9	C-R-S-H	333	95	71.5
Q4.10	C-R-H-S	330	86	73.9
Q4.11	C-H-S-R	324	77	76.2
Q4.12	C-H-R-S	332	79	76.2
Q4.13	R-S-C-H	330	75	77.3
Q4.14	R-S-H-C	326	74	77.3
Q4.15	R-C-S-H	325	75	76.9
Q4.16	R-C-H-S	324	102	68.5
Q4.17	R-H-S-C	325	75	76.9
Q4.18	R-H-C-S	328	81	75.3
Q4.19	H-S-C-R	326	75	77.0
Q4.20	H-S-R-C	322	73	77.3
Q4.21	H-C-S-R	322	78	75.8
Q4.22	H-C-R-S	324	92	71.6
Q4.23	H-R-S-C	324	75	76.9
Q4.24	H-R-C-S	322	80	75.2

	Louisiana		Kentucky
Age	EA Women	AA Women	EA Women	AA Women
30–34	89,554	51,535	122,215	11,886
35–39	88,387	44,813	125,870	11,272
40–44	93,277	47,579	130,795	10,973
45–49	108,226	52,710	148,031	12,323
50–54	110,552	52,786	147,354	12,442
55–59	100,534	45,944	135,413	10,318
60–64	88,181	34,711	120,089	7,513
65–69	68,860	24,203	91,424	5,137
70–74	53,553	18,523	70,765	3,970
75+	116,048	33,449	147,045	8,040

PERMALINK

Knowledge Graph-Enabled Cancer Data Analytics

SMShamimul Hasan

Donna Rivera

Xiao-Cheng Wu

Eric B Durbin

J Blair Christian

Georgia Tourassi

Roles

Abstract

I. Introduction

A. Major Cancer Data Organization Challenges

B. What Is a Knowledge Graph and Why Do We Need It?

II. Related Work

III. Proposed Cancer Scientific Digital Library Framework

Fig. 1.

A. Data

B. Processing

C. Knowledge Graph

D. Service

E. User

IV. Knowledge Graph Creation

A. Cancer Registry Data Overview

B. Approach

TABLE I.

V. Applications

A. Application 1: Finding Population Level Treatment Sequences in Breast Cancer

Fig. 2.

TABLE II.

TABLE III.

TABLE IV.

TABLE V.

B. Application 2: Linking External Datasets

Dataset:

TABLE VI.

Results:

TABLE VII.

TABLE VIII.

Fig. 3.

Fig. 4.

Listing 1.

Listing 2.

TABLE IX.

TABLE X.

Fig. 5.

Fig. 6.

C. Application 3: Easy Schema Evolution for Iterative Analysis

Fig. 7.

D. Application 4: Knowledge Graph Visualization

Fig. 8.

Fig. 9.

VI. Conclusion and Future Work

Acknowledgment

Appendix A Abbreviations

TABLE XI.

Appendix B Query Time

TABLE XII.

TABLE XIII.

TABLE XIV.

TABLE XV.

Appendix C Louisiana and Kentucky female Population

TABLE XVI.

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases