Skip to main content
Scientific Data logoLink to Scientific Data
. 2019 Oct 23;6:227. doi: 10.1038/s41597-019-0206-3

A database for using machine learning and data mining techniques for coronary artery disease diagnosis

R Alizadehsani 1, M Roshanzamir 2, M Abdar 3, A Beykikhoshk 4, A Khosravi 1, M Panahiazar 5,, A Koohestani 1, F Khozeimeh 6, S Nahavandi 1, N Sarrafzadegan 7,8
PMCID: PMC6811630  PMID: 31645559

Abstract

We present the coronary artery disease (CAD) database, a comprehensive resource, comprising 126 papers and 68 datasets relevant to CAD diagnosis, extracted from the scientific literature from 1992 and 2018. These data were collected to help advance research on CAD-related machine learning and data mining algorithms, and hopefully to ultimately advance clinical diagnosis and early treatment. To aid users, we have also built a web application that presents the database through various reports.

Subject terms: Cardiovascular diseases, Data integration, Data mining


Measurement(s) coronary artery disease
Technology Type(s) digital curation
Factor Type(s) year • disease
Sample Characteristic - Organism Homo sapiens

Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.9825680

Background & Summary

According to the World Health Organization (WHO) available in http://www.who.int/news-room/fact, cardiovascular diseases (CVDs) are the major reason for death worldwide. CVDs include different diseases related to heart and blood vessels, such as coronary heart disease (CHD), cerebrovascular disease, and rheumatic heart disease (RHD) among others. According to the latest WHO report available at http://www.who.int/news-room/fact and http://www.who.int/cardiovascular_diseases/en/, more than 17.7 million people are estimated to have died in 2015 due to having CVDs, accounting for 31% of all deaths globally. It also estimated that approximately 7.4 million died due to CHD, which is also called coronary artery disease (CAD)13. In other words, it can be argued that CVDs - in particular, CAD - are among the deadliest diseases in both developed and developing countries and paying attention to them is vital and indispensable.

Although the CAD mortality rate is high, the chance of survival is higher if the diagnosis is made early enough. Therefore, scientists have devised predictive models to identify high-risk patients. Recently, machine learning (ML) and data mining (DM) approaches have become more popular to construct models not only for the early diagnosis of CAD411 but also for other fatal diseases1217 such as cancer1824. These techniques reveal the hidden structures that help to achieve a quicker diagnosis among the large amount of medical data25. Indeed, this is a semi-automated approach for finding patterns in data26.

Although there are some datasets for various diseases2740, there are no comprehensive benchmarks publicly available to summarize the research and conclusions on CAD diagnosis. As a result, the studies in this field are not well organized. One can argue that a solution to this problem is creating a database of all studies to collect their related information. Using this database, researchers can explore the latest work in the field and stay informed about the new methods proposed and the results achieved. Therefore, this research attempts to provide a comprehensive dataset of the related works at the intersection of ML/DM and heart disease detection as a bridge for further research in the future. The impact of CAD disease on our daily lives and the popularity of ML/DM motivated us to create such a database. To the best of our knowledge, this is the first database that covers most relevant datasets as well as the related outcomes obtained by ML algorithms. It is a key point to recognize other modifications on ML/DM techniques that are relevant to CAD disease progression and development.

This database provides comprehensive and fundamental information on early detection of CAD disease in order to illuminate the patterns and processes that are used in ML/DM approaches. For instance, Alizadehsani et al.3 used an SVM to classify patients with CAD from healthy individuals. Their model had an accuracy of 95% and revealed that apart from typical chest pain, regional wall motion abnormality and ejection fraction (echocardiographic features), age, and hypertension have the highest importance in CAD diagnosis. Therefore, ML/DM techniques will be fruitful to biologists, computer scientists, healthcare researchers, and physicians who are experts in the CAD area.

The advantages of using ML/DM methods for CAD diagnosis can be summarized as follow3:

  • It may result in early detection that leads to a decrease in mortality rate.

  • ML/DM can provide a priori probability of disease and use this probability to selectively target patients for angiography. This can save in cost and time for other patients. The side effects of angiography are also eliminated for them.

  • Using ML/DM can extract hidden patterns in the collected data. This may lead to finding new methods for early detection in many diseases like CAD.

Although ML/DM techniques have many advantages, they are not perfect methods. The following factors limit their abilities in some directions.

  • According to no-free-lunch theorem41, different ML/DM algorithms are suitable for their own particular problems. One algorithm may work well on a specific dataset while it cannot show a good performance on some others. So, selecting a suitable algorithm for a specific dataset is a big challenge in bioinformatics. Consequently, selecting good feature selection or classification algorithms is also a big challenge in this field.

  • ML/DM algorithms commonly need massive datasets to be trained. These datasets must be inclusive and unbiased with high quality. Datasets also need time to be collected42.

  • ML/DM algorithms need time to be trained and tested enough to be able to generate results with high confidence. These algorithms need a lot of resources and equipment43.

  • ML/DM algorithms face the verification problem. It is difficult to prove that the prediction made by them work correctly for all scenarios43.

  • The correct interpretation of the generated results by ML/DM algorithms is another challenge that we are faced with42.

  • Another disadvantage of ML/DM algorithms is their high error-susceptibility. If they are trained with biased or incorrect data, they end up with imprecise outputs. This may lead to a chain of errors that mislead treatment methods. When these errors get noticed, it takes some times to diagnose the source of these errors and even needs more time to correct them44.

The benefits of using our collected dataset can be listed as follows:

  • The researchers can access useful information easily and quickly to the results of much state-of-the-art research in this field. For example, important features on CAD in each country, comparing the performance of different research, features which were used in each research, and many other useful information that can be extracted from this dataset. Consequently, researchers can find the fields that there are fewer works on them. It also prevents researchers from doing repeated works.

  • This dataset facilitates the review step of the researchers. They can do their research quickly and with more quality. Meanwhile, the referees can also use it to check the novelty of new proposed methods and have quick access to the important properties of the articles. Using this dataset, top rank researchers and better algorithms and journals with more published works in this field can be found easily. It can also be extended for diagnosing other diseases especially more common ones such as diabetes, cancers, and hepatitis.

However, unfortunately, this dataset suffers from some disadvantages. Currently, updating the dataset is done manually. For example, finding new papers, extracting their properties and adding them to the dataset are done manually.

Meanwhile, one of the most important weaknesses of the research we investigated is the size of the used datasets. Unfortunately, almost all of the researchers do not use big datasets because collecting too many records needs a lot of time and cost. If we want to have extremely high confidence results, we need more than one million records. Projects like Electronic Health Records (EHR)45 can help to achieve this goal. In EHR project, information about patients is saved electronically in a digital format. It can be shared across different health care centers to ease the treatment process. EHR almost includes all necessary records of patients like their medical history, drugs that are used, their procedures, vitals and their allergies, and laboratory test results. This mechanism has improved the quality of care. By increasing the samples in EHR, the quality of cure methods will be improved definitely. Meanwhile, it can reduce the risk of data replication as there is only one medical file for each patient which is commonly completely updated file. As all the information about the patients is saved in a digital searchable file, EHRs are more effective for extracting information for treatment methods. Meanwhile, population-based methods can also be applied more easily by widespread adaptation of EHR.

States of the art methods like deep learning can also benefit from EHR because deep learning needs much data for learning and these data are collected in EHR. Deep learning46 is a part of machine learning algorithms based on artificial neural networks. Nowadays, this method shows significant ability in solving machine learning problems. It is inspired by distributed processing of biological systems. Because of its ability in the learning process, nowadays, it is used in various learning fields like machine vision, image processing, and bioinformatics.

Deep Survival Analysis47 is a hierarchical generative approach using EHR for survival analyzing. It handles characteristics of EHR data and for an event of interest, it enables accurate risk scores. Traditional survival analysis48,49 suffers from some weaknesses. For example, high dimension and very sparse data of EHR are one of these weaknesses that makes using traditional model difficult. Deep Survival Analysis differs from traditional survival methods. In this method, all observations are modeled jointly and conditioned on the rich latent structure.

As a clinical implication of this research, physicians can use this dataset to select more effective features in CAD diagnosis according to the region they are living in. This can increase the accuracy of their diagnosis and help the early treatment of patients. Meanwhile, as it was mentioned, it can also reduce the usage rate of angiography for suspicious patients and avoiding the side effect of the unnecessary procedure. More importantly, our system will work as a recommendation engine for clinicians to help them in decision making for specific treatment for a specific group of patients with similar characteristics toward personalized medicine.

Methods

For the first time, we designed and implemented a complete dataset about the research in CAD diagnosis field. It is an important field and many researchers work on it. So, accessing to a complete resource of the research in this field can help researchers improve their work more quickly and precisely. Meanwhile, this dataset includes some useful utilities for extracting information from the data saved in it. These utilities are accessible in www.cadataset.com. Using this dataset, some new information can be extracted. For example, a physician can find what features are more important in CAD diagnosis in different regions.

This study concentrates on recent papers from 1992 to 2018 that are related to CAD diagnosis using ML/DM techniques. For the sake of completeness, we used Google Scholar to find the most related articles. The database includes information such as authors, publisher, title of the paper, country (of publication or where the research was conducted), methods that are applied, evaluation metrics, type of diseases, features that are used, journal/conference, and the most important features used in their analysis (e.g., Alizadehsani et al., Elsevier, a data mining approach for diagnosis of coronary artery disease, Iran, [SVM, Naïve Bayes], [Accuracy, Sensitivity, Specificity], CAD, 55 features, computer methods and programs in biomedicine, 36 features).

As many CAD-related articles are published every year, we built the dataset such that it can be easily updated. Using this database, for each paper published in the field, one can determine in which countries the data are collected and what features have been reported to be of importance. In addition, features not considered in each study are also determined, allowing researchers to examine those features in the future. It also reveals which authors have more influence in the field, so others can use their experiences. The journals that have published the most articles in this field are determined to allow researchers to decide on where to publish their new articles. All of the algorithms that have been used to date and the accuracy that they have achieved are identified so that researchers can choose the algorithms that have not yet been used and compare them with previous results. The articles that have the most citations have been identified so that researchers can use their ideas. The publication houses that have published the most articles in this field are identified. The datasets and the feature categories that achieved the most accuracy are determined to help researchers in feature selection. The feature selection algorithms that researchers have used have been identified to help new researchers choose the best method. The articles that have achieved the most accuracy are specified to help researchers decide on which features and methods have better results.

As the future work, there are multiple issues for improvement of mechanisms used for collecting and management of our dataset. They are summarized as follows:

  • There are no published data for most countries in Europe, Africa, Australia, and South America. This lack of information is important as regional and racial differences may affect the way CAD is detected and treated. Thus, we recommend collecting CAD data and constructing databases from various continents and countries.

  • Most of the investigated datasets have a limited number of features. This severely limits the final results since the number of both samples and features can affect the performance of ML techniques. Hence, we will construct CAD databases with more features.

  • The median sample size for the CAD datasets that were investigated in research in this field is less than 500. The larger the number of samples is, the more significant are the statistical results. To ensure reliability and trustworthiness, a model should be developed and tested using at least one million samples50.

  • Another problem with previous studies is the way the data were collected. Since the datasets differ in terms of the number of samples and features, it is not easy to compare ML techniques in terms of performance. In other words, the results obtained in various studies are comparable only if the data are the same. This dataset can help researchers to select features that make their research comparable with others.

  • As it was mentioned, currently, updating our dataset is done manually. Improving the tools which now is used to manipulate our dataset is necessary. This tool must be able to update the dataset automatically.

Database structure

The database designed in this research includes 15 tables shown in Tables 115. These tables include the following information:

  • the field name,

  • whether it is a primary key (P.K.) or a part of it,

  • if it is a foreign key (F. K.), and if yes, to which table it refers,

  • and a brief description of that field.

Table 1.

The fields of “Journals/Conferences” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
Journal/ConferenceID The ID defined for each journal or conference
Journal/ConferenceType This field indicates if this record is a journal or conference
Journal/ConferenceName The name of the journal or conference
Publisher The publisher of journal or conference

Table 15.

The fields of “Papers_ImportantFeatures” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
PaperID Papers (PaperID) The ID defined for each paper
DiseaseID Diseases (DiseaseID) The ID defined for each disease
DatasetID Datasets (DatasetID) The ID defined for each dataset
FeatureID Features (FeatureID) The ID defined for each feature
FeatureRank The reported ranked of the feature

Based on these tables, we prepared various reports to extract important information from them. The description of the most important reports is shown in Table 16. As there are many reports extracted from this database, the description of other reports is available from our web application help section.

Table 1 lists the journals and conferences in which the investigated papers were published. In Table 2, the authors of the investigated papers are listed. Commonly in each paper, there are one or more datasets to which the proposed algorithms were applied. These datasets are listed in Table 3. Currently, we investigated only four heart diseases that are listed in Table 4. They are CAD and stenosis of LAD, LCX, and RCA. This list, however, is extendable to other diseases in the future. The features investigated in the articles are listed in Table 5, and the list of methods used for diagnosis is shown in Table 6. In most of the papers, the researchers selected a subset of features in the investigated datasets. The feature selection algorithms are listed in Table 7. Table 8 is dedicated to the characteristics of research papers but not the review papers in the field. The characteristics of review papers are shown in Table 9. Table 10 shows which method was applied on a specific disease in a dataset in a specific paper. The results of applying this method are also reported. Table 11 shows the features of each dataset. Table 12 indicates the authors of each research paper, while Table 13 indicates the authors of review papers. Since we separated the tables of research papers and review papers, we did the same for their authors as well. The research papers and review papers have different fields to report on. Therefore, we used different tables to save their details. To specify the feature selection algorithm that was used in each paper, Table 14 is designed. Finally, in Table 15, the rank that was assigned to each selected feature was reported.

Table 16.

Some of the most important reports and their descriptions.

Report name Report description
Frequency of winning machine learning methods This report shows how popular/successful a machine learning technique is. For each machine learning technique, it reports on the total number of papers that have used it in their analysis, and how many times it outperformed the other techniques.
Important features reported for a specific disease in a specific country This table gives a detailed list of what features are collected in each country for each disease. Moreover, one can see the number of papers that have emphasized the importance of each feature, as well as the mean of ranks given to that feature as a proxy of feature importance in that country.
Comparison of machine learning methods in each dataset Usually, each paper reports the results of applying multiple machine learning methods on a dataset. This table allows us to compare how these algorithms perform on a dataset. It reports the paper title, the dataset that it has used, and the difference between the accuracy of two selected methods.
Important features in each disease This report lists the set of features that are reported to be important for each heart disease. First, a disease needs to be selected from the drop-down menu. The resulting table will show the feature name, the number of papers that reported the feature to be important for the selected disease, and the mean of the reported ranks for that feature. The smaller the rank, the more important the feature is.
Papers vs. Specific feature category Feature categories represent the set of features that are obtained from the same resources. For example, ECG category represents the set of features that are obtained from electrocardiography. For each feature category, this table reports the paper titles and the accuracy they have achieved.
Number of papers using a specific algorithm by year For a given machine learning method, this table reports the number of papers using that method per year, the title of publication with the best performance and the highest accuracy achieved.

Table 2.

The fields of “Authors” table, their properties and descriptions.

Field name P. K. F. K. from Table Description
AuthorID The ID defined for each author
AuthorName The name of the author

Table 3.

The fields of “Dataset” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
DatasetID The ID defined for each dataset
DatasetName The name of the dataset
DatasetSampleSize Number of records in each dataset
Country The name of the country where the dataset was collected

Table 4.

The fields of “Disease” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
DiseaseID The ID defined for each disease
DiseaseName The name of the disease (for now they are only CAD or stenosis of LAD, LCX or RCA)

Table 5.

The fields of “Features” table, their properties and descriptions.

Field name P. K. F. K. from Table Description
FeatureID The ID defined for each feature
FeatureName The name of each feature
Abbreviation Abbreviation of each feature (if it exists and is used commonly)
FeatureCategory The category that the feature belongs to

Table 6.

The fields of “Methods” table, their properties and descriptions.

Field name P. K. F. K. from Table Description
MethodID The ID defined for each method
MethodName The name of the machine learning method that was used in each study
MethodCategory The category that this method belongs to

Table 7.

The fields of “Feature Selection Algorithms” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
FeatureSelectionID The ID defined for the feature selection algorithm
FeatureSelectionName The name of the feature selection algorithm

Table 8.

The fields of “Papers” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Definition
PaperID The ID defined for each paper
PaperName Title of paper
FirstAuthorID Authors (AuthorID) The ID of the first author
Year The year this research has been published
Journal/ConferenceID JournalsConferences (Journal/ConferenceID) The ID of the journal or conference that this research has been published in
Train-Test Separation Method Which method is used for the train and test separation method
ShortDescriptionAboutMainMethod A short description of the main method
ConclusionsReportedByAuthors A short description of the conclusion
NumberOfCitation Number of citations of this research

Table 9.

The fields of “Review Articles” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
PaperID The ID defined for each paper
PaperName Title of paper
FirstAuthorID Authors (AuthorID) The ID of the first author
Year The year this research has been published
Journal/ConferenceID JournalsConferences (Journal/ConferenceID) The ID of the journal or conference that this research has been published in
InvestigatedResearchFrom The year the investigation begins
InvestigatedResearchTo The year the investigation ends
Number of investigated papers Number of papers investigated in each review paper
Number of citations Number of citations of each review paper
NotableConclusion Notable conclusion of each research

Table 10.

The fields of “Papers_ Datasets_Diseases_Methods” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
PaperID Papers (PaperID) The ID defined for each paper
DatasetID Datasets (DatasetID) The ID defined for each dataset
DiseaseID Disease (DiseaseID) The ID defined for each disease
MethodID Methods (MethodID) The ID defined for each method
IsMainMethod If this method is the main method (the method with the highest performance) of the research or not?
Accuracy% The reported accuracy
Sensitivity(Recall)% The reported sensitivity
Specificity% The reported specificity
F-Measure% The reported F-Measure
AUC The reported AUC
Precision% The reported precision

Table 11.

The fields of “Datasets_Features” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
DatasetID Datasets (DatasetID) The ID defined for each dataset
FeatureID Features (FeatureID) The ID defined for each feature

Table 12.

The fields of “Papers_Authors” table, their properties and descriptions.

Field name P. K. F. K from Table (Field) Description
PaperID Papers (PaperID) The ID defined for each paper
AuthorID Authors (AuthorID) The ID defined for each author

Table 13.

The fields of “Papers_Authors (review articles)” table, their properties, and descriptions.

Field name P. K. F. K from Table (Field) Description
PaperID The ID defined for each paper
AuthorID The ID defined for each author

Table 14.

The fields of “Papers_FeatureSelectionAlgorithms” table, their properties, and descriptions.

Field name P. K. F. K from Table (Field) Description
paperID Papers (PaperID) The ID defined for each paper
FeatureSelectionID FeatureSelectionAlgorithms (FeatureSelectionID) The ID defined for each feature selection algorithm

Data Records

We chose to index papers related to CAD detection using machine learning and data mining approaches that are published between 1992 and 2018. These criteria result in 126 papers (See Fig. 1a) in which 490 authors contributed to these papers. These papers studied 68 different datasets with almost more than 360 distinct features collected in 18 countries from Asia, Europe, and America (See Fig. 1b). In these papers, 140 different machine learning or data mining techniques were applied to diagnose CAD. We extracted the data from these articles. They are available at www.cadataset.com and within figshare51 at 10.6084/m9.figshare.8092514.

Fig. 1.

Fig. 1

Source of our dataset and its distribution worldwide. (a) The number of sources included in the database by year of publication. (b) The datasets’ distribution in different countries. It is clear that most datasets were collected in the USA. Then, India, China, Turkey, and Iran have more datasets.

These data were saved in tables shown in Fig. 2. This figure shows the tables, their fields, the primary keys, and corresponding relationships between tables. In these tables, we saved the research and their properties. We tried to design the dataset with minimum redundancy in the saved data using the methods proposed by Silberschatz et al.52. For more information about tables and their fields, please refer to Tables 115.

Fig. 2.

Fig. 2

Structure of the database (The relationships between tables). The key icons in the tables show the primary keys of those tables, and the key icons in the relationships between tables show the source tables of foreign keys in the tables.

This database has been presented in 3 different formats: comma-separated values (csv), Mysql script file (.sql), SqlServer script file (.sql). To use the csv file, the file should be opened in excel or imported to the database. To use Sqlserver files, an empty database in SQL server management studio named “cadataset” should be created and then run the script file (.sql file). To use Mysql files, an empty database in MySql named “cadataset” should be created and then run the script in the database. Users can download the data and SQL codes from our designed website at www.cadataset.com. The whole database underlying the CAD DATASET website was uploaded to figshare51.

Missing values were included as empty cells if they were not foreign keys. If they were, we defined a new record for them in the source table. For example, as datasetID in table Papers_ Datasets_Diseases_Methods is a foreign key, if the used dataset in a paper is not mentioned, we cannot let this field exist as a null field. Therefore, we defined a record in the table datasets as “not mentioned” and used its ID in the tables in which datasetID was a foreign key.

Technical Validation

The retrieved articles (254 articles in total) of our paper were reviewed by 7 authors. To collect the dataset, we used the following keywords in Google Scholar:

(LAD OR LCX OR RCA OR CAD OR “Coronary artery disease” OR “Atherosclerotic heart disease” OR “Atherosclerotic vascular disease” OR “Coronary heart disease”) AND (disease OR failure OR diagnosis OR prognosis OR treatment) AND (“machine learning” OR “data mining” OR “machine intelligent” OR classification OR clustering).

The validity of our research is investigated according to the two following factors: first, the relevance of an article to the topic must be confirmed with at least five out of seven authors of the current paper. Meanwhile, if the article was published in an unreliable journal or conference, it was rejected. As each particular data value in the database has primary resources, users can evaluate the validity of information in the database. Finally, 125 papers were selected as our main articles and saved in our database. Our extracted results are designed according to the aggregation of the results of 125 articles; therefore, a low probability of an error in one study can influence the overall extracted results of our research. Second, the extracted results were confirmed by the six outstanding cardiologists. They validated all of the final results achieved in this research. They investigated all the text of the research, extracted plots, and tables and figures precisely and confirmed the results according to their knowledge and experiences.

Usage Notes

Web application

To use this database, we designed a web application that allows the project administrator to add, remove or edit the records of tables. As shown in Fig. 3, the users can access the facilities according to their permissions. As shown in Fig. 4, the administrator can manipulate tables and reports. However, other users can only see the report results. The facilities prepared for the administrator and other users are shown in Figs 5 and 6, respectively. Moreover, a video clip was created to explain the usage of this tool. It can be viewed on http://cadataset.com/help.

Fig. 3.

Fig. 3

The front page of our web application in different modes. (a) The front page of our web application before login. There are four options. The first is a link to the home page. The second shows the list of reports that all users can see. The third is contact information, and finally, the fourth is used to log in/off to the system. (b) The front page after login; two more options appear. The first shows the list of tables, and the second shows the email address of the logged in user. (c) The login page. Currently, only the administrator can login to add, edit and remove data and reports. Other users do not need to login.

Fig. 4.

Fig. 4

List of tables and reports. (a) A screenshot from the list of tables. Guests cannot see the list of tables (b) A screenshot from the list of reports (c) A screenshot from the list of reports for the administrator. Please note there is another option in the list that the administrator can use to manage the reports.

Fig. 5.

Fig. 5

The facility prepared for an administrator to add, edit, and remove the data. The first option is used to add a new record to the table. The second option is used to determine the number of rows shown on a page. The third option can be used to export the table to the CSV file format. The fourth option can be used to edit or delete the specific record, and finally, the fifth option can be used to search the table.

Fig. 6.

Fig. 6

Reports. (a) Shows the output of a report as a table. (b) Shows output as a chart. The first and the second options in this form determine the horizontal and vertical axes, respectively. (c) In some reports, filters must be applied to data. For example, in this report, we need to specify the disease and country in which we are interested to see the most important features reported.

The database was last updated on 1/10/2018. Some records may change after this time, for example, the number of citations of an article, number of articles published by a publisher, and number of articles published by a researcher. For this purpose, this database is designed such that it can be updated easily. Once the database is updated, all the reports will be updated automatically. The database is also designed to allow checking the compatibility of new information such as the author’s name and journal name with previous information.

Author contributions

R.A., M.P. and M.R. performed the experiment. R.A., M.R., M.A., A.K., A.B., F.K., S.N., A.K., and N.S. analysed the data and wrote the manuscript. R.A., M.R., and M.P. revised the article.

Code Availability

The extracted data from the investigated articles were stored in an SQL database. See the database structure section for details of database tables. A web application was developed as an interface to interact with this dataset to extract the statistics of the saved data. The server-side application was developed using Microsoft SQL Server Express 2016 and ASP.NET MVC 5. The user interface uses the Bootstrap framework in addition to customized JavaScript libraries for plots, tables, and menus. SQL database and web application source code are available at www.cadataset.com and within figshare51. While the administrator of the website can modify previous records and add new data to the system, its reports are available to all users. It is possible for users to request new reports to be added. The reports can be exported to CSV format. In addition, the results can be plotted and summarized in a simple figure as well. Our website is compatible with all popular modern web browsers (tested on Mozilla Firefox ver. 63, Microsoft Internet Explorer ver. 11, and Google Chrome ver. 70).

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Alizadehsani R, et al. A data mining approach for diagnosis of coronary artery disease. Computer Methods and Programs in Biomedicine. 2013;111:52–61. doi: 10.1016/j.cmpb.2013.03.004. [DOI] [PubMed] [Google Scholar]
  • 2.Alizadehsani R, et al. Coronary artery disease detection using computational intelligence methods. Knowledge-Based Systems. 2016;109:187–197. doi: 10.1016/j.knosys.2016.07.004. [DOI] [Google Scholar]
  • 3.Alizadehsani R, et al. Non-invasive detection of coronary artery disease in high-risk patients based on the stenosis prediction of separate coronary arteries. Computer Methods and Programs in Biomedicine. 2018;162:119–127. doi: 10.1016/j.cmpb.2018.05.009. [DOI] [PubMed] [Google Scholar]
  • 4.Pławiak P. Novel methodology of cardiac health recognition based on ECG signals and evolutionary-neural system. Expert Systems with Applications. 2018;92:334–349. doi: 10.1016/j.eswa.2017.09.022. [DOI] [Google Scholar]
  • 5.Acharya UR, et al. Automated characterization of coronary artery disease, myocardial infarction, and congestive heart failure using contourlet and shearlet transforms of electrocardiogram signal. Knowledge-Based Systems. 2017;132:156–166. doi: 10.1016/j.knosys.2017.06.026. [DOI] [Google Scholar]
  • 6.Acharya UR, et al. Automated detection of coronary artery disease using different durations of ECG segments with convolutional neural network. Knowledge-Based Systems. 2017;132:62–71. doi: 10.1016/j.knosys.2017.06.003. [DOI] [Google Scholar]
  • 7.Stuckey TD, et al. Cardiac Phase Space Tomography: A novel method of assessing coronary artery disease utilizing machine learning. PLoS One. 2018;13:e0198603. doi: 10.1371/journal.pone.0198603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kampouraki A, Manis G, Nikou C. Heartbeat Time Series Classification With Support Vector Machines. IEEE Transactions on Information Technology in Biomedicine. 2009;13:512–518. doi: 10.1109/TITB.2008.2003323. [DOI] [PubMed] [Google Scholar]
  • 9.Green M, et al. Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. Artificial Intelligence in Medicine. 2006;38:305–318. doi: 10.1016/j.artmed.2006.07.006. [DOI] [PubMed] [Google Scholar]
  • 10.Lahsasna A, Ainon RN, Zainuddin R, Bulgiba A. Design of a Fuzzy-based Decision Support System for Coronary Heart Disease Diagnosis. Journal of Medical Systems. 2012;36:3293–3306. doi: 10.1007/s10916-012-9821-7. [DOI] [PubMed] [Google Scholar]
  • 11.Uğuz H. A Biomedical System Based on Artificial Neural Network and Principal Component Analysis for Diagnosis of the Heart Valve Diseases. Journal of Medical Systems. 2012;36:61–72. doi: 10.1007/s10916-010-9446-7. [DOI] [PubMed] [Google Scholar]
  • 12.Chuang C-L. Case-based reasoning support for liver disease diagnosis. Artificial Intelligence in Medicine. 2011;53:15–23. doi: 10.1016/j.artmed.2011.06.002. [DOI] [PubMed] [Google Scholar]
  • 13.Sartakhti JS, Zangooei MH, Mozafari K. Hepatitis disease diagnosis using a novel hybrid method based on support vector machine and simulated annealing (SVM-SA) Computer Methods and Programs in Biomedicine. 2012;108:570–579. doi: 10.1016/j.cmpb.2011.08.003. [DOI] [PubMed] [Google Scholar]
  • 14.Chen H-L, Liu D-Y, Yang B, Liu J, Wang G. A new hybrid method based on local fisher discriminant analysis and support vector machines for hepatitis disease diagnosis. Expert Systems with Applications. 2011;38:11796–11803. doi: 10.1016/j.eswa.2011.03.066. [DOI] [Google Scholar]
  • 15.Kaya Y, Uyar M. A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease. Applied Soft Computing. 2013;13:3429–3438. doi: 10.1016/j.asoc.2013.03.008. [DOI] [Google Scholar]
  • 16.Santhanam T, Padmavathi MS. Application of K-Means and Genetic Algorithms for Dimension Reduction by Integrating SVM for Diabetes Diagnosis. Procedia Computer Science. 2015;47:76–83. doi: 10.1016/j.procs.2015.03.185. [DOI] [Google Scholar]
  • 17.Kandhasamy JP, Balamurali S. Performance Analysis of Classifier Models to Predict Diabetes Mellitus. Procedia Computer Science. 2015;47:45–51. doi: 10.1016/j.procs.2015.03.182. [DOI] [Google Scholar]
  • 18.Furey TS, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16:906–914. doi: 10.1093/bioinformatics/16.10.906. [DOI] [PubMed] [Google Scholar]
  • 19.Polat K, Güneş S. Breast cancer diagnosis using least square support vector machine. Digital Signal Processing. 2007;17:694–701. doi: 10.1016/j.dsp.2006.10.008. [DOI] [Google Scholar]
  • 20.Wolberg WH, Street WN, Mangasarian OL. Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. Cancer Letters. 1994;77:163–171. doi: 10.1016/0304-3835(94)90099-X. [DOI] [PubMed] [Google Scholar]
  • 21.Cho, S.-B. & Won, H.-H. Machine learning in DNA microarray analysis for cancer classification. Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics19, 189–198 (2003).
  • 22.Wang Y, et al. Gene selection from microarray data for cancer classification—a machine learning approach. Computational Biology and Chemistry. 2005;29:37–46. doi: 10.1016/j.compbiolchem.2004.11.001. [DOI] [PubMed] [Google Scholar]
  • 23.Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal. 2015;13:8–17. doi: 10.1016/j.csbj.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cruz JA, Wishart DS. Applications of Machine Learning in Cancer Prediction and Prognosis. Cancer informatics. 2006;2:59–78. doi: 10.1177/117693510600200030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Abdar M, Zomorodi-Moghadam M, Das R, Ting IH. Performance analysis of classification algorithms on early detection of liver disease. Expert Systems with Applications. 2017;67:239–251. doi: 10.1016/j.eswa.2016.08.065. [DOI] [Google Scholar]
  • 26.Han, J., Pei, J. & Kamber, M. Data mining: concepts and techniques. (Elsevier, 2011).
  • 27.Matern WM, Bader JS, Karakousis PC. Genome analysis of Mycobacterium avium subspecies hominissuis strain 109. Scientific Data. 2018;5:180277. doi: 10.1038/sdata.2018.277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Santoro SW, Jakob S. Gene expression profiling of the olfactory tissues of sex-separated and sex-combined female and male mice. Scientific Data. 2018;5:180260. doi: 10.1038/sdata.2018.260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pereira IT, et al. Polysome profiling followed by RNA-seq of cardiac differentiation stages in hESCs. Scientific Data. 2018;5:180287. doi: 10.1038/sdata.2018.287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Fedorov A, et al. An annotated test-retest collection of prostate multiparametric MRI. Scientific Data. 2018;5:180281. doi: 10.1038/sdata.2018.281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gadkari M, et al. Transcript- and protein-level analyses of the response of human eosinophils to glucocorticoids. Scientific Data. 2018;5:180275. doi: 10.1038/sdata.2018.275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Marconi M, Sesma A, Rodríguez-Romero JL, González MLR, Wilkinson MD. Genome-wide polyadenylation site mapping datasets in the rice blast fungus Magnaporthe oryzae. Scientific Data. 2018;5:180271. doi: 10.1038/sdata.2018.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Grossberg AJ, et al. Author Correction: Imaging and clinical data archive for head and neck squamous cell carcinoma patients treated with radiotherapy. Scientific Data. 2018;5:1. doi: 10.1038/s41597-018-0002-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Caufield JH, et al. A reference set of curated biomedical data and metadata from clinical case reports. Scientific Data. 2018;5:180258. doi: 10.1038/sdata.2018.258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Du Z, et al. Combined RNA-seq and RAT-seq mapping of long noncoding RNAs in pluripotent reprogramming. Scientific Data. 2018;5:180255. doi: 10.1038/sdata.2018.255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Barupal DK, et al. Generation and quality control of lipidomics data for the alzheimer’s disease neuroimaging initiative cohort. Scientific Data. 2018;5:180263. doi: 10.1038/sdata.2018.263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data. 2018;5:180251. doi: 10.1038/sdata.2018.251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Phua YL, Clugston A, Chen KH, Kostka D, Ho J. Small non-coding RNA expression in mouse nephrogenic mesenchymal progenitors. Scientific Data. 2018;5:180218. doi: 10.1038/sdata.2018.218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Salomon MP, et al. Brain metastasis DNA methylomes, a novel resource for the identification of biological and clinical features. Scientific Data. 2018;5:180245. doi: 10.1038/sdata.2018.245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Jones L, et al. EEG, behavioural and physiological recordings following a painful procedure in human neonates. Scientific Data. 2018;5:180248. doi: 10.1038/sdata.2018.248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE transactions on evolutionary computation. 1997;1:67–82. doi: 10.1109/4235.585893. [DOI] [Google Scholar]
  • 42.Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review. 2006;26:159–190. doi: 10.1007/s10462-007-9052-3. [DOI] [Google Scholar]
  • 43.Chen L-D, Sakaguchi T, Frolick MN. Data Mining Methods, Applications, and Tools. Information Systems Management. 2000;17:65–70. doi: 10.1201/1078/43190.17.1.20000101/31216.9. [DOI] [Google Scholar]
  • 44.Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of Clinical Epidemiology. 1996;49:1225–1231. doi: 10.1016/S0895-4356(96)00002-9. [DOI] [PubMed] [Google Scholar]
  • 45.Blumenthal D, Tavenner M. The “Meaningful Use” Regulation for Electronic Health Records. New England Journal of Medicine. 2010;363:501–504. doi: 10.1056/NEJMp1006114. [DOI] [PubMed] [Google Scholar]
  • 46.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
  • 47.Ranganath R, Perotte A, Elhadad N, Blei D. Deep Survival Analysis. Proceedings of the 1st Machine Learning for Healthcare Conference. 2016;56:101–114. [Google Scholar]
  • 48.Hagar Y, et al. Survival analysis with electronic health record data: Experiments with chronic kidney disease. Statistical Analysis and Data Mining. 2014;7:385–403. doi: 10.1002/sam.11236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Perotte A, Ranganath R, Hirsch JS, Blei D, Elhadad N. Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis. Journal of the American Medical Informatics Association. 2015;22:872–880. doi: 10.1093/jamia/ocv024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–118. doi: 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Alizadehsani R. 2019. Cadataset Dataset. figshare. [DOI]
  • 52.Silberschatz, A., Korth, H. F. & Sudarshan, S. Database system concepts, 3rd Edition. M Graw-Hill. 4, 7–27 (1997).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Alizadehsani R. 2019. Cadataset Dataset. figshare. [DOI]

Data Availability Statement

The extracted data from the investigated articles were stored in an SQL database. See the database structure section for details of database tables. A web application was developed as an interface to interact with this dataset to extract the statistics of the saved data. The server-side application was developed using Microsoft SQL Server Express 2016 and ASP.NET MVC 5. The user interface uses the Bootstrap framework in addition to customized JavaScript libraries for plots, tables, and menus. SQL database and web application source code are available at www.cadataset.com and within figshare51. While the administrator of the website can modify previous records and add new data to the system, its reports are available to all users. It is possible for users to request new reports to be added. The reports can be exported to CSV format. In addition, the results can be plotted and summarized in a simple figure as well. Our website is compatible with all popular modern web browsers (tested on Mozilla Firefox ver. 63, Microsoft Internet Explorer ver. 11, and Google Chrome ver. 70).


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES