Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 13.
Published in final edited form as: Surgery. 2018 Jul 27;164(4):640–642. doi: 10.1016/j.surg.2018.06.022

18-CSA-50 EDITED BY DR SARR Big Data: More Than Big Datasets

Adrienne N Cobb 1,2,#, Andrew J Benjamin 3,#, Erich S Huang 4,5,6, Paul C Kuo 7
PMCID: PMC12345320  NIHMSID: NIHMS1501917  PMID: 30061040

Abstract

“Big Data” has become a popularized term over the past decade and is often used to refer to datasets that are too large and/or complex to be analyzed by traditional means. While it has been utilized for some time in business and engineering, the concept of Big Data is relatively new to medicine. Although he reception from the medical community has been mixed, the widespread utilization of electronic health records (EHR) in the United States, creation of large clinical datasets/national registries which capture information on numerous vectors affecting health care delivery and patient outcomes, and sequencing of the human genome are all opportunities to leverage this big data. This review was inspired by a lively panel discussion on big data that took place at the 75th Central Surgical Association Annual Meeting. The authors aim to describe big data, the methodologies used to analyze it, and its practical clinical application.

Keywords: big data, machine learning, predictive modeling, health care, surgical analytics

What is big data?

“Big data” has become a popularized term over the past decade and is often used to refer to datasets that are too large and/or complex to be analyzed by traditional means. Often, it has been taught that big data is any dataset which cannot be readily stored or analyzed using spreadsheet programs such as Excel. Opinions about the usefulness and promise of big data vary widely. Some believe it is the future by which novel insights can be made about rare disease processes, while others believe it simply adds additional noise. Regardless of whether someone feels big data is the future or a nuisance, it is here to stay. The widespread utilization of electronic health records (EHR) in the United States, creation of large clinical datasets/national registries which capture information on numerous vectors affecting health care delivery and patient outcomes, and sequencing of the human genome are all opportunities to leverage this big data to our advantage and that of our patients. In fact, we now live in an era in which data repositories are being generated at an ever increasing pace. It has been estimated that more data has been created in the past two years than in the entire history of the human race.

Although collecting such large amounts of data can at times sacrifice data quality, comprehensive data collection often has a potential benefit of mitigating bias when compared to using high quality samples of data (fewer human assumptions in the algorithm also decreases bias). The sheer amount of data requires analytic techniques that can handle not only the volume of information but also the potential interactions between them that would amplify bias when using traditional statistical methods.

Because a vast amounts of data are now collected on patients and their surgical outcomes, techniques such as regression and multivariate analysis often fall short in leveraging the advantages that such large datasets potentially offer. Statistical methods were developed in the context of detecting and summarizing relationships from small data sets that are purposefully constructed and structured. The methods are theory-driven, inductive in nature, and have a confirmatory approach. Alternatively, newer data-science methodologies are utilized to discover new patterns and new knowledge in data sets that are realistic, opportunistic, and often times messy. The methods are data-driven, deductive in nature, and have an exploratory approach. When statistical approaches are used inappropriately with “big data”, it may lead to finding a “signal” in any large enough data set, even if it is just noise. Additionally, machine learning algorithms are often better able to make more accurate predictions when used with large datasets.1

One of the more promising data-science tools available to researchers to make accurate predictions from data is machine learning. Machine learning is a subfield of artificial intelligence focused on constructing algorithms that can learn from and make predictions on data. Although the term “machine learning” was first coined in 1959 by Arthur Samuel, it was not until recently that advances in computing power and accessibility have allowed for widespread utilization of machine learning algorithms as “big datasets” have become more readily available. Machine learning is often thought of as an algorithm that learns to perform a task or make a decision automatically from data as opposed to being explicitly programmed. In reality, however machine learning and statistics exist along a continuum from fully human-guided dta anlysis to fully machine-guided data analysis.2 As fewer human assumptions are placed into an algorithm, an algorithm moves higher up the spectrum of machine learning.

Supervised vs. Unsupervised Learning

Machine learning algorithms can “learn” in two fundamentally different ways-supervised and unsupervised. Supervised machine learning algorithms are trained using examples of a known output or target. The goal is to create a model that is capable of predicting the desired target from a novel data set. Supervised machine learning is often done in the context of classification or regression. Example algorithms include logistic regression, support vector machines, artificial neural networks, or random forests. The goal is to create a model that will take input data and produce correct output data (which is determined from the training data). Alternatively, unsupervised machine learning is used with unlabeled data and is used to find naturally occurring patterns or groupings within the data. Interpreting the results of unsupervised machine learning algorithms is inherently more difficulty, and often the utility of findings is determined by performance in subsequent supervised learning tasks.3

Another major advantage of machine learning algorithms is the ability of the models to “evolve” over time. As the model is used, it produces feedback data, which in combination with collection of new data, allow the model to continue refining itself. As long as a sufficient stream of data is available, the predictive capacity of the model will continue to improve and can even adapt to changes in the underlying phenomenon being measured.

Machine learning also has fundamentally changed the types of raw data which can be analyzed. For example, consider a medical image such as a computed tomography(CT) examination. Previously, we might use data points such as the interpretation of the scan by a radiologist or the size of a lesion as data points; however, with advances in computational power, algorithms such as convolutional neural networks can analyze an image on a pixel by pixel basis. These pixels are analyzed and can allow the algorithm to do things, uch as identify lung nodules or predict the presence or development of Alzheimer’s disease.4 Given that IBM researchers estimate that medical images now account for 90% of all imaging data, the promise of machine learning to analyze the raw data contained within medical images will surely lead to promising advances in the future.

Examples of Machine Learning

Due to access to such large data sources and advances in computing power over the last decade, advanced machine learning algorithms have become more practical and useful astools for analysis and prediction. This advance key, because traditional statistical analyses are often overwhelmed by not only the sheer volume of data but because they are not able to deal with non-linear data. We will discuss three commonly used algorithms in machine learning to deal with big data: support vector machines, random forest (RF)models, and computational neural networks.

Support vector machines (SVMs) are a supervised learning method that can be used for both classification and regression. Existing data train the algorithm to then classify new or test data. SVMs perform classification through the development of a multi-dimensional hyperplane that partitions variables into groups. Both linear and non-linear data can be used to train the algorithm. There are four main tuning parameters in SVM. The first, “kernel”, defines whether we want a line of linear separation as opposed to a circular line, depending on the amount of transformation needed. The “regularization parameter” tells the SVM optimization how much you want to avoid misclassifying each training sample. The “gamma parameter” determines how far the influence of a single training example reaches, with low being ‘close’ and high values being ‘far’. Last but most important is “margin” whichdescribes how far each respective class is from the line of separation.5 The goal of SVMs is to create a maximum-margin hyperplane that lies in a transformed input space and splits the example classes, while maximizing the distance to the nearest cleanly split examples.6 SVMs are useful in real-life, practical classification problems, such as text categorization and facial recognition..7 both of which have potential indications in health care. With the advent of theEHR, there is an abundance of unstructured data in the form of progress notes, discharge summaries, and other written communications that could be potentially useful in improving health care quality. SVMs have tremendous potential to help people better organize electronic resources. The same algorithms utilized for face recognition can be applied to evaluating imaging modalities such as MRI.7

A decision tree is a model that splits data variables at discrete cut-points which are then often shown graphically as “branches of a tree”. Traditional decisions trees often have sub-par predictive ability and are prone to overfitting; however, there are modified decision tree models, such as random forest (RF) models, which provide significantly improved predictive accuracy. RF models are a bagged tree model, where multiple trees are combined together making the final model a collection of many trees. In addition, only random samples of predictor variables are considered at each split of the tree. These features allow RF models to automatically investigate interactions and non-linear effects of predictors. This approach is in stark comparison to traditional models such as logistic regression where such effects must be prespecified. One of the most common criticisms of machine learning is that the algorithms are “black boxes”, which often leads to suspicion in the field of medicine. An advantage of RFmodels, however, is their ability to determine feature importance as well as their easy to visualize outputs with discrete branch points and cutoffs for several variables. Additionally, when several models are tested, machine learning algorithms often outperform traditional methods such as logistic regression, with better a C-statistic and clear variable importance variables (ELSEVEIER HAVE T HE AUTHORS REWORD THE END OF THIS SENTECE IT IS AWKWARD).

Convolutional neural networks (CNNs) are a deep learning algorithm used most commonly with image data. A CNN consists of a series of “nodes” inspired by the structure of the human visual cortex. In general, an advantage of CNNs is that they require minimal preprocessing, imparting an independence from human input which can be a substantial advantage. An interesting medical application of this algorithm is the management of pulmonary nodules in screening for lung cancer. Ciompi et al used CNNs to better manage the large amounts of CT data that are now being produced with the advent of screening for lung cancer in heavy smokers. Using multi-scale, multi-dimensional, convolutional neural networks, they were able process raw CT data without any additional information, such as nodule size or segmentation. The train data then learned a 3D representation by analyzing an arbitrary number of 2D views of a given nodule. They went on to show that the use of CNNs achieved performance at classifying the type of nodule that surpassed classic nodels of machine learning and was within the inter-observer variability among four, experienced human observers. Because the algorithm automatically classifies all of the nodules by type that are relevant for continued workup, this approach has the potential to increase the efficiency of diagnosis and treatment of pulmonary nodules.8 Classification of lung nodules is only one of many promising uses of CNNs in clinical medicine. For example, models have been developed that diagnose breast cancer metastases from digital pathology slides9, diagnose diabetic retinopathy10, and identify malignant skin lesions11. As data continue to be collected, the capability of these algorithms will continue to evolve, further increasing their accuracy and usefulness to physicians.

What is next?

We have discussed the meaning of big data and some of its applications in health care, but where do we go from here? What will contribute to advances in the use of machine learning? Dr. Erich S. Huang MD, PhD of Duke University‟s Forge center for health data science, discussed with us some of the potential for learning health in an era of data science. He describes the DREAM Breast Cancer Prognostic challenge by Sage Bionetworks in which the BCC provided a community of data analysts with “a common platform for data access and blinded evaluation of model accuracy in predicting breast cancer survival on the basis of data from gene expression, copy number, and clinical covariates”.12 Becausemolecular biomarkers have shown promise for clinical decision-making in breast cancer, these molecular markers can be utilized to distinguish biologically relevant groupings beyond clinical measures and have the potential to inform treatment strategies. The goal of the challenge was to see if any of the data analyst teams could produce a model to predict overall survival that outperformed the current models for breast cancer prognosis using predefined performance criteria, real-time feedback, transparent sharing of source code, and a blinded final validation set. The authors state that this stuy was not designed initially for direct clinical deployment of a suite of complex biomarkers, but rather to lay the groundwork for future challenges designed to tackle clinically actionable questions. Each team was given a training set on which they built their models, which were subsequently tested and validated on a separate, hold-out dataset. This process was completed for several rounds as over 1400 models were produced by the community of data analysts. They found that the best performing model significantly outperformed available best-in-class methodologies. Even still, the improvement of the best-performing model (CI 0.76) was moderate with respect to the score achieved by aggregating standard clinical information (CI 0.72). This finding demonstrates that the models themselves are not enough. Advances in health care will come in data processingwhich will allow the machine learning algorithms to make better predictions. Dr. Huang pointed out that the most difficult portion of learning health data science is not the execution of the models, but the feature of engineering, “data munging”, and preparation of the data prior to modeling. Additionally, the strongest determinant of model performance was not just the model itself, but the size of the data being utilized. There are several levels at which data can be analyzed, from single hospital to national data or a single exome to the entire genome. The tools are there, but we must learn how to conduct this research properly to maximize its benefits in a clinical setting.

Big data and machine learning promise to fundamentally change how medicine is practiced. Machine learning algorithms utilizing big data have already proven to be highly effective clinical tools when implemented correctly, but widespread implementation of machine learning algorithms will require that physicians understand the key differentiating aspects between conventional approaches using “small data” and the different approaches needed to use “big data”. As physicians become more comfortable with big data approaches, their willingness and desire to collect and process the large amounts of unbiased data will lead necessary to advances that improve patient outcomes, streamline physician workflow, and uncover novel associations that may go unnoticed with smaller, more biased datasets.

Acknowledgements

We thank the Central Surgical Society for the opportunity to present on this novel topic and supporting innovation in the surgical community.

Funding: This work was not funded.

Footnotes

Summary Sentences

This report provides an introduction to big data in health care and its potential applications. The importance of this report is that the use of machine learning provides a means to leverage big data into potential improvements in outcome for patients.

Disclosures

Erich S. Huang has the following disclosures:

1. Founder, kēlaHealth (Startup)

2. Founder, Stratus Medicine (Startup)

3. Founder, MedBlue Data (Startup)

The remaining authors have no disclosures.

Meeting Information

The information contained in this review was presented as a panel discussion during the Central Surgical Association Annual Meeting held on March 15–17, 2018 in Columbus, Ohio.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Churpek Matthew M., Yuen Trevor C., Winslow Christopher, Meltzer David O., Kattan Michael W., and Edelson Dana P.. 2016. “Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards.” Critical Care Medicine 44 (2): 368–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Beam AL, Kohane IS. Big data and machine learning in health care. Jama. 2018. Apr 3; 319(13): 1317–8. [DOI] [PubMed] [Google Scholar]
  • 3.Deo RC. Machine learning in medicine. Circulation. 2015. Nov 17; 132(20): 1920–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Morra Jonathan H., Tu Zhuowen, Apostolova Liana G., Green Amity E., Toga Arthur W., and Thompson Paul M.. 2010. “Comparison of AdaBoost and Support Vector Machines for Detecting Alzheimer’s Disease through Automated Hippocampal Segmentation.” IEEE Transactions on Medical Imaging 29 (1). ieeexplore.ieee.org: 30–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Patel Savan. Machine Learning 101. Support Vector Machine. https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theorv-f0812effc72 Accessed: May 29, 2018.
  • 6.Shmilovici A Support vector machines. InData mining and knowledge discovery handbook 2009. (pp. 231–247). Springer, Boston, MA. [Google Scholar]
  • 7.Hearst Marti A., Dumais Susan T., Osuna Edgar, Platt John, and Scholkopf Bernhard. “Support vector machines.” IEEE Intelligent Systems and their applications 13, no. 4 (1998): 18–28. [Google Scholar]
  • 8.Ciompi F, Chung K, Van Riel SJ, Setio AA, Gerke PK, Jacobs C, Scholten ET, Schaefer-Prokop C, Wille MM, Marchianó A, Pastorino U. Towards automatic pulmonary nodule management in lung cancer screening with deep learning. Scientific reports. 2017. Apr 19;7:46479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu Yun, Gadepalli Krishna, Norouzi Mohammad, Dahl George E., Kohlberger Timo, Boyko Aleksey, Venugopalan Subhashini, et al. 2017. “Detecting Cancer Metastases on Gigapixel Pathology Images.” arXiv[cs.CV]. arXiv. http://arxiv.org/abs/1703.02442.] [Google Scholar]
  • 10.Gulshan Varun, Peng Lily, Coram Marc, Stumpe Martin C., Wu Derek, Narayanaswamy Arunachalam, Venugopalan Subhashini, et al. 2016. “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.” JAMA: The Journal of the American Medical Association 316 (22): 2402–10] [DOI] [PubMed] [Google Scholar]
  • 11.Esteva Andre, Kuprel Brett, Novoa Roberto A., Ko Justin, Swetter Susan M., Blau Helen M., and Thrun Sebastian. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639): 115–18]. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, Sauerwine B, Kellen MR, Mangravite LM, Furia MD, Vollan HK. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Science translational medicine. 2013. Apr 17;5(181):181 re1-. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES