A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis

Jaideep Vaidya; Basit Shafiq; Muazzam Asani; Nabil Adam; Xiaoqian Jiang; Lucila Ohno-Machado

. 2018 Apr 16;2017:1695–1704.

A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis

Jaideep Vaidya ¹, Basit Shafiq ², Muazzam Asani ², Nabil Adam ¹, Xiaoqian Jiang ³, Lucila Ohno-Machado ³

PMCID: PMC5977652 PMID: 29854240

Abstract

Big data coupled with precision medicine has the potential to significantly improve our understanding and treatment of complex disorders, such as cancer, diabetes, depression, etc. However, the essential problem is that data are stuck in silos, and it is difficult to precisely identify which data would be relevant and useful for any particular type of analysis. While the process to acquire and access biomedical data requires significant effort, in many cases the data may not provide much insight to the problem at hand. Therefore, there is a need to be able to measure the utility/relevance of additional datasets for a particular biomedical research task without direct access to the data. Towards this, in this paper, we develop a privacy-preserving approach to create synthetic data that can provide a firstorder approximation of utility. We evaluate the proposed approach with several biomedical datasets in the context of regression and classification tasks and discuss how it can be incorporated into existing data management systems such as REDCap.

Introduction

Today, we have the unprecedented opportunity to gather genomic, transcriptomic, clinical, behavioral, and social data, in ways relevant to health. Analysis of this data can enable new pathways to discovery and can improve the understanding, prevention and treatment of complex disorders such as cancer, diabetes, depression, etc., which are significantly on the rise. Indeed, the Precision Medicine Initiative crucially depends on data-driven science and research.

However, unfettered use of data poses significant concerns regarding patient privacy and the abuse of access to sensitive data. Indeed, the digitization of health information, without appropriate controls, magnifies the risk to privacy, due to the ease of retrieval, analysis, and linkage. Privacy and confidentiality are critical to healthcare. Improving privacy protection encourages people and organizations to share data and realize hidden insights. For example, orphan diseases can be treated more effectively when more observations from different regions in the world are shared and aggregated. Similarly, personalized medicine can be targeted to individuals more accurately if more patients similar to the person of interest are observed and analyzed.

Preserving privacy is a non-trivial task because any protection scheme essentially involves a tradeoff with data utility. Simple strategies may lead to private but uninformative data disclosure, and vice versa. This is especially challenging if the data required for a study resides at several different institutions. Note that in the early stages of any study, the researcher needs to explore the data to understand its utility for the study purpose. Often, the researcher needs to iteratively access the data across the different sources. Even though data exploration may not require fine grained access to all of the data, researchers have to spend an inordinate amount of time trying to get access to relevant data. In many cases, they may not even know whether the data would actually help their particular analysis. For example, consider a researcher at Rutgers Biomedical Health Sciences (RBHS), Newark, carrying out a survival analysis of pancreatic cancer patients using the clinical trials data collected in the local REDCap instance. Since the Newark population is overwhelmingly African-American, who typically have higher incidence rates and suffer from more aggressive forms of cancer, the researcher would like to carry out a comparative analysis of survival after adjuvant radiotherapy and/or chemotherapy with respect to the general population. To do this, the research requires access to linked cancer registry data from the Surveillance, Epidemiology, and End Results (SEER) Program¹ to get information about the cancer site, stage, grade, etc., as well as access to the corresponding Medicare/Medicaid data from the Centers for Medicare & Medicaid Services (CMS) to get information of the specific therapy. If the researcher would like to study the health costs associated with patients having other comorbidities such as diabetes and contrast them with patients who do not, the researcher would also need access to data from the Healthcare Cost and Utilization Project (HCUP) data provided by AHRQ², and/or to clinical data such as DMITRI1 from the National Center for integrating Data for analysis, Anonymization, and Sharing (iDASH)³. The researcher would like to confirm which of these data sources would be relevant to his/her study, and find out which specific data files from these sources should be retrieved. Note that, at this stage, it is sufficient for the researcher to get this confirmation from an approximate analysis without extensive data integration for each different source, and to then retrieve only the relevant data.

Our goal is to precisely enable such exploratory analysis by providing an assessment of the data with respect to utility while preserving its privacy. Towards this, we develop a technique to generate sample datasets that preserve the structure and semantics of the original data, but not exact values, thus preserving its privacy. We evaluate the proposed approach with several different biomedical datasets in the context of classification and regression and demonstrate its effectiveness. While our approach for generating synthetic data does not directly provide a measure quantifying the utility of the dataset for a specific type of analysis (e.g., regression), our evaluation does show that the area under the curve (AUC) / root mean squared error (RMSE) of models built from synthetic data are comparable to that of models built from real data. Thus, a researcher can get a first-order approximation of the overall usefulness of the real data. For example, suppose that a researcher is interested in exploring the relationship between Vitamin D levels and Cancer diagnosis, and is interested in looking at three different datasets which have been collected at institutions in different geographic regions in the country. Using the corresponding synthetic data, the researcher might be able to get an estimate of where such a relationship might exist, or whether the strength of this relationship is more significant in a particular region of the country, or whether the combined data shows the same relationship or not.

An overview of random decision trees

The proposed data generation approach is based on Random Decision Trees (RDT), developed by Fan et al.⁴ The Random Decision Trees algorithm builds multiple (or m) iso-depth (i.e., equal depth) random decision trees. As opposed to typical decision tree learning, the trees built in RDTs are completely random in structure - i.e., the structure of a random tree is constructed completely independent of the training data. However, the statistics recorded for each node are computed based on the training data. Thus, the RDT training phase consists of building the trees (BuildTreeStructure) and populating the nodes with training instance data (UpdateStatistics). It is assumed that the number of attributes is known based on the training dataset. The depth of each tree is decided based on a heuristic - Fan et al.⁴ show that when the depth of the tree is equal to half of the total number of features present in the data, the most diversity is achieved, preserving the advantage of random modeling.

The process for generating a tree is as follows. First, we start with a list of features (attributes) from the dataset. We then generate a tree by randomly choosing one of the features without using any training data. The tree stops growing once the height limit is reached. Then, use the training data to update the statistics of each node. Note that only the leaf nodes need to record the number of examples of different classes that are classified through the nodes in the tree. The training data are scanned exactly once to update the statistics in multiple random trees. Based on the statistical analysis as well as conducted experiments, from 10 to at most 30 trees are sufficient for most applications^4–6. When used for classification, the new instance is evaluated with respect to all of the random trees constructed and the average class probability distribution is output. In our case, we do not need to worry about the classification phase, and after constructing the RDT model, we can use them for synthetic data generation as discussed in the following section.

An important question is why choose RDTs as a way to encode the underlying training data. There are several reasons for this. First, an important property of RDT is that the same code can be used for multiple data mining tasks: classification, regression, ranking and multiple classification^4–6. As shown previously, the random decision tree is an efficient implementation of Bayes Optimal Classifier (BOC)⁴, effective non-parametric density estimation⁶, and can be explained via high order statistics such as moments⁷.

While the use of Random Decision Trees may seem counterintuitive, there are any many benefits in terms of performance and accuracy, that are gained by using this method versus traditional algorithms. The use of the multiple random decision trees in various learning tasks offers many benefits over other traditional classification/tree building techniques, because its structure and progression lends itself to modification for distributed/parallel tasks⁸. At the same time, RDT outperforms other models in terms of computational speed, due to the inherent properties of random partitioning used in tree construction. Indeed, one of the key advantages to RDT is the efficiency gained by the way the model is trained, as well as its minimal memory requirement⁹. Finally, RDTs are also extremely well suited from the privacy perspective since:

Randomness in structure rather than simple perturbation of input/output is more effective — perturbing the input or output from a database to achieve privacy works, but the utility of the information garnered from data mining can be diminished if the perturbations are not carefully controlled, or conversely, information can be leaked if the information is not perturbed enough. Instead, we can exploit the design properties of RDT to generate trees that are random in structure, providing us with a similar end effect as perturbation without the associated pitfalls. A random structure provides security against leveraging a priori information to discover the entire classification model or instances.
Purely cryptographic approaches are often too slow to be practical and can become computationally expensive as the size of the dataset increases and intercommunications between different parties increase. RDT provides a convenient escape from this paradigm thanks to its structural properties, and furthermore can be computed in a distributed fashion thus preserving privacy even if the data are distributed.
An additional security advantage of RDTs are that they are amenable to meeting the privacy requirements of the differential privacy model¹⁰. As shown by Jagannathan et. al¹¹, the node statistics can be viewed as queries over the training data. Therefore, standard techniques can be used to return differentially private results, without significant loss of accuracy.

Proposed Approach

The proposed approach follows a two-step process for generation of synthetic data as shown in Figure 1. In the first step, we generate multiple parameterized RDTs using the original dataset. In the second step, we perform random walk over the RDTs to recreate instances.

The procedure for generation of RDTs is given in steps 1-13 of Algorithm 1. The input to Algorithm 1 is the original dataset, D and number of RDTs to be generated. The dataset D may include both nominal and numeric attributes. For discretization of numeric attributes, the algorithm takes two discretization parameters as input: i) f is a function that gives the number of splits for each numeric attribute — for example, the numeric attribute Age with range [0,100] can be split into four discrete values by setting f (Age) = 3; the function g gives the discretization split points for each numeric attribute - for example, g(Age)=(30,50,70) discretizes the Age into 4 non-uniform age ranges [0, 30], [31, 50], [51, 70], [71,100].

After discretization, we build the structure of the required k RDTs by following the standard BuildTreeStructure procedure of RDTs. Next, we use the given dataset D to compute the conditional probability of visiting any node in the RDT. Note that this is the conditional probability of visiting a node given that we are currently visiting that node's parent, and is given by:

p r (n) = \frac{# instances in D reaching node n}{# instances in D reaching node p a r e n t (n)}

For the root of the tree, the probability is 1, since we start from the root. We illustrate the process of creating the RDT through the following example. Consider, as mentioned before, a cancer study. The data set records the cancer site, stage, grade, and other features along with the survival months. Table 1 shows a small dataset sample with 5 instances in D reaching node n instances in D reaching node parent (n) instances. Figure 2 shows one possible RDT that can be learned from the data in Table 1. Since this tree serves as a partial summarization of the data, multiple such RDTs are created. In this RDT, as mentioned before, the conditional probability of reaching a node is computed as stated in Equation 1. Thus, for example, the probability of reaching the node Diabetic? is 2/5 since only 2 out of the 5 instances have stage 1 or 2. Similarly, the probability of reaching the Cancer Site node in the left subtree is 1/2 since only 1 of the 2 instances with Stage 1 or 2 cancer is diabetic.

Table 1:

Example of Cancer Dataset

	Cancer Site	Stage	Grade	Age	Diabetic	Total Charges	Survival in Months
1	Colon	1	2	62	Yes	$70,000	90
2	Pancreas	3	3	46	Yes	$45,000	5
3	Colon	4	3	53	No	$47,000	39
4	Pancreas	2	3	47	No	$52,000	9
5	Colon	3	3	72	No	$58,000	46

Open in a new tab

Algorithm 1 Synthetic data generation using RDTs

Require: Original Dataset D

Require: Number of random decision trees to be generated, k

Require: Degree of discretization for each numeric attribute, f : A → ℤ⁺

Require: Discretization split points for a numeric attribute, g : A → ℝ^f(A)

Require: Number of synthetic instances to be generated, ns

Ensure: Synthetic Dataset D' such that |D'| = ns, and schema(D') = schema(D)

1 D'→ Φ

for each numeric attribute A ∈ D do

splits → g(A)

for all instances i ∈ D do

Discretize i_A using splits

end for

end for

Generate k random decision trees RDT₁;… RDT_k

for each tree RDT₁ do

for each node n ∈ RDT₁ do

pr(n)→ (# instances in D reaching node n) / (# instances in D reaching the parent node of n) {Record the conditional probability of reaching that node}

end for

end for

for i = 1 … ns do

inst → Φ

while inst is not completely generated do

Randomly choose a random decision tree RDT_j

{Do a random walk over RDT_j to generate the instance values}

currnode → root(RDT_j)

while currnode is not a leaf node in RDT_j do

Choose a child node c to random walk to using the node probabilities computed above

if inst does not have a value for current node attribute then

Generate the attribute value for inst based on the child node c that is chosen

end if

currnode → c

end while

end while

end for

Synthetic data is generated instance by instance by performing random walk over RDTs. For each instance, we randomly choose an RDT and traverse through it using the node probabilities. The attribute values of the instance are assigned based on the path taken during the random walk. After reaching the leaf node, the next RDT is randomly chosen for assignment of remaining attribute values. We continue this process until all the attribute values of the instance are assigned or all the k trees have been traversed. Traversal of multiple RDTs for any instance may result in different branches for the same attribute. We keep the attribute value that was assigned first and ignore subsequent values. In case, an instance has missing attribute values after traversal of all k RDTs, we discard all the assigned values for that instance and restart the random walk for that instance. However, this happens very rarely in practice. The specific procedure for creating synthetic data is listed in steps 14-28 of Algorithm 1.

As depicted in Figure 1, the proposed synthetic data generation approach can generate different data views corresponding to the privacy levels of different classes of users. For example, physicians may require accessing data at the finest granularity without any suppression. Therefore, the data view generated for physicians is generated without any generalization of nominal attribute values and splitting of numeric attributes at the finest granularity level. However, for student researchers access may need to be provided at a courser granularity. Therefore, the data view generated for student researchers is generated with generalization/splitting of attributes at coarser granularity level.

Integration with REDCap

We are currently working on integration of the proposed synthetic data generation approach into a data collection and sharing infrastructure REDCap¹², which is widely used in the medical in the medical community for building and managing online surveys and databases. The overall architecture of the extended REDCap system for privacy-preserving analysis and sharing of data is depicted in Figure 3. The system will enable different types of users to access external data from different sources through the REDCap application. The access to data will be provided based on user authorizations defined in the access control policy of the data owner. The system allows data owners to specify authorizations for different classes of users in their access control policies. Accordingly, different views of synthetic data are generated as shown in Figure 3. The system also includes a distributed data access module that enables retrieval and integration of data from multiple sources in response to a user query or exploratory data analysis task. For exploratory data analysis, the system allows a user to integrate data from multiple data sources and assess its utility.

Figure 3: — System architecture for the RDT integration with REDCap

Note that within our architecture, the data owner is still responsible for generating the synthetic dataset. This is not different from the current practice in multi-level secure databases where different views of the data are generated for different users. The idea is that the data owner creates multiple versions of the synthetic data based on his/her access control policy and the privacy requirements of the data to be shared. The synthetic data can be generated offline and the appropriate version be made available to a researcher based on the authorization of the researcher requesting access. This access will be mediated by the data access module. Furthermore, the synthetic data looks just like the real data (i.e, the records have precisely the same structure). Therefore, it can be processed the same way as the real data, and the researcher can run any standard query in REDCap pertaining to that dataset, though the exact answer to the query may vary from the answer computed over the original data.

Since the synthetic data is only an approximation of the real data, the models built from the synthetic data and the real data are likely to be different to some extent. It would be useful for a researcher to know the degree of discrepancy in results based on use of either real or synthetic data. This can be easily incorporated into the REDCap architecture, wherein the utility assessment module can measure the difference between the results of the posed query on both the real and synthetic data and provide a measure of the difference.

Experimental Evaluation

We now evaluate the efficacy of the RDT framework in creating synthetic data that can give an approximation of the utility of the original data. For this, we use three different real datasets obtained from the UCI Machine Learning Repository: 1) the Breast Cancer Wisconsin (Original) Data Set¹³ 2) the Parkinsons Telemonitoring Data Set¹⁴ 3) the Diabetes 130-US hospitals for years 1999-2008 Data Set¹⁵. We processed the data to remove the instances with missing values. We also removed attributes where the overwhelming majority of instances have the same single value. We also discretized all of the numeric attributes into 10 discrete ranges using equal width discretization. Table 2 gives the characteristics of each individual dataset after preprocessing. The Breast Cancer dataset and the Diabetes dataset were used for classification (since they have a single categorical variable to be predicted). On the other hand, the Parkinsons Telemonitoring dataset had two independent numeric response variables. Therefore, linear regression was independently performed with both response variables. Note that the original RDT paper recommends that the depth of each RDT be set to n/2 where n is the number of attributes, and that at least 10 RDTs be generated. Following this, for the classification tasks, we created 20 RDTs, where as for the regression task, 15 RDTs were generated. However, unlike the datasets considered in the original RDT paper, the biomedical datasets considered here have several nominal attributes with many values. For example, each of the three diagnosis attributes in the Diabetes dataset has over 700 distinct values, which leads to a significant fan-out. Note that the depth and the fan-out of the tree together determine the size of the tree and thus the memory requirements of the RDT. For Diabetes, since the fan-out of the three diagnosis attributes is extremely high, this tremendously increases the memory requirements, and therefore, we had to limit the depth of the RDT. As such, based on the characteristics of the different datasets, different depths were chosen as noted in Table 2.

Table 2:

Data Set Characteristics

Name	# instances	# Attributes	RDT Depth	# Response Variables	Task
Breast Cancer	683	10	5 [10=2]	1	Classificatio
Parkinson’s Telemonitoring	3178	20	5 [20=4]	2	Regression
Diabetes	98042	37	6 [37=7]	1	Classification

Open in a new tab

The experiments were run on a 24-core Intel(R) Xeon(R) server with the CPU E5-2670 v3 2.30GHz and 264 GB RAM. However, only one of the cores (i.e., no parallelization) was used and the memory heap size was restricted to only 48GB. We now discuss the results obtained. For each dataset, 10-fold cross-validation was carried out. Essentially, the data was split into 10 folds with 9 folds used for training and the remaining fold used for testing. When generating the synthetic data, we built the RDTs from the folds of data kept for training, and then generated synthetic data from the RDTs. This synthetic data was used for training the model, and the model built from the synthetic data was tested over the fold of real data held out earlier for testing. We then compare the accuracy of the model built from the real training data to the accuracy of the model built from the synthetic data. Since we are performing 10-fold cross validation, the accuracy compared is the average accuracy over all of the iterations. For classification, since the classes are not unbalanced, we directly compare the overall accuracy of the classifier using the AUC (area under the curve) metric. For regression, we compare the root mean squared error (RMSE) obtained. One point to note is that once the RDTs are generated, any amount of synthetic data can be regenerated. Therefore, we tested with generating the same number of instances as in the original training set, ten times more instances, and 100,000 instances (note that for the diabetes dataset, the training set size was itself close to 100,000, therefore we did not oversample this dataset). Table 3 gives the results obtained for classification with Naïve Bayes classifier while Table 4 gives the results obtained for linear regression. Note that we did try other classifiers such as decision trees as well and the results obtained were similar. The results show that for classification, the model generated from synthetic data achieves almost the same accuracy as the model generated from the original data in terms of the AUC. Indeed, the AUC of Breast Cancer with synthetic data is 0.9937 even with no oversampling while the AUC with the original data is only marginally higher at 0.9945. Similarly, the reduction in AUC for diabetes is also nominal. For Parkinsons, the RMSE with original data for motor_UPDRS is 6.51. Instead we get an RMSE of 7.05 with the synthetic data with no oversampling. Similarly, the RMSE for totaLUPDRS is 8.14 for the original data, whereas it is 8.93 for the synthetic data with no oversampling. Since, lower is better for RMSE, it can be seen that the performance of linear regression is slightly worse with the synthetic data but is still very comparable. Furthermore, in all cases, increasing the degree of oversampling tends to improve the results. Though the accuracy for Diabetes is low with synthetic data, it is also quite low for the original data, since this dataset is quite difficult in terms of classification. The critical point is that from the perspective of exploratory analysis, the models generated from synthetic data do give a similar view of the data as compared to the models generated from the original data. For example, there was significant overlap in the variables identified as significant in the regression model built from the training data and that built from the synthetic data, though the p-values varied.

Table 3:

Experimental Results for Classification

Data Set	# Classes	AUC with Original Data	AUC with Synthetic Data (no oversampling)	AUC with Synthetic Data (10-fold oversampling)	AUC with 100,000 synthetic instances
Breast Cancer	2	0.9945	0.9937	0.9933	0.9938
Diabetes	3	0.6534	0.6134	-	-

Open in a new tab

Table 4:

Experimental Results for Regression

Data Set	Response Variable	RMSE with Original Data	RMSE with Synthetic Data (no oversampling)	RMSE with Synthetic Data (10-fold oversampling)	RMSE with 100,000 synthetic instances
Parkinsons	motor UPDRS	6.51	7.05	7.04	7.03
Parkinsons	total UPDRS	8.14	8.93	8.86	8.87

Open in a new tab

One point worth noting is that the synthetic data generation process is extremely efficient. Once the RDTs are built, it takes very little time to generate new synthetic instances. For example, generating 100,000 instances only took a few minutes. The process for building the RDTs is memory intensive. However, if the memory required is capped by limiting the depth of the tree, then this process is also extremely efficient. i.e., in our case, it only took a few seconds to generate a single RDT. In general the time taken to build an RDT increases linearly with the size of the training data, which is extremely efficient.

Related Work

There is a lot of prior work on privacy-preserving distributed analytics¹⁶. For example, specific analytics tasks such as classification^17,18, clustering^19,20, association analysis^21,22, and outlier detection²³ can be carried out in a distributed manner. Privacy-preserving analytics has also been applied in the biomedical domain, for example for Grid LOgistic REgression (GLORE)^24,25, Distributed Cox proportional hazards model²⁶, and Distributed Privacy Preserving Support Vector Machine (DPP-SVM)²⁷ to allow accurate model construction while respecting individual institutions’ data privacy policy, and medical data integration²⁸. All of the above techniques are designed for specific data analytics tasks. The proposed synthetic data generation approach enables exploratory analysis of the data, and can be used for any of the above data analytics tasks. There is some existing work on anonymizing data or for generating synthetic data. For example, techniques for k-anonymity^29–31 will give a k-anonymized version of the dataset. However, this can lose significant data utility, and selecting an appropriate value for k is difficult. Aggarwal and Yu³² present an alternative condensation approach that clusters the data and regenerates synthetic data for each cluster - however, this does not keep track of combinations of attributes and may not fully specify all of the various combinations in the existing data.

Conclusion

Exploratory analysis is crucial to scientific innovation and often requires iterative exploration of the data. When the data are distributed across different organizational boundaries, this may lead to significant privacy concerns. Our aim in this paper is to enable exploratory analysis for the biomedical domain that is both privacy-preserving and scalable. Towards this, we develop a synthetic data generation technique using the framework of random decision trees to preserve the structure and semantics of the original data, while protecting privacy. We evaluate the proposed approach with several real biomedical datasets and typical tasks such as classification and linear regression. Our evaluation shows that the proposed approach is effective and can provide a sense of the utility of the data while protecting privacy.

Note that since our goal was simply to provide a first-order approximation of utility, we did not use standard HIPAA de-identification methods at this point. Since RDTs are amenable to meeting the privacy requirements of the differential privacy model, in the future we plan to compare and contrast the synthetic data generation approach with Safe Harbor/statistically de-identified data in more detail. We also plan to explore how other analytics tasks can be carried out and provide more accurate estimates of utility for specific tasks, and measure the uniqueness of a specific dataset for a particular task.

Acknowledgments

Research reported in this publication was supported by the National Institutes of Health under awards U54HL108460, R01GM118574, R01HG008802, R01GM114612, R21LM012060, R01GM118609, U01EB023685, by the National Science Foundation under awards CNS-1422501 and CNS-1564034 and by the Patient-Centered Outcomes Research Institute (PCORI) under Contract CDRN-1306-04819. The work of Shafiq is supported by Pakistan’s Higher Education Commission’s NRPU grant. The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research. We would also like to acknowledge the work done by Daniyal Jangda and Zain Sattar of LUMS in preparing and processing the datasets.

References

1.Ries LAG, Melbert D, Krapcho M, Stinchcomb DG, Howlader N, Homer MJ, et al. Bethesda, MD: National Cancer Institute; 2008. SEER cancer statistics review, 1975-2005; pp. 1975–2005. [Google Scholar]
2.NIS HNIS. Healthcare cost and utilization project (HCUP) 2011.
3.Ohno-Machado L, Bafna V, Boxwala AA, Chapman BE, Chapman WW, Chaudhuri K, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc. 2012 Mar;19(2):196–201. doi: 10.1136/amiajnl-2011-000538. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fan W, Wang H, Yu PS, Ma S. Is random model better? On its accuracy and efficiency. In: Proceedings of the Third IEEE International Conference on Data Mining. ICDM ‘03; Washington, DC, USA: IEEE Computer Society; 2003. p. 51. Available fromml: http://portal.acm.org/citation.cfm?id=951949.952144. [Google Scholar]
5.Fan W, McCloskey J, Yu PS. A general framework for accurate and fast regression by data summarization in random decision trees. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD ‘06; New York, NY, USA: ACM; 2006. pp. 136–146. Available fromml: http://doi.acm.org/10.1145/1150402.1150421. [Google Scholar]
6.Zhang X, Yuan Q, Zhao S, Fan W, Zheng W, Wang Z. Multi-label Classification without the Multi-label Cost. In: SDM. SIAM. 2010:778–789. [Google Scholar]
7.Dhurandhar A, Dobra A. Probabilistic Characterization of Random Decision Trees. Journal of Machine Learning Research. 2008;9:2321–2348. [Google Scholar]
8.Vaidya J, Shafiq B, Fan W, Mehmood D, Lorenzi D. A Random Decision Tree Framework for Privacy-preserving Data Mining. Dependable and Secure Computing, IEEE Transactions on. 2014;11(5):399–411. [Google Scholar]
9.Zhang K, Fan W. Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst. 2008;14(3):299–326. [Google Scholar]
10.Dwork C. Differential Privacy. In: 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006) 2006:1–12. Venice, Italy; [Google Scholar]
11.Jagannathan G, Pillaipakkamnatt K, Wright RN. A Practical Differentially Private Random Decision Tree Classifier. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops. ICDMW ‘09; Washington, DC, USA: IEEE Computer Society; 2009. pp. 114–121. Available fromml: http://dx.doi.org/10.1109/ICDMW.2009.93. [Google Scholar]
12.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of biomedical informatics. 2009;42(2):377–381. doi: 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the national academy of sciences. 1990;87(23):9193–9196. doi: 10.1073/pnas.87.23.9193. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Tsanas A, Little MA, McSharry PE, Ramig LO. Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering. 2010;57(4):884–893. doi: 10.1109/TBME.2009.2036000. [DOI] [PubMed] [Google Scholar]
15.Strack B, DeShazo JP, Gennings C, Olmo JL, Ventura S, Cios KJ, et al. Impact of HbAlc measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international. 2014;2014 doi: 10.1155/2014/781670. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Vaidya J, Clifton C, Zhu M. vol. 19 of Advances in Information Security. 1st. Springer-Verlag; 2005. Privacy-Preserving Data Mining. Available fromml: http://www.springeronline.com/sgw/cda/frontpage/0,11855,4-40356-72-52496494-0,00.html. [Google Scholar]
17.Vaidya J, Clifton C, Kantarcioglu M, Patterson AS. Privacy-preserving decision trees over vertically partitioned data. ACM Trans Knowl Discov Data. 2008;2(3):1–27. [Google Scholar]
18.Vaidya J, Kantarcioglu M, Clifton C. Privacy Preserving Naive Bayes Classification. International Journal on Very Large Data Bases. 2008 Jul;17(4):879–898. [Google Scholar]
19.Vaidya J, Clifton C. In: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC: ACM; 2003. Privacy-Preserving K-Means Clustering over Vertically Partitioned Data; pp. 206–215. Available fromml: http://doi.acm.org/10.1145/956750.956776. [Google Scholar]
20.Lin X, Clifton C, Zhu M. Privacy Preserving Clustering with Distributed EM Mixture Modeling. Knowledge and Information Systems. 2005 Jul;8(1):68–81. [Google Scholar]
21.Vaidya J, Clifton C. In: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada: ACM; 2002. Privacy Preserving Association Rule Mining in Vertically Partitioned Data; pp. 639–644. Available fromml: http://doi.acm.org/10.1145/775047.775142. [Google Scholar]
22.Vaidya J, Clifton C. Secure Set Intersection Cardinality with Application to Association Rule Mining. Journal of Computer Security. 2005 Nov;13(4):593–622. [Google Scholar]
23.Vaidya J, Clifton C. In: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04) Los Alamitos, CA: IEEE Computer Society Press; 2004. Privacy-Preserving Outlier Detection; pp. 233–240. [Google Scholar]
24.Jiang W, Li P, Wang S, Wu Y, Xue M, Ohno-Machado L, et al. WebGLORE: a web service for Grid LOgistic REgression. Bioinformatics (Oxford, England) 2013 Dec;29(24):3238–40. doi: 10.1093/bioinformatics/btt559. Available fromml: http://bioinformatics.oxfordjournals.org/content/early/2013/10/17/bioinformatics.btt559.shorthttp://www.ncbi.nlm.nih.gov/pubmed/24072732. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Jiang X, Wu Y, Marsolo K, Ohno-Machado L. Development of a Web Service for Analysis in a Distributed Network. eGEMs (Generating Evidence & Methods to improve patient outcomes) 2014 Dec;2(1) doi: 10.13063/2327-9214.1053. Available fromml: http://repository.academyhealth.org/egems/vol2/iss1/22. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, et al. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. In: Translational Bioinformatics Conference (TBC) 2014 doi: 10.1093/jamia/ocv083. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Que J, Jiang X, Ohno-Machado L. In: AMIA Annual Symposium Proceedings. Vol. 2012. American Medical Informatics Association; 2012. A Collaborative Framework for Distributed Privacy-Preserving Support Vector Machine Learning; p. 1350. [PMC free article] [PubMed] [Google Scholar]
28.He X, Vaidya J, Shafiq B, Adam N, White T. Privacy Preserving Integration of Health Care Data. International Journal of Computational Models and Algorithms in Medicine. 2010;1(2):22–36. [Google Scholar]
29.Samarati P. Protecting Respondent’s Privacy in Microdata Release. 2001 Nov-Dec;13(6):1010–1027. [Google Scholar]
30.Sweeney L. k-Anonymity: a Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002;(5):557–570. [Google Scholar]
31.LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: Efficient Full-Domain K-Anonymity. In: SIGMOD Conference. 2005:49–60. [Google Scholar]
32.Aggarwal CC, Yu PS. A Condensation Approach to Privacy Preserving Data Mining. In: Lecture Notes in Computer Science. 2004;2992:183–199. [Google Scholar]

[r1-2731669] 1.Ries LAG, Melbert D, Krapcho M, Stinchcomb DG, Howlader N, Homer MJ, et al. Bethesda, MD: National Cancer Institute; 2008. SEER cancer statistics review, 1975-2005; pp. 1975–2005. [Google Scholar]

[r2-2731669] 2.NIS HNIS. Healthcare cost and utilization project (HCUP) 2011.

[r3-2731669] 3.Ohno-Machado L, Bafna V, Boxwala AA, Chapman BE, Chapman WW, Chaudhuri K, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc. 2012 Mar;19(2):196–201. doi: 10.1136/amiajnl-2011-000538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-2731669] 4.Fan W, Wang H, Yu PS, Ma S. Is random model better? On its accuracy and efficiency. In: Proceedings of the Third IEEE International Conference on Data Mining. ICDM ‘03; Washington, DC, USA: IEEE Computer Society; 2003. p. 51. Available fromml: http://portal.acm.org/citation.cfm?id=951949.952144. [Google Scholar]

[r5-2731669] 5.Fan W, McCloskey J, Yu PS. A general framework for accurate and fast regression by data summarization in random decision trees. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD ‘06; New York, NY, USA: ACM; 2006. pp. 136–146. Available fromml: http://doi.acm.org/10.1145/1150402.1150421. [Google Scholar]

[r6-2731669] 6.Zhang X, Yuan Q, Zhao S, Fan W, Zheng W, Wang Z. Multi-label Classification without the Multi-label Cost. In: SDM. SIAM. 2010:778–789. [Google Scholar]

[r7-2731669] 7.Dhurandhar A, Dobra A. Probabilistic Characterization of Random Decision Trees. Journal of Machine Learning Research. 2008;9:2321–2348. [Google Scholar]

[r8-2731669] 8.Vaidya J, Shafiq B, Fan W, Mehmood D, Lorenzi D. A Random Decision Tree Framework for Privacy-preserving Data Mining. Dependable and Secure Computing, IEEE Transactions on. 2014;11(5):399–411. [Google Scholar]

[r9-2731669] 9.Zhang K, Fan W. Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst. 2008;14(3):299–326. [Google Scholar]

[r10-2731669] 10.Dwork C. Differential Privacy. In: 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006) 2006:1–12. Venice, Italy; [Google Scholar]

[r11-2731669] 11.Jagannathan G, Pillaipakkamnatt K, Wright RN. A Practical Differentially Private Random Decision Tree Classifier. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops. ICDMW ‘09; Washington, DC, USA: IEEE Computer Society; 2009. pp. 114–121. Available fromml: http://dx.doi.org/10.1109/ICDMW.2009.93. [Google Scholar]

[r12-2731669] 12.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of biomedical informatics. 2009;42(2):377–381. doi: 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-2731669] 13.Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the national academy of sciences. 1990;87(23):9193–9196. doi: 10.1073/pnas.87.23.9193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-2731669] 14.Tsanas A, Little MA, McSharry PE, Ramig LO. Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering. 2010;57(4):884–893. doi: 10.1109/TBME.2009.2036000. [DOI] [PubMed] [Google Scholar]

[r15-2731669] 15.Strack B, DeShazo JP, Gennings C, Olmo JL, Ventura S, Cios KJ, et al. Impact of HbAlc measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international. 2014;2014 doi: 10.1155/2014/781670. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16-2731669] 16.Vaidya J, Clifton C, Zhu M. vol. 19 of Advances in Information Security. 1st. Springer-Verlag; 2005. Privacy-Preserving Data Mining. Available fromml: http://www.springeronline.com/sgw/cda/frontpage/0,11855,4-40356-72-52496494-0,00.html. [Google Scholar]

[r17-2731669] 17.Vaidya J, Clifton C, Kantarcioglu M, Patterson AS. Privacy-preserving decision trees over vertically partitioned data. ACM Trans Knowl Discov Data. 2008;2(3):1–27. [Google Scholar]

[r18-2731669] 18.Vaidya J, Kantarcioglu M, Clifton C. Privacy Preserving Naive Bayes Classification. International Journal on Very Large Data Bases. 2008 Jul;17(4):879–898. [Google Scholar]

[r19-2731669] 19.Vaidya J, Clifton C. In: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC: ACM; 2003. Privacy-Preserving K-Means Clustering over Vertically Partitioned Data; pp. 206–215. Available fromml: http://doi.acm.org/10.1145/956750.956776. [Google Scholar]

[r20-2731669] 20.Lin X, Clifton C, Zhu M. Privacy Preserving Clustering with Distributed EM Mixture Modeling. Knowledge and Information Systems. 2005 Jul;8(1):68–81. [Google Scholar]

[r21-2731669] 21.Vaidya J, Clifton C. In: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada: ACM; 2002. Privacy Preserving Association Rule Mining in Vertically Partitioned Data; pp. 639–644. Available fromml: http://doi.acm.org/10.1145/775047.775142. [Google Scholar]

[r22-2731669] 22.Vaidya J, Clifton C. Secure Set Intersection Cardinality with Application to Association Rule Mining. Journal of Computer Security. 2005 Nov;13(4):593–622. [Google Scholar]

[r23-2731669] 23.Vaidya J, Clifton C. In: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04) Los Alamitos, CA: IEEE Computer Society Press; 2004. Privacy-Preserving Outlier Detection; pp. 233–240. [Google Scholar]

[r24-2731669] 24.Jiang W, Li P, Wang S, Wu Y, Xue M, Ohno-Machado L, et al. WebGLORE: a web service for Grid LOgistic REgression. Bioinformatics (Oxford, England) 2013 Dec;29(24):3238–40. doi: 10.1093/bioinformatics/btt559. Available fromml: http://bioinformatics.oxfordjournals.org/content/early/2013/10/17/bioinformatics.btt559.shorthttp://www.ncbi.nlm.nih.gov/pubmed/24072732. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25-2731669] 25.Jiang X, Wu Y, Marsolo K, Ohno-Machado L. Development of a Web Service for Analysis in a Distributed Network. eGEMs (Generating Evidence & Methods to improve patient outcomes) 2014 Dec;2(1) doi: 10.13063/2327-9214.1053. Available fromml: http://repository.academyhealth.org/egems/vol2/iss1/22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26-2731669] 26.Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, et al. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. In: Translational Bioinformatics Conference (TBC) 2014 doi: 10.1093/jamia/ocv083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27-2731669] 27.Que J, Jiang X, Ohno-Machado L. In: AMIA Annual Symposium Proceedings. Vol. 2012. American Medical Informatics Association; 2012. A Collaborative Framework for Distributed Privacy-Preserving Support Vector Machine Learning; p. 1350. [PMC free article] [PubMed] [Google Scholar]

[r28-2731669] 28.He X, Vaidya J, Shafiq B, Adam N, White T. Privacy Preserving Integration of Health Care Data. International Journal of Computational Models and Algorithms in Medicine. 2010;1(2):22–36. [Google Scholar]

[r29-2731669] 29.Samarati P. Protecting Respondent’s Privacy in Microdata Release. 2001 Nov-Dec;13(6):1010–1027. [Google Scholar]

[r30-2731669] 30.Sweeney L. k-Anonymity: a Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002;(5):557–570. [Google Scholar]

[r31-2731669] 31.LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: Efficient Full-Domain K-Anonymity. In: SIGMOD Conference. 2005:49–60. [Google Scholar]

[r32-2731669] 32.Aggarwal CC, Yu PS. A Condensation Approach to Privacy Preserving Data Mining. In: Lecture Notes in Computer Science. 2004;2992:183–199. [Google Scholar]

PERMALINK

A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis

Jaideep Vaidya, Ph.D.

Basit Shafiq, Ph.D

Muazzam Asani

Nabil Adam, Ph.D.

Xiaoqian Jiang, Ph.D.

Lucila Ohno-Machado, M.D., Ph.D.

Abstract

Introduction

An overview of random decision trees

Proposed Approach

Figure 1:

Table 1:

Figure 2:

Integration with REDCap

Figure 3:

Experimental Evaluation

Table 2:

Table 3:

Table 4:

Related Work

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Scalable Privacy-preserving Data Generation Methodology for Exploratory Analysis

Jaideep Vaidya, Ph.D.

Basit Shafiq, Ph.D

Muazzam Asani

Nabil Adam, Ph.D.

Xiaoqian Jiang, Ph.D.

Lucila Ohno-Machado, M.D., Ph.D.

Abstract

Introduction

An overview of random decision trees

Proposed Approach

Figure 1:

Table 1:

Figure 2:

Integration with REDCap

Figure 3:

Experimental Evaluation

Table 2:

Table 3:

Table 4:

Related Work

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases