Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2019 Nov 21;27(3):366–375. doi: 10.1093/jamia/ocz195

Protecting patient privacy in survival analyses

Luca Bonomi 1,, Xiaoqian Jiang 2, Lucila Ohno-Machado 1,3
PMCID: PMC7025359  PMID: 31750926

Abstract

Objective

Survival analysis is the cornerstone of many healthcare applications in which the “survival” probability (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of exact survival curves may reveal information about the individual patients, as an adversary may infer the presence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to develop methods to protect patient privacy in survival analysis.

Materials and Methods

We develop a framework based on the formal model of differential privacy, which provides provable privacy protection against a knowledgeable adversary. We show the performance of privacy-protecting solutions for the widely used Kaplan-Meier nonparametric survival model.

Results

We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves.

Discussion

The proposed framework demonstrates the feasibility of conducting privacy-protecting survival analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in biomedical research applications.

Conclusion

The results suggest that our proposed privacy-protection methods provide strong privacy protections while preserving the usefulness of survival analyses.

Keywords: data privacy, survival analysis, data sharing, Kaplan-Meier, actuarial

INTRODUCTION

Survival analysis aims at computing the “survival” probability (ie, how long it takes for an event to happen) for a group of observations that contain information about individuals, including time to event. In medical research, the primary interest of survival analysis is in the computation and comparison of survival probabilities across patient groups (eg, standard of care vs. intervention), in which survival may refer, for example, to the time free from the onset of a certain disease, time free from recurrence, and time to death. Survival analysis provides important insights, among other things, on the effectiveness of treatments, identification of risk, biomarker utility, and hypotheses testing.1–10 Survival curves aggregate information from groups of interest and are easy to generate, interpret, compare, and publish online. Although aggregate data can be protected by different approaches, such as, rounding,11,12 binning,13 and perturbation,14 survival analysis models have special characteristics that warrant the development of customized methods. Before describing our proposed solutions, we briefly review how survival curves are derived and what their vulnerabilities are from a privacy perspective.

Survival analysis methods and privacy

Methods for survival analysis can be divided into 3 main categories: parametric, semiparametric, and nonparametric models. Parametric models rely on known probability distributions (eg, the Weibull distribution) to learn a statistical model. These models are less frequently used than semi- or nonparametric methods, as their parametric assumptions hardly apply in practice. Even though the released curves exhibit a natural “smoothing,” studies have shown that the parameters of the model may reveal sensitive information.15 Semiparametric methods are extremely popular for multivariate analyses and can be used to identify important risk factors for the event of interest. As an example, the Cox proportional hazards model16 only assumes a proportional relationship between the baseline hazard and the hazard attributed to a specific group (ie, it does not assume that survival follows a known distribution, as is the case with parametric models). Nonparametric models are frequently used to describe the survival probability over time, without requiring assumptions on the underlying data distribution. Among those models, the Kaplan-Meier (KM) product-limit estimators are frequent in the biomedical literature. As an example, a search for PubMed articles using the term Kaplan-Meier retrieves more than 8000 articles each year, from 2013 to 2018. A search for actuarial returns about 500 articles per year. In this article, we focus on the KM estimator and present results for the actuarial model in the Supplementary Appendix. The KM method generates a survival curve in which each event can be seen by a corresponding drop in the probability of survival. For example, Foldvary et al4 used the KM method to analyze seizure outcomes for patients who underwent temporal lobectomy for epilepsy. In contrast, in the actuarial method,17,18 the survival probability is computed over prespecified periods of time (eg, 1 week, 1 month). For example, Balsam et al19 used actuarial curves to describe the long-term survival for valve surgery in an elderly population.

It is surprising that relatively little attention has been given so far to the protection of individual privacy in survival analysis. Survival analyses generate aggregated results that are unlikely to directly reveal identifying information (eg, name, SSN).20 However, a knowledgeable adversary, who observes survival analysis results over time, may be able to determine whether a targeted individual participated in the study and even if the individual belongs to a particular subgroup in the study, thus learning sensitive phenotypes. Several previous privacy studies have shown that sharing aggregated results may lead to this privacy risk.15,21,22 For example, small values of counts (eg, <11) may reveal identifiable information about patients and their demographics.11,23 As survival analyses rely on statistical primitives (eg, counts of time to events), they share similar privacy risks. In fact, each patient is responsible for a drop or step in the survival curve. Therefore, the released curves may reveal, in combination with personal or public knowledge, sensitive information about a single patient. For example, an adversary who (1) has knowledge of the time to events of individuals in various groups at a certain time (eg, previously released survival curves for different groups) and (2) knows that a person of interest joined the study may infer the presence of such an individual in a specific group (eg, patients in the hepatitis B subgroup) as the released curves are updated. Specifically, an adversary can construct a survival curve based on their auxiliary knowledge and can infer whether the person of interest is in the group by comparing such a curve with the one from a group, as illustrated with the curves s1’ and s2’ in Figure 1 (left panel). The differences between the exact curves and those obtained by the adversary disclose the participation of the person of interest in a group (ie, the patient with time to event at time unit 61 contributed to the curve s2’, thus the individual of interest was in group 2). This scenario is realistic for dashboard of “aggregate” results, where tools for data exploration (eg, web interfaces and application programming interfaces) may enable users to obtain frequent fine-grained releases, and certainly is not limited to survival analysis, applying also to counts, histograms, proportions (when accompanied by information on the total number of participants), and other seemingly harmless “aggregate” data.

Figure 1.

Figure 1.

Survival curves obtained using the Kaplan-Meier method. (Left panel) An adversary observes 2 exact curves s1’ (group 1) (eg, consisting of patients without hepatitis B) and s2’ (group 2) (patients with hepatitis B) and compares them with the curves constructed with knowledge of s1 and s2 (eg, previously released curves). The adversary knows that the person of interest had an event at time 61 and thus can learn from the change in s2’ that this individual contributed to group 2. This is an example of a difference attack. (Right panel) When the curves are generated using differential privacy (s1’-dp and s2’-dp), their difference does not reveal individual time-to-event information. The data on this plot were obtained from a publicly available repository (http://lib.stat.cmu.edu/datasets/veteran). Here, we only report on the first 80 time units (days) to highlight the difference between the survival curves.

It is imperative to develop privacy solutions to protect the individual presence in the released survival curves. In this work, we consider the formal and provable notion of differential privacy,24 in which the released statistics are perturbed with carefully calibrated random noise. Specifically, differential privacy ensures that the output statistics are “roughly the same” regardless of the presence or absence of any individual, thus providing plausible deniability. In fact, the differences between differentially private survival curves s1’-dp and s2’-dp and those obtained with the adversarial knowledge in Figure 1 (right panel) do not reveal information about the presence of any individual in either group, as opposed to the original curves (left panel).

Objective

Current research in survival analysis includes the development of accurate prediction models, under the assumption that sharing aggregate survival data does not compromise privacy. For example, deep neural networks have been recently used to learn the relationship between a patient’s covariates and the survival distribution predictions.25–28 Another example by Lu et al29 describes a decentralized method for learning a distributed Cox proportional hazards model without sharing individual patient-level data. Those solutions disclose exact results that may enable privacy attacks by untrusted users.15,22,30

Several approaches have been proposed for privacy-protecting survival analyses.20,31–33 However, they do not provide provable privacy guarantees. O’Keefe et al20 discussed privacy techniques based on data suppression (eg, removal of censored events), smoothing, and data perturbation. Yu et al32 proposed a method based on affine projections for the Cox model. Similarly, Fung et al33 developed a privacy solution using random linear kernel approaches. Despite promising results, these solutions do not provide provable privacy protection and may be vulnerable in the presence of an adversary who has auxiliary information (eg, knowledge of the time-to-event data [hospitalization, death, etc.] and from previous publication of survival curves).

We developed a privacy framework, based on the notion of differential privacy, that provides formal and provable privacy protection against a knowledgeable adversary who aims at determining the presence of an individual of interest in a particular group. Intuitively, our framework transforms the data before the release, similarly to previous methods based on generalization (eg, smoothing) and truncation (eg, censoring aggregate counts below a threshold).20,23 In our case, privacy is protected with the injection of calibrated noise. We show how this framework can be used to release differentially private survival analyses for the KM estimator (see the Supplementary Appendix for the actuarial method). Furthermore, we define an empirical privacy risk that measures how well an informed adversary may reconstruct the temporal information of time to event of an individual who participated in the study. Our evaluations show that an adversary can reconstruct the time to event with a small error from the observed nonprivate survival curves, thus indicating high privacy risk (eg, potential reidentification by linking the exact time intervals with external data). Our proposed methods significantly reduce privacy risks while retaining the usefulness of the survival curves. We must emphasize that an ideal privacy protection mechanism should not rely on specific assumptions about what background knowledge the adversary has, as violations in the adversary’s knowledge may make privacy protection invalid. Thanks to differential privacy, our methods do not require such assumptions and thus provide protection regardless of how much information the adversary has.

MATERIALS AND METHODS

Nonparametric survival models

Nonparametric survival models estimate the survival probability of a group of individuals by analyzing the temporal distribution of the recorded events during the study. Typically, each individual has a single temporal event, which may represent the development of a symptom, disease, or death. Some of these events may be only partially known (eg, subject drops out of the study, no follow-up)17,34 and therefore are denoted as censored events. We assume a study of N individuals over a period of T time units (eg, days, months). Furthermore, ui denotes the number of uncensored patients (known recorded event [eg, death]), ci denotes the censored patient at time ti, and ri represents those remaining before ti (excluding any individual censored previously). Table 1 summarizes the nonparametric models considered in this article. Additional details are reported in the Supplementary Appendix.

Table 1.

Nonparametric models for survival analysis considered

Actuarial model (see Supplementary Appendix) Kaplan-Meier model
  • time to events are grouped into intervals of fixed length (l)

  • survival computed on the set interval {I1,I2,,IT} of length l

  • the censored patients are assumed to withdraw from the study at random during the interval

  • survival function computed on each time unit

Survival function at each interval Ii:

si=j=1i1-ujrj-cj2

Survival function at time t:

st=tit1-uiri

Differential privacy

Differential privacy24 enables the release of statistical information about a group of participants while providing strong and provable privacy protection. Specifically, differential privacy ensures that the probability distribution on the released statistics should be “roughly the same” regardless the presence/absence of any individual, thus providing plausible deniability. Differential privacy has been successfully applied in a variety of settings,14,35 such as data publication (eg, 1-time data release),36–40 iterative query answering,41–43 continual data release (eg, results are published over time),44–50 and in combination with various machine learning models.30,51–53 Among those works, we are inspired by the differentially private model proposed for continual data release,46–49 as survival analyses estimate the survival function at time t using the time to events up to t. In our setting, we consider an event stream S=(e1,e2,,eT), where each event ei=(ci,ui,ti) refers to the number of events and whether censoring happened at time ti, and the events are in chronological order (ie, ti< ti+1). For example, consider a study over a period T=10 units of time (eg, months) comprising a total of N=6 individuals with time to events of 2, 4, 4, 5*, 6, 8*, where time marked with * corresponds to when censoring happened (ie, a participant was lost to follow up). Under our notation, we have an event stream S=(0, 0, 1),(0, 1, 2),(0, 0, 3),(0, 2, 4),(1, 0, 5),(0, 1, 6),(0, 0, 7),(1, 0, 8),(0, 0, 9),(0, 0, 10), where (0, 0, 3) indicates that no events were observed at time 3.

We assume a trusted data curator who wishes to release an estimate of the survival probability s(t) at each time stamp 1 tT using the information in the poststream of events up to time t, namely the prefix stream St=(e1,e2,,et).

Neighboring streams of time to events

Two streams of time to events St and S't are neighboring streams if there exists at most 1 ti[1,,t], such that: ci  ci+|ui  ui|1 (ie, they differ at most by 1 event).

Using this notion, we present the definition of differential privacy considered in our work as follows.

Differential privacy

Let M· be a randomized algorithm that takes in input a stream S, and let O be the set of all possible outputs of M·. Then, we say that M· satisfies ɛ-differential privacy if, for all sets OO, all neighboring streams St and S't, and all t, it holds that:

PrMSt=Oeɛ×Pr[MS't=O]

Intuitively, the notion of differential privacy ensures that neighboring streams should be indistinguishable by an adversary who observes the output of the mechanism M· at any time. In differentially private models, ɛ denotes the privacy parameter (also known as privacy budget). Lower values indicate higher indistinguishability, thus providing stronger privacy. Determining the right value for ɛ is a challenging problem, as specific values depend on the application (ie, risk tolerance).54 Typically, ɛ assumes values in the range [1/1000, 10].

As an example consider ɛ=1, then the probability of a stream St being mapped to a particular output is no greater than 2 times the probability that any of its neighboring streams getting mapped to the same output. Perfect privacy can be achieved with ɛ=0 (ie, neighboring streams are equally likely to produce the same output); however, it obviously leads to no utility in the released curve, as the mechanism has to completely ignore each individual record in input. The guarantee of indistinguishability between neighboring streams protects the presence of the individual in the released statistics because, in survival analysis, an individual can contribute at most once to the stream.

Typically, differential privacy is achieved via output perturbation, in which the released statistics are perturbed with calibrated random noise to hide the presence of individuals (details are reported in the Supplementary Appendix). Intuitively, the noise perturbation “generalizes” the aggregated time to events, similarly to traditional ad hoc techniques in which the released aggregated counts are obtained by binning and thresholding (eg, reporting counts as “less than 10”).

Our framework for privacy-protecting survival analyses

Publishing survival values s(t) may pose significant privacy challenges, as the event for an individual at time t' will affect the survival curve at time t' (ie, step) as well as subsequent values. Therefore, an adversary who observes these changes may gain knowledge about the individual associated with such an event. To mitigate these risks, traditional differential privacy methods perturb each released survival value s(t). However, these methods may lead to overly perturbed results when the study spans over a long period of time. To this end, we propose a framework that compresses the stream of events into partitions in which the survival probabilities can be accurately computed over time using an input perturbation strategy. Overall, our framework (Figure 2) comprises 3 main steps: (1) data partitioning, (2) survival curve computation, and (3) postprocessing.

Figure 2.

Figure 2.

Overview of our proposed framework to release differentially private survival curves in 3 main steps. First, in data partitioning, the stream of time to events in input is partitioned into segments while satisfying ɛ1-differential privacy. Second, in survival curve computation, the aggregated time to events over the stream of partitions are computed to satisfy ɛ2-differential privacy using the binary tree mechanism. These noisy counts are used in the estimate of the survival probability over time using an input perturbation mechanism. Third, in postprocessing, the values of the estimated curve are bounded in the interval [0, 1] and monotonicity is enforced.

In the data partitioning step, the time to events are grouped into partitions generated in a differentially private manner. The idea is to compress the stream, so that the privacy cost for computing the survival curve can be reduced while retaining the distribution of the events. In the survival computation step, we estimate the number of censored and uncensored events over time using a binary tree decomposition. This step reduces the perturbation noise in the estimation of the events, which are then used to compute the survival probability. Specifically, we use an input perturbation approach in which privacy is achieved by perturbing the counts of the events rather than the output of the survival function, thus improving the utility compared with standard output perturbation techniques. Because the noise perturbation may disrupt the shape of the survival curve, we perform a postprocessing step, in which we enforce consistency in the released curve (ie, monotonically decreasing survival probabilities). For brevity, in the following we describe the instantiation of our framework for the KM method. The private solution for the actuarial method follows the same steps, except for the fact that partitioning is performed over fixed intervals (see Supplementary Appendix).

Data partitioning

Our partitioning strategy takes in input the stream of events St and produces a stream of partitions as output, where multiple events are grouped. We compress the stream into partitions of variable length with the goal of retaining the distribution of the events. Our method processes 1 event at the time and keeps an active partition, which is sealed when more than Θ time to events are observed. Intuitively, this approach produces a coarser representation of the stream, where each event is grouped with at least other Θ-1, by varying the interval of time to publish survival for a group of events. In this process, we perturb the count of the events in the stream and the threshold Θ with calibrated noise. As a result, the events and the size of partitions are protected, thus providing an additional level of protection compared with other privacy methods that rely on binning (ie, rounds to the nearest 10). The privacy budget ɛ1 dedicated to this step is equally divided among the threshold and event count perturbation. As any neighboring streams may differ at most by 1 segment, these perturbations ensure that the partitions returned by the algorithm satisfy ɛ1-differential privacy.55

Survival curve computation

In this step, we determine the survival probability at time t using an input perturbation strategy. The idea is to estimate the number of uncensored and censored events in the partitions in a differentially private manner and then use those values to compute the survival curve, up to t. One could estimate these events by perturbing the counts over the partitions processed so far. However, this simple process leads to high perturbation noise, as the magnitude of the noise grows linearly with the number of partitions. To this end, we use a binary tree counting approach with privacy parameter ɛ2, where leaves represent the original partitions and internal nodes denote partitions obtained by merging the partitions of their children. Consider Figure 2, the internal node associated with the count C14 comprises the events over the partitions P1, P2, P3, and P4. This binary mechanism is very effective in reducing the overall impact of perturbation noise.46,47 With this mechanism, the differentially private number of uncensored u^i and censored c^i events in the stream can be estimated with a perturbation noise that grows only logarithmically with the number of partitions in the stream. To compute the privacy-protecting survival curve for the KM method, denoted s^KMi, we rewrite the KM survival curve formulation as follows:

s^KMi=s^KMi-1×N-u^i-c^i-1N-u^i-1-c^i-1

where, u^i and c^i represent the total number of uncensored and censored events up to the time of partition i, respectively. At the end of this step, we obtain a step function representing the survival probability of the patients over time that remains constant within each partition.

Data postprocessing

A survival curve satisfies the following properties: (1) it assumes values in the range [0, 1] and (2) it monotonically decreases with time (ie, s^(t) s^(t+1) for 1 t<T). While our solution ensures that the released curve satisfies differential privacy, the noise perturbation may violate properties 1 and 2. To this end, we propose a postprocessing step, in which we compute the survival curve s^*(t) satisfying these properties and that best resembles s^t. Similarly to previous work,56,57 we solve this optimization problem with isotonic regression methods (details in the Supplementary Appendix). An illustrative example of our postprocessing step is reported in Figure 3.

Figure 3.

Figure 3.

Illustrative example of the postprocessing step in our proposed framework. The differentially private noisy curve (s) may not be monotonic due to the injection of random noise. Such a curve is postprocessed to generate a monotonically decreasing curve (s*) that approximates s.

Overall, our approach achieves ɛ-differential privacy (with ɛ1=ɛ2= ɛ/2), as differential privacy guarantees in phase 1 and phase 2 compose sequentially.35 Furthermore, our framework is highly scalable, as all the steps can be performed efficiently.

Evaluation metrics

We conducted empirical evaluations of our proposed privacy-protecting framework to assess the usefulness of the released survival curves and the reduction in privacy risk when the privacy-protecting model is compared with a nonprivate counterpart.

Utility metric

To assess the usefulness of the released survival curves, we compared the differentially private (here named “private” for brevity) curve with the exact survival curve in terms of mean absolute error (MAE). The MAE measures the similarity between 2 curves by averaging the sum of the absolute differences, formally: MAE=1Tt=1T|st-s^(t)|, where st and s^(t) denote the nonprivate and the private survival curves, respectively. As the survival curves are based on probabilities, the MAE assumes values in the range [0, 1]. Smaller values of MAE indicate stronger similarity between the 2 curves, hence higher utility. In addition, we compare the curves using the Kolmogorov-Smirnov (KS) test to estimate the statistical difference between the differentially private and nonprivate curves.58

Privacy metrics

Differential privacy ensures that the adversary’s probability of inferring information about an individual from the survival curve is only a factor (1+ɛ) larger than the probability of inferring that information if such an individual were not included in the study. Thus, the privacy parameter ɛ provides us with a theoretical bound on the privacy risk of disclosing the participation of each individual. In addition, we consider an empirical privacy risk (defined in the Supplementary Appendix), named inference error, which measures the error in reconstructing the time to event (in time units), by a knowledgeable adversary who observes the released curve. Higher values of inference error indicate lower privacy risk. For our differentially private solutions, we conduct our evaluations of the inference error by varying the privacy parameter ɛ.

RESULTS

We used the real-world Surveillance Epidemiology and End Results (SEER)59 dataset to evaluate the effectiveness of our proposed approaches. Specifically, we generated a stream of time to events by sampling N{1000, 10000, 100000} patients from 707157 breast cancer patients with first diagnosis from 1973 to 2015, using the time unit of a month. Our results are reported with a confidence level of 95% over 100 runs of our algorithms for the KM survival analyses. Evaluations for the actuarial method, additional experiments on the SEER dataset, and on a synthetically generated dataset are reported in the Supplementary Appendix. In the figures, we denote the standard KM method and our proposed differentially private version by KM and DP-KM, respectively.

KM survival curve

Figure 4 reports the inference error for the nonprivate and private KM approaches with different data sizes, which quantifies the ability of an adversary in inferring the exact time to event of an individual of interest from the released curves. The adversary’s inference error for the standard KM approach grows with the size of the dataset, ie, from ±2 time units (N=1000) to ±80 time units (N=100000), but is still very low. Intuitively, the contribution of an individual in the survival probability decreases as N increases (as an individual is hidden in a larger crowd).

Figure 4.

Figure 4.

Inference error for the Kaplan-Meier (KM) survival curves for N = 1000, 10000, and 100000 sampled patients obtained with the nonprivate (KM) and private (differentially private KM [DP-KM]) methods. Inference error for KM method and differently private solution (DP-KM) vs the privacy parameter (ε), with (A) N = 1000, (B) N = 10000, and (C) N = 100000.

With our DP-KM method, the inference error for the adversary is significantly higher across all the sizes of the data (ie, “better” privacy). Consider N=1000, for example: an adversary can reconstruct the time to event of a targeted individual up to ±2 time units from the KM survival curve. In contrast, with our DP-KM solution, the error is at least ±250 time units (Figure 4A). In other words, with our DP-KM solution, an adversary cannot infer the time to event for an individual of interest with precision, as the confidence interval for the inferred time to event spans over 250 time units (as opposed to 2 time units with the nonprivate method). Overall, our solution provides consistently stronger privacy protection across all the dataset sizes when compared with the standard KM method. Furthermore, the inference error in our differentially private method is robust against variations of the privacy parameter (ɛ).

Figure 5 reports the MAE for the differentially private curves. Both the privacy parameter and size of the dataset impact the utility. For larger values of the privacy parameter ɛ (weaker privacy), the magnitude of the perturbation noise decreases, thus leading to more accurate results. Similarly, for larger datasets, the impact of the perturbation noise is smaller, thus leading to higher utility (ie, lower MAE). For example, with ɛ1, our method achieves MAE 0.1 for N=10000, and MAE 0.03 for N=100000. In conclusion, our DP-KM solution produces survival curves that retain the usefulness of the nonprivate curves while providing strong privacy protection.

Figure 5.

Figure 5.

Mean absolute error (MAE) of the differentially private Kaplan-Meier (DP-KM) curve vs the privacy parameter (ε), for N=1000, 10000, and 100000.

Survival analysis

We performed nonparametric survival analysis using data from the SEER database. We considered patients diagnosed with breast cancer after 2005, from which we randomly selected groups of 2500 patients representing different races: white, black, and other. We obtained the survival curves using the nonprivate KM approach (ie, KM) and its privacy-protecting counterpart (ie, DP-KM). In Figure 6, we observe that our DP-KM solution generates survival curves that closely resemble those obtained with the nonprivate method.

Figure 6.

Figure 6.

Survival curves for breast cancer patients in the Surveillance Epidemiology and End Results dataset for different groups. We sampled 2500 patients for each group (ie, black, white, and others) who have been diagnosed since 2005. The curves obtained with the (A) nonprivate KM method and (B) differentially private curve (DP-KM).

We compared the private curves with their nonprivate counterparts using the KS test, and the results are reported in Table 2 for the KM method and in Table 3 for the DP-KM method. We adopted the KS test rather than the log-rank test, as the former can be performed on the survival curves that are outputs of our differentially private methods. Overall, the differentially private curves obtained with our methods are not statistically different from the exact curves (P  > .05) and the differences between groups continue to be statistically significant.

Table 2.

Kolmogorov-Smirnov test results for the Kaplan-Meier method

White Black Other
White 0.0 (1.0) 0.37 (1.38 × 10−8) 0.21 (4.32 × 10−6)
Black 0.0 (1.0) 0.48 (2.39 × 10−14)
Other 0.0 (1.0)

Values are the Kolmogorov-Smirnov statistic (P value).

Table 3.

Kolmogorov-Smirnov test results for the DP-KM method

DPWhite DPBlack DPOther
DPWhite 0.0 (1.0) 0.36 (2.95 × 10−8)a 0.23 (1.08 × 10−4)a
DPBlack 0.0 (1.0) 0.45, 1.15 × 10−12)a
DPOther 0.0 (1.0)
White 0.10 (.52)a 0.34 (1.29 × 10−9) 0.21 (4.34 × 10−3)
Black 0.38 (5.28 × 10−9) 0.14 (.16)a 0.48 (6.45 × 10−14)
Other 0.28 (5.47 × 10−5) 0.49 (8.70 × 10−15) 0.13 (.21)a

Values are the Kolmogorov-Smirnov statistic (P value). The test results obtained on the curve produced by the differentially private Kaplan-Meier method.

DP: differentially private.

a

Differentially private curves are not statistically different from the original ones (P  > .05), and they preserve the separation between groups (P  < .05).

DISCUSSION

We presented a differentially private framework that can be used to release survival curves while protecting patient privacy. We demonstrated that our method significantly reduces the risk of a privacy breach when compared with its nonprivate counterpart, while retaining the utility of the survival curves. We discuss several future research directions.

Distributed survival analysis

Current research initiatives often rely on collaborative efforts, such as the clinical data research network pSCANNER60 and equivalent multicenter consortia. While our proposed methods are designed for a centralized setting (ie, trusted aggregator), they could be adapted to the distributed setting. Inspired by previous work,61 we can consider a protocol in which each institution perturbs the local stream of time to events, while a central unit (not necessary trusted) aggregates and partitions the received streams.

Relaxing privacy

Achieving high utility under differential privacy is very challenging in applications that require continual data releases. Recent works have proposed extensions of the differential privacy model, in which privacy is relaxed over time.48,49 Extending our privacy solutions to satisfy those privacy relaxations would help improve the utility of the released survival curves.

Solutions for other survival models

In this work, we presented a preliminary study on privacy-protecting survival analyses based on the KM method (the actuarial method is shown in the Supplementary Appendix). However, there are many other types of survival models, including those based on Cox proportional hazards,16 accelerated failure time,62 recurrent time-to-event data,63 and competing risk64 methods. Building on our results, we plan to develop new privacy methods for enabling other popular privacy-protecting survival analyses in the future.

CONCLUSION

Publication of survival curves is frequent in the biomedical literature and is becoming more frequent in websites. In this work, we studied the privacy risk in conducting survival analyses and proposed a differentially private framework for the KM product limit estimator. The differentially private curves generated by our framework prevent an adversary to infer the time to event for a particular target individual without a significant error (eg, 250 time units) while retaining the usefulness of the original nonprivate curves.

FUNDING

This work was supported by the National Heart, Lung, and Blood Institute grant R01HL136835, and National Institute of General Medical Sciences grant R01GM118609, and National Human Genome Research Institute grant K99HG010493.

AUTHOR CONTRIBUTIONS

LB developed the methods, contributed the majority of the writing, and conducted the experiments. XJ provided helpful comments on both methods and presentation. LO-M provided the motivation for this work, detailed edits, and critical suggestions.

CONFLICT OF INTEREST STATEMENT

None declared.

Supplementary Material

ocz195_Supplementary_Data

REFERENCES

  • 1. Ohno-Machado L. Modeling medical prognosis: survival analysis techniques. J Biomed Inform 2001; 34 (6): 428–39. [DOI] [PubMed] [Google Scholar]
  • 2. Cortese G, Scheike TH, Martinussen T.. Flexible survival regression modelling. Stat Methods Med Res 2010; 19 (1): 5–28. [DOI] [PubMed] [Google Scholar]
  • 3. Schwartzbaum JA, Hulka BS, Fowler JW, Kaufman DG, Hoberman D.. The influence of exogenous estrogen use on survival after diagnosis of endometrial cancer. Am J Epidemiol 1987; 126 (5): 851–60. [DOI] [PubMed] [Google Scholar]
  • 4. Foldvary N, Nashold B, Mascha E.. Seizure outcome after temporal lobectomy for temporal lobe epilepsy: a Kaplan-Meier survival analysis. Neurology 2000; 54 (3): 630.. [DOI] [PubMed] [Google Scholar]
  • 5. Galon J, Costes A, Sanchez-Cabo F, et al. Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science 2006; 313 (5795): 1960–4. [DOI] [PubMed] [Google Scholar]
  • 6. Le Voyer TE, Sigurdson ER, Hanlon AL, et al. Colon cancer survival is associated with increasing number of lymph nodes analyzed: a secondary survey of intergroup trial INT-0089. J Clin Oncol 2003; 21 (15): 2912–9. [DOI] [PubMed] [Google Scholar]
  • 7. Lee ET, Go OT.. Survival analysis in public health research. Annu Rev Public Health 1997; 18 (1): 105–34. [DOI] [PubMed] [Google Scholar]
  • 8. Wagner M, Redaelli C, Lietz M, Seiler CA, Friess H, Büchler MW.. Curative resection is the single most important factor determining outcome in patients with pancreatic adenocarcinoma. Br J Surg 2004; 91 (5): 586–94. [DOI] [PubMed] [Google Scholar]
  • 9. Strober M, Freeman R, Morrell W.. The long‐term course of severe anorexia nervosa in adolescents: survival analysis of recovery, relapse, and outcome predictors over 10–15 years in a prospective study. Int J Eat Disord 1997; 22 (4): 339–60. [DOI] [PubMed] [Google Scholar]
  • 10. Erbes R, Schaberg T, Loddenkemper R.. Lung function tests in patients with idiopathic pulmonary fibrosis: are they helpful for predicting outcome? Chest 1997; 111 (1): 51–7. [DOI] [PubMed] [Google Scholar]
  • 11. Murphy SN, Chueh HC.. A security architecture for query tools used to access large biomedical databases. Proc AMIA Symp 2002; 2002: 552–6. [PMC free article] [PubMed] [Google Scholar]
  • 12. Bacharach M. Matrix rounding problems. Manage Sci 1966; 12 (9): 732–42. [Google Scholar]
  • 13. Lin Z, Hewett M, Altman RB.. Using binning to maintain confidentiality of medical data. Proc AMIA Symp 2002; 2002: 454–8. [PMC free article] [PubMed] [Google Scholar]
  • 14. Dwork C. Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, and Li A, eds. Theory and Applications of Models of Computation (Lecture Notes on Computation Series, volume 4978) New York, NY: Springer; 2008: 1–19.
  • 15. Fredrikson M, Jha S, Ristenpart T.. Model inversion attacks that exploit confidence information and basic countermeasures In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York, NY: ACM; 2015: 1322–33. [Google Scholar]
  • 16. Cox DR. Regression models and life-tables. J R Stat Soc Ser B 1972; 34 (2): 187–220. [Google Scholar]
  • 17. Cutler SJ, Ederer F.. Maximum utilization of the life table method in analyzing survival. J Chronic Dis 1958; 8 (6): 699–712. [DOI] [PubMed] [Google Scholar]
  • 18. Berkson J, Gage RP.. Calculation of survival rates for cancer. Proc Staff MeetMayo Clinic 1950; 25 (11): 270–86.. [PubMed] [Google Scholar]
  • 19. Balsam LB, Grossi EA, Greenhouse DG, et al. Reoperative valve surgery in the elderly: predictors of risk and long-term survival. Ann Thorac Surg 2010; 90 (4): 1195–201. [DOI] [PubMed] [Google Scholar]
  • 20. O’Keefe CM, Sparks RS, McAullay D, Loong B.. Confidentialising survival analysis output in a remote data access system. J Priv Confid 2012; 4 (1): 127–54. [Google Scholar]
  • 21. Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4 (8): e1000167.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Shokri R, Stronati M, Song C, Shmatikov V.. Membership inference attacks against machine learning models In: 2017 IEEE Symposium on Security and Privacy. Piscataway, NJ: IEEE; 2017: 3–18. [Google Scholar]
  • 23. Klann JG, Joss M, Shirali R, et al. The Ad-Hoc uncertainty principle of patient privacy. AMIA Summits Transl Sci Proc 2018; 2017: 132–8. [PMC free article] [PubMed] [Google Scholar]
  • 24. Dwork C, McSherry F, Nissim K, Smith A, Smith A. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, eds. TCC 2006: Theory of Cryptography Conference New York, NY: Springer; 2006: 265–84.
  • 25. Faraggi D, Simon R.. A neural network model for survival data. Stat Med 1995; 14 (1): 73–82. [DOI] [PubMed] [Google Scholar]
  • 26. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y.. Deep survival: a deep Cox proportional hazards network. stat 2016; 1050: 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Luck M, Sylvain T, Cardinal H, Lodi A, Bengio Y. Deep learning for patient-specific kidney graft survival analysis. arXiv 2017 May 29 [E-pub ahead of print].
  • 28. Lee C, Zame WR, Yoon J, der Schaar M. Deephit van. A deep learning approach to survival analysis with competing risks In: Thirty-Second AAAI Conference on Artificial Intelligence; 2018. [Google Scholar]
  • 29. Lu C-L, Wang S, Ji Z, et al. WebDISCO: a Web service for DIStributed COx model learning without patient-level data sharing. J Am Med Informatics Assoc 2015; 22 (6): 1212–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Chaudhuri K, Monteleoni C. Privacy-preserving logistic regression. In: Koller D, Schuurmans D, eds. Advances in Neural Processing Systems 21 (NIPS 2008) San Diego, CA: Neural Information Processing Systems Foundation; 2008.
  • 31. Chen T, Zhong S.. Privacy-preserving models for comparing survival curves using the logrank test. Comput Methods Programs Biomed 2011; 104 (2): 249–53. [DOI] [PubMed] [Google Scholar]
  • 32. Yu S, Fung G, Rosales R, et al. Privacy-preserving Cox regression for survival analysis. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining New York, NY: ACM; 2008: 1034–42.
  • 33. Fung G, Yu S, Dehing-Oberije C, et al. Privacy-preserving predictive models for lung cancer survival analysis. Pract Priv-Preserving Data Min 2008; 40 . [Google Scholar]
  • 34. Pagano M, Gauvreau K.. Principles of Biostatistics. New York, NY: Chapman and Hall/CRC; 2018. [Google Scholar]
  • 35. Dwork C, Roth A.. The algorithmic foundations of differential privacy. FnT Theor Comput Sci 2013; 9 (3–4): 211–407. [Google Scholar]
  • 36. Xiao X, Wang G, Gehrke J.. Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng 2011; 23 (8): 1200–14. doi: 10.1109/TKDE.2010.247. [Google Scholar]
  • 37. Bonomi L, Xiong L. A two phase algorithm for mining sequential patterns with differential privacy. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management New York, NY: ACM; 2013: 269–78.
  • 38. Li N, Qardaji W, Su D, Cao J.. Privbasis: frequent itemset mining with differential privacy. Proc VLDB Endow 2012; 5 (11): 1340–51. [Google Scholar]
  • 39. Bhaskar R, Laxman S, Smith A, Thakurta A.. Discovering frequent patterns in sensitive data In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD ’10. New York, NY: ACM Press; 2010: 503–12. [Google Scholar]
  • 40. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) New York, NY: ACM; 2007: 273–82.
  • 41. Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T.. Differentially private spatial decompositions In: 2012 IEEE 28th International Conference on Data Engineering. Piscataway, NJ: IEEE; 2012: 20–31. [Google Scholar]
  • 42. Li C, Hay M, Rastogi V, Miklau G., McGregor A. Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) New York, NY: ACM; 2010: 123–34.
  • 43. Li C, Miklau G.. An adaptive mechanism for accurate query answering under differential privacy. Proc VLDB Endow 2012; 5 (6): 514–25. [Google Scholar]
  • 44. Fan L, Bonomi L, Xiong L, Sunderam VS.. Monitoring web browsing behavior with differential privacy In: Chung C-W, Broder AZ, Shim K, Suel T, eds. 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic Of Korea, April 7-11, 2014. New York, NY: ACM; 2014: 177–88. [Google Scholar]
  • 45. Fan L, Xiong L.. An adaptive approach to real-time aggregate monitoring with differential privacy. IEEE Trans Knowl Data Eng 2014; 26 (9): 2094–106. [Google Scholar]
  • 46. Dwork C, Naor M, Pitassi T, Rothblum GN.. Differential privacy under continual observation In: Proceedings of the Forty-Second ACM Symposium on Theory of Computing. New York, NY: ACM; 2010: 715–24. [Google Scholar]
  • 47. Chan T-H, Shi E, Song D.. Private and continual release of statistics. ACM Trans Inf Syst Secur 2011; 14 (3): 1. [Google Scholar]
  • 48. Kellaris G, Papadopoulos S, Xiao X, Papadias D.. Differentially private event sequences over infinite streams. Proc VLDB Endow 2014; 7 (12): 1155–66. [Google Scholar]
  • 49. Bolot J, Fawaz N, Muthukrishnan S, Nikolov A, Taft N.. Private decayed predicate sums on streams In: Proceedings of the 16th International Conference on Database Theory. New York, NY: ACM; 2013: 284–95. [Google Scholar]
  • 50. Bonomi L, Xiong L.. On differentially private longest increasing subsequence computation in data stream. Trans Data Priv 2016; 9 (1): 73–100. [Google Scholar]
  • 51. Chaudhuri K, Monteleoni C, Sarwate A.. Differentially private empirical risk minimization. J Mach Learn Res 2011; 12: 1069–109. [PMC free article] [PubMed] [Google Scholar]
  • 52. Ji Z, Jiang X, Wang S, Xiong L, Ohno-Machado L.. Differentially private distributed logistic regression using private and public data. BMC Med Genomics 2014; 7 (Suppl 1): S14.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Abadi M, Chu A, Goodfellow I, et al. Deep learning with differential privacy In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, NY: ACM; 2016: 308–18. [Google Scholar]
  • 54. Nissim K, Steinke T, Wood A, et al. Differential privacy: A primer for a non-technical audience. In: 10th Annual Privacy Law Scholars Conference; June 1–2, 2017; Berkeley, California.
  • 55. Dwork C, Naor M, Reingold O, Rothblum GN.. Pure differential privacy for rectangle queries via private partitions In: International Conference on the Theory and Application of Cryptology and Information Security. New York, NY: Springer; 2015: 735–51. [Google Scholar]
  • 56. Hay M, Rastogi V, Miklau G, Suciu D.. Boosting the accuracy of differentially private histograms through consistency. Proc VLDB Endow 2010; 3 (1–2): 1021–32. [Google Scholar]
  • 57. Barlow RE, Brunk HD.. The isotonic regression problem and its dual. J Am Stat Assoc 1972; 67 (337): 140–7. [Google Scholar]
  • 58. Fleming TR, O'Fallon JR, O'Brien PC, Harrington DP.. Modified Kolmogorov-Smirnov test procedures with application to arbitrarily right-censored data. Biometrics 1980; 36 (4): 607–25. [Google Scholar]
  • 59. Noone AM, Howlader N, Krapcho M.. SEER Cancer Statistics Review, 1975-2015. Bethesda, MD: National Cancer Institute. [Google Scholar]
  • 60. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered scalable national network for effectiveness research. J Am Med Inform Assoc 2014; 21 (4): 621–6. doi: 10.1136/amiajnl-2014-002751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Chan T-H, Li M, Shi E, Xu W.. Differentially private continual monitoring of heavy hitters from distributed streams In: International Symposium on Privacy Enhancing Technologies Symposium. New York, NY: Springer; 2012: 140–59. [Google Scholar]
  • 62. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med 1992; 11 (14–15): 1871–9. [DOI] [PubMed] [Google Scholar]
  • 63. Amorim L, Cai J.. Modelling recurrent events: a tutorial for analysis in epidemiology. Int J Epidemiol 2015; 44 (1): 324–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Lau B, Cole SR, Gange SJ.. Competing risk regression models for epidemiologic data. Am J Epidemiol 2009; 170 (2): 244–56. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocz195_Supplementary_Data

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES