Table 1. Main characteristics of the proposed infrastructure for privacy-protected federated data analyses with big data.
Feature | Capabilities / Advantages | Limitations / Disadvantages |
---|---|---|
Resources | Any data source or computation resource that can be accessed from R is made available in DataSHIELD. These include databases of any kind (SQL, NoSQL, distributed Big Data systems), most of the file formats, domain specific applications accessible through web services or remote commands etc. Applies to any scientific domain. | Resource URL design can be complex when resource options need to be specified. R is the required entry point, which complexifies the use of analysis algorithms in Python for instance. |
Scalability | Scales with number of studies. The resources can interact with Apache Spark or Hadoop | Some advanced statistical techniques may not scale well with number of records. The processing power can sometimes suffer from latency. There is a need for investigation share computation over multiple processors. |
R programming language | R is an open-source and heterogenous programming language. Interpreters for available for many operating systems. A wide community support its users. | Skills for other functional and statistical programming languages can be transferred to learn R. However not all of the analysts are familiar with other programming languages, which can raise some barriers of adoption. |
Disclosure prevention | Several complementary features that provide privacy preservation are implemented in the DataSHIELD architecture. Just one of these is that data custodians/owners (not analysts) have complete control over a series of optional filters that dictate the potential disclosivity of the analytic output. As one example, this includes the minimum count acceptable in a non-empty cell of a contingency table. When rare observations are critical to analysis, data custodians can actively choose to relax the filters to enable meaningful analysis to be undertaken. One of the implications of the combined effect of several of the privacy preservation features is that analysts can never see, copy or abstract the individual-level data on the primary data servers. | Data protection is shared between existing computer systems and DataSHIELD. An existing computer system that has some poorly implemented secured network access and authentication may threaten the quality of disclosure prevention. In other words, as well as the active privacy preservation built into DataSHIELD software, it is also crucial that the hardware on which everything runs must satisfy good practice for conventional privacy protection. |
Graphical User Interface (GUI) | The use of R studio and the specialised data warehouse software have a GUI. Analysis can be integrated with GUI packages, such as R Shiny. | An appropriate use of R Scripts, R notebooks and vignettes is helpful. DataSHIELD users then need to learn how to use these approaches before using DataSHIELD. The use of R Studio can make the learning curve less steep. |
Applications to other biomedical areas than genomics | Any other specific data infrastructures available in public repositories such as images (OpenNeuro), transcriptomic or epigenomic data (GEO) can be accessed and analyzed in a distributed and privacy preserving way. Applications outside biomedical, health and social science are also entirely possible. | There are no obvious scientific or academic domains to which DataSHIELD could not in principle be applied. Particularly, if the aim is to facilitate the quantitative analysis of individual level data which are sensitive either because of ethico-legal or information-governance restrictions, or because of their intrinsic intellectual or commercial value. |
Cost | As everything is based on open-source freeware, the costs are minimized for any organisations who wishes to adopt DataSHIELD. | While DataSHIELD development is substantively funded through grants, there is nevertheless a need to seek support from the user community (particularly large-scale users) to contribute to the core DataSHIELD provision: user training and support; on-line support materials; bug snagging and error correction; continuous testing of evolving code; preparing and smoothly undertaking new releases. |
Time | From the user perspective, DataSHIELD is a time-effective tool as does not require iterative in-person communication between the analyst and the data holders. This is most evident in comparing the time commitment required for a standard consortium-based meta-analysis and that required by a centrally controlled study-level meta-analysis via DataSHIELD. The former requires the analysis centre to ask each study to undertake a series of specified analysis and return results (typically several rounds of analysis and return). In contrast the latter is controlled in real time as if one was working directly with the raw data. This can speed things up by orders of magnitude. | Like any specialised open-source software, DataSHIELD analysts, developers and installer need some time to adapt to the concepts of federated systems and disclosure limitations. |
Server sharing | DataSHIELD supports multi-tenancy, to permit multiple users to share a DataSHIELD server. It also permits multiple servers to be deployed if needed. | DataSHIELD requires some specialised data warehouse software such as Opal. At the moment there is no white-paper that formally defines standards for further development in this area. |
Documentation | DataSHIELD has a wiki that provides beginners training to any new DataSHIELD analysts and developers. Some more advanced tutorials are available on multimedia contents through YouTube. Documentation for the Opal specialised data warehouse software is available online. The online book of this paper will be continuously being updated with more extensions (e.g. transcriptomics, epigenomics, imaging data, longitudinal data analyses, …) |
DataSHIELD documentation needs to be more supportive to the DataSHIELD developers. An online forum is available for community engagement and support. No advanced statistical techniques are explicitly taught, as it is assumed that any analyst planning to use DataSHIELD would already understand the theory and practice underpinning the analysis that is to be undertaken. However, in practice the DataSHIELD team knows that this is not always true and so some analytic theory is increasingly being presented when it is useful. |
Deployment | Software solution packages are available for different hosting systems (includes a container-based option) and the resource concept can adapt itself to the existing hosting infrastructure. It is advised that DataSHIELD should only be deployed in a setting in which all hardware and middleware systems satisfy conventional best practice for data management and privacy protection. Similarly, it is assumed that DataSHIELD will not be used if information governance or other restrictions already proscribe the particular analysis proposed. |
It is important that there is at least a minimum baseline level of system administration knowledge on the part of the data owner and proper dimensioning of the hardware (especially when targeting multi-user, computation intensive usage). Nobody should be making sensitive data available for analysis via any mechanism–including DataSHIELD–if they do not have a proper understanding of their data systems or of the governance framework under which analysis is to be enacted. |