Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Mar 30;17(3):e1008880. doi: 10.1371/journal.pcbi.1008880

Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD

Yannick Marcon 1,*, Tom Bishop 2, Demetris Avraam 3, Xavier Escriba-Montagut 4,5, Patricia Ryser-Welch 3, Stuart Wheater 6, Paul Burton 3, Juan R González 4,5,7,8,*
Editor: Dina Schneidman-Duhovny9
PMCID: PMC8034722  PMID: 33784300

Abstract

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers’ ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (“resources”) for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).

Author summary

Data sharing enhances understanding of research results beyond what is possible from any single study. Data pooling across multiple studies increases statistical power and allows exploration of between-study heterogeneity. But, considerations related to ethico-legal and intellectual/commercial value regularly prevent or impede physical data sharing. DataSHIELD is designed to circumvent this problem. However, despite the growing confidence users have been placing in DataSHIELD to perform privacy-protected analyses of data in cohort consortia, there are real challenges to federated analytics. They include considering the wide range of data formats, and big data sources used, for example, in ‘omics-based research. This article describes the development and implementation of the new “resources” architecture in DataSHIELD that overcomes this limitation. We illustrate its value with real world examples related to genomics and geographical data. We also demonstrate how genomic data sharing initiatives such as GA4GH and EGA can benefit directly from our development. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects.


This is a PLOS Computational Biology Software paper.

Introduction

Big Data brings new opportunities to biomedicine and challenges to data scientists. These challenges require new computational and statistical paradigms to deal with important principles of data management and data sharing. The new paradigm should consider: ensuring appropriate levels of security and privacy [1]; the rigorous application of the stringent regulations required by governance frameworks such as GDPR in Europe (https://gdpr-info.eu/) and similar regulatory mechanisms across North America and elsewhere; and a considered choice between central data warehousing and the distributed (federated) analysis of data that remain with their custodian[2,3].

Historically there has tended to be a focus on warehousing because of the technical challenges of federation [4], and funder requirements to physically share public data [5]. This requires data custodians to physically transfer data to a central location to make them accessible to analytic users. However, the potential benefits of remote and federated approaches to analysing data are now widely recognised [2,3]. Most fundamentally, the physical data then remain under the control of their custodian with limited access. This can offer major benefits in terms of: making it easier to meet ethics and governance requirements; enhanced flexibility to refine and rerun analyses quickly without waiting for an analyst at each institution to follow an updated analysis plan; allowing datasets to be updated quickly without needing to be resent to a central location [2,6].

Anticipating the future growth of federated analysis, the DataSHIELD (www.datashield.ac.uk) and OBiBa (www.obiba.org) projects are now 10 years into the joint development of an open-source analytics platform that enables and simplifies flexible but efficient federated analysis [2,7,8]. DataSHIELD is linked to an Opal database designed for data management, harmonization and dissemination [9]. In addition, it actively constrains the risk of information disclosure: i.e. the risk that a data analyst is able–accidently or deliberately—to infer individual level data [2,10]. The DataSHIELD platform has a growing user community and a central role in the analytic strategies of several large research consortia focussed primarily on the federated analysis of large cohort studies, particularly in Europe and Canada. These include BioSHaRE [8], EUCAN-connect [11], LifeCycle [12], ATHLETE [13] and InterConnect [14]. This article describes a radical extension to the DataSHIELD/Opal platform–the “Resources” architecture. This allows DataSHIELD to be used in a range of new settings which include the analysis of high-volume data such as omics, geospatial or neuroimaging among others.

Despite the growing confidence users have been placing in DataSHIELD to perform privacy-protected analyses, there have, to date, been several serious limitations: (1) Difficulty in applying DataSHIELD federated analytics across the wide range of data formats, and data sources used in ‘omics-based research. (2) Challenges to the efficient porting of high-volume distributed data into the analytic (R) environments on the remote data servers; single large tranches can overwhelm the handling capacity of the system and regular block-by-block refreshments of the analytic data can be impractically slow. (3) To date DataSHIELD has primarily been applied to research settings (typically large cohort studies) where the emphasis has been on the provision of robust disclosure control for analysing sensitive data. But in big data analyses the emphasis is more typically on fast, efficient analysis and data governance often requires a relatively basic level of disclosure control. For example, many consortium-based ‘omics projects would like data to remain with their usual generators/custodians–i.e. a federated approach avoiding the physical transfer of data to users—and for analysts to be unable to see, copy, capture or otherwise infer, those individual-level data. Crucially, if limitations “1” and “2” could be circumvented it would then be straightforward to ensure that this basic level of disclosure control is embedded into all new functions as they are developed and implemented. This would greatly accelerate the development of new functionality making it realistic to consider rapid implementation of Bioconductor [15], Neuroconductor [16] or R packages designed for big data analyses into DataSHIELD. In one stroke, these objectives have all been realised with the development and implementation of the new “Resources” architecture in DataSHIELD/Opal.

This article describes this new facility, illustrates its value with real world examples and considers the exciting implications it has for the future development of the DataSHIELD platform and for the wider adoption of federated approaches to big data analysis. To help researchers use this framework, we present an online book (https://isglobal-brge.github.io/resource_bookdown/) covering installation, sources of help, specialized topics pertaining to specific aspects of privacy-protecting analysis and complete workflows analysing various examples from biomedical, omics and geospatial settings. The packages developed are available through CRAN or Github repositories under open source licenses (GPL3 or MIT).

Design and implementation

The resources architecture

When analysing data, it is normal to deal with a very wide variety of data formats, data storage systems and programmatic interfaces. The purpose of the work we describe is not to define a new data format or storage system. Instead we have aimed to describe how data can be accessed, in a formal but generic way, to simplify the integration of various data or computation resources in a statistical analysis program. By actively embracing the variety of data formats and computation systems we seek to guarantee that the right tool can always be used for the type and the volume of data that are being considered.

We define a “resource” to be a description of how to access either: (1) data stored and formatted in a particular way or (2) a computation service. Therefore, the descriptors for the resource will contain the following elements: (1) the location of the data or of the computation services, (2) the data format (if this information cannot be inferred from the location property) or the format of the function call to the computation service, (3) the access credentials (if some apply).

Once a resource has been formally defined, it becomes possible to build a programmatic connection object that will make use of the data or computation services described. This resource description is not bound to a specific programmatic language (the URL property is a web standard, other properties are simple strings) and does not enforce the use of a specific software application for building, storing and interpreting a resource object. Section 7.8 in our online book describe some examples of resources available in a demonstration Opal repository.

The data format refers to the intrinsic structure of the data. A very common family of data formats is the tabular format which is made of rows (entities, records, observations etc.) and columns (variables, fields, vectors etc.). Examples of tabular formats are the delimiter-separated values formats (CSV, TSV etc.), the spreadsheet data formats (Microsoft Excel, LibreOffice Calc, Google Sheets etc.), some proprietary statistical software data formats (SPSS, SAS, Stata etc.), the database tables that can be stored in structured database management systems that are row-oriented (MySQL, MariaDB, PostgreSQL, Oracle, SQLite etc.) or column-oriented (Apache Cassandra, Apache Parquet, MariaDB ColumnStore, BigTable etc.), or in semi-structured database management systems such as the documented-oriented databases (MongoDB, Redis, CouchDB, Elasticsearch etc.). When the data model is highly structured or particularly complex (data types and objects relationships), a domain-specific data format is sometimes designed to handle the complexity. This then enables statistical analysis and data retrieval to be executed as efficiently as possible. Examples of domain-specific data formats are regularly encountered in the genomic or geospatial fields of research. A data format can also include additional features such as data compression, encoding or encryption. Each data format requires an appropriate reader software library or application to extract the information or perform data aggregation or filtering operations.

Data storage can simply be realised via a file that can be accessed directly from the host’s file system or downloaded from a remote location. More advanced data storage systems can involve software applications that expose an interface to query, extract or analyse the data. These applications can make use of a standard programming interface (e.g. SQL) or expose specific web services (e.g. based on the HTTP communication protocol) or provide a software library (in different programming languages) to access the data. These different ways of accessing the data are not mutually exclusive. In some cases when the (individual-level) micro-data cannot be extracted, computation services returning aggregated/summary statistics may be provided. The data storage system can also apply security rules, requiring authentication and proper authorisations to access or analyse the data.

The resource location description will make use of the web standard “Uniform Resource Identifier (URI): Generic Syntax”.[17] More specifically, the Uniform Resource Locator (URL) specification is what we need for defining the location of the data or computation resource: the term Uniform allows the resource to be described in the same way, independently of its type, location and usage context; the use of the term Resource does not limit the scope of what might be a “resource”, e.g. a document, a service, a collection of resources, or even abstract concepts (operations, relationships, etc.); the term Locator both identifies the resource and provides a means of locating it by describing its access mechanism (e.g. the network location). The URL syntax is composed of several parts: (1) a scheme, that describes how to access the resource, e.g. the communication protocols “https” (secured HTTP communication), “ssh” (secured shell, for issuing commands on a remote server), or “s3” (for accessing Amazon Web Service S3 file store services), (2) an authority (optional), e.g. a server name address, (3) a path that identifies/locates the resource in a hierarchical way and that can be altered by query parameters.

The resource’s data format might be inferred from the path component of the URL; for example, by using the file name suffix. However, it is not always possible to identify the data format because the path could make sense only for the data storage system, for example when a file store designates a document using an obfuscated string identifier or when a text-based data format is compressed as a zip archive. The format property can provide this information. Although the authority part of the URL can contain user information (such as the username and password), it is discouraged to use this capability for security considerations. The resource’s credentials property will be used instead, and will be composed of an identifier sub-property and a secret sub-property, which can be used for authenticating with a username/password, an access token, a key pair (private and public keys), or any other credentials encoded string. The advantage of separating the credentials property from the resource location property is that a user with limited permissions could have access to the resource’s location information while the credentials are kept secret.

The resourcer R package

The resourcer package is an R implementation of the data and computation resources description and connection. It reuses many existing R packages for reading various data formats and connecting to external data storage or computation servers. The resourcer package’s role is to interpret a resource description object to build the appropriate resource connection object. Because the scope of resources is very wide, the resourcer package provides a framework for dynamically extending the interpretation capabilities to new types of resources. Next, we describe the key issues to deal with resources within the R environment. Further details and examples are available in Section 7.8 in our online book.

Resources R Implementation

The resource class is a simple R structure that holds the properties of the resource described in the previous section: URL, format, identity and secret. To simplify the designation of a resource, an additional name attribute is defined. This identifier is optional and is not necessarily unique. The ResourceClient is a key R6 class, which wraps a resource object and defines operations that can be performed on it. The resourcer package has built-in support for the following use cases:

  • Data file resource whose location is defined by the resource’s URL and that can be downloaded in a temporary folder to be read. The file locations that are supported by default are: (1) the local file system (obviously with a no-op download), (2) an HTTP(S) connection, optionally providing basic authentication based on the resource’s credentials, (3) the MongoDB GridFS file store, (4) Opal’s file store and (5) a remote SSH server. For the reading part, the resourcer package uses some tidyverse R packages (such as haven for the SAS, SPSS and Stata data formats, readr for delimited data formats and readxl for Excel data formats) or can load an R object based on its class name specified in the resource’s data format property.

  • SQL database resource which has a connector based on R’s interface for databases (DBI). The resource’s URL indicates which database connector is to be used. The SQL databases supported by default are MySQL/MariaDB and PostgreSQL, and some “big data” databases exposing a SQL interface such as PrestoDB and Apache Spark. The resourcer package can be extended to new DBI-compatible databases.

  • NoSQL database resource which can be read using connectors from the nodbi package. Only the MongoDB database is supported for now.

  • Command-based computation resource. The resourcer package can handle commands to be executed in the local shell and on a remote server through a secure shell (SSH) connection.

Interacting with R Resources

As the ResourceClient is simply a connector to a resource, its utility is enhanced by a range of data conversion functions that are defined by default:

  • R data.frame, which is the most common representation of tabular data in R. A data frame, as defined in R base, is an object stored in memory that may be not suitable for large to big datasets.

  • dplyr tbl, which is another representation of tabular data provided by the dplyr package that nicely integrates with the R interface for databases: filtering, mutation and aggregation operations can be delegated to the underlying SQL database, reducing the R memory and computation footprint. Useful functions are also provided to perform joint operations on relational datasets.

In the case when the resource is a R object, the R data file ResourceClient offers the ability to get the internal raw data object. Then complex data structures, optimized for a specific research domain can be accessed with the most appropriate tools.

When the resource is a computation service provider, the interaction with the resource client will consist of issuing commands/requests with parameters and getting the result from it either as a response object or as a file to be downloaded.

The purpose of the resourcer package is definitely not to substitute itself for the underlying library; rather it is a general approach that facilitates the access to the data and service resources in the most specific way.

Extending R Resources

Thanks to its modular and dynamic architecture, the resourcer package can easily be extended to:

  • new file locations (see for instance the s3.resourcer R package which connects Amazon Web Service S3 file stores),

  • new file readers (see Section 8 in our online book to see how VCF files are read in the dsOmics R package),

  • new DBI-compatible databases. For instance, using bigrquery R package, it would be easy to implement access to a resource stored in a Google’s BigQuery database.

  • new domain specific applications which would expose data extraction and/or analysis services. The only requirement is that an R connection API exists for the considered resource application.

Resources with DataSHIELD/Opal

DataSHIELD infrastructure is a software solution that allows simultaneous co-analysis of multiple data sets stored on different servers without the need to physically pool data or disclose sensitive information. DataSHIELD uses Opal servers to perform such analyses.

At a high level DataSHIELD is set up as a client-server model, each server housing the data for its corresponding study. A request is made from the client to run specific functions on the data held in the remote servers and that is where the required analyses are actually performed. Non-sensitive and pre-approved summary statistics are returned from each study server to the client where they can then be combined for an overall analysis. An overview of what a multi-site DataSHIELD architecture would look like is illustrated in Fig 1.

Fig 1. A schematic diagram of a multi-site DataSHIELD infrastructure.

Fig 1

It includes one central analysis node (the client) and three data nodes (the servers).

The current limitation of this infrastructure is that the data assignment operation consists of extracting the dataset from the data repository (the primary storage can be either a SQL database or a MongoDB database) and pushing it in the R server’s memory as a data.frame object. This data assignment process consumes time, physical memory and applies only to datasets in tabular format. Although acceptable for small datasets (usually less than 10M data points), this infrastructure cannot be used for big data systems, complex data structures and existing computation facilities. In order to overcome these limitations, data access can instead be delegated to the R servers using the concept of resources. The DataSHIELD middleware (Opal) will be responsible for performing the assignment of a resource description as a ResourceClient object in the R server (as soon as the relevant permissions have been granted). The connection to data or computation resources is then readily usable by the DataSHIELD user. In terms of data access security, the resource credentials are not visible via the DataSHIELD client node, just like the individual-level data.

After converting a resource to a data.frame object, existing DataSHIELD analysis R packages can be used in a backward compatible way. New DataSHIELD R packages, such as dsOmics and dsGeo, described in this paper, can make use of the power of resources to apply DataSHIELD’s data privacy preserving paradigm to large datasets, complex data structures and external computation facilities.

The integration of the resources concept in Opal consists of (1) dynamically discovering the different types of resources that can be handled by the associated R server, (2) providing an appropriate user-friendly graphical interface (GUI) for managing the resource descriptions and access permissions, and (3) assigning resource objects in the R server on user demand. The latest version of Opal (v 3.0) implements these capabilities, making resources accessible to the DataSHIELD framework.

Results

Table 1 describes the main features of our proposed framework that are mainly driven by DataSHIELD’s capabilities. The table also shows the main advantages of disadvantages that can be found when using the resources in federated data analyses. Different features including scalability, disclosure prevention, deployment and future applications to other biomedical areas than genomics are discussed.

Table 1. Main characteristics of the proposed infrastructure for privacy-protected federated data analyses with big data.

Feature Capabilities / Advantages Limitations / Disadvantages
Resources Any data source or computation resource that can be accessed from R is made available in DataSHIELD. These include databases of any kind (SQL, NoSQL, distributed Big Data systems), most of the file formats, domain specific applications accessible through web services or remote commands etc. Applies to any scientific domain. Resource URL design can be complex when resource options need to be specified. R is the required entry point, which complexifies the use of analysis algorithms in Python for instance.
Scalability Scales with number of studies. The resources can interact with Apache Spark or Hadoop Some advanced statistical techniques may not scale well with number of records. The processing power can sometimes suffer from latency. There is a need for investigation share computation over multiple processors.
R programming language R is an open-source and heterogenous programming language. Interpreters for available for many operating systems. A wide community support its users. Skills for other functional and statistical programming languages can be transferred to learn R. However not all of the analysts are familiar with other programming languages, which can raise some barriers of adoption.
Disclosure prevention Several complementary features that provide privacy preservation are implemented in the DataSHIELD architecture. Just one of these is that data custodians/owners (not analysts) have complete control over a series of optional filters that dictate the potential disclosivity of the analytic output. As one example, this includes the minimum count acceptable in a non-empty cell of a contingency table. When rare observations are critical to analysis, data custodians can actively choose to relax the filters to enable meaningful analysis to be undertaken. One of the implications of the combined effect of several of the privacy preservation features is that analysts can never see, copy or abstract the individual-level data on the primary data servers. Data protection is shared between existing computer systems and DataSHIELD. An existing computer system that has some poorly implemented secured network access and authentication may threaten the quality of disclosure prevention. In other words, as well as the active privacy preservation built into DataSHIELD software, it is also crucial that the hardware on which everything runs must satisfy good practice for conventional privacy protection.
Graphical User Interface (GUI) The use of R studio and the specialised data warehouse software have a GUI. Analysis can be integrated with GUI packages, such as R Shiny. An appropriate use of R Scripts, R notebooks and vignettes is helpful. DataSHIELD users then need to learn how to use these approaches before using DataSHIELD. The use of R Studio can make the learning curve less steep.
Applications to other biomedical areas than genomics Any other specific data infrastructures available in public repositories such as images (OpenNeuro), transcriptomic or epigenomic data (GEO) can be accessed and analyzed in a distributed and privacy preserving way. Applications outside biomedical, health and social science are also entirely possible. There are no obvious scientific or academic domains to which DataSHIELD could not in principle be applied. Particularly, if the aim is to facilitate the quantitative analysis of individual level data which are sensitive either because of ethico-legal or information-governance restrictions, or because of their intrinsic intellectual or commercial value.
Cost As everything is based on open-source freeware, the costs are minimized for any organisations who wishes to adopt DataSHIELD. While DataSHIELD development is substantively funded through grants, there is nevertheless a need to seek support from the user community (particularly large-scale users) to contribute to the core DataSHIELD provision: user training and support; on-line support materials; bug snagging and error correction; continuous testing of evolving code; preparing and smoothly undertaking new releases.
Time From the user perspective, DataSHIELD is a time-effective tool as does not require iterative in-person communication between the analyst and the data holders. This is most evident in comparing the time commitment required for a standard consortium-based meta-analysis and that required by a centrally controlled study-level meta-analysis via DataSHIELD. The former requires the analysis centre to ask each study to undertake a series of specified analysis and return results (typically several rounds of analysis and return). In contrast the latter is controlled in real time as if one was working directly with the raw data. This can speed things up by orders of magnitude. Like any specialised open-source software, DataSHIELD analysts, developers and installer need some time to adapt to the concepts of federated systems and disclosure limitations.
Server sharing DataSHIELD supports multi-tenancy, to permit multiple users to share a DataSHIELD server. It also permits multiple servers to be deployed if needed. DataSHIELD requires some specialised data warehouse software such as Opal. At the moment there is no white-paper that formally defines standards for further development in this area.
Documentation DataSHIELD has a wiki that provides beginners training to any new DataSHIELD analysts and developers. Some more advanced tutorials are available on multimedia contents through YouTube. Documentation for the Opal specialised data warehouse software is available online.
The online book of this paper will be continuously being updated with more extensions (e.g. transcriptomics, epigenomics, imaging data, longitudinal data analyses, …)
DataSHIELD documentation needs to be more supportive to the DataSHIELD developers. An online forum is available for community engagement and support.
No advanced statistical techniques are explicitly taught, as it is assumed that any analyst planning to use DataSHIELD would already understand the theory and practice underpinning the analysis that is to be undertaken. However, in practice the DataSHIELD team knows that this is not always true and so some analytic theory is increasingly being presented when it is useful.
Deployment Software solution packages are available for different hosting systems (includes a container-based option) and the resource concept can adapt itself to the existing hosting infrastructure.
It is advised that DataSHIELD should only be deployed in a setting in which all hardware and middleware systems satisfy conventional best practice for data management and privacy protection. Similarly, it is assumed that DataSHIELD will not be used if information governance or other restrictions already proscribe the particular analysis proposed.
It is important that there is at least a minimum baseline level of system administration knowledge on the part of the data owner and proper dimensioning of the hardware (especially when targeting multi-user, computation intensive usage). Nobody should be making sensitive data available for analysis via any mechanism–including DataSHIELD–if they do not have a proper understanding of their data systems or of the governance framework under which analysis is to be enacted.

Available resources extensions

We have extended the resources available at the resourcer package into different settings. These extensions as well as the current resources that can be accessed through the Opal servers are described in Table 2. So far, we can get data from different locations (Amazon Web Services, HL7 FHIR or Dremio), read other types of files which are specific in genomic studies (BAM, VCF and PLINK) and access data from other infrastructures such as GA4GH, a federated ecosystem for sharing genomic, clinical data [18] and EGA which is a permanent archive that promotes distribution and sharing of genetic and phenotype data consented for specific approved uses [19].

Table 2. Available resources at the resourcer R package and extensions for genomic data.

Type Resource Reference R package Use
File reader R data https://cran.r-project.org/ resourcer Any
File reader Tidy data (.csv,.tsv, txt, …) https://www.tidyverse.org/ resourcer Any
File location S3 compatible file store https://min.io/
https://aws.amazon.com/
s3.resourcer Any
Database SQL https://www.mysql.com/
https://mariadb.org/
https://www.postgresql.org/
https://prestodb.io/
resourcer Any
Database and Big Data analytics SQL https://spark.apache.org/ resourcer Any
Database NoSQL https://www.mongodb.com/ resourcer Any
Computation service SSH https://en.wikipedia.org/wiki/Secure_Shell resourcer Any
Domain specific HL7 FHIR http://hl7.org/fhir/ fhir.resourcer Patient’s data
Database SQL https://www.dremio.com/ odbc.resourcer Any
File reader VCF https://en.wikipedia.org/wiki/Variant_Call_Format dsOmics Genomic
File reader GDS https://bioconductor.org/packages/release/bioc/html/gdsfmt.html dsOmics Genomic
File reader BAM http://samtools.github.io/hts-specs/SAMv1.pdf dsOmics Genomic
File reader Bioconductor infrastructures (ExpressionSet, RangedSummarizedExperiment, MultiAssayExperiment, …) http://bioconductor.org/ dsOmics Genomic
Computation service PLINK http://zzz.bwh.harvard.edu/plink/ dsOmics Genomic
Domain specific GA4GH https://www.ga4gh.org/ dsOmics Genomic and clinical
Domain specific EGA https://ega-archive.org/ dsOmics Genomic and clinical

Real data analyses

We illustrate how to perform privacy-protecting big data analyses using our proposed infrastructure. We have set up an Opal demo site (see Chapter 4 in our bookdown) to illustrate how to perform some basic analyses using DataSHIELD as well as how to deal with different resources for genomic and geographical data. These data are publicly available and can be accessed through DataSHIELD or using the URL available in the Opal site. The genomic example describes how to perform genome-wide association (GWAS). The geographic examples describe how to analyse information about journeys undertaken by specified individuals’, the environment through which they travel and whether the journeys have an impact on the individuals’ health. Chapter IV in our online book provides users with workflows and case studies for downstream analyses and visualizations.

Genomic data analysis

Bioconductor provides core data structures and methods that enable genome-scale analysis of high-throughput data in the context of the rich statistical programming environment offered by the R project [15]. We have created two packages to perform privacy-protecting federated genomic data analysis with DataSHIELD and Bioconductor. The dsOmics package contains the functions that are used on the server side where the actual analysis is implemented and which specify the privacy-protecting summary statistics that will be send back to the client, while the dsOmicsClient has the functions that are used on the client side, to control the commands that are send to the server side and to combine the received outcomes for pooled analysis applications.

Genomic data can be stored in different formats. Variant Call Format (VCF) and PLINK files [20]. are commonly used in genetic epidemiology studies. In order to deal with this type of data, we have extended the resources available at the resourcer package to VCF files. We use the Genomic Data Storage (GDS) format which is designed for large-scale data management of genome-wide variants and can efficiently manage VCF files into the R environment. This extension requires the creation of specific ResourceClient and ResourceResolver classes. This extension is available in the dsOmics package (See Section 8 in the supplementary book). Briefly, the client class uses snpgdsVCF2GDS function implemented in SNPrelate to coerce the VCF file to a GDS object [21]. Then, the GDS object is loaded into R as an object of class GdsGenotypeReader from GWASTools package [22]. that facilitates downstream analyses such quality control of SNPs and individual data, population stratification and association analyses using GENESIS Bioconductor package [23].

Fig 2 describes how GWAS may be performed using DataSHIELD client-side functions (i.e dsOmicsClient package). Basically, data (genomic and phenotypes/covariates) can be stored in different sites (EGA, GA4GH, https, ssh, AWS S3, local,…) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics. The association analyses involving GWAS are based on fitting different generalized linear models (GLMs) for each SNP that can be performed using a base DataSHIELD function (ds.glm). The difference between standard data analysis and that done by DataSHIELD is that the analysis is performed at the location of the data (e.g. the data nodes). No data is transferred from that location, only non-disclosive summary statistics. The set of analytical operations which can be requested to be performed at the location of the data, has been careful constructed to prevent any attempt for direct or inferential disclosure of any individual-level information. The R parser also blocks any form of arguments that are not allowed in DataSHIELD. For analytical operations which could potential yield results which are disclosive, for example if a small amount of data is being analysed, the operation will check if the results match the data protection policies of the location’s data governance rules, before returning any results back to the client (e.g. the analysis node). If any of the protection rules are violated, the client does not receive any results but gets study-side messages with information about potential disclosure issues [2].

Fig 2. Scheme of DataSHIELD implementention of genomic data analysis.

Fig 2

The dsOmics package contains functions to perform non-disclosive data analyses of resources encoding genomic data that are managed within Opal using the resourcer package. Genomic data normally have two pieces of information, one corresponding to variants (e.g. SNPs) and another for phenotypic data (grouping variable, outcome, covariates, …). Both can be stored in different resources. BAM/VCF/PLINK for SNPs and text/csv file for phenotypes and covariates. This package should be installed in the Opal server along with their dependences. The package dsOmicsClient must be available in the client side and contains functions that allow the interaction between the analysis computer and the servers.

It should be noticed that repeatedly calling ds.glm function can be very time consuming when analysing thousands of SNPs which requires multiple iterations of calls over the network between the client and the server. In order to overcome this problem, we also implemented a federated meta-analysis approach that basically runs an independent GWAS at each server and then meta-analyse the results. GDS data at each server are analyzed using GWASTools and GENESIS Bioconductor packages that allows to perform GWAS very quickly. Once the study-specific estimates and standard errors generated by the analyses undertaken on each server have been returned to the client, they can be combined using whatever meta-analysis approaches—and whatever R meta-analysis packages—the user may choose. This methodology has some limitations when data are not properly harmonized (e.g. genotyping in different platforms, different VCF versions,…). In order to overcome this problem, data format validation can also be performed by the analyst using DataSHIELD functions. In genomics, this can be achieved by first doing imputation and then solving issues concerning genomic strand and file format [24].

GWAS can also be performed using programs that are executed using shell commands. This is the case for PLINK, one of the state-of-the-art programs to run GWAS. Resources also allow the use of secure SSH service to run programs on a remote server accessible through SSH containing data and analysis tools where R is just used for launching the analyses and aggregating results. This feature allows us to create functions to analyze data using specific shell programs. Section 9 in our online book describes how the PLINK program can be used to perform GWAS. In this case, the resource describes that access is given via SSH, the credentials required to connect, and the commands that can be run (of which one is plink).

We would like to emphasize that with DataSHIELD, analysis is performed at the location of the data. The data provider has full control over what information is transferred from their location to the location of the analyst by setting filters for a number of disclosure traps. This means that the results returned to the analyst can be carefully created to be non-disclosive, and match the policies of the data provider’s data governance rules.

Geographic Information System (GIS) and spatial analysis

The R packages rgdal, rgeos and sp provide core data structures and methods that enable analysis of geospatial data in the context of the rich statistical programming environment offered by the R project. We have created two packages to perform privacy-protecting federated GIS data analysis with DataSHIELD and these packages. The dsGeo package contains the functions used on the server side to assure privacy-protecting analyses, while dsGeoClient has functions that command the data analyses from the client side and enable integration across studies.

The resourcer package allows large geospatial datasets to be handled. These include data derived from standard storage systems, such as relational databases, making use of existing sp data structures such as SpatialPoints and SpatialLinesDataFrame among others. These types of data are the core of geospatial analysis in R, allowing users to work with geometries and their descriptive attributes. For example, we might want to know that a GPS trace corresponds to someone who is 45 years old, or that a region defined by a polygon has a particular air pollution level. As described in the methods section, resources can be extended to any type of data that can be managed within R. Here we describe how to extend the resources to the case of analysing Geographic Positioning System (GPS) traces and other geolocation data, combined with phenotypic data.

Section 13 in our online book illustrates how to perform a realistic analysis of GPS traces and geolocation data. In this example, building on the work of Burgoine et al., [25] we consider daily commutes captured as GPS traces by 810 individuals in the eastern suburbs of London. We also have data on the location of 6100 fast food or takeaway outlets in the same area. Further data are available, relating to each individual’s Body Mass Index (BMI), age, sex, total household income, highest educational qualification and smoking status. These data therefore allow us to test the association between exposure to takeaway food on a commute and on individual’s BMI. We illustrate how the tools available in the dsGeo package allow this question to be addressed.

We created three resources in Opal that contain the GPS journey data (SpatialLinesDataFrame), the locations of the food outlets (SpatialPointsDataFrame) and the phenotypic data (data.frame). These data can be manipulated to give a measure of exposure to the food outlets for each individual. Our package allows us to transform the point data denoting food outlets into buffer regions surrounding each point. The idea is that if an individual’s GPS trace falls within this buffer then we can say that the individual is ‘exposed’ to that food outlet. Thus, we find the intersection of each individual’s trace with each buffered region to obtain a vector of the intersecting buffers for each individual. This is further processed into a count of ‘exposures’ per individual. Finally we can run GLMs using the DataSHIELD base function ds.glm() to test the association between BMI and food outlet exposure, correcting for potential confounding factors such as income.

Availability

We release resourcer R package under the open-source LGPL (≥ 2.1) license. The source code can be downloaded from CRAN and our public repository https://github.com/obiba/resourcer. The server site (e.g. Opal) packages dsOmics and dsGeo are R packages under the open-source MIT and GPL3 licenses, respectively. They can be installed from GitHub https://github.com/isglobal-brge/dsOmics and https://github.com/tombisho/dsGeo. The client site R packages dsOmicsClient and dsGeoClient are also publicy available at GitHub repositories https://github.com/isglobal-brge/dsOmicsClient and https://github.com/tombisho/dsGeoClient. The “resources” extensions of GA4GH and EGA are available in the dsOmics package. R code describing how to perform the genomic and geographical data analyses using our proposed infrastructure are available in our bookdown https://isglobal-brge.github.io/resource_bookdown. Data used in this bookdown are publicly available and can be accessed through DataSHIELD or using the URL available in the Opal site https://opal-demo.obiba.org (user: administrator, password: password, see https://isglobal-brge.github.io/resource_bookdown/opal.html#opal-demo-site). Users may report bugs or other issues via the “Issues” tab at GitHub repositories or using the online forum at https://datashield.discourse.group/.

Future perspectives

DataSHIELD

The development of the new “resources” component of the OBiBa middle-ware that underpins the DataSHIELD platform has profound implications for the future of the overall DataSHIELD project. It has greatly relaxed constraints on the source, format and volume of the data that can be ingested into a DataSHIELD session while continuing to provide full flexibility in terms of the capacity to enact active disclosure control at a level appropriately tailored to the context of the particular analytic problem at hand. This has already allowed us to embark on exploring extensions of functionality to encompass some of the most widely used classes of contemporary research data, for example, Omic data and images. In tandem with recent advances we have made in learning how to simplify the extension of functionality in any field provided that a relevant R-package already exists, DataSHIELD is now on the cusp of being able to provide a generic easily usable and simply extendable approach to truly federated analysis across many science and technology domains. This will have a myriad of applications in academic research, commercial settings and health & social care systems. The ability to finely tune disclosure controls in a manner that can only be modified by the data custodian will make this approach particularly attractive for anybody wishing to work with data that are sensitive. This not only includes human (or other) data subject to the appropriately stringent requirements of contemporary law, ethics and broader frameworks for data governance, but also encompasses data that are sensitive for reasons of intellectual property investment or commercial value. However, if disclosure controls are set to be very permissive or off, the convenience, flexibility and potential extendibility of the system could ultimately make this an approach of choice for any federated analysis even if the data are not especially sensitive.

In light of the development of resources, it is envisaged that in the future there will be at least three flavours of DataSHIELD that will vary in their convenience of use, flexibility and ease of extension of functionality: (1) All disclosure checks set to off. This will permit maximum flexibility for analysing non-sensitive federated data and easily extendable by adding new functionality; (2) Disclosure checks on but minimale.g. preventing users from seeing or copying the individual level data but avoiding restrictions on the analyses themselves. This flavour is likely to be most useful when data from a large research platform cannot physically be shared but the analysis required is based on data objects that are fundamentally privacy-protecting such as Omics data or images based on internal scanning; (3) Full disclosure checks; this will be equivalent to the default situation that applies now.

Now that it is possible for DataSHIELD to work much more readily with large/big datasets, we anticipate that DataSHIELD and the resourcer R package will offer functions based on the tidyverse as well as base-R. For example, this will include the option of using dplyr R package for operating on tabular datasets (see Interacting with R Resources section) and on tibbles as well as standard R data-frame objects. A resources integration improvement might therefore be to use the dplyr API for delegating as much as possible data filtering and mutating to the underlying data storage system (e.g. databases exposing a SQL query interface). This extension is already being explored.

Parallel computing

Working effectively with large data may also require programming practices that match the available computer hardware infrastructure, both processing and storage. R is efficient when operating on vectors or arrays, so a pattern used by high-performing and scalable algorithms is to split the data into manageable chunks and to iterate over them. Chunks can be evaluated in parallel to gain speed. There are several R packages that can be used to this end (parallel, foreach, …) as well as BiocParallel Bioconductor package that facilitates parallel evaluation across different computing environments while allowing users from having to configure the technicalities. DataSHIELD analysis is parallelized by design (i.e. each server is working independently of the others). Therefore, for a server instance, the best approach is to use data structures and analysis tools that perform computations efficiently. This is the strongest point of the resources as we have done with the integration of dplyr and Bioconductor packages as well as those that can be implemented using, for instance, sparklyr [26].

Omic and geographical data

We have provided some of the functionalities offered by Bioconductor and R packages in the DataSHIELD context that allow to analyse genomic and geographical data. These packages are extensive and more work is needed to repackage a more complete range of operations available in a privacy-protecting way. For instance, ds.Omics can be easily be extended to other omic data analyses such as differential gene expression analysis of methylation data analyses using the same strategy as the one used for genomic data. Visualisation can add great value and will be covered when being implemented in a privacy-protecting way. The dsBaseClient and dsBase packages already contain functions for privacy-protecting plots such as heatmaps and scatter plots. The privacy of data is protected in these cases by effectively blurring the data or by removing outlying points. These techniques could be adapted, for instance, to allow geospatial data to be visualised in a privacy-protecting way.

Data cataloging

The next step of the resources integration in Opal is to make their meta-data findable to a researcher: exposing the data dictionaries, annotated with taxonomy or ontological terms, would benefit the research community when looking for datasets for a research question. OBiBa software application suite provides both Opal, the data repository (or data integration system, using the resources) and Mica [9], the data web portal application. Mica operates by extracting from Opal the dataset dictionaries to build a searchable data catalogue, with basic summary statistics and by allowing the submission of data access requests. Resources registered in Opal should be made visible from Mica as well.

Other applications

We would like to highlight that there are dozens of disciplines other than genomics and geospatial that could also benefit from our infrastructure. For instance, extending the resources to other settings such as neuroimaging by using libraries from Neuroconductor, a similar project to Bioconductor for computational imaging, would be an important advance in that field since data confidentiality may also be an issue. Also, it is worth noting that one of the main advantages of using the resources is that we do not need to move data from their original repositories which can present a serious problem when dealing with neuroimaging data [27]. Another area that can readily benefit from our new framework is artificial intelligence. Big data and machine learning have applied innovatively many advanced statistical methodologies such as deep learning which is driving the creation of new and innovative clinical diagnostic applications among others [28]. The current trend is to include machine learning algorithms within Cloud capacities in different biomedical problems [2931]. Our framework can interface with “Apache Spark”, a fast and general engine for big data processing [32], through the sparklyr R package that will allow the use of different machine learning algorithms for big data.

Acknowledgments

We would like to thank the software developers and scientific staff who have worked on Opal. We also acknowledge support provided by Maelstrom Research partners and invaluable feedback we have received from Opal users over the years.

Data Availability

Data are available as a figshare project in this link: https://figshare.com/projects/Orchestrating_privacy-protected_big_data_analyses_of_data_from_different_resources_with_R_and_DataSHIELD/96377. The R code is available at https://isglobal-brge.github.io/resource_bookdown/

Funding Statement

This research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 874583 (ATHLETE) an No 824989 (EUCAN-Connect); the Ministerio de Ciencia, Innovación y Universidades (MICIU), Agencia Estatal de Investigación (AEI) and Fondo Europeo de Desarrollo Regional, UE (RTI2018-100789-B-I00) also through the “Centro de Excelencia Severo Ochoa 2019-2023” Program (CEX2018-000806-S); and the Catalan Government through the CERCA Program. This article is part of the project VEIS: 001-P-001647 co-financed by the European Regional Development Fund of the European Union in the framework of the Operational Program FEDER of Catalonia 2014-2020 with the support of the Secretaria d'Universitats i Recerca del Departament d'Empresa i Coneixement de la Generalitat de Catalunya. This work also forms part of Newcastle University’s methods program in Health Data Science addressing the securing of sensitive data; with support from the Department of Health and Social Care under the Connected Health Cities (North East North Cumbria) project and from joint Wellcome Trust/Medical Research Council grant 108439/A/15/Z. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nature Genetics. Nature Research; 2020. pp. 646–654. 10.1038/s41588-020-0651-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43: 1929–1944. 10.1093/ije/dyu188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Google AI Blog: Federated Analytics: Collaborative Data Science without Data Collection. [cited 7 Jul 2020]. Available: https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29.
  • 4.Is Federated Analysis the Way Forward for Genomics? | Trusted Digital Research Environment. [cited 6 Jul 2020]. Available: https://www.aridhia.com/blog/is-federated-analysis-the-way-forward-for-genomics/.
  • 5.Walport M, Brest P. Sharing research data to improve public health. The Lancet. Elsevier; 2011. pp. 537–539. 10.1016/S0140-6736(10)62234-9 [DOI] [PubMed] [Google Scholar]
  • 6.Burton PR, Murtagh MJ, Boyd A, Williams JB, Dove ES, Wallace SE, et al. Data Safe Havens in health research and healthcare. Bioinformatics. 2015/06/25. 2015;31: 3241–3248. 10.1093/bioinformatics/btv279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wolfson M, Wallace SE, Masca N, Rowe G, Sheehan NA, Ferretti V, et al. DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data. Int J Epidemiol. 2010/07/14. 2010;39: 1372–1382. 10.1093/ije/dyq111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Doiron D, Burton P, Marcon Y, Gaye A, Wolffenbuttel BHR, Perola M, et al. Data harmonization and federated analysis of population-based studies: The BioSHaRE project. Emerg Themes Epidemiol. 2013;10: 12. 10.1186/1742-7622-10-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Doiron D, Marcon Y, Fortier I, Burton P, Ferretti V. Software Application Profile: Opal and Mica: open-source software solutions for epidemiological data management, harmonization and dissemination. Int J Epidemiol. 2017;46: 1372–1378. 10.1093/ije/dyx180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wilson RC, Butters OW, Avraam D, Baker J, Tedds JA, Turner A, et al. DataSHIELD–New Directions and Dimensions. Data Sci J. 2017;16. 10.5334/dsj-2017-015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.EUCAN Connect. [cited 26 Mar 2020]. Available: https://www.eucanconnect.eu/.
  • 12.Home—LifeCycle. [cited 26 Mar 2020]. Available: https://lifecycle-project.eu/.
  • 13.Advancing Tools for Human Early Lifecourse Exposome Research and Translation | ATHLETE Project | H2020 | CORDIS | European Commission. [cited 26 Mar 2020]. Available: https://cordis.europa.eu/project/id/874583/es.
  • 14.Interconnect Project—Homepage. [cited 9 Jun 2020]. Available: https://interconnectproject.eu/.
  • 15.Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12: 115–121. 10.1038/nmeth.3252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Muschelli J, Gherman A, Fortin J-P, Avants B, Whitcher B, Clayden JD, et al. Neuroconductor: an R platform for medical imaging analysis. Biostatistics. 2019;20: 218–239. 10.1093/biostatistics/kxx068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Masinter L, Berners-Lee T, Fielding RT. Uniform Resource Identifier (URI): Generic Syntax. [Google Scholar]
  • 18.A federated ecosystem for sharing genomic, clinical data. Science. American Association for the Advancement of Science; 2016. pp. 1278–1280. 10.1126/science.aaf6162 [DOI] [PubMed] [Google Scholar]
  • 19.Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, et al. The European Genome-phenome Archive of human data consented for biomedical research. Nature Genetics. Nature Publishing Group; 2015. pp. 692–695. 10.1038/ng.3312 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.PLINK: Whole genome data analysis toolset. [cited 15 May 2020]. Available: http://zzz.bwh.harvard.edu/plink/.
  • 21.Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28: 3326–3328. 10.1093/bioinformatics/bts606 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gogarten SM, Bhangale T, Conomos MP, Laurie CA, McHugh CP, Painter I, et al. GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies. Bioinformatics. 2012;28: 3329–3331. 10.1093/bioinformatics/bts610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gogarten SM, Sofer T, Chen H, Yu C, Brody JA, Thornton TA, et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019;35: 5346–5348. 10.1093/bioinformatics/btz567 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Deelen P, Bonder MJ, Van Der Velde KJ, Westra HJ, Winder E, Hendriksen D, et al. Genotype harmonizer: Automatic strand alignment and format conversion for genotype data integration. BMC Res Notes. 2014;7: 901. 10.1186/1756-0500-7-901 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Burgoine T, Forouhi NG, Griffin SJ, Wareham NJ, Monsivais P. Associations between exposure to takeaway food outlets, takeaway food consumption, and body weight in Cambridgeshire, UK: Population based, cross sectional study. BMJ. 2014;348. 10.1136/bmj.g1464 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.sparklyr. [cited 15 May 2020]. Available: https://spark.rstudio.com/.
  • 27.Smith SM, Nichols TE. Statistical Challenges in “Big Data” Human Neuroimaging. Neuron. Cell Press; 2018. pp. 263–268. 10.1016/j.neuron.2017.12.018 [DOI] [PubMed] [Google Scholar]
  • 28.Xu J, Xue K, Zhang K. Current status and future trends of clinical diagnoses via image-based deep learning Theranostics. Ivyspring International Publisher; 2019. pp. 7556–7565. 10.7150/thno.38065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Abedini M, Von Cavallar S, Chakravorty R, Davis M, Garnavi R. A Cloud-Based Infrastructure for Feedback-Driven Training and Image Recognition. Studies in Health Technology and Informatics. IOS Press; 2015. pp. 691–695. 10.3233/978-1-61499-564-7-691 [DOI] [PubMed] [Google Scholar]
  • 30.Ping P, Hermjakob H, Polson JS, Benos PV., Wang W. Biomedical informatics on the cloud: A treasure hunt for advancing cardiovascular medicine. Circ Res. 2018;122: 1290–1301. 10.1161/CIRCRESAHA.117.310967 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.De Sousa C, Fatoyinbo L, Neigh C, Boucka F, Angoue V, Larsen T. Cloud-computing and machine learning in support of country-level land cover and ecosystem extent mapping in Liberia and Gabon. PLoS One. 2020;15. 10.1371/journal.pone.0227438 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on apache spark. Gigascience. 2018;7: giy098. 10.1093/gigascience/giy098 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008880.r001

Decision Letter 0

Dina Schneidman-Duhovny

2 Oct 2020

Dear Dr Gonzalez,

Thank you very much for submitting your manuscript "Orchestrating non-disclosive big data analyses of data from different resources with R and DataSHIELD" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Specifically, please simplify and make the main manuscript clearer based on reviewer's suggestions.

I didn't find a link to the code repository. Please provide it in the revised version.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Reproducibility report has been uploaded as an attachment.

Reviewer #2: Abstract: This abstract is too wordy and uses long sentences.

Example:

Therefore, big data analyses should ensure appropriate levels of security and privacy, be rigorous with the application of the data confidentiality regulations and address the choice between central data warehousing and the distributed (federated) analysis of data that remain ‘in-house’.

Why do we need "therefore"? What is "appropriate"? Why should analyses "address the choice"?

I can understand the basic objective in two sentences: Integrative analysis of multiple large datasets is a common objective in research [here you might focus on computational biology and use that term instead]. When data components of an integrative analysis are managed in distinct and independent systems, steps must be taken to ensure that confidentiality requirements are satisfied at all stages of the analysis.

With one more sentence you could describe key principles and tools to be described in the paper, and the abstract is done. For example, you could mention that DataSHIELD and Obiba are a decade old and that your work enhances these. Note that this reader has never heard of either of these projects, and that there is a cybersecurity company in Arizona USA called Datashield. This could be confusing, and you may want to clarify that Obiba is an open source project in the domain of bioinformatics, and that DataSHIELD is a collection of R packages usable by users of Opal.

Text: The first paragraph repeats much of the abstract; this should be avoided. The phrase "the Opal data warehouse" is used without explanation. I have never heard of this.

After browsing around obiba and looking at the three figures, I have a sense of what this project is about. The "book" https://isglobal-brge.github.io/resource_bookdown/ is nicely done and is a good supplemental resource for the paper. The figures of the paper give a vague sense of the collections of components in scope, and show "firewalls". But there is no clear depiction of how "non-disclosure" is achieved or guaranteed.

The paper suffers from the lack of tables that would help readers to understand capabilities and limitations in comparison to other frameworks. The paper also fails to mention GA4GH (Global Alliance for Genomics and Health), that aims to develop standards in this domain. See ga4gh.org, where all the main toolkits would seem to intersect with items described in the paper.

In summary, the paper reads like a mix between advertising and user manual, and mentions but really does not illuminate connections to computational biology practices. The tools appear well-motivated and well-documented, and may well deserve broad adoption. But this paper as written does not provide a strong argument for engaging with an inevitably complex environment. The basic principles can be articulated simply, but the proof that they are implemented in the system described here, and a demonstration that this implementation should be used by anyone who shares its objectives, will take more work and a more self-critical stance.

Reviewer #3: We were delighted to read about the datashield resources feature and think it is a great innovation that needs to get large attention from the scientific community. Datashield changes analysis practice by bringing analysis to the data, instead of first centralizing data, and does so in a practical usable way for researchers via the R statistical language. Thus, it provides a solution to the increasing difficult challenge of enabling pooled analysis of data from multiple centers and even countries for sensitive data which has become more challenging because of uncertainties around GDPR etc. However, datashield was limited to only the methods implemented by the datashield team and R who, while doing a great job, cannot be expected to address all analysis needs. The resources approach enables statististicians/bioinformaticians to create bespoke analysis pipelines and distribute them to centers automatically and provide access controls without needing to have much local analyst work (in contrast to sending around a cookbook and then requiring a local statistician to execute by hand).

Below we have comments and suggestions:

===

Overall we have the following concerns:

1.

The manuscript is in places uses many more words then necessary. This proza sometimes makes it hard to follow. So we would urge the authors to shorten and simplify the text where possible. Also carefully check for typos, in particular on the nice online book.

2.

One of the core features should be that data providers can precisely access control. However, it is not clearly described but based on test we believe it is actually quite simple. We think you should make this explicit from the start, i.e., in abstract, and also deserves a short alinea to describe how it works in practice

3.

While the authors give nice examples on analysis (actually more than needed), one thing that we found missing is how the authors believe complex analysis protocols will be distributed from central analysis site to local data providres. Because in case of sensitive data, we expect data providers to want to limit access to only one resource, i.e. a particular analysis procedure. Then the question is: how will such procedures be distributed from a central analysis site (i.e. lead of a consortium analysis) to all connected sites (i.e. data providers) without needing large local expertise. We believe that for example for distributed meta analysis such process should be seamless (otherwise still local expertise is needed to operate)

4.

Finally, in our own experience the hosting environment of the participating data providers can be quite heterogeneous. You might want to discuss how you expect to deal with a resource based analysis in for example a network of 20 data providers and how you would expect the analysis to accommodate these differences (without having to address these all via the central analysis lead).

===

Some text was hard to follow, and you might want to rephrase

5.

“Such analysis not only demands scalable methods and

appropriate mathematical models but must also maintain consistency with fundamental principles of

sound data management and data sharing that are common across all areas of health, social and

bioscience. Therefore, big data analyses should ensure appropriate levels of security and privacy, be

rigorous with the application of the data confidentiality regulations and address the choice between

central data warehousing and the distributed (federated) analysis of data that remain ‘in-house’.

6.

“This

requires data generators to physically transfer data [propose you add: to a central analysis server] to make them accessible to analytic users.”

7.

I got lost in the Methods section. It would greatly help if you give a short introduction, naming the main elements and how they fit together.

8.

“We define “resource” this data storage or computation access description”. I think you want to be more explicit, because apparently ‘resource’ is a URL that can denote file handle, data access protocol, or execution of some script.

===

Furthermore, we have minor suggestions that you can ignore but may be of use to the authors:

You state on page 1 that data has to be physically transferred to the warehouse. There are several tools that federate analysis to overcome this issue so you might want to rephrase. For example: https://www.dremio.com/.

Resources credentials are fixed and managed by Opal which implies you do not have a refresh token or something like that, that will expire over time. The policy decision and enforcement is now located in the DataSHIELD engine. This means that the resource owner can not decide anymore if someone has access to the resource. Audit logging at the resource side is hard to do this way.

If the computational resource needs to do a lot, you need to program that either in the ResourceExtension or as given commands.

When you program an extension you are dependent on the person's choices regarding the interface he/she exposes and when you give it as parameters in the resource you need to parse it in the resourcer package..

For example, how do you prevent malicious SQL injection? We would expect that you would not open up such resources to the external source but only a pre-packaged analysis.

You are still bound to the limitations of R in terms of memory and CPU usage when you want to correlate data in R against the data that is available in the resource. You need to either push the data you want to correlate against into the resource or extract it from the resource in the R-environment.

What is going to happen when you want to finish the analysis on another moment. Or the analysis is taking days how do you retrieve the result.

When the analysis is taking a lot of resources in terms of memory and CPU how do you limit this per DataSHIELD-user? Who is responsible for this, DataSHIELD or the resource owner?

The resource owner needs to implement a way to be easily accessible for the resourcer-package. Especially when you want to run more complex jobs it usually takes a complex interface to work with.

In the shell and ssh resource the Opal administrator is managing the actions that may be performed by the resource handler (researcher in general). When you host Opal on different infrastructure than the resource and is managed by someone who is not in charge of managing the resource the list of possible commands can be freely added to the resource. This possible allows

On page 7 and 12 you state that the disclosure control on big data is often more relaxed. In practice we do not encounter this relaxation. We are very keen on analysing data outside our facility but are bound to the current contracts regarding data transfer or access agreements which take a lot of time to arrange.

How can we make sure when you offer more open data in a way that it will be legally feasible to access the data without signing a data transfer or large access agreement?

In other words is it practically feasible to do analysis in this way regarding the juridical implications?

Maybe less dependencies in the resourcer package?

When the interface of a dependency is changed the package needs to be changed as well.

Think of a way to delegate the authorisation and authentication back to the resource owner. We would propose to change the way of passing credentials in the resource.

Is it possible to restrict the usage of certain resources? For example, prohibit the use of resources.

It would be nice to have some way of enforcing the metadata which is tight toi the data to be correct. This is often the problem with longitudinal data that metadata over all columns should be correct. It is likely that this also yields for big data structures that are analysed in a pooled manner. How do you handle multiple versions of VCF for example?

Book 16.3 is not finished

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: No: It seems possible that the data related to Figure 3, or described in "Differential Gene Expression Analysis" is all available, but the paper does not provide explicit links, and it is not clear whether any reader would need to register or be authorized to interact with such data.

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Anand K. Rampadarath

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01312.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008880.r003

Decision Letter 1

Dina Schneidman-Duhovny

17 Mar 2021

Dear Dr Gonzalez,

We are pleased to inform you that your manuscript 'Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The Reproducibility report has been submitted as an attachment.

Reviewer #2: none

Reviewer #3: Thank your for this work and precisely described responses and changes made.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Anand K. Rampadarath

Reviewer #2: No

Reviewer #3: No

Attachment

Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01312R1 (1).pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008880.r004

Acceptance letter

Dina Schneidman-Duhovny

25 Mar 2021

PCOMPBIOL-D-20-01312R1

Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD

Dear Dr Gonzalez,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01312.pdf

    Attachment

    Submitted filename: Answer_reviewers.pdf

    Attachment

    Submitted filename: Reproducible_report_PCOMPBIOL_D_20_01312R1 (1).pdf

    Data Availability Statement

    Data are available as a figshare project in this link: https://figshare.com/projects/Orchestrating_privacy-protected_big_data_analyses_of_data_from_different_resources_with_R_and_DataSHIELD/96377. The R code is available at https://isglobal-brge.github.io/resource_bookdown/


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES