Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jan 21;17(1):e0262609. doi: 10.1371/journal.pone.0262609

A data flow process for confidential data and its application in a health research project

Samantha S R Crossfield 1,*, Kieran Zucker 2, Paul Baxter 3, Penny Wright 2, Jon Fistein 1, Alex F Markham 1,2, Mark Birkin 1, Adam W Glaser 1,2,, Geoff Hall 1,2,
Editor: Pandi Vijayakumar4
PMCID: PMC8782367  PMID: 35061834

Abstract

Background

The use of linked healthcare data in research has the potential to make major contributions to knowledge generation and service improvement. However, using healthcare data for secondary purposes raises legal and ethical concerns relating to confidentiality, privacy and data protection rights. Using a linkage and anonymisation approach that processes data lawfully and in line with ethical best practice to create an anonymous (non-personal) dataset can address these concerns, yet there is no set approach for defining all of the steps involved in such data flow end-to-end. We aimed to define such an approach with clear steps for dataset creation, and to describe its utilisation in a case study linking healthcare data.

Methods

We developed a data flow protocol that generates pseudonymous datasets that can be reversibly linked, or irreversibly linked to form an anonymous research dataset. It was designed and implemented by the Comprehensive Patient Records (CPR) study in Leeds, UK.

Results

We defined a clear approach that received ethico-legal approval for use in creating an anonymous research dataset. Our approach used individual-level linkage through a mechanism that is not computer-intensive and was rendered irreversible to both data providers and processors. We successfully applied it in the CPR study to hospital and general practice and community electronic health record data from two providers, along with patient reported outcomes, for 365,193 patients. The resultant anonymous research dataset is available via DATA-CAN, the Health Data Research Hub for Cancer in the UK.

Conclusions

Through ethical, legal and academic review, we believe that we contribute a defined approach that represents a framework that exceeds current minimum standards for effective pseudonymisation and anonymisation. This paper describes our methods and provides supporting information to facilitate the use of this approach in research.

Introduction

There are significant barriers to the linkage and use of routinely captured healthcare data in research to generate novel insights. Data capture in discrete, ‘siloed’ systems renders the necessity of data linkage in research. However, linkage brings issues, such as how to handle varied data structures and maintain data security while enabling appropriate access [1]. Ethical and legal requirements may necessitate complex processes in data handling, which may be burdensome [2, 3]. In the European Union, the General Data Protection Regulation (GDPR) sets conditions for the lawful processing of personal and “special category” data that relates to an identified / identifiable individual [4]. In the UK, the Data Protection Act 2018 and common law duty of confidentiality apply [5, 6]. To disclose confidential information, collected during healthcare delivery, for secondary use in research, requires a legal gateway such as consent, public interest, a legal obligation or approval by the UK Secretary of State for Health [7]. Unprecedented rates of data sharing in response to the current pandemic are yielding important insights into COVID-19, however it is uncertain how this might change the landscape for all research [8, 9]. Further, there remains public concern, and weaknesses in public engagement, to overcome [1012].

While secure data sharing approaches exist for the use of identifiable confidential data by approved users [1317], secondary use in research often requires data that is, for the above ethico-legal reasons, “anonymous in such a manner that the data subject is not or no longer identifiable” beyond ‘reasonable effort’ [4, 7, 18, 19]. A clear and practical process is needed to link datasets and create a research dataset that is no longer considered confidential or personal. Data linkage and anonymisation procedures can be used to create an irreversibly linked, non-personal, dataset from two or more identifiable sources. Data are linked and the relationship between individuals and data are removed so that it is beyond reasonable effort to identify individuals from the data [20]. Anonymisation approaches include the removal of direct identifiers and information that is rare (either by itself or in combination with other information) [21]. Data is reduced in granularity to reduce the risk of sharing identifiable or identifying information, which can be checked using disclosure control tools such as QAMyData [22], though there may be a trade-off with utility.

Linkage and anonymisation procedures can be applied by data providers (‘at-source’) or “trusted third parties”, although each has potential practical and security limitations. At-source linkage generally occurs where the source holds multiple datasets and the approval necessary to perform such linkage. It prevents identifying data from crossing organisational boundaries, however this also prevents authorised linkage between datasets held by multiple sources–which is often required in research. At-source research platforms have developed recently to facilitate rapid COVID-19 research, but the long-term funding and sustainability of such approaches is uncertain [23]. Alternatively, a third party may receive identifiable data to link and anonymise. In the UK, the National Data Guardian recommends that a “safe-haven” performs this role, acting as an “honest broker” [24, 25]. The NHS Digital linkage service [26] is suitable for health data that is identifiable or carries a risk of re-identification through reasonable means [25]. In response to the COVID-19 pandemic, key UK funders have worked to increase research scale and infrastructure, enabling access to National Core Data assets through Trusted Research Environments (TREs), to enable COVID-19 research [27]. However, the transfer of personal data to a third party for linkage and anonymisation carries the risk of disclosure. It may also be legally restricted by the purposes for which the data was collected, as a legal reason (such as public interest during the COVID-19 pandemic) is required for third party processing of personal data. Further, nationally funded third parties may focus efforts on datasets commonly required for core research, which would not address all research requirements.

Pseudonymisation approaches that de-couple data from identifying information while maintaining a link between these that is accessible only by authorised parties [28, 29] offer a means to enable authorised data linkage. In pseudonymisation, directly identifying data that would otherwise be used in linkage can be replaced with a “digest” (linkable pseudonym) [30]. Recognised hashing algorithms such as SHA-2 and SHA-3 can be implemented in software such as Python or OpenPseudonymiser, to generate linkage digests from data strings, while affine cipher-based encryption may be suitable for image data [3134]. These digests are anonymous in the context of the recipient. However, the data provider can link reversible pseudonyms back to an individual, so they remain classed as personal data [4, 29]. An approach whereby the digest is re-pseudonymised by the recipient [31], prevents backward-engineering or re-identification of the digests by any single party. However, the digests retain their status as personal data. Multiparty cryptographic protocols, such as garbled circuit approaches, can aim to prevent data linkage reversibility, but are computationally expensive [35, 36].

We aimed to describe an approach using at-source pseudonymisation with third party data linkage and anonymisation, and apply this as a case study to health data from two sources to produce an anonymous person-level research dataset. We aimed to present the practical steps involved through describing the protocol and exemplar in order to provide a framework for other projects using personal data, to either generate a pseudonymous or anonymous dataset. This may assist the usability of such data in addressing important research in industry and society.

Materials and methods

A data flow protocol was developed and implemented as part of the Macmillan Cancer Support funded Comprehensive Patient Records (CPR) study, conducted at the University of Leeds in collaboration with the Leeds Teaching Hospitals NHS Trust (LTHT) and The Phoenix Partnership Leeds UK (TPP), a clinical systems company. Here we describe the method of its development, including external validation of its ethical and legal compliance, and implementation. The data flow protocol and its implementation are described in the Results.

Protocol development

The data flow protocol was developed with input from patients, informaticians, ethicists, legal experts and colleagues from Public Health England and NHS Digital. The patient viewpoint was incorporated throughout, with two named investigators being patients who attended two-monthly meetings. The protocol and study design were presented to local research advisory groups (Cancer Research UK Leeds Centre’s Public and Patient Involvement in Research Group and the Research Advisory Group of the University of Leeds Patient Reported Outcomes Group [37]), which viewed the data handling policies positively. Legal advice was sought in order to validate that the protocol met the necessary ethical and legal requirements. The University of Leeds IT department and legal team undertook a Privacy Impact Assessment and sought external independent legal counsel from the Queen’s Counsel, concluding that the protocol was robust and appropriate.

Separation of duties

A supporting separation of duties protocol was defined to establish physically separate stages of data handling (data provision; the stages of data linkage, preparation and review of derived datasets; data analysis) between teams and to prevent research access to data prior to the completion of anonymisation. This ensured that: each data provider conducted data extraction and the creation of matching pseudonymous digests; a third party provided separate teams for linkage and digest destruction, anonymisation and data preparation / minimisation, and data review and transfer; and the research team conducted analyses on the resultant anonymised research dataset. This prevented unauthorised handling of identifiable information or data with reasonable potential for re-identification.

Data handling organisations in the case study

GP practice and community, and hospital data were sourced from TPP and LTHT for the CPR dataset. TPP provides SystmOne, a clinical system used by 7,000+ health and social care organisations in the UK to maintain ~50 million health records. TPP deliver ResearchOne [38] as a programme with national ethics approval (11/NE/0184) to facilitate research data provision. Organisations using SystmOne (e.g. GP practice and community) can opt in to providing non-identifiable data for research purposes to ResearchOne, which currently contains >8 million patient records. Organisations can also record patient opt-outs, to exclude patient data from the ResearchOne database. LTHT is one of the largest teaching hospitals in Europe and contains the Leeds Cancer Centre. The main LTHT health record system, PPM+, has previously been described and contains detailed diagnostic and management data on ~250,000 cancer patients [39]. LTHT have internally linked these clinical records to other LTHT informatics systems, including those containing financial data. LTHT also records research opt-outs, to exclude patient data from LTHT research data extracts.

In the UK, a national data opt-out service enables patients to exclude their confidential patient information from secondary use such as research. Both TPP and LTHT excluded data from patients that had opted out nationally.

Discussion with colleagues from NHS Digital determined that the Data Analytics Team (DAT) at the Leeds Institute for Data Analytics (LIDA) at the University of Leeds were an appropriate third party. The DAT provided a third-party linkage, anonymisation and data transformation service and a secure data platform for NHS Data Security and Protection Toolkit (DSPT)-compliant data handling [40]. The service was developed to provide secure information handling processes, technical infrastructure and application development for the Medical Research Council Medical Bioinformatics Centre and the Economic and Social Research Council Consumer Research Data Centre [40, 41].

Protocol review

The protocol underwent a series of reviews in order to ensure that the data transfer could be conducted independently of a safe haven. In England and Wales, the NHS Health Research Authority (HRA) Research Ethics Committee (REC) reviews whether research is ethical and the HRA Confidentiality Advisory Group (CAG) reviews any access to confidential data without consent. The data flow protocol received ethical approval from the HRA REC (IRAS project ID 188345 REC reference 16/NE/0155) and was approved by the data providers and the third party. Informed patient consent was not sought as the data were analysed anonymously. We submitted to review by CAG explaining that we believed the confidential data processing by the data providers was according to their processing agreements, that the data was anonymous in context for the third party and research team received anonymous data. The CAG determined that approval under the National Health Service Act Section 251 was not required given these factors, following assurance from the data providers on the at-source anonymisation procedures applied [42]. The third party sought review of the proposal by the Queen’s Counsel. This confirmed the opinion that the data received by the third party and used by its researchers could be characterised as ‘anonymous’ and not ‘personal’ data. The received digests were deemed anonymous in context, and following the deletion of the salt used in their re-pseudonymisation, the final digests were deemed anonymous: although person-level, they would not contribute to identification [20].

Data flow in the case study

Implementation of the protocol began with the data providers and ends with research output dissemination. Each data provider selected the records that were eligible for linkage from PPM+ or ResearchOne, using the project’s cohort selection criteria (S1 Table), through their own computerised processes in accordance with the LTHT Fair Processing Notice and ResearchOne Database Protocol, respectively [43, 44]. The research team provided the data providers with standard operating procedures to ensure a standardised approach. The Caldicott Guardian of one provider produced an encrypted project-specific “salt” file via the OpenPseudonymiser website [31] and shared this with the analyst team in both organisations. The salt was code created using OpenPseudonymiser from a project-specific alphanumeric string. Digests only matched where they are created using the same identifiers (in the same format) and salt, which prevents unauthorised linkage. The analysts combined the project salt with the agreed patient identifiable inputs and used OpenPseudonymiser to produce linkable pseudonymous digests using the secure hash algorithm (SHA-256) [31]. These were encrypted and provided to the DAT using a secure file transfer protocol (SFTP). A designated analyst produced a list of matches and returned these to the providers before locally destroying the received and matched files. The providers produced data for this linked cohort, using the project’s data items list. The data providers additionally applied their standard anonymisation processes prior to data transfer, such as replacing postcodes with sector-level postcode or the corresponding Index of Multiple Deprivation score [45].

The extracts were encrypted and transferred via SFTP to the DAT, containing digests that were anonymous in the context of the DAT. A DAT analyst re-pseudonymised the digests using a DAT project-specific salt which was then destroyed (rendering the digests anonymous in all contexts). A DAT analyst then linked the extracts and performed other transformations to produce the anonymous research dataset. The transformations aimed to address the risk of re-identification inherent in data linkage by undertaking data minimisation steps, such as replacing the patient’s month and year of birth with an age-band and converting dated diagnostic data into a binary indicator of prevalence within specific time-frames. This helped to avoid the risk of the linked dataset containing “digital fingerprints” [46]. The processing applied to some common data types are listed in Table 1. Controlled research data access was provided to researchers over a University network under a user agreement detailing their data processing obligations. This included sanctioning against any attempts at re-identification. Access in a project-specific, firewalled environment in compliance with the NHS DSPT [47] was contingent upon providing evidence of completing annual training in advanced information security. An agreement was defined to ensure that the DAT handled all data ingress and egress, providing an independent check against the ethical and legal requirements.

Table 1. Common data types and the data processing steps applied in the Comprehensive Patient Records project.

Data Type Data Processing Steps
Patient Name Excluded at source
NHS Number An input variable in pseudonymous digest creation (performed by the data providers)
Date of Birth Transformed into age at first cancer diagnosis / matched index date, in age-bands (<1 years, 1–4 years and 5-year bands thereafter until 80–100)
Date of Death A Boolean indicator of survival status as known to the data source was provided
Postcode Reduced to postcode sector only e.g. LS1 5, or mapped to Index of Multiple Deprivation score [45]
Diagnostic Codes Aggregated to a binary yes/no for prevalence of disease or disease groups in time periods pre- and post- cancer diagnosis or matched index date
Prescribing Data Mapped to annual cost per patient; aggregated to a Boolean for presence or absence of diabetes drug classes
Sex No processing applied
Ethnicity No processing applied (coded using national code-lists)
Date last seen by primary care team No processing applied

Results

Data flow protocol

Informed by patients, informaticians, ethicists, legal experts, clinicians and colleagues from Public Health England and NHS Digital, we developed a data flow protocol resulting in an irreversibly linked anonymous dataset (S1 File). The protocol utilised OpenPseudonymiser, as an open source application for converting an input field into digests using the Secure Hash Algorithm 2 [31, 48]. The data providers agreed on the direct identifiers and a salt to be used to create linkable digests. The inputs were formatted using an agreed approach. Following the protocol, the data providers transferred these digests to a third party who advised on the matching digests, for which records the data providers each produced a dataset with pseudonymous digests. Upon receiving these datasets, the third party re-pseudonymised the digests using a unique salt that was then destroyed to render the digests anonymous, and linked the datasets using the new digests. The third party performed further transformations to ensure the anonymisation of the merged dataset. All data transfers were in encrypted format. These steps are detailed in Table 2 and a template version of the data flow protocol, which defines the data flow and technical procedure, and provides a supporting glossary and worked examples of the pseudonymisation approach (S1 File).

Table 2. Summary of actions defined in the data flow protocol and the parties involved in each step.

Step Action Description Organisation
1 Create and share a hashed project-specific salt (SALT1) Data providers
2 Determine records eligible for linkage Data providers
3 Use agreed fields and SALT1 to generate project-specific digests (PSD1s) for these records Data providers
4 Transfer PSD1s to the linkage party Data providers
5 Compile a list of matching PSD1s, return to data providers Third party
6 Delete any locally-held PSD1s Third party
7 Creation of at-source anonymised datasets; transfer to the linkage party Data providers
8 Creation of a project-specific salt (SALT2) and replacement of PSD1s with a second digest (PSD2) Third party
9 Linkage of datasets to produce the research dataset (RD) Third party
10 Authorised research access to the RD is granted, in a trusted research environment Third party
11 Analysis of the RD; research output generation Research team
12 Outputs screened for risk of re-identification and reviewed against ethical and governance requirements prior to authorised release Third party

Case study

The data flow protocol was utilised by the CPR study to link anonymised routinely collected health data from hospital, GP practice and community electronic health records in England (Fig 1). Fig 1 depicts the parties involved in the case study, they key actions that they performed, and the transfer of, and access to, data. A separation of duties procedure allocated the data handling functions in the data flow protocol to specific parties to ensure separation of the pseudonymisation and anonymisation steps and to define the digests handled by each party. Two organisations undertook the data provision: LTHT and TPP, as described in the Methods. The data providers shared a project-specific salt file and followed a standard procedure to produce patient digests. LTHT identified 102,763 patients with a cancer diagnosis, defined either via clinical review (“gold standard”) or through the recorded cancer status, ICD-10 code, and definitive disease phase, and 287,564 matched non-cancer patients (Fig 2). TPP provided digests for the ResearchOne database (approximately 8 million). The DAT in LIDA, independent of the research team, identified 140,462 matching digests. The data providers returned anonymised data for this cohort, which were linked within LIDA. The digests were then re-pseudonymised and the salt used was destroyed in order to irreversibly break the link back to the data provider datasets. Further aggregation and transformations were also applied to minimise the risk of re-identification following data linkage and to produce the research dataset in the format desired by this research project.

Fig 1. Diagram of the parties involved in the case study and the actions performed during the process of data flow, linkage and access, as defined using the protocol for linkage and anonymisation of data.

Fig 1

Fig 2. Flow chart of the patient cohort selection process for the case study, defining the number of patient records in each stage of selection and linkage.

Fig 2

Discussion

We believe that this data flow process meets the urgent need for a lawful, ethical and practical framework for personal data handling in research. It incorporates best practice mechanisms and addresses ethical and legal requirements to enable the flow of person-level data for research. This data flow process offers one approach to support data dissemination at scale for secondary uses, by producing an at-source anonymised dataset that is irreversibly linked using accepted cryptographic standards. It can also be easily adapted for use in projects that have approval to utilise reversible pseudonymisation. The protocol is made available to inform other projects, so that it may facilitate access to datasets and data-based research. Further its application in the case study has made available an anonymous research dataset that may be invaluable in future research.

Ethical data handling is particularly crucial for personal healthcare data and so we have exemplified the process with a case-study linking and anonymising such data. The successful implementation by this CPR study clarifies the efficacy of the process and how to practically apply it to real-world issues. The CPR dataset has been incorporated as an exemplar research dataset by DATA-CAN, the Health Data Research Hub for Cancer in the UK [49]. The data flow protocol has already been adopted locally by projects including the Yorkshire Specialist Register of Cancer in Children and Young People [50]. The approach may be adaptable for international applicability. Such linkage projects may facilitate in unlocking patient benefit from the wealth of information currently available within routinely collected data.

Process review

Internal and external review of the process demonstrated that the protocol ensures appropriate handling of all ethical considerations and stakeholder perspectives. It met or exceeded the current UK legal requirements and advice on data handling, including those defined in The Information Governance Review and the NHS DSPT standards [25, 47]. The perspectives of the data subject were embedded through patient and public involvement and engagement in all stages of the process, including protocol development, review and implementation. We engaged data providers to ensure that the protocol provides the necessary practical guidance to produce linkable datasets. This brought transparency to the process, which in turn enabled reproducible reporting of the study population and informs the interpretation of research [51]. A privacy impact assessment, now known as a data protection impact assessment (DPIA), was performed by the University of Leeds IT Assurance and legal team, given the involvement (at-source) of personal data. This independently assessed and informed the process before deeming it appropriate [52]. Future projects adopting a similar protocol may anticipate being assessed as ‘low risk’ during their DPIA given the view expressed in the CPR DPIA and by the Queen’s Counsel, that the data was not personal in the context of the third party and researchers. We undertook significant review so that the process might be adapted to future projects as one way to adhere with the current standards and legal requirements, and address the ethical and privacy concerns that arise in data use and linkage generally.

Cross-accreditation of data linkage and anonymisation processes

In the UK, the Information Governance Review [25] describes “accredited data safe havens” as the appropriate locations for the linkage and anonymisation of personal confidential data or that which could potentially identify individuals and is linked for limited disclosure/access. TREs and health data hubs host researcher access to core datasets, however, research projects such as the CPR often require a bespoke data flow. Unlike the clear researcher accreditation requirements that are listed for TREs [27], the criteria for an organisation to become accredited as a safe haven for such data processing are unclear. This is problematic as this is required for an organisation to be appropriate to undertake linkage or anonymisation processes and to meet bespoke research needs. A formal list of requirements for accreditation for linkage and / or anonymisation is required. Further, the former should extend to all data linkage scenarios to ensure comprehensive compliance with ethical requirements.

We propose an approach of cross-accreditation between organisations that adopt our data flow process or conduct similar data handling. We propose a system whereby institutions independently evaluate each other’s processes and infrastructure, nationally and internationally. This would foster expert practice and enable institutions in academia and industry to become ‘cross-accredited data safe havens’. Through increased transparency, this would reduce the uncertainty and variation in current practice. It could be initiated using the open and reproducible criteria that we propose in Table 3, and would evolve through practice. Datasets developed securely by such safe havens could subsequently feed into TREs or health data research hubs to facilitate wider access by accredited researchers.

Table 3. Recommended minimum criteria for incorporation into a checklist for data safe haven cross-accreditation.

Recommended Minimum Criteria
Protocols and work instructions that incorporate any relevant ethical and legal measures
Demonstrate compliance with the NHS DSPT standards when handling health data (UK health data- specific)
Robust data access control
User training in advanced information security
An information security management system with an assigned data protection officer and information governance management group oversight
Procedures for data classification, measuring identifiability, anonymisation, risk assessment and privacy impact assessment
Demonstrable adherence to procedures for appropriate auditing of controls
Where data being linked are confidential or could potentially identify individuals and are linked for purposes of limited disclosure/access: demonstrate reasonable data stewardship (which in the UK is defined in The 2013 Information Governance Review [25]) and provide an environment, controls or sanctioned agreements for limited disclosure/access
Where provision is also provided for analysis using identifiable data: remote access and controlled data ingress and egress; third-party review of outputs using data non-disclosure principles prior to dissemination; availability of ‘safe room’ with restricted access as appropriate

Strengths and limitations

The protocol utilises the concepts of at-source pseudonymisation and third party anonymisation of the linkage digests to enable non-reversible data linkage. This approach brings advantages without compromising either matching accuracy or computer processing time, as would be the case with cryptographic approaches such as garbled circuits to complete linkage at scale [35]. It reduces the risk of information disclosure by external attack, as the linked research dataset does not link back to digests held by the data providers. Two salts are used to produce the digests and no party holds both, as would be required for digest re-identification. Even in the case of malicious attack, it is not reasonable to anticipate digest re-identification in comparison with some honest broker approaches [53]. Anonymisation using a third party and a separation of duties procedure prevents data or digest identification through collusion between the data providers and research team, providing protection above pseudonymisation approaches. The digest anonymisation approach used strengthens the linkage process, although it must be coupled with processes that mitigate the risk of identification through the rest of the dataset (i.e. data that is shared but not used in the linkage process). Such processes should be selected to ensure that data are anonymous beyond reasonable effort, while maintaining wherever possible the dataset utility in relation to addressing the research question [19]. A party, e.g. the data provider or third party, must apply these processes under the appropriate consent and legal frameworks. The protocol acknowledges that data linkage may necessitate further assessment of risk and post-linkage anonymisation or aggregation steps, to produce a research dataset that is richer than unlinked data while being practically protected from internal disclosure of information. As part of a governance framework, including data access limitations (including prevention of concurrent access to auxiliary information), processing agreements and pre-dissemination disclosure review, the approach in the data flow protocol can assist in preventing unauthorised attempts at re-identification [30, 54, 55].

Our case study demonstrated successful implementation of the protocol. The case study employed a third party that acted independent of the research team but was within the same organisation (University of Leeds), which minimised the data disclosure risks inherent in otherwise involving further organisations in data handling. The third party contributed to secure data flow by verifying the involved parties and reviewing the approvals in place for data handling [17]. Alternative approaches to party verification could be considered, alongside or instead of this being performed by a third party [56]. In the case study, the cancer status of patients as defined using hospital data was not shared with the GP and community data provider owing to the inclusion of non-cancer patients. As almost all patients in the UK are registered with a GP, we believe that linkage did not provide the hospital data provider with additional information about the data subjects either. Studies of especially sensitive or contentious topics may similarly consider including a non-disease / event cohort to prevent data providers from learning a sensitive status about the data subjects. However, for studies that would not analyse data on such a cohort, this brings additional data flow, which should be minimised by immediate third party deletion of any data received in the data extracts for this surplus cohort.

The involvement of specific data, data providers or recipients may necessitate adaptations to the protocol. For example, multiple providers may not share matching fields for linkage, in which case one or more providers may act as a ‘bridge’ by providing multiple digests, each to be matched with those from a different provider. Other schemes could be considered for reaching agreement on the salt used by data providers in pseudonymisation [56]. Controls may need to be applied to the environment in which the dataset is held, for example where data access agreements do not mitigate against linkage with auxiliary information, through which ‘jigsaw re-identification’ may occur [57]. This may be at a loss of flexibility in the environment or researcher autonomy. The transferability of our data flow approach may be limited where there are additional legal requirements, industrial standards and organisational agreements applicable to different data types (e.g. human tissue), institutes and countries. Amendments may be necessary in transferring the described approach to other contexts or jurisdictions, which may require additional ethical and legal scrutiny and stakeholder involvement. However, the protocol template and steps described should aid in considering the relevant issues and developing a tailored protocol. The protocol can even be adapted if other contexts necessitate using other approaches to pseudonymisation or anonymisation [58, 59]. For example, if a project had approval to use pseusonymised data in order to relay study results to the participants, the protocol could be adapted by retaining the third party salt. Using the template, with adaptations, will efficiently aid in appropriately considering the wider information security and governance framework, including secure data transfer and access restrictions.

In the case study, the third party did not perform a quantitative assessment of the residual risk of re-identification in the final dataset. However, the final list of variables was constrained and was reviewed including by the HRA, CAG, Queen’s Counsel and University of Leeds legal team and IT Assurance. Projects using different datasets will require consideration of re-identification risk following data linkage. However, we list the data processing applied in the case study to common data types (Table 1), which should aid in determining any appropriate steps. Tools such as sdcMicro could aid the third party in checking for any disclosure risk brought through the linking of variables [60]. Such work may bring resource cost and delay research access. Researchers should be informed of all data processing steps in order to be able to consider any impact on the statistical properties of the dataset that may bias analyses [51].

In our approach, anonymous digests were created by third party re-pseudonymisation of the digests, with the salt used being promptly destroyed. The third party could have alternatively replaced each digest with a random ID, but given the difficulties in true randomisation this random ID would still then require pseudonymising with a salt that is destroyed, so such an approach would have introduced extra steps into the process. While in this study the digests were deemed anonymous in context of the third party (e.g. by the Queen’s Counsel) prior to re-pseudonymisation and salt destruction, it is hoped that in time there will be further clarification, in national and international contexts, of the legal stance on such data. This is particularly necessary given the unprecedented rise in efforts to facilitate data access in response to COVID-19 [23, 61].

Conclusions

We contribute a data flow protocol for linkage and anonymisation and a description of its successful utilisation in a case study to produce a dataset available for use in research. The presented risk-mitigating approach to data flow offers a practical solution to commonly encountered issues of data handling. It employs factors to remove the link to named individuals from person-level data. It incorporates best practice and provides adaptable steps for handling data in accordance with the current ethical and legal framework in the UK and European Union, but has potential global application. This flexible and adaptable protocol may facilitate the move toward widespread utilisation of the growing volumes of data available. We also proposed a cross-accreditation approach and recommended criteria that may support organisations in adopting appropriate data handling practices and offering public reassurance.

Supporting information

S1 Table. Inclusion criteria for patient selection for the Comprehensive Patient Records project.

(DOCX)

S1 File. Data flow protocol template.

(DOCX)

Acknowledgments

This work uses data provided by patients and collected by the NHS as part of their care and support. We acknowledge the guidance and input from the patient perspective from the CPR lay co-investigators, Barbara Woroncow and David Wilkinson. We acknowledge the views received from the Cancer Research UK Leeds Centre’s Public and Patient Involvement in Research Group and the Research Advisory Group of the University of Leeds Patient Reported Outcomes Group. We acknowledge the data handling expertise of Adam Keeley from the Data Analytics Team at the Leeds Institute for Data Analytics. We acknowledge legal counsel from Adrian Slater (University of Leeds Legal Advisor) and Robin Hopkins QC.

Data Availability

We provide a template version of the data flow protocol in the Supporting Information. The CPR data flow protocol and project information are available on the LIDA website [https://lida.leeds.ac.uk/comprehensive-patient-records-2/cpr_approvals/]. The CPR dataset can be accessed through submitting a request to the health data research hub DATA-CAN, who provide secure research access, given the potentially sensitive nature of the health data [https://web.www.healthdatagateway.org/dataset/ce4582a8-0985-46c6-b95f-29a5de862d4a].

Funding Statement

SSRC, KZ, PB, PW, JF, AWG and GH were supported by Macmillan Cancer Support (106451) [https://www.macmillan.org.uk/]. SSRC and AFM were supported by the Medical Research Council Leeds Medical Bioinformatics Centre (MR/L01629X) [https://gtr.ukri.org/projects?ref=MR%2FL01629X%2F1]. Mark Birkin was supported by the Economic and Social Research Council (ESRC) Consumer Research Data Centre (ES/L011891/1) [https://www.cdrc.ac.uk/]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Decision Letter 0

Pandi Vijayakumar

8 Oct 2021

PONE-D-21-29276A data flow process for confidential data and its application in a health research projectPLOS ONE

Dear Dr. Crossfield,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 22 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Pandi Vijayakumar, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

"We would like to acknowledge the funding and support from Macmillan Cancer Support (106451), the Economic and Social Research Council (ES/L011891/1) and Medical Research Council (MR/L01629X)."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"SSRC, KZ, PB, PW, JF, AWG and GH were supported by Macmillan Cancer Support (106451) [https://www.macmillan.org.uk/]. SSRC and AFM were supported by the Medical Research Council Leeds Medical Bioinformatics Centre (MR/L01629X) [https://gtr.ukri.org/projects?ref=MR%2FL01629X%2F1]. Mark Birkin was supported by the Economic and Social Research Council (ESRC) Consumer Research Data Centre (ES/L011891/1) [https://www.cdrc.ac.uk/]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. 

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 

Additional Editor Comments:

Based on the comments of the reviewers, I recommend major revision.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Linking the research datasets of patients without divulging the sensitive medical information of patients and hospitals is very difficult. This research work, in this line, has proposed a dataflow protocol for anonymous linked datasets for research purposes. It is accomplished by Comprehensive Patient Records study in Leeds, United Kingdom with data from two providers which contains the details of 3, 65, 193 patients.

The protocol design has been verified by research advisory groups.

The authors have received the legal advice for this protocol that it maintains ethical as well as legal requirements.

This protocol will be very useful during the Covid-19 situations to analyze and identify any useful patterns.

The use of OpenPseudonymiser for creating secure digests and linking the digests while preventing backward engineering seems to be very useful.

Reviewer #2: Authors presented a technological impact of trust in Social Government during COVID-19 crisis. Though, paper covers important topic, technical contribution of this paper is limited. Also, following comments need to be addressed in the revised manuscript:

-Highlight your contribution clearly in Abstract section.

-Mention the quantitative findings of this research in the abstract section.

-Fig. 1 is not explained properly.

-Also, in some figures, captions are not proper.

-Introduction section needs to be re-written to improve its quality and readability.

-Put some more light on concept of the paper and its usability to the industry and society.

-Paper needs to polish and provide a detailed explication of theoretical aspects such as conditions and theorems, and practical issues like algorithms, rules and possible applications.

-Following are some of relevant and recent references which can be referred in the revised manuscript:

Provably Secure Data Sharing Approach for Personal Health Records in Cloud Storage Using Session Password, Data Access Key, and Circular Interpolation,

Capability based outsourced data access control with assured file deletion and efficient revocation with trust factor in cloud computing,

IoT-based big data secure management in the fog over a 6G wireless network,

Efficient escrow-free CP-ABE with constant size ciphertext and secret key for big data storage in cloud,

A novel CNN based security guaranteed image watermarking generation scenario for smart city applications.

-In the conclusion section, I suggest, to highlighting the new discovery, new inventions and new aspects of the research.

-Improve overall flow of this paper for better understanding.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jan 21;17(1):e0262609. doi: 10.1371/journal.pone.0262609.r002

Author response to Decision Letter 0


11 Nov 2021

Thank you for the feedback and opportunity to address the comments raised. We have added our response to all points raised by the reviewers in line below, and in the attached 'Response to Reviewers' document. Line references correspond to the attached ‘Revised Manuscript with Track Changes’.

Reviewer #1: Linking the research datasets of patients without divulging the sensitive medical information of patients and hospitals is very difficult. This research work, in this line, has proposed a dataflow protocol for anonymous linked datasets for research purposes. It is accomplished by Comprehensive Patient Records study in Leeds, United Kingdom with data from two providers which contains the details of 365, 193 patients.

The protocol design has been verified by research advisory groups.

The authors have received the legal advice for this protocol that it maintains ethical as well as legal requirements.

This protocol will be very useful during the Covid-19 situations to analyze and identify any useful patterns.

The use of OpenPseudonymiser for creating secure digests and linking the digests while preventing backward engineering seems to be very useful.

Many thanks for your review and feedback.

Reviewer #2: Authors presented a technological impact of trust in Social Government during COVID-19 crisis. Though, paper covers important topic, technical contribution of this paper is limited. Also, following comments need to be addressed in the revised manuscript:

-Highlight your contribution clearly in Abstract section.

We believe that this manuscript contributes a linkage and anonymisation approach with clear steps for creating a research dataset in line with ethico-legal requirements. We also produced a research dataset that has important value for cancer research and describe how it has been made accessible.

We have now clarified this by highlighting the issue, study aim and contribution in the abstract (lines 28-30, 34-36 and 52-53).

-Mention the quantitative findings of this research in the abstract section.

The main findings of the research were that a) a data flow protocol was defined for linking and anonymising data with ethico-legal approval; and b) the data flow protocol was successfully applied in a case study in which data regarding 365,193 patients was linked and anonymised from two healthcare providers. The invaluable cancer dataset that was created is now available for use in research

We have now highlighted this more clearly in the abstract (lines 43-49)

-Fig. 1 is not explained properly.

Thank you for the opportunity to provide further clarity. The data flow protocol that is defined in the manuscript was utilised in a case study. Figure 1 depicts the parties involved in data handling in this case study: the data transfers, the linkage and anonymisation actions performed, and the research access granted to the final (linked and anonymised) research dataset.

We have now added a description of the information in Figure 1 (lines 289-290) and further clarified the caption for Fig 1 (lines 308-309).

-Also, in some figures, captions are not proper.

We have now amended the captions for both Fig 1 and Fig 2, by adding further descriptive detail (lines 307-312).

-Introduction section needs to be re-written to improve its quality and readability.

The Introduction section has been refined throughout, to clarify the issues that the manuscript addresses and to improve readability (lines 65-67, 76-80, 86-87, 93-99, 106-114, 119-123, 132-136).

-Put some more light on concept of the paper and its usability to the industry and society.

As described above, we have now defined the aim and contribution more clearly in the Abstract, and also in the Conclusions section as described in response to a further comment below. We have also elaborated on the concept of the paper and its contribution to industry and society in the Introduction (lines 132-136) and Discussion (lines 316-322).

-Paper needs to polish and provide a detailed explication of theoretical aspects such as conditions and theorems, and practical issues like algorithms, rules and possible applications.

We have now refined the flow of this paper through general changes throughout. We have also now explicitly referenced the technical detail of the data flow and worked examples of pseudonymisation that are provided in the Supporting Information (lines 279-281). The glossary of terms in the Supporting Information, which provides further explication, is also now referenced. Further, we have now highlighted the possible application of the described approach in addressing practical issues in enabling the secondary use of data for research (e.g. Abstract, Conclusions, and lines 316-322). We have made explicit that the protocol is shared and access to the CPR dataset is made available, both in order to assist in future research projects (lines 268, 321-324).

-Following are some of relevant and recent references which can be referred in the revised manuscript:

Provably Secure Data Sharing Approach for Personal Health Records in Cloud Storage Using Session Password, Data Access Key, and Circular Interpolation,

Capability based outsourced data access control with assured file deletion and efficient revocation with trust factor in cloud computing,

IoT-based big data secure management in the fog over a 6G wireless network,

Efficient escrow-free CP-ABE with constant size ciphertext and secret key for big data storage in cloud,

A novel CNN based security guaranteed image watermarking generation scenario for smart city applications.

Thank you for providing references to these manuscripts, which have now been cited as relevant (lines 74-75).

-In the conclusion section, I suggest, to highlighting the new discovery, new inventions and new aspects of the research.

The manuscript contributes a) a data flow protocol for linkage and anonymisation, b) description of a case study that successfully utilised the data flow protocol, in the healthcare domain, and information on how to request access to the resultant dataset; and c) a proposed approach and recommended criteria for cross-accreditation between organisations that handle data.

The Conclusions section has now been amended to highlight these three contributions (lines 472-481.

-Improve overall flow of this paper for better understanding.

We have now improved the flow of this paper to aid understanding, through general changes applied throughout the manuscript.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Pandi Vijayakumar

24 Nov 2021

PONE-D-21-29276R1A data flow process for confidential data and its application in a health research projectPLOS ONE

Dear Dr. Crossfield,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 08 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Pandi Vijayakumar, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments :

Based on the reviewers comments, I recommend the paper for minor revision.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The proposed work is about the utilization of valuable health data of patients. The idea of patient confidentiality is clearly maintained by anonymizing the data and linking them together for research and other application purposes. The authors claim that, the work has potential applications not only in England and European Union as the handled datasets is pertaining to these geographical locations, but to the global scale.

But, I suggest the authors clarify in what way is this research work related to security. Apart from this, many worthwhile contributions such as

1. "An efficient anonymous authentication and confidentiality preservation schemes for secure communications in wireless body area networks"

2. "A new SmartSMS protocol for secure SMS communication in m-health environment"

3. "An efficient anonymous authentication and key agreement scheme with privacy-preserving for smart cities"

to be added to the review section of this article to make it oriented towards confidentiality and security concerns.

As a whole, this research work is a novel proposal by the authors and if the suggestions are incorporated, the paper can be considered for possible publication in your esteemed journal.

Reviewer #2: A data flow process for confidential data and its application in a health research project

is presented in this paper. Paper is revised well. It can be accepted now.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jan 21;17(1):e0262609. doi: 10.1371/journal.pone.0262609.r004

Author response to Decision Letter 1


9 Dec 2021

Thank you for the feedback and opportunity to address the comments raised. We have added our response to all points raised by the reviewers in line below, and as described in the attached 'Response to Reviewers'. Line references correspond to the attached ‘Revised Manuscript with Track Changes’.

Reviewer #1: The proposed work is about the utilization of valuable health data of patients. The idea of patient confidentiality is clearly maintained by anonymizing the data and linking them together for research and other application purposes. The authors claim that, the work has potential applications not only in England and European Union as the handled datasets is pertaining to these geographical locations, but to the global scale.

But, I suggest the authors clarify in what way is this research work related to security.

As described in lines 432-435, use of the protocol encourages consideration of the wider information security and governance framework, including the adoption of mechanisms for secure data transfer and restricting data access as appropriate. We acknowledge that the research has not directly developed any security feature, but rather the research work relates to security through guiding protocol users through the steps and security measures to be considered during linkage and anonymisation exercises.

We have therefore now clarified that the protocol relates to the linkage and anonymisation of data, rather than security directly (lines 299, 305).

Apart from this, many worthwhile contributions such as

1. "An efficient anonymous authentication and confidentiality preservation schemes for secure communications in wireless body area networks"

2. "A new SmartSMS protocol for secure SMS communication in m-health environment"

3. "An efficient anonymous authentication and key agreement scheme with privacy-preserving for smart cities"

to be added to the review section of this article to make it oriented towards confidentiality and security concerns.

Thank you for these references. We have now discussed and cited these as appropriate in the manuscript (lines 73, 112-113, 406-408, 423-424).

As a whole, this research work is a novel proposal by the authors and if the suggestions are incorporated, the paper can be considered for possible publication in your esteemed journal.

Many thanks for your review and guidance.

Reviewer #2: A data flow process for confidential data and its application in a health research project

is presented in this paper. Paper is revised well. It can be accepted now.

Many thanks for your review and feedback.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Pandi Vijayakumar

31 Dec 2021

A data flow process for confidential data and its application in a health research project

PONE-D-21-29276R2

Dear Dr. Crossfield,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Pandi Vijayakumar, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Based on the comments of the reviewers, I strongly accept this paper for publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: A data flow process for confidential data and its application in a health research project is presented in this paper. It can be accepted now.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Pandi Vijayakumar

10 Jan 2022

PONE-D-21-29276R2

A data flow process for confidential data and its application in a health research project

Dear Dr. Crossfield:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Pandi Vijayakumar

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Inclusion criteria for patient selection for the Comprehensive Patient Records project.

    (DOCX)

    S1 File. Data flow protocol template.

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    We provide a template version of the data flow protocol in the Supporting Information. The CPR data flow protocol and project information are available on the LIDA website [https://lida.leeds.ac.uk/comprehensive-patient-records-2/cpr_approvals/]. The CPR dataset can be accessed through submitting a request to the health data research hub DATA-CAN, who provide secure research access, given the potentially sensitive nature of the health data [https://web.www.healthdatagateway.org/dataset/ce4582a8-0985-46c6-b95f-29a5de862d4a].


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES