Data Release Guidance

Attribute Value
Date created 2023-03-24
Date modified
Date approved
Version 0.1
Associated documents

Introduction

This document specifies the framework for releasing anonymised data collated by the Collaborative Healthcare Innovation through Mathematics, EngineeRing and AI (CHIMERA) dataset. These data are include extracts from the electronic health record and measurements of physiological processes often derived from advanced monitoring devices. They have been assembled to enable clinical and health science research teams to interrogate these rich data streams with the aim of improving the care of future patients.

While supporting this academic programme, we need to protect the identity of patients from whom these data have been derived. These data are stored securely in the Data Safe Haven at University College London (UCL DSH) which is certified to the ISO27001 information security standard and conforms to the National Health Service (NHS) Information Governance Toolkit.

We have followed the principles of patient confidentiality from the NHS Code of Practice in preparing this document.

To protect
look after the patient’s information. The data handling and storage is discussed elsewhere. The scope of this document is limited to anonymisation steps for data release.
To inform
ensure that patients are aware of how their information is used. We will make this document, and the methods we use for anonymisation publicly available. We have already, and will continue to engage with patients and their representatives to ensure that the processes of using these data are transparent.
To provide choice
allow patients to decide whether their information can be disclosed or used in particular ways. We will provide easily accessible opt-out mechanisms for patients who do not wish to have their data released.
To improve
always look for better ways to protect, inform, and provide choice. We will review this document annually, and through external audit, re-identification challenges, and public scrutiny continually improve these processes.

We envision two scenarios in which data may be released from the DSH.

  • Raw data for analysis
  • Summarised data for research publication

The first of these carries the greatest risk with respect to information security. The second refers to the release of data summaries, tables, and figures where the individual records are not exposed. The same standards of security will apply to both scenarios but the inherent aggregation of data in the latter will mean that most data has already been pre-processed to meet these standards.

Situations where data will be released

Data will be released from the UCL Data Safe Haven (UCL DSH) using its auditable exporting system. This assigns precise roles and privileges to suitably authorised individuals to make a release, and maintains a record of all releases. The purpose of such releases would include:

  • Moving the data to a high performance computing infrastructure that is not available in the UCL DSH
  • Sharing the data with named individuals for educational or research purposes as already described (e.g. training, public and professional engagement events)

These data releases will be from an already anonymised database. However the definition of anonymity depends on the context. Data outside the UCL DSH will have a higher risk of re-identification since external data sources might be used for this purposes. We will therefore hold such releases to a higher standard following the guidance in the Information Commissioner’s Office Anonymisation Code of Conduct, and adopt the ‘four-eyes’ principle (or two person rule) such that two independent members of the study team review all releases to ensure that we are meeting this standard.

Principles

Definition of personal data

We are following the guidance provided by the Information Commissioner’s Office (ICO) in ’Anonymisation: managing data protection risk code of practice’ (2012) The legal basis for this guidance comes from the Data Protection Act (DPA) 1988, and Recital 26 of the European Data Protection Directive (95/46/EC) which in turn is based on the following principles.

information or a combination of information, that does not relate to and identify an individual, is not personal data

Importantly, the guidance from the ICO states that there is:

clear legal authority for the view that where an organisation converts personal data into an anonymised form and discloses it, this will not amount to a disclosure of personal data

Definition of likelihood of re-identification

The DPA does not require that it is impossible to re-identify an individual from disclosed data, but that defines personal data as those where is the risk of re-identification is “likely”. We are expected to take three factors into account.

  • the likelihood of identification being attempted
  • the likelihood of identification being successful
  • the quality of the data after the anonymisation has taken place.

Medical data will present a likely target for re-identification, and more so where it includes information on VIPs (e.g. public figures, politicians, celebrities). Although, we can minimise this risk by removing the records of VIPs from released data, the risk remains to others.

We therefore have concentrated on making the likelihood of re-identification unsuccessful. This has to be balanced against the utility of the data after anonymisation has been performed which, in turn, requires measures of the disclosure risk, and of information content.[^1]

Measuring disclosure risk

This is a complicated topic, and the notes we provide here are to illustrate the problems faced. We would recommend that expert guidance is sought where further detail is required.

We will endeavour to share our anonymisation procedures, and welcome public review and comment. An initial draft is available here.

We treat these data as ‘safeguarded’ - that is we acknowledge that a persistent risk of re-identification and so access to the data is controlled. For open data releases, further anonymisation might include consideration of the following additional issues:

These principles are applied to all patient identifiable information as defined by the NHS code of practice on ConfidentialityNHS Code of Practice as: - patient’s name, address, full post code, date of birth; - pictures, photographs, videos, audio-tapes or other images of patients; - NHS number and local patient identifiable codes; - anything else that may be used to identify a patient directly or indirectly. For example, rare diseases, drug treatments or statistical analyses which have very small numbers within a small population may allow individuals to be identified.

Anonymisation methodology

We have adapted guidance for National Statistical Offices (e.g. the UK’s Office of National Statistics) produced by the International Household Survey Network in its ’Introduction to Statistical Disclosure Control’ . There are two steps in the anonymisation of data for release from the UCL DSH.

Default Anonymisation

  1. Removal of direct identifiers: All unique identifiers including NHS number, hosital number and names will be removed from the data before transfer
  2. Removal of implicit identifiers: This includes free text, imaging and genomics.
  3. Anonymise age by rounding to
    • weeks if <1 year
    • months if <18 years
    • years if <99
    • 100 otherwise
  4. Exclude small cells from categorical measures defined as categories holding 10 or fewere individuals
  5. Date and time metadata converted to relative offsets: The date and time when an observation was recorded carries a re-identification risk as they narrow down the number of observations that might relate to any particular individual. We will convert all of these to date and time differences from critical care admission before data release. For example, a heart rate measurement will be defined as occurring 24 hours after ICU admission, but it will not be possible to know the day, month or year of that measurement. Where a researcher intends to study a phenomenon that depends on these characteristics (e.g. the ’weekend’ effect) then the minimum data and time information necessary for the analysis will released.
  6. Remove high risk individuals and specific opt-outs:
    • The participating hospitals will inevitably care for well known public figures from time to time, and to avoid attempts at their re-identification we will prospectively remove their health records from any data release.
    • Specific patients may also have notified the Data Controller through the Local Investigator that they do not wish their data to be shared. These records will also be removed at source.

Open data anonymisation

This will follow the principles outlines above but necessarily be adapted for each data set, and potentially for each release. The data controller or their nominated deputy and the SAG data release approver must both authorise the release of the data. Data can only be released to .nhs.net, .ac.uk or approved organisational email addresses.

Governance

Central data management

  • Data controller: Professor Rebecca Shipley
  • Data custodian: Professor Graham Hart

The data controller reports to the CHIMERA Advisory Board, and may nominate others to manage the data releases such that they meet the standards described here. These individuals must be made known to the CHIMERA Advisory so that a direct line of responsibility is maintained.

In addition, we undertake to:

  • review this policy annually
  • submit the policy and procedures to an annual internal information security audit
  • submit the policy and procedures to an annual external information security audit
  • keep all other approvals up-to-date (e.g. Research Ethics, Confidential Advisory Group approvals etc.)

Data user’s responsibilities

Data users will be expected to sign an end user license agreement that is modelled on that used by the UK Data service. An example of the CHIMERA End User Agreement can be viewed here.

Audit trail

An audit trail of the data release will be created containing the following information.

  1. Name of the person transferring the data
  2. Date and time of data processing
  3. Unique reference to the source data
  4. Code reference of anonymisation package (git commit ID)
  5. Code reference of the configuration file for the anonymisation
  6. Personal details of the data user
  7. Personal details of the data controller (or designated nominee) and SAG data release approver

Absence of guidance

Fields and data that are not specifically mentioned are assumed to be non-disclosable. This is to prevent the accidental release of sensitive information as the database is updated. In other words, the algorithm for generating a data release will take a ’rule-in’ approach whereby a field must be both specified for release