Data submission SOP

Author

Steve Harris

You are expected to prepare a ‘Data set card’ for each submission. Please review the example here. We recommend reviewing the approach to data sharing popularised by Hugging Face.

Your submission might be a directory containing the following, or (preferably) a Git repository structured in the same way.

top-level-folder/
  |- README.md
  |- train.csv
  |- test.csv

This obviously will need to work differently where your data is relational. In this example, we simply provide the data as a SQLite file, and have done the train/test split.

top-level-folder/
  |- README.md
  |- data.db

The README.md file contains a YAML header.

pretty_name: "Pretty Name of the Dataset"
tags:
- tag1
- tag2
license: "any valid license identifier"
task_categories:
- task1
- task2

The following information is mandatory.

Metadata	Notes
License	We recommend using the CHIMERA data license which is based on the PhysioNet Contributor Review Health Data License. Alternatives include What is the most appropriate license for my data? - a help article for using figshare
Identifiabilty	1. Safeguarded data: Data is made available to users in the UCL Data Safe Haven (UCL-DSH) 2. Open data: Bespoke anonymised data extracts may be released to the end user Valid choices - safeguarded - open
Four Eyes	All health data must have, at some point, been identifiable. We have an anonymisation process to minimise the risk of data leak, and the final stage of that process is a review of the data product by two members of the CHIMERA team (“Four Eyes”).
Corresponding Author	An email address for the data author.

The following information is optional.

Metadata	Notes
Data Publication	A digital object identifier for any publication that describes the data
Analytic Publication(s)	A list of digital object identifiers that have used these data
Size category	e.g. `n<1k`, `100k<n<1M`