Data submission SOP
You are expected to prepare a ‘Data set card’ for each submission. Please review the example here. We recommend reviewing the approach to data sharing popularised by Hugging Face.
Your submission might be a directory containing the following, or (preferably) a Git repository structured in the same way.
top-level-folder/
|- README.md
|- train.csv
|- test.csv
This obviously will need to work differently where your data is relational. In this example, we simply provide the data as a SQLite file, and have done the train/test split.
top-level-folder/
|- README.md
|- data.db
The README.md
file contains a YAML header.
pretty_name: "Pretty Name of the Dataset"
tags:
- tag1
- tag2
license: "any valid license identifier"
task_categories:
- task1
- task2
The following information is mandatory.
Metadata | Notes |
---|---|
License | We recommend using the CHIMERA data license which is based on the PhysioNet Contributor Review Health Data License. Alternatives include What is the most appropriate license for my data? - a help article for using figshare |
Identifiabilty | 1. Safeguarded data: Data is made available to users in the UCL Data Safe Haven (UCL-DSH) 2. Open data: Bespoke anonymised data extracts may be released to the end user Valid choices - safeguarded - open |
Four Eyes | All health data must have, at some point, been identifiable. We have an anonymisation process to minimise the risk of data leak, and the final stage of that process is a review of the data product by two members of the CHIMERA team (“Four Eyes”). |
Corresponding Author | An email address for the data author. |
The following information is optional.
Metadata | Notes |
---|---|
Data Publication | A digital object identifier for any publication that describes the data |
Analytic Publication(s) | A list of digital object identifiers that have used these data |
Size category | e.g. n<1k , 100k<n<1M |