Data submission SOP
You are expected to prepare a ‘Data set card’ for each submission. Please review the example here. We recommend reviewing the approach to data sharing popularised by Hugging Face.
Your submission might be a directory containing the following, or (preferably) a Git repository structured in the same way.
top-level-folder/
|- README.md
|- train.csv
|- test.csvThis obviously will need to work differently where your data is relational. In this example, we simply provide the data as a SQLite file, and have done the train/test split.
top-level-folder/
|- README.md
|- data.dbThe README.md file contains a YAML header.
pretty_name: "Pretty Name of the Dataset"
tags:
- tag1
- tag2
license: "any valid license identifier"
task_categories:
- task1
- task2The following information is mandatory.
| Metadata | Notes |
|---|---|
| License | We recommend using the CHIMERA data license which is based on the PhysioNet Contributor Review Health Data License. Alternatives include What is the most appropriate license for my data? - a help article for using figshare |
| Identifiabilty | 1. Safeguarded data: Data is made available to users in the UCL Data Safe Haven (UCL-DSH) 2. Open data: Bespoke anonymised data extracts may be released to the end user Valid choices - safeguarded - open |
| Four Eyes | All health data must have, at some point, been identifiable. We have an anonymisation process to minimise the risk of data leak, and the final stage of that process is a review of the data product by two members of the CHIMERA team (“Four Eyes”). |
| Corresponding Author | An email address for the data author. |
The following information is optional.
| Metadata | Notes |
|---|---|
| Data Publication | A digital object identifier for any publication that describes the data |
| Analytic Publication(s) | A list of digital object identifiers that have used these data |
| Size category | e.g. n<1k, 100k<n<1M |