ME PhD students - Part 1
Slides are available at https://doi.org/10.5281/zenodo.7298877
datasteward-ME@tudelft.nl
Support with Research Data Management Policy requirements:
Attend training in Research Data Management ✔️
Data management plan (DMP)
Data/code archiving requirement
Workflow efficiency!
Early organisation and proper documentation of project and data
Prevent data loss… you never know
Part 1
Part 2
Part 3
This illustration is created by Scriberia with The Turing Way community.
Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
GIPHY
Document describing what will happen with…
…during the project, including
It also determines what happens to code/data after the project
…. data in the form of facts, observations, images, computer program results, recordings, measurements or experiences on which an argument, theory, test or hypothesis, or other research output is based. It relates to data generated, collected, or used, during research projects, and in some cases may include the research output itself. Data may be numerical, descriptive, visual or tactile. It may be raw, cleaned or processed, and may be held in any format or media. Research data, in many disciplines, may by necessity include the software, algorithm, model and/or parameters, used to arrive at the research outcome, in addition to the raw data that the software, algorithm or model is applied to.
DMPonline is an online platform for creating a DMP
You can log in with your NetID to the TU Delft DMPonline
Here, you can:
For more instructions, see here.
This illustration is created by Scriberia with The Turing Way community.
Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Storage options at TU Delft
https://tudelft.topdesk.net/ > ICT services > IT for Researchers
These are accessible from the TU Delft network (e.g. via Windows File Explorer)
Location | Storage | Access | Suitable for confidential data? |
---|---|---|---|
Personal Drive (H:) | 8 GB | Just you | Yes (but not research data) |
Staff Group Data (M:) | 50 GB | Department | No |
Project Data (U:) | 5+ TB | Managed by drive owner (project PI) | Yes |
Need more computing power?
Location | Storage | Access | Suitable for confidential data? |
---|---|---|---|
SURFDrive | 1 TB | Just you (can share files/folders) | Yes |
Microsoft OneDrive | 1 TB | Just you (can share files/folders) | Yes |
Recommended for project data
TU Delft ICT Network Drive
Should be managed by project leader
Pros:
Other cloud solutions
TU Delft ICT Network drives
Cloud drives
❌ Not secure, not appropriate for sensitive/personal data
❌ Not compliant with 3-2-1 backup
❌ Account deleted shortly after researcher leaves
3-2-1 backup rule-of-thumb
3 copies of the data (1 primary, 2 backups)
2 different storage media (e.g. external hard drive and laptop)
1 copy stored offsite (different geographical location)
Two-way (synchronisation)
One-way (backup)
Look familiar?
Can you walk away from your project for days, weeks, months,
and come back and know what everything is?
Spend some time thinking about how you will organise yourself.
Your future self will thank you…
Projects should be contained within folders in a meaningful place
📁 project_name
📄 README
📁data
📄 raw-data_exp01.csv
📄 raw-data_exp02.csv
📁analysis
📄 analysis-script.R
📁reports
📄 results-of-analysis.Rmd
📁publication
📄 manuscript_v1.docx
~/Documents/Project_name
)
OneDrive/Project_name
)
There are many pre-existing templates out there that can help you
https://github.com/djnavarro/newproject/
There are many pre-existing templates out there that can help you
https://github.com/bvreede/good-enough-project
.
├── .gitignore
├── CITATION.md
├── LICENSE.md
├── README.md
├── requirements.txt
├── bin <- Compiled and external code, ignored by git (PG)
│ └── external <- Any external source code, ignored by git (RO)
├── config <- Configuration files (HW)
├── data <- All project data, ignored by git
│ ├── processed <- The final, canonical data sets for modeling. (PG)
│ ├── raw <- The original, immutable data dump. (RO)
│ └── temp <- Intermediate data that has been transformed. (PG)
├── docs <- Documentation notebook for users (HW)
│ ├── manuscript <- Manuscript source, e.g., LaTeX, Markdown, etc. (HW)
│ └── reports <- Other project reports and notebooks (e.g. Jupyter, .Rmd) (HW)
├── results
│ ├── figures <- Figures for the manuscript or reports (PG)
│ └── output <- Other output for the manuscript or reports (PG)
└── src <- Source code for this project (HW)
There are many pre-existing templates out there that can help you
https://github.com/paleobiotechnology/analysis-project-structure
README.md
conda_environment.yml
.gitignore
01-documentation/
├──document_1.txt
└──document_2.tsv
02-scripts
├──ANA-script.sh
├──ANA-notebook.Rmd
├──QUAL-script.sh
└──QUAL-notebook.Rmd
03-data/
├──raw_data
├──published_data
├──reference_genomes
└──databases/
└──<database_1>/
04-analysis/
├──analysis_1/
│ ├──sub-step
│ └──sub-step
└──analysis_2/
├──sub-step
└──sub-step
05-results/
├──ANA-final_file.tsv
├──ANA-final_file.Rdata
├──QUAL-tool_output.csv
└──QUAL-tool_output.Rdata
06-reports/
├──ANA/
│ ├──final_rmarkdown_figures/
│ ├──final_rmarkdown.Rmd
│ └──final_rmarkdown.html
└──QUAL/
├──final_rmarkdown_figures/
├──final_rmarkdown.Rmd
└──final_rmarkdown.html
07-publication/
├──figures
├──supplementary_figures/
├──supplementary_files/
├──sequencingdata_upload/
└──final_paper.Rmd
There are many pre-existing templates out there that can help you
There are many pre-existing templates out there that can help you
Just be consistent and transparent!
I know, I know, could there BE a more boring topic…
It is pretty essential, though. Follow these rules and it’ll be right 👍:
_
and -
)
-
and chunks with _
Good examples:
analysis01_descriptive-statistics.R
analysis02_preregistered-analysis.py
2009-01-01_original-analysis.R
Bad examples:
essay "romeo and juliet" draft01(1).docx
1-April-2012 supervisor comments on final draft.docx
By using a version control system (VCS - git is most widely used), you can:
This illustration is created by Scriberia with The Turing Way community.
Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
They are NOT a certified repository for long-term storage
They do NOT assign DOIs
They CAN be connected with certified repositories
Snapshots of the repo will be taken and assigned a DOI
What to document
Ask yourself: can someone with access to your project folder, reproduce exactly your findings?
That someone may be your future self!
In practice
Conventionally kept in physical notebooks in lab or PIs office
This has some limitations
Several beneficial functionalities
TU Delft has software licenses for RSpace and eLABJournal
Illustrated by Connie Clare
File(s) (.txt, .md, .pdf) that are stored at the root of your project or data directory
Contain:
Data about data
FASTQ files - .txt format used in life sciences (bioinformatics in particular) which store information about nucleotide sequence
TIFF files - .tiff format often contain additional information about images and how these were recorded
FITS files - file standard widely used in astronomy to store images and tables. FITS files contain a headers with metadata with information about the data
Examples from https://www.tudelft.nl/en/library/research-data-management/r/manage/collect-and-document
Raw data… DO NOT TOUCH
Raw data… DO NOT TOUCH
Make a copy of the raw data to perform calculations and analysis
One row per case
One column per variable
One cell per observation
Variable naming
Combine your analysis code and
output in a single document!
For Python users! 🐍
If you just can’t choose!
See courses and workshops here
Project organisation - give it some thought
Data storage
What is considered personal data?
“Personal Data” (GDPR, Article 4): any information relating to an identified or identifiable natural person
a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person
Direct identifiers: information that relates specifically to an individual such as the individual’s residence.
. . .
Indirect identifiers: information that can be combined with other information to identify specific individuals a combination of gender, birth date, geographic indicator and other descriptors. place of birth, race, religion, weight, activities, employment information, medical information, education information, and financial information
Anyone working with human participants will have to submit an application to the Human Research Ethics Committee (HREC).
For the application you will need to:
Collect only what you need (and what you informed participants you would collect)
Access to personal data is restricted to only those who need to process them
Data should be stored in a secure location (e.g. Project Drive)
Informed consent forms should be securely stored
Pseudonymisation: assign a unique participant number to each participant on the corresponding informed consent form or a separate key document. Use participant number (not their names), during data collection & analysis. This is not anonymization, since it is possible to trace each unique participant number to the corresponding participant.
. . .
Anonymisation: Full anonymization is often difficult to achieve. It might be still possible to identify a specific individual by putting together indirect identifiers. Easier to achieve by data aggregation.
This illustration is created by Scriberia with The Turing Way community.
Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
As a PhD student you are resposible for:
Ensuring that all data and code underlying completed PhD theses are appropriately documented and accessible for at least 10 years from the end of the research project, in accordance with the FAIR principles (Findable, Accessible, Interoperable and Reusable), unless there are valid reasons which make research data unsuitable for sharing.
Minimal requirement:
Encouraged:
Data are available upon request to corresponding author.
Findable - persistent identifier (e.g. DOI) and detailed metadata
Accessible - long-term accessibility of data (or just metadata if restricted)
Interoperable - non-proprietary file formats
Reusable - proper documentation and clear license
Image: https://book.fosteropenscience.eu/
Collective benefit - inclusive development and equitable outcomes
Authority to control - Rights, interests, and governance
Responsibility - respect, reciprocity, and trust
Ethics - minimising harm and maximising benefit
https://www.gida-global.org/care
As open as possible; as closed as necessary.
Are the data suitable for sharing?
JoKalliauer; foter, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons
As a general rule, TU Delft owns all research data generated by employees at TU Delft.
But funder of the project (either public or commercial) might impose ownership conditions.
Check whether existing relevant documents, such as grant/consortium agreement etc., specify:
Bazuine, Merlijn. (2021). TU Delft Guidelines on Research Software: Licensing, Registration and Commercialisation. Zenodo. https://doi.org/10.5281/zenodo.4629635
Can co-exist (e.g. RStudio, NextCloud, ownCloud, Linux distros)
Part 1
Part 2
Part 3