«How can we consolidate data and describe it in a standardized way?»
Scientific Data management has some unique challenges, but also provides multiple learnings for other sectors. We focused on Data Storage and Operations as a knowledge area in DMBok. A topic that is often viewed as basic, often not in focus, but is a fundamental part of data operations.
I talked to Nicolai Jørgensen at NMBU - Norwegian University of Life Sciences. Nicolai has a really diverse background. His journey in data started in 1983! In his free time, Nicolai spends time with photography and AI for text to image generation
Here are my key take aways:
Scientific Data Management
- To describe data in a unified way, we need standards, like Dublin Core or Darwin Core for scientific data.
- Data is an embedded part of Science and Research - you can’t have those without data.
- You need to make sure you collect the right data, the right amount of data, valid data, +++
- You need to optimize your amount of time, energy and expenses when collecting and validating data.
- You need to standardize the way you collect data, to ensure that it can be verified.
- There needs to be an audit trail (lineage) between the data you have collected and the result presented in a publication.
- Data needs to be freely available for research and testing hypothesis.
- Data needs to be findable, accessible and interoperable, but a also reusable.
- ML algorithms can help extract and find changes to scientific data, that is internationally available.
- Describing data is key to tap into knowledge - for that you need metadata.
- In times of AI and ML, Metadata is still the key to uncover data.
- The development of AI models is a race - maybe we need to pause and get a better picture of cause and effect, and most of all risk.
- How can were standardize on the infrastructure for research projects
- Minimize or get rid of volatile data storage and infrastructure
- Standardize data storage solutions
- Secure what needs to be secured
- Splitt out sensitive or classified data and store separate (eg. Personal data)
- Train your end users and educate data stewards
- Have good guidelines for researchers on how to store, use and manipulate data.
- There is a direct correlation between disc-space use and sustainability.
- Storage is cheap, is a correct saying, if you look at its in isolation - but in the bigger picture the cost is just moved.
- Just adding more storage doesn’t solve your problems, it might just yet increase them.
Long-term Preservation & Integrity
- To preserve data for long-term you need to
- Encapsulate data at a certain level
- Standardize the way you describe the data
- Upload data package to a common governed platform
- Enclose if there is a government body that can take responsibility to preserve your data for the time necessary
- Ensure that metadata is machine-readable
- Formats like XML provide the possibility to read the data by both machines and humans
- Research integrity: conducting research in a way which allows others to have trust and confidence in the methods used and the findings in that result.
- Ensure lineage and audit trails for your scientific data.
- Fake data, data fabrication, are serious issues in research - the understanding and methods for keeping data integrity at the highest possible level is not getting easier, but increasingly important.
- Changes to data (change logs, change data capture, etc) can be studied as well; you can build models to build scenarios around data changes.
- You can fetch data from other sources to enrich the quality of your data.