Back to Search
Start Over
Readable workflows need simple data [version 2; referees: 3 approved with reservations, 1 not approved]
- Source :
- F1000Research. 3:110
- Publication Year :
- 2014
- Publisher :
- London, UK: F1000 Research Limited, 2014.
-
Abstract
- Sharing scientific analyses via workflows has the potential to improve the reproducibility of research results as they allow complex tasks to be split into smaller pieces and give a visual access to the flow of data between the components of an analysis. This is particularly useful for trans-disciplinary research fields such as biodiversity and ecosystem functioning (BEF), where complex syntheses integrate data over large temporal, spatial and taxonomic scales. However, depending on the data used and the complexity of the analysis, scientific workflows can grow very complex which makes them hard to understand and reuse. Here we argue that enabling simplicity starting from the beginning of the data life cycle adhering to good practices of data management can significantly reduce the overall complexity of scientific workflows. It can simplify the processes of data inclusion, cleaning, merging and imputation. To illustrate our points we chose a typical analysis in BEF research, the aggregation of carbon pools in a forest ecosystem. We propose indicators to measure the complexity of workflow components including the data sources. We illustrate that the complexity decreases exponentially during the course of the analysis and that simple text-based measures can help to identify bottlenecks in a workflow. Taken together we argue that focusing on the simplification of data sources and workflow components will improve and accelerate data and workflow reuse and improve the reproducibility of data-driven sciences
Details
- ISSN :
- 20461402
- Volume :
- 3
- Database :
- F1000Research
- Journal :
- F1000Research
- Notes :
- Revised Amendments from Version 1 With our revision we sharpened the focus of our paper. Our main focus is neither on the specific results of the presented use case, nor on the metrics we provide. We are writing an opinion paper (the article has now been reclassified as such), and both the use case and the metrics are illustrations of our opinion. Here we want to make a strong case for the simplicity of data and workflow components. Although it is not our intention to use the case study as proof, our paper is accompanied by many statistical analyses and plots. This may lead the reader to believe that we want to present a research article. However, we think that our plots are very useful for other data managers and scientists in illustrating why it is worthwhile to invest energy into simplifying datasets. This is especially the case for files from the long tail of big data, which are handcrafted and relatively small data sets resulting from fieldwork and not from automated sensors. To be able to illustrate the problem of merging these files - which is our day to day work as hybrids of data managers and researchers - we chose our case study, as it is representative of our work and the work of our fellow data managers we spoke to. We also think that it is highly useful to illustrate our difficulties in data re-use. Reworking our text in response to the reviewer’s questions we scaled down the method descriptions and put a stronger focus on the opinion parts of the paper. For that we reworked the text in many passages and also added a new section to the discussion better discussion the complexity measures., , [version 2; referees: 3 approved with reservations, 1 not approved]
- Publication Type :
- Academic Journal
- Accession number :
- edsfor.10.12688.f1000research.3940.2
- Document Type :
- opinion-article
- Full Text :
- https://doi.org/10.12688/f1000research.3940.2