1. Experimenting with reproducibility: a case study of robustness in bioinformatics
- Author
-
Jean-Baptiste Poline, Guillaume Dumas, Yang-Min Kim, Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS), Génétique humaine et fonctions cognitives - Human Genetics and Cognitive Functions (GHFC (UMR_3571 / U-Pasteur_1)), Institut Pasteur [Paris] (IP)-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), Lawrence Berkeley National Laboratory [Berkeley] (LBNL), McGill University = Université McGill [Montréal, Canada], This work was supported by the following: Institut Pasteur (http://dx.doi.org/10.13039/501100003762), H2020 Societal Challenges (http://dx.doi.org/10.13039/100010676), Centre National de la Recherche Scientifique (http://dx.doi.org/10.13039/501100004794), Université Paris Diderot (http://dx.doi.org/10.13039/501100005736), Conny-Maeva Charitable Foundation, Cognacq-Jay Foundation, Orange (http://dx.doi.org/10.13039/501100003951), Fondation pour la Recherche Médicale (http://dx.doi.org/10.13039/501100002915), GenMed Labex, and BioPsy Labex. J.-B.P. was partially funded by NIH-NIBIB P41 EB019936 (ReproNim) NIH-NIMH R01 MH083320 (CANDIShare) and NIH 5U24 DA039832 (NIF), as well as the Canada First Research Excellence Fund, awarded to McGill University for the Healthy Brains for Healthy Lives initiative, We thank Thomas Rolland and Freddy Cliquet for sharing their technical advice and comments. Y-M.K. and G.D. thank Thomas Bourgeron for his continuous support on this project, ANR-10-LABX-0013,GENMED,Medical Genomics(2010), ANR-11-IDEX-0004,SUPER,Sorbonne Universités à Paris pour l'Enseignement et la Recherche(2011), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS), Institut Pasteur [Paris]-Université Paris Diderot - Paris 7 (UPD7)-Centre National de la Recherche Scientifique (CNRS), McGill University, ANR: PREFI-10-LABX-13/10-LABX-0013,GENMED,Medical Genomics(2010), and ANR-11-IDEX-0004-02/11-LABX-0035,BIOPSY,Laboratoire de Psychiatrie Biologique(2011)
- Subjects
0301 basic medicine ,Computer science ,Health Informatics ,Review ,robustness ,Bioinformatics ,03 medical and health sciences ,0302 clinical medicine ,Documentation ,Software ,Robustness (computer science) ,Neoplasms ,Computer cluster ,Humans ,cancer ,MATLAB ,reproducibility ,License ,computer.programming_language ,Reusability ,standard consensus dataset ,business.industry ,network based stratification ,Computational Biology ,Reproducibility of Results ,Python (programming language) ,Computer Science Applications ,030104 developmental biology ,Data Interpretation, Statistical ,reusability ,Programming Languages ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Algorithms ,030217 neurology & neurosurgery - Abstract
International audience; Reproducibility has been shown to be limited in many scientific fields. This question is a fundamental tenet of scientific activity, but the related issues of reusability of scientific data are poorly documented. Here, we present a case study of our difficulties in reproducing a published bioinformatics method even though code and data were available. First, we tried to re-run the analysis with the code and data provided by the authors. Second, we reimplemented the whole method in a Python package to avoid dependency on a MATLAB license and ease the execution of the code on a high-performance computing cluster. Third, we assessed reusability of our reimplementation and the quality of our documentation, testing how easy it would be to start from our implementation to reproduce the results. In a second section, we propose solutions from this case study and other observations to improve reproducibility and research efficiency at the individual and collective levels.While finalizing our code, we created case-specific documentation and tutorials for the associated Python package StratiPy. Readers are invited to experiment with our reproducibility case study by generating the two confusion matrices (see more in section "Robustness: from MATLAB to Python, language and organization"). Here, we propose two options: a step-by-step process to follow in a Jupyter/IPython notebook or a Docker container ready to be built and run.
- Published
- 2018
- Full Text
- View/download PDF