Better Data Management for Better Data Science
One year ago, I wrote a blog post about the exciting role big data promises to play in revolutionizing natural resource management. Indeed, our research program has rapidly evolved to take advantage of the huge amounts of information made available by new technologies and clever data platforms. While this exciting transition has enabled us to ask questions that weren’t even imaginable just a few short years ago, our increasing reliance on externally sourced data, expanded collaborative networks, and high powered computing infrastructure to handle increasingly complex analyses has not been without growing pains. The imperative for a thoughtful and standardized data infrastructure was something we could no longer ignore.
So this is a story of our journey from opaque project structures to standardized, transparent, and version controlled workflows. It’s a story of learning from others, leveraging our team’s extensive data science and project management expertise, and working to build a streamlined and thoughtful data infrastructure. Ultimately, it’s a story that can best be told through the lessons we’ve learned along the way.
Lesson 1. Don’t reinvent the wheel. Fortunately for us, Santa Barbara is no stranger to open data science and innovative project management. Our friends and neighbors at the National Center for Ecological Analysis and Synthesis (NCEAS) -- and their Ocean Health Index team in particular -- are global leaders in open environmental data science. Senior Fellow and Data Scientist Dr. Julia Stewart Lowndes has been at the forefront of NCEAS’s pioneering open science efforts, recently founding Openscapes to train others in open science practices to enable “better science in less time.” Julie became our data streamlining guru -- she quickly identified that we ought to consider crafting a data management standard operating procedure (offering her team’s SOP as an example) and provided resources and guidance to set us on our way.
Lesson 2. Collaborative science is the best science. To effectively create a new vision for our internal data infrastructure, we knew we needed to draw on the wide range of expertise from our team. A self-assembled group of data managers, project researchers, and project managers began to meet regularly to map out an aspirational data and project management infrastructure. We reflected upon the collaborative software platforms, open data science, and project management tools that worked best for our team. We recorded best practices for file structure, naming, data storage, and metadata and thought through script standardization and version control. We assembled how-to guides for leveraging high performance computing resources of ever growing importance for our memory intensive analyses. Importantly, we connected the dots between our workstreams to actualize our vision of a fully integrated data management system, which we collaboratively documented in the emLab Standard Operating Procedure. And then we turned it loose, inviting the entire emLab community to participate in a one day Data Hackathon to document and migrate our data inputs, outputs, and repositories to our now centralized and streamlined data system.
emLabers hard at work during our one day data Hackathon
Lesson 3. Iterate and evolve. Our SOP is a living document that will continue to evolve as we do. There is an ebb and flow to the chatter on our “data-streamlining” Slack channel but we are collectively committed to continually improving our data practices and our SOP as needs arise. We’ve learned that a commitment to better data management takes time and effort, but the payoff is immediate. Our SOP is the centerpiece of our onboarding process, and our improved data infrastructure has already been leveraged to synthesize vast and disparate elements of global fisheries data into a user-friendly, interactive online tool. As a team committed to conducting research for real-world impact, our approach is inherently outward looking, forward thinking, and often based on externally generated sources of information. Our journey to re-envision our data infrastructure required us to reflect instead on our internal processes, to tear them apart and collectively rebuild something better, something more thoughtful and transparent, a new system that affirms our commitment to open and reproducible data science.