The case for reproducibility

Explanation of reproducibility and resources to start practicing it

Date Posted October 6, 2016 Last Updated August 25, 2025

Author Lindsay R. Carr

Reading Time 3 minutes Share

Science is hard. Why make it harder?

Scientists and researchers spend a lot of time on data preparation and analysis, and some of these analyses are quite computationally intensive. The amount of time required to conduct an analysis grows for the increased complexity of calculations, amount of data, and number of datasets that will be analyzed. Many researchers use spreadsheets to conduct their analysis workflows, and apply the same tasks to each data set manually. A majority of the time in this type of workflow could be spent copying and pasting equations from one spreadsheet or column to another.

What if I told you there was a better way?

By focusing on reproducibility from the start of a project, you can quickly re-run analyses, easily share your methods with colleagues and collaborators, apply the same methods to multiple datasets with little effort, and reduce errors.

What does “reproducibility” really mean?

The term “reproducibility” is referring to the ability for your work to be easily recreated by others, and your future self. You should be able to send one or two files and a few instructions for completing your analysis. There shouldn’t be a laundry list of items to change, necessary directories, or old versions of software required.

The best way to accomplish reproducibility is to start scripting your analyses. Scripting is the practice of writing code to a file in order to perform a certain task or calculation. Rather than producing “one-off” anlayses, script your work so you can reference the exact method in the future, re-run the same method for other data, and easily share your processes with colleagues, collaborators, and the public. Any scripting language will do; however, USGS Water is using R.

Tips and tricks to having reproducible workflows

script your work!
comment the steps in your code
use relative (not absolute) filepaths
limit the number of applications/programs being used when possible
keep up to date with software versions

What can you do?

If you don’t know where to start, try learning some basic R. There are many resources: tryR from Code School , swirl , and the USGS Introduction to R Course.

Next, try to script just one piece of your analysis. Pick a set of tasks that need to be applied to several similar datasets or need to be run repeatedly. Write a script that automates one iteration through those tasks. Then reduce your analysis time and mistakes by applying that same script to all of the datasets or runs. Better still, move your code into a loop so that the script automates the repetition, too.

Reproducibility can go beyond your local files. Maybe your plots and tables are scripted, but you’re still having to copy and paste into slides or a manuscript. R Markdown can automate the process of inserting figures and tables into PDFs, Word documents, and slides.

Shuffling around files between contributors and peer reviewers is time consuming and can get confusing quickly. Version control is a way to avoid this mess - it tracks every deletion, every addition, and every contributor that interacts with your code. It is especially useful when there are multiple contributors because you never have to pass around files at varying stages through email. In fact, this post is created using version control. Our group uses Git and GitHub as version control tools, but that’s not the only choice.

And finally, encourage colleagues and collaborators to strive for reproducible science!

In conclusion…

Watch this video on the “horrors” of non-reproducible workflows by Ignasi Bartomeus and Francisco Rodríguez-Sánchez.

Keywords:

Tags:

The Hydro Network-Linked Data Index
November 2, 2020
Introduction
updated 11-2-2020 after updates described here .
updated 9-20-2024 when the NLDI moved from labs.waterdata.usgs.gov to api.water.usgs.gov/nldi/
The Hydro Network-Linked Data Index (NLDI) is a system that can index data to NHDPlus V2 catchments and offers a search service to discover indexed information. Data linked to the NLDI includes active NWIS stream gages , water quality portal sites , and outlets of HUC12 watersheds . The NLDI is a core product of the Internet of Water and is being developed as an open source project. .
Read More...
Using the dataRetrieval Stats Service
October 5, 2016
Introduction
This script utilizes the new dataRetrieval package access to the USGS Statistics Web Service . We will be pulling daily mean data using the daily value service in readNWISdata, and using the stats service data to put it in the context of the site’s history. Here we are retrieving data for July 12th in the Upper Midwest, where a major storm system had recently passed through. You can modify this script to look at other areas and dates simply by modifying the states and storm.date objects.
Read More...
Reproducible Data Science in R: Say the quiet part out loud with assertion tests
September 2, 2025
Overview
This blog post is part of the Reprodicuble data science in R series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Read More...
dataRetrieval Tutorial - Using R to Discover Data
August 4, 2025
R is an open-source programming language. It is known for extensive statistical capabilities, and also has powerful graphical capabilities. Another benefit of R is the large and generally helpful user-community. This includes R-package developers who create packages that can be easily installed to enhance the basic R capabilities. This article will describe the R-package “dataRetrieval” which simplifies the process of finding and retrieving water from the U.S. Geological Survey (USGS) and other agencies.
Read More...
Charting 'tidycensus' data with R
June 24, 2025
In January, 2025, the organizers of the tidytuesday challenge highlighted data that were featured in a previous blog post and data visualization website . Some of us in the USGS Vizlab wanted to participate by creating a series of data visualizations showing these data, specifically the metric “households lacking plumbing.” This blog highlights our data visualizations inspired by the tidytuesday challenge as well as the code we used to create them, based on our previous software release on GitHub .
Read More...

The case for reproducibility

Explanation of reproducibility and resources to start practicing it

Science is hard. Why make it harder?

What does “reproducibility” really mean?

Tips and tricks to having reproducible workflows

What can you do?

In conclusion…

Categories:

Keywords:

Tags:

Share:

Related Posts

The Hydro Network-Linked Data Index

Introduction

Using the dataRetrieval Stats Service

Introduction

Reproducible Data Science in R: Say the quiet part out loud with assertion tests

Overview

dataRetrieval Tutorial - Using R to Discover Data

Charting 'tidycensus' data with R