Formatting guidance for USGS Samples Data Tables
Tips and example R scripts for converting long-formatted data into wide tabular format
What's on this page
Recently, changes were made to the delivery format of USGS samples data (learn more here ). In this blog, we describe the impact to users and show an example of how to use R to convert WQX-formatted water quality results data to a tabular, or “wide” view.
Background
USGS modernized the way we deliver data from discrete field sampling events from a USGS-specific format to the internationally-recognized Water Quality Exchange (WQX) standard. As part of this transition, users have had to update their workflows to access the new sample data outputs. We’ve heard from many users that they would like the ability to create and view tables that are similar to tables that were previously available as an output from the NWIS ‘qw’ webpages. This blog post offers some suggestions for modifying data in the WQX format to a different tabular view.
What is a long or wide data table?
WQX data follow tidy principles , which are often easier to use in scripting language. With tidy data:
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
Converting WQX data into a different tabular output may be turning a “long” data table into a “wide” data table. Data coming from USGS endpoints in the WQX standard are in a long format. The long data format allows for delivery of result-specific metadata that scientists often seek and is commonly favored in data science. However, data in the long format look different than historic National Water Information System (NWIS) ‘qw’ tables. Most commonly, tables in NWIS ‘qw’ were able to be viewed by sampling activities occurring on the same date and time, which can create very wide tables if a particular sampling activity produced many results (a common occurrence for water quality samples). This table often ends up being much wider than a user can view on a single screen!
Both long and wide data formats are acceptable methods for delivering data and each has their own advantages. Long data tables are generally preferred when performing calculations and for aggregating data for statistical analyses. Wide data tables are generally good for quick review or interpretation: for example, a user can view changes in a set of measured properties over time.
Let’s illustrate using R, with an example snippet of a “long” table (Note: example is an excerpt from the dataRetrieval
vignette called “Pivot Data”, available here
):
long <- tibble(site = c(rep("A", 4), rep("B", 4)),
day = c(rep(c(1, 1, 2, 2), 2)),
param = c(rep(c("Atrazine", "Metolachlor"), 4)),
value = c(0.1, 0.25, 0.13, 0.38,
0.14, 0.38, 0.15, 0.40))
#> site day param value
#> A 1 Atrazine 0.10
#> A 1 Metolachlor 0.25
#> A 2 Atrazine 0.13
#> A 2 Metolachlor 0.38
#> B 1 Atrazine 0.14
#> B 1 Metolachlor 0.38
#> B 2 Atrazine 0.15
#> B 2 Metolachlor 0.40
As the table shows, there is one measurement per row. A single column, param
, lists each measured property (e.g. parameter), and measurements from the same location and sampling event are spread over multiple rows.
Here is a snippet of the exact same data in “wide” table format:
wide <- tidyr::pivot_wider(long,
names_from = param,
values_from = value)
#> site day Atrazine Metolachlor
#> A 1 0.10 0.25
#> A 2 0.13 0.38
#> B 1 0.14 0.38
#> B 2 0.15 0.40
The “wide” table has one row for all measurements for a particular location and sampling event, with each unique parameter having its own column. For samples with a lot of measurements, as is common for water quality samples, this can lead to a really wide table!
Example: A wide view of detection limit data
Here’s an example in R using the tidyverse
ecosystem. Let’s say we want to view a table of all detection limit data for 2 phosphorus measured properties from 2 locations. Note that the results can represent distinct sampling events, so we need to keep track of metadata including sampling date and time, as well as activity type.
We can use the package dataRetrieval
to access USGS discrete water quality data. Here we are accessing data from the USGS Samples web service
, but the same approach applies to any long-formatted data set, such as the legacy Water Quality Portal
or the new WQX3 web services.
library(dataRetrieval)
library(tidyverse)
sites <- c("USGS-04027000", "USGS-04063700")
characteristic_names <- c("Phosphorus as phosphorus, water, filtered",
"Orthophosphate as phosphorus, water, filtered" )
nutrient_data <- read_USGS_samples(monitoringLocationIdentifier = sites,
characteristicUserSupplied = characteristic_names,
dataProfile = "basicphyschem")
The data comes back with dozens of columns (note that dataRetrieval
creates some new columns for us, importantly one called Activity_StartDateTime that concatenates the Activity_StartDate and Activity_StartTime): let’s make this a little easier to handle by pulling out the ones we decide are important for our analysis. We’ll rename the columns so they are easier to read on the screen, however the rename is not necessary. Your own data may need many more columns!
nutrient_data_slim <- nutrient_data |>
# pick columns of interest and re-name
select(date = Activity_StartDateTime,
site = Location_Identifier,
param = USGSpcode,
activity_type = Activity_TypeCode,
cond = Result_ResultDetectionCondition,
val = Result_Measure,
det_va = DetectionLimit_MeasureA,
units = Result_MeasureUnit,
det_units = DetectionLimit_MeasureUnitA) |>
# update flagged result values
mutate(val = case_when(!is.na(cond) ~ det_va,
TRUE ~ val),
units = case_when(!is.na(cond) ~ det_units,
TRUE ~ units),
cond = case_when(!is.na(cond) ~ "<",
.default = "")) |>
# keep routine records
filter(!is.na(date),
activity_type == "Sample - Routine, regular") |>
# remove more columns
select(-det_units, -det_va, -activity_type)
#> date site param cond val units
#> 1971-10-06 17:25:00 USGS-04063700 00666 0.03 mg/L
#> 1977-10-18 16:00:00 USGS-04027000 00666 0.01 mg/L
#> 1977-11-08 22:30:00 USGS-04027000 00666 0.01 mg/L
#> 1977-12-06 20:00:00 USGS-04027000 00666 0.02 mg/L
#> 1978-01-04 20:30:00 USGS-04027000 00666 0.02 mg/L
#> 1979-11-21 14:15:00 USGS-04063700 00671 0.00 mg/L
For this analysis, each parameter needs individual detection conditions, values, units, and detection limits. Using the tidyr
package, we can run the pivot_wider
function and get a table with all the required columns.
library(tidyr)
nutrients_wide <- nutrient_data_slim |>
pivot_wider(names_from = c(param, units),
values_from = c(cond, val))
#> date site det_cond_00666_mg/L det_cond_00671_mg/L val_00666_mg/L val_00671_mg/L
#> 1971-10-06 17:25:00 USGS-04063700 NA 0.03 NA
#> 1977-10-18 16:00:00 USGS-04027000 NA 0.01 NA
#> 1977-11-08 22:30:00 USGS-04027000 NA 0.01 NA
#> 1977-12-06 20:00:00 USGS-04027000 NA 0.02 NA
#> 1978-01-04 20:30:00 USGS-04027000 NA 0.02 NA
#> 1979-11-21 14:15:00 USGS-04063700 NA NA 0
Many options exist to customize the appearance of these tabular outputs, including sorting by parameter.
Example: Which columns should I use to widen my dataset?
The previous example focuses on detection limit data, however a user could modify this to widen by other variables based on their specific needs and use cases. One decision users will need to make is how they want the table to be organized, including which variables to widen with and what metadata to retain in the final table.
Legacy NWIS ‘qw’ tables were commonly organized by the date and time of the sampling event; a very similar concept to this is the sampling activity. Users can manipulate data to be organized by sampling activity, thereby creating a much wider table while dropping supporting metadata. To make tables look more similar to legacy tables, users can also concatenate metadata fields.
In the next example, as part of our data quality checks we want to review all discrete sample data at one site for Water Year 2024. We applied the following processing steps before widening:
- Retain only routine samples by applying a filter on the field
Activity_TypeCode
- Concatenate column names to include parameter information, specifically
USGSpcode
andCharacteristicUserSupplied
(a.k.a. Observed Property ) - Concatenate
Result_Measure
andResult_MeasureUnits
- Find and replace non-detect measurements with the combined values in
Result_ResultDetectionCondition
andDetectionLimit_MeasureA
.
For the widened data table, let’s separate results by status value, e.g. Provisional versus Accepted, so we can see which data still require final review.
# create site list
sites <- "USGS-04027000"
# access date
nutrient_data <- read_USGS_samples(monitoringLocationIdentifier = sites,
activityStartDateLower = "2023-10-01",
activityStartDateUpper = "2024-09-30",
dataProfile = "basicphyschem")
# pre-process
nutrient_data_slim <- nutrient_data |>
select(date = Activity_StartDateTime,
site = Location_Identifier,
activity = Activity_ActivityIdentifier,
param = USGSpcode,
activity_type = Activity_TypeCode,
cond = Result_ResultDetectionCondition,
val = Result_Measure,
units = Result_MeasureUnit,
det_va = DetectionLimit_MeasureA,
det_units = DetectionLimit_MeasureUnitA,
ObsProp = Result_CharacteristicUserSupplied,
status = Result_MeasureStatusIdentifier) |>
mutate(val = case_when(!is.na(cond) ~ det_va,
TRUE ~ val),
units = case_when(!is.na(cond) ~ det_units,
TRUE ~ units),
cond = case_when(!is.na(cond) ~ "<",
.default = ""),
headers = paste0(param, "_", ObsProp)) |> # concatenate fields for header names
mutate(val = paste(cond, val, units)) |>
filter(!is.na(date),
activity_type == "Sample - Routine, regular") |>
select(-det_va, -det_units, -cond, -units, -activity_type)
# now widen!
nutrients_wide <- nutrient_data_slim |>
pivot_wider(id_cols = c(site,
date,
activity,
status),
names_from = headers,
values_from = c(val))

Screen capture of a wide-formatted data table resulting from the R code above and downloaded in .csv format. For brevity, a subset of columns are shown: the actual data set has 13 parameters.
One important point to note in these examples—and really whenever you are transforming a long, metadata-rich data set from long to wide format—is that the resulting tables will have lost some metadata. That is why it is important for users to pre-process their data by performing quality assurance checks and filtering unwanted data and metadata fields before converting to a wide-formatted table.
Summary
This article provides just a couple of examples to help modify long WQX data into a wide data format using R. Tools also exist in Python’s pandas
package, and Microsoft Excel has some interactive tools for widening data using the Pivot Table feature. We provide a (very limited!) set of resources below as a starting point and encourage users to explore beyond the basics provided here.
Helpful resources
- Useful video describing long and wide data formats
- Python
pandas
pivot function documentation - Python
pandas
pivot function example (statology.org ) - Microsoft Excel instructions on creating a pivot table
- Video demonstrating conversion between long and wide format with Excel’s Power Query
You can also contact us with questions or comments at wqx@epa.gov .
Related Posts
Reproducible Data Science in R: Iterate, don't duplicate
July 18, 2024
Overview
This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Reproducible Data Science in R: Writing better functions
June 17, 2024
Overview
This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Calculating Moving Averages and Historical Flow Quantiles
October 25, 2016
This post will show simple way to calculate moving averages, calculate historical-flow quantiles, and plot that information. The goal is to reproduce the graph at this link: PA Graph . The motivation for this post was inspired by a USGS colleague that that is considering creating these type of plots in R. We thought this plot provided an especially fun challenge - maybe you will, too!
Reproducible Data Science in R: Writing functions that work for you
May 14, 2024
Overview
This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Hydrologic Analysis Package Available to Users
July 26, 2022
A new R computational package was created to aggregate and plot USGS groundwater data, providing users with much of the functionality provided in Groundwater Watch and the Florida Salinity Mapper . The Hydrologic Analysis Package (HASP) can retrieve groundwater level and groundwater quality data, aggregate these data, plot them, and generate basic statistics. Dcumentation is available in R or online , and users can also launch a Shiny Application from within the package to generate images in an interactive user interface.