Skip to main content

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Formatting guidance for USGS Samples Data Tables

Tips and example R scripts for converting long-formatted data into wide tabular format

Date Posted May 6, 2025 Last Updated May 7, 2025
Author Lee Stanish
Candice Hopkins
Reading Time 8 minutes Share

Recently, changes were made to the delivery format of USGS samples data (learn more here ). In this blog, we describe the impact to users and show an example of how to use R to convert WQX-formatted water quality results data to a tabular, or “wide” view.

Background

USGS modernized the way we deliver data from discrete field sampling events from a USGS-specific format to the internationally-recognized Water Quality Exchange (WQX) standard. As part of this transition, users have had to update their workflows to access the new sample data outputs. We’ve heard from many users that they would like the ability to create and view tables that are similar to tables that were previously available as an output from the NWIS ‘qw’ webpages. This blog post offers some suggestions for modifying data in the WQX format to a different tabular view.

What is a long or wide data table?

WQX data follow tidy principles , which are often easier to use in scripting language. With tidy data:

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

Converting WQX data into a different tabular output may be turning a “long” data table into a “wide” data table. Data coming from USGS endpoints in the WQX standard are in a long format. The long data format allows for delivery of result-specific metadata that scientists often seek and is commonly favored in data science. However, data in the long format look different than historic National Water Information System (NWIS) ‘qw’ tables. Most commonly, tables in NWIS ‘qw’ were able to be viewed by sampling activities occurring on the same date and time, which can create very wide tables if a particular sampling activity produced many results (a common occurrence for water quality samples). This table often ends up being much wider than a user can view on a single screen!

Both long and wide data formats are acceptable methods for delivering data and each has their own advantages. Long data tables are generally preferred when performing calculations and for aggregating data for statistical analyses. Wide data tables are generally good for quick review or interpretation: for example, a user can view changes in a set of measured properties over time.

Let’s illustrate using R, with an example snippet of a “long” table (Note: example is an excerpt from the dataRetrieval vignette called “Pivot Data”, available here ):

long <- tibble(site = c(rep("A", 4), rep("B", 4)),
               day = c(rep(c(1, 1, 2, 2), 2)),
               param = c(rep(c("Atrazine", "Metolachlor"), 4)),
               value = c(0.1, 0.25, 0.13, 0.38,
                         0.14, 0.38, 0.15, 0.40))
#>  site    day   param         value
#>  A       1     Atrazine      0.10
#>  A       1     Metolachlor   0.25
#>  A       2     Atrazine      0.13
#>  A       2     Metolachlor   0.38
#>  B       1     Atrazine      0.14
#>  B       1     Metolachlor   0.38
#>  B       2     Atrazine      0.15
#>  B       2     Metolachlor   0.40

As the table shows, there is one measurement per row. A single column, param, lists each measured property (e.g. parameter), and measurements from the same location and sampling event are spread over multiple rows.

Here is a snippet of the exact same data in “wide” table format:

wide <- tidyr::pivot_wider(long, 
                           names_from = param,
                           values_from = value)

#>  site    day   Atrazine   Metolachlor
#>  A       1     0.10        0.25
#>  A       2     0.13        0.38
#>  B       1     0.14        0.38
#>  B       2     0.15        0.40

The “wide” table has one row for all measurements for a particular location and sampling event, with each unique parameter having its own column. For samples with a lot of measurements, as is common for water quality samples, this can lead to a really wide table!

Example: A wide view of detection limit data

Here’s an example in R using the tidyverse ecosystem. Let’s say we want to view a table of all detection limit data for 2 phosphorus measured properties from 2 locations. Note that the results can represent distinct sampling events, so we need to keep track of metadata including sampling date and time, as well as activity type.

We can use the package dataRetrieval to access USGS discrete water quality data. Here we are accessing data from the USGS Samples web service , but the same approach applies to any long-formatted data set, such as the legacy Water Quality Portal or the new WQX3 web services.

library(dataRetrieval)
library(tidyverse)

sites <- c("USGS-04027000", "USGS-04063700")
characteristic_names <- c("Phosphorus as phosphorus, water, filtered",
                          "Orthophosphate as phosphorus, water, filtered" )

nutrient_data <- read_USGS_samples(monitoringLocationIdentifier = sites,
                                   characteristicUserSupplied = characteristic_names,
                                   dataProfile = "basicphyschem")

The data comes back with dozens of columns (note that dataRetrieval creates some new columns for us, importantly one called Activity_StartDateTime that concatenates the Activity_StartDate and Activity_StartTime): let’s make this a little easier to handle by pulling out the ones we decide are important for our analysis. We’ll rename the columns so they are easier to read on the screen, however the rename is not necessary. Your own data may need many more columns!

nutrient_data_slim <- nutrient_data |> 
  # pick columns of interest and re-name
  select(date = Activity_StartDateTime,
         site = Location_Identifier,
         param = USGSpcode,
         activity_type = Activity_TypeCode,
         cond = Result_ResultDetectionCondition,
         val = Result_Measure,
         det_va = DetectionLimit_MeasureA,
         units = Result_MeasureUnit,
         det_units = DetectionLimit_MeasureUnitA) |> 
    # update flagged result values     
    mutate(val = case_when(!is.na(cond) ~ det_va,
                            TRUE ~ val),
           units = case_when(!is.na(cond) ~ det_units, 
                             TRUE ~ units),
           cond = case_when(!is.na(cond) ~ "<", 
                              .default = "")) |> 
  # keep routine records
  filter(!is.na(date), 
         activity_type == "Sample - Routine, regular") |> 
  # remove more columns
  select(-det_units, -det_va, -activity_type)

#>  date                  site            param   cond    val   units
#>  1971-10-06 17:25:00   USGS-04063700   00666           0.03  mg/L
#>  1977-10-18 16:00:00   USGS-04027000   00666           0.01  mg/L
#>  1977-11-08 22:30:00   USGS-04027000   00666           0.01  mg/L
#>  1977-12-06 20:00:00   USGS-04027000   00666           0.02  mg/L
#>  1978-01-04 20:30:00   USGS-04027000   00666           0.02  mg/L
#>  1979-11-21 14:15:00   USGS-04063700   00671           0.00  mg/L

For this analysis, each parameter needs individual detection conditions, values, units, and detection limits. Using the tidyr package, we can run the pivot_wider function and get a table with all the required columns.

library(tidyr)

nutrients_wide <- nutrient_data_slim |> 
  pivot_wider(names_from = c(param, units),
              values_from = c(cond, val))
#>  date                  site            det_cond_00666_mg/L   det_cond_00671_mg/L val_00666_mg/L  val_00671_mg/L
#>  1971-10-06 17:25:00   USGS-04063700                         NA                  0.03            NA
#>  1977-10-18 16:00:00   USGS-04027000                         NA                  0.01            NA
#>  1977-11-08 22:30:00   USGS-04027000                         NA                  0.01            NA
#>  1977-12-06 20:00:00   USGS-04027000                         NA                  0.02            NA
#>  1978-01-04 20:30:00   USGS-04027000                         NA                  0.02            NA
#>  1979-11-21 14:15:00   USGS-04063700   NA                                        NA              0

Many options exist to customize the appearance of these tabular outputs, including sorting by parameter.

Example: Which columns should I use to widen my dataset?

The previous example focuses on detection limit data, however a user could modify this to widen by other variables based on their specific needs and use cases. One decision users will need to make is how they want the table to be organized, including which variables to widen with and what metadata to retain in the final table.

Legacy NWIS ‘qw’ tables were commonly organized by the date and time of the sampling event; a very similar concept to this is the sampling activity. Users can manipulate data to be organized by sampling activity, thereby creating a much wider table while dropping supporting metadata. To make tables look more similar to legacy tables, users can also concatenate metadata fields.

In the next example, as part of our data quality checks we want to review all discrete sample data at one site for Water Year 2024. We applied the following processing steps before widening:

  • Retain only routine samples by applying a filter on the field Activity_TypeCode
  • Concatenate column names to include parameter information, specifically USGSpcode and CharacteristicUserSupplied (a.k.a. Observed Property )
  • Concatenate Result_Measure and Result_MeasureUnits
  • Find and replace non-detect measurements with the combined values in Result_ResultDetectionCondition and DetectionLimit_MeasureA.

For the widened data table, let’s separate results by status value, e.g. Provisional versus Accepted, so we can see which data still require final review.

# create site list
sites <- "USGS-04027000"
# access date
nutrient_data <- read_USGS_samples(monitoringLocationIdentifier = sites,
                                   activityStartDateLower = "2023-10-01",
                                   activityStartDateUpper = "2024-09-30",
                                   dataProfile = "basicphyschem")
# pre-process
nutrient_data_slim <- nutrient_data |> 
  select(date = Activity_StartDateTime,
         site = Location_Identifier,
         activity = Activity_ActivityIdentifier,
         param = USGSpcode,
         activity_type = Activity_TypeCode,
         cond = Result_ResultDetectionCondition,
         val = Result_Measure,
         units = Result_MeasureUnit,
         det_va = DetectionLimit_MeasureA,
         det_units = DetectionLimit_MeasureUnitA,
         ObsProp = Result_CharacteristicUserSupplied,
         status = Result_MeasureStatusIdentifier) |> 
   mutate(val = case_when(!is.na(cond) ~ det_va,
                         TRUE ~ val),
         units = case_when(!is.na(cond) ~ det_units, 
                           TRUE ~ units),
         cond = case_when(!is.na(cond) ~ "<", 
                          .default = ""),
         headers = paste0(param, "_", ObsProp)) |>   # concatenate fields for header names
  mutate(val = paste(cond, val, units)) |>
  filter(!is.na(date), 
         activity_type == "Sample - Routine, regular") |> 
  select(-det_va, -det_units, -cond, -units, -activity_type)
#  now widen!
nutrients_wide <- nutrient_data_slim |>
  pivot_wider(id_cols = c(site,
                          date,
                          activity,
                          status),
              names_from = headers,
              values_from = c(val))
Screen capture of the first eight columns of the wide-formatted data table resulting from the previous R code example.

Screen capture of a wide-formatted data table resulting from the R code above and downloaded in .csv format. For brevity, a subset of columns are shown: the actual data set has 13 parameters.

One important point to note in these examples—and really whenever you are transforming a long, metadata-rich data set from long to wide format—is that the resulting tables will have lost some metadata. That is why it is important for users to pre-process their data by performing quality assurance checks and filtering unwanted data and metadata fields before converting to a wide-formatted table.

Summary

This article provides just a couple of examples to help modify long WQX data into a wide data format using R. Tools also exist in Python’s pandas package, and Microsoft Excel has some interactive tools for widening data using the Pivot Table feature. We provide a (very limited!) set of resources below as a starting point and encourage users to explore beyond the basics provided here.

Helpful resources

  • Useful video describing long and wide data formats
  • Python pandas pivot function documentation
  • Python pandas pivot function example (statology.org )
  • Microsoft Excel instructions on creating a pivot table
  • Video demonstrating conversion between long and wide format with Excel’s Power Query

You can also contact us with questions or comments at wqx@epa.gov .

Share:

Related Posts

  • Reproducible Data Science in R: Iterate, don't duplicate

    July 18, 2024

    Illustration of a personified, cartoon water droplet wearing a yellow construction hat and working to build pipelines on a conveyer belt. Text: Iterate, don't duplicate. Reproducible Data Science with R. Easy steps to improve the reusability of your code! A W.D.F.N. Blog Series.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • Reproducible Data Science in R: Writing better functions

    June 17, 2024

    A smiling raindrop in a hardhat works on a pipe assembly line beneath the blog series title, Reproducible Data Science with R and to the left of the blog post title, Writing better functions.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • Calculating Moving Averages and Historical Flow Quantiles

    October 25, 2016

    This post will show simple way to calculate moving averages, calculate historical-flow quantiles, and plot that information. The goal is to reproduce the graph at this link: PA Graph . The motivation for this post was inspired by a USGS colleague that that is considering creating these type of plots in R. We thought this plot provided an especially fun challenge - maybe you will, too!

  • Reproducible Data Science in R: Writing functions that work for you

    May 14, 2024

    A smiling raindrop in a hardhat works on a pipe assembly line beneath the blog series title, Reproducible Data Science with R and to the left of the blog post title, Writing functions that work for you.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • Hydrologic Analysis Package Available to Users

    July 26, 2022

    A new R computational package was created to aggregate and plot USGS groundwater data, providing users with much of the functionality provided in Groundwater Watch and the Florida Salinity Mapper . The Hydrologic Analysis Package (HASP) can retrieve groundwater level and groundwater quality data, aggregate these data, plot them, and generate basic statistics. Dcumentation is available in R or online , and users can also launch a Shiny Application from within the package to generate images in an interactive user interface.