Skip to main content

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Reproducible Data Science in R: Say the quiet part out loud with assertion tests

Add assertion tests to your functions to ensure assumptions are met and errors are helpful.

Date Posted September 2, 2025 Last Updated September 2, 2025
Author Anthony Martinez
Reading Time 12 minutes Share
Illustration of a personified, cartoon water droplet wearing a yellow construction hat and working to build pipelines on a conveyer belt. Text: Reproducible Data Science in R: Say the quiet part out loud with assertion tests. A W.D.F.N. Blog Series.

Overview

This blog post is part of the Reprodicuble data science in R series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

Where we're going: In this post, we’re going to look at an important aspect of writing scientific R code: assertion testing. Assertions are the defensive checks we put in place so our functions can fail fast, fail loud, and fail helpfully when something goes wrong. We start out with simple checks and move to increasingly more concise code and more expressive messages.

Audience: This article is for novice or intermediate R users interested in developing the efficiency and readability of their code. It assumes that you know how to write a function in R reasonably well - if not, see the first two blog posts in our series: (1) Writing functions that work for you and (2) Writing better functions.

Why assert?

It’s important to realize that when writing functions, we make a lot of assumptions, especially about the information our functions require via input arguments. Maybe we assume the input data frame contains a numeric column called q_cfs. Or the method argument is one of only three character string choices. Or maybe the spatial_data input accepts only polygons, not points or lines. If our assumptions aren’t met, something may eventually break and an error may get thrown, possibly after time-consuming code has already run. This can waste resources and produces unhelpful error messages. Even worse, the function may complete but silently return a wrong or unexpected result.

We can prevent many of these if we thoroughly document our function . However, even in a perfectly documented function (not that one exists), it is still possible for (1) a user to use the function without reading the documentation or (2) the documentation to not address some unanticipated use. In that case, we would still like the function to handle these situations gracefully.

In multi-person, multi-run pipelines, failing early with a clear message is far better than letting silent errors cascade downstream. The bonus is a social one: good assertions with clear messages help collaborators understand what your function expects and how to fix inputs without digging through your code. The tidyverse error-message style exists precisely for that: a short problem statement, then crisp details. We can ensure that our functions fail fast, fail loud, and fail helpfully by using assertion tests. Assertions live inside your functions; they check the inputs and assumptions every time your function is called.

Our working example: Get the HUC boundary for a given point

To demonstrate assertion tests, we will define a function that will retrieve a Hydrologic Unit boundary polygon that intersects a given point. We will provide a spatial point and the type of Hydrologic unit (like "hu02" for a 2-digit hydrologic region or "hu10" for a ten-digit watershed) and it will build a query URL to then find and return a hydrologic unit polygon from the https://geoconnex.us database.

For the code in this post to run, you’ll need the following packages. Install any you don’t have if you want to follow along on your own.

library(sf)
library(dplyr)
library(curl)
library(chk)
library(rlang)
library(cli)

Version 0.0 — Naive function (no assertions yet)

#' Retrieve HUC polygon by point
#'
#' @param in_point sf point object; location to get HUC polygon for
#' @param huc_type character; the hydrologic unit type to retrieve,
#'   corresponding to the number of digits in the HUC identifier. One of `hu02`,
#'   `hu04`, `hu06`, `hu08`, or `hu10`.
#' @param reproject_output logical; if TRUE, reprojects the output polygon to
#'   the CRS of `in_point`. If FALSE (default), outputs the polygon in its
#'   native CRS, WGS 84 latitude/longitude (EPSG:4326).
#'
#' @returns sf polygon
#'
get_huc_polygon <- function(in_point, huc_type, reproject_output = FALSE) {
  # Get latitude and longitude
  # Transform to lat long and extract lat long coordinates
  sf_lat_lon <- sf::st_transform(in_point, "EPSG:4326") |> 
    sf::st_coordinates()
  site_lon <- sf_lat_lon[1, "X"]
  site_lat <- sf_lat_lon[1, "Y"]
  
  # Build URL to query https://geoconnex.us
  base_url <- "https://reference.geoconnex.us"
  collection <- sprintf("/collections/%s/items?f=json&filter=", huc_type)
  query <- sprintf("INTERSECTS(geometry, POINT(%f %f))", site_lon, site_lat) |>
    URLencode()
  url <- paste0(base_url, collection, query)
  
  out_sf <- sf::st_read(url, quiet = TRUE)
  
  if(reproject_output) {
    out_sf <- sf::st_transform(out_sf, sf::st_crs(in_point))
  }
  
  return(out_sf)
}
# Define our site
hemlock_butte <- sf::st_sfc(
  sf::st_point(c(-115.629, 46.473)),
  crs = "EPSG:4326"
)

# Get the HUC8 that our site is within
huc_08 <- get_huc_polygon(
  in_point = hemlock_butte,
  huc_type = "hu08",
  reproject_output = FALSE
)

# Look at the output data
dplyr::glimpse(huc_08)
#> Rows: 1
#> Columns: 8
#> $ id       <chr> "17060307"
#> $ fid      <int> 1513
#> $ uri      <chr> "https://geoconnex.us/ref/hu08/17060307"
#> $ name     <chr> "Upper North Fork Clearwater"
#> $ gnis_url <chr> ""
#> $ gnis_id  <chr> NA
#> $ loaddate <dttm> 2013-01-18 12:55:15
#> $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-115.0728 4...

Our function is well-documented but lacks assertion checks. We can hope users read the documentation to use the function correctly, but we can add some extra confidence for ourselves (the function authors) and the users by incorporating assertions. The first step in writing assertions is to identify the assumptions built into our function. This can be one of the trickiest aspects of assertion testing. Writing good documentation is a good first step in understanding these assumption: you can see them baked right into the documentation:

  • in_point is an sf point object (implied to be a length of 1)
  • huc_type is a character object
  • huc_type has a length of 1
  • huc_type is one of “hu02”, “hu04”, “hu06”, “hu08”, or “hu10”
  • reproject_output is a logical (implied to be a length of 1)

Beyond those assumptions there is one more that I could think of:

  • The R session has access to the internet

Version 1.0 — Base R stopifnot()

stopifnot() is the base R workhorse. It is dependency-free and gets you quick wins. stopifnot() checks that expressions are TRUE and fails if they’re not. By default the messages are terse, but you can give them names to customize the output. I’ll combine all of the checks into a single stopifnot() call and it will throw an error when the first FALSE is reached.

get_huc_polygon <- function(in_point, huc_type, reproject_output = FALSE) {
  # Check assertions
  stopifnot(
    "in_point must be an sf point" = inherits(in_point, "sf"),
    "in_point must be an sf point" = sf::st_geometry_type(in_point) == "POINT",
    "in_point must only have one feature" = nrow(in_point) == 1,
    "huc_type must be a character" = is.character(huc_type),
    "huc_type must have a length of 1" = length(huc_type) == 1,
    "huc_type must be one of hu02, hu04, hu06, hu08, or hu10." = huc_type %in%
      c("hu02", "hu04", "hu06", "hu08", "hu10"),
    "reproject_output must be TRUE or FALSE" = is.logical(reproject_output),
    "reproject_output must have a length of 1" = length(reproject_output) == 1,
    "Internet access must be available" = curl::has_internet()
  )
  
  # ... TRUNCATING FOR BREVITY ...
}

When we use some bogus info, it looks like this:

get_huc_polygon(in_point = hemlock_butte, huc_type = 2)
#> Error in get_huc_polygon(in_point = hemlock_butte, huc_type = 2) : 
#>   huc_type must be a character

These checks are pretty good! They cover all of the assumptions we identified and there are nice, human-readable (albeit brief) error messages. All our assumptions are tested early in our function, before any other code is run, maximizing efficiency. If there are multiple FALSE statements within stopifnot() it may be tedious for a user to fix their issues one-by-one and retry again and again. If that’s a concern, one of the other options could be particularly useful.

Version 2.0 — Cleaner, higher-level checks with chk

stopifnot() works well, but we can make error checking even cleaner and more compact if we outsource some of the work to the chk package. The purpose of chk is to check function arguments and it does a good job at it. It can make our code easier to read and more concise, and it builds our error messages for us.

get_huc_polygon <- function(in_point, huc_type, reproject_output = FALSE) {
  # Check assertions
  chk::chk_s3_class(in_point, "sf") # Is this an SF object?
  stopifnot("in_point must be an sf point" = sf::st_geometry_type(in_point) == "POINT") # An sf point?
  chk::check_data(in_point, nrow = 1)
  chk::chk_string(huc_type)
  chk::chk_subset(huc_type, c("hu02", "hu04", "hu06", "hu08", "hu10"))
  chk::chk_logical(reproject_output)
  chk::chk_scalar(reproject_output)
  stopifnot("Internet access must be available" = curl::has_internet())
  
  # ... TRUNCATING FOR BREVITY ...
}

This package has functions that check for scalar (that is, length-1) versions of vector types: we can use chk::chk_string(huc_type) to check that huc_type is both a character and has a length of one in a single check. chk does not cover every possible case, so for some situations, such as testing geometry type or internet access, I still fall back on stopifnot() to supply my own messages. When I supply some data that I know is going to cause problems, here is what I get:

get_huc_polygon(in_point = hemlock_butte, huc_type = "hu01")
#> Error in `get_huc_polygon()`:
#> ! `huc_type` must match 'hu02', 'hu04', 'hu06', 'hu08' or 'hu10', not 'hu01'.

get_huc_polygon(in_point = rbind(hemlock_butte, hemlock_butte), huc_type = "hu02")
#> Error in `check_dim()`:
#> ! `nrow(in_point)` must be equal to 1.

get_huc_polygon(in_point = hemlock_butte, huc_type = "hu02", reproject_output = 1)
#> Error in `get_huc_polygon()`:
#> ! `reproject_output` must be logical.

Version 3.0 — Enumerating options

For huc_type we have a predefined list of available options that the argument is allowed to take. Options are safer, more explicit, and more self-documenting when enumerated as a small character vector, as described in Enumerate possible options | Tidy design principles . To use this pattern, we define possible options in the function definition default. We then use rlang::arg_match() to validate these options and generate informative errors for free.

get_huc_polygon <- function(
    in_point,
    huc_type  = c("hu02", "hu04", "hu06", "hu08", "hu10"),
    reproject_output = FALSE
) {
  # Check assertions
  huc_type <- rlang::arg_match(huc_type)
  chk::chk_s3_class(in_point, "sf")
  stopifnot("in_point must be an sf point" = sf::st_geometry_type(in_point) == "POINT")
  chk::check_data(in_point, nrow = 1)
  chk::chk_logical(reproject_output)
  chk::chk_scalar(reproject_output)
  stopifnot("Internet access must be available" = curl::has_internet())
  
  # ... TRUNCATING FOR BREVITY ...
}

This consolidates our assertions for huc_type, validating that it is (1) a character, (2) has a length of 1, and (3) is one of the possible options. Aside from the assertion checking, we have listed all possible options in the function definition, making it clearer to the user which options are available. If the user does not pass anything to huc_type, it will default to the first item in the list (hu02). Because this technique requires that a default value is set, care needs to be taken that (1) a default argument makes sense and (2) the best choice for a default value is the first listed option. For more information on setting default values, read Reproducible Data Science in R: Writing better functions .

Version 4.0 — Expressive errors with cli

V1V3 are all huge improvements to our functions. In fact, they are probably sufficient for most of the code individuals write. However, there is one more level that is important to share, especially for cases where you expect your function to be used far outside of your control or influence. It can be very helpful to provide the user with additional context when they receive an error. Perhaps this function will be used inside another function and it would be helpful to know which arguments were passed into your function. Here we can use the cli package to build very expressive, context-rich error messages with structure, color, and consistent formatting. It also aligns with the tidyverse style guide for errors , which is a helpful guide.

In this example, we’ll make sure that reproject_output is either TRUE or FALSE and nothing else. This will then alert the user to the precise issue. Note that building error messages in this fashion can involve writing a lot of code, so to keep things brief, I’ll only show the relevant assertion check and leave out the other assertions.

get_huc_polygon <- function(
    in_point,
    huc_type = c("hu02", "hu04", "hu06", "hu08", "hu10"),
    reproject_output = FALSE
) {
  # Check assertions
  if(!any(isTRUE(reproject_output), isFALSE(reproject_output))) {
    cli::cli_abort(c(
      "{.arg reproject_output} must be either TRUE or FALSE.",
      x = "You passed {.code {reproject_output}}."
    ))
  }
  
  # ... TRUNCATING FOR BREVITY ...
}

get_huc_polygon(
  in_point = hemlock_butte,
  huc_type = "hu02",
  reproject_output = NA
)
#> Error in `get_huc_polygon()`:
#> ! `reproject_output` must be either TRUE or FALSE.
#> ✖ You passed `NA`.

We can build up much more complex examples to programatically build helpful messages. Here we will combine our checks for in_point into a single, multi-faceted error message.

get_huc_polygon <- function(
  in_point,
  huc_type = c("hu02", "hu04", "hu06", "hu08", "hu10"),
  reproject_output = FALSE
) {
  # Check assertions
  if (
    !all(
      inherits(in_point, "sf"),
      # This catches errors, in case in_point
      tryCatch(sf::st_geometry_type(in_point), error = \(e) FALSE) == "POINT",
      nrow(in_point) == 1
    )
  ) {
    if (inherits(in_point, "sf")) {
      msgs <- character() # Create empty character vector
      geom_type <- sf::st_geometry_type(in_point) # Get geometry type
      nrow <- nrow(in_point) # Get number of rows (i.e., features)
      if (!any(geom_type == "POINT")) {
        msgs <- c(
          msgs,
          x = "{.arg in_point} has a geometry type of {geom_type}."
        )
      }
      if (nrow != 1) {
        msgs <- c(msgs, x = "{.arg in_point} has {nrow} feature{?s}.")
      }
    } else {
      msgs <- c(x = "{.arg in_point} is class {.cls {class(in_point)}}.")
    }

    cli::cli_abort(c("{.arg in_point} must be a single sf point.", msgs))
  }

  # ... TRUNCATING FOR BREVITY ...
}

It now handles a couple of different scenarios and precisely informs the user of the issue.

get_huc_polygon(in_point = "something's wrong here", huc_type = "hu02")
#> Error in `get_huc_polygon()`:
#> ! `in_point` must be a single sf point.
#> ✖ `in_point` is class <character>.

get_huc_polygon(
  in_point = rbind(hemlock_butte, hemlock_butte),
  huc_type = "hu02"
)
#> Error in `get_huc_polygon()`:
#> ! `in_point` must be a single sf point.
#> ✖ `in_point` has 2 features.

get_huc_polygon(
  in_point = sf::read_sf("https://geoconnex.us/ref/hu08/02050302"),
  huc_type = "hu02"
)
#> Error in `get_huc_polygon()`:
#> ! `in_point` must be a single sf point.
#> ✖ `in_point` has a geometry type of MULTIPOLYGON.

Compared to chk or stopifnot(), the cli approach adds structure by giving a clear problem statement, bullet-pointed details, and formatting like {.arg} and {.cls}. These features make errors easier to skim and understand, especially for collaborators who didn’t write the function. The cost is the amount of code it takes to write all of this.

Other packages and next steps

We’ve shown a progression of approaches with increasing complexity: stopifnot()assertthat → option validation with arg_match()cli messages. There are other assertion frameworks you might explore as your needs grow:

  • assertthat : broad set of checks with descriptive failure messages
  • assertr : tidyverse-style assertions for data frames
  • validate : rule-based data validation
  • checkmate : fast, vectorized, comprehensive checks

Each of these has its own trade-offs in verbosity, dependencies, and style. The style of the message matters just as much as the check itself. Regardless of the tool that you decide to use, consider reading 9 Error messages | Tidyverse style guide for broad advice on how to word and structure the messages themselves. If, later, you need programmatic recovery or specialized handling of errors (such as when building your own package), consider implementing custom conditions, where you assign special classes to different error messages so they can be handled more robustly. 8.5 Custom conditions | Advanced R is a great chapter on the subject.

Closing thoughts

Assertions tests can be tedious and annoying to write, but they offer both hospitality and security. They invite your collaborators to use your functions without fear. Adding a few assertions to your next function will make your code sturdier, your collaborators happier, and your science more reproducible. A few lines of intentional checking can save hours of debugging and make your function’s expectations legible to colleagues. Start with the basics, then let your checks graduate to more expressive patterns as the function matures.

Share:

Related Posts

  • Reproducible Data Science in R: Iterate, don't duplicate

    July 18, 2024

    Illustration of a personified, cartoon water droplet wearing a yellow construction hat and working to build pipelines on a conveyer belt. Text: Iterate, don't duplicate. Reproducible Data Science with R. Easy steps to improve the reusability of your code! A W.D.F.N. Blog Series.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • Reproducible Data Science in R: Writing better functions

    June 17, 2024

    A smiling raindrop in a hardhat works on a pipe assembly line beneath the blog series title, Reproducible Data Science with R and to the left of the blog post title, Writing better functions.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • Formatting guidance for USGS Samples Data Tables

    May 6, 2025

    Recently, changes were made to the delivery format of USGS samples data (learn more here ). In this blog, we describe the impact to users and show an example of how to use R to convert WQX-formatted water quality results data to a tabular, or “wide” view.

  • Reproducible Data Science in R: Flexible functions using tidy evaluation

    December 17, 2024

    A smiling raindrop in a hardhat works on a pipe assembly line beneath the blog series title, Reproducible Data Science with R and to the left of the blog post title, Flexible functions using tidy evaluation.

    Overview

    This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.

  • The Hydro Network-Linked Data Index

    November 2, 2020

    Introduction

    updated 11-2-2020 after updates described here .

    updated 9-20-2024 when the NLDI moved from labs.waterdata.usgs.gov to api.water.usgs.gov/nldi/

    The Hydro Network-Linked Data Index (NLDI) is a system that can index data to NHDPlus V2 catchments and offers a search service to discover indexed information. Data linked to the NLDI includes active NWIS stream gages , water quality portal sites , and outlets of HUC12 watersheds . The NLDI is a core product of the Internet of Water and is being developed as an open source project. .