-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to split / aggregate characteristics based on hydrofab refactor workflow. #303
Comments
haha we are working towards this too - we are testing out the stream cats data (see here: s3://nextgen-hydrofabric/streamcats/) with a zonal-esque approach that used |
Definitely interested in this idea - was already thinking about how to do this for estimating Mike W's upstream accumulated basin characteristics, but at a split bottom catchment. Have a potential need/use-case here |
For PUMP-ML workflows we've aggregated characteristics to the NHM reference fabric. We haven't dealt with split catchments, but I'll start by describing our workflow for aggregation using just the tabular data. library(tidyverse)
library(nhdplusTools)
# First, define a function to grab characteristics of interest. So far we've been downloading the data ourselves,
# but I'll use nhdplusTools helpers in the future now that I know about these capabilities to interface with NLDI.
#' @param varname character vector of desired variables retrieved from /link{discover_nldi_characteristics}
#' @param ids numeric vector of identifiers from the specified reference fabric
#' @param reference_fabric (not used) will be used to allow future specification of alternate reference fabrics
#' @importFrom dplyr bind_rows filter select
#' @importFrom tidyselect everything
get_catchment_characteristics <- function(varname, ids, reference_fabric = "nhdplusv2"){
nldi_vars <- bind_rows(
lapply(ids, function(x){
nldi_ft <- list(featureSource = "comid", featureID = x)
nldi_vars_ft <- bind_rows(get_nldi_characteristics(nldi_ft, type = "all"))
nldi_vars_ft_subset <- filter(nldi_vars_ft, characteristic_id %in% varname)
nldi_vars_ft_out <- select(mutate(nldi_vars_ft_subset,
comid = as.numeric(x),
characteristic_value = as.numeric(characteristic_value)),
comid, everything())
return(nldi_vars_ft_out)
})
)
return(nldi_vars)
}
# Next, define a function to aggregate characteristics assuming we have a crosswalk of ids between the two fabrics
# and preferences from the user for which summary statistics to use when aggregating.
#' @param vars data frame containing two columns: "characteristic_id", which is a character vector of desired
#' variables retrieved from /link{discover_nldi_characteristics}; and "aggregation_statistic", which is a character
#' vector indicating the summary statistic that should be applied to each variable. Acceptable values for
#' "aggregation_statistic" include "sum", "length_weighted_mean", "area_weighted_mean", "min", and "max".
#' @param ids_xwalk data frame containing two columns: "id", which is a numeric vector of identifiers at the
#' desired scale and "comid" which is a numeric vector of NHDPlusv2 identifiers.
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr select left_join group_by summarize ungroup across
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect any_of
aggregate_catchment_characteristics <- function(vars, ids_xwalk){
# download nldi characteristics for the requested comids
nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
ids = ids_xwalk$comid)
# download nhdplus subset so that we have areasqkm and lengthkm
# cols (there's probably a slicker way to do this)
nhd_lines <- sf::st_drop_geometry(get_nhdplus(comid = c(ids_xwalk$comid), realization = "flowline"))
nhd_subset <- select(nhd_lines, comid, areasqkm, lengthkm)
# format characteristics data frame
nhd_subset_w_vars <- left_join(x = left_join(x = nhd_subset, y = nhd_vars, by = "comid"),
y = ids_xwalk, by = "comid")
nhd_subset_w_vars_wide <- pivot_wider(nhd_subset_w_vars, !percent_nodata,
names_from = characteristic_id,
values_from = characteristic_value)
# assign columns based on the desired summary operation
cols_sum <- vars$characteristic_id[vars$aggregation_statistic == "sum"]
cols_area_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "area_weighted_mean"]
cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
cols_min <- vars$characteristic_id[vars$aggregation_statistic == "min"]
cols_max <- vars$characteristic_id[vars$aggregation_statistic == "max"]
# aggregate NHDPlusv2 attributes to desired catchments
nhd_vars_aggregated <- nhd_subset_w_vars_wide |>
group_by(id) |>
summarize(
areasqkm_sum = sum(areasqkm),
lengthkm_sum = sum(lengthkm),
across(any_of(cols_area_wtd_mean), weighted.mean, w = areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
) |>
ungroup()
return(nhd_vars_aggregated)
}
# Apply the functions above to an example.
# define the desired characteristics and their corresponding summary statistics
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011"),
aggregation_statistic = c("area_weighted_mean", "min"))
# we use a crosswalk that tells us how the ids from the desired fabric (e.g. NHM fabric)
# correspond to NHDPlusv2 ids. Include crosswalk for one segment as an example:
catchment_xwalk <- data.frame(id = rep(1637, 8),
comid = c(4146594, 4146596, 4147370, 4147376,
4147378, 4147380, 4147382, 4147396))
> aggregate_catchment_characteristics(vars, catchment_xwalk)
# A tibble: 1 x 5
id areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_WILDFIRE_2011_min
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1637 47.3 23.1 -23.3 0
>
Is this roughly what you had in mind? The use case to split catchments is less obvious to me right now given the tabular data, I'll check out the zonal stats approach Mike linked to above. |
This is exactly what I had in mind. The catchment splitting is going to require a little bit more in the main pipeline and we'll have to bring in a lookup table, but what you've got here is most of it and gives me a good sense for how this should work. |
I've got a script (below) running for crosswalking nhdv2 catchment level variables to nhgf refactor. It aggregates and area-weights split catchments when it needs to, sum or mean. I tested with basin area and the numbers add up, even with nhgf divides that include split catchments. Dave B graciously provided me with a refactor nhgf that includes a split_divides layer so we can get those split catchment areas. That was the missing piece. So that's the good news, the bad news is that it is ridiculously slow. I'm doing everything in an admittedly tedious way just to get the concepts working. Since I'm not a whiz at "doing things faster"...perhaps someone can have a look and see a way to optimize this? Like...a lot? It's so slow. And maybe test another variable or two?
|
Here's a link to the refactor gpkg that includes the split_divide layer: (Dave I assume it's okay to share that) |
OK, I took a look at the refactored hydrofabric and @ajsekell's code and came up with the following solution to rescale (i.e., split or aggregate) catchment characteristics: # The code below calls a function that grabs the catchment characteristics. I haven't defined
# the function here, but it is included in my previous comment within this GitHub issue.
# Define a function to rescale the catchment attributes using the tabular data.
# This function is a combination of Andrew Sekellick's workflow to accommodate
# split catchments and a function I previously included in this GitHub issue
# to aggregate NHDPlusv2 attributes to new fabrics.
#' @param vars data frame containing two columns: "characteristic_id" is a character
#' vector of desired variables to retrieve from /link{discover_nldi_characteristics};
#' "summary_statistic", is a character vector indicating which summary statistic
#' should be applied to rescale each characteristic. Accepted values include
#' "sum," "length_weighted_mean," "area_weighted_mean," "min," and "max."
#' @param lookup_table data frame containing at least three columns: "ID" is a
#' numeric vector of identifiers at the desired scale; "comid" is a numeric vector
#' of NHDPlusv2 identifiers; "member_comid" contains the formatted NHDPlusv2 COMIDs,
#' for example, in the case of split catchments. If catchments have not been split,
#' the columns "comid" and "member_comid" should be identical.
#' @param refactored_areas data frame containing columns "FEATUREID" and "areasqkm."
#' Used to retrieve adjusted catchment areas in the case of split catchments.
#'
#' @importFrom dplyr left_join rename_with mutate across group_by summarize ungroup distinct
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect starts_with any_of
#'
rescale_nldi_characteristics <- function(vars, lookup_table, refactored_areas){
# check that the inputs match what we are expecting
if(!all(c("characteristic_id", "summary_statistic") %in% names(vars))){
stop("Check that vars contains columns 'characteristic_id' and 'summary_statistic'")
}
summary_stat_acceptable <- vars$summary_statistic %in% c("min", "max", "sum",
"length_weighted_mean",
"area_weighted_mean")
if(!all(summary_stat_acceptable)){
stop("Check that all entries in vars$summary_statistic match accepted values")
}
if(!all(c("comid", "member_comid", "ID") %in% names(lookup_table))){
stop("Check that lookup_table contains columns 'comid,' 'member_comid,' and 'ID'")
}
# omit any duplicated rows in the lookup table
lookup_table <- distinct(lookup_table)
# download nldi characteristics for the requested comids and pivot to wide format
nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
ids = unique(lookup_table$comid))
nhd_vars_wide <- pivot_wider(nhd_vars, !percent_nodata,
names_from = characteristic_id,
values_from = characteristic_value)
# get comid catchment areas, adjusting area for catchments that have been "split"
nhd_subset <- get_catchment_areas(unique(lookup_table$member_comid), refactored_areas)
# combine the nldi characteristics with the catchment identifier and basin area
cat_vars_unscaled <- lookup_table |>
left_join(y = nhd_subset, by = c("member_comid","comid")) |>
left_join(y = nhd_vars_wide, by = "comid")
# rescale the nldi characteristics if needed (i.e., for split catchments)
if(!all(cat_vars_unscaled$comid == cat_vars_unscaled$member_comid)){
var_names <- unique(nhd_vars$characteristic_id)
cat_vars_scaled <- mutate(cat_vars_unscaled, across(any_of(var_names), ~.x*split_area_prop))
} else {
cat_vars_scaled <- cat_vars_unscaled
}
# assign columns based on the desired summary operation
cols_sum <- vars$characteristic_id[vars$summary_statistic == "sum"]
cols_area_wtd_mean <- vars$characteristic_id[vars$summary_statistic == "area_weighted_mean"]
cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
cols_min <- vars$characteristic_id[vars$summary_statistic == "min"]
cols_max <- vars$characteristic_id[vars$summary_statistic == "max"]
# rescale NHDPlusv2 attributes to desired catchments
# !note that there are currently no adjustments made for the length of split flowlines
nhd_vars_rescaled <- cat_vars_scaled |>
group_by(ID) |>
summarize(
areasqkm_sum = sum(split_basin_areasqkm),
lengthkm_sum = sum(lengthkm),
across(any_of(cols_area_wtd_mean), weighted.mean, w = split_basin_areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
) |>
ungroup()
return(nhd_vars_rescaled)
}
# Define the function `get_catchment_areas` that's called in `rescale_nldi_characteristics`
#' @title Get basin area and flowline length for NHDPlusV2 COMIDs
#'
#' @description
#' Get catchment area and flowline length for NHDPlusv2 COMID(s) of interest.
#' If any COMID represents a split catchment, the split catchment areas are
#' returned, along with a column that represents the proportion of the original
#' NHDPlusv2 catchment area that is covered by the split catchment.
#'
#' @param comids integer vector or character vector containing NHDPlusV2 identifiers
#' @param refactored_areas data frame containing columns "FEATUREID" and
#' "areasqkm." Used to retrieve adjusted catchment areas in the case of split
#' catchments.
#'
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr mutate select right_join left_join filter rename bind_rows
#'
get_catchment_areas <- function(comids, refactored_areas){
# format comids (omit suffix "." in the case of split catchments)
comids_fmt <- data.frame(member_comid = comids) |>
mutate(comid = gsub("\\..*", "", member_comid))
# fetch basin area for all comids
nhd_lines <- sf::st_drop_geometry(get_nhdplus(comid = c(unique(comids_fmt$comid))))
nhd_subset <- nhd_lines |>
mutate(comid = as.character(comid)) |>
select(comid, areasqkm, lengthkm) |>
right_join(x = comids_fmt, by = "comid")
# handle "split" catchments (if applicable)
if(all(comids_fmt$member_comid == comids_fmt$comid)){
nhd_subset_out <- nhd_subset |>
mutate(comid = as.integer(comid))
} else {
# create data frame containing the "split" catchments
comids_to_split <- filter(nhd_subset, grepl(".", member_comid, fixed = TRUE))
# get the basin area for the split catchments
split_areas <- refactored_areas |>
filter(FEATUREID %in% comids_to_split$member_comid) |>
rename(split_basin_areasqkm = areasqkm)
split_areas_fmt <- left_join(x = comids_to_split, y = split_areas, by = c("member_comid" = "FEATUREID"))
# create data frame containing the unsplit catchments
comids_no_split <- filter(nhd_subset, !grepl(".", member_comid, fixed = TRUE))
unsplit_areas_fmt <- mutate(comids_no_split, split_basin_areasqkm = areasqkm)
# bind split and unsplit catchments back together
nhd_subset_out <- bind_rows(unsplit_areas_fmt, split_areas_fmt) |>
mutate(comid = as.integer(comid),
split_area_prop = split_basin_areasqkm/areasqkm)
}
return(nhd_subset_out)
}
# Apply the functions above to an example.
# First, define the attributes of interest and how they should be
# summarized when aggregated or split.
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011", "CAT_BASIN_AREA"),
summary_statistic = c("area_weighted_mean", "min","sum"))
# Next, get the lookup table from the refactored hydrofabric that
# contains information about how NHDPlusv2 catchments are split.
file_name <- "refactor_02.gpkg"
file_path <- file.path(tempdir(), file_name)
gpkg_file <- sbtools::item_file_download("61fbfdced34e622189cb1b0a",
destinations = file_path,
names = file_name,
overwrite_file = TRUE)
lookup_table <- sf::st_read(gpkg_file, layer= "lookup_table") |>
rename(comid = NHDPlusV2_COMID, member_comid = member_COMID, ID = reconciled_ID) |>
# subset lookup table for this example
filter(ID %in% c(10012268, 10012979, 10024047, 10024048, 10024049, 10024050))
lookup_table
> lookup_table
comid member_comid ID LevelPathID
1 4146596 4146596 10012268 2132667
2 4147382 4147382 10012268 2132667
3 4147378 4147378 10012979 2130154
4 4147370 4147370 10012979 2130154
5 4147396 4147396.1 10024047 2123206
6 4147396 4147396.2 10024048 2123206
7 4147380 4147380.1 10024049 2123206
8 4147380 4147380.2 10024050 2123206
# We will also need the dataset "split_divides" that gives the basin
# area for the split catchments.
split_divides <- sf::st_read(gpkg_file, layer = "split_divides")
# rescale the characteristics to the refactored hydrofabric
rescale_nldi_characteristics(vars, lookup_table, refactored_areas = split_divides)
># A tibble: 6 x 6
ID areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_BASIN_AREA_sum CAT_WILDFIRE_2011_min
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10012268 12.9 6.30 -25.5 12.9 0
2 10012979 4.78 3.31 -11.9 4.77 0
3 10024047 6.51 4.31 -24.6 6.52 0
4 10024048 1.44 4.31 -5.44 1.44 0
5 10024049 5.84 5.10 -10.8 5.84 0
6 10024050 7.56 5.10 -14.0 7.56 0
> I downloaded the refactored hydrofabric from ScienceBase and tested this function on the same subset of 100 ID's that @ajsekell used, and got the ~same values for the summed Re. run time, this still isn't super quick and rescaling 100 ID's takes ~3 minutes. We may be able to speed that up, but some initial profiling suggests most of the time is in downloading the NLDI characteristics. |
This is great! I'm almost wrapped up with the hydroloom refactor and will turn attention to this as soon as I can. |
Sounds good, thanks @dblodgett-usgs! |
Awesome! That's almost a minute and a half faster than my script for 100 IDs. I'll try to run yours with something like temperature to test a weighted average instead of sum but won't get to it until Monday at the earliest. Sort of on topic, for aggregating up to a POI hydrofabric, we'd need to aggregate to the refactor fabric first (with this code) and then aggregate again to the POI fabric right? Don't think we could easily go directly from NHD to a POI fabric... |
Cool -- progress. I refactored your example a bit and think I have a good base implementation now @lekoenig Note that I pulled out a stand alone function that runs the actual statistics and have all the data munging separate. I'll try and find some time to work this into nhdplusTools in the next few days, but this function layout should do the trick if you all want to try scaling this up. library(tidyverse)
library(nhdplusTools)
get_charachteristics_metadata <- function() {
u <- "https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/5669a79ee4b08895842a1d47/metadata_table.tsv"
f <- file.path(nhdplusTools_data_dir(), "metadata_table.tsv")
if(!file.exists(f)) resp <- httr::RETRY("GET", u, httr::write_disk(f))
read.delim(f, sep = "\t")
}
#' @param varname character vector of desired variables
#' @param ids numeric vector of identifiers from the specified reference fabric
#' @param reference_fabric (not used) will be used to allow future specification of alternate reference fabrics
#' @importFrom dplyr bind_rows filter select
#' @importFrom tidyselect everything
get_catchment_characteristics <- function(varname, ids, reference_fabric = "nhdplusv2"){
metadata <- get_charachteristics_metadata()
bind_rows(
lapply(varname, function(x) {
i <- metadata[metadata$ID == x,]
id <- gsub("https://www.sciencebase.gov/catalog/item/", "", i$datasetURL)
bucket <- arrow::s3_bucket("s3://prod-is-usgs-sb-prod-publish", anonymous = TRUE, region = "us-west-2")
end <- ifelse(grepl("CAT", x, ), "_cat.parquet",
ifelse(grep("TOT", x), "_tot_parquet",
"_acc.parquet"))
ds <- arrow::open_dataset(paste0("s3://anonymous@prod-is-usgs-sb-prod-publish/",
id, "/", id, end, "?region=us-west-2"))
sub <- filter(ds, COMID %in% ids)
att <- collect(sub)
att$characteristic_id <- x
att$percent_nodata <- 0
select(att, all_of(c(characteristic_id = "characteristic_id", comid = "COMID",
characteristic_value = x, percent_nodata = "percent_nodata")))
})
)
}
rescale_characteristics <- function(vars, cat_vars_scaled) {
# assign columns based on the desired summary operation
cols_sum <- vars$characteristic_id[vars$summary_statistic == "sum"]
cols_area_wtd_mean <- vars$characteristic_id[vars$summary_statistic == "area_weighted_mean"]
cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
cols_min <- vars$characteristic_id[vars$summary_statistic == "min"]
cols_max <- vars$characteristic_id[vars$summary_statistic == "max"]
# rescale NHDPlusv2 attributes to desired catchments
# !note that there are currently no adjustments made for the length of split flowlines
nhd_vars_rescaled <- cat_vars_scaled |>
group_by(ID) |>
summarize(
areasqkm_sum = sum(split_basin_areasqkm),
lengthkm_sum = sum(lengthkm),
across(any_of(cols_area_wtd_mean), weighted.mean, w = split_basin_areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
) |>
ungroup()
return(nhd_vars_rescaled)
}
# The code below calls a function that grabs the catchment characteristics. I haven't defined
# the function here, but it is included in my previous comment within this GitHub issue.
# Define a function to rescale the catchment attributes using the tabular data.
# This function is a combination of Andrew Sekellick's workflow to accommodate
# split catchments and a function I previously included in this GitHub issue
# to aggregate NHDPlusv2 attributes to new fabrics.
#' @param vars data frame containing two columns: "characteristic_id" is a character
#' vector of desired variables to retrieve from /link{discover_nldi_characteristics};
#' "summary_statistic", is a character vector indicating which summary statistic
#' should be applied to rescale each characteristic. Accepted values include
#' "sum," "length_weighted_mean," "area_weighted_mean," "min," and "max."
#' @param lookup_table data frame containing at least three columns: "ID" is a
#' numeric vector of identifiers at the desired scale; "comid" is a numeric vector
#' of NHDPlusv2 identifiers; "member_comid" contains the formatted NHDPlusv2 COMIDs,
#' for example, in the case of split catchments. If catchments have not been split,
#' the columns "comid" and "member_comid" should be identical.
#' @param refactored_areas data frame containing columns "FEATUREID" and "areasqkm."
#' Used to retrieve adjusted catchment areas in the case of split catchments.
#'
#' @importFrom dplyr left_join rename_with mutate across group_by summarize ungroup distinct
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect starts_with any_of
#'
rescale_nldi_characteristics <- function(vars, lookup_table, refactored_areas){
# check that the inputs match what we are expecting
if(!all(c("characteristic_id", "summary_statistic") %in% names(vars))){
stop("Check that vars contains columns 'characteristic_id' and 'summary_statistic'")
}
summary_stat_acceptable <- vars$summary_statistic %in% c("min", "max", "sum",
"length_weighted_mean",
"area_weighted_mean")
if(!all(summary_stat_acceptable)){
stop("Check that all entries in vars$summary_statistic match accepted values")
}
if(!all(c("comid", "member_comid", "ID") %in% names(lookup_table))){
stop("Check that lookup_table contains columns 'comid,' 'member_comid,' and 'ID'")
}
# omit any duplicated rows in the lookup table
lookup_table <- distinct(lookup_table)
# download nldi characteristics for the requested comids and pivot to wide format
nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
ids = unique(lookup_table$comid))
nhd_vars_wide <- pivot_wider(nhd_vars, !percent_nodata,
names_from = characteristic_id,
values_from = characteristic_value)
# get comid catchment areas, adjusting area for catchments that have been "split"
nhd_subset <- get_catchment_areas(unique(lookup_table$member_comid), refactored_areas)
# combine the nldi characteristics with the catchment identifier and basin area
cat_vars_unscaled <- lookup_table |>
left_join(y = nhd_subset, by = c("member_comid","comid")) |>
left_join(y = nhd_vars_wide, by = "comid")
# rescale the nldi characteristics if needed (i.e., for split catchments)
if(!all(cat_vars_unscaled$comid == cat_vars_unscaled$member_comid)){
var_names <- unique(nhd_vars$characteristic_id)
cat_vars_scaled <- mutate(cat_vars_unscaled, across(any_of(var_names), ~.x*split_area_prop))
} else {
cat_vars_scaled <- cat_vars_unscaled
}
return(rescale_characteristics(vars, cat_vars_scaled))
}
# Define the function `get_catchment_areas` that's called in `rescale_nldi_characteristics`
#' @title Get basin area and flowline length for NHDPlusV2 COMIDs
#'
#' @description
#' Get catchment area and flowline length for NHDPlusv2 COMID(s) of interest.
#' If any COMID represents a split catchment, the split catchment areas are
#' returned, along with a column that represents the proportion of the original
#' NHDPlusv2 catchment area that is covered by the split catchment.
#'
#' @param comids integer vector or character vector containing NHDPlusV2 identifiers
#' @param refactored_areas data frame containing columns "FEATUREID" and
#' "areasqkm." Used to retrieve adjusted catchment areas in the case of split
#' catchments.
#'
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr mutate select right_join left_join filter rename bind_rows
#'
get_catchment_areas <- function(comids, refactored_areas){
# format comids (omit suffix "." in the case of split catchments)
comids_fmt <- data.frame(member_comid = comids) |>
mutate(comid = as.integer(member_comid))
# fetch basin area for all comids
nhd_subset <- nhdplusTools::get_vaa(atts = c("comid", "areasqkm", "lengthkm")) |>
select(comid, areasqkm, lengthkm) |>
right_join(x = comids_fmt, by = "comid")
# handle "split" catchments (if applicable)
if(all(comids_fmt$member_comid == as.character(comids_fmt$comid))) {
nhd_subset_out <- nhd_subset |>
mutate(comid = as.integer(comid))
} else {
# create data frame containing the "split" catchments
comids_to_split <- filter(nhd_subset, grepl(".", member_comid, fixed = TRUE))
# get the basin area for the split catchments
split_areas <- refactored_areas |>
filter(FEATUREID %in% comids_to_split$member_comid) |>
rename(split_basin_areasqkm = areasqkm)
split_areas_fmt <- left_join(x = comids_to_split,
y = split_areas,
by = c("member_comid" = "FEATUREID"))
# create data frame containing the unsplit catchments
comids_no_split <- filter(nhd_subset, !grepl(".", member_comid, fixed = TRUE))
unsplit_areas_fmt <- mutate(comids_no_split, split_basin_areasqkm = areasqkm)
# bind split and unsplit catchments back together
nhd_subset_out <- bind_rows(unsplit_areas_fmt, split_areas_fmt) |>
mutate(comid = as.integer(comid),
split_area_prop = split_basin_areasqkm/areasqkm)
}
return(nhd_subset_out)
}
# Apply the functions above to an example.
# First, define the attributes of interest and how they should be
# summarized when aggregated or split.
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011", "CAT_BASIN_AREA"),
summary_statistic = c("area_weighted_mean", "min","sum"))
# Next, get the lookup table from the refactored hydrofabric that
# contains information about how NHDPlusv2 catchments are split.
file_name <- "refactor_02.gpkg"
file_path <- file.path(tempdir(), file_name)
gpkg_file <- sbtools::item_file_download("61fbfdced34e622189cb1b0a",
destinations = file_path,
names = file_name,
overwrite_file = TRUE)
lookup_table <- sf::st_read(gpkg_file, layer= "lookup_table") |>
dplyr::rename(comid = NHDPlusV2_COMID, member_comid = member_COMID, ID = reconciled_ID) |>
# subset lookup table for this example
dplyr::filter(ID %in% c(10012268, 10012979, 10024047, 10024048, 10024049, 10024050))
lookup_table
# > lookup_table
# comid member_comid ID LevelPathID
# 1 4146596 4146596 10012268 2132667
# 2 4147382 4147382 10012268 2132667
# 3 4147378 4147378 10012979 2130154
# 4 4147370 4147370 10012979 2130154
# 5 4147396 4147396.1 10024047 2123206
# 6 4147396 4147396.2 10024048 2123206
# 7 4147380 4147380.1 10024049 2123206
# 8 4147380 4147380.2 10024050 2123206
# We will also need the dataset "split_divides" that gives the basin
# area for the split catchments.
split_divides <- sf::st_read(gpkg_file, layer = "split_divides")
# rescale the characteristics to the refactored hydrofabric
rescale_nldi_characteristics(vars, lookup_table, refactored_areas = split_divides)
# A tibble: 6 x 6
# ID areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_BASIN_AREA_sum CAT_WILDFIRE_2011_min
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10012268 12.9 6.30 -25.5 12.9 0
# 2 10012979 4.78 3.31 -11.9 4.77 0
# 3 10024047 6.51 4.31 -24.6 6.52 0
# 4 10024048 1.44 4.31 -5.44 1.44 0
# 5 10024049 5.84 5.10 -10.8 5.84 0
# 6 10024050 7.56 5.10 -14.0 7.56 0
# > |
I like these edits, @dblodgett-usgs, thanks! I haven't used the refactored hydrofabrics much yet - do you have thoughts on @ajsekell's question about the POI fabric? I think that if you had a similar lookup table and a table with the refactored areas you could go from NHD to a POI fabric with this code. Does that sound right? |
I think so, yes. My next step is to build some tests around this based on real data from a refactored fabric to make sure I'm really following what's going on. |
Sounds good, @dblodgett-usgs. We may start trying to scale this up using the functions in your comment above, and then replace with the |
I would suggest including a comment like that in the documentation. I'd like to recalculate something like NLCD directly from the rasters to a POI fabric and see if there's a significant difference between that and a NHD level compilation to NHGF POI script. |
If you install from the hydroloom branch,
you'll see I got two new catchment access functions incorporated. I am working on incorporating the rest of the functions above over the next couple days. See #304 (comment) for how those functions currently work. |
This is now complete in #324 -- @lekoenig can you kick the tires? remotes::install_github("doi-usgs/nhdplusTools@hydroloom")
??rescale_catchment_characteristics You can see what I added here: 04b195f Todo on this is to
I'll deal with the warnings but would welcome help on the other two. |
To test the rescaling functions I've tried to request characteristics that differ from the ones I was using to develop the code above. I've run into an edge case with # rescale NHDPlusv2 attributes by aggregating comid values to new "id" values
vars <- data.frame(characteristic_id = c('CAT_IMPV11',"CAT_SANDAVE", "CAT_BASIN_AREA"),
summary_statistic = c("area_weighted_mean", "min","sum"))
lookup_table <- data.frame(id = rep(10012268, 2),
comid = c(4146596, 4147382),
member_comid = c(4146596, 4147382))
rescale_catchment_characteristics(vars, lookup_table)
Error in `summarize()`:
i In argument: `across(...)`.
i In group 1: `id = 10012268`.
Caused by error in `across()`:
! Can't compute column `CAT_IMPV11_area_wtd`.
Caused by error in `x * w`:
! non-numeric argument to binary operator
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Values from `characteristic_value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(comid, characteristic_id) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
> The dplyr warnings are pretty helpful, actually. There are multiple rows with the same get_catchment_characteristics(varname = vars$characteristic_id,
ids = unique(lookup_table$comid))
characteristic_id comid characteristic_value percent_nodata
1: CAT_IMPV11 4146596 0.00 0
2: CAT_IMPV11 4147382 0.00 0
3: CAT_IMPV11 4146596 0.00 0
4: CAT_IMPV11 4147382 0.00 0
5: CAT_SANDAVE 4146596 24.65 0
6: CAT_SANDAVE 4147382 24.64 0
7: CAT_BASIN_AREA 4146596 0.96 0
8: CAT_BASIN_AREA 4147382 11.98 0
> Actually, I'm not sure about that. I'm testing river segments in the DRB and it looks like sometimes two different values get returned for vars <- data.frame(characteristic_id = "CAT_IMPV11", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 1435, comid = 1748535, member_comid = 1748535)
get_catchment_characteristics(varname = vars$characteristic_id, ids = unique(lookup_table$comid))
characteristic_id comid characteristic_value percent_nodata
1: CAT_IMPV11 1748535 0.00 0
2: CAT_IMPV11 1748535 0.25 0 |
I tracked this down the a bad assumption. The CAT_IMPV11 data has duplicates for these comids. I looks like you found the same thing (now that I read your issue more closely). @mewieczo can you look at what's up with https://www.sciencebase.gov/catalog/item/570577fee4b0d4e2b7571d7b I'm seeing duplicate rows in the cat parquet file. Is that also in the csv or is something wrong with the parquet conversion? Why are we seeing different values for the same COMID? In the mean time, if you update from the hydroloom branch, this should work. remotes::install_github("doi-usgs/nhdplusTools@hydroloom") library(nhdplusTools)
vars <- data.frame(characteristic_id = c('CAT_IMPV11',"CAT_SANDAVE", "CAT_BASIN_AREA"),
summary_statistic = c("area_weighted_mean", "min","sum"))
lookup_table <- data.frame(id = rep(10012268, 2),
comid = c(4146596, 4147382),
member_comid = c(4146596, 4147382))
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 × 6
#> id areasqkm_sum lengthkm_sum CAT_IMPV11_area_wtd CAT_BASIN_ARE…¹ CAT_S…²
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10012268 12.9 6.30 0 12.9 24.6
#> # … with abbreviated variable names ¹CAT_BASIN_AREA_sum, ²CAT_SANDAVE_min Created on 2023-02-14 with reprex v2.0.2 |
Ah, OK - thanks for looking into that I've also found a few other potential edge cases that we could consider:
library(nhdplusTools)
lookup_table <- data.frame(id = rep(10012268, 2), comid = c(4146596, 4147382), member_comid = c(4146596, 4147382))
vars <- data.frame(characteristic_id = "CAT_sinuosity", summary_statistic = "length_weighted_mean")
rescale_catchment_characteristics(vars, lookup_table)
#> Error in UseMethod("pivot_wider") :
#> no applicable method for 'pivot_wider' applied to an object of class "NULL"
#> In addition: Warning message:
#> In get_catchment_characteristics(varname = vars$characteristic_id, :
#> Issue getting characteristics. Error was:
#> Problem while evaluating `all_of(...)`.Issue getting characteristics. Error was:
...
library(nhdplusTools)
vars <- data.frame(characteristic_id = "CAT_STRM_DENS", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 1721, comid = 4188275, member_comid = 4188275)
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#> id areasqkm_sum lengthkm_sum CAT_STRM_DENS_area_wtd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1721 0 0.048 NaN Created on 2023-02-15 by the reprex package (v2.0.1)
library(nhdplusTools)
vars <- data.frame(characteristic_id = "CAT_EWT", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 4203,
comid = c(8074226,8075830,932040118,932040186,932040187),
member_comid = c(8074226,8075830,932040118,932040186,932040187))
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = lookup_table$comid)
#> characteristic_id comid characteristic_value percent_nodata
#> 1: CAT_EWT 8074226 0 0
#> 2: CAT_EWT 8075830 0 0
#> 3: CAT_EWT 932040118 0 0
#> 4: CAT_EWT 932040186 -9999 0
#> 5: CAT_EWT 932040187 -9999 0
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#> id areasqkm_sum lengthkm_sum CAT_EWT_area_wtd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4203 22.8 10.3 -5057. Created on 2023-02-15 by the reprex package (v2.0.1)
library(nhdplusTools)
library(dplyr)
vars <- data.frame(characteristic_id = c("CAT_ELEV_MIN"),
summary_statistic = c("min"))
lookup_table <- data.frame(id = rep(10012268, 2), comid = c(4146596, 4147382), member_comid = c(4146596, 4147382))
# works ok for aggregation only
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = unique(lookup_table$comid))
#> characteristic_id comid characteristic_value percent_nodata
#> 1: CAT_ELEV_MIN 4146596 796.09 0
#> 2: CAT_ELEV_MIN 4147382 530.76 0
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#> id areasqkm_sum lengthkm_sum CAT_ELEV_MIN_min
#> <dbl> <dbl> <dbl> <dbl>
#> 1 10012268 12.9 6.30 531.
# but for split catchments?
d <- readRDS(list.files(pattern = "rescale_data.rds", recursive = TRUE, full.names = TRUE))
# smallest value for cat_elev_min between comids belonging to id 10024047 or 10024048 is 487.49
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = unique(d$lookup_table$comid)) |>
left_join(d$lookup_table, by = "comid", multiple = "all")
#> characteristic_id comid characteristic_value percent_nodata member_comid id LevelPathID
#> 1: CAT_ELEV_MIN 4146596 796.09 0 4146596 10012268 2132667
#> 2: CAT_ELEV_MIN 4147370 616.66 0 4147370 10012979 2130154
#> 3: CAT_ELEV_MIN 4147378 574.88 0 4147378 10012979 2130154
#> 4: CAT_ELEV_MIN 4147380 530.24 0 4147380.1 10024049 2123206
#> 5: CAT_ELEV_MIN 4147380 530.24 0 4147380.2 10024050 2123206
#> 6: CAT_ELEV_MIN 4147382 530.76 0 4147382 10012268 2132667
#> 7: CAT_ELEV_MIN 4147396 487.49 0 4147396.1 10024047 2123206
#> 8: CAT_ELEV_MIN 4147396 487.49 0 4147396.2 10024048 2123206
# yet rescaled value of cat_elev_min for id 10024048 is 88.5
rescale_catchment_characteristics(vars, d$lookup_table, refactored_areas = d$split_divides)
#> # A tibble: 1 x 5
#> id areasqkm_sum lengthkm_sum CAT_ELEV_MIN_min
#> <dbl> <dbl> <dbl> <dbl>
#> 1 10012268 12.9 6.30 531.
#> 2 10012979 4.78 3.31 575.
#> 3 10024047 6.51 4.31 400.
#> 4 10024048 1.44 4.31 88.5
#> 5 10024049 5.84 5.10 231.
#> 6 10024050 7.56 5.10 299. Finally, it occurred to me that I've ignored the |
Thanks for the doco @lekoenig this is getting solid -- I'll keep this issue in my todo queue and look at your comments. If you think you know how to work in the percent_nodata, that would be awesome. I've kind of been overlooking it as well. |
CAT_sinuosity is in the "tot" file for some reason? The metadata does have the information needed for that and that case is now covered. I found a case where there is no "percent_nodata" (for ACC_STRM_DENS). I will just fill in 0 in that case. For a divide by 0, I think NaN is the best choice. I now do The issue you highlight in (4) just looks like a bug? I'm not totally following what's going on though. Do you want to dig further and get that fixed up or should I @lekoenig ? |
Reinstall from hydroloom to get the little fixes above.
|
Great, thanks for doing that. If The issue in (4) above is really more of a conceptual question than a bug. For split catchments, should we be rescaling all characteristics? In that example, if a user requests the "min" statistic for the attribute # rescale the nldi characteristics if needed (i.e., for split catchments)
if(!all(lookup_table$comid == lookup_table$member_comid)){
lookup_table <- mutate(lookup_table, across(any_of(var_names), ~.x*.data$split_area_prop))
} from # rescale the nldi characteristics if needed (i.e., for split catchments)
if(!all(lookup_table$comid == lookup_table$member_comid)){
lookup_table <- mutate(lookup_table, across(any_of(c(cols_area_wtd_mean, cols_sum)), ~.x*.data$split_area_prop))
} Does that make sense? |
I've suggested some changes in #331 to retain and rescale the |
@lekoenig -- based on what we have now, do you think this could be closed? |
I was just looking at this last night to see if we could close this loop. I think the only thing remaining is to resolve (4) described in #303 (comment). I'd appreciate if you or @ajsekell could take a look at the example above as a check on my logic, but either way I'll open a PR today with my proposed change. |
I think I'm following 4 in #303...yeah that's a problem. You're right the rescaling by proportion of area isn't necessarily needed for all variables (like min elevation, or maybe some of the climate variables too). Your suggestion I think makes sense and I also think some of these details will be important in the help/documentation. |
Following up on this, after the changes merged in #337 I think this issue can be closed. In the next couple months we will be working with this code and scaling up these workflows to generate rescaled attributes for CONUS. I'll open a new issue if we find any bugs or have other questions. |
Great. As I have time, I am working on getting hydroloom documented and ready for CRAN. Once it's out, nhdplusTools will start to migrate to that back end on the main branch. Until then, we'll just have to install from the hydroloom branch. |
See https://doi-usgs.github.io/nhdplusTools/news/index.html This is now on the main branch and on its way to CRAN |
Will document the idea later on, but we need to create a function to take attributes from one scale of catchments and split and / or aggregate them to another set of catchments.
The text was updated successfully, but these errors were encountered: