how to use observation_count accurately to collect sample data #23
-
Thanks for making this package. I am trying to collect hourly data for PM, which I realize is intensive. I can get the data for smaller states but not for California, for example. In the reprex below, I first pull the annual summaries and see that there are 868,623 observations for California in 2017. However, I consistently fail to get a years worth of data using aqs_sampledata_by_state. If I split the request into two six-month increments, it works, but I get fewer than 868,623 observations so I'm obviously interpreting something wrong. Can you help me better understand how to use the observation count to predict if a request for sample data with succeed or fail? Thanks! library(RAQSAPI)
#> Use the function
#> RAQSAPI::aqs_credentials(username, key)
#> before using other RAQSAPI functions
#> See ?RAQSAPI::aqs_credentials for more information
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
aqs_email <- "wheeler.william@epa.gov"
aqs_key <- "***********"
aqs_credentials(username = aqs_email, key = aqs_key)
aqs_return_CA_pm_summary <- aqs_annualsummary_by_state(parameter = "88101",
bdate = lubridate::mdy(paste0("01-01-2017")),
edate = lubridate::mdy(paste0("12-31-2017")),
stateFIPS = "06",
return_header = FALSE)
aqs_return_CA_pm_summary |>
filter(sample_duration_code == "1") |>
summarise(total_obs = sum(observation_count))
#> total_obs
#> 1 868623
aqs_return_CA_pm_1 <- aqs_sampledata_by_state(parameter = "88101",
bdate = lubridate::mdy(paste0("01-01-2017")),
edate = lubridate::mdy(paste0("12-31-2017")),
stateFIPS = "06",
duration = "1",
return_header = FALSE)
#> Waiting 4s for throttling delay ■■■■■■■■
#> Waiting 4s for throttling delay ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Waiting 4s for throttling delay ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
#> Error in `purrr::pmap()`:
#> ℹ In index: 1.
#> Caused by error in `req_perform()`:
#> ! Failed to perform HTTP request.
#> Caused by error in `curl::curl_fetch_memory()`:
#> ! OpenSSL SSL_read: Connection reset by peer, errno 104
aqs_return_CA_pm_2 <- aqs_sampledata_by_state(parameter = "88101",
bdate = lubridate::mdy(paste0("01-01-2017")),
edate = lubridate::mdy(paste0("06-30-2017")),
stateFIPS = "06",
duration = "1",
return_header = FALSE)
aqs_return_CA_pm_3 <- aqs_sampledata_by_state(parameter = "88101",
bdate = lubridate::mdy(paste0("07-01-2017")),
edate = lubridate::mdy(paste0("12-31-2017")),
stateFIPS = "06",
duration = "1",
return_header = FALSE)
aqs_return_CA_pm_summary |>
filter(sample_duration_code == "1") |>
summarise(total_obs = sum(observation_count))
#> total_obs
#> 1 868623
nrow(aqs_return_CA_pm_2) + nrow(aqs_return_CA_pm_3)
#> [1] 657388 Created on 2024-09-11 with reprex v2.1.0 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello thanks for using RAQSAPI, To Address your first issue, trying to pull in all of California’s PM2.5 data for one year: Unfortunately, my R package is simply a convivence wrapper around the EPA’s AQS DataMart API – it simply makes it easier for R users to retrieve data from the AQS DataMart API, the AQS DataMart API has a limit. Although RAQSAPI has a simple throttling mechanism built in, it is the DataMart API server itself imposes limits on the amount of data that a user can retrieve within a short timeframe. California is a very large state with many ambient air monitors so attempting to pull in a year’s worth of data from all PM2.5 monitors at once might put a lot of stress upon the DataMart API. I’m glad that you’ve decided to split your requests into half of a year chunks so as to reduce the strain on the server. Your first error was probably due to the server timing out after such a large request. To your second issue that the annual summary has an observation count that is higher than the sum of rows from the two half-year sample data API calls, that is to be expected since the sampledata API call does not return data that has been excluded for various reasons, for example data that has been flagged during an exceptional event. So there may be times when those two numbers are not equal. Thanks for your questions, feel free to email me if you have any other questions. |
Beta Was this translation helpful? Give feedback.
Hello thanks for using RAQSAPI,
To Address your first issue, trying to pull in all of California’s PM2.5 data for one year:
Unfortunately, my R package is simply a convivence wrapper around the EPA’s AQS DataMart API – it simply makes it easier for R users to retrieve data from the AQS DataMart API, the AQS DataMart API has a limit. Although RAQSAPI has a simple throttling mechanism built in, it is the DataMart API server itself imposes limits on the amount of data that a user can retrieve within a short timeframe. California is a very large state with many ambient air monitors so attempting to pull in a year’s worth of data from all PM2.5 monitors at once might put a lot of stress upon t…