diff --git a/404.html b/404.html index e0cbedd0..f6e2fdce 100644 --- a/404.html +++ b/404.html @@ -65,13 +65,16 @@ 2. Creation and coercion
vignettes/cb3tsibblesf.Rmd
+ cb3tsibblesf.Rmd
Analysts often have their own preferred spatial or temporal data
+structure, which they may wish to keep for spatio-temporal analysis. For
+example, the tbl_ts
class from the tsibble package (Wang, Cook, and Hyndman 2020) is commonly used
+in time series forecasting and similarly, the sf class (Pebesma 2018) is often used in spatial data
+science. In cubble, analysts can combine these two structures together
+by allowing the spatial component to also be an sf object and the
+temporal component to also be a tsibble object.
The key
and index
arguments in a cubble
+object corresponds to the tsibble counterparts and they can be safely
+omitted, if the temporal component is a tsibble object,
+i.e. meteo_ts
in the example below. The tsibble class from
+the input will be carried over to the cubble object:
+ts_nested <- make_cubble(
+ spatial = stations, temporal = meteo_ts, coords = c(long, lat))
+(ts_long <- face_temporal(ts_nested))
+#> # cubble: key: id [3], index: date, long form, [tsibble]
+#> # temporal: 2020-01-01 -- 2020-01-10 [1D], no gaps
+#> # spatial: long [dbl], lat [dbl], elev [dbl], name [chr], wmo_id [dbl]
+#> id date prcp tmax tmin
+#> <chr> <date> <dbl> <dbl> <dbl>
+#> 1 ASN00086038 2020-01-01 0 26.8 11
+#> 2 ASN00086038 2020-01-02 0 26.3 12.2
+#> 3 ASN00086038 2020-01-03 0 34.5 12.7
+#> 4 ASN00086038 2020-01-04 0 29.3 18.8
+#> 5 ASN00086038 2020-01-05 18 16.1 12.5
+#> 6 ASN00086038 2020-01-06 104 17.5 11.1
+#> 7 ASN00086038 2020-01-07 14 20.7 12.1
+#> 8 ASN00086038 2020-01-08 0 26.4 16.4
+#> 9 ASN00086038 2020-01-09 0 33.1 17.4
+#> 10 ASN00086038 2020-01-10 0 34 19.6
+#> # ℹ 20 more rows
+class(ts_long)
+#> [1] "temporal_cubble_df" "cubble_df" "tbl_ts"
+#> [4] "tbl_df" "tbl" "data.frame"
The long cubble shows [tsibble]
in the header to
+indicate the object also being in a tbl_ts
class. Methods
+applies to the tbl_ts
class can also be applied to the
+temporal cubble objects, for example, checking whether the data contain
+temporal gaps:
+ts_long %>% has_gaps()
+#> # A tibble: 3 × 2
+#> id .gaps
+#> <chr> <lgl>
+#> 1 ASN00086038 FALSE
+#> 2 ASN00086077 FALSE
+#> 3 ASN00086282 FALSE
An existing cubble object can promote its temporal component to a
+tsibble object by applying make_temporal_tsibble()
. The
+promoted cubble object (ts_long2
) will be the same as the
+one created with a tsibble component initially
+(ts_long
):
+ts_long2 <- make_cubble(
+ stations, meteo,
+ key = id, index = date, coords = c(long, lat)) %>%
+ face_temporal() %>%
+ make_temporal_tsibble()
+identical(ts_long2, ts_long)
+#> [1] TRUE
Similarly, an sf object can be supplied as the spatial component to
+create a cubble object, with the coords
argument being
+omitted. This opens up the possibility to represent fixed area with
+polygons or multipolygons and the coords
argument will be
+calculated as the centroids of the (multi)polygons. The
+[sf]
print in the cubble header suggest an spatial
+component being also a sf object:
+(sf_nested <- make_cubble(
+ spatial = stations_sf, temporal = meteo,
+ key = id, index = date))
+#> # cubble: key: id [3], index: date, nested form, [sf]
+#> # spatial: [144.8321, -37.98, 145.0964, -37.6655], WGS 84
+#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
+#> id elev name wmo_id long lat geometry ts
+#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <POINT [°]> <list>
+#> 1 ASN00086038 78.4 essen… 95866 145. -37.7 (144.9066 -37.7276) <tibble>
+#> 2 ASN00086077 12.1 moora… 94870 145. -38.0 (145.0964 -37.98) <tibble>
+#> 3 ASN00086282 113. melbo… 94866 145. -37.7 (144.8321 -37.6655) <tibble>
+class(sf_nested)
+#> [1] "spatial_cubble_df" "cubble_df" "sf"
+#> [4] "tbl_df" "tbl" "data.frame"
The following code shows how to perform coordinate transformation
+with st_transform
on a cubble object:
+sf_nested %>% sf::st_transform(crs = "EPSG:3857")
+#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
+#> that
+#> # cubble: key: id [3], index: date, nested form, [sf]
+#> # spatial: [16122635.6225205, -4576600.8687746, 16152057.3639371,
+#> # -4532279.35567565], WGS 84
+#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
+#> id elev name wmo_id long lat geometry ts
+#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <POINT [°]> <list>
+#> 1 ASN00086038 78.4 essen… 95866 145. -37.7 (16130929 -4541016) <tibble>
+#> 2 ASN00086077 12.1 moora… 94870 145. -38.0 (16152057 -4576601) <tibble>
+#> 3 ASN00086282 113. melbo… 94866 145. -37.7 (16122636 -4532279) <tibble>
The counterpart to promote the spatial component in an existing
+cubble to be an sf object is make_spatial_sf()
:
+sf_nested <- make_cubble(
+ stations, meteo,
+ key = id, index = date, coords = c(long, lat)) %>%
+ make_spatial_sf()
+#> CRS missing: using OGC:CRS84 (WGS84) as default
+all.equal(sf_nested, sf_nested)
+#> [1] TRUE
One common type of task with spatio-temporal data is to match nearby -sites. For example, we may want to verify the location of an old list of -stations with current stations, or we may want to match the data from -different data sources. Some of these matches only concern the spatial -dimension, while others require temporal agreement.
-This vignette introduces how to spatially and spatio-temporally match -sites with the cubble structure with two examples. The first example -pairs traditional weather stations with nearby automated stations in New -South Wales, Australia. This exercise only concerns the matching based -on spherical distance between stations. The next example pairs the river -level recorded by the river gauges with the precipitation recorded by -the nearby weather station in Victoria, Australia.
-Bureau of Meteorology collects water -data from river gauges and this includes variables: electrical -conductivity, turbidity, water course discharge, water course level, and -water temperature. In particular, water level will interactive with -precipitation from the climate data since rainfall will raise the water -level in the river. Here is the location of available weather station -and water gauges in Victoria:
- -In cubble, match_sites()
houses
-match_spatial()
and match_temporal()
. For a
-spatial-only matching, you can use
-match_sites(temporal_matching = FALSE)
or simply
-match_spatial()
.
Any matching requires two datasets in the cubble and we call them
-major
and minor
. Major and minor dataset
-differs from how distance is calculated. Spatial matching calculates the
-spherical distance using the Vincenty formula and this distance is
-calculated from each site in the major
dataset is
-to every site in the minor
dataset.
-res_sp <- match_spatial(climate_vic, river,
- spatial_n_group = 10, return_cubble = TRUE)
-(res_sp <- res_sp[-c(5, 8)] %>% bind_rows())
-#> # cubble: key: id [16], index: date, nested form, [sf]
-#> # spatial: [144.5203, -38.144913, 148.4667, -36.128657], WGS 84
-#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
-#> id long lat elev name wmo_id ts type geometry
-#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list> <chr> <POINT [°]>
-#> 1 ASN0… 145. -37.0 290 rede… 94859 <tibble> clim… (144.5203 -37.0194)
-#> 2 4062… 145. -37.0 NA CAMP… NA <tibble> rive… (144.5403 -37.01512)
-#> 3 ASN0… 148. -37.7 62.7 orbo… 95918 <tibble> clim… (148.4667 -37.6922)
-#> 4 2222… 148. -37.7 NA SNOW… NA <tibble> rive… (148.451 -37.70739)
-#> 5 ASN0… 147. -38.1 4.6 east… 94907 <tibble> clim… (147.1322 -38.1156)
-#> 6 2260… 147. -38.1 NA LA T… NA <tibble> rive… (147.1278 -38.14491)
-#> 7 ASN0… 145. -36.2 96 echu… 94861 <tibble> clim… (144.7642 -36.1647)
-#> 8 4067… 145. -36.1 NA DEAK… NA <tibble> rive… (144.7693 -36.12866)
-#> 9 ASN0… 146. -36.8 502 stra… 95843 <tibble> clim… (145.7308 -36.8472)
-#> 10 4052… 146. -36.9 NA SEVE… NA <tibble> rive… (145.6828 -36.88701)
-#> 11 ASN0… 145. -37.7 78.4 esse… 95866 <tibble> clim… (144.9066 -37.7276)
-#> 12 2302… 145. -37.7 NA MARI… NA <tibble> rive… (144.8365 -37.72771)
-#> 13 ASN0… 148. -37.9 49.4 bair… 94912 <tibble> clim… (147.5669 -37.8817)
-#> 14 2242… 148. -37.8 NA MITC… NA <tibble> rive… (147.5722 -37.815)
-#> 15 ASN0… 145. -36.3 105 kyab… 95833 <tibble> clim… (145.0638 -36.335)
-#> 16 4067… 145. -36.3 NA MOSQ… NA <tibble> rive… (144.9809 -36.32871)
-#> # ℹ 2 more variables: group <int>, dist [m]
Once the distance is calculated, three arguments are available to -refine the matching results:
-spatial_n_keep
: Number of match each major site
-receivespatial_dist_max
: maximum distance allowed for a pair
-of matchingspatial_single_match
: Whether each minor site can only
-be matched to one major siteThe order that these three arguments applied will slightly affect the
-results in cubble
. spatial_n_keep
, default to
-1, is first applied to keep n
site(s) for each major site,
-spaital_dist_max
, default to 10, is then applied to filter
-out the pairs with distance larger than this maximum distance.
-spatial_single_match
is lastly applied to resolve the
-scenario where site a
(minor) is the closest match for both
-site A
and B
(major) with distance 5km and
-8km. If spatial_single_match = TRUE
, a
will
-only be matched to the major site with the smaller distance, that is,
-site A
here.
Here we provide more details on how temporal matching works in
-cubble
. Suppose two locations have been matched spatially
-and temporal matching will be conducted on variable A
and
-a
in the plot below: .
We first find the n
peaks in each series (3 peaks here).
-A variable needs to be specified in temporal_independent
-for construct an interval. Here we pick variable A
and
-construct an interval with a default length of 5. The peaks in variable
-a
are then tested against whether they fall into the any of
-the intervals constructed from A
. In this illustration,
-there are 2 matches for these two variable The available tuning
-parameter in temporal matches are:
temporal_n_highest
: the number of peak used - 3 in the
-example abovetemporal_window
: the length of the interval - 5 in the
-example abovetemporal_min_match
: the minimum number of matched peak
-for a valid matched pair. To return all the pairs of the match, set this
-parameter to 0.In the river level and precipitation example,
-Water_course_level
in river
will be matched to
-prcp
in climate
. This can be specified in
-temporal_by
, an analogue to the by
syntax in
-join
. The goal in this example is to see if precipitation
-will be reflected by the water level in the river and this puts
-precipitation prcp
, as the independent. Given there is one
-year worth of data, the number of peak (temporal_n_highest
)
-to consider is slightly raised from a default 20 to 30 and
-temporal_min_match
is raised accordingly.
-res_tm <- res_sp %>%
- match_temporal(
- data_id = type,
- match_id = group,
- return_cubble = TRUE,
- temporal_by = c("prcp" = "Water_course_level"))
-res_tm[[1]]
-#> # cubble: key: id [2], index: date, nested form, [sf]
-#> # spatial: [144.5203, -37.0194, 144.540295, -37.015122], WGS 84
-#> # temporal: date [date], matched [dbl]
-#> id long lat elev name wmo_id type geometry group
-#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <POINT [°]> <int>
-#> 1 ASN00088… 145. -37.0 290 rede… 94859 clim… (144.5203 -37.0194) 1
-#> 2 406213 145. -37.0 NA CAMP… NA rive… (144.5403 -37.01512) 1
-#> # ℹ 3 more variables: dist [m], ts <list>, match_res <dbl>
The output from temporal matching is also a cubble with
-n_match
for the number of matched temporal peaks (on top of
-the dist
and group
from spatial
-matching):
We can look at the matched pair on the map:
-
-vic_map <- ozmaps::abs_ste |>
- filter(NAME == "Victoria")
-
-ggplot() +
- geom_sf(data = vic_map, fill = "grey95", color = "white") +
- geom_point(data = dplyr::bind_rows(river, climate_vic),
- aes(x = long, y = lat, color = type), alpha = 0.2, fill = 0.2) +
- geom_point(data = res_tm %>% as_tibble() ,
- aes(x = long, y = lat, color = type)) +
- ggrepel::geom_label_repel(
- data = res_tm |> filter(type == "river") %>% as_tibble(),
- aes(x = long, y = lat, label = group)) +
- scale_color_brewer(palette = "Dark2") +
- theme_void() +
- ggplot2::theme(legend.position = "bottom",
- legend.text = element_text(size = 15),
- legend.title = element_text(size = 15)) +
- ggplot2::labs(x = "Longitude", y = "Latitude") +
- guides(color = guide_legend(override.aes = list(size=5)))
or to look at the series:
-
-res_tm_long <- res_tm %>%
- face_temporal() %>%
- unfold(group, type) %>%
- rename(prcp = matched) %>%
- group_by(group, type) %>%
- mutate(prcp = (prcp - min(prcp, na.rm = TRUE))/
- (max(prcp, na.rm = TRUE) - min(prcp, na.rm = TRUE)))
-
-res_tm_long %>%
- ggplot(aes(x = date, y = prcp, group = type,color = type)) +
- geom_line() +
- facet_wrap(vars(group)) +
- scale_color_brewer(palette = "Dark2", guide = "none") +
- theme_bw() +
- labs(x= "date") +
- scale_x_date(date_labels = "%b") +
- labs(x = "Week", y = "Precipitation/ water level")
There are four pairs of matches - all locates in the middle Victoria -and we can observe concurrent increase of precipitation and water level -(precipitation and water level have been standardised between 0 and 1 to -be displayed on the same scale).
-One common type of task with spatio-temporal data is to match nearby +sites. For example, we may want to verify the location of an old list of +stations with current stations, or we may want to match the data from +different data sources. This vignette introduces the spatial and +temporal matching in cubble with an example on matching river level data +with precipitation in Victoria, Australia.
+In cubble, data can be matched spatially or temporarily with
+match_spatial()
and match_temporal()
. The
+function match_spatial()
calculates the spatial distance
+between observations in two cubbles. Different distances are available
+with projected or unprojected coordinate reference system. Analysts can
+subset the number of matched group to output with argument
+spatial_n_group
(by default 4 groups) and the number of
+match per group with argument spatial_n_group
(by default
+1, that is, one-to-one matching). The syntax to use
+match_spatial()
is
match_spatial(<cubble_obj1>, <cubble_obj2>, ...)
+The function match_temporal()
calculates the time series
+similarity between spatially matched groups. Two identifiers needs to be
+specified on the variable separates the each matched group
+(match_id
) and the variable separates the two sources
+(data_id
). The argument temporal_by
uses the
+by
syntax from dplyr *_join
to specify the
+temporal matching variable.
The similarity score between two time series in the spatially matched
+group is calculated by a matching function, which analysts can
+customise. The matching function should take two time series in a list
+and output a single numerical score, which allows for interfacing with
+existing time series distance calculation implementation. By default,
+cubble implements a simple peak matching algorithm
+(match_peak
) that counts the number of peaks in two time
+series that fall within a specified temporal window. The syntax to use
+match_temporal()
is
match_temporal(
+ <cubble_obj_from_match_spatial>,
+ data_id = , match_id = ,
+ temporal_by = c("..." = "...")
+)
+Bureau of Meteorology collects water +data from river gauges and this includes variables: electrical +conductivity, turbidity, water course discharge, water course level, and +water temperature. In particular, water level will interactive with +precipitation from the climate data since rainfall will raise the water +level in the river. Here is the location of available weather station +and water gauges in Victoria:
+ +Both climate_vic
and river
are cubble
+objects and we can get a summary of the 10 closest pairs:
+(res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10))
+#> # A tibble: 10 × 4
+#> from to dist group
+#> <chr> <chr> [m] <int>
+#> 1 ASN00088051 406213 1838. 1
+#> 2 ASN00084145 222201 2185. 2
+#> 3 ASN00085072 226027 3282. 3
+#> 4 ASN00080015 406704 4034. 4
+#> 5 ASN00085298 226027 4207. 5
+#> 6 ASN00082042 405234 6153. 6
+#> 7 ASN00086038 230200 6167. 7
+#> 8 ASN00086282 230200 6928. 8
+#> 9 ASN00085279 224217 7431. 9
+#> 10 ASN00080091 406756 7460. 10
The result can also be returned as cubble objects with argument
+return_cubble = TRUE
. The output is a list where each
+element is a paired cubble object and you may consider combining all the
+results into a single cubble with bind_rows()
. Care needs
+to be taken on in the case when a site is close to two stations since by
+construction, cubble require unique rows in the nested form. From the
+summary table above, river station 226027
is matched to
+more than one weather station: ASN00085072
(group 3) and
+ASN00085298
(group 5). (Similarly river station
+230200
is matched in group 7 and 8). One can either
+deselect one pair before binding the results, or take the list and work
+with the purrr::map
syntax:
+res_sp <- match_spatial(climate_vic, river, spatial_n_group = 10, return_cubble = TRUE)
+str(res_sp, max.level = 0)
+#> List of 10
+res_sp[[1]]
+#> # cubble: key: id [2], index: date, nested form, [sf]
+#> # spatial: [144.5203, -37.0194, 144.540295, -37.015122], WGS 84
+#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
+#> id long lat elev name wmo_id ts type geometry
+#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list> <chr> <POINT [°]>
+#> 1 ASN00… 145. -37.0 290 rede… 94859 <tibble> clim… (144.5203 -37.0194)
+#> 2 406213 145. -37.0 NA CAMP… NA <tibble> river (144.5403 -37.01512)
+#> # ℹ 2 more variables: group <int>, dist [m]
+(res_sp <- res_sp[-c(5, 8)] %>% bind_rows())
+#> # cubble: key: id [16], index: date, nested form, [sf]
+#> # spatial: [144.5203, -38.144913, 148.4667, -36.128657], WGS 84
+#> # temporal: date [date], prcp [dbl], tmax [dbl], tmin [dbl]
+#> id long lat elev name wmo_id ts type geometry
+#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <list> <chr> <POINT [°]>
+#> 1 ASN0… 145. -37.0 290 rede… 94859 <tibble> clim… (144.5203 -37.0194)
+#> 2 4062… 145. -37.0 NA CAMP… NA <tibble> river (144.5403 -37.01512)
+#> 3 ASN0… 148. -37.7 62.7 orbo… 95918 <tibble> clim… (148.4667 -37.6922)
+#> 4 2222… 148. -37.7 NA SNOW… NA <tibble> river (148.451 -37.70739)
+#> 5 ASN0… 147. -38.1 4.6 east… 94907 <tibble> clim… (147.1322 -38.1156)
+#> 6 2260… 147. -38.1 NA LA T… NA <tibble> river (147.1278 -38.14491)
+#> 7 ASN0… 145. -36.2 96 echu… 94861 <tibble> clim… (144.7642 -36.1647)
+#> 8 4067… 145. -36.1 NA DEAK… NA <tibble> river (144.7693 -36.12866)
+#> 9 ASN0… 146. -36.8 502 stra… 95843 <tibble> clim… (145.7308 -36.8472)
+#> 10 4052… 146. -36.9 NA SEVE… NA <tibble> river (145.6828 -36.88701)
+#> 11 ASN0… 145. -37.7 78.4 esse… 95866 <tibble> clim… (144.9066 -37.7276)
+#> 12 2302… 145. -37.7 NA MARI… NA <tibble> river (144.8365 -37.72771)
+#> 13 ASN0… 148. -37.9 49.4 bair… 94912 <tibble> clim… (147.5669 -37.8817)
+#> 14 2242… 148. -37.8 NA MITC… NA <tibble> river (147.5722 -37.815)
+#> 15 ASN0… 145. -36.3 105 kyab… 95833 <tibble> clim… (145.0638 -36.335)
+#> 16 4067… 145. -36.3 NA MOSQ… NA <tibble> river (144.9809 -36.32871)
+#> # ℹ 2 more variables: group <int>, dist [m]
For temporal matching, the variable water level
+(Water_course_level
) from the river data will be matched to
+precipitation (prcp
) in the weather station data. The
+variable identifying each matched group is group
and the
+variable identifying the two datasets is type
:
+(res_tm <- res_sp %>%
+ match_temporal(
+ data_id = type, match_id = group,
+ temporal_by = c("prcp" = "Water_course_level")))
+#> # A tibble: 8 × 2
+#> group match_res
+#> <int> <dbl>
+#> 1 1 30
+#> 2 2 5
+#> 3 3 14
+#> 4 4 20
+#> 5 6 23
+#> 6 7 26
+#> 7 9 21
+#> 8 10 14
Similarly, the cubble output can be returned with the argument
+return_cubble = TRUE
. Here we select the four pairs with
+the highest number of matching peaks:
+res_tm <- res_sp %>%
+ match_temporal(
+ data_id = type, match_id = group,
+ temporal_by = c("prcp" = "Water_course_level"),
+ return_cubble = TRUE)
+(res_tm <- res_tm %>% bind_rows() %>% filter(group %in% c(1, 7, 6, 9)))
+#> # cubble: key: id [8], index: date, nested form, [sf]
+#> # spatial: [144.5203, -37.8817, 147.572223, -36.8472], WGS 84
+#> # temporal: date [date], matched [dbl]
+#> id long lat elev name wmo_id type geometry group
+#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <POINT [°]> <int>
+#> 1 ASN00088… 145. -37.0 290 rede… 94859 clim… (144.5203 -37.0194) 1
+#> 2 406213 145. -37.0 NA CAMP… NA river (144.5403 -37.01512) 1
+#> 3 ASN00082… 146. -36.8 502 stra… 95843 clim… (145.7308 -36.8472) 6
+#> 4 405234 146. -36.9 NA SEVE… NA river (145.6828 -36.88701) 6
+#> 5 ASN00086… 145. -37.7 78.4 esse… 95866 clim… (144.9066 -37.7276) 7
+#> 6 230200 145. -37.7 NA MARI… NA river (144.8365 -37.72771) 7
+#> 7 ASN00085… 148. -37.9 49.4 bair… 94912 clim… (147.5669 -37.8817) 9
+#> 8 224217 148. -37.8 NA MITC… NA river (147.5722 -37.815) 9
+#> # ℹ 3 more variables: dist [m], ts <list>, match_res <dbl>
And then we can visialise them in space or across time:
+ +