I am an avid runner and cyclist. For the past couple of years, I have recorded almost all my activities on some kind of GPS device.
I record my runs with a Garmin device and my bike rides with a Wahoo device, and I synchronize both accounts on Strava. I figured that it would be nice to directly access my data from my Strava account.
In the following text, I will describe the progress to get Strava data into R, process the data, and then create a visualization of activity routes.
You will need the following packages:
library(tarchetypes)
library(conflicted)
library(tidyverse)
library(lubridate)
library(jsonlite)
library(targets)
library(httpuv)
library(duckdb)
library(httr2)
library(arrow)
library(pins)
library(glue)
library(fs)
conflict_prefer("filter", "dplyr")
The whole data pipeline is implemented with the help of the targets
package. You can learn more about the package and its functionalities
here.
In order to reproduce the analysis, perform the following steps:
- Clone the repository: https://github.com/duju211/pin_strava
- Install the packages listed in the
libraries.R
file - Run the target pipeline by executing
targets::tar_make()
command - Follow the instructions printed in the console
We will go through the most important targets in detail.
The Strava API requires an ‘OAuth dance’, described below.
To get access to your Strava data from R, you must first create a Strava
API. The steps are documented on the Strava Developer
site. While
creating the app, you’ll have to give it a name. In my case, I named it
r_api
.
After you have created your personal API, you can find your Client ID and Client Secret variables in the Strava API settings. Save the Client ID as STRAVA_KEY and the Client Secret as STRAVA_SECRET in your R environment.
You can edit your R environment by running `usethis::edit_r_environ()`, saving the keys, and then restarting R.STRAVA_KEY=<Client ID>
STRAVA_SECRET=<Client Secret>
The information in my_sig
can now be used to access Strava data. Set
the cue_mode
of the target to ‘always’ so that the following API calls
are always executed with an up-to-date authorization token.
Download information about the currently authenticated user. When preprocessing the data, the columns shoes, clubs and bikes need special attention, because they can contain multiple entries and can be interpreted as list columns.
active_user <- function(json_active_user, user_list_cols) {
json_active_user[
map_lgl(json_active_user, negate(is.null))
& map_lgl(names(json_active_user), ~ !.x %in% user_list_cols)] |>
as_tibble()
}
In the end there is a data frame with one row for the currently authenticated user:
## # A tibble: 1 × 26
## id resource_state firstname lastname city state country sex premium
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 26845822 3 "Julian " During Baling… Bade… Germany M FALSE
## # … with 17 more variables: summit <lgl>, created_at <chr>, updated_at <chr>,
## # badge_type_id <int>, weight <dbl>, profile_medium <chr>, profile <chr>,
## # blocked <lgl>, can_follow <lgl>, follower_count <int>, friend_count <int>,
## # mutual_friend_count <int>, athlete_type <int>, date_preference <chr>,
## # measurement_preference <chr>, is_winback_via_upload <lgl>,
## # is_winback_via_view <lgl>
Load a data frame that gives an overview of all the activities from the data. Because the total number of activities is unknown, use a while loop. It will break the execution of the loop if there are no more activities to read.
read_all_activities <- function(access_token, active_user_id) {
act_vec <- vector(mode = "list")
df_act <- tibble(init = "init")
i <- 1L
while (nrow(df_act) != 0) {
req <- request("https://www.strava.com/api/v3/athlete/activities") |>
req_auth_bearer_token(token = access_token) |>
req_url_query(page = i)
resp <- req_perform(req)
resp_check_status(resp)
df_act <- resp |>
resp_body_json(simplifyVector = TRUE) |>
as_tibble()
if (nrow(df_act) != 0)
act_vec[[i]] <- df_act
i <- i + 1L
}
act_vec |>
bind_rows() |>
mutate(
start_date = ymd_hms(start_date),
active_user_id = active_user_id)
}
The resulting data frame consists of one row per activity:
## # A tibble: 701 × 57
## resource_state athlete$id name distance moving_time elapsed_time
## <int> <int> <chr> <dbl> <int> <int>
## 1 2 26845822 "Volleyball 🏐 " 10199 1953 76946
## 2 2 26845822 "Cycling Club" 18284. 3860 5146
## 3 2 26845822 "Volleyball 🏐 " 10475. 2141 15766
## 4 2 26845822 "Abendradfahrt" 6041. 1147 4513
## 5 2 26845822 "TSG SSV" 56324. 9668 23624
## 6 2 26845822 "Volleyball 🏐 " 4246. 775 795
## 7 2 26845822 "Ballon d‘Alsace" 22693. 4740 8341
## 8 2 26845822 "Super Planche d… 49032. 10076 24068
## 9 2 26845822 "Volleyball 🏐 " 8503. 1408 15990
## 10 2 26845822 "Planche Prep 2" 32375. 5042 5242
## # … with 691 more rows, and 52 more variables: athlete$resource_state <int>,
## # total_elevation_gain <dbl>, type <chr>, sport_type <chr>,
## # workout_type <int>, id <dbl>, start_date <dttm>, start_date_local <chr>,
## # timezone <chr>, utc_offset <dbl>, location_city <lgl>,
## # location_state <lgl>, location_country <chr>, achievement_count <int>,
## # kudos_count <int>, comment_count <int>, athlete_count <int>,
## # photo_count <int>, map <df[,3]>, trainer <lgl>, commute <lgl>, …
Make sure that all ID columns have a character format and improve the column names.
pre_process_act <- function(df_act_raw, active_user_id, meas_board) {
df_act_raw |>
mutate(across(contains("id"), as.character))
}
Extract ids of all activities. Exclude activities which were recorded manually, because they don’t include additional data:
rel_act_ids <- function(df_act) {
df_act %>%
filter(!manual) %>%
pull(id)
}
A ‘stream’ is a nested list (JSON format) with all available measurements of the corresponding activity.
To get the available variables and turn the result into a data frame,
define a helper function read_activity_stream
. This function takes an
ID of an activity and an authentication token, which you created
earlier.
Preprocess and unnest the data in this function. The column latlng
needs special attention, because it contains latitude and longitude
information. Separate the two measurements before unnesting all list
columns.
read_activity_stream <- function(id, access_token) {
req <- request("https://www.strava.com/api/v3/activities") |>
req_auth_bearer_token(token = access_token) |>
req_url_query(keys = str_glue(
"distance,time,latlng,altitude,velocity_smooth,heartrate,cadence,",
"watts,temp,moving,grade_smooth")) |>
req_url_path_append(id) |>
req_url_path_append("streams")
resp <- req_perform(req)
resp_check_status(resp)
df_stream_raw <- resp |>
resp_body_json(simplifyVector = TRUE) |>
as_tibble() |>
mutate(id = id) %>%
pivot_wider(names_from = type, values_from = data)
if ("latlng" %in% colnames(df_stream_raw)) {
df_stream <- df_stream_raw %>%
mutate(
lat = map(
.x = latlng, .f = ~ .x[, 1]),
lng = map(
.x = latlng, .f = ~ .x[, 2])) %>%
select(-latlng)
} else {
df_stream <- df_stream_raw
}
df_stream %>%
unnest(where(is_list)) %>%
mutate(id = id)
}
Do this for every id and save the resulting data frames as feather
file. By doing so we can later effectively query the data.
Visualize the final data by displaying the geospatial information in the data. Join all the activities into one data frame. To do this, get the paths to all the measurement files:
meas_paths <- function(board_name) {
dir_ls(
board_name, type = "file", regexp = "df_\\d.*\\.arrow$", recurse = TRUE)
}
Insert them all into a duckdb and select relevant columns:
meas_all <- function(paths_meas) {
act_col_types <- schema(
moving = boolean(), velocity_smooth = double(),
grade_smooth = double(), distance = double(),
altitude = double(), heartrate = int32(), time = int32(),
lat = double(), lng = double(), cadence = int32(),
watts = int32(), id = string())
open_dataset(paths_meas, format = "parquet", schema = act_col_types) %>%
to_duckdb() %>%
select(id, lat, lng) %>%
filter(!is.na(lat) & !is.na(lng)) %>%
collect()
}
## # A tibble: 2,396,364 × 3
## id lat lng
## <chr> <dbl> <dbl>
## 1 7536987223 48.3 8.85
## 2 7536987223 48.3 8.85
## 3 7536987223 48.3 8.85
## 4 7536987223 48.3 8.85
## 5 7536987223 48.3 8.85
## 6 7536987223 48.3 8.85
## 7 7536987223 48.3 8.85
## 8 7536987223 48.3 8.85
## 9 7536987223 48.3 8.85
## 10 7536987223 48.3 8.85
## # … with 2,396,354 more rows
In the final plot every facet is one activity. Keep the rest of the plot as minimal as possible.
vis_meas <- function(df_meas_pro) {
df_meas_pro %>%
ggplot(aes(x = lng, y = lat)) +
geom_path() +
facet_wrap(~ id, scales = "free") +
theme(
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank(),
strip.text = element_blank())
}
And there it is: All your Strava data in a few tidy data frames and a nice-looking plot. Future updates to the data shouldn’t take too long, because only measurements from new activities will be downloaded. With all your Strava data up to date, there are a lot of appealing possibilities for further data analyses of your fitness data.