-
Notifications
You must be signed in to change notification settings - Fork 11
/
scrape_strava.Rmd
195 lines (142 loc) · 5.7 KB
/
scrape_strava.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Strava Data"
description: |
Article on how to effectively scrape and store Strava data using the `targets`
package
author:
- name: Julian During
date: "`r Sys.Date()`"
output:
html_document:
code_folding: hide
theme: cosmo
repository_url: https://github.com/duju211/pin_strava
creative_commons: CC BY
base_url: https://www.datannery.com/posts/strava-data/
---
I am an avid runner and cyclist. For the past couple of years,
I have recorded almost all my activities on some kind of GPS device.
I record my runs with a Garmin device and my bike rides with a Wahoo device,
and I synchronize both accounts on Strava.
I figured that it would be nice to directly access my data from my Strava
account.
In the following text, I will describe the progress to get Strava data into R,
process the data, and then create a visualization of activity routes.
```{r setup, include=FALSE}
source("libraries.R")
Sys.setenv("RSTUDIO_VERSION" = '1.4.1725')
```
You will need the following packages:
```{r, code=read_lines("libraries.R"), eval=FALSE}
```
# Data {.tabset}
The whole data pipeline is implemented with the help of the `targets` package.
You can learn more about the package and its functionalities
[here](https://docs.ropensci.org/targets/).
In order to reproduce the analysis, perform the following steps:
* Clone the repository:
[https://github.com/duju211/pin_strava](https://github.com/duju211/pin_strava)
* Install the packages listed in the `libraries.R` file
* Run the target pipeline by executing `targets::tar_make()` command
* Follow the instructions printed in the console
## Target Plan
We will go through the most important targets in detail.
## OAuth Dance from R
The Strava API requires an ‘OAuth dance’, described below.
### Create an OAuth Strava app
To get access to your Strava data from R, you must first create a Strava API.
The steps are documented on the
[Strava Developer site](https://developers.strava.com/docs/getting-started/).
While creating the app, you’ll have to give it a name. In my case,
I named it `r_api`.
After you have created your personal API, you can find your Client ID and
Client Secret variables in the
[Strava API settings](https://www.strava.com/settings/api).
Save the Client ID as STRAVA_KEY and the Client Secret as STRAVA_SECRET
in your R environment.
<aside>
You can edit your R environment by running `usethis::edit_r_environ()`,
saving the keys, and then restarting R.
</aside>
```
STRAVA_KEY=<Client ID>
STRAVA_SECRET=<Client Secret>
```
The information in `my_sig` can now be used to access Strava data.
Set the `cue_mode` of the target to ‘always’ so that the following API calls
are always executed with an up-to-date authorization token.
## Current authenticated user
Download information about the currently authenticated user.
When preprocessing the data, the columns
`r glue_collapse(tar_read(user_list_cols), sep = ", ", last = " and ")` need
special attention, because they can contain multiple entries and can be
interpreted as list columns.
```{r, code=read_lines("R/active_user.R"), eval=FALSE}
```
In the end there is a data frame with one row for the currently authenticated
user:
```{r, echo=FALSE}
tar_read(df_active_user)
```
## Activities
Load a data frame that gives an overview of all the activities from the data.
Because the total number of activities is unknown, use a while loop.
It will break the execution of the loop if there are no more activities to read.
```{r, code=read_lines("R/read_all_activities.R"), eval=FALSE}
```
The resulting data frame consists of one row per activity:
```{r, echo=FALSE}
tar_read(df_act_raw)
```
Make sure that all ID columns have a character format and improve the column
names.
```{r, code=read_lines("R/pre_process_act.R"), eval=FALSE}
```
Extract ids of all activities. Exclude activities which were recorded manually,
because they don't include additional data:
```{r, file="R/rel_act_ids.R"}
```
## Measurements
A ‘stream’ is a nested list (JSON format) with all available measurements of
the corresponding activity.
To get the available variables and turn the result into a data frame,
define a helper function `read_activity_stream`.
This function takes an ID of an activity and an authentication token,
which you created earlier.
Preprocess and unnest the data in this function.
The column `latlng` needs special attention,
because it contains latitude and longitude information.
Separate the two measurements before unnesting all list columns.
```{r, code=read_lines("R/read_activity_stream.R"), eval=FALSE}
```
Do this for every id and save the resulting data frames as `feather` file.
By doing so we can later effectively query the data.
<aside>
The name of the board is determined by the currently logged in user and will
have a different name, if you run the pipeline.
</aside>
# Visualisation
Visualize the final data by displaying the geospatial information in the data.
Join all the activities into one data frame. To do this, get the paths to all
the measurement files:
```{r, file="R/meas_paths.R"}
```
Insert them all into a duckdb and select relevant columns:
```{r meas_all, file="R/meas_all.R"}
```
```{r, echo=FALSE}
tar_read(df_meas_all)
```
In the final plot every facet is one activity.
Keep the rest of the plot as minimal as possible.
```{r, code=read_lines("R/vis_meas.R"), eval=FALSE}
```
```{r gg_strava, echo=FALSE, fig.height=11, fig.width=11}
tar_read(gg_meas)
```
And there it is:
All your Strava data in a few tidy data frames and a nice-looking plot.
Future updates to the data shouldn’t take too long, because only measurements
from new activities will be downloaded. With all your Strava data up to date,
there are a lot of appealing possibilities for further data analyses of your
fitness data.