The following code is for downloading audio features of Spotify’s “Top Hits of” playlists, specifically from the years 1970 to 2019. For example the playlist’s title for the year 1990 it would be “Top Hits of 1990”.
It uses the wrapper library Spotifyr to make http requests to the Spotify Web API.
The file test.R
is an R script for downloading just one playlist, in
order to have an idea of the data shape.
The Rmarkdown file spotifyHits1970-2019.Rmd
contains the necessary
code for getting the audio features of the Top Hits playlists from
1970-2019. Then it exports the data to the file
Spotify_TopHits_of_1970_2019.csv
.
The folder data
contains the following data:
Spotify_TopHits_of_1970_2019.csv
: audio features for the playlists Top Hits of, from 1970 to 2019, retrieved from Spotify Web API.test_hits_1970.csv
: audio features for the playlist Tip Hits of 1970, it’s the file used for exploring the data provided by the Spotify API.playlist_features.rds
: An RDS object for storing all the columns given by the Spotify Web API response, when making the requests for the playlists of 1970-2019.
This packages are required to run the code
library(spotifyr)
library(dplyr)
library(readr)
library(DataExplorer)
The Spotifyr’s function get_spotify_access_token()
gets the Access
Token with the Spotify Client ID and the Spotify Client Secret.
access.token <- get_spotify_access_token(
client_id = Sys.getenv("SPOTIFY_CLIENT_ID"),
client_secret = Sys.getenv("SPOTIFY_CLIENT_SECRET")
)
The playlists’ names are like “Top Hits of YYYY”, where YYYY refer to the year. For example for the year 2000 it would be “Top Hits of 2000”.
The years are in the range [1970, 2019], let’s make a vector that stores the names.
playlists.names <- paste("Top","Hits", "of", as.character(1970:2019))
In order to get the ID for each playlist, we can use the
search_spotify()
function to retrieve info about the playlists
search.response <- lapply(1:length(playlists.names), function(i) {
search_spotify(q = playlists.names[i], type = "playlist",
authorization = access.token,
limit = 1)
}
)
Let’s extract the IDs from the list response and then verify that the playlists are the ones created by Spotify
# get the IDs for the elements in the list using lapply
search.response.ID <- lapply(1:length(search.response), function(i){
# the i-th playlist tibble in the list
aux.playlist.info <- search.response[[i]]
# making sure it's the playlist created by Spotify
aux.playlist.info <- aux.playlist.info %>%
filter(name == playlists.names[i] & owner.display_name == "Spotify")
# get the actual playlist id from the 4th element of the i-th tibble
aux.playlist.info[[4]]
}
)
# convert the IDs to character
playlists.IDs <- as.character(unlist(search.response.ID))
head(playlists.IDs, 10)
## [1] "37i9dQZF1DWXQyLTHGuTIz" "37i9dQZF1DX43B4ApmA3Ee" "37i9dQZF1DXaQBa5hAMckp"
## [4] "37i9dQZF1DX2ExTChOnD3g" "37i9dQZF1DWVg6L7Yq13eC" "37i9dQZF1DX3TYyWu8Zk7P"
## [7] "37i9dQZF1DX6rhG68uMHxl" "37i9dQZF1DX26cozX10stk" "37i9dQZF1DX0fr2A59qlzT"
## [10] "37i9dQZF1DWZLO9LcfSmxX"
Now, let’s use lapply()
again to get the audio features tibbles in a
list
features.response <- lapply(1:length(playlists.IDs), function(i) {
get_playlist_audio_features("Spotify", playlists.IDs[i],
authorization = access.token
)
}
)
Concatenating the tibbles with bind_rows
playlists.features <- bind_rows(features.response)
The total instances in the dataset
nrow(playlists.features)
## [1] 5030
The numbers of songs in each playlist
# group by playlist and count
playlists.counts <- playlists.features %>% group_by(playlist_name) %>% summarise(n = n())
print(playlists.counts, n = nrow(playlists.counts))
## # A tibble: 51 × 2
## playlist_name n
## <chr> <int>
## 1 Top Hits of 1970 100
## 2 Top Hits of 1971 100
## 3 Top Hits of 1972 100
## 4 Top Hits of 1973 100
## 5 Top Hits of 1974 100
## 6 Top Hits of 1975 100
## 7 Top Hits of 1976 100
## 8 Top Hits of 1977 100
## 9 Top Hits of 1978 100
## 10 Top Hits of 1979 100
## 11 Top Hits of 1980 100
## 12 Top Hits of 1981 100
## 13 Top Hits of 1982 100
## 14 Top Hits of 1983 100
## 15 Top Hits of 1984 100
## 16 Top Hits of 1985 100
## 17 Top Hits of 1986 100
## 18 Top Hits of 1987 100
## 19 Top Hits of 1988 100
## 20 Top Hits of 1989 100
## 21 Top Hits of 1990 100
## 22 Top Hits of 1991 100
## 23 Top Hits of 1992 100
## 24 Top Hits of 1993 100
## 25 Top Hits of 1994 100
## 26 Top Hits of 1995 100
## 27 Top Hits of 1996 100
## 28 Top Hits of 1997 99
## 29 Top Hits of 1998 100
## 30 Top Hits of 1999 100
## 31 Top Hits of 2000 100
## 32 Top Hits of 2001 100
## 33 Top Hits of 2002 100
## 34 Top Hits of 2003 100
## 35 Top Hits of 2004 100
## 36 Top Hits of 2005 100
## 37 Top Hits of 2006 100
## 38 Top Hits of 2007 100
## 39 Top Hits of 2008 100
## 40 Top Hits of 2009 100
## 41 Top Hits of 2010 100
## 42 Top Hits of 2011 100
## 43 Top Hits of 2012 100
## 44 Top Hits of 2013 100
## 45 Top Hits of 2014 100
## 46 Top Hits of 2015 100
## 47 Top Hits of 2016 130
## 48 Top Hits of 2017 100
## 49 Top Hits of 2018 100
## 50 Top Hits of 2019 100
## 51 <NA> 1
We can see that there’s one missing value in the playlist_name
column.
Let’s locate it.
index.na <- which(is.na(playlists.features$playlist_name))
print(paste("The row of the missing playlist:",index.na))
## [1] "The row of the missing playlist: 2762"
print(playlists.features[(index.na-2):(index.na+2), c(2,6:16)])
## # A tibble: 5 × 12
## playlist_name danceability energy key loudness mode speechiness
## <chr> <dbl> <dbl> <int> <dbl> <int> <dbl>
## 1 Top Hits of 1997 0.662 0.878 1 -6.51 0 0.101
## 2 Top Hits of 1997 0.636 0.555 4 -8.12 0 0.0264
## 3 <NA> NA NA NA NA NA NA
## 4 Top Hits of 1997 0.698 0.985 7 -4.92 0 0.0426
## 5 Top Hits of 1997 0.745 0.972 6 -5.60 1 0.0316
## # … with 5 more variables: acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>
The missing row belongs to the 1997 playlist, but since all the audio features are missing let’s remove it.
playlists.features <- playlists.features[-index.na,]
paste("Number of rows left:",nrow(playlists.features))
## [1] "Number of rows left: 5029"
paste("missing values in playlist_name column:" ,
sum(is.na(playlists.features$playlist_name)))
## [1] "missing values in playlist_name column: 0"
Making a backup for the tibble in an .rda
file, which can be later
imported into an R session with readRDS()
function.
rds.output <- "./data/playlist_features.rds"
if (file.exists(rds.output)) {
print(paste("The file already exists:", rds.output))
} else {
saveRDS(playlists.features, file = rds.output)
}
## [1] "The file already exists: ./data/playlist_features.rds"
Now, there may be more missing values in other columns, but before checking that, let’s select the columns that contain the audio features fisrt.
It’s important to mention that not all the columns are for audio features, some are for IDs or metadata like links to the playlist’s cover images that could be useful for a dashboard.
The columns that will be extracted are
columns.features <- c("playlist_id", "playlist_name", "playlist_img",
"playlist_owner_name", "playlist_owner_id",
"danceability", "energy", "key", "loudness",
"mode", "speechiness", "acousticness", "instrumentalness",
"liveness", "valence", "tempo", "track.id",
"time_signature", "added_at", "track.artists",
"track.disc_number", "track.duration_ms", "track.explicit",
"track.name", "track.popularity",
"track.preview_url", "track.type",
"track.album.release_date",
"track.album.release_date_precision",
"track.album.total_tracks",
"track.album.type", "track.album.uri"
)
Selecting the 32 columns listed above and storing the resulting tibble
in topHits1970_2019
playlists.features.subset <- playlists.features[ ,columns.features]
dim(playlists.features.subset)
## [1] 5029 32
Now, using glimpse()
to see the data types in the
playlists.features.subset
tibble
glimpse(playlists.features.subset)
## Rows: 5,029
## Columns: 32
## $ playlist_id <chr> "37i9dQZF1DWXQyLTHGuTIz", "37i9dQZF…
## $ playlist_name <chr> "Top Hits of 1970", "Top Hits of 19…
## $ playlist_img <chr> "https://i.scdn.co/image/ab67706f00…
## $ playlist_owner_name <chr> "Spotify", "Spotify", "Spotify", "S…
## $ playlist_owner_id <chr> "spotify", "spotify", "spotify", "s…
## $ danceability <dbl> 0.443, 0.755, 0.474, 0.598, 0.611, …
## $ energy <dbl> 0.403, 0.876, 0.473, 0.797, 0.470, …
## $ key <int> 0, 0, 2, 7, 4, 8, 11, 8, 5, 5, 1, 8…
## $ loudness <dbl> -8.339, -8.867, -11.454, -6.793, -9…
## $ mode <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ speechiness <dbl> 0.0322, 0.0362, 0.0601, 0.0332, 0.0…
## $ acousticness <dbl> 0.631000, 0.357000, 0.545000, 0.042…
## $ instrumentalness <dbl> 0.00e+00, 5.17e-06, 1.25e-06, 4.07e…
## $ liveness <dbl> 0.1110, 0.2200, 0.0356, 0.0717, 0.5…
## $ valence <dbl> 0.410, 0.954, 0.561, 0.622, 0.970, …
## $ tempo <dbl> 143.462, 102.762, 77.583, 123.566, …
## $ track.id <chr> "7iN1s7xHE4ifF5povM6A48", "6QhXQOpy…
## $ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ added_at <chr> "2021-01-26T11:30:03Z", "2021-01-26…
## $ track.artists <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.disc_number <int> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ track.duration_ms <int> 243026, 174826, 199266, 147493, 134…
## $ track.explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name <chr> "Let It Be - Remastered 2009", "Cec…
## $ track.popularity <int> 78, 74, 36, 70, 70, 77, 46, 67, 76,…
## $ track.preview_url <chr> NA, "https://p.scdn.co/mp3-preview/…
## $ track.type <chr> "track", "track", "track", "track",…
## $ track.album.release_date <chr> "1970-05-08", "1970-01-26", "2014-0…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks <int> 12, 11, 87, 14, 12, 12, 120, 12, 12…
## $ track.album.type <chr> "album", "album", "album", "album",…
## $ track.album.uri <chr> "spotify:album:0jTGHV5xqHPvEcwL8f6Y…
The track.artists
column is a list of dataframes, where the artists’
names are other information are stored. Let’s extract those names and
add them to a new column for simplicity in later analysis.
topHits1970_2019 <- playlists.features.subset %>%
mutate(artists.name = lapply(1:nrow(playlists.features.subset), function(i){
playlists.features.subset$track.artists[[i]][[3]]
})
) %>%
mutate(artists.name = as.character(artists.name))
Let’s look for missing values one last time before exporting the tibble.
plot_missing(topHits1970_2019)
The result is that only the column track.preview_url
has missing
values, the column contains links which open a popup with a 30 sec
preview of the track. As long as the audio features have no missing
values then there’s no problem with this data.
Finally, let’s save the data in a .csv
file.
file.output <- "./data/Spotify_TopHits_of_1970_2019.csv"
if (file.exists(file.output)) {
print(paste("The file already exists:", file.output))
} else {
write_csv(topHits1970_2019, file = file.output)
}
## [1] "The file already exists: ./data/Spotify_TopHits_of_1970_2019.csv"
So, the file saved has the columns
glimpse(topHits1970_2019)
## Rows: 5,029
## Columns: 33
## $ playlist_id <chr> "37i9dQZF1DWXQyLTHGuTIz", "37i9dQZF…
## $ playlist_name <chr> "Top Hits of 1970", "Top Hits of 19…
## $ playlist_img <chr> "https://i.scdn.co/image/ab67706f00…
## $ playlist_owner_name <chr> "Spotify", "Spotify", "Spotify", "S…
## $ playlist_owner_id <chr> "spotify", "spotify", "spotify", "s…
## $ danceability <dbl> 0.443, 0.755, 0.474, 0.598, 0.611, …
## $ energy <dbl> 0.403, 0.876, 0.473, 0.797, 0.470, …
## $ key <int> 0, 0, 2, 7, 4, 8, 11, 8, 5, 5, 1, 8…
## $ loudness <dbl> -8.339, -8.867, -11.454, -6.793, -9…
## $ mode <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ speechiness <dbl> 0.0322, 0.0362, 0.0601, 0.0332, 0.0…
## $ acousticness <dbl> 0.631000, 0.357000, 0.545000, 0.042…
## $ instrumentalness <dbl> 0.00e+00, 5.17e-06, 1.25e-06, 4.07e…
## $ liveness <dbl> 0.1110, 0.2200, 0.0356, 0.0717, 0.5…
## $ valence <dbl> 0.410, 0.954, 0.561, 0.622, 0.970, …
## $ tempo <dbl> 143.462, 102.762, 77.583, 123.566, …
## $ track.id <chr> "7iN1s7xHE4ifF5povM6A48", "6QhXQOpy…
## $ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ added_at <chr> "2021-01-26T11:30:03Z", "2021-01-26…
## $ track.artists <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.disc_number <int> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ track.duration_ms <int> 243026, 174826, 199266, 147493, 134…
## $ track.explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name <chr> "Let It Be - Remastered 2009", "Cec…
## $ track.popularity <int> 78, 74, 36, 70, 70, 77, 46, 67, 76,…
## $ track.preview_url <chr> NA, "https://p.scdn.co/mp3-preview/…
## $ track.type <chr> "track", "track", "track", "track",…
## $ track.album.release_date <chr> "1970-05-08", "1970-01-26", "2014-0…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks <int> 12, 11, 87, 14, 12, 12, 120, 12, 12…
## $ track.album.type <chr> "album", "album", "album", "album",…
## $ track.album.uri <chr> "spotify:album:0jTGHV5xqHPvEcwL8f6Y…
## $ artists.name <chr> "The Beatles", "Simon & Garfunkel",…
Note: The column artists.name
contains strings like
c("artist 1", "artist 2", "artist 3")
when there are multiple artists.
When there’s only one artist it’s just a string like artist 1
.