-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy pathkkirby_tidytext.Rmd
198 lines (139 loc) · 7.65 KB
/
kkirby_tidytext.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "DATA 607 TidyVerse Vignette Create Assignment"
author: "Kevin Kirby"
date: "`r Sys.Date()`"
output:
html_document: default
pdf_document: default
subtitle: Text Analysis and Animation with tidytext and gganimate
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
In this vignette, I’ll show how to combine `tidytext` for text mining with g`ganimate` for animated visuals to create dynamic representations of text data. I’ll be using a data set I have includes track metadata and artist biographies from [Beatport.com](https://www.beatport.com/) . After ingesting and pre-processing using `tidytext`, I’ll run a sentiment analysis, using `gganimate` to bring the results to life in GIF and MP4 video.
The question I want to answer during this exercise is: has the sentiment of artist biographies associated with different music genres changed over time?
## Install and Load Required Packages
These are the required libraries. Please take care to install the ones you do not have.
```{r create-install}
#parquet file reader
library(arrow)
#tidyverse data wrangling
library(tidytext)
library(tidyverse)
library(textdata)
#visualization and animation
library(ggplot2)
library(gganimate)
library(dplyr)
#these three will let you render and view MP4 and gifs
library(av)
library(gifski)
library(magick)
```
## Loading the data
I uploaded a `parquet` cache to my GCP instance and then set the link to public download. A parquet is a smaller file that's a cached representation of the full data file. This allowed me to reduce a 2GB file down to .5 GB, making it easier to share.
```{r gcp-load}
gcp_bp_tidy_url <- "https://storage.googleapis.com/data_science_masters_files/2024_fall/data_607_data_management/tidyverse_create_extend/bp_text_tidy.parquet"
bp_tidy_temp <- tempfile(fileext = ".parquet")
download.file(gcp_bp_tidy_url, bp_tidy_temp, mode = "wb")
bp_tidy_df <- read_parquet(bp_tidy_temp)
```
## Tidying the data
I want to check that the date column is actually a date and then drop anything before 2018. This will let me focus on data from more recent years.
```{r tidy-bio}
bp_tidy_df <- bp_tidy_df %>%
mutate(release_date = as.Date(release_date)) %>%
filter(!is.na(beatport_bio) & beatport_bio != "")
bp_tidy_df <- bp_tidy_df[bp_tidy_df$release_date >= as.Date("2018-01-01"), ]
```
## Sentiment analysis
To perform a sentiment analysis, the text values need to be turned into tokens. The tokenization function `unnest_tokens` comes from the `tidytext` universe and lets you break out each token into its own row. I've used the default "words" setting, where each word becomes a token. Other options include things like characters, sentences, or lines.
The actual assessment of what this means will be done with the aid of the [National Research Council of Canada's (NRC) Word-Emotion Association Lexicon](https://www.saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm). It assigns 0 or 1 values for the below sentiments and emotions based on whether the world does or does not associate with it.
sentiments:
* negative
* positive
emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust
To tie the tokens to the NRC, you can do a "many-to-many" join on the word field. Each word can appear multiple times and be linked to different NRC categories. The `count()` function groups the data release year, genre, and sentiment.
```{r preprocess}
bp_bio_tokens <- bp_tidy_df %>%
unnest_tokens(word, input = beatport_bio)
nrc_lex <- get_sentiments("nrc")
bp_bio_nrc <- bp_bio_tokens %>%
inner_join(nrc_lex, by = "word", relationship = "many-to-many") %>%
count(release_year = year(release_date), genre_name, sentiment) %>%
spread(sentiment, n, fill = 0)
```
Next, I normalized the data by calculating two key metrics: the Emotional Polarity Index (EPI) and Sentiment Score.
EPI: This provides a more holistic view by summing all positive emotions and subtracting negative ones, which gives a broader perspective on emotional tone beyond a simple positive/negative split.
Sentiment Score: A more traditional metric, calculated by subtracting the negative emotions from the positive ones, offering a straightforward comparison of positive versus negative sentiment.
I normalized each sentiment category by dividing its value by the total of all emotions and then multiplying by 100. This converts the counts into percentages, making it easier to see how each sentiment contributes relative to the overall emotional expression across different genres and years.
```{r data-norms}
bp_bio_nrc <- bp_bio_nrc %>%
mutate(all_emotions = anticipation + joy + surprise + trust + disgust + fear + sadness) %>%
mutate(anticipation = anticipation / all_emotions * 100,
joy = joy / all_emotions * 100,
surprise = surprise / all_emotions * 100,
trust = trust / all_emotions * 100,
disgust = disgust / all_emotions * 100,
fear = fear / all_emotions * 100,
sadness = sadness / all_emotions * 100,
positive = positive / all_emotions * 100,
negative = negative / all_emotions * 100)
bp_bio_nrc <- bp_bio_nrc %>%
mutate(EPI = (anticipation + joy + surprise + trust) - (disgust + fear + sadness)) %>%
mutate(sentiment_score = positive - negative)
```
## Using gganimate to bring the data to life
With normalized sentiment metrics in hand, we can bring them to life with animated visualizations. `gganimate` allows us to render traditional `ggplot2` charts in more compelling ways, like a GIF or an MP4 video. The goal is to have the eyes drawm towards anything that stands out.
To demonstrate, I selected a subset of genres that my personal favorites:
```{r genre-filter}
bp_bio_nrc_lf <- bp_bio_nrc %>%
select(release_year, genre_name, EPI, sentiment_score) %>%
gather(key = "measure", value = "value", EPI, sentiment_score)%>%
filter(genre_name %in% c("Melodic House & Techno",
"Afro House",
"Organic House / Downtempo",
"Progressive House",
"Techno (Raw / Deep / Hypnotic)"))
```
To demonstrate, I’m using `facet_wrap` to create side-by-side charts that show the EPI and Sentiment Score trends by genre and year. The `ggplot` chart is saved as a variable and animated with `transition_reveal()`, which lets the lines and points gradually appear as the years progress. This makes it easier to spot how each genre’s sentiment and EPI evolve over time. The final step generates both a GIF and an MP4 video, using `gifski_renderer` for the GIF and `av_renderer` for the video.
```{r music-viz}
bp_bio_nrc_viz <- ggplot(bp_bio_nrc_lf, aes(x = release_year, y = value, color = genre_name, group = genre_name)) +
geom_line(size = 1.2) +
geom_point(aes(size = abs(value))) +
scale_color_viridis_d(option = "C") +
labs(title = "EPI & Sentiment Score By Genre and Year",
y = "Value",
x = "Year") +
facet_grid(~ measure, scales = "free_y") +
theme_minimal() +
theme(legend.position = "bottom",
legend.title = element_text(size = 14),
legend.text = element_text(size = 12)) +
guides(size = "none") +
transition_reveal(release_year) +
ease_aes('linear')
if (interactive()) {
animate(
bp_bio_nrc_viz,
width = 1024,
height = 768,
nframes = 600,
fps = 60,
duration = 10,
renderer = gifski_renderer(file = "bp_bio_nrc_viz.gif")
)
animate(
bp_bio_nrc_viz,
width = 1024,
height = 768,
nframes = 600,
fps = 60,
duration = 10,
renderer = av_renderer(file = "bp_bio_nrc_viz.mp4")
)
}
```
## Conclusion
Data visualizations should be used to aid data understanding. Data is truth and truth must win out.