In my last post I explored some data after showing how to scrape Joni Mitchell lyrics off her website and combined that data with Spotify audio data.
This post will be a continuation of the work I did there. My focus here is sentiment analysis. In my previous post we used the Spotify “valence” measure to show counts of songs based on their positivity and negativity. Here we will be looking at just the lyrics at first. At the end we will combine the sentiment of each song based on lyrics alone and with song valence to identify Joni’s most positive and negative songs based on both lyrics and music.
Let’s get started! As usual the first thing we do is load some packages and set a theme. In addition to that I’m going to bring in datasets I created in the last post so I don’t have to recreate them here.
# Load packages using the pacman package.
pacman::p_load(
tidyverse,
tidytext,
tidymodels,
SnowballC,
wordcloud,
reshape2,
here
)
# Create a ggplot2 theme
theme_alex <- function() {
font <- "Arial"
theme_minimal()
theme(
panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(
color = "#cbcbcb"
),
panel.grid.major.x = element_blank(),
panel.background = element_blank(),
strip.background = element_rect(
fill = "white"
),
strip.text = element_text(
hjust = 0,
color = "#460069",
size = 12
),
axis.ticks = element_blank(),
plot.title = element_text(
family = font,
size = 20,
face = "bold",
color = "#460069"
),
plot.subtitle = element_text(
family = font,
size = 14,
color = "#6a1c91",
hjust = 0.5
),
plot.caption = element_text(
family = font,
size = 9,
hjust = 1,
color = "#460069"
),
axis.title = element_text(
family = font,
size = 10,
color = "#460069"
),
axis.text = element_text(
family = font,
size = 9,
color = "#460069"
),
axis.text.x = element_text(
margin = margin(5, b = 10)
),
legend.text.align = 0,
legend.background = element_blank(),
legend.title = element_blank(),
legend.key = element_blank(),
legend.text = element_text(
family = font,
size = 18,
color = "#4B636E"
)
)
}
# Bring in data sets saved from last post.
joni_lyrics_dates <- readRDS(url("https://github.com/farach/data/blob/master/joni_lyric_dates.RDS?raw=true", "rb"))
joni_spotify <- readRDS(url("https://github.com/farach/data/blob/master/joni_spotify.rds?raw=true", "rb"))
Word proportion
A good place to start is to get a general sense of what the most used words are in each album. Before I can get answers I need to get questions. Getting word proportions should help there.
joni_word_prop <-
# Let's start with our data set from my last post that includes Spotify audio
# features
joni_lyrics_dates %>%
# we only want the lyrics that Joni wrote. Also remove live albums. Also remove songs that are mostly interluds (off her Mingus album)
filter(
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!song_name %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
),
song_author == "by Joni Mitchell"
) %>%
# Turn everything into a character variable
mutate_all(., ~ as.character(.)) %>%
# Tokenize scraped lyrics
unnest_tokens(word,
lyrics_scraped,
token = "words",
strip_punct = FALSE
) %>%
# Let's get rid of all the instances where the word is just punctuation.
filter(!word %in% c(
",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
";", ":", "[", "]", "oh"
)) %>%
# Remove stop words
anti_join(get_stopwords()) %>%
# Format variables
mutate(
word = str_extract(word, "[a-z']+"),
album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")")
) %>%
# Get denominator
count(album_name, word) %>%
# Get numerator
group_by(album_name) %>%
mutate(proportion = round(n / sum(n), 4)) %>%
# Drop n column
select(-n) %>%
# Pivot data
spread(album_name, proportion) %>%
# Turn NA's into 0's
mutate_if(is.double, ~ if_else(is.na(.), 0, .)) %>%
# Pivot back to have full list of words used in all Joni songs and proportion
# of each word in each album.
gather(album_name, proportion, 2:19)
joni_word_prop %>%
# Sort largest to smallest
arrange(desc(proportion)) %>%
# Round proportion
mutate(proportion = scales::percent(proportion, accuracy = 0.01)) %>%
# Get top 20
head(20) %>%
# Display table
kableExtra::kable(col.names = c("Word", "Album", "Proportion"), align = "c", row.names = FALSE)
Word | Album | Proportion |
---|---|---|
love | Wild Things Run Fast (1982) | 6.09% |
lead | Taming the Tiger (1998) | 5.83% |
come | Night Ride Home (1991) | 4.90% |
dancin | Chalk Mark In A Rain Storm (1988) | 4.09% |
man | Wild Things Run Fast (1982) | 3.78% |
shine | Shine (2007) | 3.32% |
fiction | Dog Eat Dog (1985) | 3.09% |
cold | Night Ride Home (1991) | 2.89% |
balloon | Taming the Tiger (1998) | 2.56% |
ladders | Chalk Mark In A Rain Storm (1988) | 2.46% |
dreamland | Don Juan’s Reckless Daughter (1977) | 2.37% |
tiger | Taming the Tiger (1998) | 2.35% |
good | Dog Eat Dog (1985) | 2.21% |
just | Night Ride Home (1991) | 2.14% |
want | Blue (1971) | 2.13% |
like | For the Roses (1972) | 2.12% |
dog | Dog Eat Dog (1985) | 2.06% |
get | Chalk Mark In A Rain Storm (1988) | 2.05% |
love | Chalk Mark In A Rain Storm (1988) | 2.05% |
joy | Night Ride Home (1991) | 2.01% |
First thing I see is that Joni had a tendency to repeat words more often throughout her albums in the second half of her musical career (mid-80’s through the 00’s). Did she have a tendency to repeat words more in the second half of her career?
I want to get a better view of the top words in every album seperately to see if I can get a little closer to answering that question. I want to be able to see how Joni’s use of repeating lyrics has changed over time.
# Off the bat I know I'm going to want to use a facet_wrap or grid to view the
# top words by album. I also want to see how things develop over time so I need
# to organize the albums by year.
joni_facet_reorder <- joni_lyrics_dates %>%
select(album_name, album_release_date) %>%
transmute(
album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")"),
year = as.numeric(str_sub(album_release_date, 1, 4))
) %>%
distinct() %>%
arrange(year) %>%
pull(album_name)
joni_word_prop$album_name <- factor(joni_word_prop$album_name,
levels = joni_facet_reorder
)
# With album name sorting by year over with I can plot.
joni_word_prop %>%
filter(
proportion != 0,
album_name != "<NA>",
album_name != "NA (NA)",
album_name != "Both Sides Now (2000)",
) %>%
# Group by album
group_by(album_name) %>%
# Get the top 5 words with the highest proportion
top_n(5, proportion) %>%
# Ungroup
ungroup() %>%
# Sort by proportion
arrange(album_name, desc(proportion)) %>%
# Group by album again
group_by(album_name) %>%
# Just get the top5 words. The top_n() function will include duplicate
# duplicate proportions if the proportion is in the top 5
filter(row_number() <= 5) %>%
# Ungroup
ungroup() %>%
# Reorder top 5 words by proportion
mutate(word = reorder_within(as.factor(word), proportion, album_name)) %>%
# Begin to plot
ggplot(aes(word, proportion)) +
# The next 2 geoms make a lollipop graph which will make it easier to see
# differences than using a bar plot
geom_segment(aes(xend = word, yend = 0), linetype = "dashed") +
geom_point(color = "#460069") +
# Facet by album which we reordered above.
facet_wrap(~album_name,
scales = "free_y",
# Adding this labeller option which will create a new line in the
# facet labels if the length is longer than 20
labeller = label_wrap_gen(width = 17)
) +
# Apply my theme
theme_alex() +
# The lollipop graph has lines going horizontally and so does theme_alex. I
# want to flip that so the grid lines are up and down. This will make it easier
# to see.
theme(
panel.grid.major.x = element_line(
color = "#cbcbcb"
),
panel.grid.major.y = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
# Coord flip
coord_flip() +
# Ned to add this so that the reordered words stay in the order we want them
scale_x_reordered() +
# Change the y axis to percent
scale_y_continuous(labels = scales::percent) +
# Add lables
labs(
x = NULL,
y = NULL,
title = "Top 5 words in Joni Mitchell albums",
caption = "source: JoniMitchell.com \nSpotify"
)
A couple of things pop out here to me. During the first half of her career she used the word “like” a lot. This makes me think that she was using a lot of analogies. In the second half of her career she is using less analogies (because “like” is no longer one of the most used words).
“Dreamland” appears as the most used words in 2 albums. “Dreamland” is an unexpected word to see making up the biggest proportion of all words.
Lastly we can now see that she does in fact repeat words more often in the second half of her career. This makes sense I guess - she is using less analogies and just stating what things are, repeating it over and over instead of figuring out different ways to say it.
Sentiment analysis
Now to the part I’ve been looking forward to, sentiment analysis. In my last post created a table that gave us counts of Joni songs according to their valence score. Valence scores range from 0 to 1 and Joni’s music spanned that range. My initial thought is that this makes sense considering Joni’s personal lyrics and her skill as an artist to capture the full range of human emotion.
In the following section we will attempt to do the same by using Joni’s lyrics alone. At the end I will combine the results of the sentiment analysis with the valence score to see what are Joni’s most positive and negative songs using both lyrics and valence.
joni_spotify_token <-
# This first part is the same as I've done before in the code above.
joni_lyrics_dates %>%
# Turn everything into a character variable
mutate_all(., ~ as.character(.)) %>%
# Only keep songs written by Joni and remove Both Sides Now album
filter(
song_author == "by Joni Mitchell",
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!song_name %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
)
) %>%
# Unnest tokens
unnest_tokens(word,
lyrics_scraped,
token = "words",
strip_punct = FALSE
) %>%
# Remove words that are just puncuation
filter(!word %in% c(
",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
";", ":", "[", "]", "oh"
)) %>%
# Remove stop words
anti_join(get_stopwords(source = "snowball"), by = "word") %>%
# This is new - I'll describe below
mutate(
stem = wordStem(word),
year = as.numeric(str_sub(album_release_date, 1, 4)),
album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")")
)
head(joni_spotify_token %>% select(word, stem))
## word stem
## 1 instrumental instrument
## 2 sparkle sparkl
## 3 ocean ocean
## 4 eagle eagl
## 5 top top
## 6 tree tree
Stemming is the process of reducing words to their base or root form. For example, above we see “sparkle” reduced to “sparkl”. We will use these stemmed words to prevent the uneccesary exclusion of words that don’t have a match in sentiment dictionaries.
# Get bing sentiments. Tidytext package includes dictionaries for other
# sentiment dictionaries.
bing <- get_sentiments("bing") %>%
mutate(stem = wordStem(word)) %>%
select(stem, sentiment) %>%
distinct()
joni_spotify_token %>%
# Join with Bing dictionary
inner_join(bing) %>%
# Get counts
group_by(album_name, sentiment, year) %>%
tally() %>%
ungroup() %>%
# Get proportions
group_by(album_name, year) %>%
mutate(n_prop = n / sum(n)) %>%
ungroup() %>%
# Set up sentiment analysis so that it can all be plotted on the same plot
mutate(sentiment_n = ifelse(sentiment == "negative", -n_prop, n_prop)) %>%
# Remove NULL album names
filter(album_name != "NA NA") %>%
# Reorder albumes by year
mutate(album_name = fct_reorder(album_name, year)) %>%
# Plot
ggplot(aes(album_name, sentiment_n, fill = sentiment)) +
# Geom bar here for this one with some light transperency
geom_bar(stat = "identity", alpha = 0.75) +
# Flip so albums are on the y axis
coord_flip() +
# Add labels
labs(
x = "",
y = "",
title = "Joni Mitchell album sentiment by stem"
) +
# Add base theme
theme_alex() +
# Make theme adjustments
theme(
legend.position = "bottom",
legend.text = element_text(size = 10, color = "#460069"),
plot.title = element_text(hjust = 0.5)
) +
# Select colors
scale_fill_manual(values = c("#BF406C", "#F1E678")) +
# Turn x axis into percents
scale_y_continuous(labels = scales::percent)
It looks like Joni’s albums tend to have a similar distribution of positive and negative words. On average positive words make up around 55% of Joni albums and negative words about 44%. We see 2 large exceptions though, Ladies of the Canyon (1970) and Turbulent Indigo (1994).
I was a little surprised to see Ladies of the Canyon (1970) as being so postive but was not surprised to see Turbulent Indigo as the most negative. I mean, just look at the album cover for that one:
![fig: Joni Mitchell - Turbulent Indigo (1994)](https://upload.wikimedia.org/wikipedia/en/3/37/Joni_Turbulent.jpg)
Sentiment analysis across Joni’s musical career
Let’s continue doing sentiment analysis but switch gears a bit and not group things by album. I am going to create a version of the bar chart above and I’m also going to include a word cloud.
# Same process as before but grouping and displaying things a little differently
joni_spotify_token %>%
# Join with Bing dictionary to get sentiments
inner_join(bing) %>%
# Get counts
group_by(stem, sentiment) %>%
tally() %>%
ungroup() %>%
# Get top 10
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
# Reorder factors
mutate(stem = fct_reorder(stem, n)) %>%
# Plot
ggplot(aes(stem, n, fill = sentiment)) +
# Make a bar chart
geom_col(show.legend = TRUE, alpha = 0.75) +
# Facet by sentiment
facet_wrap(~sentiment, scales = "free") +
# Add labels
labs(
y = "Contribution to sentiment",
x = NULL,
title = "Most common positive and negative words across \nJoni Mitchell's career"
) +
# Flip plot so words are on the y axis
coord_flip() +
# Add alex theme and customize a bit
theme_alex() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 10, color = "#460069"),
plot.title = element_text(hjust = 0.5)
) +
# Pick new colors
scale_fill_manual(values = c("#BF406C", "#F1E678"))
joni_spotify_token %>%
# Join with bing dictionaries
inner_join(bing) %>%
# Get counts
count(word, sentiment, sort = TRUE) %>%
# Create wordcloud
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
title.size = 3,
title.colors = c("#BF406C", "#F1E678"),
title.bg.colors = c("#FFFFFF", "#FFFFFF"),
colors = c("#BF406C", "#F1E678"),
max.words = 200
)
What pops out to me is how “like” is one of the top words. But as discussed above Joni was likely using this to describe things, she was using analogies. This is one of the issues with using single words in sentiment analysis. One loses the conext that these words are used in. For that reason the next section will take a different approach that uses a sentiment analysis algorithm that looks beyond unigrams to understand the sentiment of the sentence as a whole.
Bring in genius R package
Before we begin I want to introduce another package: genius
. We can use this package to tap into genius.com’s lyric database which will be useful since lyrics are already formatted by line and this will allow us to reproduce this analysis on other artists. Otherwise I would have to go webscraping artist lyric sites like I did for Joni and her site.
To get the genius package working you have to do some initial setup which I will not cover here. You can find those instructions in the link. Once that initial setup is complete we can get to work.
library(genius)
# Create list of albums we want lyrics to
artist_album <- joni_spotify %>%
filter(
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!track_name %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
)
) %>%
transmute(
album = str_trim(album_name, "both")
) %>%
distinct() %>%
pull()
# Create empty data frame where result of a loop will drop data in
joni_genius_df <- data.frame(
album_name = as.character(),
track_n = as.integer(),
artist = as.character(),
track_title = as.character(),
line = as.integer(),
lyric = as.character(),
element = as.character(),
element_artist = as.character()
)
# Loop through each album in the list and get lyrics.
for (i in artist_album) {
df <- possible_album("joni mitchell", i, info = "all")
joni_genius_df <- bind_rows(joni_genius_df, df)
}
# I ran into some trouble getting 1 album in particular so I'll do that one
# seperately
djrd <- genius_album("joni mitchell",
"Don Juan’s Reckless Daughter",
info = "all"
)
# Put it all together.
joni_genius_df <- bind_rows(joni_genius_df, djrd)
glimpse(joni_genius_df)
## Rows: 8,216
## Columns: 8
## $ album_name <chr> "Shine", "Shine", "Shine", "Shine", "Shine", "Shine"...
## $ track_n <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ artist <chr> "Joni Mitchell", "Joni Mitchell", "Joni Mitchell", "...
## $ track_title <chr> "One Week Last Summer", "This Place", "This Place", ...
## $ line <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ lyric <chr> NA, "Sparkle on the ocean", "Eagle at the top of a t...
## $ element <chr> "Instrumental", "Verse 1", "Verse 1", "Verse 1", "Ve...
## $ element_artist <chr> "Joni Mitchell", "Joni Mitchell", "Joni Mitchell", "...
Here we see each lyric is seperated by line. This is exaclty what we were looking for in order to determine the sentiment of a word in relation so how it is used in a sentence.
# Load sentimentr and magrittr
pacman::p_load(sentimentr, magrittr)
# Get sentiment by sentence
joni_sentence_df <- joni_genius_df %>%
# Remove albums and songs we don't want
filter(
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!track_title %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
)
) %>%
# Get sentiment by sentenct
mutate(joni_sentences = get_sentences(lyric)) %$%
sentiment_by(joni_sentences, list(album_name, track_title))
head(joni_sentence_df)
## album_name track_title word_count sd ave_sentiment
## 1: Blue A Case of You 279 0.1539164 0.03920228
## 2: Blue All I Want 313 0.2722716 0.13833782
## 3: Blue Blue 120 0.2180206 0.09941187
## 4: Blue California 331 0.2489202 -0.05003675
## 5: Blue Carey 331 0.2304985 0.09450243
## 6: Blue Little Green 198 0.2693917 0.05061697
We have our initial results and we can put a plot together similar to the ones above that shows sentiment by album.
library(patchwork)
p1 <- joni_sentence_df %>%
filter(
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!track_title %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
)
) %>%
mutate(
track_title = as.factor(track_title),
track_title = fct_reorder(track_title, ave_sentiment)
) %>%
top_n(30) %>%
ggplot(aes(track_title, ave_sentiment)) +
geom_col(alpha = 0.75, fill = "#F1E678") +
coord_flip() +
theme_alex() +
labs(
x = NULL,
y = NULL,
title = "Most positive"
) +
theme(
plot.title = element_text(size = 10)
)
p2 <- joni_sentence_df %>%
filter(
!album_name %in% c(
"Shadows and Light", "Travelogue", "Both Sides Now",
"Shine [Standard Jewel - Parts Order Only]"
),
!track_title %in% c(
"Coin In The Pocket (Rap)", "Funeral (Rap)",
"Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
"Lucky (Rap)"
)
) %>%
mutate(
track_title = as.factor(track_title),
track_title = fct_reorder(track_title, ave_sentiment)
) %>%
top_n(-30) %>%
ggplot(aes(track_title, ave_sentiment)) +
geom_col(alpha = 0.75, fill = "#BF406C") +
coord_flip() +
theme_alex() +
labs(
x = NULL,
y = NULL,
title = "Most negative"
) +
theme(
plot.title = element_text(size = 10)
)
p1 + p2 +
plot_annotation(
title = "Joni Mitchell text polarity sentiment at the sentence level",
subtitle = "Top 30 most positive and negative songs",
caption = "source: Spotify, \nGenius",
theme = theme_alex()
) &
theme(
plot.title = element_text(hjust = 0.5)
)
And here we are, Joni Mitchell’s most positive and negative songs.
The last bit of work to do is to joing this dataset with the Spotify data so that we can take a look at the results we see here with the audio based valence score.
joni_plot1 <- joni_val_sent %>%
distinct() %>%
ggplot(aes(ave_sentiment, valence)) +
geom_point(color = "#460069") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "#BF406C") +
theme_alex() +
theme(
legend.position = "bottom"
) +
labs(
x = "Average song sentiment",
y = "Valence score",
title = "Relationship between valence and average song sentiment in \nJoni Mitchell song"
)
joni_plot2 <- get_regression_points(joni_model) %>%
ggplot(aes(residual)) +
geom_histogram(
binwidth = 0.05, color = "white", alpha = 0.75,
fill = "#BF406C"
) +
theme_alex() +
labs(
y = NULL,
x = "Residual",
title = "Normality of residuals"
)
joni_plot1 / joni_plot2 /
get_regression_table(joni_model) %>%
as.data.frame() %>%
gridExtra::tableGrob(
rows = NULL
)
And there we have it, the final plot for this post. We do indeed see a relationship between the positivity of Joni’s lyrics and the positivity of the song audio.
And to end this post on a positive note why not point you to the most positive song Joni ever wrote:
As well as these playlists I put together on Spotify with Joni’s happiest and saddest songs based on valence and average song sentiment.