17 min read

Sentiment analysis on Joni Mitchell lyrics

In my last post I explored some data after showing how to scrape Joni Mitchell lyrics off her website and combined that data with Spotify audio data.

This post will be a continuation of the work I did there. My focus here is sentiment analysis. In my previous post we used the Spotify “valence” measure to show counts of songs based on their positivity and negativity. Here we will be looking at just the lyrics at first. At the end we will combine the sentiment of each song based on lyrics alone and with song valence to identify Joni’s most positive and negative songs based on both lyrics and music.

Let’s get started! As usual the first thing we do is load some packages and set a theme. In addition to that I’m going to bring in datasets I created in the last post so I don’t have to recreate them here.

# Load packages using the pacman package.
pacman::p_load(
  tidyverse,
  tidytext,
  tidymodels,
  SnowballC,
  wordcloud,
  reshape2,
  here
)

# Create a ggplot2 theme
theme_alex <- function() {
  font <- "Arial"
  theme_minimal()
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_line(
      color = "#cbcbcb"
    ),
    panel.grid.major.x = element_blank(),
    panel.background = element_blank(),
    strip.background = element_rect(
      fill = "white"
    ),
    strip.text = element_text(
      hjust = 0,
      color = "#460069",
      size = 12
    ),
    axis.ticks = element_blank(),
    plot.title = element_text(
      family = font,
      size = 20,
      face = "bold",
      color = "#460069"
    ),
    plot.subtitle = element_text(
      family = font,
      size = 14,
      color = "#6a1c91",
      hjust = 0.5
    ),
    plot.caption = element_text(
      family = font,
      size = 9,
      hjust = 1,
      color = "#460069"
    ),
    axis.title = element_text(
      family = font,
      size = 10,
      color = "#460069"
    ),
    axis.text = element_text(
      family = font,
      size = 9,
      color = "#460069"
    ),
    axis.text.x = element_text(
      margin = margin(5, b = 10)
    ),
    legend.text.align = 0,
    legend.background = element_blank(),
    legend.title = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(
      family = font,
      size = 18,
      color = "#4B636E"
    )
  )
}

# Bring in data sets saved from last post.
joni_lyrics_dates <- readRDS(url("https://github.com/farach/data/blob/master/joni_lyric_dates.RDS?raw=true", "rb"))
joni_spotify <- readRDS(url("https://github.com/farach/data/blob/master/joni_spotify.rds?raw=true", "rb"))

Word proportion

A good place to start is to get a general sense of what the most used words are in each album. Before I can get answers I need to get questions. Getting word proportions should help there.

joni_word_prop <-
  # Let's start with our data set from my last post that includes Spotify audio
  # features
  joni_lyrics_dates %>%
  # we only want the lyrics that Joni wrote. Also remove live albums. Also remove songs that are mostly interluds (off her Mingus album)
  filter(
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !song_name %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    ),
    song_author == "by Joni Mitchell"
  ) %>%
  # Turn everything into a character variable
  mutate_all(., ~ as.character(.)) %>%
  # Tokenize scraped lyrics
  unnest_tokens(word,
    lyrics_scraped,
    token = "words",
    strip_punct = FALSE
  ) %>%
  # Let's get rid of all the instances where the word is just punctuation.
  filter(!word %in% c(
    ",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
    ";", ":", "[", "]", "oh"
  )) %>%
  # Remove stop words
  anti_join(get_stopwords()) %>%
  # Format variables
  mutate(
    word = str_extract(word, "[a-z']+"),
    album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")")
  ) %>%
  # Get denominator
  count(album_name, word) %>%
  # Get numerator
  group_by(album_name) %>%
  mutate(proportion = round(n / sum(n), 4)) %>%
  # Drop n column
  select(-n) %>%
  # Pivot data
  spread(album_name, proportion) %>%
  # Turn NA's into 0's
  mutate_if(is.double, ~ if_else(is.na(.), 0, .)) %>%
  # Pivot back to have full list of words used in all Joni songs and proportion
  # of each word in each album.
  gather(album_name, proportion, 2:19)

joni_word_prop %>%
  # Sort largest to smallest
  arrange(desc(proportion)) %>%
  # Round proportion
  mutate(proportion = scales::percent(proportion, accuracy = 0.01)) %>%
  # Get top 20
  head(20) %>%
  # Display table
  kableExtra::kable(col.names = c("Word", "Album", "Proportion"), align = "c", row.names = FALSE)
Word Album Proportion
love Wild Things Run Fast (1982) 6.09%
lead Taming the Tiger (1998) 5.83%
come Night Ride Home (1991) 4.90%
dancin Chalk Mark In A Rain Storm (1988) 4.09%
man Wild Things Run Fast (1982) 3.78%
shine Shine (2007) 3.32%
fiction Dog Eat Dog (1985) 3.09%
cold Night Ride Home (1991) 2.89%
balloon Taming the Tiger (1998) 2.56%
ladders Chalk Mark In A Rain Storm (1988) 2.46%
dreamland Don Juan’s Reckless Daughter (1977) 2.37%
tiger Taming the Tiger (1998) 2.35%
good Dog Eat Dog (1985) 2.21%
just Night Ride Home (1991) 2.14%
want Blue (1971) 2.13%
like For the Roses (1972) 2.12%
dog Dog Eat Dog (1985) 2.06%
get Chalk Mark In A Rain Storm (1988) 2.05%
love Chalk Mark In A Rain Storm (1988) 2.05%
joy Night Ride Home (1991) 2.01%

First thing I see is that Joni had a tendency to repeat words more often throughout her albums in the second half of her musical career (mid-80’s through the 00’s). Did she have a tendency to repeat words more in the second half of her career?

I want to get a better view of the top words in every album seperately to see if I can get a little closer to answering that question. I want to be able to see how Joni’s use of repeating lyrics has changed over time.

# Off the bat I know I'm going to want to use a facet_wrap or grid to view the
# top words by album. I also want to see how things develop over time so I need
# to organize the albums by year.
joni_facet_reorder <- joni_lyrics_dates %>%
  select(album_name, album_release_date) %>%
  transmute(
    album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")"),
    year = as.numeric(str_sub(album_release_date, 1, 4))
  ) %>%
  distinct() %>%
  arrange(year) %>%
  pull(album_name)

joni_word_prop$album_name <- factor(joni_word_prop$album_name,
  levels = joni_facet_reorder
)

# With album name sorting by year over with I can plot.
joni_word_prop %>%
  filter(
    proportion != 0,
    album_name != "<NA>",
    album_name != "NA (NA)",
    album_name != "Both Sides Now (2000)",
  ) %>%
  # Group by album
  group_by(album_name) %>%
  # Get the top 5 words with the highest proportion
  top_n(5, proportion) %>%
  # Ungroup
  ungroup() %>%
  # Sort by proportion
  arrange(album_name, desc(proportion)) %>%
  # Group by album again
  group_by(album_name) %>%
  # Just get the top5 words. The top_n() function will include duplicate
  # duplicate proportions if the proportion is in the top 5
  filter(row_number() <= 5) %>%
  # Ungroup
  ungroup() %>%
  # Reorder top 5 words by proportion
  mutate(word = reorder_within(as.factor(word), proportion, album_name)) %>%
  # Begin to plot
  ggplot(aes(word, proportion)) +
  # The next 2 geoms make a lollipop graph which will make it easier to see
  # differences than using a bar plot
  geom_segment(aes(xend = word, yend = 0), linetype = "dashed") +
  geom_point(color = "#460069") +
  # Facet by album which we reordered above.
  facet_wrap(~album_name,
    scales = "free_y",
    # Adding this labeller option which will create a new line in the
    # facet labels if the length is longer than 20
    labeller = label_wrap_gen(width = 17)
  ) +
  # Apply my theme
  theme_alex() +
  # The lollipop graph has lines going horizontally and so does theme_alex. I
  # want to flip that so the grid lines are up and down. This will make it easier
  # to see.
  theme(
    panel.grid.major.x = element_line(
      color = "#cbcbcb"
    ),
    panel.grid.major.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  # Coord flip
  coord_flip() +
  # Ned to add this so that the reordered words stay in the order we want them
  scale_x_reordered() +
  # Change the y axis to percent
  scale_y_continuous(labels = scales::percent) +
  # Add lables
  labs(
    x = NULL,
    y = NULL,
    title = "Top 5 words in Joni Mitchell albums",
    caption = "source: JoniMitchell.com \nSpotify"
  )

A couple of things pop out here to me. During the first half of her career she used the word “like” a lot. This makes me think that she was using a lot of analogies. In the second half of her career she is using less analogies (because “like” is no longer one of the most used words).

“Dreamland” appears as the most used words in 2 albums. “Dreamland” is an unexpected word to see making up the biggest proportion of all words.

Lastly we can now see that she does in fact repeat words more often in the second half of her career. This makes sense I guess - she is using less analogies and just stating what things are, repeating it over and over instead of figuring out different ways to say it.

Sentiment analysis

Now to the part I’ve been looking forward to, sentiment analysis. In my last post created a table that gave us counts of Joni songs according to their valence score. Valence scores range from 0 to 1 and Joni’s music spanned that range. My initial thought is that this makes sense considering Joni’s personal lyrics and her skill as an artist to capture the full range of human emotion.

In the following section we will attempt to do the same by using Joni’s lyrics alone. At the end I will combine the results of the sentiment analysis with the valence score to see what are Joni’s most positive and negative songs using both lyrics and valence.

joni_spotify_token <-
  # This first part is the same as I've done before in the code above.
  joni_lyrics_dates %>%
  # Turn everything into a character variable
  mutate_all(., ~ as.character(.)) %>%
  # Only keep songs written by Joni and remove Both Sides Now album
  filter(
    song_author == "by Joni Mitchell",
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !song_name %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    )
  ) %>%
  # Unnest tokens
  unnest_tokens(word,
    lyrics_scraped,
    token = "words",
    strip_punct = FALSE
  ) %>%
  # Remove words that are just puncuation
  filter(!word %in% c(
    ",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
    ";", ":", "[", "]", "oh"
  )) %>%
  # Remove stop words
  anti_join(get_stopwords(source = "snowball"), by = "word") %>%
  # This is new - I'll describe below
  mutate(
    stem = wordStem(word),
    year = as.numeric(str_sub(album_release_date, 1, 4)),
    album_name = paste0(album_name, " (", str_sub(album_release_date, 1, 4), ")")
  )

head(joni_spotify_token %>% select(word, stem))
##           word       stem
## 1 instrumental instrument
## 2      sparkle     sparkl
## 3        ocean      ocean
## 4        eagle       eagl
## 5          top        top
## 6         tree       tree

Stemming is the process of reducing words to their base or root form. For example, above we see “sparkle” reduced to “sparkl”. We will use these stemmed words to prevent the uneccesary exclusion of words that don’t have a match in sentiment dictionaries.

# Get bing sentiments. Tidytext package includes dictionaries for other
# sentiment dictionaries.
bing <- get_sentiments("bing") %>%
  mutate(stem = wordStem(word)) %>%
  select(stem, sentiment) %>%
  distinct()

joni_spotify_token %>%
  # Join with Bing dictionary
  inner_join(bing) %>%
  # Get counts
  group_by(album_name, sentiment, year) %>%
  tally() %>%
  ungroup() %>%
  # Get proportions
  group_by(album_name, year) %>%
  mutate(n_prop = n / sum(n)) %>%
  ungroup() %>%
  # Set up sentiment analysis so that it can all be plotted on the same plot
  mutate(sentiment_n = ifelse(sentiment == "negative", -n_prop, n_prop)) %>%
  # Remove NULL album names
  filter(album_name != "NA NA") %>%
  # Reorder albumes by year
  mutate(album_name = fct_reorder(album_name, year)) %>%
  # Plot
  ggplot(aes(album_name, sentiment_n, fill = sentiment)) +
  # Geom bar here for this one with some light transperency
  geom_bar(stat = "identity", alpha = 0.75) +
  # Flip so albums are on the y axis
  coord_flip() +
  # Add labels
  labs(
    x = "",
    y = "",
    title = "Joni Mitchell album sentiment by stem"
  ) +
  # Add base theme
  theme_alex() +
  # Make theme adjustments
  theme(
    legend.position = "bottom",
    legend.text = element_text(size = 10, color = "#460069"),
    plot.title = element_text(hjust = 0.5)
  ) +
  # Select colors
  scale_fill_manual(values = c("#BF406C", "#F1E678")) +
  # Turn x axis into percents
  scale_y_continuous(labels = scales::percent)

It looks like Joni’s albums tend to have a similar distribution of positive and negative words. On average positive words make up around 55% of Joni albums and negative words about 44%. We see 2 large exceptions though, Ladies of the Canyon (1970) and Turbulent Indigo (1994).

I was a little surprised to see Ladies of the Canyon (1970) as being so postive but was not surprised to see Turbulent Indigo as the most negative. I mean, just look at the album cover for that one:

Joni Mitchell - Turbulent Indigo (1994)

Sentiment analysis across Joni’s musical career

Let’s continue doing sentiment analysis but switch gears a bit and not group things by album. I am going to create a version of the bar chart above and I’m also going to include a word cloud.

# Same process as before but grouping and displaying things a little differently
joni_spotify_token %>%
  # Join with Bing dictionary to get sentiments
  inner_join(bing) %>%
  # Get counts
  group_by(stem, sentiment) %>%
  tally() %>%
  ungroup() %>%
  # Get top 10
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  # Reorder factors
  mutate(stem = fct_reorder(stem, n)) %>%
  # Plot
  ggplot(aes(stem, n, fill = sentiment)) +
  # Make a bar chart
  geom_col(show.legend = TRUE, alpha = 0.75) +
  # Facet by sentiment
  facet_wrap(~sentiment, scales = "free") +
  # Add labels
  labs(
    y = "Contribution to sentiment",
    x = NULL,
    title = "Most common positive and negative words across \nJoni Mitchell's career"
  ) +
  # Flip plot so words are on the y axis
  coord_flip() +
  # Add alex theme and customize a bit
  theme_alex() +
  theme(
    legend.position = "bottom",
    legend.text = element_text(size = 10, color = "#460069"),
    plot.title = element_text(hjust = 0.5)
  ) +
  # Pick new colors
  scale_fill_manual(values = c("#BF406C", "#F1E678"))

joni_spotify_token %>%
  # Join with bing dictionaries
  inner_join(bing) %>%
  # Get counts
  count(word, sentiment, sort = TRUE) %>%
  # Create wordcloud
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    title.size = 3,
    title.colors = c("#BF406C", "#F1E678"),
    title.bg.colors = c("#FFFFFF", "#FFFFFF"),
    colors = c("#BF406C", "#F1E678"),
    max.words = 200
  )

What pops out to me is how “like” is one of the top words. But as discussed above Joni was likely using this to describe things, she was using analogies. This is one of the issues with using single words in sentiment analysis. One loses the conext that these words are used in. For that reason the next section will take a different approach that uses a sentiment analysis algorithm that looks beyond unigrams to understand the sentiment of the sentence as a whole.

Bring in genius R package

Before we begin I want to introduce another package: genius. We can use this package to tap into genius.com’s lyric database which will be useful since lyrics are already formatted by line and this will allow us to reproduce this analysis on other artists. Otherwise I would have to go webscraping artist lyric sites like I did for Joni and her site.

To get the genius package working you have to do some initial setup which I will not cover here. You can find those instructions in the link. Once that initial setup is complete we can get to work.

library(genius)

# Create list of albums we want lyrics to
artist_album <- joni_spotify %>%
  filter(
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !track_name %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    )
  ) %>%
  transmute(
    album = str_trim(album_name, "both")
  ) %>%
  distinct() %>%
  pull()

# Create empty data frame where result of a loop will drop data in
joni_genius_df <- data.frame(
  album_name = as.character(),
  track_n = as.integer(),
  artist = as.character(),
  track_title = as.character(),
  line = as.integer(),
  lyric = as.character(),
  element = as.character(),
  element_artist = as.character()
)

# Loop through each album in the list and get lyrics.
for (i in artist_album) {
  df <- possible_album("joni mitchell", i, info = "all")

  joni_genius_df <- bind_rows(joni_genius_df, df)
}

# I ran into some trouble getting 1 album in particular so I'll do that one
# seperately
djrd <- genius_album("joni mitchell",
  "Don Juan’s Reckless Daughter",
  info = "all"
)

# Put it all together.
joni_genius_df <- bind_rows(joni_genius_df, djrd)

glimpse(joni_genius_df)
## Rows: 8,216
## Columns: 8
## $ album_name     <chr> "Shine", "Shine", "Shine", "Shine", "Shine", "Shine"...
## $ track_n        <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ artist         <chr> "Joni Mitchell", "Joni Mitchell", "Joni Mitchell", "...
## $ track_title    <chr> "One Week Last Summer", "This Place", "This Place", ...
## $ line           <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ lyric          <chr> NA, "Sparkle on the ocean", "Eagle at the top of a t...
## $ element        <chr> "Instrumental", "Verse 1", "Verse 1", "Verse 1", "Ve...
## $ element_artist <chr> "Joni Mitchell", "Joni Mitchell", "Joni Mitchell", "...

Here we see each lyric is seperated by line. This is exaclty what we were looking for in order to determine the sentiment of a word in relation so how it is used in a sentence.

# Load sentimentr and magrittr
pacman::p_load(sentimentr, magrittr)

# Get sentiment by sentence
joni_sentence_df <- joni_genius_df %>%
  # Remove albums and songs we don't want
  filter(
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !track_title %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    )
  ) %>%
  # Get sentiment by sentenct
  mutate(joni_sentences = get_sentences(lyric)) %$%
  sentiment_by(joni_sentences, list(album_name, track_title))

head(joni_sentence_df)
##    album_name   track_title word_count        sd ave_sentiment
## 1:       Blue A Case of You        279 0.1539164    0.03920228
## 2:       Blue    All I Want        313 0.2722716    0.13833782
## 3:       Blue          Blue        120 0.2180206    0.09941187
## 4:       Blue    California        331 0.2489202   -0.05003675
## 5:       Blue         Carey        331 0.2304985    0.09450243
## 6:       Blue  Little Green        198 0.2693917    0.05061697

We have our initial results and we can put a plot together similar to the ones above that shows sentiment by album.

library(patchwork)

p1 <- joni_sentence_df %>%
  filter(
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !track_title %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    )
  ) %>%
  mutate(
    track_title = as.factor(track_title),
    track_title = fct_reorder(track_title, ave_sentiment)
  ) %>%
  top_n(30) %>%
  ggplot(aes(track_title, ave_sentiment)) +
  geom_col(alpha = 0.75, fill = "#F1E678") +
  coord_flip() +
  theme_alex() +
  labs(
    x = NULL,
    y = NULL,
    title = "Most positive"
  ) +
  theme(
    plot.title = element_text(size = 10)
  )

p2 <- joni_sentence_df %>%
  filter(
    !album_name %in% c(
      "Shadows and Light", "Travelogue", "Both Sides Now",
      "Shine [Standard Jewel - Parts Order Only]"
    ),
    !track_title %in% c(
      "Coin In The Pocket (Rap)", "Funeral (Rap)",
      "Happy Birthday 1975 (Rap)", "I's A Muggin' (Rap)",
      "Lucky (Rap)"
    )
  ) %>%
  mutate(
    track_title = as.factor(track_title),
    track_title = fct_reorder(track_title, ave_sentiment)
  ) %>%
  top_n(-30) %>%
  ggplot(aes(track_title, ave_sentiment)) +
  geom_col(alpha = 0.75, fill = "#BF406C") +
  coord_flip() +
  theme_alex() +
  labs(
    x = NULL,
    y = NULL,
    title = "Most negative"
  ) +
  theme(
    plot.title = element_text(size = 10)
  )

p1 + p2 +
  plot_annotation(
    title = "Joni Mitchell text polarity sentiment at the sentence level",
    subtitle = "Top 30 most positive and negative songs",
    caption = "source: Spotify, \nGenius",
    theme = theme_alex()
  ) &
  theme(
    plot.title = element_text(hjust = 0.5)
  )

And here we are, Joni Mitchell’s most positive and negative songs.

The last bit of work to do is to joing this dataset with the Spotify data so that we can take a look at the results we see here with the audio based valence score.

joni_plot1 <- joni_val_sent %>%
  distinct() %>%
  ggplot(aes(ave_sentiment, valence)) +
  geom_point(color = "#460069") +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "#BF406C") +
  theme_alex() +
  theme(
    legend.position = "bottom"
  ) +
  labs(
    x = "Average song sentiment",
    y = "Valence score",
    title = "Relationship between valence and average song sentiment in \nJoni Mitchell song"
  )

joni_plot2 <- get_regression_points(joni_model) %>%
  ggplot(aes(residual)) +
  geom_histogram(
    binwidth = 0.05, color = "white", alpha = 0.75,
    fill = "#BF406C"
  ) +
  theme_alex() +
  labs(
    y = NULL,
    x = "Residual",
    title = "Normality of residuals"
  )

joni_plot1 / joni_plot2 /
  get_regression_table(joni_model) %>%
    as.data.frame() %>%
    gridExtra::tableGrob(
      rows = NULL
    )

And there we have it, the final plot for this post. We do indeed see a relationship between the positivity of Joni’s lyrics and the positivity of the song audio.

And to end this post on a positive note why not point you to the most positive song Joni ever wrote:

As well as these playlists I put together on Spotify with Joni’s happiest and saddest songs based on valence and average song sentiment.