Hello everyone, welcome to the first post of my new site. On here I’ll be sharing code I’ve written and analysis I’ve done on topics of interest as well as music I’ve written. In this post I’ll be thinking through Joni Mitchell’s lyrics and music.

Let’s get into it. First thing to do we’ll load some packages and create a theme to use in visualizations.

# Load packages using the pacman package.
pacman::p_load(
  rvest,
  tidyverse,
  xml2,
  tidytext,
  tidymodels,
  tokenizers,
  here,
  spotifyr,
  lubridate,
  patchwork,
  genius
)

# Create a ggplot2 theme
theme_alex <- function() {
  font <- "Arial"
  theme_minimal()
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_line(
      color = "#cbcbcb"
    ),
    panel.grid.major.x = element_blank(),
    panel.background = element_blank(),
    strip.background = element_rect(
      fill = "white"
    ),
    strip.text = element_text(
      hjust = 0,
      color = "#4B636E",
      size = 12
    ),
    axis.ticks = element_blank(),
    plot.title = element_text(
      family = font,
      size = 20,
      face = "bold",
      hjust = 0.5,
      vjust = 2,
      color = "#A100FF"
    ),
    plot.subtitle = element_text(
      family = font,
      size = 14,
      color = "#A100FF",
      hjust = 0.5
    ),
    plot.caption = element_text(
      family = font,
      size = 9,
      hjust = 1,
      color = "#4B636E"
    ),
    axis.title = element_text(
      family = font,
      size = 10,
      color = "#4B636E"
    ),
    axis.text = element_text(
      family = font,
      size = 9,
      color = "#4B636E"
    ),
    axis.text.x = element_text(
      margin = margin(5, b = 10)
    ),
    legend.text.align = 0,
    legend.background = element_blank(),
    legend.title = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(
      family = font,
      size = 18,
      color = "#4B636E"
    )
  )
}

This analysis of Joni Mitchell lyrics will begin with some web scraping to get lyrics from Joni’s own web site. Then we will use the spotifyR package to pull in song audio data. Once we have the lyrics and the audio features we will have everything we need to create the desired plot.

The end goal here is to produce a plot that shows the words-per-second and the total words used in each song she has written. I want to see how the structure of Joni’s lyrics has changed over time using these metrics.

Web scraping Joni’s site to get song lyrics

First thing to do is to rummage through Joni’s site a bit to figure out where her lyrics are and how they are presented.

Looks like she has her lyrics in pages prepended by https://jonimitchell.com/music/ followed by song.cfm?id= and a number for each one of her songs. We can scrape this page and produce a list of song lyric locations.

joni_url <- "https://jonimitchell.com/music/songlist.cfm"

joni_song_list <- read_html(joni_url) %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  as.data.frame() %>%
  rename(pages = 1) %>%
  filter(str_sub(pages, 1, 4) == "song")

head(joni_song_list$pages)

## [1] song.cfm?id=107 song.cfm?id=118 song.cfm?id=129 song.cfm?id=36 
## [5] song.cfm?id=140 song.cfm?id=306
## 263 Levels: / /biography/ /chronology/ /contact.cfm /credits.cfm ... song.cfm?id=99

Now we can create a function that loops through each one of these links and pick up the song lyrics.

songlinks <- paste0("https://jonimitchell.com/music/", joni_song_list$pages)

# Initialize a data frame to store the results
joni_lyrics <- data.frame(
  song_name = character(),
  song_author = character(),
  lyrics_scraped = character()
)

# Function to scrape lyrics
get_lyrics <- function(x) {
  download.file(x, destfile = "scrapedpage.html", quiet = TRUE)

  # Get lyrics
  lyrics_scraped <- read_html("scrapedpage.html") %>%
    html_nodes("div.songlyrics p") %>%
    html_text()

  # Get song name
  song_name <- read_html("scrapedpage.html") %>%
    html_nodes("h2") %>%
    html_text()

  # Get song author
  song_author <- read_html("scrapedpage.html") %>%
    html_nodes("p.author") %>%
    html_text()

  song_name <- as.character(song_name[1])
  song_author <- as.character(song_author[1])
  lyrics_scraped <- as.character(lyrics_scraped[1])

  # Combine data into single dataframe
  df <- data.frame(
    song_name,
    song_author,
    lyrics_scraped
  )

  # Append df to joni_lyrics
  joni_lyrics <- bind_rows(joni_lyrics, df)

  return(joni_lyrics)
}

# Apply get_lyrics functions to all the pages in the songlinks
joni_lyrics <- lapply(songlinks, get_lyrics)

# Unlist into a data frame
joni_lyrics <- do.call(rbind.data.frame, joni_lyrics)

# Save results
saveRDS(joni_lyrics, here::here("joni_lyrics.RDS"))

# Display results
head(as_tibble(joni_lyrics))

## # A tibble: 6 x 3
##   song_name     song_author                  lyrics_scraped                     
##   <fct>         <fct>                        <fct>                              
## 1 All I Want    by Joni Mitchell             "I am on a lonely road and I am tr~
## 2 Amelia        by Joni Mitchell             "I was driving across the burning ~
## 3 Answer Me, M~ by Gerhard Winkler and Fred~ "Answer me\r\r\nOh my love\r\r\nJu~
## 4 The Arrangem~ by Joni Mitchell             "You could have been more\r\r\nTha~
## 5 At Last       by Mack Gordon and Harry Wa~ "At last\r\r\nMy love has come alo~
## 6 Bad Dreams    by Joni Mitchell             "The cats are in the flower bed\r\~

Tokenize Joni lyrics

Now that we have the lyrics we need to tokenize them. There are a lot of useful and offical ways to describe tokenizing. The way I like to think about it in this particular exercise is to make the lyric data tidy. We can think of the lyrics_scraped as an array of words that we want to unlist and then pivot these words into a single column. That’s what we will do here.

joni_token <- as_tibble(joni_lyrics) %>%
  mutate_all(., ~ as.character(.)) %>%
  # we only want the lyrics that Joni wrote
  filter(song_author == "by Joni Mitchell") %>%
  unnest_tokens(word,
    lyrics_scraped,
    token = "words",
    strip_punct = FALSE
  ) %>%
  # Let's get rid of all the instances where the word is just punctuation.
  filter(!word %in% c(
    ",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
    ";", ":", "[", "]"
  ))

head(joni_token)

## # A tibble: 6 x 3
##   song_name  song_author      word  
##   <chr>      <chr>            <chr> 
## 1 All I Want by Joni Mitchell i     
## 2 All I Want by Joni Mitchell am    
## 3 All I Want by Joni Mitchell on    
## 4 All I Want by Joni Mitchell a     
## 5 All I Want by Joni Mitchell lonely
## 6 All I Want by Joni Mitchell road

One thing to note here is that we did not get rid of any words. Normally in natural language processing pipelines we get rid of less important words and stem the remaining words. Here we want to keep all the words because the end goal is to understand words-per-song - not important-words-per-song or words-per-song excluding common words like “me”, “our”, etc.

joni_token %>%
  group_by(song_name) %>%
  tally(name = "word_count") %>%
  ungroup() %>%
  mutate(
    song_name = as.factor(song_name),
    song_name = fct_reorder(song_name, word_count)
  ) %>%
  top_n(n = 30) %>%
  ggplot(aes(song_name, word_count, fill = song_name)) +
  geom_col() +
  theme_alex() +
  coord_flip() +
  labs(
    x = "",
    y = "",
    title = "Joni Mitchell's top 30 songs\nwith the most lyrics",
    caption = "Source: JoniMitchell.com"
  ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 8)
  )

This looks good but we need to do more and get answer to more questions. For example, “Paprika Plains” has the most lyrics of any Joni Mitchell song but what if the song is 20 minutes long? What would be even more useful is to understand the words-per-second for each song. For this we’ll need to get into audio features available via the spotifyR package.

Using the spotifyR package to get more song information.

I won’t be describing the Spotify setup you will need in order to get the spotifyR package or any other accessing the Spotify API to work. That information is availbale on the Genius package page here.

After the initial setup is complete we can begin to get to work. Below we can take a look at the Joni audio data available to use.

joni_spotify <- get_artist_audio_features("joni mitchell")

names(joni_spotify)

##  [1] "artist_name"                  "artist_id"                   
##  [3] "album_id"                     "album_type"                  
##  [5] "album_images"                 "album_release_date"          
##  [7] "album_release_year"           "album_release_date_precision"
##  [9] "danceability"                 "energy"                      
## [11] "key"                          "loudness"                    
## [13] "mode"                         "speechiness"                 
## [15] "acousticness"                 "instrumentalness"            
## [17] "liveness"                     "valence"                     
## [19] "tempo"                        "track_id"                    
## [21] "analysis_url"                 "time_signature"              
## [23] "artists"                      "available_markets"           
## [25] "disc_number"                  "duration_ms"                 
## [27] "explicit"                     "track_href"                  
## [29] "is_local"                     "track_name"                  
## [31] "track_preview_url"            "track_number"                
## [33] "type"                         "track_uri"                   
## [35] "external_urls.spotify"        "album_name"                  
## [37] "key_name"                     "mode_name"                   
## [39] "key_mode"

That’s a lot and it’s awesome. Some of these make sense - “artist_name”, “danceability”, “loudness”. But what in the world is “valence” or “mode”? Joni only has 1 mode: genius. For the definition of all these variables check out Spotify’s official definitions here.

Let’s explore a little by taking a look at a few plots. First thing we can do is use boxplots to group and visualize songs by albums. We’ll also use that “album_release_year” to order the albums oldest to newest and concatenate the album name to the album release year.

Album energy

joni_spotify %>%
  mutate(
    album_name = paste(album_name, album_release_year),
    album_name = fct_reorder(album_name, album_release_year, .desc = TRUE)
  ) %>%
  ggplot(aes(album_name, energy)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    x = "",
    y = "",
    title = "Joni Mitchell album energy"
  ) +
  theme_alex()

Off the bat this tells us a lot of useful information. We can see that her first few albums, from ’68 to ’72, Joni stuck to writing less intense music. But then something happened.

In 1974 she took a turn and her albums started to have a greater variaty of low to high energy songs. The range of intensity in her albums kept evolving with her 1980 album “Shadows and Light” being the most energetically varied album of her career. After that album the range of energy in her albums started to return to her original style.

Favorit musical keys to write in

What else? Let’s take a look at the “key_mode” variable to understand Joni Mitchell’s favorite key to write in.

joni_spotify %>%
  count(key_mode, sort = TRUE) %>%
  top_n(10) %>%
  kableExtra::kable(
    row.names = NA,
    col.names = c("Key", "Song Count"),
    caption = "Joni Mitchell favorite music keys to write in",
    align = "c"
  )

Table 1: Joni Mitchell favorite music keys to write in
Key	Song Count
D major	53
C major	31
G major	27
F major	18
C# major	14
G# major	14
A# major	11
B major	11
A major	10
A minor	10
E major	10
F# major	10

I don’t know if you play guitar but seeing this makes complete sense. Joni is a guitar player known for her revolutionary use of alternate guitar tunings. Guitar players have favorite keys to compose in due to the ease of chord shapes along the guitar neck. It’s more difficult to play chords that make up the C# major scale than it is the chords to play in the C major scale, especially at the top of the neck.

Valence

What else? Let’s take a look at that mysterious “valence” variable. Valence ranges from 0 to 1 and “describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)”.

What we’ll do here is create some buckets so we can get a general 30,000 foot view of Joni Mitchell song valence.

joni_spotify %>%
  arrange(-valence) %>%
  mutate(`Valence Bucket` = round(valence, 1) * 10) %>%
  group_by(`Valence Bucket`) %>%
  tally(`Valence Bucket`, name = "Song Count") %>%
  kableExtra::kable(
    align = "c",
    caption = "Joni Mitchell Song Valence"
  )

Table 2: Joni Mitchell Song Valence
Valence Bucket	Song Count
0	0
1	30
2	116
3	105
4	120
5	140
6	126
7	126
8	176
9	90
10	40

This is really interesting as well. Joni’s writing is extremely personal. In her music you can hear the full range of her life experiences and we can see that in the data here.

Quick detour while you are here reading this. Let me take a moment to recommend the book Reckless Daughter: A Portrait of Joni Mitchell by David Yaffe. It is excellent and doing writing this post after reading the book is just a treat.

Words-per-second vs. total word count per song/album

At this point we have all the data we need to put together a plot that maps song words-per-second and compare it to total words in each song. I am going to group the results by album using a boxplot as I did before. One of the reasons to do this is that albums are generally put together with a theme, goal, or general direction in mind. It’s rare that an album, especially by such a visionary artist like Joni, be a group of completely disparate songs.

That doesn’t mean that a song by song analysis should not be considered though. Looking at individual songs can show us the absolute range spanning across Joni’s entire career. We did a little bit of this above when looking at Joni’s favorite musical keys to write in and looking at the overall valence of all her songs. For now we’ll stick to grouping songs by album.

To begin we need to clean the data up a little bit.

# Clean track names and filter out a duplicate album
joni_spotify_2 <- joni_spotify %>%
  mutate(
    track_name = str_remove_all(track_name, " - Live"),
    track_name = str_remove_all(track_name, " \\(.*\\)"),
    track_name = str_trim(track_name, "both")
  ) %>%
  filter(album_name != "Shine [Standard Jewel - Parts Order Only]")

# Clean up some more and join to lyrics
joni_lyrics_dates <- joni_spotify_2 %>%
  transmute(
    album_release_date,
    song_name = tolower(track_name),
    album_name,
    album_release_date_precision,
    album_release_date = ifelse(
      album_release_date_precision == "year",
      paste0(album_release_date, "-01-01"),
      album_release_date
    ),
    album_release_date = as.Date(album_release_date, format = "%Y-%m-%d")
  ) %>%
  full_join(
    (joni_lyrics %>%
      mutate(song_name = tolower(song_name))),
    by = "song_name"
  )

# Create new tokenized lyrics with audio features.
joni_lyrics_dates <- joni_lyrics_dates %>%
  mutate(lyrics_scraped = as.character(lyrics_scraped)) %>%
  unnest_tokens(word,
    lyrics_scraped,
    token = "words",
    strip_punct = FALSE
  ) %>%
  filter(!word %in% c(
    ",", "-", ".", "!", "?", '"', ")", "(", "*", "'", "’",
    ";", ":", "[", "]"
  ))

Now we are ready to plot. I’ll create the plots seperately and join them using the patchwork package. You can find this package (here).

# Plot 1
joni_sl <- joni_lyrics_dates %>%
  group_by(song_name, album_release_date, song_author) %>%
  tally() %>%
  arrange(desc(n)) %>%
  top_n(n = 1, n) %>%
  ungroup() %>%
  mutate(
    song_year = lubridate::year(album_release_date),
    song_name = as.factor(song_name)
  ) %>%
  filter(
    song_author == "by Joni Mitchell",
    !is.na(song_year)
  ) %>%
  ggplot(aes(
    as.factor(as.character(song_year)),
    n
  )) +
  geom_boxplot(outlier.alpha = 0) +
  geom_jitter(
    alpha = 0.75,
    aes(color = as.factor(as.character(song_year)))
  ) +
  theme_alex() +
  theme(
    axis.text.x = element_text(angle = 45),
    axis.text.y = element_blank(),
    plot.title = element_text(size = 15, face = "bold", margin = margin(10, 0, 10, 0)),
    legend.position = "none"
  ) +
  labs(
    x = "",
    y = "Word count"
  ) +
  coord_flip()

# Plot 2
jm_word_count <- joni_lyrics_dates %>%
  mutate(track_name = tolower(str_trim(song_name, "both"))) %>%
  count(track_name, sort = TRUE)

joni_wps <- joni_spotify_2 %>%
  mutate(
    track_name = tolower(str_trim(track_name, "both"))
  ) %>%
  full_join(., jm_word_count) %>%
  mutate(
    album_name = paste(album_name, album_release_year),
    album_name = fct_reorder(
      as.factor(album_name),
      album_release_year
    ),
    album_release_date,
    word_count = n,
    duration_s = round(as.numeric(duration_ms) * 0.001, 2),
    words_per_sec = round(word_count / duration_s, 2)
  ) %>%
  filter(
    words_per_sec != 0,
    album_name != "Shine [Standard Jewel - Parts Order Only]",
    album_name != "Both Sides Now"
  ) %>%
  ggplot(aes(words_per_sec, as.factor(album_name))) +
  geom_boxplot(outlier.alpha = 0, alpha = 0.25) +
  geom_jitter(alpha = 0.75, aes(color = as.factor(album_name))) +
  # facet_wrap(~ album_name, scales = "free_y") +
  theme_alex() +
  labs(
    x = "Words-per-second",
    y = "",
    color = ""
  ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45),
    plot.title = element_text(size = 15, face = "bold", margin = margin(10, 0, 10, 0))
  )

joni_wps + joni_sl + plot_annotation(
  title = "Joni Mitchell song analysis",
  subtitle = "Words-per-second and words-per-song",
  caption = "Data source: Jonimitchell.com, \n Spotify"
) &
  theme_alex()

There is the final plot! Like the energy plot I did above we can see how the 1974 album “Court and Spark” was the turning point creatively for Joni. Not only did she have a greater range of intensity in this album and after she also played around a lot more with her lyrics. She had a wider range of words-per-scond and total word count. One thing I know from reading a lot about Joni’s career is that “Court and Spark” was her best selling record. Her decision to let loose with her creativitey, employing a wider variety of emotions in her music, paid off. Personally this also happens to be my favorite era of Joni’s career. Some of my favorite songs like “Amelia” and “Help Me” came from this era.

In upcoming posts I plan on digging in even more on Joni’s music and career. I hope to be able to share other ways to look at this data and apply some machine learning to better understand the difference between her songs and albums.

Taking a look at Joni Mitchell's lyrics