This appendix was inspired by Michael Mullarkey’s tutorial, which you can follow to make beautiful dot plots out of your own Spotify data. This tutorial doesn’t require you to use Spotify; just to create a developer account so you can access their data API with spotifyr”, “https://www.rcharlie.com/spotifyr/”)`.
library(usethis) # to set system environment variableslibrary(spotifyr) # to access Spotifylibrary(tidyverse) # for data wrangling
The package spotifyr has instructions for setting up a developer account with Spotify and setting up an “app” so you can get authorisation codes.
Once you’ve set up the app, you can copy the client ID and secret to your R environment file. The easiest way to do this is with edit_r_environ() from usethis. Setting scope to “user” makes this available to any R project on your computer, while setting it to “project” makes it only available to this project.
usethis::edit_r_environ(scope ="user")
Add the following text to your environment file (don’t delete anything already there), replacing the zeros with your personal ID and secret. Save and close the file and restart R.
Double check that it worked by typing the following into the console. Don’t put it in your script unless you mean to share this confidential info. You should see your values, not ““, if it worked.
# run in the console, don't save in a scriptSys.getenv("SPOTIFY_CLIENT_ID")Sys.getenv("SPOTIFY_CLIENT_SECRET")
Now you’re ready to get data from Spotify. There are several types of data that you can download.
J.1 By Artist
Choose your favourite artist and download their discography. Set include_groups to one or more of “album”, “single”, “appears_on”, and “compilation”.
Let’s explore the data you get back. Use glimpse() to see what columns are available and what type of data they have. It looks like there is a row for each of this artist’s tracks.
Let’s answer a few simple questions first.
J.1.1 Tracks per Album
How many tracks are on each album? Some tracks have more than one entry in the table, so first select just the album_name and track_name columns and use distinct() to get rid of duplicates. Then count() the tracks per album. We’re using DT::datatable() to make the table interactive. Try sorting the table by number of tracks.
Looks like it’s all dialogue, so we should omit these. Let’s also check how variable the tempo is for multiple instances of the same track. A quick way to do this is to group by album and track, then check the standard deviation of the tempo. If it’s 0, this means that all of the values are identical. The bigger it is, the more the values vary. If you have a lot of data with a normal distribution (like a bell curve), then about 68% of the data are within one SD of the mean, and about 95% are within 2 SDs.
If we filter to tracks with SD greater than 0 (so any variation at all), we see that most tracks have a little variation. However, if we filter to tracks with an SD greater than 1, we see a few songs with slightly different tempo, and a few with wildly different tempo.
gaga %>%# omit tracks with "Dialogue" in the namefilter(!str_detect(track_name, "Dialogue")) %>%# check for varying tempos for same trackgroup_by(album_name, track_name) %>%filter(sd(tempo) >1) %>%ungroup() %>%select(album_name, track_name, tempo) %>%arrange(album_name, track_name)
You can deal with these in any way you choose. Filter out some versions of the songs or listen to them to see which value you agree with and change the others. Here, we’ll deal with it by averaging the values for each track. This will also remove the tiny differences in the majority of duplicate tracks. Now we’re ready to plot.
gaga %>%filter(tempo >0) %>%group_by(album_name, track_name) %>%summarise(tempo =round(mean(tempo)),.groups ="drop") %>%ungroup() %>%ggplot(aes(x = tempo, fill = ..x..)) +geom_histogram(binwidth =4, show.legend =FALSE) +scale_fill_gradient(low ="#521F64", high ="#E8889C") +labs(x ="Beats per minute",y ="Number of tracks",title ="Tempo of Lady Gaga Tracks")
Note
Can you see how we made the gradient fill for the histograms? Since the x-value of each bar depends on the binwidth, you have to use the code ..x.. in the mapping (not tempo) to make the fill correspond to each bar’s value.
This looks OK, but maybe we want a more striking plot. Let’s make a custom plot style and assign it to gaga_style in case we want to use it again. Then add it to the shortcut function, last_plot() to avoid having to retype the code for the last plot we created.
# define stylegaga_style <-theme(plot.background =element_rect(fill ="black"),text =element_text(color ="white", size =11),panel.background =element_rect(fill ="black"),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank(),panel.grid.major.y =element_line(colour ="white", size =0.2),panel.grid.minor.y =element_line(colour ="white", size =0.2),axis.text =element_text(color ="white"),plot.title =element_text(hjust =0.5))## add it to the last plot createdlast_plot() + gaga_style
J.2 By Playlist
You need to know the “uri” of a public playlist to access data on it. You can get this by copying the link to the playlist and selecting the 22 characters between “https://open.spotify.com/playlist/” and “?si=…”. Let’s have a look at the Eurovision 2021 playlist.
Use glimpse() and count() to explore the structure of this table.
J.2.1 Track ratings
Each track has several ratings: danceability, energy, speechiness, acousticness, instrumentalness, liveness, and valence. I’m not sure how these are determined (almost certainly by an algorithm). Let’s select the track names and these columns to have a look.
What was the general mood of Eurovision songs in 2021? Let’s use plots to assess. First, we need to get the data into long format to make it easier to plot multiple attributes.
You can solve this by putting each attribute into its own facet and letting the y-axis differ between plots by setting scales = "free_y". Now it’s easier to see that Eurovision songs tend to have pretty high danceability and energy.
Some of the columns in this table contain more tables. For example, each entry in the track.artist column contains a table with columns href, id, name, type, uri, and external_urls.spotify. Use unnest() to extract these tables. If there is more than one artist for a track, this will expand the table. For example, the track “Adrenalina” has two rows now, one for Senhit and one for Flo Rida.
If you’re a Eurovision nerd (like Emily), try downloading playlists from several previous years and visualise trends. See if you can find lists of the scores for each year and join the data to see what attributes are most related to points.
J.3 By Genre
Select the first 20 artists in the genre “eurovision”. So that people don’t spam the Spotify API, you are limited to up to 50 artists per request.
There is a programmatic way to make several calls to a function that limits you. You usually want to set this up so that you are waiting a few seconds or minutes between calls so that you don’t get locked out (depending on how strict the API is). Use map_df() to automatically join the results into one big table.
# create a slow version of get_genre_artists # delays 2 seconds after runningslow_get_genre_artists <-slowly(get_genre_artists, rate =rate_delay(2))# set 4 offsets from 0 to 150 by 50offsets <-seq(0, 150, 50)# run the slow function once for each offseteuro_genre200 <-map_df(.x = offsets, .f =~slow_get_genre_artists("eurovision", limit =50,offset = .x))
You can get even more info about a specific track if you know its Spotify ID. You can get this from an artist, album, or playlist tables.
# get the ID for Born This Way from the original albumbtw_id <- gaga %>%filter(track_name =="Born This Way", album_name =="Born This Way") %>%pull(track_id)
J.4.1 Features
Features are a list of summary attributes of the track. These are also included in the previous tables, so this function isn’t very useful unless you are getting track IDs directly.
btw_features <-get_track_audio_features(btw_id)
tibble [1 × 18] (S3: tbl_df/tbl/data.frame)
$ danceability : num 0.586
$ energy : num 0.827
$ key : int 11
$ loudness : num -5.1
$ mode : int 1
$ speechiness : num 0.15
$ acousticness : num 0.0037
$ instrumentalness: int 0
$ liveness : num 0.328
$ valence : num 0.508
$ tempo : num 124
$ type : chr "audio_features"
$ id : chr "6aDi4gOE2Cfc6ecynvP81R"
$ uri : chr "spotify:track:6aDi4gOE2Cfc6ecynvP81R"
$ track_href : chr "https://api.spotify.com/v1/tracks/6aDi4gOE2Cfc6ecynvP81R"
$ analysis_url : chr "https://api.spotify.com/v1/audio-analysis/6aDi4gOE2Cfc6ecynvP81R"
$ duration_ms : int 260253
$ time_signature : int 4
J.4.2 Analysis
The analysis gives you seven different tables of details about the track. Use the names() function to see their names and look at each object to see what information it contains.
meta gives you a list of some info about the analysis.
track gives you a list of attributes, including duration, loudness, end_of_fade_in, start_of_fade_out, and time_signature. Some of this info was available in the previous tables.
bars, beats, and tatums are tables with the start, duration and confidence for each bar, beat, or tatum of music (whatever a “tatum” is).
sections is a table with the start, duration, loudness, tempo, key, mode, time signature for each section of music, along with confidence measures of each.
segments is a table with information about loudness, pitch and timbre of segments of analysis, which tend to be around 0.2 (seconds?)
You can use this data to map a song.
ggplot(btw_analysis$segments, aes(x = start, y = loudness_start, color = loudness_start)) +geom_point(show.legend =FALSE) +scale_colour_gradient(low ="red", high ="purple") +scale_x_continuous(breaks =seq(0, 300, 30)) +labs(x ="Seconds",y ="Loudness",title ="Loudness Map for 'Born This Way'") + gaga_style