H Twitter Hashtags
In this appendix, we will create a table of the top five hashtags used in conjunction with #NationalComingOutDay, the total number of tweets in each hashtag, the total number of likes, and the top tweet for each hashtag.
library(tidyverse) # data wrangling functions
library(rtweet) # for searching tweets
library(glue) # for pasting strings
library(kableExtra) # for nice tables
The example below uses data you can download from National Coming Out Day 2021, but we encourage you to try a hashtag that interests you.
# load tweets
tweets <- search_tweets(q = "#NationalComingOutDay",
n = 30000,
include_rts = FALSE)
# save them to a file so you can skip this step in the future
saveRDS(tweets, file = "data/ncod_tweets.rds")
# load tweets from the file
tweets <- readRDS("data/ncod_tweets.rds")
H.1 Select relevant data
The function select()
is useful for just keeping the variables (columns) you need to work with, which can make working with very large datasets easier. The arguments to select()
are simply the names of the variables and the resulting table will present them in the order you specify.
H.2 Unnest columns
Look at the dataset using View(tweets_with_hashtags)
or clicking on it in the Environment tab. You'll notice that the variable hashtags
has multiple values in each cell (i.e., when users used more than one hashtag in a single tweet). In order to work with this information, we need to separate each hashtag so that each row of data represents a single hashtag. We can do this using the function unnest()
and adding a pipeline of code.
tweets_with_hashtags <- tweets %>%
select(hashtags, text, favorite_count, media_url) %>%
unnest(cols = hashtags)
Look at tweets_with_hashtags
to see how it is different from the table tweets
. WHy does it have more rows?
H.3 Top 5 hashtags
To get the top 5 hashtags we need to know how tweets used each one. This code uses pipes to build up the analysis. When you encounter multi-pipe code, it can be very useful to run each line of the pipeline to see how it builds up and to check the output at each step. This code:
- Starts with the object
tweets_with_hashtags
and then; - Counts the number of tweets for each hashtag using
count()
and then; - Filters out any blank cells using
!is.na()
(you can read this as "keep any row value where it is not true (!
) thathashtags
is missing") and then; - Returns the top five values using
slice_max()
and orders them by then
column.
top5_hashtags <- tweets_with_hashtags %>%
count(hashtags) %>%
filter(!is.na(hashtags)) %>% # get rid of the blank value
slice_max(order_by = n, n = 5)
top5_hashtags
hashtags | n |
---|---|
NationalComingOutDay | 27308 |
nationalcomingoutday | 1343 |
LGBTQ | 829 |
IndigenousPeoplesDay | 811 |
ComingOutDay | 613 |
Two of the hashtags are the same, but with different case. We can fix this by adding in an extra line of code that uses mutate()
to overwrite the variable hashtag
with the same data but transformed to lower case using tolower()
. Since we're going to use the table tweets_with_hashtags
a few more times, let's change that table first rather than having to fix this every time we use the table.
tweets_with_hashtags <- tweets_with_hashtags %>%
mutate(hashtags = tolower(hashtags))
top5_hashtags <- tweets_with_hashtags %>%
count(hashtags) %>%
filter(!is.na(hashtags)) %>% # get rid of the blank value
slice_max(order_by = n, n = 5)
top5_hashtags
hashtags | n |
---|---|
nationalcomingoutday | 28698 |
lgbtq | 1036 |
indigenouspeoplesday | 837 |
comingoutday | 676 |
loveislove | 396 |
H.4 Top tweet per hashtag
Next, get the top tweet for each hashtag using filter()
. Use group_by()
before you filter to select the most-liked tweet in each hashtag, rather than the one with most likes overall. As you're getting used to writing and running this kind of multi-step code, it can be very useful to take out individual lines and see how it changes the output to strengthen your understanding of what each step is doing.
top_tweet_per_hashtag <- tweets_with_hashtags %>%
group_by(hashtags) %>%
filter(favorite_count == max(favorite_count)) %>%
sample_n(size = 1) %>%
ungroup()
The function slice_max()
accomplishes the same thing as the filter()
and sample_n()
functions above. Look at the help for this function and see if you can figure out how to use it.
H.5 Total likes per hashtag
Get the total number of likes per hashtag by grouping and summarising with sum()
.
H.6 Put it together
We can put everything together using left_join()
(see ChapterΒ @ref(left_join)). This will keep everything from the first table specified and then add on the relevant data from the second table specified. In this case, we add on the data in top_tweet_per_hashtag
and like_per_hashtag
but only for the tweets included in top5_hashtags
H.7 Twitter data idiosyncrasies
Before we can finish up though, there's a couple of extra steps we need to add in to account for some of the idiosyncrasies of Twitter data.
First, the @
symbol is used by R Markdown for referencing (see Chapter Β 2.6.3). It's likely that some of the tweets will contain this symbol, so we can use mutate to find any instances of @
and escape them using backslashes. Backslashes create a literal version of characters that have a special meaning in R, so adding them means it will print the @
symbol without trying to create a reference. Of course \
also has a special meaning in R, which means we also need to backslash the backslash. Isn't programming fun? We can use the same code to tidy up any ampersands (&), which sometimes display as "&".
Second, if there are multiple images associated with a single tweet, media_url
will be a list, so we use unlist()
. This might not be necessary for a different set of tweets; use glimpse()
to check the data types.
Finally, we use select()
to tidy up the table and just keep the columns we need.
top5 <- top5_hashtags %>%
left_join(top_tweet_per_hashtag, by = "hashtags") %>%
left_join(likes_per_hashtag, by = "hashtags") %>%
# replace @ with \@ so @ doesn't trigger referencing
mutate(text = gsub("@", "\\\\@", text),
text = gsub("&", "&", text)) %>%
# media_url can be a list if there is more than one image
mutate(image = unlist(media_url)) %>%
# put the columns you want to display in order
select(hashtags, n, total_likes, text, image)
top5
hashtags | n | total_likes | text | image |
---|---|---|---|---|
nationalcomingoutday | 28698 | 851510 | itβs #nationalcomingoutday π hereβs a pic of how I came out back in 2003 xx https://t.co/spBmHhF6p4 | http://pbs.twimg.com/media/FBayvGYXsAY-5hZ.jpg |
lgbtq | 1036 | 13691 | It takes bravery to life an authentic life. While I was not able to come out as a member of the #LGBTQ community on my own terms, if youβre ready & can safely do so, then I support you! And if youβre not quite there yet, I support you exactly where you are. #NationalComingOutDay https://t.co/dr61oyhR3L | http://pbs.twimg.com/media/FBcJiNUXIA4LTmr.jpg |
indigenouspeoplesday | 837 | 14073 | To all my Two Spirit brothers and sisters β I see you β I celebrate you. #NationalComingOutDay #IndigenousPeoplesDay https://t.co/KsZ5F3gBKO | http://pbs.twimg.com/media/FBdQIf7UYAQjEAW.jpg |
comingoutday | 676 | 6977 |
KβIβm out. Bi ππΌ #ComingOutDay #NationalComingOutDay |
NA |
loveislove | 396 | 4033 |
HAPPY NATIONAL COMING OUT DAY!! π³οΈβπβ€οΈπ§‘πππππ³οΈβπ @msmadig #OutAndProud #Queer #loveislove #NationalComingOutDay https://t.co/DVfKJsCqNQ |
http://pbs.twimg.com/ext_tw_video_thumb/1447698152463626242/pu/img/pZor72nSNDPn8KiP.jpg |
H.8 Make it prettier
Whilst this table now has all the information we want, it isn't great aesthetically. The kableExtra
package has functions that will improve the presentation of tables. We're going to show you two examples of how you could format this table.
The first is (relatively) simple and stays within the R programming language using functionality from kableExtra
. The main aesthetic feature of the table is the incorporation of the pride flag colours for each row. Each row is set to a different colour of the pride flag and the text is set to be black and bold to improve the contrast. We've also removed the image
column, as it just contains a URL.
# the hex codes of the pride flag colours, obtained from https://www.schemecolor.com/lgbt-flag-colors.php
# the last two characters (80) make the colours semi-transparent.
# omitting them or setting to FF make them 100% opaque
pride_colours <- c("#FF001880",
"#FFA52C80",
"#FFFF4180",
"#00801880",
"#0000F980",
"#86007D80")
top5 %>%
select(-image) %>%
kable(col.names = c("Hashtags", "No. tweets", "Likes", "Tweet"),
caption = "Stats and the top tweet for the top five hashtags.",
) %>%
kable_paper() %>%
row_spec(row = 0:5, bold = T, color = "black") %>%
row_spec(row = 0, font_size = 18,
background = pride_colours[1]) %>%
row_spec(row = 1, background = pride_colours[2])%>%
row_spec(row = 2, background = pride_colours[3])%>%
row_spec(row = 3, background = pride_colours[4])%>%
row_spec(row = 4, background = pride_colours[5])%>%
row_spec(row = 5, background = pride_colours[6])
Hashtags | No. tweets | Likes | Tweet |
---|---|---|---|
nationalcomingoutday | 28698 | 851510 | itβs #nationalcomingoutday π hereβs a pic of how I came out back in 2003 xx https://t.co/spBmHhF6p4 |
lgbtq | 1036 | 13691 | It takes bravery to life an authentic life. While I was not able to come out as a member of the #LGBTQ community on my own terms, if youβre ready & can safely do so, then I support you! And if youβre not quite there yet, I support you exactly where you are. #NationalComingOutDay https://t.co/dr61oyhR3L |
indigenouspeoplesday | 837 | 14073 | To all my Two Spirit brothers and sisters β I see you β I celebrate you. #NationalComingOutDay #IndigenousPeoplesDay https://t.co/KsZ5F3gBKO |
comingoutday | 676 | 6977 |
KβIβm out. Bi ππΌ #ComingOutDay #NationalComingOutDay |
loveislove | 396 | 4033 |
HAPPY NATIONAL COMING OUT DAY!! π³οΈβπβ€οΈπ§‘πππππ³οΈβπ @msmadig #OutAndProud #Queer #loveislove #NationalComingOutDay https://t.co/DVfKJsCqNQ |
H.9 Customise with HTML
An alternative approach incorporates HTML and also uses the package glue
to combine information from multiple columns.
First, we use mutate()
to create a new column col1
that combines the first three columns into a single column and adds some formatting to make the hashtag bold (<strong>
) and insert line breaks (<br>
). We'll also change the image column to display the image using html if there is an image.
If you're not familiar with HTML, don't worry if you don't understand the below code; the point is to show you the full extent of the flexibility available.
top5 %>%
mutate(col1 = glue("<strong>#{hashtags}</strong>
<br>
tweets: {n}
<br>
likes: {total_likes}"),
img = ifelse(!is.na(image),
glue("<img src='{image}' width='200px' />"),
"")) %>%
select(col1, text, img) %>%
kable(
escape = FALSE, # allows HTML in the table
col.names = c("Hashtag", "Top Tweet", ""),
caption = "Stats and the top tweet for the top five hashtags.") %>%
column_spec(1:2, extra_css = "vertical-align: top;") %>%
row_spec(0, extra_css = "vertical-align: bottom;") %>%
kable_paper()
Hashtag | Top Tweet | |
---|---|---|
#nationalcomingoutday
tweets: 28698 likes: 851510 |
itβs #nationalcomingoutday π hereβs a pic of how I came out back in 2003 xx https://t.co/spBmHhF6p4 | |
#lgbtq
tweets: 1036 likes: 13691 |
It takes bravery to life an authentic life. While I was not able to come out as a member of the #LGBTQ community on my own terms, if youβre ready & can safely do so, then I support you! And if youβre not quite there yet, I support you exactly where you are. #NationalComingOutDay https://t.co/dr61oyhR3L | |
#indigenouspeoplesday
tweets: 837 likes: 14073 |
To all my Two Spirit brothers and sisters β I see you β I celebrate you. #NationalComingOutDay #IndigenousPeoplesDay https://t.co/KsZ5F3gBKO | |
#comingoutday
tweets: 676 likes: 6977 |
KβIβm out. Bi ππΌ #ComingOutDay #NationalComingOutDay |
|
#loveislove
tweets: 396 likes: 4033 |
HAPPY NATIONAL COMING OUT DAY!! π³οΈβπβ€οΈπ§‘πππππ³οΈβπ @msmadig #OutAndProud #Queer #loveislove #NationalComingOutDay https://t.co/DVfKJsCqNQ |