7  Scatterplots, boxplots, and violin-boxplots

Back in Chapter 3, we introduced you to data visualisation in R/RStudio using the package ggplot2. You developed foundational skills in using the layering system and customising your plots, but we only covered visualising one variable at a time in a histogram or barplot. This gets you a long way, but you typically want to visualise the relationship or difference between variables.

In this chapter, we develop your data visualisation skills to cover scatterplots, boxplots, and violin-boxplots. This will give you the skills to visualise your data in the next few chapters on inferential statistics. This is the final data visualisation specific chapter in the book, but by learning these core concepts, you will be able to find out how to create additional types of data visualisation independently.

Chapter Intended Learning Outcomes (ILOs)

By the end of this chapter, you will be able to:

7.1 Chapter preparation

7.1.1 Introduction to the data set

For this chapter, we are using open data from Zhang et al. (2014). The abstract of their article is:

Although documenting everyday activities may seem trivial, four studies reveal that creating records of the present generates unexpected benefits by allowing future rediscoveries. In Study 1, we used a time-capsule paradigm to show that individuals underestimate the extent to which rediscovering experiences from the past will be curiosity provoking and interesting in the future. In Studies 2 and 3, we found that people are particularly likely to underestimate the pleasure of rediscovering ordinary, mundane experiences, as opposed to extraordinary experiences. Finally, Study 4 demonstrates that underestimating the pleasure of rediscovery leads to time-inconsistent choices: Individuals forgo opportunities to document the present but then prefer rediscovering those moments in the future to engaging in an alternative fun activity. Underestimating the value of rediscovery is linked to people’s erroneous faith in their memory of everyday events. By documenting the present, people provide themselves with the opportunity to rediscover mundane moments that may otherwise have been forgotten.

In summary, they were interested in whether people could predict how interested they would be in rediscovering past experiences. They call it a “time capsule” effect, where people store photos or messages to remind themselves of past events in the future.

At the start of the study (time 1), participants in a romantic relationship wrote about two kinds of experiences. An “extraordinary” experience with their partner on Valentine’s day and an “ordinary” experience one week before. They were then asked how enjoyable, interesting, and meaningful they predict they will find these recollections in three months time (time 2). Three months later, Zhang et al. randomised participants into one of two groups. In the “extraordinary” group, they reread the extraordinary recollection. In the “ordinary” group, they reread the ordinary recollection. All the participants completed measures on how enjoyable, interesting, and meaningful they found the experience, but this time what they actually felt, rather than what they predict they will feel.

They predicted participants in the ordinary group would underestimate their future feelings (i.e., there would be a bigger difference between time 1 and time 2 measures) compared to participants in the extraordinary group. In this chapter, we focus on a composite measure which took the mean of items on interest, meaningfulness, and enjoyment.

7.1.2 Organising your files and project for the chapter

Before we can get started, you need to organise your files and project for the chapter, so your working directory is in order.

  1. In your folder for research methods and the book ResearchMethods1_2/Quant_Fundamentals, create a new folder called Chapter_07_dataviz. Within Chapter_07_dataviz, create two new folders called data and figures.

  2. Create an R Project for Chapter_07_dataviz as an existing directory for your chapter folder. This should now be your working directory.

  3. Create a new R Markdown document and give it a sensible title describing the chapter, such as 07 Scatterplots Boxplots Violins. Delete everything below line 10 so you have a blank file to work with and save the file in your Chapter_07_dataviz folder.

  4. We are working with a new data set, so please save the following data file: Zhang_2014.csv. Right click the link and select “save link as”, or clicking the link will save the files to your Downloads. Make sure that you save the file as “.csv”. Save or copy the file to your data/ folder within Chapter_07_dataviz.

You are now ready to start working on the chapter!

7.1.3 Activity 1 - Read and wrangle the data

As the first activity, try and test yourself by completing the following task list to practice your data wrangling skills. Create an object called zhang_data to be consistent with the tasks below. If you want to focus on data visualisation, then you can just type the code in the solution.

Try this

To wrangle the data, complete the following tasks:

  1. Load the tidyverse package.

  2. Read the data file data/Zhang_2014.csv.

  3. Select the following columns:

    • Gender

    • Age

    • Condition

    • T1_Predicted_Interest_Composite renamed to time1_interest

    • T2_Actual_Interest_Composite renamed to time2_interest.

  4. There is currently no identifier, so create a new variable called participant_ID. Hint: try participant_ID = row_number().

  5. Recode two variables to be easier to understand and visualise:

    • Gender: 1 = “Male”, 2 = “Female”.

    • Condition: 1 = “Ordinary”, 2 = “Extraordinary”.

Your data should now be in wide format and ready to create a scatterplot.

You should have the following in a code chunk:

# Load the tidyverse package below
library(tidyverse)

# Load the data file
# This should be the Zhang_2014.csv file 
zhang_data <- read_csv("data/Zhang_2014.csv")

# Wrangle the data for plotting. 
# select and rename key variables
# mutate to add participant ID and recode
zhang_data <- zhang_data %>%
  select(Gender, 
         Age, 
         Condition, 
         time1_interest = T1_Predicted_Interest_Composite, 
         time2_interest = T2_Actual_Interest_Composite) %>%
  mutate(participant_ID = row_number(),
         Condition = case_match(Condition, 
                            1 ~ "Ordinary", 
                            2 ~ "Extraordinary"),
         Gender = case_match(Gender,
                             1 ~ "Male",
                             2 ~ "Female")) 

7.1.4 Activity 2 - Explore the data

Try this

After the wrangling steps, try and explore zhang_data to see what variables you are working with. For example, opening the data object as a tab to scroll around, explore with glimpse(), or try plotting some of the individual variables to see what they look like using visualisation skills from Chapter 3.

In zhang_data, we have the following variables:

Variable Type Description
Gender character Participant gender: Male (1) or Female (2)
Age double Participant age in years.
Condition character Condition participant was randomly allocated into: Ordinary (1) or Extraordinary (2).
time1_interest double How interested they predict they will find the recollection on a 1 (not at all) to 7 (extremely) scale. This measure is the mean of enjoyment, interest, and meaningfulness.
time2_interest double How interested they actually found the recollection on a 1 (not at all) to 7 (extremely) scale. This measure is the mean of enjoyment, interest, and meaningfulness.
participant_ID integer Our new participant ID as an integer from 1 to 130.

We will use this data set to demonstrate different ways of visualising continuous variables, either combining multiple continuous variables in a scatterplot or splitting continuous variables into categories in a boxplot or violin-boxplot.

7.2 Scatterplots

The first visualisation is a scatterplot to show the relationship between two continuous variables. One variable goes on the x-axis and the other variables goes on the y-axis. Each dot then represents the intersection of those two variables per observation/participant. You will use these plots often when reporting a correlation or regression.

7.2.1 Activity 3 - Creating a basic scatterplot

Let us start by making a scatterplot of Age and time1_interest to see if there is any relationship between the two. We need to specify both the x- and y-axis variables, but the only difference to what we created in Chapter 3 is using a new layer geom_point.

zhang_data %>% 
  ggplot(aes(x = time1_interest, y = Age)) +
       geom_point()

7.2.2 Activity 4 - Editing axis labels

This plot is great for some exploratory data analysis, but it looks a little untidy to put into a report. We can use the scale_x_continuous and scale_y_continuous layers to control the tick marks, as well as the axis name.

zhang_data %>%  
  ggplot(aes(x = time1_interest,y = Age)) +
  geom_point() +
  scale_x_continuous(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_continuous(name = "Age",
                     limits = c(15, 45), # change limits to 15 to 45
                     breaks = seq(from = 15, # sequence from 15
                                  to = 45, # to 45 
                                  by = 5)) # in steps of 5

To break down these new arguments/functions in the layers:

  • breaks set the tick marks on the plot. We demonstrate two ways of setting this. On the x-axis, we just manually set values for 1 to 7. On the y-axis, we use a second function to set the breaks.

  • seq() creates a sequence of numbers and can save a lot of time when you need to add lots of values. We set three arguments, from for the starting point, to for the end point, and by for the steps the sequence goes up in.

  • limits controls the start and end point of the graph scale. In the original graph, we can see there are points below 20 and above 40, so we might want to increase the limits of the graph to include a wider range.

Error mode

When controlling the limits of the graph, sometimes you want to decrease the limits range to zoom in on an element of the data. If you decrease the range which cuts off some data points, you must be very careful as it actually cuts off data which you would receive a warning about:

zhang_data %>%  
  ggplot(aes(x = time1_interest,y = Age)) +
  geom_point() +
  scale_y_continuous(name = "Age",
                     limits = c(30, 40)) # in steps of 5
Warning: Removed 124 rows containing missing values or values outside the scale range
(`geom_point()`).

You must be very careful when truncating axes, but if you do need to do it, there is a different function layer to use:

zhang_data %>%  
  ggplot(aes(x = time1_interest,y = Age)) +
  geom_point() +
  coord_cartesian(ylim = c(30, 40))

7.2.3 Activity 5 - Adding a regression line

It is often useful to add a regression line or line of best fit to a scatterplot. You can add a regression line with the geom_smooth() layer and by default will also provide a 95% confidence interval ribbon. You can specify what type of line you want to draw, most often you will need method = "lm" for a linear model or a straight line.

zhang_data %>%  
  ggplot(aes(x = time1_interest,y = Age)) +
  geom_point() +
  scale_x_continuous(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_continuous(name = "Age",
                     limits = c(15, 45), # change limits to 15 to 45
                     breaks = seq(from = 15, # sequence from 15
                                  to = 45, # to 45 
                                  by = 5)) +  # in steps of 5
  geom_smooth(method = "lm")

With the regression line, we can see there is very little relationship between age and interest score at time 1.

Important

Remember, you can save your plots using the function ggsave(). You can use the function after creating the last plot, or saving your plot as an object and using the plot argument. You have a Figures/ directory for the chapter, so try and save the plots you make to remind yourself later.

Try this

So far, we made a scatterplot of age against interest at time 1. Now, create a scatterplot on your own using the two interest rating variables: time1_interest and time2_interest.

After you made the scatterplot, it looks like there is a relationship between interest ratings at time 1 and time 2.

You should have the following in a code chunk:

zhang_data %>%  
  ggplot(aes(x = time1_interest, y = time2_interest)) +
  geom_point() +
  scale_x_continuous(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_continuous(name = "Time 2 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  geom_smooth(method = "lm")

7.2.4 Activity 6 - Creating a grouped scatterplot

Before we move on, we can add a third variable to show how the relationship might differ for different groups within our data. We can do this by adding the colour argument to aes() and setting it as whatever variable we would like to distinguish between. In this case, we will see how the relationship between age and interest at time 1 differs for the male and female participants. There are a few participants with missing gender, so we will first filter them out.

zhang_data %>%
  drop_na(Gender) %>% 
  ggplot(aes(x = time1_interest, y = Age, colour = Gender)) +
  geom_point() +
  scale_x_continuous(name = "Mean interest score (1-7)",
                     breaks = c(1:7)) + 
  scale_y_continuous(name = "Age") +
  geom_smooth(method = "lm")

Grouped scatterplot
Try this

For your independent scatterplot of the two interest rating variables: time1_interest and time2_interest, add a colour argument using the Condition variable. This will show the relationship between time 1 and time 2 interest separately for participants in the ordinary and extraordinary groups.

You should have the following in a code chunk:

zhang_data %>%  
  ggplot(aes(x = time1_interest, y = time2_interest, colour = Condition)) +
  geom_point() +
  scale_x_continuous(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_continuous(name = "Time 2 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  geom_smooth(method = "lm")

7.3 Boxplots

The next visualisation is the boxplot which presents a range of summary statistics for your outcome, which you can split between different groups on the x-axis, or add further variables to divide by. For the boxplot element, you get five summary statistics: the median centre line, the first and third quartile as the box (essentially, the interquartile range), and 1.5 times the first and third quartiles as the whiskers extending from the box. If there are any values beyond the whiskers, you see the individual data points and this is one definition of an outlier (more on that in Chapter 11)

7.3.1 Activity 7 - Creating a basic boxplot

Before we create the boxplot, we need a final data wrangling step. At the moment, we have time1_interest and time2_interest in wide format, but to plot together, we need to express it as a single variable. For that, we must restructure the data. This is why we spent so much time on data wrangling, as you might need to quickly restructure your data to plot certain elements.

Try this

To wrangle the data, gather the variables time1_interest and time2_interest. Create a new object called zhang_data_long and use the names Time and Interest for your column names to be consistent with the demonstrations below.

You should have the following in a code chunk:

# gather the data to convert to long format
zhang_data_long <- zhang_data %>% 
  pivot_longer(cols = time1_interest:time2_interest,
               names_to = "Time",
               values_to = "Interest")

If you only want to visualise one continuous variable, we need one variable on the y-axis and a new function layer geom_boxplot().

zhang_data_long %>% 
  ggplot(aes(y = Interest)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

Typically, you want to compare the outcome between one or more categories, so we can add a categorical variable like gender to the x-axis, removing the missing values first.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

7.3.2 Activity 8 - Adding colour to variables

It is not as important when you only have one variable on the x-axis, but one useful feature is adding colour to distinguish between categories. You can control this by adding a variable to the fill argument within aes().

By default, we get a legend which is redundant when we only have different colours on the x-axis, so we can turn it off by adding guides(fill = FALSE) as a layer.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  guides(fill = FALSE) # remove the legend

Error mode

You might have noticed we have now used two different arguments to control the colour. In scatterplots, we used colour. In boxplots, we used fill. It is one of those concepts that takes time to recognise which you need, depending on the type of geom you are using. Roughly, colour is when you want to control the outline or symbol, like the points. Whereas fill is when you want the inside of a geom coloured. You can see the difference here by controlling fill first:

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

Then colour:

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, colour = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

7.3.3 Activity 9 - Controlling colours

ggplot2 has a default colour scheme which is fine for quick plots, but it is useful to control the colour scheme. You can do this manually by editing scale_fill_discrete() and choosing colours through the type argument (you can do this through character names or choosing a HEX code: https://r-charts.com/colors/).

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_discrete(type = c("blue", "pink"))

Alternatively (and what we recommend), you can use scale_fill_viridis_d(). This function does exactly the same thing but it uses a colour-blind friendly palette (which also prints in black and white). There are 5 different options for colours and you can see them by changing option to A, B, C, D or E. We like option E with alpha = 0.6 (to control transparency and soften the tone) but play around with the options to see what you prefer.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6) + 
  guides(fill = "none")

Boxplots with friendly colours
Try this

For your independent boxplot, use zhang_data_long to visualise Interest as your continuous variable and Condition for different categories. This will show the difference in interest rating between those in the ordinary and extraordinary groups.

Comparing the ordinary and extraordinary groups, it looks like .

You should have the following in a code chunk:

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6) + 
  guides(fill = "none")

7.3.4 Activity 10 - Ordering categories

When we plot variables like Gender on the x-axis, R has an internal order it sets unless you create a factor. The default is alphabetical or numerical. In previous plots, it displayed Female then Male, as F comes before M.

Controlling the order of categories is an important design choice to communicate your message, and the most direct way is controlling the factor order before plotting. Here, we add mutate() in a pipe and manually set the factor levels, just be careful as it is case sensitive to the values in your data.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  mutate(Gender = factor(Gender, 
                         levels = c("Male", "Female"))) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6) + 
  guides(fill = "none")

7.3.5 Activity 11- Boxplots for multiple factors

When you only have one independent variable, using the fill argument to change the colour can be a little redundant as the colours do not add any additional information. It makes more sense to use colour to represent a second variable.

For this example, we will use Condition and Time as variables. fill() now specifies a second independent variable, rather than repeating the variable on the x-axis as in the previous plot, so we do not want to deactivate the legend.

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Time)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6)

As a final point here, the fill values on the legend are not the most professional looking. Like reordering factors, the easiest way of addressing this is editing the underlying data before piping to ggplot2.

zhang_data_long %>% 
  mutate(Time = case_match(Time,
                           "time1_interest" ~ "Time 1",
                           "time2_interest" ~ "Time 2")) %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Time)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6)

7.4 Violin-boxplots

Boxplots are great for your own exploratory data analysis but you do not often see them reported in isolation. They visualise summary statistics, but you do not get much sense of the underlying distribution of values. When you want to communicate continuous outcomes, researchers in psychology are using violin-boxplots more often. This combines both elements: a violin plot to show the distribution of the data, and a boxplot to add summary statistics. This is where ggplot2 comes into it’s own as we can add and customise several layers.

7.4.1 Activity 12 - Creating a basic violin plot

Violin plots get their name as they look something like a violin when the data are roughly normally distributed. They show density, so the fatter the violin element, the more data points there are for that value. Compared to the boxplot, the only difference is changing the layer to geom_violin().

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_violin() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

The distribution of values is great, but sometimes it might be useful to also add the underlying data points. These are all important design choices as it can be useful when you have smaller amounts of data, but overwhelming when you have thousands of data points. So, keep in mind what you want to communicate. Here, we use the layer geom_jitter() to jitter the points slightly, so they are not all in a vertical line and we get a better sense of the density.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_violin() + 
  geom_jitter(height = 0, # do not jitter height
              width = .1) + # jitter width of points
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

Important

It is important to remember that R is very literal. ggplot2 works on a system of layers. It will add new geoms on top of existing ones and it will not stop to think whether this is a good idea. Try running the code above but put geom_jitter() first and then add geom_violin(). The order of your layers matters.

7.4.2 Activity 13 - Creating a violin-boxplot

Instead of adding the data points in a layer, we can add a boxplot to create the violin-boxplot. This way, we get distribution information from the violin layer and summary statistics from the boxplot layer.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_violin() + 
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

On it’s own, this does not look great. We can edit the settings to reduce the width of the boxplots, add a colour scheme, and add transparency to the violin layer to make it easier to see the boxplot.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2) + 
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  guides(fill = "none")

The boxplot uses the median for the centre line, but in your report you might be presenting means per category which will be slightly different. One further variation is removing the centre median line, and replacing it with the mean and 95% confidence interval (more on that in the lectures and Chapter 8). This way, you get three layers: the violin plot for the density, the boxplot for distribution summary statistics, and the mean and 95% confidence interval.

This code uses two calls to stat_summary() which is a layer to add summary statistics. The first layer draws a point to represent the mean, and the second draws an errorbar that represents the 95% confidence interval around the mean.

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2, 
               fatten = NULL) + 
  stat_summary(fun = "mean", 
               geom = "point") +
  stat_summary(fun.data = "mean_cl_boot", # confidence interval
               geom = "errorbar", 
               width = 0.1) +
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  guides(fill = "none")

Warning

When you run the line stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", width = .1) for the first time, you might be prompted to install the R package Hmisc. If you are on your own computer, follow the instructions in the Console to install the package. If you are on a university computer, this should already be installed.

Try this

For your independent violin-boxplot, use zhang_data_long to visualise Interest as your continuous variable and Condition for different categories on the x-axis. Try and create the plot to look like this, so you might need to play around with different themes:

You should have the following in a code chunk:

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2, 
               fatten = NULL) + 
  stat_summary(fun = "mean", 
               geom = "point") +
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "errorbar", 
               width = .1) +
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  theme_minimal() + 
  guides(fill = "none")

7.4.3 Activity 14 - Adding additional variables

Like boxplots, we can add a second grouping variable to fill instead of just using it for colour.

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Time)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2, 
               fatten = NULL) + 
  stat_summary(fun = "mean", 
               geom = "point") +
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "errorbar", 
               width = .1) +
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  theme_minimal()

However, unless you are trying to recreate a Kandinsky painting in ggplot2, that does not look quite right. This is because we have multiple layers that each plot separate groups in different ways. To make it all fall into line, we need to add a constant value to offset the elements. We start off by defining a position dodge value as an object. This way, we can use the object name later, and we only need to edit it in one place if we wanted to change the value.

# specify as an object, so we only change it in one place
dodge_value <- 0.9

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Time)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2, 
               fatten = NULL,
               position = position_dodge(dodge_value)) + 
  stat_summary(fun = "mean", 
               geom = "point",
               position = position_dodge(dodge_value)) +
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "errorbar", 
               width = .1,
               position = position_dodge(dodge_value)) +
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

This looks much better! Remember, if you want to change the legend labels, the easiest way is recoding the data before piping to ggplot2.

Finally, we might want to add a third variable to group the data by. There is a facet function that produces different plots for each level of a grouping variable which can be very useful when you have more than two factors. The following code shows interest ratings for all three variables we have worked with: Condition, Time, and Gender.

# specify as an object, so we only change it in one place
dodge_value <- 0.9

zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Time)) +
  geom_violin(alpha = 0.5) + 
  geom_boxplot(width = 0.2, 
               fatten = NULL,
               position = position_dodge(dodge_value)) + 
  stat_summary(fun = "mean", 
               geom = "point",
               position = position_dodge(dodge_value)) +
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "errorbar", 
               width = .1,
               position = position_dodge(dodge_value)) +
  facet_wrap(~ Gender) + # facet by Gender
  scale_fill_viridis_d(option = "E") + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

Facets work in the same way as adding a variable to fill. It is not easy to change the labels within ggplot2, you are better off editing the values in your data first.

7.5 Test yourself

To end the chapter, we have some knowledge check questions to test your understanding of the concepts we covered in the chapter. We then have some error mode tasks to see if you can find the solution to some common errors in the concepts we covered in this chapter.

7.5.1 Knowledge check

Question 1. You want to plot several summary statistics including the median for your outcome, which ggplot2 layer could you use?

Question 2. You want to create a scatterplot to show the correlation between two continuous variables, which ggplot2 layer could you use?

Question 3. You want to show the density of values in your outcome, which ggplot2 layer could you use?

Question 4. To separate a scatterplot into different groups, you could specify a grouping variable using the fill argument to change the colour of the points?

This was a sneaky one, but relates to the error mode warning within the chapter. There are two ways to add a grouping variable for separate colours: colour and fill. In this scenario, colour would change the colour of the points, whereas fill would only change the colour of the regression line and its 95% confidence interval ribbon. Sometimes you need to play around with the settings to produce the effects you want.

Question 5. The order of layers is important in ggplot2. Which order of layers would show individual data points on top of a boxplot?

In addition to the layer order, we also added an error mode feature to recognise when you need to use the pipe %>% vs the +.

  • data %>% ggplot() + geom_boxplot() + geom_jitter() was the correct answer as we add data point after the boxplot.

  • data + ggplot() + geom_boxplot() + geom_jitter() had the right order, but we used + instead of the pipe between the data and the initial ggplot() function.

  • data + ggplot() + geom_jitter() + geom_boxplot() and data %>% ggplot() + geom_jitter() + geom_boxplot() both had the wrong layer order as the boxplot would overlay the points.

7.5.2 Error mode

The following questions are designed to introduce you to making and fixing errors. For this topic, we focus on the new types of data visualisation. Remember to keep a note of what kind of error messages you receive and how you fixed them, so you have a bank of solutions when you tackle errors independently.

Create and save a new R Markdown file for these activities. Delete the example code, so your file is blank from line 10. Create a new code chunk to load tidyverse and wrangle the data files:

# Load the tidyverse package below
library(tidyverse)

# Load the data file
# This should be the Zhang_2014.csv file 
zhang_data <- read_csv("data/Zhang_2014.csv")

# Wrangle the data for plotting. 
# select and rename key variables
# mutate to add participant ID and recode
zhang_data <- zhang_data %>%
  select(Gender, 
         Age, 
         Condition, 
         time1_interest = T1_Predicted_Interest_Composite, 
         time2_interest = T2_Actual_Interest_Composite) %>%
  mutate(participant_ID = row_number(),
         Condition = case_match(Condition, 
                            1 ~ "Ordinary", 
                            2 ~ "Extraordinary"),
         Gender = case_match(Gender,
                             1 ~ "Male",
                             2 ~ "Female")) 

# gather the data to convert to long format
zhang_data_long <- zhang_data %>% 
  pivot_longer(cols = time1_interest:time2_interest,
               names_to = "Time",
               values_to = "Interest")

Below, we have several variations of a code chunk error or misspecification. Copy and paste them into your R Markdown file below the code chunk to load tidyverse and the data files. Once you have copied the activities, click knit and look at the error message you receive. See if you can fix the error and get it working before checking the answer.

Question 6. Copy the following code chunk into your R Markdown file and press knit. This code… works, but it does not look quite right? Why are the tick marks not displaying properly?

```{r}
zhang_data %>%  
  ggplot(aes(x = time1_interest, y = time2_interest)) +
  geom_point() +
  theme_classic() + 
  scale_x_discrete(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_discrete(name = "Time 2 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  geom_smooth(method = "lm")
```

In this example, we used the wrong function for continuous variables. We used scale_x_discrete and scale_y_discrete, instead of scale_x_continuous and scale_y_continuous. We must honour the variable type when we customise the plot, so think about what type of variable is on each axis and which function lets you edit it.

zhang_data %>%  
  ggplot(aes(x = time1_interest, y = time2_interest)) +
  geom_point() +
  theme_classic() + 
  scale_x_continuous(name = "Time 1 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  scale_y_continuous(name = "Time 2 interest score (1-7)", 
                     breaks = c(1:7)) + # tick marks from 1 to 7
  geom_smooth(method = "lm")

Question 7. Copy the following code chunk into your R Markdown file and press knit. You should receive an error like Error in "fortify()":! "data" must be a <data.frame>, or an object coercible by "fortify()" which is a little cryptic.

```{r}
zhang_data_long + 
  ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6) + 
  guides(fill = FALSE)
```

Once we start using a mixture of tidyverse functions, it is important to remember which uses a pipe %>% between layers, and which uses +. Here, we tried using the + between the data object and the initial ggplot() layer. We need a pipe here or it thinks you are trying to set the data argument using aes().

zhang_data_long %>% 
  ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
  geom_boxplot() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_fill_viridis_d(option = "E", 
                       alpha = 0.6) + 
  guides(fill = FALSE)

Question 8. Copy the following code chunk into your R Markdown file and press knit. We want to change the order of the categories to present males then female. This code…works, but is it doing what we think it is doing?

```{r}
zhang_data_long %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_violin() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7)) + 
  scale_x_discrete(labels = c("Male", "Female"))
```

We have introduced this error several times, but we see it so often it is worth reinforcing. When we change the labels, this is really just to tidy things up. The underlying data does not change, we are just trying to communicate it clearer. If we want to change the order of categories, we must change the underlying order of the data as a factor or R will default to alphabetical/numerical. So, we mutate Gender as a factorm, then pipe to ggplot2.

zhang_data_long %>% 
  mutate(Gender = factor(Gender,
                         levels = (c("Male", "Female")))) %>% 
  drop_na(Gender) %>% 
  ggplot(aes(y = Interest, x = Gender)) +
  geom_violin() + 
  scale_y_continuous(name = "Interest score (1-7)", 
                     breaks = c(1:7))

7.6 Words from this Chapter

Below you will find a list of words that were used in this chapter that might be new to you in case it helps to have somewhere to refer back to what they mean. The links in this table take you to the entry for the words in the PsyTeachR Glossary. Note that the Glossary is written by numerous members of the team and as such may use slightly different terminology from that shown in the chapter.

term definition
boxplot Visualising a continuous variable by five summary statistics: the median centre line, the first and third quartile, and 1.5 times the first and third quartiles.
scatterplot Plotting two variables on the x- and y-axis to show the correlation/relationship between the variables.
violin-boxplots A combination of a violin plot to show the density of data points and a boxplot to show summary statistics of distribution.

7.7 End of chapter

Well done, you have completed the second chapter dedicated to data visualisation! This is a key area for psychology research and helping to communicate your findings to your audience. Data visualisation also comes with a lot of responsibility. There are lots of design choices to make and help communicate your findings as effectively and transparently as possible. We could dedicate a whole book to data visualisation possibilities in R and ggplot2, so we have added a range of further reading sources in the Additional Resources appendix.

In the next chapter, we start on inferential statistics introducing you to the concept of regression by focusing on one continuous predictor variable.