Chapter 3 Data Visualisation

Take the quiz to see if you need to review this chapter.

3.1 Learning Objectives

3.1.1 Basic

  1. Understand what types of graphs are best for different types of data
    • 1 discrete
    • 1 continuous
    • 2 discrete
    • 2 continuous
    • 1 discrete, 1 continuous
    • 3 continuous
  2. Create common types of graphs with ggplot2
  3. Set custom labels and colours
  4. Represent factorial designs with different colours or facets
  5. Save plots as an image file

3.1.2 Intermediate

  1. Superimpose different types of graphs
  2. Add lines to graphs
  3. Deal with overlapping data
  4. Create less common types of graphs

3.1.3 Advanced

  1. Arrange plots in a grid using cowplot
  2. Adjust axes (e.g., flip coordinates, set axis limits)
  3. Change the theme
  4. Create interactive graphs with plotly

3.3 Setup

# libraries needed for these graphs
set.seed(30250) # makes sure random numbers are reproducible

3.4 Common Variable Combinations

Continuous variables are properties you can measure, like height. Discrete (or categorical) variables are things you can count, like the number of pets you have. Categorical variables can be nominal, where the categories don't really have an order, like cats, dogs and ferrets (even though ferrets are obviously best). They can also be ordinal, where there is a clear order, but the distance between the categories isn't something you could exactly equate, like points on a Likert rating scale.

Different types of visualisations are good for different types of variables.

Before you read ahead, come up with an example of each type of variable combination and sketch the types of graphs that would best display these data.

  • 1 discrete
  • 1 continuous
  • 2 discrete
  • 2 continuous
  • 1 discrete, 1 continuous
  • 3 continuous

3.4.1 Data

The code below creates some data frames with different types of data. We'll learn how to simulate data like this in the Probability & Simulation chapter, but for now just run the code chunk below.

  • pets has a column with pet type
  • pet_happy has happiness and age for 500 dog owners and 500 cat owners
  • x_vs_y has two correlated continuous variables (x and y)
  • overlap has two correlated ordinal variables and 1000 observations so there is a lot of overlap
  • overplot has two correlated continuous variables and 10000 observations

First, think about what kinds of graphs are best for representing these different types of data.

pets <- tibble(
  pet = sample(
    c("dog", "cat", "ferret", "bird", "fish"), 
    c(0.45, 0.40, 0.05, 0.05, 0.05)

pet_happy <- tibble(
  pet = rep(c("dog", "cat"), each = 500),
  happiness = c(rnorm(500, 55, 10), rnorm(500, 45, 10)),
  age = rpois(1000, 3) + 20

x_vs_y <- tibble(
  x = rnorm(100),
  y = x + rnorm(100, 0, 0.5)

overlap <- tibble(
  x = rbinom(1000, 10, 0.5),
  y = x + rbinom(1000, 20, 0.5)

overplot <- tibble(
  x = rnorm(10000),
  y = x + rnorm(10000, 0, 0.5)

3.5 Basic Plots

3.5.1 Bar plot

Bar plots are good for categorical data where you want to represent the count.

ggplot(pets, aes(pet)) +
Bar plot

Figure 3.1: Bar plot

3.5.2 Density plot

Density plots are good for one continuous variable, but only if you have a fairly large number of observations.

ggplot(pet_happy, aes(happiness)) +
Density plot

Figure 3.2: Density plot

You can represent subsets of a variable by assigning the category variable to the argument group, fill, or color.

ggplot(pet_happy, aes(happiness, fill = pet)) +
  geom_density(alpha = 0.5)
Grouped density plot

Figure 3.3: Grouped density plot

Try changing the alpha argument to figure out what it does.

3.5.3 Frequency Polygons

If you don't want smoothed distributions, try geom_freqpoly().

ggplot(pet_happy, aes(happiness, color = pet)) +
  geom_freqpoly(binwidth = 1)
Frequency ploygon plot

Figure 3.4: Frequency ploygon plot

Try changing the binwidth argument to 5 and 0.1. How do you figure out the right value?

3.5.4 Histogram

Histograms are also good for one continuous variable, and work well if you don't have many observations. Set the binwidth to control how wide each bar is.

ggplot(pet_happy, aes(happiness)) +
  geom_histogram(binwidth = 1, fill = "white", color = "black")

Figure 3.5: Histogram

Histograms in ggplot look pretty bad unless you set the fill and color.

If you show grouped histograms, you also probably want to change the default position argument.

ggplot(pet_happy, aes(happiness, fill=pet)) +
  geom_histogram(binwidth = 1, alpha = 0.5, position = "dodge")
Grouped Histogram

Figure 3.6: Grouped Histogram

Try changing the position argument to "identity", "fill", "dodge", or "stack".

3.5.5 Column plot

Column plots are the worst way to represent grouped continuous data, but also one of the most common.

To make column plots with error bars, you first need to calculate the means, error bar uper limits (ymax) and error bar lower limits (ymin) for each category. You'll learn more about how to use the code below in the next two lessons.

# calculate mean and SD for each pet
avg_pet_happy <- pet_happy %>%
  group_by(pet) %>%
    mean = mean(happiness),
    sd = sd(happiness)

ggplot(avg_pet_happy, aes(pet, mean, fill=pet)) +
  geom_col(alpha = 0.5) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.25) +
  geom_hline(yintercept = 40)
Column plot

Figure 3.7: Column plot

What do you think geom_hline() does?

3.5.6 Boxplot

Boxplots are great for representing the distribution of grouped continuous variables. They fix most of the problems with using barplots for continuous data.

ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
  geom_boxplot(alpha = 0.5)
Box plot

Figure 3.8: Box plot

3.5.7 Violin plot

Violin pots are like sideways, mirrored density plots. They give even more information than a boxplot about distribution and are especially useful when you have non-normal distributions.

ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
    trim = FALSE,
    draw_quantiles = c(0.25, 0.5, 0.75), 
    alpha = 0.5
Violin plot

Figure 3.9: Violin plot

Try changing the numbers in the draw_quantiles argument.

3.5.8 Scatter plot

Scatter plots are a good way to represent the relationship between two continuous variables.

ggplot(x_vs_y, aes(x, y)) +
Scatter plot using geom_point()

Figure 3.10: Scatter plot using geom_point()

3.5.9 Line graph

You often want to represent the relationship as a single line.

ggplot(x_vs_y, aes(x, y)) +
Line plot using geom_smooth()

Figure 3.11: Line plot using geom_smooth()

3.6 Customisation

3.6.1 Labels

You can set custom titles and axis labels in a few different ways.

ggplot(x_vs_y, aes(x, y)) +
  geom_smooth(method="lm") +
  ggtitle("My Plot Title") +
  xlab("The X Variable") +
  ylab("The Y Variable")
Set custom labels with ggtitle(), xlab() and ylab()

Figure 3.12: Set custom labels with ggtitle(), xlab() and ylab()

ggplot(x_vs_y, aes(x, y)) +
  geom_smooth(method="lm") +
  labs(title = "My Plot Title",
       x = "The X Variable",
       y = "The Y Variable")
Set custom labels with labs()

Figure 3.13: Set custom labels with labs()

3.6.2 Colours

You can set custom values for colour and fill using functions like scale_colour_manual() and scale_fill_manual(). The Colours chapter in Cookbook for R has many more ways to customise colour.

ggplot(pet_happy, aes(pet, happiness, colour = pet, fill = pet)) +
  geom_violin() +
  scale_color_manual(values = c("darkgreen", "dodgerblue")) +
  scale_fill_manual(values = c("#CCFFCC", "#BBDDFF"))
Set custom colour

Figure 3.14: Set custom colour

3.6.3 Save as File

You can save a ggplot using ggsave(). It saves the last ggplot you made, by default, but you can specify which plot you want to save if you assigned that plot to a variable.

You can set the width and height of your plot. The default units are inches, but you can change the units argument to "in", "cm", or "mm".

box <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
  geom_boxplot(alpha = 0.5)

violin <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
  geom_violin(alpha = 0.5)

ggsave("demog_violin_plot.png", width = 5, height = 7)

ggsave("demog_box_plot.jpg", plot = box, width = 5, height = 7)

3.7 Combination Plots

3.7.1 Violinbox plot

To demonstrate the use of facet_grid() for factorial designs, we create a new column called agegroup to split the data into participants older than the meadian age or younger than the median age. New factors will display in alphabetical order, so we can use the factor() function to set the levels in the order we want.

pet_happy %>%
  mutate(agegroup = ifelse(age<median(age), "Younger", "Older"),
         agegroup = factor(agegroup, levels = c("Younger", "Older"))) %>%
  ggplot(aes(pet, happiness, fill=pet)) +
    geom_violin(trim = FALSE, alpha=0.5, show.legend = FALSE) +
    geom_boxplot(width = 0.25, fill="white") +
    facet_grid(.~agegroup) +
    scale_fill_manual(values = c("orange", "green"))
Violin-box plot

Figure 3.15: Violin-box plot

Set the show.legend argument to FALSE to hide the legend. We do this here because the x-axis already labels the pet types.

3.7.2 Violin-point-range plot

You can use stat_summary() to superimpose a point-range plot showning the mean ± 1 SD. You'll learn how to write your own functions in the lesson on Iteration and Functions.

ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
    trim = FALSE,
    alpha = 0.5
  ) +
    fun.y = mean,
    fun.ymax = function(x) {mean(x) + sd(x)},
    fun.ymin = function(x) {mean(x) - sd(x)},
Point-range plot using stat_summary()

Figure 3.16: Point-range plot using stat_summary()

3.7.3 Violin-jitter plot

If you don't have a lot of data points, it's good to represent them individually. You can use geom_jitter to do this.

pet_happy %>%
  sample_n(50) %>%  # choose 50 random observations from the dataset
  ggplot(aes(pet, happiness, fill=pet)) +
    trim = FALSE,
    draw_quantiles = c(0.25, 0.5, 0.75), 
    alpha = 0.5
  ) + 
    width = 0.15, # points spread out over 15% of available width
    height = 0, # do not move position on the y-axis
    alpha = 0.5, 
    size = 3
Violin-jitter plot

Figure 3.17: Violin-jitter plot

3.7.4 Scatter-line graph

If your graph isn't too complicated, it's good to also show the individual data points behind the line.

ggplot(x_vs_y, aes(x, y)) +
  geom_point(alpha = 0.25) +
Scatter-line plot

Figure 3.18: Scatter-line plot

3.7.5 Grid of plots

You can use the cowplot package to easily make grids of different graphs. First, you have to assign each plot a name. Then you list all the plots as the first arguments of plot_grid() and provide a list of labels.

my_hist <- ggplot(pet_happy, aes(happiness, fill=pet)) +
    binwidth = 1, 
    alpha = 0.5, 
    position = "dodge", 
    show.legend = FALSE

my_violin <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
    trim = FALSE,
    draw_quantiles = c(0.5), 
    alpha = 0.5, 
    show.legend = FALSE

my_box <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
  geom_boxplot(alpha=0.5, show.legend = FALSE)

my_density <- ggplot(pet_happy, aes(happiness, fill=pet)) +
  geom_density(alpha=0.5, show.legend = FALSE)

my_bar <- pet_happy %>%
  group_by(pet) %>%
    mean = mean(happiness),
    sd = sd(happiness)
  ) %>%
  ggplot(aes(pet, mean, fill=pet)) +
    geom_bar(stat="identity", alpha = 0.5, 
             color = "black", show.legend = FALSE) +
    geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.25)

  labels = c("A", "B", "C", "D")
Grid of plots using cowplot

Figure 3.19: Grid of plots using cowplot

3.8 Overlapping Discrete Data

3.8.1 Reducing Opacity

You can deal with overlapping data points (very common if you're using Likert scales) by reducing the opacity of the points. You need to use trial and error to adjust these so they look right.

ggplot(overlap, aes(x, y)) +
  geom_point(size = 5, alpha = .05) +
Deal with overlapping data using transparency

Figure 3.20: Deal with overlapping data using transparency

3.8.2 Proportional Dot Plots

Or you can set the size of the dot proportional to the number of overlapping observations using geom_count().

overlap %>%
  ggplot(aes(x, y)) +
  geom_count(color = "#663399")
Deal with overlapping data using geom_count()

Figure 3.21: Deal with overlapping data using geom_count()

Alternatively, you can transform your data to create a count column and use the count to set the dot colour.

overlap %>%
  group_by(x, y) %>%
  summarise(count = n()) %>%
  ggplot(aes(x, y, color=count)) +
  geom_point(size = 5) +
Deal with overlapping data using dot colour

Figure 3.22: Deal with overlapping data using dot colour

The viridis package changes the colour themes to be easier to read by people with colourblindness and to print better in greyscale. Viridis is built into ggplot2 since v3.0.0. It uses scale_colour_viridis_c() and scale_fill_viridis_c() for continuous variables and scale_colour_viridis_d() and scale_fill_viridis_d() for discrete variables.

3.9 Overlapping Continuous Data

Even if the variables are continuous, overplotting might obscure any relationships if you have lots of data.

overplot %>%
  ggplot(aes(x, y)) + 
Overplotted data

Figure 3.23: Overplotted data

3.9.1 2D Density Plot

Use geom_density2d() to create a contour map.

overplot %>%
  ggplot(aes(x, y)) + 
Contour map with geom_density2d()

Figure 3.24: Contour map with geom_density2d()

You can use stat_density_2d(aes(fill = ..level..), geom = "polygon") to create a heatmap-style density plot.

overplot %>%
  ggplot(aes(x, y)) + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon") +
Heatmap-density plot

Figure 3.25: Heatmap-density plot

3.9.2 2D Histogram

Use geom_bin2d() to create a rectangular heatmap of bin counts. Set the binwidth to the x and y dimensions to capture in each box.

overplot %>%
  ggplot(aes(x, y)) + 
  geom_bin2d(binwidth = c(1,1))
Heatmap of bin counts

Figure 3.26: Heatmap of bin counts

3.9.3 Hexagonal Heatmap

Use geomhex() to create a hexagonal heatmap of bin counts. Adjust the binwidth, xlim(), ylim() and/or the figure dimensions to make the hexagons more or less stretched.

overplot %>%
  ggplot(aes(x, y)) + 
  geom_hex(binwidth = c(0.25, 0.25))
Hexagonal heatmap of bin counts

Figure 3.27: Hexagonal heatmap of bin counts

3.9.4 Correlation Heatmap

I've included the code for creating a correlation matrix from a table of variables, but you don't need to understand how this is done yet. We'll cover mutate() and gather() functions in the dplyr and tidyr lessons.

# generate two sets of correlated variables (a and b)
heatmap <- tibble(
    a1 = rnorm(100),
    b1 = rnorm(100)
  ) %>% 
    a2 = a1 + rnorm(100),
    a3 = a1 + rnorm(100),
    a4 = a1 + rnorm(100),
    b2 = b1 + rnorm(100),
    b3 = b1 + rnorm(100),
    b4 = b1 + rnorm(100)
  ) %>%
  cor() %>% # create the correlation matrix %>% # make it a data frame
  rownames_to_column(var = "V1") %>% # set rownames as V1
  gather("V2", "r", a1:b4) # wide to long (V2)

Once you have a correlation matrix in the correct (long) format, it's easy to make a heatmap using geom_tile().

ggplot(heatmap, aes(V1, V2, fill=r)) +
  geom_tile() +
Heatmap using geom_tile()

Figure 3.28: Heatmap using geom_tile()

The file type is set from the filename suffix, or by specifying the argument device, which can take the following values: "eps", "ps", "tex", "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf".

3.10 Interactive Plots

You can use the plotly package to make interactive graphs. Just assign your ggplot to a variable and use the function ggplotly().

demog_plot <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
  geom_point(position = position_jitter(width= 0.2, height = 0), size = 2)


Figure 3.29: Interactive graph using plotly

Hover over the data points above and click on the legend items.

3.11 Quiz

  1. Generate a plot like this from the built-in dataset iris. Make sure to include the custom axis labels.

    ggplot(iris, aes(Species, Petal.Width, fill = Species)) +
      geom_boxplot(show.legend = FALSE) +
      xlab("Flower Species") +
      ylab("Petal Width (in cm)")
    # there are many ways to do things, the code below is also correct
    ggplot(iris) +
      geom_boxplot(aes(Species, Petal.Width, fill = Species), show.legend = FALSE) +
      labs(x = "Flower Species",
           y = "Petal Width (in cm)")
  2. You have just created a plot using the following code. How do you save it?

    ggplot(cars, aes(speed, dist)) + 
      geom_point() +
      geom_smooth(method = lm)
  3. Debug the following code.

    ggplot(iris) +
      geom_point(aes(Petal.Width, Petal.Length, colour = Species)) +
      geom_smooth(method = lm) +
    ggplot(iris, aes(Petal.Width, Petal.Length, colour = Species)) +
      geom_point() +
      geom_smooth(method = lm) +

  4. Generate a plot like this from the built-in dataset ChickWeight.

    ggplot(ChickWeight, aes(weight, Time)) +
      geom_hex(binwidth = c(10, 1)) +
  5. Generate a plot like this from the built-in dataset iris.

    pw <- ggplot(iris, aes(Petal.Width, color = Species)) +
      geom_density() +
      xlab("Petal Width (in cm)")
    pl <- ggplot(iris, aes(Petal.Length, color = Species)) +
      geom_density() +
      xlab("Petal Length (in cm)") +
    pw_pl <- ggplot(iris, aes(Petal.Width, Petal.Length, color = Species)) +
      geom_point() +
      geom_smooth(method = lm) +
      xlab("Petal Width (in cm)") +
      ylab("Petal Length (in cm)")
      pw, pl, pw_pl, 
      labels = c("A", "B", "C"),
      nrow = 3

3.12 Exercises

Download the exercises. See the plots to see what your plots should look like (this doesn't contain the answer code). See the answers only after you've attempted all the questions.