# Chapter 3 Data Visualisation

Take the quiz to see if you need to review this chapter.

## 3.1 Learning Objectives

### 3.1.1 Basic

- Understand what types of graphs are best for different types of data
- 1 discrete
- 1 continuous
- 2 discrete
- 2 continuous
- 1 discrete, 1 continuous
- 3 continuous

- Create common types of graphs with ggplot2
- Set custom labels and colours
- Represent factorial designs with different colours or facets
- Save plots as an image file

### 3.1.2 Intermediate

- Superimpose different types of graphs
- Add lines to graphs
- Deal with overlapping data
- Create less common types of graphs

## 3.2 Resources

- Look at Data from Data Vizualization for Social Science
- Chapter 3: Data Visualisation of
*R for Data Science* - Chapter 28: Graphics for communication of
*R for Data Science* - Graphs in
*Cookbook for R* - ggplot2 cheat sheet
- ggplot2 documentation
- The R Graph Gallery (this is really useful)
- Top 50 ggplot2 Visualizations
- R Graphics Cookbook by Winston Chang
- ggplot extensions
- plotly for creating interactive graphs

Stub for this lesson ## Setup

## 3.3 Common Variable Combinations

**Continuous** variables are properties you can measure, like height. **Discrete** (or categorical) variables are things you can count, like the number of pets you have. Categorical variables can be **nominal**, where the categories don’t really have an order, like cats, dogs and ferrets (even though ferrets are obviously best). They can also be **ordinal**, where there is a clear order, but the distance between the categories isn’t something you could exactly equate, like points on a Likert rating scale.

Different types of visualisations are good for different types of variables.

Before you read ahead, come up with an example of each type of variable combination and sketch the types of graphs that would best display these data.

- 1 discrete
- 1 continuous
- 2 discrete
- 2 continuous
- 1 discrete, 1 continuous
- 3 continuous

### 3.3.1 Data

The code below creates some data frames with different types of data. We’ll learn how to simulate data like this in the Probability & Simulation chapter, but for now just run the code chunk below.

`pets`

has a column with pet type`pet_happy`

has`happiness`

and`age`

for 500 dog owners and 500 cat owners`x_vs_y`

has two correlated continuous variables (`x`

and`y`

)`overlap`

has two correlated ordinal variables and 1000 observations so there is a lot of overlap`overplot`

has two correlated continuous variables and 10000 observations

First, think about what kinds of graphs are best for representing these different types of data.

```
pets <- tibble(
pet = sample(
c("dog", "cat", "ferret", "bird", "fish"),
100,
TRUE,
c(0.45, 0.40, 0.05, 0.05, 0.05)
)
)
pet_happy <- tibble(
pet = rep(c("dog", "cat"), each = 500),
happiness = c(rnorm(500, 55, 10), rnorm(500, 45, 10)),
age = rpois(1000, 3) + 20
)
x_vs_y <- tibble(
x = rnorm(100),
y = x + rnorm(100, 0, 0.5)
)
overlap <- tibble(
x = rbinom(1000, 10, 0.5),
y = x + rbinom(1000, 20, 0.5)
)
overplot <- tibble(
x = rnorm(10000),
y = x + rnorm(10000, 0, 0.5)
)
```

## 3.4 Basic Plots

### 3.4.1 Bar plot

Bar plots are good for categorical data where you want to represent the count.

### 3.4.2 Density plot

Density plots are good for one continuous variable, but only if you have a fairly large number of observations.

You can represent subsets of a variable by assigning the category variable to the argument `group`

, `fill`

, or `color`

.

Try changing the `alpha`

argument to figure out what it does.

### 3.4.3 Frequency Polygons

If you don’t want smoothed distributions, try `geom_freqpoly()`

.

Try changing the `binwidth`

argument to 5 and 0.1. How do you figure out the right value?

### 3.4.4 Histogram

Histograms are also good for one continuous variable, and work well if you don’t have many observations. Set the `binwidth`

to control how wide each bar is.

Histograms in ggplot look pretty bad unless you set the `fill`

and `color`

.

If you show grouped histograms, you also probably want to change the default `position`

argument.

```
ggplot(pet_happy, aes(happiness, fill=pet)) +
geom_histogram(binwidth = 1, alpha = 0.5, position = "dodge")
```

Try changing the `position`

argument to “identity”, “fill”, “dodge”, or “stack”.

### 3.4.5 Column plot

Column plots are the worst way to represent grouped continuous data, but also one of the most common.

To make column plots with error bars, you first need to calculate the means, error bar uper limits (`ymax`

) and error bar lower limits (`ymin`

) for each category. You’ll learn more about how to use the code below in the next two lessons.

```
# calculate mean and SD for each pet
avg_pet_happy <- pet_happy %>%
group_by(pet) %>%
summarise(
mean = mean(happiness),
sd = sd(happiness)
)
ggplot(avg_pet_happy, aes(pet, mean, fill=pet)) +
geom_col(alpha = 0.5) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.25) +
geom_hline(yintercept = 40)
```

What do you think `geom_hline()`

does?

### 3.4.6 Boxplot

Boxplots are great for representing the distribution of grouped continuous variables. They fix most of the problems with using barplots for continuous data.

### 3.4.7 Violin plot

Violin pots are like sideways, mirrored density plots. They give even more information than a boxplot about distribution and are especially useful when you have non-normal distributions.

```
ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
geom_violin(
trim = FALSE,
draw_quantiles = c(0.25, 0.5, 0.75),
alpha = 0.5
)
```

Try changing the numbers in the `draw_quantiles`

argument.

### 3.4.8 Scatter plot

Scatter plots are a good way to represent the relationship between two continuous variables.

### 3.4.9 Line graph

You often want to represent the relationship as a single line.

## 3.5 Customisation

### 3.5.1 Labels

You can set custom titles and axis labels in a few different ways.

```
ggplot(x_vs_y, aes(x, y)) +
geom_smooth(method="lm") +
ggtitle("My Plot Title") +
xlab("The X Variable") +
ylab("The Y Variable")
```

```
ggplot(x_vs_y, aes(x, y)) +
geom_smooth(method="lm") +
labs(title = "My Plot Title",
x = "The X Variable",
y = "The Y Variable")
```

### 3.5.2 Colours

You can set custom values for colour and fill using functions like `scale_colour_manual()`

and `scale_fill_manual()`

. The Colours chapter in Cookbook for R has many more ways to customise colour.

```
ggplot(pet_happy, aes(pet, happiness, colour = pet, fill = pet)) +
geom_violin() +
scale_color_manual(values = c("darkgreen", "dodgerblue")) +
scale_fill_manual(values = c("#CCFFCC", "#BBDDFF"))
```

### 3.5.3 Save as File

You can save a ggplot using `ggsave()`

. It saves the last ggplot you made,
by default, but you can specify which plot you want to save if you assigned that plot to a variable.

You can set the `width`

and `height`

of your plot. The default units are inches, but you can change the `units`

argument to “in”, “cm”, or “mm”.

## 3.6 Combination Plots

### 3.6.1 Violinbox plot

To demonstrate the use of `facet_grid()`

for factorial designs, we create a new column called `agegroup`

to split the data into participants older than the meadian age or younger than the median age. New factors will display in alphabetical order, so we can use the `factor()`

function to set the levels in the order we want.

```
pet_happy %>%
mutate(agegroup = ifelse(age<median(age), "Younger", "Older"),
agegroup = factor(agegroup, levels = c("Younger", "Older"))) %>%
ggplot(aes(pet, happiness, fill=pet)) +
geom_violin(trim = FALSE, alpha=0.5, show.legend = FALSE) +
geom_boxplot(width = 0.25, fill="white") +
facet_grid(.~agegroup) +
scale_fill_manual(values = c("orange", "green"))
```

Set the `show.legend`

argument to `FALSE`

to hide the legend. We do this here because the x-axis already labels the pet types.

### 3.6.2 Violin-point-range plot

You can use `stat_summary()`

to superimpose a point-range plot showning the mean ± 1 SD. You’ll learn how to write your own functions in the lesson on Iteration and Functions.

```
ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
geom_violin(
trim = FALSE,
alpha = 0.5
) +
stat_summary(
fun.y = mean,
fun.ymax = function(x) {mean(x) + sd(x)},
fun.ymin = function(x) {mean(x) - sd(x)},
geom="pointrange"
)
```

### 3.6.3 Violin-jitter plot

If you don’t have a lot of data points, it’s good to represent them individually. You can use `geom_jitter`

to do this.

```
pet_happy %>%
sample_n(50) %>% # choose 50 random observations from the dataset
ggplot(aes(pet, happiness, fill=pet)) +
geom_violin(
trim = FALSE,
draw_quantiles = c(0.25, 0.5, 0.75),
alpha = 0.5
) +
geom_jitter(
width = 0.15, # points spread out over 15% of available width
height = 0, # do not move position on the y-axis
alpha = 0.5,
size = 3
)
```

### 3.6.4 Scatter-line graph

If your graph isn’t too complicated, it’s good to also show the individual data points behind the line.

### 3.6.5 Grid of plots

You can use the `cowplot`

package to easily make grids of different graphs. First, you have to assign each plot a name. Then you list all the plots as the first arguments of `plot_grid()`

and provide a list of labels.

```
my_hist <- ggplot(pet_happy, aes(happiness, fill=pet)) +
geom_histogram(
binwidth = 1,
alpha = 0.5,
position = "dodge",
show.legend = FALSE
)
my_violin <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
geom_violin(
trim = FALSE,
draw_quantiles = c(0.5),
alpha = 0.5,
show.legend = FALSE
)
my_box <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
geom_boxplot(alpha=0.5, show.legend = FALSE)
my_density <- ggplot(pet_happy, aes(happiness, fill=pet)) +
geom_density(alpha=0.5, show.legend = FALSE)
my_bar <- pet_happy %>%
group_by(pet) %>%
summarise(
mean = mean(happiness),
sd = sd(happiness)
) %>%
ggplot(aes(pet, mean, fill=pet)) +
geom_bar(stat="identity", alpha = 0.5,
color = "black", show.legend = FALSE) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.25)
plot_grid(
my_violin,
my_box,
my_density,
my_bar,
labels = c("A", "B", "C", "D")
)
```

## 3.7 Overlapping Discrete Data

### 3.7.1 Reducing Opacity

You can deal with overlapping data points (very common if you’re using Likert scales) by reducing the opacity of the points. You need to use trial and error to adjust these so they look right.

### 3.7.2 Proportional Dot Plots

Or you can set the size of the dot proportional to the number of overlapping observations using `geom_count()`

.

Alternatively, you can transform your data to create a count column and use the count to set the dot colour.

```
overlap %>%
group_by(x, y) %>%
summarise(count = n()) %>%
ggplot(aes(x, y, color=count)) +
geom_point(size = 5) +
scale_color_viridis_c()
```

The viridis package changes the colour themes to be easier to read by people with colourblindness and to print better in greyscale. Viridis is built into `ggplot2`

since v3.0.0. It uses `scale_colour_viridis_c()`

and `scale_fill_viridis_c()`

for continuous variables and `scale_colour_viridis_d()`

and `scale_fill_viridis_d()`

for discrete variables.

## 3.8 Overlapping Continuous Data

Even if the variables are continuous, overplotting might obscure any relationships if you have lots of data.

### 3.8.1 2D Density Plot

Use `geom_density2d()`

to create a contour map.

You can use `stat_density_2d(aes(fill = ..level..), geom = "polygon")`

to create a heatmap-style density plot.

```
overplot %>%
ggplot(aes(x, y)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
scale_fill_viridis_c()
```

### 3.8.2 2D Histogram

Use `geom_bin2d()`

to create a rectangular heatmap of bin counts. Set the `binwidth`

to the x and y dimensions to capture in each box.

### 3.8.3 Hexagonal Heatmap

Use `geomhex()`

to create a hexagonal heatmap of bin counts. Adjust the `binwidth`

, `xlim()`

, `ylim()`

and/or the figure dimensions to make the hexagons more or less stretched.

### 3.8.4 Correlation Heatmap

I’ve included the code for creating a correlation matrix from a table of variables, but you don’t need to understand how this is done yet. We’ll cover `mutate()`

and `gather()`

functions in the dplyr and tidyr lessons.

```
# generate two sets of correlated variables (a and b)
heatmap <- tibble(
a1 = rnorm(100),
b1 = rnorm(100)
) %>%
mutate(
a2 = a1 + rnorm(100),
a3 = a1 + rnorm(100),
a4 = a1 + rnorm(100),
b2 = b1 + rnorm(100),
b3 = b1 + rnorm(100),
b4 = b1 + rnorm(100)
) %>%
cor() %>% # create the correlation matrix
as.data.frame() %>% # make it a data frame
rownames_to_column(var = "V1") %>% # set rownames as V1
gather("V2", "r", a1:b4) # wide to long (V2)
```

Once you have a correlation matrix in the correct (long) format, it’s easy to make a heatmap using `geom_tile()`

.

The file type is set from the filename suffix, or by specifying the argument `device`

, which can take the following values: “eps”, “ps”, “tex”, “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf”.

## 3.9 Interactive Plots

You can use the `plotly`

package to make interactive graphs. Just assign your ggplot to a variable and use the function `ggplotly()`

.

```
demog_plot <- ggplot(pet_happy, aes(pet, happiness, fill=pet)) +
geom_point(position = position_jitter(width= 0.2, height = 0), size = 2)
ggplotly(demog_plot)
```

Hover over the data points above and click on the legend items.

## 3.10 Quiz

Generate a plot like this from the built-in dataset

`iris`

. Make sure to include the custom axis labels.`ggplot(iris, aes(Species, Petal.Width, fill = Species)) + geom_boxplot(show.legend = FALSE) + xlab("Flower Species") + ylab("Petal Width (in cm)") # there are many ways to do things, the code below is also correct ggplot(iris) + geom_boxplot(aes(Species, Petal.Width, fill = Species), show.legend = FALSE) + labs(x = "Flower Species", y = "Petal Width (in cm)")`

You have just created a plot using the following code. How do you save it?

Debug the following code.

Generate a plot like this from the built-in dataset

`ChickWeight`

.Generate a plot like this from the built-in dataset

`iris`

.`pw <- ggplot(iris, aes(Petal.Width, color = Species)) + geom_density() + xlab("Petal Width (in cm)") pl <- ggplot(iris, aes(Petal.Length, color = Species)) + geom_density() + xlab("Petal Length (in cm)") + coord_flip() pw_pl <- ggplot(iris, aes(Petal.Width, Petal.Length, color = Species)) + geom_point() + geom_smooth(method = lm) + xlab("Petal Width (in cm)") + ylab("Petal Length (in cm)") cowplot::plot_grid( pw, pl, pw_pl, labels = c("A", "B", "C"), nrow = 3 )`