3 Transforming Data
3.1 Data formats
To visualise the experimental reaction time and accuracy data using ggplot2
, we first need to reshape the data from wide format to long format. This step can cause friction with novice users of R. Traditionally, psychologists have been taught data skills using wide-format data. Wide-format data typically has one row of data for each participant, with separate columns for each score or variable. For repeated-measures variables, the dependent variable is split across different columns. For between-groups variables, a separate column is added to encode the group to which a participant or observation belongs.
The simulated lexical decision data is currently in wide format (see Table 3.1), where each participant's aggregated 4 reaction time and accuracy for each level of the within-subject variable is split across multiple columns for the repeated factor of conditon (words versus non-words).
id | age | language | rt_word | rt_nonword | acc_word | acc_nonword |
---|---|---|---|---|---|---|
S001 | 22 | monolingual | 379.46 | 516.82 | 99 | 90 |
S002 | 33 | monolingual | 312.45 | 435.04 | 94 | 82 |
S003 | 23 | monolingual | 404.94 | 458.50 | 96 | 87 |
S004 | 28 | monolingual | 298.37 | 335.89 | 92 | 76 |
S005 | 26 | monolingual | 316.42 | 401.32 | 91 | 83 |
S006 | 29 | monolingual | 357.17 | 367.34 | 96 | 78 |
Wide format is popular because it is intuitive to read and easy to enter data into as all the data for one participant is contained within a single row. However, for the purposes of analysis, and particularly for analysis using R, this format is unsuitable. Whilst it is intuitive to read by a human, the same is not true for a computer. Wide-format data concatenates multiple pieces of information in a single column, for example in Table 3.1, rt_word
contains information related to both a DV and one level of an IV. In comparison, long-format data separates the DV from the IVs so that each column represents only one variable. The less intuitive part is that long-format data has multiple rows for each participant (one row for each observation) and a column that encodes the level of the IV (word
or nonword
). Wickham (2014) provides a comprehensive overview of the benefits of a similar format known as tidy data, which is a standard way of mapping a dataset to its structure. For the purposes of this tutorial there are two important rules: each column should be a variable and each row should be an observation.
Moving from using wide-format to long-format datasets can require a conceptual shift on the part of the researcher and one that usually only comes with practice and repeated exposure5. It may be helpful to make a note that “row = participant” (wide format) and “row = observation” (long format) until you get used to moving between the formats. For our example dataset, adhering to these rules for reshaping the data would produce Table 3.2. Rather than different observations of the same dependent variable being split across columns, there is now a single column for the DV reaction time, and a single column for the DV accuracy. Each participant now has multiple rows of data, one for each observation (i.e., for each participant there will be as many rows as there are levels of the within-subject IV). Although there is some repetition of age and language group, each row is unique when looking at the combination of measures.
id | age | language | condition | rt | acc |
---|---|---|---|---|---|
S001 | 22 | monolingual | word | 379.46 | 99 |
S001 | 22 | monolingual | nonword | 516.82 | 90 |
S002 | 33 | monolingual | word | 312.45 | 94 |
S002 | 33 | monolingual | nonword | 435.04 | 82 |
S003 | 23 | monolingual | word | 404.94 | 96 |
S003 | 23 | monolingual | nonword | 458.50 | 87 |
The benefits and flexibility of this format will hopefully become apparent as we progress through the tutorial, however, a useful rule of thumb when working with data in R for visualisation is that anything that shares an axis should probably be in the same column. For example, a simple boxplot showing reaction time by condition would display the variable condition
on the x-axis with bars representing both the word
and nonword
data, and rt
on the y-axis. Therefore, all the data relating to condition
should be in one column, and all the data relating to rt
should be in a separate single column, rather than being split like in wide-format data.
3.2 Wide to long format
We have chosen a 2 x 2 design with two DVs, as we anticipate that this is a design many researchers will be familiar with and may also have existing datasets with a similar structure. However, it is worth normalising that trial-and-error is part of the process of learning how to apply these functions to new datasets and structures. Data visualisation can be a useful way to scaffold learning these data transformations because they can provide a concrete visual check as to whether you have done what you intended to do with your data.
3.2.1 Step 1: pivot_longer()
The first step is to use the function pivot_longer()
to transform the data to long-form. We have purposefully used a more complex dataset with two DVs for this tutorial to aid researchers applying our code to their own datasets. Because of this, we will break down the steps involved to help show how the code works.
This first code ignores that the dataset has two DVs, a problem we will fix in step 2. The pivot functions can be easier to show than tell - you may find it a useful exercise to run the below code and compare the newly created object long
(Table 3.3) with the original dat
Table 3.1 before reading on.
long <- pivot_longer(data = dat,
cols = rt_word:acc_nonword,
names_to = "dv_condition",
values_to = "dv")
As with the other tidyverse functions, the first argument specifies the dataset to use as the base, in this case
dat
. This argument name is often dropped in examples.cols
specifies all the columns you want to transform. The easiest way to visualise this is to think about which columns would be the same in the new long-form dataset and which will change. If you refer back to Table 3.1, you can see thatid
,age
, andlanguage
all remain, while the columns that contain the measurements of the DVs change. The colon notationfirst_column:last_column
is used to select all variables from the first column specified to the last In our code,cols
specifies that the columns we want to transform arert_word
toacc_nonword
.names_to
specifies the name of the new column that will be created. This column will contain the names of the selected existing columns.Finally,
values_to
names the new column that will contain the values in the selected columns. In this case we'll call itdv
.
At this point you may find it helpful to go back and compare dat
and long
again to see how each argument matches up with the output of the table.
id | age | language | dv_condition | dv |
---|---|---|---|---|
S001 | 22 | monolingual | rt_word | 379.46 |
S001 | 22 | monolingual | rt_nonword | 516.82 |
S001 | 22 | monolingual | acc_word | 99.00 |
S001 | 22 | monolingual | acc_nonword | 90.00 |
S002 | 33 | monolingual | rt_word | 312.45 |
S002 | 33 | monolingual | rt_nonword | 435.04 |
3.2.2 Step 2: pivot_longer()
adjusted
The problem with the above long-format data-set is that dv_condition
combines two variables - it has information about the type of DV and the condition of the IV. To account for this, we include a new argument names_sep
and adjust name_to
to specify the creation of two new columns. Note that we are pivoting the same wide-format dataset dat
as we did in step 1.
long2 <- pivot_longer(data = dat,
cols = rt_word:acc_nonword,
names_sep = "_",
names_to = c("dv_type", "condition"),
values_to = "dv")
names_sep
specifies how to split up the variable name in cases where it has multiple components. This is when taking care to name your variables consistently and meaningfully pays off. Because the word to the left of the separator (_
) is always the DV type and the word to the right is always the condition of the within-subject IV, it is easy to automatically split the columns.Note that when specifying more than one column name, they must be combined using
c()
and be enclosed in their own quotation marks.
id | age | language | dv_type | condition | dv |
---|---|---|---|---|---|
S001 | 22 | monolingual | rt | word | 379.46 |
S001 | 22 | monolingual | rt | nonword | 516.82 |
S001 | 22 | monolingual | acc | word | 99.00 |
S001 | 22 | monolingual | acc | nonword | 90.00 |
S002 | 33 | monolingual | rt | word | 312.45 |
S002 | 33 | monolingual | rt | nonword | 435.04 |
3.2.3 Step 3: pivot_wider()
Although we have now split the columns so that there are separate variables for the DV type and level of condition, because the two DVs are different types of data, there is an additional bit of wrangling required to get the data in the right format for plotting.
In the current long-format dataset, the column dv
contains both reaction time and accuracy measures. Keeping in mind the rule of thumb that anything that shares an axis should probably be in the same column, this creates a problem because we cannot plot two different units of measurement on the same axis. To fix this we need to use the function pivot_wider()
. Again, we would encourage you at this point to compare long2
and dat_long
with the below code to try and map the connections before reading on.
dat_long <- pivot_wider(long2,
names_from = "dv_type",
values_from = "dv")
The first argument is again the dataset you wish to work from, in this case
long2
. We have removed the argument namedata
in this example.names_from
is the reverse ofnames_to
frompivot_longer()
. It will take the values from the variable specified and use these as the new column names. In this case, the values ofrt
andacc
that are currently in thedv_type
column will become the new column names.values_from
is the reverse ofvalues_to
frompivot_longer()
. It specifies the column that contains the values to fill the new columns with. In this case, the new columnsrt
andacc
will be filled with the values that were indv
.
Again, it can be helpful to compare each dataset with the code to see how it aligns. This final long-form data should look like Table 3.2.
If you are working with a dataset with only one DV, note that only step 1 of this process would be necessary. Also, be careful not to calculate demographic descriptive statistics from this long-form dataset. Because the process of transformation has introduced some repetition for these variables, the wide-format dataset where one row equals one participant should be used for demographic information. Finally, the three step process noted above is broken down for teaching purposes, in reality, one would likely do this in a single pipeline of code, for example:
dat_long <- pivot_longer(data = dat,
cols = rt_word:acc_nonword,
names_sep = "_",
names_to = c("dv_type", "condition"),
values_to = "dv") %>%
pivot_wider(names_from = "dv_type",
values_from = "dv")
3.3 Histogram 2
Now that we have the experimental data in the right form, we can begin to create some useful visualizations. First, to demonstrate how code recipes can be reused and adapted, we will create histograms of reaction time and accuracy. The below code uses the same template as before but changes the dataset (dat_long
), the bin-widths of the histograms, the x
variable to display (rt
/acc
), and the name of the x-axis.
ggplot(dat_long, aes(x = rt)) +
geom_histogram(binwidth = 10, fill = "white", colour = "black") +
scale_x_continuous(name = "Reaction time (ms)")
ggplot(dat_long, aes(x = acc)) +
geom_histogram(binwidth = 1, fill = "white", colour = "black") +
scale_x_continuous(name = "Accuracy (0-100)")
3.4 Density plots
The layer system makes it easy to create new types of plots by adapting existing recipes. For example, rather than creating a histogram, we can create a smoothed density plot by calling geom_density()
rather than geom_histogram()
. The rest of the code remains identical.
ggplot(dat_long, aes(x = rt)) +
geom_density()+
scale_x_continuous(name = "Reaction time (ms)")
3.4.1 Grouped density plots
Density plots are most useful for comparing the distributions of different groups of data. Because the dataset is now in long format, with each variable contained within a single column, we can map condition
to the plot.
- In addition to mapping
rt
to the x-axis, we specify thefill
aesthetic to fill the visualisation so that each level of thecondition
variable is represented by a different colour. - Because the density plots are overlapping, we set
alpha = 0.75
to make the geoms 75% transparent. - As with the x and y-axis scale functions, we can edit the names and labels of our fill aesthetic by adding on another
scale_*
layer (scale_fill_discrete()
). - Note that the
fill
here is set inside theaes()
function, which tells ggplot to set the fill differently for each value in thecondition
column. You cannot specify which colour here (e.g.,fill="red"
), like you could when you setfill
inside thegeom_*()
function before.
ggplot(dat_long, aes(x = rt, fill = condition)) +
geom_density(alpha = 0.75)+
scale_x_continuous(name = "Reaction time (ms)") +
scale_fill_discrete(name = "Condition",
labels = c("Non-Word", "Word"))
Please note that the code and figure for this plot has been corrected from the published paper due to the labels "Word" and "Non-word" being incorrectly reversed. This is of course mortifying as authors, although it does provide a useful teachable moment that R will do what you tell it to do, no more, no less, regardless of whether what you tell it to do is wrong.
3.5 Scatterplots
Scatterplots are created by calling geom_point()
and require both an x
and y
variable to be specified in the mapping.
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point()
A line of best fit can be added with an additional layer that calls the function geom_smooth()
. The default is to draw a LOESS or curved regression line. However, a linear line of best fit can be specified using method = "lm"
. By default, geom_smooth()
will also draw a confidence envelope around the regression line; this can be removed by adding se = FALSE
to geom_smooth()
. A common error is to try and use geom_line()
to draw the line of best fit, which whilst a sensible guess, will not work (try it).
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
geom_smooth(method = "lm")
3.5.1 Grouped scatterplots
Similar to the density plot, the scatterplot can also be easily adjusted to display grouped data. For geom_point()
, the grouping variable is mapped to colour
rather than fill
and the relevant scale_*
function is added.
ggplot(dat_long, aes(x = rt, y = age, colour = condition)) +
geom_point() +
geom_smooth(method = "lm") +
scale_colour_discrete(name = "Condition",
labels = c("Non-Word", "Word"))
Please note that the code and figure for this plot has been corrected from the published paper due to the labels "Word" and "Non-word" being incorrectly reversed. This is of course mortifying as authors, although it does provide a useful teachable moment that R will do what you tell it to do, no more, no less, regardless of whether what you tell it to do is wrong.
3.6 Long to wide format
Following the rule that anything that shares an axis should probably be in the same column means that we will frequently need our data in long-form when using ggplot2
, However, there are some cases when wide format is necessary. For example, we may wish to visualise the relationship between reaction time in the word and non-word conditions. This requires that the corresponding word and non-word values for each participant be in the same row. The easiest way to achieve this in our case would simply be to use the original wide-format data as the input:
ggplot(dat, aes(x = rt_word, y = rt_nonword, colour = language)) +
geom_point() +
geom_smooth(method = "lm")
However, there may also be cases when you do not have an original wide-format version and you can use the pivot_wider()
function to transform from long to wide.
dat_wide <- dat_long %>%
pivot_wider(id_cols = "id",
names_from = "condition",
values_from = c(rt,acc))
id | rt_word | rt_nonword | acc_word | acc_nonword |
---|---|---|---|---|
S001 | 379.4585 | 516.8176 | 99 | 90 |
S002 | 312.4513 | 435.0404 | 94 | 82 |
S003 | 404.9407 | 458.5022 | 96 | 87 |
S004 | 298.3734 | 335.8933 | 92 | 76 |
S005 | 316.4250 | 401.3214 | 91 | 83 |
S006 | 357.1710 | 367.3355 | 96 | 78 |
3.7 Customisation 2
3.7.1 Accessible colour schemes
One of the drawbacks of using ggplot2
for visualisation is that the default colour scheme is not accessible (or visually appealing). The red and green default palette is difficult for colour-blind people to differentiate, and also does not display well in greyscale. You can specify exact custom colours for your plots, but one easy option is to use a custom colour palette. These take the same arguments as their default scale
sister functions for updating axis names and labels, but display plots in contrasting colours that can be read by colour-blind people and that also print well in grey scale. For categorical colours, the "Set2", "Dark2" and "Paired" palettes from the brewer
scale functions are colourblind-safe (but are hard to distinhuish in greyscale). For continuous colours, such as when colour is representing the magnitude of a correlation in a tile plot, the viridis
scale functions provide a number of different colourblind and greyscale-safe options.
ggplot(dat_long, aes(x = rt, y = age, colour = condition)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_brewer(palette = "Dark2",
name = "Condition",
labels = c("Non-word", "Word"))
Please note that the code and figure for this plot has been corrected from the published paper due to the labels "Word" and "Non-word" being incorrectly reversed. This is of course mortifying as authors, although it does provide a useful teachable moment that R will do what you tell it to do, no more, no less, regardless of whether what you tell it to do is wrong.
3.7.2 Specifying axis breaks
with seq()
Previously, when we have edited the breaks
on the axis labels, we have done so manually, typing out all the values we want to display on the axis. For example, the below code edits the y-axis so that age
is displayed in increments of 5.
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
scale_y_continuous(breaks = c(20,25,30,35,40,45,50,55,60))
However, this is somewhat inefficient. Instead, we can use the function seq()
(short for sequence) to specify the first and last value and the increments by
which the breaks should display between these two values.
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
scale_y_continuous(breaks = seq(20,60, by = 5))
3.8 Activities 2
Before you move on try the following:
- Use
fill
to created grouped histograms that display the distributions forrt
for eachlanguage
group separately and also edit the fill axis labels. Try addingposition = "dodge"
togeom_histogram()
to see what happens.
# fill and axis changes
ggplot(dat_long, aes(x = rt, fill = language)) +
geom_histogram(binwidth = 10) +
scale_x_continuous(name = "Reaction time (ms)") +
scale_fill_discrete(name = "Group",
labels = c("Monolingual", "Bilingual"))
# add in dodge
ggplot(dat_long, aes(x = rt, fill = language)) +
geom_histogram(binwidth = 10, position = "dodge") +
scale_x_continuous(name = "Reaction time (ms)") +
scale_fill_discrete(name = "Group",
labels = c("Monolingual", "Bilingual"))
- Use
scale_*
functions to edit the name of the x and y-axis on the scatterplot
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(name = "Reaction time") +
scale_y_continuous(name = "Age")
- Use
se = FALSE
to remove the confidence envelope from the scatterplots
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(name = "Reaction time") +
scale_y_continuous(name = "Age")
- Remove
method = "lm"
fromgeom_smooth()
to produce a curved fit line.
ggplot(dat_long, aes(x = rt, y = age)) +
geom_point() +
geom_smooth() +
scale_x_continuous(name = "Reaction time") +
scale_y_continuous(name = "Age")
- Replace the default fill on the grouped density plot with a colour-blind friendly version.
ggplot(dat_long, aes(x = rt, fill = condition)) +
geom_density(alpha = 0.75)+
scale_x_continuous(name = "Reaction time (ms)") +
scale_fill_brewer(palette = "Set2", # or "Dark2" or "Paired"
name = "Condition",
labels = c("Non-word", "Word"))