Lab 2 Data-Wrangling: A Key Skill
2.1 Overview
One of the key skills in any researcher's toolbox is the ability to work with data. When you run an experiment you get lots of data in various files. For instance, it is not uncommon for an experimental software to create a new file for every participant you run and for each participant's file to contain numerous columns and rows of data, only some of which are important. Being able to wrangle that data, manipulate it into different layouts, extract the parts you need, and summarise it, is one of the most important skills we will help you learn in the coming weeks.
The next few labs are aimed at refreshing and consolidating your skills in working with data. This lab focuses on organizing data using the tidyverse
package. Over the course of the activities, you will recap the main functions and how to use them, and we will use a number of real datasets to give you a wide range of exposure to what Psychology is about, and to reiterate that the same skills apply across different datasets. The skills don't change, just the data!
There are some questions to answer as you go along to help build your skills: use the example code as a guide and check your answer against the solutions at the end of the chapter. Finally, remember to be pro-active in your learning, work together as a community, and if you get stuck, ask on the forums, google what you are trying to do (e.g. "mutate table tidyverse"), and use the cheatsheets or the Grassroots PsyTeacher book. The key cheatsheet for this activity is the Data Transformation Cheatsheet with dplyr
.
In this lab you will recap on:
- Data-Wrangling with the Wickham Six one-table verbs.
- Additional useful functions such as count, pivot_longer (to gether data) and joins
- Piping and making efficient codes.
Note: This preclass is a bit of a read but it is important that you have all this information in the one place so you can quickly refer back to it. Also, you did a very similar task in the Grassroots book so it is about recapping more than learning afresh. But take your time to try to understand the information and be sure to ask any questions you have. "Try first, then ask" is our only rule!
Remember to open up your Portfolio that you created in Lab 1 so you can add useful information to it as you work through the tasks! Also summarising the information we give in this preclass, in your own words, is a great way to learn! You don't have to read all of these but they might help from time to time to explain parts further.
For instance, do you remember how to get help on an R function in RStudio? In your Console window, you can call the help function (e.g. ?mutate
) to view the reference page for each function. This example shows how to get help on the mutate()
function within dplyr
, which we will use in later labs.
2.2 PreClass Activity
Revisiting Tabular Data
From working with data previously you know that nearly all data in research methods is stored in two-dimensional tables, either called data-frames, tables or tibbles. There are other ways of storing data that you will discover in time but mainly we will be using tibbles (if you like more info, type vignette("tibble")
in the Console Window). A tibble is really just a table of data with columns and rows of information, and within the tibble you can get different types of data, i.e. double, integer, and character.
Type of Data | Description |
---|---|
Double | Numbers including decimals (e.g. 3.14) |
Integer | Numbers without decimals (e.g. 3) |
Character | Tends to contain letters or be words, but can be numbers (e.g. "AB12", "Data Rocks!") |
- Note: Double and Integer can both be referred to as Numeric data, and you will see this word from time to time. For clarity, we will use Double as a term for any number with a decimal (e.g. 3.14) and Integer as a term for any whole number (no decimal, e.g. 3).
Quickfire Questions
What type of data would these most likely be:
Male =
7.15 =
137 =
There are lots of different types of data as well as different levels of measurements and it can get very confusing. It's important to try to remember which is which because you can only do certain types of analyses on certain types of data and certain types of measurements. For instance, you can't take the average of characters just like you can't take the average of categorical data. Likewise, you can do any maths on double and integer data, just like you can on interval and ratio data. Integer data is funny in that sometimes it is ordinal and sometimes it is interval, sometimes you should take the median, sometimes you should take the mean.
The main point is to always know what type of data you are using and to think about what you can and cannot do with them.
2.2.1 Revisiting the Wickham Six
The main way we teach data-wrangling skills is by using the Wickham Six one-table verbs. These are part of the tidyverse
package which we introduced to you in the first PsyTeachR book, and more specifically from the dplyr
package that is contained within the tidyverse
. These six verbs are often referred to as the Wickham Six "one-table" dplyr verbs as they perform actions on a single table of data.
We will look at some of the basics again here but try to look back at the exercises in the Grassroots book to see how we used these verbs (functions) previously. The Wickham Six are:
Function | Description |
---|---|
select() | Include or exclude certain variables (columns) |
filter() | Include or exclude certain observations/data (rows) |
mutate() | Creates new variables (columns) |
arrange() | Changes the order of observations (rows) |
group_by() | Organises the observations (rows) into groups |
summarise() | Create summary variables for groups of observations (rows) |
You will use the Wickham Six very frequently for wrangling your data so this would definitely be something you should be making notes about - not just the names, but how they work and any particular nuances that you spot. Perhaps recreate the above table and add your own examples.
2.2.2 Learning to Wrangle: Is there a Chastity Belt on Perception
Today we are going to be using data from this recent paper: Is there a Chastity Belt on Perception. You can read the full paper if you like, it is a nice representation of action/perception/cogntion working together, but we will summarise the paper for you. The main research question asks, does your ability to perform an action influence your perception? For instance, does your ability to hit a tennis ball influence how fast you perceive the ball to be moving? Or to phrase another way, do expert tennis players perceive the tennis ball moving slower than novice tennis players?
This experiment does not use tennis players however, they used the Pong task: "a computerised game in which participants aim to block moving balls with various sizes of paddles". A bit like a very classic retro arcade game. Participants tend to estimate the balls as moving faster when they have to block it with a smaller paddle as opposed to when they have a bigger paddle. You can read the paper to get more details if you wish but hopefully that gives enough of an idea to help you understand the wrangling we will do on the data. We have cleaned up the data a little to start with. Let's begin!
Download the data as a zip file from this link and save it somewhere you have access. In the lab, use your
M:
drive.Set your working directory to the same folder as the data.
Session >> Set Working Directory >> Choose Directory
Open a new script and copy and paste the two lines below. Here we are:
- loading the
tidyverse
library into our session and then - loading in the data through the
read_csv()
function and storing it in the tibble calledpong_data
.
- loading the
library("tidyverse")
<- read_csv("PongBlueRedBack 1-16 Codebook.csv") pong_data
DO NOT install packages in the Boyd Orr labs; they are already there and just need called in through library()
.
However, If you are using your own computer and you haven't previously installed the tidyverse
package before, you will have to install it first, e.g. install.packages("tidyverse")
.
If you have already installed tidyverse
but it was a long time ago, it might be worth running some updates on the packages as you may have an old version that works differently. The easiest way to do this is through RStudio using the menu at the top - Tools >> Check for Package Updates
. You can update packages individually or just run all updates. Tends to be better to just update all packages as many of the packages are linked, unless you specifically don't want to update a certain package.
The three most common mistakes we see are:
- Make sure you have spelt the data file name exactly as it is shown. Spaces and everything. Do not change the name of the csv file, fix your code instead. The reason being is that if you have a different name for your file than someone else then your code is not reproducible. We would say avoid using spaces in filenames you create, but as this is one created by another researcher and already has them, we will leave it as is and work with them.
-
Remember when uploading data we use
read_csv
which has an underscore, whereas the data file itself will have a dot in its name,filename.csv
. - Check that the datafile is actually in the folder you have set as your working directory.
- Let's have a look at the
pong_data
and see how it is organized. TypeView(pong_data)
orglimpse(pong_data)
in your Console window. Capital V and little g.
In the dataset you will see that each row (observation) represents one trial per participant and that there were 288 trials for each of the 16 participants. The columns (variables) we have in the dataset are as follows:
Variable | Type | Description |
---|---|---|
Participant | integer | participant number |
JudgedSpeed | integer | speed judgement (1 = fast, 0 = slow) |
PaddleLength | integer | paddle length (pixels) |
BallSpeed | integer | ball speed (2 pixels/4ms) |
TrialNumber | integer | trial number |
BackgroundColor | character | background display colour |
HitOrMiss | integer | hit ball = 1, missed ball = 0 |
BlockNumber | integer | block number (out of 12 blocks) |
We will use this data to master our skills of the Wickham Six verbs, taking each verb in turn and looking at it briefly. You should develop your skills by setting yourself new challenges based on the ones we set. There are 6 verbs to work through and then after that we will briefly recap on two other functions before finishing with a quick look at pipes. Try everything out and let us know anything you can't quite get.
2.2.3 The select()
Function - to keep only specific columns
The select()
function lets us pick out the variables within a dataset that we want to work with. For example, say in pong_data
we wanted to only keep the columns Participant
, JudgedSpeed
, PaddleLength
, BallSpeed
, TrialNumber
, and HitOrMiss
but we don't need BackgroundColor
or BlockNumber
. We can do this in two ways:
- We can tell the function what variables we want to include
select(pong_data, Participant, JudgedSpeed, PaddleLength, BallSpeed, TrialNumber, HitOrMiss)
- Or we can do it the opposite way by excluding columns through
-ColumnName
approach (i.e. minus the ColumnName)
select(pong_data, -BackgroundColor, -BlockNumber)
In this latter example, -BackgroundColor
means 'not BackgroundColor', so here you are saying all columns except BackgroundColor
and BlockNumber
. The minus sign is the crucial part!
Task 1: Using the select()
function
- Either by inclusion or exclusion, select only the columns
Participant
,PaddleLength
,TrialNumber
,BackgroundColor
andHitOrMiss
frompong_data
.
Did you know select()
can also be used to reorder columns?
- Use
select()
to keep only the columnsParticipant
,JudgedSpeed
,BallSpeed
,TrialNumber
, andHitOrMiss
but have them display in alphabetical order, left to right.
-
Have you remembered to include the dataset
pong_data
? Pay attention to upper/lower case letters and spelling! -
Think about how you first entered the column names as they appeared. But what happens if you change the order that you enter the column names?
2.2.4 The arrange()
Function - to sort and arrange columns
The arrange()
function sorts the rows in the tibble according to what column you tell it to sort by.
- In this example we show how to the data arrange by one column e.g. by
BallSpeed
arrange(pong_data, BallSpeed)
- Or by multiple columns e.g. by
BallSpeed
(fastest first) andBackgroundColor
arrange(pong_data, desc(BallSpeed), BackgroundColor)
-
What does
desc()
do?-
desc()
is how to sort by largest to smallest - i.e. descending order. -
Compare the output of the two lines above on the
BallSpeed
column. -
Does
desc()
also work forBackgroundColor
?
-
Task 2: Arranging Data with the arrange()
function
- Arrange the data in
pong_data
by two variables:HitOrMiss
(putting hits - 1 - first), andJudgedSpeed
(fast judgement - 1 - first).
2.2.5 The filter()
Function - to keep only parts of the data
The filter()
function lets us parse out a subset of the data, meaning we keep only parts of the data.
- For example, we might want to only keep the red
BackgroundColor
filter(pong_data, BackgroundColor == "red")
- or only keep
BallSpeed
above 4 pixels
filter(pong_data, BallSpeed > 4)
- or only keep trials that match both the red
BackgroundColor
andBallSpeed
above 4 pixels. Any trial that is not red background color or slower than 5 pixels will be removed.
filter(pong_data, BackgroundColor == "red", BallSpeed > 4)
- This last example can also be written as follows. Two arguements or requirements separated by a comma is equivalent to an & (ampersand - meaning "and").
filter(pong_data, BackgroundColor == "red" & BallSpeed > 4)
- Or say you want to keep specific Participant IDs. Say we want just the data from Participants 1, 3, 10, 14 and 16. We would write it as follows.
- The
%in%
is calledgroup membership
and means keep each of these Participants - The
c()
creates a little container of items called a vector.
- The
filter(pong_data, Participant %in% c("1", "3", "10", "14", "16"))
- And finally, say you wanted to keep all Participants except Participant 7. Say the experiment didn't work for them and you want to remove them. You would write:
- you can read
!=
(exclamation mark followed by equals) as 'does not equal'. SoParticipant != "7"
means keep all Participants where the values inParticipant
column are not 7. - The exclamation mark can sometimes be used to negate the function that follows it.
- you can read
filter(pong_data, Participant != "7")
Task 3: Using the filter()
Function
- Use
filter()
to extract all Participants that had a fast speed judgement, for speeds 2, 4, 5, and 7, but missed the ball. Store this remaining data in a variable calledpong_fast_miss
There are three parts to this filter so it is best to think about them individually and then combine them.
-
Filter all fast speed judgements (
JudgedSpeed
) -
Filter for the speeds 2, 4, 5 and 7 (
BallSpeed
) -
Filter for all Misses (
HitOrMiss
)
You could do this in three filters where each one uses the output of the preceeding one, or remember that filter functions can take more than one arguement - see the example above. Also, because the JudgedSpeed
and HitOrMiss
are Integer you will need ==
instead of just =
.
- And not Or. Mistakes with
filter()
The filter function is very useful but if used wrongly can give you very misleading findings. This is why it is very important to always check your data after you perform an action. Let's say you are working in comparative psychology and have run a study looking at how cats, dogs and horses perceive emotion. Let's say the data is all stored in the tibble animal_data
and there is a column called animals
that tells you what type of animal your participant was. Something like this:
Participant ID | animals | Perceived Emotion Accuracy (%) |
---|---|---|
1 | dog | 80 |
2 | dog | 90 |
3 | cat | 10 |
4 | dog | 85 |
5 | horse | 100 |
6 | cat | 6 |
Ok, so imagine you wanted all the data from just cats
filter(animal_data, animals == "cat")
Exactly! But what if you wanted cats and dogs?
filter(animal_data, animals == "cat", animals == "dog")
Right? Wrong! This actually says "give me everything that is a cat and a dog". But nothing is a cat and a dog, that would be weird - like a dat or a cog! What you actually want is everything that is either a cat or a dog, which is stated as:
filter(animal_data, animals == "cat" | animals == "dog")
The vertical line |
is the symbol for or, just as &
is the symbol for and.
TOP TIP: Always pay attention to what you want and most importantly to what your code produces.
2.2.6 The mutate()
Function - for adding new columns
The mutate()
function lets us create a new variable in our dataset. For example, let's add a new column to pong_data
in which the background color is represented by numbers, where red
will be represented as 1, and blue
will be represented as 2.
<- mutate(pong_data,
pong_data BackgroundColorNumeric = recode(BackgroundColor,
"red" = 1,
"blue" = 2))
The code here is is a bit complicated but:
BackgroundColorNumeric
is the name of the new column you are adding to the tibble,BackgroundColor
is the name of the original column in the tibble and the one to take information from,- and 1 and 2 are the new codings of red and blue respectively.
The mutate()
function is also handy for making some calculations on or across columns in your data. For example, say you realise you made a mistake in your experiment where your participant numbers should be 1 higher for every participant, i.e. Participant 1 should actually be numbered as Participant 2, etc. You would do something like:
mutate(pong_data, Participant = Participant + 1)
Note here that the "new column" has the same name as the old column, i.e. Participant
. In the resulting table, the Participant
column will have the new values which will differ from the values in the original pong_data
table. While it may seem like you have overwritten these values, in reality you have created a copy of the table with altered values, but you have not lost anything: the original values are still there in pong_data
because you didn't store (assign) this action to pong_data
. You didn't save the change basically.
In general however it is good practice not to overwrite pong_data
with a new version of pong_data
, but to store the altered table in a new tibble, e.g., pong_data2
, like this:
<- mutate(pong_data, Participant = Participant + 1) pong_data_mutated
Task 4: Mutating Variables with mutate()
- You realise another mistake in that all your trial numbers are wrong. The first trial (trial number 1) was a practice so should be excluded. And your experiment actually started on trial 2. Tidy this up by:
- Creating a new tibble called
pong_data_filt
and in it store data frompong_data
after filtering out all trials with the number 1 (TrialNumber
column). - Now use the
mutate()
function to renumber all the remaining trial numbers, inpong_data_filt
, starting them at one again instead of two. Store this output in a new tibble calledpong_data2
.
Step 1:
-
filter(
TrialNumber
does not equal 1). -
remember to store this output in a tibble called
pong_data_filt
Step 2:
-
mutate(
TrialNumber
= TrialNumber minus 1) - exclamation mark, equals
2.2.7 The group_by()
Function - to group parts of data altogether
The group_by()
function groups the rows in a dataset according to a category you specify, e.g. in the animals example above, grouping all cat data together and all dog data together, and all horse data together.
Looking at the data within pong_data2
, say you wanted to eventually creating means, etc, for the different background color conditions. You would start by grouping trials by BackgroundColor
, grouping the data into red background data and blue background data, as such:
group_by(pong_data2, BackgroundColor)
- Or you can add numerous grouping variables depending on how you want to split up the data. Here we group by Hit Or Miss (in
HitOrMiss
column), and background color (Red or Blue). This gives four grousp - Hit Red, Miss Red, Hit Blue, Miss Blue.
group_by(pong_data2, HitOrMiss, BackgroundColor)
Note: Nothing actually appears to change in the data, unlike with the other functions, but a big operation has taken place. Look at the output in your console when you run group_by(pong_data2, BackgroundColor)
. At the top of the output notice that the 2nd line of the output tells us the grouping criteria and how many groups now exist: see the line Groups: BackgroundColor [2]
: we grouped by BackgroundColor
and there are [2]
groups - one for red and one for blue.
Task 5: Grouping Data with group_by()
- Group the data by
BlockNumber
and byBackgroundColor
, in that order, and then enter the number of groups (i.e. a number) you get as a result:
It is the same procedure as this but with different column names:
group_by(pong_data2, HitOrMiss, BackgroundColor)
The number of groups should be between the sum (i.e. multiplication) of the number of background colors (red and blue) and the number of blocks (12).
group_by()
is incredibly useful as, once the data is organised into groups, you can then apply other functions (filter
, arrange
, mutate
...etc.) to the groups within your data that you are interested in, instead of to the entire dataset. For instance, a common second step after group_by
might be to summarise
the data...
2.2.8 The summarise()
Function - to do some calculations on the data
The summarise()
function lets you calculate some descriptive statistics on your data. For example, say you want to count the number of hits there were for different paddle lengths, or number of hits there were when the background color was red or blue.
- First we group the data accordingly, storing it in
pong_data2_group
<- group_by(pong_data2, BackgroundColor, PaddleLength) pong_data2_group
- And then we summarise it, storing the answer in
total_hits
<- summarise(pong_data2_group, total_hits = sum(HitOrMiss)) pong_data2_hits
- And then for fun we can filter just the red, small paddle hits from the summarised data.
<- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50) pong_data2_hits_red_small
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.
Which would leave us with:
BackgroundColor | PaddleLength | total_hits |
---|---|---|
red | 50 | 516 |
- The name of the column within
pong_data2_hits_red_small
that has the summarised data istotal_hits
; this is what you called it when creatingpong_data2_hits
. You could have called it anything you wanted but always try to use something sensible. Make sure to call your variables something you (and anyone looking at your code) will understand and recognize later (i.e. not variable1, variable2, variable3. etc.), and avoid spaces (use_underscores_never_spaces).
summarise()
has a range of internal functions that make life really easy, e.g. mean()
, median()
, n()
, sum()
, max
, min
, etc. Some common ones are shown below but see the dplyr
cheatsheets for more examples.
function | example | purpose |
---|---|---|
n() | N = n() | number of values |
mean() | m = mean(X) | mean of values in stated column - e.g. column X |
median() | mdn = median(X) | median of values in stated column - e.g. column X |
sum() | sm = sum(X) | sum of values in stated column - e.g. column X |
max() | mx = max(X) | max of values in stated column - e.g. column X |
min() | mn = min(X) | min of values in stated column - e.g. column X |
sd() | stdev = sd(X) | standard deviation of values in stated column - e.g. column X |
Task 6: Summarising Data with summarise()
- Use the lines of code above to calculate the mean number of hits made with the small paddle (50) and the red color background. Enter that value in this box to two decimal places - (e.g. 0.12):
A quick point on the ungroup()
function
The ungroup()
undoes the action of the group_by()
function.
After grouping data together using the group_by()
function and then peforming a task on it, e.g. filter()
, summarise()
, it can be very good practice to ungroup the data before performing another function. Forgetting to ungroup the dataset won't always affect further processing but sometimes it can really mess up other things.
Again just a good reminder to always check the data you are getting out of a function a) makes sense and b) is what you expect.
Quickfire Questions
Which of the Wickham Six would I use to sort columns from smallest to largest:
Which of the Wickham Six would I use to calculate the mean of a column:
Which of the Wickham Six would I use to remove certain observations - e.g. remove all males:
2.2.9 Two Other Useful Functions
The Wickham Six verbs let you to do a lot of things with data however there are thousands of other functions at your disposal. If you want to do something with your data that you are not sure how to do using these functions, do a Google search for an alternative function - chances are someone else has had the same problem and has a help guide. For example, two other functions to note are the bind_rows()
function and the count()
function. We will breidly show you these here.
Binding columns with bind_rows()
The bind_rows()
function is useful if you want to combine two tibbles together into one larger tibble that have the same column structure - i.e. the have exactly the same columns and you want to combine them by attaching one to the bottom of the other. For example:
- Say you had on tibble with data for ball speeds 1 and 2
<- filter(pong_data2, BallSpeed < 3) slow_ball
- And another tibble with data for ball speeds 6 and 7
<- filter(pong_data2, BallSpeed >= 6) fast_ball
- And you wanted to combine those tibbles together into one big tibble containing these extreme ball speeds.
<- bind_rows(slow_ball, fast_ball) extreme_balls
We are going to use bind_rows()
throughout the labs so do keep it in mind!
Quick counts with the count()
function
Finally, the count()
function is a shortcut that can sometimes be used to count up the number of rows you have for groups in your data, without having to use the group_by()
and summarise()
functions. It is a tally basically. It doesn't sum up values. It just counts how many observations you have.
For example, in Task 6 we combined group_by()
and summarise()
to calculate how many hits there were based on background color and paddle length. Alternatively we could have done:
count(pong_data2, BackgroundColor, PaddleLength, HitOrMiss)
The results are the same, just that in the count()
version we get all the information, including misses, because we are just counting rows. In the summarise()
method we only got hits because that was the effect of what we summed. So two different methods give similar answers - coding can be individualised and get the same result!
Again, we will use count()
at various times in the labs so again it is handy to make a note of.
2.2.10 Last but not least - Pipes (%>%
) to make your code efficient
By now you'll have noticed thattidyverse
functions generally take the following grammatical structure (called syntax): function_name(dataset, arg1, arg2,..., argN)
where the dataset is the entire tibble of data you are using, and each argument (arg
) is some operation on a particular column or variable, or the column name you want to work with. For example:
filter(pong_data2, PaddleLength == "50", BallSpeed > 4)
group_by(pong_data2, BallSpeed, Participant)
Both of these examples follow the structure of function_name(dataset, arg1, arg2, ....)
In the first example, we are filtering (function) the whole pong_data2
dataset by a particular paddle length, then by particular speeds (arguments). In the second, we are grouping by BallSpeed
and then by Participant
. Note that the order of arguments is specific as it performs argument1 then argument2, etc. Changing the order of arguments may give a different output. So the order you work in is important, and this is called your pipeline. For example, here is a pipeline we used above to find how many hits there were with the small paddle length and the red background.
- First we group the data accordingly, storing it in
pong_data2_group
- And then we summarise it, storing the answer in
total_hits
- And filter just the red, small paddle hits
<- group_by(pong_data, BackgroundColor, PaddleLength)
pong_data2_group <- summarise(pong_data2_group, total_hits = sum(HitOrMiss))
pong_data2_hits <- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50) pong_data2_hits_red_small
Pipelines allow us to quickly and reproducibly, perform an action that would take much longer manually. However, we can make our code even more efficient, using less code, by stringing our sequence of functions together using 'pipes', written as %>%
. Changing the above code into one using pipes would give us:
<- pong_data2 %>%
pong_data_hits_red_small group_by(BackgroundColor, PaddleLength) %>%
summarise(total_hits = sum(HitOrMiss)) %>%
filter(BackgroundColor == "red", PaddleLength == 50)
Both these chunks show exactly the same procedure, but adding pipes can make code easier to read and follow once you understand piping.
Code without a pipe would look like
function_name(dataset, arg1, arg2,...,argN)
but a pipe version would look like
dataset %>% function_name(arg1, arg2,...,argN)
The premise is that you can pipe (%>%
) between functions when the input of a function is the output of the previous function. Alternatively, you can use a pipe to put the data into the first function, as shown directly above.
You can think of the pipe (%>%
) as saying 'and then' or 'goes into'. E.g. the data goes into this function and then into this function and then into this function. We will expand on this in the lab where you can ask more questions, but try comparing the two chunks of code above and see if you can match them up.
One last point on pipes is that they can be written in a single line of code but it's much easier to see what the pipe is doing if each function takes its own line. Every time you add a function to the pipeline, remember to add a %>%
first and note that when using separate lines for each function, the %>%
must appear at the end of the line and not the start of the next line. Compare the two examples below. The first won't work but the second will because the second puts the pipes at the end of the line where they need to be!
- Example 1: won't work because the pipes (
%>%
) are in the wrong place.
<- pong_data2
data_arrange %>% filter(PaddleLength == "50")
%>% arrange(BallSpeed)
- Example 2: will work because the pipes (
%>%
) are in the correct place.
<- pong_data2 %>%
data_arrange filter(PaddleLength == "50") %>%
arrange(BallSpeed)
Where piping becomes most useful is when we string a series of functions together, rather than using them as separate steps and having to save the data each time under a new tibble name and getting ourselves all confused. In the non-piped version we have to create a new tibble each time, for example, data
, data_filtered
, data_arranged
, data_grouped
, data_summarised
just to get to the final one we actually want, which was data_summarised
. This creates a lot of tibbles in our environment and can make everything unclear and eventually slow down our computer. The piped version however uses one tibble name, saving space in the environment, and is clear and easy to read. With pipes, we skip unnecessary steps and avoid cluttering our environment.
Quickfire Questions
- What does this line of code say?
data %>% filter() %>% group_by() %>% summarise()
:
Job Done - Activity Complete!
We have now recapped a number of functions and verbs that you will need as the semester goes on. You will use them in upcoming labs and chapters so be sure to go over these and try them out to make yourself more comfortable with them. Remember to also start looking back at the Grassroots book and remembering some of the work you did there. If you have any questions please post them on the forums. Finally, don't forget to add any useful information to your Portfolio before you leave it too long and forget. Happy Wrangling!
2.3 InClass Activity
Data Wrangling
In the PreClass activity we looked at a series of functions known as the Wickham six one-table to filter
, arrange
, group_by
, select
, mutate
and summarise
. Now we will focus on working with data across two or more tables using functions you will have come across in the Grassroots book. The two main functions we will add to the Wickham six today are pivot_longer()
and inner_join()
.
pivot_longer()
allows us to transform a table from wide format to long format (more on this below).inner_join()
allows us to combine two tables together based on common columns.
A function is a tool that takes an input, performs some action, and gives an output. They are nothing more than that. If you think about it, your toaster is a function: it takes bread as an input; it perfoms the action of heating it up (nicely sometimes; on both sides would be a luxury); and it gives an output, the toast. A good thing about the Wickham six functions is that they are nicely named as verbs to describe what they do - mutate()
mutates (adds on a column); arrange()
arranges columns, summarise()
summarises, etc.
In terms of remembering all the functions, the truth is you don't have to know them all or remember them all. However, through practice and repetition, you will quickly learn to remember which ones are which and what package they come from. Sort of like where to find your spoons in your kitchen - you don't look in the fridge, and then the washing machine, and then the drawer. Nope, you learnt, by repetition, to look in the drawer first time. It's the same with functions. Keep in mind that research methods is like a language in that the more you use it and work with it the more it makes sense.
A Note on Tidy Data
In the style of programming we teach, the most efficient format/layout of data is what is known as Tidy Data
, and any data in this format is easily processed through the tidyverse
package. You can read more about this type of data layout in this paper: Tidy Data (Wickham, 2014). It is a surprisingly good read.
However, the data you work with will not always be formatted in the most efficient way possible. If that happens then our first step is to put it into Tidy Data
format. There are two fundamental principles defining Tidy Data
:
- Each variable must have its own column.
- Each observation must have its own row.
Tidy Data (Wickham, 2014) adds the following principle:
- Each type of observation unit forms a table.
And Grolemund and Wickham (2017) restate this third principle as:
- Each value must have its own cell (i.e. no grouping two variables together, e.g. time/date in one cell).
Where a cell is where any specific row and column meet; a single data point in a tibble is a cell for example. The Grolemund and Wickham (2017) book is a very useful read and it is free, but browsing the chapter on Tidy Data will help you visualise how you want to arrange data. Try to keep the principles in mind whilst doing so.
We use Tidy Data
because it is really efficient and works well with the tidyverse
. However, people used to use data structured in long
format or wide
format.
Long format is where each row is a single observation, typically a single trial in an experiment or a response to a single item on a questionnaire. When you have multiple trials per participant, you will have multiple rows for the same participant. To identify participants, you would need a variable with some kind of participant id, which can be as simple as a distinct integer value for each participant. In addition to the participant identifier, you would have any measurements taken during each observation (e.g., response time) and what experimental condition the observation was taken under.
In wide format data, each row corresponds to a single participant, with multiple observations for that participant spread across columns. So for instance, with survey data, you would have a separate column for each survey question.
Tidy
is a mix of both of these approachs and most functions in the tidyverse assume the tidy format, so typically the first thing you need to do when you get data, particularly wide-format data, is to reshape it through wrangling. Which is why we teach these really important skills.
Today's Lab - Analysing the Autism Specturm Quotient (AQ)
To continue building your data wrangling skills we will recap on skills from Level 1 by tidying up data from the Autism Spectrum Quotient (AQ) questionnaire. If you have completed the Grassroots book then you may be familiar with the AQ10; a non-diagnostic short form of the AQ with only 10 questions per participant. It is a discrete scale and the higher a participant scores on the AQ10 the more autistic-like traits they are said to display. Anyone scoring 7 or above is recommended for further diagnosis. You can see an example of the AQ10 through this link: AQ10 Example. Remember you can revisit the Grassroots book at any time but we will recap here for you.
Today we have 66 participants and your goal in this lab is to find an AQ score for each of them through your data-wrangling skills.
We have four data files to work with:
responses.csv
containing the AQ survey responses to each of the 10 questions for our 66 participantsqformats.csv
containing information on how a question should be coded - i.e. forward or reverse codedscoring.csv
containing information on how many points a specific response should get; depending on whether it is forward or reverse codedpinfo.csv
containing participant information such as Age, Sex and importantlyID
number.
Click here to download the files as a zip file. Now unzip the files into a folder you have access to (e.g. your M:
drive on campus). We will use zip folders a lot so if this is something you struggle with please ask.
csv
stands for 'comma separated values' and is a very basic format for storing data in a plain text file. It really just stores numbers and text separated by commas and nothing else. The great thing about being this basic is that it can be read by many different systems and is non-proprietary, i.e., you don't need to purchase commercial software to open it.
Now set your working directory to the folder where you saved the .csv
files. Do this through the dropdown menus at the top toolbar: Session >> Set Working Directory >> Choose Directory
and then find your folder with your .csv
files. Remember what we previously said about folder structure - use a drive you have access to (e.g. M: drive in the labs, but never call your folders R
).
Today we will work in an RScript instead of .Rmd but if you want to turn the lab into an R Markdown report or to add elements to your Portfolio then please feel free.
Group Discussion Point
Now would be a good time to make sure that you are all using RStudio effectively and know what each window does.
- TRUE or FALSE, the Console is best for practice and the Script Window is for saving:
- TRUE or FALSE, the Environment holds all the data and objects you have loaded in and created:
- TRUE or FALSE, clicking the name of a table in the Environment window will open it in the Script window:
The answer to all of these are True.
-
The Script window is where you should write code and comments that you are going to save and send to people. The Console is where you should practice stuff - nothing is saved here; it is like a sandbox that just gets wiped away.
-
Any data you load in or create is held in the Environment (or Global Environment) window with the variable name that you gave it.
-
By clicking the name of the table in the Environment window it will open up in the Script window and you can look at it to make sure it is what you expect. This only works for tables but not for other types of data. You will learn the difference as we go along!
2.3.1 Task 1: Open a Script
- Start a new RScript and save it in the same folder as your .csv files, calling the RScript something informative like
Lab2_AQ_DataWrangling.R
. - Make sure your environment is completely empty so we don't mix up one analysis with the other. You can run the following code line in the console to clear the environment or by clicking the little brush on your environment window.
rm(list = ls())
Remember that when using a script you can write notes to yourself to remind you what a line of code does. Just put a hashtag at the start of the line and R will ignore this line. This is where you have to be clear on using a Script versus an R Markdown file. In a Script, # means the line is ignored, in Markdown # sets the line as a header!.
To run any line on a script, the simplest way is to click anywhere on that line and either press Run on the top of the script window or press CTRL+Enter on the keyboard (or mac equivalent).
2.3.2 Task 2: Bring in Your Library
- Add a line to your code that brings the
tidyverse
package into your working environment and run it.
Combine the function library()
and the package tidyverse
and remember that the solutions are at the end of the chapter.
On our lab machines in Psychology all the necessary packages will already be on the machines, they just need called into the library. If however you are using your own machine you will have to install the packages first.
Do not install packages on the Psychology machines! Why?
- They are already installed and can cause the package to stop working if a student tries to install the same package on our machines.
-
They are already installed and it is a bit like using apps on your phone. Install is putting the app onto your phone, library is just opening the app. If you've already downloaded the app (package) then you just need to open it (
library()
) to use it!
2.3.3 Task 3: Load in the Data
Now we have to load in the .csv
datafiles using the read_csv()
function and save them as tibbles in our environment. For example, to load in the data in responses.csv
and save it as the tibble responses
we would type:
<- read_csv("responses.csv") responses
- Add the following lines of code to your script and complete them to load in all four
.csv
datafiles. Use the above code as an example and name each tibble the same as its original filename (minus the .csv part), again as above, e.g.responses.csv
gets saved asresponses
. Remember to run the lines so that the data is loaded in and stored in your environment.
<- read_csv() # survey responses
responses <- # question formats
qformats <- # scoring info
scoring <- # participant information pinfo
As you work with data and functions you will find there are functions with similar names but that give different results. One of these is the read
function for csv
. Make sure to always use read_csv()
as your function to load in csv
files. Nothing else. It is part of the readr
package automatically brought in with tidyverse
.
There is a very similarly named function called read.csv()
. DO NOT use this function in these labs. We will always expect you to use read_csv()
. Although very similar in name they do not work the same way and create differences in your data.
2.3.4 Task 4: Review Your Data.
Group Discussion Point
Now that we have the data loaded in it is always best to have a look at the data to get an idea of its layout. We showed you one way before, by clicking on the name in the environment, but you can also use the glimpse()
or View()
functions in your Console window. Put the name of the data between the brackets to see how it is arranged. Don't add these to your script though - they are just one-offs for testing.
As a small group, have a look at the data in responses
to see if you think it is Tidy or not and answer the following question: The data in responses
is in format
The reponses
tibble is far from being tidy; each row represents multiple observations from the same participant, i.e. each row shows responses to multiple questions - wide format
. Remember we want the data in tidy format as described above.
Eh, remind what's a tibble?
A tibble is simply a dataframe - or a table of data with columns and rows - that is really handy for working with when using the tidyverse
package. When we say tibble, you can think of a dataframe with rows and columns of information and numbers stored in them - like responses
, it is a tibble. For more info, see here: Tibbles
2.3.5 Task 5: Gathering Data with pivot_longer().
We now have all the data we need loaded in, but in order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses
tibble using the pivot_longer()
function.
- Copy the below code line to your script and run it.
<- pivot_longer(responses,
rlong cols = Q1:Q10,
names_to = "Question",
values_to = "Response")
- The first argument given to the
pivot_longer()
function is the tibble which holds the data we want to wrangle,responses
.- Remember we could have written this with a pipe as well, e.g.
rlong <- responses %>% pivot_longer(...)
- Remember we could have written this with a pipe as well, e.g.
- The second argument is the names of specific columns in the original tibble that we want to gather together,
Q1:Q10
meaning all columns between Q1 and Q10.- You do not actually need to write
cols =
but it makes things clearer. - "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q10, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q10, and therefore the code gathers all the columns between Q1 and Q10.
- Colum names are not put in quotes because they exist already in the tibble
responses
.
- You do not actually need to write
- The third and fourth arguments are the names of the new columns we are creating;
- the first will store the question numbers,
Question
. I.e. put the question names (names_to = ...
) in a column called "Question". - the second will store the values/responses,
Response
. I.e. put the values/responses to the questions (values_to = ...
) in a column called "Response". - These new column names are put in quotes because they do not already exist in the tibble. This is not always the case but is the case for this function.
- Note that these names could have been anything but by using these names the code makes more sense.
- Lastly, you do need to write
names_to = ...
andvalues_to = ...
otherwise the columns won't be created correctly.
- the first will store the question numbers,
In case you are wondering, if we wanted to go back the way and ungather the data we just gathered, we would use the pivot_wider()
function: e.g. rwide <- rlong %>% pivot_wider(names_from = Question, values_from = Response)
. But we do not want to do that here so let's not add this to the code.
Quickfire Questions
Let's see if you understand pivot_longer()
. Say I wanted to gather the first three columns of responses
(Q1, Q2, Q3), put the question numbers in a column called Jam
, the responses in a column called Strawberry
, and store everything in a tibble called sandwich
. Fill in the box with what you would write:
sandwich <- pivot_longer(responses, cols = Q1:Q3, names_to = "Jam", values_to = "Strawberry")
pivot_longer()
wants the data first, then the columns to gather, then the name of the new column to store the gathered column names in and finally the name of the new column to store the values in.
2.3.6 Task 6: Combining Data.
So now our responses
data is in tidy format, we are closer to getting an AQ score for each person. However, we still need to add some information to:
- show if the question is reverse or forward scored - found in
qformats
- show the number of points to give a specific response - found in
scoring
.
This is a typical analysis situation where different information is in different tables and you need to join them altogether. Both these pieces of information are contained in qformats
and scoring
respectively, but we want to join it to the data in rlong
to create one informative tidy table with all the info.
We can do this sort of join through the function inner_join()
; a function to combine information in two tibbles using a column (or columns) common to both tibbles.
- Copy the below line into your code and run it. This piece of code combines rows in the tibble
rlong
with rows in the tibbleqformats
, based on the common column "Question".
<- inner_join(rlong, qformats, "Question") rlong2
- Now have a look in
rlong2
. We have matched each question with its scoring format,forward
orreverse
.
A lot of questionnaires have some questions that are Forward scored and some questions that are Reverse scored. What does this mean? Imagine a situation where your options in replying to a question are: 1 - extremely agree, 2 - agree, 3 - neutral, 4 - disagree, 5 - extremely disagree. In a forward-scoring question you would get 1 point for extremely agree, 2 for agree, 3 for neutral, etc. In a reverse scoring question you would get 5 for extremely agree, 4 for agree, 3 for neutral, etc.
The reasoning behind this shift is that sometimes agreeing or disagreeing might be more favourable depending on how the question is worded. Secondly, sometimes these questions are used just to catch people out - imagine if you had two similar questions where one has the reverse meaning of the other. In this scenario, people should respond opposites. If they respond the same then they might not be paying attention.
Now we need to combine the information in our table, rlong2
, with the scoring
table so we know how many points to attribute each question based on the answer the participant gave, and whether the question was forward or reverse coded. Again, we use the inner_join()
function, but this time the common columns found in rlong2
and scoring
are QFormat
and Response
. To combine by two columns you just write them in sequence as shown below. Note: when there is more than one common column between two tibbles you are joining, you should combine by ALL the columns to avoid repeat columns in the new tibble. If you forget to do this, your new tibble will have names such as column_name.x
and column_name.y
. This will cause confusion so avoid it by combining by all common columns.
- Copy the below line into your code and run it. This code combine rows in
rlong2
andscoring
based on the commn columns,QFormat
andResponse
.
<- inner_join(rlong2, scoring, c("QFormat", "Response")) rscores
2.3.7 Task 7: Calculating the AQ Scores.
We have now created rscores
which has information on how each participant responded to each question and how each question should be coded and scored, all within the one tibble. All we need now is to sum the scores for each participant to get their AQ score.
- Based on your PreClass learning, copy the below line into your code and complete it to obtain individual
aq_scores
for each participant.
- Save your script and run it all again from the start to make sure it works!
<- rscores %>%
aq_scores group_by() %>% # how will you group individual participants?
summarise(AQ = sum()) # which column will you sum to obtain AQ scores?
- Each participant could be grouped by their Id.
- If we summed up the value for each Score we might get a full AQ Score for each particpipant.
Yep, well spotted. Pipes are your friend. Think of them as saying 'and then' or 'goes into'. So in the example above we take rscores
and then group it by something and then summarise it into AQ scores based on...
In most cases, the pipe serves the purpose of putting the input into the function or taking the output of one function and treating it as the input of another function.
In the example above the first pipe takes rscores
as the input for group_by
, and the second pipe takes the output of group_by
and puts it as the input to summarise
. See how you can almost read it as a chain of actions or steps.
Quickfire Questions
The whole purpose of this lab was to calculate AQ scores for individual participants. As a small group, try to answer the following questions. Try to do it using code where possible to help you based on your knowledge from the preclass and inclass activity. Remember the cheatsheets as well. Look for the dplyr
one!
From the options, choose the correct citation for the AQ 10 question questionnaire:
Complete the sentence, the higher the AQ score...
Type in the AQ score (just the number) of Participant ID No. 87:
Type how many participants had an AQ score of 3 (again just the number):
The cut-off for the AQ10 is usually said to be around 6 meaning that anyone with a score of more than 6 should be referred for diagnostic assessment. Type in how many participants we should refer from our sample:
-
From the link above you can see that an appropriate citation for the AQ10 would be (Allison, Auyeung, and Baron-Cohen, (2012))
-
As mentioned, the higher the score on the AQ10 the more autistic-like traits a participant is said to show.
-
You could do this by code with
filter(aq_scores, Id == 87)
, which would give you a tibble of 1x2 showing the ID number and score. If you just wanted the score you could usepull()
which we haven't shown you yet but works as follows:filter(aq_scores, Id == 87) %>% pull(AQ)
. The answer is an AQ score of 2. -
Same as above but changing the argument of the filter.
filter(aq_scores, AQ == 3) %>% count()
. The answer is 13. Remember you can do this by counting but the code makes it reproducible every time. -
filter(aq_scores, AQ > 6) %>% count()
orfilter(aq_scores, AQ >= 7) %>% count()
. The answer is 6.
2.3.8 Task 8: One Last Thing on Pipes
You now have a complete code to load in your data, convert it to Tidy, combine the tables and calculate an AQ score for each participant. But, if you look at it, some of your code could be more efficient by using pipes.
Go back through your code and rewrite it using pipes %>%
so that it is as efficient as possible.
At any point where the first argument of your function is the name of a variable created before that line, there is a good chance you could have used a pipe! Here are all the bits of this code that could be piped together into one chain:
rlong <- pivot_longer(responses, cols = Q1:Q10, names_to = "Question", values_to = "Response")
rlong2 <- inner_join(rlong, qformats, "Question")
rscores <- inner_join(rlong2, scoring, c("QFormat", "Response"))
aq_scores <- rscores %>% group_by(Id) %>% summarise(AQ = sum(Score))
Job Done - Activity Complete!
You have now recapped on a number of handy one-table and two-table verbs. These are great to know as for example, in the above Activity, it actually only took a handful of reproducible steps to get from messy data to tidy data; could you imagine doing this by hand in Excel through cutting and pasting? Not to mention the mistakes you could make!
You should now be ready to complete the Homework Assignment for this lab. The assignment for this Lab is FORMATIVE and is NOT to be submitted and will NOT count towards the overall grade for this module. However you are strongly encouraged to do the assignment as it will continue to boost your data-wrangling skills which you will need in future assignments. If you have any questions, please post them on the forums. Finally, don't forget to add any useful information to your Portfolio before you leave it too long and forget.
Happy wrangling! Excellent work! You are a DataWrangling expert! Now go try the assignment!
2.4 Test Yourself
This is a formative assignment meaning that it is purely for you to test your own knowledge, skill development, and learning, and does not count towards an overall grade. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. You will be instructed by the Course Lead on Moodle as to when you should attempt this assignment. Please check the information and schedule on the Level 2 Moodle page.
One thing to keep in mind in all assignments is that spelling it really important when learning about being reproducible. Remember that Mycolumn
is a different column to mycolumn
and when being reproducible only one is correct. Likewise, Mydata
and mydata
are two different tibbles, and read_csv()
and read.csv()
are two different functions. Check the instructions and check your work!
Lab 2: Formative Data Wrangling Assignment
In order to complete this assignment you will need to download the data .csv files, as well as the assignment .Rmd file, which you need to edit, titled GUID_Level2_Semester1_Lab2.Rmd
. These can be downloaded within a zip file from the below link. Once downloaded and unzipped, you should create a new folder that you will use as your working directory; put the data files and the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top. Download the Assignment .zip file from here.
Now open the assignment .Rmd file within RStudio. You will see there is a code chunk for each task. As you did in previous assignments, follow the instructions on what to edit in each code chunk. This will often be entering code based on what we have covered up until this point, either entering code or a value.
In this chapter we recapped on data-wrangling using the Wickham 6 verbs, looked at additional functions such as pivot_longer()
and inner_join()
, and at piping chains of code for efficiency using %>%
. You will need these skills to complete the following assignment so please make sure you have carried out the PreClass and InClass activities before attempting this formative assignment. Remember to follow the instructions and if you get stuck at any point to post questions on the available forums. Also, two useful online resources are:
- Hadley Wickham's R for Data Science book @ http://r4ds.had.co.nz
- RStudio's dplyr cheatsheet @ Rstudio.com
As well as the PsyTeachR Grassroots book and remembering some of the work you did there.
2.4.1 Today's Topic - The Ageing Brain
A key topic in current psychologial research, and one which forms a main focus for some of the research in our School, is that of human ageing. In this research we use brain imaging techniques to understand how changes in brain function and structure relate to changes in perception and behaviour. A typical 'ageing' experiment will compare a measure (or a number of measures) such as performance on a cognitive or perceptual task between younger and older adults (i.e. a between-subjects design experiment).
However, in order to make sure we are studying 'healhty' ageing, we first have to 'screen' our older participants for symptoms of age-related dementia (Alzheimer's Disease), where cognitive function can be significantly impaired. We do this using a range of cognitive tests. Some studies will also test participants' sensory acuity (ability to perceive something), as a function of age (particularly eyesight and hearing).
The data you have downloaded for this lab is example screening data taken from research investigating how the ageing brain processes different types of sounds. The tests used in this study are detailed below. Please note that the links are there to provide you with further information and examples of the tests once you have completed the assignment if you so wish; you do not have to read them to complete the assignment.
Montreal Cognitive Assessment (MoCA) : a test specifically devised as a stand-alone screening tool for mild cognitive impairment. Assesses visuospatial skills, memory, language, attention, orientation, and abstraction skills. Example here
Working Memory Digit Span Test (D-SPAN): measures the capacity of participants' short-term (working) memory.
D2 Test of Attention: measures participants' selective and sustained concentration and visual scanning speed.
Better Hearing Institute Quick Hearing Check: a self-report questionnaire which measures participants' subjective experience of their own hearing abilities. Main Webpage for Group here
The Data Files
You have just downloaded the three .csv
files containing all the data you need. Below is a list of the .csv file names and a description of the variables each contains:
p_screen.csv
contains particpants demographic information including:ID
Participant Id number - for confidentiality (no names or other identifying info)AGE
in yearsSEX
M for male, F for femaleHANDEDNESS
L for left-handed, R for right-handedEDUCATION
in yearsMUSICAL
whether they have any musical abilties/experience (YES or NO)FLANG
speak any foreign languages (YES or NO)MOCA
Montreal Cognitive Assessment scoreD-SPAN
Working Memory Digit Span test scoreD2
D2 Test of Attention score
QHC_responses.csv
contains participants' responses to each question on the Better Hearing Institute Quick Hearing Check (QHC) questionnaire.- Column 1 represents participants'
ID
(matching up to that inp_screen.csv
). - Each column thereafter represents the 15 questions from the questionnaire.
- Each row represents a participant and their response to each question.
- Column 1 represents participants'
QHC_scoring.csv
contains the scoring key for each question of theQHC
, with the columns:RESPONSE
the types of responses participants could give (STRONGLY DISAGREE, SLIGHTLY DISAGREE, NEUTRAL, SLIGHTLY AGREE, STRONGLY AGREE)SCORE
the points awarded for each response type (from 0 to 4). A score for each participant can be calculated by converting their categorical responses to values and summing the values.
Before starting lets check:
The
.csv
files are saved into a folder on your computer and you have manually set this folder as your working directory.The
.Rmd
file is saved in the same folder as the.csv
files. For assessments we ask that you save it with the formatGUID_Level2_Semester1_Lab2.Rmd
whereGUID
is replaced with yourGUID
. Though this is a formative assessment, it may be good practice to do the same here.
2.4.2 Load in the data
- You will see a code chunk called
libraries
, similar to the one below, at the top of your .Rmd assignment file. It is all set-up to load in the data for you and to calltidyverse
to thelibrary()
. Run this code chunk now to bring in the data and tidyverse. You can do this in the console, in a script, or even through the code chunk by clicking the small green play symbol in the top right of the code chunk.
library("tidyverse")
<- read_csv("p_screen.csv")
screening <- read_csv("QHC_responses.csv")
responses <- read_csv("QHC_scoring.csv") scoring
View the data
- It is always a good idea to familiarise yourself with the layout of the data that you have just loaded in. You can do this through using
glimpse()
orView()
in the Console window, but you must never put these functions in your assignment file.
The Tasks:
Now that we have the data loaded, tidyverse attached, and have viewed our data, you should now try to complete the following 9 tasks. You may want to practice them first to get the correct code and format, and to make sure they work. You can do this in the console or a script, but remember, once you have the correct code, edit the necessary parts of the assignment .Rmd
file to produce a reproducible Rmd file. This is what you will do from now on for all other assessment files so practicing this now will really help. In short, go through the tasks and change only the NULL
with what the question asks for and then make sure that the file knits
at the end so that you have a fully reproducible code.
2.4.3 Task 1 - Oldest Participant
Replace the NULL
in the T1
code chunk with the Participant ID of the oldest participant. Store this single value in oldest_participant
(e.g. oldest_participant <- 999
.
- hint: look at your data, who is oldest?
<- NULL oldest_participant
2.4.4 Task 2 - Arranging D-SPAN
Replace the NULL
in the T2
code chunk with code that arranges participants' D-SPAN performance from highest to lowest using the appropriate one-table dplyr (i.e., Wickham) verb. Store the output in cogtest_sort
. (e.g. cogtest_sort <- verb(data, argument)
)
- hint: arrange your screening data
<- NULL cogtest_sort
2.4.5 Task 3 - Foreign Language Speakers
Replace the NULL
in each of the two lines of code chunk T3
, so that descriptives
has a column called n
that shows the number of participants that speak a foreign language and number of participants that do not speak a foreign language, and another column called median_age
that shows the median age for those two groups. If you have done this correctly, descriptives
should have 3 columns and 2 rows of data, not including the header row.
- hint: First need to group_by() foreign language
- hint: Second need to summarise(). You will need the
n()
function. Pay attention to specific column names given.
<- NULL
screen_groups <- NULL descriptives
2.4.6 Task 4 - Creating Percentage MOCA scores
Replace the NULL
in the T4
code chunk with code using one of the dplyr verbs to add a new column called MOCA_Perc
to the dataframe screening
In this new column should be the MOCA
scores converted to percentages. The maximum achievable score on MOCA is 30
and percentages are calculated as (participant score / max score) * 100
. Store this output in screening
.
- hint: mutate() something using MOCA and the percentage formula
<- NULL screening
2.4.7 Task 5 - Remove the MOCA column
Now that we have our MoCA score expressed as a percentage MOCA_Perc
we no longer need the raw scores held in MOCA
. Replace the NULL
in the T5
code chunk using a one-table dplyr verb to keep all the columns of screening
, with the same order, but without the MOCA
column. Store this output in screening
.
- hint: select your columns
<- NULL screening
Halfway There!
The remaining tasks focus on merging two tables.
You suspect that the older adults with musical experience might report more finely-tuned hearing abilities than those without musical experience. You therefore decide to check whether this trend exists in your data. You measured participants' self reported hearing abilties using the Better Hearing Institute Quick Hearing Check Questionnaire. In this questionnaire, participants rated the extent to which they agree or disagree with a list of statements (e.g. 'I have a problem hearing over the telephone') using a 5 point Likert scale (Strongly Disagree, Slightly Disagree, Neutral, Slightly Agree, Strongly Agree).
Each participant's response to each question is contained in the responses
dataframe in your environment. Each response type is worth a certain number of points (e.g. Strongly Disagree = 0, Strongly Agree = 5) and the scoring key is contained in the scoring
dataframe. A score for each participant is calculated by totalling up the number of points across all the questions to derive an overall score. The lower the overall score, the better the participants' self-reported hearing ability.
In order to score the questionnaire we first need to perform a couple of steps.
2.4.8 Task 6 - Gather the Responses together
Replace the NULL
in the T6
code chunk using code to gather the responses to all the questions of the QHC from wide format to tidy/long format. Put the names in Question
and the values in RESPONSE
. Store this output in responses_long
.
- hint: pivot_lnger()
- hint: names to "Question"
- hint: values to "RESPONSE"
<- NULL responses_long
2.4.9 Task 7 - Joining the data
Now we need to join the number of points for each response in scoring
to the participants' responses in responses_long
.
Replace the NULL
in the T7
code chunk using inner_join()
to combine responses_long
and scoring
into a new variable called responses_points
.
- hint: join them by the column common to both scoring and responses_long
<- NULL responses_points
2.4.10 Task 8 - Working the Pipes
Below we have given you five lines of code that takes the data in its current long format and then creates a QHC score for each participant (group_by()...summarise()
). It then joins the screening information (inner_join()
) before calculating a mean QHC score for the two groups of participants - those that play musical intruments and those that don't. This final step is stored as a tibble called musical_means
.
participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))
Use the above five lines of code to replace the NULL
in the T8
code chunk with a functioning code pipeline using pipes. Put each function on a new line one under the other. This pipeline should result in the mean QHC values of musical and non-musical people being stored in the tibble musical_means
. This final tibble will consist of two rows by two columns (i.e. four cells in total).
- hint: in pipes, the output of the previous function is the input of the subsequent function.
- hint: function1(...) %>% function2(...)
<- NULL musical_means
2.4.11 Task 9 - Difference in Musical Means
Finally, replace the NULL
in the T9
code chunk with the value of how much higher the QHC score of people who play music is compared to people who don't play music. This should be a single numeric value, to two decimal places, e.g. 2.93
- hint: look in musical means and enter the difference between the two means.
<- NULL QHC_diff
Job Done - Activity Complete!
Well done, you are finished! Now you should go check your answers against the solutions at the end of this chapter. You are looking to check that the answers you have submitted are exactly the same as the ones in the solution - for example, remember that Mycolumn
is different to mycolumn
and only one is correct. If you have any questions, please post them on the forums.
2.5 Solutions to Questions
Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.
2.5.1 PreClass Activities
2.5.1.1 PreClass Task 1
- Using
select()
to include only stated columns:
select(pong_data, Participant, PaddleLength, TrialNumber, BackgroundColor, HitOrMiss)
- Using
select()
to exclude certain columns:
select(pong_data, -JudgedSpeed, -BallSpeed, -BlockNumber)
- Using
select()
to change the order of columns:
select(pong_data, BallSpeed, HitOrMiss, JudgedSpeed, Participant, TrialNumber)
2.5.1.3 PreClass Task 3
filter(pong_data,
== 1,
JudgedSpeed %in% c("2", "4", "5", "7"),
BallSpeed == 0) HitOrMiss
2.5.1.4 PreClass Task 4
- first step is created with
filter()
- the second step is created with
mutate()
<- filter(pong_data, TrialNumber >= 2)
pong_data_filt <- mutate(pong_data_filt, TrialNumber = TrialNumber - 1) pong_data2
2.5.1.6 PreClass Task 6
<- group_by(pong_data2, BackgroundColor, PaddleLength)
pong_data2_group <- summarise(pong_data2_group, mean_hits = mean(HitOrMiss)) pong_data2_hits
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.
- You should find that the number of hits made with the small paddle (50) and the red color background is 0.450655
2.5.2 InClass Actitivies
2.5.2.1 InClass Task 2
library(tidyverse)
or
library("tidyverse")
- Note, there is no difference between
library(tidyverse)
andlibrary("tidyverse")
both will work.
2.5.2.2 InClass Task 3
<- read_csv("responses.csv")
responses <- read_csv("qformats.csv")
qformats <- read_csv("scoring.csv")
scoring <- read_csv("pinfo.csv") pinfo
- Note that there is a difference between
responses <- read_csv("responses.csv")
andresponses <- read_csv(responses.csv)
. You will need quotes around the .csv filename as shown in the code chunk above (e.g.responses <- read_csv("responses.csv")
), or the code won't work.
2.5.2.3 InClass Task 7
<- rscores %>%
aq_scores group_by(Id) %>% # group by the ID number in column Id
summarise(AQ = sum(Score)) # sum column Score to obtain AQ scores.
2.5.2.4 InClass Task 8
<- responses %>%
aq_scores2 pivot_longer(cols = Q1:Q10,
names_to = "Question",
values_to = "Response"),
inner_join(qformats, "Question") %>%
inner_join(scoring, c("QFormat", "Response")) %>%
group_by(Id) %>%
summarise(AQ = sum(Score))
2.5.3 Test Yourself Activities
2.5.3.1 Assignment Task 1 - Oldest Participant
- Whether you coded this answer or just read from the data, the Participant with ID Number 3 is the oldest.
<- 3 oldest_participant
This could also be answered with code. We haven't quite shown you how yet but it would look like this:
<- arrange(screening, desc(AGE)) %>%
oldest_participant_code slice(1) %>%
pull(ID)
2.5.3.2 Assignment Task 2 - Arranging D-SPAN
arrange()
is the main function here- You also needed to use
desc()
to sort from high to low
<- arrange(screening, desc(DSPAN)) cogtest_sort
2.5.3.3 Assignment Task 3 - Foreign Language Speakers
- First group the screening data by FLANG using
group_by()
- Next, summarise, paying attention to use the variable names as instructed
n()
is a function that we use withinsummarise()
to count how many observations we have. It works likecount()
but you can't usecount()
withinsummarise()
median()
is the function that we use withinsummarise()
to calculate the median. Much like we do withsum()
ormean()
orsd()
, etc.
<- group_by(screening, FLANG)
screen_groups <- summarise(screen_groups,
descriptives n = n(),
median_age = median(AGE))
2.5.3.4 Assignment Task 4 - Creating Percentage MOCA scores
mutate()
is the function to add a new column to data- Here we are mutating/adding on a column called
MOCA_Perc
which shows a participant's MOCA score divided by 30 and multiplied by 100.
<- mutate(screening, MOCA_Perc = (MOCA / 30) * 100) screening
2.5.3.5 Assignment Task 5 - Remove the MOCA column
select()
is the key function to keep and remove certain columns.- Two options here; both would give the same dataframe.
- The first option shows how to deselect a column and keep everything else.
- The second option shows how to select all the columns you want.
- Remember that order is very important and you should select the columns in the order you want.
Option 1:
<- select(screening, -MOCA) screening
Option 2:
<- select(screening, ID, AGE, SEX, HANDEDNESS, EDUCATION, MUSICAL, FLANG, DSPAN, D2, MOCA_Perc) screening
2.5.3.6 Assignment Task 6 - Gather the Responses together
pivot_longer()
is the function to use here.- People take a while to understand this function but spend some time looking at the example below and it will start to make some sense.
- The first argument is the data. In this case
responses
. - The second argument is the name of the columns you want to gather. Here we are gathering all columns between the Q1 column and the Q15 column. Remember that the colon (:) says "from ... to ..."
- You do not actually need to write
cols =
but it makes things clearer. - "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q15, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q15, and therefore the code gathers all the columns between Q1 and Q15.
- Colum names are not put in quotes because they exist already in the tibble
responses
.
- You do not actually need to write
- The third and fourth arguments are the names of the new columns we are creating;
- the first will store the question numbers,
Question
. I.e. put the question names (names_to = ...
) in a column called "Question". - the second will store the values/responses,
Response
. I.e. put the values/responses to the questions (values_to = ...
) in a column called "Response". - These new column names are put in quotes because they do not already exist in the tibble. This is not always the case but is the case for this function.
- Note that these names could have been anything but by using these names the code makes more sense.
- Lastly, you do need to write
names_to = ...
andvalues_to = ...
otherwise the columns won't be created correctly.
- the first will store the question numbers,
<- pivot_longer(responses,
responses_long cols = Q1:Q15,
names_to = "Question",
values_to = "RESPONSE")
2.5.3.7 Assignment Task 7 - Joining the data
inner_join()
will combine all common information in two sets of data by a common column or columns.- Here we are joining the data in
responses_long
with the data inscoring
by the common columnRESPONSE
. - Keep in mind that
inner_join()
keeps only the rows that have data in both datasets. It will remove rows that only have data in one dataset. - When joining two datasets, join by ALL common columns when there is more than one column in common.
<- inner_join(responses_long, scoring, "RESPONSE") responses_points
2.5.3.8 Assignment Task 8 - Working the Pipes
- This is the code we started with.
<- group_by(responses_points, ID)
participant_groups <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_scores <- inner_join(participant_scores, screening, "ID")
participant_screening <- group_by(participant_screening, MUSICAL)
screening_groups_new <- summarise(screening_groups_new, mean_score = mean(Total_QHC)) musical_means
- Below is how to transcribe the above series of functions into a pipeline.
- Remember, when using pipes, the output of the previous function is the input of the subsequent function
<- group_by(responses_points, ID) %>%
musical_means summarise(Total_QHC = sum(SCORE)) %>%
inner_join(screening, "ID") %>%
group_by(MUSICAL) %>%
summarise(mean_score = mean(Total_QHC))
2.5.3.9 Assignment Task 9 - Difference in Musical Means
- People who play music have a QHC score that is 1.53 units higher than people who don't play music.
- You can do this by looking in
musical_means
, reading the values, and doing some quick maths. - A second option is through code. Code is always better as it can reduce error and is reproducible!
# Option 1
<- 1.53
QHC_diff
# Option 2
# You will soon learn the functions to do this by code but here is how you could do it.
<- pivot_wider(musical_means,
QHC_diff_code names_from = "MUSICAL",
values_from = "mean_score") %>%
mutate(diff = YES - NO) %>%
pull(diff) %>%
round(2)
Chapter Complete!
2.6 Additional Material
Below is some additional material that might help your wrangling.
More on read_csv()
In the preclass activity we used the following code to load in our data:
<- read_csv("PongBlueRedBack 1-16 Codebook.csv") pong_data
This is a totally acceptable approach and it is the one we will use 99% of the time. Now one thing to note that the read_csv()
function by default always loads any number as double, meaning that it can take a decimal. As shown here by the
- Note: we can use
glimpse()
fromdplyr
to check our data types.
glimpse(pong_data)
## Rows: 4,608
## Columns: 8
## $ Participant <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ JudgedSpeed <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
## $ PaddleLength <dbl> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
## $ BallSpeed <dbl> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
## $ TrialNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
## $ HitOrMiss <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
## $ BlockNumber <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
However you might not always want this default and might want to stipulate the data-type to load the data in as. Particularly if you already know the data-type you should have. Let's look at this first and then talk about it.
<- read_csv("PongBlueRedBack 1-16 Codebook.csv",
pong_data3 col_types = "iiiiicii")
Which if we look at again, we now see that the numerical data are integers
glimpse(pong_data3)
## Rows: 4,608
## Columns: 8
## $ Participant <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ JudgedSpeed <int> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
## $ PaddleLength <int> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
## $ BallSpeed <int> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
## $ TrialNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
## $ HitOrMiss <int> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
## $ BlockNumber <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
So what is the difference? Note that in the new code we have specified the col_types
argument as "iiiiicii"
. The col_types
argument allows you to control the data type for each variable. If you don't specify this argument, the default is for read_csv()
to guess, and when it sees numbers in a column, it will default to treating it as type double
.
What does the "iiiiicii"
string do? Well, we know that there are 8 different columns in the csv file, and we have 8 characters in the string "iiiiicii"
; each letter in this string tells read_csv()
the data type for each of these columns. The string has five "i"s followed by one "c" followed by two more "i"s, which tells read_csv()
to treat the first five columns as type integer
(i), the sixth column as type character
(c), and the last two columns as type integer
. (If we wanted a column to be read in as a double, we would use "d"
.)
Whilst this is a very useful approach if you are already familiar with the type and structure of the data you are working with, it can cause issues if you don't know that. For instance, you need to know exactly how many columns there are, what order, and what type they are in. So it can get tricky. For this series of lab activities, we will just stick to using the basic read_csv() defaults and not state column types.
More on Code Layout
One issue we see a lot is people not being able to debug their code quickly (i.e. find issues) because of the way the code is laid out. Pipes (%>%) helps with that, but so does taking new lines for different parts of your code. After a comma (,) or a pipe (%>%), you can take a new line to continue your code to make it easier to read and follow. For example, both of the following will work, but the second is easiest to read.
<- group_by(responses_points, ID) %>% summarise(Total_QHC = sum(SCORE)) %>% inner_join(screening, "ID") %>% group_by(MUSICAL) %>% summarise(mean_score = mean(Total_QHC)) musical_means
<- group_by(responses_points, ID) %>%
musical_means summarise(Total_QHC = sum(SCORE)) %>%
inner_join(screening, "ID") %>%
group_by(MUSICAL) %>%
summarise(mean_score = mean(Total_QHC))
And you could even expand this second option further to make it clearer on the group_by()
and inner_join()
what are the different inputs:
<- group_by(responses_points,
musical_means %>%
ID) summarise(Total_QHC = sum(SCORE)) %>%
inner_join(screening,
"ID") %>%
group_by(MUSICAL) %>%
summarise(mean_score = mean(Total_QHC))
Remember, Tidy Code and Tidy Data Make Wrangling Fun! OK, that is not catchy, but true!
More on gathering data - pivot_longer()
and gather()
With time, some functions change a bit, others change entirely, and others get forgotten about. gather()
is one of these functions. gather()
was one of these functions that many people just couldn't ever get right in their head and as such pivot_longer()
was created. The first version of this book was written using the gather()
function and we have tried to update the book to replace it with the pivot_longer()
but we will make mistakes and no doubt gather()
will appear somewhere, or you may get some old code and see the gather()
function and wonder what it does. As such, we have put below a version of the code from the inclass using the gather()
function instead of the pivot_longer()
function, just in case it helps.
gather()
allows us to transform a table from wide format to long format. It has now been replaced bypivot_longer()
spread()
allows us to transform a table from long format to wide format. It has now been replaced bypivot_wider()
An example with gather()
In the example in class, Task 5, we have all the data we need loaded in, but in order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses
tibble. In class we use the pivot_longer()
function but you could have done it with the gather()
function as follows
<- gather(responses, Question, Response, Q1:Q10) rlong
- The first argument given to the
gather()
function is the dataset which holds the data we want to wrangle,responses
. - The second and third arguments are the names we want to give the columns we are creating;
- the first will store the question numbers,
Question
- the second will store the responses,
Response
. - Note that these names could have been anything but by using these names the code makes more sense.
- the first will store the question numbers,
- Finally, the fourth argument is the names of specific columns in the original tibble that we want to gather together - columns Q1 to Q10. The colon (:) says "all columns from this to that" - all columns from Q1 to Q10.
Notes:
"Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q10, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q10, and therefore the code gathers all the columns between Q1 and Q10.
If you had just wanted columns Q1 and Q10, then you could do
c(Q1, Q10)
.pivot_longer()
andgather()
do not function exactly the same. They tend to sort data differently. Make sure you are using the function you need to do. You should always always always check the output of your functions to make sure they are doing what you expect.In case you are wondering, if we wanted to go back the way and ungather the data we just gathered, we would use the
spread()
function: e.g.rwide <- spread(rlong, Questions, Response)
. But we do not want to do that here so let's not add this to the code.
Quickfire Questions
Let's see if you understand gather()
. Say I wanted to gather the first three columns from a tibble called responses
where the columns are called Q1, Q2, Q3 (in that order), put the question numbers in a column called Jam
, the responses in a column called Strawberry
, and store everything in a tibble called sandwich
. Fill in the box with what you would write:
sandwich <- gather(responses, Jam, Strawberry, Q1:Q3)
gather
wants the data first, then the name of the new column to store the gathered column names, then the name of the new column to store the data, and then finally which columns to gather.
As mentioned above we hope to replace every instance of gather()
and spread()
in this book with pivot_longer()
and pivot_wider()
, but if we make mistakes the above might help.
More on binding and joining
In this lab we looked at two methods of combining datasets. We used bind_rows()
and we used inner_join()
. They do quite different tasks but we thought a quick summary might help.
Let's say we have the following two tables that show Perceived Emotion Accuracy in five different cats:
ID | Animal | Perceived_Emotion_Accuracy |
---|---|---|
1 | Cats | 11 |
2 | Cats | 8 |
3 | Cats | 8 |
4 | Cats | 10 |
5 | Cats | 9 |
And in five different dogs:
ID | Animal | Perceived_Emotion_Accuracy |
---|---|---|
1 | Dogs | 82 |
2 | Dogs | 78 |
3 | Dogs | 83 |
4 | Dogs | 82 |
5 | Dogs | 84 |
Because these two tables have no overlapping information about participants but have a similar tibble structure (i.e. number and names of columns) then the way we can combine the tibbles is by putting one tibble below the other using bind_rows()
<- bind_rows(cats, dogs) all_animals
Which would give us:
ID | Animal | Perceived_Emotion_Accuracy |
---|---|---|
1 | Cats | 11 |
2 | Cats | 8 |
3 | Cats | 8 |
4 | Cats | 10 |
5 | Cats | 9 |
1 | Dogs | 82 |
2 | Dogs | 78 |
3 | Dogs | 83 |
4 | Dogs | 82 |
5 | Dogs | 84 |
And say now we want to tack on the age information from each participant. Because all_animals
and ages
have overlapping information that we can use to link the information, we can put the new information at the side of the tibble as a new column using inner_join()
.
<- inner_join(all_animals, ages, c("ID", "Animal")) all_animals_ages
Which would give us:
ID | Animal | Perceived_Emotion_Accuracy | Ages |
---|---|---|---|
1 | Cats | 11 | 5 |
2 | Cats | 8 | 4 |
3 | Cats | 8 | 3 |
4 | Cats | 10 | 5 |
5 | Cats | 9 | 5 |
1 | Dogs | 82 | 5 |
2 | Dogs | 78 | 5 |
3 | Dogs | 83 | 3 |
4 | Dogs | 82 | 6 |
5 | Dogs | 84 | 4 |
There is actualy bind_cols()
as well as a host of other join functions, e.g. full_join()
, anti_join()
. You will pick up more of these functions as we continue but we hope this helps clarify these two functions a little.
A quick table on some useful joins()`
function | Example | Final tibble has... |
---|---|---|
full_join() | full_join(x,y, by = "ID") | information for all observations found in both original tibbles |
inner_join() | inner_join(x,y, by = "ID") | information for only observations found in both original tibbles |
anti_join() | anti_join(x,y, by = "ID") | information for observations found in tibble x but not tibble y |
anti_join(y,x, by = "ID") | information for observations found in tibble y but not tibble x |
End of Additional Material!