Chapter 6 Data Wrangling 3
6.1 Data wrangling recap
In the last chapter, we looked at using one-table Wickham verbs to filter
, arrange
, group_by
, select
, mutate
and summarise
. Now we will focus on working with data across two or more tables. The two main verbs we will practice adding to the Wickham six in this chapter are pivot_longer()
and inner_join()
and these will help you process your data for your quantitative project.
pivot_longer()
allows us to transform a table from wide format to tidy-format (more on this below).inner_join()
allows us to combine two tables together based on common columns.
A function is a tool that takes an input, performs some action, and gives an output. They are nothing more than that. If you think about it your toaster is a function: it takes bread as an input; it performs the action of heating it up; and it gives an output, the toast. A good thing about the Wickham six functions is that they are nicely named as verbs to describe what they do - mutate()
mutates (adds on a column); arrange()
arranges columns, summarise()
summarises, etc.
In terms of remembering all the functions, the truth is you don't have to know them all. However, through practice and repetition, you will quickly learn to remember which ones are which and what package they come from. Sort of like where to find your spoons in your kitchen - you don't look in the fridge, and then the washing machine, and then the drawer. Nope, you learnt, by repetition, to look in the drawer first time. It's the same with functions. Keep in mind that research methods is like a language in that the more you use it and work with it the more it makes sense.
6.2 Tidy data
We will use a type of data organisation known as tidy data. Any data in this format is easily processed through the tidyverse
package. However, the data you work with will not always be formatted this way. If that happens then your first step is to put it into tidy data format. There are three fundamental rules defining Tidy Data:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell
If you've worked with any kind of data before, particularly if you've used Excel, it's very likely that you will have used wide format data. In wide format, each participant's data is all in one row with multiple columns for different data points. This means that the data set tends to be very wide and you will have as many rows as you have participants.
This layout can be easy to read, however, it makes programming quite difficult. Whilst Tidy Data can be conceptually more difficult to understand at first, it means you can manipulate your data in whatever way you want very easily. There is more information about tidy data available here.
6.3 Analysing the Autism Spectrum Quotient (AQ)
To continue building your data wrangling skills in this chapter you will tidy data from the Autism Spectrum Quotient (AQ) questionnaire. The AQ10 is a non-diagnostic short form of the AQ with only 10 questions per participant. It is a discrete scale and the higher a participant scores on the AQ10 the more autistic-like traits they are said to display. Anyone scoring 7 or above is recommended for further diagnosis. You can see an example of the AQ10 through this link: AQ10 Example.
There are 66 participants and your goal in this chapter is to find an AQ score for each of them through your data-wrangling skills.
There are four data files to work with that you should download into your chapter folder:
- responses.csv containing the AQ survey responses to each of the 10 questions for the 66 participants
- qformats.csv containing information on how a question should be coded - i.e. forward or reverse coding
- scoring.csv containing information on how many points a specific response should get; depending on whether it is forward or reverse coded
- pinfo.csv containing participant information such as Age, Sex and importantly
ID
number.
csv
stands for 'comma separated variable', and is a very basic way of transferring data. It really just stores numbers and text and nothing else. The great thing about being this basic is that it can be read by many different machines and does not need expensive licenses to open it.
6.4 Activity 1: Set-up Data Wrangling 3
Do the following. If you need help, consult Programming Basics and Loading Data.
- Open R Studio and set the working directory to your chapter folder. Ensure the environment is clear.
- Open a new R Markdown document and save it in your working directory. Call the file "Data wrangling 3".
- Download the four .csv files above and save them in your chapter folder. Make sure that you do not change the file names at all.
- If you're on the server, avoid a number of issues by restarting the session - click
Session
-Restart R
- Delete the default R Markdown welcome text and insert a new code chunk that loads the package
tidyverse
using thelibrary()
function.
6.5 Activity 2: Load in the data
Now you need to load in the .csv
data files using the read_csv()
function and save them as variables in the environment. For example, to load in the responses
file we would type:
<- read_csv("responses.csv") responses
- Add the following lines of code to your Markdown and complete them to load in all four
.csv
data files. Use the above code as an example and name each variable the same as its original file name (minus the .csv part), again as above, e.g.responses.csv
gets saved asresponses
. Remember to run the lines so that the data loaded in and is stored in your environment.
<- read_csv() # survey responses
responses <- # question formats
qformats <- # scoring info
scoring <- # participant information pinfo
6.6 Activity 3: Look at your data
Now that we have the data loaded in it is always best to have a look at it to get an idea of its layout. We showed you ways of doing this before, but you can also use the glimpse()
or View()
functions in your Console window and put the name of the data between the brackets to see how it is arranged. Don't add these to your script though they are just one-offs for testing.
- Have a look at the data in
responses
to see if you think it is Tidy or not and answer the following question: The data inresponses
is in format
The responses
tibble is far from being tidy; each row represents multiple observations from the same participant, i.e. each row shows responses to multiple questions and there are the same number of rows as there are participants (66) - wide format
. Remember we want the data in tidy format as described above.
6.7 Activity 4: Tidying data
We now have all the data we need loaded in, but in order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses
tibble to Tidy Data.
The first step is to use the function pivot_longer()
to transform the data. The pivot functions can be easier to show than tell - you may find it a useful exercise to run the below code and compare the newly created object rlong
with the original respones
before reading on.
<- pivot_longer(data = responses,
rlong cols = Q1:Q10,
names_to = "Question",
values_to = "Response")
As with the other tidyverse functions, the first argument specifies the dataset to use as the base, in this case
responses
. This argument name is often dropped in examples.cols
specifies all the columns you want to transform. The easiest way to visualise this is to think about which columns would be the same in the new long-form dataset and which will change. In this case, we only have a single columnId
that will remain constant, we will transform all the ther columns that contain participant's responses to each question. The colon notationfirst_column:last_column
is used to select all variables from the first column specified to the second. In our code,cols
specifies that the columns we want to transform areQ1
toQ10
.names_to
specifies the names of the new columns that will be created.Finally,
values_to
names the new column that will contain the measurements, in this case we'll call itResponse
. At this point you may find it helpful to go back and comparerlong
and `responses`` again to see how each argument matches up with the output of the table.
6.8 Activity 5: Combining data
Now the responses
data is in tidy format, you are closer to being able to calculate an AQ score for each person. However, you still need some extra information:
- Is the question reverse or forward scored (i.e., is strongly agree a positive or negative response)? This information is found in
qformats
- How many points are given to give a specific response? This information is found in
scoring
.
This is a typical analysis situation where different information is in different tables and you need to join them altogether. Both these pieces of information are contained in qformats
and scoring
respectively, but we want to join them to responses
to create one informative tidy table with all the information we need. We can do this through the function inner_join()
; a function to combine information in two tibbles using a column common to both tibbles.
- Replace the
NULL
values in the below code with the necessary variable names to joinrlong
andqformats
byQuestion
. If you need extra help, revisit Loading Data - you used the same function then! You can also check the solutions for the answer (but make sure you try yourself first).
<- inner_join(x = NULL, y = NULL, by = "NULL") rlong2
- Now view
rlong2
. You have matched each question with its scoring format,forward
orreverse
.
A lot of questionnaires have some questions that are Forward scored and some questions that are Reverse scored. What does this mean? Imagine a situation where your options in replying to a question are: 1 - extremely agree, 2 - agree, 3 - neutral, 4 - disagree, 5 - extremely disagree. In a forward-scoring question you would get 1 point for extremely agree, 2 for agree, 3 for neutral, etc. In a reverse scoring question you would get 5 for extremely agree, 4 for agree, 3 for neutral, etc.
The reasoning behind this shift is that sometimes agreeing or disagreeing might be more favourable depending on how the question is worded. Secondly, sometimes these questions are used just to catch people out - imagine if you had two similar questions where one has the reverse meaning of the other. In this scenario, people should respond opposites. If they respond the same then they might not be paying attention.
6.9 Activity 6: Combining more data
Now you need to combine the information in our new table, rlong2
, with the scoring
table so you know how many points to attribute each question based on the answer the participant gave, and whether the question was forward or reverse coded. Again, you can use the inner_join()
function, but this time the common columns found in rlong2
and scoring
are QFormat
and Response
. To combine by two columns you just write them in sequence as shown below.
- You can only ever join two tables with
inner_join()
at once, so if you have multiple tables like we do, you need to do multiple calls toinner_join()
- When there is more than one common column between two tibbles you are joining, it is best to combine by all the columns to avoid repeat columns names in the new tibble. If you run a join and it produces columns named
variable.x
andvariable.y
it means that there was another column that was the same in both datasets that you didn't add toby
. - You have to be very careful to use
c()
if you have multiple variables forby
. This is a common error and results in thevariable.x
/variable.y
issue above. - Type the below line into the Activity 6 code chunk, run it, and then view the new object.
# combine rows in rlong2 and scoring based on QFormat and Response
<- inner_join(rlong2, scoring, c("QFormat", "Response")) rscores
6.10 Activity 7: Calculating the AQ scores
You have now created rscores
which has information on how each participant responded to each question and how each question should be coded and scored, all within the one tibble. All you need now is to sum the scores for each participant to get their AQ score.
- Based on your knowledge from the last chapter, type the below line into your code and replace the NULLs to obtain individual
aq_scores
for each participant.
- Save your Markdown and knit it to make sure all your code works.
<- rscores %>%
aq_scores group_by(NULL) %>% # how will you group individual participants?
summarise(AQ = sum(NULL)) # which column will you sum to obtain AQ scores?
Each participant could be grouped by their Id.
If we summed up the value for each Score we might get a full AQ Score for each particpipant.
6.11 Activity 8: One last thing on pipes
You now have a complete code to load in your data, convert it to Tidy, combine the tables and calculate an AQ score for each participant. But, if you look at it, some of your code could be more efficient by using pipes.
Go back through your code and try to rewrite it using pipes %>%
so that it is as efficient as possible.
At any point where the first argument of your function is the name of a variable created before that line, there is a good chance you could have used a pipe! Here are all the bits of this code that could be piped together into one chain:
`rlong <- pivot_longer(responses, names_to = "Question", values_to = "Response", Q1:Q10)`
`rlong2 <- inner_join(rlong, qformats, \"Question\")`
`rscores <- inner_join(rlong2, scoring, c(\"QFormat\", \"Response\"))`
`aq_scores <- rscores %>% group_by(Id) %>% summarise(AQ = sum(Score))`
You have now recapped one-table and two-table verbs. These are great to know as for example, in the above, it actually only took a handful of reproducible steps to get from messy data to tidy data; could you imagine doing this by hand in Excel through cutting and pasting? Not to mention the mistakes you could make!
If you have any questions, please post them on Teams.
6.12 Activity solutions
Below you will find the solutions to the above questions. Only look at them after giving the questions a good try and trying to find help on Google or Teams about any issues.
6.12.1 Activity 2
<- read_csv("responses.csv")
responses <- read_csv("qformats.csv")
qformats <- read_csv("scoring.csv")
scoring <- read_csv("pinfo.csv") pinfo
Click the tab to see the solution
6.12.2 Activity 5
<- inner_join(x = rlong, y = qformats, by = "Question") rlong2
Click the tab to see the solution
6.12.3 Activity 7
<- rscores %>%
aq_scores group_by(Id) %>% # group by the ID number in column Id
summarise(AQ = sum(Score)) # sum column Score to obtain AQ scores.
Click the tab to see the solution
6.12.4 Activity 8
<- responses %>%
aq_scores2 # take the data in `responses` and then
pivot_longer(cols = Q1:Q10,
names_to = "Question",
values_to = "Response") %>%
# take columns Q1 to Q10, put the column names in Question and the scores in Response and then
inner_join(qformats, "Question") %>%
# join with `qformats` and match the data by the column `Question` and then
inner_join(scoring, c("QFormat", "Response")) %>%
# join with `scoring` and match the data by the columns `Qformat` and `Response` and then
group_by(Id) %>% # group by participant ID and then
summarise(AQ = sum(Score)) # calculate the total AQ score
Click the tab to see the solution
6.13 Test yourself
Complete the sentence, the higher the AQ score...
Type in the AQ score (just the number) of Participant ID No. 87:
Type how many participants had an AQ score of 3 (again just the number):
The cut-off for the AQ10 is usually said to be around 6 meaning that anyone with a score of more than 6 should be referred for diagnostic assessment. Type in how many participants we should refer from our sample:
As mentioned, the higher the score on the AQ10 the more autistic-like traits a participant is said to show.
You could do this by code with
filter(aq_scores, Id == 87)
, which would give you a tibble of 1x2 showing the ID number and score. If you just wanted the score you could usepull()
which we haven't shown you that yet:filter(aq_scores, Id == 87) %>% pull(AQ)
. The answer is an AQ score of 2.Same as above but changing the argument of the filter.
filter(aq_scores, AQ == 3) %>% count()
. The answer is 13. Remember you can do this by counting but the code makes it reproducible and accurate every time. You might make mistakes.filter(aq_scores, AQ > 6) %>% count()
orfilter(aq_scores, AQ >= 7) %>% count()
. The answer is 6.