The purpose of this exercise is to test your ability to run and interpret a correlation for yourself, as shown in the chapter, but at the same time get you to think about all the skills you learnt previously. Most of the hard work is in the data wrangling; running the actual analysis, much like with t-tests, is very straightforward. Remember to refer back to previous chapters when stuck.
As in the previous exercises there are a number of code chunks already set up. Some of the code chunks may require entering just a number whilst other tasks may require editing or entering code. Follow the instructions of each task and pay close attention to what is asked. Do not change the names of any variables or data frames given to you, and do not change the rules or names of any of the code chunks as this may impact your grade. If you are unsure which are the names, refer back to the activities. In this exercise the names are T2101A to T2111. Do not change these names. Nearly all the tasks will involve entering either a number or code relating to what we have previously covered in all the chapters to this point. Look back to see what you previously did.
There are 11 tasks in total to attempt and answer. Should you require any assistance or help, keep in mind the lab sessions, student/office hours, and the TEAMS channel called Data Skills and R
The Dawtry Sutton and Sibley 2015 Study 1a.csv
file
is saved into a folder on your computer and you have manually set this
folder as your working directory. Remember: do not set your
working directory using code in the script. Instead, manually
set it using the Session >> Set Working Directory
options. Also, do not at any time rename the .csv data
file. They should be named exactly as above for the duration of this
activity. This is to ensure that your code will be
reproducible.
The .Rmd
file is saved in the same folder as the
data file. We ask that you save the .Rmd file with the format
GUID_L2_Ch9_PracticeSkills.Rmd
where GUID
is
replaced with your GUID
.
Remember that if at any point you want to explore your data to
become familiar with the variable names, noting any capital letters and
full-stops, you can use View()
or glimpse()
.
Type these functions only in the console window and not in the .Rmd
file.
Today you will need the broom
and
tidyverse
packages. Load in these packages by putting the
appropriate code into the T2101A
code chunk below.
Hint: libary()
## To do: Bring in add on packages here
library(broom)
library(tidyverse)
read_csv()
replace the NULL
in the
T2101B
code chunk to read in the data file with the
exact filename you have been given.
Dawtry Sutton and Sibley 2015 Study 1a.csv
as a
tibble in dat1
.dat1 <- read_csv("Dawtry Sutton and Sibley 2015 Study 1a.csv")
Have a look at the dataset in the viewer. There are a lot of columns which you can explore later at your own leisure. For now, today, we are only interested in the ones relevant to our analysis looking at age, fairness and satisfaction of the system, and redistribution of wealth.
NULL
in the T2102
code chunk
below with code to select only the following columns, in the exact order
shown, and store them as a tibble in dat2
age
, redist1
, redist2
,
redist3
, redist4
, fairness
,
satisfaction
.dat2
should have 7 columns with 305 rows.
Columns will be in the order stated.dat2 <- select(dat1, age, redist1, redist2, redist3, redist4, fairness, satisfaction)
We have two scales relating to fairness and satisfaction but when we
run the correlation later we will need one column that captures both
measures. We want to create a new variable in our data that is a
composite measure of fairness and satisfaction - this we will call
Sat_and_Far
.
Replace the NULL
in the T2103
code
chunk with code that will mutate a new column called
Sat_and_Far
onto the data in dat2
(watching
exact spelling and capitalisation) where values in that column represent
the average of the values in the satisfaction and fairness columns for
each participant. For example, if a person scores 3 on satisfaction and
1 on fairness, they would have an Sat_and_Far score of 2. Store the
output as a tibble in dat3
.
Note: Be exact on the spelling of the new column name and in order of columns! Check and double check!!
Check your work: If you have completed this task
successfully then dat3
should have 8 columns with 305 rows,
with the same order of columns as stated in Task 2 but with the new
column added on the far right of the tibble. Check that your averages
make sense!
dat3 <- mutate(dat2, Sat_and_Far = (satisfaction + fairness)/2)
dat3
## # A tibble: 305 × 8
## age redist1 redist2 redist3 redist4 fairness satisfaction Sat_and_Far
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40 6 6 3 1 1 1 1
## 2 59 2 3 2 4 5 2 3.5
## 3 41 5 5 4 5 5 5 5
## 4 59 1 3 3 4 7 7 7
## 5 35 4 4 5 5 4 5 4.5
## 6 34 6 6 5 6 1 4 2.5
## 7 36 5 5 2 5 3 3 3
## 8 39 3 4 3 4 5 4 4.5
## 9 40 5 5 4 5 5 3 4
## 10 31 4 5 4 5 4 5 4.5
## # ℹ 295 more rows
Similarly, we now want to create one measure related to support for redistribution that is made up of the average of the relevant scales: redist1, redist2, redist3 and redist4. However, the redist3 scale is negatively scored - meaning that a 1 on that scale would score as a 6 on other scales. We need to recode this variable!
Replace the NULL
in the T2104
code
chunk with code that will mutate a column called
redist3_rcd
onto the data in dat3
where the
values of redist3
have been recoded in the following
manner:
Store the output as a tibble in dat4
.
Hint 2: When recoding numerical values, the LHS always needs quotes, e.g. “1” = 6, but the RHS doesn’t have quotes to keep it as a value!
Check your work: If you have completed this task
successfully then dat4
should have 9 columns with 305 rows,
with the same order of columns as stated in Task 3 but with the new
column added on the far right of the tibble. Check that
redist3_rcd
now has the recoded values based on what was in
redist3
and that these are values not characters.
dat4 <- mutate(dat3, redist3_rcd = recode(redist3,
"1" = 6,
"2" = 5,
"3" = 4,
"4" = 3,
"5" = 2,
"6" = 1))
Likewise the column redist4
is also negatively scored.
We will have to repeat the steps of Task 4 but this time we will do so
for the redist4 variable.
NULL
in the T2105
code chunk
with code that will mutate a column called redist4_rcd
onto
the data in dat4
where the values of redist4
have been recoded in the same manner as in Task 4.dat5
.dat5
should have 10 columns with 305
rows, with the same order of columns as stated in Task 4 but with the
new column added on the far right of the tibble. Check the
redist4_rcd
now has the recoded values based on what was in
redist4
dat5 <- mutate(dat4, redist4_rcd = recode(redist4,
"1" = 6,
"2" = 5,
"3" = 4,
"4" = 3,
"5" = 2,
"6" = 1))
Now we want to create a single variable within our data that is a
composite measure of the four correctly coded redistribution variables
(redist1, redist2, redist3_rcd, and redist4_rcd) - we will call this
measure Sup4R
which is short for Support for
Redistribution.
Replace the NULL
in the T2106
code
chunk with code that will mutate a new column called Sup4R
onto the data in dat5
(watching exact spelling and
capitalisation) where the values within that column represent the
average of the values in the four redistribution columns named above in
this Task.
Store the output in dat6
.
Note: Double check the spelling and capitalisation of the new columns.
Check your work: If you have completed this task
successfully then dat6
should have 11 columns with 305
rows, with the same order of columns as stated in Task 5 but with the
new column added on the far right of the tibble. The new column,
Sup4R
will be the column on the far right of the tibble and
contain the average of the four redistribution columns. Check that your
averages make sense!
dat6 <- mutate(dat5, Sup4R = (redist1 + redist2 + redist3_rcd + redist4_rcd)/4)
dat6
## # A tibble: 305 × 11
## age redist1 redist2 redist3 redist4 fairness satisfaction Sat_and_Far
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40 6 6 3 1 1 1 1
## 2 59 2 3 2 4 5 2 3.5
## 3 41 5 5 4 5 5 5 5
## 4 59 1 3 3 4 7 7 7
## 5 35 4 4 5 5 4 5 4.5
## 6 34 6 6 5 6 1 4 2.5
## 7 36 5 5 2 5 3 3 3
## 8 39 3 4 3 4 5 4 4.5
## 9 40 5 5 4 5 5 3 4
## 10 31 4 5 4 5 4 5 4.5
## # ℹ 295 more rows
## # ℹ 3 more variables: redist3_rcd <dbl>, redist4_rcd <dbl>, Sup4R <dbl>
Great! We have the columns we now need but the data is starting to
get untidy again with lots of columns we no longer want. Let’s get rid
of some of the columns in dat6
and keep only the necessary
columns.
NULL
in the T2107
code chunk
with code to select and keep only the following columns. Do so in the
exact order they are named here:
Sup4R
, Sat_and_Far
, age
dat7
.dat7 <- select(dat6, Sup4R, Sat_and_Far, age)
Almost there but first, for our write-up, we will need a note of some descriptive statistics such as the number of participants, as well as their mean age and the standard deviation of their ages.
Replace the NULL
in the T2108
code
chunk with code to reproduce the table shown to you, paying particular
attention to column names, capitilisation or not, and column order. Do
not worry about the spacing between columns, just names and column
order.
We have hidden the values but your table when knitted will produce the values. Do not round the values!
Store the output as a tibble in desc
Hint 1: Some participants have not stated their age so their age is stored as NA and this will need to be considered when working out the mean and sd. However, the number of participants should include everyone regardless of whether they stated their age.
Hint 2: Check and double check the spelling of the columns.
Hint 3: When naming columns, do not put the name in quotes.
desc <- summarise(dat7, Npps = n(),
Mage = mean(age, na.rm = TRUE),
SDage = sd(age, na.rm = TRUE))
desc
## # A tibble: 1 × 3
## Npps Mage SDage
## <int> <dbl> <dbl>
## 1 305 37.4 12.0
And just before we run the correlation, you will know that one of the key checks of a correlation, in regards to both the assumptions (checking it is linear or not) and descriptives, is the scatterplot.
Insert code into the T2109
code chunk below to
exactly replicate the figure shown to you
Pay particular attention to labels, axes dimensions, shape, color and background.
Note: all the dots are exactly the same color.
Note: the line of best fit is added with the
following line of code:
geom_smooth(method = "lm", se = FALSE)
Note: the figure must appear when your code is knitted and not be stored in a tibble.
Finally the figure should be created using the data and variable names stipulated above.
Hint 1: scale_y_continuous(breaks = c(minvalue, maxvalue), limits = c(minvalue, maxvalue))
Hint 2: scale_x_continuous as above
Hint 3: Shape is less than 10. We have not edited the size of the data points at all.
# to do: exactly replicate the figure shown
ggplot(dat7, aes(x = Sat_and_Far, y = Sup4R)) +
geom_point(color = "red", shape = 3) +
labs(x = "Fairness and Satisfaction", y = "Support for Redisribution") +
scale_x_continuous(breaks = c(1:9), limits = c(1,9)) +
scale_y_continuous(breaks = c(1:6), limits = c(1,6)) +
geom_smooth(method = "lm", se = FALSE) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
As we said above, wrangling is usually the hardest part of data analysis and the correlation itself is pretty straightforward. Time to run the correlation. For this analysis we will work under the premise that all assumptions were met for a Pearson’s correlation.
NULL
in the T2110
code chunk
to perform a two-sided Pearson’s correlation between the variables
Sat_and_Far and Sup4R in dat7
. Store the output as a tibble
in mods
, i.e. a table, as in previous assignments, not an
object.Hint: Remember tidy()
.
mods <- cor.test(dat7$Sat_and_Far,
dat7$Sup4R,
method = "pearson",
alternative = "two.sided") %>% tidy()
mods
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 -0.700 -17.1 2.82e-46 303 -0.753 -0.638 Pearson'… two.sided
Read through the below four statements. One of the below statements is a coherent and accurate summary of the above analysis of the relationship between people’s support for restribution of wealth and their level of perceived satisfaction with and fairness of the current system.
305 participants (mean age = 37.4 years, sd = 12.04 years) were measured on their views regarding distribution of wealth. A Pearson’s product-moment correlation was run comparing the composite measures of Fairness and Satisfaction (Sat_and_Far) against Satisfaction for Redistribution (Sup4R) and found a strong significant positive correlation between the two variables, r(303) = .7, p < .001. As such, the analysis would suggest that as people’s perceived fairness of the system increases their support for the redistribution of wealth decreases.
305 participants (mean age = 37.4 years, sd = 12.04 years) were measured on their views regarding distribution of wealth. A Pearson’s product-moment correlation was run comparing the composite measures of Fairness and Satisfaction (Sat_and_Far) against Satisfaction for Redistribution (Sup4R) and found a strong significant negative correlation between the two variables, r(303) = -.7, p < .001. As such, the analysis would suggest that as people’s perceived fairness of the system increases their support for the redistribution of wealth decreases.
303 participants (mean age = 37.4 years, sd = 12.04 years) were measured on their views regarding distribution of wealth. A Pearson’s product-moment correlation was run comparing the composite measures of Fairness and Satisfaction (Sat_and_Far) against Satisfaction for Redistribution (Sup4R) and found a strong significant negative correlation between the two variables, r(303) = -.7, p > .001. As such, the analysis would suggest that as people’s perceived fairness of the system increases their support for the redistribution of wealth decreases.
303 participants (mean age = 37.4 years, sd = 12.04 years) were measured on their views regarding distribution of wealth. A Pearson’s product-moment correlation was run comparing the composite measures of Fairness and Satisfaction (Sat_and_Far) against Satisfaction for Redistribution (Sup4R) and found a strong significant positive correlation between the two variables, r(303) = .7, p < .001. As such, the analysis would suggest that as people’s perceived fairness of the system increases their support for the redistribution of wealth increases.
T2111
code chunk below there are four lines of
code that have all been commented out using the #
at the
start of the line. Only one of the lines states the correct answer.
Remove the #
from the start of the line that states the
correct answer so when knitted answer_t11
stores only that
single value. For instance, if you think the answer is option 1 then you
would remove the #
from the start of
# answer_t11 <- 1
to make the line read as
answer_t11 <- 1
. Change only one line of code to store a
single value in answer_t11
.#
from only the line of code that you think
is the correct answer, in order to make that line of code run.#
the line will change from being
“commented out” (shown as green code in default settings) to looking
like normal code again.#
from only one line of code.answer_t11 <- 2
Well done, you are finished.