Chapter 9 Correlations

As Miller and Haden (2013) state at the start of Chapter 11, correlations are used to detect and quantify relationships among numerical variables. In short, you measure two variables and the correlation analysis tells you whether or not they are related in some manner - positively or negatively; one increases as the other increases; one decreases as the other increases; etc..

To actually carry out a correlation is very simple and we will show you that today in a little while: you just need the cor.test() function. The harder part of correlations is really wrangling the data and interpreting what the results mean. You are going to run a few correlations in this chapter to give you good practice at running and interpreting the relationships between two variables.

Note: When dealing with correlations you should always refer to relationships and not predictions. In a correlation, X does not predict Y, that is regression which you will cover in later in this course. In a correlation, all we can say is whether X and Y are related.

9.1 Correlation Activity 1: Set-up

In this chapter we will use the examples in Miller and Haden (2013), Chapter 11, looking at the relationship between four variables: reading ability, intelligence (IQ), the number of minutes per week spent reading at home (Home); and the number of minutes per week spent watching TV at home (TV). You can see in this situation that it would be unethical to manipulate these variables so measuring them as they exist in the environment is most appropriate; hence the use of correlations.

Open R Studio and set the working directory to your Correlation folder. Ensure the environment is clear.
Open a new R Markdown document and save it in your working directory. Call the file “Correlation”.
Download MillerHadenData.csv and save it in your Correlation folder and/or upload to to the R server. Make sure that you do not change the file name at all.
If you’re on the server, avoid a number of issues by restarting the session - click Session - Restart R
If you are working on your own computer, install the packages Hmisc, car, and broom. Remember do not install packages on university computers, they are already installed.
Delete the default R Markdown welcome text and insert a new code chunk that loads the packages broom, car, lsr, Hmisc, and tidyverse (in that order) using the library() function and loads the data into an object named mh using read_csv()

9.2 Correlation Activity 2: Look at your data

Look at your data, you can do this by clicking on the object in the environment, or using summary(mh) or head(mh).

As in Miller and Haden, we have 5 columns:

The participant (Participant),
Reading Ability score (Abil),
Intelligence score (IQ),
Number of minutes spent reading at home per week (Home),
And number of minutes spent watching TV per week (TV).

For the chapter we will focus on Reading Ability and IQ but for further practice you can look at other relationships in your free time.

A probable hypothesis could be that as Reading Ability increases so does Intelligence (think of the issue with causality and direction). Phrasing the hypothesis more formally, we hypothesise that the reading ability of school children, as measured through a standardized test, and intelligence, again measured through a standardized test, are positively correlated.

9.3 Activity 3: Assumptions

First, however, we must check some assumptions of the correlation tests. The main assumptions we need to check are:

Is the data interval, ratio, or ordinal?
Is there a data point for each participant on both variables?
Is the data normally distributed in both variables?
Does the relationship between variables appear linear?
Does the spread have homoscedasticity?

We will look at these in turn.

9.3.1 Assumption 1: Level of Measurement

If we want to run a Pearson correlation then we need interval or ratio data; Spearman correlations can run with ordinal, interval or ratio data. What type of data do we have?

The type of data in this analysis is most probably as the data is and there is unlikely to be a true zero

Are the variables continuous?
Is the difference between 1 and 2 on the scale equal to the difference between 2 and 3?

9.3.2 Assumption 2: Pairs of Data

All correlations must have a data point for each participant in the two variables being correlated. This should make sense as to why - you can’t correlate against an empty cell! So now go check that you have a data point in both columns for each participant.

It looks like that everyone has data in all the columns but let’s test our skills a little whilst we are here. Answer the following questions:

How is missing data represented in a tibble?
Which code would leave you with just the participants who were missing Reading Ability data in mh:
Which code would leave you with just the participants who were not missing Reading Ability data in mh:

filter(dat, is.na(variable)) versus filter(dat, !is.na(variable))

9.3.3 Assumption 3-5: Normality, linearity, homoscedasticity

The remaining assumptions are all best checked through visualisations. You can use histograms and QQ-plots to check that the data (Abil and IQ) are both normally distributed, and you can use a scatterplot of IQ as a function of Abil to check whether the relationship is linear, with homoscedasticity, and without outliers. There are various options and tests for assessing these assumptions but today we will just use visual checks.

Run the below code to create a histogram for Abil.

ggplot(data = mh, aes(x = Abil)) +
  geom_histogram()

Figure 9.1: Histogram of Abil

This code should look very similar to the code you used to create a bar plot in Intro to Data Viz. We have specified that we want to display Abil on the x-axis and that the shape we want to produce is a histogram, hence geom_histogram(). Just like geom_bar(), you do not need to specify the y-axis because if it’s a histogram, it’s always a count.

Write and run the code to produce another histogram for the variable IQ.

The QQ-plots require us to use the package car rather than ggplot2. You can make QQ-plots in ggplot2 but they aren’t as useful, however, the code is still very simple.

Run the below code to create a QQ-plot for Abil.

qqPlot(x = mh$Abil)

Figure 9.2: QQ-plot for Abil

## [1] 15  4

This code looks a little different to code you’ve used up until this point as it comes from Base R. It uses the notation object$variable so our x variable could be read as "use the variable Abil from the object mh.

The QQ-plot includes a confidence envelope (the blue dotted lines). The simple version is that if your data points fall within these dotted lines then you can assume normality. The ggplot2 version of QQ-plots make it more difficult to add on this confidence envelope, which is why we’re using a different package. qqPlot() will also print the IDs of the data points that are potentially problematic. In this case, the 4th and 15th data point in Abil are flagged to review.

Write and run the code to create a QQ-plot for IQ.

In order to assess linearity and homoscedasticity, we can create a scatterplot using ggplot2.

9.4 Correlation Scatterplot

The ggplot2 code is very similar to what you have already encountered with the bar chart and violin-boxplot. The first line of data sets up the base of the plot and we specify that we wish to display Abil on the x-axis, IQ on the y-axis, and use the dataset mh.

The first geom, geom_point() adds in the data points, the second geom, geom_smooth adds in the line of best fit. The shaded area around the line is a confidence interval.

ggplot(data = mh, aes(x = Abil, y = IQ)) +
  geom_point()+
  geom_smooth(method = lm) # if you don't want the shaded CI, add se = FALSE to this

## `geom_smooth()` using formula 'y ~ x'

Figure 7.5: Scatterplot of scores

Remember that ggplot2 works on layers and that you customise each layer. Edit the above code to add in layer of scale_x_continuous() that changes the label Abil to Reading Ability.

Based on the above visualisations:

Is the assumption of normality met for both variables?
Is the assumption of linearity met for both variables?
Is the assumption of homoscedasticity met for both variables?

When assessing assumptions through the use of visualisations your decision will always be a judgement call. In this dataset, we only have data from 25 participants therefore it is very unlikely we would ever observe perfect normality and linearity in this dataset. It is likely that a researcher would assume that this data is approximately normal, that there is no evidence of a non-linear relationship, and that the spread of data points around the line is relatively even.

Many students become fixated with needing a ‘perfect’ dataset that follows an exactly normal distribution. This is unlikely to ever happen with real data - learn to trust your instincts!

Look at the scatterplot and think back to the lecture, how would you describe this correlation in terms of direction and strength?

9.5 Correlation Activity 4: Descriptive statistics

Many researchers (and indeed members of the School of Psychology!) disagree as to whether you need to report descriptive statistics such as the mean and SD for a correlation. The argument against reporting them is that the scatterplot is actually the descriptive of the correlation that you would use to describe the potential relationship in regards to your hypothesis.

The counter argument is that providing descriptive statistics can still be informative about the spread of data for each variable, for example, in the current example it would make it easier to understand whether the participants as a whole compare to the population IQ score.

There’s no fixed answer to this question but the person writing this book takes the second view that you should always report descriptive statistics so that’s what we’re going to do.

Calculate the mean score and standard deviation for Abil and IQ using summarise()
Name the output of the calculations Abil_mean, Abil_SD, IQ_mean, and IQ_SD. Make sure to use these exact spellings otherwise later activities won’t work.
Store the output of this in an object called descriptives and then view the object. It should look something like this:

Abil_mean	Abil_SD	IQ_mean	IQ_SD
55.12	6.08	100.04	9.04

9.6 Correlation Activity 5: Run the correlation

Finally we will run the correlation using the cor.test() function. Remember that for help on any function you can type ?cor.test in the console window. The cor.test() function requires:

the column name of Variable 1
the column name of Variable 2
the type of correlation you want to run: e.g. pearson, spearman
the type of NHST tail you want to run: e.g. "less","greater"1, "two.sided"

For example, if your data is stored in dat and you want to do a two-sided pearson correlation of the variables (columns) X and Y, then you would do:

cor.test(dat$X, dat$Y, method = "pearson", alternative = "two.sided")

Based on your answers to the assumption tests, spend a couple of minutes deciding with your group which correlation method to use (e.g. pearson or spearman) and the type of NHST tail to set (e.g. two.sided or one.sided).
Run the correlation between IQ and Ability and save it in an object called results.
View the output by typing results in the console

As you can see from the environment, the output from the correlation has saved as a list. This can make it a little more difficult to work with so we’re going to use a function from the broom package that we loaded to make the table a bit tidier. The following code is going to overwrite the object results with the tidy version.

Run the below code and then view the object by clicking on results in the environment.

results <- results %>% # take the object results and then
  tidy()              # tidy it up

9.7 Correlation Activity 6: Interpreting the correlation

You should now have a tibble called results that gives you the output of the correlation between Reading Ability and IQ for the school children measured in Miller and Haden (2013) Chapter 11. All that is left to do now, is interpret the output.

Look at resultsand then with your group, answer the following questions:

What is the value of Pearson’s r to 2 decimal places?
The direction of the relationship between Ability and IQ is:
The strength of the relationship between Ability and IQ is:
Based on $\alpha = .05$ the relationship between Ability and IQ is:
The hypothesis was that the reading ability of school children, as measured through a standardized test, and intelligence, again through a standardized test, are positively correlated. Based on the results we can say that the hypothesis:

The test statistic, in this case the r value, is usually labelled as the estimate.
If Y increases as X increases then the relationship is positive. If Y increases as X decreases then the relationship is negative. If there is no change in Y as X changes then there is no relationship
Depending on the field most correlation values greater than .5 would be strong; .3 to .5 as medium, and .1 to .3 as small.
The field standard says less than .05 is significant and our p-value is less than .05.
Hypotheses can only be supported or not supported, never proven. In this case, our results matched our hypothesis therefore it is supported.

9.8 Correlation Activity 7: Write-up

Copy and paste the below exactly into white space in your R Markdown document and then knit the file.

The mean IQ score was `r round(pluck(descriptives$IQ_mean),2)` (`r round(pluck(descriptives$IQ_SD),2)`) and the mean reading ability score was `r round(pluck(descriptives$Abil_mean),2)` (`r round(pluck(descriptives$Abil_SD),2)`). A Pearson\`s correlation found a significant, medium positive correlation between the two variables (r (`r results$parameter`) = `r round(results$estimate, 2)`, *p* = `r round(results$p.value, 3)`).

It will magically transform into:

The mean IQ score was 100.04(9.04) and the mean reading ability score was 55.12(6.08). A Pearson`s correlation found a significant, medium positive correlation between the two variables (r (23) = 0.45, p = 0.024)

9.9 Correlation Activity 8: Scatterplot matrix

Above we ran one correlation and if we wanted to do a different correlation then we would have to edit the cor.test() line and run it again. However, when you have lots of variables in a dataset, to get a quick overview of patterns, one thing you might want to do is run all the correlations at the same time or create a matrix of scatterplots at the one time. You can do this with functions from the Hmisc and lsr packages. We will use the Miller and Haden data here again which you should still have in a tibble called mh.

Run the following line. The pairs() function from the Hmisc library creates a matrix of scatterplots which you can then use to view all the relationships at the one time.

pairs(mh)

Figure 9.3: Scatterplot matrix

Notice something wrong? pairs() will create scatterplots for all variables in your data (as will correlate() below). This means that it has correlated the Participant ID number as well, which is totally meaningless.

Overwrite mh and use select() to get rid of the Participant column then run pairs(mh) again.

9.10 Correlation Activity 9: Multiple correlations

To perform multiple correlations in one go, we will use the correlate() function from the lsr package. If you look at the help documentation for correlate() you will see that it takes the following form:

correlate(x, y=NULL, test=FALSE, corr.method="pearson", p.adjust.method="holm")

You can use correlate() similar to cor.test() and specify a specific variable for both x and y to perform a single correlation. However, you can also provide a data frame that has multiple variables as x and it will run all possible correlations between the variables.

test controls whether or not p-values will be computed. The default setting for this is FALSE. You will almost always want to change this to TRUE.
corr.method() controls which correlation is computed, the default is pearson but if you needed to run the non-parametric version you could change this to spearman.
p.adjust.method is the reason we are using the lsr package. In the lectures we discussed the problem of multiple comparisons - the idea that if you run lots and lots of tests you’re likely to produce a significant p-value just by chance. This argument applies a correction to the p-value that adjusts for the number of correlations you have performed. There are several different methods which you can look up in the help documentation, the default setting is a Bonferroni-Holm correction.
Because you’re running multiple correlations and some may be positive and some may be negative, there is no option to specify a one or two-tailed test.

There’s one last thing we need to do before we run the correlation. lsr is an older package and doesn’t like working with tibbles, so we need to convert our tibble to a data frame, an older type of object.

Run the below code. It will overwrite the tibble mh with a data frame.

mh <- as.data.frame(mh)

OK, finally let’s run the correlations.

Run the below code to calculate then view the correlation results

corr_results <- correlate(x = mh, # our data
                          test = TRUE, # compute p-values
                          corr.method = "pearson", # run a pearson test 
                          p.adjust.method = "holm") # use the holm correction
corr_results

corr_results is a list with 5 components and you can view each of these separately just like you did with chi-square, for example, trying running the below code to view just the correlation values:

corr_results$correlation

Is the correlation between Abil and Home positive or negative?
This means that as Abil scores increase, Home scores will
What is the strongest positive correlation?
What is the strongest negative correlation?
Is the correlation between Abil and IQ significant?
Is the correlation between Abil and Home significant?
How would you describe the strength of the correlation between Home and TV?
Think back to the lecture. Why are we not calculating an effect size?

Negative correlations are denoted by a negative r value.
1. Positive correlations = as one score goes up so does the other, negative correlations = as one score goes up the other goes down.
  3 & 4. Remember that correlations take values from -1 - 1 and that the nearer to one in either direction the stronger the correlation (i.e., an r value of 0 would demonstrate a lack of any relationship.
  5 & 6. The traditional cut-off for significance is .05. Anything below .05 is considered significant. Be careful to pay attention to decimal places.
2. Cohen’s guidelines recommend weak = 1. - .3, medium = .3 - .5, strong > .5.
3. Because r is an effect size.

9.11 Correlation Finished!

Well done! You can now add correlations to the list of things you can do in R. If you have any questions, please post them on Teams.