Analysis - 10 Simple regression

Intended Learning Outcomes

By the end of this chapter you should be able to:

Compute a Simple linear regression.
Read and interpret the output.
Report the results.

Individual Walkthrough

10.1 Activity 1: Setup & download the data

This week, we will be working with a new dataset. Follow the steps below to set up your project:

Create a new project and name it something meaningful (e.g., “2B_chapter10”, or “10_regression”). See Section 1.2 if you need some guidance.
Create a new .Rmd file and save it to your project folder. See Section 1.3 if you need help.
Delete everything after the setup code chunk (e.g., line 12 and below)
Download the new dataset here: data_ch10.zip. The zip folder includes the data file and the codebook.
Extract the data file from the zip folder and place it in your project folder. If you need help, see Section 1.4.

Citation

Alter, U., Dang, C., Kunicki, Z. J., & Counsell, A. (2024). The VSSL scale: A brief instructor tool for assessing students’ perceived value of software to learning statistics. Teaching Statistics, 46(3), 152-163. https://doi.org/10.1111/test.12374

Abstract

The biggest difference in statistical training from previous decades is the increased use of software. However, little research examines how software impacts learning statistics. Assessing the value of software to statistical learning demands appropriate, valid, and reliable measures. The present study expands the arsenal of tools by reporting on the psychometric properties of the Value of Software to Statistical Learning (VSSL) scale in an undergraduate student sample. We propose a brief measure with strong psychometric support to assess students’ perceived value of software in an educational setting. We provide data from a course using SPSS, given its wide use and popularity in the social sciences. However, the VSSL is adaptable to any statistical software, and we provide instructions for customizing it to suit alternative packages. Recommendations for administering, scoring, and interpreting the VSSL are provided to aid statistics instructors and education researchers understand how software influences students’ statistical learning.

The data is available on OSF: https://osf.io/bk7vw/

Changes made to the dataset

We converted the Excel file into a CSV file.
We aggregated the main scales (columns 50-57) by reverse-scoring all reverse-coded items (as specified in the codebook), and then computed an average score for each scale.
However, responses to the individual questionnaire items (columns 1-49) remain as raw data and have not been reverse-coded. If you’d like to practice your data-wrangling skills, feel free to do so!
We tidied the columns RaceEthE, GradesE, and MajorE, but we’ve left Gender and Student Status for you to clean.

10.2 Activity 2: Load in the library, read in the data, and familiarise yourself with the data

Today, we will be using the packages tidyverse, performance, and pwr, along with the dataset Alter_2024_data.csv.

???

data_alter <- ???

Solution

library(tidyverse)
library(performance)
library(pwr)
data_alter <- read_csv("Alter_2024_data.csv")

As always, take a moment to familiarise yourself with the data before starting your analysis.

Your Turn

Quick check: Have you familiarised yourself with the data? Answer the following questions:

How many observations?
How many variables?
How many columns are col_character or chr data type?
How many columns are col_double or dbl data type?
From the dropdown menu, select the variable type and the data type for each of the following variables.

Column	Variable type	Data type
`AgeE`
`GenderE`
`GPAE`
`StuStaE`
`MA1E`
`Mean_MA`

Potential Research Question & Hypthesis

Let’s assume that individuals who are confident in quantitative reasoning are more likely to feel comfortable using statistical software.

Potential research question: “Does quantitative self-efficacy predict attitudes toward SPSS, as measured by the VSSL Scale?”
Null Hypothesis (H₀): “Quantitative self-efficacy does not significantly predict attitudes toward SPSS (VSSL scores).”
Alternative Hypothesis (H₁): “Quantitative self-efficacy significantly predicts attitudes toward SPSS (VSSL scores), such that higher quantitative self-efficacy is associated with more positive attitudes toward SPSS.”

Thus, in this hypothetical scenario, the regression model would be:

Independent Variable (IV)/ Predictor: Quantitative self-efficacy
Dependent Variable (DV)/ Outcome: Attitudes toward SPSS (measured by the VSSL Scale)

10.3 Activity 3: Preparing the dataframe

The dataframe is already tidy for most variables, except for Gender and Student Status. This means we’re all set for this week.

However, working with a smaller subset of the data will be beneficial. Select the relevant variables of interest and store the results in a new data object called data_alter_reduced.

Hints

Of course, you have identified the variables Quantitative self-efficacy (Mean_QSE) and the VSSL Scale: Value of Software to Learning Statistics (Mean_SPSS).

But did you remember that we also need the participant ID (StudentIDE)?

Solution

data_alter_reduced <- data_alter %>% 
  select(StudentIDE, Mean_QSE, Mean_SPSS)

Extra Activity - recoding Gender and Student Status

Challenge yourself: Easy (recoding Gender)

Recode column GenderE so the values read “Female”, “Male”, and “Non-Binary” instead of 1, 2, and 3. Use the codebook to guide you. We have done this quite a few times now, but if you need a hint, recoding values was covered in Section 3.3.

Might as well turn Gender into a factor while we are at it.

Solution for Gender

data_alter <- data_alter %>% 
  mutate(Gender = case_match(GenderE,
                             1 ~ "Female",
                             2 ~ "Male",
                             3 ~ "Non-Binary"),
         Gender = factor(Gender))

Challenge yourself: A bit harder (recoding Student Status)

However, recoding Student Status is a bit more tricky. There appear to be numbers and different ways of spelling.

To capture all of them, we will need to know the distinct categories fist. We have used the distinct() function briefly in previous chapters. Time to use it once more.

Solution to identify distinct categories

data_alter %>% 
  distinct(StuStaE)

StuStaE
4
3
Junior
Senior
senior
Sophomore
2
NA
Freshman
post-bac

Now that we have identified the distinct values of column StuStaE, we can use a conditional statement to recode those distinct categories to obtain the categorical labels in the codebook.

Do you remember how conditional recoding is done? If not, get some inspiration from Section 3.4.

Solution for Student Status: Option 1

You could only recode the values that are “incorrect”, i.e., converting the numbers into labels and grouping together everyone who falls into the category “Senior or Higher”.

As Student_Status is an ordinal variable, we should turn it into a factor to order the categories accordingly.

data_alter <- data_alter %>% 
  mutate(Student_Status = case_when(
    StuStaE == "2" ~ "Sophomore",
    StuStaE == "3" ~ "Junior",
    StuStaE %in% c("4", "Senior", "senior", "post-bac") ~ "Senior or Higher",
    .default = StuStaE
  )) %>% 
  mutate(Student_Status = factor(Student_Status,
                                 levels = c("Freshman", "Sophomore", "Junior", "Senior or Higher")))

Solution for Student Status: Option 2

You could also recode all the labels whether they are already correctly spelled or not, if that seems more intuitive to you. You could leave out the .default argument this way. Yet, you’d still have to group everyone who falls into the category “Senior or Higher”. Perhaps, a bit more writing overall, but the same outcome.

I will label the newly recoded columns differently, so you’ll see that both ways get you the same outcome.

data_alter <- data_alter %>% 
  mutate(StuSta_tidy = case_when(
           StuStaE %in% c("1", "Freshman") ~ "Freshman",
           StuStaE %in% c("2", "Sophomore") ~ "Sophomore",
           StuStaE %in% c("3", "Junior") ~ "Junior",
           StuStaE %in% c("4", "Senior", "senior", "post-bac") ~ "Senior or Higher"),
         StuSta_tidy = factor(StuSta_tidy,
                              levels = c("Freshman", "Sophomore", "Junior", "Senior or Higher")))

10.4 Activity 4: Compute descriptives

Now, we can calculate the means and standard deviations for our predictor and outcome variables.

Solution

descriptives <- data_alter_reduced %>%
  summarise(mean_QSE = mean(Mean_QSE, na.rm = TRUE),
            sd_QSE = sd(Mean_QSE, na.rm = TRUE),
            mean_SPSS = mean(Mean_SPSS, na.rm = TRUE),
            sd_SPSS = sd(Mean_SPSS, na.rm = TRUE))

10.5 Activity 5: Create an appropriate plot

Now, let’s recreate the following appropriate plot using data_alter_reduced:

`geom_smooth()` using formula = 'y ~ x'

Solution

ggplot(data_alter_reduced, aes(x = Mean_QSE, y = Mean_SPSS)) +
  geom_point() +
  geom_smooth(method = "lm")

10.6 Activity 6: Check assumptions

Assumption 1: Levels of measurement

The outcome variable must be interval/ratio (continuous)
The predictor variable must be either interval/ratio (continuous) or categorical (with two levels)

In our case, both the predictor variable (Quantitative self-efficacy) and the outcome variable (Attitudes toward SPSS as measured by the VSSL Scale) are continuous. The assumption holds.

Assumption 2: Independence of cases/observations

We must assume that each observation is independent of the others, meaning that each score comes from a different participant. Like assumption 1, this assumption is determined by the study design and it holds.

Assumption 3: Non-zero variance

Simply put, we need to have some spread in the data. The variance must be greater than zero.

We can check this by looking at the scatterplot: both the x-axis (predictor) and y-axis (outcome) should show variability. Similarly, we computed standard deviations for both variables in Activity 4, and both of them had values different to 0. So this assumption holds as well.

Assumption 4: Linearity

Since we will need the linear model for the next few assumptions and later for inferential statistics, we should first save it to the Global Environment as an object called mod.

The linear model function lm() requires the following arguments:

The outcome variable, followed by a tilde (~), followed by the the predictor variable
The dataset

mod <- lm(Mean_SPSS~Mean_QSE, data = data_alter_reduced)

Now we can continue with the assumption check of linearity. The relationship between the two variables must be linear.

We can assess linearity visually, just as we did in the correlation chapter, by using a Residuals vs Fitted plot. We can use the plot() function on our linear model (mod) with the which argument set to 1.

plot(mod, which = 1)

Verdict: The red line is roughly horizontal and flat, indicating that the relationship between the predictor and outcome is linear. This assumption holds.

Assumption 5: Normality

Residuals for both variables should follow a bivariate normal distribution (i.e., both together normally distributed).

We can visually check this assumption using a Q-Q plot. To generate it, we use the plot() function on our linear model mod, setting the which argument to 2. Again, this is the same approach we used in the correlation chapter to check for normality.

plot(mod, which = 2)

Verdict: Doesn’t this look beautiful? ✨ The points closely follow the diagonal line, indicating that the assumption of normality holds.

Assumption 6: Homoscedasticity

Homoscedasticity assumes that the spread of residuals is consistent across all levels of the predictor variable, meaning that the data is evenly distributed around the line of best fit, with no visible pattern. We can assess this assumption visually using a Scale-Location plot.

To generate it, we use the plot() function on our linear model mod, setting the which argument to 3:

plot(mod, which = 3)

Verdict: The variance of the residual is constant along the values of predictor variable as indicated by this roughly flat and horizontal red line. Hence, we consider this assumption met.

Checking Assumptions Differently: Using `check_model()` from the `performance` package

Assumptions 4-6 can also be tested using the check_model() function from the performance package. This function provides the same diagnostic checks as plot(), but with a cleaner presentation and useful reminders of what to look for.

check_model(mod)

One key difference is that you get a posterior predictive check, which compares observed values to the model’s predictions. Additionally, the Q-Q plot for normality of residuals is displayed different. Instead of a diagonal reference line, residuals are plotted as deviations from a horizontal line. This visualisation can sometimes exaggerate small deviations, so focus on identifying clear overall patterns rather than minor fluctuations.

Tip

The diagnostic plots in the performance package are generated using ggplot2, so you can save them using ggsave() if needed.

But again; all assumptions are met. Let’s move on to inferential statistics!

10.7 Activity 7: Compute the regression, confidence intervals, and effect size

The lm() function from Base R is the main function to estimate a Linear Model (hence the function name lm). And if you’re experiencing a hint of déjà vu, you’re onto something! We have actually already used this function to store our linear model for assumption checks (remember the object called mod???).

Just to recap, the lm() function uses formula syntax with the following arguments:

lm(OutcomeVariable ~ PredictorVariable, data)

For our example, the variables are:

IV (Predictor): Quantitative self-efficacy, and
DV (Outcome): Attitudes toward SPSS (measured by the VSSL Scale)

Thus, our regression model is:

mod <- lm(Mean_SPSS~Mean_QSE, data = data_alter_reduced)

As you can see in the Global Environment, mod stores a list of 12 elements, containing quite a bit of information about the regression model.

To view the results of our simple linear regression, use the summary() function on the mod object:

summary(mod)


Call:
lm(formula = Mean_SPSS ~ Mean_QSE, data = data_alter_reduced)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94132 -0.48539 -0.01985  0.51461  1.60274 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.60403    0.23404  11.126  < 2e-16 ***
Mean_QSE     0.26441    0.06722   3.933  0.00012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7023 on 179 degrees of freedom
Multiple R-squared:  0.07956,   Adjusted R-squared:  0.07442 
F-statistic: 15.47 on 1 and 179 DF,  p-value: 0.0001196

Does it matter if the slope is positive or negative?

When working with a continuous predictor, the sign of the slope is important to consider:

A positive slope indicates that as the predictor increases, the outcome variable also tends to increase.
A negative slope means that as the predictor increases, the outcome variable tends to decrease.

Understanding the direction of the relationship is crucial for correctly interpreting the regression coefficient.

Sooo. We already have several key values, but we still need confidence intervals and effect sizes to fully report our results.

R has a built-in function called confint() to calculate confidence intervals using our linear model object mod.

confint(mod)

                2.5 %    97.5 %
(Intercept) 2.1421954 3.0658579
Mean_QSE    0.1317614 0.3970575

The effect size for a simple linear regression is Cohen’s \(f^2\). We can calculate it manually. As you may recall from the lectures, the formula for Cohen’s \(f^2\) is:

\[ f^2 = \frac{R^2_{Adjusted}}{1-R^2_{Adjusted}} \]

In our case, we can extract \(R^2_{Adjusted}\) from the summary(mod) output.

r_sq_adj <- summary(mod)$adj.r.squared

And then, we just use r_sq_adj in the formula above:

f_2 <- r_sq_adj/(1-r_sq_adj)

f_2

[1] 0.08039889

Now, we have all the necessary values to write up the results!

10.8 Activity 8: Sensitivity power analysis

As usual, we can calculate the smallest effect size that our study was able to detect, given our design and sample size.

To do this, we use the pwr.f2.test() function from the pwr package. This function requires the following arguments:

u = Numerator degrees of freedom. This the number of coefficients you have in your model (minus the intercept)
v = Denominator degrees of freedom. This is calculated as \(v=n-u-1\), where \(n\) is the number of participants.
f2 = The effect size. Here we are solving the effect size, so this parameter is left out
sig.level = The significance level of your study
power = The power level of your study

pwr.f2.test(u = 1, v = 179, sig.level = 0.05, power = 0.8)


     Multiple regression power calculation 

              u = 1
              v = 179
             f2 = 0.04381313
      sig.level = 0.05
          power = 0.8

With the design and sample size we have, we could have detected an effect as small as \(f^2 = 0.044\). Since the observed effect size from our inferential statistics was larger (\(f^2 = 0.080\)), this means our test was sufficiently powered to detect the observed effect.

10.9 Activity 9: The write-up

A simple linear regression was performed with Attitudes toward SPSS \((M = 3.50, SD = 0.73)\) as the outcome variable and Quantitative self-efficacy \((M = 3.39, SD = 0.78)\) as the predictor variable.

The results of the regression indicated that the model significantly predicted Attitudes toward SPSS \((F(1, 179) = 15.47 p < .001, R^2_{Adjusted} = .074, f^2 = 0.080)\), accounting for 7.4% of the variance. The effect of Quantitative self-efficacy was statistically significant, positive, and of small magnitude \((b = 0.26, 95\% CI = [.13, .40], p <.001)\).

Pair-coding

Task 1: Open the R project for the lab

Task 2: Create a new `.Rmd` file

… and name it something useful. If you need help, have a look at Section 1.3.

Task 3: Load in the library and read in the data

The data should already be in your project folder. If you want a fresh copy, you can download the data again here: data_pair_coding.

We are using the packages tidyverse and performance today. If you have already worked through this chapter, you will have all the packages installed. If you have yet to complete Chapter 10, you will need to install the package performance (see Section 1.5.1 for guidance if needed).

We also need to read in dog_data_clean_wide.csv. Again, I’ve named my data object dog_data_wide to shorten the name but feel free to use whatever object name sounds intuitive to you.

Task 4: Tidy data & Selecting variables of interest

Let’s try to answer the question whether pre-intervention social connectedness (SCS_pre) predicts post-intervention loneliness (Loneliness_post)?

Not much tidying to do today.

Step 1: Select the variables of interest. Store them in a new data object called dog_reg.

Step 2: Check for missing values and remove participants with missing in either variable.

Hints

Step 1: Variables of interest are pre-intervention social connectedness (SCS_pre), post-intervention loneliness (Loneliness_post), and of course the participant ID (RID).

Step 2: The function drop_na() is your friend.

Solution

dog_reg <- dog_data_wide %>%
  # Step 1
  select(RID, Loneliness_post, SCS_pre) %>% 
  # Step 2
  drop_na()

Task 5: Visualise the relationship

I’ve used the following code to create a scatterplot to explore the relationship between social connectedness (pre-test) and loneliness (post-test). Can you check I did it correctly?

ggplot(dog_reg, aes(x = Loneliness_post, y = SCS_pre)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Did I do it right?

Solution

The scatterplot is incorrect. Since we are predicting loneliness from social connectedness, the axes should be reversed.

In a correlation, the order of x and y does not matter, but in a regression, the predictor variable must be on the x-axis, and the outcome variable must be on the y-axis.

Here is the corrected scatterplot:

ggplot(dog_reg, aes(x = SCS_pre, y = Loneliness_post)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Task 6: Model creating & Assumption checks

Let’s store our linear model as mod and then use the check_model() function from the performance package to check assumptions.

Remember, the structure of the linear model is:

lm(Outcome~Predictor, data)

Once the model is stored as mod, we can check its assumptions using check_model(mod).

Solution

## creating the linear model
mod <- lm(Loneliness_post~SCS_pre, data = dog_reg)

## checking model assumptions
check_model(mod)

Assumptions 1-3 hold due to the study design, but let’s take a closer look at the following output:

Linearity: The relationship appears to be .
Normality: The residuals are .
Homoscedasticity: There is .

I don’t understand - can you explain more?

Linearity: The reference line is mostly flat and horizontal. There’s a slight curve at the end, but when checking the scatterplot, it’s not a major concern.
Normality: Looking pretty good. The dots fall onto the line, indicating the residuals are approximately normally distributed. A Q-Q plot would show the same outcome, but with the reference line diagonal instead of horizontal.
Homoscedasticity (or Homogeneity of Variance): Yeah, the reference line could be a bit more horizontal, but there’s no clear funnel shape. The points appear to have a fairly random pattern, so we can consider the assumption met.

Task 7: Computing a Simple Regression & interpret the output

To compute the simple regression, we need to use the summary() on our linear model mod.

summary(mod)

How do you interpret the output?

The estimate of the y-intercept for the model, rounded to two decimal places, is
The relationship is
The model indicated that
How much the variance is explained by the model (rounded to two decimal places)? %.

Test your knowledge

Question 1

What is the main purpose of a simple regression analysis?

To determine whether there is an association between two variables, without assuming causality.To predict or explain changes in an outcome variable based on changes in a predictor variable.To compare means between two groups and determine if they are significantly different.To assess whether multiple independent variables jointly predict an outcome variable.

Explain this answer

Simple regression is used to model the relationship between one predictor and one outcome variable and to determine how well the predictor explains variations in the outcome. While correlation measures association and t-tests compare means, regression specifically focuses on prediction or explanation. Multiple regression involves more than one predictor.

Question 2

A sports scientist wants to examine whether an athlete’s reaction time (in milliseconds) can predict their sprint speed (in seconds) in a 100m race.

Which of the following correctly specifies the simple regression model in R?

lm(Sprint_Speed ~ Reaction_Time + Training_Hours, data = data)lm(Sprint_Speed ~ Reaction_Time, data = data)lm(Reaction_Time ~ Sprint_Speed + Age, data = data)lm(Reaction_Time ~ Sprint_Speed, data = data)

Explain this answer

In a simple regression model, the outcome variable (dependent variable, DV) should be placed first, followed by the tilde (~), and then the predictor variable (independent variable, IV). That’s why the correct solution is Sprint_Speed ~ Reaction_Time.

In one of the options, the variables were reversed, treating sprint speed as the predictor and reaction time as the outcome variable.

The other 2 options incorrectly include multiple predictors, which would make this a multiple regression model.

Question 3

In a simple regression model where reaction time (milliseconds) predicts sprint speed (seconds), the regression output shows:

\[ Sprint Speed = 10.2 + 0.08 × Reaction Time \]

How should the slope be interpreted?

A 1-millisecond increase in reaction time is associated with a 10.2-second increase in sprint speed.A 1-millisecond increase in reaction time is associated with a 10.2-second decrease in sprint speed.A 1-millisecond increase in reaction time is associated with a 0.08-second decrease in sprint speed.A 1-millisecond increase in reaction time is associated with a 0.08-second increase in sprint speed.

Explain this answer

The slope (0.08), not the intercept (10.2), tells us the expected change in the outcome variable (sprint speed) for a one-unit increase in the predictor (reaction time). Since the coefficient is positive, it indicates that slower reaction times (higher values) are associated with slower sprint speeds (higher times).

Question 4

A simple regression analysis finds an \(R^2_{Adjusted} = 0.62\) when predicting sprint speed from reaction time. What does this mean?

62% of the variance in reaction time is explained by sprint speed.The predictor fully explains the outcome.62% of the variance in sprint speed is explained by reaction time.The model does not fit well because the adjusted \(R^2\) is too high.

Explain this answer

Adjusted \(R^2\) represents the proportion of variance in the dependent variable (sprint speed) that is explained by the independent variable (reaction time). A value of 0.62 means that 62% of the differences in sprint speed can be accounted for by reaction time, indicating a strong predictive relationship.

The other options are incorrect:

One of the options reverses the outcome variable and the predictor. Since the DV is sprint speed, we cannot say that 62% of reaction time is explained by sprint speed.
A high \(R^2\) adjusted does not mean the predictor fully explains the outcome. Some variability (\(1 - R^2_{Adjusted} = 38\%\)) remains unexplained, meaning other factors beyond reaction time contribute to sprint speed. A perfect model would have (\(R^2 = 1.00\)), which is almost never the case in real data.
A high \(R^2\) adjusted does not mean the model is bad. In fact, 0.62 is quite strong, meaning the model fits the data well. A high can sometimes be concerning if it’s unexpectedly large (e.g., above 0.90 in social sciences), but here it may simply indicate a strong relationship.

Intended Learning Outcomes

Individual Walkthrough

10.1 Activity 1: Setup & download the data

10.2 Activity 2: Load in the library, read in the data, and familiarise yourself with the data

Potential Research Question & Hypthesis

10.3 Activity 3: Preparing the dataframe

10.4 Activity 4: Compute descriptives

10.5 Activity 5: Create an appropriate plot

10.6 Activity 6: Check assumptions

Assumption 1: Levels of measurement

Assumption 2: Independence of cases/observations

Assumption 3: Non-zero variance

Assumption 4: Linearity

Assumption 5: Normality

Assumption 6: Homoscedasticity

Checking Assumptions Differently: Using check_model() from the performance package

10.7 Activity 7: Compute the regression, confidence intervals, and effect size

10.8 Activity 8: Sensitivity power analysis

10.9 Activity 9: The write-up

Pair-coding

Task 1: Open the R project for the lab

Task 2: Create a new .Rmd file

Task 3: Load in the library and read in the data

Task 4: Tidy data & Selecting variables of interest

Task 5: Visualise the relationship

Task 6: Model creating & Assumption checks

Task 7: Computing a Simple Regression & interpret the output

Test your knowledge

Question 1

Question 2

Question 3

Question 4

Checking Assumptions Differently: Using `check_model()` from the `performance` package

Task 2: Create a new `.Rmd` file