Lab 11 Introduction to GLM: One-factor ANOVA

11.1 Overview

A key way that we attempt to learn from data is to build a statistical model that captures relationships among variables. You are actually already familiar with this approach but it just hasn't been phrased as such - this is what t-test, correlations, etc, do. In this chapter we will formalise this approach and introduce you to the General Linear Model (GLM) which you will read about in the Miller and Haden (2013) textbook (Chapters 1-3) as part of the PreClass.

The GLM is a very common model in statistics in Psychology and it encapsulates a range of common analytical techniques that you are already familiar with and will become even more familiar with throughout this book as we will spend some of the next few lessons looking at the GLM and reading about it. The GLM covers all the t-tests and correlations you have looked at, and the ANOVA and regression we are going to come on to. Basically, the General Linear Model (GLM) is the foundation of a lot of that statistical tests we use. Over the next few chapters, and building for future years of study, we will introduce the GLM here, through working with the model "by hand" on a simulated dataset, as this one of the best ways to learn about linear models.

You will also notice a slight change in the assignments for the next few chapters in that you are required to do a little more computation than before. Keep in mind though that all the skills you need will of course be shown to you first or you have already learnt them. The previous chapters have been aimed at developing your general practical data skills and now we want to develop your understanding of the analysis and data you are working with.

As such, the goals of this chapter are:

  • to recap and practice entering data into a tibble (tidyverse data frame - as introduced Chapter 5);
  • to learn how to estimate model parameters from a dataset;
  • to learn how to derive/generate a decomposition matrix that expresses each observation/participant as a linear sum of model components and error.

These terms will become more familiar to you over the following chapters and from reading Miller and Haden, but remember to make notes for yourself to help your solidify your learnin and, as always, ask as many questions as you like!

11.2 PreClass Activity

The PreClass Activity for this chapter is reading. It is quite a bit of reading but don't worry if you don't understand it all the first time round. The best way to deal with this is to read through the information as prep, get the gist, then use it to support the activities, and then re-read to consolidate knowledge for the assessment. And ask questions. This will seem difficult at first but by working with the ideas and examples, you can understand the concepts introduced here.

11.2.1 Read

Chapters

  • As preparation for the inclass activities, please read Chapters 1 to 3 of Miller and Haden (2013). The inclass activities will be working up to and including the concept of Sums of Squares which is around Section 3.4 of Miller and Haden (2013), so at least read up to and including that section.

When reading Miller and Haden it will help to remember that:

  • Factor is another name for Variable
  • Level is another name for Condition
  • E.g. a between-subjects experiment with one independent variable (sleep quality) and two conditions (Good vs Poor) can be said to have one factor with two levels.

The key terms we want you to be becoming familiar with from these chapters, though perhaps you don't quite fully understand them, are:

  • ANOVA - short for Analysis of Variance. A statistical test used to compare the spread of the variance across two or more conditions and/or two or more factors.
  • estimation equations - formulas (or equations) that we use to determine our best guess (estimates) at parameters (values) of a population of interest from our sample parameters (values)
  • decomposition matrices - a table (matrix) that breaks down information into individual components. In this instance, the decomposition matrix breaks down how each individual observation/participant is fitted into the GLM based on the estimation equations.
  • sums of squares - an estimate of the total spread of the data around a parameter (such as the mean). We have seen sums of squares before in lectures, as the top part of the variance equation for example:

\[\sum(x - \bar{x})^2\]

The activities we will look at next will help you understand these terms better. In short, over the next two chapters, we will show you how to take data from a simulated experiment and apply the estimation equations to create a decomposition matrix summarising this data. This decomposition matrix will show how each participant and condition is fitted to the General Linear Model. Then, from that matrix, we will calculate the relevant sums of squares for each condition, which will in turn be used in our ANOVA calculation to determine if there is a significant difference between conditions. It is going to be a lot more fun than it sounds!

If you want to get ahead, you should try the next activities. The first part reiterates and expands on the Miller and Haden information.

Job Done - Activity Complete!

11.3 InClass Activity

We are going to start with a step-by-step example of building a decomposition matrix for an ANOVA (Analysis of Variance) and then ask you to perform the steps yourself on a different dataset. If you feel comfortable with the examples in Chapter 3 of Miller and Haden, feel free to skim worked example and move onto the exercises below. You can find further examples and step-by-step walkthroughs at the end of Miller and Haden (2013), Chapter 3.

One-factor ANOVA: Worked example

An ANOVA is a method of analysis for analysing data where you have two or more conditions (levels) for an independent variable (factor), and/or you have more than one independent variable (factors). Thinking back, a t-test is where you normally have two conditions (levels) with one independent variable (factor), right? Well an ANOVA is just an extension of that. For example, in the classic Professor Priming experiments you may have read about, instead of just comparing IQ scores for professors vs. hooligans (t-test), you can compare IQ scores for professors vs. hooligans vs. politicians (ANOVA). Actually, an experiment that has one factor with two levels can be analysed with a t-test or an ANOVA as they are both based on the General Linear Model (GLM).

So let's assume that you have data from a one-factor design with three-levels (i.e. one independent variable with three conditions (Grp1, Grp2, Grp3)). To make this example concrete, let's pretend you are studying how consuming food before an exam affects student performance. You randomly assign 12 participants to three separate conditions (four participants per group) (We chose a small number of participants to simplify the computations; obviously if you were going to do this study in real life, you'd need far more than 12 participants to make this worthwhile). The three conditions are as follow:

  1. no food, glass of water only (Control)
  2. all-you-can-eat buffet (Buffet);
  3. side salad (Salad).

Your dependent variable is operationalised as the number of questions answered correctly on a difficult exam (100 points possible). The exam is administered right after consuming the meal (or drinking water, for the control group). And just in case you aren't sure, this would be a one-factor design because there is a single factor, which we might call "pre-exam consumption". And this factor has three different levels: water, buffet, and salad. It is a between-subjects design because there are different people in each group (4 per group). In textbooks you might see this referred to as a one-way between-subjects ANOVA. Yes stats does have numerous names for the same thing - that is why making notes is really important. Depending on training someone might use a different word from you but mean the same thing. You can bridge that gap by knowing the alternative names.

And finally, for our analysis, we want to test whether there is any difference in exam performance across the levels of the factor. We won't complete the analysis today but we will look at setting up our model which we would then take on further to see if there is a difference between groups.

Here's how the exam performance looks like for each of the three groups with each value representing an individual participant's score:

## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
  • Control: 37, 80, 64, 51
  • Buffet: 33, 47, 55, 41
  • Salad: 59, 23, 50, 60

Quickfire Questions

To make sure you understand the design of this study try to answer the following questions. All the answers are in the information above:

  1. Factor is another name for a of the experiment
  2. Level is another name for a of the experiment
  3. In this experiment we have
  4. Because each group contains different participants then this is a
  5. The fourth participant in the Control condition scored on the exam

Estimating model components

Great so we understand our experiment! Now, and based on your reading, the General Linear Model (GLM) that we will fit to our data is:

\[Y_{ij} = \mu + A_i + S(A)_{ij}\]

Where:

  • \(Y_{ij}\) is the observed value for observation \(j\) of group \(i\) - i.e. a given participant's score (\(j\)) in a given group (\(i\));
  • \(\mu\) (pronounced "mu") is the population grand mean (estimated by the sample grand mean) - grand just means overall;
  • \(A_i\) is the deviation of the population mean of group \(i\) from the population grand mean - i.e. how different a group's mean is from the overall population mean;
  • Now before the last part of the formula, you need to know that the sum of \(\mu + A_i\) is known as the fitted value or the typical value or the predicted value of a participant in a condition and is written as: \(\hat{Y}_{ij}\). So:

\[\hat{Y}_{ij} = \mu + A_i\]

  • The "party hat" that \({Y}_{ij}\) is wearing in this part, i.e. \(\hat{Y}_{ij}\), is there to remind us that it is not the actual score of your participant (in a given condition), but an estimate of that value. So, when we are working in predicted values (values we haven't actually collected, just predicted) then we stick the "party hat" on the symbol. If we are working in real values (that we have collected) then no party hat!
  • Finally, back to the GLM equation, \(S(A)_{ij}\) is the error or residual, defined as the observed value (\(Y_{ij}\)) minus the model prediction (\(\hat{Y}_{ij}\)), or the actual value minus the predicted value. This can be thought of as how different an indivual observation/participant is from their condition mean. So:

\[S(A)_{ij} = \hat{Y}_{ij} - {Y}_{ij}\]

Now the joy of an analysis such as this is that it can be hard to understand just from the words but makes much more sense when you start to run the numbers - which we will now do. We begin by applying the estimation equations. Our estimate of the population grand mean \(\mu\), will be based on the grand mean of the sample. We will call this \(\hat{\mu}\) (notice once again, the "party hat").

The estimation equations for our model (seen in Table 3.3 on page 18 of Miller and Haden (2013)) are:

  • \(\hat{\mu} = Y_{..}\)

  • \(\hat{A}_i = Y_{i.} - \hat{\mu}\)

  • \(\widehat{S(A)}_{ij} = Y_{ij} - (\hat{\mu} + \hat{A}_i) = Y_{ij} - \hat{\mu} - \hat{A}_i\)

where

  • \(Y_{..}\) is the mean of all 12 observations in the sample a.k.a. the overall mean or the baseline

  • \(Y_{i.}\) is the mean of the 4 observations in group \(i\).

  • \(\hat{A}_i\) is the respective group mean minus the sample mean a.k.a. between-subjects variance

  • \(\widehat{S(A)}_{ij}\) is the individual error term for a given participant, or how much they deviate from the group contribution.

Applying these estimation equations to the data above yields the following decomposition matrix:

Table 11.1: Decomposition Matrix of our data
ID i j Yij mu Ai err
1 1 1 37 50 8 -21
2 1 2 80 50 8 22
3 1 3 64 50 8 6
4 1 4 51 50 8 -7
5 2 1 33 50 -6 -11
6 2 2 47 50 -6 3
7 2 3 55 50 -6 11
8 2 4 41 50 -6 -3
9 3 1 59 50 -2 11
10 3 2 23 50 -2 -25
11 3 3 50 50 -2 2
12 3 4 60 50 -2 12


In the above table:

  • the column ID can be used to locate individual rows,
  • the column mu represents the value of \(\hat{\mu}\),
  • the columns Ai represents the value of \(\hat{A}_i\),
  • and the column err represents the value of \(\widehat{S(A)}_{ij}\).

Spend a few moments understanding how this table expresses each of the 12 observed values in our example (the \(Y_{ij}\)s) in terms of the linear model: \(Y_{ij} = \mu + A_i + S(A)_{ij}\).

For example, if the Control group is group \(i\) = 1, then for the first participant, \(j\) = 1, you would get:

  • \(Y_{ij} = \mu + A_i + S(A)_{ij}\)
  • \(Y_{ij} = mu + Ai + err\)
  • \(37 = 50 + 8 + -21\)

Meaning that the overall mean of the whole sample is 50. The unique contribution of the control group is \(+8\), so the predicted value of a member of the control group would be 58. However the first participant has an error of \(-21\) from the predicted as there actual score is 37. We will start to understand how the difference (the variance) between conditions and within conditions leads to our analysis later but hopefully you are beginning to understand some of the above.

Quickfire Questions

To make sure you understand the above equations before going on to calculate your own, answer the following questions about the above table which expresses the GLM decomposition matrix, and then check your answers.

  1. In which column of the table are the 12 observed values - i.e. the 12 original scores from the participants?

The column named Yij


  1. What is the estimated grand mean of this sample? (hint: \(\hat{\mu}\))

The estimate grand mean of the sample is 50


  1. Which rows of the table contain the data and model estimates for the Buffet group if they are group 2?

The rows where i is equal to 2; in other words, rows 5-8


  1. What is the value of \(\hat{A}_1\) and in what rows does it appear?

The value would be 8

This value appears in column Ai, rows 1-4, where i equals 1. This can be thought of as the difference between that given group and the sample mean that is applied to all participants within that group. In other words this would be the value of the typical participant in that group - as opposed to an individual participant in that group. Note that, conceptually this gives you the effect of your manipulation (food consumption style) and the different i allow you do so for each of the different styles you look at.


  1. What is the value of \(\widehat{S(A)}_{32}\)? (hint: this can be read as where i = 3 and j = 2)

The value would be -25

This value appears in column err, where i equals 3 and j equals 2. This can be thought of the unique difference for that participant from the sample mean and group mean, meaning that we take into consideration that each subject is unique.


  1. What is the model's prediction for a 'typical' participant in the Salad group (\(\hat{A}_3\))? (hint: (\(\hat{Y}_{ij}\) = mu + Ai)

The prediction would be \(\hat{Y}_{ij} = \hat{\mu} + \hat{A}_3\) = 50 + -2 = 48

  • A 'typical' participant is one where the residual is 0.
  • The model prediction is also known as the "fitted value" for this group.


  1. Where in the table are the differences found between this 'typical' participant prediction and the observed values in the salad group?

These are the called the "residuals" (\(\widehat{S(A)}_{ij}\)) and are found in the err column of the table in rows 9-12. As above they are the difference between that specific individual participant and the typical participant for that group.

11.3.1 Recreate decomposition matrix from the raw data

So we have shown you where all the parts of the table come from and how to calculate them. Now, for this part, your task is to reproduce the decomposition matrix tibble shown above, reproduced here:

Table 11.2: Decomposition Matrix of our data
ID i j Yij mu Ai err
1 1 1 37 50 8 -21
2 1 2 80 50 8 22
3 1 3 64 50 8 6
4 1 4 51 50 8 -7
5 2 1 33 50 -6 -11
6 2 2 47 50 -6 3
7 2 3 55 50 -6 11
8 2 4 41 50 -6 -3
9 3 1 59 50 -2 11
10 3 2 23 50 -2 -25
11 3 3 50 50 -2 2
12 3 4 60 50 -2 12


You will do this by typing the observed values into a tibble, and then writing code to add columns with estimates of the individual components. At the end, your table should look exactly like the one above. You already know how to do all the data-wrangling elements, so today really try to focus on understanding what the values mean. And remember that for each step there is a solution at the end of the chapter.

11.3.2 Step 1: Create the basic tibble

  • Create a tibble named dmx (short for decomposition matrix). It will eventually contain all of the columns in the one above, but for now, just create the columns i, j, and Yij as they appear above.
  • You already know how to create a tibble (don't forget to load the tidyverse package first). In case you need to refresh your memory, see page 2 of this cheatsheet on data input or refer back to the preclass activities of Chapter 5.
  • You should just type in the values for Yij but try to use the rep() function for columns i (the group) and j (the participant).
  • You will need some wrangling functions so don't forget to load in tidyverse
  • Create a tibble as dmx <- tibble(i = NA, j = NA, Yij = NA)
  • When using the rep() function remember that you can use each or times as calls in rep: rep(1:3, each = 4)
  • When typing in numbers the c() function will allow you to put in numbers such as Column = c(37, 80, 64, 31)

11.3.3 Step 2: Estimate the Grand Mean \(\hat{\mu}\)

Great, we have our group numbers \(i\), our participant numbers \(j\), and our participant scores \(Yij\). Now we need to start expanding out dmx. First thing we need is the grand mean: \(\hat{\mu}\)

  • In a new tibble called dmx2, add a column to the dmx tibble, called mu representing \(\hat{\mu}\). Remember that you can add a column to a table using mutate().
  • \(\hat{\mu}\) is the mean() of all participants.
  • When calculating mu keep in mind that each value of mu should be the grand mean of the sample; the mean of all participants regardless of group.

dmx2 <- dmx %>% mutate(mu = ???)

11.3.4 Step 3: Entering the estimates \(\hat{A}_1\), \(\hat{A}_2\), \(\hat{A}_3\)

Good! Now we need to add on a column showing the typical effect of being a member of a particular group; the Ai column.

  • In a new tibble called dmx3, add a column to the tibble dmx2 called Ai, with the three estimates for \(\hat{A}_i\). Store the resulting tibble in dmx3.
  • \(\hat{A}_i\) is the difference between the grand mean \(\hat{\mu}\) and the mean of individual groups. This means that you will need to group people by the group they belong to.
  • Add the ungroup() function to the end of your pipeline as you won't need this grouping after.
  • To calculate the column \(\hat{A}_i\) would be something like:

dmx3 <- dmx2 %>% group_by(something) %>% mutate(Ai = something - something) %>% ungroup()

11.3.5 Step 4: Calculate Residuals \(\widehat{S(A)}_{ij}\)

Well done, you're almost there! We just need to add on the final column, err, which is called the residuals or in other words, the difference between the score of a typical participant for that group (mu + Ai) and a given individual participant' score (Yij).

  • In a new tibble called dmx4 add a column to the dmx3 tibble called err that contain the residuals - the difference between the observed (Yij) and fitted (typical) values.

For calculating err you would use:

\(\widehat{S(A)}_{ij} = Y_{ij} - \hat{Y}_{ij}\)

where:

\(\hat{Y}_{ij} = \hat{\mu} + \hat{A}_i\)

11.3.6 Step 5: Sums of squares

Great! Last step for today. Once you have your final dmx, dmx4, you can start calculating the sums of squars. Sums of squares are used in calculations for performing tests on model components, which we will learn more about soon, but you can practice them for now as shown in Section 3.4 of Miller and Haden (2013). The steps are as follows:

  • Step1 - Square all the individual values in the columns Yij, mu, Ai, and err in dmx4,
  • Step2 - Now sum up the squared values for each of these columns.
  • Step3 - Save these in a variable called sstbl.

The simplest way to square a column, for example called x, would be:

mutate(squared_x = x^2)

The ^2 means take x to the power of 2. So typing 3^2 in the console will give you 9 (try it if you're unsure). It also works for columns! Give that a go. If you have done it correctly you should see the below table:

.dmx %>%
  mutate(Yij2 = Yij^2,
         mu2 = mu^2,
         Ai2 = Ai^2,
         err2 = err^2) %>%
  select(Yij2, mu2, Ai2, err2) %>%
  summarise(ss_Y = sum(Yij2),
            ss_mu = sum(mu2),
            ss_Ai = sum(Ai2),
            ss_err = sum(err2)) %>%
  knitr::kable(align = "c", caption ="Sums of Squares for this analysis")
Table 11.3: Sums of Squares for this analysis
ss_Y ss_mu ss_Ai ss_err
32580 30000 416 2164

Where:

  • ss_Y represents the Sums of Squares of Yij. This is referred to as the Sums of Squares total or \(SS_{total}\).
  • ss_mu represents the Sums of Squares of mu. This is referred to as the Sums of Squares of the grand mean or \(SS_{\mu}\) and sometimes called the intercept.
  • ss_Ai represents the Sums of Squares of Ai. This is referred to as the Sums of Squares of A or \(SS_{A}\) and sometimes called \(SS_{between}\).
  • ss_err represents the Sums of Squares of err. This is referred to as the Sums of Squares error or \(SS_{error}\) and sometimes called \(SS_{within}\).
  • And the following statement is true:

\[SS_{total} = SS_{\mu} + SS_{A} + SS_{error}\]

  • Mutate on the squared values:

dmx4 %>% mutate(Yij2 = Yij^2, ...)

  • And sum up using:

summarise(ss_Y = sum(Yij2), ss_mu = ...)


Job Done - Activity Complete!

Well done! How did you do? If the values in your dmx4 do not match the table above then you need to go back and look at where it has gone wrong. Alternatively you should look at the solutions at the end of the chapter.

To recap, what we are doing here is setting up the ANOVA to compare the three groups to see if there is a significant difference between each group. Doing it this way allows us to get an understanding of where the numbers come from, and it highlights that the analysis is about comparing variance within groups and variance between groups. We will cover this in the next chapter but in short, if the variance between groups \(SS_{A}\) is larger than the variance within groups \(SS_{error}\) then it is likely that your experimental manipulation has had an effect. Conversely, if the variance within groups is larger than the variance between groups then there is likely to be no effect of your experimental manipulation.

One last thing:

Before ending this section, if you have any questions, please post them on the available forums or speak to a member of the team. Finally, don't forget to add any useful information to your Portfolio before you leave it too long and forget. Remember the more you work with knowledge and skills the easier they become.

11.4 Test Yourself

This is a formative assignment meaning that it is purely for you to test your own knowledge, skill development, and learning, and does not count towards an overall grade. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. You will be instructed by the Course Lead on Moodle as to when you should attempt this assignment. Please check the information and schedule on the Level 2 Moodle page.

Lab 11: Introduction to GLM: One-factor ANOVA

In order to complete this assignment you first have to download the assignment .Rmd file which you need to edit for this assignment: titled GUID_Level2_Semester2_Lab2.Rmd. This can be downloaded within a zip file from the below link. Once downloaded and unzipped you should create a new folder that you will use as your working directory; put the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top. Download the Assignment .zip file from here or on Moodle.

Single Answer and Multiple Choice Questions

For this assignment you will answer a series of short single and Mulitple Choice Questions, followed by a calculation of a decomposition matrix in the final task. In order to complete this formative assignment you will need to have completed the inclass activity and have read Miller and Haden Chapter 3.

Before starting let's check:

  1. The .Rmd file is saved in a folder and that you have set your working directory to that folder. For assessments we ask that you save it with the format GUID_Level2_Semester2_Lab2.Rmd where GUID is replaced with your GUID. Though this is a formative assessment, it may be good practice to do the same here.

Let's Begin!

11.4.1 Question 1

Consider the following description of a study.

You are investigating whether there is seasonal variation in students' bodyweight. In other words, is there any evidence that bodyweight differs across the four seasons (Winter, Spring, Summer, and Fall - #AllYouGotToDoIsCall)?

Which of the models shown below would be the the general linear model corresponding to this study?

  1. \(Y_{ij} = \mu + A_{i} + S(A)_{ij}\)
  2. \(Y_{ijkm} = \mu + A_{i} + B_{j} + C_{k} + D_{m} + S_{ijkm}\)
  3. \(Y_{ij} = \beta_0 + \beta_1 X_1 + e_{ij}\)
  4. \(Y_{ijkm} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + e_{ijkm}\)

Replace the NULL in the Q1 code chunk with the statement number that corresponds to the correct answer (e.g. 1, 2, 3 or 4).

mcq1 <- NULL

For the next few questions consider the decomposition matrix for a one-factor design with three groups, shown below .

\(i\) \(j\) \(Y_{ij}\) \(\hat{\mu}\) \(\hat{A}_{i}\) \(\widehat{S(A)}_{ij}\)
1 1 4 6 -1 -1
1 2 6 6 -1 1
2 1 4 6 0 -2
2 2 8 6 0 2
3 1 2 6 1 -5
3 2 12 6 1 5

11.4.2 Question 2

According to the above decomposition matrix, the population grand mean is estimated to be:

  1. 0
  2. 6
  3. 36
  4. can't answer; not observed

Replace the NULL in the Q2 code chunk with the statement number that corresponds to the correct answer (e.g. 1, 2, 3 or 4).

mcq2 <- NULL 

11.4.3 Question 3

According to the above decomposition matrix, the value of \(\hat{A}_3\) is:

  1. 6
  2. 0
  3. 1
  4. can't answer; not observed

Replace the NULL in the Q3 code chunk with the statement number that corresponds to the correct answer (e.g. 1, 2, 3 or 4).

mcq3 <- NULL

11.4.4 Question 4

According to the above decomposition matrix, the predicted value for a participant in group 1 is what?

Hint: this is the "fitted" or "typical" for that group (\(\hat{Y}_{ij}\)) as opposed to the actual value (\(Y_{ij}\))

Replace the NULL in the Q4 code chunk with the actual value of the correct answer (e.g a number).

Q4 <- NULL  # replace NULL with your answer (a number)

11.4.5 Question 5

Which observation or observations has/have the largest residual(s)?

  1. \(Y_{21}\)
  2. \(Y_{21}\) and \(Y_{22}\)
  3. \(Y_{31}\)
  4. \(Y_{31}\) and \(Y_{32}\)

Replace the NULL in the Q5 code chunk with the statement number that corresponds to the correct answer (e.g. 1, 2, 3 or 4).

Q5 <- NULL

11.4.6 Question 6

From your reading of Miller and Haden Chapter 3, and from the inclass activity Section 5, based on the above decomposition matrix, what would \(SS_{total}\) be for this model?

Replace the NULL in the Q6 code chunk with the actual value of the correct answer (e.g a number).

Q6 <- NULL  # replace NULL with your answer (a number)

11.4.7 Question 7

From your reading of Miller and Haden Chapter 3, and from the inclass activity Section 5, based on the above decomposition matrix, what would \(SS_{error}\) be for this model?

Replace the NULL in the Q7 code chunk with the actual value of the correct answer (e.g a number).

Q7 <- NULL  # replace NULL with your answer (a number)

11.4.8 Question 8

From reading Miller and Haden Chapter 3, and from the inclass activity Section 5, a study with a one-factor design with GLM \(Y_{ij} = \mu + A_{ij} + S(A)_{ij}\) is found to have the following SS:

  • \(SS_{total} = 280\),
  • \(SS_{\mu} = 40\),
  • and \(SS_{error} = 60\).

Given those values, what is the value of \(SS_{A}\)?

hint: \(SS_{total}\) = \(SS_{\mu}\) + \(SS_{A}\) + \(SS_{error}\)

Replace the NULL in the Q8 code chunk with the actual value of the correct answer (e.g a number).

Q8 <- NULL # replace NULL with your answer (a number)

11.4.9 Question 9: Create your own decomposition matrix

Finally, this last task tests your ability to set up a decomposition matrix as shown inclass. The code chunk below creates the basic table structure you will need to complete this task. Run the code and have a look at the table, but DO NOT CHANGE IT!

## run this block, have a look at the structure of dsetup,
## but don't change anything

library("tidyverse")

dsetup <- tibble(i = rep(1:4, each = 3),
                 j = rep(1:3, times = 4),
                 Yij = NA,
                 mu = NA,
                 Ai = NA,
                 err = NA)

In the code chunk below, flesh out the values in dsetup to create a decomposition matrix for the data shown below (a one-factor design with four levels), but with the actual numeric values replacing the NA values.

  • Group 1: 84, 86, 61
  • Group 2: 83, 71, 95
  • Group 3: 56, 95, 92
  • Group 4: 68, 76, 93

IMPORTANT!

  • Make sure the final table with your result has the name dmx. Check spelling and capitalization.
  • The values should be computed based on the Yij values such that if the Yij values were to change then your code would still produce the correct decomposition matrix.
  • DO NOT change the column names, the column ordering, and make sure it has the right number of rows and columns. You should have 12 rows by 6 columns.
  • Make sure your code runs without error in a fresh R session, and make sure no warnings are generated by the code chunk named dmx_warning, which validates your response.
# TODO: DO STUFF WITH dsetup

# you can change or remove the line below,
# but make sure your final table is called dmx
dmx <- NULL 

Job Done - Activity Complete!

Well done, you are finshed! Now you should go check your answers against the solutions which can be found at the end of this Chapter. You are looking to check that the resulting output from the answers that you have submitted are exactly the same as the output in the solution - for example, remember that a single value is not the same as a coded answer. Where there are alternative answers it means that you could have submitted any one of the options as they should all return the same answer. If you have any questions please post them on the available forums or speak to a member of the team.

On to the next chapter!

11.5 Solutions to Questions

Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

11.5.1 InClass Activities

11.5.1.1 InClass Step 1

  • The basic tibble would be created as follows.
  • When it comes to \(Y_{ij}\), simply typing in the values in order was what was needed.
library("tidyverse")

dmx <- tibble(i = rep(1:3, each = 4), 
              j = rep(1:4, times = 3),
              Yij = c(37, 80, 64, 51,
                    33, 47, 55, 41,
                    59, 23, 50, 60))

Return to Task

11.5.1.2 InClass Step 2

  • The Grand Mean can be added as follows:
dmx2 <- dmx %>%
  mutate(mu = mean(Yij))
  • And would appear as:
Table 11.4: Decomposition Matrix with Grand Mean added
i j Yij mu
1 1 37 50
1 2 80 50
1 3 64 50
1 4 51 50
2 1 33 50
2 2 47 50
2 3 55 50
2 4 41 50
3 1 59 50
3 2 23 50
3 3 50 50
3 4 60 50

Return to Task

11.5.1.3 InClass Step 3

  • The estimates \(\hat{A}_1\), \(\hat{A}_2\), \(\hat{A}_3\), or in other words the unique contribution of each group, are calculated as follows.
  • The key point is grouping by i so that each group is accounted for individually.
dmx3 <- dmx2 %>%
  group_by(i) %>%
  mutate(Ai = mean(Yij) - mu) %>%
  ungroup()
  • And would appear as:
Table 11.5: Decomposition Matrix with Group Estimates added
i j Yij mu Ai
1 1 37 50 8
1 2 80 50 8
1 3 64 50 8
1 4 51 50 8
2 1 33 50 -6
2 2 47 50 -6
2 3 55 50 -6
2 4 41 50 -6
3 1 59 50 -2
3 2 23 50 -2
3 3 50 50 -2
3 4 60 50 -2

Return to Task

11.5.1.4 InClass Step 4

  • The residuals are calculated as follows:
dmx4 <- dmx3 %>%
  mutate(err = Yij - (mu + Ai))
  • And would appear as:
Table 11.6: Decomposition Matrix with Residuals added
i j Yij mu Ai err
1 1 37 50 8 -21
1 2 80 50 8 22
1 3 64 50 8 6
1 4 51 50 8 -7
2 1 33 50 -6 -11
2 2 47 50 -6 3
2 3 55 50 -6 11
2 4 41 50 -6 -3
3 1 59 50 -2 11
3 2 23 50 -2 -25
3 3 50 50 -2 2
3 4 60 50 -2 12

Return to Task

11.5.1.5 InClass Step 5 (version 1)

  • mutate() on the squared values column
  • select() only those columns
  • summarise(sum) those columns
sstbl <- dmx4 %>%
  mutate(Yij2 = Yij^2,
         mu2 = mu^2,
         Ai2 = Ai^2,
         err2 = err^2) %>%
  select(Yij2, mu2, Ai2, err2) %>%
  summarise(ss_Y = sum(Yij2),
            ss_mu = sum(mu2),
            ss_Ai = sum(Ai2),
            ss_err = sum(err2))

11.5.1.6 InClass Step 5 (version 2)

  • There is an alternative way to do the above in a supercool, superquick, two lines of code using dplyr's "scoping" technique. Have a look at ?dplyr::scoped and ?dplyr::summarise_all.

  • Don't worry if you don't understand this yet, as it is pretty advanced, but as you can see it gives the same output as we created in class.

sstbl <- dmx4 %>%
  select(Yij:err) %>%
  summarise_all(list(name = ~ sum(.^2)))

Return to Task

11.5.2 Test Yourself Activities

11.5.2.1 Assignment Question 1

  • The correct model for this scenario would be:
  1. \(Y_{ij} = \mu + A_{i} + S(A)_{ij}\)
  • As such the correct answer is:
mcq1 <- 1

Return to Task

11.5.2.2 Assignment Question 2

The population grand mean for the shown decomposition matrix is \(\hat{\mu}\) = 6

  • As such the correct answer is:
mcq2 <- 2

Return to Task

11.5.2.3 Assignment Question 3

The value for the shown decomposition matrix is \(\hat{A}_3\) = 1

  • As such the correct answer is:
mcq3 <- 3

Return to Task

11.5.2.4 Assignment Question 4

The "fitted" or "typical" value for a participant in Group 1 would be \(\hat{Y}_{ij}\) = \(\mu\) + \(A_i\)

  • As such the correct answer is:
Q4 <- 5

Return to Task

11.5.2.5 Assignment Question 5

The participants/observations with the largest residuals are \(Y_{31}\) and \(Y_{32}\)

  • As such the correct answer is:
Q5 <- 4

Return to Task

11.5.2.6 Assignment Question 6

The \(SS_{total}\) for this model would be calculated as:

Q6 <- 4^2 + 6^2 + 4^2 + 8^2 + 2^2 + 12^2
  • As such giving a \(SS_{total}\) of 280

Return to Task

11.5.2.7 Assignment Question 7

The \(SS_{error}\) for this model would be calculated as:

Q7 <- (-1)^2 + 1^2 + (-2)^2 + 2^2 + (-5)^2 + 5^2
  • As such giving a \(SS_{error}\) of 60

Return to Task

11.5.2.8 Assignment Question 8

From reading Miller and Haden Chapter 3, and from the inclass activity Section 5, a study with a one-factor design with GLM \(Y_{ij} = \mu + A_{ij} + S(A)_{ij}\) is found to have the following SS:

  • \(SS_{total} = 280\),
  • \(SS_{\mu} = 40\),
  • \(SS_{error} = 60\).
  • \(SS_{total} = SS_{\mu} + SS_{A} + SS_{error}\)

And so:

  • \(SS_{A} = SS_{total} - (SS_{\mu} + SS_{error})\)

Or, in other words:

  • \(SS_{A} = SS_{total} - SS_{\mu} - SS_{error}\)

As such, given the above values and formula the value of \(SS_{A}\) would be \(SS_{A}\) = 180

  • As such the correct answer is:
Q8 <- 180

Return to Task

11.5.2.9 Assignment Question 9

Entering the following values:

  • Group 1: 84, 86, 61
  • Group 2: 83, 71, 95
  • Group 3: 56, 95, 92
  • Group 4: 68, 76, 93

dmx can be created as shown:

dsetup <- tibble(i = rep(1:4, each = 3),
                 j = rep(1:3, times = 4),
                 Yij = NA,
                 mu = NA,
                 Ai = NA,
                 err = NA)

dmx <- dsetup %>%
  mutate(Yij = c(84, 86, 61,
                 83, 71, 95,
                 56, 95, 92,
                 68, 76, 93),
         mu = mean(Yij)) %>%
  group_by(i) %>%
  mutate(Ai = mean(Yij) - mu) %>%
  ungroup() %>%
  mutate(err = Yij - (mu + Ai))

Producing the following output:

Table 11.7: Decomposition Matrix of Ch11 Assignment Task 9
i j Yij mu Ai err
1 1 84 80 -3 7
1 2 86 80 -3 9
1 3 61 80 -3 -16
2 1 83 80 3 0
2 2 71 80 3 -12
2 3 95 80 3 12
3 1 56 80 1 -25
3 2 95 80 1 14
3 3 92 80 1 11
4 1 68 80 -1 -11
4 2 76 80 -1 -3
4 3 93 80 -1 14

If you have set-up dmx correctly then you will see the below message at the bottom of your knitted html file.

The tibble 'dmx' has been defined properly meaning that the column names (including capitalization), column data types, and tibble structure are as expected. However this does not guarantee that the values are correct and you should check these against the solution.

# Don't change anything in this code chunk; it is just here to help you check  that you've defined a table called `dmx` correctly. Pay attention to any warnings  that appear when you run this chunk. When the warning has gone away, that  means that the structure of `dmx` is correct (although it doesn't mean that the values are correct!)

if (identical(names(dmx), names(dsetup)) && 
    identical(dim(dmx), dim(dsetup))) {
  warning("The tibble 'dmx' has been defined properly meaning that the column names (including capitalization), column data types, and tibble structure are as expected. However this does not guarantee that the values are correct and you should check these against the solution.")
}

If however you see the below, then you need to look again at the structure of dmx

You have not yet defined the tibble 'dmx' properly, and in future assignments this may affect your overall grade. Check column names (including capitalization), column data types, and table structure.

Return to Task

Chapter Complete!