G Permutation tests
This section has been adapated from the Level 2 class on permutation tests written by Dr. Phil McAleer. The original can be viewed here.
G.0.1 Overview
In this week's lab you will perform your first hypothesis test using a procedure known as a permutation test. We will help you learn how to do this through building and running data simulation procedures. In order to complete this lab you will require the following skills which we will teach you today:
- Skill 1: Generating random numbers with
base::rnorm()
- Skill 2: Permuting values with
base::sample()
- Skill 3: Creating a "tibble" (a type of data table) using
tibble::tibble()
- Skill 4: Computing and extracting a difference in group means using
dplyr::pull()
andpurrr::pluck()
- Skill 5: Creating your own custom functions using
base::function()
- Skill 6: Repeating operations using
base::replicate()
To many, a lot of statistics must seem a bit like blind faith as it deals with estimating quantities we haven't observed (or can't observe), e.g. the mean of a whole population. As such we have to know if we can trust our procedures for making estimations and inferences because we rarely get a chance to compare the estimated values to the true values to see if they match up. One way to test a procedure, and in turn learn about statistics, is through data simulation. In simulations we create a population and then draw samples and run tests on the data, i.e. on this known population. By running lots of simulations we can test our procedures and make sure they are acting as we expect them to. This approach is known as a Monte Carlo simulation, named after the city famous for the many games of chance that are played there.
You can go read up on the Monte Carlo approach if you like. It can however get quite indepth, as having a brief glance at the wikipedia entry on it highlights. The main thing to keep in mind is that the method involves creating a population and continually taking samples from that population in order to make an inference. This is what we will show you in the lab. Data simulation and "creating" your own datasets, to see how tests work, is a great way to understand statistics. When doing this lab, keep in mind how easy it really is to find a significant result if even randomly created data can give a significant result. This may help dispell any notion that there is something inherently important about a significant result, in itself.
We will now take each skill in turn. Be sure to try them all out. It looks a lot of reading but it is mainly just showing you the output of the functions so you can see you are doing it correctly. The key thing is to try them yourselves and don't be scared to change things to see what might happen if you do it slightly differently. We will also ask a couple of questions along the way to make sure you understand the skills.
G.0.2 Skill 1: Generating Random Numbers
The base::rnorm()
function generates values from a normal distribution and takes the following arguments:
n
: the number of observations to generatemean
: the mean of the distribution (default 0)sd
: the standard deviation of the distribution (default 1)
To generate 10 or even 50 random numbers from a standard normal distribution (M = 0, SD = 1), you would use rnorm(10)
or rnorm(50)
respectively.
- Type
rnorm(50)
into your console and see what happens. Use the below example forrnorm(10)
to help you.
- Try increasing
n
to 1000.
rnorm(10)
## [1] -0.3318226 0.5383048 0.8232601 0.8645694 -0.5955911 -0.4459756
## [7] -2.1898532 1.5251591 0.8880835 -0.6752691
Quickfire Questions
If you enter rnorm(50)
again you will get different numbers. Why?
If you want to change the mean or sd, you would need to pass additional arguments to the function as shown below.
rnorm(n = 10, mean = 1, sd = .5)
- Try changing the mean and sd values a couple of times and see what happens. You get different numbers again that will be around the mean you set! Set a mean of 10, then a mean of 100, to test this.
Finally, for this Skill, you can concatenate (i.e. link) numbers together into a single vector using the c()
function from base R. For instance, say you wanted to create a vector with two sets of 50 random numbers from two separate samples: one set of 50 with a mean of 75 and the other with a mean of 90, you would use:
<- c(rnorm(50, 75),
random_numbers rnorm(50, 90))
Quickfire Questions
In the above example code, what is the standard deviation of the two samples you have created?
What is the default sd of the function?
Both populations would have an sd of 1, because that is the default, although you could easily change that. Try it out!
It is always good to check that your new vector has the right number of data points in it - i.e. the total of the two samples; a sanity check if you will. The new vector random_numbers
should have 100 elements. You could verify this using the length()
function:
length(random_numbers)
## [1] 100
G.0.3 Skill 2: Permuting Values
Another thing that is useful to be able to do is to generate permutations of values.
A permutation is just a different ordering of the same values. For example, the numbers 1, 2, 3 can be permuted into the following 6 sequences:
- 1, 2, 3
- 1, 3, 2
- 2, 1, 3
- 2, 3, 1
- 3, 1, 2
- 3, 2, 1
The more values you have, the more permutations of the order you have. The number of permutations can be calculated by, for example, 321
, where 3 is the number of values you have. Or through code: factorial(3) = 6
. This assumes that each value is used once in the sequence and that each value never changes, i.e. 1234 cannot suddenly become 1235.
We can create random permutations of a vector using the sample()
function. Let's use one of R's built in vectors: letters
.
- Type
letters
into the console, as below, and press RETURN/ENTER. You will see it contains all the lowercase letters of the English alphabet. Now, I bet you are wondering whatLETTERS
does, right?
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
We can combine base::sample()
with letters
to put the letters into a random order:
- Run the below line. Run it again. And again. What do you notice? And why is our output different from yours? (The answer is below)
sample(letters)
## [1] "s" "e" "n" "q" "i" "y" "r" "f" "h" "k" "x" "v" "u" "j" "t" "p" "w" "o" "c"
## [20] "a" "z" "d" "b" "g" "l" "m"
Quickfire Questions
If month.name
contains the names of the twelve months of the year, how many possible permutations are there of sample(month.name)
?
Each time you run sample(letters)
it will give you another random permutation of the sequence. That is what sample()
does - creates a random permutation of the values you give it. Try repeating this command many times in the console. Because there are so many possible sequences, it is very unlikely that you will ever see the same sequence twice!
An interesting thing about sample()
is that sample(c(1,2,3,4))
is the same as sample(4)
. And to recap, there would be 24 different permutations based on factorial(4)
, meaning that each time you type sample(4)
you are getting one of those 24 different orders. So what would factorial(12) be?
Top Tip: Remember that you can scroll up through your command history in the console using the up arrow on your keyboard; this way, you don't ever have to retype a command you've already entered.
G.0.4 Skill 3: Creating Tibbles
Tables are important because most of the data we want to analyze comes in a table, i.e. tabular form. There are different ways to get tabular data into R for analysis. One common way is to load existing data in from a data file (for example, using readr::read_csv()
which you have seen before). But other times you might want to just type in data directly. You can do this using the tibble::tibble()
function. Being able to create a tibble is a useful data analysis skill because sometimes you will want to create some data on the fly just to try certain codes or functions.
G.0.5 Entering Data into a Tibble
The tibble()
function takes named arguments - this means that the name you give each argument within the tibble function, e.g. Y = rnorm(10)
will be the name of the column that appears in the table, i.e. Y
. It's best to see how it works through an example.
tibble(Y = rnorm(10))
## # A tibble: 10 x 1
## Y
## <dbl>
## 1 1.17
## 2 2.21
## 3 -1.11
## 4 -0.787
## 5 -0.469
## 6 -1.39
## 7 0.631
## 8 1.04
## 9 2.12
## 10 -0.844
The above command creates a new table with one column named Y
, and the values in that column are the result of a call to rnorm(10)
: 10 randomly sampled values from a standard normal distribution (mean = 0, sd = 1) - See Skill 1.
If however we wanted to sample from two different populations for Y
, we could combine two calls to rnorm()
within the c()
function. Again this was in Skill 1, here we are now just storing it in a tibble. See below:
tibble(Y = c(rnorm(5, mean = -10),
rnorm(5, mean = 20)))
## # A tibble: 10 x 1
## Y
## <dbl>
## 1 -9.47
## 2 -10.5
## 3 -10.3
## 4 -10.1
## 5 -9.17
## 6 19.9
## 7 18.6
## 8 19.4
## 9 19.8
## 10 19.5
Now we have sampled a total of 10 observations - the first 5 come from a group with a mean of -10, and the second 5 come from a group with a mean of 20. Try changing the values in the above example to get an idea of how this works. Maybe even add a third group!
But, of course, it would be good to know which population each data point refers to and so we should add some group names. We can do this with some additional trickery using the rep()
function.
G.0.6 Repeating Values to Save Typing
Before finalising our table let's learn a little about the base R function, rep()
. This is most useful for automatically repeating values in order to save typing. For instance, if we wanted 20 letter "A"s in a row, we would type:
rep("A", 20)
## [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
## [20] "A"
The first argument to rep()
is the vector containing the information you want repeated, A, and the second argument, times
, is the number of times to repeat it; in this case 20.
If you wanted to add more information, e.g. if the first argument has more than one element, say "A" and "B", it will repeat the entire vector that number of times; A B, A B, A B, ... . Note that we enclose "A" and "B" in the c()
function so that it is seen as a single argument.
rep(c("A", "B"), 20)
## [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A"
## [20] "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
## [39] "A" "B"
But sometimes we want a specific number of As followed by a specific number of Bs; A A A B B B. If the times
argument has the same number of elements as the vector given for the first argument, it will repeat each element of the first vector as many times as given by the corresponding element in the times
vector. In other words, for example, times = c(2, 4)
for vector c("A", "B")
will give you 2 As followed by 4 Bs.
rep(c("A", "B"), c(2, 4))
## [1] "A" "A" "B" "B" "B" "B"
The best way to learn about this function is to play around with it in the console and see what happens. From the dropdown menus, the correct output of the following function would be:
rep(c("A", "B", "C"),(2, 3, 1))
-rep(1:5, 5:1)
-
G.0.7 Bringing it Together in a Tibble
Now we know rep()
, we can complete our table of simulated data by combining what we've learned about generating random numbers and repeating values. We want our table to look like this:
## # A tibble: 10 x 2
## group Y
## <chr> <dbl>
## 1 A -10.5
## 2 A -10.7
## 3 A -12.2
## 4 A -9.49
## 5 A -10.1
## 6 B 20.4
## 7 B 20.1
## 8 B 20.4
## 9 B 17.9
## 10 B 19.3
You now know how to create this table. Have a look at the code below and make sure you understand it. We have one column called group
where we create As and Bs through rep()
, and one column called Y, our data, all in our tibble()
:
tibble(group = rep(c("A", "B"), c(5, 5)),
Y = c(rnorm(5, mean = -10), rnorm(5, mean = 20)))
Be sure to play around with the code chunk to get used to it. Try adding a third group or even a third column? Perhaps you want to give every participant a random age with a mean of 18, and a sd of 1; or even a participant number.
Try row_number()
to create participant numbers.
Don't forget, if you wanted to store your tibble, you would just assign it to a name, such as my_data
:
<- tibble(ID = row_number(1:10),
my_data group = rep(c("A", "B"), c(5, 5)),
Y = c(rnorm(5, mean = -10), rnorm(5, mean = 20)),
Age = rnorm(10, 18, 1))
Skill 3 out of 6 Complete!
G.0.8 Skill 4: Computing Differences in Group Means
You have already learned how to calculate group means using group_by()
and summarise()
. For example, you might want to calculate sample means for a randomly generated dataset like so:
<- tibble(group = rep(c("A", "B"), c(20, 20)),
my_data Y = c(rnorm(20, 20, 5), rnorm(20, -20, 5)))
<- my_data %>%
my_data_means group_by(group) %>%
summarise(m = mean(Y))
my_data_means
## # A tibble: 2 x 2
## group m
## <chr> <dbl>
## 1 A 20.6
## 2 B -20.8
Sometimes what we want though is to calculate the differences between means rather than just the means; so we'd like to subtract the second group mean -20.8 from the first group mean of 20.6, to get a single value, the difference: 41.4.
We can do this using the dplyr::pull()
and purrr::pluck()
functions. pull()
will extract a single column from a dataframe and turn it into a vector. pluck()
then allows you to pull out an element (i.e. a value or values) from within that vector.
<- my_data_means %>%
vec pull(m)
vec
## [1] 20.64371 -20.79290
We have now created vec
which is a vector containing only the group means; the rest of the information in the table has been discarded. Now that we have vec
, we can calculate the mean difference as below, where vec
is our vector of the two means and [1]
and [2]
refer to the two means:
1] - vec[2] vec[
## [1] 41.43661
But pluck()
is also useful, and can be written as so:
pluck(vec, 1) - pluck(vec, 2)
## [1] 41.43661
It can also be incorporated into a pipeline as below where we still pull()
the means column, m
, and then pluck()
each value in turn and subtract them from each other.
## whole thing in a pipeline
%>% pull(m) %>% pluck(1) -
my_data_means %>% pull(m) %>% pluck(2) my_data_means
## [1] 41.43661
However, there is an alternative way to extract the difference between means which may make more intuitive sense. You already know how to calculate a difference between values in the same row of a table using dplyr::mutate()
, e.g. mutate(new_column = column1 minus column2)
. So if you can get the observations in my_data_means
into the same row, different columns, you could then use mutate()
to calculate the difference. Previously you learned gather()
to bring columns together. Well the opposite of gather is the tidyr::spread()
function to split columns apart - as below.
%>%
my_data_means spread(group, m)
## # A tibble: 1 x 2
## A B
## <dbl> <dbl>
## 1 20.6 -20.8
The spread function (?spread
) splits the data in column m
by the information, i.e. labels, in column group
and puts the data into separate columns. A call to spread()
followed by a mutate()
can be used to calculate the difference in means - see below:
%>%
my_data_means spread(group, m) %>%
mutate(diff = A - B)
## # A tibble: 1 x 3
## A B diff
## <dbl> <dbl> <dbl>
## 1 20.6 -20.8 41.4
- What is the name of the column containing the differences between the means of A and B?
Finally, if you then wanted to just get diff
and throw away everything else in the table:
%>%
my_data_means spread(group, m) %>%
mutate(diff = A - B) %>%
pull(diff)
## [1] 41.43661
Keep in mind that a very useful technique for establishing what you want to do to a dataframe is to verbalise what you need, or to write it down in words, or to say it out loud. Take this last code chunk. What we wanted to do was to spread()
the data in m
into the groups A and B. Then we wanted to mutate()
a new column that is the difference, diff
, of A minus B. And finally we wanted to pull()
out the value in diff
.
Often step 1 of writing code or understanding code is knowing what it is you want to do in the first place. After that you just need the correct functions. Fortunately for us a lot of the tidyverse
names its functions based on what they specifically do!
G.0.9 Skill 5: Creating Your Own Functions
In Skills 1 to 4, we have looked at creating and sampling data, storing it in a tibble, and extracting information from that tibble. Now say we wanted to do this over and over again. For instance, we might want to generate 100 random datasets just like the one in Skill 4. It would be a pain to have to type out the tibble()
function 100 times or even to copy and paste it 100 times. We'd likely make an error somewhere and it would be hard to read. To help us, we can create a custom function that performs the action you want; in our case, creating a tibble of random data.
Remember, a function is just a procedure that takes an input and gives you the same output each time - like a toaster! A function has the following format:
<- function(arg1, arg2, arg3) {
name_of_function ## body of function goes between these curly brackets; i.e. what the function does for you.
## Note that the last value calculated will be returned if you call the function.
}
First you define your own function name (e.g. name_of_function
) and define the names of the arguments it will take (arg1
, arg2
, ...) - an argument is the information that you feed into your function, e.g. data. Finally, you state the calculations or actions of the function in the body of the function (the portion that appears between the curly braces).
One of the most basic possible functions is one that takes no arguments and just prints a message. Here is an example:
<- function() {
hello print("Hello World!")
}
So this function is called hello
. It can be run by typing hello()
in your console and it will give the output of Hello World!
every single time you run it; it has no other actions or information. Test this in the console now by typing:
hello()
## [1] "Hello World!"
Awesome right? Ok, so not very exciting. Let's make it better by adding an argument, name
, and have it say Hello to name
.
<- function(name = "World!") {
hello paste("Hello", name)
}
This new function is again called hello()
and replaces the one you previously created. This time however you are supplying what is called a default argument, `name = "World!", but it still has the same action as the previous function of putting "Hello" and "World!" together. So if you run it you get "Hello World!". Try it yourself!
hello()
## [1] "Hello World!"
The difference this time however is that because you have added an argument to the input, you can change the information you give the argument and therefore change the output of the function. More flexible. More exciting.
Quickfire Questions
Test your understanding by answering these questions:
Typing
hello("Phil")
in the console with this new function will give:Typing the argument as
"is it me you are looking for"
will give:What argument would you type to get "Hello Dolly!" as the output:
Most of the time however we want to create a function that computes a value or constructs a table. For instance, let's create a function that returns randomly generated data from two samples, as we learned in the previous skills - see below. All we are doing is taking the tibble we created in Skill 4 and putting it in the body (between the curly brackets) of the function.
<- function() {
gen_data tibble(group = rep(c("A", "B"), c(20, 20)),
Y = c(rnorm(20, 20, 5), rnorm(20, -20, 5)))
}
This function is called gen_data()
and when we run it we get a randomly generated table of two groups, each with 20 people, one with M = 20, SD = 5, the other with M = -20, sd = 5. Try running this gen_data()
function in the console a few times; remember that as the data is random, the numbers will be different each time you run it.
But say we want to modify the function to allow us to get a table with smaller or larger numbers of observations per group. We can add an argument n
and modify the code as follows. Create this function and run it a couple of times through gen_data()
. The way to think about this is that every place that n
appears in the body of the function (between the curly brackets) it will have the value of whatever you gave it in the arguments, i.e. in this case, 20.
<- function(n = 20) {
gen_data tibble(group = rep(c("A", "B"), c(n, n)),
Y = c(rnorm(n, 20, 5), rnorm(n, -20, 5)))
}
How many total participants would there be if you ran
gen_data(2)
?What would you type to get 100 participants per group?
Challenge Question:
Keeping in mind that functions can take numerous arguments, and that each group in your function have separate means, can you modify the function gen_data
to allow the user to change the means for the two calls to rnorm
? Have a try before revealing the solution below.
<- function(n = 20, m1 = 20, m2 = -20) {
gen_data tibble(group = rep(c("A", "B"), c(n, n)),
Y = c(rnorm(n, m1, 5), rnorm(n, m2, 5)))
}
# m1 is the mean of group A,
# m2 is mean of group B and would look like:
# The function would be called by: gen_data(20, 20, -20)
# Giving 20 participants in each group,
# The first group having a mean of 20,
# The second group having a mean of -20.
# Likewise, a call of: gen_data(4, 10, 5)
# Would give two groups of 4,
# The first having a mean of 10,
# The second having a mean of 5.
Here are two important things to understand about functions.
-
Functions obey lexical scoping. What does this mean? It's like what they say about Las Vegas: what happens in the function, stays in the function. Any variables created inside of a function will be discarded after the function executes and will not be accessible to the outside calling process. So if you have a line, say a variable
my_var <- 17
inside of a function, and try to printmy_var
from outside of the function, you will get an error:object 'my_var' not found
. Although the function can 'read' variables from the environment that are not passed to it through an argument, it cannot change them. So you can only write a function to return a value, not change a value. -
Functions return the last value that was computed. You can compute many things inside of a function but only the last thing that was computed will be returned as part of the calling process. If you want to return
my_var
, which you computed earlier but not as the final computation, you can do so explicitly usingreturn(my_var)
at the end of the function (before the second curly bracket).
G.0.10 Skill 6: Replicating Operations
The last skill you will need for the upcoming lab is knowing how to repeat an action (or expression) multiple times. You saw this in Lab 4 so we will only briefly recap here. Here, we use the base function replicate()
. For instance, say you wanted to calculate the mean from rnorm(100)
ten times, you could write it like this:
## bad way
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
rnorm(100) %>% mean()
But it's far easier and more readable to wrap the expression in replicate()
function where the first argument is the number of times you want to repeat the expression stated as the second argument, i.e. replicate(times, expression)
. Here below we replicate the mean of 100 randomly generated numbers from the normal distribution, and we do this 10 times:
replicate(10, rnorm(100) %>% mean())
Also you'll probably want to store the results in a variable, for example, ten_samples
:
<- replicate(10, rnorm(100) %>% mean())
ten_samples ten_samples
## [1] 0.0468488529 0.0002258006 0.0641965548 -0.2356203533 -0.0070404504
## [6] -0.1745433809 -0.0033140013 -0.0479244861 0.0399383341 -0.0159443433
Each element (value) of the vector within ten_samples
is the result of a single call to rnorm(100) %>% mean()
.
- Assuming that your
hello()
function from Skill 5 still exists, and it takes the argumentname = Goodbye
, what would happen in the console if you wrote,replicate(1000, hello("Goodbye"))
? - Try it and see if it works!
# the function would be:
<- function(name = "World!"){
hello paste("Hello", name)
}
# and would be called by:
replicate(1000, hello("Goodbye"))
G.0.11 Finished!
To recap, we have shown you the following six skills:
- Skill 1: Generating random numbers with
base::rnorm()
- Skill 2: Permuting values with
base::sample()
- Skill 3: Creating a "tibble" (a type of data table) using
tibble::tibble()
- Skill 4: Computing and extracting a difference in group means using
dplyr::pull()
andpurrr::pluck()
- Skill 5: Creating your own custom functions using
base::function()
- Skill 6: Repeating operations using
base::replicate()
You will need these skills in the coming lab to help you perform a real permutation test. Through these skills and the permutation test you will learn about null hypothesis significance testing.
G.0.12 Permutation Tests of Hypotheses
A common statistical question when comparing two groups might be, "Is there a real difference between the group means?" From this we can establish two contrasting hypotheses:
- The null hypothesis which states that the group means are equivalent and is written as: \(H_0: \mu_1 = \mu_2\)
- where \(\mu_1\) is the population mean of group 1
- and \(\mu_2\) is the population mean of group 2
- Or the alternative hypothesis which states the groups means are not equivalent and is written as: \(H_1: \mu_1 \ne \mu_2\).
Using the techniques you read about earlier and in previous labs, today you will learn how to test the null hypothesis of no difference between two independent groups. We will first do this using a permutation test before looking at other tests in later labs.
A permutation test is a basic inferential procedure that involves a reshuffling of group labels or values to create new possible outcomes of the data you collected to see how your original mean difference compares to all possible outcomes. The test can in fact be applied in many situations, this is just one, and it provides a good starting place for understanding hypothesis testing. The steps for the exercise below, and really the logic of a permutation test for two independent groups, are:
- Calculate the real difference \(D_{orig}\) between the means of two groups (e.g. Mean of A minus Mean of B).
- Randomly shuffle the group labels (i.e. which group each participant belonged to - A or B) and re-calculate the difference, \(D'\).
- Repeat step 2 \(N_{r}\) times, where \(N_r\) is a large number (typically greater than 1000), storing each \(D_i'\) value to form a null hypothesis distribution.
- Locate the difference you observed in step 1 (the real difference) on the null hypothesis distribution of all possible differences.
- Decide whether the original difference is sufficiently extreme to reject the null hypothesis of no difference (\(H_0\)).
This logic works because if the null hypothesis is true (there is no difference between the groups) then the labeling of the observations/participants into groups is arbitrary, and we can rearrange the labels in order to estimate the likelihood of our original difference under the \(H_0\). In other words if you know the original value of the difference between two groups (or the true difference) falls in the middle of your permuted distribution then there is no significant difference between the two groups. If however the original difference falls in the tail of the permuted distribution then there might be a significant difference depending on how far into the tail it falls.
Let's get started!
G.0.13 Step 1: Load in Add-on Packages and Data
1.1. Open a new script and call the tidyverse
into your library.
1.2. Now type the statement set.seed(1011)
at the top of your script after your library call and run it. (This 'seeds' the random number generator so that you will get the same results as everyone else. The number 1011 is a bit random but if everyone uses it then we all get the same outcome. Different seeds give different outcomes)
1.3. Download the data file here. and read the data in perm_data.csv
into a variable called dat
.
1.4. Let's give every participant a participant number as well by adding a new column to dat
. Something like this would work: mutate(subj_id = row_number())
-
Something to do with
library()
-
set.seed(1011)
-
Something to do with
read_csv()
-
pipe
%>%
on the mutate line shown
You will see that in the example here to put a row number for each of the participants we do not have to state the number of participants we have. In the Preclass however we did. What is the difference? Well, in the Preclass we were making a tibble and trying to create a column in that tibble using row_numbers
. If you want to do that you have to state the number of rows, e.g. 1:20
. However, in this example in the lab today the tibble already exists, we are just adding to it. If that is the case then you can just mutate on a column of row numbers without stating the number of participants. In summary:
-
When creating the tibble, state the number of participants in
row_numbers()
. -
If tibble already exists, just mutate on
row_numbers()
. No need for specific numbers.
Have a look at the resulting tibble, dat
.
- The column
Y
is your dependent variable (DV) - The column
group
is your independent variable (IV). - The columns
subj_id
is the participant number.
G.0.14 Step 2: Calculate the Original Mean Difference - \(D_{orig}\)
We now need to write a pipeline of five functions that calculates the mean difference between the groups in dat
, Group A minus Group B. Broken down into steps this would be:
2.1.1. Use a pipe of two dplyr
one-table verbs (e.g. Lab 2) to create a tibble where each row contains the mean of one of the groups. Name the column storing the means as m
.
2.1.2. Continue the pipe to spread()
your data from long to wide format, based on the columns group
and m
.
2.1.3. Now add a pipe that creates a new column in this wide dataset called diff
which is the value of group A's mean minus group B's mean.
2.1.4. Pull out the value in diff
(the mean of group A minus the mean of group B) to finish the pipe.
dat %>%
group_by(?) %>%
summarise(m = ?) %>%
spread(group, m) %>%
mutate(diff = ? - ?) %>%
pull(?)
- Check that your value for
d_orig
is correct, without using the solution, by typing yourd_orig
value to two decimal places in the box. Include the sign, e.g. -1.23. The box will go green if you are correct.
The above steps have created a pipeline of five functions to get one value. Nice! We now need to turn this into a function because we are going to be permuting the data set (specifically the grouping labels) and re-calculating the difference many, many times.
2.2. Wrap your pipeline in a function called calc_diff
but swap dat
for x
. This function will take a single argument named x
, where x
is the tibble that you want to calculate group means from. As in the previous step, the function will return a single value which is the difference between the group means. The start will look like this below:
<- function(x){
calc_diff %>%.....
x }
<- function(x) {
calc_diff %>%
x group_by(group) %>%
the_rest_of_your_pipe... }
2.3. Now call your new function where x
= dat
as the argument and store the result in a new variable called d_orig
. Make sure that your function returns the same value as you got above and that your function returns a single value rather than a tibble. You can test this: is.tibble(d_orig)
should give you FALSE
and is.numeric(d_orig)
should give you TRUE
.
<- function_name(x = data_name)
d_orig # or
<- function_name(data_name)
d_orig
# Then type the following in the Console and look at answer:
is.tibble(d_orig)
# True (is a tibble) or False (is not a tibble)
is.numeric(d_orig)
# True (is numeric) or False (is not numeric; it is a character or integer instead.)
So we now have the original difference between the groups stored in d_orig
. Next we need to create a distribution of all possible differences to see where our original difference lies in this distribution. But first we need to shuffle the group
letters (A or B) in our dataset and find the difference...a few hundred times!
G.0.15 Step 3: Permute the Group Labels
3.1. Create a new function called permute()
that takes as input a dataset x
and returns the same dataset transformed such that the group labels (the values in the column group
) are shuffled: started below for you. This will require using the sample()
function within a mutate()
. You have used mutate()
twice already today and you saw how to sample()
letters in the PreClass.
<- function(x){
permute %>%.....
x }
Might be easier to think of these steps in reverse.
-
Start with a
mutate()
function that rewrites the columngroup
every time you run it, e.g.dat %>% mutate(variable = sample(variable))
-
Now put that into your
permute()
function making the necessary adjustments to the code so it startsx %>%...
. Againx
should be in the function and notdat
. ")
3.2. Try out your new permute()
function by calling it on dat
(i.e. x = dat
) a few times. You should see the group labels in the group
column changing randomly. The most common mistake is that people mutate a new column by mispelling group
. You want to overwrite/change the information in the group
column not make a new one, so be careful with the spelling.
Now would be an excellent time to spend five minutes as a group recapping what you are doing.
- You have the original difference between groups.
- You have a function that calculates and stores this difference.
- You have a function that reshuffles the labels of the group.
Do you understand why? If not, go back to the principles of the permutation test at the start of the lab then read on...
G.0.16 Step 4: Create the Null-Hypothesis Distribution (NHD) for the Difference
Now that we have the original difference and our two functions, one to shuffle group labels and one to calculate the difference between two groups, we need to actually create the distribution of possible differences and see where the original difference lies in it.
4.1.1. Write a a single pipeline that takes dat
as the input, permutes the group labels with a call to your function permute()
, and then calculates the difference in means between these new groups with a call to your function calc_diff()
.
4.1.2. Run this line manually a few times and watch the resulting value change as the labels get permuted.
Think about verbalising your pipelines. In a single pipeline:
- I want to permute the data into two new groups.
- Then I want to calculate the difference between these two new groups.
The functions you have created do these steps. You just have to put them in order and pipe the data through it.
4.2. Now take your pipeline of functions and repeat it 1000 times using the replicate()
function. Store the output in a variable called nhd
. nhd
will contain 1000 values where each value is the mean difference of each of the 1000 random permutations of the data. (Warning: This will probably take a while to run, perhaps 10 seconds.)
# replace expression with the pipeline you created in 4.1
<- replicate(times, expression) nhd
You now have 1000 possible values of the difference between the permuted groups A and B - your permuted distribution.
4.3 Let's visualise this distribution through a frequency histogram of the values in nhd
. This shows us the likelihood of various mean differences under \(H_0\). One thing to note however is that nhd
is not a tibble
and ggplot
needs it to be a tibble
. You need to convert it. You might start by do something like:
ggplot(data = tibble(x = NULL), aes(x)) + NULL
Remember that ggplot
works as: ggplot(data, aes(x)) + geom...
. Here you need to convert nhd
into a tibble and put that in as your data. Look at the example above and keep in mind that, in this case, the first NULL could be replaced with the data in nhd
.
- Looking at the histogram, visually locate where your original value would sit on this distribution. Would it be extreme, in the tail, or does it look rather common, in the middle?
Before moving on stop to think about what this means - that the difference between the two original groups is rather uncommon in this permuted distribution, i.e. is in the tails! Again, if unsure, go back to the principles of NHST or discuss it with your tutor!
G.0.17 Step 5: Compare the Observed Mean Difference to the NHD
If the null hypothesis is false, and there is a real difference between the groups, then the difference in means we observed for the original data (d_orig
) should be somewhere in either tail of the null-hypothesis distribution we just estimated; it should be an "extreme" value. How can we test this beyond a visual inspection?
First we have to decide on a false positive (Type I error) rate which is the rate at which we will falsely reject \(H_0\) when it is true. This rate is referred to by the Greek letter \(\alpha\) ("alpha"). Let's just use the conventional level used in Psychology: \(\alpha = .05\).
So the question we must ask is, if the null hypothesis was true, what would be the probability of getting a difference in means as extreme as the one we observed in the original data? We will label this probability p
.
Take a few moments to see if you can figure out how you might compute p
from the data before we show you how. We will then show you the process in the next few, final, steps.
5.1. Replace the NULLS in the code below to create a logical vector which states TRUE for all values of nhd
greater than or equal to d_orig
regardless of sign. Note: A logical vector is one that returns TRUE when the expression is true and FALSE when the expression is false.
<- abs(NULL) >= abs(NULL) lvec
In the code above, the function abs()
says to ignore the sign and use the absolute value. For instance, if d_orig = -7
, then abs(d_orig) = 7
. Why do we do this here? Can you think why you want to know how extreme your value is in this distribution regardless of whether the value is positive or negative?
The answer relates to whether you are testing in one or two tails of your distribution; the positive side, the negative side, or both. You will have heard in your lectures of one or two-tailed tests. Most people would say to run two-tailed tests. This means looking at the negative and positive tails of the distribution to see if our original value is extreme, and the simplest way to do this is to ignore the sign of the values and treat both sides equally. If you wanted to only test one-tail, say that your value is extreme to the negative side of the tail, then you would not use the abs()
and set the expression to make sure you only find values less than your original value. To test only on the positive side of the distribution, make sure you only get values higher than the original. But for now we will mostly look at two-tailed tests.
5.2. Replace the NULL in the code below to sum()
the lvec
vector to get the total number of values equal to or greater than our original difference, d_orig
. Fortunately R is fine with summing TRUEs and FALSEs so you do not have to convert the data at all.
<- NULL n_exceeding_orig
5.3. Replace the NULL in the code below to calculate the probability of finding a value of d_orig
in our nhd
distribution by dividing n_exceeding_orig
, the number of values greater than or equal to your original value, by the length()
of your whole distribution nhd
. Note: the length of nhd
is the same as the number of replications we ran. Using code reduces the chance of human error
<- NULL p
5.4. Finally, complete the sentence below determining if the original value was extreme or not in regards to the distribution. Use inline coding, shown in Lab 1, to replace the XXX
s. For example, when formatted without the space before the first r, r length(nhd)
would present as 1000.
" The difference between Group A and Group B (M = XXX
) was found to be have a probability of p = XXX
. This means that the original mean difference was ...... and the null hypothesis is ....."
G.0.18 Finished!
Well done in completing this lab. Let's recap before finishing. We had two groups, A and B, that we had tested in an experiment. We calculated the mean difference between A and B and wanted to know if this was a significant difference. To test this we created a distribution of all possible differences between A and B using the premise of permutation tests and then found the probability of our original value in that permuted distribution. The more extreme the value in a distribution the more likely that the difference is significant. And that is exactly what we found; an \(\alpha < .05\). Next time we will look at using functions and inferential tests to perform this analysis but by understanding the above you now know how probability is determined.