8 APES - Alpha, Power, Effect Sizes, Sample Size

8.1 Overview

Up until now we have mainly spent time on data-wrangling, understanding probability, visualising our data, and more recently, running inferential tests, i.e. t-tests. In the lectures, however, you have also started to learn about additional aspects of inferential testing and trying to reduce certain types of error in your analyses. It is this balance of minimising error in our inferential statisitcs that we will focus on in this chapter.

First thing to remember is that there are two types of hypotheses in Null Hypothesis Significance Testing (NHST) and what you are trying to establish is the probability of the null hypothesis not being accepted. Those two hypotheses are:

The null hypothesis which states that the compared values are equivalent and, when referring to means, is written as: \(H_0: \mu_1 = \mu_2\)
And the alternative hypothesis which states that the compared values are not equivalent and, when referring to means, is written as: \(H_1: \mu_1 \ne \mu_2\).

Now, each decision about a hypothesis is prone to some degree of error and, as you will learn, the two main types of error that we worry about in Psychology are:

Type I error - or False Positives, is the error of rejecting the null hypothesis when it should not be rejected (otherwise called alpha or \(\alpha\)). In other words, you conclude that there is a real "effect" when in fact there is no effect. The field standard rate of acceptable false positives is \(\alpha = .05\) meaning that in theory 1 in 20 studies may be a false positive.
Type II error - or False Negatives, is the error of retaining the null hypothesis when it is false (otherwise called beta or \(\beta\)). In other words, you conclude that there was no real "effect" when in fact there was one. The field standard rate of acceptable false negatives is \(\beta = .2\) meaning that in theory 1 in 5 studies may be a false negative.

Adding to the ideas of hypotheses and errors, we are going to look at the idea of power which you will learn is the long-run probability of correctly rejecting the null hypothesis for a fixed effect size and fixed sample size; i.e. correctly concluding there is an effect when there is a real effect to detect. Power is calculated as \(power = 1-\beta\) and is directly related to the False Negative rate. If the field standard of False Negatives is \(\beta = .2\) then the field standard of power should be \(power = 1 - .2 = .8\), for a given effect size and sample size (though some papers, including Registered Reports are often required to have a power of at least \(power >= .9\)). As such, \(power = .8\) means that the majority of studies should find an effect if there is one to detect, assuming that your study maintains these rates of error and power.

Unfortunately, however, psychological research has been criticised for neglecting power and \(\beta\) when planning studies resulting in what are called "underpowered" or "low powered" studies - meaning that your error rates are higher than you think they are, your power is lower than you think it is, and the study is unreliable. Note that as \(\beta\) increases (the false negative rate increases), power decreases; power and false positive rates are also related, though less directly. In fact, low powered studies, combined with undisclosed analytical flexibility and publication bias, is thought to be a key issue in the replication crisis within the field. As such there may be a large number of studies where the null hypothesis has been rejected when it should not have been, and unpublished studies that have not been written up because they did not find an effect when they should have. In turn, when that is the case, the field becomes noisy and you are unsure which studies will replicate. It is issues like this that led us to redevelop our courses and why we really want you to understand power as much as possible.

So this chapter is all about power, error rates, effect sizes, and sample sizes. We will learn:

the relationship between power, alpha, effect sizes and sample sizes
how to calculate certain effect sizes
how to determine appropriate sample sizes in given scenarios
and how to interpret power analyses.

8.2 Introduction to Power

We have written and selected the material in this chapter to give you a better understanding of power and how it interacts with effect size, sample size, and alpha. We have also suggest some optional material that you can look at and play with to get a rounder view.

8.2.1 Blog post

Read the following blog on Power: This blog is a fictional conversation between a professor and a student on the importance of power. Grab a coffee and have a read. Don't worry about reading all the additional papers unless you want to; just the blog is fine to get an understanding. What you are trying to understand from this blog is the relationship between sample size and effect sizes, and whether a result from a study is likely to replicate or not based on the power of the original study.

The Power Dialogues by PIGEE at the University of Illinois.

Using power to design your study

To reiterate, power is defined as the probability of correctly rejecting the null hypothesis for a fixed effect size and fixed sample size. As such, power is a key decision when you design your study, under the premise that the higher the power of your planned study, the better.

Two relationships are important to understand:

for a given sample size and \(\alpha\), the power of your study is higher if the effect you are looking for is assumed to be a large effect as opposed to a small effect; large effects are easier to detect.
and, for a given effect size and \(\alpha\), the power of your study is higher when you increase your sample size.

From these relationships we see that, because you have little control over the size of the effect you are trying to detect (it lives in the real world which you don't control), you can instead increase the power of your study by increasing the size of your sample (and also reducing sources of noise and measurement error in your study). As such, when planning a study, any good researcher will consider the following four key elements - the APES:

alpha (the false positive rate - Type 1 error): most commonly thought of as the significance level; usually set at \(\alpha = .05\)
power: the probability of rejecting the null hypothesis for a given effect size and sample size, with \(power = .8\) usually cited as the minimum power you should aim for based on the false negative rate being set at \(\beta = .2\);
effect size: size of the association or difference you are trying to detect;
sample size: the number of observations (usually, participants, but sometimes also stimuli) in your study.

Note: Because power depends on several variables, it is useful to think of power as a function with varying value rather than as a single fixed quantity.

Now the cool thing about the APES is that if you know any three of these elements, then you can calculate the fourth. In reality, the two most common approaches when designing a study would be:

to determine the appropriate sample size required to reject your null hypothesis, with high probability, for the effect size that you are interested in. That is, you decide your \(\alpha\), \(power\), and effect size, and from that you calculate for the sample size required in your study. Generally, the smaller the assumed effect size, the more participants you will need, assuming power and alpha are held constant.
to determine the smallest effect size you can reliably detect given your sample size. That is, you know everything except the effect size. For example, say you are using an open dataset and you know they have run 100 participants, you can't add any more participants, and you want to know what is the minimum effect size you could detect from this dataset if you set \(power\) and \(\alpha\) at the field standards.

Hopefully that gives you an idea of how we use power to determine sample sizes for studies - and that the sample size should not just be pulled out of thin air. Both of these approaches described above are called a priori power analyses as you are stating the power level you want before (a priori means before) the study.

However, you may now be thinking, if everything is connected, then can we use the effect size from our study and the sample size to determine the power of the study after we have run it? No! Well, you can but it would be wrong to do so. This is actually called Observed or Post-Hoc power and most papers would discourage you from calculating it on the grounds that the effect size you are using is not the true effect size of the population you are interested in; it is just the effect size of your sample. As such any indication of power from this analysis is misleading. Avoid doing this. You can read more about why, here, in your own time if you like: Lakens (2014) Observed Power, and what to do if your editor asks for post-hoc power analyses. In short, stick to using only a priori power analyses approaches and use them to determine your required sample size or achievable reliable effect size.

8.2.2 Video about power and sample size

You should now also watch this short but nonetheless highly informative video by Daniel Lakens on Power and Sample Size. It will help consolidate the above points. And his shirt is amazing!

Power Analysis and Sample Size Decisions by Daniel Lakens

8.2.3 Useful links

Finally, there are a number of great webpages and blogs that will help you understand the concepts in this chapter. Here are some that we think might be good for you to look at. You don't have to look at all of these to understand this chapter, but do come back to them as they will really help you as you progress in becoming a responsible researcher. We are deliberately giving you a number of options here as for everyone there is that one analogy that will work best for you and that one paper that will make everything click into place. That example will be different from person to person so having a variety of explanations will help.

A YouTube video by Dan Quintana (University of Oslo) showing how to use the pwr package to calculate power in t-tests, correlations, and one-way ANOVAs https://www.youtube.com/watch?v=ZIjOG8LTTh8
A shiny app created by Lisa Debruine (University of Glasgow) on guessing the effect size between two conditions http://shiny.psy.gla.ac.uk/guess/
A blog by Daniel Lakens (Eindhoven University of Technology) on determining the smallest effect size you are interested in. This is often referred to as the Smallest Effect Size of Interest (SESOI) http://daniellakens.blogspot.com/2017/05/how-power-analysis-implicitly-reveals.html
An interactive webpage by Kristoffer Magnusson (Karolina Instituet, Stockholm) on interpreting Cohen's d effect size https://rpsychologist.com/d3/cohend/
A shiny app by Hause Lin (University of Toronto) showing the conversion of one effect size into another http://escal.site/
A Frontiers in Psychology paper by Daniel Lakens on calculating various effect sizes for t-tests and ANOVAs https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00863/full
A blog by Daniel Lakens on what Type I and Type II errors are acceptable. In short, justify everything http://daniellakens.blogspot.com/2019/05/justifying-your-alpha-by-minimizing-or.html
Chapter 9 of Ian Walker's "Research Methods and statistics" which is availble to read online through the University Library. This is a short chapter all about hypothesis testing and power really brining everything from the last few chapters together.
And don't forget Chapter 3 of "The 7 Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice" is very good to read on the topic of power and unreliable research. The book is available in the University Library or can be bought at all reputable bookshops and online repositories.
A really nice paper by Marjan Bakker and colleagues on whether people put power analyses in their ethics proposals. It has a nice introduction on power and the results show how many different ways researchers actually calculate sample sizes https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0236079
A paper by Marcus Munafo and colleagues that we have mentioned many times but it might help to read again now as the concepts and ideas will be more famility now. https://www.nature.com/articles/s41562-016-0021
A paper by Schafer and Schwarz (2019) aimed at helping people make meaningful interpretation of effect sizes in Psychology. It also explores differences of commonly found effect sizes in sub-disciplines of Psychology. https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00813/full
Finally, a paper by Brysbaert (2019) that shows just how many participants you would need in a variety of common designs in Psychology studies. You will be shocked by the difference between the number of participants needed compared to the number of participants used in published studies https://www.journalofcognition.org/article/10.5334/joc.72/

Hopefully this has given you a good basis to understanding power, sample sizes, alpha, and effect sizes. These are difficult concepts to grasp and it will take a lot of time thinking about them and interacting with them before they really start to sink in. Hopefully however, if nothing else, the least you come away with is the idea that the number of participants you should run in a study is not an arbitrary decision but is in fact a relationship between the effect size you want to test for and the level of error (Type I or Type II) you are willing to accept.

8.3 Practical APES Calculations

Hopefully you now have a decent understanding at least of the four APES that need to be considered when designing a study: \(\alpha\), \(power\), effect size and sample size. We are going to look more at calculating and understanding these elements. You don't have to fully understand everything about power to complete this chapter - believe us when we say many seasoned researchers struggle with parts - you just need to get the general gist that there is always a level of acceptable error in hypothesis testing and we are trying to minimise that for a given effect size (i.e., the magnitude of the difference, relationship, association).

So let's jump into this a bit now and start running some analyses to help further our understanding of alpha, power, effect sizes and sample size! To help our understanding we will focus on t-tests for this chapter which you will know well from previous chapters.

Effect Sizes - Cohen's \(d\)

There are a number of different effect sizes that you can choose to calculate, but a common one for t-tests is Cohen's d: the standardised difference between two means (in units of SD) and is written as d = effect-size-value. The key point is that Cohen's d is a standardised difference, meaning that it can be used to compare against other studies regardless of how the measurement was made. Take for example height differences in men and women which is estimated at about 5 inches (12.7 cm). That in itself is an effect size, but it is an unstandardised effect size in that for every sample that you test, that difference is dependent on the measurement tools, the measurement scale, and the errors contained within them. As such using a standardised effect size allows you to make comparisons across studies regardless of measurement error. In standardised terms, the height difference above is considered a medium effect size (d = .5) which Cohen (1988, as cited by Schafer & Schwarz (2019)) defined as representing "an effect likely to be visible to the naked eye of a careful observer". Cohen (1988) in fact stated three sizes of Cohen's d that people could use as a guide:

Effect size	Cohen's d value
small	.2 to .5
medium	.5 to .8
large	> .8

You may wish to read this paper later about different effect sizes in psychology - Schafer and Schwarz (2019) The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases.

One thing to note is that the formula for Cohen's d is slightly different depending on the type of t-test used. And even within a type of t-test the formula can sometimes change depending on who you read. For this chapter, let's go with the following formulas:

One-sample t-test & within-subjects (paired-sample) t-test:

\[d = \frac{t}{sqrt(N)}\]

Between-subjects (two-sample) t-test:

\[d = \frac{2t}{sqrt(df)}\]

Let's now try using these formulas in order to calculate the effect sizes for given scenarios; we will work up to calculating power later in the chapter.

8.3.1 Task 1: Effect size from a one-sample t-test

You run a one-sample t-test and discover a significant effect, t(25) = 3.24, p = .003. Calculate d and determine whether the effect size is small, medium or large.

Use the appropriate formula from above for the one-sample t-tests.
You have been given a t-value and df (degrees of freedom), you still need to determine n before you calculate d.
According to Cohen (1988), the effect size is small (.2 to .5), medium (.5 to .8) or large (> .8).

Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

Enter, in digits, how many people were run in this study:
Which of these codes is the appropriate calculation of d in this instance:
Enter the correct value of d for this analysis rounded to 2 decimal places:
According to Cohen (1988), the effect size for this t-test would be considered:

8.3.2 Task 2: Effect size from a two-sample t-test

You run a two-sample t-test and discover a significant effect, t(30) = 2.9, p = .007. Calculate d and determine whether the effect size is small, medium or large.

Use the appropriate formula above for two-sample t-tests.
remember that df = (N-1) + (N-1) for a two-sample t-test.
According to Cohen (1988), the effect size is small (.2 to .5), medium (.5 to .8) or large (> .8).

Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

Enter, in digits, how many people were run in this study:
Which of these codes is the appropriate calculation of d in this instance:
Enter the correct value of d for this analysis rounded to 2 decimal places:
According to Cohen (1988), the effect size for this t-test would be considered:

8.3.3 Task 3: Effect Size from a matched-sample t-test

You run a paired-sample t-test between an ASD sample and a non-ASD sample and discover a significant effect t(39) = 2.1, p < .05. How many people are there in each group? Calculate d and determine whether the effect size is small, medium or large.

You need the df value to determine N.
A matched pairs is treated like a paired-sample t-test but with two separate groups.

Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

Enter, in digits, how many people were in each group in this study. Note, not the total number of participants:
Which of these codes is the appropriate calculation of d in this instance:
Enter the correct value of d for this analysis rounded to 2 decimal places:
According to Cohen (1988), the effect size for this t-test would be considered:

df in a paired-samples and in a matched-pairs t-test is calculated as df = N - 1.
Conversely, to find the total number of participants: N = df + 1 so N = 39 + 1 = 40.
Given that this is a matched-pairs t-test, by design there has to be an equal number of participants in each group. Therefore 40 participants in each group.

8.3.4 Task 4: t-value and effect size for a between-subjects experiment

You run a between-subjects design study and the descriptives tell you: Group 1, M = 10, SD = 1.3, n = 30; Group 2, M = 11, SD = 1.7, n = 30. Calculate t and d for this between-subjects experiment.

Before you can calculate d (using the appropriate formula for a between-subjects experiment), you need to first calculate t using the formula:

t = (Mean1 - Mean2)/sqrt((var1/n1) + (var2/n2))

var stands for variance in the above formula. Variance is not the same as the standard deviation, right? Variance is measured in squared units. So for this equation, if you require variance to calculate t and you have the standard deviation, then you need to remember that \(var = SD^2\) (otherwise written as \(var = SD \times SD\).
Now you have your t-value, but for calculating d you also need degrees of freedom. Think about how you would calculate df for a between-subjects experiment, taking n for both Group 1 and Group 2 into account.
Remember that convention is that people report the t as a positive. As such, convention also dictates that d is reported as a positive value.

Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

Enter the correct t-value for this test, rounded to two decimal places:
Which of these codes is the appropriate calculation of d in this instance:
Based on the above t-value above, enter the correct value of d for this analysis rounded to 2 decimal places:
According to Cohen (1988), the effect size for this t-test would be described as:

Now that you are comfortable with calculating effect sizes, we will look at using them to establish the sample size for a required power. One thing you will realise as we progress is that the true effect size in a population is something we do not know, but we need to justify one for our design. A clever approach is laid out by Daniel Lakens in the blog from the previous section on the Smallest Effect Size of Interest (SESOI) - you set the smallest effect that you would be interested in! This can be determined through theoretical analysis, through previous studies, through pilot studies, or through rules of thumb like Cohen (1988). However, also keep in mind that the lower the effect size, the larger the sample size you will need. Everything is a trade-off!

Power Calculations

We are going to use the function pwr.t.test() to run our calculations from the pwr library. This is a really useful library of functions for various tests, but we will just use it for t-tests right now.

Remember that for more information on the function pwr.t.test(), simply do ?pwr.t.test in the console. Or you can have a look at these webpages to get in idea (or bad ideas if you spot where they erroneously calculate post-hoc power!):

A quick-R summary of the pwr package - https://www.statmethods.net/stats/power.html
the pwr package vignette - https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html

From these you will see that pwr.t.test() takes a series of inputs:

n: observations/participants, per group for the independent samples version, or the number of subjects or matched pairs for the paired and one-sample designs.
d: the effect size of interest
sig.level or \(\alpha\)
power or \(1-\beta\)
type: the type of t-test; e.g. "two.sample", "one.sample", "paired"
alternative: the type of hypothesis; "two.sided", "one.sided"

And it works on a leave one out principle. You give it all the info you have and it returns the element you are missing. So, for example, say you needed to know how many people per group you would need to detect an effect size as low as d = .4 with power = .8, alpha = .05 in a two.sample (between-subjects) t-test on a two.sided hypothesis test. You would do:

pwr.t.test(d = .4,
           power = .8,
           sig.level = .05,
           alternative = "two.sided",
           type = "two.sample")

Which will show you the following output, which, if you look at it, tells you that you need 99.0803248 people per condition.

## 
##      Two-sample t test power calculation 
## 
##               n = 99.08032
##               d = 0.4
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

But you only get whole people and we like to be conservative on our estimates so we would actually run 100 per condition. That is a lot of people!!!

One problem though is that the output of the pwr.t.test() is an object and not that easy to work with in terms of getting values out from it to be reproducible. However, the function purr::pluck() allows us to pluck values from objects. And the code would look like this

n_test <- pwr.t.test(d = .4, 
                     power = .8,
                     sig.level = .05,
                     alternative = "two.sided",
                     type = "two.sample") %>%
  pluck("n")

So when we call n_test we get the same answer as above, but it is saved as a single value and easier to work with:

n_test

## [1] 99.08032

And we could use the ceiling() funtion to round up to whole people:

n_test %>% ceiling()

## [1] 100

Note: ceiling() is better to use than round() when dealing with people as it always rounds up. For example, ceiling(1.1) gives you 2. round() on the other hand is useful for rounding an effect size, for example, to two decimal places - e.g. d = round(.4356, 2) would give you d = 0.44

We will use this approach pwr.t.test() %>% pluck() and pwr.t.test() %>% pluck() %>% ceiling() throughout the rest of this chapter to get used to it. Before you start with this next task, you will need to make sure you have loaded in the tidyverse.

8.3.5 Task 5: Sample size for standard power one-sample t-test

Assuming the smallest effect size of interest is a Cohen's d of d = .23, what would be the minimum number of participants you would need in a one-sample t-test, assuming \(power = .8\), \(\alpha = .05\), on a two-sided hypothesis?

Using a pipeline, store the answer as a single value called sample_size (e.g., think pluck()) and round up to the nearest whole participant.

Use the list of inputs above as a kind of checklist to clearly determine which inputs are known or unknown. This can help you enter the appropriate values to your code.
The structure of the pwr.t.test() would be very similar to the one shown above except two.sample would become one.sample
You will also need to use pluck("n") to help you obtain the sample size and %>% ceiling() to round up to the nearest whole participant.

Quickfire Questions

Answer the following question to check your answers. The solutions are at the end of the chapter to check against:

Enter the minimum number of participants you would need in this one-sample t-test:

8.3.6 Task 6: Effect size from a high power between-subjects t-test

Assuming you run a between-subjects t-test with 50 participants per group and want a power of .9, what would be the minimum effect size you can reliably detect? Assume the field standard \(\alpha = .05\) and alternative hypothesis settings ("two-tailed"). Using a pipeline, store the answer as a single value called cohens and round to two decimal places.

Again, use the list of inputs above as a kind of checklist to clearly determine which inputs are known or unknown. This can help you enter the values to your code.
You will also need to use pluck() to obtain Cohen’s d, and round() so the value is rounded to two decimal places.
Don’t forget the quotes when using pluck(). i.e. pluck("value") and not pluck(value)

Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

Based on the information given, what will you set type as in the function?
Based on the output, enter the minimum effect size you can reliably detect in this test, rounded to two decimal places:
According to Cohen (1988), the effect size for this t-test is
Say you run the study and find that the effect size determined is d = .50. Given what you know about power, select the statement that is most accurate:

8.3.7 Task 7: Power of Published Research

Thus far we have used hypothetical situations - now go look at the paper on the Open Stats lab website called Does Music Convey Social Information to Infants?. You can download the pdf and look at it, but here we will determine the power of the significant t-tests reported in Experiment 1 under the Results section on page 489. There is a one-sample t-test and a paired-samples t-test to consider, summarised below. Assume testing was at power = .8, alpha = .05. Based on your calculations are either of the stated effects underpowered?

one-sample: t(31) = 2.96, p = .006
paired t-test: t(31) = 2.42, p = .022

A one-sample t-test and a paired t-test use the same formula for Cohen’s d.
To calculate n: n = df + 1.
Calculate the achievable Cohens d for the studies and then calculate the established Cohen’s d for the studies.

Thinking Cap Point

Based on what you have found out, think about the following questions and discuss them in your groups:

Which of the t-tests do you believe to be potentially underpowered?
Why do you think this may be?

Additional information about this discussion can be found in the solutions at the end of this chapter.

One caveat to Tasks 6 and 7: We have to keep in mind that here we are looking at single studies using one sample from a potentially huge number of samples within a population. As such there will be a degree of variance in the true effect size within the population regardless of the effect size of one given sample. What that means is we have to be a little bit cautious when making claims about a study. Ultimately the higher the power the better as you can detect smaller effect sizes!

Concluding remarks

So hopefully you are now starting to see the interaction between alpha, power, effect sizes, and sample size. We should always want high-powered studies and depending on the size of the effect we are interested in (small to large), and our \(\alpha\) level, this will determine the number of observations we need to make sure our study is well powered. Points to note:

Lowering the \(\alpha\) level (e.g., .05 to .01) will reduce the power.
Lowering the effect size (e.g., .8 to .2) will reduce the power.
Increasing power (e.g., .8 to .9) will require more participants.
It is also possible to increase power for a fixed sample size by reducing sources of noise in the study.

A high-powered study looking to detect a small effect size at a low alpha may require a large number of participants!

8.4 Practice Your Skills

Lab 8: APES Assignment

In order to complete this assignment you first have to download the assignment .Rmd file which you need to edit for this assignment: titled GUID_PracticeSkills_Ch8.Rmd. This can be downloaded within a zip file from the below link. Once downloaded and unzipped you should create a new folder that you will use as your working directory; put the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top Download the assignment zip file from here.

NOTE: In nearly all of the problems below, you will need to replace NULL with a value or a pipeline of code that computes a value. Please pay special attention as to what the question is asking for as the output, e.g. value or a tibble; when asked for a value as an output, make sure it is a single value and not a value stored in a tibble. Finally, when altering code inside the code blocks, please do not re-order or rename the code blocks (T1, T2, ... etc.).

It's also recommended that you "knit" a report to be able to see what you've accomplished and spot potential errors. A great thing to do is close the whole programme, restart it, and then knit your code. This will test whether you have remembered to include essential elements, such as libraries, in your code.

APES: Alpha, Power, Effect Size, and Sample Size

In the chapter we have been looking at the interplay between the four components of Alpha, Power, Effect Size, and Sample Size. This is a very important part of experimental design to understand as it will help you understand which studies are worth paying attention to and it will help you design your own studies in the coming years so that you know just how many people to run and what to make of the effect that you find.

Before starting let's check:

The .Rmd file is saved in your working directory. For assessments we ask that you save it with the format GUID_PracticeSkills_Ch8.Rmd where GUID is replaced with your GUID. Though this is a formative assessment, it may be good practice to do the same here.

Libraries

You will need to use the tidyverse and broom libraries in this assignment, so load them in the library code chunk below.
Hint: library(package)

Basic Calculations

8.4.1 Task 1

You set up a study so that it has a power value of \(power = .87\). To two decimal places, what is the Type II error rate of your study?

Replace the NULL in the T1 code chunk below with either a single value, or with mathematical notation, so that error_rate returns the actual value of the Type II error rate for this study. By mathematical notation we mean you to use the appropriate formula but insert the actual values.

error_rate <- NULL

8.4.2 Task 2

You run a two-sample t-test and discover a significant effect, t(32) = 3.26, p < .05. Using the appropriate formula, given in the chapter, calculate the effect size of this t-test.

Replace the NULL in the T2 code chunk below with mathematical notation so that effect1 returns the value of the effect size. Do not round the value.

effect1 <- NULL

8.4.3 Task 3

You run a paired-sample t-test and discover a significant effect, t(43) = 2.24, p < .05. Using the appropriate formula, given in the chapter, calculate the effect size of this t-test.

Replace the NULL in the T3 code chunk below with mathematical notation so that effect2 returns the value of the effect size. Do not round the value.

effect2 <- NULL

Using the Power function

8.4.4 Task 4

Replace the NULL in the T4 code chunk below with a pipeline combining pwr.t.test(), pluck() and ceiling(), to determine how many participants are needed to sufficiently power a paired-samples t-test at \(power = .9\) with \(d = .5\)? Assume a two-sided hypothesis with \(\alpha = .05\). Ceiling the answer to the nearest whole participant and store this value in participants.
Hint: Remember the quotes on the pluck

participants <- NULL

8.4.5 Task 5

Using a pipeline similar to Task 4, what is the minimum effect size that a one-sample t-test study (two-tailed hypothesis) could reliably detect given the following details: \(\beta = .16, \alpha = 0.01, n = 30\). Round to two decimal places and replace the NULL in the T5 code chunk below to store this value in effect3.
Hint: Remember you are going to round() and not ceiling()

effect3 <- NULL

8.4.6 Task 6

Study 1

You run a between-subjects study and establish the following descriptives: Group 1 (M = 5.1, SD = 1.34, N = 32); Group 2 (M = 4.4, SD = 1.27, N = 32). Replace the NULL in the T6 code chunk below with the following formula, substituting in the appropriate values, to calculate the t-value of this test. Calculate as Group1 minus Group2. Store the t-value in tval. Do not round tval and do not include the t = part of the formula.

\[ t = \frac {{\bar{x_{1}}} - \bar{x_{2}}}{ \sqrt {\frac {{s_{1}}^2}{n_{1}} + \frac {{s_{2}}^2}{n_{2}}}}\]

tval <- NULL

8.4.7 Task 7

Using the tval calculated in Task 6, calculate the effect size of this study and store it as d1 in the T7 code chunk below, replacing the NULL with the appropriate formula and values. Do not round d1.
Hint: Think between-subjects

d1 <- NULL

8.4.8 Task 8

Assuming \(power = .8\), \(\alpha =.05\) on a two-tailed hypothesis, based on the d1 value in Task 7 and the smallest achievable effect size of this study, which of the below statements is correct.

The smallest effect size that this study can determine is d = .71. The detected effect size, d1, is larger than this and as such this study is potentially suitably powered
The smallest effect size that this study can determine is d = .17. The detected effect size, d1, is larger than this and as such this study is potentially suitably powered
The smallest effect size that this study can determine is d = .17. The detected effect size, d1, is smaller than this and as such this study is potentially suitably powered
The smallest effect size that this study can determine is d = .71. The detected effect size, d1, is smaller than this and as such this study is potentially not suitably powered

Replace the NULL in the T8 code chunk below with the number of the statement that is a true summary of this study. It may help you to calculate and store the smallest achievable effect size of this study in poss_d.

Hint: use poss_d to calculate the smallest possible effect size of this study to help you answer this question.

poss_d <- NULL

answer_T8 <- NULL

8.4.9 Task 9

Study 2

Below is a paragraph from the results of Experiment 4 from Schroeder, J., & Epley, N. (2015). The sound of intellect: Speech reveals a thoughtful mind, increasing a job candidate's appeal. Psychological Science, 26, 877-891. We saw this paper previously but you can find out more details at <a href="https://sites.trinity.edu/osl/data-sets-and-activities/t-test-activities", target = "_blank">Open Stats Lab.

Recruiters believed that the job candidates had greater intellect - were more competent, thoughtful, and intelligent - when they listened to pitches (M = 5.63, SD = 1.61, n = 21) than when they read pitches (M = 3.65, SD = 1.91, n = 18), t(37) = 3.53, p < .01, 95% CI of the difference = [0.85, 3.13], d1 = 1.16. The recruiters also formed more positive impressions of the candidates - rated them as more likeable and had a more positive and less negative impression of them - when they listened to pitches (M = 5.97, SD = 1.92) than when they read pitches (M = 4.07, SD = 2.23), t(37) = 2.85, p < .01, 95% CI of the difference = [0.55, 3.24], d2 = 0.94. Finally, they also reported being more likely to hire the candidates when they listened to pitches (M = 4.71, SD = 2.26) than when they read the same pitches (M = 2.89, SD = 2.06), t(37) = 2.62, p < .01, 95% CI of the difference = [0.41, 3.24], d3 = 0.86.

Using the pwr.t.test() function, what is the minimum effect size that this paper could have reliably detected? Test at \(power = .8\) for a two-sided hypothesis. Use the \(\alpha\) stated in the paragraph and the smallest n stated; store the value as effect4 in the T9 code chunk below. Replace the NULL with your pipeline and round the effect size to two decimal places.

effect4 <- NULL

8.4.10 Task 10

Given the value of effect4 calculated in Task 9, and the stated alpha in the paragraph and the smallest n of the two groups, which of these statements is true.

This study has enough power to reliably detect effects at the size of d3 and larger.
This study has enough power to reliably detect effects at the size of only d1.
This study has enough power to reliably detect effects at the size of d2 and larger, but not d3.
This study does not have enough power to reliably detect effect sizes at d1 or lower.

Replace the NULL in the T10 code chunk below with the number of the statement that is TRUE, storing the single value in answer_t10.

answer_t10 <- NULL

8.4.11 Task 11

Last, but not least:

Read the following statements.

In general, increasing sample size will increase the power of a study.
In general, smaller effect sizes require fewer participants to detect at \(power = .8\).
In general, lowering alpha (from .05 to .01) will decrease the power of a study.

Now look at the below four summary statements of the validity of the statements a, b and c.

Statements a, b and c are all TRUE.
Statements a and c are both TRUE.
Statements b and c are both TRUE.
None of the statements are TRUE.

Replace the NULL in the T11 code chunk below with the number of the statement that is correct, storing the single value in answer_t11.

answer_t11 <- NULL

8.4.12 The `pwr` package

An alternative solution to Task 9 would be to use the pwr.t2n.test() function from the pwr package (Champely, 2020). This would allow you to enter the n of both groups as there is an n1 and an n2 argument. Were you to use this, entering n1 = 18, n2 = 21, alpha = .01, the d drops just a little, changing the interpretation of Task 10. Feel free to try this analysis and see if you can figure out what would be the alternative answer to Task 10.

Job Done - Activity Complete!

Well done, you are finished! Now you should go check your answers against the solution which can be found at the end of this chapter. You are looking to check that the resulting output from the answers that you have submitted are exactly the same as the output in the solution - for example, remember that a single value is not the same as a coded answer. Where there are alternative answers, it means that you could have submitted any one of the options as they should all return the same answer.

8.5 Solutions to Questions

Below you will find the solutions to the questions for the activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

8.5.1 Practical APES Calculations

8.5.1.1 Task 1

d <- 3.24 / sqrt(25 +1)

Giving an effect size of d = 0.64 and as such a medium to large effect size according to Cohen (1988)

Return to Task

8.5.1.2 Task 2

d <- (2*2.9) / sqrt(30)

Giving a effect size of d = 1.06 and as such a large effect size according to Cohen (1988)

Return to Task

8.5.1.3 Task 3

N = 39 + 1

d <- 2.1 / sqrt(N)

Giving an N = 40 and an effect size of d = 0.33. This would be considered a small effect size according to Cohen (1988)

Return to Task

8.5.1.4 Task 4

t = (10 - 11)/sqrt((1.3^2/30) + (1.7^2/30))

d = (2*t)/sqrt((30-1) + (30-1))

Giving a t-value of t = 2.56 and an effect size of d = 0.67.
Remember that convention is that people tend to report the t and d as positive values.

Return to Task

8.5.1.5 Task 5

library(pwr)

sample_size <- pwr.t.test(d = .23,
                          power = .8, 
                          sig.level = .05, 
                          alternative = "two.sided", 
                          type = "one.sample") %>%
  pluck("n") %>% 
  ceiling()

Giving a sample size of n = 151

Return to Task

8.5.1.6 Task 6

cohens <- pwr.t.test(n = 50,
                    power = .9, 
                    sig.level = .05, 
                    alternative = "two.sided", 
                    type = "two.sample") %>% 
  pluck("d") %>% 
  round(2)

Giving a Cohen's d effect size of d = 0.65

Return to Task

8.5.1.7 Task 7

Example 1

ach_d_exp1 <- pwr.t.test(power = .8, 
                         n = 32, 
                         type = "one.sample", 
                         alternative = "two.sided", 
                         sig.level = .05) %>% 
  pluck("d") %>% 
  round(2) 

exp1_d <- 2.96/sqrt(31+1)

Giving an achievable effect size of 0.51 and they found an effect size of 0.52.

This study seems ok as the authors could achieve an effect size as low as .51 and found an effect size at .52

Example 2

ach_d_exp2 <- pwr.t.test(power = .8, 
                         n = 32, 
                         type = "paired", 
                         alternative = "two.sided", 
                         sig.level = .05) %>% 
  pluck("d") %>% 
  round(2) 

exp2_d <- 2.42/sqrt(31+1)

Giving an achievable effect size of 0.51 and they found an effect size of 0.43.

This effect might not be reliable given that the effect size found was much lower than the achievable effect size. The issue here is that the researchers established their sample size based on a previous effect size and not on the minimum effect size that they would find important. If an effect size as small as .4 was important then they should have powered all studies to that level and ran the appropriate n ~52 babies (see below). Flipside of course is that obtaining 52 babies isn't easy; hence why some people consider the Many Labs approach a good way ahead.

ONE CAVEAT to the above is that before making the assumption that this study is therefore flawed, we have to keep in mind that this is one study using one sample from a potentially huge number of samples within a population. As such there will be a degree of variance in the true effect size within the population regardless of the effect size of one given sample. What that means is we have to be a little bit cautious when making claims about a study. Ultimately the higher the power the better.

Below you could calculate the actual sample size required to achieve a power of .8:

sample_size <- pwr.t.test(power = .8,
                          d = .4,
                          type = "paired", 
                          alternative = "two.sided", 
                          sig.level = .05) %>%
  pluck("n") %>% 
  ceiling()

Suggesting a sample size of n = 52 would be appropriate.

Return to Task

8.5.2 Practice Your Skills

Libraries

library(pwr)
library(tidyverse)

8.5.2.1 Task 1

error_rate <- 1 - .87

The Type II error rate of your study would be \(\beta\) = 0.13.

Return to Task

8.5.2.2 Task 2

effect1 <- (2*3.26)/sqrt(32)

The effect size would be d = 1.1525841

Return to Task

8.5.2.3 Task 3

effect2 <- 2.24/sqrt(43+1)

The effect size would be d = 0.3376927

Return to Task

8.5.2.4 Task 4

participants <- pwr.t.test(power = .9,
                           d = .5,
                           sig.level = 0.05,
                           type = "paired",
                           alternative = "two.sided") %>% 
  pluck("n") %>% 
  ceiling()

Given the detailed scenario, the appropriate number of participants would be n = 44

Return to Task

8.5.2.5 Task 5

effect3 <- power.t.test(power = 1-.16,
                      n = 30,
                      sig.level = 0.01,
                      type = "one.sample",
                      alternative = "two.sided") %>% 
  broom::tidy() %>% 
  pull(delta) %>% 
  round(2)

Given the detailed scenario, we would be able to detect an effect size of d = 0.69

Return to Task

8.5.2.6 Task 6

tval <- (5.1 - 4.4) / sqrt((1.34^2/32) + (1.27^2/32))

Given the stated means and standard deviations, the t-value for this study would be t = 2.1448226

Return to Task

8.5.2.7 Task 7

d1 <- (2*tval)/sqrt((32-1)+(32-1))

Given the t-value in Task 6, the effect size of this study would be d = 0.5447855.

Return to Task

8.5.2.8 Task 8

poss_d <- pwr.t.test(power = .8,
                     n = 32,
                     sig.level = 0.05,
                     type = "two.sample",
                     alternative = "two.sided") %>% 
  pluck("d") %>% 
  round(2)

answer_T8 <- 4

The smallest effect size that this study can determine is d = 0.71. The detected effect size, d1, is smaller than this (d1 = 0.5447855) and as such this study is not suitably powered.
Given that outcome, the 4th statement is the most suitable answer - answer_T8 = 4.

Return to Task

8.5.2.9 Task 9

effect4 <- pwr.t.test(power = .8,
                      n = 18,
                      sig.level = .01,
                      alternative = "two.sided",
                      type = "two.sample") %>% 
  pluck("d") %>% 
  round(3)

The smallest stated n is n = 18 and the stated \(\alpha\) is \(\alpha\) = .01
Given these details, the minimum effect size that this paper could have reliably detected was d = 1.198

Return to Task

8.5.2.10 Task 10

answer_t10 <- 4

This study does not have enough power to detect effect sizes at d1 or lower and as such answer_t10 = 4
However, it is worth keeping in mind that we are only looking at one study here which drew one sample from a population of samples. This means that there is always uncertainty about the true effect size of a difference or association - taking a different sample may have given a different effect size. As such, the comparison we are making here is not entirely valid and we should see it more as a reminder that we should always think of power as more in the planning of studies rather than in the search for criticism.

Return to Task

8.5.2.11 Task 11

answer_t11 <- 2

In general, increasing sample size will increase the power of a study whereas lowering alpha (from .05 to .01) will decrease the power of a study. As such, statements a and c, answer_t11 = 2.

Return to Task

Chapter Complete!

7 NHST: Paired-Sample t-test and Nonparametric tests

9 Correlations

8 APES - Alpha, Power, Effect Sizes, Sample Size

8.1 Overview

8.2 Introduction to Power

8.2.1 Blog post

8.2.2 Video about power and sample size

8.2.3 Useful links

8.3 Practical APES Calculations

8.3.1 Task 1: Effect size from a one-sample t-test

8.3.2 Task 2: Effect size from a two-sample t-test

8.3.3 Task 3: Effect Size from a matched-sample t-test

8.3.4 Task 4: t-value and effect size for a between-subjects experiment

8.3.5 Task 5: Sample size for standard power one-sample t-test

8.3.6 Task 6: Effect size from a high power between-subjects t-test

8.3.7 Task 7: Power of Published Research

8.4 Practice Your Skills

8.4.1 Task 1

8.4.2 Task 2

8.4.3 Task 3

8.4.4 Task 4

8.4.5 Task 5

8.4.6 Task 6

8.4.7 Task 7

8.4.8 Task 8

8.4.9 Task 9

8.4.10 Task 10

8.4.11 Task 11

8.4.12 The pwr package

8.5 Solutions to Questions

8.5.1 Practical APES Calculations

8.5.1.1 Task 1

8.5.1.2 Task 2

8.5.1.3 Task 3

8.5.1.4 Task 4

8.5.1.5 Task 5

8.5.1.6 Task 6

8.5.1.7 Task 7

8.5.2 Practice Your Skills

8.5.2.1 Task 1

8.5.2.2 Task 2

8.5.2.3 Task 3

8.5.2.4 Task 4

8.5.2.5 Task 5

8.5.2.6 Task 6

8.5.2.7 Task 7

8.5.2.8 Task 8

8.5.2.9 Task 9

8.5.2.10 Task 10

8.5.2.11 Task 11

8.4.12 The `pwr` package