Lab 8 APES - Alpha, Power, Effect Sizes, Sample Size

8.1 Overview

Up until now we have mainly spent time on data-wrangling, understanding probability, visualising our data, and more recently, running inferential tests, i.e. t-tests. In the lectures, however, you have also started to learn about additional aspects of inferential testing and trying to reduce certain types of error in your analyses. It is this balance of minimising error in our inferential statisitcs that we will focus on today.

First thing to remember is that there are two types of hypotheses in Null Hypothesis Significance Testing (NHST) and what you are trying to establish is the probability of the null hypothesis not being accepted. Those two hypotheses are:

  • The null hypothesis which states that the compared values are equivalent and, when referring to means, is written as: \(H_0: \mu_1 = \mu_2\)
  • And the alternative hypothesis which states that the compared values are not equivalent and, when referring to means, is written as: \(H_1: \mu_1 \ne \mu_2\).

Now, each decision about a hypothesis is prone to some degree of error and, as you will learn, the two main types of error that we worry about in Psychology are:

  • Type I error - or False Positives, is the error of rejecting the null hypothesis when it should not be rejected (otherwise called alpha or \(\alpha\)). In other words, you conclude that there is a real "effect" when in fact there is no effect. The field standard rate of acceptable false positives is \(\alpha = .05\) meaning that in theory 1 in 20 studies may be a false positive.
  • Type II error - or False Negatives, is the error of retaining the null hypothesis when it is false (otherwise called beta or \(\beta\)). In other words, you conclude that there was no real "effect" when in fact there was one. The field standard rate of acceptable false negatives is \(\beta = .2\) meaning that in theory 1 in 5 studies may be a false negative.

Adding to the ideas of hypotheses and errors, we are going to look at the idea of power which you will learn is the long-run probability of correctly rejecting the null hypothesis for a fixed effect size and fixed sample size; i.e. correctly concluding there is an effect when there is a real effect to detect. Power is calculated as \(power = 1-\beta\) and is directly related to the False Negative rate. If the field standard of False Negatives is \(\beta = .2\) then the field standard of power should be \(power = 1 - .2 = .8\), for a given effect size and sample size (though some papers, including Registered Reports are often required to have a power of at least \(power >= .9\)). As such, \(power = .8\) means that the majority of studies should find an effect if there is one to detect, assuming that your study maintains these rates of error and power.

Unfortunately, however, psychological research has been criticised for neglecting power and \(\beta\) when planning studies resulting in what are called "underpowered" or "low powered" studies - meaning that your error rates are higher than you think they are, your power is lower than you think it is, and the study is unreliable. Note that as \(\beta\) increases (the false negative rate increases), power decreases; power and false positive rates are also related, though less directly. In fact, low powered studies, combined with undisclosed analytical flexibility and publication bias, is thought to be a key issue in the replication crisis within the field. As such there may be a large number of studies where the null hypothesis has been rejected when it should not have been, and unpublished studies that have not been written up because they did not find an effect when they should have. In turn, when that is the case, the field becomes noisy and you are unsure which studies will replicate. It is issues like this that led us to redevelop our courses and why we really want you to understand power as much as possible.

So this chapter is all about power, error rates, effect sizes, and sample sizes. We will learn:

  • the relationship between power, alpha, effect sizes and sample sizes
  • how to calculate certain effect sizes
  • how to determine appropriate sample sizes in given scenarios
  • and how to interpret power analyses.

8.2 PreClass Activity

As in the last chapter, the Preclass activities involve reading and watching. We have written and selected this material to help give you a better understanding of power and how it interacts with effect size, sample size, and alpha. We have also suggested some optional material that you can look at and play with to get a rounder view.

8.2.1 Reading

Read the following blog on Power and then the section we have written on Power and Design.

This blog is a fictional conversation between a professor and a student on the importance of power. Grab a coffee and have a read. Don't worry about reading all the additional papers unless you want to; just the blog is fine to get an understanding. What you are trying to understand from this blog is the relationship between sample size and effect sizes, and whether a result from a study is likely to replicate or not based on the power of the original study.

Blog:

Using power to design your study

To reiterate, power is defined as the probability of correctly rejecting the null hypothesis for a fixed effect size and fixed sample size. As such, power is a key decision when you design your study, under the premis that the higher the power of your planned study, the better.

Two relationships you will learn in this chapter are that:

  • for a given sample size and \(\alpha\), the power of your study is higher if the effect you are looking for is assumed to be a large effect as opposed to a small effect; large effects are easier to detect.
  • and, for a given effect size and \(\alpha\), the power of your study is higher when you increase your sample size.

From these relationships we see that, because you have little control over the size of the effect you are trying to detect (it lives in the real world which you don't control), you can instead increase the power of your study by increasing the size of your sample (and also reducing sources of noise and measurement error in your study). As such, when planning a study, any good researcher will consider the following four key elements - the APES:

  • alpha - (the false positive rate - Type 1 error) most commonly thought of as the significance level; usually set at \(\alpha = .05\)
  • power - the probability of rejecting the null hypothesis for a given effect size and sample size, with \(power = .8\) usually cited as the minimum power you should aim for based on the false negative rate being set at \(\beta = .2\);
  • effect size - size of the asssociation or difference you are trying to detect;
  • sample size - the number of observations (usually, participants, but sometimes also stimuli) in your study.
  • Note: because power depends on several variables, it is useful to think of power as a function with varying value rather than as a single fixed quantity.

Now the cool thing about the APES is that if you know any three of these elements then you can calculate the fourth. In reality, the two most common approaches when designing a study would be:

  1. to determine the appropriate sample size required to reject your null hypothesis, with high probability, for the effect size that you are interested in. That is, you decide your \(\alpha\), \(power\), and effect size, and from that you calculate for the sample size required in your study. Generally, the smaller the assumed effect size, the more participants you will need, assuming power and alpha are held constant.
  2. to determine the smallest effect size you can reliably detect given your sample size. That is, you know everything except the effect size. For example, say you are using an open dataset and you know they have run 100 participants, you can't add any more participants, and you want to know what is the minimum effect size you could detect from this dataset if you set \(power\) and \(\alpha\) at the field standards.

Hopefully that gives you an idea of how we use power to determine sample sizes for studies - and that the sample size should not just be pulled out of thin air. Both of these approaches described above are called a priori power analyses as you are stating the power level you want before (a priori means before) the study. However, you may now be thinking though, if everything is connected, then can we use the effect size from our study and the sample size to determine the power of the study after we have run it? No! Well, you can but it would be wrong to do so. This is actually called Observed or Post-Hoc power and most papers would discourage you from calculating it on the grounds that the effect size you are using is not the true effect size of the population you are interested in; it is just the effect size of your sample. As such any indication of power from this analysis is misleading. Avoid doing this. You can read more about why, here, in your own time if you like: Lakens (2014) Observed Power, and what to do if your editor asks for post-hoc power analyses. In short, stick to using only a priori power analyses approaches and use them to determine your required sample size or achievable reliable effect size.

8.2.2 Watch

You should now also watch this short but nonetheless highly informative video by Daniel Lakens on Power and Sample Size. It will help consolidate the above points. And his shirt is amazing!

Video:

Remember to make notes about power, effect sizes, and sample sizes, as processing the concepts into your own words will really help you to understand them better.

8.2.3 Optional

Finally, there are a number of great webpages and blogs that will help you understand the concepts in this chapter. Here are some that we think might be good for you to look at. You don't have to look at all of these to understand this chapter but do come back to them as they will really help you as you progress in becoming a responsinble researcher. We are deliberately giving you a number of options here as for everyone there is that one analogy that will work best for you and that one paper that will make everything click into place. That example will be different from person to person so having a variety of explanations will help.

Job Done - Activity Complete!

Hopefully this has given you a good basis to understanding power, sample sizes, alpha, and effect sizes. These are difficult concepts to grasp and it will take a lot of time thinking about them and interacting with them before they really start to sink in. Hopefully however, if nothing else, the least you come away with is the idea that the number of participants you should run in a study is not an arbitrary decision but is in fact a relationship between the effect size you want to test for and the level of error (Type I or Type II) you are willing to accept.

As always, the best way to understand something is to put it into your own words so don't forget to go back and add any informative points to your Portfolio. Post any questions on the available forums for discussion or ask a member of staff if you are unsure.

8.3 InClass Activity

Hopefully you now have a decent understanding at least of the four APES that need to be considered when designing a study: \(\alpha\), \(power\), effect size and sample size. We are going to look more at calculating and understanding these elements today. You don't have to fully understand everything about power to complete this chapter - believe us when we say many seasoned researchers struggle with parts - you just need to get the general gist that there is always a level of acceptable error in hypothesis testing and we are trying to minimise that for a given effect size (i.e. the magnitude of the difference, relationship, association).

So let's jump into this a bit now and start running some analyses to help further our understanding of alpha, power, effect sizes and sample size! To help our understanding we will focus on t-tests for this chapter which you will know well from previous chapters.

Effect Sizes - Cohen's \(d\)

There are a number of different "effect sizes" that you can choose to calculate but a common one for t-tests is Cohen's d: the standardised difference between two means (in units of SD) and is written as d = effect-size-value. The key point is that Cohen's d is a standardised difference, meaning that it can be used to compare against other studies regardless of how the measurement was made. Take for example height differences in men and women which is estimated at about 5 inches (12.7 cm). That in itself is an effect size, but it is an unstandardised effect size in that for every sample that you test, that difference is dependent on the measurement tools, the measurement scale, and the errors contained within them (Note: ask Helena about the time she photocopied some rulers). As such using a standardised effect size allows you to make comparisons across studies regardless of measurement error. In standardised terms, the height difference above is considered a medium effect size (d = .5) which Cohen (1988, as cited by Schafer and Schwarz (2019)) defined as representing "an effect likely to be visible to the naked eye of a careful observer". Cohen (1988) in fact stated three sizes of Cohen's d that people could use as a guide:


Effect size Cohen's d value
small .2 to .5
medium .5 to .8
large > .8


You may wish to read this paper later about different effect sizes in psychology - Schafer and Schwarz (2019) The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases.

One thing to note is that the formula for Cohen's d is slightly different depending on the type of t-test used. And even within a type of t-test the formula can sometimes change depending on who you read. For today, and this chapter, let's go with the following formulas:

  • One-sample t-test & within-subjects (paired-sample) t-test:

\[d = \frac{t}{sqrt(N)}\]

  • Between-subjects (Independent) t-test:

\[d = \frac{2t}{sqrt(df)}\]

Let's now try using these formulas in order to calculate the effect sizes for given scenarios; we will work up to calculating power later in the chapter.

8.3.1 Task 1: Effect size from a one-sample t-test

  • You run a one-sample t-test and discover a significant effect, t(25) = 3.24, p = .003. Calculate d and determine whether the effect size is small, medium or large.
  • Use the appropriate formula from above for the one-sample t-tests.
  • You have been given a t-value and df (degrees of freedom), you still need to determine n before you calculate d.
  • According to Cohen (1988), the effect size is small (.2 to .5), medium (.5 to .8) or large (> .8).


Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

  • Enter, in digits, how many people were run in this study:
  • Which of these codes is the appropriate calculation of d in this instance:
  • Enter the correct value of d for this analysis rounded to 2 decimal places:
  • According to Cohen (1988), the effect size for this t-test would be considered:

8.3.2 Task 2: Effect size from between-subjects t-test

  • You run a between-subjects t-test and discover a significant effect, t(30) = 2.9, p = .007. Calculate d and determine whether the effect size is small, medium or large.
  • Use the appropriate formula above for between-subjects t-tests.
  • remember that df = (N-1) + (N-1) for a between-subjects t-test.
  • According to Cohen (1988), the effect size is small (.2 to .5), medium (.5 to .8) or large (> .8).


Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

  • Enter, in digits, how many people were run in this study:
  • Which of these codes is the appropriate calculation of d in this instance:
  • Enter the correct value of d for this analysis rounded to 2 decimal places:
  • According to Cohen (1988), the effect size for this t-test would be considered:

8.3.3 Task 3: Effect Size from matched-pairs t-test

  • You run a matched-pairs t-test between an ASD sample and a non-ASD sample and discover a significant effect t(39) = 2.1, p < .05. How many people are there in each group? Calculate d and determine whether the effect size is small, medium or large.
  • You need the df value to determine N.
  • A matched pairs is treated like a paired-sample t-test but with two separate groups.


Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

  • Enter, in digits, how many people were in each group in this study. Note, not the total number of participants:
  • Which of these codes is the appropriate calculation of d in this instance:
  • Enter the correct value of d for this analysis rounded to 2 decimal places:
  • According to Cohen (1988), the effect size for this t-test would be considered:
  • df in a paired-samples and in a matched-pairs t-test is calculated as df = N - 1.
  • Conversely, to find the total number of participants: N = df + 1 so N = 39 + 1 = 40.
  • Given that this is a matched-pairs t-test, by design there has to be an equal number of participants in each group. Therefore 40 participants in each group.

8.3.4 Task 4: t-value and effect size for a between-subjects Experiment

  • You run a between-subjects design study and the descriptives tell you: Group 1, M = 10, SD = 1.3, n = 30; Group 2, M = 11, SD = 1.7, n = 30. Calculate t and d for this between-subjects experiment.
  • Before you can calculate d (using the appropriate formula for a between-subjects experiment), you need to first calculate t using the formula:

t = (Mean1 - Mean2)/sqrt((var1/n1) + (var2/n2))

  • var stands for variance in the above formula. Variance is not the same as the standard deviation, right? Variance is measured in squared units. So for this equation, if you require variance to calculate t and you have the standard deviation, then you need to remember that \(var = SD^2\) (otherwise written as \(var = SD \times SD\).
  • Now you have your t-value, but for calculating d you also need degrees of freedom. Think about how you would calculate df for a between-subjects experiment, taking n for both Group 1 and Group 2 into account.
  • Remember that convention is that people report the t as a positive. As such, convention also dictates that d is reported as a positive value.


Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

  • Enter the correct t-value for this test, rounded to two decimal places:
  • Which of these codes is the appropriate calculation of d in this instance:
  • Based on the above t-value above, enter the correct value of d for this analysis rounded to 2 decimal places:
  • According to Cohen (1988), the effect size for this t-test would be described as:

Excellent! Now that you are comfortable with calculating effect sizes, we will look at using them to establish the sample size for a required power. One thing you will realise as we progress is that the true effect size in a population is something we do not know, but we need to justify one for our design. A clever approach is laid out by Daniel Lakens in the blog from the PreClass on the Smallest Effect Size of Interest (SESOI) - you set the smallest effect that you would be interested in! This can be determined through theoretical analysis, through previous studies, through pilot studies, or through rules of thumb like Cohen (1988). However, also keep in mind that the lower the effect size, the larger the sample size you will need. Everything is a trade-off.

Power Calculations

Today we are going to use the function pwr.t.test() to run our calculations from the pwr library. This is a really useful library of functions for various tests but we will just use it for t-tests right now. If you are using the Boyd Orr machines the pwr package is already installed and you will just need to call it like all other packages, e.g. library(pwr). Do not attempt to install it yourself on the Boyd Orr machines. If you are using your own laptop then feel free to install it.

Remember that for more information on the function pwr.t.test(), simply do ?pwr.t.test in the console. Or you can have a look at these webpages to get in idea (or bad ideas if you spot where they erroneously calculate post-hoc power!):

From these you will see that pwr.t.test() takes a series of inputs:

  • n - observations/participants, per group for the independent samples version, or the number of subjects or matched pairs for the paired and one-sample designs.
  • d - the effect size of interest
  • sig.level or \(\alpha\)
  • power or \(1-\beta\)
  • type - the type of t-test; e.g. "two.sample", "one.sample", "paired"
  • alternative - the type of hypothesis; "two.sided", "one.sided"

And it works on a leave one out principle. You give it all the info you have and it returns the element you are missing. So, for example, say you needed to know how many people per group you would need to detect an effect size as low as d = .4 with power = .8, alpha = .05 in a two.sample (between-subjects) t-test on a two.sided hypothesis test. You would do:

pwr.t.test(d = .4,
           power = .8,
           sig.level = .05,
           alternative = "two.sided",
           type = "two.sample")

Which will show you the following output, which, if you look at it, tells you that you need 99.0803248 people per condition.

## 
##      Two-sample t test power calculation 
## 
##               n = 99.08032
##               d = 0.4
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

But you only get whole people and we like to be conservative on our estimates so we would actually run 100 per condition. That is a lot of people!!!

One problem though is that the output of the pwr.t.test() is an object and not that easy to work with in terms of getting values out from it to be reproducible. However, we have already seen a function in Chapter 5 that we can use to pluck values from objects - purr::pluck(). And the code would look like this

n_test <- pwr.t.test(d = .4, 
                     power = .8,
                     sig.level = .05,
                     alternative = "two.sided",
                     type = "two.sample") %>%
  pluck("n")

So when we call n_test we get the same answer as above, but it is saved as a single value and easier to work with:

n_test
## [1] 99.08032

And we could use the ceiling() funtion to round up to whole people:

n_test %>% ceiling()
## [1] 100

Note: ceiling() is better to use than round() when dealing with people as it always rounds up. For example, ceiling(1.1) gives you 2. round() on the other hand is useful for rounding an effect size, for example, to two decimal places - e.g. d = round(.4356, 2) would give you d = 0.44

We will use this approach pwr.t.test() %>% pluck() and pwr.t.test() %>% pluck() %>% ceiling() throughout the rest of this chapter to get used to it. But before you start with this next task, you will need to make sure you have loaded in the tidyverse.

8.3.5 Task 5: Sample size for standard power one-sample t-test

  • Assuming the smallest effect size of interest is a Cohen's d of d = .23, what would be the minimum number of participants you would need in a one-sample t-test, assuming \(power = .8\), \(\alpha = .05\), on a two-sided hypothesis?

Using a pipeline, store the answer as a single value called sample_size (e.g. think pluck()) and round up to the nearest whole participant.

  • Use the list of inputs above as a kind of checklist to clearly determine which inputs are known or unknown. This can help you enter the appropriate values to your code.
  • The structure of the pwr.t.test() would be very similar to the one shown above except two.sample would become one.sample
  • You will also need to use pluck("n") to help you obtain the sample size and %>% ceiling() to round up to the nearest whole participant.


Quickfire Questions

Answer the following question to check your answers. The solutions are at the end of the chapter to check against:

  • Enter the minimum number of participants you would need in this one-sample t-test:

8.3.6 Task 6: Effect size from a high power between-subjects t-test

  • Assuming you run a between-subjects t-test with 50 participants per group and want a power of .9, what would be the minimum effect size you can reliably detect? Assume the field standard \(\alpha = .05\) and alternative hypothesis settings ("two-tailed"). Using a pipeline, store the answer as a single value called cohens and round to two decimal places.
  • Again, use the list of inputs above as a kind of checklist to clearly determine which inputs are known or unknown. This can help you enter the values to your code.
  • You will also need to use pluck() to obtain Cohen's d, and round() so the value is rounded to two decimal places.
  • Don't forget the quotes when using pluck(). i.e. pluck("value") and not pluck(value)


Quickfire Questions

Answer the following questions to check your answers. The solutions are at the end of the chapter:

  • Based on the information given, what will you set type as in the function?
  • Based on the output, enter the minimum effect size you can reliably detect in this test, rounded to two decimal places:
  • According to Cohen (1988), the effect size for this t-test is
  • Say you run the study and find that the effect size determined is d = .50. Given what you know about power, select the statement that is most accurate:

8.3.7 Task 7: Power of Published Research

Thus far we have used hypothetical situations - now go look at the paper on the Open Stats lab website called Does Music Convey Social Information to Infants?. You can download the pdf and look at it, but here we will determine the power of the significant t-tests reported in Experiment 1 under the Results section on Pg489. There is a one-sample t-test and a paired-samples t-test to consider, summarised below. Assume testing was at power = .8, alpha = .05. Based on your calculations are either of the stated effects underpowered?

  1. one-sample: t(31) = 2.96, p = .006
  2. paired t-test: t(31) = 2.42, p = .022
  • A one-sample t-test and a paired t-test use the same formula for Cohen's d.
  • To calculate n: n = df + 1.
  • Calculate the achievable Cohens d for the studies and then calculate the established Cohen's d for the studies.


Group Discussion Point

Based on what you have found out, think about the following questions and discuss them in your groups:

  • Which of the t-tests do you believe to be potentially underpowered?
  • Why do you think this may be?

Additional information about this discussion can be found in the solutions at the end of this chapter.

One caveat to Tasks 6 and 7: We have to keep in mind that here we are looking at single studies using one sample from a potentially huge number of samples within a population. As such there will be a degree of variance in the true effect size within the population regardless of the effect size of one given sample. What that means is we have to be a little bit cautious when making claims about a study. Ultimately the higher the power the better as you can detect smaller effect sizes!

Job Done - Activity Complete!

Great! So hopefully you are now starting to see the interaction between alpha, power, effect sizes, and sample size. We should always want high powered studies and depending on the size of the effect we are interested in (small to large), and our \(\alpha\) level, this will determine the number of observations we need to make sure our study is well powered. Points to note:

  • Lowering the \(\alpha\) level (e.g. .05 to .01) will reduce the power.
  • Lowering the effect size (e.g. .8 to .2) will reduce the power.
  • Increasing power (.8 to .9) will require more participants.
  • It is also possible to increase power for a fixed sample size by reducing sources of noise in the study.

A high-powered study looking to detect a small effect size at a low alpha may require a large number of participants!

Another point probably to consider for the future: what about studies with multiple observations per participant? How do you calculate power for this? This is a very common situation.

You should now be ready to complete the Homework Assignment for this lab. The assignment for this Lab is FORMATIVE and is NOT to be submitted and will NOT count towards the overall grade for this module. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. If you have any questions, please post them on the forums!

8.4 Test Yourself

This is a formative assignment meaning that it is purely for you to test your own knowledge, skill development, and learning, and does not count towards an overall grade. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. You will be instructed by the Course Lead on Moodle as to when you should attempt this assignment. Please check the information and schedule on the Level 2 Moodle page.

Lab 8: APES Assignment

In order to complete this assignment you first have to download the assignment .Rmd file which you need to edit for this assignment: titled GUID_Level2_Semester1_Lab8.Rmd. This can be downloaded within a zip file from the below link. Once downloaded and unzipped you should create a new folder that you will use as your working directory; put the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top Download the assignment zip file from here.

NOTE: in nearly all of the problems below, you will need to replace NULL with a value or a pipeline of code that computes a value. Please pay special attention as to what the question is asking for as the output, e.g. value or a tibble; when asked for a value as an output, make sure it is a single value and not a value stored in a tibble. Finally, when altering code inside the code blocks, please do not re-order or rename the code blocks (T1, T2, ... etc.). If you do, this may impact your grade!

It's also recommended that you "Knit" a report to be able to see what you've accomplished and spot potential errors. A great thing to do is close the whole programme, restart it, and then knit your code. This will test whether you have remembered to include essential elements, such as libraries, in your code.

APES: Alpha, Power, Effect Size, and Sample Size

In the lab we have been looking at the interplay between the four components of Alpha, Power, Effect Size, and Sample Size. This is a very important part of experimental design to understand as it will help you understand which studies are worth paying attention to and it will help you design your own studies in the coming years so that you know just how many people to run and what to make of the effect that you find. If you have not yet done so, we highly recommend reading the blog suggested as PreClass reading material and carrying out the activities in the inclass activity. These will help you will both the practicalities and the interpretation of the following assignment.

Remember that this assignment is formative but the knowledge gained from the practical activities in this lab will be super important to your future-self!

Before starting let's check:

  1. The .Rmd file is saved in your working directory. For assessments we ask that you save it with the format GUID_Level2_Semester1_Lab8.Rmd where GUID is replaced with your GUID. Though this is a formative assessment, it may be good practice to do the same here.

Libraries

  • You will need to use the tidyverse and broom libraries in this assignment, so load them in the library code chunk below.

  • Hint: library(package)

Basic Calculations

8.4.1 Assignment Task 1

  • You set up a study so that it has a power value of \(power = .87\). To two decimal places, what is the Type II error rate of your study?

Replace the NULL in the T1 code chunk below with either a single value, or with mathematical notation, so that error_rate returns the actual value of the Type II error rate for this study. By mathematical notation we mean you to use the appropriate formula but insert the actual values.

error_rate <- NULL

8.4.2 Assignment Task 2

  • You run an independent t-test and discover a significant effect, t(32) = 3.26, p < .05. Using the appropriate formula, given in the inclass activity, calculate the effect size of this t-test.

Replace the NULL in the T2 code chunk below with mathematical notation so that effect1 returns the value of the effect size. Do not round the value.

effect1 <- NULL

8.4.3 Assignment Task 3

  • You run a dependent t-test and discover a significant effect, t(43) = 2.24, p < .05. Using the appropriate formula, given in the inclass activity, calculate the effect size of this t-test.

Replace the NULL in the T3 code chunk below with mathematical notation so that effect2 returns the value of the effect size. Do not round the value.

effect2 <- NULL

Using the Power function

8.4.4 Assignment Task 4

  • Replace the NULL in the T4 code chunk below with a pipeline combining pwr.t.test(), pluck() and ceiling(), to determine how many participants are needed to sufficiently power a paired-samples t-test at \(power = .9\) with \(d = .5\)? Assume a two-sided hypothesis with \(\alpha = .05\). Ceiling the answer to the nearest whole participant and store this value in participants.

  • Hint: Remember the quotes on the pluck

participants <- NULL 

8.4.5 Assignment Task 5

  • Using a pipeline similar to Task 4, what is the minimum effect size that a one-sample t-test study (two-tailed hypothesis) could reliably detect given the following details : \(\beta = .16, \alpha = 0.01, n = 30\). Round to two decimal places and replace the NULL in the T5 code chunk below to store this value in effect3.

  • Hint: Remember you are going to round() and not ceiling()

effect3 <- NULL

8.4.6 Assignment Task 6

Study 1

  • You run a between-subjects study and establish the following descriptives: Group 1 (M = 5.1, SD = 1.34, N = 32); Group 2 (M = 4.4, SD = 1.27, N = 32). Replace the NULL in the T6 code chunk below with the following formula, substituting in the appropriate values, to calculate the t-value of this test. Calculate as Group1 minus Group2. Store the t-value in tval. Do not round tval and do not include the t = part of the formula.

\[ t = \frac {{\bar{x_{1}}} - \bar{x_{2}}}{ \sqrt {\frac {{s_{1}}^2}{n_{1}} + \frac {{s_{2}}^2}{n_{2}}}}\]

tval <- NULL

8.4.7 Assignment Task 7

  • Using the tval calculated in Task 6, calculate the effect size of this study and store it as d1 in the T7 code chunk below, replacing the NULL with the appropriate formula and values. Do not round d1.

  • Hint: Think between-subjects

d1 <- NULL

8.4.8 Assignment Task 8

Assuming \(power = .8\), \(\alpha =.05\) on a two-tailed hypothesis, based on the d1 value in Task 7 and the smallest achievable effect size of this study, which of the below statements is correct.

  1. The smallest effect size that this study can determine is d = .71. The detected effect size, d1, is larger than this and as such this study is potentially suitably powered
  2. The smallest effect size that this study can determine is d = .17. The detected effect size, d1, is larger than this and as such this study is potentially suitably powered
  3. The smallest effect size that this study can determine is d = .17. The detected effect size, d1, is smaller than this and as such this study is potentially suitably powered
  4. The smallest effect size that this study can determine is d = .71. The detected effect size, d1, is smaller than this and as such this study is potentially not suitably powered

Replace the NULL in the T8 code chunk below with the number of the statement that is a true summary of this study. It may help you to calculate and store the smallest achievable effect size of this study in poss_d.

  • Hint: use poss_d to calculate the smallest possible effect size of this study to help you answer this question.
poss_d <- NULL

answer_T8 <- NULL

8.4.9 Assignment Task 9

Study 2

Below is a paragraph from the results of Experiment 4 from Schroeder, J., & Epley, N. (2015). The sound of intellect: Speech reveals a thoughtful mind, increasing a job candidate's appeal. Psychological Science, 26, 877-891. We saw this paper in Lab 5 but you can find out more details at <a href="https://sites.trinity.edu/osl/data-sets-and-activities/t-test-activities", target = "_blank">Open Stats Lab.

Recruiters believed that the job candidates had greater intellect - were more competent, thoughtful, and intelligent - when they listened to pitches (M = 5.63, SD = 1.61, n = 21) than when they read pitches (M = 3.65, SD = 1.91, n = 18), t(37) = 3.53, p < .01, 95% CI of the difference = [0.85, 3.13], d1 = 1.16. The recruiters also formed more positive impressions of the candidates - rated them as more likeable and had a more positive and less negative impression of them - when they listened to pitches (M = 5.97, SD = 1.92) than when they read pitches (M = 4.07, SD = 2.23), t(37) = 2.85, p < .01, 95% CI of the difference = [0.55, 3.24], d2 = 0.94. Finally, they also reported being more likely to hire the candidates when they listened to pitches (M = 4.71, SD = 2.26) than when they read the same pitches (M = 2.89, SD = 2.06), t(37) = 2.62, p < .01, 95% CI of the difference = [0.41, 3.24], d3 = 0.86.

Using the pwr.t.test() function, what is the minimum effect size that this paper could have reliably detected? Test at \(power = .8\) for a two-sided hypothesis. Use the \(\alpha\) stated in the paragraph and the smallest n stated; store the value as effect4 in the T9 code chunk below. Replace the NULL with your pipeline and round the effect size to two decimal places.

effect4 <- NULL

8.4.10 Assignment Task 10

Given the value of effect4 calculated in Task 9, and the stated alpha in the paragraph and the smallest n of the two groups, which of these statements is true.

  1. This study has enough power to reliably detect effects at the size of d3 and larger.
  2. This study has enough power to reliably detect effects at the size of only d1.
  3. This study has enough power to reliably detect effects at the size of d2 and larger, but not d3.
  4. This study does not have enough power to reliably detect effect sizes at d1 or lower.

Replace the NULL in the T10 code chunk below with the number of the statement that is TRUE, storing the single value in answer_t10.

answer_t10 <- NULL

8.4.11 Assignment Task 11

Last but not least:

Read the following statements.

  1. In general, increasing sample size will increase the power of a study.
  2. In general, smaller effect sizes require fewer participants to detect at \(power = .8\).
  3. In general, lowering alpha (from .05 to .01) will decrease the power of a study.

Now look at the below four summary statements of the validity of the statements a, b and c.

  1. Statements a, b and c are all TRUE.
  2. Statements a and c are both TRUE.
  3. Statements b and c are both TRUE.
  4. None of the statements are TRUE.

Replace the NULL in the T11 code chunk below with the number of the statement that is correct, storing the single value in answer_t11.

answer_t11 <- NULL

8.4.12 The pwr package

An alternative solution to Task 9 would be to use the pwr.t2n.test() function from the pwr package (Champely 2020). This would allow you to enter the n of both groups as there is an n1 and an n2 argument. Were you to use this, entering n1 = 18, n2 = 21, alpha = .01, the d drops just a little, changing the interpretation of Task 10. Feel free to try this analysis and see if you can figure out what would be the alternative answer to Task 10.

Job Done - Activity Complete!

Well done, you are finshed! Now you should go check your answers against the solution file which can be found at the end of this chapter. You are looking to check that the resulting output from the answers that you have submitted are exactly the same as the output in the solution - for example, remember that a single value is not the same as a coded answer. Where there are alternative answers, it means that you could have submitted any one of the options as they should all return the same answer. If you have any questions please post them on the forums.

8.5 Solutions to Questions

Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

8.5.1 InClass Activities

8.5.1.1 InClass Task 1

d <- 3.24 / sqrt(25 +1)
  • Giving an effect size of d = 0.64 and as such a medium to large effect size according to Cohen (1988)

Return to Task

8.5.1.2 InClass Task 2

d <- (2*2.9) / sqrt(30)
  • Giving a effect size of d = 1.06 and as such a large effect size according to Cohen (1988)

Return to Task

8.5.1.3 InClass Task 3

N = 39 + 1

d <- 2.1 / sqrt(N)
  • Giving an N = 40 and an effect size of d = 0.33. This would be considered a small effect size according to Cohen (1988)

Return to Task

8.5.1.4 InClass Task 4

t = (10 - 11)/sqrt((1.3^2/30) + (1.7^2/30))

d = (2*t)/sqrt((30-1) + (30-1))
  • Giving a t-value of t = 2.56 and an effect size of d = 0.67.
  • Remember that convention is that people tend to report the t and d as positive values.

Return to Task

8.5.1.5 InClass Task 5

sample_size <- pwr.t.test(d = .23,
                          power = .8, 
                          sig.level = .05, 
                          alternative = "two.sided", 
                          type = "one.sample") %>%
  pluck("n") %>% 
  ceiling()
  • Giving a sample size of n = 151

Return to Task

8.5.1.6 InClass Task 6

cohens <- pwr.t.test(n = 50,
                    power = .9, 
                    sig.level = .05, 
                    alternative = "two.sided", 
                    type = "two.sample") %>% 
  pluck("d") %>% 
  round(2)
  • Giving a Cohen's d effect size of d = 0.65

Return to Task

8.5.1.7 InClass Task 7

Example 1

ach_d_exp1 <- pwr.t.test(power = .8, 
                         n = 32, 
                         type = "one.sample", 
                         alternative = "two.sided", 
                         sig.level = .05) %>% 
  pluck("d") %>% 
  round(2) 

exp1_d <- 2.96/sqrt(31+1) 
  • Giving an achievable effect size of 0.51 and they found an effect size of 0.52.

This study seems ok as the authors could achieve an effect size as low as .51 and found an effect size at .52

Example 2

ach_d_exp2 <- pwr.t.test(power = .8, 
                         n = 32, 
                         type = "paired", 
                         alternative = "two.sided", 
                         sig.level = .05) %>% 
  pluck("d") %>% 
  round(2) 

exp2_d <- 2.42/sqrt(31+1) 
  • Giving an achievable effect size of 0.51 and they found an effect size of 0.43.

This effect might not be reliable given that the effect size found was much lower than the achievable effect size. The issue here is that the researchers established their sample size based on a previous effect size and not on the minimum effect size that they would find important. If an effect size as small as .4 was important then they should have powered all studies to that level and ran the appropriate n ~52 babies (see below). Flipside of course is that obtaining 52 babies isnt easy; hence why some people consider the Many Labs approach a good way ahead.

ONE CAVEAT to the above is that before making the assumption that this study is therefore flawed, we have to keep in mind that this is one study using one sample from a potentially huge number of samples within a population. As such there will be a degree of variance in the true effect size within the population regardless of the effect size of one given sample. What that means is we have to be a little bit cautious when making claims about a study. Ultimately the higher the power the better.

Below you could calculate the actual sample size required to achieve a power of .8:

sample_size <- pwr.t.test(power = .8,
                          d = .4,
                          type = "paired", 
                          alternative = "two.sided", 
                          sig.level = .05) %>%
  pluck("n") %>% 
  ceiling()
  • Suggesting a sample size of n = 52 would be appropriate.

Return to Task

8.5.2 Test Yourself Activities

Libraries

library(pwr)
library(tidyverse)

8.5.2.1 Assignment Task 1

error_rate <- 1 - .87
  • The Type II error rate of your study would be \(\beta\) = 0.13.

Return to Task

8.5.2.2 Assignment Task 2

effect1 <- (2*3.26)/sqrt(32)
  • The effect size would be d = 1.1525841

Return to Task

8.5.2.3 Assignment Task 3

effect2 <- 2.24/sqrt(43+1)
  • The effect size would be d = 0.3376927

Return to Task

8.5.2.4 Assignment Task 4

participants <- pwr.t.test(power = .9,
                           d = .5,
                           sig.level = 0.05,
                           type = "paired",
                           alternative = "two.sided") %>% 
  pluck("n") %>% 
  ceiling()
  • Given the detailed scenario, the appropriate number of participants would be n = 44

Return to Task

8.5.2.5 Assignment Task 5

effect3 <- power.t.test(power = 1-.16,
                      n = 30,
                      sig.level = 0.01,
                      type = "one.sample",
                      alternative = "two.sided") %>% 
  tidy() %>% 
  pull(delta) %>% 
  round(2)
  • Given the detailed scenario, we would be able to detect an effect size of d = 0.69

Return to Task

8.5.2.6 Assignment Task 6

tval <- (5.1 - 4.4) / sqrt((1.34^2/32) + (1.27^2/32))
  • Given the stated means and standard deviations, the t-value for this study would be t = 2.1448226

Return to Task

8.5.2.7 Assignment Task 7

d1 <- (2*tval)/sqrt((32-1)+(32-1))
  • Given the t-value in Task 6, the effect size of this study would be d = 0.5447855.

Return to Task

8.5.2.8 Assignment Task 8

poss_d <- pwr.t.test(power = .8,
                     n = 32,
                     sig.level = 0.05,
                     type = "two.sample",
                     alternative = "two.sided") %>% 
  pluck("d") %>% 
  round(2)

answer_T8 <- 4
  • The smallest effect size that this study can determine is d = 0.71. The detected effect size, d1, is smaller than this (d1 = 0.5447855) and as such this study is not suitably powered.
  • Given that outcome, the 4th statement is the most suitable answer - answer_T8 = 4.

Return to Task

8.5.2.9 Assignment Task 9

effect4 <- pwr.t.test(power = .8,
                      n = 18,
                      sig.level = .01,
                      alternative = "two.sided",
                      type = "two.sample") %>% 
  pluck("d") %>% 
  round(3)
  • The smallest stated n is n = 18 and the stated \(\alpha\) is \(\alpha\) = .01
  • Given these details, the minimum effect size that this paper could have reliably detected was d = 1.198

Return to Task

8.5.2.10 Assignment Task 10

answer_t10 <- 4
  • This study does not have enough power to detect effect sizes at d1 or lower and as such answer_t10 = 4
  • However, it is worth keeping in mind that we are only looking at one study here which drew one sample from a population of samples. This means that there is always uncertainty about the true effect size of a difference or association - taking a different sample may have given a different effect size. As such, the comparison we are making here is not entirely valid and we should see it more as a reminder that we should always think of power as more in the planning of studies rather than in the search for criticism.

Return to Task

8.5.2.11 Assignment Task 11

answer_t11 <- 2
  • In general, increasing sample size will increase the power of a study whereas lowering alpha (from .05 to .01) will decrease the power of a study. As such, statements a and c, answer_t11 = 2.

Return to Task

Chapter Complete!

8.6 Additional Material

Below is some additional material that might help you understand APES a bit more and some additional ideas.

A different power function - power.t.test()

First thing we wanted to mention was that you can still do this chapter if you don't have the pwr library installed. You could instead use the power.t.test() function which is a function available in base R, meaning that it is included when you install R, so you do not need to install any additional package. This is handy to know when you are using a computer that you can't install libraries on to. The pwr library offers more functions and is easier to follow but but for now let's just use the base function power.t.test().

Again, remember that for more information on this function, simply do ?power.t.test in the console. On doing this you will see that power.t.test() takes a series of inputs:

  • n - observations/participants, per group for the independent samples version, or the number of subjects or matched pairs for the paired and one-sample designs.
  • delta - the difference between means
  • sd - standard deviation; note: if sd = 1 then delta = Cohen's d
  • sig.level or \(\alpha\)
  • power or \(1-\beta\)
  • type - the type of t-test; e.g. "two.sample", "one.sample", "paired"
  • alternative - the type of hypothesis; "two.sided", "one.sided"

And it works on a leave one out principle, just like in the main chapter - except there is the added inclusion of sd and delta instead of d. But other than those differences it is very similar to what we did in the main activities of the chapter. You give it all the info you have and it returns the element you are missing. So, returning to the example from the main chapter, say you needed to know how many people per group you would need to detect an effect size as low as d = .4 with power = .8, alpha = .05 in a two.sample (between-subjects) t-test on a two.sided hypothesis test. You would do:

power.t.test(delta = .4,
             sd = 1,
             power = .8,
             sig.level = .05,
             alternative = "two.sided",
             type = "two.sample")

Which gives the following output:

## 
##      Two-sample t test power calculation 
## 
##               n = 99.08057
##           delta = 0.4
##              sd = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

And it would tell you that you would need 99.080565 people per condition. Which matches the 100 per condition that we saw in the chapter. That really is a lot of people!!!

And just in case you were wondering, you can also use pluck() to pull out individual values as follows:

n_test <- power.t.test(delta = .4, 
                       sd = 1,
                       power = .8,
                       sig.level = .05,
                       alternative = "two.sided",
                       type = "two.sample") %>% 
  pluck("n")

And when you call

n_test

You again see:

## [1] 99.08057

So hopefully that shows you an alternative if there is an issue with your pwr library. We would recommend using the pwr library where possible.

Cohen's d to r

As we said in the chapter there are actually a lot of different effect sizes that you could calculate and that Cohen's d is only one of them. An alternative is \(r\). You will see \(r\) a lot more as we progress through the book, and in particular around correlations, but it is fair to say that it is becoming more of a standard effect size for t-test as well. One thing that people do not like about Cohen's d is that it can actually be a very large number - well above 1 - and that can be difficult to compare across studies. \(r\) on the other hand can't go above 1 and is therefore is considered easier to compare to across studies. The good news is that \(r\) and Cohen's d can be calculated from each other using the below formulas:

\(r = \frac{d}{\sqrt(d^2 + 4)}\)

and

\(d = \frac{2 \times r}{\sqrt(1-r^2)}\)

You can present either for a t-test in the format of:

  • t(df) = t-value, p = p-value, d = d-value

or

  • t(df) = t-value, p = p-value, r = r-value

How to choose an effect size

A really quick analogy from Ian Walker's "Research Methods and statistics", is say your test is not a stats test but a telescope. And say you have a telescope that is specifically designed only for spotting animals that are the size of elephants or larger (similar to saying a cohens d of .8 or greater for example - very big effect). If your telescope can only reliably detect something down to the size of an elephant but when you look through it you see something smaller that you think might be a mouse, you can't say that the "object"" is definitely is a mouse as you don't have enough power in your telescope - it is too blurry. But likewise you can't rule out that it isn't a mouse as that would be something you don't know for sure - both of these are true because your telescope was only designed to spot things the size of an elephant or larger. You only bought a telescope that was able to spot elephants because that was all your were interested in. Had you been interested in spotting mice you would have had to have bought a more powerful telescope. And that is the point of Lakens' SESOI - you power to the minimum effect size (minimum object size) you would be interested in. This is why it is imperative that you decide before your study what effect you are interested in - and you can base this on previous literature or theory.

Interpreting and writing up power

A few points on interpreting power to consolidate things a bit. Firstly, it is great that you are now thinking about power and effect sizes in the first place. It is important that this becomes as second nature as thinking about the design of your study and in future years and future studies the first question you should ask yourself when designing your study/secondary analysis is what size are my APES - Alpha, Power, Effect Size and Sample. And remember that a priori power analysis is the way ahead. The power and alpha are determined in advance of the study and you are using them to determine the effect size or the sample size.

Power is stated more and more commonly again in papers now and you will start to notice it in the Methods or Results sections. You will see something along the lines of "Based on a power =..... and alpha =...., given we had X voices in our sample, a power analysis (pwr package citation) revealed d = ...... as the minimum effect sizes we could reliably determine."

But how do you interpret a study in terms of power? Well, lets say you run a power analysis for a t-test (or for a correlation), and you set the smallest effect size of interest as d = .4 (or the equivalent r-value). If you then run your analysis and find d = .6 and the effect is significant, then your study had enough power to determine this effect. The effect that you found was bigger than the effect you could have found. You can have some confidence that you have a reliable effect at that given power and alpha values. However, say that instead of d = .6 you found a significant effect but with an effect size just below .4, say d = .3 - the effect size you found is smaller than the smallest effect you could reliably find. In this case you have to be cautious as you are still unclear as to whether there actually is an effect or whether you have found an effect by chance due to your study not having enough power to reliably detect an effect size that small. You can't say for sure that there is an effect or that there isn't an effect. You need to consider both stances in your write up. Remember though that you have sampled a population, so how representative that sample is of your population will also influence the validity of your power. Each sample will give a slightly different effect size.

Alternatively, and probably quite likely in many undergraduate projects due to time constraints, say you find a non-significant effect at an effect size smaller than what you predicted; say you find a non-significant effect with an effect size of d = .2 and your power analysis said you could only reliably detect an effect as small as d = .4. The issue you have here is that you can't determine solely based on this study if you a) have a non-significant effect because you are under powered or b) that you have a non-significant effect because there is actually no effect in the first place. Again in your discussion you would need to consider both stances. What you can however say is that the effect that you were looking for is not any bigger than \(d = .4\). That is still useful information. Ok you don't know how small the effect really is, but you can rule out any effect size bigger than your original d-value. In turn this helps future researchers plan their studies better and can guide them better in knowing how many participants to run. See how useful it would be if we published null findings!

Basically, when your test finds an effect size smaller than you can detect, you don't know what it but you know what it isn't - we aren't sure if it is a mouse but we know it is not an elephant. Instead you would use previous findings to support the object being a mouse or not but caveat the conclusion with the suggestion that the test isn't really sensitive to finding a mouse. Similar to a finding that has an effect size smaller than you can detect. You can use previous literature to support their not being an effect but you can't rule it out for sure. You might have actually found an effect had you had a more powerful test. Just like you might have been able to determine that it was a mouse had you had a more powerful telescope.

Taking this a bit further in some studies there really is enough power (in terms of N - say a study of 25000 participants) to find a flea on the proverbial mouse, but where nevertheless there is a non-significant finding. In this case you have the fortunate situation where you have a well-powered study and so can say with some degree of confidence that your hypothesis and design is unlikely to ever produce a replicable significant result. That is probably about as certain as you can get in our science or as close as you can get to a "fact", a very rare and precious thing. However, incredibly high powered studies, with lots of participants, tend to be able to find any difference as a significant difference. A within-subjects design with 10000 participants (\(power = .8, \alpha = .05\)) can determine reliably detect an incredibly small effect size of d = .04. The question at that stage is whether that effect has any real world significance or meaning.

So the take-home message here is that your discussion should always consider the result in relation to the hypothesis, integrating previous research and theory, and if there is an additional issue of power, then your discussion could also consider the result in relation to whether you can truly determine the effect and how that might be resolved (e.g. re-assessing the effect size, changing the design (withins are more powerful), low sample, power to high (e.g. .9), alpha to low (e.g. .01)). This issue of power would probably be a small part in the generalisability/limitation section.

Note: In all of the above you can swap effect and relationship, d and r, and other analyses accordingly.

End of Additional Material!

References

Champely, Stephane. 2020. Pwr: Basic Functions for Power Analysis. https://github.com/heliosdrm/pwr.