Lab 6 NHST and One-sample t-tests

6.1 Overview

In the previous chapters we have talked a lot about probability, comparing values across groups, and the inference from a sample to a population. In effect this is the essence of a lot of quantitative research that uses Null Hypothesis Significance Testing (NHST). You collect a sample, calculate a summary statistic about that sample, and use probability to establish the likelihood of that statistic occurring given certain situations.

However these concepts and ideas are hard to grasp at first and take playing around with data a few times to help get a better understanding of them. As such, to demonstrate these ideas further, and to start introducing commonly used tests and approaches to reproducible science, we will look at data related to sleep - a very important practice for the consolidation of learning that most of us do not get enough of! The study we will look at, to explore NHST more, is by one of our team and makes use of a well known task in Psychology, the Posner Paradigm: Woods et al (2009) The clock as a focus of selective attention in those with primary insomnia: An experimental study using a modified Posner paradigm .

In Chapter 6, through the activities, we will:

Recap on testing a hypothesis through null hypothesis significance testing (NHST).
Learn about approaches to reproducible experiments.
Learn about binomial tests, as well as one-sample and independent t-tests.
Learn about Posner paradigms and attention (PreClass), and the recency effect (InClass).

You don't need to read this to complete the activities it might help it make more sense.

The Posner paradigm (Posner, 1980), or the Posner Cueing task, is an attentional shift task, often used in a variety of fields to test spatial attention and how this is impacted by disorders or injury. It works by having participants look at a fixation cross in the center of a screen. To either side is an empty box. After a short delay, a cue (e.g. an arrow, an asterisk, or some other attention grabbing image) appears in one of the boxes (e.g. the box to the left of the fixation). This stays on screen for a few hundred milliseconds and is then replaced by a second image called the target (e.g. a different shape or image). Participants then have to respond left or right depending on which side of the fixation the target appeared. The dependent variable (DV) is the time taken to respond to the target appearing.

Key to the task is that in most trials the target will appear on the same side as the cue - the cue facilitates the target - and so participants will be quicker to respond. These are called valid trials. However, on some occasions the target will appear on the other side from the cue - e.g. the cue is on the left but the target appears on the right - and these are called invalid trials; participants will be slower to respond here as the cue misleads the participant.

From that, you should be starting to get an idea of how a Posner paradigm can help to measure attention and how it can help determine if people have issues in shifting attention (particularly from the invalid trials).

Reference:

Posner, M. (1980) Orienting of attention. Quarterly Journal of Experimental Psychology, 32(1), 3-25

6.2 PreClass Activity

ManyLabs - an approach to reproducible science

As you will learn from reading papers around the reproduciblity crisis, findings from experiments tend to be more reproducible when we increase participant numbers as this increases the power of the study's design; the likelihood of an experimental design detecting an effect, of a given size, when there is an effect to detect.

Power is a rather tricky concept in research that essentially amounts to the probability of your design being able to detect a significant difference when there is actually a significant difference to detect.

Power is an interplay between three other aspects of research design:

alpha - your critical p-value (normally .05);
the sample size (n);
the effect size - how big is the difference (measured in a number of ways).

If you know any three of these four elements (power, alpha, effect size, n) you can calculate the fourth. We will save further discussion of power until Chapter 8 but if you want to read ahead then this blog is highly recommended: The Power Dialogues.

However, running several hundred participants in your one study can be a significant time and financial investment. Fortunately, the idea of a "ManyLabs" project can solve this problem. In this scenario the same experiment is run in various locations, all using the same procedure, and then the data is collapsed together and analysed as one. You can see a nice example of a Many Labs project in the paper Investigating Variation in Replicability (Klein et al., 2014). See how many labs and researchers are involved? Perhaps this is a better approach than lots of researchers working individually?

You think this all sounds a great idea so in your quest to be a collaborative reproducible researcher, and as a high-five to #TeamScience, you have joined a ManyLabs study replicating the findings of Woods et al. (2009). And that study is the premis for today's activities so let's start by having a quick-run through of the background of the experiment.

The Background

Woods and colleagues (2009) were interested in how the attention of people with poor sleep (Primary Insomnia - PI) was more tuned towards sleep-related stimuli than the attention of people with normal sleep (NS). Woods et al., hypothesised that participants with poor sleep would be more attentive to images related to a lack of sleep (i.e. an alarm clock showing 2AM) than participants with normal sleep would be. To test their hypothesis, the authors used a modified Posner paradigm, shown in Figure 1 of the paper, where images of an alarm clock acted as the cue on both valid and invalid trials, with the symbol ( .. ) being the target.

As can be seen in Figure 3 of Woods et al., the authors found that, on valid trials, whilst Primary Insomnia participants were faster in responding to the target, suggesting a slight increase in attention to the sleep related cue compared to the Normal Sleepers, there was no difference between groups. In contrast, for invalid trials, where poor sleep participants were expected to be distracted by the cue, the authors did indeed find a significant difference between groups consistent with their alternative hypothesis $H_{1}$. Woods et al., concluded that poor sleepers (Primary Insomnia participants) were slower to respond to the target on invalid trials, compared to Normal Sleepers, due to the attention of the Primary Insomnia participants being drawn to the misleading cue (the alarm clock) on the invalid trials. This increased attention to the sleep-related cue led to an overall slower reponse to the target on these invalid trials.

As we said above, your lab is now part of a ManyLabs project that is looking to replicate this finding from Woods et al., (2009). As a pilot study, to test recruitment procedures, as well as the experimental paradigm and analyses pipeline, each lab gathers data from 22 Normal Sleepers. It is common to use only the control participants in a pilot (in this cas the NS participants) as they are more plentiful in the population than the experimental group (in this case PI participants) and saves using participants from the PI group which may be harder to obtain in the long run.

After gathering your data, we want to check the recruitment process and whether or not you have been able to draw a sample of normal sleepers similar to the sample drawn by Woods et al. To keep things straightforward, allowing us to understand the analyses better, we will only look at valid trials today, in NS participants, but in effect you could perform this test on all groups and conditions.

Are These Participants Normal Sleepers (NS)?

Below is the data from the 22 participants you have collected in your pilot study. Their mean reaction time for valid trials (in milliseconds) is shown in the right hand column, valid_rt.

Table 6.1: Pilot Data for 22 Participants on a Sleep-Related Posner Paradigm. ID is shown in participant column and mean reaction time (ms) on valid trails is shown in valid_rt column.
participant	valid_rt
1	631.2
2	800.8
3	595.4
4	502.6
5	604.5
6	516.9
7	658.0
8	502.0
9	496.7
10	600.3
11	714.6
12	623.7
13	634.5
14	724.9
15	815.7
16	456.9
17	703.4
18	647.5
19	657.9
20	613.2
21	585.4
22	674.1

If you look at Woods et al (2009) Figure 3 you will see that, on valid trials, the mean reaction time for NS participants was 590 ms with a SD = 94 ms. As above, as part of our pilot study, we want to confirm that the 22 participants we have gathered are indeed Normal Sleepers. We will use the mean and SD from Woods et al., to confirm this. Essentially we are asking if the participants in the pilot are responding in a similar fashion as NS participants in the original study.

You will know from Chapter 5 that when using Null Hypothesis Significance Testing (NHST) we are working with both a null hypothesis ($H_{0}$) and an alternative hypothesis ($H_{1}$). Thinking about this study it makes some logical sense to think about it in terms of the null hypothesis ($\mu1 = \mu2$). So we could phrase our hypothesis as, "we hypothesise that there is no significant difference in mean reaction times to valid trials on the modified Posner experiment between the participants in the pilot study and the participants in the original study by Woods et al."

There is actually a few analytical ways to test our null hypothesis. Today we will show you how to do two of these. In tasks 1-3 we will use a binomial test and in tasks 4-8 we will use a one-sample t-test

The Binomial test is a very simple test that converts all participants to either being above or below a cut-off point, e.g. a mean value, and looking at the probability of finding that number of participants above that cut-off.

The one-sample t-test is similar in that it compares participants to a cut-off but it compares the mean and standard deviation of the collected sample to an ideal mean and standard deviation. By comparing the difference in means, divided by the standard deviation of the difference (a measure of the variance), we can determine if the sample is similar or not to the ideal mean.

6.2.1 The Binomial Test

The Binomial test is one of the most "basic tests" in null hypothesis testing in that it uses very little information. The binomial test is used when a study has two possible outcomes (success or failure) and you have an idea about what the probability of success is. This will sound familiar from the work we did in Chapter 4 and the Binomial distribution.

A binomial test tests if an observed result is different from what was expected. For example, is the number of heads in a series of coin flips different from what was expected. Or in our case for this chapter, we want to test whether our normal sleepers are giving reaction times that are the same or different from those measured by Woods et al. The following tasks will take you through the process.

6.2.2 Task 1: Creating a Dataframe

First we need to create a tibble with our data so that we can work with it.

Enter the data for the 22 participants displayed above into a tibble and store it in ns_data. Have one column showing the participant number (called participant) and another column showing the mean reaction time, called valid_rt.
- We saw how to enter data into tibbles in Chapter 5 PreClass Skill 3.
- You could type each value out or copy and paste them from the hint below.

You can use this code structure and replace the NULL values:

ns_data <- tibble(participant = c(NULL,NULL,...), valid_rt = c(NULL,NULL,...))

The values are: 631.2, 800.8, 595.4, 502.6, 604.5, 516.9, 658.0, 502.0, 496.7, 600.3, 714.6, 623.7, 634.5, 724.9, 815.7, 456.9, 703.4, 647.5, 657.9, 613.2, 585.4, 674.1

6.2.3 Task 2: Comparing Original and New Sample Reaction Times

Our next step is to establish how many participants from our pilot study are above the mean in the original study by Woods et al.

In the original study the mean reaction time for valid trials was 590 ms. Store this value in woods_mean.
Now write code to calculate the number of participants in the new sample (ns_data created in Task 1) that had a mean reaction time greater than the original paper's mean. Store this single value in n_participants.
- The function nrow() may help here.
- nrow() is similar to count() or n() but nrow() returns the number as a single value and not in a tibble.
- Be sure whatever method you use you end up with a single value, not a tibble. You may need to use pull() or pluck()

Part 1

woods_mean <- value

Part 2

A few ways to achieve this. Here are a couple you could try

ns_data %>% filter(x ? y) %>% count() %>% pull(?)

ns_data %>% filter(x ? y) %>% summarise(n = ?) %>% pull(?)

ns_data %>% filter(x ? y) %>% nrow()

dim[] %>% pluck()

Quickfire Questions

The number of participants that have a mean reaction time for valid trials greater than that of the original paper is:

6.2.4 Task 3: Calculating Probability

Our final step for the binomial test is to compare our value from Task 2, 16 participants, to our hypothetical cut-off. We will work under the assumption that the mean reaction time from the original paper, i.e. 590 ms, is a good estimate for the population of good sleepers (NS). If that is true then each new participant that we have tested should have a .5 chance of being above this mean reaction time ($p = .5$ for each participant).

To phrase this another way, the expected number of participants above the cut-off would be $.5 \times N$, where $N$ is the number of participants, or $.5 \times 22$ = 11 participants.

Calculate what would be the probability of observing at least 16 participants out of your 22 participants that had a valid_rt greater than the Woods et al (2009) mean value.
- hint: We looked at very similar questions in Chapter 4 using dbinom() and pbinom()
- hint: The key thing is that you are asking about obtaining X or more successes. You will need to think back about cut-offs and lower.tails.

Think back to Chapter 4 where we used the binomial distribution. This question can be phrased as, what is the probability of obtaining X or more succeses out of Y trials, given the expected probability of Z.

How many Xs? (see question)
How many Ys? (see question)
What is the probability of being either above or below the mean/cut-off? (see question)
You can use a dbinom() %>% sum() for this or maybe a pbinom()

Quickfire Questions

Using the Psychology standard $\alpha = .05$, do you think these NS participants are responding in a similar fashion as the participants in the original paper? Select the appropriate answer:
According to the Binomial test would you accept or reject the null hypothesis that we set at the start of this test?

The probability of obtaining 16 participants with a mean reaction time greater than the cut-off of 590 ms is p = .026. This is smaller than the field norm of p = .05. As such we can say that, using the binomial test, the new sample appears to be significantly different from the old sample as there is a significantly larger number of participants above the cut-off (M = 590ms) than would be expected if the new sample and the old sample were responding in a similar fashion. We would therefore reject our null hypothesis!

6.2.5 The One-Sample t-test

In Task 3 you ran a binomial test of the null hypothesis testing that here was no significant difference in mean reaction times to valid trials on the modified Posner experiment between the participants in the pilot study and the participants in the original study by Woods et al. However, the binomial test did not use all the available information in the data because each participant was simply classified as being above or below the mean of the original paper, i.e. yes or no. Information about the magnitude of the discrepancy from the mean was discarded. This information is really interesting and important however and if we wanted to maintain that information then we would need to use a one-sample $t$-test.

In a one-sample $t$-test, you test the null hypothesis $H_0: \mu = \mu_0$ where:

$H_0$ is the symbol for the null hypothesis,
$\mu$ (pronounced mu - like few with an m) is the unobserved population mean,
and $\mu_0$ (mu-zero) is some other mean to compare against (which could be an alternative population or sample mean or a constant).

And we will do this by calculating the test statistic $t$ which comes from the $t$ distribution - more on that distribution below and in the lectures. The formula to calculate the observed test statistic $t$ for the one-sample $t$-test is:

\[t = \frac{\mu - \mu_0}{s\ / \sqrt(n)}\]

$s$ is the stanard deviation of the sample collected,
and $n$ is the number of participants in the sample.

The eagle-eyed of you may have spotted that in Chapter 5 the null hypothesis was stated as:

\[H_0: \mu_1 = \mu_2\]

whereas in this Chapter we are stating it as:

$S$H_0: = _0$$

What's the difference? Conceptually there is no real difference. The null hypothesis in both situations is stating that there is no difference between the means. The only difference is that in Chapter 5 we know the actual mean of the groups of interest. In Chapter 6 we only really know the mean we want to compare against - $\mu_0$. We do not know the actual mean of the population of interest (we only know the sample we have collected) - so this is written as $\mu$.

Main thing to recognise is that the null hypothesis $H_0$ always states that there is no significant difference between thw two means of interest - i.e. the two means are equivalent.

For the current problem:

$\mu$ is the unobserved true mean of all possible participants. We don't know this value. Our best guess is the mean of the sample of 22 participants so we will use that mean here. As such will substitute this value into our formula, which we call $\bar{X}$ (pronounced X-bar), instead of $\mu$
$\mu_0$ is the mean to compare against. For us this is the mean of the original paper which we observed to be 590 ms.

So in other words we are testing the null hypothesis that $H_0: \bar{X} =$ 590.

As such the formula for our one-sample $t$-test becomes:

\[t = \frac{\bar{X} - \mu_0}{s\ / \sqrt(n)}\]

Now we just need to fill in the numbers.

6.2.6 Task 4: Calculating the Mean and Standard Deviation

Calculate the mean and standard deviation of valid_rt for our 22 participants (i.e., for all participant data at the top of this lab).
Store the mean in ns_data_mean and store the standard deviation in ns_data_sd. Make sure to store them both as single values!

In the below code, replace NULL with the code that would find the mean, m, of ns_data.

ns_data_mean <- summarise(NULL) %>% pull(NULL)

Replace NULL with the code that would find the standard deviation, sd, of ns_data.

ns_data_sd <- summarise(NULL) %>% pull(NULL)

6.2.7 Task 5: Calculating the Observed Test Statistic

From Task 4, you found out that $\bar{X}$, the sample mean, was 625.464 ms, and $s$, the sample standard deviation, was 94.307 ms. Now, keeping in mind that $n$ is the number of observations/participants in the sample, and $\mu_0$ is the mean from Woods et al (2009):

Use the one-sample t-test formula above to compute your observed test statistic. Store the answer in t_obs .
E.g. t_obs <- (x - y)/(s/sqrt(n))

Answering this question will help you in this task as you'll also need these numbers to substitute into the formula:

The mean from Woods et al (2009) was , and the number of participants in our sample is: (type in numbers) .
Remember the solutions at the end of the chapter if you are stuck. To check that you are correct without looking at the solutions though - the observed $t$-value in t_obs, to two decimal places, is

Remember BODMAS and/or PEDMAS when given more than one operation to calculate. (i.e. Brackets/Parenthesis, Orders/Exponents, Division, Multiplication, Addition, Subtraction)

t_obs <- (sample mean - woods mean) / (sample standard deviation / square root of n)

6.2.8 Task 6: Comparing the Observed Test Statistic to the t-distribution using `pt()`

Now you need to compare t_obs to the t-distribution to determine how likely the observation (i.e. your test statistic) is under the null hypothesis of no difference. To do this you need to use the pt() function.

Use the pt() function to get the $p$-value for a two-tailed test with $\alpha$ level set to .05. The test has $n - 1$ degrees of freedom, where $n$ is the number of observations contributing to the sample mean $\bar{X}$. Store the $p$ value in the variable pval.
Do you reject the null?
Hint: The pt() function works similar to pbinom() and pnorm().
Hint: Because we want the p-value for a two-tailed test, multiply pt() by two.

Remember to get help you can enter ?pt in the console.

The pt() function works similar to pbinom() and pnorm():

pval <- pt(test statistic, df, lower.tail = FALSE) * 2
Use the absolute value of the test statistic; i.e. ignore minus signs.
Remember, df is equal to n-1.
Use lower.tail = FALSE because we are wanting to know the probability of obtaining a value higher than the one we got.
Reject the null at the field standard of p < .05

6.2.9 Task 7: Comparing the Observed Test Statistic to the t-distribution using `t.test()`

Now that you have done this by hand, try using the t.test() function to get the same result. Take a moment to read the documentation for this function by typing ?t.test in the console window. No need to store the t-test output in a dataframe but do check that the p-value matches the pval in Task 6.

The structure of the t.test() function is t.test(column_of_data, mu = mean_to_compare_against)

The function requires a vector, not a table, as the first argument. You can use the pull() function to pull out the valid_rt column from the tibble ns_data with pull(ns_data, valid_rt).

You also need to include mu in the t.test(), where mu is equal to the mean you are comparing to.

Quickfire Questions

To make sure you are understanding the output of the t-test, try to answer the following questions.

To three decimal places, type in the p value for the t-test in Task 7
As such this one sample t-test is
The outcome of the binomial test and the one sample t-test produce answer

6.2.10 Task 8: Drawing Conclusions about the new data

Given these results, what do you conclude about how similar these 22 participants are to the original participants in Woods et al (2009) and whether or not you have managed to recruit sleepers similar to that study?

Think about which test used more of the available information?
Also, how reliable is the finding if the two tests give different answers?

We have given some of our thoughts at the end of the chapter.

Job Done - Activity Complete!

That's all! There is quite a bit in this lab in terms of theory of Null Hypothesis Significance Testing (NHST) so you might want to go back and add any informative points to your Portfolio. Post any questions on the forums

6.3 InClass Activity

The PreClass activity looked at the situation where you had gathered one sample of data (from one group of participants) and you want to compare that one group to a known value, e.g. a standard value. An extension of this design is where you gather data from two samples. Two-sample designs are very common in Psychology as often we want to know whether there is a difference between groups on a particular variable. We looked at a similar idead in Chapter 5 where we did this comparison through simulation. Today we will look at through the $t$-test analysis introduced in Chapter 6

Comparing the Means of Two Samples

First thing to not is that there are different types of two-sample designs depending on whether or not the two groups are independent (e.g. different participants on different conditions) or not (e.g. same participants on different conditions). In today's exercise we will focus on independent samples, which typically means that the observations in the two groups are unrelated - usually meaning different people. In Chapter 7 you will examine cases where the observations in the two groups are from pairs (paired samples) - most often the same people but could also be a matched-pairs design.

Now that we are progressing through this book, we will start to reduce the pointers for things that you should make a note of. By now you should have a really good idea yourself about what you need to remember. That said, one of the really confusing things about research design is that there are many names for the same type of design. This is definitely something you should be writing out in your own words, to remember it better, if it is something you struggle with.

Here is a very brief summary:

independent and between-subjects design typically mean the same thing - different participants in different conditions.
within-subjects, dependent samples, paired samples, and repeated measures tend to mean the same participants in all conditions
matched-pairs design means different people in different conditions but you have matched participants across the conditions so that they are effectively the same person (e.g. age, IQ, Social Economic Status, etc). These designs are analysed as though they were a within-subjects design.
mixed design is when there is a combination of within-subjects and between-subjects designs in the one experiment. For example, say you are looking at attractiveness and dominance of male and female faces. Everyone might see both male and female faces (within) but half of the participants do ratings of attractiveness and half do ratings of trustworthiness (between).
The paper we are looking at in this InClass activity technically uses a mixed design at times but we will use the between-subjects element to show you how to run independent t-tests.

Spend some time when reading articles to really figure out the design they are using.

Background

For this lab we will be revisiting the data from Schroeder and Epley (2015), which you first encountered as part of the homework for Chapter 5. You can take a look at the Psychological Science article here:

Schroeder, J. and Epley, N. (2015). The sound of intellect: Speech reveals a thoughtful mind, increasing a job candidate's appeal. Psychological Science, 26, 277--891.

The abstract from this article explains more about the different experiments conducted (we will be specifically looking at the dataset from Experiment 4, courtesy of the Open Stats Lab):

"A person's mental capacities, such as intellect, cannot be observed directly and so are instead inferred from indirect cues. We predicted that a person's intellect would be conveyed most strongly through a cue closely tied to actual thinking: his or her voice. Hypothetical employers (Experiments 1-3b) and professional recruiters (Experiment 4) watched, listened to, or read job candidates' pitches about why they should be hired. These evaluators (the employers) rated a candidate as more competent, thoughtful, and intelligent when they heard a pitch rather than read it and, as a result, had a more favorable impression of the candidate and were more interested in hiring the candidate. Adding voice to written pitches, by having trained actors (Experiment 3a) or untrained adults (Experiment 3b) read them, produced the same results. Adding visual cues to audio pitches did not alter evaluations of the candidates. For conveying one's intellect, it is important that one's voice, quite literally, be heard."

To recap on Experiment 4, 39 professional recruiters from Fortune 500 companies evaluated job pitches of M.B.A. candidates (Masters in Business Administration) from the University of Chicago Booth School of Business. The methods and results appear on pages 887--889 of the article if you want to look at them specifically for more details. The original data, in wide format, can be found at the Open Stats Lab website for later self-directed learning. Today however, we will be working with a modfied version in "tidy" format which can be downloaded from here. I you are unsure about tidy format, refer back to the inclass activities of Chapter 2.

Today's Goal!

The overall goal today is to learn about running a $t$-test on between-subjects data, as well as learning about analysing actual data along the way. As such, our task today is to reproduce a figure and the results from the article (p. 887-888). The two packages you will need are tidyverse, which we have used a lot, and broom, which is new to you but will become your friend. One of the main functions we use in broom is broom::tidy() - this is an incredibly useful function that converts the output of an inferential test in R from a combination of text and lists, that are really hard to work with - technically called objects - into a tibble that you are very familiar with and that you can then use much more easily. Might that be worth making a note of? We will show you how to use this function today and then ask you to use it over the coming chapters

If you are using the Boyd Orr labs, broom is already installed and just needs called to the library(). If you are using your own machine, you will need to install it one time to begin with if you have never installed it before.

6.3.1 Task 1: Evaluators

Open a new script or R Markdown file and use code to call broom and tidyverse into your library.

Note: Order is important when calling multiple libraries - if two libraries have a function named the same thing, R will use the function from the library loaded in last. We recommend calling libraries in an order that tidyverse is called last as the functions in that library are used most often.

The file called evaluators.csv contains the demographics of the 39 raters. After downloading and unzipping the data, and of course setting the working directory, load in the information from this file and store it in a tibble called evaluators.
Now, use code to:
1. calculate the overall mean and standard deviation of the age of the evaluators.
2. count how many male and how many female evaluators were in the study.

Note: Probably easier doing this task in seperate lines of code. Remeber the solutions are at the end of the Chapter. Note: there are NAs in the data so you will need to include a call to na.rm = TRUE.

Remember to load the libraries you need!
Also make sure you've downloaded and saved the data in the folder you're working from.
You can use summarise() and count() or a pipeline with group_by() to complete this task.
When analysing the number of male and female evaluators it isn't initially clear that '1' represents males and '2' represents females.
We can use recode() to convert the numeric names to indicate something more meaningful. Have a look at?recode to see if you can work out how to use it. It'll help to use mutate() to create a new variable to recode() the numeric names for evaluators.
This website is also incredibly useful and one to save for anytime you need to use recode(): https://debruine.github.io/posts/recode/
For your own analysis and future reproducible analyses, it's a good idea to make these representations clearer to others.

Quickfire Questions

Fill in the below answers to check that your calculations are correct. Remember that the solutions are at the end of the chapter:

What was the mean age of the evaluators in the study? Type in your answer to one decimal place:
What was the standard deviation of the age of the evaluators in the study? Type in your answer to two decimal places:
How many participants were noted as being female:
How many participants were noted as being male:

Group Discussion Point

The paper claims that the mean age of the evaluators was 30.85 years (SD = 6.24) and that there were 9 male and 30 female evaluators. Do you agree? Why might there be differences?

This paper claimed there were 9 males, however looking at your results you can see only 4 males, with 5 NA entries making up the rest of the participant count. It looks like the NA and male entries have been combined! That information might not be clear to a person re-analysing the data.

This is why it's important to have reproducible data analyses for others to examine. Having another pair of eyes examining your data can be very beneficial in spotting any discrepancies - this allows for critical evaluation of analyses and results and improves the quality of research being published. All the more reason to emphasize the importance of conducting replication studies! #ReproducibleScience

6.3.2 Task 2: Ratings

We are now going to calculate an overall intellect rating given by each evaluator. To break that down a bit, we are going to calculate how intellectual the evaluators (the raters) thought candidates were overall, depending on whether or not the evaluators read or listened to the candidates' resume pitches. This is calculated by averaging the ratings of competent, thoughtful and intelligent for each evaluator held within ratings.csv.

Note: we are not looking at ratings to individual candidates; we are looking at overall ratings for each evaluator. This is a bit confusing but makes sense if you stop to think about it a little. You can think about it in terms of "do raters rate differently depending on whether they read or listen to a resume pitch".

We will then combine the overall intellect rating with the overall impression ratings and overall hire ratings for each evaluator, all ready found in ratings.csv. In the end we will have a new tibble - let's call it ratings2 - which has the below structure:

eval_id shows the evaluator ID. Each evaluator has a different ID. So all the 1's are the same evaluator.
Category shows the scale that they were rating on - intellect, hire, impression
Rating shows the overall rating given by that evaluator on a given scale.
condition shows whether that evaluater listened to (e.g. evaluators 1, 2 and 3), or read (e.g. evaluator 4) the resume.

eval_id	Category	Rating	condition
1	hire	6.000	listened
1	impression	7.000	listened
1	intellect	6.000	listened
2	hire	4.000	listened
2	impression	4.667	listened
2	intellect	5.667	listened
3	hire	5.000	listened
3	impression	8.333	listened
3	intellect	6.000	listened
4	hire	4.000	read
4	impression	4.667	read
4	intellect	3.333	read

The following steps describe how to create the above tibble but you might want to have a bash yourself without reading them first. The trick when doing data analysis and data wrangling is to first think about what you want to achieve - the end goal - and then what function do I need to use. You know what you want to end up with - the above table - now how do you get there?

Steps 1-3 calculate the new intellect rating. Steps 4 and 5 combine this rating to all other information.

Load the data found in ratings.csv into a tibble called ratings.
filter() only the relevant variables (thoughtful, competent, intelligent) into a new tibble (call it what you like - in the solutions we use iratings), and calculate a mean Rating for each evaluator.
Add on a new column called Category where every entry is the word intellect. This tells us that every number in this tibble is an intellect rating.
Now create a new tibble called ratings2 and filter into it just the impression and hire ratings from the original ratings tibble. Next, bind this tibble with the tibble you created in step 3 to bring together the intellect, impression, and hire ratings, in ratings2.
Join ratings2 with the evaluator tibble that we created in Task 1. Keep only the necessary columns as shown above and arrange by Evaluator and Category.

Don't forget to use the hints below or the solution at the end of the Chapter if you are stuck. One thing to think about though is that the above steps don't help, again just go back to what you want to achieve and ask yourself how would you do that, and think backwards.

Make sure you've downloaded and saved the data into the folder you're working from.
filter(Category %in% c()) might work and then use group_by() and summarize() to calculate a mean Rating for each evaluator.
Use mutate() to create a new column.
bind_rows() from Chapter 2 will help you to combine these variables from two separate tibbles.
Use inner_join() with the common column in both tibbles. select() and arrange() will help you here too.

6.3.3 Task 3: Creating a Figure

To recap, we now have ratings2 which contains an overall Rating score for each evaluator on the three Category (within: hire, impression, intellect) depending on which condition that evaluator was in (between: listened or read). Great! Now we have all the information we need to replicate Figure 7 in the article (page 888), shown here:

Figure 6.1: Figure 7 from Schroeder and Epley (2015) which you should try to replicate.

Replace the NULLs below to create a very basic version of this figure. You did something like this for the Chapter 5 assignment and again in the Chapter 3 Visualisation tasks.

group_means <- group_by(ratings2, NULL, NULL) %>%
  summarise(Rating = mean(Rating))

ggplot(group_means, aes(NULL, NULL, fill = NULL)) +
  geom_col(position = "dodge")

Group Discussion Point

Improve Your Figure: Discuss with others how you could improve this plot. What other geom_() options could you try? Are barcharts that informative or would something else be better? How would you add or change the labels of your plot? Could you change the colours in your figure?

Next, have a look at the possible solution below to see a modern way of presenting this information. There are some new functions in this solution that you should play about with to understand what they do. Remember it is a layering system, so remove lines and see what happens. Note how in the solution the Figure shows the raw data points as well as the means in each condition; this gives a better impression of the true data as just showing the means can be misleading. You can continue your further exploration of visualisations by reading this paper later when you have a chance: Weissberger et al., 2015, Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Filling in the code as below will create a basic figure as shown:

group_means <- ratings2 %>%
  group_by(condition, Category) %>%
  summarise(Rating = mean(Rating))

## `summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.

ggplot(group_means, aes(Category, Rating, fill = condition)) + 
  geom_col(position = "dodge")

Figure 6.2: A basic solution to Figure 7

Or alternatively, for a more modern presentation of the data:

group_means <- ratings2 %>%
  group_by(condition, Category) %>%
  summarise(Rating = mean(Rating))

## `summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.

ggplot(ratings2, aes(condition, Rating, color = condition)) +
  geom_jitter(alpha = .4) +
  geom_violin(aes(fill = condition), alpha = .1) +
  facet_wrap(~Category) +
  geom_point(data = group_means, size = 2) +
  labs(x = "Category", y = "Recruiters' Evaluation of Candidates") +
  coord_cartesian(ylim = c(0, 10), expand = FALSE) +
  guides(color = "none", fill = "none") +
  theme_bw()

Figure 6.3: A possible alternative to Figure 7

6.3.4 Task 4: t-tests

Brilliant! So far we have checked the descriptives and the visualisations, and the last thing now is to check the inferential tests; the t-tests. You should still have ratings2 stored from Task 2. From this tibble, let's reproduce the t-test results from the article and at the same time show you how to run a t-test. You can refer back to the lectures to understand the maths a between-subjects t-test but essentially it is a measure between the difference in means over the variance about those means.

Here is a paragraph from the paper describing the results (p. 887):

"The pattern of evaluations by professional recruiters replicated the pattern observed in Experiments 1 through 3b (see Fig. 7). In particular, the recruiters believed that the job candidates had greater intellect---were more competent, thoughtful, and intelligent---when they listened to pitches (M = 5.63, SD = 1.61) than when they read pitches (M = 3.65, SD = 1.91), t(37) = 3.53, p < .01, 95% CI of the difference = [0.85, 3.13], d = 1.16. The recruiters also formed more positive impressions of the candidates---rated them as more likeable and had a more positive and less negative impression of them---when they listened to pitches (M = 5.97, SD = 1.92) than when they read pitches (M = 4.07, SD = 2.23), t(37) = 2.85, p < .01, 95% CI of the difference = [0.55, 3.24], d = 0.94. Finally, they also reported being more likely to hire the candidates when they listened to pitches (M = 4.71, SD = 2.26) than when they read the same pitches (M = 2.89, SD = 2.06), t(37) = 2.62, p < .01, 95% CI of the difference = [0.41, 3.24], d = 0.86."

We are going to run the t-tests for Intellect, Hire and Impression; each time comparing evaluators overall ratings for the listened group versus overall ratings for the read group to see if there was a significant difference between the two conditions: i.e. did the evaluators who listened to pitches give a significant higher or lower rating than evaluators that read pitches.

In terms of hypotheses, we could phrase the null hypothesis for this tests as there is no significant difference between overall ratings on the {insert trait} scale between evaluators who listened to resume pitches and evaluatoris who read the resume pitches ($H_0: \mu_1 = \mu2$). Alternatively, we could state it as there will be a significant difference between overall ratings on the {insert trait} scale between evaluators who listened to resume pitches and evaluatoris who read the resume pitches ($H_1: \mu_1 \ne \mu2$).

Now would be a good time to add to your notes about what is the difference between a positive and negative value as the outcome to a t-test? Remember? It just tells you which group had the bigger mean - the absolute value will be the same. Most commonly, t-tests are reported as a positive value.

To clarify, we are going to run three between-subjects t-tests in total; one for intellect ratings; one for hire ratings; one for impression ratings. We will show you how to run the t-test on intellect ratings and then ask you to do the remaining two t-tests yourself.

To run this analysis on the intellect ratings you will need the function t.test() and you will use broom::tidy() to pull out the results from each t-test into a tibble Below, we show you how to create the group means and then run the t-test for intellect. Run these lines and have a look at what they do.

First we calculate the group means:

group_means <- ratings2 %>%
  group_by(condition, Category) %>%
  summarise(m = mean(Rating), sd = sd(Rating))

## `summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.

And we can call them and look at them by typing:

group_means

condition	Category	m	sd
listened	hire	4.714286	2.261479
listened	impression	5.968254	1.917477
listened	intellect	5.634921	1.608674
read	hire	2.888889	2.054805
read	impression	4.074074	2.233306
read	intellect	3.648148	1.911343

Now to just look at intellect ratings we need to filter them into a new tibble:

intellect <- filter(ratings2, Category == "intellect")

And then we run the actual t-test and tidy it into a table.
- t.test() requires two vectors as input
- pull() will pull out a single column from a tibble, e.g. Rating from intellect
- tidy() takes information from a test and turns it into a tibble. Try running the t.test with and without piping into tidy() to see what it does differently.

intellect_t <- t.test(intellect %>% 
                        filter(condition == "listened") %>% 
                        pull(Rating),
                      intellect %>% 
                        filter(condition == "read") %>% 
                        pull(Rating),
                      var.equal = TRUE) %>%
  tidy()

Now lets look at the intellect_ttibble we have created (assuming you piped into tidy()):

Table 6.2: The t-test output of those in the intellect condition.
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
1.987	5.635	3.648	3.526	0.001	37	0.845	3.128	Two Sample t-test	two.sided

From the resultant tibble, intellect_t, you can see that you ran a Two Sample t-test (meaning between-subjects) with a two-tailed hypothesis test ("two.sided"). The mean for the listened condition, estimate1, was 5.635, whilst the mean for the read condition, estimate2 was 3.648 - compare these to the means in group_means as a sanity check. So an overall the was a difference between the two means of 1.987. The degrees of freedom for the test, parameter, was 37. The observed t-value, statistic, was 3.526, and it was significant as the p-value, p.value, was p = 0.0011, which is lower than the field standard Type 1 error rate of $\alpha = .05$.

As you will know from your lectures, a t-test is presented as t(df) = t-value, p = p-value. As such, this t-test would be written up as: t(37) = 3.526, p = 0.001.

Thinking about interpretation of this finding, as the effect was significant, we can reject the null hypothesis that there is no significant difference between mean ratings of those who listened to resumes and those who read the resumes, for intellect ratings. We can go further than that and say that the overall intellect ratings for those that listened to the resume was significantly higher (mean diff = 1.987) than those who read the resumes, t(37) = 3.526, p = 0.001, and as such we accept the alternative hypothesis. This would suggest that hearing people speak leads evaluators to rate the candidates as more intellectual than when you merely read the words they have written.

Now Try:

Running the remaining t-tests for hire and for impression. Store them in tibbles called hire_t and impress_t respectively.
Bind the rows of intellect_t, hire_t and impress_t to create a table of the three t-tests called results. It should look like this:

Table 6.3: Output of all three t-tests
Category	estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
intellect	1.987	5.635	3.648	3.526	0.001	37	0.845	3.128	Two Sample t-test	two.sided
hire	1.825	4.714	2.889	2.620	0.013	37	0.414	3.237	Two Sample t-test	two.sided
impression	1.894	5.968	4.074	2.851	0.007	37	0.548	3.240	Two Sample t-test	two.sided

Quickfire Questions

Check your results for hire. Enter the mean estimates and t-test results (means and t-value to 2 decimal places, p-value to 3 decimal places):
- Mean estimate1 (listened condition) =
- Mean estimate2 (read condition) =
- t() = , p =
- Looking at this result, True or False, this result is significant at $\alpha = .05$?
Check your results for impression. Enter the mean estimates and t-test results (means and t-value to 2 decimal places, p-value to 3 decimal places):
- Meanestimate1 (listened condition) =
- Mean estimate2 (read condition) =
- t() = , p =
- Looking at this result, True or False, this result is significant at $\alpha = .05$?

Your t-test answers should have the following structure:

t(degrees of freedom) = t-value, p = p-value,

where:

degrees of freedom = parameter,
t-value = statistic,
and p-value = p.value.

Remember that if a result has a p-value lower (i.e. smaller) than or equal to the alpha level then it is said to be significant.

So to recap, we looked at the data from Schroeder and Epley (2015), both the descriptives and inferentials, we plotted a figure, and we confirmed that, as in the paper, there are significant differences in each of the three rating categories (hire, impression and intellect), with the listened condition receiving a higher rating than the read condition on each rating. All in, our interpretation would be that people rate you hire when they hear you speak your resume as opposed to them just reading your resume!

Job Done - Activity Complete!

Well done for completing this inclass activity on independent samples t-tests! And now, combined with the information in the preclass activity, you are able to run a t-test on one sample data, comparing it to a standard/criterion values, and on two samples in a between-subjects design.

Are there any useful points in this activity about t-tests or plots that you think could be useful to include in your portfolio? Make sure to include them now! For instance, we haven't really looked at the assumptions of a between-subjects t-test, just the analysis. You will have covered the assumptions in your lecture series so why not combine that information with the information you learnt here!

Also, if you have more time, you might want to visit this website that will give you a better understanding of the relationship between the $t$ distribution and the normal distribution: gallery.shinyapps.io/tdist

You should now be ready to complete the Homework Assignment for this lab. The assignment for this Lab is FORMATIVE and is NOT to be submitted and will NOT count towards the overall grade for this module. However you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. If you have any questions, please post them on the forums.

6.4 Test Yourself

This is a formative assignment meaning that it is purely for you to test your own knowledge, skill development, and learning, and does not count towards an overall grade. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. You will be instructed by the Course Lead on Moodle as to when you should attempt this assignment. Please check the information and schedule on the Level 2 Moodle page.

Lab 6: Independent samples t-test Assignment

In order to complete this assignment, you first have to download the assignment .Rmd file which you need to edit for this assignment: titled GUID_Level2_Semester1_Lab6.Rmd. This can be downloaded within a zip file from the link below. Once downloaded and unzipped you should create a new folder that you will use as your working directory; put the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top. Download the Assignment .zip file from here or on Moodle.

Background

For this assignment we will be using real data from the following paper:

Nave, G., Nadler, A., Zava, D., and Camerer, C. (2017). Single-dose testosterone administration impairs cognitive reflection in men. Psychological Science, 28, 1398--1407.

The full data for these exercises can be downloaded from the Open Science Framework repository but for this assignment we will just use the .csv file in the zipped folder: CRT_Data.csv. You may also want to read the paper, at least in part, to help fully understand this analysis if at times you are unsure. Here is the article's abstract:

In nonhumans, the sex steroid testosterone regulates reproductive behaviors such as fighting between males and mating. In humans, correlational studies have linked testosterone with aggression and disorders associated with poor impulse control, but the neuropsychological processes at work are poorly understood. Building on a dual-process framework, we propose a mechanism underlying testosterone's behavioral effects in humans: reduction in cognitive reflection. In the largest study of behavioral effects of testosterone administration to date, 243 men received either testosterone or placebo and took the Cognitive Reflection Test (CRT), which estimates the capacity to override incorrect intuitive judgments with deliberate correct responses. Testosterone administration reduced CRT scores. The effect remained after we controlled for age, mood, math skills, whether participants believed they had received the placebo or testosterone, and the effects of 14 additional hormones, and it held for each of the CRT questions in isolation. Our findings suggest a mechanism underlying testosterone's diverse effects on humans' judgments and decision making and provide novel, clear, and testable predictions.

The critical findings are presented on p. 1403 of the paper under the heading The influence of testosterone on CRT performance. Your task today is to attempt to try and reproduce some of the main results from the paper.

NOTE: Being unable to get the exact same results as the authors doesn't necessarily mean you are wrong! The authors might be wrong, or might have left out important details. Present what you find.

Before starting lets check:

The .csv file is saved into a folder on your computer and you have manually set this folder as your working directory.
The .Rmd file is saved in the same folder as the .csv files. For assessments we ask that you save it with the format GUID_Level2_Semester1_Lab6.Rmd where GUID is replaced with your GUID. Though this is a formative assessment, it may be good practice to do the same here.

6.4.1 Task 1A: Libraries

In today's assignment you will need both the tidyverse and broom packages. Enter code into the t1A code chunk below to load in both of these libraries.

## load in the tidyverse and broom packages

6.4.2 Task 1B: Loading in the data

Use read_csv() to replace the NULL in the t1B code chunk below to load in the data stored in the datafile CRT_Data.csv. Store the data in the variable crt. Do not change the filename of the datafile.

crt <- NULL

6.4.3 Task 2: Selecting only relevant columns

Have a look at crt. There are three variables in crt that you will need to find and extract in order to perform the t-test: the subject ID number (hint: each participant has a unique number); the independent variable (hint: each participant has the possibility of being in one of two treatments coded as 1 or 0); and the dependent variable (hint: the test specifically looks at which answers people get correct). Identify those three variables. It might help to look at the first few sentences under the heading The influence of testosterone on CRT performance and Figure 2a in the paper for further guidance on the correct variables.

Having identified the important three columns, replace the NULL in the t2 code chunk below to select out only those three columns from crt and store them in the tibble crt2.

Check your work: If correct, crt2 should be a tibble with 3 columns and 243 rows.

crt2 <- NULL

NOTE: For the remainder, of this assignment you should use crt2 as the main source tibble and not crt.

6.4.4 Task 3: Verify the number of subjects in each group

The Participants section of the article contains the following statement:

243 men (mostly college students; for demographic details, see Table S1 in the Supplemental Material available online) were randomly administered a topical gel containing either testosterone (n = 125) or placebo (n = 118).

In the t3 code block below, replace the NULLs with lines of code to calculate:

The number of men in each Treatment. This should be a tibble called cond_counts containing a column called Treatment showing the two groups and a column called n which shows the number of men in each group.
The total number of men in the sample. This should be a single value, not a tibble, and should be stored in n_men.

You know the answer to both of these tasks already. Make sure that your code gives the correct answer!

cond_counts <- NULL
n_men <- NULL

Now replace the strings in the statements below, using inline R code, so that it reproduces the sentence from the paper exactly as it is shown above. In other words, in the statement below, anywhere it says "(your code here)", replace that string (including the quotes), with inline R code. To clarify, when looking at the .Rmd file you should see R code, but when looking at the knitted file, you should see values. Look back at Chapter 1 if you are unsure of how to use inline code.

Hint: One solution is to do something with cond_counts similar to what we did with filter() and pull() in the in-class exercises of this Chapter.

"(your code here)" men (mostly college students; for demographic details, see Table S1 in the Supplemental Material available online) were randomly administered a topical gel containing either testosterone (n = "(your code here)") or placebo (n = "(your code here)").

6.4.5 Task 4: Reproduce Figure 2a

Here is Figure 2A from the original paper:

Figure 6.4: Figure 2A from Nave, Nadler, Zava, and Camerer (2017) which you should replicate

Write code in the t4 code chunk to reproduce a version of Figure 2a - shown above. Before you create the plot, replace the NULL to make a tibble called crt_means with the mean and standard deviation of the number of CorrectAnswers for each group.
Use crt_means as the source data for the plot.

Hint: you will need to check out recode() to get the labels of treatments right. Again this webpage is highly recommended: https://debruine.github.io/posts/recode/

Don't worry about including the error bars (unless you want to) or the line indicating significance in the plot. Do however make sure to pay attention to the labels of treatments and of the y-axis scale and label. Reposition the x-axis label to below the Figure. You can use colour if you like.

crt_means <- NULL

## TODO: add lines of code using ggplot

6.4.6 Task 5: Interpreting your Figure

Always good to do a slight recap at this point to make sure you are following the analysis. Replace the NULL in the t5 code chunk below with the number of the statement that best describes the data you have calculated and plotted thus far. Store this single value in answer_t5:

The Testosterone group (M = 2.10, SD = 1.02) would appear to have fewer correct answers on average than the Placebo group (M = 1.66, SD = 1.18) on the Cognitive Reflection Test suggesting that testosterone does in fact inhibit the ability to override incorrect intuitive judgements with the correct response.
The Testosterone group (M = 1.66, SD = 1.18) would appear to have more correct answers on average than the Placebo group (M = 2.10, SD = 1.02) on the Cognitive Reflection Test suggesting that testosterone does in fact inhibit the ability to override incorrect intuitive judgements with the correct response.
The Testosterone group (M = 1.66, SD = 1.18) would appear to have fewer correct answers on average than the Placebo group (M = 2.10, SD = 1.02) on the Cognitive Reflection Test suggesting that testosterone does in fact inhibit the ability to override incorrect intuitive judgements with the correct response.
The Testosterone group (M = 2.10, SD = 1.02) would appear to have more correct answers on average than the Placebo group (M = 1.66, SD = 1.18) on the Cognitive Reflection Test suggesting that testosterone does in fact inhibit the ability to override incorrect intuitive judgements with the correct response.

answer_t5 <- NULL

6.4.7 Task 6: t-test

Now that we have calculated the descriptives in our study we need to run the inferentials. In the t6 code chunk below, replace the NULL with a line of code to run the t-test taking care to make sure that the output table has the Placebo mean under Estimate1 (group 0) and Testosterone mean under Estimate2 (group 1). Assume variance is equal and use broom::tidy() to sweep and store the results into a tibble called t_table.

t_table <- NULL

6.4.8 Task 7: Reporting results

In the t7A code chunk below, replace the NULL with a line of code to pull out the df from t_table. This must be a single value stored in t_df.

t_df <- NULL

In the t7B code chunk below, replace the NULL with a line of code to pull out the t-value from t_table. Round it to three decimal places. This must be a single value stored in t_value.

t_value <- NULL

In the t7C code chunk below, replace the NULL with a line of code to pull out the p-value from t_table. Round it to three decimal places. This must be a single value stored in p_value.

p_value <- NULL

In the t7D code chunk below, replace the NULL with a line of code to calculate the absolute difference between the mean number of correct answers for the Testosterone group and the Placebo group. Round it to three decimal places. This must be a single value stored in t_diff.

t_diff <- NULL

If you have completed t7A to t7D accurately, then when knitted, one of these statements below will produce an accurate and coherent summary of the results. In the t7E code chunk below, replace the NULL with the number of the statement below that best summarises the data in this study. Store this single value in answer_t7e

The testosterone group performed significantly better ( fewer correct answers) than the placebo group, t() = , p = .
The testosterone group performed significantly worse ( fewer correct answers) than the placebo group, t() = , p = .
The testosterone group performed significantly better ( more correct answers) than the placebo group, t() = , p = .
The testosterone group performed significantly worse ( fewer correct answers) than the placebo group, t() = , p = .

answer_t7e <- NULL

Job Done - Activity Complete!

Well done, you are finshed! Now you should go check your answers against the solutions at the end of this Chapter. You are looking to check that the resulting output from the answers that you have submitted are exactly the same as the output in the solution - for example, remember that a single value is not the same as a coded answer. Where there are alternative answers, it means that you could have submitted any one of the options as they should all return the same answer. If you have any questions please post them on the available forums or ask a member of staff.

On to the next chapter!

6.5 Solutions to Questions

Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

6.5.1 PreClass Activities

6.5.1.1 PreClass Task 1

ns_data <- tibble(participant = 1:22,
                  valid_rt = c(631.2,800.8,595.4,502.6,604.5,
                               516.9,658.0,502.0,496.7,600.3,
                               714.6,623.7,634.5,724.9,815.7,
                               456.9,703.4,647.5,657.9,613.2,
                               585.4,674.1))

Category	estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
hire	1.825397	4.714286	2.888889	2.620100	0.0126745	37	0.4137694	3.237024	Two Sample t-test	two.sided
impression	1.894180	5.968254	4.074074	2.850766	0.0070911	37	0.5478846	3.240475	Two Sample t-test	two.sided
intellect	1.986773	5.634921	3.648148	3.525933	0.0011444	37	0.8450652	3.128480	Two Sample t-test	two.sided

ID	Treatment	CorrectAnswers
11	1	2
12	1	0
13	1	0
14	1	1
15	1	3
16	1	0

Table 6.4: Cross-Tabulation Table of Simulated Data from Chapter 5 Additional Materials
Groups	No	Unsure	Yes	Total
A	17	18	15	50
B	18	14	18	50

Table 6.5: Cross-Tabulation Table Data for analysis
No	Unsure	Yes
17	18	15
18	14	18

Lab 6 NHST and One-sample t-tests

6.1 Overview

6.2 PreClass Activity

6.2.1 The Binomial Test

6.2.2 Task 1: Creating a Dataframe

6.2.3 Task 2: Comparing Original and New Sample Reaction Times

6.2.4 Task 3: Calculating Probability

6.2.5 The One-Sample t-test

6.2.6 Task 4: Calculating the Mean and Standard Deviation

6.2.7 Task 5: Calculating the Observed Test Statistic

6.2.8 Task 6: Comparing the Observed Test Statistic to the t-distribution using pt()

6.2.9 Task 7: Comparing the Observed Test Statistic to the t-distribution using t.test()

6.2.10 Task 8: Drawing Conclusions about the new data

6.3 InClass Activity

6.3.1 Task 1: Evaluators

6.3.2 Task 2: Ratings

6.3.3 Task 3: Creating a Figure

6.3.4 Task 4: t-tests

6.4 Test Yourself

6.4.1 Task 1A: Libraries

6.4.2 Task 1B: Loading in the data

6.4.3 Task 2: Selecting only relevant columns

6.4.4 Task 3: Verify the number of subjects in each group

6.4.5 Task 4: Reproduce Figure 2a

6.4.6 Task 5: Interpreting your Figure

6.4.7 Task 6: t-test

6.4.8 Task 7: Reporting results

6.5 Solutions to Questions

6.5.1 PreClass Activities

6.5.1.1 PreClass Task 1

6.5.1.2 PreClass Task 2

6.5.1.3 PreClass Task 3

6.5.1.4 PreClass Task 4

6.5.1.5 PreClass Task 5

6.5.1.6 PreClass Task 6

6.5.1.7 PreClass Task 7

6.5.1.8 PreClass Task 8

6.5.2 InClass Activities

6.5.2.1 InClass Task 1

6.5.2.2 InClass Task 2

6.5.2.3 InClass Task 4

6.5.2.4 Going Further with your coding

6.5.3 Test Yourself Activities

6.5.3.1 Assignment Task 1A: Libraries

6.5.3.2 Assignment Task 1B: Loading in the data

6.5.3.3 Assignment Task 2: Selecting only relevant columns

6.5.3.4 Assignment Task 3: Verify the number of subjects in each group

6.5.3.5 Assignment Task 4: Reproduce Figure 2A

6.5.3.6 Assignment Task 5: Interpreting your Figure

6.5.3.7 Assignment Task 6: t-test

6.5.3.8 Assignment Task 7: Reporting results

6.6 Additional Material

More on t.test() - vectors vs. formula

Misleading and Appropriate Barplots

Analysing chi-squares

6.2.8 Task 6: Comparing the Observed Test Statistic to the t-distribution using `pt()`

6.2.9 Task 7: Comparing the Observed Test Statistic to the t-distribution using `t.test()`

More on `t.test()` - vectors vs. formula