19 R

19.1 R markdown

The R-specific version of markdown: a way to specify formatting, such as headers, paragraphs, lists, bolding, and links, as well as code blocks and inline code.

More...

An R markdown file starts with a YAML header that usually contains the title, author, and output type.

---
title: "Analysis Plan Template"
author: "School of Psychology, University of Glasgow"
output: html_document
---

The rest of the file is a mix of markdown and code chunks. Here is an example of two section titles. The first section has an r chunk for loading the packages and the second section has a list of steps.

## Packages used

```{r setup, include = FALSE}
# every time you add a new package, include it in this section
library(tidyverse)
```

## Data Processing

1. load in packages
2. load in dataset
3. Wrangle data into appropriate format for analysis and checks. You might want to reshape the data or combine different values to make a new variable. 

Resources:

19.2 R-squared

A statistic that represents the proportion of the variance for a dependent variable explained by the predictor variable(s) in a linear model.

More...

Let's simulate some data:

# simulate some data
set.seed(8675309)

n <- 100
intercept <- 10
effect <- 0.5
error_sd <- 2

dat <- tibble::tibble(
  predictor = rnorm(n, 0, 1),
  error = rnorm(n, 0, error_sd),
  dv = intercept + (effect * predictor) + error
)

Analyse the data with a linear model and use the summary() function to view the R-squared values.

model <- lm(dv ~ predictor, dat)
summary(model)
#> 
#> Call:
#> lm(formula = dv ~ predictor, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.0692 -1.4106  0.2447  1.4562  3.9263 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.8913     0.1991  49.677  < 2e-16 ***
#> predictor     0.6361     0.2150   2.958  0.00388 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.988 on 98 degrees of freedom
#> Multiple R-squared:  0.08198,    Adjusted R-squared:  0.07261 
#> F-statistic: 8.752 on 1 and 98 DF,  p-value: 0.003877

Adjusted R-squared is a modified version of R-squared adjusted for the number of predictors in the model. You can extract the R-squared and adjusted R-squared values from the model summary.

summary(model)$r.squared
#> [1] 0.08198174
summary(model)$adj.r.squared
#> [1] 0.0726142

19.3 random effect

An effect associated with an individual sampling unit, usually represented by an offset from a fixed effect.

Example: If the grand mean response time in a population is 600 milliseconds, that number represents the typical value. The mean response time for an individual subject \(s\) can be represented as an offset (deviation) from that value. For example, if subject \(s\) has a mean reaction time of 650 ms, that would imply a random effect for that subject of +50 ms.

Unlike fixed effects, we expect the underlying random effects to change from experiment to experiment as the sampling units (e.g., subjects) change.

In a mixed effects model, you can get a table of just the random effects with the code below:

model <- afex::lmer(
  rating ~ rater_age * face_age + # fixed effects
  (1 | rater_id) + (1 | face_id), # random effects
  data = faux::fr4
)

broom.mixed::tidy(model, effects = "ran_pars")
effect group term estimate
ran_pars face_id sd__(Intercept) 0.6040734
ran_pars rater_id sd__(Intercept) 0.9144817
ran_pars Residual sd__Observation 1.0434014

19.4 random factor

A factor whose levels are taken to represent a proper subset of a population of interest, typically because the observed levels are the result of sampling.

If you perform a study where the population of interest is "undergraduates at the University of Glasgow", you are very unlikely to obtain data from all approximately 20,000 undergraduate students. Instead, you would likely obtain a (hopefully random) sample of undergraduates, with each undergraduate forming a single 'level' of the 'subject' factor. By treating 'subject' as random rather than fixed in your analysis, you will obtain parameter estimates that are closer to the true population values.

Random factors are usually contrasted with fixed factors, whose levels are assumed to represent all the levels of interest in a population.

19.5 random seed

A value used to set the initial state of a random number generator.

Random seeds are used in random number generation. Each time you generate a random number, the number you get depends on the state of the underlying random number generator. If you set this state to a known value, you will get the same random numbers in the same order.

Random seeds are used to make processes that involve random values reproducible. In R, you can set a random seed using the set.seed() function. If you put set.seed() at the start of your script, you will get the same output every time.

More...
set.seed(8675309) # Lisa's favourite seed
rnorm(3)
#> [1] -0.9965824  0.7218241 -0.6172088
set.seed(1) # a different seed
rnorm(3)
#> [1] -0.6264538  0.1836433 -0.8356286
set.seed(8675309) # the first seed again
rnorm(3)
#> [1] -0.9965824  0.7218241 -0.6172088

19.6 reactivity

Changes in a Shiny app that occur in response to user input.

Reactive features of your UI are rendered in the server.

19.7 README

A text file in the base directory of a project that describes the project.

A README is a plain text file that contain information you want new users of the project to read first. This commonly contains one or more of the following: configuration or installation instructions, a list of files included with the project, a usage license, citation information, and author contact info.

19.8 relative path

The location of a file in relation to the working directory.

For example, if your working directory is /Users/me/study/ and you want to refer to a file at /Users/me/study/data/faces.csv, the relative path is data/faces.csv. Use ../ to move up one directory.

# the working directory: /Users/me/study/

# read a file inside the wd: /Users/me/study/data/faces.csv
qdat <- readr::read_csv("data/faces.csv")

# read a file outside the wd: /Users/me/other_study/data/exp.csv
xdat <- readr::read_csv("../other_study/data/exp.csv")

Make sure you always use relative paths in an R Markdown document, which automatically sets the working directory to the directory that contains the .Rmd file.

Contrast with absolute path.

19.9 render

To create a file (usually an image or PDF) or widget from source code

In the context of Shiny apps, render usually means to create the HTML to display in a reactive output from code in the server section.

In the context of R Markdown files, knit, render and compile tend to be used interchangeably.

19.10 repeated measures

A dataset has repeated measures if there are multiple measurements taken on the same variable for individual sampling units.

See also multilevel.

19.11 replicability

The extent to which the findings of a study can be repeated with new samples from the same population.

Alternatively, a scientific claim is replicable if it is supported by new data (Errington et al., 2021).

See also reproducibility.

The aim of replicating a study is test the underlying theoretical process, estimate the average effect size, and test whether you can observe the effect independent of the original researchers by recreating a study's methods as closely as possible (Brandt et al., 2014).

There is not a clear definition for a successful replication, but there are some common markers (Errington et al., 2021):

  • Is the replication effect in the same direction as the original study?
  • Is the replication effect and statistical significance in the same direction as the original study?
  • Is the effect size from the original study in the confidence interval of the replication?
  • Is the effect size from the replication study in the confidence interval of the original?
  • Is the effect size from the replication smaller or equal to the effect size from the original study?

19.12 reprex

A reproducible example that is the smallest, completely self-contained example of your problem or question.

For example, you may have a question about how to figure out how to select rows that contain the value "test" in a certain column, but it isn't working. It's clearer if you can provide a concrete example, but you don't want to have to type out the whole table you're using or all the code that got you to this point in your script.

You can include a very small table with just the basics or a smaller version of your problem. Make comments at each step about what you expect and what you actually got.

More...

Which version is easier for you to figure out the solution?

# this doesn't work
no_test_data <- data %>%
  filter(!str_detect(type, "test"))
# with a minimal example table
data <- tribble(
  ~id, ~type, ~x,
  1, "test", 12,
  2, "testosterone", 15,
  3, "estrogen", 10
)

# this should keep IDs 2 and 3, but removes ID 2
no_test_data <- data %>%
  filter(!str_detect(type, "test"))

One of the big benefits to creating a reprex is that you often solve your own problem while you're trying to break it down to explain to someone else.

If you really want to go down the rabbit hole, you can create a reproducible example using the reprex package from tidyverse.

19.13 reproducibility

The extent to which the findings of a study can be repeated in some other context

Reproducibility can be either with new samples from the same population (replicability) or with the same raw data but analyzed by different researchers or by the same researchers on a different occasion (computational or analytical reproducibility).

19.14 reproducible research

Research that documents all of the steps between raw data and results in a way that can be verified.

19.15 residual

Defined as the deviation of an observation from a model's expected value.

Mathematically, the residual is defined as the observed value minus the model's fitted value for that observation.

More...

For example, the linear model \(\hat{Y_i} = 3 + 2 X_i\) predicts a value of \(\hat{Y}_i = 7\) for \(X_i = 2\). If you happen to have observed \(Y_i = 8\) for observation \(i\), then the residual for that observation would be \(Y_i - \hat{Y}_i = 8 - 7 = 1\).

A related but slightly different notion is error.

For example, below we simulate data for two groups of 50 with means of 100 and 105 (and SDs of 10).

set.seed(8675309) # for reproducibility

group0 <- rnorm(n = 50, mean = 100, sd = 10)
group1 <- rnorm(n = 50, mean = 105, sd = 10)

df <- data.frame(
  dv = c(group0, group1),
  group = rep(0:1, each = 50)
)

model <- lm(dv ~ group, data = df)

summary(model)
#> 
#> Call:
#> lm(formula = dv ~ group, data = df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -25.9878  -7.1828  -0.1015   5.8804  20.1878 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  101.027      1.319  76.609   <2e-16 ***
#> group          3.992      1.865   2.141   0.0348 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9.325 on 98 degrees of freedom
#> Multiple R-squared:  0.04467,    Adjusted R-squared:  0.03492 
#> F-statistic: 4.582 on 1 and 98 DF,  p-value: 0.03478

model$coefficients
#> (Intercept)       group 
#>  101.026918    3.992219

The model above has an intercept of 101 and an effect of group of 4, meaning that the predicted value for group 0 is 101 and the predicted value for group 1 is 105. The difference between these predictions and the observed values is the residual error.

intercept <- model$coefficients[[1]]
effect <- model$coefficients[[2]]

df |>
  mutate(predicted = intercept + effect * group,
         residual = dv - predicted) |>
  group_by(group) |>
  slice(1:5) # show 5 from each group
dv group predicted residual
90.03418 0 101.0269 -10.992742
107.21824 0 101.0269 6.191323
93.82791 0 101.0269 -7.199007
120.29392 0 101.0269 19.266997
110.65416 0 101.0269 9.627242
106.34898 1 105.0191 1.329843
106.17582 1 105.0191 1.156684
96.74411 1 105.0191 -8.275028
83.64764 1 105.0191 -21.371500
107.14209 1 105.0191 2.122948

19.16 response variable

A variable (in a regression) whose value is assumed to be influenced by one or more predictor variables.

In an experimental context, the response variable is often referred to as the dependent variable.

19.17 right_join

A mutating join that keeps all the data from the second (right) table and joins anything that matches from the first (left) table.

Right Join

Figure 19.1: Right Join

More...
X <- tibble(
  id = 1:5,
  x = LETTERS[1:5]
)

Y <- tibble(
  id = 2:6,
  y = LETTERS[22:26]
)
Table X Table Y
id x
1 A
2 B
3 C
4 D
5 E
id y
2 V
3 W
4 X
5 Y
6 Z

If there is no matching data in the left table for a row, the values are set to NA.

# X is the right table
data <- right_join(X, Y, by = "id")
id x y
2 B V
3 C W
4 D X
5 E Y
6 NA Z

Order is important for right joins.

# Y is the right table
data <- right_join(Y, X, by = "id")
id y x
2 V B
3 W C
4 X D
5 Y E
1 NA A

See joins for other types of joins and further resources.

19.18 RStudio

An integrated development environment (IDE) that helps you process R code.

Download RStudio at https://rstudio.com/