---
title: 'Formative Exercise 09: MSc Data Skills Course'
author: "Psychology, University of Glasgow"
output: html_document
---

```{r setup, include=FALSE}
# please do not alter this code chunk
knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE,
                      error = TRUE)
library("broom")
library("tidyverse")
```

## The `iris` dataset

There is a built-in dataset called `iris` that has measurements of different parts of flowers. (See `?iris` for information about the dataset.)


### Question 1

Use ggplot2 to make a scatterplot that visualizes the relationship between sepal length (horizontal axis) and petal width (vertical axis). Watch out for overplotting.

```{r Q1}
ggplot()
```


### Question 2

Run a regression model that predicts the petal width from the sepal length, and store the model object in the variable `iris_mod`.  End the block by printing out the summary of the model.

```{r Q2}

iris_mod <- NULL

summary(iris_mod) #print out the model summary
```


### Question 3

Make a histogram of the residuals of the model using ggplot2.

```{r Q3}
residuals <- NULL

ggplot()
```


### Question 4

Write code to predict the petal width for two plants, the first with a sepal length of 5.25cm, and the second with a sepal length of 6.3cm.  Store the vector of predictions in the variable `iris_pred`.

```{r Q4}
iris_pred <- NULL

iris_pred # print the predicted values
```


## Simulating data from the linear model

### Question 5

*NOTE: You can knit this file to html to see formatted versions of the equations below (which are enclosed in `$` characters); alternatively, if you find it easier, you can hover your mouse pointer over the `$` in the code equations to see the formatted versions.* 

Write code to randomly generate 10 Y values from a simple linear regression model with an intercept of 3 and a slope of -7.  Recall the form of the linear model:

$Y_i = \beta_0 + \beta_1 X_i + e_i$

The residuals ($e_i$s) are drawn from a normal distribution with mean 0 and variance $\sigma^2 = 4$, and $X$ is the vector of integer values from 1 to 10. Store the 10 observations in the variable `Yi` below. (NOTE: the standard deviation is the *square root* of the variance, i.e. $\sigma$; `rnorm()` takes the standard deviation, not the variance, as its third argument).

```{r Q5}
X   <- NULL
err <- NULL
Yi  <- NULL

Yi # print the values of Yi
```

## Advanced

### Question 6

Write a function to simulate data with the form.

$Y_i = \beta_0 + \beta_1 X_i + e_i$

The function should take arguments for the number of observations to return (`n`), the intercept (`b0`), the effect (`b1`), the mean and SD of the predictor variable X (`X_mu` and `X_sd`), and the SD of the residual error (`err_sd`). The function should return a tibble with `n` rows and the columns `id`, `X` and `Y`.

```{r Q6}

sim_lm_data <- function(n){
  #edit this function
}


dat6 <- sim_lm_data(n = 10) # do not edit 

knitr::kable(dat6) # print table
```

### Question 7

Use the function from Question 6 to generate a data table with 10000 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, and residual error SD of 2.

Analyse the data with `lm()`. Find where the analysis summary estimates the values of `b0` and `b1`. What happens if you change the simulation values?

```{r Q7}

dat7 <- NULL

mod7 <- NULL

summary(mod7) # print summary

```


### Question 8

Use the function from Question 6 to calculate power by simulation for the effect of X on Y in a design with 50 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, residual error SD of 2, and alpha of 0.05.

Hint: use `broom::tidy()` to get the p-value for the effect of X.

```{r Q8}
power <- NULL

power # print the value
```

### Question 9

Calculate power (i.e., the false positive rate) for the effect of X on Y in a design with 50 subjects where there is no effect and alpha is 0.05.

```{r Q9}
false_pos <- NULL

false_pos # print the value
```