--- title: 'Formative Exercise 09: MSc Data Skills Course' author: "Psychology, University of Glasgow" output: html_document --- ```{r setup, include=FALSE} # please do not alter this code chunk knitr::opts_chunk$set(echo = TRUE, message = FALSE, error = TRUE) library("broom") library("tidyverse") ``` ## The `iris` dataset There is a built-in dataset called `iris` that has measurements of different parts of flowers. (See `?iris` for information about the dataset.) ### Question 1 Use ggplot2 to make a scatterplot that visualizes the relationship between sepal length (horizontal axis) and petal width (vertical axis). Watch out for overplotting. ```{r Q1} ggplot() ``` ### Question 2 Run a regression model that predicts the petal width from the sepal length, and store the model object in the variable `iris_mod`. End the block by printing out the summary of the model. ```{r Q2} iris_mod <- NULL summary(iris_mod) #print out the model summary ``` ### Question 3 Make a histogram of the residuals of the model using ggplot2. ```{r Q3} residuals <- NULL ggplot() ``` ### Question 4 Write code to predict the petal width for two plants, the first with a sepal length of 5.25cm, and the second with a sepal length of 6.3cm. Store the vector of predictions in the variable `iris_pred`. ```{r Q4} iris_pred <- NULL iris_pred # print the predicted values ``` ## Simulating data from the linear model ### Question 5 *NOTE: You can knit this file to html to see formatted versions of the equations below (which are enclosed in `$` characters); alternatively, if you find it easier, you can hover your mouse pointer over the `$` in the code equations to see the formatted versions.* Write code to randomly generate 10 Y values from a simple linear regression model with an intercept of 3 and a slope of -7. Recall the form of the linear model: $Y_i = \beta_0 + \beta_1 X_i + e_i$ The residuals ($e_i$s) are drawn from a normal distribution with mean 0 and variance $\sigma^2 = 4$, and $X$ is the vector of integer values from 1 to 10. Store the 10 observations in the variable `Yi` below. (NOTE: the standard deviation is the *square root* of the variance, i.e. $\sigma$; `rnorm()` takes the standard deviation, not the variance, as its third argument). ```{r Q5} X <- NULL err <- NULL Yi <- NULL Yi # print the values of Yi ``` ## Advanced ### Question 6 Write a function to simulate data with the form. $Y_i = \beta_0 + \beta_1 X_i + e_i$ The function should take arguments for the number of observations to return (`n`), the intercept (`b0`), the effect (`b1`), the mean and SD of the predictor variable X (`X_mu` and `X_sd`), and the SD of the residual error (`err_sd`). The function should return a tibble with `n` rows and the columns `id`, `X` and `Y`. ```{r Q6} sim_lm_data <- function(n){ #edit this function } dat6 <- sim_lm_data(n = 10) # do not edit knitr::kable(dat6) # print table ``` ### Question 7 Use the function from Question 6 to generate a data table with 10000 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, and residual error SD of 2. Analyse the data with `lm()`. Find where the analysis summary estimates the values of `b0` and `b1`. What happens if you change the simulation values? ```{r Q7} dat7 <- NULL mod7 <- NULL summary(mod7) # print summary ``` ### Question 8 Use the function from Question 6 to calculate power by simulation for the effect of X on Y in a design with 50 subjects, an intercept of 80, an effect of X of 0.5, where X has a mean of 0 and SD of 1, residual error SD of 2, and alpha of 0.05. Hint: use `broom::tidy()` to get the p-value for the effect of X. ```{r Q8} power <- NULL power # print the value ``` ### Question 9 Calculate power (i.e., the false positive rate) for the effect of X on Y in a design with 50 subjects where there is no effect and alpha is 0.05. ```{r Q9} false_pos <- NULL false_pos # print the value ```