I Dissertation analysis guide

In this chapter we're going to provide a guide for how to get started working with your dissertation data. It's important to note that this guide won't work for every project and every kind of data - and nothing in this chapter should supersede advice your supervisor has given you. Rather, this chapter is here to help guide you through the initial steps of working with quantitative data and to show you how what you have learned in RM2 maps on to your dissertation project.

I.0.1 Before R

Before you touch R, you need to make sure you understand your design, data, and analysis plan. This will make your work in R considerably easier. Before you start coding, make sure you have answers to the following questions:

I.0.2 Study design

  • What is the design of your study?
  • What are your variables?
    • Are they IVs/predictors/correlation variables?
    • How many levels do they have?
      • Are they between or within-subject?
  • What are your DVs?
    • How will you calculate them? For example, do you need to take a mean of several of questionnaire items or reaction times from multiple trials?
  • What are your hypotheses?
  • Do you have any demographic data included in the data files?
  • Thinking back to the concept of tidy data, how many variables will you have?

I.0.3 Data wrangling considerations

  • What does each variable in your data set represent?
    • Do you want to rename any variables to make them easier to work with?
      • Do you have a convention for naming variables? For example, I use all lower-case variables and words are always separated by underscores, e.g., group_data.
    • Are there any variables that are not necessary for the analysis? For example, consent forms.
    • Are there minimum and maximum values for any of your variables?
  • Do you have multiple data sets you need to join together?
    • Do they have a common variable? For example, does each file have the participant ID or a question number?
  • Do you need to create any new variables? For example, the mean or sum of a number of questionnaire items to give a total scale score?
  • What type of data should each variable be?
    • Are there any variables that you need to convert to factors?
  • Do you need to recode any variables? For example, 1 = male, 2 = female, or perhaps reverse coding questionnaire responses.
  • Do you have a rule for outlier exclusion/replacement?
  • Do you need to exclude any participants? For example, if they score below a certain threshold, if they are non-native speakers etc.
  • Is there any missing data for each variable?
    • What will you do about missing data?
  • Do you need to tidy your data?

I.0.4 Data analysis

  • What descriptive statistics do you need to calculate for each variable?
    • Do you need to calculate descriptive by groups?
    • Are the descriptive statistics similar to values reported in other studies that have used the same measures. Are they similar? If not, what is the explanation?
  • How will you visualize your data?
  • Do you need to transform your data?
  • Do you need to conduct any kind of reliability analysis?
  • What type of inferential tests are you going to conduct?
    • Where appropriate, do you need to perform one or two-tailed hypothesis testing?
    • Do you need to do any dummy coding for regression models?
  • What assumptions do you need to test in order to perform your analyses?
    • What will you do if your data do not meet these assumptions?
  • Do you need to apply a correction for multiple comparison testing? If so, which one?
  • Do you need to calculate measures of effect size? If so, which ones?

If this looks like a lot of work - it is. It's important to remember that a lot of the problems that students face with R are really nothing to do with R. In order to wrangle and analyse your data you first need to understand the data that you have. If you don't know what your independent and dependent variables are or what analysis you're supposed to be running, it doesn't matter what statistical software you are using, you won't be able to complete your task. Don't rush or skip any part of the preparation, it will make coding much harder. If you know the answers to all of the above questions then it means you're ready to get started in R.

I.0.5 Exploring and cleaning your data

The following sections will not provide comprehensive instructions on how to use the example code, nor will they cover every function you may need to use. You should refer to the RM2 materials, help documentation, and online resources, however, these examples may give you an idea of where to start.

As a first step you should explore your dataset to understand its properties and then perform some basic cleaning operations that will facilitate further analysis.

I.0.6 summary()

A useful first step is to run summary(). Check the output for:

  • Missing data
  • What type of data each variable is
  • If the variable names are easy to work with, for example Participant Age is difficult to work with because it has two capital letters and a space. Renaming this as age will make your coding easier.
  • Any suspicious values, for example, if your likert scale is 1-7 you shouldn't have a maximum score of 10. If you have standardised IQ scores, you may want to check that a score of 200 or 20 isn't a typo.
  • If you make any changes, run summary() again
summary(data)

I.0.7 Visualisations

To get an overview of the data and spot any potential issues such as outliers you should plot histograms and boxplots to eyeball the distributions.

ggplot(data, aes(x = variable)) +
  geom_histogram()

ggplot(data, aes(x = condition, y = score)) +
  geom_boxplot()

I.0.8 Renaming variables

data <- rename(data, new_name = old_name)

I.0.9 Converting to factors

This is incredibly important. Don't skip this step otherwise things might go very wrong much further down the line.

data <- mutate(data, variable = as_factor(variable))

I.0.10 Dropping irrelevant variables

data <- select(data, -consent1)

I.0.11 Recoding variables

data <- mutate(data, variable = recode(variable, "old_code" = "new_code"))

I.0.12 Exclude participants/data

data <- filter(data, variable != "value") # exclude anyone with value
data <- filter(data, variable == "value") # include only those with value
data <- filter(data, variable %in% c("value1", "value2")) # include all specified values
data <- filter(data, variable > 10) # keep data if value of variable is more than 10
data <- filter(data, variable <= 10) # keep data if value of variable is less than or equal to 10

I.0.13 Reliability

If you are using a scale, for example, as part of a questionnaire study you may need to calculate reliability and you should do this before you calculate the aggregated scale scores. There are several options about how you do this and you should consult your supervisor but one option is to use alpha() from the psych package.

data %>%
  select(Q1:Q5) %>%
  psych::alpha()

I.0.14 Transforming data

The above steps should leave you with a good understanding of your data and all the variables you need for your analysis. The next step is to correct any problems with the data by replacing or transforming individual values. You may also need to create new variables, for example the total score for a questionnaire or mean reaction times or accuracy.

I.0.15 Replace missing values

data <- data %>% 
  mutate(variable = replace_na(data$variable, 0)) # replace NAs in variable with 0

data <- data %>% replace_na(list(gender = "unknown", # replace NAs in `gender` with "unknown"
                              score = 0,          # replace NAs in `score` with 0  
                              rt = mean(data$rt, na.rm = TRUE)))  # replace NAs in `rt` with mean of `rt`

I.0.16 Convert implausible values

In the case of implausible values (such as a score or 10 on a 7-point likert scale), you may wish to recode these as missing, or as the mean (or some other value).

data <- data %>% 
  mutate(q1= ifelse(q1 > 10, NA, q1)) # if the value in q1 is more than 10 replace it with NA, if it's not, keep the value as it is

data <- data %>%
  mutate(rt = ifelse(rt > 1000, mean(rt, na.rm = TRUE), rt)) # if the value in rt is more than 1000, replace it with the mean rt, if it's below 1000, keep the value as it is

I.0.17 Calculate z-scores

You may want to calculate z-scores in order to remove outliers. You could then use filter() on the new z-score variable you have created.

data <- data %>% mutate(z_scores = scale(scores))

There are a few other types of transformations we can do to correct for problems with normality. This page gives a good overview of all the options.

I.0.18 Log transformation

A popular method of transformation is to calculate the log of a variable.

data <- data %>% mutate(variable_log = log(variable))

I.0.19 Square root

Another popular method is to perform a square root transformation.

data <- data %>% mutate(variable_sqrt = sqrt(variable))

I.0.20 Calculating new variables

You may wish to calculate the sum or the mean of a number of variables. For example, if you have 9 questions and you want the sum of questions 1-5 and the mean of questions 6 - 9 and your data is is wide-form:

data <- data %>%
  mutate(sum_scoreq1q5 = rowSums(select(., Q1:Q5)),
         mean_scoreq6q10 = rowMeans(select(., Q6:Q9)))

If your data is in long-form you may want to use functions such as gather(), spread() and summarise(). See RM2 Lab 3 for more info on this.

I.0.21 Tidy data

It is at this point that you should tidy the dataset using functions such as gather(). Refer back to the RM1 Lab 2 and RM2 Lab 3. You may also wish to have a wide-form version of your data depending upon the analyses you are conducting.

I.0.22 Summarising and visualising data

I.0.23 Descriptive statistics

At this point you may want to calculate descriptive statistics for variables of interest. Refer back to RM1 Lab 2 and Lab 3 for more information on these functions.

data %>% # produce descriptives for the total data set
  summarise(mean_score = mean(score, na.rm = true),
            sd_score = sd(score, na.rm = TRUE),
            median_score = median(score))

data %>% # produce descriptives for each of the grouping variables
  group_by(gender, condition) %>%
  summarise(mean_score = mean(score, na.rm = true),
            sd_score = sd(score, na.rm = TRUE),
            median_score = median(score))

You may also find the function describe() from the psych package useful (you may need to install this package but as always, do not install packages on university computers). describe() produces a full range of descriptive statistics including skew, kurtosis and standard error.

library(psych)
describe(data)

describeBy() produces descriptives by a grouping variable.

describeBy(data, group = "gender")

I.0.24 Data visualisation

At this point you should plot your data using a method that reflects the analysis you wish to conduct (e.g., a scatterplot for a correlation, a violn-boxplot for a t-test). For the plots and code specific to each type of analysis, please refer to the relevant chapters. You should ensure that your plots are as informative as possible and display the spread of the data using functions such as geom_violin(), and geom_point(). Avoid purely aggregated plots like bar charts representing means.

I.0.25 Inferential statistics

You are now ready to conduct your inferential analyses. For details on how to perform different tests, please refer to relevant chapters. Ensure that you understand how each test relates to each of your hypotheses.

I.0.26 Assumption tests

Depending upon the analysis, you may be able to conduct assumption checks before the inferential tests are conducted, however, for methods such as ANOVA and regression, you need to check the model residuals and therefore this can't be done until afterwards.

Foe details on what assumption checks to conduct for each statistical test, please refer to the relevant chapters.