Data Skills for Reproducible Science
This course provides an overview of skills needed for reproducible research and open science using the statistical programming language R. Students will learn about data visualisation, data tidying and wrangling, archiving, iteration and functions, probability and data simulations, general linear models, and reproducible workflows. Learning is reinforced through weekly assignments that involve working with different types of data.
0.1 Course Aims
This course aims to teach students the basic principles of reproducible research and to provide practical training in data processing and analysis in the statistical programming language R.
0.2 Intended Learning Outcomes
By the end of this course students will be able to:
- Write scripts in R to organise and transform data sets using best accepted practices
- Explain basics of probability and its role in statistical inference
- Critically analyse data and report descriptive and inferential statistics in a reproducible manner
0.3 Course Resources
Data Skills Videos Each chapter has several short video lectures for the main learning outcomes at the playlist . The videos are captioned and watching with the captioning on is a useful way to learn the jargon of computational reproducibility. If you cannot access YouTube, the videos are available on the course Teams and Moodle sites or by request from the instructor.
dataskills This is a custom R package for this course. You can install it with the code below. It will download all of the packages that are used in the book, along with an offline copy of this book, the shiny apps used in the book, and the exercises.
glossary Coding and statistics both have a lot of specialist terms. Throughout this book, jargon will be linked to the glossary.
0.4 Course Outline
The overview below lists the beginner learning outcomes only. Some lessons have additional learning outcomes for intermediate or advanced students.
- Getting Started
- Working with Data
- Data Visualisation
- Tidy Data
- Data Wrangling
- Data Relations
- Iteration & Functions
- Probability & Simulation
- Generate and plot data randomly sampled from common distributions: uniform, binomial, normal, poisson
- Generate related variables from a multivariate distribution
- Define the following statistical terms: p-value, alpha, power, smallest effect size of interest (SESOI), false positive (type I error), false negative (type II error), confidence interval (CI)
- Test sampled distributions against a null hypothesis using: exact binomial test, t-test (1-sample, independent samples, paired samples), correlation (pearson, kendall and spearman)
- Calculate power using iteration and a sampling function
- Introduction to GLM
- Reproducible Workflows
- Create a reproducible script in R Markdown
- Edit the YAML header to add table of contents and other options
- Include a table
- Include a figure
source()to include code from an external file
- Report the output of an analysis using inline R
0.5 Formative Exercises
Exercises are available at the end of each lesson’s webpage. These are not marked or mandatory, but if you can work through each of these (using web resources, of course), you will easily complete the marked assessments.
Download all exercises and data files below as a ZIP archive.
0.6 I found a bug!
This book is a work in progress, so you might find errors. Please help me fix them! The best way is to open an issue on github that describes the error, but you can also mention it on the class Teams forum or email Lisa.
0.7 Other Resources
- Learning Statistics with R by Navarro
- R for Data Science by Grolemund and Wickham
- R for Reproducible Scientific Analysis
- Improving your statistical inferences on Coursera
- You can access several cheatsheets in RStudio under the
Helpmenu, or get the most recent RStudio Cheat Sheets
- Style guide for R programming
- #rstats on twitter highly recommended!