Data Skills for Reproducible Research

This book provides an overview of skills needed for reproducible and open research using the statistical programming language R and tidyverse packages. It covers reproducible workflows, data visualisation, data tidying and wrangling, archiving, iteration and functions, probability and data simulations.

Author

Lisa DeBruine, Dale Barr, Emily Nordmann, Rebecca Lai, David Pharis

Overview

While this book mainly focuses on technical data skills, reproducible and open research is the reason for learning these skills. The following papers provide a great overview of these concepts if you are not already familiar with them.

Book version

This book was created using R version 4.2.1 (2022-06-23) (Funny-Looking Kid) and RStudio version 2024.4.2.764 (Chocolate Cosmos). It was rendered with quarto version 1.5.57. Most of the content of this book will work fine in versions of R above 4.0 and earlier versions of RStudio, although there may be some small differences in the interface.

This is the 4th version of the book and is currently under revision. You can access previous versions at: v1, v2, and v3.

Resources

Videos Each chapter has several short video lectures for the main learning outcomes. The videos are captioned and watching with the captioning on is a useful way to learn the jargon of computational reproducibility. If you cannot access YouTube, the videos are available by request. The videos were created in 2020, so a few aspects of the RStudio interface or the book text have changed.
glossary Coding and statistics both have a lot of specialist terms. Throughout this book, jargon will be linked to the glossary. Each chapter will end with a table of glossary terms relevant to the chapter.

How to learn data skills

top text: Me: gonna get to the gym early today, set myself on a regimen, get gains. Also me:; Photo: Man sleeping on gym equipment

Learning data skills is kind of like having a gym membership (HT to Phil McAleer for the analogy). You’ll be given state-of-the-art equipment to use and instructions for how to use them, but your data skills won’t get any stronger unless you practice.

Data skills do not require you to memorise lots of code. You will be introduced to many different functions, but the main skill to learn is how to efficiently find the information you need. This will require getting used to the structure of help files and cheat sheets, learning how to Google your problem and choose a helpful solution, and learning how to read error messages.

Fake O'Reilly-style book cover, line drawing of a kitten; title: Changing Stuff and Seeing What Happens; top text: How to actually learn any new programming concept

Learning to code involves making a lot of mistakes. These mistakes are completely essential to the process, so try not to feel too frustrated. Many of the chapter exercises will give you broken code to fix so you get experience seeing what common errors look like. As you become a more experienced coder, you might not make fewer errors, but you’ll recover from them much faster.

I found a bug!

This book is a work in progress, so you might find errors. Please help me fix them! The best way is to open an issue on github that describes the error, but you can also email Lisa.

Hadley Wickham @hadleywickham: The only way to write good code is to write tons of shitty code first. Feeling shame about bad code stops you from getting to good code [3:11 PM · Apr 17, 2015·Echofon; 892 Retweets, 55 Quote Tweets, 1,147 Likes]

Other Resources

RStudio Cheat Sheets
Improving Pedagogy through Registered Reports
Learning Statistics with R by Navarro
R for Data Science by Grolemund and Wickham
Improving your statistical inferences on Coursera
swirl
R for Reproducible Scientific Analysis
codeschool.com
datacamp
Style guide for R programming