1 Getting Started

A line drawing of a person looking at a computer with a magnifying glass. The text reads 'I just installed RStudio. I'm a data scientist now.'

1.1 Learning Objectives

Understand the components of the RStudio IDE (video)
Type commands into the console (video)
Understand coding terms and function syntax (video)
Install a package (video)
Know the methods for getting help

1.2 R and RStudio

R is a programming environment for data processing and statistical analysis. We use R in Psychology at the University of Glasgow to promote reproducible research. This refers to being able to document and reproduce all of the steps between raw data and results. R allows you to write scripts that combine data files, clean data, and run analyses. There are many other ways to do this, including writing SPSS syntax files, but we find R to be a useful tool that is free, open source, and commonly used by research psychologists.

See Appendix A for more information on on how to install R and associated programs.

1.2.1 The Base R Console

If you open up the application called R, you will see an "R Console" window that looks something like this.

Figure 1.1: The R Console window.

You can close R and never open it again. We'll be working entirely in RStudio in this class.

ALWAYS REMEMBER: Launch R though the RStudio IDE

Launch (RStudio.app), not R.app (R.app).

1.2.2 RStudio

RStudio is an Integrated Development Environment (IDE). This is a program that serves as a text editor, file manager, and provides many functions to help you read and write R code.

Figure 1.2: The RStudio IDE

RStudio is arranged with four window panes. By default, the upper left pane is the source pane, where you view and edit source code from files. The bottom left pane is usually the console pane, where you can type in commands and view output messages. The right panes have several different tabs that show you information about your code. You can change the location of panes and what tabs are shown under Preferences > Pane Layout.

1.2.3 Configure RStudio

In this class, you will be learning how to do reproducible research. This involves writing scripts that completely and transparently perform some analysis from start to finish in a way that yields the same result for different people using the same software on different computers. Transparency is a key value of science, as embodied in the "trust but verify" motto.

Fry from Futurama squinting; top text: Not sure if I have a bad memory; bottom text: Or a bad memory

When you do things reproducibly, others can understand and check your work. This benefits science, but there is a selfish reason, too: the most important person who will benefit from a reproducible script is your future self. When you return to an analysis after two weeks of vacation, you will thank your earlier self for doing things in a transparent, reproducible way, as you can easily pick up right where you left off.

There are two tweaks that you should do to your RStudio installation to maximize reproducibility. Go to Global Options... under the Tools menu (Cmd-,), and uncheck the box that says Restore .RData into workspace at startup. If you keep things around in your workspace, things will get messy, and unexpected things will happen. You should always start with a clear workspace. This also means that you never want to save your workspace when you exit, so set this to Never. The only thing you want to save are your scripts.

Figure 1.3: Alter these settings for increased reproducibility.

Your settings should have:

Restore .RData into workspace at startup:
Save workspace to .RData on exit:

1.3 Console commands

We are first going to learn about how to interact with the console. In general, you will be developing R scripts or R Markdown files, rather than working directly in the console window. However, you can consider the console a kind of "sandbox" where you can try out lines of code and adapt them until you get them to do what you want. Then you can copy them back into the script editor.

Mostly, however, you will be typing into the script editor window (either into an R script or an R Markdown file) and then sending the commands to the console by placing the cursor on the line and holding down the Ctrl key while you press Enter. The Ctrl+Enter key sequence sends the command in the script to the console.

Morpehus from The Matrix; top text: What if I told you; bottom text: Typos are accidents nd accidents happon

One simple way to learn about the R console is to use it as a calculator. Enter the lines of code below and see if your results match. Be prepared to make lots of typos (at first).

1 + 1

## [1] 2

The R console remembers a history of the commands you typed in the past. Use the up and down arrow keys on your keyboard to scroll backwards and forwards through your history. It's a lot faster than re-typing.

1 + 1 + 3

## [1] 5

You can break up mathematical expressions over multiple lines; R waits for a complete expression before processing it.

## here comes a long expression
## let's break it over multiple lines
1 + 2 + 3 + 4 + 5 + 6 +
    7 + 8 + 9 +
    10

## [1] 55

Text inside quotes is called a string.

"Good afternoon"

## [1] "Good afternoon"

You can break up text over multiple lines; R waits for a close quote before processing it. If you want to include a double quote inside this quoted string, escape it with a backslash.

africa <- "I hear the drums echoing tonight  
But she hears only whispers of some quiet conversation  
She's coming in, 12:30 flight  
The moonlit wings reflect the stars that guide me towards salvation  
I stopped an old man along the way  
Hoping to find some old forgotten words or ancient melodies  
He turned to me as if to say, \"Hurry boy, it's waiting there for you\"

- Toto"

cat(africa) # cat() prints the string

## I hear the drums echoing tonight  
## But she hears only whispers of some quiet conversation  
## She's coming in, 12:30 flight  
## The moonlit wings reflect the stars that guide me towards salvation  
## I stopped an old man along the way  
## Hoping to find some old forgotten words or ancient melodies  
## He turned to me as if to say, "Hurry boy, it's waiting there for you"
## 
## - Toto

1.4 Coding Terms

1.4.1 Objects

Often you want to store the result of some computation for later use. You can store it in an object (also sometimes called a variable). An object in R:

contains only letters, numbers, full stops, and underscores
starts with a letter or a full stop and a letter
distinguishes uppercase and lowercase letters (rickastley is not the same as RickAstley)

The following are valid and different objects:

songdata
SongData
song_data
song.data
.song.data
never_gonna_give_you_up_never_gonna_let_you_down

The following are not valid objects:

_song_data
1song
.1song
song data
song-data

Use the assignment operator<-` to assign the value on the right to the object named on the left.

## use the assignment operator '<-'
## R stores the number in the object
x <- 5

Now that we have set x to a value, we can do something with it:

x * 2

## R evaluates the expression and stores the result in the object boring_calculation
boring_calculation <- 2 + 2

## [1] 10

Note that it doesn't print the result back at you when it's stored. To view the result, just type the object name on a blank line.

boring_calculation

## [1] 4

Once an object is assigned a value, its value doesn't change unless you reassign the object, even if the objects you used to calculate it change. Predict what the code below does and test yourself:

this_year <- 2019
my_birth_year <- 1976
my_age <- this_year - my_birth_year
this_year <- 2020

After all the code above is run:

this_year =
my_birth_year =
my_age =

1.4.2 The environment

Any time you assign something to a new object, R creates a new entry in the global environment. Objects in the global environment exist until you end your session; then they disappear forever (unless you save them).

Look at the Environment tab in the upper right pane. It lists all of the objects you have created. Click the broom icon to clear all of the objects and start fresh. You can also use the following functions in the console to view all objects, remove one object, or remove all objects.

ls()            # print the objects in the global environment
rm("x")         # remove the object named x from the global environment
rm(list = ls()) # clear out the global environment

In the upper right corner of the Environment tab, change List to Grid. Now you can see the type, length, and size of your objects, and reorder the list by any of these attributes.

1.4.3 Whitespace

R mostly ignores whitespace: spaces, tabs, and line breaks. This means that you can use whitespace to help you organise your code.

# a and b are identical
a <- list(ctl = "Control Condition", exp1 = "Experimental Condition 1", exp2 = "Experimental Condition 2")

# but b is much easier to read
b <- list(ctl  = "Control Condition", 
          exp1 = "Experimental Condition 1", 
          exp2 = "Experimental Condition 2")

When you see > at the beginning of a line, that means R is waiting for you to start a new command. However, if you see a + instead of > at the start of the line, that means R is waiting for you to finish a command you started on a previous line. If you want to cancel whatever command you started, just press the Esc key in the console window and you'll get back to the > command prompt.

# R waits until next line for evaluation
(3 + 2) *
     5

## [1] 25

It is often useful to break up long functions onto several lines.

cat("3, 6, 9, the goose drank wine",
    "The monkey chewed tobacco on the streetcar line",
    "The line broke, the monkey got choked",
    "And they all went to heaven in a little rowboat",
    sep = "  \n")

## 3, 6, 9, the goose drank wine  
## The monkey chewed tobacco on the streetcar line  
## The line broke, the monkey got choked  
## And they all went to heaven in a little rowboat

1.4.4 Function syntax

A lot of what you do in R involves calling a function and storing the results. A function is a named section of code that can be reused.

For example, sd is a function that returns the standard deviation of the vector of numbers that you provide as the input argument. Functions are set up like this:

function_name(argument1, argument2 = "value")

The arguments in parentheses can be named (e.g., argument1 = 10) or you can skip the names if you put them in the exact same order that they're defined in the function. You can check this by typing ?sd (or whatever function name you're looking up) into the console and the Help pane will show you the default order under Usage. You can skip arguments that have a default value specified.

Most functions return a value, but may also produce side effects like printing to the console.

To illustrate, the function rnorm() generates random numbers from the standard normal distribution. The help page for rnorm() (accessed by typing ?rnorm in the console) shows that it has the syntax

rnorm(n, mean = 0, sd = 1)

where n is the number of randomly generated numbers you want, mean is the mean of the distribution, and sd is the standard deviation. The default mean is 0, and the default standard deviation is 1. There is no default for n, which means you'll get an error if you don't specify it:

rnorm()

## Error in rnorm(): argument "n" is missing, with no default

If you want 10 random numbers from a normal distribution with mean of 0 and standard deviation, you can just use the defaults.

rnorm(10)

##  [1]  0.4523096  0.7214671  0.6460756 -1.6449828  0.3308863 -0.8424760
##  [7]  1.1179621 -0.3556398  1.4468851  0.1222479

If you want 10 numbers from a normal distribution with a mean of 100:

rnorm(10, 100)

##  [1] 101.26573 101.15481  99.92083 100.37763 100.06351  99.01884  99.88644
##  [8]  98.91492 101.36445  98.06426

This would be an equivalent but less efficient way of calling the function:

rnorm(n = 10, mean = 100)

##  [1] 100.68990  99.49558  99.62226 101.65270 100.29281 101.52122  99.11455
##  [8] 100.81473 101.21078 100.47087

We don't need to name the arguments because R will recognize that we intended to fill in the first and second arguments by their position in the function call. However, if we want to change the default for an argument coming later in the list, then we need to name it. For instance, if we wanted to keep the default mean = 0 but change the standard deviation to 100, we would do it this way:

rnorm(10, sd = 100)

##  [1]  234.22639 -184.77611  115.98208  -13.78877 -142.22643   21.19932
##  [7]  -98.21786 -104.83094   24.14481   46.59624

Some functions give a list of options after an argument; this means the default value is the first option. The usage entry for the power.t.test() function looks like this:

power.t.test(n = NULL, delta = NULL, sd = 1, sig.level = 0.05,
             power = NULL,
             type = c("two.sample", "one.sample", "paired"),
             alternative = c("two.sided", "one.sided"),
             strict = FALSE, tol = .Machine$double.eps^0.25)

What is the default value for sd?
What is the default value for type?
Which is equivalent to power.t.test(100, 0.5)?
power.t.test(n = 100)power.t.test(delta = 0.5, n = 100)power.t.test()power.t.test(100, 0.5, sig.level = 1, sd = 0.05)

1.5 Add-on packages

One of the great things about R is that it is user extensible: anyone can create a new add-on software package that extends its functionality. There are currently thousands of add-on packages that R users have created to solve many different kinds of problems, or just simply to have fun. There are packages for data visualisation, machine learning, neuroimaging, eyetracking, web scraping, and playing games such as Sudoku.

Add-on packages are not distributed with base R, but have to be downloaded and installed from an archive, in the same way that you would, for instance, download and install PokemonGo on your smartphone.

The main repository where packages reside is called CRAN, the Comprehensive R Archive Network. A package has to pass strict tests devised by the R core team to be allowed to be part of the CRAN archive. You can install from the CRAN archive through R using the install.packages() function.

There is an important distinction between installing a package and loading a package.

1.5.1 Installing a package

This is done using install.packages(). This is like installing an app on your phone: you only have to do it once and the app will remain installed until you remove it. For instance, if you want to use PokemonGo on your phone, you install it once from the App Store or Play Store, and you don't have to re-install it each time you want to use it. Once you launch the app, it will run in the background until you close it or restart your phone. Likewise, when you install a package, the package will be available (but not loaded) every time you open up R.

You may only be able to permanently install packages if you are using R on your own system; you may not be able to do this on public workstations if you lack the appropriate privileges.

Install the esquisse package on your system. This package lets you create plots interactively and copy the code needed to make them reproducibly.

# type this in the console pane
install.packages("esquisse")

If you don't already have packages like ggplot2 and shiny installed, it will also install these dependencies for you. If you don't get an error message at the end, the installation was successful.

Never install a package from inside a script. Only do this from the console pane.

1.5.2 Loading a package

This is done using library(packagename). This is like launching an app on your phone: the functionality is only there where the app is launched and remains there until you close the app or restart. Likewise, when you run library(packagename) within a session, the functionality of the package referred to by packagename will be made available for your R session. The next time you start R, you will need to run the library() function again if you want to access its functionality.

You can load the functions in esquisse for your current R session as follows:

library(esquisse)

You might get some red text when you load a package, this is normal. It is usually warning you that this package has functions that have the same name as other packages you've already loaded.

Now you can run the function esquisse::esquisser(), which runs an interactive plotting example on the built-in dataset diamonds from the ggplot2 package.

esquisse::esquisser(ggplot2::diamonds)

You can use the convention package::function() to indicate in which add-on package a function resides. For instance, if you see readr::read_csv(), that refers to the function read_csv() in the readr add-on package.

1.5.3 Tidyverse

tidyverseis a meta-package that loads several packages we'll be using in almost every script:

ggplot2, for data visualisation (Chapter 3)
readr, for data import (Chapter 4)
tibble, for tables (Chapter 4)
tidyr, for data tidying (Chapter 6)
dplyr, for data manipulation (Chapter 7)
purrr, for repeating things (Chapter 9)
stringr, for strings
forcats, for factors

1.5.4 Install from GitHub

Many R packages are not yet on CRAN because they are still in development. Increasingly, datasets and code for papers are available as packages you can download from github. You'll need to install the devtools package to be able to install packages from github. Check if you have a package installed by trying to load it (e.g., if you don't have devtools installed, library(devtools) will display an error message) or by searching for it in the packages tab in the lower right pane. All listed packages are installed; all checked packages are currently loaded.

Figure 1.4: Check installed and loaded packages in the packages tab in the lower right pane.

# install devtools if you get
# Error in loadNamespace(name) : there is no package called ‘devtools’
# install.packages("devtools")
devtools::install_github("psyteachr/reprores-v2")

After you install the reprores package, load it using the library() function. You can then try out some of the functions below.

library(reprores)

# opens a local copy of this book in your web browser
book()

# opens a shiny app that lets you see how simulated data would look in different plot styles
app("plotdemo")

# creates and opens a file containing the exercises for this chapter
exercise(1)

How many different ways can you find to discover what functions are available in the reprores package?

reprores contains datasets that we will be using in future lessons. getdata() creates a directory called data with all of the class datasets.

# loads the disgust dataset
data("disgust")

# shows the documentation for the built-in dataset `disgust`
?disgust

# saves datasets into a "data" folder in your working directory
getdata("data")

1.6 Getting help

You will feel like you need a lot of help when you're starting to learn. This won't really go away, and it isn't supposed to. Experienced coders are also constantly looking things up; it's impossible to memorise everything. The goal is to learn enough about the structure of R that you can look things up quickly. This is why there is so much specialised jargon in coding; it's easier to google "adding vectors in R" than "adding lists of things that are the same kind of data in R".

1.6.1 Function Help

Start up help in a browser using the function help.start().

If a function is in base R or a loaded package, you can use the help("function_name") function or the ?function_name shortcut to access the help file. If the package isn't loaded, specify the package name as the second argument to the help function.

# these methods are all equivalent ways of getting help
help("rnorm")
?rnorm
help("rnorm", package="stats")

When the package isn't loaded or you aren't sure what package the function is in, use the shortcut ??function_name.

What is the first argument to the mean function?
What package is read_excel in?

1.6.2 Googling

If the function help doesn't help, or you're not even sure what function you need, try Googling your question. It will take some practice to be able to use the right jargon in your search terms to get what you want. It helps to put "R" or "rstats", or "tidyverse" in the search text, or the name of the relevant package, like ggplot2.

1.6.3 Vignettes

Many packages, especially tidyverse ones, have helpful websites with vignettes explaining how to use their functions. Some of the vignettes are also available inside R.

# opens a list of available vignettes
vignette(package = "ggplot2")

# opens a specific vignette in the Help pane
vignette("ggplot2-specs", package = "ggplot2")

1.7 Glossary

Each chapter ends with a glossary table defining the jargon introduced in this chapter. The links below take you to the glossary book, which you can also download for offline use.

# install the glossary package (only once)
devtools::install_github("psyteachr/glossary")

# open the glossary offline 
glossary::book()

term	definition
argument	A variable that provides input to a function.
assignment operator	The symbol
base r	The set of R functions that come with a basic installation of R, before you add external packages.
console	The pane in RStudio where you can type in commands and view output messages.
cran	The Comprehensive R Archive Network: a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
escape	Include special characters like " inside of a string by prefacing them with a backslash.
factor	A data type where a specific set of values are stored with labels; An explanatory variable manipulated by the experimenter
function	A named section of code that can be reused.
global environment	The interactive workspace where your script runs
ide	Integrated Development Environment: a program that serves as a text editor, file manager, and provides functions to help you read and write code. RStudio is an IDE for R.
normal distribution	A symmetric distribution of data where values near the centre are most probable.
object	A word that identifies and stores the value of some data for later use.
package	A group of R functions.
panes	RStudio is arranged with four window "panes".
r markdown	The R-specific version of markdown: a way to specify formatting, such as headers, paragraphs, lists, bolding, and links, as well as code blocks and inline code.
reproducible research	Research that documents all of the steps between raw data and results in a way that can be verified.
script	A plain-text file that contains commands in a coding language, such as R.
standard deviation	A descriptive statistic that measures how spread out data are relative to the mean.
string	A piece of text inside of quotes.
variable	(coding): A word that identifies and stores the value of some data for later use; (stats): An attribute or characteristic of an observation that you can measure, count, or describe
vector	A type of data structure that collects values with the same data type, like T/F values, numbers, or strings.
whitespace	Spaces, tabs and line breaks

1.8 Further Resources

Chapter 1: Introduction in R for Data Science
RStudio IDE Cheatsheet
RStudio Cloud

Overview

2 Reproducible Workflows