Chapter 1 Getting Started

1.1 Learning Objectives

  1. Understand the components of the RStudio IDE
  2. Type commands into the console
  3. Understand function syntax
  4. Install a package
  5. Organise a project
  6. Appropriately structure an R script or RMarkdown file
  7. Create and compile an Rmarkdown document

1.3 What is R?

R is a programming environment for data processing and statistical analysis. We use R in Psychology at the University of Glasgow to promote reproducible research. This refers to being able to document and reproduce all of the steps between raw data and results. R allows you to write scripts that combine data files, clean data, and run analyses. There are many other ways to do this, including writing SPSS syntax files, but we find R to be a useful tool that is free, open source, and commonly used by research psychologists.

See Appendix A for more information on on how to install R and associated programs.

1.3.1 The Base R Console

If you open up the application called R, you will see an “R Console” window that looks something like this.

The R Console window.

Figure 1.1: The R Console window.

You can close R and never open it again. We’ll be working entirely in RStudio in this class.

ALWAYS REMEMBER: Launch R though the RStudio IDE

Launch RStudio.app (RStudio.app), not R.app (R.app).

1.3.2 RStudio

RStudio is an Integrated Development Environment (IDE). This is a program that serves as a text editor, file manager, and provides many functions to help you read and write R code.

The RStudio IDE

Figure 1.2: The RStudio IDE

RStudio is arranged with four window panes. By default, the upper left pane is the source pane, where you view and edit source code from files. The bottom left pane is usually the console pane, where you can type in commands and view output messages. The right panes have several different tabs that show you information about your code. You can change the location of panes and what tabs are shown under Preferences > Pane Layout.

1.3.3 Configure RStudio

In this class, you will be learning how to develop reproducible scripts. This means scripts that completely and transparently perform some analysis from start to finish in a way that yields the same result for different people using the same software on different computers. Transparency is a key value of science, as embodied in the “trust but verify” motto.

When you do things reproducibly, others can understand and check your work. This benefits science, but there is a selfish reason, too: the most important person who will benefit from a reproducible script is your future self. When you return to an analysis after two weeks of vacation, you will thank your earlier self for doing things in a transparent, reproducible way, as you can easily pick up right where you left off.

There are two tweaks that you should do to your RStudio installation to maximize reproducibility. Go to the preferences/settings menu, and uncheck the box that says Restore .RData into workspace at startup. If you keep things around in your workspace, things will get messy, and unexpected things will happen. You should always start with a clear workspace. This also means that you never want to save your workspace when you exit, so set this to Never. The only thing you want to save are your scripts.

Alter these settings for increased reproducibility.

Figure 1.3: Alter these settings for increased reproducibility.

Your settings should have:

  • Restore .RData into workspace at startup:
  • Save workspace to .RData on exit:

1.4 Getting Started

1.4.1 Console commands

We are first going to learn about how to interact with the console. In general, you will be developing R scripts or R markdown files, rather than working directly in the console window. However, you can consider the console a kind of sandbox where you can try out lines of code and adapt them until you get them to do what you want. Then you can copy them back into the script editor.

Mostly, however, you will be typing into the script editor window (either into an R script or an R Markdown file) and then sending the commands to the console by placing the cursor on the line and holding down the Ctrl key while you press Enter. The Ctrl+Enter key sequence sends the command in the script to the console.

One simple way to learn about the R console is to use it as a calculator. Enter the lines of code below and see if your results match. Be prepared to make lots of typos (at first).

## [1] 2

The R console remembers a history of the commands you typed in the past. Use the up and down arrow keys on your keyboard to scroll backwards and forwards through your history. It’s a lot faster than re-typing.

## [1] 5

You can break up math expressions over multiple lines; R waits for a complete expression before processing it.

## [1] 55

Text inside quotes is called a string.

## [1] "Good afternoon"

You can break up text over multiple lines; R waits for a close quote before processing it. If you want to include a double quote inside this quoted string, escape it with a backslash.

## I hear the drums echoing tonight  
## But she hears only whispers of some quiet conversation  
## She's coming in, 12:30 flight  
## The moonlit wings reflect the stars that guide me towards salvation  
## I stopped an old man along the way  
## Hoping to find some old forgotten words or ancient melodies  
## He turned to me as if to say, "Hurry boy, it's waiting there for you"
## 
## - Toto

1.4.2 Variables

Often you want to store the result of some computation for later use. You can store it in a variable. A variable in R:

  • contains only letters, numbers, full stops, and underscores
  • starts with a letter or a full stop and a letter
  • distinguishes uppercase and lowercase letters (rickastley is not the same as RickAstley)

The following are valid and different variables:

  • songdata
  • SongData
  • song_data
  • song.data
  • .song.data
  • never_gonna_give_you_up_never_gonna_let_you_down

The following are not valid variables:

  • _song_data
  • 1song
  • .1song
  • song data
  • song-data

Use the assignment operator <- to assign the value on the right to the variable named on the left.

Now that we have set x to a value, we can do something with it:

## [1] 10

Note that it doesn’t print the result back at you when it’s stored. To view the result, just type the variable name on a blank line.

## [1] 4

Once a variable is assigned a value, its value doesn’t change unless you reassign the variable, even if the variables you used to calculate it change. Predict what the code below does and test yourself:

After all the code above is run:

  • this_year =
  • my_birth_year =
  • my_age =

1.4.3 The environment

Anytime you assign something to a new variable, R creates a new object in the global environment. Objects in the global environment exist until you end your session; then they disappear forever (unless you save them).

Look at the Environment tab in the upper right pane. It lists all of the variables you have created. Click the broom icon to clear all of the variables and start fresh. You can also use the following functions in the console to view all variables, remove one variable, or remove all variables.

In the upper right corner of the Environment tab, change List to Grid. Now you can see the type, length, and size of your variables, and reorder the list by any of these attributes.

1.4.4 Whitespace

When you see > at the beginning of a line, that means R is waiting for you to start a new command. However, if you see a + instead of > at the start of the line, that means R is waiting for you to finish a command you started on a previous line. If you want to cancel whatever command you started, just press the Esc key in the console window and you’ll get back to the > command prompt.

## [1] 25

It is often useful to break up long functions onto several lines.

## 3, 6, 9, the goose drank wine  
## The monkey chewed tobacco on the streetcar line  
## The line broke, the monkey got choked  
## And they all went to heaven in a little rowboat

1.4.5 Function syntax

A lot of what you do in R involves calling a function and storing the results. A function is a named section of code that can be reused.

For example, sd is a function that returns the standard deviation of the vector of numbers that you provide as the input argument. Functions are set up like this:

function_name(argument1, argument2 = "value").

The arguments in parentheses can be named (like, argument1 = 10) or you can skip the names if you put them in the exact same order that they’re defined in the function. You can check this by typing ?sd (or whatever function name you’re looking up) into the console and the Help pane will show you the default order under Usage. You can also skip arguments that have a default value specified.

Most functions return a value, but may also produce side effects like printing to the console.

To illustrate, the function rnorm() generates random numbers from the standard normal distribution. The help page for rnorm() (accessed by typing ?rnorm in the console) shows that it has the syntax

rnorm(n, mean = 0, sd = 1)

where n is the number of randomly generated numbers you want, mean is the mean of the distribution, and sd is the standard deviation. The default mean is 0, and the default standard deviation is 1. There is no default for n, which means you’ll get an error if you don’t specify it:

## Error in rnorm(): argument "n" is missing, with no default

If you want 10 random numbers from a distribution with mean of 0 and standard deviation, you can just use the defaults.

##  [1] -0.05511896  0.87252252  0.82307770 -0.46609333 -0.12642648
##  [6]  1.20797031  1.67671736 -1.87642043 -0.45202949 -0.60816802

If you want 10 numbers from a distribution with a mean of 100:

##  [1]  97.65696  99.35218  99.11358  98.59506  99.48569 101.05307 100.42048
##  [8] 100.81752 101.76668 100.53960

This would be an equivalent but less efficient way of calling the function:

##  [1]  99.68801 100.48899 100.38205 101.06090  99.98390  98.62639 100.30004
##  [8] 101.05193 101.47237 102.73664

We don’t need to name the arguments because R will recognize that we intended to fill in the first and second arguments by their position in the function call. However, if we want to change the default for an argument coming later in the list, then we need to name it. For instance, if we wanted to keep the default mean = 0 but change the standard deviation to 100 we would do it this way:

##  [1] -129.97940   68.12656 -143.11548   53.27806   80.91555  -10.69488
##  [7]  -11.09132 -106.00615   12.99392   64.70920

Some functions give a list of options after an argument; this means the default value is the first option. The usage entry for the power.t.test() function looks like this:

  • What is the default value for sd?
  • What is the default value for type?
  • Which is equivalent to power.t.test(100, 0.5)?

1.4.6 Getting help

Start up help in a browser using the function help.start().

If a function is in base R or a loaded package, you can use the help("function_name") function or the ?function_name shortcut to access the help file. If the package isn’t loaded, specify the package name as the second argument to the help function.

When the package isn’t loaded or you aren’t sure what package the function is in, use the shortcut ??function_name.

  • What is the first argument to the mean function?
  • What package is read_excel in?

1.5 Add-on packages

One of the great things about R is that it is user extensible: anyone can create a new add-on software package that extends its functionality. There are currently thousands of add-on packages that R users have created to solve many different kinds of problems, or just simply to have fun. There are packages for data visualisation, machine learning, neuroimaging, eyetracking, web scraping, and playing games such as Sudoku.

Add-on packages are not distributed with base R, but have to be downloaded and installed from an archive, in the same way that you would, for instance, download and install a fitness app on your smartphone.

The main repository where packages reside is called CRAN, the Comprehensive R Archive Network. A package has to pass strict tests devised by the R core team to be allowed to be part of the CRAN archive. You can install from the CRAN archive through R using the install.packages() function.

There is an important distinction between installing a package and loading a package.

1.5.1 Installing a package

This is done using install.packages(). This is like installing an app on your phone: you only have to do it once and the app will remain installed until you remove it. For instance, if you want to use PokemonGo on your phone, you install it once from the App Store or Play Store, and you don’t have to re-install it each time you want to use it. Once you launch the app, it will run in the background until you close it or restart your phone. Likewise, when you install a package, the package will be available (but not loaded) every time you open up R.

You may only be able to permanently install packages if you are using R on your own system; you may not be able to do this on public workstations if you lack the appropriate privileges.

Install the ggExtra package on your system. This package lets you create plots with marginal histograms.

If you don’t already have packages like ggplot2 and shiny installed, it will also install these dependencies for you. If you don’t get an error message at the end, the installation was successful.

1.5.2 Loading a package

This is done using library(packagename). This is like launching an app on your phone: the functionality is only there where the app is launched and remains there until you close the app or restart. Likewise, when you run library(packagename) within a session, the functionality of the package referred to by packagename will be made available for your R session. The next time you start R, you will need to run the library() function again if you want to access its functionality.

You can load the functions in ggExtra for your current R session as follows:

You might get some red text when you load a package, this is normal. It is usually warning you that this package has functions that have the same name as other packages you’ve already loaded.

You can use the convention package::function() to indicate in which add-on package a function resides. For instance, if you see readr::read_csv(), that refers to the function read_csv() in the readr add-on package.

Now can run the function ggExtra::runExample(), which runs an interactive example of marginal plots using shiny.

1.5.3 Install from GitHub

Many R packages are not yet on CRAN because they are still in development. Increasingly, datasets and code for papers are available as packages you can download from github. You’ll need to install the devtools package to be able to install packages from github. Check if you have a package installed by trying to load it (e.g., if you don’t have devtools installed, library("devtools") will display an error message) or by searching for it in the packages tab in the lower right pane. All listed packages are installed; all checked packages are currently loaded.

Check installed and loaded packages in the packages tab in the lower right pane.

Figure 1.4: Check installed and loaded packages in the packages tab in the lower right pane.

After you install the goodshirt package, load it using the library() function and display some quotes using the functions below.

## 
##  There really is an afterlife. I can't wait to have breakfast with Kant, and lunch with Michel Foucault, and then have dinner with Kant again so we can talk about what came up at breakfast! 
## 
##  ~ Chidi
##  People are like nature's apps! 
## 
##  ~ Eleanor
##  I am an expert at mediating conflict. Like when my friends Scary, Sporty, Posh, and Baby had an issue with my other friend Archbishop Desmond Tutu. 
## 
##  ~ Tahani
##  I miss being myself. Myself was the best. 
## 
##  ~ Jason

How many different ways can you find to discover what functions are available in the goodshirt package?

1.6 Organising a project

Projects in RStudio are a way to group all of the files you need for one project. Most projects include scripts, data files, and output files like the PDF version of the script or images.

Make a new directory where you will keep all of your materials for this class. If you’re using a lab computer, make sure you make this directory in your network drive so you can access it from other computers.

Choose New Project… under the File menu to create a new project called 01-intro in this directory.

1.6.1 Structure

Here is what an R script looks like. Don’t worry about the details for now.

It’s best if you follow the following structure when developing your own scripts:

  • load in any add-on packages you need to use
  • define any custom functions
  • load or simulate the data you will be working with
  • work with the data
  • save anything you need to save

Often when you are working on a script, you will realize that you need to load another add-on package. Don’t bury the call to library(package_I_need) way down in the script. Put it in the top, so the user has an overview of what packages are needed.

You can add comments to an R script by with the hash symbol (#). The R interpreter will ignore characters from the hash to the end of the line.

## [1] 3.142857

1.6.2 Reproducible reports with R Markdown

We will make reproducible reports following the principles of literate programming. The basic idea is to have the text of the report together in a single document along with the code needed to perform all analyses and generate the tables. The report is then “compiled” from the original format into some other, more portable format, such as HTML or PDF. This is different from traditional cutting and pasting approaches where, for instance, you create a graph in Microsoft Excel or a statistics program like SPSS and then paste it into Microsoft Word.

We will use R Markdown to create reproducible reports, which enables mixing of text and code. A reproducible script will contain sections of code in code blocks. A code block starts and ends with backtick symbols in a row, with some infomation about the code between curly brackets, such as {r chunk-name, echo=FALSE} (this runs the code, but does not show the text of the code block in the compiled document). The text outside of code blocks is written in markdown, which is a way to specify formatting, such as headers, paragraphs, lists, bolding, and links.

A reproducible script.

Figure 1.5: A reproducible script.

If you open up a new RMarkdown file from a template, you will see an example document with several code blocks in it. To create an HTML or PDF report from an R Markdown (Rmd) document, you compile it. Compiling a document is called knit in RStudio. There is a button that looks like a ball of yarn with needles through it that you click on to compile your file into a report.

Create a new R Markdown file from the File > New File > R Markdown… menu. Change the title and author, then click the knit button to create an html file.

1.6.3 Working Directory

Where should I put all my files? When developing an analysis, you usually want to have all of your scripts and data files in one subtree of your computer’s directory structure. Usually there is a single working directory where your data and scripts are stored.

Your script should only reference files in three locations, using the appropriate format.

Where Example
on the web https://psyteachr.github.io/msc-data-skills/data/disgust_scores.csv
in the working directory “disgust_scores.csv”
in a subdirectory “data/disgust_scores.csv”

Never set or change your working directory in a script.

If you are working with an R Markdown file, it will automatically use the same directory the .Rmd file is in as the working directory.

If you are working with R scripts, store your main script file in the top-level directory and manually set your working directory to that location. You will have to reset the working directory each time you open RStudio, unless you create a project and access the script from the project.

For instance, if on a Windows machine your data and scripts are in the directory C:\Carla's_files\thesis2\my_thesis\new_analysis, you will set your working directory in one of two ways: (1) by going to the Session pull down menu in RStudio and choosing Set Working Directory, or (2) by typing setwd("C:\Carla's_files\thesis2\my_thesis\new_analysis") in the console window.

It’s tempting to make your life simple by putting the setwd() command in your script. Don’t do this! Others will not have the same directory tree as you (and when your laptop dies and you get a new one, neither will you).

When manually setting the working directory, always do so by using the Session > Set Working Directory pull-down option or by typing setwd() in the console.

If your script needs a file in a subdirectory of new_analysis, say, data/questionnaire.csv, load it in using a relative path so that it is accessible if you move the folder new_analysis to another location or computer:

Do not load it in using an absolute path:

Also note the convention of using forward slashes, unlike the Windows-specific convention of using backward slashes. This is to make references to files platform independent.

1.7 Exercises

Download the first set of exercises and put it in the project directory you created earlier for today’s exercises. See the answers only after you’ve attempted all the questions.