2 Adding Data

Add datasets to your package and document them with a codebook. You can use our demo data or your own.

Packages required for this chapter

library(usethis)  # for setting up data sets in a package
library(readr)    # for reading in CSV data
library(glue)     # for text joining
library(devtools) # for loading your package

2.1 Process Data

We’re going to add some data from DeBruine (2004) looking at attractiveness judgements of same-sex and other-sex self-resembling faces. Follow the directions below for each data set you want to add to your package.

2.1.1 Raw Data Directory

Run the code below.

run in the console

usethis::use_data_raw("self_res_att")

✔ Setting active project to "/Users/debruine/rproj/psyteachr/intro-r-pkgs".

☐ Finish writing the data preparation script in 'data-raw/self_res_att.R'.

☐ Use `usethis::use_data()` to add prepared data to package.

The first time you run this function in a new package project, it will create a directory called data-raw and add this directory to the .Rbuildignore file so it isn’t included in your package.

Data Privacy

Any processed data you include in an R package is viewable by anyone who downloads the package. Anything in the data-raw directory is not included in the package, but if you share your whole project publicly on GitHub, people will be able see the raw data and processing code. You can prevent this by adding this directory to .gitignore (which we will discuss in Chapter 9), but it’s pretty easy to make errors that accidentally share everything, so the safest way is to process your data outside the package project if you really can’t share the raw data.

It will also open a file in that directory called self_res_att.R with the following contents.

## code to prepare `self_res_att` dataset goes here

usethis::use_data(self_res_att, overwrite = TRUE)

2.1.2 Add Raw Data

Download the data file called DeBruine_2004_PRSLB_att.csv from https://osf.io/2ud56/ and save it in data-raw. The code below will also do this.

run in the console

download.file(url = "https://osf.io/download/3c5s4/",
              destfile = "data-raw/DeBruine_2004_PRSLB_att.csv")

2.1.3 Load Data

Now, load the data in this script, using the object name self_res_att; this will be the name of the data object in your package. You will see some message text about the data.

data-raw/self_res_att.R

self_res_att <- readr::read_csv("data-raw/DeBruine_2004_PRSLB_att.csv")

Rows: 108 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): id, sex, ethgroup, birthorder
dbl (12): age, m_non, f_non, m_self, f_self, grpsize, group, mascpref, obro,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note

I use readr::read_csv() so we can fix the column data types, but you’re free to use any way you’re comfortable with to load and process the data.

2.1.4 Check Column Types

Have a quick look at the data and make sure the columns import as the right data types. You might want to make sure that categorical factors are imported as factors with their levels in a sensible order.

use readr::spec(self_res_att) to get the default column types,
copy and paste this to your script,
assign them to an object (here called ct), and
edit them to set factors with levels in a sensible order

run in the console

spec(self_res_att)

cols(
  id = col_character(),
  sex = col_character(),
  ethgroup = col_character(),
  age = col_double(),
  m_non = col_double(),
  f_non = col_double(),
  m_self = col_double(),
  f_self = col_double(),
  grpsize = col_double(),
  group = col_double(),
  mascpref = col_double(),
  obro = col_double(),
  osis = col_double(),
  ybro = col_double(),
  ysis = col_double(),
  birthorder = col_character()
)

data-raw/self_res_att.R

ct <- cols(
  id = col_character(),
  sex = col_factor(levels = c("female", "male")),
  ethgroup = col_factor(),
  age = col_double(),
  m_non = col_double(),
  f_non = col_double(),
  m_self = col_double(),
  f_self = col_double(),
  grpsize = col_double(),
  group = col_double(),
  mascpref = col_double(),
  obro = col_double(),
  osis = col_double(),
  ybro = col_double(),
  ysis = col_double(),
  birthorder = col_factor(levels = c("only", "firstborn", "middleborn", "lastborn"))
)

self_res_att <- readr::read_csv("data-raw/DeBruine_2004_PRSLB_att.csv", col_types = ct)

2.1.5 Add Processed Data

Once you’re satisfied that the data are in the right form, you can run this code to add the data to your package.

data-raw/self_res_att.R

usethis::use_data(self_res_att, overwrite = TRUE)

✔ Adding 'R' to \033[32mDepends\033[39m field in DESCRIPTION
✔ Setting \033[32mLazyData\033[39m to 'true' in 'DESCRIPTION'
✔ Saving 'self_res_att' to 'data/self_res_att.rda'
• Document your data (see 'https://r-pkgs.org/data.html')

2.2 Documentation

Data need to be documented to be included in a package. The documentation looks like this:

#' Title.
#'
#' Description of the dataset.
#'
#' @format A data frame with {nrow} rows and {ncol} variables:
#' \describe{
#'   \item{col1}{col1 description}
#'   \item{col2}{col2 description}
#' }
#' @source \url{data_URL}
"dataname"

2.2.1 Data Decsription

Create a new script called “self_res_att”. The function use_r() will save it in the right directory as R/self_res_att.R“` and open it in the source pane.

run in the console

usethis::use_r("self_res_att")

Caution

Make sure you don’t confuse this file with the processing script at data-raw/self_res_att.R.

Edit the title, description, format, source, and dataname from the template above (or just copy the text below).

#' Attractiveness of Self-Resembling Faces.
#'
#' Attractiveness judgements of same-sex and other-sex self-resembling faces from DeBruine (2004).
#'
#' @format A data frame with 108 rows and 16 variables:
#' \describe{
#'   \item{col1}{col1 description}
#'   \item{col2}{col2 description}
#' }
#' @source \url{https://osf.io/3c5s4/}
"self_res_att"

2.2.2 Column Descriptions

You can manually edit the column \item descriptions, but this can be pretty tedious for longer data sets. I’ll show you how to make the process a little easier.

Use the following code to get a list of the column names in the data set. The function dput() gives you the code needed to create an object.

run in the console

coldesc <- rep("", ncol(self_res_att))
names(coldesc) <- names(self_res_att)
dput(coldesc)

c(id = "", sex = "", ethgroup = "", age = "", m_non = "", f_non = "", 
m_self = "", f_self = "", grpsize = "", group = "", mascpref = "", 
obro = "", osis = "", ybro = "", ysis = "", birthorder = "")

Copy and paste the output into your script and assign this to an object called vars.

data-raw/self_res_att.R

vars <- c(
    id = "",
    sex = "",
    ethgroup = "",
    age = "",
    m_non = "",
    f_non = "",
    m_self = "",
    f_self = "",
    grpsize = "",
    group = "",
    mascpref = "",
    obro = "",
    osis = "",
    ybro = "",
    ysis = "",
    birthorder = ""
  )

Note

The lines breaks will probably be ugly, but you can select the code and choose Reformat Code (cmd-shift-A) from the Code menu to fix it.

Next, define each of these columns.

data-raw/self_res_at.R

vars <- c(
  "id" = "participant unique ID",
  "sex" = "sex of the participant (male/female)",
  "ethgroup" = "ethnic group of the participant (east_asian/west_asian/white)",
  "age" = "age of the participant in years",
  "m_non" = "mean number of times the other group members chose that male face as more attractive",
  "f_non" = "mean number of times the other group members chose that female face as more attractive",
  "m_self" = "number of times out of a possible 6 chose their male self-res face as more attractive",
  "f_self" = "number of times out of a possible 6 chose their female self-res face as more attractive",
  "grpsize" = "size of the group",
  "group" = "unique group ID",
  "mascpref" = "masculinity preference on an unrelated face preference task",
  "obro" = "number of older brothers",
  "osis" = "number of older sisters",
  "ybro" = "number of younger brothers",
  "ysis" = "number of younger sisters",
  "birthorder" = "birth order (only/firstborn/middleborn/lastborn) as calculated from number of younger and older brothers and sisters"
)

Now we can format this for the description file. Because the format uses curly brackets, we’re going to set the opening and closing symbols for glue to square brackets. Copy and paste these lines into the items section.

data-raw/self_res_at.R

glue::glue("#'   \\item{[colname]}{[coldesc]}",
           colname = names(vars),
           coldesc = vars,
           .open = "[",
           .close = "]")

#'   \item{id}{participant unique ID}
#'   \item{sex}{sex of the participant (male/female)}
#'   \item{ethgroup}{ethnic group of the participant (east_asian/west_asian/white)}
#'   \item{age}{age of the participant in years}
#'   \item{m_non}{mean number of times the other group members chose that male face as more attractive}
#'   \item{f_non}{mean number of times the other group members chose that female face as more attractive}
#'   \item{m_self}{number of times out of a possible 6 chose their male self-res face as more attractive}
#'   \item{f_self}{number of times out of a possible 6 chose their female self-res face as more attractive}
#'   \item{grpsize}{size of the group}
#'   \item{group}{unique group ID}
#'   \item{mascpref}{masculinity preference on an unrelated face preference task}
#'   \item{obro}{number of older brothers}
#'   \item{osis}{number of older sisters}
#'   \item{ybro}{number of younger brothers}
#'   \item{ysis}{number of younger sisters}
#'   \item{birthorder}{birth order (only/firstborn/middleborn/lastborn) as calculated from number of younger and older brothers and sisters}

Note

I’ve written a function that you can use to set up a data documentation file for a data object. This could be useful if you are adding many data tables to a package.

2.2.3 Add to Package

Now save this file and add the documentation to your package with the following code:

run in the console

devtools::document()

You should see the following output.

ℹ Updating demopkg documentation
ℹ Loading demopkg
Writing NAMESPACE
Writing self_res_att.Rd

You need to run this function any time you make changes to the documentation in the script. If you just see the first two lines of the output, nothing changed. The third line will show only if you have added a new data set or function to the package, and the fourth line will show if the help documentation was updated.

2.3 Load in package

Now restart R and make sure that your environment is clear. Run the following code to load your package. You can also use a keyboard shortcut to run this function (Mac: cmd-shift-L, Windows: ctl-shift-L).

run in the console

devtools::load_all(".")

You should be able to access the data now.

run in the console

# load as self_res_att
data("self_res_att", package = "demopkg")

# load as d1
d1 <- demopkg::self_res_att

You should also be able to see the documentation in the Help pane.

run in the console

?demopkg::self_res_att

2.4 Glossary

term	definition
categorical	Data that can only take certain values, such as types of pet.
data-type	The kind of data represented by an object.
directory	A collection or “folder” of files on a computer.
environment	A data structure that contains R objects such as variables and functions
factor-data-type	A data type where a specific set of values are stored with labels
level	The set of valid values for a factor
object	A word that identifies and stores the value of some data for later use.

2.5 Further Resources

Data from Wickham (2015)

2.6 Further Practice

Add the dataset DeBruine_2004_PRSLB_avg.csv from https://osf.io/2ud56/. This is a separate sample of participants who judged the averageness of male and female self- and non-self-resembling faces, rather than their attractiveness.
- id: participant unique ID
- sex: sex of the participant (male/female)
- group: unique group ID
- m_avg: number of times out of a possible 6 chose a male composite face as more attractive
- f_avg: number of times out of a possible 6 chose a female composite face as more attractive
- m_self: number of times out of a possible 6 chose their male self-res face as more attractive
- f_self: number of times out of a possible 6 chose their female self-res face as more attractive
- m_non: mean number of times the other group members chose that male face as more attractive
- f_non: mean number of times the other group members chose that female face as more attractive
Add datasets of your own to this package or your personal package.