2 Adding Data
Add datasets to your package and document them with a codebook. You can use our demo data or your own.
2.1 Process Data
We’re going to add some data from DeBruine (2004) looking at attractiveness judgements of same-sex and other-sex self-resembling faces. Follow the directions below for each data set you want to add to your package.
2.1.1 Raw Data Directory
Run the code below.
✔ Setting active project to '/Users/debruine/rproj/psyteachr/intro-r-pkgs'
• Finish the data preparation script in 'data-raw/self_res_att.R'
• Use `usethis::use_data()` to add prepared data to package
The first time you run this function in a new package project, it will create a directory called data-raw
and add this directory to the .Rbuildignore
file so it isn’t included in your package.
Any processed data you include in an R package is viewable by anyone who downloads the package. Anything in the data-raw
directory is not included in the package, but if you share your whole project publicly on GitHub, people will be able see the raw data and processing code. You can prevent this by adding this directory to .gitignore
(which we will discuss in Chapter 9), but it’s pretty easy to make errors that accidentally share everything, so the safest way is to process your data outside the package project if you really can’t share the raw data.
It will also open a file in that directory called self_res_att.R
with the following contents.
2.1.2 Add Raw Data
Download the data file called DeBruine_2004_PRSLB_att.csv
from https://osf.io/2ud56/ and save it in data-raw
. The code below will also do this.
2.1.3 Load Data
Now, load the data in this script, using the object name self_res_att
; this will be the name of the data object in your package. You will see some message text about the data.
Rows: 108 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): id, sex, ethgroup, birthorder
dbl (12): age, m_non, f_non, m_self, f_self, grpsize, group, mascpref, obro,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I use readr::read_csv()
so we can fix the column data types, but you’re free to use any way you’re comfortable with to load and process the data.
2.1.4 Check Column Types
Have a quick look at the data and make sure the columns import as the right data types. You might want to make sure that categorical factors are imported as factors with their levels in a sensible order.
- use
readr::spec(self_res_att)
to get the default column types, - copy and paste this to your script,
- assign them to an object (here called
ct
), and - edit them to set factors with levels in a sensible order
cols(
id = col_character(),
sex = col_character(),
ethgroup = col_character(),
age = col_double(),
m_non = col_double(),
f_non = col_double(),
m_self = col_double(),
f_self = col_double(),
grpsize = col_double(),
group = col_double(),
mascpref = col_double(),
obro = col_double(),
osis = col_double(),
ybro = col_double(),
ysis = col_double(),
birthorder = col_character()
)
data-raw/self_res_att.R
ct <- cols(
id = col_character(),
sex = col_factor(levels = c("female", "male")),
ethgroup = col_factor(),
age = col_double(),
m_non = col_double(),
f_non = col_double(),
m_self = col_double(),
f_self = col_double(),
grpsize = col_double(),
group = col_double(),
mascpref = col_double(),
obro = col_double(),
osis = col_double(),
ybro = col_double(),
ysis = col_double(),
birthorder = col_factor(levels = c("only", "firstborn", "middleborn", "lastborn"))
)
self_res_att <- readr::read_csv("data-raw/DeBruine_2004_PRSLB_att.csv", col_types = ct)
2.1.5 Add Processed Data
Once you’re satisfied that the data are in the right form, you can run this code to add the data to your package.
✔ Adding 'R' to \033[32mDepends\033[39m field in DESCRIPTION
✔ Setting \033[32mLazyData\033[39m to 'true' in 'DESCRIPTION'
✔ Saving 'self_res_att' to 'data/self_res_att.rda'
• Document your data (see 'https://r-pkgs.org/data.html')
2.2 Documentation
Data need to be documented to be included in a package. The documentation looks like this:
#' Title.
#'
#' Description of the dataset.
#'
#' @format A data frame with {nrow} rows and {ncol} variables:
#' \describe{
#' \item{col1}{col1 description}
#' \item{col2}{col2 description}
#' }
#' @source \url{data_URL}
"dataname"
2.2.1 Data Decsription
Create a new script called “self_res_att”. The function use_r()
will save it in the right directory as R/self_res_att.R
“` and open it in the source pane.
Make sure you don’t confuse this file with the processing script at data-raw/self_res_att.R
.
Edit the title, description, format, source, and dataname from the template above (or just copy the text below).
#' Attractiveness of Self-Resembling Faces.
#'
#' Attractiveness judgements of same-sex and other-sex self-resembling faces from DeBruine (2004).
#'
#' @format A data frame with 108 rows and 16 variables:
#' \describe{
#' \item{col1}{col1 description}
#' \item{col2}{col2 description}
#' }
#' @source \url{https://osf.io/3c5s4/}
"self_res_att"
2.2.2 Column Descriptions
You can manually edit the column \item
descriptions, but this can be pretty tedious for longer data sets. I’ll show you how to make the process a little easier.
Use the following code to get a list of the column names in the data set. The function dput()
gives you the code needed to create an object.
run in the console
c(id = "", sex = "", ethgroup = "", age = "", m_non = "", f_non = "",
m_self = "", f_self = "", grpsize = "", group = "", mascpref = "",
obro = "", osis = "", ybro = "", ysis = "", birthorder = "")
Copy and paste the output into your script and assign this to an object called vars
.
The lines breaks will probably be ugly, but you can select the code and choose Reformat Code
(cmd-shift-A) from the Code
menu to fix it.
Next, define each of these columns.
data-raw/self_res_at.R
vars <- c(
"id" = "participant unique ID",
"sex" = "sex of the participant (male/female)",
"ethgroup" = "ethnic group of the participant (east_asian/west_asian/white)",
"age" = "age of the participant in years",
"m_non" = "mean number of times the other group members chose that male face as more attractive",
"f_non" = "mean number of times the other group members chose that female face as more attractive",
"m_self" = "number of times out of a possible 6 chose their male self-res face as more attractive",
"f_self" = "number of times out of a possible 6 chose their female self-res face as more attractive",
"grpsize" = "size of the group",
"group" = "unique group ID",
"mascpref" = "masculinity preference on an unrelated face preference task",
"obro" = "number of older brothers",
"osis" = "number of older sisters",
"ybro" = "number of younger brothers",
"ysis" = "number of younger sisters",
"birthorder" = "birth order (only/firstborn/middleborn/lastborn) as calculated from number of younger and older brothers and sisters"
)
Now we can format this for the description file. Because the format uses curly brackets, we’re going to set the opening and closing symbols for glue to square brackets. Copy and paste these lines into the items section.
data-raw/self_res_at.R
#' \item{id}{participant unique ID}
#' \item{sex}{sex of the participant (male/female)}
#' \item{ethgroup}{ethnic group of the participant (east_asian/west_asian/white)}
#' \item{age}{age of the participant in years}
#' \item{m_non}{mean number of times the other group members chose that male face as more attractive}
#' \item{f_non}{mean number of times the other group members chose that female face as more attractive}
#' \item{m_self}{number of times out of a possible 6 chose their male self-res face as more attractive}
#' \item{f_self}{number of times out of a possible 6 chose their female self-res face as more attractive}
#' \item{grpsize}{size of the group}
#' \item{group}{unique group ID}
#' \item{mascpref}{masculinity preference on an unrelated face preference task}
#' \item{obro}{number of older brothers}
#' \item{osis}{number of older sisters}
#' \item{ybro}{number of younger brothers}
#' \item{ysis}{number of younger sisters}
#' \item{birthorder}{birth order (only/firstborn/middleborn/lastborn) as calculated from number of younger and older brothers and sisters}
I’ve written a function that you can use to set up a data documentation file for a data object. This could be useful if you are adding many data tables to a package.
2.2.3 Add to Package
Now save this file and add the documentation to your package with the following code:
You should see the following output.
ℹ Updating demopkg documentation
ℹ Loading demopkg
Writing NAMESPACE
Writing self_res_att.Rd
You need to run this function any time you make changes to the documentation in the script. If you just see the first two lines of the output, nothing changed. The third line will show only if you have added a new data set or function to the package, and the fourth line will show if the help documentation was updated.
2.3 Load in package
Now restart R and make sure that your environment is clear. Run the following code to load your package. You can also use a keyboard shortcut to run this function (Mac: cmd-shift-L, Windows: ctl-shift-L).
You should be able to access the data now.
You should also be able to see the documentation in the Help pane.
2.4 Glossary
term | definition |
---|---|
categorical | Data that can only take certain values, such as types of pet. |
data-type | The kind of data represented by an object. |
directory | A collection or "folder" of files on a computer. |
environment | A data structure that contains R objects such as variables and functions |
factor-data-type | A data type where a specific set of values are stored with labels |
level | The set of valid values for a factor |
object | A word that identifies and stores the value of some data for later use. |
2.5 Further Resources
2.6 Further Practice
-
Add the dataset
DeBruine_2004_PRSLB_avg.csv
from https://osf.io/2ud56/. This is a separate sample of participants who judged the averageness of male and female self- and non-self-resembling faces, rather than their attractiveness.- id: participant unique ID
- sex: sex of the participant (male/female)
- group: unique group ID
- m_avg: number of times out of a possible 6 chose a male composite face as more attractive
- f_avg: number of times out of a possible 6 chose a female composite face as more attractive
- m_self: number of times out of a possible 6 chose their male self-res face as more attractive
- f_self: number of times out of a possible 6 chose their female self-res face as more attractive
- m_non: mean number of times the other group members chose that male face as more attractive
- f_non: mean number of times the other group members chose that female face as more attractive
Add datasets of your own to this package or your personal package.