Appendix G — Data Types
G.1 Basic data types
Data can be numbers, words, true/false values or combinations of these. The basic data types in R are: numeric, character, and logical, as well as the special classes of factor and date/times.
G.1.1 Numeric data
All of the numbers are numeric data types. There are two types of numeric data, integer and double. Integers are the whole numbers, like -1, 0 and 1. Doubles are numbers that can have fractional amounts. If you just type a plain number such as 10
, it is stored as a double, even if it doesn’t have a decimal point. If you want it to be an exact integer, you can use the L
suffix (10L), but this distinction doesn’t make much difference in practice.
If you ever want to know the data type of something, use the typeof
function.
[1] "double"
[1] "double"
[1] "integer"
If you want to know if something is numeric (a double or an integer), you can use the function is.numeric()
and it will tell you if it is numeric (TRUE
) or not (FALSE
).
G.1.2 Character data
Characters (also called “strings”) are any text between quotation marks.
[1] "character"
[1] "character"
This can include quotes, but you have to escape quotes using a backslash to signal that the quote isn’t meant to be the end of the string.
G.1.3 Logical Data
Logical data (also sometimes called “boolean” values) is one of two values: true or false. In R, we always write them in uppercase: TRUE
and FALSE
.
When you compare two values with an operator, such as checking to see if 10 is greater than 5, the resulting value is logical.
G.1.4 Factors
A factor is a specific type of integer that lets you specify the categories and their order. This is useful in data tables to make plots display with categories in the correct order.
Factors are a type of integer, but you can tell that they are factors by checking their class()
.
G.1.5 Dates and Times
Dates and times are represented by doubles with special classes. Although typeof()
will tell you they are a double, you can tell that they are dates by checking their class()
. Datetimes can have one or more of a few classes that start with POSIX
.
date <- as.Date("2022-01-24")
datetime <- ISOdatetime(2022, 1, 24, 10, 35, 00, "GMT")
typeof(date)
typeof(datetime)
class(date)
class(datetime)
[1] "double"
[1] "double"
[1] "Date"
[1] "POSIXct" "POSIXt"
See Appendix H for how to use
G.2 Basic container types
Individual data values can be grouped together into containers. The main types of containers we’ll work with are vectors, lists, and data tables.
G.2.1 Vectors
A vector in R is a set of items (or ‘elements’) in a specific order. All of the elements in a vector must be of the same data type (numeric, character, logical). You can create a vector by enclosing the elements in the function c()
.
## put information into a vector using c(...)
c(1, 2, 3, 4)
c("this", "is", "cool")
1:6 # shortcut to make a vector of all integers x:y
[1] 1 2 3 4
[1] "this" "is" "cool"
[1] 1 2 3 4 5 6
Selecting values from a vector
If we wanted to pick specific values out of a vector by position, we can use square brackets (an extract operator, or []
) after the vector.
You can select more than one value from the vector by putting a vector of numbers inside the square brackets. For example, you can select the 18th, 19th, 20th, 21st, 4th, 9th and 15th letter from the built-in vector LETTERS
(which gives all the uppercase letters in the Latin alphabet).
You can also create ‘named’ vectors, where each element has a name. For example:
We can then access elements by name using a character vector within the square brackets. We can put them in any order we want, and we can repeat elements:
Another way to access elements is by using a logical vector within the square brackets. This will pull out the elements of the vector for which the corresponding element of the logical vector is TRUE
. If the logical vector doesn’t have the same length as the original, it will repeat. You can find out how long a vector is using the length()
function.
Repeating Sequences
Here are some useful tricks to save typing when creating vectors.
In the command x:y
the :
operator would give you the sequence of number starting at x
, and going to y
in increments of 1.
[1] 1 2 3 4 5 6 7 8 9 10
[1] 15.3 16.3 17.3 18.3 19.3 20.3
[1] 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
What if you want to create a sequence but with something other than integer steps? You can use the seq()
function. Look at the examples below and work out what the arguments do.
[1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
[1] 0 10 20 30 40 50 60 70 80 90 100
[1] 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6
[16] 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 10.0
What if you want to repeat a vector many times? You could either type it out (painful) or use the rep()
function, which can repeat vectors in different ways.
rep(0, 10) # ten zeroes
rep(c(1L, 3L), times = 7) # alternating 1 and 3, 7 times
rep(c("A", "B", "C"), each = 2) # A to C, 2 times each
[1] 0 0 0 0 0 0 0 0 0 0
[1] 1 3 1 3 1 3 1 3 1 3 1 3 1 3
[1] "A" "A" "B" "B" "C" "C"
The rep()
function is useful to create a vector of logical values (TRUE
/FALSE
or 1
/0
) to select values from another vector.
G.2.2 Lists
Recall that vectors can contain data of only one type. What if you want to store a collection of data of different data types? For that purpose you would use a list. Define a list using the list()
function.
data_types <- list(
double = 10.0,
integer = 10L,
character = "10",
logical = TRUE
)
str(data_types) # str() prints lists in a condensed format
List of 4
$ double : num 10
$ integer : int 10
$ character: chr "10"
$ logical : logi TRUE
You can refer to elements of a list using square brackets like a vector, but you can also use the dollar sign notation ($
) if the list items have names.
The single brackets (bracket1
and name1
) return a list with the subset of items inside the brackets. In this case, that’s just one item, but can be more (try data_types[1:2]
). The items keep their names if they have them, so the returned value is list(double = 10)
.
The double brackets (bracket2
and name2
return a single item as a vector. You can’t select more than one item; data_types[[1:2]]
will give you a “subscript out of bounds” error.
The dollar-sign notation is the same as double-brackets. If the name has spaces or any characters other than letters, numbers, underscores, and full stops, you need to surround the name with backticks (e.g., sales$`Customer ID`
).
G.2.3 Tables
Tabular data structures allow for a collection of data of different types (characters, integers, logical, etc.) but subject to the constraint that each “column” of the table (element of the list) must have the same number of elements. The base R version of a table is called a data.frame
, while the ‘tidyverse’ version is called a tibble
. Tibbles are far easier to work with, so we’ll be using those. To learn more about differences between these two data structures, see vignette("tibble")
.
Tabular data becomes especially important for when we talk about tidy data in Chapter 8, which consists of a set of simple principles for structuring data.
Table info
We can get information about the table using the following functions.
Accessing rows and columns
There are various ways of accessing specific columns or rows from a table. You’ll be learning more about this in Chapter 8 and Chapter 9.
siblings <- avatar %>% slice(1, 3) # rows (by number)
bends <- avatar %>% pull(2) # column vector (by number)
friendly <- avatar %>% pull(friendly) # column vector (by name)
bends_name <- avatar %>% select(bends, name) # subset table (by name)
toph <- avatar %>% pull(name) %>% pluck(2) # single cell
The code below uses base R to produce the same subsets as the functions above. This format is useful to know about, since you might see them in other people’s scripts.
G.3 Glossary
term | definition |
---|---|
base r | The set of R functions that come with a basic installation of R, before you add external packages. |
character | A data type representing strings of text. |
coercion | Changing the data type of values in a vector to a single compatible type. |
data type | The kind of data represented by an object. |
double | A data type representing a real decimal number |
escape | Include special characters like ” inside of a string by prefacing them with a backslash. |
extract operator | A symbol used to get values from a container object, such as [, [[, or $ |
factor | A data type where a specific set of values are stored with labels; An explanatory variable manipulated by the experimenter |
integer | A data type representing whole numbers. |
list | A container data type that allows items with different data types to be grouped together. |
logical | A data type representing TRUE or FALSE values. |
numeric | A data type representing a real decimal number or integer. |
operator | A symbol that performs some mathematical or comparative process. |
tabular data | Data in a rectangular table format, where each row has an entry for each column. |
tidy data | A format for data that maps the meaning onto the structure. |
vector | A type of data structure that collects values with the same data type, like T/F values, numbers, or strings. |