4 C

4.1 cache

Storing information for later retrieval, usually to save time.

In R Markdown documents, you can cache time-consuming code chunks, such as running a complex mixed effects model, by adding a chunk option cache = TRUE.

R Markdown Cookbook

4.2 cache (web)

In a web browser, external files like CSS, JavaScript, and images are usually cached so that they don't have to be repeatedly downloaded.

When you are developing Shiny apps, you may find that your styles, scripts, or images don't update, even though you've changed the source files. This is usually because of the cache, and you can fix it by reloading the web browser, or restarting R and rerunning the Shiny app.

4.3 categorical

Data that can only take certain values, such as types of pet.

See discrete or nominal.

4.4 character

A data type representing strings of text.

Examples of character data are:

"I am a string of characters"
#> [1] "I am a string of characters"
paste("The answer is ", 6+6)
#> [1] "The answer is  12"
as.character(100)
#> [1] "100"

4.5 chunk

A section of code in an R Markdown file

In an Rmd file, you can include a chunk by surrounding the code as in the example below:

```{r chunk-name}

mean_age <- mean(ages) %>% round(2)
```

Also sometimes called a block.

4.6 CI

Confidence interval: A type of interval estimate used to summarise a given statistic or measurement where a proportion of intervals calculated from the sample(s) will contain the true value of the statistic.

See confidence interval

4.7 class

A way to categorise objects

A class is similar to a data type, but object can have more than one class.

More...
# character class and type
class("A")
#> [1] "character"
typeof("A")
#> [1] "character"
# numeric class and double type
class(1.1)
#> [1] "numeric"
typeof(1.1)
#> [1] "double"
# numeric class and integer type
class(1L)
#> [1] "integer"
typeof(1L)
#> [1] "integer"
# function class and closure type
class(mean)
#> [1] "standardGeneric"
#> attr(,"package")
#> [1] "methods"
typeof(mean)
#> [1] "closure"

Classes often control how an object is printed. For example, the result of t.test() is a list with class "htest", which is why printing the result gives you formatted output.

t <- t.test(rnorm(10), rnorm(10))
typeof(t)
#> [1] "list"
class(t)
#> [1] "htest"
t
#> 
#>  Welch Two Sample t-test
#> 
#> data:  rnorm(10) and rnorm(10)
#> t = -0.87885, df = 16.532, p-value = 0.3921
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -1.5093996  0.6230434
#> sample estimates:
#>   mean of x   mean of y 
#> -0.07360206  0.36957606

If you unclass() an object, it no longer gets specical printing styles.

unclass(t)
#> $statistic
#>         t 
#> -0.878845 
#> 
#> $parameter
#>       df 
#> 16.53224 
#> 
#> $p.value
#> [1] 0.3920801
#> 
#> $conf.int
#> [1] -1.5093996  0.6230434
#> attr(,"conf.level")
#> [1] 0.95
#> 
#> $estimate
#>   mean of x   mean of y 
#> -0.07360206  0.36957606 
#> 
#> $null.value
#> difference in means 
#>                   0 
#> 
#> $stderr
#> [1] 0.5042733
#> 
#> $alternative
#> [1] "two.sided"
#> 
#> $method
#> [1] "Welch Two Sample t-test"
#> 
#> $data.name
#> [1] "rnorm(10) and rnorm(10)"

4.8 coding scheme

How to represent categorical variables with numbers for use in models

Example of different coding schemes for the same data

Contrast coding in detail

Contrast coding in R by Marissa Barlaz

4.9 coercion

Changing the data type of values in a vector to a single compatible type.

If you try to combine values with different data types in a vector or matrix, they get coerced to a common type. Strings are most dominant which means, if you try to combine a string with another data type, the other data type will be coerced into a string. Doubles are dominant over integers and numeric values are dominant over logical values.

c(2L, T, F)  # integers and logicals become integers
#> [1] 2 1 0
c(1.5, T, F) # doubles and logicals become doubles
#> [1] 1.5 1.0 0.0
c(1.5, 2L)   # doubles and integers become doubles
#> [1] 1.5 2.0
c("A", 1.5, 1L, T)  # strings and anything become strings
#> [1] "A"    "1.5"  "1"    "TRUE"

4.10 comment

Comments are text that R will not run as code. You can annotate .R files or chunks in R Markdown files with comments by prefacing each line of the comment with one or more hash symbols (#).

# I'm demonstrating comments in this chunk
# This comment will be added to the document outline ----

You can comment multiple lines at once by highlighting the code you would like to change into a comment and using the keyboard shortcut Ctrl + Shift + C (Windows) or Command + Shift + C (Mac). You can also change from comment to code (un-commenting) by using the same keyboard shortcut.

Comments get added to the document outline if you put four or more dashes, equal signs, or hashes at the end. This is a great way to keep track of more complicated scripts.

4.11 commit

The action of storing a new snapshot of a project's state in the git history.

4.12 compile

To create another resource, such as an app, from source code.

Compiling usually refers to taking human-readable code and converting it into code that is more efficient for a computer to run, but that a human can't read.

In the context of R Markdown files, knit, render and compile are sometimes used interchangeably.

4.13 computational reproducibility

The extent to which the findings of a study can be repeated with the same raw data but analyzed by different researchers or by the same researchers on a different occasion.

See also reproducibility.

4.14 concatenate

To combine strings or vectors.

When referring to strings, concatenate means to add strings together using the function paste (adds a space between strings) or paste0 (doesn't add anything between strings).

subject_name <- "Lisa"
paste("Hello,", subject_name)
#> [1] "Hello, Lisa"

When referring to other types of variables, concatenate can mean to create a vector with those variables, usually using the c function. For example, you could concatenate the numbers 1, 3, 6, and 10 like this: c(1, 3, 6, 10). You can concatenate two vectors as well:

v1 <- 1:5
v2 <- 11:15
c(v1, v2)
#>  [1]  1  2  3  4  5 11 12 13 14 15

Remember, a vector can only have one data type. So if you concatenate a string vector and a numeric vector, the numbers will get turned into their string versions. If you concatenate an integer and a double vector, the integers will be converted to doubles.

strings <- c("a", "c", "e")
integers <- c(1L, 3L, 5L)
doubles <- c(1.1, 3.3, 5.5)
c(strings, integers)
#> [1] "a" "c" "e" "1" "3" "5"
c(doubles, integers)
#> [1] 1.1 3.3 5.5 1.0 3.0 5.0

4.15 confidence interval

A type of interval estimate used to summarise a given statistic or measurement where a proportion of intervals calculated from the sample(s) will contain the true value of the statistic.

For example, 95% Confidence Intervals of the mean state that if you were to run the same study on 100 samples and calculate this interval for each iteration of the study, then 95 of the intervals calculated will contain the true mean of the population.

Most commonly cited Confidence Intervals are the 95% Confidence Interval and the 99% Confidence Interval. The 95% CI and 99% CI of the mean are calculated as follows:

  • \(95\%\ CI = \mu ± 1.96 \times SD\)
  • \(99\%\ CI = \mu ± 2.58 \times SD\)

For example, given \(\mu\) = 10 and \(SD\) = 1.25, the \(95\%\ CI = [7.55, 12.45]\)

There are a number of misconceptions about Confidence Intervals and it would be worth reading the following papers:

4.16 conflict

Having two packages loaded that have a function with the same name.

For example, when you load the tidyverse, you will see several warnings under the Conflicts heading.

── Conflicts ───── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x lag()::filter() masks stats::lag()

This means that the stats package has functions called filter() and lag(), but the dplyr package (part of the tidyverse), also has functions with the same name.

Because you loaded dplyr after stats (which is loaded by default when you start R), the functions from dplyr mask the functions from stats, but you can still use the functions from stats, you just need to preface them with stats::.

More...

Sometime, the two functions do totally different things. There just aren't enough names to go around; this is why all packages aren't loaded by default. For example, stats::filter() applies linear filtering to a time series, while dplyr::filter() subsets a data frame.

stats::filter(1:10, rep(1, 3))
#> Time Series:
#> Start = 1 
#> End = 10 
#> Frequency = 1 
#>  [1] NA  6  9 12 15 18 21 24 27 NA
dplyr::filter(tidyr::table1, country == "Afghanistan")
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360

In other cases, a package will name a function the same as a base R function on purpose, in order to do the same thing with more features. For example, base::setdiff() and dplyr::setdiff() both return rows from the first table that are not in the second table, but the base R version requires columns to be in the same order, while the dplyr version does not.

# define overlapping tables
t1 <- data.frame(a = 1:3, b = 1:3)
t2 <- data.frame(b = 2:4, a = 2:4)

# concludes all t1 rows are not in t2
base::setdiff(t1, t2)
#> $a
#> [1] 1 2 3
# concludes only 1 t1 row is not in t2
dplyr::setdiff(t1, t2)
a b
1 1

4.17 console

The pane in RStudio where you can type in commands and view output messages.

Commands typed into the console are not saved in a script, although they may be saved in the history.

4.18 container

A data structure that aggregates data, such as a vector, list, matrix, or data frame

# vector
c(1,2,3,4) 
#> [1] 1 2 3 4
# list
list(1, "A", TRUE)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] "A"
#> 
#> [[3]]
#> [1] TRUE
# matrix
matrix(1:6, nrow = 2)
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
# data frame
data.frame(
  id = 1:5,
  name = c("Lisa", "Phil", "Helena", "Rachel", "Jack")
)
id name
1 Lisa
2 Phil
3 Helena
4 Rachel
5 Jack

A container may also refer to a virtual software environment that can be used to make sure that your code has the intended versions of all packages. Docker, Binder and Code Ocean are popular conatiner types.

4.19 continuous

Data that can take on any values between other existing values.

As opposed to discrete data, where fractional values don't make sense, continuous data can take fractional values (as far as the sensitivity of the measurement allows). So even if you are measuring height to the nearest whole centimeter, this is continuous data because it is possible and makes sense for intermediate values to exist.

4.20 correlation matrix

Parameters showing how a set of vectors are correlated.

This is set up as a matrix with 1.0 along the diagonal, meaning that every variable is perfectly correlated with itself.

data <- tibble(
  A = rnorm(100),
  B = rnorm(100), # uncorrelated with A
  C = rnorm(100) + A, # positively correlated with A
  D = rnorm(100) - A # negatively correlated with A
)

cor(data) # correlation matrix
#>               A             B           C           D
#> A  1.0000000000 -0.0009943199  0.66356312 -0.69155202
#> B -0.0009943199  1.0000000000 -0.03771143 -0.04160253
#> C  0.6635631226 -0.0377114274  1.00000000 -0.39843597
#> D -0.6915520161 -0.0416025327 -0.39843597  1.00000000

4.21 correlation

The relationship two vectors have to each other.

If you sample two sets of numbers independently, they will have a correlation close to 0 (on average). If one number perfectly predicts the other, the correlation is 1 if one gets bigger as the other gets bigger and -1 if one gets smaller as the other gets smaller.

Different Correlations (n = 100)

Figure 4.1: Different Correlations (n = 100)

4.22 covariance matrix

Parameters showing how a set of vectors vary and covary.

The covariance matrix is related to the correlation matrix, but also incorporates information about the standard deviations of all the variables.

More...
data <- faux::rnorm_multi(
  n = 100,
  mu = 0,
  sd = c(A = 1, B = 2, C = 3),
  r = 0.5,
  empirical = TRUE
)

cor(data) # correlation matrix
#>     A   B   C
#> A 1.0 0.5 0.5
#> B 0.5 1.0 0.5
#> C 0.5 0.5 1.0
cov(data) # covariance matrix
#>     A B   C
#> A 1.0 1 1.5
#> B 1.0 4 3.0
#> C 1.5 3 9.0

The matrix of the products of the standard deviations is called "sigma". If you multiply this by the correlation matrix, you get the covariance matrix.

  # multiply SDs    A    B    C
sigma <- matrix(c(1*1, 1*2, 1*3,  # A
                  2*1, 2*2, 2*3,  # B
                  3*1, 3*2, 3*3), # C
                nrow = 3)
sigma * cor(data)
#>     A B   C
#> A 1.0 1 1.5
#> B 1.0 4 3.0
#> C 1.5 3 9.0

4.23 covariance

The relationship two vectors have to each other.

A positive covariance means that as one vector increases the other does as well. A negative covariance means that as one vector increases the other decreases. A covariance of 0 means that the two vectors have to relation to each other.

Covariances can take on any numerical value which can make them difficult to interpret. Because of this covariances are often converted into correlations by incorporating information about each vectors' standard deviations.

4.24 covariate

Characteristics of the observations in a study that may affect the response variable.

Covariates can be independent variables of interest in your analysis, or confounding variables.

In the context of ANCOVAs, covariate specifically refers to continuous predictor variables.

4.25 CRAN

The Comprehensive R Archive Network: a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.

Packages that you download from CRAN have been checked to make sure they won't harm your computer. You can learn more about CRAN at https://cran.r-project.org/.

4.26 Cronbach's alpha

A measure used to assess the internal consistency of scale items

If a scale is a reliable measure of a concept, then the items of the scale should consistently measure that concept. A scale where all the items perfectly measure the same concept will have a Cronbach's alpha close to 1.0, basically meaning that all the items are nearly perfectly correlated. A scale where each item measures a totally different concept will have a Cronbach's alpha close to 0.0, meaning that each item is uncorrelated with the others.

Remember that internal reliability is not the same as external reliability. A scale can have a very high Cronbach's alpha, meaning it measures something well, but that thing may not be what you meant to measure.

More...

We can use the function psych::alpha() to calculate Cronbach's alpha.

# read in data from the 3 domains of disgust scale
disgust <- readr::read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust.csv")
  
# select just the items that measure pathogen disgust
pathogen_alpha <- disgust %>%
  select(pathogen1:pathogen7) %>%
  psych::alpha()

You often just want to report the raw or standard alpha.

pathogen_alpha$total$raw_alpha
#> [1] 0.7379495
pathogen_alpha$total$std.alpha
#> [1] 0.7441191

The item.stats section gives you some descriptive statistics about each item (see ?psych::alpha for more info).

pathogen_alpha$item.stats
n raw.r std.r r.cor r.drop mean sd
pathogen1 19668 0.6035673 0.6264342 0.5293053 0.4530070 4.443156 1.457303
pathogen2 19683 0.6371098 0.6287978 0.5371302 0.4610671 3.254026 1.745723
pathogen3 19687 0.6489288 0.6599706 0.5842526 0.4917248 3.171941 1.616815
pathogen4 19683 0.6217909 0.6168016 0.5156972 0.4397615 3.657573 1.755883
pathogen5 19678 0.6384423 0.6657726 0.5920388 0.4997309 4.282397 1.435962
pathogen6 19655 0.6120328 0.5923052 0.4785793 0.4119616 3.808751 1.878335
pathogen7 19692 0.6331624 0.6061906 0.5020210 0.4326954 3.493500 1.930726

You can look at the alpha.drop section to see what the alpha would be if you dropped an item. If we add an item from the moral disgust scale, you can see that the alpha increases if you drop it.

mixed_alpha <- disgust %>%
  select(pathogen1:pathogen7, moral1) %>%
  psych::alpha()

select(mixed_alpha$alpha.drop, raw_alpha, std.alpha)
raw_alpha std.alpha
pathogen1 0.6776276 0.6873220
pathogen2 0.6770689 0.6903431
pathogen3 0.6693926 0.6799724
pathogen4 0.6787991 0.6911267
pathogen5 0.6704759 0.6776676
pathogen6 0.6860804 0.6980331
pathogen7 0.6823258 0.6950028
moral1 0.7379495 0.7441191

4.27 CSS

Cascading Style Sheet: A system for controlling the visual presentation of HTML in web pages.

R Markdown scripts are often knit to HTML. Therefore, you can also include HTML (a way to semantically tag information) and CSS in your scripts.

Here is an example of some simple CSS (in the style tag) and HTML.

<style>
  h3 { color: red; }
  .ferret { width: 200px; height: 150px; }
</style>

<h3>Red Title</h3>

<img src="images/darwin.jpg" 
     title="The cutest ferret" 
     class="ferret">

Resources

4.28 CSV

Comma-separated variable: a file type for representing data where each variable is separated from the next by a comma.

Comma-separated data:

id, name, age
1, Lisa, 45
2, Robbie, 14