6 Converting strings to numbers
- Dale Barr (August 18, 2021)
A very common situation is that you want to convert string values ("Almost Always", "Frequently") into numeric values (5, 4, etc) so that we can calculate a score.
The solution I present here will use something called a "lookup table" that matches string values to numbers.
First, let's assume the following (made-up) questionnaire asks four questions about hygiene. When you bathe, how often do you:
- Wash your legs?
- Wash behind your ears?
- Wash between your toes?
- Shampoo your hair?
Let's assume that answers are on a scale made up of the values "Never", "Sometimes", "Frequently", "Always", which we want to assign scores of 0, 1, 2, and 3 respectively.
Here is some (also made-up) data from five participants, stored in a tibble named dat
.
subj_id | wash_legs | wash_ears | wash_toes | shampoo |
---|---|---|---|---|
S01 | Sometimes | Never | Never | Frequently |
S02 | Sometimes | Frequently | Frequently | Always |
S03 | Never | Never | Never | Frequently |
S04 | Always | Always | Sometimes | Always |
S05 | Frequently | Sometimes | Never | Sometimes |
(If you want to make dat
so you can follow along with the example by running the code in R, click the button below.)
library("tidyverse")
dat <- tribble(
~subj_id, ~wash_legs, ~wash_ears, ~wash_toes, ~shampoo,
"S01", "Sometimes", "Never", "Never", "Frequently",
"S02", "Sometimes", "Frequently", "Frequently", "Always",
"S03", "Never", "Never", "Never", "Frequently",
"S04", "Always", "Always", "Sometimes", "Always",
"S05", "Frequently", "Sometimes", "Never", "Sometimes")
This data is in wide format: there is a single row for each participant with data for each question forming the columns. What we are going to do first is to convert this data from wide to long using pivot_longer()
. (You'll see why in a minute.)
dat_long <- dat %>%
pivot_longer(cols = wash_legs:shampoo,
names_to = "question", values_to = "response")
Take a look at dat_long
.
subj_id | question | response |
---|---|---|
S01 | wash_legs | Sometimes |
S01 | wash_ears | Never |
S01 | wash_toes | Never |
S01 | shampoo | Frequently |
S02 | wash_legs | Sometimes |
S02 | wash_ears | Frequently |
S02 | wash_toes | Frequently |
S02 | shampoo | Always |
S03 | wash_legs | Never |
S03 | wash_ears | Never |
S03 | wash_toes | Never |
S03 | shampoo | Frequently |
S04 | wash_legs | Always |
S04 | wash_ears | Always |
S04 | wash_toes | Sometimes |
S04 | shampoo | Always |
S05 | wash_legs | Frequently |
S05 | wash_ears | Sometimes |
S05 | wash_toes | Never |
S05 | shampoo | Sometimes |
Now we only have one variable we need to convert to numeric values (response
) instead of the original four. There is an easy solution: we create a "lookup table" mapping the string values to the numeric values like so.
But first let's check what the unique string values are in the data. Our lookup table must match these exactly or the approach won't work.
dat_long %>%
distinct(response) %>%
pull()
## [1] "Sometimes" "Never" "Frequently" "Always"
This step is important because sometimes these fields can have special characters that you don't normally see when you print out a table. The distinct() %>% pull()
pattern will give you those values in a way that makes them visible. For instance, a student once had values in the data like this:
[1] "Somewhat\nInfrequently" "Somewhat\nFrequently" "Very\nInfrequently"
[4] "Almost\nNever" "Very\nFrequently" "Almost\nAlways"
and the lookup table was constantly failing because the lookup table they created did not have the \n
in the middle of each string. Computers are very literal!
OK now we're ready to create our lookup table where we match these four values to numbers.
lookup <- tribble(
~response, ~score,
"Never", 0,
"Sometimes", 1,
"Frequently", 2,
"always", 3)
As the final step, we inner_join()
the original table dat_long
to lookup
on the variable response
.
joined <- inner_join(dat_long, lookup, by = "response")
IMPORTANT: You should check to make sure that the join worked as intended. The values in the lookup table must exactly match the values in the response
column of dat_long
. It is easy to make a typo in the lookup table, and those values will be lost. An easy test is to make sure that the number of rows in joined
matches the number of rows in dat_long
.
The function stopifnot()
will make our script fail if the stated condition (that both tables have the same number of rows) is not satisfied.
Uh oh. Running it gives Error: nrow(joined) == nrow(dat_long) is not TRUE
. The test failed, because I deliberately included a typo in the lookup table. Can you see it?
always
should be Always
. Capitalization matters!
So let's fix our lookup table and then we should be good to go. Here is the full code for this demonstration:
dat_long <- dat %>%
pivot_longer(cols = wash_legs:shampoo,
names_to = "question", values_to = "response")
## check for hidden values
dat_long %>%
distinct(response) %>%
pull()
lookup <- tribble(
~response, ~score,
"Never", 0,
"Sometimes", 1,
"Frequently", 2,
"Always", 3)
joined <- inner_join(dat_long, lookup, by = "response")
## test whether the number of rows match
stopifnot(nrow(joined) == nrow(dat_long))
You could then use
joined %>%
group_by(subj_id) %>%
summarise(hygiene = sum(score))
to calculate a score for each subject.