3 Multi-Row Headers
- Lisa DeBruine (2021-10-17)
A student on our help forum once asked for help making a wide-format dataset long. When I tried to load the data, I realised the first three rows were all header rows. Here's the code I wrote to deal with it.
First, I'll make a small CSV "file" below. In a typical case, you'd read the data in from a file.
demo_csv <- I("SUB1, SUB1, SUB1, SUB1, SUB2, SUB2, SUB2, SUB2
COND1, COND1, COND2, COND2, COND1, COND1, COND2, COND2
X, Y, X, Y, X, Y, X, Y
10, 15, 6, 2, 42, 4, 32, 5
4, 43, 7, 34, 56, 43, 2, 33
77, 12, 14, 75, 36, 85, 3, 2")
If you try to read in this data, you'll get a message about the duplicate column names and the resulting table will have "fixed" column headers and the next two columns headers as the first two rows.
data <- read_csv(demo_csv)
SUB1...1 | SUB1...2 | SUB1...3 | SUB1...4 | SUB2...5 | SUB2...6 | SUB2...7 | SUB2...8 |
---|---|---|---|---|---|---|---|
COND1 | COND1 | COND2 | COND2 | COND1 | COND1 | COND2 | COND2 |
X | Y | X | Y | X | Y | X | Y |
10 | 15 | 6 | 2 | 42 | 4 | 32 | 5 |
4 | 43 | 7 | 34 | 56 | 43 | 2 | 33 |
77 | 12 | 14 | 75 | 36 | 85 | 3 | 2 |
Instead, you should read in just the header rows by setting n_max
equal to the number of header rows and col_names
to FALSE
.
data_head <- read_csv(demo_csv,
n_max = 3,
col_names = FALSE)
You will get a table that looks like this:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
---|---|---|---|---|---|---|---|
SUB1 | SUB1 | SUB1 | SUB1 | SUB2 | SUB2 | SUB2 | SUB2 |
COND1 | COND1 | COND2 | COND2 | COND1 | COND1 | COND2 | COND2 |
X | Y | X | Y | X | Y | X | Y |
You can then make new header names by pasting together the names in the three rows by summarising across all the columns with the paste()
function and collapsing them using "_". Use unlist()
and unname()
to convert the result from a table to a vector.
new_names <- data_head %>%
summarise(across(.fns = paste, collapse = "_")) %>%
unlist() %>% unname()
new_names
## [1] "SUB1_COND1_X" "SUB1_COND1_Y" "SUB1_COND2_X" "SUB1_COND2_Y" "SUB2_COND1_X"
## [6] "SUB2_COND1_Y" "SUB2_COND2_X" "SUB2_COND2_Y"
Now you can read in the data without the three header rows. Use skip
to skip the headers and set col_names
to the new names.
data <- read_csv(demo_csv,
skip = 3,
col_names = new_names,
show_col_types = FALSE)
SUB1_COND1_X | SUB1_COND1_Y | SUB1_COND2_X | SUB1_COND2_Y | SUB2_COND1_X | SUB2_COND1_Y | SUB2_COND2_X | SUB2_COND2_Y |
---|---|---|---|---|---|---|---|
10 | 15 | 6 | 2 | 42 | 4 | 32 | 5 |
4 | 43 | 7 | 34 | 56 | 43 | 2 | 33 |
77 | 12 | 14 | 75 | 36 | 85 | 3 | 2 |
If you have an excel file that merges the duplicate headers across rows, it's a little trickier, but still do-able.
The first steps is the same: read in the first three rows.
data_head <- read_excel("data/3headers_demo.xlsx",
n_max = 3,
col_names = FALSE)
...1 | ...2 | ...3 | ...4 | ...5 | ...6 | ...7 | ...8 |
---|---|---|---|---|---|---|---|
SUB1 | NA | NA | NA | SUB2 | NA | NA | NA |
COND1 | NA | COND2 | NA | COND1 | NA | COND2 | NA |
X | Y | X | Y | X | Y | X | Y |
The code below starts at the second column and fills in any missing data with the value in the previous column.
for (i in 2:ncol(data_head)) {
prev <- data_head[, i-1]
this <- data_head[, i]
missing <- is.na(this)
this[missing, ] <- prev[missing, ]
data_head[, i] <- this
}
Now you can continue generating the pasted name the same as above.
new_names <- data_head %>%
summarise(across(.fns = paste, collapse = "_")) %>%
unlist() %>% unname()
new_names
## [1] "SUB1_COND1_X" "SUB1_COND1_Y" "SUB1_COND2_X" "SUB1_COND2_Y" "SUB2_COND1_X"
## [6] "SUB2_COND1_Y" "SUB2_COND2_X" "SUB2_COND2_Y"
If your data are set up with multiple headers, you'll probably want to change the shape of the data. Here's a quick example how to use pivot_longer()
and pivot_wider()
to do this with variable names like above.
data <- read_excel("data/3headers_demo.xlsx",
skip = 3,
col_names = new_names)
data_long <- data %>%
# add a row ID column if one doesn't exist already
mutate(trial_id = row_number()) %>%
# make a row for each data column
pivot_longer(
cols = -trial_id, # everything except trial_id
names_to = c("sub_id", "condition", "coord"),
names_sep = "_",
values_to = "val"
) %>%
# make x and y coord columns
pivot_wider(
names_from = coord,
values_from = val
)
trial_id | sub_id | condition | X | Y |
---|---|---|---|---|
1 | SUB1 | COND1 | 0.8316380 | 0.7881552 |
1 | SUB1 | COND2 | 0.3941482 | 0.2056488 |
1 | SUB2 | COND1 | 0.9332829 | 0.1530898 |
1 | SUB2 | COND2 | 0.6189847 | 0.9400281 |
2 | SUB1 | COND1 | 0.4147148 | 0.1366791 |
2 | SUB1 | COND2 | 0.9805130 | 0.7493469 |
2 | SUB2 | COND1 | 0.1048907 | 0.6573472 |
2 | SUB2 | COND2 | 0.9579583 | 0.3430333 |
3 | SUB1 | COND1 | 0.5577673 | 0.0956297 |
3 | SUB1 | COND2 | 0.3045316 | 0.3540656 |
3 | SUB2 | COND1 | 0.3621907 | 0.8460132 |
3 | SUB2 | COND2 | 0.0167339 | 0.1886913 |
4 | SUB1 | COND1 | 0.4326746 | 0.8276863 |
4 | SUB1 | COND2 | 0.2845026 | 0.6236266 |
4 | SUB2 | COND1 | 0.0439374 | 0.5379287 |
4 | SUB2 | COND2 | 0.0712748 | 0.3511542 |
5 | SUB1 | COND1 | 0.6545546 | 0.6501679 |
5 | SUB1 | COND2 | 0.9202481 | 0.2525272 |
5 | SUB2 | COND1 | 0.8117072 | 0.3455603 |
5 | SUB2 | COND2 | 0.7073851 | 0.4249118 |
6 | SUB1 | COND1 | 0.0679236 | 0.6978207 |
6 | SUB1 | COND2 | 0.3979061 | 0.6922928 |
6 | SUB2 | COND1 | 0.5282960 | 0.1093352 |
6 | SUB2 | COND2 | 0.6622162 | 0.5567239 |