2 Importing data from multiple files
- Dale Barr (October 30, 2019)
Sometimes you have multiple files, and you want to read them into one big table for analysis. The following code allows you to read in a whole bunch of files from a subdirectory.
We'll call the subdirectory where these files live datadir
; your subdirectory may have a different name. If the files are not in a subdirectory, but in the same directory as your script, replace datadir
with a full stop, i.e., dir(".", "\\.[Cc][Ss][Vv]$")
.
The main work is accomplished using the map()
function below from the purrr package, which is part of the tidyverse. The function map()
applies a function (in this case, read_csv()
) to each of the elements in the vector supplied as the first argument.
If you want to run the example to see how things work, you'll need to set up your environment with some sample files.
library("tidyverse")
dir.create("datadir", FALSE)
## create fake data files for three subjects
write_csv(iris %>% slice(1:5), file.path("datadir", "S01.csv"))
write_csv(iris %>% slice(6:10), file.path("datadir", "S02.csv"))
write_csv(iris %>% slice(11:15), file.path("datadir", "S03.csv"))
library("tidyverse")
# "\\.csv$" = find all files ending with csv or CSV
todo <- tibble(filename = dir("datadir", "\\.[Cc][Ss][Vv]$", full.names = TRUE))
all_data <- todo %>%
mutate(imported = map(filename, read_csv, col_types = "ddddc")) %>%
unnest(cols = c(imported))
all_data
filename | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
datadir/S01.csv | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
datadir/S01.csv | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
datadir/S01.csv | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
datadir/S01.csv | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
datadir/S01.csv | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
datadir/S02.csv | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
datadir/S02.csv | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
datadir/S02.csv | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
datadir/S02.csv | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
datadir/S02.csv | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
datadir/S03.csv | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
datadir/S03.csv | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
datadir/S03.csv | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
datadir/S03.csv | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
datadir/S03.csv | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
In the above example, the filename
variable of todo
contains the values over which the function read_csv()
is to be applied. The argument col_types
which we supplied to map()
is intended to be passed along to read_csv()
. Any additional arguments that you want to pass along to read_csv()
can be placed here (e.g., skip = 1
).
If there is preprocessing you need to do on each file before reading it in, you can write your own function and call that in place of read_csv()
.
Sometimes the filename contains metadata (e.g., a string identifying the subject: S01
, S02
etc) and you want to pull that out. You can do this by creating a new variable using mutate()
on the filename. In the example below, we strip away the path from the filename (datadir/S01.csv
to S01.csv
) using basename()
and then remove the file extension .csv
using tools::file_path_sans_ext()
.
all_data_id <- all_data %>%
mutate(subj_id = basename(filename) %>% tools::file_path_sans_ext()) %>%
select(subj_id, Sepal.Length:Species)
all_data_id
subj_id | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|---|
S01 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
S01 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
S01 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
S01 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
S01 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
S02 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
S02 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
S02 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
S02 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
S02 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
S03 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
S03 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
S03 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
S03 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
S03 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |