2 Importing data from multiple files

  • Dale Barr (October 30, 2019)

Sometimes you have multiple files, and you want to read them into one big table for analysis. The following code allows you to read in a whole bunch of files from a subdirectory.

We'll call the subdirectory where these files live datadir; your subdirectory may have a different name. If the files are not in a subdirectory, but in the same directory as your script, replace datadir with a full stop, i.e., dir(".", "\\.[Cc][Ss][Vv]$").

The main work is accomplished using the map() function below from the purrr package, which is part of the tidyverse. The function map() applies a function (in this case, read_csv()) to each of the elements in the vector supplied as the first argument.

If you want to run the example to see how things work, you'll need to set up your environment with some sample files.

library("tidyverse")

dir.create("datadir", FALSE)

## create fake data files for three subjects
write_csv(iris %>% slice(1:5), file.path("datadir", "S01.csv"))
write_csv(iris %>% slice(6:10), file.path("datadir", "S02.csv"))
write_csv(iris %>% slice(11:15), file.path("datadir", "S03.csv"))



library("tidyverse")

# "\\.csv$" = find all files ending with csv or CSV
todo <- tibble(filename = dir("datadir", "\\.[Cc][Ss][Vv]$", full.names = TRUE))

all_data <- todo %>%
  mutate(imported = map(filename, read_csv, col_types = "ddddc")) %>%
  unnest(cols = c(imported))

all_data
filename Sepal.Length Sepal.Width Petal.Length Petal.Width Species
datadir/S01.csv 5.1 3.5 1.4 0.2 setosa
datadir/S01.csv 4.9 3.0 1.4 0.2 setosa
datadir/S01.csv 4.7 3.2 1.3 0.2 setosa
datadir/S01.csv 4.6 3.1 1.5 0.2 setosa
datadir/S01.csv 5.0 3.6 1.4 0.2 setosa
datadir/S02.csv 5.4 3.9 1.7 0.4 setosa
datadir/S02.csv 4.6 3.4 1.4 0.3 setosa
datadir/S02.csv 5.0 3.4 1.5 0.2 setosa
datadir/S02.csv 4.4 2.9 1.4 0.2 setosa
datadir/S02.csv 4.9 3.1 1.5 0.1 setosa
datadir/S03.csv 5.4 3.7 1.5 0.2 setosa
datadir/S03.csv 4.8 3.4 1.6 0.2 setosa
datadir/S03.csv 4.8 3.0 1.4 0.1 setosa
datadir/S03.csv 4.3 3.0 1.1 0.1 setosa
datadir/S03.csv 5.8 4.0 1.2 0.2 setosa

In the above example, the filename variable of todo contains the values over which the function read_csv() is to be applied. The argument col_types which we supplied to map() is intended to be passed along to read_csv(). Any additional arguments that you want to pass along to read_csv() can be placed here (e.g., skip = 1).

If there is preprocessing you need to do on each file before reading it in, you can write your own function and call that in place of read_csv().

Sometimes the filename contains metadata (e.g., a string identifying the subject: S01, S02 etc) and you want to pull that out. You can do this by creating a new variable using mutate() on the filename. In the example below, we strip away the path from the filename (datadir/S01.csv to S01.csv) using basename() and then remove the file extension .csv using tools::file_path_sans_ext().

all_data_id <- all_data %>%
  mutate(subj_id = basename(filename) %>% tools::file_path_sans_ext()) %>%
  select(subj_id, Sepal.Length:Species)

all_data_id
subj_id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
S01 5.1 3.5 1.4 0.2 setosa
S01 4.9 3.0 1.4 0.2 setosa
S01 4.7 3.2 1.3 0.2 setosa
S01 4.6 3.1 1.5 0.2 setosa
S01 5.0 3.6 1.4 0.2 setosa
S02 5.4 3.9 1.7 0.4 setosa
S02 4.6 3.4 1.4 0.3 setosa
S02 5.0 3.4 1.5 0.2 setosa
S02 4.4 2.9 1.4 0.2 setosa
S02 4.9 3.1 1.5 0.1 setosa
S03 5.4 3.7 1.5 0.2 setosa
S03 4.8 3.4 1.6 0.2 setosa
S03 4.8 3.0 1.4 0.1 setosa
S03 4.3 3.0 1.1 0.1 setosa
S03 5.8 4.0 1.2 0.2 setosa