Topic 6 Data Wrangling

6.1 Importing data from multiple files

The following code allows you to read in a whole bunch of files from a directory datadir all at once into a big table. If the files are in the same directory as your script, replace datadir with a full stop, i.e., dir(".", "\\.[Cc][Ss][Vv]$").

If there is preprocessing you need to do on each file before reading it in, you can write your own function and call that in place of read_csv().

October 30, 2019. -DB

6.2 Detecting “runs” in a sequence

Let’s say you have a table like below, and you want to find the start and end frames where you have a run of Z amidst a, b, c, d. Here is code that sets up this kind of situation. Don’t worry if you don’t understand this code; just run it to create the example data in runsdata, and have a look at that table.

Let’s say you want to find the start and stop frames where Z appears in stimulus, and do this independently for each combination of subject and trial. Here’s how stimulus looks for subject 1 and trial 1.

##  [1] "c" "c" "b" "b" "b" "d" "d" "d" "a" "a" "a" "a" "Z" "Z" "Z" "a" "a" "b" "b"
## [20] "b" "d" "d" "d" "c" "c" "c" "c" "Z" "Z" "Z" "b" "b" "b" "b" "a" "a"

So here you can see that the first run of Zs is from frame 13 to 15, 30 and the second is from 28 to 30. We want to write a function that processes the data for each trial and results in a table like this:

## # A tibble: 2 x 5
##   subject trial   run start_frame end_frame
##     <dbl> <dbl> <int>       <int>     <int>
## 1       1     1     1          13        15
## 2       1     1     2          28        30

The first thing to do is to add a logical vector to your tibble whose value is TRUE when the target value (e.g., Z) is present in the sequence, false otherwise.

## # A tibble: 552 x 4
##    subject trial stimulus is_target
##      <int> <int> <chr>    <lgl>    
##  1       1     1 c        FALSE    
##  2       1     1 c        FALSE    
##  3       1     1 b        FALSE    
##  4       1     1 b        FALSE    
##  5       1     1 b        FALSE    
##  6       1     1 d        FALSE    
##  7       1     1 d        FALSE    
##  8       1     1 d        FALSE    
##  9       1     1 a        FALSE    
## 10       1     1 a        FALSE    
## # … with 542 more rows

We want to iterate over subjects and trials. We’ll start by creating a tibble with columns is_target nested into a column called subtbl.

We want to iterate over the little subtables stored within subtbl in each row of the table, passing the table to a function that will find the runs and return another table, which we’ll store in new column. Let’s write a function to detect the runs. That function will need the function rle() (Run-Length Encoding) from base R. We’ll run that on the logical vector we created (is_target). Before creating the function, let’s see what rle() does on the values in is_target for subject 1, trial 1.

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## Run Length Encoding
##   lengths: int [1:5] 12 3 12 3 6
##   values : logi [1:5] FALSE TRUE FALSE TRUE FALSE

If that doesn’t make sense, look at the help for rle() (type ?rle in the console). Now we’re ready to write our function, detect_runs().

We can test the function on s1t1 just to make sure it works.

## # A tibble: 2 x 3
##     run start_fr end_fr
##   <int>    <int>  <int>
## 1     1       13     15
## 2     2       28     30

OK, now we’re ready to run the function.

## # A tibble: 15 x 4
##    subject trial subtbl            runstbl         
##      <int> <int> <list>            <list>          
##  1       1     1 <tibble [36 × 1]> <tibble [2 × 3]>
##  2       1     2 <tibble [39 × 1]> <tibble [2 × 3]>
##  3       1     3 <tibble [37 × 1]> <tibble [2 × 3]>
##  4       2     1 <tibble [41 × 1]> <tibble [2 × 3]>
##  5       2     2 <tibble [41 × 1]> <tibble [2 × 3]>
##  6       2     3 <tibble [36 × 1]> <tibble [2 × 3]>
##  7       3     1 <tibble [35 × 1]> <tibble [2 × 3]>
##  8       3     2 <tibble [39 × 1]> <tibble [2 × 3]>
##  9       3     3 <tibble [39 × 1]> <tibble [2 × 3]>
## 10       4     1 <tibble [37 × 1]> <tibble [2 × 3]>
## 11       4     2 <tibble [35 × 1]> <tibble [2 × 3]>
## 12       4     3 <tibble [29 × 1]> <tibble [2 × 3]>
## 13       5     1 <tibble [35 × 1]> <tibble [2 × 3]>
## 14       5     2 <tibble [35 × 1]> <tibble [2 × 3]>
## 15       5     3 <tibble [38 × 1]> <tibble [2 × 3]>

Now we just have to unnest and we’re done!

## # A tibble: 30 x 5
##    subject trial   run start_fr end_fr
##      <int> <int> <int>    <int>  <int>
##  1       1     1     1       13     15
##  2       1     1     2       28     30
##  3       1     2     1       15     17
##  4       1     2     2       31     33
##  5       1     3     1       14     16
##  6       1     3     2       30     32
##  7       2     1     1       17     19
##  8       2     1     2       32     34
##  9       2     2     1       16     18
## 10       2     2     2       34     36
## # … with 20 more rows

October 30, 2019. -DB