9  Data Wrangling

Intended Learning Outcomes

9.1 Functions used

  • built-in (you can always use these without loading any packages)
  • tidyverse (you can use all these with library(tidyverse))
    • readr:: read_csv()
    • tibble:: tibble()
    • dplyr:: count(), filter(), arrange(), mutate(), summarise(), group_by(), ungroup(), rename(), relocate(), slice_max(), bind_rows(), drop_na(), case_when(), na_if(), across(), row_number(), min_rank()
    • tidyr:: pivot_wider()
    • stringr:: str_detect()
    • tidyselect:: starts_with(), ends_with(), contains(), matches(), num_range(), all_of(), any_of(), everything(), where()
  • other (you need to load each package to use these)
    • readxl:: read_xlsx()
    • janitor:: clean_names(), round_half_up()

9.2 Set-up

Tip
  1. Open your ADS-2026 project
  2. Download the Data transformation cheat sheet
  3. Download a data file into the “data” folder:
  4. Create a new quarto file called 09-wrangle.qmd
  5. Update the YAML header
  6. Replace the setup chunk with the one below
```{r}
#| label: setup
#| include: false

library(tidyverse) # for data wrangling
library(readxl)    # read excel files
library(janitor)   # clean up data tables
```
Warning

At this point in ADS it’s very likely you will have the most recent version of the tidyverse packages but just in case, be aware that the function case_when() that we will use in this chapter was updated to introduce new arguments and ways of handling NA. The code we provide in this chapter will only work if you have v1.1.0 or above of dplyr installed. To check run packageVersion("dplyr") and if it’s below 1.1.0, install it again to update.

9.3 Wrangling functions

Data wrangling refers to the process of cleaning, transforming, and restructuring your data to get it into the format you need for analysis and it’s something you will spend an awful lot of time doing. Most data wrangling involves the reshaping functions you learned in Chapter 8 and six functions from the dplyr package that is loaded as part of the tidyverse: select, filter, arrange, mutate, summarise, and group_by. We covered the last two in detail in Chapter 4, and the others have already come up several times, so we’ll focus on more advanced data wrangling.

It’s worth highlighting that in this chapter we’re going to cover these common functions and common uses of said functions. However, dplyr (and packages beyond it) has a huge number of additional wrangling functions and each function has many different arguments. Essentially, if you think you should be able to wrangle your data in a particular way that we haven’t explicitly shown you, you almost certainly can, it might just take a bit of Googling to find out how.

We’ll use data from the United Nations SDG Indicators Database on sustainable development goal 11.2.1: Proportion of population that has convenient access to public transport, by sex, age and persons with disabilities. You can access metadata about this data to find out more.

Tip

Load the data into an object:

transport <- readxl::read_xlsx("data/SDG-11.2.1.xlsx")  |>
  janitor::clean_names()

9.3.1 Explore

After you load the data, use glimpse(transport) or View(transport) to get familiar with the data.

Rows: 2,473
Columns: 25
$ goal               <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ target             <chr> "11.2", "11.2", "11.2", "11.2", "11.2", "11.2", "11…
$ indicator          <chr> "11.2.1", "11.2.1", "11.2.1", "11.2.1", "11.2.1", "…
$ series_code        <chr> "SP_TRN_PUBL", "SP_TRN_PUBL", "SP_TRN_PUBL", "SP_TR…
$ series_description <chr> "Proportion of population that has convenient acces…
$ geo_area_code      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 12, 12, 12, 12,…
$ geo_area_name      <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ time_period        <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ value              <chr> "77.5", "69.5", "43.1", "82.4", "19.1", "44.2", "56…
$ time_detail        <chr> "2020", "2020", "2020", "2020", "2020", "2020", "20…
$ time_coverage      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ upper_bound        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lower_bound        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ base_period        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ source             <chr> "UN-Habitat Urban Indicators Database", "UN-Habitat…
$ geo_info_url       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ foot_note          <chr> "Only public transport stops which are mapped are i…
$ age                <chr> "ALLAGE", "ALLAGE", "ALLAGE", "ALLAGE", "ALLAGE", "…
$ cities             <chr> "AF_CHARIKAR", "AF_FARAH", "AF_KANDAHAR", "AF_KHAN_…
$ custom_breakdown   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ disability_status  <chr> "_T", "_T", "_T", "_T", "_T", "_T", "_T", "_T", "_T…
$ nature             <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "…
$ reporting_type     <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "…
$ sex                <chr> "BOTHSEX", "BOTHSEX", "BOTHSEX", "BOTHSEX", "BOTHSE…
$ units              <chr> "PERCENT", "PERCENT", "PERCENT", "PERCENT", "PERCEN…

It is useful to use the count() function to explore data, especially where there are too many rows or columns to view on one screen.

count(transport, age, sex)
age sex n
60+ BOTHSEX 4
ALLAGE BOTHSEX 2409
ALLAGE FEMALE 30
ALLAGE MALE 30

It looks like most of the data is for all ages and both sexes, so if we want to look at gender disparities in access to public transport, we will need to narrow down the data.

Tip

Use count() (or other functions) to figure out how many years and how many countries are represented in the data.

|> nrow() |> unique() |> length()

# two ways to get the number of unique values in a column
n_countries <- transport |> count(geo_area_name) |> nrow()
n_years <- transport$time_period |> unique() |> length()

There are also several columns that seem to have no data from the glimpse. We can use count to confirm that these columns are entirely NAs - for example, if we count the column time_coverage we can see that all 2473 values are NA.

count(transport, time_coverage)
time_coverage n
NA 2473

Other columns seem to have the exact same value for all rows. For example, all 2473 values of goal are equal to 11.

count(transport, goal)
goal n
11 2473

9.3.2 Select

You can select a subset of the columns (variables) in a table to make it easier to view or to prepare a table for display. You can also select columns in a new order.

9.3.2.1 By name or index

You can select columns by name or number (which is sometimes referred to as the column index). Selecting by number can be useful when the column names are long or complicated.

# select single column by name
value_col <- transport |> select(value) 

# select single column by number
value_col <- transport |> select(9) 

You can select each column individually, separated by commas (e.g., age, sex) but you can also select all columns from one to another by separating them with a colon (e.g., goal:value).

The colon notation can be much faster because you don’t need to type out each individual variable name, but make sure that you know what order your columns are in and always check the output to make sure you have selected what you intended.

# select columns individually
country_year <- transport |> 
  select(geo_area_name, time_period, value)

# select columns with colon
country_year <- transport |> 
  select(geo_area_name:value)

You can rename columns at the same time as selecting them by setting new_name = old_col.

country_year <- transport |>
  select(country = geo_area_name, 
         year = time_period, 
         value)

head(country_year, 2)
country year value
Afghanistan 2020 77.5
Afghanistan 2020 69.5

9.3.2.2 Un-selecting columns

You can select columns either by telling R which ones you want to keep as in the previous examples, or by specifying which ones you want to exclude by using a minus symbol to un-select columns. You can also use the colon notation to de-select columns, but to do so you need to put parentheses around the span first, e.g., -(time_coverage:base_period), not -time_coverage:base_period.

# de-select individual columns
transport_clean <- transport |>
  select(-time_coverage, 
         -upper_bound, 
         -lower_bound, 
         -base_period, 
         -geo_info_url, 
         -custom_breakdown)

# de-select a range of columns
transport_clean <- transport |>
  select(-(time_coverage:base_period), 
         -geo_info_url, 
         -custom_breakdown)

9.3.2.3 Select helpers

Finally, you can select columns based on criteria about the column names.

function definition
everything() all columns
starts_with() select columns that start with a character string
ends_with() select columns that end with a character string
contains() select columns that contain a character string
matches() select columns with a name that matches a regular expression pattern
num_range() select columns with a name that matches the pattern prefix
where() select columns where the function returns TRUE
all_of() select all of the columns in a vector
any_of() select any of the columns in a vector that exist in the table

Below, we make a vector of column names for columns that have only one value.

one_val_cols <- c(
  "goal",
  "target",
  "indicator",
  "series_code",
  "series_description",
  "time_coverage",
  "upper_bound",
  "lower_bound",
  "base_period",
  "geo_info_url",
  "custom_breakdown",
  "reporting_type",
  "units"
)

Now we can use this vector to unselect these columns.

transport_clean <- transport |>
  select(-all_of(one_val_cols))

There is a way to do this without having to type in all of the column names, but rather use functions to determine which columns have only one value and get a vector of their names. This is probably overkill for small tables, but invaluable if you have a table with hundreds of columns, or you need to do this to many different tables.

n_values <- transport |> 
  summarise(across(
2    .cols = everything(),
3    .fns = \(col) unique(col) |> length()
  ))
4one_val <- n_values == 1
5one_val_cols <- one_val[, one_val] |>
  names()
2
Use everything() to select all columns
3
Create an anonymous function (\(col)) that makes a vector of the unique values for each column and returns its length
4
Check which values are equal to 1 5 Choose the columns where this is TRUE
5
Get the names of these columns
Tip

What are the resulting columns for these four examples? Try and guess the answer from the code without looking at the output (but then use the output to check your answer if you get it wrong)

  • transport |> select(contains("_area_"))
  • transport |> select(where(is.numeric))
  • transport |> select(starts_with("geo_"))
  • transport |> select(ends_with("_bound"))

9.3.3 Rename and relocate

Whilst you can rename and reorder columns using select(), sometimes it is easier to use the rename() and relocate() functions.

transport_clean <- transport |>
  select(-all_of(one_val_cols)) |>
  rename(year = time_period,
         country = geo_area_name) |>
  relocate(age, sex, .after = value) |>
  relocate(geo_area_code, .before = cities)

9.3.4 Filter

Whilst select() chooses the columns you want to retain, filter() chooses the rows to retain by matching row or column criteria.

You can filter by a single criterion. This criterion can be rows where a certain column’s value matches a character value (e.g., “North”) or a number (e.g., 9003). It can also be the result of a logical equation (e.g., keep all rows with a specific column value larger than a certain value). The criterion is checked for each row, and if the result is FALSE, the row is removed. You can reverse equations by specifying != where ! means “not”.

# select all rows where country is Togo
transport_clean |> filter(country == "Togo")

# select all rows where country is NOT Togo
transport_clean |> filter(country != "Togo")

# select all rows where value was more than 99
transport_clean |> filter(value > 99)

# select rows where sex is FEMALE or MALE
transport_clean |> filter(sex %in%  c("FEMALE", "MALE"))
Warning

Remember to use == and not = to check if two things are equivalent. A single = assigns the right-hand value to the left-hand variable (much like the <- operator).

Tip

Which IDs are kept from the table below?

function (x)  .Primitive("class")
  • filter(class_tbl, score < 80)
  • filter(class_tbl, grade == "A")
  • filter(class_tbl, grade != "A")
  • filter(class_tbl, score == 91)

You can also select on multiple criteria by separating them by commas (rows will be kept if they match all criteria). Additionally, you can use & (“and”) and | (“or”) to create complex criteria.

# only data from Togo in 2020
togo_2020 <- transport_clean |> 
  filter(
    country == "Togo",
    year == 2020
  )

# the same as above, using & instead of a comma
togo_2020 <- transport_clean |> 
  filter(
    country == "Togo" &
    year == 2020
  )

# values under 10 or over 90
extreme_values <- transport_clean |>  
  filter( value < 10 | value > 90 )

If you want the filter to retain multiple specific values in the same variable, the “match operator” (%in%) should be used rather than | (or). The ! can also be used in combination here, but it is placed before the variable name.

islands <- c("Solomon Islands", "Northern Mariana Islands")

only_islands <- transport_clean |>
  filter(country %in% islands)

no_islands <- transport_clean |>
  filter(!country %in% islands)
Operator Name is TRUE if and only if
A < B less than A is less than B
A <= B less than or equal A is less than or equal to B
A > B greater than A is greater than B
A >= B greater than or equal A is greater than or equal to B
A == B equivalence A exactly equals B
A != B not equal A does not exactly equal B
A %in% B in A is an element of vector B

Finally, you can also pass many other functions to filter. For example, the package stringr that is loaded as part of the tidyverse contains many different functions for working with strings (character data). For example, you use str_detect() to only retain rows where the country contains the word “Island”.

transport_clean |>
  filter(str_detect(country, "Island")) |>
  select(1:6)
country year value age sex time_detail
Solomon Islands 2020 7.1 ALLAGE BOTHSEX 2020
Northern Mariana Islands 2020 7.8 ALLAGE BOTHSEX 2020

Note that str_detect() is case sensitive so it would not return values of “island” or “ISLAND”. You can use the function tolower() or toupper() to convert a string to lowercase or uppercase before you search for substring if you need case-insensitive matching.

Warning

filter() is incredibly powerful and can allow you to select very specific subsets of data. But, it is also quite dangerous because when you start combining multiple criteria and operators, it’s very easy to accidentally specify something slightly different than what you intended. Always check your output. If you have a small dataset, then you can eyeball it to see if it looks right. With a larger dataset, you may wish to compute summary statistics or count the number of groups/observations in each variable to verify your filter is correct. There is no level of expertise in coding that can substitute knowing and checking your data.

Let’s narrow down our dataset to just the rows that report values aggregated by age, but separate for sex.

transport_filtered <- transport_clean |> 
  filter(sex %in% c("FEMALE", "MALE"),
         age == "ALLAGE")

What check would you do to make sure the filtering above worked as intended?

# this should produce only two rows, 
# with MALE and FEMALE as sex 
# and ALLAGE as age
count(transport_filtered, sex, age)
sex age n
FEMALE ALLAGE 30
MALE ALLAGE 30

9.3.5 Arrange

You can sort your dataset using arrange(). You will find yourself needing to sort data in R much less than you do in Excel, since you don’t need to have rows next to each other in order to, for example, calculate group means. But arrange() can be useful when preparing data for display in tables. arrange() works on character data where it will sort alphabetically, as well as numeric data where the default is ascending order (smallest to largest). Reverse the order using desc().

# arranging the table 
# first by country in alphabetical order
# then by year in reverse order
transport_arranged <- transport_filtered |>
  arrange(country, desc(year))
Tip

If you want to sort character data/categories in a specific order, turn the column into a factor and set the levels in the desired order.

transport_factor <- transport_filtered |>
  mutate(sex = factor(sex, levels = c("MALE", "FEMALE")))
transport_factor |>
  filter(year %in% 2001:2003) |>
  select(country, year, sex) |>
  arrange(year, sex)
country year sex
Uzbekistan 2001 MALE
Uzbekistan 2001 FEMALE
Uzbekistan 2002 MALE
Uzbekistan 2002 FEMALE
Uzbekistan 2003 MALE
Uzbekistan 2003 FEMALE

9.3.6 Mutate

The function mutate() allows you to add new columns or change existing ones by overwriting them by using the syntax new_column = operation. You can add more than one column in the same mutate function by separating the columns with a comma. Once you make a new column, you can use it in further column definitions. For example, the following code creates a new column called urban, which is true if the string “Urban average” is detected in the foot_note column, and fixes the data type of value by replacing it with the numeric version.

transport_data <- transport_factor |>
  mutate(
    urban = str_detect(foot_note, "Urban average"),
    value = as.numeric(value)
  )

mutate() can also be used in conjunction with other functions and Boolean operators. For example, we can add another column to transport_data that states whether the value is above the median value. Just like when we used Boolean expressions with filter, it will evaluate the equation and return TRUE or FALSE depending on whether the observation meets the criteria.

transport_data <- transport_factor |>
  mutate(
    urban = str_detect(foot_note, "Urban average"),
    value = as.numeric(value),
    above_median = value > median(value)
  )
Warning

You can overwrite a column by giving a new column the same name as the old column (see region or product) above. Make sure that you mean to do this and that you aren’t trying to use the old column value after you redefine it.

You can also use case_when() to specify what values to return, rather than defaulting to TRUE or FALSE:

transport_data <- transport_factor |>
  mutate(
    urban = str_detect(foot_note, "Urban average"),
    value = as.numeric(value),
    category = case_when(
      value < 70 ~ "low",
      value < 85 ~ "average",
      value >=85 ~ "high",
      .default = NA
    )
  )

The case_when() function allows you to set multiple criteria, although we’re only using one non-default criterion here. It can be very useful, but takes a little practice.

You can also use case_when() to combine different criteria:

# use different cutoffs for urban and non-urban areas
transport_data <- transport_factor |>
  mutate(
    urban = str_detect(foot_note, "Urban average"),
    value = as.numeric(value),
    category = case_when(
      urban & value < 75 ~ "low",
      urban & value < 85 ~ "average",
      urban & value >= 85 ~ "high",
      !urban & value < 20 ~ "low",
      !urban & value < 60 ~ "average",
      !urban & value >= 60 ~ "high",
      .default = NA
    )
  ) |>
  select(-foot_note) # get rid of very long data no longer used
Tip

Be mindful that .default uses = whilst the others use ~. Emily has lost quite a lot of her time and sanity to not realising this.

Just like filter(), mutate() is incredibly powerful and the scope of what you can create is far beyond what we can cover in this book.

9.3.7 Summarise

You were introduced to summarise() in Section 4.4. Chapter 9 uses it as part of a wider wrangling workflow rather than re-teaching it.

Key points to remember:

  • summarise() returns a new table. Without groups, it returns one row. With groups, it returns one row per group.
  • Name your summaries on the left (for example, mean = mean(value)), so the output columns are readable.
  • Missing values propagate unless you handle them (for example, mean(value, na.rm = TRUE)).

Example (overall summary):

transport_data |>
  summarise(
    mean = mean(value, na.rm = TRUE),
    min  = min(value, na.rm = TRUE),
    max  = max(value, na.rm = TRUE)
  )
mean min max
68.60333 9.5 87.6

9.3.8 Group By

You were introduced to grouping in Section 4.5 (and used it with summarise() in Chapter 4). Here is the reminder version.

Key points to remember:

  • Use group_by() when you want grouping to persist across multiple steps in a pipeline.
  • Use .by = … inside a single function call (for example, summarise(.by = …)) when you only need grouping for that one operation.
  • If you use group_by(), ungroup() once you are done, unless you intentionally want subsequent steps to stay grouped.
  • .by is not the same as by. by = … will be treated as a new column name, not a grouping instruction.
# One-off grouping (recommended when you only need grouping once)
transport_data |>
  summarise(
    .by  = sex,
    mean = mean(value, na.rm = TRUE),
    min  = min(value, na.rm = TRUE),
    max  = max(value, na.rm = TRUE)
  )

# Persistent grouping (useful when you want several grouped operations)
transport_data |>
  group_by(sex) |>
  summarise(max = max(value, na.rm = TRUE)) |>
  ungroup()
sex mean min max
FEMALE 69.32667 17.3 87.6
MALE 67.88000 9.5 87.1
sex max
MALE 87.1
FEMALE 87.6

Examples:

Tip

How would you find out the maximum value for each country, separately by sex?

You can also use group_by() or by in combination with other functions. For example, slice_max() returns the top N rows, ordered by a specific variable.

# return top 2 values 
transport_data |>
  slice_max(n = 2, order_by = value)
country year value age sex time_detail source geo_area_code cities disability_status nature urban category
Uzbekistan 2022 87.6 ALLAGE FEMALE 2022 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high
Uzbekistan 2019 87.3 ALLAGE FEMALE 2019 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high

But this can be combined with by to return the top values for each sex:

# return top 2 values per sex
transport_data |>
  slice_max(n = 2, order_by = value, by = sex)
country year value age sex time_detail source geo_area_code cities disability_status nature urban category
Uzbekistan 2022 87.6 ALLAGE FEMALE 2022 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high
Uzbekistan 2019 87.3 ALLAGE FEMALE 2019 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high
Uzbekistan 2022 87.1 ALLAGE MALE 2022 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high
Uzbekistan 2019 86.8 ALLAGE MALE 2019 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high
Uzbekistan 2023 86.8 ALLAGE MALE 2023 National Statistics Committee of the Republic of Uzbekistan 860 NOCITI _T C TRUE high

9.4 Complications

9.4.1 Rounding

Let’s say we want to round all the values to the nearest percent and add a % sign.

transport_data |>
  mutate(pcnt = round(value) |> paste0("%")) |>
  filter(str_detect(value, "\\.5")) |>
  select(value, pcnt)
value pcnt
9.5 10%
65.5 66%
79.5 80%
84.5 84%

If you look at the values above, you can see that the first three round up from .5, and the last rounds down. What’s going on!?

This may seem like a mistake, but R rounds .5 to the nearest even number, rather than always up, like you were probably taught in school. This prevents overestimation biases, since x.5 is exactly halfway between x and x+1, so there is no reason it should always round up.

round(0.5)
round(1.5)
[1] 0
[1] 2

However, this might throw a monkey wrench into your own systems. For example, our school policy is to round up for course marks at x.5. The easiest solution is to use the round_half_up() function from the package janitor.

This should work as you’d expect.

round_half_up(0.5)
round_half_up(1.5)
[1] 1
[1] 2

9.4.2 Missing values

If you have control over your data, it is always best to keep missing values as empty cells rather than denoting missingness with a word or implausible number. If you used “missing” rather than leaving the cell empty, the entire variable would be read as character data, which means you wouldn’t be able to perform mathematical operations like calculating the mean. If you use an implausible number (0 or 999 are common), then you risk these values being included in any calculations as real numbers.

However, we often don’t have control over how the data come to us, so let’s run through how to fix this.

9.4.2.1 Bad missing values

First, we’re going to recode the data to add in the missing values. We can use case_when() to set the value to “missing” if the year is 2000 and country is “Uzbekistan”, otherwise use the original value. You can’t combine number and character data types in the same column, so you need to convert the default value into a character (this is your first sign that this isn’t a good idea).

missing_bad <- transport_data |>
  mutate(value = case_when(
    # set to missing when year is 2000 and country is Uzbekistan
    year == 2000 & country == "Uzbekistan" ~ "missing", 
    # otherwise, set to the existing value
    .default = as.character(value)
  ))

Now, if you try to compute the mean value, you will get an error message and the result will be NA.

# try to compute mean value
missing_bad |>
  summarise(mean = mean(value))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `mean = mean(value)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
mean
NA

9.4.2.2 Convert missing values to NA

To set the missing values to NA, we can use the handy function na_if(). We’ll also need to transform value back to numeric.

missing_data <- missing_bad |>
  mutate(
    # if value = "missing" set as NA
    value = na_if(value, "missing"),
    # convert to numeric
    value = as.numeric(value)
  )

Now, if we try to calculate the mean value for each country, we get missing values for any summary value that contains an NA.

missing_data |>
  summarise(
    mean = mean(value),
    .by= country
  )
country mean
Bahrain 16.72500
Malaysia 56.01667
Uzbekistan NA

9.4.2.3 Ignore missing values

The reason that summarise() returns NA even if there’s a single missing value is because NA basically means “I don’t know”, and the sum of 100 and “I don’t know” is “I don’t know”, not 100. However, when you’re calculating means, you often want to just ignore missing values. Set na.rm = TRUE in the summary function to remove missing values before calculating.

missing_data |>
  summarise(
    mean = mean(value, na.rm = TRUE),
    .by= country
  )
country mean
Bahrain 16.72500
Malaysia 56.01667
Uzbekistan 79.90227

9.4.2.4 Count missing values

If you want to find out how many missing or non-missing values there are in a column, use the is.na() function to get a logical vector of whether or not each value is missing, and use sum() to count how many values are TRUE or mean() to calculate the proportion of TRUE values.

missing_data |>
  summarise(
    n_valid = sum(!is.na(value)),
    n_missing = sum(is.na(value)),
    prop_missing = mean(is.na(value)),
    .by= country
  )
country n_valid n_missing prop_missing
Bahrain 8 0 0.0000000
Malaysia 6 0 0.0000000
Uzbekistan 44 2 0.0434783

9.4.2.5 Omit missing values

You may also want to remove rows that have missing values and only work from complete datasets. drop_na() will remove any row that has a missing observation. You can use drop_na() on the entire dataset which will remove any row that has any missing value, or you can specify to only remove rows that are missing a specific value.

# remove any rows with any missing values
complete_data <- missing_data |>
  drop_na()

# remove any rows that are missing a value for sales
complete_values <- missing_data |>
  drop_na(value)
Warning

If you use drop_na() without specifying columns, make absolutely sure that is what you want to do. There are often columns in a dataset that you don’t care about, such as a “notes” column, that contain NAs, and these will be removed unless you specify which columns are the important ones.

Missing data can be quite difficult to deal with depending on how it is represented. As always, no amount of coding expertise can make up for not understanding the structure and idiosyncrasies of your data.

9.5 Peer Coding Exercises

If you are enrolled on PSYCH1012 Applied Data Skills, do not do these exercises until the Tuesday workshop.

Tip

Scenario: Michael has decided that Dunder Mifflin needs an “SDG progress dashboard” for a corporate visit. Dwight has volunteered you to build it, using publicly available data.

  1. Decide who is going to start as the Driver and Navigator. Remember to switch every so often. The Navigator will find it helpful to have a copy of these instructions open and read the next step to the Driver.

  2. Open RStudio, ensure the environment is clear, and restart R.

  3. Create a new quarto document and save it as ads-week9-wrangle.qmd

  4. In the set-up chunk, load the tidyverse and janitor then load the three datasets below from Our World in Data (OWID). These map loosely onto:

    • SDG 7 (Affordable and Clean Energy): renewables share; access to electricity
    • SDG 13 (Climate Action): CO₂ emissions per capita
library(tidyverse)
library(janitor)

renewables <- read_csv("https://ourworldindata.org/grapher/share-electricity-renewables.csv")
electricity_access <- read_csv("https://ourworldindata.org/grapher/share-of-the-population-with-access-to-electricity.csv")
co2_pc <- read_csv("https://ourworldindata.org/grapher/co-emissions-per-capita.csv")
  1. Use your method of choice to inspect each dataset (glimpse(), summary(), str()), and confirm the column names and types.

9.5.1 Task 1

Goal: combine the three indicators into one long (“tidy”) dataset with a consistent structure.

  1. For each dataset:

    • clean the column names with clean_names()

    • rename the value column to value

    • add a column called indicator with one of:

      • "renewables_share_electricity"
      • "access_to_electricity"
      • "co2_per_capita"
  2. Bind the three datasets into a single object called sdg_long.

Your target structure for sdg_long is:

  • entity (country or region name)
  • code (country code, often ISO3, sometimes missing)
  • year
  • value
  • indicator

You will usually want mutate(indicator = "...") before bind_rows(), and rename(value = ...) to standardise the measurement column.

Checkpoint:

  • glimpse(sdg_long) shows exactly five columns.
  • count(sdg_long, indicator) returns three rows (one per indicator).

9.5.2 Task 2

Dwight has chosen a shortlist of “strategic markets”:

  • United Kingdom
  • United States
  • India
  • Nigeria
  • Sweden
  1. Create sdg_focus by filtering sdg_long to those countries.
  2. Restrict to years from 2000 onwards.
  3. Arrange the result so it is easy to sanity check (for example, by entity, then indicator, then year).

Checkpoint 2:

  • count(sdg_focus, entity) returns five rows.
  • The earliest year in sdg_focus is 2000 (or later).

9.5.3 Task 3

Michael only cares about “the latest number” for each indicator and country.

  1. Create sdg_latest with one row per entity and indicator, keeping:
  • the latest year available for that country and indicator
  • the corresponding value

You may find it easiest to:

  • group by entity and indicator
  • identify the maximum year
  • keep the row(s) corresponding to that year

Checkpoint 3:

  • count(sdg_latest, indicator) returns three rows.
  • sdg_latest has 15 rows in total (5 countries × 3 indicators), unless missing data forces fewer.

Two common approaches:

  • slice_max(year, n = 1, with_ties = FALSE)
  • filter(year == max(year, na.rm = TRUE))

9.5.4 Task 4

Michael wants a single table with one row per country and one column per indicator.

  1. Start from sdg_latest.
  2. Pivot to wide format so you get columns like:
  • renewables_share_electricity
  • access_to_electricity
  • co2_per_capita

Call the result sdg_dashboard.

  1. Reorder columns so entity comes first, followed by the three indicator columns.
  2. Round sensibly:
  • renewable share and access to electricity to 1 decimal place
  • CO₂ per capita to 2 decimal places

Checkpoint 4 (swap roles):

  • sdg_dashboard has 5 rows and 4 columns.
  • The indicator columns are numeric.

9.5.5 Task 5

Dwight wants “a ranking” to show who is doing best.

  1. Create two ranked tables:

A. Highest renewables share (latest values) B. Lowest CO₂ per capita (latest values)

For each, show:

  • country
  • value
  • rank (1 = best)

You can use arrange() plus mutate(rank = row_number()), or min_rank().

9.6 Glossary

term definition
boolean-expression An expression that evaluates to TRUE or FALSE.
data-wrangling The process of preparing data for visualisation and statistical analysis.
factor A data type where a specific set of values are stored with labels; An explanatory variable manipulated by the experimenter
logical A data type representing TRUE or FALSE values.
operator A symbol that performs some mathematical or comparative process.
string A piece of text inside of quotes.

9.7 Further resources