7 Belonging 1

7.1 Intended Learning Outcomes

By the end of this chapter you should be able to:

7.2 Walkthrough video

There is a walkthrough video of this chapter available via Zoom We recommend first trying to work through each section of the book on your own and then watching the video if you get stuck, or if you would like more information. This will feel slower than just starting with the video, but you will learn more in the long-run. Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.

7.3 Activity 1: Belonging, engagement, and self-confidence

For the next set of chapters we're going to use data from a questionnaire that measures an important psychological construct, and one that is key to your success at university: belonging.

  • First, take part in this online version of the questionnaire. It only takes a few minutes to complete.
  • Second, read the overview of the scale you just completed, including what the scores mean.
  • Third, read the introduction and discussion of the paper "A sense of belonging at university: student retention, motivation and enjoyment". You can read the method and the results if you're interested although we haven't yet learned about the statistics they use. Note that this paper uses a slightly different scale than the one you did above.
  • Finally, answer the following questions. Please note that your responses will not save in the browser - if you want to save them, make a note of them somewhere.


  1. Which of the following best describes the concept of belongingness in the context of university students?

The concept of belongingness in the context of university students is best described as "The degree to which students feel accepted, valued, and included in their university community". This is because belongingness refers to the sense of being part of a community and feeling valued and accepted within that community. In a university setting, this includes feeling a part of the student body, feeling valued by faculty and peers, and feeling included in university activities and discussions.


  1. According to research, why is fostering a sense of belonging crucial for first-year university students?

Fostering a sense of belonging is crucial for first-year university students because "It can enhance their academic engagement and reduce dropout rates". Research has shown that when students feel a sense of belonging at their university, they are more likely to be engaged with their studies and less likely to drop out. This is likely due to feeling more motivated and supported, and being more likely to seek help when needed.

  1. According to the paper by Pedler et al., which group of students had lower belonging scores?

According to the paper by Pedler et al., first generation students (and those with just one parent who had attended university) had lower belonging scores. First generation students often face unique challenges that can affect their sense of belonging, such as lack of family understanding or support, financial difficulties, and feeling out of place among their peers.

  1. If you wanted to try and increase your sense of belonging at university, what might help?

Sense of belonging at a university often comes from a combination of academic and social integration. Joining a club or society, participating in a study group, or attending a social or extra-curricular event can all contribute to the sense of belongingness by connecting individuals with others who have similar interests, challenges, or goals. These activities provide opportunities for interaction, mutual understanding, shared experiences, and the formation of supportive relationships. These connections can enhance a student's identification with and commitment to the university, thereby strengthening their sense of belonging. Additionally, being part of a group or a community can also provide emotional support and reduce feelings of isolation, both of which can further enhance a sense of belonging.

I KNOW THIS CHAPTER ISN'T SUBTLE.

7.4 Activity 2: New project

  • Open RStudio and make a new project for the Belonging chapters (this week and next week):

7.5 Activity 3: Data files

Once you've done all this, it's time to download the files we need and then upload them to the server.

  • First, download the Belonging data zip file to your computer and make sure you know which folder you saved it in.
  • Then, extract the files and copy them into the main folder of your Belonging project. If you need a refresher on how to do this, refer back to section 2.6.

The zip file contains four files:

  • belong_stub1.Rmd and belong_stub2.Rmd, the stub files you'll complete for this chapter and the next one. Open belong_stub1.Rmd by clicking on it in the Files tab and then edit the heading to add in your GUID and today's date.
  • demographic_data.csv is a data file that contains each participant's anonymous ID, age, gender and level of study. Gender has been coded numerically where woman = 1, man = 2, and non-binary = 3.
  • questionnaire_data is a data file that contains each participant's ID, level of study, and their score on three components of the belonging scale (belonging, engagement, self-confidence).

Now, we need to load the tidyverse and the data files so that we can use them.

  • In code chunk 1, add the code that loads the tidyverse and then imports demographic_data.csv into an object named demographic_data and questionnaire_data.csv into a object named questionnaire_data using read_csv().
library(tidyverse)
demographic_data <- read_csv("demographic_data.csv")
questionnaire_data <- read_csv("questionnaire_data.csv")

7.6 Activity 3: Check the data

As usual, once you've imported the data the first thing you should do is check it.

  • Click on each object and view them in the viewing pane
  • Then run summary() on each object in the console and answer the following questions:

How are men coded in the variable gender?
Which sub-scale has the highest mean score?
How old is the oldest participant in years?
Which variables are included in both datasets?

7.7 Activity 4: Join the datasets

As we've done before, we want to join the two datasets together. However, because there are multiple columns that each dataset has in common, the code needs adapted.

We need to list both participant and level in the by argument so that R knows that these columns are the same data in both objects. Because there's multiple values being passed to a single argument, we use the combine function c() - if you don't use c(), it will only take the first variable into account and will create problems (try it and see what happens).

  • In code chunk 2, add and run the below code to join the datasets.Try removing the c() to see what happens to the table.
full_data <- inner_join(x = demographic_data, # table 1
                        y = questionnaire_data, # table 2
                        by = c("participant", "level")) # all common columns

7.8 Activity 5: Data types

Look carefully at the summary() output for demographic_data. Can you see a problem with gender?

##   participant         gender         age           level          
##  Min.   :   1.0   Min.   :1.0   Min.   :17.00   Length:1000       
##  1st Qu.: 250.8   1st Qu.:1.0   1st Qu.:19.00   Class :character  
##  Median : 500.5   Median :2.0   Median :21.00   Mode  :character  
##  Mean   : 500.5   Mean   :1.7   Mean   :21.16                     
##  3rd Qu.: 750.2   3rd Qu.:2.0   3rd Qu.:23.00                     
##  Max.   :1000.0   Max.   :3.0   Max.   :25.00

Because gender is coded numerically, R has assumed that the data in this column is numeric so it has calculated statistics as if it were a number but it's entirely meaningless (what does a mean gender of 1.7 mean?).

There are several different data types in R and when you load data in, R will do it's best to guess which one each variable is:

  • integer (whole numbers like 1L, -10L, 3000L)
  • double (numbers like -0.223, 10.324, 1e4)
  • character (letters or words like "I love R")
  • factor (a specific set of values are stored with labels using categorical variables like gender or experimental condition)
  • logical (TRUE or FALSE)
  • complex (numbers with real and imaginary parts like 2i)

Integers and doubles are both numeric.

7.8.1 Recoding factors

If you try and use the data that's in gender as it is, it can cause some problems. For example, if you tried to plot a boxplot of the scores by gender, you'd get this mess because as far as R is concerned, you haven't given it a grouping variable, you've given it two continuous variables

ggplot(full_data, aes(x = gender, y = belongingness)) +
  geom_boxplot()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Instead, we need to convert that data to a factor, using the mutate() and factor() functions. We could overwrite the existing gender variable or we could add a new variable - let's add a new one.

  • levels specifies the original values in the variable, in the order you want to present them (think about how they'd look on a graph from left to right). We can use this to reorganise our variable so that "Man" will be presented first by stating the order as 2,1,3.
  • labels then specifies the labels you want to use instead of the original values and these must be in the same order as the levels. R will do exactly what you tell it to so if you accidentally tell it to give a label of "Man" to number 1, you'll end up getting your gender scores the wrong way round. This is one of the reasons why it can be safer to create a second variable when recoding so that you can more easily check that the values line up with what they're supposed to be.
  • Add and run the below code to code chunk 3.
full_data <- full_data %>%
  mutate(gender_coded = factor(gender, 
                        levels = c(2,1, 3),
                        labels = c("Man", "Woman", "Non-binary")))
  • Next, redo the boxplot we tried above but use the new variable gender_coded rather than the original variable gender.
ggplot(full_data, aes(x = gender_coded, y = belongingness)) +
  geom_boxplot()

7.9 Activity 6: Demographic info

As usual, let's get some demographic info about our participants. In code chunk 4:

  • Calculate the number of participants of each gender
  • Calculate the number of participants in each level of study
  • Calculate the gender split in each level of study
data %>%
  count(group)
full_data %>%
  count(gender_coded)

full_data %>%
  count(level)

full_data %>%
  count(level, gender_coded)
gender_coded n
Man 500
Woman 400
Non-binary 100
level n
First year 920
Second year 80
level gender_coded n
First year Man 457
First year Woman 373
First year Non-binary 90
Second year Man 43
Second year Woman 27
Second year Non-binary 10

7.10 Activity 7: Filter and select

Our analysis will focus on gender differences in belonging for first year students and as we can see from the demographic counts, we have some second year students in our sample that we need to get rid of.

  • In code chunk 5, create an object named first_year_data that only contains the data from first year students. Your starting dataset should be full_data.
object_name <- dataset %>%
  filter(variable = "value")
first_year_data <- full_data %>%
  filter(level == "First year")

We can also drop the numeric gender column and level as we don't need them anymore. This also gives us a good opportunity to show a useful feature of select() which is that you can rename variables as well as reordering them.

  • Add this code to code chunk 5. It will reorder the dataset, drop the original gender column and level, and rename gender_coded as gender to make it simpler to type.
  • The : notation means that it will select all columns from belonginess to self-confidence so we don't have to type them all out.
first_year_data <- first_year_data %>%
  select(participant, 
         "gender" = gender_coded,
         age,
         belongingness:self_confidence) 

Finally, see if you can string both the filter() and select() code together using pipes (%>%) in one code chunk. You should start with full_data and save it all in an object named first_year_data.

first_year_data <- starting_dataset %>%
  first_function %>%
  second_function
first_year_data <- full_data %>%
  filter(level == "First year")%>%
  select(participant, 
         "gender" = gender_coded,
         age,
         belongingness:self_confidence) 

7.11 Activity 8: Plots

For the final activity, let's make some plots and show you some additional options for how to make them look nice.

7.11.1 Scatterplots

To visualise two continuous variables we can use a scatterplot by calling geom_point(). To use geom_point() you need to specify an x and a y variable.

ggplot(first_year_data, aes(x = belongingness, y = engagement)) +
  geom_point()

To add a line of best fit, add a layer that calls geom_smooth() (instinctively you might expect to need geom_line(), this does something very different and will make your plot look mad, try it). You can specify different methods of drawing the line, for now, we'll choose method = "lm" which stands for "linear model" which basically means draw a straight line.

ggplot(first_year_data, aes(x = belongingness, y = engagement)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

As belonging scores increase, engagement scores . This is an example of a correlation.

In a scatterplot, a positive correlation is visually represented by an upward trend of the data points. This means that as one variable (belongingness, in this case) increases, the other variable (engagement, in this case) also tends to increase. A stronger positive correlation would show the data points clustering closely around a line that slopes upwards from left to right. In contrast, a negative correlation would be represented by a downward trend, meaning that as one variable increases, the other tends to decrease. The data points would cluster around a line sloping downwards from left to right.

7.11.2 Grouped scatterplots

We can also add in a grouping variable to our scatterplot code which will draw different coloured geoms (in this case points and lines) for each level of the grouping variable.

  • color changes the outline (or the border) colour of the geom. It works with geoms like points, lines, and polygons (like the outlines of bars in a bar plot or histogram).
  • fill changes the interior colour of the geom. It works with 'filled' geoms like bars, boxplots, or the interior of polygons.
  • To use colour-blind friendly colours, you add on scale_colour_viridis_d() - colour because it's mapped to colour and d because it's a discrete variable (gender is categorical).
ggplot(first_year_data, aes(x = belongingness, 
                            y = engagement,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(option = "D")
## `geom_smooth()` using formula = 'y ~ x'

7.11.3 Adding a theme

In addition to adding colour, you can also apply a theme to the entire plot which will apply a preset aesthetic style.

  • The below code adds on theme_minimal() but try removing this line and then typing theme_ to see all the options that are available with the auto-complete. Try a few and see which one your favourite is.
ggplot(first_year_data, aes(x = belongingness, 
                            y = engagement,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(option = "D") +
  theme_minimal()

7.11.4 Editing axis labels and breaks

There's a couple of ways that you can edit axis labels and breaks. If you just want to edit the overall axis labels, the easiest way is to use the labs() function:

ggplot(first_year_data, aes(x = belongingness, 
                            y = engagement,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(option = "D") +
  theme_minimal() +
  labs(x = "Mean belonginess score",
       y = "Mean engagement score",
       colour = "Gender",
       title = "Relationship between belonging and engagement",
       subtitle = "First year students")
## `geom_smooth()` using formula = 'y ~ x'

However, if you want to make more fine-grained changes you may need to use the scale_ functions.

  • We don't need to add on another scale function for colour, we can just use the one that's alreedy there and add in the arguments name and labels.
  • breaks changes the breaks of the axis, i.e., the numbers that are presented. We ask it for a sequence (seq) of all the numbers from ` to 7.
ggplot(first_year_data, aes(x = belongingness, 
                            y = engagement,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(name = "Gender",
                         labels = c("Men", "Women", "Non-binary"),
                         option = "D") +
  theme_minimal() +
  scale_x_continuous(name = "Mean belonginess score",
                     breaks = seq(1:7)) +
  scale_y_continuous(name = "Mean engagement score",
                     breaks = seq(1:7)) +
  labs(title = "Relationship between belonging and engagement",
       subtitle = "First year students")
## `geom_smooth()` using formula = 'y ~ x'

7.11.5 Make your own

In the above example we made a scatterplot to visualise the relationship between belonging and engagement but there are two other relationships we can look at.

  • In code chunk 6, create scatterplots to visualise the relationship between belonginess and self-confidence, and between engagment and self-confidence. Be careful to change all the axis labels as necessary.
# belonging and self-confidence
ggplot(first_year_data, aes(x = belongingness, 
                            y = self_confidence,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(name = "Gender",
                         labels = c("Men", "Women", "Non-binary"),
                         option = "D") +
  theme_minimal() +
  scale_x_continuous(name = "Mean belonginess score",
                     breaks = seq(1:7)) +
  scale_y_continuous(name = "Mean self-confidence score",
                     breaks = seq(1:7)) +
  labs(title = "Relationship between belonging and self-confidence",
       subtitle = "First year students")
## `geom_smooth()` using formula = 'y ~ x'
# engagement and self-confidence
ggplot(first_year_data, aes(x = engagement, 
                            y = self_confidence,
                            colour = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_viridis_d(name = "Gender",
                         labels = c("Men", "Women", "Non-binary"),
                         option = "D") +
  theme_minimal() +
  scale_x_continuous(name = "Mean engagement score",
                     breaks = seq(1:7)) +
  scale_y_continuous(name = "Mean self-confidence score",
                     breaks = seq(1:7)) +
  labs(title = "Relationship between engagement and self-confidence",
       subtitle = "First year students")
## `geom_smooth()` using formula = 'y ~ x'

Which relationship appears to be the strongest?

In the context of scatterplots, the strength of the relationship between two variables is typically represented by how closely the points follow a clear trend (either linear or nonlinear). A tighter clustering of points along a line suggests a stronger relationship.

In this case, the scatterplot for "Engagement and Self-confidence" would show the tightest clustering of points along a line or curve, indicating a strong correlation between these two variables. This could be because the more engaged students are in their learning activities, the more confident they feel about their ability to succeed in these tasks.

In contrast, the scatterplots for "Belonging and Engagement" and "Belonging and Self-confidence" show a less tight clustering of points, indicating a weaker relationship (although it's still pretty strong). The points in these scatterplots would be more spread out and not as closely aligned with a trend line or curve.

It's important to note that correlation doesn't imply causation, and these are just observed associations in the data. Further research would be necessary to understand the causal relationships between these variables.

7.12 Finished

Finally, try knitting the file to HTML and remember to make a note of any mistakes you made and how you fixed them or any other useful information you learned. Then save your Markdown, and quit your session RStudio completely.

"Data Skills" was written by Emily Nordmann. It was last built on 2025-01-15.

This book was built by the bookdown R package.