library(tidyverse)Grouping and Summarising Data
What you’ll build today: subject-level and group-level summary tables that are essential for PMx QC, exploration, and reporting.
Learning Objectives
By the end of this lesson, you will be able to:
- Use
group_by()andsummarise()to compute subject-level summaries. - Clearly distinguish between
mutate()andsummarise(). - Understand how grouping changes the unit of analysis.
- Control grouping behavior with
.groupsandungroup(). - Use grouped
mutate()for within-subject calculations. - Build PMx-specific QC tables quickly and reproducibly.
Setup
We’ll work with a familiar PMx-style dataset.
pk <- tibble::tribble(
~ID, ~TIME, ~EVID, ~AMT, ~DV, ~WT, ~SEX,
1, 0.0, 1, 100, NA, 72, "F",
1, 0.5, 0, NA, 2.1, 72, "F",
1, 1.0, 0, NA, 3.8, 72, "F",
1, 2.0, 0, NA, 3.0, 72, "F",
2, 0.0, 1, 80, NA, 88, "M",
2, 0.5, 0, NA, 1.6, 88, "M",
2, 1.0, 0, NA, 2.9, 88, "M",
2, 2.0, 0, NA, 2.4, 88, "M"
)
pk# A tibble: 8 × 7
ID TIME EVID AMT DV WT SEX
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 1 100 NA 72 F
2 1 0.5 0 NA 2.1 72 F
3 1 1 0 NA 3.8 72 F
4 1 2 0 NA 3 72 F
5 2 0 1 80 NA 88 M
6 2 0.5 0 NA 1.6 88 M
7 2 1 0 NA 2.9 88 M
8 2 2 0 NA 2.4 88 M
Key Ideas
Grouping changes the unit of analysis
Without grouping, summaries collapse the entire dataset.
pk %>% summarise(n_rows = n())# A tibble: 1 × 1
n_rows
<int>
1 8
With grouping, summaries collapse within each group.
pk %>%
group_by(ID) %>%
summarise(n_rows = n(), .groups = "drop")# A tibble: 2 × 2
ID n_rows
<dbl> <int>
1 1 4
2 2 4
Grouping tells R:
“Do this calculation separately for each subject.”
In PMx workflows, subject-level summaries are your first line of defense against subtle data issues.
Many modeling problems are first detected in subject-level summary tables — not during estimation.
Always confirm that covariates treated as subject-level are actually constant within subject.
mutate() vs summarise(): A Critical Difference
Understanding this distinction is foundational.
mutate()
- Keeps the same number of rows
- Adds or modifies columns
- Operates within the existing structure
summarise()
- Reduces rows
- Collapses data into fewer rows (often one per group)
- Changes the unit of analysis
Example:
# mutate keeps all rows
pk %>%
group_by(ID) %>%
mutate(mean_DV = mean(DV, na.rm = TRUE)) %>%
ungroup()# A tibble: 8 × 8
ID TIME EVID AMT DV WT SEX mean_DV
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 2.97
2 1 0.5 0 NA 2.1 72 F 2.97
3 1 1 0 NA 3.8 72 F 2.97
4 1 2 0 NA 3 72 F 2.97
5 2 0 1 80 NA 88 M 2.3
6 2 0.5 0 NA 1.6 88 M 2.3
7 2 1 0 NA 2.9 88 M 2.3
8 2 2 0 NA 2.4 88 M 2.3
Every original row remains.
# summarise collapses rows
pk %>%
group_by(ID) %>%
summarise(mean_DV = mean(DV, na.rm = TRUE),
.groups = "drop")# A tibble: 2 × 2
ID mean_DV
<dbl> <dbl>
1 1 2.97
2 2 2.3
Now you have one row per subject.
This difference is crucial in PMx workflows.
Use mutate() when you want to add information to each record.
Use summarise() when you want to build a summary table.
Worked Example 1: Basic subject-level summaries
subject_summary <- pk %>%
group_by(ID) %>%
summarise(
n_rows = n(),
n_dose = sum(EVID == 1, na.rm = TRUE),
n_obs = sum(EVID == 0, na.rm = TRUE),
.groups = "drop"
)
subject_summary# A tibble: 2 × 4
ID n_rows n_dose n_obs
<dbl> <int> <int> <int>
1 1 4 1 3
2 2 4 1 3
Worked Example 2: Observation time summaries
pk %>%
filter(EVID == 0) %>%
group_by(ID) %>%
summarise(
time_min = min(TIME),
time_max = max(TIME),
.groups = "drop"
)# A tibble: 2 × 3
ID time_min time_max
<dbl> <dbl> <dbl>
1 1 0.5 2
2 2 0.5 2
Worked Example 3: Checking covariate consistency
pk %>%
group_by(ID) %>%
summarise(
n_WT = n_distinct(WT),
n_SEX = n_distinct(SEX),
.groups = "drop"
)# A tibble: 2 × 3
ID n_WT n_SEX
<dbl> <int> <int>
1 1 1 1
2 2 1 1
For true subject-level covariates, n_distinct() should equal 1.
Worked Example 4: Grouped mutate()
Within-subject calculations:
pk %>%
group_by(ID) %>%
mutate(time_since_first = TIME - min(TIME)) %>%
ungroup()# A tibble: 8 × 8
ID TIME EVID AMT DV WT SEX time_since_first
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 0
2 1 0.5 0 NA 2.1 72 F 0.5
3 1 1 0 NA 3.8 72 F 1
4 1 2 0 NA 3 72 F 2
5 2 0 1 80 NA 88 M 0
6 2 0.5 0 NA 1.6 88 M 0.5
7 2 1 0 NA 2.9 88 M 1
8 2 2 0 NA 2.4 88 M 2
Observation index per subject:
pk %>%
filter(EVID == 0) %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
mutate(obs_index = row_number()) %>%
ungroup()# A tibble: 6 × 8
ID TIME EVID AMT DV WT SEX obs_index
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
1 1 0.5 0 NA 2.1 72 F 1
2 1 1 0 NA 3.8 72 F 2
3 1 2 0 NA 3 72 F 3
4 2 0.5 0 NA 1.6 88 M 1
5 2 1 0 NA 2.9 88 M 2
6 2 2 0 NA 2.4 88 M 3
Worked Example 5: QC summary by subgroup
pk %>%
group_by(SEX) %>%
summarise(
n_subjects = n_distinct(ID),
mean_WT = mean(WT),
sd_WT = sd(WT),
.groups = "drop"
)# A tibble: 2 × 4
SEX n_subjects mean_WT sd_WT
<chr> <int> <dbl> <dbl>
1 F 1 72 0
2 M 1 88 0
Strategies
- Group explicitly and ungroup intentionally.
- Remember: once grouped, all downstream operations respect those groups until you
ungroup(). - Filter dose vs observation rows before summarising when needed.
- Use
.groups = "drop"to avoid surprises. - Build small, interpretable summary tables for QC.
Common Mistakes
- Forgetting to
ungroup()aftergroup_by(), causing unintended grouped behavior in later steps
- Using
summarise()whenmutate()was intended, and unintentionally reducing the dataset
- Using grouped
mutate()without realizing calculations are performed within each group
- Summarising over both dose and observation rows when they should be handled separately
- Assuming covariates are constant within subject without checking with
n_distinct()
- Forgetting to control grouping behavior with
.groups = "drop"
- Interpreting summary tables without recognizing the current unit of analysis
Practice Problems
- Create a subject-level table with
n_dose,n_obs, andlast_time. - Identify subjects with zero doses.
- Check whether
WTandSEXare constant within subject. - Create a per-subject observation index.
- Summarise mean and SD of
WTbySEX.
# Subject-level summary
pk %>%
group_by(ID) %>%
summarise(
n_dose = sum(EVID == 1),
n_obs = sum(EVID == 0),
last_time = max(TIME[EVID == 0]),
.groups = "drop"
)# A tibble: 2 × 4
ID n_dose n_obs last_time
<dbl> <int> <int> <dbl>
1 1 1 3 2
2 2 1 3 2
# Zero doses
pk %>%
group_by(ID) %>%
summarise(n_dose = sum(EVID == 1), .groups = "drop") %>%
filter(n_dose == 0)# A tibble: 0 × 2
# ℹ 2 variables: ID <dbl>, n_dose <int>
# Covariate consistency
pk %>%
group_by(ID) %>%
summarise(
n_WT = n_distinct(WT),
n_SEX = n_distinct(SEX),
.groups = "drop"
)# A tibble: 2 × 3
ID n_WT n_SEX
<dbl> <int> <int>
1 1 1 1
2 2 1 1
# Observation index
pk %>%
filter(EVID == 0) %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
mutate(obs_index = row_number()) %>%
ungroup()# A tibble: 6 × 8
ID TIME EVID AMT DV WT SEX obs_index
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
1 1 0.5 0 NA 2.1 72 F 1
2 1 1 0 NA 3.8 72 F 2
3 1 2 0 NA 3 72 F 3
4 2 0.5 0 NA 1.6 88 M 1
5 2 1 0 NA 2.9 88 M 2
6 2 2 0 NA 2.4 88 M 3
# WT by SEX
pk %>%
group_by(SEX) %>%
summarise(
mean_WT = mean(WT),
sd_WT = sd(WT),
.groups = "drop"
)# A tibble: 2 × 3
SEX mean_WT sd_WT
<chr> <dbl> <dbl>
1 F 72 0
2 M 88 0
Summary
You now know how to:
- Use
group_by()to change the unit of analysis. - Clearly distinguish between
mutate()(adds columns) andsummarise()(reduces rows). - Build subject-level QC tables.
- Perform within-subject calculations safely.
- Control grouping behavior explicitly.
Grouping is not just syntax — it defines how your dataset is interpreted.
- Group explicitly; ungroup intentionally.
- Use
mutate()to add information to rows. - Use
summarise()to collapse rows into summaries. - Verify subject-level covariates with
n_distinct(). - Subject-level tables often reveal problems before modeling does.