Grouping and Summarising Data

Learn how to compute subject-level summaries, QC tables, and grouped transformations using group_by() and summarise() in PMx workflows.

Tip

What you’ll build today: subject-level and group-level summary tables that are essential for PMx QC, exploration, and reporting.

Learning Objectives

By the end of this lesson, you will be able to:

Use group_by() and summarise() to compute subject-level summaries.
Clearly distinguish between mutate() and summarise().
Understand how grouping changes the unit of analysis.
Control grouping behavior with .groups and ungroup().
Use grouped mutate() for within-subject calculations.
Build PMx-specific QC tables quickly and reproducibly.

Setup

library(tidyverse)

We’ll work with a familiar PMx-style dataset.

pk <- tibble::tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV,  ~WT, ~SEX,
    1,   0.0,    1,  100,  NA,    72, "F",
    1,   0.5,    0,   NA,  2.1,   72, "F",
    1,   1.0,    0,   NA,  3.8,   72, "F",
    1,   2.0,    0,   NA,  3.0,   72, "F",
    2,   0.0,    1,   80,  NA,    88, "M",
    2,   0.5,    0,   NA,  1.6,   88, "M",
    2,   1.0,    0,   NA,  2.9,   88, "M",
    2,   2.0,    0,   NA,  2.4,   88, "M"
)

pk

# A tibble: 8 × 7
     ID  TIME  EVID   AMT    DV    WT SEX  
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1     1   0       1   100  NA      72 F    
2     1   0.5     0    NA   2.1    72 F    
3     1   1       0    NA   3.8    72 F    
4     1   2       0    NA   3      72 F    
5     2   0       1    80  NA      88 M    
6     2   0.5     0    NA   1.6    88 M    
7     2   1       0    NA   2.9    88 M    
8     2   2       0    NA   2.4    88 M

Key Ideas

Grouping changes the unit of analysis

Without grouping, summaries collapse the entire dataset.

pk %>% summarise(n_rows = n())

# A tibble: 1 × 1
  n_rows
   <int>
1      8

With grouping, summaries collapse within each group.

pk %>% 
  group_by(ID) %>% 
  summarise(n_rows = n(), .groups = "drop")

# A tibble: 2 × 2
     ID n_rows
  <dbl>  <int>
1     1      4
2     2      4

Grouping tells R:

“Do this calculation separately for each subject.”

In PMx workflows, subject-level summaries are your first line of defense against subtle data issues.

Many modeling problems are first detected in subject-level summary tables — not during estimation.

Warning

Always confirm that covariates treated as subject-level are actually constant within subject.

mutate() vs summarise(): A Critical Difference

Understanding this distinction is foundational.

`mutate()`

Keeps the same number of rows
Adds or modifies columns
Operates within the existing structure

`summarise()`

Reduces rows
Collapses data into fewer rows (often one per group)
Changes the unit of analysis

Example:

# mutate keeps all rows
pk %>%
  group_by(ID) %>%
  mutate(mean_DV = mean(DV, na.rm = TRUE)) %>%
  ungroup()

# A tibble: 8 × 8
     ID  TIME  EVID   AMT    DV    WT SEX   mean_DV
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <dbl>
1     1   0       1   100  NA      72 F        2.97
2     1   0.5     0    NA   2.1    72 F        2.97
3     1   1       0    NA   3.8    72 F        2.97
4     1   2       0    NA   3      72 F        2.97
5     2   0       1    80  NA      88 M        2.3 
6     2   0.5     0    NA   1.6    88 M        2.3 
7     2   1       0    NA   2.9    88 M        2.3 
8     2   2       0    NA   2.4    88 M        2.3

Every original row remains.

# summarise collapses rows
pk %>%
  group_by(ID) %>%
  summarise(mean_DV = mean(DV, na.rm = TRUE),
            .groups = "drop")

# A tibble: 2 × 2
     ID mean_DV
  <dbl>   <dbl>
1     1    2.97
2     2    2.3

Now you have one row per subject.

This difference is crucial in PMx workflows.
Use mutate() when you want to add information to each record.
Use summarise() when you want to build a summary table.

Worked Example 1: Basic subject-level summaries

subject_summary <- pk %>%
  group_by(ID) %>%
  summarise(
    n_rows = n(),
    n_dose = sum(EVID == 1, na.rm = TRUE),
    n_obs  = sum(EVID == 0, na.rm = TRUE),
    .groups = "drop"
  )

subject_summary

# A tibble: 2 × 4
     ID n_rows n_dose n_obs
  <dbl>  <int>  <int> <int>
1     1      4      1     3
2     2      4      1     3

Worked Example 2: Observation time summaries

pk %>%
  filter(EVID == 0) %>%
  group_by(ID) %>%
  summarise(
    time_min = min(TIME),
    time_max = max(TIME),
    .groups = "drop"
  )

# A tibble: 2 × 3
     ID time_min time_max
  <dbl>    <dbl>    <dbl>
1     1      0.5        2
2     2      0.5        2

Worked Example 3: Checking covariate consistency

pk %>%
  group_by(ID) %>%
  summarise(
    n_WT  = n_distinct(WT),
    n_SEX = n_distinct(SEX),
    .groups = "drop"
  )

# A tibble: 2 × 3
     ID  n_WT n_SEX
  <dbl> <int> <int>
1     1     1     1
2     2     1     1

Tip

For true subject-level covariates, n_distinct() should equal 1.

Worked Example 4: Grouped mutate()

Within-subject calculations:

pk %>%
  group_by(ID) %>%
  mutate(time_since_first = TIME - min(TIME)) %>%
  ungroup()

# A tibble: 8 × 8
     ID  TIME  EVID   AMT    DV    WT SEX   time_since_first
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>            <dbl>
1     1   0       1   100  NA      72 F                  0  
2     1   0.5     0    NA   2.1    72 F                  0.5
3     1   1       0    NA   3.8    72 F                  1  
4     1   2       0    NA   3      72 F                  2  
5     2   0       1    80  NA      88 M                  0  
6     2   0.5     0    NA   1.6    88 M                  0.5
7     2   1       0    NA   2.9    88 M                  1  
8     2   2       0    NA   2.4    88 M                  2

Observation index per subject:

pk %>%
  filter(EVID == 0) %>%
  arrange(ID, TIME) %>%
  group_by(ID) %>%
  mutate(obs_index = row_number()) %>%
  ungroup()

# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX   obs_index
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     <int>
1     1   0.5     0    NA   2.1    72 F             1
2     1   1       0    NA   3.8    72 F             2
3     1   2       0    NA   3      72 F             3
4     2   0.5     0    NA   1.6    88 M             1
5     2   1       0    NA   2.9    88 M             2
6     2   2       0    NA   2.4    88 M             3

Worked Example 5: QC summary by subgroup

pk %>%
  group_by(SEX) %>%
  summarise(
    n_subjects = n_distinct(ID),
    mean_WT = mean(WT),
    sd_WT   = sd(WT),
    .groups = "drop"
  )

# A tibble: 2 × 4
  SEX   n_subjects mean_WT sd_WT
  <chr>      <int>   <dbl> <dbl>
1 F              1      72     0
2 M              1      88     0

Strategies

Group explicitly and ungroup intentionally.
Remember: once grouped, all downstream operations respect those groups until you ungroup().
Filter dose vs observation rows before summarising when needed.
Use .groups = "drop" to avoid surprises.
Build small, interpretable summary tables for QC.

Common Mistakes

Forgetting to ungroup() after group_by(), causing unintended grouped behavior in later steps
Using summarise() when mutate() was intended, and unintentionally reducing the dataset
Using grouped mutate() without realizing calculations are performed within each group
Summarising over both dose and observation rows when they should be handled separately
Assuming covariates are constant within subject without checking with n_distinct()
Forgetting to control grouping behavior with .groups = "drop"
Interpreting summary tables without recognizing the current unit of analysis

Practice Problems

Create a subject-level table with n_dose, n_obs, and last_time.
Identify subjects with zero doses.
Check whether WT and SEX are constant within subject.
Create a per-subject observation index.
Summarise mean and SD of WT by SEX.

Step-by-Step Solutions

# Subject-level summary
pk %>%
  group_by(ID) %>%
  summarise(
    n_dose = sum(EVID == 1),
    n_obs  = sum(EVID == 0),
    last_time = max(TIME[EVID == 0]),
    .groups = "drop"
  )

# A tibble: 2 × 4
     ID n_dose n_obs last_time
  <dbl>  <int> <int>     <dbl>
1     1      1     3         2
2     2      1     3         2

# Zero doses
pk %>%
  group_by(ID) %>%
  summarise(n_dose = sum(EVID == 1), .groups = "drop") %>%
  filter(n_dose == 0)

# A tibble: 0 × 2
# ℹ 2 variables: ID <dbl>, n_dose <int>

# Covariate consistency
pk %>%
  group_by(ID) %>%
  summarise(
    n_WT = n_distinct(WT),
    n_SEX = n_distinct(SEX),
    .groups = "drop"
  )

# A tibble: 2 × 3
     ID  n_WT n_SEX
  <dbl> <int> <int>
1     1     1     1
2     2     1     1

# Observation index
pk %>%
  filter(EVID == 0) %>%
  arrange(ID, TIME) %>%
  group_by(ID) %>%
  mutate(obs_index = row_number()) %>%
  ungroup()

# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX   obs_index
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     <int>
1     1   0.5     0    NA   2.1    72 F             1
2     1   1       0    NA   3.8    72 F             2
3     1   2       0    NA   3      72 F             3
4     2   0.5     0    NA   1.6    88 M             1
5     2   1       0    NA   2.9    88 M             2
6     2   2       0    NA   2.4    88 M             3

# WT by SEX
pk %>%
  group_by(SEX) %>%
  summarise(
    mean_WT = mean(WT),
    sd_WT = sd(WT),
    .groups = "drop"
  )

# A tibble: 2 × 3
  SEX   mean_WT sd_WT
  <chr>   <dbl> <dbl>
1 F          72     0
2 M          88     0

Summary

You now know how to:

Use group_by() to change the unit of analysis.
Clearly distinguish between mutate() (adds columns) and summarise() (reduces rows).
Build subject-level QC tables.
Perform within-subject calculations safely.
Control grouping behavior explicitly.

Grouping is not just syntax — it defines how your dataset is interpreted.

Quick Tips

Group explicitly; ungroup intentionally.
Use mutate() to add information to rows.
Use summarise() to collapse rows into summaries.
Verify subject-level covariates with n_distinct().
Subject-level tables often reveal problems before modeling does.