library(tidyverse)First 10 Minutes: Structural QC for PMx Data
What you build today: a repeatable QC checklist you run immediately after importing any PMx dataset.
Learning Objectives
By the end of this lesson, you will be able to:
- Inspect dataset structure efficiently.
- Detect missingness patterns.
- Identify duplicate records.
- Enforce correct sorting.
- Perform subject-level sanity checks.
- Explain why structural QC precedes any modeling step.
Setup
Key Ideas
The first 10 minutes after import determine whether your modeling workflow is safe.
Structural QC focuses on:
- Column names and types
- Missingness patterns
- Duplicate records
- Correct record ordering
- Subject-level plausibility
This is not exploratory analysis — it is structural validation.
Many modeling errors are not modeling errors — they are structural data errors discovered too late.
In this lesson, you will see functions like summarise() and group_by() used before their dedicated deep-dive lessons later in the course.
For now, think of them as tools that help collapse data into subject-level summaries.
Example Dataset (Demo Only)
pk <- tibble::tribble(
~ID, ~TIME, ~EVID, ~AMT, ~DV, ~CMT,
1, 0.0, 1, 100, NA, 1,
1, 0.5, 0, NA, 2.1, 1,
2, 0.0, 1, 80, NA, 1,
2, 0.5, 0, NA, 1.6, 1
)Worked Example 1: Structural Inspection
glimpse(pk)Rows: 4
Columns: 6
$ ID <dbl> 1, 1, 2, 2
$ TIME <dbl> 0.0, 0.5, 0.0, 0.5
$ EVID <dbl> 1, 0, 1, 0
$ AMT <dbl> 100, NA, 80, NA
$ DV <dbl> NA, 2.1, NA, 1.6
$ CMT <dbl> 1, 1, 1, 1
Confirm:
- Column names
- Column types
- Whether IDs appear numeric or character
- Whether time is stored as numeric
Structural issues at this stage often indicate upstream data problems.
Worked Example 2: Missingness Check
pk %>%
summarise(across(everything(), ~ sum(is.na(.x))))# A tibble: 1 × 6
ID TIME EVID AMT DV CMT
<int> <int> <int> <int> <int> <int>
1 0 0 0 2 2 0
Look for unexpected patterns such as:
- Missing doses (
AMT) whenEVID == 1
- Missing observations (
DV) whenEVID == 0
- Entire columns with high missingness
summarise() collapses the dataset into summary statistics.
We will study this function formally later.
For now, use it as a structural diagnostic tool.
Worked Example 3: Duplicate Detection
pk %>%
count(ID, TIME, EVID) %>%
filter(n > 1)# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
If any rows appear, investigate immediately.
Duplicate event records can silently break modeling workflows — especially in NONMEM, nlme, or simulation pipelines.
Worked Example 4: Sorting Enforcement
pk <- pk %>% arrange(ID, TIME, EVID)Correct ordering is critical because many modeling tools assume:
- Data are sorted by subject
- Records are time-ordered
- Dose records precede observations at identical times
Never assume imported data are properly ordered.
Worked Example 5: Subject-Level Sanity Check
pk %>%
group_by(ID) %>%
summarise(
n_rows = n(),
n_dose = sum(EVID == 1, na.rm = TRUE),
n_obs = sum(EVID == 0, na.rm = TRUE),
time_min = min(TIME, na.rm = TRUE),
time_max = max(TIME, na.rm = TRUE),
.groups = "drop"
)# A tibble: 2 × 6
ID n_rows n_dose n_obs time_min time_max
<dbl> <int> <int> <int> <dbl> <dbl>
1 1 2 1 1 0 0.5
2 2 2 1 1 0 0.5
Scan for:
- Subjects with no doses
- Subjects with no observations
- Implausible time ranges
- Extremely short or long follow-up
This step frequently catches data merges gone wrong.
Strategies
- Always run the same QC checklist.
- Inspect structure before analysis.
- Verify sorting before modeling.
- Confirm each subject has sensible time ranges.
- Document any structural corrections made.
- Treat QC as a ritual, not an optional step.
Common Mistakes
- Skipping QC because the dataset “looks small”
- Checking only missing DV but not missing AMT
- Forgetting to sort before modeling
- Assuming duplicates are harmless
- Trusting exported data without verification
Practice Problems
- Inspect a dataset with
glimpse(). - Count missing values.
- Detect duplicates.
- Sort correctly.
- Produce subject-level summaries.
glimpse(pk)Rows: 4
Columns: 6
$ ID <dbl> 1, 1, 2, 2
$ TIME <dbl> 0.0, 0.5, 0.0, 0.5
$ EVID <dbl> 1, 0, 1, 0
$ AMT <dbl> 100, NA, 80, NA
$ DV <dbl> NA, 2.1, NA, 1.6
$ CMT <dbl> 1, 1, 1, 1
pk %>%
summarise(across(everything(), ~ sum(is.na(.x))))# A tibble: 1 × 6
ID TIME EVID AMT DV CMT
<int> <int> <int> <int> <int> <int>
1 0 0 0 2 2 0
pk %>%
count(ID, TIME, EVID) %>%
filter(n > 1)# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
pk <- pk %>% arrange(ID, TIME, EVID)
pk %>%
group_by(ID) %>%
summarise(n_rows = n(), .groups = "drop")# A tibble: 2 × 2
ID n_rows
<dbl> <int>
1 1 2
2 2 2
Summary
After import, you should immediately:
- Inspect structure
- Check missingness
- Detect duplicates
- Enforce sorting
- Perform subject-level checks
Structural QC is a safety layer that protects all downstream modeling.
- Never analyze before inspecting.
- Sorting errors are silent but dangerous.
- Build a repeatable QC ritual.
- Structural validation always precedes modeling.