First 10 Minutes: Structural QC for PMx Data

A disciplined workflow for inspecting structure and catching common issues immediately after importing data.
Tip

What you build today: a repeatable QC checklist you run immediately after importing any PMx dataset.

Learning Objectives

By the end of this lesson, you will be able to:

  • Inspect dataset structure efficiently.
  • Detect missingness patterns.
  • Identify duplicate records.
  • Enforce correct sorting.
  • Perform subject-level sanity checks.
  • Explain why structural QC precedes any modeling step.

Setup

library(tidyverse)

Key Ideas

The first 10 minutes after import determine whether your modeling workflow is safe.

Structural QC focuses on:

  • Column names and types
  • Missingness patterns
  • Duplicate records
  • Correct record ordering
  • Subject-level plausibility

This is not exploratory analysis — it is structural validation.

Warning

Many modeling errors are not modeling errors — they are structural data errors discovered too late.

In this lesson, you will see functions like summarise() and group_by() used before their dedicated deep-dive lessons later in the course.

For now, think of them as tools that help collapse data into subject-level summaries.


Example Dataset (Demo Only)

pk <- tibble::tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV, ~CMT,
    1,   0.0,    1,  100,  NA,   1,
    1,   0.5,    0,   NA,  2.1,  1,
    2,   0.0,    1,   80,  NA,   1,
    2,   0.5,    0,   NA,  1.6,  1
)

Worked Example 1: Structural Inspection

glimpse(pk)
Rows: 4
Columns: 6
$ ID   <dbl> 1, 1, 2, 2
$ TIME <dbl> 0.0, 0.5, 0.0, 0.5
$ EVID <dbl> 1, 0, 1, 0
$ AMT  <dbl> 100, NA, 80, NA
$ DV   <dbl> NA, 2.1, NA, 1.6
$ CMT  <dbl> 1, 1, 1, 1

Confirm:

  • Column names
  • Column types
  • Whether IDs appear numeric or character
  • Whether time is stored as numeric

Structural issues at this stage often indicate upstream data problems.


Worked Example 2: Missingness Check

pk %>%
  summarise(across(everything(), ~ sum(is.na(.x))))
# A tibble: 1 × 6
     ID  TIME  EVID   AMT    DV   CMT
  <int> <int> <int> <int> <int> <int>
1     0     0     0     2     2     0

Look for unexpected patterns such as:

  • Missing doses (AMT) when EVID == 1
  • Missing observations (DV) when EVID == 0
  • Entire columns with high missingness
Note

summarise() collapses the dataset into summary statistics.
We will study this function formally later.
For now, use it as a structural diagnostic tool.


Worked Example 3: Duplicate Detection

pk %>%
  count(ID, TIME, EVID) %>%
  filter(n > 1)
# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>

If any rows appear, investigate immediately.

Warning

Duplicate event records can silently break modeling workflows — especially in NONMEM, nlme, or simulation pipelines.


Worked Example 4: Sorting Enforcement

pk <- pk %>% arrange(ID, TIME, EVID)

Correct ordering is critical because many modeling tools assume:

  1. Data are sorted by subject
  2. Records are time-ordered
  3. Dose records precede observations at identical times

Never assume imported data are properly ordered.


Worked Example 5: Subject-Level Sanity Check

pk %>%
  group_by(ID) %>%
  summarise(
    n_rows = n(),
    n_dose = sum(EVID == 1, na.rm = TRUE),
    n_obs  = sum(EVID == 0, na.rm = TRUE),
    time_min = min(TIME, na.rm = TRUE),
    time_max = max(TIME, na.rm = TRUE),
    .groups = "drop"
  )
# A tibble: 2 × 6
     ID n_rows n_dose n_obs time_min time_max
  <dbl>  <int>  <int> <int>    <dbl>    <dbl>
1     1      2      1     1        0      0.5
2     2      2      1     1        0      0.5

Scan for:

  • Subjects with no doses
  • Subjects with no observations
  • Implausible time ranges
  • Extremely short or long follow-up

This step frequently catches data merges gone wrong.


Strategies

  • Always run the same QC checklist.
  • Inspect structure before analysis.
  • Verify sorting before modeling.
  • Confirm each subject has sensible time ranges.
  • Document any structural corrections made.
  • Treat QC as a ritual, not an optional step.

Common Mistakes

  • Skipping QC because the dataset “looks small”
  • Checking only missing DV but not missing AMT
  • Forgetting to sort before modeling
  • Assuming duplicates are harmless
  • Trusting exported data without verification

Practice Problems

  1. Inspect a dataset with glimpse().
  2. Count missing values.
  3. Detect duplicates.
  4. Sort correctly.
  5. Produce subject-level summaries.

glimpse(pk)
Rows: 4
Columns: 6
$ ID   <dbl> 1, 1, 2, 2
$ TIME <dbl> 0.0, 0.5, 0.0, 0.5
$ EVID <dbl> 1, 0, 1, 0
$ AMT  <dbl> 100, NA, 80, NA
$ DV   <dbl> NA, 2.1, NA, 1.6
$ CMT  <dbl> 1, 1, 1, 1
pk %>%
  summarise(across(everything(), ~ sum(is.na(.x))))
# A tibble: 1 × 6
     ID  TIME  EVID   AMT    DV   CMT
  <int> <int> <int> <int> <int> <int>
1     0     0     0     2     2     0
pk %>%
  count(ID, TIME, EVID) %>%
  filter(n > 1)
# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
pk <- pk %>% arrange(ID, TIME, EVID)

pk %>%
  group_by(ID) %>%
  summarise(n_rows = n(), .groups = "drop")
# A tibble: 2 × 2
     ID n_rows
  <dbl>  <int>
1     1      2
2     2      2

Summary

After import, you should immediately:

  1. Inspect structure
  2. Check missingness
  3. Detect duplicates
  4. Enforce sorting
  5. Perform subject-level checks

Structural QC is a safety layer that protects all downstream modeling.


  • Never analyze before inspecting.
  • Sorting errors are silent but dangerous.
  • Build a repeatable QC ritual.
  • Structural validation always precedes modeling.