Final QC and Exporting Model-Ready Datasets

Perform a defensible pre-model validation checklist and export clean, production-ready datasets for PMx modeling.
Tip

What you’ll build today: a professional pre-model validation checklist and a clean, exportable dataset you can defend in a report or code review.

Learning Objectives

By the end of this lesson, you will be able to:

  • Perform final structural and logical QC checks on a PMx dataset.
  • Validate modeling assumptions explicitly.
  • Apply and document a consistent record-ordering convention.
  • Create compact QC summary tables for documentation.
  • Add simple programmatic validation checks.
  • Export model-ready datasets using common PMx conventions (including "." for missing values).

Setup

library(tidyverse)

Key Ideas

This is your last checkpoint before modeling.

If you cannot defend this dataset, you should not fit a model.

Final QC is about:

  • structural integrity
  • logical consistency
  • modeling readiness
  • reproducibility
  • defensibility
Warning

If something looks strange here, do not assume the model will “sort it out.” Fix it now.


Example Dataset

In many PMx datasets, BLQ is stored as an integer flag:

  • BLQ = 0 → not BLQ
  • BLQ = 1 → BLQ

Dose rows may have BLQ = NA.

pk <- tibble::tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV,  ~WT, ~SEX, ~BLQ,
    1,   0.0,    1,  100,  NA,    72, "F",   NA,
    1,   0.5,    0,   NA,  2.1,   72, "F",   0,
    1,   1.0,    0,   NA,  3.8,   72, "F",   0,
    1,   2.0,    0,   NA,  NA,    72, "F",   1,
    2,   0.0,    1,   80,  NA,    88, "M",   NA,
    2,   0.5,    0,   NA,  1.6,   88, "M",   0,
    2,   1.0,    0,   NA,  2.9,   88, "M",   0
)

pk
# A tibble: 7 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     BLQ
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        NA
2     1   0.5     0    NA   2.1    72 F         0
3     1   1       0    NA   3.8    72 F         0
4     1   2       0    NA  NA      72 F         1
5     2   0       1    80  NA      88 M        NA
6     2   0.5     0    NA   1.6    88 M         0
7     2   1       0    NA   2.9    88 M         0

Worked Example 1: Structural Inspection

glimpse(pk)
Rows: 7
Columns: 8
$ ID   <dbl> 1, 1, 1, 1, 2, 2, 2
$ TIME <dbl> 0.0, 0.5, 1.0, 2.0, 0.0, 0.5, 1.0
$ EVID <dbl> 1, 0, 0, 0, 1, 0, 0
$ AMT  <dbl> 100, NA, NA, NA, 80, NA, NA
$ DV   <dbl> NA, 2.1, 3.8, NA, NA, 1.6, 2.9
$ WT   <dbl> 72, 72, 72, 72, 88, 88, 88
$ SEX  <chr> "F", "F", "F", "F", "M", "M", "M"
$ BLQ  <dbl> NA, 0, 0, 1, NA, 0, 0

Confirm:

  • correct column names
  • expected column types
  • no unexpected columns

If something is off here, fix it before doing any other checks.


Worked Example 2: Ordering and Row Counts

nrow(pk)
[1] 7

Records should be sorted consistently before modeling.

Two common conventions are:

Dose before observation at same time

pk_dose_first <- pk %>% arrange(ID, TIME, desc(EVID))

Observation before dose at same time

pk_obs_first <- pk %>% arrange(ID, TIME, EVID)

For the regression and mixed-effects models used in this course (lm(), nls(), lme(), nlme()), either convention is acceptable as long as it is consistent and documented.

In this lesson, we will use observation-first ordering:

pk <- pk %>% arrange(ID, TIME, EVID)
Note

The key requirement is reproducibility—not a specific ordering rule.


Worked Example 3: Key Uniqueness

A minimal PMx expectation: each (ID, TIME, EVID) combination should be unique.

pk %>%
  count(ID, TIME, EVID) %>%
  filter(n > 1)
# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>

This should return zero rows.

If it doesn’t, you have duplicate records that must be resolved before modeling.


Worked Example 4: Subject-Level Sanity Check

pk %>%
  group_by(ID) %>%
  summarise(
    n_dose = sum(EVID == 1),
    n_obs  = sum(EVID == 0),
    time_min = min(TIME),
    time_max = max(TIME),
    .groups = "drop"
  )
# A tibble: 2 × 5
     ID n_dose n_obs time_min time_max
  <dbl>  <int> <int>    <dbl>    <dbl>
1     1      1     3        0        2
2     2      1     2        0        1

Scan for:

  • subjects with no doses
  • subjects with no observations
  • implausible time ranges

Worked Example 5: Covariate Consistency

Covariates like WT and SEX should typically be constant within subject (in many datasets).

pk %>%
  group_by(ID) %>%
  summarise(
    n_WT  = n_distinct(WT),
    n_SEX = n_distinct(SEX),
    .groups = "drop"
  )
# A tibble: 2 × 3
     ID  n_WT n_SEX
  <dbl> <int> <int>
1     1     1     1
2     2     1     1

If n_WT > 1, you need to decide whether this is real (e.g., longitudinal weight) or a data issue.


Worked Example 6: Missingness and BLQ Review

pk %>%
  summarise(
    n_missing_DV = sum(is.na(DV)),
    n_blq = sum(BLQ == 1, na.rm = TRUE)
  )
# A tibble: 1 × 2
  n_missing_DV n_blq
         <int> <int>
1            3     1

Now check whether BLQ is logically aligned with record type:

pk %>%
  count(EVID, BLQ, name = "n")
# A tibble: 3 × 3
   EVID   BLQ     n
  <dbl> <dbl> <int>
1     0     0     4
2     0     1     1
3     1    NA     2

And check for suspicious combinations:

pk %>%
  filter(EVID == 0, BLQ == 1, !is.na(DV))
# A tibble: 0 × 8
# ℹ 8 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>,
#   WT <dbl>, SEX <chr>, BLQ <dbl>

If BLQ rows have non-missing DV, clarify whether DV is the measured value (and BLQ is an annotation) or DV should be treated as missing/excluded for modeling.


Worked Example 7: Programmatic Assertions

Use a few strict assertions to prevent accidental modeling on broken data.

stopifnot(nrow(pk) > 0)
stopifnot(all(pk$TIME >= 0))
stopifnot(all(pk$EVID %in% c(0, 1)))

If any assertion fails, stop and investigate.


Compact QC Summary Table

qc_summary <- pk %>%
  summarise(
    n_subjects = n_distinct(ID),
    n_rows     = n(),
    n_dose     = sum(EVID == 1),
    n_obs      = sum(EVID == 0),
    n_blq      = sum(BLQ == 1, na.rm = TRUE),
    time_min   = min(TIME),
    time_max   = max(TIME)
  )

qc_summary
# A tibble: 1 × 7
  n_subjects n_rows n_dose n_obs n_blq time_min time_max
       <int>  <int>  <int> <int> <int>    <dbl>    <dbl>
1          2      7      2     5     1        0        2

This one-row summary is often enough for documentation or reporting.


Creating a Model-Ready Dataset

Create a modeling DV column that reflects your decision (exclude BLQ and missing observations):

pk_model <- pk %>%
  mutate(
    DV_model = case_when(
      EVID == 0 & BLQ == 0 & !is.na(DV) ~ DV,
      TRUE ~ NA_real_
    )
  )

pk_model %>% select(ID, TIME, EVID, AMT, DV, BLQ, DV_model)
# A tibble: 7 × 7
     ID  TIME  EVID   AMT    DV   BLQ DV_model
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     1   0       1   100  NA      NA     NA  
2     1   0.5     0    NA   2.1     0      2.1
3     1   1       0    NA   3.8     0      3.8
4     1   2       0    NA  NA       1     NA  
5     2   0       1    80  NA      NA     NA  
6     2   0.5     0    NA   1.6     0      1.6
7     2   1       0    NA   2.9     0      2.9

This makes modeling intent explicit and defensible.

Note

A “model-ready dataset” is not just a cleaned dataset — it encodes your analysis decision (what counts as usable DV).


Exporting Safely (Including "." for Missing)

Many PMx pipelines expect missing values to be written as "." in the final exported file.

Inside R, keep missing values as NA. Convert only at export time.

# write_csv(pk_model, "data/pk_model_v1.csv", na = ".")

Best practices:

  • Export only finalized datasets.
  • Version filenames when possible.
  • Never overwrite raw data.
  • Commit QC results before modeling.
Warning

Do not replace NA with "." inside your R objects.
Keep NA in R, and use na = "." only when writing the final file.


Strategies

  • Treat QC as part of modeling — not “cleanup.”
  • Check structure first (names/types/row counts), then logic (keys/order/values).
  • Decide (and document) an ordering rule once, then enforce it programmatically.
  • Use row counts and anti-joins as join/QC diagnostics.
  • Create a compact one-row QC summary suitable for reports.
  • Add a small number of stopifnot() assertions to prevent accidental modeling on corrupted data.
  • Keep NA inside R; only export "." at write time.

Common Mistakes

  • Skipping key-uniqueness checks because “the dataset is small.”
  • Mixing ordering conventions across different exports.
  • Treating BLQ as “missing” without documenting the decision.
  • Exporting a dataset without any saved QC summary.
  • Using na = "." too early (inside R objects) and turning missingness into text.

Practice Problems

  1. Confirm the dataset has unique (ID, TIME, EVID) keys.
  2. Create both ordering variants (dose-first and obs-first). Verify both are internally consistent.
  3. Create a subject-level table that reports n_dose, n_obs, and the number of BLQ observations per subject.
  4. Add an assertion that DV is never negative on observation rows.
  5. Create DV_model and confirm that all BLQ == 1 observation rows have DV_model = NA.
  6. Export the dataset using write_csv(..., na = ".") (commented if you don’t want to write files during rendering).

# 1) Key uniqueness
pk %>%
  count(ID, TIME, EVID) %>%
  filter(n > 1)
# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
# 2) Ordering variants
pk_dose_first <- pk %>% arrange(ID, TIME, desc(EVID))
pk_obs_first  <- pk %>% arrange(ID, TIME, EVID)

# 3) Subject-level QC table
pk %>%
  group_by(ID) %>%
  summarise(
    n_dose = sum(EVID == 1),
    n_obs  = sum(EVID == 0),
    n_blq_obs = sum(EVID == 0 & BLQ == 1, na.rm = TRUE),
    .groups = "drop"
  )
# A tibble: 2 × 4
     ID n_dose n_obs n_blq_obs
  <dbl>  <int> <int>     <int>
1     1      1     3         1
2     2      1     2         0
# 4) Assertion: no negative DV for observation rows
stopifnot(all(pk$DV[pk$EVID == 0 & !is.na(pk$DV)] >= 0))

# 5) DV_model
pk_model <- pk %>%
  mutate(
    DV_model = case_when(
      EVID == 0 & BLQ == 0 & !is.na(DV) ~ DV,
      TRUE ~ NA_real_
    )
  )

stopifnot(all(is.na(pk_model$DV_model[pk_model$EVID == 0 & pk_model$BLQ == 1])))

# 6) Export (optional)
# write_csv(pk_model, "data/pk_model_v1.csv", na = ".")

Summary

You now have a defensible, pre-model validation workflow:

  • structural validation
  • consistent ordering rules
  • logical consistency checks
  • explicit modeling-ready columns
  • programmatic safeguards
  • reproducible exports (including "." for missing)

This is professional data preparation—not just code that “runs.”


  • Treat final QC as part of modeling.
  • Choose an ordering rule and document it.
  • Check BLQ as an integer flag (0/1).
  • Check keys: (ID, TIME, EVID) should be unique.
  • Add a few stopifnot() checks.
  • Keep NA in R; export missing as "." with na = ".".
  • Export only what you would defend.