library(tidyverse)Final QC and Exporting Model-Ready Datasets
What you’ll build today: a professional pre-model validation checklist and a clean, exportable dataset you can defend in a report or code review.
Learning Objectives
By the end of this lesson, you will be able to:
- Perform final structural and logical QC checks on a PMx dataset.
- Validate modeling assumptions explicitly.
- Apply and document a consistent record-ordering convention.
- Create compact QC summary tables for documentation.
- Add simple programmatic validation checks.
- Export model-ready datasets using common PMx conventions (including
"."for missing values).
Setup
Key Ideas
This is your last checkpoint before modeling.
If you cannot defend this dataset, you should not fit a model.
Final QC is about:
- structural integrity
- logical consistency
- modeling readiness
- reproducibility
- defensibility
If something looks strange here, do not assume the model will “sort it out.” Fix it now.
Example Dataset
In many PMx datasets, BLQ is stored as an integer flag:
BLQ = 0→ not BLQ
BLQ = 1→ BLQ
Dose rows may have BLQ = NA.
pk <- tibble::tribble(
~ID, ~TIME, ~EVID, ~AMT, ~DV, ~WT, ~SEX, ~BLQ,
1, 0.0, 1, 100, NA, 72, "F", NA,
1, 0.5, 0, NA, 2.1, 72, "F", 0,
1, 1.0, 0, NA, 3.8, 72, "F", 0,
1, 2.0, 0, NA, NA, 72, "F", 1,
2, 0.0, 1, 80, NA, 88, "M", NA,
2, 0.5, 0, NA, 1.6, 88, "M", 0,
2, 1.0, 0, NA, 2.9, 88, "M", 0
)
pk# A tibble: 7 × 8
ID TIME EVID AMT DV WT SEX BLQ
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F NA
2 1 0.5 0 NA 2.1 72 F 0
3 1 1 0 NA 3.8 72 F 0
4 1 2 0 NA NA 72 F 1
5 2 0 1 80 NA 88 M NA
6 2 0.5 0 NA 1.6 88 M 0
7 2 1 0 NA 2.9 88 M 0
Worked Example 1: Structural Inspection
glimpse(pk)Rows: 7
Columns: 8
$ ID <dbl> 1, 1, 1, 1, 2, 2, 2
$ TIME <dbl> 0.0, 0.5, 1.0, 2.0, 0.0, 0.5, 1.0
$ EVID <dbl> 1, 0, 0, 0, 1, 0, 0
$ AMT <dbl> 100, NA, NA, NA, 80, NA, NA
$ DV <dbl> NA, 2.1, 3.8, NA, NA, 1.6, 2.9
$ WT <dbl> 72, 72, 72, 72, 88, 88, 88
$ SEX <chr> "F", "F", "F", "F", "M", "M", "M"
$ BLQ <dbl> NA, 0, 0, 1, NA, 0, 0
Confirm:
- correct column names
- expected column types
- no unexpected columns
If something is off here, fix it before doing any other checks.
Worked Example 2: Ordering and Row Counts
nrow(pk)[1] 7
Records should be sorted consistently before modeling.
Two common conventions are:
Dose before observation at same time
pk_dose_first <- pk %>% arrange(ID, TIME, desc(EVID))Observation before dose at same time
pk_obs_first <- pk %>% arrange(ID, TIME, EVID)For the regression and mixed-effects models used in this course (lm(), nls(), lme(), nlme()), either convention is acceptable as long as it is consistent and documented.
In this lesson, we will use observation-first ordering:
pk <- pk %>% arrange(ID, TIME, EVID)The key requirement is reproducibility—not a specific ordering rule.
Worked Example 3: Key Uniqueness
A minimal PMx expectation: each (ID, TIME, EVID) combination should be unique.
pk %>%
count(ID, TIME, EVID) %>%
filter(n > 1)# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
This should return zero rows.
If it doesn’t, you have duplicate records that must be resolved before modeling.
Worked Example 4: Subject-Level Sanity Check
pk %>%
group_by(ID) %>%
summarise(
n_dose = sum(EVID == 1),
n_obs = sum(EVID == 0),
time_min = min(TIME),
time_max = max(TIME),
.groups = "drop"
)# A tibble: 2 × 5
ID n_dose n_obs time_min time_max
<dbl> <int> <int> <dbl> <dbl>
1 1 1 3 0 2
2 2 1 2 0 1
Scan for:
- subjects with no doses
- subjects with no observations
- implausible time ranges
Worked Example 5: Covariate Consistency
Covariates like WT and SEX should typically be constant within subject (in many datasets).
pk %>%
group_by(ID) %>%
summarise(
n_WT = n_distinct(WT),
n_SEX = n_distinct(SEX),
.groups = "drop"
)# A tibble: 2 × 3
ID n_WT n_SEX
<dbl> <int> <int>
1 1 1 1
2 2 1 1
If n_WT > 1, you need to decide whether this is real (e.g., longitudinal weight) or a data issue.
Worked Example 6: Missingness and BLQ Review
pk %>%
summarise(
n_missing_DV = sum(is.na(DV)),
n_blq = sum(BLQ == 1, na.rm = TRUE)
)# A tibble: 1 × 2
n_missing_DV n_blq
<int> <int>
1 3 1
Now check whether BLQ is logically aligned with record type:
pk %>%
count(EVID, BLQ, name = "n")# A tibble: 3 × 3
EVID BLQ n
<dbl> <dbl> <int>
1 0 0 4
2 0 1 1
3 1 NA 2
And check for suspicious combinations:
pk %>%
filter(EVID == 0, BLQ == 1, !is.na(DV))# A tibble: 0 × 8
# ℹ 8 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>,
# WT <dbl>, SEX <chr>, BLQ <dbl>
If BLQ rows have non-missing DV, clarify whether DV is the measured value (and BLQ is an annotation) or DV should be treated as missing/excluded for modeling.
Worked Example 7: Programmatic Assertions
Use a few strict assertions to prevent accidental modeling on broken data.
stopifnot(nrow(pk) > 0)
stopifnot(all(pk$TIME >= 0))
stopifnot(all(pk$EVID %in% c(0, 1)))If any assertion fails, stop and investigate.
Compact QC Summary Table
qc_summary <- pk %>%
summarise(
n_subjects = n_distinct(ID),
n_rows = n(),
n_dose = sum(EVID == 1),
n_obs = sum(EVID == 0),
n_blq = sum(BLQ == 1, na.rm = TRUE),
time_min = min(TIME),
time_max = max(TIME)
)
qc_summary# A tibble: 1 × 7
n_subjects n_rows n_dose n_obs n_blq time_min time_max
<int> <int> <int> <int> <int> <dbl> <dbl>
1 2 7 2 5 1 0 2
This one-row summary is often enough for documentation or reporting.
Creating a Model-Ready Dataset
Create a modeling DV column that reflects your decision (exclude BLQ and missing observations):
pk_model <- pk %>%
mutate(
DV_model = case_when(
EVID == 0 & BLQ == 0 & !is.na(DV) ~ DV,
TRUE ~ NA_real_
)
)
pk_model %>% select(ID, TIME, EVID, AMT, DV, BLQ, DV_model)# A tibble: 7 × 7
ID TIME EVID AMT DV BLQ DV_model
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 100 NA NA NA
2 1 0.5 0 NA 2.1 0 2.1
3 1 1 0 NA 3.8 0 3.8
4 1 2 0 NA NA 1 NA
5 2 0 1 80 NA NA NA
6 2 0.5 0 NA 1.6 0 1.6
7 2 1 0 NA 2.9 0 2.9
This makes modeling intent explicit and defensible.
A “model-ready dataset” is not just a cleaned dataset — it encodes your analysis decision (what counts as usable DV).
Exporting Safely (Including "." for Missing)
Many PMx pipelines expect missing values to be written as "." in the final exported file.
Inside R, keep missing values as NA. Convert only at export time.
# write_csv(pk_model, "data/pk_model_v1.csv", na = ".")Best practices:
- Export only finalized datasets.
- Version filenames when possible.
- Never overwrite raw data.
- Commit QC results before modeling.
Do not replace NA with "." inside your R objects.
Keep NA in R, and use na = "." only when writing the final file.
Strategies
- Treat QC as part of modeling — not “cleanup.”
- Check structure first (names/types/row counts), then logic (keys/order/values).
- Decide (and document) an ordering rule once, then enforce it programmatically.
- Use row counts and anti-joins as join/QC diagnostics.
- Create a compact one-row QC summary suitable for reports.
- Add a small number of
stopifnot()assertions to prevent accidental modeling on corrupted data. - Keep
NAinside R; only export"."at write time.
Common Mistakes
- Skipping key-uniqueness checks because “the dataset is small.”
- Mixing ordering conventions across different exports.
- Treating BLQ as “missing” without documenting the decision.
- Exporting a dataset without any saved QC summary.
- Using
na = "."too early (inside R objects) and turning missingness into text.
Practice Problems
- Confirm the dataset has unique
(ID, TIME, EVID)keys. - Create both ordering variants (dose-first and obs-first). Verify both are internally consistent.
- Create a subject-level table that reports
n_dose,n_obs, and the number of BLQ observations per subject. - Add an assertion that DV is never negative on observation rows.
- Create
DV_modeland confirm that allBLQ == 1observation rows haveDV_model = NA. - Export the dataset using
write_csv(..., na = ".")(commented if you don’t want to write files during rendering).
# 1) Key uniqueness
pk %>%
count(ID, TIME, EVID) %>%
filter(n > 1)# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, n <int>
# 2) Ordering variants
pk_dose_first <- pk %>% arrange(ID, TIME, desc(EVID))
pk_obs_first <- pk %>% arrange(ID, TIME, EVID)
# 3) Subject-level QC table
pk %>%
group_by(ID) %>%
summarise(
n_dose = sum(EVID == 1),
n_obs = sum(EVID == 0),
n_blq_obs = sum(EVID == 0 & BLQ == 1, na.rm = TRUE),
.groups = "drop"
)# A tibble: 2 × 4
ID n_dose n_obs n_blq_obs
<dbl> <int> <int> <int>
1 1 1 3 1
2 2 1 2 0
# 4) Assertion: no negative DV for observation rows
stopifnot(all(pk$DV[pk$EVID == 0 & !is.na(pk$DV)] >= 0))
# 5) DV_model
pk_model <- pk %>%
mutate(
DV_model = case_when(
EVID == 0 & BLQ == 0 & !is.na(DV) ~ DV,
TRUE ~ NA_real_
)
)
stopifnot(all(is.na(pk_model$DV_model[pk_model$EVID == 0 & pk_model$BLQ == 1])))
# 6) Export (optional)
# write_csv(pk_model, "data/pk_model_v1.csv", na = ".")Summary
You now have a defensible, pre-model validation workflow:
- structural validation
- consistent ordering rules
- logical consistency checks
- explicit modeling-ready columns
- programmatic safeguards
- reproducible exports (including
"."for missing)
This is professional data preparation—not just code that “runs.”
- Treat final QC as part of modeling.
- Choose an ordering rule and document it.
- Check BLQ as an integer flag (0/1).
- Check keys:
(ID, TIME, EVID)should be unique. - Add a few
stopifnot()checks. - Keep
NAin R; export missing as"."withna = ".". - Export only what you would defend.