library(tidyverse)
library(here)
derivedpath <- here("courses", "foundations-r", "data", "derived")
resultspath <- here("courses", "foundations-r", "results", "qc")
logspath <- here("courses", "foundations-r", "analysis", "logs")
dir.create(derivedpath, recursive = TRUE, showWarnings = FALSE)
dir.create(resultspath, recursive = TRUE, showWarnings = FALSE)
dir.create(logspath, recursive = TRUE, showWarnings = FALSE)
prelim_path <- file.path(derivedpath, "analysis_prelim.csv")
if (!file.exists(prelim_path)) {
stop("Preliminary dataset not found. Render the preliminary dataset assembly first:\n", prelim_path)
}
dat0 <- read_csv(prelim_path)
# Split for targeted fixes
obs0 <- dat0 %>% filter(EVID == 0)
dose0 <- dat0 %>% filter(EVID == 1)Fixing Data Issues
Big picture: We don’t “clean everything.”
We apply only fixes we can defend, and we log each decision with evidence (IDs, times, counts).
Learning Objectives
By the end of this lesson, you will be able to:
- Load the preliminary assembled dataset and isolate observation rows.
- Remove exact duplicates (same
ID + TIME) with traceable evidence. - Identify and correct the WT unit mismatch (lb → kg) using a documented rule.
- Remove structurally invalid records (e.g., missing
TIME). - Handle impossible negative concentrations in a defensible way.
- Append fixes and decisions to a QC log.
- Export a cleaned dataset version (
v2) for final locking and EDA.
Key Ideas
- Fixes must be justified. You should be able to answer: “Why did you change this?”
- Structural problems first (missing TIME, duplicates), then obvious scientific artifacts (negative concentrations).
- Unit mismatches in covariates can silently distort downstream models (e.g., allometry).
- Never overwrite raw data; export versioned derived datasets (
analysis_prelim→analysis_v2_cleaned). - A QC log is part of the deliverable, not an optional note.
Setup
Worked Example 1: Record Counts and Quick Red Flags
We begin by checking the overall shape of the preliminary dataset and comparing it to the expected study design.
tibble(
nsubj = n_distinct(dat0$ID),
nrows = nrow(dat0),
nobs = nrow(obs0),
ndose = nrow(dose0)
)# A tibble: 1 × 4
nsubj nrows nobs ndose
<int> <int> <int> <int>
1 50 508 458 50
Expected (clean world):
- 1 dose row + 9 obs rows = 10 total rows per subject
dat0 %>%
count(ID) %>%
count(n) %>%
arrange(n)# A tibble: 2 × 2
n nn
<int> <int>
1 10 42
2 11 8
Worked Example 2: Remove Structurally Invalid Rows (Missing TIME)
Rows without TIME cannot be plotted on a timeline or modeled properly.
missing_time <- obs0 %>%
filter(is.na(TIME)) %>%
select(ID, DOSE, TIME, NTIME, DV, BLQ, LLOQ)
missing_time# A tibble: 8 × 7
ID DOSE TIME NTIME DV BLQ LLOQ
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 009 0 NA NA 0 BLQ 0.05
2 009 0 NA NA 0 BLQ 0.05
3 019 12.5 NA NA 0.052 <NA> 0.05
4 039 50 NA NA 0.173 <NA> 0.05
5 039 50 NA NA 0.614 <NA> 0.05
6 046 100 NA NA 0.445 <NA> 0.05
7 049 100 NA NA 1.11 <NA> 0.05
8 049 100 NA NA 0.246 <NA> 0.05
Save evidence:
write_csv(missing_time, file.path(resultspath, "evidence_missing_time_obs.csv"))Apply fix (drop those observation rows):
obs1 <- obs0 %>% filter(!is.na(TIME))Worked Example 3: Exact Duplicates (ID + TIME)
Duplicates inflate record counts and distort summaries.
Evidence table:
dup_keys <- obs1 %>%
count(ID, TIME) %>%
filter(n > 1) %>%
arrange(desc(n))
dup_keys# A tibble: 8 × 3
ID TIME n
<chr> <dbl> <int>
1 001 2.03 2
2 002 6.00 2
3 003 4 2
4 009 7.98 2
5 018 12.0 2
6 026 12.1 2
7 038 5.95 2
8 041 2 2
Save evidence:
write_csv(dup_keys, file.path(resultspath, "evidence_duplicate_keys_id_time.csv"))Extract the duplicated rows themselves (useful when auditing):
dup_rows <- obs1 %>%
semi_join(dup_keys, by = c("ID", "TIME")) %>%
arrange(ID, TIME)
dup_rows %>% slice_head(n = 20)# A tibble: 16 × 13
ID DOSE TIME NTIME EVID CMT AMT DV AGE WT SEX BLQ
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 001 0 2.03 2 0 2 0 0 42 70.6 M BLQ
2 001 0 2.03 2 0 2 0 0 42 70.6 M BLQ
3 002 0 6.00 6 0 2 0 0 34 75.5 male BLQ
4 002 0 6.00 6 0 2 0 0 34 75.5 male BLQ
5 003 0 4 4 0 2 0 0 45 67.2 M BLQ
6 003 0 4 4 0 2 0 0 45 67.2 M BLQ
7 009 0 7.98 8 0 2 0 0 59 97.5 F BLQ
8 009 0 7.98 8 0 2 0 0 59 97.5 F BLQ
9 018 12.5 12.0 12 0 2 0 0 NA 90.2 F BLQ
10 018 12.5 12.0 12 0 2 0 0 NA 90.2 F BLQ
11 026 25 12.1 12 0 2 0 0.0667 34 81 F <NA>
12 026 25 12.1 12 0 2 0 0.0667 34 81 F <NA>
13 038 50 5.95 6 0 2 0 0.493 30 62.5 F <NA>
14 038 50 5.95 6 0 2 0 0.493 30 62.5 F <NA>
15 041 100 2 2 0 2 0 0.12 31 65.4 M <NA>
16 041 100 2 2 0 2 0 0.12 31 65.4 M <NA>
# ℹ 1 more variable: LLOQ <dbl>
write_csv(dup_rows, file.path(resultspath, "evidence_duplicate_rows.csv"))Minimal fix: keep the first record per ID + TIME.
obs2 <- obs1 %>%
arrange(ID, TIME) %>%
distinct(ID, TIME, .keep_all = TRUE)Re-check:
obs2 %>% count(ID, TIME) %>% filter(n > 1)# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, TIME <dbl>, n <int>
Worked Example 4: Negative Concentrations
Negative concentrations are not physically meaningful.
At this stage we choose a defensible minimal rule: set negative DV to NA (keep the record, but mark DV unusable).
Evidence:
neg_rows <- obs2 %>%
filter(!is.na(DV) & DV < 0) %>%
select(ID, DOSE, TIME, NTIME, DV, BLQ, LLOQ)
neg_rows# A tibble: 1 × 7
ID DOSE TIME NTIME DV BLQ LLOQ
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 047 100 6.03 6 -0.926 <NA> 0.05
write_csv(neg_rows, file.path(resultspath, "evidence_negative_dv.csv"))Apply fix:
obs3 <- obs2 %>%
mutate(
DV = if_else(!is.na(DV) & DV < 0, NA_real_, DV)
)Worked Example 5: Weight Outlier Detection (lb vs kg)
In this case study, one subject’s weight may be recorded in lb instead of kg, creating an extreme WT outlier.
Start with a quick WT scan:
dm_unique <- dat0 %>%
distinct(ID, WT, SEX, AGE)
dm_unique %>%
summarise(
min_wt = min(WT, na.rm = TRUE),
p25 = quantile(WT, 0.25, na.rm = TRUE),
median = median(WT, na.rm = TRUE),
p75 = quantile(WT, 0.75, na.rm = TRUE),
max_wt = max(WT, na.rm = TRUE)
)# A tibble: 1 × 5
min_wt p25 median p75 max_wt
<dbl> <dbl> <dbl> <dbl> <dbl>
1 44.4 65.9 73.2 85.3 113
Flag suspicious weights (simple, transparent heuristic):
- If WT > 150, it’s almost certainly lb recorded as kg (or a unit error).
wt_suspects <- dm_unique %>%
filter(!is.na(WT) & WT > 150) %>%
arrange(desc(WT))
wt_suspects# A tibble: 0 × 4
# ℹ 4 variables: ID <chr>, WT <dbl>, SEX <chr>, AGE <dbl>
Save evidence:
write_csv(wt_suspects, file.path(resultspath, "evidence_wt_unit_suspects.csv"))Apply conversion for those subjects only (lb → kg):
lb_to_kg <- 0.45359237
suspect_ids <- wt_suspects$ID
dat1 <- bind_rows(obs3, dose0) %>%
mutate(
WT = if_else(ID %in% suspect_ids, WT * lb_to_kg, WT)
)Sanity check after conversion:
dat1 %>%
distinct(ID, WT) %>%
summarise(
max_wt = max(WT, na.rm = TRUE),
median = median(WT, na.rm = TRUE)
)# A tibble: 1 × 2
max_wt median
<dbl> <dbl>
1 113 73.2
Worked Example 6: Reassemble, Order, and Recount
Now we re-order for clarity (dose row first at TIME = 0), and check record counts again.
analysis_v2 <- dat1 %>%
arrange(ID, TIME, desc(EVID))Counts per subject:
analysis_v2 %>%
count(ID) %>%
count(n) %>%
arrange(n)# A tibble: 3 × 2
n nn
<int> <int>
1 8 3
2 9 2
3 10 45
Check one dose row per subject:
analysis_v2 %>%
filter(EVID == 1) %>%
count(ID) %>%
filter(n != 1)# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, n <int>
Worked Example 7: Update the QC Decision Log
We append decisions to the course log.
log_path <- file.path(logspath, "qc_decision_log.csv")
if (!file.exists(log_path)) {
qc_log <- tibble(
issue_id = character(),
file = character(),
issue = character(),
evidence = character(),
decision = character(),
status = character(),
notes = character()
)
write_csv(qc_log, log_path)
}
qc_log <- read_csv(log_path, show_col_types = FALSE)
qc_log_new <- tibble(
issue_id = c("ISSUE-005", "ISSUE-006", "ISSUE-007", "ISSUE-008"),
file = c("analysis_prelim", "analysis_prelim", "analysis_prelim", "analysis_prelim"),
issue = c(
"Observation rows with missing TIME.",
"Exact duplicate observation keys (ID + TIME).",
"Negative DV values in observation rows.",
"WT outlier suggests lb→kg unit mismatch."
),
evidence = c(
"Saved: results/qc/evidence_missing_time_obs.csv",
"Saved: results/qc/evidence_duplicate_keys_id_time.csv and evidence_duplicate_rows.csv",
"Saved: results/qc/evidence_negative_dv.csv",
"Saved: results/qc/evidence_wt_unit_suspects.csv (WT > 150 heuristic)"
),
decision = c(
"Drop obs rows with TIME = NA (cannot be placed on timeline).",
"Keep first record per ID+TIME after ordering; remove exact duplicates.",
"Set DV < 0 to NA (retain record for traceability).",
"Convert WT to kg for suspect IDs only (WT := WT * 0.45359237)."
),
status = "closed",
notes = ""
)
qc_log <- bind_rows(qc_log, qc_log_new) %>%
distinct(issue_id, .keep_all = TRUE)
write_csv(qc_log, log_path)
qc_log %>% arrange(issue_id)# A tibble: 7 × 7
issue_id file issue evidence decision status notes
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ISSUE-001 pc TIME auto-parsed as … glimpse… Re-vali… open <NA>
2 ISSUE-002 dm DM is in long form (… glimpse… Use piv… open <NA>
3 ISSUE-003 dm Potential unit misma… dm %>% … Standar… open <NA>
4 ISSUE-005 analysis_prelim Observation rows wit… Saved: … Drop ob… closed <NA>
5 ISSUE-006 analysis_prelim Exact duplicate obse… Saved: … Keep fi… closed <NA>
6 ISSUE-007 analysis_prelim Negative DV values i… Saved: … Set DV … closed <NA>
7 ISSUE-008 analysis_prelim WT outlier suggests … Saved: … Convert… closed <NA>
Export Cleaned Dataset v2
write_csv(analysis_v2, file.path(derivedpath, "analysis_v2_cleaned.csv"))- Do not overwrite raw source files (
data/source_*). - Do not “fix by eye” in plotting code.
- Do not apply aggressive outlier removal unless you can justify it.
- Every fix should have an evidence table saved under
results/qc/.
Strategies
- Work from the assembled dataset:
courses/foundations-r/data/derived/analysis_prelim.csv. - Separate observation fixes (EVID = 0) from dose rows (EVID = 1).
- Generate evidence tables for each fix (which IDs/rows are affected).
- Apply fixes using minimal rules:
- Duplicates: keep first row per
ID + TIME(or choose a transparent tie-break rule) - Missing TIME: drop (cannot be modeled)
- Negative DV: set to
NA(or drop; but document your choice) - WT lb→kg: convert only where evidence suggests lb (extreme WT + unit investigation)
- Duplicates: keep first row per
- Append decisions to the course QC log.
Common Mistakes
- Applying fixes before saving evidence for the affected records.
- Mixing observation-row fixes with dose-row fixes.
- Removing duplicates without a clear key or tie-break rule.
- Dropping negative concentrations without documenting the decision.
- Converting all weights instead of only the suspect records.
- Exporting a cleaned dataset without updating the QC log.
Practice Problems
- Verify there are no duplicate
ID + TIMEobservation rows inanalysis_v2. - Create a short “before vs after” table showing:
- number of obs rows dropped for missing TIME
- number of duplicates removed
- number of negative DV set to NA
- Save that table to
courses/foundations-r/results/qc/. - Add one additional QC log entry that you think would be appropriate based on your results.
Duplicate check (obs only):
analysis_v2 %>%
filter(EVID == 0) %>%
count(ID, TIME) %>%
filter(n > 1)# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, TIME <dbl>, n <int>
Before vs after summary:
before <- obs0 %>% summarise(
n_obs = n(),
n_missing_time = sum(is.na(TIME)),
n_dup_keys = (obs0 %>% count(ID, TIME) %>% filter(n > 1) %>% nrow()),
n_neg = sum(!is.na(DV) & DV < 0)
)
after <- analysis_v2 %>%
filter(EVID == 0) %>%
summarise(
n_obs = n(),
n_missing_time = sum(is.na(TIME)),
n_dup_keys = (analysis_v2 %>% filter(EVID == 0) %>% count(ID, TIME) %>% filter(n > 1) %>% nrow()),
n_neg = sum(!is.na(DV) & DV < 0)
)
summary_tbl <- bind_rows(
before %>% mutate(stage = "before"),
after %>% mutate(stage = "after")
) %>% select(stage, everything())
write_csv(summary_tbl, file.path(resultspath, "qc_before_after_summary.csv"))
summary_tbl# A tibble: 2 × 5
stage n_obs n_missing_time n_dup_keys n_neg
<chr> <int> <int> <int> <int>
1 before 458 8 11 1
2 after 442 0 0 0
Summary
- You applied targeted, defensible fixes to the preliminary dataset.
- You removed structurally invalid observation rows (missing TIME).
- You removed exact duplicates using transparent keys (
ID + TIME). - You handled negative concentrations by setting DV to
NA. - You corrected a likely lb→kg weight mismatch with evidence-based criteria.
- You documented decisions and exported
analysis_v2_cleaned.csvfor the final dataset lock and EDA.
- Save evidence tables first, then fix.
- Prefer simple, transparent rules over clever ones.
- If you can’t explain a fix in one sentence, it’s probably too aggressive for this stage.
- Version your outputs: prelim → v2 cleaned → final locked.