Fixing Data Issues

Apply targeted, defensible fixes to the preliminary dataset: remove duplicates, standardize units (lb→kg), handle missing TIME, and document every decision before locking the final dataset.
Tip

Big picture: We don’t “clean everything.”
We apply only fixes we can defend, and we log each decision with evidence (IDs, times, counts).

Learning Objectives

By the end of this lesson, you will be able to:

  • Load the preliminary assembled dataset and isolate observation rows.
  • Remove exact duplicates (same ID + TIME) with traceable evidence.
  • Identify and correct the WT unit mismatch (lb → kg) using a documented rule.
  • Remove structurally invalid records (e.g., missing TIME).
  • Handle impossible negative concentrations in a defensible way.
  • Append fixes and decisions to a QC log.
  • Export a cleaned dataset version (v2) for final locking and EDA.

Key Ideas

  • Fixes must be justified. You should be able to answer: “Why did you change this?”
  • Structural problems first (missing TIME, duplicates), then obvious scientific artifacts (negative concentrations).
  • Unit mismatches in covariates can silently distort downstream models (e.g., allometry).
  • Never overwrite raw data; export versioned derived datasets (analysis_prelimanalysis_v2_cleaned).
  • A QC log is part of the deliverable, not an optional note.

Setup

library(tidyverse)
library(here)

derivedpath <- here("courses", "foundations-r", "data", "derived")
resultspath <- here("courses", "foundations-r", "results", "qc")
logspath    <- here("courses", "foundations-r", "analysis", "logs")

dir.create(derivedpath, recursive = TRUE, showWarnings = FALSE)
dir.create(resultspath, recursive = TRUE, showWarnings = FALSE)
dir.create(logspath, recursive = TRUE, showWarnings = FALSE)

prelim_path <- file.path(derivedpath, "analysis_prelim.csv")
if (!file.exists(prelim_path)) {
  stop("Preliminary dataset not found. Render the preliminary dataset assembly first:\n", prelim_path)
}

dat0 <- read_csv(prelim_path)

# Split for targeted fixes
obs0  <- dat0 %>% filter(EVID == 0)
dose0 <- dat0 %>% filter(EVID == 1)

Worked Example 1: Record Counts and Quick Red Flags

We begin by checking the overall shape of the preliminary dataset and comparing it to the expected study design.

tibble(
  nsubj = n_distinct(dat0$ID),
  nrows = nrow(dat0),
  nobs  = nrow(obs0),
  ndose = nrow(dose0)
)
# A tibble: 1 × 4
  nsubj nrows  nobs ndose
  <int> <int> <int> <int>
1    50   508   458    50

Expected (clean world):

  • 1 dose row + 9 obs rows = 10 total rows per subject
dat0 %>%
  count(ID) %>%
  count(n) %>%
  arrange(n)
# A tibble: 2 × 2
      n    nn
  <int> <int>
1    10    42
2    11     8

Worked Example 2: Remove Structurally Invalid Rows (Missing TIME)

Rows without TIME cannot be plotted on a timeline or modeled properly.

missing_time <- obs0 %>%
  filter(is.na(TIME)) %>%
  select(ID, DOSE, TIME, NTIME, DV, BLQ, LLOQ)

missing_time
# A tibble: 8 × 7
  ID     DOSE  TIME NTIME    DV BLQ    LLOQ
  <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 009     0      NA    NA 0     BLQ    0.05
2 009     0      NA    NA 0     BLQ    0.05
3 019    12.5    NA    NA 0.052 <NA>   0.05
4 039    50      NA    NA 0.173 <NA>   0.05
5 039    50      NA    NA 0.614 <NA>   0.05
6 046   100      NA    NA 0.445 <NA>   0.05
7 049   100      NA    NA 1.11  <NA>   0.05
8 049   100      NA    NA 0.246 <NA>   0.05

Save evidence:

write_csv(missing_time, file.path(resultspath, "evidence_missing_time_obs.csv"))

Apply fix (drop those observation rows):

obs1 <- obs0 %>% filter(!is.na(TIME))

Worked Example 3: Exact Duplicates (ID + TIME)

Duplicates inflate record counts and distort summaries.

Evidence table:

dup_keys <- obs1 %>%
  count(ID, TIME) %>%
  filter(n > 1) %>%
  arrange(desc(n))

dup_keys
# A tibble: 8 × 3
  ID     TIME     n
  <chr> <dbl> <int>
1 001    2.03     2
2 002    6.00     2
3 003    4        2
4 009    7.98     2
5 018   12.0      2
6 026   12.1      2
7 038    5.95     2
8 041    2        2

Save evidence:

write_csv(dup_keys, file.path(resultspath, "evidence_duplicate_keys_id_time.csv"))

Extract the duplicated rows themselves (useful when auditing):

dup_rows <- obs1 %>%
  semi_join(dup_keys, by = c("ID", "TIME")) %>%
  arrange(ID, TIME)

dup_rows %>% slice_head(n = 20)
# A tibble: 16 × 13
   ID     DOSE  TIME NTIME  EVID   CMT   AMT     DV   AGE    WT SEX   BLQ  
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 001     0    2.03     2     0     2     0 0         42  70.6 M     BLQ  
 2 001     0    2.03     2     0     2     0 0         42  70.6 M     BLQ  
 3 002     0    6.00     6     0     2     0 0         34  75.5 male  BLQ  
 4 002     0    6.00     6     0     2     0 0         34  75.5 male  BLQ  
 5 003     0    4        4     0     2     0 0         45  67.2 M     BLQ  
 6 003     0    4        4     0     2     0 0         45  67.2 M     BLQ  
 7 009     0    7.98     8     0     2     0 0         59  97.5 F     BLQ  
 8 009     0    7.98     8     0     2     0 0         59  97.5 F     BLQ  
 9 018    12.5 12.0     12     0     2     0 0         NA  90.2 F     BLQ  
10 018    12.5 12.0     12     0     2     0 0         NA  90.2 F     BLQ  
11 026    25   12.1     12     0     2     0 0.0667    34  81   F     <NA> 
12 026    25   12.1     12     0     2     0 0.0667    34  81   F     <NA> 
13 038    50    5.95     6     0     2     0 0.493     30  62.5 F     <NA> 
14 038    50    5.95     6     0     2     0 0.493     30  62.5 F     <NA> 
15 041   100    2        2     0     2     0 0.12      31  65.4 M     <NA> 
16 041   100    2        2     0     2     0 0.12      31  65.4 M     <NA> 
# ℹ 1 more variable: LLOQ <dbl>
write_csv(dup_rows, file.path(resultspath, "evidence_duplicate_rows.csv"))

Minimal fix: keep the first record per ID + TIME.

obs2 <- obs1 %>%
  arrange(ID, TIME) %>%
  distinct(ID, TIME, .keep_all = TRUE)

Re-check:

obs2 %>% count(ID, TIME) %>% filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, TIME <dbl>, n <int>

Worked Example 4: Negative Concentrations

Negative concentrations are not physically meaningful.
At this stage we choose a defensible minimal rule: set negative DV to NA (keep the record, but mark DV unusable).

Evidence:

neg_rows <- obs2 %>%
  filter(!is.na(DV) & DV < 0) %>%
  select(ID, DOSE, TIME, NTIME, DV, BLQ, LLOQ)

neg_rows
# A tibble: 1 × 7
  ID     DOSE  TIME NTIME     DV BLQ    LLOQ
  <chr> <dbl> <dbl> <dbl>  <dbl> <chr> <dbl>
1 047     100  6.03     6 -0.926 <NA>   0.05
write_csv(neg_rows, file.path(resultspath, "evidence_negative_dv.csv"))

Apply fix:

obs3 <- obs2 %>%
  mutate(
    DV = if_else(!is.na(DV) & DV < 0, NA_real_, DV)
  )

Worked Example 5: Weight Outlier Detection (lb vs kg)

In this case study, one subject’s weight may be recorded in lb instead of kg, creating an extreme WT outlier.

Start with a quick WT scan:

dm_unique <- dat0 %>%
  distinct(ID, WT, SEX, AGE)

dm_unique %>%
  summarise(
    min_wt = min(WT, na.rm = TRUE),
    p25 = quantile(WT, 0.25, na.rm = TRUE),
    median = median(WT, na.rm = TRUE),
    p75 = quantile(WT, 0.75, na.rm = TRUE),
    max_wt = max(WT, na.rm = TRUE)
  )
# A tibble: 1 × 5
  min_wt   p25 median   p75 max_wt
   <dbl> <dbl>  <dbl> <dbl>  <dbl>
1   44.4  65.9   73.2  85.3    113

Flag suspicious weights (simple, transparent heuristic):

  • If WT > 150, it’s almost certainly lb recorded as kg (or a unit error).
wt_suspects <- dm_unique %>%
  filter(!is.na(WT) & WT > 150) %>%
  arrange(desc(WT))

wt_suspects
# A tibble: 0 × 4
# ℹ 4 variables: ID <chr>, WT <dbl>, SEX <chr>, AGE <dbl>

Save evidence:

write_csv(wt_suspects, file.path(resultspath, "evidence_wt_unit_suspects.csv"))

Apply conversion for those subjects only (lb → kg):

lb_to_kg <- 0.45359237
suspect_ids <- wt_suspects$ID

dat1 <- bind_rows(obs3, dose0) %>%
  mutate(
    WT = if_else(ID %in% suspect_ids, WT * lb_to_kg, WT)
  )

Sanity check after conversion:

dat1 %>%
  distinct(ID, WT) %>%
  summarise(
    max_wt = max(WT, na.rm = TRUE),
    median = median(WT, na.rm = TRUE)
  )
# A tibble: 1 × 2
  max_wt median
   <dbl>  <dbl>
1    113   73.2

Worked Example 6: Reassemble, Order, and Recount

Now we re-order for clarity (dose row first at TIME = 0), and check record counts again.

analysis_v2 <- dat1 %>%
  arrange(ID, TIME, desc(EVID))

Counts per subject:

analysis_v2 %>%
  count(ID) %>%
  count(n) %>%
  arrange(n)
# A tibble: 3 × 2
      n    nn
  <int> <int>
1     8     3
2     9     2
3    10    45

Check one dose row per subject:

analysis_v2 %>%
  filter(EVID == 1) %>%
  count(ID) %>%
  filter(n != 1)
# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, n <int>

Worked Example 7: Update the QC Decision Log

We append decisions to the course log.

log_path <- file.path(logspath, "qc_decision_log.csv")

if (!file.exists(log_path)) {
  qc_log <- tibble(
    issue_id = character(),
    file = character(),
    issue = character(),
    evidence = character(),
    decision = character(),
    status = character(),
    notes = character()
  )
  write_csv(qc_log, log_path)
}

qc_log <- read_csv(log_path, show_col_types = FALSE)

qc_log_new <- tibble(
  issue_id = c("ISSUE-005", "ISSUE-006", "ISSUE-007", "ISSUE-008"),
  file = c("analysis_prelim", "analysis_prelim", "analysis_prelim", "analysis_prelim"),
  issue = c(
    "Observation rows with missing TIME.",
    "Exact duplicate observation keys (ID + TIME).",
    "Negative DV values in observation rows.",
    "WT outlier suggests lb→kg unit mismatch."
  ),
  evidence = c(
    "Saved: results/qc/evidence_missing_time_obs.csv",
    "Saved: results/qc/evidence_duplicate_keys_id_time.csv and evidence_duplicate_rows.csv",
    "Saved: results/qc/evidence_negative_dv.csv",
    "Saved: results/qc/evidence_wt_unit_suspects.csv (WT > 150 heuristic)"
  ),
  decision = c(
    "Drop obs rows with TIME = NA (cannot be placed on timeline).",
    "Keep first record per ID+TIME after ordering; remove exact duplicates.",
    "Set DV < 0 to NA (retain record for traceability).",
    "Convert WT to kg for suspect IDs only (WT := WT * 0.45359237)."
  ),
  status = "closed",
  notes = ""
)

qc_log <- bind_rows(qc_log, qc_log_new) %>%
  distinct(issue_id, .keep_all = TRUE)

write_csv(qc_log, log_path)

qc_log %>% arrange(issue_id)
# A tibble: 7 × 7
  issue_id  file            issue                 evidence decision status notes
  <chr>     <chr>           <chr>                 <chr>    <chr>    <chr>  <chr>
1 ISSUE-001 pc              TIME auto-parsed as … glimpse… Re-vali… open   <NA> 
2 ISSUE-002 dm              DM is in long form (… glimpse… Use piv… open   <NA> 
3 ISSUE-003 dm              Potential unit misma… dm %>% … Standar… open   <NA> 
4 ISSUE-005 analysis_prelim Observation rows wit… Saved: … Drop ob… closed <NA> 
5 ISSUE-006 analysis_prelim Exact duplicate obse… Saved: … Keep fi… closed <NA> 
6 ISSUE-007 analysis_prelim Negative DV values i… Saved: … Set DV … closed <NA> 
7 ISSUE-008 analysis_prelim WT outlier suggests … Saved: … Convert… closed <NA> 

Export Cleaned Dataset v2

write_csv(analysis_v2, file.path(derivedpath, "analysis_v2_cleaned.csv"))

Warning
  • Do not overwrite raw source files (data/source_*).
  • Do not “fix by eye” in plotting code.
  • Do not apply aggressive outlier removal unless you can justify it.
  • Every fix should have an evidence table saved under results/qc/.

Strategies

  • Work from the assembled dataset: courses/foundations-r/data/derived/analysis_prelim.csv.
  • Separate observation fixes (EVID = 0) from dose rows (EVID = 1).
  • Generate evidence tables for each fix (which IDs/rows are affected).
  • Apply fixes using minimal rules:
    • Duplicates: keep first row per ID + TIME (or choose a transparent tie-break rule)
    • Missing TIME: drop (cannot be modeled)
    • Negative DV: set to NA (or drop; but document your choice)
    • WT lb→kg: convert only where evidence suggests lb (extreme WT + unit investigation)
  • Append decisions to the course QC log.

Common Mistakes

  • Applying fixes before saving evidence for the affected records.
  • Mixing observation-row fixes with dose-row fixes.
  • Removing duplicates without a clear key or tie-break rule.
  • Dropping negative concentrations without documenting the decision.
  • Converting all weights instead of only the suspect records.
  • Exporting a cleaned dataset without updating the QC log.

Practice Problems

  1. Verify there are no duplicate ID + TIME observation rows in analysis_v2.
  2. Create a short “before vs after” table showing:
    • number of obs rows dropped for missing TIME
    • number of duplicates removed
    • number of negative DV set to NA
  3. Save that table to courses/foundations-r/results/qc/.
  4. Add one additional QC log entry that you think would be appropriate based on your results.

Duplicate check (obs only):

analysis_v2 %>%
  filter(EVID == 0) %>%
  count(ID, TIME) %>%
  filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, TIME <dbl>, n <int>

Before vs after summary:

before <- obs0 %>% summarise(
  n_obs = n(),
  n_missing_time = sum(is.na(TIME)),
  n_dup_keys = (obs0 %>% count(ID, TIME) %>% filter(n > 1) %>% nrow()),
  n_neg = sum(!is.na(DV) & DV < 0)
)

after <- analysis_v2 %>%
  filter(EVID == 0) %>%
  summarise(
    n_obs = n(),
    n_missing_time = sum(is.na(TIME)),
    n_dup_keys = (analysis_v2 %>% filter(EVID == 0) %>% count(ID, TIME) %>% filter(n > 1) %>% nrow()),
    n_neg = sum(!is.na(DV) & DV < 0)
  )

summary_tbl <- bind_rows(
  before %>% mutate(stage = "before"),
  after  %>% mutate(stage = "after")
) %>% select(stage, everything())

write_csv(summary_tbl, file.path(resultspath, "qc_before_after_summary.csv"))
summary_tbl
# A tibble: 2 × 5
  stage  n_obs n_missing_time n_dup_keys n_neg
  <chr>  <int>          <int>      <int> <int>
1 before   458              8         11     1
2 after    442              0          0     0

Summary

  • You applied targeted, defensible fixes to the preliminary dataset.
  • You removed structurally invalid observation rows (missing TIME).
  • You removed exact duplicates using transparent keys (ID + TIME).
  • You handled negative concentrations by setting DV to NA.
  • You corrected a likely lb→kg weight mismatch with evidence-based criteria.
  • You documented decisions and exported analysis_v2_cleaned.csv for the final dataset lock and EDA.

  • Save evidence tables first, then fix.
  • Prefer simple, transparent rules over clever ones.
  • If you can’t explain a fix in one sentence, it’s probably too aggressive for this stage.
  • Version your outputs: prelim → v2 cleaned → final locked.