Visualization-Driven QC on the Preliminary Dataset

Use plots and lightweight tables to discover duplicates, spikes, and time issues in the preliminary assembled dataset.
Tip

Big idea: Visual QC is where the real problems show up.
We are not fixing yet — we are finding, documenting, and prioritizing issues.

Learning Objectives

By the end of this lesson, you will be able to:

  • Load the preliminary assembled dataset and isolate observation rows.
  • Use profile plots to detect spikes, negative values, and inconsistent shapes.
  • Identify duplicate records by key fields.
  • Detect suspicious scaling or extreme values.
  • Create a short, prioritized QC to-do list for data fixing.

Key Ideas

  • Tables catch structural issues; plots catch scientific issues.
  • Dose-stratified profiles are the fastest way to find problems.
  • Duplicates often look harmless in tables but distort summaries and models.
  • Time misassignment often shows up as strange profile shapes.
  • Your QC log should reflect evidence, not guesses.

Setup

library(tidyverse)
library(here)

derivedpath <- here("courses", "foundations-r", "data", "derived")
resultspath <- here("courses", "foundations-r", "results", "qc")

dir.create(resultspath, recursive = TRUE, showWarnings = FALSE)

path <- file.path(derivedpath, "analysis_prelim.csv")

if (!file.exists(path)) {
  stop("Preliminary dataset not found. Render the previous lesson first:\n", path)
}

analysisprelim <- read_csv(path)

# Observation rows only
obs <- analysisprelim %>% filter(EVID == 0)

Worked Example 1: Quick Disposition Checks

We start with simple counts so we know the size of the dataset before looking for specific problems.

tibble(
  nsubj = n_distinct(analysisprelim$ID),
  nobs  = nrow(obs)
)
# A tibble: 1 × 2
  nsubj  nobs
  <int> <int>
1    50   458
obs %>% count(BLQ)
# A tibble: 2 × 2
  BLQ       n
  <chr> <int>
1 BLQ     159
2 <NA>    299

Worked Example 2: Individual Profiles by DOSE (Log Scale)

Individual profiles make shape problems visible. Faceting by DOSE helps separate expected dose-related differences from suspicious profile behavior.

p1 <- ggplot(obs, aes(x = TIME, y = DV, group = ID)) +
  geom_line(alpha = 0.35) +
  geom_point(size = 0.8) +
  facet_wrap(~DOSE) +
  scale_y_log10() +
  labs(
    title = "Individual Concentration–Time Profiles by DOSE (Log Scale)",
    x = "Time After Dose (h)",
    y = "Concentration"
  )

p1

ggsave(file.path(resultspath, "profiles_by_dose_log.png"), p1, width = 9, height = 6)

Worked Example 3: Find Duplicates by Key Fields

After the plot raises suspicion, we use key fields to locate duplicate observations precisely.

dups <- obs %>%
  count(ID, TIME) %>%
  filter(n > 1) %>%
  arrange(desc(n))

dups
# A tibble: 11 × 3
   ID     TIME     n
   <chr> <dbl> <int>
 1 001    2.03     2
 2 002    6.00     2
 3 003    4        2
 4 009    7.98     2
 5 009   NA        2
 6 018   12.0      2
 7 026   12.1      2
 8 038    5.95     2
 9 039   NA        2
10 041    2        2
11 049   NA        2
write_csv(dups, file.path(resultspath, "duplicate_id_time.csv"))

Worked Example 4: Detect Impossible Values

Next, we scan for values that are biologically or analytically implausible and save the affected records as evidence.

neg <- obs %>% filter(!is.na(DV) & DV < 0)
neg %>% select(ID, DOSE, TIME, DV) %>% slice_head(n = 10)
# A tibble: 1 × 4
  ID     DOSE  TIME     DV
  <chr> <dbl> <dbl>  <dbl>
1 047     100  6.03 -0.926
spikes <- obs %>%
  group_by(ID) %>%
  mutate(med = median(DV, na.rm = TRUE)) %>%
  ungroup() %>%
  filter(!is.na(DV) & med > 0 & DV > 8 * med) %>%
  select(ID, DOSE, TIME, DV, med) %>%
  arrange(desc(DV))

spikes %>% slice_head(n = 10)
# A tibble: 0 × 5
# ℹ 5 variables: ID <chr>, DOSE <dbl>, TIME <dbl>, DV <dbl>, med <dbl>

Worked Example 5: Time Misassignment Clues

Finally, we look for profiles where the maximum concentration occurs very late, which can signal a sample-time problem.

latepeaks <- obs %>%
  group_by(ID, DOSE) %>%
  summarise(
    tmax = TIME[which.max(DV)],
    cmax = max(DV, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(tmax) & tmax > 12) %>%
  arrange(desc(tmax))

latepeaks
# A tibble: 0 × 4
# ℹ 4 variables: ID <chr>, DOSE <dbl>, tmax <dbl>, cmax <dbl>

Warning
  • Do not “fix by eye” inside plotting code.
  • Do not delete rows yet.
  • Visualization is for detection, not correction.
  • You should be able to point to an ID/time/value as evidence for every issue you list.

Strategies

  • Start with simple plots: individual profiles, stratified by DOSE.
  • Use log scale to expose multiplicative differences.
  • Add quick tables to locate exactly which IDs are affected.
  • Document issues now; fix in the next lesson.

Common Mistakes

  • Fixing records directly in plotting code.
  • Treating a suspicious plot pattern as proof without checking the underlying rows.
  • Looking only at summary statistics and missing subject-level profile problems.
  • Forgetting to save evidence tables for duplicate or impossible values.
  • Dropping apparent outliers before deciding whether they are data errors.

Practice Problems

  1. Create a linear-scale version of the profile plot and save it to courses/foundations-r/results/qc/.
  2. Count how many subjects have duplicate ID + TIME keys.
  3. Identify 3 subjects you suspect have time misassignment and write down why.
  4. Add at least 3 issues you found to your QC log.

p1lin <- ggplot(obs, aes(x = TIME, y = DV, group = ID)) +
  geom_line(alpha = 0.35) +
  geom_point(size = 0.8) +
  facet_wrap(~DOSE) +
  labs(
    title = "Individual Concentration–Time Profiles by DOSE (Linear Scale)",
    x = "Time After Dose (h)",
    y = "Concentration"
  )

ggsave(file.path(resultspath, "profiles_by_dose_linear.png"), p1lin, width = 9, height = 6)
p1lin

obs %>%
  count(ID, TIME) %>%
  filter(n > 1) %>%
  summarise(n_subjects = n_distinct(ID))
# A tibble: 1 × 1
  n_subjects
       <int>
1         10

Summary

  • You loaded the preliminary dataset and focused on observation rows.
  • You used dose-stratified profiles to detect spikes and inconsistencies.
  • You identified duplicates and impossible values with quick tables.
  • You generated a prioritized list of issues to address next.
  • Start with profiles by DOSE.
  • Use log scale early; it reveals multiplicative problems fast.
  • Always back every issue with a table of IDs and values.
  • Save QC figures so your decisions stay defensible.