Structural QC & Minimal Fixes

Perform structural QC and apply only the minimal fixes required to assemble a preliminary analysis dataset.
Tip

Big idea: Fix only what blocks dataset assembly.
Do not perform scientific cleaning yet — just ensure types, keys, and core fields are structurally valid.

Learning Objectives

By the end of this lesson, you will be able to:

  • Identify structural issues that prevent dataset assembly.
  • Parse datetime strings safely and detect failures.
  • Convert mixed-type numeric columns (e.g., “<0.05”) into analyzable form.
  • Validate subject alignment across DM, EX, and PC.
  • Reshape long-form DM into one row per subject for joining.
  • Export structurally clean intermediate files for assembly.

Key Ideas

  • Structural QC is different from scientific cleaning.
  • You fix types, not data behavior.
  • Datetime parsing and numeric coercion must be explicit.
  • Never overwrite raw source files.
  • Always export intermediate derived data.

Setup

library(tidyverse)
library(here)

sourcepath  <- here("courses", "foundations-r", "data", "source_issues")
derivedpath <- here("courses", "foundations-r", "data", "derived")

dir.create(derivedpath, recursive = TRUE, showWarnings = FALSE)

dm <- read_csv(file.path(sourcepath, "dm.csv"))
ex <- read_csv(file.path(sourcepath, "ex.csv"))
pc <- read_csv(file.path(sourcepath, "pc.csv"))

Worked Example 1: ID Alignment

Before modifying any columns, we first confirm that the same subjects appear across the source files.

If subjects are missing from one file, joins may silently drop records or create incomplete analysis datasets.

anti_join(dm %>% distinct(ID), ex %>% distinct(ID), by = "ID")
# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>
anti_join(dm %>% distinct(ID), pc %>% distinct(ID), by = "ID")
# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>

No rows should return.


Worked Example 2: Parse Datetimes

The raw TIME column may be stored as character text, even if it looks like a datetime.

Here we explicitly parse it into datetime columns that can be used for time calculations later.

pc <- pc %>%
  mutate(
    DT = ymd_hms(TIME, quiet = TRUE)
  )

ex <- ex %>%
  mutate(
    DOSEDT = ymd_hms(TIME, quiet = TRUE)
  )

After parsing, we always check for failures. Failed parses usually become NA.

pc %>% filter(is.na(DT))
# A tibble: 8 × 7
  ID    TIME   CONC  CONC_UNIT BLQ    LLOQ DT    
  <chr> <dttm> <chr> <chr>     <chr> <dbl> <dttm>
1 019   NA     0.052 mg/L      <NA>   0.05 NA    
2 039   NA     0.173 mg/L      <NA>   0.05 NA    
3 049   NA     1.11  mg/L      <NA>   0.05 NA    
4 046   NA     0.445 mg/L      <NA>   0.05 NA    
5 009   NA     0     mg/L      BLQ    0.05 NA    
6 009   NA     0     mg/L      BLQ    0.05 NA    
7 039   NA     0.614 mg/L      <NA>   0.05 NA    
8 049   NA     0.246 mg/L      <NA>   0.05 NA    
ex %>% filter(is.na(DOSEDT))
# A tibble: 0 × 6
# ℹ 6 variables: ID <chr>, TIME <dttm>, AMT <chr>, AMT_UNIT <chr>, ROUTE <chr>,
#   DOSEDT <dttm>

Worked Example 3: Fix Concentration Type

Concentration values should be numeric for modeling, but real datasets may include values such as "<0.05".

We keep a character copy first, then create a numeric version using parse_number().

pc <- pc %>%
  mutate(
    CONC_chr = as.character(CONC),
    CONC_num = parse_number(CONC_chr)
  )

This check looks for values that could not be converted to numeric.

pc %>% filter(is.na(CONC_num) & !is.na(CONC_chr))
# A tibble: 0 × 9
# ℹ 9 variables: ID <chr>, TIME <dttm>, CONC <chr>, CONC_UNIT <chr>, BLQ <chr>,
#   LLOQ <dbl>, DT <dttm>, CONC_chr <chr>, CONC_num <dbl>

Worked Example 4: Clean Dose Amount

Dose amount should also be numeric, but it may arrive as text such as "100 mg".

For structural QC, we only extract the numeric dose amount and keep the unit in its own column.

ex <- ex %>%
  mutate(
    AMT_num = parse_number(as.character(AMT))
  )

Quick scan for parsing failures:

ex %>% filter(is.na(AMT_num))
# A tibble: 0 × 7
# ℹ 7 variables: ID <chr>, TIME <dttm>, AMT <chr>, AMT_UNIT <chr>, ROUTE <chr>,
#   DOSEDT <dttm>, AMT_num <dbl>

Worked Example 5: Reshape DM (Long → Wide)

Your DM file is long-form:

  • ID, SEX, COV, VALUE, UNIT

To join DM to EX/PC, we typically want one row per subject.

Step 1: Basic checks

Before reshaping, we confirm which covariates are present and how many rows each subject has.

dm %>% count(COV)
# A tibble: 2 × 2
  COV       n
  <chr> <int>
1 AGE      50
2 WT       50
dm %>% count(ID) %>% count(n)
# A tibble: 1 × 2
      n    nn
  <int> <int>
1     2    50

Step 2: Standardize WT units (minimal)

We only standardize units that block basic use, such as computing weight-based summaries later.

Here we convert lbkg for WT rows, leaving everything else unchanged.

lb_to_kg <- 0.45359237

dm <- dm %>%
  mutate(
    VALUE = case_when(
      COV == "WT" & UNIT == "lb" ~ as.numeric(VALUE) * lb_to_kg,
      TRUE ~ as.numeric(VALUE)
    ),
    UNIT = case_when(
      COV == "WT" & UNIT == "lb" ~ "kg",
      TRUE ~ UNIT
    )
  )

Sanity check:

dm %>% filter(COV == "WT") %>% count(UNIT)
# A tibble: 1 × 2
  UNIT      n
  <chr> <int>
1 kg       50

Step 3: Pivot wider

Now we reshape DM so each subject has one row, with covariates stored as columns.

dm_wide <- dm %>%
  select(ID, SEX, COV, VALUE) %>%
  pivot_wider(names_from = COV, values_from = VALUE)

Check:

dm_wide %>% glimpse()
Rows: 50
Columns: 4
$ ID  <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "01…
$ SEX <chr> "M", "male", "M", "male", "M", "F", "F", "F", "F", "F", "M", "male…
$ AGE <dbl> 42, 34, 45, 37, 59, 43, 52, 52, 59, 60, 51, 37, 60, 46, NA, 34, 36…
$ WT  <dbl> 70.60000, 75.50000, 67.20000, 111.00000, 83.80000, 60.70000, 94.10…

Export Structurally Clean Intermediate Files

Keep only essential cleaned columns: subject IDs, parsed datetimes, numeric values, and key covariates.

These files are not the final analysis dataset yet. They are structurally clean inputs for the dataset assembly step in the next lesson.

pc_structural <- pc %>%
  transmute(
    ID,
    DT,
    CONC = CONC_num,
    CONCUNIT = CONC_UNIT,
    BLQ,
    LLOQ
  )

ex_structural <- ex %>%
  transmute(
    ID,
    DOSEDT,
    AMT = AMT_num,
    AMT_UNIT,
    ROUTE
  )

dm_structural <- dm_wide %>%
  transmute(
    ID,
    SEX,
    AGE,
    WT
  )

write_csv(pc_structural, file.path(derivedpath, "pc_structural_clean.csv"))
write_csv(ex_structural, file.path(derivedpath, "ex_structural_clean.csv"))
write_csv(dm_structural, file.path(derivedpath, "dm_structural_clean.csv"))

Warning
  • Do not remove duplicates yet.
  • Do not fix spikes yet.
  • Do not drop negative concentrations yet.
  • Structural QC ensures assembly is possible — not that data are correct.

Strategies

  • Read raw data from courses/foundations-r/data/source_issues/.
  • Use ymd_hms() for datetime parsing.
  • Use parse_number() for numeric coercion.
  • Flag parsing failures explicitly.
  • Reshape DM with pivot_wider() into one row per subject.
  • Standardize obvious covariate units only when required to merge/use them.
  • Save results to courses/foundations-r/data/derived/.

Common Mistakes

  • Performing scientific cleaning during structural QC.
  • Assuming parsed datetimes are valid without checking failures.
  • Forgetting that parse_number() silently strips characters.
  • Ignoring failed numeric coercions that become NA.
  • Overwriting raw source files instead of exporting derived files.
  • Joining DM before reshaping it into one row per subject.
  • Applying aggressive cleaning before dataset assembly is possible.
  • Standardizing units without documenting the conversion.

Practice Problems

  1. Confirm that all IDs in PC appear in EX.
  2. Count missing datetimes in PC and EX after parsing.
  3. Count how many concentrations required numeric coercion (e.g., contained “<”).
  4. Verify that DM reshaping produced 1 row per subject.

Missing datetimes:

pc %>% summarise(n_missing = sum(is.na(DT)))
# A tibble: 1 × 1
  n_missing
      <int>
1         8
ex %>% summarise(n_missing = sum(is.na(DOSEDT)))
# A tibble: 1 × 1
  n_missing
      <int>
1         0

Coercion count:

pc %>% summarise(n_coerced = sum(str_detect(CONC_chr, "<")))
# A tibble: 1 × 1
  n_coerced
      <int>
1         3

One row per subject after reshaping:

dm_wide %>% count(ID) %>% filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, n <int>

Summary

  • You performed structural QC.
  • You parsed datetimes safely.
  • You converted mixed-type concentration values to numeric.
  • You verified subject alignment.
  • You reshaped DM into one row per subject and minimally standardized WT units.
  • You exported structurally clean intermediate files for dataset assembly.

  • Structural fixes come before scientific cleaning.
  • Always save intermediate datasets.
  • Never modify raw source files.
  • Separate parsing problems from data problems.