library(tidyverse)
library(here)
sourcepath <- here("courses", "foundations-r", "data", "source_issues")
derivedpath <- here("courses", "foundations-r", "data", "derived")
dir.create(derivedpath, recursive = TRUE, showWarnings = FALSE)
dm <- read_csv(file.path(sourcepath, "dm.csv"))
ex <- read_csv(file.path(sourcepath, "ex.csv"))
pc <- read_csv(file.path(sourcepath, "pc.csv"))Structural QC & Minimal Fixes
Big idea: Fix only what blocks dataset assembly.
Do not perform scientific cleaning yet — just ensure types, keys, and core fields are structurally valid.
Learning Objectives
By the end of this lesson, you will be able to:
- Identify structural issues that prevent dataset assembly.
- Parse datetime strings safely and detect failures.
- Convert mixed-type numeric columns (e.g., “<0.05”) into analyzable form.
- Validate subject alignment across DM, EX, and PC.
- Reshape long-form DM into one row per subject for joining.
- Export structurally clean intermediate files for assembly.
Key Ideas
- Structural QC is different from scientific cleaning.
- You fix types, not data behavior.
- Datetime parsing and numeric coercion must be explicit.
- Never overwrite raw source files.
- Always export intermediate derived data.
Setup
Worked Example 1: ID Alignment
Before modifying any columns, we first confirm that the same subjects appear across the source files.
If subjects are missing from one file, joins may silently drop records or create incomplete analysis datasets.
anti_join(dm %>% distinct(ID), ex %>% distinct(ID), by = "ID")# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>
anti_join(dm %>% distinct(ID), pc %>% distinct(ID), by = "ID")# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>
No rows should return.
Worked Example 2: Parse Datetimes
The raw TIME column may be stored as character text, even if it looks like a datetime.
Here we explicitly parse it into datetime columns that can be used for time calculations later.
pc <- pc %>%
mutate(
DT = ymd_hms(TIME, quiet = TRUE)
)
ex <- ex %>%
mutate(
DOSEDT = ymd_hms(TIME, quiet = TRUE)
)After parsing, we always check for failures. Failed parses usually become NA.
pc %>% filter(is.na(DT))# A tibble: 8 × 7
ID TIME CONC CONC_UNIT BLQ LLOQ DT
<chr> <dttm> <chr> <chr> <chr> <dbl> <dttm>
1 019 NA 0.052 mg/L <NA> 0.05 NA
2 039 NA 0.173 mg/L <NA> 0.05 NA
3 049 NA 1.11 mg/L <NA> 0.05 NA
4 046 NA 0.445 mg/L <NA> 0.05 NA
5 009 NA 0 mg/L BLQ 0.05 NA
6 009 NA 0 mg/L BLQ 0.05 NA
7 039 NA 0.614 mg/L <NA> 0.05 NA
8 049 NA 0.246 mg/L <NA> 0.05 NA
ex %>% filter(is.na(DOSEDT))# A tibble: 0 × 6
# ℹ 6 variables: ID <chr>, TIME <dttm>, AMT <chr>, AMT_UNIT <chr>, ROUTE <chr>,
# DOSEDT <dttm>
Worked Example 3: Fix Concentration Type
Concentration values should be numeric for modeling, but real datasets may include values such as "<0.05".
We keep a character copy first, then create a numeric version using parse_number().
pc <- pc %>%
mutate(
CONC_chr = as.character(CONC),
CONC_num = parse_number(CONC_chr)
)This check looks for values that could not be converted to numeric.
pc %>% filter(is.na(CONC_num) & !is.na(CONC_chr))# A tibble: 0 × 9
# ℹ 9 variables: ID <chr>, TIME <dttm>, CONC <chr>, CONC_UNIT <chr>, BLQ <chr>,
# LLOQ <dbl>, DT <dttm>, CONC_chr <chr>, CONC_num <dbl>
Worked Example 4: Clean Dose Amount
Dose amount should also be numeric, but it may arrive as text such as "100 mg".
For structural QC, we only extract the numeric dose amount and keep the unit in its own column.
ex <- ex %>%
mutate(
AMT_num = parse_number(as.character(AMT))
)Quick scan for parsing failures:
ex %>% filter(is.na(AMT_num))# A tibble: 0 × 7
# ℹ 7 variables: ID <chr>, TIME <dttm>, AMT <chr>, AMT_UNIT <chr>, ROUTE <chr>,
# DOSEDT <dttm>, AMT_num <dbl>
Worked Example 5: Reshape DM (Long → Wide)
Your DM file is long-form:
ID,SEX,COV,VALUE,UNIT
To join DM to EX/PC, we typically want one row per subject.
Step 1: Basic checks
Before reshaping, we confirm which covariates are present and how many rows each subject has.
dm %>% count(COV)# A tibble: 2 × 2
COV n
<chr> <int>
1 AGE 50
2 WT 50
dm %>% count(ID) %>% count(n)# A tibble: 1 × 2
n nn
<int> <int>
1 2 50
Step 2: Standardize WT units (minimal)
We only standardize units that block basic use, such as computing weight-based summaries later.
Here we convert lb → kg for WT rows, leaving everything else unchanged.
lb_to_kg <- 0.45359237
dm <- dm %>%
mutate(
VALUE = case_when(
COV == "WT" & UNIT == "lb" ~ as.numeric(VALUE) * lb_to_kg,
TRUE ~ as.numeric(VALUE)
),
UNIT = case_when(
COV == "WT" & UNIT == "lb" ~ "kg",
TRUE ~ UNIT
)
)Sanity check:
dm %>% filter(COV == "WT") %>% count(UNIT)# A tibble: 1 × 2
UNIT n
<chr> <int>
1 kg 50
Step 3: Pivot wider
Now we reshape DM so each subject has one row, with covariates stored as columns.
dm_wide <- dm %>%
select(ID, SEX, COV, VALUE) %>%
pivot_wider(names_from = COV, values_from = VALUE)Check:
dm_wide %>% glimpse()Rows: 50
Columns: 4
$ ID <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "01…
$ SEX <chr> "M", "male", "M", "male", "M", "F", "F", "F", "F", "F", "M", "male…
$ AGE <dbl> 42, 34, 45, 37, 59, 43, 52, 52, 59, 60, 51, 37, 60, 46, NA, 34, 36…
$ WT <dbl> 70.60000, 75.50000, 67.20000, 111.00000, 83.80000, 60.70000, 94.10…
Export Structurally Clean Intermediate Files
Keep only essential cleaned columns: subject IDs, parsed datetimes, numeric values, and key covariates.
These files are not the final analysis dataset yet. They are structurally clean inputs for the dataset assembly step in the next lesson.
pc_structural <- pc %>%
transmute(
ID,
DT,
CONC = CONC_num,
CONCUNIT = CONC_UNIT,
BLQ,
LLOQ
)
ex_structural <- ex %>%
transmute(
ID,
DOSEDT,
AMT = AMT_num,
AMT_UNIT,
ROUTE
)
dm_structural <- dm_wide %>%
transmute(
ID,
SEX,
AGE,
WT
)
write_csv(pc_structural, file.path(derivedpath, "pc_structural_clean.csv"))
write_csv(ex_structural, file.path(derivedpath, "ex_structural_clean.csv"))
write_csv(dm_structural, file.path(derivedpath, "dm_structural_clean.csv"))- Do not remove duplicates yet.
- Do not fix spikes yet.
- Do not drop negative concentrations yet.
- Structural QC ensures assembly is possible — not that data are correct.
Strategies
- Read raw data from
courses/foundations-r/data/source_issues/. - Use
ymd_hms()for datetime parsing. - Use
parse_number()for numeric coercion. - Flag parsing failures explicitly.
- Reshape DM with
pivot_wider()into one row per subject. - Standardize obvious covariate units only when required to merge/use them.
- Save results to
courses/foundations-r/data/derived/.
Common Mistakes
- Performing scientific cleaning during structural QC.
- Assuming parsed datetimes are valid without checking failures.
- Forgetting that
parse_number()silently strips characters. - Ignoring failed numeric coercions that become
NA. - Overwriting raw source files instead of exporting derived files.
- Joining DM before reshaping it into one row per subject.
- Applying aggressive cleaning before dataset assembly is possible.
- Standardizing units without documenting the conversion.
Practice Problems
- Confirm that all IDs in PC appear in EX.
- Count missing datetimes in PC and EX after parsing.
- Count how many concentrations required numeric coercion (e.g., contained “<”).
- Verify that DM reshaping produced 1 row per subject.
Missing datetimes:
pc %>% summarise(n_missing = sum(is.na(DT)))# A tibble: 1 × 1
n_missing
<int>
1 8
ex %>% summarise(n_missing = sum(is.na(DOSEDT)))# A tibble: 1 × 1
n_missing
<int>
1 0
Coercion count:
pc %>% summarise(n_coerced = sum(str_detect(CONC_chr, "<")))# A tibble: 1 × 1
n_coerced
<int>
1 3
One row per subject after reshaping:
dm_wide %>% count(ID) %>% filter(n > 1)# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, n <int>
Summary
- You performed structural QC.
- You parsed datetimes safely.
- You converted mixed-type concentration values to numeric.
- You verified subject alignment.
- You reshaped DM into one row per subject and minimally standardized WT units.
- You exported structurally clean intermediate files for dataset assembly.
- Structural fixes come before scientific cleaning.
- Always save intermediate datasets.
- Never modify raw source files.
- Separate parsing problems from data problems.