library(tidyverse)
library(here)
course_root <- here("courses", "foundations-r")
source_issues <- here("courses", "foundations-r", "data", "source_issues")
source_clean <- here("courses", "foundations-r", "data", "source_clean")
dir.create(here("courses", "foundations-r", "analysis", "logs"),
recursive = TRUE, showWarnings = FALSE)
dir.create(here("courses", "foundations-r", "data", "derived"),
recursive = TRUE, showWarnings = FALSE)
dir.create(here("courses", "foundations-r", "results"),
recursive = TRUE, showWarnings = FALSE)Case Setup & Data Overview
Big picture: Before cleaning or modeling, you must understand what you received.
This lesson is about inventory, structure, and expectations — not fixing.
Learning Objectives
By the end of this lesson, you will be able to:
- Describe the purpose of the DM, EX, and PC source files.
- Import raw CSV files reproducibly using
here::here(). - Explain how
read_csv()automatically guesses column types. - Perform design-level sanity checks using expected vs observed counts.
- Create and maintain a structured QC decision log.
Key Ideas
- Raw datasets are inputs, not analysis-ready outputs.
- Datetime columns are stored as strings in the CSV file.
readr::read_csv()may automatically detect and parse ISO datetimes.- What you see in R is not always identical to how the raw file is stored.
- Early structural checks prevent downstream modeling errors.
Setup
Note on file paths:
Make sure the folder names match your setup.
You may need to adjust the paths (inside here()) so they point to where you saved the files.
Worked Example 1: Read Raw Files
We first load the raw source files exactly as received.
At this stage, we are not cleaning or modifying anything — we are only bringing the files into R so we can inspect their structure.
dm <- read_csv(file.path(source_issues, "dm.csv"))
ex <- read_csv(file.path(source_issues, "ex.csv"))
pc <- read_csv(file.path(source_issues, "pc.csv"))
list(dm = dim(dm), ex = dim(ex), pc = dim(pc))$dm
[1] 100 5
$ex
[1] 50 5
$pc
[1] 458 6
The output gives the number of rows and columns in each file. This is the first quick check that the files loaded and have the expected general size.
read_csv() also automatically guesses column types based on the data. That is convenient, but we should still verify those types ourselves.
Worked Example 2: Inspect Structure
Next, we inspect the structure of each dataset.
glimpse() shows the column names, column types, and a preview of values. This helps us understand what each file contains before we make any changes.
glimpse(dm)Rows: 100
Columns: 5
$ ID <chr> "001", "001", "002", "002", "003", "003", "004", "004", "005", "…
$ SEX <chr> "M", "M", "male", "male", "M", "M", "male", "male", "M", "M", "F…
$ COV <chr> "AGE", "WT", "AGE", "WT", "AGE", "WT", "AGE", "WT", "AGE", "WT",…
$ VALUE <dbl> 42.0, 70.6, 34.0, 75.5, 45.0, 67.2, 37.0, 111.0, 59.0, 83.8, 43.…
$ UNIT <chr> "years", "kg", "years", "kg", "years", "kg", "years", "kg", "yea…
glimpse(ex)Rows: 50
Columns: 5
$ ID <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"…
$ TIME <dttm> 2026-02-04 09:19:00, 2026-01-20 08:51:00, 2026-01-21 08:00:0…
$ AMT <chr> "0", "0", "0", "0", "0", "0", "0 mg", "0", "0", "0", "12.5", …
$ AMT_UNIT <chr> "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "MG", "…
$ ROUTE <chr> "oral", "oral", "oral", "oral", "oral", "oral", "oral", "oral…
glimpse(pc)Rows: 458
Columns: 6
$ ID <chr> "006", "050", "022", "026", "044", "016", "022", "048", "006…
$ TIME <dttm> 2026-01-16 20:18:57, 2026-01-31 17:40:38, 2026-01-15 08:43:…
$ CONC <chr> "0", "1.53", "0.205", "0.107", "0.941", "0.0629", "0.226", "…
$ CONC_UNIT <chr> "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/…
$ BLQ <chr> "BLQ", NA, NA, NA, NA, NA, NA, NA, "BLQ", NA, NA, NA, NA, NA…
$ LLOQ <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, …
Important Observation
Even though the CSV stores TIME as a string, read_csv() may display it as <dttm> if it detects ISO format.
This does not mean the raw file contained a datetime object — it means R parsed it automatically.
We will deliberately validate and control types during Structural QC.
DM Schema Note
The DM file is long (tidy) and includes:
ID(subject identifier)SEX(categorical, stored as its own column)COV(covariate name, e.g.,AGE,WT)VALUE(numeric covariate value)UNIT(covariate unit, e.g.,years,kg, sometimeslb)
That structure is intentional — we’ll use it to practice pivot_wider() before we merge demographics into the modeling dataset.
Worked Example 3: Design-Level Sanity Checks
Before modeling, we should confirm that the datasets roughly match the study design expectations.
Simple count checks often detect major data problems early, such as missing subjects, extra records, duplicate samples, or incomplete covariate information.
From the study design:
- 50 subjects
- 1 dose per subject
- 9 PK samples per subject (0–24h)
- 2 numeric DM covariate rows per subject (AGE and WT)
Expected vs Observed
We compare the expected number of rows to the observed number of rows in each file.
expected_ex_rows <- 50
expected_pc_rows <- 50 * 9
expected_dm_rows <- 50 * 2 # AGE + WT are long rows per subject
tibble(
file = c("dm", "ex", "pc"),
expected = c(expected_dm_rows, expected_ex_rows, expected_pc_rows),
observed = c(nrow(dm), nrow(ex), nrow(pc))
)# A tibble: 3 × 3
file expected observed
<chr> <dbl> <int>
1 dm 100 100
2 ex 50 50
3 pc 450 458
Unique Subjects per File
Each file should represent the same set of subjects. We start by checking the number of unique subject IDs in each file.
tibble(
file = c("dm", "ex", "pc"),
n_subjects = c(
n_distinct(dm$ID),
n_distinct(ex$ID),
n_distinct(pc$ID)
)
)# A tibble: 3 × 2
file n_subjects
<chr> <int>
1 dm 50
2 ex 50
3 pc 50
DM Covariate Inventory
Because DM is in long format, we count the covariate names to confirm which covariates are present.
dm %>% count(COV)# A tibble: 2 × 2
COV n
<chr> <int>
1 AGE 50
2 WT 50
Weight Unit Check (DM)
Unit checks are important before cleaning. Here, we check whether all weight values use the same unit.
dm %>%
filter(COV == "WT") %>%
count(UNIT)# A tibble: 2 × 2
UNIT n
<chr> <int>
1 kg 49
2 lb 1
Dose-Level Balance
We check the dose amounts to confirm that subjects received the expected dose.
ex %>% count(AMT)# A tibble: 6 × 2
AMT n
<chr> <int>
1 0 9
2 0 mg 1
3 100 10
4 12.5 10
5 25 10
6 50 10
Observation Counts per Subject
Finally, we check whether each subject has the expected number of PK concentration records.
pc %>% count(ID) %>% count(n)# A tibble: 2 × 2
n nn
<int> <int>
1 9 42
2 10 8
Worked Example 4: Quick Type Diagnostics
A column may visually look correct while still having the wrong underlying type.
Quick type checks help identify issues before downstream processing, especially for datetime, dose, and concentration columns.
typeof(ex$TIME)[1] "double"
typeof(pc$TIME)[1] "double"
typeof(pc$CONC)[1] "character"
typeof(ex$AMT)[1] "character"
typeof(dm$SEX)[1] "character"
typeof(dm$COV)[1] "character"
typeof(dm$VALUE)[1] "double"
typeof(dm$UNIT)[1] "character"
Some concentration columns may contain non-numeric values such as below-quantification-limit strings. We do not fix them yet, but we check whether they exist.
Detect “<” patterns in concentration:
pc %>%
mutate(CONC_chr = as.character(CONC)) %>%
filter(str_detect(CONC_chr, "<")) %>%
slice_head(n = 5)# A tibble: 3 × 7
ID TIME CONC CONC_UNIT BLQ LLOQ CONC_chr
<chr> <dttm> <chr> <chr> <chr> <dbl> <chr>
1 049 2026-01-20 09:01:00 <0.05 mg/L BLQ 0.05 <0.05
2 040 2026-01-29 09:26:00 <0.05 mg/L BLQ 0.05 <0.05
3 046 2026-01-16 09:47:00 <0.05 mg/L BLQ 0.05 <0.05
- Do not parse or mutate columns yet.
- Do not overwrite raw source files.
- Do not assume types are correct because the file loaded successfully.
- Do not skip count checks — they catch major issues early.
Worked Example 5: Create a QC Decision Log
Good workflows document issues as they are discovered.
A QC decision log creates a reproducible record of what was found, what evidence supports the issue, and how the issue will be handled later.
log_path <- here("courses", "foundations-r", "analysis", "logs", "qc_decision_log.csv")
if (!file.exists(log_path)) {
qc_log <- tibble(
issue_id = character(),
file = character(),
issue = character(),
evidence = character(),
decision = character(),
status = character(),
notes = character()
)
write_csv(qc_log, log_path)
}
qc_log <- read_csv(log_path, show_col_types = FALSE)
qc_log <- bind_rows(
qc_log,
tibble(
issue_id = "ISSUE-001",
file = "pc",
issue = "TIME auto-parsed as datetime by read_csv().",
evidence = "glimpse(pc) shows <dttm> though CSV stores strings.",
decision = "Re-validate and standardize datetime parsing during Structural QC.",
status = "open",
notes = ""
),
tibble(
issue_id = "ISSUE-002",
file = "dm",
issue = "DM is in long form (ID, SEX, COV, VALUE, UNIT) and must be reshaped before joining.",
evidence = "glimpse(dm) shows repeated IDs across covariates (COV = AGE, WT).",
decision = "Use pivot_wider() to create one row per subject before merging DM into EX/PC.",
status = "open",
notes = ""
),
tibble(
issue_id = "ISSUE-003",
file = "dm",
issue = "Potential unit mismatch in weight (WT) for some subjects.",
evidence = "dm %>% filter(COV=='WT') %>% count(UNIT) shows more than one unit (e.g., kg and lb).",
decision = "Standardize WT units (convert lb -> kg) during covariate cleaning.",
status = "open",
notes = ""
)
) %>% distinct(issue_id, .keep_all = TRUE)
write_csv(qc_log, log_path)
qc_log# A tibble: 7 × 7
issue_id file issue evidence decision status notes
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ISSUE-001 pc TIME auto-parsed as … glimpse… Re-vali… open <NA>
2 ISSUE-002 dm DM is in long form (… glimpse… Use piv… open <NA>
3 ISSUE-003 dm Potential unit misma… dm %>% … Standar… open <NA>
4 ISSUE-005 analysis_prelim Observation rows wit… Saved: … Drop ob… closed <NA>
5 ISSUE-006 analysis_prelim Exact duplicate obse… Saved: … Keep fi… closed <NA>
6 ISSUE-007 analysis_prelim Negative DV values i… Saved: … Set DV … closed <NA>
7 ISSUE-008 analysis_prelim WT outlier suggests … Saved: … Convert… closed <NA>
The goal is not to fix every issue immediately. The goal is to make sure important findings are not lost before the formal cleaning and structural QC steps.
Strategies
- Separate folders (course-scoped):
courses/foundations-r/data/source_*→ raw data (never edited)courses/foundations-r/data/derived/→ cleaned datacourses/foundations-r/analysis/logs/→ QC logscourses/foundations-r/results/→ modeling outputs
- Always inspect:
glimpse()nrow()n_distinct(ID)
- Compare observed counts to design expectations.
- Log issues before modifying data.
Common Mistakes
- Assuming a file is ready for analysis just because it imports successfully.
- Trusting auto-detected column types without checking them.
- Skipping expected vs observed row-count checks.
- Modifying raw source files instead of creating derived files.
- Joining DM before recognizing that it is in long format.
- Ignoring unit inconsistencies such as
kgvslb. - Fixing issues without recording them in the QC decision log.
Practice Problems
- Load the clean dataset and compare its schema to the issues dataset.
- Verify ID overlap across DM, EX, and PC.
- Count duplicates in PC by
IDandTIME. - In DM, verify you have exactly 2 covariate rows per subject (AGE and WT) and identify any unexpected units.
- Add at least one additional QC log entry.
Duplicate check:
pc %>%
count(ID, TIME) %>%
filter(n > 1)# A tibble: 11 × 3
ID TIME n
<chr> <dttm> <int>
1 001 2026-02-04 11:20:35 2
2 002 2026-01-20 14:50:59 2
3 003 2026-01-21 12:00:00 2
4 009 2026-02-03 16:44:37 2
5 009 NA 2
6 018 2026-01-27 20:17:18 2
7 026 2026-01-31 20:28:39 2
8 038 2026-01-29 15:06:50 2
9 039 NA 2
10 041 2026-02-01 11:37:00 2
11 049 NA 2
ID overlap:
anti_join(dm %>% distinct(ID),
ex %>% distinct(ID),
by = "ID")# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>
DM rows per subject:
dm %>%
count(ID) %>%
count(n)# A tibble: 1 × 2
n nn
<int> <int>
1 2 50
WT unit scan:
dm %>%
filter(COV == "WT") %>%
count(UNIT)# A tibble: 2 × 2
UNIT n
<chr> <int>
1 kg 49
2 lb 1
Summary
- You inspected raw DM, EX, and PC files.
- You recognized that
read_csv()automatically guesses column types. - You performed design-level sanity checks.
- You created a QC decision log to track structural issues.
- Raw CSV format ≠ how R displays it after import.
- Always compare observed row counts to expected design counts.
- Log issues before resolving them.
- Separate raw, derived, and results folders.