Case Setup & Data Overview

Kick off the case study: read the raw DM/EX/PC files, understand the schema, and initiate a structured QC decision log.
Tip

Big picture: Before cleaning or modeling, you must understand what you received.
This lesson is about inventory, structure, and expectations — not fixing.

Learning Objectives

By the end of this lesson, you will be able to:

  • Describe the purpose of the DM, EX, and PC source files.
  • Import raw CSV files reproducibly using here::here().
  • Explain how read_csv() automatically guesses column types.
  • Perform design-level sanity checks using expected vs observed counts.
  • Create and maintain a structured QC decision log.

Key Ideas

  • Raw datasets are inputs, not analysis-ready outputs.
  • Datetime columns are stored as strings in the CSV file.
  • readr::read_csv() may automatically detect and parse ISO datetimes.
  • What you see in R is not always identical to how the raw file is stored.
  • Early structural checks prevent downstream modeling errors.

Setup

library(tidyverse)
library(here)

course_root  <- here("courses", "foundations-r")

source_issues <- here("courses", "foundations-r", "data", "source_issues")
source_clean  <- here("courses", "foundations-r", "data", "source_clean")

dir.create(here("courses", "foundations-r", "analysis", "logs"),
           recursive = TRUE, showWarnings = FALSE)
dir.create(here("courses", "foundations-r", "data", "derived"),
           recursive = TRUE, showWarnings = FALSE)
dir.create(here("courses", "foundations-r", "results"),
           recursive = TRUE, showWarnings = FALSE)
Tip

Note on file paths:
Make sure the folder names match your setup.
You may need to adjust the paths (inside here()) so they point to where you saved the files.


Worked Example 1: Read Raw Files

We first load the raw source files exactly as received.

At this stage, we are not cleaning or modifying anything — we are only bringing the files into R so we can inspect their structure.

dm <- read_csv(file.path(source_issues, "dm.csv"))
ex <- read_csv(file.path(source_issues, "ex.csv"))
pc <- read_csv(file.path(source_issues, "pc.csv"))

list(dm = dim(dm), ex = dim(ex), pc = dim(pc))
$dm
[1] 100   5

$ex
[1] 50  5

$pc
[1] 458   6

The output gives the number of rows and columns in each file. This is the first quick check that the files loaded and have the expected general size.

read_csv() also automatically guesses column types based on the data. That is convenient, but we should still verify those types ourselves.


Worked Example 2: Inspect Structure

Next, we inspect the structure of each dataset.

glimpse() shows the column names, column types, and a preview of values. This helps us understand what each file contains before we make any changes.

glimpse(dm)
Rows: 100
Columns: 5
$ ID    <chr> "001", "001", "002", "002", "003", "003", "004", "004", "005", "…
$ SEX   <chr> "M", "M", "male", "male", "M", "M", "male", "male", "M", "M", "F…
$ COV   <chr> "AGE", "WT", "AGE", "WT", "AGE", "WT", "AGE", "WT", "AGE", "WT",…
$ VALUE <dbl> 42.0, 70.6, 34.0, 75.5, 45.0, 67.2, 37.0, 111.0, 59.0, 83.8, 43.…
$ UNIT  <chr> "years", "kg", "years", "kg", "years", "kg", "years", "kg", "yea…
glimpse(ex)
Rows: 50
Columns: 5
$ ID       <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"…
$ TIME     <dttm> 2026-02-04 09:19:00, 2026-01-20 08:51:00, 2026-01-21 08:00:0…
$ AMT      <chr> "0", "0", "0", "0", "0", "0", "0 mg", "0", "0", "0", "12.5", …
$ AMT_UNIT <chr> "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "MG", "…
$ ROUTE    <chr> "oral", "oral", "oral", "oral", "oral", "oral", "oral", "oral…
glimpse(pc)
Rows: 458
Columns: 6
$ ID        <chr> "006", "050", "022", "026", "044", "016", "022", "048", "006…
$ TIME      <dttm> 2026-01-16 20:18:57, 2026-01-31 17:40:38, 2026-01-15 08:43:…
$ CONC      <chr> "0", "1.53", "0.205", "0.107", "0.941", "0.0629", "0.226", "…
$ CONC_UNIT <chr> "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/…
$ BLQ       <chr> "BLQ", NA, NA, NA, NA, NA, NA, NA, "BLQ", NA, NA, NA, NA, NA…
$ LLOQ      <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, …

Important Observation

Even though the CSV stores TIME as a string, read_csv() may display it as <dttm> if it detects ISO format.

This does not mean the raw file contained a datetime object — it means R parsed it automatically.

We will deliberately validate and control types during Structural QC.

DM Schema Note

The DM file is long (tidy) and includes:

  • ID (subject identifier)
  • SEX (categorical, stored as its own column)
  • COV (covariate name, e.g., AGE, WT)
  • VALUE (numeric covariate value)
  • UNIT (covariate unit, e.g., years, kg, sometimes lb)

That structure is intentional — we’ll use it to practice pivot_wider() before we merge demographics into the modeling dataset.


Worked Example 3: Design-Level Sanity Checks

Before modeling, we should confirm that the datasets roughly match the study design expectations.

Simple count checks often detect major data problems early, such as missing subjects, extra records, duplicate samples, or incomplete covariate information.

From the study design:

  • 50 subjects
  • 1 dose per subject
  • 9 PK samples per subject (0–24h)
  • 2 numeric DM covariate rows per subject (AGE and WT)

Expected vs Observed

We compare the expected number of rows to the observed number of rows in each file.

expected_ex_rows <- 50
expected_pc_rows <- 50 * 9
expected_dm_rows <- 50 * 2  # AGE + WT are long rows per subject

tibble(
  file = c("dm", "ex", "pc"),
  expected = c(expected_dm_rows, expected_ex_rows, expected_pc_rows),
  observed = c(nrow(dm), nrow(ex), nrow(pc))
)
# A tibble: 3 × 3
  file  expected observed
  <chr>    <dbl>    <int>
1 dm         100      100
2 ex          50       50
3 pc         450      458

Unique Subjects per File

Each file should represent the same set of subjects. We start by checking the number of unique subject IDs in each file.

tibble(
  file = c("dm", "ex", "pc"),
  n_subjects = c(
    n_distinct(dm$ID),
    n_distinct(ex$ID),
    n_distinct(pc$ID)
  )
)
# A tibble: 3 × 2
  file  n_subjects
  <chr>      <int>
1 dm            50
2 ex            50
3 pc            50

DM Covariate Inventory

Because DM is in long format, we count the covariate names to confirm which covariates are present.

dm %>% count(COV)
# A tibble: 2 × 2
  COV       n
  <chr> <int>
1 AGE      50
2 WT       50

Weight Unit Check (DM)

Unit checks are important before cleaning. Here, we check whether all weight values use the same unit.

dm %>% 
  filter(COV == "WT") %>%
  count(UNIT)
# A tibble: 2 × 2
  UNIT      n
  <chr> <int>
1 kg       49
2 lb        1

Dose-Level Balance

We check the dose amounts to confirm that subjects received the expected dose.

ex %>% count(AMT)
# A tibble: 6 × 2
  AMT       n
  <chr> <int>
1 0         9
2 0 mg      1
3 100      10
4 12.5     10
5 25       10
6 50       10

Observation Counts per Subject

Finally, we check whether each subject has the expected number of PK concentration records.

pc %>% count(ID) %>% count(n)
# A tibble: 2 × 2
      n    nn
  <int> <int>
1     9    42
2    10     8

Worked Example 4: Quick Type Diagnostics

A column may visually look correct while still having the wrong underlying type.

Quick type checks help identify issues before downstream processing, especially for datetime, dose, and concentration columns.

typeof(ex$TIME)
[1] "double"
typeof(pc$TIME)
[1] "double"
typeof(pc$CONC)
[1] "character"
typeof(ex$AMT)
[1] "character"
typeof(dm$SEX)
[1] "character"
typeof(dm$COV)
[1] "character"
typeof(dm$VALUE)
[1] "double"
typeof(dm$UNIT)
[1] "character"

Some concentration columns may contain non-numeric values such as below-quantification-limit strings. We do not fix them yet, but we check whether they exist.

Detect “<” patterns in concentration:

pc %>% 
  mutate(CONC_chr = as.character(CONC)) %>%
  filter(str_detect(CONC_chr, "<")) %>%
  slice_head(n = 5)
# A tibble: 3 × 7
  ID    TIME                CONC  CONC_UNIT BLQ    LLOQ CONC_chr
  <chr> <dttm>              <chr> <chr>     <chr> <dbl> <chr>   
1 049   2026-01-20 09:01:00 <0.05 mg/L      BLQ    0.05 <0.05   
2 040   2026-01-29 09:26:00 <0.05 mg/L      BLQ    0.05 <0.05   
3 046   2026-01-16 09:47:00 <0.05 mg/L      BLQ    0.05 <0.05   

Warning
  • Do not parse or mutate columns yet.
  • Do not overwrite raw source files.
  • Do not assume types are correct because the file loaded successfully.
  • Do not skip count checks — they catch major issues early.

Worked Example 5: Create a QC Decision Log

Good workflows document issues as they are discovered.

A QC decision log creates a reproducible record of what was found, what evidence supports the issue, and how the issue will be handled later.

log_path <- here("courses", "foundations-r", "analysis", "logs", "qc_decision_log.csv")

if (!file.exists(log_path)) {
  qc_log <- tibble(
    issue_id = character(),
    file = character(),
    issue = character(),
    evidence = character(),
    decision = character(),
    status = character(),
    notes = character()
  )
  write_csv(qc_log, log_path)
}

qc_log <- read_csv(log_path, show_col_types = FALSE)

qc_log <- bind_rows(
  qc_log,
  tibble(
    issue_id = "ISSUE-001",
    file = "pc",
    issue = "TIME auto-parsed as datetime by read_csv().",
    evidence = "glimpse(pc) shows <dttm> though CSV stores strings.",
    decision = "Re-validate and standardize datetime parsing during Structural QC.",
    status = "open",
    notes = ""
  ),
  tibble(
    issue_id = "ISSUE-002",
    file = "dm",
    issue = "DM is in long form (ID, SEX, COV, VALUE, UNIT) and must be reshaped before joining.",
    evidence = "glimpse(dm) shows repeated IDs across covariates (COV = AGE, WT).",
    decision = "Use pivot_wider() to create one row per subject before merging DM into EX/PC.",
    status = "open",
    notes = ""
  ),
  tibble(
    issue_id = "ISSUE-003",
    file = "dm",
    issue = "Potential unit mismatch in weight (WT) for some subjects.",
    evidence = "dm %>% filter(COV=='WT') %>% count(UNIT) shows more than one unit (e.g., kg and lb).",
    decision = "Standardize WT units (convert lb -> kg) during covariate cleaning.",
    status = "open",
    notes = ""
  )
) %>% distinct(issue_id, .keep_all = TRUE)

write_csv(qc_log, log_path)

qc_log
# A tibble: 7 × 7
  issue_id  file            issue                 evidence decision status notes
  <chr>     <chr>           <chr>                 <chr>    <chr>    <chr>  <chr>
1 ISSUE-001 pc              TIME auto-parsed as … glimpse… Re-vali… open   <NA> 
2 ISSUE-002 dm              DM is in long form (… glimpse… Use piv… open   <NA> 
3 ISSUE-003 dm              Potential unit misma… dm %>% … Standar… open   <NA> 
4 ISSUE-005 analysis_prelim Observation rows wit… Saved: … Drop ob… closed <NA> 
5 ISSUE-006 analysis_prelim Exact duplicate obse… Saved: … Keep fi… closed <NA> 
6 ISSUE-007 analysis_prelim Negative DV values i… Saved: … Set DV … closed <NA> 
7 ISSUE-008 analysis_prelim WT outlier suggests … Saved: … Convert… closed <NA> 

The goal is not to fix every issue immediately. The goal is to make sure important findings are not lost before the formal cleaning and structural QC steps.


Strategies

  • Separate folders (course-scoped):
    • courses/foundations-r/data/source_* → raw data (never edited)
    • courses/foundations-r/data/derived/ → cleaned data
    • courses/foundations-r/analysis/logs/ → QC logs
    • courses/foundations-r/results/ → modeling outputs
  • Always inspect:
    • glimpse()
    • nrow()
    • n_distinct(ID)
  • Compare observed counts to design expectations.
  • Log issues before modifying data.

Common Mistakes

  • Assuming a file is ready for analysis just because it imports successfully.
  • Trusting auto-detected column types without checking them.
  • Skipping expected vs observed row-count checks.
  • Modifying raw source files instead of creating derived files.
  • Joining DM before recognizing that it is in long format.
  • Ignoring unit inconsistencies such as kg vs lb.
  • Fixing issues without recording them in the QC decision log.

Practice Problems

  1. Load the clean dataset and compare its schema to the issues dataset.
  2. Verify ID overlap across DM, EX, and PC.
  3. Count duplicates in PC by ID and TIME.
  4. In DM, verify you have exactly 2 covariate rows per subject (AGE and WT) and identify any unexpected units.
  5. Add at least one additional QC log entry.

Duplicate check:

pc %>%
  count(ID, TIME) %>%
  filter(n > 1)
# A tibble: 11 × 3
   ID    TIME                    n
   <chr> <dttm>              <int>
 1 001   2026-02-04 11:20:35     2
 2 002   2026-01-20 14:50:59     2
 3 003   2026-01-21 12:00:00     2
 4 009   2026-02-03 16:44:37     2
 5 009   NA                      2
 6 018   2026-01-27 20:17:18     2
 7 026   2026-01-31 20:28:39     2
 8 038   2026-01-29 15:06:50     2
 9 039   NA                      2
10 041   2026-02-01 11:37:00     2
11 049   NA                      2

ID overlap:

anti_join(dm %>% distinct(ID),
          ex %>% distinct(ID),
          by = "ID")
# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>

DM rows per subject:

dm %>% 
  count(ID) %>% 
  count(n)
# A tibble: 1 × 2
      n    nn
  <int> <int>
1     2    50

WT unit scan:

dm %>% 
  filter(COV == "WT") %>%
  count(UNIT)
# A tibble: 2 × 2
  UNIT      n
  <chr> <int>
1 kg       49
2 lb        1

Summary

  • You inspected raw DM, EX, and PC files.
  • You recognized that read_csv() automatically guesses column types.
  • You performed design-level sanity checks.
  • You created a QC decision log to track structural issues.

  • Raw CSV format ≠ how R displays it after import.
  • Always compare observed row counts to expected design counts.
  • Log issues before resolving them.
  • Separate raw, derived, and results folders.