Preliminary Dataset Assembly

Assemble a preliminary event-style dataset from DM, EX, and structurally cleaned PC. Derive TIME and NTIME, and propagate DOSE for QC and plotting.
Tip

What you’ll build today: a preliminary analysis dataset you can plot and diagnose.
We assemble after minimal structural fixes, then use visualization to discover deeper issues (duplicates, spikes, misassigned times).

Learning Objectives

By the end of this lesson, you will be able to:

  • Load structurally cleaned DM/EX/PC (from the previous lesson).
  • Compute TIME (hours after dose) and keep the dataset modeling-focused.
  • Compute NTIME (nominal protocol time) using nearest-protocol binning.
  • Build an event-style dataset with EVID, CMT, AMT, and DV.
  • Propagate DOSE as a subject-level covariate using fill().
  • Export a preliminary assembled dataset to courses/foundations-r/data/derived/.

Key Ideas

  • Assemble first, then diagnose. A preliminary dataset can contain problems — that’s the point.
  • TIME (actual time after dose) drives modeling and diagnostics.
  • NTIME (nominal time) supports time-binned summaries and reporting.
  • DOSE behaves like a covariate (useful for stratified plots and checks), so we store it on every row.
  • Event-style columns (EVID, AMT, CMT, DV) follow a common clinical-data convention and make downstream workflows easier.

Setup

library(tidyverse)
library(here)

derivedpath <- here("courses", "foundations-r", "data", "derived")
resultspath <- here("courses", "foundations-r", "results", "qc")

dir.create(derivedpath, recursive = TRUE, showWarnings = FALSE)
dir.create(resultspath, recursive = TRUE, showWarnings = FALSE)

pc <- read_csv(file.path(derivedpath, "pc_structural_clean.csv"))
dm <- read_csv(file.path(derivedpath, "dm_structural_clean.csv"))
ex <- read_csv(file.path(derivedpath, "ex_structural_clean.csv"))

Worked Example 1: Confirm IDs Before Joining

Before creating the analysis dataset, we confirm that the same subjects appear in DM, EX, and PC.

If IDs don’t line up, joins may silently drop records or create incomplete subject profiles.

anti_join(dm %>% distinct(ID), ex %>% distinct(ID), by = "ID")
# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>
anti_join(dm %>% distinct(ID), pc %>% distinct(ID), by = "ID")
# A tibble: 0 × 1
# ℹ 1 variable: ID <chr>

No rows should return. If rows appear, those subjects need to be investigated before assembly.


Worked Example 2: Ensure Dose is Available and Usable

In the previous lesson, EX was exported with:

  • DOSEDT (parsed datetime)
  • AMT (numeric dose amount)

For this assembled dataset, the dose amount will be used both as an event value (AMT) and as a subject-level covariate (DOSE).

We’ll rename AMT to DOSE for clarity.

ex2 <- ex %>%
  transmute(
    ID,
    DOSEDT,
    DOSE = AMT
  )

Quick checks:

ex2 %>% filter(is.na(DOSEDT))
# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, DOSEDT <dttm>, DOSE <dbl>
ex2 %>% filter(is.na(DOSE))
# A tibble: 0 × 3
# ℹ 3 variables: ID <chr>, DOSEDT <dttm>, DOSE <dbl>

Both checks should return no rows. Missing dose dates or dose amounts would block reliable TIME calculation and event-row creation.


Worked Example 3: Compute TIME (Hours After Dose)

In a previous lesson, PC was exported with:

  • DT (parsed datetime)
  • CONC (numeric)
  • BLQ, LLOQ

TIME is the actual elapsed time after dose. This is the main time variable used for modeling and diagnostic plots.

obs0 <- pc %>%
  left_join(ex2, by = "ID") %>%
  mutate(
    TIME = as.numeric(difftime(DT, DOSEDT, units = "hours"))
  ) %>%
  select(-DT, -DOSEDT)

Inspect a narrow view (so TIME is easy to see):

obs0 %>%
  select(ID, TIME, DOSE, CONC, BLQ, LLOQ) %>%
  slice_head(n = 10)
# A tibble: 10 × 6
   ID      TIME  DOSE   CONC BLQ    LLOQ
   <chr>  <dbl> <dbl>  <dbl> <chr> <dbl>
 1 006   12.0     0   0      BLQ    0.05
 2 050    7.99  100   1.53   <NA>   0.05
 3 022    0.572  25   0.205  <NA>   0.05
 4 026    0.391  25   0.107  <NA>   0.05
 5 044    5.93  100   0.941  <NA>   0.05
 6 016    0.428  12.5 0.0629 <NA>   0.05
 7 022    1.96   25   0.226  <NA>   0.05
 8 048    8.02  100   0.496  <NA>   0.05
 9 006    5.97    0   0      BLQ    0.05
10 019   NA      12.5 0.052  <NA>   0.05

Sanity checks:

obs0 %>% filter(is.na(TIME)) %>% select(ID, TIME) %>% slice_head(n = 10)
# A tibble: 8 × 2
  ID     TIME
  <chr> <dbl>
1 019      NA
2 039      NA
3 049      NA
4 046      NA
5 009      NA
6 009      NA
7 039      NA
8 049      NA
obs0 %>% filter(!is.na(TIME) & TIME < 0) %>% select(ID, TIME) %>% slice_head(n = 10)
# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, TIME <dbl>

At this stage, a small number of missing or problematic TIME values may still exist.

That is expected in a preliminary assembly workflow — the goal here is to surface issues, not immediately fix every record.

Missing or negative TIME values usually indicate:

  • datetime parsing failures,
  • missing dose information,
  • ID join problems,
  • or incorrect sample/dose ordering.

Worked Example 4: Create NTIME by Nearest Protocol Binning

NTIME is the nominal protocol time. It is useful for summaries, tables, and binned plots when actual sampling times have small deviations.

Protocol grid (nominal times):

prot <- c(0, 0.5, 1, 2, 4, 6, 8, 12, 24)

Nearest-bin assignment (NA-safe):

obs0 <- obs0 %>%
  rowwise() %>%
  mutate(
    NTIME = if (is.na(TIME)) NA_real_
    else prot[which.min(abs(TIME - prot))]
  ) %>%
  ungroup()

Inspect:

obs0 %>%
  select(ID, TIME, NTIME) %>%
  slice_head(n = 10)
# A tibble: 10 × 3
   ID      TIME NTIME
   <chr>  <dbl> <dbl>
 1 006   12.0    12  
 2 050    7.99    8  
 3 022    0.572   0.5
 4 026    0.391   0.5
 5 044    5.93    6  
 6 016    0.428   0.5
 7 022    1.96    2  
 8 048    8.02    8  
 9 006    5.97    6  
10 019   NA      NA  

At this stage, NTIME is only a convenient grouping variable. We still keep TIME for modeling and diagnostics.


Worked Example 5: Attach Covariates from DM

DM is already 1 row per subject, so it can now be safely joined to the observation records.

This adds subject-level covariates such as AGE, WT, and SEX.

obs0 <- obs0 %>%
  left_join(dm, by = "ID")
head(obs0)
# A tibble: 6 × 11
  ID      CONC CONCUNIT BLQ    LLOQ  DOSE   TIME NTIME SEX     AGE    WT
  <chr>  <dbl> <chr>    <chr> <dbl> <dbl>  <dbl> <dbl> <chr> <dbl> <dbl>
1 006   0      mg/L     BLQ    0.05   0   12.0    12   F        43  60.7
2 050   1.53   mg/L     <NA>   0.05 100    7.99    8   F        43  53.1
3 022   0.205  mg/L     <NA>   0.05  25    0.572   0.5 M        51  73.2
4 026   0.107  mg/L     <NA>   0.05  25    0.391   0.5 F        34  81  
5 044   0.941  mg/L     <NA>   0.05 100    5.93    6   M        54  61.2
6 016   0.0629 mg/L     <NA>   0.05  12.5  0.428   0.5 F        34  66.9

Worked Example 6: Build Event-Style Rows

Now we convert the data into an event-style format.

Observation rows represent measured concentrations:

  • EVID = 0
  • CMT = 2
  • AMT = 0
  • DV = CONC
obs <- obs0 %>%
  transmute(
    ID,
    DOSE,
    TIME,
    NTIME,
    EVID = 0,
    CMT  = 2,
    AMT  = 0,
    DV   = CONC,
    AGE,
    WT,
    SEX,
    BLQ,
    LLOQ
  )

Dose rows represent dosing events. There should be one dose row per subject:

  • EVID = 1
  • CMT = 1
  • AMT = DOSE
  • DV = NA
dose <- ex2 %>%
  transmute(
    ID,
    DOSE,
    TIME  = 0,
    NTIME = 0,
    EVID = 1,
    CMT  = 1,
    AMT  = DOSE,
    DV   = NA_real_
  ) %>%
  left_join(dm, by = "ID")

Combine and order so dose rows come before observations at TIME = 0:

analysisprelim <- bind_rows(obs, dose) %>%
  arrange(ID, TIME, desc(EVID))
head(analysisprelim)
# A tibble: 6 × 13
  ID     DOSE  TIME NTIME  EVID   CMT   AMT    DV   AGE    WT SEX   BLQ    LLOQ
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 001       0 0       0       1     1     0    NA    42  70.6 M     <NA>  NA   
2 001       0 0       0       0     2     0     0    42  70.6 M     BLQ    0.05
3 001       0 0.487   0.5     0     2     0     0    42  70.6 M     BLQ    0.05
4 001       0 1.04    1       0     2     0     0    42  70.6 M     BLQ    0.05
5 001       0 2.03    2       0     2     0     0    42  70.6 M     BLQ    0.05
6 001       0 2.03    2       0     2     0     0    42  70.6 M     BLQ    0.05

This gives us a preliminary event-style dataset that can be checked, plotted, and refined in later lessons.


Worked Example 7: Propagate DOSE as a Covariate

In event-style datasets, it’s common to store subject-level covariates on every row.

Here, DOSE is useful for stratified plots and QC summaries, so we fill it within each subject.

analysisprelim <- analysisprelim %>%
  group_by(ID) %>%
  fill(DOSE, .direction = "downup") %>%
  ungroup()
head(analysisprelim)
# A tibble: 6 × 13
  ID     DOSE  TIME NTIME  EVID   CMT   AMT    DV   AGE    WT SEX   BLQ    LLOQ
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 001       0 0       0       1     1     0    NA    42  70.6 M     <NA>  NA   
2 001       0 0       0       0     2     0     0    42  70.6 M     BLQ    0.05
3 001       0 0.487   0.5     0     2     0     0    42  70.6 M     BLQ    0.05
4 001       0 1.04    1       0     2     0     0    42  70.6 M     BLQ    0.05
5 001       0 2.03    2       0     2     0     0    42  70.6 M     BLQ    0.05
6 001       0 2.03    2       0     2     0     0    42  70.6 M     BLQ    0.05

Quick check:

analysisprelim %>%
  distinct(ID, DOSE) %>%
  count(DOSE) %>%
  arrange(DOSE)
# A tibble: 5 × 2
   DOSE     n
  <dbl> <int>
1   0      10
2  12.5    10
3  25      10
4  50      10
5 100      10

Worked Example 8: Record Counts per Subject

Expected design (clean world):

  • 1 dose record
  • 9 observation records
    10 total records per subject

Let’s check what we actually have:

analysisprelim %>%
  count(ID) %>%
  count(n) %>%
  arrange(n)
# A tibble: 2 × 2
      n    nn
  <int> <int>
1    10    42
2    11     8

If you see a second row (e.g., some subjects have 11 records), that usually means duplicates exist in PC.
That’s okay here — we have not removed duplicates yet. We will diagnose them in the visualization lesson and fix them with justification later.


Export Preliminary Dataset

Finally, we save the assembled dataset for plotting-driven QC in the next lesson.

This file is still preliminary. It may contain duplicates, outliers, or other issues that we have not fixed yet.

write_csv(analysisprelim, file.path(derivedpath, "analysis_prelim.csv"))

Warning
  • Do not remove duplicates yet (extra records per subject are a diagnostic clue).
  • Do not drop spikes or negative DV yet.
  • Do not “fix” time misassignments yet.
  • This dataset is intentionally preliminary so QC plots can reveal what needs attention.

Strategies

  • Use the structurally cleaned files from the previous lesson as the starting point.
  • Compute TIME, then keep the dataset modeling-focused.
  • Assign NTIME by nearest-protocol matching (robust to small jitter).
  • Create observation rows and dose rows separately, then bind_rows().
  • After binding, fill DOSE within subject so every row can be stratified by dose.
  • Export a preliminary dataset for plotting-driven QC.

Common Mistakes

  • Treating the preliminary dataset as final.
  • Dropping duplicate records before diagnosing them.
  • Using NTIME instead of TIME for modeling.
  • Forgetting to check for missing or negative TIME.
  • Joining files before confirming subject ID alignment.
  • Creating dose rows but not ordering them before observations at TIME = 0.
  • Forgetting to propagate DOSE to all rows for stratified QC.

Practice Problems

  1. Verify that every subject has exactly one dose row (EVID = 1).
  2. Identify which subjects have more than 10 records and save that list as a CSV under courses/foundations-r/results/qc/.
  3. Confirm that NTIME values are limited to the protocol grid.
  4. Create a small table of missing covariates (AGE, WT, SEX) and save it under courses/foundations-r/results/qc/.

One dose row per subject:

analysisprelim %>%
  filter(EVID == 1) %>%
  count(ID) %>%
  filter(n != 1)
# A tibble: 0 × 2
# ℹ 2 variables: ID <chr>, n <int>

Subjects with >10 records:

morethan10 <- analysisprelim %>%
  count(ID) %>%
  filter(n > 10) %>%
  arrange(desc(n))

write_csv(morethan10, file.path(resultspath, "subjects_more_than_10_records.csv"))
morethan10
# A tibble: 8 × 2
  ID        n
  <chr> <int>
1 001      11
2 002      11
3 003      11
4 009      11
5 018      11
6 026      11
7 038      11
8 041      11

NTIME grid check:

setdiff(sort(unique(analysisprelim$NTIME)), c(0, 0.5, 1, 2, 4, 6, 8, 12, 24))
numeric(0)

Missing covariates:

misscov <- analysisprelim %>%
  distinct(ID, AGE, WT, SEX) %>%
  summarise(
    n_missing_age = sum(is.na(AGE)),
    n_missing_wt  = sum(is.na(WT)),
    n_missing_sex = sum(is.na(SEX))
  )

write_csv(misscov, file.path(resultspath, "missing_covariates_prelim.csv"))
misscov
# A tibble: 1 × 3
  n_missing_age n_missing_wt n_missing_sex
          <int>        <int>         <int>
1             2            2             0

Summary

  • You assembled a preliminary event-style dataset from DM, EX, and structurally cleaned PC.
  • You derived numeric TIME and nominal NTIME.
  • You propagated DOSE to all rows for stratified QC and summaries.
  • You validated record counts and confirmed that deviations from 10 records/subject are likely duplicates to be handled later.
  • You exported a preliminary dataset for downstream plotting-driven QC.

  • Keep prints narrow: select only ID, TIME, NTIME, DOSE, DV when inspecting rows.
  • Use TIME for modeling and diagnostics; use NTIME for binned summaries.
  • If some subjects have extra rows, don’t “panic-clean” — document and diagnose first.