Data Wrangling for Pharmacometrics

A practical, PMx-first workflow for importing, cleaning, validating, and exporting model-ready datasets in R.
Tip

How to use this section: Treat it like a playbook. You’ll learn a repeatable workflow you can apply to every PMx dataset you touch.

Learning Objectives

By the end of this module, you will be able to:

  • Import common PMx datasets reliably and inspect structure immediately.
  • Standardize column names and ordering to PMx-friendly conventions.
  • Subset and sort records safely for event-style datasets.
  • Create derived variables (flags, unit conversions, log transforms) intentionally.
  • Build subject-level QC summaries to catch issues early.
  • Join covariates/visit tables without silent row expansion.
  • Bind and reshape tables when the data structure demands it.
  • Use mapping helpers to iterate across repeated PMx tasks safely.
  • Clean common string and time fields that frequently break joins and QC.
  • Handle missing data and BLQ observations explicitly and defensibly.
  • Export a clean, model-ready dataset with a compact QC record (including "." for missing values when required).

Why Wrangling Matters in PMx

Pharmacometrics models are only as good as the data feeding them.

Most PMx “modeling problems” are actually data problems:

  • wrong types (TIME as character, IDs inconsistent)
  • hidden duplicates
  • unsorted event histories (or inconsistent ordering conventions)
  • covariate merges that silently fail or multiply rows
  • BLQ and missingness handled inconsistently

This module gives you a workflow that makes these problems visible before you fit a model.

Warning

A model can converge even when the dataset is wrong. Data QC is not optional.


PMx Dataset Mental Model

In many workflows, your event-level dataset is “long”:

  • One row per record (dose or observation)
  • Common columns look like: ID, TIME, EVID, AMT, DV
  • Covariates may be:
    • subject-level (e.g., WT, SEX, AGE)
    • time-varying (e.g., labs, vitals, concomitant meds)

A major theme of this module is keeping the meaning of each row clear — and making modeling intent explicit (for example, how you treat BLQ rows).


Common PMx / NM-TRAN Dataset Terms

If you are new to PMx datasets, the variable names can feel cryptic at first. They are often inherited from NONMEM / NM-TRAN conventions, and you will see many of the same names across software, internal company datasets, and publications.

You do not need to memorize every variable immediately. But you do want to get comfortable recognizing the most common ones and understanding what each row is trying to represent.

A Few Core Ideas First

Before looking at specific variables, keep these ideas in mind:

  • An event dataset usually mixes dose records and observation records in the same table.
  • The meaning of a row is often determined by a small set of columns like EVID, AMT, DV, and sometimes MDV.
  • Some variables describe who the subject is (ID, WT, SEX).
  • Some variables describe when something happened (TIME, DATE, DATETIME, TAD).
  • Some variables describe what happened on that row (dose given, concentration observed, infusion running, covariate recorded).

That is why PMx wrangling is not just “cleaning columns.” It is interpreting the dataset correctly.

Very Common Variables You’ll See

Here are some of the most common PMx / NM-TRAN style variables.

  • ID: subject identifier
  • TIME: nominal or elapsed time for the record
  • DV: dependent variable, often the observed concentration or response
  • AMT: dose amount
  • EVID: event ID, used to tell you what kind of row this is
  • MDV: missing dependent variable flag
  • CMT: compartment number
  • RATE: infusion rate
  • II: interdose interval
  • ADDL: number of additional doses
  • SS: steady-state flag
  • WT, AGE, SEX, CRCL, AST, ALT: common covariates
  • TAD: time after dose
  • BLQ: below limit of quantification flag

Different organizations may add their own naming conventions, but these are the terms you’ll run into constantly.

Interpreting the Most Important Ones

DV

DV usually means the measured outcome for that row.

In PK datasets, DV is often the observed concentration. In PK/PD or exposure-response work, it could be a biomarker, score, lab value, or another endpoint.

A few important habits:

  • On an observation row, DV often contains the measured value.
  • On a dose row, DV is often NA.
  • If a row is BLQ, the raw DV may be missing, zero, the LLOQ, or something study-specific — which is exactly why you need explicit BLQ handling rules.

EVID

EVID tells you what kind of event the row represents.

A very common simplified interpretation is:

  • EVID = 0: observation row
  • EVID = 1: dosing row

That alone gets you very far in basic PMx wrangling.

In broader NONMEM workflows, other EVID values can appear for reset or other event types, so you should always confirm the study convention before assuming more than you know.

Note

For this course, the most important practical distinction is usually whether a row is an observation (EVID = 0) or a dose (EVID = 1).

AMT

AMT is the dose amount for that record.

Typical pattern:

  • dose rows: AMT has a value
  • observation rows: AMT is often NA or 0

This is one of the fastest ways to sanity-check whether your dose history looks plausible.

MDV

MDV stands for missing dependent variable.

A common interpretation is:

  • MDV = 0: this row has a usable dependent variable
  • MDV = 1: this row should not be treated as a dependent-variable observation

In many datasets, dose rows have MDV = 1 because they are not concentration observations. Some observation-like rows may also have MDV = 1 if the endpoint is missing or intentionally excluded from modeling.

MDV is useful, but do not rely on it blindly. Always check whether it agrees with EVID, DV, and the study documentation.

CMT

CMT identifies the compartment tied to the event.

Examples:

  • oral dose into a depot compartment
  • IV dose into a central compartment
  • concentration observed from the central compartment

In this module, you do not need to become a compartment-number expert yet. Just know that CMT helps software interpret where the event happens.

RATE, II, ADDL, and SS

These often appear in dosing records:

  • RATE: infusion rate
  • II: interdose interval
  • ADDL: additional doses after the current one
  • SS: steady-state indicator

These variables matter because one row may represent more than a single immediate bolus. They can encode repeat dosing or infusion behavior that you need to preserve correctly during wrangling.

A Practical Way to Read Any PMx Row

When you inspect a new dataset, ask these questions for any row:

  1. Who is this? (ID)
  2. When did this happen? (TIME, date/time fields)
  3. Is this a dose or an observation? (EVID, sometimes MDV)
  4. If it is a dose, how much and how was it given? (AMT, RATE, II, ADDL, SS, CMT)
  5. If it is an observation, what was measured? (DV, endpoint flags, BLQ flags)
  6. What covariates describe the subject at this time? (WT, AGE, labs, visit info)

That simple checklist will help you decode most PMx datasets quickly.

Mini Example

Here is a tiny event-style example:

tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV,  ~MDV,
    1,   0,      1,   100, NA,     1,
    1,   1,      0,    NA, 8.4,    0,
    1,   2,      0,    NA, 6.1,    0
)

How to read it:

  • first row: subject 1 received a 100-unit dose at time 0
  • second row: subject 1 had an observation at time 1 with DV = 8.4
  • third row: subject 1 had another observation at time 2 with DV = 6.1

That is the core grammar of many PMx datasets.

One Important Warning About Conventions

These names are common, but conventions are not perfectly universal.

For example:

  • some teams use . instead of NA in flat files
  • some use TAD, others derive it later
  • some store BLQ information in BLQ, others in LLOQ plus a rule
  • some use additional EVID meanings beyond just dose vs observation
  • some define special derived columns for modeling (DV_model, ANL, FLAG_EXCL)

So the goal is not to assume every dataset behaves identically. The goal is to recognize the language quickly, then verify the project-specific rules.

Tip

When you receive a new PMx dataset, do not just ask “Are the columns present?” Ask “What does each row mean, and which columns control that meaning?”


Your Reusable “First 10 Minutes” QC Checklist

Use this anytime you load a new dataset:

  1. Inspect types: glimpse()
  2. Check missingness: counts by column
  3. Check duplicates: especially (ID, TIME, EVID)
  4. Enforce sorting: choose a convention and document it
  5. Dose/obs sanity check: do subjects have expected patterns?
  6. Covariate consistency: are “subject-level” covariates constant?
  7. Join QC (if merging tables): row counts + anti_join()
  8. BLQ and missingness: flag explicitly (don’t hide it)
Tip

If you save only one thing from this module, save this checklist.


A Note on Ordering When TIME Ties

You’ll see sorting patterns like:

arrange(ID, TIME, EVID)         # observation-first if EVID=0/1
# or
arrange(ID, TIME, desc(EVID))   # dose-first if EVID=0/1

Different PMx pipelines use different conventions.

For the regression and mixed-effects models used later in this course (lm(), nls(), lme(), nlme()), either convention is acceptable as long as it is consistent and documented.

In this course, our default is observation-first when dose and observation share the same time:

arrange(ID, TIME, EVID)

A Note on BLQ and Missing Values

In many PMx datasets:

  • BLQ is stored as an integer flag (0 = not BLQ, 1 = BLQ)
  • dose rows may have BLQ = NA
  • your final modeling decision is made explicit with a column like DV_model

When exporting analysis datasets, many PMx pipelines expect missing values to be written as ".".

Inside R, keep missing values as NA. Convert only at export time (e.g., write_csv(..., na = ".")).


Lessons in This Module

Work through these in order the first time. Later, revisit individual lessons as references.

  1. Reading and Writing Data
  2. First 10 Minutes: Structural QC for PMx Data
  3. Selecting, Renaming, and Relocating Columns
  4. Filtering, Arranging, and Slicing Rows
  5. Mutating, Transmuting, and Applying Functions Across Columns
  6. Grouping and Summarising Data
  7. Joining Data
  8. Binding Data
  9. String Helpers for Data Cleaning
  10. Working with Dates and Times
  11. Reshaping Data
  12. Mapping and Iterating Across Lists
  13. Missing Data and BLQ Handling in PMx
  14. Final QC and Exporting Model-Ready Datasets
  15. A Minimal End-to-End PMx Workflow

Packages Used in This Module

Most lessons start with:

library(tidyverse)

That attaches core tools like dplyr, tidyr, ggplot2, readr, and stringr.

Some lessons also load a few additional packages where needed:

  • lubridate (dates/times)
  • readxl (Excel import, when unavoidable)
  • here (project-safe file paths)

readxl and lubridate are part of the broader tidyverse ecosystem, but they are not attached by library(tidyverse).

Note

If you ever see “could not find function …”, it usually means the package that provides it hasn’t been attached in your current session.


What’s Next

After wrangling, we move into Data Visualization (ggplot2) for PMx:

  • individual PK profiles
  • stratified plots by covariates
  • log-scale plots
  • QC visual diagnostics

Visualization is one of the fastest ways to spot patterns and problems before modeling.