Data Wrangling for Pharmacometrics
How to use this section: Treat it like a playbook. You’ll learn a repeatable workflow you can apply to every PMx dataset you touch.
Learning Objectives
By the end of this module, you will be able to:
- Import common PMx datasets reliably and inspect structure immediately.
- Standardize column names and ordering to PMx-friendly conventions.
- Subset and sort records safely for event-style datasets.
- Create derived variables (flags, unit conversions, log transforms) intentionally.
- Build subject-level QC summaries to catch issues early.
- Join covariates/visit tables without silent row expansion.
- Bind and reshape tables when the data structure demands it.
- Use mapping helpers to iterate across repeated PMx tasks safely.
- Clean common string and time fields that frequently break joins and QC.
- Handle missing data and BLQ observations explicitly and defensibly.
- Export a clean, model-ready dataset with a compact QC record (including
"."for missing values when required).
Why Wrangling Matters in PMx
Pharmacometrics models are only as good as the data feeding them.
Most PMx “modeling problems” are actually data problems:
- wrong types (
TIMEas character, IDs inconsistent) - hidden duplicates
- unsorted event histories (or inconsistent ordering conventions)
- covariate merges that silently fail or multiply rows
- BLQ and missingness handled inconsistently
This module gives you a workflow that makes these problems visible before you fit a model.
A model can converge even when the dataset is wrong. Data QC is not optional.
PMx Dataset Mental Model
In many workflows, your event-level dataset is “long”:
- One row per record (dose or observation)
- Common columns look like:
ID,TIME,EVID,AMT,DV - Covariates may be:
- subject-level (e.g.,
WT,SEX,AGE) - time-varying (e.g., labs, vitals, concomitant meds)
- subject-level (e.g.,
A major theme of this module is keeping the meaning of each row clear — and making modeling intent explicit (for example, how you treat BLQ rows).
Common PMx / NM-TRAN Dataset Terms
If you are new to PMx datasets, the variable names can feel cryptic at first. They are often inherited from NONMEM / NM-TRAN conventions, and you will see many of the same names across software, internal company datasets, and publications.
You do not need to memorize every variable immediately. But you do want to get comfortable recognizing the most common ones and understanding what each row is trying to represent.
A Few Core Ideas First
Before looking at specific variables, keep these ideas in mind:
- An event dataset usually mixes dose records and observation records in the same table.
- The meaning of a row is often determined by a small set of columns like
EVID,AMT,DV, and sometimesMDV. - Some variables describe who the subject is (
ID,WT,SEX). - Some variables describe when something happened (
TIME,DATE,DATETIME,TAD). - Some variables describe what happened on that row (dose given, concentration observed, infusion running, covariate recorded).
That is why PMx wrangling is not just “cleaning columns.” It is interpreting the dataset correctly.
Very Common Variables You’ll See
Here are some of the most common PMx / NM-TRAN style variables.
ID: subject identifierTIME: nominal or elapsed time for the recordDV: dependent variable, often the observed concentration or responseAMT: dose amountEVID: event ID, used to tell you what kind of row this isMDV: missing dependent variable flagCMT: compartment numberRATE: infusion rateII: interdose intervalADDL: number of additional dosesSS: steady-state flagWT,AGE,SEX,CRCL,AST,ALT: common covariatesTAD: time after doseBLQ: below limit of quantification flag
Different organizations may add their own naming conventions, but these are the terms you’ll run into constantly.
Interpreting the Most Important Ones
DV
DV usually means the measured outcome for that row.
In PK datasets, DV is often the observed concentration. In PK/PD or exposure-response work, it could be a biomarker, score, lab value, or another endpoint.
A few important habits:
- On an observation row,
DVoften contains the measured value. - On a dose row,
DVis oftenNA. - If a row is BLQ, the raw
DVmay be missing, zero, the LLOQ, or something study-specific — which is exactly why you need explicit BLQ handling rules.
EVID
EVID tells you what kind of event the row represents.
A very common simplified interpretation is:
EVID = 0: observation rowEVID = 1: dosing row
That alone gets you very far in basic PMx wrangling.
In broader NONMEM workflows, other EVID values can appear for reset or other event types, so you should always confirm the study convention before assuming more than you know.
For this course, the most important practical distinction is usually whether a row is an observation (EVID = 0) or a dose (EVID = 1).
AMT
AMT is the dose amount for that record.
Typical pattern:
- dose rows:
AMThas a value - observation rows:
AMTis oftenNAor0
This is one of the fastest ways to sanity-check whether your dose history looks plausible.
MDV
MDV stands for missing dependent variable.
A common interpretation is:
MDV = 0: this row has a usable dependent variableMDV = 1: this row should not be treated as a dependent-variable observation
In many datasets, dose rows have MDV = 1 because they are not concentration observations. Some observation-like rows may also have MDV = 1 if the endpoint is missing or intentionally excluded from modeling.
MDV is useful, but do not rely on it blindly. Always check whether it agrees with EVID, DV, and the study documentation.
CMT
CMT identifies the compartment tied to the event.
Examples:
- oral dose into a depot compartment
- IV dose into a central compartment
- concentration observed from the central compartment
In this module, you do not need to become a compartment-number expert yet. Just know that CMT helps software interpret where the event happens.
RATE, II, ADDL, and SS
These often appear in dosing records:
RATE: infusion rateII: interdose intervalADDL: additional doses after the current oneSS: steady-state indicator
These variables matter because one row may represent more than a single immediate bolus. They can encode repeat dosing or infusion behavior that you need to preserve correctly during wrangling.
A Practical Way to Read Any PMx Row
When you inspect a new dataset, ask these questions for any row:
- Who is this? (
ID) - When did this happen? (
TIME, date/time fields) - Is this a dose or an observation? (
EVID, sometimesMDV) - If it is a dose, how much and how was it given? (
AMT,RATE,II,ADDL,SS,CMT) - If it is an observation, what was measured? (
DV, endpoint flags, BLQ flags) - What covariates describe the subject at this time? (
WT,AGE, labs, visit info)
That simple checklist will help you decode most PMx datasets quickly.
Mini Example
Here is a tiny event-style example:
tribble(
~ID, ~TIME, ~EVID, ~AMT, ~DV, ~MDV,
1, 0, 1, 100, NA, 1,
1, 1, 0, NA, 8.4, 0,
1, 2, 0, NA, 6.1, 0
)How to read it:
- first row: subject 1 received a 100-unit dose at time 0
- second row: subject 1 had an observation at time 1 with
DV = 8.4 - third row: subject 1 had another observation at time 2 with
DV = 6.1
That is the core grammar of many PMx datasets.
One Important Warning About Conventions
These names are common, but conventions are not perfectly universal.
For example:
- some teams use
.instead ofNAin flat files - some use
TAD, others derive it later - some store BLQ information in
BLQ, others inLLOQplus a rule - some use additional
EVIDmeanings beyond just dose vs observation - some define special derived columns for modeling (
DV_model,ANL,FLAG_EXCL)
So the goal is not to assume every dataset behaves identically. The goal is to recognize the language quickly, then verify the project-specific rules.
When you receive a new PMx dataset, do not just ask “Are the columns present?” Ask “What does each row mean, and which columns control that meaning?”
Your Reusable “First 10 Minutes” QC Checklist
Use this anytime you load a new dataset:
- Inspect types:
glimpse() - Check missingness: counts by column
- Check duplicates: especially
(ID, TIME, EVID) - Enforce sorting: choose a convention and document it
- Dose/obs sanity check: do subjects have expected patterns?
- Covariate consistency: are “subject-level” covariates constant?
- Join QC (if merging tables): row counts +
anti_join() - BLQ and missingness: flag explicitly (don’t hide it)
If you save only one thing from this module, save this checklist.
A Note on Ordering When TIME Ties
You’ll see sorting patterns like:
arrange(ID, TIME, EVID) # observation-first if EVID=0/1
# or
arrange(ID, TIME, desc(EVID)) # dose-first if EVID=0/1Different PMx pipelines use different conventions.
For the regression and mixed-effects models used later in this course (lm(), nls(), lme(), nlme()), either convention is acceptable as long as it is consistent and documented.
In this course, our default is observation-first when dose and observation share the same time:
arrange(ID, TIME, EVID)A Note on BLQ and Missing Values
In many PMx datasets:
BLQis stored as an integer flag (0= not BLQ,1= BLQ)- dose rows may have
BLQ = NA - your final modeling decision is made explicit with a column like
DV_model
When exporting analysis datasets, many PMx pipelines expect missing values to be written as ".".
Inside R, keep missing values as NA. Convert only at export time (e.g., write_csv(..., na = ".")).
Lessons in This Module
Work through these in order the first time. Later, revisit individual lessons as references.
- Reading and Writing Data
- First 10 Minutes: Structural QC for PMx Data
- Selecting, Renaming, and Relocating Columns
- Filtering, Arranging, and Slicing Rows
- Mutating, Transmuting, and Applying Functions Across Columns
- Grouping and Summarising Data
- Joining Data
- Binding Data
- String Helpers for Data Cleaning
- Working with Dates and Times
- Reshaping Data
- Mapping and Iterating Across Lists
- Missing Data and BLQ Handling in PMx
- Final QC and Exporting Model-Ready Datasets
- A Minimal End-to-End PMx Workflow
Packages Used in This Module
Most lessons start with:
library(tidyverse)That attaches core tools like dplyr, tidyr, ggplot2, readr, and stringr.
Some lessons also load a few additional packages where needed:
- lubridate (dates/times)
- readxl (Excel import, when unavoidable)
- here (project-safe file paths)
readxl and lubridate are part of the broader tidyverse ecosystem, but they are not attached by library(tidyverse).
If you ever see “could not find function …”, it usually means the package that provides it hasn’t been attached in your current session.
What’s Next
After wrangling, we move into Data Visualization (ggplot2) for PMx:
- individual PK profiles
- stratified plots by covariates
- log-scale plots
- QC visual diagnostics
Visualization is one of the fastest ways to spot patterns and problems before modeling.