Understanding Modeling Datasets
Big picture: Before fitting a model, you need to understand what each row represents. Population models are built on event-based datasets where doses and observations live together.
Learning Objectives
By the end of this lesson, you will be able to:
- Explain how pharmacometric datasets differ from ordinary analysis datasets.
- Interpret dosing and observation records.
- Describe why modeling datasets use long format.
- Interpret common columns including
ID,TIME,DV,AMT,EVID, andMDV. - Recognize how event records support simulation and estimation.
Key Ideas
- Modeling datasets are event-based.
- One row usually represents one event.
- Dosing and observations are often stored together.
- Long format supports modeling and simulation.
- Event columns communicate meaning to the software.
Why Modeling Datasets Look Different
Many analysis datasets are organized around subjects.
Example:
One row = One subject
Population modeling datasets are usually organized differently.
Instead:
One row = One event
Events may include:
- dose administration
- concentration measurement
- response measurement
- additional records used during simulation
This structure gives flexibility for modeling and simulation.
Worked Example 1: Subject Dataset vs Event Dataset
Traditional analysis dataset:
| ID | AGE | WT |
|---|---|---|
| 1 | 42 | 78 |
| 2 | 35 | 64 |
Population modeling dataset:
| ID | TIME | AMT | DV |
|---|---|---|---|
| 1 | 0 | 320 | NA |
| 1 | 1 | 0 | 5.4 |
| 1 | 2 | 0 | 3.9 |
| 2 | 0 | 320 | NA |
| 2 | 1 | 0 | 4.8 |
The same subject appears across multiple rows because events occur at multiple times.
Long Format
Population datasets are typically stored in long format.
Conceptually:
Subject
↓
Multiple Events
↓
Multiple Rows
This allows software to reconstruct time-dependent behavior.
Long format becomes especially important when:
- repeated dosing exists
- PK and PD endpoints coexist
- simulation is required
Common Modeling Columns
Although datasets differ, several columns appear repeatedly.
Subject Identifier
ID
Identifies which rows belong to the same subject.
Example:
ID = 1
↓
All rows belong together
Time Variable
TIME
Represents elapsed time.
Example:
| TIME |
|---|
| 0 |
| 1 |
| 2 |
| 4 |
Interpretation depends on study design.
Dependent Variable
DV
Observed outcome.
Examples:
- concentration
- biomarker
- response
Example:
| TIME | DV |
|---|---|
| 1 | 5.6 |
| 2 | 4.1 |
Worked Example 2: Dosing vs Observation Records
Consider:
| ID | TIME | AMT | DV |
|---|---|---|---|
| 1 | 0 | 300 | NA |
| 1 | 1 | 0 | 6.1 |
| 1 | 2 | 0 | 4.3 |
Interpretation:
Row 1:
Dose event
Rows 2–3:
Observation events
Dose rows may not contain observations.
Observation rows often have no administered amount.
Amount Variable
AMT
Represents administered amount.
Example:
| TIME | AMT |
|---|---|
| 0 | 300 |
| 12 | 300 |
Event Identifier
EVID
Indicates event type.
Common examples:
| EVID | Meaning |
|---|---|
| 0 | Observation |
| 1 | Dose |
Interpretation may vary slightly across workflows.
Missing Dependent Variable
MDV
Helps identify rows that should not contribute observations.
Typical values:
| MDV | Meaning |
|---|---|
| 0 | Observation present |
| 1 | Observation ignored |
Dose rows commonly use MDV=1.
Worked Example 3: Reading Event Records
Example:
| ID | TIME | AMT | DV | EVID | MDV |
|---|---|---|---|---|---|
| 1 | 0 | 300 | NA | 1 | 1 |
| 1 | 1 | 0 | 5.6 | 0 | 0 |
| 1 | 2 | 0 | 4.3 | 0 | 0 |
Interpretation:
Dose
↓
Observation
↓
Observation
This event structure is one of the core ideas of population modeling.
Why Event Structure Matters
Modeling software uses event information to determine:
- when drug enters the system
- when observations occur
- which rows affect estimation
- how simulations are generated
Without event information, concentration-time reconstruction becomes difficult.
Preview of the Course Dataset
In the next lesson, we will load:
nlmixr2data::theo_sdand begin exploring real PK datasets.
At that point, the columns introduced here will become concrete.
Strategies
- Read rows sequentially.
- Interpret events chronologically.
- Verify observation vs dose records.
- Check identifiers early.
Common Mistakes
- Assuming all rows contain observations.
- Ignoring event columns.
- Forgetting that time ordering matters.
- Treating modeling datasets as spreadsheets.
Practice Problems
- Why are modeling datasets usually long format?
- Explain the role of
DV. - Explain the difference between
AMTandDV. - Interpret
EVID=1.
Problem 1
Long format allows multiple events per subject.
Problem 2
DV stores observed outcomes.
Problem 3
AMT represents dose administration while DV stores observations.
Problem 4
EVID=1 typically represents a dose event.
Summary
- Modeling datasets are event-based.
- Long format supports repeated events.
- Dosing and observations coexist.
- Event variables drive modeling behavior.
- Read datasets row-by-row.
- Think in events—not subjects.
- Time ordering matters.
- Event columns carry meaning.