library(tidyverse)Joining Data
What you’ll build today: reliable patterns for merging covariates, dosing, and observation data without silently breaking your PMx dataset.
Learning Objectives
By the end of this lesson, you will be able to:
- Clearly distinguish between
left_join(),right_join(),inner_join(),full_join(), andanti_join(). - Choose the correct join type for common PMx workflows.
- Join subject-level covariates to event-level data safely.
- Diagnose failed joins, duplicate keys, and unintended row expansion.
- Use joins as a QC tool, not just a data manipulation step.
Setup
Example Datasets
Event-level data (PK records)
pk <- tibble::tribble(
~ID, ~TIME, ~EVID, ~AMT, ~DV,
1, 0.0, 1, 100, NA,
1, 0.5, 0, NA, 2.1,
1, 1.0, 0, NA, 3.8,
2, 0.0, 1, 80, NA,
2, 0.5, 0, NA, 1.6
)Subject-level covariates
cov_subj <- tibble::tribble(
~ID, ~WT, ~SEX, ~AGE,
1, 72, "F", 34,
2, 88, "M", 41,
3, 95, "M", 29 # subject not in pk
)Visit-level information
visits <- tibble::tribble(
~ID, ~TIME, ~VISIT,
1, 0.5, "V1",
1, 1.0, "V2",
2, 0.5, "V1"
)Key Ideas
A join combines rows from two tables using shared key columns (e.g., ID, TIME).
In PMx workflows:
- Event-level data are the structural backbone.
- Subject-level data enrich event-level data.
- Joins must never silently change event history.
- Row counts are a structural diagnostic.
A join that runs without error can still produce incorrect results.
Always check row counts and unmatched keys.
Demonstrating Join Types in Action
1. left_join(): Default Enrichment
pk_left <- pk %>% left_join(cov_subj, by = "ID")
nrow(pk)[1] 5
nrow(pk_left)[1] 5
pk_left# A tibble: 5 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
4 2 0 1 80 NA 88 M 41
5 2 0.5 0 NA 1.6 88 M 41
- All PK rows are preserved.
- Subject 3 in
cov_subjis ignored. - Missing matches would appear as
NA.
This is the default PMx join.
2. inner_join(): Silent Row Loss Risk
pk_inner <- pk %>% inner_join(cov_subj, by = "ID")
nrow(pk_inner)[1] 5
pk_inner# A tibble: 5 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
4 2 0 1 80 NA 88 M 41
5 2 0.5 0 NA 1.6 88 M 41
Now remove ID 2 from covariates and repeat:
cov_missing <- cov_subj %>% filter(ID != 2)
pk %>% inner_join(cov_missing, by = "ID")# A tibble: 3 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
ID 2 rows disappear entirely.
inner_join() can silently drop event records — often inappropriate in PMx.
3. right_join(): Reversed Perspective
pk_right <- pk %>% right_join(cov_subj, by = "ID")
nrow(pk_right)[1] 6
pk_right# A tibble: 6 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
4 2 0 1 80 NA 88 M 41
5 2 0.5 0 NA 1.6 88 M 41
6 3 NA NA NA NA 95 M 29
Now subject 3 appears with missing PK values.
Rarely used in PMx; usually signals reversed logic.
4. full_join(): Diagnostic Merge
pk_full <- pk %>% full_join(cov_subj, by = "ID")
nrow(pk_full)[1] 6
pk_full# A tibble: 6 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
4 2 0 1 80 NA 88 M 41
5 2 0.5 0 NA 1.6 88 M 41
6 3 NA NA NA NA 95 M 29
You now see:
- All PK rows
- All covariate rows
- Subject 3 with no PK records
Useful for reconciliation.
5. anti_join(): QC Tool
pk %>% anti_join(cov_subj, by = "ID")# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>
cov_subj %>% anti_join(pk, by = "ID")# A tibble: 1 × 4
ID WT SEX AGE
<dbl> <dbl> <chr> <dbl>
1 3 95 M 29
This identifies:
- PK subjects missing covariates
- Covariates without PK records
Powerful structural diagnostic.
Multi-Key Join Example
pk_visit <- pk %>%
left_join(visits, by = c("ID", "TIME"))
pk_visit# A tibble: 5 × 6
ID TIME EVID AMT DV VISIT
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 1 100 NA <NA>
2 1 0.5 0 NA 2.1 V1
3 1 1 0 NA 3.8 V2
4 2 0 1 80 NA <NA>
5 2 0.5 0 NA 1.6 V1
If you join only by ID:
pk %>% left_join(visits, by = "ID")# A tibble: 8 × 7
ID TIME.x EVID AMT DV TIME.y VISIT
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 1 100 NA 0.5 V1
2 1 0 1 100 NA 1 V2
3 1 0.5 0 NA 2.1 0.5 V1
4 1 0.5 0 NA 2.1 1 V2
5 1 1 0 NA 3.8 0.5 V1
6 1 1 0 NA 3.8 1 V2
7 2 0 1 80 NA 0.5 V1
8 2 0.5 0 NA 1.6 0.5 V1
Row multiplication occurs because multiple TIME rows match.
Always include all necessary key columns in multi-key joins.
Row Multiplication Detection
Create duplicate keys:
visits_dup <- visits %>%
add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")
pk_dup <- pk %>%
left_join(visits_dup, by = c("ID", "TIME"))
nrow(pk)[1] 5
nrow(pk_dup)[1] 6
Row count increases → duplicate keys detected.
Strategies
- Default to
left_join()when enriching PK data. - Always compare
nrow()before and after joins. - Use
anti_join()as a QC diagnostic. - Verify key uniqueness in the right-hand table.
- Be explicit with
by =.
Common Mistakes
- Ignoring row count changes
- Joining on incomplete keys
- Assuming keys are unique
- Using
inner_join()for enrichment - Not checking unmatched rows
Practice Problems
- Use
left_join()and confirm row counts remain constant. - Remove one ID from
cov_subjand testinner_join(). - Use
anti_join()to identify mismatches. - Join visits incorrectly using only
ID. What happens? - Create duplicate visit keys and observe row expansion.
- Use
full_join()and describe what new rows appear.
# 1
pk_left <- pk %>% left_join(cov_subj, by = "ID")
stopifnot(nrow(pk_left) == nrow(pk))
# 2
cov_missing <- cov_subj %>% filter(ID != 2)
pk %>% inner_join(cov_missing, by = "ID")# A tibble: 3 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
# 3
pk %>% anti_join(cov_subj, by = "ID")# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>
# 4
pk %>% left_join(visits, by = "ID")# A tibble: 8 × 7
ID TIME.x EVID AMT DV TIME.y VISIT
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 1 100 NA 0.5 V1
2 1 0 1 100 NA 1 V2
3 1 0.5 0 NA 2.1 0.5 V1
4 1 0.5 0 NA 2.1 1 V2
5 1 1 0 NA 3.8 0.5 V1
6 1 1 0 NA 3.8 1 V2
7 2 0 1 80 NA 0.5 V1
8 2 0.5 0 NA 1.6 0.5 V1
# 5
visits_dup <- visits %>%
add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")
pk %>% left_join(visits_dup, by = c("ID", "TIME"))# A tibble: 6 × 6
ID TIME EVID AMT DV VISIT
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 1 100 NA <NA>
2 1 0.5 0 NA 2.1 V1
3 1 0.5 0 NA 2.1 V1_dup
4 1 1 0 NA 3.8 V2
5 2 0 1 80 NA <NA>
6 2 0.5 0 NA 1.6 V1
# 6
pk %>% full_join(cov_subj, by = "ID")# A tibble: 6 × 8
ID TIME EVID AMT DV WT SEX AGE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 0 1 100 NA 72 F 34
2 1 0.5 0 NA 2.1 72 F 34
3 1 1 0 NA 3.8 72 F 34
4 2 0 1 80 NA 88 M 41
5 2 0.5 0 NA 1.6 88 M 41
6 3 NA NA NA NA 95 M 29
Summary
You now understand:
- How each join type behaves structurally.
- Why
left_join()is safest for PMx enrichment. - How row multiplication and row loss occur.
- How to use
anti_join()andfull_join()as QC diagnostics.
Joins are structural operations.
Treat them as QC checkpoints, not just convenience tools.
- Default to
left_join(). - Always compare row counts before and after joins.
- Use
anti_join()to detect mismatches. - Include all necessary key columns.
- Duplicate keys cause row multiplication.
- Inner joins can silently drop PK records.