Joining Data

Learn how to safely join covariate, dosing, and event tables using dplyr joins, and how to diagnose common PMx join failures.

Tip

What you’ll build today: reliable patterns for merging covariates, dosing, and observation data without silently breaking your PMx dataset.

Learning Objectives

By the end of this lesson, you will be able to:

Clearly distinguish between left_join(), right_join(), inner_join(), full_join(), and anti_join().
Choose the correct join type for common PMx workflows.
Join subject-level covariates to event-level data safely.
Diagnose failed joins, duplicate keys, and unintended row expansion.
Use joins as a QC tool, not just a data manipulation step.

Setup

library(tidyverse)

Example Datasets

Event-level data (PK records)

pk <- tibble::tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV,
    1,   0.0,    1,  100,  NA,
    1,   0.5,    0,   NA,  2.1,
    1,   1.0,    0,   NA,  3.8,
    2,   0.0,    1,   80,  NA,
    2,   0.5,    0,   NA,  1.6
)

Subject-level covariates

cov_subj <- tibble::tribble(
  ~ID, ~WT, ~SEX, ~AGE,
    1,   72, "F",  34,
    2,   88, "M",  41,
    3,   95, "M",  29   # subject not in pk
)

Visit-level information

visits <- tibble::tribble(
  ~ID, ~TIME, ~VISIT,
    1,  0.5,   "V1",
    1,  1.0,   "V2",
    2,  0.5,   "V1"
)

Key Ideas

A join combines rows from two tables using shared key columns (e.g., ID, TIME).

In PMx workflows:

Event-level data are the structural backbone.
Subject-level data enrich event-level data.
Joins must never silently change event history.
Row counts are a structural diagnostic.

Warning

A join that runs without error can still produce incorrect results.
Always check row counts and unmatched keys.

Demonstrating Join Types in Action

1. left_join(): Default Enrichment

pk_left <- pk %>% left_join(cov_subj, by = "ID")

nrow(pk)

[1] 5

nrow(pk_left)

[1] 5

pk_left

# A tibble: 5 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41

All PK rows are preserved.
Subject 3 in cov_subj is ignored.
Missing matches would appear as NA.

This is the default PMx join.

2. inner_join(): Silent Row Loss Risk

pk_inner <- pk %>% inner_join(cov_subj, by = "ID")

nrow(pk_inner)

[1] 5

pk_inner

# A tibble: 5 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41

Now remove ID 2 from covariates and repeat:

cov_missing <- cov_subj %>% filter(ID != 2)

pk %>% inner_join(cov_missing, by = "ID")

# A tibble: 3 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34

ID 2 rows disappear entirely.

Warning

inner_join() can silently drop event records — often inappropriate in PMx.

3. right_join(): Reversed Perspective

pk_right <- pk %>% right_join(cov_subj, by = "ID")

nrow(pk_right)

[1] 6

pk_right

# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

Now subject 3 appears with missing PK values.

Rarely used in PMx; usually signals reversed logic.

4. full_join(): Diagnostic Merge

pk_full <- pk %>% full_join(cov_subj, by = "ID")

nrow(pk_full)

[1] 6

pk_full

# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

You now see:

All PK rows
All covariate rows
Subject 3 with no PK records

Useful for reconciliation.

5. anti_join(): QC Tool

pk %>% anti_join(cov_subj, by = "ID")

# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>

cov_subj %>% anti_join(pk, by = "ID")

# A tibble: 1 × 4
     ID    WT SEX     AGE
  <dbl> <dbl> <chr> <dbl>
1     3    95 M        29

This identifies:

PK subjects missing covariates
Covariates without PK records

Powerful structural diagnostic.

Multi-Key Join Example

pk_visit <- pk %>%
  left_join(visits, by = c("ID", "TIME"))

pk_visit

# A tibble: 5 × 6
     ID  TIME  EVID   AMT    DV VISIT
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1     1   0       1   100  NA   <NA> 
2     1   0.5     0    NA   2.1 V1   
3     1   1       0    NA   3.8 V2   
4     2   0       1    80  NA   <NA> 
5     2   0.5     0    NA   1.6 V1

If you join only by ID:

pk %>% left_join(visits, by = "ID")

# A tibble: 8 × 7
     ID TIME.x  EVID   AMT    DV TIME.y VISIT
  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>
1     1    0       1   100  NA      0.5 V1   
2     1    0       1   100  NA      1   V2   
3     1    0.5     0    NA   2.1    0.5 V1   
4     1    0.5     0    NA   2.1    1   V2   
5     1    1       0    NA   3.8    0.5 V1   
6     1    1       0    NA   3.8    1   V2   
7     2    0       1    80  NA      0.5 V1   
8     2    0.5     0    NA   1.6    0.5 V1

Row multiplication occurs because multiple TIME rows match.

Warning

Always include all necessary key columns in multi-key joins.

Row Multiplication Detection

Create duplicate keys:

visits_dup <- visits %>%
  add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")

pk_dup <- pk %>%
  left_join(visits_dup, by = c("ID", "TIME"))

nrow(pk)

[1] 5

nrow(pk_dup)

[1] 6

Row count increases → duplicate keys detected.

Strategies

Default to left_join() when enriching PK data.
Always compare nrow() before and after joins.
Use anti_join() as a QC diagnostic.
Verify key uniqueness in the right-hand table.
Be explicit with by =.

Common Mistakes

Ignoring row count changes
Joining on incomplete keys
Assuming keys are unique
Using inner_join() for enrichment
Not checking unmatched rows

Practice Problems

Use left_join() and confirm row counts remain constant.
Remove one ID from cov_subj and test inner_join().
Use anti_join() to identify mismatches.
Join visits incorrectly using only ID. What happens?
Create duplicate visit keys and observe row expansion.
Use full_join() and describe what new rows appear.

Step-by-Step Solutions

# 1
pk_left <- pk %>% left_join(cov_subj, by = "ID")
stopifnot(nrow(pk_left) == nrow(pk))

# 2
cov_missing <- cov_subj %>% filter(ID != 2)
pk %>% inner_join(cov_missing, by = "ID")

# A tibble: 3 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34

# 3
pk %>% anti_join(cov_subj, by = "ID")

# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>

# 4
pk %>% left_join(visits, by = "ID")

# A tibble: 8 × 7
     ID TIME.x  EVID   AMT    DV TIME.y VISIT
  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>
1     1    0       1   100  NA      0.5 V1   
2     1    0       1   100  NA      1   V2   
3     1    0.5     0    NA   2.1    0.5 V1   
4     1    0.5     0    NA   2.1    1   V2   
5     1    1       0    NA   3.8    0.5 V1   
6     1    1       0    NA   3.8    1   V2   
7     2    0       1    80  NA      0.5 V1   
8     2    0.5     0    NA   1.6    0.5 V1

# 5
visits_dup <- visits %>%
  add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")

pk %>% left_join(visits_dup, by = c("ID", "TIME"))

# A tibble: 6 × 6
     ID  TIME  EVID   AMT    DV VISIT 
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
1     1   0       1   100  NA   <NA>  
2     1   0.5     0    NA   2.1 V1    
3     1   0.5     0    NA   2.1 V1_dup
4     1   1       0    NA   3.8 V2    
5     2   0       1    80  NA   <NA>  
6     2   0.5     0    NA   1.6 V1

# 6
pk %>% full_join(cov_subj, by = "ID")

# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

Summary

You now understand:

How each join type behaves structurally.
Why left_join() is safest for PMx enrichment.
How row multiplication and row loss occur.
How to use anti_join() and full_join() as QC diagnostics.

Joins are structural operations.
Treat them as QC checkpoints, not just convenience tools.

Quick Tips

Default to left_join().
Always compare row counts before and after joins.
Use anti_join() to detect mismatches.
Include all necessary key columns.
Duplicate keys cause row multiplication.
Inner joins can silently drop PK records.