Joining Data

Learn how to safely join covariate, dosing, and event tables using dplyr joins, and how to diagnose common PMx join failures.
Tip

What you’ll build today: reliable patterns for merging covariates, dosing, and observation data without silently breaking your PMx dataset.

Learning Objectives

By the end of this lesson, you will be able to:

  • Clearly distinguish between left_join(), right_join(), inner_join(), full_join(), and anti_join().
  • Choose the correct join type for common PMx workflows.
  • Join subject-level covariates to event-level data safely.
  • Diagnose failed joins, duplicate keys, and unintended row expansion.
  • Use joins as a QC tool, not just a data manipulation step.

Setup

library(tidyverse)

Example Datasets

Event-level data (PK records)

pk <- tibble::tribble(
  ~ID, ~TIME, ~EVID, ~AMT, ~DV,
    1,   0.0,    1,  100,  NA,
    1,   0.5,    0,   NA,  2.1,
    1,   1.0,    0,   NA,  3.8,
    2,   0.0,    1,   80,  NA,
    2,   0.5,    0,   NA,  1.6
)

Subject-level covariates

cov_subj <- tibble::tribble(
  ~ID, ~WT, ~SEX, ~AGE,
    1,   72, "F",  34,
    2,   88, "M",  41,
    3,   95, "M",  29   # subject not in pk
)

Visit-level information

visits <- tibble::tribble(
  ~ID, ~TIME, ~VISIT,
    1,  0.5,   "V1",
    1,  1.0,   "V2",
    2,  0.5,   "V1"
)

Key Ideas

A join combines rows from two tables using shared key columns (e.g., ID, TIME).

In PMx workflows:

  • Event-level data are the structural backbone.
  • Subject-level data enrich event-level data.
  • Joins must never silently change event history.
  • Row counts are a structural diagnostic.
Warning

A join that runs without error can still produce incorrect results.
Always check row counts and unmatched keys.


Demonstrating Join Types in Action


1. left_join(): Default Enrichment

pk_left <- pk %>% left_join(cov_subj, by = "ID")

nrow(pk)
[1] 5
nrow(pk_left)
[1] 5
pk_left
# A tibble: 5 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
  • All PK rows are preserved.
  • Subject 3 in cov_subj is ignored.
  • Missing matches would appear as NA.

This is the default PMx join.


2. inner_join(): Silent Row Loss Risk

pk_inner <- pk %>% inner_join(cov_subj, by = "ID")

nrow(pk_inner)
[1] 5
pk_inner
# A tibble: 5 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41

Now remove ID 2 from covariates and repeat:

cov_missing <- cov_subj %>% filter(ID != 2)

pk %>% inner_join(cov_missing, by = "ID")
# A tibble: 3 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34

ID 2 rows disappear entirely.

Warning

inner_join() can silently drop event records — often inappropriate in PMx.


3. right_join(): Reversed Perspective

pk_right <- pk %>% right_join(cov_subj, by = "ID")

nrow(pk_right)
[1] 6
pk_right
# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

Now subject 3 appears with missing PK values.

Rarely used in PMx; usually signals reversed logic.


4. full_join(): Diagnostic Merge

pk_full <- pk %>% full_join(cov_subj, by = "ID")

nrow(pk_full)
[1] 6
pk_full
# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

You now see:

  • All PK rows
  • All covariate rows
  • Subject 3 with no PK records

Useful for reconciliation.


5. anti_join(): QC Tool

pk %>% anti_join(cov_subj, by = "ID")
# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>
cov_subj %>% anti_join(pk, by = "ID")
# A tibble: 1 × 4
     ID    WT SEX     AGE
  <dbl> <dbl> <chr> <dbl>
1     3    95 M        29

This identifies:

  • PK subjects missing covariates
  • Covariates without PK records

Powerful structural diagnostic.


Multi-Key Join Example

pk_visit <- pk %>%
  left_join(visits, by = c("ID", "TIME"))

pk_visit
# A tibble: 5 × 6
     ID  TIME  EVID   AMT    DV VISIT
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1     1   0       1   100  NA   <NA> 
2     1   0.5     0    NA   2.1 V1   
3     1   1       0    NA   3.8 V2   
4     2   0       1    80  NA   <NA> 
5     2   0.5     0    NA   1.6 V1   

If you join only by ID:

pk %>% left_join(visits, by = "ID")
# A tibble: 8 × 7
     ID TIME.x  EVID   AMT    DV TIME.y VISIT
  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>
1     1    0       1   100  NA      0.5 V1   
2     1    0       1   100  NA      1   V2   
3     1    0.5     0    NA   2.1    0.5 V1   
4     1    0.5     0    NA   2.1    1   V2   
5     1    1       0    NA   3.8    0.5 V1   
6     1    1       0    NA   3.8    1   V2   
7     2    0       1    80  NA      0.5 V1   
8     2    0.5     0    NA   1.6    0.5 V1   

Row multiplication occurs because multiple TIME rows match.

Warning

Always include all necessary key columns in multi-key joins.


Row Multiplication Detection

Create duplicate keys:

visits_dup <- visits %>%
  add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")

pk_dup <- pk %>%
  left_join(visits_dup, by = c("ID", "TIME"))

nrow(pk)
[1] 5
nrow(pk_dup)
[1] 6

Row count increases → duplicate keys detected.


Strategies

  • Default to left_join() when enriching PK data.
  • Always compare nrow() before and after joins.
  • Use anti_join() as a QC diagnostic.
  • Verify key uniqueness in the right-hand table.
  • Be explicit with by =.

Common Mistakes

  • Ignoring row count changes
  • Joining on incomplete keys
  • Assuming keys are unique
  • Using inner_join() for enrichment
  • Not checking unmatched rows

Practice Problems

  1. Use left_join() and confirm row counts remain constant.
  2. Remove one ID from cov_subj and test inner_join().
  3. Use anti_join() to identify mismatches.
  4. Join visits incorrectly using only ID. What happens?
  5. Create duplicate visit keys and observe row expansion.
  6. Use full_join() and describe what new rows appear.

# 1
pk_left <- pk %>% left_join(cov_subj, by = "ID")
stopifnot(nrow(pk_left) == nrow(pk))

# 2
cov_missing <- cov_subj %>% filter(ID != 2)
pk %>% inner_join(cov_missing, by = "ID")
# A tibble: 3 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
# 3
pk %>% anti_join(cov_subj, by = "ID")
# A tibble: 0 × 5
# ℹ 5 variables: ID <dbl>, TIME <dbl>, EVID <dbl>, AMT <dbl>, DV <dbl>
# 4
pk %>% left_join(visits, by = "ID")
# A tibble: 8 × 7
     ID TIME.x  EVID   AMT    DV TIME.y VISIT
  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>
1     1    0       1   100  NA      0.5 V1   
2     1    0       1   100  NA      1   V2   
3     1    0.5     0    NA   2.1    0.5 V1   
4     1    0.5     0    NA   2.1    1   V2   
5     1    1       0    NA   3.8    0.5 V1   
6     1    1       0    NA   3.8    1   V2   
7     2    0       1    80  NA      0.5 V1   
8     2    0.5     0    NA   1.6    0.5 V1   
# 5
visits_dup <- visits %>%
  add_row(ID = 1, TIME = 0.5, VISIT = "V1_dup")

pk %>% left_join(visits_dup, by = c("ID", "TIME"))
# A tibble: 6 × 6
     ID  TIME  EVID   AMT    DV VISIT 
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
1     1   0       1   100  NA   <NA>  
2     1   0.5     0    NA   2.1 V1    
3     1   0.5     0    NA   2.1 V1_dup
4     1   1       0    NA   3.8 V2    
5     2   0       1    80  NA   <NA>  
6     2   0.5     0    NA   1.6 V1    
# 6
pk %>% full_join(cov_subj, by = "ID")
# A tibble: 6 × 8
     ID  TIME  EVID   AMT    DV    WT SEX     AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1   0       1   100  NA      72 F        34
2     1   0.5     0    NA   2.1    72 F        34
3     1   1       0    NA   3.8    72 F        34
4     2   0       1    80  NA      88 M        41
5     2   0.5     0    NA   1.6    88 M        41
6     3  NA      NA    NA  NA      95 M        29

Summary

You now understand:

  • How each join type behaves structurally.
  • Why left_join() is safest for PMx enrichment.
  • How row multiplication and row loss occur.
  • How to use anti_join() and full_join() as QC diagnostics.

Joins are structural operations.
Treat them as QC checkpoints, not just convenience tools.


  • Default to left_join().
  • Always compare row counts before and after joins.
  • Use anti_join() to detect mismatches.
  • Include all necessary key columns.
  • Duplicate keys cause row multiplication.
  • Inner joins can silently drop PK records.