String Helpers for Data Cleaning

Use stringr helpers to detect, clean, standardize, and validate text fields in PMx datasets.

Tip

What you’ll build today: practical patterns for cleaning messy text fields (IDs, arms, labels, flags) using string helpers.

Learning Objectives

By the end of this lesson, you will be able to:

Use str_detect() to filter based on patterns.
Clean and standardize labels using str_replace() and str_to_upper().
Remove unwanted whitespace with str_trim().
Apply string logic safely inside mutate() and filter().
Recognize when simple pattern matching can prevent QC issues.

Setup

library(tidyverse)

Example Dataset

df <- tibble::tribble(
  ~ID, ~ARM,        ~COMMENT,
    1, "Placebo ",  "BLQ at baseline",
    2, "treatment", "OK",
    3, "Treatment", "blq at 1hr",
    4, "PLACEBO",   "Sample hemolyzed"
)

df

# A tibble: 4 × 3
     ID ARM         COMMENT         
  <dbl> <chr>       <chr>           
1     1 "Placebo "  BLQ at baseline 
2     2 "treatment" OK              
3     3 "Treatment" blq at 1hr      
4     4 "PLACEBO"   Sample hemolyzed

Key Ideas

Real-world PMx datasets often contain:

inconsistent capitalization
trailing or leading spaces
free-text comments
inconsistent group labels

String helpers allow you to standardize these safely.

Worked Example 1: Standardizing ARM labels

df_clean <- df %>%
  mutate(
    ARM = str_trim(ARM),
    ARM = str_to_upper(ARM)
  )

df_clean

# A tibble: 4 × 3
     ID ARM       COMMENT         
  <dbl> <chr>     <chr>           
1     1 PLACEBO   BLQ at baseline 
2     2 TREATMENT OK              
3     3 TREATMENT blq at 1hr      
4     4 PLACEBO   Sample hemolyzed

Worked Example 2: Detecting BLQ records

df %>%
  filter(str_detect(str_to_lower(COMMENT), "blq"))

# A tibble: 2 × 3
     ID ARM         COMMENT        
  <dbl> <chr>       <chr>          
1     1 "Placebo "  BLQ at baseline
2     3 "Treatment" blq at 1hr

Worked Example 3: Cleaning text with replacement

df %>%
  mutate(
    COMMENT = str_replace_all(COMMENT, "BLQ|blq", "BLQ")
  )

# A tibble: 4 × 3
     ID ARM         COMMENT         
  <dbl> <chr>       <chr>           
1     1 "Placebo "  BLQ at baseline 
2     2 "treatment" OK              
3     3 "Treatment" BLQ at 1hr      
4     4 "PLACEBO"   Sample hemolyzed

Strategies

Standardize case early (str_to_upper()).
Trim whitespace immediately after import.
Use str_detect() instead of exact matching when appropriate.
Create derived flags from text fields.

Common Mistakes

Using exact matches when patterns are needed.
Forgetting to standardize case before filtering.
Ignoring whitespace differences.
Writing overly complex regular expressions.

Practice Problems

Standardize ARM to uppercase without trailing spaces.
Filter rows where comments contain “BLQ” (case-insensitive).
Replace “hemolyzed” with “Hemolyzed”.
Create a logical column is_blq from COMMENT.
Explain why trimming whitespace prevents join errors.

Step-by-Step Solutions

df %>%
  mutate(
    ARM = str_to_upper(str_trim(ARM)),
    is_blq = str_detect(str_to_lower(COMMENT), "blq"),
    COMMENT = str_replace(COMMENT, "hemolyzed", "Hemolyzed")
  )

# A tibble: 4 × 4
     ID ARM       COMMENT          is_blq
  <dbl> <chr>     <chr>            <lgl> 
1     1 PLACEBO   BLQ at baseline  TRUE  
2     2 TREATMENT OK               FALSE 
3     3 TREATMENT blq at 1hr       TRUE  
4     4 PLACEBO   Sample Hemolyzed FALSE

Summary

You now know how to:

Detect patterns in text columns.
Standardize categorical labels.
Clean inconsistent strings.
Create QC flags from free text.

String cleaning is often the first hidden step in reproducible PMx workflows.

Quick Tips

Standardize case early.
Trim whitespace immediately after import.
Use str_detect() for flexible filtering.
Keep regular expressions simple and readable.