String Helpers for Data Cleaning

Use stringr helpers to detect, clean, standardize, and validate text fields in PMx datasets.
Tip

What you’ll build today: practical patterns for cleaning messy text fields (IDs, arms, labels, flags) using string helpers.

Learning Objectives

By the end of this lesson, you will be able to:

  • Use str_detect() to filter based on patterns.
  • Clean and standardize labels using str_replace() and str_to_upper().
  • Remove unwanted whitespace with str_trim().
  • Apply string logic safely inside mutate() and filter().
  • Recognize when simple pattern matching can prevent QC issues.

Setup

library(tidyverse)

Example Dataset

df <- tibble::tribble(
  ~ID, ~ARM,        ~COMMENT,
    1, "Placebo ",  "BLQ at baseline",
    2, "treatment", "OK",
    3, "Treatment", "blq at 1hr",
    4, "PLACEBO",   "Sample hemolyzed"
)

df
# A tibble: 4 × 3
     ID ARM         COMMENT         
  <dbl> <chr>       <chr>           
1     1 "Placebo "  BLQ at baseline 
2     2 "treatment" OK              
3     3 "Treatment" blq at 1hr      
4     4 "PLACEBO"   Sample hemolyzed

Key Ideas

Real-world PMx datasets often contain:

  • inconsistent capitalization
  • trailing or leading spaces
  • free-text comments
  • inconsistent group labels

String helpers allow you to standardize these safely.


Worked Example 1: Standardizing ARM labels

df_clean <- df %>%
  mutate(
    ARM = str_trim(ARM),
    ARM = str_to_upper(ARM)
  )

df_clean
# A tibble: 4 × 3
     ID ARM       COMMENT         
  <dbl> <chr>     <chr>           
1     1 PLACEBO   BLQ at baseline 
2     2 TREATMENT OK              
3     3 TREATMENT blq at 1hr      
4     4 PLACEBO   Sample hemolyzed

Worked Example 2: Detecting BLQ records

df %>%
  filter(str_detect(str_to_lower(COMMENT), "blq"))
# A tibble: 2 × 3
     ID ARM         COMMENT        
  <dbl> <chr>       <chr>          
1     1 "Placebo "  BLQ at baseline
2     3 "Treatment" blq at 1hr     

Worked Example 3: Cleaning text with replacement

df %>%
  mutate(
    COMMENT = str_replace_all(COMMENT, "BLQ|blq", "BLQ")
  )
# A tibble: 4 × 3
     ID ARM         COMMENT         
  <dbl> <chr>       <chr>           
1     1 "Placebo "  BLQ at baseline 
2     2 "treatment" OK              
3     3 "Treatment" BLQ at 1hr      
4     4 "PLACEBO"   Sample hemolyzed

Strategies

  • Standardize case early (str_to_upper()).
  • Trim whitespace immediately after import.
  • Use str_detect() instead of exact matching when appropriate.
  • Create derived flags from text fields.

Common Mistakes

  • Using exact matches when patterns are needed.
  • Forgetting to standardize case before filtering.
  • Ignoring whitespace differences.
  • Writing overly complex regular expressions.

Practice Problems

  1. Standardize ARM to uppercase without trailing spaces.
  2. Filter rows where comments contain “BLQ” (case-insensitive).
  3. Replace “hemolyzed” with “Hemolyzed”.
  4. Create a logical column is_blq from COMMENT.
  5. Explain why trimming whitespace prevents join errors.

df %>%
  mutate(
    ARM = str_to_upper(str_trim(ARM)),
    is_blq = str_detect(str_to_lower(COMMENT), "blq"),
    COMMENT = str_replace(COMMENT, "hemolyzed", "Hemolyzed")
  )
# A tibble: 4 × 4
     ID ARM       COMMENT          is_blq
  <dbl> <chr>     <chr>            <lgl> 
1     1 PLACEBO   BLQ at baseline  TRUE  
2     2 TREATMENT OK               FALSE 
3     3 TREATMENT blq at 1hr       TRUE  
4     4 PLACEBO   Sample Hemolyzed FALSE 

Summary

You now know how to:

  • Detect patterns in text columns.
  • Standardize categorical labels.
  • Clean inconsistent strings.
  • Create QC flags from free text.

String cleaning is often the first hidden step in reproducible PMx workflows.


  • Standardize case early.
  • Trim whitespace immediately after import.
  • Use str_detect() for flexible filtering.
  • Keep regular expressions simple and readable.