library(tidyverse)String Helpers for Data Cleaning
Use stringr helpers to detect, clean, standardize, and validate text fields in PMx datasets.
Tip
What you’ll build today: practical patterns for cleaning messy text fields (IDs, arms, labels, flags) using string helpers.
Learning Objectives
By the end of this lesson, you will be able to:
- Use
str_detect()to filter based on patterns. - Clean and standardize labels using
str_replace()andstr_to_upper(). - Remove unwanted whitespace with
str_trim(). - Apply string logic safely inside
mutate()andfilter(). - Recognize when simple pattern matching can prevent QC issues.
Setup
Example Dataset
df <- tibble::tribble(
~ID, ~ARM, ~COMMENT,
1, "Placebo ", "BLQ at baseline",
2, "treatment", "OK",
3, "Treatment", "blq at 1hr",
4, "PLACEBO", "Sample hemolyzed"
)
df# A tibble: 4 × 3
ID ARM COMMENT
<dbl> <chr> <chr>
1 1 "Placebo " BLQ at baseline
2 2 "treatment" OK
3 3 "Treatment" blq at 1hr
4 4 "PLACEBO" Sample hemolyzed
Key Ideas
Real-world PMx datasets often contain:
- inconsistent capitalization
- trailing or leading spaces
- free-text comments
- inconsistent group labels
String helpers allow you to standardize these safely.
Worked Example 1: Standardizing ARM labels
df_clean <- df %>%
mutate(
ARM = str_trim(ARM),
ARM = str_to_upper(ARM)
)
df_clean# A tibble: 4 × 3
ID ARM COMMENT
<dbl> <chr> <chr>
1 1 PLACEBO BLQ at baseline
2 2 TREATMENT OK
3 3 TREATMENT blq at 1hr
4 4 PLACEBO Sample hemolyzed
Worked Example 2: Detecting BLQ records
df %>%
filter(str_detect(str_to_lower(COMMENT), "blq"))# A tibble: 2 × 3
ID ARM COMMENT
<dbl> <chr> <chr>
1 1 "Placebo " BLQ at baseline
2 3 "Treatment" blq at 1hr
Worked Example 3: Cleaning text with replacement
df %>%
mutate(
COMMENT = str_replace_all(COMMENT, "BLQ|blq", "BLQ")
)# A tibble: 4 × 3
ID ARM COMMENT
<dbl> <chr> <chr>
1 1 "Placebo " BLQ at baseline
2 2 "treatment" OK
3 3 "Treatment" BLQ at 1hr
4 4 "PLACEBO" Sample hemolyzed
Strategies
- Standardize case early (
str_to_upper()). - Trim whitespace immediately after import.
- Use
str_detect()instead of exact matching when appropriate. - Create derived flags from text fields.
Common Mistakes
- Using exact matches when patterns are needed.
- Forgetting to standardize case before filtering.
- Ignoring whitespace differences.
- Writing overly complex regular expressions.
Practice Problems
- Standardize
ARMto uppercase without trailing spaces. - Filter rows where comments contain “BLQ” (case-insensitive).
- Replace “hemolyzed” with “Hemolyzed”.
- Create a logical column
is_blqfromCOMMENT. - Explain why trimming whitespace prevents join errors.
TipStep-by-Step Solutions
df %>%
mutate(
ARM = str_to_upper(str_trim(ARM)),
is_blq = str_detect(str_to_lower(COMMENT), "blq"),
COMMENT = str_replace(COMMENT, "hemolyzed", "Hemolyzed")
)# A tibble: 4 × 4
ID ARM COMMENT is_blq
<dbl> <chr> <chr> <lgl>
1 1 PLACEBO BLQ at baseline TRUE
2 2 TREATMENT OK FALSE
3 3 TREATMENT blq at 1hr TRUE
4 4 PLACEBO Sample Hemolyzed FALSE
Summary
You now know how to:
- Detect patterns in text columns.
- Standardize categorical labels.
- Clean inconsistent strings.
- Create QC flags from free text.
String cleaning is often the first hidden step in reproducible PMx workflows.
TipQuick Tips
- Standardize case early.
- Trim whitespace immediately after import.
- Use
str_detect()for flexible filtering. - Keep regular expressions simple and readable.