Regex for beginners - extract!
Introduction to Regular Expressions (regex) in R: how to extract patterns
Intro
As a Data Scientist you should know how to use regular expressions (regex).
You can use regex to detect, locate and extract patterns.
How to detect patterns:
https://rolkra.github.io/regex-for-beginners-detect/
str_extract()
can be used to extract patterns:
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_extract(string, "a.*e")
[1] "apple" "ange" "apple+orange"
So we are searching for strings that starts with “a” and end with “e”. We find:
- “apple” in “apple”
- “ange” in “orange”
- “apple+orange” in “apple+orange”
But wait, why not “apple” in “apple+orange” (as it starts with “a” and ends with “e” too)?
Greedy Matching
By default the asterisk *
is “greedy”, i.e. it always matches the longest possible string. Thats why str_extract() returns “apple+orange” and not just “apple”.
If you want get the shortest match:
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_extract(string, "a.*?e")
[1] "apple" "ange" "apple"
By adding a ?
after *
in a regex the asterisk ```*`` is not greedy any more
All Matches
To get ALL matches you can use str_extract_all()
from {stringr}:
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_extract(string, "a.*?e")
[[1]]
[1] "apple"
[[2]]
[1] "ange"
[[3]]
[1] "apple" "ange"
Now we find “apple” and “ange” in “apple+orange”. After getting the first pattern match (“apple” in “apple+orange”), regex will search for the next pattern match in the REMAINING string (“+orange”).
To get a matrix as result (instead of a list):
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_extract(string, "a.*?e", simplify = TRUE)
[,1] [,2]
[1,] "apple" ""
[2,] "ange" ""
[3,] "apple" "ange"
Replace
Use str_replace
to replace a pattern in a string:
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_replace(string, "a.*?e", "AE")
[1] "AE" "orAE" "AE+orange"
To replance ALL patterns:
> library(stringr)
> string <- c("apple", "orange", "apple+orange")
> str_replace_all(string, "a.*?e", "AE")
[1] "AE" "orAE" "AE+orAE"
Cheat Sheets
String manipulation with {stringr}:
https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf
Regex in R:
https://raw.githubusercontent.com/rstudio/cheatsheets/main/regex.pdf