Regex for beginners - detect!
Introduction to Regular Expressions (regex) in R: how to detect patterns
Intro
As a Data Scientist you should know how to use regular expressions (regex).
There are several ways how to use regex in R to search for patterns in a text-vector:
grep
is a base-R function and returns a vector containing the (integer) indices of the matches. So in this case it returns 2.
> grep("Orange", c("Apple", "Orange", "Lemon"))
[1] 2
grepl
returns a locial vector
> grepl("Orange", c("Apple", "Orange", "Lemon"))
[1] FALSE TRUE FALSE
To return the value instead of the index use parameter value = TRUE
:
> grep("a", c("Apple", "Orange", "Lemon"), value = TRUE)
[1] Orange
If you want to use {tidyverse} you can use str_detect()
from {stringr}. Here the pattern is the 2nd argument (not the first)!
> library(stringr)
> str_detect(c("Apple", "Orange", "Lemon"), "Orange")
[1] FALSE TRUE FALSE
Case Sensitive
grep
is case sensitive by default:
> grep("a", c("Apple", "Orange", "Lemon"))
[1] 2
Only “Orange” matches (but not “Apple”), because searching is case sensitive by default.
To switch off case sensitivity you can use the parameter ignore.case = TRUE
> grep("a", c("Apple", "Orange", "Lemon"), ignore.case = TRUE)
[1] 1 2
Now “Apple” and “Orange” matches as both contain “a” (ignoring upper/lower-case).
The same using {tidyverse} and {stringr}:
> library(stringr)
> str <- c("Apple", "Orange", "Lemon")
> str_detect(str, regex("a", ignore_case = TRUE))
[1] TRUE TRUE FALSE
Starts With
To search for a pattern at the beginning of a string you can use ^
:
> grep("^a", c("Apple", "Orange", "Lemon"), ignore.case = TRUE)
[1] 1
Now only “Apple” matches, because the “a” in “Orange” is in the middle.
Ends With
> grep("e$", c("Apple", "Orange", "Lemon"), ignore.case = TRUE)
[1] 1 2
“Apple” and “Orange” matches, but not “Lemon”, because the “e” in Lemon is not at the end!
Starts & Ends With
To combine both:
> grep("^A.*e$", c("Apple", "Orange", "Lemon"))
[1] 1
^A
= beginning with “A”.*
= any stringe$
= ends with “e”
To control the number of characters between start and end you can use *
, +
, ?
and { }
:
.*
= any string.+
= any string with minimum length of 1 character.?
= string with length 0 or 1.{3}
= exactly 3 characters.{3,}
= 3 or more characters.{,3}
= maximum 3 characters.{1,3}
= between 1 and 3 characters
> str = c("Apple", "Ape", "Ae")
> grep("^A.{2,}e$", str)
[1] 1
Only “Apple” starts with “A”, ends with “e” and has of 2+ character inbetween!
Which character?
To search for ANY character you can use .
If you want to be more specific:
[a-z]
= letter a to z (lower case)[A-Z]
= letter A to Z (upper case)[ACES]
= letter A or C or E or S[0-9]
= digit (0 to 9)[1234]
= digit 1 or 2 or 3 or 4[A-z0-9]
= letter (A to Z, a to z) or digit (0 to 9)[ ]
= blank (“ “)
You can exclude characters too (using ^
):
[^0-9]
= non digit (all characters, but no 0 to 9)[^A-Z]
= all characters, but no upper case letter (A to Z)
Character Classes, Groups & Quantifiers
We are looking for words that contain a repetitive letter.
> grep("[a-z]{2}", c("apple", "apple+apple", "appleapple"))
[1] 1 2 3
We define the character class [a-z]
(letter a to z) with the quantifier {2}
(two times)
All words contain “pp”! How about detecting a repetitive word?
>grep("apple{2}", c("apple", "apple+apple", "appleapple"))
[1] integer(0)
This regex is NOT detecting the repetitive word “apple”! It searches for a repetitive “e”.
Detecting the repetitive word “apple” can be done by using brackets.
>grep("(apple){2}", c("apple", "apple+apple", "appleapple"))
[1] 3
We define the word as a grouping (using brackets) and the quantifier {2}
(two times).
Here only “appleapple” matches. “aplle+apple” does not match, as there is an other character between the words.
Escaping
> grep("*.txt", c("f1.txt", "f2.txt", "f3-txt"))
[1] 1 2 3
This result may surprise you if you are not used to regex. Here “.” is not the character “.”, but means “any character”. To search for the character “.”, we need to “escape” it:
> grep("*\\.txt", c("f1.txt", "f2.txt", "f3-txt"))
[1] 1 2
Charcters you need to escape (using \\
):
.
=\\.
+
=\\+
*
=\\*
"
=\\"
\
=\\\
$
=\\$
^
=\\^
And more: ?
|
(
)
[
]
{
}
Fixed (Literal)
If you just want to search for a string as it is (literal), you can use fixed = TRUE
> grep(".", c("a.b.c", "a-b-c", "a^b^c"), fixed = TRUE)
[1] 1
Here “.” is just the character “.” (no special meaning). So only the first item matches. Without fixed = TRUE
all three elements matches, because all contain “any character” (meaning of “.” as a regex)
> grep(".", c("a.b.c", "a-b-c", "a^b^c"))
[1] 1 2 3
You can use fixed
in {stringr} too:
> library(stringr)
> str_detect("a[b", fixed("["))
[1] TRUE
Mix Regex + Literal
If you want to mix a regex with a literal expression, you can use \\Q
and \\E
:
grep(".\\Q[x^2]=\\E[0-9]+", c("a[x^2]=0", "b(x^2)=1")
[1] 1
.
= any character\\Q[x^2]=\\E
= string “[x^2]=” (literal)[0-9]+
= one digit or more
Or
You can use ( | )
for a logical OR.
Example: detect all filenames with extension “.csv” or “.txt”:
> grep(".+\\.(csv|txt)", c("f1.csv", "f2.xls", "f3.txt"))
[1] 1 3
.+
= any string (min lenth 1)\\.
= the character “.” (need to escape it)(csv|txt)
= “csv” OR “txt
Examples
Email format
This is a very simplified email format checker
> pattern <- "^[a-zA-Z0-9_\\.\\+\\-]+@[a-zA-Z0-9\\-\\.]+$"
> grep(pattern, c("bob.hope@me.com", "bob.hope"))
[1] 1
Only the first Email matches
^[a-zA-Z0-9_\\.\\+\\-]+
= Starts with alphanumeric or _ . + - (min 1 character)@
= Character “@”[a-zA-Z0-9\\-\\.]+$
= Ends with alphanumeric or - . (min 1 character)
User name
> pattern <- "^[a-z0-9_-]{3,16}$"
> grep(pattern, c("bobhope", "x"))
[1] 1
Only the first user name matches.
- Allowed: letter a to z, digit 0 to 9, characters
_
and-
- Minimum length = 3
- Maximum length = 16
Hex Color Code
> pattern <- "^#[a-fA-F0-9]{6}$"
> grep(pattern, c("#00FF12", "#xxff"))
[1] 1
Only the first color code matches.
^#
= must start with “#”[a-fA-F0-9]{6}
= 6 character (a to f, A to F and 0 to 9 allowed)$
= end of string
Continue …
with pattern extraction & greedy matching
https://rolkra.github.io/regex-for-beginners-extract/