1. Regular Expressions (RegEx) Cheat Sheet
- 1. Regular Expressions (RegEx) Cheat Sheet
- 1.1 RegEx Pattern Matching Flow
- 1.2 Basic Syntax
- 1.3 Advanced Techniques
- 1.4 Python re Module Methods
- 1.4.1 re.search() - Find First Match
- 1.4.2 re.match() - Match from Beginning
- 1.4.3 re.fullmatch() - Match Entire String
- 1.4.4 re.findall() - Find All Matches
- 1.4.5 re.finditer() - Iterator of Matches
- 1.4.6 re.sub() - Replace Matches
- 1.4.7 re.subn() - Replace with Count
- 1.4.8 re.split() - Split by Pattern
- 1.4.9 re.compile() - Compile Pattern
- 1.4.10 re.escape() - Escape Special Characters
- 1.4.11 Match Object Methods
- 1.5 Common RegEx Engines and Differences
- 1.6 Best Practices
- 1.6.1 1. Be Specific and Precise
- 1.6.2 2. Optimize Performance
- 1.6.3 3. Avoid Catastrophic Backtracking
- 1.6.4 4. Test Thoroughly
- 1.6.5 5. Document Complex Patterns
- 1.6.6 6. Use Raw Strings in Python
- 1.6.7 7. Consider Alternatives
- 1.6.8 8. Know Your Engine
- 1.6.9 9. Security Considerations
- 1.6.10 10. Common Pitfalls to Avoid
- 1.7 Quick Reference Summary
This cheat sheet provides an exhaustive overview of regular expressions (RegEx), covering syntax, character classes, quantifiers, anchors, groups, flags, and advanced techniques. It aims to be a complete reference for using regular expressions, with a focus on Python examples.
1.1 RegEx Pattern Matching Flow
ββββββββββββ
β Input β
β String β
βββββββ¬βββββ
β
ββββββββββββββββ
β Pattern β
β Compile β
ββββββββ¬ββββββββ
β
ββββββββββββββββ
β Match β
β Attempt β
ββββββββ¬ββββββββ
β
βββββββ΄βββββββ
β β
ββββββββ βββββββββββ
βMatch β βNo Match β
ββββββββ βββββββββββ
1.2 Basic Syntax
1.2.1 Literals
Characters match themselves literally, except for special characters.
abc: Matches the literal string "abc".hello: Matches the literal string "hello".
import re
# Basic literal matching
text = "abc def ghi"
pattern = r"abc"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: abc
# Multiple literal matches
text = "hello world hello"
matches = re.findall(r"hello", text)
print(matches) # Output: ['hello', 'hello']
1.2.2 Special Characters
These characters have special meanings in RegEx:
. ^ $ * + ? { } [ ] \ | ( )
To match these characters literally, escape them with a backslash (\):
\.: Matches a literal dot (.).\\: Matches a literal backslash (\).
import re
# Escaping special characters
text = "123.456"
pattern = r"\." # Matches a literal dot
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: .
# Matching literal backslash
text = "path\\to\\file"
pattern = r"\\" # Matches a literal backslash
matches = re.findall(pattern, text)
print(matches) # Output: ['\\', '\\']
# Matching special characters
text = "Price: $100 + $50 = $150"
pattern = r"\$\d+" # Matches dollar sign followed by digits
matches = re.findall(pattern, text)
print(matches) # Output: ['$100', '$50', '$150']
1.2.3 Character Classes
.: Matches any character except a newline (unless thesflag is used).[abc]: Matches any character inside the brackets (a, b, or c).[^abc]: Matches any character not inside the brackets.[a-z]: Matches any character in the range a to z (lowercase).[A-Z]: Matches any character in the range A to Z (uppercase).[0-9]: Matches any digit.[a-zA-Z0-9]: Matches any alphanumeric character.
Character Classes:
[abc] [^abc] [a-z]
β β β
β β β
Match Negate Range
a,b,c NOT a,b,c a-z
import re
# Match specific characters
text = "apple banana cherry"
pattern = r"[abc]" # Matches 'a', 'b', or 'c'
matches = re.findall(pattern, text)
print(matches) # Output: ['a', 'a', 'b', 'a', 'a', 'a', 'a', 'c']
# Negated character class
text = "apple banana cherry"
pattern = r"[^abc]" # Matches any character except 'a', 'b', or 'c'
matches = re.findall(pattern, text)
print(matches) # Output: ['p', 'p', 'l', 'e', ' ', 'n', 'n', ' ', 'h', 'e', 'r', 'r', 'y']
# Range matching
text = "apple1 banana2 cherry3"
pattern = r"[0-9]" # Matches any digit
matches = re.findall(pattern, text)
print(matches) # Output: ['1', '2', '3']
# Alphanumeric matching
text = "Hello123 World456"
pattern = r"[a-zA-Z0-9]+" # Matches alphanumeric sequences
matches = re.findall(pattern, text)
print(matches) # Output: ['Hello123', 'World456']
# Dot wildcard
text = "Hello. World!"
pattern = r"." # Matches any character except newline
matches = re.findall(pattern, text)
print(matches) # Output: ['H', 'e', 'l', 'l', 'o', '.', ' ', 'W', 'o', 'r', 'l', 'd', '!']
1.2.4 Predefined Character Classes
\d: Matches any digit (equivalent to[0-9]).\D: Matches any non-digit (equivalent to[^0-9]).\w: Matches any word character (alphanumeric + underscore, equivalent to[a-zA-Z0-9_]).\W: Matches any non-word character (equivalent to[^a-zA-Z0-9_]).\s: Matches any whitespace character (space, tab, newline, etc.).\S: Matches any non-whitespace character.\b: Matches a word boundary.\B: Matches a non-word boundary.\t: Matches a tab character.\n: Matches a newline character.\r: Matches a carriage return character.\f: Matches a form feed character.\v: Matches a vertical tab character.\0: Matches a null character.[\b]: Matches a backspace character (inside a character class).
Predefined Classes:
\d β [0-9] \D β [^0-9]
\w β [a-zA-Z0-9_] \W β [^a-zA-Z0-9_]
\s β [ \t\n\r\f\v] \S β [^ \t\n\r\f\v]
import re
# Digit matching
text = "123 abc 456"
pattern = r"\d+" # Matches one or more digits
matches = re.findall(pattern, text)
print(matches) # Output: ['123', '456']
# Word character matching
text = "user_name123 = value"
pattern = r"\w+" # Matches word characters
matches = re.findall(pattern, text)
print(matches) # Output: ['user_name123', 'value']
# Word boundary
text = "Hello World, HelloWorld"
pattern = r"\bWorld\b" # Matches "World" at word boundary
matches = re.findall(pattern, text)
print(matches) # Output: ['World']
# Whitespace matching
text = "Hello\tWorld\n"
pattern = r"\s+" # Matches one or more whitespace characters
matches = re.findall(pattern, text)
print(matches) # Output: ['\t', '\n']
# Non-digit matching
text = "abc123def456"
pattern = r"\D+" # Matches non-digits
matches = re.findall(pattern, text)
print(matches) # Output: ['abc', 'def']
1.2.5 Quantifiers
Greedy vs Lazy Quantifiers:
a+ a+?
β β
Greedy Lazy
(max) (min)
* β 0 or more *? β 0 or more (lazy)
+ β 1 or more +? β 1 or more (lazy)
? β 0 or 1 ?? β 0 or 1 (lazy)
{n} β exactly n
{n,} β n or more {n,}? β n or more (lazy)
{n,m} β n to m {n,m}? β n to m (lazy)
*: Matches 0 or more occurrences.+: Matches 1 or more occurrences.?: Matches 0 or 1 occurrence.{n}: Matches exactlynoccurrences.{n,}: Matchesnor more occurrences.{n,m}: Matches betweennandmoccurrences (inclusive).*?,+?,??,{n,}?,{n,m}?: Non-greedy (lazy) versions.
import re
# One or more
text = "aaabbbccc"
pattern = r"a+" # Matches one or more 'a's
matches = re.findall(pattern, text)
print(matches) # Output: ['aaa']
# Exact range
text = "ab abb abbb abbbb"
pattern = r"ab{2,4}" # Matches 'ab' with 2 to 4 'b's
matches = re.findall(pattern, text)
print(matches) # Output: ['abb', 'abbb', 'abbbb']
# Optional character
text = "color colour"
pattern = r"colou?r" # Matches 'color' or 'colour'
matches = re.findall(pattern, text)
print(matches) # Output: ['color', 'colour']
# Greedy vs lazy
text = "<div>content</div><div>more</div>"
greedy = re.findall(r"<div>.*</div>", text)
print(greedy) # Output: ['<div>content</div><div>more</div>']
lazy = re.findall(r"<div>.*?</div>", text)
print(lazy) # Output: ['<div>content</div>', '<div>more</div>']
# Exact count
text = "The year is 2024"
pattern = r"\d{4}" # Matches exactly 4 digits
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: 2024
1.2.6 Anchors
String Anchors:
^βββββββββββββββtextβββββββββββββββ$
β β
Start End
^ β Start of string/line
$ β End of string/line
\A β Start of string only
\Z β End of string (before final newline)
\z β End of string (absolute)
\b β Word boundary
\B β Non-word boundary
^: Matches the beginning of the string (or line).$: Matches the end of the string (or line).\A: Matches the beginning of the string.\Z: Matches the end of the string, or before a newline at the end.\z: Matches the end of the string.
import re
# Start anchor
text = "hello world"
pattern = r"^hello" # Matches 'hello' at the beginning
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: hello
# End anchor
text = "world\nhello"
pattern = r"hello$" # Matches 'hello' at the end
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: hello
# Both anchors (exact match)
text = "hello"
pattern = r"^hello$" # Matches entire string
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: hello
# String start (\A)
text = "hello world\n"
pattern = r"\Ahello"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: hello
# String end before newline (\Z)
text = "hello world\n"
pattern = r"world\Z"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: world
# Absolute string end (\z)
text = "hello world"
pattern = r"world\z"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: world
1.2.7 Alternation
Alternation Flow:
Pattern
β
βββββββ΄ββββββ
β β
β β
cat dog
β β
βββββββ¬ββββββ
β
Match
|: Matches either the expression before or the expression after the|.
import re
# Simple alternation
text = "cat and dog"
pattern = r"cat|dog" # Matches 'cat' or 'dog'
matches = re.findall(pattern, text)
print(matches) # Output: ['cat', 'dog']
# Multiple alternatives
text = "I like apples, oranges, and bananas"
pattern = r"apples|oranges|bananas"
matches = re.findall(pattern, text)
print(matches) # Output: ['apples', 'oranges', 'bananas']
# Alternation with groups
text = "file.txt file.pdf file.doc"
pattern = r"file\.(txt|pdf|doc)"
matches = re.findall(pattern, text)
print(matches) # Output: ['txt', 'pdf', 'doc']
# Complex alternation
text = "http://example.com https://secure.com"
pattern = r"https?://[\w.]+"
matches = re.findall(pattern, text)
print(matches) # Output: ['http://example.com', 'https://secure.com']
1.2.8 Grouping and Capturing
( ): Groups expressions and creates a capturing group.(?: ): Groups expressions without creating a capturing group.\1,\2, etc.: Backreferences to captured groups.
import re
text = "apple apple"
pattern = r"(\w+) \1" # Matches repeated words
match = re.search(pattern, text)
print(match.group(0)) if match else print("No match") # Output: apple apple
print(match.group(1)) if match else print("No match") # Output: apple
text = "12-34-56"
pattern = r"(\d{2})-(?:(\d{2})-(\d{2}))" # Non-capturing group for the middle part
match = re.search(pattern, text)
if match:
print(match.groups()) # Output: ('12', '34', '56')
1.2.9 Lookarounds
Lookahead & Lookbehind:
(?=...) Positive Lookahead
(?!...) Negative Lookahead
(?<=...) Positive Lookbehind
(?<!...) Negative Lookbehind
Example: \w+(?=\d)
β β
β βββ Must be followed by digit
ββββββββ Matches word chars
(?= ): Positive lookahead - matches if followed by pattern.(?! ): Negative lookahead - matches if NOT followed by pattern.(?<= ): Positive lookbehind - matches if preceded by pattern.(?<! ): Negative lookbehind - matches if NOT preceded by pattern.
import re
# Positive lookahead
text = "apple123 banana456"
pattern = r"[a-zA-Z]+(?=\d)" # Matches letters followed by digits
matches = re.findall(pattern, text)
print(matches) # Output: ['apple', 'banana']
# Negative lookahead
text = "file1.txt file2.pdf file3.txt"
pattern = r"file\d+(?!\.txt)" # Matches files NOT ending with .txt
matches = re.findall(pattern, text)
print(matches) # Output: ['file2']
# Positive lookbehind
text = "$100 β¬200 $300"
pattern = r"(?<=\$)\d+" # Matches numbers preceded by dollar sign
matches = re.findall(pattern, text)
print(matches) # Output: ['100', '300']
# Negative lookbehind
text = "cat bat rat mat"
pattern = r"(?<!c)at" # Matches 'at' NOT preceded by 'c'
matches = re.findall(pattern, text)
print(matches) # Output: ['at', 'at', 'at']
# Password validation (lookahead)
password = "Abc123!@"
# Must contain uppercase, lowercase, digit, and special char
pattern = r"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$"
match = re.match(pattern, password)
print("Valid" if match else "Invalid") # Output: Valid
# Extract price without currency symbol
text = "Price: $50.99"
pattern = r"(?<=\$)\d+\.\d+"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: 50.99
1.2.10 Flags (Modifiers)
Common Flags:
re.IGNORECASE (re.I) β Case-insensitive
re.MULTILINE (re.M) β ^ and $ match line boundaries
re.DOTALL (re.S) β . matches newlines
re.VERBOSE (re.X) β Allow comments and whitespace
re.ASCII (re.A) β ASCII-only matching
re.UNICODE (re.U) β Unicode matching (default in Python 3)
i(re.IGNORECASE): Case-insensitive matching.m(re.MULTILINE):^and$match line boundaries.s(re.DOTALL):.matches newline characters.x(re.VERBOSE): Allow whitespace and comments in pattern.a(re.ASCII): ASCII-only matching.u(re.UNICODE): Full Unicode matching.
import re
# Case-insensitive
text = "Hello World"
pattern = r"world"
match = re.search(pattern, text, re.IGNORECASE)
print(match.group(0) if match else "No match") # Output: World
# Multiline mode
text = "Line 1\nLine 2\nLine 3"
pattern = r"^Line"
matches = re.findall(pattern, text, re.MULTILINE)
print(matches) # Output: ['Line', 'Line', 'Line']
# Dotall mode
text = "Hello\nWorld"
pattern = r"Hello.World"
match = re.search(pattern, text, re.DOTALL)
print(match.group(0) if match else "No match") # Output: Hello\nWorld
# Verbose mode
pattern = re.compile(r"""
^ # Start of string
[a-zA-Z0-9._%+-]+ # Local part
@ # At symbol
[a-zA-Z0-9.-]+ # Domain
\. # Dot
[a-zA-Z]{2,} # TLD
$ # End of string
""", re.VERBOSE)
email = "test@example.com"
match = pattern.match(email)
print("Valid" if match else "Invalid") # Output: Valid
# Combining multiple flags
text = "Hello\nWORLD"
pattern = r"hello.*world"
match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
print(match.group(0) if match else "No match") # Output: Hello\nWORLD
1.2.11 Character Properties (Unicode)
Note: Python's re module doesn't directly support \p{Property} syntax. Use the regex module for full Unicode property support.
\p{Property}: Matches a character with the specified Unicode property.\P{Property}: Matches a character without the specified Unicode property.
Common Unicode categories: * \p{L}: Letters * \p{N}: Numbers * \p{P}: Punctuation * \p{S}: Symbols * \p{Z}: Separators
# Using regex module (not standard re)
try:
import regex
# Match Unicode letters
text = "Hello 123 γγγ«γ‘γ― δ½ ε₯½"
pattern = regex.compile(r"\p{L}+")
matches = pattern.findall(text)
print(matches) # Output: ['Hello', 'γγγ«γ‘γ―', 'δ½ ε₯½']
# Match Unicode numbers
text = "Hello 123 α‘α’α£"
pattern = regex.compile(r"\p{N}+")
matches = pattern.findall(text)
print(matches) # Output: ['123', 'α‘α’α£']
except ImportError:
print("Install regex module: pip install regex")
# Alternative using standard re module
import re
# Basic Unicode support
text = "Hello 123 γγγ«γ‘γ―"
pattern = r"[^\W\d_]+" # Matches letter sequences
matches = re.findall(pattern, text)
print(matches) # Output: ['Hello', 'γγγ«γ‘γ―']
1.2.12 Examples
- Email:
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
import re
email = "test@example.com"
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
match = re.match(pattern, email)
print(bool(match)) # Output: True
- URL:
^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
import re
url = "https://www.example.com/path/to/page.html"
pattern = r"^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$"
match = re.match(pattern, url)
print(bool(match)) # Output: True
- IP Address:
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
import re
ip = "192.168.1.1"
pattern = r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
match = re.match(pattern, ip)
print(bool(match)) # Output: True
- Hex Color Code:
^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$
import re
hex_code = "#FF0000"
pattern = r"^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$"
match = re.match(pattern, hex_code)
print(bool(match)) # Output: True
- Date (YYYY-MM-DD):
^\d{4}-\d{2}-\d{2}$
import re
date = "2024-01-08"
pattern = r"^\d{4}-\d{2}-\d{2}$"
match = re.match(pattern, date)
print(bool(match)) # Output: True
- HTML Tag:
<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)
import re
html = "<p>This is a paragraph.</p>"
pattern = r"<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)"
match = re.search(pattern, html)
print(match.group(1)) if match else print("No match") # Output: p
- Phone Number (US):
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}
import re
phone = "(555) 123-4567"
pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
match = re.search(pattern, phone)
print(bool(match)) # Output: True
1.3 Advanced Techniques
1.3.1 Atomic Groups
Note: Python's re module doesn't support atomic groups (?>...). This feature is available in other regex engines like PCRE.
(?> ): Atomic group (prevents backtracking).
import re
# Greedy quantifier with backtracking
text = "aaaaaaaaab"
pattern = r"a+b" # Regular + (will backtrack to find 'b')
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: aaaaaaaaab
# Non-greedy quantifier
text = "aaaaaaaaab"
pattern = r"a+?b" # Non-greedy +?
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: ab
# Note: Atomic groups would prevent backtracking
# In engines that support it: (?>a+)b would fail to match
# because a+ consumes all 'a's and doesn't backtrack
1.3.2 Recursive Patterns
Note: Python's re module doesn't support recursive patterns. Use the regex module or alternative approaches.
(?R)or(?0): Recursively matches the entire pattern.(?1),(?2), etc.: Recursively matches a specific capturing group.
# Using regex module for recursive patterns
try:
import regex
# Match balanced parentheses
text = "((abc)def(ghi))"
pattern = regex.compile(r"\(([^()]|(?R))*\)")
match = pattern.search(text)
print(match.group(0) if match else "No match") # Output: ((abc)def(ghi))
except ImportError:
print("Install regex module: pip install regex")
# Alternative using standard re module (limited depth)
import re
# Match one level of nested parentheses
text = "(abc(def)ghi)"
pattern = r"\([^()]*(?:\([^()]*\)[^()]*)*\)"
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: (abc(def)ghi)
1.3.3 Conditional Expressions
Note: Python's re module has limited support for conditional patterns.
(?(id/name)yes-pattern|no-pattern): Matches based on whether a group matched.
import re
# Conditional based on captured group
text1 = "abc123"
text2 = "xyz456"
# If group 1 exists, match digits; otherwise, match letters
pattern = r"^(abc)?(?(1)\d+|\w+)$"
match1 = re.match(pattern, text1)
print(f"{text1}: {match1.group(0) if match1 else 'No match'}") # Output: abc123: abc123
match2 = re.match(pattern, text2)
print(f"{text2}: {match2.group(0) if match2 else 'No match'}") # Output: xyz456: xyz456
# Match email with optional angle brackets
text = "<user@example.com>"
pattern = r"^(<)?(\S+@\S+)(?(1)>)$"
match = re.match(pattern, text)
print(match.group(2) if match else "No match") # Output: user@example.com
1.3.4 Named Capture Groups
Named Groups:
(?P<name>...)
β
βββ Name for capturing group
(?P=name) β Backreference by name
\g<name> β Replacement reference
(?P<name> ): Creates a named capturing group.(?P=name): Backreference to a named capturing group.
import re
# Named groups for repeated words
text = "apple apple"
pattern = r"(?P<word>\w+) (?P=word)" # Matches repeated words
match = re.search(pattern, text)
if match:
print(match.group('word')) # Output: apple
print(match.group(0)) # Output: apple apple
# Extract structured data
text = "Date: 2024-01-08"
pattern = r"Date: (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
match = re.search(pattern, text)
if match:
print(f"Year: {match.group('year')}") # Output: Year: 2024
print(f"Month: {match.group('month')}") # Output: Month: 01
print(f"Day: {match.group('day')}") # Output: Day: 08
print(match.groupdict()) # Output: {'year': '2024', 'month': '01', 'day': '08'}
# Using named groups in substitution
text = "John Doe"
pattern = r"(?P<first>\w+) (?P<last>\w+)"
result = re.sub(pattern, r"\g<last>, \g<first>", text)
print(result) # Output: Doe, John
1.3.5 Comments
(?#comment): Inline comment.
import re
text = "123-4567"
pattern = r"\d{3}(?#This is a comment)-?\d{4}"
match = re.search(pattern, text)
print(match.group(0)) if match else print("No match") # Output: 123-4567
1.3.6 Free-Spacing Mode (x flag)
import re
text = "123-4567"
pattern = r"""
\d{3} # Area code
-? # Optional separator
\d{4} # Phone number
"""
match = re.search(pattern, text, re.VERBOSE) # Use re.VERBOSE or re.X
print(match.group(0)) if match else print("No match") # Output: 123-4567
1.3.7 Branch Reset Groups
Note: Branch reset groups (?|...) are not supported in Python's re module.
This feature is available in PCRE and some other regex engines. It resets capture group numbering within alternatives.
1.3.8 Subroutine Calls
Note: Subroutine calls (?&name) are not supported in Python's re module.
For complex nested patterns, consider using the regex module or parsing libraries.
1.4 Python re Module Methods
Python re Module Workflow:
βββββββββββββββ
β Pattern β
ββββββββ¬βββββββ
β
ββββββ΄βββββ
β β
compile Direct
β β
βββββ¬ββββββ
β
βββββββββββββ
β Methods: β
β search β
β match β
β findall β
β finditer β
β sub β
β split β
βββββββββββββ
1.4.1 re.search() - Find First Match
import re
text = "The price is $19.99 and discount is 25%"
# Find first occurrence
match = re.search(r"\d+\.?\d*", text)
if match:
print(f"Found: {match.group(0)}") # Output: Found: 19.99
print(f"Position: {match.start()}-{match.end()}") # Output: Position: 13-18
1.4.2 re.match() - Match from Beginning
import re
text = "Python 3.12"
# Match only at the start of string
match = re.match(r"Python", text)
print(match.group(0) if match else "No match") # Output: Python
# Won't match if pattern not at start
match = re.match(r"3\.12", text)
print(match.group(0) if match else "No match") # Output: No match
1.4.3 re.fullmatch() - Match Entire String
import re
# Validate entire string matches pattern
text = "12345"
match = re.fullmatch(r"\d+", text)
print("Valid" if match else "Invalid") # Output: Valid
text = "123abc"
match = re.fullmatch(r"\d+", text)
print("Valid" if match else "Invalid") # Output: Invalid
1.4.4 re.findall() - Find All Matches
import re
text = "Email: john@example.com, jane@test.org"
# Find all email addresses
emails = re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", text)
print(emails) # Output: ['john@example.com', 'jane@test.org']
# With groups - returns tuples
text = "ID: 123, Name: John; ID: 456, Name: Jane"
pattern = r"ID: (\d+), Name: (\w+)"
results = re.findall(pattern, text)
print(results) # Output: [('123', 'John'), ('456', 'Jane')]
1.4.5 re.finditer() - Iterator of Matches
import re
text = "Phone: 555-1234, 555-5678"
pattern = r"(\d{3})-(\d{4})"
# Iterate over all matches
for match in re.finditer(pattern, text):
print(f"Full: {match.group(0)}")
print(f"Area: {match.group(1)}, Number: {match.group(2)}")
print(f"Position: {match.start()}-{match.end()}")
print()
# Output:
# Full: 555-1234
# Area: 555, Number: 1234
# Position: 7-15
#
# Full: 555-5678
# Area: 555, Number: 5678
# Position: 17-25
1.4.6 re.sub() - Replace Matches
import re
text = "Hello World, Hello Universe"
# Simple replacement
result = re.sub(r"Hello", "Hi", text)
print(result) # Output: Hi World, Hi Universe
# Replace with count limit
result = re.sub(r"Hello", "Hi", text, count=1)
print(result) # Output: Hi World, Hello Universe
# Using backreferences
text = "John Doe, Jane Smith"
result = re.sub(r"(\w+) (\w+)", r"\2, \1", text)
print(result) # Output: Doe, John, Smith, Jane
# Using function for replacement
def uppercase_match(match):
return match.group(0).upper()
text = "hello world"
result = re.sub(r"\w+", uppercase_match, text)
print(result) # Output: HELLO WORLD
# Remove extra whitespace
text = "Too many spaces"
result = re.sub(r"\s+", " ", text)
print(result) # Output: Too many spaces
1.4.7 re.subn() - Replace with Count
import re
text = "cat bat rat mat"
# Returns tuple (new_string, number_of_substitutions)
result, count = re.subn(r"[cbr]at", "hat", text)
print(f"Result: {result}") # Output: Result: hat hat hat mat
print(f"Replacements: {count}") # Output: Replacements: 3
1.4.8 re.split() - Split by Pattern
import re
# Split on whitespace
text = "apple banana cherry"
parts = re.split(r"\s+", text)
print(parts) # Output: ['apple', 'banana', 'cherry']
# Split on multiple delimiters
text = "apple,banana;cherry:orange"
parts = re.split(r"[,;:]", text)
print(parts) # Output: ['apple', 'banana', 'cherry', 'orange']
# Split with limit
text = "one two three four five"
parts = re.split(r"\s+", text, maxsplit=2)
print(parts) # Output: ['one', 'two', 'three four five']
# Keep delimiters using groups
text = "apple,banana;cherry"
parts = re.split(r"([,;])", text)
print(parts) # Output: ['apple', ',', 'banana', ';', 'cherry']
1.4.9 re.compile() - Compile Pattern
import re
# Compile for reuse (more efficient)
pattern = re.compile(r"\b\w+@\w+\.\w+\b", re.IGNORECASE)
text1 = "Contact: john@example.com"
text2 = "Email: JANE@TEST.ORG"
match1 = pattern.search(text1)
match2 = pattern.search(text2)
print(match1.group(0) if match1 else "No match") # Output: john@example.com
print(match2.group(0) if match2 else "No match") # Output: JANE@TEST.ORG
# Access compiled pattern properties
print(pattern.pattern) # Output: \b\w+@\w+\.\w+\b
print(pattern.flags) # Output: re.IGNORECASE value
1.4.10 re.escape() - Escape Special Characters
import re
# Escape special regex characters
user_input = "What is $100 + $50?"
escaped = re.escape(user_input)
print(escaped) # Output: What\ is\ \$100\ \+\ \$50\?
# Safe pattern from user input
text = "The price is $100"
user_search = "$100"
pattern = re.escape(user_search)
match = re.search(pattern, text)
print(match.group(0) if match else "No match") # Output: $100
1.4.11 Match Object Methods
import re
text = "Contact: John Doe (john@example.com)"
pattern = r"(\w+) (\w+) \(([\w@.]+)\)"
match = re.search(pattern, text)
if match:
# Get matched string
print(match.group(0)) # Output: John Doe (john@example.com)
print(match.group(1)) # Output: John
print(match.group(2)) # Output: Doe
print(match.group(3)) # Output: john@example.com
# Get all groups
print(match.groups()) # Output: ('John', 'Doe', 'john@example.com')
# Get positions
print(match.start()) # Output: 9
print(match.end()) # Output: 38
print(match.span()) # Output: (9, 38)
# Get span of specific group
print(match.span(1)) # Output: (9, 13)
# Named groups
pattern = r"(?P<first>\w+) (?P<last>\w+)"
match = re.search(pattern, "John Doe")
print(match.group('first')) # Output: John
print(match.groupdict()) # Output: {'first': 'John', 'last': 'Doe'}
1.5 Common RegEx Engines and Differences
- PCRE (Perl Compatible Regular Expressions): Widely used, feature-rich.
- JavaScript: Good support, but lookbehind assertions were limited (now widely supported).
- Python (
remodule): Excellent support, including Unicode properties. - .NET: Powerful and feature-rich.
- Java: Good support, some syntax differences.
- POSIX: Basic and Extended Regular Expressions (BRE and ERE). Limited features.
1.6 Best Practices
RegEx Best Practices:
ββββββββββββββββββ
β 1. Be Specific β
βββββββββ¬βββββββββ
β
ββββββββββββββββββ
β 2. Optimize β
βββββββββ¬βββββββββ
β
ββββββββββββββββββ
β 3. Test β
βββββββββ¬βββββββββ
β
ββββββββββββββββββ
β 4. Document β
ββββββββββββββββββ
1.6.1 1. Be Specific and Precise
- Avoid overly broad patterns: Use
\d{4}instead of.{4}for 4 digits. - Use anchors: Add
^and$for exact matches. - Be explicit:
[0-9]is clearer than\dwhen you want only ASCII digits.
import re
# Bad: Too broad
pattern_bad = r".+@.+\..+"
# Good: Specific
pattern_good = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
1.6.2 2. Optimize Performance
- Use character classes:
\dis more efficient than[0-9]. - Use non-capturing groups:
(?:...)when you don't need captured text. - Compile patterns: Use
re.compile()for repeated use. - Be aware of greediness: Use lazy quantifiers (
*?,+?) when appropriate.
import re
# Bad: Capturing unnecessary groups
pattern_bad = r"(http|https)://([\w.]+)/(.+)"
# Good: Non-capturing groups
pattern_good = r"(?:http|https)://[\w.]+/.+"
# Compile for reuse
compiled = re.compile(pattern_good)
for url in urls:
compiled.match(url)
1.6.3 3. Avoid Catastrophic Backtracking
- Avoid nested quantifiers:
(a+)+or(a*)*can cause exponential backtracking. - Use atomic groups or possessive quantifiers: (if available in your engine).
- Be careful with alternation:
(a|ab)+can be slow.
import re
# Dangerous: Can cause catastrophic backtracking
# pattern_dangerous = r"(a+)+b"
# Safe alternatives:
pattern_safe1 = r"a+b"
pattern_safe2 = r"(?:a+)b"
1.6.4 4. Test Thoroughly
- Use online tools: regex101.com, regexr.com, pythex.org
- Test edge cases: Empty strings, special characters, very long strings.
- Use unit tests: Validate against known inputs/outputs.
import re
def validate_email(email):
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return bool(re.match(pattern, email))
# Test cases
assert validate_email("test@example.com") == True
assert validate_email("invalid@") == False
assert validate_email("@invalid.com") == False
assert validate_email("") == False
1.6.5 5. Document Complex Patterns
- Use verbose mode:
re.VERBOSEorre.Xflag. - Add comments: Explain what each part does.
- Break into parts: Define subpatterns separately.
import re
# Bad: Hard to read
pattern_bad = r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$"
# Good: Well documented
pattern_good = re.compile(r"""
^ # Start of string
(?=.*[a-z]) # At least one lowercase letter
(?=.*[A-Z]) # At least one uppercase letter
(?=.*\d) # At least one digit
(?=.*[@$!%*?&]) # At least one special character
[A-Za-z\d@$!%*?&]{8,} # At least 8 characters from allowed set
$ # End of string
""", re.VERBOSE)
1.6.6 6. Use Raw Strings in Python
- Always use
r"..."for patterns: Prevents double-escaping issues. - Escape special characters properly: Use
\.for literal dots.
import re
# Bad: Need double escaping
pattern_bad = "\\d+\\.\\d+" # Confusing!
# Good: Raw string
pattern_good = r"\d+\.\d+" # Clear and correct
1.6.7 7. Consider Alternatives
- String methods: For simple cases, use
str.startswith(),str.endswith(),str.split(). - Parsing libraries: For structured data (JSON, XML, CSV).
- Avoid regex for: HTML/XML parsing, programming language parsing.
# Bad: Using regex for simple check
import re
if re.match(r"^Hello", text):
pass
# Good: Using string method
if text.startswith("Hello"):
pass
1.6.8 8. Know Your Engine
- Python
re: Good standard support, lacks some advanced features. - Python
regex: Extended features (Unicode properties, recursion). - PCRE: Full-featured, used in many languages.
- JavaScript: Good support, some differences in lookbehind.
1.6.9 9. Security Considerations
- Validate input length: Prevent DoS from long strings.
- Sanitize user input: Escape special characters with
re.escape(). - Set timeout limits: For untrusted patterns.
import re
# Safe: Escape user input
user_input = "$100"
escaped = re.escape(user_input)
pattern = re.compile(escaped)
# Validate input length
if len(text) > 10000:
raise ValueError("Input too long")
1.6.10 10. Common Pitfalls to Avoid
- Forgetting anchors:
\d+matches "abc123" at position 3. - Not escaping special chars:
.matches any character, not just dot. - Greedy vs lazy confusion:
.*vs.*? - Character class mistakes:
[A-z]includes more than[A-Za-z]. - Word boundaries:
\bdoesn't work with non-ASCII characters as expected.
import re
# Common mistakes:
text = "test@example.com"
# Wrong: Doesn't escape dot
pattern_wrong = r".+@.+..+" # Matches too much
# Correct: Escapes dot
pattern_correct = r".+@.+\..+"
# Wrong: Character class includes extra chars
pattern_wrong2 = r"[A-z]+" # Includes [\]^_`
# Correct: Explicit ranges
pattern_correct2 = r"[A-Za-z]+"
1.7 Quick Reference Summary
ββββββββββββββββββββββββββββββββββββββββββββββ
β REGEX QUICK REFERENCE β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β BASIC β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β . Any character except \n β
β \d Digit [0-9] β
β \w Word char [a-zA-Z0-9_] β
β \s Whitespace [ \t\n\r\f\v] β
β \D \W \S Negated versions β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β QUANTIFIERS β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β * 0 or more (greedy) β
β + 1 or more (greedy) β
β ? 0 or 1 β
β {n} Exactly n times β
β {n,m} Between n and m times β
β *? +? ?? Lazy versions β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β ANCHORS β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β ^ Start of string/line β
β $ End of string/line β
β \b Word boundary β
β \B Not word boundary β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β GROUPS β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β (...) Capturing group β
β (?:...) Non-capturing group β
β (?P<n>...) Named group β
β \1 \2 Backreference β
β (?P=n) Named backreference β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β LOOKAROUND β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β (?=...) Positive lookahead β
β (?!...) Negative lookahead β
β (?<=...) Positive lookbehind β
β (?<!...) Negative lookbehind β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β FLAGS β
ββββββββββββββββββββββββββββββββββββββββββββββ€
β re.I Case-insensitive β
β re.M Multiline ^ $ anchors β
β re.S Dot matches newline β
β re.X Verbose mode (comments) β
ββββββββββββββββββββββββββββββββββββββββββββββ
1.7.1 Essential Python Methods
| Method | Description | Returns |
|---|---|---|
re.search(pattern, string) | Find first match anywhere | Match object or None |
re.match(pattern, string) | Match at start of string | Match object or None |
re.fullmatch(pattern, string) | Match entire string | Match object or None |
re.findall(pattern, string) | Find all matches | List of strings |
re.finditer(pattern, string) | Iterator of matches | Iterator of Match objects |
re.sub(pattern, repl, string) | Replace matches | New string |
re.split(pattern, string) | Split by pattern | List of strings |
re.compile(pattern) | Compile pattern | Compiled pattern object |
1.7.2 Common Patterns Quick Guide
import re
# Email
r"^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$"
# URL
r"^https?://[\w.-]+\.[a-zA-Z]{2,}(/.*)?$"
# Phone (US)
r"^\d{3}-\d{3}-\d{4}$"
# IP Address
r"^(?:\d{1,3}\.){3}\d{1,3}$"
# Date YYYY-MM-DD
r"^\d{4}-\d{2}-\d{2}$"
# Hex Color
r"^#[a-fA-F0-9]{6}$"
# Username (3-16 alphanumeric)
r"^[a-zA-Z0-9_]{3,16}$"
# Strong Password
r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$"