re2 module

Regular expressions using Google’s RE2 engine.

Compared to Python’s re, the RE2 engine compiles regular expressions to deterministic finite automata, which guarantees linear-time behavior.

Intended as a drop-in replacement for re. Unicode is supported by encoding to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given. For best performance, work with UTF-8 encoded bytes strings.

Regular expressions that are not compatible with RE2 are processed with fallback to re. Examples of features not supported by RE2:

lookahead assertions (?!...)

backreferences (\\n in search pattern)

W and S not supported inside character classes

On the other hand, unicode character classes are supported (e.g., \p{Greek}). Syntax reference: https://github.com/google/re2/wiki/Syntax

What follows is a reference for the regular expression syntax supported by this module (i.e., without requiring fallback to re).

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like “A”, “a”, or “0”, are the simplest regular expressions; they simply match themselves.

The special characters are:

"."      Matches any character except a newline.
"^"      Matches the start of the string.
"$"      Matches the end of the string or just before the newline at
         the end of the string.
"*"      Matches 0 or more (greedy) repetitions of the preceding RE.
         Greedy means that it will match as many repetitions as possible.
"+"      Matches 1 or more (greedy) repetitions of the preceding RE.
"?"      Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n}    Matches from m to n repetitions of the preceding RE.
{m,n}?   Non-greedy version of the above.
"\\"     Either escapes special characters or signals a special sequence.
[]       Indicates a set of characters.
         A "^" as the first character indicates a complementing set.
"|"      A|B, creates an RE that will match either A or B.
(...)    Matches the RE inside the parentheses.
         The contents can be retrieved or matched later in the string.
(?:...)  Non-grouping version of regular parentheses.
(?imsux) Set the I, M, S, U, or X flag for the RE (see below).

The special sequences consist of “\” and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character:

\A         Matches only at the start of the string.
\Z         Matches only at the end of the string.
\b         Matches the empty string, but only at the start or end of a word.
\B         Matches the empty string, but not at the start or end of a word.
\d         Matches any decimal digit.
\D         Matches any non-digit character.
\s         Matches any whitespace character.
\S         Matches any non-whitespace character.
\w         Matches any alphanumeric character.
\W         Matches the complement of \w.
\\         Matches a literal backslash.
\pN        Unicode character class (one-letter name)
\p{Greek}  Unicode character class
\PN        negated Unicode character class (one-letter name)
\P{Greek}  negated Unicode character class

This module exports the following functions:

count     Count all occurrences of a pattern in a string.
match     Match a regular expression pattern to the beginning of a string.
fullmatch Match a regular expression pattern to all of a string.
search    Search a string for a pattern and return Match object.
contains  Same as search, but only return bool.
sub       Substitute occurrences of a pattern found in a string.
subn      Same as sub, but also return the number of substitutions made.
split     Split a string by the occurrences of a pattern.
findall   Find all occurrences of a pattern in a string.
finditer  Return an iterator yielding a match object for each match.
compile   Compile a pattern into a RegexObject.
purge     Clear the regular expression cache.
escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:

A  ASCII       Make \w, \W, \b, \B, \d, \D match the corresponding ASCII
               character categories (rather than the whole Unicode
               categories, which is the default).
I  IGNORECASE  Perform case-insensitive matching.
M  MULTILINE   "^" matches the beginning of lines (after a newline)
               as well as the string.
               "$" matches the end of lines (before a newline) as well
               as the end of the string.
S  DOTALL      "." matches any character at all, including the newline.
X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
U  UNICODE     Enable Unicode character classes and make \w, \W, \b, \B,
               Unicode-aware (default for unicode patterns).

This module also defines an exception ‘RegexError’ (also available under the alias ‘error’).

exception re2.BackreferencesException

Bases: Exception

Search pattern contains backreferences.

exception re2.CharClassProblemException

Bases: Exception

Search pattern contains unsupported character class.

class re2.Match

Bases: object

end(group=0)

endpos

expand(template): Expand a template with groups.

group(*args)

groupdict()

groups(default=None)

lastgroup

lastindex

pos

re

regs

span(group=0)

start(group=0)

string

class re2.Pattern

Bases: object

contains(string, pos=0, endpos=-1)

“contains(string[, pos[, endpos]]) –> bool.”

Scan through string looking for a match, and return True or False.

count(string, pos=0, endpos=-1): Return number of non-overlapping matches of pattern in string.

findall(string, pos=0, endpos=-1): Return all non-overlapping matches of pattern in string as a list of strings.

finditer(string, pos=0, endpos=-1): Yield all non-overlapping matches of pattern in string as Match objects.

flags

fullmatch(string, pos=0, endpos=-1)

“fullmatch(string[, pos[, endpos]]) –> Match object or None.”

Matches the entire string.

groupindex

groups

match(string, pos=0, endpos=-1): Matches zero or more characters at the beginning of the string.

pattern

scanner(arg)

search(string, pos=0, endpos=-1): Scan through string looking for a match, and return a corresponding Match instance. Return None if no position in the string matches.

split(string, maxsplit=0)

split(string[, maxsplit = 0]) –> list

Split a string by the occurrences of the pattern.

sub(repl, string, count=0)

sub(repl, string[, count = 0]) –> newstring

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

subn(repl, string[, count = 0]) --> (newstring, number of subs): Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl.

re2.RegexError: alias of error

re2.SREPattern: alias of Pattern

re2.compile(pattern, flags=0, max_mem=8388608)

re2.count(pattern, string, flags=0)

Return number of non-overlapping matches in the string.

Empty matches are included in the count.

exception re2.error(msg, pattern=None, pos=None)

Bases: Exception

Exception raised for invalid regular expressions.

Attributes:

msg: The unformatted error message pattern: The regular expression pattern pos: The index in the pattern where compilation failed (may be None) lineno: The line corresponding to pos (may be None) colno: The column corresponding to pos (may be None)

re2.escape(pattern): Escape all non-alphanumeric characters in pattern.

re2.findall(pattern, string, flags=0)

Return a list of all non-overlapping matches in the string.

Each match is represented as a string or a tuple (when there are two ore more groups). Empty matches are included in the result.

re2.finditer(pattern, string, flags=0)

Yield all non-overlapping matches in the string.

For each match, the iterator returns a Match object. Empty matches are included in the result.

re2.fullmatch(pattern, string, flags=0): Try to apply the pattern to the entire string, returning a Match object, or None if no match was found.

re2.match(pattern, string, flags=0): Try to apply the pattern at the start of the string, returning a Match object, or None if no match was found.

re2.purge(): Clear the regular expression caches.

re2.search(pattern, string, flags=0): Scan through string looking for a match to the pattern, returning a Match object or none if no match was found.

re2.set_fallback_notification(level): Set the fallback notification to a level; one of: FALLBACK_QUIETLY FALLBACK_WARNING FALLBACK_EXCEPTION

re2.split(pattern, string, maxsplit=0, flags=0): Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

re2.sub(pattern, repl, string, count=0, flags=0): Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it’s passed the Match object and must return a replacement string to be used.

re2.subn(pattern, repl, string, count=0, flags=0): Return a 2-tuple containing (new_string, number). new_string is the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in the source string by the replacement repl. number is the number of substitutions that were made. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it’s passed the Match object and must return a replacement string to be used.