regex guide

From thelinuxwiki
Jump to: navigation, search

References taken from regular-expressions.info

Contents

special chacters

\ ^ $ . | ? * + ( ) [ {

inside character classes only [class] i.e. [0-9]

\ ^ and  additionally - ]

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}

Some flavors also support the \Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text *\d+*.

The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.


Programming Languages

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters are processed by the compiler, before the regex library sees the string.

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).


Regex Syntax versus String Syntax

Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input.

Character Classes or Character Sets

A character class matches only a single character.


[ae] - matches a or e
[0-9] matches a single digit between 0 and 9
[0-9a-fA-F] matches a single hexadecimal digit


Negated Character Classes

Typing a caret after the opening square bracket negates the character class.

[^0-9\r\n] matches any character that is not a digit or a line break.


Metacharacters Inside Character Classes

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x.

To include an unescaped caret as a literal, place it anywhere except right after the opening bracket. [x^] matches an x or a caret.

You can generally include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. JavaScript and ruby are exceptions and require escapes.

The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.

Many regex tokens that work outside character classes can also be used inside character classes.


Repeating Character Classes

If you repeat a character class by using the ?, * or + operators, you're repeating the entire character class. You're not repeating just the character that it matched. The regex [0-9]+ can match 837 as well as 222.


Shorthand Character Classes

Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9].

\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]

\s stands for "whitespace character". In all flavors discussed in this tutorial, it includes [ \t\r\n\f].

      • Which characters these shorthands actually include depend on the regex flavor.

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s].


The Dot Matches (Almost) Any Character

The dot matches any single character except line break characters.

Example date string verification allowing various field separators...

d\d.\d\d.\d\d   matches a date like 02/12/03, but also 02512703 
\d\d[- /.]\d\d[- /.]\d\d is a better solution

In Perl, the mode where the dot also matches line breaks is called "single-line mode". You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.

Other languages and regex libraries have adopted Perl's terminology.

Anchors

^ and $ match the start and end of lines. $ generally matches the zero space character in front of a \n or the void after the last charcter in a file.

A and \Z only match at the start and the end of the entire file.

exception: python uses a lower case \z to macth end of file.

Word Boundaries

\b

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

GNU also uses its own syntax for start-of-word and end-of-word boundaries. \< matches at the start of a word, like Tcl's \m. \> matches at the end of a word.

The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary

Alternation with Pipe

cat|dog matches cat or dog

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.

i.e.

(cat|dog)

to improve on this, require word boundaries...

\b(cat|dog)\b

Optional Items

The question mark makes the preceding token in the regular expression optional. colou?r matches both colour and color. The question mark is called a quantifier.

You can make several tokens optional by grouping them together using parentheses, and placing the question mark after the closing parenthesis. E.g.: Nov(ember)? matches Nov and November.

The question mark is greedy. The regex engine always tries to match the optional part.

If you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match is always Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first.