regex guide

From thelinuxwiki
Revision as of 02:47, 14 May 2017 by Nighthawk (Talk | contribs)

Jump to: navigation, search

References taken from regular-expressions.info

Contents

special chacters

\ ^ $ . | ? * + ( ) [ {

inside character classes only [class] i.e. [0-9]

\ ^ and  additionally - ]

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}

Some flavors also support the \Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text *\d+*.

The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.


Programming Languages

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters are processed by the compiler, before the regex library sees the string.

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).


Regex Syntax versus String Syntax

Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input.


Character Classes or Character Sets

A character class matches only a single character.


[ae] - matches a or e
[0-9] matches a single digit between 0 and 9
[0-9a-fA-F] matches a single hexadecimal digit


Negated Character Classes

Typing a caret after the opening square bracket negates the character class.

[^0-9\r\n] matches any character that is not a digit or a line break.


Metacharacters Inside Character Classes

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x.


To include an unescaped caret as a literal, place it anywhere except right after the opening bracket. [x^] matches an x or a caret.
You can generally include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. JavaScript and ruby are exceptions and require escapes.


The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.


Many regex tokens that work outside character classes can also be used inside character classes.