What are regular expressions?
Regular expressions are a language that can be used to explicitly describe patterns within strings of text. In addition to simply describing such patterns, regular expression engines can typically be used to iterate through matches, to parse strings into substrings using patterns as delimiters, or to replace or reformat text in an intelligent fashion. They provide a powerful and usually very succinct way to solve many common tasks related to text manipulation.
Simple Expressions
The simplest regular expression is one you're already familiar with—the literal string. A particular string can be described, literally, by itself, and thus a regular expression pattern like foo would match the input string foo exactly once. In this case, it would also match the input: The food was quite tasty, which might be not be desired if only a precise match is sought.
Of course, matching exact strings to themselves is a trivial implementation of regular expressions, and doesn't begin to reveal their power. What if instead of foo you wanted to find all words starting with the letter f, or all three letter words? Now you've gone beyond what literal strings can do—it's time to learn some more about regular expressions. Below is a sample literal expression and some inputs it would match.
Pattern |
Inputs (Matches) |
foo |
foo, food, foot, "There's evil afoot." |
Quantifiers
Quantifiers provide a simple way to specify within a pattern how many times a particular character or set of characters is allowed to repeat itself. There are three non-explicit quantifiers:
- *, which describes "0 or more occurrences",
- +, which describes "1 or more occurrences", and
- ?, which describes "0 or 1 occurrence".
Quantifiers always refer to the pattern immediately preceding (to the left of) the quantifier, which is normally a single character unless parentheses are used to create a pattern group. Below are some sample patterns and inputs they would match.
Pattern |
Inputs (Matches) |
fo* |
foo, foe, food, fooot, "forget it", funny, puffy |
fo+ |
foo, foe, food, foot, "forget it" |
fo? |
foo, foe, food, foot, "forget it", funny, puffy |
In addition to specifying that a given pattern may occur exactly 0 or 1 time, the ? character also forces a pattern or subpattern to match the minimal number of characters when it might match several in an input string.
In addition to the above non-explicit quantifiers (generally just referred to as quantifiers), there are also explicit quantifiers. Where quantifiers are fairly vague in terms of how many occurrences there may be of a pattern, explicit quantifiers allow an exact number, range, or set of numbers to be specified. Explicit quantifiers are positioned following the pattern they apply to, just like regular quantifiers. Explicit quantifiers use curly braces {} and number values for upper and lower occurrence limits within the braces. For example, x{5} would match exactly five x characters (xxxxx). When only one number is specified, it is used as the upper bound unless it is followed by a comma, such as x{5,}, which would match any number of x characters greater than 4. Below are some sample patterns and inputs they would match.
Pattern |
Inputs (Matches) |
ab{2}c |
abbc, aaabbccc |
ab{,2}c |
ac, abc, abbc, aabbcc |
ab{2,3}c |
abbc, abbbc, aabbcc, aabbbcc |
Metacharacters
The constructs within regular expressions that have special meaning are referred to as metacharacters. You've already learned about several metacharacters, such as the *, ?, +, and { } characters. Several other characters have special meaning within the language of regular expressions. These include the following: $ ^ . [ ( | ) ] and \.
The . (period or dot) metacharacter is one of the simplest and most used. It matches any single character. This can be useful for specifying that certain patterns can contain any combination of characters, but must fall within certain length ranges by using quantifiers. Also, we have seen that expressions will match any instance of the pattern they describe within a larger string, but what if you only want to match the pattern exactly? This is often the case for validation scenarios, such as ensuring the user entered something that is the proper format for a postal code or telephone number.
The ^ metacharacter is used to designate the beginning of a string (or line), and the $ metacharacter is used to designate the end of a string (or line). By adding these characters to the beginning and end of a pattern, you can force it to only match input strings that exactly match the pattern. The ^ metacharacter also has a special meaning when used at the start of a character class, designated by hard braces [ ]. These are covered below.
The \ (backslash) metacharacter is used to "escape" characters from their special meaning, as well as to designate instances of predefined set metacharacters. These too are covered below. In order to include a literal version of a metacharacter in a regular expression, it must be "escaped" with a backslash. So for instance if you wanted to match strings that begin with "c:\" you might use this: ^c:\\ Note that we used the ^ metacharacter to indicate that the string must begin with this pattern, and we escaped our literal backslash with a backslash metacharacter.
The | (pipe) metacharacter is used for alternation, essentially to specify 'this OR that' within a pattern. So something like a|b would match anything with an 'a' or a 'b' in it, and would be very similar to the character class [ab].
Finally, the parentheses ( ) are used to group patterns. This can be done to allow a complete pattern to occur multiple times using quantifiers, for readability only, or to allow certain portions of the input to be matched separately, perhaps to allow for reformatting or parsing.
Some examples of metacharacter usage are listed below.
Pattern |
Inputs (Matches) |
. |
a, b, c, 1, 2, 3 |
.* |
Abc, 123, any string, even no characters would match |
^c:\\ |
c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else |
abc$ |
abc, 123abc, any string ending with abc |
(abc){2,3} |
abcabc, abcabcabc |
Character Classes
Character classes are a mini-language within regular expressions, defined by the enclosing hard braces [ ]. The simplest character class is simply a list of characters within these braces, such as [aeiou]. When used in an expression, any one of these vowel characters can be used at this position in the pattern (but only one unless quantifiers are used). It's important to note that character classes cannot be used to define words or patterns, only single characters.
To specify any numeric digit, the character class [0123456789] could be used. However, since this would quickly get cumbersome, ranges of characters can be defined within the braces by using the hyphen character, -. The hyphen character has special meaning within character classes, not within regular expressions (thus it doesn't qualify as a regular expression metacharacter, exactly), and it only has special meaning within a character class if it is not the first character. To specify any numeric digit using a hyphen, you would use [0-9]. Similarly for any lowercase letter, you could use [a-z], or for any uppercase letter [A-Z]. The range defined by the hyphen depends on the character set being used, so the order in which the characters occur in the (for example) ASCII or Unicode table determines which characters are included in the range. If you need a hyphen to be included in your range, specify it as the first character. For example, [-.? ] would match any one of those four characters (note the last character is a space). Also note, the regular expression metacharacters are not treated special within character classes, so they do not need escaped. Consider character classes to be a separate language from the rest of the regular expression world, with their own rules and syntax.
You can also match any character except a member of a character class by negating the class using the carat ^ as the first character in the character class. Thus, to match any non-vowel character, you could use a character class of [^aAeEiIoOuU]. Note that if you want to negate a hyphen, it should be the second character in the character class, as in [^-]. Remember that the ^ has a totally different meaning within a character class than it has at the start of a regular expression pattern.
Some examples of character classes in action are listed below.
Pattern |
Inputs (Matches) |
^b[aeiou]t$ |
Bat, bet, bit, bot, but |
^[0-9]{5}$ |
11111, 12345, 99999 |
^c:\\ |
c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else |
abc$ |
abc, 123abc, any string ending with abc |
(abc){2,3} |
abcabc, abcabcabc |
^[^-][0-9]$ |
0, 1, 2, … (will not match -0, -1, -2, etc.) |
Predefined Set Metacharacters
There's a great deal that can be done with the tools we've covered so far. However, it is still rather longwinded to use [0-9] for every numeric digit in a pattern, or worse, [0-9a-zA-Z] for any alphanumeric character. To ease the pain of dealing with these common but lengthy patterns, a set of predefined metacharacters was defined. The standard syntax for these predefined metacharacters is a backslash \ followed by one or more characters. Most of these are just one character long, making them easy to use and an ideal replacement for lengthy character classes. Two such examples are \d which matches any numeric digit and \w which matches any word character (alphanumeric plus underscore). The exceptions are specific character code matches, which must specify the address of the character they are matching, such as \u000D which would match the Unicode carriage return character. Some of the most common character classes and their metacharacter equivalents are listed below.
Metacharacter |
Equivalent Character Class |
\a |
Matches a bell (alarm); \u0007 |
\b |
Matches a word boundary except in a character class, where it matches a backspace character, \u0008 |
\t |
Matches a tab; \u0009 |
\r |
Matches a carriage return; \u000D |
\w |
Matches a vertical tab; \u000B |
\f |
Matches a form feed; \u000C |
\n |
Matches a new line; \u000A |
\e |
Matches an escape; \u001B |
\040 |
Matches an ASCII character with a three-digit octal. \040 represents a space (Decimal 32). |
\x20 |
Matches an ASCII character using 2-digit hexadecimal. In this case, \x2- represents a space. |
\cC |
Matches an ASCII control character, in this case ctrl-C. |
\u0020 |
Matches a Unicode character using exactly four hexadecimal digits. In this case \u0020 is a space. |
\* |
Any character that does not represent a predefined character class is simply treated as that character. Thus \* is the same as \x2A (a literal *, not the * metacharacter). |
\p{name} |
Matches any character in the named character class 'name'. Supported names are Unicode groups and block ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc (currency). |
\P{name} |
Matches text not included in the named character class 'name'. |
\w |
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. |
\W |
The negation of \w, this equals the ECMAScript compliant set [^a-zA-Z_0-9] or the Unicode character categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. |
\s |
Matches any white-space character. Equivalent to the Unicode character classes [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v] (note leading space). |
\S |
Matches any non-white-space character. Equivalent to the Unicode character categories [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to [^ \f\n\r\t\v] (note space after ^). |
\d |
Matches any decimal digit. Equivalent to [\p{Nd}] for Unicode and [0-9] for non-Unicode, ECMAScript behavior. |
\D |
Matches any non-decimal digit. Equivalent to [\P{Nd}] for Unicode and [^0-9] for non-Unicode, ECMAScript behavior. |
Note: Different implementations of regular expressions define different sets of predefined metacharacters—the above predefined metacharacters are supported by the System.Text.RegularExpressions API in the .NET Framework.
More Sample Expressions
Most people learn best by example, so here are a very few sample expressions. For more samples, you should visit the online regular expression library, at http://www.regexlib.com/.
Pattern |
Description |
^\d{5}$ |
5 numeric digits, such as a US ZIP code. |
^(\d{5})|(\d{5}-\d{4}$ |
5 numeric digits, or 5 digits-dash-4 digits. This matches a US ZIP or US ZIP+4 format. |
^(\d{5}(-\d{4})?$ |
Same as previous, but more efficient. Uses ? to make the -4 digits portion of the pattern optional, rather than requiring two separate patterns to be compared individually (via alternation). |
^[+-]?\d+(\.\d+)?$ |
Matches any real number with an optional sign. |
^[+-]?\d*\.?\d*$ |
Same as above, but also matches the empty string. |
^(20|21|22|23|[01]\d)[0-5]\d$ |
Matches any 24-hour time value. |
/\*.*\*/ |
Matches the contents of a C-style comment /* … */ |