web analytics

Regular Expressions in C#

Options

codeling 1599 - 6654
@2017-09-29 11:41:53

What are regular expressions?

Regular expressions are a language that can be used to explicitly describe patterns within strings of text. In addition to simply describing such patterns, regular expression engines can typically be used to iterate through matches, to parse strings into substrings using patterns as delimiters, or to replace or reformat text in an intelligent fashion. They provide a powerful and usually very succinct way to solve many common tasks related to text manipulation.

Simple Expressions

The simplest regular expression is one you're already familiar with—the literal string. A particular string can be described, literally, by itself, and thus a regular expression pattern like foo would match the input string foo exactly once. In this case, it would also match the input: The food was quite tasty, which might be not be desired if only a precise match is sought.

Of course, matching exact strings to themselves is a trivial implementation of regular expressions, and doesn't begin to reveal their power. What if instead of foo you wanted to find all words starting with the letter f, or all three letter words? Now you've gone beyond what literal strings can do—it's time to learn some more about regular expressions. Below is a sample literal expression and some inputs it would match.

Pattern Inputs (Matches)
foo foo, food, foot, "There's evil afoot."
 

Quantifiers

Quantifiers provide a simple way to specify within a pattern how many times a particular character or set of characters is allowed to repeat itself. There are three non-explicit quantifiers:

  • *, which describes "0 or more occurrences",
  • +, which describes "1 or more occurrences", and
  • ?, which describes "0 or 1 occurrence".

Quantifiers always refer to the pattern immediately preceding (to the left of) the quantifier, which is normally a single character unless parentheses are used to create a pattern group. Below are some sample patterns and inputs they would match.

Pattern Inputs (Matches)
fo* foo, foe, food, fooot, "forget it", funny, puffy
fo+ foo, foe, food, foot, "forget it"
fo? foo, foe, food, foot, "forget it", funny, puffy

In addition to specifying that a given pattern may occur exactly 0 or 1 time, the ? character also forces a pattern or subpattern to match the minimal number of characters when it might match several in an input string.

In addition to the above non-explicit quantifiers (generally just referred to as quantifiers), there are also explicit quantifiers. Where quantifiers are fairly vague in terms of how many occurrences there may be of a pattern, explicit quantifiers allow an exact number, range, or set of numbers to be specified. Explicit quantifiers are positioned following the pattern they apply to, just like regular quantifiers. Explicit quantifiers use curly braces {} and number values for upper and lower occurrence limits within the braces. For example, x{5} would match exactly five x characters (xxxxx). When only one number is specified, it is used as the upper bound unless it is followed by a comma, such as x{5,}, which would match any number of x characters greater than 4. Below are some sample patterns and inputs they would match.

Pattern Inputs (Matches)
ab{2}c abbc, aaabbccc
ab{,2}c ac, abc, abbc, aabbcc
ab{2,3}c abbc, abbbc, aabbcc, aabbbcc

 

Metacharacters

The constructs within regular expressions that have special meaning are referred to as metacharacters. You've already learned about several metacharacters, such as the *, ?, +, and { } characters. Several other characters have special meaning within the language of regular expressions. These include the following: $ ^ . [ ( | ) ] and \.

The . (period or dot) metacharacter is one of the simplest and most used. It matches any single character. This can be useful for specifying that certain patterns can contain any combination of characters, but must fall within certain length ranges by using quantifiers. Also, we have seen that expressions will match any instance of the pattern they describe within a larger string, but what if you only want to match the pattern exactly? This is often the case for validation scenarios, such as ensuring the user entered something that is the proper format for a postal code or telephone number.

The ^ metacharacter is used to designate the beginning of a string (or line), and the $ metacharacter is used to designate the end of a string (or line). By adding these characters to the beginning and end of a pattern, you can force it to only match input strings that exactly match the pattern. The ^ metacharacter also has a special meaning when used at the start of a character class, designated by hard braces [ ]. These are covered below.

The \ (backslash) metacharacter is used to "escape" characters from their special meaning, as well as to designate instances of predefined set metacharacters. These too are covered below. In order to include a literal version of a metacharacter in a regular expression, it must be "escaped" with a backslash. So for instance if you wanted to match strings that begin with "c:\" you might use this: ^c:\\ Note that we used the ^ metacharacter to indicate that the string must begin with this pattern, and we escaped our literal backslash with a backslash metacharacter.

The | (pipe) metacharacter is used for alternation, essentially to specify 'this OR that' within a pattern. So something like a|b would match anything with an 'a' or a 'b' in it, and would be very similar to the character class [ab].

Finally, the parentheses ( ) are used to group patterns. This can be done to allow a complete pattern to occur multiple times using quantifiers, for readability only, or to allow certain portions of the input to be matched separately, perhaps to allow for reformatting or parsing.

Some examples of metacharacter usage are listed below.

Pattern Inputs (Matches)
. a, b, c, 1, 2, 3
.* Abc, 123, any string, even no characters would match
^c:\\ c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else
abc$ abc, 123abc, any string ending with abc
(abc){2,3} abcabc, abcabcabc

 

Character Classes

Character classes are a mini-language within regular expressions, defined by the enclosing hard braces [ ]. The simplest character class is simply a list of characters within these braces, such as [aeiou]. When used in an expression, any one of these vowel characters can be used at this position in the pattern (but only one unless quantifiers are used). It's important to note that character classes cannot be used to define words or patterns, only single characters.

To specify any numeric digit, the character class [0123456789] could be used. However, since this would quickly get cumbersome, ranges of characters can be defined within the braces by using the hyphen character, -. The hyphen character has special meaning within character classes, not within regular expressions (thus it doesn't qualify as a regular expression metacharacter, exactly), and it only has special meaning within a character class if it is not the first character. To specify any numeric digit using a hyphen, you would use [0-9]. Similarly for any lowercase letter, you could use [a-z], or for any uppercase letter [A-Z]. The range defined by the hyphen depends on the character set being used, so the order in which the characters occur in the (for example) ASCII or Unicode table determines which characters are included in the range. If you need a hyphen to be included in your range, specify it as the first character. For example, [-.? ] would match any one of those four characters (note the last character is a space). Also note, the regular expression metacharacters are not treated special within character classes, so they do not need escaped. Consider character classes to be a separate language from the rest of the regular expression world, with their own rules and syntax.

You can also match any character except a member of a character class by negating the class using the carat ^ as the first character in the character class. Thus, to match any non-vowel character, you could use a character class of [^aAeEiIoOuU]. Note that if you want to negate a hyphen, it should be the second character in the character class, as in [^-]. Remember that the ^ has a totally different meaning within a character class than it has at the start of a regular expression pattern.

Some examples of character classes in action are listed below.

Pattern Inputs (Matches)
^b[aeiou]t$ Bat, bet, bit, bot, but
^[0-9]{5}$ 11111, 12345, 99999
^c:\\ c:\windows, c:\\\\\, c:\foo.txt, c:\ followed by anything else
abc$ abc, 123abc, any string ending with abc
(abc){2,3} abcabc, abcabcabc
^[^-][0-9]$ 0, 1, 2, … (will not match -0, -1, -2, etc.)

 

Predefined Set Metacharacters

There's a great deal that can be done with the tools we've covered so far. However, it is still rather longwinded to use [0-9] for every numeric digit in a pattern, or worse, [0-9a-zA-Z] for any alphanumeric character. To ease the pain of dealing with these common but lengthy patterns, a set of predefined metacharacters was defined. The standard syntax for these predefined metacharacters is a backslash \ followed by one or more characters. Most of these are just one character long, making them easy to use and an ideal replacement for lengthy character classes. Two such examples are \d which matches any numeric digit and \w which matches any word character (alphanumeric plus underscore). The exceptions are specific character code matches, which must specify the address of the character they are matching, such as \u000D which would match the Unicode carriage return character. Some of the most common character classes and their metacharacter equivalents are listed below.

Metacharacter Equivalent Character Class
\a Matches a bell (alarm); \u0007
\b Matches a word boundary except in a character class, where it matches a backspace character, \u0008
\t Matches a tab; \u0009
\r Matches a carriage return; \u000D
\w Matches a vertical tab; \u000B
\f Matches a form feed; \u000C
\n Matches a new line; \u000A
\e Matches an escape; \u001B
\040 Matches an ASCII character with a three-digit octal. \040 represents a space (Decimal 32).
\x20 Matches an ASCII character using 2-digit hexadecimal. In this case, \x2- represents a space.
\cC Matches an ASCII control character, in this case ctrl-C.
\u0020 Matches a Unicode character using exactly four hexadecimal digits. In this case \u0020 is a space.
\* Any character that does not represent a predefined character class is simply treated as that character. Thus \* is the same as \x2A (a literal *, not the * metacharacter).
\p{name} Matches any character in the named character class 'name'. Supported names are Unicode groups and block ranges. For example Ll, Nd, Z, IsGreek, IsBoxDrawing, and Sc (currency).
\P{name} Matches text not included in the named character class 'name'.
\w Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
\W The negation of \w, this equals the ECMAScript compliant set [^a-zA-Z_0-9] or the Unicode character categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
\s Matches any white-space character. Equivalent to the Unicode character classes [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v] (note leading space).
\S Matches any non-white-space character. Equivalent to the Unicode character categories [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to [^ \f\n\r\t\v] (note space after ^).
\d Matches any decimal digit. Equivalent to [\p{Nd}] for Unicode and [0-9] for non-Unicode, ECMAScript behavior.
\D Matches any non-decimal digit. Equivalent to [\P{Nd}] for Unicode and [^0-9] for non-Unicode, ECMAScript behavior.
 
Note: Different implementations of regular expressions define different sets of predefined metacharacters—the above predefined metacharacters are supported by the System.Text.RegularExpressions API in the .NET Framework.

 

More Sample Expressions

Most people learn best by example, so here are a very few sample expressions. For more samples, you should visit the online regular expression library, at http://www.regexlib.com/.

Pattern Description
^\d{5}$ 5 numeric digits, such as a US ZIP code.
^(\d{5})|(\d{5}-\d{4}$ 5 numeric digits, or 5 digits-dash-4 digits. This matches a US ZIP or US ZIP+4 format.
^(\d{5}(-\d{4})?$ Same as previous, but more efficient. Uses ? to make the -4 digits portion of the pattern optional, rather than requiring two separate patterns to be compared individually (via alternation).
^[+-]?\d+(\.\d+)?$ Matches any real number with an optional sign.
^[+-]?\d*\.?\d*$ Same as above, but also matches the empty string.
^(20|21|22|23|[01]\d)[0-5]\d$ Matches any 24-hour time value.
/\*.*\*/ Matches the contents of a C-style comment /* … */
@2017-09-29 11:45:02

The following C# code snippets present some commonly used regular expressions. To compile the following code you have adding the reference to the System.Text.RegularExpressions namespace which contains classes that provide access to the .NET Framework regular expression engine.

using System.Text.RegularExpressions

Roman Numbers

string p1 = "^m*(d?c{0,3}|c[dm])(l?x{0,3}|x[lc])(v?i{0,3}|i[vx])";

string t1 = "vii"; Match m1 = Regex.Match(t1, p1);

Swapping First Two Words

string t2 = "the quick brown fox";
string p2 = @"(\S+)(\s+)(\S+)";
Regex x2 = new Regex(p2);
string r2 = x2.Replace(t2, "$3$2$1", 1);

Keyword = Value

string t3 = "myval = 3";
string p3 = @"(\w+)\s*=\s*(.*)\s*";;
Match m3 = Regex.Match(t3, p3);

Line of at Least 80 Characters

string t4 = "********************"
  + "******************************"
  + "******************************";
string p4 = ".{80,}";
Match m4 = Regex.Match(t4, p4);

MM/DD/YY HH:MM:SS

string t5 = "01/01/01 16:10:01";
string p5 = @"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match m5 = Regex.Match(t5, p5);
Changing Directories (for Windows)
string t6 = @"C:\Documents and Settings\user1\Desktop\";
string r6 = Regex.Replace(t6, @\\user1\\, @\\user2\\);

Expanding (%nn) Hex Escapes

string t7 = "%41"; // capital A
string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
// uses a MatchEvaluator delegate
string r7 = Regex.Replace(t7, p7, HexConvert);

Deleting C Comments (Imperfectly)

string t8 = @"
/*
 * this is an old cstyle comment block
 */
";
string p8 = @"
  /\*  # match the opening delimiter
  .*?	 # match a minimal numer of chracters
  \*/	 # match the closing delimiter
";
string r8 = Regex.Replace(t8, p8, "", "xs");

Removing Leading and Trailing Whitespace

string t9a = "   leading";
string p9a = @"^\s+";
string r9a = Regex.Replace(t9a, p9a, "");

string t9b = "trailing  ";
string p9b = @"\s+";
string r9b = Regex.Replace(t9b, p9b, "");

Turning '\' Followed by 'n' Into a Real Newline

string t10 = @"\ntest\n";
string r10 = Regex.Replace(t10, @"\\n", "\n");

IP Address

string t11 = "55.54.53.52";
string p11 = "^" +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])" ;
Match m11 = Regex.Match(t11, p11);

Removing Leading Path from Filename

string t12 = @"c:\file.txt";
string p12 = @"^.*\\";
string r12 = Regex.Replace(t12, p12, "");

Joining Lines in Multiline Strings

string t13 = @"this is 
a split line";
string p13 = @"\s*\r?\n\s*";
string r13 = Regex.Replace(t13, p13, " ");

Extracting All Numbers from a String

string t14 = @"
test 1
test 2.3
test 47
";
string p14 = @"(\d+\.?\d*|\.\d+)";
MatchCollection mc14 = Regex.Matches(t14, p14);

Finding All Caps Words

string t15 = "This IS a Test OF ALL Caps";


string p15 = @"(\b[^\Wa-z0-9_]+\b)";

MatchCollection mc15 = Regex.Matches(t15, p15);

 

Finding All Lowercase Words

string t16 = "This is A Test of lowercase";

string p16 = @"(\b[^\WA-Z0-9_]+\b)";

MatchCollection mc16 = Regex.Matches(t16, p16);

 

Finding All Initial Caps

string t17 = "This is A Test of Initial Caps";
string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
MatchCollection mc17 = Regex.Matches(t17, p17);

 

Finding Links in Simple HTML

string t18 = @"
<html>
<a href=""first.htm"">first tag text</a>
<a href=""next.htm"">next tag text</a>
</html>
";
string p18 = @"<A[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>";
MatchCollection mc18 = Regex.Matches(t18, p18, "si");

 

Finding Middle Initial

string t19 = "Hanley A. Strappman";
string p19 = @"^\S+\s+(\S)\S*\s+\S";
Match m19 = Regex.Match(t19, p19);

 

Changing Inch Marks to Quotes

string t20 = @"2' 2"" ";
string p20 = "\"([^\"]*)";
string r20 = Regex.Replace(t20, p20, "``$1''");
@2017-09-29 13:29:22

Some common regular expressions are shown in the following table.

Field Expression Format Samples Description
Name ^[a-zA-Z''-'\s]{1,40}$ John Doe
O'Dell
Validates a name. Allows up to 40 uppercase and lowercase characters and a few special characters that are common to some names. You can modify this list.
Social Security Number ^\d{3}-\d{2}-\d{4}$ 111-11-1111 Validates the format, type, and length of the supplied input field. The input must consist of 3 numeric characters followed by a dash, then 2 numeric characters followed by a dash, and then 4 numeric characters.
Phone Number ^[01]?[- .]?(\([2-9]\d{2}\)|[2-9]\d{2})[- .]?\d{3}[- .]?\d{4}$ (425) 555-0123
425-555-0123
425 555 0123
1-425-555-0123
Validates a U.S. phone number. It must consist of 3 numeric characters, optionally enclosed in parentheses, followed by a set of 3 numeric characters and then a set of 4 numeric characters.
E-mail ^([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$ someone@example.com Validates an e-mail address.
URL ^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&amp;%\$#_]*)?$ http://www.microsoft.com Validates a URL
ZIP Code ^(\d{5}-\d{4}|\d{5}|\d{9})$|^([a-zA-Z]\d[a-zA-Z] \d[a-zA-Z]\d)$ 12345 Validates a U.S. ZIP Code. The code must consist of 5 or 9 numeric characters.
Password (?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{8,10})$   Validates a strong password. It must be between 8 and 10 characters, contain at least one digit and one alphabetic character, and must not contain special characters.
Non- negative integer ^\d+$ 0
986
Validates that the field contains an integer greater than zero.
Currency (non- negative) ^\d+(\.\d\d)?$ 1.00 Validates a positive currency amount. If there is a decimal point, it requires 2 numeric characters after the decimal point. For example, 3.00 is valid but 3.1 is not.
Currency (positive or negative) ^(-)?\d+(\.\d\d)?$ 1.20 Validates for a positive or negative currency amount. If there is a decimal point, it requires 2 numeric characters after the decimal point.
@2017-09-29 13:31:04

What is the Difference Between Quantifier * and Quantifier + in Regular Expression?

The "*" character means match 0 or more occurances of the previous character or grouping characters in a pair of parentheses. For exampel, the pattern, be*, will match:

	b
	be
	bee
	beeeeeeeeee

The "+" character is also similar, but it means match 1 or more occurances of the previous character or grouping characters in a pair of parentheses. For example, the pattern, be+, will match:

	be
	bee
	beeeeeeeeee

If you pay more attention to the above examples, you may find there is one more match in the first example.

Comments

You must Sign In to comment on this topic.


© 2024 Digcode.com