 |
 |

Learn and Test Regular Expressions Freely |
The RegEx Lab in Mergemill Pro is a regex checker that lets you test and learn powerful regular expressions
|
|
|
|
|
|
|
|
|
|
|

A regular expression, or regex, is a pattern of text consisting of
ordinary characters and metacharacters, which together describes the
strings to match when searching and replacing text. With regular expressions,
you can quickly search for specific characters and search by position.
This introduction is specific to REAL
Software's implementation of regular expressions in Mergemill
Pro. However, most of the following ideas and syntax should also apply
to other implementations.
To enable you to easily test and debug your regexes, Mergemill Pro features
a regular expression tester called the RegEx Lab. If you are not familiar
with regular expressions, the best way to start is to follow this introduction
and try out every regex pattern here described in the RegEx Lab. Since
you may keep on using the RegEx Lab without any restriction on a pre-registered
copy of Mergemill Pro, this essentially makes the software a FREE regex
test tool for mastering regular expressions.


Basic Regex Patterns
Pattern |
Description |
abc |
Matches all of a, b, and c in order. For
example, regex "123" matches "01123345". |
a|b|c |
Matches one of a, b, or c. For example,
regex "1|2|3" matches "01123345". |
[a-z0-9] |
Matches any single character
of the set enclosed in square brackets. Examples: [aeiou] matches any
one of the vowels. [a-zA-Z0-9] matches any alphanumeric character. [a-e]
matches any character in the range a-e, inclusive. To match a "-",
place it at the beginning or end of the set. For example, [a-c-] finds
a character in the range a-c or the "-" sign. Other useful
patterns are: "[[]" finds a "[". "[]]" finds
a "]". |
[^a-z0-9] |
Matches any single character
NOT in the set. For example, [^aeiou] matches any character EXCEPT a
vowel. To find the caret character, place it anywhere except the first
position after the opening bracket. For example, [a-e^] finds a character
in the range a-e or the caret character. |
\d |
Matches a digit. Same as [0-9]. |
\D |
Matches a non-digit. Same as
[^0-9]. |
\w |
Matches an alphanumeric (word)
character. Same as [a-zA-Z0-9_]. |
\W |
Matches a non-word character.
Same as [^a-zA-Z0-9_]. |
\s |
Matches a whitespace character
(space, tab, return, line feed, form feed). |
\S |
Matches a non-whitespace character.
Please note that [\D\S] is NOT the same as [^\d\s]. In fact, [\D\S] matches
ANYTHING. |
\n |
Matches a newline (or line feed). |
\r |
Matches a return. |
\t |
Matches a tab. |
\f |
Matches a formfeed. |
\0 |
Matches a null character. |
\000 |
Also matches a null character. This is a
specific case of \nnn. |
\nnn |
Matches an ASCII character of the octal
value nnn. "\15" is the same as "\r". |
\xnn |
Matches an ASCII character of the hexadecimal
value nn. So another way of searching for the return character
is to use \xD. |
\cX |
Matches an ASCII control character. The
letter after the backslash is always a lowercase c. The second letter
is an uppercase letter A through Z, to indicate Control+A through Control+Z.
These are equivalent to \x01 through \x1A. |
\metachar |
Matches the metacharacter, such as \., \\,
and \|. |
. (dot) |
Matches any character except
a line break. If you use the dot alone, you will select the first character
in the target string and, if you repeat the search, you will find each
successive character, till you encounter a line break. For example, "5.." matches
"0123456789". The
dot means [^\n] in Unix, [^\r\n] in Windows, and [^\r] in Mac OS. Don't
use the dot if you can; your regex is more efficient if you specify more
clearly the strings you want to match. Optimizing a regex is important
if it is to be used repeatedly and on large chunks of data. |
Metacharacters in Character Sets
The metacharacters remaining as such inside a character set are the closing
bracket "]", the backslash "\", the caret "^"
and the hyphen "-". Other metacharacters behave as ordinary characters,
and do not need to be escaped by a backslash. To search for a star or plus
for example, simply use [+*]. To include a backslash as a character without
any special meaning inside a character set, you have to escape it with another
backslash. So [\\x] matches a backslash or an x. The closing bracket "]",
the caret "^" and the hyphen "-" can be included by escaping
them with a backslash, or by placing them in a position where they do not take
on their special meaning.
Anchors (position matching)
Char |
Description |
^ |
Matches the beginning of a line or string.
For example, "^Name" finds lines that begin with "Name". |
$ |
Matches the end of a line or string. For
example, ".$" finds the last character in a line. |
\b |
Matches a word boundary. For example, "\bword\b"
does a whole-word search. |
\B |
Matches a non-word boundary. It matches
where \b does not. |
Repetition
Char |
Description |
x? |
Repeats x zero or one time. That is, x is
optional in the strings to be matched. For example,
"12?3" matches both "0123456789"
and "013456789". |
x* |
Repeats x zero or more times in the strings
to be matched. For example, "12*" matches
"01222223456789". |
x+ |
Repeats x one or more times in the strings
to be matched. For example, [0-9]+ finds a string of one or more consecutive
numbers, such as "32" in "Win32". |
x{m,n} |
Repeats x m to n times in the strings to
be matched. |
x{n} |
Repeats x exactly n times in the strings
to be matched. |
x{n,} |
Repeats x at least n times in the strings
to be matched. |
Greediness
The repetition operators (or quantifiers) are NOT greedy in Mergemill Pro.
Greedy quantifiers repeat the preceding token as often as possible before the
regex fails. So a greedy plus in the regex
"<.+>" starts with the leftmost "<", and includes
everything in the match till the last ">" in the string. This
won't work if you want to find the first tag in an HTML document.
Mergemill Pro lets you control the Greedy property of the regex via a checkbox.
You may also place a "?" directly after a "*" or "+" to
reverse the "greediness" setting. So when applied to "aaaa" with
the Greedy option selected, "a+?" returns "a" and "a+" returns "aaaa".
Grouping and Backreferences
You can group a part of a regex together by placing it inside parentheses.
This allows you to apply a regex operator, such as a quantifier, to the entire
group. For example, "Nov(ember)?" will match both "Nov" and "November".
Besides grouping part of a regex together, round brackets also create a "backreference".
Backreferences store the parts of the string matched by the parts of the regex
inside the parentheses. They can then be referenced later, or in the replacement
pattern, by \1, \2, etc. for the first group matched, the second group, and
so on. For example, "\b(\w+)\s+\1\b"
finds double words such as "the the". If you want to match any date,
write "(\d+)\s(B.C.|A.D.|BC|AD)", then \1 refers only to the year
number and \2 would contain the letters.
Please note:
- Backreferences store the last match only, and so "([abc]+)" captures "cab" while "([abc])+" keeps
only
"b".
- The round brackets and backreferences such as \1 have NO special meanings
inside [].
- Backreferences in search patterns must use the backslash, like \1, \2,
etc., whereas in replacement patterns you may use either \1 or $1, and so
on.
Replacement Patterns
Pattern |
Description |
$' |
Replaced with the entire target string following
the matched text. |
$& |
Same as \0 or $0, it contains
the entire matched string. For example, if "\d\d\d\d\sB\.C\." finds "1541
B.C.", then the replacement pattern "the year $&" results
in "the year 1541 B.C.". |
$0-$50 |
Same as \0 to \50. They evaluate
to nothing if the subexpression corresponding to the number doesn't exist,
otherwise they contain the last-matched subpatterns, defined by the parentheses
in the search pattern. |
\xnn |
Replaced with the character
represented by nn in Hex. |
\nnn |
Replaced with the character
represented by nn in Octal. |
\cX |
Replaced with the character
that is the control version of X. |
Extension Mechanism
Pattern |
Description |
(?#text) |
Use this to insert a comment. |
(?:regex) |
This is for grouping without
creating backreferences, and is therefore empty when called. |
(?=regex) |
This is a zero-width positive
look-ahead assertion. For example, \w+(?=\t) matches a word followed
by a tab, without including the tab in $&. |
(?!regex) |
This is a zero-width negative
look-ahead assertion. For example foo(?!bar) matches any occurrence of "foo" that
isn't followed by "bar". |
(?<=regex) |
This is a zero-width positive
look-behind assertion. For example, (?<=\t)\w+ matches a word that
follows a tab, without including the tab in $&. It works only for
fixed-width look-behind regex. |
(?<!regex) |
This is a zero-width negative
look-behind assertion. It works only for a fixed-width look-behind regex.
For example "\b\w+(?<!s)\b"
finds a word that does not end with an "s". Without using lookbehind,
the regex becomes \b\w*[^s\W]\b |
Please note:
- Lookaround is zero-width, i.e. as soon as the condition is satisfied, the
regex engine forgets about everything inside the lookaround. It therefore
does not create a backreference, and is not included in the count towards
numbering the backreferences.
- Any valid regex can be used inside the lookahead, such as (?=regex) or
(?=(regex)). If it contains capturing parentheses like the second one, the
backreferences will be saved. Example: "(?=(\d+))\w+\1"
will NOT match 123x12, but will match 56x56 in 456x56.
Regex Options in Mergemill Pro
One Line ignores internal newlines for the purposes of matching
against "^" and "$".
Case Sensitive specifies whether case is to be considered when matching
a string.
Dot Matches All sets the dot to match everything, including newlines,
which it normally doesn't match.


Top of Page

|
 |
 |