Regular expressions

Download Master allows you to use regular expressions for advanced program configuration and for searching. Below we describe the syntax of regular expressions and examples of usage.

Introduction

Regular expressions are a widely used way of describing patterns for searching text and checking if the text matches the pattern. Special metacharacters allow you to define, for example, that you are looking for a substring at the beginning of an input string or a certain number of repetitions of a substring.

Simple matching

Any character matches itself if it is not one of the special metacharacters described below.
A sequence of characters matches the same sequence in the input string, so that the pattern "bluh" will match the substring "bluh'' in the input string.
If you want metacharacters or escape sequences to be treated as normal characters, they must be preceded by the "\" character, e.g. the metacharacter "^" usually matches the beginning of strings, but if you write it as "\^" it will match the "^" character, "\\\\" matches "\", etc.

Examples:
foobar finds 'foobar'
\^FooBarPtr finds '^FooBarPtr'.

Escape sequences

Any character can be defined with an escape sequence, just as it is done in C or Perl: "\n'' means the beginning of a line, "\t'' means a tab, and so on. In general, \xnn, where nn is a sequence of hexadecimal digits, means a character with the ASCII code nn. If you need to define a two-byte (Unicode) character, use the format '\x{nnnn}', where 'nnnn' is one or more hexadecimal digits.

\xxnn character with hexadecimal code nn
\x{nnnn} character with hexadecimal code nnnn (more than one byte can only be set in Unicode mode)
\t tab (HT/TAB), can also be \x09
\n newline (NL), can also be \x0a
\r carriage return (CR), can also be \x0d
\f format conversion (FF), can also be \x0c
\Call (BEL), can also be \x07.
\e escape (ESC), can also be \x1b.

Examples:
foo\x20bar finds 'foo bar' (note the space in the middle).
\tfoobar finds 'foobar' preceded by a tab.

Character Lists

You can define a list by enclosing characters in []. The list will match any single character listed in it.
If the first character in the list (immediately after "[''') is "^'', then the list matches any character not listed in the list.

Examples:
foob[aeiou]r finds 'foobar', 'foober', etc. but not 'foobbr', 'foobcr', etc.
foob[^aeiou]r finds 'foobbr', 'foobcr', etc.. but not 'foobar', 'foober', etc..

Within a list, the '-' character can be used to define character ranges, e.g. a-z represents all characters between 'a' and 'z', inclusive.
If you want to include the "-'' character itself in the list, place it at the beginning or end of the list, or precede it with '\'. If you need to list the ']' character itself, place it at the very beginning or precede '\'.

Examples:
[-az] 'a', 'z', and '-'
[az-] 'a', 'z' and '-'.
[a\-z] 'a', 'z' and '-'.
[a-z] all 26 lowercase Latin letters from 'a' to 'z'.
[n-\x0D] #10, #11, #12, #13.
[\d-t] digit, '-' or 't'.
[]-a] character from the range ']'..'a'.

Metacharacters

Metacharacters are special characters that are the most important concept in regular expressions. There are several groups of metacharacters.

Metacharacters - line separators

^ start of line
$ end of line
\A beginning of text
\Z end of text
. any character in the string

Examples:
^foobar only finds 'foobar' if it is at the beginning of a string
foobar$ only finds 'foobar' if it is at the end of a string
^foobar$ finds 'foobar' only if it is the only word in the string
foob.r finds 'foobar', 'foobbr', 'foob1r', etc.

The default "^" metacharacter matches only at the beginning of the input text, and the "$" metacharacter - only at the end of the text. Internal line delimiters present in the text will not match "^'' and "$''.
However, if you need to handle text as multi-line text, so that "^'' matches after each line separator within the text, and "$'' matches before each separator, you can include the /m modifier.
The \A and \Z metacharacters are similar to "^'' and "$''', but they are not affected by the /m modifier, i.e. they always match only the beginning and end of all input text.
The ".'' metacharacter matches any character by default, but if you turn off the /s modifier, '.'' will not match line delimiters.
Line delimiters are interpreted as recommended at www.unicode.org ( http://www.unicode.org/unicode/reports/tr18/ ):
"^" matches the beginning of the input text and, if the /m modifier is enabled, the dot immediately following \x0D\x0A, \x0A, or \x0D. Note that it does not match in the gap inside the sequence \x0D\x0A.
"$" matches the end of the input text and, if the /m modifier is enabled, the dot immediately preceding \x0D\x0A, \x0A, or \x0D. Note that it does not match in the gap within the sequence \x0D\x0A.
The "." matches any character, but if the /s modifier is turned off, the "." does not match \x0D\x0A and \x0A and \x0D.
Note that "^.*$" (the pattern for an empty string) does not match an empty string of the form \x0D\x0A, but does match \x0A\x0D.

Metacharacters - standard lists

\w alphanumeric character or "_"
\w not \w
\d numeric character
\D not d
\s any "space" character (default is [ \t\n\r\f]).
\Not ss.
The standard \w, \d, and \s lists can also be used within character lists.

Examples:
foob\dr finds 'foob1r', ''foob6r'', etc. but not 'foobar', 'foobbr', etc.
foob[{w\s]r finds 'foobar', 'foob r', 'foobbr', etc. but not 'foob1r', 'foob=r', etc.

Metacharacters - repetitions

Any regular expression element can be followed by a very important type of metacharacter - repetition. Using them you can define the number of allowed repeats of the preceding character, metacharacter or subexpression.

* zero or more times ("greedy"), same as {0,}
+ one or more times ("greedy"), the same as {1,}
? zero or one time ("greedy"), the same as {0,1}
{n } exactly n times ("greedy")
{n,} at least n times ("greedy").
{n,m} at least n but not more than m times ("greedy").
*? zero or more times ("not greedy"), same as {0,}?
+? one or more times ("not greedy"), same as {1,}?
?? zero or one time ("not greedy"), the same as {0,1}?
{n}? exactly n times ("not greedy").
{n,}? at least n times ("not greedy").
{n,m}? at least n but not more than m times ("not greedy").

So {n,m} specifies a minimum of n repeats and a maximum of m. The repetitioner {n} is equivalent to {n,n} and specifies exactly n repeats. The repetitioner {n,} specifies a minimum of n repeats. Theoretically, the value of the parameters n and m is unbounded, but it is recommended not to set large values, since in some situations it may require substantial time and RAM in processing such a repetition due to the recursive nature of the operation.

If curly braces occur in the "wrong" place where they cannot be seen as a repetition, they are treated simply as symbols.

Examples:
foob.*r finds 'foobar', 'foobalkjdflkj9r' and 'foobr'
foob.+r finds 'foobar', 'foobalkjdflkj9r' but not 'foobr'.
foob.?r finds 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'.
fooba{2}r finds 'foobaar'.
fooba{2,}r finds 'foobaar', 'foobaaar', 'foobaaaar', 'foobaaaar', etc.
fooba{2,3}r finds 'foobaar', or 'foobaaar' but not 'foobaaaar'.

A little clarification about 'greedy'. "Greedy" variants of repeaters try to capture as much of the input text as possible, while "non-greedy" variants try to capture as little as possible. For example, 'b+' as well as 'b*' applied to the input string 'abbbbc' will find 'bbbb', while 'b+?' will find only 'b', and 'b*?' will find an empty string; 'b*? - an empty string altogether; 'b{2,3}?' will find 'bb', while 'b{2,3}' will find 'bbb'.

You can switch all the repeaters in an expression to "non-greedy" mode by using the /g modifier.

Metacharacters - alternation

You can define a list of alternatives by using the "|'' metacharacter to separate them, e.g. "fee|fie|foe" will find "fee'' or "fie'' or "foe'', (just like "f(e|i|o)e"). As the first option, everything from the preceding metacharacter "('' or "['' or from the beginning of the expression to the first metacharacter "|'' is taken as the first option; as the last option, everything from the last "|'' to the end of the expression or to the nearest metacharacter ")'' is taken as the last option. Usually, to avoid confusion, a set of variants is always enclosed in parentheses, even if it could be done without it.

The alternations are tried starting from the first one, and the attempts are terminated as soon as they manage to find one that matches the whole subsequent part of the expression (for more details, see Mechanism of Operation). This means that alternations do not necessarily provide "greedy" behavior. For example, if we apply the expression "foo|foot" to the input string "barefoot'', then "foo'' will be found, so this is the first variant that allowed the whole expression to match.

Note that the metacharacter "|'' is treated as a regular character within character lists, e.g., [fee|fie|foe] means exactly the same as [feio|].

Examples:
foo(bar|foo) finds 'foobar' or 'foofoo'.

Metacharacters - subexpressions

Metacharacters ( ... ) can also be used to specify subexpressions - after you finish searching for an expression, you can refer to any subexpression using the MatchPos, MatchLen and Match properties, as well as substitute subexpressions into a template using the Substitute method).
Subexpressions are numbered from left to right, in the order in which the opening brackets appear.
The first subexpression is numbered '1' (the expression as a whole is '0', it can be referred to in Substitute as '$0' or '$&').

Examples:
(foobar){8,10} finds a string containing 8, 9 or 10 copies of 'foobar'
foob([0-9]|a+)r finds 'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar', 'foobaar', etc.

Metacharacters - backreferences

Metacharacters from \1 to \9 are treated as backreferences. \ matches the previously found subexpression #.

Examples:
(.)\1+ finds 'aaaa' and 'cc'.
(.+)\1+ also finds 'abab' and '123123'
(['']?)(\d+)\1 finds '13' (in double quotes), or '4' (in single quotes) or 77 (unquoted), etc.

Modifiers

You can change modifiers in several ways.
Any modifier can be changed using a special construct (?...) inside a regular expression.

i
Register-independent mode (uses the default language selected in the OS by default).

m
Treat the input text as multiline, with the metacharacters "^'' and "$'' matching not only at the beginning and end of the text as a whole, but also at the beginning and end of all lines in the text (see also Line Separators).

s
Treat the input text as a single line. In this case, the ".'' metacharacter matches any character; if this modifier is off, it does not match line separators (see also Line Separators).

g
Not a standard modifier. By turning it off you switch all repeaters to "non-greedy" mode (by default this modifier is on). I.e. if you turn it off, all '+'s work as '+?', '*'s as '*?', etc.

x
Allows you to format the template to provide easier readability (see description below).

r
Not a standard modifier. If enabled, a-ya ranges also include the letter 'yo', A-Ya includes 'Yo', and a-Ya includes all Russian letters.

The /x modifier causes spaces, tabs, and line delimiters to be ignored, which allows the text of the expression to be formatted. In addition, if the # character is encountered, all subsequent characters until the end of the line are treated as a comment, e.g.:

(
(abc) # Comment 1
| # Spaces within an expression are also ignored
(efg) # Comment 2
)

Naturally, this means that if you need to insert a space, tab or line separator or # into an expression, you can only do so in extended (/x) mode by preceding them with '/' or using /xnn (within character lists, all these characters are treated as normal).

Perl extensions

(?imsxr-imsxr).
Allows you to change the values of modifiers.

Examples:
(?i)Saint-Avocado finds 'Saint-avocado' and 'Saint-Avocado'
(?i)Saint-(?-i)Avocado finds 'Saint-Avocado' but not 'Saint-avocado'.
(?i)Saint-(?-i)Saint-)?Avocado finds 'Saint-avocado' and 'saint-avocado'.
((?i)Saint-)??Avocado finds 'saint-Avocado' but not 'saint-avocado'.

(?#text)
Comment, is simply ignored. Note that it is impossible to put the ")" symbol in a comment of this kind, as it is taken as the end of the comment.