DogsByte Home
MultiSub
Home
Download
Tutorial
RegEx
MultiSubLite
MultiSubLite
Links
Contact


Welcome to


DOGSBYTE.COM




MultiSub 2005
A FREE utility for batch Find/Replace





MultiSub Functions

Use
MultiSub

for
batch
Find
&
Replace


*

Supports
regular
expressions


*

Find
text
from
file
contents

*

Replace
text
from
file
contents


*

Batch
mode

*

Ideal
for
WebMasters

*

Use
MultiSub

for
batch
Find
&
Replace


*

Supports
regular
expressions


*

Find
text
from
file
contents

*

Replace
text
from
file
contents


*

Batch
mode

*

Ideal
for
WebMasters

*

Use
MultiSub

for
batch
Find
&
Replace


*

Supports
regular
expressions


*

Find
text
from
file
contents

*

Replace
text
from
file
contents


*

Batch
mode

*

Ideal
for
WebMasters

*









Regular Expressions are extremely useful but can be difficult to comprehend at first. In this overview, I will not cover every single possibility but perhaps summarize what you need to know. If you want to go deeper, then there are many excellent web sites and books that cover the subject in greater depth.

Although there are many sources of information for regular expressions (known as RegEx and pronounced rej-x), many of these go from a basic RegEx to something a little too complicated in the space of a few lines. So given that I learned RegEx the hard way, I feel well qualified to document an introduction.





Special Characters


For the purposes of this overview, let us assume that we are searching a file for something we want to replace.

First of all, the characters you have on your keyboard all represent their normal function with the exception of the characters below;


. # ^ $ \ ? + * | [ ] ( )

These all have a special meaning. So if you try to search for a text string like "Question???" with RegEx enabled, the "?" character is going to assume its special function. To use any of these characters as the actual characters they represent, they need to be 'escaped' using a "\" character. So the previous search would need to be "Question\?\?\?". Again, the \ character indicates that the "?" is to really be the question mark character and must not assume its special function.

So the first thing to remember is that these special characters take on their special function by "default" and need to be escaped to assume their normal representation. If we could be taken back in time to when RegEx were first proposed, it would have been better if these characters "normal" function was the default. But there we are...

Taking this a step further, if you wanted to search for;
"Hello, what is an * character? (it's not a question mark)."

This would need to become;
"Hello, what is an \* character\? \(it's not a question mark\)\."

Which would be interpreted as the previous text since all special characters are escaped. Final point on this, if we use a \ character and you want it to have its normal character meaning, it would be escaped exactly as before and become \\.

The good news, is that you've now dealt with one of the trickiest things to understand! The most common mistakes with using RegEx is forgetting to use the \ escape character.

So that's part one over with!






Other special characters strings


In addition to the special characters, there are a few other special strings.

\d \D \t \n \r \s \S \w \W

These are discussed later. But one is particularly important, "\n". When you type some text into an editor and press return, you get a new line. Unknown to most users is that the return button does actually insert an invisible character called a 'line break'. Most software uses the line break represented in RegEx by "\n", but for reasons known only to Microsoft, a lot of Windows tools add a 'line feed' character too represented by "\r", as well as a "\n" onto the end of a line. So most of the time "\n" is the only one you need to remember, think of it as 'newline'. If you get some weird behaviour with line break characters, then bear in mind you may have a "\r" character present too.

If you wanted to search for all the line breaks in a document, search for \n and all the end of line characters will show up. You can even remove all the line breaks by substituting out \n.






Back to the special characters, what do they mean?


So what do the special characters mean?

Here's an explanation with examples, note that when writing a RegEx you do not need to put it into quotes. I have done that below only to help show the start and end of RegEx expressions.


.
"." - the dot character means any character except a line break character.

So if you wanted to search for all three letter words beginning with b and ending in t, the RegEx would be "b.t", for similar four letter words it would be "b..t" and so on. In these cases the "." means that any character can be in this position. The only exception is a line break character. So if our text was

bo
at

and we searched for "b..t", the above would not match, since there is a line break present.  So to search for the specific example above, we would need a RegEx of  "b.\n.t".
There is a RegEx option in MultiSub (". matches \n") which takes away the exception so that a "." character truly represents any character including line breaks.


|
"|" - the pipe character means "or".

If you want to search for fred or joe, the RegEx for that would be "fred|joe". To search for fred, joe or mike, it would be "fred|joe|mike".

You can also combine special characters, so with the previous "." examples, you could search for three and four letter words starting with b and ending in t at the same time using a RegEx of "b.t|b..t".


^
"^" - the caret character means the start of the text. So with some text like "hello, hello these are some words", the RegEx "^hello" would pick out the first word "hello" but not the second occurrence of "hello", since the first one occurs at the start of the text.


$
"$" - the dollar character means the end of the text. So with some text like "hello, hello these are some words and more words", the RegEx "words$" would pick out the last occurrence of "words", since this occurs at the end of the text.


^ $
again

The ^ and $ characters lead a double life. As mentioned previously they specify the start/end of text, even if that text covers several lines. There is an option however to make ^ and $ match the start/end of a line, this is the option "^$ match embedded \n". Let's look again...

This is some text
that straddles more
than one line.

With the option "^$ match embedded \n" deselected. The RegEx "^.h.." would match only the word "This" because it occurs at the start of the entire text, but not the words "that" or "than".

With the option "^$ match embedded \n" selected, the same RegEx of "^.h.." will match "This", "that" and "than", because they all occur at the start of a line.

The same is true with $, with "^$ match embedded \n" deselected, $ means the end of the entire text, with "^$ match embedded \n" selected, $ means the end of a line.



So by now, I'm sure you're beginning to see that although cryptic, RegEx expressions are actually very powerful. Next...


?
"?" - means 0 or 1 of the previous character. So "ca?t" would match "ct", or "cat", but it would not match "coat".

Beware greedy and lazy discussed below.


+
"+" - is a little like the "?", but this means 1 or more of the previous character. So "ca+t" would match "cat", "caat" and "caaat" etc., but it would not match "ct".

Beware greedy and lazy discussed soon.



*
"*" - means 0 or more of the previous character. For example, if we wanted to search for words beginning with "bo" and ending in "t", we would use a RegEx of "bo*t". This would match "bot", "boot", "booot" etc.

Beware greedy and lazy discussed very soon.



.*
In terms of mixing the special characters, a very common combination is ".*", this means 0 or more of any character. For instance, to search for all words beginning with "b" and ending in "t", we could use "b.*t".

Beware greedy and lazy discussed very, very soon.



Lazy/Greedy
Wildcards
Here we are at last. The wildcards ?, + and * are incredibly useful. But, there is a subtlety that trips everyone up and I remember having problems with this as I was learning RegEx. The problem is just when you think you have command of the RegEx format, you search for a block of text and a huge great chunk lights up! This is caused by the feature of greedy and lazy, no, not two of snow whites' dwarves, read on and get your brain ready...



We have mentioned that ".*" matches any character except a line break (unless we switch the ". Matches \n" option on). So if I search the word "mississipi" with a regex of "m.*s" what would I get?, well, most people would expect "mis" to be the reported match. But no!, the match we get is "mississ". Before you start holding your head in your hands, read on, this believe it or not is a feature.

This unexpected match is because of laziness and greediness. The way RegEx wildcards such as the *, + and ? characters work is to be greedy. A RegEx of "m.*s" actually means; keep looking for an m followed by an s with the maximum amount of text in between. So we actually get "mississ" as the match. This is the greedy result, the wildcard has eaten as much text as it can to give a valid match.

If we wanted "mis" returned we need to make the wildcard lazy, we can do this as follows, we change the RegEx to "m.*?s", this means keep looking for an m followed by an s, with the minimum text in between. You should remember that a question mark means zero or one of the previous character, in this case it means zero or one matches i.e. by one match, it means the first available match that yields a valid result.

Probably the most questions that users report, concern misunderstanding the greedy/lazy feature. There are times when you do want as much text as possible returned, others when you want the minimum, so it a useful feature, but like several irritating things in RegEx behaviour, the default is greedy (whereas most users expect the default to be lazy). So when a huge great chunk of text lights up and your RegEx appears to skip straight over the text you intended, do not call the software a pile of horse manure, remember the greedy/lazy feature!



So, we're getting to the point where you can comprehend a RegEx. By combining these special characters we can do some very clever things. A few more, then we're done...


[]
"[]" - putting characters into square brackets means 'any one of', so [abc] will match a or b or c. So why not do this as a|b|c?, well the brackets have an additional use, you can specify [A-Z] for instance, or [a-z] or [0-9], or pulling them all together you can have [A-Za-z0-9] which will match any character in the alphabet and any numerical digit. This can also be made specific by putting [abcdefxyz] which matches any of the letters enclosed.

For example, if we have some text such as;
"Hello, how are you today?, I'll see you at 9:30"

We could find the time part using "[0-9][:][0-9][0-9]". This means any digit, followed by a colon, followed by any other digit and any other digit. A better way might be "[0-9]+[:][0-9][0-9]" which would pick up on dates that have one or two digits before the colon.


()
"()" - The final special characters are round brackets. These surround characters that form a group. So "(ell)" would search for any occurrence of "ell" in that order and would match the "ell" in "Hello". Round brackets are used with another special character, so (ca)*t would match "t", "cat", "cacat", "cacacat" etc. but not "ct".


^
again!

I mentioned that the caret "^" had a double life, well that was not quite true, it has a triple life! When a caret is used inside square brackets, it means 'anything but', or the opposite. So "[^A-Z]" would match every character except for the capital letters.


Prefix
and
Suffix
ok, one thing to finish with. You want to select text that lies between some characters, e.g. the word "Text", but only when it appears surrounded in this manner "AAATextBCD". If you use a Regex for this, you will end up selecting the whole string, not just the "Text" part. This is very common when you want to get the text between quotes, or between brackets, but not actually including the quotes or brackets in the result. To the rescue is prefix and suffix support.

For text of AAATextBCD
(?<=AAA)Text(?=BCD) will give the result "Text"

(?<=AAA) means a prefix of "AAA", but do not include that prefix in the Regex result.
(?=BCD) means a suffix of "BCD", but do not include that suffix in the Regex result.

Note that the prefix and suffix do not have quotes, any special characters present must be escaped.





Putting it all together...


Now you know the special characters, what's left?, well nothing, just putting it all together. By combining the special characters, you can create 'patterns' that allow you to search for just about anything.

So practice with MultiSub, open a text file and try writing a RegEx that will select certain parts of it. You'll soon get the hang of it. A summary is below with some very common and useful examples.

Is it still cryptic?, yes, but it's cryptic in an understandable way when you strip it down. In a very short time you'll be writing long RegEx expressions and someone looking over your shoulder will wonder what the hell it all means. The weird thing is, you will understand it!





A summary...


MultiSub

Quick Reference Guide for Regular Expressions

Character Purpose Example
Any Represents the character typed,
with the exception of the special
characters below
A represents A,
a represents a,
etc
\character A character that is normally a special character becomes its literal character
Special characters are:
 . # ^ $ \ ? + * | [ ] ( )
. Any character
(except line breaks)
A.B will match
ABC
\d Any digit Matches characters
0..9
\D Any non-digit Will match anything except
0..9
^ Beginning of line  
$ End of line  
\t [TAB] \tText matches
"Text" preceded by a tab
\n
Line Feed

\r Line Feed (Carriage Return) [RETURN]  
\s Whitespace Matches any whitespace character:
[SPACE], [TAB], Line Break, New Line
\S Non whitespace Matches any non whitespace character
\w Word characters Matches
Letters, numbers and underscores
\W Non word characters Matches
Any non word character
[any series of characters] Any characters inside the brackets [abc] matches
a, b or c
[any character - another character] Any characters within the range of characters [a-c] matches
a, b or c
[^ any series of characters] Any character except ones after the ^ [^c3] matches
any character except c or 3
? 0 or 1 of the previous character ca?t matches
cat or ct, but not coat
* 0 or more of the previous character ca*t matches
ct, cat, caat etc
+ 1 or more of the previous character ca+t matches
cat, caat etc but not ct
pattern1|pattern2 Either of the patterns specified ca|t matches
ca or t
but not cat (it will match ca and then t for two matches instead of one)
(pattern) followed by one of the special characters Treats the characters in the parenthesis as a group (ca)*t matches
t, cat, cacat
but not ct
Some Useful Examples


abc.*def
Matches all text enclosed (and including) abc and def
For 1234abcXYZ123defghi matches
abcXYZ123def
Enable ". Matches \n" to allow the selection to include line breaks.
abc.*$
With "^$ Matches Inline \n" enabled, matches all text from and (including abc) to the end of the line

With "^$ Matches Inline \n" disabled and ". Matches \n" enabled, matches all text from and (including abc) to the end of the text
For xyzabcdefghi, matches
abcdefghi
^.*abc
With "^$ Matches Inline \n" enabled, matches all text from and (including abc) from the start of the line

With "^$ Matches Inline \n" disabled and ". Matches \n" enabled, matches all text from and (including abc) from the start of the text
For xyzabcdefghi, matches
xyzabc

abc$
With "^$ Matches Inline \n" enabled, matches abc when it occurs at the end of any line.

With "^$ Matches Inline \n" disabled, matches abc when it occurs at the end of the text.

^abc
With "^$ Matches Inline \n" enabled, matches abc when it occurs at the start of any line.

With "^$ Matches Inline \n" disabled, matches abc when it occurs at the start of the text.

m.*s
Search for the letter m followed by s with the maximum text in between.
For mississipi, matches
mississ
m.*?s
Search for the letter m followed by s with the minimum text in between. For mississipi, matches
mis
(?<=abc)

(?<=abc) means a prefix of abc, but do not include that prefix in the Regex result.
For abcMyTextxyz
(?<=abc)MyText matches
MyTextxyz
(?=xyz) (?=xyz) means a suffix of xyz, but do not include that suffix in the Regex result. For abcMyTextxyz
MyText(?=xyz) matches
abcMyText







Back to top