Regular expressions (with examples) - 1

"And now I'll tell you about regular expressions. This topic is both complex and simple at the same time. To thoroughly understand regular expressions, you may need to read two or three hefty books, but I can teach you how to use them right now."

"As experienced programmers like to joke, if you have a problem and think you're going to solve it with regular expressions, now you have two problems."

"Hmm."

"I hope I didn't scare you too much, my friend. No?"

"Okay, good. So, our new topic is regular expressions."

"If we oversimplify them, regular expressions are patterns for strings."

"You can check whether a string matches a given pattern. You can also split a string into parts using a delimiter or a pattern."

"But let's start with something simple: what is a pattern?"

"In SQL (but not in Java), you can check whether a string matches a particular pattern. This is how it looks:"

name like 'Alex%'

Here name is a variable, like is a command to check a pattern, and "Alex%" is the pattern.

In this case, % means any string or substring.

Pattern Strings matching the pattern
‘Alex%’ Alex
Alexandr
Alexander
Alexandra
….
‘%x%’ Max
Maxim
Alexandr
‘%a’ Olga
Helena
Ira

"In SQL, if you need to specify that there should only be one other character, then you would use the underscore character: "_"."

Pattern Strings matching the pattern
‘Alex%_’ Alex
Alexandr
Alexander
Alexandra
….
‘_x’ Ax
Bx
Cx
‘___’ Aaa
Aab
Bbb

"That makes sense."

"Okay, then let's move on to regular expressions."

"Regular expressions typically include restriction not only on the number of characters, but also their 'content'. "Any mask usually consists of two (sometimes more) parts: the first describes character 'preferences', and the second describes the number of characters."

"Here are some content examples:"

Pattern Description Examples
. Any one character 1
\d Any digit 7
\D Any non-digit C
\s A space, line break, or tab character ‘ ‘
\S Anything except spaces, tabs, and line breaks f
[a-z] Any letter from a to z z
[0-9] Any digit from 0 to 9. 8
\w Any word character c
\W Any non-word character _

"I won't remember those right off, but it doesn't look too hard."

"Excellent, then here are examples of the number of characters in a mask:"

Pattern Description Examples
A? The character 'A' occurs once or not at all A
B+ The character 'B' occurs one or more times BBBB
C* The character 'C' occurs zero or more times CCC
D{n} The character 'D' occurs n times The pattern D{4} matches DDDD
E{n,} The character 'E' occurs n or more times The pattern E{2,} matches EEEEEEE
F{n,m} The character 'F' occurs between n and m times The pattern E{2,4} matches EEEE

"That all seems pretty straightforward."

"You're catching on to everything so quickly. Now let's see how it looks all together:"

Pattern Description Examples
[a-d]? A character between 'a' and 'd' occurs once or not at all a, b, c, d
[b-d,z]+ The characters 'b', 'c', 'd', or 'z' occur one or more times b, bcdcdbdbdbdbzzzzbbzbzb, zbz
[1,7-9]* The digits 1, 7, 8, or 9 occur zero or more times 1, 7, 9, 9777, 111199
1{5} The digit 1 occurs 5 times 11111
[1,2,a,b]{2} The symbols 1, 2, 'a', or 'b' occur twice 11, 12, 1a, ab, 2b, bb, 22
[a,0]{2,3} The symbols 'a' or 0 occur 2 or 3 times aa, a0,00,0a, aaa,000, a00,0a0, a0a

"Still all clear."

"Really? Hmm. Either I explained everything really well or you're too quick on the uptake. Well, either way, that's good for us."

"Here are a couple of new insights for you."

"Since regular expressions are often used to find substrings, we can add two more characters (^ and $) to our patterns."

"^ means that the substring must include the beginning of the string."

"$ means that the substring must include the end of the string."

"Here are some examples:"

Pattern String and substrings that match the pattern
a{3} aaa a aaa a aaa
a{3}$ aaa a aaa a aaa
^a{3} aaa a aaa a aaa
^a{3}$ aaa a aaa a aaa

"And one more important point."

"In regular expressions, the following characters have special meaning: [ ] \ / ^ $ . | ? * + ( ) { }. They're called control characters. So, you can't simply use them in strings."

"As in Java code, they must be escaped. "And again as in Java code, the '\' character is used for this."

"If we want to describe a string consisting of three '?' characters, we can't write '?{3}', because '?' is a control character. We need to do it like this: \?{3}. If we want to use a '\' character, then we need to write '\\'."

"OK, got it."

"And now here's another interesting tidbit. In files with Java code, the '\' character must also be escaped in strings, since it's a control character."

"Of course."

"So, if you're trying to define a Java regular expression in a string, then you need to escape the '\' character twice."

"Here's an example:"

I want a mask that matches 'c:\anything'
In theory, the regular expression should look like this:
one 'c' character,
colon,
backslash,
period, and asterisk (to denote any number of characters). I added spaces to improve readability:
c : \ .*
But the characters '\' and '.' need to be escaped, so the regular expression will look like this:
c :  \\\.*
Or, without spaces
c:\\\.*
"We should have three backslashes in our regular expression.
That means that in a Java file the regular expression will look like this:"
String regexp = "c:\\\\\\.*";

"Wow! Whoa. Now I know."

"And if you decide to dig deeper into this, here are a couple of good links:"

Lesson on Wikipedia