What is a regular expression (regex)?
In fact, a regular expression is a pattern for finding a string in text. In Java, the original representation of this pattern is always a string, i.e. an object of theString
class. However, it's not any string that can be compiled into a regular expression — only strings that conform to the rules for creating regular expressions. The syntax is defined in the language specification.
Regular expressions are written using letters and numbers, as well as metacharacters, which are characters that have special meaning in regular expression syntax. For example:
String regex = "java"; // The pattern is "java";
String regex = "\\d{3}"; // The pattern is three digits;
Creating regular expressions in Java
Creating a regular expression in Java involves two simple steps:- write it as a string that complies with regular expression syntax;
- compile the string into a regular expression;
Pattern
object. To do this, we need to call one of the class's two static methods: compile
. The first method takes one argument — a string literal containing the regular expression, while the second takes an additional argument that determines the pattern-matching settings:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of potential values of the flags
parameter is defined in Pattern
class and is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE); // Pattern-matching will be case insensitive.
Basically, the Pattern
class is a constructor for regular expressions. Under the hood, the compile
method calls the Pattern
class's private constructor to create a compiled representation. This object-creation mechanism is implemented this way in order to create immutable objects. When a regular expression is created, its syntax is checked. If the string contains errors, then a PatternSyntaxException
is generated.
Regular expression syntax
Regular expression syntax relies on the<([{\^-=$!|]})?*+.>
characters, which can be combined with letters. Depending on their role, they can be divided into several groups:
Metacharacter | Description |
---|---|
^ | beginning of a line |
$ | end of a line |
\b | word boundary |
\B | non-word boundary |
\A | beginning of the input |
\G | end of the previous match |
\Z | end of the input |
\z | end of the input |
Metacharacter | Description |
---|---|
\d | digit |
\D | non-digit |
\s | whitespace character |
\S | non-whitespace character |
\w | alphanumeric character or underscore |
\W | any character except letters, numbers, and underscore |
. | any character |
Metacharacter | Description |
---|---|
\t | tab character |
\n | newline character |
\r | carriage return |
\f | linefeed character |
\u0085 | next line character |
\u2028 | line separator |
\u2029 | paragraph separator |
Metacharacter | Description |
---|---|
[abc] | any of the listed characters (a, b, or c) |
[^abc] | any character other than those listed (not a, b, or c) |
[a-zA-Z] | merged ranges (Latin characters from a to z, case insensitive) |
[a-d[m-p]] | union of characters (from a to d and from m to p) |
[a-z&&[def]] | intersection of characters (d, e, f) |
[a-z&&[^bc]] | subtraction of characters (a, d-z) |
Metacharacter | Description |
---|---|
? | one or none |
* | zero or more times |
+ | one or more times |
{n} | n times |
{n,} | n or more times |
{n,m} | at least n times and no more than m times |
Greedy quantifiers
One thing you should know about quantifiers is that they come in three different varieties: greedy, possessive, and reluctant. You make a quantifier possessive by adding a "+
" character after the quantifier. You make it reluctant by adding "?
". For example:
"A.+a" // greedy
"A.++a" // possessive
"A.+?a" // reluctant
Let's try using this pattern to understand the how the different types of quantifiers work.
By default, quantifiers are greedy. This means that they look for the longest match in the string. If we run the following code:
public static void main(String[] args) {
String text = "Fred Anna Alexander";
Pattern pattern = Pattern.compile("A.+a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(text.substring(matcher.start(), matcher.end()));
}
}
we get this output:
Anna Alexa
For the regular expression "A.+a
", pattern-matching is performed as follows:
The first character in the specified pattern is the Latin letter
A
.Matcher
compares it with each character of the text, starting from index zero. The characterF
is at index zero in our text, soMatcher
iterates through the characters until it matches the pattern. In our example, this character is found at index 5.Once a match with the pattern's first character is found,
Matcher
looks for a match with its second character. In our case, it is the ".
" character, which stands for any character.The character
n
is in the sixth position. It certainly qualifies as a match for "any character".Matcher
proceeds to check the next character of the pattern. In our pattern, it is included in the quantifier that applies to the preceding character: ".+
". Because the number of repetitions of "any character" in our pattern is one or more times,Matcher
repeatedly takes the next character from the string and checks it against the pattern as long as it matches "any character". In our example — until the end of the string (from index 7 to index 18).Basically,
Matcher
gobbles up the string to the end — this is precisely what is meant by "greedy".After Matcher reaches the end of the text and finishes the check for the "
A.+
" part of the pattern, it starts checking for the rest of the pattern:a
. There's no more text going forward, so the check proceeds by "backing off", starting from the last character:Matcher
"remembers" the number of repetitions in the ".+
" part of the pattern. At this point, it reduces the number of repetitions by one and checks the larger pattern against the text until a match is found:
Possessive quantifiers
Possessive quantifiers are a lot like greedy ones. The difference is that when text has been captured to the end of the string, there is no pattern-matching while "backing off". In other words, the first three stages are the same as for greedy quantifiers. After capturing the entire string, the matcher adds the rest of the pattern to what it is considering and compares it with the captured string. In our example, using the regular expression "A.++a
", the main method finds no match.
Reluctant quantifiers
For these quantifiers, as with the greedy variety, the code looks for a match based on the first character of the pattern:
Then it looks for a match with the pattern's next character (any character):
Unlike greedy pattern-matching, the shortest match is searched for in reluctant pattern-matching. This means that after finding a match with the pattern's second character (a period, which corresponds to the character at position 6 in the text,
Matcher
checks whether the text matches the rest of the pattern — the character "a
"The text does not match the pattern (i.e. it contains the character "
n
" at index 7), soMatcher
adds more one "any character", because the quantifier indicates one or more. Then it again compares the pattern with the text in positions 5 through 8:
In our case, a match is found, but we haven't reached the end of the text yet. Therefore, the pattern-matching restarts from position 9, i.e. the pattern's first character is looked for using a similar algorithm and this repeats until the end of the text.
main
method obtains the following result when using the pattern "A.+?a
":
Anna
Alexa
As you can see from our example, different types of quantifiers produce different results for the same pattern. So keep this in mind and choose the right variety based on what you're looking for.
Escaping characters in regular expressions
Because a regular expression in Java, or rather, its original representation, is a string literal, we need to account for Java rules regarding string literals. In particular, the backslash character "\
" in string literals in Java source code is interpreted as a control character that tells the compiler that the next character is special and must be interpreted in a special way. For example:
String s = "The root directory is \nWindows"; // Move "Windows" to a new line
String s = "The root directory is \u00A7Windows"; // Insert a paragraph symbol before "Windows"
This means that string literals that describe regular expressions and use "\
" characters (i.e. to indicate metacharacters) must repeat the backslashes to ensure that the Java bytecode compiler doesn't misinterpret the string. For example:
String regex = "\\s"; // Pattern for matching a whitespace character
String regex = "\"Windows\""; // Pattern for matching "Windows"
Double backslashes must also be used to escape special characters that we want to use as "normal" characters. For example:
String regex = "How\\?"; // Pattern for matching "How?"
Methods of the Pattern class
ThePattern
class has other methods for working with regular expressions:
String pattern()
‒ returns the regular expression's original string representation used to create thePattern
object:Pattern pattern = Pattern.compile("abc"); System.out.println(pattern.pattern()); // "abc"
static boolean matches(String regex, CharSequence input)
– lets you check the regular expression passed as regex against the text passed asinput
. Returns:true – if the text matches the pattern;
false – if it does not;For example:
System.out.println(Pattern.matches("A.+a","Anna")); // true System.out.println(Pattern.matches("A.+a","Fred Anna Alexander")); // false
int flags()
‒ returns the value of the pattern'sflags
parameter set when the pattern was created or 0 if the parameter was not set. For example:Pattern pattern = Pattern.compile("abc"); System.out.println(pattern.flags()); // 0 Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE); System.out.println(pattern.flags()); // 2
String[] split(CharSequence text, int limit)
– splits the passed text into aString
array. Thelimit
parameter indicates the maximum number of matches searched for in the text:- if
limit > 0
‒limit-1
matches; - if
limit < 0
‒ all matches in the text - if
limit = 0
‒ all matches in the text, empty strings at the end of the array are discarded;
For example:
public static void main(String[] args) { String text = "Fred Anna Alexa"; Pattern pattern = Pattern.compile("\\s"); String[] strings = pattern.split(text,2); for (String s : strings) { System.out.println(s); } System.out.println("---------"); String[] strings1 = pattern.split(text); for (String s : strings1) { System.out.println(s); } }
Console output:
Fred Anna Alexa --------- Fred Anna Alexa
Below we'll consider another of the class's methods used to create a
Matcher
object.- if
Methods of the Matcher class
Instances of theMatcher
class are created to perform pattern-matching. Matcher
is the "search engine" for regular expressions. To perform a search, we need to give it two things: a pattern and a starting index. To create a Matcher
object, the Pattern
class provides the following method:
рublic Matcher matcher(CharSequence input)
The method takes a character sequence, which will be searched. This is an instance of a class that implements the CharSequence
interface. You can pass not only a String
, but also a StringBuffer
, StringBuilder
, Segment
, or CharBuffer
.
The pattern is a Pattern
object on which the matcher
method is called.
Example of creating a matcher:
Pattern p = Pattern.compile("a*b"); // Create a compiled representation of the regular expression
Matcher m = p.matcher("aaaaab"); // Create a "search engine" to search the text "aaaaab" for the pattern "a*b"
Now we can use our "search engine" to search for matches, get the position of a match in the text, and replace text using the class's methods.
The boolean find()
method looks for the next match in the text. We can use this method and a loop statement to analyze an entire text as part of an event model. In other words, we can perform necessary operations when an event occurs, i.e. when we find a match in the text. For example, we can use this class's int start()
and int end()
methods to determine a match's position in the text. And we can use the String replaceFirst(String replacement)
and String replaceAll(String replacement)
methods to replace matches with the value of the replacement parameter.
For example:
public static void main(String[] args) {
String text = "Fred Anna Alexa";
Pattern pattern = Pattern.compile("A.+?a");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start=matcher.start();
int end=matcher.end();
System.out.println("Match found: " + text.substring(start, end) + " from index "+ start + " through " + (end-1));
}
System.out.println(matcher.replaceFirst("Ira"));
System.out.println(matcher.replaceAll("Mary"));
System.out.println(text);
}
Output:
Match found: Anna from index 5 through 8
Match found: Alexa from index 10 through 14
Fred Ira Alexa
Fred Mary Mary
Fred Anna Alexa
The example makes it clear that the replaceFirst
and replaceAll
methods create a new String
object — a string in which pattern matches in the original text are replaced by the text passed to the method as an argument. Additionally, the replaceFirst
method replaces only the first match, but the replaceAll
method replaces all the matches in the text. The original text remains unchanged.
The Pattern
and Matcher
classes' most frequent regex operations are built right into the String
class. These are methods such as split
, matches
, replaceFirst
, and replaceAll
. But under the hood, these methods use the Pattern
and Matcher
classes. So if you want to replace text or compare strings in a program without writing any extra code, use the methods of the String
class. If you need more advanced features, remember the Pattern
and Matcher
classes.
Conclusion
In a Java program, a regular expression is defined by a string that obeys specific pattern-matching rules. When executing code, the Java machine compiles this string into aPattern
object and uses a Matcher
object to find matches in the text. As I said at the beginning, people often put off regular expressions for later, considering them to be a difficult topic. But if you understand the basic syntax, metacharacters, and character escaping, and study examples of regular expressions, then you'll find they are much simpler than they appear at first glance.More reading: |
---|
GO TO FULL VERSION