As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power, at least not to the average programmer, but it is the kind of power that can be used to save numerous lines of code. One of more complicated regex tools I'd like to describe to you is that of lookaround
. When executed properly, lookaround can supercharge your patterns to provide you pattern-matching capabilities otherwise achieved through numerous procedures and even more numerous lines of code.
Regular expression lookaround is not a glaringly simple concept when you first see it. For this reason, readers of this article should at least be familiar with regular expressions in general. EE contributor BatuhanCetin has written a nice introduction to regular expressions here: Regular Expressions Starter Guide
Outside of its complexity, another thing to be mindful of is that not every regex engine supports lookaround. If you plan on experimenting with any of the patterns demonstrated in this article, you should confirm that your editor or language supports lookaround. As described in the section Types of Lookaround
, the two directions of lookaround are lookahead
. Regex engines can implement none, one, or both directions. Be sure you are using a utility which supports the type of lookaround you are testing.
Let me first start with a clarification. There is a theoretical concept of regular expression and a practical concept. Of course, the practical is based on the theoretical. The difference is that we don't have a concept of lookaround in theoretical regular expressions--at least not in the sense that we use them in the practical case. This article deals with the practical case, obviously!
What Is Lookaround?
The overall concept of lookaround is simple--at my current position during the matching process, look forward (or behind, depending) and see if some pattern matches (or does not match, depending) before continuing. "Big deal! That's what a regex pattern itself does. It matches text by examining each character," you say?
Well, the first thing to be aware of when working with lookaround is that it is a non-consuming
match. A non-consuming match is a match that is evaluated to see if can succeed, but it is not actually consumed by the regex engine. What I mean by not being consumed is that when your regex engine evaluates a character and determines that it is still in line with the pattern, it "forgets" about this character and evaluates the next character. During the course of this article you will see that this is not entirely true; for the moment, accept that it is.
One way of thinking about this non-consumption idea is to think of it like going to the deli and taking a number. Let's say you pull number 5 and then you leave. At the time of your departure, you know you have 5, so you can safely assume that 6 is the next ticket (because you know the tickets are sequential, in this case). After ten minutes pass, you return to the deli. You look at the number dispenser and ask yourself, "What is the next number to be dispensed?" You are not going to actually take the number, you just want to look and see what it is. Why? Who knows. Perhaps you just like knowing that you got in-and-out before the next deli-lover arrived.
Types of Lookaround
What happened in the deli example could be considered a positive lookahead
. In many (but not all) regex engines, we have two directions of lookaround: lookahead
. Both of these directions are as they sound: lookahead peeks forward of the current position and lookbehind peeks backward. I previously said the regex engine forgets about a character once it has been determined to satisfy the pattern. Here's the contradiction: when you use a lookbehind, you can actually peek at characters the engine has already evaluated.
In addition to the directions, we also have two concepts of matching: positive
(matching) and negative
(not matching). When you use a positive lookaround, you are informing the regex engine that you would like to verify some pattern can be matched. With a negative lookaround, you want some pattern to not
match. The thing to be mindful of in using a negative lookaround is that failing to match a pattern is actually a success. As with direction, not all regex engines implement both concepts of matching.
Here's a summary of the four primitive possibilities you can have with lookaround:
positive lookahead: ahead of current position, see if pattern matches
positive lookbehind: prior to current position, see if pattern matches
negative lookahead: ahead of current position, see if pattern does not match
negative lookbehind: prior to current position, see if pattern does not match
Lookaround by Example
Lookaround can be a bit mind-boggling to think about when you are staring at the construct in the pattern. It may be easier if you think of a pattern with lookaround as having two pointers--one for the pattern itself (the consuming part) and one for the lookaround (the non-consuming part). Here are two demonstrations.
Let's say you are interested in checking a password field, which can accept alpha-numeric characters, for the existence of at least one digit. There are a couple of ways you can approach this. You could write your pattern as:
which would work fine. Alternatively, you could use a lookahead to see if the target string contained a digit, and coincidentally shorten the original pattern a bit:
Notice there is a new construct in the pattern: (?= ... )
. This denotes a lookahead, and it is postive ( = ). A negative lookahead would exchange the equals for an exclamation point ( ! ). This syntax is typical of most regex engines.
Now in this trivial example, the benefits aren't that bountiful. But for now, I'm going to stick with i for the subsequent illustrations. To see a more real-world-applicable example, see the "Real-world Examples" section of the article.
Let's initialize our engine with the password ab1c
The red arrow indicates our current position for match evaluation. Yes, it is in the void before
the "a". This is because you can match positions as opposed to characters with regex. If you have ever used ^
to match the beginning or end of a string, respectively, then you have matched positions. In fact, ^ at the beginning of our pattern above matches the location of the red arrow in the figure.
What has happened here is that we have matched the beginning of the string, and we are now moving on to the lookahead. I mentioned before that you could think of lookaround as being another pointer for the engine. Here I am representing that pointer with the blue arrow. Why? Recall that I said lookaround is non-consuming. Our red arrow marks our current position, and we want to evaluate the pattern within the lookahead, but without
forgetting our current position.
Now we evaluate the lookahead.
The first part of the lookahead specifies the non-greedy dot-star notation, which means it will match any character, zero-or-more times. The match will be minimal, so the first successful match will indicate success. In short, this part of the pattern will match the first two letters in the pattern and advance our lookahead pointer to the only digit in the target string. The .* [0-9]
put the engine in this state:
Along the way, since we specified the dot-star be minimal, the digit has been checked for. It is now about to be matched. Since the last part of the lookahead specifies to match a digit, the current position of the lookahead pointer will match and since we are also at the end of our lookahead, the entire lookahead will match.
Notice that our main pointer has not moved at all. Again, this is because our lookahead is non-consuming. Because the lookahead succeeded, we can continue processing the remainder of the pattern. Here is the state of the engine after the success of the lookahead:
Yes, it's the same as our initialized engine. In the interest of space, I will not show the progress of our main pointer--just realize that at this point, since our lookahead succeeded, the remainder of the processing of our pattern will occur as we expect, checking each character one-by-one until the end of string is reached. Because then non-lookahead portion of our pattern is [a-zA-Z0-9]+
and our string consists of only letters and a single digit, the pattern as a whole will match.
Had we made our lookahead negative instead of positive, the existence of the digit within our password would have caused the match to fail. Of course, it's a bit contradictory to specify that your character class be comprised of alphanumeric characters, and then have a lookahead that says to not find any digits. The point to bear in mind, is that the effect of changing from positive to negative would cause failure in this example.
Let's modify the previous password requirement set forth by our original pattern. We now want a pattern which will match passwords containing at least one digit and that digit must not occur at the end of the string. Continuing with the same target password string, we could again accomplish this via a well-structured pattern:
but another way we could handle this is by using a negative lookbehind. Recall that "negative" indicates we want a particular pattern to not
be found at some point in the target string. If we modify the pattern to use a negative lookbehind, we could end up with:
Admittedly, there are not many keystroke savings in this pattern, but it will demonstrate how negative lookbehind works. As with the lookahead example, you will notice another novel construct in this pattern: (?<! ... )
. This indicates the lookaround to be a lookbehind
. This time, the matching concept is negative; to convert to positive, you would exchange the exclamation point with an equals sign.
Similarly to the lookahead example, our engine will have the same initialized state. The engine will process each character within our target string, consuming each character up to the end of the string. But, now we have a lookbehind to process. Here's what we see for the initialized lookbehind:
So here's where the white lie I told earlier comes into play--even though we "consumed" the characters in normal pattern processing, the lookbehind can still peek back at those consumed characters. This feature is what makes this next bit of logic feasible. Our lookbehind pointer will move to the first character before the end of the string--here, I'm referring to the void after the last character in the string:
Our lookbehind currently is evaluating a "c". Our lookbehind is looking for a digit, but
since the lookbehind is negative, not finding a digit is a success. Since "c" is not a digit, our negative lookbehind succeeds, and subsequently our entire pattern succeeds. Had we used a positive lookbehind, our pattern would have failed since we wanted to find a digit, but instead found an alpha character.
Again, this is a very trivial example. Please have a look at the "Real-world Examples" section of the article for a more realistic use of this feature.
Limitations of Lookaround
As described thus far, lookaround is a powerful extension to regex matching. The unfortunate truth is that not every engine supports lookaround. Many have an implementation of lookahead, but there are a few engines which do not support lookbehind. The "engines" where you will most often find lookaround to be lacking are those offered with IDEs (specifically, find/replace dialogs). Check your language's (or IDE's) documentation for support of lookaround.
Another restriction of most regex engines is that lookbehind (and possibly some lookaround) cannot have unbounded patterns within them. An unbounded pattern would be one that can have an unlimited number of repetitions. Using star and plus quantifiers would be one example. The only language(s) I have personally encountered which do
support unbounded quantifiers within lookaround are the .NET languages. One way to overcome this limitation would be to give some upper-bounded quantifier (we're talking curly braces here) that has a very large number. It's not very extensible, but it could get you by.
Addendum: Having just participated in a question dealing with it, I have come to find out that the Boost C++ libraries support unbounded lookaround--at least v1.40 does.
Many of the languages which support lookaround also support capture groups within lookaround. The caveat with this feature is that some languages only preserve the capture within the lookaround itself; others allow the captured value to be backreferenced outside of the lookaround. Refer to your language's documentation to confirm the scope of capture groups.
After all the boring stuff, I'm sure your ready to see how to implement lookaround in a useful manner. Here is a list of a few real-world applications of lookaround and explanations of why each pattern works.
Passwords Containing Special Characters and of a Specific Length
You want to ensure that a password meets a set of criteria. The password should be between 8 and 15 characters, contain at least one upper-case alpha character, contain at least one lower-case alpha character, contain at least one digit, and contain at least one of the following: $, %, #, @, &.
Why This Works
We have four lookaheads: one for each of the above conditions specifying a type of character to be included in the password. After matching the beginning of the string, we evaluate each lookahead. Because lookaround is non-consuming, we never leave the void before the first character of the string upon completion of each lookahead. Each lookahead checks for the existence of one of the character restrictions specified, using dot-star to skip over any unimportant characters. By the time we have evaluated the last lookahead, all that is left is to evaluate the bounded dot of the pattern. Since dot matches any character, we bound the dot to restrict the length of our string (^ and $ are required to make the bound effective).
One's first instinct might be to combine the lookaheads into one to save keystrokes. The reason not to do this is that if your requirement is that the characters can be at any
position, in any order within the target string, then the four separate lookaheads are needed. If instead your requirement is that they occur at any position, but in a specific
order, then you could combine the four into one lookaround, concatenating each required character condition with a dot-star to ignore unimportant characters.
Use of the bounded dot at the end could be a security concern for you. I used it here for simplicity, but you would really want to provide further restrictions on what type of characters your password can consist of. No reason to accept null characters as valid input unless you really allow passwords to have them!
The ^ and $ would be required
for this particular application. If you did not include them, then the bounded dot at the end of the pattern would be pointless.
Extract the Integer Portion of a Decimal Number, If and Only If it Has a Fractional Part
You want to find the integer portion of a decimal number within text. You have a peculiar requirement that you only want values if they have a decimal part. Why? Who knows. No one ever said the business side had any sense :)
Why This Works
The engine finds one or more digits, then checks for the existence of a decimal point and one-or-more digits. Since the lookahead here is positive, the match will only succeed if the engine finds a decimal part. Since lookahead is non-consuming, the engine has only matched, overall, the integer value of the double number.
Find a Word That Is NOT Preceded by Some Word or Phrase
You are looking for a particular word. The condition for finding this word, though, is that it not
be preceded by some other particular word or phrase.
Why This Works
The engine searches for the word "World". Once it finds it, it begins evaluating the lookbehind. If it finds the string "Hello" and a trailing space, then the lookbehind fails, since it is a negative lookbehind.
The patterns inside the lookbehind function the same as patterns outside the lookbehind. As such, just being inside the lookbehind doesn't implicitly make the search for "Hello" case-insensitive. Having a target string of "hello World" would cause the lookbehind to succeed. One interesting feature of some
regex engines is that you can turn on case-insensitivity within certain scopes. Changing the pattern to:
would turn on case-insensitivity just for the lookbehind. Even in engines which support lookaround, this feature is not always available.
Split Pascal-cased Identifiers Into Component Words
You follow good variable-naming conventions and your convention in use is Pascal casing (sometimes called CamelCase
). You want to split a variable name into its component parts.
Why This Works
This one is a bit tricky to explain, but I'll do my best.
You could think of this, in a way, as looping through the voids between characters. While in each of these voids, we look backward to find a lower-case alpha character. If that succeeds, we look forward to find an upper-case alpha character. If both succeed, we do a replace substituting in a space. You can think of this void as being turned into a space.
You could accomplish the same thing by doing a search for a lower-case alpha adjoined on the right with an upper-case alpha, capture each character in its own capture group, and then enter backreferences, each separated by a space into the replacement.
The scenario here is to split a string into tokens. For the question's purposes, tokens were considered anything separated by a spaces, punctuation, or other non-word characters. Let me stave off the faint-of-heart by saying that if you have not gotten comfortable with lookaround prior to this point, then I would suggest you avoid looking at this next pattern. Believe me, it was difficult enough to write!
Note: The first character in the above pattern is a space.
Why This Works
What we have is a series of OR
conditions. The first condition checks for a space; if we find a space then the split is trivial. The second condition, similar to the Split Pascal-cased Identifiers Into Component Words
example above, effectively loops through the voids between characters. To the left of the void, we check for a non-word character; to the right, a word character. If both conditions are met, a split occurs on the void and both "words" are preserved. The remaining two conditions work the same way. The third condition checks for two non-word characters and the fourth condition checks for a non-word character on the left and a word character on the right.
The downside of this approach is that because of the inner working of OR
in regex, the splits produced by the lookahead parts of the pattern will end up producing null (empty string) entries in the output array. If you were to use this example, then be aware that you would need to check for null values in some of your array slots.
Tokenizer on Steroids
It's highly unlikely you'd want to do something like the following, but I'm including it to demonstrate how you can nest lookaround expressions.
The requirement for the above tokenizer changed during the course of the question. The new requirement was for dates to be treated as single tokens rather than being split at the separators. Let me stave off the faint-of-heart by saying that if you have not gotten comfortable with lookaround prior to this point, then I would suggest you avoid looking at this next pattern. Believe me, it was difficult enough to write!
Note: I have split the pattern into two lines to prevent line breaks in awkward places when you are viewing this page. Take note that this is one pattern and should be treated as one string if you experiment with it.
Why This Works
The basic parts from Tokenizer
are still employed here, but I added a few
lookarounds to check whether, during the course of matching, the engine was currently looking at part of a date. The new lookarounds all function pretty much the same way: a negative lookahead is used to not
match a condition, and within the negative lookahead, I use a combination of positive lookbehinds and positive lookaheads to see what is before and after the current position within the engine--if as a whole the engine finds what comprises a valid date structure, then the negative lookahead fails, and a split does not occur. If instead the engine does not find a valid date structure, the negative lookahead succeeds and I split at the current position.
The same caveat in Tokenizer
still applies here. As you can see, this one is parentheses-laden. It's easy to leave a parentheses out when constructing patterns such as this. I advise having a text editor which provides bracket matching so you don't lose track of your parentheses!
Recall from the Lookaround by Example
section that I said you can think of lookaround like being an extra, temporary pointer that moves around in your target string independent of, but relatively to the match pointer. For each level
of nesting you embed in your lookarounds, add an additional pointer, where subsequent levels are relative to their parent lookaround's current position pointer.
As you should have seen by now, lookaround extends the base functionality of regular expressions. While not all engines provide an implementation of lookaround, those that do allow you to perform some interesting matching, replacing, and splitting capabilities in a compact unit.
Be sure to confirm which types of lookaround your language and its engine support, including whether or not positive and negative lookarounds. Just as with base regular expressions, you should always know what your inputs will be and test your pattern with a variety of inputs. Lookarounds can accept the same regular expressions you would normally use, so the same rules apply inside the lookaround which exist outside of the lookaround. Also, be sure to be attentive in writing your patterns--it is very easy to get lost in a sea of parentheses!
You've made it this far. Congratulations! I hope I haven't embedded Matrix-esque visions of regular expression symbols into your subconscious. For most everyday matching needs, you will find satisfaction with the base functionality provided by regular expressions. With the examples you have seen above, you should be able to get a sense of when using lookaround might be beneficial--either out of necessity or out of a preference to save keystrokes. It was my intent to make you more comfortable with regular expression lookaround. If I failed, don't stress; just stay positive
and take a lookaround