asked on

Greedy Quantifier * and + in Java

I am learning how the greedy quantifiers work, but I am not getting the result I am expecting from the code below, with this argument given (Notice I am using *, dot and + quantifiers):

Possible Arguments:
C.*L "CooLooLCuuLooC"
C.+L "CooLooLCuuLooC"

I like to know why the result comes as 0 "CooLooLCuuL" in both of those arguments instead of:
0 CooL 6 LC 10 CooL

Code:

import java.util.regex.*;
public class TestClass
{
	public static void main(String[] args)
	{
		Pattern p = Pattern.compile(args[0]);
		Matcher m = p.matcher(args[1]);
		boolean b = false;
		while(b = m.find())
		{
		    System.out.print(m.start()+" \""+m.group()+"\" ");
		}
	}
}

Open in new window

I will appreciate your response.

Mick Barry

. matches any character
and * says match many of them

so it matches the whole string

SOLUTION

Mick Barry

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

for_yan

 Pattern p1 = Pattern.compile("C.*?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

Open in new window

0 "CooL" 7 "CuuL" 0

Open in new window

for_yan

If you add "?" in both cases it will match "CooL" and "CooL"
Whay would it match "LC" ?

Pattern p1 = Pattern.compile("C.+?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window

output:

0 "CooL" 7 "CuuL"

Open in new window

ASKER CERTIFIED SOLUTION

for_yan

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mick Barry

for_yan,

you are just repeating what was already stated. That is not permitted at EE.

for_yan

There was no repetotion.

objects,
please, stop inappropriate remarks.

for_yan

This is even better explanation from
http://www.regular-expressions.info/repeat.html

and one more way to make it lazy (not with the help of question mark, as I showed above) from your code using negated character class (see below) which I ran and again and got the same result:

0 "CooL" 7 "CuuL"

Please, explain if you believe it should return "LC": I don't see the reason for it, but maybe I'm missing something

--------------------------------------
Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself.

This is explanation why <.+> in a string This is a first test matches first and not :
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.
-----------------------------------

   //  Pattern p1 = Pattern.compile("C.+?L");
          Pattern p1 = Pattern.compile("C[^L]+L");

            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window

SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

techbro

ASKER

Thanks objects, for_yan and CEHJ!

It was my mistake assuming the output "0 CooL 6 LC 10 CooL". Since it is greedy quantifier, it is looking for the longest possible match. So the output will be whole string.