Greedy Quantifier * and + in Java

techbro
techbro used Ask the Experts™
on
I am learning how the greedy quantifiers work, but I am not getting the result I am expecting from the code below, with this argument given (Notice I am using *, dot and + quantifiers):

Possible Arguments:
C.*L  "CooLooLCuuLooC"
C.+L  "CooLooLCuuLooC"

I like to know why the result comes as 0 "CooLooLCuuL" in both of those arguments instead of:
0 CooL 6 LC 10  CooL

Code:
import java.util.regex.*;
public class TestClass
{
	public static void main(String[] args)
	{
		Pattern p = Pattern.compile(args[0]);
		Matcher m = p.matcher(args[1]);
		boolean b = false;
		while(b = m.find())
		{
		    System.out.print(m.start()+" \""+m.group()+"\" ");
		}
	}
}

Open in new window


I will appreciate your response.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Mick BarryJava Developer
Top Expert 2010

Commented:
. matches any character
and * says match many of them

so it matches the whole string
Mick BarryJava Developer
Top Expert 2010
Commented:
so the . ends up consuming the L's (remember its greedy) , except the last one which is needed by the L in your pattern

you want to use a reluctant quantifies to stop it matching the L's
Awarded 2011
Awarded 2011

Commented:
 Pattern p1 = Pattern.compile("C.*?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

Open in new window

0 "CooL" 7 "CuuL" 0

Open in new window

C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

Awarded 2011
Awarded 2011

Commented:
If you add "?" in both cases it will match "CooL" and "CooL"
Whay would it match "LC" ?

Pattern p1 = Pattern.compile("C.+?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window



output:
0 "CooL" 7 "CuuL" 

Open in new window

Awarded 2011
Awarded 2011
Commented:

This is the simple explanation:

By default, pattern matching is greedy, which means that the matcher returns the longest match possible. For example, applying the pattern A.*c to AbcAbcA matches AbcAbc rather than the shorter Abc. To do nongreedy matching, a question mark must be added to the quantifier. For example, the pattern A.*?c will find the shortest match possible.

in here together with some code example:
http://www.exampledepot.com/egs/java.util.regex/Greedy.html
Mick BarryJava Developer
Top Expert 2010

Commented:
for_yan,

you are just repeating what was already stated. That is not permitted at EE.
Awarded 2011
Awarded 2011

Commented:
There was no repetotion.

objects,
please, stop inappropriate remarks.
Awarded 2011
Awarded 2011

Commented:

This is even better explanation from
http://www.regular-expressions.info/repeat.html

and one more way to make it lazy (not with the help of question mark, as I showed above)  from your code using negated character class (see below) which I ran and again and got the same result:


0 "CooL" 7 "CuuL"

Please, explain if you believe it should return "LC": I don't see the reason for it, but maybe I'm missing something

--------------------------------------
Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself.

This is explanation why <.+> in a string     This is a <EM>first</EM> test   matches   <EM>first</EM> and not  <EM>:
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that <EM> has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.
-----------------------------------

   //  Pattern p1 = Pattern.compile("C.+?L");
          Pattern p1 = Pattern.compile("C[^L]+L");

            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window

Top Expert 2016
Commented:
>>
Possible Arguments:
C.*L  "CooLooLCuuLooC"
C.+L  "CooLooLCuuLooC"
>>

>>
I like to know why the result comes as 0 "CooLooLCuuL" in both of those arguments instead of:
0 CooL 6 LC 10  CooL
>>

Both patterns are effectively identical given the input. The only difference is that the former would also match NO intervening characters between 'C' and 'L', so would also match "CL"

Author

Commented:
Thanks objects, for_yan and CEHJ!

It was my mistake assuming the output "0 CooL 6 LC 10  CooL". Since it is greedy quantifier, it is looking for the longest possible match. So the output will be whole string.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial