Avatar of techbro
techbro
Flag for United States of America asked on

Greedy Quantifier * and + in Java

I am learning how the greedy quantifiers work, but I am not getting the result I am expecting from the code below, with this argument given (Notice I am using *, dot and + quantifiers):

Possible Arguments:
C.*L  "CooLooLCuuLooC"
C.+L  "CooLooLCuuLooC"

I like to know why the result comes as 0 "CooLooLCuuL" in both of those arguments instead of:
0 CooL 6 LC 10  CooL

Code:
import java.util.regex.*;
public class TestClass
{
	public static void main(String[] args)
	{
		Pattern p = Pattern.compile(args[0]);
		Matcher m = p.matcher(args[1]);
		boolean b = false;
		while(b = m.find())
		{
		    System.out.print(m.start()+" \""+m.group()+"\" ");
		}
	}
}

Open in new window


I will appreciate your response.
Java

Avatar of undefined
Last Comment
techbro

8/22/2022 - Mon
Mick Barry

. matches any character
and * says match many of them

so it matches the whole string
SOLUTION
Mick Barry

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
for_yan

 Pattern p1 = Pattern.compile("C.*?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

Open in new window

0 "CooL" 7 "CuuL" 0

Open in new window

for_yan

If you add "?" in both cases it will match "CooL" and "CooL"
Whay would it match "LC" ?

Pattern p1 = Pattern.compile("C.+?L");
            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window



output:
0 "CooL" 7 "CuuL" 

Open in new window

Your help has saved me hundreds of hours of internet surfing.
fblack61
ASKER CERTIFIED SOLUTION
for_yan

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Mick Barry

for_yan,

you are just repeating what was already stated. That is not permitted at EE.
for_yan

There was no repetotion.

objects,
please, stop inappropriate remarks.
for_yan


This is even better explanation from
http://www.regular-expressions.info/repeat.html

and one more way to make it lazy (not with the help of question mark, as I showed above)  from your code using negated character class (see below) which I ran and again and got the same result:


0 "CooL" 7 "CuuL"

Please, explain if you believe it should return "LC": I don't see the reason for it, but maybe I'm missing something

--------------------------------------
Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself.

This is explanation why <.+> in a string     This is a <EM>first</EM> test   matches   <EM>first</EM> and not  <EM>:
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that <EM> has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.
-----------------------------------

   //  Pattern p1 = Pattern.compile("C.+?L");
          Pattern p1 = Pattern.compile("C[^L]+L");

            Matcher m1 = p1.matcher("CooLooLCuuLooC");
            boolean b1 = false;
            while(b1 = m1.find())
            {
                System.out.print(m1.start()+" \""+m1.group()+"\" ");
            }

        System.out.println("");

Open in new window

⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
SOLUTION
CEHJ

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
techbro

ASKER
Thanks objects, for_yan and CEHJ!

It was my mistake assuming the output "0 CooL 6 LC 10  CooL". Since it is greedy quantifier, it is looking for the longest possible match. So the output will be whole string.