I am learning how the greedy quantifiers work, but I am not getting the result I am expecting from the code below, with this argument given (Notice I am using *, dot and + quantifiers):
Possible Arguments: C.*L "CooLooLCuuLooC" C.+L "CooLooLCuuLooC"
I like to know why the result comes as 0 "CooLooLCuuL" in both of those arguments instead of: 0 CooL 6 LC 10 CooL
Code:
import java.util.regex.*;public class TestClass{ public static void main(String[] args) { Pattern p = Pattern.compile(args[0]); Matcher m = p.matcher(args[1]); boolean b = false; while(b = m.find()) { System.out.print(m.start()+" \""+m.group()+"\" "); } }}
and one more way to make it lazy (not with the help of question mark, as I showed above) from your code using negated character class (see below) which I ran and again and got the same result:
0 "CooL" 7 "CuuL"
Please, explain if you believe it should return "LC": I don't see the reason for it, but maybe I'm missing something
--------------------------------------
Laziness Instead of Greediness
The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do that by putting a question mark behind the plus in the regex. You can do the same with the star, the curly braces and the question mark itself.
This is explanation why <.+> in a string This is a <EM>first</EM> test matches <EM>first</EM> and not <EM>:
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with E. The requirement has been met, and the engine continues with > and M. This fails. Again, the engine will backtrack. But this time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to EM, and the engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The engine reports that <EM> has been successfully matched. That's more like it.
An Alternative to Laziness
In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.
-----------------------------------
It was my mistake assuming the output "0 CooL 6 LC 10 CooL". Since it is greedy quantifier, it is looking for the longest possible match. So the output will be whole string.
and * says match many of them
so it matches the whole string