Link to home
Start Free TrialLog in
Avatar of bhomass
bhomass

asked on

huge Regex pattern

My regex skill is so so. I am now having to do a very large project with it, where the pattern text is very very long.

suppose the job is to detect the positive words such as "appropriate", "good buy", value & <positive>, etc.

do I concatenate the individual match groups (appropriate)+(good buy)+(value <positive>)+...?

here <positive> itself is one of many possible matches:
graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........
the whole list is a few KB.

how would I go about expressing this whole pattern text and compile in the java Pattern class?
Avatar of for_yan
for_yan
Flag of United States of America image

If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()

http://commons.apache.org/lang/api-2.3/org/apache/commons/lang/StringUtils.html#contains%28java.lang.String,%20java.lang.String%29



check similar discussion:

https://www.experts-exchange.com/questions/27319546/Regex-to-filter-urls-w-certain-words-inside.html?sfQueryTermInfo=1+10+30+contactu+yan

Actually the code in posting there ID:36578233

will concatenate and make a regex pattern out of the array of your
relevant words and will use this regex tio match

But with som amny words regex is not the best way to go - use things I mentioned above
Avatar of CEHJ
@for_yan
If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()
I disagree. The words "appropriate" and "inappropriate" mean two different things, yet both would be found by what you mention in http:#36597285 .

This is true that "appropriate" and "inappropriate" are different things, but each solution will have its problems - to have
endless regex pattern will have its own problems.
Actually  http:#36597285 has different solutions there and some of them - maybe with slight modfifcation - will handle
distinction between "appropriate" and "inappropriate" but will probably fail on something else.
If you have long real-life search text and many words to search for, then it is something which probably requires a good combination
of different methods and will require some trial and error debugging
By the way, the solution in  
ID:36578233 which I specifically mentioned in my posting above
even without any modifications will discriminate between "appropriate" and "inappropriate"

       String [] badWords = {"login","contactus","appropriate"};
        String patStr = "\\b(?:";
        for(String sar : badWords){
            patStr += sar + "|";
        }
        patStr = patStr.substring(0,patStr.length()-1);
        patStr += ")\\b";
        System.out.println(patStr);


String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus", "www.blaat.com/contactus/index.html","appropriate","inappropriate"};

     //   Pattern p11 = Pattern.compile("\\b(?:login|contactus)\\b");

         Pattern p11 = Pattern.compile(patStr);

        for(String url : urls){
            Matcher mu = p11.matcher(url);
            if(mu.find())System.out.println(url + " matched");
            else    System.out.println(url + " not matched");


        }

Open in new window

output:
\b(?:login|contactus|appropriate)\b
www.login.htm matched
www.lloogg.htm not matched
www.hello.htm not matched
www.blaat.com/contactus matched
www.blaat.com/contactus/index.html matched
appropriate matched
inappropriate not matched

Open in new window


It is true though, that I think we should
 not be only stuck with regexes, as sometimes
more simple and straightforward means
may turn out to be more appropriate (pun unintended)


This trail even though it says there is no strict limit on pattern length, actually shows at least that I'm not alone in my intuitive opinion
that enormously long regex patterns is not something they were designed for:
http://stackoverflow.com/questions/6358387/java-does-regex-pattern-matcher-have-a-size-limit
@for_yan
I wasn't disagreeing with your view of "enormously long regex patterns," only the use of indexOf   = )
Avatar of bhomass
bhomass

ASKER

in my case I have patterns that do need regex along side a long list of possible matches.

for example, (value <positive>) looks for instances in the text where the word "value" occurs, followed by one of long list of keywords which is associated with positive sentiment. This will need to be enhanced further to dictate how far away the positive word is from "value".

I can't abandon regex completely and rely only on StringUtil in this case. Any ideas?
what kind of ideas you are waiting for - I think there are already plenty of them  already
Avatar of bhomass

ASKER

the ideas I have seen basically points to a choice between regex or StringUtil. am I wrong that none points in the direction of combining them.
So what is your goal - you have a text which has some special words.
You have a list of those words.
Now you want to mak alist of those words among your list which are present in the text?

What should be the result of the program?
Avatar of bhomass

ASKER

it seems you skip the use case in my latest posting;

I need to find a pattern where the word "value" is closely trailed by one of the keywords from a long list.

  So what should be the result of the program?
 Do you want to get the list of words out of certain group which follow the word "value" ?
Avatar of bhomass

ASKER

the results in general is just to get a match, isn't it? from there, you can build the list of matched phrases or not. doesn't matter. in my immediate case, I actually just need to accumulate the count for each match. why is that important?
It is important - either you want a number of coccurrences, or you want the list of those words which match at least once, or you want to replace them in the text
Sure all these tasks may have different approach.
Avatar of bhomass

ASKER

no, you lost me. if using regex, matcher.matches() is the point from which you can count or collect the matched phrase. using direct match, String.contains(...) is the point from which to count collect.

the key is how you find that match, not what you do with it afterwards.
Best of all if you post an example of the list of words which you want to match,
and example of the text ,and an example of the output
No it is different - say if you need to end up with the list of matches - you can first run with StringUtils - select maybe longer list and
then weed out extra elements with regex
if you want to count only the overall number of matches - one number  for all words together, you may choose another strategy

are there some words which you want just find the presence of these words and some you want to combine with other words like "value word" - those are also different requirements.

Once you have specific requirements we may thinnk about particuylar startegy
Otherwise you give the general idea, and the reply is also the general  - what methods can be used tio approach it


Avatar of bhomass

ASKER

they are already provided in my very first post. here is again, if you like

(value...<positive>) - ... is meant to represent some distance, I don't quite know how to express it.
<positive> is any one of a very long list - graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........

text is any sort of comment buyers may say about a product: "great stuff, ... value is bargain, amazing offer....

Does this clarify?

each match against the pattern should be added to the result list, from which I can do counting or extract the matched phrase.
Ok, So I understand  you want the list of the fragments of the text, which will have a word "value" and a word out of the
known list of words.

But should the word value be always somewhere before the other word or it can be before or can be after?

If we don't define it then it would be a problem, so if you have

   great .... value ....gratifying

what should we select "great ... value"  or "value ... gratifying"
especially if we didn't define requirement for  any distance between them, and even with defined distance
this situation is aproblem if we don't define direction

regex may be indeed rather powerful tool, but we want  to define the task clearly and unambiguously


Avatar of bhomass

ASKER

you are dancing around the real issue. yes, at the end, I may want to have both value in front or in the back, less than 10 char away, or less than 20 chars away. All of which can be easily represented by regex. to expand the desired match range, all I have to do is add (reasonable number of) more match groups.

The ONLY technical challenge here is what happens if <positive> is one of a long list (hundreds) of words, and I surely don't want to generate a complete collection of match groups to cover all the possibilities in that list.

let's for now just say, I only want value followed by <positive> by some distance. the over pattern needs a regex, but <positive> needs StringUtil. How to resolve this?
If so , I would first go through the text and find the word "value" and then make a list of substrings which start with the word value
and span for the number of charcters you specify. And then I'll go through that list oncy again and weed out those
strings in the list which do not contain any of the "additional" words you want

How big are your texts and waht is the order of magnitude of the number of these texts?
Avatar of bhomass

ASKER

so, basically your answer is give up any advantages you can get from regex and go right to string processing.

it may be the only way. Let me think how that will work.
No, I'm not saying it.
I just want first to come up with startegy and then think about tactics
Avatar of bhomass

ASKER

I am into how to strategize doing the string processing just yet. I would be able to figure that out. I was just checking if anyone has a way to mix regex with String processing. I believe the answer is no.
I am actually thinking that if we use go the way I mentioned above then probably
on the fisrt stage in makes sense to use this pattern

"\\bvalue\\b" (we prrobably also need to make it caese insensitive, as it amy be the virst word and then
we'll match it with regex and got though m.find() and accumuklate sustrings between m.start() and m.start() + n
and in this way collect the arraylist of these substrings

and on the next stage we pehaps can use method Stringutils.indexOfAny(each_element_of_our_arraylist, array_of_our_test_words)
and if it is > -1 we count it as a match, otherwise  - not - this seems logical thing to do
No, the answer is of course yes - just read what I was writing in the meantime above.

but the right way is first to formulate the goal clearly, then devise the general strategy, and only then think about the tools, not the other way around.
And regex is nothing but just one of the tools

 
Avatar of bhomass

ASKER

ok, let me think about it more.
ASKER CERTIFIED SOLUTION
Avatar of for_yan
for_yan
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bhomass

ASKER

a simple count of positive words will not do. what about "is far from great", "anything but wonderful".

furthermore, the comment is all in one box, but the survey is meant to extract whether a good or bad comment is associated with service, feature, price, etc.

therefore, the matching algorithm needs to insert additional keyword match for subcategory clarity. I am sure there are more reasons why at the end, you can't escape working with detailed keywords, often quite a lot of them.
well this is statistics - you'll count incorrectly "anything but wonderful" but you'll not count "I thought it was rubbisgh, but it turned out to be quite opposite" -
the overall sum will still be a good indication and if you count number of posistive adjectives, number of negative adjectives and then
subtract one from aniother - I'm sure over big numbers the average
ill reflect the attitude of the customers and gong in the direction of guessing
their klanguage expressions deeper wil hardly payoff.
  You'll never get to exact count with wahtever patrters you use - human language is too complex - tht's why they have huge teams working on these kinds of things.
Overall picture will still seems to be achievable in your case  with some reasonable effort once you use tsraightforward models.
Avatar of bhomass

ASKER

at the end I need to write a custom analyzer and forgo regex. for_yan is right in pointing out the need for a mixed strategy.