Solved

huge Regex pattern

Posted on 2011-09-25
34
232 Views
Last Modified: 2012-06-27
My regex skill is so so. I am now having to do a very large project with it, where the pattern text is very very long.

suppose the job is to detect the positive words such as "appropriate", "good buy", value & <positive>, etc.

do I concatenate the individual match groups (appropriate)+(good buy)+(value <positive>)+...?

here <positive> itself is one of many possible matches:
graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........
the whole list is a few KB.

how would I go about expressing this whole pattern text and compile in the java Pattern class?
0
Comment
Question by:bhomass
  • 18
  • 12
  • 2
  • +2
34 Comments
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()

http://commons.apache.org/lang/api-2.3/org/apache/commons/lang/StringUtils.html#contains%28java.lang.String,%20java.lang.String%29


0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility

check similar discussion:

http://www.experts-exchange.com/Programming/Languages/Java/Q_27319546.html?sfQueryTermInfo=1+10+30+contactu+yan

Actually the code in posting there ID:36578233

will concatenate and make a regex pattern out of the array of your
relevant words and will use this regex tio match

But with som amny words regex is not the best way to go - use things I mentioned above
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
@for_yan
If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()
I disagree. The words "appropriate" and "inappropriate" mean two different things, yet both would be found by what you mention in http:#36597285 .
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility

This is true that "appropriate" and "inappropriate" are different things, but each solution will have its problems - to have
endless regex pattern will have its own problems.
Actually  http:#36597285 has different solutions there and some of them - maybe with slight modfifcation - will handle
distinction between "appropriate" and "inappropriate" but will probably fail on something else.
If you have long real-life search text and many words to search for, then it is something which probably requires a good combination
of different methods and will require some trial and error debugging
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
By the way, the solution in  
ID:36578233 which I specifically mentioned in my posting above
even without any modifications will discriminate between "appropriate" and "inappropriate"

       String [] badWords = {"login","contactus","appropriate"};
        String patStr = "\\b(?:";
        for(String sar : badWords){
            patStr += sar + "|";
        }
        patStr = patStr.substring(0,patStr.length()-1);
        patStr += ")\\b";
        System.out.println(patStr);


String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus", "www.blaat.com/contactus/index.html","appropriate","inappropriate"};

     //   Pattern p11 = Pattern.compile("\\b(?:login|contactus)\\b");

         Pattern p11 = Pattern.compile(patStr);

        for(String url : urls){
            Matcher mu = p11.matcher(url);
            if(mu.find())System.out.println(url + " matched");
            else    System.out.println(url + " not matched");


        }

Open in new window

output:
\b(?:login|contactus|appropriate)\b
www.login.htm matched
www.lloogg.htm not matched
www.hello.htm not matched
www.blaat.com/contactus matched
www.blaat.com/contactus/index.html matched
appropriate matched
inappropriate not matched

Open in new window


It is true though, that I think we should
 not be only stuck with regexes, as sometimes
more simple and straightforward means
may turn out to be more appropriate (pun unintended)

0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility

This trail even though it says there is no strict limit on pattern length, actually shows at least that I'm not alone in my intuitive opinion
that enormously long regex patterns is not something they were designed for:
http://stackoverflow.com/questions/6358387/java-does-regex-pattern-matcher-have-a-size-limit
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
@for_yan
I wasn't disagreeing with your view of "enormously long regex patterns," only the use of indexOf   = )
0
 

Author Comment

by:bhomass
Comment Utility
in my case I have patterns that do need regex along side a long list of possible matches.

for example, (value <positive>) looks for instances in the text where the word "value" occurs, followed by one of long list of keywords which is associated with positive sentiment. This will need to be enhanced further to dictate how far away the positive word is from "value".

I can't abandon regex completely and rely only on StringUtil in this case. Any ideas?
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
what kind of ideas you are waiting for - I think there are already plenty of them  already
0
 

Author Comment

by:bhomass
Comment Utility
the ideas I have seen basically points to a choice between regex or StringUtil. am I wrong that none points in the direction of combining them.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
So what is your goal - you have a text which has some special words.
You have a list of those words.
Now you want to mak alist of those words among your list which are present in the text?

What should be the result of the program?
0
 

Author Comment

by:bhomass
Comment Utility
it seems you skip the use case in my latest posting;

I need to find a pattern where the word "value" is closely trailed by one of the keywords from a long list.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility

  So what should be the result of the program?
 Do you want to get the list of words out of certain group which follow the word "value" ?
0
 

Author Comment

by:bhomass
Comment Utility
the results in general is just to get a match, isn't it? from there, you can build the list of matched phrases or not. doesn't matter. in my immediate case, I actually just need to accumulate the count for each match. why is that important?
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
It is important - either you want a number of coccurrences, or you want the list of those words which match at least once, or you want to replace them in the text
Sure all these tasks may have different approach.
0
 

Author Comment

by:bhomass
Comment Utility
no, you lost me. if using regex, matcher.matches() is the point from which you can count or collect the matched phrase. using direct match, String.contains(...) is the point from which to count collect.

the key is how you find that match, not what you do with it afterwards.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 47

Expert Comment

by:for_yan
Comment Utility
Best of all if you post an example of the list of words which you want to match,
and example of the text ,and an example of the output
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
No it is different - say if you need to end up with the list of matches - you can first run with StringUtils - select maybe longer list and
then weed out extra elements with regex
if you want to count only the overall number of matches - one number  for all words together, you may choose another strategy

are there some words which you want just find the presence of these words and some you want to combine with other words like "value word" - those are also different requirements.

Once you have specific requirements we may thinnk about particuylar startegy
Otherwise you give the general idea, and the reply is also the general  - what methods can be used tio approach it


0
 

Author Comment

by:bhomass
Comment Utility
they are already provided in my very first post. here is again, if you like

(value...<positive>) - ... is meant to represent some distance, I don't quite know how to express it.
<positive> is any one of a very long list - graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........

text is any sort of comment buyers may say about a product: "great stuff, ... value is bargain, amazing offer....

Does this clarify?

each match against the pattern should be added to the result list, from which I can do counting or extract the matched phrase.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
Ok, So I understand  you want the list of the fragments of the text, which will have a word "value" and a word out of the
known list of words.

But should the word value be always somewhere before the other word or it can be before or can be after?

If we don't define it then it would be a problem, so if you have

   great .... value ....gratifying

what should we select "great ... value"  or "value ... gratifying"
especially if we didn't define requirement for  any distance between them, and even with defined distance
this situation is aproblem if we don't define direction

regex may be indeed rather powerful tool, but we want  to define the task clearly and unambiguously


0
 

Author Comment

by:bhomass
Comment Utility
you are dancing around the real issue. yes, at the end, I may want to have both value in front or in the back, less than 10 char away, or less than 20 chars away. All of which can be easily represented by regex. to expand the desired match range, all I have to do is add (reasonable number of) more match groups.

The ONLY technical challenge here is what happens if <positive> is one of a long list (hundreds) of words, and I surely don't want to generate a complete collection of match groups to cover all the possibilities in that list.

let's for now just say, I only want value followed by <positive> by some distance. the over pattern needs a regex, but <positive> needs StringUtil. How to resolve this?
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
If so , I would first go through the text and find the word "value" and then make a list of substrings which start with the word value
and span for the number of charcters you specify. And then I'll go through that list oncy again and weed out those
strings in the list which do not contain any of the "additional" words you want

How big are your texts and waht is the order of magnitude of the number of these texts?
0
 

Author Comment

by:bhomass
Comment Utility
so, basically your answer is give up any advantages you can get from regex and go right to string processing.

it may be the only way. Let me think how that will work.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
No, I'm not saying it.
I just want first to come up with startegy and then think about tactics
0
 

Author Comment

by:bhomass
Comment Utility
I am into how to strategize doing the string processing just yet. I would be able to figure that out. I was just checking if anyone has a way to mix regex with String processing. I believe the answer is no.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
I am actually thinking that if we use go the way I mentioned above then probably
on the fisrt stage in makes sense to use this pattern

"\\bvalue\\b" (we prrobably also need to make it caese insensitive, as it amy be the virst word and then
we'll match it with regex and got though m.find() and accumuklate sustrings between m.start() and m.start() + n
and in this way collect the arraylist of these substrings

and on the next stage we pehaps can use method Stringutils.indexOfAny(each_element_of_our_arraylist, array_of_our_test_words)
and if it is > -1 we count it as a match, otherwise  - not - this seems logical thing to do
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
No, the answer is of course yes - just read what I was writing in the meantime above.

but the right way is first to formulate the goal clearly, then devise the general strategy, and only then think about the tools, not the other way around.
And regex is nothing but just one of the tools

 
0
 

Author Comment

by:bhomass
Comment Utility
ok, let me think about it more.
0
 
LVL 47

Accepted Solution

by:
for_yan earned 400 total points
Comment Utility
Frankly, if the point of those who pose this task is to get some assesment of how customers
are happy with some product - I'd rather simplify this thing and would not tie it to say such words as "value" and diatsnce between
the words and any complex criteria based on
 phraseology involving more than a single word, which is much more difficult to asses.
I'd rathe make a list of positive adjectives (great, good, positive,wonderful, etc.etc - better to get them from some real texts)
 - all single words and calculate total frequency of all of  those.
that would be more clearly defined - and of course any such assesment will have its limitation,
but the more clearly we define the task, the more comparable will be the results- and probably comparability is
what you want most from such kind of statistics - well that is of course my opinion, and
I guess it is not you who makes these decisions. But some of the people who don't need to concentrate on details,
were never thinking of these kind of things, to most of the people
they don't come to mind, as they never face the necessity to formulate the task very clearly not to a human being
but to the  inanimate creature like computer.

0
 

Author Comment

by:bhomass
Comment Utility
a simple count of positive words will not do. what about "is far from great", "anything but wonderful".

furthermore, the comment is all in one box, but the survey is meant to extract whether a good or bad comment is associated with service, feature, price, etc.

therefore, the matching algorithm needs to insert additional keyword match for subcategory clarity. I am sure there are more reasons why at the end, you can't escape working with detailed keywords, often quite a lot of them.
0
 
LVL 47

Expert Comment

by:for_yan
Comment Utility
well this is statistics - you'll count incorrectly "anything but wonderful" but you'll not count "I thought it was rubbisgh, but it turned out to be quite opposite" -
the overall sum will still be a good indication and if you count number of posistive adjectives, number of negative adjectives and then
subtract one from aniother - I'm sure over big numbers the average
ill reflect the attitude of the customers and gong in the direction of guessing
their klanguage expressions deeper wil hardly payoff.
  You'll never get to exact count with wahtever patrters you use - human language is too complex - tht's why they have huge teams working on these kinds of things.
Overall picture will still seems to be achievable in your case  with some reasonable effort once you use tsraightforward models.
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
0
 

Author Closing Comment

by:bhomass
Comment Utility
at the end I need to write a custom analyzer and forgo regex. for_yan is right in pointing out the need for a mixed strategy.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
network + 7 73
mapBully challenge 6 88
reasons why a "inside Menu" would not appear for a web server 6 42
maven project error 5 17
Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now