Solved

huge Regex pattern

Posted on 2011-09-25
34
236 Views
Last Modified: 2012-06-27
My regex skill is so so. I am now having to do a very large project with it, where the pattern text is very very long.

suppose the job is to detect the positive words such as "appropriate", "good buy", value & <positive>, etc.

do I concatenate the individual match groups (appropriate)+(good buy)+(value <positive>)+...?

here <positive> itself is one of many possible matches:
graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........
the whole list is a few KB.

how would I go about expressing this whole pattern text and compile in the java Pattern class?
0
Comment
Question by:bhomass
  • 18
  • 12
  • 2
  • +2
34 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 36597285
If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()

http://commons.apache.org/lang/api-2.3/org/apache/commons/lang/StringUtils.html#contains%28java.lang.String,%20java.lang.String%29


0
 
LVL 47

Expert Comment

by:for_yan
ID: 36597308

check similar discussion:

http://www.experts-exchange.com/Programming/Languages/Java/Q_27319546.html?sfQueryTermInfo=1+10+30+contactu+yan

Actually the code in posting there ID:36578233

will concatenate and make a regex pattern out of the array of your
relevant words and will use this regex tio match

But with som amny words regex is not the best way to go - use things I mentioned above
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 36597924
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36599881
@for_yan
If you just need occurrence of these words and there are many of them, don't do regex
jsuts use indexOf() or StringUtils.contains()
I disagree. The words "appropriate" and "inappropriate" mean two different things, yet both would be found by what you mention in http:#36597285 .
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36600195

This is true that "appropriate" and "inappropriate" are different things, but each solution will have its problems - to have
endless regex pattern will have its own problems.
Actually  http:#36597285 has different solutions there and some of them - maybe with slight modfifcation - will handle
distinction between "appropriate" and "inappropriate" but will probably fail on something else.
If you have long real-life search text and many words to search for, then it is something which probably requires a good combination
of different methods and will require some trial and error debugging
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36600344
By the way, the solution in  
ID:36578233 which I specifically mentioned in my posting above
even without any modifications will discriminate between "appropriate" and "inappropriate"

       String [] badWords = {"login","contactus","appropriate"};
        String patStr = "\\b(?:";
        for(String sar : badWords){
            patStr += sar + "|";
        }
        patStr = patStr.substring(0,patStr.length()-1);
        patStr += ")\\b";
        System.out.println(patStr);


String[] urls = new String[] {"www.login.htm", "www.lloogg.htm", "www.hello.htm", "www.blaat.com/contactus", "www.blaat.com/contactus/index.html","appropriate","inappropriate"};

     //   Pattern p11 = Pattern.compile("\\b(?:login|contactus)\\b");

         Pattern p11 = Pattern.compile(patStr);

        for(String url : urls){
            Matcher mu = p11.matcher(url);
            if(mu.find())System.out.println(url + " matched");
            else    System.out.println(url + " not matched");


        }

Open in new window

output:
\b(?:login|contactus|appropriate)\b
www.login.htm matched
www.lloogg.htm not matched
www.hello.htm not matched
www.blaat.com/contactus matched
www.blaat.com/contactus/index.html matched
appropriate matched
inappropriate not matched

Open in new window


It is true though, that I think we should
 not be only stuck with regexes, as sometimes
more simple and straightforward means
may turn out to be more appropriate (pun unintended)

0
 
LVL 47

Expert Comment

by:for_yan
ID: 36600371

This trail even though it says there is no strict limit on pattern length, actually shows at least that I'm not alone in my intuitive opinion
that enormously long regex patterns is not something they were designed for:
http://stackoverflow.com/questions/6358387/java-does-regex-pattern-matcher-have-a-size-limit
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36600897
@for_yan
I wasn't disagreeing with your view of "enormously long regex patterns," only the use of indexOf   = )
0
 

Author Comment

by:bhomass
ID: 36707512
in my case I have patterns that do need regex along side a long list of possible matches.

for example, (value <positive>) looks for instances in the text where the word "value" occurs, followed by one of long list of keywords which is associated with positive sentiment. This will need to be enhanced further to dictate how far away the positive word is from "value".

I can't abandon regex completely and rely only on StringUtil in this case. Any ideas?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707517
what kind of ideas you are waiting for - I think there are already plenty of them  already
0
 

Author Comment

by:bhomass
ID: 36707547
the ideas I have seen basically points to a choice between regex or StringUtil. am I wrong that none points in the direction of combining them.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707564
So what is your goal - you have a text which has some special words.
You have a list of those words.
Now you want to mak alist of those words among your list which are present in the text?

What should be the result of the program?
0
 

Author Comment

by:bhomass
ID: 36707571
it seems you skip the use case in my latest posting;

I need to find a pattern where the word "value" is closely trailed by one of the keywords from a long list.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707581

  So what should be the result of the program?
 Do you want to get the list of words out of certain group which follow the word "value" ?
0
 

Author Comment

by:bhomass
ID: 36707589
the results in general is just to get a match, isn't it? from there, you can build the list of matched phrases or not. doesn't matter. in my immediate case, I actually just need to accumulate the count for each match. why is that important?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707594
It is important - either you want a number of coccurrences, or you want the list of those words which match at least once, or you want to replace them in the text
Sure all these tasks may have different approach.
0
 

Author Comment

by:bhomass
ID: 36707601
no, you lost me. if using regex, matcher.matches() is the point from which you can count or collect the matched phrase. using direct match, String.contains(...) is the point from which to count collect.

the key is how you find that match, not what you do with it afterwards.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 47

Expert Comment

by:for_yan
ID: 36707602
Best of all if you post an example of the list of words which you want to match,
and example of the text ,and an example of the output
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707616
No it is different - say if you need to end up with the list of matches - you can first run with StringUtils - select maybe longer list and
then weed out extra elements with regex
if you want to count only the overall number of matches - one number  for all words together, you may choose another strategy

are there some words which you want just find the presence of these words and some you want to combine with other words like "value word" - those are also different requirements.

Once you have specific requirements we may thinnk about particuylar startegy
Otherwise you give the general idea, and the reply is also the general  - what methods can be used tio approach it


0
 

Author Comment

by:bhomass
ID: 36707617
they are already provided in my very first post. here is again, if you like

(value...<positive>) - ... is meant to represent some distance, I don't quite know how to express it.
<positive> is any one of a very long list - graet,grat,grateful,gratefull,gratefully,gratifying,grea,greaat,great........

text is any sort of comment buyers may say about a product: "great stuff, ... value is bargain, amazing offer....

Does this clarify?

each match against the pattern should be added to the result list, from which I can do counting or extract the matched phrase.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707644
Ok, So I understand  you want the list of the fragments of the text, which will have a word "value" and a word out of the
known list of words.

But should the word value be always somewhere before the other word or it can be before or can be after?

If we don't define it then it would be a problem, so if you have

   great .... value ....gratifying

what should we select "great ... value"  or "value ... gratifying"
especially if we didn't define requirement for  any distance between them, and even with defined distance
this situation is aproblem if we don't define direction

regex may be indeed rather powerful tool, but we want  to define the task clearly and unambiguously


0
 

Author Comment

by:bhomass
ID: 36707658
you are dancing around the real issue. yes, at the end, I may want to have both value in front or in the back, less than 10 char away, or less than 20 chars away. All of which can be easily represented by regex. to expand the desired match range, all I have to do is add (reasonable number of) more match groups.

The ONLY technical challenge here is what happens if <positive> is one of a long list (hundreds) of words, and I surely don't want to generate a complete collection of match groups to cover all the possibilities in that list.

let's for now just say, I only want value followed by <positive> by some distance. the over pattern needs a regex, but <positive> needs StringUtil. How to resolve this?
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707672
If so , I would first go through the text and find the word "value" and then make a list of substrings which start with the word value
and span for the number of charcters you specify. And then I'll go through that list oncy again and weed out those
strings in the list which do not contain any of the "additional" words you want

How big are your texts and waht is the order of magnitude of the number of these texts?
0
 

Author Comment

by:bhomass
ID: 36707683
so, basically your answer is give up any advantages you can get from regex and go right to string processing.

it may be the only way. Let me think how that will work.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707688
No, I'm not saying it.
I just want first to come up with startegy and then think about tactics
0
 

Author Comment

by:bhomass
ID: 36707694
I am into how to strategize doing the string processing just yet. I would be able to figure that out. I was just checking if anyone has a way to mix regex with String processing. I believe the answer is no.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707696
I am actually thinking that if we use go the way I mentioned above then probably
on the fisrt stage in makes sense to use this pattern

"\\bvalue\\b" (we prrobably also need to make it caese insensitive, as it amy be the virst word and then
we'll match it with regex and got though m.find() and accumuklate sustrings between m.start() and m.start() + n
and in this way collect the arraylist of these substrings

and on the next stage we pehaps can use method Stringutils.indexOfAny(each_element_of_our_arraylist, array_of_our_test_words)
and if it is > -1 we count it as a match, otherwise  - not - this seems logical thing to do
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36707704
No, the answer is of course yes - just read what I was writing in the meantime above.

but the right way is first to formulate the goal clearly, then devise the general strategy, and only then think about the tools, not the other way around.
And regex is nothing but just one of the tools

 
0
 

Author Comment

by:bhomass
ID: 36707720
ok, let me think about it more.
0
 
LVL 47

Accepted Solution

by:
for_yan earned 400 total points
ID: 36707800
Frankly, if the point of those who pose this task is to get some assesment of how customers
are happy with some product - I'd rather simplify this thing and would not tie it to say such words as "value" and diatsnce between
the words and any complex criteria based on
 phraseology involving more than a single word, which is much more difficult to asses.
I'd rathe make a list of positive adjectives (great, good, positive,wonderful, etc.etc - better to get them from some real texts)
 - all single words and calculate total frequency of all of  those.
that would be more clearly defined - and of course any such assesment will have its limitation,
but the more clearly we define the task, the more comparable will be the results- and probably comparability is
what you want most from such kind of statistics - well that is of course my opinion, and
I guess it is not you who makes these decisions. But some of the people who don't need to concentrate on details,
were never thinking of these kind of things, to most of the people
they don't come to mind, as they never face the necessity to formulate the task very clearly not to a human being
but to the  inanimate creature like computer.

0
 

Author Comment

by:bhomass
ID: 36711538
a simple count of positive words will not do. what about "is far from great", "anything but wonderful".

furthermore, the comment is all in one box, but the survey is meant to extract whether a good or bad comment is associated with service, feature, price, etc.

therefore, the matching algorithm needs to insert additional keyword match for subcategory clarity. I am sure there are more reasons why at the end, you can't escape working with detailed keywords, often quite a lot of them.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 36711740
well this is statistics - you'll count incorrectly "anything but wonderful" but you'll not count "I thought it was rubbisgh, but it turned out to be quite opposite" -
the overall sum will still be a good indication and if you count number of posistive adjectives, number of negative adjectives and then
subtract one from aniother - I'm sure over big numbers the average
ill reflect the attitude of the customers and gong in the direction of guessing
their klanguage expressions deeper wil hardly payoff.
  You'll never get to exact count with wahtever patrters you use - human language is too complex - tht's why they have huge teams working on these kinds of things.
Overall picture will still seems to be achievable in your case  with some reasonable effort once you use tsraightforward models.
0
 
LVL 9

Expert Comment

by:user_n
ID: 36712408
0
 

Author Closing Comment

by:bhomass
ID: 36897683
at the end I need to write a custom analyzer and forgo regex. for_yan is right in pointing out the need for a mixed strategy.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

895 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now