• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1070
  • Last Modified:

regular expressions optimization

Hi,

Our application (written in native C++) uses regular expressions to process a huge number of strings from a database.
We've found out, if an expression matches the input string, it yields the result fast, but if it does not, it could take tens or even hundreds times more time to return an empty result.
Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.

I suppose, it's a quite common problem, so, what do the experts recommend?
0
zc2
Asked:
zc2
  • 3
  • 3
  • 3
  • +2
4 Solutions
 
käµfm³d 👽Commented:
It depends on how you've written the pattern. If there is a lot of backtracking performed, then it certainly can take a long time to process the input against the pattern.

Perhaps if you post the pattern and the rules it represents, then we can help you refine it.
0
 
d-glitchCommented:
Not enough information.

Presumably, there is some matching string that would not be found for until the last test.
What is that very last test before returning an empty result looking for?

You could try preprocessing your data base to find types of strings, and then only searching for within the proper type or types.  This might let you find the null results faster.
0
 
evilrixSenior Software Engineer (Avast)Commented:
As a general case, a non-matching scenario will always take longer because when a match is found the regex can terminate. When there is no match all permutations of the pattern have to be tried against all of the target string. The more permitations the regex matches the longer this takes.

When a regex engine compiles a regex it generates a tree like data structure with each branch being one possible permitation. If the regex gets to the end of a branch without finding a match it has to "backtrack" (as kaufmed mentioned) to the last branch node and then try the next path.  If you regex has many possibilities to generate branches it will be slower. Without seeing your regex it's hard to make any recommendations.

>> Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.
It's worth looking at PCRE as an alterantive. I did some performance testing with PCRE vs. Boost regex (which the tr1 regex is based on) and found PCRE to be approximately 10 times faster whilst using 10 times less memory. One thing to know with PCRE; however, is that out of the box the size of a regex is limited to 64KB. This can be extended but it requires making a small change to the codebase and recompiling it. This is documented in the man page for PCRE.

Another possibility, depending on how complex your regex is, is to consider using the Boyer Moore algorithm with the Galil Rule.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
zc2Author Commented:
thank you all for the answers,

the input text contains strings like
"AGULLO,FRANCISCO J M.D.         AGUFR      AHMAD,SAMIR ABED M.D.           AHMSA        AHMED,FAROOQUE M.D.             AHMFA        AKALONU,AMAKA A. MD             AKAAM  "

Open in new window

this is the expression we created:
"\ *([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{13})([ A-Z\,\.]{32})([ A-Z]+)"

Open in new window

if the expression is even more abstract (no possible field lengths specified), it takes a huge amount of time (10sec) even if it's successful  and 76sec if it's not:
\ *([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+)

Open in new window

0
 
käµfm³d 👽Commented:
But what are the rules for the expression? What are you matching/extracting?
0
 
zc2Author Commented:
The input strings contain fields separated by some fixed number of spaces. We need to extract the values of those fields.
What makes things more difficult, that in some strings some of the fields might be missed (replaced with spaces), and we need to be aware of that (get an empty string as the field value in this case).
0
 
evilrixSenior Software Engineer (Avast)Commented:
Given the data has a fixed format, have you considered using a proper parser rather than regexes? Take a look at Boost spirit Qi. It is an EBNF based DSL that uses templates and overloading to allow you to write your parser in C++. It's only slightly more complicated to understand that writing a regex but it is going to be way faster (by an order of magnitude or more).
0
 
evilrixSenior Software Engineer (Avast)Commented:
Also, you would probably find that a DFA regular expression engine will be faster than an NFA engine. PCRE (as I mentioned before) supports both types. DFA engines are less flexible (because they don't support backtracking) but given that your format is pretty well defined you'll probably find that a DFA engine will provide a more performant solution.

The following article explains, in some detail, how these engine types differ and the pros and cons of both.

http://swtch.com/~rsc/regexp/regexp1.html

Generally NFS is far more flexible (and the most common engine in use). DFA is quite inflexible but, generally, much faster.

I'd also strongly recommend reading the following small "story" by Jeffrey Friedl as it will help you to understand how a regex engine works and how to optimize your queries.

http://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0006.html
0
 
käµfm³d 👽Commented:
I agree with evilrix that a regex probably isn't going to help you here--unless you can fix the number of spaces which delimit fields; then maybe it will work. A parser would work much better.
0
 
GwynforWebCommented:
Sometimes we want to find a single Regexp that will do the whole job for us, it seems elegant and makes us feel clever; I offend a much as anyone else.  Often it is better to break the problem up, perhaps using a sequence of regular expressions and perhaps per-processing the sting before hand.

Looking at your RegExp, it is complex and understandably takes ages if no match is not found. A massive chunk of a huge search tree has to be traversed to give a False. Sometimes if a tool is not working it may not be our lack of skill with it, but we in fact need another tool. I doubt attempts at simple optimization are going to work here.

I can't suggest another approach as I don't have the details of how you are using the expression.
0
 
zc2Author Commented:
Thank you all, very much! That's important to understand that not being able to create a sufficient regular expression based not on a luck of skills, but on limitations of the tool itself.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 3
  • 3
  • 3
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now