Link to home
Start Free TrialLog in
Avatar of zc2
zc2Flag for United States of America

asked on

regular expressions optimization

Hi,

Our application (written in native C++) uses regular expressions to process a huge number of strings from a database.
We've found out, if an expression matches the input string, it yields the result fast, but if it does not, it could take tens or even hundreds times more time to return an empty result.
Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.

I suppose, it's a quite common problem, so, what do the experts recommend?
Avatar of kaufmed
kaufmed
Flag of United States of America image

It depends on how you've written the pattern. If there is a lot of backtracking performed, then it certainly can take a long time to process the input against the pattern.

Perhaps if you post the pattern and the rules it represents, then we can help you refine it.
Not enough information.

Presumably, there is some matching string that would not be found for until the last test.
What is that very last test before returning an empty result looking for?

You could try preprocessing your data base to find types of strings, and then only searching for within the proper type or types.  This might let you find the null results faster.
SOLUTION
Avatar of evilrix
evilrix
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of zc2

ASKER

thank you all for the answers,

the input text contains strings like
"AGULLO,FRANCISCO J M.D.         AGUFR      AHMAD,SAMIR ABED M.D.           AHMSA        AHMED,FAROOQUE M.D.             AHMFA        AKALONU,AMAKA A. MD             AKAAM  "

Open in new window

this is the expression we created:
"\ *([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{13})([ A-Z\,\.]{32})([ A-Z]+)"

Open in new window

if the expression is even more abstract (no possible field lengths specified), it takes a huge amount of time (10sec) even if it's successful  and 76sec if it's not:
\ *([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+)

Open in new window

But what are the rules for the expression? What are you matching/extracting?
Avatar of zc2

ASKER

The input strings contain fields separated by some fixed number of spaces. We need to extract the values of those fields.
What makes things more difficult, that in some strings some of the fields might be missed (replaced with spaces), and we need to be aware of that (get an empty string as the field value in this case).
Given the data has a fixed format, have you considered using a proper parser rather than regexes? Take a look at Boost spirit Qi. It is an EBNF based DSL that uses templates and overloading to allow you to write your parser in C++. It's only slightly more complicated to understand that writing a regex but it is going to be way faster (by an order of magnitude or more).
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of zc2

ASKER

Thank you all, very much! That's important to understand that not being able to create a sufficient regular expression based not on a luck of skills, but on limitations of the tool itself.