asked on

regular expressions optimization

Hi,

Our application (written in native C++) uses regular expressions to process a huge number of strings from a database.
We've found out, if an expression matches the input string, it yields the result fast, but if it does not, it could take tens or even hundreds times more time to return an empty result.
Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.

I suppose, it's a quite common problem, so, what do the experts recommend?

kaufmed

It depends on how you've written the pattern. If there is a lot of backtracking performed, then it certainly can take a long time to process the input against the pattern.

Perhaps if you post the pattern and the rules it represents, then we can help you refine it.

d-glitch

Not enough information.

Presumably, there is some matching string that would not be found for until the last test.
What is that very last test before returning an empty result looking for?

You could try preprocessing your data base to find types of strings, and then only searching for within the proper type or types. This might let you find the null results faster.

SOLUTION

evilrix

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

zc2

ASKER

thank you all for the answers,

the input text contains strings like

"AGULLO,FRANCISCO J M.D.         AGUFR      AHMAD,SAMIR ABED M.D.           AHMSA        AHMED,FAROOQUE M.D.             AHMFA        AKALONU,AMAKA A. MD             AKAAM  "

Open in new window

this is the expression we created:

"\ *([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{13})([ A-Z\,\.]{32})([ A-Z]+)"

Open in new window

if the expression is even more abstract (no possible field lengths specified), it takes a huge amount of time (10sec) even if it's successful and 76sec if it's not:

\ *([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+)

Open in new window

kaufmed

But what are the rules for the expression? What are you matching/extracting?

zc2

ASKER

The input strings contain fields separated by some fixed number of spaces. We need to extract the values of those fields.
What makes things more difficult, that in some strings some of the fields might be missed (replaced with spaces), and we need to be aware of that (get an empty string as the field value in this case).

evilrix

Given the data has a fixed format, have you considered using a proper parser rather than regexes? Take a look at Boost spirit Qi. It is an EBNF based DSL that uses templates and overloading to allow you to write your parser in C++. It's only slightly more complicated to understand that writing a regex but it is going to be way faster (by an order of magnitude or more).

ASKER CERTIFIED SOLUTION

evilrix

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

GwynforWeb

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

zc2

ASKER

Thank you all, very much! That's important to understand that not being able to create a sufficient regular expression based not on a luck of skills, but on limitations of the tool itself.