zc2
asked on
regular expressions optimization
Hi,
Our application (written in native C++) uses regular expressions to process a huge number of strings from a database.
We've found out, if an expression matches the input string, it yields the result fast, but if it does not, it could take tens or even hundreds times more time to return an empty result.
Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.
I suppose, it's a quite common problem, so, what do the experts recommend?
Our application (written in native C++) uses regular expressions to process a huge number of strings from a database.
We've found out, if an expression matches the input string, it yields the result fast, but if it does not, it could take tens or even hundreds times more time to return an empty result.
Currently we use the regexp implementation from the MS VS 2008's std::tr1 library.
I suppose, it's a quite common problem, so, what do the experts recommend?
Not enough information.
Presumably, there is some matching string that would not be found for until the last test.
What is that very last test before returning an empty result looking for?
You could try preprocessing your data base to find types of strings, and then only searching for within the proper type or types. This might let you find the null results faster.
Presumably, there is some matching string that would not be found for until the last test.
What is that very last test before returning an empty result looking for?
You could try preprocessing your data base to find types of strings, and then only searching for within the proper type or types. This might let you find the null results faster.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thank you all for the answers,
the input text contains strings like
the input text contains strings like
"AGULLO,FRANCISCO J M.D. AGUFR AHMAD,SAMIR ABED M.D. AHMSA AHMED,FAROOQUE M.D. AHMFA AKALONU,AMAKA A. MD AKAAM "
this is the expression we created:"\ *([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{10,13})([ A-Z\,\.]{30,32})([ A-Z]{13})([ A-Z\,\.]{32})([ A-Z]+)"
if the expression is even more abstract (no possible field lengths specified), it takes a huge amount of time (10sec) even if it's successful and 76sec if it's not:\ *([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+?)\ {2,}([ A-Z\,\.]+?)\ {2,}([ A-Z]+)
But what are the rules for the expression? What are you matching/extracting?
ASKER
The input strings contain fields separated by some fixed number of spaces. We need to extract the values of those fields.
What makes things more difficult, that in some strings some of the fields might be missed (replaced with spaces), and we need to be aware of that (get an empty string as the field value in this case).
What makes things more difficult, that in some strings some of the fields might be missed (replaced with spaces), and we need to be aware of that (get an empty string as the field value in this case).
Given the data has a fixed format, have you considered using a proper parser rather than regexes? Take a look at Boost spirit Qi. It is an EBNF based DSL that uses templates and overloading to allow you to write your parser in C++. It's only slightly more complicated to understand that writing a regex but it is going to be way faster (by an order of magnitude or more).
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you all, very much! That's important to understand that not being able to create a sufficient regular expression based not on a luck of skills, but on limitations of the tool itself.
Perhaps if you post the pattern and the rules it represents, then we can help you refine it.