asked on

boost regex - aborts abnormally

static boost::regex basic_tokens ( "(?:(?:(\\w+)(?:\\W+))*(\\w+))" );

what I want is a list/enumeration of all tokens matching the pattern "\\w+". regex_search solves the problem only partially. for two or less matches it produces desired results. but if there are more than two matches for (\\w+), I get three elements in match_results:
- the base string starting at beginning of first token
- the second last match in the input string
- the last match in the input string
e.g. for (a+(b*c)), i get
- a+(b*c
- b
- c
when the desired result is
- a+(b*c
- a
- b
- c

I was going to use regex_grep but it is deprecated. and I am having a little trouble using the boost library's regex_iterator. I can create an iterator using a constructor but the problem arises when I want to check if the iterator has reached the end.

const char * sz = str.c_str();
boost::regex_iterator<const char*> it(sz, sz+str.length(), basic_tokens);

there does not seems to be any way to figure out if the iterator is at or has gone beyond the last element. nor is there a means to get the number of items directly (i.e. sumthing like iter.count()). I keep looping with { match_results &m = *it; it++; } and the program aborts on an assertion.

jhshukla

ASKER

i changed basic_tokens to "\\w+" and it is working pretty well now. each match is listed separately in iterator's references. but the problem of determining the end of sequence still remains.

clockwatcher

Is that what you're after?

#include "stdafx.h"
#include <boost/regex.hpp>
#include <iostream>

using namespace std;
using namespace boost;

int _tmain(int argc, _TCHAR* argv[])
{

smatch match;
regex re("\\w+");

string str = "(a+(b*c)";

string::const_iterator start = str.begin();
string::const_iterator end = str.end();

while (regex_search(start, end, match, re))
{
cout << match.str() << endl;
start = match[0].second;
}

}

jhshukla

ASKER

well, you have a valid solution and i had thought of it.

but i would really prefer having to call boost functions as less as possible. the reason being that i want to avoid loops in my code.

clockwatcher

I guess I'm confused at what you're trying to do then. The iterators that I see in the boost library iterate over the submatches in the regex. So unless you have a fixed format for the string you want to match I don't see how you're going to get what you're after-- to use the iterators (and only call regex_search once) you'll need to capture everything within the submatches on one pass.

Here's an example (specific to your problem set) that uses an iterator to spit out what you're asking for.

// Q_21984194.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <boost/regex.hpp>
#include <iostream>

using namespace std;
using namespace boost;

int _tmain(int argc, _TCHAR* argv[])
{

smatch match;
regex re("(\\w+).*?(\\w+).*?(\\w+)");

string str = "(a+(b*c))";

string::const_iterator start = str.begin();
string::const_iterator end = str.end();

regex_search(start, end, match, re);

for (smatch::iterator i = match.begin(); i != match.end(); i++)
{
if ((*i).matched)
{
cout << "iterator: " << (*i).str() << endl;
}
}

}

You could repeat the submatch expression for the max number of matches that you could ever possibly have and then make it optional... i.e. (\w+)?.*?(\w+)?.*?(\w+)?.*?... It'd be nice if this worked (?:(\w+).*?){1,} ... but it doesn't.

jhshukla

ASKER

>> max number of matches that you could ever possibly have
theoretically that limit is what the computer's resources permit. worst case would be the caller passing me a 4GB string alternating between \w & \W. writing all 2 billion*(\\W+)(\\w+) = 24 billion characters is out of question.

using \w+ almost solved my problem but it aborts for some incomprehensible reason.

ASKER CERTIFIED SOLUTION

clockwatcher

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jhshukla

ASKER

that was exactly what i was after.