boost regex - aborts abnormally

static boost::regex basic_tokens ( "(?:(?:(\\w+)(?:\\W+))*(\\w+))" );

what I want is a list/enumeration of all tokens matching the pattern "\\w+". regex_search solves the problem only partially. for two or less matches it produces desired results. but if there are more than two matches for (\\w+), I get three elements in match_results:
 - the base string starting at beginning of first token
 - the second last match in the input string
 - the last match in the input string
e.g. for (a+(b*c)), i get
 - a+(b*c
 - b
 - c
when the desired result is
 - a+(b*c
 - a
 - b
 - c

I was going to use regex_grep but it is deprecated. and I am having a little trouble using the boost library's regex_iterator. I can create an iterator using a constructor but the problem arises when I want to check if the iterator has reached the end.

const char * sz = str.c_str();
boost::regex_iterator<const char*> it(sz, sz+str.length(), basic_tokens);

there does not seems to be any way to figure out if the iterator is at or has gone beyond the last element. nor is there a means to get the number of items directly (i.e. sumthing like iter.count()). I keep looping with { match_results &m = *it; it++; } and the program aborts on an assertion.
LVL 9
jhshuklaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jhshuklaAuthor Commented:
i changed basic_tokens to "\\w+" and it is working pretty well now. each match is listed separately in iterator's references. but the problem of determining the end of sequence still remains.
clockwatcherCommented:
Is that what you're after?

#include "stdafx.h"
#include <boost/regex.hpp>
#include <iostream>

using namespace std;
using namespace boost;


int _tmain(int argc, _TCHAR* argv[])
{

     smatch match;
     regex re("\\w+");

     string str = "(a+(b*c)";

     string::const_iterator start = str.begin();
     string::const_iterator end = str.end();
     
     while (regex_search(start, end, match, re))
     {          
          cout << match.str() << endl;
          start = match[0].second;
     }

}
jhshuklaAuthor Commented:
well, you have a valid solution and i had thought of it.

but i would really prefer having to call boost functions as less as possible. the reason being that i want to avoid loops in my code.
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

clockwatcherCommented:
I guess I'm confused at what you're trying to do then.   The iterators that I see in the boost library iterate over the submatches in the regex.  So unless you have a fixed format for the string you want to match I don't see how you're going to get what you're after-- to use the iterators (and only call regex_search once) you'll need to capture everything within the submatches on one pass.

Here's an example (specific to your problem set) that uses an iterator to spit out what you're asking for.  

// Q_21984194.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <boost/regex.hpp>
#include <iostream>

using namespace std;
using namespace boost;


int _tmain(int argc, _TCHAR* argv[])
{

     smatch match;
      regex re("(\\w+).*?(\\w+).*?(\\w+)");
     
     string str = "(a+(b*c))";

     string::const_iterator start = str.begin();
     string::const_iterator end = str.end();

     regex_search(start, end, match, re);
     
      for (smatch::iterator i = match.begin(); i != match.end(); i++)
      {
          if ((*i).matched)
          {
               cout << "iterator: " << (*i).str() << endl;
          }
      }

}

You could repeat the submatch expression for the max number of matches that you could ever possibly have and then make it optional... i.e. (\w+)?.*?(\w+)?.*?(\w+)?.*?... It'd be nice if this worked (?:(\w+).*?){1,} ... but it doesn't.

jhshuklaAuthor Commented:
>> max number of matches that you could ever possibly have
theoretically that limit is what the computer's resources permit. worst case would be the caller passing me a 4GB string alternating between \w & \W. writing all 2 billion*(\\W+)(\\w+) = 24 billion characters is out of question.

using \w+ almost solved my problem but it aborts for some incomprehensible reason.
clockwatcherCommented:

Helps if I look at the documentation.  Is this what you're after?

#include "stdafx.h"
#include <boost/regex.hpp>
#include <iostream>

using namespace std;
using namespace boost;

int _tmain(int argc, _TCHAR* argv[])
{
       string str = "(a+(b*c))";
       regex basic_tokens( "\\w+" );

       string::const_iterator start = str.begin();
       string::const_iterator end = str.end();
 
       sregex_iterator it(start, end, basic_tokens);
       sregex_iterator empty;

       while(it != empty)
       {
            cout << (*it).str() << endl;
            it++;
       }
}

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jhshuklaAuthor Commented:
that was exactly what i was after.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Editors IDEs

From novice to tech pro — start learning today.