You should consider using a match operation instead of split (the characters matched on would depend on your specific requirements):
Main Topics
Browse All TopicsHi,
Been banging my head on this one for a little while. I'm using boost::regex, and I'd like to split a string by non-word characters, except where the non-word character is inside quotes.
An example string:
The gopher's bike wasn't "hot enough" for the judges.
Would get split into:
The
gopher
'
s
bike
wasn
'
t
"hot enough"
for
the
judges
.
The really tricky part is where the quoted string has an escaped quote. For example:
Here is a "string with \"a quote\" inside" of it.
That should be split as
Here
is
a
"string with \"a quote\" inside"
of
it
.
I think boost::regex is Perl compatible, so it shouldn't matter if I'm using boost::regex, or PHP's preg_split, or any other Perl compatible regex engine.
Can anyone offer any suggestions?
P.S. Yes, I'm trying to keep the quotes in the match, as the split examples above show.
This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
Business Accounts
Answer for Membership
by: mrjoltcolaPosted on 2009-05-20 at 07:16:17ID: 24432078
You can try negative lookbehind to only match non-word patterns that are not prefixed by an escape character (\) but all that will do is make it skip \" and it will still split on the next space inside the quoted string (a[space]quote), so it still won't treat the whole quoted string atomically. You are really asking too much for a simple regex because it needs unlimited lookbehind to do this.
This is really a job for a multi-state lexer, or a recursive parser. A single regex doesn't have enough context to handle all of the possibilities properly. If using lex or flex we can push/pop states when we see certain delimiters, and then treat the characters differently while in that state, but you can't do that with a simple one-line regex.
Usually I don't want to bring another tool into the mix so I approach this type of pattern by writing a simple parser by hand to properly tokenize the quoted strings.