Solved

RegEx: Split by non-word char, except inside quotes

Posted on 2009-05-20
5
1,139 Views
Last Modified: 2012-05-07
Hi,
Been banging my head on this one for a little while. I'm using boost::regex, and I'd like to split a string by non-word characters, except where the non-word character is inside quotes.

An example string:

The gopher's bike wasn't "hot enough" for the judges.

Would get split into:

The
gopher
'
s
bike
wasn
'
t
"hot enough"
for
the
judges
.
The really tricky part is where the quoted string has an escaped quote. For example:

Here is a "string with \"a quote\" inside" of it.

That should be split as

Here
is
a
"string with \"a quote\" inside"
of
it
.
I think boost::regex is Perl compatible, so it shouldn't matter if I'm using boost::regex, or PHP's preg_split, or any other Perl compatible regex engine.

Can anyone offer any suggestions?

P.S. Yes, I'm trying to keep the quotes in the match, as the split examples above show.
0
Comment
Question by:headzoo
  • 3
5 Comments
 
LVL 40

Expert Comment

by:mrjoltcola
ID: 24432078
You can try negative lookbehind to only match non-word patterns that are not prefixed by an escape character (\) but all that will do is make it skip \" and it will still split on the next space inside the quoted string (a[space]quote), so it still won't treat the whole quoted string atomically. You are really asking too much for a simple regex because it needs unlimited lookbehind to do this.

This is really a job for a multi-state lexer, or a recursive parser. A single regex doesn't have enough context to handle all of the possibilities properly. If using lex or flex we can push/pop states when we see certain delimiters, and then treat the characters differently while in that state, but you can't do that with a simple one-line regex.

Usually I don't want to bring another tool into the mix so I approach this type of pattern by writing a simple parser by hand to properly tokenize the quoted strings.
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 24432528
You should consider using a match operation instead of split (the characters matched on would depend on your specific requirements):
Raw Match Pattern:
"[^"]*"|[A-Za-z]+|[^A-Za-z ]
 
$matches Array:
(
    [0] => Array
        (
            [0] => The
            [1] => gopher
            [2] => '
            [3] => s
            [4] => bike
            [5] => wasn
            [6] => '
            [7] => t
            [8] => "hot enough"
            [9] => for
            [10] => the
            [11] => judges
            [12] => .
        )
 
)

Open in new window

0
 
LVL 27

Accepted Solution

by:
ddrudik earned 500 total points
ID: 24432594
For the escaped " requirement you could consider:
Raw Match Pattern:
"(?:(?!(?<!\\)")[\S\s])*"|[A-Za-z]+|[^A-Za-z ]
 
$matches Array:
(
    [0] => Array
        (
            [0] => The
            [1] => gopher
            [2] => '
            [3] => s
            [4] => bike
            [5] => wasn
            [6] => '
            [7] => t
            [8] => "hot \"test\" enough"
            [9] => for
            [10] => the
            [11] => judges
            [12] => .
        )
 
)

Open in new window

0
 

Author Closing Comment

by:headzoo
ID: 31583497
Works like a charm. Of course the thing I dislike about those crazy character laden regex's, is it's difficult for me to understand how it works. But it works, so I guess that's good enough. :)
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 24436251
Thanks for the question and the points.

In case you would like more info on the pattern:
The regular expression:
 
(?-imsx:"(?:(?!(?<!\\)")[\S\s])*"|[A-Za-z]+|[^A-Za-z ])
 
matches as follows:
  
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      (?<!                     look behind to see if there is not:
----------------------------------------------------------------------
        \\                       '\'
----------------------------------------------------------------------
      )                        end of look-behind
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    [\S\s]                   any character of: non-whitespace (all
                             but \n, \r, \t, \f, and " "), whitespace
                             (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  [A-Za-z]+                any character of: 'A' to 'Z', 'a' to 'z'
                           (1 or more times (matching the most amount
                           possible))
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  [^A-Za-z ]               any character except: 'A' to 'Z', 'a' to
                           'z', ' '
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Open in new window

0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
IIS URL Rewrite to do 2 actions: Set a ServerVariable, then redirect 3 113
regex expression 9 68
Issues with C++ Class 19 101
nested if statement in excel help 4 37
IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question