asked on

Perl String Split question

Hi,
I have a perl string and would like to split (or tokenize) it by white space w/ the following taken into consideration also - the string may contain
a) Double Strings
b) escaped spaces, i.e. one\ part
c) All the special Characters like
- (Hyphen)
, (Comma)
: (Colon)
; (Semi-colon)
' (apostrophe)
~ (tilda)
@ (At)
# (Hash)
$ (Dollar)
% (Percentage)
^ (Carat)
! (Exclamation)
( (Open brackets)
) (Close brackets)
{ (Open braces)
} (Open braces)
[ (Open Square brackets)
] (Close Square brackets)
+ (Plus)
. (Dot)
| (Pipe)
\ (Backslash)
? (question mark)
_ (underscore)
etc.

For example, if a string contains
The te\ st "of" string_ reg-ex:
should be parsed into
Token 0: The
Token 1: te st
Token2: "of"
Token3: string_
Token4: reg-ex

NOTE: Recommend answers that tries to re-use existing libraries to do the split / tokenizing / parsing etc.

K

ASKER CERTIFIED SOLUTION

Adam314

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Purdue_Pete

ASKER

Adam314,
Does your code take care of all the considerations above or is just for the example posted? If not, I am looking for a solution that will take care of all the considerations.

BTW, what does this line do?
my @tokens = split(/(?<!\\) /, $str);

Thanks.

SOLUTION

Adam314

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Purdue_Pete

ASKER

Adam314,
Excellent - will try your solution with various strings.
a) I meant double quoted token, i.e. "of" in the example should be treated as one token and should include double quotes also in the token
Related to the consecutive spaces, you mean \ \ , i.e. slash-space-slash-space?

Adam314

If the double string should work like "double string", keeping this as 1 token, you'll need to use what ozo posted. Otherwise, what I posted should work.

By double space, i meant:
the test string "of" stuff
the consecutive spaces would be counted only once, so you would not end up with a bunch of empty tokens.

Terry Woods

One thing I don't think you've clarified enough - if there is a double quoted string, do you want to keep it together?

For example, if a string contains
The te\ st "of string_" reg-ex:
should it be parsed into
Token 0: The
Token 1: te st
Token2: "of string_"
Token3: reg-ex