Link to home
Start Free TrialLog in
Avatar of raterus
raterusFlag for United States of America

asked on

Parse Search Text

I'm looking for help on something that seems like it would be highly discussed, but I can't seem to find much on it.

I'm interested in turning this mess
blah, A&Z this "some word", another word

into "searchable" tokens, e.g.

blah
A&Z
this
some word
another
word

The main delimiters here are whitespace and comma's, with quotes allowing for multiple words in a token.

Certainly I can hack my way through this, but I'm looking for some good advice from someone who's already done this.
--Michael
Avatar of Brian Crowe
Brian Crowe
Flag of United States of America image

take a look at the String.Split method

Dim delimStr As String = " ,.:"
      Dim delimiter As Char() = delimStr.ToCharArray()
      Dim words As String = "one two,three:four."
      Dim split As String() = Nothing
     
      Console.WriteLine("The delimiters are -{0}-", delimStr)
      Dim x As Integer
      For x = 1 To 5
         split = words.Split(delimiter, x)
         Console.WriteLine(ControlChars.Cr + "count = {0,2} ..............", x)
         Dim s As String
         For Each s In  split
            Console.WriteLine("-{0}-", s)
         Next s
      Next x
Avatar of raterus

ASKER

Thank you, however I'm well versed in the Split methods available to me, my main question is around the proper parsing of quoted delimiters, as discussed in my original post.

blah, A&Z this "some word", another word

needs to be split/parsed into

blah
A&Z
this
some word       <-- Very important!
another
word
in the For Each s In split, you could go something like:

s = s.Replace("""", "")

That will strip out all double quotes...

Jake
Avatar of raterus

ASKER

I don't think you are quite understanding what I'm doing, please reread my question/first comment.  Removing quotes is NOT my only intent here.
Michael,

When you say "whitespace," are you including tabs and multiple spaces, or just single spaces?
Avatar of raterus

ASKER

tabs, spaces (any number).  This is a "search" box I'm parsing
Ah... so we have to account for the fact that SELECT * FROM Users WHERE Clue>0 returns no rows?

What about the case where the user enters "" or '?
SOLUTION
Avatar of Sancler
Sancler

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of raterus

ASKER

Yes Jeff, exactly.  I definitely think there were a few rows returned.

This would have to account for stupid-user tricks, something like this

"bad joe" says he wants to "break my program

would "recover" and likely ignore the third quote.

Right now I'm not going to worry about single quotes, but I want them to be included in the final token incase they are searching for "o'brien" or something.

Does
"bad joe" says he wants to "break my program

return

"bad joe"
says
he
wants
to
"break my program"

OR

"bad joe"
says
 he
 wants
 to
break
 my
 program

??
Avatar of RonaldBiemans
RonaldBiemans

    Dim fieldValues As String() = ParseLine(TextBox1.Text)

    Private Shared Function ParseLine(ByVal oneLine As String) As String()
        Dim pattern As String = ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
        Dim r As System.Text.RegularExpressions.Regex = _
                New System.Text.RegularExpressions.Regex(pattern)

        Return r.Split(oneLine)
    End Function
Avatar of raterus

ASKER

the latter Jeff

Ronald, thanks for the regex.  Unfortunately here's the "token's" it ended up parsing out for me from...looks more like it is splitting on commas
blah, A&Z this "some word", another word

--

blah
A&Z this "some word"
another word
Avatar of raterus

ASKER

@Sancler, oh I'm not ignoring you, it's actually a nice idea I may take a good look at.  However, I'm hoping to have someone do it for me, thus achieving the state of true laziness :-)

**Articles posted that go into this discussion, concerning a .Net language will get assist points.**

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
oh... and add "return output" :)
Avatar of raterus

ASKER

I'm extensively testing Chaosian's and Fernando's approaches

I can say both work very well for what I'm doing....
just a small adjustment and it should work

 Dim fieldValues As String() = ParseLine(TextBox1.Text)

Private Shared Function ParseLine(ByVal oneLine As String) As String()
        Dim pattern As String = "[ ,](?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
        Dim r As System.Text.RegularExpressions.Regex = _
                New System.Text.RegularExpressions.Regex(pattern)

        Return r.Split(oneLine)
    End Function
Ronald

I think that wants "+" after "[ ,]".  Otherwise - for me anyway - it gives an empty string in the places where it encounters both space and comma.

;-)

Roger
Yep, you are right Sancler, it does. I just filtered out the empty string afterwards :-)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of raterus

ASKER

All three solutions worked, I opted to go with Chaosian's.  Why I don't really know :-)