Problem with non-ascii chars and regex

hi folks,
 I am extracting some data from a html page and I have a char that I know is non-ascii. I have been able to remove that character with a regex. When I display the cleaned string using a response.write, it visually looks right - contains 3 items , each separated by one space. But when I view the underlaying html source from my response.write - its whacky.
And when I try to split it on the spaces, it only creates two data items instead of three. It groups the first two items together separated by a space, and then the last item.
I checked the length . it will not split correctly, I need to split it on spaces, breaking the string into 3 parts.
How can I fix this?
I have inserted my coding and attached images of it.
Any help appreciated. Thank you!

Function StripNonChars(wrkstring)

 Set regEx = New RegExp
 regEx.Global = true
 regEx.IgnoreCase = True


 regEx.Pattern = "[^\w\.\$\s]" 
 tempTxt = regEx.Replace(wrkstring, "")
 StripNonChars = tempTxt

 End Function

Open in new window

Visual image when I display my extracted string with response.write
visualchar.JPG
This is what the underlying html source code looks like
visual-source-view.JPG
OverthereAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jmcgOwnerCommented:
I suspect you'll need to adjust your regexp to the way the HTML code actually is, not the way it is rendered. You may need to invoke a multi-line match if there are line breaks in the string of interest.
0
aikimarkCommented:
So, you are starting with this:
<br><br><br>Length 21:<br><br>Cleaned USD
49.40

 Daily<br>

Open in new window

and you need to parse what out of it or transform it into what?

please use code snippets rather than screen shots.
0
OverthereAuthor Commented:
I just want to clarify the coding <br> is me for displaying the results of my regex. It has nothing to do with what I am trying to achieve - I was merely echoing back my results hence the <br>" - readability . The original string contained non-ascii characters - if a viewer saw them rendered, it would display small white boxes before and after "USD", and before "Daily". If you then viewed the underlying  html source, those white boxes showed as "?". Running them thru my regex got rid of them. But the problem remains that I can not split them on white space.
    After the regex, it displays USD 49.40 Daily looks good but....
You would think it would just be a matter of using split on space, but as I have stated before, it groups USD 49.40 as the first item and Daily as the second item when each should be its own item,i.e. item 1 = USD ;item 2 = 49.40 and item 3 = Daily.

 I tried re-splitting the first item and it will not do it. The result is : USD 49.40
If I do not have \s in my regex, I have no way of knowing when one item begins and another starts.
   What I need to do is break out the line USD 49.40 Daily into 3 separate items i.e. USD, 49.40  and Daily.
I included screen shots because the length of the entire string returns 21, and because if I do this:
strSplit = Split(StrExtractedData," ") - it returns only two items as stated above. ( And I have tried using chr(32) instead of " " with same results.)   I think it is totally whacky that the length is 21 and yet it is spread over two lines - whats with that? I thought images would help visually explain.
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

aikimarkCommented:
So, you don't care about the Length 21:?
0
aikimarkCommented:
Try this:
Cleaned\s([^\r]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>

Open in new window

0
aikimarkCommented:
This is a simpler pattern:
\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>

Open in new window

0
aikimarkCommented:
btw...my patterns are for matching, not replacing
0
aikimarkCommented:
Here are two different implementations, one function with a variant array output and the other with a collection output
Option Explicit

Function ParseMoneyCell(wrkstring)
    Static regEx As Object
    Dim oMatches As Object
    Dim oM As Object
    Dim vOut As Variant
    Dim lngSM As Long
    
    If regEx Is Nothing Then
        Set regEx = CreateObject("vbscript.regexp")
        regEx.Global = False
        regEx.Pattern = "\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>"
    End If
    
    If regEx.test(wrkstring) Then
        Set oMatches = regEx.Execute(wrkstring)
        Set oM = oMatches(0)
        ReDim vOut(0 To oM.submatches.Count - 1)
        For lngSM = 0 To oM.submatches.Count - 1
            vOut(lngSM) = oM.submatches(lngSM)
        Next
    Else
        vOut = Empty
    End If
    ParseMoneyCell = vOut

End Function


Function ParseMoneyCell_col(wrkstring)
    Static regEx As Object
    Dim oMatches As Object
    Dim colOut As New Collection
    Dim lngSM As Long
    
    If regEx Is Nothing Then
        Set regEx = CreateObject("vbscript.regexp")
        regEx.Global = False
        regEx.Pattern = "\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>"
    End If
    
    If regEx.test(wrkstring) Then
        Set oMatches = regEx.Execute(wrkstring)
        For lngSM = 0 To oMatches(0).submatches.Count - 1
            colOut.Add oMatches(0).submatches(lngSM)
        Next
    End If
    
    Set ParseMoneyCell_col = colOut

End Function

Open in new window


Invocation Examples:
Dim vParsed as Variant
Dim vItem as Variant
vParsed = ParseMoneyCell(x)
If IsEmpty(vParsed) Then
    debug.print "HTML table cell did not parse"
Else
    For Each vItem in vParsed
        debug.print vItem
    Next
End If

Open in new window


Dim vParsed as Collection
Dim vItem as Variant
Set vParsed = ParseMoneyCell_col(x)
If vParsed.Count =0 Then
    debug.print "HTML table cell did not parse"
Else
    For Each vItem in vParsed
        debug.print vItem
    Next
End If

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
OverthereAuthor Commented:
Thank you to everyone who responded to my question. I know inserting images may have annoyed but it was the only way I could explain what I was seeing visually and the discrepancies of the items etc.
Here is what resolved -  Aikimark's postings helped but I could not quite get his functions to work properly. But it got me to thinking and here is how I resolved it :
      I created one function; passed a work pattern to it and the string. The first pattern was to strip out the non-ascii characters.  I then split the returned result. I passed the first array item (the one where USD 49.40) were grouped, and parsed everything but letters. That gave me the currency type. I then passed the first array item again with a new pattern to parse out everything but numbers and period. The returning result gave me my dollar amount.  The second array item from the split was exactly right so it did not need any further processing.
I know that the character between USD 49.40 is not a space nor is it a ascii char. I sure would like to know what it really is and bidding on that it is a hexadecimal/octal representation.
Anyway, thank you all again. I did give points to jmcg for responding. I
If this is not satisfactory, please let me know.
thanks again! :)
0
aikimarkCommented:
Open the HTML in Notepad++ (or similar) and view the hex values.

If you posted actual HTML, I could have adjusted my regex pattern to properly parse it.  My pattern matched vbCrLf, but what was in your actual HTML might have been only vbLF or vbCr solo characters.
0
aikimarkCommented:
It is also possible to return the submatches collection directly from the function, rather than assign values to a collection object.
Example:
Function ParseMoneyCell_SM(wrkstring)
    Static regEx As Object
    Dim oMatches As Object
    
    If regEx Is Nothing Then
        Set regEx = CreateObject("vbscript.regexp")
        regEx.Global = False
        regEx.Pattern = "\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>"
    End If
    
    If regEx.test(wrkstring) Then
        Set oMatches = regEx.Execute(wrkstring)
        Set ParseMoneyCell_SM = oMatches(0).submatches
    Else
        Set ParseMoneyCell_SM = Nothing
    End If

End Function

Open in new window

Invocation example:
Dim oSubmatches as Object
Dim oSM as Object
Set oSubmatches = ParseMoneyCell_SM(x)
If oSubmatches Is Nothing Then
Else
    For each oSM in oSubmatches
        debug.print oSM
    Next
End If

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.