asked on

Problem with non-ascii chars and regex

hi folks,
I am extracting some data from a html page and I have a char that I know is non-ascii. I have been able to remove that character with a regex. When I display the cleaned string using a response.write, it visually looks right - contains 3 items , each separated by one space. But when I view the underlaying html source from my response.write - its whacky.
And when I try to split it on the spaces, it only creates two data items instead of three. It groups the first two items together separated by a space, and then the last item.
I checked the length . it will not split correctly, I need to split it on spaces, breaking the string into 3 parts.
How can I fix this?
I have inserted my coding and attached images of it.
Any help appreciated. Thank you!

Function StripNonChars(wrkstring)

 Set regEx = New RegExp
 regEx.Global = true
 regEx.IgnoreCase = True


 regEx.Pattern = "[^\w\.\$\s]" 
 tempTxt = regEx.Replace(wrkstring, "")
 StripNonChars = tempTxt

 End Function

Open in new window

Visual image when I display my extracted string with response.write

This is what the underlying html source code looks like

SOLUTION

jmcg

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aikimark

So, you are starting with this:

<br><br><br>Length 21:<br><br>Cleaned USD
49.40

 Daily<br>

Open in new window

and you need to parse what out of it or transform it into what?

please use code snippets rather than screen shots.

Overthere

ASKER

I just want to clarify the coding <br> is me for displaying the results of my regex. It has nothing to do with what I am trying to achieve - I was merely echoing back my results hence the <br>" - readability . The original string contained non-ascii characters - if a viewer saw them rendered, it would display small white boxes before and after "USD", and before "Daily". If you then viewed the underlying html source, those white boxes showed as "?". Running them thru my regex got rid of them. But the problem remains that I can not split them on white space.
After the regex, it displays USD 49.40 Daily looks good but....
You would think it would just be a matter of using split on space, but as I have stated before, it groups USD 49.40 as the first item and Daily as the second item when each should be its own item,i.e. item 1 = USD ;item 2 = 49.40 and item 3 = Daily.

I tried re-splitting the first item and it will not do it. The result is : USD 49.40
If I do not have \s in my regex, I have no way of knowing when one item begins and another starts.
What I need to do is break out the line USD 49.40 Daily into 3 separate items i.e. USD, 49.40 and Daily.
I included screen shots because the length of the entire string returns 21, and because if I do this:
strSplit = Split(StrExtractedData," ") - it returns only two items as stated above. ( And I have tried using chr(32) instead of " " with same results.) I think it is totally whacky that the length is 21 and yet it is spread over two lines - whats with that? I thought images would help visually explain.

aikimark

So, you don't care about the Length 21:?

aikimark

Try this:

Cleaned\s([^\r]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>

Open in new window

aikimark

This is a simpler pattern:

\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>

Open in new window

aikimark

btw...my patterns are for matching, not replacing

ASKER CERTIFIED SOLUTION

aikimark

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Overthere

ASKER

Thank you to everyone who responded to my question. I know inserting images may have annoyed but it was the only way I could explain what I was seeing visually and the discrepancies of the items etc.
Here is what resolved - Aikimark's postings helped but I could not quite get his functions to work properly. But it got me to thinking and here is how I resolved it :
I created one function; passed a work pattern to it and the string. The first pattern was to strip out the non-ascii characters. I then split the returned result. I passed the first array item (the one where USD 49.40) were grouped, and parsed everything but letters. That gave me the currency type. I then passed the first array item again with a new pattern to parse out everything but numbers and period. The returning result gave me my dollar amount. The second array item from the split was exactly right so it did not need any further processing.
I know that the character between USD 49.40 is not a space nor is it a ascii char. I sure would like to know what it really is and bidding on that it is a hexadecimal/octal representation.
Anyway, thank you all again. I did give points to jmcg for responding. I
If this is not satisfactory, please let me know.
thanks again! :)

aikimark

Open the HTML in Notepad++ (or similar) and view the hex values.

If you posted actual HTML, I could have adjusted my regex pattern to properly parse it. My pattern matched vbCrLf, but what was in your actual HTML might have been only vbLF or vbCr solo characters.

aikimark

It is also possible to return the submatches collection directly from the function, rather than assign values to a collection object.
Example:

Function ParseMoneyCell_SM(wrkstring)
    Static regEx As Object
    Dim oMatches As Object
    
    If regEx Is Nothing Then
        Set regEx = CreateObject("vbscript.regexp")
        regEx.Global = False
        regEx.Pattern = "\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>"
    End If
    
    If regEx.test(wrkstring) Then
        Set oMatches = regEx.Execute(wrkstring)
        Set ParseMoneyCell_SM = oMatches(0).submatches
    Else
        Set ParseMoneyCell_SM = Nothing
    End If

End Function

Open in new window

Invocation example:

Dim oSubmatches as Object
Dim oSM as Object
Set oSubmatches = ParseMoneyCell_SM(x)
If oSubmatches Is Nothing Then
Else
    For each oSM in oSubmatches
        debug.print oSM
    Next
End If

Open in new window