Overthere
asked on
Problem with non-ascii chars and regex
hi folks,
I am extracting some data from a html page and I have a char that I know is non-ascii. I have been able to remove that character with a regex. When I display the cleaned string using a response.write, it visually looks right - contains 3 items , each separated by one space. But when I view the underlaying html source from my response.write - its whacky.
And when I try to split it on the spaces, it only creates two data items instead of three. It groups the first two items together separated by a space, and then the last item.
I checked the length . it will not split correctly, I need to split it on spaces, breaking the string into 3 parts.
How can I fix this?
I have inserted my coding and attached images of it.
Any help appreciated. Thank you!
This is what the underlying html source code looks like
I am extracting some data from a html page and I have a char that I know is non-ascii. I have been able to remove that character with a regex. When I display the cleaned string using a response.write, it visually looks right - contains 3 items , each separated by one space. But when I view the underlaying html source from my response.write - its whacky.
And when I try to split it on the spaces, it only creates two data items instead of three. It groups the first two items together separated by a space, and then the last item.
I checked the length . it will not split correctly, I need to split it on spaces, breaking the string into 3 parts.
How can I fix this?
I have inserted my coding and attached images of it.
Any help appreciated. Thank you!
Function StripNonChars(wrkstring)
Set regEx = New RegExp
regEx.Global = true
regEx.IgnoreCase = True
regEx.Pattern = "[^\w\.\$\s]"
tempTxt = regEx.Replace(wrkstring, "")
StripNonChars = tempTxt
End Function
Visual image when I display my extracted string with response.writeThis is what the underlying html source code looks like
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I just want to clarify the coding <br> is me for displaying the results of my regex. It has nothing to do with what I am trying to achieve - I was merely echoing back my results hence the <br>" - readability . The original string contained non-ascii characters - if a viewer saw them rendered, it would display small white boxes before and after "USD", and before "Daily". If you then viewed the underlying html source, those white boxes showed as "?". Running them thru my regex got rid of them. But the problem remains that I can not split them on white space.
After the regex, it displays USD 49.40 Daily looks good but....
You would think it would just be a matter of using split on space, but as I have stated before, it groups USD 49.40 as the first item and Daily as the second item when each should be its own item,i.e. item 1 = USD ;item 2 = 49.40 and item 3 = Daily.
I tried re-splitting the first item and it will not do it. The result is : USD 49.40
If I do not have \s in my regex, I have no way of knowing when one item begins and another starts.
What I need to do is break out the line USD 49.40 Daily into 3 separate items i.e. USD, 49.40 and Daily.
I included screen shots because the length of the entire string returns 21, and because if I do this:
strSplit = Split(StrExtractedData," ") - it returns only two items as stated above. ( And I have tried using chr(32) instead of " " with same results.) I think it is totally whacky that the length is 21 and yet it is spread over two lines - whats with that? I thought images would help visually explain.
After the regex, it displays USD 49.40 Daily looks good but....
You would think it would just be a matter of using split on space, but as I have stated before, it groups USD 49.40 as the first item and Daily as the second item when each should be its own item,i.e. item 1 = USD ;item 2 = 49.40 and item 3 = Daily.
I tried re-splitting the first item and it will not do it. The result is : USD 49.40
If I do not have \s in my regex, I have no way of knowing when one item begins and another starts.
What I need to do is break out the line USD 49.40 Daily into 3 separate items i.e. USD, 49.40 and Daily.
I included screen shots because the length of the entire string returns 21, and because if I do this:
strSplit = Split(StrExtractedData," ") - it returns only two items as stated above. ( And I have tried using chr(32) instead of " " with same results.) I think it is totally whacky that the length is 21 and yet it is spread over two lines - whats with that? I thought images would help visually explain.
So, you don't care about the Length 21:?
Try this:
Cleaned\s([^\r]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>
This is a simpler pattern:
\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>
btw...my patterns are for matching, not replacing
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you to everyone who responded to my question. I know inserting images may have annoyed but it was the only way I could explain what I was seeing visually and the discrepancies of the items etc.
Here is what resolved - Aikimark's postings helped but I could not quite get his functions to work properly. But it got me to thinking and here is how I resolved it :
I created one function; passed a work pattern to it and the string. The first pattern was to strip out the non-ascii characters. I then split the returned result. I passed the first array item (the one where USD 49.40) were grouped, and parsed everything but letters. That gave me the currency type. I then passed the first array item again with a new pattern to parse out everything but numbers and period. The returning result gave me my dollar amount. The second array item from the split was exactly right so it did not need any further processing.
I know that the character between USD 49.40 is not a space nor is it a ascii char. I sure would like to know what it really is and bidding on that it is a hexadecimal/octal representation.
Anyway, thank you all again. I did give points to jmcg for responding. I
If this is not satisfactory, please let me know.
thanks again! :)
Here is what resolved - Aikimark's postings helped but I could not quite get his functions to work properly. But it got me to thinking and here is how I resolved it :
I created one function; passed a work pattern to it and the string. The first pattern was to strip out the non-ascii characters. I then split the returned result. I passed the first array item (the one where USD 49.40) were grouped, and parsed everything but letters. That gave me the currency type. I then passed the first array item again with a new pattern to parse out everything but numbers and period. The returning result gave me my dollar amount. The second array item from the split was exactly right so it did not need any further processing.
I know that the character between USD 49.40 is not a space nor is it a ascii char. I sure would like to know what it really is and bidding on that it is a hexadecimal/octal representation.
Anyway, thank you all again. I did give points to jmcg for responding. I
If this is not satisfactory, please let me know.
thanks again! :)
Open the HTML in Notepad++ (or similar) and view the hex values.
If you posted actual HTML, I could have adjusted my regex pattern to properly parse it. My pattern matched vbCrLf, but what was in your actual HTML might have been only vbLF or vbCr solo characters.
If you posted actual HTML, I could have adjusted my regex pattern to properly parse it. My pattern matched vbCrLf, but what was in your actual HTML might have been only vbLF or vbCr solo characters.
It is also possible to return the submatches collection directly from the function, rather than assign values to a collection object.
Example:
Example:
Function ParseMoneyCell_SM(wrkstring)
Static regEx As Object
Dim oMatches As Object
If regEx Is Nothing Then
Set regEx = CreateObject("vbscript.regexp")
regEx.Global = False
regEx.Pattern = "\s([^\s]+)\r\n([^\s]+)\s*(\w[^<]+?)<br>"
End If
If regEx.test(wrkstring) Then
Set oMatches = regEx.Execute(wrkstring)
Set ParseMoneyCell_SM = oMatches(0).submatches
Else
Set ParseMoneyCell_SM = Nothing
End If
End Function
Invocation example:Dim oSubmatches as Object
Dim oSM as Object
Set oSubmatches = ParseMoneyCell_SM(x)
If oSubmatches Is Nothing Then
Else
For each oSM in oSubmatches
debug.print oSM
Next
End If
Open in new window
and you need to parse what out of it or transform it into what?please use code snippets rather than screen shots.