Solved

VBScript RegEx Replace using Hex Values Fails

Posted on 2004-08-25
4
3,601 Views
Last Modified: 2008-01-09
I am simply trying to do an edit replace on ‘special’ character in large text files using VBScript.  I have easily accomplished this in PERL, but need to port it to a VBScript for other users.

Upfront facts:
WSH Version 5.6
VBScript Version 5.6
WMI Version 1085.0005
ADSI Version 5,0,00,0

I am attempting to replace all ‘special’ characters in a text file with spaces.  (for testing I use an ! for easy visibility).

In PERL the following RegEx finds everything I want and works great:
perl -p -i.bak -e "s/[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]/ /g"

All documentation for VBScript says the above syntax for the pattern is EXACTLY the same for VBS, so here is my line in the VBScript:
regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]"

Well, it works great, but only on MOST of the ‘special’ characters I have identified above.  I have isolated what it does NOT find and replace.  Here is the list in *Hex* values, showing ranges (there is a detailed list at the end of this request):
80; 82-8C;8E;91-9C;9E;9F

I think it is a bug.  I have:
* thoroughly tested this
* a Hex editor to verify whats going on
* written code to build a detailed text file with all 255 characters (I will post if you would like it, no big deal really, just nice for testing)
* run the PERL one-liner against the file and it works fine
* tried to use the Oct search with the same results.

QUESTION
How shall I rectify my problem and created a VBScript solution (its close now)?

Here is my script:
------- script start below ----------
If WScript.Arguments.Count = 0 Then
      WScript.Echo  "Hello! No argument on the command line."
      WScript.Quit(0)
Else

dim zFile
zFile = WScript.Arguments(0)
call CleanMe(zFile)

WScript.Echo  "All done!"

WScript.Quit(0)

end if

Function CleanMe(filespec)

Dim fso, SourceFile, TempFile, Line, NewLine, regEx

Set fso = CreateObject("Scripting.FileSystemObject")
Set SourceFile = fso.OpenTextFile(filespec, 1, False, 0)
Set TempFile = fso.CreateTextFile(GetPath & "Temp", true)

Set regEx = New RegExp
regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]"
regEx.Global = True

Do While SourceFile.AtEndOfStream <> True
      Line = SourceFile.ReadLine
      NewLine = regEx.Replace(Line, "!")
      TempFile.WriteLine(NewLine)
Loop

TempFile.close
SourceFile.close

 
End Function

'*******************************
'**
'**  The GetPath Function to know where we are!
'**
'*******************************

Function GetPath
      ' Return path to the current script
      DIM path
      path = WScript.ScriptFullName  ' script file name
      GetPath = Left(path, InstrRev(path, "\"))
End Function

------- script end above -------



-  end    Lines with special characters that VBScript can’t find  -
chr(128) = € the Asc() value is = 128 The HEX is 80 Back to string from Hex is € The Oct is 200

chr(130) = ‚ the Asc() value is = 130 The HEX is 82 Back to string from Hex is ‚ The Oct is 202
chr(131) = ƒ the Asc() value is = 131 The HEX is 83 Back to string from Hex is ƒ The Oct is 203
chr(132) = „ the Asc() value is = 132 The HEX is 84 Back to string from Hex is „ The Oct is 204
chr(133) = … the Asc() value is = 133 The HEX is 85 Back to string from Hex is … The Oct is 205
chr(134) = † the Asc() value is = 134 The HEX is 86 Back to string from Hex is † The Oct is 206
chr(135) = ‡ the Asc() value is = 135 The HEX is 87 Back to string from Hex is ‡ The Oct is 207
chr(136) = ˆ the Asc() value is = 136 The HEX is 88 Back to string from Hex is ˆ The Oct is 210
chr(137) = ‰ the Asc() value is = 137 The HEX is 89 Back to string from Hex is ‰ The Oct is 211
chr(138) = Š the Asc() value is = 138 The HEX is 8A Back to string from Hex is Š The Oct is 212
chr(139) = ‹ the Asc() value is = 139 The HEX is 8B Back to string from Hex is ‹ The Oct is 213
chr(140) = Πthe Asc() value is = 140 The HEX is 8C Back to string from Hex is ΠThe Oct is 214

chr(142) = Ž the Asc() value is = 142 The HEX is 8E Back to string from Hex is Ž The Oct is 216

chr(145) = ‘ the Asc() value is = 145 The HEX is 91 Back to string from Hex is ‘ The Oct is 221
chr(146) = ’ the Asc() value is = 146 The HEX is 92 Back to string from Hex is ’ The Oct is 222
chr(147) = “ the Asc() value is = 147 The HEX is 93 Back to string from Hex is “ The Oct is 223
chr(148) = ” the Asc() value is = 148 The HEX is 94 Back to string from Hex is ” The Oct is 224
chr(149) = • the Asc() value is = 149 The HEX is 95 Back to string from Hex is • The Oct is 225
chr(150) = – the Asc() value is = 150 The HEX is 96 Back to string from Hex is – The Oct is 226
chr(151) = — the Asc() value is = 151 The HEX is 97 Back to string from Hex is — The Oct is 227
chr(152) = ˜ the Asc() value is = 152 The HEX is 98 Back to string from Hex is ˜ The Oct is 230
chr(153) = ™ the Asc() value is = 153 The HEX is 99 Back to string from Hex is ™ The Oct is 231
chr(154) = š the Asc() value is = 154 The HEX is 9A Back to string from Hex is š The Oct is 232
chr(155) = › the Asc() value is = 155 The HEX is 9B Back to string from Hex is › The Oct is 233
chr(156) = œ the Asc() value is = 156 The HEX is 9C Back to string from Hex is œ The Oct is 234

chr(158) = ž the Asc() value is = 158 The HEX is 9E Back to string from Hex is ž The Oct is 236
chr(159) = Ÿ the Asc() value is = 159 The HEX is 9F Back to string from Hex is Ÿ The Oct is 237
0
Comment
Question by:McDougall
  • 2
  • 2
4 Comments
 
LVL 2

Accepted Solution

by:
amg42 earned 250 total points
ID: 11900267
This is due to the fact that VBScript uses Unicode internally.

The ANSI characters \x00-\x7f map directly to Unicode, and so do \xa0-\xff (assuming you're an English (or at least Western European) system, i.e. code page 1252).

However, \x80-\x9f map to totally different code points. For instance, \x80 (the Euro sign) maps to Unicode code point \u20ac.

Changing your pattern to

   regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\u007f-\uffff]"

seems to do the trick.

Thanks for asking this question. I'm pretty experienced w.r.t. Unicode-related issues, but I never realized that it also affects the RegExp object in this way.
0
 
LVL 4

Author Comment

by:McDougall
ID: 11901433
amg42

Outstanding!

You are of course correct.

I really like to expand my understanding.  So... Could you please point me in a direct to where I can understand why what you have done works?

"uses Unicode internally"  I'm not sure what that really means.

In the Windows Scripting Technologies help file I use I don't see the \u option for specifying Unicode, I do see it now however listed for JScript...(go figure)

So how does
\x7F become \u007F  when
\xFF becomes \uFFFF?

Seems the \x7F kinda makes sense with prefixing the  zeros, but the \xFF prefixing the Fs???
The only chart I've used for HEX values and the such is here:
http://www.simotime.com/asc2ebc1.htm

If you know of a chart I can reference it would be great, I'll see what I find on my own.

Thanks again, I'll be awarding you the points when I can figure out how :^)


0
 
LVL 2

Expert Comment

by:amg42
ID: 11902252
McDougall,

Unicode is a really big topic, and unfortunately I don't have a "this will tell you everything you will need" URL for you... http://www.unicode.org/unicode/standard/WhatIsUnicode.html is a good start, and there's a lot more info at that site as well.

A very, very, very brief intro to Unicode (specifically targetted to the issue you ran into):

Every string in VB(Script) contains Unicode characters. Unicode is basically a huge character set, where every character has a 16-bit value. I'm ignoring all kinds of details here, but this is basically what's relevant for this discussion :-)

For many things, Windows uses so-called "code pages", which contain 256 8-bit values. An English Windows install generally uses code page 1252, which contains the characters for Western-European languages. The characters you were reading from disk come from this code page (every character has an 8-bit code (from 0 to 255) that corresponds with a character in code page 1252).

For every character in every installed code page, Windows knows how to map it to Unicode. This is what happens when your code is reading a string from disk: every character that's read, is mapped to its Unicode equivalent, and appended to the string (well, it's a little bit more efficient than that, but this is conceptually what happens).

Code page 1252 is very similar to the first 256 characters of Unicode: characters 0-127 and 160-255 are identical in the two systems, so they will actually appear in the string.
However, the problem arises with characters 128-159 from code page 1252. These don't map to the Unicode characters 128-159, but to totally different ones: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT . Hence, character 128 (the Euro sign) actually becomes character 8364.

Your regexp is trying to replace occurences of character 128, but won't find any (since they've all been turned into 8364's).

The move from "[x7F-\xFF]" to "[\u007F-\uFFFF]" is sort of a move from a "code page approach" to a "Unicode approach". They both indicate a range of characters ("from here till the end"), but in different "contexts".

Hmm, the above sounds a bit rambling. Sorry, can't come up with anything better at the moment... Hope it clarifies the topic at least a little...


BTW, it's indeed odd that "\uxxxx" isn't mentioned in the VBScript docs for RegExp. I found out about it by Googling for "VBScript RegExp Unicode", which gave me http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnclinic/html/scripting051099.asp .
0
 
LVL 4

Author Comment

by:McDougall
ID: 11902571
amg42

Thanks for the valuable information.  

It is a good start and has opened a new realm for me to research, greatly appreciated.

FWIW
I must have looked in haste....  The \u syntax is listed in the help file I have, typical for me I'm afraid.

have a GREAT day and thanks again....

McDougall

0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
noX challenge 17 76
array220 challenge 8 45
Magic Software info 18 102
Base1 Encode/Decode 3 30
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now