VBScript RegEx Replace using Hex Values Fails

Posted on 2004-08-25
Last Modified: 2008-01-09
I am simply trying to do an edit replace on ‘special’ character in large text files using VBScript.  I have easily accomplished this in PERL, but need to port it to a VBScript for other users.

Upfront facts:
WSH Version 5.6
VBScript Version 5.6
WMI Version 1085.0005
ADSI Version 5,0,00,0

I am attempting to replace all ‘special’ characters in a text file with spaces.  (for testing I use an ! for easy visibility).

In PERL the following RegEx finds everything I want and works great:
perl -p -i.bak -e "s/[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]/ /g"

All documentation for VBScript says the above syntax for the pattern is EXACTLY the same for VBS, so here is my line in the VBScript:
regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]"

Well, it works great, but only on MOST of the ‘special’ characters I have identified above.  I have isolated what it does NOT find and replace.  Here is the list in *Hex* values, showing ranges (there is a detailed list at the end of this request):
80; 82-8C;8E;91-9C;9E;9F

I think it is a bug.  I have:
* thoroughly tested this
* a Hex editor to verify whats going on
* written code to build a detailed text file with all 255 characters (I will post if you would like it, no big deal really, just nice for testing)
* run the PERL one-liner against the file and it works fine
* tried to use the Oct search with the same results.

How shall I rectify my problem and created a VBScript solution (its close now)?

Here is my script:
------- script start below ----------
If WScript.Arguments.Count = 0 Then
      WScript.Echo  "Hello! No argument on the command line."

dim zFile
zFile = WScript.Arguments(0)
call CleanMe(zFile)

WScript.Echo  "All done!"


end if

Function CleanMe(filespec)

Dim fso, SourceFile, TempFile, Line, NewLine, regEx

Set fso = CreateObject("Scripting.FileSystemObject")
Set SourceFile = fso.OpenTextFile(filespec, 1, False, 0)
Set TempFile = fso.CreateTextFile(GetPath & "Temp", true)

Set regEx = New RegExp
regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\x7F-\xFF]"
regEx.Global = True

Do While SourceFile.AtEndOfStream <> True
      Line = SourceFile.ReadLine
      NewLine = regEx.Replace(Line, "!")


End Function

'**  The GetPath Function to know where we are!

Function GetPath
      ' Return path to the current script
      DIM path
      path = WScript.ScriptFullName  ' script file name
      GetPath = Left(path, InstrRev(path, "\"))
End Function

------- script end above -------

-  end    Lines with special characters that VBScript can’t find  -
chr(128) = € the Asc() value is = 128 The HEX is 80 Back to string from Hex is € The Oct is 200

chr(130) = ‚ the Asc() value is = 130 The HEX is 82 Back to string from Hex is ‚ The Oct is 202
chr(131) = ƒ the Asc() value is = 131 The HEX is 83 Back to string from Hex is ƒ The Oct is 203
chr(132) = „ the Asc() value is = 132 The HEX is 84 Back to string from Hex is „ The Oct is 204
chr(133) = … the Asc() value is = 133 The HEX is 85 Back to string from Hex is … The Oct is 205
chr(134) = † the Asc() value is = 134 The HEX is 86 Back to string from Hex is † The Oct is 206
chr(135) = ‡ the Asc() value is = 135 The HEX is 87 Back to string from Hex is ‡ The Oct is 207
chr(136) = ˆ the Asc() value is = 136 The HEX is 88 Back to string from Hex is ˆ The Oct is 210
chr(137) = ‰ the Asc() value is = 137 The HEX is 89 Back to string from Hex is ‰ The Oct is 211
chr(138) = Š the Asc() value is = 138 The HEX is 8A Back to string from Hex is Š The Oct is 212
chr(139) = ‹ the Asc() value is = 139 The HEX is 8B Back to string from Hex is ‹ The Oct is 213
chr(140) = Πthe Asc() value is = 140 The HEX is 8C Back to string from Hex is ΠThe Oct is 214

chr(142) = Ž the Asc() value is = 142 The HEX is 8E Back to string from Hex is Ž The Oct is 216

chr(145) = ‘ the Asc() value is = 145 The HEX is 91 Back to string from Hex is ‘ The Oct is 221
chr(146) = ’ the Asc() value is = 146 The HEX is 92 Back to string from Hex is ’ The Oct is 222
chr(147) = “ the Asc() value is = 147 The HEX is 93 Back to string from Hex is “ The Oct is 223
chr(148) = ” the Asc() value is = 148 The HEX is 94 Back to string from Hex is ” The Oct is 224
chr(149) = • the Asc() value is = 149 The HEX is 95 Back to string from Hex is • The Oct is 225
chr(150) = – the Asc() value is = 150 The HEX is 96 Back to string from Hex is – The Oct is 226
chr(151) = — the Asc() value is = 151 The HEX is 97 Back to string from Hex is — The Oct is 227
chr(152) = ˜ the Asc() value is = 152 The HEX is 98 Back to string from Hex is ˜ The Oct is 230
chr(153) = ™ the Asc() value is = 153 The HEX is 99 Back to string from Hex is ™ The Oct is 231
chr(154) = š the Asc() value is = 154 The HEX is 9A Back to string from Hex is š The Oct is 232
chr(155) = › the Asc() value is = 155 The HEX is 9B Back to string from Hex is › The Oct is 233
chr(156) = œ the Asc() value is = 156 The HEX is 9C Back to string from Hex is œ The Oct is 234

chr(158) = ž the Asc() value is = 158 The HEX is 9E Back to string from Hex is ž The Oct is 236
chr(159) = Ÿ the Asc() value is = 159 The HEX is 9F Back to string from Hex is Ÿ The Oct is 237
Question by:McDougall
  • 2
  • 2

Accepted Solution

amg42 earned 250 total points
ID: 11900267
This is due to the fact that VBScript uses Unicode internally.

The ANSI characters \x00-\x7f map directly to Unicode, and so do \xa0-\xff (assuming you're an English (or at least Western European) system, i.e. code page 1252).

However, \x80-\x9f map to totally different code points. For instance, \x80 (the Euro sign) maps to Unicode code point \u20ac.

Changing your pattern to

   regEx.Pattern = "[\x00-\x09]|[\x0B-\x0C]|[\x0E-\x1F]|[\u007f-\uffff]"

seems to do the trick.

Thanks for asking this question. I'm pretty experienced w.r.t. Unicode-related issues, but I never realized that it also affects the RegExp object in this way.

Author Comment

ID: 11901433


You are of course correct.

I really like to expand my understanding.  So... Could you please point me in a direct to where I can understand why what you have done works?

"uses Unicode internally"  I'm not sure what that really means.

In the Windows Scripting Technologies help file I use I don't see the \u option for specifying Unicode, I do see it now however listed for JScript...(go figure)

So how does
\x7F become \u007F  when
\xFF becomes \uFFFF?

Seems the \x7F kinda makes sense with prefixing the  zeros, but the \xFF prefixing the Fs???
The only chart I've used for HEX values and the such is here:

If you know of a chart I can reference it would be great, I'll see what I find on my own.

Thanks again, I'll be awarding you the points when I can figure out how :^)


Expert Comment

ID: 11902252

Unicode is a really big topic, and unfortunately I don't have a "this will tell you everything you will need" URL for you... is a good start, and there's a lot more info at that site as well.

A very, very, very brief intro to Unicode (specifically targetted to the issue you ran into):

Every string in VB(Script) contains Unicode characters. Unicode is basically a huge character set, where every character has a 16-bit value. I'm ignoring all kinds of details here, but this is basically what's relevant for this discussion :-)

For many things, Windows uses so-called "code pages", which contain 256 8-bit values. An English Windows install generally uses code page 1252, which contains the characters for Western-European languages. The characters you were reading from disk come from this code page (every character has an 8-bit code (from 0 to 255) that corresponds with a character in code page 1252).

For every character in every installed code page, Windows knows how to map it to Unicode. This is what happens when your code is reading a string from disk: every character that's read, is mapped to its Unicode equivalent, and appended to the string (well, it's a little bit more efficient than that, but this is conceptually what happens).

Code page 1252 is very similar to the first 256 characters of Unicode: characters 0-127 and 160-255 are identical in the two systems, so they will actually appear in the string.
However, the problem arises with characters 128-159 from code page 1252. These don't map to the Unicode characters 128-159, but to totally different ones: . Hence, character 128 (the Euro sign) actually becomes character 8364.

Your regexp is trying to replace occurences of character 128, but won't find any (since they've all been turned into 8364's).

The move from "[x7F-\xFF]" to "[\u007F-\uFFFF]" is sort of a move from a "code page approach" to a "Unicode approach". They both indicate a range of characters ("from here till the end"), but in different "contexts".

Hmm, the above sounds a bit rambling. Sorry, can't come up with anything better at the moment... Hope it clarifies the topic at least a little...

BTW, it's indeed odd that "\uxxxx" isn't mentioned in the VBScript docs for RegExp. I found out about it by Googling for "VBScript RegExp Unicode", which gave me .

Author Comment

ID: 11902571

Thanks for the valuable information.  

It is a good start and has opened a new realm for me to research, greatly appreciated.

I must have looked in haste....  The \u syntax is listed in the help file I have, typical for me I'm afraid.

have a GREAT day and thanks again....



Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question