Start Free Trial

asked on

UNICODE characters-Easy question

I am receiving network data that at times can be in unicode. The data is received as a byte array and converted to a string. I need a way of converting the unicode characters correctly.

For example:
月 is a two byte unicode character

however in ascii it would look like this:
g

I need a way to distinguish between ascii and unicode and convert the byte array into a string.

Mark_FreeSoftware

Option Explicit

Private Declare Function IsTextUnicode Lib "advapi32" (ByVal lpBuffer As String, ByVal cb As Long, lpi As Long) As Long

Private Sub Form_Load()
Dim strUnicode As String
Dim strNormal As String

strUnicode = StrConv("hallo", vbUnicode) 'create unicode text
If CBool(IsTextUnicode(strUnicode, Len(strUnicode), &H2)) Then 'see if the text is unicode
strNormal = StrConv(strUnicode, vbFromUnicode) 'if the text is unicode, convert
Else
strNormal = strUnicode 'else just assign
End If

debug.print strNormal 'print the text to the debug screen

End Sub

ASKER

wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

When I try the example I gave above I think I'm getting a "?" from the strconv() function instead of a unicode japanese character

You will not be able to see your Unicode string via Debug.Print since it is rendering to an ANSI Window. Also Vb Intrinsic controls are not Unicode aware. You can use Forms 2.0 Object Library controls if you have MS Office installed or obtain 3rd party Unicode aware controls. Set the controls Font to "Arial Unicode MS" and you should then see Japanese Unicode OK.

Note in following code the Unicode aware control doesn't care if you send it a ANSI or Unicode string. Demo code shows simple to/from byte conversion and dumps the bytes to immediate window so you can see what the bytes look like.

Option Explicit

Private Sub Form_Load()
Dim sTemp As String
Dim b() As Byte

'Chinese + Japanese
sTemp = "CHS: " & ChrW$(&H6B22) & ChrW$(&H8FCE) & ChrW$(26376)
b = sTemp 'Convert to byte array
DumpByteAray b 'Show bytes
sTemp = b 'Back to Unicode
UniLabel1.Caption = sTemp

sTemp = "Hello" 'ANSI string
b = sTemp
DumpByteAray b
sTemp = b
UniLabel2.Caption = sTemp

End Sub

Public Sub DumpByteAray(b() As Byte)
Dim i As Long
Debug.Print
For i = 0 To UBound(b)
Debug.Print i, b(i)
Next
End Sub

Mark_FreeSoftware

>>wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

no, vb really doesnt like it.

when you have normal characters in a string, vb fills the unicode with chr(0) (that is good)

however when you try to do a debug.print, vb cuts the string after the first character

so when you try to work with unicode, try to step trough your code, and make sure vb doesnt cut off your string when assigning!

ASKER

sorry i havnt responded for awile (but I increased the points)...I havnt had time to test these solutions:

I am receiving a string of ascii bytes that should be unicode, I need to convert them to unicode and write them to an html file so they can be viewed in a web browser. Right now the web browser is still showing ascii characters with both methods. Any other ideas??

Can you show us a sample of your Unicode Byte Array so we can give you a correct conversion?
Also if you are writing this to HTML what kind of header are you using to tell the browser it is Unicode and how are you formatting the HTML strings?

To dump the byte array:
For i = 0 To UBound(b)
Debug.Print b(i)
Next

ASKER

this is a raw network packet that I convert to text:

for i = 54 to ubound(buff)
strtemp = strtemp & chr(i)
next i

I then extract the code between the html tags and paste it to an html file. The html has no header file, If necessary I can output 月 or similar tags even if it is not the most space efficient

ASKER

*correction:
paste should be print...

open thefile for append as #1
print #1, isolatedText
close #1

ASKER

when I use strconv everything converts fine except the non ascii characters:
h e l l o becomes hello
h e l (a single jap char) o becomes hel??o

do I have to use fso with unicode set to true in order to print this or is there a way that wont involve changing hundreds of lines of code?

Mark_FreeSoftware

u can't change text with japanese chars in it because the japanese chars ARE unicode

ASKER

i think i explained that incorrectly... the text is hel(ajap character)o which in unicode is "h e l go "

ASKER

theres a nonstandard character after the g which I cant paste. The point is thats the ascii string that I get which I need to convert into unicode... "h " should become "h" but the "g & other char" should become the jap character

If that makes sense...

Several problems here:

1. for i = 54 to ubound(buff)
strtemp = strtemp & chr(i)
next i
I think you meant strtemp = strtemp & chr(buff(i))

Anyway this will never give you Unicode, just a string of ANSI chars. You must use ChrW and combine every 2 bytes to get Unicode.

To do this from a byte array you will have to do something like this
dim iChar as Long
for i = 54 to ubound(buff) step 2
iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
strtemp = strtemp & ChrW(iChar)
next i

2. open thefile for append as #1
print #1, isolatedText
close #1
'This works for ANSI but not for Unicode text. First you need to insert a Unicode UTF-16 BOM marker and then write the Unicode string as a byte array using Put which means using Binary Write. This code to do this via FSO is much easier. If you need an FSO version let me know.

Change code to this:
Public Sub UnicodeFile_Write_VB(ByVal sFileName As String, _
ByVal sText As String)

Dim FF As Long
Dim b() As Byte

On Error Resume Next
Kill sFileName
On Error GoTo 0
FF = FreeFile
Open sFileName For Binary Access Write As #FF
'InsertBOM
ReDim b(1)
b(0) = &HFF
b(1) = &HFE
Put #FF, , b
Erase b
'Convert Unicode string to byte array
b = sText
'Write to file
Put #FF, , b
Close #FF
End Sub

This should get you started. Take a look at your output file in Notepad first to see if you are now getting the Unicode characters.

For more in depth info on Using Unicode with Visual Basic 6 please read my Tutorial at www.cyberactivex.com/UnicodeTutorialVb.htm

ASKER

1.
yea thats what I meant...im not looking at the code as I type this, ive been over it so many times

The thing that makes this so hard is that im dealing with network packets which means most of the packet is ascii and the data portion is unicode. Thats why I am parsing out a piece of the data (the part between html tags) and converting that from an ascii string to a unicode string. I cant use chrw() to convert the array to a string because I need to be able to view nonprinted characters that chr() converts for me. Once I have that string I can make a unicode string.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work. The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.

3.
will functions like replace() and mid() work with unicode characters??

1.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

Something like this then:
Dim sUniText As String
For i= 1 to len(myString) step 2
sUniTex = sUniTex & ChrW(asc(mid(mystring, i + 1, 1)) * 256 + asc(mid(mystring, i, 1)))
Next

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work. The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.
Then you can use FSO which works on Win98 or later, and can also append.

Here is code:

Public Enum ForWriteEnum
ForWriting = 2
ForAppending = 8
End Enum
#If False Then 'PreserveEnumCase
Private ForWriting, ForAppending
#End If

Public Enum TristateEnum
TristateTrue = -1 'Opens the file as Unicode
TristateFalse = 0 'Opens the file as ASCII
TristateUseDefault = -2 'Use default system setting
End Enum
#If False Then 'PreserveEnumCase
Private TristateTrue, TristateFalse, TristateUseDefault
#End If

Public Sub UnicodeFile_Write_FSO( _
ByVal sFileName As String, _
ByVal vVar As Variant, _
Optional ByVal ForWrite As ForWriteEnum = ForWriting, _
Optional ByVal TriState As TristateEnum = TristateTrue, _
Optional ByVal bJoin As Boolean)

Dim objFSO As Object
Dim objStream As Object
Dim sText As String

Set objFSO = CreateObject("Scripting.FileSystemObject")
If (Not objFSO Is Nothing) Then
Set objStream = objFSO.opentextfile( _
sFileName, ForWrite, True, TriState)

If (Not objStream Is Nothing) Then
With objStream
If bJoin Then
sText = Join(vVar, vbCrLf)
Else
sText = vVar
End If
.Write sText
.Close
End With
Set objStream = Nothing
End If
Set objFSO = Nothing
End If
End Sub

3.
will functions like replace() and mid() work with unicode characters??
Yes, no problem here as well as Left, Right, but NOT StrComp.

Mark_FreeSoftware

>I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

this string of ascii chars IS unicode,
but it looks weird because the vb interface displays it if it is ascii

if you assign the string to a unicode aware control, it looks "normal" (without strange characters)

ASKER

You misunderstood what I meant...

I have an array of bytes-the first 100 of them are ascii, the middle(could be any size) is unicode and then end 20 bytes is ascii. Some of the unicode in the middle is two byte characters but instead of combining them into a unicode character each byte is turned into an ascii character. I need this to happen first because I need to read the first hundred bytes. After I parse the middle and extract the string I want to save I need to turn this string of ascii characters into unicode.

From danaseaman's answer it seems that all I have to do is combine every two bytes using the chrw() function. If I have a string how would I do that?

While you may be able to process the middle portion of the string and convert it to Unicode it would be better if you processed the byte array a second time like this:

for i = 101 to ubound(buff)-20 step 2
iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
strUni = strUni & ChrW(iChar)
next i

Another way to do process the byte array:

Dim b() as byte
for i = 101 to ubound(buff)-20
b(i-101) = buff(i)
Next
strUni = b

Oops, you need to Redim b()

Dim b() as byte
Redim b(ubound(buff)-120)
for i = 101 to ubound(buff)-20
b(i-101) = buff(i)
Next
strUni = b

ASKER

the problem is 100 is an example, it could be anywhere from 90 to 130 and any length. I need to process the ascii srting to find out. I need to either turn the string into a byte array or process the string...

Are all your ascii characters followed by a 0? If so then the entire string is Unicode.

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

No it truly is ascii (null has meaning in network packets but its only a few characters)... however, the uniicode portion of the string contains mostly zeros except for the special characters

Your logic is incorrect for the code in your last comment:
character 1000 would be stored in two bytes... 03 and E8...neither of those bytes is above 256 but they need to be combined to create character 1000

AscW returns character code for both bytes.

Is the starting point of the Unicode portion followed by a '0'? How about the last Unicode char does it have a trailing '0'? Otherwise I don't see how you could differentiate between 2 consecutive ascii chars and a Unicode char.

ASKER

yes but in the string code you are only converting one byte at a time:
schar = Mid(strMain,i,1)
lchar = Ascw(schar)
^ actually im not sure what that code does...

yes it does, the beginning of unicode is "<html>" with trailing zeros after each character...the end is "</html>" with trailing zeros after each character

ASKER

is there something more simple then this in a loop:

uni = uni & ChrW(CLng("&H" & Hex(Asc(Mid(tmp, i, 1))) & Hex(Asc(Mid(tmp, i + 2, 1)))))

I have a feeling thats very repetative...

String chars are always stored in Vb as 2 bytes thus sChar is both bytes and ascw gets the real unicode value. Unfortunately Vb integers are signed so you either have to convert the negatives to a positive long or just check >255 or <0.

Let me see if I can come up with a parser now that I know you have something to flag the beginning and end of Unicode.

ASKER

That would be great thanks...

I also worked out file writing with FSO enough to test whatever you come up with

'Try this where b is the original byte array, sHeader is the ascii before Unicode, sFooter is ascii after Unicode, and sUni is the the Unicode portion only including the Html tags:

Dim sUni As String
Dim lStart As Long
Dim lEnd As Long
Dim sStart As String
Dim sEnd As String
Dim sHeader As String
Dim sFooter As String
sUni = StrConv(b, vbUnicode)
sStart = StrConv("<html>", vbUnicode)
sEnd = StrConv("</html>", vbUnicode)
lStart = InStr(1, sUni, sStart, vbTextCompare)
lEnd = InStr(1, sUni, sEnd, vbTextCompare)
If lStart > 0 And lEnd > 0 Then
sHeader = Left(sUni, lStart - 1)
Debug.Print sHeader
UniListBox1.AddItem sHeader
sFooter = Mid(sUni, lEnd + Len(sEnd))
Debug.Print sFooter
UniListBox1.AddItem sFooter
sUni = StrConv(Mid(sUni, lStart, lEnd - lStart + Len(sEnd)), vbFromUnicode)
Debug.Print sUni
UniListBox1.AddItem sUni

End If

Also check these links from my tutorial for details on how Vb stores strings:
http://www.cyberactivex.com/UnicodeTutorialVb.htm#MapString
http://www.cyberactivex.com/UnicodeTutorialVb.htm#Byte_Array

ASKER

Very close but not there yet...it either merges the previous character and the unicode character into two unreadable characters or the unicode character appears as two unicode characters-neither one the right one

I am testing this using the working listview :) and adding two entries-one is what the string should look like (hard coded) and the other is having the parser parse the buff for the string...which it does find very well-it just doesnt convert it correctly yet

ASKER

help me out with this and ill award u points for both:
https://www.experts-exchange.com/questions/21933158/Unicode-Combo-Box.html