Link to home
Start Free TrialLog in
Avatar of justchat_1
justchat_1

asked on

UNICODE characters-Easy question

I am receiving network data that at times can be in unicode.  The data is received as a byte array and converted to a string.  I need a way of converting the unicode characters correctly.

For example:
月 is a two byte unicode character

however in ascii it would look like this:
g

I need a way to distinguish between ascii and unicode and convert the byte array into a string.
Avatar of Mark_FreeSoftware
Mark_FreeSoftware
Flag of Netherlands image

Option Explicit

Private Declare Function IsTextUnicode Lib "advapi32" (ByVal lpBuffer As String, ByVal cb As Long, lpi As Long) As Long


Private Sub Form_Load()
Dim strUnicode As String
Dim strNormal As String

strUnicode = StrConv("hallo", vbUnicode)  'create unicode text
If CBool(IsTextUnicode(strUnicode, Len(strUnicode), &H2)) Then    'see if the text is unicode
   strNormal = StrConv(strUnicode, vbFromUnicode)     'if the text is unicode, convert
Else
   strNormal = strUnicode                    'else just assign
End If

debug.print strNormal                   'print the text to the debug screen

End Sub
Avatar of justchat_1
justchat_1

ASKER

wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

When I try the example I gave above I think I'm getting a "?" from the strconv() function instead of a unicode japanese character
Avatar of Dana Seaman
You will not be able to see your Unicode string via Debug.Print since it is rendering to an ANSI Window. Also Vb Intrinsic controls are not Unicode aware. You can use Forms 2.0 Object Library controls if you have MS Office installed or obtain 3rd party Unicode aware controls. Set the controls Font to "Arial Unicode MS" and you should then see Japanese Unicode OK.

Note in following code the Unicode aware control doesn't care if you send it a ANSI or Unicode string. Demo code shows simple to/from byte conversion and dumps the bytes to immediate window so you can see what the bytes look like.

Option Explicit

Private Sub Form_Load()
   Dim sTemp As String
   Dim b() As Byte

   'Chinese + Japanese
   sTemp = "CHS: " & ChrW$(&H6B22) & ChrW$(&H8FCE) & ChrW$(26376)
   b = sTemp 'Convert to byte array
   DumpByteAray b 'Show bytes
   sTemp = b 'Back to Unicode
   UniLabel1.Caption = sTemp

   sTemp = "Hello" 'ANSI string
   b = sTemp
   DumpByteAray b
   sTemp = b
   UniLabel2.Caption = sTemp

End Sub

Public Sub DumpByteAray(b() As Byte)
   Dim i As Long
   Debug.Print
   For i = 0 To UBound(b)
      Debug.Print i, b(i)
   Next
End Sub


>>wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

no, vb really doesnt like it.



when you have normal characters in a string, vb fills the unicode with chr(0)   (that is good)

however when you try to do a debug.print, vb cuts the string after the first character



so when you try to work with unicode, try to step trough your code, and make sure vb doesnt cut off your string when assigning!
sorry i havnt responded for awile (but I increased the points)...I havnt had time to test these solutions:

I am receiving a string of ascii bytes that should be unicode, I need to convert them to unicode and write them to an html file so they can be viewed in a web browser.  Right now the web browser is still showing ascii characters with both methods.  Any other ideas??
Can you show us a sample of your Unicode Byte Array so we can give you a correct conversion?
Also if you are writing this to HTML what kind of header are you using to tell the browser it is Unicode and how are you formatting the HTML strings?


To dump the byte array:
   For i = 0 To UBound(b)
     Debug.Print b(i)
   Next

 
this is a raw network packet that I convert to text:

for i = 54 to ubound(buff)
     strtemp = strtemp & chr(i)
next i

I then extract the code between the html tags and paste it to an html file.  The html has no header file, If necessary I can output 月 or similar tags even if it is not the most space efficient
*correction:
paste should be print...

open thefile for append as #1
         print #1, isolatedText
close #1
when I use strconv everything converts fine except the non ascii characters:
h e l l o becomes hello
h e l (a single jap char) o becomes hel??o

do I have to use fso with unicode set to true in order to print this or is there a way that wont involve changing hundreds of lines of code?

u can't change text with japanese chars in it because the japanese chars ARE unicode

i think i explained that incorrectly... the text is hel(ajap character)o which in unicode is "h e l go "
theres a nonstandard character after the g which I cant paste.  The point is thats the ascii string that I get which I need to convert into unicode... "h " should become "h" but the "g & other char" should become the jap character

If that makes sense...
Several problems here:

1. for i = 54 to ubound(buff)
     strtemp = strtemp & chr(i)
   next i
   I think you meant strtemp = strtemp & chr(buff(i))
 
   Anyway this will never give you Unicode, just a string of ANSI chars. You must use ChrW and combine every 2 bytes to get Unicode.


   To do this from a byte array you will have to do something like this
   dim iChar as Long
   for i = 54 to ubound(buff) step 2
     iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
     strtemp = strtemp & ChrW(iChar)
   next i

2. open thefile for append as #1
         print #1, isolatedText
   close #1
   'This works for ANSI but not for Unicode text. First you need to insert a Unicode UTF-16 BOM marker and then write the Unicode string as a byte array using Put which means using Binary Write. This code to do this via FSO is much easier. If you need an FSO version  let me know.

   Change code to this:
Public Sub UnicodeFile_Write_VB(ByVal sFileName As String, _
   ByVal sText As String)

   Dim FF               As Long
   Dim b()              As Byte

   On Error Resume Next
   Kill sFileName
   On Error GoTo 0
   FF = FreeFile
   Open sFileName For Binary Access Write As #FF
   'InsertBOM
   ReDim b(1)
   b(0) = &HFF
   b(1) = &HFE
   Put #FF, , b
   Erase b
   'Convert Unicode string to byte array
   b = sText
   'Write to file
   Put #FF, , b
   Close #FF
End Sub  

This should get you started. Take a look at your output file in Notepad first to see if you are now getting the Unicode characters.

For more in depth info on Using Unicode with Visual Basic 6 please read my Tutorial at www.cyberactivex.com/UnicodeTutorialVb.htm

1.
yea thats what I meant...im not looking at the code as I type this, ive been over it so many times

The thing that makes this so hard is that im dealing with network packets which means most of the packet is ascii and the data portion is unicode.  Thats why I am parsing out a piece of the data (the part between html tags) and converting that from an ascii string to a unicode string.  I cant use chrw() to convert the array to a string because I need to be able to view nonprinted characters that chr() converts for me.  Once I have that string I can make a unicode string.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work.  The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.

3.
will functions like replace() and mid() work with unicode characters??
1.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

Something like this then:
   Dim sUniText As String
   For i= 1 to len(myString) step 2
      sUniTex = sUniTex & ChrW(asc(mid(mystring, i + 1, 1)) * 256 + asc(mid(mystring, i, 1)))      
   Next

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work.  The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.
Then you can use FSO which works on Win98 or later, and can also append.

Here is code:

Public Enum ForWriteEnum
   ForWriting = 2
   ForAppending = 8
End Enum
#If False Then  'PreserveEnumCase
   Private ForWriting, ForAppending
#End If

Public Enum TristateEnum
   TristateTrue = -1        'Opens the file as Unicode
   TristateFalse = 0        'Opens the file as ASCII
   TristateUseDefault = -2  'Use default system setting
End Enum
#If False Then  'PreserveEnumCase
   Private TristateTrue, TristateFalse, TristateUseDefault
#End If

Public Sub UnicodeFile_Write_FSO( _
   ByVal sFileName As String, _
   ByVal vVar As Variant, _
   Optional ByVal ForWrite As ForWriteEnum = ForWriting, _
   Optional ByVal TriState As TristateEnum = TristateTrue, _
   Optional ByVal bJoin As Boolean)

   Dim objFSO           As Object
   Dim objStream        As Object
   Dim sText            As String

   Set objFSO = CreateObject("Scripting.FileSystemObject")
   If (Not objFSO Is Nothing) Then
      Set objStream = objFSO.opentextfile( _
         sFileName, ForWrite, True, TriState)

      If (Not objStream Is Nothing) Then
         With objStream
            If bJoin Then
               sText = Join(vVar, vbCrLf)
            Else
               sText = vVar
            End If
            .Write sText
            .Close
         End With
         Set objStream = Nothing
      End If
      Set objFSO = Nothing
   End If
End Sub


3.
will functions like replace() and mid() work with unicode characters??
Yes, no problem here as well as Left, Right, but NOT StrComp.



>I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

this string of ascii chars IS unicode,
but it looks weird because the vb interface displays it if it is ascii


if you assign the string to a unicode aware control, it looks "normal" (without strange characters)
You misunderstood what I meant...

I have an array of bytes-the first 100 of them are ascii, the middle(could be any size) is unicode and then end 20 bytes is ascii.  Some of the unicode in the middle is two byte characters but instead of combining them into a unicode character each byte is turned into an ascii character.  I need this to happen first because I need to read the first hundred bytes.  After I parse the middle and extract the string I want to save I need to turn this string of ascii characters into unicode.

From danaseaman's answer it seems that all I have to do is combine every two bytes using the chrw() function.  If I have a string how would I do that?
While you may be able to process the middle portion of the string and convert it to Unicode it would be better if you processed the byte array a second time like this:


   for i = 101 to ubound(buff)-20 step 2
     iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
     strUni = strUni & ChrW(iChar)
   next i

Another way to do process the byte array:

   Dim b() as byte
   for i = 101 to ubound(buff)-20
      b(i-101) = buff(i)
   Next
   strUni = b

Oops, you need to Redim b()

   Dim b() as byte
   Redim b(ubound(buff)-120)
   for i = 101 to ubound(buff)-20
      b(i-101) = buff(i)
   Next
   strUni = b
the problem is 100 is an example, it could be anywhere from 90 to 130 and any length.  I need to process the ascii srting to find out. I need to either turn the string into a byte array or process the string...
Are all your ascii characters followed by a 0? If so then the entire string is Unicode.
ASKER CERTIFIED SOLUTION
Avatar of Dana Seaman
Dana Seaman
Flag of Brazil image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
No it truly is ascii (null has meaning in network packets but its only a few characters)... however, the uniicode portion of the string contains mostly zeros except for the special characters

Your logic is incorrect for the code in your last comment:
character 1000 would be stored in two bytes... 03 and E8...neither of those bytes is above 256 but they need to be combined to create character 1000
AscW returns character code for both bytes.

Is the starting point of the Unicode portion followed by a '0'? How about the last Unicode char does it have a trailing '0'?  Otherwise I don't see how you could differentiate between 2 consecutive ascii chars and a Unicode char.
yes but in the string code you are only converting one byte at a time:
schar = Mid(strMain,i,1)
lchar = Ascw(schar)
^ actually im not sure what that code does...

yes it does, the beginning of unicode is "<html>" with trailing zeros after each character...the end is "</html>" with trailing zeros after each character
is there something more simple then this in a loop:

uni = uni & ChrW(CLng("&H" & Hex(Asc(Mid(tmp, i, 1))) & Hex(Asc(Mid(tmp, i + 2, 1)))))

I have a feeling thats very repetative...
String chars are always stored in Vb as 2 bytes thus sChar is both bytes and ascw gets the real unicode value. Unfortunately Vb integers are signed so you either have to convert the negatives to a positive long or just check >255 or <0.

Let me see if I can come up with a parser now that I know you have something to flag the beginning and end of Unicode.
That would be great thanks...

I also worked out file writing with FSO enough to test whatever you come up with
'Try this where b is the original byte array, sHeader is the ascii before Unicode, sFooter is ascii after Unicode, and sUni is the the Unicode portion only including the Html tags:

   Dim sUni As String
   Dim lStart As Long
   Dim lEnd As Long
   Dim sStart As String
   Dim sEnd As String
   Dim sHeader As String
   Dim sFooter As String
   sUni = StrConv(b, vbUnicode)
   sStart = StrConv("<html>", vbUnicode)
   sEnd = StrConv("</html>", vbUnicode)
   lStart = InStr(1, sUni, sStart, vbTextCompare)
   lEnd = InStr(1, sUni, sEnd, vbTextCompare)
   If lStart > 0 And lEnd > 0 Then
      sHeader = Left(sUni, lStart - 1)
      Debug.Print sHeader
      UniListBox1.AddItem sHeader
      sFooter = Mid(sUni, lEnd + Len(sEnd))
      Debug.Print sFooter
      UniListBox1.AddItem sFooter
      sUni = StrConv(Mid(sUni, lStart, lEnd - lStart + Len(sEnd)), vbFromUnicode)
      Debug.Print sUni
      UniListBox1.AddItem sUni

   End If

Also check these links from my tutorial for details on how Vb stores strings:
http://www.cyberactivex.com/UnicodeTutorialVb.htm#MapString
http://www.cyberactivex.com/UnicodeTutorialVb.htm#Byte_Array

Very close but not there yet...it either merges the previous character and the unicode character into two unreadable characters or the unicode character appears as two unicode characters-neither one the right one

I am testing this using the working listview :) and adding two entries-one is what the string should look like (hard coded) and the other is having the parser parse the buff for the string...which it does find very well-it just doesnt convert it correctly yet
help me out with this and ill award u points for both:
https://www.experts-exchange.com/questions/21933158/Unicode-Combo-Box.html