Solved

UNICODE characters-Easy question

Posted on 2006-06-23
32
466 Views
Last Modified: 2013-12-03
I am receiving network data that at times can be in unicode.  The data is received as a byte array and converted to a string.  I need a way of converting the unicode characters correctly.

For example:
月 is a two byte unicode character

however in ascii it would look like this:
g

I need a way to distinguish between ascii and unicode and convert the byte array into a string.
0
Comment
Question by:justchat_1
  • 16
  • 12
  • 4
32 Comments
 
LVL 13

Expert Comment

by:Mark_FreeSoftware
Comment Utility
Option Explicit

Private Declare Function IsTextUnicode Lib "advapi32" (ByVal lpBuffer As String, ByVal cb As Long, lpi As Long) As Long


Private Sub Form_Load()
Dim strUnicode As String
Dim strNormal As String

strUnicode = StrConv("hallo", vbUnicode)  'create unicode text
If CBool(IsTextUnicode(strUnicode, Len(strUnicode), &H2)) Then    'see if the text is unicode
   strNormal = StrConv(strUnicode, vbFromUnicode)     'if the text is unicode, convert
Else
   strNormal = strUnicode                    'else just assign
End If

debug.print strNormal                   'print the text to the debug screen

End Sub
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

When I try the example I gave above I think I'm getting a "?" from the strconv() function instead of a unicode japanese character
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
You will not be able to see your Unicode string via Debug.Print since it is rendering to an ANSI Window. Also Vb Intrinsic controls are not Unicode aware. You can use Forms 2.0 Object Library controls if you have MS Office installed or obtain 3rd party Unicode aware controls. Set the controls Font to "Arial Unicode MS" and you should then see Japanese Unicode OK.

Note in following code the Unicode aware control doesn't care if you send it a ANSI or Unicode string. Demo code shows simple to/from byte conversion and dumps the bytes to immediate window so you can see what the bytes look like.

Option Explicit

Private Sub Form_Load()
   Dim sTemp As String
   Dim b() As Byte

   'Chinese + Japanese
   sTemp = "CHS: " & ChrW$(&H6B22) & ChrW$(&H8FCE) & ChrW$(26376)
   b = sTemp 'Convert to byte array
   DumpByteAray b 'Show bytes
   sTemp = b 'Back to Unicode
   UniLabel1.Caption = sTemp

   sTemp = "Hello" 'ANSI string
   b = sTemp
   DumpByteAray b
   sTemp = b
   UniLabel2.Caption = sTemp

End Sub

Public Sub DumpByteAray(b() As Byte)
   Dim i As Long
   Debug.Print
   For i = 0 To UBound(b)
      Debug.Print i, b(i)
   Next
End Sub


0
 
LVL 13

Expert Comment

by:Mark_FreeSoftware
Comment Utility
>>wow that was alot simpler then I thought...the problem is vb really doesnt like unicode does it?

no, vb really doesnt like it.



when you have normal characters in a string, vb fills the unicode with chr(0)   (that is good)

however when you try to do a debug.print, vb cuts the string after the first character



so when you try to work with unicode, try to step trough your code, and make sure vb doesnt cut off your string when assigning!
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
sorry i havnt responded for awile (but I increased the points)...I havnt had time to test these solutions:

I am receiving a string of ascii bytes that should be unicode, I need to convert them to unicode and write them to an html file so they can be viewed in a web browser.  Right now the web browser is still showing ascii characters with both methods.  Any other ideas??
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
Can you show us a sample of your Unicode Byte Array so we can give you a correct conversion?
Also if you are writing this to HTML what kind of header are you using to tell the browser it is Unicode and how are you formatting the HTML strings?


To dump the byte array:
   For i = 0 To UBound(b)
     Debug.Print b(i)
   Next

 
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
this is a raw network packet that I convert to text:

for i = 54 to ubound(buff)
     strtemp = strtemp & chr(i)
next i

I then extract the code between the html tags and paste it to an html file.  The html has no header file, If necessary I can output 月 or similar tags even if it is not the most space efficient
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
*correction:
paste should be print...

open thefile for append as #1
         print #1, isolatedText
close #1
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
when I use strconv everything converts fine except the non ascii characters:
h e l l o becomes hello
h e l (a single jap char) o becomes hel??o

do I have to use fso with unicode set to true in order to print this or is there a way that wont involve changing hundreds of lines of code?
0
 
LVL 13

Expert Comment

by:Mark_FreeSoftware
Comment Utility

u can't change text with japanese chars in it because the japanese chars ARE unicode

0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
i think i explained that incorrectly... the text is hel(ajap character)o which in unicode is "h e l go "
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
theres a nonstandard character after the g which I cant paste.  The point is thats the ascii string that I get which I need to convert into unicode... "h " should become "h" but the "g & other char" should become the jap character

If that makes sense...
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
Several problems here:

1. for i = 54 to ubound(buff)
     strtemp = strtemp & chr(i)
   next i
   I think you meant strtemp = strtemp & chr(buff(i))
 
   Anyway this will never give you Unicode, just a string of ANSI chars. You must use ChrW and combine every 2 bytes to get Unicode.


   To do this from a byte array you will have to do something like this
   dim iChar as Long
   for i = 54 to ubound(buff) step 2
     iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
     strtemp = strtemp & ChrW(iChar)
   next i

2. open thefile for append as #1
         print #1, isolatedText
   close #1
   'This works for ANSI but not for Unicode text. First you need to insert a Unicode UTF-16 BOM marker and then write the Unicode string as a byte array using Put which means using Binary Write. This code to do this via FSO is much easier. If you need an FSO version  let me know.

   Change code to this:
Public Sub UnicodeFile_Write_VB(ByVal sFileName As String, _
   ByVal sText As String)

   Dim FF               As Long
   Dim b()              As Byte

   On Error Resume Next
   Kill sFileName
   On Error GoTo 0
   FF = FreeFile
   Open sFileName For Binary Access Write As #FF
   'InsertBOM
   ReDim b(1)
   b(0) = &HFF
   b(1) = &HFE
   Put #FF, , b
   Erase b
   'Convert Unicode string to byte array
   b = sText
   'Write to file
   Put #FF, , b
   Close #FF
End Sub  

This should get you started. Take a look at your output file in Notepad first to see if you are now getting the Unicode characters.

For more in depth info on Using Unicode with Visual Basic 6 please read my Tutorial at www.cyberactivex.com/UnicodeTutorialVb.htm

0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
1.
yea thats what I meant...im not looking at the code as I type this, ive been over it so many times

The thing that makes this so hard is that im dealing with network packets which means most of the packet is ascii and the data portion is unicode.  Thats why I am parsing out a piece of the data (the part between html tags) and converting that from an ascii string to a unicode string.  I cant use chrw() to convert the array to a string because I need to be able to view nonprinted characters that chr() converts for me.  Once I have that string I can make a unicode string.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work.  The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.

3.
will functions like replace() and mid() work with unicode characters??
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
1.

I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

Something like this then:
   Dim sUniText As String
   For i= 1 to len(myString) step 2
      sUniTex = sUniTex & ChrW(asc(mid(mystring, i + 1, 1)) * 256 + asc(mid(mystring, i, 1)))      
   Next

2.
The problem with writing files is that I am appending them, so writing new unicode files every time isnt going to work.  The packets are being appended to HTML logs so that all the data can be viewed from any web browser on the network.
Then you can use FSO which works on Win98 or later, and can also append.

Here is code:

Public Enum ForWriteEnum
   ForWriting = 2
   ForAppending = 8
End Enum
#If False Then  'PreserveEnumCase
   Private ForWriting, ForAppending
#End If

Public Enum TristateEnum
   TristateTrue = -1        'Opens the file as Unicode
   TristateFalse = 0        'Opens the file as ASCII
   TristateUseDefault = -2  'Use default system setting
End Enum
#If False Then  'PreserveEnumCase
   Private TristateTrue, TristateFalse, TristateUseDefault
#End If

Public Sub UnicodeFile_Write_FSO( _
   ByVal sFileName As String, _
   ByVal vVar As Variant, _
   Optional ByVal ForWrite As ForWriteEnum = ForWriting, _
   Optional ByVal TriState As TristateEnum = TristateTrue, _
   Optional ByVal bJoin As Boolean)

   Dim objFSO           As Object
   Dim objStream        As Object
   Dim sText            As String

   Set objFSO = CreateObject("Scripting.FileSystemObject")
   If (Not objFSO Is Nothing) Then
      Set objStream = objFSO.opentextfile( _
         sFileName, ForWrite, True, TriState)

      If (Not objStream Is Nothing) Then
         With objStream
            If bJoin Then
               sText = Join(vVar, vbCrLf)
            Else
               sText = vVar
            End If
            .Write sText
            .Close
         End With
         Set objStream = Nothing
      End If
      Set objFSO = Nothing
   End If
End Sub


3.
will functions like replace() and mid() work with unicode characters??
Yes, no problem here as well as Left, Right, but NOT StrComp.

0
 
LVL 13

Expert Comment

by:Mark_FreeSoftware
Comment Utility


>I need a function that will convert this string of ascii chars (which should be unicode) into unicode and then write it.

this string of ascii chars IS unicode,
but it looks weird because the vb interface displays it if it is ascii


if you assign the string to a unicode aware control, it looks "normal" (without strange characters)
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 9

Author Comment

by:justchat_1
Comment Utility
You misunderstood what I meant...

I have an array of bytes-the first 100 of them are ascii, the middle(could be any size) is unicode and then end 20 bytes is ascii.  Some of the unicode in the middle is two byte characters but instead of combining them into a unicode character each byte is turned into an ascii character.  I need this to happen first because I need to read the first hundred bytes.  After I parse the middle and extract the string I want to save I need to turn this string of ascii characters into unicode.

From danaseaman's answer it seems that all I have to do is combine every two bytes using the chrw() function.  If I have a string how would I do that?
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
While you may be able to process the middle portion of the string and convert it to Unicode it would be better if you processed the byte array a second time like this:


   for i = 101 to ubound(buff)-20 step 2
     iChar = buff(i + 1) * 256 + buff(i) 'combine 2 bytes
     strUni = strUni & ChrW(iChar)
   next i

0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
Another way to do process the byte array:

   Dim b() as byte
   for i = 101 to ubound(buff)-20
      b(i-101) = buff(i)
   Next
   strUni = b

0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
Oops, you need to Redim b()

   Dim b() as byte
   Redim b(ubound(buff)-120)
   for i = 101 to ubound(buff)-20
      b(i-101) = buff(i)
   Next
   strUni = b
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
the problem is 100 is an example, it could be anywhere from 90 to 130 and any length.  I need to process the ascii srting to find out. I need to either turn the string into a byte array or process the string...
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
Are all your ascii characters followed by a 0? If so then the entire string is Unicode.
0
 
LVL 22

Accepted Solution

by:
danaseaman earned 500 total points
Comment Utility
'Extract Unicode portion from string
   Dim strMain As String
   Dim strUni  As String
   Dim sChar   As String
   Dim lChar   As Long

   strmain = buff
   For i = 1 to Len(strMain)
      schar = Mid(strMain,i,1)
      lchar = Ascw(schar)
      if lchar > 255 or lchar < 0 then
         strUni = strUni & sChar
      End if
   Next

'or direct from byte array

   Dim i       As Long
   Dim bLen    As Long
   Dim strUni  As String

   bLen = UBound(buff)
   For i = 1 To bLen Step 2
      If (buff(i) > 0) Then 'Must be Unicode
         strUni = strUni & chrW(buff(i) * 256 + buff(i-1))
      End If
   Next
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
No it truly is ascii (null has meaning in network packets but its only a few characters)... however, the uniicode portion of the string contains mostly zeros except for the special characters

Your logic is incorrect for the code in your last comment:
character 1000 would be stored in two bytes... 03 and E8...neither of those bytes is above 256 but they need to be combined to create character 1000
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
AscW returns character code for both bytes.

Is the starting point of the Unicode portion followed by a '0'? How about the last Unicode char does it have a trailing '0'?  Otherwise I don't see how you could differentiate between 2 consecutive ascii chars and a Unicode char.
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
yes but in the string code you are only converting one byte at a time:
schar = Mid(strMain,i,1)
lchar = Ascw(schar)
^ actually im not sure what that code does...

yes it does, the beginning of unicode is "<html>" with trailing zeros after each character...the end is "</html>" with trailing zeros after each character
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
is there something more simple then this in a loop:

uni = uni & ChrW(CLng("&H" & Hex(Asc(Mid(tmp, i, 1))) & Hex(Asc(Mid(tmp, i + 2, 1)))))

I have a feeling thats very repetative...
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
String chars are always stored in Vb as 2 bytes thus sChar is both bytes and ascw gets the real unicode value. Unfortunately Vb integers are signed so you either have to convert the negatives to a positive long or just check >255 or <0.

Let me see if I can come up with a parser now that I know you have something to flag the beginning and end of Unicode.
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
That would be great thanks...

I also worked out file writing with FSO enough to test whatever you come up with
0
 
LVL 22

Expert Comment

by:danaseaman
Comment Utility
'Try this where b is the original byte array, sHeader is the ascii before Unicode, sFooter is ascii after Unicode, and sUni is the the Unicode portion only including the Html tags:

   Dim sUni As String
   Dim lStart As Long
   Dim lEnd As Long
   Dim sStart As String
   Dim sEnd As String
   Dim sHeader As String
   Dim sFooter As String
   sUni = StrConv(b, vbUnicode)
   sStart = StrConv("<html>", vbUnicode)
   sEnd = StrConv("</html>", vbUnicode)
   lStart = InStr(1, sUni, sStart, vbTextCompare)
   lEnd = InStr(1, sUni, sEnd, vbTextCompare)
   If lStart > 0 And lEnd > 0 Then
      sHeader = Left(sUni, lStart - 1)
      Debug.Print sHeader
      UniListBox1.AddItem sHeader
      sFooter = Mid(sUni, lEnd + Len(sEnd))
      Debug.Print sFooter
      UniListBox1.AddItem sFooter
      sUni = StrConv(Mid(sUni, lStart, lEnd - lStart + Len(sEnd)), vbFromUnicode)
      Debug.Print sUni
      UniListBox1.AddItem sUni

   End If

Also check these links from my tutorial for details on how Vb stores strings:
http://www.cyberactivex.com/UnicodeTutorialVb.htm#MapString
http://www.cyberactivex.com/UnicodeTutorialVb.htm#Byte_Array

0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
Very close but not there yet...it either merges the previous character and the unicode character into two unreadable characters or the unicode character appears as two unicode characters-neither one the right one

I am testing this using the working listview :) and adding two entries-one is what the string should look like (hard coded) and the other is having the parser parse the buff for the string...which it does find very well-it just doesnt convert it correctly yet
0
 
LVL 9

Author Comment

by:justchat_1
Comment Utility
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

When trying to find the cause of a problem in VBA or VB6 it's often valuable to know what procedures were executed prior to the error. You can use the Call Stack for that but it is often inadequate because it may show procedures you aren't intereste…
Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
Get people started with the utilization of class modules. Class modules can be a powerful tool in Microsoft Access. They allow you to create self-contained objects that encapsulate functionality. They can easily hide the complexity of a process from…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now