Link to home
Start Free TrialLog in
Avatar of SiebPaul
SiebPaul

asked on

Recommended Books/Blogs for reading large unicode fixed format flat files using VBA

I have a number of reports that have been captured to file with embedded unicode names  (Chinese, Japanese) mixed with numbers and English that I would like to read into an Excel spreadsheets. I cannot get delimited files, so I'm stuck with the fixed format files.

I've been testing the following:
Open sFileName For Binary Access Read along with MidB
Using the File System Object
Using ADODB

I will need to process large files line at a time, so I cannot read the whole file at once.

I've looked through Safari Online for books, but have not been able to find anything useful. I'd really like some books and/or blogs that would also compare the different tools available to process large flat files.

Thanks,
SiebPaul
Avatar of Jacques Bourgeois (James Burger)
Jacques Bourgeois (James Burger)
Flag of Canada image

What did not work with the File System Object? Maybe you did not use it properly.
There is an article that shows how at http://www.devx.com/tips/Tip/22282
Avatar of aikimark
Can you read a line at a time?
You should be able to access the field data in the line, using the Mid() and MidB() functions to parse the line data.
Avatar of SiebPaul
SiebPaul

ASKER

More testing shows the issue to be with file encoding. I have a test file with the following:
101      ARA: مـرحبــاً
102      CHS: 欢迎
103      CHT: æ­¡è¿Ž
104      ENG: Welcome
I believe the first characters before 101 identify the file type. I saved this using NotePad with UTF-8 encoding. ReadLine read all four lines into a single string:
¿¿¿¿¿›¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿›¿¿¿¿¿¿¿¿¿¿¿¿
StrConv of the above produced the string appears as four lines in a cell.
"101      ARA: مـرحبــاً
102      CHS: 欢迎
103      CHT: æ­¡è¿Ž
104      ENG: Welcome"
Saved with encoding "Unicode" and "Unicode big endian" gives me the desired results in four cells so I could further parse the data.

NotePad++ shows the files that work as UCS-2 Bid Endian and UCS-2 Little Endian.

How can I change the encoding (other that opening the file in NotePad and saving with a different encoding)?

Test files and test script attached.
UNICODE TEXT CHANGED AFTER SUBMITTING MY COMMENT!
Thanks,
Paul
Test-Unicode.txt
Test-UnicodeBigEndian.txt
Test-UTF8.txt
Sub-fs-Read.txt
The problem might not be with reading the characters. It can very well be of displaying them.

You need a font that can display them. I won't be able to give you precise information, my experience with a similar problem dates from many years back and was at a customer office, so I do not have the code or the documentation.

We needed to display Chinese in a few fields in a VB6 application. Standard fonts such as Arial and Times New Roman do not have Chinese characters, at least, not the set of characters that we needed for the data at hand (I could not say for Japanese). We thus looked around for a font that could and found that there was a Chinese font installed on all the users computers by Corel Draw. We simply set the controls that needed to display Chinese to that font and that solved the problem.

Hoping that your Chinese and English are not mixed in the same fields, you could try to find a Chinese font and set the columns that display these fields to that font.

You might also search for IME in your specific version of Excel. In my 2003 edition, help refers to a IME package that can be downloaded to enable you to type Chinese and Japanese characters with a standard keyboard. I suppose that it also probably installs the necessary fonts to display in these languages.
I just went back to the source of my test file:
Unicode Test Strings

The following is a paste of the first four lines of the test string"Welcome" in several languages which may not display correctly (these are attachments in a previous message):

Resource
ID      "Welcome"UTF-16 Unicode      ResourceID      "Welcome"UTF-8 Encoded
101      "ARA: ¿¿¿¿¿¿¿¿¿"      151      "ARA: مـرحبــاً"
102      "CHS: ¿¿"      152      "CHS: 欢迎"
103      "CHT: ¿¿"      153      "CHT: æ­¡è¿Ž"
104      "ENG: Welcome"      154      "ENG: Welcome"

The UTF-16 encoding displays correctly on my computer. I think the light bulb just went off: UTF-8 displays as a single-byte (8 bits) character set while the UTF-16 (2 bytes, 16 bits) enforces the use of double-byte character sets required for unicode support of languages like Arabic, Chinese, and Japanese.

So, I guess my issue boils down to using VBA opening a flat text file and specifying that it is a double-byte character set. I think that fact the UTF-8 can be used to for both single and multiple byte character sets is confusing (according UTF-8 wiki page.
 = EF BB BF = UTF-8 coding (see http://en.wikipedia.org/wiki/Byte_order_mark)
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001

Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "E:\Test-UTF8.txt"
   Open sFileName For Binary Access Read As #1
      ReDim buff(FileLen(sFileName) - 4)
      ' read file to buffer
      ' skip first 3 bytes (BOM)
      Dim b(2) As Byte: Get #1, , b
      ' read real content into buffer
      Get #1, , buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub


Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Open in new window

I'm getting a subscript out of range error in the above code on:
ReDim buff(FileLen(sFileName) - 4)

I put a MsgBox before the line to display the value being used: 78.

There's no Preserve, so why am I getting this error?

Also, one of my files is 15MB. What is the upper practical limit on reading whole file into an array?
It means that file length<4 bytes, ie it's empty (3 bytes for UTF8 BOM + at least 1 byte info = 4 byte)
Add If logif to check this. Smth like this
Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "D:\Test-UTF8.txt"
   Dim fLen As Integer
   fLen = FileLen(sFileName) 
   If fLen < 4 Then
      MsgBox("Empty file " & sFileName)
      Exit Sub
   End If
   Open sFileName For Binary Access Read As #1
      ReDim buff(fLen- 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer
      Get #1, 4 , buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub

Open in new window


As for file size - array size can be up to Long type limit ie ~ 2GB
I added:  MsgBox ("File Length: " & fLen)
The file length is 82.
I'm still getting: Run-time error '9': Subscript out of range
I can't see a reason for the message.
Here's the code I'm running:
Option Explicit
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001


Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "D:\TestUTF8.txt"
   Dim fLen As Integer
   fLen = FileLen(sFileName)
   If fLen < 4 Then
      MsgBox ("Empty file " & sFileName)
      Exit Sub
   End If
   MsgBox ("File Length: " & fLen)
   Open sFileName For Binary Access Read As #1
      ReDim buff(fLen - 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer
      Get #1, 4, buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub
Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Open in new window

You'r still getting error on Redim buff(fLen-4) line?? It's strange.
BTW, sorry for error
Dim fLen As Integer should be Dim fLen As Long to work with huge files (>65K)
I ran a short test:
Sub Buff_Test()
  Dim buff() As Byte
  ReDim buff(78)
  MsgBox "Test Done!"
End Sub

Open in new window

which worked.

I renamed buff to aBuff and it worked. Changed the name back to buff and it worked. I think Murphy was visiting.

This demonstrated that UTF8BytesToWinString works for converting an array of the whole file into a string that is written to to a single cell.

I'm going to revisit my original code sample to see if I can integrate this function into line by line processing of a file. I'll post an update later today.
If u'r working with files max 15MB size I suggest reading whole file at once. Multiple accessing HDD takes more time than 1. Even with huge files of some GB IMHO it's better read content into byte array first (using chunks to show progress), then convert it to string and only then, if you need, split this string into lines (lines = Split(fileContent,vbCrLf))
I'm finally done what I started out to do: read a unicode fixed width text file and parse it into columns. However, l still have a couple of questions. The complete Excel VBA program is included in this message below. The unicode test file is attached, because pasting the file contents into this comment is not accurate. Please note that I included the column "Len -- LenB" to confirm that I'm processing unicode strings (see IsStringUnicode).

1. The MidB function includes a space in front fields after the first one. The Trim() fixes this issue, but why is the space there?

2. When I first started parsing the string into columns, I thought that I would simply have to double the Mid start position for the MidB start position, because the string length in bytes is double the string length in characters (since UTF-8 characters can be more than two bytes, this will not always be true). However, it turns out that you simply add the previous MidB field's start position and length in bytes to get the next field's MidB start position. This is logical, but what did I miss about not doubling the previous Mid start position?


Option Explicit
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001

Sub fsReadLine()
   Dim sFileContents As String
   Dim byteArray() As Byte
   Dim strArray() As String
   Dim sFileContentsName As String
   
   Dim fLen As Long
   Dim i As Long
   Dim nR As Long 'Row counter
   Dim nLower As Long, nUpper As Long
 
   Worksheets("Sheet1").Cells.ClearContents
   
   sFileContentsName = "D:\TestUTF8.txt"
   fLen = FileLen(sFileContentsName)
   If fLen < 4 Then
      MsgBox ("Empty file " & sFileContentsName)
      Exit Sub
   End If
   
   Open sFileContentsName For Binary Access Read As #1
      ReDim byteArray(fLen - 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer, or byteArray
      Get #1, 4, byteArray
   Close #1
      
   'convert UTF8 bytes into unicode string
   sFileContents = UTF8BytesToWinString(byteArray)
   
   'Split file contents into an array of strings
   strArray = Split(sFileContents, vbCrLf)
   
   Call SetColumnLabels
   
   nLower = LBound(strArray)
   nUpper = UBound(strArray)
   For i = nLower To nUpper
      nR = i + 2
      Worksheets("Sheet1").Cells(nR, 1) = Mid(strArray(i), 1, 3)
      Worksheets("Sheet1").Cells(nR, 2) = MidB(strArray(i), 1, 6)
      
      Worksheets("Sheet1").Cells(nR, 3) = Mid(strArray(i), 5, 4)
      Worksheets("Sheet1").Cells(nR, 4) = MidB(strArray(i), 7, 10)
      
      Worksheets("Sheet1").Cells(nR, 5) = Mid(strArray(i), 10, 9)
      Worksheets("Sheet1").Cells(nR, 6) = MidB(strArray(i), 17, 18)
      
      Worksheets("Sheet1").Cells(nR, 7) = Mid(strArray(i), 20, 8)
      Worksheets("Sheet1").Cells(nR, 8) = MidB(strArray(i), 35, 20)
      
      Worksheets("Sheet1").Cells(nR, 9) = Len(strArray(i)) & " -- " & LenB(strArray(i))
      Worksheets("Sheet1").Cells(nR, 10) = strArray(i)
    Application.StatusBar = i
   Next i
   Worksheets("Sheet1").Columns("A:J").AutoFit
End Sub

Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Sub SetColumnLabels()
   Worksheets("Sheet1").Cells(1, 1) = "Mid  Field 1"
   Worksheets("Sheet1").Cells(1, 2) = "MidB Field 1"
   
   Worksheets("Sheet1").Cells(1, 3) = "Mid  Field 2"
   Worksheets("Sheet1").Cells(1, 4) = "MidB Field 2"
   
   Worksheets("Sheet1").Cells(1, 5) = "Mid  Field 3"
   Worksheets("Sheet1").Cells(1, 6) = "MidB Field 3"
   
   Worksheets("Sheet1").Cells(1, 7) = "Mid  Column20"
   Worksheets("Sheet1").Cells(1, 8) = "MidB Column20"
   
   Worksheets("Sheet1").Cells(1, 9) = "Len -- LenB"
   Worksheets("Sheet1").Cells(1, 10) = "Full Text String"
End Sub

Open in new window

TestUTF8.txt
ASKER CERTIFIED SOLUTION
Avatar of Ark
Ark
Flag of Russian Federation image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you! I envy programmers like Ark who had to support non-Western alphabets in the beginning. After years of coding in ASCII, it's hard moving to unicode.