Solved

Recommended Books/Blogs for reading large unicode fixed format flat files using VBA

Posted on 2014-01-18
16
290 Views
Last Modified: 2014-01-25
I have a number of reports that have been captured to file with embedded unicode names  (Chinese, Japanese) mixed with numbers and English that I would like to read into an Excel spreadsheets. I cannot get delimited files, so I'm stuck with the fixed format files.

I've been testing the following:
Open sFileName For Binary Access Read along with MidB
Using the File System Object
Using ADODB

I will need to process large files line at a time, so I cannot read the whole file at once.

I've looked through Safari Online for books, but have not been able to find anything useful. I'd really like some books and/or blogs that would also compare the different tools available to process large flat files.

Thanks,
SiebPaul
0
Comment
Question by:SiebPaul
  • 7
  • 6
  • 2
  • +1
16 Comments
 
LVL 40
ID: 39791632
What did not work with the File System Object? Maybe you did not use it properly.
There is an article that shows how at http://www.devx.com/tips/Tip/22282
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39792659
Can you read a line at a time?
You should be able to access the field data in the line, using the Mid() and MidB() functions to parse the line data.
0
 

Author Comment

by:SiebPaul
ID: 39792720
More testing shows the issue to be with file encoding. I have a test file with the following:
101      ARA: مـرحبــاً
102      CHS: 欢迎
103      CHT: æ­¡è¿Ž
104      ENG: Welcome
I believe the first characters before 101 identify the file type. I saved this using NotePad with UTF-8 encoding. ReadLine read all four lines into a single string:
¿¿¿¿¿›¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿›¿¿¿¿¿¿¿¿¿¿¿¿
StrConv of the above produced the string appears as four lines in a cell.
"101      ARA: مـرحبــاً
102      CHS: 欢迎
103      CHT: æ­¡è¿Ž
104      ENG: Welcome"
Saved with encoding "Unicode" and "Unicode big endian" gives me the desired results in four cells so I could further parse the data.

NotePad++ shows the files that work as UCS-2 Bid Endian and UCS-2 Little Endian.

How can I change the encoding (other that opening the file in NotePad and saving with a different encoding)?

Test files and test script attached.
UNICODE TEXT CHANGED AFTER SUBMITTING MY COMMENT!
Thanks,
Paul
Test-Unicode.txt
Test-UnicodeBigEndian.txt
Test-UTF8.txt
Sub-fs-Read.txt
0
 
LVL 40
ID: 39792756
The problem might not be with reading the characters. It can very well be of displaying them.

You need a font that can display them. I won't be able to give you precise information, my experience with a similar problem dates from many years back and was at a customer office, so I do not have the code or the documentation.

We needed to display Chinese in a few fields in a VB6 application. Standard fonts such as Arial and Times New Roman do not have Chinese characters, at least, not the set of characters that we needed for the data at hand (I could not say for Japanese). We thus looked around for a font that could and found that there was a Chinese font installed on all the users computers by Corel Draw. We simply set the controls that needed to display Chinese to that font and that solved the problem.

Hoping that your Chinese and English are not mixed in the same fields, you could try to find a Chinese font and set the columns that display these fields to that font.

You might also search for IME in your specific version of Excel. In my 2003 edition, help refers to a IME package that can be downloaded to enable you to type Chinese and Japanese characters with a standard keyboard. I suppose that it also probably installs the necessary fonts to display in these languages.
0
 

Author Comment

by:SiebPaul
ID: 39792862
I just went back to the source of my test file:
Unicode Test Strings

The following is a paste of the first four lines of the test string"Welcome" in several languages which may not display correctly (these are attachments in a previous message):

Resource
ID      "Welcome"UTF-16 Unicode      ResourceID      "Welcome"UTF-8 Encoded
101      "ARA: ¿¿¿¿¿¿¿¿¿"      151      "ARA: مـرحبــاً"
102      "CHS: ¿¿"      152      "CHS: 欢迎"
103      "CHT: ¿¿"      153      "CHT: æ­¡è¿Ž"
104      "ENG: Welcome"      154      "ENG: Welcome"

The UTF-16 encoding displays correctly on my computer. I think the light bulb just went off: UTF-8 displays as a single-byte (8 bits) character set while the UTF-16 (2 bytes, 16 bits) enforces the use of double-byte character sets required for unicode support of languages like Arabic, Chinese, and Japanese.

So, I guess my issue boils down to using VBA opening a flat text file and specifying that it is a double-byte character set. I think that fact the UTF-8 can be used to for both single and multiple byte character sets is confusing (according UTF-8 wiki page.
0
 
LVL 27

Expert Comment

by:Ark
ID: 39793254
 = EF BB BF = UTF-8 coding (see http://en.wikipedia.org/wiki/Byte_order_mark)
0
 
LVL 27

Expert Comment

by:Ark
ID: 39793463
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001

Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "E:\Test-UTF8.txt"
   Open sFileName For Binary Access Read As #1
      ReDim buff(FileLen(sFileName) - 4)
      ' read file to buffer
      ' skip first 3 bytes (BOM)
      Dim b(2) As Byte: Get #1, , b
      ' read real content into buffer
      Get #1, , buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub


Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Open in new window

0
 

Author Comment

by:SiebPaul
ID: 39793976
I'm getting a subscript out of range error in the above code on:
ReDim buff(FileLen(sFileName) - 4)

I put a MsgBox before the line to display the value being used: 78.

There's no Preserve, so why am I getting this error?

Also, one of my files is 15MB. What is the upper practical limit on reading whole file into an array?
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 27

Expert Comment

by:Ark
ID: 39795403
It means that file length<4 bytes, ie it's empty (3 bytes for UTF8 BOM + at least 1 byte info = 4 byte)
Add If logif to check this. Smth like this
Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "D:\Test-UTF8.txt"
   Dim fLen As Integer
   fLen = FileLen(sFileName) 
   If fLen < 4 Then
      MsgBox("Empty file " & sFileName)
      Exit Sub
   End If
   Open sFileName For Binary Access Read As #1
      ReDim buff(fLen- 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer
      Get #1, 4 , buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub

Open in new window


As for file size - array size can be up to Long type limit ie ~ 2GB
0
 

Author Comment

by:SiebPaul
ID: 39795918
I added:  MsgBox ("File Length: " & fLen)
The file length is 82.
I'm still getting: Run-time error '9': Subscript out of range
I can't see a reason for the message.
Here's the code I'm running:
Option Explicit
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001


Sub fsReadLine()
   Dim buff() As Byte
   Dim sFileName As String
   sFileName = "D:\TestUTF8.txt"
   Dim fLen As Integer
   fLen = FileLen(sFileName)
   If fLen < 4 Then
      MsgBox ("Empty file " & sFileName)
      Exit Sub
   End If
   MsgBox ("File Length: " & fLen)
   Open sFileName For Binary Access Read As #1
      ReDim buff(fLen - 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer
      Get #1, 4, buff
   Close #1
'convert UTF8 bytes into unicode string
   Dim s As String
   s = UTF8BytesToWinString(buff)
   Worksheets("Test").Cells(5, 1) = s
End Sub
Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Open in new window

0
 
LVL 27

Expert Comment

by:Ark
ID: 39796016
You'r still getting error on Redim buff(fLen-4) line?? It's strange.
BTW, sorry for error
Dim fLen As Integer should be Dim fLen As Long to work with huge files (>65K)
0
 

Author Comment

by:SiebPaul
ID: 39796946
I ran a short test:
Sub Buff_Test()
  Dim buff() As Byte
  ReDim buff(78)
  MsgBox "Test Done!"
End Sub

Open in new window

which worked.

I renamed buff to aBuff and it worked. Changed the name back to buff and it worked. I think Murphy was visiting.

This demonstrated that UTF8BytesToWinString works for converting an array of the whole file into a string that is written to to a single cell.

I'm going to revisit my original code sample to see if I can integrate this function into line by line processing of a file. I'll post an update later today.
0
 
LVL 27

Expert Comment

by:Ark
ID: 39798807
If u'r working with files max 15MB size I suggest reading whole file at once. Multiple accessing HDD takes more time than 1. Even with huge files of some GB IMHO it's better read content into byte array first (using chunks to show progress), then convert it to string and only then, if you need, split this string into lines (lines = Split(fileContent,vbCrLf))
0
 

Author Comment

by:SiebPaul
ID: 39807065
I'm finally done what I started out to do: read a unicode fixed width text file and parse it into columns. However, l still have a couple of questions. The complete Excel VBA program is included in this message below. The unicode test file is attached, because pasting the file contents into this comment is not accurate. Please note that I included the column "Len -- LenB" to confirm that I'm processing unicode strings (see IsStringUnicode).

1. The MidB function includes a space in front fields after the first one. The Trim() fixes this issue, but why is the space there?

2. When I first started parsing the string into columns, I thought that I would simply have to double the Mid start position for the MidB start position, because the string length in bytes is double the string length in characters (since UTF-8 characters can be more than two bytes, this will not always be true). However, it turns out that you simply add the previous MidB field's start position and length in bytes to get the next field's MidB start position. This is logical, but what did I miss about not doubling the previous Mid start position?


Option Explicit
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
Private Const CP_UTF8       As Long = 65001

Sub fsReadLine()
   Dim sFileContents As String
   Dim byteArray() As Byte
   Dim strArray() As String
   Dim sFileContentsName As String
   
   Dim fLen As Long
   Dim i As Long
   Dim nR As Long 'Row counter
   Dim nLower As Long, nUpper As Long
 
   Worksheets("Sheet1").Cells.ClearContents
   
   sFileContentsName = "D:\TestUTF8.txt"
   fLen = FileLen(sFileContentsName)
   If fLen < 4 Then
      MsgBox ("Empty file " & sFileContentsName)
      Exit Sub
   End If
   
   Open sFileContentsName For Binary Access Read As #1
      ReDim byteArray(fLen - 4)
      ' skip first 3 bytes (BOM)
      ' and read read file content into buffer, or byteArray
      Get #1, 4, byteArray
   Close #1
      
   'convert UTF8 bytes into unicode string
   sFileContents = UTF8BytesToWinString(byteArray)
   
   'Split file contents into an array of strings
   strArray = Split(sFileContents, vbCrLf)
   
   Call SetColumnLabels
   
   nLower = LBound(strArray)
   nUpper = UBound(strArray)
   For i = nLower To nUpper
      nR = i + 2
      Worksheets("Sheet1").Cells(nR, 1) = Mid(strArray(i), 1, 3)
      Worksheets("Sheet1").Cells(nR, 2) = MidB(strArray(i), 1, 6)
      
      Worksheets("Sheet1").Cells(nR, 3) = Mid(strArray(i), 5, 4)
      Worksheets("Sheet1").Cells(nR, 4) = MidB(strArray(i), 7, 10)
      
      Worksheets("Sheet1").Cells(nR, 5) = Mid(strArray(i), 10, 9)
      Worksheets("Sheet1").Cells(nR, 6) = MidB(strArray(i), 17, 18)
      
      Worksheets("Sheet1").Cells(nR, 7) = Mid(strArray(i), 20, 8)
      Worksheets("Sheet1").Cells(nR, 8) = MidB(strArray(i), 35, 20)
      
      Worksheets("Sheet1").Cells(nR, 9) = Len(strArray(i)) & " -- " & LenB(strArray(i))
      Worksheets("Sheet1").Cells(nR, 10) = strArray(i)
    Application.StatusBar = i
   Next i
   Worksheets("Sheet1").Columns("A:J").AutoFit
End Sub

Public Function UTF8BytesToWinString(ByRef bytes() As Byte) As String
   Dim iStrSize    As Long
   Dim s           As String
   iStrSize = UBound(bytes) + 1
   s = String$(iStrSize, 0&)
   iStrSize = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(bytes(0)), &HFFFF, StrPtr(s), iStrSize)
   UTF8BytesToWinString = Left$(s, iStrSize - 1)
End Function

Sub SetColumnLabels()
   Worksheets("Sheet1").Cells(1, 1) = "Mid  Field 1"
   Worksheets("Sheet1").Cells(1, 2) = "MidB Field 1"
   
   Worksheets("Sheet1").Cells(1, 3) = "Mid  Field 2"
   Worksheets("Sheet1").Cells(1, 4) = "MidB Field 2"
   
   Worksheets("Sheet1").Cells(1, 5) = "Mid  Field 3"
   Worksheets("Sheet1").Cells(1, 6) = "MidB Field 3"
   
   Worksheets("Sheet1").Cells(1, 7) = "Mid  Column20"
   Worksheets("Sheet1").Cells(1, 8) = "MidB Column20"
   
   Worksheets("Sheet1").Cells(1, 9) = "Len -- LenB"
   Worksheets("Sheet1").Cells(1, 10) = "Full Text String"
End Sub

Open in new window

TestUTF8.txt
0
 
LVL 27

Accepted Solution

by:
Ark earned 500 total points
ID: 39808108
No.2: This is because both Mid and MidB starts from 1. Second char will be at #2 for Mid and #3 for MidB, third one at #3 and #5 respectively, so
MidB_Start = Mid_Start*2 - 1. And yes, MidB require double Mid length.

No.1 - See No.2 - you set start position for MidB at space char and got it (chars 17,33,35,37 in unicode  are spaces). Corrected code according to No.2 :
   For i = nLower To nUpper
      nR = i + 2
      Worksheets("Sheet1").Cells(nR, 1) = Mid(strArray(i), 1, 3)
      Worksheets("Sheet1").Cells(nR, 2) = MidB(strArray(i), 1, 6)
      
      Worksheets("Sheet1").Cells(nR, 3) = Mid(strArray(i), 5, 4)
      Worksheets("Sheet1").Cells(nR, 4) = MidB(strArray(i), 9, 8)
      
      Worksheets("Sheet1").Cells(nR, 5) = Mid(strArray(i), 10, 9)
      Worksheets("Sheet1").Cells(nR, 6) = MidB(strArray(i), 19, 18)
      
      Worksheets("Sheet1").Cells(nR, 7) = Mid(strArray(i), 20, 8)
      Worksheets("Sheet1").Cells(nR, 8) = MidB(strArray(i), 39, 16)
      
      Worksheets("Sheet1").Cells(nR, 9) = Len(strArray(i)) & " -- " & LenB(strArray(i))
      Worksheets("Sheet1").Cells(nR, 10) = strArray(i)
    Application.StatusBar = i
   Next i
'For testing - print bytes:
   Dim j As Long
   Dim x() As Byte
'Unicode
   For i = nLower To nUpper
      x = strArray(i)
      For j = 0 To UBound(x)
         Debug.Print x(j) & "(#" & j + 1 & ")",
      Next j
      Debug.Print
   Next i
   Debug.Print
'Print ASCII bytes (double-bytes chars will be replaced with spaces)
   For i = nLower To nUpper
      For j = 1 To Len(strArray(i)) - 1
         Debug.Print Asc(Mid(strArray(i), j, 1)) & "(#" & j & ")",
      Next j
      Debug.Print
   Next i

Open in new window


PS. Actually, you don't need 'xxxB' (bytes) functions like MidB and LenB. strArray(i) is already unicode string so VBA understand it correctly
0
 

Author Closing Comment

by:SiebPaul
ID: 39808577
Thank you! I envy programmers like Ark who had to support non-Western alphabets in the beginning. After years of coding in ASCII, it's hard moving to unicode.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Using Word 2013, I was experiencing some incredible lag when typing.  Here's what worked for me....
This video shows where to find templates, what they are used for, and how to create and save a custom template using Microsoft Word.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now