Handling UTF8 Unix flat file, while maintaining international text characters
Posted on 2004-09-17
I am trying to write a series of VBA macros to parse and validate flat files, ready for import into a database. The database, file format and technology are not changeable.
The flat file is sent to us, containing data that should be added, as a Unix text file, with UTF-8 encoding and no Byte Object Marker. However, it can contain any and all number of foreign characters (Chinese, Taiwanese, Japanese, Sri Lankan, Arabic, etc).
In its current state, my Excel application can read data, validate and recompile the desired output to another sheet, ready for export, without affecting the characters.
The primary problem occurs when I try to export the file. The exported file format has to match that of the imported file. When I view the original file in SC Unipad, I can see the file is a Unix UTF-8 file, with no Byte Order Marker. The output file, however, is a Unix UTF-16 Little Endian file, with a Byte Object Marker.
I am using the File System Object; creating the file with
Set FileObj = FSO.CreateTextFile(OutputFile, Overwrite, True)
and writing with
FileObj.Write (Sheets(OutSheet).Cells(RowCount, 1).Value)
FileObj.Write EndLine ' Endline = Chr(10)
to write each individual row.
I have tried using the WideCharToMultiByte() external function (with Codepage = 65001) in a UTF8Encode() function, as suggested in previous questions, so the write line is
FileObj.Write (UTF8Encode (Sheets(OutSheet).Cells(RowCount, 1).Value))
but this still creates a UTF-16 file and the foreign characters are lost in favour of ANSI(?) counterparts.
I figure the UTF-16 output might be due to the Unicode flag in FSO.CreateTextFile() being set to True, but I cannot see any other way of maintaining the characters.
I would prefer to avoid opening SC Unipad in a shell and using it to open the UTF-16 file, remove the BOM and convert to UTF-8, as it would rely too heavily on third party software.
Any help anyone can give would be greatly appreciated...