?
Solved

How to write the byte order markers at the start of a UNICODE file

Posted on 2006-10-19
9
Medium Priority
?
205 Views
Last Modified: 2010-04-01
I've got this function which writes text to a file...

int writetextfile(TCHAR * sfile, TCHAR * sbuffer)
{
      DWORD dwBytesWritten;
      SetErrorMode(SEM_NOOPENFILEERRORBOX | SEM_FAILCRITICALERRORS);

      // fix for cust xxx - files having content after eof
      // old hFile = CreateFile(outputfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
      HANDLE hFile = CreateFile(sfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, TRUNCATE_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

      if (hFile == INVALID_HANDLE_VALUE)
      {
            hFile = CreateFile(sfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
      }
      if (hFile != INVALID_HANDLE_VALUE)
      {
            WriteFile(hFile, sbuffer, (_tcslen(sbuffer) + 1) * sizeof(TCHAR), &dwBytesWritten, NULL);
            SetEndOfFile(hFile);
            CloseHandle(hFile);
            return dwBytesWritten;
      }
      else
      {
            return 0;
      }
}

I also need to write the marker characters.

Without these I'm finding that vb6 fso can read the file OK, because you can tell it its unicode, but when the file is read from vb.net 2005, the byte order marks arent there so .net misinterprets the file format as ascii, and I end up with a "x 0 x 0 x 0" string that breaks the code when you do anything with it.

All our stuff is UTF8, so I think the bytes should be EF BB BF. What the easiest way to modify the above to include them ?

my C is a tad rusty and I need to get this working asap !

thanks

0
Comment
Question by:plq
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
9 Comments
 
LVL 48

Expert Comment

by:AlexFM
ID: 17766307
   if (hFile != INVALID_HANDLE_VALUE)
     {
          char buffer[3];
          buffer[0] = 0xEF;
          buffer[1] = 0xBB;
          buffer[2] = 0xBF;
          WriteFile(hFile, buffer, sizeof(buffer), &dwBytesWritten, NULL);

          WriteFile(hFile, sbuffer, (_tcslen(sbuffer) + 1) * sizeof(TCHAR), &dwBytesWritten, NULL);
          SetEndOfFile(hFile);
          CloseHandle(hFile);
          return dwBytesWritten;
     }
0
 
LVL 48

Expert Comment

by:AlexFM
ID: 17766330
I didn't understand exactly where do you want to add three bytes - to the beginning or to the end. This code writes them in the beginning. If you want to add to the end - move WriteFile(hFile, buffer...) line after WriteFile(hFile, sbuffer...) line.
0
 
LVL 8

Author Comment

by:plq
ID: 17766543
Thanks Alex. Is there any way we can tune this to only write once ?

I started writing this...

#ifdef UNICODE

      //      Place the unicode utf8 byte order marks at the start of the file

      TCHAR * sbufferbase = createstring(12, lcharsneeded + 3);                  // freed
      *sbufferbase       = 0xEF;
      *(sbufferbase + 1) = 0xBB;
      *(sbufferbase + 2) = 0xBF;
      TCHAR * sbuffer = sbufferbase + 3;
#else

      // no byte order marks

      TCHAR * sbufferbase = createstring(12, lcharsneeded);                  // freed
      TCHAR * sbuffer = sbufferbase;

#endif

// fill up sbuffer as before
...

writetextfile(spath, sbufferbase);



because I only want to call WriteFile once. This is a file being written to a heavily flogged share, so I want to minimise IO

My code above is obviously flawed because TCHAR is too wide and therefore putting 6 bytes at the front instead of 3.

Can you see an easy way of doing it which allows just one write ?

thanks
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 8

Author Comment

by:plq
ID: 17766550
BTW Unicode byte order marks sit at the beginning as far as I know..
0
 
LVL 8

Author Comment

by:plq
ID: 17769445
I have tried the UTF8 marker with Alex's suggestion above, and the file displays in notepad with a space between each character.

Notepad is capable of displaying unicode files so the format must be wrong.

I have tried the 3 bytes and the 3 bytes + one zero at the front of the file. The extra 0 in 4th place screw it up completely. With the 3 bytes it seems ok except the spacing
0
 
LVL 8

Author Comment

by:plq
ID: 17772217
Here's a working example

#include "globals.h"
#include "stdafx.h"
#include <stdlib.h>
#include <string.h>
#include <tchar.h>
#include <stdio.h>
#include <windows.h>
#include <malloc.h>

int _tmain(int argc, _TCHAR* argv[])
{
      TCHAR * sfile = TEXT("C:\\ZX.TXT");
      DWORD dwBytesWritten;

      SetErrorMode(SEM_NOOPENFILEERRORBOX | SEM_FAILCRITICALERRORS);
      HANDLE hFile = CreateFile(sfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, TRUNCATE_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

      TCHAR * swork = TEXT("HELLO &#12522;&#12540;&#12474;&#12301;&#12398;&#30330; &#35805;&#20316;&#20986;&#20102;&#21453;");
      size_t llen = _tcslen(swork);
      size_t lbytes = llen * sizeof(TCHAR);

      unsigned char smarker[3];
      smarker[0] = 0xEF;
      smarker[1] = 0xBB;
      smarker[2] = 0xBF;
      WriteFile(hFile, smarker, 3, &dwBytesWritten, NULL);

      char *utf8 = (char *) malloc(lbytes * 4);            //(llen + 1) * sizeof(TCHAR));
      int lBytesWritten = WideCharToMultiByte(CP_UTF8, 0, swork, -1, utf8, lbytes * 4, NULL, NULL);
      // lBytesWritten includes the null
      int err = GetLastError();
      WriteFile(hFile, utf8, lBytesWritten, &dwBytesWritten, NULL);
      free(utf8);

      SetEndOfFile(hFile);
      CloseHandle(hFile);

      return 0;
}

0
 
LVL 8

Author Comment

by:plq
ID: 17772225
The key was to translate the TCHAR from UNICODE to UTF8, as well as adding the byte order marks.
0
 
LVL 1

Accepted Solution

by:
DarthMod earned 0 total points
ID: 17814004
Closed, 500 points refunded.
DarthMod
Community Support Moderator
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question