How to save a file

Hi i have an XML, and I need to save it in a file and read it afterwards:

I am using
void saveToFile(string whereto, wstring what)
{
      FILE *stream = fopen(whereto.c_str(), "w+, ccs=UTF-8");
      fwrite(what.c_str(), sizeof(wchar_t), what.length(), stream);
      fclose(stream);
}

Afterward I am reading the file with:

char* getFile (string str,unsigned int &size)
{
      FILE * pFile;
      long lSize;
      char * buffer1;
      size_t result;

      pFile = fopen ( str.c_str() , "rb" );

      if (!pFile) return NULL;
      // obtain file size:
      fseek (pFile , 0 , SEEK_END);
      lSize = ftell (pFile);
      rewind (pFile);

      // allocate memory to contain the whole file:
      buffer1 = (char*) malloc (sizeof(char)*lSize);
      // copy the file into the buffer:
      result = fread (buffer1,1,lSize,pFile);
      // terminate
      size=sizeof(char)*lSize;
      fclose (pFile);
      return buffer1;
}      
it seems like buffer1 at the beggining get some invalid character. So after opening the xml file in IE, i get an error in line 1 column1.
I think it has to do something with BOM, but i am not sure

And I also get some garbage at the end of the file.

Thanks for any input
LVL 3
DimkovAsked:
Who is Participating?
 
itsmeandnobodyelseCommented:
>>>> is there another way to save a UTF-8 file without assigning BOM to it?
I think so. The BOM comes because of writing UNICODE chars with wstring what. If you would convert wstring what to ANSI, say using wcstombs conversion, I would assume there is no BOM to be written. If writing by calling fputws lines, the file would be written in ANSI, i. e. no BOM was written (see article http://msdn2.microsoft.com/en-us/library/c4cy2b8e(VS.80).aspx)

In the latter case you also could read the the file by opening it in text mode. Depending on using fgets or fgetws you would read to a char or wchar_t buffer. Note, I have no experience in writing/reading with css=UTF-8 option. But I strongly would assume it is independent of ANSI/UNICODE. Hence if writing/reading in text mode, UTF-8 conversion would happen additionally if writing and reading in text mode.

Note, reading in binary mode has another disadvantage: when writing a newline char for a textfile, windows turns it to a pair of carriage-return + linefield  (0x0D 0x0A). The same is returned when reading the file in text mode. When reading the file in binary mode you'll get the pairs what may doesn't matter but should be considered.
0
 
Infinity08Commented:
>> it seems like buffer1 at the beggining get some invalid character.

What kind of garbage is that (open the file with a hex editor eg.) ? Is it before the actual XML, or does it "overwrite" part of the XML ? Is that garbage also present in the what wstring ?

Btw - from what I understand, the BOM should not be included in a UTF-8 file. Did you try erasing the file before creating it for writing ?
0
 
itsmeandnobodyelseCommented:
It maybe easier to get filesize by stat function:

#include <sys/stat.h>

 ...

  struct stat fs = { 0 };
  ...

  if (stat(str.c_str(), &fs) != 0)
        return 1;  /* file doesn't exist */
 
   size = fs.st_size;

>>>> size=sizeof(char)*lSize;
Why do you read the file into a char buffer while you are writing a wchar_t buffer?

>>>>  So after opening the xml file in IE
That would mean that writing already failed?

Regards, Alex



   
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

 
itsmeandnobodyelseCommented:
>>>> FILE *stream = fopen(whereto.c_str(), "w+, ccs=UTF-8");
By using the _wfopen the BOM should be written correctly and you only would need to pass the filename as wstring.
0
 
DimkovAuthor Commented:
I changed the function as fallows:
wchar_t* getFile (string str,unsigned int &size)
{
      FILE * pFile;
      long lSize;
      wchar_t * buffer1;
      size_t result;

      pFile = fopen ( str.c_str() , "rb" );

      if (!pFile) return NULL;
      // obtain file size:
      fseek (pFile , 0 , SEEK_END);
      lSize = ftell (pFile);
      rewind (pFile);

      // allocate memory to contain the whole file:
      buffer1 = (wchar_t*) malloc (sizeof(wchar_t)*lSize);
      // copy the file into the buffer:
      result = fread (buffer1,sizeof(wchar_t),lSize,pFile);
      // terminate
      size=sizeof(wchar_t)*lSize;
      fclose (pFile);
      return buffer1;
}      
but now the buffer1 is full with japaneese text instead of the content of the file

in the previous version the extra character was before the
 <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
0
 
DimkovAuthor Commented:
isnt _wfopen just allowing to specify widechar name of the file? I don't think it opens a file with UTF-8
0
 
DimkovAuthor Commented:
I thought this will be simple... just open UTF-8 file *.xml file, and save it again :(
0
 
itsmeandnobodyelseCommented:
>>>> isnt _wfopen just allowing to specify widechar name of the file?
You are right. But UTF-8 files normally should be written with a BOM.

You could try writing the BOM explicitly:

 FILE *stream = fopen(whereto.c_str(), "w+, ccs=UTF-8");
 char bom[3] = { 0xef, 0xbb, 0xbf };
 fwrite(bom, 1, 3, stream);
 fwrite(what.c_str(), sizeof(wchar_t), what.length(), stream);

0
 
DimkovAuthor Commented:
Alex,
what i am doing now is concentrating on the getFile function, since I noticed when opening XML file with UTF8, it has one sign on the beginning, and extra signs at the end....

at this time the characters are present when writing the file and I can't debug it.
Can you please revise the function... getFile?
its goal is just to open the file... and it is not doing it right :(
0
 
Infinity08Commented:
Can you try to delete the file before running your code. That way, the code can create the file itself, and decide the file format.
0
 
DimkovAuthor Commented:
I will have to use the code for reading xml files not generated by it
0
 
Infinity08Commented:
>> I will have to use the code for reading xml files not generated by it

Do those files also contain the garbage ? Or just the files created by your code ?
0
 
DimkovAuthor Commented:
just the files created by my code.
So, i have a good XML. When reading the file, I get garbage. After saving, I save the garbage too.
0
 
Infinity08Commented:
>> When reading the file, I get garbage.

So, let me get this right. You get a Unicode XML from somewhere (how ?), and then you use your saveToFile function to save it to a UTF-8 file.

And the wstring containing the XML already contains the garbage, so that garbage also gets written to the file. Is that correct ? If so, can't you simply either :

    1) remove the garbage from the wstring before writing it to the file
    2) take a look at the code where you obtain the XML, and spot a problem there
0
 
DimkovAuthor Commented:
i get any ordinary XML file with UTF-8 encoding. (from internet or elsewhere)
then I use openFile function, and i get garbage... which means I have a problem in this function
i posted this function previously, pls give her a closer look

0
 
itsmeandnobodyelseCommented:
If the xml has a BOM you would need to extract the BOM when reading it in binary mode. Moreover, there is nothing what would transfer UTF-8 back to MS UNICODE. I wonder that you are writing  UNICODE strings with css=UTF-8 option but expect when reading same format you simply could read the raw data in a char buffer. Did I miss any information?

0
 
DimkovAuthor Commented:
this is too advanced for me... all i needed was this simple function...

lets suppose the xml has no BOM. what should I change  in the function in order to get the string?
0
 
DimkovAuthor Commented:
I found the problem with the grarbage at the end:
i forgot to put:
buffer1[lSize]='\0';
to tell the end of the string.
Now everything that is left is to find the problem with the starting character...
0
 
DimkovAuthor Commented:
the first character is actualy 3 characters, which when converted: 0xef 0xbb 0xbf. = which is the BOM.
since it is XML, it doesn't need to be there right?
0
 
itsmeandnobodyelseCommented:
>>>> lets suppose the xml has no BOM.
you easily can check this by opening the xml using a hex editor. In Visual Studio simply use the 'Open with ...' option and use hex or binary editor. if the first three bytes have the values EE BB BF it is a BOM.

You also could check whether the XML contain non-ANSI characters or is a UNICODE text file at all. Look at the strings at the right side of the output. If any printable character is separated by a period '.', the XML was written as a UNICODE text file (where each character has 2 bytes and where the upper byte mostly is 0, hence the period in the ANSI representation). If the strings were not with periods, the XML is in ANSI though it may contain some UTF-8 codings of special characters. You'll find these by looking for strange sequences  like "&auml".

>>>> what should I change  in the function in order to get the string?
You already read the file into a char buffer. I could not see any bad. What happens with the buffer after read?

Depending what you've seen with the hex editor

1. you may detect and remove a BOM by

#include <sys/stat.h>

 ...

    ...

  if (stat(str.c_str(), &fs) != 0)
        return 1;  /* file doesn't exist */
 

char* getFile (string str,unsigned int &size)
{
      FILE * pFile;
      long lSize;
      wchar_t * buffer1;
      size_t result;
      struct stat fs = { 0 };

      if (stat(str.c_str(), &fs) != 0)
            return NULL;  /* file doesn't exist */
      pFile = fopen ( str.c_str() , "rb" );

      if (!pFile) return NULL;
      // obtain file size:
      size = fs.st_size;

      // allocate memory to contain the whole file:
      buffer1 = (char*) malloc (size);
      // copy the file into the buffer:
      result = fread (buffer1, 1, size, pFile);
      // terminate
      fclose (pFile);
      // check for BOM
      if (buffer1[0] == 0xee && buffer1[1] == 0xbb && buffer1[2] == 0xbf)
      {
             memmove(buffer1, &buffer[3], size-3);  // use memove for in-buffer copying
             size -= 3;
      }
      return buffer1;
}    

Note, I assumed you've got a ANSI xml with some UTF-8 characters. If the xml contains UNICODE chars you may cast the buffer1 to a wchar_t* before return.

Regards, Alex
0
 
itsmeandnobodyelseCommented:
>>>> since it is XML, it doesn't need to be there right?
Yes, you only get it cause you read in binary mode. Cut it by using memmove as I showed above.
0
 
itsmeandnobodyelseCommented:
Note, I read bytes *and not* wchar_t.

If the file contains UNICODE chars you nevertheless take my code but make the return value a wchar_t*. You finally would cast the buffer1 when returning and divide the byte 'size' by 2.
0
 
DimkovAuthor Commented:
just one small question: it seems bom is put with

void saveToFile(string whereto, wstring what)
{
      FILE *stream = fopen(whereto.c_str(), "w+, ccs=UTF-8");
      fwrite(what.c_str(), sizeof(wchar_t), what.length(), stream);
      fclose(stream);
}
is there another way to save a UTF-8 file without assigning BOM to it? Since i don't want to add it to the XML files that were given to me if it was not there in the first place
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.