Solved

Reading past a zero value byte with ifstream::read

Posted on 2003-11-20
18
976 Views
Last Modified: 2009-12-16
I have a large file thats about 400k that I need to read text data from. The problem I'm having is that the format is a bit weird to me. It seems to have 2 bytes used for one character (some kind of unicode text??). The second byte of the character is always zero but the first byte is ASCII. What I want to do is read the whole file into a string but the reading always stops after reading a byte that contains zero (0x00).

The way I have coded this is like:

std::ifstream file;
file.open ("example.txt");
std::ostringstream contents;
contents << file.rdbuf ();
std::string cont = contents.str ();

Inside the example.txt file would be the text (for example):
hello

But the data inside actually looks like (through a hex editor):
48 00 65 00 6C 00 6C 00 6F 00

So when the text is read it will stop at the h. Is there a way to read the whole file in and not stop at the 'zero value' bytes?
0
Comment
Question by:Toadinator
  • 10
  • 4
  • 2
  • +2
18 Comments
 
LVL 48

Expert Comment

by:AlexFM
Comment Utility
Use wostringstream instead of ostringstream and wstring instead of string.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
If you look at the file with a binary editor, I expect you'll find the first two bytes is a BOM (byte order marker), which effectively tells you that you have little-endian UTF-16 data. If so the first two byte will be: FF FE.

If you are developing on Windows, you are in luck and you can use AlexFM's suggestion if you skip past the BOM and treat subsequent characters as wchar_t, because Windows is little-endian and wchar_t is 16-bits.

i.e.
--------8<--------
#include <iostream>
#include <fstream>
#include <string>

int main()
{

      std::filebuf *pbuf;
      std::ifstream filestr;
      long size;
      char *buffer;

      filestr.open("test.txt");

// Get pointer to associated buffer object

      pbuf = filestr.rdbuf();

// Get file size using buffer's members

      using std::ios;
      size = pbuf->pubseekoff(0,ios::end,ios::in);
      pbuf->pubseekpos(0,ios::in);

// Allocate memory to contain file data

      buffer = new char[size];

// Get file data

      pbuf->sgetn(buffer,size);
      filestr.close();

// Write raw content to stdout

      std::cout << "Here is the raw stuff: ";
      std::cout.write(buffer,size);
      std::cout << '\n';

// Load a wstring with its contents

      std::wstring wstr(reinterpret_cast<wchar_t*>(buffer)+1,size/sizeof(wchar_t)-1);

      delete []buffer;

// Display the wstring

      std::wcout << "Here it is using wstring: " << wstr << L'\n';

}
--------8<--------

If you are developing on a platform which has 32-bit wchar_t, you need to use some more ingenuity.

Be warned that, as I very recently discovered, you cannot expect wofstream/wifstream to write/read characters of size wchar_t directly in most locales, because they use a compact external representation, which means that you wind up with 8-bit characters written to disk instead of 16-bit characters, which you might otherwise have expected.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Oops...

> std::wcout << "Here it is using wstring: " << wstr << L'\n';

std::wcout << L"Here it is using wstring: " << wstr << L'\n';
0
 

Expert Comment

by:jayesh_j_patel
Comment Utility
Another way in C:

char *ReadTheFile(char *fileName)
{
  FILE *fp;
  int i;
  char *buff;

  FILE *fp=fopen(filename, "rb"); // open file to read in binary mode
  if (!fp) return NULL; // could not open the file
  buff = (char*)malloc(_filelength(fileno(fp));
  if (!buff)
  {
     fclose(fp); // close the file before leaving
     return NULL; // not enough meory
  }
  i = 0; // make string buffer index to zero
  while(!feof(fp)) // loop until end of file reached
  {
     buff[i++] = fgetc(fp); // read and store first useful byte
     fgetc(fp); // read but dont store this byte
  }
  buff[i] = 0; // put string terminator
  fclose(fp); // now close the file as we read whole file
  return buff; // say evrything was fine and here is the data
}

call this above function with the file name you want to read.
It will return the text buffer read from the file name you supplied.
Dont forget to free memory allocated to buff by calling free(buff)
when you are done with the text buffer.
Hope will help,
Thanks,
Jayesh
0
 

Author Comment

by:Toadinator
Comment Utility
Using wstring and wostringstream, all I get is empty strings. The problem is reading the file, the reading is always stopped when encountering a zero (0x00). When I tried rstaveley's way only the first 3 bytes were read into buffer in the pbuf->sgetn (buffer, size) function. Like always, it stopped after it met a byte with a value of zero. If somehow I do get the data read into a wstring is there a way to convert it to a normal string?
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
If the first two bytes are not FF FE, it is probably the case that you are dealing with a different file format. Possibly your file is big endian UTF-16 rather than little endian, which I assumed. A wchar_t with value 00 00 will be treated as a L'\0' terminus by wcout. My technique shouldn't stop when it encounters an 8-bit (byte/char) value of 0, but it would get terminated by a 16-bit (Windows wchar_t) value of 0.

Can you let us know the byte values of (say) the first 10 bytes of the file, and then well have a sporting chance of guessing at the file format? It sounds like you don't have a UNICODE UTF-16 file.

Windows XP Notepad accepts little endian and big endian UTF-16 UNICODE as a possible file format. If you load it into notepad, does it look like regular text?
0
 
LVL 4

Expert Comment

by:havman56
Comment Utility

1.if ur charcter is 2 bute then it is wchar in MSC it is
typedef wchar_t unsigned short which is 2 bytes.

2. Use std::wostringstream  adn std::wstring which alexfm aldready mentioned

3.yes, In MSC 6.0 when  wostringstream write fails when charcter is greater than 0x00fd.
it treats any charcter greater than fd as space charcter. This is apotential problem in MSC

4. but i am  sure if the text "hello" as specified format in ur question can be written into wstringstream.

5. i feel u need to open the file in read mode std::ios::in
u cannot have outputstream as well as obj of rdbuf as output.
bcoz gptr(),egptr() mis match.

let us know by changing the file mode ..
and using wstringstream and wostream






0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
> 1.if ur charcter is 2 bute then it is wchar in MSC it is typedef wchar_t unsigned short which is 2 bytes

Agreed, but little endian or big endian? If is is, indeed, 48 00 65 00 6C 00 6C 00 6F 00 for 'H' 'e' 'l' 'l' 'o', you have a little-endian 16-bit characters, but if they are actually 00 48 00 65 00 6C 00 6C 00 6F, they are big endian. That's why we need to know the first (say) 10 bytes of the file. If the file includes a BOM, so much the better. We are told that the approach I described only got 3 "bytes" (wchar_t's???), which wouldn't have happened if the first bytes were 48 00 65 00 6C 00 6C 00 6F 00 with no BOM. I'd expect to see "ello", if that was the case, having mistakenly skipped the BOM.
0
 
LVL 4

Expert Comment

by:havman56
Comment Utility

it is 48 00 only not 00 48

if 00 48 then stream write stops at 00 itself....

rstavely in ur code ur trying to do something where ur not solving the problem as it is .. deviating it is absurd...

why u need pubseek etc.....
not needed..

if u have
( *ostream_ptr)<< obj.rdbuf()

will invoke basic_ostream << sb*;

Assume if obj is ostringstream itself
then the insertion fails....

if the obj is istringstream then it succeed..

my statement is clear ( but not sure with file it may be on files too)

iffffff and only if u have inputstream u can read and write to outputstream

so file should be inputstream  or read operation shd be performed.

so open the file mode in input mode or read mode and do the rest i hope it wil work.

i repeat stavesly ur deviating from the problem .....










 
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
You misunderstood my posting havman56, and I'm having difficulty following yours.

> it is 48 00 only not 00 48

How do you know? The author has not confirmed this. We only know that the file has the sequence 48 00 65 00 6C 00 6C 00 6F 00 in it. We don't know if the 00 bytes are on even bytes.

> if 00 48 then stream write stops at 00 itself....

...which could be what the author is encountering, when he said that he got three bytes. His file may be big endian 16-bit UNICODE.

> rstavely in ur code ur trying to do something where ur not solving the problem as it is .. deviating it is absurd...
> ...
> why u need pubseek etc.....
> not needed..
> ....
>  repeat stavesly ur deviating from the problem .....

pbuf->pubseekoff(0,ios::end,ios::in) was used to get the file size so that a buffer of a suitable size could be allocated. [I have a comment above it, which tells you that.] I'm using wcout/cout to output the contents read from test.txt as raw characters and then as wchar_t's. The program does not read and write to the same file.

You could humour me by creating a test file with notepad on your system and saving the text as test.txt as Unicode (Unicode which isn't not big endian - you have this option in the Windows XP notepad). Then execute the program above in the same directory. The program was tested with VC 7.1.

The author said, "When I tried rstaveley's way only the first 3 bytes were read into buffer in the pbuf->sgetn (buffer, size) function. Like always, it stopped after it met a byte with a value of zero." I'd like to know if this conclusion was made from the return value from the function.

i.e. Toadinator, please can you let me know what your output is from the following, which adds debug to the original program:
--------8<--------
#include <iostream>
#include <fstream>
#include <string>

int main()
{

     std::ifstream filestr;

     filestr.open("test.txt");

// Get pointer to associated buffer object

     std::filebuf *pbuf = filestr.rdbuf();

// Get file size using buffer's members

     using std::ios;
     long size = pbuf->pubseekoff(0,ios::end,ios::in);
     pbuf->pubseekpos(0,ios::in);

// Indicate the file size

     std::cout << "File size is: " << size << '\n';

// Allocate memory to contain file data

     char *buffer = new char[size];

// Get file data

     long retrieved_size = pbuf->sgetn(buffer,size);
     filestr.close();

// Indicate the amount of data retrieved

     std::cout << "Retrieved size is: " << retrieved_size << '\n';

// Write raw content to stdout

     std::cout << "Here is the raw stuff: ";
     std::cout.write(buffer,retrieved_size);
     std::cout << '\n';

// Load a wstring with its contents

     std::wstring wstr(reinterpret_cast<wchar_t*>(buffer)+1,retrieved_size/sizeof(wchar_t)-1);

     delete []buffer;

// Display the wstring

     std::wcout << "Here it is using wstring: " << wstr << L'\n';

}
--------8<--------
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
> Unicode which isn't not big endian

Sorry about the double negative. I mean of course Unicode which isn't big endian, the Windows default, which is little endian.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
I overlooked your question...

> If somehow I do get the data read into a wstring is there a way to convert it to a normal string?

Here's how. You use the little-used std::string constructor with iterators:
--------8<--------
#include <iostream>
#include <string>

int main()
{

// ...

     std::wstring wstr(L"This is a test string");

// Display the wstring

     std::wcout << L"Here it is using wstring: " << wstr << L'\n';

// Assign the std::wstring to a std::string, using iterators

     std::string str(wstr.begin(),wstr.end());

// Display the string

     std::cout << "Here it is using string: " << str << '\n';
}
--------8<--------
0
 

Author Comment

by:Toadinator
Comment Utility
The first ten bytes of the file are:
FF FE 2F 00 2F 00 0D 00 0A 00

I changed the file mode and also tried using wostringstream and wstream but it is still stopping at the 0x00 byte. I looked at the contents of buffer in the pbuf->sgetn (buffer, size) funtion using the debugger. The content of the buffer was FF FE 2F. I copied your program code above, compiled it and opened a file I made from the first 32 bytes of the file I'm using. The results were much better:

File size is: 32
Retrieved size is: 32
Here is the raw stuff: ??//
//              NOTICE
Here it is using wstring:


This is very odd because in the debugger, the contents of buffer are still FF FE 2F. When using this method in the main program, I converted the contents of the buffer to a std::string but only the first 3 bytes entered the string. Converting to a string is a must in the main program. It seems as though the rest of the data is actually there but not being 'shown'.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
OK, you can see that it is indeed little-endian Unicode. The first two bytes (FF EE) are the usual "BOM" to tell you that.

If I put those exact 10 bytes into my program compile with VC7.1 (Visual Studio .NET 2003), I get apparent spaces (actually 0s) between the '/' characters in the raw stuff and the wstring gets "// " followed by a carriage return and line feed.

Which compiler are you using?
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
> The first two bytes (FF EE)...

Typo. I meant:

FF FE
0
 

Author Comment

by:Toadinator
Comment Utility
I'm using gcc version 3.2.2 and I'm developing in Linux.
0
 
LVL 17

Accepted Solution

by:
rstaveley earned 125 total points
Comment Utility
Ah.... that's the problem. wchar_t is 32 bits in your environment. Windows has wchar_t as 16 bits, and my program depended on that. test.txt is a UTF16 Unicode file, which is more typically found on Windows systems.

You'll need to read the input file in a binary fashion with types of 16-bit, which is short on your platform (NB: this is not a portable assumption). You then need to load a wstring with widened shorts. wostringstream is good for this.

Try this:
---------8<---------
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>

int main()
{
    std::ifstream fin("test.txt",std::ios::in|std::ios::binary);
    if (!fin) {
        std::cerr << "Unable to open file\n";
        return 1;
    }

    // Read off byte order marker
    unsigned short bom16;
    if (!fin.read(reinterpret_cast<char*>(&bom16),sizeof(bom16))) {
        std::cerr << "Unable to read BOM\n";
        return 1;
    }

    // Check the BOM - only handling little endian 16-bit
    if (bom16 != 0xfeff) {
        std::cerr << "Input file BOM (0x" << std::hex << bom16 << ") is not 16 bit little endian\n";
        return 1;
    }

    // Read 16-bit characters and load them into wostr
    unsigned short c16;
    std::wostringstream wostr;
    while (fin.read(reinterpret_cast<char*>(&c16),sizeof(c16)))
        wostr << static_cast<wchar_t>(c16);

    // Here is the wstring
    std::wstring wstr(wostr.str());

    // We can now display it with wcout
    std::wcout
        << L"Wide data:"
        << L"\n--------8<--------\n"
        << wstr
        << L"\n--------8<--------\n"
        ;

    // We can convert is to a narrow string with possible loss
    // of unusual characters (characters > 255)
    std::string str(wstr.begin(),wstr.end());

    // We can now display it with cout
    std::cout
        << "Narrow data:"
        << "\n--------8<--------\n"
        << str
        << "\n--------8<--------\n"
        ;
 
    return 0;
}
---------8<---------
0
 

Author Comment

by:Toadinator
Comment Utility
Yes, that works very well. Thanks.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now