asked on

UNICODE file and pointer arithmetic headache !

Ah hello.

I have a query regarding UNICODE files.

Please consider the following code, intended merely to read a file into a BYTE array. I use some MFC classes here, but that is not really relevant.

      CString strFile( _T( "C:\\Input.txt" ) )
      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) )
            return -1;

      UINT nFileLengthBytes = file.GetLength();

      BYTE* b = new BYTE[ nFileLengthBytes ];

      if ( ! b ) return -1;

      TCHAR bom;
      BOOL bUnicode = FALSE;

      file.Read( &bom, sizeof(_TCHAR) );
      bUnicode = bom == _TCHAR( 0xFEFF );
      file.SeekToBegin();

      UINT uBytesRead = file.Read( b, nFileLengthBytes );

      BYTE* pIn = b;

      cout << pIn << endl;
      for( UINT uPos = 0; uPos < uBytesRead; uPos++)
      {
            cout << (pIn)++ << endl;
      }

As you can see, I read the whole file byte for byte into a BYTE array. Simple.

Currently, the contents of the input file is just one line, without a carriage return, saved as ANSI in notpad:

This is a sentence.

The output I get here on the first run of the for loop is

This is a sentence.²²²²½½½½½½½½■
This is a sentence.²²²²½½½½½½½½■

On the second run of the for loop, I get

his is a sentence.²²²²½½½½½½½½■

The third run:

is is a sentence.²²²²½½½½½½½½■

and so forth. I can see why this is happening; I am incrementing the pointer which currently points at the start of the sentence, so on each increment, it points one character further along the sentence.

Now come the questions:

1) Why is there a load of rubbish on the end of pIn ? It points to b, which is a buffer allocated to the exact size of the text I read in. Where is this rubbish coming from ?

2) I save the file C:\Input.txt as UNICODE via Notepad. Now, the output I get is completely different

First run:

■T
■T

Second run:

■T

Third run:

T

Fourth run:

<single empty char>

Fifth run:

h

etc. It appears that I output each character in the sentence followed by an empty character. (Obviously, I only notice this since I output each byte on a new line. If they were all on the same line, this would be transparent.)

Now, I think this is because of the UNICODE character encoding, each character in the sentence is two bytes, but because they are English characters, the second byte is not used, so it just empty. Please correct me if I am wrong. *** Does the ++ notation used on the pointer increment by one BYTE because I have a BYTE array ? If I had a char array, would ++ increment by one char ?

But what I don't get is why the exact same method as before outputs one byte at a time, whereas before I got the whole sentence, just trimmed from the left, if you get me.

Any ideas ?

TIA

SOLUTION

e_tadeu

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

AlexFM

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

mrwad99

ASKER

Thanks both.

>> But what I don't get is why the exact same method as before outputs one byte at a time, whereas before I got the whole sentence, just trimmed from the left, if you get me.

Any comments on that ?

AlexFM

Because after every byte there is null byte. cout handles BYTE* pointer as string and null is end of string.

e_tadeu

I think it outputs one byte at a time, becase on unicode your string is something like this:
'T\0h\0i\0s\0 \0i\0s\0 \0'...

You see, with \0's in between.. so, it will only output one char and end the string because of the \0, if you don't use proper unicode IO (e.g, wcout, etc)

mrwad99

ASKER

Thanks both :)