Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

UNICODE file and pointer arithmetic headache !

Posted on 2006-06-28
6
Medium Priority
?
270 Views
Last Modified: 2010-04-01
Ah hello.

I have a query regarding UNICODE files.

Please consider the following code, intended merely to read a file into a BYTE array.  I use some MFC classes here, but that is not really relevant.

      CString strFile( _T( "C:\\Input.txt" ) )
      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) )
            return -1;

      UINT nFileLengthBytes = file.GetLength();

      BYTE* b = new BYTE[ nFileLengthBytes ];

      if ( ! b ) return -1;

      TCHAR bom;
      BOOL bUnicode = FALSE;

      file.Read( &bom, sizeof(_TCHAR) );
      bUnicode =  bom == _TCHAR( 0xFEFF );
      file.SeekToBegin();

      UINT uBytesRead = file.Read( b, nFileLengthBytes );

      BYTE* pIn = b;

      cout << pIn << endl;
      for( UINT uPos = 0; uPos < uBytesRead;  uPos++)
      {
            cout << (pIn)++ << endl;
      }


As you can see, I read the whole file byte for byte into a BYTE array.  Simple.

Currently, the contents of the input file is just one line, without a carriage return, saved as ANSI in notpad:

This is a sentence.


The output I get here on the first run of the for loop is

This is a sentence.²²²²½½½½½½½½&#9632;
This is a sentence.²²²²½½½½½½½½&#9632;

On the second run of the for loop, I get

his is a sentence.²²²²½½½½½½½½&#9632;

The third run:

is is a sentence.²²²²½½½½½½½½&#9632;

and so forth.  I can see why this is happening; I am incrementing the pointer which currently points at the start of the sentence, so on each increment, it points one character further along the sentence.

Now come the questions:

1) Why is there a load of rubbish on the end of pIn ?  It points to b, which is a buffer allocated to the exact size of the text I read in.  Where is this rubbish coming from ?

2) I save the file C:\Input.txt as UNICODE via Notepad.  Now, the output I get is completely different

First run:

 &#9632;T
 &#9632;T

Second run:

&#9632;T

Third run:

T

Fourth run:

<single empty char>


Fifth run:

h

etc.  It appears that I output each character in the sentence followed by an empty character.  (Obviously, I only notice this since I output each byte on a new line.  If they were all on the same line, this would be transparent.)

Now, I think this is because of the UNICODE character encoding, each character in the sentence is two bytes, but because they are English characters, the second byte is not used, so it just empty.  Please correct me if I am wrong.  *** Does the ++ notation used on the pointer increment by one BYTE because I have a BYTE array ?  If I had a char array, would ++ increment by one char ?

But what I don't get is why the exact same method as before outputs one byte at a time, whereas before I got the whole sentence, just trimmed from the left, if you get me.

Any ideas ?

TIA
0
Comment
Question by:mrwad99
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
  • 2
6 Comments
 
LVL 4

Assisted Solution

by:e_tadeu
e_tadeu earned 1000 total points
ID: 17000522
For 1):

The rubbish is there because your string is not finalized with a null char. As you said, the buffer is exactly the size of the string, so there is no space for the ending null char. What you should do is:

   BYTE* b = new BYTE[ nFileLengthBytes + 1];

// ....

   UINT uBytesRead = file.Read( b, nFileLengthBytes );
   b[uBytesRead] = '\0'

for 2):

Because it is unicode, you can't use BYTE. You should really be using wchar for unicode text!

0
 
LVL 48

Accepted Solution

by:
AlexFM earned 1000 total points
ID: 17000538
1) Why is there a load of rubbish on the end of pIn ?
You need null character in the end, cout operator handles BYTE* are string. You can do the following:

BYTE* b = new BYTE[ nFileLengthBytes + 2];
b[nFileLengthBytes + 1] = b[nFileLengthBytes] = 0;

Two bytes will be OK both for ANSI and Unicode.

2) because they are English characters, the second byte is not used, so it just empty.
Yes, it is equal to 0.

>>If I had a char array, would ++ increment by one char ?

unsinged short* or WCHAR* would increment for two bytes. Pointer is incremented to number of bytes equal to sizeof(type).
0
 
LVL 19

Author Comment

by:mrwad99
ID: 17000625
Thanks both.  

>> But what I don't get is why the exact same method as before outputs one byte at a time, whereas before I got the whole sentence, just trimmed from the left, if you get me.

Any comments on that ?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 48

Expert Comment

by:AlexFM
ID: 17000682
Because after every byte there is null byte. cout handles BYTE* pointer as string and null is end of string.
0
 
LVL 4

Expert Comment

by:e_tadeu
ID: 17000691
I think it outputs one byte at a time, becase on unicode your string is something like this:
'T\0h\0i\0s\0 \0i\0s\0 \0'...

You see, with \0's in between.. so, it will only output one char and end the string because of the \0, if you don't use proper unicode IO (e.g, wcout, etc)

0
 
LVL 19

Author Comment

by:mrwad99
ID: 17000852
Thanks both :)
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Written by John Humphreys C++ Threading and the POSIX Library This article will cover the basic information that you need to know in order to make use of the POSIX threading library available for C and C++ on UNIX and most Linux systems.   [s…
C++ Properties One feature missing from standard C++ that you will find in many other Object Oriented Programming languages is something called a Property (http://www.experts-exchange.com/Programming/Languages/CPP/A_3912-Object-Properties-in-C.ht…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question