Solved

ReadFile Unicode Help

Posted on 2009-04-12
10
1,151 Views
Last Modified: 2013-12-14
Hello there,

I am trying to write a function that can read every type of file.. well it works fine with text files but when I am trying to read html files it only reads the first 3 bytes which are: "ÿþ<" and it keeps reading but the rest of bytes read are (null)..
I think the problem is because I am trying to read a unicode file or something like that..
If you can help me edit this code to support such type of files even including .doc/.pdf or any type of file..  

Thank you very much
Regards
void ReadFile(char* szFileName){

	HANDLE	szFile =0;

	DWORD	szFileSize = 0;

	char*	szBuffer;

	BOOL	szReadFile = false;

	int		i=0;

	DWORD	szNumberOfBytesRead =0;
 

	szFile = CreateFile( szFileName , GENERIC_READ, FILE_SHARE_READ , NULL , OPEN_EXISTING , FILE_ATTRIBUTE_NORMAL , NULL );

	if ( szFile != INVALID_HANDLE_VALUE ){

		szFileSize = GetFileSize( szFile, NULL );

		szBuffer  = new char[szFileSize + 1 ];

		memset(szBuffer, 0 , (szFileSize + 1) * sizeof( char ) ); 

		do{

			szReadFile = ReadFile( szFile, &szBuffer[i++], 1 , &szNumberOfBytesRead, NULL );

		}while(szNumberOfBytesRead != 0);

	

		if (szReadFile)

		{

			printf("File Read Is:%s\n",szBuffer);	

		}

		else

		{

			printf("Last Error:%d\n",GetLastError());

			//handle error here

	

		}

	delete [] szBuffer;

	}

	CloseHandle(szFile);	

}

Open in new window

0
Comment
Question by:circler
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 7

Expert Comment

by:dolomiti
ID: 24128439
hi,
can I suggest an exercise to you ?

Run notepad, save the empty document with name EmptyA.txt
Write inside Hallo, give a return, and save as HalloA.txt, close program.

Open EmptyA.txt, save as EmptyU.txt, but changing Encoding from ANSI to unicode, close
Open HalloA.txt, save it as HalloU.txt, changing Encoding as above.

See Size on Properties of this 4 files:  
EmptyA = 0
EmptyU = 2 (an header)
HalloA  = 7 ( 5 of Hallo +CR+LF)
HalloU = 16 (2hdr, 10 for Hallo +CR(2)+LF (2) )

If you use Visual Studio, open these 4 files using normal file open, no auto but binary:
you'll see what is inside them. In Unicode, each file has 2 bytes header (FF FE)
and each char (also control char) stays on 2 bytes .
For normal char ( ABCde123,.$&()...control too), a byte is always 0.

For literature see:
http://en.wikipedia.org/wiki/Unicode
http://unicode.org/

for support
http://msdn.microsoft.com/en-us/library/2dax2h36.aspx
http://msdn.microsoft.com/en-us/library/dybsewaf.aspx

Returnig to your problem, by notepad, open these html files, and looking in Save As Dialog Box,
see if unicode is pre-selected. About code, decide if it is the case to open the file in binary mode
(I don't suggest) and skip header and read each char in two bytes, or correctly use _TCHAR
for type and appropriate API call

bye
vic

0
 
LVL 86

Expert Comment

by:jkr
ID: 24130848
Well, first of all, reading a file that way - byte by byte - is terribly inefficiant. Have you tried the following? Also, the reason for the strange output is that you are outputting the contents as ANSI, use 'wprintf()' instead of 'printf()' as below:
     szFile = CreateFile( szFileName , GENERIC_READ, FILE_SHARE_READ , NULL , OPEN_EXISTING , FILE_ATTRIBUTE_NORMAL , NULL );

        if ( szFile != INVALID_HANDLE_VALUE ){

                szFileSize = GetFileSize( szFile, NULL );

                szBuffer  = new char[szFileSize + 1 ];

                memset(szBuffer, 0 , (szFileSize + 1) * sizeof( char ) ); 

                szReadFile = ReadFile( szFile, szBuffer, 1szFileSize, &szNumberOfBytesRead, NULL ); // read the entire file at once

                        

                if (szReadFile)

                {

                        wprintf(L"File Read Is:%s\n",(wchar_t*)szBuffer);  // output UNICODE buffer 

                }

                else

                {

                        printf("Last Error:%d\n",GetLastError());

                        //handle error here

        

                }

        delete [] szBuffer;

        }

        CloseHandle(szFile);  

Open in new window

0
 
LVL 19

Expert Comment

by:LordOfPorts
ID: 24130997
One possible approach you could take is to switch to the "Use Unicode Character Set" option under project properties and change the code as in the code snippet below.

Next, after you have read the file use the IsTextUnicode function http://msdn.microsoft.com/en-us/library/dd318672(VS.85).aspx to determine if the read text is multi-byte or Unicode and print the text. For the test use Notepad to create a file called ANSI.txt save it with the ANSI encoding option and one called Unicode.txt with the encoding option labeled Unicode.


#include "stdafx.h"

#include "Windows.h"
 

void ReadFile(TCHAR* szFileName);
 

int _tmain(int argc, _TCHAR* argv[])

{

	ReadFile(_T("C:\\ANSI.txt"));

	ReadFile(_T("C:\\Unicode.txt"));
 

	system("pause");

	return 0;

}
 

void ReadFile(TCHAR* szFileName){

        HANDLE  szFile =0;

        DWORD   szFileSize = 0;

        TCHAR*   szBuffer;

        BOOL    szReadFile = false;

        int     i=0;

        DWORD   szNumberOfBytesRead =0;

 

        szFile = CreateFile( szFileName , GENERIC_READ, FILE_SHARE_READ , NULL , OPEN_EXISTING , FILE_ATTRIBUTE_NORMAL , NULL );

        if ( szFile != INVALID_HANDLE_VALUE ){

                szFileSize = GetFileSize( szFile, NULL );

                szBuffer  = new TCHAR[szFileSize + 1];

                memset(szBuffer, 0 , (szFileSize + 1) * sizeof( TCHAR ) ); 

                do{

                        szReadFile = ReadFile( szFile, &szBuffer[i++], sizeof(TCHAR) , &szNumberOfBytesRead, NULL );

                } while(szNumberOfBytesRead != 0);

        

                if (szReadFile)

                {

					// If the file is not Unicode

					if(!IsTextUnicode((LPVOID)szBuffer, szFileSize, NULL) == TRUE) {											

						printf("ANSI File Read Is:%s\n", (PCHAR)szBuffer);

					}

					else {

						_tprintf(_T("Unicode File Read Is:%s\n"),szBuffer);

					}

                }

                else

                {

                    _tprintf(_T("Last Error:%d\n"),GetLastError());        

                }

        delete [] szBuffer;

        }

        CloseHandle(szFile);    

}

Open in new window

UnicodeOption.png
0
 

Author Comment

by:circler
ID: 24131060
Thank you dolomiti for giving me that piece of info, and thank you jkr for the code fix..

But still my problem is I am trying to read both unicode and non unicode files alternatively, if I use jkr code I can only parse non unicode data and when there is unicode I can only see the unicode limiting bytes 0xFF or whatever they are..
If anyone can direct me how to write a function that supports both unicode and non unicode and how I can be able to parse both..

Regards
0
 
LVL 86

Expert Comment

by:jkr
ID: 24131121
>>If anyone can direct me how to write a function that supports both unicode
>>and non unicode and how I can be able to parse both..

No problem ;o)

     szFile = CreateFile( szFileName , GENERIC_READ, FILE_SHARE_READ , NULL , OPEN_EXISTING , FILE_ATTRIBUTE_NORMAL , NULL );
        if ( szFile != INVALID_HANDLE_VALUE ){
                szFileSize = GetFileSize( szFile, NULL );
                szBuffer  = new char[szFileSize + 1 ];
                memset(szBuffer, 0 , (szFileSize + 1) * sizeof( char ) );
                szReadFile = ReadFile( szFile, szBuffer, 1szFileSize, &szNumberOfBytesRead, NULL ); // read the entire file at once
                       
                if (szReadFile)
                {
                        if(IsTextUnicode((LPVOID)szBuffer,szReadFile,NULL) // <------ Test here
                          wprintf(L"File Read Is:%s\n",(wchar_t*)szBuffer);  // output UNICODE buffer
                        else
                          printf("File Read Is:%s\n",szBuffer);  // output ANSIbuffer
                }
                else
                {
                        printf("Last Error:%d\n",GetLastError());
                        //handle error here
       
                }
        delete [] szBuffer;
        }
        CloseHandle(szFile);  
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:circler
ID: 24131391
Thank you both, we're almost near.. Is there anyway to convert the detected unicode buffer to multi byte buffer so I can use it in the rest of my functions (using dynamic memory please )?

Regards
0
 
LVL 86

Accepted Solution

by:
jkr earned 300 total points
ID: 24131470
Well, you can do that using 'WideCharToMultiByte()' (http://msdn.microsoft.com/en-us/library/dd374130(VS.85).aspx) or 'wcstombs()' (http://msdn.microsoft.com/en-us/library/5d7tc9zw(VS.80).aspx), e.g.
     szFile = CreateFile( szFileName , GENERIC_READ, FILE_SHARE_READ , NULL , OPEN_EXISTING , FILE_ATTRIBUTE_NORMAL , NULL );

        if ( szFile != INVALID_HANDLE_VALUE ){

                szFileSize = GetFileSize( szFile, NULL );

                szBuffer  = new char[szFileSize + 1 ];

                memset(szBuffer, 0 , (szFileSize + 1) * sizeof( char ) );

                szReadFile = ReadFile( szFile, szBuffer, 1szFileSize, &szNumberOfBytesRead, NULL ); // read the entire file at once

                       

                if (szReadFile)

                {

                  if(IsTextUnicode((LPVOID)szBuffer,szReadFile,NULL) // <------ Test here

                  {

                    char* tmp = new char[(szReadFile + 2) / 2];

 

                    wcstombs(tmp,(wchar_t*)szBuffer, wcslen((wchar_t*)szBuffer));
 

                    strcpy(szBuffer,tmp);

                    delete [] tmp;                  

                  }

                          

                  printf("File Read Is:%s\n",szBuffer);  // output ANSIbuffer

                }

                else

                {

                        printf("Last Error:%d\n",GetLastError());

                        //handle error here

       

                }

        delete [] szBuffer;

        }

        CloseHandle(szFile);  

Open in new window

0
 
LVL 19

Assisted Solution

by:LordOfPorts
LordOfPorts earned 200 total points
ID: 24131510
Once you enter the IsTextUnicode block you can use the WideCharToMultiByte function http://msdn.microsoft.com/en-us/library/aa450989.aspx e.g.:
size_t nLength = WideCharToMultiByte(CP_ACP, 0, szBuffer, -1, NULL, 0, NULL, NULL);
 

PCHAR mbsBuf = new CHAR[nLength];
 

WideCharToMultiByte(CP_ACP, 0, szBuffer, szFileSize, mbsBuf, nLength, NULL, NULL);
 

printf("Widechar to Multi-byte converted: %s", mbsBuf);
 

delete[] mbsBuf;

Open in new window

0
 

Author Closing Comment

by:circler
ID: 31569405
Thank you very much :)
0
 

Author Comment

by:circler
ID: 24132092
Thank everyone that tried to help,  especially LordOfPorts and jkr

Regards
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
C++ Properties One feature missing from standard C++ that you will find in many other Object Oriented Programming languages is something called a Property (http://www.experts-exchange.com/Programming/Languages/CPP/A_3912-Object-Properties-in-C.ht…
The viewer will learn how to synchronize PHP projects with a remote server in NetBeans IDE 8.0 for Windows.
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now