Solved

More UNICODE file trouble

Posted on 2006-06-29
27
1,043 Views
Last Modified: 2008-01-09
Ah hello.

I have the following code, intended to make a byte-for-byte copy of a file, replacing all occurencies certain string with another.

I know that this code below has no safety in that we check for the amount of memory that will be required, but, that aside, I cannot get it to work with _UNICODE defined in my project settings.

Please take a look.

'lpwStrReplacementToken' is what I want to replace with 'strDirectory'.

I am finding that with UNICODE defined, the memcmp test never returns 0.  Why is this ?

TIA

      CString strFile(  _T( "C:\\Input.txt" ) );

      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) )
            return -1;
      
      UINT nFileLengthBytes = file.GetLength();

      CString strDirectory = _T("Replacement");

      UINT uDirectoryLength = (strDirectory.GetLength()) * sizeof( TCHAR );
      BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof( BYTE )  * 2 ) );

      BYTE* b = new BYTE[ nFileLengthBytes ];
      if ( ! b )
            return -1;
      
      TCHAR bom;
      BOOL bUnicode = FALSE;

      file.Read( &bom, sizeof(_TCHAR) );
      bUnicode =  bom == _TCHAR( 0xFEFF );

      file.SeekToBegin();

      UINT uBytesRead = file.Read( b, nFileLengthBytes );
      BYTE* pIn = b;
      BYTE* pOut = pbOutput;

      //cout << pIn;
      USES_CONVERSION;

      LPWSTR lpwStrReplacementToken = L"sentence";
      CString strReplacementToken = lpwStrReplacementToken;
      if ( ! bUnicode )
            strReplacementToken = W2A( lpwStrReplacementToken );

      int nReplacementTokenLength = ( bUnicode ? sizeof( wchar_t ) : sizeof( char ) ) * strReplacementToken.GetLength();

      UINT uBytesToWrite = 0, uCurrentBufferSize = nFileLengthBytes * 2;
      for ( UINT uPos = 0; uPos < uBytesRead; )
      {
            cout << *pIn;
            if ( memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
                  memcpy( pOut, strDirectory, uDirectoryLength );

                  pIn += nReplacementTokenLength;
                  pOut += uDirectoryLength;
                  uPos += nReplacementTokenLength;
                  uBytesToWrite += uDirectoryLength;
            }
            else
            {
                  *pOut++ = *pIn++;
                  uPos++;
                  uBytesToWrite++;
            }
      }

      CFile fileOutput;
      if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
            return -1;

      fileOutput.Write( pbOutput, uBytesToWrite );
      fileOutput.Close();

      delete[] b;
      free ( pbOutput );

      
0
Comment
Question by:mrwad99
  • 8
  • 5
  • 4
  • +6
27 Comments
 
LVL 11

Expert Comment

by:Jase-Coder
ID: 17007533
if you using unicode you should use wmemcmp rather than memcmp
0
 
LVL 11

Expert Comment

by:Jase-Coder
ID: 17007538
to use wmemcmp include <wchar.t>
0
 
LVL 22

Expert Comment

by:mahesh1402
ID: 17007752
better is _tcscmp() which is compiled to either strcmp() or wcscmp() depending on whether or not you're doing a Unicode build.

-MAHESH
0
 
LVL 4

Expert Comment

by:e_tadeu
ID: 17008882
Also don't assume your chars are BYTE now.
0
 
LVL 30

Expert Comment

by:Axter
ID: 17012384
>>better is _tcscmp() which is compiled to either strcmp() or wcscmp() depending on whether or not you're doing a Unicode build.

I totally agree with mahesh1402.
If you're using _T objects, then you should use _t associated API functions.  Other wise, there's no point in using _T code to begin with.
0
 
LVL 11

Assisted Solution

by:cup
cup earned 100 total points
ID: 17024176
CStrings do not automatically cast to BYTE*.  You need to use the LPCTSTR operator.  This will get the data member in the correct type.

      if ( memcmp( pIn, (LPCTSTR) strReplacementToken, nReplacementTokenLength ) == 0 )
   {
           memcpy( pOut, (LPCTSTR) strDirectory, uDirectoryLength );
           ...

If you are printing a unicode string it should be

wcout << _T('"') << (TCHAR*) pIn << _T('"') << endl
wcout << _T('"') << (LPCTSTR) strReplacementToken << _T('"') << endl;

Print both strings to find out why your comparison is failing.  

0
 
LVL 19

Author Comment

by:mrwad99
ID: 17036428
Right, ok.

Sorry about the delays, had network issues.

Anyway.  I have had a good play with all the suggestions given and still cannot get it to work.

Jase-Coder:

I have changed my line that was using memcmp to

#ifdef _UNICODE
            if ( wmemcmp ( (LPCWSTR)pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
#else
            if ( memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
#endif

and it still does not work.

mahesh1402/Axter:

how can I use _tcscmp ?  I am comparing a certain amount of memory, not two strings.  

Currently,

memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )

starts at pIn, and compares the next nReplacementTokenLength bytes with those contained in strReplacementToken.  _tcscmp does not have any provision for a certain number of bytes to be compared...

e_tadeu:

I am not with you there.  What do you mean ?

cup:

I tried that and it still does not work.

I am also not bothered about printing, all I want to do is a simple memory comparision.  Yet this is so difficult !

>>Print both strings to find out why your comparison is failing.  

That did nothing.

?!?

Arggghhh !
0
 
LVL 11

Expert Comment

by:cup
ID: 17038397
Did you try LPCTSTR or something else?  LPCTSTR is an operator for CStrings: not a cast.  Everything else is a cast.  What does this print?

const TCHAR* str = (LPCTSTR) strReplacementToken;
wcout << "length is " << nReplacementTokenLength  << " value is " <<  str << endl;

If it doesn't print anything at all, then there is somethig wrong with the replacement string.  Next

cout << pIn << endl;  // if this doesn't print anything it is probably unicode
wcout << (LPCTSTR) pIn << endl; // if this doesn't print anything, you have an empty string in pIn

If the cout prints something then you are comparing a unicode string against a non unicode string.  That won't work.
0
 
LVL 19

Author Comment

by:mrwad99
ID: 17042762
>> const TCHAR* str = (LPCTSTR) strReplacementToken;
wcout << "length is " << nReplacementTokenLength  << " value is " <<  str << endl;

This prints

length is 8 value is sentence

>> cout << pIn << endl;  // if this doesn't print anything it is probably unicode
wcout << (LPCTSTR) pIn << endl; // if this doesn't print anything, you have an empty string in pIn

This prints

This is a sentence.²²²²½½½½½½½½&#9632;

The wcout has no effect.

??

But...

I think I have figured all this out.  Firstly, with my code as-is, compiled with _UNICODE defined, it does not work *if I have an ANSI file as the input file*.  So, with the source as-is, C:\Input.txt is an ANSI file, and it does not work.  Open the file in Notepad and save it as UNICODE however, and my code works fine.

This lead me to look into exactly what is happening when my code tries to make a comparision between the memory currently pointed to by pIn, and the replacement string.

I had a play, and have uploaded some screenshots to demontrate what I found.  Please take a look at them.

The first ( http://img240.imageshack.us/img240/3536/notworking3fx.png ) shows the memory for one of my variables setup via a call to W2A.  You can see that the memory is showing that the string is still UNICODE, i.e. has two bytes per character.

The second ( http://img154.imageshack.us/img154/4719/working0qg.png ) shows the memory for a char array, used to convert the same UNICODE CString (strTemp) as I did via w2A.  Except, the memory window clearly shows that there is only one byte per character; an ANSI string.

So, it would seem that W2A is failing.  The question now is why ?
0
 
LVL 19

Author Comment

by:mrwad99
ID: 17042766
(Points now at 500 as this is getting silly)
0
 
LVL 19

Author Comment

by:mrwad99
ID: 17043042
Incidentally,

      CStringA t = W2A( strTemp );


Looks fine when viewed in memory, it is not double byte encoded.
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 17043078
Macros like W2A use _alloca to grab a chunk of stack to put the converted string into. You should expect strReplacementToken no longer to point to  lpwStrReplacementToken after W2A has been called. It is conceivable, bearing in mind the funkiness of alloca_ that your debugger is getting things wrong. Can you confirm that strReplacementToken no longer points to lpwStrReplacementToken after you step past the W2A call??
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 17043092
I didn't see your last post, mrwad99 . Ignore mine.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 17

Expert Comment

by:rstaveley
ID: 17043316
You start by constructing a CString from a wchar_t array string, which is fine and then, if bUnicode is true, you reassign it to be the same string using a narrow char array. You get the narrow char array from W2A. That should load it with the same contents?! No matter, read on...

Your memcmp/wmemcmp needs to compare LPCTSTR string fetched from the CString object. You need the cast to call operator LPCTSTR(), however LPCTSTR will be *either* LPCSTR or LPCWSTR depending on _UNICODE, so you *must* have _UNICODE defined for your code to work to get wide character andling internally throughout. With _UNICODE, LPCTSTR() is the same as LPCWSTR().

i.e.

#ifdef _UNICODE
          if ( wmemcmp (LPCWSTR(pIn), LPCWSTR(strReplacementToken), nReplacementTokenLength ) == 0 )
          {    
#else
          #error Must be unicode!!
#endif
0
 
LVL 4

Expert Comment

by:havman56
ID: 17044344
there is a bug in vc6 regarding wchar_t.

whenever u read a char from stream if it is greater than 0xfd that is (ex: 0x3424) which 2 bytes for wchar_t then stream in MSVC identifes as space character.

no idea on vc7 .

0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 17045084
/*
You seem mixing up UNICODE inputfile and UNICODE build.

If I understand the requirements correctly there are 4 cases to consider:

      _UNICODE defined     UNICODE textfile
A:          no                           no
B:          no                           yes
C:          yes                          no
D:          yes                          yes

Unfortunately CString only works for A and D cause it either is single-char or wide-char. Same applies to all "T" - switches.

So, B and C can't be handled properly when using CString and the TCHAR switches.

I would suggest to use std::string and std::wstring instead. Both can be used same time.

The prog below compiles but isn't tested. There is the assumption that you didn't want to change file type, means a UNICODE text file keeps a UNICODE text file while an ANSI textfile keeps ANSI.

If the assumption is wrong you would need to convert from string to wstring or buf to wbuf what needs a UNICODE conversion function like mbstowcs and a second buffer of an appropriate size. I would recommend *not* to using the USES_CONVERSION macro and A2W macro cause it requires _UNICODE macro set (as far as I know) while mbstowcs works in any case. mbstowcs can calculate the amount of wchar_t  chars needed when passing a NULL pointer as output buffer.

Hope it works.

Regards, Alex


*/
   
#include <string>
#include <iostream>
#include <fstream>
using namespace std;

#include <sys/stat.h>

int main()    
{
    struct stat status;
    if (stat("c:\\input.dat", &status) != 0)
         return 1;
    ifstream ifs("c:\\input.dat", ios::binary | ios::in);
    if (!ifs)
        return 2;
   
    unsigned char* buf = new unsigned char[status.st_size];
    if (!ifs.read((char*)buf,  status.st_size))
        return 3;

    bool bUnicode =  *((wchar_t*)buf) == 0xfeff;
   
    int outLen = 0;
    if (!bUnicode)
    {
         string s((char*)buf, status.st_size);
         delete [] buf;
         int pos = 0;
         int lpos = 0;
         string sfind = "sentence";
         string srepl = "replacement";
         while ((pos = s.find(sfind.c_str(), lpos)) != string::npos)
         {
             s.replace(pos, sfind.length(), srepl);
             lpos = pos + srepl.length();
         }

         buf = new unsigned char[s.length()];
         memcpy(buf, s.c_str(), s.length());
         outLen = s.length();
    }
    else // if (bUnicode)
    {
         wchar_t* wbuf = (wchar_t*) buf;
         wstring ws(wbuf, status.st_size/2);
         delete [] buf;
         int pos = 0;
         int lpos = 0;
         wstring wsfind = L"sentence";
         wstring wsrepl = L"replacement";
         while ((pos = ws.find(wsfind.c_str(), lpos)) != wstring::npos)
         {
             ws.replace(pos, wsfind.length(), wsrepl);
             lpos = pos + wsrepl.length();
         }

         buf = (unsigned char*)(new wchar_t[ws.length()]);
         memcpy(buf, ws.c_str(), ws.length() * sizeof(wchar_t));
         outLen = ws.length() * sizeof(wchar_t);
    }

    ofstream ofs("c:\\output.dat", ios::out | ios::binary);
    ofs.write((const char*)buf, outLen);
    ofs.close();
    return 0;
}

0
 
LVL 19

Author Comment

by:mrwad99
ID: 17049600
rstaveley

Thanks for the comment.  I would like to clarify something if possible.

>> You get the narrow char array from W2A. That should load it with the same contents?!

I am not sure what you mean there, but I think you are referring to the fact that I do this:

      LPWSTR lpwStrReplacementToken = L"sentence";
      CString strReplacementToken(_T("")), strTemp = lpwStrReplacementToken;
      
      if ( ! bUnicode )
            strReplacementToken = W2A(strTemp );       // <-----------------------
      else
            strReplacementToken = strTemp;      

Are you saying that I am assigning what is a CStringA (returned by w2A) to strReplacementToken (a CStringW (due to _UNICODE) ), which ends up with strReplacementToken being a CStringW ?  Well, if that is the case, I can certainly see why everything is failing, and I can see the memory window showed strReplacementToken as having two bytes per character in my previous screenshot ( http://img240.imageshack.us/img240/3536/notworking3fx.png )

***** Is this what you are saying the problem is, and why I essentially believe that W2A is "not working" ? *****

>> Your memcmp/wmemcmp needs to compare LPCTSTR string fetched from the CString object...

Now, you use memcmp in your code, however I don't see why I *have* to, since all I am comparing is blocks of memory, which can be of any data type, including ANSI strings or UNICODE strings.  My code works fine for a UNICODE file, (with _UNICODE) defined as-is, so I am still not convinced about what you say here.  Can you re-enforce you statement about using the LPCTSTR operator ?

havman56

I am using VS.NET 2003, so I guess that does not apply to me.  Thanks anyway.

itsmeandnobodyelse

Thank you for the attempt to fully understand my issue.  You are correct, there are four cases as you have clearly layed out.

>> I would suggest to use std::string and std::wstring instead. Both can be used same time.

That is an option.  However, as this code is part of a large MFC project, I really want to stick to CString objects.  I will try that code however as it is good practice.

Since you are obviously a seasoned C++ programmer, perhaps you can answer another question.  I, as far as I can see, have not had any answer to the overall problem of dealing with an ANSI file from within my UNICODE program.  The problem, if I have correctly described what I think rstaveley was getting at, is that I am always going to have what are CStringW objects with UNICODE defined.  I have hence come up with a hybrid combination using both CStringA and CStringW objects.  Could you please comment on the code I have come up with below, which, incidentally, works fine for _UNICODE defined, and not defined, and for both ANSI and UNICODE files, in the four combinations you listed in your last post.  Thanks very much.

//


int main()
{
      USES_CONVERSION;

      // Setup the input file and open it
      CString strFile(  _T( "C:\\Input.txt" ) );
      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;

      UINT nFileLengthBytes = file.GetLength();

      // Setup UNICODE and ANSI versions of the replacement strings
      CStringW strwDirectory = _T("Replacement");
      CStringA straDirectory = W2A( strwDirectory );

      // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
      BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );
      BYTE* b = new BYTE[ nFileLengthBytes ];
      if ( ! b ) return -1;

      // Determine whether or not we have a UNICODE file
      TCHAR bom;
      BOOL bUnicodeFile = FALSE;
      file.Read( &bom, sizeof (_TCHAR) );
      bUnicodeFile =  bom == _TCHAR( 0xFEFF );
      file.SeekToBegin();

      // Read in the whole file into the buffer
      UINT uBytesRead = file.Read( b, nFileLengthBytes );
      BYTE* pIn = b;
      BYTE* pOut = pbOutput;

      // Get ANSI and UNICODE versions of the replacement strings
      LPWSTR lpwStrReplacementToken = L"sentence";
      CStringW strReplacementTokenW( _T("sentence") );
      CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );

      // Determine the length of the replacement string, in bytes
      int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();

      UINT uBytesToWrite = 0, uCurrentBufferSize = nFileLengthBytes * 2;

      for ( UINT uPos = 0; uPos < uBytesRead; )
      {
            if ( bUnicodeFile )
            {
                  // We are dealing with a UNICODE file, so make the comparision with the UNICODE version of the replacement string
                  if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
                  {      
                        UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                        memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                        pIn += nReplacementTokenLength;
                        pOut += uDirectoryLengthBytes;
                        uPos += nReplacementTokenLength;
                        uBytesToWrite += uDirectoryLengthBytes;
                  }
                  else
                  {
                        *pOut++ = *pIn++;
                        uPos++;
                        uBytesToWrite++;
                  }
            }
            else
            {
                  // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
                  if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
                  {      
                        UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                        memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                        pIn += nReplacementTokenLength;
                        pOut += uDirectoryLengthBytes;
                        uPos += nReplacementTokenLength;
                        uBytesToWrite += uDirectoryLengthBytes;
                  }
                  else
                  {
                        *pOut++ = *pIn++;
                        uPos++;
                        uBytesToWrite++;
                  }
            }
      }

      // Write out the buffer to our output file
      CFile fileOutput;
      if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
            return -1;

      fileOutput.Write( pbOutput, uBytesToWrite );
      fileOutput.Close();

      // Cleanup
      delete[] b;
      free ( pbOutput );

      return 0;
}

//

0
 
LVL 19

Author Comment

by:mrwad99
ID: 17049606
Correction to my comment to rstaveley

>> Now, you use memcmp in your code

should be

>> Now, you use wmemcmp in your code
0
 
LVL 17

Assisted Solution

by:rstaveley
rstaveley earned 50 total points
ID: 17050062
Since you are only using wmemcmp or memcmp to test identity, either will do. You are simply seeing if two blocks of raw mamory are the same.

My point was that you need to compare the raw memory returned from the function CString::operator LPCTSTR(). You say that you can construct a CString with a wchar_t array or a char array. There are constructors for both of these. Internally it stores a set of characters which are not dependent on the contructor used, but dependent on whether you have _UNICODE defined (i.e. whether TCHAR is char or wchar_t), characters are widened or narrowed as applicable.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 17051749
>>>> Could you please comment on the code

I am not quite through, but I'll post as far as have cause I need to get offline for some hours.

/*
Since you are obviously a seasoned C++ programmer, perhaps you can answer another question.  I, as far as I can see,
have not had any answer to the overall problem of dealing with an ANSI file from within my UNICODE program.  
The problem, if I have correctly described what I think rstaveley was getting at, is that I am always going to
have what are CStringW objects with UNICODE defined.  I have hence come up with a hybrid combination using both
CStringA and CStringW objects.  Could you please comment on the code I have come up with below, which, incidentally,
works fine for _UNICODE defined, and not defined, and for both ANSI and UNICODE files, in the four combinations you
listed in your last post.  Thanks very much.
*/

int main()
{
//>>>> ok. that's an ATL conversion macro which isn't dependent on _UNICCODE
     USES_CONVERSION;

     // Setup the input file and open it
//>>>> ok. _T is mapped either to WCHAR or CHAR same as CString
     CString strFile(  _T( "C:\\Input.txt" ) );
     CFile file;
//>>>> ok , file.Open takes a LPCTSTR where CString has a cast operator
//>>>> ???  though CFile::modeBinary is default, you should set it explicitly
//     if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;
     if ( ! file.Open( strFile, CFile::modeRead | CFile::modeBinary) ) return -1;

//>>>> ok. length in bytes
     UINT nFileLengthBytes = file.GetLength();

     // Setup UNICODE and ANSI versions of the replacement strings

//>>>> ???, don't think it makes much sense to explicitly use CStringW/CStringA
//>>>> Note, the switch is done by macros not by templates what makes it unsafe
//>>>> The problem is that MFC doesn't need to handle the case where both CStringW
//>>>> and CStringA where used same time. So, some of the interfaces may/will use
//>>>> the 'T' macros what will fail for one of the types
     //CStringW strwDirectory = _T("Replacement");
     //CStringA straDirectory = W2A( strwDirectory );

//>>>> ??? better use L"Replacement" and "Replacement"
//>>>>     or do the conversion by mbstowcs
     LPWSTR lpwStrDirectory = L"Replacement"
     LPSTR  lpStrDirectory  = "Replacement";
     // int nwlen = (wcslen(pszwDirectory)+1) * 2;   // it's safe to have the same
     // pszDirectory  = new char[nwlen];             // size for both strings
     // wcstombs(pszDirectory, pszwDirectory, nwlen);

     // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
//>>>> nok, actually, it makes no sense to use malloc and new
//>>>>      in C++ malloc shouldn't beused
//>>>> nok, if the replacement string is longer than the string to replace we need a dynamic
//>>>>      string cause the output becomes longer
//>>>>      if not using a string class based on wchar_t (as I did with wstring), you need
//>>>>      make two loops: first check the number of replacements, allocate output buffer and
//>>>>      make the replacements in the second loop
     // BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );

     BYTE* b = new BYTE[ nFileLengthBytes ];
     if ( ! b ) return -1;

     // Determine whether or not we have a UNICODE file

//>>>> ???, we definitively need a wchar_t here
     // TCHAR bom;
     WCHAR bom = 0;
     BOOL bUnicodeFile = FALSE;
     // file.Read( &bom, sizeof (_TCHAR) );
     // bUnicodeFile =  bom == _TCHAR( 0xFEFF );
//>>>> ???. you could check the UNICODE property after reading the buffer
     file.Read( &bom, sizeof (WCHAR) );
     bUnicodeFile =  ( bom == 0xFEFF );
     file.SeekToBegin();

     // Read in the whole file into the buffer
     UINT uBytesRead = file.Read( b, nFileLengthBytes );
     BYTE* pIn = b;
//>>>>> later if we know the output size
     // BYTE* pOut = pbOutput;

     // Get ANSI and UNICODE versions of the replacement strings
     LPWSTR lpwStrReplacementToken = L"sentence";
//>>>>> nok, same as above
     // CStringW strReplacementTokenW( _T("sentence") );
     // CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );
     LPSTR lpStrReplacementToken = "sentence";

     // Determine the length of the replacement string, in bytes
//>>>> nok, use wcslen or strlen to determine lengths
     // int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();
     int nReplacementTokenLength = bUnicodeFile
                                       ? wcslen(lpwStrReplacementToken) * sizeof(WCHAR)
                                       : strlen(lpStrReplacementToken);
     int nDirectoryTokenLength   = bUnicodeFile
                                       ? wcslen(lpwStrDirectory )  * sizeof(WCHAR)
                                       : strlen(lpStrDirectory);
//>>>> ???, never make two statements in one line
     UINT uBytesToWrite = 0;
//>>>> ???, variable not used
     // UINT uCurrentBufferSize = nFileLengthBytes * 2;

//>>>> new, we need to count replacements first
     int numOccurrences = 0;

// new, we don't need to differ between UNICODE file and ANSI
//      when comparing tokens cause we already have the correct sizes
     BYTE* pReplacementToken = bUnicodeFile
                                 ? (BYTE*)lpwStrReplacementToken
                                 : lpStrReplacementToken;


     for ( UINT uPos = 0; uPos < uBytesRead - nReplacementTokenLength; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                numOccurrences++;    
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
           }
           else
           {
               uPos++;
                pIn++;
           }
     }
          else
          {
               // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
               if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                    memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
     }

     for ( UINT uPos = 0; uPos < uBytesRead; )
     {
          if ( bUnicodeFile )
          {
               // We are dealing with a UNICODE file, so make the comparision with the UNICODE version of the replacement string
//>>>> nok, don't use CString it most likely fails for cases B and C

               // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
               if ( memcmp( pIn, lpwStrReplacementToken, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
          else
          {
               // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
               if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                    memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
     }

     // Write out the buffer to our output file
     CFile fileOutput;
     if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
          return -1;

     fileOutput.Write( pbOutput, uBytesToWrite );
     fileOutput.Close();

     // Cleanup
     delete[] b;
     free ( pbOutput );

     return 0;
}

//


0
 
LVL 19

Author Comment

by:mrwad99
ID: 17057098
Alex

Thanks for such a comprehensive set of comments. I have printed those off and will study them and get back to you ASAP.
0
 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 350 total points
ID: 17059093
Here is the last version of the commented code. It compiles but I didn't test it. It still might have some bugs computing sizes and switches between UNICODE/ANSI

Regards, Alex

// unicode.cpp : Defines the entry point for the console application.
//

#undef _DLL
#include <afx.h>
#include <atlbase.h>

#include <atlwin.h>
#include <windows.h>

int main()
{
//>>>> ok. that's an ATL conversion macro which isn't dependent on _UNICCODE
     USES_CONVERSION;

     // Setup the input file and open it
//>>>> ok. _T is mapped either to WCHAR or CHAR same as CString
     CString strFile(  _T( "C:\\Input.txt" ) );
     CFile file;
//>>>> ok , file.Open takes a LPCTSTR where CString has a cast operator
//>>>> ???  though CFile::typeBinary is default, you should set it explicitly
//     if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;
     if ( ! file.Open( strFile, CFile::modeRead | CFile::typeBinary) ) return -1;

//>>>> ok. length in bytes
     UINT nFileLengthBytes = (UINT)file.GetLength();

     // Setup UNICODE and ANSI versions of the replacement strings

//>>>> ???, don't think it makes much sense to explicitly use CStringW/CStringA
//>>>> Note, the switch is done by macros not by templates what makes it unsafe
//>>>> The problem is that MFC doesn't need to handle the case where both CStringW
//>>>> and CStringA where used same time. So, some of the interfaces may/will use
//>>>> the 'T' macros what will fail for one of the types
     //CStringW strwDirectory = _T("Replacement");
     //CStringA straDirectory = W2A( strwDirectory );

//>>>> ??? better use L"Replacement" and "Replacement"
//>>>>     or do the conversion by mbstowcs
     LPWSTR lpwStrDirectory = L"Replacement";
     LPSTR  lpStrDirectory  = "Replacement";
     // int nwlen = (wcslen(pszwDirectory)+1) * 2;   // it's safe to have the same
     // pszDirectory  = new char[nwlen];             // size for both strings
     // wcstombs(pszDirectory, pszwDirectory, nwlen);

     // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
//>>>> nok, actually, it makes no sense to use malloc and new
//>>>>      in C++ malloc shouldn't beused
//>>>> nok, if the replacement string is longer than the string to replace we need a dynamic
//>>>>      string cause the output becomes longer
//>>>>      if not using a string class based on wchar_t (as I did with wstring), you need
//>>>>      make two loops: first check the number of replacements, allocate output buffer and
//>>>>      make the replacements in the second loop
     // BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );

     BYTE* b = new BYTE[ nFileLengthBytes ];
     if ( ! b ) return -1;

     // Determine whether or not we have a UNICODE file

//>>>> ???, we definitively need a wchar_t here
     // TCHAR bom;
     WCHAR bom = 0;
     BOOL bUnicodeFile = FALSE;
     // file.Read( &bom, sizeof (_TCHAR) );
     // bUnicodeFile =  bom == _TCHAR( 0xFEFF );
//>>>> ???. you could check the UNICODE property after reading the buffer
     file.Read( &bom, sizeof (WCHAR) );
     bUnicodeFile =  ( bom == 0xFEFF );
     file.SeekToBegin();

     // Read in the whole file into the buffer
     UINT uBytesRead = file.Read( b, nFileLengthBytes );
     BYTE* pIn = b;
//>>>>> later if we know the output size
     // BYTE* pOut = pbOutput;

     // Get ANSI and UNICODE versions of the replacement strings
     LPWSTR lpwStrReplacementToken = L"sentence";
//>>>>> nok, same as above
     // CStringW strReplacementTokenW( _T("sentence") );
     // CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );
     LPSTR lpStrReplacementToken = "sentence";

     // Determine the length of the replacement string, in bytes
//>>>> nok, use wcslen or strlen to determine lengths
     // int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();
     UINT nReplacementTokenLength = bUnicodeFile
                                       ? wcslen(lpwStrReplacementToken) * sizeof(WCHAR)
                                       : strlen(lpStrReplacementToken);
     UINT nDirectoryTokenLength   = bUnicodeFile
                                       ? wcslen(lpwStrDirectory )  * sizeof(WCHAR)
                                       : strlen(lpStrDirectory);
//>>>> ???, never make two statements in one line
     UINT uBytesToWrite = 0;
//>>>> ???, variable not used
     // UINT uCurrentBufferSize = nFileLengthBytes * 2;

//>>>> new, we need to count replacements first
     int numOccurrences = 0;

// new, we don't need to differ between UNICODE file and ANSI
//      when comparing tokens cause we already have the correct sizes
     BYTE* pReplacementToken = bUnicodeFile
                                 ? (BYTE*)lpwStrReplacementToken
                                 : (BYTE*)lpStrReplacementToken;

     UINT uPos = 0;
     for ( uPos = 0; uPos < uBytesRead - nReplacementTokenLength; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                numOccurrences++;    
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
           }
           else
           {
                uPos++;
                pIn++;
           }
     }

       UINT uOutputLength = uBytesRead +      numOccurrences   * (nDirectoryTokenLength - nReplacementTokenLength);
      
       BYTE* pbOutput = new BYTE[uOutputLength];
       BYTE* pOut = pbOutput;
// new, we don't need to differ between UNICODE file and ANSI
//      when copying replacement cause we already have the correct sizes
     BYTE* pDirectory = bUnicodeFile
                                 ? (BYTE*)lpwStrDirectory
                                 : (BYTE*)lpStrDirectory;
     for ( uPos = 0; uPos < uBytesRead; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                      memcpy(pOut, pDirectory, nDirectoryTokenLength);
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
                        pOut += nDirectoryTokenLength;
           }
           else
           {
                      *pOut = *pIn;
                uPos++;
                pIn++;
                pOut++;
           }
     }

     // Write out the buffer to our output file
     CFile fileOutput;
//>>>> new, need to add typeBinary
       if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite | CFile::typeBinary  ) )
          return -1;

     fileOutput.Write( pbOutput, uOutputLength );
     fileOutput.Close();

     // Cleanup
     delete[] b;
     delete[] pbOutput;

     return 0;
}
0
 
LVL 11

Expert Comment

by:cup
ID: 17067118
As I said earlier, the following will fail because you're copying a CString to a byte array.  The first two bytes of a CString are the length.
 
              if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
 
If you do not wish to use the LPCTSTR operator, use the GetBuffer method

               if ( memcmp( pIn, strReplacementTokenW.GetBuffer(), nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory.GetBuffer(), uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
 
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 17067395
>>>> will fail because you're copying a CString to a byte array

Actually, it's a copy from a CString and not to a CString. Therefore a cast to a const char or wchar_t pointer would help, cause CString has a built-in cast operator.

     memcmp( pIn, (LPCTSTR)strReplacementTokenW, nReplacementTokenLength )

     memcpy( pOut, (LPCTSTR)strwDirectory, uDirectoryLengthBytes );

>>>> The first two bytes of a CString are the length.

It's the first 4 bytes (32bit word) but it most likeley doesn't matter here. The compiler normall doesn't take a struct/class for a const void* but uses the built-in cast operator which either gives a const char* (LPCSTR) or a const wchar_t* (LPCWSTR). You can get both by casting tp LPCTSTR as I did above.

GetBuffer is needed if you want to write to the CString what isn't recommended here. As I told you above, CString shouldn't be used for the task at all cause it either UNICODE or not UNICODE, what makes it unsuitable if you need it for both.

Regards, Alex

0
 
LVL 11

Expert Comment

by:cup
ID: 17067728
Sorry I meant the first 4 bytes are a pointer to the data buffer.

Are you sure it is automatically casting?  That is not in the posted code.  I'll have to check the generated code or step through it to be convinced that it will automatically cast correctly.

0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 17068239
>>>> Are you sure it is automatically casting?

I am not sure for memcmp which takes a const void *. It should give a compiler error stating that class CString cannot be converted to const void*-

It's different for wmemcmp which takes a const wchar_t*. If CString is switched to UNICODE it has a cast operator LPCWSTR which automatically was used for that. In any case you could invoke the cast operator by explicitly casting to LPCTSTR or make a static_cast<LPCTSTR>. Note, it wouldn't matter if the input is UNICODE or not as long as the correct byte lengths were considered when using memcmp.

Nevertheless, the usage of CString as a simple buffer gives less to no advantages. Better use a string class like std::basic_string that either takes a char or a wchar_t or use plain byte buffers to get not confused. IMO, the handling of UNICODE strings in Windows is a real mess cause developers and users mostly were not able to differ between printout (here UNICODE characters were senseful) and string handling which rarely has the need to care for codepages, UNICODE, multibyte, and so on. So, in 9 of 10 cases we got problems with simple buffer and text handling where no single UNICODE character was involved... Microsoft tried to solve the problem in the typically MS-like manner to ignore the problem by stating "you might have UNICODE if you want or keep ANSI if you don't want". Unfortunately, they forgot all these cillions of peoples which need a little piece of UNICODE sometime and nearly everytime ANSI. The "little piece" let them grow grey hairs and give us experts a good part of your dayly questions ...

Regards, Alex
0
 
LVL 19

Author Comment

by:mrwad99
ID: 17089341
Right

I am going to close this now as my original question has been answered by itsmeandnobodyelse.  I do have more questions regarding this solution, but they are not directly relevant to the original question hence I will open up a new question for them.

I will take this chance to thank everyone who has participated in helping me understand this problem.

Special thanks to Alex ( itsmeandnobodyelse ) for the solution: if you could I have further questions for you that I have not yet finished typing - I just want to close this so you dont all think I have forgotten it.  I will post here again with a link to my new questions, and would greatly appreciate your input on them.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Written by John Humphreys C++ Threading and the POSIX Library This article will cover the basic information that you need to know in order to make use of the POSIX threading library available for C and C++ on UNIX and most Linux systems.   [s…
Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now