Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1095
  • Last Modified:

More UNICODE file trouble

Ah hello.

I have the following code, intended to make a byte-for-byte copy of a file, replacing all occurencies certain string with another.

I know that this code below has no safety in that we check for the amount of memory that will be required, but, that aside, I cannot get it to work with _UNICODE defined in my project settings.

Please take a look.

'lpwStrReplacementToken' is what I want to replace with 'strDirectory'.

I am finding that with UNICODE defined, the memcmp test never returns 0.  Why is this ?

TIA

      CString strFile(  _T( "C:\\Input.txt" ) );

      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) )
            return -1;
      
      UINT nFileLengthBytes = file.GetLength();

      CString strDirectory = _T("Replacement");

      UINT uDirectoryLength = (strDirectory.GetLength()) * sizeof( TCHAR );
      BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof( BYTE )  * 2 ) );

      BYTE* b = new BYTE[ nFileLengthBytes ];
      if ( ! b )
            return -1;
      
      TCHAR bom;
      BOOL bUnicode = FALSE;

      file.Read( &bom, sizeof(_TCHAR) );
      bUnicode =  bom == _TCHAR( 0xFEFF );

      file.SeekToBegin();

      UINT uBytesRead = file.Read( b, nFileLengthBytes );
      BYTE* pIn = b;
      BYTE* pOut = pbOutput;

      //cout << pIn;
      USES_CONVERSION;

      LPWSTR lpwStrReplacementToken = L"sentence";
      CString strReplacementToken = lpwStrReplacementToken;
      if ( ! bUnicode )
            strReplacementToken = W2A( lpwStrReplacementToken );

      int nReplacementTokenLength = ( bUnicode ? sizeof( wchar_t ) : sizeof( char ) ) * strReplacementToken.GetLength();

      UINT uBytesToWrite = 0, uCurrentBufferSize = nFileLengthBytes * 2;
      for ( UINT uPos = 0; uPos < uBytesRead; )
      {
            cout << *pIn;
            if ( memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
                  memcpy( pOut, strDirectory, uDirectoryLength );

                  pIn += nReplacementTokenLength;
                  pOut += uDirectoryLength;
                  uPos += nReplacementTokenLength;
                  uBytesToWrite += uDirectoryLength;
            }
            else
            {
                  *pOut++ = *pIn++;
                  uPos++;
                  uBytesToWrite++;
            }
      }

      CFile fileOutput;
      if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
            return -1;

      fileOutput.Write( pbOutput, uBytesToWrite );
      fileOutput.Close();

      delete[] b;
      free ( pbOutput );

      
0
mrwad99
Asked:
mrwad99
  • 8
  • 5
  • 4
  • +6
3 Solutions
 
Jase-CoderCommented:
if you using unicode you should use wmemcmp rather than memcmp
0
 
Jase-CoderCommented:
to use wmemcmp include <wchar.t>
0
 
mahesh1402Commented:
better is _tcscmp() which is compiled to either strcmp() or wcscmp() depending on whether or not you're doing a Unicode build.

-MAHESH
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
e_tadeuCommented:
Also don't assume your chars are BYTE now.
0
 
AxterCommented:
>>better is _tcscmp() which is compiled to either strcmp() or wcscmp() depending on whether or not you're doing a Unicode build.

I totally agree with mahesh1402.
If you're using _T objects, then you should use _t associated API functions.  Other wise, there's no point in using _T code to begin with.
0
 
cupCommented:
CStrings do not automatically cast to BYTE*.  You need to use the LPCTSTR operator.  This will get the data member in the correct type.

      if ( memcmp( pIn, (LPCTSTR) strReplacementToken, nReplacementTokenLength ) == 0 )
   {
           memcpy( pOut, (LPCTSTR) strDirectory, uDirectoryLength );
           ...

If you are printing a unicode string it should be

wcout << _T('"') << (TCHAR*) pIn << _T('"') << endl
wcout << _T('"') << (LPCTSTR) strReplacementToken << _T('"') << endl;

Print both strings to find out why your comparison is failing.  

0
 
mrwad99Author Commented:
Right, ok.

Sorry about the delays, had network issues.

Anyway.  I have had a good play with all the suggestions given and still cannot get it to work.

Jase-Coder:

I have changed my line that was using memcmp to

#ifdef _UNICODE
            if ( wmemcmp ( (LPCWSTR)pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
#else
            if ( memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )
            {      
#endif

and it still does not work.

mahesh1402/Axter:

how can I use _tcscmp ?  I am comparing a certain amount of memory, not two strings.  

Currently,

memcmp( pIn, strReplacementToken, nReplacementTokenLength ) == 0 )

starts at pIn, and compares the next nReplacementTokenLength bytes with those contained in strReplacementToken.  _tcscmp does not have any provision for a certain number of bytes to be compared...

e_tadeu:

I am not with you there.  What do you mean ?

cup:

I tried that and it still does not work.

I am also not bothered about printing, all I want to do is a simple memory comparision.  Yet this is so difficult !

>>Print both strings to find out why your comparison is failing.  

That did nothing.

?!?

Arggghhh !
0
 
cupCommented:
Did you try LPCTSTR or something else?  LPCTSTR is an operator for CStrings: not a cast.  Everything else is a cast.  What does this print?

const TCHAR* str = (LPCTSTR) strReplacementToken;
wcout << "length is " << nReplacementTokenLength  << " value is " <<  str << endl;

If it doesn't print anything at all, then there is somethig wrong with the replacement string.  Next

cout << pIn << endl;  // if this doesn't print anything it is probably unicode
wcout << (LPCTSTR) pIn << endl; // if this doesn't print anything, you have an empty string in pIn

If the cout prints something then you are comparing a unicode string against a non unicode string.  That won't work.
0
 
mrwad99Author Commented:
>> const TCHAR* str = (LPCTSTR) strReplacementToken;
wcout << "length is " << nReplacementTokenLength  << " value is " <<  str << endl;

This prints

length is 8 value is sentence

>> cout << pIn << endl;  // if this doesn't print anything it is probably unicode
wcout << (LPCTSTR) pIn << endl; // if this doesn't print anything, you have an empty string in pIn

This prints

This is a sentence.²²²²½½½½½½½½&#9632;

The wcout has no effect.

??

But...

I think I have figured all this out.  Firstly, with my code as-is, compiled with _UNICODE defined, it does not work *if I have an ANSI file as the input file*.  So, with the source as-is, C:\Input.txt is an ANSI file, and it does not work.  Open the file in Notepad and save it as UNICODE however, and my code works fine.

This lead me to look into exactly what is happening when my code tries to make a comparision between the memory currently pointed to by pIn, and the replacement string.

I had a play, and have uploaded some screenshots to demontrate what I found.  Please take a look at them.

The first ( http://img240.imageshack.us/img240/3536/notworking3fx.png ) shows the memory for one of my variables setup via a call to W2A.  You can see that the memory is showing that the string is still UNICODE, i.e. has two bytes per character.

The second ( http://img154.imageshack.us/img154/4719/working0qg.png ) shows the memory for a char array, used to convert the same UNICODE CString (strTemp) as I did via w2A.  Except, the memory window clearly shows that there is only one byte per character; an ANSI string.

So, it would seem that W2A is failing.  The question now is why ?
0
 
mrwad99Author Commented:
(Points now at 500 as this is getting silly)
0
 
mrwad99Author Commented:
Incidentally,

      CStringA t = W2A( strTemp );


Looks fine when viewed in memory, it is not double byte encoded.
0
 
rstaveleyCommented:
Macros like W2A use _alloca to grab a chunk of stack to put the converted string into. You should expect strReplacementToken no longer to point to  lpwStrReplacementToken after W2A has been called. It is conceivable, bearing in mind the funkiness of alloca_ that your debugger is getting things wrong. Can you confirm that strReplacementToken no longer points to lpwStrReplacementToken after you step past the W2A call??
0
 
rstaveleyCommented:
I didn't see your last post, mrwad99 . Ignore mine.
0
 
rstaveleyCommented:
You start by constructing a CString from a wchar_t array string, which is fine and then, if bUnicode is true, you reassign it to be the same string using a narrow char array. You get the narrow char array from W2A. That should load it with the same contents?! No matter, read on...

Your memcmp/wmemcmp needs to compare LPCTSTR string fetched from the CString object. You need the cast to call operator LPCTSTR(), however LPCTSTR will be *either* LPCSTR or LPCWSTR depending on _UNICODE, so you *must* have _UNICODE defined for your code to work to get wide character andling internally throughout. With _UNICODE, LPCTSTR() is the same as LPCWSTR().

i.e.

#ifdef _UNICODE
          if ( wmemcmp (LPCWSTR(pIn), LPCWSTR(strReplacementToken), nReplacementTokenLength ) == 0 )
          {    
#else
          #error Must be unicode!!
#endif
0
 
havman56Commented:
there is a bug in vc6 regarding wchar_t.

whenever u read a char from stream if it is greater than 0xfd that is (ex: 0x3424) which 2 bytes for wchar_t then stream in MSVC identifes as space character.

no idea on vc7 .

0
 
itsmeandnobodyelseCommented:
/*
You seem mixing up UNICODE inputfile and UNICODE build.

If I understand the requirements correctly there are 4 cases to consider:

      _UNICODE defined     UNICODE textfile
A:          no                           no
B:          no                           yes
C:          yes                          no
D:          yes                          yes

Unfortunately CString only works for A and D cause it either is single-char or wide-char. Same applies to all "T" - switches.

So, B and C can't be handled properly when using CString and the TCHAR switches.

I would suggest to use std::string and std::wstring instead. Both can be used same time.

The prog below compiles but isn't tested. There is the assumption that you didn't want to change file type, means a UNICODE text file keeps a UNICODE text file while an ANSI textfile keeps ANSI.

If the assumption is wrong you would need to convert from string to wstring or buf to wbuf what needs a UNICODE conversion function like mbstowcs and a second buffer of an appropriate size. I would recommend *not* to using the USES_CONVERSION macro and A2W macro cause it requires _UNICODE macro set (as far as I know) while mbstowcs works in any case. mbstowcs can calculate the amount of wchar_t  chars needed when passing a NULL pointer as output buffer.

Hope it works.

Regards, Alex


*/
   
#include <string>
#include <iostream>
#include <fstream>
using namespace std;

#include <sys/stat.h>

int main()    
{
    struct stat status;
    if (stat("c:\\input.dat", &status) != 0)
         return 1;
    ifstream ifs("c:\\input.dat", ios::binary | ios::in);
    if (!ifs)
        return 2;
   
    unsigned char* buf = new unsigned char[status.st_size];
    if (!ifs.read((char*)buf,  status.st_size))
        return 3;

    bool bUnicode =  *((wchar_t*)buf) == 0xfeff;
   
    int outLen = 0;
    if (!bUnicode)
    {
         string s((char*)buf, status.st_size);
         delete [] buf;
         int pos = 0;
         int lpos = 0;
         string sfind = "sentence";
         string srepl = "replacement";
         while ((pos = s.find(sfind.c_str(), lpos)) != string::npos)
         {
             s.replace(pos, sfind.length(), srepl);
             lpos = pos + srepl.length();
         }

         buf = new unsigned char[s.length()];
         memcpy(buf, s.c_str(), s.length());
         outLen = s.length();
    }
    else // if (bUnicode)
    {
         wchar_t* wbuf = (wchar_t*) buf;
         wstring ws(wbuf, status.st_size/2);
         delete [] buf;
         int pos = 0;
         int lpos = 0;
         wstring wsfind = L"sentence";
         wstring wsrepl = L"replacement";
         while ((pos = ws.find(wsfind.c_str(), lpos)) != wstring::npos)
         {
             ws.replace(pos, wsfind.length(), wsrepl);
             lpos = pos + wsrepl.length();
         }

         buf = (unsigned char*)(new wchar_t[ws.length()]);
         memcpy(buf, ws.c_str(), ws.length() * sizeof(wchar_t));
         outLen = ws.length() * sizeof(wchar_t);
    }

    ofstream ofs("c:\\output.dat", ios::out | ios::binary);
    ofs.write((const char*)buf, outLen);
    ofs.close();
    return 0;
}

0
 
mrwad99Author Commented:
rstaveley

Thanks for the comment.  I would like to clarify something if possible.

>> You get the narrow char array from W2A. That should load it with the same contents?!

I am not sure what you mean there, but I think you are referring to the fact that I do this:

      LPWSTR lpwStrReplacementToken = L"sentence";
      CString strReplacementToken(_T("")), strTemp = lpwStrReplacementToken;
      
      if ( ! bUnicode )
            strReplacementToken = W2A(strTemp );       // <-----------------------
      else
            strReplacementToken = strTemp;      

Are you saying that I am assigning what is a CStringA (returned by w2A) to strReplacementToken (a CStringW (due to _UNICODE) ), which ends up with strReplacementToken being a CStringW ?  Well, if that is the case, I can certainly see why everything is failing, and I can see the memory window showed strReplacementToken as having two bytes per character in my previous screenshot ( http://img240.imageshack.us/img240/3536/notworking3fx.png )

***** Is this what you are saying the problem is, and why I essentially believe that W2A is "not working" ? *****

>> Your memcmp/wmemcmp needs to compare LPCTSTR string fetched from the CString object...

Now, you use memcmp in your code, however I don't see why I *have* to, since all I am comparing is blocks of memory, which can be of any data type, including ANSI strings or UNICODE strings.  My code works fine for a UNICODE file, (with _UNICODE) defined as-is, so I am still not convinced about what you say here.  Can you re-enforce you statement about using the LPCTSTR operator ?

havman56

I am using VS.NET 2003, so I guess that does not apply to me.  Thanks anyway.

itsmeandnobodyelse

Thank you for the attempt to fully understand my issue.  You are correct, there are four cases as you have clearly layed out.

>> I would suggest to use std::string and std::wstring instead. Both can be used same time.

That is an option.  However, as this code is part of a large MFC project, I really want to stick to CString objects.  I will try that code however as it is good practice.

Since you are obviously a seasoned C++ programmer, perhaps you can answer another question.  I, as far as I can see, have not had any answer to the overall problem of dealing with an ANSI file from within my UNICODE program.  The problem, if I have correctly described what I think rstaveley was getting at, is that I am always going to have what are CStringW objects with UNICODE defined.  I have hence come up with a hybrid combination using both CStringA and CStringW objects.  Could you please comment on the code I have come up with below, which, incidentally, works fine for _UNICODE defined, and not defined, and for both ANSI and UNICODE files, in the four combinations you listed in your last post.  Thanks very much.

//


int main()
{
      USES_CONVERSION;

      // Setup the input file and open it
      CString strFile(  _T( "C:\\Input.txt" ) );
      CFile file;
      if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;

      UINT nFileLengthBytes = file.GetLength();

      // Setup UNICODE and ANSI versions of the replacement strings
      CStringW strwDirectory = _T("Replacement");
      CStringA straDirectory = W2A( strwDirectory );

      // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
      BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );
      BYTE* b = new BYTE[ nFileLengthBytes ];
      if ( ! b ) return -1;

      // Determine whether or not we have a UNICODE file
      TCHAR bom;
      BOOL bUnicodeFile = FALSE;
      file.Read( &bom, sizeof (_TCHAR) );
      bUnicodeFile =  bom == _TCHAR( 0xFEFF );
      file.SeekToBegin();

      // Read in the whole file into the buffer
      UINT uBytesRead = file.Read( b, nFileLengthBytes );
      BYTE* pIn = b;
      BYTE* pOut = pbOutput;

      // Get ANSI and UNICODE versions of the replacement strings
      LPWSTR lpwStrReplacementToken = L"sentence";
      CStringW strReplacementTokenW( _T("sentence") );
      CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );

      // Determine the length of the replacement string, in bytes
      int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();

      UINT uBytesToWrite = 0, uCurrentBufferSize = nFileLengthBytes * 2;

      for ( UINT uPos = 0; uPos < uBytesRead; )
      {
            if ( bUnicodeFile )
            {
                  // We are dealing with a UNICODE file, so make the comparision with the UNICODE version of the replacement string
                  if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
                  {      
                        UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                        memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                        pIn += nReplacementTokenLength;
                        pOut += uDirectoryLengthBytes;
                        uPos += nReplacementTokenLength;
                        uBytesToWrite += uDirectoryLengthBytes;
                  }
                  else
                  {
                        *pOut++ = *pIn++;
                        uPos++;
                        uBytesToWrite++;
                  }
            }
            else
            {
                  // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
                  if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
                  {      
                        UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                        memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                        pIn += nReplacementTokenLength;
                        pOut += uDirectoryLengthBytes;
                        uPos += nReplacementTokenLength;
                        uBytesToWrite += uDirectoryLengthBytes;
                  }
                  else
                  {
                        *pOut++ = *pIn++;
                        uPos++;
                        uBytesToWrite++;
                  }
            }
      }

      // Write out the buffer to our output file
      CFile fileOutput;
      if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
            return -1;

      fileOutput.Write( pbOutput, uBytesToWrite );
      fileOutput.Close();

      // Cleanup
      delete[] b;
      free ( pbOutput );

      return 0;
}

//

0
 
mrwad99Author Commented:
Correction to my comment to rstaveley

>> Now, you use memcmp in your code

should be

>> Now, you use wmemcmp in your code
0
 
rstaveleyCommented:
Since you are only using wmemcmp or memcmp to test identity, either will do. You are simply seeing if two blocks of raw mamory are the same.

My point was that you need to compare the raw memory returned from the function CString::operator LPCTSTR(). You say that you can construct a CString with a wchar_t array or a char array. There are constructors for both of these. Internally it stores a set of characters which are not dependent on the contructor used, but dependent on whether you have _UNICODE defined (i.e. whether TCHAR is char or wchar_t), characters are widened or narrowed as applicable.
0
 
itsmeandnobodyelseCommented:
>>>> Could you please comment on the code

I am not quite through, but I'll post as far as have cause I need to get offline for some hours.

/*
Since you are obviously a seasoned C++ programmer, perhaps you can answer another question.  I, as far as I can see,
have not had any answer to the overall problem of dealing with an ANSI file from within my UNICODE program.  
The problem, if I have correctly described what I think rstaveley was getting at, is that I am always going to
have what are CStringW objects with UNICODE defined.  I have hence come up with a hybrid combination using both
CStringA and CStringW objects.  Could you please comment on the code I have come up with below, which, incidentally,
works fine for _UNICODE defined, and not defined, and for both ANSI and UNICODE files, in the four combinations you
listed in your last post.  Thanks very much.
*/

int main()
{
//>>>> ok. that's an ATL conversion macro which isn't dependent on _UNICCODE
     USES_CONVERSION;

     // Setup the input file and open it
//>>>> ok. _T is mapped either to WCHAR or CHAR same as CString
     CString strFile(  _T( "C:\\Input.txt" ) );
     CFile file;
//>>>> ok , file.Open takes a LPCTSTR where CString has a cast operator
//>>>> ???  though CFile::modeBinary is default, you should set it explicitly
//     if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;
     if ( ! file.Open( strFile, CFile::modeRead | CFile::modeBinary) ) return -1;

//>>>> ok. length in bytes
     UINT nFileLengthBytes = file.GetLength();

     // Setup UNICODE and ANSI versions of the replacement strings

//>>>> ???, don't think it makes much sense to explicitly use CStringW/CStringA
//>>>> Note, the switch is done by macros not by templates what makes it unsafe
//>>>> The problem is that MFC doesn't need to handle the case where both CStringW
//>>>> and CStringA where used same time. So, some of the interfaces may/will use
//>>>> the 'T' macros what will fail for one of the types
     //CStringW strwDirectory = _T("Replacement");
     //CStringA straDirectory = W2A( strwDirectory );

//>>>> ??? better use L"Replacement" and "Replacement"
//>>>>     or do the conversion by mbstowcs
     LPWSTR lpwStrDirectory = L"Replacement"
     LPSTR  lpStrDirectory  = "Replacement";
     // int nwlen = (wcslen(pszwDirectory)+1) * 2;   // it's safe to have the same
     // pszDirectory  = new char[nwlen];             // size for both strings
     // wcstombs(pszDirectory, pszwDirectory, nwlen);

     // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
//>>>> nok, actually, it makes no sense to use malloc and new
//>>>>      in C++ malloc shouldn't beused
//>>>> nok, if the replacement string is longer than the string to replace we need a dynamic
//>>>>      string cause the output becomes longer
//>>>>      if not using a string class based on wchar_t (as I did with wstring), you need
//>>>>      make two loops: first check the number of replacements, allocate output buffer and
//>>>>      make the replacements in the second loop
     // BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );

     BYTE* b = new BYTE[ nFileLengthBytes ];
     if ( ! b ) return -1;

     // Determine whether or not we have a UNICODE file

//>>>> ???, we definitively need a wchar_t here
     // TCHAR bom;
     WCHAR bom = 0;
     BOOL bUnicodeFile = FALSE;
     // file.Read( &bom, sizeof (_TCHAR) );
     // bUnicodeFile =  bom == _TCHAR( 0xFEFF );
//>>>> ???. you could check the UNICODE property after reading the buffer
     file.Read( &bom, sizeof (WCHAR) );
     bUnicodeFile =  ( bom == 0xFEFF );
     file.SeekToBegin();

     // Read in the whole file into the buffer
     UINT uBytesRead = file.Read( b, nFileLengthBytes );
     BYTE* pIn = b;
//>>>>> later if we know the output size
     // BYTE* pOut = pbOutput;

     // Get ANSI and UNICODE versions of the replacement strings
     LPWSTR lpwStrReplacementToken = L"sentence";
//>>>>> nok, same as above
     // CStringW strReplacementTokenW( _T("sentence") );
     // CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );
     LPSTR lpStrReplacementToken = "sentence";

     // Determine the length of the replacement string, in bytes
//>>>> nok, use wcslen or strlen to determine lengths
     // int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();
     int nReplacementTokenLength = bUnicodeFile
                                       ? wcslen(lpwStrReplacementToken) * sizeof(WCHAR)
                                       : strlen(lpStrReplacementToken);
     int nDirectoryTokenLength   = bUnicodeFile
                                       ? wcslen(lpwStrDirectory )  * sizeof(WCHAR)
                                       : strlen(lpStrDirectory);
//>>>> ???, never make two statements in one line
     UINT uBytesToWrite = 0;
//>>>> ???, variable not used
     // UINT uCurrentBufferSize = nFileLengthBytes * 2;

//>>>> new, we need to count replacements first
     int numOccurrences = 0;

// new, we don't need to differ between UNICODE file and ANSI
//      when comparing tokens cause we already have the correct sizes
     BYTE* pReplacementToken = bUnicodeFile
                                 ? (BYTE*)lpwStrReplacementToken
                                 : lpStrReplacementToken;


     for ( UINT uPos = 0; uPos < uBytesRead - nReplacementTokenLength; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                numOccurrences++;    
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
           }
           else
           {
               uPos++;
                pIn++;
           }
     }
          else
          {
               // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
               if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                    memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
     }

     for ( UINT uPos = 0; uPos < uBytesRead; )
     {
          if ( bUnicodeFile )
          {
               // We are dealing with a UNICODE file, so make the comparision with the UNICODE version of the replacement string
//>>>> nok, don't use CString it most likely fails for cases B and C

               // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
               if ( memcmp( pIn, lpwStrReplacementToken, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
          else
          {
               // We are dealing with an ANSI file, so make the comparison using the ANSI version of the replacement string
               if ( memcmp( pIn, strReplacementTokenA, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = straDirectory.GetLength() * sizeof ( char );
                    memcpy( pOut, straDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
               else
               {
                    *pOut++ = *pIn++;
                    uPos++;
                    uBytesToWrite++;
               }
          }
     }

     // Write out the buffer to our output file
     CFile fileOutput;
     if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite ) )
          return -1;

     fileOutput.Write( pbOutput, uBytesToWrite );
     fileOutput.Close();

     // Cleanup
     delete[] b;
     free ( pbOutput );

     return 0;
}

//


0
 
mrwad99Author Commented:
Alex

Thanks for such a comprehensive set of comments. I have printed those off and will study them and get back to you ASAP.
0
 
itsmeandnobodyelseCommented:
Here is the last version of the commented code. It compiles but I didn't test it. It still might have some bugs computing sizes and switches between UNICODE/ANSI

Regards, Alex

// unicode.cpp : Defines the entry point for the console application.
//

#undef _DLL
#include <afx.h>
#include <atlbase.h>

#include <atlwin.h>
#include <windows.h>

int main()
{
//>>>> ok. that's an ATL conversion macro which isn't dependent on _UNICCODE
     USES_CONVERSION;

     // Setup the input file and open it
//>>>> ok. _T is mapped either to WCHAR or CHAR same as CString
     CString strFile(  _T( "C:\\Input.txt" ) );
     CFile file;
//>>>> ok , file.Open takes a LPCTSTR where CString has a cast operator
//>>>> ???  though CFile::typeBinary is default, you should set it explicitly
//     if ( ! file.Open( strFile, CFile::modeRead ) ) return -1;
     if ( ! file.Open( strFile, CFile::modeRead | CFile::typeBinary) ) return -1;

//>>>> ok. length in bytes
     UINT nFileLengthBytes = (UINT)file.GetLength();

     // Setup UNICODE and ANSI versions of the replacement strings

//>>>> ???, don't think it makes much sense to explicitly use CStringW/CStringA
//>>>> Note, the switch is done by macros not by templates what makes it unsafe
//>>>> The problem is that MFC doesn't need to handle the case where both CStringW
//>>>> and CStringA where used same time. So, some of the interfaces may/will use
//>>>> the 'T' macros what will fail for one of the types
     //CStringW strwDirectory = _T("Replacement");
     //CStringA straDirectory = W2A( strwDirectory );

//>>>> ??? better use L"Replacement" and "Replacement"
//>>>>     or do the conversion by mbstowcs
     LPWSTR lpwStrDirectory = L"Replacement";
     LPSTR  lpStrDirectory  = "Replacement";
     // int nwlen = (wcslen(pszwDirectory)+1) * 2;   // it's safe to have the same
     // pszDirectory  = new char[nwlen];             // size for both strings
     // wcstombs(pszDirectory, pszwDirectory, nwlen);

     // Allocate a buffer for the file, initially twice as big as the file size so we can cater for replacements needed
//>>>> nok, actually, it makes no sense to use malloc and new
//>>>>      in C++ malloc shouldn't beused
//>>>> nok, if the replacement string is longer than the string to replace we need a dynamic
//>>>>      string cause the output becomes longer
//>>>>      if not using a string class based on wchar_t (as I did with wstring), you need
//>>>>      make two loops: first check the number of replacements, allocate output buffer and
//>>>>      make the replacements in the second loop
     // BYTE* pbOutput = ( ( BYTE* ) malloc ( nFileLengthBytes * sizeof ( BYTE )  * 2 ) );

     BYTE* b = new BYTE[ nFileLengthBytes ];
     if ( ! b ) return -1;

     // Determine whether or not we have a UNICODE file

//>>>> ???, we definitively need a wchar_t here
     // TCHAR bom;
     WCHAR bom = 0;
     BOOL bUnicodeFile = FALSE;
     // file.Read( &bom, sizeof (_TCHAR) );
     // bUnicodeFile =  bom == _TCHAR( 0xFEFF );
//>>>> ???. you could check the UNICODE property after reading the buffer
     file.Read( &bom, sizeof (WCHAR) );
     bUnicodeFile =  ( bom == 0xFEFF );
     file.SeekToBegin();

     // Read in the whole file into the buffer
     UINT uBytesRead = file.Read( b, nFileLengthBytes );
     BYTE* pIn = b;
//>>>>> later if we know the output size
     // BYTE* pOut = pbOutput;

     // Get ANSI and UNICODE versions of the replacement strings
     LPWSTR lpwStrReplacementToken = L"sentence";
//>>>>> nok, same as above
     // CStringW strReplacementTokenW( _T("sentence") );
     // CStringA strReplacementTokenA = W2A( lpwStrReplacementToken );
     LPSTR lpStrReplacementToken = "sentence";

     // Determine the length of the replacement string, in bytes
//>>>> nok, use wcslen or strlen to determine lengths
     // int nReplacementTokenLength = ( bUnicodeFile ? sizeof ( wchar_t ) : sizeof ( char ) ) * strReplacementTokenW.GetLength();
     UINT nReplacementTokenLength = bUnicodeFile
                                       ? wcslen(lpwStrReplacementToken) * sizeof(WCHAR)
                                       : strlen(lpStrReplacementToken);
     UINT nDirectoryTokenLength   = bUnicodeFile
                                       ? wcslen(lpwStrDirectory )  * sizeof(WCHAR)
                                       : strlen(lpStrDirectory);
//>>>> ???, never make two statements in one line
     UINT uBytesToWrite = 0;
//>>>> ???, variable not used
     // UINT uCurrentBufferSize = nFileLengthBytes * 2;

//>>>> new, we need to count replacements first
     int numOccurrences = 0;

// new, we don't need to differ between UNICODE file and ANSI
//      when comparing tokens cause we already have the correct sizes
     BYTE* pReplacementToken = bUnicodeFile
                                 ? (BYTE*)lpwStrReplacementToken
                                 : (BYTE*)lpStrReplacementToken;

     UINT uPos = 0;
     for ( uPos = 0; uPos < uBytesRead - nReplacementTokenLength; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                numOccurrences++;    
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
           }
           else
           {
                uPos++;
                pIn++;
           }
     }

       UINT uOutputLength = uBytesRead +      numOccurrences   * (nDirectoryTokenLength - nReplacementTokenLength);
      
       BYTE* pbOutput = new BYTE[uOutputLength];
       BYTE* pOut = pbOutput;
// new, we don't need to differ between UNICODE file and ANSI
//      when copying replacement cause we already have the correct sizes
     BYTE* pDirectory = bUnicodeFile
                                 ? (BYTE*)lpwStrDirectory
                                 : (BYTE*)lpStrDirectory;
     for ( uPos = 0; uPos < uBytesRead; )
     {
//>>>> nok, don't use CString it most likely fails for cases B and C

           // if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
           if ( memcmp( pIn, pReplacementToken, nReplacementTokenLength ) == 0 )
           {
                      memcpy(pOut, pDirectory, nDirectoryTokenLength);
                uPos += nReplacementTokenLength;
                pIn  += nReplacementTokenLength;
                        pOut += nDirectoryTokenLength;
           }
           else
           {
                      *pOut = *pIn;
                uPos++;
                pIn++;
                pOut++;
           }
     }

     // Write out the buffer to our output file
     CFile fileOutput;
//>>>> new, need to add typeBinary
       if ( ! fileOutput.Open( _T( "C:\\TestOutput.txt"), CFile::modeCreate | CFile::modeWrite | CFile::typeBinary  ) )
          return -1;

     fileOutput.Write( pbOutput, uOutputLength );
     fileOutput.Close();

     // Cleanup
     delete[] b;
     delete[] pbOutput;

     return 0;
}
0
 
cupCommented:
As I said earlier, the following will fail because you're copying a CString to a byte array.  The first two bytes of a CString are the length.
 
              if ( memcmp( pIn, strReplacementTokenW, nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory, uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
 
If you do not wish to use the LPCTSTR operator, use the GetBuffer method

               if ( memcmp( pIn, strReplacementTokenW.GetBuffer(), nReplacementTokenLength ) == 0 )
               {    
                    UINT uDirectoryLengthBytes = strwDirectory.GetLength() * sizeof ( wchar_t );

                    memcpy( pOut, strwDirectory.GetBuffer(), uDirectoryLengthBytes );

                    pIn += nReplacementTokenLength;
                    pOut += uDirectoryLengthBytes;
                    uPos += nReplacementTokenLength;
                    uBytesToWrite += uDirectoryLengthBytes;
               }
 
0
 
itsmeandnobodyelseCommented:
>>>> will fail because you're copying a CString to a byte array

Actually, it's a copy from a CString and not to a CString. Therefore a cast to a const char or wchar_t pointer would help, cause CString has a built-in cast operator.

     memcmp( pIn, (LPCTSTR)strReplacementTokenW, nReplacementTokenLength )

     memcpy( pOut, (LPCTSTR)strwDirectory, uDirectoryLengthBytes );

>>>> The first two bytes of a CString are the length.

It's the first 4 bytes (32bit word) but it most likeley doesn't matter here. The compiler normall doesn't take a struct/class for a const void* but uses the built-in cast operator which either gives a const char* (LPCSTR) or a const wchar_t* (LPCWSTR). You can get both by casting tp LPCTSTR as I did above.

GetBuffer is needed if you want to write to the CString what isn't recommended here. As I told you above, CString shouldn't be used for the task at all cause it either UNICODE or not UNICODE, what makes it unsuitable if you need it for both.

Regards, Alex

0
 
cupCommented:
Sorry I meant the first 4 bytes are a pointer to the data buffer.

Are you sure it is automatically casting?  That is not in the posted code.  I'll have to check the generated code or step through it to be convinced that it will automatically cast correctly.

0
 
itsmeandnobodyelseCommented:
>>>> Are you sure it is automatically casting?

I am not sure for memcmp which takes a const void *. It should give a compiler error stating that class CString cannot be converted to const void*-

It's different for wmemcmp which takes a const wchar_t*. If CString is switched to UNICODE it has a cast operator LPCWSTR which automatically was used for that. In any case you could invoke the cast operator by explicitly casting to LPCTSTR or make a static_cast<LPCTSTR>. Note, it wouldn't matter if the input is UNICODE or not as long as the correct byte lengths were considered when using memcmp.

Nevertheless, the usage of CString as a simple buffer gives less to no advantages. Better use a string class like std::basic_string that either takes a char or a wchar_t or use plain byte buffers to get not confused. IMO, the handling of UNICODE strings in Windows is a real mess cause developers and users mostly were not able to differ between printout (here UNICODE characters were senseful) and string handling which rarely has the need to care for codepages, UNICODE, multibyte, and so on. So, in 9 of 10 cases we got problems with simple buffer and text handling where no single UNICODE character was involved... Microsoft tried to solve the problem in the typically MS-like manner to ignore the problem by stating "you might have UNICODE if you want or keep ANSI if you don't want". Unfortunately, they forgot all these cillions of peoples which need a little piece of UNICODE sometime and nearly everytime ANSI. The "little piece" let them grow grey hairs and give us experts a good part of your dayly questions ...

Regards, Alex
0
 
mrwad99Author Commented:
Right

I am going to close this now as my original question has been answered by itsmeandnobodyelse.  I do have more questions regarding this solution, but they are not directly relevant to the original question hence I will open up a new question for them.

I will take this chance to thank everyone who has participated in helping me understand this problem.

Special thanks to Alex ( itsmeandnobodyelse ) for the solution: if you could I have further questions for you that I have not yet finished typing - I just want to close this so you dont all think I have forgotten it.  I will post here again with a link to my new questions, and would greatly appreciate your input on them.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 8
  • 5
  • 4
  • +6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now