[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1185
  • Last Modified:

UNICODE: Problem with find and replace in files

Ah hello.

In a previous question (http:Q_21240266.html) I was given the following code by Axter to perform find and replace in files.      


int FindAndReplaceStrInFile(LPCTSTR FileName, LPCTSTR OldStr, LPCTSTR NewStr)
{
     CFile file;
     if (file.Open(FileName, CFile::modeReadWrite))
     {
          CString Data;
          file.Read(Data.GetBuffer(file.GetLength()), file.GetLength());
          Data.ReleaseBuffer(file.GetLength());
          int ReturnValue = Data.Replace(OldStr, NewStr);
          file.SeekToBegin();
          file.Write((LPCTSTR)Data, Data.GetLength());
          file.SetLength(Data.GetLength());
          file.Close();
          return ReturnValue;
     }
     return -1;
}

However I have just changed my project settings to use UNICODE characters and after reading in the file, the CString 'Data' contains just squares (rubbish, in other words).

What needs to be done to change this to work when Unicode is defined ?

TIA
0
mrwad99
Asked:
mrwad99
  • 9
  • 9
  • 3
  • +1
3 Solutions
 
jkrCommented:
>>What needs to be done to change this to work when Unicode is defined ?

All you need to do is using a UNICODE text file for your find and replace operation.
0
 
AlexFMCommented:
In the case you need to work with ASCII files from Unicode program:
In VC++ 6.0 CString is Unicode string in Unicode configuration, and you need ASCII string for this case. Try STL string instead of this.
In latest VC++ versions there is CStringT<> template class. CString is defined as CStringT<unsigned char> in ASCII build, and CStringT<wchar> in Unicode build. To work by the same way as now in Unicode configuration, you need CString<unsigned char> class. Notice that OldStr and NewStr must be converted to char* from wchar* for this.

About reading Unicode files: Notice that first two bytes of Unicode file have special value 0xFEFF, these two bytes should not be processed by your algorithm.
0
 
mrwad99Author Commented:
Thanks both for replying.

jkr:  Yeah, that is a possibility, but I cannot guarantee that all input files will be Unicode.

AlexFM:  I am sure what you are saying makes sense, but I cannot see how to do what you mean.  Is it possible you could please give an example ?

(Points at 500 now)
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
AlexFMCommented:
Please give additional information. I understand that you want to read ASCII files in Unicode program, is this right? What is your C++ version?
0
 
mahesh1402Commented:
Refer this article at codeproject  'Easy text document conversion - ANSI/Unicode and Unicode/ANSI:'

 http://www.codeproject.com/file/ANSI-UNICODE_conversion.asp <====

As said by AlexFM Uncode file is having byte order stored at start which is 0xFEFF. In above sample src project Loading and Saving of both Unicode and ANSI text file is demonstrated.

Refer at end of article section 'What to do with loaded text?' where CString object is used to display loaded text.


MAHESH
0
 
mahesh1402Commented:
pFile->Read( &bom, sizeof(_TCHAR) );

// If we are reading UNICODE file
   if ( bom == _TCHAR(0xFEFF ) )
   {
      CFile* pFile = new CFile();
      pFile->Open( strFile, CFile::modeRead );
      pFile->Read( &bom, sizeof(_TCHAR) );
      UINT ret = pFile->Read( buffer,
                              _tcslen(buffer)*sizeof(_TCHAR) );
      buffer[ret] = _T('\0');
      pFile->Close();

      strText = buffer;

      // Release extra characters
      int nLen = strText.GetLength();
      strText = strText.Left( nLen/2 );
   }

As you refer in article its there 'if you have your file in CString object. If you are wondering what the last two lines of code do, then do know that this is the simple way to cut extra characters which appear due to double-byte encoding of Unicode text in the file stream.

But, what if your file isn't a Unicode file, that is, if the byte-order mask is not equal to 0xFEFF? Then, it is possible that you have to deal with ANSI file.....

MAHESH
0
 
mrwad99Author Commented:
AlexFM:

>> Please give additional information

Basically the function I have posted above is being used in an application built with _UNICODE defined.  As I noted, reading the file into the CString as per the code results in rubbish being stored in the CString.  I am using VC++.NET 2003

MAHESH

Thanks.  I will read those links and come back shortly.
0
 
mahesh1402Commented:
>>reading the file into the CString as per the code results in rubbish being stored in the CString.  I am using VC++.NET 2003

Read above thats because double byte encoding. You need to cut extra characters in CString which appear due to double-byte encoding of Unicode text in the file stream as demonstrated in above code. Also try to display contents of CString using TextOut after Creating font as demonstrated in above link.
Refer article.

MAHESH
0
 
mahesh1402Commented:
BTW if you have _UNICODE project and you are dealing with ASCII files in your above code just use 'CStdioFile' instead of CFile and check output.

MAHESH

0
 
AlexFMCommented:
CString Data;
replace with:
CStringT<BYTE> data;

int ReturnValue = Data.Replace(OldStr, NewStr);

Replace with:

#include <atlbase.h>
...

USES_CONVERSION;     // in the beginning of the function

int ReturnValue = Data.Replace(T2A(OldStr), T2A(NewStr));

Now this code works in ASCII in any configuration.
0
 
mrwad99Author Commented:
>> BTW if you have _UNICODE project and you are dealing with ASCII files in your above code just use 'CStdioFile' instead of CFile and check output.

I just noticed that.  The problem however is in CFile::Read.  CStdioFile::ReadString() works fine.  CFile::Read() however produces the rubbish.

>> fileInput.Read(strFile.GetBuffer(fileInput.GetLength()), fileInput.GetLength());

There is something wrong with that; but what ??

(At this stage, I am aware that I could just use repeated calls to ReadString() for my ANSI file, but for educational purposes, would like to know what is wrong with CFile::Read)


0
 
mrwad99Author Commented:
(Sorry Alex did not see your above comment, will read it now)
0
 
mrwad99Author Commented:
>> int ReturnValue = Data.Replace(OldStr, NewStr);

But "Data" still contains rubbish...

??
0
 
mahesh1402Commented:
http://www.codeproject.com/cpp/unicode.asp <====== 'The Length of strings' section
0
 
mrwad99Author Commented:
MAHESH

Thanks, but I am not sure how that helps in relation to my question.  Can you clarify please ?
0
 
mahesh1402Commented:
>> but for educational purposes, would like to know what is wrong with CFile::Read

CFile::Read is used to read bytes.  CFile is for binary data, CStdioFile is for text data.
So CStdioFile is one that touches the text, doing conversion to/from Unicode.

MAHESH


0
 
mahesh1402Commented:
>>But "Data" still contains rubbish...

However, I think with _UNICODE defined, text operations on CStdioFile will be done in terms of Unicode. You'd have to use the byte-oriented Read and Write functions to work with ANSI text.

As said in VC.NET, you can use an ANSI version of the new CStringT template even in a Unicode MFC app. This won't help you with CStdioFile, because it's a non-template defined in terms of CString and thus switches allegiance depending on _UNICODE.

>>fileInput.Read(strFile.GetBuffer(fileInput.GetLength()), fileInput.GetLength());There is something wrong with that;

 fileInput.Read(strFile.GetBufferSetLength(fileInput.GetLength()),fileInput.GetLength()); <=== try this

MAHESH
0
 
mrwad99Author Commented:
>> fileInput.Read(strFile.GetBufferSetLength(fileInput.GetLength()),fileInput.GetLength()); <=== try this

That does not work either I am afraid MAHESH.

So, just so I know for sure what the issue is, it is the fact that I am trying to read ANSI characters into a CString that is composed of UNICODE characters ?
0
 
mahesh1402Commented:
>>That does not work either I am afraid MAHESH.

Will you please refer this sample project 'CStdioFile-derived class for multibyte and Unicode reading and writing'

'CStdioFileEx class' src....


MAHESH

0
 
mrwad99Author Commented:
OK MAHESH I think that will do the trick.  Turns out this is a proper pain in the backside, eh ?

>> So, just so I know for sure what the issue is, it is the problem that I am trying to read ANSI characters into a CString that is composed of UNICODE characters ?

Could you just clarify that before I close this please ?

Thanks.
0
 
mahesh1402Commented:
OOPS it seems I forgot to give u link or u got it (?):
http://www.codeproject.com/file/stdiofileex.asp

>>it is the problem that I am trying to read ANSI characters into a CString that is composed of UNICODE characters ?

Well..
I think may be its like in UNICODE build CStdioFile is treating ANSI files as UNCODE ie. each ascii character will be converted to an UNICODE character and written to the destination string.

i.e.
with _UNICODE defined, text operations on CStdioFile will be done in terms of Unicode. You'd have to use the byte-oriented Read and Write functions to work with ANSI text.

MAHESH

0
 
mrwad99Author Commented:
Thanks all :)
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 9
  • 9
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now