Solved

CF_HTML from UTF-8 to ANSI on Win9x

Posted on 2000-05-01
5
894 Views
Last Modified: 2013-11-20
I have an apllication which uses the clipboard format CF_HTML. The docs say that CF_HTML is in the UTF-8 format. To use it in my application I have to convert it to ANSI format, what I try by using the Win32 function "MultiByteToWideChar(CP_UTF8, ...)".

The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)

Following code shows a simplified (quick and dirty) version of what I'm trying to do:

BOOL CMyView::OnDrop(COleDataObject* pDataObject, DROPEFFECT dropEffect, CPoint point)
{
   UINT CF_HTML = RegisterClipboardFormat(_T("Html Format"));
   HGLOBAL hGlobal = pDataObject->GetGlobalData(CF_HTML);
   LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal);
   LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
   LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];

   // the following is not supported on Win9x
   MultiByteToWideChar(CP_UTF8, 0, lpszUtf8, strlen(lpszUtf8), wchBuf, sizeof(wchBuf));

   WideCharToMultiByte(CP_ACP, 0, wchBuf, wcslen(wchBuf), lpszAnsi, strlen(lpszUtf8) + 1, NULL, NULL);

   // do something with lpszAnsi ...

   delete [] wchBuf;
   delete [] lpszAnsi;
   GlobalUnlock(hGlobal);

   return TRUE;
}

0
Comment
Question by:searching
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 10

Expert Comment

by:Lischke
ID: 2768729
Here is code to convert from UTF8 to (Delphi) WideString and vice versa without system intervention:

const
  halfShift: Integer = 10;

  halfBase: UCS4 = $0010000;
  halfMask: UCS4 = $3FF;

  offsetsFromUTF8: array[0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080);

  bytesFromUTF8: array[0..255] of Byte = (
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
      2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5);

  firstByteMark: array[0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC);

//----------------------------------------------------------------------------------------------------------------------

function WideStringToUTF8(S: WideString): AnsiString;

var
  ch: UCS4;
  L, J, T,
  bytesToWrite: Word;
  byteMask: UCS4;
  byteMark: UCS4;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S) * 6); // assume worst case
  T := 1;
  for J := 1 to Length(S) do
  begin
    byteMask := $BF;
    byteMark := $80;

    ch := UCS4(S[J]);

    if ch < $80 then
      bytesToWrite := 1
    else
    if ch < $800 then
      bytesToWrite := 2
    else
    if ch < $10000 then
      bytesToWrite := 3
    else
    if ch < $200000 then
      bytesToWrite := 4
    else
    if ch < $4000000 then
      bytesToWrite := 5
    else
    if ch <= MaximumUCS4 then
      bytesToWrite := 6
    else
    begin
      bytesToWrite := 2;
      ch := ReplacementCharacter;
    end;

    for L := bytesToWrite downto 2 do
    begin
      Result[T + L - 1] := Char((ch or byteMark) and byteMask);
      ch := ch shr 6;
    end;
    Result[T] := Char(ch or firstByteMark[bytesToWrite]);
    Inc(T, bytesToWrite);
  end;
  SetLength(Result, T - 1);
end;

//----------------------------------------------------------------------------------------------------------------------

function UTF8ToWideString(S: AnsiString): WideString;

var
  L, J, T: Cardinal;
  ch: UCS4;
  extraBytesToWrite: Word;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S)); // create enough room

  L := 1;
  T := 1;
  while L <= Cardinal(Length(S)) do
  begin
    ch := 0;
    extraBytesToWrite := bytesFromUTF8[Ord(S[L])];

    for J := extraBytesToWrite downto 1 do
    begin
      ch := ch + Ord(S[L]);
      Inc(L);
      ch := ch shl 6;
    end;
    ch := ch + Ord(S[L]);
    Inc(L);
    ch := ch - offsetsFromUTF8[extraBytesToWrite];

    if ch <= MaximumUCS2 then
    begin
      Result[T] := WideChar(ch);
      Inc(T);
    end
    else
    if ch > MaximumUCS4 then
    begin
      Result[T] := WideChar(ReplacementCharacter);
      Inc(T);
    end
    else
    begin
      ch := ch - halfBase;
      Result[T] := WideChar((ch shr halfShift) + SurrogateHighStart);
      Inc(T);
      Result[T] := WideChar((ch and halfMask) + SurrogateLowStart);
      Inc(T);
    end;
  end;
  SetLength(Result, T - 1); // now fix up length
end;

BTW: data type UTF8 is an 8 bit unsigned char, UCS4 unsigned long (Cardinal). Other constants are:

const
  ReplacementCharacter: UCS4 = $0000FFFD;
  MaximumUCS2: UCS4 = $0000FFFF;
  MaximumUTF16: UCS4 = $0010FFFF;
  MaximumUCS4: UCS4 = $7FFFFFFF;
                         
  SurrogateHighStart: UCS4 = $D800;
  SurrogateHighEnd: UCS4 = $DBFF;
  SurrogateLowStart: UCS4 = $DC00;
  SurrogateLowEnd: UCS4 = $DFFF;


Ciao, Mike
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2768730
Oops, sorry, I thought I'm still in the Windows section (I followed your link from there). I hope my Delphi code is still of use in the MFC area.

Ciao, Mike
0
 

Author Comment

by:searching
ID: 2768999
Adjusted points from 100 to 120
0
 

Author Comment

by:searching
ID: 2769000
Many tnx for the answer, but could you also tell me where you've found the conversion algorithm; is it a standard, or is it your own creation? (I've increased the points to 120)
0
 
LVL 10

Accepted Solution

by:
Lischke earned 120 total points
ID: 2769061
The algorithm is provided by the Unicode consortium (see www.unicode.org) and can be read it their official book (www.unicode.org/unicode/uni2book/u2.html, code written in C btw.). I would like to copy the code here because we have the book too, but unfortunately I cannot find it at the moment, sorry...

Ciao, Mike
0

Featured Post

Do you have a plan for Continuity?

It's inevitable. People leave organizations creating a gap in your service. That's where Percona comes in.

See how Pepper.com relies on Percona to:
-Manage their database
-Guarantee data safety and protection
-Provide database expertise that is available for any situation

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction: Dialogs (2) modeless dialog and a worker thread.  Handling data shared between threads.  Recursive functions. Continuing from the tenth article about sudoku.   Last article we worked with a modal dialog to help maintain informat…
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question