asked on

CF_HTML from UTF-8 to ANSI on Win9x

I have an apllication which uses the clipboard format CF_HTML. The docs say that CF_HTML is in the UTF-8 format. To use it in my application I have to convert it to ANSI format, what I try by using the Win32 function "MultiByteToWideChar(CP_UTF8, ...)".

The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)

Following code shows a simplified (quick and dirty) version of what I'm trying to do:

BOOL CMyView::OnDrop(COleDataObject* pDataObject, DROPEFFECT dropEffect, CPoint point)
{
UINT CF_HTML = RegisterClipboardFormat(_T("Html Format"));
HGLOBAL hGlobal = pDataObject->GetGlobalData(CF_HTML);
LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal);
LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];

// the following is not supported on Win9x
MultiByteToWideChar(CP_UTF8, 0, lpszUtf8, strlen(lpszUtf8), wchBuf, sizeof(wchBuf));

WideCharToMultiByte(CP_ACP, 0, wchBuf, wcslen(wchBuf), lpszAnsi, strlen(lpszUtf8) + 1, NULL, NULL);

// do something with lpszAnsi ...

delete [] wchBuf;
delete [] lpszAnsi;
GlobalUnlock(hGlobal);

return TRUE;
}

Lischke

Here is code to convert from UTF8 to (Delphi) WideString and vice versa without system intervention:

const
halfShift: Integer = 10;

halfBase: UCS4 = $0010000;
halfMask: UCS4 = $3FF;

offsetsFromUTF8: array[0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080);

bytesFromUTF8: array[0..255] of Byte = (
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
      2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5);

firstByteMark: array[0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC);

//----------------------------------------------------------------------------------------------------------------------

function WideStringToUTF8(S: WideString): AnsiString;

var
ch: UCS4;
L, J, T,
bytesToWrite: Word;
byteMask: UCS4;
byteMark: UCS4;

begin
if Length(S) = 0 then
begin
Result := '';
Exit;
end;

SetLength(Result, Length(S) * 6); // assume worst case
T := 1;
for J := 1 to Length(S) do
begin
byteMask := $BF;
byteMark := $80;

ch := UCS4(S[J]);

if ch < $80 then
bytesToWrite := 1
else
if ch < $800 then
bytesToWrite := 2
else
if ch < $10000 then
bytesToWrite := 3
else
if ch < $200000 then
bytesToWrite := 4
else
if ch < $4000000 then
bytesToWrite := 5
else
if ch <= MaximumUCS4 then
bytesToWrite := 6
else
begin
bytesToWrite := 2;
ch := ReplacementCharacter;
end;

for L := bytesToWrite downto 2 do
begin
Result[T + L - 1] := Char((ch or byteMark) and byteMask);
ch := ch shr 6;
end;
Result[T] := Char(ch or firstByteMark[bytesToWrite]);
Inc(T, bytesToWrite);
end;
SetLength(Result, T - 1);
end;

//----------------------------------------------------------------------------------------------------------------------

function UTF8ToWideString(S: AnsiString): WideString;

var
L, J, T: Cardinal;
ch: UCS4;
extraBytesToWrite: Word;

begin
if Length(S) = 0 then
begin
Result := '';
Exit;
end;

SetLength(Result, Length(S)); // create enough room

L := 1;
T := 1;
while L <= Cardinal(Length(S)) do
begin
ch := 0;
extraBytesToWrite := bytesFromUTF8[Ord(S[L])];

for J := extraBytesToWrite downto 1 do
begin
ch := ch + Ord(S[L]);
Inc(L);
ch := ch shl 6;
end;
ch := ch + Ord(S[L]);
Inc(L);
ch := ch - offsetsFromUTF8[extraBytesToWrite];

if ch <= MaximumUCS2 then
begin
Result[T] := WideChar(ch);
Inc(T);
end
else
if ch > MaximumUCS4 then
begin
Result[T] := WideChar(ReplacementCharacter);
Inc(T);
end
else
begin
ch := ch - halfBase;
Result[T] := WideChar((ch shr halfShift) + SurrogateHighStart);
Inc(T);
Result[T] := WideChar((ch and halfMask) + SurrogateLowStart);
Inc(T);
end;
end;
SetLength(Result, T - 1); // now fix up length
end;

BTW: data type UTF8 is an 8 bit unsigned char, UCS4 unsigned long (Cardinal). Other constants are:

const
ReplacementCharacter: UCS4 = $0000FFFD;
MaximumUCS2: UCS4 = $0000FFFF;
MaximumUTF16: UCS4 = $0010FFFF;
MaximumUCS4: UCS4 = $7FFFFFFF;

SurrogateHighStart: UCS4 = $D800;
SurrogateHighEnd: UCS4 = $DBFF;
SurrogateLowStart: UCS4 = $DC00;
SurrogateLowEnd: UCS4 = $DFFF;

Ciao, Mike

Lischke

Oops, sorry, I thought I'm still in the Windows section (I followed your link from there). I hope my Delphi code is still of use in the MFC area.

Ciao, Mike

searching

ASKER

Adjusted points from 100 to 120

searching

ASKER

Many tnx for the answer, but could you also tell me where you've found the conversion algorithm; is it a standard, or is it your own creation? (I've increased the points to 120)

ASKER CERTIFIED SOLUTION

Lischke

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial