searching
asked on
CF_HTML from UTF-8 to ANSI on Win9x
I have an apllication which uses the clipboard format CF_HTML. The docs say that CF_HTML is in the UTF-8 format. To use it in my application I have to convert it to ANSI format, what I try by using the Win32 function "MultiByteToWideChar(CP_UT F8, ...)".
The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)
Following code shows a simplified (quick and dirty) version of what I'm trying to do:
BOOL CMyView::OnDrop(COleDataOb ject* pDataObject, DROPEFFECT dropEffect, CPoint point)
{
UINT CF_HTML = RegisterClipboardFormat(_T ("Html Format"));
HGLOBAL hGlobal = pDataObject->GetGlobalData (CF_HTML);
LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal );
LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];
// the following is not supported on Win9x
MultiByteToWideChar(CP_UTF 8, 0, lpszUtf8, strlen(lpszUtf8), wchBuf, sizeof(wchBuf));
WideCharToMultiByte(CP_ACP , 0, wchBuf, wcslen(wchBuf), lpszAnsi, strlen(lpszUtf8) + 1, NULL, NULL);
// do something with lpszAnsi ...
delete [] wchBuf;
delete [] lpszAnsi;
GlobalUnlock(hGlobal);
return TRUE;
}
The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)
Following code shows a simplified (quick and dirty) version of what I'm trying to do:
BOOL CMyView::OnDrop(COleDataOb
{
UINT CF_HTML = RegisterClipboardFormat(_T
HGLOBAL hGlobal = pDataObject->GetGlobalData
LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal
LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];
// the following is not supported on Win9x
MultiByteToWideChar(CP_UTF
WideCharToMultiByte(CP_ACP
// do something with lpszAnsi ...
delete [] wchBuf;
delete [] lpszAnsi;
GlobalUnlock(hGlobal);
return TRUE;
}
Oops, sorry, I thought I'm still in the Windows section (I followed your link from there). I hope my Delphi code is still of use in the MFC area.
Ciao, Mike
Ciao, Mike
ASKER
Adjusted points from 100 to 120
ASKER
Many tnx for the answer, but could you also tell me where you've found the conversion algorithm; is it a standard, or is it your own creation? (I've increased the points to 120)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
const
halfShift: Integer = 10;
halfBase: UCS4 = $0010000;
halfMask: UCS4 = $3FF;
offsetsFromUTF8: array[0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080);
bytesFromUTF8: array[0..255] of Byte = (
0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,
firstByteMark: array[0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC);
//------------------------
function WideStringToUTF8(S: WideString): AnsiString;
var
ch: UCS4;
L, J, T,
bytesToWrite: Word;
byteMask: UCS4;
byteMark: UCS4;
begin
if Length(S) = 0 then
begin
Result := '';
Exit;
end;
SetLength(Result, Length(S) * 6); // assume worst case
T := 1;
for J := 1 to Length(S) do
begin
byteMask := $BF;
byteMark := $80;
ch := UCS4(S[J]);
if ch < $80 then
bytesToWrite := 1
else
if ch < $800 then
bytesToWrite := 2
else
if ch < $10000 then
bytesToWrite := 3
else
if ch < $200000 then
bytesToWrite := 4
else
if ch < $4000000 then
bytesToWrite := 5
else
if ch <= MaximumUCS4 then
bytesToWrite := 6
else
begin
bytesToWrite := 2;
ch := ReplacementCharacter;
end;
for L := bytesToWrite downto 2 do
begin
Result[T + L - 1] := Char((ch or byteMark) and byteMask);
ch := ch shr 6;
end;
Result[T] := Char(ch or firstByteMark[bytesToWrite
Inc(T, bytesToWrite);
end;
SetLength(Result, T - 1);
end;
//------------------------
function UTF8ToWideString(S: AnsiString): WideString;
var
L, J, T: Cardinal;
ch: UCS4;
extraBytesToWrite: Word;
begin
if Length(S) = 0 then
begin
Result := '';
Exit;
end;
SetLength(Result, Length(S)); // create enough room
L := 1;
T := 1;
while L <= Cardinal(Length(S)) do
begin
ch := 0;
extraBytesToWrite := bytesFromUTF8[Ord(S[L])];
for J := extraBytesToWrite downto 1 do
begin
ch := ch + Ord(S[L]);
Inc(L);
ch := ch shl 6;
end;
ch := ch + Ord(S[L]);
Inc(L);
ch := ch - offsetsFromUTF8[extraBytes
if ch <= MaximumUCS2 then
begin
Result[T] := WideChar(ch);
Inc(T);
end
else
if ch > MaximumUCS4 then
begin
Result[T] := WideChar(ReplacementCharac
Inc(T);
end
else
begin
ch := ch - halfBase;
Result[T] := WideChar((ch shr halfShift) + SurrogateHighStart);
Inc(T);
Result[T] := WideChar((ch and halfMask) + SurrogateLowStart);
Inc(T);
end;
end;
SetLength(Result, T - 1); // now fix up length
end;
BTW: data type UTF8 is an 8 bit unsigned char, UCS4 unsigned long (Cardinal). Other constants are:
const
ReplacementCharacter: UCS4 = $0000FFFD;
MaximumUCS2: UCS4 = $0000FFFF;
MaximumUTF16: UCS4 = $0010FFFF;
MaximumUCS4: UCS4 = $7FFFFFFF;
SurrogateHighStart: UCS4 = $D800;
SurrogateHighEnd: UCS4 = $DBFF;
SurrogateLowStart: UCS4 = $DC00;
SurrogateLowEnd: UCS4 = $DFFF;
Ciao, Mike