Solved

CF_HTML from UTF-8 to ANSI on Win9x

Posted on 2000-05-01
5
879 Views
Last Modified: 2013-11-20
I have an apllication which uses the clipboard format CF_HTML. The docs say that CF_HTML is in the UTF-8 format. To use it in my application I have to convert it to ANSI format, what I try by using the Win32 function "MultiByteToWideChar(CP_UTF8, ...)".

The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)

Following code shows a simplified (quick and dirty) version of what I'm trying to do:

BOOL CMyView::OnDrop(COleDataObject* pDataObject, DROPEFFECT dropEffect, CPoint point)
{
   UINT CF_HTML = RegisterClipboardFormat(_T("Html Format"));
   HGLOBAL hGlobal = pDataObject->GetGlobalData(CF_HTML);
   LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal);
   LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
   LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];

   // the following is not supported on Win9x
   MultiByteToWideChar(CP_UTF8, 0, lpszUtf8, strlen(lpszUtf8), wchBuf, sizeof(wchBuf));

   WideCharToMultiByte(CP_ACP, 0, wchBuf, wcslen(wchBuf), lpszAnsi, strlen(lpszUtf8) + 1, NULL, NULL);

   // do something with lpszAnsi ...

   delete [] wchBuf;
   delete [] lpszAnsi;
   GlobalUnlock(hGlobal);

   return TRUE;
}

0
Comment
Question by:searching
  • 3
  • 2
5 Comments
 
LVL 10

Expert Comment

by:Lischke
ID: 2768729
Here is code to convert from UTF8 to (Delphi) WideString and vice versa without system intervention:

const
  halfShift: Integer = 10;

  halfBase: UCS4 = $0010000;
  halfMask: UCS4 = $3FF;

  offsetsFromUTF8: array[0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080);

  bytesFromUTF8: array[0..255] of Byte = (
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
      2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5);

  firstByteMark: array[0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC);

//----------------------------------------------------------------------------------------------------------------------

function WideStringToUTF8(S: WideString): AnsiString;

var
  ch: UCS4;
  L, J, T,
  bytesToWrite: Word;
  byteMask: UCS4;
  byteMark: UCS4;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S) * 6); // assume worst case
  T := 1;
  for J := 1 to Length(S) do
  begin
    byteMask := $BF;
    byteMark := $80;

    ch := UCS4(S[J]);

    if ch < $80 then
      bytesToWrite := 1
    else
    if ch < $800 then
      bytesToWrite := 2
    else
    if ch < $10000 then
      bytesToWrite := 3
    else
    if ch < $200000 then
      bytesToWrite := 4
    else
    if ch < $4000000 then
      bytesToWrite := 5
    else
    if ch <= MaximumUCS4 then
      bytesToWrite := 6
    else
    begin
      bytesToWrite := 2;
      ch := ReplacementCharacter;
    end;

    for L := bytesToWrite downto 2 do
    begin
      Result[T + L - 1] := Char((ch or byteMark) and byteMask);
      ch := ch shr 6;
    end;
    Result[T] := Char(ch or firstByteMark[bytesToWrite]);
    Inc(T, bytesToWrite);
  end;
  SetLength(Result, T - 1);
end;

//----------------------------------------------------------------------------------------------------------------------

function UTF8ToWideString(S: AnsiString): WideString;

var
  L, J, T: Cardinal;
  ch: UCS4;
  extraBytesToWrite: Word;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S)); // create enough room

  L := 1;
  T := 1;
  while L <= Cardinal(Length(S)) do
  begin
    ch := 0;
    extraBytesToWrite := bytesFromUTF8[Ord(S[L])];

    for J := extraBytesToWrite downto 1 do
    begin
      ch := ch + Ord(S[L]);
      Inc(L);
      ch := ch shl 6;
    end;
    ch := ch + Ord(S[L]);
    Inc(L);
    ch := ch - offsetsFromUTF8[extraBytesToWrite];

    if ch <= MaximumUCS2 then
    begin
      Result[T] := WideChar(ch);
      Inc(T);
    end
    else
    if ch > MaximumUCS4 then
    begin
      Result[T] := WideChar(ReplacementCharacter);
      Inc(T);
    end
    else
    begin
      ch := ch - halfBase;
      Result[T] := WideChar((ch shr halfShift) + SurrogateHighStart);
      Inc(T);
      Result[T] := WideChar((ch and halfMask) + SurrogateLowStart);
      Inc(T);
    end;
  end;
  SetLength(Result, T - 1); // now fix up length
end;

BTW: data type UTF8 is an 8 bit unsigned char, UCS4 unsigned long (Cardinal). Other constants are:

const
  ReplacementCharacter: UCS4 = $0000FFFD;
  MaximumUCS2: UCS4 = $0000FFFF;
  MaximumUTF16: UCS4 = $0010FFFF;
  MaximumUCS4: UCS4 = $7FFFFFFF;
                         
  SurrogateHighStart: UCS4 = $D800;
  SurrogateHighEnd: UCS4 = $DBFF;
  SurrogateLowStart: UCS4 = $DC00;
  SurrogateLowEnd: UCS4 = $DFFF;


Ciao, Mike
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2768730
Oops, sorry, I thought I'm still in the Windows section (I followed your link from there). I hope my Delphi code is still of use in the MFC area.

Ciao, Mike
0
 

Author Comment

by:searching
ID: 2768999
Adjusted points from 100 to 120
0
 

Author Comment

by:searching
ID: 2769000
Many tnx for the answer, but could you also tell me where you've found the conversion algorithm; is it a standard, or is it your own creation? (I've increased the points to 120)
0
 
LVL 10

Accepted Solution

by:
Lischke earned 120 total points
ID: 2769061
The algorithm is provided by the Unicode consortium (see www.unicode.org) and can be read it their official book (www.unicode.org/unicode/uni2book/u2.html, code written in C btw.). I would like to copy the code here because we have the book too, but unfortunately I cannot find it at the moment, sorry...

Ciao, Mike
0

Featured Post

U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
lucky13 challenge 11 144
has77  challenge 9 85
Not needed 13 112
Remove Password Protection from pdf (with password) using an command line executable 5 50
Introduction: Finishing the grid – keyboard support for arrow keys to manoeuvre, entering the numbers.  The PreTranslateMessage function is to be used to intercept and respond to keyboard events. Continuing from the fourth article about sudoku. …
Introduction: Database storage, where is the exe actually on the disc? Playing a game selected randomly (how to generate random numbers).  Error trapping with try..catch to help the code run even if something goes wrong. Continuing from the seve…
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question