Celebrate National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

CF_HTML from UTF-8 to ANSI on Win9x

Posted on 2000-05-01
5
Medium Priority
?
898 Views
Last Modified: 2013-11-20
I have an apllication which uses the clipboard format CF_HTML. The docs say that CF_HTML is in the UTF-8 format. To use it in my application I have to convert it to ANSI format, what I try by using the Win32 function "MultiByteToWideChar(CP_UTF8, ...)".

The problem is: how can I convert this UTF-8 format to ANSI format on Win9x?
(I suppose it must be possible because the editing component in MSIE 5.0 does this conversion)

Following code shows a simplified (quick and dirty) version of what I'm trying to do:

BOOL CMyView::OnDrop(COleDataObject* pDataObject, DROPEFFECT dropEffect, CPoint point)
{
   UINT CF_HTML = RegisterClipboardFormat(_T("Html Format"));
   HGLOBAL hGlobal = pDataObject->GetGlobalData(CF_HTML);
   LPCSTR lpszUtf8 = (LPCSTR)GlobalLock(hGlobal);
   LPWSTR wchBuf = new WCHAR[strlen(lpszUtf8) + 1];
   LPSTR lpszAnsi = new char[strlen(lpszUtf8) + 1];

   // the following is not supported on Win9x
   MultiByteToWideChar(CP_UTF8, 0, lpszUtf8, strlen(lpszUtf8), wchBuf, sizeof(wchBuf));

   WideCharToMultiByte(CP_ACP, 0, wchBuf, wcslen(wchBuf), lpszAnsi, strlen(lpszUtf8) + 1, NULL, NULL);

   // do something with lpszAnsi ...

   delete [] wchBuf;
   delete [] lpszAnsi;
   GlobalUnlock(hGlobal);

   return TRUE;
}

0
Comment
Question by:searching
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 10

Expert Comment

by:Lischke
ID: 2768729
Here is code to convert from UTF8 to (Delphi) WideString and vice versa without system intervention:

const
  halfShift: Integer = 10;

  halfBase: UCS4 = $0010000;
  halfMask: UCS4 = $3FF;

  offsetsFromUTF8: array[0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080);

  bytesFromUTF8: array[0..255] of Byte = (
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
      2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5);

  firstByteMark: array[0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC);

//----------------------------------------------------------------------------------------------------------------------

function WideStringToUTF8(S: WideString): AnsiString;

var
  ch: UCS4;
  L, J, T,
  bytesToWrite: Word;
  byteMask: UCS4;
  byteMark: UCS4;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S) * 6); // assume worst case
  T := 1;
  for J := 1 to Length(S) do
  begin
    byteMask := $BF;
    byteMark := $80;

    ch := UCS4(S[J]);

    if ch < $80 then
      bytesToWrite := 1
    else
    if ch < $800 then
      bytesToWrite := 2
    else
    if ch < $10000 then
      bytesToWrite := 3
    else
    if ch < $200000 then
      bytesToWrite := 4
    else
    if ch < $4000000 then
      bytesToWrite := 5
    else
    if ch <= MaximumUCS4 then
      bytesToWrite := 6
    else
    begin
      bytesToWrite := 2;
      ch := ReplacementCharacter;
    end;

    for L := bytesToWrite downto 2 do
    begin
      Result[T + L - 1] := Char((ch or byteMark) and byteMask);
      ch := ch shr 6;
    end;
    Result[T] := Char(ch or firstByteMark[bytesToWrite]);
    Inc(T, bytesToWrite);
  end;
  SetLength(Result, T - 1);
end;

//----------------------------------------------------------------------------------------------------------------------

function UTF8ToWideString(S: AnsiString): WideString;

var
  L, J, T: Cardinal;
  ch: UCS4;
  extraBytesToWrite: Word;

begin
  if Length(S) = 0 then
  begin
    Result := '';
    Exit;
  end;

  SetLength(Result, Length(S)); // create enough room

  L := 1;
  T := 1;
  while L <= Cardinal(Length(S)) do
  begin
    ch := 0;
    extraBytesToWrite := bytesFromUTF8[Ord(S[L])];

    for J := extraBytesToWrite downto 1 do
    begin
      ch := ch + Ord(S[L]);
      Inc(L);
      ch := ch shl 6;
    end;
    ch := ch + Ord(S[L]);
    Inc(L);
    ch := ch - offsetsFromUTF8[extraBytesToWrite];

    if ch <= MaximumUCS2 then
    begin
      Result[T] := WideChar(ch);
      Inc(T);
    end
    else
    if ch > MaximumUCS4 then
    begin
      Result[T] := WideChar(ReplacementCharacter);
      Inc(T);
    end
    else
    begin
      ch := ch - halfBase;
      Result[T] := WideChar((ch shr halfShift) + SurrogateHighStart);
      Inc(T);
      Result[T] := WideChar((ch and halfMask) + SurrogateLowStart);
      Inc(T);
    end;
  end;
  SetLength(Result, T - 1); // now fix up length
end;

BTW: data type UTF8 is an 8 bit unsigned char, UCS4 unsigned long (Cardinal). Other constants are:

const
  ReplacementCharacter: UCS4 = $0000FFFD;
  MaximumUCS2: UCS4 = $0000FFFF;
  MaximumUTF16: UCS4 = $0010FFFF;
  MaximumUCS4: UCS4 = $7FFFFFFF;
                         
  SurrogateHighStart: UCS4 = $D800;
  SurrogateHighEnd: UCS4 = $DBFF;
  SurrogateLowStart: UCS4 = $DC00;
  SurrogateLowEnd: UCS4 = $DFFF;


Ciao, Mike
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2768730
Oops, sorry, I thought I'm still in the Windows section (I followed your link from there). I hope my Delphi code is still of use in the MFC area.

Ciao, Mike
0
 

Author Comment

by:searching
ID: 2768999
Adjusted points from 100 to 120
0
 

Author Comment

by:searching
ID: 2769000
Many tnx for the answer, but could you also tell me where you've found the conversion algorithm; is it a standard, or is it your own creation? (I've increased the points to 120)
0
 
LVL 10

Accepted Solution

by:
Lischke earned 360 total points
ID: 2769061
The algorithm is provided by the Unicode consortium (see www.unicode.org) and can be read it their official book (www.unicode.org/unicode/uni2book/u2.html, code written in C btw.). I would like to copy the code here because we have the book too, but unfortunately I cannot find it at the moment, sorry...

Ciao, Mike
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Here is how to use MFC's automatic Radio Button handling in your dialog boxes and forms.  Beginner programmers usually start with a OnClick handler for each radio button and that's just not the right way to go.  MFC has a very cool system for handli…
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question