URLDecode UTF-8 string

Hi experts,

I'm facing a problem and need some help, im trying do a urldecode,from a utf-8 urlencoded string.

I tryed almost every urldecode code i found in internet and neither of them are fully working for me.

The closest one was this code i attached below, but they don't work rigth with some strings like this:

%c3%9a: this urlencoded utf8 should be decoded to Ú but the code make a wrong decode.

I have the delphi prism(oxygene .net 3.5)  installed and i used the HttpUtility.UrlDecode() function and worked like a charm, but my project is wirtten in delphi, so what im asking is...

So what i'm asking is a function like urldecode(string), pleaze to test use the %c3%9a , should return Ú.
If that is too much work i could use a dll make in .net with the function httputility.urldecode() , if u decided help me making a dll,  pleze upload the dll and show me some delphi code to use the function from it. Registering and all, i have very limited knowledge.

If im demanding too much, sorry, i don't know what to do, any leads to fix this code would help too.

I ll be back after easter, on monday, maybe i enter to check this question on saturday.

cya experts happy easter too u all


function URLDecodeUTF8(const s: PAnsiChar; const buf: PWideChar;
      var lenBuf: Cardinal): boolean; stdcall;
var
   sAnsi: String;    // normal ansi string
   sUtf8: String;    // utf8-bytes string
   sWide: WideString; // unicode string

   i, len: Cardinal;
   ESC: string[2];
   CharCode: integer;
   c: char;
begin
   sAnsi := s; // null-terminated str to pascal str
   SetLength(sUtf8, Length(sAnsi));

   // Convert URLEncoded str to utf8 str, it must
   // use utf8 hex escaping for non us-ascii chars
   //    +      = space
   //    %2A    = *
   //    %C3%84 = Ä (A with diaeresis)
   i := 1;
   len := 1;
   while (i <= Cardinal(Length(sAnsi))) do begin
      if (sAnsi[i] <> '%') then begin
         if (sAnsi[i] = '+') then begin
            c := ' ';
         end else begin
            c := sAnsi[i];
         end;
         sUtf8[len] := c;
         Inc(len);
      end else begin
         Inc(i); // skip the % char
         ESC := Copy(sAnsi, i, 2); // Copy the escape code
         Inc(i, 1); // skip ESC, another +1 at end of loop
         try
            CharCode := StrToInt('$' + ESC);
            //if (CharCode > 0) and (CharCode < 256) then begin
               c := Char(CharCode);
               sUtf8[len] := c;
               Inc(len);
            //end;
         except end;
      end;
      Inc(i);
   end;
   Dec(len); // -1 to fix length (num of characters)
   SetLength(sUtf8, len);

   sWide := UTF8Decode(sUtf8); // utf8 string to unicode
   len := Length(sWide);

   if Assigned(buf) and (len < lenBuf) then begin
      // copy result into the buffer, buffer must have
      // space for last null byte.
      //    lenBuf=num of chars in buffer, not counting null
      if (len > 0) then
         Move(PWideChar(sWide)^, buf^, len * SizeOf(WideChar));
      buf[len] := #0;
      lenBuf := len;
      Result := True;
   end else begin
      // tell calling program how big the buffer
      // should be to store all decoded characters,
      // including trailing null value.
      if (len > 0) then
         lenBuf := len+1;
      Result := False;
   end;
end;

Open in new window

LVL 2
arreeguaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Emmanuel PASQUIERFreelance Project ManagerCommented:
0
arreeguaAuthor Commented:
Yes, the code i attached befored is from this site. Code do a good job, but make a few mistakes.

For example, when i have %c3%9a , the code returns ?? besides the right answer is Ú.

regards

Arreegua


0
developmentguruPresidentCommented:
The problem you are having is that the url encoded string is encoded using 8 bit characters.  What you are trying to decode it to is 16 bit characters.  If you would like me to create the function for you I will need to see a full string (encoded and decoded).  

Let me know.
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

Emmanuel PASQUIERFreelance Project ManagerCommented:
are you using Delphi 2009/2010 ?
it is well possible that along the way you have unwanted conversions from unicode to ansi, because you assume that String = AnsiString & Char = AnsiChar. Since in newest Delphi version it is not the case, I have no clear idea of what kind of problems this could create in special cases.

Here is a more strict version of the function, that should do the same job whatever Delphi version. If still not Ok, then as developmentguru said, it would be easier if you could provide a full string encoded & decoded.
function URLDecodeUTF8(const s: PAnsiChar; const buf: PWideChar;
      var lenBuf: Cardinal): boolean; stdcall;
var
   sAnsi: ANSIString;    // normal ansi string
   sUtf8: ANSIString;    // utf8-bytes string
   sWide: WideString; // unicode string
   i, len: Integer;
   CharCode: Cardinal;
begin
 sAnsi := s; // null-terminated str to pascal str
 SetLength(sUtf8, Length(sAnsi));

 // Convert URLEncoded str to utf8 str, it must
 // use utf8 hex escaping for non us-ascii chars
 //    +      = space
 //    %2A    = *
 //    %C3%84 = Ä (A with diaeresis)
 i := 1;
 len := 1;
 while (i <= Length(sAnsi)) do 
  begin
   if (sAnsi[i] <> '%') then 
    begin
     if (sAnsi[i] = '+') 
      then sUtf8[len] := ' '
      else sUtf8[len] := sAnsi[i];
     Inc(len);
    end else 
    begin
     Inc(i); // skip the % char
     try
      CharCode := StrToInt('$' + Copy(sAnsi, i, 2) );
      sUtf8[len] := AnsiChar(CharCode);
      Inc(len);
     except 
     end;
     Inc(i); // skip ESC, another +1 at end of loop
    end;
   Inc(i);
  end;
 Dec(len); // -1 to fix length (num of characters)
 SetLength(sUtf8, len);

 sWide := UTF8Decode(sUtf8); // utf8 string to unicode
 len := Length(sWide);

 if Assigned(buf) and (len < lenBuf) then 
  begin
   // copy result into the buffer, buffer must have
   // space for last null byte.
   //    lenBuf=num of chars in buffer, not counting null
   if (len > 0) 
    then Move(sWide[1], buf^, (len+1) * SizeOf(WideChar));
   lenBuf := len;
   Result := True;
  end else 
  begin
   // tell calling program how big the buffer
   // should be to store all decoded characters,
   // including trailing null value.
   if (len > 0) 
    then lenBuf := len+1;
   Result := False;
  end;
end;

Open in new window

0
Emmanuel PASQUIERFreelance Project ManagerCommented:
tested with D2007, this works for '%c3%9a'
display in memo => 1 Ú

procedure TForm1.FormCreate(Sender: TObject);
Var
 Buf:Array [0..10] of WideChar;
 lenBuf: Cardinal;
begin
 URLDecodeUTF8('%c3%9a',Buf,lenBuf);
 Memo.Lines.Add(Format('%d %s',[lenBuf,Buf]));
end;

Open in new window

0
developmentguruPresidentCommented:
The big issue here will be trying to determine which characters in the stream are UTF8 and which are wide char.  There are standards that tell us how to do this dynamic decoding.  Having said that, seeing the actual set of encoded and decoded text would help us as experts to determine which rules are being followed so we could design the correct function that will always work for you (until the other end changes the encoding anyway).  As an example.  The C3 would be decoded to show that either it's whole value or, perhaps, a bit set within the value would indicate that the next byte should be interpreted as a wide character in an advanced range.  If you have Delphi 2009 or Delphi 2010 then these functions are already built in.  You would simply pass the encoded text to a function and it would return the correct decoding.

There are some very good reasons to upgrade Delphi.
0
arreeguaAuthor Commented:
Sorry for the delay guys, i suffered a small car accident and had to be hospitalized in the hollyday, a head concussion, just for observation, but im ok now.

I ll read and try the sugestions and will be right back with the answers.

by the way im using delphi 2010.

cya
0
arreeguaAuthor Commented:
Hi developmentguru and epasquier,

First of all, i appreciate very much the effort, thankyou both.


Epasquier , i tryed your code, what i undertand from it, is that now i have to pass an ansichar to the function and the problem is that i don't know if i need 2 ansichars to decode or only one, like in %c3%9a i need two , and with %2A i need only 1 to decode to *.

About the sample with the encoded strings, i'm working with this file from this link:
http://br28.tribalwars.com.br/map/ally.txt, i attached a picture from my program with samples with encoded strings and the decoded with the function i provide in the first post.(i marked the problematic ones)

It is a comma separated file with info from a webbrowser game. Some sample with good ones and problematics strings are:

Encoded Strings:
(this 4 strings below is ok)

10821,QAZSS,QAZSS,1,1,316,316,1973
10823,Alian%C3%A7a+Cartago+Esparta,ACE,1,1,681,681,1261
10824,imp%C3%A9rio+de+guerreiros,i.m.g.,1,1,825,825,1147
10830,miillerzao,mlz,1,1,75,75,3907

(problem ones)

10799,LE%C3%83O+TRIBAL+K48,LT+K48,9,9,2339,2339,662
10814,_S%C3%B3_ZiNhO_,_S_%C3%93_,1,1,747,747,1206
10820,%2A%2AFORT%C3%95ES+2%2A%2A,%5B%2AF2%2A%5D,1,1,111,111,3246

Decoded Strings:

10821,QAZSS,QAZSS,1,1,316,316,1973
10823,Aliança Cartago Esparta,ACE,1,1,681,681,1261
10824,império de guerreiros,i.m.g.,1,1,825,825,1147
10830,miillerzao,mlz,1,1,75,75,3907
10799,LE¿?O TRIBAL K48,LT K48,9,9,2339,2339,662
10814,_Só_ZiNhO_,_S_¿?_,1,1,747,747,1206
10820,**FORT¿?ES 2**,[*F2*],1,1,111,111,3246

Should BE:

10821,QAZSS,QAZSS,1,1,316,316,1973
10823,Aliança  Cartago Esparta,ACE,1,1,681,681,1261
10824,império de  guerreiros,i.m.g.,1,1,825,825,1147
10830,miillerzao,mlz,1,1,75,75,3907
10799,LEÃO TRIBAL K48,LT K48,9,9,2339,2339,662
10814,_Só_ZiNhO_,1,1,747,747,1206
10820,**FORTÕES 2**,[*F2*],1,1,111,111,3246


I'm opening the encoded file to a memo and decoding to another memo.

Thanks agian for the help.



utf8decode.jpg
0
Emmanuel PASQUIERFreelance Project ManagerCommented:
> the problem is that i don't know if i need 2 ansichars to decode or only one,
> like in %c3%9a i need two , and with %2A  i need only 1 to decode to *.

I suppose it is UTF8Decode that is doing that.
Is the result you just post with the code I provided ?
0
arreeguaAuthor Commented:
No, i'm embarressed to say that i don't know how to use your function.

I tryed pass the all memo1.lines.text , then i realized that i need to pass ansichar per ansichar, i tryed put the memo1.lines.text to an ansistring var, then use a for with the length of the ansistring and pass ansstring[length]

But don't worked.

Can you provide me the code to test your function, decoding from memo1.lines.text to memo2.lines.text?

Thanks very much epasquier, sorry about my stupid, i know i am really noob '=D


0
Emmanuel PASQUIERFreelance Project ManagerCommented:
try this
Var
 Buf:Array [0..1023] of WideChar;// 2ko buffer (1024 widechar)
 lenBuf: Cardinal;
 i:integer;
 S:AnsiString;
begin
 Memo2.Clear;
 for i:=0 to Memo1.Lines.Count-1 do 
  begin
   S:=Memo1.Lines[i];
   URLDecodeUTF8(PAnsiChar(S),Buf,lenBuf);
   Memo2.Lines.Add(Format('%d %s',[lenBuf,Buf]));
  end;
end;

Open in new window

0
arreeguaAuthor Commented:
Hi epasquier,

I can't figure out what is going on, some lines are repeating the lenbuf and the buf.

encoded strings:
11395,brinquedo+assassino,k43,1,1,105,105,3348
11396,todo+mundo+%C3%A9+bem+vindo,tdm%C3%A9bv,1,1,922,922,1068
11397,IMPERIO+BRASIL+DARK+LEGENDS,%7CIBDL%7C,1,1,715,715,1233
11398,Ordem+dos+Templ%C3%A1rios,%2ATMPS%2A,2,2,908,908,1079
11399,Poder+Supremo+Tribal,PSG,1,1,293,293,2050
11400,skatefehh,skt,1,1,105,105,3368
11401,tribo_los_angels,TLA,1,1,287,287,2075
11402,for%C3%A7a+jovem+goias,%2BFJG%2B,1,1,578,578,1384
11403,mano+exxx,maex,1,1,188,188,2580
11404,Hangaz+K41,%21H41,1,1,105,105,3369

decoded strings:
46 11395,brinquedo assassino,k43,1,1,105,105,3348
53 11395,brinquedo assassino,k43,1,1,105,105,3348
58 11395,brinquedo assassino,k43,1,1,105,105,3348
50 11398,Ordem dos Templários,*TMPS*,2,2,908,908,1079
47 11399,Poder Supremo Tribal,PSG,1,1,293,293,2050
36 11400,skatefehh,skt,1,1,105,105,3368
44 11400,skatefehh,skt,1,1,105,105,3368
47 11400,skatefehh,skt,1,1,105,105,3368
37 11403,mano exxx,maex,1,1,188,188,2580
39 11403,mano exxx,maex,1,1,188,188,2580

I'm attaching a screenshot from my program with your code.

thanks for the help and patience



Epasquier-Function.jpg
0
Emmanuel PASQUIERFreelance Project ManagerCommented:
arf. lenBuf must be reset at each call, otherwise if the new len is below the supposed size of the buffer, the copy of the resulting string is not done. That's not a good design of URLDecodeUTF8 parameters, but that's not much of a problem
Var
 Buf:Array [0..1023] of WideChar;// 2ko buffer (1024 widechar)
 lenBuf: Cardinal;
 i:integer;
 S:AnsiString;
begin
 Memo2.Clear;
 for i:=0 to Memo1.Lines.Count-1 do 
  begin
   S:=Memo1.Lines[i];
   lenBuf:=High(Buf); // <== Add this
   URLDecodeUTF8(PAnsiChar(S),Buf,lenBuf);
   Memo2.Lines.Add(Format('%d %s',[lenBuf,Buf]));
  end;
end;

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
arreeguaAuthor Commented:
Thankyou very much for your help epasquier.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Editors IDEs

From novice to tech pro — start learning today.