Solved

Malformed Spanish Characters when want UTF-8 output

Posted on 2008-10-17
9
930 Views
Last Modified: 2012-06-27
Hi

I'm having problems displaying UTF-8 characters on ASP webpages.

http://staging-www.bedandbreakfasts.es/  is example look at body of text lots of boxes appearing.

I've added <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> meta tag

and

<%@codepage=65001 %>
<%Response.Charset="UTF-8"%>

At top of ASP code.   But still getting these boxes.

What gives?
0
Comment
Question by:bendecko
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
9 Comments
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22746808
By the time I see it, the characters are already compromised. They are of course multi-byte characters, and would suggest you need :

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" lang="es">
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22746813
Could also be the datasource, or, the string used to retrieve that data from the database - maybe it is not unicode data type ...
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22747102
Well, it all appears to be part of the Ansi character set, so, should be OK. Interesting to see the body does have correct letter representation, and the hotkeys in the otherwise wrongly formatted text are also OK - but they do have substitution happening e.g.  funci&oacute;n de b&uacute;squeda   ie o acute and u acute respectively... but not in the rest. Where as choose a different language and we do see the substitution... interesting...
0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 
LVL 1

Author Comment

by:bendecko
ID: 22747543
The page is comprised of part database part static HTML mix.  

I've added the line

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" lang="es">

in the meta but the body is still junk.

When I look in the HTML I can see some &codes for characters and some not.  Should they all be &codes or theoretically should the browser display the correct characters since the Content-Type is set correctly?

In the SQL database I can see the spanish characters via Enterprise Manager and they do render correctly e.g in the left handside menu bar.

Thanks for the help

Bendecko
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22747575
yes I can see some of that - the big question is why just the top part (excluding hyperlinks) - so, how is that part different from the next paragraph...
0
 
LVL 51

Accepted Solution

by:
Mark Wills earned 250 total points
ID: 22748177
There is definitely non-ansi characters where the boxes appear, when I save the page and open in textpad - textpad complains about non-ansi characters.

For example, on the destination navigation indicator (ie El mundo > Europa > España) we find España is spelled (in Hex) : 45 73 70 61 C3 B1 61      

meaning two digits C3 B1 for ñ

on the next line (ie <H1> ) it is spelled : 45 73 70 61 EF BF BD 61

meaning three digits EF BF BD for i do not know what... other than it shows as a square (maybe that is what it is)...

Now, if I manually go in and change those characters and make sure it is:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">    i did end up removing the lang = "ES" did not seem to matter (just yet maybe).
then it does work. Changing the above line as utf_8 really screws it up again...

So, if you focus on getting <H1> correct, then the rest should likely follow on (in terms of a fix). At this stage, it looks a bit like how the page is being generated, because everything from the next title down (ie  Costas fabulosas y escapadas a las Islas )  . Similary <H2> seems compromised as well. Maybe it has something to do with the embedded tables - thought doesn't really explain why it comes good again after that next bold heading...

When I click on any of the top few links, then the following page is similarly compromised... But, the other languages appear to be OK.

So, definitely look at how you are retrieving, what the datatypes are from the database, and because the other lagaunages are OK, would say there is some default handling behaviour or database content that is fundamentally different.

Not sure if I can help any more at this stage...  Maybe show how you retrieve if Italian versus Spanish, and what you are retrieving, and where you are retrieving it from.

Hope that helps...
0
 
LVL 1

Author Comment

by:bendecko
ID: 22755605
OK Great I'm getting somewhere.  It looks like that file was saved in the wrong sort of encoding so I loaded with textpad and saved it out as UTF-8 and now the characters appear.   However further down the page I'm still getting the boxes.

The text below is generated from a database and an <!--Include--> file.  I loaded that file and saved it out as before but this time it didn't work.  The database was generated by a FORM post from a translator writing the Spanish.  I don't know the encoding of that form; it might not have been UTF-8 - probably not - so maybe the data in that part of database is not now compatible with the encoding of the page.

Don't worry about French etc being different.  The Spanish staging site is the first one to specify the encodings and all the other languages will have the same problems later!

How do I see in textpad the byte sequences you mention above?

Thanks

Bendecko
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 22755636
have to open textpad first, then do a file open and select binary as file type... cannot edit, it becomes read only, but gives you the hex view - like the old fashioned DUMP command.

Yes it does sound like a database / data problem... It is a pity about french et al - it looks like that was working well.
0
 
LVL 1

Author Closing Comment

by:bendecko
ID: 31508175
Thank you.

It turned out to be a routine that copied the translated HTML sections from their inital location to the staging areas that was corrupting the characters. The routine used FSO filesystem object and this messed up the UTF-8 characters.  You have to use ADOstreams instead to preserve the formatting.  

For any other EE user embarking on Internationalisation you should definately read even just for a good laugh Joel's article: http://www.joelonsoftware.com/articles/Unicode.html

Thanks again Mark for help me on this one.

Bendecko
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
I have a large data set and a SSIS package. How can I load this file in multi threading?
Via a live example, show how to shrink a transaction log file down to a reasonable size.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

724 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question