Encoding problem with characters... French...

PagodNaUtak
PagodNaUtak used Ask the Experts™
on
Hi,

Currently the encoding that I use in my XML is
 <?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:resx="resxUri">

But I came I cross this scenario that instead of "Questo messaggio è la conferma" it become this "Questo messaggio è la conferma"

Any ideas why? And what should be the proper encoding I used? Your advice is greatly appreciated.

Regards,

Joseph
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
This simply means that somewhere in your processing chain a UTF-8 encoded character (double byte) is interpreted as a single byte iso-8859-1.
So somewhere in your chain you pass in a UTF-8 encoded XML stating it is ISO-8859-1
It could be that the source is allready corrupt.
Are you sure that the source XML is really ISO-8859-1?
you can check that by opening the file in a binary text editor and see if the character is a double byte.
If it is, an UTF-8 snippet has been introduced in your source and you need to fix that,
OR the encoding of the XML is wrong
It helps viewing your source in an XML editor to see if the encoding is right (www.oxygenxml.com is a good choice)

Is this "questo messagio..."
introduced in the XSLT?
Your XSLT has an encoding iso-8859-1 as well,
it could be that you pasted the wrong encoding to your XSLT, maybe start by setting that to UTF-8

Encoding issues are tricky, if the above doesn't help you yet, you need to give us more information (maybe attach source and XSLT and explain how you run the XSLT)

Author

Commented:
Attached here is the XLST...

Is ther anything wrong?
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:resx="resxUri">
  
  <xsl:output indent="yes" />
  <xsl:output method="html"/>
  <xsl:param name="locale"/>

  <xsl:template match="/">
    <xsl:text disable-output-escaping="yes">
      &lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
    </xsl:text>
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>        
      </head>
      <body>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('GreetingsFromMyCompany', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('YourColleague', $locale)"/>
          <xsl:text> </xsl:text>
          <xsl:value-of select="emailAColleagueEntity/yourName"/>
          <xsl:value-of select="resx:GetTranslatedValue('EmailAColleaguePar1', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="emailAColleagueEntity/pageTitle"/>
        </p>
        <p>          
          <a>
            <xsl:attribute name="href">
              <xsl:value-of select="emailAColleagueEntity/pageLink"/>                            
            </xsl:attribute>
            <xsl:value-of select="emailAColleagueEntity/pageLink"/>            
          </a>
        </p>
        <p>
          <xsl:value-of select="emailAColleagueEntity/comment"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('EmailAColleaguePar2', $locale)"/>          
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('PleaseDoNotReplyNotAMonitoredAccount', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('AboutMyCompany', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('CombiningUnparalleled', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('MyCompanyCollaborates', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('ItsHompageIs', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('EmailConfirmSiteName', $locale)"/>
        </p>
        
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

Open in new window

Information Architect
Top Expert 2006
Commented:
Well, have you read my response and taken the suggested actions?
Please do that first
(check the encoding of your source)

Nothing wrong with your XSLT.
From your profile I understand this could be C#
I strongly recommend that you use UTF-8 for the stylesheet,
but that won't solve it.
The issue likely is in your source XML
you can attach it for validation, but I have suggested some actions you could do yourself first
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

Author

Commented:
I will try your suggestion...
Éric MoreauSenior .Net Consultant
Top Expert 2016

Commented:
BTW, this is not French. It looks something like Spanish to me.
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
It is Italian
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
Original Poster is not referring to the language of the sentence but to the "è" I believe :-)

Author

Commented:
Yes, actually the problem is that instead of è it becomes Ã.
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
nope it becomes à AND another byte
the à indicates a UTF-8 interpreter that thsi byte needs to eb interpreted together with the next one,
to form one two byte character
that is how UTF-8 works

Have you checked the source allready,
if not attach it and will do it for you

Author

Commented:
Hi, will there be a problem if the source is encoded as UTF-8 then once pass in the xlst it is  iso-8859-1.

The source is not yet available at the moment... I am working on it...
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
no, you can perfectly have a source in UTF-8,
internally the parser will transform to UTF-8 and on serialisation sets it to whatever encoding you want
just set
<xsl:output encoding="iso-8859-1"/>
and the serialiser of your XSLT processor will transform to iso latin as you wish

BUT your C# XSLT transformer could potentially overrule that setting,
so be carefull there

Commented:
>>Your advice is greatly appreciated

As one who uses the French and German versions on Windows XP Pro and who occasionally writes XML data in Russian and Hebrew (appropiate keyboards being installed) I always use Notepad and always store the XML in UTF-8 (which in capital letters is the officially - by IANA - registered character set name).

Author

Commented:
I think the problem is something like this:

the text orginally is a UTF encoded then converted to ISO-8859-1 encoded then converted again to UTF-8.

Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

Will it still be converted properly?

Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
Hi BigRat,
how do you prevent Notepad from using Win-1252 in the background?
Using Notepad on Windows XP Pro with UTF-8 in teh XML encoding
does not force Notepad into storing the characters as UTF-8 necessary

Personally I highly recommend not to use non XML tools for creating XML,
specially because cutting and pasting from various sources, always ends up creating encoding deadlocks (because of mixed encodings)
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
> Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

No, not at all, that is teh default by the way
any encoding should work. Any XML processing tools hsould understand the encoding correctly
only UTF-8 and UTF-16 are mandated in order to have a conformant parser, but I am not aware of a processor that doesn't understand ISO-8859-1

Author

Commented:
I think the problem is something like this:

the text orginally is a UTF encoded then converted to ISO-8859-1 encoded then converted again to UTF-8.

Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

Will it still be converted properly?

Author

Commented:
I run a test, here is the code... When I convert the encoding from ISOBytes to UTF-8. It does not generate the text correctly with one byte added. So, I think the problem is when I convert the ISOBytes to UTF-8.

Any ideas?
Dim isoBytes As Byte() = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes("Questo messaggio è la conferma")
        Dim utfBytes As Byte() = System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding("ISO-8859-1"), System.Text.Encoding.UTF8, isoBytes)


        Dim msg As String = System.Text.Encoding.UTF8.GetString(isoBytes)
        MsgBox(msg)

Open in new window

Author

Commented:
Is there any disadvantage if I use UTF-8 instead of ISO-8859-1?
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
weird, my answer comes prior to your question :-), so swap ...186 and ...182 when reading

anyway, don't look at the XSLT for the encoding, you are messing with it in the VB,
my VB is rusty, I don't necessarily see what you are doing,
why do you transform, why don't you read it in as UTF-8 directly?
make the 'è' a 'Ãè' to test
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
> Is there any disadvantage if I use UTF-8 instead of ISO-8859-1?

yes, bigger character set and no issues with encodings.
UTF-8 is the default characterset used in XML and some tools assume (sadly) that the encoding is UTF-8
and tend to ignore character encoding settings
So, it is generally safer to use UTF-8 when doing XML

If you want to avoid encoding issue,
can you try this?
"Questo messaggio &#232; la conferma"
should work better, just for testing what happens
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
sorry, I misread that follow up
I answered "Is there an advantage..."

There is no disadvantage in using UTF-8
(well, each tricky character comes as a double byte so the UTF-8 tends to be bigger than teh ISO-8859-1,
but that difference is marginal in Italian, so I tend to ignore the size difference)
IN my opinion, only advantages going to UTF-8

Commented:
>>how do you prevent Notepad from using Win-1252 in the background?

Notepad is internally Unicode. On saving the file (with SaveAs)  just select  UTF-8 from the encoding. Once saved as such all subsequent edits and saves use UTF-8. I use Lucida Console as font.

>>specially because cutting and pasting from various sources, always ends up creating encoding deadlocks (because of mixed encodings)

Not under Windows, where everything is 16 bit internally. This is a typical problem under Linux where the editors work in some character set and assume the same on pasting.

Commented:
PagodNaUtak: How are you editing or changing the encoding of the file? Under Windows or Linux?
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
@BigRat,

i have had too many experiences in projects that contradict to what you say... on windows, using notepad
if you don't select an encoding, notepad guesses and guessing can introduce mistakes
I added example 1 in notepad on Windows XP Pro, saved and loaded in Stylus Studio
Stylus says it is unvalid XML, and stylus is right
Hence my recommendation not to use Notepad for XML, there is too much you need to keep in sync
OK, if you use UTF-8 only and always tell the editor to use UTF-8 encoding, it could work, but I don't consider a double binding requirment in an editor good practice.
I am not disputing that it can't be done. I have seen it go wrong too many times with clients to recommend it as good practice
(note that I favour using UTF-8 throughout as I indicated earlier, but you can't always control what you get)

your second statement is untrue for sure (or at least incomplete),
if you copy and paste in a text editor that is from a ISO-8859-1 source
tell me which text editor knows about the XML encoding?
example 2, added example 3 so I get example 4
save it, no longer valid XML

stupid? I know, but risky enough to not recommend XML editing in notepad, unless you are constantly watching your back and really know what you'r doing

I am dealing with lots of customers and data entry people and half of my income deals with EU related projects (23 languages including bulgarian, greek, ... )
I can tell you that stupid things do happen :)


example 1
---------
<root>
<foo>é</foo>
<euro>€</euro>
</root>

example 2
---------
<?xml version="1.0" encoding="iso-8859-1"?>
<bar>
    <foo>é</foo>
</bar>

example 3
---------
<foo>€</foo>

example 4
---------
<?xml version="1.0" encoding="iso-8859-1"?>
<bar>
    <foo>é</foo>
    <foo>€</foo>
</bar>

Open in new window

Commented:
Gert: You haven't actually said, in your examples, how the file is encoded. On Windows Notepad puts a BOM when Unicode and UTF-8, not for ISO. On Windows Eastern Europe, Greek and Russian the default is not iso-8859-1 but windows-1251, or windows-greek or windows-russian, so my advice is always to store in UTF-8.

Now your examples. If the encoding is correct in the four files (which means on Windows Western Europe Windows-1252), I'd open ach one with Notepad, cut and paste how I want and finally REMOVE the encoding attribute on the <?xml?> processing instruction and SaveAs UTF-8.

It is not a question of tools but of disipline. Moreover Notepad is free and I use it for all sorts of things as well.
Gertone (Geert Bormans)Information Architect
Top Expert 2006

Commented:
> It is not a question of tools but of disipline

I think that summarises it,
either teach enough background and discipline,
or recommend to use an XML editor
and free ones are available

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial