Avatar of PagodNaUtak
PagodNaUtakFlag for Philippines

asked on 

Encoding problem with characters... French...

Hi,

Currently the encoding that I use in my XML is
 <?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:resx="resxUri">

But I came I cross this scenario that instead of "Questo messaggio è la conferma" it become this "Questo messaggio è la conferma"

Any ideas why? And what should be the proper encoding I used? Your advice is greatly appreciated.

Regards,

Joseph
Web Languages and StandardsXML

Avatar of undefined
Last Comment
Gertone (Geert Bormans)
Avatar of Gertone (Geert Bormans)
Gertone (Geert Bormans)
Flag of Belgium image

This simply means that somewhere in your processing chain a UTF-8 encoded character (double byte) is interpreted as a single byte iso-8859-1.
So somewhere in your chain you pass in a UTF-8 encoded XML stating it is ISO-8859-1
It could be that the source is allready corrupt.
Are you sure that the source XML is really ISO-8859-1?
you can check that by opening the file in a binary text editor and see if the character is a double byte.
If it is, an UTF-8 snippet has been introduced in your source and you need to fix that,
OR the encoding of the XML is wrong
It helps viewing your source in an XML editor to see if the encoding is right (www.oxygenxml.com is a good choice)

Is this "questo messagio..."
introduced in the XSLT?
Your XSLT has an encoding iso-8859-1 as well,
it could be that you pasted the wrong encoding to your XSLT, maybe start by setting that to UTF-8

Encoding issues are tricky, if the above doesn't help you yet, you need to give us more information (maybe attach source and XSLT and explain how you run the XSLT)
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

Attached here is the XLST...

Is ther anything wrong?
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:resx="resxUri">
  
  <xsl:output indent="yes" />
  <xsl:output method="html"/>
  <xsl:param name="locale"/>

  <xsl:template match="/">
    <xsl:text disable-output-escaping="yes">
      &lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
    </xsl:text>
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>        
      </head>
      <body>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('GreetingsFromMyCompany', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('YourColleague', $locale)"/>
          <xsl:text> </xsl:text>
          <xsl:value-of select="emailAColleagueEntity/yourName"/>
          <xsl:value-of select="resx:GetTranslatedValue('EmailAColleaguePar1', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="emailAColleagueEntity/pageTitle"/>
        </p>
        <p>          
          <a>
            <xsl:attribute name="href">
              <xsl:value-of select="emailAColleagueEntity/pageLink"/>                            
            </xsl:attribute>
            <xsl:value-of select="emailAColleagueEntity/pageLink"/>            
          </a>
        </p>
        <p>
          <xsl:value-of select="emailAColleagueEntity/comment"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('EmailAColleaguePar2', $locale)"/>          
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('PleaseDoNotReplyNotAMonitoredAccount', $locale)"/>
        </p>
        <p>
          <xsl:value-of select="resx:GetTranslatedValue('AboutMyCompany', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('CombiningUnparalleled', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('MyCompanyCollaborates', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('ItsHompageIs', $locale)"/>
          <xsl:value-of select="resx:GetTranslatedValue('EmailConfirmSiteName', $locale)"/>
        </p>
        
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Gertone (Geert Bormans)
Gertone (Geert Bormans)
Flag of Belgium image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

I will try your suggestion...
Avatar of Éric Moreau
Éric Moreau
Flag of Canada image

BTW, this is not French. It looks something like Spanish to me.
It is Italian
Original Poster is not referring to the language of the sentence but to the "è" I believe :-)
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

Yes, actually the problem is that instead of è it becomes Ã.
nope it becomes à AND another byte
the à indicates a UTF-8 interpreter that thsi byte needs to eb interpreted together with the next one,
to form one two byte character
that is how UTF-8 works

Have you checked the source allready,
if not attach it and will do it for you
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

Hi, will there be a problem if the source is encoded as UTF-8 then once pass in the xlst it is  iso-8859-1.

The source is not yet available at the moment... I am working on it...
no, you can perfectly have a source in UTF-8,
internally the parser will transform to UTF-8 and on serialisation sets it to whatever encoding you want
just set
<xsl:output encoding="iso-8859-1"/>
and the serialiser of your XSLT processor will transform to iso latin as you wish

BUT your C# XSLT transformer could potentially overrule that setting,
so be carefull there
Avatar of BigRat
BigRat
Flag of France image

>>Your advice is greatly appreciated

As one who uses the French and German versions on Windows XP Pro and who occasionally writes XML data in Russian and Hebrew (appropiate keyboards being installed) I always use Notepad and always store the XML in UTF-8 (which in capital letters is the officially - by IANA - registered character set name).
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

I think the problem is something like this:

the text orginally is a UTF encoded then converted to ISO-8859-1 encoded then converted again to UTF-8.

Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

Will it still be converted properly?

Hi BigRat,
how do you prevent Notepad from using Win-1252 in the background?
Using Notepad on Windows XP Pro with UTF-8 in teh XML encoding
does not force Notepad into storing the characters as UTF-8 necessary

Personally I highly recommend not to use non XML tools for creating XML,
specially because cutting and pasting from various sources, always ends up creating encoding deadlocks (because of mixed encodings)
> Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

No, not at all, that is teh default by the way
any encoding should work. Any XML processing tools hsould understand the encoding correctly
only UTF-8 and UTF-16 are mandated in order to have a conformant parser, but I am not aware of a processor that doesn't understand ISO-8859-1
Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

I think the problem is something like this:

the text orginally is a UTF encoded then converted to ISO-8859-1 encoded then converted again to UTF-8.

Will there be a problem if I set the encoding from <xsl:output encoding="iso-8859-1"/> to <xsl:output encoding="UTF-8"/>?

Will it still be converted properly?

Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

I run a test, here is the code... When I convert the encoding from ISOBytes to UTF-8. It does not generate the text correctly with one byte added. So, I think the problem is when I convert the ISOBytes to UTF-8.

Any ideas?
Dim isoBytes As Byte() = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes("Questo messaggio è la conferma")
        Dim utfBytes As Byte() = System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding("ISO-8859-1"), System.Text.Encoding.UTF8, isoBytes)


        Dim msg As String = System.Text.Encoding.UTF8.GetString(isoBytes)
        MsgBox(msg)

Open in new window

Avatar of PagodNaUtak
PagodNaUtak
Flag of Philippines image

ASKER

Is there any disadvantage if I use UTF-8 instead of ISO-8859-1?
weird, my answer comes prior to your question :-), so swap ...186 and ...182 when reading

anyway, don't look at the XSLT for the encoding, you are messing with it in the VB,
my VB is rusty, I don't necessarily see what you are doing,
why do you transform, why don't you read it in as UTF-8 directly?
make the 'è' a 'Ãè' to test
> Is there any disadvantage if I use UTF-8 instead of ISO-8859-1?

yes, bigger character set and no issues with encodings.
UTF-8 is the default characterset used in XML and some tools assume (sadly) that the encoding is UTF-8
and tend to ignore character encoding settings
So, it is generally safer to use UTF-8 when doing XML

If you want to avoid encoding issue,
can you try this?
"Questo messaggio &#232; la conferma"
should work better, just for testing what happens
sorry, I misread that follow up
I answered "Is there an advantage..."

There is no disadvantage in using UTF-8
(well, each tricky character comes as a double byte so the UTF-8 tends to be bigger than teh ISO-8859-1,
but that difference is marginal in Italian, so I tend to ignore the size difference)
IN my opinion, only advantages going to UTF-8
Avatar of BigRat
BigRat
Flag of France image

>>how do you prevent Notepad from using Win-1252 in the background?

Notepad is internally Unicode. On saving the file (with SaveAs)  just select  UTF-8 from the encoding. Once saved as such all subsequent edits and saves use UTF-8. I use Lucida Console as font.

>>specially because cutting and pasting from various sources, always ends up creating encoding deadlocks (because of mixed encodings)

Not under Windows, where everything is 16 bit internally. This is a typical problem under Linux where the editors work in some character set and assume the same on pasting.
Avatar of BigRat
BigRat
Flag of France image

PagodNaUtak: How are you editing or changing the encoding of the file? Under Windows or Linux?
@BigRat,

i have had too many experiences in projects that contradict to what you say... on windows, using notepad
if you don't select an encoding, notepad guesses and guessing can introduce mistakes
I added example 1 in notepad on Windows XP Pro, saved and loaded in Stylus Studio
Stylus says it is unvalid XML, and stylus is right
Hence my recommendation not to use Notepad for XML, there is too much you need to keep in sync
OK, if you use UTF-8 only and always tell the editor to use UTF-8 encoding, it could work, but I don't consider a double binding requirment in an editor good practice.
I am not disputing that it can't be done. I have seen it go wrong too many times with clients to recommend it as good practice
(note that I favour using UTF-8 throughout as I indicated earlier, but you can't always control what you get)

your second statement is untrue for sure (or at least incomplete),
if you copy and paste in a text editor that is from a ISO-8859-1 source
tell me which text editor knows about the XML encoding?
example 2, added example 3 so I get example 4
save it, no longer valid XML

stupid? I know, but risky enough to not recommend XML editing in notepad, unless you are constantly watching your back and really know what you'r doing

I am dealing with lots of customers and data entry people and half of my income deals with EU related projects (23 languages including bulgarian, greek, ... )
I can tell you that stupid things do happen :)


example 1
---------
<root>
<foo>é</foo>
<euro>€</euro>
</root>

example 2
---------
<?xml version="1.0" encoding="iso-8859-1"?>
<bar>
    <foo>é</foo>
</bar>

example 3
---------
<foo>€</foo>

example 4
---------
<?xml version="1.0" encoding="iso-8859-1"?>
<bar>
    <foo>é</foo>
    <foo>€</foo>
</bar>

Open in new window

Avatar of BigRat
BigRat
Flag of France image

Gert: You haven't actually said, in your examples, how the file is encoded. On Windows Notepad puts a BOM when Unicode and UTF-8, not for ISO. On Windows Eastern Europe, Greek and Russian the default is not iso-8859-1 but windows-1251, or windows-greek or windows-russian, so my advice is always to store in UTF-8.

Now your examples. If the encoding is correct in the four files (which means on Windows Western Europe Windows-1252), I'd open ach one with Notepad, cut and paste how I want and finally REMOVE the encoding attribute on the <?xml?> processing instruction and SaveAs UTF-8.

It is not a question of tools but of disipline. Moreover Notepad is free and I use it for all sorts of things as well.
> It is not a question of tools but of disipline

I think that summarises it,
either teach enough background and discipline,
or recommend to use an XML editor
and free ones are available
Web Languages and Standards
Web Languages and Standards

Web development can range from developing the simplest static single page of plain text to the most complex web-based internet applications, electronic businesses, and social network services using a wide variety of languages and standards, including the familiar HTML, JavaScript and jQuery, ASP and ASP.NET, PHP, ColdFusion, CSS, PHP, Flex and Flash, but also the implementation of a broad list of standards including XML, WSDL, SSDL, VoiceXML and many more.

40K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo