Link to home
Start Free TrialLog in
Avatar of teufelsdaumen
teufelsdaumen

asked on

How not to loose accents from a XML RSS feed that is ISO-8859-1 encoded when reading with com.sun.syndication.io.XmlReader ? (I'm getting "US-ASCII"...)

Hi All,

I am trying to use ROME (Rss and atOM utilitiEs - https://rome.dev.java.net/) to build a Java program to read an RSS feed that is ISO-8859-1 encoded.

I use com.sun.syndication.io.XmlReader to read the remote file, but all the accents ("´", "`", "^", "~", etc.) are being lost, probably because the encoding is not being properly recognized.

Here is my example code:

...

import java.net.URL;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

            String feed = "http://somedomain/some_rss_feed.xml";
            URL feedUrl = new URL(feed);
            XmlReader reader = new XmlReader(feedUrl);
            SyndFeedInput input = new SyndFeedInput();
            SyndFeed result = input.build(reader);
...

The structure of the RSS feed (which is NOT under my control, so I have no ways to correct anything wrong related to it...) is like below:

<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="2.0">
<channel>
<title>Some title already wíth ãny àccênts intö it</title>
<link>...</link>
...
<language>pt-br</language>
...

<item>
<title>...</title>
<description>...</description>
<link>...</link>
</item>

<item>
<title>...</title>
<description>...</description>
<link>...</link>
</item>

...

</channel>
</rss>

When variable "reader" gets the result of "new XmlReader(feedUrl)", it already shows me a property named "_encoding" filled with value US-ASCII instead of ISO-8859-1.

And when I check the variable "result" for its contents, it has already all the attributes filled with the values which were read from the XML feed, but with all my accents already corrupted...


Plz help...!
Avatar of YZlat
YZlat
Flag of United States of America image

Did you try using Replace() function?

Replace("US-ASCII","ISO-8859-1")
Avatar of teufelsdaumen
teufelsdaumen

ASKER

Hello YZlat,

The problem is that it does not help to change the attribute which contains the encoding inside the "reader" object, since this is already the result of the transfer of a HTTP stream comming from the computer serving the feed, and in order to transfer the XML file, XmlReader class does some "magic" (or "Voodo", according to ROME website...) trying to detect the encoding of the file *before* transfering it.

The "US-ASCII" encoding is thus the encoding that XmlReader understood to be the encoding of the document (and so when I get the document *it is already "corrupted" since was transfered as an US-ASCII* and not as an ISO-8859-1).

Here is an excerpt from the ROME website about XmlReader class (https://rome.dev.java.net/apidocs/0_5/com/sun/syndication/io/XmlReader.html):

public class XmlReader
extends java.io.Reader

Character stream that handles (or at least attemtps to) all the necessary Voodo to figure out the charset encoding of the XML document within the stream.

IMPORTANT: This class is not related in any way to the org.xml.sax.XMLReader. This one IS a character stream.

All this has to be done without consuming characters from the stream, if not the XML parser will not recognized the document as a valid XML. This is not 100% true, but it's close enough (UTF-8 BOM is not handled by all parsers right now, XmlReader handles it and things work in all parsers).

The XmlReader class handles the charset encoding of XML documents in Files, raw streams and HTTP streams by offering a wide set of constructors.

By default the charset encoding detection is lenient, the constructor with the lenient flag can be used for an script (following HTTP MIME and XML specifications). All this is nicely explained by Mark Pilgrim in his blog, Determining the character encoding of a feed.

--

I ended up searching for some RSS online validators to see if the feed had any problems. I ended up finding this excelent validator by Mark Pilgrim and Sam Ruby: "http://feedvalidator.org/" and discovered the reason why!!! (well, now that I have figured out what´s happening, I still have to find the way to go around the problem...).

As I expected, the feed had not one but various problems, most of them regarding a non compliance to the DTD. Follows just an excerpt of all the reported errors. The first error is the reason why I am getting US-ASCII:

Sorry
This feed does not validate.

Your feed appears to be encoded as "ISO-8859-1", but your server is reporting "US-ASCII" [help]


line 1, column 164: XML parsing error: No declaration for element publisher (2 occurrences) [help]

... tscape.com/publish/formats/rss-0.91.dtd">                                            ^
line 1, column 164: XML parsing error: Element channel content does not follow the DTD, Misplaced publisher [help]

... tscape.com/publish/formats/rss-0.91.dtd">

etc...

--

Looking at the HELP (http://feedvalidator.org/docs/warning/EncodingMismatch.html) I could find the following information:

Message
Your feed appears to be encoded as “foo”, but your server is reporting “bar”

Explanation
The XML appears to be using one encoding, but the HTTP headers from the web server indicate a different charset. Internet standards require that the web server's version takes preference, but many aggregators ignore this. Note that, if you are serving content as 'text/*', then the default charset is US-ASCII, which is probably not what you want. (See RFC 3023 for technical details.)

RSS feeds should be served as application/rss+xml (RSS 1.0 is an RDF format, so it may be served as application/rdf+xml instead). Atom feeds should use application/atom+xml. Alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml.

Another possible cause is the use of single quotes to delimit the charset parameter in the http header, whereas the http definition of Basic Rules only permits the use of double quotes. The result is somewhat confusing messages such as:

Your feed appears to be encoded as “utf-8”, but your server is reporting “'utf-8'”

Solution
Either ensure that the charset parameter of the HTTP Content-Type header matches the encoding declaration, or ensure that the server makes no claims about the encoding. Serving the feed as application/xml means that the encoding will be taken from the file's declaration.

The W3C has published information on how to set the HTTP charset parameter with various popular web servers.

If you are unable to control your server's charset declaration, Character and Entity References may be used to specify the full range of Unicode characters in an feed served as US-ASCII.

Not clear? Disagree?
Let us know on the feedvalidator-users discussion list!

--

And going through the mailing list "feedvalidator-users" at SourceForge I found out the following message replied by Sam Ruby:

...

> However, some news items have ASCII characters such as the copyright
 > symbol, trademark symbol etc. These stop the XML feed from validating,
 > and the validator says "Your feed appears to be encoded as "iso-8859-1"
 > but your server is reporting "US-ASCII". It sends me to the following
 > page: http://feedvalidator.org/docs/warning/EncodingMismatch.html which
 > then links to another page of techy stuff, but it is way over my head,
 > far too technical for me.

...

2) The message you cited is only a warning
 
3) Adding either or both of these lines to your Apache server config,
    virtual host, directory, or .htaccess files will eliminate this
    warning:
 
      AddCharset iso-8859-1 .xml
      AddType application/xml .xml

--

So, the solution for the feed provider is clear to be the one above!

I'll try to solve things here though, before contacting folks there asking to change anything regarding the feed... (although I think I will be of help to point their attention to the problems regarding their feed...).

So if anyone has any other suggestions....
I have found the solution myself. Taking the information about setting the content type to "application/xml" into account ("alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml."), I changed my code from

            XmlReader reader = new XmlReader(feedUrl);

to

            InputStream is = feedUrl.openStream();
            XmlReader reader = new XmlReader(is, "application/xml");

And now XmlReader treats the HTTP stream as being "ISO-8859-1" and the accents are preserved.

Thanks anyway.
ASKER CERTIFIED SOLUTION
Avatar of OzzMod
OzzMod

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial