Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


How not to loose accents from a XML RSS feed that is ISO-8859-1 encoded when reading with com.sun.syndication.io.XmlReader ? (I'm getting "US-ASCII"...)

Posted on 2005-03-21
Medium Priority
Last Modified: 2013-11-19
Hi All,

I am trying to use ROME (Rss and atOM utilitiEs - https://rome.dev.java.net/) to build a Java program to read an RSS feed that is ISO-8859-1 encoded.

I use com.sun.syndication.io.XmlReader to read the remote file, but all the accents ("´", "`", "^", "~", etc.) are being lost, probably because the encoding is not being properly recognized.

Here is my example code:


import java.net.URL;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

            String feed = "http://somedomain/some_rss_feed.xml";
            URL feedUrl = new URL(feed);
            XmlReader reader = new XmlReader(feedUrl);
            SyndFeedInput input = new SyndFeedInput();
            SyndFeed result = input.build(reader);

The structure of the RSS feed (which is NOT under my control, so I have no ways to correct anything wrong related to it...) is like below:

<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="2.0">
<title>Some title already wíth ãny àccênts intö it</title>





When variable "reader" gets the result of "new XmlReader(feedUrl)", it already shows me a property named "_encoding" filled with value US-ASCII instead of ISO-8859-1.

And when I check the variable "result" for its contents, it has already all the attributes filled with the values which were read from the XML feed, but with all my accents already corrupted...

Plz help...!
Question by:teufelsdaumen
  • 2
LVL 35

Expert Comment

ID: 13595351
Did you try using Replace() function?


Author Comment

ID: 13600227
Hello YZlat,

The problem is that it does not help to change the attribute which contains the encoding inside the "reader" object, since this is already the result of the transfer of a HTTP stream comming from the computer serving the feed, and in order to transfer the XML file, XmlReader class does some "magic" (or "Voodo", according to ROME website...) trying to detect the encoding of the file *before* transfering it.

The "US-ASCII" encoding is thus the encoding that XmlReader understood to be the encoding of the document (and so when I get the document *it is already "corrupted" since was transfered as an US-ASCII* and not as an ISO-8859-1).

Here is an excerpt from the ROME website about XmlReader class (https://rome.dev.java.net/apidocs/0_5/com/sun/syndication/io/XmlReader.html):

public class XmlReader
extends java.io.Reader

Character stream that handles (or at least attemtps to) all the necessary Voodo to figure out the charset encoding of the XML document within the stream.

IMPORTANT: This class is not related in any way to the org.xml.sax.XMLReader. This one IS a character stream.

All this has to be done without consuming characters from the stream, if not the XML parser will not recognized the document as a valid XML. This is not 100% true, but it's close enough (UTF-8 BOM is not handled by all parsers right now, XmlReader handles it and things work in all parsers).

The XmlReader class handles the charset encoding of XML documents in Files, raw streams and HTTP streams by offering a wide set of constructors.

By default the charset encoding detection is lenient, the constructor with the lenient flag can be used for an script (following HTTP MIME and XML specifications). All this is nicely explained by Mark Pilgrim in his blog, Determining the character encoding of a feed.


I ended up searching for some RSS online validators to see if the feed had any problems. I ended up finding this excelent validator by Mark Pilgrim and Sam Ruby: "http://feedvalidator.org/" and discovered the reason why!!! (well, now that I have figured out what´s happening, I still have to find the way to go around the problem...).

As I expected, the feed had not one but various problems, most of them regarding a non compliance to the DTD. Follows just an excerpt of all the reported errors. The first error is the reason why I am getting US-ASCII:

This feed does not validate.

Your feed appears to be encoded as "ISO-8859-1", but your server is reporting "US-ASCII" [help]

line 1, column 164: XML parsing error: No declaration for element publisher (2 occurrences) [help]

... tscape.com/publish/formats/rss-0.91.dtd">                                            ^
line 1, column 164: XML parsing error: Element channel content does not follow the DTD, Misplaced publisher [help]

... tscape.com/publish/formats/rss-0.91.dtd">



Looking at the HELP (http://feedvalidator.org/docs/warning/EncodingMismatch.html) I could find the following information:

Your feed appears to be encoded as “foo”, but your server is reporting “bar”

The XML appears to be using one encoding, but the HTTP headers from the web server indicate a different charset. Internet standards require that the web server's version takes preference, but many aggregators ignore this. Note that, if you are serving content as 'text/*', then the default charset is US-ASCII, which is probably not what you want. (See RFC 3023 for technical details.)

RSS feeds should be served as application/rss+xml (RSS 1.0 is an RDF format, so it may be served as application/rdf+xml instead). Atom feeds should use application/atom+xml. Alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml.

Another possible cause is the use of single quotes to delimit the charset parameter in the http header, whereas the http definition of Basic Rules only permits the use of double quotes. The result is somewhat confusing messages such as:

Your feed appears to be encoded as “utf-8”, but your server is reporting “'utf-8'”

Either ensure that the charset parameter of the HTTP Content-Type header matches the encoding declaration, or ensure that the server makes no claims about the encoding. Serving the feed as application/xml means that the encoding will be taken from the file's declaration.

The W3C has published information on how to set the HTTP charset parameter with various popular web servers.

If you are unable to control your server's charset declaration, Character and Entity References may be used to specify the full range of Unicode characters in an feed served as US-ASCII.

Not clear? Disagree?
Let us know on the feedvalidator-users discussion list!


And going through the mailing list "feedvalidator-users" at SourceForge I found out the following message replied by Sam Ruby:


> However, some news items have ASCII characters such as the copyright
 > symbol, trademark symbol etc. These stop the XML feed from validating,
 > and the validator says "Your feed appears to be encoded as "iso-8859-1"
 > but your server is reporting "US-ASCII". It sends me to the following
 > page: http://feedvalidator.org/docs/warning/EncodingMismatch.html which
 > then links to another page of techy stuff, but it is way over my head,
 > far too technical for me.


2) The message you cited is only a warning
3) Adding either or both of these lines to your Apache server config,
    virtual host, directory, or .htaccess files will eliminate this
      AddCharset iso-8859-1 .xml
      AddType application/xml .xml


So, the solution for the feed provider is clear to be the one above!

I'll try to solve things here though, before contacting folks there asking to change anything regarding the feed... (although I think I will be of help to point their attention to the problems regarding their feed...).

So if anyone has any other suggestions....

Author Comment

ID: 13600415
I have found the solution myself. Taking the information about setting the content type to "application/xml" into account ("alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml."), I changed my code from

            XmlReader reader = new XmlReader(feedUrl);


            InputStream is = feedUrl.openStream();
            XmlReader reader = new XmlReader(is, "application/xml");

And now XmlReader treats the HTTP stream as being "ISO-8859-1" and the accents are preserved.

Thanks anyway.

Accepted Solution

OzzMod earned 0 total points
ID: 13638497
Closed, 500 points refunded.
Community Support Moderator (Graveyard shift)

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Shoutout to Emily Plummer (http://www.experts-exchange.com/members/eplummer26.html) for giving me this article! She did most of it, I just finished it up and posted it for her :)    Introduction In a previous article (http://www.experts-exchang…
JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question