?
Solved

How not to loose accents from a XML RSS feed that is ISO-8859-1 encoded when reading with com.sun.syndication.io.XmlReader ? (I'm getting "US-ASCII"...)

Posted on 2005-03-21
5
Medium Priority
?
1,295 Views
Last Modified: 2013-11-19
Hi All,

I am trying to use ROME (Rss and atOM utilitiEs - https://rome.dev.java.net/) to build a Java program to read an RSS feed that is ISO-8859-1 encoded.

I use com.sun.syndication.io.XmlReader to read the remote file, but all the accents ("´", "`", "^", "~", etc.) are being lost, probably because the encoding is not being properly recognized.

Here is my example code:

...

import java.net.URL;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

            String feed = "http://somedomain/some_rss_feed.xml";
            URL feedUrl = new URL(feed);
            XmlReader reader = new XmlReader(feedUrl);
            SyndFeedInput input = new SyndFeedInput();
            SyndFeed result = input.build(reader);
...

The structure of the RSS feed (which is NOT under my control, so I have no ways to correct anything wrong related to it...) is like below:

<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="2.0">
<channel>
<title>Some title already wíth ãny àccênts intö it</title>
<link>...</link>
...
<language>pt-br</language>
...

<item>
<title>...</title>
<description>...</description>
<link>...</link>
</item>

<item>
<title>...</title>
<description>...</description>
<link>...</link>
</item>

...

</channel>
</rss>

When variable "reader" gets the result of "new XmlReader(feedUrl)", it already shows me a property named "_encoding" filled with value US-ASCII instead of ISO-8859-1.

And when I check the variable "result" for its contents, it has already all the attributes filled with the values which were read from the XML feed, but with all my accents already corrupted...


Plz help...!
0
Comment
Question by:teufelsdaumen
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
5 Comments
 
LVL 35

Expert Comment

by:YZlat
ID: 13595351
Did you try using Replace() function?

Replace("US-ASCII","ISO-8859-1")
0
 

Author Comment

by:teufelsdaumen
ID: 13600227
Hello YZlat,

The problem is that it does not help to change the attribute which contains the encoding inside the "reader" object, since this is already the result of the transfer of a HTTP stream comming from the computer serving the feed, and in order to transfer the XML file, XmlReader class does some "magic" (or "Voodo", according to ROME website...) trying to detect the encoding of the file *before* transfering it.

The "US-ASCII" encoding is thus the encoding that XmlReader understood to be the encoding of the document (and so when I get the document *it is already "corrupted" since was transfered as an US-ASCII* and not as an ISO-8859-1).

Here is an excerpt from the ROME website about XmlReader class (https://rome.dev.java.net/apidocs/0_5/com/sun/syndication/io/XmlReader.html):

public class XmlReader
extends java.io.Reader

Character stream that handles (or at least attemtps to) all the necessary Voodo to figure out the charset encoding of the XML document within the stream.

IMPORTANT: This class is not related in any way to the org.xml.sax.XMLReader. This one IS a character stream.

All this has to be done without consuming characters from the stream, if not the XML parser will not recognized the document as a valid XML. This is not 100% true, but it's close enough (UTF-8 BOM is not handled by all parsers right now, XmlReader handles it and things work in all parsers).

The XmlReader class handles the charset encoding of XML documents in Files, raw streams and HTTP streams by offering a wide set of constructors.

By default the charset encoding detection is lenient, the constructor with the lenient flag can be used for an script (following HTTP MIME and XML specifications). All this is nicely explained by Mark Pilgrim in his blog, Determining the character encoding of a feed.

--

I ended up searching for some RSS online validators to see if the feed had any problems. I ended up finding this excelent validator by Mark Pilgrim and Sam Ruby: "http://feedvalidator.org/" and discovered the reason why!!! (well, now that I have figured out what´s happening, I still have to find the way to go around the problem...).

As I expected, the feed had not one but various problems, most of them regarding a non compliance to the DTD. Follows just an excerpt of all the reported errors. The first error is the reason why I am getting US-ASCII:

Sorry
This feed does not validate.

Your feed appears to be encoded as "ISO-8859-1", but your server is reporting "US-ASCII" [help]


line 1, column 164: XML parsing error: No declaration for element publisher (2 occurrences) [help]

... tscape.com/publish/formats/rss-0.91.dtd">                                            ^
line 1, column 164: XML parsing error: Element channel content does not follow the DTD, Misplaced publisher [help]

... tscape.com/publish/formats/rss-0.91.dtd">

etc...

--

Looking at the HELP (http://feedvalidator.org/docs/warning/EncodingMismatch.html) I could find the following information:

Message
Your feed appears to be encoded as “foo”, but your server is reporting “bar”

Explanation
The XML appears to be using one encoding, but the HTTP headers from the web server indicate a different charset. Internet standards require that the web server's version takes preference, but many aggregators ignore this. Note that, if you are serving content as 'text/*', then the default charset is US-ASCII, which is probably not what you want. (See RFC 3023 for technical details.)

RSS feeds should be served as application/rss+xml (RSS 1.0 is an RDF format, so it may be served as application/rdf+xml instead). Atom feeds should use application/atom+xml. Alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml.

Another possible cause is the use of single quotes to delimit the charset parameter in the http header, whereas the http definition of Basic Rules only permits the use of double quotes. The result is somewhat confusing messages such as:

Your feed appears to be encoded as “utf-8”, but your server is reporting “'utf-8'”

Solution
Either ensure that the charset parameter of the HTTP Content-Type header matches the encoding declaration, or ensure that the server makes no claims about the encoding. Serving the feed as application/xml means that the encoding will be taken from the file's declaration.

The W3C has published information on how to set the HTTP charset parameter with various popular web servers.

If you are unable to control your server's charset declaration, Character and Entity References may be used to specify the full range of Unicode characters in an feed served as US-ASCII.

Not clear? Disagree?
Let us know on the feedvalidator-users discussion list!

--

And going through the mailing list "feedvalidator-users" at SourceForge I found out the following message replied by Sam Ruby:

...

> However, some news items have ASCII characters such as the copyright
 > symbol, trademark symbol etc. These stop the XML feed from validating,
 > and the validator says "Your feed appears to be encoded as "iso-8859-1"
 > but your server is reporting "US-ASCII". It sends me to the following
 > page: http://feedvalidator.org/docs/warning/EncodingMismatch.html which
 > then links to another page of techy stuff, but it is way over my head,
 > far too technical for me.

...

2) The message you cited is only a warning
 
3) Adding either or both of these lines to your Apache server config,
    virtual host, directory, or .htaccess files will eliminate this
    warning:
 
      AddCharset iso-8859-1 .xml
      AddType application/xml .xml

--

So, the solution for the feed provider is clear to be the one above!

I'll try to solve things here though, before contacting folks there asking to change anything regarding the feed... (although I think I will be of help to point their attention to the problems regarding their feed...).

So if anyone has any other suggestions....
0
 

Author Comment

by:teufelsdaumen
ID: 13600415
I have found the solution myself. Taking the information about setting the content type to "application/xml" into account ("alternatively, for compatibility with widely-deployed web browsers, any of these feeds can use one of the more general XML types - preferably application/xml."), I changed my code from

            XmlReader reader = new XmlReader(feedUrl);

to

            InputStream is = feedUrl.openStream();
            XmlReader reader = new XmlReader(is, "application/xml");

And now XmlReader treats the HTTP stream as being "ISO-8859-1" and the accents are preserved.

Thanks anyway.
0
 

Accepted Solution

by:
OzzMod earned 0 total points
ID: 13638497
Closed, 500 points refunded.
OzzMod
Community Support Moderator (Graveyard shift)
0

Featured Post

Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Preface This article introduces an authentication and authorization system for a website.  It is understood by the author and the project contributors that there is no such thing as a "one size fits all" system.  That being said, there is a certa…
Introduction Knockoutjs (Knockout) is a JavaScript framework (Model View ViewModel or MVVM framework).   The main ideology behind Knockout is to control from JavaScript how a page looks whilst creating an engaging user experience in the least …
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question