Unsure how to handle utf and cp437 encoding in web page / script

Posted on 2006-05-25
Medium Priority
Last Modified: 2012-06-22
Hello Experts!

I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.

I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.

The website is wikitravel.org, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape

But, if I access the same text on a web page,
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)

the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).

When I tell irb to print what it got, I get
>print text
gîtes d'étape

(if it helps, I have the ethereal output from this session).

I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - http://intertwingly.net/stories/2004/04/14/i18n.html. Using the same ruby script

>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)
>print text
Iñtërnâtiônà lizætiøn

What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.

thanks for you help!
Question by:stoniergrunow
  • 2
  • 2
LVL 27

Expert Comment

ID: 16768596
wikitravel.org is returning the text in UTF-8 format, which is the default encoding for XML. One normally loads the XML into a DOM object and extracts the data by using the DOMs methods like SelectNode. This is because the encoding, if not defaulted, is in the <?xml version="1.0"? encoding="..."> header which the load process sees and handles.

I don't know Ruby, but I suspect that there IS an XMLDOM object which you could use. Ruby probably uses 8-bit ANSI whose exact encoding defaults to the platform on which it runs, which is probably 432 or 850 (they're very similar). The XML DOM would handle these conversions for you. Otherwise one has to start using the Windows API MultiByteToWideChar and WideCharToMultiByte to do the conversion.

Author Comment

ID: 16773965
Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

I found this on google about rails, but it might apply to ruby equally:

By default, Rails isn’t set up to handle non-English characters.
Here’s what you’ll need to do to make it work:
1. Add the following to your config/environment.rb file:
$KCODE = 'u'
require 'jcode'
This sets Ruby’s character encoding to UTF-8.

Will this help me? I tried adding the above two lines to my method, but no luck.

LVL 27

Expert Comment

ID: 16783151
>>Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.

Author Comment

ID: 16839587
Well, I've got it. Perhaps not the best, but it works. If anyone else is having problems with this solution, the code you can use is


it's right there in the reference, http://www.rubycentral.com/ref/ref_c_string.html

Accepted Solution

ee_ai_construct earned 0 total points
ID: 16877938
Closed, 500 points refunded.
Community Support Moderator
replacement part #xm34

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article was originally published on Monitis Blog, you can check it here . Today it’s fairly well known that high-performing websites and applications bring in more visitors, higher SEO, and ultimately more sales. By the same token, downtime…
Although a lot of people devote their energy toward marketing for specific industries, there are some basic principles that can be applied to any sector imaginable. We’ll look at four steps to take and examine how those steps were put into action fo…
The viewer will learn how to count occurrences of each item in an array.
The viewer will get a basic understanding of what section 508 compliance can entail, learn about skip navigation links, alt text, transcripts, and font size controls.
Suggested Courses
Course of the Month15 days, 8 hours left to enroll

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question