Unsure how to handle utf and cp437 encoding in web page / script

Hello Experts!

I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.

I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.

The website is wikitravel.org, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape

But, if I access the same text on a web page,
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)

the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).

When I tell irb to print what it got, I get
>print text
gîtes d'étape

(if it helps, I have the ethereal output from this session).

I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - http://intertwingly.net/stories/2004/04/14/i18n.html. Using the same ruby script

>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)
>print text
Iñtërnâtiônà lizætiøn

What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.

thanks for you help!
stoniergrunowAsked:
Who is Participating?
 
ee_ai_constructCommented:
Closed, 500 points refunded.
ee_ai_construct
Community Support Moderator
replacement part #xm34
0
 
BigRatCommented:
wikitravel.org is returning the text in UTF-8 format, which is the default encoding for XML. One normally loads the XML into a DOM object and extracts the data by using the DOMs methods like SelectNode. This is because the encoding, if not defaulted, is in the <?xml version="1.0"? encoding="..."> header which the load process sees and handles.

I don't know Ruby, but I suspect that there IS an XMLDOM object which you could use. Ruby probably uses 8-bit ANSI whose exact encoding defaults to the platform on which it runs, which is probably 432 or 850 (they're very similar). The XML DOM would handle these conversions for you. Otherwise one has to start using the Windows API MultiByteToWideChar and WideCharToMultiByte to do the conversion.
0
 
stoniergrunowAuthor Commented:
Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

I found this on google about rails, but it might apply to ruby equally:

By default, Rails isn’t set up to handle non-English characters.
Here’s what you’ll need to do to make it work:
1. Add the following to your config/environment.rb file:
$KCODE = 'u'
require 'jcode'
This sets Ruby’s character encoding to UTF-8.

Will this help me? I tried adding the above two lines to my method, but no luck.

thanks
0
 
BigRatCommented:
>>Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.
0
 
stoniergrunowAuthor Commented:
Well, I've got it. Perhaps not the best, but it works. If anyone else is having problems with this solution, the code you can use is

text.unpack("U*").pack("C*")

it's right there in the reference, http://www.rubycentral.com/ref/ref_c_string.html
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.