Link to home
Start Free TrialLog in
Avatar of stoniergrunow
stoniergrunow

asked on

Unsure how to handle utf and cp437 encoding in web page / script

Hello Experts!

I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.

I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.

The website is wikitravel.org, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape

But, if I access the same text on a web page,
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)

the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).

When I tell irb to print what it got, I get
>print text
gîtes d'étape

(if it helps, I have the ethereal output from this session).

I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - http://intertwingly.net/stories/2004/04/14/i18n.html. Using the same ruby script

>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url_page)
>print text
Iñtërnâtiônà lizætiøn

What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.

thanks for you help!
Avatar of BigRat
BigRat
Flag of France image

wikitravel.org is returning the text in UTF-8 format, which is the default encoding for XML. One normally loads the XML into a DOM object and extracts the data by using the DOMs methods like SelectNode. This is because the encoding, if not defaulted, is in the <?xml version="1.0"? encoding="..."> header which the load process sees and handles.

I don't know Ruby, but I suspect that there IS an XMLDOM object which you could use. Ruby probably uses 8-bit ANSI whose exact encoding defaults to the platform on which it runs, which is probably 432 or 850 (they're very similar). The XML DOM would handle these conversions for you. Otherwise one has to start using the Windows API MultiByteToWideChar and WideCharToMultiByte to do the conversion.
Avatar of stoniergrunow
stoniergrunow

ASKER

Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

I found this on google about rails, but it might apply to ruby equally:

By default, Rails isn’t set up to handle non-English characters.
Here’s what you’ll need to do to make it work:
1. Add the following to your config/environment.rb file:
$KCODE = 'u'
require 'jcode'
This sets Ruby’s character encoding to UTF-8.

Will this help me? I tried adding the above two lines to my method, but no luck.

thanks
>>Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.
Well, I've got it. Perhaps not the best, but it works. If anyone else is having problems with this solution, the code you can use is

text.unpack("U*").pack("C*")

it's right there in the reference, http://www.rubycentral.com/ref/ref_c_string.html
ASKER CERTIFIED SOLUTION
Avatar of ee_ai_construct
ee_ai_construct
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial