Unsure how to handle utf and cp437 encoding in web page / script

Posted on 2006-05-25
Last Modified: 2012-06-22
Hello Experts!

I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.

I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.

The website is, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape

But, if I access the same text on a web page,
>require 'net/http'
>resp, text =

the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).

When I tell irb to print what it got, I get
>print text
gîtes d'étape

(if it helps, I have the ethereal output from this session).

I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - Using the same ruby script

>require 'net/http'
>resp, text =
>print text
Iñtërnâtiônà lizætiøn

What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.

thanks for you help!
Question by:stoniergrunow
    LVL 27

    Expert Comment

    by:BigRat is returning the text in UTF-8 format, which is the default encoding for XML. One normally loads the XML into a DOM object and extracts the data by using the DOMs methods like SelectNode. This is because the encoding, if not defaulted, is in the <?xml version="1.0"? encoding="..."> header which the load process sees and handles.

    I don't know Ruby, but I suspect that there IS an XMLDOM object which you could use. Ruby probably uses 8-bit ANSI whose exact encoding defaults to the platform on which it runs, which is probably 432 or 850 (they're very similar). The XML DOM would handle these conversions for you. Otherwise one has to start using the Windows API MultiByteToWideChar and WideCharToMultiByte to do the conversion.

    Author Comment

    Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

    I found this on google about rails, but it might apply to ruby equally:

    By default, Rails isn’t set up to handle non-English characters.
    Here’s what you’ll need to do to make it work:
    1. Add the following to your config/environment.rb file:
    $KCODE = 'u'
    require 'jcode'
    This sets Ruby’s character encoding to UTF-8.

    Will this help me? I tried adding the above two lines to my method, but no luck.

    LVL 27

    Expert Comment

    >>Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?

    To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.

    Author Comment

    Well, I've got it. Perhaps not the best, but it works. If anyone else is having problems with this solution, the code you can use is


    it's right there in the reference,

    Accepted Solution

    Closed, 500 points refunded.
    Community Support Moderator
    replacement part #xm34

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Course: CSS Specialist

    We don’t have to sell you on the idea of becoming a developer. If you’re you here, you already know it’s one of the most lucrative (and fastest growing) career tracks out there. It’s CSS that allows you to set yourself apart from other web and mobile developers.

    The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
    Accessibility and Usability are two concepts that seem to be closely related.  But, too many people seem to have a distorted perception of them. During last five years, those two words have come to the day-to-day work of almost every web develope…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
    Any person in technology especially those working for big companies should at least know about the basics of web accessibility. Believe it or not there are even laws in place that require businesses to provide such means for the disabled and aging p…

    759 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    14 Experts available now in Live!

    Get 1:1 Help Now