stoniergrunow
asked on
Unsure how to handle utf and cp437 encoding in web page / script
Hello Experts!
I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.
I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.
The website is wikitravel.org, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape
But, if I access the same text on a web page,
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url _page)
the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).
When I tell irb to print what it got, I get
>print text
gîtes d'étape
(if it helps, I have the ethereal output from this session).
I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - http://intertwingly.net/stories/2004/04/14/i18n.html. Using the same ruby script
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url _page)
>print text
Iñtërnâtiônà lizætiøn
What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.
thanks for you help!
I'm writing a Ruby script that will pull text from an XML web page, modify it, and save it to ebook form. Unfortunately, some of the text, primarily international characters, get changed from one thing to another.
I'm doing it in Ruby (1.8), and use both a Mac and Linux (so my problems doesn't seem to be platform dependent) for the Ruby client.
The website is wikitravel.org, and an example of the text I'm having problems with is "gîtes d'étape". Ruby can handle this easily in irb:
> puts "gîtes d'étape"
gîtes d'étape
But, if I access the same text on a web page,
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url
the output (not print/puts, just the scolling text) from irb is "g\303\256tes d'\303\251tape" (my quotes).
When I tell irb to print what it got, I get
>print text
gîtes d'étape
(if it helps, I have the ethereal output from this session).
I tried saving the xml pages onto my my apple laptop's apache serer, and accessing them through http://localhost, but I got the same results. After some googling, I think this is from cp437 encoding. To test this, I made a web page with the single word "Iñtërnâtiônàlizætiøn", as seen here - http://intertwingly.net/stories/2004/04/14/i18n.html. Using the same ruby script
>require 'net/http'
>resp, text = Net::HTTP.new(url).get(url
>print text
Iñtërnâtiônà lizætiøn
What I would like is for ruby to output the text, handling internation characters correctly. Also, if this question is in the wrong section (not too sure what section to put it in), please tell me.
thanks for you help!
ASKER
Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?
I found this on google about rails, but it might apply to ruby equally:
By default, Rails isn’t set up to handle non-English characters.
Here’s what you’ll need to do to make it work:
1. Add the following to your config/environment.rb file:
$KCODE = 'u'
require 'jcode'
This sets Ruby’s character encoding to UTF-8.
Will this help me? I tried adding the above two lines to my method, but no luck.
thanks
I found this on google about rails, but it might apply to ruby equally:
By default, Rails isn’t set up to handle non-English characters.
Here’s what you’ll need to do to make it work:
1. Add the following to your config/environment.rb file:
$KCODE = 'u'
require 'jcode'
This sets Ruby’s character encoding to UTF-8.
Will this help me? I tried adding the above two lines to my method, but no luck.
thanks
>>Hmm, not to sure about DOM - isn't it pretty large for just a single read of the text?
To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.
To an extent yes, but the XML format and all it's character encoding options are handled properly by a DOM, which would make it work for all XML sources. On Windows the MS XML COM object loads around 800K from a DLL.
ASKER
Well, I've got it. Perhaps not the best, but it works. If anyone else is having problems with this solution, the code you can use is
text.unpack("U*").pack("C* ")
it's right there in the reference, http://www.rubycentral.com/ref/ref_c_string.html
text.unpack("U*").pack("C*
it's right there in the reference, http://www.rubycentral.com/ref/ref_c_string.html
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I don't know Ruby, but I suspect that there IS an XMLDOM object which you could use. Ruby probably uses 8-bit ANSI whose exact encoding defaults to the platform on which it runs, which is probably 432 or 850 (they're very similar). The XML DOM would handle these conversions for you. Otherwise one has to start using the Windows API MultiByteToWideChar and WideCharToMultiByte to do the conversion.