How to get the charset through MSXML2.ServerXMLHTTP

How do i return the approriate charset if one is not defined in the getResponseHeader. In the example below the charset is not returned by the page headers.

<% 
url = "http://www.embalgeria.nl/Contact.htm" 
set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP") 
xmlhttp.open "GET", url, false 
xmlhttp.send "" 

mycharset = ""
mycharset= xmlhttp.getResponseHeader("Content-Type")
If(inStr(mycharset,"charset")=0) Then 
 mycharset = ""
 ' Find the correct charset
End If 

response.Write mycharset 
%>

Open in new window

NebukadAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
BigRatConnect With a Mentor Commented:
>>By default the charset is set to ISO-8859-1 because no charset is set in the response header, correct?

Correct, that is the default. It is also the standard.

BUT

You'll find some web servers that don't conform to the rules - mostly in Russia, Greece, Arabia and China/Japan/Korea where they have a <META> HTML tag with the charset set ommitted from the header and different from ISO-8859-1 (normally of course a Russian, Greek, Arabic or Big-5 char set).

I'd use this rule :-
   1) charset in header -> extract and set that initially as the set to use
   2) charset not in header -> set ISO-8859-1 initially.
   3) META tag with charset (equivalence to content-type header) then override setting with that
   4) decode the page with that charset.
  Note that any numeric entities are to be interpreted in the Unicode set. This is also a problem since Netscape used to interpret them in the selected charset and you'll find Russian sites still doing the same.

HTH
0
 
BigRatCommented:
>>How do i return the approriate charset if one is not defined in the getResponseHeader

Then it is dependant on the Content-Type, for example with text/html it is ISO-8859-1 and with application/xml (and sometimes text/xml) it is UTF-8. If it is image/* then there isn't one.

What content does your URL return?
0
 
NebukadAuthor Commented:
In my example the page headers returns:

HTTP Status Code: HTTP/1.1 200 OK
....................
Content-Type: text/html
....................

By default the charset is set to ISO-8859-1 because no charset is set in the response header, correct?

nb: Since my application only crawls webpages i have only need for the examples you already mentioned in you response (text/html, application/xml and text/xml) .
0
 
NebukadAuthor Commented:
Thanks for the response and the explanation on how webservers deal with charsets.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.