Link to home
Start Free TrialLog in
Avatar of Skyruner2
Skyruner2

asked on

Missing Characters after Umlaut (ä,ö,ü) in HttpRequest.responseText

hi, so im using an RSS Content feed class ( http://www.tele-pro.co.uk/scripts/rss/rss_content_feed_class.htm ) and knowticed that some RSS feeds gave me a syntax error while validating via the MS phraser.( MSXML2.DOMDocument.loadXML )
after a bit of research i quickly knowticed that the problem was with the umlauts.
Finaly i simply read out the ( http.responseText ) and knowticed that follwing an umlaut the next 6 characters were missing.

thus out of the original
 <title>Kunst: Komprimiertes Glücksgefühl</title>
 <description>

resulted:
 <title>Kunst: Komprimiertes Gl?f?itle>
 <description>

(taken from http://newsfeed.zeit.de/ )

i thus ge the follwing error: "GetRSS: XML error: End tag 'item' does not match the start tag 'title'. Parse Error line 207, character 7" since the title end tag is missing.

so my question: How do i need to adjust my HttpRequest object in order to get a correct responseText?

Avatar of Skyruner2
Skyruner2

ASKER

it is urgent.
Avatar of fritz_the_blank
You will have to encode this:

http://www.raok.ee/programming/index.php?id=4

FtB
Someone here had the same issue with solutions suggested:

http://www.pmachine.com/forums/viewthread/13747/

FtB
ok, so it is definaly a problem with the encoding.
Either my HttpRequest object OR the XML supplying server has a wrong setting.

Since i can not change the setting (or supplied XML) from the server (I could send the guys an e-mail, but the server/XML source is not under my controll), i will have to work with the setting of the HttpRequest object.

So the problem is reduced to a setting of the HttpRequest object. Well, is it possible to overwrite the char.encoding? if yes, how?

~Sky


 p.s.: i did not find anything after two quick google searches for "overwrite DOM Http Requst encoding" or simmilar terms.
if you load the XML that gives me the error in Firefox, or the MS XML notepad, it works just fine. if i take the xml as a string recived by my request obj. then i recive an error.

 http://newsfeed.zeit.de/

www.heute.de (look for news feed (alternate tag in html head))
Okay, I can do this without error:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title> New Document </title>
<%
Function GetHTML(strURL)
      Dim objXMLHTTP, strReturn
      Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")
      objXMLHTTP.Open "GET", strURL, false
      objXMLHTTP.Send
      strReturn = objXMLHTTP.responseText
      Set objXMLHTTP = Nothing
      GetHTML = strReturn
End Function
%>
</head>

<body>
<%
strFile=gethtml("http://newsfeed.zeit.de/")
response.write strFile
%>
</body>
</html>
SOLUTION
Avatar of fritz_the_blank
fritz_the_blank
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I stripped my code of all possible post data, and here is what i am now using:

'Retrieve response and return HTML response body
Public Function XmlHttp(xAction, data, hdrs)
  Dim HTTP, Raw
  Set Http = CreateObject("MSXML2.ServerXMLHTTP")
  'MSXML2.XMLHTTP
 
   Http.open "GET", xAction, FALSE

  Http.send '(data)
  Raw = "_"
  response.Write(server.HTMLEncode(http.responseText))
  Set Http = Nothing
  XmlHttp = Raw
End Function


still gets me:



 <title>Kunst: Komprimiertes Gl&#65535;f&#65535;itle>  <description>Wie Ulrich und Sylvia Str&#65535; urpl&#65535;ich zu de

thus still an error (see missing </title> tag)

i am not using any XSL file. what do you mean by "set that on your .asp page as the encoding type" ? do you mean what i have suggested before: overwriting the (wrong) encoding setting of the httpRequest object?
hmm the aboce &+Charackter code appear as missing charachter, and not as (propably changed druing submition of the message) the code seen above.

so i knowtice the feed does have a header wich identifies the encoding, but the http object does not seem to care much for it!


<?xml version="1.0" encoding="ISO-8859-1"?>
i got a what seems to be a very usefull text at http://www.devguru.com/Technologies/xmldom/quickref/httpRequest_send.html

if i understood correctly,  the data type passed through .send( ) method will determin the encoding of the returned data.

"[the] acceptable input types are BSTR, SAFEARRAY of UI1 (unsigned bytes), IDispatch to an XMLDOM object, and IStream.
[...]
If the input type is an XMLDOM object, the response is encoded according to the encoding attribute on the '<?' xml declaration in the document. If there is no xml declaration, an encoding attribute of UTF-8 is assumed.

If the input type is IStream, the response is sent as is, without additional encoding. The caller must set a Content-Type header with the appropriate content type."

so, looks like im stuck with either of thouse. While looking up "IStream" (wich i still dont know what it is), i found this:

" HttpRequest.responseStream

The responseStream property is read-only and represents the response entity body as an IStream. This stream returns the raw, uncoded bytes as received from the server. So, depending on the server, this may appear as binary-encoded data (UTF-8, UCS-2, UCS-4, shiftJis, etc).
" at http://www.devguru.com/technologies/xmldom/quickref/httprequest_responsestream.html

I'm surprised that the RSS feed specifies ISO-8859-1 as the encoding. I would've thought the UFT-8 would be more appropriate (but what do I know....??).

You can override this programmatically in the XML parser.

After you load the XML into your DOM object you can use the createProcessingInstruction method to create a new <?xml?> node with the appropriate encoding type.

Assuming your XML DOM object is called xmlDOM:

' Load the RSS XML first. Then...
' Create the processing instruction node
set pi = xmlDOM.createProcessingInstruction("xml", "version=""1.0"" encoding=""UTF-8""")
' Append the node to the beginning of the document
xmlDOM.insertBefore pi, xmlDOM.childNodes.item(0)

but the problem is while loading the XML!

the responsetext of the HttpRequest object returns a string encoded in UTF-8, though it should be ISO-8859-1. Thus after evey non UTF-8 Character (ä,ü,ö - for exsample) the follwoing 6 characters are missing.

Would the original <?xml?> tag be overwriten, and the raw xml string would be converted back to normal?


just to clear it up again:

Server supplied XML(ISO-8859-1) ---loaded into---> HttpRequestObject.ResponseText  (encoding ignored and set to UTF-8) ---Loaded into---> DOMDocument(still in wron UTF-8 encoding) -> Hairball (since 6 characters after every äöü are "removed" due to the wrong encoding, and thus vital data is missing (closing tags for exsample - thus the Phraser hairball.))

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
" HttpRequest.responseText

The responseText property is read-only and represents the response entity body as a string. The XMLHTTP object tries to decode the response into a unicode string, assuming a default encoding of UTF-8. However, it can decode any type of UCS-2 (big or little endian) or UTC-4 encoding as long as the server sends the appropriate unicode byte order mark. It does not process the XML '<?' coding declaration."

ok great, but how do i tell it wich encoding it is?!
@deighc

good point. i am simply working with the code from  http://www.tele-pro.co.uk/scripts/rss/rss_content_feed_class.htm

could be so it can handle all of the different "versions" of rss, rdf feeds?! - i will try loading it directly into the xmlDOM
might be because of the way the feed cache works in this think... dang, looks like ill end up re-writing most of this code.
I misunderstood what you meant by HTTPRequest. You meant XMLHTTP.

This is an OK way to do things (but it still probably makes sense to load the xml directly into the XML DOM).

But regardless of how you load the XML into your DOM object (load(), loadXML()) ALWAYS check the return value to determine whether or not the XML is well formed.

This will eliminate the possiblity that the RSS feed itself is somehow mis-functioning.
great, looks like this is working... just to sum up: its actually a work around to getting the right data from tehe XMLHTTP, but it works.

I jsut implemented a quick and dirty solution, and i am not sure if everything works out correctly... if you dont mind i would like to keep this open untill i have fully changed the code (latest tomorow (12th of july) around noon Central European Time)
I'm curious to know what you mean by "a work around".

You shouldn't need any quick and dirty fixes. MSXML is designed from the ground up to work correctly with all character sets and encoding types. I've used it many times with German character sets and never had problems.

But hey, if it works use it....
Sorry, I am back now, but too late I see?

FtB
Well its a work around since i am no longer using the XMLHTTP. The original question was (or better established to be) how get the ResponseText method of the request onject to return the results in the correct encoding. now i no longer use the XMLHTTP.

as to "Quick and Dirty" fix:

right now the code reads: Res.load (xml_URL)

it used to be: res.loadXML getResults(xml_URL)


now i did not write the code, and the getResults(URL) function did some work with the XML besides just checking if a cached file was avaliable. So i have to see if any changes done in this function are vital for the functioning of the script with other XML sources, and i have to re-implement caching the source for X hours.
since i do not have the time to do that now, id like to delay accepting your answer untill i have done this work either tonight, or tomorow morning, so i can be certain this workaround (as described above) is working, or if i still want to fall back on using the XMLHTTP at wich point id be stuck with my problem again....

dont worry im not trying to delay giving out points, i simply want to be sure the solution works 100%.
Take your time, I'm not after points - I'm just curious to know what works and what doesn't.

And thanks for the more detailed explanation of the class. It makes more sense to me now to know that it uses a caching mechanism.

But if, for your purposes, you find that you can only get the class to work by loading the XML directly into a DOM instead of an XMLHTTP object it should still be quite simple to modify the class code to work properly with the cache.

My guess is that the cache stores a copy of the XML as a file somewhere on the filesystem and checks a timestamp.

You can use the Save method of the XML DOM to save the RSS XML to a file so this should  be quite simple to implement.
right... you can get full info @  http://www.tele-pro.co.uk/scripts/rss/rss_content_feed_class.htm ...

it is easy to do, but im leaving for holidays on thursday, so i have lots of other things to do ;)!
I had a quick look at the class.

It's hard to say exactly how many different ways the XMLHTTP functionality is used. If all you want to do is request a URL that returns XML then loading it directly into an XML DOM will work fine.

The only time you have to use XMLHTTP is if you need more control over the HTTP request. XMLHTTP lets you control various timeouts and (more importantly) load data (XML, binary etc) into the request header.

I **think** in your case loading the data directly into an XML DOM should work OK. But I understand why you'd want to thoroughly test this first.
yeas, i saw that defining headers was possible... all i currently want to do is to load publically provided rss feeds from vareious news sites, wich should not requiere and special header data...
seems to be working allright, but even when i load from cache it seems a bit slow:

http://the-n-e-x-u-s.net:8080/newsfeeds.asp

I turned the response buffer off, so you can see it build up the RSS feeds
hmm woops looks like it actually builds much faster with the response buffer on!.. sorry..