There is no ASCII 133 - ASCII is only 7 bits.
In utf-8, the codepoint 133 (U+0085) should be encoded as two octets: 0xc2 0x85
COuld it be that you have only one octet 0x85, hence invalid utf-8?
Main Topics
Browse All TopicsI'm trying to load a utf-8 xml string using the php function SimpleXML_Load_String, but it fails and error's out when it finds a special character in the string (contained in some description fields) eg: ASCII 133 which is 3 dots (...), and ASCII 147 which appears to be double quotes.
How can I either stip out problem characters (characters outside the allowed ASII range) or allow their import?
This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
Yes the field data is enclosed within CDATA tags.
I understood that ASCII 133 was in the extended characterset. The character is listed in the third table down at http://www.idevelopment.in
It still looks like the input string is rather iso8859-1 than utf-8 (and I don't find a function utf8_compliant at www.php.net).
GIve
SimpleXML_Load_String( utf8_encode($data), ...)
a try.
Ah sorry, utf8_compliant is a function I picked up off the net as shown below.
Using utf8_encode($data) made no difference.
An example of data from the xml file that is causing the problem is shown below the function in the code box below; the data has been reduced to only include the sentence with the offending character, which appears as a black rectangle in the Textpad editor, but shows as 3 dots in the code below.
OK, the utf8_compliant test does what it is supposed to do and as much as can be expected without too much computational overhead.
Hence in your case the code point 133 is correctly encoded as two octets (0xc2 0x85 as I mentioned above) and not just as a single byte character.
However, I do not see 3 dots in your post and that made me check with the unicode charts:
Codepoint 133 or U+0085 is not a glyph but a control (NEXT LINE)
Codepoint 147 or U+0093 is not a glyph but a control (SET TRANSMIT STATE)
I bet these control codes are invalid in XML.
The correct code point for three dots would be U+2026 (HORIZONTAL ELLIPSIS) and various double quotes can be found at U+201C - U+201F
I suspect that the original data was
- produced as windows1250 (or another 125x code page),
- then wrongly interpreted as iso8859-1,
- then encoded as utf-8.
You should replace all these invalid characters with their corresponding intended characters before invoking simplexml_load_string
The offending character showed 3 dots when I pasted it into the code textarea, but I see it now shows as an ampersand.
>You should replace all these invalid characters with their corresponding intended characters before invoking simplexml_load_string
How can I do that (or remove them) if I don't know what they might be. Can I do a regex to remove all characters outside a valid range?
I suggest you have a look at the contribution by user squeegee on http://de2.php.net/manual/
I've just tested using DOMDocument and it also errors out with:
DOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8, indicate encoding ! Bytes: 0x85 0x66 0x6F 0x72 in Entity, line: 109
I tried cleaning up the xml using the function fix_latin() but it fell over at:
if(1==preg_match($nibble_g
with the following error:
Warning: preg_match() [function.preg-match]: Empty regular expression
I'm starting to get some where by finding the individual characters that are causing errors and doing a search and replace. I replace a character and test for the next one. But I'm stuck on a character that seems to be ASCII 150 but replacing it does not solve the error, only if I manually delete it will the error go away.
Business Accounts
Answer for Membership
by: basic612Posted on 2009-09-16 at 08:05:35ID: 25346383
do you have your XML enclosed with CDATA tags around the description fields?
mL/xml_cda ta.asp
nual/en/fu nction.pre g-replace. php#64828
eg: http://www.w3schools.com/x
if this is not possible you could strip out any unwanted tags in your XML using preg_replace, this might help:
http://www.php.net/ma
Otherwise can you provide some sample XML that fails.