Question

SimpleXML_Load_String fails with special characters

Asked by: ncw

I'm trying to load a utf-8 xml string using the php function SimpleXML_Load_String, but it fails and error's out when it finds a special character in the string (contained in some description fields) eg:  ASCII 133 which is 3 dots (...), and ASCII 147 which appears to be double quotes.

How can I either stip out problem characters (characters outside the allowed ASII range) or allow their import?

This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2009-09-16 at 05:14:50ID24735998
Tags

SimpleXML_Load_String

,

php

Topics

PHP Scripting Language

,

Extensible Markup Language (XML)

Participating Experts
2
Points
0
Comments
15

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. SimpleXML for 4.3
    I had a link that I can't find to some code that would act like the new SimpleXML function in php5. Does anyone know of any code that can act like SimpleXML in php 4.3. Specifically it reads the XML string or document and turns it into a PHP Object.
  2. simpleXML accessing attributes
    Hello, I'm developing a site using simpleXML and so far have not had any problems. However I now need to access an attribute. My normal way of accessing the XML is like this; $xml = simplexml_load_file("xml/whats-on-blog.xml"); foreach($xml->gallery as $entry...
  3. PHP / SimpleXML question
    I'm iterating through an XML file (see code section below for XML) returned from an affiliate supplier of hotels using PHP / SimpleXML, displaying the results in an HTML table using the following code: <table> <?php $room = simplexml_load_file($query); foreac...
  4. simplexml_load_string returning false when loading xml?
    Hello everyone - I am encountering a problem when trying to load a simple xml stream into a simplexml_load_string object. I am retrieving the xml from a webservice, and var_dumps at the appropriate places verify that it is being recieved properly. However, when I instanti...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: basic612Posted on 2009-09-16 at 08:05:35ID: 25346383

do you have your XML enclosed with CDATA tags around the description fields?

eg: http://www.w3schools.com/xmL/xml_cdata.asp

if this is not possible you could strip out any unwanted tags in your XML using preg_replace, this might help:

http://www.php.net/manual/en/function.preg-replace.php#64828

 Otherwise can you provide some sample XML that fails.

 

by: thehagmanPosted on 2009-09-16 at 09:19:16ID: 25347302

There is no ASCII 133 - ASCII is only 7 bits.
In utf-8, the codepoint 133 (U+0085) should be encoded as two octets: 0xc2 0x85
COuld it be that you have only one octet 0x85, hence invalid utf-8?

 

by: ncwPosted on 2009-09-16 at 10:17:11ID: 25347834

Yes the field data is enclosed within CDATA tags.

I understood that ASCII 133 was in the extended characterset. The character is listed in the third table down at http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml under DEC value 133. I'm afraid I don't understand the technicalities or significance of the number of octets. It may still be invalid utf-8, but the php function utf8_compliant seems to check it as ok.


 

by: thehagmanPosted on 2009-09-16 at 11:31:37ID: 25348597

It still looks like the input string is rather iso8859-1 than utf-8 (and I don't find a function utf8_compliant at www.php.net).
GIve
SimpleXML_Load_String( utf8_encode($data), ...)
a try.

 

by: ncwPosted on 2009-09-16 at 12:51:36ID: 25349392

Ah sorry, utf8_compliant is a function I picked up off the net as shown below.

Using utf8_encode($data) made no difference.

An example of data from the xml file that is causing the problem is shown below the function in the code box below; the data has been reduced to only include the sentence with the offending character, which appears as a black rectangle in the Textpad editor, but shows as 3 dots in the code below.



	// reference http://www.phpwact.org/php/i18n/charsets
	function utf8_compliant($str) {
		if ( strlen($str) == 0 ) {
			return TRUE;
		}
		// If even just the first character can be matched, when the /u
		// modifier is used, then it's valid UTF-8. If the UTF-8 is somehow
		// invalid, nothing at all will match, even if the string contains
		// some valid sequences
		return (preg_match('/^.{1}/us',$str,$ar) == 1);
	}
 
 
    <Full_Desc>
      <en><![CDATA[Modern living located in an undiscovered paradise, this is quickly becoming the perfect destination for anyone wanting more&for less.]]></en>
    </Full_Desc>

                                              
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:

Select allOpen in new window

 

by: thehagmanPosted on 2009-09-17 at 04:03:29ID: 25354656

OK, the utf8_compliant test does what it is supposed to do and as much as can be expected without too much computational overhead.
Hence in your case the code point 133 is correctly encoded as two octets (0xc2 0x85 as I mentioned above) and not just as a single byte character.

However, I do not see 3 dots in your post and that made me check with the unicode charts:
Codepoint 133 or U+0085 is not a glyph but a control (NEXT LINE)
Codepoint 147 or U+0093 is not a glyph but a control (SET TRANSMIT STATE)
I bet these control codes are invalid in XML.

The correct code point for three dots would be U+2026 (HORIZONTAL ELLIPSIS) and various double quotes can be found at U+201C - U+201F

I suspect that the original data was
- produced as windows1250 (or another 125x code page),
- then wrongly interpreted as iso8859-1,
- then encoded as utf-8.

You should replace all these invalid characters with their corresponding intended characters before invoking simplexml_load_string

 

by: ncwPosted on 2009-09-17 at 04:18:56ID: 25354733

The offending character showed 3 dots when I pasted it into the code textarea, but I see it now shows as an ampersand.

>You should replace all these invalid characters with their corresponding intended characters before invoking simplexml_load_string
How can I do that (or remove them) if I don't know what they might be. Can I do a regex to remove all characters outside a valid range?

 

by: ncwPosted on 2009-09-17 at 04:20:15ID: 25354742

Is DOMDocument better at dealing with such characters?

 

by: thehagmanPosted on 2009-09-17 at 09:24:23ID: 25357948

I suggest you have a look at the contribution by user squeegee on http://de2.php.net/manual/de/function.utf8-encode.php

 

by: ncwPosted on 2009-09-17 at 10:18:24ID: 25358623

The data file is 0.5GB in size so I think any parsing function would take too long. I never had an issue before with DOMDocument.

 

by: thehagmanPosted on 2009-09-17 at 13:46:29ID: 25360888

It might work, but without correcting the illegal characters you will at least have unintended / garbled content

 

by: ncwPosted on 2009-09-17 at 14:50:49ID: 25361428

I've just tested using DOMDocument and it also errors out with:
DOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8, indicate encoding ! Bytes: 0x85 0x66 0x6F 0x72 in Entity, line: 109

I tried cleaning up the xml using the function fix_latin() but it fell over at:
if(1==preg_match($nibble_good_chars,$input,$match)){
with the following error:
Warning: preg_match() [function.preg-match]: Empty regular expression

 

by: ncwPosted on 2009-09-18 at 02:48:55ID: 25364231

I'm starting to get some where by finding the individual characters that are causing errors and doing a search and replace. I replace a character and test for the next one. But I'm stuck on a character that seems to be ASCII 150 but replacing it does not solve the error, only if I manually delete it will the error go away.

$search = array(chr(150), chr(133), chr(147), chr(148), chr(149), chr(146), chr(163));
$xml_encoded = str_replace($search,'',$xml_encoded);
                                              
1:
2:

Select allOpen in new window

 

by: ncwPosted on 2009-09-18 at 04:32:23ID: 25364755

The error returned by libxml_get_errors give me the line number. If I extracted the line number from the error message is there anyway I can remove the field value on that line (replace text between tags on that line with nothing) from the string without having to write to disk.

 

by: ncwPosted on 2009-10-01 at 06:57:31ID: 25468645

I had to open the xml file, do a search and replace, and then a preg_replace to remove characters outside the standard range, then save the file, and finally read the xml file using XMLReader stream reader as it was too large for simplexml_load_file. Some of the code is below.

$search = array(' & ', chr(150), chr(133), chr(147), chr(148), chr(149), chr(146), chr(163), chr(128));
$replace = array(' &amp; ', chr(45), '...', '"', '"', chr(45), chr(39), '&pound;', '&#128;' );
$buffer = str_replace($search, $replace, $buffer);
$buffer = preg_replace('/[^(\xA\xD\x20-\x7E)]*/','', $buffer);
                                              
1:
2:
3:
4:

Select allOpen in new window

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...