Solved

PHP XMLReader Invalid Characters

Posted on 2010-09-21
17
2,701 Views
Last Modified: 2013-11-18
Our company is paying for a feed service.  The feed is 2GB!  I have to go through the feed with xmlreader since it's so large.  I know I can do a bulk import, but we ruled that out.  

Anyways, it gets about 100,000 items in and it flags an error about invalid characters and quits...  Is there a way I can turn that off?  Or run a function that removes the characters before it flags the error?  Or is there a setting in the xmlreader I can change?

Please give an example...  I'm using more code than what is listed below...  Just put this for a reference...
xmlReader = new XMLReader();

$xmlReader->open(XMLFILE, null, LIBXML_NOBLANKS);



$isParserActive = false;

$simpleNodeTypes = array ("DealerID", "VIN", "Status", "VehicleType", "Year", "Make", "Model", "Trim", "Body", "Mileage", "Transmission", "EngineSize", "DriveTrain", "FuelType", "Doors", "GenericColorExterior","GenericColorInterior", "InternetPrice", "Options", "VehicleComments", "DepartmentComments", "AddendumDetails");



$c=0;

while ($xmlReader->read ())

{	

    $nodeType = $xmlReader->nodeType;

}



//there's more code

Open in new window

0
Comment
Question by:stephenmp
  • 9
  • 5
  • 2
  • +1
17 Comments
 

Author Comment

by:stephenmp
ID: 33731437
Here are the warnings i get...

Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/VEHICLES.XML:2: parser error : PCDATA invalid Char value 7 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: s located at our new facility at MARS  in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 35
0
 
LVL 17

Expert Comment

by:shinuq
ID: 33731605
Tyr reading the file as normal file and the contents that you got is replaced for unwanted characters.

$xml = file_get_contents('myxml.xml');
$xml = preg_replace('/[\x0-\x1f\x7f-\x9f]/u', ' ', $xml);

//parse the XML after this now
 

Hope this helps
0
 

Author Comment

by:stephenmp
ID: 33731664
The file is 2GB...  This is not wise to open a large file like that and run commands on the whole array...

I'll try it...  You never know....
0
 
LVL 48

Accepted Solution

by:
hernst42 earned 500 total points
ID: 33732191
The XML you get is invalid. CData may not contain x07 as char. See

http://www.w3.org/TR/REC-xml/#dt-cdsection
http://www.w3.org/TR/REC-xml/#NT-Char

Contact the feed provider to fix their generated xml
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33733505
You might try reading the file with fopen(), fgets().  If the file has end-of-line characters this would let you edit with a REGEX before giving the data to the XMLReader class.

Wonder what would happen if you suppress error reports?  Maybe it would just drop the errant stuff?  Could be worth a try.
http://us.php.net/manual/en/libxml.constants.php
0
 

Author Comment

by:stephenmp
ID: 33737291
Thanks!

 I'm away from my computer until this evening...  Meanwhile I contacted the data provider to see what they can do to correct it...  They can provide a csv file as well...  I might have to use that instead...
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33737906
That's good news.  The CSV file would be A LOT smaller!  PHP has some helpful functions for dealing with CSV, for example:
http://us2.php.net/manual/en/function.fgetcsv.php
0
 

Author Comment

by:stephenmp
ID: 33738114
But, even a 1.5 gb or 1 gb csv is going to be hard to load all at once correct?
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33738161
Yes, a 1.5GB file is a big thing, and it should probably be in a data base.  But read the man page on fgetcsv() before you assume it is going to try to load it all at once.
0
 

Author Comment

by:stephenmp
ID: 33738479
Let me ask this... I have to compare the new feed with the old feed...  Would it be insane to create 2 large arrays of vehicle VIN numbers (700,000 each) and then use array functions to compare them and find what was remove between the old and new?  Or even hold one in an array and loop through the other one at a time and do an array function to see if it exists?!?!?

Then I can delete those from my database...
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33740582
I cannot really answer that.  You might want to set up a test case to create arrays of 700,000 string elements.  See if your server can hold them.

This sounds like (and I am guessing a little bit here) a better application for a data base than a collection of PHP strings and arrays.
0
 

Author Comment

by:stephenmp
ID: 33766124
Ok...  I'm still waiting on word over the CSV file...  I'm trying to get it to work with the XML data I have...  The person who is generating the XML is not sure how to fix it...  Anyways..  I was able to remove some unwanted characters, but I'm still getting more errors.  This time about char #20...  I have a script I'm using to clean out the data from the file and store it in another file... Then, I can open the clean file and parse it using XMLReader...  

It doesn't make sense because I'm removing all ascii chars below 31...  Which fixed the one for char 07...  Maybe I don't have a clear understanding of what's what in encoding land...  lol  PLEASE HELP!

Can I use preg_replace() and do this faster?  If so...  I need an exact line that will remove all invalid xml characters from a string...  I've searched the internet far and wide...

-------  MY ERROR DISPLAY-----------
Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/CLEANVEHICLES.XML:1: parser error : xmlParseCharRef: invalid xmlChar value 20 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ts><VehicleComments>Manners rather than muscle, space instead of hustle. &#x14; in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 37
function stripInvalidXml($value)

{

	$ret = "";

	$current;

	if (empty($value))

		return $ret;

		

	$length = strlen($value);

	for ($i=0; $i < $length; $i++){

		$current = ord($value{$i});

		if ($current > 31){

			$ret .= chr($current);

		}

		else {

			$ret .= " ";

			echo "illegal char found! - " . $value{$i} . "<br>";

		}

	}

	return $ret;

}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33766404
Hmm.. Have you considered hiring a professional developer to help with this?  It is not really a good area for novice work.  Too many snares and trip-wires.
0
 

Author Comment

by:stephenmp
ID: 33766461
lol...  I am the professional developer...  I've just never worked with importing such large files... I wouldn't be trying to come up with a fix for this if the idiots who send the feed did it correctly to begin with...

I do front-end for my full-time company, and I've done PHP/HTML contracting for years...   Most of my interaction is with the database and not files, xml files n stuff...  I've always used simpleXML, but that won't work because it would require I load the entire 2GB file... XMLreader works for large files...

I'm sure it's probably a simple fix...  But, I'm missing something in my cleaner script...
0
 
LVL 17

Expert Comment

by:shinuq
ID: 33767114
$xmlReader->setParserProperty(XMLREADER_DEFAULTATTRS);

Try to call this code after the xmlReader->open code, This will read the XML but it wont validate it. I think that is causing the problem.

Hope this helps
0
 

Author Comment

by:stephenmp
ID: 33767184
I tried...  Thought it was working and then...

Warning: XMLReader::setParserProperty() expects exactly 2 parameters, 1 given in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 31

Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/VEHICLES.XML:2: parser error : xmlParseCharRef: invalid xmlChar value 20 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ts><VehicleComments>Manners rather than muscle, space instead of hustle. &#x14; in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 37
0
 

Author Closing Comment

by:stephenmp
ID: 33775655
I never found a 100% workable fix for this...  I'm sick of trying to fix their corrupt data...  I believe this user was correct in that I should make them fix it...  I'm just getting a csv file from them instead...
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Shoutout to Emily Plummer (http://www.experts-exchange.com/members/eplummer26.html) for giving me this article! She did most of it, I just finished it up and posted it for her :)    Introduction In a previous article (http://www.experts-exchang…
Styling your websites can become very complex. Here I'll show how SASS can help you better organize, maintain and reuse your CSS code.
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now