• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 3058
  • Last Modified:

PHP XMLReader Invalid Characters

Our company is paying for a feed service.  The feed is 2GB!  I have to go through the feed with xmlreader since it's so large.  I know I can do a bulk import, but we ruled that out.  

Anyways, it gets about 100,000 items in and it flags an error about invalid characters and quits...  Is there a way I can turn that off?  Or run a function that removes the characters before it flags the error?  Or is there a setting in the xmlreader I can change?

Please give an example...  I'm using more code than what is listed below...  Just put this for a reference...
xmlReader = new XMLReader();
$xmlReader->open(XMLFILE, null, LIBXML_NOBLANKS);

$isParserActive = false;
$simpleNodeTypes = array ("DealerID", "VIN", "Status", "VehicleType", "Year", "Make", "Model", "Trim", "Body", "Mileage", "Transmission", "EngineSize", "DriveTrain", "FuelType", "Doors", "GenericColorExterior","GenericColorInterior", "InternetPrice", "Options", "VehicleComments", "DepartmentComments", "AddendumDetails");

$c=0;
while ($xmlReader->read ())
{	
    $nodeType = $xmlReader->nodeType;
}

//there's more code

Open in new window

0
stephenmp
Asked:
stephenmp
  • 9
  • 5
  • 2
  • +1
1 Solution
 
stephenmpAuthor Commented:
Here are the warnings i get...

Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/VEHICLES.XML:2: parser error : PCDATA invalid Char value 7 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: s located at our new facility at MARS  in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 108

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 35
0
 
Shinesh PremrajanTechnical ManagerCommented:
Tyr reading the file as normal file and the contents that you got is replaced for unwanted characters.

$xml = file_get_contents('myxml.xml');
$xml = preg_replace('/[\x0-\x1f\x7f-\x9f]/u', ' ', $xml);

//parse the XML after this now
 

Hope this helps
0
 
stephenmpAuthor Commented:
The file is 2GB...  This is not wise to open a large file like that and run commands on the whole array...

I'll try it...  You never know....
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
hernst42Commented:
The XML you get is invalid. CData may not contain x07 as char. See

http://www.w3.org/TR/REC-xml/#dt-cdsection
http://www.w3.org/TR/REC-xml/#NT-Char

Contact the feed provider to fix their generated xml
0
 
Ray PaseurCommented:
You might try reading the file with fopen(), fgets().  If the file has end-of-line characters this would let you edit with a REGEX before giving the data to the XMLReader class.

Wonder what would happen if you suppress error reports?  Maybe it would just drop the errant stuff?  Could be worth a try.
http://us.php.net/manual/en/libxml.constants.php
0
 
stephenmpAuthor Commented:
Thanks!

 I'm away from my computer until this evening...  Meanwhile I contacted the data provider to see what they can do to correct it...  They can provide a csv file as well...  I might have to use that instead...
0
 
Ray PaseurCommented:
That's good news.  The CSV file would be A LOT smaller!  PHP has some helpful functions for dealing with CSV, for example:
http://us2.php.net/manual/en/function.fgetcsv.php
0
 
stephenmpAuthor Commented:
But, even a 1.5 gb or 1 gb csv is going to be hard to load all at once correct?
0
 
Ray PaseurCommented:
Yes, a 1.5GB file is a big thing, and it should probably be in a data base.  But read the man page on fgetcsv() before you assume it is going to try to load it all at once.
0
 
stephenmpAuthor Commented:
Let me ask this... I have to compare the new feed with the old feed...  Would it be insane to create 2 large arrays of vehicle VIN numbers (700,000 each) and then use array functions to compare them and find what was remove between the old and new?  Or even hold one in an array and loop through the other one at a time and do an array function to see if it exists?!?!?

Then I can delete those from my database...
0
 
Ray PaseurCommented:
I cannot really answer that.  You might want to set up a test case to create arrays of 700,000 string elements.  See if your server can hold them.

This sounds like (and I am guessing a little bit here) a better application for a data base than a collection of PHP strings and arrays.
0
 
stephenmpAuthor Commented:
Ok...  I'm still waiting on word over the CSV file...  I'm trying to get it to work with the XML data I have...  The person who is generating the XML is not sure how to fix it...  Anyways..  I was able to remove some unwanted characters, but I'm still getting more errors.  This time about char #20...  I have a script I'm using to clean out the data from the file and store it in another file... Then, I can open the clean file and parse it using XMLReader...  

It doesn't make sense because I'm removing all ascii chars below 31...  Which fixed the one for char 07...  Maybe I don't have a clear understanding of what's what in encoding land...  lol  PLEASE HELP!

Can I use preg_replace() and do this faster?  If so...  I need an exact line that will remove all invalid xml characters from a string...  I've searched the internet far and wide...

-------  MY ERROR DISPLAY-----------
Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/CLEANVEHICLES.XML:1: parser error : xmlParseCharRef: invalid xmlChar value 20 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ts><VehicleComments>Manners rather than muscle, space instead of hustle. &#x14; in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 37
function stripInvalidXml($value)
{
	$ret = "";
	$current;
	if (empty($value))
		return $ret;
		
	$length = strlen($value);
	for ($i=0; $i < $length; $i++){
		$current = ord($value{$i});
		if ($current > 31){
			$ret .= chr($current);
		}
		else {
			$ret .= " ";
			echo "illegal char found! - " . $value{$i} . "<br>";
		}
	}
	return $ret;
}

Open in new window

0
 
Ray PaseurCommented:
Hmm.. Have you considered hiring a professional developer to help with this?  It is not really a good area for novice work.  Too many snares and trip-wires.
0
 
stephenmpAuthor Commented:
lol...  I am the professional developer...  I've just never worked with importing such large files... I wouldn't be trying to come up with a fix for this if the idiots who send the feed did it correctly to begin with...

I do front-end for my full-time company, and I've done PHP/HTML contracting for years...   Most of my interaction is with the database and not files, xml files n stuff...  I've always used simpleXML, but that won't work because it would require I load the entire 2GB file... XMLreader works for large files...

I'm sure it's probably a simple fix...  But, I'm missing something in my cleaner script...
0
 
Shinesh PremrajanTechnical ManagerCommented:
$xmlReader->setParserProperty(XMLREADER_DEFAULTATTRS);

Try to call this code after the xmlReader->open code, This will read the XML but it wont validate it. I think that is causing the problem.

Hope this helps
0
 
stephenmpAuthor Commented:
I tried...  Thought it was working and then...

Warning: XMLReader::setParserProperty() expects exactly 2 parameters, 1 given in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 31

Warning: XMLReader::read() [function.XMLReader-read]: /var/www/vhosts/mysite.com/httpdocs/dsi/daily/VEHICLES.XML:2: parser error : xmlParseCharRef: invalid xmlChar value 20 in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ts><VehicleComments>Manners rather than muscle, space instead of hustle. &#x14; in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: ^ in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 110

Warning: XMLReader::read() [function.XMLReader-read]: An Error Occured while reading in /var/www/vhosts/mysite.com/httpdocs/dsi/import_daily_dsi_vehicles.php on line 37
0
 
stephenmpAuthor Commented:
I never found a 100% workable fix for this...  I'm sick of trying to fix their corrupt data...  I believe this user was correct in that I should make them fix it...  I'm just getting a csv file from them instead...
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 9
  • 5
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now