Solved

Parsing large XML file

Posted on 2013-01-03
9
440 Views
Last Modified: 2013-05-14
I need to (basically) rewrite a large xml file. It's too large to use simplexml_load_file so instead have been trying to use XMLReader and then use a foreach statement to access keys and values.
This works fine for accessing most nodes but there are some that are not so straight forward.
Example:
<number1 name="Land area (+/- 5%) in m2 " value="4500"/>
<number2 name="Habitable area (+/- 5%) in m2" value="200"/>
<number3 name="DPE numeric value (Admin) " value="274"/>
where I need to access the name and value values.

There are also nodes with children that I need to access
Example:
<pictures>
     <picture name="Photo 6">
          <filename>http://etc.jpg </filename>
     </picture>
     <picture name="Photo 7">
          <filename>http://etc.jpg</filename>
     </picture>

Searching forum and googling has left me confused.

I need to extract the fields we require and rewrite into another (simpler) xml file.
Any help with this will me most appreciated.
0
Comment
Question by:fionafenton
  • 4
  • 3
9 Comments
 
LVL 16

Expert Comment

by:Chris Harte
Comment Utility
You will need to use XMLReader.

http://uk3.php.net/manual/en/book.xmlreader.php

If nobody else comes up with an example I will get one working for you tomorrow.
0
 
LVL 1

Author Comment

by:fionafenton
Comment Utility
I am using XMLReader and am getting it to work except when I try to grab some attributes.
Here's my code so far:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;
$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName>Beaux Villages</AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        if($xmlReader->localName == 'advert_heading') {
            $xmlReader->read();
            $title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$title."</Title>\r\n";
        }
        if($xmlReader->localName == 'main_advert') {
            $xmlReader->read();
            $description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$description."</Description>\r\n";
        }
	if($xmlReader->localName == 'town') {
            $xmlReader->read();
            $town = $xmlReader->value;
	   $_xml .="\t\t<City>".$town."</City>\r\n";
        }
	if($xmlReader->localName == 'postcode') {
            $xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(",France","",$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
	if($xmlReader->localName == 'property_type') {
            $xmlReader->read();
            $type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$type."</PropertyType>\r\n";
        }
	if($xmlReader->localName == 'numeric_price') {
            $xmlReader->read();
            $price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$price."</Price>\r\n";
        }
	if($xmlReader->localName == 'bedrooms') {
            $xmlReader->read();
            $bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$bedrooms."</Bedrooms>\r\n";
        }
	if($xmlReader->localName == 'bathrooms') {
             $xmlReader->read();
            $bathrooms = $xmlReader->value;
	    $_xml .="\t\t<Bathrooms>".$bathrooms."</Bathrooms>\r\n";
        }
	if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
        }
        if($xmlReader->localName == 'floorplans') {
            // got to end
	    $_xml .="\t\t<WCs></WCs>\r\n";
	    $_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";
            $_xml .="\t</Property>\r\n"; 
			      $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window


When trying to access attributes this works
if($xmlReader->localName == 'property') {
           $id = $xmlReader->getAttribute('reference');
	   $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }

Open in new window

But this doesn't and I can't see why
if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


And I've still to work out how to extract the <picture> attributes and the associated filename url
0
 
LVL 82

Expert Comment

by:hielo
Comment Utility
>>But this doesn't and I can't see why

if you were to add comments to your code it might help:
//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();

            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');

            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


From the comments above, it should be clear that your code is looking for:
<number1 />

and expects that the node/element that follows will always have a "value" attribute.  So this will work:
<number1 value="1" />
<number20 value="2" />  <= $plot will get its value from here.

This will not work:
<number1 value="1" />
<picture>test.jpg</picture>  <= there is no value attribute here.

Somewhere in your XML file there must be <number1> node/element that is followed by another element that does NOT have a "value" attribute.

If you expect the next node after <number1> to <numberX>, then check the first six chars of the localName after advancing the "node pointer":

//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();


if( substr($xmlReader->localName,0,6)=='number')
{
            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');
}
else
{
  //this should tell you which node it is currently reading
  echo 'no value attribute detected on node '. $xmlReader->localName;
  $plot=0;
}
            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


If you don't care to find out what is the name of the node that is giving your problems,
on your existing code, changing:
$plot = $xmlReader->getAttribute('value');

to:
$plot = intval( @$xmlReader->getAttribute('value'),10);

should force $plot to be set to zero when no 'value' exists.

Regards,
Hielo
0
 
LVL 1

Author Comment

by:fionafenton
Comment Utility
Thanks for your input Hielo but I'm afraid it didn't work.
Through a process of trial and error I've discovered that removing the line
$xmlReader->read();
it works.  (And looking back at the code snippets I previously posted it is now obvious!)

So one problem solved. On to the next ...

The original xml has the following format
<pictures>
  <picture name="Photo 1">
        <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 2">
         <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 3">
         <filename>http://www.etc.JPG</filename>
   </picture>
  etc ....

</pictures>

Open in new window

I need to convert that to the following format
<imageCount>8</imageCount>
<Images>
      <Image1>http://www.etc.jpg</Image1>    
      <Image2>http://www.etc.jpg</Image2>
      <Image3>http://www.etc.jpg</Image3>
  etc...

</Images>

Open in new window

The total number of images varies. There doesn't appear to be a maximum number of images and the Name attributes don't follow a pattern.
So I need to loop through all the picture nodes and collect the child filename values and also keep a running count of the picture nodes.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 82

Expert Comment

by:hielo
Comment Utility
Have you considered using XSL?  If that is OK, then save the following  as hielo.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="xml" indent="yes"/>
	<xsl:template match="/agency/branches/branch">
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="properties">
		<!-- This "if" clause is meant to exclude branches with no properties ex: Sales Support -->
		<xsl:if test="count(*) &gt; 0">
			<xsl:element name="Properties">
				<xsl:attribute name="branch"><xsl:value-of select="../@name"/></xsl:attribute>
				<xsl:apply-templates select="property"/>
			</xsl:element>
		</xsl:if>
	</xsl:template>

	<xsl:template match="property">
		<Property>
			<Agent>BV</Agent>
			<AgentName>Beaux Villages</AgentName>
			<ID><xsl:value-of select="@reference"/></ID>
			<Title><xsl:value-of select="advert_heading"/></Title>
			<Description><xsl:value-of select="main_advert"/></Description>
			<City><xsl:value-of select="town"/></City>
			<PostCode><xsl:value-of select="translate(postcode,'France','')"/></PostCode>
			<PropertyType><xsl:value-of select="property_type"/></PropertyType>
			<Price><xsl:value-of select="numeric_price"/></Price>
			<Bedrooms><xsl:value-of select="bedrooms"/></Bedrooms>
			<Bathrooms><xsl:value-of select="bathrooms"/></Bathrooms>
			<PlotSize><xsl:value-of select="number1/@value"/></PlotSize>
			<WCs></WCs>
			<Locality></Locality>
			<Status>For Sale</Status>
			<Pictures><xsl:apply-templates select="pictures/picture"/></Pictures>
		</Property>
	</xsl:template>

 	<xsl:template match="picture">
		<Picture>
			<!-- These will add and "id" and "caption" attribute to the enclosing element "Picture" in the output -->
			<xsl:attribute name="id"><xsl:value-of select="position()"/></xsl:attribute>
			<xsl:attribute name="caption"><xsl:value-of select="@name"/></xsl:attribute>

			<!-- This retrieves the text-node value from filename and uses it as the text-node value for "Picture" (because it is enclosed in the "Picture" node)  -->
			<xsl:value-of select="filename" />
		</Picture>
	</xsl:template>

</xsl:stylesheet>

Open in new window


Then save the following as hielo.php:
<?php
//hielo.php

// Load the XML source
$xml = new DOMDocument;

# you need to provide the correct path below so that it "points" to the exact location 
# of properties2.xml
$xml->load('/path/to/properties2.xml');

$xsl = new DOMDocument;

# this too needs the exact location of the xsl file
$xsl->load('/path/to/hielo.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

# here provide the correct path of where you want the output saved.
file_put_contents('/path/to/outputFile.xml', $proc->transformToXml($xml) );

?>

Open in new window


On another note, given the size of your file, if possible, I suggest you consider porting your project to a db.
0
 
LVL 1

Author Comment

by:fionafenton
Comment Utility
I tried your code but it's causing all sorts of memory problems and won't run.

I'm almost there with what I've already done. All I need to know is how to access the <filename> values for each <pictures> parent (and to keep a running count of how many <filename> children there are.
0
 
LVL 82

Accepted Solution

by:
hielo earned 500 total points
Comment Utility
try:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;

$pictures=NULL;

$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName><Company Name></AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        elseif($xmlReader->localName == 'advert_heading') {
            //$xmlReader->read();
            //$title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$xmlReader->value."</Title>\r\n";
        }
        elseif($xmlReader->localName == 'main_advert') {
            //$xmlReader->read();
            //$description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$xmlReader->value."</Description>\r\n";
        }
		elseif($xmlReader->localName == 'town') {
            //$xmlReader->read();
            //$town = $xmlReader->value;
	   $_xml .="\t\t<City>".$xmlReader->value."</City>\r\n";
        }
		elseif($xmlReader->localName == 'postcode') {
            //$xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(',France','',$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
		elseif($xmlReader->localName == 'property_type') {
            //$xmlReader->read();
            //$type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$xmlReader->value."</PropertyType>\r\n";
        }
		elseif($xmlReader->localName == 'numeric_price') {
            //$xmlReader->read();
            //$price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$xmlReader->value."</Price>\r\n";
        }
		elseif($xmlReader->localName == 'bedrooms') {
            //$xmlReader->read();
            //$bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$xmlReader->value."</Bedrooms>\r\n";
        }
		elseif($xmlReader->localName == 'bathrooms') {
            //$xmlReader->read();
	    $_xml .="\t\t<Bathrooms>".$xmlReader->value."</Bathrooms>\r\n";
        }
		elseif($xmlReader->localName == 'number1') {

			# Get rid of this.  You don't need to advance to the next 'numberX'
			#node.  You just need the 'value' attribute of the current node.
            //$xmlReader->read();


	    $_xml .="\t\t<PlotSize>".$xmlReader->getAttribute('value')."</PlotSize>\r\n";
        }

		# To get the pictures, create an empty array
        elseif($xmlReader->localName == 'pictures') {
			$pictures=array();	
		}

		# on next iteration of main 'while()', the 'read()' moves onto next node =>'picture'
		# So add one element array('caption'=>'', 'uri'=>'')
		# $pictures is no longer empty.  You still don't know what is the uri, but once again
		# the main 'while()' will advance to next node via read()
        elseif($xmlReader->localName == 'picture') {
			$pictures[]=array('caption'=>$xmlReader->getAttribute('name'), 'uri'=>'');
		}

		# now that 'read()' brought you to filename, you just need to add the image path to the uri
		# of the last array you added to $pictures
        elseif($xmlReader->localName == 'filename') {
			$index=count($pictures)-1;
			$pictures[$index]['uri']=$xmlReader->value;
			$index=null;
		}
        elseif($xmlReader->localName == 'floorplans') {
            // got to end
	    	$_xml .="\t\t<WCs></WCs>\r\n";
	    	$_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";

			if( !is_null($pictures) )
			{
				$_xml .= '<imageCount>'.count($pictures).'</imageCount>'.PHP_EOL;
				$_xml .= '<Images>'.PHP_EOL;
				foreach($pictures as $j=>$v)
				{
					$_xml .='<Image'.(1+$j).' caption="'.$v['caption'].'">'.$v['uri'].'</Image>'.PHP_EOL);
				}
				$_xml .= '</Images>'.PHP_EOL;
				$pictures=NULL;
			}
			else
			{
				$_xml .='<imageCount>0</imageCount>';
			}


            $_xml .="\t</Property>\r\n"; 
	    $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window

0
 
LVL 1

Author Closing Comment

by:fionafenton
Comment Utility
Brilliant! Thanks.
It almost worked. I had to add back in most of the $xmlReader->read(); that you'd commented out, but the logic for getting the photo values and attributes was spot on.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now