Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 457
  • Last Modified:

Parsing large XML file

I need to (basically) rewrite a large xml file. It's too large to use simplexml_load_file so instead have been trying to use XMLReader and then use a foreach statement to access keys and values.
This works fine for accessing most nodes but there are some that are not so straight forward.
Example:
<number1 name="Land area (+/- 5%) in m2 " value="4500"/>
<number2 name="Habitable area (+/- 5%) in m2" value="200"/>
<number3 name="DPE numeric value (Admin) " value="274"/>
where I need to access the name and value values.

There are also nodes with children that I need to access
Example:
<pictures>
     <picture name="Photo 6">
          <filename>http://etc.jpg </filename>
     </picture>
     <picture name="Photo 7">
          <filename>http://etc.jpg</filename>
     </picture>

Searching forum and googling has left me confused.

I need to extract the fields we require and rewrite into another (simpler) xml file.
Any help with this will me most appreciated.
0
fionafenton
Asked:
fionafenton
  • 4
  • 3
1 Solution
 
Chris HarteThaumaturgeCommented:
You will need to use XMLReader.

http://uk3.php.net/manual/en/book.xmlreader.php

If nobody else comes up with an example I will get one working for you tomorrow.
0
 
fionafentonAuthor Commented:
I am using XMLReader and am getting it to work except when I try to grab some attributes.
Here's my code so far:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;
$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName>Beaux Villages</AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        if($xmlReader->localName == 'advert_heading') {
            $xmlReader->read();
            $title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$title."</Title>\r\n";
        }
        if($xmlReader->localName == 'main_advert') {
            $xmlReader->read();
            $description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$description."</Description>\r\n";
        }
	if($xmlReader->localName == 'town') {
            $xmlReader->read();
            $town = $xmlReader->value;
	   $_xml .="\t\t<City>".$town."</City>\r\n";
        }
	if($xmlReader->localName == 'postcode') {
            $xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(",France","",$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
	if($xmlReader->localName == 'property_type') {
            $xmlReader->read();
            $type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$type."</PropertyType>\r\n";
        }
	if($xmlReader->localName == 'numeric_price') {
            $xmlReader->read();
            $price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$price."</Price>\r\n";
        }
	if($xmlReader->localName == 'bedrooms') {
            $xmlReader->read();
            $bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$bedrooms."</Bedrooms>\r\n";
        }
	if($xmlReader->localName == 'bathrooms') {
             $xmlReader->read();
            $bathrooms = $xmlReader->value;
	    $_xml .="\t\t<Bathrooms>".$bathrooms."</Bathrooms>\r\n";
        }
	if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
        }
        if($xmlReader->localName == 'floorplans') {
            // got to end
	    $_xml .="\t\t<WCs></WCs>\r\n";
	    $_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";
            $_xml .="\t</Property>\r\n"; 
			      $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window


When trying to access attributes this works
if($xmlReader->localName == 'property') {
           $id = $xmlReader->getAttribute('reference');
	   $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }

Open in new window

But this doesn't and I can't see why
if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


And I've still to work out how to extract the <picture> attributes and the associated filename url
0
 
hieloCommented:
>>But this doesn't and I can't see why

if you were to add comments to your code it might help:
//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();

            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');

            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


From the comments above, it should be clear that your code is looking for:
<number1 />

and expects that the node/element that follows will always have a "value" attribute.  So this will work:
<number1 value="1" />
<number20 value="2" />  <= $plot will get its value from here.

This will not work:
<number1 value="1" />
<picture>test.jpg</picture>  <= there is no value attribute here.

Somewhere in your XML file there must be <number1> node/element that is followed by another element that does NOT have a "value" attribute.

If you expect the next node after <number1> to <numberX>, then check the first six chars of the localName after advancing the "node pointer":

//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();


if( substr($xmlReader->localName,0,6)=='number')
{
            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');
}
else
{
  //this should tell you which node it is currently reading
  echo 'no value attribute detected on node '. $xmlReader->localName;
  $plot=0;
}
            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


If you don't care to find out what is the name of the node that is giving your problems,
on your existing code, changing:
$plot = $xmlReader->getAttribute('value');

to:
$plot = intval( @$xmlReader->getAttribute('value'),10);

should force $plot to be set to zero when no 'value' exists.

Regards,
Hielo
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
fionafentonAuthor Commented:
Thanks for your input Hielo but I'm afraid it didn't work.
Through a process of trial and error I've discovered that removing the line
$xmlReader->read();
it works.  (And looking back at the code snippets I previously posted it is now obvious!)

So one problem solved. On to the next ...

The original xml has the following format
<pictures>
  <picture name="Photo 1">
        <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 2">
         <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 3">
         <filename>http://www.etc.JPG</filename>
   </picture>
  etc ....

</pictures>

Open in new window

I need to convert that to the following format
<imageCount>8</imageCount>
<Images>
      <Image1>http://www.etc.jpg</Image1>    
      <Image2>http://www.etc.jpg</Image2>
      <Image3>http://www.etc.jpg</Image3>
  etc...

</Images>

Open in new window

The total number of images varies. There doesn't appear to be a maximum number of images and the Name attributes don't follow a pattern.
So I need to loop through all the picture nodes and collect the child filename values and also keep a running count of the picture nodes.
0
 
hieloCommented:
Have you considered using XSL?  If that is OK, then save the following  as hielo.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="xml" indent="yes"/>
	<xsl:template match="/agency/branches/branch">
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="properties">
		<!-- This "if" clause is meant to exclude branches with no properties ex: Sales Support -->
		<xsl:if test="count(*) &gt; 0">
			<xsl:element name="Properties">
				<xsl:attribute name="branch"><xsl:value-of select="../@name"/></xsl:attribute>
				<xsl:apply-templates select="property"/>
			</xsl:element>
		</xsl:if>
	</xsl:template>

	<xsl:template match="property">
		<Property>
			<Agent>BV</Agent>
			<AgentName>Beaux Villages</AgentName>
			<ID><xsl:value-of select="@reference"/></ID>
			<Title><xsl:value-of select="advert_heading"/></Title>
			<Description><xsl:value-of select="main_advert"/></Description>
			<City><xsl:value-of select="town"/></City>
			<PostCode><xsl:value-of select="translate(postcode,'France','')"/></PostCode>
			<PropertyType><xsl:value-of select="property_type"/></PropertyType>
			<Price><xsl:value-of select="numeric_price"/></Price>
			<Bedrooms><xsl:value-of select="bedrooms"/></Bedrooms>
			<Bathrooms><xsl:value-of select="bathrooms"/></Bathrooms>
			<PlotSize><xsl:value-of select="number1/@value"/></PlotSize>
			<WCs></WCs>
			<Locality></Locality>
			<Status>For Sale</Status>
			<Pictures><xsl:apply-templates select="pictures/picture"/></Pictures>
		</Property>
	</xsl:template>

 	<xsl:template match="picture">
		<Picture>
			<!-- These will add and "id" and "caption" attribute to the enclosing element "Picture" in the output -->
			<xsl:attribute name="id"><xsl:value-of select="position()"/></xsl:attribute>
			<xsl:attribute name="caption"><xsl:value-of select="@name"/></xsl:attribute>

			<!-- This retrieves the text-node value from filename and uses it as the text-node value for "Picture" (because it is enclosed in the "Picture" node)  -->
			<xsl:value-of select="filename" />
		</Picture>
	</xsl:template>

</xsl:stylesheet>

Open in new window


Then save the following as hielo.php:
<?php
//hielo.php

// Load the XML source
$xml = new DOMDocument;

# you need to provide the correct path below so that it "points" to the exact location 
# of properties2.xml
$xml->load('/path/to/properties2.xml');

$xsl = new DOMDocument;

# this too needs the exact location of the xsl file
$xsl->load('/path/to/hielo.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

# here provide the correct path of where you want the output saved.
file_put_contents('/path/to/outputFile.xml', $proc->transformToXml($xml) );

?>

Open in new window


On another note, given the size of your file, if possible, I suggest you consider porting your project to a db.
0
 
fionafentonAuthor Commented:
I tried your code but it's causing all sorts of memory problems and won't run.

I'm almost there with what I've already done. All I need to know is how to access the <filename> values for each <pictures> parent (and to keep a running count of how many <filename> children there are.
0
 
hieloCommented:
try:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;

$pictures=NULL;

$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName><Company Name></AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        elseif($xmlReader->localName == 'advert_heading') {
            //$xmlReader->read();
            //$title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$xmlReader->value."</Title>\r\n";
        }
        elseif($xmlReader->localName == 'main_advert') {
            //$xmlReader->read();
            //$description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$xmlReader->value."</Description>\r\n";
        }
		elseif($xmlReader->localName == 'town') {
            //$xmlReader->read();
            //$town = $xmlReader->value;
	   $_xml .="\t\t<City>".$xmlReader->value."</City>\r\n";
        }
		elseif($xmlReader->localName == 'postcode') {
            //$xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(',France','',$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
		elseif($xmlReader->localName == 'property_type') {
            //$xmlReader->read();
            //$type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$xmlReader->value."</PropertyType>\r\n";
        }
		elseif($xmlReader->localName == 'numeric_price') {
            //$xmlReader->read();
            //$price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$xmlReader->value."</Price>\r\n";
        }
		elseif($xmlReader->localName == 'bedrooms') {
            //$xmlReader->read();
            //$bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$xmlReader->value."</Bedrooms>\r\n";
        }
		elseif($xmlReader->localName == 'bathrooms') {
            //$xmlReader->read();
	    $_xml .="\t\t<Bathrooms>".$xmlReader->value."</Bathrooms>\r\n";
        }
		elseif($xmlReader->localName == 'number1') {

			# Get rid of this.  You don't need to advance to the next 'numberX'
			#node.  You just need the 'value' attribute of the current node.
            //$xmlReader->read();


	    $_xml .="\t\t<PlotSize>".$xmlReader->getAttribute('value')."</PlotSize>\r\n";
        }

		# To get the pictures, create an empty array
        elseif($xmlReader->localName == 'pictures') {
			$pictures=array();	
		}

		# on next iteration of main 'while()', the 'read()' moves onto next node =>'picture'
		# So add one element array('caption'=>'', 'uri'=>'')
		# $pictures is no longer empty.  You still don't know what is the uri, but once again
		# the main 'while()' will advance to next node via read()
        elseif($xmlReader->localName == 'picture') {
			$pictures[]=array('caption'=>$xmlReader->getAttribute('name'), 'uri'=>'');
		}

		# now that 'read()' brought you to filename, you just need to add the image path to the uri
		# of the last array you added to $pictures
        elseif($xmlReader->localName == 'filename') {
			$index=count($pictures)-1;
			$pictures[$index]['uri']=$xmlReader->value;
			$index=null;
		}
        elseif($xmlReader->localName == 'floorplans') {
            // got to end
	    	$_xml .="\t\t<WCs></WCs>\r\n";
	    	$_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";

			if( !is_null($pictures) )
			{
				$_xml .= '<imageCount>'.count($pictures).'</imageCount>'.PHP_EOL;
				$_xml .= '<Images>'.PHP_EOL;
				foreach($pictures as $j=>$v)
				{
					$_xml .='<Image'.(1+$j).' caption="'.$v['caption'].'">'.$v['uri'].'</Image>'.PHP_EOL);
				}
				$_xml .= '</Images>'.PHP_EOL;
				$pictures=NULL;
			}
			else
			{
				$_xml .='<imageCount>0</imageCount>';
			}


            $_xml .="\t</Property>\r\n"; 
	    $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window

0
 
fionafentonAuthor Commented:
Brilliant! Thanks.
It almost worked. I had to add back in most of the $xmlReader->read(); that you'd commented out, but the logic for getting the photo values and attributes was spot on.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now