Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Parsing large XML file

Posted on 2013-01-03
9
Medium Priority
?
456 Views
Last Modified: 2013-05-14
I need to (basically) rewrite a large xml file. It's too large to use simplexml_load_file so instead have been trying to use XMLReader and then use a foreach statement to access keys and values.
This works fine for accessing most nodes but there are some that are not so straight forward.
Example:
<number1 name="Land area (+/- 5%) in m2 " value="4500"/>
<number2 name="Habitable area (+/- 5%) in m2" value="200"/>
<number3 name="DPE numeric value (Admin) " value="274"/>
where I need to access the name and value values.

There are also nodes with children that I need to access
Example:
<pictures>
     <picture name="Photo 6">
          <filename>http://etc.jpg </filename>
     </picture>
     <picture name="Photo 7">
          <filename>http://etc.jpg</filename>
     </picture>

Searching forum and googling has left me confused.

I need to extract the fields we require and rewrite into another (simpler) xml file.
Any help with this will me most appreciated.
0
Comment
Question by:fionafenton
  • 4
  • 3
9 Comments
 
LVL 17

Expert Comment

by:Chris Harte
ID: 38740849
You will need to use XMLReader.

http://uk3.php.net/manual/en/book.xmlreader.php

If nobody else comes up with an example I will get one working for you tomorrow.
0
 
LVL 1

Author Comment

by:fionafenton
ID: 38740986
I am using XMLReader and am getting it to work except when I try to grab some attributes.
Here's my code so far:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;
$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName>Beaux Villages</AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        if($xmlReader->localName == 'advert_heading') {
            $xmlReader->read();
            $title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$title."</Title>\r\n";
        }
        if($xmlReader->localName == 'main_advert') {
            $xmlReader->read();
            $description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$description."</Description>\r\n";
        }
	if($xmlReader->localName == 'town') {
            $xmlReader->read();
            $town = $xmlReader->value;
	   $_xml .="\t\t<City>".$town."</City>\r\n";
        }
	if($xmlReader->localName == 'postcode') {
            $xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(",France","",$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
	if($xmlReader->localName == 'property_type') {
            $xmlReader->read();
            $type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$type."</PropertyType>\r\n";
        }
	if($xmlReader->localName == 'numeric_price') {
            $xmlReader->read();
            $price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$price."</Price>\r\n";
        }
	if($xmlReader->localName == 'bedrooms') {
            $xmlReader->read();
            $bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$bedrooms."</Bedrooms>\r\n";
        }
	if($xmlReader->localName == 'bathrooms') {
             $xmlReader->read();
            $bathrooms = $xmlReader->value;
	    $_xml .="\t\t<Bathrooms>".$bathrooms."</Bathrooms>\r\n";
        }
	if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
        }
        if($xmlReader->localName == 'floorplans') {
            // got to end
	    $_xml .="\t\t<WCs></WCs>\r\n";
	    $_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";
            $_xml .="\t</Property>\r\n"; 
			      $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window


When trying to access attributes this works
if($xmlReader->localName == 'property') {
           $id = $xmlReader->getAttribute('reference');
	   $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }

Open in new window

But this doesn't and I can't see why
if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


And I've still to work out how to extract the <picture> attributes and the associated filename url
0
 
LVL 82

Expert Comment

by:hielo
ID: 38742707
>>But this doesn't and I can't see why

if you were to add comments to your code it might help:
//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();

            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');

            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


From the comments above, it should be clear that your code is looking for:
<number1 />

and expects that the node/element that follows will always have a "value" attribute.  So this will work:
<number1 value="1" />
<number20 value="2" />  <= $plot will get its value from here.

This will not work:
<number1 value="1" />
<picture>test.jpg</picture>  <= there is no value attribute here.

Somewhere in your XML file there must be <number1> node/element that is followed by another element that does NOT have a "value" attribute.

If you expect the next node after <number1> to <numberX>, then check the first six chars of the localName after advancing the "node pointer":

//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();


if( substr($xmlReader->localName,0,6)=='number')
{
            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');
}
else
{
  //this should tell you which node it is currently reading
  echo 'no value attribute detected on node '. $xmlReader->localName;
  $plot=0;
}
            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


If you don't care to find out what is the name of the node that is giving your problems,
on your existing code, changing:
$plot = $xmlReader->getAttribute('value');

to:
$plot = intval( @$xmlReader->getAttribute('value'),10);

should force $plot to be set to zero when no 'value' exists.

Regards,
Hielo
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 1

Author Comment

by:fionafenton
ID: 38743349
Thanks for your input Hielo but I'm afraid it didn't work.
Through a process of trial and error I've discovered that removing the line
$xmlReader->read();
it works.  (And looking back at the code snippets I previously posted it is now obvious!)

So one problem solved. On to the next ...

The original xml has the following format
<pictures>
  <picture name="Photo 1">
        <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 2">
         <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 3">
         <filename>http://www.etc.JPG</filename>
   </picture>
  etc ....

</pictures>

Open in new window

I need to convert that to the following format
<imageCount>8</imageCount>
<Images>
      <Image1>http://www.etc.jpg</Image1>    
      <Image2>http://www.etc.jpg</Image2>
      <Image3>http://www.etc.jpg</Image3>
  etc...

</Images>

Open in new window

The total number of images varies. There doesn't appear to be a maximum number of images and the Name attributes don't follow a pattern.
So I need to loop through all the picture nodes and collect the child filename values and also keep a running count of the picture nodes.
0
 
LVL 82

Expert Comment

by:hielo
ID: 38744277
Have you considered using XSL?  If that is OK, then save the following  as hielo.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="xml" indent="yes"/>
	<xsl:template match="/agency/branches/branch">
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="properties">
		<!-- This "if" clause is meant to exclude branches with no properties ex: Sales Support -->
		<xsl:if test="count(*) &gt; 0">
			<xsl:element name="Properties">
				<xsl:attribute name="branch"><xsl:value-of select="../@name"/></xsl:attribute>
				<xsl:apply-templates select="property"/>
			</xsl:element>
		</xsl:if>
	</xsl:template>

	<xsl:template match="property">
		<Property>
			<Agent>BV</Agent>
			<AgentName>Beaux Villages</AgentName>
			<ID><xsl:value-of select="@reference"/></ID>
			<Title><xsl:value-of select="advert_heading"/></Title>
			<Description><xsl:value-of select="main_advert"/></Description>
			<City><xsl:value-of select="town"/></City>
			<PostCode><xsl:value-of select="translate(postcode,'France','')"/></PostCode>
			<PropertyType><xsl:value-of select="property_type"/></PropertyType>
			<Price><xsl:value-of select="numeric_price"/></Price>
			<Bedrooms><xsl:value-of select="bedrooms"/></Bedrooms>
			<Bathrooms><xsl:value-of select="bathrooms"/></Bathrooms>
			<PlotSize><xsl:value-of select="number1/@value"/></PlotSize>
			<WCs></WCs>
			<Locality></Locality>
			<Status>For Sale</Status>
			<Pictures><xsl:apply-templates select="pictures/picture"/></Pictures>
		</Property>
	</xsl:template>

 	<xsl:template match="picture">
		<Picture>
			<!-- These will add and "id" and "caption" attribute to the enclosing element "Picture" in the output -->
			<xsl:attribute name="id"><xsl:value-of select="position()"/></xsl:attribute>
			<xsl:attribute name="caption"><xsl:value-of select="@name"/></xsl:attribute>

			<!-- This retrieves the text-node value from filename and uses it as the text-node value for "Picture" (because it is enclosed in the "Picture" node)  -->
			<xsl:value-of select="filename" />
		</Picture>
	</xsl:template>

</xsl:stylesheet>

Open in new window


Then save the following as hielo.php:
<?php
//hielo.php

// Load the XML source
$xml = new DOMDocument;

# you need to provide the correct path below so that it "points" to the exact location 
# of properties2.xml
$xml->load('/path/to/properties2.xml');

$xsl = new DOMDocument;

# this too needs the exact location of the xsl file
$xsl->load('/path/to/hielo.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

# here provide the correct path of where you want the output saved.
file_put_contents('/path/to/outputFile.xml', $proc->transformToXml($xml) );

?>

Open in new window


On another note, given the size of your file, if possible, I suggest you consider porting your project to a db.
0
 
LVL 1

Author Comment

by:fionafenton
ID: 38746707
I tried your code but it's causing all sorts of memory problems and won't run.

I'm almost there with what I've already done. All I need to know is how to access the <filename> values for each <pictures> parent (and to keep a running count of how many <filename> children there are.
0
 
LVL 82

Accepted Solution

by:
hielo earned 2000 total points
ID: 38746863
try:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;

$pictures=NULL;

$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName><Company Name></AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        elseif($xmlReader->localName == 'advert_heading') {
            //$xmlReader->read();
            //$title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$xmlReader->value."</Title>\r\n";
        }
        elseif($xmlReader->localName == 'main_advert') {
            //$xmlReader->read();
            //$description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$xmlReader->value."</Description>\r\n";
        }
		elseif($xmlReader->localName == 'town') {
            //$xmlReader->read();
            //$town = $xmlReader->value;
	   $_xml .="\t\t<City>".$xmlReader->value."</City>\r\n";
        }
		elseif($xmlReader->localName == 'postcode') {
            //$xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(',France','',$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
		elseif($xmlReader->localName == 'property_type') {
            //$xmlReader->read();
            //$type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$xmlReader->value."</PropertyType>\r\n";
        }
		elseif($xmlReader->localName == 'numeric_price') {
            //$xmlReader->read();
            //$price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$xmlReader->value."</Price>\r\n";
        }
		elseif($xmlReader->localName == 'bedrooms') {
            //$xmlReader->read();
            //$bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$xmlReader->value."</Bedrooms>\r\n";
        }
		elseif($xmlReader->localName == 'bathrooms') {
            //$xmlReader->read();
	    $_xml .="\t\t<Bathrooms>".$xmlReader->value."</Bathrooms>\r\n";
        }
		elseif($xmlReader->localName == 'number1') {

			# Get rid of this.  You don't need to advance to the next 'numberX'
			#node.  You just need the 'value' attribute of the current node.
            //$xmlReader->read();


	    $_xml .="\t\t<PlotSize>".$xmlReader->getAttribute('value')."</PlotSize>\r\n";
        }

		# To get the pictures, create an empty array
        elseif($xmlReader->localName == 'pictures') {
			$pictures=array();	
		}

		# on next iteration of main 'while()', the 'read()' moves onto next node =>'picture'
		# So add one element array('caption'=>'', 'uri'=>'')
		# $pictures is no longer empty.  You still don't know what is the uri, but once again
		# the main 'while()' will advance to next node via read()
        elseif($xmlReader->localName == 'picture') {
			$pictures[]=array('caption'=>$xmlReader->getAttribute('name'), 'uri'=>'');
		}

		# now that 'read()' brought you to filename, you just need to add the image path to the uri
		# of the last array you added to $pictures
        elseif($xmlReader->localName == 'filename') {
			$index=count($pictures)-1;
			$pictures[$index]['uri']=$xmlReader->value;
			$index=null;
		}
        elseif($xmlReader->localName == 'floorplans') {
            // got to end
	    	$_xml .="\t\t<WCs></WCs>\r\n";
	    	$_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";

			if( !is_null($pictures) )
			{
				$_xml .= '<imageCount>'.count($pictures).'</imageCount>'.PHP_EOL;
				$_xml .= '<Images>'.PHP_EOL;
				foreach($pictures as $j=>$v)
				{
					$_xml .='<Image'.(1+$j).' caption="'.$v['caption'].'">'.$v['uri'].'</Image>'.PHP_EOL);
				}
				$_xml .= '</Images>'.PHP_EOL;
				$pictures=NULL;
			}
			else
			{
				$_xml .='<imageCount>0</imageCount>';
			}


            $_xml .="\t</Property>\r\n"; 
	    $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window

0
 
LVL 1

Author Closing Comment

by:fionafenton
ID: 38748437
Brilliant! Thanks.
It almost worked. I had to add back in most of the $xmlReader->read(); that you'd commented out, but the logic for getting the photo values and attributes was spot on.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question