We help IT Professionals succeed at work.

Parsing large XML file

fionafenton
fionafenton asked
on
473 Views
Last Modified: 2013-05-14
I need to (basically) rewrite a large xml file. It's too large to use simplexml_load_file so instead have been trying to use XMLReader and then use a foreach statement to access keys and values.
This works fine for accessing most nodes but there are some that are not so straight forward.
Example:
<number1 name="Land area (+/- 5%) in m2 " value="4500"/>
<number2 name="Habitable area (+/- 5%) in m2" value="200"/>
<number3 name="DPE numeric value (Admin) " value="274"/>
where I need to access the name and value values.

There are also nodes with children that I need to access
Example:
<pictures>
     <picture name="Photo 6">
          <filename>http://etc.jpg </filename>
     </picture>
     <picture name="Photo 7">
          <filename>http://etc.jpg</filename>
     </picture>

Searching forum and googling has left me confused.

I need to extract the fields we require and rewrite into another (simpler) xml file.
Any help with this will me most appreciated.
Comment
Watch Question

Chris Harte2015 Top Expert (Most Article Points)
CERTIFIED EXPERT

Commented:
You will need to use XMLReader.

http://uk3.php.net/manual/en/book.xmlreader.php

If nobody else comes up with an example I will get one working for you tomorrow.

Author

Commented:
I am using XMLReader and am getting it to work except when I try to grab some attributes.
Here's my code so far:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;
$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName>Beaux Villages</AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        if($xmlReader->localName == 'advert_heading') {
            $xmlReader->read();
            $title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$title."</Title>\r\n";
        }
        if($xmlReader->localName == 'main_advert') {
            $xmlReader->read();
            $description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$description."</Description>\r\n";
        }
	if($xmlReader->localName == 'town') {
            $xmlReader->read();
            $town = $xmlReader->value;
	   $_xml .="\t\t<City>".$town."</City>\r\n";
        }
	if($xmlReader->localName == 'postcode') {
            $xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(",France","",$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
	if($xmlReader->localName == 'property_type') {
            $xmlReader->read();
            $type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$type."</PropertyType>\r\n";
        }
	if($xmlReader->localName == 'numeric_price') {
            $xmlReader->read();
            $price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$price."</Price>\r\n";
        }
	if($xmlReader->localName == 'bedrooms') {
            $xmlReader->read();
            $bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$bedrooms."</Bedrooms>\r\n";
        }
	if($xmlReader->localName == 'bathrooms') {
             $xmlReader->read();
            $bathrooms = $xmlReader->value;
	    $_xml .="\t\t<Bathrooms>".$bathrooms."</Bathrooms>\r\n";
        }
	if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
        }
        if($xmlReader->localName == 'floorplans') {
            // got to end
	    $_xml .="\t\t<WCs></WCs>\r\n";
	    $_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";
            $_xml .="\t</Property>\r\n"; 
			      $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window


When trying to access attributes this works
if($xmlReader->localName == 'property') {
           $id = $xmlReader->getAttribute('reference');
	   $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }

Open in new window

But this doesn't and I can't see why
if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


And I've still to work out how to extract the <picture> attributes and the associated filename url
CERTIFIED EXPERT
Expert of the Year 2008
Top Expert 2008

Commented:
>>But this doesn't and I can't see why

if you were to add comments to your code it might help:
//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();

            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');

            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


From the comments above, it should be clear that your code is looking for:
<number1 />

and expects that the node/element that follows will always have a "value" attribute.  So this will work:
<number1 value="1" />
<number20 value="2" />  <= $plot will get its value from here.

This will not work:
<number1 value="1" />
<picture>test.jpg</picture>  <= there is no value attribute here.

Somewhere in your XML file there must be <number1> node/element that is followed by another element that does NOT have a "value" attribute.

If you expect the next node after <number1> to <numberX>, then check the first six chars of the localName after advancing the "node pointer":

//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();


if( substr($xmlReader->localName,0,6)=='number')
{
            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');
}
else
{
  //this should tell you which node it is currently reading
  echo 'no value attribute detected on node '. $xmlReader->localName;
  $plot=0;
}
            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


If you don't care to find out what is the name of the node that is giving your problems,
on your existing code, changing:
$plot = $xmlReader->getAttribute('value');

to:
$plot = intval( @$xmlReader->getAttribute('value'),10);

should force $plot to be set to zero when no 'value' exists.

Regards,
Hielo

Author

Commented:
Thanks for your input Hielo but I'm afraid it didn't work.
Through a process of trial and error I've discovered that removing the line
$xmlReader->read();
it works.  (And looking back at the code snippets I previously posted it is now obvious!)

So one problem solved. On to the next ...

The original xml has the following format
<pictures>
  <picture name="Photo 1">
        <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 2">
         <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 3">
         <filename>http://www.etc.JPG</filename>
   </picture>
  etc ....

</pictures>

Open in new window

I need to convert that to the following format
<imageCount>8</imageCount>
<Images>
      <Image1>http://www.etc.jpg</Image1>    
      <Image2>http://www.etc.jpg</Image2>
      <Image3>http://www.etc.jpg</Image3>
  etc...

</Images>

Open in new window

The total number of images varies. There doesn't appear to be a maximum number of images and the Name attributes don't follow a pattern.
So I need to loop through all the picture nodes and collect the child filename values and also keep a running count of the picture nodes.
CERTIFIED EXPERT
Expert of the Year 2008
Top Expert 2008

Commented:
Have you considered using XSL?  If that is OK, then save the following  as hielo.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="xml" indent="yes"/>
	<xsl:template match="/agency/branches/branch">
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="properties">
		<!-- This "if" clause is meant to exclude branches with no properties ex: Sales Support -->
		<xsl:if test="count(*) &gt; 0">
			<xsl:element name="Properties">
				<xsl:attribute name="branch"><xsl:value-of select="../@name"/></xsl:attribute>
				<xsl:apply-templates select="property"/>
			</xsl:element>
		</xsl:if>
	</xsl:template>

	<xsl:template match="property">
		<Property>
			<Agent>BV</Agent>
			<AgentName>Beaux Villages</AgentName>
			<ID><xsl:value-of select="@reference"/></ID>
			<Title><xsl:value-of select="advert_heading"/></Title>
			<Description><xsl:value-of select="main_advert"/></Description>
			<City><xsl:value-of select="town"/></City>
			<PostCode><xsl:value-of select="translate(postcode,'France','')"/></PostCode>
			<PropertyType><xsl:value-of select="property_type"/></PropertyType>
			<Price><xsl:value-of select="numeric_price"/></Price>
			<Bedrooms><xsl:value-of select="bedrooms"/></Bedrooms>
			<Bathrooms><xsl:value-of select="bathrooms"/></Bathrooms>
			<PlotSize><xsl:value-of select="number1/@value"/></PlotSize>
			<WCs></WCs>
			<Locality></Locality>
			<Status>For Sale</Status>
			<Pictures><xsl:apply-templates select="pictures/picture"/></Pictures>
		</Property>
	</xsl:template>

 	<xsl:template match="picture">
		<Picture>
			<!-- These will add and "id" and "caption" attribute to the enclosing element "Picture" in the output -->
			<xsl:attribute name="id"><xsl:value-of select="position()"/></xsl:attribute>
			<xsl:attribute name="caption"><xsl:value-of select="@name"/></xsl:attribute>

			<!-- This retrieves the text-node value from filename and uses it as the text-node value for "Picture" (because it is enclosed in the "Picture" node)  -->
			<xsl:value-of select="filename" />
		</Picture>
	</xsl:template>

</xsl:stylesheet>

Open in new window


Then save the following as hielo.php:
<?php
//hielo.php

// Load the XML source
$xml = new DOMDocument;

# you need to provide the correct path below so that it "points" to the exact location 
# of properties2.xml
$xml->load('/path/to/properties2.xml');

$xsl = new DOMDocument;

# this too needs the exact location of the xsl file
$xsl->load('/path/to/hielo.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

# here provide the correct path of where you want the output saved.
file_put_contents('/path/to/outputFile.xml', $proc->transformToXml($xml) );

?>

Open in new window


On another note, given the size of your file, if possible, I suggest you consider porting your project to a db.

Author

Commented:
I tried your code but it's causing all sorts of memory problems and won't run.

I'm almost there with what I've already done. All I need to know is how to access the <filename> values for each <pictures> parent (and to keep a running count of how many <filename> children there are.
CERTIFIED EXPERT
Expert of the Year 2008
Top Expert 2008
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
Brilliant! Thanks.
It almost worked. I had to add back in most of the $xmlReader->read(); that you'd commented out, but the logic for getting the photo values and attributes was spot on.

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.