Avatar of fionafenton
fionafenton
Flag for United Kingdom of Great Britain and Northern Ireland asked on

Parsing large XML file

I need to (basically) rewrite a large xml file. It's too large to use simplexml_load_file so instead have been trying to use XMLReader and then use a foreach statement to access keys and values.
This works fine for accessing most nodes but there are some that are not so straight forward.
Example:
<number1 name="Land area (+/- 5%) in m2 " value="4500"/>
<number2 name="Habitable area (+/- 5%) in m2" value="200"/>
<number3 name="DPE numeric value (Admin) " value="274"/>
where I need to access the name and value values.

There are also nodes with children that I need to access
Example:
<pictures>
     <picture name="Photo 6">
          <filename>http://etc.jpg </filename>
     </picture>
     <picture name="Photo 7">
          <filename>http://etc.jpg</filename>
     </picture>

Searching forum and googling has left me confused.

I need to extract the fields we require and rewrite into another (simpler) xml file.
Any help with this will me most appreciated.
PHPXML

Avatar of undefined
Last Comment
fionafenton

8/22/2022 - Mon
Chris Harte

You will need to use XMLReader.

http://uk3.php.net/manual/en/book.xmlreader.php

If nobody else comes up with an example I will get one working for you tomorrow.
fionafenton

ASKER
I am using XMLReader and am getting it to work except when I try to grab some attributes.
Here's my code so far:
$file1   = "http://<URL>/properties2.xml";
$newxml="beauxvillages.xml";

if (file_exists($origdir.$newxml)) {
 unlink($origdir.$newxml);
}


$file=fopen($origdir.$newxml,"a");
$_xml ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\r\n";
 $_xml .="<Properties>\r\n";

$i=0;
$xmlReader = new XMLReader();
$xmlReader->open($file1);
while($xmlReader->read()) {
        // check to ensure nodeType is an Element not attribute or #Text 
    if($xmlReader->nodeType == XMLReader::ELEMENT) {
        if($xmlReader->localName == 'property') {
	    $_xml .="\t<Property>\r\n";
	    $_xml .="\t\t<Agent>BV</Agent>\r\n";
	    $_xml .="\t\t<AgentName>Beaux Villages</AgentName>\r\n";
            $id = $xmlReader->getAttribute('reference');
	    $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }
        if($xmlReader->localName == 'advert_heading') {
            $xmlReader->read();
            $title = $xmlReader->value;
	    $_xml .="\t\t<Title>".$title."</Title>\r\n";
        }
        if($xmlReader->localName == 'main_advert') {
            $xmlReader->read();
            $description = $xmlReader->value;
	    $_xml .="\t\t<Description>".$description."</Description>\r\n";
        }
	if($xmlReader->localName == 'town') {
            $xmlReader->read();
            $town = $xmlReader->value;
	   $_xml .="\t\t<City>".$town."</City>\r\n";
        }
	if($xmlReader->localName == 'postcode') {
            $xmlReader->read();
            $postcode = $xmlReader->value;
	    $postcode = str_replace(",France","",$postcode);
 	    $_xml .="\t\t<Postcode>".$postcode."</Postcode>\r\n";
        }
	if($xmlReader->localName == 'property_type') {
            $xmlReader->read();
            $type = $xmlReader->value;
	   $_xml .="\t\t<PropertyType>".$type."</PropertyType>\r\n";
        }
	if($xmlReader->localName == 'numeric_price') {
            $xmlReader->read();
            $price = $xmlReader->value;
	    $_xml .="\t\t<Price>".(int)$price."</Price>\r\n";
        }
	if($xmlReader->localName == 'bedrooms') {
            $xmlReader->read();
            $bedrooms = $xmlReader->value;
	   $_xml .="\t\t<Bedrooms>".$bedrooms."</Bedrooms>\r\n";
        }
	if($xmlReader->localName == 'bathrooms') {
             $xmlReader->read();
            $bathrooms = $xmlReader->value;
	    $_xml .="\t\t<Bathrooms>".$bathrooms."</Bathrooms>\r\n";
        }
	if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
        }
        if($xmlReader->localName == 'floorplans') {
            // got to end
	    $_xml .="\t\t<WCs></WCs>\r\n";
	    $_xml .="\t\t<Locality></Locality>\r\n";
            $_xml .="\t\t<Status>For Sale</Status>\r\n";
            $_xml .="\t</Property>\r\n"; 
			      $i++;
        }
       
    }
} 

$_xml .="</Properties>\r\n";
    fwrite($file, $_xml);

    fclose($file);

Open in new window


When trying to access attributes this works
if($xmlReader->localName == 'property') {
           $id = $xmlReader->getAttribute('reference');
	   $_xml .="\t\t<ID>".$id."</ID>\r\n";
        }

Open in new window

But this doesn't and I can't see why
if($xmlReader->localName == 'number1') {
            $xmlReader->read();
	    $plot = $xmlReader->getAttribute('value');
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


And I've still to work out how to extract the <picture> attributes and the associated filename url
hielo

>>But this doesn't and I can't see why

if you were to add comments to your code it might help:
//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();

            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');

            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


From the comments above, it should be clear that your code is looking for:
<number1 />

and expects that the node/element that follows will always have a "value" attribute.  So this will work:
<number1 value="1" />
<number20 value="2" />  <= $plot will get its value from here.

This will not work:
<number1 value="1" />
<picture>test.jpg</picture>  <= there is no value attribute here.

Somewhere in your XML file there must be <number1> node/element that is followed by another element that does NOT have a "value" attribute.

If you expect the next node after <number1> to <numberX>, then check the first six chars of the localName after advancing the "node pointer":

//check if the "current/active" node's name equals 'number1'
if($xmlReader->localName == 'number1') {

            //if so, make the next node the "current/active" node
            $xmlReader->read();


if( substr($xmlReader->localName,0,6)=='number')
{
            //now get the 'value' attribute from the "current/active" node.
	    $plot = $xmlReader->getAttribute('value');
}
else
{
  //this should tell you which node it is currently reading
  echo 'no value attribute detected on node '. $xmlReader->localName;
  $plot=0;
}
            //append <PlotSize> onto variable.
	    $_xml .="\t\t<PlotSize>".$plot."</PlotSize>\r\n";
 }

Open in new window


If you don't care to find out what is the name of the node that is giving your problems,
on your existing code, changing:
$plot = $xmlReader->getAttribute('value');

to:
$plot = intval( @$xmlReader->getAttribute('value'),10);

should force $plot to be set to zero when no 'value' exists.

Regards,
Hielo
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
fionafenton

ASKER
Thanks for your input Hielo but I'm afraid it didn't work.
Through a process of trial and error I've discovered that removing the line
$xmlReader->read();
it works.  (And looking back at the code snippets I previously posted it is now obvious!)

So one problem solved. On to the next ...

The original xml has the following format
<pictures>
  <picture name="Photo 1">
        <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 2">
         <filename>http://www.etc.JPG</filename>
   </picture>
   <picture name="Photo 3">
         <filename>http://www.etc.JPG</filename>
   </picture>
  etc ....

</pictures>

Open in new window

I need to convert that to the following format
<imageCount>8</imageCount>
<Images>
      <Image1>http://www.etc.jpg</Image1>    
      <Image2>http://www.etc.jpg</Image2>
      <Image3>http://www.etc.jpg</Image3>
  etc...

</Images>

Open in new window

The total number of images varies. There doesn't appear to be a maximum number of images and the Name attributes don't follow a pattern.
So I need to loop through all the picture nodes and collect the child filename values and also keep a running count of the picture nodes.
hielo

Have you considered using XSL?  If that is OK, then save the following  as hielo.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="xml" indent="yes"/>
	<xsl:template match="/agency/branches/branch">
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="properties">
		<!-- This "if" clause is meant to exclude branches with no properties ex: Sales Support -->
		<xsl:if test="count(*) &gt; 0">
			<xsl:element name="Properties">
				<xsl:attribute name="branch"><xsl:value-of select="../@name"/></xsl:attribute>
				<xsl:apply-templates select="property"/>
			</xsl:element>
		</xsl:if>
	</xsl:template>

	<xsl:template match="property">
		<Property>
			<Agent>BV</Agent>
			<AgentName>Beaux Villages</AgentName>
			<ID><xsl:value-of select="@reference"/></ID>
			<Title><xsl:value-of select="advert_heading"/></Title>
			<Description><xsl:value-of select="main_advert"/></Description>
			<City><xsl:value-of select="town"/></City>
			<PostCode><xsl:value-of select="translate(postcode,'France','')"/></PostCode>
			<PropertyType><xsl:value-of select="property_type"/></PropertyType>
			<Price><xsl:value-of select="numeric_price"/></Price>
			<Bedrooms><xsl:value-of select="bedrooms"/></Bedrooms>
			<Bathrooms><xsl:value-of select="bathrooms"/></Bathrooms>
			<PlotSize><xsl:value-of select="number1/@value"/></PlotSize>
			<WCs></WCs>
			<Locality></Locality>
			<Status>For Sale</Status>
			<Pictures><xsl:apply-templates select="pictures/picture"/></Pictures>
		</Property>
	</xsl:template>

 	<xsl:template match="picture">
		<Picture>
			<!-- These will add and "id" and "caption" attribute to the enclosing element "Picture" in the output -->
			<xsl:attribute name="id"><xsl:value-of select="position()"/></xsl:attribute>
			<xsl:attribute name="caption"><xsl:value-of select="@name"/></xsl:attribute>

			<!-- This retrieves the text-node value from filename and uses it as the text-node value for "Picture" (because it is enclosed in the "Picture" node)  -->
			<xsl:value-of select="filename" />
		</Picture>
	</xsl:template>

</xsl:stylesheet>

Open in new window


Then save the following as hielo.php:
<?php
//hielo.php

// Load the XML source
$xml = new DOMDocument;

# you need to provide the correct path below so that it "points" to the exact location 
# of properties2.xml
$xml->load('/path/to/properties2.xml');

$xsl = new DOMDocument;

# this too needs the exact location of the xsl file
$xsl->load('/path/to/hielo.xsl');

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules

# here provide the correct path of where you want the output saved.
file_put_contents('/path/to/outputFile.xml', $proc->transformToXml($xml) );

?>

Open in new window


On another note, given the size of your file, if possible, I suggest you consider porting your project to a db.
fionafenton

ASKER
I tried your code but it's causing all sorts of memory problems and won't run.

I'm almost there with what I've already done. All I need to know is how to access the <filename> values for each <pictures> parent (and to keep a running count of how many <filename> children there are.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
ASKER CERTIFIED SOLUTION
hielo

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
fionafenton

ASKER
Brilliant! Thanks.
It almost worked. I had to add back in most of the $xmlReader->read(); that you'd commented out, but the logic for getting the photo values and attributes was spot on.