Solved

Help parsing very large XML file

Posted on 2010-09-23
8
1,164 Views
Last Modified: 2013-12-12
Hello, I'm trying to extract product data from a very large XML file in minimal time (at least 20mb).

The reason I'm asking this question here is that the xml file contains around 25,000+ records. Doing a recursive loop is a resource hog, and looping over every element takes too long and at times will crash before the process is finished.

This is an example of the xml file's structure...

<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>

Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function.  

Also, when each record is extracted I need to know the full parent-child category it was listed under.

For example:

==================

productid = 1234
category = category1
title = ProductTitle

productid = 9999
category = category1:subcat1
title = ProductTitle

productid = 3333
category = category1:subcat2
title = ProductTitle

productid = 6355
category = category2
title = ProductTitle

===================

I've been playing around with the PHP methods below.... but I still can't get this to run at optimal speed.

new DOMDocument()
simplexml_load_file()
new XMLReader()

Anyway, there's gotta be someone out there who knows how to process a large XML file in under a minute!

I'd be forever grateful for any insight or examples!

Thanks in advance!
0
Comment
Question by:goldentine
8 Comments
 
LVL 7

Expert Comment

by:ziceva
ID: 33743003
I've always used SimpleXMLElement
Don't know how it behaves on big files, but I think is worth a try

http://www.php.net/manual/en/class.simplexmlelement.php

Use it like this:

$XML = new SimpleXMLElement((string)file_get_contents("/path/or/URL/for/file"));
$nodes = $XML->xpath("/catalog/category/name");

nodes is also a SimpleXMLElement object containing the requested data.

You might get goot results with this class.

 
0
 
LVL 4

Expert Comment

by:trencH87
ID: 33745034
SimpleXML and DOM is not suited for large xml files as it has to load the entire file into memory before processing it. I would have used XMLReader.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33760459
The first question that occurs to me is, "Why is this in XML?"

The size of the data set suggests that it would be faster to process it with a more economical transport mechanism like JSON.  As shown above, the size of the tag strings exceed the size of the payload strings.

The organization of the data set suggest that this belongs in a relational data base, not in an XML string.

Given that, "Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function." -- My suggestion is to make that a one-time load of the MySQL data base.  Then use the data base in the future, and use XML (if needed) to transmit only the updates to the data base.  It will be faster and MUCH easier to program.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:goldentine
ID: 33839381
The fact that it's in XML format is out of my control. It's pulled from a 3rd party site. If I had it my way I'd definitely choose JSON over XML.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 33840090
I may not be understanding this well enough.  You wrote above that you were reading a local XML file, but now you say it comes from a 3rd party.  So please tell us a little more about the TRUE NATURE of the application.  How often do you have to run it?  What does some of the ACTUAL data look like?  What does the data base structure look like?  Is the data base normalized? Are there really nested tags with the same names?  What does an ACTUAL "INSERT" query look like, using the ACTUAL XML data.  Things like that may make us better able to help.

Some tips:  Increase your allowable memory size to "very large" whatever that is.  Use set_time_limit(3) or something like that after each DB update.  Do not index the data base until after you have loaded all of the data.

If the XML posted above is truly representative of your input string, it seems like it should not be too hard to process it in a timely manner, not should it be too hard to read it into memory and iterate over it with.  I copied the XML into this little script and it outputs numbers that are not scary at all.

int(760)
string(10) "19,000,000"

Standing by to learn more, ~Ray
<?php // RAY_temp_godentine.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// HOW BIG IS IT?
$len = strlen($xml);
var_dump($len);

// HOW BIG IS 25,000 X
$new = $len * 25000;
var_dump(number_format($new));

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 33840592
Here is another version, somewhat scaled up.  Output suggests that you will have no problems.

XML STRING IS 18,500,021 BYTES
MEMORY USAGE IS 20,185,088 BYTES
PROCESSING TIME IS 0.840 SECONDS

Best regards, ~Ray
<?php // RAY_temp_goldentine.php

error_reporting(E_ALL);



echo "<pre>" . PHP_EOL;

ob_start();

$start_time = microtime(TRUE);



// TEST DATA FROM THE POST AT EE

$xml = <<<XML

<catalog>

   <category>

        <name>category1</name>

        <product>

            <id>1234</id>

            <title>ProductTitle</title>

        </product>

        <category>

            <name>subcat1</name>

            <product>

                <id>9999</id>

                <title>ProductTitle</title>

            </product>

        </category>

        <category>

            <name>subcat2</name>

            <product>

                <id>3333</id>

                <title>ProductTitle</title>

            </product>

        </category>

    </category>

    <category>

        <name>category2</name>

        <product>

            <id>6355</id>

            <title>ProductTitle</title>

        </product>

    </category>

</catalog>

XML;



// CAN WE PARSE IT?

$obj = SimpleXML_Load_String($xml);

// var_dump($obj);



// CREATE 25,000 X THE INITIAL DATA

$str = $xml;

$str = ltrim( $str, '<catalog>'  . PHP_EOL);

$str = rtrim( $str, '</catalog>' . PHP_EOL);

$new = '<catalog>'  . PHP_EOL;

$k = 25000;

while ($k)

{

    $k--;

    $new .= $str;

}

$new .= '</catalog>'  . PHP_EOL;



// CAN WE WRITE IT TO DISK?

file_put_contents('RAY_temp_goldentine.txt', $new);



// CAN WE READ IT BACK AND PARSE IT?

$obj = SimpleXML_Load_File('RAY_temp_goldentine.txt');

// var_dump($obj);



// ITERATE OVER THE OBJECT

foreach ($obj as $category)

{

    $n = $category->name;

    $p = $category->product->id;

    $c = 0;

    if (isset($category->category)) $c = 1;

    echo PHP_EOL . "$n $p $c";

}

$buf = ob_get_clean();



// DATA, MEMORY AND TIME LIMITS

$len = number_format(strlen($new));

echo PHP_EOL . "XML STRING IS $len BYTES";



$mem = number_format(memory_get_usage(TRUE));

echo PHP_EOL . "MEMORY USAGE IS $mem BYTES";



$tim = number_format(microtime(TRUE) - $start_time, 3);

echo PHP_EOL . "PROCESSING TIME IS $tim SECONDS";



// OUTPUT

// echo $buf;

Open in new window

0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 34228596
This question has been classified as abandoned and is being closed as part of the Cleanup Program.  See my comment at the end of the question for more details.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now