Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Help parsing very large XML file

Posted on 2010-09-23
8
Medium Priority
?
1,190 Views
Last Modified: 2013-12-12
Hello, I'm trying to extract product data from a very large XML file in minimal time (at least 20mb).

The reason I'm asking this question here is that the xml file contains around 25,000+ records. Doing a recursive loop is a resource hog, and looping over every element takes too long and at times will crash before the process is finished.

This is an example of the xml file's structure...

<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>

Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function.  

Also, when each record is extracted I need to know the full parent-child category it was listed under.

For example:

==================

productid = 1234
category = category1
title = ProductTitle

productid = 9999
category = category1:subcat1
title = ProductTitle

productid = 3333
category = category1:subcat2
title = ProductTitle

productid = 6355
category = category2
title = ProductTitle

===================

I've been playing around with the PHP methods below.... but I still can't get this to run at optimal speed.

new DOMDocument()
simplexml_load_file()
new XMLReader()

Anyway, there's gotta be someone out there who knows how to process a large XML file in under a minute!

I'd be forever grateful for any insight or examples!

Thanks in advance!
0
Comment
Question by:goldentine
7 Comments
 
LVL 7

Expert Comment

by:ziceva
ID: 33743003
I've always used SimpleXMLElement
Don't know how it behaves on big files, but I think is worth a try

http://www.php.net/manual/en/class.simplexmlelement.php

Use it like this:

$XML = new SimpleXMLElement((string)file_get_contents("/path/or/URL/for/file"));
$nodes = $XML->xpath("/catalog/category/name");

nodes is also a SimpleXMLElement object containing the requested data.

You might get goot results with this class.

 
0
 
LVL 4

Expert Comment

by:trencH87
ID: 33745034
SimpleXML and DOM is not suited for large xml files as it has to load the entire file into memory before processing it. I would have used XMLReader.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 33760459
The first question that occurs to me is, "Why is this in XML?"

The size of the data set suggests that it would be faster to process it with a more economical transport mechanism like JSON.  As shown above, the size of the tag strings exceed the size of the payload strings.

The organization of the data set suggest that this belongs in a relational data base, not in an XML string.

Given that, "Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function." -- My suggestion is to make that a one-time load of the MySQL data base.  Then use the data base in the future, and use XML (if needed) to transmit only the updates to the data base.  It will be faster and MUCH easier to program.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:goldentine
ID: 33839381
The fact that it's in XML format is out of my control. It's pulled from a 3rd party site. If I had it my way I'd definitely choose JSON over XML.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 33840090
I may not be understanding this well enough.  You wrote above that you were reading a local XML file, but now you say it comes from a 3rd party.  So please tell us a little more about the TRUE NATURE of the application.  How often do you have to run it?  What does some of the ACTUAL data look like?  What does the data base structure look like?  Is the data base normalized? Are there really nested tags with the same names?  What does an ACTUAL "INSERT" query look like, using the ACTUAL XML data.  Things like that may make us better able to help.

Some tips:  Increase your allowable memory size to "very large" whatever that is.  Use set_time_limit(3) or something like that after each DB update.  Do not index the data base until after you have loaded all of the data.

If the XML posted above is truly representative of your input string, it seems like it should not be too hard to process it in a timely manner, not should it be too hard to read it into memory and iterate over it with.  I copied the XML into this little script and it outputs numbers that are not scary at all.

int(760)
string(10) "19,000,000"

Standing by to learn more, ~Ray
<?php // RAY_temp_godentine.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// HOW BIG IS IT?
$len = strlen($xml);
var_dump($len);

// HOW BIG IS 25,000 X
$new = $len * 25000;
var_dump(number_format($new));

Open in new window

0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 33840592
Here is another version, somewhat scaled up.  Output suggests that you will have no problems.

XML STRING IS 18,500,021 BYTES
MEMORY USAGE IS 20,185,088 BYTES
PROCESSING TIME IS 0.840 SECONDS

Best regards, ~Ray
<?php // RAY_temp_goldentine.php
error_reporting(E_ALL);

echo "<pre>" . PHP_EOL;
ob_start();
$start_time = microtime(TRUE);

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// CAN WE PARSE IT?
$obj = SimpleXML_Load_String($xml);
// var_dump($obj);

// CREATE 25,000 X THE INITIAL DATA
$str = $xml;
$str = ltrim( $str, '<catalog>'  . PHP_EOL);
$str = rtrim( $str, '</catalog>' . PHP_EOL);
$new = '<catalog>'  . PHP_EOL;
$k = 25000;
while ($k)
{
    $k--;
    $new .= $str;
}
$new .= '</catalog>'  . PHP_EOL;

// CAN WE WRITE IT TO DISK?
file_put_contents('RAY_temp_goldentine.txt', $new);

// CAN WE READ IT BACK AND PARSE IT?
$obj = SimpleXML_Load_File('RAY_temp_goldentine.txt');
// var_dump($obj);

// ITERATE OVER THE OBJECT
foreach ($obj as $category)
{
    $n = $category->name;
    $p = $category->product->id;
    $c = 0;
    if (isset($category->category)) $c = 1;
    echo PHP_EOL . "$n $p $c";
}
$buf = ob_get_clean();

// DATA, MEMORY AND TIME LIMITS
$len = number_format(strlen($new));
echo PHP_EOL . "XML STRING IS $len BYTES";

$mem = number_format(memory_get_usage(TRUE));
echo PHP_EOL . "MEMORY USAGE IS $mem BYTES";

$tim = number_format(microtime(TRUE) - $start_time, 3);
echo PHP_EOL . "PROCESSING TIME IS $tim SECONDS";

// OUTPUT
// echo $buf;

Open in new window

0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 34228596
This question has been classified as abandoned and is being closed as part of the Cleanup Program.  See my comment at the end of the question for more details.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will show, step by step, how to integrate R code into a R Sweave document
Q&A with Course Creator, Mark Lassoff, on the importance of HTML5 in the career of a modern-day developer.
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
Suggested Courses
Course of the Month9 days, 2 hours left to enroll

876 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question