Solved

Help parsing very large XML file

Posted on 2010-09-23
8
1,180 Views
Last Modified: 2013-12-12
Hello, I'm trying to extract product data from a very large XML file in minimal time (at least 20mb).

The reason I'm asking this question here is that the xml file contains around 25,000+ records. Doing a recursive loop is a resource hog, and looping over every element takes too long and at times will crash before the process is finished.

This is an example of the xml file's structure...

<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>

Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function.  

Also, when each record is extracted I need to know the full parent-child category it was listed under.

For example:

==================

productid = 1234
category = category1
title = ProductTitle

productid = 9999
category = category1:subcat1
title = ProductTitle

productid = 3333
category = category1:subcat2
title = ProductTitle

productid = 6355
category = category2
title = ProductTitle

===================

I've been playing around with the PHP methods below.... but I still can't get this to run at optimal speed.

new DOMDocument()
simplexml_load_file()
new XMLReader()

Anyway, there's gotta be someone out there who knows how to process a large XML file in under a minute!

I'd be forever grateful for any insight or examples!

Thanks in advance!
0
Comment
Question by:goldentine
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 7

Expert Comment

by:ziceva
ID: 33743003
I've always used SimpleXMLElement
Don't know how it behaves on big files, but I think is worth a try

http://www.php.net/manual/en/class.simplexmlelement.php

Use it like this:

$XML = new SimpleXMLElement((string)file_get_contents("/path/or/URL/for/file"));
$nodes = $XML->xpath("/catalog/category/name");

nodes is also a SimpleXMLElement object containing the requested data.

You might get goot results with this class.

 
0
 
LVL 4

Expert Comment

by:trencH87
ID: 33745034
SimpleXML and DOM is not suited for large xml files as it has to load the entire file into memory before processing it. I would have used XMLReader.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33760459
The first question that occurs to me is, "Why is this in XML?"

The size of the data set suggests that it would be faster to process it with a more economical transport mechanism like JSON.  As shown above, the size of the tag strings exceed the size of the payload strings.

The organization of the data set suggest that this belongs in a relational data base, not in an XML string.

Given that, "Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function." -- My suggestion is to make that a one-time load of the MySQL data base.  Then use the data base in the future, and use XML (if needed) to transmit only the updates to the data base.  It will be faster and MUCH easier to program.
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 

Author Comment

by:goldentine
ID: 33839381
The fact that it's in XML format is out of my control. It's pulled from a 3rd party site. If I had it my way I'd definitely choose JSON over XML.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 33840090
I may not be understanding this well enough.  You wrote above that you were reading a local XML file, but now you say it comes from a 3rd party.  So please tell us a little more about the TRUE NATURE of the application.  How often do you have to run it?  What does some of the ACTUAL data look like?  What does the data base structure look like?  Is the data base normalized? Are there really nested tags with the same names?  What does an ACTUAL "INSERT" query look like, using the ACTUAL XML data.  Things like that may make us better able to help.

Some tips:  Increase your allowable memory size to "very large" whatever that is.  Use set_time_limit(3) or something like that after each DB update.  Do not index the data base until after you have loaded all of the data.

If the XML posted above is truly representative of your input string, it seems like it should not be too hard to process it in a timely manner, not should it be too hard to read it into memory and iterate over it with.  I copied the XML into this little script and it outputs numbers that are not scary at all.

int(760)
string(10) "19,000,000"

Standing by to learn more, ~Ray
<?php // RAY_temp_godentine.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// HOW BIG IS IT?
$len = strlen($xml);
var_dump($len);

// HOW BIG IS 25,000 X
$new = $len * 25000;
var_dump(number_format($new));

Open in new window

0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 33840592
Here is another version, somewhat scaled up.  Output suggests that you will have no problems.

XML STRING IS 18,500,021 BYTES
MEMORY USAGE IS 20,185,088 BYTES
PROCESSING TIME IS 0.840 SECONDS

Best regards, ~Ray
<?php // RAY_temp_goldentine.php
error_reporting(E_ALL);

echo "<pre>" . PHP_EOL;
ob_start();
$start_time = microtime(TRUE);

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// CAN WE PARSE IT?
$obj = SimpleXML_Load_String($xml);
// var_dump($obj);

// CREATE 25,000 X THE INITIAL DATA
$str = $xml;
$str = ltrim( $str, '<catalog>'  . PHP_EOL);
$str = rtrim( $str, '</catalog>' . PHP_EOL);
$new = '<catalog>'  . PHP_EOL;
$k = 25000;
while ($k)
{
    $k--;
    $new .= $str;
}
$new .= '</catalog>'  . PHP_EOL;

// CAN WE WRITE IT TO DISK?
file_put_contents('RAY_temp_goldentine.txt', $new);

// CAN WE READ IT BACK AND PARSE IT?
$obj = SimpleXML_Load_File('RAY_temp_goldentine.txt');
// var_dump($obj);

// ITERATE OVER THE OBJECT
foreach ($obj as $category)
{
    $n = $category->name;
    $p = $category->product->id;
    $c = 0;
    if (isset($category->category)) $c = 1;
    echo PHP_EOL . "$n $p $c";
}
$buf = ob_get_clean();

// DATA, MEMORY AND TIME LIMITS
$len = number_format(strlen($new));
echo PHP_EOL . "XML STRING IS $len BYTES";

$mem = number_format(memory_get_usage(TRUE));
echo PHP_EOL . "MEMORY USAGE IS $mem BYTES";

$tim = number_format(microtime(TRUE) - $start_time, 3);
echo PHP_EOL . "PROCESSING TIME IS $tim SECONDS";

// OUTPUT
// echo $buf;

Open in new window

0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 34228596
This question has been classified as abandoned and is being closed as part of the Cleanup Program.  See my comment at the end of the question for more details.
0

Featured Post

Are You Using the Best Web Development Editor?

The worlds of web hosting and web development are constantly evolving. Every year we see design trends change, coding standards adapt and new frameworks/CMS created. With such a quick pace of change it’s easy to get lost trying to keep up.

See if your editor made the list.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
Q&A with Course Creator, Mark Lassoff, on the importance of HTML5 in the career of a modern-day developer.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question