Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Help parsing very large XML file

Posted on 2010-09-23
8
1,174 Views
Last Modified: 2013-12-12
Hello, I'm trying to extract product data from a very large XML file in minimal time (at least 20mb).

The reason I'm asking this question here is that the xml file contains around 25,000+ records. Doing a recursive loop is a resource hog, and looping over every element takes too long and at times will crash before the process is finished.

This is an example of the xml file's structure...

<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>

Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function.  

Also, when each record is extracted I need to know the full parent-child category it was listed under.

For example:

==================

productid = 1234
category = category1
title = ProductTitle

productid = 9999
category = category1:subcat1
title = ProductTitle

productid = 3333
category = category1:subcat2
title = ProductTitle

productid = 6355
category = category2
title = ProductTitle

===================

I've been playing around with the PHP methods below.... but I still can't get this to run at optimal speed.

new DOMDocument()
simplexml_load_file()
new XMLReader()

Anyway, there's gotta be someone out there who knows how to process a large XML file in under a minute!

I'd be forever grateful for any insight or examples!

Thanks in advance!
0
Comment
Question by:goldentine
8 Comments
 
LVL 7

Expert Comment

by:ziceva
ID: 33743003
I've always used SimpleXMLElement
Don't know how it behaves on big files, but I think is worth a try

http://www.php.net/manual/en/class.simplexmlelement.php

Use it like this:

$XML = new SimpleXMLElement((string)file_get_contents("/path/or/URL/for/file"));
$nodes = $XML->xpath("/catalog/category/name");

nodes is also a SimpleXMLElement object containing the requested data.

You might get goot results with this class.

 
0
 
LVL 4

Expert Comment

by:trencH87
ID: 33745034
SimpleXML and DOM is not suited for large xml files as it has to load the entire file into memory before processing it. I would have used XMLReader.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 33760459
The first question that occurs to me is, "Why is this in XML?"

The size of the data set suggests that it would be faster to process it with a more economical transport mechanism like JSON.  As shown above, the size of the tag strings exceed the size of the payload strings.

The organization of the data set suggest that this belongs in a relational data base, not in an XML string.

Given that, "Basically I'm reading a local xml file, extracting product data, and passing an associative array over to a MySql function." -- My suggestion is to make that a one-time load of the MySQL data base.  Then use the data base in the future, and use XML (if needed) to transmit only the updates to the data base.  It will be faster and MUCH easier to program.
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 

Author Comment

by:goldentine
ID: 33839381
The fact that it's in XML format is out of my control. It's pulled from a 3rd party site. If I had it my way I'd definitely choose JSON over XML.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 33840090
I may not be understanding this well enough.  You wrote above that you were reading a local XML file, but now you say it comes from a 3rd party.  So please tell us a little more about the TRUE NATURE of the application.  How often do you have to run it?  What does some of the ACTUAL data look like?  What does the data base structure look like?  Is the data base normalized? Are there really nested tags with the same names?  What does an ACTUAL "INSERT" query look like, using the ACTUAL XML data.  Things like that may make us better able to help.

Some tips:  Increase your allowable memory size to "very large" whatever that is.  Use set_time_limit(3) or something like that after each DB update.  Do not index the data base until after you have loaded all of the data.

If the XML posted above is truly representative of your input string, it seems like it should not be too hard to process it in a timely manner, not should it be too hard to read it into memory and iterate over it with.  I copied the XML into this little script and it outputs numbers that are not scary at all.

int(760)
string(10) "19,000,000"

Standing by to learn more, ~Ray
<?php // RAY_temp_godentine.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// HOW BIG IS IT?
$len = strlen($xml);
var_dump($len);

// HOW BIG IS 25,000 X
$new = $len * 25000;
var_dump(number_format($new));

Open in new window

0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 33840592
Here is another version, somewhat scaled up.  Output suggests that you will have no problems.

XML STRING IS 18,500,021 BYTES
MEMORY USAGE IS 20,185,088 BYTES
PROCESSING TIME IS 0.840 SECONDS

Best regards, ~Ray
<?php // RAY_temp_goldentine.php
error_reporting(E_ALL);

echo "<pre>" . PHP_EOL;
ob_start();
$start_time = microtime(TRUE);

// TEST DATA FROM THE POST AT EE
$xml = <<<XML
<catalog>
   <category>
        <name>category1</name>
        <product>
            <id>1234</id>
            <title>ProductTitle</title>
        </product>
        <category>
            <name>subcat1</name>
            <product>
                <id>9999</id>
                <title>ProductTitle</title>
            </product>
        </category>
        <category>
            <name>subcat2</name>
            <product>
                <id>3333</id>
                <title>ProductTitle</title>
            </product>
        </category>
    </category>
    <category>
        <name>category2</name>
        <product>
            <id>6355</id>
            <title>ProductTitle</title>
        </product>
    </category>
</catalog>
XML;

// CAN WE PARSE IT?
$obj = SimpleXML_Load_String($xml);
// var_dump($obj);

// CREATE 25,000 X THE INITIAL DATA
$str = $xml;
$str = ltrim( $str, '<catalog>'  . PHP_EOL);
$str = rtrim( $str, '</catalog>' . PHP_EOL);
$new = '<catalog>'  . PHP_EOL;
$k = 25000;
while ($k)
{
    $k--;
    $new .= $str;
}
$new .= '</catalog>'  . PHP_EOL;

// CAN WE WRITE IT TO DISK?
file_put_contents('RAY_temp_goldentine.txt', $new);

// CAN WE READ IT BACK AND PARSE IT?
$obj = SimpleXML_Load_File('RAY_temp_goldentine.txt');
// var_dump($obj);

// ITERATE OVER THE OBJECT
foreach ($obj as $category)
{
    $n = $category->name;
    $p = $category->product->id;
    $c = 0;
    if (isset($category->category)) $c = 1;
    echo PHP_EOL . "$n $p $c";
}
$buf = ob_get_clean();

// DATA, MEMORY AND TIME LIMITS
$len = number_format(strlen($new));
echo PHP_EOL . "XML STRING IS $len BYTES";

$mem = number_format(memory_get_usage(TRUE));
echo PHP_EOL . "MEMORY USAGE IS $mem BYTES";

$tim = number_format(microtime(TRUE) - $start_time, 3);
echo PHP_EOL . "PROCESSING TIME IS $tim SECONDS";

// OUTPUT
// echo $buf;

Open in new window

0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 34228596
This question has been classified as abandoned and is being closed as part of the Cleanup Program.  See my comment at the end of the question for more details.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
This article will show, step by step, how to integrate R code into a R Sweave document
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question