Problem parsing large XML file

Here is a script I wrote that reads a 6-7 MB xml file while extracting a few xml items at a time and writing the mysql queries line by line into a writable file from a loop. This works while using very little server resources, but writing to the file it is extremely slow.

What can I do to speed this up without using much server resources?

I also tried loading the file as an array using file('file.xml'); to load the entire file into memory first, and this appeared to make everything work a whole lot faster but it exhausted the php memory limit until I had to make the value around 300MB which caused me to reconsider the design. weird? huh.

I also used fwrite, in place of file_put_contents at first (inside the loop), but it wrote to the sql file at about the same speed. Slower than a turtle. Any suggestions please.


<?php
      
      ini_set("memory_limit", "40M");
      set_time_limit(0);
      ignore_user_abort(false);
      ini_set('error_reporting', E_ALL);
      
      // Use a small test file first
      $XML_file = 'test.xml'; // marketplace_feed_v1.xml (6 MB file)
      $SQL_file = 'sql_dump.sql';
      
      // First, let's make sure the xml file is present
      if(!file_exists($XML_file))
      {
            die("<h2>ERROR - THE XML FILE COULD NOT BE FOUND</h2>");
      }
      if (!function_exists('file_put_contents') && !defined('FILE_APPEND'))
      {
            die("<h2>ERROR - file_put_contents AND FILE_APPEND NOT SUPPORTED</h2>");
      }
      // Start the timer
      $started = time();
      
    // Get the content between these xml tags using regex
      $id = '/<Id>(.*?)<\/Id>/';
      $title = '/<Title>(.*?)<\/Title>/';
      $description = '/<Description>(.*?)<\/Description>/';
            
      //Retrieve the XML file
      $xmlfile = file_get_contents($XML_file);
            
      //Process the xml file and filter results using above regex rules
      preg_match_all($id,$xmlfile,$out1);
      preg_match_all($title,$xmlfile,$out2);
      preg_match_all($description,$xmlfile,$out3);


      // Make sure to chmod 777 this file
      $fp      = fopen($SQL_file, "wb");
      
      // Write the first SQL statement in the file
      fwrite($fp, "TRUNCATE TABLE mytable;\n");
      
      // Close the file pointer
      fclose($fp);
      
      // Predefine loop counter var
      $counter = 0;
      
      //Create a loop to get all the results there are
      for ($i=0;$i<count($out1[0]);$i++)
      {
            //Remove any other tags that I don't to be shown in the output
            $out1[0] = str_replace(array('<Id>', '</Id>'), '', $out1[0]);
            $out2[0] = str_replace(array('<Title>', '<![CDATA[', ']]>', '</Title>'), '', $out2[0]);
            $out3[0] = str_replace(array('<Description>', '<![CDATA[', ']]>', '</Description>'), '', $out3[0]);


            // Start appending subsequent SQL statments into the file
            file_put_contents ($SQL_file, "INSERT INTO mytable VALUES (NULL, '".$out1[0][$i]."', '".mysql_escape_string($out2[0][$i])."', '".mysql_escape_string($out3[0][$i])."');\n", FILE_APPEND);
            
            
            
            // Increment the counter during each loop
            $counter += 1;
            
      }//end loop
      
      
      // Free the memory
      flush();
      
      $finished = time();
      $elapsed = $finished-$started;
      
      echo "<h2 style=\"color: green;\"><b>Finished..</b><br />Added $counter items to the SQL file in about $elapsed seconds</h2><br><br><h2>=]</h2>";
      
?>





Below is test.xml, a sample that I use to test with, but I only have problems parsing the bigger ~6MB file here:
http://www.clickbank.com/feeds/marketplace_feed_v1.xml.zip

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Catalog SYSTEM "marketplace_feed_v1.dtd">
<Catalog>
<Category>
<Name>Business to Business</Name>
<Site>
<Id>dayjobkill</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Day Job Killer.]]></Title>
<Description><![CDATA[Kill. Or be killed.]]></Description>
<Gravity>361.05325</Gravity>
<EarnedPerSale>42.366276</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>82.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>duvetdolla</Id>
<PopularityRank>2</PopularityRank>
<Title><![CDATA[Duvet Dollars]]></Title>
<Description><![CDATA[Make Money While You Sleep... Victoria shows you how... Affiliates Make $57.75 - $130.50 Per Sale! Conversions as high as 9.86%!]]></Description>
<Gravity>197.07547</Gravity>
<EarnedPerSale>49.51552</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>87.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>paidetc</Id>
<PopularityRank>3</PopularityRank>
<Title><![CDATA[Paid Surveys Etc.. #1 Survey Site.]]></Title>
<Description><![CDATA[Insane Conversions With New Sales Copy... Must See. ## Http://www.paidsurveysetc.com/affiliates.php.]]></Description>
<Gravity>361.72852</Gravity>
<EarnedPerSale>22.604908</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>90.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>surveyis</Id>
<PopularityRank>4</PopularityRank>
<Title><![CDATA[SurveyIncomeSystem - The #1 Survey Site]]></Title>
<Description><![CDATA[The #1 Paid Surveys Site now offers bonus commissions in addition to 75% - See http://aff.SurveyIncomeSystem.com]]></Description>
<Gravity>391.56284</Gravity>
<EarnedPerSale>21.643227</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>89.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>xoftspyse</Id>
<PopularityRank>5</PopularityRank>
<Title><![CDATA[XoftSpySE - #1 Converting Anti-Spyware.]]></Title>
<Description><![CDATA[Free Scan With Amazing Conversion Rates. Google, Overture, & Custom Conversion Tracking. Dedicated Support For Our Affiliates!]]></Description>
<Gravity>191.1444</Gravity>
<EarnedPerSale>25.63003</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>89.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>1free</Id>
<PopularityRank>6</PopularityRank>
<Title><![CDATA[Nick Marks- Multiple Streams Of Income.]]></Title>
<Description><![CDATA[Converting Like Crazy At 20 & 30%. Earn 75% Of Each Sale, Upsell and Backend. - Well Make You Rich!]]></Description>
<Gravity>195.3886</Gravity>
<EarnedPerSale>32.29731</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>86.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>bryxen4</Id>
<PopularityRank>7</PopularityRank>
<Title><![CDATA[Keyword Elite: New Keyword Software.]]></Title>
<Description><![CDATA[Go Here For Tons Of Affiliate Tools: Http://www.keywordelite.com/affiliate/ Create 1000s Of Adsense Keywords, Dominate Adwords.]]></Description>
<Gravity>194.66837</Gravity>
<EarnedPerSale>84.85204</EarnedPerSale>
<PercentPerSale>58.0</PercentPerSale>
<Referred>77.0</Referred>
<Commission>60</Commission>
</Site>
<Site>
<Id>salehoo</Id>
<PopularityRank>8</PopularityRank>
<Title><![CDATA[SaleHoo Wholesale Directory]]></Title>
<Description><![CDATA[We offer over 30 DIFFERENT sales pages for you to direct traffic to. Check out our Affiliate Area: http://www.salehoo.com/affiliates.php or email affiliates@salehoo.com to get your tracking code added!]]></Description>
<Gravity>209.82355</Gravity>
<EarnedPerSale>32.553345</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>75.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>autopilotp</Id>
<PopularityRank>9</PopularityRank>
<Title><![CDATA[Ewen Chias Autopilot Profits System]]></Title>
<Description><![CDATA[Turnkey Money Machine Prints Non-Stop Profits For You Automatically, The Lazy Way!]]></Description>
<Gravity>209.75601</Gravity>
<EarnedPerSale>25.158716</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>51.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>spywarebot</Id>
<PopularityRank>10</PopularityRank>
<Title><![CDATA[Ultra Hot* SpywareBot: #1 AntiSpyware.]]></Title>
<Description><![CDATA[We Offer 24/7/365 Live Support, Remote Assistance, Active Protection, Check-out Assistance, Live FAQs And More To Ensure Your Traffic Converts Day And Night Without Interuptions - Affiliates Make 120,000/Month With Ppc! Now With Msn/Goog/Yhoo Tracking!]]></Description>
<Gravity>101.50244</Gravity>
<EarnedPerSale>21.295887</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>79.0</Referred>
<Commission>75</Commission>
</Site>
</Category>
</Category>
</Catalog>
LVL 10
ray-solomonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

RoonaanCommented:
You could open the file and read it line by line, having it parsed by a php's xml parser:

function startElement($parser, $elemName, $attributes) {
  //gets called when a element opens: <elemName>
}
function endElement($parser, $elemName) {
  //gets called when a element gets closed: </elemName>
}
function parseCData($parser, $data) {
  //gets called for all cdata (including cdata like <elemName>cdata</elemName>)
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "parseCData");

$f_in = fopen($XML_FILE, 'r');
while(!feof($f_in)) {
  $line = fgets($f_in, 4096);
  xml_parse($xml_parser, $line, $feof($f_in));
}
xml_parser_free($xml_parser);

You would have to track the element you are currently "in" based on the calls to startElement and endElement. You could use an global array with array_push/array_pop and test for the top tag inside the parseCData function.

Kind regards

-r-
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ray-solomonAuthor Commented:

WOW, I am using the php's xml parser and it parsed the ~8 MB xml file while writing to the sql file in a total of about 7 seconds.
I am truly amazed Roonaan. The working code is below.

I need help on one minor problem please. After the data gets written to the sql file, the title and descriptions are offset. Like this:

<snip>

TRUNCATE TABLE mytable;
INSERT INTO mytable VALUES (NULL, 'dayjobkill', '', '');
INSERT INTO mytable VALUES (NULL, 'paidetc', 'Day Job Killer.', 'Kill. Or be killed.');
INSERT INTO mytable VALUES (NULL, 'duvetdolla', 'Paid Surveys Etc.. #1 Survey Site.', 'Insane Conversions With New Sales Copy... Must See. ## Http://www.paidsurveysetc.com/affiliates.php.');
INSERT INTO mytable VALUES (NULL, '1free', 'Duvet Dollars', 'Make Money While You Sleep... Victoria shows you how... Affiliates Make $57.75 - $130.50 Per Sale! Conversions as high as 9.86%!');

</snip>

as you can see, the second and third columns in each sql statement should be in the line above it. Weird?
The sql file can be compared to the original xml file here:
http://www.clickbank.com/feeds/marketplace_feed_v1.xml.zip (changes often)

How can it be possible to fix this error?
I had a feeling, this is caused by writing the truncate statement to the file first, so I commented out the first fwrite command and it did not make a difference. Any suggestions?


<?php

            // Define the xml file to read from
            $xml_file = "marketplace_feed_v1.xml";
            
            // Define the sql file to write to -make sure to chmod 777
            $sql_file = "sql_dump.sql";

            if(!file_exists($xml_file))
            {
                  die("<h2>ERROR - THE XML FILE COULD NOT BE FOUND</h2>");
            }
            if(!is_writable($sql_file))
            {
                  die("<h2>ERROR - THE XML FILE IS NOT WRITEABLE</h2>");
            }

            $open_tags = array("ID" => "<ID>", "TITLE" => "<TITLE>", "DESCRIPTION" => "<DESCRIPTION>");
            $close_tags = array("ID" => "</ID>", "TITLE" => "</TITLE>", "DESCRIPTION" => "</DESCRIPTION>");
            
            // Start the timer
            $started = time();
            
            $count = 0;
            
            // Make sure to chmod 777 this file
            $sqlfile      = fopen($sql_file, "wb");
      
            // Write the first SQL statement to the file
            fwrite($sqlfile, "TRUNCATE TABLE mytable;\n");

            function startElement($parser, $name, $attrs="")
            {
                  global $open_tags, $temp, $current_tag;
                  $current_tag = $name;
                  if ($format = $open_tags[$name])
                  {
                        switch($name)
                        {
                              default:
                              break;
                        }
                  }
            }
            
            function endElement($parser, $name, $attrs="")
            {
                  global $close_tags, $temp, $current_tag, $count, $numstors;
                  if ($format = $close_tags[$name])
                  {
                        switch($name)
                        {
                              case "ID":
                              return_data($temp);
                              $temp = "";
                              break;
                              default:
                              break;
                        }
                  }
            }
            
            function characterData($parser, $data)
            {
                  global $current_tag, $temp, $catID;
                  switch($current_tag)
                  {
                        case "ID":
                        $temp["id"] = $data;
                        $current_tag = "";
                        break;
                        case "TITLE":
                        $temp["title"] = $data;
                        $current_tag = "";
                        break;
                        case "DESCRIPTION":
                        $temp["description"] = $data;
                        $current_tag = "";
                        break;
                  }
            }
            
            function return_data()
            {
                  global $temp, $count, $sqlfile;
                  
                  // Start appending subsequent SQL statments into the file
                  fwrite($sqlfile, "INSERT INTO mytable VALUES (NULL, '".$temp["id"]."', '".mysql_escape_string($temp["title"])."', '".mysql_escape_string($temp["description"])."');\n");

            }
                  
                  $xml_parser = xml_parser_create();
                  xml_set_element_handler($xml_parser, "startElement","endElement");
                  xml_set_character_data_handler($xml_parser, "characterData");
                  
                  if (!($xmlfile = fopen($xml_file, "r")))
                  {
                        die("<h2>ERROR - THE XML FILE COULD NOT BE OPENED</h2>");
                  }
                  
            while ($data = fread($xmlfile, 4096))
            {
                  if (!($data = utf8_encode($data)))
                  {
                        echo "<h2>ERROR - COULD NOT ENCODE THE DATA TO UTF-8</h2>"."\n";
                  }
                  if (!xml_parse($xml_parser, $data, feof($xmlfile)))
                  {
                        die(sprintf( "XML error: %s at line %d\n\n",
                        xml_error_string(xml_get_error_code($xml_parser)),
                        xml_get_current_line_number($xml_parser)));
                  }
            }
            
            // Close the file pointer
            fclose($xmlfile);
            fclose($sqlfile);
            xml_parser_free($xml_parser);
            
            $finished = time();
            $elapsed = $finished-$started;
      
            echo "<h2 style=\"color: green;\"><b>Finished..</b><br />Added $count items to the SQL file in about $elapsed seconds</h2><br><br><h2>=]</h2>";
?>
0
ray-solomonAuthor Commented:
oops typo, I meant to say that the 3rd and 4th columns are offset instead of second and third in the sql statements.
0
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

RoonaanCommented:
Hi,

What is the usage of this line (it's in your code twice):

if ($format = $open_tags[$name])

A single = is an assignment. I think you ment to have if($format == $open_tags[$name]). The $format variable however is not located/initialized and has no value as far as I can tell.

-r-
0
ray-solomonAuthor Commented:
Since this is my first time using the xml parser, I needed an example to follow from. I used an example from another ee question and modified it. Here I cleaned those functions up now and got rid of the unused vars. The other guy's code did not have those vars initialized either.  I wondered about it too.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_20490325.html

Here is the change I made:

            function startElement($parser, $name, $attrs="")
            {
                  global $open_tags, $current_tag;
                  $current_tag = $name;
                  if ($open_tags[$name])
                  {
                        switch($name)
                        {
                              default:
                              break;
                        }
                  }
            }
            
            function endElement($parser, $name, $attrs="")
            {
                  global $close_tags, $temp;
                  if ($close_tags[$name])
                  {
                        switch($name)
                        {
                              case "ID":
                              return_data($temp);
                              //$temp = "";
                              break;
                              default:
                              break;
                        }
                  }
            }
0
ray-solomonAuthor Commented:
okay, what I am going to do is go ahead and give you the points for helping me get in the right direction and taking the time to help, but I will post a new question very soon so I get get more help on this script because it is urgent.

Thanks Roonan
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.