Solved

Problem parsing large XML file

Posted on 2007-03-28
6
1,199 Views
Last Modified: 2012-06-27
Here is a script I wrote that reads a 6-7 MB xml file while extracting a few xml items at a time and writing the mysql queries line by line into a writable file from a loop. This works while using very little server resources, but writing to the file it is extremely slow.

What can I do to speed this up without using much server resources?

I also tried loading the file as an array using file('file.xml'); to load the entire file into memory first, and this appeared to make everything work a whole lot faster but it exhausted the php memory limit until I had to make the value around 300MB which caused me to reconsider the design. weird? huh.

I also used fwrite, in place of file_put_contents at first (inside the loop), but it wrote to the sql file at about the same speed. Slower than a turtle. Any suggestions please.


<?php
      
      ini_set("memory_limit", "40M");
      set_time_limit(0);
      ignore_user_abort(false);
      ini_set('error_reporting', E_ALL);
      
      // Use a small test file first
      $XML_file = 'test.xml'; // marketplace_feed_v1.xml (6 MB file)
      $SQL_file = 'sql_dump.sql';
      
      // First, let's make sure the xml file is present
      if(!file_exists($XML_file))
      {
            die("<h2>ERROR - THE XML FILE COULD NOT BE FOUND</h2>");
      }
      if (!function_exists('file_put_contents') && !defined('FILE_APPEND'))
      {
            die("<h2>ERROR - file_put_contents AND FILE_APPEND NOT SUPPORTED</h2>");
      }
      // Start the timer
      $started = time();
      
    // Get the content between these xml tags using regex
      $id = '/<Id>(.*?)<\/Id>/';
      $title = '/<Title>(.*?)<\/Title>/';
      $description = '/<Description>(.*?)<\/Description>/';
            
      //Retrieve the XML file
      $xmlfile = file_get_contents($XML_file);
            
      //Process the xml file and filter results using above regex rules
      preg_match_all($id,$xmlfile,$out1);
      preg_match_all($title,$xmlfile,$out2);
      preg_match_all($description,$xmlfile,$out3);


      // Make sure to chmod 777 this file
      $fp      = fopen($SQL_file, "wb");
      
      // Write the first SQL statement in the file
      fwrite($fp, "TRUNCATE TABLE mytable;\n");
      
      // Close the file pointer
      fclose($fp);
      
      // Predefine loop counter var
      $counter = 0;
      
      //Create a loop to get all the results there are
      for ($i=0;$i<count($out1[0]);$i++)
      {
            //Remove any other tags that I don't to be shown in the output
            $out1[0] = str_replace(array('<Id>', '</Id>'), '', $out1[0]);
            $out2[0] = str_replace(array('<Title>', '<![CDATA[', ']]>', '</Title>'), '', $out2[0]);
            $out3[0] = str_replace(array('<Description>', '<![CDATA[', ']]>', '</Description>'), '', $out3[0]);


            // Start appending subsequent SQL statments into the file
            file_put_contents ($SQL_file, "INSERT INTO mytable VALUES (NULL, '".$out1[0][$i]."', '".mysql_escape_string($out2[0][$i])."', '".mysql_escape_string($out3[0][$i])."');\n", FILE_APPEND);
            
            
            
            // Increment the counter during each loop
            $counter += 1;
            
      }//end loop
      
      
      // Free the memory
      flush();
      
      $finished = time();
      $elapsed = $finished-$started;
      
      echo "<h2 style=\"color: green;\"><b>Finished..</b><br />Added $counter items to the SQL file in about $elapsed seconds</h2><br><br><h2>=]</h2>";
      
?>





Below is test.xml, a sample that I use to test with, but I only have problems parsing the bigger ~6MB file here:
http://www.clickbank.com/feeds/marketplace_feed_v1.xml.zip

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Catalog SYSTEM "marketplace_feed_v1.dtd">
<Catalog>
<Category>
<Name>Business to Business</Name>
<Site>
<Id>dayjobkill</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Day Job Killer.]]></Title>
<Description><![CDATA[Kill. Or be killed.]]></Description>
<Gravity>361.05325</Gravity>
<EarnedPerSale>42.366276</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>82.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>duvetdolla</Id>
<PopularityRank>2</PopularityRank>
<Title><![CDATA[Duvet Dollars]]></Title>
<Description><![CDATA[Make Money While You Sleep... Victoria shows you how... Affiliates Make $57.75 - $130.50 Per Sale! Conversions as high as 9.86%!]]></Description>
<Gravity>197.07547</Gravity>
<EarnedPerSale>49.51552</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>87.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>paidetc</Id>
<PopularityRank>3</PopularityRank>
<Title><![CDATA[Paid Surveys Etc.. #1 Survey Site.]]></Title>
<Description><![CDATA[Insane Conversions With New Sales Copy... Must See. ## Http://www.paidsurveysetc.com/affiliates.php.]]></Description>
<Gravity>361.72852</Gravity>
<EarnedPerSale>22.604908</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>90.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>surveyis</Id>
<PopularityRank>4</PopularityRank>
<Title><![CDATA[SurveyIncomeSystem - The #1 Survey Site]]></Title>
<Description><![CDATA[The #1 Paid Surveys Site now offers bonus commissions in addition to 75% - See http://aff.SurveyIncomeSystem.com]]></Description>
<Gravity>391.56284</Gravity>
<EarnedPerSale>21.643227</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>89.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>xoftspyse</Id>
<PopularityRank>5</PopularityRank>
<Title><![CDATA[XoftSpySE - #1 Converting Anti-Spyware.]]></Title>
<Description><![CDATA[Free Scan With Amazing Conversion Rates. Google, Overture, & Custom Conversion Tracking. Dedicated Support For Our Affiliates!]]></Description>
<Gravity>191.1444</Gravity>
<EarnedPerSale>25.63003</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>89.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>1free</Id>
<PopularityRank>6</PopularityRank>
<Title><![CDATA[Nick Marks- Multiple Streams Of Income.]]></Title>
<Description><![CDATA[Converting Like Crazy At 20 & 30%. Earn 75% Of Each Sale, Upsell and Backend. - Well Make You Rich!]]></Description>
<Gravity>195.3886</Gravity>
<EarnedPerSale>32.29731</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>86.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>bryxen4</Id>
<PopularityRank>7</PopularityRank>
<Title><![CDATA[Keyword Elite: New Keyword Software.]]></Title>
<Description><![CDATA[Go Here For Tons Of Affiliate Tools: Http://www.keywordelite.com/affiliate/ Create 1000s Of Adsense Keywords, Dominate Adwords.]]></Description>
<Gravity>194.66837</Gravity>
<EarnedPerSale>84.85204</EarnedPerSale>
<PercentPerSale>58.0</PercentPerSale>
<Referred>77.0</Referred>
<Commission>60</Commission>
</Site>
<Site>
<Id>salehoo</Id>
<PopularityRank>8</PopularityRank>
<Title><![CDATA[SaleHoo Wholesale Directory]]></Title>
<Description><![CDATA[We offer over 30 DIFFERENT sales pages for you to direct traffic to. Check out our Affiliate Area: http://www.salehoo.com/affiliates.php or email affiliates@salehoo.com to get your tracking code added!]]></Description>
<Gravity>209.82355</Gravity>
<EarnedPerSale>32.553345</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>75.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>autopilotp</Id>
<PopularityRank>9</PopularityRank>
<Title><![CDATA[Ewen Chias Autopilot Profits System]]></Title>
<Description><![CDATA[Turnkey Money Machine Prints Non-Stop Profits For You Automatically, The Lazy Way!]]></Description>
<Gravity>209.75601</Gravity>
<EarnedPerSale>25.158716</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>51.0</Referred>
<Commission>75</Commission>
</Site>
<Site>
<Id>spywarebot</Id>
<PopularityRank>10</PopularityRank>
<Title><![CDATA[Ultra Hot* SpywareBot: #1 AntiSpyware.]]></Title>
<Description><![CDATA[We Offer 24/7/365 Live Support, Remote Assistance, Active Protection, Check-out Assistance, Live FAQs And More To Ensure Your Traffic Converts Day And Night Without Interuptions - Affiliates Make 120,000/Month With Ppc! Now With Msn/Goog/Yhoo Tracking!]]></Description>
<Gravity>101.50244</Gravity>
<EarnedPerSale>21.295887</EarnedPerSale>
<PercentPerSale>75.0</PercentPerSale>
<Referred>79.0</Referred>
<Commission>75</Commission>
</Site>
</Category>
</Category>
</Catalog>
0
Comment
Question by:ray-solomon
  • 4
  • 2
6 Comments
 
LVL 49

Accepted Solution

by:
Roonaan earned 500 total points
ID: 18807055
You could open the file and read it line by line, having it parsed by a php's xml parser:

function startElement($parser, $elemName, $attributes) {
  //gets called when a element opens: <elemName>
}
function endElement($parser, $elemName) {
  //gets called when a element gets closed: </elemName>
}
function parseCData($parser, $data) {
  //gets called for all cdata (including cdata like <elemName>cdata</elemName>)
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "parseCData");

$f_in = fopen($XML_FILE, 'r');
while(!feof($f_in)) {
  $line = fgets($f_in, 4096);
  xml_parse($xml_parser, $line, $feof($f_in));
}
xml_parser_free($xml_parser);

You would have to track the element you are currently "in" based on the calls to startElement and endElement. You could use an global array with array_push/array_pop and test for the top tag inside the parseCData function.

Kind regards

-r-
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 18813444

WOW, I am using the php's xml parser and it parsed the ~8 MB xml file while writing to the sql file in a total of about 7 seconds.
I am truly amazed Roonaan. The working code is below.

I need help on one minor problem please. After the data gets written to the sql file, the title and descriptions are offset. Like this:

<snip>

TRUNCATE TABLE mytable;
INSERT INTO mytable VALUES (NULL, 'dayjobkill', '', '');
INSERT INTO mytable VALUES (NULL, 'paidetc', 'Day Job Killer.', 'Kill. Or be killed.');
INSERT INTO mytable VALUES (NULL, 'duvetdolla', 'Paid Surveys Etc.. #1 Survey Site.', 'Insane Conversions With New Sales Copy... Must See. ## Http://www.paidsurveysetc.com/affiliates.php.');
INSERT INTO mytable VALUES (NULL, '1free', 'Duvet Dollars', 'Make Money While You Sleep... Victoria shows you how... Affiliates Make $57.75 - $130.50 Per Sale! Conversions as high as 9.86%!');

</snip>

as you can see, the second and third columns in each sql statement should be in the line above it. Weird?
The sql file can be compared to the original xml file here:
http://www.clickbank.com/feeds/marketplace_feed_v1.xml.zip (changes often)

How can it be possible to fix this error?
I had a feeling, this is caused by writing the truncate statement to the file first, so I commented out the first fwrite command and it did not make a difference. Any suggestions?


<?php

            // Define the xml file to read from
            $xml_file = "marketplace_feed_v1.xml";
            
            // Define the sql file to write to -make sure to chmod 777
            $sql_file = "sql_dump.sql";

            if(!file_exists($xml_file))
            {
                  die("<h2>ERROR - THE XML FILE COULD NOT BE FOUND</h2>");
            }
            if(!is_writable($sql_file))
            {
                  die("<h2>ERROR - THE XML FILE IS NOT WRITEABLE</h2>");
            }

            $open_tags = array("ID" => "<ID>", "TITLE" => "<TITLE>", "DESCRIPTION" => "<DESCRIPTION>");
            $close_tags = array("ID" => "</ID>", "TITLE" => "</TITLE>", "DESCRIPTION" => "</DESCRIPTION>");
            
            // Start the timer
            $started = time();
            
            $count = 0;
            
            // Make sure to chmod 777 this file
            $sqlfile      = fopen($sql_file, "wb");
      
            // Write the first SQL statement to the file
            fwrite($sqlfile, "TRUNCATE TABLE mytable;\n");

            function startElement($parser, $name, $attrs="")
            {
                  global $open_tags, $temp, $current_tag;
                  $current_tag = $name;
                  if ($format = $open_tags[$name])
                  {
                        switch($name)
                        {
                              default:
                              break;
                        }
                  }
            }
            
            function endElement($parser, $name, $attrs="")
            {
                  global $close_tags, $temp, $current_tag, $count, $numstors;
                  if ($format = $close_tags[$name])
                  {
                        switch($name)
                        {
                              case "ID":
                              return_data($temp);
                              $temp = "";
                              break;
                              default:
                              break;
                        }
                  }
            }
            
            function characterData($parser, $data)
            {
                  global $current_tag, $temp, $catID;
                  switch($current_tag)
                  {
                        case "ID":
                        $temp["id"] = $data;
                        $current_tag = "";
                        break;
                        case "TITLE":
                        $temp["title"] = $data;
                        $current_tag = "";
                        break;
                        case "DESCRIPTION":
                        $temp["description"] = $data;
                        $current_tag = "";
                        break;
                  }
            }
            
            function return_data()
            {
                  global $temp, $count, $sqlfile;
                  
                  // Start appending subsequent SQL statments into the file
                  fwrite($sqlfile, "INSERT INTO mytable VALUES (NULL, '".$temp["id"]."', '".mysql_escape_string($temp["title"])."', '".mysql_escape_string($temp["description"])."');\n");

            }
                  
                  $xml_parser = xml_parser_create();
                  xml_set_element_handler($xml_parser, "startElement","endElement");
                  xml_set_character_data_handler($xml_parser, "characterData");
                  
                  if (!($xmlfile = fopen($xml_file, "r")))
                  {
                        die("<h2>ERROR - THE XML FILE COULD NOT BE OPENED</h2>");
                  }
                  
            while ($data = fread($xmlfile, 4096))
            {
                  if (!($data = utf8_encode($data)))
                  {
                        echo "<h2>ERROR - COULD NOT ENCODE THE DATA TO UTF-8</h2>"."\n";
                  }
                  if (!xml_parse($xml_parser, $data, feof($xmlfile)))
                  {
                        die(sprintf( "XML error: %s at line %d\n\n",
                        xml_error_string(xml_get_error_code($xml_parser)),
                        xml_get_current_line_number($xml_parser)));
                  }
            }
            
            // Close the file pointer
            fclose($xmlfile);
            fclose($sqlfile);
            xml_parser_free($xml_parser);
            
            $finished = time();
            $elapsed = $finished-$started;
      
            echo "<h2 style=\"color: green;\"><b>Finished..</b><br />Added $count items to the SQL file in about $elapsed seconds</h2><br><br><h2>=]</h2>";
?>
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 18813465
oops typo, I meant to say that the 3rd and 4th columns are offset instead of second and third in the sql statements.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 49

Expert Comment

by:Roonaan
ID: 18813880
Hi,

What is the usage of this line (it's in your code twice):

if ($format = $open_tags[$name])

A single = is an assignment. I think you ment to have if($format == $open_tags[$name]). The $format variable however is not located/initialized and has no value as far as I can tell.

-r-
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 18814097
Since this is my first time using the xml parser, I needed an example to follow from. I used an example from another ee question and modified it. Here I cleaned those functions up now and got rid of the unused vars. The other guy's code did not have those vars initialized either.  I wondered about it too.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_20490325.html

Here is the change I made:

            function startElement($parser, $name, $attrs="")
            {
                  global $open_tags, $current_tag;
                  $current_tag = $name;
                  if ($open_tags[$name])
                  {
                        switch($name)
                        {
                              default:
                              break;
                        }
                  }
            }
            
            function endElement($parser, $name, $attrs="")
            {
                  global $close_tags, $temp;
                  if ($close_tags[$name])
                  {
                        switch($name)
                        {
                              case "ID":
                              return_data($temp);
                              //$temp = "";
                              break;
                              default:
                              break;
                        }
                  }
            }
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 18821620
okay, what I am going to do is go ahead and give you the points for helping me get in the right direction and taking the time to help, but I will post a new question very soon so I get get more help on this script because it is urgent.

Thanks Roonan
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now