Solved

PHP / MySQL - Processing a 9MB-25MB .csv file

Posted on 2011-02-11
12
1,005 Views
Last Modified: 2012-05-11
Hi,
We have:
1. PHP / MySQL website on dedicated server.

2. Customer uploads a large file (9MB - 25MB)

3. This file is a .csv of product_id,cust_id,product_price

4. Whats the fastest and best way to efficiently import this data into our data table.

We have successfully created scripting that does it pretty fast, but we're having trouble with the larger files.

We have been able to get it to work by putting this at the beginning of our processing PHP script: ini_set("memory_limit","375M");

...but that seems like a bit of a hack, right?

There must be a more efficient wat, or is this ok since we're on a dedicated server?

Should we just set the memory_limit back to something like 64MB when we're done processing, or is it ok to leave it at 375MB?

Thanks!






0
Comment
Question by:dasher3000
  • 4
  • 2
  • 2
  • +2
12 Comments
 
LVL 6

Expert Comment

by:birwin
ID: 34876665
I upload and process a 400,000 line csv file every day, using a cron. Because of the processing that I do the code is very complicated, but the upload is very easy.

I set:

 set_time_limit(900);
 ini_set('memory_limit', '500M');

$my_file = file_get_contents('http://www.xxxxxx.com/download_excel/xxxxxx.csv',true);

$file_suffix=date("ymdhis");

$destination_file="inventory/inventory".$file_suffix.".csv";

 file_put_contents($destination_file,$my_file);


The time limits and memory allocation expire with the program.
0
 
LVL 3

Expert Comment

by:wwwdeveloper2
ID: 34876669
Using the ini_set is the best route.  You are just allowing this one script to use that much memory iat the time the script is running - not the whole php engine (if you had did it the php.ini file)

Please understand that one instance or thread using this script would use 375M - So if your script gets a lot of traffic all using the same script concurrently, each instance could possibly be using its own 375M - if you had a 2GB of memory in the server and allocated 1GB - 2 concurrent instances would try to consume all of your memory and choke out.

I think you are doing it the best way, just be careful and decide what your setting should be - based on the use and available memory for the dedicated server.

It also might be neat if you could identify how big the uploaded file is and then use your ini_set to assign memory_limit as needed.

Good luck!
0
 
LVL 7

Expert Comment

by:lexlythius
ID: 34876708
If you have FILE privileges on MySQL, nothing will beat MySQL's using LOAD DATA INFILE on the uploaded file.

This way you don't need to worry about memory configurations.
0
 
LVL 7

Accepted Solution

by:
lexlythius earned 350 total points
ID: 34876727
If you cannot use LOAD DATA INFILE, your problem is that file_get_contents reads the whole big fat file into a single string. Don't.

If at all feasible, use fgets or fgetcsv to read and process one line at a time. That is much more efficient in terms of memory usage.

You will probably need to set_time_limit(0) anyway.
0
 
LVL 6

Expert Comment

by:birwin
ID: 34876926
Reading 400,000 lines, one at a time, is an intensive process. In the real world, I have no, that is zero, problems, uploading massive files with file_get_contents.

Most files need to be purged, cleaned and processed before being dumped into the database. I use LOAD_DATE_INFILE to insert my processed results into the database, but you better be very sure of the content of the data before you dump the unprocessed results directly into a database.

Even innocent data can have killer characters embedded in it. Your load will stop with the first O'Reilly it meets.

That being said, if you are using Godaddy with a shared hosting account, fgets may be the only way you can do it, but even then, a large file will time out. But with a dedicated server, that isn't an issue.
0
 

Author Comment

by:dasher3000
ID: 34878468
Thanks for the ideas!
Just to clarify - the problem is in processing only, not uploading.

I'll try some of these out and post my findings ASAP.

- D
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 150 total points
ID: 34878503
file_get_contents() requires that the entire file be loaded into memory at once, so the memory_limit directives come into play.  But fortunately PHP provides a function that helps us with the import process.
http://us3.php.net/manual/en/function.fgetcsv.php

As far as things like O'Reilly or 23" monitor are concerned, well, that is why we use this:
http://us3.php.net/manual/en/function.fgetcsv.php

If you're running into time limit restrictions, you can use this to reset the time limit from zero, repeatedly:
http://us3.php.net/manual/en/function.set-time-limit.php

If you are using GoDaddy with a shared hosting account, you may find that they have disabled set_time_limit().  If that is the case and it is standing in the way of your success, you might want to move to a different hosting company!
0
 

Author Comment

by:dasher3000
ID: 34879003
No GoDaddy - thank goodnes.

One point of possible interest, or perhaps unnecessary:
1. We setup a script that first uploads the large file to the server without processing it.
2. Then the user clicks a second button for "Process The File."

This is theoretically so the system just processes a local file without having to do the upload process at the same time.

Is that a pointless step to take?

Thanks,
D
0
 
LVL 3

Expert Comment

by:wwwdeveloper2
ID: 34879082
Dasher -

I think it is good that you have it uploaded first.  Since the file is uploaded first, when the user clicks "Process the File" - Have the script look at the file size and then use some conditions to allocate how much memory should be dedicated to the script to process the file.

$getFileSize = filesize("test.csv"); //will get file size in bytes

//check to see if file size is greater than 9MB
if ($getFileSize >9437184)
{
//file size is greater than 9MB, lets do the ini_set to allocated the memory
ini_set("memory_limit","375M");
}
// do your file processing below as it does now.  

"Should we just set the memory_limit back to something like 64MB when we're done processing, or is it ok to leave it at 375MB?"

You do not have to worry about doing another ini_set to return the memory.  the ini_set memory limit only allocates the memory for the duration of this script.  It will be restored after the scripts finishes executing.  The actual memory limit for you php engine is set in your install's php.ini file.  
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34880305
Separating the file upload from the processing?  Seems OK to me either way.  The file size is present in the $_FILES array so you do not have to look for it after-the-fact.  My instinct would be to do it in one step, but if there are very large files involved and the chance of failure is great the two-step approach might be easier to debug.
0
 

Author Comment

by:dasher3000
ID: 34888248
Here's what we did:

 i have created new import mode - sequential import. it consists mainly of one large function, which combines different parts of old code along with new one and does reading, parsing, database storing and output altogether. the main difference from the old method is that the entries are read from pricing file sequentially, and the database queries (large ones, as i did in previous optimization) are constructed gradually from them, instead of reading the whole file and create huge arrays in memory;

- also i have added some debug output - memory allocation after each query run (it can be turned on with define at the top) , and overall processing time when finished;

- so, after some debug (dozens of unsuccessful runs - luckily i was using a small pricing file:) i got it to work properly, and the results were really impressive - when switched to 'sequential_import', the timing remained nearly the same (~4min for the large 9M file), but the memory usage dropped down from 300MB to just 3M!!! 100x, heheh )

Here's the function code:::
-----------------------------------


function sequential_parse_and_import($dir){

      global $count, $first, $q, $error_log, $product_errors, $customer_errors;

  if(is_dir($dir)){
    if($dh = opendir($dir)){
      while (($file = readdir($dh)) !== false){
        //find first .txt file in the directory:
        if($file != '.' && $file != '..' && eregi('.txt$', $file)){
          //open file:
          if($price_handle = fopen($dir . '/' .  $file, "r")){
            //sequential parse file contents:
                                    
                                    //reset previous pricing first:
                custom_pricing_reset();

                                    $prev_entry = array();
                                    while($entry = fgetcsv($price_handle, 64, '|')){
                                          if(empty($prev_entry)){
                                                //the first customer, start adding:

                                                $customer_group = start_customer_group($entry[0]);
                                                if($customer_group !== false){
                                                      //process: first query entry
                                                      $count = 0;
                                                      $first = true;
                                                      $q = "INSERT INTO " . TABLE_PRODUCTS_GROUPS . " VALUES ";
                                                      process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                }

                                          } else {
                                                if($entry[0] == $prev_entry[0]){
                                                      //the same customer, continue adding:

                                                      //process: next query entry
                                                      if($customer_group !== false){
                                                            process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                      }

                                                } else {
                                                      //process customer and skip to the next one:
                                                      if($customer_group !== false){
                                                            if($customer_group > 0){
                                                                  if(tep_db_query($q)){
                                                                        $count = mysql_affected_rows();
                                                                  } else {
                                                                        $count = 0;
                                                                  }
                                                            }
                                                            echo ": " . $count . " entries added.".(DEBUG_MEM?" - mem usage: ".memory_get_usage():'')."<br />\n";
                                                      }
                                                      //start adding next customer:
                                                      $customer_group = start_customer_group($entry[0]);
                                                      if($customer_group !== false){
                                                            //process: first query entry
                                                            $count = 0;
                                                            $first = true;
                                                            $q = "INSERT INTO " . TABLE_PRODUCTS_GROUPS . " VALUES ";
                                                            process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                      }

                                                } // if($entry[0] == $prev_entry[0]){ ...

                                          } // if(empty($prev_entry)){ ...

                                          //save the processed entry:
                                          $prev_entry = $entry;
                                    }

                                    //process last customer:
                                    if($customer_group !== false){
                                          if($customer_group > 0){
                                                if(tep_db_query($q)){
                                                      $count = mysql_affected_rows();
                                                } else {
                                                      $count = 0;
                                                }
                                          }
                                          echo ": " . $count . " entries added.".(DEBUG_MEM?" - mem usage: ".memory_get_usage():'')."<br />\n";
                                    }

                //close file:
            fclose($price_handle);
          } else {
            closedir($dh);
            return false; //failed to open file
          }
          closedir($dh);
          return true; //completed successfully
        }
      }
      return false; //failed to read file from dir
    } else {
      return false; //failed to open dir
    }
  } else {
    return false; //not a dir
  }
}


-----------------------------------

0
 

Author Closing Comment

by:dasher3000
ID: 34888322
Thanks for the tips!

The fgetcsv() seems to do the trick for importing large .csv files into a database while only using small amounts of memory.

Cheers,
D
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
Creating and Managing Databases with phpMyAdmin in cPanel.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now