Link to home
Start Free TrialLog in
Avatar of Dasher
Dasher

asked on

PHP / MySQL - Processing a 9MB-25MB .csv file

Hi,
We have:
1. PHP / MySQL website on dedicated server.

2. Customer uploads a large file (9MB - 25MB)

3. This file is a .csv of product_id,cust_id,product_price

4. Whats the fastest and best way to efficiently import this data into our data table.

We have successfully created scripting that does it pretty fast, but we're having trouble with the larger files.

We have been able to get it to work by putting this at the beginning of our processing PHP script: ini_set("memory_limit","375M");

...but that seems like a bit of a hack, right?

There must be a more efficient wat, or is this ok since we're on a dedicated server?

Should we just set the memory_limit back to something like 64MB when we're done processing, or is it ok to leave it at 375MB?

Thanks!






Avatar of birwin
birwin
Flag of Canada image

I upload and process a 400,000 line csv file every day, using a cron. Because of the processing that I do the code is very complicated, but the upload is very easy.

I set:

 set_time_limit(900);
 ini_set('memory_limit', '500M');

$my_file = file_get_contents('http://www.xxxxxx.com/download_excel/xxxxxx.csv',true);

$file_suffix=date("ymdhis");

$destination_file="inventory/inventory".$file_suffix.".csv";

 file_put_contents($destination_file,$my_file);


The time limits and memory allocation expire with the program.
Avatar of wwwdeveloper2
wwwdeveloper2

Using the ini_set is the best route.  You are just allowing this one script to use that much memory iat the time the script is running - not the whole php engine (if you had did it the php.ini file)

Please understand that one instance or thread using this script would use 375M - So if your script gets a lot of traffic all using the same script concurrently, each instance could possibly be using its own 375M - if you had a 2GB of memory in the server and allocated 1GB - 2 concurrent instances would try to consume all of your memory and choke out.

I think you are doing it the best way, just be careful and decide what your setting should be - based on the use and available memory for the dedicated server.

It also might be neat if you could identify how big the uploaded file is and then use your ini_set to assign memory_limit as needed.

Good luck!
If you have FILE privileges on MySQL, nothing will beat MySQL's using LOAD DATA INFILE on the uploaded file.

This way you don't need to worry about memory configurations.
ASKER CERTIFIED SOLUTION
Avatar of lexlythius
lexlythius
Flag of Argentina image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Reading 400,000 lines, one at a time, is an intensive process. In the real world, I have no, that is zero, problems, uploading massive files with file_get_contents.

Most files need to be purged, cleaned and processed before being dumped into the database. I use LOAD_DATE_INFILE to insert my processed results into the database, but you better be very sure of the content of the data before you dump the unprocessed results directly into a database.

Even innocent data can have killer characters embedded in it. Your load will stop with the first O'Reilly it meets.

That being said, if you are using Godaddy with a shared hosting account, fgets may be the only way you can do it, but even then, a large file will time out. But with a dedicated server, that isn't an issue.
Avatar of Dasher

ASKER

Thanks for the ideas!
Just to clarify - the problem is in processing only, not uploading.

I'll try some of these out and post my findings ASAP.

- D
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Dasher

ASKER

No GoDaddy - thank goodnes.

One point of possible interest, or perhaps unnecessary:
1. We setup a script that first uploads the large file to the server without processing it.
2. Then the user clicks a second button for "Process The File."

This is theoretically so the system just processes a local file without having to do the upload process at the same time.

Is that a pointless step to take?

Thanks,
D
Dasher -

I think it is good that you have it uploaded first.  Since the file is uploaded first, when the user clicks "Process the File" - Have the script look at the file size and then use some conditions to allocate how much memory should be dedicated to the script to process the file.

$getFileSize = filesize("test.csv"); //will get file size in bytes

//check to see if file size is greater than 9MB
if ($getFileSize >9437184)
{
//file size is greater than 9MB, lets do the ini_set to allocated the memory
ini_set("memory_limit","375M");
}
// do your file processing below as it does now.  

"Should we just set the memory_limit back to something like 64MB when we're done processing, or is it ok to leave it at 375MB?"

You do not have to worry about doing another ini_set to return the memory.  the ini_set memory limit only allocates the memory for the duration of this script.  It will be restored after the scripts finishes executing.  The actual memory limit for you php engine is set in your install's php.ini file.  
Separating the file upload from the processing?  Seems OK to me either way.  The file size is present in the $_FILES array so you do not have to look for it after-the-fact.  My instinct would be to do it in one step, but if there are very large files involved and the chance of failure is great the two-step approach might be easier to debug.
Avatar of Dasher

ASKER

Here's what we did:

 i have created new import mode - sequential import. it consists mainly of one large function, which combines different parts of old code along with new one and does reading, parsing, database storing and output altogether. the main difference from the old method is that the entries are read from pricing file sequentially, and the database queries (large ones, as i did in previous optimization) are constructed gradually from them, instead of reading the whole file and create huge arrays in memory;

- also i have added some debug output - memory allocation after each query run (it can be turned on with define at the top) , and overall processing time when finished;

- so, after some debug (dozens of unsuccessful runs - luckily i was using a small pricing file:) i got it to work properly, and the results were really impressive - when switched to 'sequential_import', the timing remained nearly the same (~4min for the large 9M file), but the memory usage dropped down from 300MB to just 3M!!! 100x, heheh )

Here's the function code:::
-----------------------------------


function sequential_parse_and_import($dir){

      global $count, $first, $q, $error_log, $product_errors, $customer_errors;

  if(is_dir($dir)){
    if($dh = opendir($dir)){
      while (($file = readdir($dh)) !== false){
        //find first .txt file in the directory:
        if($file != '.' && $file != '..' && eregi('.txt$', $file)){
          //open file:
          if($price_handle = fopen($dir . '/' .  $file, "r")){
            //sequential parse file contents:
                                    
                                    //reset previous pricing first:
                custom_pricing_reset();

                                    $prev_entry = array();
                                    while($entry = fgetcsv($price_handle, 64, '|')){
                                          if(empty($prev_entry)){
                                                //the first customer, start adding:

                                                $customer_group = start_customer_group($entry[0]);
                                                if($customer_group !== false){
                                                      //process: first query entry
                                                      $count = 0;
                                                      $first = true;
                                                      $q = "INSERT INTO " . TABLE_PRODUCTS_GROUPS . " VALUES ";
                                                      process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                }

                                          } else {
                                                if($entry[0] == $prev_entry[0]){
                                                      //the same customer, continue adding:

                                                      //process: next query entry
                                                      if($customer_group !== false){
                                                            process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                      }

                                                } else {
                                                      //process customer and skip to the next one:
                                                      if($customer_group !== false){
                                                            if($customer_group > 0){
                                                                  if(tep_db_query($q)){
                                                                        $count = mysql_affected_rows();
                                                                  } else {
                                                                        $count = 0;
                                                                  }
                                                            }
                                                            echo ": " . $count . " entries added.".(DEBUG_MEM?" - mem usage: ".memory_get_usage():'')."<br />\n";
                                                      }
                                                      //start adding next customer:
                                                      $customer_group = start_customer_group($entry[0]);
                                                      if($customer_group !== false){
                                                            //process: first query entry
                                                            $count = 0;
                                                            $first = true;
                                                            $q = "INSERT INTO " . TABLE_PRODUCTS_GROUPS . " VALUES ";
                                                            process_product_entry($customer_group, $entry[1], sprintf(PRICE_FORMAT, $entry[2]));
                                                      }

                                                } // if($entry[0] == $prev_entry[0]){ ...

                                          } // if(empty($prev_entry)){ ...

                                          //save the processed entry:
                                          $prev_entry = $entry;
                                    }

                                    //process last customer:
                                    if($customer_group !== false){
                                          if($customer_group > 0){
                                                if(tep_db_query($q)){
                                                      $count = mysql_affected_rows();
                                                } else {
                                                      $count = 0;
                                                }
                                          }
                                          echo ": " . $count . " entries added.".(DEBUG_MEM?" - mem usage: ".memory_get_usage():'')."<br />\n";
                                    }

                //close file:
            fclose($price_handle);
          } else {
            closedir($dh);
            return false; //failed to open file
          }
          closedir($dh);
          return true; //completed successfully
        }
      }
      return false; //failed to read file from dir
    } else {
      return false; //failed to open dir
    }
  } else {
    return false; //not a dir
  }
}


-----------------------------------

Avatar of Dasher

ASKER

Thanks for the tips!

The fgetcsv() seems to do the trick for importing large .csv files into a database while only using small amounts of memory.

Cheers,
D