Link to home
Start Free TrialLog in
Avatar of xtreme109
xtreme109Flag for United States of America

asked on

PHP Execute Python in backgroun

So I am building an application that requires an import and manipulate of a large CSV file. I figure it would be best to let a PYTHON script handle that and have PHP execute the python script.

I would like to accomplish a few things.
Run this process in the backgroup
Monitor this process via ajax.

The issue I am running into is that the script take a while to complete, and the only way i can monitor the process is by capturing the PID, but i dont believe it is being return until the script is complete.

Now when i run a 'quick' script. everythng works great.



Here is my sample code...I am using codeigniter.

function index(){
         $command = 'python /to/my/script.py arg1 arg 2';
        echo "Begin.....";
        $pid = $this->run_in_background($command);
        echo $pid;
        while($this->is_process_running($pid))
        {
         echo(" . ");
           ob_flush(); flush();
                sleep(1);
        }
        echo "Complete";
        
   
        
    }
    
    
    
    function run_in_background($Command, $Priority = 0)
    {
       if($Priority)
           $PID = shell_exec("nohup nice -n $Priority $Command 2> /dev/null & echo $!");
       else
           $PID = shell_exec("nohup $Command 2> /dev/null & echo $!");
       return($PID);
    }
    

    function is_process_running($PID){
       exec("ps $PID", $ProcessState);
       return(count($ProcessState) >= 2);
    }

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Mark Brady
Mark Brady
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
What makes you think that PHP is not a good tool to import a large CSV file?  

Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).  

I'm guessing there is a PHP solution, so long as the script does not create a long-running job.  PHP is made for generating web pages, so long-running jobs are sometimes less successful in PHP.  But trying to marry PHP and Python is also kind of "out there."

Maybe if you can tell us a little more about the application (what it does, in business - non-technical - terms) we can offer some more mainstream suggestions.
Avatar of xtreme109

ASKER

Some more background to my project. Clients are able to upload a list products, along with all of their variations, pricing etc.. (this is in CSV format)

Our "worst case" scenario is about 1M records,  In which we have to see if the variation already exist but  if not insert the record as new (using INSERT IGNORE?);however if the product/variation was previously listed and was set to 'inactive', and is now on that newly uploaded CSV I would like to update that existing record and set it back to 'active', as I do not want duplicate skus.

Because of the time frames and the high possibility of someone leaving the page after the upload, I figured to run a background process which i can then monitor via a simple ajax call to check on Process and update the customer upon completion.

**1 million records with very very minor manipulation takes about 10-15 minutes.
do you have to wait for the upload to finish before starting your integration?
Yes, the upload is relatively quick.

File size is about 30Mb.
Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).
I simplified my issue without getting overly complex, but...

Attached you will find a list which contains variations for ONE product!  I generate the SKU's based on a prefix and generate  numbers on the fly based off a database value. I.E. prefix= "TEST" lastSKU="100"

Each new variation creates sku for example "TEST0000101, TEST0000102, etc.."

https://dl.dropboxusercontent.com/u/50617741/large-sample-set.csv
What database are you using?

To me, this seems like a simple problem with a two step solution.
1. You update a joined view of the CSV data with the production table, based on the SKU where the production table has an inactive flag = true, setting the production table inactive flag = false
2. Append all the CSV rows, allowing the unique index on the SKU column to prevent duplicates.
i am using MySQL
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
My download from DropBox was fast, but my upload from my office PC to my server took about 75 seconds via FTP

In my timing test, I found that there were 1,013,761 records.  My script read these in 5,536.573 ms (5.5 seconds).  So my guess is that either the database is too slow or the application design contains something that is slowing it down.   75 seconds to transfer the file, plus 5 seconds to process the CSV is nowhere near 10 minutes!
@Ray

How long to import the CSV file into a MySQL database? (no indexes)
Of course, if the CSV could be opened as an external table there might not be any reason to import it.
@aikimark: I think that INSERT may be the central issue slowing the PHP script down.  If you sort the million-line CSV and look at the data, it seems to me that there is some normalization that could be applied.  These look like they might be specifications for eyeglasses, but I'm not sure.
@Ray

You are correct.  These are glasses/lens combinations of strengths, cyl (astygmatism correction) and color combinations.
@Ray - i believe the time is consumed in performing a INSERT IGNORE on the mysql database and manual data maniuplation. 75 seconds is not bad at all!

Also here is my table structure. The "custom" fields are for specific products which require special settings and i am storing JSON data in those fields since they are not tied to any specific sku

CREATE TABLE IF NOT EXISTS `my_product_variations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `product_id` int(11) NOT NULL,
  `is_custom` tinyint(1) NOT NULL DEFAULT '0',
  `sku` varchar(15) NOT NULL,
  `power` varchar(20) NOT NULL,
  `diameter` varchar(10) NOT NULL,
  `base_curve` varchar(10) NOT NULL,
  `axis` varchar(20) NOT NULL,
  `cylinder` varchar(10) NOT NULL,
  `addition` varchar(10) NOT NULL,
  `color` varchar(25) NOT NULL,
  `frame_color` varchar(75) DEFAULT NULL,
  `lens_color` varchar(75) DEFAULT NULL,
  `custom_power` text,
  `custom_diameter` text,
  `custom_base_curve` text,
  `custom_axis` text,
  `custom_cylinder` text,
  `custom_addition` text,
  `custom_color` text,
  `custom_frame_color` text,
  `custom_lens_color` text,
  `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `created` datetime DEFAULT NULL,
  `active` tinyint(1) DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `sku_unique` (`sku`,`active`),
  UNIQUE KEY `variation_unique` (`product_id`,`power`,`diameter`,`base_curve`,`axis`,`cylinder`,`addition`,`color`,`active`,`frame_color`,`lens_color`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=205502 ;

Open in new window

remove the UNIQUE KEY `variation_unique` index and perform another import test.
Agree with @aikimark, UNIQUE keys require a table scan for every INSERT, and this causes geometrically longer INSERT processes as the tables grow.  Also, I am wondering about the columns defined as TEXT.  Why is that?  Is it possible to define those as some kind of CHAR instead?

What if you load the table without the keys, then go back and ALTER TABLE to add the keys in a separate query?

Is the data such that it could be structured using ENUM?  Or numeric values?
@Ray,

What about using the CSV file as an external table?  No indexes.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You can specify an external table as an in-memory object?!?  Wow!  That should be a good performer.  I thought we would have to point to the newly uploaded .CSV file.  Or are you going to perform some of your PHP 'magic'?
I removed the UNIQUE KEY `variation_unique` index and i ended up stopping the process around 2 minutes. Script was only was able to import ~65,000 records.

I was able to full execute the script in the background on the server and capture the PID instantly. I am now able to check on the status of the script, as well as pause/kill the script at anytime.

Issue is still the length of time.

Here is what i have so far....


PHP - Codeigniter..
class Importtest extends CI_Controller 
{

    function __construct()
    {
        parent::__construct();


    }
    
    function check_if_running($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))echo "FALSE";
        else echo "True!";
    }
    public function status($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))return false;
        else return true;
    }
    
    public function stop($pid){
        $command = 'kill '.$pid;
        exec($command);
        if ($this->status($pid) == false)echo "TRUE";
        else echo "FALSE";
    }
    
    private function __runCom($commandToRun){
        $command = 'nohup '.$commandToRun.' > /dev/null 2>&1 & echo $!';
        exec($command ,$op);
        return (int)$op[0]; //Return PID
    }
    
    function index(){
        echo "<h1>Test Large Import</h1>";
        
        $scriptPath = $_SERVER["DOCUMENT_ROOT"] . "/python/import.py";
        $fullFilePath = "/path/to/file/large-data-set.csv";
        
        $productID = "1227"; //test data
        $skuPrefix = "XXX";  //test data
        $nextSku= 105;       //test data
        
        //python command
        $command = 'python '. $scriptPath .' ' . $fullFilePath .' ' . $productID .' ' . $skuPrefix .' ' .$nextSku;;
        $processID = $this->__runCom($command);
        echo "Current PID: " . $processID;
        
    }
}

Open in new window



Python Script...very basic as of right now..havent put much time into validation, etc..
#!/usr/bin/python
import os
import time
import sys
import csv
import MySQLdb as mdb


filePath    = sys.argv[1]
productID   = int(sys.argv[2])
skuPrefix   = sys.argv[3]
nextSkuNumber     = sys.argv[4]

try:
    #establish DB connection
    con = mdb.connect('db.server.com', 'user', 'pass', 'db')
    cur = con.cursor()
    #try to open file and being parsing
    f = open(filePath, "rb") 
    
    skipFirst = True
    
    reader = csv.reader(f)  # creates the reader object
    for csvRow in reader:   # iterates the rows of the file in orders

        if skipFirst == True:
            skipFirst = False
            continue

        nextSkuNumber = int(nextSkuNumber) + 1
        nextSkuNumber = '%07d' % nextSkuNumber
        nextSku = skuPrefix + nextSkuNumber

        insertIgnore = "INSERT IGNORE INTO my_product_variations (product_id, sku, power, diameter, base_curve, axis, cylinder, addition, color) VALUES (%d, '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % (productID, nextSku, csvRow[0], csvRow[1], csvRow[2], csvRow[3], csvRow[4], csvRow[5], csvRow[6])
        
        cur.execute(insertIgnore)
        con.commit()
    
except mdb.Error, e:
  
    print "Error %d: %s" % (e.args[0],e.args[1])
    sys.exit(1)
    
finally:    
    if f:
         f.close() 
        
    if con:    
        con.close()

Open in new window

@ray
I wouldn't be able to constant add and remove the indexes. As this is for a single product and we have more than one product in our database.

As for as an ENUM value a good majority of the fields would need over 300+ ENUM values..so i am not sure if that is ideal

The TEXT fields are for custom products which will accept a JSON array, and there is no clearly defined SKU since those products require custom settings.

I hope this helps. I appreciate all the help so far!
No wonder you're experiencing such poor performance.  You're doing row-at-a-time inserts.

stopping the process around 2 minutes
Thanks for testing.  Just to confirm...that index was on the production table and not the CSV data, right?
@aikimark
I will modifify to do batch inserts of 1000(?) and report back.

correct...index was on table
this issue with me batching inserts is that i cannot create a duplicate variation with a new SKU.

Otherwise i would have multiple variations with different SKU's and my system would not be able to manage inventory correctly.
I was thinking about two batch queries.  One to update the Active flag (matching SKU rows) and the other just does a bulk append for new SKU rows.
The unique index on the SKU column will prevent duplicate SKUs
i would then need to keep track of those and then mark those active...any ideas? Batching will not provide that back to me
Using my experience with similar applications...

I think that you will have so few updates and new rows that these two operations shouldn't take more than a few seconds to execute.  You already have a two column index on SKU and active.  Depending on the percentage of the (total number of) rows that are inactive, this should be a reasonably small number of SKU comparisons with the CSV file.

Similarly, I wouldn't expect to add many new SKU rows with the append query.
@ali
This depends on how often viariation become available(active)/inative.

@ray
ENUM may not be a great idea , as if anything new were to be added to one of the fields i would have to alter the entire table.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you're going to do that, you might want to try my suggestion of hashing those columns and storing them in a new column with a unique index.  Hash the contents of each row in the CSV.  I usually advise hashers to include a field delimiter when hashing.  The most logical delimiter would be a comma, since that simplifies the hashing of the CSV lines.

Have you considered testing the external table technique?
i will try hashing on m y next round of testing.

Personally i have never used, nor do I know how to use the external table.  Sample script?
I think Ray would have better examples and caveats for external tables.

The help files (online documentation) takes you through the steps of defining a table (DDL statements) and then instruct you to swap out the .CSV file with your actual CSV file.
I know you have had some great advice in this thread but you asked how to excecute a python script from within php. I showed you how to do it in two different ways and no comment?
@mark
I appreciate your comments! thank you. The title of this thread probably could've been named better. My code that i put in my first post showed me executing the script and successfully running (using the methods you shared). However my post explained my issues with large files.
Based on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
@xtreme109

So, you really aren't working with actual SKU values?  Well, that does restrict how many ways you can solve this problem.

When you get around to playing with hashing, open up a new question and post a comment in this thread with the URL to the new question.
@aikimark
i am working with SKU's, however they are generated dynamically the first time, assuming that particular combination of variables do not exist.

This is displayed in my sample python script.
i am working with SKUs
Let me clarify.  When I referred to a SKU, I meant a predefined value that is associated with a particular product that you do or might carry/handle/sell/process/etc.  Any CSV record would contain this SKU value along with other data.  You are using the combined attributes as your SKU.

If you needed to update an attribute in an existing product, you would create a new SKU, since the attribute combination would not match anything in your production table.
Basde on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.