Solved

PHP Execute Python in backgroun

Posted on 2014-09-04
42
1,002 Views
Last Modified: 2014-09-11
So I am building an application that requires an import and manipulate of a large CSV file. I figure it would be best to let a PYTHON script handle that and have PHP execute the python script.

I would like to accomplish a few things.
Run this process in the backgroup
Monitor this process via ajax.

The issue I am running into is that the script take a while to complete, and the only way i can monitor the process is by capturing the PID, but i dont believe it is being return until the script is complete.

Now when i run a 'quick' script. everythng works great.



Here is my sample code...I am using codeigniter.

function index(){
         $command = 'python /to/my/script.py arg1 arg 2';
        echo "Begin.....";
        $pid = $this->run_in_background($command);
        echo $pid;
        while($this->is_process_running($pid))
        {
         echo(" . ");
           ob_flush(); flush();
                sleep(1);
        }
        echo "Complete";
        
   
        
    }
    
    
    
    function run_in_background($Command, $Priority = 0)
    {
       if($Priority)
           $PID = shell_exec("nohup nice -n $Priority $Command 2> /dev/null & echo $!");
       else
           $PID = shell_exec("nohup $Command 2> /dev/null & echo $!");
       return($PID);
    }
    

    function is_process_running($PID){
       exec("ps $PID", $ProcessState);
       return(count($ProcessState) >= 2);
    }

Open in new window

0
Comment
Question by:xtreme109
  • 17
  • 16
  • 6
  • +1
42 Comments
 
LVL 20

Accepted Solution

by:
Mark Brady earned 200 total points
ID: 40304858
You can do it several ways. Here is a couple of samples.
<?php 

$command = escapeshellcmd('/usr/custom/test.py');
$output = shell_exec($command);
echo $output;

?>

Open in new window


and using passthrough
ob_start();
passthru('/usr/bin/python2.7 /srv/http/assets/py/switch.py arg1 arg2');
$output = ob_get_clean(); 

Open in new window

0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40304912
What makes you think that PHP is not a good tool to import a large CSV file?  

Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).  

I'm guessing there is a PHP solution, so long as the script does not create a long-running job.  PHP is made for generating web pages, so long-running jobs are sometimes less successful in PHP.  But trying to marry PHP and Python is also kind of "out there."

Maybe if you can tell us a little more about the application (what it does, in business - non-technical - terms) we can offer some more mainstream suggestions.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40304971
Some more background to my project. Clients are able to upload a list products, along with all of their variations, pricing etc.. (this is in CSV format)

Our "worst case" scenario is about 1M records,  In which we have to see if the variation already exist but  if not insert the record as new (using INSERT IGNORE?);however if the product/variation was previously listed and was set to 'inactive', and is now on that newly uploaded CSV I would like to update that existing record and set it back to 'active', as I do not want duplicate skus.

Because of the time frames and the high possibility of someone leaving the page after the upload, I figured to run a background process which i can then monitor via a simple ajax call to check on Process and update the customer upon completion.

**1 million records with very very minor manipulation takes about 10-15 minutes.
0
Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

 
LVL 45

Expert Comment

by:aikimark
ID: 40304976
do you have to wait for the upload to finish before starting your integration?
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40304982
Yes, the upload is relatively quick.

File size is about 30Mb.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40304985
Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305001
I simplified my issue without getting overly complex, but...

Attached you will find a list which contains variations for ONE product!  I generate the SKU's based on a prefix and generate  numbers on the fly based off a database value. I.E. prefix= "TEST" lastSKU="100"

Each new variation creates sku for example "TEST0000101, TEST0000102, etc.."

https://dl.dropboxusercontent.com/u/50617741/large-sample-set.csv
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305018
What database are you using?

To me, this seems like a simple problem with a two step solution.
1. You update a joined view of the CSV data with the production table, based on the SKU where the production table has an inactive flag = true, setting the production table inactive flag = false
2. Append all the CSV rows, allowing the unique index on the SKU column to prevent duplicates.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305020
i am using MySQL
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 200 total points
ID: 40305025
A less simple approach would be needed if you needed to check for ANY changes.  If that's the case, add a hash column to the production table and compare each CSV line's hash to the corresponding SKU's hash in the production table.  Even though you only have nine fields to compare, it might be quicker to compute the CSV hashes.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40305038
My download from DropBox was fast, but my upload from my office PC to my server took about 75 seconds via FTP

In my timing test, I found that there were 1,013,761 records.  My script read these in 5,536.573 ms (5.5 seconds).  So my guess is that either the database is too slow or the application design contains something that is slowing it down.   75 seconds to transfer the file, plus 5 seconds to process the CSV is nowhere near 10 minutes!
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305044
@Ray

How long to import the CSV file into a MySQL database? (no indexes)
Of course, if the CSV could be opened as an external table there might not be any reason to import it.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40305051
@aikimark: I think that INSERT may be the central issue slowing the PHP script down.  If you sort the million-line CSV and look at the data, it seems to me that there is some normalization that could be applied.  These look like they might be specifications for eyeglasses, but I'm not sure.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305057
@Ray

You are correct.  These are glasses/lens combinations of strengths, cyl (astygmatism correction) and color combinations.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305068
@Ray - i believe the time is consumed in performing a INSERT IGNORE on the mysql database and manual data maniuplation. 75 seconds is not bad at all!

Also here is my table structure. The "custom" fields are for specific products which require special settings and i am storing JSON data in those fields since they are not tied to any specific sku

CREATE TABLE IF NOT EXISTS `my_product_variations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `product_id` int(11) NOT NULL,
  `is_custom` tinyint(1) NOT NULL DEFAULT '0',
  `sku` varchar(15) NOT NULL,
  `power` varchar(20) NOT NULL,
  `diameter` varchar(10) NOT NULL,
  `base_curve` varchar(10) NOT NULL,
  `axis` varchar(20) NOT NULL,
  `cylinder` varchar(10) NOT NULL,
  `addition` varchar(10) NOT NULL,
  `color` varchar(25) NOT NULL,
  `frame_color` varchar(75) DEFAULT NULL,
  `lens_color` varchar(75) DEFAULT NULL,
  `custom_power` text,
  `custom_diameter` text,
  `custom_base_curve` text,
  `custom_axis` text,
  `custom_cylinder` text,
  `custom_addition` text,
  `custom_color` text,
  `custom_frame_color` text,
  `custom_lens_color` text,
  `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `created` datetime DEFAULT NULL,
  `active` tinyint(1) DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `sku_unique` (`sku`,`active`),
  UNIQUE KEY `variation_unique` (`product_id`,`power`,`diameter`,`base_curve`,`axis`,`cylinder`,`addition`,`color`,`active`,`frame_color`,`lens_color`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=205502 ;

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305098
remove the UNIQUE KEY `variation_unique` index and perform another import test.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 40305753
Agree with @aikimark, UNIQUE keys require a table scan for every INSERT, and this causes geometrically longer INSERT processes as the tables grow.  Also, I am wondering about the columns defined as TEXT.  Why is that?  Is it possible to define those as some kind of CHAR instead?

What if you load the table without the keys, then go back and ALTER TABLE to add the keys in a separate query?

Is the data such that it could be structured using ENUM?  Or numeric values?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305755
@Ray,

What about using the CSV file as an external table?  No indexes.
0
 
LVL 109

Assisted Solution

by:Ray Paseur
Ray Paseur earned 100 total points
ID: 40305763
Might be reasonable and certainly worth a try.  Should be able to load the CSV entirely into memory.  It's a file of 33,397,837 bytes, so it will probably require increasing the memory limit because PHP objects and arrays are explosively large when compared to string data (Like 16X-20X larger).  So a PHP memory limit of 1GB is probably a safe starting point.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305807
You can specify an external table as an in-memory object?!?  Wow!  That should be a good performer.  I thought we would have to point to the newly uploaded .CSV file.  Or are you going to perform some of your PHP 'magic'?
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305866
I removed the UNIQUE KEY `variation_unique` index and i ended up stopping the process around 2 minutes. Script was only was able to import ~65,000 records.

I was able to full execute the script in the background on the server and capture the PID instantly. I am now able to check on the status of the script, as well as pause/kill the script at anytime.

Issue is still the length of time.

Here is what i have so far....


PHP - Codeigniter..
class Importtest extends CI_Controller 
{

    function __construct()
    {
        parent::__construct();


    }
    
    function check_if_running($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))echo "FALSE";
        else echo "True!";
    }
    public function status($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))return false;
        else return true;
    }
    
    public function stop($pid){
        $command = 'kill '.$pid;
        exec($command);
        if ($this->status($pid) == false)echo "TRUE";
        else echo "FALSE";
    }
    
    private function __runCom($commandToRun){
        $command = 'nohup '.$commandToRun.' > /dev/null 2>&1 & echo $!';
        exec($command ,$op);
        return (int)$op[0]; //Return PID
    }
    
    function index(){
        echo "<h1>Test Large Import</h1>";
        
        $scriptPath = $_SERVER["DOCUMENT_ROOT"] . "/python/import.py";
        $fullFilePath = "/path/to/file/large-data-set.csv";
        
        $productID = "1227"; //test data
        $skuPrefix = "XXX";  //test data
        $nextSku= 105;       //test data
        
        //python command
        $command = 'python '. $scriptPath .' ' . $fullFilePath .' ' . $productID .' ' . $skuPrefix .' ' .$nextSku;;
        $processID = $this->__runCom($command);
        echo "Current PID: " . $processID;
        
    }
}

Open in new window



Python Script...very basic as of right now..havent put much time into validation, etc..
#!/usr/bin/python
import os
import time
import sys
import csv
import MySQLdb as mdb


filePath    = sys.argv[1]
productID   = int(sys.argv[2])
skuPrefix   = sys.argv[3]
nextSkuNumber     = sys.argv[4]

try:
    #establish DB connection
    con = mdb.connect('db.server.com', 'user', 'pass', 'db')
    cur = con.cursor()
    #try to open file and being parsing
    f = open(filePath, "rb") 
    
    skipFirst = True
    
    reader = csv.reader(f)  # creates the reader object
    for csvRow in reader:   # iterates the rows of the file in orders

        if skipFirst == True:
            skipFirst = False
            continue

        nextSkuNumber = int(nextSkuNumber) + 1
        nextSkuNumber = '%07d' % nextSkuNumber
        nextSku = skuPrefix + nextSkuNumber

        insertIgnore = "INSERT IGNORE INTO my_product_variations (product_id, sku, power, diameter, base_curve, axis, cylinder, addition, color) VALUES (%d, '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % (productID, nextSku, csvRow[0], csvRow[1], csvRow[2], csvRow[3], csvRow[4], csvRow[5], csvRow[6])
        
        cur.execute(insertIgnore)
        con.commit()
    
except mdb.Error, e:
  
    print "Error %d: %s" % (e.args[0],e.args[1])
    sys.exit(1)
    
finally:    
    if f:
         f.close() 
        
    if con:    
        con.close()

Open in new window

0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305883
@ray
I wouldn't be able to constant add and remove the indexes. As this is for a single product and we have more than one product in our database.

As for as an ENUM value a good majority of the fields would need over 300+ ENUM values..so i am not sure if that is ideal

The TEXT fields are for custom products which will accept a JSON array, and there is no clearly defined SKU since those products require custom settings.

I hope this helps. I appreciate all the help so far!
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305903
No wonder you're experiencing such poor performance.  You're doing row-at-a-time inserts.

stopping the process around 2 minutes
Thanks for testing.  Just to confirm...that index was on the production table and not the CSV data, right?
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305946
@aikimark
I will modifify to do batch inserts of 1000(?) and report back.

correct...index was on table
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305960
this issue with me batching inserts is that i cannot create a duplicate variation with a new SKU.

Otherwise i would have multiple variations with different SKU's and my system would not be able to manage inventory correctly.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305962
I was thinking about two batch queries.  One to update the Active flag (matching SKU rows) and the other just does a bulk append for new SKU rows.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305966
The unique index on the SKU column will prevent duplicate SKUs
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40305971
i would then need to keep track of those and then mark those active...any ideas? Batching will not provide that back to me
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40305987
Using my experience with similar applications...

I think that you will have so few updates and new rows that these two operations shouldn't take more than a few seconds to execute.  You already have a two column index on SKU and active.  Depending on the percentage of the (total number of) rows that are inactive, this should be a reasonably small number of SKU comparisons with the CSV file.

Similarly, I wouldn't expect to add many new SKU rows with the append query.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40306186
@ali
This depends on how often viariation become available(active)/inative.

@ray
ENUM may not be a great idea , as if anything new were to be added to one of the fields i would have to alter the entire table.
0
 
LVL 2

Assisted Solution

by:xtreme109
xtreme109 earned 0 total points
ID: 40306705
UPDATE***


I resolved the issue. I utilized the above code however I did modify my python script to INSERT 500 records at a time with ON DUPLICATE KEY UPDATE

INSERT INTO table (x, x, x, x, x, x) VALUES
(d, d, d, d, d, 1), (d, d, d, d, d, 2), (d, d, d, d, d, 3), (d, d, d, d, d, 4)
 ON DUPLICATE KEY UPDATE active=1

Open in new window


Additionally i did use a multi column UNIQUE constraint on the following columns "product_id, power,diameter, base_curve, axis, cylinder,addition,      color,       frame_color,       lens_color,"

By doing this it allowed me to create a unique sku for only NEW products and reset old products (if the above constraint is met) to active while not know the sku before hand.

Using ONLY the above query over and over again; here are my results

*New inserts of 1M+ records average time elapsed is 40 seconds.
*New inserts along with UPDATES of 1M+ records average time elapsed is 45 seconds.
*ALL UPDATES of 1M+ records average time elapsed is 92 seconds.

Tested on a database of 5M+ records so far.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40306773
If you're going to do that, you might want to try my suggestion of hashing those columns and storing them in a new column with a unique index.  Hash the contents of each row in the CSV.  I usually advise hashers to include a field delimiter when hashing.  The most logical delimiter would be a comma, since that simplifies the hashing of the CSV lines.

Have you considered testing the external table technique?
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40306808
i will try hashing on m y next round of testing.

Personally i have never used, nor do I know how to use the external table.  Sample script?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40307089
I think Ray would have better examples and caveats for external tables.

The help files (online documentation) takes you through the steps of defining a table (DDL statements) and then instruct you to swap out the .CSV file with your actual CSV file.
0
 
LVL 20

Expert Comment

by:Mark Brady
ID: 40307174
I know you have had some great advice in this thread but you asked how to excecute a python script from within php. I showed you how to do it in two different ways and no comment?
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40307498
@mark
I appreciate your comments! thank you. The title of this thread probably could've been named better. My code that i put in my first post showed me executing the script and successfully running (using the methods you shared). However my post explained my issues with large files.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40307504
Based on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40307785
@xtreme109

So, you really aren't working with actual SKU values?  Well, that does restrict how many ways you can solve this problem.

When you get around to playing with hashing, open up a new question and post a comment in this thread with the URL to the new question.
0
 
LVL 2

Author Comment

by:xtreme109
ID: 40309878
@aikimark
i am working with SKU's, however they are generated dynamically the first time, assuming that particular combination of variables do not exist.

This is displayed in my sample python script.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40310001
i am working with SKUs
Let me clarify.  When I referred to a SKU, I meant a predefined value that is associated with a particular product that you do or might carry/handle/sell/process/etc.  Any CSV record would contain this SKU value along with other data.  You are using the combined attributes as your SKU.

If you needed to update an attribute in an existing product, you would create a new SKU, since the attribute combination would not match anything in your production table.
0
 
LVL 2

Author Closing Comment

by:xtreme109
ID: 40316503
Basde on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
editing example file and creating an accessible same directory error log txt file 2 27
error log using ftp 7 38
PHP AJAX JSON 2 57
comma true 6 32
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
These days socially coordinated efforts have turned into a critical requirement for enterprises.
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question