Solved

PHP Execute Python in backgroun

Posted on 2014-09-04
42
884 Views
Last Modified: 2014-09-11
So I am building an application that requires an import and manipulate of a large CSV file. I figure it would be best to let a PYTHON script handle that and have PHP execute the python script.

I would like to accomplish a few things.
Run this process in the backgroup
Monitor this process via ajax.

The issue I am running into is that the script take a while to complete, and the only way i can monitor the process is by capturing the PID, but i dont believe it is being return until the script is complete.

Now when i run a 'quick' script. everythng works great.



Here is my sample code...I am using codeigniter.

function index(){
         $command = 'python /to/my/script.py arg1 arg 2';
        echo "Begin.....";
        $pid = $this->run_in_background($command);
        echo $pid;
        while($this->is_process_running($pid))
        {
         echo(" . ");
           ob_flush(); flush();
                sleep(1);
        }
        echo "Complete";
        
   
        
    }
    
    
    
    function run_in_background($Command, $Priority = 0)
    {
       if($Priority)
           $PID = shell_exec("nohup nice -n $Priority $Command 2> /dev/null & echo $!");
       else
           $PID = shell_exec("nohup $Command 2> /dev/null & echo $!");
       return($PID);
    }
    

    function is_process_running($PID){
       exec("ps $PID", $ProcessState);
       return(count($ProcessState) >= 2);
    }

Open in new window

0
Comment
Question by:xtreme109
  • 17
  • 16
  • 6
  • +1
42 Comments
 
LVL 20

Accepted Solution

by:
Mark Brady earned 200 total points
Comment Utility
You can do it several ways. Here is a couple of samples.
<?php 

$command = escapeshellcmd('/usr/custom/test.py');
$output = shell_exec($command);
echo $output;

?>

Open in new window


and using passthrough
ob_start();
passthru('/usr/bin/python2.7 /srv/http/assets/py/switch.py arg1 arg2');
$output = ob_get_clean(); 

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
What makes you think that PHP is not a good tool to import a large CSV file?  

Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).  

I'm guessing there is a PHP solution, so long as the script does not create a long-running job.  PHP is made for generating web pages, so long-running jobs are sometimes less successful in PHP.  But trying to marry PHP and Python is also kind of "out there."

Maybe if you can tell us a little more about the application (what it does, in business - non-technical - terms) we can offer some more mainstream suggestions.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
Some more background to my project. Clients are able to upload a list products, along with all of their variations, pricing etc.. (this is in CSV format)

Our "worst case" scenario is about 1M records,  In which we have to see if the variation already exist but  if not insert the record as new (using INSERT IGNORE?);however if the product/variation was previously listed and was set to 'inactive', and is now on that newly uploaded CSV I would like to update that existing record and set it back to 'active', as I do not want duplicate skus.

Because of the time frames and the high possibility of someone leaving the page after the upload, I figured to run a background process which i can then monitor via a simple ajax call to check on Process and update the customer upon completion.

**1 million records with very very minor manipulation takes about 10-15 minutes.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
do you have to wait for the upload to finish before starting your integration?
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
Yes, the upload is relatively quick.

File size is about 30Mb.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
I simplified my issue without getting overly complex, but...

Attached you will find a list which contains variations for ONE product!  I generate the SKU's based on a prefix and generate  numbers on the fly based off a database value. I.E. prefix= "TEST" lastSKU="100"

Each new variation creates sku for example "TEST0000101, TEST0000102, etc.."

https://dl.dropboxusercontent.com/u/50617741/large-sample-set.csv
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
What database are you using?

To me, this seems like a simple problem with a two step solution.
1. You update a joined view of the CSV data with the production table, based on the SKU where the production table has an inactive flag = true, setting the production table inactive flag = false
2. Append all the CSV rows, allowing the unique index on the SKU column to prevent duplicates.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
i am using MySQL
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 200 total points
Comment Utility
A less simple approach would be needed if you needed to check for ANY changes.  If that's the case, add a hash column to the production table and compare each CSV line's hash to the corresponding SKU's hash in the production table.  Even though you only have nine fields to compare, it might be quicker to compute the CSV hashes.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
My download from DropBox was fast, but my upload from my office PC to my server took about 75 seconds via FTP

In my timing test, I found that there were 1,013,761 records.  My script read these in 5,536.573 ms (5.5 seconds).  So my guess is that either the database is too slow or the application design contains something that is slowing it down.   75 seconds to transfer the file, plus 5 seconds to process the CSV is nowhere near 10 minutes!
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Ray

How long to import the CSV file into a MySQL database? (no indexes)
Of course, if the CSV could be opened as an external table there might not be any reason to import it.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
@aikimark: I think that INSERT may be the central issue slowing the PHP script down.  If you sort the million-line CSV and look at the data, it seems to me that there is some normalization that could be applied.  These look like they might be specifications for eyeglasses, but I'm not sure.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Ray

You are correct.  These are glasses/lens combinations of strengths, cyl (astygmatism correction) and color combinations.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@Ray - i believe the time is consumed in performing a INSERT IGNORE on the mysql database and manual data maniuplation. 75 seconds is not bad at all!

Also here is my table structure. The "custom" fields are for specific products which require special settings and i am storing JSON data in those fields since they are not tied to any specific sku

CREATE TABLE IF NOT EXISTS `my_product_variations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `product_id` int(11) NOT NULL,
  `is_custom` tinyint(1) NOT NULL DEFAULT '0',
  `sku` varchar(15) NOT NULL,
  `power` varchar(20) NOT NULL,
  `diameter` varchar(10) NOT NULL,
  `base_curve` varchar(10) NOT NULL,
  `axis` varchar(20) NOT NULL,
  `cylinder` varchar(10) NOT NULL,
  `addition` varchar(10) NOT NULL,
  `color` varchar(25) NOT NULL,
  `frame_color` varchar(75) DEFAULT NULL,
  `lens_color` varchar(75) DEFAULT NULL,
  `custom_power` text,
  `custom_diameter` text,
  `custom_base_curve` text,
  `custom_axis` text,
  `custom_cylinder` text,
  `custom_addition` text,
  `custom_color` text,
  `custom_frame_color` text,
  `custom_lens_color` text,
  `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `created` datetime DEFAULT NULL,
  `active` tinyint(1) DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `sku_unique` (`sku`,`active`),
  UNIQUE KEY `variation_unique` (`product_id`,`power`,`diameter`,`base_curve`,`axis`,`cylinder`,`addition`,`color`,`active`,`frame_color`,`lens_color`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=205502 ;

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
remove the UNIQUE KEY `variation_unique` index and perform another import test.
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Agree with @aikimark, UNIQUE keys require a table scan for every INSERT, and this causes geometrically longer INSERT processes as the tables grow.  Also, I am wondering about the columns defined as TEXT.  Why is that?  Is it possible to define those as some kind of CHAR instead?

What if you load the table without the keys, then go back and ALTER TABLE to add the keys in a separate query?

Is the data such that it could be structured using ENUM?  Or numeric values?
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Ray,

What about using the CSV file as an external table?  No indexes.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 100 total points
Comment Utility
Might be reasonable and certainly worth a try.  Should be able to load the CSV entirely into memory.  It's a file of 33,397,837 bytes, so it will probably require increasing the memory limit because PHP objects and arrays are explosively large when compared to string data (Like 16X-20X larger).  So a PHP memory limit of 1GB is probably a safe starting point.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
You can specify an external table as an in-memory object?!?  Wow!  That should be a good performer.  I thought we would have to point to the newly uploaded .CSV file.  Or are you going to perform some of your PHP 'magic'?
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 2

Author Comment

by:xtreme109
Comment Utility
I removed the UNIQUE KEY `variation_unique` index and i ended up stopping the process around 2 minutes. Script was only was able to import ~65,000 records.

I was able to full execute the script in the background on the server and capture the PID instantly. I am now able to check on the status of the script, as well as pause/kill the script at anytime.

Issue is still the length of time.

Here is what i have so far....


PHP - Codeigniter..
class Importtest extends CI_Controller 
{

    function __construct()
    {
        parent::__construct();


    }
    
    function check_if_running($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))echo "FALSE";
        else echo "True!";
    }
    public function status($pid){
        $command = 'ps -p '.$pid;
        exec($command,$op);
        if (!isset($op[1]))return false;
        else return true;
    }
    
    public function stop($pid){
        $command = 'kill '.$pid;
        exec($command);
        if ($this->status($pid) == false)echo "TRUE";
        else echo "FALSE";
    }
    
    private function __runCom($commandToRun){
        $command = 'nohup '.$commandToRun.' > /dev/null 2>&1 & echo $!';
        exec($command ,$op);
        return (int)$op[0]; //Return PID
    }
    
    function index(){
        echo "<h1>Test Large Import</h1>";
        
        $scriptPath = $_SERVER["DOCUMENT_ROOT"] . "/python/import.py";
        $fullFilePath = "/path/to/file/large-data-set.csv";
        
        $productID = "1227"; //test data
        $skuPrefix = "XXX";  //test data
        $nextSku= 105;       //test data
        
        //python command
        $command = 'python '. $scriptPath .' ' . $fullFilePath .' ' . $productID .' ' . $skuPrefix .' ' .$nextSku;;
        $processID = $this->__runCom($command);
        echo "Current PID: " . $processID;
        
    }
}

Open in new window



Python Script...very basic as of right now..havent put much time into validation, etc..
#!/usr/bin/python
import os
import time
import sys
import csv
import MySQLdb as mdb


filePath    = sys.argv[1]
productID   = int(sys.argv[2])
skuPrefix   = sys.argv[3]
nextSkuNumber     = sys.argv[4]

try:
    #establish DB connection
    con = mdb.connect('db.server.com', 'user', 'pass', 'db')
    cur = con.cursor()
    #try to open file and being parsing
    f = open(filePath, "rb") 
    
    skipFirst = True
    
    reader = csv.reader(f)  # creates the reader object
    for csvRow in reader:   # iterates the rows of the file in orders

        if skipFirst == True:
            skipFirst = False
            continue

        nextSkuNumber = int(nextSkuNumber) + 1
        nextSkuNumber = '%07d' % nextSkuNumber
        nextSku = skuPrefix + nextSkuNumber

        insertIgnore = "INSERT IGNORE INTO my_product_variations (product_id, sku, power, diameter, base_curve, axis, cylinder, addition, color) VALUES (%d, '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % (productID, nextSku, csvRow[0], csvRow[1], csvRow[2], csvRow[3], csvRow[4], csvRow[5], csvRow[6])
        
        cur.execute(insertIgnore)
        con.commit()
    
except mdb.Error, e:
  
    print "Error %d: %s" % (e.args[0],e.args[1])
    sys.exit(1)
    
finally:    
    if f:
         f.close() 
        
    if con:    
        con.close()

Open in new window

0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@ray
I wouldn't be able to constant add and remove the indexes. As this is for a single product and we have more than one product in our database.

As for as an ENUM value a good majority of the fields would need over 300+ ENUM values..so i am not sure if that is ideal

The TEXT fields are for custom products which will accept a JSON array, and there is no clearly defined SKU since those products require custom settings.

I hope this helps. I appreciate all the help so far!
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
No wonder you're experiencing such poor performance.  You're doing row-at-a-time inserts.

stopping the process around 2 minutes
Thanks for testing.  Just to confirm...that index was on the production table and not the CSV data, right?
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@aikimark
I will modifify to do batch inserts of 1000(?) and report back.

correct...index was on table
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
this issue with me batching inserts is that i cannot create a duplicate variation with a new SKU.

Otherwise i would have multiple variations with different SKU's and my system would not be able to manage inventory correctly.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
I was thinking about two batch queries.  One to update the Active flag (matching SKU rows) and the other just does a bulk append for new SKU rows.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
The unique index on the SKU column will prevent duplicate SKUs
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
i would then need to keep track of those and then mark those active...any ideas? Batching will not provide that back to me
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
Using my experience with similar applications...

I think that you will have so few updates and new rows that these two operations shouldn't take more than a few seconds to execute.  You already have a two column index on SKU and active.  Depending on the percentage of the (total number of) rows that are inactive, this should be a reasonably small number of SKU comparisons with the CSV file.

Similarly, I wouldn't expect to add many new SKU rows with the append query.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@ali
This depends on how often viariation become available(active)/inative.

@ray
ENUM may not be a great idea , as if anything new were to be added to one of the fields i would have to alter the entire table.
0
 
LVL 2

Assisted Solution

by:xtreme109
xtreme109 earned 0 total points
Comment Utility
UPDATE***


I resolved the issue. I utilized the above code however I did modify my python script to INSERT 500 records at a time with ON DUPLICATE KEY UPDATE

INSERT INTO table (x, x, x, x, x, x) VALUES
(d, d, d, d, d, 1), (d, d, d, d, d, 2), (d, d, d, d, d, 3), (d, d, d, d, d, 4)
 ON DUPLICATE KEY UPDATE active=1

Open in new window


Additionally i did use a multi column UNIQUE constraint on the following columns "product_id, power,diameter, base_curve, axis, cylinder,addition,      color,       frame_color,       lens_color,"

By doing this it allowed me to create a unique sku for only NEW products and reset old products (if the above constraint is met) to active while not know the sku before hand.

Using ONLY the above query over and over again; here are my results

*New inserts of 1M+ records average time elapsed is 40 seconds.
*New inserts along with UPDATES of 1M+ records average time elapsed is 45 seconds.
*ALL UPDATES of 1M+ records average time elapsed is 92 seconds.

Tested on a database of 5M+ records so far.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you're going to do that, you might want to try my suggestion of hashing those columns and storing them in a new column with a unique index.  Hash the contents of each row in the CSV.  I usually advise hashers to include a field delimiter when hashing.  The most logical delimiter would be a comma, since that simplifies the hashing of the CSV lines.

Have you considered testing the external table technique?
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
i will try hashing on m y next round of testing.

Personally i have never used, nor do I know how to use the external table.  Sample script?
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
I think Ray would have better examples and caveats for external tables.

The help files (online documentation) takes you through the steps of defining a table (DDL statements) and then instruct you to swap out the .CSV file with your actual CSV file.
0
 
LVL 20

Expert Comment

by:Mark Brady
Comment Utility
I know you have had some great advice in this thread but you asked how to excecute a python script from within php. I showed you how to do it in two different ways and no comment?
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@mark
I appreciate your comments! thank you. The title of this thread probably could've been named better. My code that i put in my first post showed me executing the script and successfully running (using the methods you shared). However my post explained my issues with large files.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
Based on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@xtreme109

So, you really aren't working with actual SKU values?  Well, that does restrict how many ways you can solve this problem.

When you get around to playing with hashing, open up a new question and post a comment in this thread with the URL to the new question.
0
 
LVL 2

Author Comment

by:xtreme109
Comment Utility
@aikimark
i am working with SKU's, however they are generated dynamically the first time, assuming that particular combination of variables do not exist.

This is displayed in my sample python script.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
i am working with SKUs
Let me clarify.  When I referred to a SKU, I meant a predefined value that is associated with a particular product that you do or might carry/handle/sell/process/etc.  Any CSV record would contain this SKU value along with other data.  You are using the combined attributes as your SKU.

If you needed to update an attribute in an existing product, you would create a new SKU, since the attribute combination would not match anything in your production table.
0
 
LVL 2

Author Closing Comment

by:xtreme109
Comment Utility
Basde on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now