xtreme109
asked on
PHP Execute Python in backgroun
So I am building an application that requires an import and manipulate of a large CSV file. I figure it would be best to let a PYTHON script handle that and have PHP execute the python script.
I would like to accomplish a few things.
The issue I am running into is that the script take a while to complete, and the only way i can monitor the process is by capturing the PID, but i dont believe it is being return until the script is complete.
Now when i run a 'quick' script. everythng works great.
Here is my sample code...I am using codeigniter.
I would like to accomplish a few things.
Run this process in the backgroup
Monitor this process via ajax.
The issue I am running into is that the script take a while to complete, and the only way i can monitor the process is by capturing the PID, but i dont believe it is being return until the script is complete.
Now when i run a 'quick' script. everythng works great.
Here is my sample code...I am using codeigniter.
function index(){
$command = 'python /to/my/script.py arg1 arg 2';
echo "Begin.....";
$pid = $this->run_in_background($command);
echo $pid;
while($this->is_process_running($pid))
{
echo(" . ");
ob_flush(); flush();
sleep(1);
}
echo "Complete";
}
function run_in_background($Command, $Priority = 0)
{
if($Priority)
$PID = shell_exec("nohup nice -n $Priority $Command 2> /dev/null & echo $!");
else
$PID = shell_exec("nohup $Command 2> /dev/null & echo $!");
return($PID);
}
function is_process_running($PID){
exec("ps $PID", $ProcessState);
return(count($ProcessState) >= 2);
}
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Some more background to my project. Clients are able to upload a list products, along with all of their variations, pricing etc.. (this is in CSV format)
Our "worst case" scenario is about 1M records, In which we have to see if the variation already exist but if not insert the record as new (using INSERT IGNORE?);however if the product/variation was previously listed and was set to 'inactive', and is now on that newly uploaded CSV I would like to update that existing record and set it back to 'active', as I do not want duplicate skus.
Because of the time frames and the high possibility of someone leaving the page after the upload, I figured to run a background process which i can then monitor via a simple ajax call to check on Process and update the customer upon completion.
**1 million records with very very minor manipulation takes about 10-15 minutes.
Our "worst case" scenario is about 1M records, In which we have to see if the variation already exist but if not insert the record as new (using INSERT IGNORE?);however if the product/variation was previously listed and was set to 'inactive', and is now on that newly uploaded CSV I would like to update that existing record and set it back to 'active', as I do not want duplicate skus.
Because of the time frames and the high possibility of someone leaving the page after the upload, I figured to run a background process which i can then monitor via a simple ajax call to check on Process and update the customer upon completion.
**1 million records with very very minor manipulation takes about 10-15 minutes.
do you have to wait for the upload to finish before starting your integration?
ASKER
Yes, the upload is relatively quick.
File size is about 30Mb.
File size is about 30Mb.
Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).
ASKER
I simplified my issue without getting overly complex, but...
Attached you will find a list which contains variations for ONE product! I generate the SKU's based on a prefix and generate numbers on the fly based off a database value. I.E. prefix= "TEST" lastSKU="100"
Each new variation creates sku for example "TEST0000101, TEST0000102, etc.."
https://dl.dropboxusercontent.com/u/50617741/large-sample-set.csv
Attached you will find a list which contains variations for ONE product! I generate the SKU's based on a prefix and generate numbers on the fly based off a database value. I.E. prefix= "TEST" lastSKU="100"
Each new variation creates sku for example "TEST0000101, TEST0000102, etc.."
https://dl.dropboxusercontent.com/u/50617741/large-sample-set.csv
What database are you using?
To me, this seems like a simple problem with a two step solution.
1. You update a joined view of the CSV data with the production table, based on the SKU where the production table has an inactive flag = true, setting the production table inactive flag = false
2. Append all the CSV rows, allowing the unique index on the SKU column to prevent duplicates.
To me, this seems like a simple problem with a two step solution.
1. You update a joined view of the CSV data with the production table, based on the SKU where the production table has an inactive flag = true, setting the production table inactive flag = false
2. Append all the CSV rows, allowing the unique index on the SKU column to prevent duplicates.
ASKER
i am using MySQL
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
My download from DropBox was fast, but my upload from my office PC to my server took about 75 seconds via FTP
In my timing test, I found that there were 1,013,761 records. My script read these in 5,536.573 ms (5.5 seconds). So my guess is that either the database is too slow or the application design contains something that is slowing it down. 75 seconds to transfer the file, plus 5 seconds to process the CSV is nowhere near 10 minutes!
In my timing test, I found that there were 1,013,761 records. My script read these in 5,536.573 ms (5.5 seconds). So my guess is that either the database is too slow or the application design contains something that is slowing it down. 75 seconds to transfer the file, plus 5 seconds to process the CSV is nowhere near 10 minutes!
@Ray
How long to import the CSV file into a MySQL database? (no indexes)
Of course, if the CSV could be opened as an external table there might not be any reason to import it.
How long to import the CSV file into a MySQL database? (no indexes)
Of course, if the CSV could be opened as an external table there might not be any reason to import it.
@aikimark: I think that INSERT may be the central issue slowing the PHP script down. If you sort the million-line CSV and look at the data, it seems to me that there is some normalization that could be applied. These look like they might be specifications for eyeglasses, but I'm not sure.
@Ray
You are correct. These are glasses/lens combinations of strengths, cyl (astygmatism correction) and color combinations.
You are correct. These are glasses/lens combinations of strengths, cyl (astygmatism correction) and color combinations.
ASKER
@Ray - i believe the time is consumed in performing a INSERT IGNORE on the mysql database and manual data maniuplation. 75 seconds is not bad at all!
Also here is my table structure. The "custom" fields are for specific products which require special settings and i am storing JSON data in those fields since they are not tied to any specific sku
Also here is my table structure. The "custom" fields are for specific products which require special settings and i am storing JSON data in those fields since they are not tied to any specific sku
CREATE TABLE IF NOT EXISTS `my_product_variations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`product_id` int(11) NOT NULL,
`is_custom` tinyint(1) NOT NULL DEFAULT '0',
`sku` varchar(15) NOT NULL,
`power` varchar(20) NOT NULL,
`diameter` varchar(10) NOT NULL,
`base_curve` varchar(10) NOT NULL,
`axis` varchar(20) NOT NULL,
`cylinder` varchar(10) NOT NULL,
`addition` varchar(10) NOT NULL,
`color` varchar(25) NOT NULL,
`frame_color` varchar(75) DEFAULT NULL,
`lens_color` varchar(75) DEFAULT NULL,
`custom_power` text,
`custom_diameter` text,
`custom_base_curve` text,
`custom_axis` text,
`custom_cylinder` text,
`custom_addition` text,
`custom_color` text,
`custom_frame_color` text,
`custom_lens_color` text,
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`created` datetime DEFAULT NULL,
`active` tinyint(1) DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `sku_unique` (`sku`,`active`),
UNIQUE KEY `variation_unique` (`product_id`,`power`,`diameter`,`base_curve`,`axis`,`cylinder`,`addition`,`color`,`active`,`frame_color`,`lens_color`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=205502 ;
remove the UNIQUE KEY `variation_unique` index and perform another import test.
Agree with @aikimark, UNIQUE keys require a table scan for every INSERT, and this causes geometrically longer INSERT processes as the tables grow. Also, I am wondering about the columns defined as TEXT. Why is that? Is it possible to define those as some kind of CHAR instead?
What if you load the table without the keys, then go back and ALTER TABLE to add the keys in a separate query?
Is the data such that it could be structured using ENUM? Or numeric values?
What if you load the table without the keys, then go back and ALTER TABLE to add the keys in a separate query?
Is the data such that it could be structured using ENUM? Or numeric values?
@Ray,
What about using the CSV file as an external table? No indexes.
What about using the CSV file as an external table? No indexes.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
You can specify an external table as an in-memory object?!? Wow! That should be a good performer. I thought we would have to point to the newly uploaded .CSV file. Or are you going to perform some of your PHP 'magic'?
ASKER
I removed the UNIQUE KEY `variation_unique` index and i ended up stopping the process around 2 minutes. Script was only was able to import ~65,000 records.
I was able to full execute the script in the background on the server and capture the PID instantly. I am now able to check on the status of the script, as well as pause/kill the script at anytime.
Issue is still the length of time.
Here is what i have so far....
PHP - Codeigniter..
Python Script...very basic as of right now..havent put much time into validation, etc..
I was able to full execute the script in the background on the server and capture the PID instantly. I am now able to check on the status of the script, as well as pause/kill the script at anytime.
Issue is still the length of time.
Here is what i have so far....
PHP - Codeigniter..
class Importtest extends CI_Controller
{
function __construct()
{
parent::__construct();
}
function check_if_running($pid){
$command = 'ps -p '.$pid;
exec($command,$op);
if (!isset($op[1]))echo "FALSE";
else echo "True!";
}
public function status($pid){
$command = 'ps -p '.$pid;
exec($command,$op);
if (!isset($op[1]))return false;
else return true;
}
public function stop($pid){
$command = 'kill '.$pid;
exec($command);
if ($this->status($pid) == false)echo "TRUE";
else echo "FALSE";
}
private function __runCom($commandToRun){
$command = 'nohup '.$commandToRun.' > /dev/null 2>&1 & echo $!';
exec($command ,$op);
return (int)$op[0]; //Return PID
}
function index(){
echo "<h1>Test Large Import</h1>";
$scriptPath = $_SERVER["DOCUMENT_ROOT"] . "/python/import.py";
$fullFilePath = "/path/to/file/large-data-set.csv";
$productID = "1227"; //test data
$skuPrefix = "XXX"; //test data
$nextSku= 105; //test data
//python command
$command = 'python '. $scriptPath .' ' . $fullFilePath .' ' . $productID .' ' . $skuPrefix .' ' .$nextSku;;
$processID = $this->__runCom($command);
echo "Current PID: " . $processID;
}
}
Python Script...very basic as of right now..havent put much time into validation, etc..
#!/usr/bin/python
import os
import time
import sys
import csv
import MySQLdb as mdb
filePath = sys.argv[1]
productID = int(sys.argv[2])
skuPrefix = sys.argv[3]
nextSkuNumber = sys.argv[4]
try:
#establish DB connection
con = mdb.connect('db.server.com', 'user', 'pass', 'db')
cur = con.cursor()
#try to open file and being parsing
f = open(filePath, "rb")
skipFirst = True
reader = csv.reader(f) # creates the reader object
for csvRow in reader: # iterates the rows of the file in orders
if skipFirst == True:
skipFirst = False
continue
nextSkuNumber = int(nextSkuNumber) + 1
nextSkuNumber = '%07d' % nextSkuNumber
nextSku = skuPrefix + nextSkuNumber
insertIgnore = "INSERT IGNORE INTO my_product_variations (product_id, sku, power, diameter, base_curve, axis, cylinder, addition, color) VALUES (%d, '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % (productID, nextSku, csvRow[0], csvRow[1], csvRow[2], csvRow[3], csvRow[4], csvRow[5], csvRow[6])
cur.execute(insertIgnore)
con.commit()
except mdb.Error, e:
print "Error %d: %s" % (e.args[0],e.args[1])
sys.exit(1)
finally:
if f:
f.close()
if con:
con.close()
ASKER
@ray
I wouldn't be able to constant add and remove the indexes. As this is for a single product and we have more than one product in our database.
As for as an ENUM value a good majority of the fields would need over 300+ ENUM values..so i am not sure if that is ideal
The TEXT fields are for custom products which will accept a JSON array, and there is no clearly defined SKU since those products require custom settings.
I hope this helps. I appreciate all the help so far!
I wouldn't be able to constant add and remove the indexes. As this is for a single product and we have more than one product in our database.
As for as an ENUM value a good majority of the fields would need over 300+ ENUM values..so i am not sure if that is ideal
The TEXT fields are for custom products which will accept a JSON array, and there is no clearly defined SKU since those products require custom settings.
I hope this helps. I appreciate all the help so far!
No wonder you're experiencing such poor performance. You're doing row-at-a-time inserts.
stopping the process around 2 minutesThanks for testing. Just to confirm...that index was on the production table and not the CSV data, right?
ASKER
@aikimark
I will modifify to do batch inserts of 1000(?) and report back.
correct...index was on table
I will modifify to do batch inserts of 1000(?) and report back.
correct...index was on table
ASKER
this issue with me batching inserts is that i cannot create a duplicate variation with a new SKU.
Otherwise i would have multiple variations with different SKU's and my system would not be able to manage inventory correctly.
Otherwise i would have multiple variations with different SKU's and my system would not be able to manage inventory correctly.
I was thinking about two batch queries. One to update the Active flag (matching SKU rows) and the other just does a bulk append for new SKU rows.
The unique index on the SKU column will prevent duplicate SKUs
ASKER
i would then need to keep track of those and then mark those active...any ideas? Batching will not provide that back to me
Using my experience with similar applications...
I think that you will have so few updates and new rows that these two operations shouldn't take more than a few seconds to execute. You already have a two column index on SKU and active. Depending on the percentage of the (total number of) rows that are inactive, this should be a reasonably small number of SKU comparisons with the CSV file.
Similarly, I wouldn't expect to add many new SKU rows with the append query.
I think that you will have so few updates and new rows that these two operations shouldn't take more than a few seconds to execute. You already have a two column index on SKU and active. Depending on the percentage of the (total number of) rows that are inactive, this should be a reasonably small number of SKU comparisons with the CSV file.
Similarly, I wouldn't expect to add many new SKU rows with the append query.
ASKER
@ali
This depends on how often viariation become available(active)/inative.
@ray
ENUM may not be a great idea , as if anything new were to be added to one of the fields i would have to alter the entire table.
This depends on how often viariation become available(active)/inative.
@ray
ENUM may not be a great idea , as if anything new were to be added to one of the fields i would have to alter the entire table.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If you're going to do that, you might want to try my suggestion of hashing those columns and storing them in a new column with a unique index. Hash the contents of each row in the CSV. I usually advise hashers to include a field delimiter when hashing. The most logical delimiter would be a comma, since that simplifies the hashing of the CSV lines.
Have you considered testing the external table technique?
Have you considered testing the external table technique?
ASKER
i will try hashing on m y next round of testing.
Personally i have never used, nor do I know how to use the external table. Sample script?
Personally i have never used, nor do I know how to use the external table. Sample script?
I think Ray would have better examples and caveats for external tables.
The help files (online documentation) takes you through the steps of defining a table (DDL statements) and then instruct you to swap out the .CSV file with your actual CSV file.
The help files (online documentation) takes you through the steps of defining a table (DDL statements) and then instruct you to swap out the .CSV file with your actual CSV file.
I know you have had some great advice in this thread but you asked how to excecute a python script from within php. I showed you how to do it in two different ways and no comment?
ASKER
@mark
I appreciate your comments! thank you. The title of this thread probably could've been named better. My code that i put in my first post showed me executing the script and successfully running (using the methods you shared). However my post explained my issues with large files.
I appreciate your comments! thank you. The title of this thread probably could've been named better. My code that i put in my first post showed me executing the script and successfully running (using the methods you shared). However my post explained my issues with large files.
ASKER
Based on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
@xtreme109
So, you really aren't working with actual SKU values? Well, that does restrict how many ways you can solve this problem.
When you get around to playing with hashing, open up a new question and post a comment in this thread with the URL to the new question.
So, you really aren't working with actual SKU values? Well, that does restrict how many ways you can solve this problem.
When you get around to playing with hashing, open up a new question and post a comment in this thread with the URL to the new question.
ASKER
@aikimark
i am working with SKU's, however they are generated dynamically the first time, assuming that particular combination of variables do not exist.
This is displayed in my sample python script.
i am working with SKU's, however they are generated dynamically the first time, assuming that particular combination of variables do not exist.
This is displayed in my sample python script.
i am working with SKUsLet me clarify. When I referred to a SKU, I meant a predefined value that is associated with a particular product that you do or might carry/handle/sell/process/
If you needed to update an attribute in an existing product, you would create a new SKU, since the attribute combination would not match anything in your production table.
ASKER
Basde on the title of my post. Marks Solution was best.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
However based on my actual post and the problems I faced, I took Ray's and AikiMarks help and feedback and came up with a solution.
Please post a link to an appropriately sized CSV test data set and show us what you want out of it (SSCCE).
I'm guessing there is a PHP solution, so long as the script does not create a long-running job. PHP is made for generating web pages, so long-running jobs are sometimes less successful in PHP. But trying to marry PHP and Python is also kind of "out there."
Maybe if you can tell us a little more about the application (what it does, in business - non-technical - terms) we can offer some more mainstream suggestions.