Link to home
Start Free TrialLog in
Avatar of beer9
beer9Flag for India

asked on

How to use the subprocess module of python to spawn child process.

I have a application which does following steps in a list of the files

1. Encrypt the file
2. Compress the file
3. Upload to a ftp server

I have above functionality in application and code is working as expected. It takes more time as it's sequential and I am planning to reduce the time to run above three steps in all the associated files parallel so that total time can be reduced dramatically. I see subprocess module  ( https://docs.python.org/2/library/subprocess.html )  can help me to get it run in parallel. Appreciate if someone can give me a example code on how to get it done. Any help is very much appreciated.
Avatar of gelonida
gelonida
Flag of France image

Is the encryption and the compression done in python?

You might look at the multiprocessing module ( https://docs.python.org/2/library/multiprocessing.html )
What might be of special interest to you is the process pool.

Please look at 16.6.1with an example of how to use Poll
ASKER CERTIFIED SOLUTION
Avatar of gelonida
gelonida
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of beer9

ASKER

I am using tarfile module (https://docs.python.org/2/library/tarfile.html) to do compression
if you start from my code snippet, then just add all the ruiqured function calls, that do
encryption, compression ftp upload to the function encrypt_compress_and_upload()
and you should be able to parallelize.
Avatar of beer9

ASKER

Hi Gelonida, How can I parallely run encrypt_compress_and_upload() function in all the file at a time instead of pool of 3 at a time? I do not want to control How much I can run at a time. Appreciate your input
Normally it is not a good idea to start up an infinite number of processes in parallel.

There's serveral reasons for this.
- The first one being memory consumption (RAM)
- the second one is, that you will normally not gain a lot as soon as you exceed a certain number  of processes.
- as the network bandwidth might be a limiting factor I assume also, that you will not gain a lot as soon as you try to upload too many files in parallel via ftp. you might even hit ftp server constraints, which might not accept more than a given numer of ftp connections in parallel.

Just for 'fun' you could just increase the pool size to 100, so you would have max 100 processes in parallel.
poolsize =100
pool = Pool(poolsize)

Open in new window



I personally would hard code the pool size to something like twice (or three times) the number of available CPUs

poolsize = multiprocessing.cpu_count() * 2
pool = Pool(poolsize)

Open in new window


In theory you could just set the poolsize to the number of files, that you want to upload, but if this size is too large, you might encounter problems.
I'd recommend the poolsize to be twice the number of CPUs, but you can experiment to see whether larger numbers 3 x XPU or even alrger are beneficial to your performance.

A more optimal and more complex architecture could create two pools.
One pool for encryption / compression ( limited to 2x CPU) and
another pool with a number of 4 for uploading to the ftp server

The first pool would make sure, that the CPUs are fully used and the second one would keep the network busy.  With only one pool it might theoretically happen, that there are phases wher you only compress and phases where you only upload, which would not fully use our available resources.

I guess however, that you would just have more complex code and not necessarily remark any noticable difference in performance,
Avatar of beer9

ASKER

Thanks for your help gelonida, I am getting few 'None' while printing using:

for result in results:
        print("    %r" % result.get()) # wait till result is ready

Open in new window

None means, that the task is still running and therefore the result is unknown
In order to know, whether a process is ready you can check in the async result list
( https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers section 16.6.2.9 description of class multiprocessing.pool.AsyncResult )
with .ready() whether the result is ready
for result in results:
        if result.is_ready():
            print("    %r" % result.get()) # wait till result is ready
        else:
            print("    process still running")

Open in new window


you could also wait for a result to be ready with
result.wait()
Avatar of beer9

ASKER

I am getting following error:

    if result.is_ready():
AttributeError: 'ApplyResult' object has no attribute 'is_ready'

Open in new window


Also is there a way I can find how much second each process took from the using multiprocess?
Avatar of beer9

ASKER

Hi gelonida, Appreciate if you could help here.
Apologies beer9 for the late reply:

my previous answer contained a wrong function name.
the method is ready() and not is_ready()

def show_results():
    all_ready = True
    for result in results:
        if result.ready():
            print("    %r" % result.get()) # wait till result is ready
        else:
            print("    process still running")
            all_ready = False
    return all_ready

Open in new window


This is a function, which can display the status of the current processes and will also return a bool to tell you whether all tasks are finished
Avatar of beer9

ASKER

Hi gelonida, Thanks for the info. Is there a way I can find how much time each (sub) process tool to complete the task? Thanks!
As far as I know the multiprocesing module does not have this feature, but it would be
rather simple to add it yourself.

What time would you be interested.
The time the process actually ran or the time taken since the process has been enqueueed.

One simple way to get this information would, that the function you call in multiprocessign keeps time.

the return value can be the consumed time or a tuple containing the consumed time
Just noticed, that I forgot to post a small example about a task keeping track of it's runtime.

As you didn't answer my previous question I wasn't sure. whether it's the CPU time or the run time you're interested.

Runtime is quiet simple as you can see here:
The idea is to get the current time ( with time.time() ) at the beginning of your task
and at the end of the task and calculate its difference.

import time
import random
from multiprocessing import Pool

def encrypt_compress_and_upload(fname):
    """ just some example code """
    start_time = time.time()
    time.sleep(1) + random.random() % 5
    return_value = "%s uploaded sucessfully" % fname
    consumed_time = time.time() - start_time
    return cosumed_time, return_value

Open in new window