How to use the subprocess module of python to spawn child process.

I have a application which does following steps in a list of the files

1. Encrypt the file
2. Compress the file
3. Upload to a ftp server

I have above functionality in application and code is working as expected. It takes more time as it's sequential and I am planning to reduce the time to run above three steps in all the associated files parallel so that total time can be reduced dramatically. I see subprocess module  ( https://docs.python.org/2/library/subprocess.html )  can help me to get it run in parallel. Appreciate if someone can give me a example code on how to get it done. Any help is very much appreciated.
beer9Asked:
Who is Participating?
 
gelonidaCommented:
One small example,
- creating a pool of three parallel workers
- enqueuing 10 tasks.
- fetching the results as soon as they are available.

import time
from multiprocessing import Pool

def encrypt_compress_and_upload(fname):
    """ just some example code """
    time.sleep(2)
    return("%s uploaded sucessfully" % fname)


if __name__ == '__main__':
    pool = Pool(3) # up to 3 files are processed in parallel
    results = []
    for num in range(10):
        fname = "file_%02d" % num
        result = pool.apply_async(encrypt_compress_and_upload, [ fname ])
        results.append(result)

    print("Results are:")
    for result in results:
        print("    %r" % result.get()) # wait till result is ready
    print("---")

Open in new window


Full documentation of the Process pool object is in
section 16.6.2.9.  of
https://docs.python.org/2/library/multiprocessing.html
0
 
gelonidaCommented:
Is the encryption and the compression done in python?

You might look at the multiprocessing module ( https://docs.python.org/2/library/multiprocessing.html )
What might be of special interest to you is the process pool.

Please look at 16.6.1with an example of how to use Poll
0
 
beer9Author Commented:
I am using tarfile module (https://docs.python.org/2/library/tarfile.html) to do compression
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
gelonidaCommented:
if you start from my code snippet, then just add all the ruiqured function calls, that do
encryption, compression ftp upload to the function encrypt_compress_and_upload()
and you should be able to parallelize.
0
 
beer9Author Commented:
Hi Gelonida, How can I parallely run encrypt_compress_and_upload() function in all the file at a time instead of pool of 3 at a time? I do not want to control How much I can run at a time. Appreciate your input
0
 
gelonidaCommented:
Normally it is not a good idea to start up an infinite number of processes in parallel.

There's serveral reasons for this.
- The first one being memory consumption (RAM)
- the second one is, that you will normally not gain a lot as soon as you exceed a certain number  of processes.
- as the network bandwidth might be a limiting factor I assume also, that you will not gain a lot as soon as you try to upload too many files in parallel via ftp. you might even hit ftp server constraints, which might not accept more than a given numer of ftp connections in parallel.

Just for 'fun' you could just increase the pool size to 100, so you would have max 100 processes in parallel.
poolsize =100
pool = Pool(poolsize)

Open in new window



I personally would hard code the pool size to something like twice (or three times) the number of available CPUs

poolsize = multiprocessing.cpu_count() * 2
pool = Pool(poolsize)

Open in new window


In theory you could just set the poolsize to the number of files, that you want to upload, but if this size is too large, you might encounter problems.
I'd recommend the poolsize to be twice the number of CPUs, but you can experiment to see whether larger numbers 3 x XPU or even alrger are beneficial to your performance.

A more optimal and more complex architecture could create two pools.
One pool for encryption / compression ( limited to 2x CPU) and
another pool with a number of 4 for uploading to the ftp server

The first pool would make sure, that the CPUs are fully used and the second one would keep the network busy.  With only one pool it might theoretically happen, that there are phases wher you only compress and phases where you only upload, which would not fully use our available resources.

I guess however, that you would just have more complex code and not necessarily remark any noticable difference in performance,
0
 
beer9Author Commented:
Thanks for your help gelonida, I am getting few 'None' while printing using:

for result in results:
        print("    %r" % result.get()) # wait till result is ready

Open in new window

0
 
gelonidaCommented:
None means, that the task is still running and therefore the result is unknown
In order to know, whether a process is ready you can check in the async result list
( https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers section 16.6.2.9 description of class multiprocessing.pool.AsyncResult )
with .ready() whether the result is ready
for result in results:
        if result.is_ready():
            print("    %r" % result.get()) # wait till result is ready
        else:
            print("    process still running")

Open in new window


you could also wait for a result to be ready with
result.wait()
0
 
beer9Author Commented:
I am getting following error:

    if result.is_ready():
AttributeError: 'ApplyResult' object has no attribute 'is_ready'

Open in new window


Also is there a way I can find how much second each process took from the using multiprocess?
0
 
beer9Author Commented:
Hi gelonida, Appreciate if you could help here.
0
 
gelonidaCommented:
Apologies beer9 for the late reply:

my previous answer contained a wrong function name.
the method is ready() and not is_ready()

def show_results():
    all_ready = True
    for result in results:
        if result.ready():
            print("    %r" % result.get()) # wait till result is ready
        else:
            print("    process still running")
            all_ready = False
    return all_ready

Open in new window


This is a function, which can display the status of the current processes and will also return a bool to tell you whether all tasks are finished
0
 
beer9Author Commented:
Hi gelonida, Thanks for the info. Is there a way I can find how much time each (sub) process tool to complete the task? Thanks!
0
 
gelonidaCommented:
As far as I know the multiprocesing module does not have this feature, but it would be
rather simple to add it yourself.

What time would you be interested.
The time the process actually ran or the time taken since the process has been enqueueed.

One simple way to get this information would, that the function you call in multiprocessign keeps time.

the return value can be the consumed time or a tuple containing the consumed time
0
 
gelonidaCommented:
Just noticed, that I forgot to post a small example about a task keeping track of it's runtime.

As you didn't answer my previous question I wasn't sure. whether it's the CPU time or the run time you're interested.

Runtime is quiet simple as you can see here:
The idea is to get the current time ( with time.time() ) at the beginning of your task
and at the end of the task and calculate its difference.

import time
import random
from multiprocessing import Pool

def encrypt_compress_and_upload(fname):
    """ just some example code """
    start_time = time.time()
    time.sleep(1) + random.random() % 5
    return_value = "%s uploaded sucessfully" % fname
    consumed_time = time.time() - start_time
    return cosumed_time, return_value

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.