Go Premium for a chance to win a PS4. Enter to Win

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 689
  • Last Modified:

Python proc vs thread

In python - what is the difference between a process, a child process (that has been forked), the output of a fork-exec and a thread?  Actually I get the difference between the first two  - (I think) that when forking you wind up with a new process that is a virtual copy of an existing parent process only running separately with a new PID.  But I lost the trail when they got into fork-exec and then the thread concept.
  • 2
  • 2
1 Solution
In the first part I give the non Python specific answer (assuming you run on a Linux like OS) in the second part I'll try to tackle Python specific details (GIL, . . .)

creating a sub process with identical code/data is done by doing a fork()
- fork() duplicates the current process (the Python interpreter) with all its
   code / data (variables, . . .)
   The only difference between the new and the old process is the return value of the
   fork() call, which allows the two process to be distinguished, all variables have the
   same value, but if changed in one process, they will not be changed in the other()

- fork() / exec() is the Unix way of creating a new child process()
   the new forked child process will immediately exec anonther executyable. (load and
  execute some completely different code, which could be again a python program
  or something completely different. so basically all code / data of the child process
  will be overwritten

- creating a thread is like creating a new process with fork(), BUT both new processes will
  refer to the SAME data, so if one process modifies some data it will be changed for the
  otherthread (process) as well.

2.) Now to Python:
- fork()/exec(). If you would exec another python program, you would re-initialize a completely new Python interpreter, reload/import all the .py (or.pyc) files and run the code from the beginning.
 The Python module to use for creating child processes is the module subprocess.
 and the command would be subprocess.Popen() or some of the helpers simplifying this.

-fork() duplicates the process, nothing had to be reinitialized and both processes can now live their own life.
 A Python module, which is roughly using fork() to create new processes is the module multiprocessing. Multiprocessing will also work under windows, though the cost of creating another process would more or less have the cost of a fork()/exec() as windows doesn't really implement a fork() but for you as developer the code stays platform independent and creating a subprocess will be as fast as possible depending on the platform.

- with threading:
Threading is very powerful and nice, but has some issues. As two processes access the same data and not all modifications to Python variables are atomic you have to use a lot of tools to protect your code like mutexes ( threading.Lock()s ) not writing to the same file from different threads nor use many other shared exclusive resources at the same time.

If unexperienced it's easy to write code. which behaves differently than initially expected.

Another drawback only specific to Python is the GIL (global interpreter lock).
which means that due to an implementation limitation of Python multiple threads cannot execute Python byte code at the same time.

So if you have a pure CPU limited Python program on a machine with multiple cores (CPUs) you will see, that with multiprocessing both python programs will run in parallel and both your CPUs will be loaded, whereas with threading you'll be unable to benefit from the other CPU.

Depending on you want to do in your thread this may not be a problem, as:
- python has a lot of modules, which call code from shared libraries (numPy, PIL, . . . ) which releases the GIL before calling). SO they would benefit from a multicore machine.
In many of my programs most of the CPU is consumed in C libraries being called by Python.

- often threads are not used to perform calculations in parallel, but more to structure code and most of the time most threads are sleeping and just woken up for a short time to perform a short task.

So to summarize:
Three Python modules:
- subprocessing for fork()/exec()
- multiprocessing for fork()
- threading for threads

- to benefit from a multicore machine and it's CPUs you should use multiprocessing ( fork() )
- in order to write code needing to access the same data you should use threading.
  A typical example would be a GUI, where have to do some task when clicking on a button (transferring data over the net, .  . .) should not freeze the GUI.
 or a program with one thread receivning data another one processing it and another one sending it (thoug such code could also be implemented withouth threading and select() or alike.

Interthread communication:
- you can use Locks() / wait() functions and Queues() and variables

Interprocess Communication
- common files (if protected with file locking)
- pipes
- sockets
- a database, which supports multiple processes. (sqlite, sql servers, . . .)

Hope this is the level of detail that you were lookiong for.
amigan_99Author Commented:
Thank you very much!  A point of clarification - if you have a moment - can you explain "not all modifications to Python variables are atomic"?  I think of atomic from Democratis as a small indivisible particle.  Not sure how that applies to a Python variable.
OK: atomic operation in computer science means something like
"non interruptable by another thread." so an atomic operation would either not be executed or it wouild be completely executed.
Under no circujmstance it would be possible, that another thread interrupts an atomic operation while 'half'of the work is done.

you can refer to http://en.wikipedia.org/wiki/Atomic_operation

By the way it seems, that my above statement is wrong. some googling seems to indicate, that Python variable assignments are atomic operations.

However commands like
a += 1
a, b = b, a

would not be and modifying several members of an object wouldn't thus normally
Locks should be used in order to avoid, that an object is in an inconsistent state before being accessed by another thread.
amigan_99Author Commented:
Thank you again very much.

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now