Chosen aspects of concurrency in Python 3

There are a lot of concepts that exist in the area of parallel and concurrent programming: threads, processes, deadlocks, starvation, race condition, etc. Within concurrent and parallel programming, Python has built-in and external modules that simplify implementation. Due to significant improvements in the areas, this article is based on Python 3.4.

Some basics to start with

Writing and maintaining concurrent programs is harder than writing and maintaining non-concurrent programs. Furthermore, concurrent programs can sometimes have worse performance than equivalent non-concurrent programs. Nonetheless, if done well, it is possible to write concurrent programs whose performance compared with their non-concurrent cousins is so much better as to outweigh the additional effort.

Most modern languages (including C++ and Java) support concurrency directly in the language itself and usually have additional higher-level functionality in their standard libraries. Concurrency can be implemented in a number of ways, with the most important difference being whether shared data is accessed directly (e.g. using shared memory) or indirectly (e.g. using inter-process communication – IPC) [9].

Parallel programming can be defined as a model that aims to create programs that are compatible with environments prepared to execute code instructions simultaneously. With parallel processing, one can increase the number of calculations your program can do in a given time without needing a faster processor. The main idea is to divide a task into many sub-units and employ multiple processors to solve them independently [7][2].
Terms concurrent and parallel are related but different.

A concurrent program has multiple logical threads of control. These threads may or may not run in parallel. A parallel program potentially runs more quickly than a sequential program by executing different parts of the computation simultaneously (in parallel). It may or may not have more than one logical thread of control [2]. Within concurrent programming program dispatches several workers to run a task (figure 1).

CPU (Central Processing Unit) scheduler defines which worker is apt for using the resource at the specific moment. It happens so fast, so one may get the impression of pseudo-parallelism. Therefore, concurrent programming is an abstraction from parallel programming. In parallel programming, workers run specific tasks simultaneously in a multi-core environment, without the need for concurrency among them to access the CPU [7].

1. Threads and processes

Computer programs are merely executables or binary, which reside on disk. They do not take on a life of their own until loaded into memory and invoked by the operating system. A process is a program in execution. Each process has its own address space, memory, a data stack, and other auxiliary data to keep track of execution. The operating system manages the execution of all processes on the system, dividing the time fairly between all processes. Each new process has its own memory, data stack, etc., and cannot generally share information unless IPC is employed [3].

Threads (sometimes called lightweight processes) are similar to processes except that they all execute within the same process, and thus all share the same context. They can be thought of as “mini-processes” running in parallel within the main processor “main thread.” A thread has a beginning, an execution sequence, and a conclusion. It has an instruction pointer that keeps track of where within its context it is currently running. It can be preempted (interrupted) and temporarily put on hold (also known as sleeping) while other threads are running – this is called yielding.

Multiple threads within a process share the same data space with the main thread and can therefore share information or communicate with one another more easily than if they were separate processes. Threads are generally executed in a concurrent fashion, and it is this parallelism and data sharing that enable the coordination of multiple tasks [3].

I/O (Input/Output) operations (such as reading from memory, disk, or the network) can be quite burdensome to the flow of a program. Every time code reads from a file or writes to a network socket, it must pause to contact the kernel, request that the operation happen, and then wait for it to complete. Most of the performed I/O operations are orders of magnitude slower than CPU operations. Execution of programs is paused to complete these operations, time spent in the paused state is called „I/O wait”. Concurrency helps to utilize this wasted time by allowing to perform other operations while waiting for an I/O operation to complete. Figure 2 depicts a program that must run three tasks, all of which have periods of I/O wait within them.

If run serially, then it suffers the I/O wait penalty three times. However, if run tasks concurrently the wait time can be essentially hidden by running another task in the meantime. It is important to note that this is all still happening on a single thread and still only uses one CPU at a time [5].

2. Threading and multi-processing with Python

2.1. GIL

Execution of Python code is controlled by the Python Virtual Machine (a.k.a. the interpreter main loop). Python was designed in such a way that only one thread of control may be executed in this main loop, similar to how multiple processes in a system share a single CPU. Many programs can be in memory, but only one is live on the CPU at any given moment. Likewise, although multiple threads can run within the Python interpreter, only one thread is being executed by the interpreter at any given time. Access to the Python Virtual Machine is controlled by the global interpreter lock (GIL). This lock is what ensures that exactly one thread is running [3].

The GIL avoids conflicts between threads, simplifying the implementation of the CPython interpreter. Despite this limitation, threads can still be used to provide concurrency in situations where the lock can be released, such as in time-consuming I/O operations or in C extensions [6]. The GIL can be completely avoided by using processes instead of threads. Processes don’t share the same memory area and are independent of each other – each process has its own interpreter. By using processes, we’ll have very few disadvantages: inter-process communication is less efficient than shared memory, but it is more flexible and explicit [6].
Excellent explanation and deep dive into GIL is shown by David Beazley in this presentation and in this visualization.

2.2._thread module

This module provides low-level primitives for working with multiple threads. For synchronization, simple locks (also called mutexes or binary semaphores) are provided.]


import _thread  
from time import sleep, ctime

def loop1():  
    print('Start loop 1 at:', ctime())
    sleep(4)
    print('Loop 1 done at:', ctime())

def loop2():  
    print('Start loop 2 at:', ctime())
    sleep(2)
    print('Loop 2 done at:', ctime())

def main():  
    print('Starting at:', ctime())
    _thread.start_new_thread(loop1, ())
    _thread.start_new_thread(loop2, ())
    sleep(6)
    print('All tasks done at:', ctime())

if __name__ == '__main__':  
    main()

This is a basic usage of the _thread module [3][docs]. One may ask why line with sleep(6) is necessary. The reason is that if we did not stop the main thread from continuing, it would proceed to the next statement, displaying “All tasks done” and exit, killing both threads running loop1() and loop2(). Try what happens when line with sleep(6) is commented.

This is where locks come in, but use of the _thread module is recommended only for experts desiring lower level thread access. To emphasize this, it is renamed from thread in Python 2 to _thread in Python 3.

2.3. threading module

This module gives not only a Thread class but also a wide variety of synchronization mechanisms. It constructs higher-level threading interfaces on top of the lower level _thread module.
The program given below uses threads and lock critical sections of code to avoid race conditions [1].


import threading

class SharedCounter:  
    # A counter object that can be shared by multiple threads.
    def __init__(self, initial_value = 0):
        self._value = initial_value
        self._value_lock = threading.Lock()

    def incr(self,delta=1):
        # Increment the counter with locking
        with self._value_lock:
            self._value += delta

    def decr(self,delta=1):
        # Decrement the counter with locking
        with self._value_lock:
            self._value -= delta

A Lock guarantees mutual exclusion when used with the with statement – that is, only one thread is allowed to execute the block of statements under the with statement at a time. The with statement acquires the lock for the duration of the indented statements and releases the lock when control flow exits the indented block [1][docs].

2.4. queue module

The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads. The Queue class in this module implements all the required locking semantics.

The module implements three types of queue, which differ only in the order in which the entries are retrieved – FIFO, LIFO and priority queue.

Queues can be used when multiple threads need to safely communicate or exchange data between them [1].

code>
from queue import Queue  
from threading import Thread

# A thread that produces data
def producer(out_queue):  
    while True:
    # Produce some data
    ...
    out_queue.put(data)

# A thread that consumes data
def consumer(in_queue):  
    while True:
    # Get some data
    data = in_queue.get()
    # Process the data
    ...

# Create the shared queue and launch both threads
q = Queue()  
t1 = Thread(target=consumer, args=(q,))  
t2 = Thread(target=producer, args=(q,))  
t1.start()  
t2.start()

Threads then use put() or get() operations to add or remove items from the queue. Queue instances already have all of the required lockings, so they can be safely shared by as many threads as you wish [1][docs].

2.5. concurrent.futures

This module was introduced in Python 3.2 (via PEP 3148). It provides the core behavior of multiprocessing, with a simpler interface based on Java’s java.util.concurrent. It is available as a backport to earlier versions of Python. It isn’t as flexible as multiprocessing, but it is suspected that the growing adoption of Python 3+ it could replace multiprocessing over time.

As of Python 3.4 – there are two classes named Future in the standard library: concurrent.futures.Future and asyncio. Future. They serve the same purpose: an instance of either Future class represents a deferred computation that may or may not have completed [8][docs].

In larger programs, it is very difficult to create and initialize more than one thread manually and manage this at a specific moment. In this scheme, ThreadPoolExecutor could be used. In some cases, there are mechanisms that allow a thread pool. A thread pool is nothing but a structure that keeps several threads, which are previously created, to be used in a certain process. It aims to reuse threads, thus avoiding unnecessary creation of threads – which is costly [docs].

2.6. multiprocessing module

This package supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the GIL by using subprocesses instead of threads. One can use process-based and thread-based parallel processing, share work over queues, and share data among processes. It is mostly focused on single-machine multi-core parallelism.

A very common use is to parallelize a task over a set of processes for a CPU-bound problem. It may be also used to parallelize an I/O-bound problem, but there are better tools for this (e.g., the new asyncio module in Python 3.4+ and gevent or tornado in Python 2+).

Here are some typical jobs for the multiprocessing module:
* Parallelize a CPU-bound task with Process or Pool objects,
* Parallelize an I/O-bound task in a Pool with threads using the dummy module,
* Share pickled work via a Queue,
* Share state between parallelized workers, including bytes, primitive datatypes, dictionaries, and lists [5][docs].
Just as the concurrent.futures module offers ThreadPoolExecutor, which facilitates the creation and manipulation of multiple threads, processes belong to the class of ProcessPoolExecutor.


with concurrent.futures.ProcessPoolExecutor(  
    max_workers=number_of_cpus) as group_link_processes:
        for i in range(urls.qsize()):
            group_link_processes.submit(group_urls_task, urls,
                result_dict, html_link_regex)

2.7. numpy + openMP

OpenMP is a low-level interface to multiple cores, one might wonder whether to focus on it rather than multiprocessing, which works at a higher level, sharing Python data structures, while OpenMP works with C primitive objects (e.g., integers and floats) once it’s been compiled to C. It only makes sense to use it if you’re compiling your code; if you’re not compiling (e.g., if you’re using efficient numpy code and you want to run on many cores), then sticking with multiprocessing is probably the right approach [5][6].

Cython provides a convenient interface to perform shared-memory parallel processing through OpenMP. This lets you write extremely efficient parallel code directly in Cython without having to create a C wrapper. OpenMP is a specification to write multithreaded programs, and includes series of C preprocessor directives to manage threads; these include communication patterns, load balancing, and synchronization features. Several C/C++ and Fortran compilers (including GCC) implement the OpenMP API.

The simplest construct is prange: a construct that automatically distributes loop operations in multiple threads.For example this standard loop:


for i in range(size):  
    out[i] = inp[i]*inp[i]

could be changed in a parallel version by substituting the range call with prange. There’s a caveat, you need to make sure that the body of the loop is interpreter-free. To make use of threads GIL must be released, since interpreter calls acquire and release the GIL, we should avoid them. Failure in doing so will result in compilation errors. In Cython, you can release the GIL by using nogil, as follows:


with nogil:  
    for i in prange(size):
        out[i] = inp[i]*inp[i]

2.8. Parallel Python (PP)

The parallel Python module is external and offers a rich API for the creation of parallel and distributed systems making use of the processes approach. This module promises to be light and easy to install and integrates with other Python programs. Among some of the features, we may highlight the following:

Automatic detection of the optimal configuration,
The fact that a number of worker processes can be changed during runtime,
Dynamic load balance,
Fault tolerance,
Auto-discovery of computational resources [7][docs].

The PP module implements the execution of parallel code in two ways:
– SMP (systems with multiple processors or cores) architecture, where there are multiple processors/cores in the same machine,
– distributing the tasks through machines in a network, configuring, and thus forming a cluster.

In both cases, the exchange of information among the processes receives a call of abstraction, which allows not to worry about details such as pipes and sockets. The information is simply exchanged through arguments and function returns using callbacks [7].

There is a class, called Server, present in the API of PP, which we can use to encapsulate and dispatch tasks among local and remote processes. There are some important arguments in the initializer (__init__) from the Server class.
The most relevant arguments are as follows:

ncpus: This argument allows us to define the number of worker processes, which will execute tasks. If a value is not informed, it will automatically detect how many processors/cores the machine has and create a total of worker processes based on this to optimize the use of resources.
ppservers: This argument represents a tuple containing names or IP addresses of machines that we call Parallel Python Execution Servers (PPES). A PPES consists of a network machine that has the ppserver.py utility running and waiting for tasks to be executed.
There are other arguments that can be visualized at [PP module API ].

2.9. PyParallel

PyParallel combines the concurrency benefits afforded by single-threaded async I/O architectures with the performance benefits afforded by simultaneous multi-threading [docs]. It’s an experimental, proof-of-concept fork of Python 3.3.5 designed to optimally exploit contemporary hardware: multiple CPU cores, fast SSDs, NUMA architectures, and fast I/O channels (10GbE, Thunderbolt, etc).

It presents a solution for removing the limitation of the GIL without needing to actually remove it at all. The code changes required to the interpreter are relatively unobtrusive, all existing semantics such as reference counting and garbage collection remain unchanged, the new mental model required in order to write PyParallel-safe code is very simple (don’t persist parallel objects), the single-thread overhead is negligible, and, most desirably, performance scales linearly with cores [benchmarks].

PyParallel is, first and foremost, an experiment. It is not currently suitable for production. It is a product of trial-and-error, intended to shape the discussions surrounding the next generation of Python.

Conclusion

Although Python fully supports thread programming, parts of the C implementation of the interpreter are not entirely thread-safe to the level of allowing fully concurrent execution. GIL that only allows one Python thread to execute at any given time. The most noticeable effect of the GIL is that multithreaded Python programs are not able to fully take advantage of multiple CPU cores (e.g., a computationally intensive application using more than one thread only runs on a single CPU). Threads in Python are great for I/O-bound tasks, but they’re a poor choice for CPU-bound problems.

If you are going to use a process pool as a workaround, be aware that doing so involves data serialization and communication with a different Python interpreter. For this to work, the operation to be performed needs to be contained within a Python function defined by the def statement (i.e., no lambdas, closures, callable instances, etc.), and the function arguments and return value must be compatible with pickle. Also, the amount of work to be performed must be sufficiently large to make up for the extra communication overhead.
There are many Python modules that support concurrency by means of threads and processes. Using them carefully can really speed up your programs.

References

Beazley D., Jones B., “Python Cookbook, 3rd edition”, O’Reilly, 2013.
Butcher P., “Seven concurrency models in seven weeks”, The Pragmatic Programmers LLC, 2014.
Chun W. J., “Core Python Applications Programming, 3rd edition”, Prentice Hall, 2012.
Downey A., “The Little Book of Semaphores”, GreenTea Press, 2008.
Gorelick M., Ozsvald I., „High Performance Python”, O’Reilly, 2014.
Lanaro G., „Python High Performance Programming”, Packt Publishing, 2013.
Palach J., „Parallel Programming with Python”, Packt Publishing, 2014.
Ramalho L., „Fluent Python”, O’Reilly, 2015.
Summerfield M., „Python in Practice”, Addison-Wesley, 2014.

Data and AI