I've been working in Python 3 on an an embarrassingly parallel task of parsing and importing data into Postgres. Once I've done single-threaded implementation I looked around for parallelizing the program and after a few tries I managed to get everything running in parallel.
This is a quick note on how to get a thread pool working in Python 3.6
from concurrent.futures import ThreadPoolExecutor def data_source(): """Fetch data to be processed in parallel""" # Implement fetching data yield item def process_data(input_value): """Process data without modifying input""" # Implement saving data to postgres MAX_THREADS = 8 pool = ThreadPoolExecutor(MAX_THREADS) for value in data_source(): pool.submit(process_data, value)
In the code block above, data_source function is an iterator that generates one value at a time. pool.submit calls process_data with value as a parameter of process_data in parallel until thread pool limited in size by MAX_THREADS is exhausted, then the program waits for a thread to become available and fetches the next value, until data_sources generator is exhausted.
In this example pool.submit passes single parameter value to process_data, but pool.submit can pass any number of parameters required for the callee function i.e.
pool_submit(callee_func, callee_param_1, callee_param_n, ... callee_param_n)
I really like this threading implementation because it's really simple, there's no need to write any code to manage threading pool, single- and multi- threaded implementation can live side-by-site if there's ever a need to debug any logic in process_data function.
Still this threading library implemented in python, whic is not true threading and multi-threaded code is a subject of Global Interpreter Lock so the CPU-bound tasks will not benefit from this.
ThreadPoolExecutor documentation -- https://docs.python.org/3/library/concurrent.futures.html