Understanding Multiprocessing in Python
Multiprocessing is a powerful module in Python that allows the creation of multiple processes, enabling concurrent execution of tasks. This is especially useful for CPU-bound tasks, where multiple CPU cores can be utilized to improve performance.
Key Concepts
- Process: An independent program that runs in its own memory space.
- Concurrency vs. Parallelism: Concurrency is about dealing with lots of things at once, while parallelism is about doing lots of things at the same time.
- CPU-bound vs. I/O-bound: CPU-bound tasks require significant processing power, whereas I/O-bound tasks involve waiting for input/output operations.
Benefits of Multiprocessing
- Improved performance by utilizing multiple CPU cores.
- Isolation of processes reduces risks of data corruption.
- Bypassing Python's Global Interpreter Lock (GIL).
The Pool
Class
The Pool
class in the multiprocessing
module allows you to create a pool of worker processes. It provides a convenient way to parallelize the execution of a function across multiple input values. The pool can be configured with the number of worker processes, which typically corresponds to the number of CPU cores.
Using the map
Method
The map
method applies a given function to every item of an iterable (like a list or a DataFrame chunk) in parallel. It distributes the tasks among the worker processes in the pool, collects the results, and returns them as a list.
Example: Using Multiprocessing with Pandas
In this example, we'll create a DataFrame with a large set of numbers and use multiprocessing to calculate their squares in parallel.
Step-by-Step Code
import pandas as pd
import numpy as np
from multiprocessing import Pool
# Create a DataFrame with a large number of entries
num_records = 10_000_000 # 10 million records
df = pd.DataFrame({'numbers': np.random.randint(1, 100, size=num_records)})
# Define a function to process data
def calculate_square(chunk):
return chunk ** 2
if __name__ == '__main__':
# Split the DataFrame into chunks for processing
num_chunks = 8 # Adjust based on the number of CPU cores
chunks = np.array_split(df['numbers'], num_chunks)
# Create a Pool of worker processes
with Pool(processes=num_chunks) as pool:
# Use map to apply the calculate_square function to each chunk
results = pool.map(calculate_square, chunks)
# Concatenate results back into a single DataFrame
squared_df = pd.concat(results)
squared_df = squared_df.reset_index(drop=True) # Reset index
# Display the first few rows of the squared DataFrame
print(squared_df.head())
Explanation of the Code
This example demonstrates the following key steps:
- Creating the DataFrame: A DataFrame with 10 million random integers is generated to demonstrate multiprocessing.
- Defining the Processing Function: The
calculate_square
function computes the square of each number in a chunk. - Using Pool for Multiprocessing: The DataFrame is split into smaller chunks, and a pool of worker processes is created. The
map
method applies thecalculate_square
function to each chunk concurrently. - Combining Results: The results are concatenated back into a single DataFrame for final output.
Example: Downloading Images with Multiprocessing
In this example, we'll demonstrate how to download multiple images concurrently using multiprocessing.
Step-by-Step Code
import multiprocessing
import requests
def download_file(url, name):
print(f'Initiated download {name + 1} image')
response = requests.get(url)
open(f'file{name}.jpg', 'wb').write(response.content)
print('Download completed')
url = 'https://picsum.photos/200/300'
processes = []
for i in range(10):
p = multiprocessing.Process(target=download_file, args=[url, i])
p.start()
processes.append(p)
for p in processes:
p.join()
Explanation of the Code
This example demonstrates the following key steps:
- Importing Modules: The
multiprocessing
andrequests
modules are imported to handle process creation and HTTP requests, respectively. - Defining the Download Function: The
download_file
function initiates the download of an image from a specified URL and saves it to a file. - Creating and Starting Processes: A loop creates and starts multiple processes to download images concurrently.
- Joining Processes: The main process waits for all initiated processes to complete using the
join
method.
Conclusion
Multiprocessing in Python is an effective way to improve the performance of CPU-bound tasks. By leveraging multiple processes, applications can execute tasks in parallel, making them more efficient, especially when dealing with large datasets or I/O-bound operations like downloading files.
Thanks for your feedback