Multiprocessing in Python

0
Multiprocessing in Python Multiprocessing in Python

Understanding Multiprocessing in Python

Multiprocessing is a powerful module in Python that allows the creation of multiple processes, enabling concurrent execution of tasks. This is especially useful for CPU-bound tasks, where multiple CPU cores can be utilized to improve performance.

Key Concepts

  • Process: An independent program that runs in its own memory space.
  • Concurrency vs. Parallelism: Concurrency is about dealing with lots of things at once, while parallelism is about doing lots of things at the same time.
  • CPU-bound vs. I/O-bound: CPU-bound tasks require significant processing power, whereas I/O-bound tasks involve waiting for input/output operations.

Benefits of Multiprocessing

  • Improved performance by utilizing multiple CPU cores.
  • Isolation of processes reduces risks of data corruption.
  • Bypassing Python's Global Interpreter Lock (GIL).

The Pool Class

The Pool class in the multiprocessing module allows you to create a pool of worker processes. It provides a convenient way to parallelize the execution of a function across multiple input values. The pool can be configured with the number of worker processes, which typically corresponds to the number of CPU cores.

Using the map Method

The map method applies a given function to every item of an iterable (like a list or a DataFrame chunk) in parallel. It distributes the tasks among the worker processes in the pool, collects the results, and returns them as a list.

Example: Using Multiprocessing with Pandas

In this example, we'll create a DataFrame with a large set of numbers and use multiprocessing to calculate their squares in parallel.

Step-by-Step Code


import pandas as pd
import numpy as np
from multiprocessing import Pool

# Create a DataFrame with a large number of entries
num_records = 10_000_000  # 10 million records
df = pd.DataFrame({'numbers': np.random.randint(1, 100, size=num_records)})

# Define a function to process data
def calculate_square(chunk):
    return chunk ** 2

if __name__ == '__main__':
    # Split the DataFrame into chunks for processing
    num_chunks = 8  # Adjust based on the number of CPU cores
    chunks = np.array_split(df['numbers'], num_chunks)

    # Create a Pool of worker processes
    with Pool(processes=num_chunks) as pool:
        # Use map to apply the calculate_square function to each chunk
        results = pool.map(calculate_square, chunks)

    # Concatenate results back into a single DataFrame
    squared_df = pd.concat(results)
    squared_df = squared_df.reset_index(drop=True)  # Reset index

    # Display the first few rows of the squared DataFrame
    print(squared_df.head())
        

Explanation of the Code

This example demonstrates the following key steps:

  • Creating the DataFrame: A DataFrame with 10 million random integers is generated to demonstrate multiprocessing.
  • Defining the Processing Function: The calculate_square function computes the square of each number in a chunk.
  • Using Pool for Multiprocessing: The DataFrame is split into smaller chunks, and a pool of worker processes is created. The map method applies the calculate_square function to each chunk concurrently.
  • Combining Results: The results are concatenated back into a single DataFrame for final output.

Example: Downloading Images with Multiprocessing

In this example, we'll demonstrate how to download multiple images concurrently using multiprocessing.

Step-by-Step Code


import multiprocessing
import requests

def download_file(url, name):
    print(f'Initiated download {name + 1} image')
    response = requests.get(url)
    open(f'file{name}.jpg', 'wb').write(response.content)
    print('Download completed')

url = 'https://picsum.photos/200/300'
processes = []
for i in range(10):
    p = multiprocessing.Process(target=download_file, args=[url, i])
    p.start()
    processes.append(p)

for p in processes:
    p.join()
        

Explanation of the Code

This example demonstrates the following key steps:

  • Importing Modules: The multiprocessing and requests modules are imported to handle process creation and HTTP requests, respectively.
  • Defining the Download Function: The download_file function initiates the download of an image from a specified URL and saves it to a file.
  • Creating and Starting Processes: A loop creates and starts multiple processes to download images concurrently.
  • Joining Processes: The main process waits for all initiated processes to complete using the join method.

Conclusion

Multiprocessing in Python is an effective way to improve the performance of CPU-bound tasks. By leveraging multiple processes, applications can execute tasks in parallel, making them more efficient, especially when dealing with large datasets or I/O-bound operations like downloading files.

Post a Comment

0Comments

Thanks for your feedback

Post a Comment (0)