Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory arena contention #8714

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kddnewton
Copy link

@kddnewton kddnewton commented Jan 24, 2025

This is a follow-up to #8692, based on @wiredfool's feedback.

Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it.

This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread.

When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena.

Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us.

As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).

Here is the benchmarking script that I used:

test.py
import concurrent.futures
import os
import threading
import time

from PIL import Image

num_threads = 16
num_images = 1024


def operation():
    images = []
    for i in range(num_images):
        img = Image.new(
            "RGB", (100, 100), color=(i % 256, (i // 256) % 256, (i // 65536) % 256)
        )
        images.append(img)

    for img in images:
        img = img.convert("CMYK")

    images.clear()


def worker(barrier):
    barrier.wait()
    runtimes = []

    for _ in range(5):
        start_time = time.time()
        operation()
        end_time = time.time()
        runtimes.append(end_time - start_time)

    return runtimes


def benchmark():
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        barrier = threading.Barrier(num_threads)
        futures = [executor.submit(worker, barrier) for _ in range(num_threads)]

        run_times = []
        for future in concurrent.futures.as_completed(futures):
            try:
                run_times.extend(future.result())
            except IndexError:
                os._exit(-1)

        min_time = min(run_times)
        max_time = max(run_times)
        mean_time = sum(run_times) / len(run_times)
        print(f"Max: {max_time:.6f} Mean: {mean_time:.6f} Min: {min_time:.6f}")


benchmark()

Results

3.13.0 on main

$ python -VV                                        
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.404962 Mean: 0.353120 Min: 0.303807
$ python test.py
Max: 0.369188 Mean: 0.320218 Min: 0.282613
$ python test.py
Max: 0.386692 Mean: 0.335509 Min: 0.294476
$ python test.py
Max: 0.394410 Mean: 0.350275 Min: 0.299456
$ python test.py
Max: 0.416075 Mean: 0.354347 Min: 0.309045

3.13.0 on branch

$ python -VV
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.422371 Mean: 0.354453 Min: 0.292521
$ python test.py
Max: 0.423698 Mean: 0.358393 Min: 0.313581
$ python test.py
Max: 0.405487 Mean: 0.356346 Min: 0.299354
$ python test.py
Max: 0.431244 Mean: 0.369772 Min: 0.308096
$ python test.py
Max: 0.472806 Mean: 0.377575 Min: 0.313588

3.13.0t on main

$ python -VV
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.161379 Mean: 0.114555 Min: 0.072739
$ python test.py
Max: 0.188203 Mean: 0.133376 Min: 0.095111
$ python test.py
Max: 0.181084 Mean: 0.128733 Min: 0.086086
$ python test.py
Max: 0.187286 Mean: 0.131114 Min: 0.094561
$ python test.py
Max: 0.191979 Mean: 0.133439 Min: 0.097527

3.13.0t on branch

$ python -VV  
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.090055 Mean: 0.044987 Min: 0.019220
$ python test.py
Max: 0.095986 Mean: 0.040712 Min: 0.019204
$ python test.py
Max: 0.094424 Mean: 0.041949 Min: 0.016574
$ python test.py
Max: 0.084610 Mean: 0.042789 Min: 0.016866
$ python test.py
Max: 0.095480 Mean: 0.044068 Min: 0.015999

@kddnewton kddnewton mentioned this pull request Jan 24, 2025
@ngoldbaum
Copy link

ngoldbaum commented Jan 24, 2025

I wonder what happens if you plot the result of your benchmark as a function of thread count. See e.g. this NumPy issue which reported a similar scaling issue and running a benchmark as a function of thread count was a very useful way to identify the scaling issue and that it was fixed by using locking that scales better.

Copy link
Contributor

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! The approach looks good to me, though I've left some comments on some specifics.

src/libImaging/Storage.c Show resolved Hide resolved
src/libImaging/Storage.c Outdated Show resolved Hide resolved
src/_imaging.c Outdated Show resolved Hide resolved
src/libImaging/Storage.c Show resolved Hide resolved
Previously there was one memory arena for all threads, making it
the bottleneck for multi-threaded performance. As the number of
threads increased, the contention for the lock on the arena would
grow, causing other threads to wait to acquire it.

This commit makes it use 8 memory arenas, and round-robbins how
they are assigned to threads. Threads keep track of the index that
they should use into the arena array, assigned the first time the
arena is accessed on a given thread.

When an image is first created, it is allocated from an arena.
When the logic to have multiple arenas is enabled, it then keeps
track of the index on the image, so that when deleted it can be
returned to the correct arena.

Effectively this means that in single-threaded programs, this
should not really have an effect. We also do not do this logic if
the GIL is enabled, as it effectively acts as the lock on the
default arena for us.

As expected, this approach has no real noticable effect on regular
CPython. On free-threaded CPython, however, there is a massive
difference (measuring up to about 70%).
@kddnewton
Copy link
Author

@lysnikolaou — Thanks for the review! I've applied your suggestions. Please let me know if you see anything else.
@ngoldbaum — I'll work on a graph. I chose 8 arenas after some experimentation, but it would be good to back this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Free-threading PEP 703 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants