Never use heap for return buffers #112060

EgorBo · 2025-02-01T07:37:41Z

CI experiment for #111127

MyStruct Foo(string name, int age)
{
    return new MyStruct(name, age);
}

record struct MyStruct(string Name, int Age);

Was:

; Method Prog:Foo(System.String,int):MyStruct:this (FullOpts)
       push     rsi
       push     rbx
       mov      rbx, rdx
       mov      rdx, r8
       mov      esi, r9d
       mov      rcx, rbx
       call     CORINFO_HELP_CHECKED_ASSIGN_REF
       mov      dword ptr [rbx+0x08], esi
       mov      rax, rbx
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 28

Now:

; Method Prog:Foo(System.String,int):MyStruct:this (FullOpts)
       mov      gword ptr [rdx], r8
       mov      dword ptr [rdx+0x08], r9d
       mov      rax, rdx
       ret      
; Total bytes of code: 11

where the write barrier is put at the callsite if needed (presumably, it happens rarely)

Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers):

aspnet-win-x64 SPMI collection:

CORINFO_HELP_ASSIGN_REF:          -0
CORINFO_HELP_ASSIGN_BYREF:      -123
CORINFO_HELP_CHECKED_ASSIGN_REF: -64
CORINFO_HELP_BULK_WRITEBARRIER:  -31

Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)

MihuBot (PMI for BCL):

CORINFO_HELP_ASSIGN_REF:           -0
CORINFO_HELP_ASSIGN_BYREF:       -342
CORINFO_HELP_CHECKED_ASSIGN_REF: -838
CORINFO_HELP_BULK_WRITEBARRIER:  -300

dotnet-policy-service · 2025-02-01T07:38:25Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

src/coreclr/inc/jiteeversionguid.h

src/coreclr/vm/reflectioninvocation.cpp

src/coreclr/jit/importer.cpp

src/coreclr/vm/reflectioninvocation.cpp

EgorBo · 2025-02-03T22:34:42Z

/azp run Fuzzlyn

azure-pipelines · 2025-02-03T22:34:57Z

Azure Pipelines successfully started running 1 pipeline(s).

…p-retbuf

EgorBo · 2025-02-04T02:33:47Z

/azp run runtime-coreclr jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr gcstress-extra, runtime-coreclr libraries-jitstress, runtime-coreclr libraries-pgo, Fuzzlyn, runtime-coreclr pgostress, runtime-coreclr outerloop

azure-pipelines · 2025-02-04T02:34:21Z

Azure Pipelines successfully started running 8 pipeline(s).

EgorBo · 2025-02-04T13:32:29Z

Ended up taking @jakobbotsch's patch instead of handling all places where we possibly propagate non-locals to return buffers + Jakob's suggestions. Diffs look good IMO. Regressions, obviously caused by the fact that callers no longer can pass a heap/etc reference as a return buffer for a callee directly and have to do a local copy (in case of large structs it could be quite a few SIMD instructions, memset call, a loop of atomic stores, depending on size & existence of gc pointers).

For a single aspnet-windows-x64 collection (TechEmpower), the stats for removed write barriers are the following:

CORINFO_HELP_ASSIGN_REF:           0 (expected)
CORINFO_HELP_ASSIGN_BYREF:      -113
CORINFO_HELP_CHECKED_ASSIGN_REF: -87
CORINFO_HELP_BULK_WRITEBARRIER:   -2

To confirm that we never deal with a heap, every method contains a validation helper call (under jitstress mode) for its return-buffer argument (if any). For better coverage, I've enabled it unconditionally (no stress mode) and ran all sorts of outerloop pipelines (see above ^) - no unknown failures were found.

We can enable more optimizations in JIT now that we know the return buffer is never aliased. Also, we can relax atomicity guarantees for it, e.g. today load/store coalescing just gives up on ret buffers.

EgorBo · 2025-02-04T13:32:55Z

@MihuBot

EgorBo · 2025-02-04T14:29:02Z

@EgorBot -amd -arm -windows_intel

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmarks
{
    [Benchmark]
    public void Bench() => _ = Test("A", "B", "C");

    [MethodImpl(MethodImplOptions.NoInlining)]
    public (string, string, string) Test(string a, string b, string c)
        => (a, b, c);
}

Results: EgorBot/runtime-utils#287

jakobbotsch · 2025-02-04T15:02:51Z

We can enable more optimizations in JIT now that we know the return buffer is never aliased.

We'll need to do a bit more JIT work before we have this property. E.g. for something like x = Foo(ref x) Foo will still end up with an aliased retbuffer with this PR.

We can do the investigation/change of that separately to this PR, but we should make sure to get it done as part of this R2R version update to avoid having to change the calling convention once more in the future.

src/coreclr/inc/readytorun.h

jkotas · 2025-02-04T16:47:05Z

Regressions, obviously caused by the fact that callers no longer can pass a heap/etc reference as a return buffer for a callee directly and have to do a local copy

What's are the worst-case examples of this regression in real-world code?

EgorBo · 2025-02-04T20:45:39Z

Regressions, obviously caused by the fact that callers no longer can pass a heap/etc reference as a return buffer for a callee directly and have to do a local copy

What's are the worst-case examples of this regression in real-world code?

I'll check later this week. Presumably, we regress cases where large structs don't contain gc pointers and we introduce a redundant memory move for them when they're saved on the heap. Perhaps, we can complicate things and only guarantee return-buffer-on-stack for structs with gc pointers in the worst case? Hopefully, the PR improves more than it potentially regresses, especially on arm64 where all moves are a bit cheaper (due to better atomicity guarantees of SIMD moves, paired loads/stores) while calls a bit more expensive. Also, hopefully, follow up changes will remove some redundant copies.

jkotas · 2025-02-04T21:08:04Z

Perhaps, we can complicate things and only guarantee return-buffer-on-stack for structs with gc pointers in the worst case?

We had complicated scheme like that. The CLR ABI doc has mentions of it - look for IsStructRequiringStackAllocRetBuf - that decided the return buffer convention based on different factors. It was a bug farm that did not survive the test of time.

EgorBo · 2025-02-06T13:16:17Z

@MihuBot

EgorBo · 2025-02-06T13:52:06Z

Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers):

aspnet-win-x64 SPMI collection:

CORINFO_HELP_ASSIGN_REF:          -0
CORINFO_HELP_ASSIGN_BYREF:      -123
CORINFO_HELP_CHECKED_ASSIGN_REF: -64
CORINFO_HELP_BULK_WRITEBARRIER:  -31

Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)

MihuBot (PMI for BCL):

CORINFO_HELP_ASSIGN_REF:           -0
CORINFO_HELP_ASSIGN_BYREF:       -342
CORINFO_HELP_CHECKED_ASSIGN_REF: -838
CORINFO_HELP_BULK_WRITEBARRIER:  -300

EgorBo · 2025-02-06T16:43:14Z

@EgorBot -arm -amd -windows_intel

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

// Structs with GC references
[InlineArray(128)] struct StructGC_1024B { object _; }
[InlineArray(32)] struct StructGC_256B { object _; }
[InlineArray(16)] struct StructGC_128B { object _; }
[InlineArray(8)] struct StructGC_64B { object _; }
[InlineArray(4)] struct StructGC_32B { object _; }

// Structs without GC references
[InlineArray(128)] struct StructNoGC_1024B { long _; }
[InlineArray(32)] struct StructNoGC_256B { long _; }
[InlineArray(16)] struct StructNoGC_128B { long _; }
[InlineArray(8)] struct StructNoGC_64B { long _; }
[InlineArray(4)] struct StructNoGC_32B { long _; }

static class Utils
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T Create<T>() where T : struct => default;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T Create<T>(T copy) where T : struct => copy;
}

public class BoxBenchmarks
{
    object boxedStruct;

    [Benchmark] public void Box_StructGC_1024B() => boxedStruct = Utils.Create<StructGC_1024B>();
    [Benchmark] public void Box_StructGC_256B() => boxedStruct = Utils.Create<StructGC_256B>();
    [Benchmark] public void Box_StructGC_128B() => boxedStruct = Utils.Create<StructGC_128B>();
    [Benchmark] public void Box_StructGC_64B() => boxedStruct = Utils.Create<StructGC_64B>();
    [Benchmark] public void Box_StructGC_32B() => boxedStruct = Utils.Create<StructGC_32B>();

    [Benchmark] public void Box_StructNoGC_1024B() => boxedStruct = Utils.Create<StructNoGC_1024B>();
    [Benchmark] public void Box_StructNoGC_256B() => boxedStruct = Utils.Create<StructNoGC_256B>();
    [Benchmark] public void Box_StructNoGC_128B() => boxedStruct = Utils.Create<StructNoGC_128B>();
    [Benchmark] public void Box_StructNoGC_64B() => boxedStruct = Utils.Create<StructNoGC_64B>();
    [Benchmark] public void Box_StructNoGC_32B() => boxedStruct = Utils.Create<StructNoGC_32B>();
}
public class StackBenchmarks
{
    [Benchmark] public void Stack_StructGC_1024B() => _ = Utils.Create<StructGC_1024B>(default);
    [Benchmark] public void Stack_StructGC_256B() => _ = Utils.Create<StructGC_256B>(default);
    [Benchmark] public void Stack_StructGC_128B() => _ = Utils.Create<StructGC_128B>(default);
    [Benchmark] public void Stack_StructGC_64B() => _ = Utils.Create<StructGC_64B>(default);
    [Benchmark] public void Stack_StructGC_32B() => _ = Utils.Create<StructGC_32B>(default);

    [Benchmark] public void Stack_StructNoGC_1024B() => _ = Utils.Create<StructNoGC_1024B>(default);
    [Benchmark] public void Stack_StructNoGC_256B() => _ = Utils.Create<StructNoGC_256B>(default);
    [Benchmark] public void Stack_StructNoGC_128B() => _ = Utils.Create<StructNoGC_128B>(default);
    [Benchmark] public void Stack_StructNoGC_64B() => _ = Utils.Create<StructNoGC_64B>(default);
    [Benchmark] public void Stack_StructNoGC_32B() => _ = Utils.Create<StructNoGC_32B>(default);
}

EgorBo · 2025-02-06T17:05:09Z

A histogram for struct sizes in BCL (and the number of structs with gc pointers).

~45% of all structs are less-or-equal than 16 bytes.
~65% of all structs are less-or-equal than 32 bytes.
~85% of all structs are less-or-equal than 64 bytes.
~95% of all structs are less-or-equal than 128bytes.

With GC pointers: 1134
Without GC pointers: 1197

Struct Size Histogram

Size in bytes - number of structs

Size: 1, Count: 72
Size: 2, Count: 14
Size: 3, Count: 3
Size: 4, Count: 116
Size: 5, Count: 5
Size: 6, Count: 8
Size: 7, Count: 2
Size: 8, Count: 316
Size: 9, Count: 2
Size: 10, Count: 5
Size: 11, Count: 1
Size: 12, Count: 55
Size: 13, Count: 1
Size: 14, Count: 1
Size: 16, Count: 477
Size: 17, Count: 1
Size: 18, Count: 1
Size: 19, Count: 2
Size: 20, Count: 21
Size: 21, Count: 1
Size: 22, Count: 2
Size: 24, Count: 241
Size: 25, Count: 1
Size: 27, Count: 1
Size: 28, Count: 16
Size: 30, Count: 1
Size: 31, Count: 1
Size: 32, Count: 152
Size: 33, Count: 2
Size: 34, Count: 1
Size: 36, Count: 4
Size: 37, Count: 1
Size: 38, Count: 2
Size: 39, Count: 1
Size: 40, Count: 122
Size: 44, Count: 3
Size: 48, Count: 116
Size: 50, Count: 1
Size: 51, Count: 1
Size: 52, Count: 4
Size: 56, Count: 88
Size: 60, Count: 3
Size: 64, Count: 88
Size: 65, Count: 1
Size: 66, Count: 1
Size: 68, Count: 2
Size: 70, Count: 1
Size: 72, Count: 68
Size: 76, Count: 4
Size: 80, Count: 41
Size: 82, Count: 1
Size: 84, Count: 3
Size: 88, Count: 28
Size: 90, Count: 1
Size: 96, Count: 23
Size: 98, Count: 1
Size: 99, Count: 1
Size: 104, Count: 9
Size: 108, Count: 1
Size: 112, Count: 13
Size: 120, Count: 12
Size: 124, Count: 2
Size: 128, Count: 9
Size: >128B, Count: 153

EgorBo · 2025-02-06T18:03:36Z

@jkotas @jakobbotsch

So there are 2 use-cases:

*byref = struct_call();
It seems to happen a lot less frequently than storing it to a local/stack/temp.
This PR introduces a regression for this case as we introduce a new struct copy (for large sizes it's done via HELP_MEMCPY or BULK_BARRIER call) - up to 2x slower for a synthetic micro-benchmark.
local = struct_call();
For this case, presumably, we don't introduce any overhead if the struct contains no pointers. if it contains pointers and the size is less than 128 bytes on arm64 and 256 bytes on x64 - we should see an improvement as we avoid a barrier completely. But the most benefit we see for small structs where JIT doesn't even emit the "Bulk" barrier, but a set of individual barriers for each field as shown in this benchmark: Never use heap for return buffers #112060 (comment). For very large structs with GC refs it should also be a regression as basically have to BULK_BARRIER instead of one.

So far, I wasn't able to find any benchmark in dotnet/performance suit that regresses because of it (I was trying benchmarks which had jit diffs). Neither I was able to see any impact on OrchardCMS benchmark. The original issue stated that TechEmpower Fortunes should see an improvement, unfortunately, I wasn't able to kick off a run yet to see that.

Presumably, we can land additional follow-up improvements to benefit from stack-only return buffer like was mentioned before. So not sure what exactly we should do here, I am fine with leaving things as is in Main or we can just merge it, spin for a week or two, if dotnet/performance and aspnet PerfLab results won't be motivating we can just revert it. Unfortunately, we still don't have a reliable way to run dotnet/performance benchmarks in parallel - all attempts in the past had issues with triaging potential improvements/regressions due to noise as we can't use the same script we use for Tuesday/Thursday triaging yet.

PS: #112060 (comment) implies that for 85% of structs, the copy is basically 2 mov instructions (with avx512).

NinoFloris · 2025-02-06T19:22:17Z

Neither I was able to see any impact on OrchardCMS benchmark

Status quo incentives shape common code patterns. Authors are less inclined to use structs if there is a perfomance cost for doing so. Green field design considerations in ASP.NET Core, Npgsql and libraries based on them (e.g. OrchardCMS) are rooted in - at best - early .NET Core era JIT limitations: redundant copies, bad copying perf, no physical promotion/enregistration, and so on.

IMO as long as there are no notable regressions - while good portions of all the measured code is class/heap focused - it seems worthwhile to allow for more struct optimizations rather than fewer (especially if knock-on optimizations can eliminate more copies). Stack-centric and shared-nothing kind of code will continue to be a better fit for the increasingly higher core counts.

The original issue #111127 (comment) that TechEmpower Fortunes should see an improvement, unfortunately, I wasn't able to kick off a run yet to see that

My measurement was done with a new driver - not Npgsql (of which I'm also a maintainer though) - it's a reset of my previous work on https://github.com/NinoFloris/Slon. That work is almost ready for TE CI runs.

EgorBo · 2025-02-06T20:22:48Z

@NinoFloris thanks! I'll try benchmarks with Npgsql and TE Fortunes specifically and post the results here

jkotas · 2025-02-07T03:55:33Z

You can also delete this and this.

So not sure what exactly we should do here, I am fine with leaving things as is in Main or we can just merge it, spin for a week or two,

I am fine with merging this change and keeping it even if we are not able to find a motivating perf improvement. I think it is more reasonable calling convention in general. The change unifies how we are dealing with arguments passed by reference and return buffers.

jakobbotsch · 2025-02-07T09:33:32Z

Can you also update

runtime/src/coreclr/jit/rationalize.cpp

Lines 160 to 163 in cee8434

    
           GenTree*   destAddr = comp->gtNewLclVarAddrNode(tmpNum, TYP_BYREF); 
        
           NewCallArg newArg   = NewCallArg::Primitive(destAddr).WellKnown(WellKnownArg::RetBuffer); 
        
           call->gtArgs.InsertAfterThisOrFirst(comp, newArg);

to TYP_I_IMPL?

Also, if you want to you can fix up most of the pointer -> byref changes made in
#72720

However, I'm also ok with leaving that as is and I can clean it up some other time. But can you please change this part of the docs:

runtime/docs/design/features/tailcalls-with-helpers.md

Lines 262 to 264 in cee8434

    
           directly pass along its own return buffer parameter to DispatchTailCalls. It is 
        
           possible that this return buffer is pointing into GC heap, so the result is 
        
           always tracked as a byref in the mechanism.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2025

dotnet-policy-service bot assigned EgorBo Feb 1, 2025

jakobbotsch reviewed Feb 1, 2025

View reviewed changes

src/coreclr/inc/jiteeversionguid.h Outdated Show resolved Hide resolved

jkotas reviewed Feb 1, 2025

View reviewed changes