-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Never use heap for return buffers #112060
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
a72ec93
to
e00043b
Compare
34edd40
to
f9506f7
Compare
f9506f7
to
d379d38
Compare
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr gcstress-extra, runtime-coreclr libraries-jitstress, runtime-coreclr libraries-pgo, Fuzzlyn, runtime-coreclr pgostress, runtime-coreclr outerloop |
Azure Pipelines successfully started running 8 pipeline(s). |
Ended up taking @jakobbotsch's patch instead of handling all places where we possibly propagate non-locals to return buffers + Jakob's suggestions. Diffs look good IMO. Regressions, obviously caused by the fact that callers no longer can pass a heap/etc reference as a return buffer for a callee directly and have to do a local copy (in case of large structs it could be quite a few SIMD instructions, memset call, a loop of atomic stores, depending on size & existence of gc pointers). For a single aspnet-windows-x64 collection (TechEmpower), the stats for removed write barriers are the following:
To confirm that we never deal with a heap, every method contains a validation helper call (under jitstress mode) for its return-buffer argument (if any). For better coverage, I've enabled it unconditionally (no stress mode) and ran all sorts of outerloop pipelines (see above ^) - no unknown failures were found. We can enable more optimizations in JIT now that we know the return buffer is never aliased. Also, we can relax atomicity guarantees for it, e.g. today load/store coalescing just gives up on ret buffers. |
@EgorBot -amd -arm -windows_intel using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Benchmarks
{
[Benchmark]
public void Bench() => _ = Test("A", "B", "C");
[MethodImpl(MethodImplOptions.NoInlining)]
public (string, string, string) Test(string a, string b, string c)
=> (a, b, c);
} Results: EgorBot/runtime-utils#287 |
We'll need to do a bit more JIT work before we have this property. E.g. for something like We can do the investigation/change of that separately to this PR, but we should make sure to get it done as part of this R2R version update to avoid having to change the calling convention once more in the future. |
What's are the worst-case examples of this regression in real-world code? |
I'll check later this week. Presumably, we regress cases where large structs don't contain gc pointers and we introduce a redundant memory move for them when they're saved on the heap. Perhaps, we can complicate things and only guarantee return-buffer-on-stack for structs with gc pointers in the worst case? Hopefully, the PR improves more than it potentially regresses, especially on arm64 where all moves are a bit cheaper (due to better atomicity guarantees of SIMD moves, paired loads/stores) while calls a bit more expensive. Also, hopefully, follow up changes will remove some redundant copies. |
We had complicated scheme like that. The CLR ABI doc has mentions of it - look for IsStructRequiringStackAllocRetBuf - that decided the return buffer convention based on different factors. It was a bug farm that did not survive the test of time. |
Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers): aspnet-win-x64 SPMI collection:
Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)
|
@EgorBot -arm -amd -windows_intel using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
// Structs with GC references
[InlineArray(128)] struct StructGC_1024B { object _; }
[InlineArray(32)] struct StructGC_256B { object _; }
[InlineArray(16)] struct StructGC_128B { object _; }
[InlineArray(8)] struct StructGC_64B { object _; }
[InlineArray(4)] struct StructGC_32B { object _; }
// Structs without GC references
[InlineArray(128)] struct StructNoGC_1024B { long _; }
[InlineArray(32)] struct StructNoGC_256B { long _; }
[InlineArray(16)] struct StructNoGC_128B { long _; }
[InlineArray(8)] struct StructNoGC_64B { long _; }
[InlineArray(4)] struct StructNoGC_32B { long _; }
static class Utils
{
[MethodImpl(MethodImplOptions.NoInlining)]
public static T Create<T>() where T : struct => default;
[MethodImpl(MethodImplOptions.NoInlining)]
public static T Create<T>(T copy) where T : struct => copy;
}
public class BoxBenchmarks
{
object boxedStruct;
[Benchmark] public void Box_StructGC_1024B() => boxedStruct = Utils.Create<StructGC_1024B>();
[Benchmark] public void Box_StructGC_256B() => boxedStruct = Utils.Create<StructGC_256B>();
[Benchmark] public void Box_StructGC_128B() => boxedStruct = Utils.Create<StructGC_128B>();
[Benchmark] public void Box_StructGC_64B() => boxedStruct = Utils.Create<StructGC_64B>();
[Benchmark] public void Box_StructGC_32B() => boxedStruct = Utils.Create<StructGC_32B>();
[Benchmark] public void Box_StructNoGC_1024B() => boxedStruct = Utils.Create<StructNoGC_1024B>();
[Benchmark] public void Box_StructNoGC_256B() => boxedStruct = Utils.Create<StructNoGC_256B>();
[Benchmark] public void Box_StructNoGC_128B() => boxedStruct = Utils.Create<StructNoGC_128B>();
[Benchmark] public void Box_StructNoGC_64B() => boxedStruct = Utils.Create<StructNoGC_64B>();
[Benchmark] public void Box_StructNoGC_32B() => boxedStruct = Utils.Create<StructNoGC_32B>();
}
public class StackBenchmarks
{
[Benchmark] public void Stack_StructGC_1024B() => _ = Utils.Create<StructGC_1024B>(default);
[Benchmark] public void Stack_StructGC_256B() => _ = Utils.Create<StructGC_256B>(default);
[Benchmark] public void Stack_StructGC_128B() => _ = Utils.Create<StructGC_128B>(default);
[Benchmark] public void Stack_StructGC_64B() => _ = Utils.Create<StructGC_64B>(default);
[Benchmark] public void Stack_StructGC_32B() => _ = Utils.Create<StructGC_32B>(default);
[Benchmark] public void Stack_StructNoGC_1024B() => _ = Utils.Create<StructNoGC_1024B>(default);
[Benchmark] public void Stack_StructNoGC_256B() => _ = Utils.Create<StructNoGC_256B>(default);
[Benchmark] public void Stack_StructNoGC_128B() => _ = Utils.Create<StructNoGC_128B>(default);
[Benchmark] public void Stack_StructNoGC_64B() => _ = Utils.Create<StructNoGC_64B>(default);
[Benchmark] public void Stack_StructNoGC_32B() => _ = Utils.Create<StructNoGC_32B>(default);
} |
A histogram for struct sizes in BCL (and the number of structs with gc pointers). ~45% of all structs are less-or-equal than 16 bytes. With GC pointers: 1134 Struct Size HistogramSize in bytes - number of structs Size: 1, Count: 72 |
So there are 2 use-cases:
So far, I wasn't able to find any benchmark in dotnet/performance suit that regresses because of it (I was trying benchmarks which had jit diffs). Neither I was able to see any impact on OrchardCMS benchmark. The original issue stated that TechEmpower Fortunes should see an improvement, unfortunately, I wasn't able to kick off a run yet to see that. Presumably, we can land additional follow-up improvements to benefit from stack-only return buffer like was mentioned before. So not sure what exactly we should do here, I am fine with leaving things as is in Main or we can just merge it, spin for a week or two, if dotnet/performance and aspnet PerfLab results won't be motivating we can just revert it. Unfortunately, we still don't have a reliable way to run dotnet/performance benchmarks in parallel - all attempts in the past had issues with triaging potential improvements/regressions due to noise as we can't use the same script we use for Tuesday/Thursday triaging yet. PS: #112060 (comment) implies that for 85% of structs, the copy is basically 2 mov instructions (with avx512). |
Status quo incentives shape common code patterns. Authors are less inclined to use structs if there is a perfomance cost for doing so. Green field design considerations in ASP.NET Core, Npgsql and libraries based on them (e.g. OrchardCMS) are rooted in - at best - early .NET Core era JIT limitations: redundant copies, bad copying perf, no physical promotion/enregistration, and so on. IMO as long as there are no notable regressions - while good portions of all the measured code is class/heap focused - it seems worthwhile to allow for more struct optimizations rather than fewer (especially if knock-on optimizations can eliminate more copies). Stack-centric and shared-nothing kind of code will continue to be a better fit for the increasingly higher core counts.
My measurement was done with a new driver - not Npgsql (of which I'm also a maintainer though) - it's a reset of my previous work on https://github.com/NinoFloris/Slon. That work is almost ready for TE CI runs. |
@NinoFloris thanks! I'll try benchmarks with Npgsql and TE Fortunes specifically and post the results here |
You can also delete this and this.
I am fine with merging this change and keeping it even if we are not able to find a motivating perf improvement. I think it is more reasonable calling convention in general. The change unifies how we are dealing with arguments passed by reference and return buffers. |
Can you also update runtime/src/coreclr/jit/rationalize.cpp Lines 160 to 163 in cee8434
to TYP_I_IMPL ?
Also, if you want to you can fix up most of the pointer -> byref changes made in However, I'm also ok with leaving that as is and I can clean it up some other time. But can you please change this part of the docs: runtime/docs/design/features/tailcalls-with-helpers.md Lines 262 to 264 in cee8434
|
CI experiment for #111127
Was:
Now:
where the write barrier is put at the callsite if needed (presumably, it happens rarely)
Updated stats for write-barriers after #112227 was merged (it is supposed to help reducing the number of bulk barriers):
aspnet-win-x64 SPMI collection:
Looks like the aspnet collection has too many missed contexts currently (so the actual numbers are likely 5-10% higher)
MihuBot (PMI for BCL):