Skip to content

Commit

Permalink
Detect data section string literal hash collisions (#77061)
Browse files Browse the repository at this point in the history
* Detect data section string literal hash collisions

* Move check into GetOrAdd

* Remove VB error

* Remove test only option from equality

* Truncate the string in the error message

* Update the spec

* Update the spec
  • Loading branch information
jjonescz authored Feb 11, 2025
1 parent 700edd9 commit 19c9b9e
Show file tree
Hide file tree
Showing 28 changed files with 154 additions and 9 deletions.
18 changes: 17 additions & 1 deletion docs/features/string-literals-data-section.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,15 @@ The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a ge

For every unique string literal, a unique internal static class is generated which:
- has name composed of `<S>` followed by a hex-encoded XXH128 hash of the string
(collisions [should not happen][xxh128] with XXH128 and so they aren't currently detected or reported, the behavior in that case is undefined),
(collisions [should not happen][xxh128] with XXH128, but if there are string literals which would result in the same XXH128 hash, a compile-time error is reported),
- is nested in the `<PrivateImplementationDetails>` type to avoid polluting the global namespace
and to avoid having to enforce name uniqueness across modules,
- has one internal static readonly `string` field which is initialized in a static constructor of the class,
- is marked `beforefieldinit` so the static constructor can be called eagerly if deemed better by the runtime for some reason.

There is also an internal static readonly `.data` field generated into `<PrivateImplementationDetails>` containing the actual bytes,
similar to [u8 string literals][u8-literals] and [constant array initializers][constant-array-init].
This field uses hex-encoded SHA-256 hash for its name and collisions are currently not reported by the compiler.
These other scenarios might also reuse the data field, e.g., the following statements could all reuse the same data field:

```cs
Expand Down Expand Up @@ -241,6 +242,9 @@ This fixup phase already exists in the compiler in `MetadataWriter.WriteInstruct
It is called from `SerializeMethodBodies` which precedes `PopulateSystemTables` call,
hence synthesizing the utf8 string classes in the former should be possible and they would be emitted in the latter.

Alternatively, we could collect string literals during binding, then before emit sort them by length and content (for determinism)
to find the ones that are over the threshold and should be emitted with this new strategy.

### Statistics

The compiler could emit an info diagnostic with useful statistics for customers to determine what threshold to set.
Expand Down Expand Up @@ -302,6 +306,18 @@ The compiler should report a diagnostic when the feature is enabled together wit
`[assembly: System.Runtime.CompilerServices.CompilationRelaxations(0)]`, i.e., string interning enabled,
because that is incompatible with the feature.

### Avoiding hash collisions

Instead of XXH128 for the type names and SHA-256 for the data field names, we could use index-based names.
- The compiler could assign names lazily based on metadata tokens which are deterministic.
If building on the current approach, that might require some refactoring,
because internal data structures in the compiler might not be ready for lazy names like that.
But it would be easier if combined with the first strategy suggested for [automatic threshold](#automatic-threshold) above,
where we would not synthesize the types until very late in the emit phase (during fixup of the metadata tokens).
- We could build on the second strategy suggested for [automatic threshold](#automatic-threshold) where we would collect string literals during binding
(and perhaps also constant arrays and u8 strings if we want to extend this support to them as well),
then before emit we would sort them by length and content and assign indices to them to be then used for the synthesized names.

<!-- links -->
[u8-literals]: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-11.0/utf8-string-literals
[constant-array-init]: https://github.com/dotnet/roslyn/pull/24621
Expand Down
3 changes: 3 additions & 0 deletions src/Compilers/CSharp/Portable/CSharpResources.resx
Original file line number Diff line number Diff line change
Expand Up @@ -8047,4 +8047,7 @@ To remove the warning, you can use /reference instead (set the Embed Interop Typ
<data name="WRN_InterceptsLocationAttributeUnsupportedSignature_Title" xml:space="preserve">
<value>'InterceptsLocationAttribute(string, int, int)' is not supported. Move to 'InterceptableLocation'-based generation of these attributes instead. (https://github.com/dotnet/roslyn/issues/72133)</value>
</data>
<data name="ERR_DataSectionStringLiteralHashCollision" xml:space="preserve">
<value>Cannot emit this string literal into the data section because it has XXHash128 collision with another string literal: {0}</value>
</data>
</root>
2 changes: 2 additions & 0 deletions src/Compilers/CSharp/Portable/Errors/ErrorCode.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2360,6 +2360,8 @@ internal enum ErrorCode
ERR_ImplicitlyTypedParamsParameter = 9272,
ERR_VariableDeclarationNamedField = 9273,

ERR_DataSectionStringLiteralHashCollision = 9274,

// Note: you will need to do the following after adding errors:
// 1) Update ErrorFacts.IsBuildOnlyDiagnostic (src/Compilers/CSharp/Portable/Errors/ErrorFacts.cs)

Expand Down
4 changes: 3 additions & 1 deletion src/Compilers/CSharp/Portable/Errors/ErrorFacts.cs
Original file line number Diff line number Diff line change
Expand Up @@ -637,7 +637,9 @@ or ErrorCode.ERR_InterceptableMethodMustBeOrdinary
or ErrorCode.ERR_PossibleAsyncIteratorWithoutYield
or ErrorCode.ERR_PossibleAsyncIteratorWithoutYieldOrAwait
or ErrorCode.ERR_RefLocalAcrossAwait
// Update src\EditorFeatures\CSharp\LanguageServer\CSharpLspBuildOnlyDiagnostics.cs
or ErrorCode.ERR_DataSectionStringLiteralHashCollision
// Update src\Features\CSharp\Portable\Diagnostics\LanguageServer\CSharpLspBuildOnlyDiagnostics.cs
// and TestIsBuildOnlyDiagnostic in src\Compilers\CSharp\Test\Syntax\Diagnostics\DiagnosticTest.cs
// whenever new values are added here.
=> true,

Expand Down
1 change: 1 addition & 0 deletions src/Compilers/CSharp/Portable/Errors/MessageProvider.cs
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ public override void ReportDuplicateMetadataReferenceWeak(DiagnosticBag diagnost
public override int ERR_EncUpdateFailedMissingSymbol => (int)ErrorCode.ERR_EncUpdateFailedMissingSymbol;
public override int ERR_InvalidDebugInfo => (int)ErrorCode.ERR_InvalidDebugInfo;
public override int ERR_FunctionPointerTypesInAttributeNotSupported => (int)ErrorCode.ERR_FunctionPointerTypesInAttributeNotSupported;
public override int ERR_DataSectionStringLiteralHashCollision => (int)ErrorCode.ERR_DataSectionStringLiteralHashCollision;

// Generators:
public override int WRN_GeneratorFailedDuringInitialization => (int)ErrorCode.WRN_GeneratorFailedDuringInitialization;
Expand Down
5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.cs.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.de.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.es.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.fr.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.it.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.ja.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.ko.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.pl.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.pt-BR.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.ru.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.tr.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.zh-Hans.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/Compilers/CSharp/Portable/xlf/CSharpResources.zh-Hant.xlf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions src/Compilers/CSharp/Test/Emit/Emit/EmitMetadataTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3175,6 +3175,27 @@ .maxstack 1
""");
}

[Fact]
public void DataSectionStringLiterals_HashCollision()
{
var emitOptions = new EmitOptions
{
// Take only the first byte of each string as its hash to simulate collisions.
TestOnly_DataToHexViaXxHash128 = static (data) => data[0].ToString(),
};
var source = """
System.Console.Write("a");
System.Console.Write("b");
System.Console.Write("aa");
""";
CreateCompilation(source,
parseOptions: TestOptions.Regular.WithFeature("experimental-data-section-string-literals", "0"))
.VerifyEmitDiagnostics(emitOptions,
// (3,22): error CS9274: Cannot emit this string literal into the data section because it has XXHash128 collision with another string literal: a
// System.Console.Write("aa");
Diagnostic(ErrorCode.ERR_DataSectionStringLiteralHashCollision, @"""aa""").WithArguments("a").WithLocation(3, 22));
}

[Fact]
public void DataSectionStringLiterals_SynthesizedTypes()
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2991,6 +2991,7 @@ public void TestIsBuildOnlyDiagnostic()
case ErrorCode.ERR_PossibleAsyncIteratorWithoutYield:
case ErrorCode.ERR_PossibleAsyncIteratorWithoutYieldOrAwait:
case ErrorCode.ERR_RefLocalAcrossAwait:
case ErrorCode.ERR_DataSectionStringLiteralHashCollision:
Assert.True(isBuildOnly, $"Check failed for ErrorCode.{errorCode}");
break;

Expand Down
3 changes: 2 additions & 1 deletion src/Compilers/Core/CodeAnalysisTest/Emit/EmitOptionsTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@ public void TestFieldsForEqualsAndGetHashCode()
nameof(EmitOptions.IncludePrivateMembers),
nameof(EmitOptions.InstrumentationKinds),
nameof(EmitOptions.DefaultSourceFileEncoding),
nameof(EmitOptions.FallbackSourceFileEncoding));
nameof(EmitOptions.FallbackSourceFileEncoding),
nameof(EmitOptions.TestOnly_DataToHexViaXxHash128));
}

[Fact]
Expand Down
Loading

0 comments on commit 19c9b9e

Please sign in to comment.