Consolidate/cleanup hash functions #10090

niloc132 · 2025-01-29T13:25:59Z

(Previous copy of this was eaten by the github issue management changeover, I'll probably continue to edit this and save it rather than wait until it is complete before submitting)

As part of #10029 the Util class is getting cleaned up - one of the functions of that class is to create "strong names" for various byte arrays throughout the codebase. This made me take a closer look at how hashing is used in the compiler and the server.

I'll be leaving the server and i18n alone for now to avoid risking changing any custom implementations or already generated constant keys. The gwt-rpc xsrf protection uses Md5Utils, and i18n uses MD5KeyGenerator. I am advocating removing some usages of Md5Utils and maybe moving/renaming that class, but for now will leave these usages alone.

There appear to be two hash algorithms used in GWT - md5 and sha1. None are used for true cryptographic purposes, and in all cases the developer/admin owns the content that would be hashed by them, or plaintext isn't available to a hypothetical attacker to find a collision. In all but one case, the JDK's own MessageDigest type is used, but one case uses Guava's hash functions.

Guava's documentation includes a summary of the provided hashing functions, generic enough to be useful here:

SHA1 provides more bits which in theory will give better guarantees about collisions, but in practice the CompilerVersion use of sha1 will hash the gwt-dev.jar itself, then rehash that hash (with other inputs) into an md5. The unitCache files are also named with the sha1 of the compiler jar. The second usage of sha1 is hashing OperationKeys for requestfactory type tokens (both in APT and generators). The third is ModuleDef.java, hashing the filenames of all sources - but then only uses 32 of the 160 bits. As far as I can tell, in none of those cases are those extra bits necessary/helpful, there are unlikely to be collisions either way - I'm unclear why md5 wasn't considered sufficient - sha1 hasn't been considered secure since 2005.

Changing RequestFactory annotation processing carries some risk of breaking compatibility for projects that update client or server but not both, but I believe that is considered to be an invalid configuration anyway?

Finally there are some oddities with how these are used:

In CompilationStateBuilder, ImageResourceGenerator, entire files are read fully into memory so they can be hashed - instead, they can be hashed as they are read (and buffers discarded as we go). In CompilationUnitBuilder, the contents are at least not discarded afterwards, but could be hashed as they are read rather than doing a second pass (...before a third+ pass over the string, its char[]).
Util.computeStrongName prefixes each byte[] with its length, so that multiple arrays can be hashed together. Without this, hash("foo" + "bar") has the same value as hash("fooba" + "r"), risking a collision if adjacent keys can exchange a few bytes of content resulting in meaningful differences - clearly this is resolved by instead by prefixing each string with its length: hash("3foo" + "3bar") does not equal hash("5fooba" + "1r"). Obviously avoiding all collisions is technically impossible, but this is an easy class to avoid. Two obvious violators of this are i18n keys and module paths - the former cannot be changed without breaking existing properties files, but the latter we can fix.

Summary of requirements:

Don't change anything explicitly user facing
Hash function should only be used on trusted data, will not be used for cryptographic purposes
Move hashing out of Util, into more local tools (specific to generators/linkers/compiler rather than a single do-it-all class which all other parts of the classpath must depend on), probably dependent on Guava
Preserve (or introduce where possible) use of delimiters or length prefix to avoid collisions from concatenation.
Offer streaming tooling, rather than requiring all resources be entirely in memory to be hashed.

Given our non-cryptographic usage (on only trusted data), the deprecation status of the current hash functions, and the , I'm proposing a few improvements:

Use murmur128 in place of both md5 and sha1, for a small performance bump and to get a non-deprecated implementation from Guava.
With Deprecate com.google.gwt.dev.util.Util for removal #10030, deprecate Util.computeStrongName(byte[]), providing a reasonably local implementation for various usages - for many cases, Hashing.murmur3_128().hashBytes(bytes).toString().toUpperCase(Locale.ROOT) is a complete drop-in replacement for Util.computeStrongName(bytes) that even preserves the output casing expectation.
Calls that provide byte[][] (only PermutationResultImpl's constructor) can do their own length prefixing. Other calls that hash multiple strings should likewise use a delimiter.
InputStreams to be hashed should either be funneled to a hasher if contents are discarded, or read with a HashingInputStream if contents are consumed, to avoid an extra pass at the data.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate/cleanup hash functions #10090

Consolidate/cleanup hash functions #10090

niloc132 commented Jan 29, 2025 •

edited

Loading

Consolidate/cleanup hash functions #10090

Consolidate/cleanup hash functions #10090

Comments

niloc132 commented Jan 29, 2025 • edited Loading

niloc132 commented Jan 29, 2025 •

edited

Loading