Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA objects #96

Open
simonbyrne opened this issue Dec 29, 2023 · 7 comments
Open

SHA objects #96

simonbyrne opened this issue Dec 29, 2023 · 7 comments

Comments

@simonbyrne
Copy link

It would be helpful if for each hash there was an object representing a hash (e.g. SHA1, SHA256 etc), similar to UUID in Base.

@staticfloat

@simonbyrne
Copy link
Author

@inkydragon
Copy link
Collaborator

inkydragon commented Jan 2, 2024

A quick Proof-Of-Concept impl:

If we really need this, I'd like to add it to the base and have SHA, MD5, CRC32, GitHash... all reuse these codes.

hash_obj.jl

# SPDX-License-Identifier: MIT
abstract type AbstractHash end

"""
    HashBytes{N}

A hash object identifier. It is a `N` byte string.
"""
struct HashBytes{N} <: AbstractHash where {N}
    val::NTuple{N, UInt8}
    HashBytes(val::NTuple{N, UInt8}) where N = new{N}(val)
end

HashBytes{N}() where N = HashBytes(ntuple(i->zero(UInt8), N))
HashBytes(h::HashBytes) = h
function HashBytes{N}(u8::Vector{UInt8}) where N
    @assert N == length(u8) "Hash length not match"
    HashBytes(ntuple(idx->u8[idx], N))
end
HashBytes(s::AbstractString) = error("not impl")


import Base.show
function show(io::IO, hash_bytes::HashBytes{N}) where N
    hash = join( repr(u)[3:end] for u in hash_bytes.val )
    print(io, "HashBytes{$N}($(repr(hash)))")
end



# ==== Generate Hash Type Definitions for All SHA Types
# Examples:
#   const Sha1Hash = HashBytes{20}
#   const Sha3_512Hash = HashBytes{64}
using SHA
for (sha_prefix, sha_type) in [(:Sha1, :SHA1_CTX),
                 (:Sha224, :SHA224_CTX),
                 (:Sha256, :SHA256_CTX),
                 (:Sha384, :SHA384_CTX),
                 (:Sha512, :SHA512_CTX),
                 (:Sha2_224, :SHA2_224_CTX),
                 (:Sha2_256, :SHA2_256_CTX),
                 (:Sha2_384, :SHA2_384_CTX),
                 (:Sha2_512, :SHA2_512_CTX),
                 (:Sha3_224, :SHA3_224_CTX),
                 (:Sha3_256, :SHA3_256_CTX),
                 (:Sha3_384, :SHA3_384_CTX),
                 (:Sha3_512, :SHA3_512_CTX)]
    hashsha_type = Symbol(sha_prefix, :Hash)
    @eval begin
        hashtype_len = SHA.digestlen($sha_type)
        const $(hashsha_type) = HashBytes{hashtype_len}
    end
end


# ---- examples:
Sha1Hash(sha1(""))
Sha3_256Hash(sha3_256(""))
Sha3_512Hash(sha3_512(""))

example outout:

julia> Sha1Hash(sha1(""))
HashBytes{20}("da39a3ee5e6b4b0d3255bfef95601890afd80709")

julia> Sha3_256Hash(sha3_256(""))
HashBytes{32}("a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a")

julia> Sha3_512Hash(sha3_512(""))
HashBytes{64}("a69f73cca23a9ac5c8b567dc185a756e97c982164fe25859e0d1dcc1475c80a615b2123af1f5f94c11e3e9402c3ac558f500199d95b6d3e301758586281dcd26")

@staticfloat
Copy link
Member

It would be helpful if for each hash there was an object representing a hash (e.g. SHA1, SHA256 etc), similar to UUID

Can you explain a bit more about what you want and why it would be useful? I have heard strong arguments both for hashes being objects, and for hash contexts being objects, but the hashes themselves being just arrays of bytes. I'd like to hear your argument for why it's better that they are their own objects.

If its just for dispatch, I think a higher-level package like AbstractHashing or something similar may be a better fit for these kinds of concerns. I myself wanted something that lives higher level than SHA.jl (and can work with MD5 and whatnot) so I wrote this mini package to make dealing with different hashes easier. You can then constrain things to only take a certain hash type via snippets like this.

@simonbyrne
Copy link
Author

I think the main advantages are:

  1. It adds semantic information: I know the bytes correspond to the output of a specific hash. This is helpful when variables are just named commit_hash.
  2. It can be a bitstype, which has some performance advantages, and makes it easier for C interop.
  3. There are many cases where hashes are passed as hexadecimal strings: having dedicated hash objects makes the conversions easier, e.g.
    • the various registry packages (Registrator.jl, RegistryTools.jl, RegistryCI.jl) all represents hashes as strings, as that is how they are represented in the TOML files.
    • interfacing with git: it's helpful to be able to do
      hash = SHA1("...")
      run(`git checkout $hash`)
      and have it work as expected
  4. It appears to be what people do anyway, but we end up with the same thing defined in multiple places: I didn't realize Base had an SHA1 type, but LibGit2 should use this rather than define it's own GitHash type. Similarly, GitHub.jl should use this instead String, etc.

@staticfloat
Copy link
Member

Yes, so my main point would be that we probably want an AbstractHashType that is more than just SHA hashes, and then we have two options for implementation:

Bottom-up; define AbstractHashType in some bare-bones package, then packages like MD5.jl can define their types as inheriting from the abstract type, and get all the goodness defined in the abstract package's generic methods.

Top-down; create a AbstractHashes package that imports SHA.jl, MD5.jl, and every other hash type, then defines the shared functionality right there in terms of the things it has imported.

I think the bottom-up organization is better, but I don't think we want AbstractHashType to be tied to julia releases as a stdlib. So perhaps the best way forward is to have a kind of middle ground, where AbstractHashes.jl is meant to be a bottom-up package, but it includes funcitonality for SHA.jl since it knows that will always be a part of your environment?

@simonbyrne
Copy link
Author

simonbyrne commented Jan 4, 2024

Possibly? This is complicated somewhat by the fact that SHA1 is defined in Base: ideally that would use the same machinery, otherwise we end up with multiple implementations again.

What if we add AbstractHashType and SHA1Hash (and make Base.SHA1 an alias) in Base, adding them to Compat.jl for existing releases, and add the remaining hash objects here?

@staticfloat
Copy link
Member

What if we add AbstractHashType and SHA1Hash (and make Base.SHA1 an alias) in Base, adding them to Compat.jl for existing releases, and add the remaining hash objects here?

The downside to this is that it's then only available in Julia v1.12+, and if we want to change something about how hash functions work, we have to wait for a new Julia version. I think it's actually better to have an AbstractHash.jl that just implements whatever adapters are needed for the SHA that happens to be shipped with Julia, and then has maybe package extensions for MD5 and other hash types. Truly the only reason SHA is a stdlib is because Pkg needs to be able to hash things to verify their contents, we should not introduce more code into the stdlib if at all possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants