-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5 bits for exponent? #21
Comments
I'd be happy if you want to add it, and I'm happy to assist implementing it. In principle there's only 2 things needed
Having said that, is there any more information on how they define Float8 on the H100? As I currently don't see any other use for it, I'd be happy to stick to Nvidia's specs. For example, they may not have NaN's and redefine what exponent only 1s means, similar for subnormals. Asking because I've seen a bunch of very low precision floats that redefine the floating-point standard somewhat. Not that Float8 was ever standardised, but the logical extension down to 8 bits. |
the 5 exp bit version is IEEE compliant. the 4 exp bit version does not have either inf and has only one nan. see https://arxiv.org/pdf/2209.05433.pdf it might make more sense to be consistent with whatever you've already done for the 4 and 3 exp bit versions you already have. |
i started looking into the "easy" lookup table (your first bullet above) to convert Float8 to Float32, and i'm wondering how these two tables for 3 and 4 exp bits were created. can they be generated programmatically? if so, it might be simplest, for me at least, to modify that code for 5 exp bits. |
Given that we may want 3 different Float8 formats in this package, it might be worth not hardcoding the tables as we did before but to create them dynamically. It's more readable, more reproducible and serves as some form of documentation. (As I seemingly cannot remember how we created this tables some years ago!!!) |
I generated them. An advantage to the hardcoding is that it allows relative timing of compound calculations to be more consistent than would otherwise be the case. This assists investigation of the impact of providing certain other ops hardcoded (in hardware) or by e.g. multiple poly approx. |
how about hard coding them with meta programming? |
It is easier to use some subsidiary functions with a few outer "get it together" functions. |
Sorry, maybe to be a bit clearer, what I thought was to do something like abstract type AbstractFloat8 <: AbstractFloat end
primitive type Float8_e3m4 <: AbstractFloat8 8 end # 3 exponent bits, IEEE compliant
primitive type Float8_e4m3 <: AbstractFloat8 8 end # 4 exponent bits, like Nvidia's E4M3
primitive type Float8_e5m2 <: AbstractFloat8 8 end # 5 exponent bits, IEEE compliant
# define one of the above as default
Float8 = ...
function representable_float8s(::Type{T}) where {T<:AbstractFloat8}
ne = Base.exponent_bits(T)
nm = Base.significand_bits(T)
...
# call normals, subnormals and concatentate accordingly
if T == Float8_e3m4 # somehow distinguish between the different formats (or use dispatch for that)
all_float8s = cat(subnormals,normals,...)
...
end
return all_float8s
end
# the following is then executed on every using/import
const float8_e3m4_to_float32 = representable_float8s(Float8_e3m4)
const float8_e4m3_to_float32 = representable_float8s(Float8_e4m3)
const ...
# and conversion defined as
Base.Float32(x::Float8_e3m4) = @inbounds float8_e3m4_to_float32[reinterpret(UInt8,x) + 0x01]
... I find
|
fyi papers are using this way of indicating FP8 exponent and significand bits
(they should be using And there are still specifiable params. |
MicroFloatingPoints.jl uses parametric types to signify the bit partitioning: Floatmu{4,3} for example is an 8-bit float with 4 exp bits and 3 fraction bits. everything is IEEE compliant there. perhaps we could parameterize the exp bias and NaN/Inf count similarly. maybe Float8{E,B,N,I} where E is the no. of exp bits, B the bias, N the no. of NaNs, and I the no. if Infs. |
It is cleaner and clearer to reduce the parameter count .. [at least at first]. Sometimes 1 (unsigned) zero is proper, other times ( Complex{Float8} ) 2 signed zeros is proper. maybe all are signed floats with two infs Although this drops some parametric flexibility, it would be easier to get right. Then introducing more flexibility could proceed with care. I ran into this design dilemma with early versions of DoubleFloats.jl. Pushing more params is not the best first way. For the more blue-sky-numeratives: I would like to experiment with signed Huge (a nonspecific finite value that exceeds the exact finite value floatmax(T) and -where a mathematical perspective is needed- is considered much greater than that, and signed Tiny (a nonspecific finite value that is nonzero and -where a mathematical perspective is needed- is considered much less than the exact finite value nextfloat(zero(T))). So as we augment parameters, a way to indulge this is sought. |
the new H100 from nvidia has 8-bit floats in two flavors: 4 bits for the exponent like Float8s.jl's Float8_4, and 5 bits. scroll down to "NVIDIA Hopper FP8 data format" here: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
have you considered adding this type to Float8s.jl? currently i'm using https://github.com/goualard-f/MicroFloatingPoints.jl to simulate to see if that many exponent bits is better (than 4 or 3), and it is painfully slow.
The text was updated successfully, but these errors were encountered: