`exp` and `log` are much slower than they should be. #135

oscardssmith · 2022-01-20T07:21:55Z

both are around 100x slower than float64. exp in particular should be able to be done relatively quickly since exp(hi+lo)=exp(hi)exp(lo), and since hi can easily be reduced to [1,2), exp(lo) only needs 2 terms.

I have fewer ideas for log, but there should be something better than the current behavior.

also, it's worth figuring on double64 versions of these functions because the double32 versions can be done really quickly using float64.

The text was updated successfully, but these errors were encountered:

JeffreySarnoff · 2022-01-20T07:45:47Z

Yes .. the Double32 type was an early artifact of hoping to redouble Double64s to compute Double128 accurately. At the time, Julia did not provide the mechanism for that sort involutve constructor to just work [or to work at all]. A next large reorganization of DoubleFloats just drops Double32, and just as you say .. recommends the use of Float64 has the doubled Float32 precision.

I had not worked with exp and log in quite a while. The low performance is unacceptable .. and was not always the case ... thank you for bringing this to light.

JeffreySarnoff · 2022-01-20T08:07:23Z

There are at least a few approaches to implementing exp and log. One gives each their own algorithmic specialization and whatever specific computational advantages are to be had. One works hard to speed exp for its own sake and uses it within log to refine the result. At some point I replaced one of those approaches with the other -- and the accuracy suffered some. To undo the revisions exactly. reduplicating the earlier implementations, was an ask too murky for me at the time (my comfort with GitHub has increased slowly).

oscardssmith · 2022-01-20T14:38:12Z

How much error are you willing to accept for an exp implementation?

JeffreySarnoff · 2022-01-20T14:56:08Z

If you sample deeply through all of the arithmetic and each elementary function, and then do the same with other implementations of double-double, we get at least as many and often more accurate bits in our results.
exp is or may be used in constructing other functions that are defined in terms of exp and other things. So, there is less leeway with the arithmetic ops and exp and some others. I like that we do Float106 in a way that better protects the accuracy of the lower order bits [not always the lowest order bit].

oscardssmith · 2022-01-20T15:06:22Z

How would you feel about 4 ulps (or so) in the low part of the output? I think that will be roughly the threshold that is achievable without significant usage of triple compensated arithmetic.

JeffreySarnoff · 2022-01-20T15:47:03Z

I need to find the version of that number I had decided upon some time ago. I do recall there being some threshold of cooperation throughout the functional realizations. OTOH, rather than go full triple-double inside exp, using triple-doubles for the first two-sh coeffs may be the medium.

oscardssmith · 2022-01-20T16:05:32Z

How does this look? It's still missing the overflow/underflow checks, but I'm seeing this as benching roughly 10x faster than the current implementation, and I think the error should be acceptable (I haven't fully tested it though)

using DoubleFloats, ProfileView, Plots
setprecision(BigFloat, 2048)

@inline function two_sum(a, b)
    hi = a + b
    a1 = hi - b
    lo = (a - a1) + (b - (hi - a1))
    return hi, lo
end

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err1 = fma(hi, x, -prod)
        hi, err2 = two_sum(pi,prod)
        lo = fma(lo, x, err1 + err2)
    end
    return hi, lo
end

const coefs = Tuple(log(big(2))^n/factorial(big(n)) for n in 1:10)
const coefs_hi =  Float64.(coefs)
const coefs_lo = Float64.(coefs .- coefs_hi)

function exp_kernel(x::Float64)
    hi, lo = exthorner(x, coefs_hi)
    lo2 = evalpoly(x, coefs_lo)
    #lo2 = Float64(evalpoly(x, coefs.-coefs_hi))
    hix = hi*x
    lo_all = fma(lo, x, fma(lo2, x, fma(hi, x, -hix)))
    return Double64(hix, lo_all)
end

function make_table(size, n=1)
    t_array = zeros(Double64, 16);
    for j in 1:size
        val = 2.0^(BigFloat(j-1)/(16*n))
        t_array[j] = val
    end
    return Tuple(t_array)
end
const T1 = make_table(16)
const T2 = make_table(16, 16)

function double_exp2(x::Float64)
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = exp_kernel(muladd((-1/256), N, x))
    return r * j1 * j2 * exp2(k)
end

function double_exp2(a::Double64)
    x = a.hi
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = fma((-1/256), N, x)
    poly = exp_kernel(r)
    e2k = exp2(k)
    poly_lo = exp_kernel(a.lo)
    lo_part = fma(poly, poly_lo, poly_lo) + poly
    ans = fma(j1*j2, lo_part, j1*j2)
    return e2k*ans
end

Edit: This still has a bug somewhere that is causing it to not give nearly as many sig figs as it should.

JeffreySarnoff · 2022-01-20T16:40:47Z

It looks good, much more useful than was available an hour ago!
I need to write a harness of more focused test regions.

oscardssmith · 2022-01-21T00:27:24Z

I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.

JeffreySarnoff · 2022-01-21T01:01:34Z

super good, smiling bits

…

On Thu, Jan 20, 2022 at 7:27 PM Oscar Smith ***@***.***> wrote: I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM2VRVYFUUC4JUCHYDVUODUXCR7NANCNFSM5ML6RYPQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

JeffreySarnoff · 2022-01-27T02:04:10Z

let me know when you want to merge something

oscardssmith · 2022-01-27T02:17:34Z

Can you merge https://github.com/oscardssmith/DoubleFloats.jl/tree/oscardssmith-faster-square? it's a smaller change, but should be ready to merge. this PR needs a small amount of extra work to avoid an accuracy regression (unless you are OK with a minor accuracy regression).

JeffreySarnoff · 2022-01-28T04:52:46Z

merged the faster-square
strongly prefer maintaining accuracy
is the regression confined to some lesser used region?
Please give me a few examples that regress accuracy.

JeffreySarnoff · 2022-05-04T19:36:58Z

I just ran some rand * scale vectors of Double64 and Float128 through exp and log.
Double64 outperforms Float128 by 2x .. 3.5x, that is rather impressive, considering
all the person-years of development and all the more person-years of use and reporting that libquadmath has going for it.

@oscardssmith is there some corner of the domain that you found troublesome, or did merging your earlier improvement work quite well?

oscardssmith · 2022-05-04T21:00:55Z

what do you mean?

JeffreySarnoff · 2022-05-04T21:32:59Z

The title on this issue seems outdated. Before changing it, I am confirming with you that the performance of exp and log have improved over these past months. It appears they have.

oscardssmith · 2022-05-04T22:04:13Z

how have they changed? the source likes identical

JeffreySarnoff · 2022-05-04T22:49:34Z

Limited benchmarking does not report slow performance relative to Float128,
allowing some proportional time adjustment for the 7 extra sigbits does not appear to matter.

I used 1024 rands, multiplied them by 2, subtracted 1 to have both positive and negative values.
These signed floats were scaled, given them these extrema (34.8, -34.9).
A second test vector was the first one translated for log, the second vector's extrema: (0.17, 69.9).

julia> (@btime map(exp,xs) setup=(xs=float128));
  1.513 ms (1 allocation: 16.12 KiB)

julia> (@btime map(exp,xs) setup=(xs=double64));
  326.600 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=float128_nonneg));
  665.800 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=double64_nonneg));
  388.900 μs (1 allocation: 16.12 KiB)

oscardssmith · 2022-05-04T23:28:12Z

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

photor · 2024-03-17T15:02:29Z

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

Any progress since then?

oscardssmith · 2024-03-17T16:24:14Z

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

photor · 2024-03-18T02:58:35Z

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.

JeffreySarnoff · 2024-03-18T15:47:37Z

This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as `exp_unsafe` -- let me see about tracking it down and benchmarking that.

…

On Sun, Mar 17, 2024 at 10:58 PM photor ***@***.***> wrote: no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3. Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A> . You are receiving this because you commented.Message ID: ***@***.***>

JeffreySarnoff · 2024-03-18T20:19:57Z

I just spent time on this -- the current exp benchmarks 10x slower than Float64 -- no longer 100x slower.
@oscardssmith @photor

julia> using DoubleFloats

julia> x = log(2.0)
0.6931471805599453

julia> dx = log(Double64(2.0))
6.93147180559945309417232121458179079e-01

julia> @b exp(x)
35.361 ns (2.01 allocs: 32.264 bytes)

julia> @b exp(dx)
359.211 ns (2.11 allocs: 67.158 bytes)


julia> x = log(37.0)
3.6109179126442243

julia> dx = log(Double64(37.0))
3.61091791264422444436809567103144955

julia> @b exp(x)
34.032 ns (2.01 allocs: 32.256 bytes)

julia> @b exp(dx)
374.667 ns (2.11 allocs: 67.200 bytes)

Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)

oscardssmith · 2024-03-18T20:29:51Z

What computer are you on? 35ns for Float64 exp is way too slow. I get ~5ns on an 11th gen laptop on battery.

julia> @btime exp($(Ref(x)[]))
  4.885 ns (0 allocations: 0 bytes)
2.0

JeffreySarnoff · 2024-03-18T22:11:55Z

Thanks for that. It seems that using Chairmarks for this messed up my results. Back to the mill.

photor · 2024-03-19T07:13:46Z

This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as exp_unsafe -- let me see about tracking it down and benchmarking that.
…
On Sun, Mar 17, 2024 at 10:58 PM photor @.> wrote: no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3. Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.>

Any chance to make it work on GPU?

julia> using DoubleFloats, Adapt, CUDA

julia> a=randn(Double64,1000,1000);

julia> gpua=adapt(CuArray, a);

julia> gpub=similar(gpua);

julia> function gpu_exp(y,x)
       for i=1:length(y)
       @inbounds y[i]=exp(x[i])
       end
       return nothing
       end
gpu_exp (generic function with 1 method)

julia> @cuda gpu_exp(gpub,gpua);
ERROR: InvalidIRError: compiling MethodInstance for gpu_exp(::CuDeviceMatrix{Double64, 1}, ::CuDeviceMatrix{Double64, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:170
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:171
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
 [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\validation.jl:149
 [2] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:415 [inlined]
 [3] macro expansion
   @ C:\Users\DELL\.julia\packages\TimerOutputs\RsWnF\src\TimerOutput.jl:253 [inlined]
 [4] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:414 [inlined]
 [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\utils.jl:89

julia>

This was referenced Jan 21, 2022

faster exp* #136

Open

replacing expm1 implementation #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`exp` and `log` are much slower than they should be. #135

`exp` and `log` are much slower than they should be. #135

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022 •

edited

Loading

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 21, 2022

JeffreySarnoff commented Jan 21, 2022 via email

JeffreySarnoff commented Jan 27, 2022

oscardssmith commented Jan 27, 2022

JeffreySarnoff commented Jan 28, 2022 •

edited

Loading

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

photor commented Mar 17, 2024

oscardssmith commented Mar 17, 2024

photor commented Mar 18, 2024

JeffreySarnoff commented Mar 18, 2024 via email

JeffreySarnoff commented Mar 18, 2024

oscardssmith commented Mar 18, 2024

JeffreySarnoff commented Mar 18, 2024

photor commented Mar 19, 2024 •

edited

Loading

exp and log are much slower than they should be. #135

exp and log are much slower than they should be. #135

Comments

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 20, 2022 • edited Loading

JeffreySarnoff commented Jan 20, 2022

oscardssmith commented Jan 21, 2022

JeffreySarnoff commented Jan 21, 2022 via email

JeffreySarnoff commented Jan 27, 2022

oscardssmith commented Jan 27, 2022

JeffreySarnoff commented Jan 28, 2022 • edited Loading

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

JeffreySarnoff commented May 4, 2022

oscardssmith commented May 4, 2022

photor commented Mar 17, 2024

oscardssmith commented Mar 17, 2024

photor commented Mar 18, 2024

JeffreySarnoff commented Mar 18, 2024 via email

JeffreySarnoff commented Mar 18, 2024

oscardssmith commented Mar 18, 2024

JeffreySarnoff commented Mar 18, 2024

photor commented Mar 19, 2024 • edited Loading

`exp` and `log` are much slower than they should be. #135

`exp` and `log` are much slower than they should be. #135

oscardssmith commented Jan 20, 2022 •

edited

Loading

JeffreySarnoff commented Jan 28, 2022 •

edited

Loading

photor commented Mar 19, 2024 •

edited

Loading