approaching better vectorization #103

JeffreySarnoff · 2020-03-26T01:33:47Z

@chriselrod @ffevotte I could use some help providing better vectorization over basic arithmetic ops for DoubleFloats. I have tried to follow the guidelines you have given without seeing much any change. Underlying addition and multiplication, there are three core routines. Being able to vectorize well over them would go a long way.

Multiplication is faster than addition of Double64s. Both are needed to accelerate matrix ops. Please ask questions.

# Double64s are structs

struct Double64
   hi::Float64
   lo::Float64
end

# the underlying support for arithmetic

"""
    two_prod(a, b)

Computes `hi = fl(a*b)` and `lo = err(a*b)`.
"""
@inline function two_prod(a::T, b::T) where {T<:Float64}
    hi = a * b
    lo = fma(a, b, -s)
    return hi, lo
end

"""
    two_sum(a, b)

Computes `hi = fl(a+b)` and `lo = err(a+b)`.
"""
@inline function two_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    a1 = hi - b
    b1 = hi - a1
    lo = (a - a1) + (b - b1)
    return hi, lo
end

# there is an alternative implementation of two_sum.
@inline function two_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    v  = hi - a
    lo = (a - (hi - v)) + (b - v)
    return hi, lo
end

# this is a special case of two_sum when the relative magnitudes are known
#   it is used heavily to gain some speed where allowable 
"""
    two_hilo_sum(a, b)

*unchecked* requirement `|a| ≥ |b|`
Computes `hi = fl(a+b)` and `lo = err(a+b)`.
"""
@inline function two_hilo_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    lo = b - (s - a)
    return hi, lo
end

# this is the support for division (called `two_dvi` in the source)
"""
    two_div(a, b)

Computes `hi = fl(a/b)` and `lo = err(a/b)`.
"""
@inline function two_div(a::T, b::T) where {T<:Float64}
     hi = a / b
     lo = fma(-hi, b, a)
     lo /= b
     return hi, lo
end

# The basic arithmetic `add` and `mul` for Double64s

# relative error < 3u² (called `add_dddd_dd` in the source)
# each Tuple gives the Double64 struct fields `(hi, lo)`
#
# >>>> It is easy to pass the structs instead if that is easier to vectorize
# 
@inline function add(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_sum(xhi, yhi)
    thi, tlo = two_sum(xlo, ylo)
    c = lo + thi
    hi, lo = two_hilo_sum(hi, c)
    c = tlo + lo
    hi, lo = two_hilo_sum(hi, c)
    return hi, lo
end


# relative error <= 5u² (called `mul_dddd_dd` in the source)
@inline function mul(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_prod(xhi, yhi)
    t = xlo * ylo
    t = fma(xhi, ylo, t)
    t = fma(xlo, yhi, t)
    t = lo + t
    hi, lo = two_hilo_sum(hi, t)
    return hi, lo
end

# called `dvi_dddd_dd` in the source
@inline function divide(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi = xhi / yhi
    uh, ul = two_prod(hi, yhi)
    lo = ((((xhi - uh) - ul) + xlo) - hi*ylo)/yhi
    hi,lo = two_hilo_sum(hi, lo)
    return hi, lo
end

The text was updated successfully, but these errors were encountered:

ffevotte · 2020-03-26T13:53:55Z

Hello Jeffrey,

there is one thing that is not clear to me yet: what would be the objective of the vectorization? I can see two very different approaches:

trying to make +(::Double64, ::Double64) faster by using vector instructions in the underlying EFTs, or
trying to make +(::Vector{Double64}, ::Vector{Double64}) faster by allowing double-double numbers to be handled in packs of 4.

Approach 1. seems hard, because I'm not sure there is enough SIMD-compatible work to do in the EFTs. On the other hand, in approach 2 it would be more straightforward to know which operations can be performed in a SIMD way; but in this case I think difficulties would lie in the data layout: I expect that it would be better to have a structure of arrays, rather than an array of Double64 structures

JeffreySarnoff · 2020-03-26T14:30:45Z

Greetings François,

I did not know that one thing! So I did not guess about effectiveness: whether (1) or (2) with judicious use of LLVM/SIMD/AVX aor LoopVectorization/AccurateArithmetic for each as appropriate, would be most beneficial, attainable and maintainable.

I do understand your thinking. Perhaps first we see how; once I know appropriate ways to code to this, I'll push it and develop benchmarking help. It is important at the start to pursue (2) without constraining the availability of gofaster through (1). When reviewing the generated code, the primitives (two_prod more than two_sum) looked ok with some room for improvement (there is opportunity for more v prefixed opcodes). The arithmetic ops built upon the primitives generate code that appears less flexible / SIMD-ready than necessary. So, I hope, (2) and (1), differently rather than (2) and not (1).

Best, Jeffrey

chriselrod · 2020-03-29T02:23:54Z

I tried this (requires LoopVectorization 0.6.24 or newer):

using LoopVectorization
function gemm_accurate_kernel!(C, A, B)
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn_hi = zero(eltype(C))
        Cmn_lo = zero(eltype(C))
        for k in 1:size(B,1)
            hiprod = evmul(A[m,k], B[k,n])
            loprod = vfmsub(A[m,k], B[k,n], hiprod)
            hi_ts = evadd(hiprod, Cmn_hi)
            a1_ts = evsub(hi_ts, Cmn_hi)
            b1_ts = evsub(hi_ts, a1_ts)
            lo_ts = evadd(evsub(hiprod, a1_ts), evsub(Cmn_hi, b1_ts))
            thi = evadd(loprod, Cmn_lo)
            a1_t = evsub(thi, Cmn_lo)
            b1_t = evsub(thi, a1_t)
            tlo = evadd(evsub(loprod, a1_t), evsub(Cmn_lo, b1_t))
            c1 = evadd(lo_ts, thi)
            hi_ths = evadd(hi_ts, c1)
            lo_ths = evsub(c1, evsub(hi_ths, hi_ts))
            c2 = evadd(tlo, lo_ths)
            Cmn_hi = evadd(hi_ths, c2)
            Cmn_lo = evsub(c2, evsub(Cmn_hi, hi_ths))
        end
        C[m,n] = Cmn_hi
    end
    C
end

But it doesn't seem to work. I may have made a mistake following the double float product and sum functions.
Maybe I should follow the AccurateArtihmetic dot product algorithm.

Example of it not working:

using LinearAlgebra
r = 1e-14:99+1e-14;
qrb = qr(rand(length(r), length(r)));
Q = Matrix{BigFloat}(qrb.Q); # I wanted a random orthogonal matrix more interesting than I
Ab = Q * Diagonal(r) * Q';
Abi = Q * Diagonal(1 ./ r) * Q';
Ab64 = Float64.(Ab); Abi64 = Float64.(Abi);
cond(Ab64), cond(Abi64)
# (6.659871254102312e15, 2.063196098117369e16)
last(r) / first(r) # Should be equal to the above
# 9.900000000000002e15
C = gemm_accurate_kernel!(similar(Ab64), Ab64, Abi64);
sum(abs, C - I), sum(abs, Ab64 * Abi64 - I) # Should be 0
# (241.8293068059044, 251.69357370578118)

Improvement is pretty negligible.

EDIT:
Actually, maybe it's doing quite well

julia> sum(abs, Ab * Abi - I)
241.0932487642165916894920436948567838194347574286817003393540001216488095375455

I think the problem is that my Q matrix isn't actually perfectly orthogonal.

Just calculating Abi using inv:

julia> Abiv2 = inv(Ab);

julia> sum(abs, Ab * Abiv2 - I)
6.952146822190708017887559626198144268623694722432354117472718405806550794109339495669806009628687039895162178322961830680289099636242996590885749743681880768558283646634882983369521213016458761115018330939573505303406194198078809992514064880960725997856230716739219579947204304410369227438083786821415686911706e-291

julia> Abi64 = Float64.(Abiv2);

julia> gemm_accurate_kernel!(C, Ab64, Abi64);

julia> sum(abs, C - I), sum(abs, Ab64 * Abi64 - I)
(22.301547859434912, 63.88009463823736)

Not great.

chriselrod · 2020-03-29T02:39:12Z

I'm not sure what an EFT is, but from context I assume it means the individual operations in the functions you described. In which case I agree with ffevotte, in that I don't think much gain there is possible.

When it comes to vectorization, the simplest approach is generally to do multiple operations in parallel. So if you have a loop, calculate multiple iterations of the loop in parallel using SIMD instructions.

Structs of arrays would definitely make things easier/faster, but if you want to support arrays of structs, you could load 2 vectors of the interleaved hi/lo elements, and then use vector shuffles to get the homogeneous vectors of hi and lo elements.
It is of course faster not to need permutes, which is why structs of arrays are faster, but because you only need two on loading (and 2 on storing), it's fast enough that the approach should still give a nice performance boost if you can't change the data layout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

approaching better vectorization #103

approaching better vectorization #103

JeffreySarnoff commented Mar 26, 2020

ffevotte commented Mar 26, 2020

JeffreySarnoff commented Mar 26, 2020

chriselrod commented Mar 29, 2020 •

edited

Loading

chriselrod commented Mar 29, 2020

approaching better vectorization #103

approaching better vectorization #103

Comments

JeffreySarnoff commented Mar 26, 2020

ffevotte commented Mar 26, 2020

JeffreySarnoff commented Mar 26, 2020

chriselrod commented Mar 29, 2020 • edited Loading

chriselrod commented Mar 29, 2020

chriselrod commented Mar 29, 2020 •

edited

Loading