-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approaching better vectorization #103
Comments
Hello Jeffrey, there is one thing that is not clear to me yet: what would be the objective of the vectorization? I can see two very different approaches:
Approach 1. seems hard, because I'm not sure there is enough SIMD-compatible work to do in the EFTs. On the other hand, in approach 2 it would be more straightforward to know which operations can be performed in a SIMD way; but in this case I think difficulties would lie in the data layout: I expect that it would be better to have a structure of arrays, rather than an array of |
Greetings François, I did not know that one thing! So I did not guess about effectiveness: whether (1) or (2) with judicious use of LLVM/SIMD/AVX aor LoopVectorization/AccurateArithmetic for each as appropriate, would be most beneficial, attainable and maintainable. I do understand your thinking. Perhaps first we see how; once I know appropriate ways to code to this, I'll push it and develop benchmarking help. It is important at the start to pursue (2) without constraining the availability of gofaster through (1). When reviewing the generated code, the primitives ( Best, Jeffrey |
I tried this (requires LoopVectorization 0.6.24 or newer): using LoopVectorization
function gemm_accurate_kernel!(C, A, B)
@avx for n in 1:size(C,2), m in 1:size(C,1)
Cmn_hi = zero(eltype(C))
Cmn_lo = zero(eltype(C))
for k in 1:size(B,1)
hiprod = evmul(A[m,k], B[k,n])
loprod = vfmsub(A[m,k], B[k,n], hiprod)
hi_ts = evadd(hiprod, Cmn_hi)
a1_ts = evsub(hi_ts, Cmn_hi)
b1_ts = evsub(hi_ts, a1_ts)
lo_ts = evadd(evsub(hiprod, a1_ts), evsub(Cmn_hi, b1_ts))
thi = evadd(loprod, Cmn_lo)
a1_t = evsub(thi, Cmn_lo)
b1_t = evsub(thi, a1_t)
tlo = evadd(evsub(loprod, a1_t), evsub(Cmn_lo, b1_t))
c1 = evadd(lo_ts, thi)
hi_ths = evadd(hi_ts, c1)
lo_ths = evsub(c1, evsub(hi_ths, hi_ts))
c2 = evadd(tlo, lo_ths)
Cmn_hi = evadd(hi_ths, c2)
Cmn_lo = evsub(c2, evsub(Cmn_hi, hi_ths))
end
C[m,n] = Cmn_hi
end
C
end But it doesn't seem to work. I may have made a mistake following the double float product and sum functions. Example of it not working: using LinearAlgebra
r = 1e-14:99+1e-14;
qrb = qr(rand(length(r), length(r)));
Q = Matrix{BigFloat}(qrb.Q); # I wanted a random orthogonal matrix more interesting than I
Ab = Q * Diagonal(r) * Q';
Abi = Q * Diagonal(1 ./ r) * Q';
Ab64 = Float64.(Ab); Abi64 = Float64.(Abi);
cond(Ab64), cond(Abi64)
# (6.659871254102312e15, 2.063196098117369e16)
last(r) / first(r) # Should be equal to the above
# 9.900000000000002e15
C = gemm_accurate_kernel!(similar(Ab64), Ab64, Abi64);
sum(abs, C - I), sum(abs, Ab64 * Abi64 - I) # Should be 0
# (241.8293068059044, 251.69357370578118) Improvement is pretty negligible. EDIT: julia> sum(abs, Ab * Abi - I)
241.0932487642165916894920436948567838194347574286817003393540001216488095375455 I think the problem is that my Just calculating julia> Abiv2 = inv(Ab);
julia> sum(abs, Ab * Abiv2 - I)
6.952146822190708017887559626198144268623694722432354117472718405806550794109339495669806009628687039895162178322961830680289099636242996590885749743681880768558283646634882983369521213016458761115018330939573505303406194198078809992514064880960725997856230716739219579947204304410369227438083786821415686911706e-291
julia> Abi64 = Float64.(Abiv2);
julia> gemm_accurate_kernel!(C, Ab64, Abi64);
julia> sum(abs, C - I), sum(abs, Ab64 * Abi64 - I)
(22.301547859434912, 63.88009463823736) Not great. |
I'm not sure what an EFT is, but from context I assume it means the individual operations in the functions you described. In which case I agree with ffevotte, in that I don't think much gain there is possible. When it comes to vectorization, the simplest approach is generally to do multiple operations in parallel. So if you have a loop, calculate multiple iterations of the loop in parallel using SIMD instructions. Structs of arrays would definitely make things easier/faster, but if you want to support arrays of structs, you could load 2 vectors of the interleaved hi/lo elements, and then use vector shuffles to get the homogeneous vectors of hi and lo elements. |
@chriselrod @ffevotte I could use some help providing better vectorization over basic arithmetic ops for DoubleFloats. I have tried to follow the guidelines you have given without seeing much any change. Underlying addition and multiplication, there are three core routines. Being able to vectorize well over them would go a long way.
Multiplication is faster than addition of Double64s. Both are needed to accelerate matrix ops. Please ask questions.
The text was updated successfully, but these errors were encountered: