Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Tensor Contractions are unusually slow #18

Open
mganahl opened this issue Jun 21, 2018 · 1 comment
Open

Complex Tensor Contractions are unusually slow #18

mganahl opened this issue Jun 21, 2018 · 1 comment

Comments

@mganahl
Copy link

mganahl commented Jun 21, 2018

Hi Devin,
Thanks for this nice library!

I'm currently testing the library, and I found the following:
For real tensors, tblis is substantially faster than a contraction using dgemm for the cases I tested.
For double matrices, dgemm and tblis are about equal. For a test case of 1000 matrix multiplications of
two 200 x 200 matrices, a timing gives (using OMP_NUM_THREADS=4 on a 4 core machine)

TBLIS: fastest of 1000 contractions: 281 microseconds
TBLIS: average time per contraction: 390.587 microseconds
DGEMM: fastest of 1000 contractions: 311 microseconds
DGEMM: average time per contraction: 332.135 microseconds

For complex types however, the contractions are actually much slower than when using zgemm. When I compile and run the code below (1000 matrix multiplications of two 200 x 200 matrices), I get the following output:

TBLIS: fastest of 1000 contractions: 14917 microseconds
TBLIS: average time per contraction: 15622.9 microseconds
ZGEMM: fastest of 1000 contractions: 932 microseconds
ZGEMM: average time per contraction: 1053.49 microseconds

Is this to be expected, or am I doing something wrong?
I was building with g++ 5 o and comparing to zgemm implementation from openblas.

thanks!

Below, mat_mat_prod is a wrapper around zgemm;

#include
#include <tblis.h>
#include "lib/linalg/blasroutines.hpp"
typedef double Real;
typedef complex Complex;

using std::cout;
using std::endl;
using std::vector;

int main(int argc, char** argv){

int d1=200,d2=200,d3=200;
vector < tblis::label_type > la={1,2};
vector < tblis::label_type > lb={2,3};
vector < tblis::label_type > lc={1,3};
tblis::tensor < Complex> A({d1,d2});
tblis::tensor < Complex> B({d2,d3});
tblis::tensor < Complex > C({d1,d3});
Complex beta=0.0;
Complex alpha=1.0;
auto mA=new Complex [d1d2];
auto mB=new Complex [d2
d3];
auto mC=new Complex [d1*d3];
int Nmax=1000;
vector time(Nmax);
vector time2(Nmax);
cout<<"doing TBLIS contraction"<<endl;
for(uint n=0;n<Nmax;n++){
auto time0 = std::chrono::high_resolution_clock::now();
tblis::mult(alpha,A,la.data(),B,lb.data(),beta,C,lc.data());
auto time1 = std::chrono::high_resolution_clock::now();
time[n]=std::chrono::duration_caststd::chrono::microseconds(time1-time0).count();
}
cout<<"doing ZGEMM contraction"<<endl;
for(uint n=0;n<Nmax;n++){
auto time0 = std::chrono::high_resolution_clock::now();
blasroutines::mat_mat_prod(alpha,mA ,d1,d2,mB,d2,d3,mC,'N','N');
auto time1 = std::chrono::high_resolution_clock::now();
time2[n]=std::chrono::duration_caststd::chrono::microseconds(time1-time0).count();
}
cout<< "TBLIS: fastest of "<<Nmax<<" contractions: "<<*std::min_element(time.begin(),time.end()) << " microseconds\n";
cout<< "TBLIS: average time per contraction: "<<std::accumulate(time.begin(),time.end(),0.0)*1.0/Nmax << " microseconds\n";
cout<< "ZGEMM: fastest of "<<Nmax<<" contractions: "<<*std::min_element(time2.begin(),time2.end()) << " microseconds\n";
cout<< "ZGEMM: average time per contraction: "<<std::accumulate(time2.begin(),time2.end(),0.0)*1.0/Nmax << " microseconds\n";

delete[] mA;
delete[] mB;
delete[] mC;
return 0;
}

@devinamatthews
Copy link
Owner

The problem is that TBLIS doesn't currently include optimized complex microkernels for most architectures. Basically you are getting a slightly fancy triple loop for these cases.

The approach that I had planned to take is the "1m" method, where the real microkernels are reused by cleverly packing the real and imaginary parts of the complex operands. This hasn't been implemented yet, but perhaps soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants