Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm! is slower and crashing on Macbook pro gpu #17

Closed
fonghou opened this issue May 27, 2016 · 13 comments
Closed

mm! is slower and crashing on Macbook pro gpu #17

fonghou opened this issue May 27, 2016 · 13 comments

Comments

@fonghou
Copy link

fonghou commented May 27, 2016

Hello,

Trying a few examples today. On my macbook pro (ealy 2013), ran this code


(ns matrix.core
  (:require
   [criterium.core :refer [quick-bench with-progress-reporting]]
   [uncomplicate.commons.core :refer [with-release]]
   [uncomplicate.clojurecl.core
    :refer [with-default with-platform platforms with-context context
            with-queue sort-by-cl-version devices finish!]]
   [uncomplicate.clojurecl.legacy
    :refer [with-default-1 command-queue-1]]
   [uncomplicate.neanderthal
    [core :refer [asum dot axpy! mv! mm! transfer! copy]]
    [native :refer [sv sge]]
    [opencl :refer [with-default-engine clv clge]]]))

(with-platform (first (platforms))
  (let [dev (first (sort-by-cl-version (devices :gpu)))]
    (with-context (context [dev])
      (with-queue (command-queue-1 dev)
        (with-default-engine
          (let [cnt 4096]
            (with-release [host-a (sge cnt cnt (range (* cnt cnt)))
                           host-b (copy host-a)
                           host-c (copy host-a)
                           gpu-a (transfer! host-a (clge cnt cnt))
                           gpu-b (copy gpu-a)
                           gpu-c (copy gpu-a)]
              (println "CPU:")
              (time (mm! 3 host-a host-b 2 host-c))
              (println "GPU:")
              (mm! 3 gpu-a gpu-b 2 gpu-c)
              (println (finish!))
              (println gpu-c)
              (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish!))))
            ))))))

First gpu device is Intel HD Graphics 4000. GPU is slower.

:device-type :gpu, :vendor "Intel", :vendor-id 16925696, :device-version "OpenCL 1.2 ", :driver-version "1.2(Apr 26 2016 00:33:44)"

CPU:
"Elapsed time: 835.101025 msecs"
GPU:

object[org.jocl.cl_command_queue 0x72581c95 cl_command_queue[0x7f897a5b03f0]]

CLGeneralMatrix[float, COL, mxn: 4096x4096, offset:0, ld:4096>]

"Elapsed time: 1968.578958 msecs"

object[org.jocl.cl_command_queue 0x72581c95 "cl_command_queue[0x7f897a5b03f0]"]

Run the same code by switching to second gpu device - NVIDIA GeForce GT 650M.

:device-type :gpu, :vendor "NVIDIA", :vendor-id 16918272, :device-version "OpenCL 1.2 ", :driver-version "10.10.10 310.42.25f01"

JVM crashed in native code.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001386dd2d6, pid=1169, tid=0x000000000000e453
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libclh.dylib+0xaa2d6]  cuiModuleUnloadEx+0x66

Stack: [0x0000700000e41000,0x0000700000ec1000],  sp=0x0000700000ec07f0,  free space=509k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libclh.dylib+0xaa2d6]  cuiModuleUnloadEx+0x66
C  [libclh.dylib+0xa8f2e]  cuiModuleUnload+0xe
C  [GeForceGLDriver+0x30b3dd]  gldCopyTextureDataToBufferWithQueue+0x11fd
C  [GeForceGLDriver+0x30d8a1]  gldExecuteKernel+0x17b
C  [OpenCL+0x34a7]
C  [OpenCL+0x200da]  clSetEventCallback+0x1700
C  [OpenCL+0x239cc]  clFinish+0x2f9
C  [libdispatch.dylib+0x240b]  _dispatch_client_callout+0x8
C  [libdispatch.dylib+0x703b]  _dispatch_queue_drain+0x2f2
C  [libdispatch.dylib+0xd707]  _dispatch_queue_invoke+0x225
C  [libdispatch.dylib+0x5d53]  _dispatch_root_queue_drain+0x21a
C  [libdispatch.dylib+0x5b00]  _dispatch_worker_thread3+0x5b
C  [libsystem_pthread.dylib+0x34de]  _pthread_wqthread+0x469
C  [libsystem_pthread.dylib+0x1341]  start_wqthread+0xd

Are these results expected on old macbook pro? Would like to see some results from newer mac.

Another note, I did a quick attempt on AWS G2 instance using Amazon Linux AMI with NVIDIA GRID GPU Driver. OS failed to load libJOCL_2_0_0-linux-x86_64.so with LinkError because libstdc++ does not support CXXABI_1.3.8 (is there jocl built for CXXABI_1.3.7?). I may try to build a custom AMI. It'd be really helpful if someone did it before give some guidance, e.g. what linux distribution, opencl package, versions etc.

Thanks,
Feng

@blueberry
Copy link
Member

blueberry commented May 27, 2016

mm could be slow(er) for two reasons:

  1. Low-end integrated GPUs are not very powerful, and are expected to be much slower than the destop/server GPUs.
  2. The library has not been tuned (yet) for the particular device, so it is slower than it could be.

Here is the list of devices that the current version (0.6.2) is tuned to:
NVIDIA GPUs:
GeForce GTX 480
GeForce GTX 680
GeForce GTX 750 Ti
GeForce GTX 980
GeForce GTX Titan
GeForce GTX Titan X
Tesla K20m
Tesla K40m
AMD GPUs:
Tahiti
Hawaii
Pitcairn
R9 M370X
Intel GPUs:
Iris
Iris Pro
Intel CPUs:
Core i5-6200U
Core i7-3770K
Core i7-5930K
Other devices:
ARM Mali-T628 GPU
Intel MIC

Obviously, your GPUs are not on the list, so if you would like them to be in the next version, ask and I'll send you the pointers on how to tune and optimize the library. It involves some native code compilation, but is automated and reasonably easy.

There are also tests for the native library that I use for GPU BLAS, so that may also help us to debug the JVM crash that you have.

@blueberry
Copy link
Member

BTW, I am using Arch Linux, and one of the good things about it is that it has fairly recent drivers for my AMD GPUs, which are also regularly patched for newer kernels.
I do not know what's the situation with Nvidia's drivers but I guess you need fairly recent ones since they started supporting OpenCL 1.2 recently (less than a year ago).
I guess you also need to upgrade gcc to some recent version and/or set LD_LIBRARY_PATH...

@fonghou
Copy link
Author

fonghou commented May 28, 2016

Thanks for the info. I'm a newbei in OpenCL, but would love to give it a try. Though I'm more interested in making it work then tuning it on aws g2 instance, which has nvidia gpu and nvidia releasted ami. I've upgraded it to the latest nvidia driver and gcc-4.9/libstd++ which resolved libstd++ CXXABI_1.3.8 link error. However, getting another link error now:

java.lang.UnsatisfiedLinkError: /tmp/JOCLBlast_0_7_1-linux-x86_64_dependents/linux/x86_64/libclblast.so: /usr/lib64/libOpenCL.so: version `OPENCL_2.0' not found (required by /tmp/JOCLBlast_0_7_1-linux-x86_64_dependents/linux/x86_64/libclblast.so)

Here are some system info.

[ec2-user@ip-172-31-26-167 ~]$ uname -a
Linux ip-172-31-26-167 4.4.10-22.54.amzn1.x86_64 #1 SMP Tue May 17 22:45:04 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-172-31-26-167 ~]$ nvidia-smi -q | head

==============NVSMI LOG==============

Timestamp : Sat May 28 03:16:13 2016
Driver Version : 361.45.11

Attached GPUs : 1
GPU 0000:00:03.0
Product Name : GRID K520
Product Brand : Grid
[ec2-user@ip-172-31-26-167 ~]$ strings /usr/lib64/libOpenCL.so |grep OPENCL
OPENCL_1.0
OPENCL_1.1
OPENCL_1.2

Apparently, nvidia linux only supports OpenCL 1.2. Also, GPU device is GRID K520, which is not in your tuned list. I think I have to compile JOCL native libraries first for OpenCL 1.2, then compile and tune your library, correct?

Appreciate any help you can provide!

Thanks!

@blueberry
Copy link
Member

blueberry commented May 28, 2016

You do not need to recompile, neanderthal should work with OpenCl 1.2. Check os, drivers, and the calling code.

@blueberry
Copy link
Member

... but you can try to compile and test https://github.com/CNugteren/CLBlast to see what happens. When we make it work, then you tune it to make it work fast.

@fonghou
Copy link
Author

fonghou commented May 28, 2016

Above example works now after compiling CLBlast. However, I couldn't make it work with CLTune. Here is the cmake that failed with lots of undefined reference to ATL_... errors.

cmake -DCMAKE_INSTALL_PREFIX=/opt/CLBlast-tune -DOPENCL_ROOT=/opt/nvidia/cuda -DCLTUNE_ROOT=/opt/CLTune -DTUNERS=ON -DTESTS=ON -DCBLAS_ROOT=/opt/ATLAS ..

@blueberry
Copy link
Member

@fonghou I think that it is better to ask for help with that in CLBlast issues.

@fonghou
Copy link
Author

fonghou commented May 28, 2016

Remove -DTESTS=ON, cmake -DCMAKE_INSTALL_PREFIX=/opt/CLBlast-tune -DOPENCL_ROOT=/opt/nvidia/cuda -DCLTUNE_ROOT=/opt/CLTune -DTUNERS=ON ..

Scanning dependencies of target clblast_tuner_copy
[ 85%] Building CXX object CMakeFiles/clblast_tuner_copy.dir/src/tuning/copy.cc.o
Linking CXX executable clblast_tuner_copy
/opt/CLTune/lib/libcltune.so: undefined reference to clGetMemObjectInfo@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclWaitForEvents@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clCreateBuffer@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclCreateCommandQueue@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clCreateContext@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclReleaseContext@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clEnqueueCopyBuffer@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclReleaseKernel@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clSetKernelArg@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclGetPlatformIDs@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clEnqueueNDRangeKernel@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclGetEventProfilingInfo@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clEnqueueReadBuffer@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclGetDeviceInfo@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clReleaseMemObject@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclFinish@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clGetProgramBuildInfo@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclBuildProgram@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clReleaseCommandQueue@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclCreateKernel@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clEnqueueWriteBuffer@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclGetKernelWorkGroupInfo@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to clReleaseProgram@OPENCL_1.0' /opt/CLTune/lib/libcltune.so: undefined reference toclGetDeviceIDs@OPENCL_1.0'
/opt/CLTune/lib/libcltune.so: undefined reference to `clCreateProgramWithSource@OPENCL_1.0'
collect2: error: ld returned 1 exit status
make[2]: *** [clblast_tuner_copy] Error 1
make[1]: *** [CMakeFiles/clblast_tuner_copy.dir/all] Error 2

@fonghou
Copy link
Author

fonghou commented May 28, 2016

@blueberry Yes, I'll follow up there. Thanks again for your help!

@blueberry
Copy link
Member

@fonghou On the surface, it looks to me that Nvidia made some mess in drivers, so OpenCL 1.0 gets picked up, and CLBlast requires access at least to OpenCL 1.1, which is itself ancient. I hope Cedric (CLBlast author) will be able to help, since he is using Nvidia himself.

@blueberry
Copy link
Member

Another idea: clean up old Nvidia drivers, make sure nothing is left over, and then reinstall the latest drivers.

@CNugteren
Copy link

@blueberry Thanks for helping with the initial support. I'll follow-up as soon as possible in the CLBlast issues. The requirements for CLBlast are indeed OpenCL 1.1 or higher and GCC 4.9.0 or higher.

@blueberry
Copy link
Member

@fonghou OK, since you found the solution through CLBlast issues, I'll close this for nw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants