Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Is there a way to build the custom op on my LINUX 16.0 with tensorflow-gpu==1.15 and cuda-8.0 #76

Open
BruceDai003 opened this issue Aug 7, 2020 · 2 comments

Comments

@BruceDai003
Copy link

I know that in your project description, you mentioned that one has to download the docker image provided in order to build the custom op successfully.
I first tried to use 'make' to build the op. It only works for the ZeroOut CPU op, on my own machine with tf 1.15. But what I really want to do is build a GPU op on my own machine with tf 1.15. Besides, I used your provided docker image, and tried to use 'make' to build the TimeTwo GPU op, it didn't work, complaining the unkonwn -fPIC flag issue:

root@5f0d055f7746:/custom-op# make time_two_op
nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC -DNDEBUG --expt-relaxed-constexpr
nvcc fatal   : Unknown option 'fPIC'
Makefile:35: recipe for target 'tensorflow_time_two/python/ops/_time_two_ops.cu.o' failed
make: *** [tensorflow_time_two/python/ops/_time_two_ops.cu.o] Error 1

So, I removed the -fPIC flag. (Notice that there are two -fPIC flags in the nvcc command, I have to remove both to make this error disappear.)
So I copied the command, remove -fPIC flag, and run to get another error:

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
In file included from tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc:21:0:
/usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_kernel_helper.h:22:53: fatal error: third_party/gpus/cuda/include/cuda_fp16.h: No such file or directory
compilation terminated.

It seems that in this gpu_kernel_helper.h file, it has to include cuda_fp16.h header file. I can't find this file in the docker image, but found some in my host machine, so I copied one into the directory third_party/gpus/cuda/include (I don't have /gpus/cuda/include directory under third_party actually).

root@5f0d055f7746:/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include# ll
total 112
drwxr-sr-x 2 root staff     25 Aug  7 06:42 ./
drwxr-sr-x 3 root staff     21 Aug  7 06:42 ../
-rw-r--r-- 1 root staff 114479 Aug  7 06:42 cuda_fp16.h

Now running the nvcc command again:

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
In file included from /usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_kernel_helper.h:25:0,
                 from tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc:21:
/usr/local/lib/python3.6/dist-packages/tensorflow/include/tensorflow/core/util/gpu_device_functions.h:34:53: fatal error: third_party/gpus/cuda/include/cuComplex.h: No such file or directory
compilation terminated.

Well, this file is also in the same directory in my host machine, in case some other headers are needed, so I copied all contents into the required directory on the docker image.

Now I got at least 100 errors( which is too long to paste them all here, so I pasted some of them here for reference):

root@5f0d055f7746:/custom-op# nvcc -std=c++11 -c -o tensorflow_time_two/python/ops/_time_two_ops.cu.o tensorflow_time_two/cc/kernels/time_two_kernels.cu.cc  -I/usr/local/lib/python3.6/dist-packages/tensorflow/include -D_GLIBCXX_USE_CXX11_ABI=0 -O2 -std=c++11 -L/usr/local/lib/python3.6/dist-packages/tensorflow -l:libtensorflow_framework.so.2 -D GOOGLE_CUDA=1 -x cu -Xcompiler -DNDEBUG --expt-relaxed-constexpr
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(129): error: invalid redeclaration of type name "__half"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(96): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(140): error: invalid redeclaration of type name "__half2"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(100): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(155): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(183): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(198): error: cannot overload functions distinguished by return type alone

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.h(213): error: cannot overload functions distinguished by return type alone

...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(166): error: class "__half" has no member "__x"
...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(360): error: cannot overload functions distinguished by return type alone
...
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(552): error: no suitable user-defined conversion from "__half2" to "__half2" exists
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2053): error: invalid redeclaration of type name "half"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(103): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp(2054): error: invalid redeclaration of type name "half2"
/usr/local/lib/python3.6/dist-packages/tensorflow/include/third_party/gpus/cuda/include/cuda_fp16.h(104): here
/usr/local/lib/python3.6/dist-packages/tensorflow/include/unsupported/Eigen/CXX11/../../../Eigen/src/Core/arch/Default/TypeCasting.h(25): error: no suitable user-defined conversion from "__half" to "Eigen::half" exists
/usr/local/lib/python3.6/dist-packages/tensorflow/include/unsupported/Eigen/CXX11/../../../Eigen/src/Core/arch/GPU/PacketMath.h(1005): error: no suitable user-defined conversion from "__half2" to "half2" exists

Error limit reached.
100 errors detected in the compilation of "/tmp/tmpxft_000000d7_00000000-6_time_two_kernels.cu.cpp1.ii".
Compilation terminated.

The errors all seem to relate to the files we just copied them to. But I don't know what's wrong and how to solve it.

Then I turned to use bazel. Finally I successfully built the TimeTwo GPU op under this tensorflow/tensorflow:custom-op-gpu-ubuntu16 docker image. But I want to modify the bazel code in order to build a simple GPU custom op on Ubuntu 16.0 with tf==1.15. I am not quite familiar with bazel, and am trying to learn it. I don't understand why this is so complicated to build a simple GPU op. :(

@BruceDai003
Copy link
Author

It seems that in your Makefile, the command to build the GPU op TimeTwo is very simple, just a nvcc command is OK. But to build with bazel, there are some other folders that are needed, e.g. 'gpu', 'tf', 'third_party' which makes me learning to build with bazel on a different machine quite difficult. Is there a simple way to use nvcc to build the GPU op?

@leimao
Copy link
Contributor

leimao commented Sep 30, 2020

I got the same problem. This cuda_fp16.h file is missing (I installed TensorFlow via pip). I also checked the TensorFlow master branch, there is no such file in the third_party directory either.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants