TensorRT 有两种安装包,DEB 和 Tar 两种形式,这个需要对应 cuda 当时安装的形式,如果是 cuda 以 deb 形式安装的话,则需要下载 tensorRT.deb;如果 cuda 以 runfile 形式安装,则需要下载 tensorRT.tar。个人使用的是 tar 的形式.
访问:https://developer.nvidia.com/tensorrt
进入conda环境.
conda active sp_tensorflow
下载的 TensorRT 放在自己的
路径下,解压下载的tar
文件:
tar -xzvf TensorRT-xxx.tar.gz
添加环境变量:
$gedit ~/.bashrc
# 将下面的一句话添加到文档末尾
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/TensorRT-6.0.1.5/lib # 自己解压的路径
$source ~/.bashrc
进入解压后的 TensorRT 文件夹
安装 Python 的 TensorRT 包
cd ./python # 切换到 python 文件夹
pip install tensorrt-6.0.1.5-cp36-none-linux_x86_64.whl # conda 虚拟环境中 pip 默认是 pip3 安装
测试 TensorRT 是否安装成功
$python
>>>import tensorrt
>>>tensorrt.__version__
'6.0.1.5'
安装 uff 包
cd ../uff # 切换到 uff 文件夹
pip install uff-0.6.5-py2.py3-none-any.whl
测试
which convert-to-uff # 会输出安装路径
安装 graphsurgen 包
cd ../graphsurgeon # 切换到 graphsurgeon 文件夹
pip install graphsurgeon-0.4.1-py2.py3-none-any.whl
这一步可自行尝试/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/samples中的一些例程,记得先读README.md
。(Ps:我尝试的是sampleMNIST)
conda active sp_tensorflow
pip install -U pip six numpy wheel setuptools mock 'future>=0.17.1'
pip install -U keras_applications --no-deps
pip install -U keras_preprocessing --no-deps
这里我选择的是v2.1.1版本的tensorflow,这个版本需要的cuda>10.1,cudnn=7.6.5,tensorrt=6
git clone -b v2.1.1 --recursive https://github.com/tensorflow/tensorflow.git
在tensorflow/configure.py
查看所需bazel版本,tensorflow_v2.1.1支持0.27.0~0.29.1版本的bazel,当然如果你想要用更高级版本的bazel,需修改tensorflow/configure.py
中的_TF_MAX_BAZEL_VERSION
。
从https://github.com/bazelbuild/bazel/releases中下载对应版本的安装脚本bazel-
。
然后运行安装脚本
chmod +x bazel-<version>-installer-linux-x86_64.sh
./bazel-<version>-installer-linux-x86_64.sh --user
--user
标志位指的是$HOME/bin
目录。
配置bazel运行环境
在bashrc
文件中添加export PATH="$PATH:$HOME/bin"
,然后就是国际惯例,要么source
要么重开个端口。
先看下内存有没有24G(8线程全开需要这么多的内存),没有的话就得考虑利用swap
来扩充内存了。查看swap
空间的大小(这里是我设置后的16G,初始是2G):
$ swapon -s
Filename Type Size Used Priority
/swapfile file 16777212 3612052 -2
开始设置swap
空间大小
# 如果第一步存在swapfile则需要先禁用
sudo swapoff /swapfile
# 修改swap 空间的大小为2G
sudo dd if=/dev/zero of=/swapfile bs=1G count=16
# 设置文件为“swap file”类型
sudo mkswap /swapfile
# 启用swapfile
sudo swapon /swapfile
如果不扩从内存则需要考虑在后面的限制bazel
在编译过程中的资源占用,如
bazel build --config=opt --local_resources=4096,6,10 --verbose_failures //tensorflow/tools/pip_package:build_pip_package
其中--local_resources=4096,6,10
是限制占用内存最多为4096 M,最多占用6个CPU,最多10个IO线程。-verbose_failures
用于打印报错信息。但是占用内存依旧很多,仍然会报错。或者再使用--jobs=8
限制工作数量。
设置配置文件
./configure
可参考以下配置方法
felaim@felaim-pc:~/Documents/software/tensorflow-r1.14$ ./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.24.1 installed.
Please specify the location of python. [Default is /usr/bin/python]: /home/c2214-e/anaconda3/envs/sp_tensorflow/bin/python3.6
#检查一下这里的Python环境,建议使用Python3
Found possible Python library paths:
/home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages]
#务必检查一下这里的Python环境,确保环境与上面的python链接位置保持一致,而且要与pip install的默认安装位置一致
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
No XLA JIT support will be enabled for TensorFlow.
#这个选项是询问是否开启XLA JIT编译支持。XLA(Accelerated Linear Algebra/加速线性代数)目前还是TensorFlow的实验项目,XLA 使用 JIT(Just in Time,即时编译)技术来分析用户在运行时(runtime)创建的 TensorFlow 图,专门用于实际运行时的维度和类型。作为新技术,这项编译技术还不成熟,爱折腾的“极客”读者可以选“y”,否则选择默认值“N”。
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
#这个是OpenCL高级编程模型,NVIDIA GPU不支持,Intel和AMD的GPU支持
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
#这个是AMD GPU的加速库,作用类似
#安装之前需要提前安装keras库,否则编译会报错
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
#这个选项是询问是否使用CUDA。CUDA是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。如果用户配备有NVIDIA的GPU,可以选择“y”,如果仅使用TensorFlow的CPU版本,回车确认“N”。
Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.
#这个是nvidia专门用于优化模型的工具,可加速训练和推理过程
Could not find any NvInferVersion.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/home/felaim/intel/ipp/lib/intel64_lin'
'/lib/x86_64-linux-gnu'
'/usr'
'/usr/lib'
'/usr/lib/x86_64-linux-gnu'
'/usr/lib/x86_64-linux-gnu/libfakeroot'
'/usr/lib/x86_64-linux-gnu/mesa'
'/usr/lib/x86_64-linux-gnu/mesa-egl'
'/usr/local/cuda'
'/usr/local/cuda-10.1/targets/x86_64-linux/lib'
'/usr/local/cuda/extras/CUPTI/lib64'
'/usr/local/lib'
Asking for detailed CUDA configuration...
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]:
Please specify the TensorRT version you want to use. [Leave empty to default to TensorRT 5]: 6
Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]:
Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/local/cuda,/usr/local/cuda/bin,/usr/local/cuda/lib64,/usr/local/cuda/include,/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5,/usr/lib/x86_64-linux-gnu/,/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/targets/x86_64-linux-gnu
Found CUDA 10.1 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found TensorRT 6 in:
/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/lib
/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/include
Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 5.2]:
#这个是设置NVIDIA显卡的计算力,建议与NVIDIA官网的数据一致
Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
#这个选项是指定CPU编译优化选项。默认值就是“-march=native”。这里“m”表示“machine(机器)”,“arch”就是“architecture”简写。“march”合在一起表示机器的结构,如果选择“-march=native”,则表示选择本地(native)CPU,如果本地CPU比较高级,就可以支持SSE4.2、AVX等选项。这里建议选择默认值。
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
#这部分是编译适用于Android NDK和SDK的TensorFlow的,如果没有需求可以填N
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
在tensorflow
目录下开始编译
bazel build --config=opt --config=cuda --config=v2 //tensorflow/tools/pip_package:build_pip_package
这条编译指令基本和官方一致,但是经在编译过程中出现各种错误,如
bazel-out/k8-opt/bin/tensorflow/stream_executor/cuda/libcuda_stub.pic.a(cuda_stub.pic.o): In function `cudaError_enum (*(anonymous namespace)::LoadSymbol<cudaError_enum (*)(CUctx_st*)>(char const*))(CUctx_st*)':cuda_stub.cc:(.text._ZN12_GLOBAL__N_110LoadSymbolIPF14cudaError_enumP8CUctx_stEEET_PKc+0x2a): undefined reference to `tensorflow::Env::Default()'
bazel-out/k8-opt/bin/tensorflow/stream_executor/cuda/libcuda_stub.pic.a(cuda_stub.pic.o):cuda_stub.cc:(.text._ZN12_GLOBAL__N_110LoadSymbolIPF14cudaError_enumPP8CUctx_stEEET_PKc+0x2a): more undefined references to `tensorflow::Env::Default()' follow bazel-out/k8-opt/bin/tensorflow/core/distributed_runtime/librequest_id.pic.a(request_id.pic.o): In function `tensorflow::GetUniqueRequestId()':
还有就是在编译一些toolchains时会抽风去找python2
环境来编译导致报错。
为了应对这些报错,在编译指令多加了一个选项--noincompatible_do_not_split_linking_cmdline
,这会使得bazel
自行解决兼容问题,可能会导致一些模块的编译过程被跳过,我在编译过程中是加了的。
bazel build --config=opt --config=cuda --noincompatible_do_not_split_linking_cmdline --config=v2 //tensorflow/tools/pip_package:build_pip_package
如果想重新编译则在上次的编译目录下bazel clean
,并删除~/.cache/bazel/
中的东西。
另外v1.x的版本可能会多一步操作
cd tensorflow/contrib/makefile
sh build_all_linux.sh
生成.whl
文件
bazel-bin/tensorflow/tools/pip_package/build_pip_package ./tensorflow_pkg
用pip
安装编译完成的tensorflow
cd tensorflow_pkg
pip install tensorflow-2.1.1-cp36-cp36m-linux_x86_64.whl
如果需要用到tensorflow
的c++API则继续
bazel build --config=opt --config=cuda --noincompatible_do_not_split_linking_cmdline --config=v2 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so //tensorflow:install_headers
需要注意的是c++API编译完成后头文件被安装在tensorflow/bazel-out/k8-opt/bin/tensorflow/include
中,库文件则就在tensorflow/bazel-out/k8-opt/bin/tensorflow
中(当然也有可能在bazel-genfiles
中,但是我没有这个目录有点奇怪),但是由于protobuf
的头文件没有被安装,所以需要借用前面python版本的tensorflow
,头文件在这个位置/home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages/tensorflow_core/include/google/protobuf
。不添加protobuf
的头文件可能会出现这个报错google/protobuf/port_def.inc: No such file or directory
编译过程中一些模块无法下载。
Tensorflow依赖的包很多,所以编译的时候还要下载一些依赖包。在源代码里面已经给出了依赖包的下载地址,bazel会自动下载。但是有时候网络不好,或者部分资源下载速度慢(甚至无法下载),这时候只能自己想办法下载到本地,再安装。这类关键还不报错,而是警告xxx failed downloading
,或者带有下载地址。解决办法如下:
第一种修改tensorflow/tensorflow/workspace.bzl
中模块的下载路径urls
,如
tf_http_archive(
name = "llvm",
build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
sha256 = "47a5cb24209c24370cd4fec7bfbda8b40d5660b3c821addcfb47a405a077eee9",
strip_prefix = "llvm-project-ecc999101aadc8dc7d4af9fd88be10fe42674aa0/llvm",
urls = [
"https://github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
"https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
],
)
链接https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz
我连不上,所以我把这条链接挪到后面去了。
第二种自己手动下载源文件,把手动下载下来的文件放到/var/www/html/
路径下。假设手动下载的文件是ABCD.tar.gz
cp ABCD.tar.gz /var/www/html/
并在模块中添加路径
tf_http_archive(
name = "llvm",
build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
sha256 = "47a5cb24209c24370cd4fec7bfbda8b40d5660b3c821addcfb47a405a077eee9",
strip_prefix = "llvm-project-ecc999101aadc8dc7d4af9fd88be10fe42674aa0/llvm",
urls = [
"http://127.0.0.1/ABCD.tar.gz",
"https://github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
"https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
],
)
如果bazel版本安装错了,想重新安装bazel
rm -rf ~/.bazel
rm -rf ~/bin
rm -rf /usr/bin/bazel