conda环境下编译tensorflow_c++api(with_tensorRT_support)

编译tensorRT

  1. 下载

    TensorRT 有两种安装包,DEB 和 Tar 两种形式,这个需要对应 cuda 当时安装的形式,如果是 cuda 以 deb 形式安装的话,则需要下载 tensorRT.deb;如果 cuda 以 runfile 形式安装,则需要下载 tensorRT.tar。个人使用的是 tar 的形式.

    访问:https://developer.nvidia.com/tensorrt

  2. 安装

    • 进入conda环境.

      conda active sp_tensorflow
      
    • 下载的 TensorRT 放在自己的路径下,解压下载的tar文件:

      tar -xzvf TensorRT-xxx.tar.gz
      
    • 添加环境变量:

      $gedit ~/.bashrc
       
      # 将下面的一句话添加到文档末尾
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/TensorRT-6.0.1.5/lib # 自己解压的路径
       
      $source ~/.bashrc
      
    • 进入解压后的 TensorRT 文件夹

      • 安装 Python 的 TensorRT 包

        cd ./python    # 切换到 python 文件夹
        pip install tensorrt-6.0.1.5-cp36-none-linux_x86_64.whl    # conda 虚拟环境中 pip 默认是 pip3 安装
        

        测试 TensorRT 是否安装成功

        $python
        >>>import tensorrt
        >>>tensorrt.__version__
        '6.0.1.5'
        
      • 安装 uff 包

        cd ../uff    # 切换到 uff 文件夹
        pip install uff-0.6.5-py2.py3-none-any.whl
        

        测试

        which convert-to-uff  # 会输出安装路径
        
      • 安装 graphsurgen 包

        cd ../graphsurgeon     # 切换到 graphsurgeon 文件夹
        pip install graphsurgeon-0.4.1-py2.py3-none-any.whl
        
  3. 验证

    这一步可自行尝试/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/samples中的一些例程,记得先读README.md。(Ps:我尝试的是sampleMNIST)

安装tensorflow

进入虚拟环境

conda active sp_tensorflow

安装依赖

pip install -U  pip six numpy wheel setuptools mock 'future>=0.17.1'
pip install -U  keras_applications --no-deps
pip install -U  keras_preprocessing --no-deps

下载tensorflow源代码并选择版本

这里我选择的是v2.1.1版本的tensorflow,这个版本需要的cuda>10.1,cudnn=7.6.5,tensorrt=6

git clone -b v2.1.1 --recursive https://github.com/tensorflow/tensorflow.git

安装bazel

  • tensorflow/configure.py查看所需bazel版本,tensorflow_v2.1.1支持0.27.0~0.29.1版本的bazel,当然如果你想要用更高级版本的bazel,需修改tensorflow/configure.py中的_TF_MAX_BAZEL_VERSION

  • 从https://github.com/bazelbuild/bazel/releases中下载对应版本的安装脚本bazel--installer-linux-x86_64.sh

  • 然后运行安装脚本

    chmod +x bazel-<version>-installer-linux-x86_64.sh
    ./bazel-<version>-installer-linux-x86_64.sh --user
    

    --user标志位指的是$HOME/bin目录。

  • 配置bazel运行环境

    bashrc文件中添加export PATH="$PATH:$HOME/bin",然后就是国际惯例,要么source要么重开个端口。

编译tensorflow源码

  • 先看下内存有没有24G(8线程全开需要这么多的内存),没有的话就得考虑利用swap来扩充内存了。查看swap空间的大小(这里是我设置后的16G,初始是2G):

    $ swapon -s
    Filename		Type		Size		Used		Priority
    /swapfile      	file    	16777212	3612052		-2
    

    开始设置swap空间大小

    # 如果第一步存在swapfile则需要先禁用
    sudo swapoff /swapfile
    # 修改swap 空间的大小为2G
    sudo dd if=/dev/zero of=/swapfile bs=1G count=16
    # 设置文件为“swap file”类型
    sudo mkswap /swapfile
    # 启用swapfile
    sudo swapon /swapfile
    

    如果不扩从内存则需要考虑在后面的限制bazel在编译过程中的资源占用,如

    bazel build --config=opt --local_resources=4096,6,10 --verbose_failures //tensorflow/tools/pip_package:build_pip_package
    

    其中--local_resources=4096,6,10是限制占用内存最多为4096 M,最多占用6个CPU,最多10个IO线程。-verbose_failures用于打印报错信息。但是占用内存依旧很多,仍然会报错。或者再使用--jobs=8限制工作数量。

  • 设置配置文件

    ./configure
    

    可参考以下配置方法

    felaim@felaim-pc:~/Documents/software/tensorflow-r1.14$ ./configure 
    WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
    You have bazel 0.24.1 installed.
    Please specify the location of python. [Default is /usr/bin/python]: /home/c2214-e/anaconda3/envs/sp_tensorflow/bin/python3.6
    #检查一下这里的Python环境,建议使用Python3
    
    Found possible Python library paths:
     /home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages
    Please input the desired Python library path to use.  Default is [/home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages]
    #务必检查一下这里的Python环境,确保环境与上面的python链接位置保持一致,而且要与pip install的默认安装位置一致
    
    
    Do you wish to build TensorFlow with XLA JIT support? [Y/n]: 
    No XLA JIT support will be enabled for TensorFlow.
    #这个选项是询问是否开启XLA JIT编译支持。XLA(Accelerated Linear Algebra/加速线性代数)目前还是TensorFlow的实验项目,XLA 使用 JIT(Just in Time,即时编译)技术来分析用户在运行时(runtime)创建的 TensorFlow 图,专门用于实际运行时的维度和类型。作为新技术,这项编译技术还不成熟,爱折腾的“极客”读者可以选“y”,否则选择默认值“N”。
    
    Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
    No OpenCL SYCL support will be enabled for TensorFlow.
    #这个是OpenCL高级编程模型,NVIDIA GPU不支持,Intel和AMD的GPU支持
    
    Do you wish to build TensorFlow with ROCm support? [y/N]: 
    No ROCm support will be enabled for TensorFlow.
    #这个是AMD GPU的加速库,作用类似
    #安装之前需要提前安装keras库,否则编译会报错
    
    Do you wish to build TensorFlow with CUDA support? [y/N]: y
    CUDA support will be enabled for TensorFlow.
    #这个选项是询问是否使用CUDA。CUDA是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。如果用户配备有NVIDIA的GPU,可以选择“y”,如果仅使用TensorFlow的CPU版本,回车确认“N”。
    
    Do you wish to build TensorFlow with TensorRT support? [y/N]: y
    TensorRT support will be enabled for TensorFlow.
    #这个是nvidia专门用于优化模型的工具,可加速训练和推理过程
    
    Could not find any NvInferVersion.h matching version '' in any subdirectory:
            ''
            'include'
            'include/cuda'
            'include/*-linux-gnu'
            'extras/CUPTI/include'
            'include/cuda/CUPTI'
    of:
            '/home/felaim/intel/ipp/lib/intel64_lin'
            '/lib/x86_64-linux-gnu'
            '/usr'
            '/usr/lib'
            '/usr/lib/x86_64-linux-gnu'
            '/usr/lib/x86_64-linux-gnu/libfakeroot'
            '/usr/lib/x86_64-linux-gnu/mesa'
            '/usr/lib/x86_64-linux-gnu/mesa-egl'
            '/usr/local/cuda'
            '/usr/local/cuda-10.1/targets/x86_64-linux/lib'
            '/usr/local/cuda/extras/CUPTI/lib64'
            '/usr/local/lib'
    Asking for detailed CUDA configuration...
    
    Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 
    
    
    Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 
    
    
    Please specify the TensorRT version you want to use. [Leave empty to  default to TensorRT 5]: 6
    
    Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 
    
    
    Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/local/cuda,/usr/local/cuda/bin,/usr/local/cuda/lib64,/usr/local/cuda/include,/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5,/usr/lib/x86_64-linux-gnu/,/home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/targets/x86_64-linux-gnu
    
    
    Found CUDA 10.1 in:
        /usr/local/cuda/lib64
        /usr/local/cuda/include
    Found cuDNN 7 in:
        /usr/local/cuda/lib64
        /usr/local/cuda/include
    Found TensorRT 6 in:
        /home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/lib
        /home/c2214-e/anaconda3/envs/sp_tensorflow/dz_tools/TensorRT-6.0.1.5/include
    
    Please specify a list of comma-separated CUDA compute capabilities you want to build with.
    You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
    Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 5.2]: 
    #这个是设置NVIDIA显卡的计算力,建议与NVIDIA官网的数据一致
    
    Do you want to use clang as CUDA compiler? [y/N]:   
    nvcc will be used as CUDA compiler.
    
    Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 
    
    
    Do you wish to build TensorFlow with MPI support? [y/N]: 
    No MPI support will be enabled for TensorFlow.
    
    Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 
    #这个选项是指定CPU编译优化选项。默认值就是“-march=native”。这里“m”表示“machine(机器)”,“arch”就是“architecture”简写。“march”合在一起表示机器的结构,如果选择“-march=native”,则表示选择本地(native)CPU,如果本地CPU比较高级,就可以支持SSE4.2、AVX等选项。这里建议选择默认值。
    
    Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
    Not configuring the WORKSPACE for Android builds.
    #这部分是编译适用于Android NDK和SDK的TensorFlow的,如果没有需求可以填N
    
    Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    	--config=mkl         	# Build with MKL support.
    	--config=monolithic  	# Config for mostly static monolithic build.
    	--config=gdr         	# Build with GDR support.
    	--config=verbs       	# Build with libverbs support.
    	--config=ngraph      	# Build with Intel nGraph support.
    	--config=numa        	# Build with NUMA support.
    	--config=dynamic_kernels	# (Experimental) Build kernels into separate shared objects.
    Preconfigured Bazel build configs to DISABLE default on features:
    	--config=noaws       	# Disable AWS S3 filesystem support.
    	--config=nogcp       	# Disable GCP support.
    	--config=nohdfs      	# Disable HDFS support.
    	--config=noignite    	# Disable Apache Ignite support.
    	--config=nokafka     	# Disable Apache Kafka support.
    	--config=nonccl      	# Disable NVIDIA NCCL support.
    Configuration finished
    
  • tensorflow目录下开始编译

    bazel build --config=opt --config=cuda --config=v2 //tensorflow/tools/pip_package:build_pip_package
    

    这条编译指令基本和官方一致,但是经在编译过程中出现各种错误,如

    bazel-out/k8-opt/bin/tensorflow/stream_executor/cuda/libcuda_stub.pic.a(cuda_stub.pic.o): In function `cudaError_enum (*(anonymous namespace)::LoadSymbol<cudaError_enum (*)(CUctx_st*)>(char const*))(CUctx_st*)':cuda_stub.cc:(.text._ZN12_GLOBAL__N_110LoadSymbolIPF14cudaError_enumP8CUctx_stEEET_PKc+0x2a): undefined reference to `tensorflow::Env::Default()'
    bazel-out/k8-opt/bin/tensorflow/stream_executor/cuda/libcuda_stub.pic.a(cuda_stub.pic.o):cuda_stub.cc:(.text._ZN12_GLOBAL__N_110LoadSymbolIPF14cudaError_enumPP8CUctx_stEEET_PKc+0x2a): more undefined references to `tensorflow::Env::Default()' follow bazel-out/k8-opt/bin/tensorflow/core/distributed_runtime/librequest_id.pic.a(request_id.pic.o): In function `tensorflow::GetUniqueRequestId()':
    

    还有就是在编译一些toolchains时会抽风去找python2环境来编译导致报错。

    为了应对这些报错,在编译指令多加了一个选项--noincompatible_do_not_split_linking_cmdline,这会使得bazel自行解决兼容问题,可能会导致一些模块的编译过程被跳过,我在编译过程中是加了的。

    bazel build --config=opt --config=cuda --noincompatible_do_not_split_linking_cmdline --config=v2 //tensorflow/tools/pip_package:build_pip_package
    

    如果想重新编译则在上次的编译目录下bazel clean,并删除~/.cache/bazel/中的东西。

    另外v1.x的版本可能会多一步操作

    cd tensorflow/contrib/makefile
    sh build_all_linux.sh
    
  • 生成.whl文件

    bazel-bin/tensorflow/tools/pip_package/build_pip_package  ./tensorflow_pkg
    
  • pip安装编译完成的tensorflow

    cd tensorflow_pkg
    pip install tensorflow-2.1.1-cp36-cp36m-linux_x86_64.whl
    
  • 如果需要用到tensorflow的c++API则继续

    bazel build --config=opt --config=cuda --noincompatible_do_not_split_linking_cmdline --config=v2 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so //tensorflow:install_headers
    

    需要注意的是c++API编译完成后头文件被安装在tensorflow/bazel-out/k8-opt/bin/tensorflow/include中,库文件则就在tensorflow/bazel-out/k8-opt/bin/tensorflow中(当然也有可能在bazel-genfiles中,但是我没有这个目录有点奇怪),但是由于protobuf的头文件没有被安装,所以需要借用前面python版本的tensorflow,头文件在这个位置/home/c2214-e/anaconda3/envs/sp_tensorflow/lib/python3.6/site-packages/tensorflow_core/include/google/protobuf。不添加protobuf的头文件可能会出现这个报错google/protobuf/port_def.inc: No such file or directory

一些坑

  1. 编译过程中一些模块无法下载。

    Tensorflow依赖的包很多,所以编译的时候还要下载一些依赖包。在源代码里面已经给出了依赖包的下载地址,bazel会自动下载。但是有时候网络不好,或者部分资源下载速度慢(甚至无法下载),这时候只能自己想办法下载到本地,再安装。这类关键还不报错,而是警告xxx failed downloading,或者带有下载地址。解决办法如下:

    • 第一种修改tensorflow/tensorflow/workspace.bzl中模块的下载路径urls,如

          tf_http_archive(
              name = "llvm",
              build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
              sha256 = "47a5cb24209c24370cd4fec7bfbda8b40d5660b3c821addcfb47a405a077eee9",
              strip_prefix = "llvm-project-ecc999101aadc8dc7d4af9fd88be10fe42674aa0/llvm",
              urls = [
                  "https://github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
                  "https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
              ],
          )
      

      链接https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz我连不上,所以我把这条链接挪到后面去了。

    • 第二种自己手动下载源文件,把手动下载下来的文件放到/var/www/html/路径下。假设手动下载的文件是ABCD.tar.gz

      cp ABCD.tar.gz /var/www/html/
      

      并在模块中添加路径

          tf_http_archive(
              name = "llvm",
              build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
              sha256 = "47a5cb24209c24370cd4fec7bfbda8b40d5660b3c821addcfb47a405a077eee9",
              strip_prefix = "llvm-project-ecc999101aadc8dc7d4af9fd88be10fe42674aa0/llvm",
              urls = [
              	"http://127.0.0.1/ABCD.tar.gz",
                  "https://github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
                  "https://mirror.bazel.build/github.com/llvm/llvm-project/archive/ecc999101aadc8dc7d4af9fd88be10fe42674aa0.tar.gz",
              ],
          )
      
  2. 如果bazel版本安装错了,想重新安装bazel

    rm -rf ~/.bazel
    rm -rf ~/bin
    rm -rf /usr/bin/bazel
    

参考

  1. https://blog.csdn.net/surtol/article/details/97638399
  2. https://zhuanlan.zhihu.com/p/46566618
  3. https://blog.csdn.net/qq_26550927/article/details/104159921#_63
  4. https://cloud-atlas.readthedocs.io/zh_CN/latest/machine_learning/build_tensorflow_from_source.html#tensorflow
  5. https://blog.csdn.net/Felaim/article/details/100349318

你可能感兴趣的:(conda环境下编译tensorflow_c++api(with_tensorRT_support))