TensorRT安装记录(8.2.5)

官网链接:https://developer.nvidia.com/tensorrt



0 TensorRT简介

NVIDIA® TensorRT™ is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models, and a runtime for execution. After you have trained your deep learning model in a framework of your choice, TensorRT enables you to run it with higher throughput and lower latency.

根据官方对于TensorRT的介绍可知,TensorRT是一个针对已训练好模型的SDK,通过该SDK能够在NVIDIA的设备上进行高性能的推理。那么TensorRT具体会对我们训练好的模型做哪些优化呢,可以参考TensorRT官网中的一幅图,如下图所示:
在这里插入图片描述总结下来主要有以下6点:

  1. Reduced Precision:将模型量化成INT8或者FP16的数据类型(在保证精度不变或略微降低的前提下),以提升模型的推理速度。
  2. Layer and Tensor Fusion:通过将多个层结构进行融合(包括横向和纵向)来优化GPU的显存以及带宽。
  3. Kernel Auto-Tuning:根据当前使用的GPU平台选择最佳的数据层和算法。
  4. Dynamic Tensor Memory:最小化内存占用并高效地重用张量的内存。
  5. Multi-Stream Execution:使用可扩展设计并行处理多个输入流。
  6. Time Fusion:使用动态生成的核去优化随时间步长变化的RNN网络。

1 安装TensorRT

安装TensorRT建议直接按照官方的教程来,官方最新TensorRT快速开始文档:
https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html
或者指定某一版本的TensorRT快速开始文档(以当前最新稳定版8.2.5为例):
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-825/quick-start-guide/index.html

对于安装TensorRT官方列出了下面三种安装方式,但我个人还是喜欢TAR Package安装(其他安装方式参考https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html):

  • Container Installation
  • Debian Installation
  • pip Wheel File Installation

1.1 pip安装(trtexec无法使用)

如果会使用Docker的建议用Container Installation,本文先以pip Wheel File Installation安装方式为例。在官方快速开始文档pip Wheel File Installation中(8.2.5)明确说明Python的版本只支持3.6至3.9,CUDA版本只支持11.x,并且只支持Linux操作系统以及x86_64的CPU架构,官方建议使用Centos 7或者Ubuntu 18.04

The pip-installable nvidia-tensorrt Python wheel files only support Python versions 3.6 to 3.9 and CUDA 11.x at this time and will not work with other Python or CUDA versions. Only the Linux operating system and x86_64 CPU architecture is currently supported. These wheel files are expected to work on CentOS 7 or newer and Ubuntu 18.04 or newer.

除了以上说的要求外,还需要注意下GPU的驱动版本,因为不同的CUDA版本对GPU的驱动有不同的要求,而这里安装的TensorRT(8.2.5)要求使用CUDA 11.x版本,所以需要看下自己GPU的驱动版本是否满足,可通过nvidia-smi指令查看自己的驱动版本。这里可以直接在NVIDIA官网,看下CUDA版本以及GPU驱动的对应关系:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

在这里插入图片描述

在保证GPU驱动满足要求的前提下,建议先用conda创建一个新的虚拟环境(不影响其他环境)。这里就直接创建一个名为tensorrt的虚拟环境,然后采用python的版本为3.8

conda create -n tensorrt python=3.8

创建好虚拟环境以后,激活进入虚拟环境:

conda activate tensorrt

接着安装nvidia-pyindexnvidia-tensorrt,注意,如果不指定nvidia-tensorrt的版本号默认安装最新版本,本文是以8.2.5版本为例,所以这里安装的是当前可用的8.2.5.1

pip install nvidia-pyindex
pip install nvidia-tensorrt==8.2.5.1

安装完成后,按照官方的步骤检查下是否安装成功,只需进入Python环境,然后简单打印下版本号等信息,只要不报错就说明安装成功。

import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())

在这里插入图片描述
但后面按照官方教程使用trtexec转换模型格式时发现找不到这个工具,我怀疑通过pip安装方式只是安装了TensorRT的运行时,没有提供trtexec工具。


1.2 TAR Package安装

安装过程主要按照官网流程:https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-tar
在安装之前需要准备好以下环境:

  • CUDA 10.2, 11.0 update 1, 11.1 update 1, 11.2 update 2, 11.3 update 1, 11.4 update 3, 11.5 update 1 or 11.6
  • cuDNN 8.3.2
  • Python 3 (Optional)

进入官方TensorRT的下载页面(需要登录)

在这里插入图片描述

下载对应的包,这里我下载的是TensorRT 8.2 GA Update 4 for Linux x86_64 and CUDA 11.0, 11.1, 11.2, 11.3, 11.4 and 11.5 TAR Package

在这里插入图片描述

下载完成后解压文件:

tar -xzvf TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz

解压后会生成TensorRT-8.2.5.1文件夹,接着将TensorRT-8.2.5.1/lib文件夹路径添加到环境变量LD_LIBRARY_PATH中,注意我是将TensorRT-8.2.5.1文件夹放在root路径下所以设置的是/root/TensorRT-8.2.5.1/lib,这里需要根据自己解压的路径设置。同理将TensorRT-8.2.5.1/bin文件夹路径添加到环境变量PATH中,其中包含后面需要用到的trtexec工具:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/TensorRT-8.2.5.1/lib
export PATH=$PATH:/root/TensorRT-8.2.5.1/bin

接着进入TensorRT-8.2.5.1/python文件夹下安装TensorRT wheel文件,在该文件夹里有针对不同python版本的whl文件,由于我采用的虚拟环境中的python版本是3.8所以安转cp38对应的whl文件:

cd TensorRT-8.2.5.1/python
pip install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl

接着进入TensorRT-8.2.5.1/graphsurgeon文件夹下安装graphsurgeon wheel文件:

cd TensorRT-8.2.5.1/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl

接着进入TensorRT-8.2.5.1/onnx-graphsurgeon文件夹下安装onnx-graphsurgeon wheel文件:

cd TensorRT-8.2.5.1/onnx-graphsurgeon
pip install onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl

安转完后,同样可以进入Python环境,然后简单打印下版本号等信息,只要不报错就说明安装成功。

import tensorrt
print(tensorrt.__version__)
assert tensorrt.Builder(tensorrt.Logger())

2 将模型转换成TensorRT的流程

根据官网的介绍,转换TensorRT的工作流程主要有以下6个步骤:

  1. Export the Model: 导出模型
  2. Select A Batch Size: 根据自己的实际项目选择一个合适的Batch Size
  3. Select A Precision: 选择一个精度类型,比如INT8FLOAT16FLOAT32
  4. Convert The Model: 转换模型
  5. Deploy The Model: 部署模型

那哪些格式的模型能够导出并转换成TensorRT模型呢,官方提到了三种方式:

  1. using TF-TRT: 使用TF-TRT(TensorFlow-TensorRT )
  2. automatic ONNX conversion from .onnx files:从ONNX通用格式转换得到(注意,这里需要自己提前将模型转成ONNX格式)
  3. manually constructing a network using the TensorRT API (either in C++ or Python):自己用TensorRT API构建模型(这个对新人不太友好,难度有点大)

也可以参考下面这幅图,比如说对于Pytorch的模型,我们一般需要先转成ONNX通用格式,然后再转成TensorRT模型,最后部署的时候可以选择C++或者Python
在这里插入图片描述


3 将Pytorch模型转成TensorRT案例

按照上述内容,我们知道一般将Pytorch模型转成TensorRT格式的流程是先转ONNX通用格式,再转TensorRT。

3.1 将Pytorch模型转成ONNX格式

这里以Pytorch官方提供的ResNet34为例,直接从torchvision中实例化ResNet34并载入自己在flower_photos数据集上训练好的权重,然后在转成ONNX格式,示例代码如下:

import torch
import torch.onnx
import onnx
import onnxruntime
import numpy as np
from torchvision.models import resnet34

device = torch.device("cpu")


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()


def main():
    weights_path = "resNet34(flower).pth"
    onnx_file_name = "resnet34.onnx"
    batch_size = 1
    img_h = 224
    img_w = 224
    img_channel = 3

    # create model and load pretrain weights
    model = resnet34(pretrained=False, num_classes=5)
    model.load_state_dict(torch.load(weights_path, map_location='cpu'))

    model.eval()
    # input to the model
    # [batch, channel, height, width]
    x = torch.rand(batch_size, img_channel, img_h, img_w, requires_grad=True)
    torch_out = model(x)

    # export the model
    torch.onnx.export(model,             # model being run
                      x,                 # model input (or a tuple for multiple inputs)
                      onnx_file_name,    # where to save the model (can be a file or file-like object)
                      input_names=["input"],
                      output_names=["output"],
                      verbose=False)

    # check onnx model
    onnx_model = onnx.load(onnx_file_name)
    onnx.checker.check_model(onnx_model)

    ort_session = onnxruntime.InferenceSession(onnx_file_name)

    # compute ONNX Runtime output prediction
    ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
    ort_outs = ort_session.run(None, ort_inputs)

    # compare ONNX Runtime and Pytorch results
    # assert_allclose: Raises an AssertionError if two objects are not equal up to desired tolerance.
    np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
    print("Exported model has been tested with ONNXRuntime, and the result looks good!")


if __name__ == '__main__':
    main()

注意,这里将Pytorch模型转成ONNX后,又利用ONNXRUNTIME载入导出的模型,然后输入同样的数据利用np.testing.assert_allclose方法对比转换前后输出的差异,其中rtol代表相对偏差,atol代表绝对偏差,如果两者的差异超出指定的精度则会报错。在转换后,会在当前文件夹中生成一个resnet34.onnx文件。


3.2 将ONNX格式转成TensorRT格式

将ONNX转成TensorRT engine的方式有多种,其中最简单的就是使用trtexec工具。在上面3.1章节中已经将Pyotrch中的Resnet34转成ONNX格式了,接下来可以直接使用trtexec工具将其转为TensorRT engine格式:

trtexec --onnx=resnet34.onnx --saveEngine=trt_output/resnet34.trt

其中:

  • --onnx是指向生成的onnx模型文件路径
  • --saveEngine是保存TensorRT engine的文件路径(发现一个小问题,就是保存的目录必须提前创建好,如果没有创建的话就会报错)

转化过程中终端会输出如下信息:

[06/23/2022-08:08:14] [I] === Model Options ===
[06/23/2022-08:08:14] [I] Format: ONNX
[06/23/2022-08:08:14] [I] Model: /root/project/resnet34.onnx
[06/23/2022-08:08:14] [I] Output:
[06/23/2022-08:08:14] [I] === Build Options ===
[06/23/2022-08:08:14] [I] Max batch: explicit batch
[06/23/2022-08:08:14] [I] Workspace: 16 MiB
[06/23/2022-08:08:14] [I] minTiming: 1
[06/23/2022-08:08:14] [I] avgTiming: 8
[06/23/2022-08:08:14] [I] Precision: FP32
[06/23/2022-08:08:14] [I] Calibration:
[06/23/2022-08:08:14] [I] Refit: Disabled
[06/23/2022-08:08:14] [I] Sparsity: Disabled
[06/23/2022-08:08:14] [I] Safe mode: Disabled
[06/23/2022-08:08:14] [I] DirectIO mode: Disabled
[06/23/2022-08:08:14] [I] Restricted mode: Disabled
[06/23/2022-08:08:14] [I] Save engine: trt_ouput/resnet34.trt
[06/23/2022-08:08:14] [I] Load engine:
[06/23/2022-08:08:14] [I] Profiling verbosity: 0
[06/23/2022-08:08:14] [I] Tactic sources: Using default tactic sources
[06/23/2022-08:08:14] [I] timingCacheMode: local
[06/23/2022-08:08:14] [I] timingCacheFile:
[06/23/2022-08:08:14] [I] Input(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Output(s)s format: fp32:CHW
[06/23/2022-08:08:14] [I] Input build shapes: model
[06/23/2022-08:08:14] [I] Input calibration shapes: model
......
[06/23/2022-08:08:41] [I] === Performance summary ===
[06/23/2022-08:08:41] [I] Throughput: 550.406 qps
[06/23/2022-08:08:41] [I] Latency: min = 1.85938 ms, max = 2.23706 ms, mean = 1.87513 ms, median = 1.87372 ms, percentile(99%) = 1.90234 ms
[06/23/2022-08:08:41] [I] End-to-End Host Latency: min = 1.87573 ms, max = 3.56226 ms, mean = 3.38754 ms, median = 3.47742 ms, percentile(99%) = 3.50659 ms
[06/23/2022-08:08:41] [I] Enqueue Time: min = 0.402954 ms, max = 2.53369 ms, mean = 0.68202 ms, median = 0.653564 ms, percentile(99%) = 0.830811 ms
[06/23/2022-08:08:41] [I] H2D Latency: min = 0.0581055 ms, max = 0.0943298 ms, mean = 0.063807 ms, median = 0.0615234 ms, percentile(99%) = 0.0910645 ms
[06/23/2022-08:08:41] [I] GPU Compute Time: min = 1.79099 ms, max = 2.14551 ms, mean = 1.80203 ms, median = 1.80127 ms, percentile(99%) = 1.8125 ms
[06/23/2022-08:08:41] [I] D2H Latency: min = 0.00610352 ms, max = 0.0129395 ms, mean = 0.00928149 ms, median = 0.00949097 ms, percentile(99%) = 0.0119934 ms
[06/23/2022-08:08:41] [I] Total Host Walltime: 3.00324 s
[06/23/2022-08:08:41] [I] Total GPU Compute Time: 2.97876 s
[06/23/2022-08:08:41] [I] Explanations of the performance metrics are printed in the verbose logs.

有关trtexec工具的使用方法,可以通过trtexec --help查看详细介绍,比如要使用FP16精度转模型时加上--fp16参数即可。


3.3 载入TensorRT模型

这里主要参考官方提供的notebook教程:https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb

下面是我参考官方demo写的一个样例,在样例中对比ONNXTensorRT的输出结果。

import numpy as np
import tensorrt as trt
import onnxruntime
import pycuda.driver as cuda
import pycuda.autoinit


def normalize(image: np.ndarray) -> np.ndarray:
    """
    Normalize the image to the given mean and standard deviation
    """
    image = image.astype(np.float32)
    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)
    image /= 255.0
    image -= mean
    image /= std
    return image


def onnx_inference(onnx_path: str, image: np.ndarray):
    # load onnx model
    ort_session = onnxruntime.InferenceSession(onnx_path)

    # compute onnx Runtime output prediction
    ort_inputs = {ort_session.get_inputs()[0].name: image}
    res_onnx = ort_session.run(None, ort_inputs)[0]
    return res_onnx


def trt_inference(trt_path: str, image: np.ndarray):
    # Load the network in Inference Engine
    trt_logger = trt.Logger(trt.Logger.WARNING)
    with open(trt_path, "rb") as f, trt.Runtime(trt_logger) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())

    with engine.create_execution_context() as context:
        # Set input shape based on image dimensions for inference
        context.set_binding_shape(engine.get_binding_index("input"), (1, 3, image.shape[-2], image.shape[-1]))
        # Allocate host and device buffers
        bindings = []
        for binding in engine:
            binding_idx = engine.get_binding_index(binding)
            size = trt.volume(context.get_binding_shape(binding_idx))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            if engine.binding_is_input(binding):
                input_buffer = np.ascontiguousarray(image)
                input_memory = cuda.mem_alloc(image.nbytes)
                bindings.append(int(input_memory))
            else:
                output_buffer = cuda.pagelocked_empty(size, dtype)
                output_memory = cuda.mem_alloc(output_buffer.nbytes)
                bindings.append(int(output_memory))

        stream = cuda.Stream()
        # Transfer input data to the GPU.
        cuda.memcpy_htod_async(input_memory, input_buffer, stream)
        # Run inference
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer prediction output from the GPU.
        cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
        # Synchronize the stream
        stream.synchronize()

        res_trt = np.reshape(output_buffer, (1, -1))

    return res_trt


def main():
    image_h = 224
    image_w = 224
    onnx_path = "resnet34.onnx"
    trt_path = "trt_output/resnet34.trt"

    image = np.random.randn(image_h, image_w, 3)
    normalized_image = normalize(image)

    # Convert the resized images to network input shape
    # [h, w, c] -> [c, h, w] -> [1, c, h, w]
    normalized_image = np.expand_dims(np.transpose(normalized_image, (2, 0, 1)), 0)

    onnx_res = onnx_inference(onnx_path, normalized_image)
    ir_res = trt_inference(trt_path, normalized_image)
    np.testing.assert_allclose(onnx_res, ir_res, rtol=1e-03, atol=1e-05)
    print("Exported model has been tested with TensorRT Runtime, and the result looks good!")


if __name__ == '__main__':
    main()

3.4 其他

最后提下模型的量化,关于量化可以简单的分成两类(不严谨):QAT(Quantiztion Aware Training)在训练过程中同时进行量化,PTQ(Post Training Quantization)训练后量化。由于现在深度学习框架非常多以及各种runtime(比如tensorflow的tf-lite,pytorch的torchscript,onnx,tensorrt,openvino等等),量化的工具也一堆。这里对于QAT推荐nvidia的pytorch-quantization工具。对于PTQ,如果是部署在nvidia卡上推荐tensorrt,如果部署在cpu上可以尝试openvino。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>