Blog

Real-time multiple model inference: DeepDetect full CUDA pipeline

10 January 2022

This blog post describes improvements to the TensorRT pipeline for desktop and embedded GPUs with DeepDetect.

Full CUDA pipeline for real time inference

DeepDetect helps with creating production-grade AI / Deep Learning applications with ease, at every stage of conception. For this reason, there’s no gap between development and production phases: we do it all with DeepDetect.

This allows working from the training phase to prototyping an application, up to final production models that can run at optimal performances and perform inference in real time.

This methodology has several advantages:

  • It removes the development to production difficulties since it preserves the input pipelines and ensures models retain their accuracy.
  • DeepDetect leverages dedicated optimization automatically for the underlying hardware, whether GPU, desktop CPU or embedded.
  • Production-grade inference is readily available, without any changes, API calls remain identical.

This post focuses on the third point, and most especially GPU inference performances. And most especially the type of applications with real-time requirements.

At Jolibrain we see many very different industrial applications with this use-case, mainly in two categories:

  • Applications with a real-time requirements, such as virtual try-ons and virtual/augmented reality. These applications typically mix object detectors and GANs, requiring multi-model real-time inference.

  • Applications with very high throughput requirements. These industrial applications need processing points clouds, and other types of sensor outputs at very high speeds, e.g. 5000 frames per second.

For these applications, DeepDetect embeds a TensorRT backend to run models on NVidia GPUs with x2 or x3 performance gains. This is the best that can be done at the moment. TensorRT is efficient but it’s fair to note that it is not easy to setup properly and optimization comes with many, many caveats, such as unsupported layers, data types, all hardware dependent sometimes. For these reasons, we’ve automated it all into DeepDetect, so that these difficulties are abstracted away.

This has been efficient for a while, but as inference itself gets more optimized, and model prediction time decreases accordingly, the full input pipeline sometimes becomes the bottleneck, and needs to be optmized as well for even better efficiency.

In most of applications the raw data is not passed to the model directly, and some preprocessing is needed. This is typically true for images, where resizing, normalizing and format conversion are standard operations before the data hit the models.

These preprocessing steps can take an uneven amount of time, until they even become a bottleneck, most especially on edge/embedded devices with low computational power.

Preprocessing can be improved by using hardware acceleration. Most of the usual preprocessing operations are already implemented in OpenCV CUDA, and thus it’s been recently added to DeepDetect preprocessing pipelines.

To really benefit from hardware acceleration, one must be careful when moving data time across devices. This happens all time time since data moving from CPU RAM to GPU do so across a PCI bus. So sending data to GPU comes at a cost, and to minimize that cost, it’s best to minimize data transfers as well.

This optimization has been implemented for the TensorRT backend in DeepDetect. This makes sense since performance are critical at inference time, and this is when TensorRT is used. So data are moved onto the GPU once and for all, while the full preprocessing and multi-model (aka /chain) pipeline applies now without leaving GPU memory space.

Additionnally, for high speed video processing applications, it is also now possible to link an exe directly against the C++ core of DeepDetect and this allows to even decode the final output images on GPU. This is most useful for real-time applications that involve image segmentation models or generative models such as GANs.

This optimized pipeline allows images to stay within GPU memory during the full predict or chain call and benefit completely from hardware acceleration.

Using hardware accelerated pipeline in DeepDetect

Our DeepDetect TensorRT docker images are now built with full CUDA pipeline and OpenCV4. An image predict or chain call with CUDA acceleration then uses the API, almost unchanged.

First a service is created for the tensorrt backend with DeepDetect:

# service creation
curl -X PUT http://localhost:8080/squeezenet -d '{
  "create": {
    "mllib": "tensorrt",
    "model": {
      "repository": "/opt/platform/private/squeezenet"
    },
    "parameters": {
      "input": {
        "connector": "image",
        "width": 512,
        "height": 512,
        "bbox": true
      },
      "mllib": {
        "gpu": true,
        "maxBatchSize":1,
        "maxWorkspaceSize":256,
        "datatype":"fp32",
        "nclasses": 3
      },
      "output": {}
    }
  }
}'

The "cuda":true parameters switches the prediction to the full CUDA pipeline:

# predict call
curl -X POST http://localhost:8080/predict/ -d '{
  "parameters": {
    "input": {
      "cuda":true
    },
    "mllib": {},
    "output": {
      "bbox": true,
      "best_bbox": 1
    }
  },
  "service": "squeezenet"
}'