Deploying and Scaling AI Applications with the NVIDIA TensorRT Inference Server on Kubernetes
The open source NVIDIA TensorRT Inference Server is production‑ready software that simplifies deployment of AI models for speech recognition, natural language processing, recommendation systems, object detection, and more. It integrates with NGINX, Kubernetes, and Kubeflow for a complete solution for real‑time and offline data center AI inference. It can run inference on GPUs and CPUs. It supports all popular AI frameworks and maximizes GPU utilization by serving multiple models per GPU and dynamically batching client requests, which is crucial to avoiding under‑ or over‑provisioning and managing costs.
In this session, Davide:
- Shows how TRTIS simplifies AI deployment in production environments based in the data center, cloud, or edge
- Shares best practices and a sample deployment
- Explores integration with Kubernetes, Kubeflow, Prometheus, Kubernetes autoscaling, gRPC, and the NGINX load balancer