TensorRT Alone Is Not Enough: Lessons from Jetson Optimization

Anyone looking to run an AI model on an NVIDIA Jetson quickly receives seemingly simple advice: export the model to TensorRT, and performance will follow. In practice, that is the beginning at best.

For the ARGOS traffic monitoring system, Gradion EdgeAI built a complete inference pipeline on the Jetson NX — and learned that real optimization goes far beyond model conversion alone.

The Reality Behind the Benchmark Number

Quantizing a YOLO model to TensorRT int8 and running it on a Jetson NX produces impressive numbers. In theory.

The reality in field deployment looks different: a model that runs smoothly on a desk can fail in production at the overall pipeline level — at the camera interface, at the tracker, at the data forwarding, at the interaction of all components under load.

With ARGOS, the requirements were clear: detect, track, and count 11 vehicle classes in real time. At intersections, on multi-lane roads, at night, and in rain. The target: 15 FPS at under 200 milliseconds end-to-end latency — from camera frame to processed result. Over weeks of unattended operation, on hundreds of autonomous sensor units in the field.

What counts here is not the best benchmark, but the most reliable overall solution.

The Pipeline, Not Just the Model

The first decision was the choice of inference framework. Three options were evaluated:

Python/OpenCV: Too slow for real-time processing on the Jetson NX
Google Mediapipe: Strong framework, but not optimized for Jetson hardware — GPU acceleration does not engage the way NVIDIA’s own stack does
NVIDIA Deepstream: Leverages the Jetson’s hardware acceleration directly through GStreamer plugins and integrates seamlessly with TensorRT

Deepstream was the right choice. For object detection, YOLOv4 is used via the DeepStream-YOLO plugin, which integrates the model directly into the Deepstream pipeline. The decisive advantage: a single inference pass delivers detection and classification simultaneously. No two-stage process, no additional overhead.

Why int8 — and Why the Tracker Matters More Than Expected

For TensorRT optimization, int8 quantization was chosen deliberately over FP16. The reason: on the Jetson NX with its limited resources, int8 provided the right balance between inference speed and detection accuracy.

FP16 would have left more headroom on precision, but the performance gain from int8 was critical to maintaining the 15 FPS requirement reliably.

The Tracker as Critical Component

What many underestimate: in a production pipeline, the tracker is at least as critical as the detection model itself.

ARGOS uses a tracking-by-detection approach — the model detects objects in each frame, and the tracker correlates these detections across successive frames into trajectories. For this purpose, a custom Luenberger Observer was developed — a Kalman-based tracker specifically tuned for traffic monitoring requirements.

Why not a standard tracker? In practice, stability issues arose with common tracking algorithms, particularly with occlusions at intersections and under changing lighting conditions.

The Decoupling Makes the Difference

A detail that never appears in benchmark discussions: what happens when downstream processing is slower than inference?

In ARGOS, the inference pipeline runs in C++/GStreamer while data processing runs in Python. Between them sits a Redis queue. This means: when the data collector slows down, encounters a network problem, or briefly goes offline, the inference pipeline continues uninterrupted.

This decoupling was not a theoretical architecture decision — it was the answer to real problems discovered during field testing. Without the Redis buffer, a brief network timeout would have blocked the entire pipeline.

What Comes Next

The pipeline is running: 15 FPS, under 200 milliseconds latency, hundreds of units in the field.

But there is no standing still. For the next generation, the roadmap includes:

Switching from YOLOv4 to RT-DETR (transformer-based)
Moving from the Luenberger Observer to ByteTrack
Migrating to the Jetson Orin NX as the new hardware platform

Conclusion

Model optimization for Jetson is not a single step — it is a pipeline problem. TensorRT int8 is an important building block, but without the right framework, the appropriate tracker, and robust decoupling between inference and data processing, the performance stays on paper.

Taking Edge AI to production requires optimizing the entire chain.

About Gradion EdgeAI: Gradion EdgeAI takes Edge AI products from prototype to production — with deep NVIDIA Jetson expertise, production-hardened architecture, and the reliability of a 600-engineer organization. Discuss your project.

Have an Edge AI project?

Let us discuss how to take your project from prototype to production.

Discuss your project