Triton Inference Server Deployment Guide
This guide walks through deploying an image classification pipeline using an ensemble model in NVIDIA Triton Inference Server. The architecture offloads preprocessing to a Python backend and classification to a TorchScript model.
Table of Contents
- Inference Pipeline Diagram
- Triton Model Repository Structure
- Model Configurations
- Run the Server
- Arguments Explained
- Sample Client Snippet
- Debugging Tips
- Limitations
Inference Pipeline Diagram
flowchart LR
subgraph CLIENT["CLIENT"]
A["User Uploads Images via FastAPI"]
B["FastAPI reads image bytes"]
C["read_and_pad_images → NumPy array"]
D["gRPC call to Triton InferenceServer: ensemble_model"]
end
subgraph subGraph1["TRITON SERVER"]
E["Ensemble Model receives RAW_IMAGE"]
F1["Step 1: Preprocessor Model"]
end
subgraph subGraph2["TRITON PREPROCESSOR - Python Backend"]
G1["Decode JPEG with OpenCV"]
H1["Convert BGR → RGB → Torch Tensor"]
I1["Apply transforms: Resize → ToImage → Normalize"]
J1["Move to CPU → Convert to NumPy"]
K1["Output: PREPROCESSED_IMAGE"]
end
subgraph subGraph3["CLASSIFIER - TorchScript"]
F2["Step 2: Classifier Model"]
G2["Run forward pass"]
H2["Generate prediction"]
end
subgraph CLIENT_RESPONSE["CLIENT_RESPONSE"]
I["Return prediction to FastAPI"]
J["FastAPI sends JSON response to user"]
end
A --> B --> C --> D
D --> E --> F1
F1 --> G1 --> H1 --> I1 --> J1 --> K1 --> F2
F2 --> G2 --> H2 --> I --> J
Triton Model Repository Structure
The Repository structure depends on the model,
pytorch_libtorch
For a pytorch_libtorch model, organize your repository as:
1
2
3
4
5
6
7
8
9
10
11
12
models/
├── ensemble_model/
│ ├── config.pbtxt
├── preprocessor/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
└── classifier/
├── 1/
│ └── model.pt
└── config.pbtxt
Model Configurations
preprocess/config.pbtxt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
name: "preprocess"
backend: "python"
max_batch_size: 4096
input [
{
name: "RAW_IMAGE"
data_type: TYPE_UINT8
dims: [-1]
}
]
output [
{
name: "PREPROCESSED_IMAGE"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
instance_group [
{
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [8, 16, 32, 64]
max_queue_delay_microseconds: 100
}
classifier/config.pbtxt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
name: "classifier"
platform: "pytorch_libtorch" # To tell which backend to use
max_batch_size: 4096 # Maximum Batch Size to expect
instance_group [
{
count: 2 # To tell how many copies of the model you want
kind: KIND_GPU # CPU or GPU
gpus: [0, 1] # How many GPU to expect. [0] means one 1 GPU
}
]
dynamic_batching { # Change this according to your needs
preferred_batch_size: [32, 64, 128, 256, 512, 1024]
max_queue_delay_microseconds: 100
}
input [ # Change this according to your model
{
name: "input__0"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [3, 224, 224]
}
]
output [ # Change this according to your model
{
name: "output__0"
data_type: TYPE_FP32
dims: [5]
}
]
response_cache { # Optional
enable: true
}
ensemble_model/config.pbtxt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
name: "ensemble_model" # Name of the ensemble model exposed to Triton clients
platform: "ensemble" # Specifies this is an ensemble model, not a standard ML model
input [ # Define the input expected by the ensemble pipeline
{
name: "RAW_IMAGE" # Input name exposed to the client, matches the input of the first step (preprocessor)
data_type: TYPE_UINT8 # Raw image bytes (e.g., JPEG/PNG in bytes)
dims: [ -1 ] # Flat bytes array per image; handled by the preprocessor Python backend
}
]
output [ # Final output of the ensemble pipeline that gets returned to the client
{
name: "output__0" # Must match the output name from the final step (classifier model)
data_type: TYPE_FP32 # Probabilities or logits output (e.g., for classification)
dims: [5] # Example: 5-class classification output
}
]
ensemble_scheduling { # Defines the flow of inference across multiple models in this pipeline
step [ # Ordered steps to execute models sequentially
{
model_name: "preprocessor" # First step: Python backend model that decodes and preprocesses image
model_version: -1 # Use the latest version available
input_map { # Maps the ensemble input to the preprocessor model's input
key: "RAW_IMAGE" # Preprocessor model's input
value: "RAW_IMAGE" # Connect it to the ensemble input
}
output_map { # Maps the output of the preprocessor to the next step
key: "PREPROCESSED_IMAGE" # Preprocessor model's output
value: "input__0" # Connects to the input of the classifier model
}
},
{
model_name: "classifier" # Second step: TorchScript model that takes preprocessed tensor and returns predictions
model_version: -1 # Use the latest version available
input_map { # Maps preprocessed image to classifier input
key: "input__0" # Classifier model's input
value: "input__0" # From preprocessor output
}
output_map { # Final output of the pipeline
key: "output__0" # Classifier model's output
value: "output__0" # Ensemble model's output returned to client
}
}
] # End of steps
}
Run the Triton Server
Run the following Docker command to start the Triton Inference Server on a specific GPU:
1
2
3
4
5
docker run --gpus="device=1" --rm \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v ~/models:/models \
nvcr.io/nvidia/tritonserver:24.02-py3 \
tritonserver --model-repository=/models
To install with python libraries:
1
2
3
4
docker run --gpus="device=3" --rm --shm-size=4g \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:24.02-py3 \
bash -c "pip install numpy torchvision opencv-python-headless && tritonserver --model-repository=/models" \
Arguments Explained
–gpus=”device=1” → Use GPU 1 , Change this to –gpus=all to use all GPUs
-v ~/models:/models → Mount local model repository
–model-repository=/models → Path inside the container
Ports:
8000: HTTP → Use to make Restfull APIs
8001: gRPC
8002: Prometheus metrics
Sample client snippet
1
2
3
4
import requests
files = [('files', open('sample.jpg', 'rb'))]
response = requests.post("http://localhost:8000/predict", files=files)
print(response.json())
Debugging tips
- 🔍 Use
curl localhost:8000/v2/health/ready
to verify if Triton is live. - 🧠 To check model loading issues, run Triton with
--log-verbose=1
. - 📦 Use
curl localhost:8000/v2/models/ensemble_model/config
to verify model config. - 🔄 Add retry logic in client when testing gRPC batch loads.
Limitations
FastAPI (Client Side) Max number of files per request: 1000
Triton gRPC Client Max request size: 2GB