Understand how to configure and run AI models directly on-device using the
Unity Inference Engine.
Learn how to convert, serialize, and quantize .onnx models into optimized
.sentis assets for faster performance.
Implement model warm-up routines to prevent stutters and ensure smooth runtime
initialization.
Optimize inference with GPU-based Non-Max Suppression (NMS) and Split
Over Frames scheduling.
How to transform 2D object detections into spatially anchored 3D
visualizations using DepthTextureAccess.
On-device inference allows AI models to run directly on Meta Quest headsets,
eliminating network dependencies and enabling low-latency processing. This
capability is powered by the
Unity Inference Engine
(formerly Unity Sentis), which executes ONNX or Sentis models efficiently on
CPU or GPU backends. Running inference on-device enables instant feedback,
offline operation, and complete data privacy.
Why Run Models On-Device
Benefit
Description
Offline Operation
Works fully offline — essential for exhibitions, enterprise, and privacy-sensitive apps.
Ultra-Low Latency
All computation runs locally, removing network delays.
Full Privacy
Sensitive inputs like passthrough images never leave the device.
Deterministic Performance
Performance remains stable regardless of network or server load.
The UnityInferenceEngineProvider
The UnityInferenceEngineProvider is the bridge between Unity and the on-device
inference runtime. It wraps your AI model asset (for example, .onnx or
.sentis) and provides configuration options for execution backend, frame
scheduling, and GPU-based post-processing.
Inspector Parameters
Field
Description
Model Asset
The trained model file (.onnx or .sentis).
Backend Type
Choose between CPU and GPUCompute backends.
Split Over Frames
Run a portion of the model per frame to maintain framerate.
Layers Per Frame
Number of layers to execute each frame (used when splitting).
NMS Compute Shader
Optional GPU Non-Max Suppression for faster bounding box filtering.
Class Labels File
Optional .txt file mapping class indices to readable labels.
Model Conversion, Serialization, and Quantization
This guide is for when you are planning to use your own models for the Object
Detection Building Block, for example. Converting, serializing, and quantizing
your models are key steps to prepare them for efficient runtime execution in
Unity. These optimizations ensure faster load times, lower memory usage, and
consistent performance across devices. The following sections explain how to
clean up ONNX models, convert them into Unity’s optimized .sentis format,
and optionally serialize or quantize them for deployment.
To make this process easier, Meta provides an editor window located at Meta →
Tools → Unity Inference Engine → ONNX → Sentis Converter, which allows you to
import, clean up, quantize, and export your ONNX models as optimized
.sentis assets with just a few clicks. The following sections explain how
to use this tool and how to serialize or quantize models for deployment.
1. Quantize to Reduce Size
Quantization compresses your model by storing weights in lower precision formats
(Float16 or Uint8). This reduces file size and memory usage with minimal
accuracy loss.
Type
Bits
Description
None
32
Full precision (default)
Float16
16
Half precision, preserves most accuracy
Uint8
8
Highly compact, may slightly reduce accuracy
You can quantize and serialize models directly from
code:
Most ONNX models require cleanup before use in Unity. Use the
OnnxModelConverterEditor (Window → Meta → AI → ONNX Model Converter) to:
Import your .onnx model.
Apply cleanup options (for example, Softmax, NMS removal).
Choose your Quantization Type (None, Float16, Uint8).
Enter the desired path and name of your converted model
Press Convert to Sentis.
This generates a .sentis asset optimized for the Unity runtime.
3. Serialize and Load Models
Optionally, for large models, create a serialized asset to speed up loading:
In the Project window, select your ONNX model.
In the Inspector, click Serialize to StreamingAssets.
Unity generates a .sentis file inside your StreamingAssets folder.
You can then load it at runtime:
using Unity.InferenceEngine;
Model model = ModelLoader.Load(Application.streamingAssetsPath + "/mymodel.sentis");
Advantages of Serialization:
Faster load times and smaller project size
Unity-validated format (guaranteed compatibility)
Easier to share between projects
Runtime Initialization and Warm-Up
When a model first runs, Unity Inference Engine must allocate buffers, compile
GPU kernels, and upload weights. This can cause a one-time delay of several
seconds at startup.
Always perform a warm-up inference during loading or splash screens:
IEnumerator Start()
{
var worker = model.CreateWorker();
var input = new Tensor(1, 3, 224, 224);
worker.Execute(input);
yield return null;
Debug.Log("Model warmed up and ready");
}
Best Practices:
Warm up before gameplay starts.
Keep the worker alive across frames.
Dispose workers only on scene unload.
For multiple models (for example, STT + detection), warm them up sequentially.
Non-Max Suppression (NMS)
Object detectors often output multiple overlapping boxes for the same object.
Non-Max Suppression (NMS) filters these out, keeping only the most confident
ones.
How to efficiently run NMS on the GPU
Some ONNX models include a CPU-based NonMaxSuppression op, like YoloX, which
can cause performance bottlenecks. If you have ever tried to run Yolo on the CPU
backend, or GPU for that matter, you have likely experienced significant frame
drops. Furthermore, simply changing the backend to GPU does not solve the issue,
as the NonMaxSuppression op is still executed on the CPU. Instead, on the
GPUCompute backend, you will notice that the detection results are not
filtered at all, which leads to the model outputting a large number of bounding
boxes for the same object.
To tackle this, you can post-process detections using the provided GPU NMS
implementation we provide:
GpuNMS.cs
NMSCompute.compute
These run NMS entirely on the GPU, preventing GPU-to-CPU sync stalls. Important
to note is that for this to work, the NMS layer must be removed from the model.
This is done automatically when you convert your model to the .sentis format
using the OnnxModelConverterEditor and check the Remove NMS option.
CPU Bottlenecks with Unity Inference Engine
You will still notice performance spikes, despite the Object Detection Building Block removing the NMS layer and running it on the `GPUCompute` backend. This is because we still copy all results from the CPU to the GPU to then filter and place our bounding boxes in 3D. We are resolving this bottleneck for a future release.