On-device inference allows AI models to run directly on Meta Quest headsets, eliminating network dependencies and enabling low-latency processing. This capability is powered by the Unity Inference Engine⁠ (formerly Unity Sentis), which executes ONNX or Sentis models efficiently on CPU or GPU backends. Running inference on-device enables instant feedback, offline operation, and complete data privacy.

Why Run Models On-Device

Benefit	Description
Offline Operation	Works fully offline — essential for exhibitions, enterprise, and privacy-sensitive apps.
Ultra-Low Latency	All computation runs locally, removing network delays.
Full Privacy	Sensitive inputs like passthrough images never leave the device.
Deterministic Performance	Performance remains stable regardless of network or server load.

The UnityInferenceEngineProvider

The UnityInferenceEngineProvider is the bridge between Unity and the on-device inference runtime. It wraps your AI model asset (for example, .onnx or .sentis) and provides configuration options for execution backend, frame scheduling, and GPU-based post-processing.

Inspector Parameters

Field	Description
Model Asset	The trained model file (`.onnx` or `.sentis`).
Backend Type	Choose between `CPU` and `GPUCompute` backends.
Split Over Frames	Run a portion of the model per frame to maintain framerate.
Layers Per Frame	Number of layers to execute each frame (used when splitting).
NMS Compute Shader	Optional GPU Non-Max Suppression for faster bounding box filtering.
Class Labels File	Optional `.txt` file mapping class indices to readable labels.

Model Conversion, Serialization, and Quantization

This guide is for when you are planning to use your own models for the Object Detection Building Block, for example. Converting, serializing, and quantizing your models are key steps to prepare them for efficient runtime execution in Unity. These optimizations ensure faster load times, lower memory usage, and consistent performance across devices. The following sections explain how to clean up ONNX models, convert them into Unity’s optimized .sentis format, and optionally serialize or quantize them for deployment.

To make this process easier, Meta provides an editor window located at Meta → Tools → Unity Inference Engine → ONNX → Sentis Converter, which allows you to import, clean up, quantize, and export your ONNX models as optimized .sentis assets with just a few clicks. The following sections explain how to use this tool and how to serialize or quantize models for deployment.

1. Quantize to Reduce Size

Quantization compresses your model by storing weights in lower precision formats (Float16 or Uint8). This reduces file size and memory usage with minimal accuracy loss.

Type	Bits	Description
None	32	Full precision (default)
Float16	16	Half precision, preserves most accuracy
Uint8	8	Highly compact, may slightly reduce accuracy

You can quantize and serialize models directly from code:

using Unity.InferenceEngine;

void QuantizeAndSave(Model model, string path)
{
    ModelQuantizer.QuantizeWeights(QuantizationType.Float16, ref model);
    ModelWriter.Save(path, model);
}

2. Convert Your Model

Most ONNX models require cleanup before use in Unity. Use the OnnxModelConverterEditor (Window → Meta → AI → ONNX Model Converter) to:

Import your .onnx model.

Apply cleanup options (for example, Softmax, NMS removal).

Choose your Quantization Type (None, Float16, Uint8).

Enter the desired path and name of your converted model

Press Convert to Sentis.

This generates a .sentis asset optimized for the Unity runtime.

3. Serialize and Load Models

Optionally, for large models, create a serialized asset to speed up loading:

In the Project window, select your ONNX model.

In the Inspector, click Serialize to StreamingAssets.

Unity generates a .sentis file inside your StreamingAssets folder.

You can then load it at runtime:

using Unity.InferenceEngine;
Model model = ModelLoader.Load(Application.streamingAssetsPath + "/mymodel.sentis");

Advantages of Serialization:

Faster load times and smaller project size

Unity-validated format (guaranteed compatibility)

Easier to share between projects

Runtime Initialization and Warm-Up

When a model first runs, Unity Inference Engine must allocate buffers, compile GPU kernels, and upload weights. This can cause a one-time delay of several seconds at startup.

Always perform a warm-up inference during loading or splash screens:

IEnumerator Start()
{
    var worker = model.CreateWorker();
    var input = new Tensor(1, 3, 224, 224);
    worker.Execute(input);
    yield return null;
    Debug.Log("Model warmed up and ready");
}

Best Practices:

Warm up before gameplay starts.

Keep the worker alive across frames.

Dispose workers only on scene unload.

For multiple models (for example, STT + detection), warm them up sequentially.

Non-Max Suppression (NMS)

Object detectors often output multiple overlapping boxes for the same object. Non-Max Suppression (NMS) filters these out, keeping only the most confident ones.

How to efficiently run NMS on the GPU

Some ONNX models include a CPU-based NonMaxSuppression op, like YoloX, which can cause performance bottlenecks. If you have ever tried to run Yolo on the CPU backend, or GPU for that matter, you have likely experienced significant frame drops. Furthermore, simply changing the backend to GPU does not solve the issue, as the NonMaxSuppression op is still executed on the CPU. Instead, on the GPUCompute backend, you will notice that the detection results are not filtered at all, which leads to the model outputting a large number of bounding boxes for the same object.

To tackle this, you can post-process detections using the provided GPU NMS implementation we provide:

GpuNMS.cs

NMSCompute.compute

These run NMS entirely on the GPU, preventing GPU-to-CPU sync stalls. Important to note is that for this to work, the NMS layer must be removed from the model. This is done automatically when you convert your model to the .sentis format using the OnnxModelConverterEditor and check the Remove NMS option.

CPU Bottlenecks with Unity Inference Engine

You will still notice performance spikes, despite the Object Detection Building Block removing the NMS layer and running it on the `GPUCompute` backend. This is because we still copy all results from the CPU to the GPU to then filter and place our bounding boxes in 3D. We are resolving this bottleneck for a future release.

→ Next: Agents and Building Blocks

Did you find this page helpful?