AI Building Blocks - Overview
Updated: Dec 11, 2025
AI Building Blocks bring plug-and-play machine-learning capabilities
directly into Unity-based XR projects for Meta Quest. Each block combines a
Unity Agent (runtime logic) with a configurable Provider (defining where
inference runs, A. in the cloud, B. on a local machine, or C. on the Meta Quest
device). Providers from options A and B currently connect through HTTP requests
due to a lack of native Websocket/WebRTC support in Unity.
Also, working with HTTP is simpler for prototyping and testing, but it is less
performant than native Websocket/WebRTC, so be aware of increased latency when
trying to build real-time experiences using cloud or local inference through
HTTP ports. For this reason we will focus more on building out on-device
inference support with the
Unity Inference Engine
in the future.
AI Building Blocks offer modular functionality across four key categories:

Object Detection
Detect and label real-world objects in passthrough or camera textures using on-device or cloud models.
View Documentation
Large Language Models
Integrate contextual or multimodal AI using Llama, or custom models through any Provider.
View Documentation
Speech to Text (STT)
Transcribe microphone or audio-clip input in real time using state-of-the-art cloud models.
View Documentation
Text to Speech (TTS)
Generate natural-sounding voice output using state-of-the-art ElevenLabs or OpenAI models.
View Documentation- Unity 6 or newer
- Meta Quest 3 or 3s
- Meta XR Core SDK v83+ and Meta XR MR UtilityKit v83+ (for Passthrough
Camera support)
- Stable internet connection when running cloud providers
Always check provider and model availability
We do our best to provide you with state-of-the-art providers and up-to-date models, but especially for cloud providers, models may not always be available on the provider's servers. Therefore, always make sure to check the provider and model availability before using them in your experience. | Category | Example | Recommended Provider |
|---|
Vision | Real-time object detection | Unity Inference Engine / HuggingFace |
Language | Language and Vision Requests to LLMs/VLMs | Llama API / OpenAI / Ollama / HuggingFace / Replicate |
Speech | Voice commands or narration (TTS / STT) | OpenAI / ElevenLabs |
Each AI Building Block consists of two core layers:
| Layer | Role | Examples |
|---|
Agent | Unity runtime component managing input/output and inference calls. | ObjectDetectionAgent, LlmAgent, SpeechToTextAgent, TextToSpeechAgent
|
Provider | ScriptableObject defining the inference backend and input/output structure.
| OpenAIProvider, HuggingFaceProvider, OllamaProvider, UnityInferenceEngineProvider
|
Example
Prototype using Llama 4 Maverick (Llama API), then switch to Llama 3.3 running on Ollama, or an on-device model, without changing your experience’s logic.