AI Building Blocks - Overview

Updated: Dec 11, 2025

AI Building Blocks bring plug-and-play machine-learning capabilities directly into Unity-based XR projects for Meta Quest. Each block combines a Unity Agent (runtime logic) with a configurable Provider (defining where inference runs, A. in the cloud, B. on a local machine, or C. on the Meta Quest device). Providers from options A and B currently connect through HTTP requests due to a lack of native Websocket/WebRTC support in Unity.

Also, working with HTTP is simpler for prototyping and testing, but it is less performant than native Websocket/WebRTC, so be aware of increased latency when trying to build real-time experiences using cloud or local inference through HTTP ports. For this reason we will focus more on building out on-device inference support with the Unity Inference Engine in the future.

Core AI Building Blocks

AI Building Blocks offer modular functionality across four key categories:

Object Detection

Detect and label real-world objects in passthrough or camera textures using on-device or cloud models.

View Documentation

Large Language Models

Integrate contextual or multimodal AI using Llama, or custom models through any Provider.

View Documentation

Speech to Text (STT)

Transcribe microphone or audio-clip input in real time using state-of-the-art cloud models.

View Documentation

Text to Speech (TTS)

Generate natural-sounding voice output using state-of-the-art ElevenLabs or OpenAI models.

View Documentation

System Recommendations

Unity 6 or newer

Meta Quest 3 or 3s

Meta XR Core SDK v83+ and Meta XR MR UtilityKit v83+ (for Passthrough Camera support)

Stable internet connection when running cloud providers

Typical Use Cases

Always check provider and model availability

We do our best to provide you with state-of-the-art providers and up-to-date models, but especially for cloud providers, models may not always be available on the provider's servers. Therefore, always make sure to check the provider and model availability before using them in your experience.

Category	Example	Recommended Provider
Vision	Real-time object detection	Unity Inference Engine / HuggingFace
Language	Language and Vision Requests to LLMs/VLMs	Llama API / OpenAI / Ollama / HuggingFace / Replicate
Speech	Voice commands or narration (TTS / STT)	OpenAI / ElevenLabs

Architecture Overview

Each AI Building Block consists of two core layers:

Layer	Role	Examples
Agent	Unity runtime component managing input/output and inference calls.	`ObjectDetectionAgent`, `LlmAgent`, `SpeechToTextAgent`, `TextToSpeechAgent`
Provider	`ScriptableObject` defining the inference backend and input/output structure.	`OpenAIProvider`, `HuggingFaceProvider`, `OllamaProvider`, `UnityInferenceEngineProvider`

Example

Prototype using Llama 4 Maverick (Llama API), then switch to Llama 3.3 running on Ollama, or an on-device model, without changing your experience’s logic.

Continue Learning

Providers and Inference Types: Configure Cloud, Local, and On-Device inference.

Unity Inference Engine: Run optimized models directly on Meta Quest hardware.

Agents and Building Blocks: Use Object Detection, LLM, STT, and TTS components.

Adding New Providers: Extend the system with custom backends and Editors.

Troubleshooting and FAQ: Resolve setup issues and see answers to frequently asked questions.

→ Next: Providers and Inference Types

Did you find this page helpful?