Develop
Develop
Select your platform

AI Building Blocks - Overview

Updated: Dec 11, 2025
AI Building Blocks Banner
AI Building Blocks bring plug-and-play machine-learning capabilities directly into Unity-based XR projects for Meta Quest. Each block combines a Unity Agent (runtime logic) with a configurable Provider (defining where inference runs, A. in the cloud, B. on a local machine, or C. on the Meta Quest device). Providers from options A and B currently connect through HTTP requests due to a lack of native Websocket/WebRTC support in Unity.
Also, working with HTTP is simpler for prototyping and testing, but it is less performant than native Websocket/WebRTC, so be aware of increased latency when trying to build real-time experiences using cloud or local inference through HTTP ports. For this reason we will focus more on building out on-device inference support with the Unity Inference Engine in the future.

Core AI Building Blocks

AI Building Blocks offer modular functionality across four key categories:
Object Detection
Object Detection
Detect and label real-world objects in passthrough or camera textures using on-device or cloud models.
View Documentation
Large Language Models
Large Language Models
Integrate contextual or multimodal AI using Llama, or custom models through any Provider.
View Documentation
Speech to Text
Speech to Text (STT)
Transcribe microphone or audio-clip input in real time using state-of-the-art cloud models.
View Documentation
Text to Speech
Text to Speech (TTS)
Generate natural-sounding voice output using state-of-the-art ElevenLabs or OpenAI models.
View Documentation

System Recommendations

  • Unity 6 or newer
  • Meta Quest 3 or 3s
  • Meta XR Core SDK v83+ and Meta XR MR UtilityKit v83+ (for Passthrough Camera support)
  • Stable internet connection when running cloud providers

Typical Use Cases

Always check provider and model availability
We do our best to provide you with state-of-the-art providers and up-to-date models, but especially for cloud providers, models may not always be available on the provider's servers. Therefore, always make sure to check the provider and model availability before using them in your experience.
CategoryExampleRecommended Provider
Vision
Real-time object detection
Unity Inference Engine / HuggingFace
Language
Language and Vision Requests to LLMs/VLMs
Llama API / OpenAI / Ollama / HuggingFace / Replicate
Speech
Voice commands or narration (TTS / STT)
OpenAI / ElevenLabs

Architecture Overview

Each AI Building Block consists of two core layers:
LayerRoleExamples
Agent
Unity runtime component managing input/output and inference calls.
ObjectDetectionAgent, LlmAgent, SpeechToTextAgent, TextToSpeechAgent
Provider
ScriptableObject defining the inference backend and input/output structure.
OpenAIProvider, HuggingFaceProvider, OllamaProvider, UnityInferenceEngineProvider
Example
Prototype using Llama 4 Maverick (Llama API), then switch to Llama 3.3 running on Ollama, or an on-device model, without changing your experience’s logic.

Continue Learning

Did you find this page helpful?