Checking WebGPU…
Runs entirely in your browser · Zero data sent to cloud

⚠ WebGPU is not available in this browser. WebLLM needs WebGPU to run LLMs locally.
Use Chrome 113+ or Edge 113+ to try this demo. The How It Works tab below explains the full API.

Starting…
🧠
Select a model and click Load Model to begin.
Model downloads once · caches in browser · loads instantly after
How WebLLM Works
📥Model weights download once and cache in IndexedDB — no re-download on repeat visits
🖥WebGPU compiles the model into GPU compute shaders — near-native inference speed in the browser
💬100% OpenAI-compatible API — engine.chat.completions.create() is identical to the OpenAI SDK
🔒Zero data sent to any server — the entire inference pipeline runs inside your browser tab
Browser Compatibility
✅ Chrome 113+
✅ Edge 113+
⚠ Safari 18+ (macOS 15)
❌ Firefox (no WebGPU)
Available Models
ModelBest forSizeVRAM
Llama 3.2 1BFirst test, fast~700 MB2 GB
Gemma 2 2BBalanced quality~1.5 GB3 GB
Phi-3.5 MiniStrong reasoning~2.2 GB4 GB
Llama 3.2 3BHigher quality~2.0 GB4 GB
The Code (OpenAI-compatible)
import * as webllm from '@mlc-ai/web-llm';
// Load model with progress callback
const engine = await webllm.CreateMLCEngine(modelId);
// Identical to OpenAI SDK!
const stream = await engine.chat.completions.create({
  messages, stream: true
});
for await (const chunk of stream) { /* render token */ }
Storage Info
Checking…
Read the tutorial