WebLLM In-Browser LLM Demo

⚠ WebGPU is not available in this browser. WebLLM needs WebGPU to run LLMs locally.
Use Chrome 113+ or Edge 113+ to try this demo. The How It Works tab below explains the full API.

🧠

Select a model and click Load Model to begin.

Model downloads once · caches in browser · loads instantly after

How WebLLM Works

📥Model weights download once and cache in IndexedDB — no re-download on repeat visits

🖥WebGPU compiles the model into GPU compute shaders — near-native inference speed in the browser

💬100% OpenAI-compatible API — engine.chat.completions.create() is identical to the OpenAI SDK

🔒Zero data sent to any server — the entire inference pipeline runs inside your browser tab

Browser Compatibility

✅ Chrome 113+

✅ Edge 113+

⚠ Safari 18+ (macOS 15)

❌ Firefox (no WebGPU)

Available Models

Model	Best for	Size	VRAM
Llama 3.2 1B	First test, fast	~700 MB	2 GB
Gemma 2 2B	Balanced quality	~1.5 GB	3 GB
Phi-3.5 Mini	Strong reasoning	~2.2 GB	4 GB
Llama 3.2 3B	Higher quality	~2.0 GB	4 GB

The Code (OpenAI-compatible)

import * as webllm from '@mlc-ai/web-llm';
// Load model with progress callback
const engine = await webllm.CreateMLCEngine(modelId);
// Identical to OpenAI SDK!
const stream = await engine.chat.completions.create({
  messages, stream: true
});
for await (const chunk of stream) { /* render token */ }

Storage Info

Checking…