We built MetalRT from scratch in 48 hours: pure C++ to Metal, no abstractions, no compromises. Result is the fastest decode performance available today on Apple Silicon.
658 tokens per second on Qwen3-0.6B (4-bit) using a single M4 Max.
We benchmarked against the strongest competitors on the exact same hardware (M4 Max, 64 GB, macOS 26.3):
- MetalRT
- uzu (Rust production engine)
- mlx-lm (Apple's official MLX framework)
- llama.cpp
- Ollama (REST API)
MetalRT is fastest on 3 of 4 models and wins the only clean apples-to-apples comparison: 1.10–1.19× faster than Apple's own MLX using identical model files.
Average 1.67× faster than llama.cpp, 1.59× faster than Ollama.
TTFT on Qwen3-0.6B: 6.6 ms.
Same model weights = same output quality. Only the speed is different.
Public access coming soon as part of MetalRT by RunAnywhere Team.
We benchmarked against the strongest competitors on the exact same hardware (M4 Max, 64 GB, macOS 26.3): - MetalRT - uzu (Rust production engine) - mlx-lm (Apple's official MLX framework) - llama.cpp - Ollama (REST API)
Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized, greedy decoding, 5 runs, best reported).
MetalRT is fastest on 3 of 4 models and wins the only clean apples-to-apples comparison: 1.10–1.19× faster than Apple's own MLX using identical model files. Average 1.67× faster than llama.cpp, 1.59× faster than Ollama. TTFT on Qwen3-0.6B: 6.6 ms.
Same model weights = same output quality. Only the speed is different.
Public access coming soon as part of MetalRT by RunAnywhere Team.