Inference Optimization

Running frontier models on constrained hardware

Parameter offloading, sparsity, quantization, MoE caching — the engineering that makes large models fit where they otherwise wouldn't.

IO-aware exact attention kernel. 2–4× speedup, O(N) memory. The kernel everyone's LLM runs on today.

Flash memory parameter storage, sparsity-aware on-demand loading, 20–25× GPU speedup.

MoE expert offloading to SSD/CPU, run Mixtral-8x7B on consumer hardware.

Data-oblivious vector quantization, KV cache to 3-bit with zero accuracy loss, 8× on H100.

ML-based cache replacement for MoE SSD offloading, 2.6× speedup on edge devices.