Inference Optimization
Inference Optimization
Running frontier models on constrained hardware
Parameter offloading, sparsity, quantization, MoE caching — the engineering that makes large models fit where they otherwise wouldn't.
2022–2024
→
2023
→
2023
→
2025
→
2026
→
FlashAttention (v1/v2/v3)
IO-aware exact attention kernel. 2–4× speedup, O(N) memory. The kernel everyone's LLM runs on today.
LLM in a Flash
Flash memory parameter storage, sparsity-aware on-demand loading, 20–25× GPU speedup.
Fast Inference of MoE with Offloading
MoE expert offloading to SSD/CPU, run Mixtral-8x7B on consumer hardware.
TurboQuant
Data-oblivious vector quantization, KV cache to 3-bit with zero accuracy loss, 8× on H100.
FlashMoE
ML-based cache replacement for MoE SSD offloading, 2.6× speedup on edge devices.