Inference Optimization Index
Reducing cost and latency for production LLM inference.
Notes
- Quantization for LLMs — bitsandbytes, GPTQ, AWQ, HQQ, and GGUF quantization methods for memory-efficient inference.
- Attention Optimization and KV Cache — Flash Attention, GQA, PagedAttention, and long-context positional encoding.
- LLM Serving Frameworks — vLLM, llama.cpp, TGI, Ollama, and SGLang for production inference.
Navigation
← Prev ← Dataset Engineering | Next → Architecture and Feedback →