Inference Optimization Index

Reducing cost and latency for production LLM inference.

Notes

Quantization for LLMs — bitsandbytes, GPTQ, AWQ, HQQ, and GGUF quantization methods for memory-efficient inference.
Attention Optimization and KV Cache — Flash Attention, GQA, PagedAttention, and long-context positional encoding.
LLM Serving Frameworks — vLLM, llama.cpp, TGI, Ollama, and SGLang for production inference.