Inference
Running a trained model to produce outputs (as opposed to training the model in the first place).
Inference is what happens every time a user sends a prompt: the model loads, processes the input, and generates a response. Training is a one-time (or rare) cost; inference is the recurring one.
Inference cost dominates AI economics. Speeding it up matters: techniques include quantization (smaller weights), speculative decoding (using a small model to draft, big model to verify), batching, KV cache reuse, and specialized hardware.
Inference providers like Groq, Together AI, Fireworks, and Cerebras compete on speed and price. For self-hosting, frameworks like vLLM, TensorRT-LLM, and llama.cpp serve models efficiently on GPUs and CPUs.