AI Definition

Inference

Running a trained model to produce outputs (as opposed to training the model in the first place).

Inference is what happens every time a user sends a prompt: the model loads, processes the input, and generates a response. Training is a one-time (or rare) cost; inference is the recurring one.

Inference cost dominates AI economics. Speeding it up matters: techniques include quantization (smaller weights), speculative decoding (using a small model to draft, big model to verify), batching, KV cache reuse, and specialized hardware.

Inference providers like Groq, Together AI, Fireworks, and Cerebras compete on speed and price. For self-hosting, frameworks like vLLM, TensorRT-LLM, and llama.cpp serve models efficiently on GPUs and CPUs.

Tools related to Inference

Groq

Ultra-fast inference for open models on custom LPU hardware.

Together AI

Run, fine-tune, and serve 200+ open-source models.

Replicate

Run open-source models in the cloud with a one-line API call.

Related concepts

LLM (Large Language Model)

A neural network trained on huge amounts of text to predict and generate language.

Quantization

Compressing model weights to lower precision (4-bit, 8-bit) so they're cheaper to run.

Token

The unit a language model processes roughly 4 characters or 0.75 of an English word.

Want help applying this in production?

Our engineers ship AI features into production every week. Tell us what you're building.

Get a Free Quote Contact Us