Multimodal AI
Models that handle multiple input or output modalities text, images, audio, video, code.
Multimodal models can see, hear, read, and increasingly speak and generate visuals. ChatGPT can describe an image, Gemini can analyze a video, Claude can read a PDF, Sora can generate footage from a prompt.
The shift from text-only to multimodal has changed what's buildable. AI customer support can now read screenshots of invoices. Coding agents can look at design mockups. Tutors can mark up handwritten math.
Under the hood, modern multimodal models are usually a transformer with multiple input encoders (one per modality) and for output either generation directly or routing to a specialist model.