Multimodal Models — Beyond Text

PM: Read in full — 15 min

How a Language Model Sees an Image

Current flagship models from OpenAI, Anthropic, and Google (GPT, Claude, and Gemini) can all "see" images. The question worth understanding is how: what happens between a JPEG landing in the API request and the model reasoning about what's in it?

The short answer is that images get converted into tokens — the same basic unit the model uses for text. Once the image is tokenized, the model processes visual and textual tokens together in one pass through the transformer. From the model's perspective, an image is just a different kind of token sequence.

The mechanism that does this conversion is called a vision encoder, and the dominant architecture behind it is the Vision Transformer, or ViT (Dosovitskiy et al., 2020).

From Pixels to Tokens: The Patch Approach

The Vision Transformer takes a surprisingly direct approach. Instead of processing an image pixel by pixel (too granular) or as one giant blob (too coarse), it divides the image into a grid of fixed-size patches — typically 16×16 pixels each.

Each patch gets embedded into a vector, just like a word token. A 224×224 pixel image divided into 16×16 patches produces 196 patch vectors. The model also adds a 2D positional encoding to each patch embedding so the model knows where in the image each patch came from.

These patch embeddings are fed into a transformer — structurally the same architecture used for text. The transformer attends across all the patches, learning relationships between different parts of the image.

Native Multimodal vs. Adapters

Not all multimodal models are built the same way.

Adapter-based models take a pre-trained text model and bolt on a vision encoder after the fact. A small adapter network maps the vision encoder's output into the text model's embedding space. This is faster to train and works reasonably well, but the visual and linguistic representations were never jointly optimized.

Native multimodal models are trained with mixed image-text data from the start. The model learns visual and linguistic representations together. These models tend to handle complex visual reasoning better — tasks that require integrating visual details with world knowledge, like reading a chart and drawing inferences from it.

The distinction matters when evaluating models for tasks involving nuanced visual understanding, like medical imaging analysis or reading diagrams in technical documents.

Audio and Video

The same patch-based approach extends to other modalities. Audio is typically converted to a mel spectrogram — a 2D representation of frequency over time — and then treated as an image with the same patch-and-embed pipeline. Video is treated as a sequence of image frames, each converted to patch tokens.

Native multimodal training on video continues to mature. The challenge is primarily computational: a 10-second video clip at 24fps generates an enormous number of patch tokens.

Where This Shows Up in Products

Multimodal capabilities enable a set of product use cases that weren't possible with text-only models:

Document understanding: PDFs with charts, tables, and diagrams — not just the text layer
Visual question answering: "What does this error message say?" from a screenshot
UI testing automation: screenshot → describe what's on screen → compare to expected state
Medical imaging: describe findings in a scan (specialized models, high stakes, requires validation)
Content moderation: classify images at scale
Accessibility: generate alt text for images automatically

The Cost Implication

High-resolution images consume substantially more tokens than text. Depending on the model and the resolution settings you pass, a single image can consume 1,000–4,000 tokens. Across major providers, processing a high-resolution image can cost 10–40× more than a typical short text exchange.

Most providers expose a resolution/detail trade-off: low-detail mode uses a fixed small number of tokens; high-detail mode tiles the image and processes multiple crops, producing higher token counts but capturing fine details. For use cases like reading a receipt or examining a microscopy image, high detail is necessary. For use cases like classifying whether an image contains a face, low detail is usually sufficient.

PM Takeaway

Multimodal inputs increase token count and cost significantly. A high-resolution image can consume 1,000–4,000 tokens depending on the model. Build image handling into your cost model from the start, and understand the resolution/detail trade-off your provider exposes.

How a Language Model Sees an Image​

From Pixels to Tokens: The Patch Approach​

Native Multimodal vs. Adapters​

Audio and Video​

Where This Shows Up in Products​

The Cost Implication​

Further Reading​