Multimodal Models โ Beyond Text
How a Language Model Sees an Imageโ
GPT-4V, Claude 3, and Gemini can all "see" images. The question worth understanding is how: what happens between a JPEG landing in the API request and the model reasoning about what's in it?
The short answer is that images get converted into tokens โ the same basic unit the model uses for text. Once the image is tokenized, the model processes visual and textual tokens together in one pass through the transformer. From the model's perspective, an image is just a different kind of token sequence.
The mechanism that does this conversion is called a vision encoder, and the dominant architecture behind it is the Vision Transformer, or ViT (Dosovitskiy et al., 2020).
From Pixels to Tokens: The Patch Approachโ
The Vision Transformer takes a surprisingly direct approach. Instead of processing an image pixel by pixel (too granular) or as one giant blob (too coarse), it divides the image into a grid of fixed-size patches โ typically 16ร16 pixels each.
Each patch gets embedded into a vector, just like a word token. A 224ร224 pixel image divided into 16ร16 patches produces 196 patch vectors. The model also adds a 2D positional encoding to each patch embedding so the model knows where in the image each patch came from.
These patch embeddings are fed into a transformer โ structurally the same architecture used for text. The transformer attends across all the patches, learning relationships between different parts of the image.
Native Multimodal vs. Adaptersโ
Not all multimodal models are built the same way.
Adapter-based models take a pre-trained text model and bolt on a vision encoder after the fact. A small adapter network maps the vision encoder's output into the text model's embedding space. This is faster to train and works reasonably well, but the visual and linguistic representations were never jointly optimized.
Native multimodal models are trained with mixed image-text data from the start. The model learns visual and linguistic representations together. These models tend to handle complex visual reasoning better โ tasks that require integrating visual details with world knowledge, like reading a chart and drawing inferences from it.
The distinction matters when evaluating models for tasks involving nuanced visual understanding, like medical imaging analysis or reading diagrams in technical documents.
Audio and Videoโ
The same patch-based approach extends to other modalities. Audio is typically converted to a mel spectrogram โ a 2D representation of frequency over time โ and then treated as an image with the same patch-and-embed pipeline. Video is treated as a sequence of image frames, each converted to patch tokens.
Native multimodal training on video is still maturing as of 2024โ2025. The challenge is primarily computational: a 10-second video clip at 24fps generates an enormous number of patch tokens.
Where This Shows Up in Productsโ
Multimodal capabilities enable a set of product use cases that weren't possible with text-only models:
- Document understanding: PDFs with charts, tables, and diagrams โ not just the text layer
- Visual question answering: "What does this error message say?" from a screenshot
- UI testing automation: screenshot โ describe what's on screen โ compare to expected state
- Medical imaging: describe findings in a scan (specialized models, high stakes, requires validation)
- Content moderation: classify images at scale
- Accessibility: generate alt text for images automatically
The Cost Implicationโ
High-resolution images consume substantially more tokens than text. Depending on the model and the resolution settings you pass, a single image can consume 1,000โ4,000 tokens. At GPT-4V pricing, processing a high-resolution image can cost 10โ40ร more than a typical short text exchange.
Most providers expose a resolution/detail trade-off: low-detail mode uses a fixed small number of tokens; high-detail mode tiles the image and processes multiple crops, producing higher token counts but capturing fine details. For use cases like reading a receipt or examining a microscopy image, high detail is necessary. For use cases like classifying whether an image contains a face, low detail is usually sufficient.
Multimodal inputs increase token count and cost significantly. A high-resolution image can consume 1,000โ4,000 tokens depending on the model. Build image handling into your cost model from the start, and understand the resolution/detail trade-off your provider exposes.