Vision = ability for models to see and understand images.
- e.g. models can understand the text in images
Input Format
Two types:
- Fully qualified URL to an image file
- Image as Base64-encoded data URL
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4.1-mini",
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "what's in this image?"},
{
"type": "input_image",
"image_url": "www.example.com/image.png",
},
],
}],
)
print(response.output_text)
Requirements:
- png, jpeg, webp, non-animated gif
- <= 20 MB per image
- low-res mode: 512px x 512px
- high-res mode: 768px (short side) x 2000px (long side)
- no watermarks, no nsfw content, clear enough for a human to understand
Details Param
Three options:
Choose low
to speed up and save time.
Model will process the image as:
- budget of 85 tokens
- low-resolution 512px x 512px version of the image
{
"type": "input_image",
"image_url": "some.jpg",
"detail": "high"
}
Cost
4.1-mini, 4.1-nano, o4-mini
1 token = 1 patch
- 32px x 32px image = 1 patch = 32 tokens.
Max patch per image = 1536 total.
- think 1536 as max surface area.
1024 x 1024 image = 32 patches
- 32patches x 32tokens per patch = 1,024 tokens.
Base64
200kb jpg = 270,000 characters = 67,500 tokens.
Limitations
Non-english Text
- The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Rotation
- The model may misinterpret rotated or upside-down text and images.
Visual Elements
- The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary.
Spatial Reasoning
- The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Metadata
- The model doesn't process original file names or metadata
Resizing
- images are resized before analysis, affecting their original dimensions.