Vision = ability for models to see and understand images.
- e.g. models can understand the text in images
Input Format
Two types:
- Fully qualified URL to an image file
- Image as Base64-encoded data URL
from openai import OpenAI client = OpenAI() response = client.responses.create( model="gpt-4.1-mini", input=[{ "role": "user", "content": [ {"type": "input_text", "text": "what's in this image?"}, { "type": "input_image", "image_url": "www.example.com/image.png", }, ], }], ) print(response.output_text)
Requirements:
- png, jpeg, webp, non-animated gif
- <= 20 MB per image
- low-res mode: 512px x 512px
- high-res mode: 768px (short side) x 2000px (long side)
- no watermarks, no nsfw content, clear enough for a human to understand
Details Param
Three options:
- low, high, auto
Choose low
to speed up and save time.
Model will process the image as:
- budget of 85 tokens
- low-resolution 512px x 512px version of the image
{ "type": "input_image", "image_url": "some.jpg", "detail": "high" }
Cost
4.1-mini, 4.1-nano, o4-mini
1 token = 1 patch
- 32px x 32px image = 1 patch = 32 tokens.
Max patch per image = 1536 total.
- think 1536 as max surface area.
1024 x 1024 image = 32 patches
- 32patches x 32tokens per patch = 1,024 tokens.
Base64
200kb jpg = 270,000 characters = 67,500 tokens.
Limitations
Non-english Text
- The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Rotation
- The model may misinterpret rotated or upside-down text and images.
Visual Elements
- The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary.
Spatial Reasoning
- The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Metadata
- The model doesn't process original file names or metadata
Resizing
- images are resized before analysis, affecting their original dimensions.