HomeAbout

Vision = ability for models to see and understand images.

  • e.g. models can understand the text in images

Input Format

Two types:

  • Fully qualified URL to an image file
  • Image as Base64-encoded data URL
from openai import OpenAI client = OpenAI() response = client.responses.create( model="gpt-4.1-mini", input=[{ "role": "user", "content": [ {"type": "input_text", "text": "what's in this image?"}, { "type": "input_image", "image_url": "www.example.com/image.png", }, ], }], ) print(response.output_text)

Requirements:

  • png, jpeg, webp, non-animated gif
  • <= 20 MB per image
  • low-res mode: 512px x 512px
  • high-res mode: 768px (short side) x 2000px (long side)
  • no watermarks, no nsfw content, clear enough for a human to understand

Details Param

Three options:

  • low, high, auto

Choose low to speed up and save time.

Model will process the image as:

  • budget of 85 tokens
  • low-resolution 512px x 512px version of the image
{ "type": "input_image", "image_url": "some.jpg", "detail": "high" }

Cost

4.1-mini, 4.1-nano, o4-mini

1 token = 1 patch

  • 32px x 32px image = 1 patch = 32 tokens.

Max patch per image = 1536 total.

  • think 1536 as max surface area.

1024 x 1024 image = 32 patches

  • 32patches x 32tokens per patch = 1,024 tokens.

Base64

200kb jpg = 270,000 characters = 67,500 tokens.

Limitations

Non-english Text

  • The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

Rotation

  • The model may misinterpret rotated or upside-down text and images.

Visual Elements

  • The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary.

Spatial Reasoning

  • The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

Metadata

  • The model doesn't process original file names or metadata

Resizing

  • images are resized before analysis, affecting their original dimensions.
AboutContact