Input Format

Two types:

Fully qualified URL to an image file
Image as Base64-encoded data URL

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4.1-mini",
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "what's in this image?"},
            {
                "type": "input_image",
                "image_url": "www.example.com/image.png",
            },
        ],
    }],
)

print(response.output_text)

Requirements:

png, jpeg, webp, non-animated gif
<= 20 MB per image
low-res mode: 512px x 512px
high-res mode: 768px (short side) x 2000px (long side)
no watermarks, no nsfw content, clear enough for a human to understand

Details Param

Three options:

low, high, auto

Choose low to speed up and save time.

Model will process the image as:

budget of 85 tokens
low-resolution 512px x 512px version of the image

{
    "type": "input_image",
    "image_url": "some.jpg",
    "detail": "high"
}

Cost

4.1-mini, 4.1-nano, o4-mini

1 token = 1 patch

32px x 32px image = 1 patch = 32 tokens.

Max patch per image = 1536 total.

think 1536 as max surface area.

1024 x 1024 image = 32 patches

32patches x 32tokens per patch = 1,024 tokens.

Base64

200kb jpg = 270,000 characters = 67,500 tokens.

Limitations

Non-english Text

The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

Rotation

The model may misinterpret rotated or upside-down text and images.

Visual Elements

The model may struggle to understand graphs or text where colors or styles—like solid, dashed, or dotted lines—vary.

Spatial Reasoning

The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

Metadata

The model doesn't process original file names or metadata

Resizing

images are resized before analysis, affecting their original dimensions.