Guide: Adding Images to Surveys#

Some survey questions make more sense when the model can see a picture. QSTN lets you attach images to the whole questionnaire, to one specific item, or to both.

In this guide we will use a small vision-language model and four friendly, neutral images. We will first learn how image ownership works, then run the same two-item gallery as a single-item, battery, and sequential survey.

Image Sources and Licenses#

We use four different example images from wikimedia:

Earth from Apollo 17 — NASA, public domain.
Apple I computer — CC0.
Apple on a white background — public domain.
Banana — National Cancer Institute, public domain.

Imports#

import time
from pathlib import Path
from urllib.error import HTTPError
from urllib.request import Request, urlopen

import pandas as pd

from qstn.inference import ImageInput
from qstn.logger import configure_logging
from qstn.parser import raw_responses
from qstn.prompt_builder import LLMPrompt, QuestionnairePresentation
from qstn.survey_manager import (
    conduct_survey_battery,
    conduct_survey_sequential,
    conduct_survey_single_item,
)
from qstn.utilities import create_one_dataframe, placeholder

configure_logging(level="WARNING", force=True)

1. Set Up the Images and cache them for repeated use#

QSTN uses ImageInput to store an image source and optionally a label. Labels are useful when several images appear in one prompt because they give both us and the model a simple way to refer to each image.

A plain URL is also accepted. QSTN automatically turns it into an ImageInput without a label.

Because we run the same images in three presentation modes, the setup cell downloads each URL once to a temporary cache. This avoids repeatedly requesting the same files from Wikimedia; the rest of the guide can stay focused on image attachment and routing.

EARTH_URL = (
    "https://commons.wikimedia.org/wiki/Special:Redirect/file/"
    "The_Earth_seen_from_Apollo_17.jpg?width=512"
)
APPLE_I_URL = (
    "https://commons.wikimedia.org/wiki/Special:Redirect/file/"
    "Apple_Computer_1_%28Apple_I%29_from_Smithsonian_National_Museum_of_American_History"
    ".png?width=512"
)
APPLE_URL = (
    "https://commons.wikimedia.org/wiki/Special:Redirect/file/Apple-003.jpg?width=512"
)
BANANA_URL = (
    "https://commons.wikimedia.org/wiki/Special:Redirect/file/Banana_%281%29.jpg?width=512"
)

IMAGE_URLS = {
    "earth": (EARTH_URL, "earth.jpg"),
    "apple_i": (APPLE_I_URL, "apple_i.png"),
    "apple": (APPLE_URL, "apple.jpg"),
    "banana": (BANANA_URL, "banana.jpg"),
}
CACHE_DIR = Path("/tmp/qstn_image_guide")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def cache_image(url, filename):
    path = CACHE_DIR / filename
    if path.exists():
        return path
    request = Request(
        url,
        headers={
            "User-Agent": (
                "QSTN documentation tutorial "
                "(https://github.com/dess-mannheim/QSTN)"
            )
        },
    )
    for attempt in range(3):
        try:
            with urlopen(request, timeout=60) as response:
                path.write_bytes(response.read())
            return path
        except HTTPError as error:
            if error.code != 429 or attempt == 2:
                raise
            time.sleep(5 * (attempt + 1))
    return path

IMAGE_PATHS = {
    name: cache_image(url, filename)
    for name, (url, filename) in IMAGE_URLS.items()
}

print("Cached image files:", [path.name for path in IMAGE_PATHS.values()])

Cached image files: ['earth.jpg', 'apple_i.png', 'apple.jpg', 'banana.jpg']

For this tutorial we will simply ask the model to answer simple questions regarding the images.

questions = pd.DataFrame(
    [
        {
            "questionnaire_item_id": 1,
            "question_content": "Which item 1 image shows an apple: A or B?",
        },
        {
            "questionnaire_item_id": 2,
            "question_content": "What fruit is shown in the image labeled 'Item 2 image'? \
Could the fruit be found on the planet shown in the shared reference?",
        },
    ]
)

image_setup = LLMPrompt(questionnaire_source=questions)

2. Add, Replace, Inspect, and Clear Images#

QSTN supports both global and local images.

An image is global when no item_id is supplied. Global images are appended at the start of the first request and therefore are available at every step of the survey and for every question of the survey.

An image is local when an item_id is provided. These images are only appended at the end of the relevant question.

There are two methods that allow you to modify images in your survey.

The method add_image() appends one image.
The method set_images() replaces the full collection at that scope, which is helpful when reusing a prompt for a new condition.

# Add one global image without a label
image_setup.add_image(EARTH_URL)

# Replace it with two labeled global images. 
# (so the unlabelled one now is not part of the LLMPrompt)
image_setup.set_images(
    [
        ImageInput(IMAGE_PATHS["earth"], label="Shared reference: Earth"),
        ImageInput(IMAGE_PATHS["apple_i"], label="Shared reference: Apple I computer"),
    ]
)

# Add multiple images to the first question with id 1
image_setup.set_images(
    [
        ImageInput(IMAGE_PATHS["apple"], label="Item 1 image A"),
        ImageInput(IMAGE_PATHS["banana"], label="Item 1 image B"),
    ],
    item_id=1,
)

# A single image can be appended to another item.
image_setup.add_image(
    ImageInput(IMAGE_PATHS["banana"], label="Item 2 image"),
    item_id=2,
)

<qstn.prompt_builder.LLMPrompt at 0x7514696cd820>

With get_images() you can see both global and local images.

# Only Global images
image_setup.get_images()

(ImageInput(source=PosixPath('/tmp/qstn_image_guide/earth.jpg'), label='Shared reference: Earth'),
 ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple_i.png'), label='Shared reference: Apple I computer'))

# Global images + local images for one question
image_setup.get_images(item_id=1)

(ImageInput(source=PosixPath('/tmp/qstn_image_guide/earth.jpg'), label='Shared reference: Earth'),
 ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple_i.png'), label='Shared reference: Apple I computer'),
 ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple.jpg'), label='Item 1 image A'),
 ImageInput(source=PosixPath('/tmp/qstn_image_guide/banana.jpg'), label='Item 1 image B'))

# Only local images for one question
image_setup.get_images(item_id=1, include_global=False)

(ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple.jpg'), label='Item 1 image A'),
 ImageInput(source=PosixPath('/tmp/qstn_image_guide/banana.jpg'), label='Item 1 image B'))

To clear a collection, replace it with an empty list. We do this on a duplicate so that the configured gallery remains available for the rest of the guide.

cleared_example = image_setup.duplicate()

# Clear local images
cleared_example.set_images([], item_id=1)

# Clear global images
cleared_example.set_images([])

print("Global images after clearing:", cleared_example.get_images())
print("Item 1 images after clearing:", cleared_example.get_images(item_id=1))

Global images after clearing: ()
Item 1 images after clearing: ()

3. Build the Gallery Prompt#

Now we add a short prompt around the image setup. The labels let each question identify the relevant image even when shared reference images are also present.

The same LLMPrompt will be reused in all three presentation modes.

gallery = LLMPrompt(
    questionnaire_name="image_gallery",
    questionnaire_source=questions,
    system_prompt=(
        "You are looking at a small labeled image gallery. "
        "Answer each question with one short phrase."
    ),
    prompt=(
        "Look at the labeled images and answer the following questionnaire item or items.\n"
        f"{placeholder.PROMPT_QUESTIONS}"
    ),
)
_ = gallery.prepare_prompt(
    question_stem=f"{placeholder.QUESTION_CONTENT}\nAnswer briefly.",
)

gallery.set_images(image_setup.get_images())
gallery.set_images(
    image_setup.get_images(item_id=1, include_global=False),
    item_id=1,
)
_ = gallery.set_images(
    image_setup.get_images(item_id=2, include_global=False),
    item_id=2,
)

Preview the Complete Multimodal Prompt#

get_prompt_for_questionnaire_type() keeps returning a plain string for text-only prompts. When relevant images are attached, its user content becomes an ordered sequence of text and ImageInput blocks. This is the exact structure passed to the vision model.

For a compact human-readable view, print(gallery) renders image labels and sources at their positions in the prompt without expanding local files or base64 payloads.

preview_system, preview_content = gallery.get_prompt_for_questionnaire_type(
    questionnaire_type=QuestionnairePresentation.BATTERY,
    item_separator="\n\n",
)

print("System prompt:")
print(preview_system)
print("\nStructured user content:")
print(preview_content)

System prompt:
You are looking at a small labeled image gallery. Answer each question with one short phrase.

Structured user content:
('Look at the labeled images and answer the following questionnaire item or items.\n', ImageInput(source=PosixPath('/tmp/qstn_image_guide/earth.jpg'), label='Shared reference: Earth'), ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple_i.png'), label='Shared reference: Apple I computer'), 'Which item 1 image shows an apple: A or B?\nAnswer briefly.', ImageInput(source=PosixPath('/tmp/qstn_image_guide/apple.jpg'), label='Item 1 image A'), ImageInput(source=PosixPath('/tmp/qstn_image_guide/banana.jpg'), label='Item 1 image B'), "\n\nWhat fruit is shown in the image labeled 'Item 2 image'? Could the fruit be found on the planet shown in the shared reference?\nAnswer briefly.", ImageInput(source=PosixPath('/tmp/qstn_image_guide/banana.jpg'), label='Item 2 image'))

print(gallery)

=== image_gallery ===
=== SYSTEM_PROMPT ===
You are looking at a small labeled image gallery. Answer each question with one short phrase.
=== USER_PROMPT_WITH_ALL_QUESTIONS ===
Look at the labeled images and answer the following questionnaire item or items.
[Image: Shared reference: Earth | /tmp/qstn_image_guide/earth.jpg]
[Image: Shared reference: Apple I computer | /tmp/qstn_image_guide/apple_i.png]
Which item 1 image shows an apple: A or B?
Answer briefly.
[Image: Item 1 image A | /tmp/qstn_image_guide/apple.jpg]
[Image: Item 1 image B | /tmp/qstn_image_guide/banana.jpg]

What fruit is shown in the image labeled 'Item 2 image'? Could the fruit be found on the planet shown in the shared reference?
Answer briefly.
[Image: Item 2 image | /tmp/qstn_image_guide/banana.jpg]

4. Load a Small Vision Model#

We use Qwen/Qwen3-VL-2B-Instruct. It is small enough for a local tutorial while supporting multiple images in one chat request.

Images are structured chat content, so this guide uses the default chat inference mode.

from vllm import LLM

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model = LLM(
    model_id,
    max_model_len=4096,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"image": 5},
    seed=42,
)

5. Single-Item Presentation#

Single-item presentation starts a fresh request for every question. Each request receives the global images first and then the images assigned to the current item.

This is a good default when questions should not influence one another.

single_item_results = conduct_survey_single_item(
    model,
    gallery,
    print_progress=False,
    max_tokens=24,
)

display(
    create_one_dataframe(raw_responses(single_item_results))
)

INFO 06-09 14:25:15 [hf.py:318] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.

	questionnaire_name	questionnaire_item_id	question	llm_response	logprobs	reasoning
0	image_gallery	1	Which item 1 image shows an apple: A or B?\nAn...	A	None	None
1	image_gallery	2	What fruit is shown in the image labeled 'Item...	Banana\nYes, bananas are found on Earth.	None	None

6. Battery Presentation#

Battery presentation sends all questions in one request. QSTN inserts global images once near the start, then places each item’s images directly after that item’s question.

The result is one response for the full battery, so we ask for one short answer per line.

battery_gallery = gallery.duplicate()
battery_gallery.system_prompt = (
    "Answer both questionnaire items. Use exactly two lines: "
    "Item 1: <A or B> and Item 2: <fruit name> <Yes/No>."
)
battery_gallery.prompt = (
    "Answer every questionnaire item in order.\n"
    f"{placeholder.PROMPT_QUESTIONS}"
)

battery_results = conduct_survey_battery(
    model,
    battery_gallery,
    item_separator="\n\n",
    print_progress=False,
    max_tokens=48,
)

display(
    create_one_dataframe(raw_responses(battery_results))
)

	questionnaire_name	questionnaire_item_id	question	llm_response	logprobs	reasoning
0	image_gallery	-1	Answer every questionnaire item in order.\nWhi...	Item 1: A \nItem 2: banana Yes	None	None

7. Sequential Presentation#

Sequential presentation keeps the conversation history. Global images are attached to the first turn only. Later turns add their own item images while the earlier image context remains in the conversation.

Use this mode when the survey should feel like one continuing interview.

sequential_results = conduct_survey_sequential(
    model,
    gallery,
    print_progress=False,
    max_tokens=38,
)

display(
    create_one_dataframe(raw_responses(sequential_results))
)

	questionnaire_name	questionnaire_item_id	question	llm_response	logprobs	reasoning
0	image_gallery	1	Which item 1 image shows an apple: A or B?\nAn...	A	None	None
1	image_gallery	2	What fruit is shown in the image labeled 'Item...	The fruit shown is a banana.\n\nBased on the s...	None	None

8. Images Require Chat Mode#

Base-model completion mode accepts plain text only. QSTN raises a clear error if structured image content is attached while inference_mode="completion" is selected.

The following cell catches that expected error so the notebook can continue.

try:
    conduct_survey_single_item(
        model,
        gallery,
        inference_mode="completion",
        print_progress=False,
        max_tokens=15,
    )
except ValueError as error:
    print(error)

Structured prompt content is supported only when inference_mode='chat'.

9. Quick Recap#

Use a URL directly for a simple unlabeled image, or ImageInput when you want a label.
Call add_image() to append one image and set_images() to replace a collection.
Omit item_id for global images; provide it for images that belong to one question.
Use get_images() to inspect the final global-first ordering.
Single-item, battery, and sequential surveys route images differently to match their conversation structure.
Image prompts require chat mode.

Guide: Adding Images to Surveys

Contents