Image detector: Summary and VQA
The image_summary module provides advanced image analysis capabilities using a vision-language model (such as Qwen2.5-VL) hosted externally and reached over an OpenAI-compatible HTTP API. The model answers users' questions about the context of a media file and produces summaries. It combines functionality from the inference.py and prompt_builder.py modules to offer comprehensive image understanding.
Core Components
InferenceModel (inference.py)
A thin OpenAI-compatible client that handles vision-language inference against an externally hosted model:
- Endpoint configuration:
base_url,api_keyandmodel_id, read from theAMMICO_API_BASE_URL,AMMICO_API_KEYandAMMICO_MODEL_IDenvironment variables (or passed to the constructor). - Provider-agnostic: the same client targets a self-hosted vLLM server, the OpenAI API, or Google Gemini (via its OpenAI-compatibility endpoint) — only configuration changes.
- Message building:
build_messages(images, text)encodes PIL images as base64 blocks for the chat API. - Inference:
chat(messages, max_new_tokens, n)andchat_batch(...)(concurrent requests via a bounded thread pool, replacing in-GPU batching). - Optional dependency: the
openaiclient is installed withpip install ammico[api].
PromptBuilder (prompt_builder.py)
Modular prompt construction system for multi-level analysis:
- Processing Levels: Frame, Clip, and Video level prompts
- Task Types: Summary generation, VQA, or combined tasks
- Audio Integration: Prompts that incorporate audio transcription when available
- Structured Output: Ensures consistent, well-formatted model outputs
Key Features
- Image Captioning: Generate concise or detailed captions for images
- Visual Question Answering (VQA): Answer custom questions about image content
- Batch Processing: Process multiple images efficiently with configurable batch sizes
- Flexible Input: Supports file paths, PIL Images, or sequences of images
- Analysis Types:
summary: Generate image captions onlyquestions: Answer questions onlysummary_and_questions: Both caption and Q&A (default)- Concise Mode: Option to generate shorter, more focused summaries and answers
- Question Chunking: Automatically processes questions in batches (default: 8 per batch)
- Error Handling: Robust error handling with retry logic for CUDA operations
Usage
from ammico.image_summary import ImageSummaryDetector
from ammico.inference import InferenceModel
# Initialize the inference client (reads AMMICO_API_* env vars, or pass base_url/api_key/model_id)
model = InferenceModel()
# Create detector
detector = ImageSummaryDetector(summary_model=model, subdict={})
# Analyze single image
results = detector.analyse_image(
entry={"filename": "image.jpg"},
analysis_type="summary_and_questions",
list_of_questions=["What is in this image?", "Are there people?"],
is_concise_summary=True,
is_concise_answer=True
)
# Batch processing
detector.subdict = image_dict
results = detector.analyse_images_from_dict(
analysis_type="summary",
keys_batch_size=16
)
Configuration
- Max Questions: Default 32 questions per image (configurable)
- Batch Size: Default 16 images per batch (configurable)
- Token Limits:
- Concise summary: 64 tokens
- Detailed summary: 256 tokens
- Concise answers: 64 tokens
- Detailed answers: 128 tokens
Output
vqa: List of answers corresponding to questions (if questions requested)
Workflow
stateDiagram-v2
[*] --> Initialize_Detector
Initialize_Detector --> Check_Analysis_Type
state Check_Analysis_Type {
[*] --> Decide
Decide --> Summary : Type=Summary
Decide --> Questions : Type=Questions
Decide --> Both : Type=Summary+Questions
}
Summary --> Generate_Caption
Questions --> Answer_Questions
Both --> Generate_Caption
Both --> Answer_Questions
Generate_Caption --> Build_Messages
Answer_Questions --> Clean_Questions
Clean_Questions --> Build_Messages
state InferenceModel {
[*] --> Configure_Endpoint
Configure_Endpoint --> Build_Chat_Request
Build_Chat_Request --> HTTP_Call
HTTP_Call --> Parse_Response
}
Build_Messages --> Build_Chat_Request
Parse_Response --> Return_Result
Return_Result --> [*]
note right of Build_Chat_Request : Uses PromptBuilder + base64 image blocks