Skip to content

Video detector: Summary and VQA

The Video Summary module provides comprehensive video analysis combining visual and audio information using multimodal AI models.

Key Features

  • Multi-Modal Analysis: Combines visual frames and audio transcription
  • Scene Detection: Automatically detects scene cuts using frame differencing
  • Frame Extraction: Extracts keyframes at optimal timestamps using ffmpeg
  • Audio Transcription: Transcribes audio using WhisperX models with word-level alignment
  • Segment Merging: Intelligently merges audio phrases and video scenes into coherent segments
  • Hierarchical Processing: Three-level analysis pipeline:
  • Frame Level: Individual frame analysis with captions and VQA
  • Clip Level: Segment-level summaries combining frames and audio
  • Video Level: Final video summary and Q&A
  • Parallel Processing: Multi-threaded frame extraction for performance
  • Adaptive Frame Sampling: Adjusts frame rate based on segment duration
  • Memory Management: Efficient resource cleanup for audio models

Audio Features

  • Automatic Audio Detection: Checks for audio streams in video files
  • Audio Extraction: Extracts audio using ffmpeg (16kHz mono WAV)
  • Transcription: Uses WhisperX for accurate transcription with timestamps
  • Word Alignment: Provides precise timing for each transcribed segment
  • Language Support: Supports all WhisperX languages with optional language specification

Visual Features

  • Scene Cut Detection: Uses adaptive thresholding based on frame differences
  • Smart Frame Selection:
  • 2 frames/second for short segments (<2s)
  • 4 frames/second for medium segments (2-20s)
  • 6 frames/second for long segments (>20s)
  • Aspect Ratio Preservation: Maintains video aspect ratio during frame extraction
  • Multiple Extraction Strategies: Fallback strategies (MJPEG, PNG) for robust frame extraction
  • Segment Splitting: Automatically splits segments longer than 25 seconds

Usage

from ammico.video_summary import VideoSummaryDetector
from ammico.model import MultimodalSummaryModel, AudioToTextModel

# Initialize models
video_model = MultimodalSummaryModel(device="cuda")
audio_model = AudioToTextModel(model_size="large", device="cuda")

# Create detector
detector = VideoSummaryDetector(
    summary_model=video_model,
    audio_model=audio_model,
    subdict={"video1": {"filename": "video.mp4"}}
)

# Analyze video
results = detector.analyse_videos_from_dict(
    analysis_type="summary_and_questions",
    list_of_questions=["What is happening?", "Who is speaking?"]
)

Analysis Pipeline

  1. Audio Processing (if available):
  2. Extract audio track
  3. Transcribe with WhisperX
  4. Generate timestamped audio segments

  5. Visual Processing:

  6. Detect scene cuts
  7. Extract frame timestamps for each scene
  8. Extract frames using parallel ffmpeg processes

  9. Segment Merging:

  10. Merge audio and visual boundaries
  11. Create coherent temporal segments
  12. Reassign frame timestamps to merged segments

  13. Frame-Level Analysis:

  14. Generate captions for each extracted frame
  15. Answer questions about individual frames
  16. Create summary bullets and VQA bullets

  17. Clip-Level Analysis:

  18. Combine frame summaries with audio transcription
  19. Generate segment-level summaries
  20. Answer questions using combined audio/visual context

  21. Video-Level Analysis:

  22. Aggregate all segment summaries
  23. Generate final video summary
  24. Answer questions about the entire video

Configuration

  • Segment Threshold: Default 8 seconds for creating new segments
  • Max Segment Duration: Segments longer than 25 seconds are split
  • Frame Extraction Workers: Default uses half of available CPU cores (max 8)
  • Output Dimensions: Max 720px while preserving aspect ratio

Output

Returns dictionaries with:

  • summary: Final video summary (if summary requested)
  • vqa_answers: List of answers to questions (if questions requested)
  • audio_descriptions: Timestamped audio transcription segments
  • Per-segment data with summary_bullets and vqa_bullets

Requirements

  • ffmpeg and ffprobe for video processing
  • CUDA support recommended for performance
  • WhisperX models for audio transcription

Workflow

stateDiagram-v2
    [*] --> Initialize_Detector
    Initialize_Detector --> Check_Audio_Model

    state Check_Audio_Model {
        [*] --> Decisions
        Decisions --> Extract_Audio : Has Audio Model
        Decisions --> Visual_Processing_Only : No Audio Model
    }

    Extract_Audio --> Transcribe_WhisperX
    Transcribe_WhisperX --> Audio_Segments

    state Visual_Processing {
        [*] --> Scene_Cut_Detection
        Scene_Cut_Detection --> Calculate_Timestamps
        Calculate_Timestamps --> Extract_Frames_ffmpeg
    }

    Visual_Processing_Only --> Visual_Processing
    Audio_Segments --> Visual_Processing
    Extract_Frames_ffmpeg --> Merge_Audio_Visual_Segments

    state Analysis_Levels {
        [*] --> Frame_Level
        Frame_Level --> Clip_Level
        Clip_Level --> Video_Level
    }

    Merge_Audio_Visual_Segments --> Analysis_Levels

    state MultimodalSummaryModel {
        [*] --> Model_Inference
    }

    state PromptBuilder {
        [*] --> Build_Prompts
        Build_Prompts --> Frame_Prompt
        Build_Prompts --> Clip_Prompt
        Build_Prompts --> Video_Prompt
    }

    Frame_Level --> Frame_Prompt
    Frame_Prompt --> Model_Inference

    Clip_Level --> Clip_Prompt
    Clip_Prompt --> Model_Inference

    Video_Level --> Video_Prompt
    Video_Prompt --> Model_Inference

    Model_Inference --> Final_Result
    Final_Result --> [*]