Video Content Summarization with Large Language-Vision Models

Kelley Lynch, Bowen Jiang, Benjamin Lambright, Kyeongmin Rim, James Pustejovsky

2024 IEEE International Conference on Big Data (BigData), Washington, DC

CAS Workshop · Pages 2456–2463

Abstract

We present a modular pipeline for summarizing broadcast news videos using large language and vision models, specifically integrating Whisper for ASR, TransNetV2 for shot segmentation, LLaVA for image captioning, and LLaMA for generating structured summaries. Implemented within the CLAMS platform using the Multimedia Interchange Format (MMIF) for component interoperability, our approach combines ASR transcriptions and image captions to enhance metadata extraction. We evaluated our pipeline with automated metrics based on user-generated Youtube video descriptions as well as human assessments. Our analysis highlights challenges with automated metrics and emphasizes the value of human evaluation for nuanced assessment. This work demonstrates the effectiveness of multimodal summarization for video metadata extraction and paves the way for enhanced video accessibility.

BibTeX

@inproceedings{lynch2024videosumm,
  author    = {Lynch, Kelley and Jiang, Bowen and Lambright, Benjamin
               and Rim, Kyeongmin and Pustejovsky, James},
  title     = {Video Content Summarization with Large
               Language-Vision Models},
  booktitle = {2024 IEEE International Conference on Big Data
               (BigData)},
  year      = {2024},
  pages     = {2456--2463},
  address   = {Washington, DC, USA},
  doi       = {10.1109/BigData62323.2024.10825195}
}