Sunday, June 15, 2025
HomeArtificial IntelligenceViSMaP: Unsupervised Summarization of Hour-Lengthy Movies Utilizing Meta-Prompting and Brief-Kind Datasets

ViSMaP: Unsupervised Summarization of Hour-Lengthy Movies Utilizing Meta-Prompting and Brief-Kind Datasets

Video captioning fashions are usually educated on datasets consisting of brief movies, normally beneath three minutes in size, paired with corresponding captions. Whereas this allows them to explain fundamental actions like strolling or speaking, these fashions wrestle with the complexity of long-form movies, comparable to vlogs, sports activities occasions, and films that may final over an hour. When utilized to such movies, they typically generate fragmented descriptions targeted on remoted actions somewhat than capturing the broader storyline. Efforts like MA-LMM and LaViLa have prolonged video captioning to 10-minute clips utilizing LLMs, however hour-long movies stay a problem as a consequence of a scarcity of appropriate datasets. Though Ego4D launched a big dataset of hour-long movies, its first-person perspective limits its broader applicability. Video ReCap addressed this hole by coaching on hour-long movies with multi-granularity annotations, but this strategy is dear and susceptible to annotation inconsistencies. In distinction, annotated short-form video datasets are broadly out there and extra user-friendly.

Developments in visual-language fashions have considerably enhanced the mixing of imaginative and prescient and language duties, with early works comparable to CLIP and ALIGN laying the inspiration. Subsequent fashions, comparable to LLaVA and MiniGPT-4, prolonged these capabilities to pictures, whereas others tailored them for video understanding by specializing in temporal sequence modeling and establishing extra strong datasets. Regardless of these developments, the shortage of enormous, annotated long-form video datasets stays a big hindrance to progress. Conventional short-form video duties, like video query answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long movies calls for figuring out key frames amidst substantial redundancy. Whereas some fashions, comparable to LongVA and LLaVA-Video, can carry out VQA on lengthy movies, they wrestle with summarization duties as a consequence of knowledge limitations.

Researchers from Queen Mary College and Spotify introduce ViSMaP, an unsupervised methodology for summarising hour-long movies with out requiring expensive annotations. Conventional fashions carry out nicely on brief, pre-segmented movies however wrestle with longer content material the place vital occasions are scattered. ViSMaP bridges this hole by utilizing LLMs and a meta-prompting technique to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video fashions. The method includes three LLMs working in sequence for era, analysis, and immediate optimisation. ViSMaP achieves efficiency comparable to totally supervised fashions throughout a number of datasets whereas sustaining area adaptability and eliminating the necessity for intensive guide labelling.

The research addresses cross-domain video summarization by coaching on a labelled short-form video dataset and adapting to unlabelled, hour-long movies from a distinct area. Initially, a mannequin is educated to summarize 3-minute movies utilizing TimeSFormer options, a visual-language alignment module, and a textual content decoder, optimized by cross-entropy and contrastive losses. To deal with longer movies, they’re segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting strategy with a number of LLMs (generator, evaluator, optimizer) refines summaries. Lastly, the mannequin is fine-tuned on these pseudo-summaries utilizing a symmetric cross-entropy loss to handle noisy labels and enhance adaptation.

The research evaluates VisMaP throughout three eventualities: summarization of lengthy movies utilizing Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to brief movies utilizing EgoSchema. VisMaP, educated on hour-long movies, is in contrast towards supervised and zero-shot strategies, comparable to Video ReCap and LaViLa+GPT3.5, demonstrating aggressive or superior efficiency with out supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation research spotlight the advantages of meta-prompting and element modules, comparable to contrastive studying and SCE loss. Implementation particulars embrace using TimeSformer, DistilBERT, and GPT-2, with coaching performed on an NVIDIA A100 GPU.

In conclusion, ViSMaP is an unsupervised strategy for summarizing lengthy movies by using annotated short-video datasets and a meta-prompting technique. It first creates high-quality summaries via meta-prompting after which trains a summarization mannequin, lowering the necessity for intensive annotations. Experimental outcomes display that ViSMaP performs on par with absolutely supervised strategies and adapts successfully throughout numerous video datasets. Nonetheless, its reliance on pseudo labels from a source-domain mannequin could affect efficiency beneath important area shifts. Moreover, ViSMaP at the moment depends solely on visible data. Future work may combine multimodal knowledge, introduce hierarchical summarization, and develop extra generalizable meta-prompting strategies.


Take a look at the Paper. Additionally, don’t neglect to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Neglect to affix our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Digital Convention on AGENTIC AI: FREE REGISTRATION + Certificates of Attendance + 4 Hour Brief Occasion (Might 21, 9 am- 1 pm PST) + Palms on Workshop


Sana Hassan, a consulting intern at Marktechpost and dual-degree pupil at IIT Madras, is enthusiastic about making use of know-how and AI to handle real-world challenges. With a eager curiosity in fixing sensible issues, he brings a recent perspective to the intersection of AI and real-life options.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments