VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Abstract

Video-language understanding tasks have historically focused on short video clips, often struggling with the complexities of long-form video understanding. Recently, many long video-language understanding approaches have taken advantage of the reasoning capabilities of Large Language Models (LLMs) to perform long video question answering, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions. However, the frames used for captioning are often redundant and contain irrelevant information, making dense sampling inefficient, and ignoring the fact that video question-answering requires varying levels of granularity, with some video segments being highly relevant to the question (and hence needing more fine-grained detail) while others being less relevant. Thus, these LLM-based approaches are prone to missing information and operate on large numbers of irrelevant captions, lowering both performance and efficiency. To address these shortcomings, we introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. Specifically, VideoTree dynamically extracts query-related information from the input video and builds a tree-based video representation for LLM reasoning. First, VideoTree adaptively selects frames for captioning by clustering frames based on their visual features and scoring clusters based on their relevance to the query. We iterate this process until enough query-related keyframes are extracted. Second, it organizes visual clusters into a query-adaptive and hierarchical tree structure; the structure of the tree encodes varying levels of granularity, with higher (deeper) resolution on relevant segments. Finally, VideoTree produces an answer to each question by traversing the tree’s keyframes and passing their captions to an LLM answering model, which answers the query. Our experiments show that our training- free adaptive method improves both reasoning accuracy and efficiency compared to existing methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% improvement in accuracy over existing methods on the popular EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while reducing inference time by 40%.

Method

Figure 2: A detailed view of VideoTree. To construct the tree structure, we begin with Adaptive Breadth Expansion (Step 1), which dynamically extracts query-related key information by considering both video and question inputs. Then, starting from the highly relevant root nodes, we explore deeper into the tree branches by Relevance-guided Depth Expansion (Step 2), re-clustering at each level to capture finer visual cues. Finally, we gather the selected nodes (keyframes), caption them, and arrange them in temporal order for LLM reasoning (Step 3).

Results

Table 1: Comparison with other methods on EgoSchema, NExT-QA, and IntentQA datasets. We compare our VideoTree framework with three types of existing works, including video transformer models, open-source LLM-based models, and proprietary LLM-based models.

We demonstrate the effectiveness of VideoTree by evaluating it on three standard long video question answering (LVQA) datasets: EgoSchema, which focuses on egocentric long-form video-language understanding; NExT-QA, a widely-used video question answering benchmark featuring videos that average 44 seconds in length; IntentQA, an LVQA dataset focused on reasoning about people’s intent in long videos. Across these tasks, VideoTree significantly improves accuracy compared to state-of-the-art LLM-based approaches, achieving an absolute accuracy improvement of 7.0% on EgoSchema benchmark, 2.2% on NExT-QA, and 2.7% on IntentQA.

Abalation Study

we provide a detailed analysis of our VideoTree framework. All quantitative analyses are conducted on the validation subset of the EgoSchema dataset. First, we analyze the tradeoff between efficiency and effectiveness, comparing VideoTree to the LLoVi baseline. Here we show that our method has better efficiency and performance across all settings. We then verify the effectiveness of the query-adaptive hierarchical video representation by comparing against different alternative representations. Finally, we visualize the output trees from VideoTree and show the clusters VideoTree chooses to expand, qualitatively supporting its quantitative gains.

Qualitative Analysis

We visualize qualitative results from VideoTree. Specifically, we show the keyframes and their captions extracted by our adaptive tree representation given a video query. This example is drawn from EgoSchema, and shows the query format, which consists of a query and multiple choice answers. With the proposed VideoTree strategy, we can split a complex multi-scene video (e.g.cleaning house across rooms) into several key scenes via visual clustering and determine the most query-relevant scene via the relevance score. We then can obtain more fine-grained visual cues by descending into each relevant cluster (Levels 2 and 3 in Figure 5 top). For example “C opens a washing machine” is deemed highly relevant to the question, which asks about the sequence of events. At the same time, frames like “C moves around” are deemed irrelevant to the query and not expanded. In the end, VideoTree shows a dynamic ability to select relevant segments and can answer the given question correctly with only 50% of the baseline’s 32 input captions. The baseline (fixed uniformly sampling) fails to correctly answer the question, sampling a large number of redundant and irrelevant frames.

BibTeX

@article{wang2024videotree,
  author    = {Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal},
  title     = {VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos},
  journal   = {arxiv},
  year      = {2024},
}