General purpose and interactive video analytics
- The proliferation of video collections and the increased capabilities of machine learning models have led to a growing desire for video analytics — the process of extracting insights from video. These two trends have made automatic and meaningful analysis of video increasingly feasible, allowing users to answer queries such as "how many birds of a particular species visit a feeder per day" or "do any cars that passed an intersection match an AMBER alert." Despite these advances, video today cannot be explored as practically and as performant as structured data. Exploring video today requires significant time and expertise for optimizing queries to meet performance, cost, and accuracy goals. This thesis focuses on the design of a general purpose video analytics database management system that allows users to query videos as easily, interactively, and cost-efficiently as querying structured data with scale-out systems like Spark SQL and PrestoDB. To reach this vision, we need to address challenges across three areas: systems (performance and cost), databases (automated optimization), and artificial intelligence (ease-of-use). We first focus on the systems challenges for how to improve the latency and resource efficiency of executing directed acyclic graphs of machine learning models for video analysis. The latency and resource efficiency of these directed acyclic graphs can be optimized using configurable knobs for each operation (e.g., batch size or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, (b) the optimal configuration depends on users' desired latency and cost targets, and (c) input video contents may exercise different paths in the directed acyclic graph and produce a variable amount of intermediate results. We present Llama: a heterogeneous and serverless framework for video processing. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. Compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average. Given the high cost of processing frames using expensive models, we then focus on query optimization. While researchers have proposed optimizations such as selectively using faster but less accurate models to replace or filter frames for expensive models, users today must manually explore how and when these optimizations should be applied. This is especially difficult for complex queries with multiple predicates and models. We propose Relational Hints, a declarative interface that allows users to suggest ML model relationships based on domain knowledge. Users can express two key relationships: when a model can replace another (CAN REPLACE) and when a model can be used to filter frames for another (CAN FILTER). We then present the VIVA video analytics system that uses relational hints to optimize SQL queries on video datasets. VIVA automatically selects and validates the hints applicable to the query, generates possible query plans using a formal set of transformations, and finds the best performance plan that meets a user's accuracy requirements. Using VIVA, we show that hints improve performance up to 16.6x without sacrificing accuracy. Despite improving performance and cost, existing systems still have general purpose and ease-of-use limitations. These systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. Recently proposed vision-language models enable users to query videos using natural language like "cars during daytime at traffic intersections." We show that vision-language models improve general expressivity while simplifying query optimization and achieving interactive latencies. However, these models still return large numbers of redundant and low-quality results, which can overwhelm and burden users. We present Zelda: a video analytics system that uses vision-language models to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the vision-language model with the user's query in natural language and additional terms to improve accuracy and identify low-quality frames. Zelda improves result diversity by leveraging the rich semantic information encoded in vision-language model embeddings. Across five datasets and 19 queries, Zelda achieves higher mean average precision (up to 1.15x) and improves average pairwise similarity (up to 1.16x) compared to using vision-language models out-of-the-box. Zelda also retrieves results 7.5x (up to 10.4x) faster for the same accuracy and frame diversity compared to a state-of-the-art video analytics engine.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Romero, Francisco Alejandro
|Degree committee member
|Degree committee member
|Stanford University, School of Engineering
|Stanford University, Department of Electrical Engineering
|Statement of responsibility
|Francisco Alejandro Romero Llamas.
|Submitted to the Department of Electrical Engineering.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Francisco Alejandro Romero
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...