General purpose and interactive video analytics

Placeholder Show Content

Abstract/Contents

Abstract
The proliferation of video collections and the increased capabilities of machine learning models have led to a growing desire for video analytics — the process of extracting insights from video. These two trends have made automatic and meaningful analysis of video increasingly feasible, allowing users to answer queries such as "how many birds of a particular species visit a feeder per day" or "do any cars that passed an intersection match an AMBER alert." Despite these advances, video today cannot be explored as practically and as performant as structured data. Exploring video today requires significant time and expertise for optimizing queries to meet performance, cost, and accuracy goals. This thesis focuses on the design of a general purpose video analytics database management system that allows users to query videos as easily, interactively, and cost-efficiently as querying structured data with scale-out systems like Spark SQL and PrestoDB. To reach this vision, we need to address challenges across three areas: systems (performance and cost), databases (automated optimization), and artificial intelligence (ease-of-use). We first focus on the systems challenges for how to improve the latency and resource efficiency of executing directed acyclic graphs of machine learning models for video analysis. The latency and resource efficiency of these directed acyclic graphs can be optimized using configurable knobs for each operation (e.g., batch size or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, (b) the optimal configuration depends on users' desired latency and cost targets, and (c) input video contents may exercise different paths in the directed acyclic graph and produce a variable amount of intermediate results. We present Llama: a heterogeneous and serverless framework for video processing. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. Compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average. Given the high cost of processing frames using expensive models, we then focus on query optimization. While researchers have proposed optimizations such as selectively using faster but less accurate models to replace or filter frames for expensive models, users today must manually explore how and when these optimizations should be applied. This is especially difficult for complex queries with multiple predicates and models. We propose Relational Hints, a declarative interface that allows users to suggest ML model relationships based on domain knowledge. Users can express two key relationships: when a model can replace another (CAN REPLACE) and when a model can be used to filter frames for another (CAN FILTER). We then present the VIVA video analytics system that uses relational hints to optimize SQL queries on video datasets. VIVA automatically selects and validates the hints applicable to the query, generates possible query plans using a formal set of transformations, and finds the best performance plan that meets a user's accuracy requirements. Using VIVA, we show that hints improve performance up to 16.6x without sacrificing accuracy. Despite improving performance and cost, existing systems still have general purpose and ease-of-use limitations. These systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. Recently proposed vision-language models enable users to query videos using natural language like "cars during daytime at traffic intersections." We show that vision-language models improve general expressivity while simplifying query optimization and achieving interactive latencies. However, these models still return large numbers of redundant and low-quality results, which can overwhelm and burden users. We present Zelda: a video analytics system that uses vision-language models to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the vision-language model with the user's query in natural language and additional terms to improve accuracy and identify low-quality frames. Zelda improves result diversity by leveraging the rich semantic information encoded in vision-language model embeddings. Across five datasets and 19 queries, Zelda achieves higher mean average precision (up to 1.15x) and improves average pairwise similarity (up to 1.16x) compared to using vision-language models out-of-the-box. Zelda also retrieves results 7.5x (up to 10.4x) faster for the same accuracy and frame diversity compared to a state-of-the-art video analytics engine.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Romero, Francisco Alejandro
Degree supervisor Kozyrakis, Christos
Thesis advisor Kozyrakis, Christos
Thesis advisor Rosenblum, Mendel
Thesis advisor Trippel, Caroline
Degree committee member Rosenblum, Mendel
Degree committee member Trippel, Caroline
Associated with Stanford University, School of Engineering
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Francisco Alejandro Romero Llamas.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/zm860jk5533

Access conditions

Copyright
© 2023 by Francisco Alejandro Romero
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...