Improving computational and human efficiency in large-scale data analytics

Placeholder Show Content

Abstract/Contents

Abstract
Network telemetry, sensor readings, and other machine-generated data are growing exponentially in volume. Meanwhile, the computational resources available for processing this data -- as well as analysts' ability to manually inspect it -- remain limited. As the gap continues to widen, keeping up with the data volumes is challenging for analytic systems and analysts alike. This dissertation introduces systems and algorithms that focus the limited computational resources and analysts' time in modern data analytics on a subset of relevant data. The dissertation comprises two parts that focus on improving the computational and human efficiency in data analytics, respectively. In the first part of this dissertation, we improve the computational efficiency of analytics by combining precomputation and sampling techniques to select a subset of data that contributes the most to query results. We demonstrate this concept with two approximate query processing systems. PS3 approximates aggregate SQL queries with weighted, partition-level samples based on precomputed summary statistics, whereas HBE approximates kernel density estimations using precomputed hash indexes as smart data samplers. Our evaluation shows that both systems outperform uniform sampling, the best-known result for these queries, with practical precomputation overheads. PS3 enables a 3 to 70x speedup under the same accuracy as uniform partition sampling, with less than 100 KB of storage overhead per partition; HBE offers up to a 10x improvements in query time compared to the second-best method with comparable precomputation time. In the second part of this dissertation, we improve the human efficiency of analytics by automatically identifying and summarizing unusual behaviors in large data streams to reduce the burden of manual inspections. We demonstrate this approach through two monitoring applications for machine-generated data. First, ASAP is a visualization operator that automatically smooths time series in monitoring dashboards to highlight large-scale trends and deviations. Compared to presenting the raw time series, ASAP decreases users' response time for identifying anomalies by up to 44.3% in our user study. We subsequently describe FASTer, an end-to-end earthquake detection system that we built in collaboration with seismologists at Stanford University. By pushing down domain-specific filtering and aggregation into the analytics workflows, FASTer significantly improves the speed and quality of earthquake candidate generation, scaling the analysis from three months of data from a single sensor to ten years of data over a network of sensors. The contributions of this dissertation have had real-world impact. ASAP has been incorporated into open-source tools such as Graphite, TimescaleDB Toolkit, and NPM module downsample. ASAP has also directly inspired an auto smoother for the real-time dashboards at the monitoring service Datadog. FASTer is open-source and has been used by researchers worldwide. Its improved scalability has enabled the discovery of hundreds of new earthquake events near the Diablo Canyon nuclear power plant in California.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Rong, Kexin
Degree supervisor Levis, Philip
Thesis advisor Levis, Philip
Thesis advisor Bailis, Peter
Thesis advisor Zaharia, Matei
Degree committee member Bailis, Peter
Degree committee member Zaharia, Matei
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Kexin Rong.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/nc796rp3408

Access conditions

Copyright
© 2021 by Kexin Rong
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...