Data summaries for scalable, high-cardinality analytics

Placeholder Show Content

Abstract/Contents

Abstract
Sensor data, network requests, and other machine generated data continue to grow in both volume and dimensionality. Users now commonly aggregate over billions of records segmented by fine-grained time windows or a long tail of device types. To support this growth, an emerging class of data systems including Apache Druid and Apache Kylin precompute approximate summaries for different segments of a dataset, partitioned by contextual dimensions. These systems can avoid expensive full data scans by operating directly over segment summaries. However, existing streaming and mergeable summarization techniques do not scale to settings where summaries must be aggregated over hundreds or millions of segments. In this thesis, we develop data summaries tailored for high-cardinality aggregation. We find that an end-to-end approach to algorithm design, where data summaries and aggregation are designed for specific high-cardinality workloads, can yield dramatic improvements in accuracy, space usage, and runtime. We demonstrate this approach with three summarization techniques. First, we address runtime bottlenecks for quantile queries by introducing the Moments sketch, which uses compact moment statistics and efficient distribution solvers to reduce aggregation and query overhead compared to existing mergeable summaries. Second, we address accuracy and space bottlenecks for item frequency and quantile queries by developing Cooperative summaries optimized for aggregate (rather than individual) summary error, improving upon the error guarantees provided by mergeable summaries. Third, we address runtime bottlenecks for thresholded kernel density estimation queries by extending k-d tree summaries with a new aggregation procedure, TKDC, and show that it achieves asymptotic improvements in runtime compared with direct computation

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2020; ©2020
Publication date 2020; 2020
Issuance monographic
Language English

Creators/Contributors

Author Gan, Edward R
Degree supervisor Bailis, Peter
Thesis advisor Bailis, Peter
Thesis advisor Charikar, Moses
Thesis advisor Zaharia, Matei
Degree committee member Charikar, Moses
Degree committee member Zaharia, Matei
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Edward Gan
Note Submitted to the Computer Science Department
Thesis Thesis Ph.D. Stanford University 2020
Location electronic resource

Access conditions

Copyright
© 2020 by Edward R Gan
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...