Data summaries for scalable, high-cardinality analytics
Abstract/Contents
- Abstract
- Sensor data, network requests, and other machine generated data continue to grow in both volume and dimensionality. Users now commonly aggregate over billions of records segmented by fine-grained time windows or a long tail of device types. To support this growth, an emerging class of data systems including Apache Druid and Apache Kylin precompute approximate summaries for different segments of a dataset, partitioned by contextual dimensions. These systems can avoid expensive full data scans by operating directly over segment summaries. However, existing streaming and mergeable summarization techniques do not scale to settings where summaries must be aggregated over hundreds or millions of segments. In this thesis, we develop data summaries tailored for high-cardinality aggregation. We find that an end-to-end approach to algorithm design, where data summaries and aggregation are designed for specific high-cardinality workloads, can yield dramatic improvements in accuracy, space usage, and runtime. We demonstrate this approach with three summarization techniques. First, we address runtime bottlenecks for quantile queries by introducing the Moments sketch, which uses compact moment statistics and efficient distribution solvers to reduce aggregation and query overhead compared to existing mergeable summaries. Second, we address accuracy and space bottlenecks for item frequency and quantile queries by developing Cooperative summaries optimized for aggregate (rather than individual) summary error, improving upon the error guarantees provided by mergeable summaries. Third, we address runtime bottlenecks for thresholded kernel density estimation queries by extending k-d tree summaries with a new aggregation procedure, TKDC, and show that it achieves asymptotic improvements in runtime compared with direct computation
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2020; ©2020 |
Publication date | 2020; 2020 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Gan, Edward R | |
---|---|---|
Degree supervisor | Bailis, Peter | |
Thesis advisor | Bailis, Peter | |
Thesis advisor | Charikar, Moses | |
Thesis advisor | Zaharia, Matei | |
Degree committee member | Charikar, Moses | |
Degree committee member | Zaharia, Matei | |
Associated with | Stanford University, Computer Science Department. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Edward Gan |
---|---|
Note | Submitted to the Computer Science Department |
Thesis | Thesis Ph.D. Stanford University 2020 |
Location | electronic resource |
Access conditions
- Copyright
- © 2020 by Edward R Gan
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...