Weighted sampling sketches for functions of frequencies

Placeholder Show Content

Abstract/Contents

Abstract
Weighted sampling is an extremely useful technique for processing streaming or distributed data. In particular, for the purpose of computing statistics or aggregates that depend on the frequencies of keys, a sample can be used to efficiently compute estimates without storing the entire table of keys and frequencies in memory. In order to get low-error estimates using a small sample, the sampling method should be weighted and tailored to the specific aggregates of interest. As a rule of thumb, the sampling weight of each key should be proportional to its contribution to the aggregate. For aggregates that sum over a function f applied to the frequency of each key (e.g., frequency moments), the sampling weight of a key x with frequency w_x should be f(w_x). A main question, and the focus of this work, is designing weighted sampling schemes for broad families of functions that can be computed efficiently over streaming or distributed data. First, we consider the family of concave sublinear functions, which includes x^p for p at most 1, ln(1+x), capping functions (min{x, T} for a constant T), their compositions, and more. We design a sampling sketch that can be tailored to any concave sublinear function with near-optimal guarantees on the variance of the estimates. We complement the theoretical analysis with an experimental study, which shows that in practice the variance of our estimates is even lower than the theoretical bound and is actually close to that of the optimal benchmark. For some functions of frequencies (for example, x^p for p > 2), there are hardness results that show in the streaming setting, weighted sampling according to these functions requires polynomial space. In the second part, we consider two scenarios that go beyond worst-case analysis. In the first scenario, the input is still adversarial, but we additionally assume some noisy knowledge of the input distribution. In the second scenario, we assume that the input comes from a more restricted class of frequency distributions that is better aligned with what is observed in practice. In both scenarios, we show how to leverage our assumptions to sample according to functions that are "hard" on worst-case inputs. In the final part, we propose a method to post-process weighted samples to satisfy element-level differential privacy, while retaining the sample utility to the maximum extent possible. Our method maximizes the reporting probabilities of keys and estimation quality for a broad family of statistics.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Geri, Ofir
Degree supervisor Charikar, Moses
Thesis advisor Charikar, Moses
Thesis advisor Cohen, Edith
Thesis advisor Valiant, Gregory
Degree committee member Cohen, Edith
Degree committee member Valiant, Gregory
Associated with Stanford University, Computer Science Department

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Ofir Geri.
Note Submitted to the Computer Science Department.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/vc067sy0699

Access conditions

Copyright
© 2021 by Ofir Geri

Also listed in

Loading usage metrics...