Weighted sampling sketches for functions of frequencies

Geri, Ofir

Weighted sampling sketches for functions of frequencies

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fvc067sy0699" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Weighted sampling is an extremely useful technique for processing streaming or distributed data. In particular, for the purpose of computing statistics or aggregates that depend on the frequencies of keys, a sample can be used to efficiently compute estimates without storing the entire table of keys and frequencies in memory. In order to get low-error estimates using a small sample, the sampling method should be weighted and tailored to the specific aggregates of interest. As a rule of thumb, the sampling weight of each key should be proportional to its contribution to the aggregate. For aggregates that sum over a function f applied to the frequency of each key (e.g., frequency moments), the sampling weight of a key x with frequency w_x should be f(w_x). A main question, and the focus of this work, is designing weighted sampling schemes for broad families of functions that can be computed efficiently over streaming or distributed data. First, we consider the family of concave sublinear functions, which includes x^p for p at most 1, ln(1+x), capping functions (min{x, T} for a constant T), their compositions, and more. We design a sampling sketch that can be tailored to any concave sublinear function with near-optimal guarantees on the variance of the estimates. We complement the theoretical analysis with an experimental study, which shows that in practice the variance of our estimates is even lower than the theoretical bound and is actually close to that of the optimal benchmark. For some functions of frequencies (for example, x^p for p > 2), there are hardness results that show in the streaming setting, weighted sampling according to these functions requires polynomial space. In the second part, we consider two scenarios that go beyond worst-case analysis. In the first scenario, the input is still adversarial, but we additionally assume some noisy knowledge of the input distribution. In the second scenario, we assume that the input comes from a more restricted class of frequency distributions that is better aligned with what is observed in practice. In both scenarios, we show how to leverage our assumptions to sample according to functions that are "hard" on worst-case inputs. In the final part, we propose a method to post-process weighted samples to satisfy element-level differential privacy, while retaining the sample utility to the maximum extent possible. Our method maximizes the reporting probabilities of keys and estimation quality for a broad family of statistics.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2021; ©2021
Publication date	2021; 2021
Issuance	monographic
Language	English

Creators/Contributors

Author	Geri, Ofir
Degree supervisor	Charikar, Moses
Thesis advisor	Charikar, Moses
Thesis advisor	Cohen, Edith
Thesis advisor	Valiant, Gregory
Degree committee member	Cohen, Edith
Degree committee member	Valiant, Gregory
Associated with	Stanford University, Computer Science Department

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Ofir Geri.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2021.
Location	https://purl.stanford.edu/vc067sy0699

Access conditions

Also listed in

View in SearchWorks

Loading usage metrics...