Skip to main content

A Bayesian Nonparametric View on Count-Min Sketch

Author(s): Cai, Diana; Mitzenmacher, Michael; Adams, Ryan P

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr12n7h
Full metadata record
DC FieldValueLanguage
dc.contributor.authorCai, Diana-
dc.contributor.authorMitzenmacher, Michael-
dc.contributor.authorAdams, Ryan P-
dc.date.accessioned2021-10-08T19:45:02Z-
dc.date.available2021-10-08T19:45:02Z-
dc.date.issued2018en_US
dc.identifier.citationCai, Diana, Michael Mitzenmacher, and Ryan P. Adams. "A Bayesian Nonparametric View on Count-Min Sketch." Advances in Neural Information Processing Systems 31 (2018): 8768-8777.en_US
dc.identifier.urihttps://papers.nips.cc/paper/2018/hash/0b9e57c46de934cee33b0e8d1839bfc2-Abstract.html-
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr12n7h-
dc.description.abstractThe count-min sketch is a time- and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that it is possible to straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the count-min sketch estimator, inheriting the associated probabilistic guarantees. Using simulated data and text data, we investigate the properties of these estimators. Lastly, we also study a modified problem in which the observation stream consists of collections of tokens (i.e., documents) arising from a random measure drawn from a stable beta process, which allows for power law scaling behavior in the number of unique tokens.en_US
dc.format.extent8768 - 8777en_US
dc.language.isoen_USen_US
dc.relation.ispartofAdvances in Neural Information Processing Systemsen_US
dc.rightsAuthor's manuscripten_US
dc.titleA Bayesian Nonparametric View on Count-Min Sketchen_US
dc.typeJournal Articleen_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
BayesianNonparamCountMinSketch.pdf2.68 MBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.