A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Arora, Sanjeev; Khodak, Mikhail; Saunshi, Nikunj; Vodrahalli, Kiran

A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Author(s): Arora, Sanjeev; Khodak, Mikhail; Saunshi, Nikunj; Vodrahalli, Kiran

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr13v8j

Full metadata record

DC Field	Value	Language
dc.contributor.author	Arora, Sanjeev	-
dc.contributor.author	Khodak, Mikhail	-
dc.contributor.author	Saunshi, Nikunj	-
dc.contributor.author	Vodrahalli, Kiran	-
dc.date.accessioned	2021-10-08T19:51:08Z	-
dc.date.available	2021-10-08T19:51:08Z	-
dc.date.issued	2018	en_US
dc.identifier.citation	Arora, Sanjeev, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. "A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs." In International Conference on Learning Representations (2018).	en_US
dc.identifier.uri	https://openreview.net/pdf?id=B1e5ef-C-	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr13v8j	-
dc.description.abstract	Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice.	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartof	International Conference on Learning Representations	en_US
dc.rights	Final published version. This is an open access article.	en_US
dc.title	A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs	en_US
dc.type	Conference Article	en_US
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceeding	en_US

Files in This Item:

File	Description	Size	Format
CompressedSensingView.pdf		2.28 MB	Adobe PDF	View/Download

Show Simple Item Record