Skip to main content

A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Author(s): Arora, Sanjeev; Khodak, Mikhail; Saunshi, Nikunj; Vodrahalli, Kiran

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr13v8j
Full metadata record
DC FieldValueLanguage
dc.contributor.authorArora, Sanjeev-
dc.contributor.authorKhodak, Mikhail-
dc.contributor.authorSaunshi, Nikunj-
dc.contributor.authorVodrahalli, Kiran-
dc.date.accessioned2021-10-08T19:51:08Z-
dc.date.available2021-10-08T19:51:08Z-
dc.date.issued2018en_US
dc.identifier.citationArora, Sanjeev, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. "A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs." In International Conference on Learning Representations (2018).en_US
dc.identifier.urihttps://openreview.net/pdf?id=B1e5ef-C--
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr13v8j-
dc.description.abstractLow-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice.en_US
dc.language.isoen_USen_US
dc.relation.ispartofInternational Conference on Learning Representationsen_US
dc.rightsFinal published version. This is an open access article.en_US
dc.titleA Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMsen_US
dc.typeConference Articleen_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
CompressedSensingView.pdf2.28 MBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.