Skip to main content

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Author(s): Arora, Sanjeev; Liang, Yingyu; Ma, Tengyu

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1rk2k
Full metadata record
DC FieldValueLanguage
dc.contributor.authorArora, Sanjeev-
dc.contributor.authorLiang, Yingyu-
dc.contributor.authorMa, Tengyu-
dc.date.accessioned2021-10-08T19:51:06Z-
dc.date.available2021-10-08T19:51:06Z-
dc.date.issued2017en_US
dc.identifier.citationArora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A Simple but Tough-to-Beat Baseline for Sentence Embeddings." International Conference on Learning Representations (2017).en_US
dc.identifier.urihttps://openreview.net/pdf?id=SyK00v5xx-
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr1rk2k-
dc.description.abstractThe success of neural network methods for computing word embeddings has motivated methods for generating semantic embeddings of longer pieces of text, such as sentences and paragraphs. Surprisingly, Wieting et al (ICLR'16) showed that such complicated methods are outperformed, especially in out-of-domain (transfer learning) settings, by simpler methods involving mild retraining of word embeddings and basic linear regression. The method of Wieting et al. requires retraining with a substantial labeled dataset such as Paraphrase Database (Ganitkevitch et al., 2013). The current paper goes further, showing that the following completely unsupervised sentence embedding is a formidable baseline: Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD. This weighting improves performance by about 10% to 30% in textual similarity tasks, and beats sophisticated supervised methods including RNN's and LSTM's. It even improves Wieting et al.'s embeddings. This simple method should be used as the baseline to beat in future, especially when labeled training data is scarce or nonexistent. The paper also gives a theoretical explanation of the success of the above unsupervised method using a latent variable generative model for sentences, which is a simple extension of the model in Arora et al. (TACL'16) with new "smoothing" terms that allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts.en_US
dc.language.isoen_USen_US
dc.relation.ispartofInternational Conference on Learning Representationsen_US
dc.rightsFinal published version. This is an open access article.en_US
dc.titleA Simple but Tough-to-Beat Baseline for Sentence Embeddingsen_US
dc.typeConference Articleen_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
BaselineSentenceEmbedding.pdf318.09 kBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.