Skip to main content

Learning topic models - Going beyond SVD

Author(s): Arora, Sanjeev; Ge, R; Moitra, A

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1wm8t
Full metadata record
DC FieldValueLanguage
dc.contributor.authorArora, Sanjeev-
dc.contributor.authorGe, R-
dc.contributor.authorMoitra, A-
dc.date.accessioned2019-08-29T17:05:04Z-
dc.date.available2019-08-29T17:05:04Z-
dc.date.issued2012en_US
dc.identifier.citationArora, S, Ge, R, Moitra, A. (2012). Learning topic models - Going beyond SVD. 1 - 10. doi:10.1109/FOCS.2012.49en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr1wm8t-
dc.description.abstractTopic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications.en_US
dc.format.extent1 - 10en_US
dc.language.isoen_USen_US
dc.relation.ispartofProceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCSen_US
dc.rightsAuthor's manuscripten_US
dc.titleLearning topic models - Going beyond SVDen_US
dc.typeConference Articleen_US
dc.identifier.doidoi:10.1109/FOCS.2012.49-
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
Learning Topic Models Going beyond SVD.pdf246.31 kBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.