Skip to main content

Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach

Author(s): Fang, EX; Li, MD; Jordan, MI; Liu, H

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1vg49
Abstract: Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.
Publication Date: 2017
Citation: Fang, EX, Li, MD, Jordan, MI, Liu, H. Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach. Journal of the American Statistical Association, 112 (519) (2017): pp. 921 - 932. doi:10.1080/01621459.2016.1256812
DOI: doi:10.1080/01621459.2016.1256812
ISSN: 0162-1459
EISSN: 1537-274X
Pages: 921 - 932
Type of Material: Journal Article
Journal/Proceeding Title: Journal of the American Statistical Association
Version: Author's manuscript



Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.