Skip to main content

Guarding against spurious discoveries in high dimensions

Author(s): Fan, J; Zhou, WX

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr15w91
Full metadata record
DC FieldValueLanguage
dc.contributor.authorFan, Jen_US
dc.contributor.authorZhou, WXen_US
dc.date.accessioned2018-07-20T15:06:13Z-
dc.date.available2018-07-20T15:06:13Z-
dc.date.issued2016-11-01en_US
dc.identifier.citationFan, J, Zhou, WX. (2016). Guarding against spurious discoveries in high dimensions. Journal of Machine Learning Research, 17 (1 - 34en_US
dc.identifier.issn1532-4435en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr15w91-
dc.description.abstract© 2016 Jianqing Fan and Wen-Xin Zhou. Many data mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1 regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.en_US
dc.format.extent1 - 34en_US
dc.relation.ispartofJournal of Machine Learning Researchen_US
dc.titleGuarding against spurious discoveries in high dimensionsen_US
dc.typeJournal Article-
dc.identifier.eissn1533-7928en_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/journal-articleen_US

Files in This Item:
File Description SizeFormat 
Guarding against spurious discoveries in high dimensions.pdf665.92 kBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.