Guarding against spurious discoveries in high dimensions

Fan, J; Zhou, WX

Guarding against spurious discoveries in high dimensions

Author(s): Fan, J; Zhou, WX

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr15w91

Full metadata record

DC Field	Value	Language
dc.contributor.author	Fan, J	en_US
dc.contributor.author	Zhou, WX	en_US
dc.date.accessioned	2018-07-20T15:06:13Z	-
dc.date.available	2018-07-20T15:06:13Z	-
dc.date.issued	2016-11-01	en_US
dc.identifier.citation	Fan, J, Zhou, WX. (2016). Guarding against spurious discoveries in high dimensions. Journal of Machine Learning Research, 17 (1 - 34	en_US
dc.identifier.issn	1532-4435	en_US
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr15w91	-
dc.description.abstract	© 2016 Jianqing Fan and Wen-Xin Zhou. Many data mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1 regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.	en_US
dc.format.extent	1 - 34	en_US
dc.relation.ispartof	Journal of Machine Learning Research	en_US
dc.title	Guarding against spurious discoveries in high dimensions	en_US
dc.type	Journal Article	-
dc.identifier.eissn	1533-7928	en_US
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/journal-article	en_US

Files in This Item:

File	Description	Size	Format
Guarding against spurious discoveries in high dimensions.pdf		665.92 kB	Adobe PDF	View/Download

Show Simple Item Record