On the Feasibility of Internet-Scale Author Identification

Narayanan, Arvind; Paskov, Hristo; Gong, Neil Z; Bethencourt, John; Stefanov, Emil; Shin, Eui CR; Song, Dawn

On the Feasibility of Internet-Scale Author Identification

Author(s): Narayanan, Arvind; Paskov, Hristo; Gong, Neil Z; Bethencourt, John; Stefanov, Emil; et al

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr13z6v

Full metadata record

DC Field	Value	Language
dc.contributor.author	Narayanan, Arvind	-
dc.contributor.author	Paskov, Hristo	-
dc.contributor.author	Gong, Neil Z	-
dc.contributor.author	Bethencourt, John	-
dc.contributor.author	Stefanov, Emil	-
dc.contributor.author	Shin, Eui CR	-
dc.contributor.author	Song, Dawn	-
dc.date.accessioned	2021-10-08T19:44:27Z	-
dc.date.available	2021-10-08T19:44:27Z	-
dc.date.issued	2012	en_US
dc.identifier.citation	Narayanan, Arvind, Hristo Paskov, Neil Z. Gong, John Bethencourt, Emil Stefanov, Eui C. R. Shin, and Dawn Song. "On the Feasibility of Internet-Scale Author Identification." In 2012 IEEE Symposium on Security and Privacy (2012): pp. 300-314. doi:10.1109/SP.2012.46	en_US
dc.identifier.issn	1081-6011	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr13z6v	-
dc.description.abstract	We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors. Given the increasing availability of writing samples online, our result has serious implications for anonymity and free speech - an anonymous blogger or whistleblower may be unmasked unless they take steps to obfuscate their writing style. While there is a huge body of literature on authorship recognition based on writing style, almost none of it has studied corpora of more than a few hundred authors. The problem becomes qualitatively different at a large scale, as we show, and techniques from prior work fail to scale, both in terms of accuracy and performance. We study a variety of classifiers, both "lazy" and "eager," and show how to handle the huge number of classes. We also develop novel techniques for confidence estimation of classifier outputs. Finally, we demonstrate stylometric authorship recognition on texts written in different contexts. In over 20% of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35% of cases the correct author is one of the top 20 guesses. If we allow the classifier the option of not making a guess, via confidence estimation we are able to increase the precision of the top guess from 20% to over 80% with only a halving of recall.	en_US
dc.format.extent	300 - 314	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartof	2012 IEEE Symposium on Security and Privacy	en_US
dc.rights	Author's manuscript	en_US
dc.title	On the Feasibility of Internet-Scale Author Identification	en_US
dc.type	Conference Article	en_US
dc.identifier.doi	doi:10.1109/SP.2012.46	-
dc.identifier.eissn	2375-1207	-
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceeding	en_US

Files in This Item:

File	Description	Size	Format
FeasibilityInternetScaleAuthorID.pdf		512.96 kB	Adobe PDF	View/Download

Show Simple Item Record