Skip to main content

De-anonymizing programmers via code stylometry

Author(s): Caliskan-Islam, A; Harang, R; Liu, A; Narayanan, Arvind; Voss, C; et al

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1q24c
Full metadata record
DC FieldValueLanguage
dc.contributor.authorCaliskan-Islam, A-
dc.contributor.authorHarang, R-
dc.contributor.authorLiu, A-
dc.contributor.authorNarayanan, Arvind-
dc.contributor.authorVoss, C-
dc.contributor.authorYamaguchi, F-
dc.contributor.authorGreenstadt, R-
dc.date.accessioned2021-10-08T19:45:31Z-
dc.date.available2021-10-08T19:45:31Z-
dc.date.issued2015-01-01en_US
dc.identifier.citationCaliskan-Islam, A, Harang, R, Liu, A, Narayanan, A, Voss, C, Yamaguchi, F, Greenstadt, R. (2015). De-anonymizing programmers via code stylometry. Proceedings of the 24th USENIX Security Symposium, 255 - 270en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr1q24c-
dc.description.abstract© 2015 Proceedings of the 24th USENIX Security Symposium. All rights reserved. Source code authorship attribution is a significant privacy threat to anonymous code contributors. However, it may also enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields. In this work, we investigate machine learning methods to de-anonymize source code authors of C/C++ using coding style. Our Code Stylometry Feature Set is a novel representation of coding style found in source code that reflects coding style from properties derived from abstract syntax trees. Our random forest and abstract syntax tree-based approach attributes more authors (1,600 and 250) with significantly higher accuracy (94% and 98%) on a larger data set (Google Code Jam) than has been previously achieved. Furthermore, these novel features are robust, difficult to obfuscate, and can be used in other programming languages, such as Python. We also find that (i) the code resulting from difficult programming tasks is easier to attribute than easier tasks and (ii) skilled programmers (who can complete the more difficult tasks) are easier to attribute than less skilled programmers.en_US
dc.format.extent255 - 270en_US
dc.language.isoen_USen_US
dc.relation.ispartofProceedings of the 24th USENIX Security Symposiumen_US
dc.rightsFinal published version. Article is made available in OAR by the publisher's permission or policy.en_US
dc.titleDe-anonymizing programmers via code stylometryen_US
dc.typeJournal Articleen_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
DeanonymizingProgrammersCodeStylometry.pdf418.59 kBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.