Skip to main content

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Author(s): Amos, Ryan; Acar, Gunes; Lucherini, Elena; Kshirsagar, Mihir; Narayanan, Arvind; et al

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1w562
Full metadata record
DC FieldValueLanguage
dc.contributor.authorAmos, Ryan-
dc.contributor.authorAcar, Gunes-
dc.contributor.authorLucherini, Elena-
dc.contributor.authorKshirsagar, Mihir-
dc.contributor.authorNarayanan, Arvind-
dc.contributor.authorMayer, Jonathan-
dc.date.accessioned2021-10-08T19:51:21Z-
dc.date.available2021-10-08T19:51:21Z-
dc.date.issued2021en_US
dc.identifier.citationAmos, Ryan, Gunes Acar, Elena Lucherini, Mihir Kshirsagar, Arvind Narayanan, and Jonathan Mayer. "Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset." In Proceedings of the Web Conference (2021): pp. 2165-2176. doi:10.1145/3442381.3450048en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr1w562-
dc.description.abstractAutomated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive’s Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the occurrence of tracking-related terminology in our dataset to prior web privacy measurements, we find that privacy policies have consistently failed to disclose the presence of common tracking technologies and third parties. We also find that over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade in the median reading level. Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations. Finally, we contribute to the literature on privacy regulation by demonstrating the historic impact of the GDPR on privacy policies.en_US
dc.format.extent2165 - 2176en_US
dc.language.isoen_USen_US
dc.relation.ispartofProceedings of the Web Conferenceen_US
dc.rightsFinal published version. This is an open access article.en_US
dc.titlePrivacy Policies over Time: Curation and Analysis of a Million-Document Dataseten_US
dc.typeConference Articleen_US
dc.identifier.doi10.1145/3442381.3450048-
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
PrivacyPolicies.pdf850.12 kBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.