Data and Code

PrivaSeer Corpus and Language Model (ACL, 2021)

The PrivaSeer corpus is a collection of 3,967,487 privacy policies described in the following paper:

Mukund Srinath, Shomir Wilson, and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proc. ACL 2021.

For technical questions about this data, please contact Mukund Srinath (mukund@psu.edu). For licensing questions, please contact Prof. Shomir Wilson (shomir@psu.edu).

For research, teaching, and scholarship purposes, the corpus is available under a CC BY-NC-SA license. Please contact us for any requests regarding commercial use.

Link to the corpus: Access the PrivaSeer Corpus on PSU Git

Link to the language model (PrivBERT): Access PrivBERT on Hugging Face

SoAC and SoACer (DocEng 2025)

Resources associated with the paper SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification by Shahriar Shayesteh, Mukund Srinath, Lucy Matheson, Lu Xian, Sumon Saha, C. Lee Giles, and Shomir Wilson, published at ACM DocEng 2025, include:

(a) Code: SoAC Code Repository (GitHub)
(b) Dataset: SoAC Dataset on Hugging Face

Layered, Overlapping, and Inconsistent (CCS 2025)

Resources associated with the paper Layered, Overlapping, and Inconsistent: A Large-Scale Analysis of the Multiple Privacy Policies and Controls of U.S. Banks by Lu Xian, Van Tran, Lauren Lee, Meera Kumar, Yichen Zhang, and Florian Schaub, published at ACM CCS 2025, include:

(a) Document collection: a collection of privacy policies of 2,073 U.S. banks, including GLBA, CCPA, and other types of policies;
(b) Annotated segments and codebook: the privacy policies are annotated according to the provided codebook;
(c) Code: scripts for collecting and analyzing privacy policies.

Link to artifacts: View CCS 2025 Artifacts on Zenodo