The PrivaSeer corpus is a collection of 3,967,487 privacy policies described in the following paper:
Mukund Srinath, Shomir Wilson, and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proc. ACL 2021.
For technical questions about this data, please contact Mukund Srinath (mukund@psu.edu). For licensing questions, please contact Prof. Shomir Wilson (shomir@psu.edu).
For research, teaching, and scholarship purposes, the corpus is available under a CC BY-NC-SA license. Please contact us for any requests regarding commercial use.
Link to the corpus: Access the PrivaSeer Corpus on PSU Git
Link to the language model (PrivBERT): Access PrivBERT on Hugging Face
Resources associated with the paper SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification by Shahriar Shayesteh, Mukund Srinath, Lucy Matheson, Lu Xian, Sumon Saha, C. Lee Giles, and Shomir Wilson, published at ACM DocEng 2025, include:
Resources associated with the paper Layered, Overlapping, and Inconsistent: A Large-Scale Analysis of the Multiple Privacy Policies and Controls of U.S. Banks by Lu Xian, Van Tran, Lauren Lee, Meera Kumar, Yichen Zhang, and Florian Schaub, published at ACM CCS 2025, include:
Link to artifacts: View CCS 2025 Artifacts on Zenodo