The PrivaSeer corpus is a collection of 3,967,487 privacy policies described in the following paper:
Mukund Srinath, Shomir Wilson, and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proc. ACL 2021.
For technical questions about this data, please contact Mukund Srinath (mukund@psu.edu). For licensing questions, please contact Prof. Shomir Wilson (shomir@psu.edu).
For research, teaching, and scholarship purposes, the corpus is available under a CC BY-NC-SA license. Please contact us for any requests regarding commercial use.
Link to the corpus: https://git.psu.edu/hlt-lab/PrivaSeer-Corpus
Link to the privacy policy language model (PrivBERT): https://huggingface.co/mukund/privbert
Resources associated with the paper SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification by Shahriar Shayesteh, Mukund Srinath, Lucy Matheson, Lu Xian, Sumon Saha, C. Lee Giles, and Shomir Wilson, published at ACM DocEng 2025, include:
Resources associated with the paper Layered, Overlapping, and Inconsistent: A Large-Scale Analysis of the Multiple Privacy Policies and Controls of U.S. Banks by Lu Xian, Van Tran, Lauren Lee, Meera Kumar, Yichen Zhang, and Florian Schaub, published at ACM CCS 2025, include:
Link to artifacts: https://zenodo.org/records/17214521