Data and Code

PrivaSeer Corpus and Language Model (ACL, 2021)

The PrivaSeer corpus is a collection of 1,005,380 privacy policies described in the following paper

Mukund Srinath, Shomir Wilson and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proc. ACL 2021.

For technical questions about this data, please contact Mukund Srinath ( For licensing questions, please contact Prof. Shomir Wilson (

For research, teaching, and scholarship purposes, the corpus is available under a CC BY-NC-SA license. Please contact us for any requests regarding commercial use.

Link to the corpus:

Link to the privacy policy langauge model (PrivBERT):