Computing That Serves

Privacy and Anonymity in Text


Thursday, November 12, 2009 - 10:00am


Chris Clifton
Associate Professor of Computer Science
Purdue University

Personal data, in the aggregate, has substantial value - for example, companies sell de-identified healthcare information for research use, and netflix released movie rental records to encourage advances in recommender system research.  Unfortunately, anonymity is hard, particularly with textual data.  In July 2006, AOL released an anonymized search query log of some 600K randomly selected users. While valuable as a research tool, the anonymization was insufficient: individuals were identified from the contents of the queries alone. More detailed text (e.g., in medical records) poses even greater challenges.

We have been exploring this space, and this talk will discuss two approaches to the problem.  The first is directed towards search: we propose a client-centered approach of Plausibly Deniable Search.  Each user query is substituted with a standard, closely-related query intended to fetch the desired results.  In addition, a set of k-1 cover queries are issued; these have characteristics similar to the standard query but on unrelated topics.  The system ensures that any of these k queries will produce the same set of k queries, giving k possible topics the user could have been searching for.  We use a Latent Semantic Indexing (LSI) based approach to generate queries, and evaluate on the DMOZ webpage collection to show effectiveness of the proposed approach.

The talk will also discuss preliminary work in text generalization.  While de-identification techniques exist, they leave substantial information that does not meet the traditional definition of an identifier, but is identifying nonetheless.  We propose a statistical generalization approach that reduces both the identity and sensitivity of data, while retaining meaning.

This is joint work with Mummoorthy Murugesan.


Dr. Clifton is an associate professor at Purdue University. He works on challenges posed by novel uses of data mining technology, including privacy-preserving data mining, data mining of text, and data mining techniques applied to interoperation of heterogeneous information sources. Fundamental data mining challenges posed by these applications include extracting knowledge from noisy data, identifying knowledge in highly skewed data (few examples of "interesting" behavior), and limits on learning. He also works on database support for widely distributed and autonomously controlled information, particularly information administration issues such as supporting fine-grained access control. Prior to joining Purdue, Dr. Clifton was a principal scientist in the Information Technology Division at the MITRE Corporation. Before joining MITRE in 1995, he was an assistant professor of computer science at Northwestern University. He has a Ph.D. from Princeton University (1991) and M.S. and B.S. degrees from the Massachusetts Institute of Technology (1986).