Computing That Serves

Accounting for Burstiness of Words in Text Mining


Thursday, October 1, 2009 - 11:00am


Charles Elkan
Professor of Computer Science
University of California, San Diego

A fundamental property of language is that if a word is used once in a document, it is likely to be use again.  Statistical models of documents applied in text mining must take this property into account, in order to be accurate.  In this talk, I will describe how to model burstiness using a probability distribution called the Dirichlet compound multinomial.  In particular, I will present a new topic model based on DCM distributions.  The central advantage of topic models is that they allow documents to concern multiple themes, unlike standard clustering methods that assume each document concerns a single theme.  On both text and non-text datasets, the new topic model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). 


Charles Elkan is a professor in the Department of Computer Science and Engineering at the University of California, San Diego.  In 2005/06 he was on sabbatical at MIT, and in 1998/99 he was visiting associate professor at Harvard.  Dr. Elkan is known for his research in machine learning, data mining, and computational biology.  The MEME algorithm he developed with his Ph.D. student Tim Bailey has been used in over 1000 publications in biology.  Dr. Elkan has won several best paper awards and data mining contests, and his Ph.D students have held tenure-track or equivalent positions at Columbia University, the University of Washington, the University of Queensland, other

universities, and IBM Research.