Computing That Serves

Classification and Learning with Networked Data


Monday, March 14, 2005 - 10:00am


Foster Provost, Associate Professor of Information Systems, New York University

Customer accounts are linked by communications and other transactions.  Organizatons are linked by joint activities.  Text documents are hyperlinked.  Such networked data create opportunities for learning and applying classification models.  For example, for detecting fraud a common and successful strategy is to use transactions to link a questionable account to previous fraudulent activity.  Document classification can be improved by considering hyperlink structure.  Marketing can change dramatically when customer communication is taken into account.  Two special characteristics of classification with networked data include: (1) Knowing the classifications of some entities in the network can improve the classification of others.  (2) Very-high-cardinality categorical attributes (e.g., identifiers) can be used effectively in learned models.  I will present NetKit, a toolkit to facilitate research on classification and learning with networked data.  NetKit is based on a modular framework that allows components to be mixed and matched to form different network classification algorithms.  I will demonstrate NetKit with a case study of univariate classification using networked data from several domains.


Foster Provost is Associate Professor of Information Systems and NEC Faculty Fellow at New York University's Stern School of Business.  He is Editor-in-Chief of the journal Machine Learning, and a founding board member of the International Machine Learning Society.  Professor Provost's recent research focuses on mining networked data, economic machine learning, and applications of machine learning and data mining.  Previously, at NYNEX/Bell Atlantic Science and Technology, he studied a variety of applications of machine learning to telecommunications problems including fraud detection, network diagnosis and monitoring, and customer contact management.