Computing That Serves

Searching for Statistical Diagrams


Thursday, November 17, 2011 - 10:00am


Michael Cafarella

Assistant Professor of Computer Science and Engineering at the University of Michigan


David Embley

Data-driven diagrams—bar charts, scatterplots, timelines,
and so on—comprise a valuable source of information online.
They are often the only access readers have to the data that
drives many technical documents, and are created at great expense.
However, such diagrams are a strange mixture of graphics and text,
and neither Web document search engines nor traditional
image search engines are effective at finding them.

This talk will describe our work in building a search engine that is
specially tailored for statistical diagrams.  Our contributions include
an extraction system that recovers semantic roles for differ-
ent parts of diagram text, a ranking system for diagrams, and a
“visual snippet” generator that summarizes complex diagrams for presentation
in a search engine response page. Each component outperforms conventional
approaches; the result is a keyword-driven search engine that makes a new class
of data effectively searchable.  Finally, the talk will include recent work in
querying the spreadsheets that serve as the source of many statistical diagrams.


Michael Cafarella is an assistant professor of Computer Science and
Engineering at the University of Michigan.  His research interests
include databases, information extraction, and data mining.  He is
particularly interested in applying data mining techniques to problems
of Web and scientific data management.  In addition to his academic
work, he is the cofounder of several open-source projects, including
the Hadoop distributed software suite.