Computing That Serves

SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia


Thursday, January 14, 2010 - 11:00am


C. Lee Giles

David Reese Professor at the College of Information Sciences and
Director of the Intelligent Systems Research Laboratory
Pennsylvania State University

Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples from computer science, CiteSeerX, chemistry, ChemXSeer, and archaeology, ArchSeer.

CiteSeerX, the successor to CiteSeer, currently offers or intends to offer some unique aspects of search not yet present in other scientific search services or engines, such as table, figure, algorithm and author search. In addition, CiteSeerX continuously crawls the web and author submissions and now has nearly 1.5 million documents, close to 30 million citations, a million authors and comparable database tables. It has nearly 1 million unique users with several million hits a day.

In chemistry, the growth of data has been explosive and timely, and effective information and data access is critical. The ChemXSeer (funded by NSF Chemistry) system is a portal and search engine for academic researchers in environmental chemistry, which integrates the scientific literature with experimental, analytical and simulation datasets. ChemXSeer consists of information crawled from the web, manual submission of scientific documents and user submitted datasets, as well as scientific documents and metadata provided by major publishers. Information gathered from the web is publicly accessible whereas access to restricted resources such as user submitted data will be determined by those users. Thus, instead of being a fully open search engine and repository, ChemXSeer will be a hybrid one, limiting access to some resources.

Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance. We draw lessons for other escience and cyberinfrastructure systems in terms of design, implementation and research and discuss future directions and systems.





Dr. C. Lee Giles is the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University, University Park, PA. He is also graduate college Professor of Computer Science and Engineering, courtesy Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory. He directs the Next Generation CiteSeer, CiteSeerX project and codirects the ChemXSeer project at Penn State. He has been associated with Columbia University, the University of Maryland, University of Pennsylvania, Princeton University, and the University of Trento.  He and his collaborators, including current and former graduate students, have published over 300 journal and conference papers, book chapters, edited books and proceedings.  He has been involved in the creation and development of various novel search engines and digital libraries. He was one of the creators of the popular computer and information science search engine, CiteSeer, an autonomous citation indexing search engine and digital library, now hosted at the College of Information Sciences and Technology at Penn State University. Recently, it has been replaced by the Next Generation CiteSeer, CiteSeerX.  He is a Fellow of the ACM, a Fellow of the IEEE and a Fellow of the International Neural Network Society, and a member of AAAI and AAAS. He has twice received the IBM Distinguished Faculty Award. He is also a member of Sigma Xi, Tau Beta Pi, and Eta Kappa Nu.  Much more info. here: