Next Generation CiteSeer
CiteSeer is an automated digital library of scholarly literature in the Computer and Information Science disciplines that is a free public resource providing access to the full-text of nearly 700,000 academic science papers, and over 10 million citations. CiteSeer is widely used by computer and information scientists and others, and is often cited as a search service that has greatly improved communication and progress in computer science research. Many researchers find the system to be invaluable and expect 24-7 availability. CiteSeer was created by Kurt Bollacker, Lee Giles and Steve Lawrence in 1997-98 at NEC Research Institute (now NEC Labs), Princeton, NJ. All additional versions of CiteSeer, such as CiteSeer.IST at Penn State, operate under a limited open-source license provided by NEC that permits unlimited noncommercial use.
CiteSeer consists of three basic components: a focused crawler or harvester, the document archive and specialized index, and the query interface [Lawrence99a]. The focused spider or harvester crawls the web for relevant documents in PDF and Postscript formats. After filtering crawled documents for academic documents, these are then indexed using autonomous citation indexing [Giles98], which automatically links references in research articles to facilitate navigation and evaluation. Automatic extraction of the context of citations allows researchers to determine the contributions of a given research article quickly and easily [Garfield64]; and several advanced methods are employed locate related research based on citations, text, and usage information. CiteSeer is a full text search engine with an interface that permits search by document or by numbers of citations or fielded searching, not currently possible on general purpose web search engines.
While developed for search on the Web, CiteSeer can be adapted for use with an existing document database. Using a local database instead of the Web would eliminate the need for the crawler, but would allow archive and indexing functions to provide the document connections, citation linking, and document navigation and evaluation features. CiteSeer from past NSF NSDL support is now Open Archives compliant, providing metadata for search engines of other digital libraries.
The goal of the Next Generation CiteSeer project is to add new and expand existing CiteSeer-based services for the Computer and Information Science community. As the CiteSeer collection increases in breadth and depth, we need techniques to exploit this valuable resource and to move it from a static collection of information to a dynamic, collaborative research assistant.
Our newest service is automatic acknowledgement indexing which provides an index of all found acknowledgements in CiteSeer.
This Next Generation CiteSeer project builds on the previous work of CiteSeer, expanding the service by increasing the breadth of the collection and increasing the site usability and services. In particular, this project has the following goals:
- To redesign the CiteSeer architecture for increased utility, reliability and services making it completely modular and open source.
- To expand the index to authors, affiliations, acknowledgements and others.
- To expand the breadth and depth of CiteSeer’s collection.
- To have CiteSeer serve as web service for research use.
- To facilitate personalized CiteSeer search through the use of individual search histories combined with exploiting patterns of citations and searches within the community of users.
- To support collaborative CiteSeer usage and thereby to promote the formation and activity of research communities.
- To evaluate the impact of the new architecture, new content, and new services on the user community.
- To increase the reliability and sustainability of CiteSeer as a community resource.
For more information, please contact Lee Giles, Next Generation CiteSeer project director.