IST 441 Course Project

 

This project counts 35% of your grade and is a group project. Groups can be found here. The grading is as follows.

    * Search engine - 20%
    * Report - 10%
    * Presentations - 5% (both will be graded)

In this project, you will use the open source software, Nutch/Lucene, available online at the class server to build a vertical or specialty search engine for a customer.  You will deliver the finished search engine to that customer.

Suggested customer list.

Vertical/specialty Search Engine:

With this software you will construct a specialty search engine and crawl the web for at least a 1000 document dataset of documents of interest. Your specialty engine needs to be approved by me and the customer and should be of interest to the customer. See me for suggestions if necessary. (Other projects are possible but must be approved by the instructor.)

You will then index the documents and provide a query engine to access the index and generate a ranking based on an arbitrary query. A query interface has been included in the Nutch package. Since the goal of the project is for you to understand the fundamentals of search engines and to create a novel search engine application, you may use the Nutch/Lucene query interface. In any case, an interface must be provided to permit others to query your database. Students should not use already crawled collections but can use search engine selections. Part of this exercise is to crawl for "your" collection.

Google Custom Search:

On the same topic you will build a search engine using Google Custom Search. In your Nutch query interface, you will provide a query box for your Google Custom Search engine.

Crawling:

Please exercise good judgment in what you crawl and use an ethical crawler that respects the robots exclusion principle and does not over crawl a web site.

DO NOT CRAWL SEARCH ENGINES WITHOUT REGISTERING!


REQUIREMENTS:


To complete this project, you need to do the following:

    * Submit a comprehensive final report not more than 20 pages that discusses the search engine tool and how it works. Motivation for this vertical engine should be explained. The indexing, the query process, and how the search engine calculates relevance should be discussed in detail.

Compare your Nutch specialty engine with that built with Google Custom Search in terms of relevance.

    * Provide evidence of the crawled documents by giving the urls crawled, the built index and query engine by submitting the index on a memory stick or CD or providing access to a web directory of the documents to the instructor.
    * Provide the web link to the query interface that the instructor and grader can test and use.
    * Deliver the search engine to the customer and have the customer contact the instructor and TA.


Presentations (all in powerpoint):

    * Present an introductory presentation on topic to be covered. The topic of interest should be motivated. Why is this a good topic for which to build a specialty search engine? How hard will it be to get the documents? What is the competition? Who is your customer and why is your customer interested?
(5 minute presentation)
    * For the final presentation discussing the specialty search engine, discuss what was discovered about the topic? What was crawled and why. A live demonstration of the search engine should be performed. The customer should be identified; address and email must be provided.
(25 minute presentation)

Both of your presentations must be professionally prepared in powerpoint and well organized. Hard copies must precede each presentation.


Report:

Two professional quality hard copies of the report is due by May 2. The built index, the report and powerpoint of the presentation must be provided to the instructor and also to the search engine customer.

The report must be a hard copy to get full credit.

Web Interface:

A web link to your search engine query interface must be up for the entire week of April 28.

Customer:

The customer be involved in this project and will identify what the engine will do. He or she must notify the instructor and TA about acceptance of the search engine project and report.

 

*************************************************