IST 441 Course Project

Introduction:

This project counts 35% of your grade and is a team project. Teams can be found here. Under certain conditions, projects can be individual activities. For any issues, please see me right away.

The grading as part of your final grade is as follows.

    * Search engine - 20%
    * Report - 10%
    * Presentations - 5% (both will be graded)

In this project, you will use open source software, Nutch/Lucene, Solr or Lucid Works available online at the class server, to build a vertical or specialty search engines for a customer.  You will also build a Google Custom Search engine on the same topic. Depending on the student project, some students will be allowed to use other open source search engine tools. You will deliver the finished search engine to that customer.  Your customer must acknowledge your successful transfer of the project. A suggested customer list is below.

Suggested customer list.


Vertical/specialty Search Engine:


With this software you will construct a specialty search engine and crawl the web for at least a 1000 document dataset of documents of interest. Your specialty engine needs to be approved by me and the customer and should be of interest to the customer. See me for suggestions if necessary. (Other projects are possible but must be approved by the instructor.)

You will then index the documents and provide a query engine to access the index and generate a ranking based on an arbitrary query. A query interface has been included in the Nutch and YouSeer package. Since the goal of the project is for you to understand the fundamentals of search engines and to create a novel search engine application, you may use the Nutch/Lucene query interface. In any case, an interface must be provided to permit others to query your data and index. Students if possible should not use already crawled collections but can use search engine selections. Part of this exercise is to crawl for "your" special collection.

Google Custom Search:

On the same topic you will build a search engine using Google Custom Search.

Crawling:

Please exercise good judgment in what you crawl and use an ethical crawler that respects the robots exclusion principle and does not over crawl a web site.

DO NOT CRAWL:
Crawling with Heritrix is explained here.
Here's a document on how to crawl with Nutch.


FINAL REQUIREMENTS:


To complete this project, you need to do the following:


Final Report:


    * Submit a comprehensive final report 20 pages or so that discusses the search engine tool and how it works. Motivation for this vertical engine should be explained. The indexing, the query process, and how the search engine calculates relevance should be discussed in detail.  In the document must be a link to your

    * You must compare your specialty engine with that built using the Google Custom Search engine in terms of relevance.

    * Provide evidence of the crawled documents by giving the urls crawled, the built index and query engine by submitting the index on a memory stick or CD or providing access to a web directory of the documents to the instructor. This is not necessary for those who build their engine on the IST 441 server.

    * Provide the web link to the query interface that the instructor and grader can test and use.

    * Deliver the search engine to the customer and have the customer contact the instructor and TA.


Presentations (all in powerpoint):

You will give two presentations on your search project; both are graded. The initial one is short 10 minutes or less; the final one 30 minutes. The first is mid-semester and the second one at the end of the semester.

    * Introductory presentation on your specialty search engine topic. (10 minute presentation)
    * For the final presentation discuss all aspects of the specialty search engine. (20 minute presentation)
Both of your presentations must be professionally prepared in powerpoint and well organized. Hard copies must precede each presentation.


Report and Presentation:

Two professional quality hard copies of the report are due by May 1.

The built search engine interface url (with index if the search engine is not built on the IST441 server)

The report and powerpoint of the presentation must be provided to the instructor and also to the search engine customer.

The report must be in hard copy to get full credit.


Web Interface:

A web link to your search engine query interface must be up for the entire last week of the semester.


Customer:


The customer to be involved in your project will help determine what topic the search engine will cover.

**He or she must notify the instructor and TA about the final acceptance of the search engine project and report.

 

*************************************************