IST 441 Course Project
This project counts 35% of your grade and is a group project. Groups
can be found here.
The
grading is as follows.
* Search engine - 20%
* Report - 10%
* Presentations - 5% (both will be graded)
In this project, you will use the open source software, Nutch/Lucene,
available online at the class server to build a vertical or
specialty search engine for a customer. You will deliver the
finished search engine to that customer.
Vertical/specialty Search
Engine:
With this software you will construct a specialty search engine and
crawl the web for at least a 1000 document dataset of documents of
interest. Your specialty engine needs to be approved by me and the
customer and should be of interest to the customer. See me for
suggestions if necessary. (Other projects are possible but must be
approved by the instructor.)
You will then index the documents and provide a query engine to access
the index and generate a ranking based on an arbitrary query. A query
interface has been included in the Nutch package. Since the goal of the
project is for you to understand the fundamentals of search engines and
to create a novel search engine application, you may use the
Nutch/Lucene query interface. In any case, an interface must be
provided to permit others to query your database. Students should not
use already crawled collections but can use search engine selections.
Part of this exercise is to crawl for "your" collection.
Google Custom Search:
On the same topic you will build a search engine using Google Custom
Search. In your Nutch query interface, you will provide a query box for
your Google Custom Search engine.
Crawling:
Please exercise good judgment in what you crawl and use an ethical
crawler that respects the robots exclusion principle and does not over
crawl a web site.
DO
NOT CRAWL SEARCH ENGINES WITHOUT
REGISTERING!
REQUIREMENTS:
To complete this project, you need to do the following:
* Submit a comprehensive final report not more than
20 pages that discusses the search engine tool and how it works.
Motivation for this vertical engine should be explained. The indexing,
the query process, and how the search engine calculates relevance
should be discussed in detail.
Compare your Nutch specialty engine
with that built with Google Custom Search in terms of relevance.
* Provide evidence of the crawled documents by
giving the urls crawled, the built index and query engine by submitting
the index on a memory stick or CD or providing access to a web
directory of the documents to the instructor.
* Provide the web link to the query interface that
the instructor and grader can test and use.
* Deliver the search engine to the customer and have
the customer contact the instructor and TA.
Presentations (all in
powerpoint):
* Present an introductory presentation on topic to
be covered. The topic of interest should be motivated. Why is this a
good topic for which to build a specialty search engine? How hard will
it be to get the documents? What is the competition? Who is your
customer and why is your customer interested?
(5 minute presentation)
* For the final presentation discussing the
specialty search engine, discuss what was discovered about the topic?
What was crawled and why. A live
demonstration of the search engine
should be performed. The customer should be identified; address and
email must be provided.
(25 minute presentation)
Both of your presentations must be professionally prepared in
powerpoint and well organized. Hard copies must precede each
presentation.
Report:
Two professional quality hard copies of the report is due by May 2. The
built index, the report and powerpoint of the presentation must be
provided
to the instructor and also to the search engine customer.
The report must be a hard copy to get full credit.
Web Interface:
A web link to your search engine query interface must be up for the
entire week of April 28.
Customer:
The customer be involved in this project and will identify what the
engine will do. He or she must notify the instructor and TA about
acceptance of the search engine project and report.
*************************************************