IST 441 Course Project
This project counts 35% of your grade and is a team project. Teams
can be found here.
Under certain conditions, projects can be individual activities. For
any issues, please see me right away.
The grading as part of your final grade is as follows.
* Search engine - 20%
* Report - 10%
* Presentations - 5% (both will be graded)
In this project, you will use open source software, Nutch/Lucene, Solr or Lucid Works available online at
the class server, to build a vertical or specialty search engines
for a customer. You will also build a Google Custom Search engine
on the same topic. Depending on the student project, some students
will be allowed to use other open source search engine
tools. You will deliver the finished search engine to that
customer. Your customer must acknowledge your successful
transfer of the project. A suggested customer list is below.
Vertical/specialty Search Engine:
With this software you will construct a specialty search engine and
crawl the web for at least a 1000 document dataset of documents of
interest. Your specialty engine needs to be approved by me and the
customer and should be of interest to the customer. See me for
suggestions if necessary. (Other projects are possible but must be
approved by the instructor.)
You will then index the documents and provide a query engine to
access the index and generate a ranking based on an arbitrary query.
A query interface has been included in the Nutch and YouSeer
package. Since the goal of the project is for you to understand the
fundamentals of search engines and to create a novel search engine
application, you may use the Nutch/Lucene query interface. In any
case, an interface must be provided to permit others to query your
data and index. Students if possible should not use already crawled
collections but can use search engine selections. Part of this
exercise is to crawl for "your" special collection.
On the same topic you will build a search engine using Google Custom Search.
- You will provide a query box in your specialty search engine
for your Google Custom Search engine.
Please exercise good judgment in what you crawl and use an ethical
crawler that respects the robots exclusion principle and does not
over crawl a web site.
DO NOT CRAWL:
Crawling with Heritrix is explained here.
- SEARCH ENGINES WITHOUT REGISTERING
Here's a document on how to crawl with Nutch.
To complete this project, you need to do the following:
* Submit a comprehensive final report 20 pages or
so that discusses the search engine tool and how it works.
Motivation for this vertical engine should be explained. The
indexing, the query process, and how the search engine calculates
relevance should be discussed in detail. In the document must
be a link to your
- speciality search engine,
- google custom search engine.
* You must compare your specialty engine with
that built using the Google Custom Search engine in terms of
* Provide evidence of the crawled documents by
giving the urls crawled, the built index and query engine by
submitting the index on a memory stick or CD or providing access to
a web directory of the documents to the instructor. This is not
necessary for those who build their engine on the IST 441 server.
* Provide the web link to the query interface
that the instructor and grader can test and use.
* Deliver the search engine to the customer and
have the customer contact the instructor and TA.
Presentations (all in powerpoint):
You will give two presentations on your search project; both
are graded. The initial one is short 10 minutes or less; the final
one 30 minutes. The first is mid-semester and the second one at the
end of the semester.
* Introductory presentation on your specialty
search engine topic. (10 minute presentation)
* For the final presentation discuss all aspects
of the specialty search engine. (20 minute presentation)
- Your choice of a search topic should be motivated.
- Why is this a good topic for which to build a specialty search
- How hard will it be to get the documents?
- What is the competition?
- Who is your customer and why is your customer interested?
Both of your presentations must be professionally prepared in
powerpoint and well organized. Hard copies must precede each
- Discuss what was discovered about the topic chosen.
- What was crawled and why.
- If appropriate, a live
demonstration of the search engine should be performed.
- The customer should be identified; address and email must be
Two professional quality hard copies of the report are due by May 1.
The built search engine interface url (with index if the search
engine is not built on the IST441 server)
The report and powerpoint of the presentation must be provided to
the instructor and also to the search engine customer.
The report must be in hard copy to get full credit.
A web link to your search engine query interface must be up for the
entire last week of the semester.
The customer to be involved in your project will help determine what
topic the search engine will cover.
**He or she must notify the
instructor and TA about the final acceptance of the search engine
project and report.