IST 441 Course Project

Introduction:

This project counts 35% of your grade and is a team project for undergraduates. Teams can be found here. Under certain conditions, projects can be individual activities. For any of these issues, please see the instructor right away.

The grading as part of your final grade is as follows.

    * Search engine - 20%
    * Report - 10%
    * Presentations - 5% (both will be graded)

In this project, you will build two search engines for a customer. See below a suggested customer list.

Suggested customer list.

1). You will use the open source software, Elasticsearch available online at the class server, to build a vertical or specialty search engines for a customer.  You can also download and use Elasticsearch on your own server or laptop. However, you will need to give access to queries for the instructor.

2). You will also build a Google Custom Search engine on the same topic. Depending on the student project, some students will be allowed to use other open source search engine tools. You will deliver the finished search engine to that customer.  Your customer must acknowledge your successful transfer of the project.


1) Vertical/specialty Search Engine:


With the Elasticsearch software you will construct a specialty search engine and crawl the web for at least a 1000 document dataset of documents of interest. Your specialty engine needs to be approved by the instructor and the customer and should be of interest to the customer. See the instructor for suggestions if necessary. (Other projects are possible but must be approved by the instructor.)

You will then index these documents with Elasticsarch and provide a user interface with Kibana or another tool to query the index and generate a ranking based on an arbitrary query.  An interface must be provided to permit others to query your data and index. Students if possible should not use already crawled collections but can use search engine selections. Part of this project is to crawl for "your" special collection.

2) Google Custom Search:

On the same topic you will build a search engine using Google Custom Search.

Crawling:

Please exercise good judgment in what you crawl and use an ethical crawler that respects the robots exclusion principle and does not over crawl a web site.

DO NOT CRAWL:
There are many crawlers available. We will install Scrapy on the IST server. Other crawlers you can use are Heritrix and Nutch.


FINAL REQUIREMENTS:


To complete this project, you need to do the following:


Final Report:


1) Submit 2 comprehensive hard copy final reports 20 pages or so that discusses the specialty search engine, how it works and what was crawled.

2) Submit via email a PDF of the report.

For the report the motivation for your vertical engine should be described. The indexing, the query process, and how the search engine calculates relevance should be discussed in detail.  In the document must be a link to your

    * You must compare your specialty engine with that built using the Google Custom Search engine in terms of relevance for at least 20 queries.

    * Provide evidence of the crawled documents by giving the urls crawled, the built index and query engine by submitting the index on a memory stick or CD or providing access to a web directory of the documents to the instructor. This is not necessary for those who build their search engine on the IST 441 server.

    * Provide the web link to the query interface that the instructor and TA can test and use.

    * Deliver the search engine to the customer and have the customer contact the instructor and TA.


Presentations (all in PowerPoint):

You will give two presentations on your search project; both are graded. The first one is a brief 15 minutes or less; the final one 30 or more minutes. The first is mid-semester and the second is at the end of the semester.

1st presentation: Introductory presentation on your specialty search engine topic. (15 minutes)
Final presentation: Overview of the search engine. (30 or more minutes)
Both of your presentations must be professionally prepared in PowerPoint and well organized. Hard copies must precede each presentation.


Report and Presentation:

Two professional quality hard copies and a PDF of the report are due by 8 am April 29.

The built search engine interface url (with index if the search engine is not built on the IST441 server)

The report and PowerPoint of the presentation must be provided to the instructor and also to the search engine customer.

The reports must be in hard copy and PDF to get full credit.


Web Interface:

A web link to your search engine query interface must be up for the entire last week of the semester.


Customer:


The customer to be involved in your project will help determine what topic the search engine will cover.

**The customer must notify the instructor and TA about the final acceptance of the search engine project and report.

 

*************************************************