IST 441 Course Project

Introduction:

This project counts 35% of your grade and is a team project for undergraduates. Teams once formed can be found on Canvas. Under certain conditions, projects can be individual activities. For any of these issues, please see the instructor right away.

Please fill out the experience form.

The grading as part of your 35% is as follows.

    * Search engine - 20%
    * Report - 10%
    * Presentations - 5% (both will be graded)

In this project, you will build two search engines for the class.

1). You will use the open source software, Elasticsearch available online at the class server, to build a vertical or specialty search engine.  You can also download and use Elasticsearch on your own server or laptop. However, you will need to give access to queries for the instructor and Lab Assistant.

2). You will also build, if appropriate, a Google Custom Search engine (CSE) on the same topic.

3). You will then compare your CSE with your specialty search engine.


1) Vertical/specialty Search Engine:


Undergraduates:

With the Elasticsearch your team will construct a specialty search engine and with Scrapy crawl the web for at least a 10,000 document dataset of documents.  (Other projects are possible but must be approved by the instructor.)

You will then index these documents with Elasticsearch and provide a user interface with Kibana or another tool to query the index and generate a ranking based on an arbitrary query.  An interface must be provided to permit others to query your data and index.

Graduate students:

You will build a specialty search engine project  after discussions with the TA, Lab Assistant and the Instructor.


2) Google Custom Search:


On the same topic if possible you will build a search engine using Google Custom Search.

Crawling:


Undergraduates:


For this project you will crawl Data Science Stack Exchange.

There are many crawlers available. We have installed Scrapy on the IST server. Other crawlers you can use are Heritrix and Nutch.

Graduate Students:

You will crawl the web or use a repository of documents appropriate to your project.



FINAL REQUIREMENTS:

To complete this project, you need to do the following:


Final Report:


1) Write the final report with no more than 20 pages that discusses the specialty search engine, how it works, and what was crawled and why.
        Use the ACM paper format for your submission.

2) Submit via Canvas a PDF of the report.

The indexing, the query process, and how the search engine calculates relevance should be discussed in detail.  In the document must be a link to your

    * If possible, you must compare your specialty engine with your built Google custom search engine (CSE) in terms of relevance for at least 20 queries.

    * For those who do not build their search engines on the IST 441 server, please provide evidence of the crawled documents by giving in an appendix the urls crawled, the built index, and query engine.

    * Provide the web links to the CSE query interface that the instructor and TA can test and use and to your specialty engine if not in the IST 441 server.


Presentations (all in PowerPoint):

You will give three presentations on your search project; all are graded. The first and second are brief covering 15 minutes or less; the final one not more than 30 minutes.

All presentations must be submitted to Canvas before the class.

1st presentation: Present your crawling progress.
2nd presentation: Status of your index
Final presentation: Overview of the search engine. (30 or more minutes)
Both of your presentations must be professionally prepared in PowerPoint and well organized. Hard copies must precede each presentation.


Report and Presentation:

A PDF of the report is due by 8 am December 12.

The built search engine interface url (with index if the search engine is not built on the IST441 server)

A PDF of the report and PowerPoint of the presentation must be provided to the instructor and also to the search engine customer.



Web Interface:

A web link to your search engine query interface must be up for the entire last week of the semester.


 

*************************************************