IST 441 Course Project
Introduction:
This project counts 35% of your grade and is a team project for
undergraduates. Teams once formed can be found on Canvas.
Under certain conditions, projects can be individual activities. For
any of these issues, please see the instructor right away.
Please fill out the experience form.
The grading as part of your 35% is as follows.
* Search engine - 20%
* Report - 10%
* Presentations - 5% (both will be graded)
In this project, you will build two search engines for the class.
1). You will use the open source software, Elasticsearch available online
at the class server, to build a vertical or specialty search engine. You can also download and use Elasticsearch on
your own server or laptop. However, you will need to give access to
queries for the instructor and Lab Assistant.
2). You will also build, if appropriate, a Google
Custom Search engine (CSE) on the same topic.
3). You will then compare your CSE with your specialty search engine.
1) Vertical/specialty Search Engine:
Undergraduates:
With the Elasticsearch your team will construct a specialty
search engine and with Scrapy crawl the web for at least a 10,000 document dataset
of documents. (Other
projects are possible but must be approved by the instructor.)
You will then index these documents with Elasticsearch and provide a
user interface with Kibana or another tool to query the index and
generate a ranking based on an arbitrary query. An interface
must be provided to permit others to query your data and index.
Graduate students:
You will build a specialty search engine project after discussions with the TA, Lab Assistant and the Instructor.
2)
Google Custom Search:
On the same topic if possible you will build a search engine using Google Custom Search.
- You will provide a query box in your specialty search engine
for your Google Custom Search engine.
Crawling:
Undergraduates:
For this project you will crawl Data Science Stack Exchange.
There are many crawlers available. We have installed Scrapy
on the IST server. Other crawlers you can use are Heritrix and Nutch.
Graduate Students:
You will crawl the web or use a repository of documents appropriate to your project.
FINAL REQUIREMENTS:
To complete this project, you need to do the following:
Final Report:
1) Write the final report with no more than 20 pages that discusses
the specialty search engine, how it works, and what was crawled and why.
Use the ACM paper format for your submission.
2) Submit via Canvas a PDF of the report.
The indexing, the query process, and how the search
engine calculates relevance should be discussed in detail. In
the document must be a link to your
- specialty search engine,
- google custom search engine.
* If possible, you must compare your specialty
engine with your built Google custom search engine (CSE) in terms of
relevance for at least 20 queries.
* For those who do not build their search engines
on the IST 441 server, please provide evidence of the crawled documents
by giving in an appendix the urls crawled, the built index, and query
engine.
* Provide the web links to the CSE query interface
that the instructor and TA can test and use and to your specialty engine if not in the IST 441 server.
Presentations (all in PowerPoint):
You will give three presentations on your search project; all
are graded. The first and second are brief covering 15 minutes or less; the final
one not more than 30 minutes.
All presentations must be submitted to Canvas before the class.
1st presentation: Present your crawling progress.
- How hard was it to get the documents?
- How long did it take?
- How can you make the crawl faster while keeping the crawl polite?
2nd presentation: Status of your index
- How big is your index?
- What field are searchable?
Final presentation: Overview of the search engine. (30 or more
minutes)
- Discuss what was discovered about the topic chosen.
- What was crawled and why.
- Give a live
demonstration of the search engine.
Both of your presentations must be professionally prepared in
PowerPoint and well organized. Hard copies must precede each
presentation.
Report
and
Presentation:
A PDF of the report is due by 8 am December 12.
The built search engine interface url (with index if the search
engine is not built on the IST441 server)
A PDF of the report and PowerPoint of the presentation must be provided to
the instructor and also to the search engine customer.
Web
Interface:
A web link to your search engine query interface must be up for the
entire last week of the semester.
*************************************************