IST 441  Information Retrieval and Search Engines
  Fall 2021

Instructor: Dr. C. Lee Giles

TA: Ankur Mali
Lab Assistant: Shaurya Rohatgi

This course can be counted for the IST 402 requirement.

Time and Place: Fall, 2021, 4:30-7:30 pm, Tuesday, Westgate Bldg E202.

Office Hours: Dr. Lee Giles, upon request and 11-12 am, Wednesday, Westgate Bldg E350.
                        TA: Ankur Mali, upon request via zoom and 1-3 pm, Wednesday, Westgate Bldg E301.
                        Lab Assistant: Shaurya Rohatgi,  Thursday 3-4 pm, Westgate Bldg E345.

Course Overview

This is a three hour course for juniors, seniors and graduate students that meets once a week. The course will cover: organization, representation, and access to information; categorization, indexing, and content analysis; data structures for unstructured data; design and maintenance of such data structures, indexing and indexes, retrieval and classification schemes; use of codes, formats, and standards; analysis, construction and evaluation of search and navigation techniques; and search engines and how they relate to the above. Students will build a specialty web search engine using open source web tools and focused web crawling.

Course Mission Statement:

This course is intended to prepare students to understand, design, develop and use information retrieval and search systems.

Course Prerequisites:

IST students should have taken IST 210 and IST 240 or equivalents.  IST 220 and IST 230 are also useful. Other students should consult with the instructor.

Schedule (syllabus):  This schedule is subject to change. Please check it on a regular basis for assignments. The reading list is here; most classes will have online handouts. It is the student's responsibility to download that material.

Course Materials and References: Course materials including powerpoint can be found here. There may also be links on the schedule to course materials.

Grading percentages:

Search Project & Report
 35 points
Exam  30 points
 30 points
Class Participation
 5 points

Exercise Submission:

All exercises either on paper or in PDFs are due at the start of class on the date due.

Late Policy: Starting right after the required submission date, 1/3 of the grade will be deducted for every day tardy until no grade is available. Medical and other absences will need approval of the instructor or letters of excuse (e.g. doctor's letter).

For more information on any of the above, please contact Lee Giles.

Grading: All exercises must be completed and turned in, even if late. Failure to turn in an exercise will result in a final grade of "DF - Deferred" until completed.

Texts and Readings:
  The primary text is the online version of Introduction to Information Retrieval. Wikipedia and online papers, chapters, and selections from other online books can be used.

The reading list is on the schedule above.  We will use chapters and sections from:

There are many other useful texts both on search and information retrieval. A good selection, but a bit outdated, can be found at the resources section of the first book. 

One that is very good and downloadable is
 Search Engines, Information Retrieval in Practice by W. Bruce Croft,  Donald Metzler, and Trevor Strohman.

A very mathematical treatment of link analysis can be found in
Google's PageRank and Beyond by Amy Langville and Carl Meyer.

Popular but less technical books that you may find useful and are very informative but a bit outdated are:

John Battelle, The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture, Portfolio, 2005.
Ian Witten, Marco Gori, Teresa Numerico, Web Dragons: Inside the Myths of Search Engine Technology, Morgan Kauffman, 2006.

We will be using the popular open source enterprise search platform, ElasticSearch, which is based on the even more popular Lucene indexer.

Email: All email to the instructor and TA about this class should contain "IST441" in the subject line.  For example, the subject line might read "IST441: Question about ....".  Email without this information might be deleted by spam filters or placed in a folder to be read at a later date.  Email with the appropriate identifier will usually be read within 24 hours of being received.

Academic Integrity


Open Educational Resources Materials from this course can be publicly reused in other courses. This course supports Open Educational Resources (OER) with most materials readily available online to all.