IST 441 Information Retrieval and Search
Engines
Fall 2022
Notice: Grad students who want to take this course will need to request admission with an email to Dr. Giles.
This course
can be counted for the IST 402 requirement.
Time
and Place: Fall, 2022, 4:30-7:30 pm, Tuesday, Westgate Bldg E208.
Office
Hours: Dr. Lee Giles, Wednesday, 3-4 pm, Westgate Bldg E350 or upon request. (New date effective 11/9)
TA: Shaurya
Rohatgi, Thursday, 4-5 pm, Westgate Bldg E301.
Course
Overview:
This is a three hour course for juniors, seniors and graduate
students that meets once a week. The course will cover:
organization, representation, and access to information;
categorization, indexing, ranking, and content analysis; data structures for
unstructured data; design and maintenance of such data structures,
indexing and indexes, retrieval and classification schemes; use of
codes, formats, and standards; analysis, construction and evaluation
of search and navigation techniques; and search engines and how they
relate to the above. New methods using machine learning will be briefly discussed.
Students will build a specialty web search
engine using open source web tools and focused web crawling.
Course
Mission Statement:
This course is intended to prepare students to understand, design,
develop and use information retrieval and search systems.
Course Prerequisites:
IST undergrad students should have taken IST 210 and IST 240 or equivalents. IST 220
and IST 230 are also useful. Other students should consult with the
instructor.
Most graduate students are well qualified to take this course.
Schedule
(syllabus): This schedule is
subject to change. Please check it on a regular basis for
assignments. The reading list is here; most classes will have online
handouts. It is the student's responsibility to download that
material.
Course Materials and References: Course materials
including powerpoint can be found here. There may also be links on
the schedule to course materials.
Grading percentages:
- The project is a group activity for undergraduate and
individual activity for graduate students unless approved by
the instructor.
- All exercise assignments unless stated are individual
assignments.
- All grades are accessible in canvas.
Exercise Submission:
All exercises either on paper or in PDFs are due at the start of class on the date due.
Late
Policy: Starting right
after the required submission date, 1/3 of the grade will be
deducted for every day tardy until no grade is available.
Medical and other absences will need approval of the instructor
or letters of excuse (e.g. doctor's letter).
For more information on any of the above, please contact Lee
Giles.
Grading: All exercises must be completed and turned in, even if
late. Failure to turn in an exercise will result in a final grade
of "DF - Deferred" until completed.
Texts and Readings: The primary text is the online version of Introduction to Information Retrieval. Wikipedia and online papers, chapters, and selections from other online
books can be used.
The reading list is on the schedule above. We will use
chapters and sections from:
There are many other useful texts
both on search and information retrieval. A good selection, but a
bit outdated, can be found at the
resources
section of the first book.
A very mathematical treatment of
link analysis
can be found in
Google's
PageRank and Beyond by Amy Langville and Carl Meyer.
Popular but less technical books
that you may find useful and are very informative but a bit
outdated are:
John Battelle, The Search: How Google and Its
Rivals Rewrote the Rules of Business and Transformed Our Culture,
Portfolio, 2005.
Ian Witten, Marco Gori, Teresa Numerico, Web Dragons: Inside the Myths of Search Engine
Technology, Morgan Kauffman, 2006.
We will be using the popular open
source enterprise search platform,
ElasticSearch,
which is based on the even more popular
Lucene indexer.
Email:
All email to the instructor and TA about this class should contain
"IST441" in the subject line. For example, the subject line
might read "IST441: Question about ....". Email without this
information might be deleted by spam filters or placed in a folder
to be read at a later date. Email with the appropriate
identifier will usually be read within 24 hours of being received.
Academic
Integrity
Acknowledgements!
Open Educational Resources: Materials from this
course can be publicly reused in other courses. This course supports Open Educational Resources (OER) with most materials readily available online to all.