COSC 6341 Information Retrieval and Text Mining

Fall 2015

MW 1:00 - 2:30 PM, PGH XXX

Instructor: Arjun Mukherjee


Overview

This is a graduate level course in information retrieval (IR) and text mining. The course in intended for developing foundations in IR and Web data/text mining with a focus on solving real-world problems. Throughout the course, emphasis will be placed on tying IR techniques to specific real-world applications through hands-on experience.


Administrative details

Office hours

Instructor office hours: MW 1:00 - 2:00 PM, PGH XXX
TA: XXXXXX (xxx @ uh . edu)
TA office hours: MW 1:00 - 2:00 PM, PGH XXX

Prerequisites

The course requires basic background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Artificial Intelligence, Numerical methods, or have some background in linear algebra, it will be helpful. The course however reviews and covers required mathematical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) will prove beneficial.

Note: This course has some overlap with Data Mining (COSC 6335) and Natural Language Processing(COSC 6342). Especially some topics in supervised learning, clustering. However, the focus is more applied (as opposed to theory) and the goal is to understand their applications in IR. They are covered in this course to make this course standalone. Although not absolutely essential, if you have taken either of those courses, it will definitely be serve to be useful.

Reading Materials

Textbooks:
IIR: Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press. 2008. Companion website with online version for reading.
WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.

Reference materials:

SI: Statistical Inference, Casella and Berger. Cengage Learning; 2nd edition.
Online resources (OR) per topic as appearing in the schedule below.
Lecture notes

Grading

Component Contribution
HW0 x%
HW1 x%
HW2 x%
HW3 x%
Project 1 x%
Project 2 x%
Final (Must pass to pass the course!) x%


Rules and policies

Late Assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. It also needs to be a justifiable reason owing to exacting circumstances. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Cheating: All submitted work (code, homeworks, exams, etc.) must be your own. If evidence of code sharing is found, you will receive an F grade in the course. Please refer to the student handbook for details on academic honesty.
Statute of limitations: Grading questions or complaints, will in general not be attended to beyond one week after the item in question has been returned.


Homework/Project due dates

Assignment Due date
HW0 xx/xx
HW1 xx/xx
HW2 xx/xx
Project 1 xx/xx
Project 2 xx/xx


Schedule of topics


Class Topic(s) Resources: Readings, Slides, Lecture notes, Papers, etc.
M, 8/25 Introduction
Course administrivia, semester plan, course goals
Information Retrieval in a nutshell
Readings:
OR01:IR tutorials by (1) [Zhou et al. 2006] (2) K.Radinsky
x, x/xx Retrieval Models in IR
Boolean retrieval
Vector space model
Readings:
OR02: Chapter 1 IIR, Chapter 6, IIR
OR02: Lecture notes/slides by H. Schütze on Boolean retrieval, Vector space model.
Programming resources, tools, libraries for projects and homeworks:
Lucene, Nutch
x, x/xx Relevance feedback and Probabilistic IR
Relevance feedback and querry expansion
Probabilistic retrieval
Readings: Chapter 9, 11 IIR
x, x/xx Evalaution and Language Models
Evaluation metrics
Language models for IR
Readings:
OR13: Chapter 8 IIR, Chapter 12 IIR.
OR14: Lecture notes/slides by H. Schütze on Evaluation metrics for IR, Langugae models for IR.
Programming resources, tools, libraries for projects and homeworks:
RankLib:Implementation of Ranking algorithms
X, X/XX Machine Learning Review
Basic concepts: Data and features
Decision tress
Naive Bayes
Support Vector Machines
Classifier evaluation
Readings: Chapter 3 WDM (3.1, 3.2, 3.3, 3.5, 3.6, 3.8)
Programming resources, tools, libraries for projects and homeworks:
SVM: SVMLight, Mallet, LingPipe
x, x/xx Web Search basics
Infomration and Text Retrieval basics
Web search: Text retireval models
Meta Search: Combining multiple rankings
Reading(s): Chapter 6 WDM (6.1, 6.2, 6.3, 6.8, 6.9)
x, x/xx Link Analysis I
Social Network Analysis
Page Rank and Random Walk models
Trust Rank and Topical Page Rank
HITS algorithm
Reading(s): Chapter 6 WDM (6.4), Chapter 7 WDM (7.1, 7.3, 7.4), Chapter 19, 21 IIR.
Papers on (1) Trust rank and (2) Topic sensitive pagerank.
Programming resources, tools, libraries for projects and homeworks:
JUNG, RankLib
x, x/xx Link Analysis II
Ranking evaluation metrics (revisited): MAP, NDCG, P@n
Markov Random Fields for IR
LETOR: Combining multiple rankers
Reading(s): Chapter 6 WDM (6.4), Chapter 7 WDM (7.1, 7.3, 7.4), Chapter 19, 21 IIR.
MRF and BP tutorial
MRFs for IR/Fraud detection [Metzler and Croft, 2005], [Metzler and Croft, 2007], [Pandit et al., 2007], [Fei et al., 2013]
LETOR Tutorial by (1) T. Liu (Chapter 1 reading) H. Li (comprehensive reference), application papers [Mukherjee et al., 2013], [Mukherjee et al., 2012]
Programming resources, tools, libraries for projects and homeworks:
JUNG, RankLib, Beleif Propagation on MRFs
x, x/xx Latent Variable Models for IR I
Stat review I: Distributions, Hierarchical models
Stat review II:Bayes nets, PGMs
Latent variable models: Statistical topic models

Readings: TBA
x, x/xx Latent Variable Models for IR II
Modeling query intent and interesting things
Topic modeling for IR
Ad-hoc IR using statistical topic/language models

Readings: TBA
Utilizing Topic Models in IR
Topic models for Ad-hoc IR
x, x/xx Adversarial Infomration Retrieval
Web spam
Collaborative filtering
Opinion spam

Readings: TBA
AIRWeb