Overview
This is a graduate level course in data mining. The course in intended for developing foundations in data and text mining with a focus on solving problems in the Web/real-world domains. The broader goal is to understand how data mining tasks are carried out in the real world (e.g., Web). Throughout the course, large emphasis will be placed on tying data mining techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in information retrieval, applied machine learning, web mining and opinion mining.
Administrative details
If class is full or you are a non-CS major, you should contact Liz (ejfaig@central.uh.edu) with your UHID to be enrolled in the course or added to the waitlist.
Office hours
Instructor office hours: MW 4:00 - 5:00 PM, PGH 582
TA: Marjan Hosseinia (ma.hosseinia@gmail.com)
TA office hours: MW 12.30-2.30 PM, PGH 301
Prerequisites
The course requires background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Artificial Intelligence, Numerical methods, or have some background in linear algebra, it will be helpful. The course however reviews and covers required mathematical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) is required.
Note: This course has some overlap with Natural Language Processing (COSC 6397) and Machine Learning(COSC 6342). Especially some topics in supervised learning, clustering. However, the focus is more applied (as opposed to theory) and the goal is to build novel data mining technologies/algorithms on top of those methods. They are covered in this course to make this course standalone. Although not required, however, if you have taken either of those courses, it will serve to be useful.
Reading Materials
Textbooks:
WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.
Reference materials:
Required:
Online resources per topic as appearing in the schedule below.
Course materials (contains all Lecture notes + Sample exam questions)
Optional:
IIR: Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press. 2008.
Companion website with online version for reading.
SAOM: Sentiment Analysis and Opinon Mining. Bing Liu. Morgan and Claypool Publishers. Draft available from author's website.
Grading
Component | Contribution | Due date |
Project 1 | 30% [100 + 20 EC] | 10/20 |
Project 2 | 15% | |
Midterm | 25% | 10/11 |
Final | 30% |
Schedule of topics
Topic(s) | Resources: Readings, Slides, Lecture notes, Papers, etc. |
Introduction
Course administrivia, semester plan, course goals DM Resources Data mining basics |
Readings: Chapter 1 WDM (upto 1.3.2)
|
Pattern Mining
Association rules Apriori algorithm |
Required Readings: Chapter 2 WDM (2.1-2.3, 2.5)
Optional Recommended Reading: Apriori implementation leveraging Tries [Bodon et al., 2010] Programming resources, tools, libraries for projects and homeworks: C. Borgelt's FPM library |
Supervised Learning I
Basic concepts: Data and features Decision tress Naive Bayes Classifier evaluation |
Required Readings: Chapter 3 WDM (3.1, 3.2, 3.3, 3.5, 3.6, 3.7.2)
F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8) Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe |
Supervised Learning II
Support Vector Machines Feature selection |
Required Readings: Chapter 3 WDM (3.8, 3.10)
Feature selection schemes: [Forman, 2003], [Mukherjee and Liu, 2010]
Programming resources, tools, libraries for projects and homeworks: SVM: SVMLight, Boosted decision trees/Random forests: JForests, JBoost |
Clustering
K-means clustering Hierarchical clustering Distance functions Clustering evaluation |
Required Readings: Chapter 4 WDM (4.1, 4.2, 4.4, 4.5, 4.6, 4.9).
Optional Recommended Reading: Clustering Analysis (Advanced) [Fred et al.], Modern Methods and Algorithmic Analyses [Müllner et al.] Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe, PU-Learning |
Partially Supervised Learning
Naive Bayes EM estimation Co-Training Learning from Positive and Unlabeled examples |
Required Readings: Chapter 5 WDM (5.1, 5.1.1, 5.1.2, 5.2).
Recommended reading: T-SVMs 5.1.4 WDM Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe, PU-Learning |
Web Mining
Web Search and Information Retrieval Social Network Analysis Google Page Rank HITS algorithm |
Required Reading(s): Chapter 6 WDM Selected topics of these sections (6.1, 6.2.1, 6.2.2, 6.3, 6.4, 6.5, 6.6) which were covered in class
Chapter 7 WDM (7.1, 7.3, 7.4) Recommended reading: Chapter 21 from [Manning et al., 2008]. Accompanying slides. Search Ranking evaluation metrics: MAP, NDCG, P@n Programming resources, tools, libraries for projects and homeworks: JUNG, RankLib |
Opinion Mining
Aspect and sentiment extraction Detecting opinion spam |
Required Readings:
Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM Paper on opinon spam: [Ott et al., 2011], slides, demo Programming resources, tools, libraries for projects and homeworks: Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis |