COSC 6335 Data Mining

Fall 2017

MW 2:30 - 4:00 PM, F (Lamar Fleming Jr.) 154

Instructor: Arjun Mukherjee


Overview

This is a graduate level course in data mining. The course in intended for developing foundations in data and text mining with a focus on solving problems in the Web/real-world domains. The broader goal is to understand how data mining tasks are carried out in the real world (e.g., Web). Throughout the course, large emphasis will be placed on tying data mining techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in information retrieval, applied machine learning, web mining and opinion mining.


Administrative details

If class is full or you are a non-CS major, you should contact Liz (ejfaig@central.uh.edu) with your UHID to be enrolled in the course or added to the waitlist.

Office hours

Instructor office hours: MW 4:00 - 5:00 PM, PGH 582
TA: Marjan Hosseinia (ma.hosseinia@gmail.com)
TA office hours: MW 12.30-2.30 PM, PGH 301

Prerequisites

The course requires background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Artificial Intelligence, Numerical methods, or have some background in linear algebra, it will be helpful. The course however reviews and covers required mathematical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) is required.

Note: This course has some overlap with Natural Language Processing (COSC 6397) and Machine Learning(COSC 6342). Especially some topics in supervised learning, clustering. However, the focus is more applied (as opposed to theory) and the goal is to build novel data mining technologies/algorithms on top of those methods. They are covered in this course to make this course standalone. Although not required, however, if you have taken either of those courses, it will serve to be useful.

Reading Materials

Textbooks:
WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.

Reference materials:
Required:
Online resources per topic as appearing in the schedule below.
Course materials (contains all Lecture notes + Sample exam questions)

Optional:
IIR: Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press. 2008. Companion website with online version for reading.
SAOM: Sentiment Analysis and Opinon Mining. Bing Liu. Morgan and Claypool Publishers. Draft available from author's website.

Grading

Component Contribution Due date
Project 1 30% [100 + 20 EC] 10/20
Project 2 15%
Midterm 25% 10/11
Final 30%


Rules and policies

Late Assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. It also needs to be a justifiable reason owing to exacting circumstances. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Cheating: All submitted work (code, homeworks, exams, etc.) must be your own. If evidence of code sharing is found, you will receive an F grade in the course. Please refer to the student handbook for details on academic honesty.
Statute of limitations: Grading questions or complaints, will in general not be attended to beyond one week after the item in question has been returned.


Schedule of topics


Topic(s) Resources: Readings, Slides, Lecture notes, Papers, etc.
Introduction
Course administrivia, semester plan, course goals
DM Resources
Data mining basics
Readings: Chapter 1 WDM (upto 1.3.2)
Pattern Mining
Association rules
Apriori algorithm
Required Readings: Chapter 2 WDM (2.1-2.3, 2.5)
Optional Recommended Reading: Apriori implementation leveraging Tries [Bodon et al., 2010]
Programming resources, tools, libraries for projects and homeworks:
C. Borgelt's FPM library
Supervised Learning I
Basic concepts: Data and features
Decision tress
Naive Bayes
Classifier evaluation
Required Readings: Chapter 3 WDM (3.1, 3.2, 3.3, 3.5, 3.6, 3.7.2)
F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8)

Programming resources, tools, libraries for projects and homeworks:
Mallet, LingPipe
Supervised Learning II
Support Vector Machines
Feature selection
Required Readings: Chapter 3 WDM (3.8, 3.10) Feature selection schemes: [Forman, 2003], [Mukherjee and Liu, 2010]

Programming resources, tools, libraries for projects and homeworks:
SVM: SVMLight, Boosted decision trees/Random forests: JForests, JBoost
Clustering
K-means clustering
Hierarchical clustering
Distance functions
Clustering evaluation
Required Readings: Chapter 4 WDM (4.1, 4.2, 4.4, 4.5, 4.6, 4.9).
Optional Recommended Reading: Clustering Analysis (Advanced) [Fred et al.], Modern Methods and Algorithmic Analyses [Müllner et al.]
Programming resources, tools, libraries for projects and homeworks:
Mallet, LingPipe, PU-Learning
Partially Supervised Learning
Naive Bayes EM estimation
Co-Training
Learning from Positive and Unlabeled examples
Required Readings: Chapter 5 WDM (5.1, 5.1.1, 5.1.2, 5.2).
Recommended reading: T-SVMs 5.1.4 WDM

Programming resources, tools, libraries for projects and homeworks:
Mallet, LingPipe, PU-Learning
Web Mining
Web Search and Information Retrieval
Social Network Analysis
Google Page Rank
HITS algorithm
Required Reading(s): Chapter 6 WDM Selected topics of these sections (6.1, 6.2.1, 6.2.2, 6.3, 6.4, 6.5, 6.6) which were covered in class
Chapter 7 WDM (7.1, 7.3, 7.4)
Recommended reading:
Chapter 21 from [Manning et al., 2008]. Accompanying slides.
Search Ranking evaluation metrics: MAP, NDCG, P@n
Programming resources, tools, libraries for projects and homeworks:
JUNG, RankLib
Opinion Mining
Aspect and sentiment extraction
Detecting opinion spam
Required Readings:
Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM
Paper on opinon spam: [Ott et al., 2011], slides, demo

Programming resources, tools, libraries for projects and homeworks:
Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis