Data Mining

COSC 6335 Data Mining

Fall 2017

MW 2:30 - 4:00 PM, F (Lamar Fleming Jr.) 154

Instructor: Arjun Mukherjee

Overview

This is a graduate level course in data mining. The course in intended for developing foundations in data and text mining with a focus on solving problems in the Web/real-world domains. The broader goal is to understand how data mining tasks are carried out in the real world (e.g., Web). Throughout the course, large emphasis will be placed on tying data mining techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in information retrieval, applied machine learning, web mining and opinion mining.

Administrative details

If class is full or you are a non-CS major, you should contact Liz (ejfaig@central.uh.edu) with your UHID to be enrolled in the course or added to the waitlist.

Office hours

Instructor office hours: MW 4:00 - 5:00 PM, PGH 582
TA: Marjan Hosseinia (ma.hosseinia@gmail.com)
TA office hours: MW 12.30-2.30 PM, PGH 301

Prerequisites

The course requires background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Artificial Intelligence, Numerical methods, or have some background in linear algebra, it will be helpful. The course however reviews and covers required mathematical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) is required.

Note: This course has some overlap with Natural Language Processing (COSC 6397) and Machine Learning(COSC 6342). Especially some topics in supervised learning, clustering. However, the focus is more applied (as opposed to theory) and the goal is to build novel data mining technologies/algorithms on top of those methods. They are covered in this course to make this course standalone. Although not required, however, if you have taken either of those courses, it will serve to be useful.

Reading Materials

Textbooks:
WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.

Reference materials:
Required:
Online resources per topic as appearing in the schedule below.
Course materials (contains all Lecture notes + Sample exam questions)

Optional:
IIR: Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press. 2008. Companion website with online version for reading.
SAOM: Sentiment Analysis and Opinon Mining. Bing Liu. Morgan and Claypool Publishers. Draft available from author's website.

Grading

Component Contribution Due date

Project 1 30% [100 + 20 EC] 10/20

Project 2 15%

Midterm 25% 10/11

Final 30%

Rules and policies

Late Assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. It also needs to be a justifiable reason owing to exacting circumstances. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Cheating: All submitted work (code, homeworks, exams, etc.) must be your own. If evidence of code sharing is found, you will receive an F grade in the course. Please refer to the student handbook for details on academic honesty.
Statute of limitations: Grading questions or complaints, will in general not be attended to beyond one week after the item in question has been returned.

Schedule of topics

Topic(s)	Resources: Readings, Slides, Lecture notes, Papers, etc.
Introduction Course administrivia, semester plan, course goals DM Resources Data mining basics	Readings: Chapter 1 WDM (upto 1.3.2)
Pattern Mining Association rules Apriori algorithm	Required Readings: Chapter 2 WDM (2.1-2.3, 2.5) Optional Recommended Reading: Apriori implementation leveraging Tries [Bodon et al., 2010] Programming resources, tools, libraries for projects and homeworks: C. Borgelt's FPM library
Supervised Learning I Basic concepts: Data and features Decision tress Naive Bayes Classifier evaluation	Required Readings: Chapter 3 WDM (3.1, 3.2, 3.3, 3.5, 3.6, 3.7.2) F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8) Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe
Supervised Learning II Support Vector Machines Feature selection	Required Readings: Chapter 3 WDM (3.8, 3.10) Feature selection schemes: [Forman, 2003], [Mukherjee and Liu, 2010] Programming resources, tools, libraries for projects and homeworks: SVM: SVMLight, Boosted decision trees/Random forests: JForests, JBoost
Clustering K-means clustering Hierarchical clustering Distance functions Clustering evaluation	Required Readings: Chapter 4 WDM (4.1, 4.2, 4.4, 4.5, 4.6, 4.9). Optional Recommended Reading: Clustering Analysis (Advanced) [Fred et al.], Modern Methods and Algorithmic Analyses [Müllner et al.] Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe, PU-Learning
Partially Supervised Learning Naive Bayes EM estimation Co-Training Learning from Positive and Unlabeled examples	Required Readings: Chapter 5 WDM (5.1, 5.1.1, 5.1.2, 5.2). Recommended reading: T-SVMs 5.1.4 WDM Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe, PU-Learning
Web Mining Web Search and Information Retrieval Social Network Analysis Google Page Rank HITS algorithm	Required Reading(s): Chapter 6 WDM Selected topics of these sections (6.1, 6.2.1, 6.2.2, 6.3, 6.4, 6.5, 6.6) which were covered in class Chapter 7 WDM (7.1, 7.3, 7.4) Recommended reading: Chapter 21 from [Manning et al., 2008]. Accompanying slides. Search Ranking evaluation metrics: MAP, NDCG, P@n Programming resources, tools, libraries for projects and homeworks: JUNG, RankLib
Opinion Mining Aspect and sentiment extraction Detecting opinion spam	Required Readings: Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM Paper on opinon spam: [Ott et al., 2011], slides, demo Programming resources, tools, libraries for projects and homeworks: Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis

Component	Contribution	Due date
Project 1	30% [100 + 20 EC]	10/20
Project 2	15%
Midterm	25%	10/11
Final	30%