last updated: May 19, 2015

COSC 4335: Data Mining in Spring 2015 (Dr. Eick )



Goals of the Data Mining Course

Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. For example, almost 900 papers were submitted for the IEEE International Conference on Data Mining (ICDM) to be held in Shenzhen, China in December 2014 (Data Mining Conference Rankings). Data mining tools and suites (for example, see KDnuggets' DM Software Survey) are used a lot in industry and in reseach projects. UH's Data Mining and Machine Learning Group Website (UH-DMML) conducts research in some of the areas that are covered by this course (UH-DMML Research Overview). Finally, having basic knowledge in data mining is a plus when you are looking for a job in industry and at major US research institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. It also gives a basic introduction to data analysis. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Basic data analysis techniques, centering on basic visualization techniques and statistics, to get a better understanding of the data mining task at hand will be covered. Moreover, techniques how to preprocess a data set for a data mining task will be introduced. Moreover, in course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally, as R will be used in most course projects; therefore, participants of the couse will obtain valuable exprience in using the R statistics, data mining, and visualization packages and will learn how to write programs in R and how to develop data mining software on top to R. A recent 2013 poll Rexer Analytics found that R is currently the most popular data mining tool: 24% of the respondents use R as their primary tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's versatile.

In summary, having a sound background in data analytics and data mining and knowing R well will open a lot of job opportunities for you, which, I believe, is a strong reason to take the course.

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
office hours (573 PGH) TU 4-5p TH 11:30a-12:30p
e-mail: ceick@uh.edu
TA: Raju, Rezaul Karim
office hours: ...
2015 TA website: Raju's 4335 Website

Google COSC 6335 News Group
class meets: TU/TH 2:30-4p
cancelled classes: TBDL
COSC 4335 Lecture Video Link

Course Materials

COSC 4335 Syllabus
Objectives Data Mining Course

Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, 2006,
Link to Book HomePage

Other Material:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition, 2011.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

Important Dates in 2015

Thursday, February 26: Nguyen Pham will be teaching this lecture
Tuesday, March 10: Midterm exam1
Tuesday, April 7: Midterm exam2, mostly covering R programming and classification (you are allowed to use your labtop in this exam!)
Tuesday, April 21: Project3 Student Presentations
Tuesday, May 12, 2p: Final Exam

2015 Assignments

Assingment1: Exploratory Data Analysis using R (Group Project)

Assignment2: Traditional Clustering with K-means and DBSCAN; (Individual Project; Randomized Hill Climbing Slides)

Assignment3: Making Sense of Data—Learn Classification Models for an Interesting Dataset/Problem (Preliminary Draft; Assignment3 Groups, Assignment3 Talks)

Assignment4: Association Rule Mining (Group project; group size 2)

News COSC 4335 (Data Mining) Spring 2015

Exam Solutions

Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam May 12, 2015

Prerequisites

COSC 3380  and MATH 3336.

Course Elements and Their Tentative Weights for 2014

Assignments (4): 45%
Exams (3): 54% (17+17+20%)
Class Attendance: 1%

COSC 4335 Data Mining: Lecture Notes

I Introduction to Data Mining (COSC 4355 Knowledge Sources, Part1, Part2, Part3: Data, Differences between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III R (Arko's Short Intro Into R, Data and Some R Data Analysis Functions (download datasets prior to the Feb. 5 lecture!), Scatter Plot Code, Decision Trees in R, Some useful code for Project1, Functions and Loops in R, Directory containing R-code for Project2 (lecture on Feb. 26)
IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN; R-scripts demonstrating: K-means/medoids, DBSCAN; Clustering Exercises K-Means, HC, and DBSCAN)
V Introduction to Classification: Basic Concepts and Decision Trees, kNN-Classifiers and Support Vector Machines and Ensemble Learning.
VI Association Analysis (Part1, Part2)
VII Data Preprocessing for Data Mining
VIII PageRank
IX Outlier Detection
X Final Words COSC 4335

Grading

Students will be responsible for material covered in the lectures and assigned in the readings. All assignment and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly chosen dates, and an attendence score will be computed from how many of the those lectures you attended.

Translation number to letter grades in 2015:
A:100-89 A-:89-86 B+:86-82 B:82-78 B-:78-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.