last updated: May 2, 2016
COSC 4335: Data Mining in Spring 2015
(Dr. Eick )
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 900 papers were submitted
for the IEEE International Conference
on Data Mining (ICDM) that was held in Atlantic City in December 2015
(Data Mining
Conference Rankings). Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and
Machine Learning Group Website (UH-DMML) conducts research in some of the
areas that are covered by this
course (UH-DMML
Research Overview). Finally, having
basic knowledge in data mining and data analysis is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a
data mining project. It also gives a basic introduction to data analysis. After defining
what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Basic data analysis techniques, centering on basic visualization techniques and statistics,
to get a better understanding of the data mining task at hand will be covered.
Moreover, techniques how to preprocess a data set for a data mining
task will be introduced.
Moreover, in
course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally,
as R will be
used in most course projects; therefore, participants of the couse will obtain
valuable exprience in using the R statistics, data mining, and visualization
packages and will learn how to write programs in R and how to develop data mining software
on top to R.
A recent 2013 poll Rexer Analytics found that R is currently the most
popular data mining tool: 24% of the respondents use R as their primary
tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's
versatile.
In summary, having a sound background in data analytics and data mining and knowing R
well will open a lot of job opportunities for you, which, I believe, is a strong
reason to take the course.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (573 PGH) TU 2:30-3:30p TH 9:30-10:30a
e-mail: ceick@uh.edu
TA: Can Cao
office hours: Tue: 1:00pm-2:00pm, Thur: 10:00am-11:00pm in 550E PGH
Email: COSC4335TA@gmail.com
2015 TA website: Can Cao's 4335 Website (TA's 4335
Website in 2015)
class meets: TU/TH 11:30-1p in T 120 K
cancelled classes: TBDL
Course Materials
COSC 4335 Syllabus
Objectives Data Mining Course
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2006,
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 4335 (Data Mining) Spring 2015
- Please do not forget conduct your on-line teaching evaluations for COSC 4335 and the other courses you take this
semester in the next days!
- The final exam of COSC 4335 has been scheduled for
Tu., May 10, 11a in our class room (Review List for May 10 Final Exam).
- Concerning Assignment4, you are allowed to use any tool, R-package or other software package to solve the tasks of Assignment4. Moreover, the "late" submission deadline for Assignment4 has been extended to Fr., May 6, 11a (in the morning); however, this deadline is a hard deadline.
- There will be a makeup class on Tuesday, May 3, 11:30a-1p for the cancelled classes in our classroom. The lectures in the remainder of the semester, will center on anomaly and outlier detection, association rule mining, preprocessing, discussion of Assignments 4, PageRank (brief), a short course summary, and a review for the COSC 4335 final exam.
Important Dates in 2016
Tuesday, February 2: Introduction to R Tutorial
Thursday, March 3: Midterm1 Exam (Review List 2016 Midterm1 Exam)
Tuesday, March 8: Lab-style class (in preparation for Assignment2; bring Labtop!)
Thursday, April 7: Midterm2 Exam (Review List 2016 Midterm2 Exam; updated on April 4, 2016 at noon!)
Thursday, April 21: Project3 Student Presentations
Tuesday, May 3, 11:30a: Makeup class
Tuesday, May 10, 11a: Final Exam
2016 Assignments
Assingment1: Exploratory Data Analysis using R for an Abalone Dataset (Group Project;
HABALONE data file, HABALONE csv file.)
Assingment2: Similarity Assessment and Clustering with K-medoids, K-means, and DBSCAN (Individual Project; Randomized Hill Climbing Slides).
Assignment3: Making Sense of Data—Learning Classification/Prediction Models for an Interesting Dataset of Your Own Choice (Group Project; More information about Assignment3).
Assignment4: Design and Implementation of an Outlier Detection Technique for Spatial Data (Individual Project).
Prerequisites
COSC 3380 and MATH 3336.
Course Elements and Their Tentative Weights for 2016
Assignments (4): 45%
Exams (3): 54% (17+17+20%)
Class Attendance: 1%
COSC 4335 Data Mining: Lecture Notes
I Introduction to Data Mining (COSC 4355 Knowledge Sources, Part1, Part2,
Part3: Data,
Differences
between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays; Some R Data Analysis Functions I; Some R Data Analysis Functions II)
III R (Can Cao's R Tutorial (used in lab on Feb. 2, 2016), Arko's Short Intro Into R, Scatter Plot Code, Decision Trees in R, Some useful code for Project1, Functions
and Loops in R,
Directory containing R-code for Project2 (lecture on Feb. 26)
IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN;
R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects;
Clustering Exercises
K-Means, HC, and DBSCAN)
V Introduction to Classification: Basic Concepts and Decision Trees, kNN-Classifiers and Support Vector Machines and Ensemble Learning.
VI Association Analysis (Part1, Part2 (not covered in Spring 2016))
VII Outlier Detection (significantly updated on April 25, 2016)
VIII Data Preprocessing for Data
Mining
IX PageRank
X Recent Trends in Data Mining and COSC 4335
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam May 12, 2015
Solution Sketches Review1 March 1, 2016
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Review for Final Exam, May 3, 2016
Grading
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly
chosen dates, and an attendence score will be computed from how many
of the those lectures you attended.
Translation number to letter grades in 2015:
A:100-89 A-:89-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
2015 Assignments
Assingment1: Exploratory Data Analysis using R (Group Project)
Assignment2: Traditional Clustering with K-means and DBSCAN; (Individual Project; Randomized Hill Climbing Slides)
Assignment3: Making Sense of Data Learn Classification Models for an Interesting Dataset/Problem (Preliminary Draft; Assignment3
Groups, Assignment3 Talks)
Assignment4: Association Rule Mining (Group project; group size 2)