last updated: February 15, 2015
COSC 4335: Data Mining in Spring 2015
(Dr. Eick )
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 900 papers were submitted
for the IEEE International Conference
on Data Mining (ICDM) to be held in Shenzhen, China in December 2014
(Data Mining
Conference Rankings). Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and
Machine Learning Group Website (UH-DMML) conducts research in some of the
areas that are covered by this
course (UH-DMML
Research Overview). Finally, having
basic knowledge in data mining is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a
data mining project. It also gives a basic introduction to data analysis. After defining
what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Basic data analysis techniques, centering on basic visualization techniques and statistics,
to get a better understanding of the data mining task at hand will be covered.
Moreover, techniques how to preprocess a data set for a data mining
task will be introduced.
Moreover, in
course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally,
as R will be
used in most course projects; therefore, participants of the couse will obtain
valuable exprience in using the R statistics, data mining, and visualization
packages and will learn how to write programs in R and how to develop data mining software
on top to R.
A recent 2013 poll Rexer Analytics found that R is currently the most
popular data mining tool: 24% of the respondents use R as their primary
tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's
versatile.
In summary, having a sound background in data analytics and data mining and knowing R
well will open a lot of job opportunities for you, which, I believe, is a strong
reason to take the course.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (573 PGH) TU 4-5p TH 11:30a-12:30p
e-mail: ceick@uh.edu
TA: Raju, Rezaul Karim
office hours: ...
2015 TA website: Raju's 4335
Website
Google COSC
6335 News Group
class meets: TU/TH 2:30-4p
cancelled classes: TBDL
COSC 4335 Lecture Video Link
Course Materials
COSC 4335 Syllabus
Objectives Data Mining Course
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2006,
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
Important Dates in 2015
Thursday, February 26: Nguyen Pham will be teaching this lecture
Tuesday, March 10: Midterm exam
Tuesday, March 24: N.N. will be teaching this lecture
Tuesday, April 7: Maybe, R programming (may be takehome) exam
Tuesday, April 21: Project3 Student Presentations
TU/TH, May ?, 2p: Final Exam (will not be comprehensive)
2015 Assignments
Draft of Assignment1
Assignment2: Traditional Clustering with K-means and DBSCAN; (Individual Project, preliminary draft; you are expected to start working on the
project after the lecture on February 24; Randomized Hill Climbing Slides)
News COSC 4335 (Data Mining) Spring 2015
- The specification of Assignment1 problem 6 has been updated.
- This website is an evolving document. As Dr. Eick teaches this course the first time, more teaching material will be added to this website as we move along in the semester.
- A first draft Assignment2 (including the 2015 Abalone Data Mining Cup) has been posted; please, read its specification before the lecture on Feb. 24 which will discuss its tasks.
- The lectures in the Feb. 16 week will center on clustering, and there will an R-lab on Th., Feb. 26 (bring labtops) to prepare you for the tasks of Assignment2.
Prerequisites
COSC 3380 and MATH 3336.
Course Elements and Their Tentative Weights for 2014
Assignments (4): 41-50%
Exams (2-3): 48-55%
Class Attendance: 1%
Special Individual Tasks: 0-5%
2015 Projects
COSC 4335: Data Mining Lecture Notes
I Introduction to Data Mining (COSC 4355 Knowledge Sources, Part1, Part2,
Part3: Data,
Differences
between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III R (Arko's Short Intro Into R, Data and Some R Data Analysis Functions (download datasets prior to the Feb. 5 lecture!),
Scatter Plot Code, , Decision Trees in R, Some useful code for Project1)
IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN;
R-scripts demonstrating: K-means/medoids, DBSCAN;
Clustering Exercises
K-Means, HC, and DBSCAN)