last updated: September 8, 2014
COSC 6335: Data Mining in Fall 2014
(Dr. Eick )
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 900 papers were submitted
for the IEEE International Conference
on Data Mining (ICDM) to be held in Shenzhen, China in December 2014
(Data Mining
Conference Rankings). Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and
Machine Learning Group Website (UH-DMML) conducts research in some of the
areas that are covered by this
course (UH-DMML
Research Overview). Finally, having
basic knowledge in data mining is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a
data mining project. In the first 9
weeks a very basic introduction to data mining will be given. After defining
what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Exploratory data analysis, centering on basic visualization techniques and statistics,
to get a better understanding of the data mining task at hand will be covered.
Moreover, techniques how to preprocess a data set for a data mining
task will be introduced.
In the remaining 4-5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification
techniques, and sequence mining and, webpage ranking will be discussed. Moreover, in
course projects you will obtain hands on experience in conducting data mining project. Finally,
as R will be
used in most course projects; therefore, participants of the couse will obtain
valuable exprience in using the R statistics, data mining, and visualization
packages and will learn how to write programs in R and how to develop data mining software
on top to R.
A recent 2013 poll Rexer Analytics found that R is currently the most
popular data mining tool: 24% of the respondents use R as their primary
tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's
versatile.
In summary, having a sound background in data analytics and data mining and knowing R
well will open a lot of job opportunities for you, which, I believe, is a strong
reason to take the course; perhaps, you have to work a little more when taking this course,
compared
to other courses, and perhaps not everybody taking this course will get an A, but you should also consider the merits of completing this course!
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (573 PGH) TU 10-11a TH 2-3p
e-mail: ceick@uh.edu
TA: Arko Barman
office hours: TU 2-3 TH 3-4p in ????
2014 TA website: Arko
Barman's COSC 6335
Website
Google COSC
6335 News Group
class meets: TU/TH 11:30a-1p
cancelled classes: TBDL
Makeup class (if necessary): Tu., December 9, 11:30a-1p
Course Materials
COSC 6335 Syllabus
Objectives Data Mining Course
Required Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2006,
- Link to
Book HomePage
Recommended Texts:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 6335 (Data Mining) Fall 2014
- Project 1 groups have been set up; there will be 7 groups of 3 (2) students; which group each
student belongs to has been listed below.
- Dr. Eick will not have his regular office hours on Tu., Sept. 9 10-11a; however, he will
have makeup office hours 4-4:45p on Tuesday, September 9.
- 2013 Textbook Reading Assignments (updated on November 15,
2013): August 29: 1-11; Sept. 2: 19-36;
September 4: 97-131;
September 8: 65-84; September 13: 487-496; September 15: 496-508; September
15: 510-515; September 18: 526-532; September 22: 569-577;
September 24: 327-335; October 1: 532-544; October 3: 145-162
October 7: 162-166, 168-178 and 184-188, October 11: 335-351,
lecture;
October 27: 44-64;
November 6: read the web link documents of the "Short Introduction to Spatial Data Mining"
November 8: 223-226, 256-281; November 10: 131-139; November 14: 515-526,
532-553; November 20: 415-423, 426-441; November 28: Read discussions about K-means, kNN, decision trees, and
APRIORI of the "Top 10 Data Mining Algorithms Paper".
- The course will use a mixture of group and individual projects. Course projects (unless specified otherwise) and other assignment
tasks are individual activities; therefore, collaborating with other students or colloborating with students from other groups
is not allowed (also see academic honesty section near the end of
this webpage). Most likely there will be 4 course projects in the course; Project2 will be an individual project, whereas
the other 3 projects will be a group projects with group sizes of 2-4 students.
Important Dates in 2014
Th., September 4: Arko Barman (the TA of this course) will be teaching an R-Lab centering on R basics and knowledge
necessary for Project1 (please
bring your labtop with R installed and read the Project1 specification prior to the lab; it will be posted
no later than September 2 on this website)
Either Tu., November 4 or Th., October 23: Midterm Exam (should know by September 8, when Dr. Eick's
travel has been finalized)
Tu+Th., November 4+6: Guest lectures will be teaching those days, as Dr. Eick will be attending the ACM
SIGSPATIAL GIS Conference in Dallas in that week.
Tu., November 18: Project3 Student Presentations in 563 PGH
Tu., December 9, 11:30a-1p: Potential Makeup Class for cancelled classes...
Th., December 11, 11a-2p: Final Exam
Review Sessions(last 35 minutes of lecture): October 2, October 21, either December 4 or December 9
2013 Reviews
Review1 (try to solve problems 2-4! Solutions
to be discussed on October 8)
Review2 (solution sketches; the answer
to question 2d has been corrected (in red); was discussed
during the lecture on October 31).
Review3 (will take place on
Tuesday, December 10. Solve the 7 review questions!)
Prerequisites
The course is mostly self-contained. However, students taking the course
should have
sound software development skills and very basic knowledge of Java.
Course Elements and Their Tentative Weights for 2014
Course Projects (4): 41%
Exams (2): 58% (midterm: 26%; final exam: 32%)
Class Attendance: 1%
2014 Projects
Project1: Exploratory Data Analysis for a Banknote
Authentication Dataset using R (Group Project; groups of size 3 (or 2);
deadline Fr. September 26, 11p (electronic submission)).
Project1 Groups
Anchlia,Puja A
Boumber,Dainis A
Dunbar,Arthur P B
Ghimire Khatiwada,Yamuna B
Hu,Zhiguan B
Jidagam,Rohith C
Kajol,. C
Memariani,Ali C
Nagineni,Kaushik D
Nandamuri,Anil Kumar D
Pampana,Renuka D
Parmar,Kunal Jagdishbhai E
Patil,Pramey Prakash E
Pavuluri,Bhagyasri E
Pham,Nguyen Dinh F
Ravipati,Prudhvi F
Settipalli,Agnjani Swetha F
Shi,Yiwen G
Yang,Yajun G
Zhang,Yongli G
COSC 6335: Data Mining Lecture Notes
I Introduction to Data Mining (Part1, Part2,
Part3,
Differences
between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and
Decision Trees
IV Introduction to Similarity Assessment
and Clustering (AGNES (not covered) and DBSCAN;
R-scripts demonstrating: K-means/medoids,
DBSCAN)
V Association Analysis(Part1,Part2)
VI A Short Introduction to Data Cubes
VII Preprocessing for Data Mining
VIII Introduction to Spatial Data Mining
(Introduction to Regional Knowledge Extraction,
Intoduction to Region Discovery,
Example Fitness Functions (to
be used in the Region Discovery Lecture), Introduction
to the CLEVER Region Discovery Algorithm,
Dr. Eick's Report 2011 ACM-GIS Conference,
Spatial Regression(not covered
in 2011))
IX More on Clustering and Outlier Detection: Grid-based,
Density-based Clustering, and Subspace Clustering (
Non-Parametric Density Estimation R Demo),
Cluster Validity,
Anomaly/Outlier Detection.
X Introduction to R (R Preliminaries,
R Data Structures, R Graphics, Functions in R),
Rar-file (contains the .r (dot r)files are the scripts
presented in class, namely
kmean.r,
linearRegression.r,
linearRegression_iris_2.r,
and scatterlot3d_1.r.
The file RCommander_LAB_EXAMPLES.r contains all the scripts.
The rest are the powerpoints and the datasets used
), Sample Classification Plots in R)
XI More on Classification: Instance-based Learning, Support Vector Machines,
Editing, Ensembles, and ROC-Curves
(NN-Classifiers and Support Vector Machines
(Prof. Sastry's Introduction
to SVM),
Editing and Condensing Techniques for NN-Classifiers
(to be covered in Fall 2013), Ensembles and ROC Curves,
Model Evaluation).
XII The PageRank Algorithm (taught the first time in 2012; Gleich 2009 Dissertation Defensecentering
on PageRank at
Stanford University)
XIII Top Ten Algorithms in Data Mining
(Top-10 Panel, Top10)
XIV Miscellaneous: Experiences in Finding
Data Mining/Internet/HPC Jobs in Industry,
2009 Netflix Contest,
90 Days at Yahoo! and Final Words
Order of Teaching in 2013 (subject to change):
I-II-IV-III--V(Part1)-V(Part2)-VII-XI-VIII-IX(only covering
hierarchical clustering, DENCLUE and subspace clustering in 2013)-XII-XIII-VI-XIV.
Remarks: likely, topios XIII and XIV, will be only partially covered. R (Topic X) will be
covered in a lab
early September and as part of the lectures centering on Exploratory Data Analysis and
Clustering Part1.
2013 Projects
Project1: Exploratory Data Analysis for a Pima Indians Diabetes Dataset using R( Post Analysis Project1;
Group Project; groups of size 3 (or 2)).
Project2: Traditional Clustering with K-means and DBSCAN
(Individual Project, Yeast CSV-File,
Randomized Hill Climbing Slides,
Hints Task4 Project2,
Post Analysis Project2,
Project2 Scores (scores marked by '?' are
only preliminary and might change))
Project 3:
Something Interesting about Finding Interesting Associations in Large Amounts
of Data
(Group Project (Groups of 4), Project3 Q&A,
Project3 Scores)
Project4: Reading, Understanding,
Summarizing and Reviewing of Data Mining Papers
(2-person Group Project;
paper candidates (choose one!):
Paper 1,
Paper 2,
Paper 3,
Paper 4;
Project4 Scores)
2013 Project Weights: Project1:1, Project2:2.0, Project3: 1.3,
Project4: 1.2.
2013 Project Scores (please, verify!)
2011 Review Sessions
The review sessions will discuss questions which typically will be posted
1-5 days prior to the review session; review sections will take about 30 minutes
and are typically discussed 10:45-11:15a. It is important that
you try to answer the review questions before
the review session!
2012 Review Questions
Questions 2012 Review1
Question 2012 Review2
Question 2012 Review3
2011 Review Questions
Questions October 4
Questions October 20
Questions November 22
Questions December 1
A Few Results 2012 DM Questionnaire
Student Preferences and other: 19 students joined UH in Fall 2012; 15 students joined UH earlier.
Student Languages: As far as languages are concerned which
students spoke as a child are concerned (based on 34 responses; if students
listed more than 2 languages only the first two languages
were counted): English(22), Telugu(12), Hindi(12), Chinese(4), Tamil(2), Greek(2), Bulgarian(1), Arab(1), Urdu(1), French(1), Persian(1), Marathi(1), Malayalam(1).
A Few Results 2011 DM Questionnaire
Student Preferences: Of a group of 31 students (neutral statements
were not counted),
26 students like group projects and 3 students dislike group projects;
25 students like reading scientific papers and 4 students dislike reading
scientific papers; 23 students like projects which involve a significant
amout of
programmming and 6 students dislike such projects; 17 students like to give presentations and 5 students dislike giving presentations.
Student Languages: As far as languages are concerned which
students spoke as a child are concerned (based on 29 responses; if students
listed more than 2 languages only the first two languages
were counted): English(16), Telugu(6), Hindi(6), Chinese(5), Tamil(5),
Spanish(3), African(1), French(1), Nepali(1), Marathi(1), Kannada(1).
2011 Projects
Project1: Explaratory Data Analysis for
the Vehicle Silhouette Dataset using R.
Project2: Clustering with K-means and DBSCAN
Project 3: Extracting Regional
Knowledge from Spatial Datasets: Clustering with Plug-in
Interestingness Functions with CLEVER
Project 4: Something Interesting About
Finding Something Interesting (Group Project;
Slides Project4 Student Presentations,
Video taken of the Event)
Project 5: Learning and Assessing
Classification Models
2010 Assignments
Assignment1 (see Chun-sheng's Website)
Assignment 2
Assignment3
(Earthquake 2010 Dataset)
Assignment4
First Draft of Assigment 5
2010 Review Sessions
There will be 30 minute review sessions on September 23,
October 12, November 16, and November 30. Review questions will be
posted here. Occasionally, review questions will discuss paper-and-pencil
problems of assignments.
Review Questions for September 23
Review Questions for October 12
Review Questions for November 11
Review Questions for November 30
Grading
Each student has to
have a weighted average of 74.0 or higher in the
exams of the course in order to receive a grade of "B-" or better
for the course.
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly
chosen dates, and an attendence score will be computed from how many
of the those lectures you attended.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Past Data Mining Exams
2008 Midterm Exam
2007 Final Exam
2009 Midterm Exam with Solution Sketches
2009 Final Exam with Solution Sketches
2010 Midterm Exam with Solution Sketches
2011 Midterm Exam with Solution Sketches (some
typos in the solution of Problem 5b have been corrected on
November 29, 2011)
2012 Midterm Exam with Solution Sketches
2013 Midterm Exam with Solution Sketches
2010 Final Exam with Solution Sketches
2011 Final Exam with Solution Sketches
Summary Answers COSC 6335 2009 Student Questionnaire
Student Language Summary Registered
Students: English:14, Hindi:9, Telugu:7, Bengali:2,
Vietnamese:2, Arabic:2, Sindhi:1,
French:1, Russian:1, Turkish:1, Kyrgyz(?):1, Tamil:1, Filipino:1, Spanish:1,
Urdu:1, Garhwali(?):1, Chinese:1; I am impressed: some of you spoke up to
four languages as a child! Concerning group projects, 11 students
liked group projects, 2 students disliked group project, and 9 students
had no preference. Concerning reading scientific papers 12 students liked
reading scientific papers, 3 students disliked it, and the rest of the students
were neutral or gave fuzzy answers "I like reading paper that are interesting.".
15 students like giving presentations and 4 students didn't. Concerning
projects that involve significant amounts of programming 16 liked it and
3 didn'tlike it.
Master Thesis and Dissertation Research in Data Mining
If you plan to perform a dissertation or Master thesis project in the area of
data mining, I strongly recommend
to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing
about evolutionary computing (COSC 6367) will be helpful, particularly
for designing novel data mining algorithms.
Moreover, having basic knowledge in data structures, software design, and
databases is important when conducting
data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice.
Moreover, taking a course that teaches high preformance computing is also
desirable, because most data mining algorithms are very resource intensive.
Finally, having some knowledge
in the following fields is a plus: numerical optimization techniques, image
processing, statistics, geographical information systems (GIS), agent-based
systems and data visualization.
Also be aware of the fact that having sufficient background in the above
listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge
in data mining, machine learning, statistics and related areas. Similarly, you
will not be hired as a RA for a
data mining project without having some background in data mining.
Data Mining Links
KDnuggets
2013 Rexer Analytics Data Mining Software Highlights
Netflix $1,000,000 Grand Prize
SPMF (Sequential Pattern
Mining Framework)
Magnum Opus Data Mining Framework
UIUC Data Mining Group
Microsoft DMX Group
UMN Spatial Database and Spatial Data Mining Group
Data Mining and Machine Learning Group University
of Helsinki
Houston R Group
UH's Data Mining and Machine Learning Group (UH-DMML)
Data Mining Conferences and Journals
RapidMiner (formerly Yale)