last updated: December 1, 2006


COSC 6397--- Data Mining Fall 2006
(Dr. Eick )

Goals of the Data Mining Course
Data mining centers on finding novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, 776 papers were submitted
for the 2006 IEEE International Conference
on Data Mining (ICDM) that will be held in Hong Kong in December 2006. Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and Machine Learning Group
(UH-DMML) conducts research in some of the areas that are covered by this course. Moreover, having
basic knowledge in data mining is a plus when you are looking for a job in industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8
weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Moreover, techniques how to preprocess a data set for a data mining task will be covered. Also basic
visualization techniques and statistical methods will be introduced. In the remaining
5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification
techniques, and mining streaming data will be discussed.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@cs.uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (589 PGH): TU 11:30a-12:30p and TH 4-5p
e-mail: ceick@uh.edu
class meets: TU/TH 2:30-4p in 232 PGH
cancelled classes: TBDL
makeup classes: Friday, October 27, 2-5p (if necessary)
class room: ???
Course Materials
Required Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley,
- Link to
Book HomePage
Mildly Recommended Text
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, second edition.
- Link to Data Mining Book Home
Page
News COSC 6367 (Data Mining) Fall 2006
- The final grade for the course will be computed weighting the different
parts of the course as follows: Midterm=27%; FinalExam=33%; Homeworks=8%;
Project=32%. The final course grade will be e-mail to the students in the
class no later than Th., Dec. 14, 2006.
- The will be meeting Post-Project meeting on Th., Dec. 7,2:30-4p everybody
is expected to attend. Details
about the meeting will be announced on Dec. 5 on this webpage!
- Check this webpage regularly between Dec. 3 and Dec. 6!, especially for Project1 updates.
- Please e-mail a brief status report concerning Part4 of the Course
project to Dr. Eick and Abraham no later than Dec. 1, 11p.
- The first draft of Part4 of the
Region Discovery Project is
available now --- please, read! Part4 is
due on We., Dec. 6, 11:59p and this part will conclude the project. Most
likely there also will be some Project Wrap Up Event on Dec. 7, 2006.
- The softcopies of Homework2 have been lost: please, resent your softcopy
to ceick@aol.com, if you havn't done so...
- Homework3 with some solution sketches has
been posted.
- Reading assignments for upcoming weeks: August 24: pages 1-29; August 28:
pages 29-44 (remark section 2.3 and 2.4 will be covered later) and 97-105;
August 30: pages 105-131 (section 3.4 will be covered later),
September 4: pages 144-158; September 6: pages 158-179; September 11:
pages 179-193; September 13: pages 65-84 (section 2.3) and 488-496;
September 22: pages 496-513; October 5: 516-526; October 10: sections
9.1.3 and 9.3.1 (pages 573-574 and 601-603); October 17: pages
327-358 and 131-140; October 19: 370-386, October 24: 415-422 and 426-435,
October 26: pages 36-57, October 30: pages 57-80, Nov. 7: 526-550; 604-612,
November 16: 223-227, 256-276, November 19: 276-294, November 21: 298-306.
2006 Exams Dates and Other Deadlines:
Exams: October 12, November 30
Homework due dates: Sept. 28, October 7, November 15
Project 1 due date: multiple due-dates; deadline for final project
submissions will be: December 6, 2006, 11p.
Prerequisites
The course is mostly self-contained. However, students talking the course should have
sound software development skills and basic knowledge of Java. Lacking these skills likely will ask for
trouble when performing the two course projects.
The 2006 Offering of Data Mining
The
teaching in 2006 will be quite different from the 2005
teaching of the Data Mining course. A different textbook will be
used (see below) that comes with good online teching material. There will be
2 course projects this semester. The first project centers on the implementation of a data mining
algorithm; the second project centers applying data mining techniques to a real world dataset. Students also
will make a 10-15 minute presentation, discussing the results of one of the
two course projects. Moreover, 4
homeworks will be given that contain short, review-style questions; answers to
these excercises will be covered in review-style lectures every third Thursday.
Finally, software design for data mining will
be covered in the course, and students will be exposed to the
Java-based Cougar-Square Data Mining and Machine
Learning Environment that is currently under development by the
UH-DMML research group. In addition to learning how to design
and implement data mining algorithms, the participation in the course project
will help you obtaining valuable experience in Eclipse development, Java core
development, object oriented analysis & design, design patterns, and XML
techology. Having knowledge and experience in using these
technologies will
also help you getting a job in the software industry.
Anyhow, students
that take the course should be familiar with the basic concepts of Java --- if
you have doubts
about this prerequisite, feel free to contact Dr. Eick or members
of the UH-DMML research group about this matter.
Courese Elements and Their Weights in 2006
Course Projects (2): 25-31%
Homeworks: 8-14%
Exams(2): 58-62%
Class Participation: 2-4%
Tentative 2006 Teaching Plan and Transparencies (subject to change!)
I Introduction to Data Mining (Part1, Part2,
Part3 --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part)
III Introduction to Classification: Basic Concepts and
Decision Trees
IV Introduction to Similarity Assessment
and Clustering (Grid-based,
Hierarchical and Density-based Clustering,
Critical Issues with Respect to Clustering" --- covered in
part; other parts will be covered in Part IX)
V A Short Introduction to Data Cubes
VI Association Analysis(Part1,Part2)
VII Preprocessing for Data Mining (mostly covers
Tan Chapter 2, was added on Oct. 26, 2006)
VIII More on Clustering: Grid-based, Hierarchical and Density-based Clustering
(more on AGNES and DBSCAN),
Critical Issues with Respect to Clustering",
Supervised Clustering.
IX Spatial Data Mining (Spatial Databases (not
covered),
Spatial Data Mining (first 3 quarters will be covered ---
time permitting)
X Software Design for Knowledge Discovery Projects and Background Knowledge
for Project1 (Software Design in General,
Intoduction to Region Discovery, Region Discovery Techology (please
read the first 8 pages of the Wordfile), Background
Material for Part3 of Project1)
XI More on Classification: Instance-based Learning, Support Vector Machines,
Editing, Ensembles, ROC-Curves
(NN-Classifiers and Support Vector Machines
Editing and Condensing Techniques for NN-Classifiers,
Ensembles and ROC Curves)
XII Mining Data Streams, Online Data Mining and Incremental Data Mining
XIII Final Words
Remark: The teaching plan will be updated continuously.
Master Thesis and Dissertation Research in Data Mining
If you plan to perform a dissertation or Master thesis project in the area of
data mining, I strongly recommend
to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing
about evolutionary computing (COSC 6367) will be helpful, particularly
for designing novel data mining algorithms.
Moreover, having basic knowledge in data structures, software design, and databases is important when conducting
data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice.
Moreover, taking a course that teaches high preformance computing is also
desirable, because most data mining algorithms are very resource intensive.
Because a lot of data mining projects have to deal with images, I
suggest to take at least one of the many
biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge
in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization. Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge
in data mining, machine learning, statistics and related areas. Similarly, you
will not be hired as a RA for a
data mining project without having some background in data mining.
Grading
Each student has to
have a weighted average of 74.0 or higher in the
exams of the course in order to receive a grade of "B-" or better
for the course.
Students will be responsible for material covered in the
lectures and assigned in the readings. All homeworks and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions to homeworks and assignments
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
homework problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Course Exams
Midterm
Th., October 12, 2006.
Final Exam
Th., November 30, 2006
Data Mining Links
KDnuggets
KDD 2006 Conference
IEEE International Conference
on Data Mining (ICDM) Website
PKDD 2006 (European KDD Conference)
UIUC Data Mining Group
Microsoft DMX Group
Penn Data Mining Group
UMN Spatial Database and Spatial Data Mining Group
Vrije Universiteit Amsterdam Data Mining Group
Data Mining and Machine Learning Group University
of Helsinki
Data Mining at Massey University