last updated: December 6, 2007


COSC 6397--- Data Mining Fall 2007
(Dr. Eick )

Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, 776 papers were submitted
for the 2006 IEEE International Conference
on Data Mining (ICDM) that will be held in Hong Kong in December 2006. Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and Machine Learning Group
(UH-DMML) conducts research in some of the areas that are covered by this
course. Moreover, we are currently developing Cougar^2 an open source data mining and machine
learning platform
that will be used in part in the course projects. Finally, having
basic knowledge in data mining is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8
weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic
visualization techniques and statistical methods will be introduced. Finally, in the remaining
5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification
techniques, and mining sequence and streaming data will be discussed.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@cs.uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (589 PGH): TU 11:30a-12:30p and TH 4-5p
e-mail: ceick@uh.edu
Teaching Assistant: Ying Wei Kuo
office hours (313 PGH): TU 4-5p, TH 1:30-2:30p
e-mail: ykuo@cs.uh.edu
Ying Wei's Webspace
class meets: TU/TH 2:30-4p in 232 PGH
cancelled classes: TBDL
makeup classes: TBDL (if necessary)
class room: 232 PGH
Course Materials
Required Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley,
- Link to
Book HomePage
Mildly Recommended Text
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, second edition.
- Link to Data Mining Book Home
Page
Programming Project Part3
Draft Part3 Specification (updated
on October 26, 2007)
Groups and Algorithms assigned
1. PICPF-DBSCAN: Akash Bhatt, Francisco Ocegueda-Hernandez
2. SRIDHCR: Karthik Karra and Krushita Shah
3. RG: Shweta Savkar and Nishta Sharma
4. SCAH: Gayathri Subramanian and Deeptha Janakiraman
5. RG: Raja Yalamanchili and Ravi Mehta
6. SCMRG: Xiaofan Wu, Ashish Kapadia, and Varun Raheja
7. RG: Madhura Koppoli and Shyamali Balasubraminiyan
News COSC 6367 (Data Mining) Fall 2007
- The semester will come to an end soon. You should receive a grade summary
for the Data Mining course per e-mail no later than Dec. 10, and a more
detailed grade report will be posted on the web no later than Dec. 15, 2007. If
you have any questions concerning grading, feel free to send Dr. Eick and
Yingwei an e-mail, and
we can get together to discuss your problem in January when I am back on campus.
Moreover, if you have any comments about the course you like to share with Dr.
Eick and/or Yingwei, feel free to send an e-mail.
I enjoyed teaching the course, and I looking forward to seeing most of you
again in 2008!
- The final
exam will be not returned to students; however, you can see your
final exam on Th., January 17, 4-5p and on Tu., January 22, at 10:00-11a.
- Concerning the grading of the group project, we decided that the
students belonging to the same
group will not necessarily receive the same score. In some groups, students
showed different levels of activity, and also for some groups a single
student did the presentation just by himself. If you believe that the
uneven score distribution in your group misreprents the contribution
of the group members, feel free to send an e-mail to Yingwei and Dr. Eick.
- Homework4 has been posted; it is due
Monday, November 26, 11:59p (electronic submission). Your submission should
include the fifth problem of
Homework3!
-
We recommend to install the following software on your
labtop: Java JDK 1.5, Eclipse 3.2 or higher, and
Weka 3.5.6.
- Reading assignments for upcoming weeks: August 24: pages 1-29; August 28:
pages 29-44 (remark section 2.3 and 2.4 will be covered later) and 97-105;
August 30: pages 105-131 (section 3.4 will be covered later),
September 4: pages 144-158; September 6: pages 158-179; September 11:
pages 179-193; September 13: pages 65-84 (section 2.3) and 488-496;
September 22: pages 496-513; October 5: 516-526; October 10: sections
9.1.3 and 9.3.1 (pages 573-574 and 601-603); October 17: pages
327-358 and 131-140; October 19: 370-386, October 24: 415-422 and 426-435,
October 26: pages 36-57, October 30: pages 57-80, Nov. 7: 526-550; 604-612,
November 16: 223-227, 256-276, November 19: 276-294, November 21: 298-306.
- Course Syllabus
2007 Exams Dates and Other Deadlines:
Exams: October 18, November 29
Homework due dates: Homework1: Sept. 25, Homework2 (short) October 13,
Homework3: November 11(long); Homework4: November 25(short)
Programming Project due date: Part1: September 13; Part2: October 6/11; Part3: November 19
Prerequisites
The course is mostly self-contained. However, students talking the course should have
sound software development skills and basic knowledge of Java. Lacking these skills likely will ask
for
trouble when performing the two course projects.
The 2007 Offering of Data Mining
The
teaching in 2007 will be similar to the offering in 2006. The teaching in the first
8 weeks will closely cover material from the course textbook,
that also comes with good online teching material. There will be
2 course projects this semester. The first project centers on the implementation of a data mining
algorithm; the second project centers applying data mining techniques to a real world dataset.
Moreover, 3-4
homeworks will be given that contain short, review-style questions; answers to
these excercises will be covered in review-style lectures every third Thursday.
Finally, software design for data mining will
be covered in the course in part, and students will be exposed to the
Java-based Cougar^2 Data Mining
and Machine Learning Environment that is currently under development by the
UH-DMML research group. In addition to learning how to design
and implement data mining algorithms and how to interpret data mining results,
the participation in the course project
will help you obtaining valuable experience in Eclipse development, Java core
development, object oriented analysis & design, design patterns, and XML
techology. Having knowledge and experience in using these
technologies will
also help you getting a job in the software industry.
Anyhow, students
that take the course should be familiar with the basic concepts of Java --- if
you have doubts
about this prerequisite, feel free to contact Dr. Eick or members
of the UH-DMML research group about this matter.
Course Elements and Their Weights in 2007
Programming Projects and Homeworks: 30-42%
Exams and Quizzes(2-3): 56-68%
Class Participation: 2%
2007 Programming Project
Part1
Specification
Preliminary Draft Part2 Specificiation
(Experiment Guide)
Part3
Tentative 2007 Teaching Plan and Transparencies (subject to change!)
I Introduction to Data Mining (Part1, Part2,
Part3 --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and
Decision Trees
IV Introduction to Similarity Assessment
and Clustering (AGNES and DBSCAN,
Critical Issues with Respect to
Clustering")
V A Short Introduction to Data Cubes
VI Association Analysis(Part1,Part2)
VII Preprocessing for Data Mining (mostly covers
Tan Chapter 2, was added on Oct. 26, 2006)
VIII More on Classification: Instance-based Learning, Support Vector Machines,
Editing, Ensembles, ROC-Curves
(NN-Classifiers and Support Vector Machines
Editing and Condensing Techniques for NN-Classifiers,
Ensembles and ROC Curves,
Model Evaluation)
IX Brief Introduction to Spatial Data Mining
X More on Clustering: Grid-based,
Density-based Clustering, and Outlier Detection
XI Software Design for Knowledge Discovery Projects and Background Knowledge
for Project1 (Software Design in General,
Intoduction to Region Discovery, Region Discovery Technology (please
read the first 8 pages of the Wordfile), Weka Introduction Transparencies,
Introduction to the CLEVER Region Discovery Algorithm,
Background
Material for Part3 of Project1)
XII Mining Data Streams, Online Data Mining and Incremental Data Mining
XIII Fayyad's KDD 2007 Innovation Talk and Final Words
Remark: The teaching plan will be updated continuously.
Grading
Each student has to
have a weighted average of 74.0 or higher in the
exams of the course in order to receive a grade of "B-" or better
for the course.
Students will be responsible for material covered in the
lectures and assigned in the readings. All homeworks and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions to homeworks and assignments
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
homework problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Master Thesis and Dissertation Research in Data Mining
If you plan to perform a dissertation or Master thesis project in the area of
data mining, I strongly recommend
to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing
about evolutionary computing (COSC 6367) will be helpful, particularly
for designing novel data mining algorithms.
Moreover, having basic knowledge in data structures, software design, and databases is important when conducting
data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice.
Moreover, taking a course that teaches high preformance computing is also
desirable, because most data mining algorithms are very resource intensive.
Because a lot of data mining projects have to deal with images, I
suggest to take at least one of the many
biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge
in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization. Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge
in data mining, machine learning, statistics and related areas. Similarly, you
will not be hired as a RA for a
data mining project without having some background in data mining.
Data Mining Links
KDnuggets
KDD 2006 Conference
IEEE International Conference
on Data Mining (ICDM) Website
PKDD 2006 (European KDD Conference)
UIUC Data Mining Group
Microsoft DMX Group
Penn Data Mining Group
UMN Spatial Database and Spatial Data Mining Group
Vrije Universiteit Amsterdam Data Mining Group
Data Mining and Machine Learning Group University
of Helsinki
Data Mining at Massey University
Weka Data Mining Software in Java
RapidMiner (formerly Yale)