last updated: December 6, 2007

COSC 6397--- Data Mining Fall 2007 (Dr. Eick )



Goals of the Data Mining Course

Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. For example, 776 papers were submitted for the 2006 IEEE International Conference on Data Mining (ICDM) that will be held in Hong Kong in December 2006. Data mining tools and suites (for example, see KDnuggets' DM Software Survey) are used a lot in industry and in reseach projects. UH's Data Mining and Machine Learning Group (UH-DMML) conducts research in some of the areas that are covered by this course. Moreover, we are currently developing Cougar^2 an open source data mining and machine learning platform that will be used in part in the course projects. Finally, having basic knowledge in data mining is a plus when you are looking for a job in industry and at major US research institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8 weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic visualization techniques and statistical methods will be introduced. Finally, in the remaining 5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification techniques, and mining sequence and streaming data will be discussed.

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@cs.uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
office hours (589 PGH): TU 11:30a-12:30p and TH 4-5p
e-mail: ceick@uh.edu
Teaching Assistant: Ying Wei Kuo
office hours (313 PGH): TU 4-5p, TH 1:30-2:30p
e-mail: ykuo@cs.uh.edu
Ying Wei's Webspace
class meets: TU/TH 2:30-4p in 232 PGH
cancelled classes: TBDL
makeup classes: TBDL (if necessary)
class room: 232 PGH

Course Materials

Required Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley,
Link to Book HomePage

Mildly Recommended Text
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, second edition.
Link to Data Mining Book Home Page

Programming Project Part3

Draft Part3 Specification (updated on October 26, 2007)

Groups and Algorithms assigned
1. PICPF-DBSCAN: Akash Bhatt, Francisco Ocegueda-Hernandez
2. SRIDHCR: Karthik Karra and Krushita Shah
3. RG: Shweta Savkar and Nishta Sharma
4. SCAH: Gayathri Subramanian and Deeptha Janakiraman
5. RG: Raja Yalamanchili and Ravi Mehta
6. SCMRG: Xiaofan Wu, Ashish Kapadia, and Varun Raheja
7. RG: Madhura Koppoli and Shyamali Balasubraminiyan

News COSC 6367 (Data Mining) Fall 2007

2007 Exams Dates and Other Deadlines:

Exams: October 18, November 29
Homework due dates: Homework1: Sept. 25, Homework2 (short) October 13, Homework3: November 11(long); Homework4: November 25(short)
Programming Project due date: Part1: September 13; Part2: October 6/11; Part3: November 19

Prerequisites

The course is mostly self-contained. However, students talking the course should have sound software development skills and basic knowledge of Java. Lacking these skills likely will ask for trouble when performing the two course projects.

The 2007 Offering of Data Mining

The teaching in 2007 will be similar to the offering in 2006. The teaching in the first 8 weeks will closely cover material from the course textbook, that also comes with good online teching material. There will be 2 course projects this semester. The first project centers on the implementation of a data mining algorithm; the second project centers applying data mining techniques to a real world dataset. Moreover, 3-4 homeworks will be given that contain short, review-style questions; answers to these excercises will be covered in review-style lectures every third Thursday.

Finally, software design for data mining will be covered in the course in part, and students will be exposed to the Java-based Cougar^2 Data Mining and Machine Learning Environment that is currently under development by the UH-DMML research group. In addition to learning how to design and implement data mining algorithms and how to interpret data mining results, the participation in the course project will help you obtaining valuable experience in Eclipse development, Java core development, object oriented analysis & design, design patterns, and XML techology. Having knowledge and experience in using these technologies will also help you getting a job in the software industry. Anyhow, students that take the course should be familiar with the basic concepts of Java --- if you have doubts about this prerequisite, feel free to contact Dr. Eick or members of the UH-DMML research group about this matter.

Course Elements and Their Weights in 2007

Programming Projects and Homeworks: 30-42%
Exams and Quizzes(2-3): 56-68%
Class Participation: 2%

2007 Programming Project

Part1 Specification
Preliminary Draft Part2 Specificiation (Experiment Guide)
Part3

Tentative 2007 Teaching Plan and Transparencies (subject to change!)

I Introduction to Data Mining (Part1, Part2, Part3 --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and Decision Trees
IV Introduction to Similarity Assessment and Clustering (AGNES and DBSCAN, Critical Issues with Respect to Clustering")
V A Short Introduction to Data Cubes
VI Association Analysis(Part1,Part2)
VII Preprocessing for Data Mining (mostly covers Tan Chapter 2, was added on Oct. 26, 2006)
VIII More on Classification: Instance-based Learning, Support Vector Machines, Editing, Ensembles, ROC-Curves (NN-Classifiers and Support Vector Machines Editing and Condensing Techniques for NN-Classifiers, Ensembles and ROC Curves, Model Evaluation)
IX Brief Introduction to Spatial Data Mining
X More on Clustering: Grid-based, Density-based Clustering, and Outlier Detection
XI Software Design for Knowledge Discovery Projects and Background Knowledge for Project1 (Software Design in General, Intoduction to Region Discovery, Region Discovery Technology (please read the first 8 pages of the Wordfile), Weka Introduction Transparencies, Introduction to the CLEVER Region Discovery Algorithm, Background Material for Part3 of Project1)
XII Mining Data Streams, Online Data Mining and Incremental Data Mining
XIII Fayyad's KDD 2007 Innovation Talk and Final Words

Remark: The teaching plan will be updated continuously.

Grading

Each student has to have a weighted average of 74.0 or higher in the exams of the course in order to receive a grade of "B-" or better for the course. Students will be responsible for material covered in the lectures and assigned in the readings. All homeworks and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.

Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions to homeworks and assignments are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through homework problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

Master Thesis and Dissertation Research in Data Mining

If you plan to perform a dissertation or Master thesis project in the area of data mining, I strongly recommend to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing about evolutionary computing (COSC 6367) will be helpful, particularly for designing novel data mining algorithms. Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice. Moreover, taking a course that teaches high preformance computing is also desirable, because most data mining algorithms are very resource intensive. Because a lot of data mining projects have to deal with images, I suggest to take at least one of the many biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization.

Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

Data Mining Links

KDnuggets
KDD 2006 Conference
IEEE International Conference on Data Mining (ICDM) Website
PKDD 2006 (European KDD Conference)
UIUC Data Mining Group
Microsoft DMX Group
Penn Data Mining Group
UMN Spatial Database and Spatial Data Mining Group
Vrije Universiteit Amsterdam Data Mining Group
Data Mining and Machine Learning Group University of Helsinki
Data Mining at Massey University
Weka Data Mining Software in Java
RapidMiner (formerly Yale)