last updated: December 18, 2009

COSC 6335: Data Mining in Fall 2009 (Dr. Eick )





Most Recent Offering of COSC 6335 Data Mining

Goals of the Data Mining Course

Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. For example, almost 800 papers were submitted for the IEEE International Conference on Data Mining (ICDM) that will be held in Miami, Florida in December 2009. Data mining tools and suites (for example, see KDnuggets' DM Software Survey) are used a lot in industry and in reseach projects. UH's Data Mining and Machine Learning Group (UH-DMML) conducts research in some of the areas that are covered by this course (Research of Dr. Eick's Subgroup). Moreover, we are currently developing Cougar^2 an open source data mining and machine learning platform that will be used in part in the course projects. Finally, having basic knowledge in data mining is a plus when you are looking for a job in industry and at major US research institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8 weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic visualization techniques and statistical methods will be introduced. Finally, in the remaining 5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification techniques, and mining sequence and streaming data will be discussed.

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@cs.uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
office hours (589 PGH): TU 11:30a-12:30p and TH 1-2p
e-mail: ceick@uh.edu
Teaching Assistant: Daquan Zhang TBDL
office hours (226 PGH): TU 1-2p TH 2-3p
www: Daquan's COSC 6335 Website
e-mail: zhang_dq@cs.uh.edu
Rachsuda's (577 PGH) e-mail: rachsuda@gmail.com (only for Assignment1)
TA's website: TBDL
class meets: TU/TH 10-11:30a
cancelled classes: Tu., Nov. 24
Makeup classes: Tu., December 8, 10-11:30a in 200 PGH
class room: 200 PGH

Course Materials

Required Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley,
Link to Book HomePage

Mildly Recommended Texts:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, second edition.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, modelling and prediction)

News COSC 6335 (Data Mining) Fall 2009

2009 Exams Dates and Other Deadlines:

Midterm Exame: Th., October 15
Final Exam: Tu., December 8, 10a (in PGH 200)
Assignment 1: Tu., September 22, 11p
Assignment 2: We., September 30, 11p
Assignment 3: Part1: Mo., October 12, 11p; Part2: Th., October 29, 11p
Assignment 4: We., November 11, 11p
Assignment 5: Report due Mo., November 30, 11p (for groups presenting on December 3) / We, December 2, 11a (for groups presenting on December 1)

In general, all 2009 COSC 6335 activities will come to an halt on Tu., December 8, noon.

2009 Review Sessions

There will be 30 minute review sessions on September 29, October 13, November 17, and December 1 (or 3). Review questions will be posted here. Occasionally, review questions will discuss paper-and-pencil problems of assignments.

Review Questions for September 29
Review Questions for October 13
Review Questions for November 17
Review Questions for December 1

2009 Assignments

Instructions Concerning What Software to Install (updated on August 27 at noon!).
Assignment1: Getting Familiar with Cougar^2 (please attend the lab classes on September 10 and 17 that will help you with this task)
Introduction to Cougar^2
Assignment2: Exploratory Data Analysis (Corrected Wine Dataset)
Draft of Assignment3: Making Sense of Data using Traditional and Clustering with Plug-in Fitness Functions( How to run experiments in Cougar^2, Earthquake09 Dataset (some errors in the file have been corrected on Oct. 2, 2009), Visualization Earthquake09 Dataset, Last Year's Project Specification (contains useful information for the 2009 course project)
Assigment4: Association Analysis and Similarity Assessment (contains paper and pencil style questions)
Assignment 5: Group Project (multiple topics to choose from)

Prerequisites

The course is mostly self-contained. However, students taking the course should have sound software development skills and basic knowledge of Java. Lacking these skills likely will ask for trouble when performing the programming course projects.

The 2009 Offering of Data Mining

The teaching in 2009 will be similar to the offering in 2008. The teaching in the first 8 weeks will closely cover material from the course textbook, that also comes with good online teching material. The programming projects centers on the implementation of a data mining algorithm and on applying data mining techniques to a real world datasets. Moreover, 3-4 homeworks will be given that contain short, review-style questions; answers to these excercises will be covered in review-style lectures every third Thursday.

Finally, software design for data mining will be covered in the course in part, and students will be exposed to the Java-based Cougar^2 Data Mining and Machine Learning Environment that is currently under development by the UH-DMML research group. In addition to learning how to design and implement data mining algorithms and how to interpret data mining results, the participation in the course project will help you obtaining valuable experience in Eclipse development, Java core development, object oriented analysis & design, design patterns, and XML techology. Having knowledge and experience in using these technologies will also help you getting a job in the software industry. Anyhow, students that take the course should be familiar with the basic concepts of Java --- if you have doubts about this prerequisite, feel free to contact Dr. Eick or members of the UH-DMML research group about this matter.

Course Elements and Their Weights in 2009

Programming Projects and Homeworks: 30-42%
Exams and Quizzes(2-3): 57-68%
Class Participation: 1%

Tentative 2009 Teaching Plan and Transparencies (subject to change!)

I Introduction to Data Mining (Part1, Part2, Part3, Differences between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and Decision Trees
IV Introduction to Similarity Assessment and Clustering (AGNES and DBSCAN, Region Discovery in Spatial Datasets, Introduction to the CLEVER Region Discovery Algorithm)
V Association Analysis(Part1,Part2)
VI A Short Introduction to Data Cubes
VII Preprocessing for Data Mining
VIII Introduction to Spatial Data Mining (Spatial Regression)
IX More on Clustering and Outlier Detection: Grid-based, Density-based Clustering, and Subspace Clustering, Cluster Validity, Anomaly/Outlier Detection.
X Software Design for Knowledge Discovery Projects and Background Knowledge for Programming Projects (Software Design in General, Intoduction to Region Discovery, Region Discovery Technology (please read the first 8 pages of the Wordfile), Weka Introduction Transparencies, Experiment Guide, Introduction to the CLEVER Region Discovery Algorithm, Post Analysis Assignment 3a)
XI More on Classification: Instance-based Learning, Support Vector Machines, Editing, Ensembles, ROC-Curves (NN-Classifiers and Support Vector Machines, Editing and Condensing Techniques for NN-Classifiers (not covered in Fall 2008), Ensembles and ROC Curves, Model Evaluation).
XII Top Ten Algorithms in Data Mining (Top10)
XIII Miscellaneous: 2009 Netflix Contest, 90 Days at Yahoo! and Final Words

Remark: The teaching plan will be updated continuously.

Grading

Each student has to have a weighted average of 74.0 or higher in the exams of the course in order to receive a grade of "B-" or better for the course. Students will be responsible for material covered in the lectures and assigned in the readings. All homeworks and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.

Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions to homeworks and assignments are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through homework problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

Past Exams

2008 Midterm Exam
2007 Final Exam
2009 Midterm Exam with Solution Sketches
2009 Final Exam with Solution Sketches

Summary Answers COSC 6335 Student Questionnaire

2009 Student Language Summary Registered Students: English:14, Hindi:9, Telugu:7, Bengali:2, Vietnamese:2, Arabic:2, Sindhi:1, French:1, Russian:1, Turkish:1, Kyrgyz(?):1, Tamil:1, Filipino:1, Spanish:1, Urdu:1, Garhwali(?):1, Chinese:1; I am impressed: some of you spoke up to four languages as a child! Concerning group projects, 11 students liked group projects, 2 students disliked group project, and 9 students had no preference. Concerning reading scientific papers 12 students liked reading scientific papers, 3 students disliked it, and the rest of the students were neutral or gave fuzzy answers "I like reading paper that are interesting.". 15 students like giving presentations and 4 students didn't. Concerning projects that involve significant amounts of programming 16 liked it and 3 didn'tlike it.

Master Thesis and Dissertation Research in Data Mining

If you plan to perform a dissertation or Master thesis project in the area of data mining, I strongly recommend to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing about evolutionary computing (COSC 6367) will be helpful, particularly for designing novel data mining algorithms. Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice. Moreover, taking a course that teaches high preformance computing is also desirable, because most data mining algorithms are very resource intensive. Because a lot of data mining projects have to deal with images, I suggest to take at least one of the many biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization.

Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

2008 Textbook Reading Schedule

Recommended readings 2008: Sept. 1: 1-35; Sept. 3: 97-105; Sept. 8: 105-131; Sept. 15: 144-166; Sept. 17: 168-186; Sept. 22: 186-193; September 25: 487-491, 493-508; September 29: 510-513, 526-532; October 1: read recommended region discovery document; October 6: 569-575; material on CLEVER and SCMRG; October 15: 515-526; October 21: 327-341; October 23: 349-358 370-382; October 27: 415-426; October 31: 429-439. November 6: 39-65; 69-74; November 16: 131-139; November 18: 600-612; 532-542; 546-550 November 20: 651-652; 658-661; 666-669; 671-674 (skip 10.5.2 and 10.5.3); November 24: 223-227; November 25: 256-276; December 1: 276-280; 283-291 December 2: 295-301.

Data Mining Links

KDnuggets
Netflix $1,000,000 Grand Prize
KDD 2009 Data Mining Contest
KDD 2009 Tutorial on Predictive Data Mining and DM-Contests
2009 IEEE International Conference on Data Mining (ICDM), Miami, December 2009.
UIUC Data Mining Group
Microsoft DMX Group
UMN Spatial Database and Spatial Data Mining Group
Data Mining and Machine Learning Group University of Helsinki
UH's Data Mining and Machine Learning Group (UH-DMML)
Weka Data Mining Software in Java
Weka's Most Recent Version (Version 3.6)
RapidMiner (formerly Yale)