Most Recent Offering of COSC 6335 Data Mining

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8 weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic visualization techniques and statistical methods will be introduced. Finally, in the remaining 5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification techniques, and mining sequence and streaming data will be discussed.

office hours (589 PGH): TU 11:30a-12:30p and TH 1-2p

e-mail: ceick@uh.edu

Teaching Assistant: Daquan Zhang TBDL

office hours (226 PGH): TU 1-2p TH 2-3p

www: Daquan's COSC 6335 Website

e-mail: zhang_dq@cs.uh.edu

Rachsuda's (577 PGH) e-mail: rachsuda@gmail.com (only for Assignment1)

TA's website: TBDL

class meets: TU/TH 10-11:30a

cancelled classes: Tu., Nov. 24

Makeup classes: Tu., December 8, 10-11:30a in 200 PGH

class room: 200 PGH

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley,
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, second edition.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, modelling and prediction)

- I enjoyed teaching the course and like to wish you a happy and successful year 2010.
- The letter grades for the course will be available on Fr. Dec. 18, and more detailed grading reports will be posted no later than Dec. 22; check Daquan's website. The assignment scores are converted into number grades and from those using a formula (that will be posted on Daquan's website) a finalnumber grade will be derived that is then converted into a letter grade using the conversion formulas given on this webpage. In general, the assignments counted 40% (weights for the parts can be found in Daquan's grade summary) towards the final grade, midterm counted 27% and the final exam counted 33%.
- The final exam will not be returned to students; however, you can look at your final exam on the following dates: We., December 23, 1-2p; Thursday, January 21, 10-11a or Tuesday, January 26, 11a-noon. Moreover, solution sketches for the final exam can be found below (see past exams).
**Reading Instructions**: Read Chapter 3 of the textbook by September 1, 2009. Read chapter 4 by September 14, 2009! Read Chapter 8 pages 487-506 by September 30, 2009; Read Chapter 8 pages 510-514 and 526-532, Chapter 9 pages 569-576 by October 5, 2009. Read first 9 pages of the region discovery technology document by October 8. Make sure you read pages 487-505 top, 526-532, 600-604; 69-74, 80 bottom-84 by October 12, 2009 Read pages 327-349 by October 20, read pages 349-358 by October 22, read pages 370-386 by October 25, read pages 415-426 & 429-439 by October 26; read pages 131-140 by October 28; read pages 36-65 by November 2; read Spatial Analysis Wikepedia document by November 9; read 516-526 and 608-612 by November 10; read 532-548 by November 12; read pages 651-659 and 666-674 by November 13; read pages 223-227 by November 15; read pages 256-290 by November 19; read Top10 Data Mining Article by December 1, centering on those algorithms that were covered in the course!- Programming projects (unless specified otherwise) and other assignment tasks are individual activities; therefore, collaborating with other students is not allowed (also see academic honesty section near the end of this webpage).
- Course Syllabus

Final Exam: Tu., December 8, 10a (in PGH 200)

Assignment 1: Tu., September 22, 11p

Assignment 2: We., September 30, 11p

Assignment 3: Part1: Mo., October 12, 11p; Part2: Th., October 29, 11p

Assignment 4: We., November 11, 11p

Assignment 5: Report due Mo., November 30, 11p (for groups presenting on December 3) / We, December 2, 11

In general, all 2009 COSC 6335 activities will come to an halt on Tu., December 8, noon.

Review Questions for September 29

Review Questions for October 13

Review Questions for November 17

Review Questions for December 1

Assignment1: Getting Familiar with Cougar^2 (please attend the lab classes on September 10 and 17 that will help you with this task)

Introduction to Cougar^2

Assignment2: Exploratory Data Analysis (Corrected Wine Dataset)

Draft of Assignment3: Making Sense of Data using Traditional and Clustering with Plug-in Fitness Functions( How to run experiments in Cougar^2, Earthquake09 Dataset (some errors in the file have been corrected on Oct. 2, 2009), Visualization Earthquake09 Dataset, Last Year's Project Specification (contains useful information for the 2009 course project)

Assigment4: Association Analysis and Similarity Assessment (contains paper and pencil style questions)

Assignment 5: Group Project (multiple topics to choose from)

Finally, software design for data mining will be covered in the course in part, and students will be exposed to the Java-based Cougar^2 Data Mining and Machine Learning Environment that is currently under development by the UH-DMML research group. In addition to learning how to design and implement data mining algorithms and how to interpret data mining results, the participation in the course project will help you obtaining valuable experience in Eclipse development, Java core development, object oriented analysis & design, design patterns, and XML techology. Having knowledge and experience in using these technologies will also help you getting a job in the software industry. Anyhow, students that take the course should be familiar with the basic concepts of Java --- if you have doubts about this prerequisite, feel free to contact Dr. Eick or members of the UH-DMML research group about this matter.

Exams and Quizzes(2-3): 57-68%

Class Participation: 1%

II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)

III Introduction to Classification: Basic Concepts and Decision Trees

IV Introduction to Similarity Assessment and Clustering (AGNES and DBSCAN, Region Discovery in Spatial Datasets, Introduction to the CLEVER Region Discovery Algorithm)

V Association Analysis(Part1,Part2)

VI A Short Introduction to Data Cubes

VII Preprocessing for Data Mining

VIII Introduction to Spatial Data Mining (Spatial Regression)

IX More on Clustering and Outlier Detection: Grid-based, Density-based Clustering, and Subspace Clustering, Cluster Validity, Anomaly/Outlier Detection.

X Software Design for Knowledge Discovery Projects and Background Knowledge for Programming Projects (Software Design in General, Intoduction to Region Discovery, Region Discovery Technology (please read the first 8 pages of the Wordfile), Weka Introduction Transparencies, Experiment Guide, Introduction to the CLEVER Region Discovery Algorithm, Post Analysis Assignment 3a)

XI More on Classification: Instance-based Learning, Support Vector Machines, Editing, Ensembles, ROC-Curves (NN-Classifiers and Support Vector Machines, Editing and Condensing Techniques for NN-Classifiers (not covered in Fall 2008), Ensembles and ROC Curves, Model Evaluation).

XII Top Ten Algorithms in Data Mining (Top10)

XIII Miscellaneous: 2009 Netflix Contest, 90 Days at Yahoo! and Final Words

Remark: The teaching plan will be updated continuously.

Translation number to letter grades:

A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70

C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions to homeworks and assignments are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special
care to discern the difference between **collaborating** in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is **not** permissible for one
student to help or be helped by another student in working through
homework problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.

2007 Final Exam

2009 Midterm Exam with Solution Sketches

2009 Final Exam with Solution Sketches

Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

Netflix $1,000,000 Grand Prize

KDD 2009 Data Mining Contest

KDD 2009 Tutorial on Predictive Data Mining and DM-Contests

2009 IEEE International Conference on Data Mining (ICDM), Miami, December 2009.

UIUC Data Mining Group

Microsoft DMX Group

UMN Spatial Database and Spatial Data Mining Group

Data Mining and Machine Learning Group University of Helsinki

UH's Data Mining and Machine Learning Group (UH-DMML)

Weka Data Mining Software in Java

Weka's Most Recent Version (Version 3.6)

RapidMiner (formerly Yale)