The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 9 weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Exploratory data analysis, centering on basic visualization techniques and statistics, to get a better understanding of the data mining task at hand will be covered. Moreover, techniques how to preprocess a data set for a data mining task will be introduced. In the remaining 4-5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification techniques, and sequence mining and, webpage ranking will be discussed. Moreover, in course projects you will obtain hands on experience in conducting data mining project. Finally, as R will be used in most course projects; therefore, participants of the couse will obtain valuable exprience in using the R statistics, data mining, and visualization packages and will learn how to write programs in R and how to develop data mining software on top to R. A recent 2013 poll Rexer Analytics found that R is currently the most popular data mining tool: 24% of the respondents use R as their primary tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's versatile.

office hours (589 PGH) TU 2:30-3:30p TH 9:30-10:30a

e-mail: ceick@uh.edu

2012 TA website: Zechun's COSC 6335 Website

Google COSC 6335 News Group

class meets: TU/TH 11:30a-1p

cancelled classes: Tu., December 3, 2013.

Makeup classes: Tu., December 10, 11:30a-1p

Objectives Data Mining Course

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley, 2006,
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

- The grading for the course was completed on December 20 ( Grade Summary COSC 6335 F'13). We had: A:4 A-:8 B+:8 B:5 B-:3 C+:3 this semester. I enjoyed teaching the course and I already like to wish you a Happly and Successful Year 2014!
- The final exam will not be returned to students; however, you will be able to see you final exam on: We., January 15, 4-5p and Tuesday, January 21, 2:30-3:30p in 573 PGH. If you have other concerns or suggestions about the course, also talk to me at that time!
- The lectures in the remainder of the semester will cover: Spatial Data Mining, Clustering Part2 (Hierarchical Clustering and DENCLUE), Discussion Project4, Post Analysis Project2, Q&A Project3, Final Words, PageRank, Top 10 Data Mining Algorithms,and Review3.
- Solution sketches and results of the midterm exam have been posted (Midterm 2013: Solution Sketches and Results).
- The final exam is scheduled for Tu., December 17,
**11a**in room**T101**; we used the same room for the midterm exam (Review List for the 2013 Data Mining Final Exam). -
2013 Textbook Reading Assignments (updated on November 15, 2013): August 29: 1-11; Sept. 2: 19-36; September 4: 97-131; September 8: 65-84; September 13: 487-496; September 15: 496-508; September 15: 510-515; September 18: 526-532; September 22: 569-577; September 24: 327-335; October 1: 532-544; October 3: 145-162 October 7: 162-166, 168-178 and 184-188, October 11: 335-351, lecture; October 27: 44-64; November 6: read the web link documents of the "Short Introduction to Spatial Data Mining" November 8: 223-226, 256-281; November 10: 131-139; November 14: 515-526, 532-553; November 20: 415-423, 426-441; November 28: Read discussions about K-means, kNN, decision trees, and APRIORI of the "Top 10 Data Mining Algorithms Paper". - The course will use a mixture of group and individual projects. Course projects (unless specified otherwise) and other assignment tasks are individual activities; therefore, collaborating with other students or colloborating with students from other groups is not allowed (also see academic honesty section near the end of this webpage).

Tu., November 5: Midterm Exam

Th., November 21: Project3 Student Presentations in

Tu., December 3: No lecture and office hour that day!

Tu., December 10, 11:30a-1p: Makeup Class for cancelled class on Dec. 3.

Tu., December 17, 11a-2p: Final Exam

Review Sessions(last 30 minutes of lecture): October 8, October 31, December 10 (long 50 minute review!)).

Project2: Traditional Clustering with K-means and DBSCAN (

Project 3: *
Something Interesting about Finding Interesting Associations in Large Amounts
of Data*
(Group Project (Groups of 4), Project3 Q&A,
Project3 Scores)

Project4: *Reading, Understanding,
Summarizing and Reviewing of Data Mining Papers*
(2-person Group Project;
paper candidates (choose one!):
Paper 1,
Paper 2,
Paper 3,
Paper 4;
Project4 Scores)

2013 Project Weights: Project1:1, Project2:2.0, Project3: 1.3,
Project4: 1.2.

2013 Project Scores (please, verify!)

Review2 (solution sketches; the answer to question 2d has been corrected (in red); was discussed during the lecture on October 31).

Review3 (will take place on Tuesday, December 10. Solve the 7 review questions!)

Exams (2): 58% (midterm: 26%; final exam: 32%)

Class Attendance: 1%

II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)

III Introduction to Classification: Basic Concepts and Decision Trees

IV Introduction to Similarity Assessment and Clustering (AGNES (not covered) and DBSCAN; R-scripts demonstrating: K-means/medoids, DBSCAN)

V Association Analysis(Part1,Part2)

VI A Short Introduction to Data Cubes

VII Preprocessing for Data Mining

VIII Introduction to Spatial Data Mining (Introduction to Regional Knowledge Extraction, Intoduction to Region Discovery, Example Fitness Functions (to be used in the Region Discovery Lecture), Introduction to the CLEVER Region Discovery Algorithm, Dr. Eick's Report 2011 ACM-GIS Conference, Spatial Regression(not covered in 2011))

IX More on Clustering and Outlier Detection: Grid-based, Density-based Clustering, and Subspace Clustering ( Non-Parametric Density Estimation R Demo), Cluster Validity, Anomaly/Outlier Detection.

X Introduction to R (R Preliminaries, R Data Structures, R Graphics, Functions in R), Rar-file (contains the .r (dot r)files are the scripts presented in class, namely kmean.r, linearRegression.r, linearRegression_iris_2.r, and scatterlot3d_1.r. The file RCommander_LAB_EXAMPLES.r contains all the scripts. The rest are the powerpoints and the datasets used ), Sample Classification Plots in R)

XI More on Classification: Instance-based Learning, Support Vector Machines, Editing, Ensembles, and ROC-Curves (NN-Classifiers and Support Vector Machines (Prof. Sastry's Introduction to SVM), Editing and Condensing Techniques for NN-Classifiers (to be covered in Fall 2013), Ensembles and ROC Curves, Model Evaluation).

XII The PageRank Algorithm (taught the first time in 2012; Gleich 2009 Dissertation Defensecentering on PageRank at Stanford University)

XIII Top Ten Algorithms in Data Mining (Top-10 Panel, Top10)

XIV Miscellaneous: Experiences in Finding Data Mining/Internet/HPC Jobs in Industry, 2009 Netflix Contest, 90 Days at Yahoo! and Final Words

Remarks: likely, topios XIII and XIV, will be only partially covered. R (Topic X) will be covered in a lab early September and as part of the lectures centering on Exploratory Data Analysis and Clustering Part1.

2012 Review Questions

Questions 2012 Review1

Question 2012 Review2

Question 2012 Review3

2011 Review Questions

Questions October 4

Questions October 20

Questions November 22

Questions December 1

Project2: Clustering with K-means and DBSCAN

Project 3: Extracting Regional Knowledge from Spatial Datasets: Clustering with Plug-in Interestingness Functions with CLEVER

Project 4: Something Interesting About Finding Something Interesting (Group Project; Slides Project4 Student Presentations, Video taken of the Event)

Project 5: Learning and Assessing Classification Models

Assignment 2

Assignment3 (Earthquake 2010 Dataset)

Assignment4

First Draft of Assigment 5

Review Questions for September 23

Review Questions for October 12

Review Questions for November 11

Review Questions for November 30

Seveal times during the semester I will check

Translation number to letter grades:

A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70

C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special
care to discern the difference between **collaborating** in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is **not** permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.

2007 Final Exam

2009 Midterm Exam with Solution Sketches

2009 Final Exam with Solution Sketches

2010 Midterm Exam with Solution Sketches

2011 Midterm Exam with Solution Sketches (some typos in the solution of Problem 5b have been corrected on November 29, 2011)

2012 Midterm Exam with Solution Sketches

2013 Midterm Exam with Solution Sketches

2010 Final Exam with Solution Sketches

2011 Final Exam with Solution Sketches

Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

2013 Rexer Analytics Data Mining Software Highlights

Netflix $1,000,000 Grand Prize

SPMF (Sequential Pattern Mining Framework)

Magnum Opus Data Mining Framework

UIUC Data Mining Group

Microsoft DMX Group

UMN Spatial Database and Spatial Data Mining Group

Data Mining and Machine Learning Group University of Helsinki

Houston R Group

UH's Data Mining and Machine Learning Group (UH-DMML)

Data Mining Conferences and Journals

RapidMiner (formerly Yale)