last updated: December 20, 2011

COSC 6335: Data Mining in Fall 2011 (Dr. Eick )



Goals of the Data Mining Course

Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. For example, almost 800 papers were submitted for the IEEE International Conference on Data Mining (ICDM) which will be held in Vancouver, Canada in December 2011. Data mining tools and suites (for example, see KDnuggets' DM Software Survey) are used a lot in industry and in reseach projects. UH's Data Mining and Machine Learning Group Website (UH-DMML) conducts research in some of the areas that are covered by this course (UH-DMML Research Overview). Finally, having basic knowledge in data mining is a plus when you are looking for a job in industry and at major US research institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 9 weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Moreover, techniques how to preprocess a data set for a data mining task will be covered. Also basic visualization techniques and statistical methods will be introduced. In the remaining 4 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification techniques, and sequence mining and webpage ranking will be discussed. Moreover,in course project you will obtain hands on experience in conducting data mining project; moreover, as R will be used in most course projects, you will obtain valuable exprience in using the R statistics, data mining, and visualization open source software. Moreover, 2-3 labs will assist you in learning how to use and program in R. A recent poll at KDnuggets found that 34% of respondents do at least half of their data mining in R. Although it's a domain specific language, it's versatile.

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@cs.uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
office hours (589 PGH): TU 1:30-2:30p and TH 11:30a-12:30p
e-mail: ceick@uh.edu
Teaching Assistant: Zechun Cao office hours: MO 2:30-3:30p TH noon-1:30p in 313 PGH
Zechun's COSC 6335 Website
email:
class meets: TU/TH 10-11:30a
cancelled classes: Th., November 3
Makeup classes: Tu., December 6, 10-noon
class room: 301 AH
exam classroom: FH 135 (October 25 and December 6, 2011)

Course Materials

2011 COSC 6335 Syllabus

Required Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley,
Link to Book HomePage

Recommended Texts:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition (just came out a few months ago).
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

News COSC 6335 (Data Mining) Fall 2011

Important Dates in 2011

Tu., August 30: Chun-sheng Chen will teach this lecture
Th., September 1, 2011: Lab-style R-Tutorial Part I, bring your labtops and have R installed
Th., September 22, 2011: R Tutial Part2 centering on clustering
Tu., October 25: Midterm exam 10-11:15a in FH 135
Th., October 27: Likely, Guest Lecturer and 45-minute lab that day.
Th, November 3: No lecture that day.
Th., November 17: "Something Interesting About Finding Something Interesting" Student Presentations in 563 PGH (Project4)
Tu., December 6: Final exam 10-11:40a in FH 135 (makeup class for cancelled class on November 3)

2011 Review Session Dates (last 30 minutes of lecture): October 4, October 20, November 22, December 1.

2011 Projects

Project1: Explaratory Data Analysis for the Vehicle Silhouette Dataset using R.
Project2: Clustering with K-means and DBSCAN
Project 3: Extracting Regional Knowledge from Spatial Datasets: Clustering with Plug-in Interestingness Functions with CLEVER
Project 4: Something Interesting About Finding Something Interesting (Group Project; Slides Project4 Student Presentations, Video taken of the Event)
Project 5: Learning and Assessing Classification Models

Prerequisites

The course is mostly self-contained. However, students taking the course should have sound software development skills and very basic knowledge of Java.

Course Elements and Their Tentative Weights in 2011

Course Projects (5): 40%
Exams (2): 59% (midterm: 26%; final exam: 33%)
Class Attendance: 1%

COSC 6335: Data Mining Lecture Notes

I Introduction to Data Mining (Part1, Part2, Part3, Differences between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and Decision Trees
IV Introduction to Similarity Assessment and Clustering (AGNES and DBSCAN)
V Association Analysis(Part1,Part2)
VI A Short Introduction to Data Cubes
VII Preprocessing for Data Mining
VIII Introduction to Spatial Data Mining (Introduction to Regional Knowledge Extraction, Intoduction to Region Discovery, Example Fitness Functions (to be used in the Region Discovery Lecture), Introduction to the CLEVER Region Discovery Algorithm, Dr. Eick's Report 2011 ACM-GIS Conference, Spatial Regression(not covered in 2011))
IX More on Clustering and Outlier Detection: Grid-based, Density-based Clustering, and Subspace Clustering, Cluster Validity, Anomaly/Outlier Detection.
X DM Labs: Introduction to R and Course Projects (for teaching material see Zechun's website who teaches the labs)
XI More on Classification: Instance-based Learning, Support Vector Machines, Editing, Ensembles, ROC-Curves (NN-Classifiers and Support Vector Machines, Editing and Condensing Techniques for NN-Classifiers (not covered in Fall 2011), Ensembles and ROC Curves, Model Evaluation).
XII Top Ten Algorithms in Data Mining (Top-10 Panel, Top10)
XIII Miscellaneous: Experiences in Finding Data Mining/Internet/HPC Jobs in Industry, 2009 Netflix Contest, 90 Days at Yahoo! and Final Words

Order of Teaching in 2011: I-II-IV-V(Part1)-III-VIII-VI-VII-XI(only covering kNN, support vector machines, and briefly ensembles)-IX(only covering hierarchical clustering and cluster validity in 2011)- V(Part2)-XII-XIII.
Remarks: X will be covered in 3 computer labs (see schedule above); likely, XIII, will be only partially covered.

2011 Review Sessions

The review sessions will discuss questions which typically will be posted 2-7 days prior to the review session; review sections will take about 30 minutes and are typically discussed 10:45-11:15a. It is important that you try to answer the review questions before the review session

2011 Review Questions
Questions October 4
Questions October 20
Questions November 22
Questions December 1

2011 "Something Interesting About Finding Something Interesting" Groups

Group1
Amalaman,Paul Koutoua 
Joshi,Sushil 
Kampalli Santhamurthy,Divya Durga 

Group2
Anurag,Ananya 
Dotson Jr,Ulysses Sidney 
Edamalapati,Raghavendra Rao 
Francis Xavier,John Brentan 

Group3
Arun,Balakrishna Sarathy 
Asodekar,Pallavi 
Chilukuri,Brundavani 
Nalan Chakravarthy,Vidya Thirumalai 

Group4
Chohan,Gaurav 
Veerappan,Vaduganathan 
Wang,Ning 
Wen,Xi 

Group5:
Conjeepuramkrishnamoorthy,Manasee 
Gondu,Ananth Kumar 
Hernandez Herrera,Paul 
Kao,Hsu-Wan 

Group6:
Kethamakka,Uma Shankar Koushik 
Komma,Gayathri 
Xi,Chen 
Zhu,Rui

Group7: 
Marathe,Deepti A 
Mauricio,Aura Elvira 
Souran,Malvika 
Vanegas,Carlos R 

Group8: 
Mohanam,Naveen 
Nyshadham,Harshanand 
Poolla,Veda Shruthi 
Siga,Dedeepya 

A Few Results 2011 DM Questionnaire

Student Preferences: Of a group of 31 students (neutral statements were not counted), 26 students like group projects and 3 students dislike group projects; 25 students like reading scientific papers and 4 students dislike reading scientific papers; 23 students like projects which involve a significant amout of programmming and 6 students dislike such projects; 17 students like to give presentations and 5 students dislike giving presentations.

Student Languages: As far as languages are concerned which students spoke as a child are concerned (based on 29 responses; if students listed more than 2 languages only the first two languages were counted): English(16), Telugu(6), Hindi(6), Chinese(5), Tamil(5), Spanish(3), African(1), French(1), Nepali(1), Marathi(1), Kannada(1).

2010 Assignments

Assignment1 (see Chun-sheng's Website)
Assignment 2
Assignment3 (Earthquake 2010 Dataset)
Assignment4
First Draft of Assigment 5

2010 Review Sessions

There will be 30 minute review sessions on September 23, October 12, November 16, and November 30. Review questions will be posted here. Occasionally, review questions will discuss paper-and-pencil problems of assignments.

Review Questions for September 23
Review Questions for October 12
Review Questions for November 11
Review Questions for November 30

Grading

Each student has to have a weighted average of 74.0 or higher in the exams of the course in order to receive a grade of "B-" or better for the course. Students will be responsible for material covered in the lectures and assigned in the readings. All assignment and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly chosen dates, and an attendence score will be computed from how many of the those lectures you attended.

Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

Past Data Mining Exams

2008 Midterm Exam
2007 Final Exam
2009 Midterm Exam with Solution Sketches
2009 Final Exam with Solution Sketches
2010 Midterm Exam with Solution Sketches
2011 Midterm Exam with Solution Sketches (some typos in the solution of Problem 5b have been corrected on November 29, 2011)
2010 Final Exam with Solution Sketches

Summary Answers COSC 6335 2009 Student Questionnaire

Student Language Summary Registered Students: English:14, Hindi:9, Telugu:7, Bengali:2, Vietnamese:2, Arabic:2, Sindhi:1, French:1, Russian:1, Turkish:1, Kyrgyz(?):1, Tamil:1, Filipino:1, Spanish:1, Urdu:1, Garhwali(?):1, Chinese:1; I am impressed: some of you spoke up to four languages as a child! Concerning group projects, 11 students liked group projects, 2 students disliked group project, and 9 students had no preference. Concerning reading scientific papers 12 students liked reading scientific papers, 3 students disliked it, and the rest of the students were neutral or gave fuzzy answers "I like reading paper that are interesting.". 15 students like giving presentations and 4 students didn't. Concerning projects that involve significant amounts of programming 16 liked it and 3 didn'tlike it.

Master Thesis and Dissertation Research in Data Mining

If you plan to perform a dissertation or Master thesis project in the area of data mining, I strongly recommend to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing about evolutionary computing (COSC 6367) will be helpful, particularly for designing novel data mining algorithms. Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice. Moreover, taking a course that teaches high preformance computing is also desirable, because most data mining algorithms are very resource intensive. Finally, having some knowledge in the following fields is a plus: numerical optimization techniques, image processing, statistics, geographical information systems (GIS), agent-based systems and data visualization.

Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

Data Mining Links

KDnuggets
Rexer Analytics: Data Mining Software Survey
Netflix $1,000,000 Grand Prize
2011 IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, December 2011.
2011 ACM KDD Conference, San Diego, California, August 2011.
UIUC Data Mining Group
Microsoft DMX Group
UMN Spatial Database and Spatial Data Mining Group
Data Mining and Machine Learning Group University of Helsinki
R User Groups
UH's Data Mining and Machine Learning Group (UH-DMML)
RapidMiner (formerly Yale)