last updated:
December 20, 2010
COSC 6335: Data Mining in Fall 2010
(Dr. Eick )
Most Recent Offering of COSC 6335 Data Mining
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 800 papers were submitted
for the IEEE International Conference
on Data Mining (ICDM) that will be held in Sidney, Australia in December
2010. Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and Machine Learning Group
(UH-DMML) conducts research in some of the areas that are covered by this
course (Research of Dr. Eick's Subgroup). Moreover, we are currently developing Cougar^2 an open source data mining and machine
learning platform
that will be used in part in the course projects. Finally, having
basic knowledge in data mining is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8
weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic
visualization techniques and statistical methods will be introduced. Finally, in the remaining
5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification
techniques, and sequence mining and webpage ranking will be discussed.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@cs.uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (589 PGH): TU 1:30-2:30p and TH 11:30a-12:30p
e-mail: ceick@uh.edu
Teaching Assistant: Chun-sheng Chen
office hours (577 PGH): TH 11:30a-12:15p & 12:45-1:30p
Chun-sheng's COSC 6335
Website
email: lyons19tw@sbcglobal.net
class meets: TU/TH 10-11:30a
cancelled classes: Th., October 14
Makeup classes: Tu., December 7, 10-noon room TBDL
class room: 347 PGH
Course Materials
Required Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley,
- Link to
Book HomePage
Mildly Recommended Texts:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, second edition.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, modelling and prediction)
2010 Assignments
Assignment1 (see Chun-sheng's Website)
Assignment 2
Assignment3
(Earthquake 2010 Dataset)
Assignment4
First Draft of Assigment 5
2010 Assignments5 Groups and Topics
1. A Survey on Text Categorization Algorithms and Tools
Shraddha Khaire & Monisha P Chengappa & Ushasi Ghosh& Yogesh Kalashkar
2. Handling Missing Values in Classification
Manish Limaye & Mansi Desai & Anuradha Ramprakash & Sanjana Shetty
3. The Netflix Contest What Algorithms Were Successful and What did we learn from the Contest?
Cao,Zechun & Jimenez, Francisco & Charoenrattanaruk, Panitee & Sui, Bangsheng
4. Mining Social Networks A Survey
Santosh Kumar & Kurashetty Srinaath & Ravchandran Sindhuja Kakaraparthy & Shashank Karam
5. Popular Algorithms for Mining Steaming Data
Maitreya Kundurthy & Rahul Lanka & Nagaraj Kanedole
6. Clustering of Proteins
Shen,Zhichao & Wang,Chong
7. Uses of Data Mining to Monitor and Protect the Environment
Ankit Anchlia & Arun Prasad Iyer & Charulata Rastogi & Darshan Modani
8. Application of Data Mining to Crime Analysis and Prevention
Dheeraj Reddy Mamidi, Raviteja Koluguri, Sandeep Sankineni, Ajoy Desai Ingala
Remarks: The first two groups will give a short presentation summarizing
their results on Tu., November 30, 2010; all other groups present on
Th., December 2, 2010. Groups of 2 students give 8 minute talks, groups of 3
students give 10 minute talks and groups of 4 students give 12 minute talks;
each student should participate in his/her group's presentation.
News COSC 6335 (Data Mining) Fall 2010
- The letter grades for this course should be available on December 18,
2011 and detailed grade reports should be available on Zechun's course
website no later than December 22, 2011.
I enjoyed
teaching this course, and might see some of you in 2012.
- Review List
2010 Final Exam( be aware that some pages of the "Top 10 Data Mining
Alg." paper are relevant for the final exam; therefore, I recomment to study
those pages carefully). The final exam
will not be returned to students; however, you can view your
final exam on Tuesday, January 4, 2011 11a-noon and on Thursday,
January 20, 2011 3:30-4:30p in Dr. Eick's office.
- Programming projects (unless specified otherwise) and other assignment
tasks are individual activities; therefore, collaborating with other students
is not allowed (also see academic honesty section near the end of
this webpage).
- 2011 Course Syllabus
2010 Exams Dates and Other Deadlines:
Midterm Exam: Tu., October 19
Final Exam: Tu., December 7, 10-11:45a (232 PGH)
Lab Classes: Th., September 9+14 (likely 563 PGH)
Assignment 1: Tu., September 21, 11p
Assignment 2: We., September 30, 11p
Assignment 3: Part1: Saturday, October 23, 11p; Part2: Sa., Nov. 6, 11p
Assignment 4: Th., November 11, 11p
Assignment 5: Th., December 2, 11p (Sa., Dec. 4, 11a (for groups presenting
on Nov. 30))
In general, all 2010 COSC 6335 activities will come to an halt on
Tu., December 7, 2p.
2010 Review Sessions
There will be 30 minute review sessions on September 23,
October 12, November 16, and November 30. Review questions will be
posted here. Occasionally, review questions will discuss paper-and-pencil
problems of assignments.
Review Questions for September 23
Review Questions for October 12
Review Questions for November 11
Review Questions for November 30
2009 Assignments
Instructions Concerning What
Software to Install (updated on August 27 at noon!).
Assignment1: Getting Familiar with
Cougar^2 (please attend the lab classes on September 10 and 17
that will help you with this task)
Introduction to Cougar^2
Assignment2: Exploratory
Data Analysis (Corrected Wine Dataset)
Draft of Assignment3: Making Sense of Data using
Traditional and
Clustering with Plug-in Fitness Functions(
How to run experiments
in Cougar^2, Earthquake09 Dataset
(some errors in the file have been corrected on Oct. 2, 2009),
Visualization Earthquake09 Dataset,
Last Year's Project Specification (contains
useful information for the 2009 course project)
Assigment4: Association Analysis and
Similarity Assessment (contains paper and
pencil style questions)
Assignment 5: Group Project (multiple topics
to choose from)
Prerequisites
The course is mostly self-contained. However, students taking the course
should have
sound software development skills and basic knowledge of Java. Lacking these skills likely will ask
for
trouble when performing the programming course projects.
The 2010 Offering of Data Mining
The
teaching in 2010 will be similar to the offering in 2009. The teaching in the first
8 weeks will closely cover material from the course textbook,
that also comes with good online teching material. The programming projects centers on the implementation of a
data mining algorithm and on applying data mining techniques to a real world datasets.
Moreover, 3-4
homeworks will be given that contain short, review-style questions; answers to
these excercises will be covered in review-style lectures every third Thursday.
Finally, software design for data mining will
be covered in the course in part, and students will be exposed to the
Java-based Cougar^2 Data Mining
and Machine Learning Environment that is currently under development by the
UH-DMML research group. In addition to learning how to design
and implement data mining algorithms and how to interpret data mining results,
the participation in the course project
will help you obtaining valuable experience in Eclipse development, Java core
development, object oriented analysis & design, design patterns, and XML
techology. Having knowledge and experience in using these
technologies will
also help you getting a job in the software industry.
Anyhow, students
that take the course should be familiar with the basic concepts of Java --- if
you have doubts
about this prerequisite, feel free to contact Dr. Eick or members
of the UH-DMML research group about this matter.
Course Elements and Their Tentative Weights in 2010
Programming Projects and Homeworks: 40%
Exams 2): 59% (midterm: 26%; final exam: 33%)
Class Attendance: 1%
Tentative 2010 Teaching Plan and Transparencies (subject to change!)
I Introduction to Data Mining (Part1, Part2,
Part3,
Differences
between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and
Decision Trees
IV Introduction to Similarity Assessment
and Clustering (AGNES and DBSCAN,
Introduction to Regional Knowledge Extraction
(added on October 5, 2010; just look at the first 3 sections!),
Example Fitness Functions (to
be used in the Region Discovery Lecture),
Region Discovery in Spatial Datasets,
Introduction to the CLEVER Region Discovery Algorithm)
V Association Analysis(Part1,Part2)
VI A Short Introduction to Data Cubes
VII Preprocessing for Data Mining
VIII Introduction to Spatial Data Mining
(Spatial Regression)
IX More on Clustering and Outlier Detection: Grid-based,
Density-based Clustering, and Subspace Clustering,
Cluster Validity,
Anomaly/Outlier Detection.
X Software Design for Knowledge Discovery Projects and Background Knowledge
for Programming Projects (Software Design in General,
Intoduction to Region Discovery, Region Discovery Technology (please
read the first 8 pages of the Wordfile), Weka Introduction Transparencies,
Experiment Guide,
Introduction to the CLEVER Region Discovery Algorithm,
Post Analysis Assignment
3a)
XI More on Classification: Instance-based Learning, Support Vector Machines,
Editing, Ensembles, ROC-Curves
(NN-Classifiers and Support Vector Machines,
Editing and Condensing Techniques for NN-Classifiers
(not covered in Fall 2008), Ensembles and ROC Curves,
Model Evaluation).
XII Top Ten Algorithms in Data Mining
(Top-10 Panel, Top10)
XIII Miscellaneous: 2009 Netflix Contest,
90 Days at Yahoo! and Final Words
Remark: The teaching plan will be updated continuously.
Grading
Each student has to
have a weighted average of 74.0 or higher in the
exams of the course in order to receive a grade of "B-" or better
for the course.
Students will be responsible for material covered in the
lectures and assigned in the readings. All homeworks and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
6 times during the semester I will check class attendance at randomly
chosen dates, and an attendence score will be computed from how many
of the 6 lectures you attended.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions to homeworks and assignments
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
homework problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Past Data Mining Exams
2008 Midterm Exam
2007 Final Exam
2009 Midterm Exam with Solution Sketches
2009 Final Exam with Solution Sketches
2010 Midterm Exam with Solution Sketches
Summary Answers COSC 6335 Student Questionnaire
2009 Student Language Summary Registered
Students: English:14, Hindi:9, Telugu:7, Bengali:2,
Vietnamese:2, Arabic:2, Sindhi:1,
French:1, Russian:1, Turkish:1, Kyrgyz(?):1, Tamil:1, Filipino:1, Spanish:1,
Urdu:1, Garhwali(?):1, Chinese:1; I am impressed: some of you spoke up to
four languages as a child! Concerning group projects, 11 students
liked group projects, 2 students disliked group project, and 9 students
had no preference. Concerning reading scientific papers 12 students liked
reading scientific papers, 3 students disliked it, and the rest of the students
were neutral or gave fuzzy answers "I like reading paper that are interesting.".
15 students like giving presentations and 4 students didn't. Concerning
projects that involve significant amounts of programming 16 liked it and
3 didn'tlike it.
Master Thesis and Dissertation Research in Data Mining
If you plan to perform a dissertation or Master thesis project in the area of
data mining, I strongly recommend
to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing
about evolutionary computing (COSC 6367) will be helpful, particularly
for designing novel data mining algorithms.
Moreover, having basic knowledge in data structures, software design, and databases is important when conducting
data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice.
Moreover, taking a course that teaches high preformance computing is also
desirable, because most data mining algorithms are very resource intensive.
Because a lot of data mining projects have to deal with images, I
suggest to take at least one of the many
biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge
in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization. Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge
in data mining, machine learning, statistics and related areas. Similarly, you
will not be hired as a RA for a
data mining project without having some background in data mining.
2008 Textbook Reading Schedule
Recommended readings 2008: Sept. 1: 1-35; Sept. 3: 97-105;
Sept. 8: 105-131; Sept. 15: 144-166; Sept. 17: 168-186; Sept. 22: 186-193;
September 25: 487-491, 493-508; September 29: 510-513, 526-532;
October 1: read recommended region discovery document; October 6: 569-575;
material on CLEVER and SCMRG; October 15: 515-526;
October 21:
327-341; October 23: 349-358 370-382; October 27: 415-426; October 31: 429-439.
November 6: 39-65; 69-74; November 16: 131-139; November 18: 600-612; 532-542;
546-550 November 20: 651-652; 658-661; 666-669; 671-674 (skip 10.5.2 and
10.5.3); November 24: 223-227; November 25: 256-276; December 1:
276-280; 283-291 December 2: 295-301.
Data Mining Links
KDnuggets
Netflix $1,000,000 Grand Prize
KDD 2009 Data Mining
Contest
KDD 2009
Tutorial on Predictive Data Mining and DM-Contests
2009 IEEE International
Conference on Data Mining (ICDM), Miami, December 2009.
UIUC Data Mining Group
Microsoft DMX Group
UMN Spatial Database and Spatial Data Mining Group
Data Mining and Machine Learning Group University
of Helsinki
UH's Data Mining and Machine Learning Group (UH-DMML)
Weka Data Mining
Software in Java
Weka's Most Recent
Version (Version 3.6)
RapidMiner (formerly Yale)