last updated:
December 18, 2009
COSC 6335: Data Mining in Fall 2009
(Dr. Eick )
Most Recent Offering of COSC 6335 Data Mining
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 800 papers were submitted
for the IEEE International Conference
on Data Mining (ICDM) that will be held in Miami, Florida in December 2009. Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Mining and Machine Learning Group
(UH-DMML) conducts research in some of the areas that are covered by this
course (Research of Dr. Eick's Subgroup). Moreover, we are currently developing Cougar^2 an open source data mining and machine
learning platform
that will be used in part in the course projects. Finally, having
basic knowledge in data mining is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a data mining project. In the first 8
weeks a very basic introduction to data mining will be given. After defining what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Moreover, techniques how to preprocess a data for a data mining task will be covered. Also basic
visualization techniques and statistical methods will be introduced. Finally, in the remaining
5 weeks of the semester, more advanced topics including spatial data mining, advanced clustering and classification
techniques, and mining sequence and streaming data will be discussed.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@cs.uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
office hours (589 PGH): TU 11:30a-12:30p and TH 1-2p
e-mail: ceick@uh.edu
Teaching Assistant: Daquan Zhang TBDL
office hours (226 PGH): TU 1-2p TH 2-3p
www: Daquan's COSC 6335
Website
e-mail: zhang_dq@cs.uh.edu
Rachsuda's (577 PGH) e-mail: rachsuda@gmail.com (only for Assignment1)
TA's website: TBDL
class meets: TU/TH 10-11:30a
cancelled classes: Tu., Nov. 24
Makeup classes: Tu., December 8, 10-11:30a in 200 PGH
class room: 200 PGH
Course Materials
Required Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley,
- Link to
Book HomePage
Mildly Recommended Texts:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, second edition.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, modelling and prediction)
News COSC 6335 (Data Mining) Fall 2009
- I enjoyed teaching the course and like to wish you a happy and successful
year 2010.
- The letter grades for the course will be available on Fr. Dec. 18,
and more detailed grading reports will be posted no later than Dec. 22;
check Daquan's website. The assignment scores are converted into number grades
and from those using a formula (that will be posted on Daquan's website) a finalnumber grade will be derived that is then converted into a letter
grade using the conversion formulas given on this webpage. In general,
the assignments counted 40% (weights for the parts can be found in
Daquan's grade summary) towards the final grade, midterm counted
27% and the final exam counted 33%.
- The final exam will not be returned to students; however, you
can look at your final exam on the following dates: We., December 23, 1-2p;
Thursday, January 21, 10-11a or Tuesday, January 26, 11a-noon.
Moreover, solution sketches for the final
exam can be found below (see past exams).
- Reading Instructions: Read Chapter 3 of the
textbook by September 1, 2009. Read chapter 4 by September 14, 2009!
Read Chapter 8 pages 487-506 by September 30, 2009; Read Chapter 8 pages
510-514 and 526-532, Chapter 9 pages 569-576 by October 5, 2009. Read first 9
pages of the region discovery technology document by October 8. Make sure you read pages 487-505 top, 526-532, 600-604; 69-74, 80 bottom-84 by
October 12, 2009 Read pages 327-349 by October 20, read pages 349-358 by
October 22, read pages 370-386 by October 25, read pages 415-426 & 429-439
by October 26; read pages 131-140 by October 28; read pages
36-65 by November 2; read Spatial Analysis Wikepedia document by
November 9; read 516-526 and 608-612 by November 10; read 532-548 by
November 12; read pages 651-659 and 666-674 by November 13; read pages
223-227 by November 15; read pages 256-290 by November 19;
read Top10 Data Mining Article by December 1, centering on those algorithms
that were covered in the course!
- Programming projects (unless specified otherwise) and other assignment
tasks are individual activities; therefore, collaborating with other students
is not allowed (also see academic honesty section near the end of
this webpage).
- Course Syllabus
2009 Exams Dates and Other Deadlines:
Midterm Exame: Th., October 15
Final Exam: Tu., December 8, 10a (in PGH 200)
Assignment 1: Tu., September 22, 11p
Assignment 2: We., September 30, 11p
Assignment 3: Part1: Mo., October 12, 11p; Part2: Th., October 29, 11p
Assignment 4: We., November 11, 11p
Assignment 5: Report due Mo., November 30,
11p (for groups presenting on December
3) / We, December 2, 11a (for groups presenting on December 1)
In general, all 2009 COSC 6335 activities will come to an halt on
Tu., December 8, noon.
2009 Review Sessions
There will be 30 minute review sessions on September 29,
October 13, November 17, and December 1 (or 3). Review questions will be
posted here. Occasionally, review questions will discuss paper-and-pencil
problems of assignments.
Review Questions for September 29
Review Questions for October 13
Review Questions for November 17
Review Questions for December 1
2009 Assignments
Instructions Concerning What
Software to Install (updated on August 27 at noon!).
Assignment1: Getting Familiar with
Cougar^2 (please attend the lab classes on September 10 and 17
that will help you with this task)
Introduction to Cougar^2
Assignment2: Exploratory
Data Analysis (Corrected Wine Dataset)
Draft of Assignment3: Making Sense of Data using
Traditional and
Clustering with Plug-in Fitness Functions(
How to run experiments
in Cougar^2, Earthquake09 Dataset
(some errors in the file have been corrected on Oct. 2, 2009),
Visualization Earthquake09 Dataset,
Last Year's Project Specification (contains
useful information for the 2009 course project)
Assigment4: Association Analysis and
Similarity Assessment (contains paper and
pencil style questions)
Assignment 5: Group Project (multiple topics
to choose from)
Prerequisites
The course is mostly self-contained. However, students taking the course
should have
sound software development skills and basic knowledge of Java. Lacking these skills likely will ask
for
trouble when performing the programming course projects.
The 2009 Offering of Data Mining
The
teaching in 2009 will be similar to the offering in 2008. The teaching in the first
8 weeks will closely cover material from the course textbook,
that also comes with good online teching material. The programming projects centers on the implementation of a
data mining algorithm and on applying data mining techniques to a real world datasets.
Moreover, 3-4
homeworks will be given that contain short, review-style questions; answers to
these excercises will be covered in review-style lectures every third Thursday.
Finally, software design for data mining will
be covered in the course in part, and students will be exposed to the
Java-based Cougar^2 Data Mining
and Machine Learning Environment that is currently under development by the
UH-DMML research group. In addition to learning how to design
and implement data mining algorithms and how to interpret data mining results,
the participation in the course project
will help you obtaining valuable experience in Eclipse development, Java core
development, object oriented analysis & design, design patterns, and XML
techology. Having knowledge and experience in using these
technologies will
also help you getting a job in the software industry.
Anyhow, students
that take the course should be familiar with the basic concepts of Java --- if
you have doubts
about this prerequisite, feel free to contact Dr. Eick or members
of the UH-DMML research group about this matter.
Course Elements and Their Weights in 2009
Programming Projects and Homeworks: 30-42%
Exams and Quizzes(2-3): 57-68%
Class Participation: 1%
Tentative 2009 Teaching Plan and Transparencies (subject to change!)
I Introduction to Data Mining (Part1, Part2,
Part3,
Differences
between Clustering and Classification --- covers chapter 1 and Section 2.1)
II Exploratory Data Analysis (covers chapter 3 in part; see also Interpreting Displays)
III Introduction to Classification: Basic Concepts and
Decision Trees
IV Introduction to Similarity Assessment
and Clustering (AGNES and DBSCAN,
Region Discovery in Spatial Datasets,
Introduction to the CLEVER Region Discovery Algorithm)
V Association Analysis(Part1,Part2)
VI A Short Introduction to Data Cubes
VII Preprocessing for Data Mining
VIII Introduction to Spatial Data Mining
(Spatial Regression)
IX More on Clustering and Outlier Detection: Grid-based,
Density-based Clustering, and Subspace Clustering,
Cluster Validity,
Anomaly/Outlier Detection.
X Software Design for Knowledge Discovery Projects and Background Knowledge
for Programming Projects (Software Design in General,
Intoduction to Region Discovery, Region Discovery Technology (please
read the first 8 pages of the Wordfile), Weka Introduction Transparencies,
Experiment Guide,
Introduction to the CLEVER Region Discovery Algorithm,
Post Analysis Assignment
3a)
XI More on Classification: Instance-based Learning, Support Vector Machines,
Editing, Ensembles, ROC-Curves
(NN-Classifiers and Support Vector Machines,
Editing and Condensing Techniques for NN-Classifiers
(not covered in Fall 2008), Ensembles and ROC Curves,
Model Evaluation).
XII Top Ten Algorithms in Data Mining
(Top10)
XIII Miscellaneous: 2009 Netflix Contest,
90 Days at Yahoo! and Final Words
Remark: The teaching plan will be updated continuously.
Grading
Each student has to
have a weighted average of 74.0 or higher in the
exams of the course in order to receive a grade of "B-" or better
for the course.
Students will be responsible for material covered in the
lectures and assigned in the readings. All homeworks and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions to homeworks and assignments
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
homework problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Past Exams
2008 Midterm Exam
2007 Final Exam
2009 Midterm Exam with Solution Sketches
2009 Final Exam with Solution Sketches
Summary Answers COSC 6335 Student Questionnaire
2009 Student Language Summary Registered
Students: English:14, Hindi:9, Telugu:7, Bengali:2,
Vietnamese:2, Arabic:2, Sindhi:1,
French:1, Russian:1, Turkish:1, Kyrgyz(?):1, Tamil:1, Filipino:1, Spanish:1,
Urdu:1, Garhwali(?):1, Chinese:1; I am impressed: some of you spoke up to
four languages as a child! Concerning group projects, 11 students
liked group projects, 2 students disliked group project, and 9 students
had no preference. Concerning reading scientific papers 12 students liked
reading scientific papers, 3 students disliked it, and the rest of the students
were neutral or gave fuzzy answers "I like reading paper that are interesting.".
15 students like giving presentations and 4 students didn't. Concerning
projects that involve significant amounts of programming 16 liked it and
3 didn'tlike it.
Master Thesis and Dissertation Research in Data Mining
If you plan to perform a dissertation or Master thesis project in the area of
data mining, I strongly recommend
to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing
about evolutionary computing (COSC 6367) will be helpful, particularly
for designing novel data mining algorithms.
Moreover, having basic knowledge in data structures, software design, and databases is important when conducting
data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice.
Moreover, taking a course that teaches high preformance computing is also
desirable, because most data mining algorithms are very resource intensive.
Because a lot of data mining projects have to deal with images, I
suggest to take at least one of the many
biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge
in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization. Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge
in data mining, machine learning, statistics and related areas. Similarly, you
will not be hired as a RA for a
data mining project without having some background in data mining.
2008 Textbook Reading Schedule
Recommended readings 2008: Sept. 1: 1-35; Sept. 3: 97-105;
Sept. 8: 105-131; Sept. 15: 144-166; Sept. 17: 168-186; Sept. 22: 186-193;
September 25: 487-491, 493-508; September 29: 510-513, 526-532;
October 1: read recommended region discovery document; October 6: 569-575;
material on CLEVER and SCMRG; October 15: 515-526;
October 21:
327-341; October 23: 349-358 370-382; October 27: 415-426; October 31: 429-439.
November 6: 39-65; 69-74; November 16: 131-139; November 18: 600-612; 532-542;
546-550 November 20: 651-652; 658-661; 666-669; 671-674 (skip 10.5.2 and
10.5.3); November 24: 223-227; November 25: 256-276; December 1:
276-280; 283-291 December 2: 295-301.
Data Mining Links
KDnuggets
Netflix $1,000,000 Grand Prize
KDD 2009 Data Mining
Contest
KDD 2009
Tutorial on Predictive Data Mining and DM-Contests
2009 IEEE International
Conference on Data Mining (ICDM), Miami, December 2009.
UIUC Data Mining Group
Microsoft DMX Group
UMN Spatial Database and Spatial Data Mining Group
Data Mining and Machine Learning Group University
of Helsinki
UH's Data Mining and Machine Learning Group (UH-DMML)
Weka Data Mining
Software in Java
Weka's Most Recent
Version (Version 3.6)
RapidMiner (formerly Yale)