last updated: November 28, 2023
COSC 6335: Data Mining in Fall 2023
(Dr. Eick )
Goals of the Data Mining Course
Data mining centers on finding novel, interesting, valid, and potentially useful patterns in data. It aims
at transforming a large amount of data into a well of knowledge. Data mining has become a very
important field in industry as well as academia. The course covers most of the important data
mining techniques, covers the Basics of Data Science, and provides
background knowledge on how to conduct a data mining project.
Topics covered in the course include exploratory data analysis, classification and prediction,
clustering and similarity assessment, association analysis, outlier and anomaly detection, and
interpreting and evaluating data analysis/data mining results. Also basic visualization techniques
and statistical methods will be introduced. Moreover, hands on data mining experience will be
provided in three Problem Sets. You will also get some practical expierence in evaluating data
mining results from you fellow students and data mining publications. Finally, you will learn on how to use and do
programming in the popular statistics, visualization, and data mining environment R. The topics
of the course have some overlap with what is taught in the Machine Learning (COSC 6342) course, to
reduce this overlap the teaching
of this course places a little less emphasis on learning
classification and prediction models (this topic will be covered "more quickly" and not a
lot of points are allocated in the problem sets to this topic)
and more emphasis will put on Data Science Basics, Exploratory Data Analysis,
Association Analysis, Clustering, and Outlier Detection.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
Basic Course Information
Christoph F. Eick
office hours: TU 4:10-5p TH 8:50-10a (on MS Teams)
TA: Md. Mahin
TA office hours: MO 12:30-1:30p TU 12:30-1:30p (on MS Teams)
2014 TA website: Arko
Barman's COSC 6335
class meets: TU/TH 2:30-4p
class room: S 120
classes taught by others: Tuesday, October 3, Tuesday, November 14, Thursday, November 16;
moreover, Mahin will teach 3-4 30-75 minute labs as part of the lecture to provide background
knowledge for some problemset task.
Makeup class (if necessary): TBDL
Objectives Data Mining Course
Highly Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, Second Edition,
- Link to
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne
source covering exploratory data analysis, statistics, modelling and prediction)
2023 Course Organization
I. Introduction to Data Mining
II. Data Science Basics and Exploratory Data Analysis
IV. Brief Introduction to Peer Reviewing and using Kritik for it
V. Labs: Using R and Python for Data Science and Data Mining
VI. Introduction to Clustering and Similarity Assessment
VII. Data Storytelling
VIII. Introduction to Density Estimation
IX. Outlier and Anomaly Detection
X. Introduction to Classification: Basic Concepts and Decision Trees, Support Vector Machines and Neural Networks.
XI. Brief Introduction to Deep Learning Centering on Autoencoders
XII. Reviewing Data Mining Papers
XIII. Association Analysis: Rule, Sequence, Graph and Collocation Mining
XIV. Spatial Data Mining (optional topic)
XV. Advanced Clustering
Important Course Dates
Tuesday, September 5: Lab taught by Mahin (in preparation to Task1)
Tuesday, October 17, 2:30p: Midterm Exam (Review List 2023 Midterm Exam)
Thursday, December 7, 2p: Final exam
News COSC 6335 (Data Mining) Fall 2023
- Dr. Eicks office hours this week are: WE 9:10-10a and TH: 8:50-10a!
- The lecture on November 28 will have Group L's GHC presentation and
will start the discussion of Advanced clustering; moreover, there will be polls concerning the
the 5 problem set tasks and a discussion about the use of KRITIK in this course 3:15-3:45p (Discussion
Questions; please take a look!).
Mahin's Sept. 5 Lab
Attendance counts 2% towards the course grade.
Attendance will be taken starting Tuesday, August 29, 2023 throughout the remainder of the semester (with
the exception of Tuesday, November 21). Only
F2F attendance counts. Attendance will be taken approx. 2:40p;
that is, if you show up 20 minutes late, your attendance will not count. Therefore, 25 attendances
(August(2), September (8), October (8), November (7)) will
be recorded. Your number of attendances will be converted as follows into a number grade:
24-25:92, 22-23:91, 21:90, 20:89, 19:87, 18:83, 17:79, 16:75, 15:71, 14:67, 13:63, 12:59, 0-11:55.
Midterm Exam (scheduled for Oct. 17, 2023 in 105 SEC, Review List,
October 12, 2023 Review for the Midterm Exam, Solution Sketches 2022 Midterm
Exam, Solution Sketches 2023 Midterm
The course final exam (2022 Review List, Dec. 2 Review
for the 2022 Final Exam) has been scheduled for Thursday, December 7, 2023. 2p
in S105. It will take slightly less than 2 hours.
Course exams will be open book/notes paper exams; however, the use of cell phones and
computers during the exam is not allowed!
2023 Problem Sets
Exploratory Data Analysis for a Basel Weather Dataset, Task2: Develop
an Intelligent Tool which Compares Boxplots; individual
tasks; you should read the specification of Task1 by August 30, and should start working on Task1 on
ProblemSet2 (contains Task3 a clustering group task, and Task4 an individual, peer reviewed
outlier detection task)
ProblemSet3(contains Task5, a peer reviewed group task in which you will review a
data mining paper which is due on November 14, 2023; Short Discussion
Concerning Reviewing Data Mining Papers).
2023 Groups for Tasks 3 and 5; Task 1, 2, and 4 are individual tasks!
2023 Weights for the ProblemSet Tasks: Task1:4, Task2:4, Task3:4, Task4:5, Task5:2.
2023 Weights of Course Elements: Midterm Exam:21%, Final Exam:27%; Group Homework Credit: 3%; Attendance:2%; Problem
Dr. Eick's COSC 6335 2022 Lecture Notes
I Introduction to Data Mining (Part1, Part2,
Peer Reviewing and Kritik,
Part3--- covers chapter 1 and Section 2.1)
II Data Science Basics (formerly called Exploratory Data Analysis) (covers
chapter 3 of the first edition of the Tan book)
III R (Arko's Short Intro Into R (used in Lab),
Plot Code, Decision Trees in R, Some useful code for Task1 (to be discussed
on Sept. 16),
Some other code for Task1 (not discussed in the lecture),
Computing Statistical Summaries In the Presense of Missing Value (NA) (not discussed in the lecture),
and Loops in R (useful, but not discussed in the lecture)
IV Peer Reviewing and Kritik (
Introduction to Kritik Video (will
we watch this video during the lecture!))
V A Brief Introduction to Naive, Parametric
and Non-Parametric Density Estimation
VI Introduction to Similarity Assessment
VII More on Clustering: Density-based Clustering and Hierarchical Clustering,
R-scripts demoing K-means/medoids and
DBSCAN, Randomized Hill Climbing and Cluster Validity.
VIII Outlier Detection
IX Classification (Introduction to Classification: Basic Concepts and Decision
(3blue1brown: What is a
Neural Network? (will show the first 12:30 of this video)),
Neural Networks Part2, kNN-Classifiers and Support Vector Machines
X Deep Learning (Introduction to Deep Learning (will watch and discuss some MIT Deep Learning
Bootcamp Videos), Review Neural Network Basics, Autoencoders, Language Models and Convolutional
neural networks (CNN)); taught by Mahin on Nov. 14+16, 2023, More on VAEs (not covered in 2023!))
XI Association Analysis: Assiociation Rule Mining,
XII Introduction to Spatial Data Mining
XIII Data Preprocessing for Data Mining (will already be covered on Sept. 5, 2023)
XIV Advanced Clustering (might only cover FCM and EM)
XV Data Storytelling (alredy covered in early October 2023)
Old Webpages of COSC 6335: 2013 and
2023 Group Homework Credit
2023 GHC Groups and Contact E-mails
Th., Sept 7: Group A
Th., Sept. 14: Group B and Group C
Th., Sept. 21/Tu., September 26: Group D and Group E
Th., Sept. 28: Group F
Th., October 5: Group G
Tu., October 10: Denclue 2.0 paper Walkthrough lead by Group H
Th., October 12: Group I
Th., November 9: Group J and K (group K's
presentation has been moved to November 21)
Tu., November 28: Group L
Th., November 30: Group M
Group tasks will be posted here, at least six days before a group's presentation day!
Graduate Research Opportunities
UH-DAIS Research Overview
2023 COSC 6335 Grading
Dr. Eick uses the following number grade scale for graduate courses: A: 100-90, A-: 90-85, B+: 85-82, B: 82-77,
B-:77-74, C+: 74-70, C:70-66,C-:66-62, D+: 62-58, D:58-54, D-:54-50, F: 50-0. Exams and Problemsets
are still curved after your exam and task scores
have been determined. Exam scores are curved immediately and problem set scores are normalized, added;
ultimately, your Problemset total is converted at the end of the semester into a number grade which
counts about 48% to the overall number grade score which is then converted into a letter grade.
Number grades higher than 95 are rarely used---except for
truly exceptional performances and outliers; a grade of 95
already represents a letter grade of A+… Exams of graduate courses are usually
curved so that the exam number grade average is in the
range 81-84, depending how well the students performed in the particular exam. Dr. Eick
first determines the class’ performance in the exam and selects an average
and then the exam is curved accordingly.
Number grades do not directly correspond to percentage obtained: for
high percentage averages percentage
will be down graded and for low percentage average percentages will be upgraded.
As far as individual Kritik tasks are concerned, creation scores were weighted by 66%, written evaluation scores by 15%, grading scores
by 10% and feedback scores by 95. As far as the grading of the 2 group tasks is concerned:
Dr. Eick read your collocation mining reports and created his own creation score for each group project
which will be combined
creation score your group received from your student peers (by averaging the two scores and then
converting those into a number grades) based on his impressions
and his evaluation about the
quantity and the quality of the work you did for the collocation mining task. The same procedure was used
to obtain a final creation score for Task 7, except Nour and not Dr. Eick took a look at your reviews. Moreover, no feedback with
respect to written comments is solicited from the groups in Kritik group projects. However, your written
comments for Task 6 will be graded by Dr. Eick and for Task 7 by Nour, and this written comment score will be combined (counting 15%) with
your group's creation score (counting 85%) when the final group task grade is computed. In contrast to individual Kritik projects
there will be no feedback and grading scores for group projects.
Comments about COSC 6335 Exams: to reduce the probability of cheating, the COSC 6335 final exam was designed to
be slightly too long. If it turns out that the exam was more than slightly too long, be aware of the fact that all
students took the same exam, and that the final exam is still subject of curving to possibly rectify some problems with
In most of Dr. Eick's exams students who score more than 75% of the available points in the exam usually will get a grade of A- or better.
Moreover, I believe that graduate students should be "challenged" to demonstrate their skills during final exams. On the other
hand, this semester's
midterm exam was not very challenging, in my opinion,
and more importaintly is was much too short, encouraging cheating.
Dr. Eick also concluded---after using them a lot for 2 semesters---that multiple choice exams are not appropriate to
assess particular skills, and that often multiple choice exams
test a student's capability with respect to differences in natural language semantics rather than assessing if students actually
understand and can apply what was taught in a course.
Peer Review using Kritik
Moreover, peer review will be a component of this course: you will evaluate work of other students
in the course and work of other peers in
the field of data mining; Kritik will be used for producing
and evaluating peer evaluations. There will be a $29+tax student fee to get Kritik access—however, if you
are a really poor graduate student, feel free to contact Dr. Eick to subsidise this fee! Moreover, we received
2 'free' accounts for deserving students!
Kritik Tasks: There is a 24 hour grace period for each Kritik task. You are allowed to use this grade period for
up to 5 submissions. "Draft" rubrics for the peer reviewed tasks in the problem sets can be found in the ProblemSet
channel in DM2000 and ultimately you will find the "final" rubric in Kritik. As these rubrics tell
you how your submissions will be graded by your peers, Nour and Dr. Eick, it might be worth while looking
at those rubrics closely.
2022 Problem Sets
Problem Set1 (Task1 Specification, Task2 Specification,
Discussion Task2; T2 Density Plots
for Data Sets D1, D2 and D3; Task1 is due Sunday, Sept. 25, 11:59p and Task2 is due Thursday, October 6, 11:59p)
Task3: Clustering (group activity of groups of 2 or 3 students; due on Tuesday, November
8, 11:59p; Spatial Analysis
Lab (taught by Mahin))
Task4: Autoencoder and Dimensionality Reduction (individual activity; is due on Tuesday, November 22, 11:59p,
last updated on October 25, 2022)
Task5: Reviewing a Data Mining Paper (group task; rubrics have been added to
the specification on Nov. 28 Short Discussion
Concerning Reviewing Data Mining Papers, initial review is due Dec. 1 followed
by some Kritik peer reviewing in the window Dec. 4-6)
Old COSC 6335 News Items
- The number grades of the midterm exam will be available by the end of the day of Nov. 2 on MS Teams; the results were not
that good; the average was 12% lower than the results for the last year's midterm exam; 8 students had quite low
scores of 21 points or less, but, on the positive side, five students did very well in this year's midterm exam, scoring
in the 59-64.5 range. The score average was 36 and the number grade
average was 82.03! Solution sketches for midterm exam will be posted on the course website early next week.
- Every student in COSC 6335 should have received an e-mail from Kritik, about setting up an account for COSC 6335; if you
did not receive such an e-mail, send an e-mail to Mahin immediately!
- We will start taking F2F attendance starting Tuesday, August 29. Attendance counts 2% towards the
course grade and will be taken at 2:40p for the remainder of the semester. More details about the 2023
Attendance policy for this course you find below!
- Read Chapter 3 of the first Edition of the Tan book, before the lecture on August 29; link to the chapter is in the
Lecture Notes Section below!
- At the moment you see the website of the Fall 2022 teaching of the course; this website will be updated
continuously as we have along with the teaching of the course in the Fall 2023 semester!
- We will be using Kritik for the course and there is a fee of $29 for using Kritk; however, Dr. Eick is willing to
pay your fee, if you cannot afford to pay the fee. If you believe you belong to this group, send Dr. Eick an e-mail,
providing a short, but convincing justification why you cannot afford to pay a fee of $29
by September 5, 2023 at 1p the latest. I will respond to your e-mails by September 8 the latest.
- Always download documents from the course website, as it always stores the most recent version of
the respective documents.
2022 Group Homework Credit
In this activity which will be called group homework credit, each group formed for this activity,
receives a different homework-style problem, and they present their solution during the lecture (the presentation
should take about 11-14 minutes), and
share their solution in form of a Word or pptx or pdf file in the respective channel in
DM2022 (2022 Group Homework Credit
Groups). Here is a list of the already assigned tasks and associated groups and dates;
tasks will be added as we move along with the teaching of the course:
Group A and B Tasks (to present Friday, September 16)
Group C and D Tasks (to present Friday, September 23/30)
Group E will lead a discussion about DENCLUE!
Group F and G Tasks (Group F will present on October 14 and group
G will present on October 28!)
Group H Task (Group H will present November 11)
Group I Task (will present on Nov. 18)
Group J and K Task (will present on Dec. 2)
Other Ideas for COSC 6335
Another idea is to reach out to industry to sponsor COSC 6335 course activities; e.g.
we could have the "Collocation Mining Group Project (sponsored by Company X)". If you have any good ideas and/or
expertise on obtaining such sponsorships, feel free to contact Nour or Dr. Eick!