last updated: February 23, 2026


COSC 6335: Data Mining in Fall 2024 (Dr. Eick )



2024 COSC 6335 Syllabus

Goals of the Data Mining Course

Data mining centers on finding novel, interesting, valid, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. The course covers most of the important data mining techniques, covers the Basics of Data Science, and provides background knowledge on how to conduct a data mining project. Topics covered in the course include exploratory data analysis, classification and prediction, clustering and similarity assessment, association analysis, outlier and anomaly detection, and interpreting and evaluating data analysis/data mining results. Also basic visualization techniques and statistical methods will be introduced. Moreover, hands on data mining experience will be provided in three assignments. You will also get some practical expierence in evaluating data mining results from you fellow students and data mining publications. Finally, you will learn on how to use and do programming in the popular statistics, visualization, and data mining environment R. The topics of the course have some overlap with what is taught in the Machine Learning (COSC 6342) course, to reduce this overlap the teaching of this course places a little less emphasis on learning classification and prediction models (this topic will be covered "more quickly" and more emphasis will put on Data Science Basics, Exploratory Data Analysis, Association Analysis, Clustering, Outlier Detection and Data Set Augmentation. There will be 3 assignments and 2 "paper walkthoughs" in 2026. There will be 3 group activities in 2026: Assignment2, leading paper walkthoughts, and Group Homework Credit (you find a more detailed description about group activities below).

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
office hours: TU 4-5p TH 9-10a
e-mail: ceick@uh.edu
TA: Janet Anagli
e-mail: jyanagli@CougarNet.UH.EDU
TA office hours: MO+TU 9-10a (on MS Teams)
class meets: TU/TH 2:30-4p
class room: SEC 203
classes taught by others: Likely, February 19
cancelled classes: Likely, March 12

Course Materials

Objectives Data Mining Course

Highly Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, Second Edition, 2019
Link to Book HomePage

Recommended Texts: NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

2026 Course Organization

I.	Introduction to Data Mining
II. 	Density Estimation 
III.    Similarity Assessment 
IV.	Outlier and Anomaly Detection
V.      Autoencoders 
VI.	Data Science Basics and Exploratory Data Analysis 
VII.    Introduction to Clustering 
VIII.     Classification: Basic Concepts, Decision Trees and Neural Networks.
IX.	Association Analysis: Rule, Sequence, Graph and Collocation Mining
X.     Spatial Data Minng
XI.      Data Augmentation 
XII.    Data Storytelling 
XIII.   Advanced Clustering 
XIV.    Preventing False Discoveries (Optional Topic)
XV.    Preprocessing 
  

Important Course Dates

Thursday, February 5: Assignment1 Lab
Tu., Feb. 24: In-class Activity (30 minutes) and Lockdown Browser Exam Mock Exam (10 minutes) Thursday, March 5, 11:30a: First Course Exam
March 17+19: no lecture (Spring Break)
Th., March 26: First Paper Walkthrough
Th., April 16: Second Paper Walkthrough
Th+Tu, April 2+7: Task2 Student Presentations
Thursday, April 30, 11:30a: Second Course Exam

News COSC 6335 (Data Mining) Fall 2026

2026 Assignments

Task1: Outlier Detection For a Houston Weather Dataset (second draft; tentative deadline: Feb. 26, 2023 Houston Weather Dataset); Task1 is an individual task and centers on density estimation, autoencoders and outlier detection.

Task2: Group Project (tentative deadline: March 30) centering on classification/prediction and clustering; each group will find an "interesting" dataset and choose their own topic which will be approved by Feb. 20 the latest. Groups will present their Task2 results in presentation April 2 and 7!

Task3: Collocation Mining for a Building Dataset (tentative deadline: April 22). Individual task centering on association analysis which will be posted March 30; task and dataset used might still change!



Tentative Task Weights: Task1: 39%, Task2: 34%, Task3: 27%.

Course Elements and Their Weights

Exams (2): 47% (Exam1 (March 5, 2026): 22%, Exam2 (April 30, 2026): 25%
Assignments (3 tasks): 39%
Group Homework Credit, Paper Walkthroughs, Assignment2 Presentation and Other In-Class Activities : 12%
Attendance: 2%

2026 Group Activities

Groups will perform Assignment2, summarize paper sections and lead discussion in the two paper walkthoughs, and will make a 13-18 minute group homework credit (GHC) presentation. GHC tasks include presenting solutions to homework-style problems, leading in-class discussions and demo tools. Each group get a different GHC task.

2026 Group Homework Credit

Group A and Group B Tasks (Group A will present on Feb. 12 and Group B will present on Feb. 17)
Group C Task (will present on Feb. 24)
Group D Task (will present on March 3)

Remarks: Tasks will be assigned at least 4 days before your group's presentation date!

Dr. Eick's COSC 6335 2026 Lecture Notes

I Introduction to Data Mining (Part1, Part2, Part3--- covers chapter 1 and Section 2.1)
II Data Science Basics / Exploratory Data Analysis (covers chapter 3 of the first edition of the Tan book)
III R (Arko's Short Intro Into R (used in Lab), Scatter Plot Code, Decision Trees in R, Some useful code for Task1 (to be discussed on Sept. 16), Some other code for Task2 (not discussed in the lecture), Computing Statistical Summaries In the Presense of Missing Value (NA) (not discussed in the lecture), Functions and Loops in R (useful, but not discussed in the lecture)
IV Similarity Assessment
V Naive, Parametric (MLE Example) and Non-Parametric Density Estimation
VI Autoencoders (presented by Janet on Feb. 5)
VII Outlier Detection (extended and updated on Feb. 6, 2026)
VIII Clustering: Introduction to Clustering, Density-based Clustering, Hierarchical Clustering, DENCLUE, EM, R-scripts demoing K-means/medoids and DBSCAN and Cluster Validity.
IX Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, Neural Networks Part1 (3blue1brown: What is a Neural Network? (will show the first 12:30 of this video)), Neural Networks Part2, kNN-Classifiers and Support Vector Machines
X Deep Learning (Introduction to Deep Learning (will watch and discuss some MIT Deep Learning Bootcamp Videos), Review Neural Network Basics, Autoencoders, Language Models and Convolutional neural networks (CNN)); taught by Mahin on October 31, 2024 , More on VAEs (not covered in 2023!))
XI Association Analysis: Assiociation Rule Mining, Sequence Mining.
XII Introduction to Spatial Data Mining
XIII Data Preprocessing for Data Mining (will already be covered in early September)
XIV Advanced Clustering (will cover CLIQUE, FCM and EM)
XV Data Storytelling

Old Webpages of COSC 6335: 2013 and 2009.

In-Class Activity

Feb. 24 Activity (Group Activity; groupsize 4-5)

Data Sets

Complex8
Complex9 with 8% Gaussian Noise Added
Bank Note Authentication

2024 Group Homework Credit and Other In-class Activities

2024 GHC Groups and Contact E-mails

In this activity which will be called group homework credit, each group formed for this activity, receives a different usually homework-style problem (other tasks include demos, and leading discussions), and they present their solution during the lecture (10-14 minutes), and share their solution in form of a Word or pptx file. The groups and e-mail addresses of the group members have been posted in the 'Group Homework Credit' channel of this section's MS Team. Below is the old 2024 list of assigned tasks and associated groups and presentation dates; tasks will be added as we move along with the teaching of the course; 2026 tasks will be posted at least 5 days before a group's presenation date!

2024 Schedule and Tasks

Group A and B be will present on Sept. 12 and group C will present on Sept. 17 (Tasks for groups A, B, C)
Group D will present on Sept. 19 and Group E will present on Sept. 26 (Group D and E Tasks)
Group F Task (will present on Th., October 3!)
Group G Task (will present on Oct. 8)
Group H, I, J and K Tasks (Group H will present on Oct. 22, Groups I will present on October 29, group J will present on Nov. 1 (online) and Group K will present and lead a discussion on November 7)
Group L and M Tasks (group L will present Nov. 14 and group M will present Nov. 19)
Group N Task (will present on November 21)


Another skill that is important is the capability to read, understand and assess the quality of data mining papers. To practice this skill, there will be two paper walkthroughs (Feb. 26 and March 31) where we will go slowly though a data mining paper---section by section---under the guidance of students who are responsible to summarize a particular section of a paper and lead the discussion of the section they are assigned to.

2024 Exams

Midterm1 Exam (scheduled for Oct. 10, 2024, 2:30p in 203 SEC, Review List, October 8, 2024 Review for the Midterm Exam, Solution Sketches 2022 Midterm Exam, Solution Sketches 2023 Midterm Exam)

Midterm2 Exam (2024 Review List, Nov. 21 Review for the Nov. 26, 2024 Exam) has been scheduled for Tuesday, November 26, 2024, 2:30p(Some solution sketches for the 12/09/22 Final Exam; Some solution sketches for the 12/07/23 Final Exam)

Course exams will be open book/notes paper exams; however, the use of cell phones and computers during the exam is not allowed; basic calculators are okay.

Midterm1 counts 22% and Midterm2 counts 24% toward the course grade.

Attendance 2024

Attendance counts 2% towards the course grade. Attendance will be taken starting Tuesday, August 27 throughout the remainder of the semester. Only F2F attendance counts. Therefore, 24 attendances will be taken (August (2), September (7), October (9), November (6)). Your number of attendances will be converted as follows into a number grade:
24-23: 92, 22-21: 91, 20-19 :90, 18:88, 17:86, 16:84, 15:81, 14:78, 13:75, 12:71, 11:67, 10:63, 9:59, 8-0:55.

2023 COSC 6335 Grading

Dr. Eick uses the following number grade scale for graduate courses: A: 100-90, A-: 90-85, B+: 85-82, B: 82-77, B-:77-74, C+: 74-70, C:70-66,C-:66-62, D+: 62-58, D:58-54, D-:54-50, F: 50-0. Exams and Problemsets are still curved after your exam and task scores have been determined. Exam scores are curved immediately and problem set scores are normalized, added; ultimately, your Problemset total is converted at the end of the semester into a number grade which counts about 48% to the overall number grade score which is then converted into a letter grade.

Number grades higher than 95 are rarely used---except for truly exceptional performances and outliers; a grade of 95 already represents a letter grade of A+… Exams of graduate courses are usually curved so that the exam number grade average is in the range 81-84, depending how well the students performed in the particular exam. Dr. Eick first determines the class’ performance in the exam and selects an average and then the exam is curved accordingly. Number grades do not directly correspond to percentage obtained: for high percentage averages percentage will be down graded and for low percentage average percentages will be upgraded.

Comments about COSC 6335 Exams: to reduce the probability of cheating, the COSC 6335 final exam was designed to be slightly too long. If it turns out that the exam was more than slightly too long, be aware of the fact that all students took the same exam, and that the final exam is still subject of curving to possibly rectify some problems with exam length. In most of Dr. Eick's exams students who score more than 75% of the available points in the exam usually will get a grade of A- or better. Moreover, I believe that graduate students should be "challenged" to demonstrate their skills during final exams. On the other hand, this semester's midterm exam was not very challenging, in my opinion, and more importaintly is was much too short, encouraging cheating. Dr. Eick also concluded---after using them a lot for 2 semesters---that multiple choice exams are not appropriate to assess particular skills, and that often multiple choice exams test a student's capability with respect to differences in natural language semantics rather than assessing if students actually understand and can apply what was taught in a course.

Polls

Nov. 19, 2024 Poll about Problemset Tasks and Kritik

2023 Problem Sets

ProblemSet1 (Task1: Exploratory Data Analysis for a Basel Weather Dataset, Task2: Develop an Intelligent Tool which Compares Boxplots; individual tasks; you should read the specification of Task1 by August 30, and should start working on Task1 on Sept. 5!)

ProblemSet2 (contains Task3 a clustering group task, and Task4 an individual, peer reviewed outlier detection task)

ProblemSet3(contains Task5, a peer reviewed group task in which you will review a data mining paper which is due on November 14, 2023; Short Discussion Concerning Reviewing Data Mining Papers).

2023 Groups for Tasks 3 and 5; Task 1, 2, and 4 are individual tasks!

2023 Weights for the ProblemSet Tasks: Task1:4, Task2:4, Task3:4, Task4:5, Task5:1.5.

2023 Weights of Course Elements: Midterm Exam:21%, Final Exam:27%; Group Homework Credit: 3%; Attendance:2%; Problem Set Tasks:47%.

Old COSC 6335 News Items