office hours: TU 4:10-5p TH 8:50-10a (on MS Teams)

e-mail: ceick@uh.edu

TA: Md. Mahin

e-mail: mdmahin3@gmail.com

TA office hours: MO 12:30-1:30p TU 12:30-1:30p (on MS Teams)

2014 TA website: Arko Barman's COSC 6335 Website

class meets: TU/TH 2:30-4p

class room: S 120

classes taught by others: Tuesday, October 3, Tuesday, November 14, Thursday, November 16; moreover, Mahin will teach 3-4 30-75 minute labs as part of the lecture to provide background knowledge for some problemset task.

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley, Second Edition,
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

I. Introduction to Data Mining II. Data Science Basics and Exploratory Data Analysis III. Preprocessing IV. Brief Introduction to Peer Reviewing and using Kritik for it V. Labs: Using R and Python for Data Science and Data Mining VI. Introduction to Clustering and Similarity Assessment VII. Data Storytelling VIII. Density Estimation IX. Outlier and Anomaly Detection X. Classification: Basic Concepts and Decision Trees, Support Vector Machines and Neural Networks. XI. Introduction to Deep Learning Centering on Autoencoders XII. Reviewing Data Mining Papers XIII. Association Analysis: Rule, Sequence, Graph and Collocation Mining XIV. Spatial Data Mining XV. Advanced Clustering

Tuesday, October 17, 2:30p: Midterm Exam (Review List 2023 Midterm Exam)

Thursday, December 7,

- DM Fall 2023 Letter Grade Distribution: A:6 A-:7 B+:9 B:8 B-:6 C+:0 C:1 W:0.
- The 2023 final exam student performance was slightly better than the 2023 Midterm Exam performance; mostly because there were fewer really low scores, and its number grade average was therefore curved to be 0.27 higher than the Midterm Number grade average. Solution sketches for most final exam problems will be posted below by Dec. 19 the latest. The final exams will be not returned to students; however, you have the chance to see your exams in 575 PGH on Wednesday, January 17, 11-noon or on Monday, January 22 2-3p; as Mahin is out off the country until January 15 2024, I also suggest to discuss other grading issues, you might have, with Mahin at one of those two dates.
- Here are the final weights for the 2023 Weights for the ProblemSet Tasks: Task1:4, Task2:4, Task3:4, Task4:5, Task5:1.5.
Mahin will elaborate about the weight reduction for Task5 at a later stage.
## 2023 Labs

Mahin's Sept. 5 Lab## Attendance 2023

Attendance counts 2% towards the course grade. Attendance will be taken starting Tuesday, August 29, 2023 throughout the remainder of the semester (with the exception of Tuesday, November 21). Only F2F attendance counts. Attendance will be taken approx. 2:40p; that is, if you show up 20 minutes late, your attendance will not count. Therefore, 25 attendances (August(2), September (8), October (8), November (7)) will be recorded. Your number of attendances will be converted as follows into a number grade:

24-25:92, 22-23:91, 21:90, 20:89, 19:87, 18:83, 17:79, 16:75, 15:71, 14:67, 13:63, 12:59, 0-11:55.## 2023 Exams

Midterm Exam (scheduled for Oct. 17, 2023 in**105 SEC**, Review List, October 12, 2023 Review for the Midterm Exam, Solution Sketches 2022 Midterm Exam, Solution Sketches 2023 Midterm Exam)

The course final exam (2023 Review List, Nov. 30 Review for the 2023 Final Exam) has been scheduled for Thursday, December 7, 2023 (Some solution sketches for the 12/07/23 Final Exam)

Course exams will be open book/notes paper exams; however, the use of cell phones and computers during the exam is not allowed!## 2023 Problem Sets

ProblemSet1 (Task1: Exploratory Data Analysis for a Basel Weather Dataset, Task2: Develop an Intelligent Tool which Compares Boxplots; individual tasks; you should read the specification of Task1 by August 30, and should start working on Task1 on Sept. 5!)

ProblemSet2 (contains Task3 a clustering group task, and Task4 an individual, peer reviewed outlier detection task)

ProblemSet3(contains Task5, a peer reviewed group task in which you will review a data mining paper which is due on November 14, 2023; Short Discussion Concerning Reviewing Data Mining Papers).

2023 Groups for Tasks 3 and 5; Task 1, 2, and 4 are individual tasks!

2023 Weights for the ProblemSet Tasks: Task1:4, Task2:4, Task3:4, Task4:5, Task5:1.5.

2023 Weights of Course Elements: Midterm Exam:21%, Final Exam:27%; Group Homework Credit: 3%; Attendance:2%; Problem Set Tasks:47%. I Introduction to Data Mining (Part1, Part2, Peer Reviewing and Kritik, Part3--- covers chapter 1 and Section 2.1)### Dr. Eick's COSC 6335 2022 Lecture Notes

II Data Science Basics (formerly called Exploratory Data Analysis) (covers chapter 3 of the first edition of the Tan book)

III R (Arko's Short Intro Into R (used in Lab), Scatter Plot Code, Decision Trees in R, Some useful code for Task1 (to be discussed on Sept. 16), Some other code for Task1 (not discussed in the lecture), Computing Statistical Summaries In the Presense of Missing Value (NA) (not discussed in the lecture), Functions and Loops in R (useful, but not discussed in the lecture)

IV Peer Reviewing and Kritik ( Introduction to Kritik Video (will we watch this video during the lecture!))

V Naive, Parametric and Non-Parametric Density Estimation

VI Introduction to Similarity Assessment and Clustering

VII More on Clustering: Density-based Clustering and Hierarchical Clustering, DENCLUE, EM, R-scripts demoing K-means/medoids and DBSCAN, Randomized Hill Climbing and Cluster Validity.

VIII Outlier Detection

IX Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, Neural Networks Part1 (3blue1brown:*What is a Neural Network?*(will show the first 12:30 of this video)), Neural Networks Part2, kNN-Classifiers and Support Vector Machines

X Deep Learning (Introduction to Deep Learning (will watch and discuss some MIT Deep Learning Bootcamp Videos), Review Neural Network Basics, Autoencoders, Language Models and Convolutional neural networks (CNN)); taught by Mahin on Nov. 14+16, 2023, More on VAEs (not covered in 2023!))

XI Association Analysis: Assiociation Rule Mining, Sequence Mining.

XII Introduction to Spatial Data Mining

XIII Data Preprocessing for Data Mining (will already be covered on Sept. 5, 2023)

XIV Advanced Clustering (will cover CLIQUE, FCM and EM)

XV Data Storytelling (already covered in early October 2023)

Old Webpages of COSC 6335: 2013 and 2009.## 2023 Group Homework Credit

2023 GHC Groups and Contact E-mails

2023 Schedule

Th., Sept 7: Group A

Th., Sept. 14: Group B and Group C

Th., Sept. 21/Tu., September 26: Group D and Group E

Th., Sept. 28: Group F

Th., October 5: Group G

Tu., October 10: Denclue 2.0 paper Walkthrough lead by Group H

Th., October 12: Group I

Th., November 9: Group J and K (group K's presentation has been moved to November 21)

Tu., November 28: Group L

Th., November 30: Group M

Group tasks will be posted here, at least six days before a group's presentation day!## Graduate Research Opportunities

UH-DAIS Research Overview## 2023 COSC 6335 Grading

Dr. Eick uses the following number grade scale for graduate courses: A: 100-90, A-: 90-85, B+: 85-82, B: 82-77, B-:77-74, C+: 74-70, C:70-66,C-:66-62, D+: 62-58, D:58-54, D-:54-50, F: 50-0. Exams and Problemsets are still curved after your exam and task scores have been determined. Exam scores are curved immediately and problem set scores are normalized, added; ultimately, your Problemset total is converted at the end of the semester into a number grade which counts about 48% to the overall number grade score which is then converted into a letter grade.

Number grades higher than 95 are rarely used---except for truly exceptional performances and outliers; a grade of 95 already represents a letter grade of A+… Exams of graduate courses are usually curved so that the exam number grade average is in the range 81-84, depending how well the students performed in the particular exam. Dr. Eick first determines the class’ performance in the exam and selects an average and then the exam is curved accordingly. Number grades do not directly correspond to percentage obtained: for high percentage averages percentage will be down graded and for low percentage average percentages will be upgraded.As far as individual Kritik tasks are concerned, creation scores were weighted by 66%, written evaluation scores by 15%, grading scores by 10% and feedback scores by 95. As far as the grading of the 2 group tasks is concerned: Dr. Eick read your collocation mining reports and created his own creation score for each group project which will be combined creation score your group received from your student peers (by averaging the two scores and then converting those into a number grades) based on his impressions and his evaluation about the quantity and the quality of the work you did for the collocation mining task. The same procedure was used to obtain a final creation score for Task 7, except Nour and not Dr. Eick took a look at your reviews. Moreover, no feedback with respect to written comments is solicited from the groups in Kritik group projects. However, your written comments for Task 6 will be graded by Dr. Eick and for Task 7 by Nour, and this written comment score will be combined (counting 15%) with your group's creation score (counting 85%) when the final group task grade is computed. In contrast to individual Kritik projects there will be no feedback and grading scores for group projects.

Comments about COSC 6335 Exams: to reduce the probability of cheating, the COSC 6335 final exam was designed to be slightly too long. If it turns out that the exam was more than slightly too long, be aware of the fact that all students took the same exam, and that the final exam is still subject of curving to possibly rectify some problems with exam length. In most of Dr. Eick's exams students who score more than 75% of the available points in the exam usually will get a grade of A- or better. Moreover, I believe that graduate students should be "challenged" to demonstrate their skills during final exams. On the other hand, this semester's midterm exam was not very challenging, in my opinion, and more importaintly is was much too short, encouraging cheating. Dr. Eick also concluded---after using them a lot for 2 semesters---that multiple choice exams are not appropriate to assess particular skills, and that often multiple choice exams test a student's capability with respect to differences in natural language semantics rather than assessing if students actually understand and can apply what was taught in a course.

## Peer Review using

Moreover, peer review will be a component of this course: you will evaluate work of other students in the course and work of other peers in the field of data mining; Kritik will be used for producing and evaluating peer evaluations. There will be a $29+tax student fee to get*Kritik**Kritik*access—however, if you are a really poor graduate student, feel free to contact Dr. Eick to subsidise this fee! Moreover, we received 2 'free' accounts for deserving students!

Kritik Tasks: There is a 24 hour grace period for each Kritik task. You are allowed to use this grade period for up to 5 submissions. "Draft" rubrics for the peer reviewed tasks in the problem sets can be found in the ProblemSet channel in DM2000 and ultimately you will find the "final" rubric in Kritik. As these rubrics tell you how your submissions will be graded by your peers, Nour and Dr. Eick, it might be worth while looking at those rubrics closely.## 2022 Problem Sets

Problem Set1 (Task1 Specification, Task2 Specification, Discussion Task2; T2 Density Plots for Data Sets D1, D2 and D3; Task1 is due Sunday, Sept. 25, 11:59p and Task2 is due Thursday, October 6, 11:59p)

Task3: Clustering (group activity of groups of 2 or 3 students; due on Tuesday, November 8, 11:59p; Spatial Analysis Lab (taught by Mahin))

Task4: Autoencoder and Dimensionality Reduction (individual activity; is due on Tuesday, November 22, 11:59p, last updated on October 25, 2022)

Task5: Reviewing a Data Mining Paper (group task; rubrics have been added to the specification on Nov. 28 Short Discussion Concerning Reviewing Data Mining Papers, initial review is due Dec. 1 followed by some Kritik peer reviewing in the window Dec. 4-6)## Old COSC 6335 News Items

- The number grades of the midterm exam will be available by the end of the day of Nov. 2 on MS Teams; the results were not that good; the average was 12% lower than the results for the last year's midterm exam; 8 students had quite low scores of 21 points or less, but, on the positive side, five students did very well in this year's midterm exam, scoring in the 59-64.5 range. The score average was 36 and the number grade average was 82.03! Solution sketches for midterm exam will be posted on the course website early next week.
- Every student in COSC 6335 should have received an e-mail from Kritik, about setting up an account for COSC 6335; if you did not receive such an e-mail, send an e-mail to Mahin immediately!
- We will start taking F2F attendance starting Tuesday, August 29. Attendance counts 2% towards the course grade and will be taken at 2:40p for the remainder of the semester. More details about the 2023 Attendance policy for this course you find below!
- Read Chapter 3 of the first Edition of the Tan book, before the lecture on August 29; link to the chapter is in the Lecture Notes Section below!
- At the moment you see the website of the Fall 2022 teaching of the course; this website will be updated continuously as we have along with the teaching of the course in the Fall 2023 semester!
- We will be using Kritik for the course and there is a fee of $29 for using Kritk; however, Dr. Eick is willing to pay your fee, if you cannot afford to pay the fee. If you believe you belong to this group, send Dr. Eick an e-mail, providing a short, but convincing justification why you cannot afford to pay a fee of $29 by September 5, 2023 at 1p the latest. I will respond to your e-mails by September 8 the latest.
- Always download documents from the course website, as it always stores the most recent version of the respective documents.