office hours: TU 4-5p TH 11:30a-12:30p (on MS Teams)

e-mail: ceick@uh.edu

TA: Nour Smaoui

e-mail: nsmaoui@central.uh.edu

TA office hours: TU 1-2p TH 4-5p (on MS Teams)

2014 TA website: Arko Barman's COSC 6335 Website

class meets: TU/TH 2:30-4p

classes taught by others: none

cancelled classes: TBDL

Makeup class (if necessary): none

lecture not taught by Dr. Eick: Th., October 29

Objectives Data Mining Course

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley, Second Edition,
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

2. Exploratory Data Analysis

3. A Short Introduction to R

4. Brief Discussion Peer Reviewing and Using Kritik for it

5. Introduction to Classification: Basic Concepts, Decision Trees, Support Vector Machines, k-NN and Neural Networks.

6. Association Analysis: Rule, Sequence, Graph and Collocation Mining

7. Outlier and Anomaly Detection

8. Introduction to Density Estimation

9. Introduction to Clustering and Similarity Assessment

10. On Convolutional Neural Networks and Autoencoders in Particular and on Deep Learning in General

11. More on Clustering: Density-based Clustering, EM, Hierarchical Clustering and Cluster Validity

12. A Brief Introduction to Spatial Data Mining

13. Data Preprocessing

Th., October 8: 35 minute Review for 2020 Midterm Exam

Tu., October 13: Midterm Exam (Review List 2020 Midterm Exam)

Th., November 5, 11p: Deadline Collocation Mining Group Project

Th., November 26: no class (Thanksgiving)

Tu., December 1: Peer Reviewing and Kritik Discussion

Th., December 3: last lecture including a Review for the 2020 Final Exam

Th., December 10,

Tu, December 22, 3:30p: Collocation Mining Group Project Post Analysis (only if there is some intrest in having this meeting)

- I enjoyed teaching the course, and I already like to wish you all
*a Happy and Successful year 2021*! - The letter grades this semester were as follows: A:6, A-:9, B+:10, B:2, B-:2, C:1, W:1. More detailed grade summaries and information about curving functions and weigths will be posted by in Blackboard by December 20 or earlier! If you have any concerns or issues, feel free to discuss those with Nour who will have an office hour next week on Monday, December 21, 1p. Dr. Eick will be mostly offline and travelling December 19-27, 2020; if there is anything you like to discuss about the course Dr. Eick has two office hours during the winter break: Tuesday, December 29, 4-5p and Tuesday, January 5, 4-5p.
- If you are potentially interested in starting a MS Thesis (or in taking a special problems course in Spring 2021, in preparation of your MS thesis which you will then start in Summer or Fall 2021), send Dr. Eick an e-mail by January 14, 2021 the latest.
- It looks like a lot of you are currently travelling until shortly before the semester begins, and there was not much interest in having a Collocation Mining Group Project Discussion before January 19, 2021.
- The results of the Kritik Poll have been posted in a channel with the same name in the DM2020 team.
Please, take a look!
## COSC 6335 Grading

Dr. Eick uses the following number grade scale: A: 100-90, A-: 90-86, B+: 86-82, B: 82-77, B-:77-74, C+: 74-70, C:70-66,C-:66-62, D+: 62-58, D:58-54, D-:54-50, F: 50-0. Exams and Problemsets are still curved after your exam and task scores have been determined. Exam scores are curved immediately and problem set scores are normalized, added; ultimately, your Problemset total is converted at the end of the semester into a number grade which counts about 48% to the overall number grade score which is then converted into a letter grade.

Number grades higher than 95 are rarely used---except for truly exceptional performances and outliers; a grade of 95 already represents a letter grade of A+… Exams of graduate courses are usually curved so that the exam number grade average is in the range 81-84, depending how well the students performed in the particular exam. Dr. Eick first determines the class’ performance in the exam and selects an average and then the exam is curved accordingly. Number grades do not directly correspond to percentage obtained: for high percentage averages percentage will be down graded and for low percentage average percentages will be upgraded.As far as individual Kritik tasks are concerned, creation scores were weighted by 66%, written evaluation scores by 15%, grading scores by 10% and feedback scores by 95. As far as the grading of the 2 group tasks is concerned: Dr. Eick read your collocation mining reports and created his own creation score for each group project which will be combined creation score your group received from your student peers (by averaging the two scores and then converting those into a number grades) based on his impressions and his evaluation about the quantity and the quality of the work you did for the collocation mining task. The same procedure was used to obtain a final creation score for Task 7, except Nour and not Dr. Eick took a look at your reviews. Moreover, no feedback with respect to written comments is solicited from the groups in Kritik group projects. However, your written comments for Task 6 will be graded by Dr. Eick and for Task 7 by Nour, and this written comment score will be combined (counting 15%) with your group's creation score (counting 85%) when the final group task grade is computed. In contrast to individual Kritik projects there will be no feedback and grading scores for group projects.

Comments about COSC 6335 Exams: to reduce the probability of cheating, the COSC 6335 final exam was designed to be slightly too long. If it turns out that the exam was more than slightly too long, be aware of the fact that all students took the same exam, and that the final exam is still subject of curving to possibly rectify some problems with exam length. In most of Dr. Eick's exams students who score more than 75% of the available points in the exam usually will get a grade of A- or better. Moreover, I believe that graduate students should be "challenged" to demonstrate their skills during final exams. On the other hand, this semester's midterm exam was not very challenging, in my opinion, and more importaintly is was much too short, encouraging cheating. Dr. Eick also concluded---after using them a lot for 2 semesters---that multiple choice exams are not appropriate to assess particular skills, and that often multiple choice exams test a student's capability with respect to differences in natural language semantics rather than assessing if students actually understand and can apply what was taught in a course.

## Course Elements and Their Tentative Weights for 2020

Problem Set Tasks: 48% Spontaneous Online Credit(small exploratory tasks will be given to the small groups during the lecture) and Attendance: 4%

Midterm Exam: 19%

Final Exam: 29%

## Problem Sets

Problem Sets contain paper and pencil tasks which review your understanding of basic data mining concepts and algorithms, tasks which use data mining tools, and small and medium sized data analysis/data mining projects, and tasks in which you evaluate data mining results of other students and data mining publications. Some tasks will be group tasks. There will be three Problem Sets in Fall 2020:

Problem Set1: Eploratory Data Analysis, Classification, and Evaluating Data Analysis Results (Cleaned Pima Indian Diabetes Dataset)

Problem Set2: Outlier Detection and Collocation Mining (Task6 Discussions)

Problem Set3: Data Mining Paper Reviewing and Clustering (Some Discussion of Task7; some potentially useful material for Task8: Loops and Functions in R, Randomized Hill Climbing)Two of the tasks in the Problem Sets will be group trasks: There will be a collocation mining group project in ProblemSet2, and you will conduct a group review of a data mining paper in ProblemSet3.

*Peer Assessment*is a new element of COSC 6335: you will get some practical experience in evaluating data mining results of your fellow students as well as data mining publications. Kritik will be used for the peer assessment tasks of the course. Each student, taking this course will need to pay a $15+tax usage fee for the Kritik software. I Introduction to Data Mining (Part1, Part2, Part3--- covers chapter 1 and Section 2.1)### Dr. Eick's COSC 6335 2020 Lecture Notes

II Exploratory Data Analysis (covers chapter 3 of the first edition of the Tan book)

III R (Arko's Short Intro Into R (used in Lab), Scatter Plot Code, Decision Trees in R, Some useful code for Task1, Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R)

IV Peer Reviewing and Kritik ( Introduction to Kritik Video (will we watch this video during the lecture!))

V Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, Neural Networks Part1 (3blue1brown:*What is a Neural Network?*(will show the first 12:30 of this video)), Neural Networks Part2, kNN-Classifiers and Support Vector Machines)

VI Association Analysis: Assiociation Rule Mining, Sequence and Graph Mining, Collocation Mining

VII Outlier Detection

VIII A Brief Introduction to Naive, Parametric and Non-Parametric Density Estimation

IX Introduction to Similarity Assessment and Clustering

X Two Popular Deep Learning Approaches: Convolutional Neural Networks and Autoencoders (by Rishabh Sharma)

XI More on Clustering: Density-based Clustering and Hierarchical Clustering, DENCLUE, EM, R-scripts demoing K-means/medoids and DBSCAN, Randomized Hill Climbing and Cluster Validity.

XII A Brief Introduction to Spatial Data Mining

XIII Data Preprocessing for Data Mining

Other Material: COSC 6335 Grading

Old Webpages of COSC 6335: 2013 and 2009.## Interactive Course Elements and Peer Review using

The course will try out some "new" approaches to reduce*Kritik**student isolation during online teaching*. Each student taking COSC 6335 belongs to two groups (a small one with 2 students and a larger group of 4-5 students) and some of the course problem set tasks will involve activities of the larger groups. Problem Set2 and Set3 will contain a single task for the larger groups. Small tasks will be assigned to small groups for online credit as we move along during the semester; groups present their findings/solutions during the lecture using a single (rarely 2) powerpoint slide. Also mail your solution slide to Dr. Eick with your group name and the names of the group members in the header of the slide, so that he can add those slides to the COSC 6335 teaching material!

Moreover, peer review will be a new component of this course: you will evaluate work of other students in the course and work of other peers in the field of data mining; Kritik will be used for producing and evaluating peer evaluations. There will be a $15+tax student fee to get*Kritik*access—however, if you are a really poor graduate student, feel free to contact Dr. Eick to subsidise this fee!

Kritik Tasks: There is a 24 hour grace period for each Kritik task. You are allowed to use this grade period for up to 5 submissions. "Draft" rubrics for the peer reviewed tasks in the problem sets can be found in the ProblemSet channel in DM2000 and ultimately you will find the "final" rubric in Kritik. As these rubrics tell you how your submissions will be graded by your peers, Nour and Dr. Eick, it might be worth while looking at those rubrics closely.

As we are trying out some of those approaches for the first time, I like to ask you for a little patience as there might be some startup problems and not everything might work well initially, and some of the new approaches might not work as well as we expect them to work.## Online Teaching Information

In general, the course will be taught 100% online with lectures being taught on MS Teams TU+TH 2:30-4p! The course will use MS Teams for the teaching of the course; the 2020 Team is called. The team pass code to join the team is:*DM2020***707alkj —**For taking the course you will need to use your UH-cougarnet account: if you do not have a working cougarnet account, fix this problem as soon as possible. As we move along with the course various channels will be created for course discussions, content, problem sets and polls in DM2020! Dr. Eick will schedule his Th. office hours a separate meeting in DM2020 (Th. 11:30a-12:30p) that you can join as you join the course lecture. If you have some private matters to discuss, a private channel will be used to ensure your privacy. The is no separate meeting for his office hour Tu. 4-5; join the Tuesday lecture Teams event if you want to meet him.**I also suggest to join the team as soon as possible**as by doing so you will receive all relevant course information!## COVID-19 Related Matters Interferring with Taking COSC 6335

Moreover, if you face serious problems taking COSC 6335 related COVID-19, with respect to obtaining real-time access to course meetings, related to not be able to by physically in Houston,... please send Dr. Eick an e-mail!## Other Ideas for COSC 6335

Another idea is to reach out to industry to sponsor COSC 6335 course activities; e.g. we could have the*"Collocation Mining Group Project (sponsored by Company X)"*. If you have any good ideas and/or expertise on obtaining such sponsorships, feel free to contact Nour or Dr. Eick!### 2020FA-27014-COSC6335-Data Mining: Nours Dec. 10, 2p Final Exam Instructions

Dear students,

The COSC 6335 Final Exam consists of a single part, containing 4 single answer multiple-choice questions (1 of the presented n alternatives is correct) and 13 free text questions. Questions will be presented in a random sequential order and you will not have the opportunity to change answers to previous questions. For most exam problems multiple versions exist, and questions are assigned to students randomly.You will need to take the exam between 2:00 pm and 3:45 pm on Thursday, Dec 10, 2020. You will have 105 minutes to complete the exam. The exam will be available at 2:00 pm on the Blackboard Assignments section with the name "COSC 6335: Data Mining 2020 Final Exam".

Be mindful of the weight of each question. The available scores for the 17 questions will be: 8-5-5-7-8-3-3-3-5-8-4-5-3-3-8-9-4 for a total of 91 points. You may not see this order as the questions order will also be randomized.

CASA Monitor will NOT be used for this exam. You won't be asked to show your webcam or share your screen.

- Important Note: UH academic honesty rules apply to the taking of the 6335 Final Exam.

If you have any questions during the exam please feel free to contact me at:

Nour: +1 832-866-7732

or by chat on DM2020 Teams.

Best of luck to everyone!

Nour## Online Credit Tasks Fall 2020

Task A: Demo and Evaluation of Tools which Create Histograms Task B: Demo and Evaluation of Tools which Create Histograms Task C: Scatter Plot Interpretation Task D: Comparing two Age Distribution Histograms Task E: Comparing two Box Plots Task F: Example Information and GINI Gain Computations Task G: Parametric Density Estimation for an Example Dataset Task H: Using Maximum Likelihood to Get a Parametric Model for a given Dataset Task I: Application of Non-parametric Density Estimation to an Example Dataset Task J: Demo of the Expectation Maximization (EM) Algorithm Task K: Design of a Distance Function for a Supermarket Customer Dataset Task L: Running PAM/K-Medoids for an Example Dataset Task N: Leading a Discussion about the DENCLUE Density-based Clustering Algorithm Task O: Cluster Evaluation using Silhouette