The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. It also gives a basic introduction to data analysis. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Basic data analysis techniques, centering on basic visualization techniques and statistics, to get a better understanding of the data mining task at hand will be covered. Moreover, techniques how to preprocess a data set for a data mining task will be introduced. Moreover, in course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally, as R will be used in most course projects; therefore, participants of the couse will obtain valuable exprience in using the R statistics, data mining, and visualization packages and will learn how to write programs in R and how to develop data mining software on top to R. A recent 2013 poll Rexer Analytics found that R is currently the most popular data mining tool: 24% of the respondents use R as their primary tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's versatile.

In summary, having a sound background in data analytics and data mining and knowing R well will open a lot of job opportunities for you, which, I believe, is a strong reason to take the course.

Office hours (573 PGH)

Office Hours: MO 4-4:45p WE 1-2:15p

e-mail: ceick@uh.edu

TA: Romita Banerjee

Office Hours: MO 1-2p WE noon-1p

Office: PGH 313

Email: rbanerj2@central.uh.edu

2018 TA website: Romita's 4335 Website (?!)

class meets: MO/WE 2:30-4p

class room: 201 SEC

Instructor's Travel: September 26, October 8, Nov. 7

TA's Travel: October 5-13

Midterms: October 1 and November 7, 2018

Lectures by 'Guests': Septenber 26, maybe November 13

Makeup class: Mo., December 3, 2:30-4p

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley, 2018.
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

- I enjoyed teaching the course and I already like to wish you a Happy and Successful Year 2019.
- The letter grades for the Fall 2018 teaching of COSC 4335 have been posted. This semester we had: A:6, A-:3, B+:4, B:5, B-:3, C+:5, C:2, C-:3, D+:2, W:3. In general, the student grades of the Fall 2018 teaching of the course was better than the performance of the previous 2 teachings of the course; in particular, the performance of the top 60% students was better, whereas the performance of the bottow 40% students was about the same. Detailed grade reports you can found in Romita's 4335 webpage.
- This is probably the last offering of COSC 4335: Data Mining; about 65% of its content will be taught in a new Data Science I 3000-level course, and about 20% of its content will be covered in a new Data Science II 4000-level course.
- The final exam will not be returned to students: however, you can view your final exam at the following two dates: Monday, December 17, 11:30a-12:30p and Thursday, January 17, 11:30-12:30p in Romita's TA office.
- Concerns Number Grade Curving: As A in Dr. Eick's number grade scales ranges from 90 to 100, the top number grade assigned to course assigments and exams usually not higher than 95 with grades 96-100 only being assigned to exceptional, kind of outlier performances or in cases when there is a very large spread in the distribution of scores.

Wednesday, September 5: Introduction to R Lab (taught by Romita; please bring your laptop with R installed!)

Tuesday, September 18, 11p: Deadline Assignment1

Monday, September 24: 45 minute review for Midterm1 exam

Wednesday, September 26: Lecture will be given by guest lecturer

Monday, October 1: Midterm1 Exam (Review List 2018 Midterm1 Exam; updated recently)

Wednesday, October 3: R-lab centering on R-programmming (bring labtop)!

Wednesday, November 7: Midterm2 Exam (Review List Fall 2018 Midterm2 Exam, Questions for Nov. 5, 2018 Midterm2 exam Review)

Wednesday, November 21: no lecture due to Thanksgiving next day

Monday, December 3, 2:30p: "Makeup" Lecture (last lecture of COSC 4335)

Monday, December 10,

Assignment2: Similarity Assessment, Clustering, and Using Clustering to Create Background Knowledge for Classification Tasks (individual project; Running PAM with Distance Functions)

Assignment3: Making Sense of Data: Learning and Comparing Classification Models for a Dataset (Individual Project)

Assignment4: Design, Implementation and Comparison of Outlier Detection Techniques for a Spatial Dataset (Group Project, Assignment4 Groups)

This semester, the student performance for Assigment2 and 3 met expectation; overall, more than 45% of the students did not do well at all in Assigment1, and most groups did very well in Assignment4. Based on this assessment the Assignment Number Grade Averages after curving for the four assigments were as follows: 75, 78-79, 78-79, and 82 approximately.

Exams (3): 52% (16+16+20%)

Class Attendance: 2%

CrowdFlower's 2016 Data Science Report

Summary Rexer Analytics' 2017 Data Science Survey (the complete survey is supposed to appear in October 2018)

II Exploratory Data Analysis (updated on August 29, 2018; covers chapter 3 from the the First Edition of the Tan Book (download as this material is not in the second edition); see also Interpreting Displays; Some R Data Analysis Functions I; Some R Data Analysis Functions II)

III R (Arko's Short Intro Into R, Scatter Plot Code, Decision Trees in R, Some useful code for Assignment1, Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for Project2 (lecture on Feb. 26); moreover, checkout Romita's 4335 webpage for more R-code, datasets, and slides)

IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN; R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects; Clustering Exercises K-Means, HC, and DBSCAN)

V Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, Neural Networks, kNN-Classifiers and Support Vector Machines and Ensemble Learning)

VI Outlier Detection (significantly updated on April 7, 2018)

VII Data Science (Introduction to Data Science, Introduction to Data Science Part2; this part of the course will discuss the new developments in Data Science and how Computer Sciencce Departments adapt to its increased importance. It also discusses the relationship between Data Mining and Data Science, and finally discusses some developments in Data Science, in particular the importance of Data Storytelling).

VIII Data Preprocessing for Data Mining

IX Association Analysis (Part1, Part2 (not covered in 2018))

Solution Sketches Midterm2 April 7, 2015

Solution Sketches Final Exam December 10, 2018

Solution Sketches Review1 March 1, 2016

Solution Sketches Review1 Feb. 27, 2018

Solution Sketches Review1 September 24+26, 2018

Solution Sketches Midterm1 March 3, 2016

Solution Sketches Midterm1 March 1, 2018

Solution Sketches Review2 April 5, 2016

Solution Sketches Midterm2 April 7, 2016

Solution Sketches Midterm2 April 5, 2018

Review for Final Exam, May 3, 2016

Solution Sketches oo Review for Final Exam, April 26, 2018

Solution Sketches Final Exam May 10, 2016

Review2 solution sketches on November 5, 2018

Seveal times during the semester I will check

Translation number to letter grades in 2015:

A:100-90 A-:90-86 B+:86-82 B:82-78 B-:78-74 C+:74-70

C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Assingment2: Clustering and Using Clustering for Prediction (Individual Project, second draft, Slides Discussing Assignment2)

Assingment3: Creating and Comparing Different Classification Models for a Dataset (Individual Project)

Assingment4: Developing and Comparing Outlier Detection Techniquesfor a Spatial Dataset (Group Project; Assignment4 Groups)

Weights for the 4 Assignments: Romita and I met, and selected the following weights for the assignments: Assignment1: 24%, Assignment2: 36%, Assignment3: 16%, Assignment4: 24%. In general, we decided to lower the weight of Assigment2, and decided to give a higher weight to Assignment4, so that students have a chance to make up for some points they lost for Assignment1 and Assignment2.

Evaluation of Assignment performance in Spring 2018: About 25% of the students did a very good job in Assignment2, and about 40% of the students did quite a poor job; the class performance for Assignment2 was below expectation; moreover, only 17 students attended the lab that was intended as a preparation for Assignment2: why?? The performance for Assignment1 and Assignment3 met expectations; however, there was a significant spead in scores. Finally, students preformed quite well in Assignment4, and the class performance was above expectation. This fact was considered when curving the assignment scores and the average number grades for the 4 assignments after curving where: 77, 74, 78 and 81.

9 students felt assignment 2 was most difficult and took the most time,
2 students felt assignment 3 was most difficult and took the most time followed by 1 student for assignment 1. Several Students thought that assignment 1 was also quite time consuming,
although not as time consuming as assignment 2; assignment 1 also had a lost of second
place finishes in other categories.
3 students felt that assignment 3 was least difficult and another 3 students felt
assignment 4 was least difficult.
Concerning interestingness, 8 students felt that assignment 3 was most
interesting, followed by 5 students for assignment2 and 2 students that that assignment 1.

All the students agreed that the assignments helped in better understanding of the concepts covered in class. They also liked the fact that the assignments gave them an opportunity to work
with real datasets. 5 students felt they learned most in assignment 3 and 3 students mentioned
they learned the most in assignment 2. Another common comment was that the assignments helped
them in learning and using R.