last updated: December 15, 2018 at 5p

COSC 4335: Data Mining in Fall 2018 (Dr. Eick )



Goals of the Data Mining Course

Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data. It aims at transforming a large amount of data into a well of knowledge. Data mining has become a very important field in industry as well as academia. For example, almost 1000 papers are submitted each year to the IEEE International Conference on Data Mining (ICDM) that will be held in Singapure in November 2018. Data mining tools and suites (for example, see KDnuggets' DM Software Survey) are used a lot in industry and in reseach projects. UH's Data Analysis and Intelligent Systems Lab (UH-DAIS) conducts research in data sciences, data mining, but also geographic information systems (GIS) and Artificial Intelligence. Finally, having basic knowledge in data mining and data analytics is a plus when you are looking for a job in industry and at major US research institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.

The course covers the most important data mining techniques and provides background knowledge on how to conduct a data mining project. It also gives a basic introduction to data analysis. After defining what knowledge discovery and data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in detail. Basic data analysis techniques, centering on basic visualization techniques and statistics, to get a better understanding of the data mining task at hand will be covered. Moreover, techniques how to preprocess a data set for a data mining task will be introduced. Moreover, in course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally, as R will be used in most course projects; therefore, participants of the couse will obtain valuable exprience in using the R statistics, data mining, and visualization packages and will learn how to write programs in R and how to develop data mining software on top to R. A recent 2013 poll Rexer Analytics found that R is currently the most popular data mining tool: 24% of the respondents use R as their primary tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's versatile.

In summary, having a sound background in data analytics and data mining and knowing R well will open a lot of job opportunities for you, which, I believe, is a strong reason to take the course.

Comments concerning this website

If you have any comments concerning this website, send e-mail to: ceick@uh.edu

Basic Course Information

Instructor: Dr. Christoph F. Eick
Office hours (573 PGH)
Office Hours: MO 4-4:45p WE 1-2:15p
e-mail: ceick@uh.edu
TA: Romita Banerjee
Office Hours: MO 1-2p WE noon-1p
Office: PGH 313
Email: rbanerj2@central.uh.edu
2018 TA website: Romita's 4335 Website (?!)

class meets: MO/WE 2:30-4p
class room: 201 SEC
Instructor's Travel: September 26, October 8, Nov. 7
TA's Travel: October 5-13
Midterms: October 1 and November 7, 2018
Lectures by 'Guests': Septenber 26, maybe November 13
Makeup class: Mo., December 3, 2:30-4p

Course Materials

COSC 4335 Syllabus for Fall 2018

Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, 2018.
Link to Book HomePage

Other Material:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition, 2011.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

News COSC 4335 (Data Mining) Fall 2018

Important Dates in Fall 2018

Monday, September 3: Labor Day (no lecture)
Wednesday, September 5: Introduction to R Lab (taught by Romita; please bring your laptop with R installed!)
Tuesday, September 18, 11p: Deadline Assignment1
Monday, September 24: 45 minute review for Midterm1 exam
Wednesday, September 26: Lecture will be given by guest lecturer
Monday, October 1: Midterm1 Exam (Review List 2018 Midterm1 Exam; updated recently)
Wednesday, October 3: R-lab centering on R-programmming (bring labtop)!
Wednesday, November 7: Midterm2 Exam (Review List Fall 2018 Midterm2 Exam, Questions for Nov. 5, 2018 Midterm2 exam Review)
Wednesday, November 21: no lecture due to Thanksgiving next day
Monday, December 3, 2:30p: "Makeup" Lecture (last lecture of COSC 4335)
Monday, December 10, 2p: Final Exam (Review List for Dec. 10, 2018 Final Exam, Questions and Solution Sketches Dec. 3 Review for the Final Exam) in SEC 201 (our regular class room).

Fall 2018 Assignments

Assignment1: Exploratory Data Analysis for a Portuguese School Performance Dataset (Individual Assignment)
Assignment2: Similarity Assessment, Clustering, and Using Clustering to Create Background Knowledge for Classification Tasks (individual project; Running PAM with Distance Functions)
Assignment3: Making Sense of Data: Learning and Comparing Classification Models for a Dataset (Individual Project)
Assignment4: Design, Implementation and Comparison of Outlier Detection Techniques for a Spatial Dataset (Group Project, Assignment4 Groups)

This semester, the student performance for Assigment2 and 3 met expectation; overall, more than 45% of the students did not do well at all in Assigment1, and most groups did very well in Assignment4. Based on this assessment the Assignment Number Grade Averages after curving for the four assigments were as follows: 75, 78-79, 78-79, and 82 approximately.

Course Elements and Their Tentative Weights for 2018

Assignments (4): 46%
Exams (3): 52% (16+16+20%)
Class Attendance: 2%

Reports

Rexer Analytics' 2015 Data Science Survey
CrowdFlower's 2016 Data Science Report
Summary Rexer Analytics' 2017 Data Science Survey (the complete survey is supposed to appear in October 2018)

COSC 4335 Data Mining: Lecture Notes

I Introduction to Data Mining (COSC 4355 Knowledge Sources (to be discussed Sept. 10, 2018), Part1, Part2, Differences between Clustering and Classification).
II Exploratory Data Analysis (updated on August 29, 2018; covers chapter 3 from the the First Edition of the Tan Book (download as this material is not in the second edition); see also Interpreting Displays; Some R Data Analysis Functions I; Some R Data Analysis Functions II)
III R (Arko's Short Intro Into R, Scatter Plot Code, Decision Trees in R, Some useful code for Assignment1, Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for Project2 (lecture on Feb. 26); moreover, checkout Romita's 4335 webpage for more R-code, datasets, and slides)
IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN; R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects; Clustering Exercises K-Means, HC, and DBSCAN)
V Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, Neural Networks, kNN-Classifiers and Support Vector Machines and Ensemble Learning)
VI Outlier Detection (significantly updated on April 7, 2018)
VII Data Science (Introduction to Data Science, Introduction to Data Science Part2; this part of the course will discuss the new developments in Data Science and how Computer Sciencce Departments adapt to its increased importance. It also discusses the relationship between Data Mining and Data Science, and finally discusses some developments in Data Science, in particular the importance of Data Storytelling).
VIII Data Preprocessing for Data Mining
IX Association Analysis (Part1, Part2 (not covered in 2018))

Past Exam and Review Solutions

Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches oo Review for Final Exam, April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on November 5, 2018

Grading

Students will be responsible for material covered in the lectures and assigned in the readings. All assignment and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly chosen dates, and an attendence score will be computed from how many of the those lectures you attended.

Translation number to letter grades in 2015:
A:100-90 A-:90-86 B+:86-82 B:82-78 B-:78-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

  • In contrast to the exam grades where you receive your number grades immediately, the assignment scores, near the end of the semester, still will be curved and your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates the different weights of the four assignments---and, then this number grade will count 45-48% towards your final course grade. In general, assignment weights were selected considering amount of work required but also difficulty was considered; moreover, group projects carry lower weights. Moreover, when looking at the detailed grade reports, be aware of the fact that number grades of 90 or higher are A's in Dr. Eick's curving. Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

    Prerequisites

    COSC 3380 and MATH 3336. You are allowed to take COSC 3380 concurrently with COSC 4335!

    Spring 2018 Assignments

    Assingment1: Exploratory Data Analysis using R for the Pima Indians Diabetes Dataset (Individual Project)

    Assingment2: Clustering and Using Clustering for Prediction (Individual Project, second draft, Slides Discussing Assignment2)

    Assingment3: Creating and Comparing Different Classification Models for a Dataset (Individual Project)

    Assingment4: Developing and Comparing Outlier Detection Techniquesfor a Spatial Dataset (Group Project; Assignment4 Groups)

    Weights for the 4 Assignments: Romita and I met, and selected the following weights for the assignments: Assignment1: 24%, Assignment2: 36%, Assignment3: 16%, Assignment4: 24%. In general, we decided to lower the weight of Assigment2, and decided to give a higher weight to Assignment4, so that students have a chance to make up for some points they lost for Assignment1 and Assignment2.

    Evaluation of Assignment performance in Spring 2018: About 25% of the students did a very good job in Assignment2, and about 40% of the students did quite a poor job; the class performance for Assignment2 was below expectation; moreover, only 17 students attended the lab that was intended as a preparation for Assignment2: why?? The performance for Assignment1 and Assignment3 met expectations; however, there was a significant spead in scores. Finally, students preformed quite well in Assignment4, and the class performance was above expectation. This fact was considered when curving the assignment scores and the average number grades for the 4 assignments after curving where: 77, 74, 78 and 81.

    Spring 2018 Assignment Evaluation Questionnare Result Summary

    The questionnaire was conducted on April 24, 2018, and Romita created the following summary of the answers you gave (based on 17 responses):

    9 students felt assignment 2 was most difficult and took the most time, 2 students felt assignment 3 was most difficult and took the most time followed by 1 student for assignment 1. Several Students thought that assignment 1 was also quite time consuming, although not as time consuming as assignment 2; assignment 1 also had a lost of second place finishes in other categories. 3 students felt that assignment 3 was least difficult and another 3 students felt assignment 4 was least difficult. Concerning interestingness, 8 students felt that assignment 3 was most interesting, followed by 5 students for assignment2 and 2 students that that assignment 1.
    All the students agreed that the assignments helped in better understanding of the concepts covered in class. They also liked the fact that the assignments gave them an opportunity to work with real datasets. 5 students felt they learned most in assignment 3 and 3 students mentioned they learned the most in assignment 2. Another common comment was that the assignments helped them in learning and using R.