last updated: December 15, 2018 at 5p
COSC 4335: Data Mining in Fall 2018
(Dr. Eick )
Goals of the Data Mining Course
Data mining centers on finding valid, novel, interesting, and potentially useful patterns in data.
It aims at transforming a large amount of data into a well of knowledge. Data mining
has become a very important field in industry as
well as academia. For example, almost 1000 papers are submitted each year to the IEEE International
Conference
on Data Mining (ICDM) that will be held in Singapure in
November 2018.
Data mining tools and
suites (for example, see KDnuggets' DM Software
Survey) are used a lot in industry and
in reseach projects.
UH's Data Analysis and Intelligent Systems Lab (UH-DAIS) conducts research in
data sciences, data mining, but also geographic information systems (GIS) and Artificial Intelligence. Finally, having
basic knowledge in data mining and data analytics is a plus when you are looking for a job in
industry and at major US research
institutions, such as the Texas Medical Center in Houston or at Federal Research Labs.
The course covers the most important
data mining techniques and provides background knowledge on how to conduct a
data mining project. It also gives a basic introduction to data analysis. After defining
what knowledge discovery and
data mining is, data mining tasks such classfication, clustering, and association analysis will be discussed in
detail. Basic data analysis techniques, centering on basic visualization techniques and statistics,
to get a better understanding of the data mining task at hand will be covered.
Moreover, techniques how to preprocess a data set for a data mining
task will be introduced.
Moreover, in
course projects you will obtain hands on experience in conducting data mining and data analysis projects. Finally,
as R will be
used in most course projects; therefore, participants of the couse will obtain
valuable exprience in using the R statistics, data mining, and visualization
packages and will learn how to write programs in R and how to develop data mining software
on top to R.
A recent 2013 poll Rexer Analytics found that R is currently the most
popular data mining tool: 24% of the respondents use R as their primary
tool, and only 30% of the respondents do not use R at all. Although R is a domain specific language, it's
versatile.
In summary, having a sound background in data analytics and data mining and knowing R
well will open a lot of job opportunities for you, which, I believe, is a strong
reason to take the course.
Comments concerning this website
If you have any comments
concerning this website, send e-mail
to: ceick@uh.edu
Basic Course Information
Instructor: Dr.
Christoph F. Eick
Office hours (573 PGH)
Office Hours: MO 4-4:45p WE 1-2:15p
e-mail: ceick@uh.edu
TA: Romita Banerjee
Office Hours: MO 1-2p WE noon-1p
Office: PGH 313
Email: rbanerj2@central.uh.edu
2018 TA website: Romita's 4335 Website (?!)
class meets: MO/WE 2:30-4p
class room: 201 SEC
Instructor's Travel: September 26, October 8, Nov. 7
TA's Travel: October 5-13
Midterms: October 1 and November 7, 2018
Lectures by 'Guests': Septenber 26, maybe November 13
Makeup class: Mo., December 3, 2:30-4p
Course Materials
COSC 4335 Syllabus for Fall 2018
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2018.
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 4335 (Data Mining) Fall 2018
- I enjoyed teaching the course and I already like to wish you a Happy
and Successful Year 2019.
- The letter grades for the Fall 2018 teaching of COSC 4335 have been
posted. This semester we had: A:6, A-:3, B+:4, B:5, B-:3, C+:5, C:2,
C-:3, D+:2, W:3. In general, the student grades of the Fall 2018 teaching
of the course was better than the performance of the previous 2 teachings of
the course; in particular, the performance of the top 60% students was better,
whereas the performance of the bottow 40% students was about the same.
Detailed grade reports you can found in Romita's 4335 webpage.
- This is probably the last offering of COSC 4335: Data Mining; about
65% of its content will be taught in a new Data
Science I 3000-level course, and about 20% of its content will be covered
in a new Data Science II 4000-level course.
- The final exam will not be returned to students: however, you can view your final
exam at the following two dates: Monday, December 17, 11:30a-12:30p and Thursday, January 17,
11:30-12:30p in Romita's TA office.
- Concerns Number Grade Curving: As A in Dr. Eick's number grade scales ranges from 90 to 100, the top number grade assigned to course assigments and
exams usually not higher than 95 with grades 96-100 only being assigned
to exceptional, kind of outlier performances or in cases when there is a
very large spread in the distribution of scores.
Important Dates in Fall 2018
Monday, September 3: Labor Day (no lecture)
Wednesday, September 5: Introduction to R Lab (taught by Romita; please bring your laptop with R installed!)
Tuesday, September 18, 11p: Deadline Assignment1
Monday, September 24: 45 minute review for Midterm1 exam
Wednesday, September 26: Lecture will be given by guest lecturer
Monday, October 1: Midterm1 Exam (Review List 2018 Midterm1 Exam; updated recently)
Wednesday, October 3: R-lab centering on R-programmming (bring labtop)!
Wednesday, November 7: Midterm2 Exam (Review List
Fall 2018 Midterm2 Exam,
Questions for Nov. 5, 2018
Midterm2 exam Review)
Wednesday, November 21: no lecture due to Thanksgiving next day
Monday, December 3, 2:30p: "Makeup" Lecture (last lecture of
COSC 4335)
Monday, December 10, 2p: Final Exam (Review List
for Dec. 10, 2018 Final Exam, Questions and Solution
Sketches Dec. 3 Review
for the Final Exam) in SEC 201 (our regular class room).
Fall 2018 Assignments
Assignment1: Exploratory Data Analysis for a Portuguese
School Performance Dataset (Individual Assignment)
Assignment2: Similarity Assessment, Clustering, and Using
Clustering to Create Background Knowledge for Classification Tasks (individual project;
Running PAM with Distance Functions)
Assignment3: Making Sense of Data: Learning and Comparing
Classification Models for a Dataset (Individual Project)
Assignment4: Design, Implementation and Comparison of Outlier
Detection Techniques for a Spatial Dataset (Group Project,
Assignment4 Groups)
This semester, the student performance for Assigment2 and 3 met expectation;
overall, more
than 45% of the students did not do well at all in Assigment1, and most
groups did very
well in Assignment4. Based on this assessment the Assignment Number Grade Averages
after curving for the four assigments were as follows: 75, 78-79, 78-79,
and 82 approximately.
Course Elements and Their Tentative Weights for 2018
Assignments (4): 46%
Exams (3): 52% (16+16+20%)
Class Attendance: 2%
Reports
Rexer Analytics' 2015
Data Science Survey
CrowdFlower's 2016 Data Science
Report
Summary Rexer Analytics' 2017
Data Science Survey (the complete survey is supposed
to appear in October 2018)
COSC 4335 Data Mining: Lecture Notes
I Introduction to Data Mining (COSC 4355 Knowledge Sources (to be discussed Sept. 10, 2018), Part1, Part2,
Differences
between Clustering and Classification).
II Exploratory Data Analysis (updated on August 29, 2018; covers chapter 3 from the
the First Edition of the Tan Book (download as this material is not
in the second edition); see
also Interpreting Displays; Some R Data Analysis Functions I; Some R Data Analysis Functions II)
III R (Arko's Short Intro Into R,
Scatter Plot Code, Decision Trees in R,
Some useful code for Assignment1,
Computing Statistical Summaries In the Presense of Missing Value (NA),
Functions
and Loops in R,
Directory containing R-code for Project2 (lecture on Feb. 26); moreover,
checkout Romita's 4335 webpage for more R-code, datasets, and slides)
IV Clustering and Similarity Assessment (Introduction and Hierachical Clustering and DBSCAN;
R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects;
Clustering Exercises
K-Means, HC, and DBSCAN)
V Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting,
Neural Networks, kNN-Classifiers and Support Vector Machines and Ensemble Learning)
VI Outlier Detection (significantly updated on April 7, 2018)
VII Data Science (Introduction to
Data Science, Introduction to Data Science Part2; this part of the course will
discuss the new developments in Data Science
and how Computer Sciencce Departments adapt to its increased importance. It also discusses the relationship between
Data Mining and Data Science, and finally discusses some developments in Data Science,
in particular the importance of Data Storytelling).
VIII Data Preprocessing for Data
Mining
IX Association Analysis (Part1, Part2 (not covered in 2018))
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches oo Review for Final Exam,
April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on
November 5, 2018
Grading
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly
chosen dates, and an attendence score will be computed from how many
of the those lectures you attended.
Translation number to letter grades in 2015:
A:100-90 A-:90-86 B+:86-82 B:82-78 B-:78-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
In contrast to the exam grades where you receive your number grades immediately, the assignment scores, near the end
of the semester, still will be curved and
your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates
the different weights of the four assignments---and, then this number grade will count 45-48% towards your final course grade.
In general, assignment weights were selected considering amount of work required but also difficulty was considered;
moreover, group projects carry lower weights. Moreover, when
looking at the detailed grade reports, be aware of the fact that number grades of 90 or higher are A's in Dr. Eick's curving.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Prerequisites
COSC 3380 and MATH 3336. You are allowed to take COSC 3380 concurrently with COSC 4335!
Spring 2018 Assignments
Assingment1: Exploratory Data Analysis using R for the Pima Indians Diabetes
Dataset (Individual Project)
Assingment2: Clustering and Using Clustering for Prediction
(Individual Project, second draft, Slides Discussing Assignment2)
Assingment3: Creating and Comparing Different Classification
Models for a Dataset
(Individual Project)
Assingment4: Developing and Comparing Outlier Detection Techniquesfor a Spatial Dataset
(Group Project; Assignment4 Groups)
Weights for the 4 Assignments: Romita and I met, and selected the following weights for
the assignments: Assignment1: 24%, Assignment2: 36%, Assignment3: 16%, Assignment4: 24%. In
general, we decided to lower the weight of Assigment2, and decided to give a higher weight to
Assignment4, so that students have a chance to make up for some points they lost for Assignment1 and Assignment2.
Evaluation of Assignment performance in Spring 2018: About 25% of the students did a very good
job in Assignment2, and about 40% of the students did quite a poor job; the class performance
for Assignment2 was below expectation; moreover, only 17 students attended the lab that was intended as a preparation for Assignment2: why?? The performance for Assignment1 and Assignment3
met expectations; however, there was a significant spead in scores. Finally, students preformed quite well in Assignment4, and the class performance was above expectation. This fact was
considered when curving the assignment scores and the average number grades for the 4 assignments after
curving where: 77, 74, 78 and 81.
Spring 2018 Assignment Evaluation Questionnare Result Summary
The questionnaire was conducted on April 24, 2018, and Romita created the following
summary of the answers you gave (based on 17 responses):
9 students felt assignment 2 was most difficult and took the most time,
2 students felt assignment 3 was most difficult and took the most time followed by 1 student for assignment 1. Several Students thought that assignment 1 was also quite time consuming,
although not as time consuming as assignment 2; assignment 1 also had a lost of second
place finishes in other categories.
3 students felt that assignment 3 was least difficult and another 3 students felt
assignment 4 was least difficult.
Concerning interestingness, 8 students felt that assignment 3 was most
interesting, followed by 5 students for assignment2 and 2 students that that assignment 1.
All the students agreed that the assignments helped in better understanding of the concepts covered in class. They also liked the fact that the assignments gave them an opportunity to work
with real datasets. 5 students felt they learned most in assignment 3 and 3 students mentioned
they learned the most in assignment 2. Another common comment was that the assignments helped
them in learning and using R.