last updated: August 15 at 3p
COSC 3337: Data Science I in Fall 2021
(Dr. Eick )
Goals of the Data Science I Course
COSC 3337 Syllabus
Upon completion of this course, students
1. will know what the goals and objectives of data science are and how to conduct a data science project.
2. will have a sound knowledge of basic statistics and basic machine learning concepts.
3. will have sound knowledge about the most important data visualization techniques.
4. will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles, and neural networks.
5. will have some sound knowledge about how to construct distance functions.
6. will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7. will get hands-on exposure in the course assignments how to apply data analysis techniques to real world data sets. They will obtain valuable experience in learning how to interpret data analysis results, how to select parameters of data analysis tools, and how to interpret and evaluate data analysis results.
8. will learn on how to use the popular data analysis and visualization environment R and its popular libraries.
9. will get some exposure to and experience in data storytelling.
Course Content
1. Introduction to Data Analysis, Data Science and Data Mining
2. Exploratory Data Analysis—how to Visualize and Compute Basic Statistics for Datasets and how to Interpret the Findings and Assignment1
3. Introduction to R
4. Introduction to Data Visualization (new!)
5. Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines,
Neural Networks, Regression and Assignment2
6. Introduction to Clustering, Similarity Assessement and Assignment3
7. Data Preprocessing
8. Data Storytelling
9. Outlier Detection (optional, might be running out of time to discuss this topic!)
Basic Course Information
Instructor: Dr.
Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 1-2p TH 9:30-10:30a
e-mail: ceick@uh.edu
TA: Nour Smaoui
Office Hours: MO noon-1p WE 1-2p
Office: 350 PGH
Email: nour@cs.uh.edu
2018 TA website: Romita's 4335 Website (?!)
class meets: TU/TH 11:30a-1p
class room: 108 AH or online on MS TEAM 3337-Class
Instructor's Travel: October 12+14
Lectures by 'Guests': October 14
Course Materials
COSC 3335 Syllabus for Fall 2019
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2018.
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 3337 (Data Science I) Fall 2021
- We will be using a MS Team called "3337-Class" for the teaching of the course; please, go ahead and register for
this team using the passcode '617ouxx'.
- The first course lecture of COSC 3337 on Tu., August 24, 11:30a will be taught online (and not F2F)
via MS Teams in '3337-Class'.
- There is still a lot of uncertainty about the rules and regulations concerning teaching courses in Fall
2021. More detailed infomation on how the course will be taught should be available
on this website by Monday, August 23, in the morning.
the latest. The course syllabus should also be available by then.
- You still see mostly the Fall 2019 version of the course website, which will be changed as we move on teaching
the course.
Important Dates in Fall 2021
Thursday, September 9: Maybe, Introduction to R and Assignment1 Lab (taught by Nour; please bring your laptop with R installed!)
Saturday, September 25, 11p: Deadline Assignment1
Tuesday, October 12: Midterm1 Exam (Review List 2019 Midterm1 Exam, October 7 Review1 Questions and Answers)
Thursdat, October 14: Lecture given by a Guest Lecturer
Monday, October 28: R-lab centering on R-programmming (bring labtop)!
Wednesday, October 30: Group Assignment2 Student Presentations
Monday, November 25: Last Lecture COSC 3337
Friday, December 6, 2p: Final Exam (Review List
for Dec. 6, 2019, 2p Final Exam (will be posted on Nov. 26, 2019), Questions and Solution
Sketches of the Nov. 25, 2019 Review
for the Final Exam) in SEC 202 (our regular class room).
Course Elements and Their Tentative Weights for 2021
Assignments (3) and Online Credit: 52%
Exams (2): 48% (20+28%)
Fall 2019 Assignments
There will be 3 assignments in the course, centering on:
Assignment 1: Exploratory Data Analysis and Data Visualization (individual
assignment, second draft; Few Remarks Tasks 12+13).
Assignment 2: Learning Classification Models
and Model Evaluation (group assignment;
35 Cents of Wisdom on Giving Presentations,
On AS2 Group Presentations)
Assignment 3: Similarity Assessment, Clustering
and Using Clustering to Create Background Knowledge for Classification Tasks (individual assignment;
Additional Material for Assignment3)
Late Policies for Assignments: Assignments are due at the submission deadline; assignments submited
after the deadline will not be graded; however, students are allowed to submit either Assignment1 or
Assignment3 36 hours late!
Assignment Weights and Curving: This semester assignment weights were chosen as follows: Assignment1:37%,
Assignment2: 30%, Assigment3: 33%. In general, there was a large spread of scores in the three
assignments; we also felt that the average performance
in Assignment1 and Assignment2 was slightly below expectations and the performance for Assignment3 was above
expectations. Consequently, the number grade averages for Assignment1 and Assignment2 were curved to be about 76 and 77,
respectively, whereas for Assignment3 this average is about 82.
COSC 3337 Data Science I: Lecture Notes
I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining,
Part2:
Course Information, Part3:Introduction to Data Science,
Differences
between Clustering and Classification).
II Exploratory Data Analysis (updated on August 29, 2019; covers chapter 3 from the
the First Edition of the Tan Book (download as this material is not
in the second edition); more material: Interpreting Displays;
Introduction to Non-Parametric
Density Estimation; KDE Density Functions,
Some R Data Analysis Functions I (not covered in the 2019 lecture; overlap with the R-Lab); Some
R Data Analysis
Functions II (not covered in the 2019 lecture))
III R (Arko's Short Intro Into R (used in Lab),
Scatter
Plot Code, Decision Trees in R,
Some useful code for Assignment1,
Computing Statistical Summaries In the Presense of Missing Value (NA),
Functions
and Loops in R,
Directory containing R-code for Project2 (lecture on Feb. 26); moreover,
checkout Romita's 4335 webpage for more R-code, datasets, and slides)
IV Introduction to Data Visualization (Part1 (Most of the slides in this slideshow were
created by Guoning Chen, Department of Computer Science,
University of Houston), Part2 (slides were created
by Alark Joshi, Department of Computer Science,
University of San Francisco; Data Visualization Reading Material for DS I)
V Classification (Introduction to Classification: Basic Concepts and Decision Trees,
Overfitting, kNN-Classifiers
and Support Vector Machines, Neural Networks (updated on Oct. 14, 2019),
Ensemble Learning, Naive Bayes Classifiers&Bayes' Theorem)
VI Clustering and Similarity Assessment (
Introduction, Hierarchical Clustering and Cluster Validation,
DBSCAN;
R-scripts demonstrating: K-means/medoids, DBSCAN, More on
PAM and using PAM/DBSCAN with dist-objects;
Clustering Exercises
K-Means, HC, and DBSCAN)
VII Preprocessing
VIII Data Storytelling
IX Outlier Detection (Only Sections 0 and 4 of the slides will be covered in 2019)
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on
April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on
November 5, 2018
Reports
Rexer Analytics' 2015
Data Science Survey
CrowdFlower's 2016 Data Science
Report
Summary Rexer Analytics' 2017
Data Science Survey (the complete survey is supposed
to appear in October 2018)
Grading
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly
chosen dates, and an attendence score will be computed from how many
of the those lectures you attended.
Translation number to letter grades in 2015:
A:100-90 A-:90-86 B+:86-82 B:82-78 B-:78-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
In contrast to the exam grades where you receive your number grades immediately, the assignment scores, near the end
of the semester, still will be curved and
your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates
the different weights of the four assignments---and, then this number grade will count 45-48% towards your final course grade.
In general, assignment weights were selected considering amount of work required but also difficulty was considered;
moreover, group projects carry lower weights. Moreover, when
looking at the detailed grade reports, be aware of the fact that number grades of 90 or higher are A's in Dr. Eick's curving.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Fall 2018 Assignments
Assignment1: Exploratory Data Analysis for a Portuguese
School Performance Dataset (Individual Assignment)
Assignment2: Similarity Assessment, Clustering, and Using
Clustering to Create Background Knowledge for Classification Tasks (individual project;
Running PAM with Distance Functions)
Assignment3: Making Sense of Data: Learning and Comparing
Classification Models for a Dataset (Individual Project)
Assignment4: Design, Implementation and Comparison of Outlier
Detection Techniques for a Spatial Dataset (Group Project,
Assignment4 Groups)
This semester, the student performance for Assigment2 and 3 met expectation;
overall, more
than 45% of the students did not do well at all in Assigment1, and most
groups did very
well in Assignment4. Based on this assessment the Assignment Number Grade Averages
after curving for the four assigments were as follows: 75, 78-79, 78-79,
and 82 approximately.
Spring 2018 Assignment Evaluation Questionnare Result Summary
The questionnaire was conducted on April 24, 2018, and Romita created the following
summary of the answers you gave (based on 17 responses):
9 students felt assignment 2 was most difficult and took the most time,
2 students felt assignment 3 was most difficult and took the most time followed by 1 student for assignment 1. Several Students thought that assignment 1 was also quite time consuming,
although not as time consuming as assignment 2; assignment 1 also had a lost of second
place finishes in other categories.
3 students felt that assignment 3 was least difficult and another 3 students felt
assignment 4 was least difficult.
Concerning interestingness, 8 students felt that assignment 3 was most
interesting, followed by 5 students for assignment2 and 2 students that that assignment 1.
All the students agreed that the assignments helped in better understanding of the concepts covered in class. They also liked the fact that the assignments gave them an opportunity to work
with real datasets. 5 students felt they learned most in assignment 3 and 3 students mentioned
they learned the most in assignment 2. Another common comment was that the assignments helped
them in learning and using R.