last updated: December 20, 9a
COSC 3337: Data Science I in Fall 2022
(Dr. Eick )
Goals of the Data Science I Course
COSC 3337 Syllabus
Upon completion of this course, students
1. will know what the goals and objectives of data science are and how to conduct a data science project.
2. will have a sound knowledge of basic statistics and basic machine learning concepts.
3. will have sound knowledge about exploratory data analysis
4. will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles and neural networks.
5. will have some basic knowledge about how to construct distance functions.
6. will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7. will have some basic knowledge about anomaly and outlier detection.
8. will get some basic knowledge about association analysis.
9. will get hands-on exposure in the course assignments how to apply data analysis techniques to real world data sets.
You will also obtain valuable experience in creating data visualizations, how to select parameters of data analysis
tools, how to interpret and evaluate data analysis results, and data storytelling.
10. will get some practical experience with respect to popular
data analysis and visualization environments, such as R or Python Data Science frameworks, and their popular libraries.
Course Content
1. Introduction to Data Analysis, Data Science and Data Mining
2. Preprocessing
3. Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings
4. Brief Introduction to R and Python Tools for Data Science
5. Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines,
Neural Networks and Regression
6. Introduction to Clustering and Similarity Assessment
7. A Brief Exposure to Deep Learning
8. A Brief Introduction to Association Analysis Centering on the Apriori Algorithm
9. Outlier and Anomaly Detection
10. Data Storytelling
Basic Course Information
Instructor: Dr.
Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 2-3p FR 10-11a (in MS Teams)
e-mail: ceick@uh.edu
TA: Navid Ayoobi
Office Hours: WE 2-3p TH 1:30-2:30p (scheduled in MS Teams)
Email: nayoobi@cougarnet.uh.edu
TA: Raunak Sarbajna
Office Hours: MO 1:30-2:30p TU 4-5p (scheduled in MS Teams)
Email: rsarbajn@cougarnet.uh.edu
class meets: TU/TH 11:30a-1p
Cancelled class: none yet
Lectures taught by others: Th., Sept. 1: Raunak Sarbajna;
Tu., Nov. 1: TBDL
All other lectures will be taught by Dr. Eick
Course Materials
COSC 3335 Syllabus for Fall 2022
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2018.
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 3337 (Data Science I) Fall 2022
Important Dates in Fall 2022
Tuesday, September 6: R-Lab taught by Raunak (last 40 minutes of the lecture that day)
Thursday, September 8: Using Python for Data Science Lab taught by Navid (last 40 minutes of the lecture that day)
Friday, Sept. 23, 11:59p: Deadline to submit Task1 of ProblemSet1 focusing on Exploratory Data Analysis
Tuesday, October 4: Midterm1 Exam
Thursday, October 13: 40-50 minute lab taught by Raunak in preparation for the group project
Tuesday, October 18: 55-75 minute lab taught by Navid in preparation for Task3
Thursday, November 3: Midterm2 Exam
Wednesday, November 9: Submission Deadline Task3
Tuesday, November 22: Submission Deadline Group Project
Thursday, December 1: Last class of the semester
Tuesday, December 6: Submission Deadline Task4 (Clustering)
Tuesday, December 13, 11a: Final Exam
Course Elements and Their Tentative Weights for 2022
Problem Sets (3), Group Project and Group Homework Credit and attendance: 51%
Exams (2): 49% (14%, 15%, 20%)
Tentative weights of non-exam tasks: Problem Sets: 31-33%, Group Project: 13-15%, Group Homework Credit: 3%,
Attendance: 2%.
Fall 2022 Problem Sets and Group Project
Problem Set1 (Task1 Specification;
Task2 Specification; updated on Sept. 28)
Problem Set2 (Task3, centering on Recurrent Neural Networks; individual task;
Navid's Introduction to RNN)
ProblemSet3 (centering on clustering; individual task)
POIMAGIC: an Early Warning System for Streaming Spatial Events (group project October 4-November 22, 2022;
Groups in 2022)
2022 Course Exams
Mid1 Exam(Oct. 4, 2022): Sept 29, 2022 Review for Mid1,
Review List for 2022 Midterm1 Exam
Mid2 Exam(Nov. 3, 2022): Nov. 1, 2022 Review for Mid1,
Review List for 2022 Midterm2 Exam, Solution Sketches 2022 Midterm2 Exam.
Final Exam(Dec. 13, 11a, 2022): Dec. 1, 2022 Review for Final Exam,
Review List for 2022 Final Exam.
Nov. 22, 2022 Activities
There will be no online lecture on Tu., Nov. 22; instead you will read on article about recent trends in data science and
watch a 30 minute video concerning the challenges of collecting and interpreting census data; finally, you will have the opportunity to
receive a COSC 3337 bonus, by writing a 1-1.25 page essay about the video you watched. For more details about those
activities, click
the following link:
Nov. 22, 2022 Offline Tasks
Fall 2022 Group Homework Credit Tasks and Schedule
In this activity which will be called group homework credit, each group formed for this activity,
receives a different homework-style problem,
and they present their solution during the lecture, and share their solution in form of a Word or pptx file.
The groups and e-mail addresses of the group members have been posted in the 'File' Section of the General Channel of 3337-Class. Here is a list of the already
assigned tasks and associated groups; tasks will be added as we move along with the teaching of the course:
Group A, B and C Tasks (Group A will present during the lecture on September 13, and groups B and C will
present on September 15)
Group D Task (Group D will present during the lecture on September 22)
Group E and F Tasks (both groups will present on September 29)
Group G Task (to be presented on October 13)
Group H, I and J Task (groups H and I will present on October 20, and group J will
present on October 25)
Group K Task (to be presented on November 1)
Group L and M Task (both groups will present on Nov. 10)
Group N Task (to be presented on Nov. 17)
Group O Task (to be presented on Dec. 1)
For groups see: 2022 Group Homework Credit Groups
2022 Late Submission Policy
Tasks are due at the time specified; however,
a. tasks that are submitted one day late receive a 12% penalty; multiply task score with 0.88
b. tasks that are submitted two days late receive a 30% penalty; multiply task score with 0.7
c. task that are more than 2 days late will receive a score of 0
COSC 3337 Data Science I: Lecture Notes
I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining,
Part2: Mostly
Course Information, Part3:Introduction to Data Science,
Preprocessing in Data Science).
II Exploratory Data Analysis (covers chapter 3 from the
the First Edition of the Tan Book (download as this material is not
in the second edition); more material: these slides will not be covered in 2021:
Introduction to Non-Parametric
Density Estimation; KDE Density Functions,
Some R Data Analysis Functions I; Some
R Data Analysis
Functions II
III R and Python for Data Science (only some of the listed slide sets will be
covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if
you forgot most details of using R, because you learnt it some time ago),
Scatter
Plot Code, Decision Trees in R,
Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture),
Computing Statistical Summaries In the Presense of Missing Value (NA),
Functions
and Loops in R,
Directory containing R-code for ProblemSet3; Python:
Saying Hi to Python, Python Refresher.
IV Classification (Introduction to Classification: Basic Concepts and Decision Trees,
Overfitting, kNN-Classifiers
and Support Vector Machines, Neural Networks, Recurrent Neural Networks,
Colah's Blog: Understanding LSTMs,
Ensemble Learning (not covered in 2022), Naive
Bayes Classifiers&Bayes' Theorem (not covered in 2022)
V Density Estimation (Naive and Parametric Density Estimation, Non-parametric
Density Estimation)
VI Clustering and Similarity Assessment (
Introduction, Density-based Clustering Centering on DBSCAN,
Hierarchical Clustering, Cluster Validity,
R-scripts demonstrating: K-means/medoids, DBSCAN,
More on
PAM and using PAM/DBSCAN with dist-objects (not relevant and covered in 2022);
Clustering Exercises
K-Means, HC, and DBSCAN)
VII Outlier Detection
VIII Association Analysis: Brief Introduction to Association Analysis Centering on APRIORI
and Sequence Mining
IX Data Storytelling
X Introduction to Spatial Data Mining
XI Advanced Clustering (will be partially covered in 2022)
XII Overview of Data Preprocessing Techniques (was already discussed in the August
30 lecture)
XIII Introduction to Data Visualization (not covered in 2022; Part1 (Most of the slides in this slideshow were
created by Guoning Chen, Department of Computer Science,
University of Houston), Part2 (slides were created
by Alark Joshi, Department of Computer Science,
University of San Francisco; Data Visualization Reading Material for DS I)
"Old" News COSC 3337 (Data Science I) Fall 2022
- Midterm2 has been graded; although about 30-40 students "did well" in the exam, it is our assessment,
that 25+ students have been "ill prepared" for the exam and overall the results were not good. We will go solutions of
some Midterm2 problems during the lecture on Nov. 29. As there will be
no course activities Dec. 7-12, I strongly suggest you use this time window to get "well prepared" for the
course final exam, so that the final exam results will be better than those for Midterm2!
- At the moment you mostly see information from the Fall 2021 teaching of the course; this information and teaching material will be replaced
as we move along with teaching this course. The same remark applies for the teaching material, problem set and group project information and chat
in MS Team 3337-Class.
- MS Teams will be used for teaching the online course; however, we will be using classical paper exams
which will be given in UH class rooms. Navid and Raunak will be the TAs for this course; you find
their MS Teams officehours and e-mail above.
- The lectures in the Sept.5 week will continue to discuss EDA; there will be two 40 minute labs on using
R/Python for Data Science. Please attend these labs as they prepare you for Task1 and Task2 of ProblemSet1.
The next topic we will discuss in the lecture is classification!
- We will be using a MS Team called "3337-Class" for the teaching of the course; please, go ahead and register for
this team using the passcode '617ouxx'. You will not automatically be added to 3337-Class! We will use MS Team 3337-Class
for a lot of things in the course:
for online lectures, likely for problem set and group project submissions, for posting grades, and as a chat venue.
You need to be a member of the team 3337-Class to take the course!
- When looking for the most recent version of a course documents, always download it from the course website! Reason
for doing that: as there is no security in MS Teams---documents can be modified by any team members;
consequently, documents you find in MS Teams might be outdated or modified by your class mates.
- Please post you solution files/slides for the Group Homework Credit tasks in the respective channel in MS Teams! Use you group name
as the name of the file that you will post.
- A first very preliminary draft of the group project has been posted below: you can either form you own group of 4 students (request with
other group sizes will be rejected) and e-mail the group information to Navid no later than Thursday, Sept. 29, 11:59
or Navid will assign you to a group.
Group compositions will be posted in MS Teams by the end of the day of October 4. You will need to complete
the Group Project by Saturday, November 12!
- A first very preliminary draft of the group project has been posted below: you can either form you own group of 4 students (request with
other group sizes will be rejected) and e-mail the group information to Navid no later than Thursday, Sept. 29, 11:59
or Navid will assign you to a group.
Group compositions will be posted in MS Teams by the end of the day of October 4. You will need to complete
the Group Project by Saturday, November 12!
- The first course exam has been scheduled for Tuesday, October 4, 11:30a-12:45p in MH 170; we will use
"classical paper exams"; no multiple choice questions. A review list for the
exam has been posted below.
Midterm1 will be open notes and books (no computers will be allowed, but calculators are okay). There will be also a
35 minute review for Midterm1 during the Sept. 29
lecture.
- An updated specification of Task2 has been upload (see link to download it below). Moreover, you get an extra
day to complete it: Task2 is now due Oct. 1, 2022 at 11:59p---there will be no further extensions!
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on
April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on
November 5, 2018
Solution Sketches Midterm Exam October 14 2021
Fall 2021 Problem Sets and Group Project
Problem Set1 (final draft; individual tasks)
2021 Group Project (updated on November 8, 2021,
COSC 3337 Group Project Presentations)
Problem Set2(Task3 Description(updated on Nov. 3!), Task3
Deliverables, Helper Function; individual
task)
Problem Set3 (individual tasks; Some Information
concerning Task4, Randomized Hill Climbing (a potential algorithm you could use
for DBSCAN parameter search))
Grading
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Translation number to letter grades, starting in Fall 2021:
A:100-92 A-:92-88 B+:88-84 B:84-80 B-:80-76 C+:76-71
C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
In contrast to the exam grades where you receive your number grades immediately, the assignment/problemset scores, still will be curved near the end
of the semester, and
your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates
the different weights of the four assignments---and, then this number grade will count about 50% towards your final course grade.
In general, assignment weights were selected considering amount of work required but also difficulty was considered;
moreover, group projects carry lower weights. Moreover, when
looking at the detailed grade reports, be aware of the fact that number grades of 92 or higher are A's in Dr. Eick's curving.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Excused Missing of Course Exams:
If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances
lead to your missing of the exam!
Fall 2021 Group Homework Credit Tasks and Schedule
In this activity which will be called group homework credit, each group formed for this activity,
receives a different homework-style problem,
and they present their solution during the lecture, and share their solution in form of a Word or pptx file.
The groups and e-mail addresses of the group members have been posted in the 'File' Section of the General Channel of 3337-Class. Here is a list of the already
assigned tasks and associated groups:
Group A Task (to be presented on Sept. 16)
Group B Task (to be presented on Sept. 16)
Group C Task (to be presented on Sept. 21)
Group D Task (to be presented on Sept. 28)
Group E Task (to be conducted on October 5; updated on Sept. 30)
Group F Task (to be presented on October 12)
Group G Task (to be presented on October 26)
Group H Task (to be presented on November 2)
Group I Task (to be presented on November 9)
Group J and Group K Task (both to be presented on November 11)
Group L and Group M Task (both to be presented on December 2)
Group N Task (to be presented on December 7)