last updated: November 20, 2024
COSC 3337: Data Science I in Fall 2024
(Dr. Eick )
Goals of the Data Science I Course
COSC 3337 Syllabus
Upon completion of this course, students
1. will know what the goals and objectives of data science are and how to conduct a data science project.
2. will have a sound knowledge of basic statistics and basic machine learning concepts.
3. will have sound knowledge about exploratory data analysis
4. will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles and neural networks.
5. will have some basic knowledge about how to construct distance functions.
6. will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7. will have some basic knowledge about anomaly and outlier detection.
8. will get some basic knowledge about association analysis.
9. will get hands-on exposure in the course assignments how to apply data analysis techniques to real world data sets.
You will also obtain valuable experience in creating data visualizations, how to select parameters of data analysis
tools, how to interpret and evaluate data analysis results, and data storytelling.
10. will get some practical experience with respect to popular
data analysis and visualization environments, such as R or Python Data Science frameworks, and their popular libraries.
Course Content
1. Introduction to Data Analysis, Data Science and Data Mining
2. Preprocessing
3. Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings
4. Brief Introduction to R and Python Tools for Data Science
5. Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning,
Support Vector Machines and Neural Networks
6. Density Estimation
7. Outlier and Anomaly Detection
8. Introduction to Clustering and Similarity Assessment
9. Data Storytelling
10. Introduction to Association Analysis Centering on the Apriori Algorithm (short)
11. Introduction to Deep Learning Centering on Autoencoders
12. Ethical Issues of Data Science (short)
13. Advanced Clustering
14. Spatial Data Analysis and Spatial Data Mining
Basic Course Information
Instructor: Dr.
Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 4:10-5p (through October 29) TH 8:50-10a FR 2:10-3p (Nov. 8+15) (in MS Teams)
e-mail: ceick@uh.edu
TA: Janet Anagli
Office Hours: MO 10-11a TU 10-11a (scheduled in MS Teams)
Email: jyanagli@CougarNet.UH.EDU
TA: Raunak Sarbajna
Office Hours: WE 1:30-2:30p TH 9:30-10:30a (scheduled in MS Teams)
Email: rsarbajn@cougarnet.uh.edu
TA: Arthur Dunbar
Office Hours: TU 3-4p WE 9:30-10:30a (scheduled in MS Teams)
Email: apdunbar12@gmail.com
class meets: TUTH 11:30a-1p in 103 SEC
Cancelled class: none yet
Lectures taught by others: Tuesday, September 24 Raunak will discuss neural networks; Thursday, October 10 Arthur will teach the first 35 minutes
discussing outlier detection followed by a Task3 lab taught by Raunak; Tuesday, November 12.
All other lectures will be taught by Dr. Eick, but some lectures include labs that are taught by
the course TAs.
Course Materials
COSC 3337 Syllabus for Fall 2024
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2018.
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 3337 (Data Science I) Fall 2024
- Dr. Eick's last office hour in the Fall 2024 semester
will be Th., November 21, 8:50-10a; moreover he will have an extra office hour
on Wednesday, November 20, 4:10-5p!
- There will be another poll on Nov. 21 around noon to obtain student feedback concerning the teaching of
COSC 3337!
- Due to your instructor's originally, unexpected conference travel in early December 2024,
the COSC 3337 grades have to completed by December 4.
Therefore, the final exam, originally scheduled for Dec. 9, 2024, will be replaced by a third course exam which will take
place on the last day of classes: Tuesday, November 26, 11:30a in SEC 103!
- If you have something important to discuss with Dr. Eick or the course TAs, please use e-mail
and not MS Teams!!
Important Dates in Fall 2024 (preliminary; will be finalized by Sept. 4, 2024)
Thursday, August 29, 2024: R-Lab / Task1 Lab taught by Raunak
Tuesday, September 17: Using Python for Data Science & Task2 Lab (taught by Janet) (last 55-60 minutes of the class that day)
Saturday, September 21, 11:59p: Deadline to submit Task1 of ProblemSet1 focusing on Exploratory Data Analysis
Friday, September 7, 11:59p: Deadline to submit Task2 of ProblemSet1 focusing on Classification
Tuesday, October 1, 11:30a: Midterm1 Exam
Thursday, October 10, 12:05p: Lab/lecture in preparation for Task3 taught by Raunak and "Outlier Detection"
lecture taught by Arthur!
Thrusday, October 31, 11:30a: Midterm2 Exam
Tuesday, November 12: Deep Learning Lecture and Lab taught by Raunak in preparation of Task6
Thursday, November 21, 11:30a: Last regular COSC 3337 Class
Tuesday, November 26, 11:30a: Third Course Exam
Course Elements and Their Tentative Weights for 2024
Problem Sets Tasks, Group Project and Group Homework Credit and attendance: 53%
Exams (3): 48% (14%, 17%, 17%)
Tentative weights of non-exam tasks: Problem Set Tasks: 47%, Group Homework Credit: 3%,
Attendance: 2%.
Fall 2024 Problem Set Tasks
Task1: Exploratory Data Analysis for a Baseball Databank (group task)
Task2: Learning Decision Tree and Support Vector Machine Models (individual task)
Task3: Juracan: Analyzing Hurricane Trajectories and Assessing
Hurricane Risks for Gulf of Mexico Locations (Group Task; is due on Nov. 8;
Benchmark of Gulf Cities for Hurricane Risk Assessment,
Assessing City Hurricane Risk with Influence Functions)
Task4 Outlier Detection for
a Houston Weather Dataset (individual task; is due on Tuesday, October 22; Houston Weather Data Set)
Task 5: Clustering a Spatial Dataset with K-Means and DBSCAN (individual task;
due on We., Nov. 13;
Complex9 Dataset with 8% Gaussian Noise Added)
Task 6: Using Deep Learning for Outlier Detection
(individual task; last task of the semester; due on Nov. 21 end of the day; there will be a lab taught by Raunak on
Tu., Nov. 12 to prepare you for this task).
The tentative weights for the six tasks in 2024 are as follows:
Task1: 17% Task2: 14% Task3: 21% Task4: 19% Task5: 15% Task6 14%! Weights will be finalized on Nov. 21 the latest.
2024 Group Homework Credit Groups, Tasks and Presentation Dates
2024 GHC Groups
In this activity which will be called group homework credit, each group formed for this activity,
receives a different usually homework-style problem (other tasks include demos, and leading discussions), and
they present their solution during the lecture (10-14 minutes),
and share their solution in form of a
Word or pptx file. The groups and e-mail addresses of the group members have been posted in the 'Group Homework Credit' channel
of this section's MS Team. Below is a list of the already assigned tasks and associated groups and presentation dates;
tasks will be added as we move along with the teaching of the course; tasks will be posted at
least 5 days before a group's presenation date!
Schedule and Tasks:
Group A Task (group A will present on Sept. 10)
Group B will present on Sept. 17 and Group C will present on Sept. 19 (Group B and C Tasks)
Groups D will preseent on Sept. 26 (Group D Task)
Group E Task and Group F, G, and H Task (groups E and F will present on October 8,
and groups G and H will present on October 15).
Group I and Group J Tasks (will both present on October 24)
Group K, L, and M Tasks (group K will present on Nov. 7, group L will present
on Nov. 14, and group M will present and lead a discussion on Nov. 14).
Groups N and O Tasks (group N will present on November 19
and Group O will lead an ethics discussion on November 21)
Course Exams
Mid1 Exam(October 1, 11:30a, 2024): Sept 26, 2024 Review for Mid1,
Review List for 2024 Midterm1 Exam, Solution Sketches October 3, 2023 Midtem1
Exam, Solution Sketeches October 1, 2024 Midterm1 Exam
Mid2 Exam(October 31, 11:30a, 2024): Oct 29 Review for 2024 Mid2 Exam,
Review List for 2024 Midterm2 Exam, Solution Sketches 2023 Midterm2 Exam.
Mid3 Exam(November 26, 11:30a, 2024): Nov. 21, 2024 Review for Third Exam,
Review List for 2024 Midterm3 Exam, A Few Solution Sketches of the 2023
Final Exam.
2024 Late Submission Policy
Tasks are due at the time specified; however,
a. tasks that are submitted one day late receive a 12% penalty; multiply task score with 0.88
b. tasks that are submitted two days late receive a 30% penalty; multiply task score with 0.7
c. task that are more than 2 days late will receive a score of 0.
There will be a short grace period of a few minutes for each submission deadline (up to
the discretion of the respective Teaching Assistant); submissions that are obtained
after this grace period will be considered to be late!
COSC 3337 Data Science I: Lecture Notes
I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining,
Part2: Mostly
Course Information, Part3: Introduction to Data Science,
Preprocessing in Data Science).
II Exploratory Data Analysis (covers chapter 3 from the
the First Edition of the Tan Book (download as this material is not
in the second edition); more material; these slides have not been covered since 2021:
Some R Data Analysis Functions I; Some
R Data Analysis
Functions II).
III R and Python for Data Science (only some of the listed slide sets will be
covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if
you forgot most details of using R, because you learnt it some time ago),
Scatter
Plot Code, Decision Trees in R,
Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture),
Computing Statistical Summaries In the Presense of Missing Value (NA),
Functions
and Loops in R,
Directory containing R-code for ProblemSet3; Python:
Saying Hi to Python, Python Refresher.
IV Classification (Introduction to Classification: Basic Concepts and Decision Trees,
Overfitting, kNN-Classifiers
and Support Vector Machines, Neural Networks,
Ensemble Learning (not covered), Naive
Bayes Classifiers&Bayes' Theorem (not covered)
V Density Estimation (Naive and Parametric Density Estimation (PDE Task (added
on Nov. 6, 2023)), Non-parametric
Density Estimation)
VI Clustering and Similarity Assessment (
Introduction, Density-based Clustering Centering on DBSCAN,
Hierarchical Clustering, Cluster Validity,
R-scripts demonstrating: K-means/medoids, DBSCAN,
EM and Gaussian Mixture Models,
Clustering Exercises
K-Means, HC, and DBSCAN)
VII Outlier Detection
VIII Association Analysis: Brief Introduction to Association Analysis Centering on APRIORI
and Sequence Mining
IX Data Storytelling
X Spatial Data Analysis: Spatial Data Analysis and Hotspot Discovery
and Introduction to Spatial Data Mining
XI Introduction to Deep Learning Centering on Autoencoders (In Part 1 we are
showing parts of MIT 6.S191 (MIT Deep Learning Bootcamp) videos and discuss their
content (Introduction
to Deep Learning (watch the first 8:20 of the video and 11:20-15:00; the remainder of the video was actually covered in the
neural network part of this course),
Deep Generative Learning
(watch the first 22 minutes of this video; if you want to know VAEs generate "new" examples resume watching the video 31:05 for a few minutes)
and maybe---if enough time---New Horizons:
Diffussion Models; watching at 39:40-58:30); Part 2: Intro to Deep Learning,
AutoEncoders (taught by Raunak on Nov. 12, 2024)).
XII Advanced Clustering
XIII Overview of Data Preprocessing Techniques (was already discussed in the August
30 lecture)
XIV Ethical Aspects of Data Science centering Ethics Involving
Census Data Collection and Interpretation (Danah Boyd Video)
XV Introduction to Data Visualization (not covered since 2021; Part1 (Most of the slides in this slideshow were
created by Guoning Chen, Department of Computer Science,
University of Houston), Part2 (slides were created
by Alark Joshi, Department of Computer Science,
University of San Francisco; Data Visualization Reading Material for DS I)
Grading
Translation number to letter grades, starting in Fall 2021:
A:100-91 A-:91-87 B+:87-84 B:84-80 B-:80-76 C+:76-71
C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
In contrast to the exam grades where you receive your number grades immediately, the assignment/problemset scores, still will be curved near the end
of the semester, and
your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates
the different weights of the four assignments---and, then this number grade will count about 50% towards your final course grade.
In general, assignment weights were selected considering amount of work required but also difficulty was considered;
moreover, group projects carry lower weights. Moreover, when
looking at the detailed grade reports, be aware of the fact that number grades of 92 or higher are A's in Dr. Eick's curving.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Excused Missing of Course Exams:
If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances
lead to your missing of the exam!
Attendance 2024
Attendance counts 2% towards the course grade.
Attendance will be taken starting Tuesday, Sept. 5 throughout the remainder of the semester. Only
F2F attendance counts. Therefore, 23 attendances will be taken.
(September (8), October (8), November (7)) Your number of attendances will be converted as follows into a number grade:
23: 93, 22-21: 92, 20-19 :91, 18-17:90, 16:87, 15:83, 14:79, 13:75, 12:71, 11:67, 10:63, 9:59, 8:55, 7: 51, 0-6: 47.
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on
April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on
November 5, 2018
Solution Sketches Midterm Exam October 14 2021
"Old" News COSC 3337 (Data Science I)
- We will start taking F2F attendance on Thursday, August 29; attendance will count 2% towards your overall course grade!
- Paper which analyzes the Old Faithful temporal eruption patters (donated by Aleia Sen)
- Last year we only had one group task/group project; however, this semester there will
be two group tasks: Task1 which focuses on exploratory data analysis for a baseball data bank, and Task3 in which you will
likely be analyzing hurricane data will be group tasks. All other problem set tasks
will be individual tasks! The specification of Task3 should be available approx. Sept. 25!
- If you use ChatGPT, or other AI tools for course tasks, you have to mention this fact in your
course report and describe for what subtasks you used the AI tool for; not doing that represents a
serious academic honesty violation.
- When looking for the most recent version of a course documents, always download it from the course website! Reason
for doing that: as there is no security in MS Teams---documents can be modified by any team members;
consequently, documents you find in MS Teams might be outdated or modified by your class mates.
- Please post you solution files/slides for the Group Homework Credit tasks in the respective channel in MS Teams! Use you group name
as the name of the file that you will post.
- Task3 which centers on analyzing hurricane trajectory data and hurricane risk assessment
has been posted below. Task3 is a group task which is due on Fr., November 8. Please read its specification and meet
your team mates and discuss what you are supposed to do prior to October 24. There will be a Task3 Q&A at 11:40a,
during the class on October 24
to answer your questions about Task3!
Fall 2023 Problem Sets and Group Project
Problem Set1 (Task1: Exploratory Data Analysis for a Video Sales Dataset;
Task2: Learning SVM and Decision Tree Models)
Group Project (Oct. 10-Nov. 10, 2023 (1 month), centering on analyzing
solar flares; Discussions Helios Project)
Problem Set2 (consists of a clustering task which is due on Nov. 6, and
an outlier detection task which is due on Nov. 20).
Problem Set3 (Task5: Autoencoders).
2023 Group Homework Credit Tasks and Presentation Dates
Group A and Group B Tasks (Group A will present on Sept. 12 and Group B will present on Sept. 14)
Group C and Group D Tasks (Group C will present on Sept. 19 and Group D will present on Sept. 21)
Group E and Group F Tasks (both will present on Sept. 28)
Group G Task (will present Thursday, October 12)
Group H and Group I Tasks (both will present on Thursday, October 26)
Group J Task (will present on November 7)
Group K and L Tasks (Group K will present on Nov. 7 and Group L will present on November 9)
Group M will present on November 28
Group N will present on November 30
Fall 2022 Problem Sets and Group Project
Problem Set1 (Task1 Specification;
Task2 Specification; updated on Sept. 28)
Problem Set2 (Task3, centering on Recurrent Neural Networks; individual task;
Navid's Introduction to RNN)
ProblemSet3 (centering on clustering; individual task)
POIMAGIC: an Early Warning System for Streaming Spatial Events (group project October 4-November 22, 2022;
Groups in 2022)