last updated: December 19 at noon
COSC 3337: Data Science I in Fall 2021
(Dr. Eick )
Goals of the Data Science I Course
COSC 3337 Syllabus
Upon completion of this course, students
1. will know what the goals and objectives of data science are and how to conduct a data science project.
2. will have a sound knowledge of basic statistics and basic machine learning concepts.
3. will have sound knowledge about exploratory data analysis
4. will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles and neural networks.
5. will have some basic knowledge about how to construct distance functions.
6. will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7. will have some basic knowledge about anomaly and outlier detection.
8. will get some basic knowledge about association analysis.
9. will get hands-on exposure in the course assignments how to apply data analysis techniques to real world data sets.
You will also obtain valuable experience in creating data visualizations, how to select parameters of data analysis
tools, how to interpret and evaluate data analysis results, and data storytelling.
10. will get some practical experience with respect to popular
data analysis and visualization environments, such as R or Python Data Science frameworks, and their popular libraries.
Course Content
1. Introduction to Data Analysis, Data Science and Data Mining
2. Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings
3. Brief Introduction to R and Python Tools for Data Science
4. Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines,
Neural Networks and Regression
5. Introduction to Clustering and Similarity Assessment
6. A Brief Introduction to Association Analysis Centering on the Apriori Algorithm
7. Outlier and Anomaly Detection
8. Data Storytelling
Basic Course Information
Instructor: Dr.
Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 1-2p TH 9:30-10:30a
e-mail: ceick@uh.edu
TA: Shahriar Sadat
Office Hours: TU 3-4p TH 4-5 (scheduled in MS Teams)
Office: ?
Email: sshahria@CougarNet.UH.EDU
TA: Mathew Banda
Office Hours: TU 2-3p TH 2-3p(scheduled in MS Teams)
Office: ?
Email: mabanda3@CougarNet.UH.EDU
class meets: TU/TH 11:30a-1p
class room: 108 AH and/or online on MS TEAM 3337-Class
Cancelled class: Tuesday, Nov. 23
Makeup class: Tuesday, December 7, 2021 at 11:30a in our class room!
Course Materials
COSC 3335 Syllabus for Fall 2021
Recommended Text:
- P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
- Addison Wesley, 2018.
- Link to
Book HomePage
Other Material:
- Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques
- Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page
NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering
exploratory data analysis, statistics, modelling and prediction)
News COSC 3337 (Data Science I) Fall 2021
- The course letter grades have been posted, and more detailed grade reports will be posted in 72-96 hours. I enjoyed teaching the
couse and like already to wish you a Happy and Successful Year 2022.
- In Fall 2021 the course letter grade distribution for Data Science I was
as follows: A:7, A-:7, B+:11, B:10, B-:9, C+:8, C:4, C-:0, D+:2, D:0, D-:1,
F:0, I:3, W:5.
- The final exam will not be returned to students but you can view your exam (come to 577 PGH):
Thursday, January 13, 11a-noon and Monday, January 24, 10:30-11:30a.
Important Dates in Fall 2021
Thursday, September 14: R-Refresher Lab (35 minutes) taught by Mathew
Tuesday, September 19: Phython Data Science Basics Lab (35 minutes) taught bye Sadat
Thursday, October 14: Midterm Exam (Review List 2021 Midterm1 Exam, October 12, 2021 Review1 Questions and Answers, Solution Sketches October
14, 2021 Midterm Exam A, Solution Sketches October
14 Midterm Exam B)
Tuesday, December 7, 11:30a: Last Lecture COSC 3337
Tuesday, December 14, 11a: Final Exam (Review List
for the Dec. 14, 11a 2021 Final Exam, Questions and Solution
Sketches of the Dec. 7, 2021 Review
for the 2021 Final Exam)
Course Elements and Their Tentative Weights for 2021
Problem Sets (3), Group Project and Group Homework Credit: 52%
Exams (2): 48% (20+28%)
Tentative weights of non-exam tasks: Problem Sets: 33%, Group Project: 16%, Group Homework Credit: 3%
Fall 2021 Problem Sets and Group Project
Problem Set1 (final draft; individual tasks)
2021 Group Project (updated on November 8, 2021,
COSC 3337 Group Project Presentations)
Problem Set2(Task3 Description(updated on Nov. 3!), Task3
Deliverables, Helper Function; individual
task)
Problem Set3 (individual tasks; Some Information
concerning Task4, Randomized Hill Climbing (a potential algorithm you could use
for DBSCAN parameter search))
Fall 2021 Group Homework Credit Tasks and Schedule
In this activity which will be called group homework credit, each group formed for this activity,
receives a different homework-style problem,
and they present their solution during the lecture, and share their solution in form of a Word or pptx file.
The groups and e-mail addresses of the group members have been posted in the 'File' Section of the General Channel of 3337-Class. Here is a list of the already
assigned tasks and associated groups:
Group A Task (to be presented on Sept. 16)
Group B Task (to be presented on Sept. 16)
Group C Task (to be presented on Sept. 21)
Group D Task (to be presented on Sept. 28)
Group E Task (to be conducted on October 5; updated on Sept. 30)
Group F Task (to be presented on October 12)
Group G Task (to be presented on October 26)
Group H Task (to be presented on November 2)
Group I Task (to be presented on November 9)
Group J and Group K Task (both to be presented on November 11)
Group L and Group M Task (both to be presented on December 2)
Group N Task (to be presented on December 7)
COSC 3337 Group Project Presentations
Vermont, Pensylvania, Utah, Washington, Nevada, California, Delaware, Illinois and Tennessee will be presenting on Tuesday, November 16!
Data Engineers, Alaska, Ohio, Data Scientists 1, Colorado, DarkSide, Virginia, Team Rocket and Arizona will be presenting on
Thursday, November 18!
More Information About the Group Project Presentations and the Event Itself (updated on Nov.
15, 9a; please take a look at the first 2 slides before the event!)
"Final" Presentation Schedule for Nov. 16+18, 2021 (updated on November 15, 9a; please take a look!)
COSC 3337 Data Science I: Lecture Notes
I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining,
Part2: Mostly
Course Information, Part3:Introduction to Data Science,
Differences
between Clustering and Classification).
II Exploratory Data Analysis (covers chapter 3 from the
the First Edition of the Tan Book (download as this material is not
in the second edition); more material: these slides will not be covered in 2021:
Introduction to Non-Parametric
Density Estimation; KDE Density Functions,
Some R Data Analysis Functions I; Some
R Data Analysis
Functions II
III R and Python for Data Science (only some of the listed slide sets will be
covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if
you forgot most details of using R, because you learnt it some time ago),
Scatter
Plot Code, Decision Trees in R,
Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture),
Computing Statistical Summaries In the Presense of Missing Value (NA),
Functions
and Loops in R,
Directory containing R-code for ProblemSet3; Python:
Saying Hi to Python, Python Refresher.
IV Classification (Introduction to Classification: Basic Concepts and Decision Trees,
Overfitting, kNN-Classifiers
and Support Vector Machines, Neural Networks,
Ensemble Learning, Naive Bayes Classifiers&Bayes' Theorem)
V Clustering and Similarity Assessment (
Introduction, Density-based Clustering Centering on DBSCAN,
Hierarchical Clustering, Cluster Validity,
R-scripts demonstrating: K-means/medoids, DBSCAN,
More on
PAM and using PAM/DBSCAN with dist-objects (not relevant and covered in 2021);
Clustering Exercises
K-Means, HC, and DBSCAN)
VI Brief Introduction to Association Analysis Centering on APRIORI
VII Outlier Detection
VIII Data Storytelling
IX Overview of Data Preprocessing Techniques (likely not or only briefly covered in 2021)
X Introduction to Data Visualization (not covered in 2021; Part1 (Most of the slides in this slideshow were
created by Guoning Chen, Department of Computer Science,
University of Houston), Part2 (slides were created
by Alark Joshi, Department of Computer Science,
University of San Francisco; Data Visualization Reading Material for DS I)
"Old" News COSC 3337 (Data Science I) Fall 2021
- TAs for this semester are Sadat Shahriar and Mathew Banda; you find their e-mails and office hours are listed above under "Basic Course Information".
All officehours through Sept. 30, including Dr. Eick's office hours, will be using MS Teams and not be F2F!
- We will be using a MS Team called "3337-Class" for the teaching of the course; please, go ahead and register for
this team using the passcode '617ouxx'.
We will use MS Team 3337-Class for a lot of things in the course:
for online and F2F lectures, likely for submissions, for posting grades, and as a chat venue.
You need to be a member of the team 3337-Class to take the course!
- There is still a lot of uncertainty about the guidelines, rules and regulations concerning teaching courses in Fall
2021. Considering this fact, a tentative Syllabus has been posted at this website;
more detailed infomation on how the course will be taught will
be added to this Syllabus as soon it becomes available!
- UH announced on Saturday that the soft transition period into the Fall 2021 semester ended last week.
Consequently, COSC 3337 will be taught F2F on both Tuesdays and
Thursdays 11:30a-12:45p for the remainder of this semester.
- Always download documents from the course website, as it always stores the most recent version of
the respective documents.
- On October 22: The dates of the 7-week-long group project have been finalized. Please, download the most recent version
of the group project specification.
- For code and datasets used in demos checkout the Datasets and Code channel in the course's MS Teams page.
- The 7-week group project groups have been set up. They can be found in the general channel of
the course's MS Teams page!
- Due to the cancellation of the Sept. 14 lecture, course activities will be shifted by one lecture.
Moreover, the deadline
for submitting Task1 of ProblemSet1 has been extended to Monday, September 27, 11:59p.
- The Neural Network Lecture has been updated on October 9; please download its most recent version from the
course website!
- The course midterm exam has been scheduled for Thursday, October 14, 11:30a-12:45p
in Flemming 160 (F 160). It will be a traditional paper exam (where you write you answers on the exam paper);
that is, we are back to giving exams
as we did before COVID-19. The exam will be open nodes and books, but the use of computers will not be allowed!
A detailed review list has been posted below.
- There are two group activities in the course: the 7-week long
group project and group homework credit; all other activities in the course are individual tasks and academic
honesty rules apply for problem set tasks.
- When looking for the most recent version of a course documents, always download it from the course website! Reason
for doing that: as there is no security in MS Teams---documents can be modified by any team members;
consequently, documents you find in MS Teams might be outdated or modified by your class mates.
- The specification of the group project has been finalized; please, take a look at it.
- On Nov. 16 and 18 we will have group project presentations (approx. 6-7 minutes per group); groups will be
subdivided in two
meta group A and B which present November 16 and 18, respectively. A schedule for the presentations will be posted on
this website by Nov. 10 the latest.
- Please upload your group project presentation slides in the Group Project Channel of the course MS Teams page by the end of the day
you gave the presentation, as Sadat, Mathew and Dr. Eick might take a look at your slides, before finalizing their presentation scores.
- The grading of the midterm exam has been finished. The average exam score was 36.4. The exam already has been curved: the average number
grade score was 78.4. The exam will be returned to students during the lecture on Tuesday, November 30! Moreover, solutions sketches for
the midterm exam have been posted directly below.
- The course final exam has been scheduled for Tu., Dec. 14, 11a in Fleming 160 (same room as the midterm exam);
it
will take 105 minutes. A review list for the exam is available below!
Excused Missing of Course Exams:
If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances
lead to your missing of the exam!
- Please post you solution files/slides for the Group Homework Credit tasks in the respective channel in MS Teams! Use you group name
as the name of the file that you will post.
- Mathew is grading Task 1 and Task 4 and the midterm exam; Sadat will be grading Task2 and Task3 and the group project
(mostly). Please ask the specific TA, if
you have any questions about the grading.
Past Exam and Review Solutions
Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on
April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on
November 5, 2018
Makeup Activities for the Cancelled Class on Sept. 12, 2021
Please watch the following two introductory videos by 3blueonebrown about neural networks:
Introduction to Neural Networks (watch the whole video)
Weight Learning in Neural Networks
(just watch the
first 15 minutes of the second video)
Grading
Students will be responsible for material covered in the
lectures and assigned in the readings. All assignment and
project reports are due at the date specified.
No late submissions
will be accepted after
the due date. This policy will be strictly enforced.
Translation number to letter grades, starting in Fall 2021:
A:100-92 A-:92-88 B+:88-84 B:84-80 B-:80-76 C+:76-71
C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions
are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets.
Be aware of the fact that our
only source of information is what you have turned in. If we are not capable to understand your
solution, you will receive a low score.
Moreover, students should not throw away returned assignments or tests.
In contrast to the exam grades where you receive your number grades immediately, the assignment/problemset scores, still will be curved near the end
of the semester, and
your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates
the different weights of the four assignments---and, then this number grade will count about 50% towards your final course grade.
In general, assignment weights were selected considering amount of work required but also difficulty was considered;
moreover, group projects carry lower weights. Moreover, when
looking at the detailed grade reports, be aware of the fact that number grades of 92 or higher are A's in Dr. Eick's curving.
Students may discuss course material and homeworks, but must take special
care to discern the difference between collaborating in order to increase
understanding of course materials and collaborating on the homework /
course project
itself. We encourage students to help each other understand course
material to clarify the meaning of homework problems or to discuss
problem-solving strategies, but it is not permissible for one
student to help or be helped by another student in working through
assignment problems and in the course project. If, in discussing course materials and problems,
students believe that their like-mindedness from such discussions could be
construed as collaboration on their assignments, students must cite each
other, briefly explaining the extent of their collaboration. Any
assistance that is not given proper citation may be considered a violation
of the Honor Code, and might result in obtaining a grade of F
in the course, and in further prosecution.
Excused Missing of Course Exams:
If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances
lead to your missing of the exam!