last updated: November 8 at 8p.

COSC 3337: Data Science I in Fall 2021 (Dr. Eick )

Goals of the Data Science I Course

COSC 3337 Syllabus

Upon completion of this course, students
1.	will know what the goals and objectives of data science are and how to conduct a data science project.
2.	will have a sound knowledge of basic statistics and basic machine learning concepts.
3.	will have sound knowledge about exploratory data analysis
4.	will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles and neural networks.
5.	will have some sound knowledge about how to construct distance functions.
6.	will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7.      will have some basic knowledge about anomaly and outlier detection.
8.	will get hands-on exposure in  the course assignments how  to apply data analysis techniques  to real world data sets. 
You will also obtain valuable experience in learning how to interpret data visualizations, how to select parameters of data analysis tools, 
and how to interpret and evaluate data analysis results.
9.	will get some practical experience with respect to popular data analysis and visualization environments, such as R or Python, and their popular libraries. 
10.	will get some exposure to and experience in data storytelling.

Course Content

1.	Introduction to Data Analysis, Data Science and Data Mining  
2.	Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings 
3.      Introduction to R (optional topic) 
4.	Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines, 
         Neural Networks and Regression 
5.	Introduction to Clustering and Similarity Assessement 
6.      Outlier and Anomaly Detection 
7.	Data Preprocessing 
8.      Data Storytelling 

Basic Course Information

Instructor: Dr. Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 1-2p TH 9:30-10:30a
e-mail: ceick@uh.edu
TA: Shahriar Sadat
Office Hours: TU 3-4p TH 4-5 (scheduled in MS Teams)
Office: ?
Email: sshahria@CougarNet.UH.EDU
TA: Mathew Banda
Office Hours: TU 2-3p TH 2-3p(scheduled in MS Teams)
Office: ?
Email: mabanda3@CougarNet.UH.EDU
class meets: TU/TH 11:30a-1p
class room: 108 AH and/or online on MS TEAM 3337-Class
Cancelled class: Tuesday, Nov. 23
Makeup class: maybe, Tuesday, December 7, 2021

Course Materials

COSC 3335 Syllabus for Fall 2021

Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, 2018.
Link to Book HomePage

Other Material:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition, 2011.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

News COSC 3337 (Data Science I) Fall 2021

Important Dates in Fall 2021

Thursday, September 14: R-Refresher Lab (35 minutes) taught by Mathew
Tuesday, September 19: Phython Data Science Basics Lab (35 minutes) taught bye Sadat
Thursday, October 14: Midterm Exam (Review List 2021 Midterm1 Exam, October 12, 2021 Review1 Questions and Answers)
Thursday, December 2 or Tuesday, December 7: Last Lecture COSC 3337
Tuesday, December 14, 11a: Final Exam (Review List for Dec. 6, 2019, 2p Final Exam (will be posted on Nov. 26, 2019), Questions and Solution Sketches of the Nov. 25, 2019 Review for the Final Exam)

Course Elements and Their Tentative Weights for 2021

Problem Sets (3), Group Project and Group Homework Credit: 52%
Exams (2): 48% (20+28%)

Fall 2021 Problem Sets and Group Project

Problem Set1 (final draft; individual tasks)

2021 Group Project (updated on November 8, 2021, COSC 3337 Group Project Presentations)

Problem Set2(Task3 Description(updated on Nov. 3!), Task3 Deliverables, Helper Function; individual task)

Problem Set3 (first, preliminary draft; individual tasks; Some Information on Task4 (will be extended by November 29, 2021)

Fall 2021 Group Homework Credit Tasks and Schedule

In this activity which will be called group homework credit, each group formed for this activity, receives a different homework-style problem, and they present their solution during the lecture, and share their solution in form of a Word or pptx file. The groups and e-mail addresses of the group members have been posted in the 'File' Section of the General Channel of 3337-Class. Here is a list of the already assigned tasks and associated groups:
Group A Task (to be presented on Sept. 16)
Group B Task (to be presented on Sept. 16)
Group C Task (to be presented on Sept. 21)
Group D Task (to be presented on Sept. 28)
Group E Task (to be conducted on October 5; updated on Sept. 30)
Group F Task (to be presented on October 12)
Group G Task (to be presented on October 26)
Group H Task (to be presented on November 2)
Group I Task (to be presented on November 9)
Group J and Group K Task (both to be presented on November 11)

COSC 3337 Group Project Presentations

Vermont, Pensylvania, Utah, Washington, Nevada, California, Delaware, Illinois and Tennessee will be presenting on Tuesday, November 16!
Data Engineers, Alaska, Ohio, Data Scientists 1, Colorado, DarkSide, Virginia, Team Rocket and Arizona will be presenting on Thursday, November 18!

Also take a look at: More Information About the Group Project Presentations

COSC 3337 Data Science I: Lecture Notes

I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining, Part2: Mostly Course Information, Part3:Introduction to Data Science, Differences between Clustering and Classification).
II Exploratory Data Analysis (covers chapter 3 from the the First Edition of the Tan Book (download as this material is not in the second edition); more material: these slides will not be covered in 2021: Introduction to Non-Parametric Density Estimation; KDE Density Functions, Some R Data Analysis Functions I; Some R Data Analysis Functions II
III R and Python for Data Science (only some of the listed slide sets will be covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if you forgot most details of using R, because you learnt it some time ago), Scatter Plot Code, Decision Trees in R, Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture), Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for ProblemSet3; Python: Saying Hi to Python, Python Refresher.
IV Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, kNN-Classifiers and Support Vector Machines, Neural Networks, Ensemble Learning, Naive Bayes Classifiers&Bayes' Theorem)
V Clustering and Similarity Assessment ( Introduction, Density-based Clustering Centering on DBSCAN, Hierarchical Clustering, Cluster Validity, R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects (not relevant and covered in 2021); Clustering Exercises K-Means, HC, and DBSCAN)
VI Brief Introduction to Association Analysis Centering on APRIORI
VII Outlier Detection
VIII Data Storytelling (likely not or only briefly covered in 2021)
IX Preprocessing
X Introduction to Data Visualization (not covered in 2021; Part1 (Most of the slides in this slideshow were created by Guoning Chen, Department of Computer Science, University of Houston), Part2 (slides were created by Alark Joshi, Department of Computer Science, University of San Francisco; Data Visualization Reading Material for DS I)

"Old" News COSC 3337 (Data Science I) Fall 2021

Past Exam and Review Solutions

Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on November 5, 2018

Makeup Activities for the Cancelled Class on Sept. 12, 2021

Please watch the following two introductory videos by 3blueonebrown about neural networks:
Introduction to Neural Networks (watch the whole video)
Weight Learning in Neural Networks (just watch the first 15 minutes of the second video)

Grading

Students will be responsible for material covered in the lectures and assigned in the readings. All assignment and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly chosen dates, and an attendence score will be computed from how many of the those lectures you attended.

Translation number to letter grades in 2020:
A:100-92 A-:92-88 B+:88-84 B:84-80 B-:80-76 C+:76-71
C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

  • In contrast to the exam grades where you receive your number grades immediately, the assignment/problemset scores, still will be curved near the end of the semester, and your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates the different weights of the four assignments---and, then this number grade will count about 50% towards your final course grade. In general, assignment weights were selected considering amount of work required but also difficulty was considered; moreover, group projects carry lower weights. Moreover, when looking at the detailed grade reports, be aware of the fact that number grades of 92 or higher are A's in Dr. Eick's curving. Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

    Excused Missing of Course Exams: If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances lead to your missing of the exam!