last updated: August 15 at 3p

COSC 3337: Data Science I in Fall 2021 (Dr. Eick )

Goals of the Data Science I Course

COSC 3337 Syllabus

Upon completion of this course, students
1.	will know what the goals and objectives of data science are and how to conduct a data science project.
2.	will have a sound knowledge of basic statistics and basic machine learning concepts.
3.	will have sound knowledge about the most important data visualization techniques.
4.	will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles, and neural networks.
5.	will have some sound knowledge about how to construct distance functions.
6.	will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation.
7.	will get hands-on exposure in  the course assignments how  to apply data analysis techniques  to real world data sets. They will obtain valuable experience in learning how to interpret data analysis results, how to select parameters of data analysis tools, and how to interpret and evaluate data analysis results.
8.	will learn on how to use the popular data analysis and visualization environment R and its popular libraries. 
9.	will get some exposure to and experience in data storytelling.

Course Content

1.	Introduction to Data Analysis, Data Science and Data Mining  
2.	Exploratory Data Analysis—how to Visualize and Compute Basic Statistics for Datasets and how to Interpret the Findings and Assignment1
3.	Introduction to R
4.	Introduction to Data Visualization (new!)
5.	Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines, 
         Neural Networks, Regression and Assignment2
6.	Introduction to Clustering, Similarity Assessement and Assignment3 
7.	Data Preprocessing 
8.      Data Storytelling 
9.      Outlier Detection (optional, might be running out of time to discuss this topic!)

Basic Course Information

Instructor: Dr. Christoph F. Eick
Office hours (573 PGH)
Office Hours: TU 1-2p TH 9:30-10:30a
e-mail: ceick@uh.edu
TA: Nour Smaoui
Office Hours: MO noon-1p WE 1-2p
Office: 350 PGH
Email: nour@cs.uh.edu
2018 TA website: Romita's 4335 Website (?!)

class meets: TU/TH 11:30a-1p
class room: 108 AH or online on MS TEAM 3337-Class
Instructor's Travel: October 12+14
Lectures by 'Guests': October 14

Course Materials

COSC 3335 Syllabus for Fall 2019

Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, 2018.
Link to Book HomePage

Other Material:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition, 2011.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

News COSC 3337 (Data Science I) Fall 2021

Important Dates in Fall 2021

Thursday, September 9: Maybe, Introduction to R and Assignment1 Lab (taught by Nour; please bring your laptop with R installed!)
Saturday, September 25, 11p: Deadline Assignment1
Tuesday, October 12: Midterm1 Exam (Review List 2019 Midterm1 Exam, October 7 Review1 Questions and Answers)
Thursdat, October 14: Lecture given by a Guest Lecturer
Monday, October 28: R-lab centering on R-programmming (bring labtop)!
Wednesday, October 30: Group Assignment2 Student Presentations
Monday, November 25: Last Lecture COSC 3337
Friday, December 6, 2p: Final Exam (Review List for Dec. 6, 2019, 2p Final Exam (will be posted on Nov. 26, 2019), Questions and Solution Sketches of the Nov. 25, 2019 Review for the Final Exam) in SEC 202 (our regular class room).

Course Elements and Their Tentative Weights for 2021

Assignments (3) and Online Credit: 52%
Exams (2): 48% (20+28%)

Fall 2019 Assignments

There will be 3 assignments in the course, centering on:
Assignment 1: Exploratory Data Analysis and Data Visualization (individual assignment, second draft; Few Remarks Tasks 12+13).
Assignment 2: Learning Classification Models and Model Evaluation (group assignment; 35 Cents of Wisdom on Giving Presentations, On AS2 Group Presentations)
Assignment 3: Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks (individual assignment; Additional Material for Assignment3)

Late Policies for Assignments: Assignments are due at the submission deadline; assignments submited after the deadline will not be graded; however, students are allowed to submit either Assignment1 or Assignment3 36 hours late!

Assignment Weights and Curving: This semester assignment weights were chosen as follows: Assignment1:37%, Assignment2: 30%, Assigment3: 33%. In general, there was a large spread of scores in the three assignments; we also felt that the average performance in Assignment1 and Assignment2 was slightly below expectations and the performance for Assignment3 was above expectations. Consequently, the number grade averages for Assignment1 and Assignment2 were curved to be about 76 and 77, respectively, whereas for Assignment3 this average is about 82.

COSC 3337 Data Science I: Lecture Notes

I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining, Part2: Course Information, Part3:Introduction to Data Science, Differences between Clustering and Classification).
II Exploratory Data Analysis (updated on August 29, 2019; covers chapter 3 from the the First Edition of the Tan Book (download as this material is not in the second edition); more material: Interpreting Displays; Introduction to Non-Parametric Density Estimation; KDE Density Functions, Some R Data Analysis Functions I (not covered in the 2019 lecture; overlap with the R-Lab); Some R Data Analysis Functions II (not covered in the 2019 lecture))
III R (Arko's Short Intro Into R (used in Lab), Scatter Plot Code, Decision Trees in R, Some useful code for Assignment1, Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for Project2 (lecture on Feb. 26); moreover, checkout Romita's 4335 webpage for more R-code, datasets, and slides)
IV Introduction to Data Visualization (Part1 (Most of the slides in this slideshow were created by Guoning Chen, Department of Computer Science, University of Houston), Part2 (slides were created by Alark Joshi, Department of Computer Science, University of San Francisco; Data Visualization Reading Material for DS I)
V Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, kNN-Classifiers and Support Vector Machines, Neural Networks (updated on Oct. 14, 2019), Ensemble Learning, Naive Bayes Classifiers&Bayes' Theorem)
VI Clustering and Similarity Assessment ( Introduction, Hierarchical Clustering and Cluster Validation, DBSCAN; R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects; Clustering Exercises K-Means, HC, and DBSCAN)
VII Preprocessing
VIII Data Storytelling
IX Outlier Detection (Only Sections 0 and 4 of the slides will be covered in 2019)

Past Exam and Review Solutions

Solution Sketches Midterm1 March 10, 2015
Solution Sketches Midterm2 April 7, 2015
Solution Sketches Final Exam December 10, 2018
Solution Sketches Review1 March 1, 2016
Solution Sketches Review1 Feb. 27, 2018
Solution Sketches Review1 September 24+26, 2018
Solution Sketches Midterm1 March 3, 2016
Solution Sketches Midterm1 March 1, 2018
Solution Sketches Midterm1 October 2, 2019
Solution Sketches Midterm2 November 6, 2019
Solution Sketches Final Exam December 6, 2019
Solution Sketches Review2 April 5, 2016
Solution Sketches Midterm2 April 7, 2016
Solution Sketches Midterm2 April 5, 2018
Review for Final Exam, May 3, 2016
Solution Sketches of Review for Final Exam on April 26, 2018
Solution Sketches Final Exam May 10, 2016
Review2 solution sketches on November 5, 2018

Reports

Rexer Analytics' 2015 Data Science Survey
CrowdFlower's 2016 Data Science Report
Summary Rexer Analytics' 2017 Data Science Survey (the complete survey is supposed to appear in October 2018)

Grading

Students will be responsible for material covered in the lectures and assigned in the readings. All assignment and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Seveal times during the semester I will check class attendance at randomly chosen dates, and an attendence score will be computed from how many of the those lectures you attended.

Translation number to letter grades in 2015:
A:100-90 A-:90-86 B+:86-82 B:82-78 B-:78-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

  • In contrast to the exam grades where you receive your number grades immediately, the assignment scores, near the end of the semester, still will be curved and your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates the different weights of the four assignments---and, then this number grade will count 45-48% towards your final course grade. In general, assignment weights were selected considering amount of work required but also difficulty was considered; moreover, group projects carry lower weights. Moreover, when looking at the detailed grade reports, be aware of the fact that number grades of 90 or higher are A's in Dr. Eick's curving. Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

    Fall 2018 Assignments

    Assignment1: Exploratory Data Analysis for a Portuguese School Performance Dataset (Individual Assignment)
    Assignment2: Similarity Assessment, Clustering, and Using Clustering to Create Background Knowledge for Classification Tasks (individual project; Running PAM with Distance Functions)
    Assignment3: Making Sense of Data: Learning and Comparing Classification Models for a Dataset (Individual Project)
    Assignment4: Design, Implementation and Comparison of Outlier Detection Techniques for a Spatial Dataset (Group Project, Assignment4 Groups)

    This semester, the student performance for Assigment2 and 3 met expectation; overall, more than 45% of the students did not do well at all in Assigment1, and most groups did very well in Assignment4. Based on this assessment the Assignment Number Grade Averages after curving for the four assigments were as follows: 75, 78-79, 78-79, and 82 approximately.

    Spring 2018 Assignment Evaluation Questionnare Result Summary

    The questionnaire was conducted on April 24, 2018, and Romita created the following summary of the answers you gave (based on 17 responses):

    9 students felt assignment 2 was most difficult and took the most time, 2 students felt assignment 3 was most difficult and took the most time followed by 1 student for assignment 1. Several Students thought that assignment 1 was also quite time consuming, although not as time consuming as assignment 2; assignment 1 also had a lost of second place finishes in other categories. 3 students felt that assignment 3 was least difficult and another 3 students felt assignment 4 was least difficult. Concerning interestingness, 8 students felt that assignment 3 was most interesting, followed by 5 students for assignment2 and 2 students that that assignment 1.
    All the students agreed that the assignments helped in better understanding of the concepts covered in class. They also liked the fact that the assignments gave them an opportunity to work with real datasets. 5 students felt they learned most in assignment 3 and 3 students mentioned they learned the most in assignment 2. Another common comment was that the assignments helped them in learning and using R.