last updated: October 9, 8a

COSC 3337: Data Science I in Fall 2025 (Dr. Eick )

Goals of the Data Science I Course

COSC 3337 Syllabus

Upon completion of this course, students
1.	will know what the goals and objectives of data science are and how to conduct a data science project.
2.	will have a sound knowledge of basic statistics and basic machine learning concepts.
3.	will have sound knowledge about exploratory data analysis
4.	will have knowledge of popular classification techniques, such as decision trees, support vector machines, 
K-nearest neighbor and neural networks.
5.	will have some basic knowledge about how to construct distance functions.
6.	will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, EM and hierarchical clustering and cluster evaluation.
7.      will have some basic knowledge about anomaly and outlier detection and some hands-on experience in using these techniques for a dataset
8.      will get some basic knowledge about association analysis.
9.	will get hands-on exposure in  the course assignments how  to apply data analysis techniques  to real world data sets. 
You will also obtain valuable experience in creating data visualizations, how to select parameters of data analysis 
tools, how to interpret and evaluate data analysis results, and data storytelling. 
10.	will get some practical experience with respect to popular 
data analysis and visualization environments, such as R or Python Data Science frameworks, and their popular libraries. 

Course Content and Order of Coverage (updated on Sept. 1, 2025)

1.	Introduction to Data Analysis, Data Science and Data Mining  
2.	Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings 
3.      Preprocessing (short)
4.	Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, 
        and Neural Networks
5.      Density Estimation 
6.      Similarity Assessment 
7.      Outlier and Anomaly Detection 
8.      Data Storytelling 
9.      Ethical and Societal Issues of Data Science
10.     Introduction to Clustering
11.     Introduction to Deep Learning 
12.     Introduction to Association Analysis Centering on the Apriori Algorithm (short) 
13.     Advanced Clustering 

Basic Course Information 2025

Instructor: Dr. Christoph F. Eick
office hours (F2F and online on Tuesday using 4368 MS Team; F2F on Thursday: send me an e-mail and I will contract you): TU 8:45-10a TH 4-4:45p
e-mail: ceick@uh.edu
TA: Janet Anagli
Office Hours: MO 10-11a TU 10-11a (scheduled in MS Teams)
Email: jyanagli@CougarNet.UH.EDU
TA: Athina Bikaki
Office Hours: MO 2-3p TH 8:30-9:30a (scheduled in MS Teams)
Email: abikaki@CougarNet.UH.EDU
TA: Pei-Chi Pan
Office Hours: WE&FR 9:30-10:30a (scheduled in MS Teams)
Email: ppan@CougarNet.UH.EDU
Classes not taught by Dr. Eick: Th., October 16, Th., Nov. 6; maybe, Tu., December 2.

Recommended Text:
P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining,
Addison Wesley, 2018.
Link to Book HomePage

Other Material:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques
Morgan Kaufman Publishers, Third Edition, 2011.
Link to Data Mining Book Home Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

News COSC 3337 (Data Science I) Dr. Eick's Section Fall 2025

Course Elements and Their Tentative Weights for 2025

ProblemSet Tasks (5): 47%
Single Group Presentation or Preparation of an Inclass Discussion: 4%
Exams (3): 47% (14%, 16%, 17%)
Attendance: 2%

Task1 and Task3 will be group tasks (group size: 6 students are in most groups; 7 students are in a few groups). Groups will be formed shortly after Labor Day. Moreover, on September 25, October 21 and October 30 there will be a 40 minute in-class discussion about some data science related topic. 3 groups will prepare and lead the three discussions of those 3 days. 3-4 groups will give 10-13 minute presentations discussing their Task1 results on September 30. All the other groups will give 10-13 minute presentations discussing their Task3 results either on November 11 or on November 13.

Important Dates in 2025

Sept. 12, 12:10p: Janet will teach a lab for Task1.
Sept. 23, 12:10p: Pei-Chi will teach a lab for Task2.
Sept. 25, 12:10p: Group Bar will lead a panel discussion about "ChatGPT and Data Science I"
Sept. 30, 12:10p: Groups Dendrogram, Heat Map and Histogram will give presentations, showcasing their Task1 Results.
Oct. 2, 11:30a: Exam1 in our Classroom
Oct. 16,11:30a: Athina will teach a lab for Task3 and also Support Vector Machines
Oct. 21, 12:10p: Panel Discussion lead by Group Area centering on 'Data Privacy'
Oct. 30, 12:10p: Panel Discussion lead by Group Bubbles; tentative topic: 'Visualization: Transforming raw data into meaningful insights.'
Nov. 4, 11:30a: Exam2 in our class room
Nov. 6, 11:30a: Abc will teach this lecture, as Dr. Eick is attending a conference.
Nov. 11, 11:30a: Six groups will give presentations showcasing their Task3 results
Nov. 13, 11:30a: Six groups will give prsentations showcasing their Task3 results
Nov. 20, 12:10p: Optional Panel Discussion
Nov. 27: No Class (Thanksgving)
Dec. 4, 11:30a: Exam3 in our class room
Dec. 4, 12:50p: COSC 3337 activities end!

Fall 2025 Problem Set Tasks

Task1: Exploratory Data Analysis for a 2023 Houston Weather (Group Task, 2023 Houston Weather Dataset)
Task2: Neural Networks and Deep Learning (Individual Task)
Task3: Helios: Analyzing and Understanding Solar Flares (significantly updated on Oct. 8, Group Task, solar flares; Discussions Helios Project)
Task4: Outlier Detection for a Houston Weather Dataset (individual task; 2023 Houston Weather Data Set)
Task 5: Clustering a Spatial Dataset with K-Means and DBSCAN (individual task; Complex9 Dataset with 8% Gaussian Noise Added; currently, you see last year's task; the task description will change)

2025 Groups

The tentative weights for the five tasks in 2025 are as follows: Task1: 17% Task2: 20% Task3: 26% Task4: 22% Task5: 15%; weights will be finalized on Nov. 11 the latest. This is the time line of the 5 problem set tasks: Task1: Sept 4-22, Task 2: Sept. 23-October 8, Task3: October 9-November 8; Task 4: October 9-October 30; Task5: Nov. 10-Nov. 23.

Course Exams

Mid1 Exam(October 2, 11:30a, 2025): Sept 30, 2025 Review for Mid1, Review List for 2025 Midterm1 Exam, Solution Sketches October 3, 2023 Midtem1 Exam, Solution Sketches October 1, 2024 Midterm1 Exam

Mid2 Exam(November 4, 11:30a, 2025): Oct 29 Review for 2024 Mid2 Exam, Review List for 2024 Midterm2 Exam, Solution Sketches 2023 Midterm2 Exam.

Mid3 Exam(December 4, 11:30a, 2025): Nov. 21, 2024 Review for Third Exam, Review List for 2024 Midterm3 Exam, A Few Solution Sketches of the 2023 Final Exam.

2025 Late Submission Policy

Tasks are due at the time specified; however, 
a. tasks that are submitted one day late receive a 8% penalty; multiply task score with 0.92
b. tasks that are submitted two days late receive a 20% penalty; multiply task score with 0.8
c. task that are more than 2 days late will receive a score of 0. 
There will be a short grace period of a few minutes for each submission deadline (up to the discretion of the respective Teaching Assistant); submissions that are obtained after this grace period will be considered to be late!

COSC 3337 Data Science I: Lecture Notes

I Introduction to Data Mining/Data Science (Part1: Introduction to Data Mining, Part2: Mostly Course Information, Part3: Introduction to Data Science (not covered in 2025), Preprocessing in Data Science (will be covered approx. Sept. 4).
II Exploratory Data Analysis (covers chapter 3 from the the First Edition of the Tan Book (download as this material; it is not in the second edition); more material; these slides have not been covered since 2021: Some R Data Analysis Functions I; Some R Data Analysis Functions II).
III R and Python for Data Science (only some of the listed slide sets will be covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if you forgot most details of using R, because you learnt it some time ago), Scatter Plot Code, Decision Trees in R, Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture), Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for ProblemSet3; Python: Saying Hi to Python, Python Refresher.
IV Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, kNN-Classifiers and Support Vector Machines, Neural Networks (Pei-Chi's slides, discussing CNNs have been added on Sept. 25, 2025), Ensemble Learning (not covered), Naive Bayes Classifiers&Bayes' Theorem (not covered)
V Density Estimation (Naive and Parametric Density Estimation (PDE Task (added on Nov. 6, 2023)), Non-parametric Density Estimation)
VI Clustering and Similarity Assessment ( Introduction, Density-based Clustering Centering on DBSCAN, Hierarchical Clustering, Cluster Validity, R-scripts demonstrating: K-means/medoids, DBSCAN, EM and Gaussian Mixture Models, Clustering Exercises K-Means, HC, and DBSCAN)
VII Outlier Detection
VIII Association Analysis: Brief Introduction to Association Analysis Centering on APRIORI and Sequence Mining
IX Data Storytelling
X Spatial Data Analysis: Spatial Data Analysis and Hotspot Discovery and Introduction to Spatial Data Mining
XI Introduction to Deep Learning Centering on Autoencoders (In Part 1 we are showing parts of MIT 6.S191 (MIT Deep Learning Bootcamp) videos and discuss their content (Introduction to Deep Learning (watch the first 8:20 of the video and 11:20-15:00; the remainder of the video was actually covered in the neural network part of this course), Deep Generative Learning (watch the first 22 minutes of this video; if you want to know VAEs generate "new" examples resume watching the video 31:05 for a few minutes) and maybe---if enough time---New Horizons: Diffussion Models; watching at 39:40-58:30); Part 2: Intro to Deep Learning, AutoEncoders (taught by Raunak on Nov. 12, 2024)).
XII Advanced Clustering
XIII Overview of Data Preprocessing Techniques (was already discussed in the August 30 lecture)
XIV Ethical Aspects of Data Science centering Ethics Involving Census Data Collection and Interpretation (Danah Boyd Video)
XV Introduction to Data Visualization (not covered since 2021; Part1 (Most of the slides in this slideshow were created by Guoning Chen, Department of Computer Science, University of Houston), Part2 (slides were created by Alark Joshi, Department of Computer Science, University of San Francisco; Data Visualization Reading Material for DS I)

Grading

Translation number to letter grades, starting in Fall 2021:
A:100-92 A-:92-88 B+:88-84 B:84-80 B-:80-76 C+:76-71
C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

  • In contrast to the exam grades where you receive your number grades immediately, the assignment/problemset scores, still will be curved near the end of the semester, and your curved assignment scores will be ultimately converted into a number grade using a coversion function---the conversion function incorperates the different weights of the four assignments---and, then this number grade will count about 50% towards your final course grade. In general, assignment weights were selected considering amount of work required but also difficulty is considered; moreover, group projects carry lower weights. Moreover, when looking at the detailed grade reports, be aware of the fact that number grades of 92 or higher are A's in Dr. Eick's curving. Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through assignment problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.

    Excused Missing of Course Exams: If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances lead to your missing of the exam!

    Attendance 2025

    Attendance counts 2% towards the course grade. Attendance will be taken starting Thursday, Sept. 4. Only F2F attendance counts. Attendance will be taken the following days: Sept. 4+9+11+16+18+23+25+30 (8), October 7+9+14+16+21+23+28 (7), November 6+18+20+25 (4), December 6(1): 20 times total. Your number of attendances will be converted as follows into a number grade:
    20: 94, 19-18: 93, 17 :91, 16:89, 15:87, 14:83, 13:79, 12:75, 11:71, 10: 67, 9:63, 8:59, 7: 55 6:50, 0-5: 45.

    Past Exam and Review Solutions

    Solution Sketches Midterm1 March 10, 2015
    Solution Sketches Midterm2 April 7, 2015
    Solution Sketches Final Exam December 10, 2018
    Solution Sketches Review1 March 1, 2016
    Solution Sketches Review1 Feb. 27, 2018
    Solution Sketches Review1 September 24+26, 2018
    Solution Sketches Midterm1 March 3, 2016
    Solution Sketches Midterm1 March 1, 2018
    Solution Sketches Midterm1 October 2, 2019
    Solution Sketches Midterm2 November 6, 2019
    Solution Sketches Final Exam December 6, 2019
    Solution Sketches Review2 April 5, 2016
    Solution Sketches Midterm2 April 7, 2016
    Solution Sketches Midterm2 April 5, 2018
    Review for Final Exam, May 3, 2016
    Solution Sketches of Review for Final Exam on April 26, 2018
    Solution Sketches Final Exam May 10, 2016
    Review2 solution sketches on November 5, 2018
    Solution Sketches Midterm Exam October 14 2021

    "Old" News COSC 3337 (Data Science I)

    2024 Group Homework Credit

    Due to the large `number of students in Dr. Eick's section, there will be no Group Homework Credit in 2025!t

    In this activity which will be called group homework credit, each group formed for this activity, receives a different usually homework-style problem (other tasks include demos, and leading discussions), and they present their solution during the lecture (10-14 minutes), and share their solution in form of a Word or pptx file. The groups and e-mail addresses of the group members have been posted in the 'Group Homework Credit' channel of this section's MS Team. Below is a list of the already assigned tasks and associated groups and presentation dates; tasks will be added as we move along with the teaching of the course; tasks will be posted at least 5 days before a group's presenation date!

    Schedule and Tasks:
    Group A Task (group A will present on Sept. 10)
    Group B will present on Sept. 17 and Group C will present on Sept. 19 (Group B and C Tasks)
    Groups D will preseent on Sept. 26 (Group D Task)
    Group E Task and Group F, G, and H Task (groups E and F will present on October 8, and groups G and H will present on October 15).
    Group I and Group J Tasks (will both present on October 24)
    Group K, L, and M Tasks (group K will present on Nov. 7, group L will present on Nov. 14, and group M will present and lead a discussion on Nov. 14).
    Groups N and O Tasks (group N will present on November 19 and Group O will lead an ethics discussion on November 21)

    Fall 2024 Problem Set Tasks

    Task1: Exploratory Data Analysis for a Baseball Databank (group task)
    Task2: Learning Decision Tree and Support Vector Machine Models (individual task)
    Task3: Juracan: Analyzing Hurricane Trajectories and Assessing Hurricane Risks for Gulf of Mexico Locations (Group Task; is due on Nov. 8; Benchmark of Gulf Cities for Hurricane Risk Assessment, Assessing City Hurricane Risk with Influence Functions)
    Task4 Outlier Detection for a Houston Weather Dataset (individual task; is due on Tuesday, October 22; Houston Weather Data Set)
    Task 5: Clustering a Spatial Dataset with K-Means and DBSCAN (individual task; due on We., Nov. 13; Complex9 Dataset with 8% Gaussian Noise Added)
    Task 6: Using Deep Learning for Outlier Detection (individual task; last task of the semester; due on Nov. 21 end of the day; there will be a lab taught by Raunak on Tu., Nov. 12 to prepare you for this task).

    The tentative weights for the six tasks in 2024 are as follows: Task1: 17% Task2: 14% Task3: 21% Task4: 19% Task5: 15% Task6 14%! Weights will be finalized on Nov. 21 the latest. <