Upon completion of this course, students

1. will know what the goals and objectives of data science are and how to conduct a data science project. 2. will have a sound knowledge of basic statistics and basic machine learning concepts. 3. will have sound knowledge about exploratory data analysis 4. will have knowledge of popular classification techniques, such as decision trees, support vector machines, ensembles and neural networks. 5. will have some basic knowledge about how to construct distance functions. 6. will have detailed knowledge of popular clustering algorithms, such as K-means, DBSCAN, and hierarchical clustering and cluster evaluation. 7. will have some basic knowledge about anomaly and outlier detection. 8. will get some basic knowledge about association analysis. 9. will get hands-on exposure in the course assignments how to apply data analysis techniques to real world data sets. You will also obtain valuable experience in creating data visualizations, how to select parameters of data analysis tools, how to interpret and evaluate data analysis results, and data storytelling. 10. will get some practical experience with respect to popular data analysis and visualization environments, such as R or Python Data Science frameworks, and their popular libraries.

1. Introduction to Data Analysis, Data Science and Data Mining 2. Preprocessing 3. Exploratory Data Analysis: How to Visualize and Compute Basic Statistics for Datasets and How to Interpret the Findings 4. Brief Introduction to R and Python Tools for Data Science 5. Introduction to Supervised Learning: Basic Concepts, Decision Trees, Instance-based Learning, Support Vector Machines and Neural Networks 6. Density Estimation 7. Outlier and Anomaly Detection 8. Introduction to Clustering and Similarity Assessment 9. Data Storytelling 10. Introduction to Association Analysis Centering on the Apriori Algorithm (short) 11. Introduction to Deep Learning Centering on Autoencoders 12. Ethical Issues of Data Science (short) 13. Advanced Clustering 14. Spatial Data Analysis and Spatial Data Mining

Office hours (573 PGH)

Office Hours: TU 4:10-5p TH 8:50-10a (in MS Teams)

e-mail: ceick@uh.edu

TA: Janet Anagli

Office Hours: MO 3-4p TU 9:30-10:30a (scheduled in MS Teams)

Email: jyanagli@CougarNet.UH.EDU

TA: Raunak Sarbajna

Office Hours: WE 1:30-2:30p TH 9:30-10:30a (scheduled in MS Teams)

Email: rsarbajn@cougarnet.uh.edu

class meets: TUTH 11:30a-1p in S105

Cancelled class: none yet

Lectures taught by others: Tu., October 17: Janet; Th., November 16: Raunak

All other lectures will be taught by Dr. Eick, but some lectures include labs that are taught by Janet and Raunak!

- P.-N. Tang, M. Steinback, and V. Kumar:
*Introduction to Data Mining*, - Addison Wesley, 2018.
- Link to
Book HomePage

- Jiawei Han and Micheline Kamber,
*Data Mining: Concepts and Techniques* - Morgan Kaufman Publishers, Third Edition, 2011.
- Link to Data Mining Book Home
Page

NIST/SEMATECH e-Handbook of Statistical Methods (good onlne source covering exploratory data analysis, statistics, modelling and prediction)

- Dr. Eick's office hours this week are: WE 9:10-10a and TH 8:50-10a!
- Task5 has been posted. See below! It is due end of the day Saturday, December 2 (1 day later as originally announced)!
- The lectures in the Nov. 27 week will cover advanced clustering and spatial data mining, there will be review for the final exam, and, maybe, there will be a little more discussion about VAEs.
- If you use ChatGPT, or other AI tools for course tasks, you have to mention this fact in your course report and describe for what subtasks you used the AI tool for; not doing that represents a serious academic honesty violation.
- The final exam will take place on Thursday, December 7,
**11a**in SW101 (same class room at before); you will have 105 minutes to complete this exam; a detailed review list for the final should be available by Friday, Dec. 1, 2023 the latest on this website!

Thursday, September 14: Using Python for Data Science & Task2 Lab (taught by Janet) taught by Janet (first 50-55 minutes of the lecture that day)

Saturday, September 23, 11:59p: Deadline to submit Task1 of ProblemSet1 focusing on Exploratory Data Analysis

Friday, September 29, 11:59p: Deadline to submit Task2 of ProblemSet1 focusing on Classification

Tuesday, October 3: Midterm1 Exam

Thursday, October 5: 25-45 minute presentation given by Raunak in preparation for the group project

Thursday, November 2: Spatial Analysis and Hotspot Discovery Lab taught by Raunak (45+ minutes)

Tuesday, November 14: Midterm2 Exam

Thursday, November 16: Deep Learning Lecture and Lab taught by Raunak in preparation of Task5

Thursday, November 23: Thanksgiving (no class!)

Thursday, November 30: Last class of the semester

Thursday, December 7,

Exams (3): 49% (14%, 15%, 20%)

Tentative weights of non-exam tasks: Problem Sets: 31-32%, Group Project: 13-15%, Group Homework Credit: 3%, Attendance: 3%.

Group Project (Oct. 10-Nov. 10, 2023 (1 month), centering on analyzing solar flares; Discussions Helios Project)

Problem Set2 (consists of a clustering task which is due on Nov. 6, and an outlier detection task which is due on Nov. 20).

Problem Set3 (Task5: Autoencoders).

Group A and Group B Tasks (Group A will present on Sept. 12 and Group B will present on Sept. 14)

Group C and Group D Tasks (Group C will present on Sept. 19 and Group D will present on Sept. 21)

Group E and Group F Tasks (both will present on Sept. 28)

Group G Task (will present Thursday, October 12)

Group H and Group I Tasks (both will present on Thursday, October 26)

Group J Task (will present on November 7)

Group K and L Tasks (Group K will present on Nov. 7 and Group L will present on November 9)

Group M will present on November 28

Group N will present on November 30

Mid2 Exam(Nov. 14, 11:30a, 2023): Nov. 9, 2023 Review for Mid2, Review List for 2023 Midterm2 Exam, Solution Sketches 2022 Midterm2 Exam.

Final Exam(Dec. 7, 11a, 2023): Dec. 1, 2022 Review for Final Exam, Review List for 2022 Final Exam.

All Exams are in SW 101.

Nov. 21, 2023 Offline Tasks

Tasks are due at the time specified; however, a. tasks that are submitted one day late receive a 12% penalty; multiply task score with 0.88 b. tasks that are submitted two days late receive a 30% penalty; multiply task score with 0.7 c. task that are more than 2 days late will receive a score of 0.There will be a short grace period of a few minutes for each submission deadline (up to the discretion of the respective Teaching Assistant); submissions that are obtained after this grace period will be considered to be late!

II Exploratory Data Analysis (covers chapter 3 from the the First Edition of the Tan Book (download as this material is not in the second edition); more material: these slides will not be covered in 2021: Introduction to Non-Parametric Density Estimation; KDE Density Functions, Some R Data Analysis Functions I; Some R Data Analysis Functions II.

III R and Python for Data Science (only some of the listed slide sets will be covered in the lecture; Arko's Short Intro Into R (not covered, but a good "refresher" if you forgot most details of using R, because you learnt it some time ago), Scatter Plot Code, Decision Trees in R, Some useful code for Task1 ProblemSet1 (will be covered in part during the lecture), Computing Statistical Summaries In the Presense of Missing Value (NA), Functions and Loops in R, Directory containing R-code for ProblemSet3; Python: Saying Hi to Python, Python Refresher.

IV Classification (Introduction to Classification: Basic Concepts and Decision Trees, Overfitting, kNN-Classifiers and Support Vector Machines, Neural Networks, Recurrent Neural Networks (not covered in 2023), Colah's Blog: Understanding LSTMs (not covered in 2023), Ensemble Learning (not covered), Naive Bayes Classifiers&Bayes' Theorem (not covered)

V Density Estimation (Naive and Parametric Density Estimation (PDE Task (added on Nov. 6, 2023)), Non-parametric Density Estimation(slides have been added on November 3, 2023!)

VI Clustering and Similarity Assessment ( Introduction, Density-based Clustering Centering on DBSCAN, Hierarchical Clustering, Cluster Validity, R-scripts demonstrating: K-means/medoids, DBSCAN, More on PAM and using PAM/DBSCAN with dist-objects (not relevant and covered in 2022); Clustering Exercises K-Means, HC, and DBSCAN)

VII Outlier Detection

VIII Association Analysis: Brief Introduction to Association Analysis Centering on APRIORI and Sequence Mining

IX Data Storytelling

X Spatial Data Analysis: Spatial Data Analysis and Hotspot Discovery and Introduction to Spatial Data Mining

XI Introduction to Deep Learning Centering on Autoencoders (In Part 1 we are showing parts of MIT 6.S191 (MIT Deep Learning Bootcamp) videos and discuss their content (Introduction to Deep Learning (watch the first 8:20 of the video and 11:20-15:00; the remainder of the video was actually covered in the neural network part of this course), Deep Generative Learning (watch the first 22 minutes of this video; if you want to know VAEs generate "new" examples resume watching the video 31:05 for a few minutes) and maybe---if enough time---New Horizons: Diffussion Models; watching at 39:40-58:30); Part 2: Autoencoders and More on Deep Learning and Lab for Task 5 (taught by Raunak on Nov. 16, 2023)).

XII Advanced Clustering

XIII Overview of Data Preprocessing Techniques (was already discussed in the August 30 lecture)

XIV Ethical Aspects of Data Science centering Ethics Involving Census Data Collection and Interpretation (Danah Boyd Video)

XV Introduction to Data Visualization (not covered since 2021; Part1 (Most of the slides in this slideshow were created by Guoning Chen, Department of Computer Science, University of Houston), Part2 (slides were created by Alark Joshi, Department of Computer Science, University of San Francisco; Data Visualization Reading Material for DS I)

A:100-92 A-:92-88 B+:88-84 B:84-80 B-:80-76 C+:76-71

C: 71-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0

Only machine written solutions are accepted (the only exception to this point are figures and complex formulas) in the assignments/problem sets. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.

Excused Missing of Course Exams: If you miss course exams for other reasons, you might get a grade of 'F' for the exam, unless highly unusual circumstances lead to your missing of the exam!

23-24: 94, 22: 93, 21:92, 20:91, 19:87, 18:83, 17:79, 16:75, 15:71, 14:67, 13:63, 12:59, 11:55, 10: 51, 9:47, 0-8:43.

Solution Sketches Midterm2 April 7, 2015

Solution Sketches Final Exam December 10, 2018

Solution Sketches Review1 March 1, 2016

Solution Sketches Review1 Feb. 27, 2018

Solution Sketches Review1 September 24+26, 2018

Solution Sketches Midterm1 March 3, 2016

Solution Sketches Midterm1 March 1, 2018

Solution Sketches Midterm1 October 2, 2019

Solution Sketches Midterm2 November 6, 2019

Solution Sketches Final Exam December 6, 2019

Solution Sketches Review2 April 5, 2016

Solution Sketches Midterm2 April 7, 2016

Solution Sketches Midterm2 April 5, 2018

Review for Final Exam, May 3, 2016

Solution Sketches of Review for Final Exam on April 26, 2018

Solution Sketches Final Exam May 10, 2016

Review2 solution sketches on November 5, 2018

Solution Sketches Midterm Exam October 14 2021

- We will be taking attendance starting Tuesday, August 29 throughout the semester. Attendance counts 3% towards the course grade. Attendance will be taken approx. 11:45a; that is, if you show up 30 minutes late, your attendance will not count. The Fall Attendance grade computation can be found near the end of this webpage!
- A first draft of Task2 has been posted below. Janet will teach a lab preparing you Task2 on Thuesday, September 14, 2:30-3:15p; the Lab will also briefly discuss "Python Data Science Basics". Please, bring your laptop to the class and read the Task2 specification before the lab.
- Midterm2 has been graded; although about 30-40 students "did well" in the exam, it is our assessment, that 25+ students have been "ill prepared" for the exam and overall the results were not good. We will go solutions of some Midterm2 problems during the lecture on Nov. 29. As there will be no course activities Dec. 7-12, I strongly suggest you use this time window to get "well prepared" for the course final exam, so that the final exam results will be better than those for Midterm2!
- At the moment you mostly see information from the Fall 2021 teaching of the course; this information and teaching material will be replaced as we move along with teaching this course. The same remark applies for the teaching material, problem set and group project information and chat in MS Team 3337-Class.
- MS Teams will be used for teaching the online course; however, we will be using classical paper exams which will be given in UH class rooms. Navid and Raunak will be the TAs for this course; you find their MS Teams officehours and e-mail above.
- The lectures in the Sept.5 week will continue to discuss EDA; there will be two 40 minute labs on using R/Python for Data Science. Please attend these labs as they prepare you for Task1 and Task2 of ProblemSet1. The next topic we will discuss in the lecture is classification!
- You currently mostly see the Website of the Fall 2022 teaching of the course; this website will be continuously updated as we move along with teaching the course in the Fall 2023 semester. However, the dates and times for the 2 course midterm exams and the final exam have already been posted!
- When looking for the most recent version of a course documents, always download it from the course website! Reason for doing that: as there is no security in MS Teams---documents can be modified by any team members; consequently, documents you find in MS Teams might be outdated or modified by your class mates.
- Please post you solution files/slides for the Group Homework Credit tasks in the respective channel in MS Teams! Use you group name as the name of the file that you will post.
- There have been minor extensions of deadlines: Task3 is now due Nov. 6 (2 extra days), the Group Project is due on Nov. 11 (one extra day), Task4 is due on Nov. 20 (one extra day). The deadline for Task5 remains Dec. 1! There will be not further extensions in 2023!

Group A, B and C Tasks (Group A will present during the lecture on September 13, and groups B and C will present on September 15)

Group D Task (Group D will present during the lecture on September 22)

Group E and F Tasks (both groups will present on September 29)

Group G Task (to be presented on October 13)

Group H, I and J Task (groups H and I will present on October 20, and group J will present on October 25)

Group K Task (to be presented on November 1)

Group L and M Task (both groups will present on Nov. 10)

Group N Task (to be presented on Nov. 17)

Group O Task (to be presented on Dec. 1)

For groups see: 2022 Group Homework Credit Groups

Problem Set2 (Task3, centering on Recurrent Neural Networks; individual task; Navid's Introduction to RNN)

ProblemSet3 (centering on clustering; individual task)

POIMAGIC: an Early Warning System for Streaming Spatial Events (group project October 4-November 22, 2022; Groups in 2022) <