last updated: May 17, 2011

COSC 6342---Machine Learning (Dr. Eick )

If you have any comments concerning this website, send e-mail to: ceick@cs.uh.edu

News COSC 6342 (Machine Learning) Spring 2011

The grades should be posted on Chun-sheng's website soon, and more detailed grade reports should be available on May 18 the latest. I enjoyed teaching the course, particularly the many questions you asked during the lecture, as I prefer a more interactive style of teaching. Have a good summer!
The 2011 Machine Learning Final Exam (Review List for 2011 Final) will not be returned to students; however, you can see you final exam on Th., May 19, 1-2p and on the first Thursday of the Fall 2011 semester 1-2p. If you have other matters concerning the course to discuss, feel free to come also during these two time slots.
Reading assignment (page number refer to the new edition of the textbook): Jan. 20: 1-17; Jan. 24: 21-28 & 30-43; Jan. 26: 46-58; Feb. 1: 61-66; Feb. 3: 66-83, Feb. 8: 87-94, 100-102; Feb. 10: 109-116; Feb. 14: 116-120, 125-128, 133-134 (ISOMAP); Feb. 16: 142-149, Feb. 20: 149-157; 163-171; Feb. 26: 174-179 185-192; March 2: 192-203; March 6: Read DBSCAN paper; March 25: 447-463 (lecture does not follow much the textbook); March 30: 319-343, April 4: 309-317, April 8: 319-327; April 12: 328-337; April 13:Read first 3 columns of Smola Support Vector Regression Tutorial; April 19: 483-503; April 27: 387-402.
After Spring break, we will first finish the discussion of decision trees, followed by density-based clustering algorithms centering on DBSCAN/DENCLUE, and reinforcment learning/active learning will be discussed next. After that, we will discuss ensemble approaches, followed by support vector machines, followed by a short discussion on kernels solely discussing some slides of the Nuno Vasconcelos Kernel Lecture on this theme. After that "Comparing classifiers" will be discussed next, and we will conclude the course with a brief introduction to Belief Networks.
2011 Course Syllabus
Basic Course Information
Instructor: Dr. Christoph F. Eick
office hours (589 PGH): TU 2:30-3:30p and TH 11a-noon
e-mail: ceick@uh.edu
Teaching Assistant: Chun-sheng Chen
office hours (577 PGH): TU noon-1p & 2:30-3:30p
e-mail:
Link to Chun-sheng's COSC 6342 Website
class meets: TU/TH 1-2:30p
cancelled classes: Tu., April 26, 2011
Makeup class: Tu., May 3, 2011 (in 301 AH)
last class:
teaching class room: AH 301

Course Materials
Required Text: Ethem Alpaydin, Introduction to Machine Learning, MIT Press, 2010

Recommended Texts: Tom Mitchell, "Machine Learning", McGraw-Hill, 1997.
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
Course Elements and their Weights
Due to the more theoretical nature of machine learning there will be a little more emphasis on exams and on understanding the course material presented in the lecture and textbook. However, there will two hands-on projects and a group project, 4 graded homeworks which count about 39% towards the overall grade. In 2011 the weights of the different parts of the course are as follows:

Midterm Exam 27%
Final Exam 33%
Attendance 1%
Project1 16%
Project2 8%
Project3 9%
Homeworks 6%
2011 Homeworks and Projects
Graded Homework1
Graded Homework2
Graded Homework3+4

Project1: Using Machine Learning to Make Money in Horse Race Betting (HorseRaceExample, Project1 Discussions, Preference Learning Tutorial, Wolverhampton Statistics)
Project2: Group Project---Exloring a Subfield of Machine Learning (Project2 Group Presentation Schedule).
Project3: Application and Evaluation of Temporal Difference Learning (Project Description, RST-World)

Tentative Course Organization Spring 2011
```
Topic 1: Introduction to Machine Learning
Topic 2: Supervised Learning
Topic 3: Bayesian Decision Theory (excluding Belief Networks)
Topic 5: Parametric Model Estimation 
Topic 6: Dimensionality Reduction Centering on PCA
Topic 7: Clustering1: Mixture Models, K-Means and EM 
Topic 8: Non-Parametric Methods Centering on kNN and density estimation 
Topic 9: Clustering2: Density-based Approaches
Topic 10 Decision Trees
Topic 11: Comparing Classifiers 
Topic 12: Combining Multiple Learners
Topic 13: Support Vector Machines 
Topic 14: More on Kernel Methods 
Topic 15: Naive Bayes' and Belief Networks
Topic 16: Applications of Machine Learning---Urban Driving, Netflix, etc.
Topic 18: Reinforcement Learning 
Topic 19: Neural Networks 
Topic 20: Computational Learning Theory 
```
The topic numbers are not changed from the 2009 offering of the course; the main reason for this is that like this the names of Powerpoint files remain the same.
Prerequisites
The course is mostly self-contained. However, students taking the course should have sound software development skills, and some basic knowledge of statistics.

2011 Transparencies and Other Teaching Material
Course Organization ML Spring 2011
Topic 1: Introduction to Machine Learning(Eick/Alpaydin Introduction, Tom Mitchell's Introduction to ML---only slides 1-8 and 15-16 will be used)
Topic 2: Supervised Learning (examples of classification techniques: Decision Trees, k-NN)
Topic 3: Bayesian Decision Theory (excluding Belief Networks)
Topic 4: Using Curve Fitting as an Example to Discuss Major Issues in ML (read Bishop Chapter1 in conjuction with this material; not covered in 2011)
Topic 5: Parametric Model Estimation
Topic 6: Dimensionality Reduction Centering on PCA (PCA Tutorial, Arindam Banerjee's More Formal Discussion of the Objectives of Dimensionality Reduction)
Topic 7: Clustering1: Mixture Models, K-Means and EM (Introduction to Clustering, Modified Alpaydin transparencies, Top 10 Data Mining Algorithms paper)
Topic 8: Non-Parametric Methods Centering on kNN and Density Estimation(kNN, Non-Parametric Density Estimation, Summary Non-Parametric Density Estimation, Editing and Condensing Techniques to Enhance kNN, Toussant's survey paper on editing, condesing and proximity graphs)
Topic 9: Clustering 2: Density-based Clustering (DBSCAN paper, DENCLUE2 paper)
Topic 10: Decision Trees
Topic 11: Comparing Classifiers
Topic 12: Ensembles: Combining Multiple Learners for Better Accuracy
Topic 13: Support Vector Machines (Eick: Introduction to Support Vector Machines, Alpaydin on Support Vectors and the Use of Support Vector Machines for Regression, PCA, and Outlier Detection (only transperencies which carry the word "cover" will be discussed), Smola/Schoelkopf Tutorial on Support Vector Regression)
Topic 14: More on Kernel Methods(Arindam Banerjee on Kernels, Nuno Vasconcelos Kernel Lecture, Bishop on Kernels; only transparencies 13-25 and 30-35 of the excellent Vasconcelos( Homepage) lecture will be covered in 2011)
Topic 15: Naive Bayes and Belief Networks(Eick on Naive Bayes, Eick on Belief Networks, Bishop on Belief Networks)
Topic 16: Successful Application of Machine Learning
Topic 18: Reinforcement Learning (Alpaydin on RL (not used), Eick on RL---try to understand those transparencies; Using Reinforcement Learning for Robot Soccer, Kaelbling's RL Survey Article---read sections 1, 2, 3, 4.1 and 4.2 centering on what was discussed in the lecture)
Topic 20: Computational Learning Theory(Greiner on PAC Learning,...)
Review May 3, 2011 (2009 Exam2 Solution Sketches, 2009 Exam3 Solution Sketches)

Remark: The teaching material will be extended and possibly corrected during the course of the semester.

Grading
Each student has to have a weighted average of 74.0 or higher in the exams of the course in order to receive a grade of "B-" or better for the course. Students will be responsible for material covered in the lectures and assigned in the readings. All homeworks and project reports are due at the date specified. No late submissions will be accepted after the due date. This policy will be strictly enforced.
Translation number to letter grades:
A:100-90 A-:90-86 B+:86-82 B:82-77 B-:77-74 C+:74-70
C: 70-66 C-:66-62 D+:62-58 D:58-54 D-:54-50 F: 50-0
Only machine written solutions to homeworks and assignments are accepted (the only exception to this point are figures and complex formulas) in the assignments. Be aware of the fact that our only source of information is what you have turned in. If we are not capable to understand your solution, you will receive a low score. Moreover, students should not throw away returned assignments or tests.
Students may discuss course material and homeworks, but must take special care to discern the difference between collaborating in order to increase understanding of course materials and collaborating on the homework / course project itself. We encourage students to help each other understand course material to clarify the meaning of homework problems or to discuss problem-solving strategies, but it is not permissible for one student to help or be helped by another student in working through homework problems and in the course project. If, in discussing course materials and problems, students believe that their like-mindedness from such discussions could be construed as collaboration on their assignments, students must cite each other, briefly explaining the extent of their collaboration. Any assistance that is not given proper citation may be considered a violation of the Honor Code, and might result in obtaining a grade of F in the course, and in further prosecution.
Master Thesis and Dissertation Research in Data Mining and Machine Learning
If you plan to perform a dissertation or Master thesis project in the areas of data mining or machine learning, I strongly recommend to take the "Data Mining" course; moreover, I also suggest to take at least one, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368) or Machine Learning (COSC 6342). Furthermore, knowing about evolutionary computing (COSC 6367) will be helpful, particularly for designing novel data mining algorithms. Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 or COSC 6340 is a also good choice. Moreover, taking a course that teaches high preformance computing is also desirable, because most data mining algorithms are very resource intensive. Because a lot of data mining projects have to deal with images, I suggest to take at least one of the many biomedical image processing courses that are offered in our curriculum. Finally, having some knowledge in the following fields is a plus: software engineering, numerical optimization techniques, statistics, and data visualization.
Also be aware of the fact that having sufficient background in the above listed areas is a prerequisite for consideration for a thesis or dissertation project in the area of data mining. I will not serve as your MS thesis or dissertation advisor, if you have do not have basic knowledge in data mining, machine learning, statistics and related areas. Similarly, you will not be hired as a RA for a data mining project without having some background in data mining.

Machine Learning Resources
ICML 2011 (ICML is the #1 Machine Learning Conference)
Carlos Guestrin's 2009 CMU Machine Learning Course
Andrew Ng's Stanford Machine Learning Course
Andrew Moore's Statistical Data Mining Tutorial
Christoph Bishop IET/CBS Turing Lecture
Alpaydin Textbook Webpage

COSC 6342---Machine Learning (Dr. Eick )

News COSC 6342 (Machine Learning) Spring 2011

Basic Course Information

Course Materials

Course Elements and their Weights

2011 Homeworks and Projects

Tentative Course Organization Spring 2011

Prerequisites

2011 Transparencies and Other Teaching Material

Grading

Master Thesis and Dissertation Research in Data Mining and Machine Learning

Machine Learning Resources