**Overview**

This is a introductory natural language processing course (NLP). The course in intended for developing foundations in NLP and text mining. The broader goal is to understand how NLP tasks are carried out in the real world (e.g., Web) and how to build tools for solving practical text mining problems. Throughout the course, emphasis will be placed on a understand NLP concepts and tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistical machine learning and touches upon topics in sentiment analysis and psycholinguistics.

**Administrative details**

Flyer

Syllabus

Instructor office hours: MW 2:30 - 3:30 PM, PGH 582

**Prerequisites**

The course requires basic background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Artificial Intelligence, Numerical methods, or have some background in probability/statistics, it will be helpful. The course however reviews and covers required mathematical and statistical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) will prove beneficial.

**Note:** This course has minor overlap with Data Mining (COSC 6335) and Machine Learning(COSC 6342). Especially some topics in supervised learning as they lay the foundation for other NLP algorithms to be covered in this course. Hence, they are covered to make this course standalone. Although not required, however, if you have taken either of those courses, it will be helpful.

**Reading Materials**

**Textbooks:**

SI: Statistical Inference, Casella and Berger. Cengage Learning; 2nd edition.

FSNLP: Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze. MIT Press. Cambridge, MA: May 1999. Companion website for book.

WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.

**Required reference materials:**

Online resources (OR) per topic as appearing in the schedule below.

Lecture notes

**Course materials (slides, lecture notes, etc.)** (You may use 7zip to unpack)

**NLP tools (POS Tagger, Chunker, Naive Bayes, etc.) and templates with linked libraries for research project**

**Assignment Due Dates and Grading**

Component | Contribution | Due date |

HW1 | 10% | 09/03 |

HW2 | 10% | 09/17 |

HW3 | 10% | 10/06 |

HW4 | 10% | 10/22 |

HW5 | 10% | 11/10 |

Mini Project | 25% | 09/29 |

Optional Research Project | 50% | 12/08 |

Final | 20% | 12/05, 1-4 PM, SW 221 |

Class contribution | 5% | - |

**Schedule of topics**

Topic(s) | Resources: Readings, Slides, Lecture notes, Papers, Pointers to useful materials, etc. |

Introduction
Course administrivia, plan, goals, NLP Resources Language as a probabilistic phenomenon, Zipf's law Word collocations and text retrival |
Required readings:
Lecture notes/slides Chapter 1 FSNLP (Sections 1.2.3, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4) Boolean retrieval slides by H.Schutze Boolean retrieval [Manning et al., 2008] (upto section 1.4) |

Statistical foundations I:Basics
Probability theory Conditional probability and independence |
Required Readings:
Lecture notes/slides Chapter 2 FSNLP (Section 2.1.1 - 2.1.10) Chapter 1 SI (Full reading recommended. Focus on topics covered in class and solved examples) OR01: X.Zhu's notes on mathematical background for NLP |

Statistical foundations II: Random varibales and Distributions
Random variables, density and mass fuctions Mean, Variance Common families of distributions Multiple random variables: joints and marginals |
Required Readings:
Lecture notes/slides Chapter 2 SI (Theorem 2.1.10, 2.2, 2.2.1, 2.2.2, 2.2.3, 2.2.5, 2.3.1, 2.3.2, 2.3.4, and topics covered in class). Chapter 3 SI (All sections + worked out examples upto 3.4), only distributions that were covered in class. Chapter 4 SI (4.1, 4.1.1, 4.1.2, 4.1.3, 4.1.4, 4.1.5, 4.1.6, 4.1.10, 4.1.11, 4.1.12, 4.2.1, 4.2.2, 4.2.3, 4.2.4, 4.2.5). OR02: K.Zhang's notes on common families of distribution with worked out examples [Skip hyupergeometric, neg-binomial, geometric distributions and read only those covered in class]. Optional Recommended reading/solved examples:
OR03: Notes on Joint, marginals, worked out examples by S.Fan OR04: Tutorial on joints and marginals by M.Osborne [Contains NLP specific examples] |

Words
Collocations Hypothesis testing, statistical tests, p-values N-gram language models |
Required Reading(s):
Chapter 5 FSNLP (5, 5.1, 5.3, 5.3.1, 5.3.3), Chapter 6 FSNLP (upto 6.2.2). OR07: J.Zhu's notes on t-test OR08: Lecture notes/slides on N-gram langauge models by Y. Choi: (1), (2) OR09: Recommended readings: (0) Chapter 6 of [Jurafsky and Martin] (1) Langugae modeling notes by M.Collins (2) Lecture notes by K.Mckeown
t-table, Chi-square table |

Markov models and POS tagging
Hidden markov model (HMM) Part of speech tagging |
Required Readings:
Chapter 9 FSNLP (upto 9.4), Chapter 10 FSNLP (upto 10.2.2) OR10: (1) Lecture notes/slides by M. Marszalek on "A Tutorial on Hidden Markov Models by Lawrence R. Rabiner", (2) Toy problems by E.Lussier, (3) POS Tagging by Y.Choi Programming resources, tools, libraries for projects and homeworks:
(1) HMMs and sequence taggers JAHMM: Implementation of an HMM in Java, Mallet, SVMHMM, CRF++ (2) POS Taggers OpenNLP, Stanford Parser, (Online version), Illinois Chunker, POS Tagger |

Grammar and Parsing
Shallow Parsing and Phrase Chunking Context Free Grammars (CFGs) Top-down and Bottom-up parsing Probabilistic Context Free Grammars (PCFGs) Statistical Parsing and PCFGs |
Required Readings:
Lecture notes/slides Refer to Chapter 9, 10 and 12 of this book for select topics covered in class Programming resources, tools, libraries for projects and homeworks:
Chunkers, shallow and full parsers: (1)OpenNLP, (2) Stanford Parser, (Online version), (3) Illinois Chunker, POS Tagger |

Text categorization
Decision trees Naive Bayes Support Vector Machines Evaluation metrics Significance testing (revisited) Feature selection schemes |
Required Readings:
Chapter 3 WDM (3.1, 3.2, 3.6, 3.8, 3.3) F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8) Programming resources, tools, libraries for projects and homeworks:
SVMLight LIBLINEAR (Implements L1/L2 regularized classification and regression using SVM, SVR, LR with support for L1/L2 loss), Mallet, LingPipe, C. Borgelt's DM Tools |

Sentiment Analysis and Psycholinguistics
Aspect extraction Deception and opinion spam |
Required Readings:
Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM Paper on opinon spam: [Ott et al., 2011], slides, demo Programming resources, tools, libraries for projects and homeworks:
Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis |