Natural Language Processing

COSC 4397 Natural Language Processing

Summer 2025

Asynchronous Online (via Microsoft Teams)

Instructor: Arjun Mukherjee

Overview

This is an introductory natural language processing course (NLP). The course in intended for developing foundations in NLP and text mining. The broader goal is to understand how NLP tasks are carried out in the real world (e.g., Web) and how to build tools for solving practical NLP and text mining problems. Throughout the course, emphasis will be placed on understanding NLP concepts and tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistics and important topics in NLP such as embeddings, semantics, part of speech tagging, parsing, infomration retrieval, sentiment analysis and psycholinguistics

Administrative details

Syllabus
Instructor office hours: Online via Teams
TA Yang Lu (ylu17@Central.UH.EDU) office hours: Wed, Fri 11-12 Teams
TA Zhiming Zhao (zzhao41@cougarnet.uh.edu) office hours: Mon 10-12 PGH 376
Prerequisites

The course requires basic background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Data Structures, Algorithms, Artificial Intelligence, Numerical methods, Data Science or have some background in probability/statistics, it will be helpful. The course however reviews and covers required mathematical and statistical foundations. Sufficient experience for building projects in a high level programming language (e.g., C++, Python, Java) is required.

Reading Materials

Textbooks:
Natural Langugae Processing in Python, NLTK: This is an Open Access book. An excellent resouce for self directed learning and porgraming assignments.

SI: Statistical Inference, Casella and Berger. Cengage Learning; 2nd edition.

FSNLP: Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schtuetze. MIT Press. Cambridge, MA: May 1999. Companion website for book.

WDM: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Bing Liu; Springer, 1st Edition.

SLP: Speech and Language Processing, Jurfsky and Martin, 2024 Edition.

IR: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

Required reference materials:

Online resources (OR) per topic as appearing in the schedule below.
Lecture notes

Course materials (slides, lecture notes, etc.) (You may use 7zip to unpack)

Course materials (slides, lecture notes, etc.) .zip version

NLP tools (POS Tagger, Chunker, Naive Bayes, etc.) and templates with linked libraries for research project

Assignment Due Dates and Grading

Component Contribution Due date

HW1 5% 6/20

HW2 10% 6/29

HW3 7% 7/12

HW4 8%

Project 40% 8/1

Exam 1 30%

Rules and policies

Late Assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. It also needs to be a justifiable reason owing to exacting circumstances. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Cheating: All submitted work (code, homeworks, exams, etc.) must be your own. If evidence of code sharing is found, you will receive an F grade in the course. Please refer to the student handbook for details on academic honesty.
Statute of limitations: Grading questions or complaints, will in general not be attended to beyond one week after the item in question has been returned.

Schedule of topics

Please note that the following is a list of tentative topics. During the course, and interleaved between lectures, time will be invested in review questions, homeworks, discussion of novel ideas, project updates, and concept review before exams.

Topic(s)	Resources: Readings, Slides, Lecture notes, Papers, Pointers to useful materials, etc.
Introduction Course administrivia, plan, goals, NLP Resources Language as a probabilistic phenomenon, Zipf's law Word collocations and text retrival	Required readings: Lecture notes/slides Chapter 1 FSNLP (Sections 1.2.3, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4) Boolean retrieval slides by H.Schutze Boolean retrieval [Manning et al., 2008] (upto section 1.4)
Statistical foundations I:Basics Probability theory Conditional probability and independence	Required Readings: Lecture notes/slides Chapter 2 FSNLP (Section 2.1.1 - 2.1.10) Chapter 1 SI (Full reading recommended. Focus on topics covered in class and solved examples) OR01: X.Zhu's notes on mathematical background for NLP
Statistical foundations II: Random varibales and Distributions Random variables, density and mass fuctions Mean, Variance Common families of distributions Multiple random variables: joints and marginals	Required Readings: Lecture notes/slides Chapter 2 SI (Theorem 2.1.10, 2.2, 2.2.1, 2.2.2, 2.2.3, 2.2.5, 2.3.1, 2.3.2, 2.3.4, and topics covered in class). Chapter 3 SI (All sections + worked out examples upto 3.4), only distributions that were covered in class. Chapter 4 SI (4.1, 4.1.1, 4.1.2, 4.1.3, 4.1.4, 4.1.5, 4.1.6, 4.1.10, 4.1.11, 4.1.12, 4.2.1, 4.2.2, 4.2.3, 4.2.4, 4.2.5). OR02: K.Zhang's notes on common families of distribution with worked out examples [Skip hyupergeometric, neg-binomial, geometric distributions and read only those covered in class]. Optional Recommended reading/solved examples: OR03: Notes on Joint, marginals, worked out examples by S.Fan OR04: Tutorial on joints and marginals by M.Osborne [Contains NLP specific examples]
Words Collocations Hypothesis testing, statistical tests, p-values N-gram language models	Required Reading(s): Chapter 5 FSNLP (5, 5.1, 5.3, 5.3.1, 5.3.3), Chapter 6 FSNLP (upto 6.2.2). OR07: J.Zhu's notes on t-test OR08: Lecture notes/slides on N-gram langauge models by Y. Choi: (1), (2) OR09: Recommended readings: (0) Chapter 6 of [Jurafsky and Martin] (1) Langugae modeling notes by M.Collins (2) Lecture notes by K.Mckeown t-table, Chi-square table
Markov models and POS tagging Hidden markov model (HMM) Part of speech tagging	Required Readings: Chapter 9 FSNLP (upto 9.4), Chapter 10 FSNLP (upto 10.2.2) OR10: (1) Lecture notes/slides by M. Marszalek on "A Tutorial on Hidden Markov Models by Lawrence R. Rabiner", (2) Toy problems by E.Lussier, (3) POS Tagging by Y.Choi Programming resources, tools, libraries for projects and homeworks: (1) HMMs and sequence taggers JAHMM: Implementation of an HMM in Java, Mallet, SVMHMM, CRF++ (2) POS Taggers OpenNLP, Stanford Parser, (Online version), Illinois Chunker, POS Tagger
Grammar and Parsing Shallow Parsing and Phrase Chunking Context Free Grammars (CFGs) Top-down and Bottom-up parsing Probabilistic Context Free Grammars (PCFGs) Statistical Parsing and PCFGs	Required Readings: Lecture notes/slides Refer to Chapter 9, 10 and 12 of this book for select topics covered in class Programming resources, tools, libraries for projects and homeworks: Chunkers, shallow and full parsers: (1)OpenNLP, (2) Stanford Parser, (Online version), (3) Illinois Chunker, POS Tagger
Text Clustering and Topic Models Hierarchical Models Sampling from Distributions Topic Models	Required Readings: Lecture notes/slides/Programming resources Derivation and Java implementation by G. Heinrich Topic Models using Gensim library Topic Models using NLTK library Topic Models using Scikit Learn library
Text categorization Decision trees Naive Bayes Support Vector Machines Evaluation metrics Significance testing (revisited) Feature selection schemes	Required Readings: Chapter 3 WDM (3.1, 3.2, 3.6, 3.8, 3.3) F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8) Programming resources, tools, libraries for projects and homeworks: SVMLight LIBLINEAR (Implements L1/L2 regularized classification and regression using SVM, SVR, LR with support for L1/L2 loss), Mallet, LingPipe, C. Borgelt's DM Tools
Advanced Topics: Neural Text Models Logistic Regression Word embeddings: word2vec , Glove Transformer Models: BERT	Required Readings: Lecture notes/slides Word2Vec demo using Gensim Word2Vec demo using Keras Glove using Keras BERT demo using huggingface
Sentiment Analysis and Psycholinguistics Aspect extraction Deception and opinion spam	Required Readings: Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM Paper on opinon spam: [Ott et al., 2011], slides, demo Programming resources, tools, libraries for projects and homeworks: Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis

Component	Contribution	Due date
HW1	5%	6/20
HW2	10%	6/29
HW3	7%	7/12
HW4	8%
Project	40%	8/1
Exam 1	30%