Course goals
``Every time I fire a linguist, the performance of our speech recognition system goes up.'' - Fred Jelinek.
Although the quote by Jelinek may be an extreme position, it is no exaggeration that anyone wishing to work in natural language processing (NLP) must have some understanding of the statistical methods in common practice. This course will introduce you to the fundamentals of statistical NLP. Statistical NLP builds on ideas from many fields, including linguistics, probability theory, information theory, programming, and computer science. We will see how these fields provide us with tools to engage in part-of-speech (POS) tagging, parsing, word sense disambiguation, machine translation, and information retrieval.
The focus of the course is very data-driven, meaning that students will be working with large corpora and will be learning how to handle such large pieces of data. Applying statistical techniques to large corpora will also allow us to examine collocations and n-grams, along with techniques for categorizing text.
We will be focusing on statistical methods in the context of particular tasks, e.g., parsing. However, all of the methods we will use are applicable to a range of tasks in NLP, and thus this course provides an essential platform for finding one's way in the field of NLP.
Instructor: Markus Dickinson
Office: Intercultural Center (ICC) 452
Phone: 687-5753
E-mail: mad87 AT georgetown DOT edu
Office hours: (at least for the first week)
| M | 3:00-4:00pm |
| R | 3:00-4:00pm |
| or by appointment |
Meeting time: R, 4:15-6:45pm
Classroom: Reiss Science Building (REI) 282
Course website: http://www9.georgetown.edu/faculty/mad87/06/420/
Course notes will be posted to this website.
Credits: 3
Course prerequisites:
Introduction to NLP (Ling 362) or permission of instructor. Some programming experience is expected.
| Participation | 10% | |
| Assignments | 70% | (=10@7% each) |
| Final assignment | 20% |
I rely on the Academic Resource Center for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted the Academic Resource Center are encouraged to do so (202-687-8354; http://ldss.georgetown.edu/index.html).
I'm including both the main slides (.pdf) and the more printer-friendly versions (3x3.pdf). If you want a different size (e.g., 2x3), let me know.
| Month | Week | Day | Date | Topic | Reading | Assignments |
| Jan. | 1 | R | 12 | Intro to class (.pdf, 3x3.pdf) | M&S, ch. 1 | |
| 2 | R | 19 | Probability Theory (.pdf, 3x3.pdf) | M&S, ch. 2.1 | HW1 due | |
| Programming (1): Python (.pdf, 3x3.pdf) | ||||||
| 3 | R | 26 | Collocations (.pdf, 3x3.pdf) | M&S, ch. 5 | HW2 due | |
| Feb. | 4 | R | 2 | Information Theory (.pdf, 3x3.pdf) | M&S, ch. 2.2 | HW3 due |
| Programming (2): UNIX (.pdf, 3x3.pdf) | ||||||
| 5 | R | 9 | POS tagging (.pdf, 3x3.pdf) | M&S, ch. 9, 10 | HW4 due | |
| 6 | R | 16 | Corpora and Linguistic Annotation (.pdf, 3x3.pdf) | M&S, ch. 3, 4 | HW5 due | |
| Programming (3): NLTK (.pdf, 3x3.pdf) | ||||||
| 7 | R | 23 | Probabilistic Context-Free Grammars (.pdf, 3x3.pdf) | M&S, ch. 11 | HW6 due | |
| Programming (4): XML (.pdf, 3x3.pdf) | ||||||
| Mar. | 8 | R | 2 | Probabilistic Parsing (.pdf, 3x3.pdf) | M&S, ch. 12 | HW7 due |
| 9 | R | 9 | NO CLASS, SPRING BREAK | |||
| 10 | R | 16 | Lexicalized Parsing | |||
| Programming (5): Practicum | ||||||
| 11 | R | 23 | Text categorization (.pdf, 3x3.pdf) | M&S, ch. 16 | HW8 due | |
| PP attachment (.pdf, 3x3.pdf) | M&S, ch. 8 | |||||
| 12 | R | 30 | Word Sense Disambiguation (.pdf, 3x3.pdf) | M&S, ch. 7 | ||
| Apr. | 13 | R | 6 | Programming (6): Practicum 2 | HW9 due | |
| 14 | R | 13 | NO CLASS, EASTER BREAK | |||
| 15 | R | 20 | Statistical Machine Translation (.pdf, 3x3.pdf) | M&S, ch. 13 | ||
| 16 | R | 27 | General techniques (.pdf, 3x3.pdf) | HW10 due | ||
| May | 17 | R | 11 | Final HW due |
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 syllabus.tex
The translation was initiated by Markus Dickinson on 2005-10-26