A Brief Overview of Parsing
-
Parsing is the assignment of structure, usually constituent or dependency structure,
to a phrase or sentence.
-
A parser is a device (or algorithm) that takes a phrase or sentence as
input, and uses a grammar (including a lexicon) to produce the syntactic
structure(s) appropriate for that phrase or sentence (often called parse trees
or just trees).
-
The type of grammar most often discussed when introducing parsing is known as a
context-free grammar.
- A CFG is a set of rules (or "productions") that specify how a syntactic
constituent can be composed of smaller constituents.
- The term "context-free" simply means that the possibilities for expanding
a consitutent don't depend on what other constituents are around it.
-
An example of a small CFG (from Jurafsky and Martin, p. 360):

-
Some remarks on these rules and the notation:
- The symbols for constituents (e.g., phrases and sentences) are called
nonterminal symbols. Those representing words are called terminal
symbols.
- Each rule has a single nonterminal symbol on the left hand side of the
arrow. This is the symbol that can be "expanded" into the symbols on the
right hand side; that is, a constituent of the type on the left hand side can
be composed of the constituents or words on the right hand side (in order!).
Nonterminals on the right hand side can then be expanded by other rules.
- The vertical stroke | is just shorthand for alternative expansions; thus:
Prep ® from | to | on
is just a way of writing these three rules:
Prep ® from
Prep ® to
Prep ® on
- The grammar "accepts" a sentence if there is a way of expanding S
(the start symbol), then expanding all the subconsitituents, and so on, until
the leaves of the tree are all the words in the sentence (which are terminal
symbols). Of course we can allow other start symbols, too; if we want to
accept noun phrases, we can treat NP as a start symbol also.
- The parser's job is to find the ways this expansion can be done, or if it
can't, to reject the sentence.
- It's worth thinking a bit about what the shorcomings of CFGs are. What
kinds of phenomena in language are they ill-suited for?
(2)
(back to part-of-speech tagging)
(return to syllabus)