Thursday, September 6, 2007

Formal Grammars and Natural Language Parsers

Formal Grammars

The first topic we're covering in AI is natural language processing, and the first subtopic is Parsing. As background, we're working with formal grammars to define the structure of valid parsing trees. As an example, a very basic formal grammar for the English language is as follows:

S -> NP VP
NP -> Noun
NP -> Noun PPList
VP -> Verb NP
VP -> Verb NP PPList
PPLIST -> PP
PPLIST -> PP PPLIST
PP -> Prep NP

English is a (nearly) context-free grammar; The grammar specified above is a true context-free grammar. What that means in terms of the specification is that the symbols to the left of an arrow are always what's called 'Non-Terminating', which basically means that they expand to at least one (possibly Terminating) symbol. Terminating symbols are things like Noun and Verb, and generally indicate a part of speech.

The grammar above is making statements about the structure of valid English sentences. The first line says that a Sentence (S) must be composed of a Noun Phrase (NP) followed by a Verb Phrase (VP). In turn, a NP must be composed of either a Noun or a Noun followed by a PPlist (Which must be a PP (prepositional phrase) or a PP followed by another PPLIST). It's obvious why this is a sketchy grammar, as it considers valid English sentences such as one-word interrogative responses to be invalid; A full English grammar necessarily includes repetition (*), conditional (|) and other operators, and contains many, many more rules. By eliminating these operators, and ensuring that the rules contain no cycles (in the graph-theoretic sense) a context-free grammar conforms to what's called Chomsky Normal Form. It's enormously easier to write a parser for a CNF grammar, at the obvious expense of completeness.


Parsing Strategies

There are two ways to approach a corpus if you intend to build a full parse tree. You can start at the highest level of organization (in this case the entire sentence) and break it down into parts, or you can start with an instance of the smallest level of organization (in this case a single word) and aggregate into larger categories until you've built a complete sentence. The first strategy is rather clearly called Top-Down, and the second is called Bottom-Up. Each is valid, but is prone to a specific type of erroneous processing:

A Top-Down parser might approach the Sentence "People with beards buy books" by tagging the word "People" as the NP, and the remainder of the sentence "with beards buy books" as the VP (recall that S -> NP VP). At this point the parser will fail to match "with beards buy books" to either of the valid VP forms, realize there's been a mistake, and backtrack. In contrast, a Bottom-Up parser would know that "People" is at least the beginning of a NP, and then add legal terms to that NP until it hit one which didn't match, at which point it would start building the VP. With this example, the BU parser would get the proper tree on the first try, with no backtracking.

Bottom-Up parsers, though, have their own habitual mode of failure. Consider the sentence "People buy books for classes in bookstores". It might mark "books for classes" as the NP, then fail to match the rest of the words of the sentence into a valid S form, and need to backtrack.

Ideally, a parser would know what parts of speech a given word could possibly be, and make smart decisions during the parsing process using this information (this associative list is called a Lexicon). In practice, the first half of this is possible, but the second turns out to be much more difficult. What happens is that for every occurrence of every word, the parser must at least consider all possible parts of speech. This has an enormous impact on efficiency, first of all, and really only succeeds in generating a larger number of possibly-valid parse trees which must then be winnowed in some other way. In addition to this larger output size, it's possible to produce many valid trees if a sentence is ambiguously worded; Disambiguation is another very expensive postprocessing step.

The Recursive Descent Parser

The bottom-up parser that we studied as an example is called the Recursive Descent Parser. The pseudocode for the RDP is fairly short and straightforward. The summary is that it takes a Sentence, a Grammar, and a Lexicon and returns a list of valid parse Trees. It's incredibly inefficient - In fact it runs in exponential time. The main reason for this behavior is due to the very common occurrence of sub-trees within a sentence that are repeatedly re-evaluated in their entirety every time the sentence is parsed. This is a fundamental flaw in the design of the RDP.

No comments: