This is my first week back after my long wedding- and honeymoon-induced hiatus. As such, I knew Tuesday's practice was going to be hard; As it happens, Tuesday's practice was unbelievably frustrating. I had almost no energy, which was hard enough, but my technique was so bad that Shifu pulled me out of the line on Lunbi Caijow. I was really looking forward to working on forms, so it was a real downer to have to go off to the side. Once over there, Heng Li worked on Fanyao with four of us, and I just couln't get it right. It's not so much that I don't know how to do it, though, more that my damned shoes didn't fit.
If you're like me and you're not precisely one size or another, it's so much better to get the slightly smaller size of Feivues. They're canvas, you know, they'll stretch. I'm basically a 45.5, and I thought I'd try a 46 to see how it would work which was a terrible mistake. When your shoes are too big it's impossible to get your feet in the right position if there's any twisting involved, as I've learned over the past few practices.
For Thursday's practice I went out and got some size 45 shoes, and everything was excellent. In fact, this is the first time Shifu's had us walk that I was surprised when he gave the order. I was just cruising along full-speed, and he stopped us twice during basics to walk. When we got to forms, I'd finally watched enough and gotten enough instruction to do Yiluchuan all the way through. I was a little nervous when Shifu came over to watch later on, since I hadn't had formal instruction on anything after the first pubu, but he watched the entire form and had no comments - I consider this success. It looks like I'll be able to test the first two forms in October, and have a full 6 months to work on Erlichuan. I'm very excited to get to the five-kicks everyone seems to love so much.
Friday, September 7, 2007
Thursday, September 6, 2007
Formal Grammars and Natural Language Parsers
Formal Grammars
The first topic we're covering in AI is natural language processing, and the first subtopic is Parsing. As background, we're working with formal grammars to define the structure of valid parsing trees. As an example, a very basic formal grammar for the English language is as follows:
S -> NP VP
NP -> Noun
NP -> Noun PPList
VP -> Verb NP
VP -> Verb NP PPList
PPLIST -> PP
PPLIST -> PP PPLIST
PP -> Prep NP
English is a (nearly) context-free grammar; The grammar specified above is a true context-free grammar. What that means in terms of the specification is that the symbols to the left of an arrow are always what's called 'Non-Terminating', which basically means that they expand to at least one (possibly Terminating) symbol. Terminating symbols are things like Noun and Verb, and generally indicate a part of speech.
The grammar above is making statements about the structure of valid English sentences. The first line says that a Sentence (S) must be composed of a Noun Phrase (NP) followed by a Verb Phrase (VP). In turn, a NP must be composed of either a Noun or a Noun followed by a PPlist (Which must be a PP (prepositional phrase) or a PP followed by another PPLIST). It's obvious why this is a sketchy grammar, as it considers valid English sentences such as one-word interrogative responses to be invalid; A full English grammar necessarily includes repetition (*), conditional (|) and other operators, and contains many, many more rules. By eliminating these operators, and ensuring that the rules contain no cycles (in the graph-theoretic sense) a context-free grammar conforms to what's called Chomsky Normal Form. It's enormously easier to write a parser for a CNF grammar, at the obvious expense of completeness.
Parsing Strategies
There are two ways to approach a corpus if you intend to build a full parse tree. You can start at the highest level of organization (in this case the entire sentence) and break it down into parts, or you can start with an instance of the smallest level of organization (in this case a single word) and aggregate into larger categories until you've built a complete sentence. The first strategy is rather clearly called Top-Down, and the second is called Bottom-Up. Each is valid, but is prone to a specific type of erroneous processing:
A Top-Down parser might approach the Sentence "People with beards buy books" by tagging the word "People" as the NP, and the remainder of the sentence "with beards buy books" as the VP (recall that S -> NP VP). At this point the parser will fail to match "with beards buy books" to either of the valid VP forms, realize there's been a mistake, and backtrack. In contrast, a Bottom-Up parser would know that "People" is at least the beginning of a NP, and then add legal terms to that NP until it hit one which didn't match, at which point it would start building the VP. With this example, the BU parser would get the proper tree on the first try, with no backtracking.
Bottom-Up parsers, though, have their own habitual mode of failure. Consider the sentence "People buy books for classes in bookstores". It might mark "books for classes" as the NP, then fail to match the rest of the words of the sentence into a valid S form, and need to backtrack.
Ideally, a parser would know what parts of speech a given word could possibly be, and make smart decisions during the parsing process using this information (this associative list is called a Lexicon). In practice, the first half of this is possible, but the second turns out to be much more difficult. What happens is that for every occurrence of every word, the parser must at least consider all possible parts of speech. This has an enormous impact on efficiency, first of all, and really only succeeds in generating a larger number of possibly-valid parse trees which must then be winnowed in some other way. In addition to this larger output size, it's possible to produce many valid trees if a sentence is ambiguously worded; Disambiguation is another very expensive postprocessing step.
The Recursive Descent Parser
The bottom-up parser that we studied as an example is called the Recursive Descent Parser. The pseudocode for the RDP is fairly short and straightforward. The summary is that it takes a Sentence, a Grammar, and a Lexicon and returns a list of valid parse Trees. It's incredibly inefficient - In fact it runs in exponential time. The main reason for this behavior is due to the very common occurrence of sub-trees within a sentence that are repeatedly re-evaluated in their entirety every time the sentence is parsed. This is a fundamental flaw in the design of the RDP.
The first topic we're covering in AI is natural language processing, and the first subtopic is Parsing. As background, we're working with formal grammars to define the structure of valid parsing trees. As an example, a very basic formal grammar for the English language is as follows:
S -> NP VP
NP -> Noun
NP -> Noun PPList
VP -> Verb NP
VP -> Verb NP PPList
PPLIST -> PP
PPLIST -> PP PPLIST
PP -> Prep NP
English is a (nearly) context-free grammar; The grammar specified above is a true context-free grammar. What that means in terms of the specification is that the symbols to the left of an arrow are always what's called 'Non-Terminating', which basically means that they expand to at least one (possibly Terminating) symbol. Terminating symbols are things like Noun and Verb, and generally indicate a part of speech.
The grammar above is making statements about the structure of valid English sentences. The first line says that a Sentence (S) must be composed of a Noun Phrase (NP) followed by a Verb Phrase (VP). In turn, a NP must be composed of either a Noun or a Noun followed by a PPlist (Which must be a PP (prepositional phrase) or a PP followed by another PPLIST). It's obvious why this is a sketchy grammar, as it considers valid English sentences such as one-word interrogative responses to be invalid; A full English grammar necessarily includes repetition (*), conditional (|) and other operators, and contains many, many more rules. By eliminating these operators, and ensuring that the rules contain no cycles (in the graph-theoretic sense) a context-free grammar conforms to what's called Chomsky Normal Form. It's enormously easier to write a parser for a CNF grammar, at the obvious expense of completeness.
Parsing Strategies
There are two ways to approach a corpus if you intend to build a full parse tree. You can start at the highest level of organization (in this case the entire sentence) and break it down into parts, or you can start with an instance of the smallest level of organization (in this case a single word) and aggregate into larger categories until you've built a complete sentence. The first strategy is rather clearly called Top-Down, and the second is called Bottom-Up. Each is valid, but is prone to a specific type of erroneous processing:
A Top-Down parser might approach the Sentence "People with beards buy books" by tagging the word "People" as the NP, and the remainder of the sentence "with beards buy books" as the VP (recall that S -> NP VP). At this point the parser will fail to match "with beards buy books" to either of the valid VP forms, realize there's been a mistake, and backtrack. In contrast, a Bottom-Up parser would know that "People" is at least the beginning of a NP, and then add legal terms to that NP until it hit one which didn't match, at which point it would start building the VP. With this example, the BU parser would get the proper tree on the first try, with no backtracking.
Bottom-Up parsers, though, have their own habitual mode of failure. Consider the sentence "People buy books for classes in bookstores". It might mark "books for classes" as the NP, then fail to match the rest of the words of the sentence into a valid S form, and need to backtrack.
Ideally, a parser would know what parts of speech a given word could possibly be, and make smart decisions during the parsing process using this information (this associative list is called a Lexicon). In practice, the first half of this is possible, but the second turns out to be much more difficult. What happens is that for every occurrence of every word, the parser must at least consider all possible parts of speech. This has an enormous impact on efficiency, first of all, and really only succeeds in generating a larger number of possibly-valid parse trees which must then be winnowed in some other way. In addition to this larger output size, it's possible to produce many valid trees if a sentence is ambiguously worded; Disambiguation is another very expensive postprocessing step.
The Recursive Descent Parser
The bottom-up parser that we studied as an example is called the Recursive Descent Parser. The pseudocode for the RDP is fairly short and straightforward. The summary is that it takes a Sentence, a Grammar, and a Lexicon and returns a list of valid parse Trees. It's incredibly inefficient - In fact it runs in exponential time. The main reason for this behavior is due to the very common occurrence of sub-trees within a sentence that are repeatedly re-evaluated in their entirety every time the sentence is parsed. This is a fundamental flaw in the design of the RDP.
Overloaded again.
The past few months ended up being pretty extraordinary, in terms of workload. I dropped off of posting as finals started to loom, then went straight into intensive Mandarin for the summer, after which I got married. So, the long silence is explicable, if not entirely justified.
In any case, a new semester is before me, the wedding and honeymoon are behind, I'm on a new project at work, and I'm looking to get back into the posting routine. My only two formal classes right now are Artificial Intelligence and Transformation and Geometries, which is a course on basic analytical geometry in euclidean, non-euclidean, and various non-metric geometries. I have an Independent Study section this semester as well, in which I'm working on a relatively generic assembler framework written in Python.
I'm still training, and looking forward to testing (my first!) in about a month. More about that after practice tonight!
In any case, a new semester is before me, the wedding and honeymoon are behind, I'm on a new project at work, and I'm looking to get back into the posting routine. My only two formal classes right now are Artificial Intelligence and Transformation and Geometries, which is a course on basic analytical geometry in euclidean, non-euclidean, and various non-metric geometries. I have an Independent Study section this semester as well, in which I'm working on a relatively generic assembler framework written in Python.
I'm still training, and looking forward to testing (my first!) in about a month. More about that after practice tonight!
Subscribe to:
Comments (Atom)