View this PageEdit this PageAttachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Four Monkeys and a bag of Bannanas: Milestone 1

Product of Four monkeys and a bag of bannanas


Milestone 1: Parser
By: Jared Parsons
Group: Four Monkeys and a bag of bannanas.
========================================================================
The goal of this case is to explain an alterate solution for a
NFA parser. Most of the NFA parsers that I saw for this project used
heavy recursion while mine relies on the use of a doubly linked list and
several stack/queues. The actual parser it self was extremely "busy"
but it worked virutally without modification from the begining to the
end.
First, for future students, I need to explain the end goal of our
parsers. We were given the task of builing a parser that would match a
natural language with variables. Basically a set of commands for our
soon to be built adventure game. For instance:

My !animal

Is an expression with !animal a variable that matches one word
in its place. So it would match:

My dog
My cat
My fish

The other types of variables were words preceeded by the
following symbols.

! - match one word
? - match one or zero words
& - match one or more words
# - match zero or more words

Without the induction of & and # this could be solved with a
fairly simple parser. However those two symbols allow for strings like
the following to be created.

My &animal and fred

The matches for this are what makes it so difficult to parse.
Take for example the string "My and dog and fred". This would match the
above sentence but that is incredibly hard to parse with a DFA. Mainly
becuase you're not sure where the and goes. This becomes even more
complicated with the additions of #'s.

My &animal #name and fred

Now "My and dog and fred" still matches but it can match two
ways. One where &animal gets matched as "and" and #name gets matched as
"dog" or where &animal gets matched as "and dog" and #name gets matched
as nothing. Another very difficult one to parse is

#Jack and #Jill and #Henry

Since we were required to store and return the values of the
variable, not only did our parsers have to match the incredibly hard
cases but they also had to do it consistently. The resolution of
grammar ambiguity was left up to us. The two above expressions were the
ones I set out to solve with my parser.
I struggled for awhile on how to design my parser mainly because
I had no concept on how to build an NFA parser. I hadn't taken theory
and parsers weren't my best subject. However I did understand that a
NFA parser considered every possibility were examining an expression so
I designed my parser to do that. After much toil and struggle I ended
up with the following idea.
The overal design of the parser is very simple. First I am
going to give a brief overview of the classes that I used.
I created two token classes to represent the two types of tokens
that you could have( variable and constant tokens). Both classes
understood how to add a value to it and to know when they were correctly
evaluated. VariableTokens did this based on the first character of the
word they were passed as their definition. VariableTokens had the added
functionality to add any number of values and store them in a linked
list. This linkedlist could be accessed as a stack or a queue which was
very necessary for the matching part of this parser.
Both of these tokens inherited from a superclass called
GenericToken which held the doublely linked list functionality and the
protypes for what both should understand. I realize now that protypes
are don't have any functional significance since you can send any class
a message but they were left overs from Java programming.
TokenList was a class that held a doubley linkedlist of tokens.
This class was mainly used as a wrapper for the doubley linked list
functionality. Later it was also used to calculate the power of an
expression. The Rule class was also mainly used to give the to wrap the
TokenList so that it could store a value to be put through the
TokenList. WordMatcher held a set of Rules to be evaluated and would
return the one that matched the best based on an input string.
The transformation from a string containing an expression to a
actual Rule was very simple. Simply break up the string based on white
space. Then based on the first character in any of the resulting
tokens, you create either a RegularToken or a VariableToken. As you
create the tokens you form them into a doublely linked list based on
their order in the string. So the expression "My !dog is !color" gets
transformed into...


 —–     ######     —–     ########
 | my| <-> |!dog| <-> | is| <-> |!color|
 —–     ######     —–     ######## 


Boxes enclosed in #'s represent VariableTokens and boxes wrapped
in -'s represent RegularTokens. This will be true for the rest of the
paper. Now onto how these expressions actually get parsed.
Before we move onto how to actually parse this I need to define
a couple of terms.

controlling token - the variable token that is currently being
used to control the flow of tokens.
bridge - when all RegularTokens between two VariableTokens are
correctly matched.
shift - moves a token from the controlling token to the next
available acceptor.
<-> - shows a double linked list connection.

Shifting is somewhat complicated so I will go into a bit more
detail here. To shift you take the last value off of the controlling
token. The you move one token to the right. If that token is
VariableToken you simply add the value to the token at the very front
and that's it. If the value is a RegularToken then there are several
cases. If the token already has a value on it, you take the current
value you are shifting, place it back on the controlling token, and the
value in the RegularToken you are looking at becomes the value you are
shifting. If the RegularToken is empty and the value you are shifting
correctly matches it then you place the value on the token. If the
token is empty and it does not evaluate you move to the next token in
the list. Once agian, this is more easily explained with visual
examples(below).
There are ten rules my parser runs by. Some seem complicated
but I will go those when we use some examples.

1) Controlling tokens can only be variable tokens.
2) The first VariableToken is the first controlling token
3) Controlling tokens will shift values until a valid bridge is formed
or until there are no tokens left to shift.
4) A controlling token must try to shift at least once.
5) A bridge is valid when there is a bridge to the next VariableToken
and the controlling token correctly evalutes its contents.
6) If a controlling token cannot shift and there is not a valid bridge
forward, then the control goes to the VariableToken preceeding the
current controlling token.
7) If controlling tokens form a valid bridge and have attempted to
shift, then control passes to the VariableToken at the end of the
bridge.
8) If the first shift attempt fails, then control must be passed
backwards in the list regardless of whether or not there is a valid
bridge.
9) If all tokens in the list ever evaluate to true then a sentence is
matched
10) If control should ever pass backwards to the previous VariableToken
and the controlling token is the first VariableToken in the expression
then the sentence is invalid.

These rules probably seem extrodinarily complicated and odd but
the actual parsing algorithmn for this is less than 75 lines of code.
At first we will be working with a very simple string and then
move onto other more complicated ones. Take the expression "My &dog" to
be our first expression. We are going to go through the process of
matching it with the sentence "My big dog". Here is how the basic
process goes.
The first thing you do is break the string up into tokens so "My
big dog" gets broken into 3 tokens(My,big,dog). Then you go along the
string adding the sentence tokens to the expression tokens until the
current expression token is a variable token. Once you get to a
VariableToken, you add all of the tokens onto it. A visual example of
this is...


 —-     ######
 |my| <-> |&dog|
 —-     ######
  my       big
           dog 


Now by rule #9 this sentence matches the expression. The
RegularToken obviously evaluates to true and &dog takes one or mroe
words so "big dog" will match. This example was very simple used mainly
to illustrate how to initialy place a sentence onto a expression.
Lets now use the expression "My &dog and cat". This is a little
harder because it actually requires us to shift values a couple times.
So we start in the initial state...



 —-     ######     —–     —–
 |my| <-> |&dog| <-> |and| <-> |cat|
 —-     ######     —–     —–
  my       big 
           dog
           and
           cat 



This does not evalute to true because nether "and" or "cat" are
satisfied. So by rule #4 we must shift a value. The value we shift is
cat. After the first step in shifting we come to this


 —-     ######     —–     —–
 |my| <-> |&dog| <-> |and| <-> |cat|
 —-     ######     —–     —–
  my       big        cat
           dog
           and


Since cat does not correctly match "and" we continue shifting
the value along the list leaving us with the following.


 —-     ######     —–     —–
 |my| <-> |&dog| <-> |and| <-> |cat|
 —-     ######     —–     —–
  my       big                  cat 
           dog
           and


This correctly matches the "cat" token so we are done shifting.
Now there is no bridge since "and" is not satisfied so by Rule #4 we
shift again. This leaves us with the following.


 —-     ######     —–     —–
 |my| <-> |&dog| <-> |and| <-> |cat|
 —-     ######     —–     —–
  my       big        and       cat 
           dog


Now the expression is correctly matched and by Rule #9 we are
done.
Now lets take a harder example that will force the VariableToken
in control to change hands a couple of times. THe expression we will
use is "#Jack and #Jill and #Henry" and the sentence will be "Jack and
Jill and Henry". After two shifts we are left with this(The CTRL
symbolizes which token is controlling)


   CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and       henry  
  and 
  jill


Now we by Rule #7 controll passes to "#jill" because there is a
bridge between "#jack" and "#jill" and "#jack" has shifted at least
once. So now we have.


                        CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and       henry  
  and 
  jill


After one shift we have...


                        CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and                              henry  
  and 
  jill


At this point, "#jill" has shifted once but there is nothing left
for "#jill" to shift also there is no bridge to the next VariableToken.
So by Rule #6 "#jack" is now the controlling token.


  CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and                              henry  
  and 
  jill


Now "#jack" must shift at least one time so it first tries to
shift "jill". However the next token is already occupied by "and", so
"jill" is pushed back on the controlling token and the value getting
shifted now becomes "and". Once this is done the result is


  CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack                   and                   henry  
  and 
  jill


After two more shifts we have


  CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and        jill                  henry  
                         and 


So once again control will pass to "#jill" based on Rule #7
control passes to "#jill".


                         CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and        jill                  henry  
                         and 


After one shift we are left with this.


                        CTRL 
 #######     —–     #######     —–     ########
 |#jack| <-> |and| <-> |#jill| <-> |and| <-> |#henry| 
 #######     —–     #######     —–     ########
  jack        and        jill       and        henry  


So by Rule #9 this sentence matches the expression.
Using the previous expression lets try and match an expression
very similar that fails. The senetence we will be matching is "jack and
jill" and the expression is "&jack and &jill and &henry" After several
shifts you end up with.


                         CTRL 
 #######     —–     #######     —–     ########
 |!jack| <-> |and| <-> |!jill| <-> |and| <-> |!henry| 
 #######     —–     #######     —–     ########
  jack        and        jill     


"!jill" shifts then by Rule #6 we get


 CTRL 
 #######     —–     #######     —–     ########
 |!jack| <-> |and| <-> |!jill| <-> |and| <-> |!henry| 
 #######     —–     #######     —–     ########
  jack        and                              jill     


"!jack" will shift twice and give us...


  CTRL 
 #######     —–     #######     —–     ########
 |!jack| <-> |and| <-> |!jill| <-> |and| <-> |!henry| 
 #######     —–     #######     —–     ########
                        jack                  jill     
                        and


Now "!jack" cannot shift nor does he evaluate to true since !'s
require one word and "!jack" has none. Since there is no valid bridge
forward control goes back by Rule #6. However there is no VariableToken
preceeding "!jack" so by Rule #10 this sentence does not match.
This parser proved to be excellent for the task at had in this
set of Milestones. I believe the overal design is very sound and I
think it would work great for projects in the preceeding semesters with
very little code modification. With the link to this case there should
be a link to the parser code if you wish to look at it. Quick
disclaimer. I am not repsonsible for anyting that happens as a result
of using this code. A more detailed disclaimer should be present on the
link to the code.
Any questions or comments should be directed to Jared
Parsons(tremul@cc.gatech.edu). Thanks for reading this. I appreciate
feedback.

Other stuff:

Design Flaws:
My basic design used two main token classes. VariableTokens and
RegularTokens. However I probably should have had 4 classes extend of
variable token. One for each of the types of token. This would have
made a clearer line between the tokens and divided up responsiblities.
For this small example with only 4 types of tokens, it worked great.
However this design allows for extension and the addition of other
VariableToken types. If many more were added it would be more
maintainable in seperate classes.

Problems:
This parser tended to be problem free. As stated the actual
parsing algorithmn didn't change at all through the entire project. The
only thing that changed was we had to alter the way strings were
tokenized when we had to deal with comma's in M5 and how the
WordMatcher's determined which Rule to return when more than one
matched.
The problem with commas was the following. We were later asked
to be able to handle the following rule.

"&chara#cter, &message"

The problem here was that as was, our parser intereperted this
as "&character," with the comma which was not what we wanted. To fix
this little problem we simply had our strings tokenize with "," as a
delimeter and did not let it keep that character when it returned the
tokens.
The multiple matched Rule problem was resolved with a system of
priorities. Each rule had a point value associated with it. For every
RegularToken in the Rule it got 2 points and for every VariableToken it
got 1 points. RegularTokens were given higher point values because they
are more difficult to match. This way "drop &object" would have higher
priority than "&action &object" because the first has a point value of 3
and the second has a point value of 2. Otherwise inside the callback
for "&action &object" you had to do horrible hacks to see what the value
of action was.

Programmer Error:
I like to look back on my stupid mistakes and laugh at them. It
illustrates were testing simple cases is always necssary even if your
parser can handle the really hard ones.
The only types of problems we encountered with this parser were
the following two programmer errors. After M1 we got these two bugs
out. The first was a simple underrun of tokens. If your expression was
"My dog" and you put in "My" as the sentence the parser would crash
horrifically and cost me several points on the project. The other was
similar to that. "My #dog" would also be crashed by "My". Other than
that it was fairly efficient.




plum

Link to this Page