






Adaptive ABL
Papers
Andre, D., and Russell, S.J., 2001. (NIPS-2000) Programmable Reinforcement Learning Agents. Proceedings of the 13th Conference on Neural Information Processing Systems. MIT Press, pages 1019-1025. pham_nips_final.pdf
Michael's comments 5.17.05
Their motivation for PHAMS is to provide a language for expressing prior knowledge that allows the user (programmer) to provide a partial description of desired behaviors. By providing a partial description for a behavior you want to learn, the learner doesn't have to do as much work since it already has a partial description to start from. Thus, for them, the motivation for combining RL with a programming language is to use the language to specify an initial policy the learner then bootstraps from (the PHAM structure plus the initial random policy for the choice points exactly specifies an initial policy).
Our motivation for incorporating adaptation into ABL is to allow a programmer to write arbitrary (large) programs, then single out pieces of the program for adaptation (metaphorically point at a listing of their program and say "adapt here, here, and here"). This is a different motivation than the PHAM one. The "language for specifying partial policies" is a very ML-centric viewpoint (let me simplify the learning problem by specifying prior knowledge) while our approach is programming centric (let me write programs in which I can specify arbitrary adaptation points). While at a technical level these two viewpoints can be mapped into each other, at a research motivation level, they are quite different. With the PHAM motivation it is OK to require the programmer to be ML-savvy - with our motivation, it is not (we're "ML for the masses").
It's unclear to me whether they are addressing the "self-interference" (temporally extended actions) problem or not. The talk about how semi-Markov decision processes allow for actions that can take more than one timestep. But I don't follow how including the natural numbers in the SxAxS tuple models this (I don't get what the natural numbers are for - someone explain this to me). But, it seems like their interrupt mechanism can introduce this problem. For example, if I'm in the guts of DoDelivery(), and I get a "water" interrupt, then I'll suspend DoDelivery(), do the "clean" action, and resume DoDelivery(). The "clean" action could be an arbitrary PHAM and could thus arbitrarily mess up the state of DoDelivery(). But I guess even in this case only one PHAM is running at a time; when DoDelivery() gets to it's next choice point, it can make a policy decision based purely on external state, even if that external state has been changed by the interrupt PHAM. In ABL you can have the equivalent of multiple PHAMs running simultaneously - this is where adaptation really has to start paying attention to the agent's execution state. It doesn't look like the PHAM formalism has a model of parallel execution.
I don't understand how choice states work - this is where adaptation occurs, but looking at their PHAMs for the delivery domain, I always only see one arrow out of a choice state (the boxes), so what the #$#$% are they adapting over. Since there is only one arrow out of the boxes, I don't understand what this sentence means: "A choice state may have several possible next states; after learning, the choice is reduced to a single next state." I was going to reexpress the Delivery domain in ABL (it's a small ABL program), but since I don't understand the choice states, I don't understand where to use adaptive behaviors.
Memory is peculiarly described in PHAMs. I can't tell if their formalization of memory ("Memory Variables" section 3) is a restriction of standard programming language variables, or only a formalization of standard variables. I'm not sure what this means vs. ABL's arbitrary use of java variables and working memory.
I can't tell if they require the specification of an MDP in order to execute a PHAM (therefore the programmer must specify the MDP in addition to the PHAM) or if they are only talking about MDP's for theoretical reasons (so that they can talk about the SMDP induced by composing the PHAM with the underlying MDP, and then prove theorems about the SMDP). This sentence holds the key: "Alternatively, and much more simply, we can solve it using online, model-free HAMQ-learning [8], which learns directly in the reduced state space of choice points." If HAM-Q requires that you explicitly represent the SMDP, then you need to explicitly represent the MDP (how else would you get the SMDP). Which of course sucks sucks sucks. Only ML people think that explicit MDPs are a useful way to think about the world (talk about an impoverished knowledge representation...). But using MDPs as a mathematical construct to get a proof to go through is entirely kosher.
Do either of you know what an SMDP is (other than saying that actions can take more than one time step)? If so, explain it to me sometime. Sounds like we'll be using SMDPs for whatever formal analysis we want to do for A2BL.
I like the qualitative analysis they can give of how the knowledge specified in the PHAM results in a smaller number of choice points in the induced SMDP than in the underlying MDP ("... Deliver-Patrol PHAM induces 7,816 choice points in the joint SMDP, compared with 38,400 in the original MDP. Furthermore, only 15,800 Q-values must be learned, compared with 307,200 for flat Q-learning."). But again, this reveals the ML bias in their underlying motivation. The assumption is that the agent has a single, global policy it is trying to accomplish. However, when thinking about an ABL agent, I find it more profitable to think of it as simultaneously pursuing many local policies, some human authored, some adaptive. When thinking about my own behavior, there is just not much traction to be gained by thinking about the single global policy that defines my behavior - it is much more intuitive to think about the many local policies I pursue. This is true of any complex agent.
I think to have a good NIPS paper we would really need to be able to do a head-to-head comparison of A2BL with PHAMs. And to do this we need to answer a number of my questions above. Clearly A2BL is much more expressive than PHAMs (simultaneous execution of multiple behaviors, possibly a broader variable model, other stuff...?), but this means that it will be harder (possibly much harder) to make the clean theoretical statements that the PHAM paper does. My overall impression is that, yes, PHAMs are an example of integrating RL into a language, but not a language that anyone would really want to use. It's like integrating RL into a Turing machine by adding "choice" states that use a learned policy to jump to some location on tape and write a symbol - sure, you've integrated RL into a Turing complete framework, but a) would any programmer ever want to use it (we want an expressive language) and b) would the programmer have to be a machine learning expert to usefully employ learning (we want an adaptive language that works for programmers naive about learning).
Link to this Page
- Projects last edited on 27 October 2005 at 11:21 pm by 65-182-51-67.biltmorecommunications.net