PhD Progress: Infinity Pacman

Posted by Sam on June 30th, 2009 under Uncategorized  •  No Comments

Well, I started the grand experiment(s) yesterday at about noon (I can’t remember the exact time) and I have come back to check on them. One thing I thought I should do after I had started them was to make sure they output their progress every episode (after the generators were updated) so they could pick up where they left off.

However, the problem I came to find was not that of crashed experiments; they were all working fine. The problem was that of infinite PacMan. Two of the experiments (from 3) had scores in the millions, and spare lives equally as large, and the other experiment was exploiting a bug in the code where PacMan moves back and forth along the wall at a warp-point.

So, the first issue can be solved by having a level-limit. Perhaps I’ll stick with the 10 levels. So the nature of the experiment will be different to that of the paper, but it always was anyway. If the agent completes all 10 levels, then the episode ends.

The second issue seems to be a bug in the ghost code. Although PacMan was practically staying in the same place, the ghosts were just moving around in circles. The ghost’s movement is determined by a target point which they move towards. However, this movement algorithm doesn’t seem to take into account warp points (paths in the level that go out one wall and come in the opposite wall). I rarely see a ghost take one of these warp paths – if ever! A further method of halting infinite games (perhaps PacMan is able to outmaneuvre a ghost forever) is to simply place step limits on the episodes. Something quite big oughta do it.

Perhaps it is time to give the ghosts individual behaviour while I’m at it too. Currently, all the ghosts have the same greedy, but lonely, behaviour. A ghost will chase the player, but maintain its distance from other ghosts. This needs to be changed to the proper ghost logic seen in PacMan (or Ms. PacMan, in this case).

Anyway, here’s the two policies that were all about achieving infinite points:
[1]: if CONSTANT>0.0 then TO_DOT+
[1]: if NEAREST_ED_GHOST<99.0 and NEAREST_POWER_DOT<5.0 then FROM_POWER_DOT+
[1]: if NEAREST_GHOST<5.0 then FROM_GHOST+
[2]: if MAX_JUNCTION_SAFETY>5.0 then TO_SAFE_JUNCTION-
[2]: if CONSTANT>0.0 then FROM_GHOST_CENTRE+
[2]: if NEAREST_GHOST>7.0 then FROM_GHOST-
[2]: if MAX_JUNCTION_SAFETY<1.0 then FROM_GHOST+
[2]: if MAX_JUNCTION_SAFETY>5.0 then TO_SAFE_JUNCTION-
[2]: if NEAREST_ED_GHOST>99.0 then TO_POWER_DOT+
[2]: if NEAREST_ED_GHOST>99.0 then TO_POWER_DOT+
[2]: if NEAREST_POWER_DOT>10.0 then FROM_POWER_DOT-
[3]: if NEAREST_ED_GHOST<99.0 then FROM_POWER_DOT+
[3]: if NEAREST_GHOST>7.0 then FROM_GHOST-
[3]: if NEAREST_GHOST<4.0 then FROM_GHOST+

[1]: if NEAREST_GHOST<3.0 then FROM_GHOST+
[1]: if MAX_JUNCTION_SAFETY<2.0 then FROM_GHOST+
[1]: if GHOST_DENSITY<1.5 and NEAREST_POWER_DOT<5.0 then FROM_POWER_DOT+
[1]: if MAX_JUNCTION_SAFETY<2.0 then FROM_GHOST+
[1]: if CONSTANT>0.0 then TO_DOT+
[1]: if MAX_JUNCTION_SAFETY>3.0 then FROM_GHOST-
[1]: if NEAREST_GHOST<5.0 then FROM_GHOST+
[2]: if CONSTANT>0.0 then TO_DOT+
[2]: if CONSTANT>0.0 then FROM_GHOST+
[2]: if NEAREST_ED_GHOST>99.0 then FROM_POWER_DOT-
[2]: if NEAREST_ED_GHOST<99.0 then FROM_POWER_DOT+
[2]: if MAX_JUNCTION_SAFETY<3.0 then FROM_GHOST+
[2]: if NEAREST_GHOST>7.0 then FROM_GHOST-
[2]: if NEAREST_ED_GHOST>99.0 then TO_POWER_DOT+
[3]: if NEAREST_ED_GHOST<99.0 then TO_POWER_DOT-
[3]: if MAX_JUNCTION_SAFETY>5.0 then TO_SAFE_JUNCTION-
[3]: if NEAREST_ED_GHOST>99.0 then FROM_POWER_DOT-

Big policies, but they were still from the early part of he cross-entropy process. They seem to basically focus on getting all the dots, and avoiding ghosts when near.

PhD Progress: Pacman has a brain!

Posted by Sam on June 28th, 2009 under PhD Project  •  No Comments

After some weeks (I’m not entirely sure how many. Probably about 4-5 weeks) I have successfully re-implemented most of the RL Pacman paper by Szita and Lorincz. I say most because I have left out two possible action modules, but they were rarely used in the actual paper and they didn’t seem all that useful anyway.

Although I haven’t run a full testing run as they did (population=1000, episodes=50, p=0.05), I have run a shorter episode that demonstrates some learning (well I assume it has. I just record the best policy). Perhaps I should add some testing criteria that actually record the mean and SD of an episode, giving an estimate towards the effectiveness of the generators. Anyway, the best policy from a shorter test run (population=100, episodes=20, p=0.1) is:
[1]: if CONSTANT>0.0 then TO_DOT+
[1]: if CONSTANT>0.0 then TO_DOT+
[1]: if NEAREST_GHOST>6.0 then FROM_GHOST-
[1]: if NEAREST_GHOST<4.0 then FROM_GHOST+
[2]: if MAX_JUNCTION_SAFETY<3.0 then TO_SAFE_JUNCTION+
[2]: if CONSTANT>0.0 then FROM_GHOST+
[3]: if NEAREST_GHOST>5.0 then FROM_GHOST-
[3]: if CONSTANT>0.0 then TO_SAFE_JUNCTION+
[3]: if MAX_JUNCTION_SAFETY>5.0 then FROM_GHOST-
[3]: if MAX_JUNCTION_SAFETY<1.0 then FROM_GHOST+

Obviously there are some redundant rules, but the policy basically functions as such:

  • The default behaviour is to eat dots. That's what Pacman is all about. I think the reason this behaviour wasn't present in Szita and Lorincz's agent was because they perhaps ended the episode once all dots are eaten. So their agent looked to maximise the score over a single map, instead of over the entire game.
  • If a ghost ventures near, start running from it instead of chasing dots, but if it falls away, eat dots again.
  • When at a junction where priority one (probably eating dots) gives several possible directions, choose the direction from a ghost. The other action will not matter, as it will always be overwritten by the ghost running.
  • Finally, there is just further ghost running rules. Although that last rule seems redundant, but hey, it's there.

This agent earns an average reward (over 3 games) of 69050 points. Note that this also includes points from Fruit, which occasionally appears, rewarding 750 points if eaten.

Fuck Macs

Posted by Sam on May 29th, 2009 under Uncategorized  •  2 Comments

Yes. Fuck Macintosh OSs. They are horrible unintuitive pieces of crap. Why? Here’s an example.

I walked up to the labs today, in an effort to make sure I produce some work today. However, there is one key problem with this. The computers in my lab are all Macintoshs. I have been sparingly using this computer as I went through my book, using Google documents to create presentations (fortunately, I don’t have to use any Mac Powerpoint-like software).

Anyway, I started up Google Documents (through Safari, which is another area of frustration, as opposed to Firefox) and started bullet-pointing key ideas. Then I arrive at an equation in the PDF, which cannot be properly written in Google Documents, so I do what I normally do (on my Windows machine, anyway) and attempt to take a screenshot. Oh no! Macs don’t have the print screen key. Well, I’ll attempt to copy the screen using the Edit->Copy command. Perhaps that will take a screenshot of the PDF. Now, to paste it into some sort of image editing program. WAIT! Mac doesn’t have one of them. Not a simple-to-use one anyway. Mac, known for its awesomeness at editing images and all that design stuff, doesn’t even have a basic Paint program. Lovely…

Well, the internet will have the solution! I search for a Mac paint program and find a program that states it is similar to MS Paint. Works for me! I download it, then try to start it up.

Move this program to the Applications folder

Where the fuck is that! But I then remember something about it being in the Finder (the hell is Finder anyway?! I assume it is similar to Explorer for Windows, in that it controls everything). I find it, open it up, and drag the file in.

ACCESS DENIED!

Macs by themselves infuriate me, but when coupled with an access restricting laboratory environment, using them is torture. After several attempts, I couldn’t get it to work. So, I must find an alternate solution. Surely one can take a screenshot on a Mac. A quick Google reveals that you can, through several options: Shift+Command+Ctrl+3 or Shift+Command+Control+4+Space. Who the fuck comes up with this shit?! Why does taking a screenshot require 4 or more keys to be pressed? Windows has a convenient (well, not convenient, but it exists) key called ‘Prt Scr’. One press, and you have a screenshot. Surely the creators of Mac could have assigned a different key combination, perhaps something more intuitive? Maybe Shift+Command+S?

Well, I tried this demonic combination, but nothing happened! It allegedly takes a shot of the screen and dumps the file to the desktop, but nothing appeared there. Getting to the desktop is annoying enough as well. F11 is the only way I can get there, and even then, there are windows hovering about the edges of the screen. F11 sure is intuitive, you white, rounded-edge-loving pricks! Anyway, at this point, I gave up in frustration, and decided to return home and work there (after this post, of course).

So Mac, why no print screen key? You have F13 and F14, which, I’m sure are very useful, maybe as useful as all the other function keys! You remove print screen, a harmless key in no-one’s way, but leave Caps Lock? FUCK I HATE MACS!

PhD progress: This week’s work

Posted by Sam on May 13th, 2009 under PhD Project  •  No Comments

I’ve finished reading the book, so now I will be reading related papers and articles. I will be making note of which articles I read and noting down comments and ideas the paper inspires.

Using background knowledge to speed reinforcement learning in physical agents
This paper talks about Icarus, a hierarchical reactive reinforcement learning agent. The agent is given a general hierarchy for what to do and the agent fine-tunes how to do stuff using the hierarchy. The paper uses the example of driving a car on an infinite highway, which could be a possible example. The paper also looks at the results of obscuring the agent’s knowledge by constricting the number of questions an agent can ask, rendering the domain a POMDP. The paper shows results of the hierarchical Icarus performing better than a general reinforcement learner, even after a great number of iterations. The paper highlights the benefits of domain knowledge for learning.

Relational Reinforcement Learning
The thesis that started my interest (sort of). Though I only skimmed, it brought up important issues in environments. How do you convert a typical environment into relations? For example, Digger. The player has a position and the state is set up in a coordinated fashion. The answer lies in how the agent receives the info. The environment appears to send the agent information on the set questions it can ask (closest emerald, is there a monster in my line of fire, etc). I have begun to understand that the language bias is indeed the strongest thing an agent can receive.

In setting up my environments, I need to store them in whatever way is necessary, but give the agent relational observations to use. This allows the agent to operate on abstractions rather than directly on the environment itself. However, there is the problem of numbers. For instance, when measuring distance between the agent and something, a number would usually be given. But logic doesn’t really allow numbers. This sort of problem has been solved in ML Weka algorithms, so I’ll need to investigate some of them. I have a feeling it will be dealing with typed logic. E.g. distanceBetween\3: distanceBetween(agent, wall, 5.3) could be abstracted as distanceBetween(agent, X, numberLessThan(6)) or distanceBetween(agent, X, numberBetween(2.5, 7.5)). It will certainly be a stumbling block.

Combining Model-Based and Instance-Based Learning for First Order Regression
This publication talked of Trendi, the child of TG and RIB. The paper stated that it performed better than both TG and RIB, but slightly worse than KBR. However, it runs much faster than KBR. Could be a possible implementation beginning point.

Transfer Learning in Reinforcement Learning Problems Through Partial Policy Recycling
This publication outlines the TGR algorithm, an extension of the TG algorithm that deals with transfer learning and concept drift. The results were quite good, making transfer learning a viable opportunity. The flexibility of the learner was achieved by using more tree-restructuring operations, allowing the learner to re-adapt its tree.

TD(λ) networks: temporal-difference networks with eligibility traces
This paper talks about using a predictive representation as a solution for the POMDP problem. It attempts to predict future states using a network of states. The paper was a bit math-intensive (or I was just getting tired), so I didn’t follow the exact algorithm.

Reinforcement Learning in Relational Domains: A Policy-Language Approach
This paper concerns the LRW-API learner. This technique of policy iteration uses many tricks to achieve a quick learning and low computation policy learner.
One of the tricks is policy rollout (previously seen in the book, but never fully grasped). Policy rollout computes π^, an approximation of π’, the improved policy (normally computed by iterating between π and Vπ(s)). This approximation is computed by estimating Qπ(s,a) for every action by taking w trajectories of length h for each action, following π. The estimated Q value is then given as the average of the cumulative discounted (remember that rewards obtained later on are more discounted than earlier) rewards obtained. w accounts for the transition probabilities (averaged out) and h accounts for long-term reward.

This policy-rollout is used in the Improved-Trajectories procedure, which creates n length h trajectories starting from random initial states. These rollouts store more than just the action selected by π, they also store the Q^-values (approximate Q-values) for each state in h. For example, at state i, the information about the state is (s, π(s), {Q^(s,a1), …, Q^(s,am)}). These Q^-values are filled by the trajectories, allowing an approximate policy π^ to be extracted from the Q^-values. Storing the Q^-values allows the learner to make trade-offs, such as avoiding states with heavy penalties for mistakes.

A further minor technique used is the existence of predicates with relation to the goal. These predicates take the for gsomething(X,Y) for goal predicates (gon(a,b) implies that a is on b in the goal state), and csomething(X,Y) for comparison predicates (con(a,b) implies that a is on b in the goal state and in the current state)

Taxonomic syntax is a language used for denoting sets of objects. The syntax is made up of predicates (max arity of 2) and variables. For instance, the class expression (on on-table) denotes the set of blocks that are on blocks on the table (The blocks directly above the blocks on the table). The * operator denotes a chain of objects: (on* a) denotes all blocks above a (linked by the on operator). In a similar vein, (min on) denotes all objects that are on something, but nothing is on them (a minimal chain).

An example of the taxonomic decision list policy put together is as follows. The goal is clear(red). So, a policy in the following format will solve any problem:
putdown(x1) : x1 ∈ holding
pickup(x1) : x1 ∈ clear, x1 ∈ (on* (on red))

This is getting a bit big, so I’ve moved it into powerpoint format. Stored in my Google docs.

An extra note. If I implement a similar bootstrapping system to that used in LRW-API, I could use the bootstrapping process to help realise sub-goals in a hierarchical manner.

Things I hate…

Posted by Sam on May 11th, 2009 under Mind of Me  •  1 Comment

There are a lot of individual things/people that I hate, but this list will try to list general things I hate. Pet peeves, gear-grinders, etc. It will be added to as I remember stuff.

  • Those dubbed cleaning product advertisements on TV. They were filmed in German or whatever, and a crappy dub is just layered over it.
  • Slow people. Not retarded. Just slow moving
  • Dishes put in the sink. Even if there is no room on the bench, they shouldn’t be put in the sink. There’s no room to rinse shit when you’re done with it.
  • Car ads. Fuck they annoy me. Usually the actual content of the ad has nothing to do with cars, save the placement of the car itself. Especially the Mazda ‘Zoom zoom’ thing and the Holden Berina or whatever ad.
  • More to come…