PhD Progress: Successful developments
Posted by Sam on August 11th, 2010 under PhD Project • No Comments
Seems that all of my work over the past few weeks is paying off. All the additions of extra learning options and such. Blocks World is now able to complete its onAB learning task in little over 36 minutes (just under 20 minutes learning time) (this includes learning modules as well). The modules it learns are compact, neat and valid, consisting of minimal rules (thanks to the slot removal aspect). The convergance property allows the learning to progress quickly along with little over a minute per iteration used.
As for Ms. PacMan, it is still slow, but after 12 (learning) hours, the experiment is 2.7125% along (27.125% iteration). The ruleset is beginning to shape itself to something useful, with handy fromGhost rules (which include conditions for ghost state: aggressive) and toDot rules near the top, and other, less useful rules near the bottom/disappearing. Assuming it continues at this speed, one iteration will take ~45 hours (2 days). Hence the entire experiment takes 20 days, but these can be split up.
Speaking of splitting up, Eibe brought up the possibility of splitting the learning across multiple machines (i.e. Symphony). This could be easily achieved, as the very nature of the experiment allows it to be split. Simply send out X agents operating in their own environments and when they return, sort them in order and update generator. Then repeat. Of course this current system operates in an iterative manner, but the learning should be roughly equal if the update parameter is proprotional to the number of samples.
An alternative to that method is a much larger one which only requires 10 machines, each running entire experiments. Then the results are averaged and stored. But that takes much longer and doesn’t make full use of the number of processors Symphony has available.
Seems there is still the problem of statistical pre-goal unification which hasn’t upset the PacMan experiment yet, but is likely to when a pre-goal is created with edible ghosts. I’ll have top give more thought to sorting that out later.
A testing policy from PacMan:
Policy:
(distanceGhost player ?X ?__Num7&:(betweenRange ?__Num7 2.0 12.666666666666666)) (nonblinking ?X) (aggressive ?X) (pacman player) => (fromGhost ?X ?__Num7)
(distanceFruit player fruit ?__Num9&:(betweenRange ?__Num9 2.0 14.5)) (pacman player) => (toFruit fruit ?__Num9)
(distancePowerDot player ?X ?__Num1&:(betweenRange ?__Num1 0.0 51.0)) (pacman player) => (toPowerDot ?X ?__Num1)
(distanceDot player ?X ?__Num3&:(betweenRange ?__Num3 0.0 52.0)) (pacman player) => (toDot ?X ?__Num3)
(distancePowerDot player ?X ?__Num2&:(betweenRange ?__Num2 19.5 29.25)) (pacman player) => (fromPowerDot ?X ?__Num2)
(distanceGhostCentre player ?X ?__Num8&:(betweenRange ?__Num8 0.0 13.0)) (pacman player) => (toGhostCentre ?X ?__Num8)
(distanceGhost player ?X ?__Num6&:(betweenRange ?__Num6 34.0 43.0)) (pacman player) => (toGhost ?X ?__Num6)
(junctionSafety ?X ?__Num4&:(betweenRange ?__Num4 -8.0 0.0)) => (toJunction ?X ?__Num4)
(distanceGhostCentre player ?X ?__Num5&:(betweenRange ?__Num5 0.0 52.0)) (pacman player) => (fromGhostCentre ?X ?__Num5)
Clearly from ghost behaviour is most important, along with eating fruit and using the powerdot to keep the ghosts placid. The toDot is all encompassing and will always be active. The fromPowerDot rule will only trigger at a distance, so it has no value. The toGhost rules have little value currently, as they don’t include the edible attribute, but that is likely because they can’t be included. The last 2 are practically useless, though the all-compassing fromGhostCentre does lend some defensive behaviour.
Just had a thought about pre-goal mutation and such. Haven’t fleshed out the possibilities yet, but mutate rules based on what constant elements are present and mutate in relevant conditions seen in conjunction with said elements. So a rule concerning a ghost would mutate in the 4 attributes concerning the ghost: edible, aggressive, blinking or nonblinking. This could also create an opening for negation, allowing me to remove half of those attributes. I’ll think on it.
PhD Progress: Slot Addition/Removal
Posted by Sam on August 5th, 2010 under PhD Project • No Comments
With rule selection out of the way, I move on to the next issue of address: slot addition/removal. In some problems, the same action may need to be performed more than once under different circumstances. For example, in Towers of Hanoi, all we can do is move a tile from A to B. But a single rule is not enough to solve the problem; in fact it will only get us 1, maybe 2 steps. And on the other side of the equation, for the sake of policy simplification, it is better to remove slots we never use than to have them hang around. Sure, in Ms. PacMan, every action has a use, and could potentially have a rule which works for it, but some actions are better left without (toGhostCentre, fromPowerDot for example).
And in a simple case, Blocks World has an action without use: moveFloor. Yes, it is used in modules, but when using modules, it is useless. So for the sake of these environments and possibly future environments, I will implement an optional slot addition/removal system.
The algorithm is something like this:
- Each slot maintains an M value for the base chance of it being used (initially M = 1, M >= 0) and an S value for variance (initially S = say 0.5, 0 <= S <= 0.5).
- The chance of a slot being used depends on a normally sampled value of m = M +- S.
- If m is above 1.0, the slot will be used but not removed from the slot distribution. The value for the not removed slot is equal to m – 1.0.
- If m is below 1.0, the slot will only be used (and removed from the distribution) if a random number is below m. Otherwise the slot is not used but is removed from the distribution.
For example, a slot has M = 1.0 and S = 0.5. It is drawn with m = 1.4. Hence, the slot is used at least once and a 40% chance of being used twice.
Another example, a slot is drawn with m = 2.3. The slot is used at least twice and a 30% chance of being used thrice.
Another example, a slot is drawn with m = 0.8. The slot only has an 80% chance of being used at all.
During the update procedure, these slot values M and S are updated as well, with M’s values being updated in a standard step-wise manner using the elite solution count. S’s value needs to slowly decrease over time if M remains relatively stable. Perhaps a standard deviation for the elite samples can be calculated and used to update S in a step-wise manner. That should suffice.
PhD Progress: Rarely Used Rule Selection Algorithm
Posted by Sam on August 3rd, 2010 under PhD Project • No Comments
It is said that the cross-entropy method is good for finding rarely used rules and improve their selection chance, but in a massive ruleset which is already converged, it can be difficult to chance upon them.
The algorithm can be changed to modify the probabilities of individual rules being selected by taking into account how many times a rule has been previously used. So given a distribution of rules D, where each rule has probability P(R) and each rule records how many times it has been used U(R) (this includes times ruls are used but never fired), the probabilities of D can be updated at each step using the following formula:
D’ = forall(R): P’(R) = if (U(R) < #(R in D) * alpha) ? P(R) * (#(R in D) + 1 - U(R)) * beta : P(R)
This states, if a rule's uses are less than the number of rules in the distribution * alpha (a coefficient), then modify the probability of being selected by the probability * the number of rules in the distribution + 1 - the number of uses * beta (another coefficient).
This would probably work well enough with alpha and beta both set to 1, but some modifications may be needed to properly test each little used rule.
An example distribution:
Rule A: 0.4, U = 5 => 0.4
Rule B: 0.17, U = 0 => 0.17 * 5 = 0.85
Rule C: 0.26, U = 1 => 0.26 * 4 = 1.04
Rule D: 0.17, U = 0 => 0.17 * 5 = 0.85
Normalised:
Rule A: 0.13
Rule B: 0.27
Rule C: 0.33
Rule D: 0.27
This system attempts to maintain the probabilities of the original distribution but offsets them by the number of uses, which will eventually decrease to the threshold, resulting in a fair and stable distribution again.
I suppose I could ignore existing probabilities and just set the distribution to reflect the inverse uses (until all thresholds are met). But this could result in unfair balancing.
PhD Progress: Performance Decrease
Posted by Sam on July 21st, 2010 under PhD Project • No Comments
Seems the performance of Ms.PacMan has decreased. This is possibly due to the ‘broken’ mutation operators used at the moment. Because the unification process was modified such that general rules now contain ranges, the mutation process no longer mutates these ranges because it lacks the capabilities at the moment. Therefore, PacMan achieves less reward.
A Ms. PacMan experiment was run recently (40 hours in, though half of that was just testing) which has the following results (about 62% completion):
2064.0
3259.2
3441.8
3574.0334
3836.2334
3582.9666
3420.2
3421.8333
3715.6667
3890.8667
3555.1333
3555.1667
3542.5
3654.1667
3400.3667
3484.4
3443.6667
3809.3333
3616.8667
3061.9
2986.3
3770.3667
3762.3667
3142.9333
3247.0334
3266.5
4481.4
4112.433
4163.967
3327.4333
3934.3
4198.533
This is significantly worse than the previous experiment over the regular environment. While there is some initial improvement in the rules, it doesn’t appear to be increasing very fast and the last few results could simply have been flukes.
The readable generator is:
A typical policy:
(distanceDot player ?X ?__Num3&:(betweenRange ?__Num3 0.0 52.0)) (pacman player) => (toDot ?X ?__Num3)
(distanceGhost player ?X ?__Num7&:(betweenRange ?__Num7 0.0 52.0)) (pacman player) => (fromGhost ?X ?__Num7)
(distancePowerDot player ?X ?__Num1&:(betweenRange ?__Num1 0.0 50.0)) (pacman player) => (toPowerDot ?X ?__Num1)
(distanceGhostCentre player ?X ?__Num5&:(betweenRange ?__Num5 0.0 51.0)) (pacman player) => (fromGhostCentre ?X ?__Num5)
(junctionSafety ?X ?__Num4&:(betweenRange ?__Num4 -16.0 28.0)) => (toJunction ?X ?__Num4)
(distancePowerDot player ?X ?__Num2&:(betweenRange ?__Num2 0.0 50.0)) (pacman player) => (fromPowerDot ?X ?__Num2)
(distanceGhostCentre player ?X ?__Num8&:(betweenRange ?__Num8 0.0 51.0)) (pacman player) => (toGhostCentre ?X ?__Num8)
(distanceGhost player ?X ?__Num6&:(betweenRange ?__Num6 0.0 52.0)) (pacman player) => (toGhost ?X ?__Num6)
(distanceFruit player fruit ?__Num9&:(betweenRange ?__Num9 0.0 52.0)) (pacman player) => (toFruit fruit ?__Num9)
Obviously, toDot behaviour is best, followed by fromGhost behaviour. Normally, I’d prefer fromGhost to be highest weighted, but because it only contains one (or two) all-encompassing rules, it would mean the agent spends most of its time cowering in the corner. ToPowerDot is above fromPowerDot, generally a good choice, and toGhost and toGhostCentre are both lowly weighted. Strangely, toFruit is quite low. Possibly because it doesn’t turn up so much? You know, the fruit may be making all the difference between the scores from this run and the previous one.
On the bright side, the faster optimisation seems to be working, with this run due to be completed in about 24 hours, making a total time of 63 hours.
I need to fix this range mutation and also to fix the mutation towards useful to/fromGhost rules that include the binary attributes edible/aggressive. Furthermore, the modularisation for learning clear still hasn’t been 100% solved, as sub-optimal rules are being chosen as the best.
PhD Progress: Relational Clustering Possibly Needed
Posted by Sam on July 17th, 2010 under PhD Project • No Comments
Thinking about Ms. PacMan environment, I realised it’s kind of cheating, with all the actions based on particular behaviours. A true Ms. PacMan environment would use two actions: moveFrom and moveTo.
While these were proposed a while ago, the larger number of actions was chosen to make learning easier. But if I truly want a general learning algorithm, I can’t always rely on acctions to be given. While I am working towards slot addition and removal (funny how I’m reimplementing things present in the original paper) which should be effective for learning a set of good rules per domain, I may need to also look at clustering states.
The problem is present in Hanoi: there are essentially two actions needed: move the smallest tile, and move a different tile. The key is being able to distinguish the two. For this, I may need a relational clustering algorithm.
Ergh. My mind is dead, so I can’t continue this. Well, something for thought later on perhaps.