|
||||||||||||
Title: lightvector's Thesis Post by lightvector on May 15th, 2011, 10:46pm I hope this is of interest to others. Let me know if there are any problems accessing it. http://icosahedral.net/downloads/djwuthesis.pdf Enjoy! |
||||||||||||
Title: Re: lightvector's Thesis Post by Swynndla on May 15th, 2011, 10:50pm Thanks!!! |
||||||||||||
Title: Re: lightvector's Thesis Post by UruramTururam on May 16th, 2011, 7:32am Great! I've printed it out to read in a spare time. :D |
||||||||||||
Title: Re: lightvector's Thesis Post by jdb on May 16th, 2011, 12:49pm I look forward to reading it. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 17th, 2011, 1:02pm Thanks very much for making this public, lightvector. Your writing is very clear and readable. I'm sure I will have many questions as I try to understand it, but here's a quick, easy one. On page 9 you say, "the average game length was about 92 turns, or 41 complete turn alternations." That reminds me of the Rocky and Bullwinkle episode in which they said, "after forty days and thirty nights..." :) |
||||||||||||
Title: Re: lightvector's Thesis Post by Hippo on May 17th, 2011, 1:30pm Nice reading. I would concentrate myself to fast approximations of the ordering (and incremental computation of it). I would concentrate on iterative deepening with high pruning deep in the tree. It seems to me prunning deeper in tree is more important than pruning in the root. I would think about option to remember the pruned lists for use in next level of iterative deepening. It's a pitty you tested prunning only on 5s games. It would be interesting how it would change on longer time controlls. But Omar must be happy by your approach ... using automated learning to improve itself. May be in the future it would be difficult to learn from humans ... :). |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 17th, 2011, 2:27pm After my first pass reading of how expert move prediction was done, I am stunned that it could work at all. Let me check my understanding: You had over 3600 features, and those features contributed additively (in the generalized Bradly-Terry model) to the evaluation of the move. For example, the feature "pushes horse to c5" made a contribution to the value of the move independent of the feature "enemy elephant defends c6 trap"? It seems that all the features that are present contribute equally to the strength of the "team of features"; pairs or groups of features have no meaning. Or did I misunderstand the math? Furthermore, you trained these features on part of the data from only 4121 games? That's a whale of a training method to extract meaningful results from so little data. From Haizhi's failed attempt at training an evaluation function containing tens of thousands of features, I must have drawn incorrect conclusions. I thought that the presence of so many features, with so little data to go on, would inevitably result in overfitting, or what might be called "superstitions" whereby the AI homes in on features that just happened to have been associated with winning moves in a few instances. Like I was listening to Bruce Springsteen just before I got word of my grandfather's death, so I think Springsteen is bad luck. Exact move prediction 12% of the time is astounding; for comparison, in the 2009 Mob game, I predicted the Mob's exact move nine times out of forty, or 22.5%. Has any developer done statistics on how often his bot correctly anticipates the opponent's move on the basis of full search? I am extremely curious as to distribution of the trained weights. Were the weights so different as to make a near hierarchy, e.g. goal >> capture >> trap control >> piece-square-location? If so, then perhaps the fact that the value of the features was merely additive is less of a handicap than I would have thought. Indeed, perhaps a hand-tuned expert move predictor could have outperformed the machine-learned expert move predictor. Did you try compare hand-tuned pruning with machine-learned pruning, as you later compared hand-tuned evaluation with machine-learned evaluation? If I were your competitor in the Computer Championship, I might be tempted to imitate you in point of forward pruning all but 5% of the moves at the root, given the obvious success of such a result, but diverge in point of making my own pruning function by hand, and if I could make the pruning function sufficiently lightweight, applying it recursively as Hippo suggests. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 17th, 2011, 2:55pm In your second pruning round-robin, where you pruned moves without using the expert move predictor, did you use static evaluation to decide which moves to prune, or just prune at random? If you pruned unordered moves (i.e. pruned at random), the result seems pretty meaningless, but if you used static evaluation then you have confirmed earlier results (including by 99of9 with GnoBot) that forward pruning on the basis of evaluation results in weaker play. Intuitively, this is a very important result. Conventional wisdom says that forward pruning is a bad idea, but what if it is a great idea that has simply been badly implemented? What if we need to do forward pruning on a completely different basis than we do evaluation? If that turns out to be a good idea, then it could be a landmark insight in the evolution of alpha-beta searchers. Overall, I felt that I was reading (at least) a Master's thesis. It is a bit scary to think what you will come up with next, should you continue to assail the Arimaa Challenge. Congratulations, and thanks again for sharing! |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 17th, 2011, 7:45pm A minor typo fix for page 26 - under TRAP_STATUS, it is rare to have 4 players in a game. |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 17th, 2011, 7:53pm on 05/17/11 at 14:55:44, Fritzlein wrote:
Please bear in mind that the 2004 version of Gnobot was my first try at a bot, and I really didn't know what I was doing. I don't think it can be taken as much evidence for any particular "result". My pruning was savage (only 10 moves were examined at every node... out of all 17000). And I don't recall doing any direct comparison to a non-pruning version of the bot. (because my algorithms were so slow that I don't think I could make it to 8 steps without pruning!) |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 17th, 2011, 11:10pm Here's something I don't understand. Surely "players" on a team can have a negative strength? For example, what is the strength of the ALLOWS_GOAL feature? Surely if I have two possible moves that look similar in every other way (i.e. all their other features are identical), but one of the two has ALLOWS_GOAL = 1. Surely it doesn't help to have that extra player! Maybe some of the features need to be turned around so that they are always helpful? |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 17th, 2011, 11:26pm on 05/17/11 at 23:10:20, 99of9 wrote:
Good question. Here the math appears to be done before taking logarithms, so instead of the formulas looking like 1/(1+exp(b-a)) or equivalently exp(a)/(exp(a) + exp(b)) as we are used to, the formulas instead look like 1/(1 + B/A) or equivalently A/(A + B) where exp(a) = A; a = log(A); exp(b) = B; b = log(B). Now it doesn't make sense to have negative factors. If A is positive and B is negative, then A/(A + B) is greater than one or negative, whereas probabilities must be between zero and one. Also you can't take logarithms of negative numbers. So I think the answer is that features can't have negative strengths, or else whole teams of features could have negative strengths, resulting in impossible win percentages. However, the A and B are themselves computed as the product of all the features that are present in move A and move B respectively. If a feature isn't present in move A, it doesn't contribute, which is the same as saying it contributes a factor of 1. A bad feature which is present could contribute a factor of 1/2, or 1/1,000,000, severely dragging down the product compared to move B which doesn't have the feature. So the strength of a bad feature isn't a negative number, but after taking logarithms it is. A fraction being multiplied into a team (instead of a non-factor of 1) is the equivalent of a negative elo rating being added into a team (instead of a non-rating of 0). If that makes sense. By the way, the feature "allows opposing goal" shouldn't be infinitely bad. In particular, it had better contribute less than the feature "moves friendly rabbit to goal". :-) |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 18th, 2011, 1:12am Ok good, I'm glad it's a problem with my reading rather than the method. |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 18th, 2011, 1:48am Yes, Fritzlein is spot-on regarding "negative values" for features. on 05/17/11 at 14:27:58, Fritzlein wrote:
Fritzlein: Yes, I was very pleased to see that it was so effective. But I also strongly suspected from the beginning it would work at least with some good degree of success. Open up a typical Arimaa game and forget about higher-level strategy, just look at what basic "good" features each move has. Almost all moves will have at least one easily recognizable good feature, and often many. And for moves that aren't so obvious, things like Haizhi's step combo (dependency components) as well as closeness to the previous two moves will help distinguish them from random piece shuffling. It's a decent amount of data. About 100000 training instances, where in each one the learner is taught to distinguish the game move from about 100 random moves. Most features get plenty of positive and negative instances. on 05/17/11 at 14:55:44, Fritzlein wrote:
Pruned at random. This was just a sanity check, basically. I hadn't tested the eval function for ordering/prediction. I expect it to do pretty well at predicting the exact move a lot of the time, perhaps better than either learned predictor, but to have much worse tail behavior. Let's run it now! Here's the prediction stats on a little less than 1/3 of the testing set (127 games, 10500 instances): Top X 1 10.0371 2 14.4367 5 20.4933 10 24.7786 50 40.52 100 48.2716 500 68.9172 1000 77.8497 Top % 5 71.0885 10 80.2685 20 88.6297 40 95.8575 60 98.8573 80 99.5905 So apparently, it's better than Naive Bayes at getting the exact move, or getting it within the top several moves. But as you go further, it gets worse. Also, throughout, it's uniformly worse at predicting than Bradley-Terry. This makes sense. Here's my guess at one reason why something like this might happen: In forward pruning, you're trying to do something a little different than in static evaluation. You want the moves that capture as much as possible of the probability mass for being the best, even if they have a good chance of also backfiring. A lot difficult tactical moves can fall into this category. Such moves the evaluation function may score poorly because if a move opens up a lot of tactics, a static evaluation is going to have no sense of what the true value ultimately is and will misevaluate. However, the move-prediction algorithms might learn that tactical moves that do things like create threats or trade pieces, even if they appear initially questionable, will also contain a decent part of the probability mass for being the best move, larger than one might expect from the static evaluation. on 05/17/11 at 19:45:13, 99of9 wrote:
Thanks! on 05/17/11 at 14:55:44, Fritzlein wrote:
Thanks! I'm not out of ideas yet! |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 18th, 2011, 2:46am Here's a selection of logarithm'ed feature values. All from gold's perspective. Most common piece for a given piece type assumed (0 stronger = elephant, 1 stronger = camel, etc). Keep in mind that there are tons and tons of correlated features, which will affect things a LOT. In the extreme case, if two features were perfectly correlated, then each would receive only part of the value, so that their sum gives the full value. So a feature's value may be deceptive, depending on which other features frequently occur with it. Movement: Move Elephant off c3: 1.3942674695377 Move Elephant off d5: 0.32809975663243 Move Elephant off d6: -0.32646524296478 Push/pull rabbit at a3 0.95719375501162 Push/pull rabbit at c7 1.4152185138243 Traps: Change # of c3 trap defenders from 1 to 2: 0.2447351795347 Change # of c3 trap defenders from 2 to 1: -0.20043139578781 Capture: Suicide Elephant -4.0458695575478 Suicide Rabbit -3.6913757061848 Capture Camel 4.6433340488424 Capture Rabbit 3.4709199670508 Freezing: Freeze opponent's Camel 0.89834689641162 Freeze opponent's Rabbit (7-8 stronger pieces) 0.060348724742266 Freeze opponent's Rabbit (5-6 stronger pieces) 0.20640037768827 Goal threat: Threaten Goal 1-step goal next turn: 2.6907457101836 Threaten Goal 4-step goal next turn: 1.9281538285566 Goal: Score Goal: 2.503911619406 Move Rabbit to c8: 4.9692384474976 Allow Goal: -6.7643656178349 Dependency structure of move: All steps dependent: 0.66186058142058 Two independent step combos: -0.38533435579265 Four independent steps: -1.7993513726477 |
||||||||||||
Title: Re: lightvector's Thesis Post by Hippo on May 18th, 2011, 4:15am on 05/18/11 at 01:48:03, lightvector wrote:
Good explanation, may be simillar effect would give static evaluation giving range rather than a value ... |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 18th, 2011, 12:40pm on 05/18/11 at 01:48:03, lightvector wrote:
Very lucidly stated! There are many sharp tactical moves which, if they fail, will have a very negative evaluation. For example, I might consider sacrificing my camel for a strong goal attack. If the attack fails, I am down a camel and will lose the game, so giving up my camel is absolutely horrible, one of the worst moves on the board. Yet in spite of the high probability of the move backfiring, it is the type of move that must be examined, because it is unclear in a way that only search will clarify. The go-for-broke attack has a higher probability of being the best move than some random safe move with a higher static evaluation. The safe move, although it has a low probability of being catastrophic, also has a low probability of being brilliant, and thus probably isn't the move you want to play. Static evaluation simply can't be expected to properly handle sharp moves such as sacrifices. Even if you miraculously managed to give a static weight to the goal attack that roughly balanced the static weight given to losing a camel, your summed evaluation would still be wrong. The exchange is certainly not an equal one, certainly not near a neutral move in value. It is either disastrous or brilliant, not in between, and it must be searched to know which. Intuitively, late-move reductions might be better at handling this problem than static evaluation, but still poor for the same reason. There will be critical lines which fare poorly on shallow search, but are still critical lines deserving of being searched at least as deeply as the main line. Late move reductions will not give them due attention. This line of thought reinforces the plausibility of having a completely different pruning function than evaluation function. Both functions should like and dislike some of the same features, such as trap control and greater material, but the pruning function should be asymmetrical. And, voila, your pruning function, by virtue of looking at moves rather than positions, is highly asymmetrical. It emphasizes what you are doing and de-emphasizes what the opponent will do in response. The pruning function should give bonuses for being sharp, unstable, purposeful, i.e. something that might work even if it might also go horribly wrong. It should also want to examine "extreme" moves more than "average" moves, i.e. moves where a player puts all his eggs in one basket are more important than moves where a player does a little of this and a little of that. The pruning function should want to examine races above all, where both sides are slugging it out. By the same token it should be very eager to prune tame positions where the static evaluation is likely to be nearly as good as the evaluation after searching, and to prune tame moves which, even if they are good, are likely to be dominated by a more committed, purposeful version of the move. With this strategy in mind it becomes very plausible that aggressive forward pruning is a good idea. Indeed, the only reason I know of to avoid forward pruning is its poor track record (including in chess, not just for the early GnoBot). But if that poor track record is because people couldn't get away from using eval to do pruning, the conceptual obstacle disappears. In the long run it might turn out that your current, learned pruning function is relatively mediocre compared to what is possible, and the brilliance was merely to do pruning completely independently of how you do evaluation. Does it also matter that you gave moves different features than positions? Does it also matter that you trained the pruning function instead of hand-coding it? This could be analogous to the breakthrough in Go-playing bots which at first appeared to be about UCT exploration/exploitation tradeoff, but later turned out to be all about random playouts being decent at evaluation, quite apart from the other features of UCT, such that now one refers more generally to MCTS instead. One has to experiment with variations on the new theme before one knows what parts were essential and what parts extraneous. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 18th, 2011, 1:47pm on 05/18/11 at 01:48:03, lightvector wrote:
OK, you are helping make it more plausible. It's like the chess adage that a bad plan is better than no plan. The features are a sort of ranking of what things are there to be done, and by picking the highest value thing to be done, you are at least doing something. I have to get used to understanding that this is pruning not evaluation. A move can be purposeful and still quite weak. The pruning function doesn't need to care if the move is weak; search/evaluation will take care of that later. The pruning function just needs to select which moves are worth looking at, and for that purpose looking for independent features seems entirely reasonable. Indeed, I do it myself all the time, for example looking at a move because it pulls a rabbit and only later rejecting it because of what it allows the opponent to do in response. Quote:
Hmm, how much data is in one move being preferred to a hundred others? Surely less data than knowing for fifty pairs of moves which move of the pair is preferred. But I take your point that you have somewhat more data than just 100,000 paired comparisons. Quote:
Ah, too bad. Hopefully I can help persuade you to compare different types of pruning functions to your learned pruning function, starting with pruning by static eval and/or pruning by static eval plus quiescence search. Quote:
OK, since static evaluation is worse at predicting human moves, it would presumably give you less strength gain to prune on the basis of static evaluation than to prune on the basis of trained features. It might even give you a strength loss. Still worth testing to create a baseline. Quote:
Thanks for this sampling of actual trained values. Quote:
Got it. In particular it would look odd that scoring goal is less positive than allowing goal is negative, if not for the fact that moving a rabbit to a given square gets a separate bonus, which is high if that square is on the eighth rank! By magnitude of features, the hierarchy seems to be 1) Score goal 2) Don't allow goal 3) Capture a piece 4) Don't suicide a piece 5) Threaten immediate goal 6) Haizhi pruning (not four independent steps) 7) Dislodge advanced enemy rabbit 8) Get your elephant off the trap 9) Pull a rabbit 10) Freeze the enemy camel 11) Reposition elephant 12) Protect your home traps 13) Freeze a rabbit Also from the magnitudes, it doesn't seem likely that several small features will outweigh one big one, i.e. most of the time this function will find the biggest thing that can be done and do it. Or no, again, I have to remember that this function is only used to recommend which moves to look at. It isn't eval. Thus this function finds the biggest thing that can be done and demands that it be examined further. A completely different function decides what to do after that examination. Very plausible, if I can keep it straight in my head. |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 18th, 2011, 2:20pm Actually, it is quite possible for lots of small features to outweigh big ones, because there are a lot of them. It's not uncommon for a move to have > 20 features. For example, a typical move where the elephant pushes a piece and then two more pieces are advanced a step might have the following features: 3 moved-own-piece-from-square 3 moved-own-piece-to-square 1 moved-opponent-piece-to-square 1 moved-opponent-piece-to-square 2 changed opponent trap status at traps X and Y 2 changed own trap status at traps X and Z 1 froze opponent's piece of type T 1 froze opponent's at location L 2 advanced piece with certain influence 2 features indicating changes in distance of the advanced pieces from the nearest stronger piece 1 threatens capture of opponent piece 1 defends capture of own piece 1 closeness to last move 1 closeness to last last move 1 haizhi move structure (3 indep components) That's a lot of features, and depending on how positive or negative they add up to be overall, gives a lot of potential for outweighing large-weighted features in other moves. Even moves with all 4 steps independent can still get good scores, because if all or most of the 4 steps do things that are very good, then their sum will easily outweigh the penalty. Other major correlations to keep in mind: Increase trap defense + move piece to one of the squares adjacent to traps Approach stronger piece + freeze own piece Push opponent's piece + make capture threat Increase trap defense + defend capture threat Unfreeze piece + defend capture threat Advance rabbit near opponent's goal + make goal threat Push opponent's piece adjacent to trap square + decrease opponent trap defense. Have only 1 feature for each of move-from-square and move-to-square and having 1 dependency component (because the move was of a single piece for 4 steps) Etc. These are just some of the obvious ones. There are many many more. The presence of all of these interwoven correlating features would make it quite a pain to weight everything by hand correctly. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 18th, 2011, 3:01pm on 05/18/11 at 14:20:53, lightvector wrote:
I take your point about the abundance of features and the many correlations between them. It won't necessarily be easy to read off human-comprehensible lessons from a set of thousands of interrelated values. It could be that only a trained algorithm will be able to make sense of the underlying complexity and thereby make good pruning recommendations. Nevertheless, your conclusion is deeply undermined by the analogy to the evaluation function. A priori, you could make exactly the same case for a machine-learned evaluation function and against a hand-tuned evaluation function (i.e. thousands of interrelated weights which would be bewildering to set properly), but you know for a fact that your hand-tuned evaluation function out-performed your machine-learned evaluation function. Contrariwise, you didn't even try the efficacy of a hand-crafted pruning function that has fewer features and more human-comprehensible ones. I am wholly unpersuaded that it couldn't be done, or would perform worse if done by hand. I congratulate you heartily on having made a breakthrough in playing strength by novel means. But you most definitely didn't apply machine learning to something that had previously been done by hand, thereby showing that the machine-learned function surpassed the hand-tuned function. Rather you had already innovated merely by using a pruning function which is asymmetric to the evaluation function, with different features, including features capable of recommending sharp, purposeful moves rather than recommending merely good moves. Since nobody had created such a pruning function before in any manner whatsoever, it is entirely unclear whether machine learning is necessary to the task. You introduced multiple innovations at the same time, and my intuition as to which was the most critical part of the innovation obviously differs from yours. :) At the end of the day, the intuitions that get pursued are the intuitions of the hard workers who actually put in the elbow grease needed to try things out. By no means let me dissuade you from the research that seems most interesting to you, and most likely to bear fruit. But maybe I can make a strong enough case for my own intuition that you will bend your research direction slightly, if only enough to make a case that I'm looking at the problem all wrong. ;) |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 18th, 2011, 3:45pm Sure, a hand-tuned function might be able to do better, I agree. I just don't want to deal with it right now, when I'm still rapidly changing things, such as adding new features (more trap-control, multiple-piece relations, piece patterns) as well as removing or simplifying features (to train alternate, lighter versions for use in the tree). The advantage of the learning is that when I add a new category with 128 different features, I don't have to meticulously tune and test the values myself. I just toss them in, and then wait 2 hours, and then I have some excellent weights and can immediately see whether they might be any good. My feeling is also that this method is limited only by the amount of data and by the features themselves. The model itself is reasonably robust given enough data. And with enough data, the math guarantees that you will get the true "optimal" linear function, and moreover the particular definition of "optimal" here happens to correspond decently well to the practical goal of pruning without losing the best move. Some informal testing (I forget the numbers) suggests that accuracy does degenerate a little when you further lower the amount of data. But not too much. So if I can mitigate the immediate issues (which in the short-term might just be "buy a few more GB of RAM"), and just train on a somewhat larger proportion of the data, I think the amount of data is okay. There are also hundreds of new high-quality games from the last year that I haven't used, which is a substantial boost, given that I'm basically using only 4000/3 ~= 1300 games of data right now. If data is no longer a problem, and the algorithm is giving something close to an optimal linear function on that data, then this suggests that the major limitations are the features themselves (which are a bit crude and only include the basics) and the fact that the ordering function itself is linear. The weights themselves not being hand-tuned should not be that big of a deal compared to these two issues. And you can try to fix the linearity by hand-coding a fundamentally nonlinear evaluation, but would take a lot of testing. While far from a perfect substitute, it's often easier and far faster to test by coding your nonlinear computation separately and then just feed the result in as a new set of features! |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 18th, 2011, 4:48pm on 05/18/11 at 15:45:07, lightvector wrote:
And limited by the quality of the data. IdahoEv created a machine-learned material evaluation function, and concluded from his research that an initial cat is worth less than an initial rabbit, based both on the damage done by losing each piece in isolation, and based directly on games in which cat-for-rabbit was the initial trade. We have proof that rabbits are more valuable than cats. Despite this evidence based squarely on the available data, I am unaware of any developer that uses a material evaluation that values a rabbit more highly than a cat as the first capture. In this particular instance, we have collectively decided to trust human intuition more than we trust optimized parameters. This can be justified only in that we mistrust the quality of the data. Of course, in your case you are explicitly trying to mimic strong players. Since almost everyone values an initial cat more than an initial rabbit, I'm sure your optimized weights do so as well. Even so, there is the potential that your own insights are closer to optimal than insights derived, however optimally, from imperfect data. Quote:
OK, so if you are hard-coding the most important inputs to the feature set, the line between a hand-coded function and a machine learned function starts to blur. I guess that is already apparent in the "allows goal" feature; does any other feature have such a high weight as this hand-coded feature which can't be assembled from any combination of more primitive features? If you additionally try to pare the full feature set, by hand, down to the most important few features for the sake of speed, the line blurs further. This process of speeding up the function also tacitly hard-codes human insights for which the machine learning can't then correct. Ultimately you may end up with a lightweight function where you understand the meaning of all the machine-learned values and can read off insights relatively easily. At that point, it might become tempting to say, "Hmm, I see what the machine is thinking, but my intuition is different. Let me tweak the actual values to be closer to my intuition and see whether that helps playing strength." If ever you succumb to that temptation, even briefly to see what happens, I carry my argument. ;) But as long as you remain committed to a large and generic set of features, I do take your point about the value of quickly re-optimizing after each change to the feature set. This can otherwise be a serious problem, for example when adding a new feature that is partially, but not entirely, correlated to existing features. How much to weaken the old weights so as to avoid double-counting? And how important is the new feature compared to existing features? I can imagine how the difficulty of changing the feature set is a major drawback of hand-tuned functions, and that having an automated optimization of relative weights is a big plus in that it allows for a large and rapidly-changing feature set. |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 18th, 2011, 7:16pm on 05/18/11 at 16:48:30, Fritzlein wrote:
Of course I intend to do that! As long as I can get the computer to do most of the slow and tedious work, I'm happy to tweak things where I think it got it wrong. ;D |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 18th, 2011, 9:12pm on 05/18/11 at 15:01:20, Fritzlein wrote:
This doesn't prove the issue either way, because it was never compared like for like, but... Gnobot, and I'm sure many of the other bots have a "hand written" move-ordering function. Although in Gnobot it does not actually do any pruning beyond Haizhi-pruning (it may for other bots), it still affects speed. And obviously it's not equivalent to the eval. Anyway, I agree with the issue you raise Fritz, hand writing may end up better... Ok, so here's another idea for you lightvector. Could you try adding into your features something like EVAL_TOP_10, EVAL_TOP_100, etc? I'd be interested to know how much of the final weightings they bleed away from the move-centric terms. |
||||||||||||
Title: Re: lightvector's Thesis Post by Swynndla on May 18th, 2011, 9:32pm You are kind lightvector, sharing your interesting thoughts for the sake of AI advancement, and in a way that's easy to follow. on 05/18/11 at 12:40:36, Fritzlein wrote:
Fritz, I'm enjoying the clear explanations you are giving too, and thanks to you I now have an understanding of the difference between evaluating a position & evaluating moves (for ordering & pruning) (I hope I'm not abusing those terms). I can follow everything except "the pruning function should be asymmetrical" ... I don't understand what it means for the pruning function (as opposed to the evaluation function) to be asymmetrical in this context? |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 19th, 2011, 12:46pm on 05/18/11 at 21:32:12, Swynndla wrote:
I can see why the term isn't very clear; intuitively there is as much justification for calling it "symmetric" as "asymmetric" :P. At the time that computer chess engines started to challenge strong humans, humans took a growing interest in anti-computer play. This turned out to be a temporary phenomenon. Normally one is interested in playing the "objectively best" move regardless of the opponent. When computers were weak, and playing objectively good chess was enough to beat them, there was no reason to play anti-computer chess. And similarly, now that computers are better than any of us, there is again no reason to play anti-computer chess. Instead we use computers to help figure out the objectively best move, the move we would like to play against our fellow humans. For a time, however, anti-computer play was a hot topic. It emerged that humans could play moves that were objectively poor, or at least questionable, and win anyway because the confused computers couldn't see how to take advantage of the strategic characteristics of the position. Some programmers started changing their programs to avoid certain types of positions, not because the positions were bad per se, but because they knew their programs would not be able to understand how to play in those positions. I remember watching a grandmaster game on the Internet Chess Club. Bob Hyatt was generously giving us lines suggested by his program Crafty. But something interesting emerged. Tacked on to those lines was Crafty's evaluation. It appeared that Crafty thought white was losing by about a pawn, and also that Crafty thought black was losing by about a pawn. When Hyatt was questioned about it, he explained that the central pawn structure was locked, and Crafty knew that it played such positions poorly. Crafty would have been unhappy to play white, and also unhappy to play black, because either way the position had a feature that would put Crafty at a disadvantage. Normally, if an engine thinks a position is -1.0 for white, it must be +1.0 for black. Objectively it must be so. It's a zero-sum game; both players can't be losing. But Crafty was using what they called asymmetric evaluation, where the evaluations don't have to sum to zero, so white was -1.0 and black was also -1.0. Maybe I shouldn't have borrowed the term asymmetric from that context to this one. My excuse is that it came to my mind because it applied to bots wanting to avoid certain types of positions, regardless of which side was objectively leading, and also to seek out certain types of positions, regardless of which side was objectively leading. The same now goes for moves. Instead of wanting only to examine objectively good moves, we want to examine also sharp moves that may be objectively bad. Even the direction of the effect is the same as in Crafty's asymmetric evaluation. Crafty avoided locked, slow-changing positions and favored open, fast-changing positions, which is what I now claim pruning should do, even as evaluation remains "symmetric". Typically in search, you want to spend your time looking at the best moves. Why should I waste time looking at what the opponent will do if I suicide my camel? I'm not going to suicide my camel. It's a waste of time to explore that thought. And recursively I only want to look at the opponent's best replies, etc. I would like to prune out all the bad moves. Pruning out all the bad moves will help me, because I look more deeply at the good moves, on the way to finding the best move. Unfortunately for pruning, we might accidentally throw out the best move because it superficially looks bad. I might have a forced goal in three that involves suiciding my camel on the first move. If I prune out all camel suicides, then I will throw away the best move without even looking at it, which will mean pruning has weakened the bot. This is why lightvector is going on about capturing the greatest "probability mass" of the best move. We can't know in advance which move will be best, so we have to retain any moves with a significant probability of being best. One way of guessing which move will turn out best is using static evaluation. But if your static evaluation is symmetric (as I think it should be), then it thinks both players getting a goal attack is the same as neither player getting a goal attack. It cancels out. According to objective evaluation, an equal exchange is just a mediocre, middle-of-the-road move which is probably not best, and therefore can be safely pruned. But this is a lunatic pruning policy: a move which launches a goal race is highly likely to be the best move. Never mind that it is also highly likely to be the worst move; accidentally keeping in the worst move hurts only a tiny bit in lost time. In contrast, throwing away the best move hurts a lot because it creates a blind spot. Lightvector's features for move prediction allow for asymmetry. They permit offense to be ranked higher than defense. Capturing a horse can be a bigger plus than losing a horse is a minus. Threatening goal can be a bigger plus than allowing a goal threat is a minus. Gaining control of an enemy trap can be a bigger plus than losing control of a home trap is a minus. Etc. If any of those asymmetries were part of evaluation, the bot would play like a moron, always trying to do something to you while never stopping you from doing something to it. It would play speculatively and unsoundly and be punished for it. But for the purposes of making sure the best move isn't pruned, the asymmetry of favoring offense over defense is brilliant. Speculative lines must be examined. A move that trades horses is far more important to examine than a move that neither captures nor concedes material, even though they may be evaluated the same. Given this explanation, Swynndla, would you suggest a different term than "asymmetric" to refer to what I am trying to get at? The main point is to throw away as many moves as possible while still capturing all moves that might be best. To do this, one needs to get away from using a function that tries to objectively say who is winning. We can't use symmetric evaluation. But what we are really after is not an evaluation function that is asymmetric in the sense that Crafty's was. We want something else that is a combination of evaluation per se and uncertainty of evaluation. Instead of putting +0.2 ahead of -0.2, we want to put -0.2 plus or minus 7 ahead of +0.2 plus or minus 0.5. The one with the greater uncertainty is more likely to prove best after search. But how does one capture this uncertainty? And what should one name the technique? Offense-first? Sharpness metric? Fog of war barometer? |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 19th, 2011, 1:06pm There's another subtle point I didn't think of at first. Throwing away the best move isn't a big deal if we have retained a similar move that is nearly as good. It might be that Haizhi pruning routinely throws out the best move, but we don't care, because it generally leaves behind a nearly-identical move with two dependent steps, and that can be played instead without much loss. I think the important metric is not the probability of retaining the best move, but rather how much worse the best move we retained is than the best move overall. Probably this is a harder metric to use, but it could be good to keep it in mind in any case. |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 19th, 2011, 2:06pm on 05/18/11 at 21:12:25, 99of9 wrote:
I could do this when I have time. But given that the eval is a big sum of different terms for different features, and we suspect that the ordering function may want to weight each feature a little differently, it seems more interesting to me to add each term separately. This amounts to adding "trap control" as a move feature (currently only counts elephant and # of other defenders), as well as some of the eval's terms for things like hostages, blockades, frames, and some more features for things like advanced rabbits, all of which the move features lack right now. It might be interesting to extend to non-binary features to allow for "continuous" inputs from features whose values are from the evaluation. Anyways, regarding hand-coding, one can compare it to doing a multidimensional linear regression - you have a model, and an algorithm, like least-squares, that does a great job of finding the best fit. Likely you can out-do it by hand if you spent enough time, but the biggest work/improvement ratio will come from other things. One would prefer to do things like find ways to remove the outliers in the data (filter for blunders, missed goals, lemming captures), and transform the data to eliminate nonlinearities (adding new features). Unless you think the model is untrustworthy and a bad approximation to the task (which you might :) ), reproducing all of the numeric optimization by hand is a poor work/improvement ratio so long as there are other things to be done. Indeed, there's no way that the existing, very-basic feature set I'm using gives enough information to distinguish the best move that much more accurately than it already does now. I'd spend a lot of time adding and testing new features anyways when hand-coding, so might as well just do it with the learning algorithm in-place, because it does a pretty good job and testing is much much easier. |
||||||||||||
Title: Re: lightvector's Thesis Post by lightvector on May 19th, 2011, 2:15pm on 05/19/11 at 13:06:41, Fritzlein wrote:
Yep, definitely. You sort-of get this as a byproduct too if you're using game records. Because human players certainly don't play optimally, and strong players will disagree in many positions about the best move. So to do well, you have to rank a lot of the good moves highly, because there's no telling which one the human player will pick. Unfortunately, this doesn't stop you from consistently throwing out a certain category of good moves. I wonder if you could detect different "playing styles" for different players. It would be hard to get enough data, but if possible, could be cool. :) By the way, it would be interesting to try what Fritz suggests regarding asymmetric pruning. The current learning algorithm does this in some ways. For example, passive piece-shuffling moves one's own side of the board are get slightly lower weights because people don't usually play them, even if they get safe evaluations. But it doesn't do it in other ways. If players regularly defend as often as they trade or race, then the learning algorithm would learn the same, and not favor exploring of racing higher. I'm enjoying this discussion. I'm glad my senior thesis could spawn such a thread, and I'm grateful to everyone for providing input. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 19th, 2011, 5:08pm on 05/19/11 at 14:15:27, lightvector wrote:
I am not quite sure how this would play out. Your feature set is asymmetric. You have a feature for "push enemy camel to b3" but no feature for "allow my camel to be pushed to b6", so the defensive feature can't be given any weight, never mind a negative weight equal and opposite to the weight of the offensive feature. Similarly you have a feature "advance rabbit to a7" but no feature "prevent opposing rabbit from advancing to a2". The piece-square features by nature look only at what you can do and not at what the opponent can do back to you. For capture you included defensive features as well as offensive features, but it still isn't symmetrical. You have the feature of immediately capturing, allowing a capture next ply, and threatening a capture on the ply after that. The offense gets one more ply than the defense in a capture race, i.e. a move can give me credit for being able to capture in a way that threatens another capture, but my opponent can't get this double credit, even if the race is entirely equal. Again for goals the offense gets an extra ply. I can get credit for immediate goal, my opponent can get credit for goal the ply after that to cancel it out, but only I can get credit for a goal threat the following ply. There is no feature, "allows opponent to make goal threat". Won't the inherent asymmetry of the features naturally bias the move selection to sharp moves? If the move selector can't see the counter-attack, it will at least greedily grab the attack, thus generally overweighting sharp moves compared to tame ones. Quote:
I'm grateful to you for sharing the thesis per se, and for participating in ongoing discussion about it. I hope my ideas are somewhat useful to you, since I am having great fun cranking them out. |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 19th, 2011, 6:02pm on 05/18/11 at 21:12:25, 99of9 wrote:
True, but it is interesting that lightvector's move-ranking function, which gave significant strength boost when used to prune, gave no measurable boost when used for move ordering. Since the move-ordering is redone at each ply of iterated alpha-beta searching, and the re-ordering is done on the basis of search+eval, presumably any non eval-based ordering will be quickly "overwitten" by the next iteration, unlike forward pruning which is permanent. In other words, mere move-ordering not only isn't used to prune, it shouldn't be used to prune, whereas the pruning-order probably shouldn't be used to permanently order moves, since it would be fighting with eval and slowing down search. So in this respect the existence of hand-coded move-ordering isn't very close to what lightvector is doing, and lightvector's innovation is truly sui generis. Quote:
Good point. I shouldn't have short-changed Haizhi-pruning, which is indeed a form of hand-coded pruning that is completely independent from evaluation. Indeed, now that I recollect, didn't that inspire a discussion of more general non-evaluation-based pruning? And that no developer was interested in trying it because it wasn't clear on what principle to do such pruning? I forget so quickly... |
||||||||||||
Title: Re: lightvector's Thesis Post by jdb on May 20th, 2011, 9:07am Almost any form of pruning, or reductions for that matter, will be an improvement. Increased pruning means greater search depth which means greater playing strength. Jdb's theorem: In arimaa, 99% of the available moves are garbage 99% of the time. Pick a position and print out all of the legal moves. Its amazing how many of them are pointless. As long as the pruning scheme stays away from the 1%, any method will do a decent job. That's why something like haizhi pruning works. It identifies moves that fall into the 99% category. In order to be safe, some form of verification search is required. For example a test to make sure a goal in 2 or a capture in 2 has not been missed. This can be done with a dedicated function and takes little time. |
||||||||||||
Title: Re: lightvector's Thesis Post by Swynndla on May 21st, 2011, 6:45am on 05/19/11 at 12:46:19, Fritzlein wrote:
Your explanation was extremely clear and informative. Thank you also for the crafty example. You convinced me that asymmetric is indeed the correct term. All this is very thought provoking, and I haven't been able to think about anything else. Quote:
With all my pondering, I thought I might have it all figured out, but I'm not quite there. I understand what you said just fine, but when I read the rest of what you've written and think about how it works ... are you saying that the pruning function looks at only risky/extreme/speculative moves (and prunes all others)? I can't imagine you are saying that as not many expert moves are like that. You're just saying they shouldn't be excluded, right? - and that that is a clear example of how it differs from the evaluation function. What you're saying is risky moves should be included along with solid moves (solid moves like threats and trap strengthening, etc) and then the rest can be pruned, right? There are a lot of risky moves. I keep thinking that the Bradley Terry optimization did so well (ie around 90% of expert moves falling within the top 5% of the ordering, and around 94% within 10%), that there isn't a lot of room to fit all the risky moves in. In fact it seems to me the risky moves are so far and few, that to exclude them entirely would give better percentages, because most expert moves would be solid moves (ie not sacrificing the my camel), and so the expert moves would be predicted within a smaller percentage of the move ordering. If I've explained what I mean properly, then I guess you'll see that I'm a bit confused about it. I have a feeling it's because lightvector's optimization orders moves that are likely to be expert moves not in general, but based on a given position, and that for a given position, it will have only a small number of risky (but likely) moves near the top. Would I be correct? It makes sense to me, although I'm impressed that the optimization function works so well. I suppose it works like this, eg it only looks at sacrificing its camel if there's a potential goal threat (eg advanced rabbit, weakened defense), but at the beginning of the game, a camel sacrifice would be pruned. So an advanced rabbit feature and a weakened defense & supporting friendlies would increase the value given to sacrifices (value to be ordered higher & therefore not as likely to be pruned), and lightvector's optimization would work this out for itself, because strong players will start sacrificing when there are goal threats. These sacrifices by strong player are still fairly rare, so I'm not sure how the optimization is able to kick up the value high enough not to be pruned, as there's more pressure in the data not to sacrifice even in those positions. So nope, it's still not straight in my head (and it's giving me a headache). on 05/18/11 at 01:48:03, lightvector wrote:
Haizhi's pruning I understand, but closeness to the move before, and the move before that I'm not sure about. It doesn't mean that because I moved my rabbit and horse up on the east, that I should consider moving them further, does it? And if I added a defender to a trap, that I should look at adding another? Or does it really mean those sort of things? |
||||||||||||
Title: Re: lightvector's Thesis Post by Fritzlein on May 21st, 2011, 8:50am on 05/21/11 at 06:45:54, Swynndla wrote:
Right. The goal here is not to favor playing risky move more often than solid moves. The goal is it play the objectively best move, regardless of its nature. For the purposes of pruning, risky moves should be less likely to be pruned, because they have high variance, and might turn out to be the best move even if static evaluation is unimpressed at first. Quote:
This sounds counter-intuitive, but my hunch is that even if most expert moves are solid moves, the move prediction function is going to fare badly if it biases towards solid moves. The point is that initial impressions of risky moves are most likely to be wrong. If you think a solid move is decent (or weak) without searching it, you are probably right. If you think a risky move is decent (or weak) without searching it, you have a good chance of being wrong. The true value of the risky move could be wildly different than your initial evaluation of it. So you can't prune on that initial evaluation. You have to keep the risky move around and look at it just in case. Quote:
I think you are correct, but maybe also missing the point. The point isn't that the expert move predictor has some magical way to tell what move is the best move without searching ahead. That would be strong magic indeed, and you would want to use that function for evaluation as well as for pruning. The point is that the expert move predictor can tell what moves are likely to be the best move. I believe that this means it includes more risky moves in its list than the evaluation function would include in its list of best moves. The evaluation function has to rank from best to worst knowing no more search will happen. This is a harsh constraint not laid on the expert move predictor. The pruning function realizes that more search is yet to come, so it doesn't have to throw away dubious-looking risky moves that might secretly be brilliant. Those can be thrown away later by evaluation if they are still no good when searched. Perhaps this thought experiment will help clarify: Why not use the expert move predictor for post-search evaluation as well? Lightvector couldn't, of course, because it is too slow to use for evaluation. My hunch is that even if the speed weren't a problem, it would work badly. The expert move predictor would play too unsoundly, because too many of the "it might be best" moves will actually turn out to be sub-optimal when searched. Lightvector, would you be interested in making this thought experiment a real experiment? You could pit your machine-learned move-predictor versus your hand-tuned evaluation function at one ply. You have two ways to rank the moves: which one plays betters in the absence of search? Quote:
No, you are giving the expert move predictor too much credit. It doesn't have fancy connections like "if you control the c6 trap, then advance rabbits in the west". Advancing rabbits in the west has a weight that is independent of trap control. |
||||||||||||
Title: Re: lightvector's Thesis Post by 99of9 on May 22nd, 2011, 5:07am on 05/21/11 at 08:50:53, Fritzlein wrote:
As far as I recall, density of pieces in the locale is included in the rabbit advancement terms. |
||||||||||||
Title: Re: lightvector's Thesis Post by omar on May 25th, 2011, 4:50pm Thanks for posting the Thesis David. I've added it to the Arimaa papers page. I haven't read through it yet, but it looks very interesting. Developing automated techniques for generating the evaluation function is definitely in the spirit of the Arimaa challenge :-) |
||||||||||||
Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1! YaBB © 2000-2003. All Rights Reserved. |