Arimaa Forum - Handicap Order

Welcome, Guest. Please Login or Register.
Jul 15^th, 2025, 7:16pm

Home

Help

Members

Arimaa Forum « Handicap Order - what beats what? »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Handicap Order - what beats what?

« Previous topic | Next topic »

Pages: 1 2 3 4 5 6

Notify of replies

Send Topic

Author

Topic: Handicap Order - what beats what? (Read 7766 times)

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Handicap Order - what beats what?
« Reply #45 on: Apr 20^th, 2008, 10:14pm »

Quote

Modify

on Apr 20^th, 2008, 9:36pm, mistre wrote:

I can't wait to see what your results will bring when you re-run the analysis. There are quite a few more E handicap games in the database, thanks to me. Grin

If your elephant handicap games are included, it could significantly mess up the value of the elephant. You probably won a lot more than one would expect from starting down an elephant, therefore a function that is optimized to fit that data will do better if it values the elephant less than it should.

As for the superiority of LinearAB in predicting winners in games databases, let's just say that when Zombie wins the Computer Championship and I am defending the Arimaa Challenge, I will have my fingers crossed that Zombie still likes trading its camel for a horse and a rabbit, and I get to make that trade at the start of every game.

For all their flaws, FAME and DAPE correctly prefer having the camel, while LinearAB and DAPE(eo) get it wrong. But I'm willing to believe that as more pieces get traded, FAME gets progressively less accurate, while LinearAB gets better. I mostly tuned FAME for opening trades, not midgames and endgames.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Handicap Order - what beats what?
« Reply #46 on: Apr 20^th, 2008, 10:16pm »

Quote

Modify

on Apr 20^th, 2008, 9:36pm, mistre wrote:

Were the game that were looked at human vs human games? bot vs bot games? human vs bot games? or all 3?

Most of the analysis was done with games where both players were rated over 1600, and including HvH games and BvB games, but not HvB games because there was so much botbashing strangeness in that category and I was trying to represent the true value of the pieces to players honestly trying to win a straight-up game.

Some of the analysis was done both ways; with and without including HvB games.

You can find the original discussion in these three threads:
thread one, thread two, and thread three.

« Last Edit: Apr 21^st, 2008, 2:25am by IdahoEv »

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Handicap Order - what beats what?
« Reply #47 on: Apr 20^th, 2008, 10:41pm »

Quote

Modify

The third link above should to this thread, I believe.

If you do end up running the evaluators through new games again there are two things I would find interesting. First, some sort of cross validation where a subset of the games are used for training then the rest used for testing. An obvious application initially would be to test all the old constants on the games played since then. Second, in addition to looking at a win percentage prediction also check a straight side to win prediction.

Janzert

p.s. Another thing I just thought is instead of scoring evaluators per game, score them per material pattern.

« Last Edit: Apr 20^th, 2008, 10:56pm by Janzert »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Handicap Order - what beats what?
« Reply #48 on: Apr 21^st, 2008, 2:45am »

Quote

Modify

Fixed the third link above, thanks.

Cross-validation I will definitely do.

on Apr 20^th, 2008, 10:41pm, Janzert wrote:

Second, in addition to looking at a win percentage prediction also check a straight side to win prediction.

I'm not quite certain what you mean by that.

All the material functions simply output a number that represents who is ahead and by how many rabbits. One could simply check whether score > 0 and assume that means "gold win", then add a point of error if silver won, and vice versa. Or you can use another function to convert the score into a probability estimate as to who is more likely to win, and score the difference, so if the eval predicts an 80% chance of a gold win, it accrues 0.2 error for every gold win from that state and 0.8 error for every silver win.

In the limit of large numbers of sample cases, these amount to the same thing: the training of the coefficients will find the same solution. When the number of test cases is finite, the probability estimate will find the solution a bit faster and more reliably.

Quote:

p.s. Another thing I just thought is instead of scoring evaluators per game, score them per material pattern.

You mean so that the error function is Sum over states(error for state n), instead of sum over states(error for state n)*(number of times n has appeared)? I suppose it could. This would cause the functions to attempt to match mid-game and end-game states more strongly than early-game states (relative to the way I did it before), because later states are much less likely to be duplicated in the database. It also leaves open the question of how to score the individual states, though, when we do have multiple examples. If state N is won by gold 53% of the time and silver 47%, how much error do we accrue the training function for predicting a gold win? 0.47? 0.0?

I don't intuitively see this as an improvement, though you're more than welcome to try to convince me.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Handicap Order - what beats what?
« Reply #49 on: Apr 21^st, 2008, 11:12am »

Quote

Modify

on Apr 21^st, 2008, 2:45am, IdahoEv wrote:

All the material functions simply output a number that represents who is ahead and by how many rabbits. One could simply check whether score > 0 and assume that means "gold win", then add a point of error if silver won, and vice versa.

Yes, this is what I mean. But I'm just interested in seeing the result, I didn't mean to use it for training. I'm simply wondering if some of the evaluators predict the correct side to win more frequently but when they get it wrong they get it wrong by a larger margin.

Quote:

You mean so that the error function is Sum over states(error for state n), instead of sum over states(error for state n)*(number of times n has appeared)?

Right, although states that have occured less than some cutoff should be excluded. Perhaps better would be sum over states(error for state n) * log(number of times n has appeared) or some other sub-linear weighting.

My reasoning for this is to try and see what the error is over the whole space of possible material states, rather than weighted towards how frequently those states appear in the game database. Basically I was motivated by this comment you made in that third thread.

Quote:

Since there are many many examples of a single rabbit loss in the DB - and since those equate to a 55/45 win/loss or so, the optimizer has to work very hard to generate a 0.55 output for that case in order to minimize the error.

In regards to,
Quote:

It also leaves open the question of how to score the individual states, though, when we do have multiple examples. If state N is won by gold 53% of the time and silver 47%, how much error do we accrue the training function for predicting a gold win?

I meant to use your previous formula to turn an evaluator score into a prediction percentage then directly compare the error of the actual percentage of wins in the database (e.g. since fame has a K of 2.92 the prediction for 1 rabbit loss is 0.58% which is a 0.03% error).

Janzert

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Handicap Order - what beats what?
« Reply #50 on: Apr 21^st, 2008, 2:14pm »

Quote

Modify

on Apr 21^st, 2008, 2:45am, IdahoEv wrote:

Or you can use another function to convert the score into a probability estimate as to who is more likely to win, and score the difference, so if the eval predicts an 80% chance of a gold win, it accrues 0.2 error for every gold win from that state and 0.8 error for every silver win.

In the limit of large numbers of sample cases, these amount to the same thing: the training of the coefficients will find the same solution.

That's right, the there is the same minimum in both cases. For example, say that there have only been three material states ever, always a cat for a rabbit, and the side with the extra cat won two out of three. If my penalty function for predicting percentage P on the side with the cat is (1-P) when I am right and P when I am wrong, then my total penalty function is 2*(1-P) + 1*P = 2-P. I minimize my penalty by setting P=1, i.e. I should predict 100% for the side with the cat.

Something is wrong when you are optimizing a variable P in such a way that the "optimum" value doesn't match observation. The root cause of this problem is that you are optimizing using the wrong penalty function. You should penalize square error. Then the total penalty function is 2*(1-P)^2 +1*P^2 = 3*P^2 - 4*P - 2. What is the value of P that minimizes this penalty? Astonishingly, it is P=2/3, i.e. the exact fraction of the time that the cat won. The least square error metric is a wonderful thing...

Now that you have posted that detail, I see you were effectively only rewarding the function that was right the most often, and ignoring how much it was right by, even when you brought percentages into the mix. So the optimization was all provided by the few cases in the middle that are in doubt, and you were only tweaking coefficients to be right in the maximum number of close cases. I can't help but point out that we should not expect the optimized functions to do well in extreme cases (such as massive handicaps), if the function was optimized on the basis of only close cases. Moreover, since only the close cases matter, you are effectively throwing away most of your optimization data, and optimizing over a much smaller set, which one can expect to make the results less reliable.

Suddenly I am extremely curious to have you re-run your optimization with percentages and the least square error function, so that the scaling actually does matter, i.e. it does matter how much one side is ahead. The results might be essentially the same, or they might be substantially different.

« Last Edit: Apr 21^st, 2008, 3:31pm by Fritzlein »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Handicap Order - what beats what?
« Reply #51 on: Apr 22^nd, 2008, 1:47am »

Quote

Modify

Okay Karl, I was with you right until you said this:

on Apr 21^st, 2008, 2:14pm, Fritzlein wrote:

Now that you have posted that detail, I see you were effectively only rewarding the function that was right the most often, and ignoring how much it was right by, even when you brought percentages into the mix. So the optimization was all provided by the few cases in the middle that are in doubt, and you were only tweaking coefficients to be right in the maximum number of close cases.

Because what you are asking me to do here:

Quote:

Suddenly I am extremely curious to have you re-run your optimization with percentages and the least square error function, so that the scaling actually does matter, i.e. it does matter how much one side is ahead.

Is exactly what I did in 2006/2007, and I'm not following what led you to believe otherwise. The functions were all optimized by minimizing the least-squared-error, computed as the difference between the %age confidence of gold win vs. the actual game result for every state in the database.

They definitely weren't optimized against only the close cases, they were optimized against all cases. (subject to the inclusion criteria described in those threads, i.e. ratings >= 1600, no HvB, no mid-exchange states, etc.).

It seems to me that Janzert was asking me to ignore the % confidence and just use a binary error function, and I was explaining why I doubt that would be an improvement.

« Last Edit: Apr 22^nd, 2008, 1:48am by IdahoEv »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Handicap Order - what beats what?
« Reply #52 on: Apr 22^nd, 2008, 7:33am »

Quote

Modify

on Apr 22^nd, 2008, 1:47am, IdahoEv wrote:

Because what you are asking me to do here [...] Is exactly what I did in 2006/2007, and I'm not following what led you to believe otherwise. The functions were all optimized by minimizing the least-squared-error, computed as the difference between the %age confidence of gold win vs. the actual game result for every state in the database.

Oh, whoops. I thought that was what you had done, but then when you started talking about errors of 0.2 and 0.8 for an 80% prediction, then I thought I had mis-remembered. But I guess I was mistaken in my mistake. My apologies. I should have at least checked the original thread and given you the credit that I had been giving you!

Given that all the material states did in fact contribute to your optimized coefficients, I'll have to think harder to explain why I don't like the results. I mean, harder than the obvious explanation that my own intuitive material evaluation is wrong. Tongue

But, given how many things I've been wrong about in the past, why think too hard? I used to think that 99of9 was overly bold to open with two rabbits in front, but now I have four in front every game. I once said I knew I was winning because I had gotten a camel hostage by sacrificing only two cats, which I'm now confident means I was losing. My mocking LinearAB for preferring HR to M in the opening may only be fodder for history to laugh at me some more.

« Last Edit: Apr 22^nd, 2008, 7:47am by Fritzlein »

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Handicap Order - what beats what?
« Reply #53 on: Apr 22^nd, 2008, 1:05pm »

Quote

Modify

As the ranting seems to have died down over this, here are my thoughts on trying to rank handicaps by difficulty. These are more or less in the order they occured to me. Wink

First, current evaluators were mostly developed to look at near even piece trades. So it's unsurprising to me that they give unreasonable results when used to compare large handicaps. Also all but the empirically optimized ones by IdahoEv are simply based on current intuition rather than any objective measure. Because of the way the empirically optimized ones are trained I wouldn't expect them to do any better on extreme handicaps either.

So what do we actually mean when trying to rank the piece handicaps? One thought I had was to try and relate it to the probability of a game theoretic proven loss. But I can easily give a fairly simple algorithm (involving no search, a p0 bot if you will) that would beat perfect play when playing against any handicap that leaves only cats and rabbits. So on some level you could say that all handicaps leaving only cats and rabbits are equivalent (trivially provably lost). Obviously any difficulty measure needs to be able to distinguish between any, or almost any, handicap.

Another aspect is that almost certainly different opponents are going to have different rankings in difficulty for various sacrifices. Almost certainly even to the point of handicap 1 being impossible and 2 being possible against opponent A but the reverse against opponent B. But of course we want to establish a general ranking that in some way represents an overall ordering of difficulty, i.e. free of 'opponent bias'.

My current idea for an objective, bias free, although theoretical, measure of difficulty would be to look at certain features of the game tree derived from a given handicap. The first thought I had was to use the percentage of leaves that are wins, lower percentages being more difficult of course. This has the potential problem though of a given tree having many early critical moves followed by a period where most moves lead to wins. This could lead to it having a percentage just as high or higher than another tree that has several viable choices at every stage throughout the game and therefor easier to play. A better metric that avoids this problem would be one that looks at the interior nodes instead of just the leaves. Perhaps looking at the percentage of critical moves as the game progresses or maybe something involving the average length of critical move sequences. Another potential problem with any similar approach is that by definition 'good play' does not travel uniformly through the game tree. But I'm afraid any attempt to try and correct for that will probably lead to opponent bias in the results.

Of course it is beyond any current, or apparent near future, resources to construct and/or look at the whole game tree for even a single handicap. I wonder though if some sort of random sampling could produce useful results.

Anyone have further thoughts on this or other ideas?

Janzert

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Handicap Order - what beats what?
« Reply #54 on: Apr 22^nd, 2008, 2:14pm »

Quote

Modify

on Apr 22^nd, 2008, 7:33am, Fritzlein wrote:

Given that all the material states did in fact contribute to your optimized coefficients, I'll have to think harder to explain why I don't like the results. I mean, harder than the obvious explanation that my own intuitive material evaluation is wrong. Tongue

Cognitive dissonance and confirmation bias are, after all, the reason we have the scientific method. Tongue

You can see evidence of my own biases in the newly-reconstituted Material Eval II thread, defending my evaluator against a big fault demonstrated by 99: I'm convinced I'm my own system is correct despite evidence to the contrary.

What I can tell you is that as of the last time I ran the data, the coefficients of LinearAB and optimized DAPE -- coefficients which lean more towards piece number and less towards piece size, relative to human expert intuition -- were definitely supported by the actual game history, both in the aggregate of all states, and in the specifics we examined like M vs. HR. I was as surprised by these coefficients as you are. One can argue that the game history doesn't actually represent the value of the pieces, and you might even be correct! But it's definitely an uphill argument.

Your argument about causality is IMHO the strongest argument here. We don't know if HR captures are causing wins more often, or if winning play is causing HR captures more often. Put statistically, there's no way to tell whether the database results are measuring p(win|HR) or p(HR|win), and poor Thomas Bayes will spin pirouettes in the ground if we assume those are the same thing.

But consider this: being a winning player would tend to impose your understanding of the game on the database, because you would play in the way you believe is right, and the win (since you're a strong player) would demonstrate the "rightness" of your approach, even if it's not actually optimal. So the database evidence should be skewed towards the human belief that bigger pieces are much more valuable, simply because the top players have been playing that way and winning. So maybe even LinearAB is overvaluing the big pieces!

In any case, the database history is the only set of objective data we've got. Despite possible (but unproven) flaws in it, my instinct as a scientist is to trust data over human intuition. And if I'm the only bot developer taking that approach, believe me that's perfectly fine with me! If there's any modest possibility it's the correct approach, i'd certainly prefer to be the only one doing it Tongue

Edited: Move 'grammar'. For great clarity.

« Last Edit: Apr 22^nd, 2008, 2:22pm by IdahoEv »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Handicap Order - what beats what?
« Reply #55 on: Apr 22^nd, 2008, 3:43pm »

Quote

Modify

on Apr 22^nd, 2008, 1:05pm, Janzert wrote:

Another aspect is that almost certainly different opponents are going to have different rankings in difficulty for various sacrifices. Almost certainly even to the point of handicap 1 being impossible and 2 being possible against opponent A but the reverse against opponent B.

Hey, that's a pretty strong argument in favor of my proposal to list both handicaps any time the two are incommensurate!

Quote:

But of course we want to establish a general ranking that in some way represents an overall ordering of difficulty, i.e. free of 'opponent bias'.

Do we really want this even if it is wrong? If there is one bot that can be beaten more easily with RRR and another that can be beaten more easily with CR, do we want to lock ourselves into recognizing only one as the greater achievement in both cases?

It seems that trying to get this definitive relative ranking is like trying to get a definitive answer as to whether a camel is worth 4 rabbits or 5. Why break your back tuning it to 4.38051 rabbits when you know it is only an average, and sometimes the camel will be worth more than that and sometimes less? Instead we try to build dynamic evaluations that recognize that how much a camel is worth depends on the situation.

Instead of trying to say in advance what is best for a certain bot, we can instead keep all incommensurate handicap records, and see what all different handicaps can be achieved...

IP Logged

aaaa
Forum Guru

Arimaa player #958

Posts: 768

Re: Handicap Order - what beats what?
« Reply #56 on: Apr 22^nd, 2008, 6:25pm »

Quote

Modify

If we want to try to make any headway with the theory of material evaluation, we should work out the various relevant considerations that could be captured in intermediate variables. It may even be the case that four of those, namely those that stand for "army strength" and "goal threat" for both sides, may not even be enough.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Handicap Order - what beats what?
« Reply #57 on: Apr 22^nd, 2008, 6:28pm »

Quote

Modify

Regarding handicaps having different difficulties against different opponents:
on Apr 22^nd, 2008, 3:43pm, Fritzlein wrote:

Hey, that's a pretty strong argument in favor of my proposal to list both handicaps any time the two are incommensurate!

I'm not sure how the word "incommensurate"¹ applies here, but I'll take a stab at interpreting your meaning to be "roughly equal" (i.e. commensurate).

I think either every handicap that is accomplished has to be listed² or you have to define a full ordering. We could judge some handicaps to be equal but the more of those there are the less useful the ordering is.

But even disregarding that, I think the opponent bias is going to be larger than just roughly equal handicaps being possible against one opponent and not another. I wouldn't be surprised to see a bot against which an E handicap is possible but not a M handicap while others would more likely to have the reverse.

Regarding a general list of handicaps:
Quote:

Actually I think we want to answer yes to both questions or at least yes with qualifications. Wink

Without a general list we have to define an ordering for each bot. I believe it is impossible to make a bot specific ordering that includes handicaps that have not yet been accomplished. Even just stating whether a handicap is possible or not would seem to be impossible, never mind creating a more fine grained ordering. This leads to the problem of defining the finish line after the race is over. Also even after various handicaps have been accomplished, I'm not sure how to simply define an objective measure of difficulty against the specific bot.

Have a general list allows predefined goals and reduces the burden on the community to only make one ordering instead of one for each bot. Even if as more arimaa knowledge is gained it is decided that some specific handicap ordering needs to be changed, I think this is a better solution than trying to make separate orderings for all bots.

Janzert

¹ (adj) incommensurate (not corresponding in size or degree or extent) "a reward incommensurate with his effort".

² I seem to recall a proposal a long time ago to simply list the first instance of every handicap that was done and some very vocal resistance to the idea.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Handicap Order - what beats what?
« Reply #58 on: Apr 22^nd, 2008, 7:31pm »

Quote

Modify

on Apr 22^nd, 2008, 6:28pm, Janzert wrote:

Regarding handicaps having different difficulties against different opponents:

I'm not sure how the word "incommensurate"¹ applies here, but I'll take a stab at interpreting your meaning to be "roughly equal" (i.e. commensurate).

Read reply #24 in this thread to understand what I meant by incommensurate, and my proposal to deal with it. But I used the wrong word. I meant "incommensurable", meaning they can't be compared.

We can know for sure that CRR > CR > RR, but we can't compare CR to RRR or DRR to CCR. My idea is not to have a separate, fully-ordered ranking for each bot, but to treat as records all handicaps for a bot that aren't clearly beaten by some better handicap.

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Handicap Order - what beats what?
« Reply #59 on: Apr 22^nd, 2008, 7:59pm »

Quote

Modify

If we're willing to have more than one record per bot (I estimate 3-6 for most bots), then partial ordering is a great way to go because it removes the need to rely on any materials eval at all.

Even if we go down that track, I think it is worth having the conversation about which handicaps are harder in general, but it would certainly take the angst out of that conversation.

Regarding mistre's suggestion of one list per person. I agree with Arimaabuff that this is not the best way to go on the main "records" page, because the competition does help spur us on. However, I agree with mistre that this would be helpful for newer or less competitive botbashers, so I recommend setting this up as a subpage (where you can fit many more people).

IP Logged

Pages: 1 2 3 4 5 6

Notify of replies

Send Topic


« Previous topic \| Next topic »