Arimaa Forum - Print Page


    
      
        Arimaa Forum
        (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
      

        Arimaa >> General Discussion >> Adapting WHR to Material Handicaps
        
(Message started by: bestaludulo on Oct 20^th, 2020, 7:33pm)

Title: Adapting WHR to Material Handicaps
Post by bestaludulo on Oct 20^th, 2020, 7:33pm

# How is WHR Adapted to Handicap Games?

## How to Even Start?

Arimaa has 864 possible handicaps including the even game. Do each of the handicaps get a rating? Probably. How is it calculated? WHR as for the player ratings? Or would Baysian Elo be better since one does not expect the value of a given handicap when played at the same rating (handicaps will not have the same value in WHR rating points across different skill levels) to ever change, unlike player ratings?

Presumably each handicap might be associated with some `function(rating) = handicapValue` where "rating" is the average of the two players ratings, and "handicapValue" is the number of WHR rating points would be added to the rating of the player being given the handicap in order for the WHR system to correctly predict the win chance of both players without knowledge of the handicap? Thus when any two players play, the system would get their average/median rating, and plug it into the associated functions of each (optimizations definitely possible) handicap, and suggest the handicap for which the rating of the weaker player plus the result of the suggested handicap's associated function is a close as possible to the rating of the higher ranked player?

## Processing Games

### Player Ratings

Once the game is played (and it might have been played at any handicap at all: even one wildly inappropriate like me giving browni3141 7 rabbits) two systems must appropriately handle the result. The normal WHR would need to take into account that the expected result has been influenced by the used handicap: and that handicap may not be at all what the system suggested: the system can not rely on all games being roughly even.

I am still struggling to grok or even understand WHR (and, more generally, Bayes' Theorem), but my intuition says that this particular adjustment would be fairly trivial for someone who grokked vanilla WHR.

### Handicap Ratings

The handicaps must also be adjusted, though. I feel that me trying to think of how to do this is like someone who has no idea how baseball works except that it uses a baseball bat, and deciding to try and run around clocking people over the head with it because maybe that's how baseball works? Is there a parallel WHR system? Is it all the same system? are the rating points for the handicaps in a 1:1 correspondence with the rating points difference which they cover? If so, how is the different worth of the same handicap across skill levels handled? If not, how are the handicap ratings converted to and from suggested handicaps? How does one prevent degenerate behavior with two systems influencing eachother: seems like a recipe for feedback loops. Is WHR used or Baysian Elo or something else entirely?

## Handicap Relative Size

Presumably a solid system such as WHR or Baysian Elo would correctly evaluate the strength of different handicaps relatively quickly and accurately, but with 864 handicaps to rate, most of them absurd, and no domain-specific knowledge of their relation to eachother before WHR starts to work its magic one game at a time, I fear it would take a lot of seemingly random suggested handicaps before the system learned that 7 rabbits 2 horses and an elephant (`[7, 0, 0, 2, 0, 1]`)is not much more appropriate a handicap for most matchups than 6 rabbits 2 horses and a camel (`[6, 0, 0, 2, 1, 0]`) would be. But some handicaps, such as the two seen above, are just straight up <i>better</i> than others. I do not know how or even if such knowledge could be used to bootstrap the system (maybe setting the initial ratings roughly or seeding the system with some sample games demonstrating the relative strengths of handicaps for the initial prior?) but the following metric seems like it would go a long way towards reducing the number of games which would need to be played against crazy handicaps before the vast majority of them gained enough rating to virtually never be seen suggested, and to be decently accurate when they were used anyway:

let h1, h2 be handicaps denoted by arrays a1[6], a2[6] respectively.

let subtraction of two arrays be defined as returning an array with length equal to the longer of the two input arrays (extra 0s being imagined to be appended to the shorter array) with out[n] = arrayIn1[n] - arrayIn2[n] for each n in out[]. I don't know if that makes sense as subtraction, but call it something else if you like; this definition is useful here regardless of what it's called.

let tempArray = a1 - a2

for (int x = 5; x >=0; x--) {
if tempArray[x] != 0 {
if tempArray[x] < 0 {
break; // (not sure if break works like this; but this is all pseudo-code at best anyway): exit for loop: h1 cannot be said to be strictly better than h2 by this metric.
} else {
tempArray[x] = tempArray[x] - 1;

for (int y = x; y <= 0; y--) {
if tempArray[y] < 0 {
tempArray[y] = tempArray[y] + 1;
break; // again, probably wrong syntax: my intent is to break out of the inner for loop but not the outer here
} // end if
} // end inner for
x++; // counteract the decrementing of the counter because I shouldn't have used a for loop here. :D
} // end < 0 if-then
} // end != 0 if-then
} // end outer for

There's definitely a cleaner way to do this (syntax highlighting and debugging might help, lol) but the idea is that once it exits, if tempArray has no negative values, then h1 can be said to be strictly better than h2 by this metric. It of course doesn't work if h1 = h2, but why would you test that?

If applied to every possible pair of handicaps, this will create a lot of directed paths between handicaps where the arrow is pointing towards a handicap strictly smaller than the handicap the arrow originates from. Since all this is transitive, redundant arrows can be skipped. Perhaps this can somehow expedite the process of rating the handicaps?

## Conclusion

Does anyone have any idea how to do this? I'm sure someone must have implemented it, maybe for Go handicap stones or something, but I can't find anything by googling. Any links to good explanations of WHR would also be awesome if you have them on hand; maybe something will click with me. How did everyone else begin to grok WHR? Would going line by line through one of the implementations of it (it seems there's a Ruby one and a Python adaption of the same one on github) be helpful, or would I just be wasting my time if I don't understand the math from Column's paper first?

Title: Re: Adapting WHR to Material Handicaps
Post by clyring on Oct 23^rd, 2020, 11:05am

The code that actually performs the update calculations in a WHR system isn't especially hard to follow, but may also not be all that enlightening if you've read Coulom's paper (https://www.remi-coulom.fr/WHR/WHR.pdf). In this implementation (https://github.com/goshrine/whole_history_rating), the relevant code is at line 98 of player.rb (https://github.com/goshrine/whole_history_rating/blob/3940a70a01b46878ce55636821270bf3915655c4/lib/whole_history_rating/player.rb#L98), and follows appendix B of Coulom's paper almost exactly. I recall my own implementation performed Gaussian elimination without saving the intermediate coefficients of the LU decomposition, but this is equivalent.

If there's any recent research on properly handling handicaps in fully Bayesian rating systems, I haven't heard about it. (But I am not omniscient!) From what I understand, rating systems in use in the Go world deal with the differing values of handicap stones at different levels of play by adding a few parameters to the system that describe how this is pre-supposed to happen. I don't know how rating system designers have determined these parameters in practice, but in theory they can be tweaked either by hand or by a meta-optimizing program if they seem off.

At a quick glance, the EGF rating system uses a linearly-varying a to do this (see this page (https://senseis.xmp.net/?EGFRatingSystem) and its sources); The KGS rating system uses some value k which varies with rating in a manner I wasn't able to quickly find (see this page (https://senseis.xmp.net/?KGSRatingMath) and its sources).

A Bayesian learned function model is certainly possible in theory, but performing approximate inference quickly on such a model will likely mean reducing the space of candidate functions and/or finding clever ways to reduce the (potentially literally infinite) dimensionality of the candidate function space without losing too much information. Devising such a system would certainly be an accomplishment!

Trying to learn 863 nontrivial handicap amounts simultaneously within a fairly transparent Bayesian model seems unnecessarily difficult. I doubt there will ever be much demand for a rating system to handle more than a handful of standard handicap amounts, even if handicap Arimaa games become popular.

Title: Re: Adapting WHR to Material Handicaps
Post by bestaludulo on Oct 27^th, 2020, 7:36pm

I guess it's good to know that the code doesn't really add much that the paper doesn't. I have read the paper, I suppose I will reread the paper and see if I can understand more this time around.

That's essentially what they did on [OGS](online-go.com): they have glicko-2 in the background, and found a best-fit curve to relate the glicko-2 ratings with kyu-dan rankings. The inelegance of this bothers me somewhat, but if it's stupid and it works, it's not stupid.

I did knock around in my head the idea that the function mapping handicaps to WHR adjustment would have to be constrained in some way; I'm glad to see you thinking the same thing. The OGS formula is of the form `a - {[ln(WHR/b)] * c}`. Perhaps if we assume that this form is correct, varying a, b, and c would be feasible? But now you're trying to find a point in 3d space. Or a point in a 3d solid in 4d space, since WHR also gets an axis. I suppose we can drop `a` since it's only there to give kyu-dan numbers, so that's only two variables plus WHR (probably the midpoint between the two players WHR ratings). Then one is finding a minimum on the surface for different values of `a` and `b`. Is there any reason to assume that there would be only one minimum, and that there would exist some computationally efficient algorithm for finding it quickly? If the surface had some bumps in it, then guessing a minima outside of the bump could result in estimates approaching infinity for a and b, going ever further away from the actual minimum... Of course, this assumes we know this surface, which we don't. Wait, but don't we? If, instead of treating the expression as only one side of an equation, we take it as is and solve for WHR, we have `f(b, c, r) = b * e^(r/c)` where r is the rating adjustment to estimate the effective strength after handicap of the weaker player, I think. But this formulation just makes it clear that it is actually a point in 4d space. But it can (I assume) be solved as is, but my brain is starting to misfire trying to grok what good that does. It's like the feeling when you're reading out a move and you're at the point where you can't quite read further without losing track of the moves you've already read.

I guess the end goal is that each handicap has a function `f(x, y)` such that if you took some cross-section defined by some constant `y` value, `g(x)` would be the probability distribution of the given handicap being worth different `z` rating points, if the midpoint rating of the two players is `y`. Thus it's only a surface in 3d space. Now I'm getting two contradictory answers via two different lines of reasoning: this bodes well for me being anywhere near correct. Ahh. The previous reasoning was also trying to solve for two constants, whereas this reasoning implicitly assumes that any constants are already solved for.

Regarding finding the relative strengths of 800+ handicaps simultaneously, what about only allowing the system to recommend handicaps with a standard deviation less than or equal to a semi-arbitrarily chosen number. At first, when asked for a recommended handicap between two given players of a certain rating, the system would return some message such as "No appropriate handicaps found.". The players might then decide to play a handicap they feel is about right and enter it into the system. After a bit of this, the standard deviations of the popular handicaps will fall below the threshold and the system will begin recommending them. Perhaps the threshold could also be eventually raised if the system gets enough data eventually.

I do agree that the vast majority of handicaps aren't really worth bothering with; but it feels arbitrary to just unilaterally choose some to offer. If I wasn't scared by how big a number 864^2 was, I'd like to expand it to both the weaker and stronger player sacrificing 0 or more pieces, so that handicaps like "M for a H" could be used. But 746,496 is, ummm, a big number. The only way it could be plausible is with a very large number of games to train the system, or with a very good way to seed the relative strengths of the handicaps.