There are 3 tests that are used to determine the performance of a rating system. Drift, convergence and disruption. Drift - measures how much the ratings point drift from actual values A pool of P (usually 10) players are created with a known true rating set randomly between [1000, 2000) points. The calculated ratings are initially set to the true ratings. Games are played by randomly selecting two players and having the outcome determined based on the true ratings using a normal distribution outcome function FO(r1, r2). The calculated ratings are updated based on the outcome of the games. Px100 games are played to allow the calculated ratings to drift. Px10 more games are played while the system error is measured and normalized to represent rating points. A total of 100 trials are run and the results averaged to get the final drift error value. Convergence - measures how long it takes for the system error to drop below the drift level. A pool of P players are created with as in the Drift trials. The calculated ratings are all initially set to 1500. Games are played by selecting 2 players and determining the outcome as in the Drift trials. The calculated ratings are updated based on the outcome. The trial stops when the system error drops below the system error for the drift level. The number of games played is averaged and normalized to represent the number of games each player has to play before the ratings are convergered. A total of 100 trials are run and the results averaged to get the final convergence rate. Disruption - measures how much disruption is caused when new players are added to the a pool of established players. A pool of P players are created and initialized as in the convergence trials. Half of the players is set aside and the other half players are used to play games and update the calculated ratings as in the convergence trials until the system error falls below the drift level. Once the system error has converged all P players are allowed to play and the average system error is measured until it drops below the drift level. The average system error normalized to represent rating points and the number of games each player has to play before the ratings settle is also averaged over 100 trials. Step change - measures how much disruption is caused when some of the established players ratings change up or down by 100 points. A pool of P players are created and allowed to player each other as in the convergence test until the system error drops below the drift level. Quarter of the players ratings are increased and quarter of the players ratings are decreased. The players are again allowed to play until the system error drops below the drift level. The system error and the number of games played by each player is averaged over 100 trials. After trying various different rating systems it was found that the differences in a rating system do not make a significant difference in its performance. The difference between the best system and the worst is about a factor of 2. However, it is very easy to mess up a rating system so that it will not converge or has a very high drift level. The rating system formulas do have be selected carefully, but among them there is not much of a difference in performace. Inserting ad hoc factors in the rating system formula does not significantly improve it's performance and runs the risk of making it unstable. There is a very inverse relationship between the established drift level and the convergence rate. It takes a very high number of games to converge to a low drift level. The os19 system was selected for Arimaa.