There are 3 tests that are used to determine the performance
of a rating system. Drift, convergence and disruption.

Drift - measures how much the ratings point drift from actual values
  A pool of P (usually 10) players are created with a known true rating set
  randomly between [1000, 2000) points. The calculated ratings are 
  initially set to the true ratings. Games are played by randomly 
  selecting two players and having the outcome determined based
  on the true ratings using a normal distribution outcome function
  FO(r1, r2). The calculated ratings are updated based on the
  outcome of the games. Px100 games are played to allow the
  calculated ratings to drift. Px10 more games are played while
  the system error is measured and normalized to represent 
  rating points. A total of 100 trials are run and the results
  averaged to get the final drift error value.

Convergence - measures how long it takes for the system error to
  drop below the drift level.
  A pool of P players are created with as in the Drift trials.
  The calculated ratings are all initially set to 1500. Games
  are played by selecting 2 players and determining the outcome
  as in the Drift trials. The calculated ratings are updated
  based on the outcome. The trial stops when the system error
  drops below the system error for the drift level. The number
  of games played is averaged and normalized to represent the
  number of games each player has to play before the ratings
  are convergered. A total of 100 trials are run and the results
  averaged to get the final convergence rate.

Disruption - measures how much disruption is caused when new
  players are added to the a pool of established players.
  A pool of P players are created and initialized as in the
  convergence trials. Half of the players is set aside and
  the other half players are used to play games and update the calculated
  ratings as in the convergence trials until the system error
  falls below the drift level. Once the system error has 
  converged all P players are allowed to play and the average
  system error is measured until it drops below the drift level.
  The average system error normalized to represent rating points
  and the number of games each player has to play before the
  ratings settle is also averaged over 100 trials.

Step change - measures how much disruption is caused when some
  of the established players ratings change up or down by 100 points.
  A pool of P players are created and allowed to player each
  other as in the convergence test until the system error
  drops below the drift level. Quarter of the players ratings are
  increased and quarter of the players ratings are decreased.
  The players are again allowed to play until the system error
  drops below the drift level. The system error and the number
  of games played by each player is averaged over 100 trials.

After trying various different rating systems it was found
that the differences in a rating system do not make a significant
difference in its performance. The difference between the
best system and the worst is about a factor of 2. However, it
is very easy to mess up a rating system so that it will not
converge or has a very high drift level. The rating system
formulas do have be selected carefully, but among them there is
not much of a difference in performace. Inserting ad hoc factors
in the rating system formula does not significantly improve
it's performance and runs the risk of making it unstable.

There is a very inverse relationship between the established
drift level and the convergence rate. It takes a very high
number of games to converge to a low drift level.

The os19 system was selected for Arimaa.