Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> Bot Development >> Multiple core and non-locking transposition tables
(Message started by: rednaxela on Aug 29th, 2010, 12:14am)

Title: Multiple core and non-locking transposition tables
Post by rednaxela on Aug 29th, 2010, 12:14am
So, I saw a thread about using multi-core before but the scaling I heard of was not that great. Since they've been a while ago I won't bump that (if I should actually bump in the future, just let me know, sorry).

One thing I've wondered is if syncronization for the transposition table had much to do with the scaling issues, so I ended up finding some non-locking hashtables/maps that are threading safe via use of compare-and-swap instructions. See http://sourceforge.net/projects/high-scale-lib/ for Java and http://code.google.com/p/nbds/ for C. I'm thinking they could be modified to become a nice fixed-size transposition table pretty easily, and be very optimal for parallel use. The Java lock-free hashtable I linked apparently even scales near-linearly on a crazy 768 core beast that the it's author's employer makes.

I don't think we'll be running an Arimaa bot on anything quite that big any time soon, but I still think that a non-locking transposition table would be a good thing. Has anyone with a parallel bot measured how much of their scaling limits are due to overhead of locking for the transposition table? Any thoughts about the value or lack thereof in a non-locking one?

Title: Re: Multiple core and non-locking transposition ta
Post by tize on Aug 29th, 2010, 3:34am
I would think that most bots that are parallel reads and writes to the hash table as "normal", and just ignores the problem with multiple threads reading and writing to the same hash. See http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=devTalk;action=display;num=1205320786 for why the locking is ignored.

Title: Re: Multiple core and non-locking transposition ta
Post by rbarreira on Aug 29th, 2010, 6:21am
I had problems when I was using a plain non-locking hash table as soon as I moved from a dual core to a hexa core.

For me, the solution was to use Hyatt's lockless hashing:

http://www.cis.uab.edu/hyatt/hashing.html

It is very simple to implement (just a handful of lines of code changed for me), and it has almost zero overhead compared to the plain hashing.

I'm sure a more fancy and "correct" solution is possible, but with more overhead to ensure 0% data loss (which isn't anywhere near a must with a transposition table).

Title: Re: Multiple core and non-locking transposition ta
Post by rednaxela on Aug 29th, 2010, 6:56am

on 08/29/10 at 06:21:41, rbarreira wrote:
http://www.cis.uab.edu/hyatt/hashing.html

It is very simple to implement (just a handful of lines of code changed for me), and it has almost zero overhead compared to the plain hashing.

I'm sure a more fancy and "correct" solution is possible, but with more overhead to ensure 0% data loss (which isn't anywhere near a must with a transposition table).

Ahh very interesting link there. It's simplicity looks pretty elegant.

As far as overhead, while I haven't benchmarked  it yet, the lockless hash tables I linked are supposed to have no data loss and also not have any overhead. If it benchmarks favorably it's probably what I'll use as a basis for a transposition table I think. I imagine the difference is fairly slim indeed,  but as long as there's no overhead, why not.

Title: Re: Multiple core and non-locking transposition ta
Post by doublep on Aug 29th, 2010, 10:15am
Plain non-locking hashing is really not an option, because it can yield not just a wrong move/score when two threads read/write the same entry, but something illegal and crash your program.  Badger currently uses spin-locks.  I need to look closer at that article, though it's very hard to read for some reason (or maybe I'm too stupid for it).

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 29th, 2010, 1:34pm
How much help are transposition tables anyway in Arimaa?

Title: Re: Multiple core and non-locking transposition ta
Post by jdb on Aug 29th, 2010, 2:06pm

on 08/29/10 at 13:34:57, speek wrote:
How much help are transposition tables anyway in Arimaa?



A very big help.   Consider Ra1n Rb1n or Rb1n Ra1n. The same board position is reached after two steps. This happens all the time in the search.

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 29th, 2010, 2:23pm
Well, that's different - that's something you have to do to find all the unique legal moves from a given position, but it only applies to that one ply.

I thought transpositional tables were about collapsing the search tree to only unique positions and evaluating each unique position only once.

However, given an engine that correctly produces unique legal moves per ply, how much advantage is there to transpositional tables? If I search all my moves, and all your responses to all my moves, how many duplicate positions occur and how much time does it take dealing with them vs managing a transpositional table that is probably hundreds of MB in size?  I'm already storing the game tree which is hundreds of MB in size, and I only get 500 to work with.

Currently, I am not keeping a transpositional table.  At the start of the game, searching a full 2-ply takes less than 3 seconds.  I know that when I am generating moves, managing the hashtable to prevent duplicate moves being generated is my biggest use of cpu time.  So, I have doubts on the value of the transpositional table outside that use.

Title: Re: Multiple core and non-locking transposition ta
Post by rednaxela on Aug 29th, 2010, 4:52pm

on 08/29/10 at 10:15:00, doublep wrote:
Plain non-locking hashing is really not an option, because it can yield not just a wrong move/score when two threads read/write the same entry, but something illegal and crash your program.  Badger currently uses spin-locks.  I need to look closer at that article, though it's very hard to read for some reason (or maybe I'm too stupid for it).

If I'm interpreting it correctly, the one in the article that  rbarreira links appears to get around the issue, by 1) assuming a write of the key is atomic and that a write of the data is atomic (i.e. fits in a 64 bit long or is a pointer), the 2) XORing the key over the data, such that if the key/data are mismatched the error will most likely be detected and subsequently ignored. If you have spare bits in a 64 bit data value, I imagine you could probably pack a checksum into  it as well to further increase the chance of detecting the bad writes. Essentially the idea is that instead of getting getting undetectably corrupt data, you on rare occasional lose data, which for a transposition table is tolerable.

The high-scale-lib that I linked gets around the issue by a different approach. It appears to make use of the "compare and set" instructions to manage to operate with neither locks nor occasional lost data. There might be limitations though that I'm not yet aware of. I'll have to look closer. I still don't fully understand exactly how it's achieving it.

Here's another possible approach for scalability that just came to mind here thanks to you doublep. I recently read that Java's ConcurrentHashMap class deals with concurrency by having 16 separate locks for different parts of the array backing it. Perhaps this partitioned-locking approach could be used with spinlocks as used in Badger, to have a simple way to reduce how often things run into a lock? Just a thought.


on 08/29/10 at 14:23:39, speek wrote:
Well, that's different - that's something you have to do to find all the unique legal moves from a given position, but it only applies to that one ply.

I thought transpositional tables were about collapsing the search tree to only unique positions and evaluating each unique position only once.

However, given an engine that correctly produces unique legal moves per ply, how much advantage is there to transpositional tables? If I search all my moves, and all your responses to all my moves, how many duplicate positions occur and how much time does it take dealing with them vs managing a transpositional table that is probably hundreds of MB in size?  I'm already storing the game tree which is hundreds of MB in size, and I only get 500 to work with.
Well, what you're doing with your hash tables to check the uniqueness of moves is very similar to what the transposition tables do, but transposition tables do a little more at the same time. As I see it, the differences are:
1) They stay fixed size so they don't need to be re-allocated or cleared
2) They take the 'lazy' way out with regards to collisions, just storing one value or the other usually (some can try alternative indexes but they don't put try to remember everthing permanently). This is faster and perfect recall is not necessary for the transposition table's performance in the big picture.
3) They're remembered for future turns as well, used as appropriately without needing special handling
4) They usually remember the next principal variation step that was last found at that position. When doing deeper searches, it is faster to check the principal variation first, because this often leads to to alpha-beta pruning making a cutoff quicker.

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 29th, 2010, 5:33pm

on 08/29/10 at 16:52:17, rednaxela wrote:
Well, what you're doing with your hash tables to check the uniqueness of moves is very similar to what the transposition tables do, but transposition tables do a little more at the same time. As I see it, the differences are:
1) They stay fixed size so they don't need to be re-allocated or cleared
2) They take the 'lazy' way out with regards to collisions, just storing one value or the other usually (some can try alternative indexes but they don't put try to remember everthing permanently). This is faster and perfect recall is not necessary for the transposition table's performance in the big picture.
3) They're remembered for future turns as well, used as appropriately without needing special handling
4) They usually remember the next principal variation step that was last found at that position. When doing deeper searches, it is faster to check the principal variation first, because this often leads to to alpha-beta pruning making a cutoff quicker.


I understand this, I'm just dubious of it's value for Arimaa.  Firstly, these differences mean you can't use it to generate a correct collection of unique moves for a given position - that can't be a fuzzy operation.  It also does have to be cleared (though not necessarily reallocated) each time.  Given that, you'd have to have two such datastores.  As I said, I already spend a plurality of cpu time doing hashes for discarding duplicate moves - I'm doubtful that storing duplicate  positions between plys for the sake of forgoing evaluation of the position - I'm doubting this is all that useful.  Primarily I doubt it because we're not going terribly deep into the Arimaa tree.  It's not like chess where 10 ply will get you a lot of transpositions, yet not so many that we're straining memory capacities.

I would like to see empirical evidence of improvement strictly from the use of such tables.

Title: Re: Multiple core and non-locking transposition ta
Post by Fritzlein on Aug 29th, 2010, 5:56pm
Speek, if you are throwing out duplicate moves and only achieving shallow searches, I rather doubt transposition tables will help you much.  At two ply the help should be almost zero; the only possible duplication is from dislodging opposing pieces.  At three ply, however, consider the worst-case scenario of your opponent making a fixed move while your two moves consisted of eight independent steps that don't interact with each other or the opponent's move.  In that case, you can reach the same position in 8!/(4!4!) = 70 different ways in spite of duplicate moves being thrown out.  Your evaluation may be cheap, but do you want to blow it up by a factor of 70?  Of course, eight independent steps will be very rare, and if you do haizhi-pruning of moves, your maximum different ways to reach each position is only 2*4*3/2 = 12.  Still significant, it seems to me.

I understand that you don't want to waste time on duplicate-position detection after having already spent so much time on duplicate-move detection, but why not get rid of the duplicate-move detection and let the duplicate-position detection do both jobs?

Title: Re: Multiple core and non-locking transposition ta
Post by rbarreira on Aug 29th, 2010, 6:16pm

on 08/29/10 at 16:52:17, rednaxela wrote:
2) XORing the key over the data, such that if the key/data are mismatched the error will most likely be detected and subsequently ignored.


I'm not sure if that's what you mean, but I think the better way is to xor the key and the data to get the key which will be stored. This way the data is always readable, which is good since it needs to be readable for the corrupted entry to be replaced with the usual checks.


on 08/29/10 at 14:23:39, speek wrote:
I know that when I am generating moves, managing the hashtable to prevent duplicate moves being generated is my biggest use of cpu time.  So, I have doubts on the value of the transpositional table outside that use.


If the hashing is such a big bottleneck, that could indicate that your approach to get non-unique moves is not so efficient. In my bot, accesses to the transposition table take up about 10% of the running time.

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 29th, 2010, 6:26pm
Well, I don't understand your example of an opponent making a "fixed" move (what is that?) and 8 independent steps.  The first 4 independent steps resulting in position A will be in my set of unique moves once.  Then, the opponent has a "fixed" move.  Then I do 4 more moves independent of each other and the one move I started with.  How am I duplicating positions?  I think without getting to the 4th ply, I'm not duplicating positions without dislodging pieces, but please help me understand better.

As for using the same facility for both, I think that'd be great, but I'm frankly unsure of how it would work.  The one is checking positions at every step, whereas the other only cares about end-of-ply positions.  Necessarily, all the positions I've recorded for making unique moves have to be discarded to allow for new sets of unique moves next ply.  If I save them, then I need a different storage for generating new moves, which means I have two storage locations anyway.

The thing is, I can see that saving position evaluations would save a lot of time - especially as sometimes an evaluation has come as a result of a cascade via min-max through multiple ply that have been checked.  So, some of those evaluations are pretty thorough.  However, doing this comes at the cost of a lot of uselessly stored positions that the engine will never care about anyway, a lot of memory.  I'm already having trouble getting below 500MB because my game tree is currently using 160MB by itself.

Title: Re: Multiple core and non-locking transposition ta
Post by rbarreira on Aug 29th, 2010, 6:30pm
Your bot can use more than 500 MB, the worst case is that if omar thinks it's too much he won't make your bot available for play after the computer tournament.

IMO you really shouldn't let a 500 MB limit interfere with what you want to do in your bot...

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 29th, 2010, 6:30pm

on 08/29/10 at 18:16:29, rbarreira wrote:
If the hashing is such a big bottleneck, that could indicate that your approach to get non-unique moves is not so efficient. In my bot, accesses to the transposition table take up about 10% of the running time.


Well, 10% is a hefty percentage for one low-level operation like doing a put to a hash table.  It would certainly be the kind of cost I'm talking about.

Title: Re: Multiple core and non-locking transposition ta
Post by rbarreira on Aug 29th, 2010, 6:32pm

on 08/29/10 at 18:30:13, speek wrote:
Well, 10% is a hefty percentage for one low-level operation like doing a put to a hash table.  It would certainly be the kind of cost I'm talking about.


Reading and writing to a huge hash table which doesn't fit in the cache necessarily means accessing RAM most of the time. It takes many CPU cycles to access RAM.

For me it was definitely worth it to add the transposition table.

Title: Re: Multiple core and non-locking transposition ta
Post by rednaxela on Aug 29th, 2010, 6:56pm

on 08/29/10 at 18:16:29, rbarreira wrote:
I'm not sure if that's what you mean, but I think the better way is to xor the key and the data to get the key which will be stored.

Sorry, yeah that's what I meant.

Title: Re: Multiple core and non-locking transposition ta
Post by Fritzlein on Aug 29th, 2010, 10:13pm

on 08/29/10 at 18:26:57, speek wrote:
Well, I don't understand your example of an opponent making a "fixed" move (what is that?) and 8 independent steps.  The first 4 independent steps resulting in position A will be in my set of unique moves once.  Then, the opponent has a "fixed" move.  Then I do 4 more moves independent of each other and the one move I started with.  How am I duplicating positions?

OK, for illustration suppose the gold pieces are all on the second and third ranks, with the silver pieces all on the sixth and seventh ranks, with Gold to move.  The position I have in mind three ply from now has all of Gold's third-rank pieces advanced one step to the fourth rank (eight independent steps) and Silver's four central pieces advanced one step each to the fifth rank (the opponent's "fixed" reply).  On the way to this position three-ply deep, you can make 70 different moves at the root of the search, as there are 70 different ways to step four of your front pieces foward.  The opponent responds the same way in each case, and then you move the other four pieces forward.  Your duplicate move detection doesn't eliminate these different ways of arriving at the same position, as all the root moves are different.

Does this clarify what duplicate-position detection can do that duplicate-move detection can't do?

Title: Re: Multiple core and non-locking transposition ta
Post by tize on Aug 30th, 2010, 12:03am
I was clearly wrong about that most used lock-less hashing...


on 08/29/10 at 10:15:00, doublep wrote:
Plain non-locking hashing is really not an option, because it can yield not just a wrong move/score when two threads read/write the same entry, but something illegal and crash your program.

To fix that problem you don't have to lock the hash, just validate that the move is a legal one, even with a lock I wouldn't trust the move without validation but maybe I'm just paranoid. :)


on 08/29/10 at 17:33:43, speek wrote:
I understand this, I'm just dubious of it's value for Arimaa.  Firstly, these differences mean you can't use it to generate a correct collection of unique moves for a given position - that can't be a fuzzy operation.  It also does have to be cleared (though not necessarily reallocated) each time.

First you should never clear the hash other than when it's being initialized, instead use an age counter or something similar if you would like to know if the entry is from a previous iteration.

In marwin I also use a hash that filters away dublicates inside of a move, in fact the 2010 version doesn't have any transposition table. I have now added such a table and it helps, on p2 searches it's mostly in the 0-30% range, but on a 2 minutes search it usually searches one step deeper. Not earth shattering but a nice improvement.


on 08/29/10 at 06:21:41, rbarreira wrote:
For me, the solution was to use Hyatt's lockless hashing:

Nice I will use that too.

Title: Re: Multiple core and non-locking transposition ta
Post by rednaxela on Aug 30th, 2010, 8:26am
While on the topic of transposition tables, has anyone ever considered persisting some entries on-disk for future games? As to not add overhead it could be implemented in a manner such that between turns or at end of game, the deepest or most seen entries could be written to an on-disk transposition table, which could then be used to either initialize or supplement the normal one in subsequent matches.

Recently I ran some processing on games since the start of this year and found that out of 983873 turns, there were 946219 unique positions (mirror images counting as duplicate). Since I'm sure most of the duplicates are very early game, I don't think the on-disk table would have that big an impact overall, but I wonder if it could be worth it for perhaps getting slightly more depth in the first few turns. I mean.... surely out of all the times double-99of9 opening is seen, it could be searched slightly deeper by re-using information found in past games.

Title: Re: Multiple core and non-locking transposition ta
Post by rbarreira on Aug 30th, 2010, 9:00am
The problem with persistent knowledge is that it would have to be recalculated every time you change your evaluation function or even much of the search code. But it could be worth it for a quite mature bot (I'm guessing none or very few of the existing Arimaa bots can really be called mature yet).

Title: Re: Multiple core and non-locking transposition ta
Post by speek on Aug 30th, 2010, 9:54am

on 08/29/10 at 22:13:23, Fritzlein wrote:
Does this clarify what duplicate-position detection can do that duplicate-move detection can't do?



Thanks Fritz, yes, I understand now.  I'll have to try not clearing my unique move hash and see what effect it has.  

Title: Re: Multiple core and non-locking transposition ta
Post by Fritzlein on Aug 30th, 2010, 7:51pm

on 08/30/10 at 00:03:30, tize wrote:
In marwin I also use a hash that filters away dublicates inside of a move, in fact the 2010 version doesn't have any transposition table. I have now added such a table and it helps, on p2 searches it's mostly in the 0-30% range, but on a 2 minutes search it usually searches one step deeper. Not earth shattering but a nice improvement.

Thanks for the practical results, tize.  That's much more informative that my speculation about duplicate positions.  You say that searching one step deeper isn't earth-shattering, but isn't that equivalent to searching more than three times as fast?  I'm trying to think what other optimizations (other than alpha-beta pruning) result in an additional step of depth in a two-minute search in an already-strong bot.

Title: Re: Multiple core and non-locking transposition ta
Post by rednaxela on Aug 31st, 2010, 8:57pm

on 08/30/10 at 09:00:21, rbarreira wrote:
The problem with persistent knowledge is that it would have to be recalculated every time you change your evaluation function or even much of the search code. But it could be worth it for a quite mature bot (I'm guessing none or very few of the existing Arimaa bots can really be called mature yet).


Well as far as numerical score values or bounds, yes, that data does get broken when changing the bot, however I have a feeling that the best/refutation moves are still of significant value. Even after changing the evaluation function drastically, chances are the best/refutation move from an old eval function is still a good first guess when searching. This could work simply by including a "bot revision" field in the on-disk transposition table data.



Today as I was pondering parallel searches... and I think I realized  one possible optimization for the parallel alpha-beta search, that applies best in multiple-steps-per-turn situations like with Arimaa. During any given turn from a given starting position, the resulting value of alpha will always end up being the greatest alpha found in the 4-step tree of that turn. Due to that, perhaps it would be advantageous to use a "shared alpha" within each subtree for a turn. Normally some parallel branches would start with non-optimal alpha values thus leading to more visited nodes than a non-parallel implementation, limiting the scalability. Using a "shared alpha" as I describe should bring the visited nodes of the parallel search back to something close to the non-parallel search, thus improving the scaling.

Now, implementing the "shared alpha" with traditional locking would probably have too much overhead, however it could be implemented very efficiently by making use of atomic compare-and-set/swap instructions. In Java this would be implemented pretty easily by using the AtomicInteger class for the shared alpha. In C/C++ this could be implemented by careful use of a "volatile int *" along with the compare-and-swap instruction.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.