Monday, August 16, 2010

Solving transaction contention issues in Ranklist

This weekend I begin seeing the following message on my dashboard:

08-13 08:21PM 44.950 Transaction collision for entity group with key datastore_types.Key.from_path(u'ranker', 23150L, _app=u'railroadempire'). Retrying...

After a little digging I discovered that the transaction that gets run by ranklist is wrapped around the entire ranker. In Google App Engine terms that means that each individual ranker is it's own entity group and the transaction is on the ranker's entity group. What this ends up meaning is that you can only write one rank into your ranklist at a time.

My game has picked up that there are enough people playing that more than one update to the ranking was trying to be processed at a time. This was causing the transactions to need to be retried and adding significant weight to my CPU usage.

After a quick post on the ranklist Google group I got a great response from Bartholomew Furrow:

I'm only guessing, but probably the best way to deal with this is batch your updates.  Whenever a scoreboard row changes in score, set a 'dirty' field in it to True; then, separately, have a cron job continually querying for rows with 'dirty' set to true and then running set_scores on the ranker for all of them.  That way you get the advantage of having a lot of updates at the same time, and during periods of heavy use you just fall behind a bit instead of failing requests.

Beyond that I think you'd need to get into specifics of the tree structure and whether certain nodes happen to be getting hit more often than the others.  For example, if you're calling num_ranked_users (or whatever it's called) a lot, you could consider slapping a memcache on that with an expiry time of 20s so your root node doesn't get hit nearly as often.
Well I chose the batch method that he mentioned above, and here is what I did:

  1. I added a rankChanged Boolean Property to the player entity. 
  2. Whenever the player's cashOnHand changes I set the rankChanged to true.
  3. I removed the old individual calls to setRanking in ranklist
  4. I set up a Cron Job to run every minute syncronized in the cron.yaml file
  5. I added a request handler to do the work:
def get(self):
        playerscores = {}
        count = 0
        playerquery = model.Player.all()
        playerquery.filter("rankChange =", True)
        ranker = model.Player.getRanker()

        for player in playerquery.fetch(50):
            playerscores[str(player.key())] = [player.cashOnHand]
            player.rankChange = False
            count += 1
        logging.debug("Set Rank for %s players" % count)

This solved the problem nicely, but not perfect. Now the rankings are no longer real-time they are delayed about a minute. I had to set the fetch size to 50 because after that you get timeouts and memory issues. Currently I am only using about 38 of the 50 in a request. I should add to this function so that if the count is 50 I launch a Task Queue request to run this same code, that way if there are more then 50 it won't back up until the next Cron job runs.

Hope this helps someone else.

Thanks Bartholomew Furrow for the help.


  1. I don't know much about this environment, but the first thing I thought of was using a semaphore and a co-process, in place of the cron job.

  2. Jim: I suggested a cron job simply because that's what was available to us on Code Jam when we were first using the library; you're completely right that there are technologies more suited to the task, but I'm not sure which of them App Engine currently supports.

  3. Good post! Thanks for the insight.

  4. Thanks!
    I had a similar issue with ranklist and your post helped.

    I decided to use a fork-join-queue to do the bulk updates, which may be better than constantly polling with cron job?

    Thanks again.