Monday, July 11, 2011

Writing data to Google Fusion Tables from App Engine

This weekend I was asked by some potential investors to supply some statistics for my game. The information was embedded in my datastore in the player and transaction objects. To extract the data, I decided to iterate over each player in the game. This could have been done with a map part of map/reduce, but I just wrote a simple task queue job to walk the data serially for simplicity since there wasn't a ton of data and speed wasn't a huge issue, the fifteen minutes it took was fine.

To access Fusion Tables I downloaded the Python library for Fusion Tables:
http://code.google.com/p/fusion-tables-client-python/

Then I built the table I wanted to create using the Fusion Tables UI:
http://www.google.com/fusiontables/Home?pli=1

Then I simply setup my task to insert a fusion table row for each player entity with the desired data in it.

To authenticate I user Client Login:

from authorization.clientlogin import ClientLogin
import ftclient

token = ClientLogin().authorize(username, password)
ft_client = ftclient.ClientLoginFTClient(token)

I created a template for the insert I wanted to do


ftquerytemplate = """insert into 1116661 (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""



Then I just iterated over my players, for each 100 players, I saved a cursor and launched a new task with that cursor. I stacked up the inserts into one string and sent them as a single query at the end. There are some things you have to watch out for:

  • Null values are entered as empty strings ''
  • When entering dates, Fusion tables will enter either just the date or the entire timestamp, however there is no way in fusion tables that I can find to aggregate on a date when you have a timestamp, so make sure you do that conversion in code.
  • If you want a number field to be null you still pass it an empty string ''
Below is the complete controller I used to get the data and place it in Fusion tables. There are a lot of things that could be done by mixing fusion tables and Google App Engine. This was just a quick way for me to solve a problem quickly, that's one of the things I love about these technologies, they are very easy to try out.




from controllers import *
from models import model
import datetime
import logging
import ftclient
from google.appengine.api import taskqueue
from authorization.clientlogin import ClientLogin

ftquerytemplate = """insert into [mytablenumber] (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""

class UserReporting(BaseController):
    def get(self):
        self.handleReporting()
        
    def post(self):
        self.handleReporting()
        
    def handleReporting(self):
        token = ClientLogin().authorize("[myusername]", "[mypassword]")
        ft_client = ftclient.ClientLoginFTClient(token)
        playerquery = model.Player.all()

        last_cursor = self.request.get('cursor')
        if last_cursor:
            playerquery.with_cursor(last_cursor)

        fetched = playerquery.fetch(100)
        logging.debug("Fetched %s records" % len(fetched))
        ftquery = ""
        for player in fetched:
            #Created data in Fusion Tables 
            playerdata = {'created_date':player.createdDate.date(),
                          'login_date':player.lastLoginDate if player.lastLoginDate else "",
                          'player_status':'inactive' if player.inactive else "active"}
            if player.lastLoginDate and player.createdDate:
                playerdata['length_of_play'] = (player.lastLoginDate - player.createdDate.date()).days
            else:
                playerdata['length_of_play'] = 0
                          
            count = 0
            first_txn = datetime.datetime.max
            last_txn = datetime.datetime.min
            for txn in player.transaction_set:
                count += 1
                if txn.timestamp < first_txn:
                    first_txn = txn.timestamp
                if txn.timestamp > last_txn:
                    last_txn = txn.timestamp
            if first_txn != datetime.datetime.max and player.createdDate:
                playerdata['time_to_txn'] = (first_txn - player.createdDate).days
            else:
                playerdata['time_to_txn'] = "''"

            playerdata['number_of_transactions'] = count
            playerdata['last_transaction_date'] = '' if last_txn == datetime.datetime.min else last_txn 
            playerdata['first_transaction_date'] = '' if first_txn == datetime.datetime.max else first_txn
            ftquery += ftquerytemplate % playerdata
        logging.debug("SQL: %s" % ftquery)
        ft_client.query(ftquery) 
        if len(fetched) == 100:
            taskqueue.add(url='/reporting/userreporting', params={'cursor':playerquery.cursor()})
        

Friday, January 28, 2011

More App Engine optimization Goodness

So I was recently trying to improve performance on the application and used two tricks to great effect that I want to share with everyone.

First I had some code in one of my tasks like this:

def doTask(self):
    train = NTrain.get(self.request['trainkey'])
    # Do some work
    # This call is a ReferenceProperty and does a get
    player = train.player
    #Do Some work with the player


The problem here is we are doing two gets when we could do 1 get, especially when all NTrain entites are parents of their player. So instead we can do this:


def doTask(self):
    trainkey = db.Key(self.request['trainkey'])
    playerkey = trainkey.parent()
    train, player = db.get((trainkey, playerkey))
    #This will keep you from redoing the get if you use 
    #train.player later
    train.player = player
    #Do all your work here


The second optimization I did was to completely remove the need for a write. In the past my ranking code that ranks players would query players that had a flag set for needing ranking done. Next I would call the ranking code, and then clear the flag for each of the players. Finally I would write all the ranking in a batch and then I would write all the players in a batch.


class Player(db.Model):
 flag = db.BooleanProperty()

def requireRanking(self):
flag = True


class SetRanks(BaseController):
def handleranks(self):
        playerquery = model.Player.all()
        playerquery.filter("flag =", True)

        players = []
        for player in playerquery:
            #Do Work
            player.flag = False
players.append(player)


db.put(players)


Now what I do is set a timestamp instead of a flag when ranking needs done. Now I query for players whose timestamp is greater then the last time the task ran. Now I would write all the ranking in a batch, but there is no longer a need to clear the flag and write changes to the player.

class Player(db.Model):
    lastMoney = db.DateTimeProperty()

    def requireRanking(self):
        lastMoney = datetime.datetime.now()

class SetRanks(BaseController):
    def handleranks(self): 
        lastRankCheck = datetime.datetime.now() -datetime.timedelta(minutes =5)
        playerquery = model.Player.all()        
        playerquery.filter("lastMoney >", lastRankCheck)
        for player in playerquery:
            #Do Work
There are many options for how to keep track of the last time the query was run. In my case knowing it is run every five minutes and making that assumption was OK because of the non-critical nature of the data. In a more critical environment you may want to store the last run time in the datastore. You could also pass it forward from task to task (Having the last task create an instance of the next task with the time it ran as a parameter).

Wednesday, October 27, 2010

Updating AppEngine Kinds without taking your site down

I recently had the need to change one of my datastore Kinds. In the past when I have done this, I have shutdown the site and migrated every entity of the kind to the new version of the Kind. However my application is getting to the point that it takes a long time to do that and I didn't want the site down that long. After reading several articles on the subject, most notably this one: http://codebetter.com/blogs/kyle.baley/archive/2010/09/28/options-for-upgrading-an-app-engine-application.aspx, and discussing the matter in the #appengine irc channel on freenet, I had a plan.

The first thing I did was add a new property to my Kind called version, next I added an __init__ constructor for the class. I had previously thought you were not supposed to do that with App Engine, but Nick Johnson on the irc channel said as long as you are careful it would be the right thing to use in this case.

The new constructor has a version check, and if it's the old version the upgrade on the entity is performed. The super call to the base constructor must be called first.


class NTrain(db.Model):
    version = db.IntegerProperty()
    
    def __init__(self, *args, **kwargs):
        super(NTrain, self).__init__(*args, **kwargs)
        #Since this is the first time I am using version just do a boolean check
#otherwise you would want if self.version < 2: or whatever you version number is.
        if not self.version:
            self.version = 1
            #do upgrade


Next I changed my code to use the new version of the entities.

Then I set up the code that creates all new instances of the Kind to use the new version number.

Finally I will (I haven't done this part yet) run a mapper over the entire kind that just has to load the entity and save it.

That last step is technically not necessary, but by doing that I will guarantee that all entities of the kind have been upgraded to the new version, and once that has been done I can remove the version checking code from the constructor.

You do not have to use a version number to decide that an upgrade is necessary. You could also use an implicit method of upgrading, such as checking if a new property exists on the entity in the datastore and if not then run the upgrade. For my use case I decided that the explicit check would be simpler and more robust.

Tuesday, September 7, 2010

Datastore lessons learned the hard way

With the latest release of the Railroad Empire game I have reduced my App Engine CPU usage by about 66%. Looking back on why my CPU usage was so high before, I have come to the conclusion that when I originally designed my application I made some poor datastore design decisions and fixing those design flaws drastically decreased my CPU usage. So below I am detailing some of the important lessons I have learned.

Use Entity Groups to intelligently partition your data
According to the App Engine documentation you should only use Entity Groups for transactions. http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html
 Only use entity groups when they are needed for transactions. For other relationships between entities, use ReferenceProperty properties and Key values, which can be used in queries.
I have discovered that you may also want to use Entity Groups in partitioning your data, for example placing all the data for a player in a game in an Entity Group with the player as the parent. The documentation alludes to doing this later in the same page.
A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller.
The problem is that the way the documentation reads is that you should only do that if you are actively using transactions on that data. First I have found that you may start out thinking that you won't be doing transactions and then you find that you do need to do transactions on the data. Second, you can use your parent relationship in conjunction with your Key Names for establishing unique keys.

Make intelligent use of your Key Names.
This is the big one. No matter what you do, don't just go and use a generated key for every one of your entities. Whenever it makes sense use a key name that uniquely identifies the entity using data within the entity. You can often use entity groups to help create intelligent unique keys.

Here's why this is important, queries on key names are so much faster then queries on full entities. If you have a screen that displays a list of your entities that a user can select one of to get a deeper view of that entity, if you do a full query on your list screen you are loading every single entity into memory just to get the information necessary to click on it. Sometimes this can't be avoided, and when it can't you can use Memcache to cache the list for you, but if at all possible only display the information in the key name.

The other place that intelligent use of Key Names can help you, and where it helped me the most was to have my key names match some static lookup data that I used specifically in my maps. In my application I use Google Maps for displaying the Cities, stations, and routes of your railroad. There is a fixed set of cities and routes you can choose from that are all saved in an XML file. Before I made the most recent change you had to load each station entity to display its icon on the map because the city the station referred to was stored as a property in the entity.

Since you can only have one station per city I changed the datastore so that the key name for the station was the city being referenced. In order for this to work I had to set the Station in an Entity Group with the player as the parent. Now whenever I want all of a players stations, but I only care about what city the station is in, which is about 95% of the time, I don't have to load the large bulky station object.

There is another way you could accomplish this same goal. You could de-normalize your data and store a list of the players stations in the player object. This approach will work, but be wary it has several drawbacks. You can have issues with index bloat, as well as entity size issues. remember each entity can only have 1000 index rows each and an entity can only be 1MB in size.

How to fix your ailing datastore
Luckily if you have these sorts of woes in your datastore, or need to refactor it for any reason, all is not lost. There are several ways to move data around in the datastore without too much interference. There is a great project for doing the map part of a map-reduce over here: http://code.google.com/p/appengine-mapreduce/.
For datastore refactoring you only need the map part of map reduce. Beware that sharding of the data can be problematic in the current implemntation.

Another good alternative is to do the following, which ends up being a poor mans map. Have an admin only url that launches a task (we'll call it the launcher task) to iterate over your data(keys only) you want to change using a cursor. Have the launcher task launch a new task for each entity you want to change to do the actual work of changing that entity (the worker task). Have the launcher task execute a reasonable number of worker tasks (I used 100) and then add itself back to the queue with its cursor as a parameter and then end. Now the launcher task will iterate over your entire query launching tasks to do the real work. The worker tasks can now make each individual datastore change.

Be sure to stop your cron jobs and other task queues while you are updating your datastore.

I hope that some of these lessons I have learned help other people.

Monday, August 16, 2010

Solving transaction contention issues in Ranklist

This weekend I begin seeing the following message on my dashboard:

08-13 08:21PM 44.950 Transaction collision for entity group with key datastore_types.Key.from_path(u'ranker', 23150L, _app=u'railroadempire'). Retrying...



After a little digging I discovered that the transaction that gets run by ranklist is wrapped around the entire ranker. In Google App Engine terms that means that each individual ranker is it's own entity group and the transaction is on the ranker's entity group. What this ends up meaning is that you can only write one rank into your ranklist at a time.


My game has picked up that there are enough people playing that more than one update to the ranking was trying to be processed at a time. This was causing the transactions to need to be retried and adding significant weight to my CPU usage.


After a quick post on the ranklist Google group I got a great response from Bartholomew Furrow:


I'm only guessing, but probably the best way to deal with this is batch your updates.  Whenever a scoreboard row changes in score, set a 'dirty' field in it to True; then, separately, have a cron job continually querying for rows with 'dirty' set to true and then running set_scores on the ranker for all of them.  That way you get the advantage of having a lot of updates at the same time, and during periods of heavy use you just fall behind a bit instead of failing requests.

Beyond that I think you'd need to get into specifics of the tree structure and whether certain nodes happen to be getting hit more often than the others.  For example, if you're calling num_ranked_users (or whatever it's called) a lot, you could consider slapping a memcache on that with an expiry time of 20s so your root node doesn't get hit nearly as often.
Well I chose the batch method that he mentioned above, and here is what I did:

  1. I added a rankChanged Boolean Property to the player entity. 
  2. Whenever the player's cashOnHand changes I set the rankChanged to true.
  3. I removed the old individual calls to setRanking in ranklist
  4. I set up a Cron Job to run every minute syncronized in the cron.yaml file
  5. I added a request handler to do the work:
def get(self):
        playerscores = {}
        players=[]
        count = 0
        playerquery = model.Player.all()
        playerquery.filter("rankChange =", True)
        
        ranker = model.Player.getRanker()

        for player in playerquery.fetch(50):
            playerscores[str(player.key())] = [player.cashOnHand]
            player.rankChange = False
            players.append(player)
            count += 1
            
        logging.debug("Set Rank for %s players" % count)
        ranker.SetScores(playerscores)
        db.put(players)

This solved the problem nicely, but not perfect. Now the rankings are no longer real-time they are delayed about a minute. I had to set the fetch size to 50 because after that you get timeouts and memory issues. Currently I am only using about 38 of the 50 in a request. I should add to this function so that if the count is 50 I launch a Task Queue request to run this same code, that way if there are more then 50 it won't back up until the next Cron job runs.

Hope this helps someone else.

Thanks Bartholomew Furrow for the help.

Sunday, August 1, 2010

Developing Games on Google App Engine

Yesterday at the Cleveland Game Developer Meetup Tech Workshop I gave a talk on using Google App Engine for games. I discussed many aspects and gave a brief overview of Google App Engine for those not familiar with it. In the talk I told people about the tools I used to create Railroad Empire and I presented the code for a simple demonstration multi-player text adventure.




You can get the code for the sample from here:
http://bitbucket.org/dramus/samplerpg

You can try it out here:
http://samplerpg.appspot.com