Monday, July 11, 2011

Writing data to Google Fusion Tables from App Engine

This weekend I was asked by some potential investors to supply some statistics for my game. The information was embedded in my datastore in the player and transaction objects. To extract the data, I decided to iterate over each player in the game. This could have been done with a map part of map/reduce, but I just wrote a simple task queue job to walk the data serially for simplicity since there wasn't a ton of data and speed wasn't a huge issue, the fifteen minutes it took was fine.

To access Fusion Tables I downloaded the Python library for Fusion Tables:
http://code.google.com/p/fusion-tables-client-python/

Then I built the table I wanted to create using the Fusion Tables UI:
http://www.google.com/fusiontables/Home?pli=1

Then I simply setup my task to insert a fusion table row for each player entity with the desired data in it.

To authenticate I user Client Login:

from authorization.clientlogin import ClientLogin
import ftclient

token = ClientLogin().authorize(username, password)
ft_client = ftclient.ClientLoginFTClient(token)

I created a template for the insert I wanted to do


ftquerytemplate = """insert into 1116661 (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""



Then I just iterated over my players, for each 100 players, I saved a cursor and launched a new task with that cursor. I stacked up the inserts into one string and sent them as a single query at the end. There are some things you have to watch out for:

  • Null values are entered as empty strings ''
  • When entering dates, Fusion tables will enter either just the date or the entire timestamp, however there is no way in fusion tables that I can find to aggregate on a date when you have a timestamp, so make sure you do that conversion in code.
  • If you want a number field to be null you still pass it an empty string ''
Below is the complete controller I used to get the data and place it in Fusion tables. There are a lot of things that could be done by mixing fusion tables and Google App Engine. This was just a quick way for me to solve a problem quickly, that's one of the things I love about these technologies, they are very easy to try out.




from controllers import *
from models import model
import datetime
import logging
import ftclient
from google.appengine.api import taskqueue
from authorization.clientlogin import ClientLogin

ftquerytemplate = """insert into [mytablenumber] (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""

class UserReporting(BaseController):
    def get(self):
        self.handleReporting()
        
    def post(self):
        self.handleReporting()
        
    def handleReporting(self):
        token = ClientLogin().authorize("[myusername]", "[mypassword]")
        ft_client = ftclient.ClientLoginFTClient(token)
        playerquery = model.Player.all()

        last_cursor = self.request.get('cursor')
        if last_cursor:
            playerquery.with_cursor(last_cursor)

        fetched = playerquery.fetch(100)
        logging.debug("Fetched %s records" % len(fetched))
        ftquery = ""
        for player in fetched:
            #Created data in Fusion Tables 
            playerdata = {'created_date':player.createdDate.date(),
                          'login_date':player.lastLoginDate if player.lastLoginDate else "",
                          'player_status':'inactive' if player.inactive else "active"}
            if player.lastLoginDate and player.createdDate:
                playerdata['length_of_play'] = (player.lastLoginDate - player.createdDate.date()).days
            else:
                playerdata['length_of_play'] = 0
                          
            count = 0
            first_txn = datetime.datetime.max
            last_txn = datetime.datetime.min
            for txn in player.transaction_set:
                count += 1
                if txn.timestamp < first_txn:
                    first_txn = txn.timestamp
                if txn.timestamp > last_txn:
                    last_txn = txn.timestamp
            if first_txn != datetime.datetime.max and player.createdDate:
                playerdata['time_to_txn'] = (first_txn - player.createdDate).days
            else:
                playerdata['time_to_txn'] = "''"

            playerdata['number_of_transactions'] = count
            playerdata['last_transaction_date'] = '' if last_txn == datetime.datetime.min else last_txn 
            playerdata['first_transaction_date'] = '' if first_txn == datetime.datetime.max else first_txn
            ftquery += ftquerytemplate % playerdata
        logging.debug("SQL: %s" % ftquery)
        ft_client.query(ftquery) 
        if len(fetched) == 100:
            taskqueue.add(url='/reporting/userreporting', params={'cursor':playerquery.cursor()})
        

Friday, January 28, 2011

More App Engine optimization Goodness

So I was recently trying to improve performance on the application and used two tricks to great effect that I want to share with everyone.

First I had some code in one of my tasks like this:

def doTask(self):
    train = NTrain.get(self.request['trainkey'])
    # Do some work
    # This call is a ReferenceProperty and does a get
    player = train.player
    #Do Some work with the player


The problem here is we are doing two gets when we could do 1 get, especially when all NTrain entites are parents of their player. So instead we can do this:


def doTask(self):
    trainkey = db.Key(self.request['trainkey'])
    playerkey = trainkey.parent()
    train, player = db.get((trainkey, playerkey))
    #This will keep you from redoing the get if you use 
    #train.player later
    train.player = player
    #Do all your work here


The second optimization I did was to completely remove the need for a write. In the past my ranking code that ranks players would query players that had a flag set for needing ranking done. Next I would call the ranking code, and then clear the flag for each of the players. Finally I would write all the ranking in a batch and then I would write all the players in a batch.


class Player(db.Model):
 flag = db.BooleanProperty()

def requireRanking(self):
flag = True


class SetRanks(BaseController):
def handleranks(self):
        playerquery = model.Player.all()
        playerquery.filter("flag =", True)

        players = []
        for player in playerquery:
            #Do Work
            player.flag = False
players.append(player)


db.put(players)


Now what I do is set a timestamp instead of a flag when ranking needs done. Now I query for players whose timestamp is greater then the last time the task ran. Now I would write all the ranking in a batch, but there is no longer a need to clear the flag and write changes to the player.

class Player(db.Model):
    lastMoney = db.DateTimeProperty()

    def requireRanking(self):
        lastMoney = datetime.datetime.now()

class SetRanks(BaseController):
    def handleranks(self): 
        lastRankCheck = datetime.datetime.now() -datetime.timedelta(minutes =5)
        playerquery = model.Player.all()        
        playerquery.filter("lastMoney >", lastRankCheck)
        for player in playerquery:
            #Do Work
There are many options for how to keep track of the last time the query was run. In my case knowing it is run every five minutes and making that assumption was OK because of the non-critical nature of the data. In a more critical environment you may want to store the last run time in the datastore. You could also pass it forward from task to task (Having the last task create an instance of the next task with the time it ran as a parameter).