Sunday, October 18, 2015

Excited for Postgres 9.5(Locking)

I was reading about some of the new things coming up in the Postgres 9.5 release and I was very excited about some new locking options that will make concurrency even better in Postgres.

I remember several years ago I thought it would be a good idea to use Postgres as Queue service. At the time I did not want to add a new Service to our infrastructure, and I though Postgres would be up to the challenge. This was the first time I ran into the problem that the first locking feature we are going to talk about solves.

In my early days of trying to make this work I did something along the lines of

select * from queue_table
where processing = false
limit 1
for update;
The problem with this is that the first call will lock the first row for writing and the second call will sit and wait for the lock to release on the row before continuing. In Postgres 9.5 we get a new option:
select * from queue_table
where processing = false
limit 1
for update skip locked;


Problem solved. I still wouldn't use Postgres as a Queue system given the many things I have learned since then, however there are other places where you will run into this problem.

Here is another one that would save us a lot of headaches at my previous employer. We had data where you had a data_table that you joined on the users table. You would then want to select for update the data in the data_table while joining on the users table. This would previously row lock the row in the user table and could cause issues.

select * from data_table
join users on data_table.user_id = users.id where data_table.id = 12345
for update;
The new solution to this is:
select * from data_table
join users on data_table.user_id = users.id where data_table.id = 12345
for update of data_table for share of users;
I hope I have shown you some new things that are going to make Postgres just that much better.

Monday, August 17, 2015

Excited for Postgres 9.5(Row Level Security)

Every year around this time I get very excited about the new features that are getting ready to be released in the next version of Postgres. This year is no exception, There is a trove of new features that have me salivating. Today I am going to talk about one of these new features, Row Level Security. I hope to talk about some of the other features soon.

As an architect/developer of SaaS apps I can't help but be intrigued by the new Row Level Security model. Right now SaaS apps typically implement security either at the application level or by having seperate databases or schemas per customer. Each of these methodoligies has its pluses and minuses, but now we will have a new option to consider.

Application Level Security

This is what most people use when they write applications. You simply add a column to your tables to limit them to a specific user or customer and make sure you always query with those columns in your where clause or join somewhere. This works great 99% of the time, but what happens when some big customer comes up to you and wants to connect their data in your app to a custom reporting system, or they exercise the cluase in their contract where they want all their data from your database becuase they are leaving you? Another place where this could be an issue is what if your application layer is vulnerable to a SQL injection attack or some other nefarious attack. With application level security the attacker now has access to all the data in your database.

Seperate Databases/Schemas

This model of development solves the problems from above very effectively. It also has the added benfit of allowing you to shard your data horizontally across many machines which can be very advantageous as your company grows. There are however tradeoffs, and one of the big ones is complexity. You now have to track which database and schema to connect to for each user/customer, you have to figure out how to mange pools of connections for each database to each web box. Superuser/Administration activities become a big problem. Shared data is now a problem. Running analytics and metrics on your system is much harder.

Row Level Security

RLS gives you the ability to have the advantages of both with the exception of a convienent sharding mechanism. Using RLS you set up policies for tables which are nothing more then where clauses or check constraints that are applied on the table for every query. Now that you can move those security checks to the database It becomes easy to give access to only a customers data. There is also an option on pg_dump to enable RLS so you can easily only dump a users data. If your web box becomes compromised the only data that can be accessed now is the data for the customer that is logged in. (Clearly if the comprmoised user is a superuser the original problem persists.) 

I can't wait to try these features out in production and see if they deliver on their promises. 

References:


Thursday, July 16, 2015

Don't test with SQLLite when you use Postgres in Production

Yesterday, I was looking for some help trying to have a clean way to roll back database transactions at the end of a test in a Python TestCase. So, naturally, I started searching and found some posts on Stack Overflow and started to read them. I was instantly mortified by what I read.
But in any case, an in-memory database (SQLite makes this easy) is the best choice for unit tests. 
Huh? I, at least, will listen and have a nice healthy debate when people suggest you should mock out every call to the database in the unit tests because if you are talking to the database it's not a unit test its a functional test. That argument is mainly one of semantics of what you call the test. In the case of using a different database engine in a testing environment than a production environment, why would you want to do that?

What's worse is the next stack overflow question I went to had almost the same answer by another person.
You can simply create an in-memory SQLite database and bind your session to that.

So, first, I wanted to stand on the rooftop and yell at the top of my lungs to anyone who would listen. This is a really bad idea. Here's why.

  1. SQLite has different SQL semantics than Postgres, so a query written on SQLite may not run in Postgres
  2. SQLite has different bugs than Postgres, so just because it works in SQLite does not mean it will work in Postgres. Also the reverse is true, just because you can't get it to work in SQLite doesn't mean it won't work in Postgres
  3. Postgres has way more features that SQLite. Try doing a CTE query in SQLite. You can't use any of these features in your application if you test in a different database than you use for your production environment. (It ends up in 2014 SQLite added CTEs to their software, however there are differences, which is the point, for instance it does not appear that you can do insert or update in a SQLite CTE).
  4. How can you test deployments if your test environment uses a different database than your production environment?
  5. How can you test that your application works with the newest version of the database before you push it to production?
This sentiment about testing in a different database than you use in production is not just limited to stackoverflow it seems. I found a document on leading cloud service provider's site that said this:


When To Use SQLite
  • Testing:
It is an overkill for a large portion of applications to use an additional process for testing the business-logic (i.e. the application's main purpose: functionality).
This is even more strange coming from them since one of the valid reasons cloud service providers give for using their services is that you can have a test environment that mirrors your production environment as closely as possible.

I specifically mentioned SQLite and Postgres, but in reality you could replace any two technologies and the arguments don't change. You wouldn't want to test on SQLite and use MySQL in production, nor would you want to test on Postgres and use Oracle in production.

So remember don't develop and test against one database and release against a different database.

Edited Aug. 4
Fixed spelling error for SQLLite -> SQLite.
SQLite does have CTE support since 2014.

Thanks for all the great comments, and the godd discussion this is creating within the community.

Monday, July 11, 2011

Writing data to Google Fusion Tables from App Engine

This weekend I was asked by some potential investors to supply some statistics for my game. The information was embedded in my datastore in the player and transaction objects. To extract the data, I decided to iterate over each player in the game. This could have been done with a map part of map/reduce, but I just wrote a simple task queue job to walk the data serially for simplicity since there wasn't a ton of data and speed wasn't a huge issue, the fifteen minutes it took was fine.

To access Fusion Tables I downloaded the Python library for Fusion Tables:
http://code.google.com/p/fusion-tables-client-python/

Then I built the table I wanted to create using the Fusion Tables UI:
http://www.google.com/fusiontables/Home?pli=1

Then I simply setup my task to insert a fusion table row for each player entity with the desired data in it.

To authenticate I user Client Login:

from authorization.clientlogin import ClientLogin
import ftclient

token = ClientLogin().authorize(username, password)
ft_client = ftclient.ClientLoginFTClient(token)

I created a template for the insert I wanted to do


ftquerytemplate = """insert into 1116661 (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""



Then I just iterated over my players, for each 100 players, I saved a cursor and launched a new task with that cursor. I stacked up the inserts into one string and sent them as a single query at the end. There are some things you have to watch out for:

  • Null values are entered as empty strings ''
  • When entering dates, Fusion tables will enter either just the date or the entire timestamp, however there is no way in fusion tables that I can find to aggregate on a date when you have a timestamp, so make sure you do that conversion in code.
  • If you want a number field to be null you still pass it an empty string ''
Below is the complete controller I used to get the data and place it in Fusion tables. There are a lot of things that could be done by mixing fusion tables and Google App Engine. This was just a quick way for me to solve a problem quickly, that's one of the things I love about these technologies, they are very easy to try out.




from controllers import *
from models import model
import datetime
import logging
import ftclient
from google.appengine.api import taskqueue
from authorization.clientlogin import ClientLogin

ftquerytemplate = """insert into [mytablenumber] (created_date, login_date, first_transaction_date, last_transaction_date, player_status, number_of_transactions, length_of_play, time_to_txn) 
VALUES ('%(created_date)s', '%(login_date)s', '%(first_transaction_date)s', '%(last_transaction_date)s', '%(player_status)s', %(number_of_transactions)s, %(length_of_play)s, %(time_to_txn)s);"""

class UserReporting(BaseController):
    def get(self):
        self.handleReporting()
        
    def post(self):
        self.handleReporting()
        
    def handleReporting(self):
        token = ClientLogin().authorize("[myusername]", "[mypassword]")
        ft_client = ftclient.ClientLoginFTClient(token)
        playerquery = model.Player.all()

        last_cursor = self.request.get('cursor')
        if last_cursor:
            playerquery.with_cursor(last_cursor)

        fetched = playerquery.fetch(100)
        logging.debug("Fetched %s records" % len(fetched))
        ftquery = ""
        for player in fetched:
            #Created data in Fusion Tables 
            playerdata = {'created_date':player.createdDate.date(),
                          'login_date':player.lastLoginDate if player.lastLoginDate else "",
                          'player_status':'inactive' if player.inactive else "active"}
            if player.lastLoginDate and player.createdDate:
                playerdata['length_of_play'] = (player.lastLoginDate - player.createdDate.date()).days
            else:
                playerdata['length_of_play'] = 0
                          
            count = 0
            first_txn = datetime.datetime.max
            last_txn = datetime.datetime.min
            for txn in player.transaction_set:
                count += 1
                if txn.timestamp < first_txn:
                    first_txn = txn.timestamp
                if txn.timestamp > last_txn:
                    last_txn = txn.timestamp
            if first_txn != datetime.datetime.max and player.createdDate:
                playerdata['time_to_txn'] = (first_txn - player.createdDate).days
            else:
                playerdata['time_to_txn'] = "''"

            playerdata['number_of_transactions'] = count
            playerdata['last_transaction_date'] = '' if last_txn == datetime.datetime.min else last_txn 
            playerdata['first_transaction_date'] = '' if first_txn == datetime.datetime.max else first_txn
            ftquery += ftquerytemplate % playerdata
        logging.debug("SQL: %s" % ftquery)
        ft_client.query(ftquery) 
        if len(fetched) == 100:
            taskqueue.add(url='/reporting/userreporting', params={'cursor':playerquery.cursor()})
        

Friday, January 28, 2011

More App Engine optimization Goodness

So I was recently trying to improve performance on the application and used two tricks to great effect that I want to share with everyone.

First I had some code in one of my tasks like this:

def doTask(self):
    train = NTrain.get(self.request['trainkey'])
    # Do some work
    # This call is a ReferenceProperty and does a get
    player = train.player
    #Do Some work with the player


The problem here is we are doing two gets when we could do 1 get, especially when all NTrain entites are parents of their player. So instead we can do this:


def doTask(self):
    trainkey = db.Key(self.request['trainkey'])
    playerkey = trainkey.parent()
    train, player = db.get((trainkey, playerkey))
    #This will keep you from redoing the get if you use 
    #train.player later
    train.player = player
    #Do all your work here


The second optimization I did was to completely remove the need for a write. In the past my ranking code that ranks players would query players that had a flag set for needing ranking done. Next I would call the ranking code, and then clear the flag for each of the players. Finally I would write all the ranking in a batch and then I would write all the players in a batch.


class Player(db.Model):
 flag = db.BooleanProperty()

def requireRanking(self):
flag = True


class SetRanks(BaseController):
def handleranks(self):
        playerquery = model.Player.all()
        playerquery.filter("flag =", True)

        players = []
        for player in playerquery:
            #Do Work
            player.flag = False
players.append(player)


db.put(players)


Now what I do is set a timestamp instead of a flag when ranking needs done. Now I query for players whose timestamp is greater then the last time the task ran. Now I would write all the ranking in a batch, but there is no longer a need to clear the flag and write changes to the player.

class Player(db.Model):
    lastMoney = db.DateTimeProperty()

    def requireRanking(self):
        lastMoney = datetime.datetime.now()

class SetRanks(BaseController):
    def handleranks(self): 
        lastRankCheck = datetime.datetime.now() -datetime.timedelta(minutes =5)
        playerquery = model.Player.all()        
        playerquery.filter("lastMoney >", lastRankCheck)
        for player in playerquery:
            #Do Work
There are many options for how to keep track of the last time the query was run. In my case knowing it is run every five minutes and making that assumption was OK because of the non-critical nature of the data. In a more critical environment you may want to store the last run time in the datastore. You could also pass it forward from task to task (Having the last task create an instance of the next task with the time it ran as a parameter).

Wednesday, October 27, 2010

Updating AppEngine Kinds without taking your site down

I recently had the need to change one of my datastore Kinds. In the past when I have done this, I have shutdown the site and migrated every entity of the kind to the new version of the Kind. However my application is getting to the point that it takes a long time to do that and I didn't want the site down that long. After reading several articles on the subject, most notably this one: http://codebetter.com/blogs/kyle.baley/archive/2010/09/28/options-for-upgrading-an-app-engine-application.aspx, and discussing the matter in the #appengine irc channel on freenet, I had a plan.

The first thing I did was add a new property to my Kind called version, next I added an __init__ constructor for the class. I had previously thought you were not supposed to do that with App Engine, but Nick Johnson on the irc channel said as long as you are careful it would be the right thing to use in this case.

The new constructor has a version check, and if it's the old version the upgrade on the entity is performed. The super call to the base constructor must be called first.


class NTrain(db.Model):
    version = db.IntegerProperty()
    
    def __init__(self, *args, **kwargs):
        super(NTrain, self).__init__(*args, **kwargs)
        #Since this is the first time I am using version just do a boolean check
#otherwise you would want if self.version < 2: or whatever you version number is.
        if not self.version:
            self.version = 1
            #do upgrade


Next I changed my code to use the new version of the entities.

Then I set up the code that creates all new instances of the Kind to use the new version number.

Finally I will (I haven't done this part yet) run a mapper over the entire kind that just has to load the entity and save it.

That last step is technically not necessary, but by doing that I will guarantee that all entities of the kind have been upgraded to the new version, and once that has been done I can remove the version checking code from the constructor.

You do not have to use a version number to decide that an upgrade is necessary. You could also use an implicit method of upgrading, such as checking if a new property exists on the entity in the datastore and if not then run the upgrade. For my use case I decided that the explicit check would be simpler and more robust.

Tuesday, September 7, 2010

Datastore lessons learned the hard way

With the latest release of the Railroad Empire game I have reduced my App Engine CPU usage by about 66%. Looking back on why my CPU usage was so high before, I have come to the conclusion that when I originally designed my application I made some poor datastore design decisions and fixing those design flaws drastically decreased my CPU usage. So below I am detailing some of the important lessons I have learned.

Use Entity Groups to intelligently partition your data
According to the App Engine documentation you should only use Entity Groups for transactions. http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html
 Only use entity groups when they are needed for transactions. For other relationships between entities, use ReferenceProperty properties and Key values, which can be used in queries.
I have discovered that you may also want to use Entity Groups in partitioning your data, for example placing all the data for a player in a game in an Entity Group with the player as the parent. The documentation alludes to doing this later in the same page.
A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller.
The problem is that the way the documentation reads is that you should only do that if you are actively using transactions on that data. First I have found that you may start out thinking that you won't be doing transactions and then you find that you do need to do transactions on the data. Second, you can use your parent relationship in conjunction with your Key Names for establishing unique keys.

Make intelligent use of your Key Names.
This is the big one. No matter what you do, don't just go and use a generated key for every one of your entities. Whenever it makes sense use a key name that uniquely identifies the entity using data within the entity. You can often use entity groups to help create intelligent unique keys.

Here's why this is important, queries on key names are so much faster then queries on full entities. If you have a screen that displays a list of your entities that a user can select one of to get a deeper view of that entity, if you do a full query on your list screen you are loading every single entity into memory just to get the information necessary to click on it. Sometimes this can't be avoided, and when it can't you can use Memcache to cache the list for you, but if at all possible only display the information in the key name.

The other place that intelligent use of Key Names can help you, and where it helped me the most was to have my key names match some static lookup data that I used specifically in my maps. In my application I use Google Maps for displaying the Cities, stations, and routes of your railroad. There is a fixed set of cities and routes you can choose from that are all saved in an XML file. Before I made the most recent change you had to load each station entity to display its icon on the map because the city the station referred to was stored as a property in the entity.

Since you can only have one station per city I changed the datastore so that the key name for the station was the city being referenced. In order for this to work I had to set the Station in an Entity Group with the player as the parent. Now whenever I want all of a players stations, but I only care about what city the station is in, which is about 95% of the time, I don't have to load the large bulky station object.

There is another way you could accomplish this same goal. You could de-normalize your data and store a list of the players stations in the player object. This approach will work, but be wary it has several drawbacks. You can have issues with index bloat, as well as entity size issues. remember each entity can only have 1000 index rows each and an entity can only be 1MB in size.

How to fix your ailing datastore
Luckily if you have these sorts of woes in your datastore, or need to refactor it for any reason, all is not lost. There are several ways to move data around in the datastore without too much interference. There is a great project for doing the map part of a map-reduce over here: http://code.google.com/p/appengine-mapreduce/.
For datastore refactoring you only need the map part of map reduce. Beware that sharding of the data can be problematic in the current implemntation.

Another good alternative is to do the following, which ends up being a poor mans map. Have an admin only url that launches a task (we'll call it the launcher task) to iterate over your data(keys only) you want to change using a cursor. Have the launcher task launch a new task for each entity you want to change to do the actual work of changing that entity (the worker task). Have the launcher task execute a reasonable number of worker tasks (I used 100) and then add itself back to the queue with its cursor as a parameter and then end. Now the launcher task will iterate over your entire query launching tasks to do the real work. The worker tasks can now make each individual datastore change.

Be sure to stop your cron jobs and other task queues while you are updating your datastore.

I hope that some of these lessons I have learned help other people.