Tuesday, September 7, 2010

Datastore lessons learned the hard way

With the latest release of the Railroad Empire game I have reduced my App Engine CPU usage by about 66%. Looking back on why my CPU usage was so high before, I have come to the conclusion that when I originally designed my application I made some poor datastore design decisions and fixing those design flaws drastically decreased my CPU usage. So below I am detailing some of the important lessons I have learned.

Use Entity Groups to intelligently partition your data
According to the App Engine documentation you should only use Entity Groups for transactions. http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html
 Only use entity groups when they are needed for transactions. For other relationships between entities, use ReferenceProperty properties and Key values, which can be used in queries.
I have discovered that you may also want to use Entity Groups in partitioning your data, for example placing all the data for a player in a game in an Entity Group with the player as the parent. The documentation alludes to doing this later in the same page.
A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller.
The problem is that the way the documentation reads is that you should only do that if you are actively using transactions on that data. First I have found that you may start out thinking that you won't be doing transactions and then you find that you do need to do transactions on the data. Second, you can use your parent relationship in conjunction with your Key Names for establishing unique keys.

Make intelligent use of your Key Names.
This is the big one. No matter what you do, don't just go and use a generated key for every one of your entities. Whenever it makes sense use a key name that uniquely identifies the entity using data within the entity. You can often use entity groups to help create intelligent unique keys.

Here's why this is important, queries on key names are so much faster then queries on full entities. If you have a screen that displays a list of your entities that a user can select one of to get a deeper view of that entity, if you do a full query on your list screen you are loading every single entity into memory just to get the information necessary to click on it. Sometimes this can't be avoided, and when it can't you can use Memcache to cache the list for you, but if at all possible only display the information in the key name.

The other place that intelligent use of Key Names can help you, and where it helped me the most was to have my key names match some static lookup data that I used specifically in my maps. In my application I use Google Maps for displaying the Cities, stations, and routes of your railroad. There is a fixed set of cities and routes you can choose from that are all saved in an XML file. Before I made the most recent change you had to load each station entity to display its icon on the map because the city the station referred to was stored as a property in the entity.

Since you can only have one station per city I changed the datastore so that the key name for the station was the city being referenced. In order for this to work I had to set the Station in an Entity Group with the player as the parent. Now whenever I want all of a players stations, but I only care about what city the station is in, which is about 95% of the time, I don't have to load the large bulky station object.

There is another way you could accomplish this same goal. You could de-normalize your data and store a list of the players stations in the player object. This approach will work, but be wary it has several drawbacks. You can have issues with index bloat, as well as entity size issues. remember each entity can only have 1000 index rows each and an entity can only be 1MB in size.

How to fix your ailing datastore
Luckily if you have these sorts of woes in your datastore, or need to refactor it for any reason, all is not lost. There are several ways to move data around in the datastore without too much interference. There is a great project for doing the map part of a map-reduce over here: http://code.google.com/p/appengine-mapreduce/.
For datastore refactoring you only need the map part of map reduce. Beware that sharding of the data can be problematic in the current implemntation.

Another good alternative is to do the following, which ends up being a poor mans map. Have an admin only url that launches a task (we'll call it the launcher task) to iterate over your data(keys only) you want to change using a cursor. Have the launcher task launch a new task for each entity you want to change to do the actual work of changing that entity (the worker task). Have the launcher task execute a reasonable number of worker tasks (I used 100) and then add itself back to the queue with its cursor as a parameter and then end. Now the launcher task will iterate over your entire query launching tasks to do the real work. The worker tasks can now make each individual datastore change.

Be sure to stop your cron jobs and other task queues while you are updating your datastore.

I hope that some of these lessons I have learned help other people.