In my previous post, I described a caching solution based on subclassing the db.ReferenceProperty class. This time I want to take this one step further and implement this on the lower level API: apiproxy_stub_map.apiproxy.
You can add hooks to the datastore, as I already showed in a previous (profiling) post. However, in order to return a cached entity instead of one retrieved from the datastore, we need more than a PreCall hook. From a PreCall hook will can not prevent the actual datastore call being made. Maybe we can “trick” MakeSyncCall by removing the keys from the argument list, but that is not a road I want to explore…
So, we need to replace the actual apiproxy.MakeSyncCall. Please be aware that there already may be references to the original MakeSyncCall, as described in this post. For our caching solution, this is not a real problem: the worst thing that could happen, is that there will be no caching. So, the app will be slower, but still functioning the same…
Replacing a registered datastore Stub
Replacing the registered datastore stub may seem like the way to go, if you want to wrap the datastore_v3 call. However, registering a stub is allowed only when there is no stub registered yet. And, you guessed it, in production there is already a stub present. This method therefore, is only useful for unit testing and not for production. We could replace the global MakeSyncCall, but then we add overhead for every remote call, which is something I do not prefer.
Replacing db.get
If we want to retrieve a model by using a key or key_name, all calls to the datastore are implemented using the global module function db.get. So, if we can wrap this call with a caching function, we’re done. Here’s the solution I came up with:
# define a global, weakly referenced CACHE...
_cache = weakref.WeakValueDictionary({})
def getCached(real_func, keys):
real_keys, multiple = datastore.NormalizeAndTypeCheckKeys(keys)
if multiple: # too difficult for now ;-)
return real_func(keys)
real_key = real_keys[0]
instance = _cache.get(real_key, None)
if instance:
return instance
model = real_func(keys)
if model:
_cache[real_key] = model
return model
# Patch db.get
from functools import partial
db.get = partial(getCached, db.get)
A global function getCached is defined, which wraps the original db.get function. First, we check whether the function is called for multiple keys. If that is the case, we simply call the original function (implementing this would be too much code for this example). Otherwise, we check whether the key is already present in cache. If so we return that value, otherwise the real db.get is called and the result is added to the cache.
Query results
How about query results? Are they added to the cache as well? Unfortunately, they are not. When query results are fetched, the entire entity is retrieved at once, so there is no need to get the values by key. It would be nice if we could add the retrieved entities to the cache, so that we don’t have to retrieve them again for a ReferenceProperty value.
This turns out to be more difficult. We could subclass db.Query (and db.GqlQuery) so that we could implement a different iterator (a caching subclass of db._QueryIterator), but that seems a bit tedious. Besides, we would have to override more functions, since the model values are not always retrieved using the iterator. The fetch function, for example, uses a map function to run the entities over the Model.from_entity function.
Replacing Model.from_entity
Given the fact that entities are converted to Model instances by the Model.from_entity function, gives us a nice hook to implement a caching strategy, doesn’t it?
All we have to do, is subclass db.Model and re-implement the from_entity function:
class CachingModel(db.Model):
@classmethod
def from_entity(cls, entity):
model = super(CachingModel,cls).from_entity(entity)
if model:
_cache[model.key()] = model
return model
The resulting CachingModel class is the one we subclass to implement our business models, like this:
class Team(CachingModel):
club = db.ReferenceProperty(Club)
naam = db.StringProperty()
poule = db.ReferenceProperty(Poule)
If we add all model instances to the cache in the from_entity function, we can remove that from our global getCached function…
def getCached(real_func, keys):
real_keys, multiple = datastore.NormalizeAndTypeCheckKeys(keys)
if multiple: # too difficult for now ;-)
return real_func(keys)
real_key = real_keys[0]
instance = _cache.get(real_key, None)
if instance:
return instance
model = real_func(keys)
# if model:
# _cache[real_key] = model
return model
We now have a complete, in-memory, caching solution, which can enhance our application performance significantly with a minimal effort!
I like the idea that you have laid out here, but I don’t think that your caching class takes into account any updates to cached objects.
I don’t understand how your from_entity method reduces the number of calls to the datastore. (it may be my limited python)
but doesn’t: your from_entity method just cache results, but never look up in the cache to see if the results are already there?
The from_entity method is able to cache query results as well. Therefore more entities are present in my cache and my hitrate will increase. The getCached method is still used to retrieve the results from the cache.
Hope this clarifies things a bit!
Do you think that there would be a problem looking for hits in the cache inside the from_entity method?
and
How does this cope with a model.put()? have you also overridden something in the put() method that updates the cached version?
Jonathan,
Looking in the cache inside the from_entity method is not very usefull, since from_entity is called immediately after the entity is retrieved from the datastore.
I do not cope with model.put() since the scope of the cache is just one request. But, if you need this, you could add it of course…
Regarding using stubs – with a little messing around, you can replace the existing stub with your own one, and call the real one when necessary. It requires messing with SDK internals, though, so it may break with changes.
There’s a recipe around that uses this technique for transparent memcache caching, but I can’t find it at the moment.
Henri,
Thanks for this post, very interesting. A few questions.
1. Is it possible this acts differently in development and production? When I put logging into getCached(), I see it in production but not development, which seems confusing.
2. Wouldn’t you want to override put() to clear the cache for that key? That is, if you get then put then get, you don’t want the stale copy, even if it’s just a per-request cache.
3. Why does getCached() not need to declare _cache as a global? For me, python interpreted _cache as a local until I declared it.
4. If you put the override of db.get at the global level into a file, then import that file from two different places, will you get it doubly overridden? Seems like maybe it should be protected.
Also, doesn’t this cache persists across requests if you have a main method? (See http://code.google.com/appengine/docs/python/runtime.html#App_Caching).
Thus, one might wish to clear the cache in main.
The link without ):
http://code.google.com/appengine/docs/python/runtime.html#App_Caching
Note that this depends on the signature of db.get(). As of yesterday, db.get() in production seems to now have a new parameter (“rpc”), so this code fails.
Change to
def getCached(real_func, keys, **kwargs):
…
and where you call real_func, pass also **kwargs, e.g.,
model = real_func(keys, **kwargs)