Real Time Web Analytics

AppEngine doesn’t fit the needs of startups on the runway

I tweeted yesterday that I have found AppEngine a poor fit for my startup. The topic deserves followup, since I am big fan of AppEngine in general.

The primary need in a startup is to find a sufficiently large, sufficiently paying audience to enable continued survival. Since by definition the startup is boldly going where no startup has gone before, this implies experimentation. Startups need to mitigate risk by validating a large portfolio of ideas. Spinning the “idea -> experiment -> validation” cycle at hyperspeed reduces risk and maximizes value.

JUnit Max logs all of its errors to an AppEngine-hosted server. It also stores a summary of each test run: number of tests run, number passed, time, and (optionally) the user id of the programmer. I use the errors to prioritize the time I spend fixing defects. I hope to use the log of test runs to power novel services that help programmers find meaning in their work. I picked AppEngine because it was free for small-scale projects and because it handled lots of system administration tasks for me.

But…

I now have a month worth of data from the Max user community in the server. AppEngine has two attributes that have frustrated my efforts to spin my validation loop for JUnit Max using this data: time limits and owned data. First, what’s stored in AppEngine stays in AppEngine. There is no general way to download my data set. In extremis I have resorted to copying html tables from the online user interface and pasting them into Excel. Locked up in the data on the server are answers to a million questions like, “How many people are setting their user id?” If I had the data locally, answering this question would take seconds. As the data is in AppEngine, I can’t run an ad hoc query.

Which leads to the second attribute of AppEngine that has proven to be a problem: time limits. All processing in AppEngine is triggered by sending a URL and URLs can’t take longer than 5 seconds to process. Ad hoc queries, of the kind required to quickly answer the “How many people are setting their user ids?” can’t run in five seconds, not at least the way I write them (I’m using JDO as an interface). Conceptually simple operations like setting a default value for a new data column in existing rows turn into a game of writing a servlet that queries for rows that don’t have the column set and then setting as many rows as possible before timing out (this has turned out to be 30-40 rows for simple operations), and then setting a cron job to run the servlet once a minute until done. An operation that could have been done in a second manually turns into an hour of messing about.

Since Max is still young, the user base is small and the data involved is also small. This negates on of the great advantages advertised for AppEngine: scale. While I’m sure that transparent scaling would be great, I don’t have that problem. If I can’t find a way to experiment faster I fear I will never have that problem.

The frustrating part is that I’m sure that if I had the data in, say, Smalltalk I could do all this experimentation at lightning speed. The data fits in memory. If I wanted to process it to understand user behavior I could do so in a second. If I wanted to produce an experiment like this one, which shows the time lapse between green test runs (thanks to David Saff for being an example), I could code it in minutes. Even if a product idea required processing the entire data set on every hit, I could do it. As it is, I have to scale back my ideas to what AppEngine can accomplish fetching a handful of records.

Where next?

I am emphatically not anti-AppEngine in general. Once I work out which features people will pay for and how I can incrementally process data to deliver those features, I’m sure it will be a cheap way to scale quickly. That is, AppEngine supports climb out and level flight just fine. On the runway, though, its limitations are potentially fatal. I’m now looking for an environment for my data that encourages experimentation and is still cheap and simple to operate. I’m looking forward to the day when moving back to AppEngine is a way to solve my scaling problems.

Now if I can just figure out how to liberate my data 5 seconds at a time…

22 Comments

Dominic MitchellJune 5th, 2009 at 6:33 am

You may be interested in looking at GAEBAR (Google App Engine Backup and Restore). I haven’t tried it, but it claims to backup your entire dataset. EVen if it’s not usable directly, it may well contain some hints on how to do so for you.

KentBeckJune 5th, 2009 at 6:36 am

Thanks for the pointer. That solves the ad hoc analysis problem, but if I want to trial features that happen to need to traverse a lot of data I still can’t do it in AppEngine.

Michael O'BrienJune 5th, 2009 at 6:39 am

Hi Kent, I was asking you about this on twitter yesterday (I’m @mcobrien). I hadn’t realised you were using Java on appengine, but remote download (and a slow but useful remote REPL) are both possible with the python runtime.

You can actually upload a python version of your app to get access to these features — see this post on the ae group: http://groups.google.com/group/google-appengine/browse_thread/thread/2f31defb815308ee/af9650efeac0666b

Once you have a python version installed, take a look at setting up the remote API:

http://code.google.com/appengine/articles/remote_api.html

Hope this helps, let me know if you have any problems as I’ve done this (via python) a couple of times!

cheers
Michael

KentBeckJune 5th, 2009 at 6:45 am

Thanks. That will solve my immediate need to mine the data I have so far.

Brett SlatkinJune 5th, 2009 at 8:23 am

Hey Kent,

Sorry to hear you’ve encountered some bumps along the way. Things can work a little differently in App Engine, but I believe it’s great for small and large applications alike. Comments here have already pointed out some good solutions to your problems; I’d like to point out a couple others:

First, the App Engine request time limit is now *30 seconds*. It was 10 seconds at launch in April 2008. You can do a full 30 second burn of CPU usage in a single request to accommodate the type of work you’ve described.

You can definitely do *online* ad hoc queries on the production data (without downloading it), as long as your query does not require a new index to be built. You can use the Datastore Viewer in the admin console to do this or add a shell/repl app to run arbitrary code that you type in. See an example shell here:
http://lotrepls.appspot.com/

Ad hoc queries can be satisfied primarily with the “merge-join” type of query, which lets you do multiple equality filters in a single query. This can satisfy set membership tests (like “How many people are setting their user ids?”) very easily. This concept is covered in detail in this Google I/O talk (the video should be online today or very soon):
http://code.google.com/events/io/sessions/BuildingScalableComplexApps.html

Otherwise, for background work like schema migration and modifying a whole sequence of data, we are soon launching our Task Queue API. It’s described in detail in this Google I/O talk (which should also have its video online sometime today or very soon):
http://code.google.com/events/io/sessions/OfflineProcessingAppEngine.html

Hope that helps,

-Brett

brad dunbarJune 5th, 2009 at 8:53 am

Also, the time limits have been significantly raised.

http://code.google.com/appengine/docs/python/runtime.html

(They are at the bottom of the page)

KentBeckJune 5th, 2009 at 9:09 am

Brett,

It’s good to know someone there is listening. The changes you mention will help with some of my concerns, especially mining data for patterns that can be used as the seeds of features.

The big remaining problem for me is performance. If I have < 100MB of data that could easily be represented in memory as POJOs, it seems like there is at least a two order of magnitude difference in performance between working through JDO and working in memory (I should measure this). If I am experimenting quickly, a feature could take 5 minutes to code against in-memory data and a day to code going through the database. As a bootstrap startup searching for the thread that will begin unraveling the revenue ball, this is an enormous difference. A higher timeout limit won’t fix this–I get better market feedback from a snappy service than a sluggish one.

David VydraJune 5th, 2009 at 9:14 am

Kent,

I think its a reasonable strategy to use GAE to collect data since scalability needs arise at the most inconvenient time :) . Once collected, transfer the data to Amazon EC2 and only pay for storage and CPU time of the actual analysis. They even have http://aws.amazon.com/elasticmapreduce now. I don’t know if there are already libraries to help with this, but even a hand-rolled solution should be very small.

KentBeckJune 5th, 2009 at 9:30 am

I was coming to something like that as well. It is still more engineering investment than I would like to be making given the infant mortality rate of my features. What’s frustrating is that there isn’t a good dynamic, in-memory solution. I keep going back to Smalltalk–if I could just have a Smalltalk image live on the web I could do all the experimenting I want five minutes of coding at a time. When the data started pushing 300MB I could either go to Gemstone or do pre-processing as necessary for performance in a more conventional database technology.

Brett SlatkinJune 5th, 2009 at 11:03 am

Hey Kent,

There is memcache support right now available through JCache:
http://code.google.com/appengine/docs/java/memcache/usingjcache.html

This gives you a high-speed, in-memory cache for accessing frequently used objects. You can insert POJOs in here as long as they can serialize. I’m not sure which frameworks support transparent caching of JDO queries, but this seems straight forward to add for your needs.

Otherwise, you should take another read through the Google Bigtable research paper. It describes how data is cached in memory when it’s frequently accessed. This is what gives the App Engine Datastore such good read performance. That doc is here:
http://labs.google.com/papers/bigtable.html

Combining these two together (memcache and Bigtable caching through our Datastore) you can get some serious read performance from App Engine.

-Brett

ps: Looks like you didn’t moderate my first comment to be visible to everyone. =)

KentBeckJune 5th, 2009 at 11:12 am

Brett,

Not approving your previous comment was a complete oversight on my part. No intention intended.

Thanks for the additional information about performance tuning. I have read the BigTable paper a couple of times. However, I am not seeing delivered performance by the time I get through all the layers. I have what look to me like modest queries that fail intermittently.

Again, I’d rather not spend time engineering this. I’d rather have a bunch of objects in memory. I know how to engineer that for performance and I’m willing to take the hit scaling later. However, I can imagine I am not your target audience. It could be that I just need to bring my performance tuning skills into the 21st century.

Regards (and I mean that :-) ,

Kent

James AbleyJune 5th, 2009 at 11:37 am

There’s also Java REPL options – http://googleappengine.blogspot.com/2009/04/many-languages-and-in-runtime-bind-them.html

James AbleyJune 5th, 2009 at 11:40 am

Doh – didn’t refresh before I posted the previous comment!

Sri PanyamJune 5th, 2009 at 8:29 pm

Hi,

For doing adhoc quries you could use the remote api. The time limit does not apply here. (But the limits on queru length and offsets unfortunately do – unless you want to resort to custom orderable keys – another huge pain). However, the biggest annoyances I am facing is clearing of the production server. Takes a good couple of days to erase the entire DB. No body has been able to tell me why this has to be the case. Why cant there be a simple clear switch?

cheers
Sro

David VydraJune 5th, 2009 at 8:47 pm

Kent,

You comment about having an image in the cloud made me think that my attempt to run Scala compiler inside the GAE Servlet is not just for building an educational site – http://github.com/vydra/gae-scala/tree/master

William ShieldsJune 6th, 2009 at 2:42 am

I too was going to ask whether in-memory caching (namely memcache) would help with the data traversal and time limit problems. Memcache isn’t as sophisticated as, say, Oracle Coherence but it’s better than the quite limited persistence layer that GAE has with DataNucleus.

Joe BowmanJune 6th, 2009 at 3:46 am

I think you need to think outside of the box a little on this. You need to understand appengine is a bunch of servers that you can create java/python applications in, and it has a datastore backend. Most requests are done via HTTP. It also has time constraints for how long a request can run.

With these limitations in mind, you need to do some development in order to get at your data reliably.

First, since you’re pulling logs, I assume you’re using timestamps and trying to get information by date? Easy enough.

Make sure you model has the date as one of the fields so you can limit by date, and sort by time.

Generate a view that fetches x amount of requests, sorted by time, limit by date as a results array. You want to fetch x + 1and store as a field next (if it exists). Output the data in a format that can be parsed, say XML or JSON.

On your server side, create a script that fetches results. If the next variable is passed then fetch again, except you want >= or <= (depending on your sort) for the timestamp in the “next” entry.

Your server side script can then keep fetching until next is not passed, and you have all your results.

I’ve also used this method for deleting large amounts of data using javascript, where I set up a delete button that kicked off a javascript function that would keep requesting to delete until appengine told my script (based on logic I wrote) there were no more to delete.

Spike WashburnJune 6th, 2009 at 7:05 pm

Kent, I dropped you a tweet but wanted to explain in more detail…
Amazon EC2 is a very good option for building cloud-based scalable Java apps. Raw EC2 is more complicated and more pricey than GAE, but more flexible and supports standard Java+DB apps. If you’re looking for your “lightning speed” dev workflow, Stax.net is a Java Platform-as-a-Service for building and deploying apps on EC2 that provides Java developers with a similar create/dev/deploy workflow, but without many of the GAE restrictions. Like GAE, Stax.net is still in beta, but you get a hefty load of EC2 goodness for free while you are building your apps.

KentBeckJune 7th, 2009 at 9:38 am

I’m aware of EC2. It isn’t the technology platform in AppEngine that forms the bottleneck, it’s the difficulty in experimenting.

Even Java isn’t what I’m looking for. What I’d like is a maximum of 100MB of easily-persistable objects that can be programmed very flexibly and (because they are all in memory) very efficiently. Maybe JavaScript/V8 on the server and/or an in-memory version of Map/Reduce. I’m afraid I’m not terribly articulate about this, because it’s just a feeling at this point. I’ve been through this enough times to know that I may be just beginning to uncover something very interesting or not.

AppEngine does a brilliant job of its location on the scaling/flexibility tradeoff, I’m just looking for different spot.

Tahir AkramJune 7th, 2009 at 10:24 pm

Hi Kent;

You discuss an area in your post that I am stuck on. I am fluent in web app development in Java. And I am evaluating GAE for my strt-up too. My choice for sticking with GAE is only its giving me managed hosting for some popular Java web frameworks (I will use Struts). I dont want to mess with a naked server and putty it all the day in solving configuration issues in my web app. So this my motivation behind thinking about GAE.

So far I have code some servlets in it to understand its environment. But most concerned are for me is DB. I am unable to find any Java code snippet about interacting with Database. They are not supporting relational database – that increases my concern about data. How I will see my data or as you said how I will run queries on it.

I will love to hear if you point me towards some better understanding of DB stuff. Is there any tutorial or some code snippet?

Comments are welcomed.

KentBeckJune 7th, 2009 at 10:57 pm

To learn AppEngine’s database I worked through http://code.google.com/appengine/docs/java/datastore/ several times. I’d read it through, add a little functionality, read it again, add a little more. Also, I found the local testing mode helpful for speeding up the learning cycle, especially when used with TDD. Copying from code that was out there in bits and pieces, I use this fixture to write tests that can read and write data:

@Before
public void setUp() throws Exception {
ApiProxy.setEnvironmentForCurrentThread(new TestEnvironment());
ApiProxy.setDelegate(new ApiProxyLocalImpl(new File(".")){});
ApiProxyLocalImpl proxy = (ApiProxyLocalImpl) ApiProxy.getDelegate();
proxy.setProperty(LocalDatastoreService.NO_STORAGE_PROPERTY, Boolean.TRUE.toString());
}

@After
public void tearDown() throws Exception {
ApiProxyLocalImpl proxy = (ApiProxyLocalImpl) ApiProxy.getDelegate();
LocalDatastoreService datastoreService = (LocalDatastoreService) proxy.getService("datastore_v3");
datastoreService.clearProfiles();
ApiProxy.setDelegate(null);
ApiProxy.setEnvironmentForCurrentThread(null);
}

[...] profile to Facebook without manually retyping everything.” Another report entitled “AppEngine doesn’t fit the needs of startups on the runway” described how limited data access made impossible to run a statistical analysis of the user [...]

Leave a comment

Your comment