Posts Tagged GAE

software as search

Brave New World

The rules of the development are changing. In the not so olde days, ambitious programmers or even groups of programmers came to the inevitable stumbling block: where to host the application?

Making the leap from “hosting it at a friends house” or the “server in the basement” to 24/7 real internet hoo-ha used to involve plunking down $100+ a month to rack a machine in a colo or some rot.

Well now there’s a real answer: Amazon S3 + Google’s Application Engine

OK, well at least the S3 part…

My Toy App

My original toy idea was to create a catalog of all the classes in all the jar files I care about. I have a local application that I use that I find pretty nifty:

  • run over my local ${HOME}/.m2/repository, glassfish, etc and find every jar
  • jar tf | sed “s,^,${path_to_jar}:,” >> ~/.jar_minder_file.txt
  • grep someclass ~/.jar_minder_file.txt

Originally, that was going to be my GAE app: upload my 62mb txt file, make it searchable, the end… but…

I ran into a lot of issues… bulkloading issues and search issues… most of which I did eventually overcome, but it made me change my application model considerably.

I ❤ grep.

I do. I’m not ashamed of it and yes, I have told my parents about it… They used to work at AT&T pre-split, so they grokked it to a certain extent.

Grep like Google

But I thought, I also ❤ Google. I thought the bulkload+keyword thing would get it, but once that 62mb was indexed… well… let’s say it grew an ittle bit… an ittle bit beyond the 500mb of free GAE space.

How fortunate you are not to have been in the room as I stomped my little feet and shook my mad fists in impotent rage!

Google wouldn’t let me search my 62mb text file… It was too big… then I thought… but Google lets’ me search that internet thing! What if I put my file in the so-called “internet” and then Google crawled it and when I wanted to do a search I could Google up: someclass site:somesite

Instead of one big txt file, I thought I could generate a buncha little xml files. That way every search wouldn’t return the same 62mb txt file and I could just handle the presentation layer with XSL + JS and bada-boom: search application.

First I tried GAE.. but even my toy set of jars was too much! Did I mention GAE has a 1,000 file limit? Yeah… no one mentioned it to me either!

OK.. no need to flail my manhooks at the ether! I knew a guy who had been telling me about how I was being all dopey not to take advantage of all the S3 cheapness that was out there.

Change of plans

Instead of trying to python everything myself( oh, didn’t mention GAE only runs python? yeah… it’s not that bad, but then I say that about XSLT)… I decide I would host it on cheapo S3 and put the “smart” bits on GAE.

The search part would then just be a matter of getting the site indexed (still pending as of today).

Ironically, Yahoo’s BOSS works really well on GAE… but once again… still waiting for it to be indexed.

The smart bits

Of course, that’s all well and good, but I wanted to put a little something+something on top.

GAE has nice user hooks, so I borrowed the idea of “starring” from netflix to let logged in users mark jar’s they like. If it gets off the ground, I think ultimately that is going to be the really interesting part.

I’d also like to expand to not just cover jar contents but actually javap up the classes… Move beyond resolving classpath issues into a fullblown method signature to implementation resolution… Track API changes across releases from every OSS Java project…

Lot’s of possibilities… for now… gotta wait for it to get indexed…

Of course, if Google / Yahoo don’t hack it… I’m not just going to give up… after all… there’s always EC2! and I have a couple of pals named JackRabbit and Lucene who enjoy kicking some ass!



Comments (2)

so much free stuff!!!

Yeah… I know… I was doing stuff, but then I went totally nuts on Google’s App Engine!


Just wanted to say: go and get at it! Sure it is suck ass pythong, but at least it’s not perl.

What else? S3 is what else?

Waiting on YAP to drop. I’m out…

Parting shot:

% mkdir amazonS3
% wsimport -d amazonS3 -s amazonS3 
% javadoc -d javadoc $( find amazonS3 -name "*.java" )
% jar cf amazonS3.jar -C amazonS3 .


Yeah… let me try to write something useful…

Here is a good tip on managing your own key on bulkload with GAE.

Here is what I got (names changed to keep my sh!t on the d/l):

from google.appengine.ext import db
from google.appengine.ext import bulkload
from google.appengine.api import datastore
from google.appengine.api import datastore_types
from google.appengine.ext import search

class LoadMyJunk( bulkload.Loader ):
    def __init__(self):
            , 'SomeJunk'
            , [
                  ( 'sha1',    str )
                , ( 'pwd',     str )
                , ( 'filesize', int )
                , ( 'datal',    db.Text )
    def HandleEntity( self, entity ):
        name = 's' + entity[ 'sha1' ] 
        newent = datastore.Entity( 'SomeJunk', name=name )
        newent.update( entity )
        ent = search.SearchableEntity( newent )
        return ent

if __name__ == '__main__':
    bulkload.main( LoadMyJunk() )

I would say how neat it is, but I am too busy kicking ass. STOP Suggest you do same STOP


Client/Server is dead! Long live, commodity computing!

Power to the people!


Leave a Comment