software as search

Brave New World

The rules of the development are changing. In the not so olde days, ambitious programmers or even groups of programmers came to the inevitable stumbling block: where to host the application?

Making the leap from “hosting it at a friends house” or the “server in the basement” to 24/7 real internet hoo-ha used to involve plunking down $100+ a month to rack a machine in a colo or some rot.

Well now there’s a real answer: Amazon S3 + Google’s Application Engine

OK, well at least the S3 part…

My Toy App

My original toy idea was to create a catalog of all the classes in all the jar files I care about. I have a local application that I use that I find pretty nifty:

  • run over my local ${HOME}/.m2/repository, glassfish, etc and find every jar
  • jar tf | sed “s,^,${path_to_jar}:,” >> ~/.jar_minder_file.txt
  • grep someclass ~/.jar_minder_file.txt

Originally, that was going to be my GAE app: upload my 62mb txt file, make it searchable, the end… but…

I ran into a lot of issues… bulkloading issues and search issues… most of which I did eventually overcome, but it made me change my application model considerably.

I ❤ grep.

I do. I’m not ashamed of it and yes, I have told my parents about it… They used to work at AT&T pre-split, so they grokked it to a certain extent.

Grep like Google

But I thought, I also ❤ Google. I thought the bulkload+keyword thing would get it, but once that 62mb was indexed… well… let’s say it grew an ittle bit… an ittle bit beyond the 500mb of free GAE space.

How fortunate you are not to have been in the room as I stomped my little feet and shook my mad fists in impotent rage!

Google wouldn’t let me search my 62mb text file… It was too big… then I thought… but Google lets’ me search that internet thing! What if I put my file in the so-called “internet” and then Google crawled it and when I wanted to do a search I could Google up: someclass site:somesite

Instead of one big txt file, I thought I could generate a buncha little xml files. That way every search wouldn’t return the same 62mb txt file and I could just handle the presentation layer with XSL + JS and bada-boom: search application.

First I tried GAE.. but even my toy set of jars was too much! Did I mention GAE has a 1,000 file limit? Yeah… no one mentioned it to me either!

OK.. no need to flail my manhooks at the ether! I knew a guy who had been telling me about how I was being all dopey not to take advantage of all the S3 cheapness that was out there.

Change of plans

Instead of trying to python everything myself( oh, didn’t mention GAE only runs python? yeah… it’s not that bad, but then I say that about XSLT)… I decide I would host it on cheapo S3 and put the “smart” bits on GAE.

The search part would then just be a matter of getting the site indexed (still pending as of today).

Ironically, Yahoo’s BOSS works really well on GAE… but once again… still waiting for it to be indexed.

The smart bits

Of course, that’s all well and good, but I wanted to put a little something+something on top.

GAE has nice user hooks, so I borrowed the idea of “starring” from netflix to let logged in users mark jar’s they like. If it gets off the ground, I think ultimately that is going to be the really interesting part.

I’d also like to expand to not just cover jar contents but actually javap up the classes… Move beyond resolving classpath issues into a fullblown method signature to implementation resolution… Track API changes across releases from every OSS Java project…

Lot’s of possibilities… for now… gotta wait for it to get indexed…

Of course, if Google / Yahoo don’t hack it… I’m not just going to give up… after all… there’s always EC2! and I have a couple of pals named JackRabbit and Lucene who enjoy kicking some ass!

😀

2 Comments »

  1. Vic said

    Well, what about having your friends that regularly have spiders crawl them put references to your site in their content? It is not as shameless as link farming and what else are the benefits of letting the spiders wander across your old corpus?

    If you want to see this in action, check out the page,

    http://cvs.dlogic.org/

    Better yet, you you have a dlogic account, check the access logs,

    $ sudo less /var/log/httpd/cvs.dlogic.com-access_log

    more,
    l8r,
    v

  2. brianin3d said

    The more links the merrier, but ultimately it’s not much of an offering if it’s not searchable…

    Of course, I’m not picky! The more sites that index it the better!

RSS feed for comments on this post · TrackBack URI

Leave a comment