Posts Tagged jar

software as search

Brave New World

The rules of the development are changing. In the not so olde days, ambitious programmers or even groups of programmers came to the inevitable stumbling block: where to host the application?

Making the leap from “hosting it at a friends house” or the “server in the basement” to 24/7 real internet hoo-ha used to involve plunking down $100+ a month to rack a machine in a colo or some rot.

Well now there’s a real answer: Amazon S3 + Google’s Application Engine

OK, well at least the S3 part…

My Toy App

My original toy idea was to create a catalog of all the classes in all the jar files I care about. I have a local application that I use that I find pretty nifty:

  • run over my local ${HOME}/.m2/repository, glassfish, etc and find every jar
  • jar tf | sed “s,^,${path_to_jar}:,” >> ~/.jar_minder_file.txt
  • grep someclass ~/.jar_minder_file.txt

Originally, that was going to be my GAE app: upload my 62mb txt file, make it searchable, the end… but…

I ran into a lot of issues… bulkloading issues and search issues… most of which I did eventually overcome, but it made me change my application model considerably.

I ❤ grep.

I do. I’m not ashamed of it and yes, I have told my parents about it… They used to work at AT&T pre-split, so they grokked it to a certain extent.

Grep like Google

But I thought, I also ❤ Google. I thought the bulkload+keyword thing would get it, but once that 62mb was indexed… well… let’s say it grew an ittle bit… an ittle bit beyond the 500mb of free GAE space.

How fortunate you are not to have been in the room as I stomped my little feet and shook my mad fists in impotent rage!

Google wouldn’t let me search my 62mb text file… It was too big… then I thought… but Google lets’ me search that internet thing! What if I put my file in the so-called “internet” and then Google crawled it and when I wanted to do a search I could Google up: someclass site:somesite

Instead of one big txt file, I thought I could generate a buncha little xml files. That way every search wouldn’t return the same 62mb txt file and I could just handle the presentation layer with XSL + JS and bada-boom: search application.

First I tried GAE.. but even my toy set of jars was too much! Did I mention GAE has a 1,000 file limit? Yeah… no one mentioned it to me either!

OK.. no need to flail my manhooks at the ether! I knew a guy who had been telling me about how I was being all dopey not to take advantage of all the S3 cheapness that was out there.

Change of plans

Instead of trying to python everything myself( oh, didn’t mention GAE only runs python? yeah… it’s not that bad, but then I say that about XSLT)… I decide I would host it on cheapo S3 and put the “smart” bits on GAE.

The search part would then just be a matter of getting the site indexed (still pending as of today).

Ironically, Yahoo’s BOSS works really well on GAE… but once again… still waiting for it to be indexed.

The smart bits

Of course, that’s all well and good, but I wanted to put a little something+something on top.

GAE has nice user hooks, so I borrowed the idea of “starring” from netflix to let logged in users mark jar’s they like. If it gets off the ground, I think ultimately that is going to be the really interesting part.

I’d also like to expand to not just cover jar contents but actually javap up the classes… Move beyond resolving classpath issues into a fullblown method signature to implementation resolution… Track API changes across releases from every OSS Java project…

Lot’s of possibilities… for now… gotta wait for it to get indexed…

Of course, if Google / Yahoo don’t hack it… I’m not just going to give up… after all… there’s always EC2! and I have a couple of pals named JackRabbit and Lucene who enjoy kicking some ass!

😀

Advertisements

Comments (2)

using java.util.jar.JarFile

Of course Java can read zip files / jars. It’s pretty straightforward:

import java.util.jar.JarFile;
import java.util.Enumeration;
//...
    public void print( String arg ) throws Exception {
        this.print( new JarFile( arg ) );
    }

    public void print( JarFile jarFile ) {
        this.print( jarFile.entries()  );
    }

    public void print( Enumeration entries ) {
        while ( entries.hasMoreElements() ) {
            System.out.println( entries.nextElement() );
        }
    }

program, compile and execute thyself!

I wrote a little test program called Jarout.java, which is a “self-compiling” program:

% ./Jarout.java ~/.m2/repository/org/springframework/spring/2.5.5/spring-2.5.5.jar

This works (in reasonable environments) because the first 4 lines are a shell script protected by a block comment:

/*2222222 2>/dev/null
javac Jarout.java && java Jarout ${*}
exit ${?}
*/

import java.util.jar.JarFile;
import java.util.Enumeration;
//...

The first line keeps the shell quiet and the exit keeps the rest of the program from being interpreted. It’s a silly trick, but useful sometimes. I also use it for C and C++. The gcc is so fast that it gives traditional interpreted languages a run for their money!

Even though it is a silly trick, it allows you to use any compiled language like a scripting language…

😀

Leave a Comment

using libzip

jar tf too slow

So I recently had occassion to notice that “jar tf” was much slower than “unzip -l”

% for f in 0 1 2 3 4 5 6 7 8 9 ; do time jar tf velocity-dvsl-0.43.20020711.010949.jar > a ; done 2>&1 | grep ^real | cut -f2- -dm | cut -f1 -ds | xargs | tr ' ' + | bc -l
2.750
% for f in 0 1 2 3 4 5 6 7 8 9 ; do time unzip -l velocity-dvsl-0.43.20020711.010949.jar > u ; done 2>&1 | grep ^real | cut -f2- -dm | cut -f1 -ds | xargs | tr ' ' + | bc -l
.096

So the jar command is almost 3 times slower! But it prints the list of contents in a simple format I like. Instead of grep/cut/sed’ing up, I decided to write a custom application.

zip_print.c

% sudo apt-get install libzip-dev libzip1

Here are the highlights:

#include <zip.h>
...
int
main( int argc, char *argv[] ) {
  struct zip *zip_ptr;
....
  for ( i = 1 ; i < argc ; i++ ) {
        path = argv[ i ];
        zip_ptr = zip_open( path, ZIP_CHECKCONS, &errorp );
...
        max = zip_get_num_files( zip_ptr );
        for ( j = 0 ; j < max ; j++ ) {
            printf( "%s\n", zip_get_name( zip_ptr, j, ZIP_FL_UNCHANGED ) );
        }
...
       zip_close( zip_ptr );
% for f in 0 1 2 3 4 5 6 7 8 9 ; do time ./zip_print velocity-dvsl-0.43.20020711.010949.jar > b ; done 2>&1 | grep ^real | cut -f2- -dm | cut -f1 -ds | xargs | tr ' ' + | bc -l
.092
% diff a b

Now I’m not going to claim zip_print is faster than “zip -l”… that .004 is basicly bunk, but it is certainly faster than “jar tf”

Leave a Comment