Archive for nlp

noun chunking with openNLP

This is pretty much straight from the openNLP README

% echo "The openNLP project is also pretty cool, LGPL and it doesn't cost $3000" | ./textMe.sh 
[NP The/DT openNLP/JJ project/NN ] [VP is/VBZ ] [ADVP also/RB ] [ADJP pretty/RB cool/JJ ] 
,/, [NP LGPL/NNP ] and/CC [NP it/PRP ] [VP does/VBZ n't/RB cost/VB ] 000/CD

The script is dumb:

_textMe_main() {
    local classpath="output/opennlp-tools-1.4.3.jar:"
    classpath="${classpath}:$( find lib -name "*.jar" |xargs |tr ' ' ':' )"
    local jMe="java -classpath ${classpath}"
    ${jMe} opennlp.tools.lang.english.SentenceDetector  \
        english/sentdetect/EnglishSD.bin.gz                     \  
    | ${jMe} opennlp.tools.lang.english.Tokenizer            \
        english/tokenize/EnglishTok.bin.gz                       \
    | ${jMe} opennlp.tools.lang.english.PosTagger -d       \
        english/parser/tagdict english/parser/tag.bin.gz   \
    | ${jMe} opennlp.tools.lang.english.TreebankChunker \
        english/chunker/EnglishChunk.bin.gz
}   

_textMe_main ${*}

Break it down:

% echo "This is a sentence? I know this is, Mr. Funzone... What will happen?" | ${jMe} opennlp.tools.lang.english.SentenceDetector english/sentdetect/EnglishSD.bin.gz
This is a sentence?
I know this is, Mr. Funzone... What will happen?
% echo "This is a sentence?" | ${jMe} opennlp.tools.lang.english.Tokenizer english/tokenize/EnglishTok.bin.gz
This is a sentence ?
% echo 'This is a sentence ?' | ${jMe} opennlp.tools.lang.english.PosTagger -d english/parser/tagdict english/parser/tag.bin.gz
This/DT is/VBZ a/DT sentence/NN ?/.
% echo 'This/DT is/VBZ a/DT sentence/NN ?/.' | ${jMe} opennlp.tools.lang.english.TreebankChunker english/chunker/EnglishChunk.bin.gz
[NP This/DT ] [VP is/VBZ ] [NP a/DT sentence/NN ] ?/.
Advertisements

Leave a Comment

The Stanford Parser is neat

The Stanford Parser: A statistical parser is a pretty neat piece of GPL java code does all kinds of wild text processing.

It can parse a sentence and create a dependency graph that relates each word and includes POS-tagging. Gadzooks!

I had to change the script to use 512m instead of 150m to process even pretty small files…

% ./lexparser.csh f
Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.1 sec].
f
Parsing file: f with 1 sentences.
Parsing [sent. 1 len. 24]: [The, Stanford, Parser, :, A, statistical, parser, is, a, pretty, neat, piece, of, GPL, java, code, does, all, kinds, of, wild, text, processing, .]
(ROOT
  (NP
    (NP (DT The) (NNP Stanford) (NNP Parser))
    (: : )
    (S 
      (NP (DT A) (JJ statistical) (NN parser))
      (VP (VBZ is)
        (NP
          (NP (DT a)
            (ADJP (RB pretty) (JJ neat))
            (NN piece))
          (PP (IN of)
            (NP (NNP GPL)))
          (SBAR 
            (S
              (NP (NN java) (NN code))
              (VP (VBZ does)
                (NP
                  (NP (DT all) (NNS kinds))
                  (PP (IN of)
                    (NP (JJ wild) (NN text) (NN processing))))))))))
    (. .)))         
    
det(Parser-3, The-1)
nn(Parser-3, Stanford-2)
det(parser-7, A-5)
amod(parser-7, statistical-6)
nsubj(piece-12, parser-7)
cop(piece-12, is-8)
det(piece-12, a-9)
advmod(neat-11, pretty-10)
amod(piece-12, neat-11)
dep(Parser-3, piece-12)
prep_of(piece-12, GPL-14)
nn(code-16, java-15)
nsubj(does-17, code-16)
rcmod(piece-12, does-17)
det(kinds-19, all-18)
dobj(does-17, kinds-19)
amod(processing-23, wild-21)
nn(processing-23, text-22)
prep_of(kinds-19, processing-23)

Parsed file: f [1 sentences].
Parsed 24 words in 1 sentences (17.28 wds/sec; 0.72 sents/sec).

Leave a Comment