Between other work I have made a port of my Sitar (C++) tagger to Java, this port is named Jitar. I took the opportunity to redesign some aspects:
- Simplify the training data: for lexicon entries frequencies are stored now, rather than probabilities. This will allow us to use the same lexicon for the known word handler and the unknown word handler (which relies on suffix analysis). Since the CPU calculations often beat disk I/O, this does not lead to a longer startup time.
- Store the suffixes for the unknown word handler in a tree. This makes the handler use less memory, and is faster.
- Apply some more tweaks for unknown word handling. With these tweaks, the unknown word accuracy for our test set seems to be at the same level as TnT.
The Java port provides some other nice advantages as well, such as easy integration with programs written in other languages that run on top of a JVM (Groovy, Scala, JRuby, etc.). Jitar is also licensed under the Apache License 2.0, which allows use in FLOSS and proprietary software.
Jasper Spaans has agreed to help with the maintenance of Jitar (thanks!). I expect that we can tag a 0.0.1 version soon, and provide precompiled and source archives. I'd like to move the whole tagger to another (more general) namespace, make the training parameters less specific, add more assertions, and preferably unit tests. In the meanwhile, the code can be checked out from the development project of the Jitar project.