Recently, I wrote a part of speech (POS) tagger in C++. A POS tagger assigns morphosyntactic labels to words, that can be used in subsequent processing, such as chunking or parsing. The tagger uses trigram Hidden Markov Models (HMM), combined with suffix analysis for unknown words. On my Brown corpus-based training set, it achieves an overall accuracy of 95.5% (74.8% for unknown words). When two parameters are hand-tuned, I achieved an accuracy of above 76% for unknown words.

The TnT tagger, which Sitar is partly modeled after, is more accurate in assigning tags to unknown words. So, this is an area which can use improvement (though, Sitar scores better than many other taggers that do not follow this methodology).

If you are interested in tinkering with Sitar, you may want to know that the source code is available under the liberal Apache License version 2.0. This license also allows for use in proprietary software, although I hope improvements are contributed. I hope this is useful to some people :).