Odds and ends

To content | To menu | To search

Monday 21 July 2008

Jitar: a port of Sitar

Between other work I have made a port of my Sitar (C++) tagger to Java, this port is named Jitar. I took the opportunity to redesign some aspects:

  • Simplify the training data: for lexicon entries frequencies are stored now, rather than probabilities. This will allow us to use the same lexicon for the known word handler and the unknown word handler (which relies on suffix analysis). Since the CPU calculations often beat disk I/O, this does not lead to a longer startup time.
  • Store the suffixes for the unknown word handler in a tree. This makes the handler use less memory, and is faster.
  • Apply some more tweaks for unknown word handling. With these tweaks, the unknown word accuracy for our test set seems to be at the same level as TnT.

The Java port provides some other nice advantages as well, such as easy integration with programs written in other languages that run on top of a JVM (Groovy, Scala, JRuby, etc.). Jitar is also licensed under the Apache License 2.0, which allows use in FLOSS and proprietary software.

Jasper Spaans has agreed to help with the maintenance of Jitar (thanks!). I expect that we can tag a 0.0.1 version soon, and provide precompiled and source archives. I'd like to move the whole tagger to another (more general) namespace, make the training parameters less specific, add more assertions, and preferably unit tests. In the meanwhile, the code can be checked out from the development project of the Jitar project.

Tuesday 20 May 2008

Sitar: a simple part of speech tagger

Recently, I wrote a part of speech (POS) tagger in C++. A POS tagger assigns morphosyntactic labels to words, that can be used in subsequent processing, such as chunking or parsing. The tagger uses trigram Hidden Markov Models (HMM), combined with suffix analysis for unknown words. On my Brown corpus-based training set, it achieves an overall accuracy of 95.5% (74.8% for unknown words). When two parameters are hand-tuned, I achieved an accuracy of above 76% for unknown words.

The TnT tagger, which Sitar is partly modeled after, is more accurate in assigning tags to unknown words. So, this is an area which can use improvement (though, Sitar scores better than many other taggers that do not follow this methodology).

If you are interested in tinkering with Sitar, you may want to know that the source code is available under the liberal Apache License version 2.0. This license also allows for use in proprietary software, although I hope improvements are contributed. I hope this is useful to some people :).

Saturday 15 March 2008

C++ book recommendations

After having completed an excellent C++ course, I have been on the lookout for good books to venture deeper into the language. The following books turned out to be must-haves that I always try to keep within reach:

  • The C++ Standard Library - A Tutorial and Reference, Nicolai M. Josuttis
  • Beyond the C++ Standard Library: An Introduction to Boost, Björn Karlsson
  • C++ Templates - The Complete Guide, David Vandevoorde and Nicolai M. Josuttis
  • C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond, David Abrahams and Aleksey Gurtovoy

To people yet unfamiliar to C++, I have been recommending Accelerated C++, Practical Programming by Example by Andrew Koenig and Barbara E. Moo. I only had the opportunity to skim through this book, but it seems to be aimed at leveraging C++ features and the standard library right away, rather than building up to C++ from C.

Wednesday 4 July 2007

Extending DocBook RelaxNG schema

A while ago I switched from DocBook 4.x DTDs to the DocBook 5 RELAX NG schema. RELAX NG supports XML namespaces, I like the compact RELAX NG syntax for its readability, and it is arguably easier to extend than DTDs. While I agree that XML can be painful to edit by hand without a proper editor, XML with proper schemas and stylesheets certainly make writing content easier, especially due to its extensibility. E.g., for a book I am working on, I wanted to extend DocBook to be able to add platform testing information to be able to see in hindsight if I tested a piece of content on every system I want to cover. E.g.:

<sect1 xml:id="somesection">
  <title>Some section</title>

  <platformtests>
      <platform os="Debian GNU/Linux" version="4.0" arch="i386" />
      <platform os="FreeBSD" version="6.2" arch="i386" />
  </platformtests>

  <!-- ... -->
</sect1>

Of course, adding the "platformtests" blocks will be invalid, because they are not described in the schema. This is easily solved by writing a small scheme on top of the DocBook schema:

default namespace db = "http://docbook.org/ns/docbook"

include "docbook.rnc"

dbx.platformtests = element platformtests {
        element platform {
                attribute os { text },
                attribute version { text },
                attribute arch { text }
        }+
}

db.extension.blocks |= dbx.platformtests

This piece of scheme includes the DocBook schema, and defines "platformtests" blocks that can be used at places where normal content blocks can be used (e.g. section bodies). With these schema you can validate DocBook documents with platform tests. Of course, platform tests are not known by the stylesheets, so you will want to modify them to handle "platformtests" elements. In this case, is just a piece of help for me, and I do not want it to show up in HTML or FO documents, so an empty XSL template for this element will handle it fine.

Of course, you can do a lot of other fun stuff with the RELAX NG schemas for DocBook. E.g. Norman Walsh has a nice recipe for making a minimal DocBook variant with relatively little work.