As I mentioned in my previous blog
post, I just started studying OCaml. Since the OCaml for Scientists book is
still in the delivery pipeline, I decided to make a headstart with the
Introduction to Objective Caml book. It is a lot of fun trying the
examples, and making some exercises from various OCaml courses on the web.
Programming functionally requires a paradigm switch very much like when you
start writing Prolog for the first time, and is a good brain exercise in
itself. And what is a better way to get into shape than writing some basic
algorithms that most of us have probably implemented often in an imperative
language? One such simple algorithm is determining the longest common
subsequence (LCS) of two sequences. If we model a sequence as a list, the
algorithm very simple:
- The longest common subsequence of a list and and empty list is the empty
list.
- The longest common subsequence of two lists with equal heads is the head
concatenated with the LCS of the tails of the lists.
- The longest common subsequence of two lists with different heads is the
longest LCS of:
- The first list and the tail of the second list.
- The second list and the tail of the first list.
It should be noted that this approach only returns one LCS if there are
multiple possible LCSes for two sequences.
We can almost directly implement this in OCaml code:
let rec lcs l1 l2 = match l1, l2 with
[], _
| _, [] -> []
| h1::t1, h2::t2 when h1 = h2 -> h1 :: (lcs t1 t2)
| h1::t1, h2::t2 -> let s1 = lcs t1 (h2::t2) in
let s2 = lcs (h1::t1) t2 in
if (List.length s1 > List.length s2) then s1
else s2;;
Here you can see that lcs is a recursive function with two
arguments. We can use a technique called pattern matching to match each of the
cases listed in the description above. The first two patterns ([], _
and _, []) match if one of the lists is empty, and if l1, l2
matches one of these patterns, the expression after the arrow wil be evaluated
and returned. In this case, it's just the empty list. The next pattern matches
two non-empty list, but uses an additional pattern guard after the
when keyword to restrict this case further. The list matching itself
uses the so-called cons operator to split the list in a head and a
tail (very much like [Head|Tail] in Prolog). The guard adds the
requirement that the heads should be equal. If this is the case, we will return
a list with the shared head as the list head, and a tail which is the LCS of
the tail of the lists. Finally, the fourth pattern also matches two non-empty
lists, but has no further guards. As a consequence of the preceding paterns, it
will match to non-empty lists with inequal heads. In this case, we will
calculate two LCSes as described above, and will return the longest.
Although this implementation has the elegance of almost directly
representing the algorithm, it is also very slow. For example, computing the
LCS of the sequences [2;7;5;6;7;4;3;2;9;6;4;2;7;3;2;7;8;1] and
[3;5;8;2;3;4;5;7;2;8;3;2;3;8;7;3;5;5] took 50 seconds on my machine.
Fortunately, we can optimize this quite easily: the LCS algorithm divides the
problem in subproblems, and there are many overlapping subproblems. So, if we
memoize the LCSes of subproblems, we don't have to calculate them over and over
again. We could either directly introduce some storage for memoizing results in
the lcs function directly, but why not make a separate memoization
function that the lcs function can use? Such a memoization function should do
the following given two arguments: look the arguments up and see if we already
have a result for these arguments, and if not call the given function to
calculate the result and store the arguments with the result. Of course, the
argument/result storage needs to be persistent to the memoization function.
This is my first attempt (I am not sure if it can be made more elegant, since I
am explicitly naming two arguments here):
let memo2 f =
let h = Hashtbl.create 13 in
let rec fmem a1 a2 =
try Hashtbl.find h (a1, a2) with
| Not_found -> try Hashtbl.find h (a2, a1) with
| Not_found ->
let ha r = Hashtbl.add h (a1, a2) r; r in
ha (f fmem a1 a2) in
fmem;;
This function returns a function that takes two arguments (like the original
lcs function), and evaluates in the following manner:
- If the tuple of arguments (a1, a2) is known as a key in the hash table,
return its value.
- Otherwise, if the tuple of arguments (a2, a1) is known as a key in the hash
table, return its value. We want to do this here, since the lcs
function is symmetric for two given arguments.
- Otherwise, call the function f, and restore the result in the hash
table.
As you can see here, the memoizing function fmem is passed to
f, since f itself will have to apply the memoization function
for a recursive call:
let lcs f l1 l2 = match l1, l2 with
[], _
| _, [] -> []
| h1::t1, h2::t2 when h1 = h2 -> h1 :: (f t1 t2)
| h1::t1, h2::t2 -> let s1 = f t1 (h2::t2) in
let s2 = f (h1::t1) t2 in
if (List.length s1 > List.length s2) then s1
else s2;;
With these functions implemented, we can make a memoizing lcs in one
line:
let lcs_memo = memo2 lcs;;
And, as expected, the memoizing function is much faster, and completes on my
machine in a fraction of a second.