Planet Haskell

May 15, 2008

Don Stewart (dons)

Write Haskell as fast as C: exploiting strictness, laziness and recursion

In a recent mailing list thread Andrew Coppin complained of poor performance with "nice, declarative" code for computing the mean of a very large list of double precision floating point values:

    import System.Environment
    import Text.Printf

    mean :: [Double] -> Double
    mean xs = sum xs / fromIntegral (length xs)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean [1 .. d])

Which, when executed, churns away for a while, but eventually runs out of memory:

    $ ghc -O2 --make A.hs
    $ time ./A 1e9
    A: out of memory (requested 1048576 bytes)
    ./A 1e9  12.74s user 1.20s system 99% cpu 13.997 total

Well, it got 13 seconds into the job doing something. At least it was making progress.

While Andrew was citing this as an example of GHC Haskell being unpredictable (it's not an example of that, as we'll see), there is something here -- this little program reveals a lot about the interaction between strictness, laziness and approaches to compiler optimisation in Haskell.

This post is about writing reliably fast code, and then how to write more naive code that is also reliably fast. In particular, how recursion consistently produces excellent code, competitive with heavily optimised C, and how to get similar results from higher order functions. Throughout this post I'll be using GHC 6.8.2, with -O2. Sometimes I'll switch to the C backend (-fvia-C -optc-O2). Also, I don't care about smarter ways to implement this -- we're simply interested in understanding the compiler transformations involved in the actual loops involved.

Understanding strictness and laziness

First, let's work out why this ran out of memory. The obvious hard constraint is that we request a list of 1e9 doubles: that is very, very large. It's 8 * 10^9 bytes, if allocated directly, or about 7.5G. Inside a list there is yet further overhead, for the list nodes, and their pointers. So no matter the language, we simply cannot allocate this all in one go without burning gigabytes of memory. The only approach that will work is to lazily generate the list somehow. To confirm this, we can use a strict list data type, just to see how far we get.

First, let's define a strict list data type. That is one whose head and tail are always fully evaluated. In Haskell, strictness is declared by putting bangs on the fields of the data structure, though it can also be inferred based on how values are used:

    data List a = Empty
                | Cons !a !(List a)

This is a new data type, so we'll need some new functions on it:

    length' :: List a -> Int
    length' = go 0
        where
            go n Empty       = n
            go n (Cons _ xs) = go (n+1) xs

    sum' :: Num a => List a -> a
    sum' xs = go 0 xs
        where
            go n Empty       = n
            go n (Cons x xs) = go (n+x) xs

    enumFromTo' :: (Num a, Ord a) => a -> a -> List a
    enumFromTo' x y | x > y     = Empty
                    | otherwise = go x
        where
            go x = Cons x (if x == y then Empty else go (x+1))

I've written these in a direct worker/wrapper transformed, recursive style. It's a simple, functional idiom that's guaranteed to get the job done.

So now we can compute the length and sum of these things, and generate one given a range. Our 'mean' function stays much the same, just thee type changes from a lazy to strict list:

    mean :: List Double -> Double
    mean xs = sum' xs / fromIntegral (length' xs)

Testing this, we see it works fine for small examples:

    $ time ./B 1000
    500.5
    ./A 1000  0.00s user 0.00s system 0% cpu 0.004 total

But with bigger examples, it quickly exhausts memory:

    $ time ./B 1e9 
    Stack space overflow: current size 8388608 bytes.
    Use `+RTS -Ksize' to increase it.
    ./A 1e9  0.02s user 0.02s system 79% cpu 0.050 tota

To see that we can't even get as far as summing the list, we'll replace the list body with undefined, and just ask for the list argument to be evaluated before the body with a bang patterns (a strictness hint to the compiler, much like a type annotation suggests the desired type):

    mean :: List Double -> Double
    mean !xs = undefined

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean (enumFromTo' 1 d))

And still, there's no hope to allocate that thing:

    $ time ./A 1e9
    Stack space overflow: current size 8388608 bytes.
    Use `+RTS -Ksize' to increase it.
    ./A 1e9  0.01s user 0.04s system 93% cpu 0.054 total

Strictness is not the solution here.

Traversing structures and garbage collection

So why is the naive version actually allocating the whole list anyway? It was a lazy list, shouldn't it somehow avoid allocating the elements? To understand why it didn't work, we can look at what:

    mean :: [Double] -> Double
    mean xs = sum xs / fromIntegral (length xs)

actually means. To reason about the code, one way is to just look at the final, optimised Haskell GHC produces, prior to translation to imperative code. This reduced Haskell is known as "core", and is the end result of GHC's optimisations-by-transformation process, which iteratively rewrites the original source into more and more optimised versions.

Referring to the core is useful if you're looking for C-like performance, as you can state precisely, to the register level, what your program will do -- there are no optimisations to perform that would confuse the interpretation. It is an entirely unambiguous form of Haskell, so we'll use it to clarify any uncertainties. (For everyday programming, a comfortable knowledge of laziness and strictness is entirely adequate, so don't panic if you don't read core!).

To view the core, we can use ghc -O2 -ddump-simpl, or the ghc-core tool. Looking at it for the naive program we see that 'mean' is translated to:

    $wlen :: [Double] -> Int# -> Int#
    ...

    $wsum'1 :: [Double] -> Double# -> Double#
    ...

    main = ...
       shows (
        let xs :: [Double]
            xs = enumFromTo1 lvl3 d_a6h
        in 
             case $wsum'1 xs 0.0 of {
                 _ -> case $wlen xs 0 of {
                     z -> case ww_s19a /## (int2Double# z) of {
                          i -> D# i
       )

It looks like a lot of funky pseudo-Haskell, but its made up of very simple primitives with a precise meaning. First, length and sum are specialised -- their polymorphic components replaced with concrete types, and in this case, those types are simple, atomic ones, so Double and Int are replaced with unlifted (or unboxed) values. That's good to know -- unlifted values can be kept in registers.

We can also confirm that the input list is allocated lazily.

    let xs :: [Double]
        xs = enumFromTo1 lvl3 d_a6h
    in ..

The 'let' here is the lazy allocation primitive. If you see it on non-unboxed type, you can be sure that thing will be lazily allocated on the heap. Remeber: core is like Haskell, but there's only 'let' and 'case' (one for laziness, one for evaluation).

So that's good. It means the list itself isn't costing us anything to write. This lazy allocation also explains why we're able to make some progress before running out of memory resources -- the list is only allocated as we traverse it.

However, all is not perfect, the next lines reveal:

         case $wsum'1 xs 0.0 of {
             _ -> case $wlen xs 0 of {
                 z -> case ww_s19a /## (int2Double# z) of {
                      i -> D# i

Now, 'case' is the evaluation primitive. It forces evaluation to the outermost constructor of the scrutinee -- the expression we're casing on -- right where you see it.

In this way we get an ordering of operations set up by the compiler. In our naive program, first the sum of the list is performed, then the length of the list is computed, until finally we divide the sum by the length.

That sounds fine, until we think about what happened to our lazily allocated list. The initial 'let' does no work, when allocating. However, the first 'sum' will traverse the entire list, computing the sum of its elements. To do this it must evaluate the entire huge list.

Now, on its own, that is still OK -- the garbage collector will race along behind 'sum', deallocating list nodes that aren't needed anymore. However, there is a fatal flaw in this case. 'length' still refers to the head of the list. So the garbage collector cannot release the list: it is forced to keep the whole thing alive.

'sum', then, will allocate the huge list, and it won't be collected till after 'length' has started consuming it -- which is too late. And this is exactly as we observed. The original poster's unpredictably is an entirely predictable result if you try to traverse a 7G lazy list twice.

Note that this would be even worse if we'd been in a strict language: the initial 'let' would have forced evaluation, so you'd get nowhere at all.

Now we have enough information to solve the problem: make sure we make only one traversal of the huge list. so we don't need to hang onto it.

Lazy lists are OK

The simple thing to do then, and the standard functional approach for the last 40 years of functional programming is to write a loop to do both sum and length at once:

    mean :: [Double] -> Double
    mean = go 0 0
        where
            go :: Double -> Int -> [Double] -> Double
            go s l []     = s / fromIntegral l
            go s l (x:xs) = go (s+x) (l+1) xs

We just store the length and sum in the function parameters, dividing for the mean once we reach the end of the list. In this way we make only one pass. Simple, straight forward. Should be the standard approach in the Haskell hackers kit.

By writing it in an obvious recursive style that walks the list as it is created, we can be sure to run in constant space. Our code compiles down to:

    $wgo :: Double# -> Int# -> [Double] -> Double#
    $wgo x y z =
        case z of
          []             -> /## x (int2Double# y);
          x_a9X : xs_a9Y ->
             case x_a9X of 
                D# y_aED -> $wgo (+## x y_aED) (+# y 1) xs_a9Y

We see that it inspects the list, determining if it is an empty list, or one with elements. If empty, return the division result. If non-empty, inspect the head of the list, taking the raw double value out of its box, then loop, On a small list:

    $ time ./B 1000
    500.5
    ./B 1000  0.00s user 0.00s system 0% cpu 0.004 total

On the 7 gigabyte list:

    $ time ./B 1e9
    500000000.067109
    ./B 1e9  68.92s user 0.19s system 99% cpu 1:09.57 total

Hooray, we've computed the result! The lazy list got us home.

But can we do better? Looking at the GC and memory statistics (run the program with +RTS -sstderr ) we can see how much work was going on:

             24,576 bytes maximum residency (1 sample(s))

      %GC time       1.4%  (4.0% elapsed)
  Alloc rate    2,474,068,932 bytes per MUT second
  Productivity           98.6% of total user

What does this tell us? Firstly, garbage collection made up a tiny fraction of the total time (1.4%), meaning the program was doing productive thing 98.6% of the time -- that's very good.

Also, we can see it ran in constant space: 24k bytes maximum allocated in the heap. And the program was writing and releasing these list nodes at an alarming rate. 2G a second (stunning, I know). Good thing allocation is cheap.

However, this does suggest that we can do better -- much better -- by avoiding any list node allocation at all and keeping off the bus. We just need to prevent list nodes from being allocated at all -- by keeping only the current list value in a register -- if we can stay out of memory, we should see dramatic improvements.

Recursion kicks arse

So let's rewrite the loop to no longer take a list, but instead, the start and end values of the loop as arguments:

    mean :: Double -> Double -> Double
    mean n m = go 0 0 n
        where
            go :: Double -> Int -> Double -> Double
            go s l x | x > m      = s / fromIntegral l
                     | otherwise  = go (s+x) (l+1) (x+1)

    main = do
        [d] <- map read `fmap` getArgs
        printf "%f\n" (mean 1 d)

We've moved the list lower and upper bounds into arguments to the function. Now there are no lists whatsover. This is a fusion transformation, instead of two separate loops, one for generation of the list, and one consuming it, we've fused them into a single loop with all operations interleaved. This should now avoid the penalty of the list memory traffic. We've also annotated the types of the functions, so there can be no ambiguity about what machine types are used.

Looking at the core:

    $wgo_s17J :: Double# -> Int# -> Double# -> Double#
    $wgo_s17J x y z =
      case >## z m of
        False ->
          $wgo_s17J
            (+## x z)
            (+#  y 1)
            (+## z 1.0);
        True -> /## x (int2Double# y)

it is remarkably simple. All parameters are unboxed, the loop is tail recursive, and we can expect zero garbage collection, all list parameters in registers, and we'd expect excellent performance. The >## and +## are GHC primitives for double comparison and addition. +# is normal int addition.

Let's compile it with GHC's native code generator first (-O2 -fexcess-precision):

    $ time ./C 1e9                 
    500000000.067109
    ./C 1e9  3.75s user 0.00s system 99% cpu 3.764 total

Great. Very good performance! The memory statistics tell a similar rosy story:

     20,480 bytes maximum residency (1 sample(s))
      %GC time       0.0%  (0.0% elapsed)
  Alloc rate    10,975 bytes per MUT second
  Productivity 100.0% of total user

So it ran in constant space, did no garbage collection, and was allocating only a few bytes a second. Recursion is the breakfast of champions.

And if we use the C backend to GHC: (-O2 -fexcess-precision -fvia-C -optc-O2), things get seriously fast:

    $ time ./C 1e9
    500000000.067109
    ./C 1e9  1.77s user 0.00s system 99% cpu 1.776 total

Wow, much faster. GCC was able to really optimise the loop GHC produced. Good job.

Let's see if we can get this kind of performance in C itself. We can translate the recursive loop directly into a for loop, where function arguments becomes the variables in the loop:

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char **argv) {

            double d = atof(argv[1]);

            double n;
            long long a; // 64 bit machine
            double b;

            // go_s17J :: Double# -> Int# -> Double# -> Double#
            for (n = 1,
                 a = 0,
                 b = 0; n <= d; b+=n,
                                n++,
                                a++)
                ;

            printf("%f\n", b / a);

            return 0;
    }

That was quite straight forward, and the C is nicely concise. It is interesting to see how function parameters are updated in place explicitly in the C code, while its implicitly reusing stack slots (or registers) in the functional style. But they're really describing the same code. Now, compiling the C without optimisations:

    $ time ./a.out 1e9
    500000000.067109
    ./a.out 1e9  4.72s user 0.00s system 99% cpu 4.723 total

Ok. That's cool. We got the same results, and optimised Haskell beat unoptimsed C. Now with -O, gcc is getting closer to catch GHC:

    $ time ./a.out 1e9
    500000000.067109
    ./a.out 1e9  2.10s user 0.00s system 99% cpu 2.103 total

And with -O2 it makes it in front by a nose:

    $ time ./a.out 1e9
    500000000.067109
    ./a.out 1e9  1.76s user 0.00s system 99% cpu 1.764 total

gcc -O2 wins the day (barely?), just sneaking past GHC -O2, which comes in ahead of gcc -O and -Onot.

We can look at the generated assembly (ghc -keep-tmp-files) to find out what's really going on.. GCC generates the rather nice:

    .L6:
        addsd   %xmm1, %xmm2
        incq    %rax
        addsd   %xmm3, %xmm1
        ucomisd %xmm1, %xmm0
        jae .L6

    .L8:
        cvtsi2sdq   %rax, %xmm0
        movl    $.LC2, %edi
        movl    $1, %eax
        divsd   %xmm0, %xmm2

Note the very tight inner loop, and final division. Meanwhile, GHC produces, with its native code backend:

    s1bn_info:
      ucomisd 5(%rbx),%xmm6
      ja .Lc1dz
      movsd %xmm6,%xmm0
      addsd .Ln1dB(%rip),%xmm0
      leaq 1(%rsi),%rax
      addsd %xmm6,%xmm5
      movq %rax,%rsi
      movsd %xmm0,%xmm6
      jmp s1bn_info

    .Lc1dz:
      cvtsi2sdq %rsi,%xmm0
      divsd %xmm0,%xmm5
Quite a bit more junk in the inner loop, which explains the slowdown with -fasm. That native code gen needs more work for competitve floating point work. But with the GHC C backend:
    s1bn_info:
      ucomisd     5(%rbx), %xmm6
      ja  .L10
      addsd       %xmm6, %xmm5
      addq        $1, %rsi
      addsd       .LC1(%rip), %xmm6
      jmp s1bn_info

    .L10:
      cvtsi2sdq   %rsi, %xmm7
      divsd       %xmm7, %xmm5

Almost identical to GCC, which explains why the performance was so good!

Some lessons

Lesson 1: To write predictably fast Haskell -- the kind that competes with C day in and out -- use tail recursion, and ensure all types are inferred as simple machine types, like Int, Word, Float or Double that simple machine representations. The performance is there if you want it.

Lesson 2: Laziness has an overhead -- while it allows you to write new kinds of programs (where lists may be used as control structures), the memory traffic that results can be a penalty if it appears in tight inner loops. Don't rely laziness to give you performance in your inner loops.

Lesson 3: For heavy optimisation, the C backend to GHC is still the way to go. Later this year a new bleeding edge native code generator will be added to GHC, but until then, the C backend is still an awesome weapon.

In a later post we'll examine how to get the same performance by using higher order functions.

May 15, 2008 07:00 PM

Luke Palmer

Know any good GUI languages?

Recently an urge has been growing inside me. I have been wanting to play with graphical programming; I believe the age of textual programming is nearly over1, and I want to experiment with the next phase.

I’m asking for recommendations for languages or libraries in which to create this experiment. It’s purely an experiment, so experimental languages and environments are okay. Things along the lines of Squeak are okay (I don’t know smalltalk, but I was going to start researching it, until I realized that there might be something better I’ve never even heard of). Basically I want it to be a nice match for what I’m doing; I want it to be ridiculously easy to create a GUI (and I don’t care how it looks, so long as it does the right thing).

I’ll describe what I’m doing. My first experiment is a dependently-typed “environment”. I’m picturing bubbles floating around (could be boxes in a table, I don’t care) which are not organized in any linear fashion. These bubbles are tagged with a type and possibly a name, and they represent “I have an object of type t“. Arrow types are multi-parameter and look like this:

Visualization of an Arrow type

Where the things on the left are “important” parameters, the thing on the right is the return type, and the things on the bottom are “extra” or “implicit” parameters. And you can drag a bubble onto a parameter of an arrow to represent application, and to get a new curried composite bubble representing it. That’s basically the whole environment. What could I make that in?

1 The same way that C is an obsolete language today. :-)

by Luke Palmer at May 15, 2008 06:26 PM

Mark Jason Dominus

Luminous band-aids

Last night after bedtime Iris asked for a small band-aid for her knee. I went into the bathroom to get one, and unwrapped it in the dark.

The band-aid itself is circular, about 1.5 cm in diameter. It is sealed between two pieces of paper, each about an inch square, that have been glued together along the four pairs of edges. There is a flap at one edge that you pull, and then you can peel the two glued-together pieces of paper apart to get the band-aid out.

As I peeled apart the two pieces of paper in the dark, there was a thin luminous greenish line running along the inside of the wrapper at the place the papers were being pulled away from each other. The line moved downward following the topmost point of contact between the papers as I pulled the papers apart. It was clearly visible in the dark.

I've never heard of anything like this; the closest I can think of is the thing about how wintergreen Life Savers glow in the dark when you crush them.

My best guess is that it's a static discharge, but I don't know. I don't have pictures of the phenomenon itself, and I'm not likely to be able to get any. But the band-aids look like this:

Have any of my Gentle Readers seen anything like this before? A cursory Internet search has revealed nothing of value.

by Mark Dominus at May 15, 2008 04:56 PM

Gareth Smith (gds)

Functional Reactive (web) Programming

I just saw a talk by Shriram Krishnamurthi.

He showed us something live - essentially the thing I'm doing in this sub-one-minute video.

From some of the questions asked afterwards, I gather that the concept may not be a new one, but I don't care - I'd not seen it before, and the idea blew me away.

In the video, I show the scheme setup. The good folks at Brown have also seen fit to furnish us with a web programming setup which works on all the main browsers right now.

May 15, 2008 04:34 PM

Adam Turoff

Static vs. Dynamic Languages

One permatopic across programming blogs is the good ol' static-vs-dynamic languages debate. Static languages (like C, C++, C#, C--, Java, etc.) require variables to have type declarations primarily to generate compiled code and secondarily to catch errors or emit warnings at compile time. Dynamic languages (like Perl, Python, PHP, Ruby, Tcl, Lisp, Scheme, Smalltalk, etc.) do away with all those needless keystrokes, allow programmers to be more productive, and generally aren't as dangerous as the C-programmers would have you believe. Furthermore, dynamic language programmers are quick to point out that any real program in a static language requires typecasting, which throws all of the "guarantees" out the window, and bring in all of the "dangers" of dynamic languages with none of the benefits.

However, this debate isn't limited to these opposite points of view. There's another perspective, which I heard from Mark-Jason Dominus about a decade ago: static typing in languages like C and Pascal isn't really static typing! (Correction: Mark actually said that "strong" typing in C and Pascal isn't really strong typing.)

The kind of static typing supported by C, its ancestors and its descendants, dates back to the 1950s and Fortran (or, rather FORTRAN, since lower case letters hadn't been invented yet). Type annotations in these languages are guides to the compiler in how much storage to allocate for each variable. Eventually, compilers began to emit warnings or compiler errors mixing variables of incompatible types (like floating point numbers and strings), because those kinds of operation don't make sense. If the irksome compiler gets in your way, you can always tell the it to shut up and do the operation anyway and maybe let the program crash at runtime. Or maybe silently corrupt data. Or maybe something completely different. None of which makes much sense, but there you go.

Modern static typing, on the other hand, uses a strong dose of the Hindley-Milner type inference algorithm to determine (a) what values a variable may contain and (b) prevent the use of incompatible values in expressions. Furthermore, Hindley-Milner prevents values being cast from one type to another on a whim (unlike C), and thus prevents odd runtime errors through abuse of the type system. This leads to an strong, unbreakable type system that is enforced by the compiler, and prevents a whole class of loopholes allowed by lesser type systems. Because types can be inferred, explicit type annotations aren't always needed, which leads to an interesting property: languages that are safer than "safe" static languages like C and Java, and programs that are free of those troublesome, noisy type declarations, just like dynamic languages.

Use of the Hindley-Milner type system comes to us through ML, and descendants like Haskell, OCaml, SML, F# and the like. It's such a good idea, that it's slowly making its way into new programming languages like Perl 6, Fortress, Factor, Scala, and future versions of C++, C#, and VB.Net.

So, really, the whole static-vs-dynamic languages debate is both misleading and moot. It's misleading, because it's not about static vs. dynamic typing, it's about weak compile-typing vs. runtime-typing. It's moot, because the writing is on the wall, and weakly typed languages are on the way out, to be replaced (eventually) by strongly-typed languages of one form or another. (It also tends to be a meaningless "debate", because it's typically a shouting match between C/C++/Java/etc. acolytes who are uncomfortable with Perl/Python/Ruby/etc., and Perl/Python/Ruby/etc. acolytes who have a strong distaste for C/C++/Java/etc., and neither group has a hope of persuading the other.)

Normally, I would avoid topic entirely, but today, I couldn't resist. Steve Yegge posted a transcription of his talk on dynamic languages, where he covers both sides of the debate, and moves beyond the two-sides-talking-past-each-other and uncovers the core issues. He covers the meat of the discussion, not the typical push-button issues that go nowhere and convince no one. (I think he misses the point behind Hindley-Milner, but that's an insignificant side issue.)

  • Most of the work in compiler optimization in the last few decades has focused on weakly-typed static languages.
  • Dynamic languages are, in general, slower, only because there has been relatively little research interest in making them faster.
  • There are some interesting approaches to optimizing dynamic languages that are now ready for prime time. Much of this comes from Strongtalk, an effort to optimize Smalltalk that led to Java's HotSpot VM.
  • Some of the "benefits" of "static" typing, like refactoring IDEs, aren't as good as they appear at first. Refactoring dynamic languages is different but IDEs could do about as good a job as they do for C++/C#/Java. Refactoring properly is a hard problem that no one has solved completely, yet.
  • "Fast" is relative. C++ used to be the "fast" language because it compiles down to native code. But in a multicore world, C/C++ is becoming slow, because C/C++ programs cannot take advantage of multiple CPUs easily, unlike Java.
  • Dynamic languages of the 1990s (Perl/Python/Ruby/Tcl) don't have a reasonable answer to scaling across CPUs, either.
  • Making Java bytecode run quickly takes more work than you probably realized.
  • Any sufficiently large system built with a static language will need to build in the kind of introspection that's available with a dynamic language and REPL. (Greenspun's Tenth Rule in action)
  • It was easy for the industry to switch programming languages every decade or so, up until 1995. Today, the bar is higher, and the amount of effort invested in existing code makes it much more difficult to migrate to a new language and platform.
  • Google has some of the most brilliant people and language designers in the industry, but they use C++, Java, Python and JavaScript, and none of the esoteric languages the smart kids love. This is because it's more important to standardize on a set of technologies and get stuff done than let the geeks run wild and free. But you probably knew that already. ;-)

Steve's talk is long and rambling, but certainly worth the time read or at least skim through.

by Adam Turoff (noreply@blogger.com) at May 15, 2008 02:34 PM

Joel Reymont

Maria


001-DSC_5931
Originally uploaded by joelr1
We are 3 months pregnant and that would be my 2nd :D.

by Joel Reymont at May 15, 2008 09:23 AM

Mark Jason Dominus

More artificial Finnish

Several Finns wrote to me to explain in some detail what was wrong with the artificial Finnish in yesterday's article. As I surmised, the words "ssän" and "kkeen" are lexically illegal in Finnish. There were a number of similar problems. For example, my sample output included the non-word "t". I don't know how this could have happened, since the input probably didn't include anything like that, and the Markov process I used to generate it shouldn't have done so. But the code is lost, so I suppose I'll never know.

Of the various comments I received, perhaps the most interesting was from Ilmari Vacklin. ("Vacklin", huh? If my program had generated "Vacklin", the Finns would have been all over the error.) M. Vacklin pointed out that a number of words in my sample output violated the Finnish rules of vowel harmony.

(M. Vacklin also suggested that my article must have been inspired by this comic, but it wasn't. I venture to guess that the Internet is full of places that point out that you can manufacture pseudo-Finnish by stringing together a lot of k's and a's and t's; it's not that hard to figure out. Maybe this would be a good place to mention the word "saippuakauppias", the Finnish term for a soap-dealer, which was in the Guinness Book of World Records as the longest commonly-used palindromic word in any language.)

Anyway, back to vowel harmony. Vowel harmony is a phenomenon found in certain languages, including Finnish. These languages class vowels into two antithetical groups. Vowels from one group never appear in the same word as vowels from the other group. When one has a prefix or a suffix that normally has a group A vowel, and one wants to join it to a word with group B vowels, the vowel in the suffix changes to match. This happens a lot in Finnish, which has a zillion suffixes. In many languages, including Finnish, there is also a third group of vowels which are "neutral" and can be mixed with either group A or with group B.

Modern Korean does not have vowel harmony, mostly, but Middle Korean did have it, up until the early 16th century. The Korean alphabet was invented around 1443, and the notation for the vowels reflected the vowel harmony:

The first four vowels in this illustration, with the vertical lines, were incompatible with the second four vowels, the ones with the horizontal lines. The last two vowels were neutral, as was another one, not shown here, which was written as a single dot and which has since fallen out of use. Incidentally, vowel harmony is an unusual feature of languages, and its presence in Korean has led some people to suggest that it might be distantly related to Turkish.

The vowel harmony thing is interesting in this context for the following reason. My pseudo-Finnish was generated by a Markov process: each letter was selected at random so as to make the overall frequency of the output match that of real Finnish. Similarly, the overall frequency of two- and three-letter sequences in pseudo-Finnish should match that in real Finnish. Is this enough to generate plausible (although nonsensical) Finnish text? For English, we might say maybe. But for Finnish the answer is no, because this process does not respect the vowel harmony rules. The Markov process doesn't remember, by the time it gets to the end of a long word, whether it is generating a word in vowel category A or B, and so it doesn't know which vowels it whould be generating. It will inevitably generate words with moxed vowels, which is forbidden. This problem does not come up in the generation of pseudo-English.

None of that was what I was planning to write about, however. What I wanted to do was to present samples of pseudo-Finnish generated with various tunings of the Markov process.

The basic model is this: you choose a number N, say 2, and then you look at some input text. For each different sequence of N characters, you count how many times that sequence is followed by "a", how many times it is followed by "b", and so on.

Then you start generating text at random. You pick a sequence of N characters arbitrarily to start, and then you generate the next character according to the probabilities that you calculated. Then you look at the last N characters (the last N-1 from before, plus the new one) and repeat. You keep doing that until you get tired.

For example, suppose we have N=2. Then we have a big table whose keys are 2-character strings like "ab", and then associated with each such string, a table that looks something like this:

r 54.52
a 15.89
i 10.41
o 7.95
l 4.11
e 3.01
u 1.10
space 0.82
: 0.55
t 0.55
, 0.27
. 0.27
b 0.27
s 0.27
So in the input to this process, "ab" was followed by "r" more than 54% of the time, by "a" about 16% of the time, and so on. And when generating the output, every time our process happens to generate "ab", it will follow by generating an "r" 54.52% of the time, an "a" 15.89% of the time, and so on.

Whether to count capital letters as the same as lowercase, and what to do about punctuation and spaces and so forth, are up to the designer.

Here, as examples, are some samples of pseudo-English, generated with various N. The input text was the book of Genesis, which is not entirely typical. In each case, I deleted the initial N characters and the final partial word, cleaned up the capitalization by hand, and appended a final period.

N=0
Lt per f idd et oblcs hs hae:uso ar w aaolt y tndh rl ohn otuhrthpboleel.ee n synenihbdrha,spegn.
N=1
Cachand t wim, heheethas anevem blsant ims, andofan, ieahrn anthaye s, lso iveeti alll t tand, w.
N=2
Ged hich callochbarthe of th to tre said nothem, and rin ing of brom. My and he behou spend the.
N=3
Sack one eved of and refor ther of the hand he will there that in the ful, when it up unto rangers.
It should be clear that the quality improves as one increases the N parameter. The N=3 sample has mostly real words, and the few nonsense ones it contains ("eved", "ful") are completely plausible English. N=2, on the other hand, is mostly nonsense, although it's mostly plausible nonsense. Even "callochbarthe" is almost plausible. (The unfortunate "chb" in the middle is just bad luck. It occurs because Genesis 36 mentions Baalhanan the son of Achbor.) The N=1 sample is recognizably bogus; no English word looks like "ieahrn", and the triple "l" in "alll" is nearly impossible. (I did once write to Jesse Sheidlower, an editor of the Big Dictionary, to ask his advice about whether "ballless" should be hyphenated.)

I have prepared samples of pseudo-Finnish of various qualities. The input here was a bunch of text I copied out of Finnish Wikipedia. (Where else? If you need Finnish text in 1988, you get it from the Usenet fi.talk group; if you need Finnish text in 2008, you get it from Finnish Wikipedia.) I did a little bit of manual cleanup, as with the English, but not too much.

N=0
Vtnnstäklun so so rl sieesjo.Aiijesjeäyuiotiannorin traäl.N vpojanti jonn oteaanlskmt enhksaiaaiiv oenlulniavas. Rottlatutsenynöisu iikännam e lavantkektann eaagla admikkosulssmpnrtinrkudilsorirumlshsmoti,anlosa anuioessydshln.Atierisllsjnlu e.Itatlosyhi vnko ättr otneän akho smalloailäi jiaat kajvtaopnasneilstio tntin einteaonaiimotn:r apoya oruasnainttotne wknaiossäelaäinoev aobrs,vteorlokynv. Aevsrikhanä tp s s oälnlke rvmi il ynae nara ign ssm lkimttbhineaatismäi tst lli ahaltineshne kr keöunv ah s itenh s .Ia pa elstpnanmnuiksriil anaalnttt mr ti.Ooa ka eee eiiei,tnees äusee a nanhetv.Iopkijeatatits,i l eklbiik suössmap tioaotaktdiir rkeaviohiesotkeagarihv nnadvö jlape öt kaeakmjkhykoto tnt iunnuyknnelu rutliie.Leva eiriaösnaj,rk oyumtsle,iioa,aspa aeiaä wsuinn eta y tvati klssviutkuaktmlpnheomi.T akapskushhnuksnhnnheaaaaussitseminmpnamäiaä pät.Kaaaabl unnionuhnpa iaes,outka.Cväinvkshvrnlteeoea rmi re suodmpr autlysa tnliaanäass. Srs rnvrtsita kmidusvjn tii.
N=1
Ava pän svun kerekent lsita batävomenasttenerga kovosuujalules rma punntäni rtraliksainoi van eukällä. Enäkukänesinntampalä ttan kolpäsäkyönsllvitivenestakkesenelussivaliite kuuksä kttteni einsuekeita kuterissalietäkilpöikalit ojatäjä pinsin atollukole idoitenn kkaorhjajasteden en vuolynkoiverojaa hta puon ehalan vaivä ihoshäositi. Hde setua tämpitydi makta jasyn sää oinncgrkai jeeten. Ljalanekikeri toiskkksypohoin ta yö atenesällväkeesaatituuun. Paait pukata tuon ktusumitttan zagaleskli va kkanäsin siikutytowhenttvosa veste eten vunovivä. Vorytellkeeni stan jä taa eka kaine ja kurenntonsin kyn o nta ja. Aisst urksetaka. Hotimivaa ta mppussternallai ja. Hdä on koraleerermohtydelen on jon. Rgienon kulinoilisälsa ja holälimmpa vitin, kukausoompremänn ra, palestollebilsen kaalesta, oina. Blilullaushoingiötideispaanoksiton, mulurklimi kermalli pota atebau lmomarymin kypa hta vanon tin kela vanaspoita s kulitekkäjen jäleetuolpan, veesalekäilin oii. Häreli. Ymialisstermimpriekaksst on.
N=2
Omaalis onino osa josa hormastaaraktse tyi altäänä tyntellevääostoidesenä, la siä vuansilliana inöön akalkuulukempellys kisä nen myöhelyaminenkiemostamahti omuonsa onite oni kusissa. Kungin sykynteillalkaai ellahasiteisuunnaja eroniemmin javai musuuasinä, sittan tusuovatkryt tormon vuolisenitiivansaliuotkietjuuta sensa. Kutumppalvinen. Vaikintolat hän ja kilkuossa osa koiseuvo keyhdysvisakeemppolowistoisijouliuodosijolasissän muoli ogro soluksi valuksasverix intetormon patlantaan et muiksen paiettaatulun kan vuomesyklees ovain pun. Sesva sa hänerittämpiraun tyi vuoden sälisen sän yhtiit, set tämpiraalletä. Senssaikanoje leemp:tabeten ain raa olliukettyi su. Solulukuuttellerrotolit hee säkinessa hän sekketäärinenvaikeihakti umallailuksin sestunno klossi ilunuta. Klettisaa osen vua vuola, jani ja hinangia en ta kaineemonimien polin barkiviäliukkuta joseseva. Ebb rautta onistärään on ml jokoulistä oheksi anoton allysvallelsiliineuvoja kutuko ala ulkietutablohitkain. Ituno.
N=3
Ävivät mena osakeyhti yhdysvalmiininäkin rakenne tuliitä hermoni ja umpirauhastui liin baryshnikoneja. Ain viljelukuullisää olisäke spesideksyylikoliittu latvia. Helsina hän solukeskuksen kannumme, peri palkin vieskeinä sisään on orgaan poikanssisäätelukauno klee laisenäläinen tavastui kauno on länteen muttava hän voimista kilometsästymistettäjän lehtiöiksitoreisö. Sitoutuvat mukalle. Ainettiin sisäke suomaihin, jouluun. Verenkilpalveli valtaineen opisteri poli ohjasionee rakennuttikolan aivastisenäläistuu kehittisetoja, rajahormaailmanajan kulkopuolesti kuluu mooliitoutuvat ovat olle. Ainen yhdysvaltai valiolähtiöiksi vasta, S. Muidentilaisteri jotka verenkirovin verenkiehumistä nelle väliaivoittynyt baleviiliukoisiin maailmestavarasta, jokakuudessa laisu. Sai rakeyhti yhtiö eli gluksessa. Ebbin, ja linnosakkeen hormonien I hallistehtiin kilpirasvua jaajana hormaailusta kunnetteluskäyttöön suomalaivat yhdysvalmistämistammonit veteet olimistuvatta. Hormon oli rautta.
Before anyone objects to the non-word "ml" in the N=2 sample, let me explain that this is the standard abbreviation for "millilitra". The "i" in the N=3 sample was a puzzle, since Marko Heiskanen assures me that Finnish has no one-letter words. But it appears in my sample in connection with Sukselaisen I hallitus, whatever that is, so I capitalized it.

I must say that I found "yhdysvalmistämistammonit" rather far-fetched, even in Finnish. But then I discovered that "yhdeksänkymmenvuotiaaksi" and "yhdysvalloissakaan" are genuine, so who am I to judge?

by Mark Dominus at May 15, 2008 04:30 AM

David R. MacIver

A cute hack: Changing the scope of System.out

This isn’t any sort of deep trick. I just found the results surprisingly pleasant, so thought I’d share.

In the GUI interpreter I need to be able to capture standard out. Doing println in the interpreter should print to the screen, not the console. And that’s fine. Java provides System.setOut. I had to fight the stupidities of java.io a bit to make it work (annoying details with flushing among other things), but it did in the end.

There is of course a problem with this. System.out is basically a great honking big global variable. We’ve now made it impossible to run multiple interpreters in the same VM. Le sigh.

Ok. So, what we do is a bit of juggling and set System.out appropriately before interpreting each thing. Right?

Nope. Doesn’t work. These things can and do live in separate threads. And for that matter, what about background threads spawned by one of the interpreters?

The solution is to make System.out thread local. And not just any type of thread local - an inheritable thread local. Spawned threads need to inherit the System.out of their parent thread so that something like spawn { println(”Hello world”) } prints to the right screen.

Once you have this, the idea is obvious. You define an object which extends OutputStream but delegates all functionality to some thread local variable (snarfing the System.out that was present when the object loads as the default value), set System.out to be a print stream wrapping that and you’re done. You can now set its value in a thread local way and multiple interpreters can coexist peacefully.

Like I said, nothing special, but it seems like a surprisingly handy trick.

by david at May 15, 2008 12:05 AM

May 14, 2008

David R. MacIver

GUI interpreter frontend coming along nicely

I’ve separated the interpreter frontend out into its own separate project. It’s available at http://www.drmaciver.com/repos/guinterpreter/

It’s doing pretty well. The UI has been improved, it supports multiple interpreter windows in the same VM, and is generally in a pretty usable state. It still has a few quirks, but it’s getting there (still a little bit of a headache to set up, but I’ll work on that). My plans for the immediate future mainly revolve around UI polishing and embeddability - I’d like it to be really easy to include this inside another (Jambi) application. hemant / gnufied has been taking a look at the code and has some interesting plans for it, starting with syntax highlighting. I’ve given him commit access, so we’ll see what happens on that front.

If you want to give it a try, go right ahead. You’ll need to resolve the Jambi dependency yourself, but otherwise it should be pretty easy (I will probably set up some sort of maven (sigh) repository for the jambi dependencies and switch the build script over to buildr so it can use them at some point, but I haven’t yet).

by david at May 14, 2008 11:20 PM

Mark Jason Dominus

Artificial Finnish

Order
Symbols, Signals, and Noise
Symbols, Signals, and Noise
with kickback
no kickback
By 1988 or 1989 I had read in several places, most recently in J. R. Pierce's Symbols, Signals, and Noise, that if you compile a table of the relative frequencies of three-letter sequences (trigraphs) in English text, and then generate random text with the same trigraph frequencies, the result cannot be distinguished from meaningful English text except by people who actually know English. Examples were provided, containing weird but legitimate-sounding words like "deamy" and "grocid", and the claim seemed plausible. But since I did actually know English, I could not properly evaluate it.

But around that time the Internet was just beginning to get into full swing. The Finnish government was investing a lot of money in networking infrastructure, and a lot of people in Finland were starting to appear on the Internet.

I have a funny story about that: Around the same time, a colleague named Marc Edgar approached me in the computer lab to ask if I knew of any Internet-based medium he could use to chat with his friend at the University of Oulu. I thought at first that he was putting me on (and maybe he was) because in 1989 the University of Oulu was just about the only place in the world where a large number of people were accessible via internet chat, IRC having been invented there the previous autumn.

A new set of Finnish-language newsgroups had recently appeared on Usenet, and people posted to them in Finnish. So I had access to an unlimited supply of computer-readable Finnish text, something which would have been unthinkable a few years before, and I could do the experiment in Finnish.


I wrote up the program, which is not at all difficult, gathered Finnish news articles, and produced the following sample:

Uttavalon estaa ain pahalukselle? Min omatunu selle menneet hy, toista. Palveljen alh tkö an välin oli ei alkohol pisten jol elenin. Että, ille, ittavaikki oli nim tor taisuuristä usein an sie a in sittä asia krista sillo si mien loinullun, herror os; riitä heitä suurinteen palve in kuk usemma. Tomalle, äs nto tai sattia yksin taisiä isiäk isuuri illää hetorista. Varsi kaikenlaineet ja pu distoja paikelmai en tulissa sai itsi mielim ssän jon sn ässäksi; yksen kos oihin! Jehovat oli kukahdol ten on teistä vak kkiasian aa itse ee eik tse sani olin mutta todistanut t llisivat oisessa sittä on raaj a vaisen opinen. Ihmisillee stajan opea tajat ja jumalang, sitten per sa ollut aantutta että voinen opeten. Ettuj, jon käs iv telijoitalikantaminun hä seen jälki yl nilla, kkeen, vaaraajil tuneitteistamaan same?

In those days, the world was 7-bit, and Finnish text was posted in a Finnish national variant of ASCII that caused words like "tkö an välin" to look like "tk| an v{lin". The presence of the curly braces heightened the apparent similarity, because that was all you could see at first glance.

At the time I was pleased, but now I think I see some defects. There are some vowelless words, such as "sn" and "t", which I think doesn't happen in Finnish. Some other words look defective: "ssän" and "kkeen", for example. Also, my input sample wasn't big enough, so once the program generated "alk" it was stuck doing the rest of "alkohol". Still, I think this could pass for Finnish if the reader wasn't paying much attention. I was satisfied with the results of the experiment, and was willing to believe that randomly-contructed English really did look enough like English to fool a non-English-speaking observer.

[ Addendum 20080514: There is a followup to this article. ]

by Mark Dominus at May 14, 2008 10:17 PM

Joel Reymont

We eat well down here!


DSC_5840
Originally uploaded by joelr1
That bottle of wine is 25 years old.

by Joel Reymont at May 14, 2008 08:48 PM

I love Fred Olsen


DSC_5826
Originally uploaded by joelr1
This trimaran goes up to 40 miles per hour, takes an hour to reach Gran Canaria from Tenerife and about 45 minutes to reach La Gomera.

by Joel Reymont at May 14, 2008 08:42 PM

Alex McLean

Late summer events

I’ve had a paper accepted to ICMC (International Computer Music Conference) in Belfast.  My paper isn’t directly about livecoding but according to chatter on the TOPLAP list there will be a fair number of livecoding papers and performances around the conference, including a off-icmc livecoding event organised by Graham Coleman.   Looking forward to the schedule appearing…

Just after that from the 29th August is the 3rd annual dorkcamp, a weekend in a field doing strange things with electricity. The previous camps were fantastic, I can’t wait.

Then probably the following weekend, 6th September will be the London Placard headphone festival, an intense evening of diverse back-to-back 20 minute performances over a bank of headphone distribution amplifiers (and no PA).  Always extra-special and full of surprises, it looks like this will be a big one…

by Alex at May 14, 2008 06:39 PM

Adam Turoff

Thought of the Day

Also from Steve Yegge's talk:

[W]hen we talk about performance, it's all crap. The most important thing is that you have a small system. And then the performance will just fall out of it naturally.

by Adam Turoff (noreply@blogger.com) at May 14, 2008 06:54 PM

David R. MacIver

Trampoline Systems and Scala

So, as I’ve mentioned a few times my company Trampoline Systems have been using Scala at work. This is, somewhat unfortunately, about to change.

It’s not really due to any problems with Scala. I’m certainly still planning to continue using it myself. There have been a few hitches that have meant we’ve not been able to take advantage of it as well as I’d like, but this is mainly a strategic rather than a technical decision. The majority of our code is in Ruby (even more so than it was at the start of this project), and most of our expertise is in Ruby, so it was starting to look increasingly silly that we had just this one project in Scala. Consequently we’ve decided to move the stuff we were previously doing in Scala to JRuby.

Oh well. It was nice to be a professional Scala developer for a bit. Now I get to be a professional Ruby developer instead. Life’s all about dealing with changes. :-)

by david at May 14, 2008 04:17 PM

John Goerzen (CosmicRay)

Thoughtfulness on the OpenSSL bug

By now, I'm sure you all have read about the OpenSSL bug discovered in Debian.

There's a lot being written about it. There's a lot of misinformation floating about, too. First thing to do is read this post, which should clear up some of that.

Now then, I'd like to think a little about a few things people have been saying.

People shouldn't try to fix bugs they don't understand.

At first, that sounds like a fine guideline. But when I thought about it a bit, I think it's actually more along the lines of useless.

First of all, there is this problem: how do you know whether or not you understand it? Obviously, sometimes you know you don't understand code well. But there are times when you think you do, but don't. Especially when we're talking about C and its associated manual memory management and manual error handling. I'd say that, for a C program of any given size, very few people really understand it. Especially since you may be dealing with functions that call other functions 5 deep, and one of those functions modifies what you thought was an input-only parameter in certain rare cases. Maybe it's documented to do that, maybe not, but of course documentation cannot always be trusted either.

I'd say it's more useful to say that people should get peer review of code whenever possible. Which, by the way, did occur here.

The Debian maintainer of this package {is an idiot, should be fired, should be banned}

I happen to know that the Debian programmer that made this patch is a very sharp individual. I have worked with him on several occasions and I would say that kicking him out of maintaining OpenSSL would be a quite stupid thing to do.

He is, like the rest of us, human. We might find that other people are considerably less perfect than he.

Nobody that isn't running Debian or Ubuntu has any need to worry. This is all Debian's fault.

I guess you missed the part of the advisory that mentioned that it also fixed an OpenSSL upstream bug (that *everyone* is vulnerable to) that permitted arbitrary code execution in a certain little-used protocol? OpenSSL has a history of security bugs over the years.

Of course, the big keygen bug is a Debian-specific thing.

Debian should send patches upstream

This is general practice in Debian. It happens so often, in fact, that the Debian bug-tracking system has had -- for probably more than a decade -- a feature that lets a Debian developer record that a bug reported to Debian has been forwarded to an upstream developer or bug-tracking system.

It is routine to send both bug reports and patches upstream. Some Debian developers are more closely aligned with upstream than others. In some cases, Debian developers are part of the upstream team. In others, upstream may be friendly and responsive enough that Debian developers run any potential patches to upstream code by them before committing them to Debian. (I tend to do this for Bacula). In some cases, upstream is busy and doesn't respond fast or reliably or helpfully enough to permit Debian to make security updates or other important fixes in a timely manner. And sometimes, upstream is plain AWOL.

Of course, it benefits Debian developers to send patches upstream, because then they have a smaller diff to maintain when each new version comes out.

In this particular case, communication with upstream happened, but the end result just fell through the cracks.

Debian shouldn't patch security-related stuff itself, ever

Well, that's not a very realistic viewpoint. Every Linux distribution does this, for several reasons. First, a given stable release of a distribution may be older than the current state of the art upstream software, and some upstreams are not interested in patching old versions, while the new upstream versions introduce changes too significant to go into a security update. Secondly, some upstreams do not respond in a timely manner, and Debian wants to protect its users ASAP. Finally, some upstreams are simply bad at security, and having smart folks from Debian -- and other distributions -- write security patches is a benefit to the community.

by nospam@example.com (John Goerzen) at May 14, 2008 10:48 AM

Edward Kmett

Generatingfunctorology

Ok, I decided to take a step back from my flawed approach in the last post and play with the idea of power series of functors from a different perspective.

I dusted off my copy of Herbert Wilf's generatingfunctionology and switched goals to try to see some well known recursive functors or species as formal power series. It appears that we can pick a few things out about the generating functions of polynomial functors.

As an example:

 
Maybe x = 1 + x
 

Ok. We're done. Thank you very much. I'll be here all week. Try the veal...

For a more serious example, the formal power series for the list [x] is just a geometric series:

 
[x] = mu y . 1 + x y  -- the mu here is a pleasant fiction, more below
    = 1 + x (1 + x (1 + x (...)))
    = 1 + x + x^2 + x^3 + ...
    = 1/(1-x)
 

Given the power series of a functor, its nth coefficient * n! tells you how many distinguishable ways its constructors can house n values. If we see that a list of n values can be permuted n! ways this has some interesting possibilities for linearizing the storage of some functors. The list case is boring, we can store a finite list of n elements by supplying the length of the array and an array of n elements, hence (among other reasons) the mu above.

Lets try decorated binary trees:

 
data Tree x = Leaf | Node (Tree x) x (Tree x)
 
Tree x = mu y. 1 + x * y * y
       = 1 + x * (1 + x * (...)^2)^2
       = 1 + x + 2x^2 + 5x^3 + 14x^4 + 42x^5 + ...
 

It turns out the coefficients of our generating function are the Catalan numbers, A000108, commonly denoted C(n), which happen to be well known for among other things, being the number of ways you can build a binary tree of n nodes.

This tells us we could store a tree by storing a number n of nodes it contains, an array of that many nodes, and an index 0 = i C(n) to tell you which particular tree you selected. Not that this is likely to be an incredibly time-efficient encoding, but you could then fmap over the tree by just fmapping over your array.

For a formal power series,

$f(x) = \sum_{i=0}^{\infty} a_n x^n = a_0 + a_1 x + a_2 x^2 + ... $

its derivative is given by differentiating the series term by term:

$f'(x) = \sum_{i=1}^{\infty} n a_n x^{n - 1} = a_1 + 2 a_2 x + 3 a_3 x^2 + ...$

Consequently we can take the derivative of a list:

([]') x = 1 + 2x + 3x^2 + ... = 1/(1-x)^2 = ([] :*: []) x

and rederive the notion that a derivative/one hole context of a list can be represented by a pair of lists.

If we step slightly outside of the Haskell users' comfort zone and notion of a Functor and allow other Species, we get (as noted by apfelmus the other day) that Exp a is just a bag of elements with no predetermined order.

 
Exp x = 1 + x + x^2/2! + x^3/3! + ... = Bag x
 

Since there are n! ways to order n elements and Bag manages to forget that information, we can get a feeling for the meaning of division in this setting.

Similarly we can define:

 
Sinh x = x + x^3/3! + ... -- a Bag of some odd number of elements.
Cosh x = 1 + x^2/2! + ... --  a Bag of some even number of elements.
 

Then by construction:

 
Exp = Cosh :+: Sinh
 

The derivative of Exp is Exp, of Cosh is Sinh, of Sinh is Cosh, all as you would expect.

We can handle other species as well:

 
Cycle a = 1 + x^2/2 + x^3/3 + x^4/4 + ... -- cycles
Cycle_n a = x^n/n -- cycles of n elements
Bag_n a = x^n/n!  -- bags of n elements
 

That said, there seem to be some problems, not every functor is well behaved in this way. Lets take for instance the type of natural numbers given by the functor:

 
data Nat a = S (Nat a) | Z
 

Then the recurrence blows up, the coefficient for 0 is $\aleph_0$!

Similarly, if we parameterized a functor on another value we have to deal with the number of cases that other value can denote.

 
Either a x = |a| + x
(a,x) = |a| x
 

This is both good and bad, using the above, we can quickly establish an isomorphism between Either () and the Maybe Functor, but we blow up again for Either Integer. This gets even worse if we allow 'functors in space.' (i.e. functors that can contain functions)

On the other extreme, we might modify our tree example and remove the leaves, yielding infinite decorated cotrees.

 
Tree x = nu y. x * y * y
       = x * (x * (...)^2)^2
       = 0 + 0x + 0x^2 + 0x^3 + ...
 

Then a_n = 0 for all n in the natural numbers, so you can't use the coefficients of the generating function to tell you about the behavior of an infinite structure! It would appear that the generating function of a functor does not capture what happens in the greatest fixed point case, so we can only use generating functions to describe the behavior of data defined with mu, not in general codata defined by nu.

The Bags and Cycles above are nice examples, but if we wanted to rule out the non-polynomial Functors (from the Haskell perspective) in the above then we can simply limit ourselves to ordinary generating functions with natural number coefficients, that is to say generating functions of the form:

$f(x) = \sum_{i=0}^{\infty} a_n x^n = a_0 + a_1 x + a_2 x^2 + ... $

To choose to admit bags, cycles and other species etc. then you need merely also permit exponential generating functions with natural coefficients, that is to say, generating functions of the form:

$f(x) = \sum_{i=0}^{\infty} a_n x^n / n! = a_0 + a_1 x / 1! + a_2 x^2 / 2! + ... $

by Edward Kmett at May 14, 2008 07:45 AM

May 13, 2008

David R. MacIver

Code move

I’m going to be moving my hg repositories to be hosted on this domain, as freehg.org is very flaky. I’ve moved the miscellaneous code repository over already. It’s available from http://drmaciver.com/repos/misc/

by david at May 13, 2008 06:12 PM

Eric Kow (kowey)

recurring problem (boring text file merging)

I keep solving variations of this problem at work, whether I'm trying to merge some log files together, or identify token offsets with bits of parse tree. I had better jot it down so that I don't forget there may be something more general hidden behind all this.

mergeFoo :: [a] -> [(Int,Int,b)] -> [Either a ([a],b)]




I'm not necessarily looking for a solution -- I could just boil one out from my previous solutions -- but I am at least officially and publicly reminding myself that I shouldn't keep solving the same thing over and over again (unless I'm engaged in some kind of lateral thinking exercise, which is a different story)

by kowey (noreply@blogger.com) at May 13, 2008 05:02 PM

David R. MacIver

Request for help: Numerical linear algebra

I have a bit of an odd background. I have a fair bit of mathematics (four year MA equivalent course in mathematics at Cambridge) and at least a reasonable background in software engineering and programming (about three years working as a software engineer now, plus a lot of random stuff on my spare time during that period and a little bit before hand), but I don’t have a lot of knowldge of the overlap between the two. This is increasingly causing me trouble, and I’d like to fix it. It seems like the sort of subject my blog audience should know about, so I’m asking for help here. :-)

The immediate subject I want to learn more about is numerical linear algebra. It’s come up a couple times recently and I’ve had to go “Bwrgh. Um. Halp!”. I usually muddle through, but it’s really not something I should be struggling on and I’ve probably done stupid things as a result of lack of knowledge.

So, my goal here is to become reasonably well versed in at least the basics of the subject. At the moment nothing terribly specific - I want to get a good grasp of the theory and practice of working with matrices, vectors, etc. computationally. Linear optimisation and eigenvector problems especially. I’m looking for recommendations for books, libraries, software packages and anything else you think would be useful. I’m willing to spend a reasonable amount of money (more so on the books than the software packages, as long term I’ll really be looking for stuff I can freely bundle with my personal projects) in doing so, ‘though obviously free is a plus. Libraries I can use from a variety of languages are definitely a good thing, though I’m willing to look at language specific things like NumPy, etc. too if they come highly recommended (at least as a starting point).

Relevant background:

Mathematics: Quite a lot of real and complex analysis, though mostly pure. Enough linear algebra that I at least know where to start with it, though I’m going to be rusty.

Programming/computer science: Some amount of general algorithms and datastructures work. Generally comfortable with Java, Haskell and Scala. Comfortable enough with C that I can write it if I have to. A little bit of Python and Lua (plus a few others which are probably not relevant). Fairly willing to pick up a new language if it would help (though if you tell me to write fortran I’ll cry).

So, suggestions?

Edit: I’ll keep a list of recommendations updated here. If you have any comments for or against these, please pipe up.

I found a copy of Epperson going very cheaply and for no particular reason it looked interesting, so I’ve ordered a copy. I’ll almost certainly order a copy of Golub and Loan. I’m also going to be playing around with Octave. Thanks guys!

by david at May 13, 2008 03:38 PM

Clifford Beshers

Paul McCartney was in a band before Wings?

Jeremy Shaw observed today that "Haskell is moving up in the world." His evidence was that a Google search for "haddock" now puts the Haskell documentation tool third on the list.

I guess we'll know that Haskell is a complete and total success when we overhear some kid in a supermarket asking, "Haddock is a fish?"

by Clifford.Beshers (noreply@blogger.com) at May 13, 2008 04:14 PM

Thomas Schilling

The Thing That Should Not Be (Or: How to import 18500+ patches from Darcs into Git in less than three days)

I like Darcs. Really. It is easy to learn and use and for smallish projects I never had any real problems. Unfortunately, it still has some performance problems and it is likely that some operations will never be fast.

An extreme example of where you run into those problems is the GHC repository. It consists of over 18500 patches and spans over 12 years of history. When I tried to build the latest version I ran into a linker error which I know I didn't get with the snapshot from one month ago. As GHC builds take quite a while I wanted to use an efficient way to find which exact change introduced the problem. More precisely I wanted git bisect.

I know that Don had converted Darcs repositories to Git in order to get ohloh statistics, but he reported that this process was rather painful. It took four weeks(!) to convert the GHC repository.

So, I looked what tools were out there, and how to improve them. I know that there is Tailor, but I looked at darcs-to-git by Steve Purcell first and found it very hackable. I didn't like that it saved the Darcs patch ID in the Git commit message, so I changed that and I extended it to properly translate Darcs escape sequences. I also added a parameter to only pull a number of patches at the same time, so that I can import a big repository in stages and I allowed custom mapping from committer names to other committer names. I used this to map various pseudonyms to (a unique) full name and email address. (I hope no one minds being credited with his or her full name. ;) )

It worked rather well for smallish repositories (a bit less than 2000 patches) but I had serious problems to get it to work with GHC.

  • Darcs has a bug on case-insensitive volumes (which OS X uses by default), so Steve suggested using a case-sensitive sparse image. This works, but it is probably a bit slower. I tried running it on my FreeBSD home server, but it has only 256 MB of RAM (usually fine for a home file server) so Darcs ran out of space and eventually got killed by the OS. (Getting Darcs to compile on my server was an adventure in itself--first a few hours to update the ports tree, then one more to compile GHC 6.8 which then just failed to install...) Fortunately, my Laptop has 2 GB, so it works fine there.
  • At startup darcs-to-git reads the full Darcs patch inventory. For such a big repo as GHC this takes over a minute (and lots of RAM). Caching it in a file didn't seem to help much. I could have lived with that, but there was a more serious problem: the approach used by darcs-to-git (and, it seems, also by Tailor) doesn't work!
  • darcs-to-git pulls one patch at a time by giving it's ID to darcs pull --match 'hash ...id...', then git adds the changes on the Git side and git commits it with the appropriate commit message. The patches are pulled in the order in which they were applied in the source repo, so any dependencies should be fulfilled. Nevertheless, Darcs refused to apply some patches -- silently. Darcs just determined that I didn't want to pull any patches and didn't do anything. This is most likely a Darcs bug, but I heard it was only a known bug for some development version of Darcs 2 (I used Darcs 1.0.9 at that time). Anyways, that didn't work; it failed at about patch 30 of the GHC repository.
  • OK. So instead of pulling patches by ID we could fake user interaction. Something like this:
    $ echo "yd" | darcs pull source-repo
    The input corresponds to "Yes, I want to pull this patch" and "Ok, I'm done and want to pull all the selected patches". This works reliably and also has the advantage that we don't have to read the whole history up front but instead can just retrieve the info for the last applied patch via
    darcs changes --last 1 --xml
  • By now you might have guessed, though, that this still didn't work very well. It took about 60 seconds per patch (with about 1 second of this used by Git), resulting in estimated 13 days(!) CPU time for the full repository.
  • Interestingly, most of those 60 seconds are spent before any patch choice is displayed, so apparently Darcs is doing something to calculate which patches to show. After that, displaying more choices is relatively quick. Apparently, the startup time depends on the number of patches not yet pulled.

This leads to the following trick.

We use two intermediate repositories. We use one to pull several patches at a time from the source repository. I use 15 patches, ie.:

$ cd tmp/ghc.pull
$ echo "yyyyyyyyyyyyyyyd" | darcs pull /path/to/ghc
We now could import from this intermediate repository into Git, since the startup time to pull from this repo is now much lower. However, we'd like to already start and pull the next 15 patches into the temporary repository. Pulling from and into the same repo at the same time doesn't work (Dars locks the repo), so we also need to mirror this temporary repository. A cp -r would work, but as the repository grows larger, this would do unnecessary work. So I just pull the changes at once.
$ cd /tmp/ghc.pull_mirror
$ darcs pull --all /tmp/ghc.pull  # this is pretty quick now
Now we can import into our git mirror from there, and already start pulling the next 15 patches (proper term for this is "macro pipelining", I believe).
$ cd /path/to/ghc.git
$ ./darcs-to-git /tmp/ghc.pull_mirror &  # run in background
$ cd /tmp/ghc.pull
$ echo "yyyy..." # etc
Of course, before pulling from the first mirror into the second mirror we have to make sure that darcs-to-git has finished pulling from the second mirror. I have implemented this as a shell script on top of darcs-to-git, but I may move it into darcs-to-git at some point.

My fork of darcs-to-git as well as Steve's main repo are both available at Github. I haven't pushed all of my local changes yet, but I plan to implement pulling dars patches "interactively" as a possible option for darcs-to-git, so maybe check the repo in a week or two.

With this approach I am down to about 200 seconds per 15 patches or about 68 hours fo the 18500 patches of the GHC repo which is just below the promised three days. (Of course, YMMV)

So the moral of this story? Darcs is very slow for biggish repositories, especially for rarely used border cases (such as pulling patches one by one). It may be possible to fix them, but I doubt that this will be easy. I tried using the new hashed format and the darcs-2 format, but converting the GHC repo didn't work for me. I certainly hope that things get better, and I plan to help at least a little by submitting several bug reports in the next couple of days about the problems I ran into in the past days. Let's see what happens.

Oh, and Darcs needs a killer-app like Github!

by Thomas Schilling (noreply@blogger.com) at May 13, 2008 01:34 PM

John Goerzen (CosmicRay)

Imagine 1

Imagine, for a moment, that you are a young man in your 20s, trying to make your way in the world. You are married and have a young daughter, just old enough to start to talk. You live in a run-down neighborhood, long passed-over by any economic advances. What schools you had access to barely taught anyone much. The few jobs you can reach have fierce competition, even though the pay is low. You worry about your health, but even more about that of your wife and child. Finding food is a constant concern. Although you are still healthy now, and you are willing and able to be a hard worker, there is simply nobody hiring people in your area. Not to mention the gunfights that erupt between gangs or drug dealers. Oh, and did I mention that your wife is 4 months pregnant?

Your top priority is to do your best to keep your family safe. You're afraid that your whole family will starve, or be killed by an errant bullet. You've tried for a long time -- it seems like forever -- to do everything you can think of, with no success. Finally, you decide that the only way you can have the hope for a better life is to move somewhere where the economy is better, and the drug dealers are fewer.

But moving hundreds or thousands of miles away is no easy task when you have no money to move. Somehow, with some luck, ingenuity, and tenacity, you have finally managed to find a way. You have no job offer in your new town, but conditions are so bleak at home that you just can't risk staying there. So the three of you move 1500 miles away.

You arrive with no money, no apartment, and don't know anybody. But you're a hard worker, and have talked yourself into a job. It pays what passes for minimum wage in your new home, but it's a fortune compared to what you made before. It's backbreaking work, and you work long hours. But soon you can afford a cramped apartment, and keep your refrigerator stocked with food. What a luxury!

Pretty soon your new baby son is born. You can afford to feed him, your daughter, your wife, and yourself, every day. When you're really lucky, you even have some money left over to send to your brother back home, who is still struggling to make ends meet there. You seem to have climbed the first rung on the American Dream ladder.

Years pass. Your old home becomes a memory; your daily life revolves around new struggles now. Your oldest child is in school, your wife finds part-time work sometimes too, cleaning houses for rich people. You've been laid off several times, your income isn't guaranteed, and the others in your new home don't take kindly to strangers -- and they still think you're one. But it's better than flying bullets and never knowing where your next meal will come from.

Then one day, while you are at work, federal agents show up. You are arrested and taken to jail. Agents show up at home, too, arresting your wife. It turns out that they realized you entered the country illegally from Ecuador those years ago. Meanwhile, your wife wonders what will happen to your son that was playing in a neighbor's yard while she was arrested, or to your daugther that was at school.

After months in jail, with little contact with each other, and poor medical care, the government decides to deport you to Mexico. Why Mexico? Well, it's cheaper, and there's no documentation showing where you came from. Apparently you "look" Mexican, and they don't believe your story.

After months in jail with no income, you are once again bankrupt. A government bus takes you to Mexico and drops you down someplace there, with your wife and your oldest child. Your younger child was born in the United States, and so is an American citizen and can't be deported. But the government isn't going to give him a free ride on a prison bus (and Mexico wouldn't take him anyway, since everyone knows he's American). You have no idea where he is. You have no idea how you're going to find food in Mexico, no idea how to find your son, no idea where to find refuge from the ever more prevalent drug dealers. Meanwhile, the Americans think you're scum because you wanted to protect your family, and it's going to be much more difficult to get back in to try to reunite your family.

This story is based on true events.

It's truly easy to demonize illegal immigrants, isn't it? Easy to round them up by the thousands, easy to build a bigger fence, easy to lock them away.

Sometimes it seems like this nation built on freedom, supposedly on Christian values, has lost sight of compassion for the lowly. In this country, we would throw in jail parents that didn't do everything humanly possible to find food for their children. We also throw in jail parents that grew up in other countries that are just doing the same.

How sad that we