Planet Haskell

September 29, 2016

Well-Typed.Com

Sharing, Memory Leaks, and Conduit and friends

TL;DR: Sharing conduit values leads to memory leaks. Make sure to disable the full laziness optimization in the module with your top-level calls to runConduit or ($$) (skip to the end of the conclusion for some details on how to do this). Similar considerations apply to other streaming libraries and indeed any Haskell code that uses lazy data structures to drive computation.

Motivation

We use large lazy data structures in Haskell all the time to drive our programs. For example, consider

main1 :: IO ()
main1 = forM_ [1..5] $ \_ -> mapM_ print [1 .. 1000000]

It’s quite remarkable that this works and that this program runs in constant memory. But this stands on a delicate cusp. Consider the following minor variation on the above code:

ni_mapM_ :: (a -> IO b) -> [a] -> IO ()
{-# NOINLINE ni_mapM_ #-}
ni_mapM_ = mapM_

main2 :: IO ()
main2 = forM_ [1..5] $ \_ -> ni_mapM_ print [1 .. 1000000]

This program runs, but unlike main1, it has a maximum residency of 27 MB; in other words, this program suffers from a memory leak. As it turns out, main1 was running in constant memory because the optimizer was able to eliminate the list altogether (due to the fold/build rewrite rule), but it is unable to do so in main2.

But why is main2 leaking? In fact, we can recover constant space behaviour by recompiling the code with -fno-full-laziness. The full laziness transformation is effectively turning main2 into

longList :: [Integer]
longList = [1 .. 1000000]

main3 :: IO ()
main3 = forM_ [1..5] $ \_ -> ni_mapM_ print longList

The first iteration of the forM_ loop constructs the list, which is then retained to be used by the next iterations. Hence, the large list is retained for the duration of the program, which is the beforementioned space leak.

The full laziness optimization is taking away our ability to control when data structures are not shared. That ability is crucial when we have actions driven by large lazy data structures. One particularly important example of such lazy structures that drive computation are conduits or pipes. For example, consider the following conduit code:

import qualified Data.Conduit as C

countConduit :: Int -> C.Sink Char IO ()
countConduit cnt = do
    mi <- C.await
    case mi of
      Nothing -> liftIO (print cnt)
      Just _  -> countConduit $! cnt + 1

getConduit :: Int -> C.Source IO Char
getConduit 0 = return ()
getConduit n = do
    ch <- liftIO getChar
    C.yield ch
    getConduit (n - 1)

Here countConduit is a sink that counts the characters it receives from upstream, and getConduit n is a conduit that reads n characters from the console and passes them downstream. Suppose we connect these two conduits and run them inside an exception handler that retries when an error occurs:

retry :: IO a -> IO a
retry io = catch io (\(_ :: SomeException) -> retry io)
main :: IO ()
main = retry $ C.runConduit $ getConduit 1000000 C.=$= countConduit 0

we again end up with a large memory leak, this time of type Pipe and ->Pipe (conduit’s internal type):

Although the values that stream through the conduit come from IO, the conduit itself is fully constructed and retained in memory. In this blog post we examine what exactly is being retained here, and why. We will also suggest a simple workaround: it usually suffices to avoid sharing at the very top-level calls to runConduit or ($$). Note that these problems are not specific to the conduit library, but apply equally to all other similar libraries.

We will not assume any knowledge of conduit but start from first principles; however, if you have never used any of these libraries before this blog post is probably not the best starting point; you might for example first want to watch my presentation Lazy I/O and Alternatives in Haskell.

Lists

Before we look at the more complicated case, let’s first consider another program using just lists:

main :: IO ()
main = retry $ ni_mapM_ print [1..1000000]

This program suffers from a spaceleak for similar reasons to the example with lists we saw in the introduction, but it’s worth spelling out the details here: where exactly is the list being maintained?

Recall that the IO monad is effectively a state monad over a token RealWorld state (if that doesn’t make any sense to you, you might want to read ezyang’s article Unraveling the mystery of the IO monad first). Hence, ni_mapM_ (just a wrapper around mapM_) is really a function of three arguments: the action to execute for every element of the list, the list itself, and the world token. That means that

ni_mapM_ print [1..1000000]

is a partial application, and hence we are constructing a PAP object. Such a PAP object is an runtime representation of a partial application of a function; it records the function we want to execute (ni_mapM_), as well as the arguments we have already provided. It is this PAP object that we give to retry, and which retry retains until the action completes because it might need it in the exception handler. The long list in turn is being retained because there is a reference from the PAP object to the list (as one of the arguments that we provided).

Full laziness does not make a difference in this example; whether or not that [1 .. 10000000] expression gets floated out makes no difference.

Reminder: Conduits/Pipes

Just to make sure we don’t get lost in the details, let’s define a simple conduit-like or pipe-like data structure:

data Pipe i o m r =
    Yield o (Pipe i o m r)
  | Await (Either r i -> Pipe i o m r)
  | Effect (m (Pipe i o m r))
  | Done r

A pipe or a conduit is a free monad which provides three actions:

  1. Yield a value downstream
  2. Await a value from upstream
  3. Execute an effect in the underlying monad.

The argument to Await is passed an Either; we give it a Left value if upstream terminated, or a Right value if upstream yielded a value.1

This definition is not quite the same as the one used in real streaming libraries and ignores various difficulties (in particular exception safely, as well as other features such as leftovers); however, it will suffice for the sake of this blog post. We will use the terms “conduit” and “pipe” interchangeably in the remainder of this article.

Sources

The various Pipe constructors differ in their memory behaviour and the kinds of memory leaks that they can create. We therefore consider them one by one. We will start with sources, because their memory behaviour is relatively straightforward.

A source is a pipe that only ever yields values downstream.2 For example, here is a source that yields the values [n, n-1 .. 1]:

yieldFrom :: Int -> Pipe i Int m ()
yieldFrom 0 = Done ()
yieldFrom n = Yield n $ yieldFrom (n - 1)

We could “run” such a pipe as follows:

printYields :: Show o => Pipe i o m () -> IO ()
printYields (Yield o k) = print o >> printYields k
printYields (Done ())   = return ()

If we then run the following program:

main :: IO ()
main = retry $ printYields (yieldFrom 1000000)

we get a memory leak. This memory leak is very similar to the memory leak we discussed in section Lists above, with Done () playing the role of the empty list and Yield playing the role of (:).

Sinks

A sink is a conduit that only ever awaits values from upstream; it never yields anything downstream.2 The memory behaviour of sinks is considerably more subtle than the memory behaviour of sources and we will examine it in detail. As a reminder, the constructor for Await is

data Pipe i o m r = Await (Either r i -> Pipe i o m r) | ...

As an example of a sink, consider this pipe that counts the number of characters it receives:

countChars :: Int -> Pipe Char o m Int
countChars cnt =
    Await $ \mi -> case mi of
      Left  _ -> Done cnt
      Right _ -> countChars $! cnt + 1

We could “run” such a sink by feeding it a bunch of characters; say, 1000000 of them:

feed :: Char -> Pipe Char o m Int -> IO ()
feed ch = feedFrom 10000000
  where
    feedFrom :: Int -> Pipe Char o m Int -> IO ()
    feedFrom _ (Done r)  = print r
    feedFrom 0 (Await k) = feedFrom 0     $ k (Left 0)
    feedFrom n (Await k) = feedFrom (n-1) $ k (Right ch)

If we run this as follows and compile with optimizations enabled, we once again end up with a memory leak:

main :: IO ()
main = retry $ feed 'A' (countChars 0)

We can recover constant space behaviour again by disabling full laziness; however, the effect of full laziness on this example is a lot more subtle than the example we described in the introduction.

Full laziness

Let’s take a brief moment to describe what full laziness is, exactly. Full laziness is one of the optimizations that ghc applies by default when optimizations are enabled; it is described in the paper “Let-floating: moving bindings to give faster programs”. The idea is simple; if we have something like

f = \x y -> let e = .. -- expensive computation involving x but not y
            in ..

full laziness floats the let binding out over the lambda to get

f = \x = let e = .. in \y -> ..

This potentially avoids unnecessarily recomputing e for different values of y. Full laziness is a useful transformation; for example, it turns something like

f x y = ..
  where
    go = .. -- some local function

into

f x y   = ..
f_go .. = ..

which avoids allocating a function closure every time f is called. It is also quite a notorious optimization, because it can create unexpected CAFs (constant applicative forms; top-level definitions of values); for example, if you write

nthPrime :: Int -> Int
nthPrime n = allPrimes !! n
  where
    allPrimes :: [Int]
    allPrimes = ..

you might expect nthPrime to recompute allPrimes every time it is invoked; but full laziness might move that allPrimes definition to the top-level, resulting in a large space leak (the full list of primes would be retained for the lifetime of the program). This goes back to the point we made in the introduction: full laziness is taking away our ability to control when values are not shared.

Full laziness versus sinks

Back to the sink example. What exactly is full laziness doing here? Is it constructing a CAF we weren’t expecting? Actually, no; it’s more subtle than that. Our definition of countChars was

countChars :: Int -> Pipe Char o m Int
countChars cnt =
    Await $ \mi -> case mi of
      Left  _ -> Done cnt
      Right _ -> countChars $! cnt + 1

Full laziness is turning this into something more akin to

countChars' :: Int -> Pipe Char o m Int
countChars' cnt =
    let k = countChars' $! cnt + 1
    in Await $ \mi -> case mi of
                        Left  _ -> Done cnt
                        Right _ -> k

Note how the computation of countChars' $! cnt + 1 has been floated over the lambda; ghc can do that, since this expression does not depend on mi. So in memory the countChars 0 expression from our main function (retained, if you recall, because of the surrounding retry wrapper), develops something like this. It starts of as a simple thunk:

Then when feed matches on it, it gets reduced to weak head normal form, exposing the top-most Await constructor:

The body of the await is a function closure pointing to the function inside countChars (\mi -> case mi ..), which has countChars $! (cnt + 1) as an unevaluated thunk in its environment. Evaluating it one step further yields

So where for a source the data structure in memory was a straightforward “list” consisting of Yield nodes, for a sink the situation is more subtle: we build up a chain of Await constructors, each of which points to a function closure which in its environment has a reference to the next Await constructor. This wouldn’t matter of course if the garbage collector could clean up after us; but if the conduit itself is shared, then this results in a memory leak.

Without full laziness, incidentally, evaluating countChars 0 yields

and the chain stops there; the only thing in the function closure now is cnt. Since we don’t allocate the next Yield constructor before running the function, we never construct a chain of Yield constructors and hence we have no memory leak.

Depending on values

It is tempting to think that if the conduit varies its behaviour depending on the values it receives from upstream the same chain of Await constructors cannot be constructed and we avoid a memory leak. For example, consider this variation on countChars which only counts spaces:

countSpaces :: Int -> Pipe Char o m Int
countSpaces cnt =
    Await $ \mi ->
      case mi of
        Left  _   -> Done cnt
        Right ' ' -> countSpaces $! cnt + 1
        Right _   -> countSpaces $! cnt

If we substitute this conduit for countChars in the previous program, do we fare any better? Alas, the memory behaviour of this conduit, when shared, is in fact far, far worse.

The reason is that both the countSpaces $! cnt + 1 and the expression countSpaces $! cnt can both be floated out by the full laziness optimization. Hence, now every Await constructor will have a function closure in its payload with two thunks, one for each alternative way to execute the conduit. What’s more, both of these thunks will are retained as long as we retain a reference to the top-level conduit.

We can neatly illustrate this using the following program:

main :: IO ()
main = do
    let count = countSpaces 0
    feed ' ' count
    feed ' ' count
    feed ' ' count
    feed 'A' count
    feed 'A' count
    feed 'A' count

The first feed ' ' explores a path through the conduit where every character is a space; so this constructs (and retains) one long chain of Await constructors. The next two calls to feed ' ' however walk over the exact same path, and hence memory usage does not increase for a while. But then we explore a different path, in which every character is a non-space, and hence memory behaviour will go up again. Then during the second call to feed 'A' memory usage is stable again, until we start executing the last feed 'A', at which point the garbage collector can finally start cleaning things up:

What’s worse, there is an infinite number of paths through this conduit. Every different combination of space and non-space characters will explore a different path, leading to combinatorial explosion and terrifying memory usage.

Effects

The precise situation for effects depends on the underlying monad, but let’s explore one common case: IO. As we will see, for the case of IO the memory behaviour of Effect is actually similar to the memory behaviour of Await. Recall that the Effect constructor is defined as

data Pipe i o m r = Effect (m (Pipe i o m r)) | ...

Consider this simple pipe that prints the numbers [n, n-1 .. 1]:

printFrom :: Int -> Pipe i o IO ()
printFrom 0 = Done ()
printFrom n = Effect $ print n >> return (printFrom (n - 1))

We might run such a pipe using3:

runPipe :: Show r => Pipe i o IO r -> IO ()
runPipe (Done r)   = print r
runPipe (Effect k) = runPipe =<< k

In order to understand the memory behaviour of Effect, we need to understand how the underlying monad behaves. For the case of IO, IO actions are state transformers over a token RealWorld state. This means that the Effect constructor actually looks rather similar to the Await constructor. Both have a function as payload; Await a function that receives an upstream value, and Effect a function that receives a RealWorld token. To illustrate what printFrom might look like with full laziness, we can rewrite it as

printFrom :: Int -> Pipe i o IO ()
printFrom n =
    let k = printFrom (n - 1)
    in case n of
         0 -> Done ()
         _ -> Effect $ IO $ \st -> unIO (print n >> return k) st

If we visualize the heap (using ghc-vis), we can see that it does indeed look very similar to the picture for Await:

Increasing sharing

If we cannot guarantee that our conduits are not shared, then perhaps we should try to increase sharing instead. If we can avoid allocating these chains of pipes, but instead have pipes refer back to themselves, perhaps we can avoid these memory leaks.

In theory, this is possible. For example, when using the conduit library, we could try to take advantage of monad transformers and rewrite our feed source and our count sink as:

feed :: Source IO Char
feed = evalStateC 1000000 go
  where
    go :: Source (StateT Int IO) Char
    go = do
      st <- get
      if st == 0
        then return ()
        else do put $! (st - 1) ; yield 'A' ; go

count :: Sink Char IO Int
count = evalStateC 0 go
  where
    go :: Sink Char (StateT Int IO) Int
    go = do
        mi <- await
        case mi of
          Nothing -> get
          Just _  -> modify' (+1) >> go

In both definitions go refers back to itself directly, with no arguments; hence, it ought to be self-referential, without any long chain of sources or sinks ever being constructed. This works; the following program runs in constant space:

main :: IO ()
main = retry $ print =<< (feed $$ count)

However, this kind of code is extremely brittle. For example, consider the following minor variation on count:

count :: Sink Char IO Int
count = evalStateC 0 go
  where
    go :: Sink Char (StateT Int IO) Int
    go = withValue $ \_ -> modify' (+1) >> go

    withValue :: (i -> Sink i (StateT Int IO) Int)
              -> Sink i (StateT Int IO) Int
    withValue k = do
      mch <- await
      case mch of
        Nothing -> get
        Just ch -> k ch

This seems like a straight-forward variation, but this code in fact suffers from a memory leak again4. The optimized core version of this variation of count looks something like this:

count :: ConduitM Char Void (StateT Int IO) Int
count = ConduitM $ \k ->
    let countRec = modify' (+ 1) >> count
    in unConduitM await $ \mch ->
         case mch of
           Nothing -> unConduitM get      k
           Just _  -> unConduitM countRec k

In the conduit library, ConduitM is a codensity transformation of an internal Pipe datatype; the latter corresponds more or less to the Pipe datastructure we’ve been describing here. But we can ignore these details: the important point here is that this has the same typical shape that we’ve been studying above, with an allocation inside a lambda but before an await.

We can fix it by writing our code as

count :: Sink Char IO Int
count = evalStateC 0 go
  where
    go :: Sink Char (StateT Int IO) Int
    go = withValue goWithValue

    goWithValue :: Char -> Sink Char (StateT Int IO) Int
    goWithValue _ = modify' (+1) >> go

    withValue :: (i -> Sink i (StateT Int IO) Int)
              -> Sink i (StateT Int IO) Int
    withValue k = do
      mch <- await
      case mch of
        Nothing -> get
        Just ch -> k ch

Ironically, it would seem that full laziness here could have helped us by floating out that modify' (+1) >> go expression for us. The reason that it didn’t is probably related to the exact way the k continuation is threaded through in the compiled code (I simplified a bit above). Whatever the reason, tracking down problems like these is difficult and incredibly time consuming; I’ve spent many many hours studying the output of -ddump-simpl and comparing before and after pictures. Not a particularly productive way to spend my time, and this kind of low-level thinking is not what I want to do when writing application level Haskell code!

Composed pipes

Normally we construct pipes by composing components together. Composition of pipes can be defined as

(=$=) :: Monad m => Pipe a b m r -> Pipe b c m r -> Pipe a c m r
{-# NOINLINE (=$=) #-}
_         =$= Done   r   = Done r
u         =$= Effect   d = Effect $ (u =$=) <$> d
u         =$= Yield  o d = Yield o (u =$= d)
Yield o u =$= Await    d = u =$= d (Right o)
Await   u =$= Await    d = Await $ \ma -> u ma =$= Await d
Effect  u =$= Await    d = Effect $ (=$= Await d) <$> u
Done  r   =$= Await    d = Done r =$= d (Left r)

The downstream pipe “is in charge”; the upstream pipe only plays a role when downstream awaits. This mirrors Haskell’s lazy “demand-driven” evaluation model.

Typically we only run self-contained pipes that don’t have any Awaits or Yields left (after composition), so we are only left with Effects. The good news is that if the pipe components don’t consist of long chains, then their composition won’t either; at every Effect point we wait for either upstream or downstream to complete its effect; only once that is done do we receive the next part of the pipeline and hence no chains can be constructed.

On the other hand, of course composition doesn’t get rid of these memory leaks either. As an example, we can define a pipe equivalent to the getConduit from the introduction

getN :: Int -> Pipe i Char IO Int
getN 0 = Done 0
getN n = Effect $ do
           ch <- getChar
           return $ Yield ch (getN (n - 1))

and then compose getN and countChars to get a runnable program:

main :: IO ()
main = retry $ runPipe $ getN 1000000 =$= countChars 0

This program suffers from the same memory leaks as before because the individual pipelines component are kept in memory. As in the sink example, memory behaviour would be much worse still if there was different paths through the conduit network.

Conclusions

At Well-Typed we’ve been developing an application for a client to do streaming data processing. We’ve been using the conduit library to do this, with great success. However, occassionally memory leaks arise that difficult to fix, and even harder to track down; of course, we’re not the first to suffer from these problems; for example, see ghc ticket 9520.

In this blog post we described how such memory leaks arise. Similar memory leaks can arise with any kind of code that uses large lazy data structures to drive computation, including other streaming libraries such as pipes or streaming, but the problem is not restricted to streaming libraries.

The conduit library tries to avoid these intermediate data structures by means of fusion rules; naturally, when this is successful the problem is avoided. We can increase the likelihood of this happening by using combinators such as folds etc., but in general the intermediate pipe data structures are difficult to avoid.

The core of the problem is that in the presence of the full laziness optimization we have no control over when values are not shared. While it is possible in theory to write code in such a way that the lazy data structures are self-referential and hence keeping them in memory does not cause a memory leak, in practice the resulting code is too brittle and writing code like this is just too difficult. Just to provide one more example, in our application we had some code that looked like this:

go x@(C y _) = case y of
         Constr1 -> doSomethingWith x >> go
         Constr2 -> doSomethingWith x >> go
         Constr3 -> doSomethingWith x >> go
         Constr4 -> doSomethingWith x >> go
         Constr5 -> doSomethingWith x >> go

This worked and ran in constant space. But after adding a single additional clause to this pattern match, suddenly we reintroduced a memory leak again:

go x@(C y _) = case y of
         Constr1 -> doSomethingWith x >> go
         Constr2 -> doSomethingWith x >> go
         Constr3 -> doSomethingWith x >> go
         Constr4 -> doSomethingWith x >> go
         Constr5 -> doSomethingWith x >> go
         Constr6 -> doSomethingWith x >> go

This was true even when that additional clause was never used; it had nothing to do with the change in the runtime behaviour of the code. Instead, when we added the additional clause some limit got exceeded in ghc’s bowels and suddenly something got allocated that wasn’t getting allocated before.

Full laziness can be disabled using -fno-full-laziness, but sadly this throws out the baby with the bathwater. In many cases, full laziness is a useful optimization. In particular, there is probably never any point allocation a thunk for something that is entirely static. We saw one such example above; it’s unexpected that when we write

go = withValue $ \_ -> modify' (+1) >> go

we get memory allocations corresponding to the modify' (+1) >> go expression.

Fortunately, there is a simple workaround. Any internal sharing in the conduit is (usually) fine, as long as we don’t retain the conduit from one run to the next. So it’s the argument to the top-level calls to runConduit or ($$) that we need to worry about (or the equivalent “run” functions from other libraries). This leads to the following recommendation:

Conduit code typically looks like

runMyConduit :: Some -> Args -> IO r
runMyConduit some args =
    runConduit $ stage1 some
             =$= stage2 args
             ...
             =$= stageN

You should put all top-level calls to runConduit into a module of their own, and disable full laziness in that module by declaring

{-# OPTIONS_GHC -fno-full-laziness #-}

at the top of the file. This means the computation of the conduit (stage1 =$= stage2 .. =$= stageN) won’t get floated to the top and the conduit will be recomputed on every invocation of runMyConduit (note that this relies on runMyConduit to have some arguments; if it doesn’t, you should add a dummy one).

It is not necessary to disable full laziness anywhere else. In particular, the conduit stages themselves (stage1 etc.) can be defined in modules where full laziness is enabled as usual.

There is a recent proposal for adding a pragma to ghc that might make it possible to disable full laziness on specific expressions, but for now the above is a reasonable workaround.

Addendum 1: ghc’s “state hack”

Let’s go back to the section about sinks; if you recall, we considered this example:

countChars :: Int -> Pipe Char o m Int
countChars cnt =
    let k = countChars $! cnt + 1
    in Await $ \mi -> case mi of
                        Left  _ -> Done cnt
                        Right _ -> k

feedFrom :: Int -> Pipe Char o m Int -> IO ()
feedFrom n (Done r)  = print r
feedFrom 0 (Await k) = feedFrom 0 $ k (Left 0)
feedFrom n (Await k) = feedFrom (n - 1) $ k (Right 'A')

main :: IO ()
main = retry $ feedFrom 10000000 (countChars 0)

We explained how countChars 0 results in a chain of Await constructors and function closures. However, you might be wondering, why would this be retained at all? After all, feedFrom is just an ordinary function, albeit one that computes an IO action. Why shouldn’t the whole expression

feedFrom 10000000 (countChars 0)

just be reduced to a single print 10000000 action, leaving no trace of the pipe at all? Indeed, this is precisely what happens when we disable ghc’s “state hack”; if we compile this program with -fno-state-hack it runs in constant space.

So what is the state hack? You can think of it as the opposite of the full laziness transformation; where full laziness transforms

     \x -> \y -> let e = <expensive> in ..    
~~>  \x -> let e = <expensive> in \y -> ..

the state hack does the opposite

     \x -> let e = <expensive> in \y -> ..
~~>  \x -> \y -> let e = <expensive> in ..    

though only for arguments y of type State# <token>. In general this is not sound, of course, as it might duplicate work; hence, the name “state hack”. Joachim Breitner’s StackOverflow answer explains why this optimization is necessary; my own blog post Understanding the RealWorld provides more background.

Let’s leave aside the question of why this optimization exists, and consider the effect on the code above. If you ask ghc to dump the optimized core (-ddump-stg), and translate the result back to readable Haskell, you will realize that it boils down to a single line change. With the state hack disabled the last line of feedFrom is effectively:

feedFrom n (Await k) = IO $
    unIO (feedFrom (n - 1) (k (Right 'A')))

where IO and unIO just wrap and unwrap the IO monad. But when the state hack is enabled (the default), this turns into

feedFrom n (Await k) = IO $ \w ->
    unIO (feedFrom (n - 1) (k (Right 'A'))) w

Note how this floats the recursive call to feedFrom into the lambda. This means that

feedFrom 10000000 (countChars 0)

no longer reduces to a single print statement (after an expensive computation); instead, it reduces immediately to a function closure, waiting for its world argument. It’s this function closure that retains the Await/function chain and hence causes the memory leak.

Addendum 2: Interaction with cost-centres (SCC)

A final cautionary tale. Suppose we are studying a memory leak, and so we are compiling our code with profiling enabled. At some point we add some cost centres, or use -fprof-auto perhaps, and suddenly find that the memory leak disappeared! What gives?

Consider one last time the sink example. We can make the memory leak disappear by adding a single cost centre:

feed :: Char -> Pipe Char o m Int -> IO ()
feed ch = feedFrom 10000000
  where
    feedFrom :: Int -> Pipe Char o m Int -> IO ()
    feedFrom n p = {-# SCC "feedFrom" #-}
      case (n, p) of
        (_, Done r)  -> print r
        (0, Await k) -> feedFrom 0     $ k (Left 0)
        (_, Await k) -> feedFrom (n-1) $ k (Right ch)

Adding this cost centre effectively has the same result as specifying -fno-state-hack; with the cost centre present, the state hack can no longer float the computations into the lambda.

Footnotes

  1. The ability to detect upstream termination is one of the characteristics that sets conduit apart from the pipes package, in which this is impossible (or at least hard to do). Personally, I consider this an essential feature.

  2. Sinks and sources can also execute effects, of course; since we are interested in the memory behaviour of the indvidual constructors, we treat effects separately.

  3. runPipe is (close to) the actual runPipe we would normally use; we connect pipes that await or yield into a single self contained pipe that does neither.

  4. For these simple examples actually the optimizer can work its magic and the memory leak doesn’t appear, unless evalStateC is declared NOINLINE. Again, for larger examples problems arise whether it’s inlined or not.

by edsko at September 29, 2016 06:20 AM

September 27, 2016

FP Complete

Updated Hackage mirroring

As we've discussed on this blog before, FP Complete has been running a Hackage mirror for quite a few years now. In addition to a straight S3-based mirror of raw Hackage content, we've also been running some Git repos providing the same content in an arguably more accessible format (all-cabal-files, all-cabal-hashes, and all-cabal-metadata).

In the past, we did all of this mirroring using Travis, but had to stop doing so a few months back. Also, a recent revelation showed that the downloads we were making were not as secure as I'd previously believed (due to lack of SSL between the Hackage server and its CDN). Finally, there's been off-and-on discussion for a while about unifying on one Hackage mirroring tool. After some discussion among Duncan, Herbert, and myself, all of these goals ended up culminating in this mailing list post

This blog post details the end result of these efforts: where code is running, where it's running, how secret credentials are handled, and how we monitor the whole thing.

Code

One of the goals here was to use the new hackage-security mechanism in Hackage to validate the package tarballs and cabal file index downloaded from Hackage. This made it natural to rely on Herbert's hackage-mirror-tool code, which supports downloads, verification, and uploading to S3. There were a few minor hiccups getting things set up, but overall it was surprisingly easy to integrate, especially given that Herbert's code had previously never been used against Amazon S3 (it had been used against the Dreamhost mirror).

I made a few downstream modifications to the codebase to make it compatible with officially released versions of Cabal, Stackify it, and in the process generate Docker images. I also included a simple shell script for running the tool in a loop (based on Herbert's README instructions). The result is the snoyberg/hackage-mirror-tool Docker image.

After running this image (we'll get to how it's run later), we have a fully populated S3 mirror of Hackage guaranteeing a consistent view of Hackage (i.e., all package tarballs are available, without CDN caching issues in place). The next step is to use this mirror to populated the Git repositories. We already have all-cabal-hashes-tool and all-cabal-metadata-tool for updating the appropriate repos, and all-cabal-files is just a matter of running a tar xf on the tarball containing .cabal files. Putting all of this together, I set up the all-cabal-tool repo, containing:

  • run-inner.sh will:
    • Grab the 01-index.tar.gz file from the S3 mirror
    • Update the all-cabal-files repo
    • Use git archive in that repo to generate and update the 00-index.tar.gz file*
    • Update the all-cabal-hashes and all-cabal-metadata repos using the appropriate tools
  • run.sh uses the hackage-watcher to run run-inner.sh each time a new version of 01-index.tar.gz is available. It's able to do a simple ETag check, saving on bandwidth, disk IO, and CPU usage.
  • Dockerfile pulls in all of the relevant tools and provides a commercialhaskell/all-cabal-tool Docker image
  • You may notice some other code in that repo. I did have intention of rewriting the Bash scripts and other Haskell code into a single Haskell executable for simplicity, but didn't get around to it yet. If anyone's interested in taking up the mantle on that, let me know.

* About this 00/01 business: 00-index.tar.gz is the original package format, without hackage-security, and is used by previous cabal-install releases, as well as Stack and possibly some other tools too. hackage-mirror-tool does not mirror this file since it has no security information, so generating it from the known-secure 01-index.tar.gz file (via the all-cabal-files repo) seemed the best option.

In setting up these images, I decided to split them into two pieces instead of combining them so that the straight Hackage mirroring bits would remain unaffected by the rest of the code, since the Hackage mirror (as we'll see later) will be available for users outside of the all-cabal* set of repos.

At the end of this, you can see that we're no longer using the original hackage-mirror code that powered the FP Complete S3 mirror for years. Unification achieved!

Kubernetes

As I mentioned, we previously ran all of this mirroring code on Travis, but had to move off of it. Anyone who's worked with me knows that I hate being a system administrator, so it was a painful few months where I had to run this code myself on an EC2 machine I set up personally. Fortunately, FP Complete runs a Kubernetes cluster these days, and that means I don't need to be a system administrator :). As mentioned, I packaged up all of the code above in two Docker images, so running them on Kubernetes is very straightforward.

For the curious, I've put the Kubernetes deployment configurations in a Gist.

Credentials

We have a few different credentials that need to be shared with these Docker containers:

  • AWS credentials for uploading
  • GPG key for signing tags
  • SSH key for pushing to Github

One of the other nice things about Kubernetes (besides allowing me to not be a sysadmin) is that it has built-in secrets support. I obviously won't be sharing those files with you, but if you look at the deployment configs I shared before, you can see how they are being referenced.

Monitoring

One annoyance I've had in the past is, if there's a bug in the scripts or some system problem, mirroring will stop for many hours before I become aware of it. I was determined to not let that be a problem again. So I put together the Hackage Mirror status page. It compares the last upload date from Hackage itself against the last modified time on various S3 artifacts, as well as the last commit for the Git repos. If any of the mirrors fall more than an hour behind Hackage itself, it returns a 500 status code. That's not technically the right code to use, but it does mean that normal HTTP monitoring/alerting tools can be used to watch that page and tell me if anything has gone wrong.

If you're curious to see the code powering this, it's available on Github.

Official Hackage mirror

With the addition of the new hackage-security metadata files to our S3 mirror, one nice benefit is that the FP Complete mirror is now an official Hackage mirror, and can be used natively by cabal-install without having to modify any configuration files. Hopefully this will be useful to end users.

And strangely enough, just as I finished this blog post, I got my first "mirrors out of sync" 500 error message ever, proving that the monitoring itself works (even if the mirroring had a bug).

What's next?

Hopefully nothing! I've spent quite a bit more time on this in the past few weeks than I'd hoped, but I'm happy with the end result. I feel confident that the mirroring processes will run reliably, I understand and trust the security model from end to end, and there's less code and machines to maintain overall.

Thank you!

Many thanks to Duncan and Herbert for granting me access to the private Hackage server to work around CDN caching issues, and to Herbert for the help and quick fixes with hackage-mirror-tool.

September 27, 2016 12:00 PM

September 26, 2016

Ken T Takusagawa

Functional Jobs

Senior Backend Engineer at Euclid Analytics (Full-time)

We are looking to add a senior individual contributor to the backend engineering team! Our team is responsible for creating and maintaining the infrastructure that powers the Euclid Analytics Engine. We leverage a forward thinking and progressive stack built in Scala and Python, with an infrastructure that uses Mesos, Spark and Kafka. As a senior engineer you will build out our next generation ETL pipeline. You will need to use and build tools to interact with our massive data set in as close to real time as possible. If you have previous experience with functional programming and distributed data processing tools such as Spark and Hadoop, then you would make a great fit for this role!

RESPONSIBILITIES:

  • Partnering with the data science team to architect and build Euclid’s big data pipeline
  • Building tools and services to maintain a robust, scalable data service layer
  • Leverage technologies such as Spark and Kafka to grow our predictive analytics and machine learning capabilities in real time
  • Finding innovative solutions to performance issues and bottlenecks

REQUIREMENTS:

  • At least 3 years industry experience in a full time role utilizing Scala or other modern functional programming languages (Haskell, Clojure, Lisp, etc.)
  • Database management experience (MySQL, Redis, Cassandra, Redshift, MemSQL)
  • Experience with big data infrastructure including Spark, Mesos, Scalding and Hadoop
  • Excited about data flow and orchestration with tools like Kafka and Spark Streaming
  • Have experience building production deployments using Amazon Web Services or Heroku’s Cloud Application Platform
  • B.S. or equivalent in Computer Science or another technical field

Get information on how to apply for this position.

September 26, 2016 09:53 PM

September 23, 2016

Derek Elkins

Quotient Types for Programmers

Introduction

Programmers in typed languages with higher order functions and algebraic data types are already comfortable with most of the basic constructions of set/type theory. In categorical terms, those programmers are familiar with finite products and coproducts and (monoidal/cartesian) closed structure. The main omissions are subset types (equalizers/pullbacks) and quotient types (coequalizers/pushouts) which would round out limits and colimits. Not having a good grasp on either of these constructions dramatically shrinks the world of mathematics that is understandable, but while subset types are fairly straightforward, quotient types are quite a bit less intuitive.

Subset Types

In my opinion, most programmers can more or less immediately understand the notion of a subset type at an intuitive level.
A subset type is just a type combined with a predicate on that type that specifies which values of the type we want. For example, we may have something like { n:Nat | n /= 0 } meaning the type of naturals not equal to #0#. We may use this in the type of the division function for the denominator. Consuming a value of a subset type is easy, a natural not equal to #0# is still just a natural, and we can treat it as such. The difficult part is producing a value of a subset type. To do this, we must, of course, produce a value of the underlying type — Nat in our example — but then we must further convince the type checker that the predicate holds (e.g. that the value does not equal #0#). Most languages provide no mechanism to prove potentially arbitrary facts about code, and this is why they do not support subset types. Dependently typed languages do provide such mechanisms and thus either have or can encode subset types. Outside of dependently typed languages the typical solution is to use an abstract data type and use a runtime check when values of that abstract data type are created.

Quotient Types

The dual of subset types are quotient types. My impression is that this construction is the most difficult basic construction for people to understand. Further, programmers aren’t much better off, because they have little to which to connect the idea. Before I give a definition, I want to provide the example with which most people are familiar: modular (or clock) arithmetic. A typical way this is first presented is as a system where the numbers “wrap-around”. For example, in arithmetic mod #3#, we count #0#, #1#, #2#, and then wrap back around to #0#. Programmers are well aware that it’s not necessary to guarantee that an input to addition, subtraction, or multiplication mod #3# is either #0#, #1#, or #2#. Instead, the operation can be done and the mod function can be applied at the end. This will give the same result as applying the mod function to each argument at the beginning. For example, #4+7 = 11# and #11 mod 3 = 2#, and #4 mod 3 = 1# and #7 mod 3 = 1# and #1+1 = 2 = 11 mod 3#.

For mathematicians, the type of integers mod #n# is represented by the quotient type #ZZ//n ZZ#. The idea is that the values of #ZZ // n ZZ# are integers except that we agree that any two integers #a# and #b# are treated as equal if #a - b = kn# for some integer #k#. For #ZZ // 3 ZZ#, #… -6 = -3 = 0 = 3 = 6 = …# and #… = -5 = -2 = 1 = 4 = 7 = …# and #… = -4 = -1 = 2 = 5 = 8 = …#.

Equivalence Relations

To start to formalize this, we need the notion of an equivalence relation. An equivalence relation is a binary relation #(~~)# which is reflexive (#x ~~ x# for all #x#), symmetric (if #x ~~ y# then #y ~~ x#), and transitive (if #x ~~ y# and #y ~~ z# then #x ~~ z#). We can check that “#a ~~ b# iff there exists an integer #k# such that #a-b = kn#” defines an equivalence relation on the integers for any given #n#. For reflexivity we have #a - a = 0n#. For symmetry we have if #a - b = kn# then #b - a = -kn#. Finally, for transitivity we have if #a - b = k_1 n# and #b - c = k_2 n# then #a - c = (k_1 + k_2)n# which we get by adding the preceding two equations.

Any relation can be extended to an equivalence relation. This is called the reflexive-, symmetric-, transitive-closure of the relation. For an arbitrary binary relation #R# we can define the equivalence relation #(~~_R)# via “#a ~~_R b# iff #a = b# or #R(a, b)# or #b ~~_R a# or #a ~~_R c and c ~~_R b# for some #c#“. To be precise, #~~_R# is the smallest relation satisfying those constraints. In Datalog syntax, this looks like:

eq_r(A, A).
eq_r(A, B) :- r(A, B).
eq_r(A, B) :- eq_r(B, A).
eq_r(A, B) :- eq_r(A, C), eq_r(C, B).

Quotient Types: the Type Theory view

If #T# is a type, and #(~~)# is an equivalence relation, we use #T // ~~# as the notation for the quotient type, which we read as “#T# quotiented by the equivalence relation #(~~)#”. We call #T# the underlying type of the quotient type. We then say #a = b# at type #T // ~~# iff #a ~~ b#. Dual to subset types, to produce a value of a quotient type is easy. Any value of the underlying type is a value of the quotient type. (In type theory, this produces the perhaps surprising result that #ZZ# is a subtype of #ZZ // n ZZ#.) As expected, consuming a value of a quotient type is more complicated. To explain this, we need to explain what a function #f : T // ~~ -> X# is for some type #X#. A function #f : T // ~~ -> X# is a function #g : T -> X# which satisfies #g(a) = g(b)# for all #a# and #b# such that #a ~~ b#. We call #f# (or #g#, they are often conflated) well-defined if #g# satisfies this condition. In other words, any well-defined function that consumes a quotient type isn’t allowed to produce an output that distinguishes between equivalent inputs. A better way to understand this is that quotient types allow us to change what the notion of equality is for a type. From this perspective, a function being well-defined just means that it is a function. Taking equal inputs to equal outputs is one of the defining characteristics of a function.

Sometimes we can finesse needing to check the side condition. Any function #h : T -> B# gives rise to an equivalence relation on #T# via #a ~~ b# iff #h(a) = h(b)#. In this case, any function #g : B -> X# gives rise to a function #f : T // ~~ -> X# via #f = g @ h#. In particular, when #B = T# we are guaranteed to have a suitable #g# for any function #f : T // ~~ -> X#. In this case, we can implement quotient types in a manner quite similar subset types, namely we make an abstract type and we normalize with the #h# function as we either produce or consume values of the abstract type. A common example of this is rational numbers. We can reduce a rational number to lowest terms either when it’s produced or when the numerator or denominator get accessed, so that we don’t accidentally write functions which distinguish between #1/2# and #2/4#. For modular arithmetic, the mod by #n# function is a suitable #h#.

Quotient Types: the Set Theory view

In set theory such an #h# function can always be made by mapping the elements of #T# to the equivalence classes that contain them, i.e. #a# gets mapped to #{b | a ~~ b}# which is called the equivalence class of #a#. In fact, in set theory, #T // ~~# is usually defined to be the set of equivalence classes of #(~~)#. So, for the example of #ZZ // 3 ZZ#, in set theory, it is a set of exactly three elements: the elements are #{ 3n+k | n in ZZ}# for #k = 0, 1, 2#. Equivalence classes are also called partitions and are said to partition the underlying set. Elements of these equivalence classes are called representatives of the equivalence class. Often a notation like #[a]# is used for the equivalence class of #a#.

More Examples

Here is a quick run-through of some significant applications of quotient types. I’ll give the underlying type and the equivalence relation and what the quotient type produces. I’ll leave it as an exercise to verify that the equivalence relations really are equivalence relations, i.e. reflexive, symmetric, and transitive. I’ll start with more basic examples. You should work through them to be sure you understand how they work.

Integers

Integers can be presented as pairs of naturals #(n, m)# with the idea being that the pair represents “#n - m#”. Of course, #1 - 2# should be the same as #2 - 3#. This is expressed as #(n_1, m_1) ~~ (n_2, m_2)# iff #n_1 + m_2 = n_2 + m_1#. Note how this definition only relies on operations on natural numbers. You can explore how to define addition, subtraction, multiplication, and other operations on this representation in a well-defined manner.

Rationals

Rationals can be presented very similarly to integers, only with multiplication instead of addition. We also have pairs #(n, d)#, usually written #n/d#, in this case of an integer #n# and a non-zero natural #d#. The equivalence relation is #(n_1, d_1) ~~ (n_2, d_2)# iff #n_1 d_2 = n_2 d_1#.

(Topological) Circles

We can extend the integers mod #n# to the continuous case. Consider the real numbers with the equivalence relation #r ~~ s# iff #r - s = k# for some integer #k#. You could call this the reals mod #1#. Topologically, this is a circle. If you walk along it far enough, you end up back at a point equivalent to where you started. Occasionally this is written as #RR//ZZ#.

Torii

Doing the previous example in 2D gives a torus. Specifically, we have pairs of real numbers and the equivalence relation #(x_1, y_1) ~~ (x_2, y_2)# iff #x_1 - x_2 = k# and #y_1 - y_2 = l# for some integers #k# and #l#. Quite a bit of topology relies on similar constructions as will be expanded upon on the section on gluing.

Unordered pairs

Here’s an example that’s a bit closer to programming. Consider the following equivalence relation on arbitrary pairs: #(a_1, b_1) ~~ (a_2, b_2)# iff #a_1 = a_2 and b_1 = b_2# or #a_1 = b_2 and b_1 = a_2#. This just says that a pair is equivalent to either itself, or a swapped version of itself. It’s interesting to consider what a well-defined function is on this type.1

Gluing / Pushouts

Returning to topology and doing a bit more involved construction, we arrive at gluing or pushouts. In topology, we often want to take two topological spaces and glue them together in some specified way. For example, we may want to take two discs and glue their boundaries together. This gives a sphere. We can combine two spaces into one with the disjoint sum (or coproduct, i.e. Haskell’s Either type.) This produces a space that contains both the input spaces, but they don’t interact in any way. You can visualize them as sitting next to each other but not touching. We now want to say that certain pairs of points, one from each of the spaces, are really the same point. That is, we want to quotient by an equivalence relation that would identify those points. We need some mechanism to specify which points we want to identify. One way to accomplish this is to have a pair of functions, #f : C -> A# and #g : C -> B#, where #A# and #B# are the space we want to glue together. We can then define a relation #R# on the disjoint sum via #R(a, b)# iff there’s a #c : C# such that #a = tt "inl"(f(c)) and b = tt "inr"(g(c))#. This is not an equivalence relation, but we can extend it to one. The quotient we get is then the gluing of #A# and #B# specified by #C# (or really by #f# and #g#). For our example of two discs, #f# and #g# are the same function, namely the inclusion of the boundary of the disc into the disc. We can also glue a space to itself. Just drop the disjoint sum part. Indeed, the circle and torus are examples.

Polynomial ring ideals

We write #RR[X]# for the type of polynomials with one indeterminate #X# with real coefficients. For two indeterminates, we write #RR[X, Y]#. Values of these types are just polynomials such as #X^2 + 1# or #X^2 + Y^2#. We can consider quotienting these types by equivalence relations generated from identifications like #X^2 + 1 ~~ 0# or #X^2 - Y ~~ 0#, but we want more than just the reflexive-, symmetric-, transitive-closure. We want this equivalence relation to also respect the operations we have on polynomials, in particular, addition and multiplication. More precisely, we want if #a ~~ b# and #c ~~ d# then #ac ~~ bd# and similarly for addition. An equivalence relation that respects all operations is called a congruence. The standard notation for the quotient of #RR[X, Y]# by a congruence generated by both of the previous identifications is #RR[X, Y]//(X^2 + 1, X^2 - Y)#. Now if #X^2 + 1 = 0# in #RR[X, Y]//(X^2 + 1, X^2 - Y)#, then for any polynomial #P(X, Y)#, we have #P(X, Y)(X^2 + 1) = 0# because #0# times anything is #0#. Similarly, for any polynomial #Q(X, Y)#, #Q(X, Y)(X^2 - Y) = 0#. Of course, #0 + 0 = 0#, so it must be the case that #P(X, Y)(X^2 + 1) + Q(X, Y)(X^2 - Y) = 0# for all polynomials #P# and #Q#. In fact, we can show that all elements in the equivalence class of #0# are of this form. You’ve now motivated the concrete definition of a ring ideal and given it’s significance. An ideal is an equivalence class of #0# with respect to some congruence. Let’s work out what #RR[X, Y]//(X^2 + 1, X^2 - Y)# looks like concretely. First, since #X^2 - Y = 0#, we have #Y = X^2# and so we see that values of #RR[X, Y]//(X^2 + 1, X^2 - Y)# will be polynomials in only one indeterminate because we can replace all #Y#s with #X^2#s. Since #X^2 = -1#, we can see that all those polynomials will be linear (i.e. of degree 1) because we can just keep replacing #X^2#s with #-1#s, i.e. #X^(n+2) = X^n X^2 = -X^n#. The end result is that an arbitrary polynomial in #RR[X, Y]//(X^2 + 1, X^2 - Y)# looks like #a + bX# for real numbers #a# and #b# and we have #X^2 = -1#. In other words, #RR[X, Y]//(X^2 + 1, X^2 - Y)# is isomorphic to the complex numbers, #CC#.

As a reasonably simple exercise, given a polynomial #P(X) : RR[X]#, what does it get mapped to when embedded into #RR[X]//(X - 3)#, i.e. what is #[P(X)] : RR[X]//(X - 3)#?2

Free algebras modulo an equational theory

Moving much closer to programming, we have a rather broad and important example that a mathematician might describe as free algebras modulo an equational theory. This example covers several of the preceding examples. In programmer-speak, a free algebra is just a type of abstract syntax trees for some language. We’ll call a specific absract syntax tree a term. An equational theory is just a collection of pairs of terms with the idea being that we’d like these terms to be considered equal. To be a bit more precise, we will actually allow terms to contain (meta)variables. An example equation for an expression language might be Add(#x#,#x#) = Mul(2,#x#). We call a term with no variables a ground term. We say a ground term matches another term if there is a consistent substitution for the variables that makes the latter term syntactically equal to the ground term. E.g. Add(3, 3) matches Add(#x#,#x#) via the substitution #x |->#3. Now, the equations of our equational theory gives rise to a relation on ground terms #R(t_1, t_2)# iff there exists an equation #l = r# such that #t_1# matches #l# and #t_2# matches #r#. This relation can be extended to an equivalence relation on ground terms, and we can then quotient by that equivalence relation.

Let’s consider a worked example. We can consider the theory of monoids. We have two operations (types of AST nodes): Mul(#x#,#y#) and 1. We have the following three equations: Mul(1,#x#) =#x#, Mul(#x#, 1) =#x#, and Mul(Mul(#x#,#y#),#z#) = Mul(#x#, Mul(#y#,#z#)). We additionally have a bunch of constants subject to no equations. In this case, it turns out we can define a normalization function, what I called #h# far above, and that the quotient type is isomorphic to lists of constants. Now, we can extend this theory to the theory of groups by adding a new operation, Inv(#x#), and new equations: Inv(Inv(#x#)) =#x#, Inv(Mul(#x#,#y#)) = Mul(Inv(#y#), Inv(#x#)), and Mul(Inv(#x#),#x#) = 1. If we ignore the last of these equations, you can show that we can normalize to a form that is isomorphic to a list of a disjoint sum of the constants, i.e. [Either Const Const] in Haskell if Const were the type of the constant terms. Quotienting this type by the equivalence relation extended with that final equality, corresponds to adding the rule that a Left c cancels out Right c in the list whenever they are adjacent.

This overall example is a fairly profound one. Almost all of abstract algebra can be viewed as an instance of this or a closely related variation. When you hear about things defined in terms of “generators and relators”, it is an example of this sort. Indeed, those “relators” are used to define a relation that will be extended to an equivalence relation. Being defined in this way is arguably what it means for something to be “algebraic”.

Postscript

The Introduction to Type Theory section of the NuPRL book provides a more comprehensive and somewhat more formal presentation of these and related concepts. While the quotient type view of quotients is conceptually different from the standard set theoretic presentation, it is much more amenable to computation as the #ZZ // n ZZ# example begins to illustrate.


  1. It’s a commutative function.

  2. It gets mapped to it’s value at #3#, i.e. #P(3)#.

September 23, 2016 05:21 AM

Michael Snoyman

Proposed conduit reskin

In a few different conversations I've had with people, the idea of reskinning some of the surface syntax of the conduit library has come up, and I wanted to share the idea here. I call this "reskinning" since all of the core functionality of conduit would remain unchanged in this proposal, we'd just be changing operators and functions a bit.

The idea here is: conduit borrowed the operator syntax of $$, =$ and $= from enumerator, and it made sense at the beginning of its lifecycle. However, for quite a while now conduit has evolved to the point of having a unified type for Sources, Conduits, and Sinks, and the disparity of operators adds more confusion than it may be worth. So without further ado, let's compare a few examples of conduit usage between the current skin:

import Conduit
import qualified Data.Conduit.Binary as CB

main :: IO ()
main = do
    -- copy files
    runResourceT $ CB.sourceFile "source.txt" $$ sinkFile "dest.txt"

    -- sum some numbers
    print $ runIdentity $ enumFromToC 1 100 $$ sumC

    -- print a bunch of numbers
    enumFromToC 1 100 $$ mapC (* 2) =$ takeWhileC (< 100) =$ mapM_C print

With a proposed reskin:

import Conduit2
import qualified Data.Conduit.Binary as CB

main :: IO ()
main = do
    -- copy files
    runConduitRes $ CB.sourceFile "source.txt" .| sinkFile "dest.txt"

    -- sum some numbers
    print $ runConduitPure $ enumFromToC 1 100 .| sumC

    -- print a bunch of numbers
    runConduit $ enumFromToC 1 100 .| mapC (* 2) .| takeWhileC (< 100) .| mapM_C print

This reskin is easily defined with this module:

{-# LANGUAGE FlexibleContexts #-}
module Conduit2
    ( module Conduit
    , module Conduit2
    ) where

import Conduit hiding (($$), (=$), ($=), (=$=))
import Data.Void (Void)

infixr 2 .|
(.|) :: Monad m
     => ConduitM a b m ()
     -> ConduitM b c m r
     -> ConduitM a c m r
(.|) = fuse

runConduitPure :: ConduitM () Void Identity r -> r
runConduitPure = runIdentity . runConduit

runConduitRes :: MonadBaseControl IO m
              => ConduitM () Void (ResourceT m) r
              -> m r
runConduitRes = runResourceT . runConduit

To put this in words:

  • Replace the $=, =$, and =$= operators - which are all synonyms of each other - with the .| operator. This borrows intuition from the Unix shell, where the pipe operator denotes piping data from one process to another. The analogy holds really well for conduit, so why not borrow it? (We call all of these operators "fusion.")
  • Get rid of the $$ operator - also known as the "connect" or "fuse-and-run" operator - entirely. Instead of having this two-in-one action, separate it into .| and runConduit. The advantage is that no one needs to think about whether to use .| or $$, as happens today. (Note that runConduit is available in the conduit library today, it's just not very well promoted.)
  • Now that runConduit is a first-class citizen, add in some helper functions for two common use cases: running with ResourceT and running a pure conduit.

The goals here are to improve consistency, readability, and intuition about the library. Of course, there are some downsides:

  • There's a slight performance advantage (not benchmarked recently unfortunately) to foo $$ bar versus runConduit $ foo =$= bar, since the former combines both sets of actions into one. We may be able to gain some of this back with GHC rewrite rules, but my experience with rewrite rules in conduit has been less than reliable.
  • Inertia: there's a lot of code and material out there using the current set of operators. While we don't need to ever remove (or even deprecate) the current operators, having two ways of writing conduit code in the wild can be confusing.
  • Conflicting operator: doing a quick Hoogle search reveals that the parallel package already uses .|. We could choose a different operator instead (|. for instance seems unclaimed), but generally I get nervous any time I'm defining new operators.
  • For simple cases like source $$ sink, code is now quite a few keystrokes longer: runConduit $ source .| sink.

Code wise, this is a trivial change to implement. Updating docs to follow this new convention wouldn't be too difficult either. The question is: is this a good idea?

September 23, 2016 12:00 AM

September 22, 2016

Philip Wadler

Lambdaman, supporting Bootstrap


After watching talks or videos of Propositions as Types, folk ask me how they can get their own Lambdaman t-shirt. In the past, I tried to make it available through various services, but they always rejected the design as a copyright violation. (It's not, it's fair use.) Thanks to a little help from my friends, CustomInk has agreed to print the design as a Booster. Sign up now, order will be printed on October 15. Any profits (there will be more if there is a bigger order) go to Bootstrap, an organisation run by Shriram Krishnamurthi, Matthias Felleisen, and the PLT group that teaches functional programming to middle and high school students. Order has already surpassed our goal of fifty shirts!

by Philip Wadler (noreply@blogger.com) at September 22, 2016 12:16 AM

September 21, 2016

FP Complete

Practical Haskell: Simple File Mirror (Part 2)

This is part 2 of a three part series. If you haven't seen it already, I'd recommend starting with the first part, which covers communication protocols and streaming of data. This second part will cover network communication and some basic concurrency in Haskell.

Simple HTTP client

We saw previously how to send and receive binary data using the conduit library. We're going to build on this with a conduit-aware network library. This first example will make a very simplistic, hard-coded HTTP request and send the entire response from the server to standard output.

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.Network (runTCPClient, appSource, appSink, clientSettings)

main :: IO ()
main = runTCPClient settings $ \appData -> do
    yield request $$ appSink appData
    appSource appData $$ stdoutC
  where
    settings = clientSettings 80 "httpbin.org"

request :: ByteString
request = encodeUtf8 $ unlines
    [ "GET /get?foo=bar&baz=bin HTTP/1.1"
    , "Host: httpbin.org"
    , "User-Agent: Practical Haskell"
    , "Connection: close"
    , ""
    ]

The runTCPClient creates the actual TCP connection, and provides access to it via the appData value. This value allows us to send data to the server (via appSink) and get data from the server (via appSource). We can also get information about the connection such as the locally used port number, which we're not using in this example.

We've hard-coded a settings value that states we should connect to host httpbin.org* on port 80. We've also hard-coded an HTTP request body, which is thoroughly uninteresting.

Once our connection has been established, we send our hard-coded request to the server with yield request $$ appSink appData. When that's complete, we stream all data from the server to standard output with appSource appData $$ stdoutC.

The output from this looks very much like you'd expect it to:

HTTP/1.1 200 OK
Server: nginx
Date: Wed, 21 Sep 2016 07:38:30 GMT
Content-Type: application/json
Content-Length: 224
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

{
  "args": {
    "baz": "bin",
    "foo": "bar"
  },
  "headers": {
    "Host": "httpbin.org",
    "User-Agent": "Practical Haskell"
  },
  "origin": "31.210.186.0",
  "url": "http://httpbin.org/get?foo=bar&baz=bin"
}

* Side note: anyone playing with HTTP client software should definitely check out httpbin.org, it's a great resource.

Upgrading to TLS

On a small tangent, it's trivial to adapt the above program to work over secure HTTPS instead of plaintext HTTP. All we need to do is:

  • Use the Data.Conduit.Network.TLS module from the network-conduit-tls library
  • Swap runTLSClient for runTCPClient, and tlsClientConfig for clientSettings
  • Change port 80 to port 443

The code looks as follows. To convince yourself that this is real: go ahead and run it and see what the url value in the response body looks like.

#!/usr/bin/env stack
{- stack --resolver nightly-2016-09-10 --install-ghc runghc
     --package classy-prelude-conduit
     --package network-conduit-tls
-}

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.Network (appSink, appSource)
import Data.Conduit.Network.TLS (runTLSClient, tlsClientConfig)

main :: IO ()
main = runTLSClient settings $ \appData -> do
    yield request $$ appSink appData
    appSource appData $$ stdoutC
  where
    settings = tlsClientConfig 443 "httpbin.org"

request :: ByteString
request = encodeUtf8 $ unlines
    [ "GET /get?foo=bar&baz=bin HTTP/1.1"
    , "Host: httpbin.org"
    , "User-Agent: Practical Haskell"
    , "Connection: close"
    , ""
    ]

Echo server

Let's play with the server side of things. We're going to implement an echo server, which will receive a chunk of data from the client and then send it right back.

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.Network (appSink, appSource, runTCPServer, serverSettings)

main :: IO ()
main =
    runTCPServer settings $ \appData -> appSource appData $$ appSink appData
  where
    settings = serverSettings 4200 "*"

This listens on port 4200, on all network interfaces ("*"). We start our server with runTCPServer, which grabs a listening socket and waits for connections. For each connection, it forks a new thread, and runs the provided application. In this case, our application is trivial: we connect the source to the sink, automatically piping data from the connection back to itself.

To stress a point above: this is a fully multithreaded server application. You can make multiple telnet connections to the server and interact with each of them independently. This is a lot of bang for very little buck.

For those of you concerned about the inefficiency of forking a new thread for each incoming connection: Haskell's runtime is built on top of green threads, making the act of forking very cheap. There are more details available in a talk I gave on "Haskell for fast, concurrent, robust services" (relevant slide and video link).

Full duplex

The examples so far have all been half duplex, meaning they have always been either sending or receiving data. Let's implement a full duplex application: a simple telnet client replacement. We need to wait for any input from standard input, while at the same time waiting for any input from the socket. We're going to take advantage of Haskell threading to handle this case too:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.Network (appSink, appSource, runTCPClient, clientSettings)

main :: IO ()
main = runTCPClient settings $ \appData -> race_
    (stdinC $$ appSink appData)
    (appSource appData $$ stdoutC)
  where
    settings = clientSettings 4200 "localhost"

The race_ function is a wonderful helper for concurrency, which says "run these two actions, see which one finishes first, kill the other one, and ignore any results (the _ at the end of the name)." It has a sibling function, concurrently, for running two things until they both complete. You can implement a surprisingly large number of common concurrency solutions using just these two functions. For more information, see the library package tutorial on haskell-lang.org.

You may be terrified of the performance characteristics of this: we've introduced two blocking threads, when theoretically callback-based I/O would be far more efficient! Not to worry: in Haskell, the runtime system uses a fully callback based system under the surface, using whatever system calls are relevant for your operating system. When a Haskell green thread makes a "blocking" I/O call, what actually happens is the runtime puts the thread to sleep, installs a callback handler to wait for data to be available, and when the callback is triggered, wakes the green thread up again.

The details of the Haskell runtime are well described in the paper Mio: A High-Performance Multicore IO Manager for GHC. Fortunately, for most real world cases, you can write the naive, easy-to-conceptualize I/O operations based on blocking semantics, and automatically get the great performance you'd want from event/callback based system calls.

Client and server in same process

Just to prove that we can: let's throw our client and server into a single process, using the same concurrency approach we've had until now.

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import Data.Conduit.Network (appSink, appSource, runTCPClient, clientSettings, runTCPServer, serverSettings)

main :: IO ()
main = race_ server client

server :: IO ()
server =
    runTCPServer settings $ \appData -> appSource appData $$ appSink appData
  where
    settings = serverSettings 4200 "*"

client :: IO ()
client = do
    -- Sleep for 1 second (1 million microsecond) to give the server a
    -- chance to start up. There are definitely better ways to do
    -- this, but this is good enough for our example.
    threadDelay 1000000

    runTCPClient settings $ \appData -> race_
        (stdinC $$ appSink appData)
        (appSource appData $$ stdoutC)
  where
    settings = clientSettings 4200 "localhost"

This isn't a particularly useful application (stdinC $$ stdoutC would do the same thing without wasting a network connection), but it does show how easy it is to combine various pieces of code in Haskell for concurrent applications.

Next time on Practical Haskell

We've so far figured out how to deal with our simple file mirror's communication protocol, and how to do network communication. All that's left is combining these two things together and wrapping it up with a command line interface. Stay tuned!

September 21, 2016 12:00 PM

September 20, 2016

Brent Yorgey

The generic-random library, part 1: simple generic Arbitrary instances

In a previous post I pointed out that we know all the theory to make nice, principled, practical random generators for recursive algebraic data types; someone just needed to step up and do the work. Well, Li-yao Xia took up the challenge and produced a brilliant package, generic-random, available on Hackage right now for you to use!

However, although the package does include some Haddock documentation, it is probably difficult for someone with no experience or background in this area to navigate. So I thought it would be worth writing a few blog posts by way of a tutorial and introduction to the package.

> {-# LANGUAGE GADTSyntax           #-}
> {-# LANGUAGE DeriveGeneric        #-}
> {-# LANGUAGE FlexibleContexts     #-}
> {-# LANGUAGE UndecidableInstances #-}
> 
> import GHC.Generics
> import Test.QuickCheck
> 
> import Generic.Random.Generic

The problem

First, a quick recap of the problem we are trying to solve: the obvious, naive way of generating random instances of some recursive algebraic data type often produces really terrible distributions. For example, one might generate really tiny structures most of the time and then occasionally generate a humongous one. For more background on the problem, see this post or this one.

A first example: generating generic Arbitrary instances

As a first example, consider the following algebraic data type:

> data Foo where
>   Bar  :: Char -> Int -> String -> Foo
>   Baz  :: Bool -> Bool -> Foo
>   Quux :: [Woz] -> Foo
>   deriving (Show, Generic)
> 
> data Woz where
>   Wiz :: Int -> Woz
>   Waz :: Bool -> Woz
>   deriving (Show, Generic)

You have probably noticed by now that this is not recursive (well, except for the embedded lists). Patience! We’ll get to recursive ADTs in due time, but it turns out the library has some nice things to offer for non-recursive ADTs as well, and it makes for an easier introduction.

Now, suppose we wanted to use QuickCheck to test some properties of a function that takes a Foo as an argument. We can easily make our own instances of Arbitrary for Foo and Woz, like so:

instance Arbitrary Foo where
  arbitrary = oneof
    [ Bar <$> arbitrary <*> arbitrary <*> arbitrary
    , Baz <$> arbitrary <*> arbitrary
    , Quux <$> arbitrary
    ]

instance Arbitrary Woz where
  arbitrary = oneof
    [ Wiz <$> arbitrary
    , Waz <$> arbitrary
    ]

This works reasonably well:

λ> sample (arbitrary :: Gen Foo)
Baz True True
Baz False True
Baz True True
Quux []
Baz False True
Bar '<' 3 "zy\\\SOHpO_"
Baz False True
Bar '\SOH' 0 "\"g\NAKm"
Bar 'h' (-9) "(t"
Quux [Wiz (-2),Waz False]
Baz False True

The only problem is that writing those instances is quite tedious. There is no thought required at all. Isn’t this exactly the sort of thing that is supposed to be automated with generic programming?

Why yes, yes it is. And the generic-random package can do exactly that. Notice that we have derived Generic for Foo and Woz. We can now use the genericArbitrary function from Generic.Random.Generic to derive completely standard Arbitrary instances, just like the ones we wrote above:

> instance Arbitrary Foo where
>   arbitrary = genericArbitrary
> 
> instance Arbitrary Woz where
>   arbitrary = genericArbitrary
λ> sample (arbitrary :: Gen Foo)
Quux []
Bar '\159' (-2) ""
Baz True True
Baz False False
Baz True True
Baz True False
Quux [Wiz 9,Wiz 7,Waz True,Waz True,Waz False]
Quux [Wiz (-10),Waz False,Waz False,Waz True,Waz True,Wiz (-14),Wiz 13,Waz True,Wiz (-8),Wiz 12,Wiz (-13)]
Bar '\130' 10 "FN\222j?\b=\237(\NULW\231+ts\245"
Bar 'n' 14 ""
Bar '\205' 4 "\SYN"

Seems about the same, except we wrote way less code! Huzzah!

If we want certain constructors to occur more frequently, we can also control that using genericArbitraryFrequency, which takes a list of Ints (each Int specifies the weight for one constructor).

A few notes:

  • Using the Generic.Random.Generic module is the quickest and simplest way to generate random instances of your data type, if it works for your use case.

  • It has some limitations, namely:

    • It only generates Arbitrary instances for QuickCheck. It can’t create more general random generators.

    • It probably won’t work very well for recursive data types.

However, these limitations are addressed by other parts of the library. Intrigued? Read on!

Recursive types, the simple way

Let’s now consider a simple recursive type:

> data Tree a where
>   Leaf   :: a                -> Tree a
>   Branch :: Tree a -> Tree a -> Tree a
>   deriving (Show, Generic)
> 
> treeSize :: Tree a -> Int
> treeSize (Leaf _)     = 1
> treeSize (Branch l r) = 1 + treeSize l + treeSize r

We can try using genericArbitrary:

instance Arbitrary a => Arbitrary (Tree a) where
  arbitrary = genericArbitrary

The problem is that this tends to generate some tiny trees and some enormous trees, with not much in between:

λ> map treeSize  replicateM 50 (generate (arbitrary :: Gen (Tree Int)))
[1,1,1,269,1,1,1,1,1,11,7,3,5,1,1,1,7,1,1,1,3,3,83,5,1,1,3,111,265,47,1,3,19,1,11,1,5,3,15,15,1,91,1,13,4097,119,1,15,5,3]

And this is not a problem specific to trees; this kind of thing is likely to happen for any recursive type.

Before we get to more interesting/complicated tools, it’s worth noting that random-generics provides a simple mechanism to limit the size of the generated structures: the genericArbitrary' function works like genericArbitrary but uses QuickCheck’s sized mechanism to cut off the recursion when it gets too big. The available size is partitioned among recursive calls, so it does not suffer from the exponential growth you might see if only the depth was limited. When the size counter reaches zero, the generator tries to terminate the recursion by picking some finite, non-recursive value(s). The parameter to genericArbitrary' is a natural number specifying how deep the finite, recursion-terminating values can be. Z (i.e zero) means the generator will only be willing to terminate the recursion with nullary constructors. In our case, Tree does not have any nullary constructors, so we should not use Z: if we do, the generator will be unable to terminate the recursion when the size reaches zero and we will get the same behavior as genericArbitrary. Instead, we should use S Z, which means it will be able to pick the depth-1 term Leaf x (for some arbitrary x) to terminate the recursion.

Let’s try it:

> instance (Arbitrary a, Generic a, BaseCases Z (Rep a)) => Arbitrary (Tree a) where
>   arbitrary = genericArbitrary' (S Z)
λ> sample (arbitrary :: Gen (Tree Int))
Leaf 0
Branch (Leaf 0) (Branch (Leaf 0) (Branch (Leaf 0) (Leaf 0)))
Branch (Leaf (-1)) (Leaf 1)
Leaf (-3)
Leaf 7
Branch (Leaf (-4)) (Branch (Branch (Leaf 1) (Leaf (-1))) (Leaf (-1)))
Branch (Leaf (-2)) (Branch (Leaf 1) (Branch (Leaf 0) (Branch (Leaf 0) (Leaf 0))))
Leaf 14
Branch (Branch (Leaf 2) (Leaf 2)) (Branch (Branch (Branch (Leaf 1) (Branch (Branch (Leaf 0) (Branch (Leaf 0) (Leaf 0))) (Branch (Leaf 0) (Leaf 0)))) (Branch (Branch (Branch (Leaf 0) (Leaf 0)) (Leaf 0)) (Leaf 0))) (Leaf (-3)))
Leaf 4
Leaf 9

Ah, that’s much better.

Finally, genericArbitraryFrequency' is the same as genericArbitraryFrequency but limits the recursion depth as genericArbitrary' does.

If you have a recursive data type you want to use with QuickCheck, it’s worth trying this, since it is quick and simple. The main problem with this approach is that it does not generate a uniform distribution of values. (Also, it is limited in that it is specifically tied to QuickCheck.) In this example, although you can’t necessarily tell just by looking at the sample random trees, I guarantee you that some kinds of trees are much more likely to be generated than others. (Though I couldn’t necessarily tell you which kinds.) This can be bad if the specific trees that will trigger a bug are in fact unlikely to be generated.

Next time, we’ll look at how we can actually have efficient, size-limited, uniform random generators using Boltzmann samplers.


by Brent at September 20, 2016 10:27 PM

September 18, 2016

Roman Cheplyaka

How to prepare a good pull request

  1. A pull request should have a specific goal and have a descriptive title. Do not put multiple unrelated changes in a single pull request.

  2. Do not include any changes that are irrelevant to the goal of the pull request.

    This includes refactoring or reformatting unrelated code and changing or adding auxiliary files (.gitignore, .travis.yml etc.) in a way that is not related to your main changes.

  3. Make logical, not historical commits.

    Before you submit your work for review, you should rebase your branch (git rebase -i) and regroup your changes into logical commits.

    Logical commits achieve different parts of the pull request goal. Each commit should have a descriptive commit message. Logical commits within a single pull request rarely overlap with each other in terms of the lines of code they touch.

    If you want to amend your pull request, I’d rather you rewrite the branch and force-push it instead of adding new (historical) commits or creating a new pull request. Note, however, that other maintainers may disagree with me on this one.

  4. Make clean commits. Run git diff or git show on your commits. It will show you issues like trailing whitespace or missing newlines at the end of the file.

.gitignore

My .gitignore policy is that the project-specific .gitignore file should only contain patterns specific for this project. For instance, if a test suite generates files *.out, this pattern belongs to the project’s .gitignore.

If a pattern is standard across a wide range of projects (e.g. *.o, or .stack-work for Haskell projects), then it belongs to the user-specific ~/.gitignore.

stack.yaml

(This section is specific to Haskell.)

My policy is to track stack.yaml inside the repo for applications, but not for libraries.

The rationale is that for an application, stack.yaml provides a useful bit of metainformation: which snapshot the app is guaranteed to build with. Additionally, non-programmers (or non-Haskell programmers) may want to install the application, and the presence of stack.yaml makes it easy for them.

These benefits do not apply to libraries. And the cost of including .stack.yaml is:

  • The snapshot version gets out of date quickly, so you need to update this file regularly.
  • This file is often changed temporarily (e.g. to test a specific version of a dependency), and if it is tracked, you need to pay attention not to commit those changes by accident.

September 18, 2016 08:00 PM

September 17, 2016

Tom Schrijvers

Doctoral or Post-Doctoral Position in Programming Languages Theory & Implementation

I am looking for a new member to join my research team in either a doctoral or post-doctoral position.

You can find more details here.

by Tom Schrijvers (noreply@blogger.com) at September 17, 2016 01:08 AM

September 16, 2016

wren gayle romano

Visiting Nara over the next week

I announced this on twitter a while back, but tomorrow I'm flying out to Nara Japan. I'll be out there all week for ICFP and all that jazz. It's been about a decade since last time I was in the Kansai region, and I can't wait. As I've done in the past, if you want to meet up for lunch or dinner, just comment below (or shoot me a tweet, email, etc).



comment count unavailable comments

September 16, 2016 04:39 AM

September 15, 2016

Douglas M. Auclair (geophf)

August 2016 1HaskellADay 1Liners

  • August 20th, 2016: maybeify :: (a, Maybe b) -> Maybe (a, b)
    Define maybeify. Snaps for elegance.
    • Hardy Jones @st58 sequence
    • Bruno @Brun0Cad mapM id
    • Thomas D @tthomasdd {-# LANGUAGE TupleSections #-}
      mabeify (x,mY) = maybe Nothing (return . (x,)) mY
    • Андреев Кирилл @nonaem00 import "category-extras" Control.Functor.Strong
      maybeify = uncurry strength
    • bazzargh @bazzargh I can't beat 'sequence', but: uncurry (fmap.(,))
    • Nick @crazy_fizruk distribute (from Data.Distributive)

by geophf (noreply@blogger.com) at September 15, 2016 02:29 PM

Brent Yorgey

Meeting people at ICFP in Nara

In less than 24 hours I’m getting on a plane to Japan (well, technically, Dallas, but I’ll get to Japan eventually). As I did last year, I’m making an open offer here: leave a comment on this post, and I will make a point of finding and meeting you sometime during the week! One person took me up on the offer last year and we had a nice chat over dinner.


by Brent at September 15, 2016 01:53 PM

Manuel M T Chakravarty

This is the video of my Compose :: Melbourne keynote. I am...

<iframe allowfullscreen="allowfullscreen" frameborder="0" height="225" id="youtube_iframe" src="https://www.youtube.com/embed/9dk7_GDNocQ?feature=oembed&amp;enablejsapi=1&amp;origin=http://safe.txmblr.com&amp;wmode=opaque" width="400"></iframe>

This is the video of my Compose :: Melbourne keynote. I am making the case for purely functional graphics programming with Haskell playgrounds, including live programming a little game (for the second half of the talk).

Some of the code is hard to read in the video; you may like to refer to the slides.

September 15, 2016 03:00 AM

September 14, 2016

Jens Petersen

Stackage LTS 7 is released

The Stackage curator team is pleased to announce the initial release of Stackage LTS 7 for ghc-8.0.1. Released over 3.5 months after LTS 6.0, the biggest change introduced with LTS 7.0 is the move from ghc-7.10.3 to ghc-8.0.1.  There is also the usual large number of (major) version number bumps: in order to achieve this we used a new one-time pruning step in Stackage Nightly development dropping all packages with constraining Stackage upper-bounds.  Nevertheless LTS 7.0 has 1986 packages which is almost as many as LTS 6.0 with 1994 packages (latest 6.17 at the time of writing has 2002 packages) thanks to all the work of the community of Haskell maintainers. We are going to do the major pruning step earlier for current Stackage Nightly now that LTS 7 is branched from Nightly, which will give package maintainers even more time to get ready for LTS 8.

by Jens Petersen (noreply@blogger.com) at September 14, 2016 06:01 PM

The haskell-lang.org team

Updates for September 14, 2016

The biggest update to the site is the addition of three new targeted "next steps" tutorials on the get started page for using Stack, aimed at Play (using the REPL), Script (single-file programs), and Build (full projects). Hopefully this will help people with different goals all get started with Haskell quickly.

In addition, we have included a few new tutorials:

Plus a few other minor edits throughout the site.

The complete diff can be found here.

September 14, 2016 04:00 PM

Jan Stolarek

Moving to University of Edinburgh

I wanted to let you all know that after working for 8 years as a Lecturer at the Institute of Information Technology (Lodz University of Technology, Poland), I have received a sabbatical leave to focus solely on research. Yesterday I began my work as a Research Associate at the Laboratory for Foundations of Computer Science, University of Edinburgh. This is a two-year post-doc position. I will be part of the team working on the Skye project under supervision of James Cheney. This means that from now on I will mostly focus on developing the Links programming language.

by Jan Stolarek at September 14, 2016 12:45 PM

FP Complete

Practical Haskell: Simple File Mirror (Part 1)

The other day I threw together a quick program to solve an annoyance some people on our team were expressing. We sometimes do our development on remote machines, but would still like to use local file editors. There are plenty of solutions for this (SSHFS, inotify+rsync), but none of us ever found a prebaked solution that was low-latency and robust enough to use regularly.

The program I put together is a simple client/server combo, where the client watches for local file changes and sends the updates to the server, which writes them to disk. This approach reduces latency significantly by keeping an open TCP connection, and can be tunneled over SSH for security. It's simple, and the code is short and readable.

This program turned out to be a good demonstration of a few different common problem domains when writing practical, real world Haskell code:

  • Communication protocols
  • Streaming of data
  • Network communications
  • File monitoring (eh, probably not that common, but cool)
  • Command line option parsing

This blog post series will build up from basics to being able to implement a program like the simple file mirror. This first post of three planned posts will cover communication protocols and streaming of data. You can also read:

  • Part 2, on network communication and concurrency

Prereqs

This series will contain a number of self-contained code samples, presented as runnable scripts. In order to run these scripts, copy them into a file ending with .hs (e.g., practical.hs), and then run it with stack practical.hs). You will need to install the Stack build tool. Stack will take care of downloading and installing any compiler and libraries necessary. Because of that: your first time running a script will take a bit of time while downloading and installing.

If you'd like to test that you're ready to proceed, try running this program:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc

main :: IO ()
main = putStrLn "Hello world!"

Also, this series of posts will use the classy-prelude-conduit library, which provides a lot of built-in functionality to make it easier to follow. If you'd like to kick off a build of that library now so it's available when you want to use it, run:

$ stack --resolver nightly-2016-09-10 --install-ghc build classy-prelude-conduit

Textual versus binary data

We humans have an interesting tendancy to want to communicate - with each other and our computers - in human languages. Computers don't care: they just see binary data. Often times, this overlaps just fine, specifically when dealing with the ASCII subset of the character set. But generally speaking, we need some way to distinguish between textual and binary data.

In Haskell, we do this by using two different data types: Text and ByteString. In order to convert between these two, we need to choose a character encoding, and most often we'll choose UTF-8. We can perform this conversion with the encodeUtf8 and decodeUtf8 functions:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = do
    let someText :: Text
        someText = unlines
            [ "Hello, this is English."
            , "Hola, este es el español."
            , "שלום, זה עברית."
            ]
        binary :: ByteString
        binary = encodeUtf8 someText

        filePath :: FilePath
        filePath = "say-hello.txt"

    writeFile filePath binary
    binary2 <- readFile filePath :: IO ByteString

    let text2 = decodeUtf8 binary2
    putStrLn text2

If you're coming from a language that doesn't have this strong separation, it may feel clunky at first. But after years of experience with having a type-level distinction between textual and binary data, I can attest to the fact that the compiler has many times prevented me from making some mistakes that would have been terrible in practice. One example popped up when working on the simple-file-mirror tool itself: I tried to send the number of characters in a file path over the wire, instead of sending the number of bytes after encoding it.

The OverloadedStrings language extension lets us use string literals for all of these string-like things, including (as you saw above) file paths.

NOTE: For historical reasons which are being worked on, there is additionally a String type, which is redundant with Text but far less efficient. I mention it because you may see references to it when working with some Haskell documentation. Whenever possible, stick to Text or ByteString, depending on your actual data content.

Some simple data streaming

OK, that was a bit abstract. Let's do something more concrete. We're going to use the conduit library to represent streaming data. To start off simple, let's stream the numbers 1 to 10 and print each of them:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = yieldMany [1..10] $$ mapM_C print

Our yieldMany function will take a sequence of values - in this case a list from 1 to 10 - and yield them downstream. For those familiar, this is similar to yielding in Python with generators. The idea is that we will build a pipeline of multiple components, each awaiting for values from upstream and yielding values downstream.

Our mapM_C print component will apply the function print to every value it receives from upstream. The C suffix is used for disambiguating conduit functions from non-conduit functions. Finally, the $$ operator in the middle is the "connect" function, which connects a source of data to a sink of data and runs it. As you might guess, the above prints the numbers 1 to 10.

We can also put other components into our pipeline, including functions that both await values from upstream and yield them downstream. For example:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = yieldMany [1..10] $$ mapC (* 10) =$ mapM_C print

The mapC function applies a function to each value in the stream and yields it downstream, while the =$ operator connects two components together. Take a guess at what the output from this will be, and then give it a try.

NOTE There's a subtle but important different between $$ and =$. $$ will connect two components and then run the result to completion to get a result. =$ will connect two components without running them so that it can be further connected to other components. In a single pipeline, you'll end up with one $$ usage.

We can also do more interesting consumption of a stream, like summing:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = do
    total <- yieldMany [1..10] $$ mapC (* 10) =$ sumC
    print total

    -- or more explicitly...
    total <- yieldMany [1..10] $$ mapC (* 10) =$ foldlC (+) 0
    print total

Or limit how much of the stream we want to consume:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = do
    total <- yieldMany [1..10] $$ mapC (* 10) =$ takeC 5 =$ sumC
    print total

This only scratches the surface of what we can do with conduit, but hopefully it gives enough of a basic intuition for the library to get started. If you're interested in diving in deep on the conduit library, check out the previously linked tutorial.

Streaming chunked data

Having a stream of individual bytes turns out to be inefficient in practice. It's much better to chunk a series of bytes into an efficient data structure like a ByteString. Let's see what it looks like to stream data from a file to standard output:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import qualified System.IO as IO

main :: IO ()
main = do
    writeFile "some-file.txt" ("This is just some text." :: ByteString)

    -- Yes, this is clunky. We'll do something much slicker in a bit.
    IO.withBinaryFile "some-file.txt" IO.ReadMode $ \fileHandle ->
           sourceHandle fileHandle
        $$ decodeUtf8C
        =$ stdoutC

This is good, but what if we want to deal with the individual bytes in the stream instead of the chunks. For example, let's say we want to get just the first 10 bytes of our file. The takeC function we used above would take the first five chunks of data. We instead need a function which will work on the elements of the bytestrings (the individual bytes). Fortunately, conduit provides for this with the E-suffixed element-specific functions:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import qualified System.IO as IO

main :: IO ()
main = do
    writeFile "some-file.txt" ("This is just some text." :: ByteString)

    -- Yes, this is clunky. We'll do something much slicker in a bit.
    IO.withBinaryFile "some-file.txt" IO.ReadMode $ \fileHandle ->
           sourceHandle fileHandle
        $$ takeCE 10
        =$ decodeUtf8C
        =$ stdoutC

In the simple-file-mirror program, we will be sending files over the network, and will need to limit some operations to the actual file sizes in question. Functions like takeCE will be vital for doing this.

Managing resources

While the withBinaryFile approach above worked, it felt kind of clunky. And for more complicated control flows, opening up the file in advance won't be an option (like when we'll only know which file to open after the network connection tells us the file path). To allow for these cases, we're going to introduce the runResourceT, which allows us to acquire resources in an exception-safe manner. Let's rewrite the above example with runResourceT:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit

main :: IO ()
main = do
    writeFile "some-file.txt" ("This is just some text." :: ByteString)

    runResourceT
         $ sourceFile "some-file.txt"
        $$ takeCE 10
        =$ decodeUtf8C
        =$ stdoutC

Internally, sourceFile uses the bracketP function, which runs some initialization function (in our case, opening a file handle), registers some cleanup function (in our case, closing that file handle), and then performs an action with the resource. To demonstrate what that looks like more explicitly, let's write a modified sourceFile function which will return some default file contents if the requested file can't be read from.

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import ClassyPrelude.Conduit
import qualified System.IO as IO

sourceFileDef :: MonadResource m
              => FilePath
              -> Source m ByteString
sourceFileDef fp = do
    let -- tryIO catches all IO exceptions and puts them in a Left
        -- value.
        open = tryIO $ IO.openBinaryFile fp IO.ReadMode

        -- Exception occurred, so nothing to close.
        close (Left e) =
            putStrLn $ "No file to close, got an exception: " ++ tshow e
        -- No exception, so close the file handle.
        close (Right h) = hClose h

    bracketP open close $ \eitherHandle ->
        case eitherHandle of
            Left ex -> do
                yield "I was unable to open the file in question:\n"
                yield $ encodeUtf8 $ tshow ex ++ "\n"
            Right fileHandle -> sourceHandle fileHandle

main :: IO ()
main = runResourceT
     $ sourceFileDef "some-file.txt"
    $$ decodeUtf8C
    =$ stdoutC

Implementing our protocol

Let's at least get started on our actual simple-file-mirror code. The wire protocol we're going to use is defined in the README, but we can describe it briefly as:

  • Client sends data to server, server never sends to client
  • A line like 9:hello.txt11:Hello World means "write the file hello.txt with the content Hello World"
  • We provide lengths of both the file paths and the file contents with a decimal-encoded integer followed by a colon (similar to netstrings, but without the trailing comma)
  • If a file has been deleted, we use a special length of -1, e.g. 9:hello.txt-1: means "hello.txt was deleted"

So let's get started with implementing the integer send/receive logic in conduit. There are many ways of doing this, some more efficient than others. For this program, I elected to go with the simplest approach possible (though you can see some historical more complicated/efficient versions). Let's start with sending, which is pretty trivial:

#!/usr/bin/env stack
-- stack --resolver nightly-2016-09-10 --install-ghc runghc --package classy-prelude-conduit

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes        #-}
import ClassyPrelude.Conduit

sendInt :: Monad m => Int -> Producer m ByteString
sendInt i = yield $ encodeUtf8 $ tshow i ++ ":"

main :: IO ()
main =
    yieldMany sampleInts $$ awaitForever sendInt =$ stdoutC
  where
    sampleInts =
        [ 1
        , 10
        , -5
        , 0
        , 60
        ]

Here we've introduced a new function, awaitForever, which repeatedly applies a function as long as data exists on the stream. Take a guess at what the output of this program will be, and then try it out.

Now let's try out the receiving side of this, which is slightly more complicated, but not too bad:

#!/usr/bin/env stack
{- stack --resolver nightly-2016-09-10 --install-ghc runghc
    --package classy-prelude-conduit
    --package word8
-}

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes        #-}
import ClassyPrelude.Conduit
import Data.Word8 (_colon)

sendInt :: Monad m => Int -> Producer m ByteString
sendInt i = yield $ encodeUtf8 $ tshow i ++ ":"

recvInt :: MonadThrow m => Consumer ByteString m Int
recvInt = do
    intAsText <- takeWhileCE (/= _colon) =$ decodeUtf8C =$ foldC
    dropCE 1 -- drop the colon from the stream
    case readMay $ unpack intAsText of
        Nothing -> error $ "Invalid int: " ++ show intAsText
        Just i -> return i

main :: IO ()
main =
       yieldMany sampleInts
    $$ awaitForever sendInt
    =$ peekForeverE (do i <- recvInt
                        print i)
  where
    sampleInts =
        [ 1
        , 10
        , -5
        , 0
        , 60
        ]

peekForeverE is similar to awaitForever, in that it repeatedly performs an action as long as there is data on the stream. However, it's different in that it doesn't grab the data off of the stream itself, leaving it to the action provided to do that, and it deals correctly with chunked data by ignoring empty chunks.

We've also introduced takeWhileCE, which is like takeCE, but instead of giving it a fixed size of the stream to consume, it continues consuming until it finds the given byte. In our case: we consume until we get to a colon. Then we decode into UTF-8 data, and use foldC to concatenate multiple chunks of Text into a single Text value. Then we use readMay to parse the textual value into an Int. (And yes, we could do a much better job at error handling, but using error is the simplest approach.)

Building on top of these two functions, it becomes a lot easier to send and receive complete file paths. Let's put that code in here, together with a test suite to prove it's all working as expected:

#!/usr/bin/env stack
{- stack --resolver nightly-2016-09-10 --install-ghc runghc
    --package classy-prelude-conduit
    --package word8
    --package hspec
-}

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes        #-}
import ClassyPrelude.Conduit
import Data.Word8 (_colon)
import Test.Hspec
import Test.Hspec.QuickCheck (prop)

sendInt :: Monad m => Int -> Producer m ByteString
sendInt i = yield $ encodeUtf8 $ tshow i ++ ":"

sendFilePath :: Monad m => FilePath -> Producer m ByteString
sendFilePath fp = do
    -- UTF-8 encode the filepath
    let bs = encodeUtf8 $ pack fp :: ByteString

    -- Send the number of bytes
    sendInt $ length bs

    -- Send the actual path
    yield bs

recvInt :: MonadThrow m => Consumer ByteString m Int
recvInt = do
    intAsText <- takeWhileCE (/= _colon) =$ decodeUtf8C =$ foldC
    dropCE 1 -- drop the colon from the stream
    case readMay $ unpack intAsText of
        Nothing -> error $ "Invalid int: " ++ show intAsText
        Just i -> return i

recvFilePath :: MonadThrow m => Consumer ByteString m FilePath
recvFilePath = do
    -- Get the byte count
    fpLen <- recvInt

    -- Read in the given number of bytes, decode as UTF-8 text, and
    -- then fold all of the chunks into a single Text value
    fpText <- takeCE fpLen =$= decodeUtf8C =$= foldC

    -- Unpack the text value into a FilePath
    return $ unpack fpText

main :: IO ()
main = hspec $ do
    prop "sendInt/recvInt are inverses" $ \i -> do
        res <- sendInt i $$ recvInt
        res `shouldBe` i

    prop "sendFilePath/recvFilePath are inverses" $ \fp -> do
        res <- sendFilePath fp $$ recvFilePath
        res `shouldBe` fp

We've used the hspec test framework library and its QuickCheck support to create a test suite which automatically generates test cases based on the types in our program. In this case, it will generate 100 random Ints and 100 random FilePaths each time it's run, and ensure that our properties hold. This is a great way to quickly get significant test coverage for a program.

Sending the files themselves

OK, finally time to put all of this together. We're going to add in some new functions for sending and receiving files themselves. This is a fairly simple composition of all of the work we've done until now. And this is the nice thing about Haskell in general, as well as the conduit library: purity and strong typing often make it possible to trivially combine smaller functions into more complicated beasts. Let's see this all in action:

#!/usr/bin/env stack
{- stack --resolver nightly-2016-09-10 --install-ghc runghc
    --package classy-prelude-conduit
    --package word8
    --package hspec
    --package temporary
-}

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RankNTypes        #-}
import ClassyPrelude.Conduit
import Data.Word8            (_colon)
import System.Directory      (createDirectoryIfMissing, doesFileExist,
                              removeFile)
import System.FilePath       (takeDirectory)
import System.IO             (IOMode (ReadMode), hFileSize, openBinaryFile)
import System.IO.Temp        (withSystemTempDirectory)
import Test.Hspec
import Test.Hspec.QuickCheck (prop)

sendInt :: Monad m => Int -> Producer m ByteString
sendInt i = yield $ encodeUtf8 $ tshow i ++ ":"

sendFilePath :: Monad m => FilePath -> Producer m ByteString
sendFilePath fp = do
    -- UTF-8 encode the filepath
    let bs = encodeUtf8 $ pack fp :: ByteString

    -- Send the number of bytes
    sendInt $ length bs

    -- Send the actual path
    yield bs

sendFile :: MonadResource m
         => FilePath -- ^ root directory
         -> FilePath -- ^ path relative to root
         -> Producer m ByteString
sendFile root fp = do
    -- Send the relative file path, so the other side knows where to
    -- put the contents
    sendFilePath fp

    let open = tryIO $ openBinaryFile fpFull ReadMode

        -- If the opening failed, we'll have an error message. So
        -- there's nothing to close, just do nothing!
        close (Left _err) = return ()
        -- Opening succeeded, so close the file handle.
        close (Right h) = hClose h

    -- Grab the file handle...
    bracketP open close $ \eh ->
        case eh of
            -- No file, send a -1 length to indicate file does not
            -- exist
            Left _ex -> sendInt (-1)

            -- File exists
            Right h -> do
                -- Send the size of the file
                size <- liftIO $ hFileSize h
                sendInt $ fromInteger size

                -- And stream the contents
                sourceHandle h
  where
    fpFull = root </> fp

recvInt :: MonadThrow m => Consumer ByteString m Int
recvInt = do
    intAsText <- takeWhileCE (/= _colon) =$ decodeUtf8C =$ foldC
    dropCE 1 -- drop the colon from the stream
    case readMay $ unpack intAsText of
        Nothing -> error $ "Invalid int: " ++ show intAsText
        Just i -> return i

recvFilePath :: MonadThrow m => Consumer ByteString m FilePath
recvFilePath = do
    -- Get the byte count
    fpLen <- recvInt

    -- Read in the given number of bytes, decode as UTF-8 text, and
    -- then fold all of the chunks into a single Text value
    fpText <- takeCE fpLen =$= decodeUtf8C =$= foldC

    -- Unpack the text value into a FilePath
    return $ unpack fpText

recvFile :: MonadResource m
         => FilePath -- ^ directory to store files in
         -> Sink ByteString m ()
recvFile root = do
    -- Get the relative path to store in
    fpRel <- recvFilePath

    -- Prepend with the root directory to get a complete path
    let fp = root </> fpRel

    -- Get the size of the file
    fileLen <- recvInt

    if fileLen == (-1)
        -- We use -1 to indicate the file should be removed. Go ahead
        -- and call removeFile, but ignore any IO exceptions that
        -- occur when doing so (in case the file doesn't exist
        -- locally, for example)
        then liftIO $ void $ tryIO $ removeFile fp
        else do
            -- Create the containing directory
            liftIO $ createDirectoryIfMissing True $ takeDirectory fp

            -- Stream out the specified number of bytes and write them
            -- into the file
            takeCE fileLen =$= sinkFile fp

main :: IO ()
main = hspec $ do
    prop "sendInt/recvInt are inverses" $ \i -> do
        res <- sendInt i $$ recvInt
        res `shouldBe` i

    prop "sendFilePath/recvFilePath are inverses" $ \fp -> do
        res <- sendFilePath fp $$ recvFilePath
        res `shouldBe` fp

    -- A more standard unit test, checking that sending and receiving
    -- a file does what is expected.
    it "create and delete files" $
      -- Get temporary source and destination directories
      withSystemTempDirectory "src" $ \srcDir ->
      withSystemTempDirectory "dst" $ \dstDir -> do

        let relPath = "somepath.txt"
            content = "This is the content of the file" :: ByteString

        -- Ensure that sending a file that exists makes it appear in
        -- the destination
        writeFile (srcDir </> relPath) content

        runResourceT $ sendFile srcDir relPath $$ recvFile dstDir

        content' <- readFile (dstDir </> relPath)
        content' `shouldBe` content

        -- Ensure that sending a file that doesn't exist makes it
        -- disappear in the destination
        removeFile (srcDir </> relPath)

        runResourceT $ sendFile srcDir relPath $$ recvFile dstDir

        exists <- doesFileExist (dstDir </> relPath)
        exists `shouldBe` False

Our sendFile function looks very similar to our sourceFileDef example at the beginning of the post. But instead of streaming a default value, we just send a length of -1, as our protocol dictates. The recvFile function relies heavily on recvFilePath and recvInt. In the case of a -1, it removes the file in question. Otherwise, it creates the containing directory if necessary, and then composes takeCE with sinkFile to stream the correct number of bytes into the file.

We also have a unit test covering the interaction of these two new functions. While some kind of property could perhaps be devised for testing this with QuickCheck, a more standard unit test seemed far more straightforward in this case.

Next time on Practical Haskell

This part of the tutorial covered quite a number of topics, so this is a good place to take a break. Next time, we'll dive into the network communication aspect of things, including:

  • Implementing a really simple HTTP client
  • Implementing an echo server
  • Using some basic concurrency in Haskell to have a client and server in the same process

If you have feedback on how to make this tutorial series more useful, please share it in the comments below, on Twitter (@snoyberg), or in any Reddit discussions about it. I'll try to review feedback and make changes to parts 2 and 3.

September 14, 2016 07:00 AM

Working with data in Haskell

In data mining or general exploration, it's common to need to easily access data efficiently and without ceremony. Typically, a programming language will be designed for this case specifically, like R, or a library will be written for it, like Python with the pandas library.

Implementing this in Haskell, we improve upon this area with all the benefits that come with using Haskell over Python or R, such as:

  • Safety - garbage collected memory, no need for pointers.
  • Performance - it's fast: If you write new algorithms in Haskell, they don't to be added to the language (like R) or written in C (like Python).
  • Concurrency - make use of concurrent algorithms trivially, and take advantage of your other CPU cores.
  • Maintainability - static types ensure safety at the time you write the program, and when you come back later to change them, it's harder to break them.

Let's look at an example of doing this in Haskell, and compare with how this is done in Python's pandas. The steps are:

  1. Download a zip file containing a CSV file.
  2. Unzip the file.
  3. Read through the CSV file.
  4. Do some manipulation of the data from the file.

In Haskell we have all the libraries needed (streaming HTTP, CSV parsing, etc.) to achieve this goal, so specifically for this post I've made a wrapper package that brings them together like pandas does. We have some goals:

  • Convenience We don't want to have to write more code than necessary while exploring data.
  • Constant memory Be able to process the file in constant memory space. If I have a 1GB file I don't want to have to load all 1GB into memory in one go.
  • Type-safe I would like that once parsed from CSV, I have a statically-typed data structure with proper types (integers, dates, text, etc.).

Python example

This example code was taken from Modern Pandas. In Python we request the web URL in chunks, which we then write to a file. Next, we unzip the file, and then the data is available as df, with column names downcased.

import zipfile
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

r = requests.get('http://chrisdone.com/ontime.csv.zip', stream=True)
with open("flights.csv", 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)

zf = zipfile.ZipFile("flights.csv.zip")
filename = zf.filelist[0].filename
fp = zf.extract(filename)
df = pd.read_csv(fp, parse_dates="FL_DATE").rename(columns=str.lower)

Finally, we can look at the 5 rows starting at row 10, for the columns fl_date and tail_num, like this:

df.ix[10:14, ['fl_date', 'tail_num']]

=>

    fl_date     tail_num
10  2014-01-01  N002AA
11  2014-01-01  N3FXAA
12  2014-01-01  N906EV
13  2014-01-01  N903EV
14  2014-01-01  N903EV

Python: good and bad

Good parts of the Python code:

  • Extracting from the Zip file was fairly easy.
  • We were able to specify for some fields to parse them as a different data type (parse_dates).
  • Referencing a range of the data was easy.

Bad parts of the Python code:

  • Reading the HTTP request was very verbose. We manually streamed the chunks to disk, which seems pointless.
  • It's not statically typed. Even though we parsed fl_date and tail_num, we can't be certain down the line if they still exist, or are of the right type.
  • We loaded the whole 100MB CSV into memory.

Let's compare with the solution I prepared in Haskell. While reading, you can also clone the repository that I put together:

$ git clone git@github.com:chrisdone/labels.git --recursive

The wrapper library created for this post is under labels-explore, and all the code samples are under labels-explore/app/Main.hs.

Haskell example

I prepared the module Labels.RunResourceT which provides us with some data manipulation functionality: web requests, unzipping, CSV parsing, etc.

{-# LANGUAGE TypeApplications, OverloadedStrings, OverloadedLabels,
    TypeOperators, DataKinds, FlexibleContexts #-}

import Labels.RunResourceT

main =
  runResourceT $
  httpSource "http://chrisdone.com/ontime.csv.zip" responseBody .|
  zipEntryConduit "ontime.csv" .|
  fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    (set #downcase True csv) .|
  dropConduit 10 .|
  takeConduit 5 .>
  tableSink

Output:

fl_date     tail_num
2014-01-01  N002AA
2014-01-01  N3FXAA
2014-01-01  N906EV
2014-01-01  N903EV
2014-01-01  N903EV

Breaking this down, the src .| c .| c .> sink can be read like a UNIX pipe src | c | c > sink.

The steps are:

  • Make a web request for the zip file and yield a stream of bytes.
  • Unzip that stream of bytes and yield a stream of bytes of the CSV.
  • Parse from CSV into records of type ("fl_date" := Day, "tail_num" := String).
  • Specify the downcase option so we can deal with lower-case names.
  • Drop the first 10 results.
  • Take 5 of the remaining results.
  • Print the table out.

In this library the naming convention for parts of the pipline is:

  • fooSource -- something which is at the beginning of the pipeline, a source of streaming input.
  • fooConduit -- something which connects two parts of the pipeline together and perhaps does some transformations (such as parsing the Zip, CSV or other things).
  • fooSink -- something into which all streaming input is poured, and produces a final result.

Haskell: good parts

What's good about the Haskell version:

  • Reading the HTTP request was trivial.
  • It's statically typed. If I try to multiply the fl_date as a number, for example, or mistakenly write fl_daet, I'll get a compile error before ever running the program.
  • It achieves it all in a streaming fashion, with constant memory usage.

How is it statically typed? Here:

fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    csv

We've statically told fromCsvConduit the exact type of record to construct: a record of two fields fl_date and tail_num with types Day and String. Below, we'll look at accessing those fields in an algorithm and demonstrate the safety aspect of this.

Swapping out pipeline parts

We can also easily switch to reading from file. Let's write that URL to disk, uncompressed:

main =
  runResourceT
    (httpSource "http://chrisdone.com/ontime.csv.zip" responseBody .|
     zipEntryConduit "ontime.csv" .>
     fileSink "ontime.csv")

Now our reading becomes:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    (set #downcase True csv) .|
  dropConduit 10 .|
  takeConduit 5 .>
  tableSink

Data crunching

It's easy to perform more detailed calculations. For example, to display the number of total flights, and the total distance that would be travelled, we can write:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  fromCsvConduit @("distance" := Double) (set #downcase True csv) .|
  sinkConduit
    (foldSink
       (\table row ->
          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
       (#flights := (0 :: Int), #distance := 0)) .>
  tableSink

The output is:

flights  distance
471949   372072490.0

Above we made our own sink which consumes all the rows, and then yielded the result of that downstream to the table sink, so that we get the nice table display at the end.

Type correctness

Returning to our safety point, imagine above we made some mistakes.

First mistake, I wrote modify #flights twice by accident:

-          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+          modify #flights (+ 1) (modify #flights (+ get #distance row) table))

Before running the program, the following message would be raised by the Haskell type checker:

• Couldn't match type ‘Int’ with ‘Double’
  arising from a functional dependency between:
  constraint ‘Has "flights" Double ("flights" := Int, "distance" := value0)’
  arising from a use of ‘modify’

See below for where this information comes from in the code:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  --
  --              The distance field is actually a double
  --                             ↓
  --
  fromCsvConduit @("distance" := Double) (set #downcase True csv) .|
  sinkConduit
    (foldSink
       (\table row ->
          modify #flights (+ 1) (modify #flights (+ get #distance row) table))
  --
  -- But we're trying to modify `#flights`, which is an `Int`.
  --                      ↓
  --
       (#flights := (0 :: Int), #distance := 0)) .>
  tableSink

Likewise, if we misspelled #distance as #distant, in our algorithm:

-          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+          modify #flights (+ 1) (modify #distance (+ get #distant row) table))

We would get this error message:

No instance for (Has "distant" value0 ("distance" := Double))
arising from a use of ‘get’

Summarizing:

  • The correct values being parsed from the CSV.
  • Fields must exist if we're accessing them.
  • We can't mismatch types.

All this adds up to more maintainable software, and yet we didn't have to state any more than necessary!

Grouping

If instead we'd like to group by a field, in pandas it's like this:

first = df.groupby('airline_id')[['fl_date', 'unique_carrier']].first()
first.head()

We simply update the code with the type, putting the additional fields we want to parse:

csv :: Csv ("fl_date" := Day, "tail_num" := String
           ,"airline_id" := Int, "unique_carrier" := String)

And then our pipeline instead becomes:

fromCsvConduit
  @("fl_date" := Day, "tail_num" := String,
    "airline_id" := Int, "unique_carrier" := String)
  (set #downcase True csv) .|
groupConduit #airline_id .|
explodeConduit .|
projectConduit @("fl_date" := _, "unique_carrier" := _) .|
takeConduit 5 .>
tableSink
  • We added the two new fields to be parsed.
  • We grouped by the #airline_id field into a stream of lists of rows. That groups the stream [x,y,z,a,b,c] into e.g. [[x,y],[z,a],[b,c]].
  • We explode those groups [[x,y],[z,a],[b,c],...] into a stream of each group's parts: [x,y,z,a,b,c,...].
  • We projected a new record type just for the table display to include fl_date and unique_carrier. The types are to be left as-is, so we use _ to mean "you know what I mean". This is like SELECT fl_date, unique_carrier in SQL.

Output:

unique_carrier  fl_date
AA              2014-01-01
AA              2014-01-01
EV              2014-01-01
EV              2014-01-01
EV              2014-01-01

The Python blog post states that a further query upon that result,

first.ix[10:15, ['fl_date', 'tail_num']]

yields an unexpected empty data frame, due to strange indexing behaviour of pandas. But ours works out fine, we just drop 10 elements from the input stream and project tail_num instead:

dropConduit 10 .|
projectConduit @("fl_date" := _, "tail_num" := _) .|
takeConduit 5 .>
tableSink

And we get

fl_date     tail_num
2014-01-01  N002AA
2014-01-01  N3FXAA
2014-01-01  N906EV
2014-01-01  N903EV
2014-01-01  N903EV

Conclusion

In this post we've demonstrated:

  1. Concisely handling a chain of problems smoothly like a bash script.
  2. Done all the above in constant memory usage.
  3. Done so with a type-safe parser, specifying our types statically, but without having to declare or name any record type ahead of time.

This has been a demonstration, and not a finished product. Haskell needs work in this area, and the examples in this post are not performant (but could be), but such work would be very fruitful.

Are the advantages of using Haskell something you're interested in? If so, contact us at FP Complete.

September 14, 2016 07:00 AM

September 13, 2016

Functional Jobs

Software Engineer/Researcher at Galois Inc (Full-time)

We are currently seeking software engineers/researchers to play a pivotal role in fulfilling our mission to make critical systems trustworthy.

Galois engineers participate in one or more projects concurrently, and specific roles vary greatly according to skills, interests, and company needs. Your role may include technology research and development, requirements gathering, implementation, testing, formal verification, infrastructure development, project leadership, and/or supporting new business development.

Skills & Requirements

Education – Minimum of a Bachelor’s degree in computer science or equivalent. MS or PhD in CS or a related field desirable but optional, depending on specific role.

Required Technical Expertise – Must have hands-on experience developing software and/or performing computer science research. Demonstrated expertise in aspects of software development mentioned above.

Desired Technical Expertise – Fluency in the use of formal or semi-formal methods such as Haskell or other functional programming languages. Direct experience in developing high assurance systems and/or security products. Experience with identity management, security risk analysis, systems software, or networking.

Required General Skills – Must work well with customers, including building rapport, identifying needs, and communicating with strong written, verbal, and presentation skills. Must be highly motivated and able to self-manage to deadlines and quality goals.

Our engineers develop in programming languages including functional languages, designing and developing advanced technologies for safety- and security-critical systems, networks, and applications. Engineers work in small team settings and must successfully interact with clients, partners, and other employees in a highly cooperative, collaborative, and intellectually challenging environment.

We’re looking for people who can invent, learn, think, and inspire. We reward creativity and thrive on collaboration.

Get information on how to apply for this position.

September 13, 2016 10:43 PM

Edward Z. Yang

Seize the Means of Production (of APIs)

There's a shitty API and it's ruining your day. What do you do?

Without making a moral judgment, I want to remark that there is something very different about these two approaches. In Dropbox's case, Dropbox has no (direct) influence on what APIs Apple provides for its operating system. So it has no choice but to work within the boundaries of the existing API. (When Apple says jump, you say, "How high?") But in Adam's case, POSIX is implemented by an open source project Linux, and with some good ideas, Adam could implement his new interface on top of Linux (avoiding the necessity of writing an operating system from scratch.)

APIs cross social boundaries: there is the proletariat that produces software using an API and the bourgeoisie that controls the API. When the Man(TM) is a big corporation, our only choices are to work around their shitty API or pay them sufficiently large amounts of money to fix their shitty APIs. But when the Man(TM) is an open source project, your choices change. True, you could work around their shitty API. Or you could seize the means of production, thereby enabling you to fix the shitty API.

What do I mean by seize the means of production? Indeed, what are the means of production? An open source API does not live in a vacuum; it is made useful by the software that provides the API, the developers who contribute their time and expertise to maintain this expertise, even the publicity platform which convinces people to use an API. To seize the means of production is to gain control of all of these aspects. If you can convince the establishment that you are a core contributor of a piece of open source software, you in turn gain the ability to fix the shitty APIs. If you are unwilling or unable to do so, you might still fork, vendor or rewrite the project, but this is not so much seizing the means of production as it is recreating it from scratch. Another possibility is to build the abstraction you need on top of the existing APIs (as Adam did), though you are always at risk of the original project not being sensitive to your needs.

Time and time again, I see people working with open source projects who refuse to seize the means of production. Instead, they are willing to write increasingly convoluted workarounds to solve problems, all to stay within the boundaries. You might say, "This is just Getting Things Done(TM)", but at some point you will have done more work working around the problem, than you would have spent just fixing the damn thing.

So stop it. Stop putting up with shitty APIs. Stop working within the boundaries. Seize the means of production.

Counterpoint.

  1. What is advocated for in this post is nothing less than infinite yak shaving; if you take the advice seriously you will proceed to never get anything done ever again.
  2. It may be true that in aggregate, the cost of working around a bad API exceeds the cost to fix it, but for any individual the cost is generally less. Thus, even if you could perfectly predict the cost of a workaround versus a proper fix, individual incentives would prevent the proper fix.
  3. Users (including developers) don't know anything about the software they use and are ill-equipped to design better APIs, even if they know where it hurts.
  4. Rarely can you unilaterally seize the means of production. In an ideal world, to become a core contributor, it would merely be sufficient to demonstrate sustained, useful contributions to a project. We all know the real world is more messy.

by Edward Z. Yang at September 13, 2016 05:09 AM

Sandy Maguire

Better Data Types a la Carte

<article> <header>

Better Data Types a la Carte

</header>

<time>September 13, 2016</time> haskell, rpg, dsl, data types a la carte, semantics

To be honest with you, my approach to procedurally generating RPG stories has been stymied until very recently. Recall the command functor:

data StoryF a = Change Character ChangeType (ChangeResult -> a)
              | forall x y. Interrupt (Free StoryF x) (Free StoryF y) (y -> a)
              | -- whatever else

This recursively defined Interrupt command has caused more than its fare share of grief. The idea is that it should represent one potential line of action being interrupted by another. The semantics are rather hazy, but this should result in grafting the Free StoryF y monad somewhere inside of the Free StoryF x monad. Once we’ve done whatever analysis on the original story, we can then forget that the Interrupt bit was ever there in the first place.

In effect, we want this:

data StoryF' a = Change Character ChangeType (ChangeResult -> a)
               | -- whatever else

runInterrupt :: StoryF a -> StoryF' a
runInterrupt = -- ???

where runInterrupt’s job is to remove any instances of the Interrupt command from our story – replacing them with the “canonical” version of what actually happened.

Of course, we could just remove all of the Interrupt data constructors from our Free StoryF a object, and rely on convention to keep track of that for us. However, if you’re like me, whenever the phrase “by convention” is used, big alarm bells should go off in your head. Convention isn’t enforced by the compiler, and so anything maintained “by convention” is highly suspect to bugs.

What would make our lives better is if we could define StoryF and StoryF' somehow in terms of one another, so that there’s no hassle to keep them in sync with one another. Even better, in the future, maybe we’ll want to remove or add other constructors as we interpret our story.

What we really want to be able to do is to mix and match individual constructors into one larger data structure, which we can then transform as we see fit.

Fortunately for us, the machinery for this has already been built. It’s Swierstra’s Data Types a la Carte (henceforth DTalC) – essentially a set of combinators capable of composing data types together, and tools for working with them in a manageable way.

Unfortunately for us, Data Types a la Carte isn’t as type-safe as we’d like it to be. Additionally, it’s missing (though not fundamentally) the primitives necessary to remove constructors.1

This post presents a variation of DTalC which is type-safe, and contains the missing machinery.

But first, we’ll discuss DTalC as it is described in the original paper, in order to get a feeling for the approach and where the problems might lie. If you know how DTalC works already, consider skipping to the next heading.

Data Types a la Carte

Data Types a la Carte presents a novel strategy for building data types out of other data types with kind2 * -> *. A code snippet is worth a thousand words, so let’s dive right in. Our StoryF command functor as described above would instead be represented like this:

data ChangeF    a = Change Character ChangeType (ChangeResult -> a)
data InterruptF a = forall x y.
                    Interrupt (Free StoryF x) (Free StoryF y) (y -> a)

type StoryF = ChangeF :+: InterruptF

Here, (:+:) is the type operator which composes data types together into a sum type (there is a corresponding (:*:) for products, but we’re less interested in it today.)

Because the kindedness of (:+:) lines up with that of the data types it combines, we can nest (:+:) arbitrarily deep:

type Something = Maybe :+: Either Int :+: (,) Bool :+: []

In this silly example, Something a might be any of the following:

  • Maybe a
  • Either Int a
  • (Bool, a)
  • [a]

but we can’t be sure which. We will arbitrary decide that (:+:) is right-associative – although it doesn’t matter in principle (sums are monoidal), part of our implementation will depend on this fact.

Given a moment, if you’re familiar with Haskell, you can probably figure out what the machinery must look like:

data (f :+: g) a = InL (f a)
                 | InR (g a)
                 deriving Functor
infixr 8 :+:

(:+:) essentially builds a tree of data types, and then you use some combination of InL and InR to find the right part of the tree to use.

However, in practice, this becomes annoyingly painful and tedious; adding new data types can completely shuffle around your internal tree structure, and unless you’re careful, things that used to compile will no longer.

But fear not! Swierstra has got us covered!

class (Functor sub, Functor sup) => sub :<: sup where
    inj :: sub a -> sup a

This class (and its instances) say that f :<: fs means that the data type f is nestled somewhere inside of the big sum type fs. Furthermore, it gives us a witness to this fact, inj, which lifts our small data type into our larger one. With some clever instances of this typeclass, inj will expand to exactly the right combination of InL and InRs.

These instances are:

instance Functor f => f :<: f where
    inj = id

instance (Functor f, Functor g) => f :<: (f :+: g) where
    inj = InL

instance {-# OVERLAPPABLE #-}
         (Functor f, Functor g, Functor h, f :<: g) => f :<: (h :+: g) where
    inj = InR . inj

The first one states “if there’s nowhere left to go, we’re here!”. The second: “if our desired functor is on the left, use InL”. The third is: “otherwise, slap a InR down and keep looking”.

And so, we can now write our smart constructors in the style of:

change :: (ChangeF :<: fs) => Character -> ChangeType -> Free fs ChangeResult
change c ct = liftF . inj $ Change c ct id

which will create a Change constructor in any data type which supports it (witnessed by ChangeF :<: fs).

Astute readers will notice immediately that the structural induction carried out by (:<:) won’t actually find the desired functor in any sum tree which isn’t right-associative, since it only ever recurses right. This unfortunate fact means that we must be very careful when defining DTalC in terms of type aliases.

In other words: we must adhere to a strict convention in order to ensure our induction will work correctly.

Better Data Types a la Carte

The problem, of course, is caused by the fact that DTalC can be constructed in ways that the structural induction can’t handle. Let’s fix that by constraining how DTalCs are constructed.

At the same time, we’ll add the missing inverse of inj, namely outj :: (f :<: fs) => fs a -> Maybe (f a)3, which we’ll need later to remove constructors, but isn’t fundamentally restricted in Swiestra’s method.

On the surface, our structural induction problem seems to be that we can only find data types in right-associative trees. But since right-associative trees are isomorphic to lists, the real flaw is that we’re not just using lists in the first place.

With the help of {-# LANGUAGE DataKinds #-}, we can lift lists (among other term-level things) to the type level. Additionally, using {-# LANGUAGE TypeFamilies #-}, we’re able to write type-level functions – functions which operate on and return types!

We define a type class with an associated data family:

class Summable (fs :: [* -> *]) where
    data Summed fs :: * -> *

Here fs is a type, as is Summed fs. Take notice, however, of the explicit kind annotations: fs is a list of things that look like Functors, and Summed fs looks like one itself.

Even with all of our fancy language extensions, a type class is still just a type class. We need to provide instances of it for it to become useful. The obvious case is if fs is the empty list:

instance Summable '[] where
    data Summed '[] a = SummedNil Void
                      deriving Functor

The funny apostrophe in '[] indicates that what we’re talking about is an empty type-level list, rather than the type-constructor for lists. The distinction is at the kind level: '[] :: [k] for all kinds k, but [] :: * -> *.

What should happen if we try to join zero data types together? This is obviously crazy, but since we need to define it to be something we make it wrap Void. Since Void doesn’t have any inhabitants at the term-level, it is unconstructible, and thus so too is SummedNil.

But what use case could an unconstructible type possibly have? By itself, nothing, but notice that Either a Void must be Right a, since the Left branch can never be constructed. Now consider that Either a (Either b Void) is isomorphic to Either a b, but has the nice property that its innermost data constructor is always Left (finding the a is Left, and finding b is Right . Left).

Let’s move to the other case for our Summable class – when fs isn’t empty:

instance Summable (f ': fs) where
    data Summed (f ': fs) a = Here (f a)
                            | Elsewhere (Summed fs a)

Summed for a non-empty list is either Here with the head of the list, or Elsewhere with the tail of the list. For annoying reasons, we need to specify that Summed (f ': fs) is a Functor in a rather obtuse way:

instance Summable (f ': fs) where
    data Summed (f ': fs) a = Functor f => Here (f a)
                            | Elsewhere (Summed fs a)

{-# LANGUAGE StandaloneDeriving #-}
deriving instance Functor (Summed fs) => Functor (Summed (f ': fs))

but this now gives us what we want. Summed fs builds a nested sum-type from a type-level list of data types, and enforces (crucially, not by convention) that they form a right-associative list. We now turn our attention to building the inj machinery a la Data Types a la Carte:

class Injectable (f :: * -> *) (fs :: [* -> *]) where
    inj :: f a -> Summed fs a

We need to write instances for Injectable. Note that there is no instance Injectable '[] fs, since Summable '[] is unconstructible.

instance Functor f => Injectable f (f ': fs) where
    inj = Here

instance {-# OVERLAPPABLE #-} Injectable f fs => Injectable f (g ': fs) where
    inj = Elsewhere . inj

These instances turn out to be very inspired by the original DTalC. This should come as no surprise, since the problem was with our construction of (:+:) – which we have now fixed – rather than our induction on (:<:).

At this point, we could define an alias between f :<: fs and Injectable f fs, and call it a day with guaranteed correct-by-construction data types a la carte, but we’re not quite done yet.

Remember, the original reason we dived into all of this mumbo jumbo was in order to remove data constructors from our DTalCs. We can’t do that yet, so we’ll need to set out on our own.

We want a function outj :: Summed fs a -> Maybe (f a) which acts as a prism into our a la carte sum types. If our Summed fs a is constructed by a f a, we should get back a Just – otherwise Nothing. We define the following type class:

class Outjectable (f :: * -> *) (fs :: [* -> *]) where
    outj :: Summed fs a -> Maybe (f a)

with instances that again strongly resemble DTalC:

instance Outjectable f (f ': fs) where
    outj (Here fa)     = Just fa
    outj (Elsewhere _) = Nothing

instance {-# OVERLAPPABLE #-} Outjectable f fs => Outjectable f (g ': fs) where
    outj (Here _ )      = Nothing
    outj (Elsewhere fa) = outj fa

The first instance says, “if what I’m looking for is the head of the list, return that.” The other says, “otherwise, recurse on an Elsewhere, or stop on a Here.”

And all that’s left is to package all of these typeclasses into something more easily pushed around:

class ( Summable fs
      , Injectable f fs
      , Outjectable f fs
      , Functor (Summed fs)
      ) => (f :: * -> *) :<: (fs :: [* -> *])
instance ( Summable fs
         , Injectable f fs
         , Outjectable f fs
         , Functor (Summed fs)
         ) => (f :<: fs)

This is a trick I learned from Edward Kmett’s great talk on Monad Homomorphisms, in which you build a class that has all of the right constraints, and then list the same constraints for an instance of it. Adding the new class as a constraint automatically brings all of its dependent constraints into scope; f :<: fs thus implies Summable fs, Injectable f fs, Outjectable f fs, and Functor (Summed fs) in a much more terse manner.

As a good measure, I wrote a test that outj is a left-inverse of inj:

injOutj_prop :: forall fs f a. (f :<: fs) => Proxy fs -> f a -> Bool
injOutj_prop _ fa = isJust $ (outj (inj fa :: Summed fs a) :: Maybe (f a))

{-# LANGUAGE TypeApplications #-}
main = quickCheck (injOutj_prop (Proxy @'[ []
                                         , Proxy
                                         , Maybe
                                         , (,) Int
                                         ]) :: Maybe Int -> Bool)

where we use the Proxy fs to drive type checking for the otherwise hidden fs from the type signature in our property.

And there you have it! Data types a la carte which are guaranteed correct-by-construction, which we can automatically get into and out of. In the next post we’ll look at how rewriting our command functor in terms of DTalC solves all of our Interrupt-related headaches.

A working version of all this code together can be found on my GitHub repository.


  1. EDIT 2016-09-14: After re-reading the paper, it turns out that it describes (though doesn’t implement) this functionality.

  2. For the uninitiated, kinds are to types as types are to values – a kind is the “type” of a type. For example, Functor has kind * -> * because it doesn’t become a real type until you apply a type to it (Functor Int is a type, but Functor isn’t).

  3. I couldn’t resist the fantastic naming opportunity.

</article>

September 13, 2016 12:00 AM

September 12, 2016

Ketil Malde

HP EliteBook Folio G1

I got a new laptop the other day. My aging Thinkpad X220 was beginning to wear out, keys were falling off and the case was starting to crack here and there. Sometimes the system log would contain ominous messages about temperatures and such. So it was finally replaced with a new laptop, a HP EliteBook Folio G1.

EliteBook Folio G1 as presented by HP.

EliteBook Folio G1 as presented by HP.

This is a Mac-a-like thin little thing, very shiny and sleek compared to the Thinkpad's rough and rugged exterior. They are both "business models", where the Thinkpad has a look that clearly means business, the HP has more the "executive" feel, and there is probably a matching tie pin and cuff links somewhere near the bottom of the very elaborate cardboard package.

I took care to order the model with the non-touch, full-HD display, and not the one with a 4K touch display. I don't really see the point of a touch display on a PC, and I'd rather have the lighter weight, less reflective screen, and longer battery life. And it's only 12 inches or so, how small do pixels need to get, really? Full HD is something like 180 dpi, I'm pretty sure it suffices.

Did I mention shiny? I was surprised, and somewhat disappointed to unpack it and discover that the display is indeed very reflective. It's nice and bright (especially compared to the totally mediocre display on the old Lenovo), but you still need to dress like Steve Jobs (i.e. black sweater) to see anything else than your own reflection. My suspicions were further aroused when I discovered the screen actually was a touch screen, after all. I then weighed the thing, and true enough, it is ten percent heavier than it should be.

After going back and checking the model number (which was correct) and the description at the seller (which was clearly wrong), it was made clear that - although they had given the wrong specifications, a return was not possible. So that's that, I guess, I'm stuck with a heavier laptop with a less usable screen than I thought I would get. So be warned, model has a reflective touch display, even though specs don't say so.

The rest of the hardware is okay, for the most part. It's still fairly light at just a bit over one kg, and apart from its two USB-C ports, the only other connector is an audio jack. The chicklet keyboard is...not very good, compared to the Lenovo's, but I guess I'll learn to accept it. At least it's better than Apple's very shallow version. Oh, and I really like the fanless design. It's not a quiet PC - it's a silent one, and the only moving parts are the keyboard keys.

Apple may have more fanatic users, but they too don't much care for the Mac built-in keyboard.

Apple may have more fanatic users, but they too don't much care for the Mac built-in keyboard.

Installing Linux

As an old-time Linux user, of course I wanted to install my favorite distribution. My previous computer had Debian installed, but I thought I'd try the latest Ubuntu for a (very small) change. Since I am particular in what software I use, and have little use for complex "desktop environments", my usual modus operandi is to install a minimal system -- meaning a Ubuntu server install -- and then pull whatever software I need.

So I copied the server install image to a USB stick (it took me ages of meddling with specialized and buggy USB stick creator software before I realized you can simply dump an ISO directly with something like ), and it happily booted, installed, and rebooted. Only to discover that networking is not supported. Not the built-in wifi, nor the USB-brick's ethernet. Really, Ubuntu?

Microsoft Windows tries to "repair" my Linux install, but gave up after a long period of trying.

Microsoft Windows tries to "repair" my Linux install, but gave up after a long period of trying.

Disappointed, I decided to try Debian, which provides a 'netinst' image. Unfortunately, the computer's UEFI copy protection fascist control skynet terminator system kicked in, and Debian images are non grata. So booting into the BIOS, meddle around with UEFI settings and "secure boot", to get this to work. For some reason, I needed to do this several times before it would take, and although I'm pretty sure the BIOS initially identified my USB stick by manufacturer and model, it now gave some generic description. It also wouldn't boot it directly, but there was a "boot from file" option I could use. But of course, installation failed here as well, network configuration didn't work.

Back to Ubuntu, the desktop image this time. And although I still needed to "boot from file" and dig around for an EFI image (or some such), it now managed to configure the network properly, and the install went pretty smoothly from there. I am now running Ubuntu 16.04, and wondering why the server image doesn't come with the same drivers as the desktop image. As long as the network is up, everything else can be fixed - and unless you are installing from a stack of DVDs, something I vaguely remember doing back in a previous century, without networking, you can't get anywhere.

Software

Not being a fan of desktop environments, I prefer to run the Xmonad window manager. After -ting it, I can select it at login, and things work the way they should.

That is: almost.

Since I don't have a graphical tool to inspect or configure networking, I previously used for this. Running this in the background (as root, of course), it reads a config file where all my networks are configured. When it connects and authenticates to a network, I can then run to obtain IP and things like DNS server.

My initial attempt at this config failed, and I tried instead. That didn't work too well last time I tried, which I think is the primary reason I used . The command connects with the NetworkManager daemon, and allows me to connect to networks with something like:

      nmcli dev wifi connect my_ssid password my_psk_key

NetworkManager stores this information in , not unlike the config file, and after adding things once, it switches automatically - and unlike , also between other network interfaces. And it uses to proxy DNS queries, also in a smart manner (as far as I can tell).

A new install is also a good time to revisit configurations. I had a setup where XMonad's terminal command would bring up an

    uxterm +sb -font '-*-fixed-medium-r-*-*-14-*-*-*-*-*-*-*'

but that font is way too small, and doesn't look very good. After some googling, I ended up with XTerm and specifically:

    xterm -fn 7x13 -fa 'Liberation Mono:size=12:antialias=false' +rv

which I think looks pretty good. (Not sure why I have the antialias option, changing it to 'true' doesn't appear to change things, and fonts look to be antialiased regardless)

Docking

I got a docking station as well. Unlike the old Lenovo, where you press the whole PC down in a specialized dock, the new one is just a smallish brick with a bunch of connectors. It hooks up to the PC with a single USB-C cable, which supplies both power, and...well, USB. Running shows some devices, including a virtual disk, an ethernet interface, and something called a DisplayLink.

    Bus 004 Device 003: ID 17e9:4354 DisplayLink 

Ethernet and USB appears to work out of the box, but the external monitor doesn't show up, and 'xrandr' only displays the built-in display. Googling "displaylink" quickly brings one to http://www.displaylink.com/ which also have downloadable Linux drivers for Ubuntu. Distributed as a self-extracting shell script, haven't seen one of those for a while. First I needed to , to install support for building kernel modules, and after that, the display link driver made the external monitor available to xrandr configuration - and rescaled my screen and cloned it to the external display. So far so good.

There is a downside, however. The external display introduces a noticeable lag, and the DisplayLink daemon often hogs the CPU quite a bit. I also think it incurs a cost to graphically intensive programs, but so far I only noticed that chromium appears more CPU-greedy, so I could be wrong. There's a bunch of "WARNING" dumps in syslog, in addition to messages like this:

    [107073.458870] evdi: [W] collapse_dirty_rects:102 Not enough space for clip rects! Rects will be collapsed

so I find it a bit hard to entirely trust this system. But at least it works.

Battery

Battery was supposed to be reasonable. Much of this depends on the CPU being able to enter sleep states. The obvious thing to do is to use 'powertop', go to the rightmost tab, and enable all kinds of hardware and device power saving. But userspace is also important, and I find that my web browser - typically with tens of tabs open on sites peppered with javascript and flash and whatnot - tend to keep the system very busy. Chromium consists of a giant cluster of threads, but as I start it from a terminal, I just bring it to the foreground and hit ctrl-Z.

Having done that, I recheck powertop, and apparently the wifi interface and the bluetooth keep interrupting the system. I don't think the function button for disabling wireless works, but running brings the power use down from 6W (about five hours) to 5.8W (five and a half)1. on the DisplayLink process brought it down to 4.4 (over seven hours), but then bounced back to 5.4 (6 hours). Dimming the backlight gets it down to 4.2W - but, okay, a dark screen and a PC doing nothing - how useful is that? In the end, what matters is how long a flight I can take and still expect to be able to use my computer, and only experience will tell.

And one nice thing about USB-C is that with a universal standard, chances are someone nearby will have a compatible charger - much like for mobile phones these days. And another nice thing is that it is already possible to buy external battery packs with USB-C that will keep your computer juiced up even for really long haul flights.

In the end?

Some things are still not working (e.g. special keys to adjust display brightness or switch off wireless - strangely, the key to dim keyboard backlight works), and other things are funky or unpredictable (e.g. that you must use the "fn" key to use function keys F1 to F12, or the direction two-finger scroll works). But by and large, things work pretty well, really.


  1. And reinserting the iwlwifi module, the device pops back into existence, and after a few seconds, NetworkManager has me reconnected. Nice!

September 12, 2016 10:00 AM

Michael Snoyman

Monads are like Lannisters

As many people are likely aware, monads (incorrectly) have a bad rap in the programming community for being difficult to learn. A string of extremely flawed monad tutorials based on analogies eventually led to a blog post by Brent Yorgey about the flaws of this analogy-based approach. And we've seen great learning materials on Haskell and monads.

However, I'm disappointed to see the analogy-based route disappear. Based on a recent Twitter poll I ran, which is held to the highest levels of scientific and statistical scrutiny, it's obvious that there is very high demand for a rigorous analogy-based monad tutorial. I claim that the flaw in all previous analogy based tutorials is lack of strong pop culture references. Therefore, I'm happy to announce the definitive guide to monads: Monads are like Lannisters.

Spoiler alert: if you haven't completed book 5 or season 6 of Haskell yet, there will be spoilers.

Prereqs

The examples below will all be Haskell scripts that can be run with the Stack build tool. Please grab Stack to play along. Copy-paste the full example into a file like foo.hs and then run it with stack foo.hs. (This uses Stack's script interpreter support.)

Hear me roar

Many people believe that the Lannister house words are "A Lannister always pays his debts" (or her debts of course). We'll get to that line in a moment. This belief however is false: the true Lannister house words are Hear me roar. So let's hear a Lannister roar:

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

roar :: IO ()
roar = putStrLn "Roar"

main :: IO ()
main = do
    roar -- Tywin
    roar -- Cersei
    roar -- Jaime
    roar -- Tyrion

Roaring is clearly an output, and therefore it makes sense that our action is an IO action. But roaring doesn't really do much besides making sound, so its return is the empty value (). In our main function, we use do notation to roar multiple times. But we can just as easily use the replicateM_ function to replicate a monadic action multiple times and discard (that's what the _ means) the results:

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Control.Monad (replicateM_)

roar :: IO ()
roar = putStrLn "Roar"

main :: IO ()
main = replicateM_ 4 roar

Tyrion, the scholar

As we all know, Tyrion is a prolific scholar, consuming essentially any book he can get his hands on (we'll discuss some other consumption next). Fortunately, monads are there to back him up, with the Reader monad. Let's say that Tyrion is doing some late night research on wine production (epic foreshadowment) in various kingdoms, and wants to produce a total:

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Data.Map (Map)
import qualified Data.Map as Map
import Control.Monad.Trans.Reader
import Data.Maybe (fromMaybe)

type Kingdom = String
type WineCount = Int
type WineData = Map Kingdom WineCount

tyrionResearch :: Reader WineData Int
tyrionResearch = do
    mnorth <- asks $ Map.lookup "north"
    mriverlands <- asks $ Map.lookup "riverlands"
    return $ fromMaybe 0 mnorth + fromMaybe 0 mriverlands

main :: IO ()
main = print $ runReader tyrionResearch $ Map.fromList
    [ ("north", 5)
    , ("riverlands", 10)
    , ("reach", 2000)
    , ("dorne", 1000)
    ]

While Tyrion may have chosen inferior kingdoms for wine production, it does not take away from the fact that the Reader type has allowed him access to data without having to explicitly pass it around. For an example this small (no dwarf joke intended), the payoff isn't great. But Reader is one of the simplest monads, and can be quite useful for larger applications.

Tyrion, the drinker

Let's try another one. We all know that Tyrion also likes to drink wine, not just count it. So let's use our Reader monad to give him access to a bottle of wine.

However, unlike reading data from a book, drinking from a bottle actually changes the bottle. So we have to leave our pure Reader world and get into the ReaderT IO world instead. Also known as monad transformers. (Unfortunately I don't have time now for a full post on it, but consider: Transformers: Monads in Disguise.)

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Control.Monad (replicateM_)
import Control.Monad.Trans.Reader
import Control.Monad.IO.Class
import Control.Concurrent.Chan

data Wine = Wine
type Bottle = Chan Wine

drink :: MonadIO m => Bottle -> m ()
drink bottle = liftIO $ do
    Wine <- readChan bottle
    putStrLn "Now I'm slightly drunker"

tyrionDrinks :: ReaderT Bottle IO ()
tyrionDrinks = replicateM_ 10 $ ReaderT drink

main :: IO ()
main = do
    -- Get a nice new bottle
    bottle <- newChan

    -- Fill up the bottle
    replicateM_ 20 $ writeChan bottle Wine

    -- CHUG!
    runReaderT tyrionDrinks bottle

A Lannister always pays his debts

What is a debt, but receiving something from another and then returning it? Fortunately, there's a monad for that too: the State monad. It lets us take in some value, and then give it back - perhaps slightly modified.

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Control.Monad.Trans.State

type Sword = String

forge :: State Sword Sword
forge = do
    "Ice" <- get
    -- do our Valyrian magic, and...

    -- Repay our debt to Brienne
    put "Oathkeeper"

    -- And Tywin gets a sword too!
    return "Widows Wail"

killNedStark :: IO Sword
killNedStark = do
    putStrLn "Off with his head!"
    return "Ice"

main :: IO ()
main = do
    origSword <- killNedStark

    let (forTywin, forBrienne) = runState forge origSword

    putStrLn $ "Tywin received: " ++ forTywin
    putStrLn $ "Jaime gave Brienne: " ++ forBrienne

Not exactly justice, but the types have been satisfied! A monad always pays its debts.

Throwing

Monads are also useful for dealing with exceptional cases:

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Control.Exception
import Data.Typeable (Typeable)

data JaimeException = ThrowBranFromWindow
    deriving (Show, Typeable)
instance Exception JaimeException

jaime :: IO ()
jaime = do
    putStrLn "Did anyone see us?"
    answer <- getLine
    if answer == "no"
        then putStrLn "Good"
        else throwIO ThrowBranFromWindow

main :: IO ()
main = jaime

Killing

And thanks to asynchronous exceptions, you can also kill other threads with monads.

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import Control.Concurrent
import Control.Exception
import Control.Monad
import Data.Typeable (Typeable)

data JaimeException = Defenestrate
    deriving (Show, Typeable)
instance Exception JaimeException

bran :: IO ()
bran = handle onErr $ forever $ do
    putStrLn "I'm climbing a wall!"
    threadDelay 100000
  where
    onErr :: SomeException -> IO ()
    onErr ex = putStrLn $ "Oh no! I've been killed by: " ++ show ex

jaime :: ThreadId -> IO ()
jaime thread = do
    threadDelay 500000
    putStrLn "Oh, he saw us"
    throwTo thread Defenestrate
    threadDelay 300000
    putStrLn "Problem solved"

main :: IO ()
main = do
    thread <- forkIO bran
    jaime thread

Exercise for the reader: modify bran so that he properly recovers from that exception and begins warging instead. (WARNING: don't recover from async exceptions in practice.)

Exiting

You can also exit your entire process with monads.

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

import System.Exit

tommen :: IO ()
tommen = do
    putStrLn "Oh, my dear wife!"
    exitFailure

main :: IO ()
main = tommen

Passing the baton of reign

We've seen Lannisters pass the baton of reign from family member to family member. We've also seem them ruthlessly destroy holy insitutions and have their private, internal affairs exposed. As it turns out, we can do all of that with some explicit state token manipulation!

#!/usr/bin/env stack
-- stack --resolver lts-6.15 --install-ghc runghc

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
import GHC.Prim
import GHC.Types

joffrey :: IO ()
joffrey = do
    putStrLn "Look at your dead father's head Sansa!"
    putStrLn "Oh no, poisoned!"

tommen :: IO ()
tommen = do
    putStrLn "I'm in love!"
    putStrLn "*Swan dive*"

cersei :: IO ()
cersei = undefined -- season 7 isn't out yet

unIO :: IO a -> State# RealWorld -> (# State# RealWorld, a #)
unIO (IO f) = f

main :: IO ()
main = IO $ \s0 ->
  case unIO joffrey s0 of
    (# s1, () #) ->
      case unIO tommen s1 of
        (# s2, () #) ->
          unIO cersei s2

Honorable mentions

There's much more to be done with this topic, and there were other ideas besides Lannisters for this post. To give some mention for other ideas:

  • There's plenty of other Lannister material to work with (skinning animals, Jaime and Cersei affairs, or Tyrion proclivities). But I had to draw the line somewhere (both for length of post and topics I felt like discussing...). Feel free to comment about other ideas.
  • One of my favorites: monads are like onions
  • Monads are just your opinion, man
  • "they're like Gargoyles, they decorate the public facing interface of important landmarks in Haskell-land." link

Personally, I was going to go with a really terrible "monads are like printers" based on them taking plain paper in and spitting out printed pages, but Lannisters was a lot more fun. Also, I'm sure there's something great to be done with presidential candidates, at the very least throwTo opponent Insult.

Disclaimer

Since I'm sure someone is going to get upset about this: YES, this post is meant entirely as parody, and should not be used for actually learning anything except some random details of Game of Thrones, and perhaps a bit of Haskell syntax. Please use the learning materials I linked at the beginning for actually learning about Haskell and monads. And my real advice: don't actually "learn monads," just start using Haskell and you'll pick them up naturally.

September 12, 2016 12:00 AM

September 11, 2016

Philip Wadler

Option A vs B: Kicked into the long grass



I've been putting off posting about the final outcome of the West-East Cycle Route confrontation at Edinburgh City Council's Transport and Environment committee, in part because there was no final outcome.

I sat through four hours of Edinburgh Council's Transport and Environment Committee. The end result was a fudge. The Council heard adamant depositions from both sides, and decided not to decide. They will form a stakeholder group, with representatives from all involved, and attempt to come to a compromise that satisfies all sides. But it's hard to see how that can happen: if we don't get a straight route it will be a disaster, but if we do folk in Roseburn will fear an impact on their businesses (studies show cycle routes don't harm business, but they seem to be taking the Gove line: they are tired of hearing from experts).

Though perhaps we were lucky to get a fudge. Had it gone to a straight vote, following the whip Greens and Labour would have voted for Option A (7 votes) while the SDP, Conservatives, and Liberal Democrats would have voted for Option B (8 votes). As it was, the Greens gave a rousing defense of Option A, with fantastic speeches from Nigel Bagshaw and Chas Booth. Everyone else supported the 'compromise', and took the opportunity to excoriate the Greens for taking a principled stand. Even those who would have voted for Option B expressed support for cycling, and perhaps that is something we can build on.

Dave duFeu of Spokes posted an excellent summary, including links to the webcast of the meeting and a list of the steps you can take to support the West-East Cycle Route.

The Edinburgh Evening News offered coverage in the two days before: Cyclists take to street to support £6m cycle “superhighway” and D-Day for cycleway in Edinburgh amidst anger and division. The first of these features a couple of photos of me arguing with opponents of the cycle route in Roseburn. I don't look calm and collected. Oddly, I can't find any coverage the Evening News gave to the outcome of the Transport and Environment committee meeting. Was there any?

Not only are Roseburn residents up in arms. A similar cycleway in Bearsden, near Glasgow, is attracting comparable ire from its locals. What is our best way forward to fight bicycle bigots?

Previously: 
  Option A: Think about the children

by Philip Wadler (noreply@blogger.com) at September 11, 2016 07:50 PM

What is it like to understand advanced mathematics?


A correspondent on Quora explains the insider's view to an outsider. Some selections:

  • You can answer many seemingly difficult questions quickly. But you are not very impressed by what can look like magic, because you know the trick. The trick is that your brain can quickly decide if a question is answerable by one of a few powerful general purpose "machines"  (e.g., continuity arguments, the correspondences between geometric and algebraic objects, linear algebra, ways to reduce the infinite to the finite through various forms of compactness) combined with specific facts you have learned about your area. The number of fundamental ideas and techniques that people use to solve problems is, perhaps surprisingly, pretty small -- seehttp://www.tricki.org/tri<wbr></wbr>cki/map for a partial list, maintained by Timothy Gowers.
  • You go up in abstraction, "higher and higher". The main object of study yesterday becomes just an example or a tiny part of what you are considering today. For example, in calculus classes you think about functions or curves. In functional analysis or algebraic geometry, you think of spaces whose pointsare functions or curves -- that is, you "zoom out" so that every function is just a point in a space, surrounded by many other "nearby" functions. Using this kind of zooming out technique, you can say very complex things in short sentences -- things that, if unpacked and said at the zoomed-in level, would take up pages. Abstracting and compressing in this way makes it possible to consider extremely complicated issues with one's limited memory and processing power.
  • You are easily annoyed by imprecision in talking about the quantitative or logical. This is mostly because you are trained to quickly think about counterexamples that make an imprecise claim seem obviously false.

by Philip Wadler (noreply@blogger.com) at September 11, 2016 05:31 PM

ISDS enables corruption


I've long known that ISDS (Investor-State Dispute Settlement) is one of the worst aspects of TTIP, TTP, and CETA. But I've thought the main problem with ISDS was first, that it enabled businessmen to sue governments over laws enacted to save their citizenry (as in the cartoon below), and, second, its secrecy. What I did not understand was how it enables corruption. Kudos to Buzzfeed for their detailed investigation.
Say a nation tries to prosecute a corrupt CEO or ban dangerous pollution. Imagine that a company could turn to this super court and sue the whole country for daring to interfere with its profits, demanding hundreds of millions or even billions of dollars as retribution.

Imagine that this court is so powerful that nations often must heed its rulings as if they came from their own supreme courts, with no meaningful way to appeal. That it operates unconstrained by precedent or any significant public oversight, often keeping its proceedings and sometimes even its decisions secret. That the people who decide its cases are largely elite Western corporate attorneys who have a vested interest in expanding the court’s authority because they profit from it directly, arguing cases one day and then sitting in judgment another. That some of them half-jokingly refer to themselves as “The Club” or “The Mafia.”

And imagine that the penalties this court has imposed have been so crushing — and its decisions so unpredictable — that some nations dare not risk a trial, responding to the mere threat of a lawsuit by offering vast concessions, such as rolling back their own laws or even wiping away the punishments of convicted criminals.
 This system is already in place, operating behind closed doors in office buildings and conference rooms in cities around the world. Known as investor-state dispute settlement, or ISDS, it is written into a vast network of treaties that govern international trade and investment, including NAFTA and the Trans-Pacific Partnership, which Congress must soon decide whether to ratify.

These trade pacts have become a flashpoint in the US presidential campaign. But an 18-month BuzzFeed News investigation, spanning three continents and involving more than 200 interviews and tens of thousands of documents, many of them previously confidential, has exposed an obscure but immensely consequential feature of these trade treaties, the secret operations of these tribunals, and the ways that business has co-opted them to bring sovereign nations to heel.

The series starts today with perhaps the least known and most jarring revelation: Companies and executives accused or even convicted of crimes have escaped punishment by turning to this special forum. Based on exclusive reporting from the Middle East, Central America, and Asia, BuzzFeed News has found the following:

  • A Dubai real estate mogul and former business partner of Donald Trump was sentenced to prison for collaborating on a deal that would swindle the Egyptian people out of millions of dollars — but then he turned to ISDS and got his prison sentence wiped away.
  • In El Salvador, a court found that a factory had poisoned a village — including dozens of children — with lead, failing for years to take government-ordered steps to prevent the toxic metal from seeping out. But the factory owners’ lawyers used ISDS to help the company dodge a criminal conviction and the responsibility for cleaning up the area and providing needed medical care.
  • Two financiers convicted of embezzling more than $300 million from an Indonesian bank used an ISDS finding to fend off Interpol, shield their assets, and effectively nullify their punishment.

by Philip Wadler (noreply@blogger.com) at September 11, 2016 05:22 PM

September 10, 2016

LHC Team

Haskell Suite: Type inference.

Disclaimer: I am not an academic. The following post is akin to a plumber's guide to brain surgery.

Type inference is complex and has evolved over time. In this post, I will try to explain how I see the landscape and where LHC and other Haskell compilers fit into this landscape.

The beginning: The Hindley–Milner type system. 1982.
The typing rules here are quite simple and every Haskeller seem to learn them intuitively. They include things like: if 'f :: A → B' and 'a :: A' then 'f a :: B'.
In this system, types are always inferred and there must always be a single, most general type for every expression. This becomes a problem when we want higher ranked types because here a single, most general type cannot be inferred. There may be many equally valid type solutions and it has to be up to the programmer to select the appropriate one. But this cannot happen in plain HM as type signatures are only used to make inferred types less general (eg. [a] was inferred but the programmer wanted [Int]).
Omitting the type signature in the following code can show us what plain HM would be like:
<script src="https://gist.github.com/Lemmih/3d9e64eb6b6dd4735e823e95094555e9.js"></script> In GHC, the snippet will run fine with the type signature but not without it.


Version two: Bidirectional type system. 2000.
People realised that checking the correctness of a given type is much easier than inferring a correct type. Armed with this knowledge, a new type checker was born that had two modes usually called 'up' and 'down'. The 'up' mode lifts a new correct type up from an expression and the 'down' mode that checks the correctness of a type. Because of these two modes, this kind of system was called bidirectional and it deals with higher ranked types quite well.
LHC current implements this.

Version three: Boxy types. 2006.
At this point it had become apparent that higher ranked types didn't really play well with higher order functions. People often found themselves in situations where slight, seemingly innocent changes caused the type-checker to reject their programs. An example of this can be seen in this gist:
<script src="https://gist.github.com/Lemmih/52d35e65b915a722bc377165109715ab.js"></script>Impredicative polymorphism is required for the above code and boxy types is a stab in that direction. Bidirectional type checking was a big improvement over plain HM but it lacked granularity. Types are either 100% inferred or 100% checked with no middle ground. What if you wanted to check parts of a type and infer the rest? Well, boxy types solves exactly that problem. Boxes are added (internally, we're not making changes to Haskell here) to types and they signify an unknown that should be inferred. Now parts of types can be checked while the boxes are inferred and we're left with the best of both worlds. This is what JHC implements, btw. Boxy types was also implemented in GHC but was deemed to be too complicated.

Version four: FPH, First-class Polymorphism for Haskell. 2008.
Impredicative polymorphism, second attempt from the authors of boxy types. Improvements were made but the problem is still not solved.

Version five: OutsideIn(X). 2011.
GHC is a hotbed for experimentation in type checkers. GADTs, multi-parameter type classes, type families. These are just some of the features that makes the type-checker the largest and most complicated component of GHC. To deal with all of this, researchers came up with OutsideIn, described in a paper longer than all the previous papers put together. The algorithm is relatively simple, but, for practical reasons, implementations must reject some programs that are valid according to the specification.

by David Himmelstrup (noreply@blogger.com) at September 10, 2016 09:39 AM

September 09, 2016

Roman Cheplyaka

A case for static linking in scientific computing

When researchers run scientific software on high-performance clusters, they often experience problems with shared libraries, such as this one:

bcftools: /lib64/libz.so.1: version `ZLIB_1.2.5.2' not found

Or this one:

eagle: error while loading shared libraries: libhts.so.1: cannot open shared object file: No such file or directory

Popular advice points them in the direction of LD_LIBRARY_PATH, but a simple and robust solution—static linking—is often overlooked.

In this article, I explain the background behind static and dynamic linking, demonstrate the advantages of static linking, address some of the objections against static linking, and give instructions on how to prepare static binaries.

What is static and dynamic linking?

The word linking itself refers to the process of assembling a complete program from libraries of subprograms1.

Static linking occurs as the last stage of the compilation process; the required libraries are embedded in the final binary file of your program.

Some (or even all) of the libraries may not be included in the binary at this stage. In this case, when we attempt to run the program, we need dynamic linking in order to find the libraries and make the missing subroutines accessible to the program. This second type of libraries is called dynamic, or shared, libraries. The files with these libraries are usually named libsomething.so on Linux and something.dll on Windows.

The rules that an operating system follows when it searches for dynamic libraries are complex. And simply having a library in the right place is not enough; it needs to be the same version of the library that was used during the compilation, or at least a different version with the same ABI. (So no, you shouldn’t ln -s /usr/lib/libsomething.so.2 /usr/lib/libsomething.so.1 when a program doesn’t work, contrary to another popular piece of advice one can find on the Internet.)

Linux distributions, most of which dynamically link the software they distribute, manage this by engaging qualified package maintainers and by having tools and centralized infrastructure to build and distribute packages.

But the world of scientific software is not there yet. And if you fail to take care of your dynamic libraries, the result can vary from a program refusing to start (as we saw earlier), to a program crash in the middle of operation, to a hard to detect and diagnose case of data corruption.

This is why I think that scientific software should be linked statically by default.

Advantages of static linking

Reproducibility. When a program is linked statically, it executes the same algorithm wherever it is run. A dynamic executable executes the code from the version of the dynamic library that happens to be installed on a particular computing node.

Note that the static executable doesn’t contain the metainformation about the library versions used to build it. You should record that information when you compile the software, and ideally use a binary repository manager for your builds.

But replacing dynamic linking with static linking by itself dramatically increases the probability that your program will run tomorrow in the same way as it runs today.

Ease of distribution. Suppose that you want your colleague to run the program you have compiled. If you link statically, you only need to distribute a single binary. If you link dynamically, you need to distribute the binary, all dynamic libraries that it depends on, and the instructions on how to make the binary find its libraries.

Portability. Static linking ensures that you can compile the program on your Ubuntu laptop and run it on a Debian or CentOS cluster. With dynamic linking, you’d have to compile it directly on a cluster or in an identical environment.

That said, you still need to ensure that both systems use the same or compatible architectures (the majority of scientific computing happens on x86-64 anyway), the same OS (probably Linux) and not too different kernel versions (or libc versions if you link libc dynamically).

No problems with finding libraries. Since no dynamic libraries are needed, the OS cannot fail to find them. No more cannot open shared object file messages. No more LD_LIBRARY_PATH tricks.

Isn’t static linking considered harmful?

Ulrich Drepper says it is:

There are still too many people out there who think (or even insist) that static linking has benefits. This has never been the case and never will be the case.

Ulrich certainly knows about this stuff much more than I ever hope to. But as you can tell from the above quote, he is sometimes a bit extreme in his judgment.

There is no shortage of knowledgeable people who disagree with him on this issue.

But more importantly, he looks at linking from a very different perspective. For many years, he was employed by Red Hat. He was one of those people who knew a lot about dealing with dynamic libraries and maintained a centralized repository of packages that worked well together in a controlled environment.

It is understandable that he would not care about any of the advantages I list above (though this is different from claiming that there has never been and never will be any benefits to static linking).

But what about the advantages of the dynamic linking that Ulrich describes in his article?

Centralized bug/security fixes.

  1. Security issues matter less for scientific software because it is not exposed to the outside world.

  2. HPC cluster users don’t benefit from centralized bug fixes because usually they don’t have the permissions to install software system-wide. Every user of the same cluster or node would still be responsible for their own updates.

  3. The scale is very different. If you are Red Hat, re-linking hundreds or thousands of binaries every time there is an update in a library is a significant burden. If you are a researcher, you deal maybe with a dozen or two programs, and you may not have to update them often.

  4. Even when centralized updates are possible (e.g. if you can request libraries to be installed centrally and then link against them), scientists would not want them because they are directly at odds with reproducibility.

More efficient use of physical memory through sharing the code.

  1. In high-performance computing, the size of the libraries is usually negligible compared to the size of the data being processed.

  2. When the number of running processes is small, and they don’t have many common dependencies, there’s not much opportunity for sharing.

  3. On the other hand, sometimes multiple copies of the same executable are run in parallel. This happens with software that is not capable of multithreading or cannot exploit it efficiently. Well, in this case, the OS actually can share the code across the processed because it is exactly the same.

  4. When there’s little sharing of code between processes, static linking can sometimes be more memory-efficient. This is because static linking only embeds the object files (i.e. parts of a library) that are actually used by the application, whereas dynamic linking has to load the entire library into memory.

Security measures like load address randomization—see above.

Some features of glibc require dynamic linking. Ulrich, by the way, was one of the core developers of glibc—just in case you were wondering why he considers this a problem of static linking and not a problem of glibc.

Fortunately, most scientific software doesn’t perform character conversions or go to the network. It just crunches numbers. You don’t need dynamic linking for that.

Licensing considerations. I am not a lawyer, but as far as I can tell, this should concern you only if the software is closed-source (or distributed under a license incompatible with GPL) and some of those dependencies are licensed under LGPL. In that case, those dependencies must be linked dynamically, although the other ones can still be linked statically.

I am not sure why Ulrich writes “(L)GPL”, since, to my knowledge, GPL itself does not make a distinction between static and dynamic linking, but I am happy to be corrected.

Tools and hacks like ltrace, LD_PRELOAD, LD_PROFILE, LD_AUDIT don’t work. Oh well.

OK, how do I do this?

Unfortunately, most of the scientific software I come across is linked dynamically by default. Otherwise, I wouldn’t be writing this article.

Convincing the build system

Read the installation instructions. They usually can be found in a file named README, INSTALL, or on the website. If they mention static linking, congratulations.

If not, also try looking inside the Makefile (or whatever build system the software uses). If there is a configure script, try ./configure --help. There could be a target for static linking that the author has not documented.

If the build system doesn’t support static linking out of the box, you will have to modify the linker flags. If a Makefile is well-written, this should be as simple as LDFLAGS=-static make or make LDFLAGS=-static (these are different; try them in this order).

If that doesn’t work, edit the Makefile. Locate the rule that does linking, i.e. produces the final executable.

It usually looks like this:

program: $(OBJS)
    gcc -o program $(OBJS)

Note that gcc could also be g++, or ld, or hidden behind a variable such as $(CC), $(CXX), or $(LD). The variable $(OBJS) could also be named differently, or be a literal list of .o, .c, or .cpp files. But the program is usually exactly the name of the final program (or, again, a variable that expands to one).

Once you located this rule, try adding a -static flag to the gcc command line:

program: $(OBJS)
    gcc -static -o program $(OBJS)

In many cases it will be enough to perform static linking.

Sometimes you need to get more creative. One tool, for instance, explicitly built itself as a shared library. This is incompatible with a (global) -static flag set as part of $(LDFLAGS). The way I solved this was to specify a target explicitly, i.e. make prog1 prog2, so that it wouldn’t attempt to build a dynamic library and fail.

Dependencies

In order to statically link a program, you need to have its dependencies available as static libraries. These are files names as libsomething.a.

If this is a library available in your distribution:

  • For Debian and derived distros (e.g. Ubuntu), static libraries usually reside in the libsomething-dev package. If you’ve done dynamic linking of the program, you probably have this package installed already because it also contains the header files.

  • For Red Hat and derived distributions (Fedora, CentOS; not sure about other rpm-based distros), static libraries are often placed in separate packages named libsomething-static.

If this is a third-party library, you’ll need to get a static version of it or compile it from source, following the same instructions that you are reading right now.

Verifying the result

How do you check that you got a static binary? Try running file and ldd on it.

For a dynamic binary, you’ll get something like this:

% ldd ./eagle 
    linux-vdso.so.1 (0x00007ffd47d87000)
    libhts.so.1 => not found
    libboost_program_options.so.1.49.0 => not found
    libboost_iostreams.so.1.49.0 => not found
    libz.so.1 => /lib64/libz.so.1 (0x00007fe77a445000)
    libopenblas.so.0 => /lib64/libopenblas.so.0 (0x00007fe778133000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe777f17000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fe777b90000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fe777886000)
    libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fe777658000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fe777441000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fe77707e000)
    libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007fe776d4d000)
    /lib64/ld-linux-x86-64.so.2 (0x000055f7b3885000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fe776b49000)
    libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007fe776908000)

% file ./eagle
./eagle: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.26, BuildID[sha1]=af18461c835d6f0209754b78c639581c67ed1443, stripped

For a static binary, you’ll see this instead:

% ldd ./Minimac3
    not a dynamic executable

% file ./Minimac3
./Minimac3: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.26, BuildID[sha1]=edf443bb3e695b3f421b0d162ca30bb0f422c2ad, stripped

By the way, if file says that your binary is “not stripped”, run the strip command on it:

% strip eagle

This will significantly reduce its on-disk size and make it faster to copy to the cluster.

<section class="footnotes">
  1. To avoid confusion, I should note that I am talking here about software written in compiled languages such as C, C++, or Fortran. I am not talking about interpreted languages (Perl, Python, Ruby) or bytecode-compiled languages (Java, Scala).

</section>

September 09, 2016 08:00 PM

LHC Team

Haskell Suite: Scoping.

This post answers why I created 'haskell-scope' even though there's already another library that addresses the same problem.

There are two libraries for resolving references in Haskell source code on Hackage: haskell-names and haskell-scope. Of the two, haskell-names is the oldest, the most feature complete, and the most ambitious. It uses a very innovative scheme that allows the scope to be inspected at any point in the syntax tree. You can read more about it in the linked article. Unfortunately, all this innovation comes at a price of complexity.

Here's the complete list of extensions used by haskell-names: CPP, ConstraintKinds, DefaultSignatures, DeriveDataTypeable, DeriveFoldable, DeriveFunctor, DeriveTraversable, FlexibleContexts, FlexibleInstances, FunctionalDependencies, GADTs, GeneralizedNewtypeDeriving, ImplicitParams, KindSignatures, MultiParamTypeClasses, NamedFieldPuns, OverlappingInstances, OverloadedStrings, RankNTypes, ScopedTypeVariables, StandaloneDeriving, TemplateHaskell, TupleSections, TypeFamilies, TypeOperators, UndecidableInstances, and ViewPatterns.

A total of 27 extensions and many of them will never be implemented by LHC. If LHC is to compile itself one day, this obviously won't do. Enter haskell-scope: a library more plain than bread without butter. Give it an AST and it will annotate all the references. Nothing more, nothing less.

by David Himmelstrup (noreply@blogger.com) at September 09, 2016 11:14 AM

Vincent Hanquez

Foundation

A new hope. Foundation is a new library that tries to define a new modern Haskell framework. It is also trying to be more than a library: A common place for the community to improve things and define new things

It started as a thought experiment:

What would a modern Haskell base looks like if I could start from scratch ?

What would I need to complete my Haskell projects without falling into traditional pitfalls like inefficient String, all-in-one Num, un-productive packing and unpacking, etc.

One of the constraints, that was set early on, was not depending on any external packages, instead depending only on what GHC provides (itself, base, and libraries like ghc-prim). While it may sound surprising, especially considering the usually high quality and precision of libraries available on hackage, there are many reasons for not depending on anything; I’ll motivate the reason later in the article, so hang on.

A very interesting article from Stephen Diehl on production, that details well some of the pitfalls of Haskell, or the workaround for less than ideal situation/choice, outline pretty well some of the reasons for this effort.

Starting with the basic

One of the few basic things that you’ll find in any modern haskell project, is ByteArray, Array and packed strings. Usually in the form of the bytestring, vector and text packages.

We decided to start here. One of the common problem of those types is their lack of inter-operability. There’s usually a way to convert one into another, but it’s either exposed in an Unsafe or Internal module, or has a scary name like unsafeFromForeignPtr.

Then, if you’re unlucky you will see some issues with unpinned and pinned (probably in production settings to maximise fun); the common ByteString using the pinned memory, ShortByteString and Text using unpinned memory, and Vector, well, it’s complicated (there’s 4 different kind of Vectors).

Note: pinned and unpinned represent whether the memory is allowed to move by the GC. Unpinned usually is better as it allows the memory system to reduce fragmentation, but pinned memory is crucial for dealing with Input/Output with the real world, large data (and some other uses).

Unboxed Array

Our corner stone is the unboxed array. The unboxed array is a native Haskell ByteArray (represented by the ByteArray# primitive type), and it is allowed to be unpinned or pinned (at allocation time). To also support further interesting stuff, we supplement it with another constructor to make it able to support natively a chunk of memory referenced by a pointer.

In simplified terms it looks like:

data UArray ty = UArrayBA (Offset ty) (Size ty) PinnedStatus ByteArray#
               | UArrayAddr (Offset ty) (Size ty) (Ptr ty)

With this capability, we have the equivalent of ByteString, ShortByteString, Unboxed Vector and (Some) Storable Vector, implemented in one user friendly type. This is a really big win for users, as suddenly all those types play better together; they are all the same thing working the same way.

Instead of differentiating ByteString and Vector, now ByteString disappears completely in favor of just being a UArray Word8. This has been tried before with the current ecosystem with vector-bytestring.

String

String is a big pain point. Base represents it as a list of Char [Char], which as you can imagine is not efficient for most purpose. Text from the popular text package implements a packed version of this, using UTF-16 and unpinned native ByteArray#.

While text is almost a standard in haskell, it’s very likely you’ll need to pack and unpack this representation to interact with base functions, or switch representation often to interact with some libraries.

Note on Unicode: UTF-8 is an encoding format where unicode codepoints are encoded in sequence of 1 to 4 bytes (4 different cases). UTF-16 represent unicode sequences with either 2 or 4 bytes (2 different cases).

Foundation’s String are packed UTF-8 data backed by an unboxed vector of bytes. This means we can offer a lightweight type based on UArray Word8:

newtype String = String (UArray Word8)

So by doing this, we inherit directly all the advantages of our vector types; namely we have a String type that is unpinned or pinned (depending on needs), and supports native pointers. It is extremely lightweight to convert between the two: provided UTF8 binary data, we only validate the data, without re-allocating anything.

There’s no perfect representation of unicode; each representation has it own advantages and disadvantages, and it really depends on what types of data you’re actually processing. One of the easy rules of thumb is that the more your representation has cases, the slower it will be to process the highest unicode sequences.

By extension, it means that choosing a unique representation leads to compromise. In early benchmarks against text we are consistently outperforming Text when the data is predominantly ASCII (i.e. 1-byte encoding). In other type of data, it really depends; sometimes we’re faster still, sometimes slower, and sometimes par.

Caveat emptor: benchmarks are far from reliable, and only been run on 2 machines with similar characteristic so far.

Other Types

We also support already:

  • Boxed Array. This is an array to any other Haskell types. Think of it as array of pointers to another Haskell value
  • Bitmap. 1 bit packed unboxed array

In the short term, we expect to add:

  • tree like structure.
  • hash based structure.

Unified Collection API

Many types of collections support the same kind of operations.

For example, commonly you have very similar functions defined with different types:

    take :: Int -> UArray a -> UArray a
    take :: Int -> [a]      -> [a]
    take :: Int -> String   -> String

    head :: UArray a -> a
    head :: String   -> Char
    head :: [a]      -> a

So we tried to avoid monomorphic versions of common Haskell functions and instead provide type family infused versions of those functions. In foundation we have:

    take :: Int -> collection -> collection

    head :: collection -> Element collection

Note: Element collection is a type family. this allow from a type “collection” to define another type. for example, the Element of a [a] is a, and the Element of String is Char.

Note2: head is not exactly defined this way in foundation: This was the simplest example that show Type families in action and the overloading. foundation’s head is not partial and defined: head :: NonEmpty collection -> Element collection

The consequence is that the same head or take (or other generic functions) works the same way for many different collection types, even when they are monomorphic (e.g. String).

For another good example of this approach being taken, have a look at the mono-traversable package

For other operations that are specific to a data structure, and hard to generalize, we still expose dedicated operations.

The question of dependencies

If you’re not convinced by how we provide a better foundation to the standard Haskell types, then it raises the question: why not depend on those high quality libraries doing the exact same thing ?

Consistency. I think it’s easier to set a common direction, and have a consistent approach when working in a central place together, than having N maintainers working on M packages independently.

Common place. An example speaks better than words sometime: I have this X thing, that depends on the A package, and the B package. Should I add it to A, to B, or create a new C package ?

Atomic development. We don’t have to jump through hoops to improve our types or functions against other part of foundation. Having more things defined in a same place, means we can be more aggressive about improving things faster, while retaining an overall package that make sense.

Versions, and Releases. Far easier to depends on a small set of library than depends on hundreds of different versions. Particularly in an industrial settings, I will be much more confident tracking 1 package, watch 1 issue tracker and deal with a set of known people, than having to deal with N packages, N issues trackers (possibly in different places), and N maintainers.

Some final notes

A fast iterative release schedule. Planning to release early, release often. and with a predictable release schedule.

We’re still in the early stage. While we’re at an exciting place, don’t expect a finish product right now.

You don’t need to be an expert to help. anyone can help us shape foundation.

Join us. If you want to get involved: all Foundation works take place in the open, on the haskell-foundation organisation with code, proposals, issues and voting, questions.

September 09, 2016 12:00 AM

September 08, 2016

Theory Lunch (Institute of Cybernetics, Tallinn)

Nonuniversality in computation: A proof by semantic shift?

Today, the 8th of September 2016, we had a very interesting discussion about a theorem, due to Selim G. Akl, pointed to me in a tweet by Andy Adamatzky. Such theorem has, according to Akl, the consequence that the Church-Turing thesis, a basic tenet of theoretical computer science, is false. Of course, surprising statements require solid arguments: is Akl’s solid enough?

First of all, let us recall what the Church-Turing thesis is, and what it is not. Its statement, as reported by the Stanford Encyclopedia of Philosophy, goes as follows:

A function of positive integers is effectively calculable only if recursive.

Here, for a calculation procedure to be “effective” means the following:

  1. it has a finite description;
  2. it always returns the correct output, given any valid input;
  3. it can be “carried on by pencil and paper” by a human being; and
  4. it requires no insight or ingenuity on the human’s behalf.

One model of effective procedures is given by the recursive functions; another one, by the functions computable by Turing machines; a third one, by the functions which are representable in Church’s \lambda-calculus. Alan Turing and Stephen Cole Kleene proved that the three classes coincide: thus, in ordinary practice, the Church-Turing thesis is often stated with “Turing-computable” in place of “recursive”.

The class of Turing machines has the advantage of containing a universal element: a special Turing machine and an encoding from the set of Turing machines to the set of natural numbers exists such that, when the special Turing machine is provided the encoding of an arbitrary Turing machine and a valid input for the latter, it will return the value of the encoded Turing machine on the provided input.

Now that we have written down what the Church-Turing thesis is, we can examine Akl’s theorem.

In his 2005 paper, Akl defines a universal computer as a system \mathcal{U} having the following features:

  1. It has means of communicating with the outside world, so to receive input, and where to send its output.
  2. It can perform every elementary arithmetic and logic operations.
  3. It can be programmed, according to the two previous rules.
  4. It has unlimited memory to use for input, output, and temporary values.
  5. It can only execute finitely many operations (evaluating input, producing output, performing an elementary operation, etc.) at each time step.
  6. It can simulate any computation performed by any other model of computation.

The statement of the theorem, which does not appear explicitly in the original paper but is written down in the one from 2015 which clarifies the idea and addresses criticism, is hereby reported verbatim:

Nonuniversality in Computation Theorem (NCT): No computer is universal if it is capable of exactly T(i) operations during time unit i of computation, where i is a positive integer, and T(i) is finite and fixed once and for all.

The main argument is that no such computer can perform a computation which requires more than T(i) operations at some time i. Explicit examples happen in parallel computation, a field Akl is a master of, where the number of operations that can be performed in a time unit grows linearly with the number of processors: for instance, reading n values in input can be done in time T(i)=1 by a parallel machine with n processors, but not by any machine with m<n processors.

Such requirement, however, does not appear in the notion of universality at the base of the original, and actual, Church-Turing thesis. There, to “simulate” a machine or algorithm means to be able of always reproducing the same output of the algorithm, given any valid input for it, up to an encoding of the input and the output. But no hypothesis on how the output is achieved from the input is made: a simulation in linear time, such that each step of the simulated algorithm is reproduced by exactly k operations of the Turing machine, is as good as one where the simulation of the ith step takes 17^{i^2-i} operations from the Turing machine, or where no such regularity appears.

Among the (counter)examples provided by Akl are:

  1. Computations with time-varying variables.
  2. Computations with time-varying computational complexity.
  3. Computations whose complexity depends on their placement on a schedule.
  4. Computations with interacting variables, e.g., states of entangled electrons.
  5. Computations with uncertain time constraints.

None of these, however, respect the definition of computation from the model of recursive functions: where the values of the variables are given once and for all, and can possibly change for recursive calls, but not for the original call. They can be seen as instances of unconventional models of computation: but by doing this, one changes the very notion of computation, which ceases to be the one at the basis of the Church-Turing thesis.

So my guess is that Akl’s statement about the falsity of the Church-Turing thesis actually falls in the following category, as reported in the humorous list by Dana Angluin:

Proof by semantic shift: Some standard but inconvenient definitions are changed for the statement of the result.

Actually, if we go back to Akl’s definition of a universal computer, it appears to be fine until the very last: the first two points agree with the definition of effective computation at the basis of the actual Church-Turing thesis, the next three are features of any universal Turing machine. The problem comes from the last point, which has at least two weak spots: the first one being that it does not define precisely what a model of computation is, which can be accepted as Akl is talking of unconventional computation, and it is wiser to be open to other possibilities. But there is a more serious one, in that it is not clear

what does the expression “to simulate” mean.

Note that the Stanford Encyclopedia of Philosophy reports the following variant of the Church-Turing thesis, attributed to David Deutsch:

Every finitely realizable physical system can be perfectly simulated by a universal model computing machine operating by finite means.

Deutsch’s thesis, however, does not coincide with the Church-Turing thesis! (This, notwithstanding Deutsch’s statement that “[t]his formulation is both better defined and more physical than Turing’s own way of expressing it”.) Plus, there is another serious ambiguity, which is of the same kind as the one in Akl’s definition:

what is “perfectly simulated” supposed to mean?

Does it mean that every single step performed by the system can be reproduced in real time? In this case, Akl is perfectly right in disproving it under the constraint of boundedly many operations at each time unit. Or does it mean that the simulation of each elementary step of the process (e.g., one performed in a quantum of time) ends with the correct result if the correct initial conditions are given? In this case, the requirement to reproduce exactly what happens between the reading of the input and the writing of the output is null and void.

Worse still, there is a vulgarized form of the Church-Turing thesis, which is reported by Akl himself on page 172 of his 2005 paper!, and goes as follows:

Any computable function can be computed on a Turing machine.

If one calls that “the Church-Turing thesis”, then Akl’s NCT is absolutely correct in disproving it. But that is not the actual Church-Turing thesis! It is actually a rewording of what in the Stanford Encyclopedia of Philosophy is called “Thesis M”, and explicitly stated not to be equivalent to the original Church-Turing thesis—and also false. Again, the careful reader will have noticed that, in the statement above, being “computable by a Turing machine” is a well defined property, but “computable” tout court definitely not so.

At the end of this discussion, my thesis is that Akl’s proof is correct, but NCT’s consequences and interpretation might not be what Akl means, or (inclusive disjunction) what his critics understand. As for my personal interpretation of NCT, here it goes:

No computer which is able to perform a predefinite, finite number of operations at each finite time step, is universal across all the different models of computation, where the word “computation” may be taken in a different meaning than that of the Church-Turing thesis.

Is mine an interpretation by semantic shift? Discussion is welcome.

References:

  1. Selim G. Akl. The Myth of Universal Computation. Parallel Numerics ’05, 167–192.
  2. Selim G. Akl. Nonuniversality explained. International Journal of Parallel, Emergent and Distributed Systems 31:3, 201–219. doi:10.1080/17445760.2015.1079321
  3. The Church-Turing Thesis. Stanford Encyclopedia of Philosophy. First published January 8, 1997; substantive revision August 19, 2002. http://plato.stanford.edu/entries/church-turing/
  4. Dana Angluin’s List of Proof Techniques. http://www.cs.northwestern.edu/~riesbeck/proofs.html

by Silvio Capobianco at September 08, 2016 02:59 PM

September 07, 2016

Edward Z. Yang

The Base of a String Theory for Haskell

One of the early posts from this blog, from 2010, was on the subject of how to pick your string library in Haskell. Half a decade later, the Haskell ecosystem is still largely in the same situation as it was half a decade ago, where most of the boot libraries shipped with GHC (e.g., base) still use the String type, despite the existence of superior string types. The problem is twofold:

  1. No one wants to break all of the existing code, which means libraries like base have to keep String versions of all their code. You can't just search-replace every occurrence of String with Text.
  2. No one wants to be in the business of maintaining two copies of any piece of code, which are copy-pastes of each other but subtly different. In practice, we must: e.g., unix has ByteString variants of all of its functions (done by copy-paste); text provides some core IO functionality (also done by copy-paste). But it is terrible and scales poorly: every downstream library that wants to support two string types (or more) now has to publish two copies of themselves, and any new string implementation has the unenviable task of reimplementing the world to make themselves useful.

Backpack solves these problems, by allowing you to parametrize over a signature rather than a concrete implementation of a string type, and instantiate such an indefinite library whenever you want. This solves both problems:

  1. Because you are allowed to instantiate an indefinite library whenever you want, we can eagerly instantiate a posix-indef using String and ship it as posix, keeping backwards compatibility with all packages which are Backpack ignorant.
  2. At the same time, if packages depend directly on posix-indef, they themselves are parametrizable over a string type. Entire library ecosystems can defer the choice of string type to the end user, which on a sufficiently new version of GHC offers an backwards compatible way of adding support for new string types to a library. (I don't want to say, support multiple string types, because this is not necessarily a virtue in-and-of-itself.)

To this end, I would like to propose a string theory, for the base of GHC Haskell: namely the core boot libraries that are distributed with GHC today. These packages will set the tone for the eventual Backpackification of the rest of the ecosystem.

But first, what is it that we are parametrizing over? A string is not so simple...

A digression on file paths (and OS strings)

File paths (FilePath) are an important form of String which aren't really Unicode strings at all. POSIX specifies that file paths can be arbitrary C strings, thus, code which decodes a file path as Unicode must be cognizant of the fact that the underlying ByteString could contain arbitrary, undecodable nonsense. To make matters worse, even the encoding can vary: on Windows file paths are encoded in UTF-16 (with unpaired surrogates, eek!), while in modern Linux environments the encoding is dictated by the locale (base uses locale_charset to determine how to interpret file paths; the locale is often UTF-8, but not always).

Thus, the definition type FilePath = String is very questionable indeed. There is an existing proposal, the Abstract FilePath Proposal to turn FilePath into an abstract type, and not just a type synonym for String. Unfortunately, a change like this is a BC-breaking one, so it will take some time to implement, since GHC must first be taught to warn when FilePath is used as if it were a String, to help people find out that they are using it incorrectly.

Backpack offers a more decentralized way to move into the future: just define an abstract signature for FilePath to depend upon. The low level signature might look like this:

signature FilePath where

-- | File and directory names, whose precise
-- meaning is operating system dependent. Files can be opened, yielding a
-- handle which can then be used to operate on the contents of that file.
data FilePath

-- | A C string (pointer to an array of C characters terminated by NUL)
-- representing a file path, suitable for use with the operating system
-- C interface for file manipulation.  This exact type is architecture
-- dependent.
type CFilePath =
#ifdef mingw32_HOST_OS
        CWString
#else
        CString
#endif

withFilePath :: FilePath -> (CFilePath -> IO a) -> IO a
newFilePath  :: FilePath -> IO CFilePath
peekFilePath :: CFilePath -> IO FilePath
-- peekFilePath >=> newFilePath should be identity
-- (this is tricky to achieve if FilePath is a
-- Unicode-based type, like String)

And of course, you would want all of the FilePath manipulation functions that people use.

To maintain compatibility with the existing ecosystem, you would likely instantiate your library with type FilePath = String. But there is nothing stopping you from picking your own abstract FilePath type and using it instead.

File paths are not unique in this sense; there are other strings (such as the values of environment variables) which have similar properties: I've taken to calling these OSStrings (as they are called in Rust.)

Axes of parametrization

With this in mind, there are three "string variants" any given library can be parametrized:

  1. They can be parametrized over FilePath, for modules which deal with the file system (e.g., System.Posix.Directory)
  2. They can be parametrized over an OSString, because they deal with various operating system specific APIs (e.g., System.Posix.Env)
  3. They can be parametrized over a String, because, well, sometimes a string is just a string. (e.g., Text.ParserCombinators.ReadP)

Some libraries may be parametrized in multiple ways: for example, readFile needs to be parametrized over both FilePath and String.

Split base (and friends) for Backpack

For technical reasons, Backpack cannot be used to parametrize specific modules; you have to parametrize over an entire library. So a side-effect of Backpack-ing the core libraries is that they will be split into a number of smaller libraries. Using module reexports, you can still keep the old libraries around as shims.

There are four GHC boot libraries which would most benefit from modularization on strings:

  • base
    • base-io (System.IO and submodules; parametrized over FilePath and String)
    • There are a few other modules which could be stringified, but the marginal benefit may not justify making a new package for each (Data.String, System.Console.GetOpt, Text.ParserCombinators.ReadP, Text.Printf). Each of these only needs to be parametrized over String.
    • Control.Exception, Text.Read and Text.Show are explicit non-goals, they are too deeply wired into GHC at present to muck about with.
  • unix
    • unix-env (System.Posix.Env, parametrized over OSString)
    • unix-fs (System.Posix.Directory, System.Posix.Files, System.Posix.Temp parametrized over FilePath)
    • unix-process (System.Posix.Process, parametrized over FilePath and OSString)
  • pretty (parametrized over String; then GHC could use it rather than roll its own copy!)
  • process (parametrized over String, OSString and FilePath)

The naming scheme I propose is that, e.g., the package unix continues to be the package instantiated with old-fashioned Strings. Then unix-indef is a package which is uninstantiated (the user can instantiate it to what they want, or pass on the decision to their users). Some packages may choose to also provide shims of their package instantiated with specific types, e.g., base-io-bytestring, which would be base-io instantiated with ByteString rather than String, though these names could get to be quite long, so it's uncertain how useful this would be.

Closing remarks

Of all the packages mentioned here, only base could make the bold step of using Backpack next GHC release (although it won't; at least, not for GHC 8.2); the rest need to maintain backwards compatibility with old versions of GHC and so would have to be forked to use Backpack.

The real test for Backpack will be whether or not string-using packages in the ecosystem decide to sign on, and parametrize themselves over signatures. I hope that eventually, you can use any library with ByteString or Text with the same ease that you can use libraries with String (and maybe even use your own, home-grown type.) The problem with module systems is that you rarely see the benefits until you use them for big systems, but that makes it difficult to evaluate them before hand. But the benefits seem tantalizing: let's boldly backpack forth to a brighter future!

by Edward Z. Yang at September 07, 2016 06:22 AM

September 05, 2016

Darcs

darcs 2.12.1 release

The Darcs team is pleased to announce the release of Darcs 2.12.1!

The most important changes are:
  • fixes for building with GHC 8
  • drop support for GHC 7.6 and 7.8, i.e., require GHC 7.10
  • improvements in `darcs whatsnew` output with irrelevant files (Ben Franksen)
This release can be installed via cabal or (soon) stack. The 2.12 branch is also available as a darcs repository from http://darcs.net/releases/branch-2.12 .

by guillaume (noreply@blogger.com) at September 05, 2016 08:17 PM

The GHC Team

Call for Nominations: GHC Steering Committee

Hello everyone,

As you likely know, over the last few months we have been discussing options for reforming the process for proposing language and compiler changes to GHC. After much discussion, we have a process which, while not perfect, is acceptable to a majority of our contributor base and will be an improvement over the status quo. While we invite suggestions for future improvements, we are at a point where we can move ahead with implementation.

Consequently, we are seeking nominations for the initial GHC steering committee. This body is responsible for overseeing the progression of proposals through the process, working with authors on refining their ideas, and evaluating proposals for acceptance. The committee will consist of five to eight members of diverse backgrounds.

We would like to offer the following as a criteria for membership. Note that a candidate is not expected to satisfy all or even most of these criteria, but a good candidate should satisfy at least one:

  • A history of contributions to the design of new language features
  • Experience developing Haskell libraries and applications
  • A demonstrated track record of contributing code to GHC
  • A pedagogical background, with experience in either teaching or authoring educational materials
  • Experience in compiler development
  • Knowledge of functional programming language theory

I would like to emphasize that committee membership is as much a duty as it is a privilege. Membership is not intended to be a platform to be used by members to drive their own ideas; rather it is a way of serving the Haskell community by helping other community members refine and advance their proposals. This, of course, requires an investment of time and effort, which you should be willing and able to consistently put forth.

If you would like to be considered for committee membership then please write a statement describing why you feel you are well-qualified to serve, in terms of the criteria above and any others that you would like to offer. Please send your statements to ben at well-typed.com by September 30th. The initial committee selection will be made by the Simons soon thereafter.

Thanks to everyone for their feedback and cooperation so far!

Cheers,

~ Ben

by bgamari at September 05, 2016 06:56 PM

Joachim Breitner

The new CIS-194

The Haskell minicourse at the University of Pennsylvania, also known as CIS-194, has always had a reach beyond the students of Penn. At least since Brent Yorgey gave the course in 2013, who wrote extensive lecture notes and eventually put the material on Github.

This year, it is my turn to give the course. I could not resist making some changes, at least to the first few weeks: Instead of starting with a locally installed compiler, doing execises that revolve mostly around arithmetic and lists, I send my students to CodeWorld, which is a web programming environment created by Chris Smith1.

This greatly lowers the initial hurlde of having to set up the local toolchain, and is inclusive towards those who have had little expose to the command line before. Not that I do not expect my students to handle that, but it does not hurt to move that towards later in the course.

But more importantly: CodeWorld comes with a nicely designed simple API to create vector graphics, to animate these graphics and even create interactive programs. This means that instead of having to come up with yet another set of exercieses revolving around lists and numbers, I can have the students create Haskell programs that are visual. I believe that this is more motivating and stimulating, and will nudge the students to spend more time programming and thus practicing.

<iframe height="400" src="https://code.world/run.html?mode=haskell&amp;dhash=D7Y0vPUj4D2tE2iKoYnvZrg" style="display: block; margin-left: auto; margin-right: auto" width="400"> </iframe>

In fact, the goal is that in their third homework assignemnt, the students will implement a fully functional, interactive Sokoban game. And all that before talking about the built-in lists or tuples, just with higher order functions and custom datatypes. (Click on the picture above, which is part of the second weeks’s homework. You can use the arrow keys to move the figure around and press the escape key to reset the game. Boxes cannot be moved yet -- that will be part of homework 3.)

If this sounds interesting to you, and you always wanted to learn Haskell from scratch, feel free to tag along. The lecture notes should be elaborate enough to learn from that, and with the homework problems, you should be able to tell whether you have solved it yourself. Just do not publish your solutions before the due date. Let me know if you have any comments about the course so far.

Eventually, I will move to local compilation, use of the interpreter and text-based IO and then start using more of the material of previous iterations of the course, which were held by Richard Eisenberg in 2014 and by Noam Zilberstein in 2015.


  1. Chris has been very helpful in making sure CodeWorld works in a way that suits my course, thanks for that!

by Joachim Breitner (mail@joachim-breitner.de) at September 05, 2016 06:09 PM

Ken T Takusagawa

[fltalwhq] integerLog2 and an introduction to unboxed types

Some notes on using unboxed types in Haskell, and in particular, notes on creating a boxed wrapper for the integer-gmp library function integerLog2# :: Integer -> Int# which returns an unboxed type.

{-# LANGUAGE MagicHash #-}
import GHC.Exts(Int(I#));
import GHC.Integer.Logarithms(integerLog2#);

integerLog2 :: Integer -> Int;
integerLog2 i = if i < 1
then error "must be positive"
-- because integerLog2# does no bounds checking
else I# (integerLog2# i);

The pragma MagicHash prevents GHC from interpreting the hash symbol as an operator.  Without it, one gets error messages like this:

parse error on input `#'
not in scope: `#'

It would be nice if GHC emitted a suggestion of MagicHash on errors like this.

The constructor I# is findable using Hoogle, searching for Int#->Int.

One must use parentheses around the argument to I#.  The standard trick of removing parentheses with the dollar sign results in an error:

bad1 i = I# $ integerLog2# i;

Couldn't match kind `*' with `#'
When matching types
r0 :: *
GHC.Prim.Int# :: #

Using the composition operator in point-free style fails similarly:

bad2 = I# . integerLog2#;

Couldn't match kind `*' with `#'
When matching types
b0 :: *
GHC.Prim.Int# :: #
Expected type: b0 -> Int
Actual type: GHC.Prim.Int# -> Int
In the first argument of `(.)', namely `I#'

Unboxed types are a different "kind", the # kind, than boxed types, which are a * kind.

by Ken (noreply@blogger.com) at September 05, 2016 07:18 AM

Yesod Web Framework

Better CI for the Yesod scaffoldings

After having completely forgotten to do this for a long time, I finally set aside some time last night to fix up the Travis CI for the Yesod scaffoldings. We have a number of different flavors of scaffoldings (e.g., PostgreSQL, MySQL, simple, and minimal), and keep all of the different flavors as branches on a single repo so that improvements to the base scaffolding can easily be merged into all of the others. The goal of my changes was to:

  • Have Travis check against the different snapshots likely to be selected by the stack new command
  • Automate testing against live databases, so that I can be lazy and not set up those databases for local testing

Overall, this went pretty smoothly, and also serves as a nice example of a short Stack-based Travis configuration. You can see the latest PostgreSQL Travis configuration.

Beyond my desire to be lazy in the scaffolding release process, the other obvious benefit is much more confidence when review PRs against the scaffolding that things will actually work.

Interesting discoveries

I discovered two things in this work that I hadn't realized previously:

  • Due to a dependency on a relatively recent yesod-auth version, only LTS Haskell 6 and up and Stackage Nightly are supported by the scaffolding. Therefore, for the build matrix, I haven't included LTS 5 and lower.
  • I was unaware of a bug in GHC 8.0.1 which prevents the scaffolding from compiling. This bug has already been resolved upstream and the fix will be included in GHC 8.0.2. In the meanwhile, I've blocked GHC 8.0.1 from usage in the scaffolding.

Upon reflection, this is probably a good thing: we are all but guaranteed that a new user will start off with LTS 6, which is a well tested set of packages and a very stable GHC version, leading to hopefully little user friction when getting started.

A user could in theory do something like stack new foo yesod-simple --resolver lts-5 --solver, but I think it's fair to assume that someone passing in such specific flags knows what he/she is doing and we should just stay out of his/her way.

Contributing

Now that CI is in better shape, this is a good time to remind everyone of how to contribute to the Yesod scaffoldings. The PostgreSQL scaffolding is considered the base, with the other flavors merging in changes from there. If you want to make a change, please submit a PR to the postgres branch of the yesod-scaffold repo. If you have a patch which is specific to one of the scaffoldings instead (like changing MySQL config settings), please submit it to that branch.

Note that it is not supported to send pull requests against the .hsfiles files in the stack-templates repo, as such changes can't be properly tested, and will be overwritten the next time a change from the yesod-scaffold repo is merged in.

September 05, 2016 07:00 AM

September 03, 2016

Edward Z. Yang

The Edit-Recompile Manager

A common claim I keep seeing repeated is that there are too many language-specific package managers, and that we should use a distribution's package manager instead. As an example, I opened the most recent HN discussion related to package managers, and sure enough the third comment was on this (very) dead horse. (But wait! There's more.) But it rarely feels like there is any forward progress on these threads. Why?

Here is my hypothesis: these two camps of people are talking past each other, because the term "package manager" has been overloaded to mean two things:

  1. For end-users, it denotes an install manager, primarily responsible for installing some useful software so that they can use it. Software here usually gets installed once, and then used for a long time.
  2. For developers, it denotes an edit-recompile manager: a piece of software for letting you take a software project under development and (re)build it, as quickly as possible. The installation of packages is a means, but it is not the end.

It should be clear that while these two use-cases have some shared mechanism, the priorities are overwhelmingly different:

  • End-users don't care about how a package is built, just that the things they want to install have been built. For developers, speed on rebuild is an overriding concern. To achieve this performance, a deep understanding of the structure of the programming language is needed.
  • End-users usually just want one version of any piece software. Developers use multiple versions, because that is the cost of doing business with a diverse, rapidly updated, decentralized package ecosystem.
  • End-users care about it "just working": thus, a distribution package manager emphasizes control over the full stack (usually requiring root.) Developers care about flexibility for the software they are rebuilding and don't mind if a little setup is needed.

So the next time someone says that there are too many language-specific package managers, mentally replace "package manager" with "edit-recompile manager". Does the complaint still make sense? Maybe it does, but not in the usual sense: what they may actually be advocating for is an interface between these two worlds. And that seems like a project that is both tractable and worth doing.

by Edward Z. Yang at September 03, 2016 12:40 AM

Christopher Allen

The Hashrocket websocket shootout in Haskell

I recently PR’d a Haskell entry to Hashrocket’s websocket shootout. Haskell seemed to do a lot better than C++, Rust, Golang, Elixir, Erlang, NodeJS, Ruby MRI, and JRuby. Although the Haskell version has been since fixed, so I can no longer run the benchmark reliably on my machine, so any final results will have to come from Hashrocket running the unagi-chan variant.

How the benchmark works

The idea is to test how many concurrent clients a single websocket server (process?) can serve and how efficiently it can broadcast messages to all the clients.

The constraints of the benchmark are that your 95th percentile round-trip time cannot exceed 250ms. This is a better measurement/restriction for concurrency benchmarks than the usual “how many can it handle before it crashes” or throughput metrics, so props to Hashrocket on that point.

The client as-designed will increase the number of clients connected in the step-size specified and send test events at each step. If the 95th percentile round trip time exceeds 250ms, the benchmark client disconnects all client connections and halts. So, the last “line” of output you see from the client is essentially where you peaked before failing the SLA constraint.

What follows is the flawed Broadcast implementation I wrote that drops messages, so caveat lector

Everything below is retracted for now as Broadcast was dropping messages, which wasn’t explicitly permitted in the benchmark. I’m currently kicking around an unagi-chan based variant PR’d by Sebastian Graf, but I don’t think unagi-chan was designed for broadcasting across many thousands of channels.

For some context, this benchmark using Tsung is roughly what I expected in terms of results modulo hardware differences, which is why I wasn’t that surprised when I saw the initial results. Currently the Go websocket client seems to behave very differently from Tsung’s, so I don’t have a reproduction of what Tom Hunger’s benchmark did.

Before I retracted the results, I was peaking at 45,000 concurrent clients and a very low/flat latency with this version that uses Broadcast. However, Broadcast was dropping messages, so it’s not a valid comparison when the other benchmark servers weren’t dropping any messages. Incidentally, load-shedding is a great strategy for consistent server performance when it’s permissible ;)

Here’s the source to the Haskell version at time of writing:

{-# LANGUAGE DeriveGeneric     #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes       #-}

module Main where

import qualified Control.Concurrent as C
import qualified Control.Concurrent.Broadcast as BC
import Control.Lens hiding ((.=))
import Control.Monad (forever)
import Data.Aeson
import Data.Aeson.Lens
import Data.Aeson.Types
import Data.ByteString.Lazy (ByteString, toStrict)
import qualified Data.Char as DC
import Data.Functor (void)
import Data.Text (Text)
import Data.Text.Encoding (decodeUtf8)
import GHC.Generics
import Network.HTTP.Types (status400)
import Network.Wai
import Network.Wai.Handler.Warp
import Network.Wai.Handler.WebSockets
import Network.WebSockets
import Text.RawString.QQ

The above is just the usual preamble/noise. I had quasiquotes for a test/example I didn’t use in the actual server.

type Broadcaster = BC.Broadcast ByteString

Hedging my bets in case I switched again after changing the broadcast type from Text to a lazy ByteString.

amendTest :: Maybe Value
amendTest = decode $ [r|
{"type":"broadcast","payload":{"foo": "bar"}}
|]

amendBroadcast :: Value -> Value
amendBroadcast v =
  v & key "type" . _String .~ "broadcastResult"

Above was just test code.

broadcastThread :: Broadcaster -> Connection -> IO ()
broadcastThread bc conn = forever $ do
  t <- BC.listen bc
  sendTextData conn t

That’s all I do to relay broadcasted data to the listeners. Under the hood, Broadcast is:

MVar (Either [MVar a] a)

I used broadcast from concurrent-extra because I knew I wanted the propagation/thread wake to happen via the MVar machinery in the GHC runtime system.

wtf conn =
  sendTextData conn ("<img src=\"http://bit.ly/1kmRC7Q\" />" :: Text)

Error return method borrowed from ocharles.

mkPayload :: Text -> Value -> ByteString
mkPayload type_ payload = encode $
  object [ "type" .= String type_
         , "payload" .= payload
         ]

Constructing a JSON value fitting the format expected by the test client and then encode-ing it into a ByteString.

bidiHandler :: Broadcaster -> Connection -> IO ()
bidiHandler bc conn = do
  _ <- C.forkIO (broadcastThread bc conn)
  --   [               1                ]
  forever $ do
  -- [2]
    msg <- receiveDataMessage conn
    --     [3]
    case msg of
      Text t -> do
        let Just payload = t ^? key "payload"
        --                 [       4       ]
        case t ^? key "type" . _String of
        --   [           5           ]
          Just "echo" -> sendTextData conn (mkPayload "echo" payload)
          --             [                   6                      ]
          Just "broadcast" -> BC.signal bc (mkPayload "broadcastResult" payload)
          --                  [                       7                        ]
          _ -> wtf conn
      _ -> do
        wtf conn

I hate reading overly chopped-up code, so I annotated this one in the mode of the haskell book.

  1. We run the broadcast listener that relays data to the websocket client in a separate thread

  2. Running the client listener that (potentially) broadcasts data or just echoes back to the client in a Control.Monad.forever block.

  3. Block on receiving a data message (sum type, Text or Bytes)

  4. Pluck the payload value out of the JSON body because I’m too lazy to make a datatype for this.

  5. Get the event type out of the JSON body to dispatch on. We’re going to either echo or broadcast.

  6. If the event type was echo, kick the JSON data back to the client, but with the event type amended to echo.

  7. If the event type was broadcast, signal the broadcast handle to propagate the new JSON body with the payload and a broadcastResult event type.

wsApp :: Broadcaster -> ServerApp
wsApp bc pending = do
  conn <- acceptRequest pending
  bidiHandler bc conn

Passing on the Broadcast handle and Connection to the handler.

main :: IO ()
main = do
  bc <- BC.new
  runServer "127.0.0.1" 3000 (wsApp bc)

Spawn a Broadcast, pass the handle on to wsApp, run it with the provided server from the wai-websockets library. That’s it.

Some thoughts

Erlang is the only runtime competitive on per-thread (process in their lingo) overhead, but they bite the dust on message send. MVar take/put pairing is ~25-40ns, you’re eating at least 1,000 ns in Erlang. It’s possible a custom Erlang implementation (Cowboy?) could do a better job here, but I’m not sure how to do broadcast especially efficiently in Erlang.

Asking how to efficiently broadcast to many Erlang processes on the mailing list gets you smarmy answers.

I was initially disappointed I didn’t get an excuse to optimize any of the Haskell code. It was limited only by the number of TCP connections I could bind, I had 2/3s of my 95th percentile RTT to burn yet. I messed with ulimit and the like a bit, but to really uncap it I’d need to change the client to connect to multiple IP addresses so I can use more TCP connections. Now I know it was because Broadcast was dropping messages and not tickling the slow parts as much as an implementation that forces broadcasts to all clients.

I know this site is a bit of a disaster zone, but if you like my writing or think you could learn something useful from me, please take a look at the book I've been writing with my coauthor Julie. There's a free sample available too!

September 03, 2016 12:00 AM

September 02, 2016

Brent Yorgey

Deep work and email habits

Lately I have been enjoying Cal Newport’s writing on work, and particularly his new book Deep Work which I am in the middle of reading (definitely recommended). His basic thesis is about the power of sustained, focused, distraction-free work on cognitively demanding tasks—what he calls deep work. It takes intentional effort to make the time and space for this kind of work, but Newport argues cogently that doing so can have enormous benefits.

Newport’s ideas have really resonated with me—I think I was already converging (albeit slowly, with little clarity) on similar ideas and practices over the last few years—and I’ve begun trying to put some of them more deliberately into practice. First, I have scheduled two large (4 hour) blocks of time for deep work each week. These blocks are sacrosanct: I won’t meet with students, schedule committee meetings, or do anything else during those times. I physically go somewhere other than my office—usually the library, occasionally my favorite coffee shop, somewhere relatively quiet and interruption-free where students and colleagues won’t find me. I first do as much as possible without turning on my laptop: course planning, reading, brainstorming, a lot of longhand writing (blog posts, papers, grant proposals, whatever—for example, I wrote this blog post itself longhand during my deep work session this morning). Sometimes if I need to write a longer, thoughtful email response, I will even print out the message beforehand and write my response longhand. Only towards the end of the session will I pull out my laptop, if I have specific projects to work on deeply that require a computer, like some sort of coding project.

Anecdotally at least, so far this feels incredibly successful—I get a lot done during these deep work sessions and always come away feeling accomplished and energized. The thing that feels especially good is that I’m not just getting a large amount of stuff done, but I’m getting important, difficult stuff done.

Another related practice I have recently adopted is that I do not read or write any email before 4pm. I have literally blocked myself from accessing email on my computers and phone between midnight and 4pm. Perhaps this sounds heretical, but that’s just the point—“because doing otherwise would be heresy” is a terrible reason for doing anything, and the fact is that not many of us really stop to consider and consciously choose the way we make use of technologies like email and social media. It’s taken some getting used to, but by now I don’t think I am ever going back. At 4pm I triage my inbox—respond to things that need a quick response, archive or delete stuff I don’t care about, and forward other things to my personal bug tracker for dealing with later. I am typically able to totally clear out my inbox before going home for the day. Over the course of the day I keep a list of emails I want to write later, and I write those at the same time that I triage my inbox, or sometimes later in the evening before going to bed. It feels way more efficient to batch most of my email processing into a focused session like this, and freeing to not be distracted by it the rest of the day. But do I ever miss it? Yes, all the time—and that’s exactly the point! Left to my natural tendencies I distract myself silly checking my email constantly.

Time will tell how much of this sticks and how my approach might change over time—I’ve scheduled a reminder for myself to write a followup post six months from now. As always, I’m happy to hear and respond to thoughts, reactions, questions, etc. in the comments.


by Brent at September 02, 2016 07:54 PM

Dan Burton

GPG signing for github & mac

I just went through a few steps to get gpg signing to work on my mac and show up on github. I wanted to quickly document the process since the instructions are a little bit scattered. All of it basically came … Continue reading

by Dan Burton at September 02, 2016 07:18 PM

September 01, 2016

Well-Typed.Com

Haskell eXchange, Hackathon, and Courses

In October 2016, we are co-organizing various events in London. Now is the time to register for:

  • the Haskell eXchange, a two-day three-track conference with a large number of Haskell-related talks and workshops on a wide variety of topics, including keynotes by Simon Peyton Jones, Don Stewart, Conor McBride and Graham Hutton;

  • the Haskell eXchange Hackathon, a two-day event for everyone who wants to get involved coding on projects related to the Haskell infrastructure, such as Hackage and Cabal;

  • our Haskell courses, including a two-day introductory course, a one-day course on type-level programming in GHC, and a two-day course on lazy evaluation and performance.

Haskell eXchange

Thursday, October 5 – Friday, October 6

The Haskell eXchange is a general Haskell conference aimed at Haskell enthusiasts of all skill levels. The Haskell eXchange is organized annually, and 2016 is its fifth year. For the second year in a row, the venue will be Skills Matter’s CodeNode, where we have space for three parallel tracks. New this year: a large number of beginner-focused talks. At all times, at least one track will be available with a talk aimed at (relative) newcomers to Haskell. Of course, there are also plenty of talks on more advanced topics. The four keynote speakers are Simon Peyton Jones, Don Stewart, Conor McBride and Graham Hutton.

Registration is open; you can buy tickets via Skills Matter.

Haskell eXchange Hackathon

Saturday, October 7 – Sunday, October 8

We are going to repeat the successful Haskell Infrastructure Hackathon that we organized last year directly after the Haskell eXchange. Once again, everyone who is already contributing to Haskell projects related to the Haskell infrastructure as well as everyone who wants to get involved and talk to active contributors is invited to spend two days hacking on various projects, such as Hackage and Cabal.

Registration is open. This event is free to attend (and you can attend independently of the Haskell eXchange), but there is limited space, so you have to register.

Haskell courses

Fast Track to Haskell

Monday, October 3 – Tuesday, October 4

This is a two-day general introduction to Haskell, aimed at developers who have experience with other (usually non-functional) programming languages, and want to learn about Haskell. Topics include defining basic datatypes and functions, the importance of type-driven design, abstraction via higher-order functions, handling effects (such as input/output) explicitly, and general programming patterns such as applicative functors and monads. This hands-on course includes several small exercises and programming assignments that allow to practice and feedback from the instructor during the course.

Registration is open; you can buy tickets via Skills Matter.

Guide to the Haskell Type System

Wednesday, October 5

This one-day course focuses on several of the type-system-oriented language extensions that GHC offers and shows how to put them to good use. Topics include the kind system and promoting datatypes, GADTs, type families, moving even more towards dependent types via the new TypeInType. The extensions will be explained, illustrated with exampls, and we provide advice on how and when to best use them.

Registration is open; you can buy tickets via Skills Matter.

Guide to Haskell Performance

Monday, October 9 – Tuesday, October 10

In this two-day course, we focus on how to write performant Haskell code that scales. We systematically explain how lazy evaluation works, and how one can reason about the time and space performance of code that is evaluated lazily. We look at various common pitfalls and explain them. We look at data structures and their performance characteristics and discuss their suitability for various tasks. We also discuss how one can best debug the performance of Haskell code, and look at existing high-performance Haskell libraries and their implementation to learn general techniques that can be reused.

Registration is open; you can buy tickets via Skills Matter.

Other courses and events

Well-Typed also offers on-demand on-site training and consulting. Please contact us if you are interested in consulting, or in events or courses that are not listed here.

We also have a low-volume mailing list where we occasionally announce events that we organize or participate in (subscribe here).

by andres at September 01, 2016 01:04 PM

Douglas M. Auclair (geophf)

August 2016 1HaskellADay Problems and Solutions

August 2016


  




  • August 25th, 2016: Today's #haskell exercise looks at historical prices of #bitcoin
    Today's #haskell solution is worth $180k ... five years ago. I wonder what it will be worth 5 years hence? 
     
  • August 23rd, 2016: Enough diving into the node's data, let's look at the structure of the related nodes for today's #haskell problem. The structure of tweets and related data for today's #haskell solution 
  • August 22nd, 2016: Today's #haskell problem is parsing twitter hashtags and a bit of data fingerprinting/exploration of same. BOOM! Today's #haskell solution analyzes hashtags twitter-users ('tweeps') use
  • August 19th, 2016: For today's #haskell exercise we look at unique users in a set of twitter graph-JSONToday's #haskell solution gives us a list of users, then their tweets, from twitter graph-JSON data 
  • August 18th, 2016: For today's #haskell problem we extract and reify URLs from twitter graph-JSON. Today's #haskell solution extract URLs from twitter data as easily as looking up the URLs in a JSON map.
  • August 17th, 2016: For today's #haskell problem we explore the relationships from and to tweets and their related data. Today's #haskell solution relates data to tweets extracted from graph-JSON 
  • August 16th, 2016: For today's #haskell exercise we begin converting nodes in a graph to more specific types (Tweets are up first). We create some JSON Value-extractors and with those find the tweets in graph JSON in today's #Haskell solution 
  • August 15th, 2016: Today's #haskell exercise looks at twitter data as labeled/typed nodes and relations in JSON  

    Okay! For today's #haskell solution we discover our node and relation types in twitter data-as-graphs JSON! 
  • August 10th, 2016: Today's #Haskell problem we look at the big data-problem: getting a grasp of large indices of tweets in graph JSON. Today's #Haskell solution time-stamps and gives 'small-data' indices to tweets from graph JSON 
  • August 9th, 2016: For today's #haskell problem we extract the tweets from rows of graph data encoded in JSON. Today's #Haskell solution extracts the tweets from graph JSON and does some simple queries
  • August 8th, 2016: For today's #haskell problem we look at reading in the graph of a twitter-feed as JSON and just a bit of parsing. We leverage the Cypher library for today's #haskell solution to look at 100 rows of tweets encoded as JSON 
  • August 5th, 2016: Today's #Haskell problem we go for the Big Kahuna: solving a Kakuro puzzle
    Okay, we have a #Haskell solution ... finally ... maybe. The solver took too long, so I solved it myself faster :/ 
  • August 4th, 2016: Today's #Haskell exercise looks at (simple) constraints of unknown values for a sum-solverToday's #Haskell solution also uses QBits to solve constrained unknowns 
  • August 3rd, 2016: Today's #haskell problem provides the cheatsheet: "What are the unique 4-number sums to 27?" We round-trip the Set category for today's #haskell solution
  • August 2nd, 2016: Today's #haskell exercise looks at solving our sums when we know some of the numbers alreadyQBits actually work nicely for today's #Haskell solution 
  • August 1st, 2016: For today's #Haskell exercise we play the 'Numbers Game.' The #haskell solution is a guarded combine >>= permute in the [Int]-domain. I like the Kleisli category; ICYMI.
  • by geophf (noreply@blogger.com) at September 01, 2016 12:57 PM

    Edward Z. Yang

    Backpack and separate compilation

    When building a module system which supports parametrizing code over multiple implementations (i.e., functors), you run into an important implementation question: how do you compile said parametric code? In existing language implementations are three major schools of thought:

    1. The separate compilation school says that you should compile your functors independently of their implementations. This school values compilation time over performance: once a functor is built, you can freely swap out implementations of its parameters without needing to rebuild the functor, leading to fast compile times. Pre-Flambda OCaml works this way. The downside is that it's not possible to optimize the functor body based on implementation knowledge (unless, perhaps, you have a just-in-time compiler handy).
    2. The specialize at use school says, well, you can get performance by inlining functors at their use-sites, where the implementations are known. If the functor body is not too large, you can transparently get good performance benefits without needing to change the architecture of your compiler in major ways. Post-FLambda OCaml and C++ templates in the Borland model both work this way. The downside is that the code must be re-optimized at each use site, and there may end up being substantial duplication of code (this can be reduced at link time)
    3. The repository of specializations school says that it's dumb to keep recompiling the instantiations: instead, the compiled code for each instantiation should be cached somewhere globally; the next time the same instance is needed, it should be reused. C++ templates in the Cfront model and Backpack work this way.

    The repository perspective sounds nice, until you realize that it requires major architectural changes to the way your compiler works: most compilers don't try to write intermediate results into some shared cache, and adding support for this can be quite complex and error-prone.

    Backpack sidesteps the issue by offloading the work of caching instantiations to the package manager, which does know how to cache intermediate products. The trade off is that Backpack is not as integrated into Haskell itself as some might like (it's extremely not first-class.)

    by Edward Z. Yang at September 01, 2016 06:26 AM

    August 31, 2016

    Michael Snoyman

    Using AppVeyor for Haskell+Windows CI

    I don't think I ever documented this before, so just a quick post to get this out there. Many of us working on open source Haskell libraries already use Travis CI for doing continuous integration builds of our software. Some time ago they added support for OS X, making it possible to cover Linux and OS X with multiple configurations on their systems. For any project with a stack.yaml file, this can be easily achieved using the Stack recommended Travis configuration.

    Unfortunately, this leaves Windows testing out, which is unfortunate, because Windows is likely to be the most common build to fail. Fortunately, AppVeyor provides a similar experience to Travis, but for Windows. In order to get set up, just:

    1. Sign in with their web interface and add your Github repo
    2. Add an appveyor.yaml to your project

    Here's a simple file I've used on a few projects with succeess:

    build: off
    
    before_test:
    - curl -sS -ostack.zip -L --insecure http://www.stackage.org/stack/windows-i386
    - 7z x stack.zip stack.exe
    
    clone_folder: "c:\\stack"
    environment:
      global:
        STACK_ROOT: "c:\\sr"
    
    test_script:
    - stack setup > nul
    # The ugly echo "" hack is to avoid complaints about 0 being an invalid file
    # descriptor
    - echo "" | stack --no-terminal test

    All this does is:

    • Downloads the Stack zip file
    • Unpacks the stack.exe executable
    • Changes the STACK_ROOT to deal with Windows long path issues
    • Run stack setup to get a toolchain
    • Run stack --no-terminal test to build your package and run the test suites

    You're free to modify this in any way you want, e.g., add in --bench to build benchmarks, add --pedantic to fail on warnings, etc. If you have more system library dependencies, you'll need to consult the AppVeyor docs to see how to install them. And in our use cases for Stack, we found that using the AppVeyor caching functionality made builds unreliable (due to the large size of the cache). You may want to experiment with turning it back on, since this setup is slow (it downloads and installs a full GHC toolchain and builds all library dependencies each time).

    August 31, 2016 12:00 AM

    August 30, 2016

    Joachim Breitner

    Explicit vertical alignment in Haskell

    Chris Done’s automatic Haskell formatter hindent is released in a new version, and getting quite a bit of deserved attention. He is polling the Haskell programmers on whether two or four spaces are the right indentation. But that is just cosmetics…

    I am in principle very much in favor of automatic formatting, and I hope that a tool like hindent will eventually be better at formatting code than a human.

    But it currently is not there yet. Code is literature meant to be read, and good code goes at length to be easily readable, and formatting can carry semantic information.

    The Haskell syntax was (at least I get that impression) designed to allow the authors to write nicely looking, easy to understand code. One important tool here is vertical alignment of corresponding concepts on different lines. Compare

    maze :: Integer -> Integer -> Integer
    maze x y
    | abs x > 4  || abs y > 4  = 0
    | abs x == 4 || abs y == 4 = 1
    | x ==  2    && y <= 0     = 1
    | x ==  3    && y <= 0     = 3
    | x >= -2    && y == 0     = 4
    | otherwise                = 2

    with

    maze :: Integer -> Integer -> Integer
    maze x y
    | abs x > 4 || abs y > 4 = 0
    | abs x == 4 || abs y == 4 = 1
    | x == 2 && y <= 0 = 1
    | x == 3 && y <= 0 = 3
    | x >= -2 && y == 0 = 4
    | otherwise = 2

    The former is a quick to grasp specification, the latter (the output of hindent at the moment) is a desert of numbers and operators.

    I see two ways forward:

    • Tools like hindent get improved to the point that they are able to detect such patterns, and indent it properly (which would be great, but very tricky, and probably never complete) or
    • We give the user a way to indicate intentional alignment in a non-obtrusive way that gets detected and preserved by the tool.

    What could such ways be?

    • For guards, it could simply detect that within one function definitions, there are multiple | on the same column, and keep them aligned.
    • More general, one could take the approach lhs2Tex (which, IMHO, with careful input, a proportional font and with the great polytable LaTeX backend, produces the most pleasing code listings) takes. There, two spaces or more indicate an alignment point, and if two such alignment points are in the same column, their alignment is preserved – even if there are lines in between!

      With the latter approach, the code up there would be written

      maze :: Integer -> Integer -> Integer
      maze x y
      | abs x > 4   ||  abs y > 4   = 0
      | abs x == 4  ||  abs y == 4  = 1
      | x ==  2     &&  y <= 0      = 1
      | x ==  3     &&  y <= 0      = 3
      | x >= -2     &&  y == 0      = 4
      | otherwise                   = 2

      And now the intended alignment is explicit.

    (This post is cross-posted on reddit.)

    Update (2016-09-05) Shortly after this post, the Haskell formatter brittany gets released, which supports vertial alignment. Yay!

    by Joachim Breitner (mail@joachim-breitner.de) at August 30, 2016 01:35 PM

    August 29, 2016

    Edward Z. Yang

    cabal new-build is a package manager

    An old article I occasionally see cited today is Repeat after me: "Cabal is not a Package Manager". Many of the complaints don't apply to cabal-install 1.24's new Nix-style local builds. Let's set the record straight.

    Fact: cabal new-build doesn't handle non-Haskell dependencies

    OK, so this is one thing that hasn't changed since Ivan's article. Unlike Stack, cabal new-build will not handle downloading and installing GHC for you, and like Stack, it won't download and install system libraries or compiler toolchains: you have to do that yourself. This is definitely a case where you should lean on your system package manager to bootstrap a working installation of Cabal and GHC.

    Fact: The Cabal file format can record non-Haskell pkg-config dependencies

    Since 2007, the Cabal file format has a pkgconfig-depends field which can be used to specify dependencies on libraries understood by the pkg-config tool. It won't install the non-Haskell dependency for you, but it can let you know early on if a library is not available.

    In fact, cabal-install's dependency solver knows about the pkgconfig-depends field, and will pick versions and set flags so that we don't end up with a package with an unsatisfiable pkg-config dependency.

    Fact: cabal new-build 2.0 handles build-tools dependencies

    As of writing, this feature is unreleased (if you are impatient, get a copy of HEAD from the GitHub repository or install cabal-install-head from hvr's PPA). However, in cabal-install 2.0, build-tools dependencies will be transparently built and added to your PATH. Thus, if you want to install a package which has build-tools: happy, cabal new-build will automatically install happy and add it to the PATH when building this package. These executables are tracked by new-build and we will avoid rebuilding the executable if it is already present.

    Since build-tools identify executable names, not packages, there is a set of hardcoded build-tools which are treated in this way, coinciding with the set of build-tools that simple Setup scripts know how to use natively. They are hscolour, haddock, happy, alex, hsc2hs, c2hs, cpphs and greencard.

    Fact: cabal new-build can upgrade packages without breaking your database

    Suppose you are working on some project which depends on a few dependencies. You decide to upgrade one of your dependencies by relaxing a version constraint in your project configuration. After making this change, all it takes is a cabal new-build to rebuild the relevant dependency and start using it. That's it! Even better, if you had an old project using the old dependency, well, it still is working, just as you would hope.

    What is actually going on is that cabal new-build doesn't do anything like a traditional upgrade. Packages installed to cabal new-build's global store are uniquely identified by a Nix-style identifier which captures all of the information that may have affected the build, including the specific versions that were built against. Thus, a package "upgrade" actually is just the installation of a package under a different unique identifier which can coexist with the old one. You will never end up with a broken package database because you typed new-build.

    There is not presently a mechanism for removing packages besides deleting your store (.cabal/store), but it is worth noting that deleting your store is a completely safe operation: cabal new-build won't decide that it wants to build your package differently if the store doesn't exist; the store is purely a cache and does not influence the dependency solving process.

    Fact: Hackage trustees, in addition to package authors, can edit Cabal files for published packages to fix bugs

    If a package is uploaded with bad version bounds and a subsequent new release breaks them, a Hackage Trustee can intervene, making a modification to the Cabal file to update the version bounds in light of the new information. This is a more limited form of intervention than the patches of Linux distributions, but it is similar in nature.

    Fact: If you can, use your system package manager

    cabal new-build is great, but it's not for everyone. If you just need a working pandoc binary on your system and you don't care about having the latest and greatest, you should download and install it via your operating system's package manager. Distro packages are great for binaries; they're less good for libraries, which are often too old for developers (though it is often the easiest way to get a working install of OpenGL). cabal new-build is oriented at developers of Haskell packages, who need to build and depend on packages which are not distributed by the operating system.

    I hope this post clears up some misconceptions!

    by Edward Z. Yang at August 29, 2016 09:32 PM

    Functional Jobs

    Senior Software Engineer (Haskell) at Front Row Education (Full-time)

    Position

    Senior Software Engineer to join fast-growing education startup transforming the way 3+ million K-12 students learn Math and English.

    What you tell your friends you do

    “You know how teachers in public schools are always overworked and overstressed with 30 kids per classroom and never ending state tests? I make their lives possible and help their students make it pretty far in life”

    What you really will be doing

    Architect, design and develop new web applications, tools and distributed systems for the Front Row ecosystem in Haskell, Flow, PostgreSQL, Ansible and many others. You will get to work on your deliverable end-to-end, from the UX to the deployment logic

    Mentor and support more junior developers in the organization

    Create, improve and refine workflows and processes for delivering quality software on time and without incurring debt

    Work closely with Front Row educators, product managers, customer support representatives and account executives to help the business move fast and efficiently through relentless automation.

    How you will do this

    You’re part of an agile, multidisciplinary team. You bring your own unique skill set to the table and collaborate with others to accomplish your team’s goals.

    You prioritize your work with the team and its product owner, weighing both the business and technical value of each task.

    You experiment, test, try, fail and learn all the time

    You don’t do things just because they were always done that way, you bring your experience and expertise with you and help the team make the best decisions

    What have we worked on in the last quarter

    We have rewritten our business logic to be decoupled from the Common Core math standards, supporting US state-specific standards and international math systems

    Prototyped and tested a High School Math MVP product in classrooms

    Changed assigning Math and English to a work queue metaphor across all products for conceptual product simplicity and consistency

    Implemented a Selenium QA test suite 100% in Haskell

    Released multiple open source libraries for generating automated unit test fixtures, integrating with AWS, parsing and visualizing Postgres logs and much more

    Made numerous performance optimization passes on the system for supporting classrooms with weak Internet bandwidth

    Team

    We’re an agile and lean small team of engineers, teachers and product people working on solving important problems in education. We hyper-focus on speeds, communication and prioritizing what matters to our millions of users.

    Requirements

    • You’re smart and can find a way to show us.
    • A track record of 5+ years of working in, or leading, teams that rapidly ship high quality web-based software that provides great value to users. Having done this at a startup a plus.
    • Awesome at a Functional Programming language: Haskell / Scala / Clojure / Erlang etc
    • Exceptional emotional intelligence and people skills
    • Organized and meticulous, but still able to focus on the big picture of the product
    • A ton of startup hustle: we're a fast-growing, VC-backed, Silicon Valley tech company that works hard to achieve the greatest impact we can.

    Benefits

    • Money, sweet
    • Medical, dental, vision
    • Incredible opportunity to grow, learn and build lifetime bonds with other passionate people who share your values
    • Food, catered lunch & dinner 4 days a week + snacks on snacks
    • Room for you to do things your way at our downtown San Francisco location right by the Powell Station BART, or you can work remotely from anywhere in the US, if that’s how you roll
    • Awesome monthly team events + smaller get-togethers (board game nights, trivia, etc)

    Get information on how to apply for this position.

    August 29, 2016 05:59 PM

    Philip Wadler

    Option A: Think about the children

    Fellow tweeter @DarlingSteveEDI captured my image (above) as we gathered for Ride the Route this morning, in support of Option A for the Edinburgh's proposed West-East Cycle Route (the route formerly known as Roseburn to Leith Walk). My own snap of the gathering is below.

    Fellow blogger Eilidh Troup considers another aspect of the route, safety for schoolchildren. Option A is far safer than Option B for young children cycling to school: the only road crossing in Option A is guarded by a lollipop lady, while children taking Option B must cross *three* busy intersections unaided.

    It's down to the wire: members of the Transport and Environment Committee vote tomorrow. The final decision may be closely balanced, so even sending your councillor (and councillors on the committee) a line or two can have a huge impact. If you haven't written, write now, right now!

    Previously:
      Roseburn to Leith Walk A vs B: time to act!
      Ride the Route in support of Option A

    Late breaking addendum:
      Sustrans supports Option A: It’s time for some big decisions…
     


    by Philip Wadler (noreply@blogger.com) at August 29, 2016 04:47 PM

    Michael Snoyman

    Follow up: haskell.org and the Evil Cabal

    Yesterday I put out a blog post describing a very problematic situation with the haskell.org committee. As often happens with this kind of thing, a very lively discussion occurred on Reddit. There are many repeating themes over there, so instead of trying to address the points in that discussion, I'm going to give some responses in this post.

    • Firstly: thank you to those of you who subscribed to the haskell-community list and made your voices heard. That was the best response to the blog post I could have hoped for, and it happened. At this point, the Twitter poll and mailing list discussion both point to a desire to have Stack as the primary option on the downloads page (the latter is a tied vote of 6 to 6, indicating the change proposed should not happen). As far as I'm concerned, the committee has two options:

      • Listen to the voices of the community and make Stack the primary option on haskell.org.

      • Ignore the community voices and put the Haskell Platform at the top of the page, thus confirming my claims of an oligarchy.

    • Clarification: I do not believe anyone involved in this is an evil person. I thought my wording was unambiguous, but apparently not. The collusion among the projects is what gets the term "Evil Cabal." That said, I do believe that there were bad actions taken by individuals involved, and I've called some of those out. There's a much longer backstory here of the nepotism I refer to, starting at least at ICFP 2014 and GPS Haskell, but that's a story I'm not getting into right now.

    • A few people who should know better claimed that there's no reason for my complaint given that the Haskell Platform now ships with Stack. This is incorrect for multiple reasons. Firstly, one of my complaints in the blog post is that we've never discussed technical merits, so such a claim should be seen as absurd immediately. There's a great Reddit comment explaining that this inclusion is just misdirection. In any event, here are just 140 characters worth of reasons the Haskell Platform is inferior to Stack for a new user

      • There is no clear "getting started" guide for new users. Giving someone a download is only half the battle. If they don't know where to go next, the download it useless. (Compare with haskell-lang's getting started.)

      • Choice confusion: saying "HP vs Stack" is actually misleading. The real question is "HP+cabal-install vs HP+Stack vs Stack". A new user is not in a strong enough position to make this decision.

      • Stack will select the appropriate version of GHC to be used based on the project the user is working on. Bundling GHC with Stack insists on a specific GHC version. (I'm not arguing that there's no benefit to including GHC in the installer, but there are definitely downsides too.)

      • The HP release process has historically been very slow, whereas the Stack release process is a well oiled machine. I have major concerns about users being stuck with out-of-date Stack executables by using the HP and running into already fixed bugs. This isn't hypothetical: GHC for Mac OS X shipped an old Stack version for a while resulting in many bug reports. (This is an example of haskell.org download page decisions causing extra work for the Stack team.)

      • Bonus point (not on Twitter): Stack on its own is very well tested. We have little experience in the wild of HP+Stack. Just assuming it will work is scary, and goes against the history of buggy Haskell Platform releases.

    • A lot of the discussion seemed to assume I was saying to get rid of cabal-install entirely. In fact, my blog post said the exact opposite: let it continue if people want to work on it. I'm talking exclusively about the story we tell to new users. Again, technical discussions should have occurred long ago about what's the best course of action. I'm claiming that Stack is by far the best option for the vast majority of new users. The committee has never to my knowledge argued publicly against that.

    • There was a lot of "tone policing," saying things like I need to have more patience, work with not against the committee, follow the principle of charity, etc. If this is the first time I raised these issues, you'd be right. Unfortunately, there is a long history here of many years of wasted time and effort. The reason I always link back to pull request #130 is because it represents the tipping point from "work with the committee without making a fuss" to "I need to make all of these decisions as public as possible so bad decisions don't slip in."

      Let me ask you all: if I had just responded to the mailing list thread asking for a different course of action to be taken, would most of you know that this drama was happening? This needed to be public, so that no more massive changes could slip under everyone's radar.

      Also: it's ironic to see people accusing me of violating the principle of charity by reading my words in the most negative way they possibly can. That's true irony, not just misrepresenting someone's position.

    • For a long time, people have attacked FP Complete every chance they could, presumably because attacking a company is easier than attacking an individual. There is no "FP Complete" conspiracy going on here. I decided to write this blog post on my own, not part of any FP Complete strategy. I discussed it with others, most of whom do not work for FP Complete. In fact, most of the discussion happened publicly, on Twitter, for you all to see.

      If you want to attack someone, attack me. Be intellectually honest. And while you're at it: try to actually attack the arguments made instead of resorting to silly ad hominems about power grabs. Such tin-foil hattery is unbecoming.

    • There's a legitimate discussion about how we get feedback from multiple forms of communication (mailing lists, Twitter, Reddit). While that's a great question to ask and a conversation to have, it really misses the point here completely: we're looking for a very simple vote on three options. We can trivially put up a Google Form or similar and link to it from all media. We did this just fine with the FTP debate. It feels almost disingenuous to claim that we don't know how to deal with this problem when we've already dealt with it in the past.

    August 29, 2016 12:00 AM

    Christopher Done

    hindent 5: One style to rule them all

    Reminder of the past

    In 2014, in my last post about hindent, I wrote these points:

    1. Automatic formatting is important:
    1. Other people also care about this
    2. The Haskell community is not immune to code formatting debates

    I proposed my hindent tool, which:

    1. Would format your code.
    2. Supported multiple styles.
    3. Supported further extension/addition of more styles trivially.

    Things learned

    I made some statements in that post that I’m going to re-evaluate in this post:

    1. Let’s have a code style discussion. I propose to solve it with tooling.
    2. It’s not practical to force everyone into one single style.

    Code formatting is solved with tooling

    I’ve used hindent for two years, it solves the problem. There are a couple exceptions1. On the whole, though, it’s a completely different working experience:

    • Code always looks the same.
    • I don’t make any style decisions. I just think about the tree I need for my program.
    • I don’t do any manual line-breaking.
    • I’ve come to exploit it by writing lazy code like do x<-getLine;when(x>5)(print 5) and then hitting a keybinding to reformat it.

    Switching style is realistic

    I’ve been writing Haskell in my own style for years. For me, my style is better for structured editing, more consistent, and visually easier to read, than most code I’ve seen. It’s like Lisp. Using hindent, with my ChrisDone style, I had it automatically formatted for me. I used 2-space indents.

    The most popular style in the community2 is JohanTibell: The alignment, line-breaking, and spacing (4 spaces instead of 2) differs significantly to my own style.

    At FP Complete I’ve done a lot of projects, private FP Complete projects, client projects, and public FP Complete projects (like Stack). For the first year or so I generally stuck to my guns when working on code only I was going to touch and used my superior style.

    But once the JohanTibell style in hindent was quite stable, I found that I didn’t mind using it while collaborating with people who prefer that style. The tooling made it so automatic, that I didn’t have to understand the style or make any style decisions, I just wrote code and got on with it. It doesn’t work great with structured-haskell-mode, but that’s ok. Eventually I got used to it, and eventually switched to using it for my own personal projects.

    I completely did a U-turn. So I’m hoping that much of the community can do so too and put aside their stylistic preferences and embrace a standard.

    Going forward

    hindent-5.* now supports one style, based on the Johan Tibell style guide. My own style guide is now deprecated in favor of that. The style flag --style foo is now silently ignored.

    There is a demonstration web site in which you can try examples, and also get a link for the example to show other people the output (for debugging).

    HIndent now has a “literate” test suite here: TESTS.md. You can read through it as a document, a bit like Johan’s style guide. But running the test suite parses this file and checks that each code fence is printed as written.

    There’s also a BENCHMARKS.md, since I rewrote comment handling, switched to a bytestring-builder, improved the quadratic line-breaking algorithm to short-circuit, among other improvements, hindent now formats things in 1.5ms instead of 1s.

    For those who still want to stick with their old hindent, Andrew Gibiansky is keeping a fork of hindent 4 for his personal use, and has said he’ll accept PR’s for that.

    HIndent is not perfect, there’s always room for improvement (issue tracker welcomes issues), but over time that problem space gets smaller and smaller. There is support for Emacs, Vim and Atom. I would appreciate support for SublimeText too.

    Give it a try!


    1. Such as CPP #if directives–they are tricky to handle. Comments are also tricky, but I’ve re-implemented comment handling from scratch and it works pretty well now. See the pretty extensive tests.

    2. From a survey of the top downloaded 1000 packages on Hackage, 660 are 4-spaced and 343 are 2-spaced. All else being equal, 4 spaces wins.

    August 29, 2016 12:00 AM

    August 28, 2016

    Dimitri Sabadie

    luminance designs

    luminance-0.7.0 was released a few days ago and I decided it was time to explain exactly what luminance is and what were the design choices I made. After a very interesting talk with nical about other rust graphics frameworks (e.g. gfx, glium, vulkano, etc.), I thought it was time to give people some more information about luminance and how to compare it to other frameworks.

    Origin

    luminance started as a Haskell package, extracted from a “3D engine” I had been working on for a while called quaazar. I came to the realization that I wasn’t using the Haskell garbage collector at all and that I could benefit from using a language without GC. Rust is a very famous language and well appreciated in the Haskell community, so I decided to jump in and learn Rust. I migrated luminance in a month or two. The mapping is described in this blog entry.

    What is luminance for?

    I’ve been writing 3D applications for a while and I always was frustrated by how OpenGL is badly designed. Let’s sum up the lack of design of OpenGL:

    • weakly typed: OpenGL has types, but… it actually does not. GLint, GLuint or GLbitfield are all defined as aliases to primary and integral types (i.e. something like typedef float GLfloat). Try it with grep -Rn "typedef [a-zA-Z]* GLfloat" /usr/include/GL. This leads to the fact that framebuffers, textures, shader stages, shader program or even uniforms, etc. have the same type (GLuint, i.e. unsigned int). Thus, a function like glCompileShader expects a GLuint as argument, though you can pass a framebuffer, because it’s also represented as a GLuint – very bad for us. It’s better to consider that those are just untyped – :( – handles.
    • runtime overhead: Because of the point above, functions cannot assume you’re passing a value of a the expected type – e.g. the example just above with glCompileShader and a framebuffer. That means OpenGL implementations have to check against all the values you’re passing as arguments to be sure they match the type. That’s basically several tests for each call of an OpenGL function. If the type doesn’t match, you’re screwed and see the next point.
    • error handling: This is catastrophic. Because of the runtime overhead, almost all functions might set the error flag. You have to check the error flag with the glGetError function, adding a side-effect, preventing parallelism, etc.
    • global state: OpenGL works on the concept of global mutation. You have a state, wrapped in a context, and each time you want to do something with the GPU, you have to change something in the context. Such a context is important; however, some mutations shouldn’t be required. For instance, when you want to change the value of an object or use a texture, OpenGL requires you to bind the object. If you forget to bind for the next object, the mutation will occurs on the first object. Side effects, side effects…

    The goal of luminance is to fix most of those issues by providing a safe, stateless and elegant graphics framework. It should be as low-level as possible, but shouldn’t sacrifice runtime performances – CPU charge as well as memory bandwidth. That is why if you know how to program with OpenGL, you won’t feel lost when getting your feet wet with luminance.

    Because of the many OpenGL versions and other technologies (among them, vulkan), luminance has an extra aim: abstract over the trending graphics API.

    Types in luminance

    In luminance, all graphics resources – and even concepts – have their own respective type. For instance, instead of GLuint for both shader programs and textures, luminance has Program and Texture. That ensures you don’t pass values with the wrong types.

    Because of static warranties provided by compile-time, with such a scheme of strong-typing, the runtime shouldn’t have to check for type safety. Unfortunately, because luminance wraps over OpenGL in the luminance-gl backend, we can only add static warranties; we cannot remove the runtime overhead.

    Error handling

    luminance follows the Rust conventions and uses the famous Option and Result types to specify errors. You will never have to check against a global error flag, because this is just all wrong. Keep in mind, you have the try! macro in your Rust prelude; use it as often as possible!

    Even though Rust needs to provide an exception handler – i.e. panics – there’s no such thing as exceptions in Rust. The try! macro is just syntactic sugar to:

    match result {
    Ok(x) => x,
    Err(e) => return e
    }

    Stateless

    luminance is stateless. That means you don’t have to bind an object to be able to use it. luminance takes care of that for you in a very simple way. To achieve this and keep performances running, it’s required to add a bit of high-level to the OpenGL API by leveraging how binds should happen.

    Whatever the task you’re trying to reach, whatever computation or problem, it’s always better to gather / group the computation by batches. A good example of that is how magnetic hard drive disks work or your RAM. If you spread your data across the disk region (fragmented data) or across several non-contiguous addresses in your RAM, it will end up by unnecessary moves. The hard drive’s head will have to go all over the disk to gather the information, and it’s very likely you’ll destroy the RAM performance (and your CPU caches) if you don’t put the data in a contiguous area.

    If you don’t group your OpenGL resources – for instances, you render 400 objects with shader A, 10 objects with shader B, then 20 objects with shader A, 32 objects with shader C, 349 objects with shader A and finally 439 objects with shader B, you’ll add more OpenGL calls to the equation – hence more global state mutations, and those are costly.

    Instead of this:

    1. 400 objects with shader A
    2. 10 objects with shader B
    3. 20 objects with shader A
    4. 32 objects with shader C
    5. 348 objects with shader A
    6. 439 objects with shader B

    luminance forces you to group your resources like this:

    1. 400 + 20 + 348 objects with shader A
    2. 10 + 439 objects with shader B
    3. 32 objects with shader C

    This is done via types called Pipeline, ShadingCommand and RenderCommand.

    Pipelines

    A Pipeline gathers shading commands under a Framebuffer. That means that all ShadingCommand embedded in the Pipeline will output to the embedded Framebuffer. Simple, yet powerful, because we can bind the framebuffer when executing the pipeline and don’t have to worry about framebuffer until the next execution of another Pipeline.

    ShadingCommand

    A ShadingCommand gathers render commands under a shader Program along with an update function. The update function is used to customize the Program by providing uniforms – i.e. Uniform. If you want to change a Programs Uniform once a frame – and only if the Program is only called once in the frame – it’s the right place to do it.

    All RenderCommand embedded in the ShadingCommand will be rendered using the embedded shader Program. Like with the Pipeline, we don’t have to worry about binding: we just have to use the embedded shader program when executing the ShadingCommand, and we’ll bind another program the next time a ShadingCommand is ran!

    RenderCommand

    A RenderCommand gathers all the information required to render a Tessellation, that is:

    • the blending equation, source and destination blending factors
    • whether the depth test should be performed
    • an update function to update the Program being in use – so that each object can have different properties used in the shader program
    • a reference to the Tessellation to render
    • the number of instances of the Tessellation to render
    • the size of the rasterized points (if the Tessellation contains any)

    What about shaders?

    Shaders are written in… the backend’s expected format. For OpenGL, you’ll have to write GLSL. The backends automatically inserts the version pragma (#version 330 core for OpenGL 3.3 for instance). In the first place, I wanted to migrate cheddar, my Haskell shader EDSL. But… the sad part of the story is that Rust is – yet – unable to handle that kind of stuff correctly. I started to implement an EDSL for luminance with macros. Even though it was usable, the error handling is seriously terrible – macros shouldn’t be used for such an important purpose. Then some rustaceans pointed out I could implement a (rustc) compiler plugin. That enables the use of new constructs directly into Rust by extending its syntax. This is great.

    However, with the hindsight, I will not do that. For a very simple reason. luminance is, currently, simple, stateless and most of all: it works! I released a PC demo in Köln, Germany using luminance and a demoscene graphics framework I’m working on:

    pouët.net link

    youtube capture

    ion demoscene framework

    While developping Céleri Rémoulade, I decided to bake the shaders directly into Rust – to get used to what I had wanted to build, i.e., a shader EDSL. So there’re a bunch of constant &'static str everywhere. Each time I wanted to make a fix to a shader, I had to leave the application, make the change, recompile, rerun… I’m not sure it’s a good thing. Interactive programming is a very good thing we can enjoy – yes, even in strongly typed languages ;).

    I saw that gfx doesn’t have its own shader EDSL either and requires you to provide several shader implementations (one per backend). I don’t know; I think it’s not that bad if you only target a single backend (i.e. OpenGL 3.3 or Vulkan). Transpiling shaders is a thing, I’ve been told…

    sneaking out…

    Feel free to dig in the code of Céleri Rémoulade here. It’s demoscene code, so it had been rushed on before the release – read: it’s not as clean as I wanted it to be.

    I’ll provide you with more information in the next weeks, but I prefer spending my spare time writing code than explaining what I’m gonna do – and missing the time to actually do it. ;)

    Keep the vibe!

    by Dimitri Sabadie (noreply@blogger.com) at August 28, 2016 11:46 PM

    Michael Snoyman

    haskell.org and the Evil Cabal

    There's no point being coy or saying anything but what I actually believe, and saying it bluntly. So here it is:

    The haskell.org committee has consistently engaged in tactics which silence the voices of all non-members, and stacks the committee to prevent dissenting opinions from joining.

    I've said various parts of this previously. You may have heard me say things like the haskell.org oligarchy, refer to the "evil cabal of Haskell" (referring to the nepotism which exists amongst Hackage, cabal-install, haskell.org, and the Haskell Platform), or engage in lengthy debates with committee members about their actions.

    This is a pretty long post, if you want to see my request, please jump to the end.

    The backstory

    To summarize a quick backstory: many of us in the community have been dissatisfied with the four members of the "evil cabal" for years, and have made efforts to improve them, only to be met with opposition. One by one, some of us have been replacing these components with alternatives. Hackage's downtime led to an FP Complete mirror and more reliable doc hosting on stackage.org. cabal-install's weaknesses led to the creation of the Stack build tool. Haskell Platform's poor curation process and broken installer led to Stackage Nightly and LTS Haskell, as well some of the Stack featureset. And most recently, the haskell.org committee's poor decisions (as I'll demonstrate shortly) for website content led to resurrecting haskell-lang.org, a website devoted to actually making Haskell a more approachable language.

    As you can see, at this point all four members of the evil cabal have been replaced with better options, and community discussions and user statistics indicate that most users are switching over. (For an example of statistics, have a look at the package download count on Hackage, indicating that the vast majority of users are no longer downloading packages via cabal-install+Hackage.) I frankly have no problem at all with the continued existence and usage of these four projects; if people want to spend their time on them and use what I consider to be inferior tools, let them. The only remaining pain point is that new, unsuspecting users will arrive at haskell.org download page instead of the much more intuitive haskell-lang.org get started page.

    EDIT Ignore that bit about the download statistics, it's apparently due to the CDN usage on Hackage. Instead, one need only look at how often a question about Haskell Platform is answered with "don't do that, use Stack instead." For a great example, see the discussion of the Rust Platform.

    The newest attempt

    Alright, with that out of the way, why am I writing this blog post now? It's due to this post on the Haskell-community mailing list, proposing promoting the Haskell Platform above all other options (yet again). Never heard of that mailing list? That's not particularly surprising. That mailing list was created in response to a series of complaints by me, claiming that the haskell.org committee acted in a secretive way and ignored all community input. The response to this was, instead of listening to the many community discussions already occuring on Twitter and Reddit, to create a brand new mailing list, have an echo chamber of people sympathetic to Evil Cabal thought, and insist that "real" community discussions go on there.

    We're seeing this process work exactly as the committee wants. Let me demonstrate clearly how. At the time of writing this blog post, three people have voted in favor of promoting the HP on haskell-community, including two haskell.org committee members (Adam Foltzer and John Wiegley) and the person who originally proposed it, Jason Dagit. There were two objections: Chris Allen and myself. So with a sample size of 5, we see that 60% of the community wants the HP.

    The lie

    A few hours after this mailing list post, I put out a poll on Twitter. At time of writing (4 hours or so into the poll), we have 122 votes, with 85% in favor of Stack, and 15% in favor of some flavor of the Haskell Platform (or, as we'll now be calling, the Perfect Haskell Platform). Before anyone gets too excited: yes, a poll of my Twitter followers is obviously a biased sample, but no more biased than the haskell-community list. My real point is this:

    The haskell.org committee is posing questions of significant importance in echo chambers where they'll get the response they want from a small group of people, instead of engaging the community correctly on platforms that make participation easy.

    This isn't the first time this has happened. When we last discussed haskell.org download page content, a similar phenonmonon occurred. Magically, the haskell-community discussion had a bias in favor of the Haskell Platform. In response, I created a Google Form, and Stack was the clear victor:

    Yet despite this clear feedback, the committee went ahead with putting minimal installers at the top, not Stack (they weren't quite brazen enough to put the Perfect Haskell Platform at the top or even above Stack, for which I am grateful).

    Proper behavior

    As I see it, the haskell.org committee has two correct options to move forward with making the download page decision:

    • Accept the votes from my Twitter poll in addition to the haskell-community votes
    • Decide that my poll is invalid for some reason, and do a proper poll of the community, with proper advertisement on Reddit, Twitter, the more popular mailing lists, etc

    If past behavior is any indication though, I predict a third outcome: stating that the only valid form of feedback is on the haskell-community mailing list, ignore the clear community groundswell against their decisions, and continue to make unilateral, oligarchic decisions. Namely: promote the Haskell Platform, thereby misleading all unfortunate new Haskellers who end up at haskell.org instead the much better haskell-lang.org.

    Further evidence

    Everyone's always asking me for more of the details on what's gone on here, especially given how some people vilify my actions. I've never felt comfortable putting that kind of content on blogs shared with other authors when some of those others don't want me to call out the negative actions. However, thankfully I now have my own blog to state this from. This won't include every punch thrown in this long and sordid saga, but hopefully will give a much better idea of what's going on here.

    • Not only are conversations held in private by the committee, but:

      • Their private nature is used to shut down commentary on committee actions
      • There is open deception about what was actually discussed in private

      Evidence: see this troubling Reddit thread. I made the (very true) claim that Gershom made a unilateral decision about the downloads page. You can see the evidence of this where he made that decision. Adam Foltzer tried to call my claim false, and ultimately Gershom himself confirmed I was correct. Adam then claimed offense at this whole discussion and backed out.

    • When I proposed making Stack the preferred download option (at a time when Stack did not appear at all on haskell.org), Gershom summarilly closed the pull request. I have referenced this pull request many times. I don't believe any well intentioned person can read that long discussion and believe that the haskell.org committee has a healthy process for maintaining a community website.

    • At no point in any of these discussions has the committee opened up discussion to either the technical advantages of the HP vs Stack, or the relative popularity. Instead, we get discussions of committee process, internal votes, an inability to make changes at certain periods of time based on previously made and undocumented decisions.

    • We often hear statements from committee members about the strong support for their actions, or lack of controversy on an issue. These claims are many times patently false to any objective third party. For example, Gershom claimed that the pull request #122 that he unilaterally decided to merge was "thought to be entirely mundane and uncontroversial." Everyone is welcome to read the Reddit discussion and decide if Gershom is giving a fair summary or not.

    • Chris Done - a coworker of mine - spent his own time on creating the first haskell-lang.org, due to his unhappiness with the homepage at that time. His new site was met with much enthusiasm, and he was pressured by many to get it onto haskell.org itself. What ensued was almost a year of pain working out the details, having content changed to match the evil cabal narrative, and eventually a rollout. At the end of this, Chris was - without any given reason - not admitted to the haskell.org committee, denying him access to share an opinion on what should be on the site he designed and created.

    My request

    Thank you for either getting through all of that, or skipping to this final section. Here's my request: so many people have told me that they feel disenfranchised by these false-flag "community" processes, and just give up on speaking up. This allows the negative behavior we've seen dominate the evil cabal in Haskell for so long. If you've already moved on to Stack and Stackage yourself, you're mostly free of this cabal. I'm asking you to think of the next generation of Haskell users, and speak up.

    Most powerful course of action: subscribe to the haskell-community mailing list and speak out about how the committee has handled the downloads page. Don't just echo my message here: say what you believe. If you think they've done a good job, then say so. If you think (like I do) that they've done a bad job, and are misleading users with their decisions, say that.

    Next best: comment about this on Reddit or Twitter. Get your voice out there and be heard, even if it isn't in the haskell.org committee echo chamber.

    In addition to that: expect me to put out more polls on Twitter and possibly elsewhere. Please vote! We've let a select few make damaging decisions for too long, make your voice heard. I'm confident that we will have a more user-friendly Haskell experience if we actually start listening to users.

    And finally: as long as it is being mismanaged, steer people away from haskell.org. This is why we created haskell-lang.org. Link to it, tell your friends about it, warn people away from haskell.org, and maybe even help improve its content.


    Archive link of the Reddit and Github threads quoted above:

    • http://archive.is/7zFkb
    • http://archive.is/NTzUD
    • http://archive.is/roexm
    • http://archive.is/uwdzr
    • http://archive.is/uduu5

    August 28, 2016 12:00 AM

    August 26, 2016

    Functional Jobs

    Full-Stack Developer (Haskell/PureScript) at CollegeVine (Full-time)

    Overview

    CollegeVine is looking for a product-focused full-stack developer to help engineer the future of mentorship and higher education attainment.

    There aren't many industries left that haven't been significantly disrupted by technology in some way, but you're reading about one right here! You will find many opportunities to apply high-leverage computer science (think machine learning, probabilistic reasoning, etc.) as well as plenty of opportunities for the more human side of the problem. As it stands, the current admissions process is a huge source of stress and confusion for students and parents alike. If we execute correctly, your work will impact the entire next generation of college graduates-to-be.

    You will join a fast-moving company whose culture centers around authenticity, excellence, and balance. You'll find that everyone likes to keep things simple and transparent. We prefer to be goal-oriented and hands-off as long as you are a self-starter.

    Our modern perspective on developer potential means we celebrate and optimize for real output. And that's probably the reason why we're a polyglot functional programming shop, with emphasis on Haskell and functional paradigms. Our infrastructure and non-mission-critical tooling tends to be in whatever works best for the task at hand: sometimes that's Haskell with advanced GHC extensions a-blazin', other times it's minimalist Ruby or bash—basically, it's a team decision based on whatever sits at the intersection of appropriateness, developer joy, quality, and velocity.

    As an early-stage company headquartered in Cambridge, MA, we have a strong preference for key members of our team to be located in the Boston metro area; however, given that our company has its roots in remote work (and that it's 2016), we are open to remote arrangements after one year of continuous employment and/or executive approval.

    Requirements

    You know you are right for this position if:

    • You have at least five years of professional software engineering experience, and at least two years of preference for a high-level programming language that's used in industry, like Haskell, Clojure, OCaml, Erlang, F#, or similar.
    • You have some front-end experience with JS or a functional language that compiles to JS, like PureScript, Elm, Clojurescript, or similar. We use PureScript, React, and ES6 on the front-end. It's pretty awesome.
    • You are a self-starter and internally motivated, with a strong desire to be part of a successful team that shares your high standards.
    • You have great written communication skills and are comfortable with making big decisions over digital presence (e.g. video chat).
    • You have polyglot experience along several axes (dynamic/static, imperative/functional, lazy/strict, weird/not-weird).
    • You are comfortable with modern infrastructure essentials like AWS, Heroku, Docker, CI, etc. You have basic but passable sysadmin skills.
    • You are fluent with git.
    • You instrument before you optimize. You test before you ship. You listen before you conclude. You measure before you cut. Twice.

    Benefits

    We offer a competitive salary and a full suite of benefits, some of them unconventional, but awesome for the right person:

    • Medical, dental, and vision insurance come standard.
    • Flexible hours with a 4-hour core - plan the rest of your workday as you wish, just give us the majority of your most productive hours. Productivity ideas: avoid traffic, never wait in line at the grocery store, wake up without an alarm clock.
    • Goal-based environment (as opposed to grind-based or decree-based environment; work smarter, not harder; intelligently, not mindlessly). We collaborate on setting goals, but you set your own process for accomplishing those goals. You will be entrusted with a lot of responsibility and you might even experience fulfillment and self-actualization as a result.
    • Daily physical activity/mindfulness break + stipend: invest a non-core hour to make yourself more awesome by using it for yoga, tap-dance lessons, a new bike, massage, a surfboard - use your imagination! Just don’t sit at a computer all day! Come back to work more relaxed and productive and share your joy with the rest of the team. Note: You must present and share proof of your newly enriched life with the team in order to receive the stipend.

    Remember: We’re a startup. You’re an early employee. We face challenges. We have to ship. Your ideas matter. You will make a difference.

    Get information on how to apply for this position.

    August 26, 2016 11:22 PM

    August 25, 2016

    Brandon Simmons

    Announcing: unagi-bloomfilter

    I just released a new Haskell library called unagi-bloomfilter that is up now on hackage. You can install it with:

    $ cabal install unagi-bloomfilter
    

    The library uses the bloom-1 variant from “Fast Bloom Filters and Their Generalization” by Yan Qiao, et al. I’ll try to write more about it when I have the time. Also I just gave a talk on things I learned working on the project last night at the New York Haskell User Group:

    http://www.meetup.com/NY-Haskell/events/233372271/
    

    It was quite rough, but I was happy to hear from folks that found some interesting things to take away from it.

    Thanks to Gershom for inviting me to speak, for my company Signal Vine for sponsoring my trip out, and to Yan Qiao for generously answering my silly questions and helping me understand the paper.

    P.S. We’re hiring haskell developers

    Signal Vine is an awesome group of people, with interesting technology and problems to solve, and we’re looking to grow the small development team. If you have some experience with haskell (you don’t have to be a guru) and are interested, please reach out to Jason or me at:

    brandon@signalvine.com
    jason@signalvine.com
    

    August 25, 2016 02:47 PM

    August 23, 2016

    Roman Cheplyaka

    Extract the first n sequences from a FASTA file

    A FASTA file consists of a series of biological sequences (DNA, RNA, or protein). It looks like this:

    >gi|173695|gb|M59083.1|AETRR16S Acetomaculum ruminis 16S ribosomal RNA
    NNTAAACAAGAGAGTTCGATCCTGGCTCAGGATNAACGCTGGCGGCATGCCTAACACATGCAAGTCGAAC
    GGAGTGCTTGTAGAAGCTTTTTCGGAAGTGGAAATAAGTTACTTAGTGGCGGACGGGTGAGTAACGCGTG
    
    >gi|310975154|ref|NR_037018.1| Acidaminococcus fermentans strain VR4 16S ribosomal RNA gene, partial sequence
    GGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAGAACTTTCTTCGGAATGTTC
    TTAGTGGCGAACGGGTGAGTAACGCGTAGGCAACCTNCCCCTCTGTTGGGGACAACATTCCGAAAGGGAT

    There probably exist dozens of python scripts to extract the first \(n\) sequences from a FASTA file. Here I will show an awk one-liner that performs this task, and explain how it works.

    Here it is (assuming the number of sequences is stored in the environment variable NSEQS):

    awk "/^>/ {n++} n>$NSEQS {exit} {print}"

    This one-liner can read from standard input (e.g. as part of a pipe), or you can append one or more file names to the end of the command, e.g.

    awk "/^>/ {n++} n>$NSEQS {exit} {print}" file.fasta

    An awk script consists of one or more statements of the form pattern { actions }. The input is read line-by-line, and if the current line matches the pattern, the corresponding actions are executed.

    Our script consists of 3 statements:

    1. /^>/ {n++} increments the counter each time a new sequence is started. /.../ denotes a regular expression pattern, and ^> is a regular expression that matches the > sign at the beginning of a line.

      An uninitialized variable in awk has the value 0, which is exactly what we want here. If we needed some other initial value (say, 1), we could have added a BEGIN pattern like this: BEGIN {n=1}.
    2. n>$NSEQS {exit} aborts processing once the counter reaches the desired number of sequences.
    3. {print} is an action without a pattern (and thus matching every line), which prints every line of the input until the script is aborted by exit.

    A shorter and more cryptic way to write the same is

    awk "/^>/ {n++} n>$NSEQS {exit} 1"

    Here I replaced the action-without-pattern by a pattern-without-action. The pattern 1 (meaning “true”) matches every line, and when the action is omitted, it is assumed to be {print}.

    August 23, 2016 08:00 PM

    August 22, 2016

    mightybyte

    Measuring Software Fragility

    <style> .hl { background-color: orange; } </style>

    While writing this comment on reddit I came up with an interesting question that I think might be a useful way of thinking about programming languages. What percentage of single non-whitespace characters in your source code could be changed to a different character such that the change would pass your CI build system but would result in a runtime bug? Let's call this the software fragility number because I think that metric gives a potentially useful measure of how bug prone your software is.

    At the end of the day software is a mountain of bytes and you're trying to get them into a particular configuration. Whether you're writing a new app from scratch, fixing bugs, or adding new features, the number of bytes of source code you have (similar to LOC, SLOC, or maybe the compressed number of bytes) is rough indication of the complexity of your project. If we model programmer actions as random byte mutations over all of a project's source and we're trying to predict the project's defect rate this software fragility number is exactly the thing we need to know.

    Now I'm sure many people will be quick to point out that this random mutation model is not accurate. Of course that's true. But I would argue that in this way it's similar to the efficient markets hypothesis in finance. Real world markets are obviously not efficient (Google didn't become $26 billion less valuable because the UK voted for brexit). But the efficient markets model is still really useful--and good luck finding a better one that everybody will agree on.

    What this model lacks in real world fidelity, it makes up for in practicality. We can actually build an automated system to calculate a reasonable approximation of the fragility number. All that has to be done is take a project, randomly mutate a character, run the project's whole CI build, and see if the result fails the build. Repeat this for every non-whitespace character in the project and count how many characters pass the build. Since the character was generated at random, I think it's reasonable to assume that any mutation that passes the build is almost definitely a bug.

    Performing this process for every character in a large project would obviously require a lot of CPU time. We could make this more tractable by picking characters at random to mutate. Repeat this until you have done it for a large enough number of characters and then see what percentage of them made it through the build. Alternatively, instead of choosing random characters you could choose whole modules at random to get more uniform coverage over different parts of the language's grammar. There are probably a number of different algorithms that could be tried for picking random subsets of characters to test. Similar to numerical approximation algorithms such as Newton's method, any of these algorithms could track the convergence of the estimate and stop when the value gets to a sufficient level of stability.

    Now let's investigate actual fragility numbers for some simple bits of example code to see how this notion behaves. First let's look at some JavaScript examples.

    It's worth noting that comment characters should not be allowed to be chosen for mutation since they obviously don't affect the correctness of the program. So the comments you see here have not been included in the calculations. Fragile characters are highlighted in orange.

    // Fragility 12 / 48 = 0.25
    function f(n) {
      if ( n < 2 ) return 1;
      else return n * f(n-1);
    }
    
    // Fragility 14 / 56 = 0.25
    function g(n) {
      var p = 1;
      for (var i = 2; i <= n; i++ ) {
        p *= i;
      }
      return p;
    }
    

    First I should say that I didn't write an actual program to calculate these. I just eyeballed it and thought about what things would fail. I easily could have made mistakes here. In some cases it may even be subjective, so I'm open to corrections or different views.

    Since JavaScript is not statically typed, every character of every identifier is fragile--mutating them will not cause a build error because there isn't much of a build. JavaScript won't complain, you'll just start getting undefined values. If you've done a signifciant amount of JavaScript development, you've almost definitely encountered bugs from mistyped identifier names like this. I think it's mildly interesting that the recursive and iterative formulations if this function both have the same fragility. I expected them to be different. But maybe that's just luck.

    Numerical constants as well as comparison and arithmetic operators will also cause runtime bugs. These, however, are more debatable because if you use the random procedure I outlined above, you'll probably get a build failure because the character would have probably changed to something syntactically incorrect. In my experience, it semes like when you mistype an alpha character, it's likely that the wrong character will also be an alpha character. The same seems to be true for the classes of numeric characters as well as symbols. The method I'm proposing is that the random mutation should preserve the character class. Alpha characters should remain alpha, numeric should remain numeric, and symbols should remain symbols. In fact, my original intuition goes even further than that by only replacing comparison operators with other comparison operators--you want to maximize the chance that new mutated character will cause a successful build so the metric will give you a worst-case estimate of fragility. There's certainly room for research into what patterns tend come up in the real world and other algorithms that might describe that better.

    Now let's go to the other end of the programming language spectrum and see what the fragility number might look like for Haskell.

    // Fragility 7 / 38 = 0.18
    f :: Int -> Int
    f n
      | n < 2 = 1
      | otherwise = n * f (n-1)
    

    Haskell's much more substantial compile time checks mean that mutations to identifier names can't cause bugs in this example. The fragile characters here are clearly essential parts of the algorithm we're implementing. Maybe we could relate this idea to information theory and think of it as an idea of how much information is contained in the algorithm.

    One interesting thing to note here is the effect of the length of identifier names on the fragility number. In JavaScript, long identifier names will increase the fragility because all identifier characters can be mutated and will cause a bug. But in Haskell, since identifier characters are not fragile, longer names will lower the fragility score. Choosing to use single character identifier names everywhere makes these Haskell fragility numbers the worst case and makes JavaScript fragility numbers the best case.

    Another point is that since I've used single letter identifier names it is possible for a random identifier mutation in Haskell to not cause a build failure but still cause a bug. Take for instance a function that takes two Int parameters x and y. If y was mutated to x, the program would still compile, but it would cause a bug. My set of highlighted fragile characters above does not take this into account because it's trivially avoidable by using longer identifier names. Maybe this is an argument against one letter identifier names, something that Haskell gets criticism for.

    Here's the snippet of Haskell code I was talking about in the above reddit comment that got me thinking about all this in the first place:

    // Fragility 31 / 277 = 0.11
    data MetadataInfo = MetadataInfo
        { title       :: Text
        , description :: Text
        }
    
    pageMetadataWidget :: MonadWidget t m => Dynamic t MetadataInfo -> m ()
    pageMetadataWidget i = do
        el "title" $ dynText $ title <$> i
        elDynAttr "meta" (mkDescAttrs . description <$> i) blank
      where
        mkDescAttrs desc =
          "name" =: "description" 
          "content" =: desc
    

    In this snippet, the fragility number is probably close to 31 characters--the number of characters in string literals. This is out of a total of 277 non-whitespace characters, so the software fragility number for this bit of code is 11%. This half the fragility of the JS code we saw above! And as I've pointed out, larger real world JS examples are likely to have even higher fragility. I'm not sure how much we can conclude about the actual ratios of these fragility numbers, but at the very least it matches my experience that JS programs are significantly more buggy than Haskell programs.

    The TDD people are probably thinking that my JS examples aren't very realistic because none of them have tests, and that tests would catch most of the identifier name mutations, bringing the fragility down closer to Haskell territory. It is true that tests will probably catch some of these things. But you have to write code to make that happen! It doesn't happen by default. Also, you need to take into account the fact that the tests themselves will have some fragility. Tests require time and effort to maintain. This is an area where this notion of the fragility number becomes less accurate. I suspect that since the metric only considers single character mutations it will underestimate the fragility of tests since mutating single characters in tests will automatically cause a build failure.

    There seems to be a slightly paradoxical relationship between the fragility number and DRY. Imagine our above JS factorial functions had a test that completely reimplemented factorial and then tried a bunch of random values Quickcheck-style. This would yield a fragility number of zero! Any single character change in the code would cause a test failure. And any single character change in the tests would also cause a test failure. Single character changes can no longer classified fragile because we've violated DRY. You might say that the test suite shouldn't reimplement algorithm--you should just specific cases like f(5) == 120. But in an information theory sense this is still violating DRY.

    Does this mean that the fragility number is not very useful? Maybe. I don't know. But I don't think it means that we should just throw away the idea. Maybe we should just keep in mind that this particular formulation doesn't have much to tell us about the fragility more complex coordinated multi-character changes. I could see the usefulness of this metric going either way. It could simplify down to something not very profound. Or it could be that measurements of the fragility of real world software projects end up revealing some interesting insights that are not immediately obvious even from my analysis here.

    Whatever the usefulness of this fragility metric, I think the concept gets is thinking about software defects in a different way than we might be used to. If it turns out that my single character mutation model isn't very useful, perhaps the extension to multi-character changes could be useful. Hopefully this will inspire more people to think about these issues and play with the ideas in a way that will help us progress towards more reliable software and tools to build it with.

    EDIT: Unsurprisingly, I'm not the first person to have thought of this. It looks like it's commonly known as mutation testing. That Wikipedia article makes it sound like mutation testing is commonly thought of as a way to assess your project's test suite. I'm particularly interested in what it might tell us about programming languages...i.e. how much "testing" we get out of the box because of our choice of programming language and implementation.

    by mightybyte (noreply@blogger.com) at August 22, 2016 03:41 PM

    Why version bounds cannot be inferred retroactively (using dates)

    In past debates about Haskell's Package Versioning Policy (PVP), some have suggested that package developers don't need to put upper bounds on their version constraints because those bounds can be inferred by looking at what versions were available on the date the package was uploaded. This strategy cannot work in practice, and here's why.

    Imagine someone creates a small new package called foo. It's a simple package, say something along the lines of the formattable package that I recently released. One of the dependencies for foo is errors, a popular package supplying frequently used error handling infrastructure. The developer happens to already have errors-1.4.7 installed on their system, so this new package gets built against that version. The author uploads it to hackage on August 16, 2015 with no upper bounds on its dependencies. Let's for simplicity imagine that errors is the only dependency, so the .cabal file looks like this:

    name: foo
    build-depends:
      errors
    

    If we come back through at some point in the future and try to infer upper bounds by date, we'll see that on August 16, the most recent version of errors was 2.0.0. Here's an abbreviated illustration of the picture we can see from release dates:

    If we look only at release dates, and assume that packages were building against the most recent version, we will try to build foo with errors-2.0.0. But that is incorrect! Building foo with errors-2.0.0 will fail because errors had a major breaking change in that version. Bottom line: dates are irrelevant--all that matters is what dependency versions the author happened to be building against! You cannot assume that package authors will always be building against the most recent versions of their dependencies. This is especially true if our developer was using the Haskell Platform or LTS Haskell because those package collections lag the bleeding edge even more. So this scenario is not at all unlikely.

    It is also possible for packages to be maintaining multiple major versions simultaneously. Consider large projects like the linux kernel. Developers routinely do maintenance releases on 4.1 and 4.0 even though 4.2 is the latest version. This means that version numbers are not always monotonically increasing as a function of time.

    I should also mention another point on the meaning of version bounds. When a package specifies version bounds like this...

    name: foo
    build-depends:
      errors >= 1.4 && < 1.5
    

    ...it is not saying "my package will not work with errors-1.5 and above". It is actually saying, "I warrant that my package does work with those versions of errors (provided errors complies with the PVP)". So the idea that "< 1.5" is a "preemptive upper bound" is wrong. The package author is not preempting anything. Bounds are simply information. The upper and lower bounds are important things that developers need to tell you about their packages to improve the overall health of the ecosystem. Build tools are free to do whatever they want with that information. Indeed, cabal-install has a flag --allow-newer that lets you ignore those upper bounds and step outside the version ranges that the package authors have verified to work.

    In summary, the important point here is that you cannot use dates to infer version bounds. You cannot assume that package authors will always be building against the most recent versions of their dependencies. The only reliable thing to do is for the package maintainer to tell you explicitly what versions the package is expected to work with. And that means lower and upper bounds.

    Update: Here is a situation that illustrates this point perfectly: cryptonite issue #96. cryptonite-0.19 was released on August 12, 2016. But cryptonite-0.15.1 was released on August 22, 2016. Any library published after August 22, 2016 that depends on cryptonite-0.15.1 would not be able to build if the solver used dates instead of explicit version bounds.

    by mightybyte (noreply@blogger.com) at August 22, 2016 03:35 PM