Planet Haskell

December 17, 2017

Neil Mitchell

Announcing the 'debug' package

Haskell is a great language, but debugging Haskell is undoubtedly a weak spot. To help with that problem, I've just released the debug library. This library is intended to be simple and easy to use for a common class of debugging tasks, without solving everything. As an example, let's take a function we are interested in debugging, e.g.:

module QuickSort(quicksort) where
import Data.List

quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (x:xs) = quicksort lt ++ [x] ++ quicksort gt
where (lt, gt) = partition (<= x) xs

Turn on the TemplateHaskell and ViewPatterns extensions, import Debug, indent your code and place it under a call to debug, e.g.:

{-# LANGUAGE TemplateHaskell, ViewPatterns #-}
module QuickSort(quicksort) where
import Data.List
import Debug

debug [d|
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (x:xs) = quicksort lt ++ [x] ++ quicksort gt
where (lt, gt) = partition (<= x) xs

We can now run our debugger with:

$ ghci QuickSort.hs
GHCi, version 8.2.1: :? for help
[1 of 1] Compiling QuickSort ( QuickSort.hs, interpreted )
Ok, 1 module loaded.
*QuickSort> quicksort "haskell"
*QuickSort> debugView

The call to debugView starts a web browser to view the recorded information, looking something like:

From there you can click around to explore the computation.

I'm interested in experiences using debug, and also have a lot of ideas for how to improve it, so feedback or offers of help most welcome at the bug tracker.

If you're interested in alternative debuggers for Haskell, you should check out the GHCi debugger or Hood/Hoed.

by Neil Mitchell ( at December 17, 2017 10:02 PM

Michael Snoyman

What Makes Haskell Unique

I gave a talk today at the F(by) 2017 conference in Minsk, Belarus. The conference was great, I would definitely recommend it in the future. Thank you very much to the organizers for the opportunity to present on Haskell.

I prepared for this talk differently than I've prepared for other talks in the past. I'm very comfortable writing up blog posts, but have always found slide preparation difficult. This time around, I wrote up the content in mostly-blog-post form first, and only created the slides after that was complete. Overall, this worked very well for me, and I'll try it again in the future. (If others want to share their approaches to preparing talks, I'd definitely be happy to hear them.)

As a result: I'm able to share the original write-up I did as well. For those who saw the live talk (or the video): you may want to skip towards the end, which covers some material that there wasn't time for in the talk itself.

If you'd like to follow with the slides, they're also available.

My name is Michael Snoyman. I work at a company called FP Complete. One of the things we do is help individuals and companies adopt Haskell, and functional programming in general. And that leads right in to the topic of my talk today:

What makes Haskell unique

Programmers today have a large number of languages to choose from when deciding what they will learn and use in their day to day coding. In order to make intelligent decisions about which languages to pursue, people need to be able to quickly learn and understand what distinguishes one language from another.

Given that this is a functional programming conference, it's probably no surprise to you that Haskell can be called a functional programming language. But there are lots of languages out there that can be called functional. Definitions vary, but let's take a particularly lax version of functional programming: first class functions, and higher order functions. Well, by this defintion, even a language like C counts! You may want to limit the definition further to include syntactic support for closures, or some other features. Regardless, the same point remains:

Haskell may be functional, but that doesn't make it unique

In fact, there's a long list of features I could rattle off that could be used to describe Haskell.

  • Functional
  • Statically typed
  • Pure
  • Lazy
  • Strongly typed
  • Green threads
  • Native executables
  • Garbage collected
  • Immutability

Some of these features, like being pure and lazy, are relatively rare in mainstream languages. Others, however, are common place. What I'm going to claim is that not one of these features is enough to motivate new people to Haskell—including people in this audience—to start using it. Instead:

It's the combination of these features that makes Haskell unique

As an example: the intersection of purity, strong typing, and functional programming style, for instance, lends itself to a high level form of expression which is simultaneously easy to write, easy to read, easy to modify, and efficient. I want to share some examples of some code examples in Haskell that demonstrate how the language encourages you to write code differently from other languages. And I'm going to try to claim that this "different" style is awesome, though it also has some downsides.

Async I/O and Concurrency

Let's start off with a use case that's pretty popular today. Look at this pseudocode and tell me what's wrong with it:

json1 := httpGet(url1)
json2 := httpGet(url2)
useJsonBodies(json1, json2)

Given the heading of this slide, you may have guessed it: this is blocking code. It will tie up an entire thread waiting for the response body from each of these requests to come back. Instead, we should be using asynchronous I/O calls to allow more efficient usage of system resources. One common approach is to use callbacks:

httpGetA(url1, |json1| =>
  httpGetA(url2, |json2| =>
    useJsonBodies(json1, json2)

You may recognize this coding style as "callback hell." There are plenty of techniques in common languages to work around that, usually around the idea of promises or futures. And you may have heard something about how Javascript futures are a monad, and expect me to be talking about how Haskell does monads better. But I'm not going to do that at all. Instead, I want to show you what the asynchronous version of the code looks like in Haskell

json1 <- httpGet url1
json2 <- httpGet url2
useJsonBodies json1 json2

This may surprise you, since this looks exactly like the blocking pseudocode I showed above. It turns out that Haskell has a powerful runtime system. It will automatically convert your blocking-style code into asynchronous system calls, and automatically handle all of the work of scheduling threads and waking them up when data is available.

This is pretty great, but it's hardly unique to Haskell. Erlang and Go, as two popular examples, both have this as well. If we want to see what makes Haskell different...

we have to go deeper.


It's pretty lame that we need to wait for our first HTTP request to complete before even starting our second. What we'd like to do is kick off both requests at the same time. You may be imagining some really hairy APIs with threads, and mutable variables, and locks. But here's how you do this in Haskell:

(json1, json2) <- concurrently
  (httpGet url1)
  (httpGet url2)
useJsonBodies json1 json2

Haskell has a green thread implementation which makes forking threads cheap. The async library provides a powerful, high level interface performing actions in parallel without bothering with the low level aspects of locking primitives and mutable variables. And this builds naturally on top of the async I/O system already described to be cheap about system resource usage.


What we've seen already is elegant in Haskell, but it's not terribly difficult to achieve in other languages. Let's take it to the next level. Instead of needing both JSON response bodies, we only need one: whichever one comes back first. In pseudocode, this might look like:

promise1 := httpGet(url1)
promise2 := httpGet(url2)
result := newMutex()
promise1.andThen(|json1| =>
promise2.andThen(|json2| =>

This code is tedious and error prone, but it gets the job done. As you can probably guess, there's a simple API for this in Haskell:

eitherJson <- race
  (httpGet url1)
  (httpGet url2)
case eitherJson of
  Left  json1 -> useJsonBody1 json1
  Right json2 -> useJsonBody2 json2

At first, this may seem like it's just a well designed API. But there's quite a bit more going on under the surface. The Haskell runtime system itself supports the idea of an asynchronous exception, which allows us to cancel any other running thread. This feature is vital to making race work.

And here's the final piece in the puzzle. All of the thread scheduing and canceling logic I've described doesn't just apply to async I/O calls. It works for CPU-intensive tasks as well. That means you can fork thousands of threads, and even if one of them is busy performing computation, other threads will not be starved. Plus, you can interrupt these long-running computations:

let tenSeconds = 10 * 1000 * 1000
timeout tenSeconds expensiveComputation

Summary: concurrency and async I/O


  • Cheap threads
  • Simple API
  • Highly responsive


  • Complicated runtime system
  • Need to be aware of async exceptions when writing code

Immutability and purity

Most programming languages out there default to mutability: a variable or field in a data structure can be changed at any time. Haskell is different in two ways:

  1. Values are immutable by default, and mutability must be explicitly indicated with a variable type
  2. Mutating a mutable variable is considered a side effect, and that mutable is tracked by the type system

For example, the following Haskell-like code is impossible:

let mut total = 0
    loop i =
      if i > 1000000
        then total
        else total += i; loop (i + 1)
 in loop 1

From pure code, we cannot create, read, or modify a mutable variable. We also need to say what kind of mutable variable we want:

total <- newIORef 0
let loop i =
      if i > 1000000
        then readIORef total
        else do
          modifyIORef total (+ i)
          loop (i + 1)
loop 1

This is a lot of ceremony for a simple algorithm. Of course, the recommended Haskell way of doing this would be to avoid mutable variables, and use a more natural functional style.

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

Besides pushing us towards this supposedly better functional approach, why is immutable, pure code such a nice thing?

Reasoning about code

You'll often hear Haskellers throw around a phrase "reasoning about code." Personally, I think the phrase is used to mean too many different things. But let me give you an example that I think is accurate. Let's look at some pseudocode:

// scores.txt

func main() {
  results := readResultsFromFile("results.txt")
  print("First result was by: " + results[0].name)

func printScoreRange(results: Vector<TestResult>) {

If you look at the code above, what do you expect the output to be? I think it would be reasonable to guess something like:

Lowest: 22
Highest: 55
First result was by: Alice

However, now let's throw in another piece of information: the definition of printScoreRange:

func printScoreRange(results: Vector<TestResult>) {
  results.sortBy(|result| => result.score)
  print("Lowest: " + results[0].score)
  print("Highest: " + results[results.len() - 1].score)

Suddenly our assumptions change. We can see that this function mutates the results value passed to it. If we're passing mutable references to vectors in this made up language, then our output is going to look more like:

Lowest: 22
Highest: 55
First result was by: Charlie

Since the original results value in our main function has been modified. This is what I mean by hurting our ability to reason about the code: it's no longer sufficient to look at just the main function to understand what will be happening. Instead, we're required to understand what may possibly be occurring in the rest of our program to mutate our variables.

In Haskell, the code would instead look like:

main :: IO ()
main = do
  results <- readResultsFromFile "results.txt"
  printScoreRange results
  putStrLn $ "First result was by: " ++ name (head results)

printScoreRange :: [TestResult] -> IO ()
printScoreRange results = do
  let results' = sortBy score results
  putStrLn $ "Lowest: " ++ show (score (head results'))
  putStrLn $ "Highest: " ++ show (score (last results'))

We know that it's impossible for printScoreRange to modify the results value we have in the main function. Looking at only this bit of code in main is sufficient to know what will happen with the results value.

Data races

Even more powerful than the single threaded case is how immutability affects multithreaded applications. Ignoring the insanity of multiple threads trying to output to the console at the same time, we can easily parallelize our code:

main :: IO ()
main = do
  results <- readResultsFromFile "results.txt"
  concurrently_ printFirstResult printScoreRange

printFirstResult results =
  putStrLn $ "First result was by: " ++ name (head results)

printScoreRange results = do
  let results' = sortBy score results
  putStrLn $ "Lowest: " ++ show (score (head results'))
  putStrLn $ "Highest: " ++ show (score (last results'))

There's no need to worry about concurrent accesses to data structures. It's impossible for the other threads to alter our data. If you do want other threads to affect your local data, you'll need to be more explicit about it, which we'll get back to.

Mutability when needed

One thing you may be worried about is how this affects performance. For example, it's much more efficient to sort a vector using mutable access instead of only pure operations. Haskell has two tricks for that. The first is the ability to explicitly create mutable data structures, and mutate them in place. This breaks all of the guarantees I already mentioned, but if you need the performance, it's available. And unlike mutable-by-default approaches, you now know exactly which pieces of data you need to handle with care when coding to avoid tripping yourself up.

The other approach is to create a mutable copy of the original data, perform your mutable algorithm on it, and then freeze the new copy into an immutable version. With sorting, this looks something like:

sortMutable :: MutableVector a -> ST (MutableVector a)
sortMutable = ... -- normal sorting algorithm

sortImmutable :: Vector a -> Vector a
sortImmutable orig = runST $ do
  mutable <- newMutableVector (length orig)
  copyValues orig mutable
  sort mutable
  freeze mutable

ST is something we use to have temporary and local mutable effects. Because of how it's implemented, we know that none of the effects can be visible from outside of our function, and that for the same input, the sortImmutable function will always have the same output. While this approach requires an extra memory buffer and an extra copy of the elements in the vector, it avoids completely the worries of your data being changed behind your back.

Summary: immutability and purity


  • Easier to reason about code
  • Avoid many cases of data races
  • Functions are more reliable, returning the same output for the same input


  • Lots of ceremony if you actually want mutation
  • Some runtime performance hit for mutable algorithms

Software Transactional Memory

Let's say you actually need to be able to mutate some values. And for fun, let's say you want to do this from multiple threads. A common example of this is a bank. Let's again play with some pseudocode:

runServer (|request| => {
  from := accounts.lookup(request.from)
  to := accounts.lookup(
  accounts.set(request.from, from - request.amt)
  accounts.set(, to + request.amt)

This looks reasonable, except that if two requests come in at the same time for the same account, we can end up with a race condition. Consider something like this:

Thread 1: receive request: Alice gives $25
Thread 2: receive request: Alice receives $25
Thread 1: lookup that Alice has $50
Thread 2: lookup that Alice has $50
Thread 1: set Alice's account to $25
Thread 2: set Alice's account to $75

We know that we want Alice to end up with $50, but because of our data race, Alice ends up with $75. Or, if the threads ran differently, it could be $25. Neither of these is correct. In order to avoid this, we would typically deal with some kind of locking:

runServer (|request| => {
  // same code as before

Unfortunately, this leads to deadlocks! Consider this scenario:

Thread 1: receive request: $50 from Alice to Bob
Thread 2: receive request: $50 from Bob to Alice
Thread 1: lock Alice
Thread 2: lock Bob
Thread 1: try to lock Bob, but can't, so wait
Thread 2: try to lock Alice, but can't, so wait

This kind of problem is the bane of many concurrent programs. Let me show you another approach. As you may guess, here's some Haskell:

runServer $ \request -> atomically $ do
  let fromVar = lookup (from request) accounts
      toVar = lookup (to request) accounts
  origFrom <- readTVar fromVar
  writeTVar fromVar (origFrom - amt request)
  origTo <- readTVar toVar
  writeTVar toVar (origTo + amt request)

There are helper functions to make this shorter, but I wanted to do this the long way to prove a point. This looks like exactly the kind of race condition I described before. However, that atomically function is vital here. It ensures that only a complete transaction is ever committed. If any of the variables we touch are mutated by another thread before our transaction is complete, all of our changes are rolled back, and the transaction is retried. No need for explicit locking, and therefore many less worries about data races and deadlocks.

A TVar is a "transactional variable." It's an alternative to the IORef that I mentioned earlier. There are other kinds of mutable variables in Haskell, including channels and MVars which are like mutexes. This is what I meant when I said you need to be explicit about what kind of mutation you want in Haskell.

Purity's role

What do you think will happen with this program:

atomically $ do
  buyBitcoins 3 -- side effects on my bank account

  modifyTVar myBitcoinCount (+ 3)

Here, buyBitcoins is going off to some exchange a buying about $100,000 in bitcoin (or whatever ridiculous amount they're selling for now). I said before that, if the variables are modified while running, the transaction will be retried. It seems like this function is very dangerous, as it may result in me going about $10,000,000 into debt buying bitcoins!

This is where purity steps in. Inside atomically, you are not allowed to perform any side effects outside of STM itself. That means you can modify TVars, but you cannot read or write files, print to the console, fire the missiles, or place multi million dollar currency purchases. This may feel like a limitation, but the tradeoff is that it's perfectly safe for the runtime system to retry your transactions as many times as it wants.

Summary of STM


  • Makes concurrent data modification much easier
  • Bypass many race conditions and deadlocks


  • Depends on purity to work at all
  • Not really a disadvantage, you're already stuck with purity in Haskell
  • Not really any other disadvantages, so just use it!


It's a little cheeky of me to get this far into a talk about unique features of Haskell and ignore one of its most notable features: laziness. Laziness is much more of a double-edged sword than the other features I've talked about, and let me prove that by revisiting one of our previous examples.

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

I didn't describe it before, but this function will sum up the numbers from 1 to 1,000,000. There are two problems with this function:

  1. There's a major performance bug in it
  2. It's much more cumbersome than it should be

Space leaks

The bane of laziness is space leaks, something you've probably heard about if you've read at all about Haskell. To understand this, let's look at how laziness is implemented. When you say something like:

let foo = 1 + 2

foo doesn't actually contain 3 right now. Instead, it contains an instruction to apply the operator + to the values 1 and 2. This kind of instruction is called a thunk. And as you might guess, storing the thunk is a lot more expensive than storing a simple integer. We'll see why this helps in a bit, but for now we just care about why it sucks. Let's look at what happens in our loop function:

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

Each time we step through the loop, we have to compare i to the number 1,000,000. Therefore, we are forced to evaluate it, which means turning it into a simple integer. But we never look at the value of total. Instead of storing a simple integer, which would be cheap, we end up building a huge tree that looks like "add 1 to the result of add 2 to the result of ... to 1,000,000." This is really bad: it uses more memory and more CPU than we'd like.

We can work around this in Haskell by being explicit about which values should be evaluated. There are a few ways to do this, but in our case, the easiest is:

let loop i !total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

All I've done is added an exclamation point in front of the total argument. This is known as a bang pattern, and says "make sure this is evaluated before running the rest of this function." The need to do this in some cases is definitely a downside to Haskell's laziness. On the other hand, as we'll see shortly, you often don't need to bother if you use the right kinds of functions.

Laziness is awesome

Let's go back to pseudocode and rewrite our summation:

total := 0
for(i := 1; i <= 1000000; i++) {
  total += i

Pretty simple. But now let's modify this to only sum up the even numbers:

total := 0
for(i := 1; i <= 1000000; i++) {
  if (isEven(i)) {
    total += i

OK, that's fine. But now, let's sum up the indices modulus 13 (for some weird reason):

total := 0
for(i := 1; i <= 1000000; i++) {
  if (isEven(i)) {
    total += i % 13

Each of these modifications is fine on its own, but at this point it's getting harder to see the forest for the trees. And fortunately each of these transformations was relatively simple. If some of the requirements were more complicated, fitting it into the for loop may be more challenging.

Let's go back to the beginning with Haskell. We saw how we could do it with a loop, but let's see the real way to sum the numbers from 1 to 1,000,000:

-- Bad
let loop i !total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

-- Awesome!
sum [1..1000000]

We use list range syntax to create a list with one million numbers in it. On its face, this looks terrible: we need to allocate about 8mb of data to hold onto this integers, when this should run in constant space. But this is exactly where laziness kicks in: instead of allocating all of these values immediately, we allocate a thunk. Each time we step through the list, our thunk generates one new integer and a new thunk for the rest of the list. We're never using more than a few machine words.

There are also other optimizations in GHC to avoid even allocating those thunks, but that's not something I'm going to cover today.

Anyway, let's continue. We can easily tweak this to only add up the even numbers:

sum (filter even [1..1000000])

This uses the filter higher order function, and likewise avoids allocating an entire list at once. And doing the silly modulus 13 trick:

sum (map (`mod` 13) (filter even [1..1000000]))

Laziness is definitely a mixed bag, but combined with the functional style of Haskell in general, it allows you to write higher level, declarative code, while keeping great performance.

Short circuiting for free

Lots of languages define && and || operators which stop evaluation early, e.g.:

foo() && bar()

bar is only called if foo returns true. Haskell works the same way, but these operators aren't special; they just use laziness!

False && _ = False
True && x = x

True || _ = True
False || x = x

This even scales up to functions working on lists of values, such as and, or, all, and any.

Other downsides

There's one other downside to laziness, and a historical artifact. Laziness means that exceptions can be hiding inside any thunk. This is also known as partial values and partial functions. For example, what does this mean?

head []

Generally speaking, partiality is frowned upon, and you should use total functions in Haskell.

The historical artifact is that many bad functions are still easily available, and they should be avoided. head is arguably an example of that. Another is the lazy left fold function, foldl. In virtually all cases, you should replace it with a strict left fold foldl'.

Summary of laziness


  • More composable code
  • Get efficient results from combining high level functions
  • Short-circuiting like && and || is no longer a special case


  • Need to worry about space leaks
  • Exceptions can be hiding in many places
  • Unfortunately some bad functions like foldl still hanging around

Side note There's a major overlap with Python generators or Rust iterators, but laziness in Haskell is far more pervasive than these other approaches.


Due to time constraints, I'm not going to be able to go into detail on a bunch of other examples I wanted to talk about. Let me just throw out some quick thoughts on them.

Parser (and other) DSLs

  • Operator overloading!
  • Abstract type classes like Applicative and Alternative a natural fit, e.g.: parseXMLElement <|> parseXMLText.
  • Able to reuse huge number of existing library functions, e.g. optional, many
  • General purpose do-notation is great
data Time = Time Hour Minutes Seconds (Maybe AmPm)
data AmPm = Am | Pm

parseAmPm :: Parser Time
parseAmPm = Time
  <$> decimal
  <*> (":" *> decimal)
  <*> (":" *> decimal)
  <*> optional (("AM" $> Am) <|> ("PM" $> Pm))

c/o @queertypes

Advanced techniques

  • Free monads
  • Monad transformer stacks
  • Lens, conduit, pipes, ...
  • Lots of ways to do things in Haskell!
  • It's a plus and a minus
  • Recommendation: choose a useful subset of Haskell and its libraries, and define some best practices


  • Haskell combines a lot of uncommon features
  • Very few of those features are unique
  • Combining those features allows you to write code very differently than in other languages
  • If you want readable, robust, easy to maintain code: I think it's a great choice
  • Be aware of the sharp edges: they do exist!


December 17, 2017 08:00 AM

December 16, 2017

Robert Harper

A proof by contradiction is not a proof that derives a contradiction

It is well-known that constructivists renounce “proof by contradiction”, and that classicists scoff at the critique.  “Those constructivists,” the criticism goes, “want to rule out proofs by contradiction.  How absurd!  Look, Pythagoras showed that the square root of two is irrational by deriving a contradiction from the assumption that it is rational.  There is nothing wrong with this.  Ignore them!”

On examination that sort of critique fails, because a proof by contradiction is not a proof that derives a contradiction.  Pythagoras’s  proof is valid, one of the eternal gems of mathematics.  No one questions the validity of that argument, even if they question proof by contradiction.

Pythagoras’s Theorem expresses a negation: it is not the case that the square root of two can be expressed as the ratio of two integers.  Assume that it can be so represented.  A quick deduction shows that this is impossible.  So the assumption is false.  Done.  This is a direct proof of a negative assertion; it is not a “proof by contradiction”.

What, then, is a proof by contradiction?  It is the affirmation of a positive statement by refutation of its denial.  It is a direct proof of the negation of a negated assertion that is then pressed into service as a direct proof of the assertion, which it is not.  Anyone is free to ignore the distinction for the sake of convenience, as a philosophical issue, or as a sly use of “goto” in a proof, but the distinction nevertheless exists and is important.  Indeed, part of the beauty of constructive mathematics is that one can draw such distinctions. Once drawn, such distinctions can be disregarded; once blurred, forever blurred, a pure loss of expressiveness.

For the sake of explanation, let me rehearse a standard example of a genuine proof by contradiction.  The claim is that there exists irrationals a and b such that a to the b power is rational.  Here is an indirect proof, a true proof by contradiction.  Let us prove instead that it is impossible that any two irrationals a and b are such that a to the b is irrational.  This is a negative statement, so of course one proves it by deriving a contradiction from assuming that which is negated.  Suppose, for a contradiction, that for every two irrationals a and b, the exponentiation a to the b power is irrational.  We know from Pythagoras that root two is irrational, so plug it in for both a and b, and conclude that root two to the root two power is irrational.  Now use the assumption again, taking a to be root two to the root two, and b to be root two.  Calculate a to the power of b, it is two, which is eminently rational.  Contradiction.

We have now proved that it is not the case that every pair of irrationals, when exponentiated, give an irrational.  There is nothing questionable about this proof.  But it does not prove that there are two irrationals whose exponent is rational!  If you think it does, then I ask you, please name them for me.  That information is not in this proof (there are other proofs that do name them, but that is not relevant for my purposes).  You may, if you wish, disregard the distinction I am drawing, that is your prerogative, and neither I nor anyone has any problem with that.  But you cannot claim that it is a direct proof, it is rather an indirect proof, that proceeds by refuting the negative of the intended assertion.

So why am I writing this?  Because I have learned, to my dismay, that in U.S. computer science departments–of all places!–students are being taught, erroneously, that any proof that derives a contradiction is a “proof by contradiction”.  It is not.  Any proof of a negative must proceed by contradiction.  A proof by contradiction in the long-established sense of the term is, contrarily, an indirect proof of a positive by refutation of the negative.  This distinction is important, even if you want to “mod out” by it in your work, for it is only by drawing the distinction that one can even define the equivalence with which to quotient.

That’s my main point.  But for those who may not be familiar with the distinction between direct and indirect proof, let me take the opportunity to comment on why one might care to draw such a distinction.  It is a matter of honesty, of a sort: the information content of the foregoing indirect proof does not fulfill the expectation stated in the theorem.  It is a kind of boast, an overstatement, to claim otherwise.  Compare the original statement with the reformulation used in the proof.  The claim that it is not the case that every pair of irrationals exponentiate to an irrational is uncontroversial.  The proof proves it directly, and there is nothing particularly surprising about it.  One would even wonder why anyone would bother to state it.  Yet the supposedly equivalent claim stated at the outset appears much more fascinating, because most people cannot easily think up an example of two irrationals that exponentiate to rationals.  Nor does the proof provide one. Once, when shown the indirect proof, a student of mine blurted out “oh that’s so cheap.”  Precisely.

Why should you care?  Maybe you don’t, but there are nice benefits to keeping the distinction, because it demarcates the boundary between constructive proofs, which have direct interpretation as functional programs, and classical proofs, which have only an indirect such interpretation (using continuations, to be precise, and giving up canonicity).  Speaking as a computer scientist, this distinction matters, and it’s not costly to maintain.  May I ask that you adhere to it?

Edit: rewrote final paragraph, sketchy and irrelevant, and improved prose throughout. Word-smithing, typos.


Filed under: Programming, Research, Teaching

by Robert Harper at December 16, 2017 07:20 PM

December 15, 2017

Mark Jason Dominus

Wasteful and frugal proofs in Ramsey theory

This question asks how to show that, among any 11 integers, one can find a subset of exactly six that add up to a multiple of 6. Let's call this “Ebrahimi’s theorem”.

This was the last thing I read before I put away my phone and closed my eyes for the night, and it was a race to see if I would find an answer before I fell asleep. Sleep won the race this time. But the answer is not too hard.

  1. First, observe that among any five numbers there are three that sum to a multiple of 3: Consider the remainders of the five numbers upon division by 3. There are three possible remainders. If all three remainders are represented, then the remainders are and the sum of their representatives is a multiple of 3. Otherwise there is some remainder with three representatives, and the sum of these three is a multiple of 3.

  2. Now take the 11 given numbers. Find a group of three whose sum is a multiple of 3 and set them aside. From the remaining 8 numbers, do this a second time. From the remaining 5 numbers, do it a third time.

  3. We now have three groups of three numbers that each sum to a multiple of 3. Two of these sums must have the same parity. The six numbers in those two groups have an even sum that is a multiple of 3, and we win.

Here is a randomly-generated example:

$$3\quad 17\quad 35\quad 42\quad 44\quad 58\quad 60\quad 69\quad 92\quad 97\quad 97$$

Looking at the first 5 numbers we see that on division by 3 these have remainders . The remainder is there three times, so we choose those three numbers , whose sum is a multiple of 3, and set them aside.

Now we take the leftover and and supplement them with three more unused numbers . The remainders are so we take and set them aside as a second group.

Then we take the five remaining unused numbers . The remainders are . The first three have all different remainders, so let's use those as our third group.

The three groups are now . The first one has an even sum and the second has an odd sum. The third group has an odd sum, which matches the second group, so we choose the second and third groups, and that is our answer:

$$3\qquad 42\qquad 60\qquad 58 \qquad 69 \qquad 92$$

The sum of these is .

This proves that 11 input numbers are sufficient to produce one output set of 6 whose sum is a multiple of 6. Let's write to mean that inputs are enough to produce outputs. That is, means “any set of numbers contains distinct 6-element subsets whose sum is a multiple of 6.” Ebrahimi’s theorem, which we have just proved, states that is true, and obviously it also proves for all larger .

I would like to consider the following questions:

  1. Does this proof suffice to show that is false?
  2. Does this proof suffice to show that is false?

I am specifically not asking whether or are actually false. There are easy counterexamples that can be found without reference to the proof above. What I want to know is if the proof, as given, contains nontrivial information about these questions.

The reason I think this is interesting is that I think, upon more careful examination, that I will find that the proof above does prove at least one of these, perhaps with a very small bit of additional reasoning. But there are many similar proofs that do not work this way. Here is a famous example. Let be shorthand for the following claim:

Let the integers from 1 to be partitioned into two sets. Then one of the two sets contains distinct subsets of three elements of the form for integers .


Van der Waerden's theorem: is true.

, like , is monotonic: van der Waerden's theorem trivially implies for all larger than 325. Does it also imply that is false for smaller ? No, not at all; this is actually untrue. Does it also imply that is false for ? No, this is false also.

Van der Waerden's theorem takes 325 inputs (the integers) and among them finds one output (the desired set of three). But this is extravagantly wasteful. A better argument shows that only 9 inputs were required for the same output, and once we know this it is trivial that 325 inputs will always produce at least 36 outputs, and probably a great many more.

Proofs of theorems in Ramsey theory are noted for being extravagant in exactly this way. But the proof of Ebrahimi's theorem is different. It is not only frugal, it is optimally so. It uses no more inputs than are absolutely necessary.

What is different about these cases? What is the source the frugality of the proof of Ebrahimi’s theorem? Is there a way that we can see from examination of the proof that it will be optimally frugal?

Ebrahimi’s theorem shows . Suppose instead we want to show for some . From Ebrahimi’s theorem itself we immediately get and indeed . Is this the best we can do? (That is, is false?) I bet it isn't. If it isn't, what went wrong? Or rather, what went right in the case that stopped working when ?

I don't know.

by Mark Dominus ( at December 15, 2017 05:09 PM

Ken T Takusagawa

[agobrown] Longest games of chomp

What Chomp starting positions offer the longest games, perhaps the most possibilities for interesting games?  Among rectangular starting positions, good starting positions are 13x12, 12x11, 10x9, 9x8, 11x6, 7x6, 8x5, 6x5, 5x4.  Missing from the pattern of (N)x(N-1) are 11x10 and 8x7.  (Chomp is weird in how there aren't simple patterns.  It might be a good candidate for machine learning.)

We assumed 3 types of positions in Chomp are instantly known lost (P positions):

  1. L-shaped positions with both arms of the L having unit width and same lengths
  2. 2-row positions of the form [a,a-1]
  3. 3-row positions of the form [a,a-2,2]

The 3-row [a,a-2,2] class of positions is noted in Proposition 2 of "Three-Rowed Chomp" by Doron Zeilberger.  The winning strategy from such a position is as follows:

The base case is [4,2,2] (which looks kind of like a pistol).  If the opponent moves to [3,2,2], then respond moving to [3,2] and follow the 2-row strategy (or move to [3,1,1] and L-shaped strategy).  If [2,2,2] then 2-row strategy vertically.  If [4,1,1] then [3,1,1] and L-shaped strategy.  If [4,2,1] then [2,2,1] and 2-row strategy vertically.  If [4,2] then 2-row strategy.

For larger 3-row positions [a,a-2,2], if the opponent moves in the first 2 rows, leaving at least 4 in the first row and at least 2 in the second row, then restore the position to the shape [b,b-2,2].  If [3,3,2] then [3,1,1] and L-shaped strategy.  If [a,1,1] then [3,1,1] and L-shaped strategy.  If the opponent moves on the third row to [a,a-2,1] then [2,2,1] and follow the 2-row strategy vertically.  If [a,a-2], then 2-row strategy.

Here is complete output of all positions within board size 13x13 and Haskell source code.  A selection of some positions and their game values are also given below.  Computing board size 12 required 8.5 GB of RAM on a machine with 16 GB of RAM.  (Haskell programs tend to use a lot of memory unless one puts effort into conserving memory, which we did not do.)

For computing board size 13, we allowed swapping to virtual memory on SSD on a machine with 8 GB of physical RAM.  The output of /usr/bin/time was:

5751.60user 86808.57system 39:48:33elapsed 64%CPU (0avgtext+0avgdata 7192640maxresident)k
10410518744inputs+8outputs (184956202major+316491058minor)pagefaults 0swaps

This suggests a slowdown factor of about 25 for using virtual memory on SSD compared to RAM for this program which made heavy use of Data.Map.  Polling "ps xu" saw a maximum virtual memory usage of 39 GB.  For the output of the board size 13 at the link above, we omitted saving the "Win_in 1" positions to save disk space.

There are only 3 "Lose in 2" positions: [6,3,3]; [5,5,3]; and [5,2,1,1].  Memorize them to get an edge against opponents.  One could also memorize the 7 "Lose in 4" positions, 14 "Lose in 6", 26 "Lose in 8"...

There seem to be some more patterns that lose: [odd,2,1,1,1,...]; [even,3,1,1,1,...]; [even,2,2,2,1,1,1,...]; [even,2,2,1,1,1,...]; [odd,4,1,1,1,...].  These deserve future investigation.  Andries Brouwer's web site suggests that losing families of positions exist in 3-row chomp for [a+11,a+7,5]; [?,?,7]; [?,?,9]; [?,?,11]; [?,?,14] (not 13, once again breaking what seemed to be a simple pattern of odd third rows).  It still needs to be explicitly articulated how to win after giving your opponent these losing positions.  Work by Steven Byrnes suggests the game values of all 3-row Chomp positions can be rapidly computed, though probably not by a human in his or her head.  Future versions of the code should bound not by board size but number of pieces, to investigate thin positions and roughly L-shaped positions.

(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12, 5], Win_in 103)
(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 5], Win_in 103)
(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13], Win_in 101)
(Position [12, 12, 12, 12, 12, 12, 12, 12, 12, 10, 7], Lose_in 86)
(Position [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12], Win_in 79)
(Position [11, 11, 11, 10, 10, 10, 10, 10, 2], Win_in 57)
(Position [11, 11, 11, 10, 10, 10, 10, 9, 2], Win_in 57)
(Position [11, 11, 11, 11, 11, 9, 9, 7, 1, 1], Win_in 57)
(Position [11, 11, 11, 11, 11, 9, 9, 9, 1, 1], Win_in 57)
(Position [11, 11, 11, 11, 11, 11], Win_in 43)
(Position [11, 11, 11, 11, 11, 11, 11], Win_in 41)
(Position [11, 11, 11, 11, 11, 11, 11, 11], Win_in 39)
(Position [11, 11, 11, 11, 11, 11, 11, 11, 11], Win_in 37)
(Position [11, 11, 11, 11, 11], Win_in 35)
(Position [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], Win_in 21)
(Position [10, 10, 10, 10, 10, 10, 10, 10, 4], Lose_in 56)
(Position [10, 10, 10, 10, 10, 10, 10, 10, 10], Win_in 55)
(Position [9, 9, 9, 9, 9, 9, 9, 9], Win_in 41)
(Position [8, 8, 8, 8, 8], Win_in 23)
(Position [8, 8, 8, 8, 8, 8], Win_in 15)
(Position [8, 8, 8, 8, 8, 8, 8], Win_in 13)
(Position [7, 7, 7, 7, 7, 7], Win_in 21)
(Position [6, 6, 6, 6, 2], Win_in 13)
(Position [6, 6, 6, 6, 6], Win_in 9)
(Position [5, 5, 5, 5], Win_in 5)
(Position [4, 4, 4, 4], Win_in 1)
(Position [4, 4, 4], Win_in 1)
(Position [4, 4], Win_in 1)
(Position [4], Win_in 1)

(Position [5, 2, 1, 1], Lose_in 2)
(Position [5, 5, 3], Lose_in 2)
(Position [6, 3, 3], Lose_in 2)

(Position [5, 3, 3, 2], Lose_in 4)
(Position [5, 5, 2, 2], Lose_in 4)
(Position [6, 2, 2, 1, 1], Lose_in 4)
(Position [6, 2, 2, 2], Lose_in 4)
(Position [6, 3, 1, 1, 1], Lose_in 4)
(Position [7, 2, 1, 1, 1, 1], Lose_in 4)
(Position [7, 4, 3], Lose_in 4)

(Position [6, 4, 3, 3, 2], Lose_in 6)
(Position [7, 2, 2, 2, 2], Lose_in 6)
(Position [7, 3, 2, 1, 1, 1], Lose_in 6)
(Position [7, 3, 2, 2], Lose_in 6)
(Position [7, 3, 3, 1, 1], Lose_in 6)
(Position [7, 3, 3, 2, 1, 1], Lose_in 6)
(Position [7, 4, 1, 1, 1], Lose_in 6)
(Position [7, 5, 3, 2], Lose_in 6)
(Position [7, 7, 4], Lose_in 6)
(Position [8, 2, 2, 1, 1, 1, 1], Lose_in 6)
(Position [8, 2, 2, 2, 1, 1], Lose_in 6)
(Position [8, 3, 1, 1, 1, 1, 1], Lose_in 6)
(Position [8, 4, 4], Lose_in 6)
(Position [9, 2, 1, 1, 1, 1, 1, 1], Lose_in 6)

(Position [6, 4, 4, 3, 3], Lose_in 8)
(Position [6, 6, 3, 3, 3], Lose_in 8)
(Position [6, 6, 4, 3, 2], Lose_in 8)
(Position [7, 3, 3, 3, 2, 2], Lose_in 8)
(Position [7, 4, 2, 2, 2, 2], Lose_in 8)
(Position [7, 4, 4, 2], Lose_in 8)
(Position [7, 5, 3, 3, 1, 1], Lose_in 8)
(Position [7, 7, 3, 3], Lose_in 8)
(Position [8, 3, 2, 2, 2], Lose_in 8)
(Position [8, 3, 3, 3], Lose_in 8)
(Position [8, 4, 2, 1, 1, 1], Lose_in 8)
(Position [8, 4, 2, 2], Lose_in 8)
(Position [8, 5, 1, 1, 1], Lose_in 8)
(Position [8, 5, 4, 2], Lose_in 8)
(Position [9, 2, 2, 2, 2, 1, 1], Lose_in 8)
(Position [9, 2, 2, 2, 2, 2], Lose_in 8)
(Position [9, 3, 2, 1, 1, 1, 1, 1], Lose_in 8)
(Position [9, 3, 2, 2, 1, 1, 1], Lose_in 8)
(Position [9, 4, 1, 1, 1, 1, 1], Lose_in 8)
(Position [9, 4, 4, 1, 1], Lose_in 8)
(Position [9, 5, 3, 1, 1, 1, 1], Lose_in 8)
(Position [9, 5, 4], Lose_in 8)
(Position [10, 2, 2, 1, 1, 1, 1, 1, 1], Lose_in 8)
(Position [10, 2, 2, 2, 1, 1, 1, 1], Lose_in 8)
(Position [10, 3, 1, 1, 1, 1, 1, 1, 1], Lose_in 8)
(Position [11, 2, 1, 1, 1, 1, 1, 1, 1, 1], Lose_in 8)

by Ken ( at December 15, 2017 08:05 AM

December 14, 2017

Mike Izbicki

how to cheat at settlers by loading the dice

how to cheat at settlers by loading the dice
(and prove it with p-values)

posted on 2017-12-14

tl;dr This post shows how to create loaded dice, and how to use these dice to gain between 5-15 additional resource cards per game of Settlers of Catan. Surprisingly, we’ll prove that standard scientific tests are not powerful enough to determine that the dice are unfair while playing a game. This essentially means that it’s impossible for your opponents to scientifically prove that you’re cheating. This impossibility is due to methodological defects in the current state of scientific practice, and we’ll highlight some ongoing work to fix these defects.

Loading the dice

My copy of Settlers of Catan came with two normal wooden dice. To load these dice, I placed them in a small plate of water overnight, leaving the 6 side exposed.

The submerged area absorbed water, becoming heavier. My hope was that when rolled, the heavier wet sides would be more likely to land face down, and the lighter dry side would be more likely to land face up. So by leaving the 6 exposed, I was hoping to create dice that roll 6’s more often.

This effect is called the bias of the dice. To measure this bias, my wife and I spent the next 7 days rolling dice while eating dinner. (She must love me a lot!)

In total, we rolled the dice 4310 times. The raw results are shown below.

1 2 3 4 5 6
number of rolls 622 698 650 684 666 812
probability 0.151 0.169 0.157 0.165 0.161 0.196

Looking at the data, it’s “obvious” that our dice are biased: The 6 gets rolled more times than any of the other numbers. Before we prove this bias formally, however, let’s design a strategy to exploit this bias while playing Settlers of Catan.

A strategy for loaded dice

The key to winning at Settlers of Catan is to get a lot of resources. We want to figure out how many extra resources we can get using our biased dice.

First, let’s quickly review the rules. Each settlement is placed on the corner of three tiles, and each tile has a number token. Whenever the dice are rolled, if they add up to one of the numbers on the tokens, you collect the corresponding resource card. For example:

A good settlement will be placed next to numbers that will be rolled often.

To make strategizing easier, the game designers put helpful dots on each token below the number. These dots count the ways to roll that token’s number using two dice.

We can use these dots to calculate the probability of rolling each number. For example, a \(4\) can be rolled in three ways. If we name our two dice \(A\) and \(B\), then the possible combinations are \((A=1,B=3)\), \((A=2,B=2)\), \((A=3,B=1)\). To calculate the probability of rolling a 4, we calculate the probability of each of these rolls and add them together. For fair dice, the probability of every roll is the same \((1/6)\), so the calculation is:

\[\begin{align} Pr(A+B = 4) &= Pr(A = 1)Pr(B=3) + Pr(A=2)Pr(B=2) + Pr(A=3)Pr(B=1) \\ &= (1/6)(1/6) + (1/6)(1/6) + (1/6)(1/6) \\ &= 1/12 \\ &\approx 0.08333 \end{align}\]

For our biased dice, the probability of each roll is different. Using the numbers from the table above, we get:

\[\begin{align} Pr(A+B = 4) &= Pr(A = 1)Pr(B=3) + Pr(A=2)Pr(B=2) + Pr(A=3)Pr(B=1) \\ &= (0.151)(0.157) + (0.169)(0.169) + (0.157)(0.151) \\ &= 0.07597 \end{align}\]

So rolling a \(4\) is now less likely with our biased dice. Performing this calculation for each possible number gives us the following chart.

All the numbers below \(7\) are now less likely, and the numbers above 7 are now more likely. The shift is small, but it has important strategic implications.

Consider the two initial settlement placements below.

The naughty player knows that the dice are biased and puts her settlements on locations with high numbers, but the nice player doesn’t know the dice are biased and puts her settlements on locations with low numbers. Notice that if the dice were fair, both settlement locations would be equally good because they have the same number of dots.

The following formula calculates the average number of cards a player receives on each dice roll:

\[ \text{expected cards per roll} = \sum_{\text{adjacent tokens}} Pr(A+B=\text{token value}) \]

Substituting the appropriate values gives us the following results.

expected cards per roll
naughty nice
fair dice 0.500 0.500
biased dice 0.543 0.457

So the difference between the naughty and nice player is \(0.086\) cards per roll of the biased dice. A typical game of Settlers contains about 60 dice rolls (about 15 turns per player in a 4 player game), so this results in \(0.086*60=5.16\) more cards for the naughty player.

And this is only considering the two starting settlements. As the game progresses, more settlements will be built, and some settlements will be upgraded to cities (which receive two cards per roll instead of one). Calculating the exact effect of these additional sources of cards is difficult because these improvements will be built at random points throughout the game. We’ll have to make some additional assumptions.

If we assume that the naughty player gets 0.043 more cards per roll per settlement/city than the nice player (this exact number will vary depending on the quality of the settlement), and that both players build settlement/cities at turns 10,20,25,30,35,40,45, and 50, then the naughty player will on average receive 15.050 more cards than the nice player.

To summarize, the naughty player will receive somewhere between 5 and 15 more resource cards depending on how their future settlements and cities are built. This advantage can’t guarantee a victory, but it’ll definitely help.

A scientific analysis

Now we’re going to do some simple statistics to prove two things:
  1. The dice really are biased. So the fact that the 6 was rolled more times than the other numbers wasn’t just due to random chance.
  2. There are not enough dice rolls in a game of Settlers for our opponents to scientifically prove that the dice are biased. So it’s scientifically impossible for our opponents to know that we’re cheating.

To show that the dice are biased, we will use a standard scientific technique called null hypothesis significance testing. We begin by assuming a hypothesis that we want to disprove. In our case, we assume that the dice are not biased. In other words, we assume that each number on the dice has a \(1/6\approx 0.166\) chance of being rolled. Our goal is to show that under this assumption, the number of 6’s rolled above is very unlikely. We therefore conclude that our hypothesis is also unlikely, and that the dice probably are in fact biased.

More formally, we let \(X\) be a random variable that represents the total number of 6’s we would roll if we were to repeat our initial experiment with fair dice. Then \(X\) follows a binomial distribution whose density is plotted below.
The \(p\)-value for our experiment is defined informally to be the probability of getting results similar to the results we observed if the dice are not biased. The formal definition and formula is \[\begin{equation} p\text{-value}= Pr(X\ge k) = %1-\sum_{i=0}^k {4310\choose 812} (1/6)^i(1-1/6)^{n-i} 1-\sum_{i=0}^k {n\choose k} q^i(1-q)^{n-i} , \end{equation}\]

where \(n\) is the total number of dice rolls (4310), \(k\) is the number of 6’s actually rolled (812), and \(q\) is the assumed probability of rolling a 6 (1/6). Substituting these numbers gives us \[ p\text{-value}= Pr(X\ge k) \approx 0.0000884 . \] In other words, if we repeated this experiment one million times with fair dice, we would expect to get results similar to the results we actually got only 88 times. Since this is so unlikely, we conclude that our original assumption (that the dice are not biased) is probably false. Most science classes teach that \(p\)-values less than 0.05 are “significant.” We are very far below that threshold, so our result is “very significant.”

Our \(p\)-value is so low because the number of trials we conducted was very large \((n=4310)\). In a typical game of Settlers, however, there will be many fewer trials. This makes it hard for our opponents to prove that we’re cheating.

We said before that there are 60 dice rolls in a typical game. Since we have two dice, that means \(n=120\). To keep the math simple, we’ll assume that we role an average number of 6’s. That is, the number of sixes rolled during the game is \[ k=812\cdot \frac{120}{4310}\approx23. \] Substituting into our formula for the \(p\)-value, we get \[ p\text{-value}=P(X\ge k) \approx 0.265 . \] In words, this means that if the dice were actually fair, then we would still role this number of 6’s \(26.5\%\) of the time. Since this probability is so high, the standard scientific protocol tells us to conclude that we have no “significant” evidence that the dice are biased. (Notice that this is subtly different from having evidence that the dice are not biased! Confusing these two statements is a common mistake, even for trained phd scientists, and especially for medical doctors.)

So how many games can we play without getting caught? It turns out that if we play 6 games (so \(n=6*120=720\), and \(k=812\cdot(720/4310)\approx136\)), then the resulting \(p\)-value is 0.05. In other words, as long as we play fewer than 6 games, then our opponents won’t have enough data to conclude that their measurements of the biased dice are “significant.” The standard scientific method won’t prove we’re cheating.

Some flaws with the \(p\)-value and “significance”

The \(p\)-value argument above is how most scientists currently test their hypotheses. But there’s some major flaws with this approach. For example:

  1. The \(p\)-value test doesn’t use all the available information. In particular, our opponents may have other reasons to believe that the dice are loaded. If you look closely at the dice, you’ll notice some slight discoloration where it was submerged in water.

    This discoloration was caused because the water spread the ink on the die’s face. If you see similar discoloration on the dice in your game, it makes sense to be extra suspicious about the dice’s bias.

    Unfortunately, there’s no way to incorporate this suspicion into the \(p\)-value analysis we conducted above. An alternative to the \(p\)-value called the bayes factor can incorporate this prior evidence. So if our opponent uses a bayes factor analysis, they may be able to determine that we’re cheating. The bayes factor is more complicated than the \(p\)-value, however, and so it is not widely taught to undergraduate science majors. It is rarely even used in phd-level scientific publications, and many statisticians are calling for increased use of these more sophisticated analysis techniques.

  2. Another weakness of the \(p\)-value test is that false positives are very common. Using the standard significance threshold of \(p\le0.05\) means that 5 of every 100 games will have “significant” evidence that the dice are biased to role 6’s. Common sense, however, tells us that cheating at Settlers of Catan is almost certainly not this common because most people just don’t want to cheat. But when you run many experiments, some of them will give “significant” results just by random chance. This is one of the many reasons why some scientists have concluded that most published research is false. This effect is thought to be one of the reasons that evidence of extra sensorial perception (ESP) continues to be published in scientific journals. Some less scrupulous scientists exploit this deficiency in a process called p-hacking to make their research seem more important.

    To alleviate the problem of false positives, a group of statisticians is proposing a new significance threshold of \(p\le0.005\) for a result to qualify as “significant”. While this reduces the risk of false positives, it also makes detecting true effects harder. Under this new criterion, we’d have to play 16 games (for \(n=1920\) dice roles) to get statistically significant evidence that the dice are biased.

At this point, you might be feeling overwhelmed at the complexity of statistical analysis. And this is just for the toy problem of detecting loaded dice in a game. Real world problems like evaluating the effectiveness of chemotherapy drugs are much more complicated, and so require much more complicated statistical analyses. Doing science is hard!

Edit after peer review: Vijay Lulla sent me the following message:

The blog mentions that you rolled the dice 4310 times and all your calculations are based on it, but the frequency table adds up to 4312.

Whooops! It looks like a messed up my addition. Fortunately, this mistake is small enough that it won’t affect any of the numbers in the article by much.

A lot of people mistakenly think that peer review is where other scientists repeat an experiment to test the conclusion. But that’s not the case. The purpose for peer review is for scientists like Vijay to just do a sanity check on the whole procedure to make sure obvious mistakes like this get caught. Sadly, another commonly made mistake in science is that researchers don’t publish their data, so there’s no way for checks like this to be performed.

If this were a real publication in a scientific journal, I would redo all the calculations. But since it’s not, I’ll leave the mistake for posterity.

Edit 2: There’s a good discussion on reddit’s /r/statistics. This discussion provides a much more nuanced view about significance testing than my discussion above, and a few users point out ways that I might be overstating some conclusions.

December 14, 2017 12:00 AM

December 13, 2017

FP Complete

Software Release Management Best Practices

What is software release management?

At its most general, “releasing software” is the process by which software is delivered from the engineers creating it to its users. This can take such forms as:

by Michael Snoyman ( at December 13, 2017 09:11 PM

December 12, 2017

Neil Mitchell

Benchmarking strchr vs memchr

Summary: memchr is faster, but the obvious implement seems to beat the builtin versions.

There are two related C functions for finding the next character in a string - strchr which assumes the string has a NUL character at the end, and memchr which takes the string length as an argument. For strings where you have the size and a NUL terminator, which is fastest? Using gcc 6.2.0 64bit MSYS2 on Windows 10, searching for a single byte 10M bytes along a string, the times were (fastest to slowest):

Trying on 3 different Windows computers, the results are all similar (but scaled).

Given the choice, you should prefer memchr over strchr.

Surprise result

The optimised implementations shipped with GCC are slower than the obvious C implementations taken from a wiki. I have absolutely no idea why. From what I can tell, the builtin versions are coded in assembly, operating on multiple bytes at a time, using SSE instructions. In contrast, the C variants operate on a single byte at a time, and aren't vectorised by the optimiser according to Godbolt. If anyone has an explanation I'd be keen to hear it.

Benchmark Code

To benchmark the variants I wrote a Haskell program using criterion. The full code and build instructions are available in this gist. I compiled the C code with -O3, using the gcc shipped with GHC 8.2.1. I've reproduced the Haskell code below, with some comments:

-- Import all the necessary pieces
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import Criterion.Main
import Foreign
import Foreign.C.Types
import Data.Monoid

-- Make all the C imports
foreign import ccall unsafe "string.h memchr" memchr_std :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe "string.h strchr" strchr_std :: Ptr Word8 -> CInt -> IO (Ptr Word8)
foreign import ccall unsafe memchr_c :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe strchr_c :: Ptr Word8 -> CInt -> IO (Ptr Word8)

-- Method for ignoring the size when using strchr
ignoreSize f a b _ = f a b

-- Build a suitable string with an interesting character i bytes along
cstr i = BS.replicate i 32 <> BS.singleton 64 <> BS.replicate i 32 <> BS.singleton 0

-- The functions to benchmark
funs =
[("memchr_std", memchr_std)
,("strchr_std", ignoreSize strchr_std)
,("memchr_c", memchr_c)
,("strchr_c", ignoreSize strchr_c)]

-- The main function, using Criterion
main = defaultMain
[ seq bs $ bench (show i ++ " " ++ name) $ whnfIO $ test fun bs
| i <- [1,10,100,1000,10000,100000,1000000,10000000]
, let bs = cstr i
, (name, fun) <- funs]

-- The function under test and input string
{-# NOINLINE test #-}
test fun bs =
BS.unsafeUseAsCStringLen bs $ \(ptr,len) ->
fun (castPtr ptr) 64 (fromIntegral len)

by Neil Mitchell ( at December 12, 2017 04:56 PM

December 11, 2017

Jeremy Gibbons

Streaming Arithmetic Coding

In the previous post we saw the basic definitions of arithmetic encoding and decoding, and a proof that decoding does indeed successfully retrieve the input. In this post we go on to show how both encoding and decoding can be turned into streaming processes.

Producing bits

Recall that

\displaystyle  \begin{array}{@{}l} \mathit{encode}_0 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow \mathit{Rational} \\ \mathit{encode}_0\;m = \mathit{pick} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \vrule width0pt depth2ex \\ \mathit{decode}_0 :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_0\;m\;x = \mathit{unfoldr}\;\mathit{step}\;(m,x) \end{array}

Encoding and decoding work together. But they work only in batch mode: encoding computes a fraction, and yields nothing until the last step, and so decoding cannot start until encoding has finished. We really want encoding to yield as the encoded text a list of bits representing the fraction, rather than the fraction itself, so that we can stream the encoded text and the decoding process. To this end, we replace {\mathit{pick} :: \mathit{Interval} \rightarrow \mathit{Rational}} by {\mathit{pick}_2 = \mathit{fromBits} \cdot \mathit{toBits}}, where

\displaystyle  \begin{array}{@{}l} \mathbf{type}\;\mathit{Bit} = \mathit{Integer} - \mbox{\quad 0 or 1 only} \vrule width0pt depth2ex \\ \mathit{toBits} :: \mathit{Interval} \rightarrow [\mathit{Bit}] \\ \mathit{fromBits} :: [\mathit{Bit}] \rightarrow \mathit{Rational} \end{array}

The obvious definitions have {\mathit{toBits}\;i} yield the shortest binary expansion of any fraction within {i}, and {\mathit{fromBits}} evaluate this binary expansion. However, we don’t do quite this—it turns out to prevent the streaming condition from holding—and instead arrange for {\mathit{toBits}} to yield the bit sequence that when extended with a 1 yields the shortest expansion of any fraction within {i} (and indeed, the shortest binary expansion necessarily ends with a 1), and {\mathit{fromBits}} compute the value with this 1 appended.

\displaystyle  \begin{array}{@{}l} \mathit{fromBits} = \mathit{foldr}\;\mathit{pack}\;(\frac 1 2) \vrule width0pt depth2ex \\ \mathit{pack} :: \mathit{Bit} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{pack}\;b\;x = (b + x) / 2 \vrule width0pt depth2ex \\ \mathit{toBits} = \mathit{unfoldr}\;\mathit{nextBit} \vrule width0pt depth2ex \\ \mathit{nextBit} :: \mathit{Interval} \rightarrow \mathsf{Maybe}\;(\mathit{Bit}, \mathit{Interval}) \\ \mathit{nextBit}\;(l,r) \\ \begin{array}[t]{@{\quad}clcl} | & r \le \frac 1 2 &=& \mathit{Just}\;(0, (0, \frac 1 2) \mathbin{\triangleleft} (l,r)) \\ | & \frac 1 2 \le l &=& \mathit{Just}\;(1, (\frac 1 2,1) \mathbin{\triangleleft} (l,r)) \\ | & \mathbf{otherwise} &=& \mathit{Nothing} \end{array} \end{array}

Thus, if {r \le \frac 1 2} then the binary expansion of any fraction within {[l,r)} starts with 0; and similarly, if {\frac 1 2 \le l}, the binary expansion starts with 1. Otherwise, the interval {[l,r)} straddles {\frac 1 2}; the shortest binary expansion within is it the expansion of {\frac 1 2}, so we yield the empty bit sequence.

Note that {\mathit{pick}_2 = \mathit{fromBits} \cdot \mathit{toBits}} is a hylomorphism, so we have

\displaystyle  \begin{array}{@{}l} \mathit{pick}_2\;(l,r) \\ \begin{array}[t]{@{\quad}clcl} | & r \le \frac 1 2 &=& \mathit{pick}_2\;((0,\frac 1 2) \mathbin{\triangleleft} (l,r)) / 2 \\ | & \frac 1 2 \le l &=& (1 + \mathit{pick}_2\;((\frac 1 2,1) \mathbin{\triangleleft} (l,r))) / 2 \\ | & \mathbf{otherwise} &=& \frac 1 2 \end{array} \end{array}

Moreover, it is clear that {\mathit{toBits}} yields a finite bit sequence for any non-empty interval (since the interval doubles in width at each step, and the process stops when it includes {\frac 1 2}); so this equation serves to uniquely define {\mathit{pick}_2}. In other words, {\mathit{nextBit}} is a recursive coalgebra. Then it is a straightforward exercise to prove that {i \ni \mathit{pick}_2\;i}; so although {\mathit{pick}} and {\mathit{pick}_2} differ, they are sufficiently similar for our purposes.

Now we redefine encoding to yield a bit sequence rather than a fraction, and decoding correspondingly to consume that bit sequence:

\displaystyle  \begin{array}{@{}l} \mathit{encode}_1 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Bit}] \\ \mathit{encode}_1\;m = \mathit{toBits} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \vrule width0pt depth2ex \\ \mathit{decode}_1 :: \mathit{Model} \rightarrow [\mathit{Bit}] \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_1\;m = \mathit{decode}_0\;m \cdot \mathit{fromBits} \end{array}

That is, we move the {\mathit{fromBits}} part of {\mathit{pick}_2} from the encoding stage to the decoding stage.

Streaming encoding

Just like {\mathit{encode}_0}, the new version {\mathit{encode}_1} of encoding consumes all of its input before producing any output, so does not work for encoding infinite inputs, nor for streaming execution even on finite inputs. However, it is nearly in the right form to be a metamorphism—a change of representation from lists of symbols to lists of bits. In particular, {\mathit{narrow}} is associative, and {\mathit{unit}} is its unit, so we can replace the {\mathit{foldr}} with a {\mathit{foldl}}:

\displaystyle  \mathit{encode}_1\;m = \mathit{unfoldr}\;\mathit{nextBit} \cdot \mathit{foldl}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m

Now that {\mathit{encode}_1} is in the right form, we must check the streaming condition for {\mathit{narrow}} and {\mathit{nextBit}}. We consider one of the two cases in which {\mathit{nextBit}} is productive, and leave the other as an exercise. When {r \le \frac 1 2}, and assuming {\mathit{unit} \supseteq (p,q)}, we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{nextBit}\;((l,r) \mathbin{\triangleright} (p,q)) \\ = & \qquad \{ \mathit{narrow} \} \\ & \mathit{nextBit}\;(\mathit{weight}\;(l,r)\;p, \mathit{weight}\;(l,r)\;q) \\ = & \qquad \{ (l,r) \ni \mathit{weight}\;(l,r)\;q \mbox{, so in particular } \mathit{weight}\;(l,r)\;q < r \le \frac 1 2 \} \\ & \mathit{Just}\;(0, (0, \frac 1 2) \mathbin{\triangleleft} ((l,r) \mathbin{\triangleright} (p,q))) \\ = & \qquad \{ \mathit{widen} \mbox{ associates with } \mathit{narrow} \mbox{ (see below)} \} \\ & \mathit{Just}\;(0, ((0, \frac 1 2) \mathbin{\triangleleft} (l,r)) \mathbin{\triangleright} (p,q)) \end{array}

as required. The last step is a kind of associativity property:

\displaystyle  i \mathbin{\triangleleft} (j \mathbin{\triangleright} k) = (i \mathbin{\triangleleft} j) \mathbin{\triangleright} k

whose proof is left as another exercise. Therefore the streaming condition holds, and we may fuse the {\mathit{unfoldr}} with the {\mathit{foldl}}, defining

\displaystyle  \begin{array}{@{}l} \mathit{encode}_2 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Bit}] \\ \mathit{encode}_2\;m = \mathit{stream}\;\mathit{nextBit}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \end{array}

which streams the encoding process: the initial bits are output as soon as they are fully determined, even before all the input has been read. Note that {\mathit{encode}_1} and {\mathit{encode}_2} differ, in particular on infinite inputs (the former diverges, whereas the latter does not); but they coincide on finite symbol sequences.

Streaming decoding

Similarly, we want to be able to stream decoding, so that we don’t have to wait for the entire encoded text to arrive before starting decoding. Recall that we have so far

\displaystyle  \mathit{decode}_1\;m = \mathit{decode}_0\;m \cdot \mathit{fromBits}

where {\mathit{decode}_0} is an {\mathit{unfoldr}} and {\mathit{fromBits}} a {\mathit{foldr}}. The first obstacle to streaming is that {\mathit{foldr}}, which we need to be a {\mathit{foldl}} instead. We have

\displaystyle  \textstyle \mathit{fromBits} = \mathit{foldr}\;\mathit{pack}\;(\frac 1 2)

Of course, {\mathit{pack}} is not associative—it doesn’t even have the right type for that. But we can view each bit in the input as a function on the unit interval: bit~0 is represented by the function {\mathit{weight}\;(0,\frac 1 2)} that focusses into the lower half of the unit interval, and bit~1 by the function {\mathit{weight}\;(\frac 1 2, 1)} that focusses into the upper half. The fold itself composes a sequence of such functions; and since function composition is associative, this can be written equally well as a {\mathit{foldr}} or a {\mathit{foldl}}. Having assembled the individual focussers into one composite function, we finally apply it to {\frac 1 2}. (This is in fact an instance of a general trick for turning a {\mathit{foldr}} into a {\mathit{foldl}}, or vice versa.) Thus, we have:

\displaystyle  \textstyle \mathit{fromBits}\;bs = \mathit{foldl}\;\mathit{focusf}\;\mathit{id}\;bs\;(\frac 1 2) \quad\mathbf{where}\; \mathit{focusf}\;h\;b = h \cdot \mathit{weight}\;(\mathit{half}\;b)

where {\mathit{half}} yields either the lower or the upper half of the unit interval:

\displaystyle  \begin{array}{@{}lcl} \multicolumn{3}{@{}l}{\mathit{half} :: \mathit{Bit} \rightarrow \mathit{Interval}} \\ \mathit{half}\;0 &=& (0, \frac 1 2) \\ \mathit{half}\;1 &=& (\frac 1 2, 1) \end{array}

In fact, not only may the individual bits be seen as focussing functions {\mathit{weight}\;(0, \frac 1 2)} and {\mathit{weight}\;(\frac 1 2, 1)} on the unit interval, so too may compositions of such functions:

\displaystyle  \begin{array}{@{}lcl} \mathit{id} &=& \mathit{weight}\;\mathit{unit} \\ \mathit{weight}\;i \cdot \mathit{weight}\;j &=& \mathit{weight}\;(i \mathbin{\triangleright} j) \end{array}

So any such composition is of the form {\mathit{weight}\;i} for some interval {i}, and we may represent it concretely by {i} itself, and retrieve the function via {\mathit{weight}}:

\displaystyle  \textstyle \mathit{fromBits}\;bs = \mathit{weight}\;(\mathit{foldl}\;\mathit{focus}\;\mathit{unit}\;bs)\;(\frac 1 2) \quad\mathbf{where}\; \mathit{focus}\;i\;b = i \mathbin{\triangleright} \mathit{half}\;b

So we now have

\displaystyle  \textstyle \mathit{decode}_1\;m = \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \quad\mathbf{where}\; \mathit{prepare}\;m\;i = (m, \mathit{weight}\;i\;(\frac 1 2))

This is almost in the form of a metamorphism, except for the occurrence of the adapter {\mathit{prepare}\;m} in between the unfold and the fold. It is not straightforward to fuse that adapter with either the fold or the unfold; fortunately, however, we can split it into the composition

\displaystyle  \mathit{prepare}\;m = \mathit{prepare}_2 \cdot \mathit{prepare}_1\;m

of two parts, where

\displaystyle  \begin{array}{@{}lcl} \multicolumn{3}{@{}l}{\mathit{prepare}_1 :: \mathit{Model} \rightarrow \mathit{Interval} \rightarrow (\mathit{Model}, \mathit{Interval})} \\ \mathit{prepare}_1\;m\;i &=& (m,i) \vrule width0pt depth2ex \\ \multicolumn{3}{@{}l}{\mathit{prepare}_2 :: (\mathit{Model}, \mathit{Interval}) \rightarrow (\mathit{Model}, \mathit{Rational})} \\ \mathit{prepare}_2\;(m,i) &=& (m, \mathit{weight}\;i\;(\frac 1 2)) \end{array}

in such a way that the first part {\mathit{prepare}_1} fuses with the fold and the second part {\mathit{prepare}_2} fuses with the unfold. For fusing the first half of the adapter with the fold, we just need to carry around the additional value {m} with the interval being focussed:

\displaystyle  \mathit{prepare}_1\;m \cdot \mathit{foldl}\;\mathit{focus}\;i = \mathit{foldl}\;\mathit{mfocus}\;(m,i)


\displaystyle  \begin{array}{@{}l} \mathit{mfocus} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathit{Bit} \rightarrow (\mathit{Model}, \mathit{Interval}) \\ \mathit{mfocus}\;(m,i)\;b = (m, \mathit{focus}\;i\;b) \end{array}

For fusing the second half of the adapter with the unfold, let us check the fusion condition. We have (exercise!):

\displaystyle  \begin{array}{@{}l} \mathit{step}\;(\mathit{prepare}_2\;(m, i)) = \mathit{fmap}\;\mathit{prepare}_2\;(\mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i))) \\ \qquad\mathbf{where}\;s = \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\frac 1 2)) \end{array}

where the {\mathit{fmap}} is the functorial action for the base functor of the {\mathsf{List}} datatype, applying just to the second component of the optional pair. We therefore define

\displaystyle  \begin{array}{@{}l} \mathit{stepi} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model}, \mathit{Interval})) \\ \mathit{stepi}\;(m,i) = \mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i)) \\ \qquad\mathbf{where}\;s = \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\frac 1 2)) \end{array}

and have

\displaystyle  \mathit{step}\;(\mathit{prepare}_2\;(m, i)) = \mathit{fmap}\;\mathit{prepare}_2\;(\mathit{stepi}\;(m,i))

and therefore

\displaystyle  \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2 = \mathit{unfoldr}\;\mathit{stepi}

Note that the right-hand side will eventually lead to intervals that exceed the unit interval. When {j \supseteq i}, it follows that {\mathit{unit} \supseteq j \mathbin{\triangleleft} i}; but the unfolding process keeps widening the interval without bound, so it will necessarily eventually exceed the unit bounds. We return to this point shortly.

We have therefore concluded that

\displaystyle  \begin{array}{@{}lcl} \mathit{decode}_1\;m &=& \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \\ &=& \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2 \cdot \mathit{prepare}_1\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \\ &=& \mathit{unfoldr}\;\mathit{stepi} \cdot \mathit{foldl}\;\mathit{mfocus}\;(m,\mathit{unit}) \end{array}

Now we need to check the streaming condition for {\mathit{mfocus}} and {\mathit{stepi}}. Unfortunately, this is never going to hold: {\mathit{stepi}} is always productive, so {\mathit{stream}\;\mathit{stepi}\;\mathit{mfocus}} will only take production steps and never consume any input. The problem is that {\mathit{unfoldr}\;\mathit{stepi}} is too aggressive, and we need to use the more cautious flushing version of streaming instead. Informally, the streaming process should be productive from a given state {(m,i)} only when the whole of interval {i} maps to the same symbol in model {m}, so that however {i} is focussed by subsequent inputs, that symbol cannot be invalidated.

More formally, note that

\displaystyle  \mathit{unfoldr}\;\mathit{stepi} = \mathit{apo}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{stepi})


\displaystyle  \begin{array}{@{}l} \mathit{safestepi} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model}, \mathit{Interval})) \\ \mathit{safestepi}\;(m,i) \\ \begin{array}[t]{@{\quad}clcl} | & \mathit{safe}\;(m,i) &=& \mathit{stepi}\;(m,i) \\ | & \mathbf{otherwise} &=& \mathit{Nothing} \end{array} \end{array}


\displaystyle  \begin{array}{@{}l} \mathit{safe} :: (\mathit{Model},\mathit{Interval}) \rightarrow \mathit{Bool} \\ \mathit{safe}\;(m, i) = \mathit{encodeSym}\;m\;s \supseteq i \quad\mathbf{where}\; s = \mathit{decodeSym}\;m\;(\mathit{midpoint}\;i) \end{array}

That is, the interval {i} is “safe” for model {m} if it is fully included in the encoding of some symbol {s}; then all elements of {i} decode to {s}. Then, and only then, we may commit to outputting {s}, because no further input bits could lead to a different first output symbol.

Note now that the interval remains bounded by unit interval during the streaming phase, because of the safety check in {\mathit{safestepi}}, although it will still exceed the unit interval during the flushing phase. However, at this point we can undo the fusion we performed earlier, “fissioning{\mathit{unfoldr}\;\mathit{stepi}} into {\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2} again: this manipulates rationals rather than intervals, so there is no problem with intervals getting too wide. We therefore have:

\displaystyle  \mathit{decode}_1\;m = \mathit{apo}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2) \cdot \mathit{foldl}\;\mathit{mfocus}\;(m,\mathit{unit})

Now let us check the streaming condition for {\mathit{mfocus}} and the more cautious {\mathit{safestepi}}. Suppose that {(m,i)} is a productive state, so that {\mathit{safe}\;(m,i)} holds, that is, all of interval {i} is mapped to the same symbol in {m}, and let

\displaystyle  \begin{array}{@{}lcl} s &=& \mathit{decodeSym}\;m\;(\mathit{midpoint}\;i) \\ m' &=& \mathit{newModel}\;m\;s \\ i' &=& \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i \end{array}

so that {\mathit{safestepi}\;(m,i) = \mathit{Just}\;(s, (m',i'))}. Consuming the next input {b} leads to state {(m, \mathit{focus}\;i\;b)}. This too is a productive state, because {i \supseteq \mathit{focus}\;i\;b} for any {b}, and so the whole of the focussed interval is also mapped to the same symbol {s} in the model. In particular, the midpoint of {\mathit{focus}\;i\;b} is within interval {i}, and so the first symbol produced from the state after consumption coincides with the symbol {s} produced from the state before consumption. That is,

\displaystyle  \mathit{safestepi}\;(\mathit{mfocus}\;(m,i)\;b) = \mathit{Just}\;(s, \mathit{mfocus}\;(m', i')\;b)

as required. We can therefore rewrite decoding as a flushing stream computation:

\displaystyle  \begin{array}{@{}l} \mathit{decode}_2 :: \mathit{Model} \rightarrow [\mathit{Bit}] \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_2\;m = \mathit{fstream}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2)\;\mathit{mfocus}\;(m,\mathit{unit}) \end{array}

That is, initial symbols are output as soon as they are completely determined, even before all the input bits have been read. This agrees with {\mathit{decode}_1} on finite bit sequences.

Fixed-precision arithmetic

We will leave arithmetic coding at this point. There is actually still quite a bit more arithmetic required—in particular, for competitive performance it is important to use only fixed-precision arithmetic, restricting attention to rationals within the unit interval with denominator {2^k} for some fixed~{k}. In order to be able to multiply two numerators using 32-bit integer arithmetic without the risk of overflow, we can have at most {k=15}. Interval narrowing now needs to be approximate, rounding down both endpoints to integer multiples of {2^{-k}}. Care needs to be taken so that this rounding never makes the two endpoints of an interval coincide. Still, encoding can be written as an instance of {\mathit{stream}}. Decoding appears to be more difficult: the approximate arithmetic means that we no longer have interval widening as an exact inverse of narrowing, so the approach above no longer works. Instead, our 2002 lecture notes introduce a “destreaming” operator that simulates and inverts streaming: the decoder works in sympathy with the encoder, performing essentially the same interval arithmetic but doing the opposite conversions. Perhaps I will return to complete that story some time…

by jeremygibbons at December 11, 2017 01:32 PM

Philip Wadler

Simplicity and Michelson


Only once in my life have I encountered a programming language that was too simple to use. That was Lispkit Lisp, developed by Peter Henderson, Geraint Jones, and Simon Jones, which I saw while serving as a postdoc at Oxford, 1983–87, and which despite its simplicity was used to implement an entire operating system. It is an indightment of the field of programming languages that I have not since encountered another system that I consider too simple. Until today. I can now add a second system to the list of those that are too simple, the appropriately-titled Simplicity, developed by Russell O'Connor of Blockstream. It is described by a paper hereand a website here.
The core of Simplicity consists of just nine combinators: three for products (pair, take, and drop), three for sums (injl, injr, and case), one for unit (unit), and two for plumbing (iden and comp). It is throughly grounded in ideas from the functional programming, programming language, and formal methods communities.
When I call Simplicity too simple it is intended as a compliment. It is delightful to see full adders and cryptographic hash functions cobbled together using just products, sums, and units. It is eye-opening to see how far one can get without recursion or iteration, and how this enables simple analyses of the time and space required to execute a program. It is a confirmation to see a system with foundations in category theory and sequent calculus. Now I know what to say when developers respond to my talk "Categories for the Working Hacker" by asking "But how can we use this in practice?"
The system is accompanied by a proof of its correctness in Coq, which sets a high bar for competing systems. O'Connor even claims to have a proof in Coq that the Simplicity implementation of SHA-256 matches the reference specification provided by Andrew Appel's Verified Software Toolchain project (VST), which VST proved corresponds to the OpenSSL implementation of SHA-256 in C.
At IOHK, I have been involved in the design of Plutus Core, our own smart contract scripting language, working with Darryl McAdams, Duncan Coutts, Simon Thompson, Pablo Lamela Seijas, and Grigore Rosu and his semantics team. We have a formal specification which we are preparing for release. O'Connor's work on Simplicity has caused us to rethink our own work: what can we do to make it simpler? Thank you, Russell!
That said, Simplicity is still too simple, and despite its emphasis on rigour there are some gaps in its description.


A 256-bit full adder is expressed with 27,348 combinators, meaning addition in Simplicity requires several orders of magnitude more work than the four 64-bit addition instructions one would normally use. Simplicity proposes a solution: any commonly used sequence of instructions may be abbreviated as a "jet", and implemented in any equivalent matter. Hence, the 27,348 combinators for the 256-bit full adder can be ignored, and replaced by the equivalent four 64-bit additions.
All well and good, but this is where it gets too simple. No one can afford to be inefficient by several orders of magnitude. Hence, any programmer will need to know what jets exist and to exploit them whenever possible. In this sense, Simplicity is misleadingly simple. It would be clearer and cleaner to define each jet as an opcode. Each opcode could still be specified by its equivalent in the other combinators of Simplicity, but programs would be more compact, faster to execute, and—most important—easier to read, understand, and analyse accurately. If one ignores jets, the analyses of time and space required to execute a program, given toward the end of the paper, will be useless—off by orders of magnitude. The list of defined jets is given nowhere in the paper. Nor could I spot additional information on Simplicity linked to from its web page or findable by a web search. More needs to be done before Simplicity can be used in practice.


It's not just the definition of jets which is absent from the paper, and cannot be found elsewhere on the web. Lots more remains to be supplied.
  • Sections 2.4, 2.5, 3.2 claim proofs in Coq, but apart from defining the semantics of the nine combinators in Appendix A, no Coq code is available for scrutiny.
  • Section 2.5 claims a representation of Simplicity terms as a dag, but it is not specified. Lacking this, there is no standard way to exchange code written in Simplicity.
  • Section 4.4 defines an extended semantics for Simplicity that can read the signature of the current transaction, support Merklised abstract syntax trees, and fail when a transaction does not validate. It also lifts meanings of core (unextended) Simplicity programs to the extended semantics. However, it says nothing about how the seven combinators that combine smaller Simplicity programs into bigger ones act in the extended semantics! It's not hard to guess the intended definitions, but worrying that they were omitted from a paper that aims for rigour.
  • Section 3 provides a Bit Machine to model the space and time required to execute Simplicity. The model is of limited use, since it ignores the several orders of magnitude improvement offered by jets. Further, the Bit Machine has ten instructions, enumerated on pages 10–12, but the list omits the vital "case" instruction which appears in Figure 2. Again, it's not hard to guess, but worrying it was omitted.


A second language for scripting blockchains is Michelson. It is described by a paper hereand a website here. (Oddly, the website fails to link to the paper.)
I will offer just one word on Michelson. The word is: "Why?"
Michelson takes many ideas from the functional programming community, including higher-order functions, data structures such as lists and maps, and static type safety. Currently, it is also much more thoroughly described and documented than Simplicity. All of this is to be commended.
But Michelson is an inexplicably low-level language, requiring the programmer to explicitly manipulate a stack. Perhaps this was done so that there is an obvious machine model, but Simplicity offers a far superior solution: a high-level model for programming, which compiles to a low-level model (the Bit Machine) to explicate time and space costs.
Or perhaps Michelson is low-level to improve efficiency. Most of the cost of evaluating a smart contract is in cryptographic primitives. The rest is cheap, whether compiled or interpreted. Saving a few pennies of electricity by adopting an error prone language—where there is a risk of losing millions of dollars in an exploit—is a false economy indeed. Premature optimisation is the root of all evil.
The language looks a bit like all the bad parts of Forth and Lisp, without the unity that makes each of those languages a classic. Lisp idioms such as CAAR and CDADAR are retained, with new ones like DUUP, DIIIIP, and PAAIAIAAIR thrown in.
There is a fair set of built-in datatypes, including strings, signed and unsigned integers, unit, product, sum, options, lists, sets, maps, and higher-order functions. But there is no way for users to define their own data types. There is no way to name a variable or a routine; everything must be accessed by navigating a data structure on the stack.
Some operations are specified formally, but others are left informal. For lists, we are given formal rewriting rules for the first three operators (CONS, NIL, IF_CONS) but not the last two (MAP, REDUCE). Type rules are given in detail, but the process of type inference is not described, leaving me with some questions about which programs are well typed and which are not. It reminds me of a standard problem one sees in early work by students—the easy parts are thoroughly described, but the hard parts are glossed over.
If I have understood correctly, the inference rules assign types that are monomorphic, meaning each term has exactly one type. This omits one of the most successful ideas in functional programming, polymorphic routines that act on many types. It means back to the bad old days of Pascal, where one has to write one routine to sort a list of integers and a different routine to sort a list of strings.
Several of these shortcomings are also shared by Simplicity. But whereas Simplicity is intended as a compilation target, not to be read by humans, the Michelson documentation includes a large collection of examples suggesting it is intended for humans to write and read.
Here is one of the simpler examples from the paper.
  { DUP ; CDAAR ; # T
IF { DUP ; CDADR ; # N
{ DUP ; CDDDR ; # B
PAIR } }
{ DUP ; CDDAR ; # A
PAIR } }
The comment # T is inserted as a reminder that CDAAR extracts variable T, and similarly for the other variables N, B, and A. This isn't the 1950s. Why don't we write T when we mean T, instead of CDAAR? WHY ARE WE WRITING IN ALL CAPS?
In short, Michelson is a bizarre mix of some of the best and worst of computing.


It is exciting to see ideas from the functional programming, programming languages, and formal methods communities gaining traction among cryptocurrencies and blockchains. While there are shortcomings, it is fantastic to see an appreciation of how these techniques can be applied to increase reliability—something which the multi-million dollar exploits against Ethereum show is badly needed. I look forward to participating in the conversations that ensue!


The conversation has begun! Tezos have put up a page to explain Why Michelson. I've also learned there is a higher-level language intended to compile into Michelson, called Liquidity.

by Philip Wadler ( at December 11, 2017 09:37 AM

December 09, 2017

Mark Jason Dominus

Legal Nerdsniping

The Volokh Conspiracy is a frequently-updated blog about legal issues. It reports on interesting upcoming court cases and recent court decisions and sometimes carries thoughtful and complex essays on legal theory. It is hosted by, but not otherwise affiliated with, the Washington Post.

Volokh periodically carries a “roundup of recent federal court decisions”, each with an intriguing one-paragraph summary and a link to the relevant documents, usually to the opinion itself. I love reading federal circuit court opinions. They are almost always carefully thought out and clearly-written. Even when I disagree with the decision, I almost always concede that the judges have a point. It often happens that I read the decision and say “of course that is how it must be decided, nobody could disagree with that”, and then I read the dissenting opinion and I say exactly the same thing. Then I rub my forehead and feel relieved that I'm not a federal circuit court judge.

This is true of U.S. Supreme Court decisions also. Back when I had more free time I would sometimes visit the listing of all recent decisions and pick out some at random to read. They were almost always really interesting. When you read the newspaper about these decisions, the newspaper always wants to make the issue simple and usually tribal. (“Our readers are on the (Red / Blue) Team, and the (Red / Blue) Team loves mangel-wurzels. Justice Furter voted against mangel-wurzels, that is because he is a very bad man who hates liberty! Rah rah team!”) The actual Supreme Court is almost always better than this.

For example we have Clarence Thomas's wonderful dissent in the case of Gonzales v. Raich. Raich was using marijuana for his personal medical use in California, where medical marijuana had been legal for years. The DEA confiscated and destroyed his supplier's plants. But the Constitution only gives Congress the right to regulate interstate commerce. This marijuana had been grown in California by a Californian, for use in California by a Californian, in accordance with California law, and had never crossed any state line. In a 6–3 decision, the court found that the relevant laws were nevertheless a permitted exercise of Congress's power to regulate commerce. You might have expected Justice Thomas to vote against marijuana. But he did not:

If the majority is to be taken seriously, the Federal Government may now regulate quilting bees, clothes drives, and potluck suppers throughout the 50 States. This makes a mockery of Madison’s assurance to the people of New York that the “powers delegated” to the Federal Government are “few and defined,” while those of the States are “numerous and indefinite.”

Thomas may not be a fan of marijuana, but he is even less a fan of federal overreach and abuse of the Commerce Clause. These nine people are much more complex than the newspapers would have you believe.

But I am digressing. Back to Volokh's federal court roundups. I have to be careful not to look at these roundups when I have anything else that must be done, because I inevitably get nerdsniped and read several of them. If you enjoy this kind of thing, this is the kind of thing you will enjoy.

I want to give some examples, but can't decide which sound most interesting, so here are three chosen at random from the most recent issue:

  • Warden at Brooklyn, N.Y., prison declines prisoner’s request to keep stuffed animals. A substantial burden on the prisoner’s sincere religious beliefs?

  • Online reviewer pillories Newport Beach accountant. Must Yelp reveal the reviewer’s identity?

  • With no crosswalks nearby, man jaywalks across five-lane avenue, is struck by vehicle. Is the church he was trying to reach negligent for putting its auxiliary parking lot there?

Check it out.

[ Addendum 20171213: Volokh has just left the Washington Post, and moved to Reason, citing changes in the Post's paywall policies. ]

by Mark Dominus ( at December 09, 2017 08:05 PM

Edward Z. Yang

Systems ML workshop panel

  • JG: Joseph Gonzalez
  • GG: Garth Gibson (CMU)
  • DS: Dawn Song (UC Berkeley)
  • JL: John Langford (Microsoft NY)j
  • YQ: Yangqing Jia (Facebook)
  • SB: Sarah Bird
  • M: Moderator
  • A: Audience

M: This workshop is bringing together ML and systems. Can you put your place on that spectrum? Who is your home community?

YJ: Right in the middle. I'd like to move more towards systems side, but Berkeley Parallel Labs kicked me out. ML is my home base.

JL: ML is where I come from, and where I will be, but I'm interested in systems. My home is NIPS and ICML

DS: My area is AI and security, did computer security in the past, now moving into AI.

GG: Systems.

JG: I started out in ML, working on probabilistic methods. I basically, in middle of PhD, looked at systems. Now I'm moving to being a systems person that does ML.

M: We've seen a proliferation of deep learning / ML frameworks that require a lot of dev effort, money, time to put in. Q, what is the role of academia of doing research in this area. What kind of large scale ML learning can you do.

GG: I liked YJ's answer last time.

YJ: The thing that is astonishing is that academia is the source of so many innovations. With all due respect, we did very good work in Google, but then Alex came out with 2 GPUs and nuked the field. Academia is the amazing place where we find all of the new ideas, and industry scale it out.

JL: Some examples. If you're coming from academia, maybe you don't have research at big company, but it's an advantage as you will spend time about the right algorithm for solving it efficiently. And that's what will win in the long run. Short term, they'll brute force with AutoML. Long run, the learning algorithms are going to be designed where tjhey won't have parameters. A common ML paper is "we eliminate this hyperparameter". When they're more automatic, more efficient, great things will happen. There's an advantage in being resource constrained, as you will solve things in the right way.

Another example is, the study of machine learning tells us that in thefuture we will regard any model that u just learned and deploy as inherently broken adn buggy as data collection is not part of process of training, deploying. It will decay and become irrelevant. The overall paradagim of ML where you're interacting with the world, and learning, that can be studied easy in academia, and that has huge implications about how you're going to design systems,

DS: People often talk about in a startup, the best thing is to not raise a ton of money; if you're resource constrained you're more focused and creative. ML is really broad, there's lots of problems. Right now we learn from lots of data, but lots of talks at NIPS, humans have amazing ability to learn from very few example. These are problems for academia to tackle, given unique resource constraints.

GG: I'll say, it's difficult to concentrate on top accuracy if you don't have enough data, and the data available to students is stuff like DAWNbench which tends to lag. In academia, we build relationships with industry, send students for internships, they get the ability to do big data, while exploring first principles in university. IT's a challenge, but open publishing and open sharing of code world more berable.

JG: The one thing I've struggled with is focusing on human resources. I have grad students; good students, focus on a key problem can make a lot of progress. We struggle with a lot of data. Struggle with RL really is here, we can build simulators to build at this scale. Being able to use simualtion to get data; be creative, find new and interesting problems.

M: Follow-up on process. I think a lot of you have tried to publish ML in your communities. Are they equipped to appreciate work properly; what is a common reason they don't appreciate.

JG: Publishing ML in systems, or vice versa, is hard. It goes both ways. These communities are not equipped to evaluate work in other field. ML in systems, where if you saw here, it was surprising. Or vice versa, wouldn't have done well in systems venue as systems. The failure mode I see, is systems community doesn't appreciate extreme complexity. In ML, I have this very sophisticated thing, and reducing them to their essential components. ML tries to overextend their complexity as an innovation. MOre broadly, each of these communities has their own biases how they look at research. One thing I've noticed, it's gotten better. Systems is better at evaluating, and at this workshop, people are pushing research in an advanced way.

GG: I'm old, so I've seen creation of conference before. So, you start off with an overlap of areas. In my prior life, it was the notion of storage as a research area, rather than app of devices. You start off, send submission in. The PC has two people that know anything about it, and they aren't assigned, and the reviews are sloppy, and you get one conference that do a little better, but other conferences don't read it. I faced this with fault tolerance, database, OS communities, they don't read each other's stuff. You get enough mass, get a conference that focuses in the middle; reviewing and PC that have seen most of the good work in the area. That's hard, but we're on the edge of doing it in SysML. We're doing the right thing to do competitive, on top of state of the art.

M: Is that the only solution, or can we mix up PCs?

GG: I've seen a lot of experiments to try it. You can end up with permanently fractured communities.

JL: Joey and Dawn are an area chair at ICML. I have found the ML community to be friendly to system type things. There's an area chair systems. Hopefully papers get assigned appropriately.

M: We're not good about that at systems.

DS: About ML and security, we have this problem. In security, we also have very small percentage of ML, and the committee, if you submit ML, it's very hard to find people who can review the paper, and as a consequence, the review quality varies highly. Similar in terms of security in ML, similar problems. It's interesting to think about why this happens and how to solve the problem. In general, sometimes the most interesting work is the interdisciplinary areas. ML and systems, security, and examples I see, including machine learning in systems... so, one thing I actually can understand is, within each community, even though the review quality varies, I can see from committee's perspective, really what they want is papers that are more meaningful to community, help people get exposed to this new area, fostering new exploration. That's part of natural progression. As time goes on, there's more cross pollonization.

JG: We are launching a SysML conference. I had a little bit of reservations: ML is getting better at systems, but now I have to decide where I'm going to send a paper. A lot of papers we see in ML is going to have systems.

GG: When you have a new conference area, not all work is sent there. Overlapping, you have a favorite conference, your heros, and you'll send your most exciting work to that root conference. No problem.

YJ: SysML is great, and this is how it comes out. New fields, it warrants new conferences.

M: Do you think ML expert needs to also be a systems expert? Does such a person who lies at that intersection have a different way of looking? Or you come up with a nice algorithm, and you

JL: It's not OK to have a wall.

There's many way learning algorithms can be changed. The problem with having a wall, if you don't understand, throw engineer. But if you can bridge to understand, they're not artifacts, you can break open and modify. That can let you achieve much better solutions.

GG: AGreed, but what happens initially is you reach over to other side, you put it into system, and it's my innovation that redundancy makes fault tolerance, even though it's fairly pedestrian from the other side. If it is a substantial improvement, it is worth doing. We all grow up.

JG: We need a wall, but we're going to constantly tear it down. Matlab in grad school, we made jokes about it, and MKL community would make it fast. Then they said we are going to build ML for distributed computing algorithms, and ML would write class algorithms for system. That waned in the dev of pytorch, TF, etc., which leveled up abstraction. The stack is building up again; systems community to make more efficient. Well, fp could change, and that could affect algorithm. So we're tearing it down again. But systems is about designing the wall.

YJ: It's more like a bar stool. It's a barrier, but we don't have to be both to do anything, but you need it to make it efficient. A story: a training system we looked at, SGD. That person found a very nicely rounded number: 100. But people frown, you should round to 128. Understanding and improving the common core for CS and engineering, that helps a lot for people to have good sense for how to design ML algorithms.

M: There's a lot of talk about democratizing AI, and all of you have helped that process. What is a truly democratic AI landscape look like, and how far are we from that world.

YJ: I plead guilty in participating in framework wars. When reading CS history, one thing that's pretty natural, when field is strating, there's all sorts of standards, protocols. FTP, Gopher, and now in the end HTTP took over, and everything runs on HTTP. Right now, there's all kinds of different abstractions; boiling it down, everyone is doing computation graph, optimization. I look forward to when we have one really nice graph representation, protocol for optimizing graphs. It's not a rosy dream, because in compilers we have that solution, LLVM. I don't know if we'll reach that state but I think one day we'll get there.

JL: You have AI/ML democratized when anyone can use it. What does that mean, a programmer has a library, or language constructs, which that they use routinely and easily; no issues of data getting mismatched or confused or biased. All the bugs people worry about in data science; those are removed from the system because the system is designed right and easy to use. The level beyond that is when somebody is using a system, that system is learning to adapt to you. There's huge room for improvement in how people interact. I don't know how often there's a rewrite rule driving me crazy; why can't it rewrite the way I want. People can signal info to a learning algorithm, and when those can be used effectively tpo assist people, you have democratized AI.

DS: I have a very different view of democratizing AI. I think it's interesting to think about what democratization here really means. For systems people, it's about making it easier for people to do learning, to use these libraries, platforms. But that's really just providing them with tools. For me, I give talks on demccratizing AI, we are looking at it from a completely different perspective. Code: even, whoever controls AI will control the world. So who controls AI? Even if you give everyone the tools, push a button, but they don't have the data to do the training. So who controls the AI today, and tomorrow? It's Facebook, Microsoft, Google... so for me, democratization means something totally different. Today, they collect data, train models, and they control who has action to model, and users can get recommendations, but not direct access to models. We have a project to actually democratize AI, where users can control their data. Combining blockchain and AI, where users can donate their data to a smart contract, where the smart contract will specify the terms; e.g., if you train a model, the user can use the model, and if the model produces profits, the user can get part of the profits. The smart contract can specify various incentive terms; e.g., if the data is vbetter than others, they can get more profits, and other mechanisms. A developer will supply the ML training algorithm, and get benefits when it is trained well. We are decentralizing th epower of AI; users will be able to get direct access to models and use them. In this case, I hope for an alternate future, where big companies can continue with business, but users by pooling their data in a decentralized fashion, will see actual true democratization of AI; they will access the power of AI. Not just use tools.


GG: I think that a lot of what's meant in democratizing AI is how can you move from a small number of people innovating, to a large number. Tool development and standards. We're close to being there. There was an example in the past, was VSLI paint boxes. Up until a certain point, only an EE could really develop hardware at all. They took a lot of effort and time to make sure it could make it through very part without very much crosstalk. a group came together and thought, well, there are some design rules. This lets you build hardware pretty easily. I could paint green/red boxes, hardware months later, worked. It never worked as fast as that EE guy, so there would always be a place for it, but it would let us build a RISC computer, and ship it. We were in the game, we could innvoate, and do it. The tools we're trying to build right now can build on statistical.

JG: When I started PhD, we did integrals and derivatives by hand. Automatic differentiation was a huge step forward. I blame that for the explosion of papers. A first year can build something far more complex than what I could do. That's moving AI forward, on algorithms side.

The data side is interesting, and that is one where I think about in systems. There's a lot of opportunities to think about how security interacts, leveraging hardware to protect it, markets to sell/buy data from sources, and protect the data across a lot of places. I would argue we're making a substantial amount of progress in how we think about algorithms.

M: When I think about democratizing pervasive AI, recent questions that have been consuming our minds, interpretability, fairness, etc. Can you share... any experience where things like interpretability came up and became a problem, issue, do we have to worry about a lot more in ML, or systems-ML.

JG: My grad students come to me and say the models stop working. I don't know how to fix that; the process is very experimental. Tracking experiments is a big part of the process. We cared a lot about interpretable models, and that meant something very particular. Now it's explainable; we don't need to know what it did exactly, but there needs tob e some connection to what we did. Interpretable, explain computation, it could be related or unrelated to the decision. That's two answers about explainability, and how we debug these systems.

GG: SOSP just happened, and they have ten years of... good copies of everything they submitted. At the end of the conference, Peter Chen took all the PDF files, and did a naive bayes classifier, and saw how well he would predict that it would be accepted. And half the things it predicted to be accepted, would be accepted.

So what did they do? They made ad etector for popular authors. And so what you did is those who had succeeded, they will follow behind. I recognize this problem. You might think that you found a good way, but it's actually Nicolai Zeldovich's paper.

DS: There's a big debate. Some think it's really important, and sometimes, as long as the model works, it's fine. Our brain, we can't really explain how we arrive at certain decisions, but it works fine. And it depends on application. Some applications have stronger requirements for explainability; e.g., law and healthcare, whereas in others it's less required. Also as a whole community, there's a lot we don't understand. We can dtalk about causality, transparenty, all related. As a whole community, we don't really understand what explainability means. Not a good definition. All these concepts are related, we're trying to figure out what's the real core. That's a really good open question.

JL: There's two different interpretations. Can you explain to a person? And that's limited; there's no explainable vision models. The other definition is debuggability. If you want to create complex systems, they need to be debuggable. This is nontrivial with a distributed system, it's nomntriival with ML. If you want to create nontrivial ML systems, yo uhave to figure out why they're not behaving the way you want it to.

DS: Do we debug our brains?

JL: Evolution has done this the hard way for a very long way... a lot of people have bugs in their brains. I know I have bugs. I get an ocular migraine sometimes... very annoying. No, we don't debug our brains, and it's a problem.

YJ: I'm suire there's bugs in my brains; I chased chickens in my grandma's house; the chicken has one spot in its back that if you press it, it just ducks and sits there. It shuts off because of fear. WE humans don't do that. But these bugs, are in our brain as well. Chasing for interpretability helps understand how things work. The old days, deep dream; this line of work started with figuring out what the gradients do, and we propagated back, and we found that direct gradient doesn't work; then we added L1 priors, and then we got pictures. This curiosity has lead to the fact that convnets with random weights are codifying the local correlation; we are hardcoding the structured info in CNNs which we didn't know before. So maybe we will not achieve full interpretability, but some amount of interpretability and creativity will help.

(audience questions)

A: I'd really like to hear what Jeff said about ML for systems. As systems, I'm interested in it, but people have said, you can get far with heuristics.

JL: I think it's exciting.

GG: The index databases, when I read it for reviewing, I went, "Wow! Is that possible?" I think things like that will change the way we do systems. The novelty of the application opens a lot of people's minds. Right now we think of the machine learning tools as being expensive things that repeat what humans do easily that computers don't do well. But that's not what DB index is. We can execute it, but we're not better. But to get it half the size and twice the speed, throwing in another way of thinking about compression through a predictor is a fabulous insight.

JG: I tried to publish in this area for a while. For a while, systems didn't like the idea of complex algorithms in the middle of their system. Now, these days, Systems is like, "ML is cool." But where it's easier to have success, you prediction improves the system, but a bad prediction doesn't break the system. So scheduling, that's good. Where models can boost performance but not hurt. The work in ML to solve systems is successful.

DS: ML for systems is super exciting. I'm personally very excited about this domain, esp. for people who have done systems work, and are interested in AI. ML for systems is an amazing domain of ML. I wouldn't be surprised, I would hope to see, in five years, our systems are more ML driven. A lot of systems have a lot of knobs to tune, trial and error setting, where exactly ML can help. On these amazing techniques, RL, bandits, instead of using bandits to serve ads, we can try to autotune systems. Just like we are seeing AI transforming a lot of application domains, and a lot more intelligent system, old systems, the one we built, should be more intelligent. It's a prediction: It hink we are going to see a lot of work in this domain. I think it will transform systems.

M: I work in this quite a bit. We have some successes with bandits in some settings, but there are settings that are really tough: stateful, choices, decisions influence the future, it makes it hard to apply RL, or the RL techniques take a lot of data. There are challenges, but there are successes. There are a lot of papers that apply RL in caching, resource allocation. The real question is why it's not used in production? I don't know if we have an answer to that, papers do it, it seems to be really good, but it's not that mainstream, esp. having RL all over the place. Why isn't it pervasive. That I don't see.

A: Isn't it because it's not verifiable. You want some kind of verification analysis.

GG: It's called a regression sweep. If you deploy on a lot of systems. There's a lot of money, it has to work. If it falls over, that's a lawsuit. I hired a VP of software. OK, now that I'm in charge, things are going to slow down. Every LoC is bugs, if I want low bug, I stop programmers from writing code, by making the bar very high. This is the thing JOy was talking about; they need a really compelling reason with no downsides, and then they have to pass tests before the pass. So anything stochastic has a high bar.

SB: Another thing that is happening, there aren't that many people who have understanding in both areas. It's really hard to do ML in systems without deep expertise in systems. You really need to understand to explain it.

GG: It wasn't that long since we didn't have hosted services.

M: Guardrails, you constrain the ML system to not suggest something bad. We have a scenario in MS, machines are unresponsive. How long to wait? You can do it in ML. The choices are reasonable, they're never more than the max you'd want to wait.

A: On democratization. There's been a lot of talk about optimizing the models so they can bear the cost. Another is decentralizing data... but there's two very big constraints for systems and models. They cost a lot of money, and there's big variance. Because of cost, if some guy gets into programming, and does research, he won't have resources to do it. So they won't go into engineering; they'll intern at Amazon instead. So if there is some community going into lowering the barrier, demoratizing, what solution is there to get people much more easily? Because there's huge economic costs. People are trying to make huge amounts of money, startups, but there's no... systems have faults with decentralization... there's just a big problem colliding and ML.

JG: We teach data, I teach data science at Berkeley. The summary is, what about the costs of getting into DL? There's cost to train models, GPUs, data, how do I get a freshman in college who is excited about this, chromebook, they can do research and explore opportunities. At Berkeley we have exactly this problem. I teach 200 students, a lot of them are freshmen, chromebook ipad as primary computer. We've built tools using Azure... we run a cloud in Azure, and on these devices they can experiment with models. They get to use pretrained models and appreciate how to ... Someone built a Russian Twitterbot detector, and saw value and opportunity in those. And then they got involved in research projects where they had more funds and tools.

JL: The right interfaces make a huge difference, because they prevent you from having bugs that prevent you from doing things. Also, DL, is all the rage, but framing the problem is more important than the representation you do. If you have the right problem, and a dumb representation, you'll still do something interesting. otherwise, it's just not going to work very well at all.

YJ: As industry, don't be afraid of industry and try it out. Back at Berkeley, when Berkeley AI was using GPUs, the requirement was that you have one project per GPU. We students, framed ten different projects, and we just asked for ten GPUs. NVIDIA came to us and asked, what are you donig. We'll just give you 40 GPUs and do research on that. Nowadays, FAIR has residency, and Google AI has residency, all of these things are creating very nice collaborations between industry and academia, and I want to encourage people to try it out. Industry has funds, academia has talent, marrying those together is an everlasting theme.

A: Going back to where do we go forward in terms of conferences, the future of this workshop; has any decision been made, where we go?

SB: This is work in progress. We're interested in feedback and what you think. We've had this workshop evolving for 10 yrs, with NIPS and iCML. Then we did one with SOSP, excciting. We are now doing a separate conference at Stanford in February. We think there's really an important role to play with workshops colocated with NIPS and ICML. We're still planning to conitnue this series of workshops. There's also a growing amount of systems work in ICML and NIPS, natural expansion to accept that work. The field is growing, and we're going to try several venues, and form a community. If people have ideas.

JG: More people should get involved.

M: We plan to continue this; audience is great, participation is great.

It's a panel, so I have to ask you to predict the future. Tell me something you're really excited... 50-100yrs from now. If you're alive then, I will find you and see if your prediction panned out. Or say what you hope will happen...

YJ: Today we write in Python. Hopefully, we'll write every ML model in one line. Classifier, get a cat.

JL: Right now, people are in a phase where they're getting more and more knobs in learning. ML is all about having less knobs. I believe the ML vision of less knobs. I also believe in democratizing AI. You are constantly turning ... around you, and devs can incorporate learning algorithms into systems. It will be part of tech. It's part of hype cycle. NIPS went through a phase transition. At some point it's gotta go down. When it becomes routine, we're democratizing things.

DS: It's hard to give predictions... I guess, right now, we see ML as an example, we see the waves. Not so long ago, there was the wave of NNs, graphical models, now we're back to NNs. I think... I hope that we... there's a plateauing. Even this year, I have been talking to a lot of great ML researchers, even though one can say there has been more papers written this year, when you hear what people talk about in terms of milestones, many people mentioned milestones from past years. AlexNet, ResNet, ... I do hope that we will see new innovation beyond deep learning. I do teach a DL class, but I hope that we see something beyond DL that can bring us... we need something more, to bring us to the next level.

GG: I'm tempted to point out DL is five years ago, and dotcom era was not more than five years... I think, I'm looking forward to a change in the way CS, science in general, does business, having learned from statistical AI. My favorite one is overfitting. I poorly understood overfitting, in vague stories, until ML hammered what this said. I look forward to the time when students tell me, they stopped writing code, because they were adding parameters... and they added a decent random, iid process for testing code. We're no where near there, but I think it's coming.

JG: I'm looking forward to the return of graphical models... actually not. When we're democratizing AI, but what ultimately happens, we're democratizing technology. I can walk up to Alexa and teach it. Or I can teach my Tesla how to park more appropriately. Tech that can adapt to us because it can learn; when I can explain to a computer what I want. (Star Trek but without a transporter.)

by Edward Z. Yang at December 09, 2017 02:17 AM

December 08, 2017

Edward Z. Yang

Accelerating Persistent Neural Networks at Datacenter Scale (Daniel Lo)

The below is a transcript of a talk by Daniel Lo on BrainWave, at the ML Systems Workshop at NIPS'17.

Deploy and serve accelerated DNNs at cloud scale. As we've seen, DNNs have enabled amazing applications. Architectures achieve SoTA on computer vision, language translation and speech recognition. But this is challenging to serve in large-scale interactive because there are latency, cost and power constraints. Also, DNNs are growing larger in size and complexity.

We've seen a Cambrian explosion in startups to solve this problem. Research groups have produced DNN processing units, DPUs, custom hardware solutions to prove high throughput efficient serving of DNNs. We categorize them into two categories: fast DPUs, where the algorithms and applications have to be fixed in at design time, because they're fabbing an ASIC, or a soft DPU, FPGA. But for soft DPUs, we haven't seen them deployed at scale.

To address this, we've been working on Project BrainWave. Solution to deploy large scale DNNs with FPGA-acceleration. We've designed it to be fast, flexible and friendly. High throughput, low latency acceleration using FPGAs. Flexibility with adaptive numerical precision, update to latest AI algorithms with reconfigurable FPGAs. And it's user friendly, because we have a full stack solution, compile CNTK/Caffe/TF and compile them down. This is deployed on our configurable cloud, an outer layer of CPUs, a data center that puts everything together, and a layer of reconfigurable FPGAs.

We've been deployed DNN models. LSTM model that takes tens to hundreds of milliseconds CPU. What we see is the 99th percentile for latency; even at 99 we are able to achieve sub-millisecond latencies. When you get to these levels of acceleration, it's negligible in the E2E pipeline.

Next I'll dive into details. It's a full stack solution. starting with a compiler and runtime that takes model sin high level frameworks and compiles them down to our architecture. A flexible ISA for serving DNNs. We have a throughput, low latency serving. We do this all with persistency at scale, to keep models pinned in FPGA memories. Deployed on our wide deployment of Intel FPGAs using hardware microservices.

To begin with, let's talk about hardware microservices. This is something we presented at Micro. The architecture of reconfigurable cloud is FPGAs sit between CPU and network. CPU can use FPGA locally for acceleration, but because FPGAs are connected over network, they can distribute between them. We have a proprietary network protocol for low latency compute.

We'vec disaggregated FPGA compute plane from CPU. So we can aggregate FPGAs together to form larger accelerators, and you don't have to match the rate of FPGAs to CPUs. You can serve a large number of CPUs with a small cluster of FPGAs, or vice versa.

Next I'll talk about the compiler and runtime. Goal is to make it very easy for ML specialists to do this. The typical ML specialist doesn't know how to program this. Models developed in high level frameworks, compile them down to our architecture. If you compile them down first into an intermediate graph based representation. We split them into portions split on FPGAs, and portions on CPU. When we execute, we also have runtime that handles orchestration and scheduling that handles it between parts.

There are two main categories of DNNs we have to optimize for. DNNs that have very high compute to data ratio, convnets, these are well studied. I'm going to focus on the other class of DNNs, those with less compute to data ratio, e.g. dense layers and RNNs.

The conventional approach to accelerating DNNs on FPGAs, you keep all model parameters in DRAM. When a request comes in, you're going to stream the model parameters of DRAM, and return a request. The issue with this is when you have DNN layers that are memory bandwidth bound, you're limited in how fast you can run this by memory bandwidth; you're not getting full compute capabilities of FPGA. Typically the way to solve this is with batching; you send a number of requests and use the model parameters for all requests. WHile you may achieve good throughput, latency will increase. For realtime services, this violates your SLA. What we want to do is provide high performance at low or no batching.

The way we do this is with persisted Dnets. FPGAs have lots of memory on chip: 10MB memory. Since they're on chip, it's high bandwidth. So we're going to keep the model parameters on the chip, so that when we get one request in, we distribute it across the entire FPGA chip.

The obvious question is, what happens if your model doesn't fit on chip? We take advantage of the hardware microcenter. We'll distribute a single model over multiple FPGAs in the datacenter.

Let's look at the architecture and microarchitecture of the processing unit we developed. The BrainWave DPU is a software programmable processor, programmed in single-threaded C, but we've added a number of instructions for serving DNNs, e.g., matrix multiply, convolution, nonlinear activations, embeddings. The processor is designed to use narrow precision format (float16) and easily flexible for extending to newer algorithms.

The microarchitecture of the processor, main portion is dedicated to matrix vector unit; matrix vector multiply, consisting of a number kernels on a tile of a larger matrix. Tiling gives us flexibility while maintaining performance. Other compute units are multifunction units; vector-vector operations, such as element-wise multiply, add and activation functions. Tying it all together is an on-chip network that lets us keep all the compute together at time.

Most of the chip is dedicated to matrix vector unit. It's composed of hundreds of multilane dot product units. Each of these dot product units is consists of tens of adds and muls. To keep them fed with data, each dot product unit is fed by a set of dedicated block rams.

Next, I'd like to show performance results for this architecture. Two years ago, we had a deployment of Stratix V FPGAs. It shows the effective teraflops of this format. 16 bit integer.. we've been playing with our own format Microsoft Floating Point. 4.5Tflops at MSFP5.8. These Stratix are pretty old.

(Demo for latest generation of FPGAs)

Looking at throughput oriented DPU, the latency is 65.81ms. With brainwave, latency is 0.98ms. Under 1 millisecond.

This was done on initial engineering silicon. For production silicon, we're expecting to get 12TOps at 16-bit integer. 90TOps for MSFP8. One question is how does numeric output affects output. Here is the normalized accuracy for three in-house text models, using GRU and LSTM. The orange bar shows what happens when you go to MSFP9, but we've developed a way to fine tune networks for this precision, and you see we recover our accuracy. We're working with MSFP8 and see similar results.

Project BrainWave is our project for accelerating DNNs at cloud scale. We hope it will be fast, friendly and cloud-scale, and expand capabilities of AI in the cloud, providing a way to run higher dimensional RNN networks for NLP and other great applications. We're planning to release to third parties, stay tuned.

Q: When you decrease batch size, what hardware are you evaluating? Hardware utilization as we decrease?

A: We stay highly utilized even as we decrease batch size; even at high batch size, we're still sending requests one by one. (Only one step will be processed?) Right.

Q: Regarding the FP9 and FP8, nine and eight being the number of bits used? (Yes) Is it in any way related to Flexpoint at Intel?

A: We developed this independently of flexpoint, and I'm not able to talk about our numeric format.

Q: In MS, do you really write Verilog for your FPGA, or do you use high level synthesis tool?

A: For this, we are writing System Verilog

Q: Batchnorm layers, which require batch computation; how do you put that onto the FPGA?

A: Part of the work of the compiler is to do splitting between CPU and FPGA. So things that are not amenable to FPGA, including batchnorm, we're still running them on CPU.

by Edward Z. Yang at December 08, 2017 08:08 PM

MOCHA: Federated Multi-Tasks Learning (Virginia Smith)

The below is a transcript of a talk by Virginia Smith on MOCHA, at the ML Systems Workshop at NIPS'17.

The motivation for this work comes from the way we think about solving ML problems in practice is changing. The typical ML workflow looks like this. You start iwth dataset and problem to solve. Say you want to build a classifier to identify high quality news articles. Next step is to select an ML model to solve the problem. Under the hood, to fit the model to your data, you have to select an optimization algorithm. The goal is to find an optimal model that minimizes some function over your data.

In practice, there's a very important part of the workflow that is missing. For new datasets, interesting and systems, the system and properties of system, play a large role in the optimization algorithm we select to fix. To give an example, in the past several years, data that is so large that must be distributed over multiple machines, in a datacenter environment. I've been thinking about how to perform fast distributed optimization in this setting, when data is so large.

But more and more frequently, data is not coming nicely packaged in datacenter. It's coming from mobile phones, devices, distributed across country and globe. Training ML in this setting is challenging. For one, whereas in datacenter you have hundreds to thousands, here you have millions and billions. Also, in datacenter, devices are similar capability; here, you have phones that are old, low battery, not connected to wifi. This can change ability to perform computation at any given iteration.

Additionally, there's heterogeneity in data itself. For privacy and computation reasons, data can become very unbalanced in network. And it can be non-IID, so much so that there can be interesting underlying structure to the data at hand. I'm excited because these challenges break down into both systems and statistical challenges. The one second summary of this work, thinking about both systems and statistical in this federated setting; the punchline is that systems setting plays a role not only in optimization algorithm but also the model we select to fit. IT plays a more important role in this overall workflow.

I'm going to go through how we holistically tackle systems and statistical challenges.

Starting with statistical. The goal is we have a bunch of devices generating data, could be unbalanced; some devices have more data than others. One approach used in past is fit a single model across all of this data. All of the data can be aggregated; you find one model that best achieves accuracy across all of the data simultaneously. The other extreme is you find a model for each of the data devices, and not share information. From systems point of view this is great, but statistically, you might have devices that are only ... that are poor in practice. What we're proposing is something between these two extremes. We want to find local models for each device, but share information in a structured way. This can be captured in a framework called multitask learning.

The goal is to fit a separate loss function for each device. These models can be aggregated in this matrix W, and the function of the regularizer, is to force some structure omega on it. This omega is a task relationship matrix, capturing interesting relationships, e.g., all the tasks are related and you want to learn weights, or most of the tasks are related and there are a few outliers, or there are clusters and groups, or there are more sophisticated relationships like asymmetric relationships. These can all be captured in multitask.

We developed a benchmarking set of real federated data. This includes trying to predict human activity from mobile phone, predict if eating or drinking, land mine, and vehicle sensor; distributed sensor to determine if a vehicle is passing by.

For these various datasets, we compared global, local and MTL. The goal is to fit a SVD model. For each data set, we looked at the average error across tasks, where each model is a task. What you can see is average error, for SVD, is significantly lower than global and local approaches. This makes sense because MTL is much more expressive; it lets you go between these extremes. What's interesting is that in these real data sets, it really helps. Reduction by half. This is a significant improvement in practice.

Given that we like to be using multitask learning to model data in federated environment, the next problem is figure out how to train this in distributed setting, thinking about massive distributed. In particular, the goal is to solve the following optimization objective. In looking how to solve this objective, we note that it's often common to solve for W and omega in an alternating fashion. When you solve for omega, it's centrally, you just need access to models. But W must be distributed because data is solved across devices. The key component how to solve this in practice is the W update. The challenge of doing this is communication is extremely expensive. And because of heterogeneity, you may have massive problems with stragglers and fault tolerance; e.g., someone who turns their phone off.

The high level idea for how we're doing this, take a communication efficient method that works well in data center, and modify it to work in federated setting. It will handle MTL as well as stragglers and fault tolerance.

What is the method we're using? The method we're using is COCOA, which is a state of the art method for empirical risk minimization problems. The thing that's nice about COCOa is it spans prior work of mini-batch and one-shot communication, by making communication a first class parameter of the method. Make it flexible as possible. It does it by not solving the primal formulation, but the dual. The dual is nice because we can easily approximate it by forming a quadratic approximation to the objective; and this more easily decomposes across machines.

To distribute this to federate setting, a key challenge is figuring out how to generalize it to the MTL framework. A second challenge; in COCOA, the subproblems are assumed to be solved to some accuracy theta. This is nice because theta varies from 0 to 1, where 0 is exact solve, and 1 is inexact. This can be thought of as how much time you do local communication versus communication. However, in fact, this is not as flexible as it should be in the federated setting. There is only one theta that is set for all iterations, a ll nodes. And because theta cannot be set exactly to one, it cannot handle fault tolerance, where there's no work performed at any iteration. Making this communication parameter much more flexible in practice.

JHow are we doing this? we developed MOCHA. The goal is to solve multitask learning framework; W and Omega in an alternating fashion. In particular, we're able to form the following dual formulation, similar to COCOA, so it decomposes. In comparison, we make this much more flexible assumption on subproblem parameter. This is important because of stragglers: statistical reasons, unbalance, different distributions, it can be very different in how difficult it is to solve subproblems. Additionally, there can be stragglers due to systems issues. And issues of fault tolerance. So this looks like a simple fix: we make this accuracy parameter more flexible: allow it to vary by node and iteration t, and let it be exactly 1. The hard thing is showing it converges to optimal solution.

Following this new assumption, and you can't have a device go down every single round, we show the following convergence guarantee. For L-Lipschitz loss, we get a convergence at 1/epsilon; for smooth models (logistic regression) we get a linear rate.

How does this perform in practice? The method is quite simple. The assumption is we have data stored at m different devices. We alternate between solving Omega, and W stored on each. While we're solving w update, it works by defining these local subproblems for machines, and calling solver that does approximate solution. This is flexible because it can vary by node and iteration.

In terms of comparing this to other methods, what we've seen is the following. Comparing MOCHA to CoCoA, compared to Mb-SDCA and Mb-SGD. We had simulation, with real data to see what would happen if we do it on wifi. We have simulated time and how close are to optimal. What you can see is that MoCHA is converging much more quickly to optimal solution, because MoCHA doesn't have the problem of statistical heterogeneity, and it's not bogged down by stragglers. This is true for all of the different types of networks; LET and 3G. The blue line and MOCHA and CoCOA, they work well in high communication settings, because they are more flexible. But compared to CoCOA, MOCHA is much more robust to statistical heterogeneity.

What's interesting is that if we impose some systems heterogeneity, some devices are slower than others, we looked at imposing low and high systems heterogeneity, MOCHA with this additional heterogeneity, it's a two orders of magnitude speedup to reach optimal solution.

And for MOCHA in particular, we looked at issue of fault tolerance. What we're showing here, we're increasing the probability a device will drop out at any distribution. Going up until there's half devices, we're still fairly robust to MOCHA converging, in almost the same amount of time. But what we see with green dotted line, of the same device drops out every iteration, it doesn't converge. This shows the assumption we made makes sense in practice.

The punchline is that in terms of thinking this new setting, training ML on these massive networks of devices, this is both a statistical and systems issue. We've addressed it in a holistic matter. Code at I also want to reiterate about SysML conference in February.

Q: When you compare global and local? Why is it always better than global?

A: The motivation why you want to use local model over global model, is that if you have a local data a lot, you might perform better. It boosts the overall sample size. I have some additional experiments where we took the original data, and skewed it even further than it already was. We took the local data, and there was less data locally, and they have global approaches. That's just a function of the data in the devices.

Q: I really like how your method has guarantees, but I'm wondering about an approach where you create a metalearning algorithm locally and have it work locally?

A: That's worth looking into empirically, since you can do fine tuning locally. What we were trying to do first was converge to exact optimal solution, but you might want to just work empirically well, would be good to compare to this setting.

by Edward Z. Yang at December 08, 2017 06:15 PM

A Machine Learning Approach to Database Indexes (Alex Beutel)

The below is a transcript of a talk by Alex Beutel on machine learning database indexes, at the ML Systems Workshop at NIPS'17.

DB researchers think about there research differently. You have a system that needs to work for all cases. Where as in ML, we have a unique circumstance, I'll build a model that works well. In DB, you have to fit all.

To give an example of this is a B-tree. A B-tree works for range queries. We have records, key, we want to find all records for range of keys. 0-1000, you build tree on top of sorted array. To quickly look up starting point in range. What if all my data, all of the keys, from zero to million... it becomes clear, you don't need the whole tree above. You can use the key itself as an offset into the array. Your lookup is O(1), O(1) memory, no need for extra data structure.

Now, we can't go for each app, we can't make a custom implementation to make use of some pattern. DB scale to any application, we don't want to rebuild it any time.

But ML excels in this situation. It works well for a wide variety of distributions, learn and make use of them effectively.

This is the key insight we came to. Traditional data structures make no assumptions about your data. They work under any distribution, and generally scale O(n). Interestingly, learning, these data distributions, can offer a huge win. What we're trying to go to, is instead of scaling to size of data, we scale to complexity of it. With linear data, it's O(1). For other distributions, can we leverage this?

There are three dat structures underlying databases. There are B-Trees; range queries, similarity search. Main index. Hash maps for point lookups; individual records. This is more common throughout CS. And bloom filters, are really common for set-inclusion queries. Do I have a key. If your record is stored on disk, checking first if there's a record with that key is worthwhile. We're going to focus entirely on B-trees.

B-trees take a tree like structure with high branching factor. What makes it really effective is that it's cache efficient. You can store top level nodes in your cache where it's fast to look it up, maybe others in main memory, and the actual memory on disk. By caching the hierarchy appropriately, it makes it efficiently. At a high level, a B-tree maps a key to a page, some given place in memory. Once it finds that page, it will do some local search to find the particular range of that key. That could be a scan or binary search; we know the range will be the position from start of page to page size.

An abstract level, the Btree is just a model. It's taking the position of the key, and trying to estimate the position. What we have in this case, we want to search in this error range to find the ultimate record. At a high level, it would mean that we can't use any model. We need err_min and err_max. But we have all the data. If you have all the data, you know at index construction time, you know all the data you're executing against, and you can calculate what the model's min and max error is.

One interesting thing is this is just a regression problem. What you're really modeling is just the CDF. On the X axis on this plot here, the X axis is your keys, Ys your position. This is modeling where your probability mass is located; where your data is in the keyspace. CDFs are studied somewhat, but not a ton, in the literature. This is a nice new implication of research.

We thought, OK, let's try this out straightaway. Train a model, see how fast it is. We looked at 200M server logs, timestamp key, 2 layer NN, 32-width, relatively small by ML. We train to predict position, square error. A B-Tree executes in 300ns. Unfortunately, with the model, it takes 80000ns. By most ML model speeds, this is great. If you're looking at executing on server, great. But this doesn't work for a database.

There are a bunch of problems baked into this. TF is really designed for large models. Think about translation or superresolution images; these are hefty tasks. We need to make this fast for database level speed. Second, b-trees are great for overfitting. There's no risk of over-fitting in this context. They're also cache efficient; that's not looked at in ML. The last thing is local search in the end. Is that really the most effective way of ultimately finding that key? I'm skipping that part because it's fairly detailed, I'll focus on first three.

The first part is just the raw speed fo execution of ML model. This was built really by Tim, this Learning Index Framework program. What it does is it lets you create different indexes under different configurations. For one thing, it lets you do code compilation for TF, ideas from Tupleware, where you can take a linear model and execute it extremely quickly. We can also train simple models. Use TF for more complex gradient descent based learning; extract weights, and have inference graph be codegenned. And we can do a lot of autotuning, to find what the best model architecture is. We know ahead of time what the best training is. We can make pretty smart decisions about what works best.

The next problem is accuracy and sepeed. If I have 100M records, I narrow down quickly from 1.5M to 24K, with each step down this tree. Each one of those steps is 50-60 cycles to look through that page, and to find what the right branch is. So we have to get to an accurracy of 12000, within 500 mul/add, to beat these levels of hierarchy, which are in cache. This is a steep task. The question is what is the right model? a really wide network? Single hidden layer? This scales nicely, we can fit in 256 layer reasonably. We could go deeper... the challenge is we have width^2, which need to be parallelized somehow. The challenge is, how do we effectively scale this. We want to add capacity to the model, make it more and more accurate, with increased size, without becoming to.

We took a different approach, based on mixed experts. We'll have a key, have a really simple classifier. We get an estimate. Then we can use that estimate to find it at the next stage. Narrow down the CDF range, and try to be more accurate in the subset of space. It will still get key as input; given key, give position, but more narrow space of keys. We build this down, and we'll walk down this hierarchy. This decouples model size and complexity. We have a huge model, overfitting, but we don't have to execute all of the sparsity that you would have to do from a pure ML view. We can decouple it usefully. The nice thing we can do is fall back to B-trees for subsets that are difficult to learn in a model. The LIF framework lets us substitute it in easily. In the worst case, B-tree. Best case, more efficient.

The quick results version here, is we find we have four different data sets. Most are integer data sets; last one is string data set. We're trying to save memory and speed; we save memory hugely; these are really simple models. Linear with simple layer, with possibly two stages. We're able to get a significant speedup in these cases. Server logs one is interesting. It looks at a high level very linear, but there's actually daily patterns to this data accessed. Maps is more linear; it's longitudes of spaces. We created synthetic data that's log normal, and here we see we can model it effectively. Strings is an interesting challenge going forward; your data is larger and more complicated, building models that are efficient over a really long string is different; the overall patterns are harder to have intuition about. One thing really worth noting here, it's not using GPUs or TPUs; it's pureely CPU comparison. Apples-to-apples.

This is mostly going into the B-tree part. This is a regression model looking at CDF of data. We can use these exact same models for hash maps. With bloom filters, you can use binary classifiers. I have a bunch of results in the poster in the back.

A few minutes to talk about rooms for improvement. There are a bunch of directions that we're excited to explore. Obvious one is GPUs/TPUs. It's cPUs because that's when B-trees are most effective; but scaling is all about ML. Improving throughput and latency for models with GPUs, exciting going forward. Modeling themselves; there's no reason to believe hierarchy of models is the right or best choice; it's interesting to build model structures that match your hardware. Memory efficient, underlying architecture of GPUs. In the scale of ns we need for database. Multidimensional indexes; ML excels in high numbers of dimension; most things are not looking at a single integer feature. There's interesting question about how you map to multidimensional indexes that are difficult to scale. If we have a CDF, you can approximately sort it right there. And inserts and updates, assumed read-only databases. Large class of systems, but we get more data. How do we balance overfitting with accuracy; can we add some extra auxiliary data structures to balance this out?

Q: One thing is that when... this problem, we solved pretty well without ML. When we introduce ML, we should introduce new metrics. We shouldn't make our system more fragile, because distribution changes. What would be the worst case when distribution changes?

A: As the data becomes updated... in the case of inference and updates, there's a question about generalization. I think you could look at it from the ML point of view: statistically, test model today on tomorrows inserts. (It's a method. If I use this method, and then train it with data that I don't yet have... and do.) The typical extrapolation to future generalization of ML. Guarantees are hard. There will be a worst case that is awful... but the flip side, that's the ML side... generalization. There's also a point of view, I couple this with classic data structure. we coupled modeling with classic data structures: search, bloom filter case, so you don't actually have this work. You catch worst case.

Let me add to that. If you assume that the inserts follow the same distribution as trained model, then the inserts become all one operation. They're even better. Suppose they don't follow the same distribution? you can still do delta indexing. Most systems do do delta indexing. So inserts are not a big problem.

Q: (Robert) Most of the inputs were one or two real numbers, and outputs are a single real number. how does it work if you use a low degree polynomial, or a piecewise linear classifier on the different digits?

A: In the case of strings, it's not a single input. (Treat it as integer?) Well, it's possibly a thousand characters long. It's not the best representation. Different representations work really well. The last thing I want to say, piecewise linear could work, but when you run 10k, 100k submodels, it's slow. Hierarchy helps. Polynomials are interesting, depends on data source.

Q: Can you comment how bad your worst case is? Average numbers?

A: We specifically always have a spillover. The worst case is defaulting to typical database. We haven't had a case where you do worse, because we'll default to B-tree. (Deterministic execution?) Not inference time.

by Edward Z. Yang at December 08, 2017 06:11 PM

Ray: A Distributed Execution Framework for Emerging AI Applications (Ion Stoica)

The below is a transcript of a talk by Ion Stoica on Ray, at the ML Systems Workshop at NIPS'17.

We've been working on it at Berkeley for more than one year. Over the past years, there's been tremendous progress in AI. Ad targeting, image&speech, many more. Many applications are based on supervised learning with DNNs. Supervised plus unsupervised are the two dominant approaches.

However, the next generation of AI applications will be very different. They're deployed in mission critical scenarios, need to continually learn from a rapidly changing env. Robotics, self driving cars, unmanned drones, dialogue systems. Implementing this new generation of AI applications requires a broader range of techniques. Stochastic optimization, parallel simulations, many more.

Ray provides a unified platform for implementing these approaches. To motivate Ray, I'll use reinforcement learning. RL learns by interacting with env. A policy mapping from state/observation to action that maximizes a certain reward. What are the reqs of RL? Many applications exhibit nested parallelism: search, where they use data parallel SGD, which then calls a component that does policy evaluation with a model to simulate, that runs in parallel on multiple CPUs. Second, these workloads can be highly heterogenous in hardware and time. Many of these computations require not only CPUs, but GPUs TPUs and FPGAs. Second, this computation can take wildly different times. Simulate a chess game: 3 moves to lose, or 50 moves to win or draw. And in robotics, we need to process in real time, processing the data from sensors in parallel, tens of ms.

Meeting these requirements is not easy. To meet these requirements, you need a system that is flexible and performant. Flexible: it should create and schedule tasks dynamically, and support arbitrary dependencies. Perf: it should scale to hundreds of nodes, sub-millisecond latency, millions of task, and efficiently share numeric data.

Next, I'm going to say how we achieve these challenges. Flexibility? We provide a very flexible model: dynamic tasks graphs. On top of this, we give the two models: parallel tasks and actors.

To talk about parallel tasks, here is Python code: one reads an array from a file, and the other adds two arrays. The code is simple: it creates two arrays a and b from file1 and file2, and sum them up. So now, parallelizing this program is quite easy. If we want to parallelize a function, in order to do that, we need to add a ray.remote decorator to each function. When we invoke these functions, you need to invoke remote method. Remove doesn't return object itself, just the object id. This is very similar to the futures abstraction. To get the actual object, you must invoke ray.get on the object id.

To get a better idea of how Ray is executing, let's execute a simple program. Assumes files stored on different nodes. When read_array on file1, it schedules read_array on the appropriate node. The remote call returns immediately, before the actual read finishes. This allows the driver to run the second task in parallel, running on the node on file 2, and launch the add remote function. All functions have been scheduled remotely, but none of them have finished. To actually get the result, you have to call ray.get on the result. This is a blocking call, you'll wait for the entire computation graph to be executed.

Tasks are very general, but they are not enough. Consider that you want to run a simulator, and this simulator is closed source. In this case, you do not have access to the state. You have state, action, simulations, to set up state in simulator, you cannot do it. So to get around this, there is another use case, where the state is too expensive to create. For example, DNNs on GPUs, in this case, you want to initialize it once, and reinitialize for each simulation.

In order to address these use cases, we add Actor abstraction. An actor is just a remote class. If you have a Counter, you mark it ray.remote, and the when you create the class or invoke methods, you use remote keyword. This is a computation graph for this very simple example. Notice the method invocations also return object identifiers. To get the results, you need to call ray.get on object identifiers. Ray also allows you to specify the number of resources, for actors and tasks.

To put things together, and provide a more realistic example, evaluation strategy, a scalable form of RL, by Salimans et al in OpenAI. In a nutshell, evolution strategy, tries lots of policies, and tries to see which runs best. This is highly parallel. So here is pseudocode for parallel strategies. A worker that does simulation and returns the reward, create twenty workers, and then 200, do 200 simulations, update policy. Again, if you want to parallelize this code, we have to add a bunch of remote, and now on the right hand side, you'll notice I'm also sharing the computation graph. When you invoke now, the Worker.remote, you create 20 remote workers to do it in parallel. And you invoke with the remote keyword. Again, notice that in this case, the results are not the rewards themselves, but they're ids to the reward objects. In order to get the rewards to get policy, you have to call ray.get.

This hopefully gives you a flavor how to program in Ray. Next time, I switch gears, presents system design of Ray; how Ray gets high performance and scalability.

Like many classic computing frameworks, it has a driver, and a bunch of workers. Driver runs a program, worker runs task remotely. You can run and write a bunch of actors. The drivers actors on the same node, they share the data, on shared memory, and the workers and actors of cross nodes, share through distributed object store we built. Each node has a local scheduler, so when a driver wants to run another task, the local scheduler tries to schedule it locally. If it cannot schedule it locally, it invokes global scheduler, and it will schedule another node that has resources. Actor, remote method. Finally, what we do, and one essential part of the design, is we have a Global Control State. It takes all of the state of the system, and centralizes it. The metadata for the objects, in objects table, function. This allows system to be stateless. All these other components can fail, you can bring them up, get the most recent data from global control state. It also allows us to parallelize the global scheduler, because these replicas are going to share the same state in the GCS.

Another nice effect of having a GCS is that it makes it easy to build a bunch of profiling and debugging tools.

This design is highly scalable. Let me try to convince you why this is. To make GcS scalable, we just shard it. All these keys are pseudorandom, so it's easy to shard and load balance. The scheduler as you see is distributed; each node has a local scheduler, and Ray tries to schedule tasks which are spawned by a worker/driver on another task that is locally. The global scheduler, becomes a bottleneck, we can also replicate it. Finally, in systems, even if scheduler is super scalable, in Spark, there's another bottleneck: only the driver can launch new tasks. In order to get around that, we allow in Ray the workers and actors to launch tasks. Really, there is no single bottleneck point.

A few words about implementation. The GCS is implemented with Redis. For object store, we leverage Apache Arrow. For fault tolerance, we use lineage based fault tolerance like Spark. Actors are part of task graph; methods are treated as tasks, so we have a uniform model for providing fault tolerance.

So now some evaluation results. This plot represents the number of tasks per second, and you can see the number of nodes; it scales linearly. You can schedule over 1.8 M/s. Latency of local task execution is 300us, the latency of remote task is 1ms. This plot illustrates fault tolerance. You may ask why you care about fault tolerance? The problem is you need in your program that the simulation may not finish; this makes the program far more complicated, even if you're willing to ignore some results. Here, on this axis, you have the time in seconds, you have two y axes, number of nodes in system, and the throughput. As you can see, the number of nodes is starting at 50, then 25, then to 10, and goes back to 50. In the red area, you show the number of tasks per second; it follows as you may expect, the number of nodes in the system. If you look a little bit, there are some drops; every time, you have a drop in the number of tasks. It turns out this is because of the object reconstruction. When some nodes go away, you lose the objects on the node, so you have to reconstruct them. Ray and Spark reconstruct them transparently. With blue, you can see the re-executed tasks. If you add them, you get a very nice filling curve.

Finally, for evolution strategies, we compared with reference ES from... we followed the OpenAI, and on the X axis, you have number of CPUs, mean time to solve the particular problem; simulator, learning to run, there are three points to notice. One is, as expected, as you add more CPUs, the time to solve goes down. The second is that Ray is actually better than the reference ES, better results, even though the reference ES is specialized for beating. Third, for a very large number of CPUs, ref couldn't do it, but Ray could do better and better. I should add that Ray takes half the amount of code, and was implemented in a couple of hours.

Related work: look, in this area, there are a huge number of systems, that's why you are here, lots of systems. Ray is complimentary to TF, MXNet, PyTorch, etc. We use these systems to implement DNNs. We integrate with TF and PyT. There are more general systems, like MPI and Spark; these have limited support for nested parallelism; computation model, and they have much coarser grained tasks.

To conclude, Ray is a system for high performance and flexibility and scalability. We have two libraries on top of Ray: RLlib and Ray Tune. It's open source, please try, we'd love your feedback. Robert, Philip, Alex, Stephanie, Richard, Eric, Heng, William, and many thanks to my colleague Michael Jordan.

Q: In your system, you also use actor; actor is built up on shared memory. Do you have separate mailbox for actors? How do you do that?

A: No, the actors communicate by passing the argument to the shared object store.

Q: What is the granularity of parallelism? Is it task atomic, or do you split task?

A: The task granularity is given by what is the overhead for launching a task and scheduling the task. The task you see, we are targeting task, low and few ms. The task is not implementing something like activation function. we leave that job to much better frameworks. And a task is executing atomically, a method, in the actors, are serialized.

Q: Question about fault tolerance: in Spark, when you don't have a response for some time, it says this node died. Here, the task is much more, because NN, something like that. So we don't have the same time.

A: We do not do speculation; implicit speculation in Ray, for the reason you mentioned.

Q: Can you give me more details on the reference implementation, doesn't scale

A: The reference implementation, it's the OpenAI implementation, Robert here can provide you a lot more detailed answers to that question.

by Edward Z. Yang at December 08, 2017 06:07 PM

Mark Jason Dominus

The Aeropress

I drink a lot of coffee at work. Folks there often make a pot of coffee and leave it on the counter to share, but they never make decaf and I drink a lot of decaf, so I make a lot of single cups of decaf, which is time-consuming. More and more people swear by the AeroPress, which they say makes single cups of excellent coffee very quickly. It costs about $30. I got one and tried it out.

The AeroPress works like this: There is a cylinder, open at the top, closed but perforated at the bottom. You put a precut circle of filter paper into the bottom and add ground coffee on top of it. You put the cylinder onto your cup, then pour hot water into the cylinder.

So far this is just a regular single-cup drip process. But after a minute, you insert a plunger into the cylinder and push it down gently but firmly. The water is forced through the grounds and the filter into the cup.

In theory the press process makes better coffee than drip, because there is less opportunity to over-extract. The AeroPress coffee is good, but I did not think it tasted better than drip. Maybe someone else, fussier about coffee than I am, would be more impressed.

Another the selling points is that the process fully extracts the grounds, but much more quickly than a regular pourover cone, because you don't have to wait for all the dripping. One web site boasts:

Aeropress method shortens brew time to 20 seconds or less.

It does shorten the brew time. But you lose all the time again washing out the equipment. The pourover cone is easier to clean and dry. I would rather stand around watching the coffee drip through the cone than spend the same amount of time washing the coffee press.

The same web site says:

Lightweight, compact design saves on storage space.

This didn't work for me. I can't put it in my desk because it is still wet and it is difficult to dry. So it sits on a paper towel on top of my desk, taking up space and getting in the way. The cone dries faster.

The picture above makes it look very complicated, but the only interesting part itself is the press itself, shown at upper left. All the other stuff is unimportant. The intriguing hexagon thing is a a funnel you can stick in the top of the cylinder if you're not sure you can aim the water properly. The scoop is a scoop. The flat thing is for stirring the coffee in the cylinder, in case you don't know how to use a spoon. I threw mine away. The thing on the right is a holder for the unused paper filters. I suspect they were afraid people wouldn't want to pay $30 for just the press, so they bundled in all this extra stuff to make it look like you are getting more than you actually are. In the computer biz we call this “shovelware”.

My review: The AeroPress gets a solid “meh”. You can get a drip cone for five bucks. The advantages of the $30 AeroPress did not materialize for me, and are certainly not worth paying six times as much.

by Mark Dominus ( at December 08, 2017 02:13 PM

December 07, 2017

FP Complete

Michael Snoyman

Stack and Nightly breakage

I'm sure a number of readers have already seen something about the situation around Stack and Stackage Nightly/GHC 8.2. I tried to clarify how this happened on the relevant Github issue, plus the GHC trac ticket, but thought I'd reshare as a blog post for others who are interested.

EDIT Right after publishing, I saw that Stack 1.6.1 was released, so you should probably just run stack upgrade. Keep reading if you're curious on the bug.

The problem

When the first releases of Stackage Nightly for GHC 8.2.1 started coming out some months back, they did not work with Stack 1.5.0, due to an issue with the ghc.cabal file on Hackage. The reason for this is explained below. We made a point release (Stack 1.5.1) which worked around the issue temporarily, until Stack 1.6 was released with the complete fix.

In the interim, GHC 8.2.2 was released, and Stackage Nightly switched over to it. Contrary to my initial claims: this was a red herring and unrelated to anything.

On December 4, integer-gmp- was uploaded to Hackage, which reintroduced all of the breakage we had with Stack 1.5.0. Since our point release had a very targetted workaround (specifically for ghc.cabal), it did not work around the same bug occurring for integer-gmp.cabal. Therefore, all versions of Stack before 1.6 will fail to build a Stackage release with GHC 8.2.

The workaround

The best "workaround" is just a new release: Stack 1.6 was fortunately already in release candidate mode, and as I type this up it's going through the standard release process. By the time I hit publish, the workaround may be to run stack upgrade.

If that's not the case, you can upgrade to the release candidate by running:

stack upgrade --binary-version

Cabal background

In order to understand the explanation, you should be aware of a few different things that are all called Cabal:

  • cabal-install, the build tool. This is not relevant to the explanation below
  • Cabal the library. This is a normal Haskell library which Stack depends on, and is used for (among other things) parsing cabal files.
  • Cabal the file format. If you open up virtually any cabal file you'll see a cabal-version: >= 1.10 looking field. This is stating which version of the Cabal file format is being used. New versions of Cabal-the-library may add new features to the Cabal file format. The version of the format tracks the library version it was released with, so that a cabal file stating cabal-version: >= 1.24 can only be parsed by Cabal-the-library 1.24 or later.

There was an addition made to Cabal-the-file-format 2.0: a ^>= operator. This operator is not parseable by older versions of Cabal the library (meaning: Cabal 1.24 or earlier). Stack 1.5 was built against Cabal-the-library 1.24, and therefore cannot parse any Cabal files using this new operator.

The Stackage build process prevents any such Cabal files from being used yet to give tooling (like Stack) a chance to upgrade, something I've requested of Hackage as well. However, there are some packages which ship with GHC itself, and which Stackage has no control over in the creation of a snapshot. This includes packages like base, ghc, and integer-gmp.

Original breakage

There's a short explanation (and some code to demonstrate it!) for the original breakage with GHC 8.2.1 in the pull request:

Prior to Stack 1.6, there was a bug where Stack would try to get some metadata about libraries that shipped with GHC from their cabal files instead of directly from the package database. Historically, this has never been a problem, which is why it's survived in Stack for so long. The reason is that, historically, GHC-shipped packages did not use bleeding-edge features in their cabal files.

When GHC 8.2.1 was released, the ghc.cabal file uploaded to Hackage did something new: it used a feature of the newly released Cabal 2.0 file format (the ^>= operator) and required the new Cabal 2.0 library to parse it. This occurred before Stack had a chance to upgrade to Cabal-the-library 2.0, and for that matter before cabal-install 2.0 was released. In other words: at the time the file was placed on Hackage, no officially released version of any common tool supported it.

For unrelated reasons, I'd already fixed this bug on master as part of a refactoring. Strangely enough, that refactoring had to do with problems with revisions. Thanks to the revision system, it's not possible to rely on cabal files on Hackage to tell you anything about GHC-installed packages, since we can't know for certain which revision was used to build the package. (We'll get to integer-gmp in a moment, which is slightly worse in this regard.)

The behavior of Stack at this time with regard to GHC-shipped packages was the following (and this is a bug):

  • If the cabal file cannot be found: ignore the package entirely. This is necessary for packages like rts.
  • If the cabal file is found: try to parse it, and fail if the parse fails.

It was this second bullet which caused a problem. When we discovered this, we released an emergency patch release of Stack to work around this situation and simply ignore parse failures from ghc.cabal. We did not embark on a bigger fix because:

  1. A bigger fix would involve much more code change, introducing the chance for regressions
  2. We already had a fix on master, and knew that Stack 1.6 would be released before GHC 8.4

This went out the door, and all users who upgraded to Stack 1.5.1 were able to use the new Stackage Nightly snapshots based on GHC 8.2.2.

December 4, 2017

One of the packages that ships with GHC 8.2 is integer-gmp- Until December 4, this package was not uploaded to Hackage. As a result, Stack 1.5.1 simply ignored the package entirely, which worked fine. However, something we didn't anticipate happened:

  • Months after the GHC 8.2.1 release, integer-gmp- was uploaded to Hackage
  • The cabal file that was uploaded was manually modified to use Cabal-the-format 2.0 features (again, the ^>= operator).

You can compare the file on Hackage with the file on Github. It's unclear what the motivation was behind this modification, but this modification is what broke Stack 1.5.1 and GHC 8.2.

Before this upload, the missing integer-gmp.cabal file was simply ignored by Stack. Once it was uploaded, Stack (again, as a bug) tries to parse it, fails, and gives up.

The future

Obviously there was a bug in Stack that needed to be fixed, and has been fixed. However, the irregularities around the ghc.cabal and integer-gmp.cabal files are a little troubling, and make it difficult to predict future behavior. Hopefully some new policies from GHC HQ will address these concerns.

And while this case is a bug in Stack, I want to clarify a general point. It is entirely expected that over time, older releases of Stack will not be able to use newer Stackage snapshots. At some point in the future, Stackage will allow Cabal 2.0-formatted cabal files into snapshots, and then by design Stack 1.5 and earlier will be unable to parse those files. That's unfortunate, but expected. What's unexpected in this case was that

  1. These cabal files slipped into a snapshot through the back door (GHC's package database) so quickly, before Stack 1.6 was out the door
  2. That actions taken post-GHC release (a new upload of integer-gmp.cabal) could affect existing snapshots.

Both points will hopefully be hit both by the fixes that landed on Stack 1.6 ensuring less eager parsing of cabal files, and changes in GHC HQ policy.


  1. There's a bug in Stack, triggered by new behavior not seen before by GHC
  2. That bug affects reproducibility, because an upload to Hackage in the future (or a revision for that matter) can break existing build plans
  3. This bug is fixed on master fully (AFAICT, we've added an integration test to check for regressions)
  4. Instead of putting out another emergency Stack 1.5 patch for integer-gmp.cabal, we're going to get Stack 1.6 out the door ASAP

I hope that clarifies. This is definitely an unfortunate situation, and I know it's screwed up people's development, so my apologies on that front. I hope for all our sakes (mine included!) that the situation is more stable going forward.

December 07, 2017 04:00 AM

Mark Jason Dominus

Shitpost roundup, 2017-11

As I mentioned before, I have started another blog, called Content-type: text/shitpost. While I don't recommend that you read it regularly, you might want to scan over this list of the articles from November 2017 to see if anything catches your eye.

I plan to continue to post monthly summaries here.

by Mark Dominus ( at December 07, 2017 02:03 AM

Jasper Van der Jeugt

Video: Getting things done in Haskell

Someone alerted me that the video of my talk at the Skills Matter Haskell eXchange 2017 is now available. You can watch it on their website.

The slides can be found here.

It’s a talk aimed towards beginners. If you are writing a medium-sized Haskell application for the very first time, you will typically end up with three modules: Types.hs, Utils.hs and Main.hs. While this is a very clear split, it typically doesn’t scale very well as applications become larger.

I try to answer some questions like:

  • When is it a good idea to use something like Monad/Applicative (and when is it not)?
  • When is it a good idea to invent my own typeclass (and when is it not)?
  • How do I design interfaces and services like in OOP?

Thanks again to Skills Matter for putting together this excellent conference.

by Jasper Van der Jeugt at December 07, 2017 12:00 AM

December 06, 2017

FP Complete

Techniques for Success with Offshore Software Development

When I ran Microsoft’s engineering tools group in the late 1990’s, remote engineering was uncommon and challenging. We had spent millions to relocate engineers to a central headquarters, and when that wasn’t enough to meet all our needs, we had to invent a lot of our own tools to keep offshore projects on track. Since then, the industry has evolved better methods and tools and, more recently, cloud DevOps systems. Reliable remote engineering is now available to everyone.

by Aaron Contorer ( at December 06, 2017 10:07 PM

December 05, 2017

Jeremy Gibbons

Arithmetic Coding

This post is about the data compression method called arithmetic coding, by which a text is encoded as a subinterval of the unit interval, which is then represented as a bit sequence. It can often encode more effectively than Huffman encoding, because it doesn’t have the restriction of Huffman that each symbol be encoded as a positive whole number of bits; moreover, it readily accommodates adaptive models of the text, which “learn” about the text being encoded while encoding it. It is based on lecture notes that I wrote in 2002 with Richard Bird, although the presentation here is somewhat simplified; it is another application of streaming. There’s quite a lot to cover, so in this post I’ll just set up the problem by implementing a basic encoder and decoder. In the next post, I’ll show how they can both be streamed. (We won’t get into the intricacies of restricting to fixed-precision arithmetic—perhaps I can cover that in a later post.)

The basic idea behind arithmetic coding is essentially to encode an input text as a subinterval of the unit interval, based on a model of the text symbols that assigns them to a partition of the unit interval into non-empty subintervals. For the purposes of this post, we will deal mostly with half-open intervals, so that the interval {[l,r)} contains values {x} such that {l \le x < r}, where {l,r,x} are rationals.

For example, with just two symbols “a” and “b”, and a static model partitioning the unit interval into {[0, \frac 1 3)} for “a” and {[\frac 1 3, 1)} for “b”, the symbols in the input text “aba” successively narrow the unit interval to {[0,\frac 1 3), [\frac 1 9, \frac 1 3), [\frac 1 9, \frac 5 {27})}, and the latter interval is the encoding of the whole input. And in fact, it suffices to pick any single value in this final interval, as long as there is some other way to determine the end of the encoded text (such as the length, or a special end-of-text symbol).


We introduce the following basic definitions for intervals:

\displaystyle  \begin{array}{@{}l} \mathbf{type}\;\mathit{Interval} = (\mathit{Rational}, \mathit{Rational}) \vrule width0pt depth2ex \\ \mathit{unit} :: \mathit{Interval} \\ \mathit{unit} = (0,1) \vrule width0pt depth2ex \\ \mathit{contains} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Bool} \\ \mathit{contains}\;(l,r)\;x = l \le x \land x < r \vrule width0pt depth2ex \\ \mathit{includes} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Bool} \\ \mathit{includes}\;(l,r)\;(p,q) = l \le p \land q \le r \end{array}

We’ll write “{i \ni x}” for {\mathit{contains}\;i\;x}, and “{i \supseteq j}” for {\mathit{includes}\;i\;j}.

A crucial operation on intervals is narrowing of one interval by another, where {\mathit{narrow}\;i\;j} is to {i} as {j} is to the unit interval:

\displaystyle  \begin{array}{@{}l} \mathit{narrow} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{narrow}\;i\;(p,q) = (\mathit{weight}\;i\;p, \mathit{weight}\;i\;q) \vrule width0pt depth2ex \\ \mathit{weight} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{weight}\;(l,r)\;x = l + (r-l) \times x \end{array}

We’ll write “{i \mathbin{\triangleright} j}” for {\mathit{narrow}\;i\;j}. Thus, {\mathit{weight}\;(l,r)\;x} is “proportionately {x} of the way between {l} and {r}“, and we have

\displaystyle  \begin{array}{@{}lcl} i \ni \mathit{weight}\;i\;x & \Leftarrow& \mathit{unit} \ni x \\ i \supseteq i \mathbin{\triangleright} j &\Leftarrow& \mathit{unit} \supseteq j \end{array}

Conversely, we can widen one interval by another:

\displaystyle  \begin{array}{@{}l} \mathit{widen} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{widen}\;i\;(p,q) = (\mathit{scale}\;i\;p, \mathit{scale}\;i\;q) \vrule width0pt depth2ex \\ \mathit{scale} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{scale}\;(l,r)\;x = (x-l)/(r-l) \end{array}

We’ll write “{i \mathbin{\triangleleft} j}” for {\mathit{widen}\;i\;j}. Note that {\mathit{scale}} is inverse to {\mathit{weight}}, in the sense

\displaystyle  y = \mathit{weight}\;i\;x \Leftrightarrow \mathit{scale}\;i\;y = x

and consequently widening is inverse to narrowing:

\displaystyle  i \mathbin{\triangleleft} (i \mathbin{\triangleright} j) = j


We work with inputs consisting of sequences of symbols, which might be characters or some higher-level tokens:

\displaystyle  \mathbf{type}\;\mathit{Symbol} = \mathit{Char}

The type {\mathit{Model}} then must provide the following operations:

  • a way to look up a symbol, obtaining the corresponding interval:

    \displaystyle  \mathit{encodeSym} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Interval}

  • conversely, a way to decode a value, retrieving a symbol:

    \displaystyle  \mathit{decodeSym} :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow \mathit{Symbol}

  • an initial model:

    \displaystyle  \mathit{initial} :: \mathit{Model}

  • a means to adapt the model on seeing a new symbol:

    \displaystyle  \mathit{newModel} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Model}

The central property is that encoding and decoding are inverses, in the following sense:

\displaystyle  \mathit{decodeSym}\;m\;x = s \quad \Leftrightarrow \quad \mathit{encodeSym}\;m\;s \ni x

There are no requirements on {\mathit{initial}} and {\mathit{newModel}}, beyond the latter being a total function.

For example, we might support adaptive coding via a model that counts the occurrences seen so far of each of the symbols, represented as a histogram:

\displaystyle  \mathbf{type}\;\mathit{Model} = [(\mathit{Symbol},\mathit{Integer})]

This naive implementation works well enough for small alphabets. One might maintain the histogram in decreasing order of counts, so that the most likely symbols are at the front and are therefore found quickest. For larger alphabets, it is better to maintain the histogram as a binary search tree, ordered alphabetically by symbol, and caching the total counts of every subtree.


Now encoding is straightforward to define. The function {\mathit{encodeSyms}} takes an initial model and a list of symbols, and returns the list of intervals obtained by looking up each symbol in turn, adapting the model at each step:

\displaystyle  \begin{array}{@{}l} \mathit{encodeSyms} :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Interval}] \\ \mathit{encodeSyms}\; m = \mathit{map}\;\mathit{snd} \cdot \mathit{tail} \cdot \mathit{scanl}\;\mathit{next}\;(m,\mathit{unit}) \\ \quad \mathbf{where}\; \begin{array}[t]{@{}l} \mathit{next} :: (\mathit{Model},\mathit{Interval}) \rightarrow \mathit{Symbol} \rightarrow (\mathit{Model},\mathit{Interval}) \\ \mathit{next}\;(m,i)\;s = (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s) \end{array} \end{array}

That is,

\displaystyle  \begin{array}{@{}lcl} \mathit{encodeSyms}\;m\;[\,] &=& [\,] \\ \mathit{encodeSyms}\;m\;(s:ss) &=& \mathit{encodeSym}\;m\;s : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss \end{array}

We then narrow the unit interval by each of these subintervals, and pick a single value from the resulting interval:

\displaystyle  \begin{array}{@{}l} \mathit{encode}_0 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow \mathit{Rational} \\ \mathit{encode}_0\;m = \mathit{pick} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \end{array}

All we require of {\mathit{pick} :: \mathit{Interval} \rightarrow \mathit{Rational}} is that {i \ni \mathit{pick}\;i}; then {\mathit{encode}_0} yields a fraction in the unit interval. For example, we might set {\mathit{pick} = \mathit{midpoint}}, where

\displaystyle  \textstyle \mathit{midpoint}\;i = \mathit{weight}\;i\;(\frac 1 2)


So much for encoding; how do we retrieve the input text? In fact, we can retrieve the first symbol simply by using {\mathit{decodeSym}}. Expanding the encoding of a non-empty text, we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{encode}_0\;m\;(s:ss) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(i : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mbox{fold} \} \\ & \mathit{pick}\;(i \mathbin{\triangleright} \mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j) \mbox{ (see below)} \} \\ & \mathit{weight}\;i\;(\mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss))) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{ again} \} \\ & \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \end{array}

The proof obligation, left as an exercise, is to show that

\displaystyle  \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j)

which holds when {\mathit{pick}\;i} is of the form {\mathit{weight}\;i\;x} for some {x}.


\displaystyle  \begin{array}{@{}cl} & \mathit{decodeSym}\;m\;(\mathit{encode}_0\;m\;(s:ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{expansion of } \mathit{encode}_0 \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{requirement on models} \} \\ & i \ni \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \\ \Leftarrow & \qquad \{ \mathit{weight} \} \\ & \mathit{unit} \ni \mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss \end{array}

and indeed, encoding yields a fraction in the unit interval, so this recovers the first symbol correctly. This is the foothold that allows the decoding process to make progress; having obtained the first symbol using {\mathit{decodeSym}}, it can adapt the model in precisely the same way that the encoding process does, then retrieve the second symbol using that adapted model, and so on. The only slightly tricky part is that when decoding an initial value {x}, having obtained the first symbol {s}, decoding should continue on some modified value {x'}; what should the modification be? It turns out that the right thing to do is to scale {x} by the interval associated in the model with symbol {s}, since scaling is the inverse operation to the {\mathit{weight}}s that take place during encoding. That is, we define:

\displaystyle  \begin{array}{@{}l} \mathit{decode}_0 :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_0\;m\;x = \mathit{unfoldr}\;\mathit{step}\;(m,x) \vrule width0pt depth2ex \\ \mathit{step} :: (\mathit{Model}, \mathit{Rational}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model},\mathit{Rational})) \\ \mathit{step}\;(m,x) = \mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{scale}\;(\mathit{encodeSym}\;m\;s)\;x)) \\ \quad \mathbf{where}\;s = \mathit{decodeSym}\;m\;x \end{array}

(Of course, {\mathit{encodeSym}\;m\;s \ni x}, by the inverse requirement on models, and so the new scaled value is again within the unit interval.)

Note that decoding yields an infinite list of symbols; the function {\mathit{step}} is always productive. Nevertheless, that infinite list starts with the encoded text, as we shall now verify. Define the round-trip function

\displaystyle  \mathit{round}_0\;m = \mathit{decode}_0\;m \cdot \mathit{encode}_0\;m

Then we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{round}_0\;m\;(s:ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & \mathit{decode}_0\;m\;(\mathit{encode}_0\;m\;(s:ss)) \\ = & \qquad \{ \mathit{encode}_0 \mbox{; let } i = \mathit{encodeSym}\;m\;s, m' = \mathit{newModel}\;m\;s \} \\ & \mathit{decode}_0\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss)) \\ = & \qquad \{ \mathit{decode}_0 \mbox{; first decoded symbol is correct, as above} \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{scale}\;i\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss))) \\ = & \qquad \{ \mathit{scale}\;i\;(\mathit{weight}\;i\;x) = x \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{encode}_0\;m'\;ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & s : \mathit{round}_0\;m'\;ss \end{array}

From this it follows that indeed the round-trip recovers the initial text, in the sense that {\mathit{round}_0\;m\;ss} yields an infinite sequence that starts with {ss}; in fact,

\displaystyle  \mathit{round}_0\;m\;ss = ss \mathbin{{+}\!\!\!{+}} \mathit{round}_0\;(\mathit{foldl}\;\mathit{newModel}\;m\;ss)\;[\,]

yielding the original input followed by some junk, the latter obtained by decoding the fraction {\frac 1 2} (the encoding of {[\,]}) from the final model {\mathit{foldl}\;\mathit{newModel}\;m\;ss} that results from adapting the initial model to each symbol in {ss} in turn. To actually retrieve the input text with no junk suffix, one could transmit the length separately (although that doesn’t sit well with streaming), or append a distinguished end-of-text symbol.

What’s next

So far we have an encoder and a decoder, and a proof that the decoder successfully decodes the encoded text. In the next post, we’ll see how to reimplement both as streaming processes.

by jeremygibbons at December 05, 2017 03:58 PM

Joachim Breitner

Finding bugs in Haskell code by proving it

Last week, I wrote a small nifty tool called bisect-binary, which semi-automates answering the question “To what extent can I fill this file up with zeroes and still have it working”. I wrote it it in Haskell, and part of the Haskell code, in the Intervals.hs module, is a data structure for “subsets of a file” represented as a sorted list of intervals:

data Interval = I { from :: Offset, to :: Offset }
newtype Intervals = Intervals [Interval]

The code is the kind of Haskell code that I like to write: A small local recursive function, a few guards to case analysis, and I am done:

intersect :: Intervals -> Intervals -> Intervals
intersect (Intervals is1) (Intervals is2) = Intervals $ go is1 is2
    go _ [] = []
    go [] _ = []
    go (i1:is1) (i2:is2)
        -- reorder for symmetry
        | to i1 < to i2 = go (i2:is2) (i1:is1)
        -- disjoint
        | from i1 >= to i2 = go (i1:is1) is2
        -- subset
        | to i1 == to i2 = I f' (to i2) : go is1 is2
        -- overlapping
        | otherwise = I f' (to i2) : go (i1 { from = to i2} : is1) is2
      where f' = max (from i1) (from i2)

But clearly, the code is already complicated enough so that it is easy to make a mistake. I could have put in some QuickCheck properties to test the code, I was in proving mood...

Now available: Formal Verification for Haskell

Ten months ago I complained that there was no good way to verify Haskell code (and created the nifty hack ghc-proofs). But things have changed since then, as a group at UPenn (mostly Antal Spector-Zabusky, Stephanie Weirich and myself) has created hs-to-coq: a translator from Haskell to the theorem prover Coq.

We have used hs-to-coq on various examples, as described in our CPP'18 paper, but it is high-time to use it for real. The easiest way to use hs-to-coq at the moment is to clone the repository, copy one of the example directories (e.g. examples/successors), place the Haskell file to be verified there and put the right module name into the Makefile. I also commented out parts of the Haskell file that would drag in non-base dependencies.

Massaging the translation

Often, hs-to-coq translates Haskell code without a hitch, but sometimes, a bit of help is needed. In this case, I had to specify three so-called edits:

  • The Haskell code uses Intervals both as a name for a type and for a value (the constructor). This is fine in Haskell, which has separate value and type namespaces, but not for Coq. The line

    rename value Intervals.Intervals = ival

    changes the constructor name to ival.

  • I use the Int64 type in the Haskell code. The Coq version of Haskell’s base library that comes with hs-to-coq does not support that yet, so I change that via

    rename type GHC.Int.Int64 = GHC.Num.Int

    to the normal Int type, which itself is mapped to Coq’s Z type. This is not a perfect fit, and my verification would not catch problems that arise due to the boundedness of Int64. Since none of my code does arithmetic, only comparisons, I am fine with that.

  • The biggest hurdle is the recursion of the local go functions. Coq requires all recursive functions to be obviously (i.e. structurally) terminating, and the go above is not. For example, in the first case, the arguments to go are simply swapped. It is very much not obvious why this is not an infinite loop.

    I can specify a termination measure, i.e. a function that takes the arguments xs and ys and returns a “size” of type nat that decreases in every call: Add the lengths of xs and ys, multiply by two and add one if the the first interval in xs ends before the first interval in ys.

    If the problematic function were a top-level function I could tell hs-to-coq about this termination measure and it would use this information to define the function using Program Fixpoint.

    Unfortunately, go is a local function, so this mechanism is not available to us. If I care more about the verification than about preserving the exact Haskell code, I could easily change the Haskell code to make go a top-level function, but in this case I did not want to change the Haskell code.

    Another way out offered by hs-to-coq is to translate the recursive function using an axiom unsafeFix : forall a, (a -> a) -> a. This looks scary, but as I explain in the previous blog post, this axiom can be used in a safe way.

    I should point out it is my dissenting opinion to consider this a valid verification approach. The official stand of the hs-to-coq author team is that using unsafeFix in the verification can only be a temporary state, and eventually you’d be expected to fix (heh) this, for example by moving the functions to the top-level and using hs-to-coq’s the support for Program Fixpoint.

With these edits in place, hs-to-coq splits out a faithful Coq copy of my Haskell code.

Time to prove things

The rest of the work is mostly straight-forward use of Coq. I define the invariant I expect to hold for these lists of intervals, namely that they are sorted, non-empty, disjoint and non-adjacent:

Fixpoint goodLIs (is : list Interval) (lb : Z) : Prop :=
  match is with
    | [] => True
    | (I f t :: is) => (lb <= f)%Z /\ (f < t)%Z /\ goodLIs is t

Definition good is := match is with
  ival is => exists n, goodLIs is n end.

and I give them meaning as Coq type for sets, Ensemble:

Definition range (f t : Z) : Ensemble Z :=
  (fun z => (f <= z)%Z /\ (z < t)%Z).

Definition semI (i : Interval) : Ensemble Z :=
  match i with I f t => range f t end.

Fixpoint semLIs (is : list Interval) : Ensemble Z :=
  match is with
    | [] => Empty_set Z
    | (i :: is) => Union Z (semI i) (semLIs is)

Definition sem is := match is with
  ival is => semLIs is end.

Now I prove for every function that it preserves the invariant and that it corresponds to the, well, corresponding function, e.g.:

Lemma intersect_good : forall (is1 is2 : Intervals),
  good is1 -> good is2 -> good (intersect is1 is2).
Proof. … Qed.

Lemma intersection_spec : forall (is1 is2 : Intervals),
  good is1 -> good is2 ->
  sem (intersect is1 is2) = Intersection Z (sem is1) (sem is2).
Proof. … Qed.

Even though I punted on the question of termination while defining the functions, I do not get around that while verifying this, so I formalize the termination argument above

Definition needs_reorder (is1 is2 : list Interval) : bool :=
  match is1, is2 with
    | (I f1 t1 :: _), (I f2 t2 :: _) => (t1 <? t2)%Z
    | _, _ => false

Definition size2 (is1 is2 : list Interval) : nat :=
  (if needs_reorder is1 is2 then 1 else 0) + 2 * length is1 + 2 * length is2.

and use it in my inductive proofs.

As I intend this to be a write-once proof, I happily copy’n’pasted proof scripts and did not do any cleanup. Thus, the resulting Proof file is big, ugly and repetitive. I am confident that judicious use of Coq tactics could greatly condense this proof.

Using Program Fixpoint after the fact?

This proofs are also an experiment of how I can actually do induction over a locally defined recursive function without too ugly proof goals (hence the line match goal with [ |- context [unsafeFix ?f _ _] ] => set (u := f) end.). One could improve upon this approach by following these steps:

  1. Define copies (say, intersect_go_witness) of the local go using Program Fixpoint with the above termination measure. The termination argument needs to be made only once, here.

  2. Use this function to prove that the argument f in go = unsafeFix f actually has a fixed point:

    Lemma intersect_go_sound:

    f intersect_go_witness = intersect_go_witness

    (This requires functional extensionality). This lemma indicates that my use of the axioms unsafeFix and unsafeFix_eq are actually sound, as discussed in the previous blog post.

  3. Still prove the desired properties for the go that uses unsafeFix, as before, but using the functional induction scheme for intersect_go! This way, the actual proofs are free from any noisy termination arguments.

    (The trick to define a recursive function just to throw away the function and only use its induction rule is one I learned in Isabelle, and is very useful to separate the meat from the red tape in complex proofs. Note that the induction rule for a function does not actually mention the function!)

Maybe I will get to this later.

Update: I experimented a bit in that direction, and it does not quite work as expected. In step 2 I am stuck because Program Fixpoint does not create a fixpoint-unrolling lemma, and in step 3 I do not get the induction scheme that I was hoping for. Both problems would not exist if I use the Function command, although that needs some tickery to support a termination measure on multiple arguments. The induction lemma is not quite as polished as I was hoping for, so he resulting proof is still somewhat ugly, and it requires copying code, which does not scale well.

Efforts and gains

I spent exactly 7 hours working on these proofs, according to arbtt. I am sure that writing these functions took me much less time, but I cannot calculate that easily, as they were originally in the Main.hs file of bisect-binary.

I did find and fix three bugs:

  • The intersect function would not always retain the invariant that the intervals would be non-empty.
  • The subtract function would prematurely advance through the list intervals in the second argument, which can lead to a genuinely wrong result. (This occurred twice.)

Conclusion: Verification of Haskell code using Coq is now practically possible!

Final rant: Why is the Coq standard library so incomplete (compared to, say, Isabelle’s) and requires me to prove so many lemmas about basic functions on Ensembles?

by Joachim Breitner ( at December 05, 2017 02:17 PM

December 04, 2017

Roman Cheplyaka

Introduction to golden testing

Golden tests are like unit tests, except the expected output is stored in a separate file. I learned about them in 2010 from Max Grigorev at ZuriHac.

Let’s say you want to test Python’s json module. One way to do that would be to encode an object and compare the result to a reference string:

import json

assert(json.dumps([1,2,3]) == "[1, 2, 3]")

Alternatively, you could create a file with contents

[1, 2, 3]

and read it to know the expected output:

import json

with open("example1.json", "r") as ex1_file:
    ex1 =
    assert(json.dumps([1,2,3]) == ex1)

The file example1.json is called a golden file.

Here are some advantages of golden tests over ordinary unit tests:

  1. If the expected output is large in size, it may be impractical to put it inside the source code.
  2. No need to escape quotes or binary data in the expected output.
  3. When you add a new test, your testing framework can generate the missing golden file from the current output of the function.

    It is best if you can write down the expected output without looking at the actual output, but it is not always possible. The output may be too big to type it character by character, or it may be hard to predict. For instance, in the json example, you couldn’t tell in advance whether there would be spaces between array elements or not. So often what you do is launch an interactive interpreter (if your language of choice even has one), run the function, and then copy-paste its output into the test code.

    This process can be easily automated if you use golden files.

  4. The expected output can be automatically updated.

    Say you changed your json module to replace some of the spaces with newlines to make the output more aesthetically pleasing. You have 40 test cases that need updating. Can you imagine doing this by hand?

    With golden tests, you can tell your test framework to update all golden files from the current outputs, then check git diff to ensure that all changes are valid, and commit them.
  5. If some of your tests suddently started failing, you can use diff or other such tools to compare the golden file to the actual file and figure out what exactly changed. Perhaps your testing framework could even show the diff automatically on test failure?

While advantages 1-2 are automatic, 3-5 require special support from your testing framework. The rest of this article will be focused on a Haskell testing framework tasty and its add-on package for golden tests, tasty-golden.

Basic usage

To illustrate how tasty-golden works, consider this yaml-to-json conversion module:

{-# LANGUAGE TypeApplications #-}
module YamlToJson where

import qualified Data.Yaml as Y
import Data.Aeson as J
import qualified Data.ByteString.Lazy as LBS

yamlToJson :: LBS.ByteString -> LBS.ByteString
yamlToJson = J.encode . Y.decode @Value . LBS.toStrict

Because JSON contains quotes and YAML spans multiple lines, it is not very practical to store them as string literals in the source code file. Instead, you will keep them both in files.

Note that the name “golden file” only refers to the file containing the output, not the input. There is no requirement that the input is stored in a file or that there even is any “input” at all; but in practice it is often convenient to store them both in files so that there is an input file for every output file and vice versa.

import Test.Tasty (defaultMain, TestTree, testGroup)
import Test.Tasty.Golden (goldenVsString, findByExtension)
import qualified Data.ByteString.Lazy as LBS
import YamlToJson (yamlToJson)
import System.FilePath (takeBaseName, replaceExtension)

main :: IO ()
main = defaultMain =<< goldenTests

goldenTests :: IO TestTree
goldenTests = do
  yamlFiles <- findByExtension [".yaml"] "."
  return $ testGroup "YamlToJson golden tests"
    [ goldenVsString
        (takeBaseName yamlFile) -- test name
        jsonFile -- golden file path
        (yamlToJson <$> LBS.readFile yamlFile) -- action whose result is tested
    | yamlFile <- yamlFiles
    , let jsonFile = replaceExtension yamlFile ".json"

This is all the code you need to support one, two, or a thousand test cases. When run, this code will:

  1. find all .yaml files in the current directory
  2. for each .yaml file, construct a golden test that evaluates yamlToJson on the input read from file and compares the result to the golden file, which has the name and the .json extension
  3. put all individual tests in a test group and pass it to defaultMain for execution

To see how this works in practice, create an input file, fruits.yaml, with the following contents:

- orange
- apple
- banana

Now run your test suite (note: in a proper cabalized project, you’d run cabal test or stack test instead):

% stack runghc test.hs
YamlToJson golden tests
  fruits: OK
    Golden file did not exist; created

All 1 tests passed (0.00s)

tasty-golden realized that this is a new test case because the golden file was absent, so it went ahead and initialized the golden file based on the function’s output. You can now examine the file to see if it makes sense:

% cat fruits.json

If you are happy with it, check in both input and output files to git. This is important so that your collaborators can run the tests, but it also helps when dealing with failing tests, as you’ll see next.

% git add fruits.yaml fruits.json && git commit -m "fruits test case"

Dealing with test failures

Occasionally, your tests will fail. A test that cannot fail is a useless test.

A golden test fails when the actual output does not match the contents of the golden file. You then need to figure out whether this is a bug or an intentional code change.

Let’s say you decide that the output of yamlToJson should end with a newline.

The new function definition is

yamlToJson = (<> "\n") . J.encode . Y.decode @Value . LBS.toStrict

Now run the test suite:

% stack runghc test.hs
YamlToJson golden tests
  fruits: FAIL
    Test output was different from './fruits.json'. It was: "[\"orange\",\"apple\",\"banana\"]\n"

1 out of 1 tests failed (0.00s)

Ok, this is not very helpful. There are two main ways to get better diagnostics. One is to use the goldenVsStringDiff function as an alternative to goldenVsString. This will include the diff right in the tasty output.

But my preferred workflow is to use git for this. First, rerun the tests and pass the --accept option. This will update the golden files with the new output:

% stack runghc -- test.hs --accept
YamlToJson golden tests
  fruits: OK
    Accepted the new version

All 1 tests passed (0.00s)

Now, because your golden file is tracked by git, you can examine the differences between the old and new golden files with git diff:

% git diff
diff --git fruits.json fruits.json
index c244c0a..ed447d4 100644
--- fruits.json
+++ fruits.json
@@ -1 +1 @@
\ No newline at end of file

Because this is the change you expected, you can now commit the updated file to git.

This workflow lets you use all the powerful git diff options like --color-words, or even launch a graphical diff tool like kdiff3 with git difftool.

See also

Golden tests are tasty by Kwang Yul Seo

December 04, 2017 08:00 PM

Manuel M T Chakravarty

Here is the video of my Functional Conf 2017 talk Haskell...

<iframe allow="encrypted-media" allowfullscreen="allowfullscreen" frameborder="0" gesture="media" height="225" id="youtube_iframe" src=";enablejsapi=1&amp;origin=;wmode=opaque" width="400"></iframe>

Here is the video of my Functional Conf 2017 talk Haskell SpriteKit — a Purely Functional API for a Stateful Animation System and Physics Engine. In this talk, I am explaining how to wrap an OOish game engine API based on a mutable scene graph into a purely functional API based on an immutable algebraic data type.

December 04, 2017 05:23 AM

December 01, 2017

Mark Jason Dominus

Slaughter electric needle injector

[ This article appeared yesterday on Content-type: text/shitpost but I decided later there was nothing wrong with it, so I have moved it here. Apologies if you are reading it twice. ]

At the end of the game Portal, one of the AI cores you must destroy starts reciting GLaDOS's cake recipe. Like GLaDOS herself, it starts reasonably enough, and then goes wildly off the rails. One of the more memorable ingredients from the end of the list is “slaughter electric needle injector”.

I looked into this a bit and I learned that there really is a slaughter electric needle injector. It is not nearly as ominous as it sounds. The needles themselves are not electric, and it has nothing to do with slaughter. Rather, it is a handheld electric-powered needle injector tool that happens to be manufactured by the Slaughter Instrument Company, Inc, founded more than a hundred years ago by Mr. George Slaughter.

Slaughter Co. manufactures tools for morticians and enbalmers preparing bodies for burial. The electric needle injector is one such tool; they also manufacture a cordless electric needle injector, mentioned later as part of the same cake recipe.

The needles themselves are quite benign. They are small, with delicate six-inch brass wires attached, and cost about twenty-five cents each. The needles and the injector are used for securing a corpse's mouth so that it doesn't yawn open during the funeral. One needle is injected into the upper jaw and one into the lower, and then the wires are twisted together, holding the mouth shut. The mortician clips off the excess wire and tucks the ends into the mouth. Only two needles are needed per mouth.

There are a number of explanatory videos on YouTube, but I was not able to find any actual demonstrations.

by Mark Dominus ( at December 01, 2017 04:10 PM

November 30, 2017

FP Complete

NAT Gateways in Amazon GovCloud

NAT Gateways in Amazon GovCloud

So you’re deploying your government-sensitive data and services on GovCloud, or planning to and you want your data to be protected against third-party access, so you configure your subnets as private resources, without internet access. In other AWS regions, you could then add a managed NAT Gateway and instances would have, once configured, egress available for internet access. This allows them to update their software and run smoothly pulling necessary external information.

by Yghor Kerscher ( at November 30, 2017 10:25 PM

Douglas M. Auclair (geophf)

November 2017 1HaskellADay problems and solutions

by geophf ( at November 30, 2017 09:58 PM

November 29, 2017

Tweag I/O

Making two garbage collectors be good neighbours <br/> (using linear types)

Facundo Domínguez and Mathieu Boespflug

Foreign function interfaces (FFI) allow fast interop between languages. Unlike other approaches, like performing RPC calls between different components written in different languages, using the FFI allows for all manner of data to be shared between each language runtime, in the same address space. This reduces memory consumption and obviates marshalling costs. But when two garbaged-collected languages share references to the same values, each garbage collector (GC) needs to be careful to not collect these values while the other language has references to them. This is a problem we ran into when building both inline-r and inline-java. In this post, we'll survey this very generic problem in all fast language interop, using Java interop as a case study.

Bonus: we'll show you how linear types can help solve the problem safely.

Unsafe bindings to Java

The Java Virtual Machine (JVM) offers a foreign interface to manipulate Java objects, known as the Java Native Interface (JNI). This is a C interface, which we can readily bind in Haskell using inline-c or similar. This is what the jni package does.

The JNI is a low-level interface that is painful to use. No programmer wants to invoke Java methods through the JNI using stringly typed class names, method names and argument types. Doing so is very error-prone and verbose. So we built higher-level abstractions on top, jvm and inline-java, that run every method invocation through the Java type checker as well as the Haskell type checker. Think of inline-java as a pretty good typo detector.

In fact, inline-java does even more than that. It checks that Haskell types and Java types line up. It catches at compile time many common bugs that could cause the program to crash or fail, but a few remain. Notably,

  • it is possible to use references to Java objects by mistake after they have been collected, and
  • it is possible to accidentally retain large amounts of memory in the Java heap with references that live in the memory managed by Haskell.

Here's a case study: the conversion of Java Iterators to Haskell Streams (as defined in the streaming package).

import Foreign.JNI
import Language.Java as Java
import Language.Java.Inline
import Streaming

  :: Reify a
  => J ('Iface "java.util.Iterator")
  -> IO (Stream (Of a) IO ())
iteratorToStream it = do
    return $ Streaming.untilRight $ do
      [| $it.hasNext() |] >>= \case
        False -> return (Right ())
        True -> do
          obj <- [| $ |]
          Left <$> Java.reify obj

See previous posts for an intro to inline-java, but here's the gist. The input to this function is any Java object that conforms to the java.util.Iterator interface. The output is a Stream yielding values of some type a. The Java objects are pulled from the iterator as the stream is consumed. The constraint Reify a states that we know how to convert Java objects to Haskell values of type a. We do this on the last line by calling reify.

Like in Java, it and obj above are actually references to objects. But it's a special type of reference provided by the JNI, which can be used by foreign code (such as C or Haskell). These JNI references need to be deleted explicitly once they are no longer needed, otherwise JVM objects cannot be reclaimed by the JVM GC.

The above implementation of iteratorToStream is not deleting the references to Java objects. That's a leak! Indeed, an object reference acts as a root in the graph of all objects in the heap, as far as the JVM garbage collector is concerned. Adding to the problem, the JVM can't deal very well with large and unknown amounts of references. The JNI expects native calls to use only a few references and expects the programmer to say in advance how many references will be needed. Failing to do so affects performance and can lead to failures.

A straightforward fix to this situation is to delete the reference after the Haskell value has been obtained.

    bracket [| $ |]
            (\jNext -> Left <$> Java.reify jNext)

There are two problems with this approach:

  • this puts the burden on the programmer to remember to delete the reference and to be careful not to use it afterwards (or risk a segfault). Moreover,
  • JNI references are usually local, meaning that they are only valid on the thread that created them. So the programmer has to be careful to not share them with other threads.

Could we possibly ask the compiler to perform these checks?

Garbage Collector Finalizers

One way to avoid needing these checks in the first place is to just let the Haskell GC delete Java references automatically when they become unreachable. We attach to each reference a finalizer that deletes it, which is going to be called by the Haskell GC. Such references are no longer local references, but global references. Unlike local references, a global reference can be used in any thread and it is not destroyed when control returns to Java. Since the JNI provides a facility to promote any local reference to a global one, couldn't we just turn all local references into global ones and then have them be managed by the GC? A global reference is more expensive than a local one, so performance suffers. But it mostly works. Until you run out of memory...

A major problem with letting the GC run the show completely is that counter intuitively, sometimes memory might never be reclaimed, even when many objects are long dead. Suppose that the Java heap is crowded, the Garbage Collector of the JVM is desperate to kick some objects out of existence, and yet there is a good chunk of references from Haskell-land to the Java Heap. The Haskell portion of the application is already done with the references, but since there is plenty of space in the Haskell heap, the Haskell's Garbage Collector is basking in the sun, with no pressure to run the finalizers that would delete the unused references.

Sometimes, the application is lucky and the Haskell GC runs the finalizers just in time, which lets the Java GC clean the Java heap. Unfortunately, sometimes, the Haskell GC won't run and the JVM will fail with an OutOfMemory exception.

Dynamic scopes

Another solution is to define dynamic scopes. When a program's control flow enters a scope, we open a new buffer. We keep track of all newly created references in the buffer, until the control flow leaves the scope, at which point we discard all recorded references all at once. In general, scopes are not allowed to overlap arbitrarily, but they can be nested.

In Haskell, the resourcet package neatly encapsulates this idea. The JNI natively supports a similar idea with using pushLocalFrame and popLocalFrame. pushLocalFrame (n :: Int) creates a new scope in which at least n local references can be created. Exceeding the given capacity might cause performance issues or errors. popLocalFrame j copies the reference j to the parent frame and deletes the current frame, which causes all references of the frame to be deleted.

We are still running the risk of accidentally using a local reference after deletion, and to use it in threads where it is invalid. But programmers no longer need to remember to delete individual local references. Still, in practice we found difficult finding a hierarchy of nested scopes that keeps the counts of local references low. It is a problem that worsens with the size of the application. When building a complex server application that made many invocations to Java, we started with a scope per client request, and then a scope per test, and then we added scopes within the scopes when we were creating more local references than anticipated. Eventually, it did get very difficult for multiple teams of programmers of varying experience levels to be sure that the number of extant references stayed bounded for all possible code paths and inputs.

Linear Types

We would really prefer to delete a reference exactly when we know it to be no longer useful. In this way, memory becomes reclaimable by Java GC immediately. The problem is: it's easy to forget doing so at all, leading to multiple leaks in an application. The key invariant we want checked by the compiler is that once we have a reference, it should be deleted exactly once, and never referred to after that. That is, we want to use references linearly.

What if we used the GHC proposal for linear types to treat our local references linearly? It would look something like this:

import Foreign.JNI
import Language.Java as Java
import Language.Java.Inline as Inline.
import Streaming

  :: Reify a
  => J ('Iface "java.util.Iterator" <> [Interp a])
  ->. IOL (Stream (Of a) IOL ())
iteratorToStream itLocal = do
    return $ Streaming.untilRight $ do
      [| $it.hasNext() |] >>= \case
        False -> return (Right ())
        True -> do
          obj0 <- [| $ |]
          (obj1, Unrestricted a) <- Java.reify obj0
          JNI.deleteLocalRef obj1
          return a

Java.reify :: J (Interp a) ->. IOL (J (Interp a), Unrestricted a)

-- | A linear value of type `Unrestricted a` holds a value of
-- type `a` which can be used non-linearly or unrestrictly.
data Unrestricted a where
  Unrestricted :: a -> Unrestricted a

We are assuming that we have a restricted form of the IO monad, called IOL, with the following operations.

return :: a ->. IOL a
(>>=) :: IOL a ->. (a ->. IOL b) ->. IOL b

liftIO :: IO a -> IOL a

data IOL a where
  IOL :: IO a -> IOL a

runIOL :: IOL (Unrestricted a) -> IO a
runIOL (IOL io) =
    Unrestricted a <-
      bracket_ (JNI.pushLocalFrame capacity)
               (JNI.popLocalFrame JNI.jnull)
    return a
    capacity = ...

Compared to dynamic scopes, the major feature of IOL is that programmers can delete local references promptly, inside a single global scope, when they are no longer needed. The programmer doesn't have to be concerned with guessing a scope hierarchy anymore.

IOL introduces local references as linear values. Operations that do not delete the reference, like reify, now have to return a copy of it, and the operations that delete the value, like deleteLocalRef, produce no copy. This means both that references cannot be used after they are deleted (since they can't be used more than once), and that the compiler will require them to be deleted eventually (they must be used at least once). Finally, local references cannot be allowed to escape the scope of runIOL, as they become invalid before runIOL returns. This is achieved by constraining its argument to yield an unrestricted value Unrestricted a. Local references are released promptly even if an exception arises, thanks to the bracket inside runIOL and the fact that there is no way to catch exceptions in IOL.

Admittedly, if exceptions need to be caught, it has to be done by the caller of runIOL. In our experience, many applications need to catch exceptions in a few places only, so this is a modest price to pay.


Each the local and global references we create via the JNI is effectively a GC root for the Java GC. The JNI was designed with the assumption that programmers ensure that very few such roots are in flight at any one time. The R native interface and others make similar assumptions. In this post, we discussed the tension that arises between releasing early and frequently, and doing so safely without increasing the risk of use-after-free bugs. With linear types, we can get both.

A competing approach that we haven't discussed is the lightweight monadic regions of Kiselyov and Shan. This is an incarnation of dynamic scopes that, like linear types, have the type checker guarantee that resources aren't used after released and that they aren't used in other threads. However, they still demand from the programmer to not insert too many or too few scopes.

Some have suggested introducing affine types instead of linear types in Haskell. But for the particular use case discussed in this post, affine types would do no better than these monadic regions. That's because affine types provide a weaker guarantee to the caller: we can return to the caller having used the argument at most once, but also never at all. We'd need nested scopes all over again to ensure that references do get disposed of in a timely fashion.

In our discussion of linear types, we brought streams to a linear monad without delving into the details of whether it is possible and how it would work. This will be the topic for a future post.

November 29, 2017 12:00 AM

November 27, 2017

Gabriel Gonzalez

Compare Nix derivations using nix-diff

<head><meta charset="UTF-8"/></head>

I'm announcing a small nix-diff utility I wrote for comparing Nix derivations. This post will walk through two use cases for how you might use this utility.


This section provides some required background for understanding this post if you're new to Nix.

There are three stages to a Nix build:

  • Nix source code (i.e. *.nix files)
    • This corresponds to a source distribution in a typical package manager
  • Nix derivations (i.e. /nix/store/*.drv files)
    • This is the stage that caching works at
  • Nix build products (i.e. /nix/store/* files that are not derivations)
    • This corresponds to a binary distribution in a typical package manager

You can convert between these stages using the following command-line tools:

  • nix-instantiate converts Nix source code to Nix derivations
    • i.e. *.nix → /nix/store/*.drv
  • nix-store --realise converts Nix derivations to Nix build products
    • i.e. /nix/store/*.drv → /nix/store/*
  • nix-build is a convenience utility which combines the two preceding steps to go straight from source code to build products
    • i.e. *.nix → /nix/store/*

Nix supports caching binary build products so if you try to build the same derivation twice then the second build will reuse the result of the first build (i.e. a "cache hit"). If the derivation changes in any way, you get a "cache miss" and you need to build the derivation.

Carefully note that caching works at the level of Nix derivations and not at the level of Nix source code. For example, the following two Nix files differ at the source code level:

$ cat example0.nix 
pkgs = import <nixpkgs> { };


$ cat example1.nix
(import <nixpkgs> { }).hello

... but they produce the exact same derivation file:

$ nix-instantiate example0.nix 

$ nix-instantiate example1.nix

... which means that if you try to build both example0.nix and example1.nix the build will only occur once since they share the same derivation.

You can think of the derivation file as a language-independent description of how to build something:

$ fold /nix/store/ajypjz54a8rn1qxsnhyr8m87w6hd7ghp-hello-2.10.drv 
orks/CoreFoundation.framework/CoreFoundation /dev/zero /dev/random /dev/urandom
__sandboxProfile","(allow file-read* (literal \"/usr/lib/libncurses.5.4.dylib\")
)\n(import \"/nix/store/\")\

These *.drv files use the ATerm file format and are Nix-independent. Conceptually, Nix is just a domain-specific language for generating these ATerm files. That means, for example, that you could replace Nix with any front-end language or tool that can generate these ATerm files. In fact, this is how Guix works, by replacing Nix with Guile Scheme as the front-end language.

Understanding how Nix derivations work is fundamental to understanding the Nix ecosystem. nix-diff is one tool that aids this learning process as the following sections will illustrate.

Cache misses

nix-diff is a tool that I wish I had back when Awake Security first adopted Nix. We frequently ran into cache misses when using Nix because of subtle differences in Nix derivations in different development environments.

We can understand why we got cache misses by referring back to the three stages of a Nix build:

  • Nix source code (i.e. *.nix files)
  • Nix derivations (i.e. /nix/store/*.drv files)
  • Nix build products (i.e. /nix/store/* files that are not derivations)

For production we prefer to distribute Nix build products (i.e. binary distributions), but internally for development we distribute Nix source code. We prefer Nix code internally because this gives developers complete control over all of their transitive dependencies. For example, a developer can easily patch the systemd executable used on the virtual machine that runs their integration tests.

However, this flexibility comes at a price: if you don't know what you are doing you can easily accidentally change the derivation. This is because Nix and Nixpkgs are customizable to a fault and they have all sorts of "impure" defaults that change depending on the development environment. If you trip over one of these pitfalls you end up with a cache miss, which is a poor user experience.

The most common pitfalls we ran into early on in our Nix adoption were:

  • Not pinning nixpkgs
    • Note: We publicly shared our recipe for pinning nixpkgs here
  • Not pinning the system field for a derivation
    • This field defaults to the impure builtins.currentSystem in many cases
  • Impure surprises in nixpkgs

Let's motivate this with a real example. Suppose that I have the following derivation to build the Glasgow Haskell compiler (ghc):

$ cat example0.nix
pkgs = import <nixpkgs> { };


This Nix expression is "impure" because the expression depends on the ambient nixpkgs channel that the user has installed. Compare this to the following expression which pins nixpkgs to a specific revision protected by a hash:

$ cat example1.nix
fetchNixpkgs = import ./fetchNixpkgs.nix;

nixpkgs = fetchNixpkgs {
rev = "76d649b59484607901f0c1b8f737d8376a904019";
sha256 = "01c2f4mj4ahir0sxk9kxbymg2pki1pc9a3y6r9x6ridry75fzb8h";

pkgs = import nixpkgs { };


Let's instantiate the two expressions to compute their derivations:

$ nix-instantiate example0.nix 
$ nix-instantiate example1.nix

Note that you may get a different result for the first derivation depending on what version of the nixpkgs channel you have installed.

Visually comparing the two derivation files is tedious and time-consuming:

$ fold /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv | head

$ fold /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv | head

If we use nix-diff, then we can pull out the differences immediately:

$ nix-diff /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv 
- /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv:{out}
+ /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv:{out}
The set of outputs do not match:
+ {man}
The builders do not match
- /nix/store/hsk82g493i7r496ghs0y61m6yvknxcml-bash-4.4-p5/bin/bash
+ /nix/store/axikcsz4wh2qpi5zmlfsmm4jx8wm8s1g-bash-4.4-p12/bin/bash
The set of input names do not match:
- bash-4.4-p5
- clang-wrapper-3.7.1
- coreutils-8.26
- gmp-6.1.1
- perl-5.22.3
- python2.7-Sphinx-1.5.2
+ bash-4.4-p12
+ clang-wrapper-4.0.1
+ coreutils-8.28
+ gmp-6.1.2
+ perl-5.24.3
+ python2.7-Sphinx-1.6.5

Now we can see at a glance that the versions of several dependencies changed and GHC has split out its man pages into a new man output for better granularity of the build graph.

Note that these are not the only differences between the two derivations. However, all of the other differences are downstream of the above differences. For example, the two derivations have different out paths, but we expect them to differ for any two derivations that are not identical so there's no point including that in the diff. nix-diff makes an effort to highlight the root cause of the difference.

Understanding differences

Nix is more than just a package manager. You can use Nix to build and deploy an entire machine, which is how NixOS (the Nix operating system) works. The machine configuration is a Nix expression that you can instantiate and build like any other Nix expression.

This means that we can also use nix-diff to compare two machine configurations and understand how they differ. For example, when we change our production systems at Awake Security we sometimes run the change through nix-diff during code review to ensure that reviewers understand every change being made to the system.

We can illustrate this with a small example comparing two NixOS system specifications. The first system specification is a mostly blank system:

$ cat example0.nix
nixos = import <nixpkgs/nixos> {
system = "x86_64-linux";

configuration = {
boot.loader.grub.devices = [ "/dev/sda" ];

fileSystems."/" = {
device = "/dev/sda";


... and the second specification enables Kafka on the system:

$ cat example1.nix
nixos = import <nixpkgs/nixos> {
system = "x86_64-linux";

configuration = {
boot.loader.grub.devices = [ "/dev/sda" ];

fileSystems."/" = {
device = "/dev/sda";

services.apache-kafka.enable = true;


We can differentiate the two derivations in one step like this:

$ nix-diff $(nix-instantiate example0.nix) $(nix-instantiate example1.nix)
- /nix/store/6z9nr5pzs4j1v9mld517dmlcz61zy78z-nixos-system-nixos-18.03pre119245.
+ /nix/store/k05ibijg0kknvwrgfyb7dxwjrs8qrlbj-nixos-system-nixos-18.03pre119245.
The input named `etc` differs
- /nix/store/05c0v10pla0v8rfl44rs744m6wr729jy-etc.drv:{out}
+ /nix/store/8waqvzjg7bazzfzr49m89q299kz972wv-etc.drv:{out}
The input named `dbus-1` differs
- /nix/store/a16j2snzz25dhh96jriv3p6cgkc0vhxr-dbus-1.drv:{out}
+ /nix/store/mliabzdkqaayya67xiwfhwkg4gs9k0cg-dbus-1.drv:{out}
The input named `system-path` differs
- /nix/store/jcf6q7na01j8k9xcmqxykl62k4x6zwiv-system-path.drv:{out}
+ /nix/store/kh4kgsms24d02bxlrxb062pgsbs3riws-system-path.drv:{out}
The set of input names do not match:
+ apache-kafka-2.12-
The input named `system-path` differs
These two derivations have already been compared
The input named `system-units` differs
- /nix/store/yqnqdajd4664rvycrnwxwaj0mxp7602c-system-units.drv:{out}
+ /nix/store/2p5c4arwqphdz5wsvz6dbrgv0vhgf5qh-system-units.drv:{out}
The set of input names do not match:
+ unit-apache-kafka.service
The input named `user-units` differs
- /nix/store/x34dqw5y34dq6fj5brj2b5qf0nvglql9-user-units.drv:{out}
+ /nix/store/4iplnk260q2dpr8b8ajrjkrn44yk06aq-user-units.drv:{out}
The input named `unit-dbus.service` differs
- /nix/store/fd6j972zn1hfvqslxc8c64xxaf1wg475-unit-dbus.service.drv:{out}
+ /nix/store/s7rpgwbald9qx8rwlw4v276wj2x3ld8r-unit-dbus.service.drv:{out}
The input named `dbus-1` differs
These two derivations have already been compared
The input named `system-path` differs
These two derivations have already been compared
The input named `users-groups.json` differs
- /nix/store/x6c7pqx40wfdzwf96jfi1l0hzxjgypri-users-groups.json.drv:{out}
+ /nix/store/gk5yyjw579hgyxgwbrh1kzb3hbdbzgbq-users-groups.json.drv:{out}
The environments do not match:
"mutableUsers":true,"users":[{"createHome":false,"description":"→Apache Kafka
daemon user","group":"nogroup","hashedPassword":null,"home":"/tmp/kafka-logs","i
/bin/nologin","uid":169},{"createHome":false,"description":"→D-Bus system mess

However, this doesn't do the diff justice because the output is actually colorized, like this:

From the diff we can see that:

  • This change adds Kafka executables to the system PATH
  • This change adds a new apache-kafka systemd service
  • This change adds a new apache-kafka user to the system

Note how nix-diff does more than diffing the two root derivations. If the two derivations differ on a shared input then nix-diff will descend into that input and diff that and repeat the process until the root cause of the change is found. This works because Nix's dependency graph is complete and reachable from the root derivation.


You can find the nix-diff utility on Hackage or GitHub if you would like to use this in your own development workflow. Hopefully nix-diff will help you better understand how Nix works under the hood and also help you pin Nix derivations more robustly.

by Gabriel Gonzalez ( at November 27, 2017 03:58 PM

November 26, 2017

Neil Mitchell

Haskell exceptions and FFI wrappers

Summary: If you create a C function pointer from a Haskell function with "wrapper", and it throws an exception, bad things happen.

The Haskell FFI is incredibly powerful, allowing you to convert Haskell functions into C function pointers. In this post I'll give a quick example, then go into what happens if the Haskell function throws an exception. First, let's define a C function (and put it in a file called c.c):

int apply(int(*f)(int), int x)
return f(x);

The piece int(*f)(int) says f is a function of type Int -> Int. The function apply is equivalent to $, restricted to int - it applies the first argument f to the second argument x and returns the result. We can call that in Haskell with:

foreign import ccall apply :: FunPtr (CInt -> IO CInt) -> CInt -> IO CInt
foreign import ccall "wrapper" wrap :: (CInt -> IO CInt) -> IO (FunPtr (CInt -> IO CInt))

main :: IO ()
main = do
f <- wrap $ \x -> return $ x + 20
res <- apply f 22
print res

On the first line we wrap apply into a Haskell definition, turning a C function pointer into FunPtr. In the second we define a special "wrapper" FFI definition - the name "wrapper" is a specific string which is part of the FFI spec - it converts a Haskell function into a C function pointer. In main we put these pieces together, and other than the pervasive IO, it looks like the equivalent Haskell.

Note: In real code you should always call freeHaskellFunPtr after you have finished using a "wrapper" function, usually using bracket.

Consequences of Exceptions

What happens if the function we pass to wrap throws an exception? If you read the GHC manual, you'll find an incomplete link to the FFI spec, which stays silent on the subject. Thinking it through, Haskell has exceptions, but C does not - if the Haskell throws an exception it can't be passed back through C. Haskell can't provide a return value, so it can never resume the C code that called it. The GHC runtime can block indefinitely or kill the thread, both of which are fairly fatal for a program. As a consequence, I strongly recommend never throwing an exception from a function generated by "wrapper" - but what if we do?

Suggestion: most of the FFI addendum should probably be reproduced in the GHC manual with details around corner cases and exceptions.

Testing Exceptions

First, let's change our wrapped function to wrap $ \x -> fail "finish". Running that prints out:

bug.exe: user error (finish)

That seems like a standard exception. However, let's go further and put the entire program inside a finally, to show we have a normal Haskell exception:

main = flip finally (print "done") $ do

The output doesn't change - we never print out "done". It seems the exception thrown inside wrap aborts the program rather than bubbling up.

Suggestion: This error looks like a normal exception, but really isn't. It should say you have violated the wrapper invariant and your program has been violently aborted.

We've encountered bad behaviour, but can we go worse? Yes we can, by adding threads:

main = do
replicateM_ 100 $ do
forkIO $ do
ff <- wrap $ \_ -> fail "die"
print =<< apply ff 12
threadDelay 10000000

Here we spawn 100 threads, each of which does an apply with an exception, then we wait for 10 seconds. The output is:

bug.exe: user error (die)
bug.exe: user error (die)
bug.exe: warning: too many hs_exit()s

It looks like there is a race condition with the exit path, causing two fatal wrapper exceptions to try and take down the runtime twice.

Suggestion: The hs_exit bug should be fixed.

Avoiding Exceptions

Now we know we need to avoid throwing exceptions inside "wrapper" functions, the obvious approach is to wrap them in a catch, e.g.:

wrap $ \x -> ... `catch` \(_ :: SomeException) -> return (-1)

Namely catch all exceptions, and replace them with -1. As usual with catch, it is important to force evaluation of the ... inside the catch (e.g. using catchDeep from safe-exceptions). If you want to recover the original exception you can capture it in an IORef and throw it after leaving C:

ref <- newIORef Nothing
f <- wrap $ \x -> ... `catch` \(e :: SomeException) -> do
writeIORef ref $ Just e
return (-1)
res <- apply f 22
whenJustM (readIORef ref) throwIO

However, what if there is an asynchronous exception after we leave the catch but before we return to C? From my experiments, this doesn't appear to be possible. Even though getMaskingState returns Unmasked exceptions thrown to the function inside wrapper appear to be deferred until the C code returns.

Suggestion: The documentation should clarify if my experiments are correct. Should getMaskingState return MaskedUninterruptible?

by Neil Mitchell ( at November 26, 2017 09:39 PM

November 25, 2017

Joachim Breitner

Existence and Termination

I recently had some intense discussions that revolved around issues of existence and termination of functions in Coq, about axioms and what certain proofs actually mean. We came across some interesting questions and thoughts that I’ll share with those of my blog readers with an interest in proofs and interactive theorem proving.


  • It can be meaningful to assume the existence of a function in Coq, and under that assumption prove its termination and other properties.
  • Axioms and assumptions are logically equivalent.
  • Unsound axioms do not necessary invalidate a theory development, when additional meta-rules govern their use.


Our main running example is the infamous Collatz series. Starting at any natural number, the next is calculated as follow:

Require Import Coq.Arith.Arith.

Definition next (n : nat) :nat :=
  if Nat.even n then n / 2 else 3*n + 1.

If you start with some positive number, you are going to end up reaching 1 eventually. Or are you? So far nobody has found a number where that does not happen, but we also do not have a proof that it never happens. It is one of the great mysteries of Mathematics, and if you can solve it, you’ll be famous.

A failed definition

But assume we had an idea on how to prove that we are always going to reach 1, and tried to formalize this in Coq. One attempt might be to write

Fixpoint good (n : nat) : bool :=
  if n <=? 1
    then true
    else good (next n).

Theorem collatz: forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

Unfortunately, this does not work: Coq rejects this recursive definition of the function good, because it does not see how that is a terminating function, and Coq requires all such recursive function definitions to be obviously terminating – without this check there would be a risk of Coq’s type checking becoming incomplete or its logic being unsound.

The idiomatic way to avoid this problem is to state good as an inductive predicate... but let me explore another idea here.

Working with assumptions

What happens if we just assume that the function good, described above, exists, and then perform our proof:

Theorem collatz
  (good : nat -> bool)
  (good_eq : forall n,
     good n = if n <=? 1 then true else good (next n))
  : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

Would we accept this as a proof of Collatz’ conjecture? Or did we just assume what we want to prove, in which case the theorem is vacuously true, but we just performed useless circular reasoning?

Upon close inspection, we find that the assumptions of the theorem (good and good_eq) are certainly satisfiable:

Definition trivial (n: nat) : bool := true.

Lemma trivial_eq: forall n,
  trivial n = if n <=? 1 then true else trivial (next n).
Proof. intro; case (n <=? 1); reflexivity. Qed.

Lemma collatz_trivial: forall n, trivial n = true.
  apply (collatz trivial trivial_eq).

So clearly there exists a function of type nat -> bool that satisfies the assumed equation. This is good, because it means that the collatz theorem is not simply assuming False!

Some (including me) might already be happy with this theorem and proof, as it clearly states: “Every function that follows the Collatz series eventually reaches 1”.

Others might still not be at ease with such a proof. Above we have seen that we cannot define the real collatz series in Coq. How can the collatz theorem say something that is not definable?

Classical reasoning

One possible way of getting some assurance it to define good as a classical function. The logic of Coq can be extended with the law of the excluded middle without making it inconsistent, and with that axiom, we can define a version of good that is pretty convincing (sorry for the slightly messy proof):

Require Import Coq.Logic.ClassicalDescription.
Require Import Omega.
Definition classical_good (n:nat) : bool :=
  if excluded_middle_informative (exists m, Nat.iter m next n <= 1)
  then true else false.

Lemma iter_shift:
  forall a f x (y:a), Nat.iter x f (f y) = f (Nat.iter x f y).
 intros. induction x. reflexivity. simpl. rewrite IHx. reflexivity. Qed.

Lemma classical_good_eq: forall n,
  classical_good n = if n <=? 1 then true else classical_good (next n).
  unfold classical_good at 1.
  destruct (Nat.leb_spec n 1).
  * destruct (excluded_middle_informative _); try auto.
    contradict n0. exists 0. simpl. assumption.
  * unfold classical_good.
    destruct (Nat.eqb_spec (next n) 0); try auto.
    destruct (excluded_middle_informative _), (excluded_middle_informative _); auto.
    - contradict n0.
      destruct e0.
      destruct x; simpl in *. omega.
      exists x. rewrite iter_shift. assumption.
    - contradict n0.
      destruct e0.
      exists (S x). simpl. rewrite iter_shift in H0. assumption.

Lemma collatz_classical: forall n, classical_good n = true.
Proof. apply (collatz classical_good classical_good_eq). Qed.

The point of this is not so much to use this particular definition of good, but merely to convince ourselves that the assumptions of the collatz theorem above encompass “the” Collatz series, and thus constitutes a proof of the Collatz conjecture.

The main take-away so far is that existence and termination of a function are two separate issues, and it is possible to assume the former, prove the latter, and not have done a vacuous proof.

The ice gets thinner


Starting with the above Theorem collatz, there is another train of thought I invite to to follow along.

Probably the “genius idea” proof will be more than a few lines long, and we probably to be able to declare helper lemmas and other things along the way. Doing all that in the body of the collatz proof is not very convenient, so instead of using assumptions, we might write

Section collatz:
Variable good : nat -> bool.
Variable good_eq : forall n,
  good n = if n <=? 1 then true else good (next n)

Theorem collatz2 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.
End collatz.

So far so good: Clearly, I just refactored my code a bit, but did not make any significant change. The theorems collatz2 and collatz are equivalent.

Sound axioms

But note that we do not really intend to instantiate collatz2. We know that the assumptions are satisfiable (e.g. since we can define trivial or classical_good). So maybe, we would rather avoid the Section mechanism and simply write

Axiom good : nat -> bool.
Axiom good_eq : forall n,
  good n = if n <=? 1 then true else good (next n)

Theorem collatz3 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

I assume this will make a few of my readers’ eyebrows go up: How can I dare to start with such Axioms? Do they not invalidate my whole development?

On the other hand, all that a Coq axiom is doing is saying “the following theorems are under the assumption that the axiom holds”. In that sense, collatz3 and collatz2 are essentially equivalent.

Unsound axioms

Let me take it one step further, and change that to:

Axiom unsafeFix : forall a, (a -> a) -> a.
Axiom unsafeFix_eq : forall f, unsafeFix f = f (unsafeFix f).
Definition good : nat -> bool :=
  unsafeFix (fun good n => if n <=? 1 then true else good (next n)).

Theorem collatz4 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

At this point, the majority of my readers will cringe. The axiom unsafeFix is so blatantly unsound (in Coq), how do I even dare to think of using it. But bear with me for a moment: I did not change the proof. So maybe the collatz4 theorem is still worth something?

I want to argue that it is: Both unsafeFix and unsafeFix_eq are unsound in their full generality. But as long as I instantiate them only with functions f which have a fixpoint, then I cannot prove False this way. So while “Coq + unsafeFix” is unsound, “Coq + unsafeFix + unsafeFix_eq + metarule that these axioms are only called with permissible f” is not.

In that light, my collatz4 proof carries the same meaning as the collatz3 proof, it is just less convenient to check: If I were to check the validity of collatz3, I have to maybe look for uses of admit, or some misleading use of syntax or other tricks, or other smells. When I have to check the validity of collatz4, I also have to additionally check the meta-rule -- tedious, but certainly possible (e.g. by inspecting the proof term).

Beyond Collatz

The questions discussed here did not come up in the context of the Collatz series (for which I unfortunately do not have a proof), but rather the verification of Haskell code in Coq using hs-to-coq. I started with the idiomatic Haskell definition of “Quicksort”:

quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = quicksort lesser ++ [p] ++ quicksort greater
    where (lesser, greater) = partition (<p) xs

This function is not terminating in a way that is obvious to the Coq type checker. Conveniently, hs-to-coq can optionally create the Coq code using the unsafeFix axiom above, producing (roughly):

Definition quicksort {a} `{Ord a} : list a -> list a :=
  unsafeFix (fun quicksort xs =>
    match xs with
      | nil => nil
      | p :: xs => match partition (fun x => x <? p) xs with
         | (lesser, greater) => quicksort lesser ++ [p] ++ quicksort greater

I then proved (roughly)

Theorem quicksort_sorted:
  forall a `(Ord a) (xs : list a), StronglySorted (quicksort xs).


Theorem quicksort_permutation:
  forall a `(Ord a) (xs : list a), Permutation (quicksort xs) xs.

These proofs proceed by well-founded induction on the length of the argument xs, and hence encompass a termination proof of quicksort. Note that with a only partially correct but non-terminating definition of quicksort (e.g. quicksort := unsafeFix (fun quicksort xs => quicksort xs)) I would not be able to conclude these proofs.

My (not undisputed) claim about the meaning of these theorems is therefore

If the Haskell equations for quicksort actually have a fixed point, then the use of unsafeFix in its definition does not introduce any inconsistency. Under this assumption, we showed that quicksort always terminates and produces a sorted version of the input list.

Do you agree?

by Joachim Breitner ( at November 25, 2017 08:54 PM

November 24, 2017

Functional Jobs

Software Engineer (Haskell, Full Stack) at Capital Match (Full-time)

About Us: Capital Match is a leading P2P lending platform based in Singapore, founded in 2014, backed by alternative investment management firm with more than US$ 5 bn AUM. We are looking for experienced developers to lead our tech growth in the Fintech space, expand into surrounding countries and develop new products on the platform.

Job Description: We are inviting developers with a minimum of 5 years coding experience. The candidate should have functional programming experience as well as experience in developing server and web applications. An interest in all aspects of the creation, growth and operations of a secure web-based platform: front-to-back feature development, distributed deployment and automation in the cloud, build and test automation, is highly desirable. A background in fintech and especially the lending space would be an advantage (but not essential).

Job Requirements: Our platform is primarily developed in Haskell with a ClojureScript frontend. Candidates should ideally have production experience with Haskell, or strong experience with at least one other functional programming language. (For example: OCaml/F#/Scala/Clojure/Lisp/Erlang)

We use Docker containers and standard cloud infrastructure systems to manage our production rollouts, so familiarity with Linux systems, command-line environments and cloud-based deployments is highly desirable. Exposure to and understanding of XP practices such as TDD, CI, Emergent Design, Refactoring, Peer Review and Continuous Improvement is highly desirable.

We are inviting developers with at least 5 years of software engineering experience.

Offer: We offer a combination of salary and equity depending on experience and skills of the candidate. Most expats who relocate to Singapore do not have to pay their home country taxes and the local tax rate in Singapore is more or less 5%. Visa sponsorship will be provided. Singapore is a great place to live, a vibrant city rich with diverse cultures, a very strong financial sector and a central location in Southeast Asia.

Get information on how to apply for this position.

November 24, 2017 08:40 AM

Sandy Maguire

Gentle Theorems: Difference of Squares

<article> <header>

Gentle Theorems: Difference of Squares


<time>November 24, 2017</time> math

I have a (not very controversial) feeling that people don’t feel as though algebra is actually a thing you can use for stuff. I fall into this trap myself often, despite being someone who does math for a living, and so I suspect this is a pretty wide-spread phenomenon. Let me explain.

For example, consider the equation:

\[ (x + y)(x - y) = x^2 - y^2 \]

This is known as the difference of squares. Let’s work through the derivation of it together:

\[ \begin{align*} (x + y)(x - y) &= (x + y)(x - y) \\ &= x^2 + xy - xy - y^2 \\ &= x^2 + \cancel{xy - xy} - y^2 \\ &= x^2 - y^2 \end{align*} \]

Recall that we can use the FOIL method to get from the first line to the second.

I implore you to read through this proof carefully, and convince yourself of its truthfulness – even if you don’t consider yourself a “math” person. Believe it or not, there’s a point I’m getting to.

Anyway – by all accounts, this difference of squares thing is a pretty humdrum theorem. Who really cares, right? Let’s switch gears for a bit and talk about something more interesting.

Recall that \(20 \times 20 = 400\). As an interesting question, without actually computing it, let’s think about the product \(19 \times 21\). What does this equal? It seems like it could also be \(400\) – after all, all we did was take one away from the left side of the times and move it to the right.

In fact, if you work it out, \(19 \times 21 = 399\). That’s kind of interesting: somehow we lost a \(1\) by shuffling around the things we were multiplying.

This seems to not be an isolated incident:

\[ \begin{align*} 5 \times 5 &= 25 \\ \text{but,}\quad4 \times 6 &= 24 \end{align*} \]

\[ \begin{align*} 10 \times 10 &= 100 \\ \text{but,}\quad9 \times 11 &= 99 \end{align*} \]

An intriguing question to ask yourself is whether this is always true, or whether we’ve just gotten lucky with the examples we looked at.

But the more interesting question, in my opinion, is what happens if we go from \(19 \times 21 = 399\) to \(18\times22\). Will we lose another \(1\) when we fiddle with it? Or will something else happen? Form an opinion on what the answer will be before continuing.

\[ \begin{align*} 20 \times 20 &= 400 \\ \text{but,}\quad 21 \times 19 &= 399 \\ \text{but,}\quad 22 \times 18 &= 396 \end{align*} \]

Weird – somehow we lost \(3\) that time. What’s happened here?

If you’re confused (and I was, when I first saw this), don’t despair. As it happens, you already know the answer!

So, what’s going on here? Well, we’ve actually just been dealing with differences of squares the whole time – probably without even realizing it!

Most people, I think, fail to connect the algebraic fact that \((x+y)(x-y)=x^2-y^2\) to the fact that \(22\times18=396\). If you still don’t see why, we can explicitly fill in our variables:

\[ \begin{align*} 22\times18&=(20+2)(20-2)\\ &=20^2-2^2 \\ &= 400 - 4 \\ &= 396 \end{align*} \]

Neat, right? Even if you carefully read through the proof of the difference of squares earlier, you might not have noticed that we’ve been playing with them the entire time! I blame western math education for this; too often are equations presented only to be solved, and never to be thought about. It’s a disservice we’ve done to ourselves.

The takeaway of all of this, in my opinion, is that we should spend some time thinking about the notion of equality, about the \(=\) symbol. Ever since looking at this difference of squares thing, I’ve started viewing \(=\) not as the symbol which separates the left side of an equation from the right, but as a transformation. The \(=\) sign transforms something we can experience into something we can manipulate, and back again.

What I mean by that is that it’s a lot easier to conceptualize \(22\times18\) than it is to think about \((x+y)(x-y)\). The numeric representation is better suited for human minds to experience, while the algebraic expression is better at twiddling. We know how to twiddle algebra, but twiddling numbers themselves is rather meaningless.

In terms of everyday usefulness, this isn’t particularly helpful, except that it’s often easier to compute a difference of squares than it is to do the multiplication naively. If you can recognize one, you could probably impress someone with your mental arithmetic – but, again, it’s not going to revolutionize your life in any way.

All of this is to say that math is neat. Even if you don’t see any practical value in this stuff, hopefully you’ll agree that there might be interesting puzzles to be found here. And, as it turns out, algebra can be a satisfying tool for solving these puzzles.

Thanks to Matt Parsons for proof-reading an early version of this post.


November 24, 2017 12:00 AM

November 22, 2017

Robert Harper

Sequentiality as the Essence of Parallelism

I recently thought of a nice way to structure a language for parallel programming around the concept of sequential composition.  Think of parallelism as the default—evaluate everything in parallel unless the semantics of the situation precludes it: sums are posterior to summands, but the summands can be evaluated simultaneously.  You need a way to express the necessary dependencies without introducing any spurious ones.

There’s a tool for that, called lax logic, introduced by Fairtlough and Mendler and elaborated by Davies and Pfenning, which I use extensively in PFPL.  The imperative language Modernized Algol is formulated in the lax style, distinguishing two modes, or levels, of syntax, the (pure) expressions and the (impure) commands.  The lax modality, which links the two layers, behaves roughly like a monad, but, all the hype notwithstanding, it is not the central player.  It’s the modes, not the modality, that matter.  (See the Commentary on PFPL for more.)

The lax modality is just the ticket for expressing parallelism.  Rather than separate expressions from commands, here we distinguish between values and computations.  The names are important, to avoid possible confusion.  Values are fully evaluated; they are not a source of parallelism.  (If values were called “pure”, it would be irresistible to think otherwise.)  Computations have yet to be evaluated; they engender parallelism by sequential composition.  What?  No, you didn’t nod off! Let me explain.

Parallelism is all about the join points.  If parallel execution is the default, then the job of the programmer is not to induce parallelism, but to harness it.  And you do that by saying, “this computation depends on these others.”  Absent that, there is nothing else to say, just go for it.  No sub-languages.  No program analysis.  No escaping the monad.  Just express the necessary dependencies, and you’re good to go.

So, what are the join points?  They are the elimination forms for two parallel modalities.  They generalize the sequential case to allow for statically and dynamically determined parallelism.   A value of parallel product type is a tuple of unevaluated computations, a kind of “lazy” tuple (but not that kind of laziness, here I just mean unevaluated components).  The elimination form evaluates all of the component computations in parallel, creates a value tuple from their values, and passes it to the body of the form.  Similarly, a value of parallel sequence type is a generator consisting of two values, a natural number n indicating its size, and a function determining the ith component computation for each 1≤i<n.  The elimination form activates all n component computations, binds their values to a value sequence, and passes it to the body of the form.

The join point effects a change of type, from encapsulated computations to evaluated values, neatly generalizing sequential composition from a unary to a multiway join.  If you’d like, the parallel products and parallel sequences are “generalized monads” that encapsulate not just one, but many, unevaluated computations.  But they are no more monads than they are in any other functional language: the categorial equational laws need not hold in the presence of, say, divergence, or exceptions.

The dynamics assigns costs to computations, not to values, whose cost of creation has already been paid.  The computation that just returns a value has unit work and span.  Primitive operations take unit work and span.  The sequential composition of a parallel product with n components induces span one more than the maximum span of the constituents, and induces work one more than the sum of their work.  The dynamics of sequential composition for parallel sequences is similar, with the “arity” being determined dynamically rather than statically.

Programming in this style means making the join points explicit.  If you don’t like that, you can easily define derived forms—and derived costs—for constructs that do it for you.    For example, a pair of computations might be rendered as activating a parallel pair of its components, then returning the resulting value pair.  And so on and so forth.  It’s no big deal.

En passant the modal formulation of parallelism solves a nasty technical problem in a substitution-based cost semantics that does not make the modal distinction.  The issue is, how to distinguish between the creation of a value, and the many re-uses of it arising from substitution?  It’s not correct to charge again and again for cresting the value each time you see it (this cost can be asymptotically significant), but you do have to charge for creating it somewhere (it’s not free, and it can matter).  And, anyway, how is one to account for the cost of assessing whether an expression is, in fact, a value?  The usual move is to use an environment semantics to manage sharing.  But you don’t have to, the modal framework solves the problem, by distinguishing between a value per se; the computation that returns it fully created; and the computation that incrementally constructs it from its constituent parts.  It’s the old cons-vs-dotted pair issue, neatly resolved.

Please see Section 10 of the Commentary on PFPL for a fuller account.  The main idea is to generalize a type of single unevaluated computations, which arises in lax logic, to types of statically- and dynamically many unevaluated computations.  The bind operation becomes a join operation for these computations, turning a “lazy” tuple or sequence into eager tuples or sequences.

Updates: word-smithing, added cite to Davies-Pfenning, replaced cite of course notes with reference to commentary.

Filed under: Programming, Research, Teaching Tagged: functional programming, parallelism, programming languages, semantics

by Robert Harper at November 22, 2017 06:18 PM

FP Complete

Lambda Conference and Haskell Survey


Functional programmers are a unique breed of software development professionals. They have decided that the traditional methods to solving problems are not good enough. In their quest to find the most efficient way to find solutions they eventually stumble upon functional programming. Functional programmers also know they are a minority among their programming peers and don't enjoy the cornucopia of resources available to imperative language developers. That's why Lambda World is such an important conference. Michael Snoyman, our VP of Engineering, spoke to his functional programming peers when he discussed "Everything you didn’t want to know about Monad transformer state". 

by Robert Bobbett ( at November 22, 2017 06:16 PM

November 21, 2017

The GHC Team

GHC 8.2.2 is available

The GHC Team is pleased to announce a new minor release of GHC. This release builds on the performance and stability improvements of 8.2.1, fixing a variety of correctness bugs, improving error messages, and making the compiler more portable.

Notable bug-fixes include

  • A correctness issue resulting in segmentation faults in some FFI-users (#13707, #14346)
  • A correctness issue resulting in undefined behavior in some programs using STM (#14171)
  • A bug which may have manifested in segmentation faults in out-of-memory condition (#14329)
  • clearBit of Natural no longer bottoms (#13203)
  • A specialisation bug resulting in exponential blowup of compilation time in some specialisation-intensive programs (#14379)
  • ghc-pkg now works even in environments with misconfigured NFS mounts (#13945)
  • GHC again supports production of position-independent executables (#13702)

A thorough list of the changes in the release can be found in the release notes,

How to get it

This release can be downloaded from

For older versions see

We supply binary builds in the native package format for many platforms, and the source distribution is available from the same place.


Haskell is a standard lazy functional programming language.

GHC is a state-of-the-art programming suite for Haskell. Included is an optimising compiler generating efficient code for a variety of platforms, together with an interactive system for convenient, quick development. The distribution includes space and time profiling facilities, a large collection of libraries, and support for various language extensions, including concurrency, exceptions, and foreign language interfaces. GHC is distributed under a BSD-style open source license.

A wide variety of Haskell related resources (tutorials, libraries, specifications, documentation, compilers, interpreters, references, contact information, links to research groups) are available from the Haskell home page (see below).

On-line GHC-related resources

Relevant URLs on the World-Wide Web:

Supported Platforms

The list of platforms we support, and the people responsible for them, is here

Ports to other platforms are possible with varying degrees of difficulty. The Building Guide describes how to go about porting to a new platform.


We welcome new contributors. Instructions on accessing our source code repository, and getting started with hacking on GHC, are available from the GHC's developer's site run by Trac.

Community Resources

There are mailing lists for GHC users, develpoers, and monitoring bug tracker activity; to subscribe, use the Mailman web interface.

There are several other Haskell and GHC-related mailing lists on; for the full list, see the lists page.

Some GHC developers hang out on the #ghc and #haskell of the Freenode IRC network, too. See the Haskell wiki for details.

Please report bugs using our bug tracking system. Instructions on reporting bugs can be found here.

by Ben Gamari at November 21, 2017 10:06 PM

Yesod Web Framework

mega-sdist: the mega repo helper

Many years ago, I wrote a utility called mega-sdist to help me with managing mega repos (more on that below). I've been using it myself ever since, making some minor improvements over the years. But I realized recently that I never really announced it to others, and especially not to the people whom it would help the most: other Yesod contributors and maintainers. Consider this the (massively belated) announcement.

You can find the most up-to-date information in the project on Github. Below is the current content of that file, to help save you a click.

This is a utility written to address the specific needs in maintaining Haskell "mega-repos," or Git repositories containing multiple Cabal projects. It is intended to ease the process of deciding which packages need to be released and tagging those releases appropriately.

It provides the following functionality:

  • Detect when local code has changed from what's on Hackage
    • Note that, due to Hackage revisions, sometimes this logic isn't perfect
  • Detect when a version number needs to be updated
  • Dump the difference between the Hackage version of your package and the local version

To install it... well, listen. This tool is intended for people authoring Haskell packages. Odds are, you already know how to do this. And if you don't know, this probably isn't a tool that will help you. Anyway, in order to install it, first install Stack and then run stack install mega-sdist, or just stack install inside this repository.

Opinionated tool

This utility is highly opinionated in some ways, e.g.:

  • It only supports one style of Git tag name: packagename/version. This may look weird in non-mega-repos, where v1.2.3 looks better than foo/1.2.3, but for mega-repos the former doesn't make sense.
  • It depends on Stack for both discovering all of your local packages, and for uploading to Hackage.

If you're OK with these opinions, keep reading for usage.

Have I changed anything?

Let's say I'm working on the monad-unlift megarepo (chosen as an example of a relatively small repo). I've merged some PRs recently, or at least think I have. But I don't remember which of the individual packages within the repo this affected. Instead of looking at the commit history like some caveman, I'll typically do:

$ git pull # make sure I have all latest changes
$ mega-sdist

The mega-sdist command will:

  • Build tarballs for all local packages
  • Check what the latest versions of my packages on Hackage are
  • Do a full diff on these two things and see if anything's changed

At the time of writing, here's the output from this repo:

The following packages from Hackage have not changed:

The following packages require a version bump:

What this means is:

  • The monad-unlift package I have locally is at version 0.2.0. And it perfectly matches that version on Hackage. No actions necessary.
  • The monad-unlift-ref package I have locally is at version 0.2.1. And it doesn't match the code on Hackage. Therefore, if I wanted to run stack upload monad-unlift-ref successfully, I'd need to bump the version number.

What did I change?

Well, again, if I wanted to see what changed, I could run (again, like a caveman):

$ git diff monad-unlift-ref/0.2.1 -- monad-unlift-ref

But that's long! mega-sidst's got your back. Just run:

$ mega-sdist monad-unlift-ref --get-diffs

This will print out the difference between the tarball uploaded to Hackage and what you have locally. Besides my tongue-in-cheek comment above, this is also useful if, for some reason, you either don't have or don't trust the tags in your Git repo.

One other thing: this diff is currently based on the pristine tarball from Hackage, ignoring cabal file revisions. So the difference may be slightly different from what you'd get from stack unpack monad-unlift-ref-0.2.1. But ¯\_(ツ)_/¯ that's revisions for you.

The default behavior of mega-sdist is to look at all packages specified in your stack.yaml. Targets can be any directory. And mega-sdist will automatically look at packages in any subdirectory, so that mega-sdist . is the same as mega-sdist at the root of your repo*.

* Assuming all of your packages are actually in your repo, but only crazy people would do otherwise.

Preparing a new release

OK, now I continue working on my project, and I've:

  • Made some changes to monad-unlift
  • Updated the cabal file's version number
    • And of course I also updated the, I'm not some monster

From the root of my repo, I run:

$ mega-sdist monad-unlift

Or, equivalently, from inside the monad-unlift subdirectory I run:

$ mega-sdist .

Either way, I get:

The following new packages exist locally:

No version bumps required, good to go!

This tells me that my package has local changes, and the version number has been updated, so that stack upload monad-unlift will work. Neato! Now, you could just run stack upload ..., but here's what I usually do. First, I'll review the changes I'm about to upload and make sure there are no surprises:

$ mega-sdist --get-diffs .

The following new packages exist locally:
diff -r old/monad-unlift-0.2.0/ new/monad-unlift-0.2.1/
> ## 0.2.1
> * Silly changes
diff -r old/monad-unlift-0.2.0/Control/Monad/Trans/Unlift.hs new/monad-unlift-0.2.1/Control/Monad/Trans/Unlift.hs
> -- I just need some space
diff -r old/monad-unlift-0.2.0/monad-unlift.cabal new/monad-unlift-0.2.1/monad-unlift.cabal
< version:             0.2.0
> version:             0.2.1

No version bumps required, good to go!

OK, that's what I wanted. Time to release. Next, I'm going to use mega-sdist to tag the release:

$ mega-sdist --gittag .

From the root of my repo, this would notice that monad-unlift-ref still requires a version bump, and refuse to proceed. But inside the monad-unlift directory, it notices that all necessary version bumps are done, and happily tags:

$ mega-sdist --gittag .
The following new packages exist locally:

No version bumps required, good to go!
Raw command: git tag monad-unlift/0.2.1

And suddenly I notice something new:

$ ls tarballs/

Neat, mega-sdist left behind tarballs I can upload! To do so, I run:

$ stack upload tarballs/*

Note that this will work whether I'm trying to upload just one package, or all of the updated packages in my repo. Finally, I need to push the new tags to Github (or wherever):

$ git push --tags

And in fact, this upload sequence is so common that I have a shell alias set up:

$ alias upload
alias upload='mega-sdist --gittag . && stack upload tarballs/* && git push --tags'

So there you have it: convenient little utility to help manage repos with lots of packages in them.

November 21, 2017 03:15 PM

Philip Wadler

Pay what you want for Java Generics and Collections

Humble Book Bundle is selling off a passle of Java books, including Java Generics and Collection by Naftalin and Wadler, on a pay-what-you-want basis (USD $1 minimum), DRM-free. You choose what proportion of the profits go to Humble and what goes to the charity Code for America. A great deal!

by Philip Wadler ( at November 21, 2017 12:16 PM

November 18, 2017

Sandy Maguire

Type-Directed Code Generation

<article> <header>

Type-Directed Code Generation


<time>November 18, 2017</time>

aka “Type-Level Icing Sugar”


At work recently I’ve been working on a library to get idiomatic gRPC support in our Haskell project. I’m quite proud of how it’s come out, and thought it’d make a good topic for a blog post. The approach demonstrates several type-level techniques that in my opinion are under-documented and exceptionally useful in using the type-system to enforce external contracts.

Thankfully the networking side of the library had already been done for me by Awake Security, but the interface feels like a thin-wrapper on top of C bindings. I’m very, very grateful that it exists, but I wouldn’t expect myself to be able to use it in anger without causing an uncaught type error somewhere along the line. I’m sure I’m probably just using it wrong, but the library’s higher-level bindings all seemed to be targeted at Awake’s implementation of protobuffers.

We wanted a version that would play nicely with proto-lens, which, at time of writing, has no official support for describing RPC services via protobuffers. If you’re not familiar with proto-lens, it generates Haskell modules containing idiomatic types and lenses for protobuffers, and can be used directly in the build chain.

So the task was to add support to proto-lens for generating interfaces to RPC services defined in protobuffers.

My first approach was to generate the dumbest possible thing that could work – the idea was to generate records containing fields of the shape Request -> IO Response. Of course, with a network involved there is a non-negligible chance of things going wrong, so this interface should expose some means of dealing with errors. However, the protobuffer spec is agnostic about the actual RPC backend used, and so it wasn’t clear how to continue without assuming anything about the particulars behind errors.

More worrisome, however, was that RPCs can be marked as streaming – on the side of the client, server, or both. This means, for example, that a method marked as server-streaming has a different interface on either side of the network:

serverSide :: Request -> (Response -> IO ()) -> IO ()
clientSide :: Request -> (IO (Maybe Response) -> IO r) -> IO r

This is problematic. Should we generate different records corresponding to which side of the network we’re dealing with? An early approach I had was to parameterize the same record based on which side of the network, and use a type family to get the correct signature:

{-# LANGUAGE DataKinds #-}

data NetworkSide = Client | Server

data MyService side = MyService
  { runServerStreaming :: ServerStreamingType side Request Response

type family ServerStreamingType (side :: NetworkSide) input output where
  ServerStreamingType Server input output =
      input -> (output -> IO ()) -> IO ()

  ServerStreamingType Client input output =
      forall r. input -> (IO (Maybe output) -> IO r) -> IO r

This seems like it would work, but in fact the existence of the forall on the client-side is “illegally polymorphic” in GHC’s eyes, and it will refuse to compile such a thing. Giving it up would mean we wouldn’t be able to return arbitrarily-computed values on the client-side while streaming data from the server. Users of the library might be able to get around it by invoking IORefs or something, but it would be ugly and non-idiomatic.

So that, along with wanting to be backend-agnostic, made this approach a no-go. Luckily, my brilliant coworker Judah Jacobson (who is coincidentally also the author of proto-lens), suggested we instead generate metadata for RPC services in proto-lens, and let backend library code figure it out from there.

With all of that context out of the way, we’re ready to get into the actual meat of the post. Finally.

Generating Metadata

According to the spec, a protobuffer service may contain zero or more RPC methods. Each method has a request and response type, either of which might be marked as streaming.

While we could represent this metadata at the term-level, that won’t do us any favors in terms of getting type-safe bindings to this stuff. And so, we instead turn to TypeFamilies, DataKinds and GHC.TypeLits.

For reasons that will become clear later, we chose to represent RPC services via types, and methods in those services as symbols (type-level strings). The relevant typeclasses look like this:

class Service s where
  type ServiceName    s :: Symbol

class HasMethod s (m :: Symbol) where
  type MethodInput       s m :: *
  type MethodOutput      s m :: *
  type IsClientStreaming s m :: Bool
  type IsServerStreaming s m :: Bool

For example, the instances generated for the RPC service:

service MyService {
  rpc BiDiStreaming(stream Request) returns(stream Response);

would look like this:

data MyService = MyService

instance Service MyService where
  type ServiceName    MyService = "myService"

instance HasMethod MyService "biDiStreaming" where
  type MethodInput       MyService "biDiStreaming" = Request
  type MethodOutput      MyService "biDiStreaming" = Response
  type IsClientStreaming MyService "biDiStreaming" = 'True
  type IsServerStreaming MyService "biDiStreaming" = 'True

You’ll notice that these typeclasses perfectly encode all of the information we had in the protobuffer definition. The idea is that with all of this metadata available to them, specific backends can generate type-safe interfaces to these RPCs. We’ll walk through the implementation of the gRPC bindings together.

The Client Side

The client side of things is relatively easy. We can the HasMethod instance directly:

    :: HasMethod s m
    => s
    -> Proxy m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))
runNonStreamingClient =  -- call the underlying gRPC code

    :: HasMethod s m
    => s
    -> Proxy m
    -> MethodInput s m
    -> (IO (Either GRPCError (Maybe (MethodOutput s m)) -> IO r)
    -> IO r
runServerStreamingClient =  -- call the underlying gRPC code

-- etc

This is a great start! We’ve got the interface we wanted for the server-streaming code, and our functions are smart enough to require the correct request and response types.

However, there’s already some type-unsafety here; namely that nothing stops us from calling runNonStreamingClient on a streaming method, or other such silly things.

Thankfully the fix is quite easy – we can use type-level equality to force callers to be attentive to the streaming-ness of the method:

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'False
    => s
    -> Proxy m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'True
    => s
    -> Proxy m
    -> MethodInput s m
    -> (IO (Either GRPCError (Maybe (MethodOutput s m)) -> IO r)
    -> IO r

-- et al.

Would-be callers attempting to use the wrong function for their method will now be warded off by the type-system, due to the equality constraints being unable to be discharged. Success!

The actual usability of this code leaves much to be desired (it requires being passed a proxy, and the type errors are absolutely disgusting), but we’ll circle back on improving it later. As it stands, this code is type-safe, and that’s good enough for us for the time being.

The Server Side

Method Discovery

Prepare yourself (but don’t panic!): the server side of things is significantly more involved.

In order to run a server, we’re going to need to be able to handle any sort of request that can be thrown at us. That means we’ll need an arbitrary number of handlers, depending on the service in question. An obvious thought would be to generate a record we could consume that would contain handlers for every method, but there’s no obvious place to generate such a thing. Recall: proto-lens can’t, since such a type would be backend-specific, and so our only other strategy down this path would be Template Haskell. Yuck.

Instead, recall that we have an instance of HasMethod for every method on Service s – maybe we could exploit that information somehow? Unfortunately, without Template Haskell, there’s no way to discover typeclass instances.

But that doesn’t mean we’re stumped. Remember that we control the code generation, and so if the representation we have isn’t powerful enough, we can change it. And indeed, the representation we have isn’t quite enough. We can go from a HasMethod s m to its Service s, but not the other way. So let’s change that.

We change the Service class slightly:

class Service s where
  type ServiceName    s :: Symbol
  type ServiceMethods s :: [Symbol]

If we ensure that the ServiceMethods s type family always contains an element for every instance of HasService, we’ll be able to use that info to discover our instances. For example, our previous MyService will now get generated thusly:

data MyService = MyService

instance Service MyService where
  type ServiceName    MyService = "myService"
  type ServiceMethods MyService = '["biDiStreaming"]

instance HasMethod MyService "biDiStreaming" where
  type MethodInput       MyService "biDiStreaming" = Request
  type MethodOutput      MyService "biDiStreaming" = Response
  type IsClientStreaming MyService "biDiStreaming" = 'True
  type IsServerStreaming MyService "biDiStreaming" = 'True

and we would likewise add the m for any other HasMethod MyService m instances if they existed.

This seems like we can now use ServiceMethods s to get a list of methods, and then somehow type-level map over them to get the HasMethod s m constraints we want.

And we almost can, except that we haven’t told the type-system that ServiceMethods s relates to HasService s m instances in this way. We can add a superclass constraint to Service to do this:

class HasAllMethods s (ServiceMethods s) => Service s where
  -- as before

But was is this HasAllMethods thing? It’s a specialized type-level map which turns our list of methods into a bunch of constraints proving we have HasMethod s m for every m in that promoted list.

class HasAllMethods s (xs :: [Symbol])

instance HasAllMethods s '[]
instance (HasMethod s x, HasAllMethods s xs) => HasAllMethods s (x ': xs)

We can think of xs here as the list of constraints we want. Obviously if we don’t want any constraints (the '[] case), we trivially have all of them. The other case is induction: if we have a non-empty list of constraints we’re looking for, that’s the same as looking for the tail of the list, and having the constraint for the head of it.

Read through these instances a few times; make sure you understand the approach before continuing, because we’re going to keep using this technique in scarier and scarier ways.

With this HasAllMethods superclass constraint, we can now convince ourselves (and, more importantly, GHC), that we can go from a Service s constraint to all of its HasMethod s m constraints. Cool!

Typing the Server

We return to thinking about how to actually run a server. As we’ve discussed, such a function will need to be able to handle every possible method, and, unfortunately, we can’t pack them into a convenient data structure.

Our actual implementation of such a thing might take a list of handlers. But recall that each handler has different input and output types, as well as different shapes depending on which bits of it are streaming. We can make this approach work by existentializing away all of the details.

While it works as far as the actual implementation of the underlying gRPC goes, we’re left with a great sense of uneasiness. We have no guarantees that we’ve provided a handler for every method, and the very nature of existentialization means we have absolutely no guarantees that any of these things are the right ype.

Our only recourse is to somehow use our Service s constraint to put a prettier facade in front of this ugly-if-necessary implementation detail.

The actual interface we’ll eventually provide will, for example, for a service with two methods, look like this:

runServer :: HandlerForMethod1 -> HandlerForMethod2 -> IO ()

Of course, we can’t know a priori how many methods there will be (or what type their handlers should have, for that matter). We’ll somehow need to extract this information from Service s – which is why we previously spent so much effort on making the methods discoverable.

The technique we’ll use is the same one you’ll find yourself using again and again when you’re programming at the type-level. We’ll make a typeclass with an associated type family, and then provide a base case and an induction case.

class HasServer s (xs :: [Symbol]) where
  type ServerType s xs :: *

We need to make the methods xs explicit as parameters in the typeclass, so that we can reduce them. The base case is simple – a server with no more handlers is just an IO action:

instance HasServer s '[] where
  type ServerType s '[] = IO ()

The induction case, however, is much more interesting:

instance ( HasMethod s x
         , HasMethodHandler s x
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs) = MethodHandler s x -> ServerType s xs

The idea is that as we pull methods x off our list of methods to handle, we build a function type that takes a value of the correct type to handle method x, which will take another method off the list until we’re out of methods to handle. This is exactly a type-level fold over a list.

The only remaining question is “what is this MethodHandler thing?” It’s going to have to be a type family that will give us back the correct type for the handler under consideration. Such a type will need to dispatch on the streaming variety as well as the request and response, so we’ll define it as follows, and go back and fix HasServer later.

class HasMethodHandler input output cs ss where
  type MethodHandler input output cs ss :: *

cs and ss refer to whether we’re looking for client-streaming and/or server-streaming types, respectively.

Such a thing could be a type family, but isn’t because we’ll need its class-ness later in order to actually provide an implementation of all of this stuff. We provide the following instances:

-- non-streaming
instance HasMethodHandler input output 'False 'False where
  type MethodHandler input output 'False 'False =
    input -> IO output

-- server-streaming
instance HasMethodHandler input output 'False 'False where
  type MethodHandler input output 'False 'True =
    input -> (output -> IO ()) -> IO ()

-- etc for client and bidi streaming

With MethodHandler now powerful enough to give us the types we want for handlers, we can go back and fix HasServer so it will compile again:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs

It’s not pretty, but it works! We can convince ourselves of this by asking ghci:

ghci> :kind! ServerType MyService (ServiceMethods MyService)

(Request -> (Response -> IO ()) -> IO ()) -> IO () :: *

and, if we had other methods defined for MyService, they’d show up here with the correct handler type, in the order they were listed in ServiceMethods MyService.

Implementing the Server

Our ServerType family now expands to a function type which takes a handler value (of the correct type) for every method on our service. That turns out to be more than half the battle – all we need to do now is to provide a value of this type.

The generation of such a value is going to need to proceed in perfect lockstep with the generation of its type, so we add to the definition of HasServer:

class HasServer s (xs :: [Symbol]) where
  type ServerType s xs :: *
  runServerImpl :: [AnyHandler] -> ServerType s xs

What is this [AnyHandler] thing, you might ask. It’s an explicit accumulator for existentialized handlers we’ve collected during the fold over xs. It’ll make sense when we look at the induction case.

For now, however, the base case is trivial as always:

instance HasServer s '[] where
  type ServerType s '[] = IO ()
  runServerImpl handlers = runGRPCServer handlers

where runGRPCServer is the underlying server provided by Awake’s library.

We move to the induction case:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs
  runServerImpl handlers f = runServerImpl (existentialize f : handlers)

where existentialize is a new class method we add to HasMethodHandler We will elide it here because it is just a function MethodHandler i o cs mm -> AnyHandler and is not particularly interesting if you’re familiar with existentialization.

It’s evident here what I meant by handlers being an explicit accumulator – our recursion adds the parameters it receives into this list so that it can pass them eventually to the base case.

There’s a problem here, however. Reading through this implementation of runServerImpl, you and I both know what the right-hand-side means, unfortunately GHC isn’t as clever as we are. If you try to compile it right now, GHC will complain about the non-injectivity of HasServer as implied by the call to runServerImpl (and also about HasMethodHandler and existentialize, but for the exact same reason.)

The problem is that there’s nothing constraining the type variables s and xs on runServerImpl. I always find this error confusing (and I suspect everyone does), because in my mind it’s perfectly clear from the HasServer s xs in the instance constraint. However, because SeverType is a type family without any injectivity declarations, it means we can’t learn s and xs from ServerType s xs.

Let’s see why. For a very simple example, let’s look at the following type family:

type family NotInjective a where
  NotInjective Int  = ()
  NotInjective Bool = ()

Here we have NotInjective Int ~ () and NotInjective Bool ~ (), which means even if we know NotInjective a ~ () it doesn’t mean that we know what a is – it could be either Int or Bool.

This is the exact problem we have with runServerImpl: even though we know what type runServerImpl has (it must be ServerType s xs, so that the type on the left-hand of the equality is the same as on the right), that doesn’t mean we know what s and xs are! The solution is to explicitly tell GHC via a type signature or type application:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs
  runServerImpl handlers f = runServerImpl @s @xs (existentialize f : handlers)

(For those of you playing along at home, you’ll need to type-apply the monstrous MethodInput and friends to the existentialize as well.)

And finally, we’re done! We can slap a prettier interface in front of this runServerImpl to fill in some of the implementation details for us:

    :: forall s
     . ( Service s
       , HasServer s (ServiceMethods s)
    => s
    -> ServerType s (ServiceMethods s)
runServer _ = runServerImpl @s @(ServiceMethods s) []

Sweet and typesafe! Yes!

Client-side Usability

Sweet and typesafe all of this might be, but the user-friendliness on the client-side leaves a lot to be desired. As promised, we’ll address that now.

Removing Proxies

Recall that the runNonStreamingClient function and its friends require a Proxy m parameter in order to specify the method you want to call. However, m has kind Symbol, and thankfully we have some new extensions in GHC for turning Symbols into values.

We can define a new type, isomorphic to Proxy, but which packs the fact that it is a KnownSymbol (something we can turn into a String at runtime):

data WrappedMethod (sym :: Symbol) where
  WrappedMethod :: KnownSymbol sym => WrappedMethod sym

We change our run*Client friends to take this WrappedMethod m instead of the Proxy m they used to:

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'False
    => s
    -> WrappedMethod m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

and, with this change in place, we’re ready for the magic syntax I promised earlier.

import GHC.OverloadedLabel

instance ( KnownSymbol sym
         , sym ~ sym'
         ) => IsLabel sym (WrappedMethod sym') where
  fromLabel _ = WrappedMethod

This sym ~ sym' thing is known as the constraint trick for instances, and is necessary here to convince GHC that this can be the only possible instance of IsLabel that will give you back WrappedMethods.

Now turning on the {-# LANGUAGE OverloadedLabels #-} pragma, we’ve changed the syntax to call these client functions from the ugly:

runBiDiStreamingClient MyService (Proxy @"biDiStreaming")

into the much nicer:

runBiDiStreamingClient MyService #biDiStreaming

Better “Wrong Streaming Variety” Errors

The next step in our journey to delightful usability is remembering that the users of our library are only human, and at some point they are going to call the wrong run*Client function on their method with a different variety of streaming semantics.

At the moment, the errors they’re going to get when they try that will be a few stanza long, the most informative of which will be something along the lines of unable to match 'False with 'True. Yes, it’s technically correct, but it’s entirely useless.

Instead, we can use the TypeError machinery from GHC.TypeLits to make these error messages actually helpful to our users. If you aren’t familiar with it, if GHC ever encounters a TypeError constraint it will die with a error message of your choosing.

We will introduce the following type family:

type family RunNonStreamingClient (cs :: Bool) (ss :: Bool) :: Constraint where
  RunNonStreamingClient 'False 'False = ()
  RunNonStreamingClient 'False 'True = TypeError
      ( Text "Called 'runNonStreamingClient' on a server-streaming method."
   :$$: Text "Perhaps you meant 'runServerStreamingClient'."
  RunNonStreamingClient 'True 'False = TypeError
      ( Text "Called 'runNonStreamingClient' on a client-streaming method."
   :$$: Text "Perhaps you meant 'runClientStreamingClient'."
  RunNonStreamingClient 'True 'True = TypeError
      ( Text "Called 'runNonStreamingClient' on a bidi-streaming method."
   :$$: Text "Perhaps you meant 'runBiDiStreamingClient'."

The :$$: type operator stacks message vertically, while :<>: stacks it horizontally.

We can change the constraints on runNonStreamingClient:

    :: ( HasMethod s m
       , RunNonStreamingClient (IsClientStreaming s m)
                               (IsServerStreaming s m)
    => s
    -> WrappedMethod m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

and similarly for our other client functions. Reduction of the resulting boilerplate is left as an exercise to the reader.

With all of this work out of the way, we can test it:

runNonStreamingClient MyService #biDiStreaming
Main.hs:45:13: error:
    • Called 'runNonStreamingClient' on a bidi-streaming method.
      Perhaps you meant 'runBiDiStreamingClient'.
    • In the expression: runNonStreamingClient MyService #bidi


Better “Wrong Method” Errors

The other class of errors we expect our users to make is to attempt to call a method that doesn’t exist – either because they made a typo, or are forgetful of which methods exist on the service in question.

As it stands, users are likely to get about six stanzas of error messages, from No instance for (HasMethod s m) to Ambiguous type variable 'm0', and other terrible things that leak our implementation details. Our first thought might be to somehow emit a TypeError constraint if we don’t have a HasMethod s m instance, but I’m not convinced such a thing is possible.

But luckily, we can actually do better than any error messages we could produce in that way. Since our service is driven by a value (in our example, the data constructor MyService), by the time things go wrong we do have a Service s instance in scope. Which means we can look up our ServiceMethods s and given some helpful suggestions about what the user probably meant.

The first step is to implement a ListContains type family so we can determine if the method we’re looking for is actually a real method.

type family ListContains (n :: k) (hs :: [k]) :: Bool where
  ListContains n '[]       = 'False
  ListContains n (n ': hs) = 'True
  ListContains n (x ': hs) = ListContains n hs

In the base case, we have no list to look through, so our needle is trivially not in the haystack. If the head of the list is the thing we’re looking for, then it must be in the list. Otherwise, take off the head of the list and continue looking. Simple really, right?

We can now use this thing to generate an error message in the case that the method we’re looking for is not in our list of methods:

type family RequireHasMethod s (m :: Symbol) (found :: Bool) :: Constraint where
  RequireHasMethod s m 'False = TypeError
      ( Text "No method "
   :<>: ShowType m
   :<>: Text " available for service '"
   :<>: ShowType s
   :<>: Text "'."
   :$$: Text "Available methods are: "
   :<>: ShowType (ServiceMethods s)
  RequireHasMethod s m 'True = ()

If found ~ 'False, then the method m we’re looking for is not part of the service s. We produce a nice error message informing the user about this (using ShowType to expand the type variables).

We will provide a type alias to perform this lookup:

type HasMethod' s m =
  ( RequireHasMethod s m (ListContains m (ServiceMethods s)
  , HasMethod s m

Our new HasMethod' s m has the same shape as HasMethod, but will expand to our custom type error if we’re missing the method under scrutiny.

Replacing all of our old HasMethod constraints with HasMethod' works fantastically:

Main.hs:54:15: error:
    • No method "missing" available for service 'MyService'.
      Available methods are: '["biDiStreaming"]

Damn near perfect! That list of methods is kind of ugly, though, so we can write a quick pretty printer for showing promoted lists:

type family ShowList (ls :: [k]) :: ErrorMessage where
  ShowList '[]  = Text ""
  ShowList '[x] = ShowType x
  ShowList (x ': xs) = ShowType x :<>: Text ", " :<>: ShowList xs

Replacing our final ShowType with ShowList in RequireHasMethod now gives us error messages of the following:

Main.hs:54:15: error:
    • No method "missing" available for service 'MyService'.
      Available methods are: "biDiStreaming"

Absolutely gorgeous.


This is where we stop. We’ve used type-level metadata to generate client- and server-side bindings to an underlying library. Everything we’ve made is entirely typesafe, and provides gorgeous, helpful error messages if the user does anything wrong. We’ve found a practical use for many of these seemingly-obscure type-level features, and learned a few things in the process.

In the words of my coworker Renzo Carbonara1:

“It is up to us, as people who understand a problem at hand, to try and teach the type system as much as we can about that problem. And when we don’t understand the problem, talking to the type system about it will help us understand. Remember, the type system is not magic, it is a logical reasoning tool.”

This resounds so strongly in my soul, and maybe it will in yours too. If so, I encourage you to go forth and find uses for these techniques to improve the experience and safety of your own libraries.

  1. Whose article “Opaleye’s sugar on top” was a strong inspiration on me, and subsequently on this post.


November 18, 2017 12:00 AM

November 16, 2017

Tweag I/O

Parallelising your array code

Manuel M T Chakravarty

This is the fifth post in a series about array programming in Haskell — you might be interested in the first, second, third, and fourth, too.

A recurring theme in array programming is performance. After all, many algorithms in numerical computing and data science are computationally intensive. Once the sequential implementation of an array program has been fully optimised, the natural next step is to use one or multiple forms of parallelism to achieve further performance improvements. This can be parallelism within one computational core (SIMD parallelism), multicore parallelism, or distributed multi-machine parallelism. Unfortunately, at this point matters become much more complicated, because parallel programming comes with its own set of serious challenges.

In this post, we will focus on multicore parallelism for computations operating on multi-dimensional arrays. In other words, in relation to the vector package, which we discussed in the last post, we have two new ingredients. Firstly, instead of one-dimensional Int-indexed arrays, we have multi-dimensional Int-indexed arrays. Secondly, the collective operations provided on these arrays come with parallel implementations. In fact, the library API is designed to favour collective operations that have good parallel implementations. Similarly, the move to explicitly multi-dimensional arrays is motivated by being able to provide parallel implementations that take the array shape into account, wherever that is an advantage.

To make matters concrete, we will discuss the Repa library. Internally it uses many of the same techniques as vector, including strictness, unboxing, and a two-phase initialisation strategy. However, it uses a second array fusion strategy in addition to vector’s stream fusion. More precisely, Repa internally uses vector to represent plain boxed and unboxed arrays and to execute sequential computations on those, which still benefit from stream fusion. However, Repa introduces additional array representations, such as delayed arrays, to also achieve fusion across parallel computations.

This additional complication is necessary as stream fusion, by itself, tends to turn parallel into sequential code. In other words, one of the challenges of high-performance parallel array implementations that are built on collective operations is to apply fusion while preserving parallelism. To really get good performance, we need to simultaneously optimize along two orthogonal dimensions: get more done simultaneously, by parallelizing, but also make each sequential unit of work run faster.

A second consequence of targeting a parallelisation-friendly API is a very limited use of mutable arrays. Mutable structures generally interact badly with concurrency and parallelism, opening the door to a whole range of hard to diagnose faults. In fact, the focus on immutable arrays for parallel programming is one of the most compelling conceptual improvements of functional over imperative parallel array programming. (To be precise, Repa’s API does provide access to the mutable array structures used to implement two-phase initialisation, but it is usually not necessary to use them directly.)

Multiple dimensions

The obvious structure for indexing multi-dimensional Int-indexed arrays are tuples of Ints. However, they come with two severe drawbacks: (1) they force us to fix the dimensionality of all functions over arrays and (2) they are not sufficient to characterise operations on lower-dimensional subarrays of an array (e.g., a two-dimensional plane within a three-dimensional cube).

As an example of the first drawback, consider a fold function that given a three-dimensional cube, reduces it along, say, the x-axis to a two-dimensional plane of sums. The only difference of that operation compared to a fold that sums a two-dimensional plane across one axis to a one-dimensional vector is the number of dimensions that we do not reduce along. Now, we could have a family of fold functions (fold1, fold2, and so on), one for each possible dimension of argument array. But that is hardly satisfactory.

Instead, Repa uses a custom datatype for indexing. Index types are built from the infix constructor (:.) and the constant Z, representing a zero-dimensional array (which is the special cases of a singleton array). For example, the type of two-dimensional indices is Z :. Int :. Int and one of its values is Z :. 3 :. 5. By using a type variable instead of Z, we can denote indices with a particular minimum dimensionality. For instance, sh :. Int has at least one dimension, but it might have more, depending on how the type variable sh is instantiated — in any case, instances of sh need to be drawn from the class Shape. On the basis of this index representation, we can capture the entire family of multi-dimensional fold functions in a single type:

foldS :: (Shape sh, Source r a, Unbox a)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> Array U sh a

The function foldS implements a sequential, multi-dimensional reduction; hence, the S suffix. It gets three arguments:

  1. a -> a -> a is the type of the binary reduction function, which needs to be associative,
  2. a is the reduction function’s neutral (i.e, together they form a monoid), and
  3. Array e (sh :. Int) a is an at least one-dimensional array of elements of type a, which the type constraint Unbox a requires to be a type that has an associated unboxed representation.

Finally, the result of type Array U sh a has one dimension less than the argument array, but contains elements of the same type a. This leaves us with wondering about the meaning of the first type argument of Arrayr and U, respectively— as well as the type constraint Source r a.

Indexed arrays

The first type argument of Array determines the array representation. The available representations include boxed (V) and unboxed (U) representations, but also delayed (D) and cursored (C) representations. The latter are guaranteed to be removed by fusion, but can lead to the superfluous recomputation of array elements that are used more than once. Repa makes the choice of representation explicit to place it under programmer control — experience shows that compiler heuristics for automatic representation selection tend to be fragile and unreliable.

A consequence of a representation that is fused away, such as delayed D and cursored C, is that it can only be a data Source of a computation. Hence, the type class of the same name provides elementary array access functions for arrays. The opposite, a Target, provides the functionality to fill an array as part of two-phase initialisation and is only available to manifest representations, such as the boxed V and unboxed U representation. A manifest representation is one which, in contrast to a fused-away delayed representation, is actually stored in memory.

In addition to concrete representations, Repa representation tags can also include meta information, such as the interleaving hint I. An array tagged I U uses an unboxed interleaved representation, which improves parallel load balancing in parallel computations where the amount of work strongly varies between different regions in the parallel array. A standard example is computing a Mandelbrot set, where black pixels are significantly more expensive than others.


As we saw above with foldS, Repa follows the convention of adding an S to sequential array operations. Similarly, it uses a P as a suffix for parallel functions. For example, we have

foldP :: (Shape sh, Source r a, Unbox a, Monad m)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> m (Array U sh a)

for the parallel version of fold. The distinction between sequential and parallel functions is an important one, since Repa does not support nested parallelism. That is, a parallel function (e.g., foldP) cannot use another parallel function as an argument (e.g., as the combination function).

In addition to the suffix, the parallel fold distinguishes itself from the sequential by the use of a not further specified monad. The purpose of this monad is to ensure the one-by-one execution of pipelines of parallel computations. This is important to prevent inadvertent nesting of parallel computations as Haskell is a lazy language and we might otherwise feed a suspended (i.e., not yet evaluated) parallel computation into another parallel computation.

Parallel matrix multiplication

As a simple example of a parallel computation, consider the multiplication of two matrices arr and brr of type Array U DIM2 Double (two-dimensional, unboxed arrays), where type DIM2 = Z :. Int :. Int:

mmultP :: Monad m
       => Array U DIM2 Double
       -> Array U DIM2 Double
       -> m (Array U DIM2 Double)
mmultP arr brr
  = do trr <- transpose2P brr
       computeP (fromFunction (Z :. h1 :. w2) dotp)
    (Z :. h1  :. _)  = extent arr
    (Z :. _   :. w2) = extent brr

    dotp ix = sumAllS $
                zipWith (*)
                  (slice arr (Any :. (row ix) :. All))
                  (slice trr (Any :. (col ix) :. All))

We assume the existence of a helper function transpose2P, which transposes a matrix in parallel — for example, by using Repa’s backpermute function. Then, we generate the manifest result array by computing all elements of fromFunction (Z :. h1 :. w2) dotpin parallel with computeP. The shape (i.e., the size of the dimensions) of the result is h1 times w2, and fromFunction turns a function, which takes an array index to the corresponding array element , into a delayed array:

fromFunction :: sh -> (sh -> a) -> Array D sh a

At each index ix of the resulting array, we evaluate dotp, which only involves a sequential computation. It’s sequential nature is important for two reasons. Firstly, as mentioned, Repa does not support nested parallelism, so the computations on each result array index triggered by computeP in parallel may themselves not be parallel. Secondly, the work complexity of matrix multiplication is n^3 — that is the number of scalar multiplications that need to be performed. Performing them all in parallel would lead to (a) too much and (b) too fine-grained parallelism. Both too much parallelism and parallel workloads that are each too little work lead to bad performance as they result in too much administrative overhead.

In contrast, the sequential computation performed by dotp obtains a row of the matrix arr and a column of brr (actually, a row of the transposed brr, which is trr) with slice, which extracts an entire subarray from an array. Then, it multiples the row and column pointwise with zipWith (*) and sums up the products with sumAllS, where

zipWith :: (Shape sh, Source r1 a, Source r2 b)
        => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c
sumAllS :: (Shape sh, Source r a, Num a) => Array r sh a -> a

This example highlights how reasoning about the decomposition of an algorithm into parallel and sequential components is crucial for good parallel performance. This is assisted by Repa’s clear distinction between sequential and parallel operations.

Further reading

Repa went through three major iterations before arriving at the current interface. The underlying concepts are described and supported by benchmarks in the papers Regular, shape-polymorphic, parallel arrays in Haskell, Efficient Parallel Stencil Convolution in Haskell, and Guiding Parallel Array Fusion with Indexed Types, respectively. In addition, Data Flow Fusion with Series Expressions in Haskell proposes a further improvement to the fusion system. However, this has not been integrated into the main package.

November 16, 2017 12:00 AM

November 15, 2017

Jeremy Gibbons

The Digits of Pi

In the previous post we were introduced to metamorphisms, which consist of an unfold after a fold—typically on lists, and the fold part typically a {\mathit{foldl}}. A canonical example is the conversion of a fraction from one base to another. For simplicity, let’s consider here only infinite fractions, so we don’t have to deal with the end of the input and flushing the state:

\displaystyle  \begin{array}{@{}lcl@{}} \multicolumn{3}{@{}l}{\mathit{stream} :: (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow [\gamma]} \\ \mathit{stream}\;g\;f\;b\;x &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{stream}\;g\;f\;b'\;x \\ \mathit{Nothing} &\rightarrow& \mathbf{case}\;x\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} a:x' &\rightarrow& \mathit{stream}\;g\;f\;(f\;b\;a)\;x' \end{array} \end{array} \end{array}

So for example, we can convert an infinite fraction in base 3 to one in base 7 with

\displaystyle  \mathit{stream}\;\mathit{next}\;\mathit{step}\;(0,1)


\displaystyle  \begin{array}{@{}lcl@{}} \mathit{next}\;(u,v) &=& \begin{array}[t]{@{}l} \mathbf{let}\;y = \lfloor{7 \times u \times v}\rfloor\;\mathbf{in} \\ \mathbf{if}\;\lfloor{y}\rfloor = \lfloor{7 \times (u+1) \times v}\rfloor\;\begin{array}[t]{@{}l@{\;}l}\mathbf{then}&\mathit{Just}\;(y,(u - y/(v \times 7), v \times 7))\\\mathbf{else}&\mathit{Nothing} \\ \end{array} \end{array} \\ \mathit{stepl}\;(u,v)\;d &=& (u \times 3 + d, v / 3) \end{array}

In this post, we’ll see another number conversion problem, which will deliver the digits of {\pi}. For more details, see my paper—although the presentation here is simpler now.

Series for pi

Leibniz showed that

\displaystyle  \displaystyle \frac{\pi}{4} = \sum_{i=0}^{\infty} \frac{(-1)^i}{2i+1}

From this, using Euler’s convergence-accelerating transformation, one may derive

\displaystyle  \pi = \sum_{i=0}^{\infty} \frac{(i!)^2\,2^{i+1}}{(2i+1)!}

or equivalently

\displaystyle  \pi = 2 + \frac{1}{3} \times \biggl(2 + \frac{2}{5}\times \biggl(2 + \frac{3}{7}\times \biggl( \cdots \biggl(2 + \frac{i}{2i+1}\times \biggl(\cdots\biggr)\biggr)\biggr)\biggr)\biggr)

This can be seen as the number {(2;2,2,2...)} in a funny mixed-radix base {(\frac{1}{3}, \frac{2}{5}, \frac{3}{7}...)}, just as the usual decimal expansion

\displaystyle  \pi = 3 + \frac{1}{10} \times \biggl(1 + \frac{1}{10}\times \biggl(4 + \frac{1}{10}\times \biggl( \cdots\biggr)\biggr)\biggr)

is represented by the number {(3;1,4,1...)} in the fixed-radix base {(\frac{1}{10},\frac{1}{10},\frac{1}{10}...)}. Computing the decimal digits of {\pi} is then a matter of conversion from the mixed-radix base to the fixed-radix base.

Conversion from a fixed base

Let’s remind ourselves of how it should work, using a simpler example: conversion from one fixed base to another. We are given an infinite-precision fraction in the unit interval

\displaystyle  x = \frac{1}{m} \times \biggl(x_0 + \frac{1}{m}\times \biggl(x_1 + \frac{1}{m}\times \biggl( \cdots\biggr)\biggr)\biggr)

in base {m}, in which {0 \le x_i < m} for each digit {x_i}. We are to convert it to a similar representation

\displaystyle  x = \frac{1}{n} \times \biggl(y_0 + \frac{1}{n}\times \biggl(y_1 + \frac{1}{n}\times \biggl( \cdots\biggr)\biggr)\biggr)

in base {n}, in which {0 \le y_j < n} for each output digit {y_j}. The streaming process maintains a state {(u,v)}, a pair of rationals; the invariant is that after consuming {i} input digits and producing {j} output digits, we have

\displaystyle  x = \frac{1}{n} \times \biggl(y_0 + \cdots + \frac{1}{n}\times \biggl(y_{j-1} + v \times (u + \frac{1}{m} \times \biggl( x_i + \frac{1}{m} \times \biggl(x_{i+1} + \cdots \biggr)\biggr)\biggr)\biggr)\biggr)

so that {(u,v)} represents a linear function {(v\times) \cdot (u+)} that should be applied to the value represented by the remaining input.

We can initialize the process with {i=0, j=0, u=0, v=1}. At each step, we first try to produce another output digit. The remaining input digits {x_i, x_{i+1},...} represent a value in the unit interval; so if {n \times v \times (u+0)} and {n \times v \times (u+1)} have the same integer part, then that must be the next output digit, whatever the remaining input digits are. Let {y_j = \lfloor n \times v \times u \rfloor} be that integer. Now we need to find {(u',v')} such that

\displaystyle  \frac{1}{n} \times \biggl(y_j + v' \times (u' + r)\biggr) = v \times (u + r)

for any remainder {r}; then we can increment {j} and set {(u,v)} to {(u',v')} and the invariant is maintained. A little algebra shows that we should take {v' = n \times v} and {u' = u - y_j/v'}.

If {v \times u} and {v \times (u+1)} have different integer parts, we cannot yet tell what the next output digit should be, so we must consume the next input digit instead. Now we need to find {(u',v')} such that

\displaystyle  v \times \biggl(u + \frac{1}{m} \times \biggl(x_i + r\biggr)\biggr) = v' \times (u' + r)

for any remainder {r}; then we can increment {i} and set {(u,v)} to {(u',v')} and the invariant is again maintained. Again, algebraic manipulation leads us to {v' = v/m} and {u' = m \times u + x_i}.

For example, {\frac{1}{4} = 0.020202..._3 = 0.151515..._7}, and the conversion starts as follows:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 0 && 2 && 0 && && 2 && && 0 && 2 \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{0}{1},\frac{1}{3}\bigr) && \bigl(\frac{2}{1},\frac{1}{9}\bigr) && \bigl(\frac{6}{1},\frac{1}{27}\bigr) && \bigl(\frac{15}{7},\frac{7}{27}\bigr) && \bigl(\frac{59}{7},\frac{7}{81}\bigr) && \bigl(\frac{8}{49},\frac{49}{81}\bigr) && \bigl(\frac{24}{49},\frac{49}{243}\bigr) && \bigl(\frac{170}{49},\frac{49}{729}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && && && 1 && && 5 \end{array}

That is, the initial state is {u_0=\frac{0}{1}, v_0=\frac{1}{1}}. This state does not yet determine the first output digit, so we consume the first input digit 0 to yield the next state {u_1 = \frac{0}{1}, v_1 = \frac{1}{3}}. This state still does not determine the first output, and nor will the next; so we consume the next two input digits 2 and 0, yielding state {u_3 = \frac{6}{1}, v_3 = \frac{1}{27}}. This state does determine the next digit: {v_3 \times u_3 = 0.020_3 = 0.136..._7} and {v_3 \times (u_3+1) = 0.021_3 = 0.154..._7} both start with a 1 in base 7. So we can produce a 1 as the first output digit, yielding state {u_4 = \frac{15}{7}, v_4 = \frac{7}{27}}. And so on.

The process tends to converge. Each production step widens the non-empty window {[n \times v \times u, n \times v \times (u+1))} by a factor of {n}, so it will eventually contain multiple integers; therefore we cannot produce indefinitely. Each consumption step narrows the window by a factor of {m}, so it will tend towards eventually producing the next output digit. However, this doesn’t always work. For example, consider converting {0.333..._{10}} to base 3:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 3 && 3 && 3 & \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{3}{1},\frac{1}{10}\bigr) && \bigl(\frac{33}{1},\frac{1}{100}\bigr) && \bigl(\frac{333}{1},\frac{1}{1000}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && \end{array}

The first output digit is never determined: if the first non-3 in the input is less than 3, the value is less than a third, and the first output digit should be a 0; if the first non-3 is greater than 3, then the value is definitely greater than a third, and it is safe to produce a 1 as the first output digit; but because the input is all 3s, we never get to make this decision. This problem will happen whenever the value being represented has a finite representation in the output base.

Conversion from a mixed base

Let’s return now to computing the digits of {\pi}. We have the input

\displaystyle  \pi = 2 + \frac{1}{3} \times \biggl(2 + \frac{2}{5}\times \biggl(2 + \frac{3}{7}\times \biggl( \cdots \biggl(2 + \frac{i}{2i+1}\times \biggl(\cdots\biggr)\biggr)\biggr)\biggr)\biggr)

which we want to convert to decimal. The streaming process maintains a pair {(u,v)} of rationals—but this time representing the linear function {(u+) \cdot (v\times)}, since this time our expression starts with a sum rather than a product. The invariant is similar: after consuming {i-1} input digits and producing {j} output digits, we have

\displaystyle  \pi = y_0 + \frac{1}{10} \times \biggl(\cdots y_{j-1} + \frac{1}{10} \times \biggl(u + v \times \biggl(x_i + \frac{i}{2i+1} \times \biggl(x_{i+1} + \frac{i+1}{2i+3} \times \cdots\biggr)\biggr)\biggr)\biggr)

Note that the output base is fixed at 10; but more importantly, the input digits {x_i} are all fixed at 2, and it is the input base that varies from digit to digit.

We can initialize the process with {i=1, j=0, u=0, v=1}. At each step, we first try to produce an output digit. What value might the remaining input

\displaystyle  r = 2 + \frac{i}{2i+1} \times \biggl(2 + \frac{i+1}{2i+3} \times \cdots \biggr)

represent? Each of the bases is at least {\frac{1}{3}}, so it is clear that {r_{\mathrm{min}} \le r}, where

\displaystyle  r_{\mathrm{min}} = 2 + \frac{1}{3} \times r_{\mathrm{min}}

which has unique solution {r_{\mathrm{min}} = 3}. Similarly, each of the bases is less than {\frac{1}{2}}, so it is clear that {r < r_{\mathrm{max}}}, where

\displaystyle  r_{\mathrm{max}} = 2 + \frac{1}{2} \times r_{\mathrm{max}}

which has unique solution {r_{\mathrm{max}} = 4}. So we consider the bounds {\lfloor u + v \times 3 \rfloor} and {\lfloor u + v \times 4 \rfloor}; if these have the same integer part {y_j}, then that is the next output digit. Now we need to find {(u',v')} such that

\displaystyle  y_j + \frac{1}{10} \times (u' + v' \times r) = u + v \times r

for any remainder {r}, so we pick {u' = 10 \times (u - y_j)} and {v' = 10 \times v}. Then we can increment {j} and set {(u,v)} to {(u',v')}, and the invariant is maintained.

If the two bounds have different integer parts, we must consume the next input digit instead. Now we need to find {(u',v')} such that

\displaystyle  u' + v' \times r = u + v \times \biggl(x_i + \frac{i}{2i+1} \times r\biggr)

for all {r}, so we pick {u' = u + v \times x_i} and {v' = v \times i / (2i+1)}. Then we can increment {i} and set {(u,v)} to {(u',v')}, and again the invariant is maintained.

The conversion starts as follows:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 2 && && 2 && 2 && && 2 && 2 \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{2}{1},\frac{1}{3}\bigr) && \bigl(\frac{-10}{1},\frac{10}{3}\bigr) && \bigl(\frac{10}{3},\frac{4}{3}\bigr) && \bigl(\frac{-2}{3},\frac{4}{7}\bigr) && \bigl(\frac{-50}{3},\frac{40}{7}\bigr) && \bigl(\frac{-110}{21},\frac{160}{63}\bigr) && \bigl(\frac{-10}{63},\frac{800}{693}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && 3 && && && 1 && && \end{array}

Happily, non-termination ceases to be a problem: the value being represented does not have a finite representation in the output base, being irrational.


We can plug these definitions straight into the {\mathit{stream}} function above:

\displaystyle  \mathit{piDigits} = \mathit{stream}\;g\;f\;(1,0,0\%1,1\%1)\;(\mathit{repeat}\;2)


\displaystyle  g\;(i,j,u,v) = \begin{array}[t]{@{}l} \mathbf{if}\;y == \mathit{floor}\;(u + v \times 4) \\ \mathbf{then}\;\mathit{Just}\;(y, (i,j+1, 10 \times (u - \mathit{fromIntegral}\;y), 10 \times v)) \\ \mathbf{else}\;\mathit{Nothing} \\ \mathbf{where}\;y = \mathit{floor}\;(u + v \times 3) \end{array}


\displaystyle  f\;(i,j,u,v)\;x = \begin{array}[t]{@{}l} (i+1,j,u + v \times \mathit{fromIntegral}\;x, v \times i' / (2 \times i' + 1)) \\ \mathbf{where}\;i' = \mathit{fromIntegral}\;i \end{array}

(The {\%}s make rational numbers in Haskell, and force the ambiguous fractional type to be {\mathit{Rational}} rather than {\mathit{Double}}.)

In fact, this program can be considerably simplified, by inlining the definitions. In particular, the input digits are all 2, so we need not supply them. Moreover, the {j} component of the state is never used, because we treat each output digit in the same way (in contrast to the input digits); so that may be eliminated. Finally, we can eliminate some of the numeric coercions if we represent the {i} component as a rational in the first place:

\displaystyle  \mathit{piDigits} = \begin{array}[t]{@{}l} \mathit{go}\;((1,0,1) :: (\mathit{Rational},\mathit{Rational},\mathit{Rational}))\;\mathbf{where} \\ \qquad \mathit{go}\;(i,u,v) = \begin{array}[t]{@{}ll} \mathbf{if} & y == \mathit{floor}\;(u+v \times 4) \\ \mathbf{then} & y : \mathit{go}\;(i,10 \times (u-\mathit{fromIntegral}\;y),10 \times v) \\ \mathbf{else} & \mathit{go}\;(i+1,u+2 \times v, (v \times i) / (2 \times i+1)) \\ \multicolumn{2}{@{}l}{\qquad \mathbf{where}\; y = \mathit{floor}\;(u+v \times 3)} \end{array} \end{array}

Then we have

\displaystyle  \mathit{piDigits} = [3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3...

by jeremygibbons at November 15, 2017 05:22 PM


It appears that I have insufficient time, or at least insufficient discipline, to contribute to this blog, except when I am on sabbatical. Which I now am… so let’s see if I can do better.


I don’t think I’ve written about them yet in this series—another story, for another day—but hylomorphisms consist of a fold after an unfold. One very simple example is the factorial function: {n!} is the product of the predecessors {[n,...,1]} of {n}. The predecessors can be computed with an unfold:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{preds} &::& \mathit{Integer} \rightarrow [\mathit{Integer}] \\ \mathit{preds} &=& \mathit{unfoldr}\;\mathit{step} \; \mathbf{where} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{step}\;0 &=& \mathit{Nothing} \\ \mathit{step}\;n &=& \mathit{Just}\;(n, n-1) \end{array} \end{array}

and the product as a fold:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{prod} &::& [\mathit{Integer}] \rightarrow \mathit{Integer} \\ \mathit{prod} &=& \mathit{foldr}\;(\times)\;1 \end{array}

and then factorial is their composition:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{factorial} &::& \mathit{Integer} \rightarrow \mathit{Integer} \\ \mathit{factorial} &=& \mathit{prod} \cdot \mathit{preds} \end{array}

Another example is a tree-based sorting algorithm that resembles Hoare’s quicksort: from the input list, grow a binary search tree, as an unfold, and then flatten that tree back to a sorted list, as a fold. This is a divide-and-conquer algorithm; in general, these can be modelled as unfolding a tree of subproblems by repeatedly dividing the problem, then collecting the solution to the original problem by folding together the solutions to subproblems.

An unfold after a fold

This post is about the opposite composition, an unfold after a fold. Some examples:

  • {\mathit{regroup}\;n = \mathit{group}\;n \cdot \mathit{concat}} to reformat a list of lists to a given length;
  • {\mathit{heapsort} = \mathit{flattenHeap} \cdot \mathit{buildHeap}} to sort a list;
  • {\mathit{baseconv}\;(b,c) = \mathit{toBase}\;b \cdot \mathit{fromBase}\;c} to convert a fraction from base {c} to base {b};
  • {\mathit{arithCode} = \mathit{toBits} \cdot \mathit{narrow}} to encode a text in binary by “arithmetic coding”.

In each of these cases, the first phase is a fold, which consumes some structured representation of a value into an intermediate unstructured format, and the second phase is an unfold, which generates a new structured representation. Their composition effects a change of representation, so we call them metamorphisms.

Hylomorphisms always fuse, and one can deforest the intermediate virtual data structure. For example, one need not construct the intermediate list in the factorial function; since each cell gets constructed in the unfold only to be immediately deconstructed in the fold, one can cut to the chase and go straight to the familiar recursive definition. For the base case, we have:

\displaystyle  \begin{array}{ll} & \mathit{factorial}\;0 \\ = & \qquad \{ \mathit{factorial} \} \\ & \mathit{prod}\;(\mathit{preds}\;0) \\ = & \qquad \{ \mathit{preds} \} \\ & \mathit{prod}\;[\,] \\ = & \qquad \{ \mathit{prod} \} \\ & 1 \end{array}

and for non-zero argument {n}, we have:

\displaystyle  \begin{array}{ll} & \mathit{factorial}\;n \\ = & \qquad \{ \mathit{factorial} \} \\ & \mathit{prod}\;(\mathit{preds}\;n) \\ = & \qquad \{ \mathit{preds} \} \\ & \mathit{prod}\;(n : \mathit{preds}\;(n-1)) \\ = & \qquad \{ \mathit{prod} \} \\ & n \times \mathit{prod}\;(\mathit{preds}\;(n-1)) \\ = & \qquad \{ \mathit{factorial} \} \\ & n \times \mathit{factorial}\;(n-1) \\ \end{array}

In contrast, metamorphisms only fuse under certain conditions. However, when they do fuse, they also allow infinite representations to be processed, as we shall see.

Fusion seems to depend on the fold being tail-recursive; that is, a {\mathit{foldl}}:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{foldl} &::& (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow \beta \\ \mathit{foldl}\;f\;b\;(a:x) &=& \mathit{foldl}\;f\;(f\;b\;a)\;x \\ \mathit{foldl}\;f\;b\;[\,] &=& b \end{array}

For the unfold phase, we will use the usual list unfold:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{unfoldr} &::& (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow \beta \rightarrow [\gamma] \\ \mathit{unfoldr}\;g\;b &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{unfoldr}\;g\;b' \\ \mathit{Nothing} &\rightarrow& [\,] \end{array} \end{array}

We define a metamorphism as their composition:

\displaystyle  \begin{array}{l} \mathit{meta} :: (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow [\gamma] \\ \mathit{meta}\;g\;f\;b = \mathit{unfoldr}\;g \cdot \mathit{foldl}\;f\;b \end{array}

This transforms input of type {[A]} to output of type {[C]}: in the first phase, {\mathit{foldl}\;f\;b}, it consumes all the input into an intermediate value of type {B}; in the second phase, {\mathit{unfoldr}\;g}, it produces all the output.


Under certain conditions, it is possible to fuse these two phases—this time, not in order to eliminate an intermediate data structure (after all, the intermediate type {B} need not be structured), but rather in order to allow some production steps to happen before all the consumption steps are complete.

To that end, we define the {\mathit{stream}} function as follows:

\displaystyle  \begin{array}{@{}lcl@{}} \multicolumn{3}{@{}l}{\mathit{stream} :: (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow [\gamma]} \\ \mathit{stream}\;g\;f\;b\;x &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{stream}\;g\;f\;b'\;x \\ \mathit{Nothing} &\rightarrow& \mathbf{case}\;x\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} a:x' &\rightarrow& \mathit{stream}\;g\;f\;(f\;b\;a)\;x' \\ {[\,]} &\rightarrow& [\,] \end{array} \end{array} \end{array}

This takes the same arguments as {\mathit{meta}}. It maintains a current state {b}, and produces an output element {c} when it can; and when it can’t produce, it consumes an input element instead. In more detail, it examines the current state {b} using function {g}, which is like the body of an unfold; this may produce a first element {c} of the result and a new state {b'}; when it yields no element, the next element {a} of the input is consumed using function {f}, which is like the body of a {\mathit{foldl}}; and when no input remains either, we are done.

The streaming condition for {f} and {g} is that

\displaystyle  g\;b = \mathit{Just}\;(c,b') \quad\Rightarrow\quad \forall a \mathbin{.} g\;(f\;b\;a) = \mathit{Just}\;(c, f\;b'\;a)

Consider a state {b} from which the body {g} of the unfold is productive, yielding some {\mathit{Just}\;(c,b')}. From here we have two choices: we can either produce the output {c}, move to intermediate state {b'}, then consume the next input {a} to yield a final state {f\;b'\;a}; or we can consume first to get the intermediate state {f\;b\;a}, and again try to produce. The streaming condition says that this intermediate state {f\;b\;a} will again be productive, and will yield the same output {c} and the same final state {f\;b'\;a}. That is, instead of consuming all the inputs first, and then producing all the outputs, it is possible to produce some of the outputs early, without jeopardizing the overall result. Provided that the streaming condition holds for {f} and {g}, then

\displaystyle  \mathit{stream}\;g\;f\;b\;x = \mathit{meta}\;g\;f\;b\;x

for all finite lists {x}.

As a simple example, consider the `buffering’ process {\mathit{meta}\;\mathit{uncons}\;(\mathbin{{+}\!\!\!{+}})\;[\,]}, where

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{uncons}\;x &=& \mathbf{case}\;x\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} [\,] &\rightarrow& \mathit{Nothing} \\ c:x' &\rightarrow& \mathit{Just}\;(c,x') \end{array} \end{array}

Note that {\mathit{unfoldr}\;\mathit{uncons} = \mathit{id}}, so {\mathit{meta}\;\mathit{uncons}\;(\mathbin{{+}\!\!\!{+}})\;[\,]} is just a complicated way of writing {\mathit{concat}} as a {\mathit{foldl}}. But the streaming condition holds for {\mathbin{{+}\!\!\!{+}}} and {\mathit{uncons}} (as you may check), so {\mathit{concat}} may be streamed. Operationally, the streaming version of {\mathit{concat}} consumes one list from the input list of lists, then peels off and produces its elements one by one; when they have all been delivered, it consumes the next input list, and so on.


The streaming version of {\mathit{concat}} is actually rather special, because the production steps can always completely exhaust the intermediate state. In contrast, consider the `regrouping’ example {\mathit{regroup}\;n = \mathit{meta}\;(\mathit{chunk}\;n)\;(\mathbin{{+}\!\!\!{+}})\;[\,]} where

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{chunk}\;n\;[\,] &=& \mathit{Nothing} \\ \mathit{chunk}\;n\;x &=& \mathit{Just}\;(\mathit{splitAt}\;n\;x) \end{array}

from the introduction (here, {\mathit{splitAt}\;n\;x} yields {(y,z)} where {y \mathbin{{+}\!\!\!{+}} z = x}, with {\mathit{length}\;y=n} when {n \le \mathit{length}\;x} and {y=x} otherwise). This transforms an input list of lists into an output list of lists, where each output `chunk’ except perhaps the last has length {n}—if the content doesn’t divide up evenly, then the last chunk is short. One might hope to be able to stream {\mathit{regroup}\;n}, but it doesn’t quite work with the formulation so far. The problem is that {\mathit{chunk}} is too aggressive, and will produce short chunks when there is still some input to consume. (Indeed, the streaming condition does not hold for {\mathbin{{+}\!\!\!{+}}} and {\mathit{chunk}\;n}—why not?) One might try the more cautious producer {\mathit{chunk'}}:

\displaystyle  \begin{array}{@{}lclcl@{}} \mathit{chunk'}\;n\;x &\mid& n \le \mathit{length}\;x &=& \mathit{Just}\;(\mathit{splitAt}\;n\;x) \\ &\mid& \mathbf{otherwise} &=& \mathit{Nothing} \end{array}

But this never produces a short chunk, and so if the content doesn’t divide up evenly then the last few elements will not be extracted from the intermediate state and will be lost.

We need to combine these two producers somehow: the streaming process should behave cautiously while there is still remaining input, which might influence the next output; but it should then switch to a more aggressive strategy once the input is finished, in order to flush out the contents of the intermediate state. To achieve this, we define a more general flushing stream operator:

\displaystyle  \begin{array}{@{}lcl@{}} \multicolumn{3}{@{}l@{}}{\mathit{fstream} :: (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow [\gamma]) \rightarrow (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow [\gamma]} \\ \mathit{fstream}\;g\;h\;f\;b\;x &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{fstream}\;g\;h\;f\;b'\;x \\ \mathit{Nothing} &\rightarrow& \mathbf{case}\;x\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} a:x' &\rightarrow& \mathit{fstream}\;g\;h\;f\;(f\;b\;a)\;x' \\ {[\,]} &\rightarrow& h\;b \end{array} \end{array} \end{array}

This takes an additional argument {h :: \beta \rightarrow [\gamma]}; when the cautious producer {g} is unproductive, and there is no remaining input to consume, it uses {h} to flush out the remaining output elements from the state. Clearly, specializing to {h\;b=[\,]} retrieves the original {\mathit{stream}} operator.

The corresponding metamorphism uses an apomorphism in place of the unfold. Define

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{apo} &::& (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow [\gamma]) \rightarrow \beta \rightarrow [\gamma] \\ \mathit{apo}\;g\;h\;b &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{apo}\;g\;h\;b' \\ \mathit{Nothing} &\rightarrow& h\;b \end{array} \end{array}

Then {\mathit{apo}\;g\;h} behaves like {\mathit{unfoldr}\;g}, except that if and when {g} stops being productive it finishes up by applying {h} to the final state. Similarly, define flushing metamorphisms:

\displaystyle  \mathit{fmeta}\;g\;h\;f\;b = \mathit{apo}\;g\;h \cdot \mathit{foldl}\;f\;b

Then we have

\displaystyle  \mathit{fstream}\;g\;h\;f\;b\;x = \mathit{fmeta}\;g\;h\;f\;b\;x

for all finite lists {x} if the streaming condition holds for {f} and {g}. In particular,

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{regroup}\;n\;\mathit{xs} &=& \mathit{fmeta}\;(\mathit{chunk'}\;n)\;(\mathit{unfoldr}\;(\mathit{chunk}\;n))\;(\mathbin{{+}\!\!\!{+}})\;[\,]\;\mathit{xs} \\ &=& \mathit{fstream}\;(\mathit{chunk'}\;n)\;(\mathit{unfoldr}\;(\mathit{chunk}\;n))\;(\mathbin{{+}\!\!\!{+}})\;[\,]\;\mathit{xs} \end{array}

on finite inputs {\mathit{xs}}: the streaming condition does hold for {\mathbin{{+}\!\!\!{+}}} and the more cautious {\mathit{chunk'}\;n}, and once the input has been exhausted, the process can switch to the more aggressive {\mathit{chunk}\;n}.

Infinite input

The main advantage of streaming is that it can allow the change-of-representation process also to work on infinite inputs. With the plain metamorphism, this is not possible: the {\mathit{foldl}} will yield no result on an infinite input, and so the {\mathit{unfoldr}} will never get started, but the {\mathit{stream}} may be able to produce some outputs before having consumed all the inputs. For example, the streaming version of {\mathit{regroup}\;n} also works for infinite lists, providing that the input does not end with an infinite tail of empty lists. And of course, if the input never runs out, then there is no need ever to switch to the more aggressive flushing phase.

As a more interesting example, consider converting a fraction from base 3 to base 7:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{fromBase3} &=& \mathit{foldr}\;\mathit{stepr}\;0 \quad \mathbf{where}\;\mathit{stepr}\;d\;x = (d+x)/3 \\ \mathit{toBase7} &=& \mathit{unfoldr}\;\mathit{next} \quad \mathbf{where}\; \begin{array}[t]{@{}lcl@{}} \mathit{next}\;0 &=& \mathit{Nothing} \\ \mathit{next}\;x &=& \mathbf{let}\;y=7\times x\;\mathbf{in}\;\mathit{Just}\;(\lfloor y\rfloor, y-\lfloor y\rfloor) \end{array} \end{array}

We assume that the input digits are all either 0, 1 or 2, so that the number being represented is in the unit interval.

The fold in {\mathit{fromBase3}} is of the wrong kind; but we have also

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{fromBase3} &=& \mathit{extract} \cdot \mathit{foldl}\;\mathit{stepl}\;(0,1) \quad \mathbf{where}\; \mathit{stepl}\;(u,v)\;d = (u \times 3 + d, v / 3) \end{array}

Here, the intermediate state {(u,v)} can be seen as a defunctionalized representation of the function {(v\times) \cdot (u+)}, and {\mathit{extract}} applies this function to {0}:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{apply}\;(u,v)\;x &=& v \times (u + x) \\ \mathit{extract}\;(u,v) &=& \mathit{apply}\;(u,v)\;0 \end{array}

Now there is an extra function {\mathit{extract}} between the {\mathit{foldl}} and the {\mathit{unfoldr}}; but that’s no obstacle, because it fuses with the {\mathit{unfoldr}}:

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{toBase7} \cdot \mathit{extract} &=& \mathit{unfoldr}\;\mathit{next'} \quad \mathbf{where}\; \begin{array}[t]{@{}lcl@{}} \mathit{next'}\;(0,v) &=& \mathit{Nothing} \\ \mathit{next'}\;(u,v) &=& \begin{array}[t]{@{}l} \mathbf{let}\;y = \lfloor{7 \times u \times v}\rfloor\;\mathbf{in} \\ \mathit{Just}\;(y,(u - y/(v \times 7), v \times 7)) \end{array} \end{array} \end{array}

However, the streaming condition does not hold for {\mathit{stepl}} and {\mathit{next'}}. For example,

\displaystyle  \begin{array}{@{}lcl@{}} \mathit{next'}\;(1,{{}^{1\!}/_{\!3}}) &=& \mathit{Just}\;(2, ({{}^{1\!}/_{\!7}},{{}^{7\!}/_{\!3}})) \\ \mathit{next'}\;(\mathit{stepl}\;(1,{{}^{1\!}/_{\!3}})\;1) &=& \mathit{next'}\;(4,{{}^{1\!}/_{\!9}}) \\ &=& \mathit{Just}\;(3,({{}^{1\!}/_{\!7}},{{}^{7\!}/_{\!9}})) \end{array}

That is, {0.1_3 \simeq 0.222_7}, but {0.11_3 \simeq 0.305_7}, so it is premature to produce the first digit 2 in base 7 having consumed only the first digit 1 in base 3. The producer {\mathit{next'}} is too aggressive; it should be more cautious while input remains that might invalidate a produced digit.

Fortunately, on the assumption that the input digits are all 0, 1, or 2, the unconsumed input—a tail of the original input—again represents a number in the unit interval; so from the state {(u,v)} the range of possible unproduced outputs represents a number between {\mathit{apply}\;(u,v)\;0} and {\mathit{apply}\;(u,v)\;1}. If these both start with the same digit in base 7, then (and only then) is it safe to produce that digit. So we define

\displaystyle  \mathit{next''}\;(u,v) = \mathbf{if}\;\lfloor{u \times v \times 7}\rfloor = \lfloor{(u+1) \times v \times 7}\rfloor\;\mathbf{then}\;\mathit{next'}\;(u,v)\;\mathbf{else}\;\mathit{Nothing}

and we have

\displaystyle  \mathit{unfoldr}\;\mathit{next'} = \mathit{apo}\;\mathit{next''}\;(\mathit{unfoldr}\;\mathit{next'})

Now, the streaming condition holds for {\mathit{stepl}} and {\mathit{next''}} (as you may check), and therefore

\displaystyle  \mathit{toBase7}\;(\mathit{fromBase3}\;x) = \mathit{fstream}\;\mathit{next''}\;(\mathit{unfoldr}\;\mathit{next'})\;\mathit{stepl}\;(0,1)\;x

on finite digit sequences {x} in base 3. Moreover, the streaming program works also on infinite digit sequences, where the original does not.

(Actually, the only way this could possibly produce a finite output in base 7 would be for the input to be all zeroes. Why? If we are happy to rule out this case, we could consider only the case of taking infinite input to infinite output, and not have to worry about reaching the end of the input or flushing the state.)

by jeremygibbons at November 15, 2017 12:30 PM

Manuel M T Chakravarty

November 14, 2017

Functional Jobs

Backend Ruby and Haskell engineer at Health eFilings (Full-time)

Our backend engineering team manages the ingestion and normalization of data sets, from data extraction through to product delivery. We want to work smarter instead of harder, and create domain specific languages, meta-programming etc. where possible.

Our current code base is written in Ruby and Coffee Script, but some new modules are being written in Haskell. You will be on the front lines of creating a Haskell-based infrastructure that is maintainable and can scale to support our needs as we grow.

We currently expect that about 80% of your work will be in Ruby/CoffeeScript, and 20% in Haskell, but that ratio will decrease over time as we move more of our functionality to Haskell. (The faster you can work to migrate functionality to Haskell, the more Haskell you will be doing.)


You will have ownership of an entire module, including responsibility for:

  • Creating new features in a clean and maintainable way
  • Re-factoring existing code to ensure that we stay agile
  • Reviewing teammates’ code and providing feedback
  • Keeping yourself focused and your projects on track
  • An “I can run through walls” mentality to ensure that goals are met
  • Answering questions from our implementation team and squashing bugs on a monthly support rotation

We are a small team (four engineers), and so it is critical that you be a team player, willing to pitch in and help out your colleagues.


  • Autonomy to solve problems in the way you best see fit
  • A manager who is accountable for ensuring you meet your professional goals
  • A team who helps each other and always strives to improve
  • The time to focus on creating the right solution, instead of the easiest one


  • Professional experience as a software engineer
  • Experience with Haskell and Ruby
  • A desire for continual self-improvement
  • An understanding of best practices regarding maintainability and scalability
  • Must have US work authorization and be located in the US (we cannot sponsor visas at this time)
  • There are no formal education requirements for this position


  • Experience with data scraping and parsing


This is expected to be a remote position, although our Madison, Wisconsin office is also available as a work location.

Get information on how to apply for this position.

November 14, 2017 07:26 PM

November 13, 2017

Tim Docker

Algebraic Data Types in Java

At Helix we often code backend services in java. I find modern java acceptable as a language for getting things done. As a long time haskell developer, however, I find java’s facilities for data types frustrating indeed. These frustrations are twofold. Java lacks support for algebraic data types (ADTs), and requires large amounts of boilerplate to define even simple types.

When designing systems, I place great value in applying the "make illegal states unrepresentable" principle1. Using ADTs to more accurately model data is a excellent step in this direction. However, it’s a burden to do in languages like java that lack support for sum types.

Even for regular product types (ie records of fields) java can be tedious. Defining a record of a few fields should really only take a corresponding few lines of code. Yet for a useful value type in java one will generally need to write: constructors, accessors, a comparison function, a hash implementation, serialisation logic etc. It’s common in the java world to use IDEs to automatically generate this kind of boilerplate, but subtle bugs can creep in over time as the once generated code isn’t manually updated to reflect subsequent changes in the data model.

Hence, at Helix we now often use my ADL language to define data types, and generate the corresponding java code from them. As a tiny example, these adl definitions (see complete file here):

    struct Rectangle
        Double width;
        Double height;

    union Picture
        Circle circle;
        Rectangle rectangle;
        Vector<Picture> composed;
        Translated<Picture> translated;

result in the corresponding and These two definitions alone correspond to 280 lines of java code (that you really don’t want to write and maintain). As can be seen in the Translated<> type, parametric polymorphism is supported.

I find that being able to define data types concisely encourages me to build more accurate data models, resulting in systems that are more robust and better reflect the problem domain. And ADL’s multi language support (java, haskell, typescript) allows us to easily serialize and transfer the corresponding data values between our java services, and our typescript web and mobile UIs.

  1. attributed to Yaron Minsky↩

by Tim Docker at November 13, 2017 09:53 PM

Functional Jobs

Scala Developer at LeadIQ (Full-time)

Are you the type of engineer who punches juke boxes to make the music start? Do you consider riding your motorcycle off into the a sunset a personal hobby? Is architecting a system from the ground up no big deal to you? We're looking for full-time Scala developer to make this happen.

The Product

We are on a mission to revolutionize Sales industry using data science. Our product helps our customers to collect and enrich their target prospects. Our internal data processing combines human intelligence and data science to enable our customers to find perfect contact information and save to their existing platforms like Salesforce, etc.

The Challenge

  • We are at an exciting stage in our growth. We are getting traction with big customers, scaling out, and solving increasingly complex engineering problems.

  • Our systems are mostly written in Scala. We have used Kafka as backbone to communicate between our API server and micro-services. Smart architecture design is crucial in order to guarantee our micro-services based systems run smoothly and reliably.

  • We're looking for someone who can drive our product backend integration features, refactor existing code for faster responses and becomes an important asset to the rest of the engineering team.

  • Data quality is one of the critical factors to make our product successful. We often have needs to process 3rd parties data and clean existing data using Spark. So you need to be comfortable writing Spark scripts.

  • We have very complex integrations with 3rd parties systems like Salesforce, etc. These integrations are core to what we're offering to our customers. We're looking for someone who is willing to listen to customer feedback to improve existing features and provide new features for customer success.

The Stack

Scala, Kafka, Spark, MongoDB, ElasticSearch, Docker, Vue.js

The Team

We want team members with attributes like:

  • Focus on delivering value to the customer
  • Strong belief in collaboration
  • Passion that drives you to execute and innovate
  • Ability to self-manage and take ownership of a feature
  • Ability to juggle many projects and responsibilities
  • Extremely entrepreneurial and self-driven
  • Not afraid of a steep learning curve
  • Passionate about building a big business that transforms the sales industry
  • Exceptional at writing scalable, production-ready code
  • Thrive in a fast-paced environment
  • Avoid over-engineering
  • Simple designs and fast execution
  • Discipline in following process and documenting your work

These personality traits define the culture we are building and are more important to us than a particular set of technical skills.

The Responsibilities

If you join LeadIQ, you will learn a lot: In terms of technical ability there are many cool tools, technologies, patterns and other great developers that will sharpen your skills. Personally you be given the chance to step up, lead and make your mark in a growing startup as we tackle the challenges in our next phase of growth.

On the technical front, we need you skilled in:

  • Scala (but experience in another functional language helps, e.g. Haskell or Clojure)
  • Play framework
  • Concurrency (futures, actors, basic understanding of threads)

So if you feel like you're a good fit for us, drop us a line! We love meeting developers who are excited by our product!

Get information on how to apply for this position.

November 13, 2017 02:46 PM

November 12, 2017

Michael Snoyman

Future proofing test suites

I'll start with the specific case I've seen pop up a few times recently, and then expand to the general. If you're a package author who has been affected by this, please note: I'm putting this information into a blog post since it's easier to state this once and link to it rather than rewrite an explanation on lots of different bug trackers.

hlint is a great tool for getting advice on improving your Haskell codebase (another great Neil Mitchell product). And as such tools go, hlint has new versions which improve its ability to provide useful advice. This means that, sometimes, code which triggered no hlint warnings previously may suddenly present with such warnings under a new hlint version.

Twice recently in my Stackage curation, I've seen a number of test suites fail, even though the code for those packages was unmodified. It turns out that the upgrade to a new version of hlint caused a previously successful test suite to now fail. Clearly the code isn't suddenly broken because a new version of hlint has been released, but as far as the diagnostics of test suite failures are concerned, that's exactly what happened.


I do strongly recommend projects use hlint to get code improvements. And I've seen some great results with using it as part of the CI process, such as on Stack. (For the record: it wasn't my idea and I didn't implement it. I was just pleasantly surprised when my PRs failed because I had some style errors.) However, making the test suite for the entire package fail because of a new version of hlint is too much. Therefore:

  • DO Have some way to run hlint from your CI process, if you want these warnings to block PRs. There are two approaches I can think of:

    • The way Stack does it: have a separate part of the build matrix just for style errors. The cabal file for the project itself knows nothing about hlint.
    • Via a test suite in your cabal file which is disabled by default. Then: turn on that test suite with a flag from your CI configuration.
  • DON'T Set up your package which is uploaded to Hackage/built by Stackage such that it will fail if a style-based error occurs.

General recommendation

The general takeaway from this is: when you're building your code on CI, be as strict as you want. Set high standards, block PRs, call master broken, for whatever trivial or non-trivial issues you deem worthy. Turn on -Wall -Werror, respect hlint, error out if someone uses tabs* or includes trailing whitespace. That's all good.

* Cue necessary tabs-vs-spaces argument

However, when you're releasing your code elsewhere, make the tests as lenient as possible on optional features. If the code fails to build: that's a problem. If the code builds, but returns incorrect runtime results: that's a problem. These should stop build systems like Stackage from including your package. But stylistic issues, or newly introduced warnings from the compiler, or myriad other issues, should not trigger a failure for downstream consumers of your package.

November 12, 2017 05:00 PM

November 10, 2017

Neil Mitchell

Ghcid with VS Code

Summary: New versions of Ghcid and the VS Code extension work even better together.

I've just released Ghcid v0.6.8 and the associated VS Code extension haskell-ghcid v0.2.0. Together they vastly simplify the Ghcid VS Code experience.

Ghcid reads .ghcid files

A new feature in Ghcid is that if there is a .ghcid file in the current directory it will load it as additional arguments. For example, in the Shake repo I have a .ghcid file:

-c "ghci -fno-code -ferror-spans"

Which tells ghcid to not guess at the command (e.g. using stack if you have a .stack-work) but always run ghci -fno-code -ferror-spans. This command works because I have a .ghci file which loads all the necessary files, while -fno-code speeds up compilation and -ferror-spans gives better error highlighting.

Ghcid VS Code starts ghcid

A new feature in the VS Code extension is the action Start Ghcid which starts a new ghcid terminal, writes the output to a temporary file, and uses that output to populate the Problems pane. Importantly, the extension runs ghcid with no command line arguments, so having a sensible .ghcid lets you control what it does.

The effect of these changes is that to start ghcid in VS Code is now a few key strokes, whereas before it required special flags, opening files, running commands etc.

by Neil Mitchell ( at November 10, 2017 11:06 PM

Tweag I/O

Nix on the <br/> Windows Subsystem for Linux

Jonas Chevalier

Nix on Windows: does it run yet? That's the question I wondered about while testing the latest NixOS release, version 17.09. To that end, I had the idea of running the Nix installation process from inside the Windows Subsystem for Linux (WSL) see if it worked. And it worked! Success!

<script async="async" charset="utf-8" src=""></script>

So what does this mean?

You might remember that the Windows NT kernel used to have a POSIX layer. Unfortunately, The POSIX layer always had compatibility issues with BSD and Linux software, because typical applications seldom fit completely and entirely within the confines of an age old API. Nevertheless, the NT kernel was designed from the start to support different subsystems, not just Win32, and the POSIX layer of old was a step in the right direction. The WSL is a revival of that idea but with a specific focus on the Linux ABI. It means that it is now possible to run Linux software natively on Windows. Think of it as reverse Wine. Linux software can execute Windows software and vice versa.

It's not perfect yet. I/O and symlink resolution seem to be slow and not all Linux syscalls have been implemented yet. This is more about the promised land that Microsoft is showing. WSL is not available on the server edition yet, but it looks like they are going to deliver on it.

At we often use Nix to declaratively specify reproducible build environments for our projects and those of our clients. Nix is a good fit for project that mix different languages. It works really well at providing reproducible builds and compose the various parts of the project with external dependencies. Unfortunately it is also not supported on Windows so we have to decide upfront whether to use it based in part on whether Windows is going to become a target platform or not. Thanks to WSL it looks like we will have an escape hatch, at least for non graphical applications.

Another potential use-case that I see is for Haskell development. Today, a lot of good software is being developed directly on top of Linux and macOS. For some of these projects Windows is not a prime target environment anymore. The Glasgow Haskell Compiler (GHC) is actually quite well behaved on Windows when compiling pure Haskell code. But as soon as C library dependencies are involved, the story gets a lot more complicated. In that case, deploying via WSL might just be easier than aiming for a native Windows port.

How to install

Enable and install WSL following these instructions:

Make sure to have the latest version of Windows 10 installed. I had this version at the time of install:

  • Windows Edition: Windows 10 Pro
  • Windows Version: 1703
  • Windows OS Build: 15063.540
  • System Type: 64-bit operating system

Start the “Bash On Ubuntu On Windows” program and type curl | sh.

Known issues

WSL is an experimental subsystem still. At this point in time, there are still important issues to know about. Here are the workarounds I came up with:

  • curl is hanging. Hit Ctrl+C and retry.
  • Nix installation crash. Older versions of WSL didn't support all the syscalls needed by Nix. Update Windows and try again.
  • nix-shell is broken. Fails with synchronous I/O disk error Here's a workaraund: edit /etc/nix/nix.conf and add use-sqlite-wal=false
  • It’s slow. Yes, especially I/O and symlinks seem to be quite slow. The only solution here is to wait for Microsoft to optimise their syscalls.
  • Nix environment is not started in new logins. Workaround: Run source ~/.profile


For now, it's just a technology preview that opens new possibilities. Hopefully in the future, when the performance of I/O operations improves, it will also be enjoyable to develop Linux programs under WSL directly. Meanwhile, Microsoft has put out useful resources to go further with WSL:

November 10, 2017 12:00 AM

November 05, 2017

Douglas M. Auclair (geophf)

October 2017 1Liner 1HaskellADay problems and solutions

  • October 20th, 2017:
    You have a list of numbers: [1,2,3,4]
    You have a list of the same length of number fns: [succ, id, id, succ]
    You want: [2,2,3,5]
    •  🇪🇺 Cλément D  🌈  🐇 @clementd zipWith (flip ($)) ?
      •  he adds: `zipWith (flip id)` is a bit shorter tho
    • Simon Courtenage @SCourtenage zipWith ($) [succ,id,id,succ] [1,2,3,4]
    • lukasz @lukaszklekot getZipList $ ZipList [succ, id, id, succ] <*> ZipList [1, 2, 3, 4]
    • Alexey Radkov @sheshanaag (map (uncurry ($)) .) . zip
  • October 5th, 2017: "reverse the sequencing"
    You have [[(1,2),(1,3),(1,7)],[(9,2)],[(11,3)]]
    You want [(1,[2,3,7]),(9,[2]),(11,[3])]
    • bazzargh @bazzargh map ((,) <$> head.(map fst) <*> (map snd))
    • bazzargh @bazzargh map ((first head).unzip)
    • Chris Martin @chris__martin \x -> [(a, b : fmap snd xs) | Just ((a, b) :| xs) <- fmap="" li="" nonempty="" x="">
    • Simon Courtenage @SCourtenage fmap (\x -> (fst . head $ x, fmap snd x))
      • Denis Stoyanov  🐜 @xgrommx Your solution nice) but u can do it with point free style like
        • fmap(fst.head &&& fmap snd)
    • Denis Stoyanov  🐜 @xgrommx My solution is ugly, but I wanna to solve it with traverse)
      • fmap(first head . traverse (first (:[])))
    • Andreas Källberg @Anka213 map$fst.head&&&map snd
    • Scott Fleischma‏ @scottfleischman
        $ _1
              [y] -> Just y
              _ -> Nothing
          . nub
        . unzip
        :: [[(Int, Int)]] -> Maybe [(Int, [Int])]
    • Scott Fleischman @scottfleischman
    •  sing [] = Left "Too few"
       sing [x] = Right x
       sing (_ : _) = Left "Too many"
       valid = sing . nub
       go = _1 valid . unzip
      in traverse go
    • matt @themattchan map ((head *** id ) . unzip)
  • October 3rd, 2017:
    you have [(1,[2,3,4]),(10,[5,6,7])]
    you want [(1,2),(1,3),(1,4),(10,5),(10,6),(10,7)]

    or, generally: [(a,[b])] -> [(a,b)]


    • bazzargh @bazzargh (uncurry (zip . repeat) =<<)
    • Bruno @Brun0Cad (=<<) sequence
    • Denis Stoyanov  🐜 @xgrommx fmap (uncurry (liftA2(,) . (:[])))
      • Darren G @Kludgy I like that this doesn't unnecessarily implicate the sequentiality of bind.
    • Darren G @Kludgy Funny this same product came up at work last week.
      concatMap $ \(a,bs) -> fmap (\b -> (a,b)) bs

by geophf ( at November 05, 2017 04:35 AM

November 04, 2017

Neil Mitchell

Understanding HLint rules

Summary: I added a degenerate foldr to map rule in the new version of HLint, here I describe how it works.

I've just released HLint 2.0.10, which includes a rule to recognise uses of foldr that should really be map. As an example:

foldr (\curr acc -> (+1) curr : acc) []

Can be rewritten as:

map (\curr -> (+1) curr)

Which is much more readable (and then subsequently HLint will suggest map (+1), which is vastly clearer than the initial foldr). The change required to HLint was to add a rule to the hlint.yaml saying:

- warn: {lhs: "foldr (\\c a -> x : a) []", rhs: "map (\\c -> x)"}

You can read this statement as saying if you see foldr (\c a -> x : a) [], suggest map (\c -> x) as a warning. The HLint matching engine then applies that template to every subexpression in your program. In the rest of the post I'll talk through the steps HLint performs.

Step 1: Unification

The first step is to try unifying the template foldr (\c a -> x : a) [] against the users subexpression, namely foldr (\curr acc -> (+1) curr : acc) []. HLint is trying to find assignments for the single-letter variables in the template (namely c, a and x) which cause it to match the subexpression. Unification proceeds top-down, and if it finds anything concrete that does not match (e.g. the user had written foldl) then it fails. In this case the unification succeeds with the bindings:

  • c = curr (from the first argument to the lambda)
  • a = acc (from the second argument to the lambda)
  • x = (+1) curr (from before the cons)
  • a = acc (from after the cons)

An example of a subexpression that would have failed unification is foldl (\curr acc -> (+1) curr : acc) [].

Step 2: Validity

The next step is to check that any value which has been bound more than once is equal in all bindings. In our case only a has been used twice, and it always binds to acc, so the unification is valid.

An example of a subexpression that would have failed validity is foldr (\curr acc -> (+1) curr : xs) [].

Step 3: Substitution

Now we've got some bindings, we can substitute them into the RHS, namely map (\c -> x). We replace c and x using the bindings above. Note that a isn't mentioned on the RHS, so we don't use it. After substitution we get:

map (\curr -> (+1) curr)

Step 4: Free variable check

Consider the expression foldr (\curr acc -> f acc : acc) []. Using the rules above we'd end up with map (\curr -> f acc), which is terrible, since we've gone from referring to a locally bound acc to whatever acc is in scope (if any). To solve that, we check that the result doesn't introduce any new free variables:

(freeVars result \\ freeVars hintRuleRHS) `isSubsetOf` freeVars original

Specifically any free variables introduced in the result, which weren't in the RHS (excluding the fake unification variables), must have been in the original subexpression.

With that, for foldr, we're done. There are a handful of other steps that apply in some cases.

Step A: Dot expansion in the template

If you write a hint map f (map g x) ==> map (f . g) x then HLint notices that also implies the rule map f . map g ==> map (f . g) and adds it. As a result, you shouldn't write your HLint rules in point-free style.

Step B: Dot/dollar expansion in the subexpression

When matching a subexpression HLint will expand f $ x and (f . g) x if doing so results in a match. These operators are used commonly enough that they are often treated more like brackets than functions.

Step C: Scope matching

When unifying qualified function names, HLint uses the active imports to guess whether they match. If you have import qualified Data.Vector as V then the subexpression V.length will unify with Data.Vector.length. Since HLint doesn't have complete import information it uses a few heuristics to figure out matching.

Step D: Scope moving

Similarly to scope matching on the LHS of a rule, after matching, HLint tries to requalify any necessary values on the RHS. As an example, assuming we are producing Data.Vector.null, if we know about import qualified Data.Vector as V then we suggest V.null.

Full code

To see the full code and all supporting definitions go to the HLint source, which defines matchIdea - here I show a gently simplified version. Given scope information, a rule (LHS and RHS) and a subexpression, we optionally produce a resulting expression after substitution.

matchIdea :: Scope -> HintRule -> Exp_ -> Maybe Exp_
matchIdea s HintRule{..} original = do
u <- unifyExp hintRuleLHS original
u <- validSubst u
-- need to check free vars before unqualification, but after subst (with e)
-- need to unqualify before substitution (with res)
let result = substitute u hintRuleRHS
guard $ (freeVars result Set.\\ Set.filter (not . isUnifyVar) (freeVars hintRuleRHS))
`Set.isSubsetOf` freeVars original
-- check no unexpected new free variables
return result

by Neil Mitchell ( at November 04, 2017 12:07 PM

Gabriel Gonzalez

Semantic integrity checks are the next generation of semantic versioning

<html xmlns=""><head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <meta content="text/css" http-equiv="Content-Style-Type"/> <meta content="pandoc" name="generator"/> <style type="text/css">code{white-space: pre;}</style> <style type="text/css">div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } code > { color: #007020; font-weight: bold; } /* Keyword */ code > span.dt { color: #902000; } /* DataType */ code > span.dv { color: #40a070; } /* DecVal */ code > { color: #40a070; } /* BaseN */ code > span.fl { color: #40a070; } /* Float */ code > { color: #4070a0; } /* Char */ code > { color: #4070a0; } /* String */ code > { color: #60a0b0; font-style: italic; } /* Comment */ code > span.ot { color: #007020; } /* Other */ code > { color: #ff0000; font-weight: bold; } /* Alert */ code > span.fu { color: #06287e; } /* Function */ code > { color: #ff0000; font-weight: bold; } /* Error */ code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ code > { color: #880000; } /* Constant */ code > { color: #4070a0; } /* SpecialChar */ code > span.vs { color: #4070a0; } /* VerbatimString */ code > { color: #bb6688; } /* SpecialString */ code > { } /* Import */ code > { color: #19177c; } /* Variable */ code > { color: #007020; font-weight: bold; } /* ControlFlow */ code > span.op { color: #666666; } /* Operator */ code > span.bu { } /* BuiltIn */ code > span.ex { } /* Extension */ code > span.pp { color: #bc7a00; } /* Preprocessor */ code > { color: #7d9029; } /* Attribute */ code > { color: #ba2121; font-style: italic; } /* Documentation */ code > { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code > { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code > { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ </style></head><body>

The Dhall configuration language just added support for "semantic integrity checks". This post explains what "semantic integrity check" means, motivates the new feature, and compares to semantic versioning.

The problem

I added this feature in response to user concerns about code injection in Dhall configuration files.

We'll illustrate the problem using the following example.dhall configuration file which derives a summary of student information from a list of students:

    -- Example of an expression imported by URL
let map =

-- Example of an expression imported by path
in let students = ./students.dhall

in let getName = λ(student : { name : Text, age : Natural })

in { classSize = List/length { name : Text, age : Natural } students
, names = map { name : Text, age : Natural } Text getName students

This configuration imports a helper function named map from the Dhall Prelude by URL:

    let map =

in ...

... and that URL currently hosts a text file encoding the following Dhall function:

$ curl -L
Tranform a list by applying a function to each element


./map Natural Bool Natural/even ([+2, +3, +5] : List Natural)
= [True, False, False] : List Bool

./map Natural Bool Natural/even ([] : List Natural)
= [] : List Bool
let map : ∀(a : Type) → ∀(b : Type) → (a → b) → List a → List b
= λ(a : Type)
→ λ(b : Type)
→ λ(f : a → b)
→ λ(xs : List a)
→ List/build
( λ(list : Type)
→ λ(cons : b → list → list)
→ List/fold a xs list (λ(x : a) → cons (f x))

in map

Similarly, our example configuration imports student data from another configuration file by path:


in let students = ./students.dhall


... and we'll assume that file contains the following list of student records:

[ { name = "Jane Doe"    , age = +19 }
, { name = "John Rivera" , age = +18 }
, { name = "Alice O'Hare", age = +19 }

Values, functions, and types are all Dhall expressions, so we can inject all of them in our code via URLs or paths. When we interpret a Dhall configuration file these imports get substituted with their contents and then we evaluate the fully resolved configuration file as an expression in a functional language:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text

Users were concerned that these imports could be compromised, resulting in malicious code injection

The solution

The latest release of Dhall added support for import integrity checks to address user concerns about malicious tampering. We can use these integrity checks to "freeze" our imports by adding a SHA-256 hash after each import.

First, we ask the dhall-hash utility to compute the current hash for our imports:

$ dhall-hash <<< ''
$ dhall-hash <<< './students.dhall'

... and then we append the hash after each import to freeze the import:

    let map = sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in let students =
./students.dhall sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

in let getName = λ(student : { name : Text, age : Natural })

in { classSize = length { name : Text, age : Natural } students
, names = map { name : Text, age : Natural } Text getName students

Once you add these integrity checks the Dhall interpreter will enforce them when resolving imports. In this case, the example configuration still successfully evaluates to the same result after adding the integrity checks:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text

The integrity check passes because we haven't yet modified any of our imports.

Semantic integrity

Once you freeze an import with a hash, Dhall guarantees that the meaning of the import never changes. These are semantic hashes, not textual hashes.

For example, suppose that we modify ./students.dhall to add a comment, reorder record fields, and modify the formatting, like this:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
{ name = "John Rivera" , age = +18 },
{ name = "Alice O'Hare", age = +19 } ]

These changes do not affect the computed hash of the file and the interpreter still accepts the ./students.dhall import that we protected with an integrity check:

$ dhall <<< './example.dhall'  | dhall-format  # Still succeeds
{ classSize : Natural, names : List Text }

{ classSize = +3
, names = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text

The Dhall interpreter accepted the import of ./students.dhall because the semantic hash never changed:

$ dhall-hash <<< './students.dhall' 

However, now suppose we try to change the substance of the file by modifying John's age:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
{ name = "John Rivera" , age = +20 },
{ name = "Alice O'Hare", age = +19 } ]

Now the semantic integrity check fails:

$ dhall <<< './example.dhall'

Error: Import integrity check failed

Expected hash:

↳ 6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

Actual hash:

↳ 808d921914de5349f50ac656bed93c2894dfe35401991e1ca0c89861834023fb

Dhall recognizes that this is no longer the same expression and rejects the import. Only an import that represents the same value can pass the check.

This means, for example, that malicious users cannot tamper with our imports, even if we were to distribute the imported code over an insecure channel. The worst that an attacker can do is cause our configuration to reject the import, but they cannot trick the configuration into silently accepting the wrong expression.


We can use these integrity checks to do more than just secure code. We can also repurpose these checks to assert that our code refactors are safe and behavior-preserving.

For example, suppose that we change the student list to:

-- Class of 2017

let double = λ(x : Natural) x * +2

in [ { name = "Jane Doe" , age = +19 }
, { name = "John Rivera" , age = double +9 }
, { name = "Alice O'Hare", age = +19 }

This will still pass the integrity check because the student list still evaluates to the same expected result.

We can also refactor our project layout, too. For example, we could modify the student list to import the double function from another file:

-- Class of 2017

[ { name = "Jane Doe" , age = +19 }
, { name = "John Rivera" , age = ./double.dhall +9 }
, { name = "Alice O'Hare", age = +19 }

... where ./double.dhall has the following contents:

λ(x : Natural)  x * +2

... and the integrity check would still pass.

I originally introduced semantic integrity checks to protect against malicious code modification then later realized that they can also be used to protect against non-malicious modifications (such as a refactor gone wrong).

Textual hashes

The semantic hash provides a more information than a textual hash of the import. For example, suppose we changed our ./double.dhall function to triple the argument:

λ(x : Natural)  x * +3

A textual hash of the ./students.dhall import would not detect this change because the real change took place in the text of another file that ./students.dhall imported. However, A semantic hash can follow these imports to detect transitive changes to dependencies.

The semantic hash is also more flexible than a textual hash because the semantic hash does not change when we make cosmetic changes like refactoring, reformatting, or commenting code.


Dhall's semantic versioning can reject some behavior-preserving changes to functions. Dhall only attempts to detect if two functions are β-equivalent (i.e. the same if fully β-reduced).

For example, the following two functions are equivalent, but will not produce the same hash:

λ(x : Bool)  x
λ(x : Bool)  if x then True else False

Similarly, Dhall's semantic hash cannot detect that these two functions are the same:

λ(x : Natural)  x * +2
λ(x : Natural)  x + x

On the other hand, Dhall will (almost) never give two semantically distinct expressions the same hash. Only an astronomically improbable hash collision can cause this and at the time of this writing there is no known vulnerability in the SHA-256 hash algorithm.

Dhall will support other hash algorithms should SHA-256 ever be broken. This is why Dhall prefixes the hash with the algorithm to leave the door open for new hash algorithms.

Semantic versioning

You might wonder how semantic integrity checks compare to semantic versioning. I like to think of semantic integrity checks and semantic versions as two special cases of the following abstract interface:

  • a package publishes a version string for each official release
  • you can compare two version strings to detect a breaking change to the package

Semantic versioning is one special case of that abstract interface where:

  • the version string has a major number and minor number
  • a difference in major version numbers signals a breaking change

Some variations on semantic versioning propose independently versioning each exported function/value/type instead of versioning the package as a whole. Also, some languages (like Elm) mechanically enforce semantic versioning by detecting API changes programmatically and forcing a major version bump if there is a breaking change.

A semantic integrity check is another special case of that abstract interface where:

  • the version string is a SHA-256 hash
  • if two hashes are different then that signals a breaking change

The key difference between semantic versioning and semantic integrity checks is how we define "a breaking change". Semantic version numbers (usually) treat changes to types as breaking changes whereas semantic integrity checks treat changes to values as breaking changes. (To be totally pedantic: semantic integrity checks treat changes to expressions as breaking changes, and in a language like Dhall everything is an expression, including types).

This does not imply that semantic integrity checks are better than semantic version numbers. Sometimes you want to automatically pick up small changes or improvements from your dependencies without adjusting a hash. In cases like those you want the expected type to be the contract with your dependency and you don't want to pin the exact value.

For example, we could "simulate" semantic versioning in Dhall by attaching a type annotation to our ./students.dhall import like this:

    let map = sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in let students =
./students.dhall : List { name : Text, age : Natural }

in let getName = λ(student : { name : Text, age : Natural })

in { classSize = List/length { name : Text, age : Natural } students
, names = map { name : Text, age : Natural } Text getName students

... and now we can add or remove students from our imported list without breaking anything. We've used the type system as a coarser integrity check to state that certain changes to our configuration file's meaning are okay.


You can think of a semantic integrity check as a "value annotation" (i.e. the term-level equivalent of a type annotation). Instead of declaring an expected type we declare an expected value summarized as a hash.

This is why the title of this post declares that "semantic integrity checks are the next generation of semantic versioning". If you think of a semantic version as a concise summary of an imported package's type, then a semantic integrity check is a concise summary of an imported package's value.


by Gabriel Gonzalez ( at November 04, 2017 03:45 AM

Keegan McAllister

On depression, privilege, and online activism

Update (November 2017): I'm leaving this up as a snapshot of how I felt at the time. Since then a lot has changed in my life, I'm much less angry in general and I no longer give a shit what the toxic assholes think of me, which is pretty great!

[Content warning: depression, privilege, online activism]

This isn't a general account of my experiences with depression. Many people have written about that, and I don't have much to add. But there's one aspect that I don't hear about very often. It's something that bothers me a lot, and others have told me that it bothers them too.

The thing is, I'm not just a person with a mental illness. I'm also a well-off white guy, and I enjoy a whole set of unearned privileges from that. Every day people around the world are harassed, abused, and killed over things I never have to worry about. Even in mundane daily life, most everyone is playing on a higher difficulty setting than I ever will.

I've thought about this a lot over the past few years, and I'm trying to understand how I can help make the world more fair and less oppressive. So I give money and I volunteer a little and I speak up when it seems useful, but mostly I listen. I listen to the experiences of people who are different from me. I try to get some understanding of how they feel and why.

How is this related to depression? Because the reality of privilege and oppression is fucking depressing. Of course it's depressing to those who are directly harmed. That's a lot of what I read about, and some of the despair transfers to me. But my profiting from the suffering of others in a way that I mostly can't change is also depressing, at least if I make an attempt not to ignore it.

And my distress over my role in systems of oppression brings its own layer of guilt. People are actually suffering and I feel sorry for myself because I'm dimly aware of it? But this comes from the voice that has always taunted me about depression. “How can you be sad? Your life is great. If you had real problems you wouldn't be so pathetic. You're not really sick. You're just a whiner.”

All of which is part of the disease. I need to own it and work on it every day. But it seems like every time I read an online discussion about social justice, I take a huge step backwards.

It's hard to shrug off the “men are horrible” comments when I spend so much effort trying to convince myself that I'm not horrible. When I hear people gloating about delicious white male tears, I think about all the times when I would come home from work and collapse in bed crying. Is this what they want my life to be?

I can't give myself permission to tune out, because the same people lecture constantly about my obligation to be a good ally, which mostly takes the form of “shut up and listen.” And then when I'm upset by the things they say, the response is “This isn't for you! Why are you listening?”

A local group, one that had recently invited me to hang out as a guest, retweeted a member's declaration to would-be allies: “We're not friends. Fuck you.” Can you see why it feels like they're trying to hurt me?

Let me be clear: I truly don't care if people in a room somewhere are talking about how men are the worst. I don't feel oppressed by it, and I have no desire to argue with it. But I can't handle direct exposure.

And don't tell me that I'm too stupid to understand why they say these things. I know intellectually that it's not about me. I understand the need to vent and the importance of building solidarity. None of that matters on the emotional level where these comments register like a punch to the gut. I do feel this way, even if I shouldn't and I wish I didn't.

I'm talking about mental health, triggers, and unintentionally hurtful speech. Does that sound familiar? One reason I was drawn to intersectional feminism is that it seemed to have a good set of ground rules for how to treat everyone decently. But now I feel like I'm excluded from protection. “Men are horrible” is apparently the one form of speech where intent is all that matters, and I'm a bad person if it triggers something. I've been told it's offensive that I would even try to describe my experience in those terms.

It hurts a whole lot to try and really feel someone's pain, and then realize they don't even slightly give a shit about me. It hurts even more when they'll bend over backwards for anyone except me.

Look, I get it. You argue all the time with trolls who claim that men have it just as bad as women and will shout “what about the men” as a way to disrupt any discussion. When you're engaged in meme warfare, you can't show them any human empathy. They certainly wouldn't return the favor. And if my voice sounds a little like theirs, that's just too bad for me.

I know that this article will serve as ammunition for some people with views I find disgusting. That sucks, but I'm done using political strategy as a reason to stay silent. I understand tone policing as a derailing tactic, and I understand the need to call it out. But at this point it seems there's no room for a sincere request for kindness, especially coming from someone who doesn't get much benefit of the doubt. (The Geek Feminism Wiki basically says that asking for kindness is tone policing if and only if you're a man.)

I'm not trying to silence anyone here. I'm not jumping in and derailing an existing conversation. I'm writing on my own blog, on my own schedule, about my own feelings. But I'm told that even this is crossing a line.

I know that I can't dictate how others feel about our fucked-up world. Does that mean I must absolutely suppress the way I feel? Even when we agree about the substance of what's wrong? I know that if I ask someone to share their life experiences, they have a right to express anger. When does expressing anger become sustained, deliberate cruelty?

“People are being oppressed and you're asking us to care about your feelings?” Yes, I am asking you to care. Just a little bit. I don't claim that my feelings should be a top priority. I hope it wouldn't come up very often. But according to the outspoken few who set the tone, I'm never allowed to bring it up. I don't deserve to ask them to be nice.

And that's why I can no longer have anything to do with this movement. It's really that simple. I guess it says something about my state of mind that I felt the need to attach 1,700 words of preemptive defenses.

The truth is, when I'm not allowed to say or even think “not all men,” part of me hears “Yes, all men, especially you.” And if I'm ever confused about whether I'm allowed to say “not all men,” there are a dozen unprompted reminders every day. Little jokes, repeated constantly to set the climate about what will and won't be tolerated.

When you treat me like one of the trolls, I start to believe that I am one. Guys who say “I support feminism but sometimes they go too far” are usually trying to excuse sexist behavior. So what do I conclude about myself when I have the same thought?

I get that “ally” is not a label you self-apply, it's a thing you do, and the label comes from others. The problem is, if a hundred people say I'm a good ally, and one person says I'm a sexist asshole, who do you think I'm going to believe?

I'm not allowed to stand up for myself, because doing so is automatically an act of oppression. If a woman treats me like shit, and she's being “more feminist” than me, I conclude that I deserve to be treated like shit. That is the model I've learned of a good ally.

I'm not a good ally, or even a bad one. I'm collateral damage.

If the point of all this is to give me a tiny little taste of the invalidation that others experience on a regular basis, then congratulations, it worked. You've made your point. Now that you've broken me, how can I possibly help you, when it seems like I'm part of the problem just by existing? It feels like all I can do is engage in emotional self-harm to repay the debt of how I was born.

I can't just take a break “until I feel better.” My depressive symptoms will always come and go, and some thoughts will reliably bring them back. I spent years reading about how the most important thing I can do, as a winner of the birth lottery, is to be an ally to marginalized people. And now I've realized that I'm too sick and weak to do it.

Even if I give up on being an ally, I can't avoid this subject. It affects a lot of my friends, and I feel even worse when I ask them not to talk about it around me. I don't want to silence anyone. At least I've mostly stopped using Twitter.

So this is how I feel, but I'm not sure anyone else can do anything about it. Really, most of the people I've talked to have been sympathetic. Maybe I need to learn not to let bullies get to me, even when they're bullying in service of a cause I support. They don't seem to get much pushback from the wider community, at any rate.

What gives me hope is, I recognize that my participation in the endless shouting online wasn't really useful to anyone. If I can let myself ignore all that, maybe I can recover some of my energy for other activities that actually help people.

That's all I have to say right now. Thank you for listening to me.

by keegan ( at November 04, 2017 12:42 AM

November 03, 2017

Brent Yorgey

Sum of heights in a binary tree

Executive summary: every year when teaching data structures I always forget how to analyze the cost of building a binary heap, which amounts to summing the heights of all the nodes in a full binary tree. So I’m writing down the (lovely) proof here in the hopes that I will remember it next time.

Suppose you have a full binary tree and you do an operation on every node, where the cost of the operation is proportional to the height of that node. That is, the cost for each of the n/2 leaves is 0, for each of the n/4 nodes in the next level up the cost is 1, and so on. We can visualize the scenario like this:

As a function of the total number of nodes n, how expensive is this? We can see that O(n \lg n) is an upper bound, since there are n nodes and the height of each node is at most \lg n. But it seems like it might actually be faster than this in reality, since, intuitively, most of the nodes have a height which is much smaller than \lg n.

(One specific motivation for this scenario is that we can build a binary heap from an arbitrary set of data by looping over the nodes from the bottom up and calling reheapDown on each; in the worst case reheapDown takes time proportional to the height of the node, as in this scenario. But it doesn’t matter if you don’t know about binary heaps.)

Let’s take the same tree and put a dollar at every node, for a total of \$n:

Now imagine sliding all the money as far up and to the right as it will go. That is, we take each dollar, and keep moving it up as long as it is a left child. As soon as we reach a node which is a right child we stop. The tree ends up looking like this:

Now take each pile of money and move it up one step to its parent, except the money at the root of the tree, which you can put in your pocket.

And voilà! We now have exactly enough money at each node to pay for the cost of the operations, and we even have a bit left over (which we can use to buy coffee). But we started with \$n and only shuffled money around; this shows that the total cost is actually O(n).

Exercise for the reader: what does this have to do with the number of bit flips needed to count from 1 to n with a binary counter?

by Brent at November 03, 2017 03:06 PM

November 02, 2017

Robert Harper

PFPL Commentary

I am building a web page devoted to the 2nd edition of Practical Foundations for Programming Languages, recently published by Cambridge University Press.  Besides an errata, the web site features a commentary on the text explaining major design decisions and suggesting alternatives.  I also plan to include additional exercises and to make sample solutions available to faculty teaching from the book.

The purpose of the commentary is to provide the “back story” for the development, which is often only hinted at, or is written between the lines, in PFPL itself.  To emphasize enduring principles over passing fads, I have refrained from discussing particular languages in the book.  But this makes it difficult for many readers to see the relevance.  One purpose of the commentary is to clarify these connections by explaining why I said what I said.

As a starting point, I explain why I ignore the familiar concept of a “paradigm” in my account of languages.  The idea seems to have been inspired by Kuhn’s (in)famous book The Structure of Scientific Revolutions, and was perhaps a useful device at one time.  But by now the idea of a paradigm is just too vague to be useful, and there are many better ways to explain and systematize language structure.  And so I have avoided it.

I plan for the commentary to be a living document that I will revise and expand as the need arises.  I hope for it to provide some useful background for readers in general, and teachers in particular.  I wish for the standard undergraduate PL course to evolve from a superficial taxonomy of the weird animals in the language zoo to a systematic study of the general theory of computation.  Perhaps PFPL can contribute to effecting that change.

Update: As I had hoped, I have been making many new additions to the commentary, exposing alternatives, explaining decisions, and expanding on topics in PFPL.  There are also a few errors noted in the errata; so far, nothing major has come up.  (The sections on safety are safely sound.)

Filed under: Research, Teaching

by Robert Harper at November 02, 2017 07:45 PM

It Is What It Is (And Nothing Else)

A recent discussion of introductory computer science education led to the topic of teaching recursion.  I was surprised to learn that students are being taught that recursion requires understanding something called a “stack” that is nowhere in evidence in their code.  Few, if any, students master the concept, which is usually “covered” only briefly.  Worst, they are encouraged to believe that recursion is a mysterious bit of esoterica that is best ignored.

And thus is lost one of the most important and beautiful concepts in computing.

The discussion then moved on to the implementation of recursion in certain inexplicably popular languages for teaching programming.  As it turns out, the compilers mis-implement recursion, causing unwarranted space usage in common cases.  Recursion is dismissed as problematic and unimportant, and the compiler error is elevated to a “design principle” — to be serpentine is to do it wrong.

And thus is lost one of the most important and beautiful concepts in computing.

And yet, for all the stack-based resistance to the concept, recursion has nothing to do with a stack.  Teaching recursion does not need any mumbo-jumbo about “stacks”.  Implementing recursion does not require a “stack”.  The idea that the two concepts are related is simply mistaken.

What, then, is recursion?  It is nothing more than self-reference, the ability to name a computation for use within the computation itself.  Recursion is what it is, and nothing more.  No stacks, no tail calls, no proper or improper forms, no optimizations, just self-reference pure and simple.  Recursion is not tied to “procedures” or “functions” or “methods”; one can have self-referential values of all types.

Somehow these very simple facts, which date back to the early 1930’s, have been replaced by damaging myths that impede teaching and using recursion in programs.  It is both a conceptual and a practical loss.  For example, the most effective methods for expressing parallelism in programs rely heavily on recursive self-reference; much would be lost without it.  And the allegation that “real programmers don’t use recursion” is beyond absurd: the very concept of a digital computer is grounded in recursive self-reference (the cross-connection of gates to form a latch).  (Which, needless to say, does not involve a stack.)  Not only do real programmers use recursion, there could not even be programmers were it not for recursion.

I have no explanation for why this terrible misconception persists.  But I do know that when it comes to programming languages, attitude trumps reality every time.  Facts?  We don’t need no stinking facts around here, amigo.  You must be some kind of mathematician.

If all the textbooks are wrong, what is right?  How should one explain recursion?  It’s simple.  If you want to refer to yourself, you need to give yourself a name.  “I” will do, but so will any other name, by the miracle of α-conversion.  A computation is given a name using a fixed point (not fixpoint, dammit) operator:  fix x is e stands for the expression e named x for use within e.  Using it, the textbook example of the factorial function is written thus:

fix f is fun n : nat in case n {zero => 1 | succ(n') => n * f n'}.

Let us call this whole expression fact, for convenience.  If we wish to evaluate it, perhaps because we wish to apply it to an argument, its value is

fun n : nat in case n {zero => 1 | succ(n') => n * fact n'}.

The recursion has been unrolled one step ahead of execution.  If we reach fact again, as we will for a positive argument,  fact is evaluated again, in the same way, and the computation continues.  There are no stacks involved in this explanation.

Nor is there a stack involved in the implementation of fixed points.  It is only necessary to make sure that the named computation does indeed name itself.  This can be achieved by a number of means, including circular data structures (non-well-founded abstract syntax), but the most elegant method is by self-application.  Simply arrange that a self-referential computation has an implicit argument with which it refers to itself.  Any use of the computation unrolls the self-reference, ensuring that the invariant is maintained.  No storage allocation is required.

Consequently, a self-referential functions such as

fix f is fun (n : nat, m:nat) in case n {zero => m | succ(n') => f (n',n*m)}

execute without needing any asymptotically significant space.  It is quite literally a loop, and no special arrangement is required to make sure that this is the case.  All that is required is to implement recursion properly (as self-reference), and you’re done.  There is no such thing as tail-call optimization.  It’s not a matter of optimization, but of proper implementation.  Calling it an optimization suggests it is optional, or unnecessary, or provided only as a favor, when it is more accurately described as a matter of getting it right.

So what, then, is the source of the confusion?  The problem seems to be a too-close association between compound expressions and recursive functions or procedures.  Consider the classic definition of factorial given earlier.  The body of the definition involves the expression

n * fact n'

where there is a pending multiplication to be accounted for.  Once the recursive call (to itself) completes, the multiplication can be carried out, and it is necessary to keep track of this pending obligation.  But this phenomenon has nothing whatsoever to do with recursion.  If you write

n * square n'

then it is equally necessary to record where the external call is to return its value.  In typical accounts of recursion, the two issues get confused, a regrettable tragedy of error.

Really, the need for a stack arises the moment one introduces compound expressions.  This can be explained in several ways, none of which need pictures or diagrams or any discussion about frames or pointers or any extra-linguistic concepts whatsoever.  The best way, in my opinion, is to use Plotkin’s structural operational semantics, as described in my Practical Foundations for Programming Languages (Second Edition) on Cambridge University Press.

There is no reason, nor any possibility, to avoid recursion in programming.  But folk wisdom would have it otherwise.  That’s just the trouble with folk wisdom, everyone knows it’s true, even when it’s not.

Update: Dan Piponi and Andreas Rossberg called attention to a pertinent point regarding stacks and recursion.  The conventional notion of a run-time stack records two distinct things, the control state of the program (such as subroutine return addresses, or, more abstractly, pending computations, or continuations), and the data state of the program (a term I just made up because I don’t know a better one, for managing multiple simultaneous activations of a given procedure or function).  Fortran (back in the day) didn’t permit multiple activations, meaning that at most one instance of a procedure can be in play at a given time.  One consequence is that α-equivalence can be neglected: the arguments of a procedure can be placed in a statically determined spot for the call.  As a member of the Algol-60 design committee Dijkstra argued, successfully, for admitting multiple procedure activations (and hence, with a little extra arrangement, recursive/self-referential procedures).  Doing so requires that α-equivalence be implemented properly; two activations of the same procedure cannot share the same argument locations.  The data stack implements α-equivalence using de Bruijn indices (stack slots); arguments are passed on the data stack using activation records in the now-classic manner invented by Dijkstra for the purpose.  It is not self-reference that gives rise to the need for a stack, but rather re-entrancy of procedures, which can arise in several ways, not just recursion.  Moreover, recursion does not always require re-entrancy—the so-called tail call optimization is just the observation that certain recursive procedures are not, in fact, re-entrant.  (Every looping construct illustrates this principle, albeit on an ad hoc basis, rather than as a general principle.)

Filed under: Programming, Teaching

by Robert Harper at November 02, 2017 06:24 PM

November 01, 2017

Tweag I/O

The Exodus to Streamgard,<br/> an epic poem

Yves Parès

If Haskell was a god, often would he be depicted with the ravens Modularity and Abstraction flying above him, hovering the world and reporting to him every detail of our whereabouts. Haskell would sit on the Throne of Purity and look upon the world with an eye full of wisdom. And in his hand, the mighty Haskell would wield the Spear of Lazy Lists, which is said to have the power to tackle each and every problem the world might have to face. And to honour him, we would code and abstract everything with lazy lists. For millenia would lists be used to map, filter, separate, merge, group, and so forth.

But, one day, the Real-World Serpent, son of the wicked Foldr, would come. And the Real-World Serpent carries an eternal hatred towards lazy lists. Oh, that dreaded Serpent, that will throw everything it can muster to prevent us from staying within the warm comfort of abstraction and laziness. The Serpent will assemble its minions, Early-close and Strictness of effects, and unleash its wrath upon our world. Foldl, son of Haskell and brother of Foldr, would lead humanity to its last bastion, Streamgard, and organize the final fight...

So, long story short, streaming is a library that allows you to leverage the insights you have gained while manipulating lazy lists in Haskell to handle effectful streams of data. We already talked about streaming on this blog, with this post discussing the IO part and this one comparing it to pipes and conduit. Here, we will be using streaming for highly efficient data processing and filtering. To this effect, we will use it conjointly with another library, foldl, which gives us an Applicative interface to the usual list functions. In this blog post we will apply them to the task of computing some statistics about a distribution of data. We want to be able to:

  • process the input data stream only once (aka in one pass),
  • never repeat the effects that were used to produce that data stream,
  • maintain the possibility to use the input stream as if it were a list, for instance by splitting it into two subparts, sending each subpart to be processed by a specific function.

So lets imagine that the statistics I want to compute on my input data distributions take the shape of a simple summary. This is what I want to obtain in the end:

data Summary v a = Summary
  { summaryLength :: Int
  , summaryMins :: [a]
  , summaryMaxes :: [a]
  , summaryMean :: v
  , summaryStdDev :: v
  deriving (Show)

Nothing too fancy here, I just want to be able to compute the length, the n smallest elements, the n' biggest elements, the mean and the standard deviation of my distribution. We distinguish the types a and v here because our input distribution does not have to be numerical, as long as we have a projection a -> v available. This way, we can compute a summary of a stream of (Double, String) tuples, for instance, if the projection is just fst.

So let's have a little reminder of our conditions. We want to be able to read the input data only once. But, we still want modularity and reusability. We do not want to have to recode our Summary-computing function every time we want to add a new field, and we would like to reuse already existing functions computing these statistics. And this is where the foldl package comes in.

This package defines a type Fold as follows:

data Fold a b = forall acc. Fold (acc -> a -> acc) acc (acc -> b)

You might recognize here the typical arguments of the classical foldl function of the Prelude: a is the type of each element of the input stream we consume, the first field (acc -> a -> acc) is an accumulation function and the second field acc is the initial value of the accumulator. The new component is the b type parameter and the last field (acc -> b). This one is called extract. It is used to extract the final value out of the accumulator. This is necessary so that Fold a can be a Functor and therefore an Applicative. See the original blog post and this talk by Gabriel Gonzalez for more detail, though be aware that Fold had a different shape back then.

One of the central ideas of the foldl library is that Fold implements the Applicative type class:

instance Applicative (Fold a)

Crucially, this instance combines two Folds, into a guaranteed one-pass traversal of the data. Therefore we can safely decompose the computation of a Summary as follows:

import qualified Control.Foldl as L
import Data.Function (on)

summarizeBy :: (Floating v, Ord v)
            => (a -> v) -> Int -> Int -> L.Fold a (Summary v a)
summarizeBy f nMins nMaxes = Summary
  <$> L.length
  <*> collect ((>=) `on` f) nMins
  <*> collect ((<=) `on` f) nMaxes
  <*> L.premap f L.mean
  <*> L.premap f L.std

What's happening here? We are using a few of the functions already present in the foldl package and a new one, so let's delve into it a bit. The function summarizeBy takes a projection f, which we talked about earlier, the number of smallest elements we want to collect and the number of biggest elements. Then our five statistics are computed:

  • L.length :: L.Fold a Int gives us the number of elements in the input.
  • collect, which we will define a bit later, accumulates either the mins or the maxes given a comparison function.
  • L.mean gives us the average. We use L.premap f to turn it into a fold that will work on our projection f.
  • L.std gives us the standard deviation.

The combination of the above gives us a Fold a (Summary v a), something that will consume a stream of a's and output a summary. At this point, nothing is consumed, we have only composed folds together, and a Fold is agnostic of the exact nature of the input. Running it on any Foldable datatype for instance is just a matter of calling:

L.fold (summarizeBy id 3 3) [1..100]

The only function not provided by the foldl package is the collect function. Defining it as a brand new Fold is simple:

import Data.Sequence as Seq

collect :: (a -> a -> Bool) -> Int -> L.Fold a [a]
collect skipPred n = L.Fold insertPop Seq.empty (L.fold L.list)
    insertPop acc x
      | Seq.length acc < n = insert x acc
      | otherwise          = pop (insert x acc)
    insert x s = let (before, after) = Seq.spanl (skipPred x) s
                 in before <> Seq.singleton x <> after
    pop s = case viewr s of
              s' :> _ -> s'
              _ -> s

Here we manually defined a new Fold from the three elements we mentioned earlier: an accumulation function (insertPop), an initial accumulator value (Seq.empty) and an extract function ((L.fold L.list), which also uses a Fold to turn the final sequence into a plain list).

Now, the astute reader will notice we left streaming aside. Let's get back to it. Let's use as an input the classic Titanic dataset:

1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

We want to get two different summaries for the fares: one for the passengers that survived and one for those who did not. First, let's load the CSV into a stream by using the streaming-cassava and streaming-bytestring packages:

{-# LANGUAGE OverloadedStrings #-}
import Control.Monad (mzero)
import qualified Data.ByteString.Streaming as BS
import Streaming
import Streaming.Cassava

data Passenger { name :: !String, fare :: !Double, survived :: !Bool }
  deriving (Show)

instance FromNamedRecord Passenger where
  parsedNamedRecord m =
    Person <$> m .: "Name" <*> m .: "Fare" <*> (toBool =<< (m .: "Survived"))
  where toBool 0 = return False
        toBool 1 = return True
        toBool _ = mzero

streamCsv :: (MonadResource m) => Stream (Of Passenger) m ()
streamCsv = decodeByName (BS.readFile ".../titanic.csv")

Nothing too fancy here, just a bit of required boilerplate to be able to read Passengers from the CSV file. MonadResource is necessary to track the files opened by our program. The type Stream (Of Passenger) m () means that we will be manipulating a stream whose elements are Passengers, that will run some effects in a monad m and return no result in the end.

Now, lets split that input in two different substreams:

import qualified Streaming.Prelude as S

aliveDead :: Stream (Of Passenger) (Stream (Of Passenger) m) ()
aliveDead = S.partition survived streamCsv

Let's look at the type of aliveDead: it is a Stream over another Stream. Stream (Of a) is actually a monad transformer, the way the partitioning happens is by creating two layers: one for the live passengers and one for the dead ones. It's not exactly a tuple of two streams (as it would be with Data.List.partition), but is has the same advantages: each layer can be processed by different functions which don't have to know where the stream they process lies in the monad stack. Therefore, each one of these functions can be expressed as:

  :: (Monad m) => Stream (Of Passenger) m a -> m (Of (Summary Double Passenger) a)
summarizePassengers = L.purely S.fold (summarizeBy fare 3 3)

where m can be any monad. This can be the bottom MonadResource or another Stream, summarizePassengers does not mind and does not have to! Of behaves like a tuple, so it simply means that we return both the newly computed Summary and an a (a may just be (), but here we have to be a little more general). S.fold is the basic folding function for streams. L.purely fn f "unpacks" a Fold f and calls a folding function fn. So now, getting our summaries is just a matter of

runAll = runResourceT $ do
  (summaryAlive :> summaryDead :> ()) <-
    summarizePassengers $ summarizePassengers aliveDead

So in the end, we splitted the input file in two substreams, we computed various statistics twice, and despite all this streaming and foldl guarantee that the input will be read only once in bounded memory.

These techniques are currently being applied by Tweag I/O in the context of a project with Novadiscovery. Novadiscovery is a consulting company for in silico clinical trials, namely simulation of virtual patients through biomodeling. Parts of this blog post are actual code from the tools we develop with them.

November 01, 2017 12:00 AM

October 31, 2017

Douglas M. Auclair (geophf)

October 2017 1HaskellADay problems and solutions

by geophf ( at October 31, 2017 09:32 PM