Planet Haskell

January 20, 2018

Mark Jason Dominus

I forgot there are party conventions

Yesterday I made a huge mistake in my article about California's bill requiring presidential candidates to disclose their tax returns. I asked:

Did I miss anything?

Yes, yes I did. I forgot that party nominees are picked not by per-state primary elections, but by national conventions. Even had Ronnie won the California Republican primary election, rather than Trump, that would not be enough the get him on the ballot in the general election.

In the corrected version of my scenario, California sends many Ronnie supporters to the Republican National Convention, where the other delegates overwhelmingly nominate Trump anyway. Trump then becomes the Republican party candidate and appears on the California general election ballot in November. Whoops.

I had concluded that conceding California couldn't hurt Trump, and it could actually a huge benefit to him. After correcting my mistake, most of the benefit evaporates.

I wonder if Trump might blow off California in 2020 anyway. The upside would be that he could spend the resources in places that might give him some electoral votes. (One reader pointed out that Trump didn't blow off California in the 2016 election, but of course the situation was completely different. In 2016 he was not the incumbent, he was in a crowded field, and he did not yet know that he was going to lose California by 30%.)

Traditionally, candidates campaign even in states they expect to lose. One reason is to show support for candidates in local elections. I can imagine many California Republican candidates would prefer that Trump didn't show up. Another reason is to preserve at least a pretense that they are representatives of the whole people. Newly-elected Presidents have often upheld the dignity of the office by stating the need for unity after a national election and promising (however implausibly) to work for all Americans. I don't care to write the rest of this paragraph.

The major downside that I can think of (for Trump) is that Republican voters in California would be infuriated, and while they can't directly affect the outcome of the presidential election, they can make it politically impossible for their congressional representatives to work with Trump once he is elected. A California-led anti-Trump bloc in Congress would probably cause huge problems for him.

Wild times we live in.

by Mark Dominus ( at January 20, 2018 03:45 PM

January 19, 2018

Brandon Simmons

In defense of partial functions in the haskell Prelude

…because I’m trying to blog more, and this sounds like a fun argument to try to make.

One of the most universally-maligned parts of Haskell is the inclusion of partial functions in its standard library called Prelude. These include head and tail which are undefined for the empty list:

head :: [a] -> a
head (a:_) = a
head [] = error "empty list"

It’s generally understood that the inclusion of these sorts of functions are a wart (that the type of head should be [a] -> Maybe a) that has motivated the proliferation of Prelude alternatives, few of which are used by anyone besides their authors (and fewer still have enthusiastic advocates).

I’ve heard a handful of allusions to tricky production bugs that involved some deep dive to find the source of a "*** Exception: Prelude.head: empty list", but personally I can recall only one instance of such a bug in the code I’ve worked on professionally and it was trivial to track down. I can’t recall flagging a use of head in a code review either, or raising an eyebrow at some use of the function in some library source I was perusing.

But most of the time the argument goes that partial functions should be removed for the sake of new users, who will become quickly disillusioned when their first function blows up with an exception. But you said haskell was safe!

It would be unfortunate if this caused a new user to give up, and maybe this is a real problem for the community, but here’s what I think really happens to most of us:

  • Your homework doesn’t work; this doesn’t matter.
  • You use Google and quickly learn that partial functions (will forever) exist, and that they’re bad
  • You ask yourself “Hm, come to think of it what did I expect to happen…?”

And so you learn an important lesson, early on and in the most painless way possible, you acquire a nose for inferring which functions must be partial, an appreciation for compiler warnings that help prevent accidentally-partial functions, etc.

Would I recommend designing a standard library around this weird sort of tough-love? Probably not, but I think the haskell library ecosystem and pedagogy have benefited from this wart.

The problems that get the most (and most passionate) attention are usually not the ones that are the most important, but the ones that are the most easily understood. I think in the proliferation of Preludes and the discussion around partial functions (and the fact that they haven’t been excised yet) we see evidence of both the Law of Triviality, and a healthy language pedagogy.

January 19, 2018 08:22 PM

Mark Jason Dominus

Presidential tax return disclosure

The California state legislature passed a bill that would require presidential candidates to disclose their past five tax returns in order to qualify for California primary elections. The bill was vetoed by Governor Brown, but what if it had become law?

Suppose Donald Trump ran for re-election in 2020, as seems likely, barring his death or expulsion. And suppose he declined once again to disclose his tax returns, and was excluded from the California Republican primary election. I don't see how this could possibly hurt Trump, and it could benefit him.

It doesn't matter to Trump whether he enters the primary or wins the primary. Trump lost California by 30% in 2016. Either way he would be just as certain to get the same number of electors: zero. So he would have no incentive to comply with the law by releasing his tax returns.

Most candidates would do it anyway, because they try to maintain a pretense of representing the entire country they are campaigning to lead, but Trump is really different in this way. I can easily imagine he might simply refuse to campaign in California, instead dismissing the entire state with some vulgar comment. If there is a downside for Trump, I don't see what it could be.

Someone else (call them “Ronnie”) would then win the California Republican primary. Certainly Ronnie is better-qualified and more competent than Trump, and most likely Ronnie is much more attractive to the California electorate.

Ronnie might even be more attractive than the Democratic candidate, and might defeat them in the general election, depriving Trump's challenger of 55 electoral votes and swinging the election heavily in Trump's favor.

Did I miss anything?

[ Addendum 20180120: Yeah, I forgot that after the primary there is a convention that nominates a national party candidate. Whooops. Further discussion. ]

by Mark Dominus ( at January 19, 2018 05:40 PM

January 18, 2018

FP Complete

Signs Your Business Needs a DevOps Consultant

How to Know if Hiring a DevOps Consultant is Right for Your Business

DevOps is a set of tools and methods to get your applications from Dev through Deployment to Operations. It automates a reliable path from my app runs on one machine to my app is online for all users and data, secure, scalable, managed, and maintainable.

by Aaron Contorer ( at January 18, 2018 11:06 PM

Comonad Reader

Computational Quadrinitarianism (Curious Correspondences go Cubical)

Back in 2011, in an influential blog post [1], Robert Harper coined the term "computational trinitarianism" to describe an idea that had been around a long time — that the connection between programming languages, logic, and categories, most famously expressed in the Curry-Howard-Lambek correspondence — should guide the practice of researchers into computation. In particular "any concept arising in one aspect should have meaning from the perspective of the other two". This was especially satisfying to those of us trying learning categorical semantics and often finding it disappointing how little it is appreciated in the computer science community writ large.

1. Categories

Over the years I've thought about trinitarianism a lot, and learned from where it fails to work as much as where it succeeds. One difficulty is that learning to read a logic like a type theory, or vice versa, is almost a definitional trick, because it occurs at the level of reinterpretation of syntax. With categories it is typically not so easy. (There is a straightforward version of categorical semantics like this — yielding "syntactic categories" — but it is difficult to connect it to the broader world of categorical semantics, and often it is sidestepped in favor of deeper models.)

One thing I came to realize is that there is no one notion of categorical semantics — the way in which the simply typed lambda calculus takes models in cartesian closed categories is fundamentally unlike the way in which linear logics take models in symmetric monoidal categories. If you want to study models of dependent type theories, you have a range of approaches, only some of which have been partially unified by Ahrens, Lumsdaine and Voevodsky in particular [2]. And then there are the LCCC models pioneered by Seely for extensional type theory, not to mention the approach that takes semantics directly in toposes, or in triposes (the latter having been invented to unify a variety of structures, and in the process giving rise to still more questions). And then there is the approach that doesn't use categories at all, but multicategories.

Going the other way, we also run into obstacles: there is a general notion, opposite to the "syntactic category" of a type theory, which is the "internal logic" of a category. But depending on the form of category, "internal logic" can take many forms. If you are in a topos, there is a straightforward internal logic called the Mitchell–Bénabou language. In this setting, most "logical" operations factor through the truth-lattice of the subobject classifier. This is very convenient, but if you don't have a proper subobject classifier, then you are forced to reach for other interpretations. As such, it is not infrequently the case that we have a procedure for deriving a category from some logical theory, and a procedure for constructing a logical theory from some category, but there is no particular reason to expect that where we arrive, when we take the round-trip, is close to, much less precisely, where we began.

2. Spaces, Logics

Over the past few years I've been in a topos theory reading group. In the course of this, I've realized at least one problem with all the above (by no means the only one) — Harper's holy trinity is fundamentally incomplete. There is another structure of interest — of equal weight to categories, logics, and languages — which it is necessary to understand to see how everything fits. This structure is spaces. I had thought that it was a unique innovation of homotopy type theory to consider logics (resp. type theories) that took semantics in spaces. But it turns out that I just didn't know the history of constructive logic very well. In fact, in roughly the same period that Curry was exploring the relationship of combinatory algebras to logic, Alfred Tarski and Marshall Stone were developing topological models for intuitionistic logic, in terms of what we call Heyting Algebras [3] [4]. And just as, as Harper explained, logic, programming and category theory give us insights into implication in the form of entailment, typing judgments, and morphisms, so to, as we will see, do spaces.

A Heyting algebra is a special type of distributive lattice (partially ordered set, equipped with meet and join operations, such that meet and join distribute over one another) which has an implication operation that satisfies curry/uncurry adjointness — i.e. such that c ∧ a ≤ b < -> c ≤ a → b. (Replace meet here by "and" (spelled "*"), and ≤ by ⊢ and we have the familiar type-theoretic statement that c * a ⊢ b < -> c ⊢ a → b).

If you haven't encountered this before, it is worth unpacking. Given a set, we equip it with a partial order by specifying a "≤" operation, such that a ≤ a, if a ≤ b and b ≤ a, then a = b, and finally that if a ≤ b and b ≤ c, then a ≤ c. We can think of such things as Hasse diagrams — a bunch of nodes with some lines between them that only go upwards. If a node b is reachable from a node a by following these upwards lines, then a ≤ b. This "only upwards" condition is enough to enforce all three conditions. We can define ∨ (join) as a binary operation that takes two nodes, and gives a node a ∨ b that is greater than either node, and furthermore is the uniquely least node greater than both of them. (Note: A general partial order may have many pairs of nodes that do not have any node greater than both of them, or may that may have more than one incomparable node greater than them.) We can define ∧ (meet) dually, as the uniquely greatest node less than both of them. If all elements of a partially ordered set have a join and meet, we have a lattice.

It is tempting to read meet and join as "and" and "or" in logic. But these logical connectives satisfy an additional important property — distributivity: a & (b | c) = (a & b) | (a & c). (By the lattice laws, the dual property with and swapped with or is also implied). Translated for lattices this reads: a ∧ (b ∨ c) = (a ∧ b) ∨ (a ∧ c). Rather than thinking just about boolean logic, we can think about lattices built from sets — with meets as union, join as intersection, and ≤ given by inclusion. It is easy to verify that such lattices are distributive. Furthermore, every distributive lattice can be given (up to isomorphism) as one built out of sets in this way. While a partially ordered set can have a Hasse diagram of pretty arbitrary shape, a lattice is more restrictive — I imagine it as sort of the tiled diamonds of an actual lattice like one might use in a garden, but with some nodes and edges possibly removed.

Furthermore, there's an amazing result that you can tell if a lattice is distributive by looking for just two prototypical non-distributive lattices as sublattices. If neither is contained in the original lattice, then the lattice is distributed. These tell us how distribution can fail in two canonical ways. The first is three incomparable elements, all of which share a common join (the top) and meet (the bottom). The join of anything but their bottom element with them is therefore the top. Hence if we take the meet of two joins, we still get the top. But the meet of any two non-top elements is the bottom and so, if we take the join of any element with the meet of any other two, we get back to the first element, not all the way to the top, and the equality fails. The second taboo lattice is constructed by having two elements in an ordered relationship, and another incomparable to them — again augmented with a bottom and top. A similar argument shows that if you go one way across the desired entity, you pick out the topmost of the two ordered elements, and the other way yields the bottommost. (The wikipedia article on distributive lattices has some very good diagrams to visualize all this). So a distributive lattice has even more structure than before — incomparable elements must have enough meets and joins to prevent these sublattices from appearing, and this forces even more the appearance of a tiled-diamond like structure.

To get us to a Heyting algebra, we need more structure still — we need implication, which is like an internal function arrow, or an internal ≤ relation. Recall that the equation we want to satisfy is "c ∧ a ≤ b < -> c ≤ a → b". The idea is that we should be able to read ≤ itself as an "external implication" and so if c and a taken together imply b, "a implies b" is the portion of that implication if we "only have" c. We can see it as a partial application of the external implication. If we have a lattice that permits infinite joins (or just a finite lattice such that we don't need them), then it is straightforward to see how to construct this. To build a → b, we just look at every possible choice of c that satisfies c ∧ a ≤ b, and then take the join of all of them to be our object a → b. Then, by construction, a → b is necessarily greater than or equal to any c that satisfies the left hand side of the equation. And conversely, any element that a → b is greater than is necessarily one that satisfies the left hand side, and the bi-implication is complete. (This, by the way, gives a good intuition for the definition of an exponential in a category of presheaves). Another way to think of a → b is as the greatest element of the lattice such that a → b ∧ a ≤ b (exercise: relate this to the first definition). It is also a good exercise to explore what happens in certain simple cases — what if a is 0 (false)? What if it is 1? The same as b? Now ask the same questions of b.

So why is a Heyting algebra a topological construct? Consider any topological space as given by a collection of open sets, satisfying the usual principles (including the empty set and the total set, and closed under union and finite intersection). These covers have a partial ordering, given by containment. They have unions and intersections (all joins and meets), a top and bottom element (the total space, and the null space). Furthermore, they have an implication operation as described above. As an open set, a → b is given by the meet of all opens c for which a ∧ c ≤ b. (We can think of this as "the biggest context, for which a ⊢ b"). In fact, the axioms for open sets feel almost exactly like the rules we've described for Heyting algebras. It turns out this is only half true — open sets always give Heyting algebras, and we can turn every Heyting algebra into a space. However, in both directions the round trip may take us to somewhere slightly different than where we started. Nonetheless it turns out that if we take complete Heyting algebras where finite meets distribute over infinite joins, we get something called "frames." And the opposite category of frames yields "locales" — a suitable generalization of topological spaces, first named by John Isbell in 1972 [5]. Spaces that correspond precisely to locales are called sober, and locales that correspond precisely to spaces are said to have "enough points" or be "spatial locales".

In fact, we don't need to fast-forward to 1972 to get some movement in the opposite direction. In 1944, McKinsey and Tarski embarked on a program of "The Algebra of Topology" which sought to describe topological spaces in purely algebraic (axiomatic) terms [6]. The resultant closure algebras (these days often discussed as their duals, interior algebras) provided a semantics for S4 modal logic. [7] A further development in this regard came with Kripke models for logic [8] (though arguably they're really Beth models [9]).

Here's an easy way to think about Kripke models. Start with any partial ordered set. Now, for each object, instead consider instead all morphisms into it. Since each morphism from any object a to any object b exists only if a ≤ b, and we consider such paths unique (if there are two "routes" showing a ≤ b, we consider them the same in this setting) this amounts to replacing each element a with the set of all elements ≤ a. (The linked pdf does this upside down, but it doesn't really matter). Even though the initial setting may not have been Heyting algebra, this transformed setting is a Heyting algebra. (In fact, by a special case of the Yoneda lemma!). This yields Kripke models.

Now consider "collapsings" of elements in the initial partial order — monotone downwards maps taken by sending some elements to other elements less than them in a way that doesn't distort orderings. (I.e. if f(a) ≤ f(b) in the collapsed order, then that means that a ≤ b in the original order). Just as we can lift elements from the initial partial order into their downsets (sets of elements less than them) in the kripkified Heyting Algebra, we can lift our collapsing functions into collapsing functions in our generated Heyting Algebra. With a little work we can see that collapsings in the partial order also yield collapsings in the Heyting Algebra.

Furthermore, it turns out, more or less, that you can generate every closure algebra in this way. Now if we consider closure algebras a bit (and this shouldn't surprise us if we know about S4), we see that we can always take a to Ca, that if we send a → b, then we can send Ca → Cb, and furthermore that CCa → Ca in a natural way (in fact, they're equal!). So closure algebras have the structure of an idempotent monad. (Note: the arrows here should not be seen as representing internal implication — as above they represent the logical turnstile ⊢ or perhaps, if you're really in a Kripke setting, the forcing turnstile ⊩).

Now we have a correspondence between logic and computation (Curry-Howard), logic and categories (Lambek-Scott), and logic and spaces (Tarski-Stone). So maybe, instead of Curry-Howard-Lambek, we should speak of Curry-Howard-Lambek-Scott-Tarski-Stone! (Or, if we want to actually bother to say it, just Curry-Howard-Lambek-Stone. Sorry, Tarski and Scott!) Where do the remaining correspondences arise from? A cubical Kan operation, naturally! But let us try to sketch in a few more details.

3. Spaces, Categories

All this about monads and Yoneda suggests that there's something categorical going on. And indeed, there is. A poset is, in essence, a "decategorified category" — that is to say, a category where any two objects have at most one morphism between them. I think of it as if it were a balloon animal that somebody let all the air out of. We can pick up the end of our poset and blow into it, inflating the structure back up, and allowing multiple morphisms between each object. If we do so, something miraculous occurs — our arbitrary posets turn into arbitrary categories, and the induced Heyting algebra from their opens turns into the induced category of set-valued presheaves of that category. The resultant structure is a presheaf topos. If we "inflate up" an appropriate notion of a closure operator we arrive at a Grothendieck topos! And indeed, the internal language of a topos is higher-order intuitionistic type theory [10].

4. Spaces, Programming Languages

All of this suggests a compelling story: logic describes theories via algebraic syntax. Equipping these theories with various forms of structural operations produces categories of one sort or another, in the form of fibrations. The intuition is that types are spaces, and contexts are also spaces. And furthermore, types are covered by the contexts in which their terms may be derived. This is one sense in which we it seems possible to interpret the Meillies/Zeilberger notion of a type refinement system as a functor [11].

But where do programming languages fit in? Programming languages, difficult as it is to sometimes remember, are more than their type theories. They have a semantic of computation as well. For example, a general topos does not have partial functions, or a fixed point combinator. But computations, often, do. This led to one of the first applications of topology to programming languages — the introduction of domain theory, in which terms are special kinds of spaces — directed complete partial orders — and functions obey a special kind of continuity (preservation of directed suprema) that allows us to take their fixed points. But while the category of dcpos is cartesian closed, the category of dcpos with only appropriately continuous morphisms is not. Trying to resolve this gap, one way or another, seems to have been a theme of research in domain theory throughout the 80s and 90s [12].

Computations can also be concurrent. Topological and topos-theoretic notions again can play an important role. In particular, to consider two execution paths to be "the same" one needs a notion of equivalence. This equivalence can be seen, stepwise, as a topological "two-cell" tracing out at each step an equivalence between the two execution paths. One approach to this is in Joyal, Nielson and Winskel's treatment of open maps [13]. I've also just seen Patrick Schultz and David I. Spivak's "Temporal Type Theory" which seems very promising in this regard [14].

What is the general theme? Computation starts somewhere, and then goes somewhere else. If it stayed in the same place, it would not "compute". A computation is necessarily a path in some sense. Computational settings describe ways to take maps between spaces, under a suitable notion of topology. To describe the spaces themselves, we need a language — that language is a logic, or a type theory. Toposes are a canonical place (though not the only one) where logics and spaces meet (and where, to a degree, we can even distinguish their "logical" and "spatial" content). That leaves categories as the ambient language in which all this interplay can be described and generalized.

5. Spaces, Categories

All the above only sketches the state of affairs up to roughly the mid '90s. The connection to spaces starts in the late 30s, going through logic, and then computation. But the categorical notion of spaces we have is in some sense impoverished. A topos-theoretic generalization of a space still only describes, albeit in generalized terms, open sets and their lattice of subobject relations. Spaces have a whole other structure built on top of that. From their topology we can extract algebraic structures that describe their shape — this is the subject of algebraic topology. In fact, it was in axiomatizing a branch of algebraic topology (homology) that category theory was first compelled to be invented. And the "standard construction" of a monad was first constructed in the study of homology groups (as the Godement resolution).

What happens if we turn the tools of categorical generalization of algebraic topology on categories themselves? This corresponds to another step in the "categorification" process described above. Where to go from "0" to "1" we took a partially ordered set and allowed there to be multiple maps between objects, to go from "1" to "2" we can now take a category, where such multiple maps exist, and allow there to be multiple maps between maps. Now two morphisms, say "f . g" and "h" need not merely be equal or not, but they may be "almost equal" with their equality given by a 2-cell. This is just as two homotopies between spaces may themselves be homotopic. And to go from "2" to "3" we can continue the process again. This yields n-categories. An n-category with all morphisms at every level invertible is an (oo,0)-category, or an infinity groupoid. And in many setups this is the same thing as a topological space (and the question of which setup is appropriate falls under the name "homotopy hypothesis" [15]). When morphisms at the first level (the category level) can have direction (just as in normal categories) then those are (oo,1)-categories, and the correspondence between groupoids and spaces is constructed as an equivalence of such categories. These too have direct topological content, and one setting in which this is especially apparent is that of quasi-categories, which are (oo,1)-categories that are built directly from simplicial sets — an especially nice categorical model of spaces (the simplicial sets at play here are those that satisfy a "weak" Kan condition, which is a way of asking that composition behave correctly).

It is in these generalized (oo,1)-toposes that homotopy type theory takes its models. And, it is hypothesized that a suitable version of HoTT should in fact be the initial model (or "internal logic") of an "elementary infinity topos" when we finally figure out how to describe what such a thing is.

So perhaps it is not that we should be computational trinitarians, or quadrinitarians. Rather, it is that the different aspects which we examine — logic, languages, categories, spaces — only appear as distinct manifestations when viewed at a low dimensionality. In the untruncated view of the world, the modern perspective is, perhaps, topological pantheism — spaces are in all things, and through spaces, all things are made as one.

Thanks to James Deikun and Dan Doel for helpful technical and editorial comments


by Gershom Bazerman at January 18, 2018 08:15 PM

Mark Jason Dominus

England in a pother in 1533

In autumn 2014 I paid for something and got $15.33 in change. I thought I'd take the hint from the Universe and read Wikipedia's article on the year 1533. This turned out unexpectedly exciting. 1533 was a big year in English history. Here are the highlights:

  • January 25: King Henry VIII marries Anne Boleyn
  • March 30: Thomas Cranmer becomes Archbishop of Canterbury
  • May 23: Cranmer annuls Henry's marriage to Catherine of Aragon
  • June 1: Boleyn crowned queen consort by Cranmer
  • July 11: Henry excommunicated by Pope Clement VII
  • September 7: Princess Elizabeth, later Queen Elizabeth I, is born to Boleyn

A story clearly emerges here, the story of Henry's frantic response to Anne Boleyn's surprise pregnancy.

The first thing to notice is that Elizabeth was born only seven months after Henry married Boleyn. The next thing to notice is that Henry was still married to Catherine when he married Boleyn. He had to get Cranmer to annul the marriage, issuing a retroactive decree that not only was Henry not married to Catherine, but he had never been married to her.

In 2014 I imagined that Henry appointed Cranmer to be Archbishop on condition that he get the annulment, and eventually decided that was not the case. Looking at it now, I'm not sure why I decided that.

Wikipedia says:

While Cranmer was following Charles through Italy, he received a royal letter dated 1 October 1532 informing him that he had been appointed the new Archbishop of Canterbury, following the death of archbishop William Warham [on 22 August]. Cranmer was ordered to return to England. The appointment had been secured by the family of Anne Boleyn, who was being courted by Henry.

Cranmer had been working on that annulment since at least 1527. In 1532 he was ambassador to Charles V, the Holy Roman Emperor, who was the nephew of Henry's current wife Catherine. I suppose a large part of Cranmer's job was trying to persuade Charles to support the annulment. (He was unsuccessful.) When Charles conveniently went to Rome (what for? Wikipedia doesn't say) Cranmer followed him and tried to drum up support there for the annulment. (He was unsuccessful in that too.)

Fortunately there was a convenient vacancy, and Henry called him back to fill it, and got his annulment that way. In 2014 I thought Warham's death was just a little too convenient, but I decided that he died too early for it to have been arranged by Henry. Now I'm less sure — Henry was certainly capable of such cold-blooded planning — but I can't find any mention of foul play, and The Divorce of Henry VIII: The Untold Story from Inside the Vatican describes the death as “convenient though entirely natural”.

[ Addendum: This article used to say that Elizabeth was born “less than seven months” after Henry and Boleyn's marriage. Daniel Holtz has pointed out that this was wrong. The exact amount is 225 days, or 32 weeks plus one day. The management regrets the error. ]

by Mark Dominus ( at January 18, 2018 07:16 AM

January 16, 2018

Mark Jason Dominus

Plutonium collection

In an earlier article, I said:

If I were in charge of keeping plutonium out of the wrong hands, I would still worry about [people extracting it from plutonium-fueled pacemakers].

This turns out to be no worry at all. The isotope in the pacemaker batteries is Pu-238, which is entirely unsuitable for making weapons. Pu-238 is very dangerous, being both radioactive and highly poisonous, but it is not fissile. In a fission chain reaction, an already-unstable atomic nucleus is hit by a high-energy neutron, which causes it to fragment into two lighter nuclei. This releases a large amount of nuclear binding energy, and more neutrons which continue the reaction. The only nuclei that are unstable enough for this to work have an odd number of neutrons (for reasons I do not understand), and Pu-238 does not fit the bill (Z=94, N=144). Plutonium fission weapons are made from Pu-241 (N=147), and this must be carefully separated from the Pu-238, which tends to impede the chain reaction. Similarly, uranium weapons are made from U-235, and this must be painstakingly extracted from the vastly more common U-238 with high-powered centrifuges.

But I did not know this when I spent part of the weekend thinking about the difficulties of collecting plutonium from pacemakers, and discussing it with a correspondent. It was an interesting exercise, so I will publish it anyway.

While mulling it over I tried to identify the biggest real risks, and what would be the most effective defenses against them. An exercise one does when considering security problems is to switch hats: if I were the bad guy, what would I try? What problems would I have to overcome, and what measures would most effectively frustrate me? So I put on my Black Hat and tried to think about it from the viewpoint of someone, let's call him George, who wants to build a nuclear weapon from pacemaker batteries.

I calculated (I hope correctly) that a pacemaker had around 0.165 mg of plutonium, and learned online that one needs 4–6 kg to make a plutonium bomb. With skill and experience one can supposedly get this down to 2 kg, but let's take 25,000 pacemakers as the number George would need. How could he get this much plutonium?

(Please bear in mind that the following discussion is entirely theoretical, and takes place in an imaginary world in which plutonium-powered pacemakers are common. In the real world, they were never common, and the last ones were manufactured in 1974. And this imaginary world exists in an imaginary universe in which plutonium-238 can sustain a chain reaction.)

Obviously, George's top target would be the factory where the pacemakers are made. Best of all is to steal the plutonium before it is encapsulated, say just after it has been delivered to the factory. But equally obviously, this is where the security will be the most concentrated. The factory is not as juicy a target as it might seem at first. Plutonium is radioactive and toxic, so they do not want to have to store a lot of it on-site. They will have it delivered as late as possible, in amounts as small as possible, and use it up as quickly as possible. The chances of George getting a big haul of plutonium by hitting the factory seem poor.

Second-best is for George to steal the capsules in bulk before they are turned into pacemakers. Third-best is for him to steal cartons of pacemakers from the factory or from the hospitals they are delivered to. But bulk theft is not something George can pull off over and over. The authorities will quickly realize that someone is going after pacemakers. And after George's first heist, everyone will be looking for him.

If the project gets to the point of retrieving pacemakers after they are implanted, George's problems multiply enormously. It is impractical to remove a pacemaker from of a living subject. George would need to steal them from funeral homes or crematoria. These places are required to collect the capsules for return to Oak Ridge, and conceivably might sometimes have more than one on hand at a time, but probably not more than a few. It's going to be a long slog, and it beggars belief that George would be able to get enough pacemakers this way without someone noticing that something was up.

The last resort is for George to locate people with pacemakers, murder, and dissect them. Even if George somehow knows whom to kill, he'd have to be Merlin to arrange the murder of 25,000 people without getting caught. Merlin doesn't need plutonium; he can create nuclear fireballs just by waving his magic wand.

If George does manage to collect the 25,000 capsules, his problems get even worse. He has to open the titanium capsules, already difficult because they are carefully made to be hard to open — you wouldn't want the plutonium getting out, would you? He has to open them without spilling the plutonium, or inhaling it, or making any sort of mess while extracting it. He has to do this 25,000 times without messing up, and without ingesting the tiniest speck of plutonium, or he is dead.

He has to find a way to safely store the plutonium while he is accumulating it. He has to keep it hidden not only from people actively looking for him — and they will be, with great yearning — but also from every Joe Blow who happens to be checking background radiation levels in the vicinity.

And George cannot afford to take his time and be cautious. He is racing against the clock, because every 464 days, 1% of his accumulated stock, however much that is, will turn into U-234 and be useless. The more he accumulates, the harder it is to keep up. If George has 25,000 pacemakers in a warehouse, ready for processing, one pacemaker-worth of Pu-238 will be going bad every two days.

In connection with this, my correspondent brought up the famous case of the Radioactive Boy Scout, which I had had in mind. (The RBS gathered a recklessly large amount of americium-241 from common household smoke detectors.) Ignoring again the unsuitability of americium for fission weapons (an even number of neutrons again), the project is obviously much easier. At the very least, you can try calling up a manufacturer of smoke alarms, telling them you are building an apartment complex in Seoul, and that you need to bulk-order 2,000 units or whatever. You can rob the warehouse at Home Depot. You can even buy them online.

by Mark Dominus ( at January 16, 2018 04:32 PM

January 14, 2018

Mark Jason Dominus

How do plutonium-powered pacemakers work?

I woke up in the middle of the night wondering: Some people have implanted medical devices, such as pacemakers, that are plutonium-powered. How the hell does that work? The plutonium gets hot, but what then? You need electricity. Surely there is not a tiny turbine generator in there!

There is not, and the answer turns out to be really interesting, and to involve a bunch of physics I didn't know.

If one end of a wire is hotter than the other, electrons tend to diffuse from the hot end to the cold end; the amount of diffusion depends on the material and the temperature. Take two wires of different metals and join them into a loop. (This is called a thermocouple.) Make one of the joints hotter than the other. Electrons will diffuse from the hot joint to the cold joint. If there were only one metal, this would not be very interesting. But the electrons diffuse better through one wire (say wire A) than through the other (B), and this means that there will be net electron flow from the hot side down through wire A and then back up through B, creating an electric current. This is called the Seebeck effect. The potential difference between the joints is proportional to the temperature difference, on the order of a few hundred microvolts per kelvin. Because of this simple proportionality, the main use of the thermocouple is to measure the temperature difference by measuring the voltage or current induced in the wire. But if you don't need a lot of power, the thermocouple can be used as a current source.

In practice they don't use a single loop, but rather a long loop of alternating metals, with many junctions:

A long row of conductors alternately of two different materials,   each joined in series to the end of the next, snaking back and forth   so that all the A-B junctions are on top and all the B-A junctions   on the bottom.  Between the top and bottom is an insulating layer.   The top set of junctions is heated.  Heat flows from top to bottom   and creates a current in the series of   conductors.

This is called a thermopile; when the heat source is radioactive material, as here, the device is called a radioisotope thermoelectric generator (RTG). The illustration shows the thermocouples strung out in a long line, but in an actual RTG you put the plutonium in a capsule and put the thermocouples in the wall of the capsule, with the outside joints attached to heat sinks. The plutonium heats up the inside joints to generate the current.

RTGs are more commonly used to power spacecraft, but there are a few dozen people still in the U.S. with plutonium-powered thermopile batteries in their pacemakers.

In pacemakers, the plutonium was sealed inside a titanium capsule, which was strong enough to survive an accident (such as a bullet impact or auto collision) or cremation. But Wikipedia says the technique was abandoned because of worries that the capsule wouldn't be absolutely certain to survive a cremation. (Cremation temperatures go up to around 1000°C; titanium melts at 1668°C.) Loose plutonium in the environment would be Very Bad.

(I wondered if there wasn't also worry about plutonium being recovered for weapons use, but the risk seems much smaller: you need several kilograms of plutonium to make a bomb, and a pacemaker has only around 135 mg, if I did the conversion from curies correctly. Even so, if I were in charge of keeping plutonium out of the wrong hands, I would still worry about this. It does not seem totally out of the realm of possibility that someone could collect 25,000 pacemakers. Opening 25,000 titanium capsules does sound rather tedious.)

Earlier a completely different nuclear-powered pacemaker was tried, based on promethium-powered betavoltaics. This is not a heat-conversion process. Instead, a semiconductor does some quantum physics magic with the electrons produced by radioactive beta decay. This was first demonstrated by Henry Moseley in 1913. Moseley is better-known for discovering that atoms have an atomic number, thus explaining the periodic table. The periodic table had previously been formulated in terms of atomic mass, which put some of the elements in the wrong order. Scientists guessed they were in the wrong order, because the periodicity didn't work, but they weren't sure why. Moseley was able to compute the electric charge of the atomic nucleus from spectrographic observations. I have often wondered what else Moseley would have done if he had not been killed in the European war at the age of 27.

It took a while to gather the information about this. Several of Wikipedia's articles on the topic are not up to their usual standards. The one about the radioisotope thermoelectric generator is excellent, though.

Thermopile illustration is by FluxTeq (Own work) CC BY-SA 4.0, via Wikimedia Commons.

[ Addendum 20180115: Commenters on Hacker News have pointed out that my concern about the use of plutonium in fission weapons is easily satisfied: the fuel in the batteries is Pu-238, which is not fissile. The plutonium to use in bombs is Pu-241, and indeed, when building a plutonium bomb you need to remove as much Pu-238 as possible, to prevent its non-fissile nuclei from interfering with the chain reaction. Interestingly, you can tell this from looking at the numbers: atomic nuclei with an odd number of neutrons are much more fissile than those with an even number. Plutonium is atomic number 94, so Pu-238 has an even number of neutrons and is not usable. The other isotope commonly used in fission is U-235, with 141 neutrons. I had planned to publish a long article today detailing the difficulties of gathering enough plutonium from pacemakers to make a bomb, but now I think I might have to rewrite it as a comedy. ]

[ Addendum 20170116: I published it anyway, with some editing. ]

by Mark Dominus ( at January 14, 2018 12:00 AM

January 11, 2018

Yesod Web Framework

Upcoming Yesod breaking changes

With all of the talk I've had about breaking changes in my libraries, I definitely didn't want the Yesod world to feel left out. We've been stable at yesod-core version 1.4 since 2014. But the changes going through my package ecosystem towards MonadUnliftIO are going to affect Yesod as well. The question is: how significantly?

For those not aware, MonadUnliftIO is an alternative typeclass to both MonadBaseControl and the MonadCatch/MonadMask classes in monad-control and exceptions, respectively. I've mentioned the advantages of this new approach in a number of places, but the best resource is probably the release announcement blog post.

At the simplest level, the breaking change in Yesod would consist of:

  • Modifying WidgetT's internal representation. This is necessary since, currently, it's implemented as a WriterT. Instead, to match with MonadUnliftIO, it needs to be a ReaderT holding an IORef. This is just about as minor a breaking change as I can imagine, since it only affects internal modules. (Said another way: it could even be argued to be a non-breaking change.)
  • Drop the MonadBaseControl and MonadCatch/MonadMask instances. This isn't strictly necessary, but has two advantages: it allows reduces the dependency footprint, and further encourages avoiding dangerous behavior, like using concurrently with a StateT on top of HandlerT.
  • Switch over to the new versions of the dependent libraries that are changing, in particular conduit and resourcet. (That's not technically a breaking change, but I typically consider dropping support for a major version of a dependency a semi-breaking change.)
  • A number of minor cleanups that have been waiting for a breaking changes. This includes things like adding strictness annotations in a few places, and removing the defunct GoogleEmail and BrowserId modules.

This is a perfectly reasonable set of changes to make, and we can easily call this Yesod 1.5 (or 2.0) and ship it. I'm going to share one more slightly larger change I've experimented with, and I'd appreciated feedback on whether it's worth the breakage to users of Yesod.

Away with transformers!

NOTE All comments here, as is usually the case in these discussions, refer to code that must be in IO anyway. Pure code gets a pass.

You can check out the changes (which appear larger than they actually are) in the no-transformers branch. You'll see shortly that that's a lie, but it does accurately indicate intent. If you look at the pattern of the blog posts and recommended best practices I've been discussing for the past year, it ultimately comes down to a simple claim: we massively overuse monad transformers in modern Haskell.

The most extreme response to this claim is that we should get rid of all transformers, and just have our code live in IO. I've made a slight compromise to this for ergonomics, and decided it's worth keeping reader capabilities, because it's a major pain (or at least perceived major pain) to pass extra stuff around for, e.g., simple functions like logInfo.

The core data type for Yesod is HandlerT, with code that looks like getHomeR :: HandlerT App IO Html. Under the surface, HandlerT looks something like:

newtype HandlerT site m a = HandlerT (HandlerData site -> m a)

Let's ask a simple question: do we really need HandlerT to be a transformer? Why not simply rewrite it to be:

newtype HandlerFor site a = HandlerFor (HandlerData site -> IO a)

All we've done is replaced the m type parameter with a concrete selection of IO. There are already assumptions all over the place that your handlers will necessarily have IO as the base monad, so we're not really losing any generality. But what we gain is:

  • Slightly clearer error messages
  • Less type constraints, such as MonadUnliftIO m, floating around
  • Internally, this actually simplifies quite a few ugly things around weird type families

We can also regain a lot of backwards compatibility with a helper type synonym:

type HandlerT site m = HandlerFor site

Plus, if you're using the Handler type synonym generated by the Template Haskell code, the new version of Yesod would just generate the right thing. Overall, this is a slight improvement, and we need to weigh the benefit of it versus the cost of breakage. But let me throw one other thing into the mix.

Handling subsite (yes, transformers)

I lied, twice: the new branch does use transformers, and HandlerT is more general than HandlerFor. In both cases, this has to do with subsites, which have historically been a real pain to write (using them hasn't been too bad). In fact, the entire reason we have HandlerT today is to try and make subsites work in a nicely layered way (which I think I failed at). Those who have been using Yesod long enough likely remember GHandler as a previous approach for this. And anyone who has played with writing a subsite, and the hell which ensues when trying to use defaultLayout, will agree that the situation today is not great.

So cutting through all of the crap: when writing a subsite, almost everything is the same as writing normal handler code. The following differences pop up:

  • When you call getYesod, you get the master site's app data (e.g. App in a scaffolded site). You need some way to get the subsite's data as well (e.g., the Static value in yesod-static).
  • When you call getCurrentRoute, it will give you a route in the master site. If you're inside yesod-auth, for instance, you don't want to deal with all of the possible routes in the parent, but instead get a route for the subsite itself.
  • If I'm generated URLs, I need some way to convert the routes for a subsite into the parent site.

In today's Yesod, we provide these differences inside the HandlerT type itself. This ends up adding some weird complications around special-casing the base (and common) case where m is IO. Instead, in the new branch, we have just one layer of ReaderT sitting on top of HandlerFor, providing these three pieces of functionality. And if you want to get a better view of this, check out the code.

What to do?

Overall, I think this design is more elegant, easier to understand, and simplifies the codebase. In reality, I don't think it's either a major departure from the past, or a major improvement, which is what leaves me on the fence about the no transformer changes.

We're almost certainly going to have a breaking change in Yesod in the near future, but it need not include this change. If it doesn't, the breaking change will be the very minor one mentioned above. If the general consensus is in favor of this change, then we may as well throw it in at the same time.

January 11, 2018 09:29 PM

January 10, 2018

FP Complete

Big Data vs Business Intelligence: What’s the difference?

Data analytics systems are the cutting edge of modern corporate computing. While many people may feel they are behind the “state of the art” they read about, the truth is these are projects we’re implementing currently for prominent companies in life sciences, finance, healthcare, Internet services, and aerospace, They have a lot in common with each other, and likely even with your computing environment.

by Aaron Contorer ( at January 10, 2018 08:58 PM

January 09, 2018

Michael Snoyman

Breaking changes, dependency trees

My previous blog post discussed a possible upcoming breaking change to the conduit library: dropping finalizers. This is one of a number of other breaking changes I have planned. Another one is switching over from MonadBaseControl to MonadUnliftIO, for reasons I've discussed at length before and spoken about too.

Beyond this change, I have a number of others planned out as well, some more solidly than others. I've started a document describing some of these, and I wanted to bring up one point in this design space for some user feedback: conduit dependency trees.


The situation today is that we have a dependency graph that looks something like the following:

  • resourcet is at the base of the hierarchy, and defines some non-conduit-specific types and functions used throughout the conduit ecosystem. It currently depends on a number of packages, like monad-control, but that number will naturally drop as we move over to MonadUnliftIO exclusively.
  • conduit is designed to provide basic conduit functionality with fewer dependencies. It does depend on resourcet, and packages like monad-control. But it does not depend on bytestring, text, or vector, even though these are almost always wanted with conduit. It provides the Data.Conduit.List set of combinators, which are not the best ones out there.
  • conduit-extra adds lots of dependencies, including things like attoparsec, and provides a nicer set of helpers around bytestring and text.
  • And finally, at the top of the tree (or our tree for today), we've got conduit-combinators, which provides the combinators I actually recommend people use in the Data.Conduit.Combinator module. This has lots of dependencies, since it inherits from conduit-extra and also adds in some extra things like mwc-random.


  • You can use resourcet without touching the conduit ecosystem at all
  • You can use conduit without pulling in lots of resources
  • Data.Conduit.Combinators is fully loaded


  • The current dependency footprint even at the base is higher than I'd like, though that's getting fixed soon regardless.
  • The conduit package is not super useful on its own due to lack of bytestring, text, and vector support.
  • To get the functionality you want in either conduit-extra or conduit-combinators, you end up with a much larger dependency footprint.

Plans for the next version

I have a number of different ideas in mind. I'll start off with the most conservative plan, and mention some variants below.

  • As already mentioned, resourcet drops a bunch of dependencies. Nothing too interesting there.
  • conduit adds a dependency on bytestring, text, and vector as basic libraries everyone should be using anyway. We move over Data.Conduit.Combinators and provide most of its functionality in conduit itself, and start recommending against Data.Conduit.List, Data.Conduit.Binary, and Data.Conduit.Text.
  • conduit-extra basically remains as-is
  • conduit-combinators retains the extra functionality not present in the new conduit


  • The conduit package now provides most of the functionality you'll want on a day-to-day basis
  • The dependency footprint for the Data.Conduit.Combinators module is much reduced
  • We can finally get away from the not-well-named functions in Data.Conduit.List

There aren't necessarily downsides to this approach, as I think it's simply better than what we have today already. But I want to list out the alternatives, which will make clear some things that could be possibly better still.

  • What do we do with the mono-traversable package? It's currently a dependency of conduit-combinators, and the simplest path forward for the above is to make conduit depend on mono-traversable. However, this is a slightly heavier dependency footprint, requiring adding in unordered-containers and vector-algorithms. Alternatives: Strip down mono-traversable to have less deps Redefine parts of mono-traversable needed for conduit in conduit itself Going crazy: really move mono-traversable into conduit and swap the dependency tree around My inclination: minimize mono-traversable's dependencies a bit more (like dropping the split package, and maybe vector-algorithms) and make it a dependency of conduit.
  • Do we really need conduit-combinators as well as conduit-extra? It's just adding a few extra pieces of functionality over conduit-extra, and perhaps those should be folded into conduit-extra itself.
  • Some people may not like the heavier dep footprint of conduit now. Should we split off a conduit-core package providing the core data types, functions, and operators, and have conduit depend on that?
  • It feels almost silly to have the ResourceT data type live in a separate package. If we have conduit-core, that could be a logical place to put it, since it won't have any extra dependencies versus the resourcet package itself, and then we can turn resourcet into a backwards compatibility layer. Or it may be logical to place ResourceT in the unliftio-core package, since both concepts help with resource cleanup in monad transformers. The former is necessary for continuation-based monads, while the latter (MonadUnliftIO) works for simpler monads.

If people have feedback, I'm happy to hear about it. I've spent an unfortunate amount of time bouncing around between these different options, so hopefully writing it all down and hearing some outside opinions can help move this forward.

January 09, 2018 12:00 PM

January 08, 2018

Ken T Takusagawa

[iblofees] Calendar Facts

Here is a machine readable version of xkcd #1930 "Calendar Facts", perhaps useful for followup projects like creating a random fact generator.  Further notes follow.

Sequence [Atom "Did you know that",Choice [Sequence [Atom "the",Choice [Sequence [Choice [Atom "fall",Atom "spring"],Atom "equinox"],Sequence [Choice [Atom "winter",Atom "summer"],Choice [Atom "solstice",Atom "Olympics"]],Sequence [Choice [Atom "earliest",Atom "latest"],Choice [Atom "sunrise",Atom "sunset"]]]],Sequence [Atom "Daylight",Choice [Atom "Saving",Atom "Savings"],Atom "Time"],Sequence [Atom "leap",Choice [Atom "day",Atom "year"]],Atom "Easter",Sequence [Atom "the",Choice [Atom "Harvest",Atom "super",Atom "blood"],Atom "moon"],Atom "Toyota Truck Month",Atom "Shark Week"],Choice [Sequence [Atom "happens",Choice [Atom "earlier",Atom "later",Atom "at the wrong time"],Atom "every year"],Sequence [Atom "drifts out of sync with the",Choice [Atom "sun",Atom "moon",Atom "zodiac",Sequence [Choice [Atom "Gregorian",Atom "Mayan",Atom "lunar",Atom "iPhone"],Atom "calendar"],Atom "atomic clock in Colorado"]],Sequence [Atom "might",Choice [Atom "not happen",Atom "happen twice"],Atom "this year"]],Atom "because of",Choice [Sequence [Atom "time zone legislation in",Choice [Atom "Indiana",Atom "Arizona",Atom "Russia"]],Atom "a decree by the Pope in the 1500s",Sequence [Choice [Atom "precession",Atom "libration",Atom "nutation",Atom "libation",Atom "eccentricity",Atom "obliquity"],Atom "of the",Choice [Atom "moon",Atom "sun",Atom "earth's axis",Atom "equator",Atom "prime meridian",Sequence [Choice [Atom "International Date",Atom "Mason-Dixon"],Atom "line"]]],Atom "magnetic field reversal",Sequence [Atom "an arbitrary decision by",Choice [Atom "Benjamin Franklin",Atom "Isaac Newton",Atom "FDR"]]],Atom "? ",Atom "Apparently",Choice [Atom "it causes a predictable increase in car accidents",Atom "that's why we have leap seconds",Atom "scientists are really worried",Sequence [Atom "it was even more extreme during the",Choice [Sequence [Choice [Atom "Bronze",Atom "Ice"],Atom "Age"],Atom "Cretaceous",Atom "1990s"]],Sequence [Atom "there's a proposal to fix it, but it",Choice [Atom "will never happen",Atom "actually make things worse",Atom "is stalled in Congress",Atom "might be unconstitutional"]],Atom "it's getting worse and no one knows why"],Atom ". ",Atom "While it may seem like trivia, it",Choice [Atom "causes huge headaches for software developers",Atom "is taken advantage of by high-speed traders",Atom "triggered the 2003 Northeast Blackout",Atom "has to be corrected for by GPS satellites",Atom "is now recognized as a major cause of World War I"],Atom "."]

Including the mouseover text, the grammar encodes 780,000 facts.

The above grammar is the output of "show" by a Haskell program where we typed the grammar slightly more compactly, using/abusing the OverloadedStrings language extension and a Num instance.  OverloadedStrings is a nice language extension for when we have a large number of literals in the source.  Excerpts of the full source code below:

{-# LANGUAGE OverloadedStrings #-}

data Grammar = Atom String | Choice [Grammar] | Sequence [Grammar] deriving (Show,Eq);

instance IsString Grammar where {
fromString = Atom;

instance Num Grammar where {
(+) x y = Choice [x,y];
(*) x y = Sequence [x,y];
abs = undefined;
signum = undefined;
fromInteger = undefined;
negate = undefined;

facts :: Grammar;
facts =
    "Did you know that"

by Ken ( at January 08, 2018 08:32 PM

Roman Cheplyaka

New patterns in tasty

When I wrote tasty in 2013, I borrowed the pattern language and its implementation from test-framework. I wasn’t fond of that pattern language, but it did the job most of the time, and the task of coming up with a better alternative was daunting.

Over the years, however, the pattern language and implementation received more feature requests and complaints than any other aspect of tasty.

  • Mikhail Glushenkov wanted to run all tests in a group except a selected few, e.g. --pattern 'test-group/**' --ignore 'test-group/test-3'.
  • Utku Demir wanted to specify a list of tests to run, such as “A and B, but not C and D”; Philipp Hausmann and Süleyman Özarslan requested something similar in the same issue.
  • Rob Stewart wanted a pattern that would match Bar but not FooBar.
  • Daniel Mendler reported that patterns with slashes didn’t work because slashes have special meaning in the patterns.
  • Levent Erkök, whose package sbv has more than 30k tasty tests, was concerned about the performance of pattern matching.
  • Allowing regular expressions in patterns would fulfill some (though not all) of these requests. However, using a regular expression library has its downsides. Carter Schonwald repeatedly complained to me that regex-tdfa takes ages to compile. Simon Jakobi noted that regex-tdfa brings along several transitive dependencies (including parsec), which further increases compilation time, makes the package more prone to breakages, and makes it harder for the maintainers of core packages to use tasty.

Every time someone filed an issue about tasty patterns, I would say something like “oh, this will be fixed once I get around to that issue from 2013 about the new patterns”. But I still had no idea what that new pattern language should look like.

The new pattern language had to be expressive, containing at least the boolean operators. It had to allow matching against the test name, the name of any of the groups containing the test, or the full test path, like Foo/Bar/Baz for a test named Baz in the test group Bar, which itself is contained in the top-level group Foo.

Finally, there was an issue of familiarity. Whatever ad-hoc DSL I would come up with, I had to document thoroughly its syntax and semantics, and then I had to convince tasty users to learn a new language and read the docs every time they wanted to filter their tests. (Not that the old patterns were particularly intuitive.)

The insight came to me last summer while I was spending time with my family and working remotely from a cabin in Poltava oblast, Ukraine. The language I needed already existed and was relatively well-known. It’s called AWK!

<figure> My workspace in Poltava oblast<figcaption>My workspace in Poltava oblast</figcaption> </figure>

In AWK, the variable1 $0 refers to the current line of input (called a “record”), and the variables $1, $2 etc. refer to the fields resulting from splitting the record on the field separator (like a tab or a comma).

The analogy with test names in tasty is straightforward: $0 denotes the full path of the test, $1 denotes the outermost test group name, $2 for the next group name, and so on. The test’s own name is $NF.

Then you can use these variables together with string, numeric, and boolean operators. Some examples:

  • $2 == "Two" — select the subgroup Two
  • $2 == "Two" && $3 == "Three" — select the test or subgroup named Three in the subgroup named Two
  • $2 == "Two" || $2 == "Twenty-two" — select two subgroups
  • $0 !~ /skip/ or ! /skip/ — select tests whose full names (including group names) do not contain the word skip
  • $NF !~ /skip/ — select tests whose own names (but not group names) do not contain the word skip
  • $(NF-1) ~ /QuickCheck/ — select tests whose immediate parent group name contains QuickCheck

The list of all supported functions and operators can be found in the README.

As a shortcut, if the -p/--pattern argument consists of letters, digits, and characters, it is matched against the full test path, so -p foo is equivalent to -p /foo/.

The subset of AWK recognized by tasty contains only expressions (no statements like loops or function definitions), no assignment operators, and no variables except NF. Other than that, the most salient deviation is that pattern matching (as in $3 ~ /foo/) does not use regular expressions, for the reasons stated above. Instead, a pattern match means a simple substring search — an idea suggested by Levent Erkök. So /foo/ in tasty means exactly the same as in AWK, while AWK’s /foo+/ cannot be expressed.

This allowed me to drop regex-tdfa as a dependency and significantly speed up the compilation time. An installation of tasty-1.0 (the new major release featuring AWK patterns) from scratch (a fresh cabal sandbox) takes 24 seconds on my laptop2, while an installation of tasty- (the previous version, which depends on regex-tdfa) takes 2 minutes 43 seconds.

The performance improved, too. I tried Levent’s example, which runs 30k dummy tests (j @?= j). When run with --quiet (so no time is wasted on output), tasty- takes 0.3 seconds to run all tests and 0.6 seconds to run a single test selected by a pattern (-p 9_2729). The new tasty-1.0 takes the same time without a pattern, and less than 0.1 seconds with a pattern (also -p 9_2729, which is equivalent to $0 ~ /9_2729/). The overhead of pattern matching, although it was pretty low already (0.3 seconds per 30k tests), became much smaller — so that it is now outweighed by the benefit of running fewer dummy tests. I haven’t done any performance optimization at all3, so I don’t even know where the speedup came from, exactly.

Earlier I said that I dropped regex-tdfa as a dependency for tasty and that regex-tdfa in turn depended on parsec; but didn’t I have to retain parsec or a similar library to parse the AWK syntax? No! We already have a perfectly fine parser combinators module in the base library, Text.ParserCombinators.ReadP. Its original purpose was to back the standard Read instances, but there is no reason it can’t be used for something more fun.

I did borrow one small module from megaparsec for parsing expressions (Text.Megaparsec.Expr), which I adapted to work with ReadP and to parse ternary operators. The expression parser originally comes from parsec, but Mark Karpov did a great job refactoring it, so I recommend you read Mark’s version instead. The expression parser is an ingenious algorithm deserving a separate blog post.

Enjoy the new tasty patterns!

<section class="footnotes">
  1. Actually, unlike in Perl, in AWK $0 is an expression: the operator $ applied to the number 0. Instead of $0 you could write $(1-1). The most practically relevant implication is that you can write $NF for the last field in the record, $(NF-1) for the second to last field and so on.

  2. Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz, 2 cores (4 virtual), SSD.

  3. I had an idea to do partial evaluation of the expression, so that the condition $2 == "X" would prune the whole test subgroup with name $2 == "Y" without having to consider each test individually. But after doing the measurement, I saw that matching is already fast enough. and the extra complexity is not justified.


January 08, 2018 08:00 PM

Functional Jobs

Software Developer - all levels at S&P Global (Full-time)

S&P Global is hiring functional programmers at various levels and locations (New York, Boston, Denver, Delhi)

We are the Analytics Product Development Team, a relatively small group of developers focused on building next generation financial models and applications that leverage S&P's world-class data sets. Last year we launched a re-imagined Portfolio Analytics product that helps investment managers of all types measure the efficacy of their investment strategies and identify areas of risk in their equity portfolios. Put your FP skills to use as we move on to multiple asset classes, intraday analytics, and strategy modeling.

Functional Programming has a relatively long history here at S&P Global. We build our back-end data calculation engine using purely-functional Scala in 2008 and have been building new models and expanding it ever since. We created Ermine, a Haskell-like language featuring row types that can run on the JVM. Ermine drives our templating, reporting, and database access, adding type-safety to user-generated layouts. The new Portfolio Analytics is a single page web application that makes extensive use of PureScript. All of this co-exists in a diverse tech ecosystem that includes the JVM, .NET, and SQL Server.

We have a few open positions, so we are looking for developers with varying levels of experience. Ideal candidates have FP experience, but we'd still like to talk with you if you are just getting started with FP and are looking to grow.

Sponsorship is not available for these positions.

S&P Global is an equal opportunity employer committed to making all employment decisions on the basis of merit, capability and equality of opportunity, and without regard to race/ethnicity, gender, pregnancy, gender identity or expression, color, creed, religion, national origin, age, disability, marital status (including domestic partnerships and civil unions), sexual orientation, military veteran status, unemployment status, or any other basis prohibited by federal, state or local law, or any other characteristic that has no bearing on a person’s ability to perform his or her job.

Only electronic job submissions will be considered for employment. If you need an accommodation during the application process due to a disability, please send an email to: EEO [dot] Compliance [at] spglobal [dot] com and your request will be forwarded to the appropriate person

Get information on how to apply for this position.

January 08, 2018 04:26 PM

January 06, 2018

Comonad Reader

The State Comonad

Is State a Comonad?

Not Costate or rather, Store as we tend to call it today, but actually State s itself?

Let's see!

Recently there was a post to reddit in which the author King_of_the_Homeless suggested that he might have a Monad for Store. Moreover, it is one that is compatible with the existing Applicative and ComonadApply instances. My knee-jerk reaction was to disbelieve the result, but I'm glad I stuck with playing with it over the last day or so.

In a much older post, I showed how to use the Co comonad-to-monad-transformer to convert Store s into State s, but this is a different beast, it is a monad directly on Store s.

{-# language DeriveFunctor #-}
import Control.Comonad
import Data.Semigroup
data Store s a = Store
  { peek :: s -> a, pos :: s }
  deriving Functor
instance Comonad (Store s) where
  extract (Store f s) = f s
  duplicate (Store f s) = Store (Store f) s
 (Semigroup s, Monoid s) =>
 Applicative (Store s) where
  pure a = Store (const a) mempty
  Store f s < *> Store g t = Store (\m -> f m (g m)) (mappend s t)
 Semigroup s =>
 ComonadApply (Store s) where
  Store f s < @> Store g t = Store (\m -> f m (g m)) (s <> t)
 (Semigroup s, Monoid s) =>
 Monad (Store s) where
  return = pure
  m >>= k = Store
    (\s -> peek (k (peek m s)) s)
    (pos m `mappend` pos (k (peek m mempty)))

My apologies for the Semigroup vs. Monoid, noise, as I'm still using GHC 8.2 locally. This will get a bit cleaner in a couple of months.

Also, peek here is flipped relative to the version in Control.Comonad.Store.Class so that I can use it directly as a field accessor.

As I noted, at first I was hesitant to believe it could work, but then I realized I'd already implemented something like this for a special case of Store, in the 'streams' library, which got me curious. Upon reflection, this feels like the usual Store comonad is using the ability to distribute (->) e out or (,) e in using a "comonoid", which is always present in Haskell, just like how the State monad does. But the type above seems to indicate we can go the opposite direction with a monoid.

So, in the interest of exploring duality, let's see if we can build a comonad instance for `State s`!

Writing down the definition for state:

newtype State s a = State
  { runState :: s -> (a, s) }
  deriving Functor
instance Applicative (State s) where
  pure a = State $ \s -> (a, s)
  State mf < *> State ma = State $ \s -> case mf s of
    (f, s') -> case ma s' of
      (a, s'') -> (f a, s'')
instance Monad (State s) where
  return = pure
  State m >>= k = State $ \s -> case m s of
    (a, s') -> runState (k a) s'

Given a monoid for out state, extraction is pretty obvious:

 Monoid s =>
 Comonad (State s) where
  extract m = fst $ runState m mempty

But the first stab we might take at how to duplicate, doesn't work.

  duplicate m = State $ \s ->
   (State $ \t -> runState m (mappend s t)
   , s

It passes the `extract . duplicate = id` law easily:

extract (duplicate m)
= extract $ State $ \s -> (State $ \t -> runState m (mappend s t), s)
= State $ \t -> runState m (mappend s mempty)
= State $ \t -> runState m t
= State $ runState m
= m

But fails the second law:

fmap extract (duplicate m)
= fmap extract $ State $ \s -> (State $ \t -> runState m (mappend s t), s)
= State $ \s -> (extract $ State $ \t -> runState m (mappend s t), s)
= State $ \s -> (fst $ runState m (mappend s mempty), s)
= State $ \s -> (evalState m s, s)

because it discards the changes in the state.

But the King_of_the_Homeless's trick from that post (and the Store code above) can be modified to this case. All we need to do is ensure that we modify the output state 's' as if we'd performed the action unmolested by the inner monoidal state that we can't see.

  duplicate m = State $ \s ->
    ( State $ \t -> runState m (mappend s t)
    , snd $ runState m s


extract (duplicate m)
= extract $ State $ \s -> (State $ \t -> runState m (mappend s t), snd $ runState m s)
= State $ \t -> runState m (mappend mempty t)
= State $ \t -> runState m t
= State $ runState m
= m

just like before, but the inner extraction case now works out and performs the state modification as expected:

fmap extract (duplicate m)
= fmap extract $ State $ \s -> (State $ \t -> runState m (mappend s t), snd $ runState m s)
= State $ \s -> (extract $ State $ \t -> runState m (mappend s t), snd $ runState m s)
= State $ \s -> (fst $ runState m (mappend s mempty), snd $ runState m s)
= State $ \s -> (fst $ runState m s, snd $ runState m s)
= State $ \s -> runState m s
= State $ runState m
= m

This is still kind of a weird beast as it performs the state action twice with different states, but it does pass at least the left and right unit laws.

Some questions:

1. Proving associativity is left as an exercise. It passes visual inspection and my gut feeling, but I haven't bothered to do all the plumbing to check it out. I've been wrong enough before, it'd be nice to check!

[Edit: Simon Marechal (bartavelle) has a coq proof of the associativity and other axioms.]

2. Does this pass the ComonadApply laws?

instance Monoid s => ComonadApply (State s) where
  (< @>) = (< *>)

[Edit: No.]

3. The streams code above suggests at least one kind of use-case, something like merging together changes of position in a stream, analogous to the "zipping monad" you have on infinite streams. But now the positions aren't just Integers, they are arbitrary values taken from any monoid you want. What other kind of spaces might we want to "zip" in this manner?

4. Is there an analogous construction possible for an "update monad" or "coupdate comonad"? Does it require a monoid that acts on a monoid rather than arbitrary state like the semi-direct product of monoids or a "twisted functor"? Is the result if it exists a twisted functor?

5. Does `Co` translate from this comonad to the monad on `Store`?

6. Is there a (co)monad transformer version of these?

Link: Gist

by Edward Kmett at January 06, 2018 08:00 PM

Russell O'Connor

Verifying Bech32 Checksums with Pen and Paper

Today we are going to learn how to verify a Bech32 checksum using only pen and paper. This is useful in those cases where you need to urgently validate the checksum of a Bech32 address, but the electricity has gone out in your home or office and you have lost your smartphone.

We are going to do a worked example of verifying the checksum of BC1SW50QA3JX3S, which is one of the test vectors from the Bech32 specification. However, before we begin, we need to make some preparations. We will need three tables.

The table of power
a xe86fe
c wt5v4t
d 4vljgv
e ukpcrk
f 0reszr
g a7vy57
h k5glc5
j 7xmfyx
k yfatwf
l t2ymv2
m 39zex9
n vmwajm
p ja45ka
q qqqqqq
r lwk4nw
s n4cgp4
t zs638s
u 5yjwly
v 832x73
w 2zf8mz
x hu9r0u
y 60xz20
z dnrp9n
0 clundl
2 sd093d
3 pgduhg
4 m8t7a8
5 f672t6
6 rchdsc
7 eh306h
8 9pshep
9 gjnkuj

The matrix of wisdom
acde fghj klmn pqrs tuvw xyz0 2345 6789
a q9sy 5420 tzxw ua7d kp3n melj hvgf 8r6c
c 9q4p 3s02 w8rt ecmg ny5k 7u6h jfdv zxla
d s4q5 y96l mjk7 vdwa x3pr tf0z 8uce hn2g
e yp5q s3wt 0xz2 ce6f j94h lamk ngvd r87u
f 53ys qp7m lkj6 gf2e z498 0dtx rcua nhwv
g 4s93 pql6 7hnm fgtc r5yx wv28 zeau jk0d
h 206w 7lq9 pgvy kh58 utme 3n4c axzr dfsj
j 02lt m69q ydfp nj3z ew7u 5ksa cr8x gv4h
k twm0 l7py qfd9 hk4x a26c sj5e u8rz vg3n
l z8jx khgd fqyv 7lu0 5rn3 emas 4w2t 9pc6
m xrkz jnvf dyqg 6mct s8h4 ale5 32w0 p9u7
n wt72 6myp 9vgq jnsr c0la 4h3u ezx8 fd5k
p uevc gfkn h76j qpz3 2ad0 89rw ts54 mlxy
q acde fghj klmn pqrs tuvw xyz0 2345 6789
r 7mw6 2t53 4ucs zrqn gl0d 98pv fjkh eayx
s dgaf ec8z x0tr 3snq mvu7 k5jl 6p9y 2wh4
t knxj zrue a5sc 2tgm qh89 d0fy p67l 34vw
u py39 45tw 2r80 aulv hqsj 6c7n kdfg xzme
v 35p4 9ym7 6nhl dv0u 8sqz 2gwr xaec kjtf
w nkrh 8xeu c34a 0wd7 9jzq g2vp ylm6 5sft
x m7tl 0w35 sea4 8x9k d62g qzyf vhnj ucpr
y eufa dvnk jmlh 9y85 0cg2 zqxt w43s 76rp
z l60m t24s 5ae3 rzpj f7wv yxqd gnhk cu98
0 jhzk x8ca es5u w0vl ynrp ftdq 976m 43g2
2 hj8n rzac u43e t2f6 pkxy vwg9 qml7 s5d0
3 vfug cexr 8w2z s3jp 6dal h4n7 mqy9 t0k5
4 gdcv uaz8 r2wx 54k9 7fem n3h6 lyqp 0tjs
5 fved aurx zt08 45hy lgc6 jskm 79pq w2n3
6 8zhr njdg v9pf m6e2 3xk5 u7c4 st0w qyal
7 rxn8 hkfv gp9d l7aw 4zjs c6u3 50t2 yqem
8 6l27 w0s4 3cu5 x8yh vmtf pr9g dkjn aeqz
9 cagu vdjh n67k y9x4 weft rp82 05s3 lmzq

The list of courage
bc1 rzqrrp
tb1 z5qrrp

Print out these tables and keep them with your emergency supplies so that you can find them when you need them.

Now we can begin. Split the message BC1SW50QA3JX3S into its prefix, BC1, and suffix SW50QA3JX3S. Take the suffix and write it vertically on a piece of paper, leaving a gap after each letter and then a line.













Find the prefix in the list of courage and write the associated word after the first letter, S, placing a diagonal line between each letter. For new prefixes, you may need to add them to the list of courage beforehand.



Take the last letter, which is p, and look it up in the table of power to find its associated word, which is ja45ka. Write this word in the gap under the Srzqrrp, extending the diagonal lines between each letter.


For each letter of this power word, we need to use the matrix of wisdom to add it to the letter above and to the left of it. For example, we look up row S and column j in the matrix of wisdom and we find the letter z. Write z after the W below the line, separating it with a diagonal line again.


We look up row r and column a in the matrix of wisom to find the number 7. We add 7 after the z, and keep doing this until every pair of letters is done. The matrix of wisdom is symmetric, so you do not have to worry about whether you are looking up by row/column or column/row.


We repeat this process with the next line. First we lookup 7 in the table of power to find eh306h and write it underneath.


Then, for each pair of letters, we add them using the matrix of wisdom.


We keep doing this until we go through all the letters of the suffix.


The final result should be pqqqqq, where q is the most powerful letter and p is the wisest letter. If you did not get this result, start over from the beginning because you probably made a mistake. Remember to mind your p's and q's.

After a couple years of practice doing this by hand, the operations become natural. For example, you learn that x and y equals z, and so forth.

Exercise for the reader: Create a variant of this procedure for computing Bech32 checksums.

P.S. This article is not meant to be taken seriously.

January 06, 2018 04:40 PM

January 05, 2018

Douglas M. Auclair (geophf)

December 2017 1HaskellADay 1Liners problems and solutions

  • December 29th, 2017:
    given f :: Monad m => n -> a -> m (Maybe b)
    define g :: Monad m => n -> a -> m (a, Maybe b)
    using f and ... arrows? Kleisli category?
    • Bazzargh @bazzargh (\n a->liftM ((,) a) (f n a)) ... according to, that's `liftM2 fmap (,) . f` but I can't pretend to get the transformation
  • December 29th, 2017:
    given f :: a -> b
    define g :: [a] -> [Maybe c] -> [(b, c)]

    >>> g [1,2,3] [Just 7, Nothing, Just 10]

    when f = show
    • matt @themattchan
      g = catMaybes ... zipWith (fmap . (,) . f)
      where (...) = (.).(.)
    • garrison @GarrisonLJ g a b = map (***fromJust) . filter (isJust . snd) $ zip a b
    • TJ Takei @karoyakani g = (catMaybes .) . zipWith ((<$>) . (,) . f)
  • December 29th, 2017: define f :: [(a,b)] -> ([a], [b])
    • Андреев Кирилл @nonaem00 and matt @themattchan unzip
    • Victoria C @ToriconPrime f = fmap fst &&& fmap snd
      • (in a vacuum, a more general type signature would be inferred, but the compiler limits itself as instruct)

by geophf ( at January 05, 2018 05:55 PM

January 04, 2018

FP Complete

DevOps Value: How to Measure the Success of DevOps

If you ask ten engineers how they measure the success of their DevOps strategy you are likely to get ten different answers. Success is measured differently by different people, so it’s not unusual to observe these different perspectives. What is the DevOps value proposition? This blog will explore a number of things to consider when you try to answer the question “How do YOU measure the success of DevOps?”

by Robert Bobbett ( at January 04, 2018 09:51 PM

Twan van Laarhoven

Type theory with indexed equality - the theory

In a previous post I introduced the TTIE language, along with a type checker and interpreter. My motivation for writing that (aside from it being fun!) was to explore the type system. At the time I started this project, formalizing this system as a shallow embedding in Agda was not easy. But with the addition of a rewriting mechanism, it has become much easier to use Agda without going insane from having to put substitutions everywhere. So, in this post I will formalize the TTIE type system.

This post is literate Agda, and uses my own utility library. The utility library mainly defines automatic rewrite rules like trans x (sym x) refl, which make life a bit more pleasant. All these rewrites use the standard library propositional equality , which I will call meta equality. . All these rewrites use the standard library propositional equality, which I will denote as and call meta equality.

{-# OPTIONS --rewriting #-}
module _ where
open import Util.Equality as Meta using (_∎) renaming (__ to __; refl to □; _≡⟨__ to ___; _≡⟨_⟩⁻¹_ to ___) open import Data.Product open import Data.Sum open import Data.Nat using (; zero; suc) open import Data.Vec open import Function open import Level renaming (zero to lzero; suc to lsuc)

First we postulate the existence of the interval. I will abbreviate the interval type as I.

postulate I : Set
postulate i₀ : I
postulate i₁ : I

The canonical eliminator for the interval needs equalities, to show that i₀ and i₁ are mapped to equal values. But we haven't defined those yet. However, there is one eliminator that we can define, namely into I, since values in I are always equal.

postulate icase : I  I  I  I
postulate icase-i₀ :  a b  icase a b i₀a
postulate icase-i₁ :  a b  icase a b i₁b
{-# REWRITE icase-i₀ icase-i₁ #-}

And with this icase construct, we can define conjunction, disjunction, and negation

_&&_ : I  I  I
i && j = icase i₀ j i
_||_ : I I I i || j = icase j i₁ i
inot : I I inot = icase i₁ i₀

We can define some extra computation rules based on the principle that when evaluating icase a b c, if we use the a branch then c = i₀, and similarly for b.

postulate icase-same :  (a b c : I  I) d  a i₀c i₀  b i₁c i₁
                      icase (a d) (b d) dc d
icase-const : a b icase a a ba icase-id : a icase i₀ i₁ aa icase-i₀-x : b icase i₀ b bb icase-i₁-x : b icase i₁ b bi₁ icase-x-i₀ : a icase a i₀ ai₀ icase-x-i₁ : a icase a i₁ aa
<details><summary class="comment">Show implementation</summary>icase-const a b = icase-same (const a) (const a) (const a) b □ □ icase-id a = icase-same (const i₀) (const i₁) id a □ □ icase-i₀-x b = icase-same (const i₀) id id b □ □ icase-i₁-x b = icase-same (const i₁) id (const i₁) b □ □ icase-x-i₀ a = icase-same id (const i₀) (const i₀) a □ □ icase-x-i₁ a = icase-same id (const i₁) id a □ □
{-# REWRITE icase-const #-} {-# REWRITE icase-id #-} {-# REWRITE icase-i₀-x #-} {-# REWRITE icase-i₁-x #-} {-# REWRITE icase-x-i₀ #-} {-# REWRITE icase-x-i₁ #-} </details>

The equality type

We can now define the indexed equality type

data Eq {a} (A : I  Set a) : A i₀  A i₁  Set a where
  refl :  (x : (i : I)  A i)  Eq A (x i₀) (x i₁)

For convenience we write the non-indexed object level equality as

__ :  {a} {A : Set a}  A  A  Set a
__ {A = A} x y = Eq (\_  A) x y

And now that we have equalities, we can write down the the general dependent eliminator for the interval,

postulate _^_ :  {a A x y}  Eq {a} A x y  (i : I)  A i
postulate ^-i₀   :  {a A x y} x≡y  _^_ {a} {A} {x} {y} x≡y i₀x
postulate ^-i₁   :  {a A x y} x≡y  _^_ {a} {A} {x} {y} x≡y i₁y
postulate ^-refl :  {a A} x  _^_ {a} {A} {x i₀} {x i₁} (refl x) ⟹ x
{-# REWRITE ^-i₀ ^-i₁ ^-refl #-}
infixl 6 _^_

At the same time, the _^_ operator also functions as an eliminator for Eq, projecting out the argument to refl. This also means that we have the following eta contraction rule

refl-eta :  {a A x y} (x≡y : Eq {a} A x y)  refl (\i  x≡y ^ i) ⟹ x≡y -- HIDE a
refl-eta (refl x) ={-# REWRITE refl-eta #-}

These definitions are enough to state some object level theorems, such as function extensionality

ext:  {a} {A B : Set a} {f g : A  B}  ( x  f x  g x)  f  g -- HIDE a
extf≡g = refl \i  \x  f≡g x ^ i


cong:  {a b} {A : Set a} {B : Set b} (f : A  B) {x y}  x  y  f x  f y -- HIDE a|b
congf x≡y = refl \i  f (x≡y ^ i)

and symmetry of ,

sym:  {a} {A : Set a} {x y : A}  x  y  y  x -- HIDE a
symx≡y = refl \i  x≡y ^ inot i

We can also define dependent versions of all of the above, which are the same, only with more general types. I'll leave these as an exercise for the reader.

<details><summary class="comment">spoiler</summary>sym :  {a} {A : I  Set a} {x y}  Eq A x y  Eq (A  inot) y x
sym x≡y = refl \i  x≡y ^ inot i


In general, to make full use of equalities, you would use substitution, also called transport. I will formalize this as

postulate tr :  {a} (A : I  Set a)  A i₀  A i₁ -- HIDE a

Where tr stands for transport, since we transport a value of type A i₀ along A, to a value of type A i₁. This should be possible, because there is a path between i₀ and i₁, that is, they are indistinguishable, and because functions are continuous. So A is a continuous path between A i₀ and A i₁. In a previous blog post I have used a more general cast primitive, which can be defined in terms of tr,

cast :  {a} (A : I  Set a)  (j₀ j₁ : I)  A j₀  A j₁ -- HIDE a
cast A j₀ j₁ = tr (\i  A (icase j₀ j₁ i))

And now we can define things like the usual substitution

subst :  {a b} {A : I  Set a} (B : {i : I}  A i  Set b) {x} {y}  Eq A x y  B x  B y -- HIDE a|b
subst B xy = tr (\i  B (xy ^ i))

and the J axiom

jay :  {A : Set} {x : A} (B : {y : A}  x  y  Set)  {y : A}  (x≡y : x  y)
     B (refl (\_  x))  B x≡y
jay B xy = tr (\i  B {xy ^ i} (refl \j  xy ^ (j && i)))

Yay, jay!

Evaluating transport

To be useful as a theory of computation, all primitives in our theory should reduce. In particular, we need to know how to evaluate tr, at least when it is applied to arguments without free variables. We do this by pattern matching on the first argument of tr, and defining transport for each type constructor.

The simplest case is if the type being transported along doesn't depend on the index at all

postulate tr-const :  {a} {A : Set a} {x}  tr (\_  A) xx -- HIDE a
{-# REWRITE tr-const #-}

Much more interesting is the case when the type is a function type. To cast function types, we first transport the argument 'back', apply the function, and then transport the result forward. First look at the non-dependent case, i.e. going from A i₀ B i₀ to A i₁ B i₁:

postulate tr-arrow :  {a b} {A : I  Set a} {B : I  Set b} {f} -- HIDE a|b
                    tr (\i  A i  B i) f
                   ⟹ (\x  tr B (f (cast A i₁ i₀ x)))

The dependent case is a bit more complicated, since the type of the result depends on the transported argument. The result of the function has type B i₀ (cast A i₁ i₀ x), and we have to transport this to B i₁ x. So as we go from i₀ to i₁, we want to "undo" the cast operation. We can do this by changing both i₀'s to i₁'s, to get a value of the type B i₁ (cast A i₁ i₁ x). Because cast A i₁ i₁ xx by icase-const and tr-const, this is equivalent to B i₁ x.

postulate tr-pi :  {a b} {A : I  Set a} {B : (i : I)  (A i)  Set b} {f} -- HIDE a|b
                 tr (\i  (x : A i)  B i x) f
                ⟹ (\x  tr (\i  B i (cast A i₁ i x)) (f (cast A i₁ i₀ x)))

Besides function/pi types, there are also product/sigma types. The idea here is similar: transport both parts of the pair independently. Again, the type of the second part can depend on the transported first part,

postulate tr-sigma :  {a b} {A : I  Set a} {B : (i : I)  A i  Set b} {x y} -- HIDE a|b
                       tr (\i  Σ (A i) (B i)) (x , y)
                      ⟹ (tr A x , tr (\i  B i (cast A i₀ i x)) y)

Finally, let's look at sum types, for which we use simple recursion,

postulate tr-sum₁ :  {a b} {A : I  Set a} {B : I  Set b} {x} -- HIDE a|b
                   tr (\i  A i  B i) (inj₁ x) ⟹ inj₁ (tr A x)
postulate tr-sum₂ :  {a b} {A : I  Set a} {B : I  Set b} {x} -- HIDE a|b
                   tr (\i  A i  B i) (inj₂ x) ⟹ inj₂ (tr B x)

Transport for equality types

The final type constructors in our language are equality types, and this is where things get more hairy. The idea is that a type like Eq A x y behaves like A in many respects. Its values will just be wrapped in a refl constructor.

Consider the case of equalities over (dependent) function types. The evaluation rule could look like

postulate tr-eq-pi
           :  {a b} {A : I  I  Set a} -- HIDE a|b
               {B :  i j  A i j  Set b} -- HIDE a|b
               {u :  i  (x : A i i₀)  B i i₀ x}
               {v :  i  (x : A i i₁)  B i i₁ x}
               {f₀ : Eq (\j  (x : A i₀ j)  B i₀ j x) (u i₀) (v i₀)}
            tr (\i  Eq (\j  (x : A i j)  B i j x) (u i) (v i)) f₀refl \j  \x 
             let x' = \i' j'  tr (\i  A (icase i₁ i' i) (icase j j' i)) x in
             (tr (\i  Eq (\j'  B i j' (x' i j')) (u i (x' i i₀)) (v i (x' i i₁)))
                 (refl \j'  (f₀ ^ j') (x' i₀ j'))) ^ j

Of course the A in Eq A x y could again be an equality type, and we would have to repeat the construction. To do this systematically, I start by collecting all the 'sides' of the equality type recursively. For example the sides of Eq (\i Eq (\j _) x y) u v) are eq (\i eq (\j done) x y) u v,

  data Sides {a} :  n (A : Vec I n  Set a)  Set (lsuc a) where
    done :  {A}  Sides zero A
    eq   :  {n A}
          (sides : (i : I)  Sides n (\is  A (i  is)))
          Eqs (sides i₀)
          Eqs (sides i₁)
          Sides (suc n) A
Eqs : {a n A} Sides {a} n A Set a Eqs {A = A} done = A [] Eqs {A = A} (eq sides x y) = Eq (\i Eqs (sides i)) x y

Since I A are the continuous functions out of the 1-dimensional interval, you can think of a Vec I n A as a continuous function out of the n-dimensional hypercube. So in geometric terms, we can draw such a function as assigning a value to all elements of the hypercube. Similarly, you can think of Sides {n = n} as a function out of the n-dimensional hypercube with the central cell removed, and Eqs as filling in that central cell.

Eqs 0 Sides 1 Eqs 1 Vec I 1 A Sides 2 Eqs 2 Vec I 2 A

I will spare you the details, see the source code of this post if you are interested. Suffice to say, that if we generalize _^_, icase, etc. from I to Vec I n and from Eq to Eqs, then we can generalize tr-eq-pi to arbitrarily deep Eqs.

tr-eqs-pi-rhs :  {a b n} {A : I  Vec I n  Set a} -- HIDE a|b
                {B : (i : I)  (is : Vec I n)  A i is  Set b} -- HIDE a|b
               (sides : (i : I)  Sides n (\js  (x : A i js)  B i js x))
               Eqs (sides i₀)
               Eqs (sides i₁)
postulate tr-eqs-pi :  {a b n}
                        {A : I  Vec I n  Set a}
                        {B : (i : I)  (is : Vec I n)  A i is  Set b}
                        (sides : (i : I)  Sides n (\js  (x : A i js)  B i js x))
                        (f₀ : Eqs (sides i₀))
                     tr (Eqs  sides) f₀tr-eqs-pi-rhs sides f₀

You can do a similar thing for sigma types, except that the types get even messier there because we need a dependently typed map function for Eqs and Sides.

This is the evaluation strategy implemented in the current TTIE interpreter. But it has two issues: 1) it is error prone and ugly 2) we still haven't defined tr (Eq Set u v)

What remains is to define tr (Eq Set u v).

A note about transitivity

Note that transitivity can be defined by transporting along an equality,

trans:  {a} {A : Set a} {x y z : A}  x  y  y  z  x  z -- HIDE a
trans′ {y = y} x≡y y≡z = tr (\i  (x≡y ^ inot i)  (y≡z ^ i)) (refl \_  y)

There are several ways to generalize this to dependent types. I'll use a variant that is explicit about the type

trans :  {a} (A : I  I  Set a) {x y z} -- HIDE a
       Eq (\i  A i₀ i) x y
       Eq (\i  A i i₁) y z
       Eq (\i  A i i) x z
trans A {y = y} x≡y y≡z = tr (\i  Eq (\j  A (icase i₀ i j) (icase (inot i) i₁ j)) (x≡y ^ inot i) (y≡z ^ i)) (refl \_  y)

Just as transitivity can be defined in terms of tr, the converse is also true. Instead of specifying transport for nested equality types, we could define tr for Eq types in terms of transitivity and symmetry.

The most general case of such a transport is

xy = fw (\i  Eq (\j  A i j) (ux ^ i) (vy ^ i)) uv


ux : Eq (\i  A i i₀) u x
vy : Eq (\i  A i i₁) v y
uv : Eq (\j  A i₀ j) u v

which we can draw in a diagram as u : A i₀ i₀ v : A i₀ i₁ x : A i₁ i₀ y : A i₁ i₁ uv ux vy

If you ignore the types for now, it seems obvious that

xy = trans (trans ((sym ux) uv) vy)

So, we could take

postulate tr-eq :  {a} {A : I  I  Set a} -- HIDE a
                    (ux :  i  A i i₀)
                    (vy :  i  A i i₁)
                    (uv : Eq (A i₀) (ux i₀) (vy i₀))
                 tr (\i  Eq (A i) (ux i) (vy i)) uvtrans (\i j  A (icase i₁ i j) (icase i i j))
                    (refl (ux  inot)) (trans A uv (refl vy))

I will stick to taking tr as primitive. However, this definition will come in handy for defining transport along paths between types.

Inductive types

It is straightforward to extend the theory with inductive types and higher inductive types. Here are some concrete examples, taken from the HoTT book.

The homotopy circle

postulate Circle : Set
postulate point  : Circle
postulate loop   : Eq (\_  Circle) point point
postulate Circle-elim :  {a} {A : Circle  Set a} -- HIDE a
                       (p : A point)
                       (l : Eq (\i  A (loop ^ i)) p p)
                       (x : Circle)  A x

with the computation rules

postulate elim-point :  {a A p l}  Circle-elim {a} {A} p l pointp -- HIDE a
postulate elim-loop  :  {a A p l i}  Circle-elim {a} {A} p l (loop ^ i) ⟹ l ^ i -- HIDE a
{-# REWRITE elim-point #-}
{-# REWRITE elim-loop #-}

Technically we would also need to specify elim for transitive paths (or paths constructed with tr). First the non-dependent version,

postulate Circle-elim′-tr-eq :  {a A p l} (x y : I  Circle) xy i -- HIDE a
             Circle-elim {a} {\_  A} p l (tr (\j  x j  y j) xy ^ i) -- HIDE atr (\j  Circle-elim {a} {\_  A} p l (x j) -- HIDE a
                       Circle-elim {a} {\_  A} p l (y j)) -- HIDE a
                  (refl \k  Circle-elim {a} {\_  A} p l (xy ^ k)) ^ i -- HIDE a

To write down the dependent version, it is helpful to first define a generalized version of transport over equality types. This generalized equality transport doesn't just give the final path, but also any of the sides, depending on the argument. Fortunately, it can be defined in terms of the existing transport primitive tr.

treq :  {a} (A : I  I  Set a) -- HIDE a
      (x :  i  A i i₀) (y :  i  A i i₁) (xy : Eq (\j  A i₀ j) (x i₀) (y i₀))
      (i j : I)  A i j
treq A x y xy i j = tr (\k  Eq (A (i && k)) (x (i && k)) (y (i && k))) xy ^ j

Note that we have

treq-i-i₀ :  {a} A x y xy i  treq {a} A x y xy i i₀x i -- HIDE a
treq-i-i₁ :  {a} A x y xy i  treq {a} A x y xy i i₁y i -- HIDE a
treq-i₀-j :  {a} A x y xy j  treq {a} A x y xy i₀ jxy ^ j -- HIDE a
treq-i₁-j :  {a} A x y xy j  treq {a} A x y xy i₁ jtr (\i  Eq (A i) (x i) (y i)) xy ^ j -- HIDE a

Now the dependent version of commuting Circle-elim for transitive paths looks like this:

postulate Circle-elim-tr-eq :  {a A p l} (x y : I  Circle) xy i -- HIDE a
             Circle-elim {a} {A} p l (tr (\j  x j  y j) xy ^ i)
            ⟹ tr (\j  Eq (\k  A (treq _ x y xy j k)) 
                          (Circle-elim {a} {A} p l (x j))
                          (Circle-elim {a} {A} p l (y j)))
                 (refl \k  Circle-elim {a} {A} p l (xy ^ k)) ^ i

We also need to continue this for higher paths, but that should be straightforward, if tedious.

<details><summary class="comment">tedious next step...</summary>postulate Circle-elim-tr-eq-eq :  {a A p ll} (x y : I  I  Circle) -- HIDE a
                                   (xy₀ :  k  x k i₀  y k i₀) (xy₁ :  k  x k i₁  y k i₁)
                                   xy i j
             Circle-elim {a} {A} p ll (tr (\k  Eq (\l  x k l  y k l) (xy₀ k) (xy₁ k)) xy ^ i ^ j)
            ⟹ tr (\k  Eq (\l  Eq (\m  A (tr (\k'  Eq (\l'  x (k && k') l'  y (k && k') l')
                                                         (xy₀ (k && k'))
                                                         (xy₁ (k && k'))) xy ^ l ^ m) )
                                   (Circle-elim {a} {A} p ll (x k l))
                                   (Circle-elim {a} {A} p ll (y k l)))
                          (refl \l  Circle-elim {a} {A} p ll (xy₀ k ^ l))
                          (refl \l  Circle-elim {a} {A} p ll (xy₁ k ^ l)))
                 (refl \k  refl \l  Circle-elim {a} {A} p ll (xy ^ k ^ l)) ^ i ^ j


postulate Truncate : Set  Set
postulate box  :  {A}  A  Truncate A
postulate same :  {A} x y  Eq (\_  Truncate A) x y
module _ {p} {A} {P : Truncate A Set p} -- HIDE p (b : (x : A) P (box x)) (s : {x y} (px : P x) (py : P y) Eq (\i P (same x y ^ i)) px py) where
postulate Truncate-elim : (x : Truncate A) P x
postulate elim-box : x Truncate-elim (box x) ⟹ b x postulate elim-same : x y i Truncate-elim (same x y ^ i) ⟹ s (Truncate-elim x) (Truncate-elim y) ^ i

Notice that in the eliminator for every path constructor, we expect an argument of type P "along that path constructor".

Quotient types

postulate _/_      : (A : Set)  (R : A  A  Set)  Set
postulate quot     :  {A R}  A  A / R
postulate eqn      :  {A R}  (x y : A)  R x y  Eq (\_  A / R) (quot x) (quot y)
postulate truncate :  {A R}  (x y : A / R)  (r s : Eq (\_  A / R) x y)  r  s
module _ {A R} {P : A / R Set} (q : (x : A) P (quot x)) (e : {x y} (r : R x y) Eq (\i P (eqn x y r ^ i)) (q x) (q y)) (t : {x y r s} (px : P x) (py : P y) (pr : Eq (\i P (r ^ i)) px py) (ps : Eq (\i P (s ^ i)) px py) Eq (\i Eq (\j P (truncate x y r s ^ i ^ j)) px py) pr ps) where
postulate /-elim : (x : A / R) P x
postulate elim-quot : x /-elim (quot x) ⟹ q x postulate elim-eqn : x y r i /-elim (eqn x y r ^ i) ⟹ e r ^ i postulate elim-truncate : x y r s i j /-elim (truncate x y r s ^ i ^ j) ⟹ t (/-elim x) (/-elim y) (refl \k /-elim (r ^ k)) (refl \k /-elim (s ^ k)) ^ i ^ j

Indexed types

One caveat to the support of inductive types are indexed types. These are the types with parameters whose value can depend on the constructor, written after the colon in Agda. An obvious example is the standard inductive equality type as it is defined in the standard library,

data __ {A : Set} (x : A) : A  Set where
  refl : xx

Another example are length indexed vectors,

data Vec (A : Set) :   Set where
  [] : Vec A zero
  __ :  {n}  A  Vec A n  Vec A (suc n)

Such inductive types introduce a new kind of equality, and we can't have that in TTIE.

Fortunately, outlawing such definitions is not a big limitation, since any indexed type can be rewritten to a normal inductive type by making the equalities explicit. For example

data Vec (A : Set) (n : ) : Set where
  [] : n  zero  Vec A n
  __ :  {m}  A  Vec A m  n  suc m  Vec A n


The final ingredient to turn TTIE into a homotopy type theory is the univalence axiom. A univalence primitive might look like this:

postulate univalence :  {a} {A B : Set a} -- HIDE a
                      (f : A  B)
                      (g : B  A)
                      (gf :  x  g (f x)  x)
                      (fg :  x  f (g x)  x)
                      (fgf :  x  congf (gf x)  fg (f x))
                      Eq (\_  Set a) A B -- HIDE a

By using an equality constructed with univalence in a transport, you can recover the forward and backward functions,

fw :  {a} {A B : Set a}  A  B  A  B -- HIDE a
fw A≡B = tr (_^_ A≡B)
bw : {a} {A B : Set a} A B B A -- HIDE a bw A≡B = tr (_^_ A≡B inot)

as well as the proofs of left and right-inverse,

bw∘fw :  {a} {A B : Set a}  (A≡B : A  B)   x  bw A≡B (fw A≡B x)  x -- HIDE a
bw∘fw A≡B x = refl \j  tr (\i  A≡B ^ icase (inot j) i₀ i)
                       (tr (\i  A≡B ^ icase i₀ (inot j) i) x)
fw∘bw : {a} {A B : Set a} (A≡B : A B) x fw A≡B (bw A≡B x) x -- HIDE a fw∘bw A≡B x = refl \j tr (\i A≡B ^ icase j i₁ i) (tr (\i A≡B ^ icase i₁ j i) x)

Here the trick is that when j = i₁, the transports become the identity, while otherwise they become fw and bw.

Getting out the adjunction fgf is a bit harder. You need to come up with an expression that reduces to f (gf x ^ k) when j = i₀ and that reduces to (fg (f x) ^ k) when j = i₁. The following does the trick

not-quite-fw∘bw∘fw :  {a} {A B : Set a}  (A≡B : A  B)   x -- HIDE a
                    cong′ (fw A≡B) (bw∘fw A≡B x)  fw∘bw A≡B (fw A≡B x)
not-quite-fw∘bw∘fw A≡B x = refl \j 
  refl \k  tr (\i  A≡B ^ icase                          (icase i₀ k j) i₁ i)
          $ tr (\i  A≡B ^ icase    (icase (inot k) i₁ j) (icase i₀ k j)    i)
          $ tr (\i  A≡B ^ icase i₀ (icase (inot k) i₁ j)                   i) x)

but the type is not right. We want an equality between two equalities, both of type fw (bw (fw x)) x. But instead we get a dependent equality type that mirrors the body of the definition.

To resolve this, we need to add another reduction rule to the language, which states that if you transport from i₀ to i and then to i₁, this is the same as going directly from i₀ to i₁. This should hold regardless of what i is.

postulate tr-tr :  {a} (A : I  Set a) i x  tr (A  icase i i₁) (tr (A  icase i₀ i) x) ⟹ tr A x -- HIDE a
postulate tr-tr-i₀ :  {a} A x  tr-tr {a} A i₀ x ⟹ □ -- HIDE a
postulate tr-tr-i₁ :  {a} A x  tr-tr {a} A i₁ x ⟹ □ -- HIDE a
{-# REWRITE tr-tr-i₀ tr-tr-i₁ #-}
fw∘bw∘fw :  {a} {A B : Set a}  (A≡B : A  B)   x 
          cong′ (fw A≡B) (bw∘fw A≡B x)  fw∘bw A≡B (fw A≡B x)
fw∘bw∘fw A≡B x = 
<details><summary class="comment">-- same as above, with ugly rewriting details...</summary>  Meta.subst id (cong-Eq
    (ext \j  cong-Eq □ □ (tr-tr (\i  A≡B ^ i) (j) x)) □ □)
    (refl \j  refl \k
           tr (\i  A≡B ^ icase                          (icase i₀ k j) i₁ i)
          $ tr (\i  A≡B ^ icase    (icase (inot k) i₁ j) (icase i₀ k j)    i)
          $ tr (\i  A≡B ^ icase i₀ (icase (inot k) i₁ j)                   i) x)

Computation rules

The computation rules are now obvious: when fw, bw, etc. are applied to a univalence primitive, return the appropriate field.

module _ {a} {A B} f g gf fg fgf (let AB = univalence {a} {A} {B} f g gf fg fgf) where -- HIDE a
  postulate tr-univalence-f :  x  tr (\i  AB ^ i) xf x
  postulate tr-univalence-g :  x  tr (\i  AB ^ inot i) xg x
  {-# REWRITE tr-univalence-f #-}
  {-# REWRITE tr-univalence-g #-}
postulate tr-univalence-gf : x j tr (\i AB ^ icase j i₀ i) (tr (\i AB ^ icase i₀ j i) x) ⟹ gf x ^ inot j postulate tr-univalence-fg : x j tr (\i AB ^ icase j i₁ i) (tr (\i AB ^ icase i₁ j i) x) ⟹ fg x ^ j {-# REWRITE tr-univalence-gf #-} {-# REWRITE tr-univalence-fg #-} -- tr-univalence-fgf ommitted

Ideally, we would be able to compute tr for AB ^ f i for any function f, and even

tr (\i  AB ^ f1 i)  tr (\i  AB ^ fn i)

But we quickly run into problems. Consider

  problem : I  I  A  B
  problem j k = tr (\i  AB ^ icase k i₁ i)
               tr (\i  AB ^ icase j k i)
               tr (\i  AB ^ icase i₀ j i)

When j=i₁, this reduces to

problem i₁ k = fg ^ k  f

and when k=i₀, it reduces to

problem j i₀ = f  gf ^ j

These two types look a lot like the adjunction fgf, but there are two differences: 1. For the two reductions of problem to be confluent, the two right hand sides should be equal in the meta language (judgementally equal). But an adjunction inside the theory doesn't guarantee this.

2. Even when using fgf, we can not get an expression for problem with the right reductions. The issue is that depending on j and k, problem can represent any of the following compositions

problem i₀ i₀ = f   id  id
problem i₀ i₁ = id  f   id
problem i₁ i₀ = f   g   f
problem i₁ i₁ = id  id  f

Transporting univalent paths

Finally, we also need to decide how to transport along equality types involving univalence. As I showed previously, transporting along equalities can be defined in terms of transitivity. So that is what we will do here. The idea is that to transport along trans AB BC, you first transport along AB, and then along BC. The same goes for other directions of using this transitive path (bw, fw∘bw, etc.)

module _ {a} {A B C : Set a} (A≡B : A  B) (B≡C : B  C) where
  trans-f : A  C
  trans-f = fw B≡C  fw A≡B
trans-g : C A trans-g = bw A≡B bw B≡C
trans-gf : x trans-g (trans-f x) x trans-gf x = cong′ (bw A≡B) (bw∘fw B≡C (fw A≡B x)) trans bw∘fw A≡B x
trans-fg : x trans-f (trans-g x) x trans-fg x = cong′ (fw B≡C) (fw∘bw A≡B (bw B≡C x)) trans fw∘bw B≡C x
postulate trans-fgf : x congtrans-f (trans-gf x) trans-fg (trans-f x) -- trans-fgf should be provable, but proof is omitted here
trans-equivalence : A C trans-equivalence = univalence trans-f trans-g trans-gf trans-fg trans-fgf

And we use this transitivity to define transport,

postulate tr-eq-Set :  {a} (A B : I  Set a) (A₀≡B₀ : A i₀  B i₀)
                     tr (\i  Eq (\_  Set a) (A i) (B i)) A₀≡B₀trans-equivalence (refl (A  inot)) (trans-equivalence A₀≡B₀ (refl B))
-- spacial case for fw tr-tr-eq-Set : {a} (A B : I Set a) (A₀≡B₀ : A i₀ B i₀) x tr (\j tr (\i Eq (\_ Set a) (A i) (B i)) A₀≡B₀ ^ j) xtr B (tr (_^_ A₀≡B₀) (tr (A inot) x)) tr-tr-eq-Set A B A₀≡B₀ x = Meta.cong (\A₁≡B₁ tr (_^_ A₁≡B₁) x) (tr-eq-Set A B A₀≡B₀)

Note that tr-eq-Set cannot be used as a rewrite rule. Agda incorrectly complains about universe levels, and when removing those the rule is accepted, but the file takes more than 10 minutes to type check.

Reduction rules spoiled by univalence

While we are at it, it would be nice if we could add some additional judgemental equalities to the type system. For instance, trans xy (sym xy) = refl \_ x should hold for all xy.

However, we can not add this as a reduction. The reason is that for paths build with univalence, transporting along the left hand side reduces to bw∘fw, and this is not necessarily the same as reflexivity. Here is an example

-- A path that flips the interval in one direction, but not in the other
-- so fw ∘ bw ≠ refl
flip-I : I  I
flip-I = univalence id inot
  (\i  refl (icase (inot i) i))
  (\i  refl (icase (inot i) i))
  (\i  refl \_  refl (icase (inot i) i))
module _ (trans-sym : {a A x y} xy trans′ {a} {A} {x} {y} xy (sym xy) -- hide {a} ⟹ (refl \_ x)) where problem2 : i₀i₁ problem2 = Meta.begin i₀ tr-tr-eq-Set (_^_ flip-I inot) (_^_ flip-I inot) (refl \_ I) i₁ tr (\i transflip-I (sym flip-I) ^ i) i₁ Meta.cong (\AB tr (\i AB ^ i) i₁) (trans-sym flip-I) i₁

tr (\i trans flip-I (sym flip-I) ^ i) i₁ evaluates to i₁ with tr-eq-Set, since we follow the equivalence backward and then forward. But according to trans-sym it is an identity path, and so this expression evaluates to i₀. So, we have a term that can evaluate to either i₀ or to i₁, depending on the evaluation order. In other words, reduction is no longer confluent.

This might not seem too bad, since i₀ i₁ inside the theory. But note that the reduction relation is not a homotopy equality. And it might even be untyped if we were using an untyped meta-theory, like the Haskell TTIE implementation. With a non-confluent reduction relation, it is easy to break the type system,

flip-Bool : Bool  Bool
flip-Bool = univalence not not not-not not-not not-not-not
bad : i₀i₁ (Bool , false) ⟹ (_,_ {B = id} Bool true) bad x = Meta.cong {B = Σ Set id} (\i flip-Bool ^ i , tr (\j flip-Bool ^ i && j) false) x
worse : i₀i₁ worse x with bad x ... | ()

So, trans-sym is out.

Another seemingly sensible reduction is that cong f (trans xy yz) trans (cong f xy) (cong f yz). But, if we also postulate that all paths over the interval can be defined in terms of icase, we end up in the same problematic situation.

module _ (trans-cong :  {a b A B x y z} (f : A  B) xy yz -- HIDE a|b
                      congf (trans′ {a} {A} {x} {y} {z} xy yz) -- HIDE atrans′ {b} (congf xy) (congf yz)) -- HIDE b
         (tr-eq-I :  (j k : I  I) jk₀  tr (\i  Eq (\_  I) (j i) (k i)) jk₀refl (icase (j i₁) (k i₁))) where
  trans-sym :  {a A x y} xy  trans′ {a} {A} {x} {y} xy (sym xy) ⟹ (refl \_  x) -- HIDE a
  trans-sym {x = x} xy =
      transxy (sym xy)
    ⟸ trans-cong (_^_ xy) (refl id) (refl inot) 
      cong′ (\i  xy ^ i) (trans′ (refl id) (refl inot))
    ⟹ Meta.cong (cong′ (_^_ xy)) (tr-eq-I inot inot (refl \_  i₁)) 
      refl (\_  x)
problem3 : i₀i₁ problem3 = problem2 trans-sym

I don't have any solution to these problems, aside from not adding the problematic reductions.

Reductions that do seem fine are those involving only a single path. For instance, things like trans xy (refl \_ y) ⟹ xy.


What I have presented is the type theory with indexed equality. As mentioned before, there is also a prototype implementation in Haskell.

The theory is quite similar to the cubical system, but it is developed mostly independently.

Some area's I haven't discussed or investigated yet, and some issues with the theory are:

1. Transitive paths involving HIT path constructors are not reduced, so trans loop (sym loop) is not the same as refl \_ point, however, the two are provably equal inside the theory. As with a general trans-sym rule, adding such a reduction would break confluence.

2. I have defined a function treq that generalizes tr (Eq ..). This could be taken as a primitive instead of tr. In that case we should further generalize it to take Sides, so that it also works for higher paths.

3. It is possible to combine transports to write terms that do not reduce, for example

x : A
AB : A  B
f : A  Set
y : f (bw AB (fw AB x))
tr (\i  f (tr (\j  AB ^ icase (inot i) i₀ j)
           (tr (\j  AB ^ icase i (inot i) j)
           (tr (\j  AB ^ icase i₀ i j) x)))) y

the tr-tr rule handles one such case, but more are possible. For well-behaved equalities flatting all this out is not a problem, but with univalence the intermediate steps become important.

4. I am not entirely happy with univalence breaking confluence in combination with trans-sym. It means that you have to be really careful about what, seemingly benign, reductions are allowed.

January 04, 2018 09:18 PM

January 03, 2018

Michael Snoyman

Drop Conduit's Finalizers?

tl;dr: I'm thinking of dropping finalizers from Conduit. If you have use cases that necessitate finalizers, please share them with me. Github issue for examples.

People who have followed the history of the Conduit library will likely be aware of how passionately I've argued for the necessity of a finalizer concept. For those unaware: when you have a Conduit pipeline like the following:

runConduitRes $ do
  sourceFile input .| takeCE 100 .| sinkFile output
  -- lots of other stuff

Without finalizers, the file input will remain open until we exit the runConduitRes call (unless the file is less than 100 bytes). The reason is that, in a pipeline, downstream is always in control. When takeCE stops consuming from sourceFile, sourceFile will never be called again. Finalizers allow sourceFile to tell downstream "hey, if you don't call me back, please close up my file." And so they're a Good Thing.

They're also a Complicated Thing. The internals of Conduit include a lot of complexity around finalizers. As a user, you hardly ever see it, which is by design. But if you start dealing with the guts of Conduit, it's there. (Have a look at this commit to get an idea.)

They also complicate the story around associativity of Conduit significantly. Or more to the point: without finalizers, we can get properly behaving associative and identity laws*. With finalizers, we have to start making up claims about reordering of finalizers. Which is almost always fine in practice, but clearly not a mathematical law.

* Caveats: I've never written the proof of it completely. Also, it relies on using the type paramter on the Pipe type to eliminate leftovers, but leftovers are not a topic I'm raising right now.

None of this is new information; so why am I writing this blog post now? The first is that I'm already working on a breaking change to Conduit to standardize naming and eliminate some legacy type and operator names. See the discussion and initial comparison for more details. This naturally leads me to ask related questions.

More to the point: after having worked with Conduit for years, my initial concerns about prompt finalization seem to have been overzealous. While the code above can happen, it doesn't happen that often in practice, and even that level of resource overholding isn't usually that bad. Regardless, if the situation really does call for guaranteed promptness, we can still get it (a trick I'm fairly certain I learned from Gabriel Gonzalez):

runConduitRes $ do
  withSourceFileResource input $ \src ->
    src .| takeCE 100 .| sinkFile output
  -- lots of other stuff

It's not quite as elegant, but works well enough. And again: I'm hard pressed to think of real life code I've written that would really warrant this. BIG REQUEST If you have examples of Conduit-using code where finalizers are vital, please send me examples.

While on the one hand, making this change would be to admit I've sucked up huge amounts of extra work in maintaining Conduit over the years, I'd be very happy to cut the dead weight now, unless someone out there wants to convince me otherwise. Feel free to discuss wherever desired, but the main discussion I'll be sure to follow will be conduit issue #343.

There are some natural follow-on questions that come from this also, which I'll likely broach in a later blog post. To give a taste and hopefully encourage some thoughts from others:

  • Should the Pipe types upstream finalizer concept disappear as well?
  • Can we remove the Leftover data constructor from Pipe, making Pipe a full category, and then move the tracking of leftovers to ConduitM and its codensity approach?

January 03, 2018 01:57 PM

January 02, 2018

Douglas M. Auclair (geophf)

December 2017 1HaskellADay problems and solutions

by geophf ( at January 02, 2018 05:04 PM

Gabriel Gonzalez

Dhall - Year in review (2017-2018)

<html xmlns=""><head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <meta content="text/css" http-equiv="Content-Style-Type"/> <meta content="pandoc" name="generator"/> <style type="text/css">code{white-space: pre;}</style></head><body>

The Dhall programmable configuration language is now one year old and this post will review progress over the last year and the future direction of the language in 2018.

If you're not familiar with Dhall, you might want to visit the official GitHub project which is the recommended starting point. This post assumes familiarity with the Dhall language.

Also, I want to use this post to advertise a short survey that you can take if you are interested in the language and would like to provide feedback:


Standardization is currently the highest priority for the language since Dhall cannot be feasibly be ported to other languages until the standard is complete. In the absence of the standard parallel implementations in other languages are chasing a moving and poorly-defined target.

Fortunately, standardization made good progress this year. The completed parts are:

The one missing piece is the standard semantics for import resolution. Once that is complete then alternative implementations should have a well-defined and reasonably stable foundation to build upon.

Standardizing the language should help stabilize the language and increase the barrier to making changes. The goal is that implementations of Dhall in other languages can review and approve/reject proposed changes to the standard. I've already adopted this approach myself by submitting pull requests to change the standard before changing the Haskell implementation. For example:

This process gives interested parties an official place to propose and review changes to the language.

New integrations

One of the higher priorities for the language is integrations with other languages and configuration file formats in order to promote adoption. The integrations added over the past year were:

Also, an initial Dhall binding to JavaScript was prototyped, but has not yet been polished and published:

  • JavaScript (using GHCJS)
    • Supports everything except the import system

All of these integrations have one thing in common: they all build on top of the Haskell implementation of the Dhall configuration language. This is due to necessity: the Dhall language is still in the process of being standardized so depending on the Haskell API is the only realistic way to stay up to date with the language.

Dhall implementations in other languages are incomplete and abandoned, most likely due to the absence of a stable and well-defined specification to target:

The JSON integration has been the most successful integration by far. In fact, many users have been using the JSON integration as a backdoor to integrating Dhall into their favorite programming language until a language-native Dhall integration is available.

Some users have proposed that Dhall should emphasize the JSON integration and de-emphasize language-native Dhall bindings. In other words, they believe that Dhall should primarily market itself as an alternative to Jsonnet or HCL.

So far I've struck a middle ground, which is to market language-native bindings (currently only Haskell and Nix) when they exist and recommend going through JSON as a fallback. I don't want to lean for too long on the JSON integration because:

  • Going through JSON means certain Dhall features (like sum types and functions) are lost in translation.
  • I believe Dhall's future is healthier if the language integrates with a diverse range of configuration file formats and programming languages instead of relying on the JSON integration.
  • I want to avoid a "founder effect" problem in the Dhall community where JSON-specific concerns dwarf other concerns
  • I would like Dhall to eventually be a widely supported configuration format in its own right instead of just a preprocessor for JSON

New language features

Several new language features were introduced in 2017. All of these features were added based on user feedback:

Don't expect as many new language features in 2018. Some evolution is natural in the first year based on user feedback, but I will start rejecting more feature requests unless they have a very high power-to-weight ratio. My goal for 2018 is to slow down language evolution to give alternative implementations a stable target to aim for.

The one likely exception to this rule in 2018 is a widespread user request for "type synonym" support. However, I will probably won't get around to fixing this until after the first complete draft of the language standard.


The main tooling features that landed this year were:

Tooling is one of the easier things to build for Dhall since command-line tools can easily depend on the Haskell API, which is maintained, well-documented, and up-to-date with the language.

However, excessive dependence on tooling can sometimes indicate a flaw in the language. For example, the recent constructors keyword proposal originated as something that several people were originally implementing using their own tooling and was elevated to a language feature.

Haskell-specific improvements

Dhall also provides a Haskell API that you can build upon, which also landed some notable improvements:

The language core is simple enough that some people have begun using Dhall as a starting point for their implementing their own programming languages. I personally don't have any plans to market or use Dhall for this purpose but I still attempt to support this use case as long as these requests:

  • don't require changes to language standard
  • don't significantly complicate the implementation.
  • don't deteriorate performance


The language's official documentation over the last year has been the Haskell tutorial, but I've been slowly working on porting the tutorial to language-agnostic documentation on a GitHub wiki:

The most notable addition to the wiki is a self-contained JSON-centric tutorial for getting started with Dhall since that's been the most popular integration so far.

The wiki will eventually become the recommended starting point for new users after I flesh the documentation out more.

Robustness improvements

I've also made several "invisible" improvements under the hood:

I've been slowly whittling away at type-checking bugs both based on user reports and also from reasoning about the type-checking semantics as part of the standardization process.

The next step is to use proof assistants to verify the safety of the type-checking semantics but I doubt I will do that in 2018. The current level of assurance is probably good enough for most Dhall users for now. However, anybody interested in doing formal verification can take a stab at it now that the type-checking and normalization semantics have been formalized.

I also plan to turn the test suite into a standard conformance test that other implementations can use to verify that they implemented the standard correctly.


This section covers all the miscellaneous improvements to supporting infrastructure:


I'd also like to thank everybody who helped out over the past year, including:

... and also everybody who filed issues, reported bugs, requested features, and submitted pull requests! These all help improve the language and no contribution is too small.


In 2017 the focus was on responding to user feedback as people started to use Dhall "in anger". In 2018 the initial plan is to focus on standardization, language stabilization, and creating at least one complete alternative implementation native to another programming language.

Also, don't forget to take the language survey if you have time. I will review the feedback from the survey in a separate follow-up post and update the plan for 2018 accordingly.


by Gabriel Gonzalez ( at January 02, 2018 03:26 PM

FP Complete

Weakly Typed Haskell

I was recently doing a minor cleanup of a Haskell codebase. I started off with some code that looked like this:

by Michael Snoyman ( at January 02, 2018 02:00 PM

January 01, 2018

Michael Snoyman

Review of The Bridge strength program

Last week, I completed The Bridge, an 8 week strength program by Barbell Medicine. Since this program is significantly different than what I've done in the past, and what I've talked about on this blog previously, I wanted to share my thoughts and some results.

Summary This program was more complicated to follow than others I've tried, but given that I was looking for an intermediate instead of novice program, that's not surprising. I improved my actual 1 rep max numbers on all 4 major lifts. I'm planning on continuing my training with another cycle of the program.


I've been lifting for close to 2 years now, with a few years of bodyweight training before that. In the subset of that time that I've been seriously training, I've followed these programs:

  • Start Bodyweight
  • StrongLifts 5x5
  • A few months of a Push-Pull-Leg (PPL) routine I made up, which was a mistake and gave me no progress. Pretend it didn't happen.
  • 5/3/1, with the Boring But Big (BBB) accessories

Ignoring the silly PPL, all three of these programs helped me significantly. They also all share something: they're simple to follow. First you determine where to start, which is either dictated by the program or based on current abilities. Then you follow a simple set of rules on how to progress. This simplicity is very appealing.

Start Bodyweight got me to a decent strength level, but I was unhappy with weakness in my back leading to regular pain while sitting at my desk. I switch to StrongLifts when I decided I wanted to do a barbell program with deadlifts, and (as always happens with a novice program) eventually hit a wall.

When I switched to 5/3/1, I immediately saw my estimated 1 rep max (1RM) numbers going up. In fact, they went up significantly. Unfortunately, I found that my actual 1RM numbers were not budging. It seemed that 5/3/1 was giving me an increase in muscular endurance, but not necessarily strength. While the program was fun, easy to follow, and required relatively little time in the gym, I wanted more progress.

NOTE I know there are many variations of 5/3/1 out there, and likely some of them would have served me better. I'm not comparing all potential programs in the world, just the ones I've actually pursued myself.

Overview of The Bridge

Over the past half year or so, I've been regularly exposed—via YouTube videos and articles—to the team behind Barbell Medicine, mainly Drs. Jordan Feigenbaum and Austin Baraki. When I was in the market for a new program, I heard mention of The Bridge program, and downloaded it.

The Bridge is delivered as a PDF. The program itself takes up about 5 pages in this 36 page document. The rest of the material is a bit dry to get through, but immensely useful and informative. I really appreciate the way the authors have given a background on the concepts of stress, training volume, and intensity. If you're at all interested in strength training, give it a read.

The program focuses on the main barbell lifts (squat, overhead press, bench press, deadlift) with accessories (e.g., barbell row, pin squat, paused deadlift). Unlike other programs I'd followed, this program changes from week to week. That makes it more complicated to follow, but not significantly. The real curve ball is the Rate of Perceived Exertion scale, or RPE.

RPE based training

In a program like Strong Lifts, I go into the gym on a Tuesday, and I know that I'm going to try to squat X amount of weight for Y sets of Z reps. Not so with The Bridge. Instead, you'll see something like:

Squats: 5 @ 6 RPE, 5 @ 7 RPE, 5 @ 8 RPE for 3 sets

This means that, after my warmup sets, I need to start with a set of 5 squats at an exertion level of 6. The scale goes up to 10, and each number below 10 indicates how many more reps you could have possibly done. So RPE 6 means "I could have done 4 more reps." Therefore, "5 reps at 6 RPE" means "choose a weight that you can just barely do 9 reps for, then do 5 reps at that weight."

When I first saw this, I was dumbfounded. "How do I guess the magical weight number?" And in fact, that was the most complicated part of this program for all 8 weeks. There are charts in the PDF that help you compare against your 1 rep max. But overall, it was trial and error. There were definitely sets where I lifted more than I should have, and sets where I could have added more weight.

In contrast to a simple numeric guide like Strong Lifts or 5/3/1 delivers, an RPE based scale allows you to easily adjust training intensity to account for both good and bad days in the gym. A few times during this program, I had a bad night's sleep or a bit of a cold, and lifted less weight. A few times, I was feeling great and lifted more than I would have expected. RPE allowed this to happen. By contrast, with both SL and 5/3/1, there were days where the weight felt easy, and other days when it felt crushing.

Ultimately, my conclusion to all of this was: RPE is harder than a number based scale, but gives great results. Just accept the fact that you're going to screw up regularly.

Difficulty level

The program is broken up into weeks of different stress level, either low, moderate, or high. The weeks also tend to focus on either high volume or high intensity. For example, both weeks 4 and 7 are considered high stress weeks, but compare the first day's squat programming:

  • Week 4: 5 @ RPE 6, 5 @ 7, 5 @ 8 x 4 sets
  • Week 7: 1 @ RPE 8, 3 @ 8 x 4 sets

Week 4's day 1 ends up having 30 total reps of squat, whereas week 7 has 13. However, because of how the RPE scale works, you'll end up lifting much heavier weight on week 7. For example, "5 @ RPE 8" means a weight you could have done 7 reps at. "3 @ RPE 8" means you could have done 5. You can lift more weight for 5 reps than you can for 7, and therefore week 7 ends up with a lower volume at higher intensity.

Personally, I really liked the later weeks in the program. None of the other pgorams I'd tried ever got to low volume high intensity. But having read the PDF and its motivations, I understand why we need both the volume and intensity weeks, and appreciate the way the program is designed. In fact, having completed the program, I think I understand the design of the program much better than before.


Unlike previous programs I've done, this program included cardio. I considered this a good kick in the pants to start running and the eliptical again. If you're like me: find a good audio book to listen to, otherwise it will be 30 minutes of hell :)

(Not entirely accidentally, I also signed up for an Audible account around the same time I started this program.)

Time in gym

The earlier weeks in this program have high volume, with lots of sets. I spend a long time in the gym some weeks: 3 days of lifting for about 2 hours each, and 1-2 days doing about 45 minutes of cardio. You can reduce that by having shorter rest periods or supersetting your warmups for the next lift with your previous working sets, or you can answer emails and Slack messages. (Fortunately no one can smell your sweat over email.)


Way back in May of last year, my gym had a "powerlifting competition." I put that in quotes since there were four of us, I was the only one in my weight class (everyone else weighed 20kg more than me), and some of the other competitors half repped their squats. (I'm going to pray that I actually squatted to depth, since there were no videos and I don't trust the judges.) Between that time and starting this program at the end of October, I increased my estimated 1RM numbers significantly, but to my recollection barely, if at all, bested my powerlifting meet numbers with actual weight.

There are some minor confounding factors in these increases due to my addition of chalk to assist with my deadlift, and getting more comfortable with my lifting belt (the competition was the first time I ever used a lifting belt). Nonetheless, I think a good portion of these increases can be attributed to my time on The Bridge:

  • Squat: 20% increase
  • Overhead press: 16% increase*
  • Bench press: 11% increase
  • Deadlift: 19% increase

* If you're wondering: no, the powerlifting competition did not include an overhead press. I'm including a number from around the same time.

Definitely keep in mind that I do believe my time on 5/3/1 in the interim helped at the very least with my muscular endurance, and most likely primed me to be able to hit the volume weeks of The Bridge better than I would have been able to in May. I would not expect to make those kinds of increases in just 8 weeks.


This was my first time taking a foray into an intermediate program, intended for lifters who are no longer making linear progressions with novice programs like Strong Lifts. I can finally understand why Mark Rippetoe says you want to be a novice: simple programs with great results are much more fun.

If you've got the time to spend in the gym on higher volume routines, have the patience to figure out the RPE system, want to learn more about strength programming, and are looking for a well designed intermediate program, I recommend checking out The Bridge.

For myself: when it comes to health and fitness, I consider this at least half an experiment with sample size of 1, and therefore I like to try out many different things. I'm going to keep my eyes out for a new routine to try (and if you have recommendations, let me know). But given the great progress this program helped me achieve, I'm going to continue with at least one more cycle of it before moving on to something new.

If you want me to share more experience reports like this in the future, let me know. I can also include the nutrition side of things if people are interested, which has been possibly more volatile for me over the past year than the training itself.

January 01, 2018 10:45 AM

December 30, 2017

Douglas M. Auclair (geophf)

November 2017 1HaskellADay 1Liner problem and solutions

  • November 5th, 2017: f :: Map Int [a] -> [b] - > [(Int, b)]
    for, e.g.: f mapping bs
    length bs == length (concat (Map.elems mapping))
    define f
    • Andreas Källberg @Anka213 Using parallel list comprehensions:
      f mp bs = [ (k,b) | (k,as) <-assocs mp, a <- as | b <- bs]
    • Steve Trout @strout f = zip . foldMapWithKey (fmap . const)

by geophf ( at December 30, 2017 02:16 AM

December 28, 2017

FP Complete

Parsing command line arguments

There are many ways to make programs that use settings to customise their behavior. In this post, we provide an overview of these methods and some best practices.

by Syd Kerckhove ( at December 28, 2017 01:15 PM

December 25, 2017

Robert Harper

Proofs by contradiction, versus contradiction proofs

It is well-known that constructivists renounce “proof by contradiction”, and that classicists scoff at the critique.  “Those constructivists,” the criticism goes, “want to rule out proofs by contradiction.  How absurd!  Look, Pythagoras showed that the square root of two is irrational by deriving a contradiction from the assumption that it is rational.  There is nothing wrong with this.  Ignore them!”

On examination that sort of critique fails, because a proof by contradiction is not a proof that derives a contradiction.  Pythagoras’s  proof is valid, one of the eternal gems of mathematics.  No one questions the validity of that argument, even if they question proof by contradiction.

Pythagoras’s Theorem expresses a negation: it is not the case that the square root of two can be expressed as the ratio of two integers.  Assume that it can be so represented.  A quick deduction shows that this is impossible.  So the assumption is false.  Done.  This is a direct proof of a negative assertion; it is not a “proof by contradiction”.

What, then, is a proof by contradiction?  It is the affirmation of a positive statement by refutation of its denial.  It is a direct proof of the negation of a negated assertion that is then pressed into service as a direct proof of the assertion, which it is not.  Anyone is free to ignore the distinction for the sake of convenience, as a philosophical issue, or as a sly use of “goto” in a proof, but the distinction nevertheless exists and is important.  Indeed, part of the beauty of constructive mathematics is that one can draw such distinctions. Once drawn, such distinctions can be disregarded; once blurred, forever blurred, irrecoverably.

For the sake of explanation, let me rehearse a standard example of a genuine proof by contradiction.  The claim is that there exists irrationals a and b such that a to the b power is rational.  Here is an indirect proof, a true proof by contradiction.  Let us prove instead that it is impossible that any two irrationals a and b are such that a to the b is irrational.  This is a negative statement, so of course one proves it by deriving a contradiction from assuming that which is negated.  Suppose, for a contradiction, that for every two irrationals a and b, the exponentiation a to the b power is irrational.  We know from Pythagoras that root two is irrational, so plug it in for both a and b, and conclude that root two to the root two power is irrational.  Now use the assumption again, taking a to be root two to the root two, and b to be root two.  Calculate a to the power of b, it is two, which is eminently rational.  Contradiction.

We have now proved that it is not the case that every pair of irrationals, when exponentiated, give an irrational.  There is nothing questionable about this proof.  But it does not prove that there are two irrationals whose exponent is rational!  If you think it does, then I ask you, please name them for me.  That information is not in this proof (there are other proofs that do name them, but that is not relevant for my purposes).  You may, if you wish, disregard the distinction I am drawing, that is your prerogative, and neither I nor anyone has any problem with that.  But you cannot claim that it is a direct proof, it is rather an indirect proof, that proceeds by refuting the negative of the intended assertion.

So why am I writing this?  Because I have learned, to my dismay, that in U.S. computer science departments–of all places!–students are being taught, erroneously, that any proof that derives a contradiction is a “proof by contradiction”.  It is not.  Any proof of a negative must proceed by contradiction.  A proof by contradiction is, contrarily, a proof of a positive by refutation of the negative.  This distinction is important, even if you want to “mod out” by it in your work, for it is only by drawing the distinction that one can even define the equivalence with which to quotient.

That’s my main point.  But for those who may not be familiar with the distinction between direct and indirect proof, let me take the opportunity to comment on why one might care to draw such a distinction.  It is a matter of honesty, of a sort: the information content of the foregoing indirect proof does not fulfill the expectation stated in the theorem.  It is a kind of boast, an overstatement, to claim otherwise.  Compare the original statement with the reformulation used in the proof.  The claim that it is not the case that every pair of irrationals exponentiate to an irrational is uncontroversial.  The proof proves it directly, and there is nothing particularly surprising about it.  One would even wonder why anyone would bother to state it.  Yet the supposedly equivalent claim stated at the outset appears much more fascinating, because most people cannot easily think up an example of two irrationals that exponentiate to rationals.  Nor does the proof provide one. Once, when shown the indirect proof, a student of mine blurted out “oh that’s so cheap.”  Precisely.

Why should you care?  Maybe you don’t, but there are nice benefits to keeping the distinction, because it demarcates the boundary between constructive proofs, which have direct interpretation as functional programs, and classical proofs, which have only an indirect such interpretation (using continuations, to be precise, and giving up canonicity).  Speaking as a computer scientist, this distinction matters, and it’s not costly to maintain.  May I ask that you adhere to it?

Edit: rewrote final paragraph, sketchy and irrelevant, and improved prose throughout. Word-smithing, typos.


by Robert Harper at December 25, 2017 11:50 PM

Michael Snoyman

Dropped packages following LTS 10

tl;dr: Check this pull request to see if your package was just removed from Stackage Nightly.

Stackage Nightly maintains a set of upper bounds to give package maintainers a grace period between dependencies updating their APIs and users needing to support the new versions. Keeping these upper bounds in place indefinitely places a burden on the rest of the ecosystem needing to keep support for older versions of packages. Therefore, the Stackage Curator team will periodically drop these upper bounds, and in the process must temporarily drop some packages.

Over the years, we've standardized on doing this drop immediately following the release of a new major version of LTS Haskell. This allows maximum packages to be included in an LTS release without imposing "bleeding edge" requirements (something LTS Haskell tries to avoid doing).

And as you may have guessed: I'm writing this now since I just dropped a bunch of upper bounds and blocked a number of packages on Stackage Nightly :). If you'd like to see if your package was evicted, please check out the relevant pull request. Some notes:

  • The haskell-src-exts upgrade caused the most downstream breakage
  • There were a huge number of upper bounds for http-types. Since this is the second time a major version bump occurred recently, I left this upper bound in place. More information is available on the http-types issue.

Once dependencies are fixed, please send pull requests to reenable packages. Everyone is welcome to do so, whether you're unblocking your own package or someone else's.

PS: Merry Christmas to all those celebrating today.

December 25, 2017 03:47 PM

December 21, 2017

Tweag I/O

All about reflection:<br/> a tutorial

Arnaud Spiwack

An important device in the tool belt I carry around everyday is type class reflection. I don't reach for it often, but it can be very useful. Reflection is a little known device. And for some reason it is often spoken of with a hint of fear.

In this post, I want to convince you that reflection is not hard and that you ought to know about it. To that end, let me invite you to join me on a journey to sort a list:

sortBy :: (a->a->Ordering) -> [a] -> [a]

What is reflection?

Type class reflection is an extension of Haskell which makes it possible to use a value as a type class instance. There is a package on Hackage, implementing type class reflection for GHC, which I will use for this tutorial. Type class reflection being an extension of Haskell (that is, it can't be defined from other Haskell features), this implementation is GHC-specific and will probably not work with another compiler.

Literate Haskell

This blog post was generated from literate Haskell sources. You can find an extracted Haskell source file here.

There is a bit of boilerplate to get out of the way before we start.

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE UndecidableInstances #-}

module Reflection where

import Data.Proxy
import Data.Reflection

UndecidableInstances... scary, I know. It is unfortunately required. It means that we could technically send the type checker into an infinite loop. Of course, we will be careful not to introduce such loops.

Sorted lists

My goal, today, is to sort a list. In order to make the exercise a tiny bit interesting, I will use types to enforce invariants. I'll start by introducing a type of sorted lists.

newtype SortedList a = Sorted [a]

Obviously, a SortedList is a list: we can just forget about its sortedness.

forget :: SortedList a -> [a]
forget (Sorted l) = l

But how does one construct a sorted list? Well, at the very least, the empty lists and the lists of size 1 are always sorted.

nil :: SortedList a
nil = Sorted []

singleton :: a -> SortedList a
singleton a = Sorted [a]

What about longer lists though? We could go about it in several ways. Let's decide to take the union of two sorted list:

merge :: Ord a => SortedList a -> SortedList a -> SortedList a
merge (Sorted left0) (Sorted right0) = Sorted $ mergeList left0 right0
    -- 'mergeList l1 l2' returns a sorted permutation of 'l1++l2' provided
    -- that 'l1' and 'l2' are sorted.
    mergeList :: Ord a => [a] -> [a] -> [a]
    mergeList [] right = right
    mergeList left [] = left
    mergeList left@(a:l) right@(b:r) =
        if a <= b then
          a : (mergeList l right)
          b : (mergeList left r)

We need Ord a to hold in order to define merge. Indeed, type classes are global and coherent: there is only one Ord a instance, and it is guaranteed that merge always uses the same comparison function for a. This enforces that if Ord a holds, then SortedList a represents lists of a sorted according to the order defined by the unique Ord a instance. In contrast, a function argument defining an order is local to this function call. So if merge were to take the ordering as an extra argument, we could change the order for each call of merge; we couldn't even state that SortedList a are sorted.

If it weren't for you meddling type classes

That's it! we are done writing unsafe code. We can sort lists with the SortedList interface: we simply need to split the list in two parts, sort said parts, then merge them (you will have recognised merge sort).

fromList :: Ord a => [a] -> SortedList a
fromList [] = nil
fromList [a] = singleton a
fromList l = merge orderedLeft orderedRight
    orderedLeft = fromList left
    orderedRight = fromList right
    (left,right) = splitAt (div (length l) 2) l

Composing with forget, this gives us a sorting function

sort :: Ord a => [a] -> [a]
sort l = forget (fromList l)

Though that's not quite what we had set out to write. We wanted

sortBy :: (a->a->Ordering) -> [a] -> [a]

It is easy to define sort from sortBy (sort = sortBy compare). But we needed the type class for type safety of the SortedList interface. What to do? We would need to use a value as a type class instance. Ooh! What may have sounded excentric when I first brought it up is now exactly what we need!

As I said when I discussed the type of merge: one property of type classes is that they are globally attached to a type. It may seem impossible to implement sortBy in terms of sort: if I use sortBy myOrd :: [a] -> [a] and sortBy myOtherOrd :: [a] -> [a] on the same type, then I am creating two different instances of Ord a. This is forbidden.

So what if, instead, we created an entirely new type each time we need an order for a. Something like

newtype ReflectedOrd a = ReflectOrd a

Except that we can't do a newtype every time we call sortBy. So let's make one newtype once and for all, with an additional parameter.

newtype ReflectedOrd s a = ReflectOrd a

-- | Like `ReflectOrd` but takes a `Proxy` argument to help GHC with unification
reflectOrd :: Proxy s -> a -> ReflectedOrd s a
reflectOrd _ a = ReflectOrd a

unreflectOrd :: ReflectedOrd s a -> a
unreflectOrd (ReflectOrd a) = a

Now, we only have to create a new parameter s locally at each sortBy call. This is done like this:

reifyOrd :: (forall s. Ord (ReflectedOrd s a) => …) -> …

What is happening here? The reifyOrd function takes an argument which works for any s. In particular, if every time we called reifyOrd we were to actually use a different s then the program would be correctly typed. Of course, we're not actually creating types: but it is safe to reason just as if we were! For instance if you were to call reifyOrd (reifyOrd x) then x would have two distinct parameters s1 and s2: s1 and s2 behave as names for two different types. Crucially for us, this makes ReflectOrded s1 a and ReflectOrded s2 a two distinct types. Hence their Ord instance can be different. This is called a rank 2 quantification.

In order to export a single reify function, rather than one for every type class, the reflection package introduces a generic type class so that you have:

reify :: forall d r. d -> (forall s. Reifies s d => Proxy s -> r) -> r

Think of d as a dictionary for Ord, and Reifies s d as a way to retrieve that dictionary. The Proxy s is only there to satisfy the type-checker, which would otherwise complain that s does not appear anywhere. To reiterate: we can read s as a unique generated type which is valid only in the scope of the reify function. For completeness, here is the the Reifies type class, which just gives us back our d:

class Reifies s d | s -> d where
  reflect :: proxy s -> d

The | s -> d part is called a functional dependency. It is used by GHC to figure out which type class instance to use; we won't have to think about it.

Sorting with reflection

All that's left to do is to use reflection to give an Ord instance to ReflectedOrd. We need a dictionary for Ord: in order to build an Ord instance, we need an equality function for the Eq subclass, and a comparison function for the instance proper:

data ReifiedOrd a = ReifiedOrd {
  reifiedEq :: a -> a -> Bool,
  reifiedCompare :: a -> a -> Ordering }

Given a dictionary of type ReifiedOrd, we can define instances for Eq and Ord of ReflectedOrd. But since type class instances only take type class instances as an argument, we need to provide the dictionary as a type class. That is, using Reifies.

instance Reifies s (ReifiedOrd a) => Eq (ReflectedOrd s a) where
  (==) (ReflectOrd x) (ReflectOrd y) =
    reifiedEq (reflect (Proxy :: Proxy s)) x y

instance Reifies s (ReifiedOrd a) => Ord (ReflectedOrd s a) where
  compare (ReflectOrd x) (ReflectOrd y) =
    reifiedCompare (reflect (Proxy :: Proxy s)) x y

Notice that because of the Reifies on the left of the instances GHC does not know that it will for sure terminate during type class resolution (hence the use of UndecidableInstances). However, these are indeed global instances: by definition, they are the only way to have an Ord instances on the ReflectedOrd type! Otherwise GHC would complain.

We are just about done: if we reify a ReifiedOrd a, we have a scoped instance of Ord (ReflectedOrd s a) (for some locally generated s). To sort our list, we simply need to convert between [a] and ReflectedOrd s a.

sortBy :: (a->a->Ordering) -> [a] -> [a]
sortBy ord l =
  reify (fromCompare ord) $ \ p ->
    map unreflectOrd . sort . map (reflectOrd p) $ l

-- | Creates a `ReifiedOrd` with a comparison function. The equality function
--   is deduced from the comparison.
fromCompare :: (a -> a -> Ordering) -> ReifiedOrd a
fromCompare ord = ReifiedOrd {
  reifiedEq = \x y -> ord x y == EQ,
  reifiedCompare = ord }

Wrap up & further reading

We've reached the end of our journey. And we've seen along the way that we can enjoy the safety of type classes, which makes it safe to write function like merge in Haskell, while still having the flexibility to instantiate the type class from a function argument, such as options from the command line. Since type class instances are global, such local instances are defined globally for locally generated types. This is what type class reflection is all about.

If you want to delve deeper into the subject of type class reflection, let me, as I'm wrapping up this tutorial, leave you with a few pointers to further material:

  • A talk by Edward Kmett, the author of the reflection package, on the importance of the global coherence of type classes and about reflection
  • There is no built-in support for reflection in GHC, this tutorial by Austin Seipp goes over the very unsafe, internal compiler representation dependent, implementation of the library
  • John Wiegley discusses an application of reflection in relation with QuickCheck.
  • You may have noticed, in the definition of sortBy, that we map the reflectOrd and unreflectOrd in order to convert between a and ReflectedOrd s a. However, while, reflectOrd and unreflectOrd, have no computational cost, using them in combination with map will traverse the list. If you are dissatified with this situation, you will have to learn about the Coercible type class. I would start with this video from Simon Peyton Jones.

December 21, 2017 12:00 AM

December 17, 2017

Neil Mitchell

Announcing the 'debug' package

Haskell is a great language, but debugging Haskell is undoubtedly a weak spot. To help with that problem, I've just released the debug library. This library is intended to be simple and easy to use for a common class of debugging tasks, without solving everything. As an example, let's take a function we are interested in debugging, e.g.:

module QuickSort(quicksort) where
import Data.List

quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (x:xs) = quicksort lt ++ [x] ++ quicksort gt
where (lt, gt) = partition (<= x) xs

Turn on the TemplateHaskell and ViewPatterns extensions, import Debug, indent your code and place it under a call to debug, e.g.:

{-# LANGUAGE TemplateHaskell, ViewPatterns #-}
module QuickSort(quicksort) where
import Data.List
import Debug

debug [d|
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (x:xs) = quicksort lt ++ [x] ++ quicksort gt
where (lt, gt) = partition (<= x) xs

We can now run our debugger with:

$ ghci QuickSort.hs
GHCi, version 8.2.1: :? for help
[1 of 1] Compiling QuickSort ( QuickSort.hs, interpreted )
Ok, 1 module loaded.
*QuickSort> quicksort "haskell"
*QuickSort> debugView

The call to debugView starts a web browser to view the recorded information, looking something like:

From there you can click around to explore the computation.

I'm interested in experiences using debug, and also have a lot of ideas for how to improve it, so feedback or offers of help most welcome at the bug tracker.

If you're interested in alternative debuggers for Haskell, you should check out the GHCi debugger or Hood/Hoed.

by Neil Mitchell ( at December 17, 2017 10:02 PM

Michael Snoyman

What Makes Haskell Unique

I gave a talk today at the F(by) 2017 conference in Minsk, Belarus. The conference was great, I would definitely recommend it in the future. Thank you very much to the organizers for the opportunity to present on Haskell.

I prepared for this talk differently than I've prepared for other talks in the past. I'm very comfortable writing up blog posts, but have always found slide preparation difficult. This time around, I wrote up the content in mostly-blog-post form first, and only created the slides after that was complete. Overall, this worked very well for me, and I'll try it again in the future. (If others want to share their approaches to preparing talks, I'd definitely be happy to hear them.)

As a result: I'm able to share the original write-up I did as well. For those who saw the live talk (or the video): you may want to skip towards the end, which covers some material that there wasn't time for in the talk itself.

If you'd like to follow with the slides, they're also available.

My name is Michael Snoyman. I work at a company called FP Complete. One of the things we do is help individuals and companies adopt Haskell, and functional programming in general. And that leads right in to the topic of my talk today:

What makes Haskell unique

Programmers today have a large number of languages to choose from when deciding what they will learn and use in their day to day coding. In order to make intelligent decisions about which languages to pursue, people need to be able to quickly learn and understand what distinguishes one language from another.

Given that this is a functional programming conference, it's probably no surprise to you that Haskell can be called a functional programming language. But there are lots of languages out there that can be called functional. Definitions vary, but let's take a particularly lax version of functional programming: first class functions, and higher order functions. Well, by this defintion, even a language like C counts! You may want to limit the definition further to include syntactic support for closures, or some other features. Regardless, the same point remains:

Haskell may be functional, but that doesn't make it unique

In fact, there's a long list of features I could rattle off that could be used to describe Haskell.

  • Functional
  • Statically typed
  • Pure
  • Lazy
  • Strongly typed
  • Green threads
  • Native executables
  • Garbage collected
  • Immutability

Some of these features, like being pure and lazy, are relatively rare in mainstream languages. Others, however, are common place. What I'm going to claim is that not one of these features is enough to motivate new people to Haskell—including people in this audience—to start using it. Instead:

It's the combination of these features that makes Haskell unique

As an example: the intersection of purity, strong typing, and functional programming style, for instance, lends itself to a high level form of expression which is simultaneously easy to write, easy to read, easy to modify, and efficient. I want to share some examples of some code examples in Haskell that demonstrate how the language encourages you to write code differently from other languages. And I'm going to try to claim that this "different" style is awesome, though it also has some downsides.

Async I/O and Concurrency

Let's start off with a use case that's pretty popular today. Look at this pseudocode and tell me what's wrong with it:

json1 := httpGet(url1)
json2 := httpGet(url2)
useJsonBodies(json1, json2)

Given the heading of this slide, you may have guessed it: this is blocking code. It will tie up an entire thread waiting for the response body from each of these requests to come back. Instead, we should be using asynchronous I/O calls to allow more efficient usage of system resources. One common approach is to use callbacks:

httpGetA(url1, |json1| =>
  httpGetA(url2, |json2| =>
    useJsonBodies(json1, json2)

You may recognize this coding style as "callback hell." There are plenty of techniques in common languages to work around that, usually around the idea of promises or futures. And you may have heard something about how Javascript futures are a monad, and expect me to be talking about how Haskell does monads better. But I'm not going to do that at all. Instead, I want to show you what the asynchronous version of the code looks like in Haskell

json1 <- httpGet url1
json2 <- httpGet url2
useJsonBodies json1 json2

This may surprise you, since this looks exactly like the blocking pseudocode I showed above. It turns out that Haskell has a powerful runtime system. It will automatically convert your blocking-style code into asynchronous system calls, and automatically handle all of the work of scheduling threads and waking them up when data is available.

This is pretty great, but it's hardly unique to Haskell. Erlang and Go, as two popular examples, both have this as well. If we want to see what makes Haskell different...

we have to go deeper.


It's pretty lame that we need to wait for our first HTTP request to complete before even starting our second. What we'd like to do is kick off both requests at the same time. You may be imagining some really hairy APIs with threads, and mutable variables, and locks. But here's how you do this in Haskell:

(json1, json2) <- concurrently
  (httpGet url1)
  (httpGet url2)
useJsonBodies json1 json2

Haskell has a green thread implementation which makes forking threads cheap. The async library provides a powerful, high level interface performing actions in parallel without bothering with the low level aspects of locking primitives and mutable variables. And this builds naturally on top of the async I/O system already described to be cheap about system resource usage.


What we've seen already is elegant in Haskell, but it's not terribly difficult to achieve in other languages. Let's take it to the next level. Instead of needing both JSON response bodies, we only need one: whichever one comes back first. In pseudocode, this might look like:

promise1 := httpGet(url1)
promise2 := httpGet(url2)
result := newMutex()
promise1.andThen(|json1| =>
promise2.andThen(|json2| =>

This code is tedious and error prone, but it gets the job done. As you can probably guess, there's a simple API for this in Haskell:

eitherJson <- race
  (httpGet url1)
  (httpGet url2)
case eitherJson of
  Left  json1 -> useJsonBody1 json1
  Right json2 -> useJsonBody2 json2

At first, this may seem like it's just a well designed API. But there's quite a bit more going on under the surface. The Haskell runtime system itself supports the idea of an asynchronous exception, which allows us to cancel any other running thread. This feature is vital to making race work.

And here's the final piece in the puzzle. All of the thread scheduing and canceling logic I've described doesn't just apply to async I/O calls. It works for CPU-intensive tasks as well. That means you can fork thousands of threads, and even if one of them is busy performing computation, other threads will not be starved. Plus, you can interrupt these long-running computations:

let tenSeconds = 10 * 1000 * 1000
timeout tenSeconds expensiveComputation

Summary: concurrency and async I/O


  • Cheap threads
  • Simple API
  • Highly responsive


  • Complicated runtime system
  • Need to be aware of async exceptions when writing code

Immutability and purity

Most programming languages out there default to mutability: a variable or field in a data structure can be changed at any time. Haskell is different in two ways:

  1. Values are immutable by default, and mutability must be explicitly indicated with a variable type
  2. Mutating a mutable variable is considered a side effect, and that mutable is tracked by the type system

For example, the following Haskell-like code is impossible:

let mut total = 0
    loop i =
      if i > 1000000
        then total
        else total += i; loop (i + 1)
 in loop 1

From pure code, we cannot create, read, or modify a mutable variable. We also need to say what kind of mutable variable we want:

total <- newIORef 0
let loop i =
      if i > 1000000
        then readIORef total
        else do
          modifyIORef total (+ i)
          loop (i + 1)
loop 1

This is a lot of ceremony for a simple algorithm. Of course, the recommended Haskell way of doing this would be to avoid mutable variables, and use a more natural functional style.

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

Besides pushing us towards this supposedly better functional approach, why is immutable, pure code such a nice thing?

Reasoning about code

You'll often hear Haskellers throw around a phrase "reasoning about code." Personally, I think the phrase is used to mean too many different things. But let me give you an example that I think is accurate. Let's look at some pseudocode:

// scores.txt

func main() {
  results := readResultsFromFile("results.txt")
  print("First result was by: " + results[0].name)

func printScoreRange(results: Vector<TestResult>) {

If you look at the code above, what do you expect the output to be? I think it would be reasonable to guess something like:

Lowest: 22
Highest: 55
First result was by: Alice

However, now let's throw in another piece of information: the definition of printScoreRange:

func printScoreRange(results: Vector<TestResult>) {
  results.sortBy(|result| => result.score)
  print("Lowest: " + results[0].score)
  print("Highest: " + results[results.len() - 1].score)

Suddenly our assumptions change. We can see that this function mutates the results value passed to it. If we're passing mutable references to vectors in this made up language, then our output is going to look more like:

Lowest: 22
Highest: 55
First result was by: Charlie

Since the original results value in our main function has been modified. This is what I mean by hurting our ability to reason about the code: it's no longer sufficient to look at just the main function to understand what will be happening. Instead, we're required to understand what may possibly be occurring in the rest of our program to mutate our variables.

In Haskell, the code would instead look like:

main :: IO ()
main = do
  results <- readResultsFromFile "results.txt"
  printScoreRange results
  putStrLn $ "First result was by: " ++ name (head results)

printScoreRange :: [TestResult] -> IO ()
printScoreRange results = do
  let results' = sortBy score results
  putStrLn $ "Lowest: " ++ show (score (head results'))
  putStrLn $ "Highest: " ++ show (score (last results'))

We know that it's impossible for printScoreRange to modify the results value we have in the main function. Looking at only this bit of code in main is sufficient to know what will happen with the results value.

Data races

Even more powerful than the single threaded case is how immutability affects multithreaded applications. Ignoring the insanity of multiple threads trying to output to the console at the same time, we can easily parallelize our code:

main :: IO ()
main = do
  results <- readResultsFromFile "results.txt"
  concurrently_ printFirstResult printScoreRange

printFirstResult results =
  putStrLn $ "First result was by: " ++ name (head results)

printScoreRange results = do
  let results' = sortBy score results
  putStrLn $ "Lowest: " ++ show (score (head results'))
  putStrLn $ "Highest: " ++ show (score (last results'))

There's no need to worry about concurrent accesses to data structures. It's impossible for the other threads to alter our data. If you do want other threads to affect your local data, you'll need to be more explicit about it, which we'll get back to.

Mutability when needed

One thing you may be worried about is how this affects performance. For example, it's much more efficient to sort a vector using mutable access instead of only pure operations. Haskell has two tricks for that. The first is the ability to explicitly create mutable data structures, and mutate them in place. This breaks all of the guarantees I already mentioned, but if you need the performance, it's available. And unlike mutable-by-default approaches, you now know exactly which pieces of data you need to handle with care when coding to avoid tripping yourself up.

The other approach is to create a mutable copy of the original data, perform your mutable algorithm on it, and then freeze the new copy into an immutable version. With sorting, this looks something like:

sortMutable :: MutableVector a -> ST (MutableVector a)
sortMutable = ... -- normal sorting algorithm

sortImmutable :: Vector a -> Vector a
sortImmutable orig = runST $ do
  mutable <- newMutableVector (length orig)
  copyValues orig mutable
  sort mutable
  freeze mutable

ST is something we use to have temporary and local mutable effects. Because of how it's implemented, we know that none of the effects can be visible from outside of our function, and that for the same input, the sortImmutable function will always have the same output. While this approach requires an extra memory buffer and an extra copy of the elements in the vector, it avoids completely the worries of your data being changed behind your back.

Summary: immutability and purity


  • Easier to reason about code
  • Avoid many cases of data races
  • Functions are more reliable, returning the same output for the same input


  • Lots of ceremony if you actually want mutation
  • Some runtime performance hit for mutable algorithms

Software Transactional Memory

Let's say you actually need to be able to mutate some values. And for fun, let's say you want to do this from multiple threads. A common example of this is a bank. Let's again play with some pseudocode:

runServer (|request| => {
  from := accounts.lookup(request.from)
  to := accounts.lookup(
  accounts.set(request.from, from - request.amt)
  accounts.set(, to + request.amt)

This looks reasonable, except that if two requests come in at the same time for the same account, we can end up with a race condition. Consider something like this:

Thread 1: receive request: Alice gives $25
Thread 2: receive request: Alice receives $25
Thread 1: lookup that Alice has $50
Thread 2: lookup that Alice has $50
Thread 1: set Alice's account to $25
Thread 2: set Alice's account to $75

We know that we want Alice to end up with $50, but because of our data race, Alice ends up with $75. Or, if the threads ran differently, it could be $25. Neither of these is correct. In order to avoid this, we would typically deal with some kind of locking:

runServer (|request| => {
  // same code as before

Unfortunately, this leads to deadlocks! Consider this scenario:

Thread 1: receive request: $50 from Alice to Bob
Thread 2: receive request: $50 from Bob to Alice
Thread 1: lock Alice
Thread 2: lock Bob
Thread 1: try to lock Bob, but can't, so wait
Thread 2: try to lock Alice, but can't, so wait

This kind of problem is the bane of many concurrent programs. Let me show you another approach. As you may guess, here's some Haskell:

runServer $ \request -> atomically $ do
  let fromVar = lookup (from request) accounts
      toVar = lookup (to request) accounts
  origFrom <- readTVar fromVar
  writeTVar fromVar (origFrom - amt request)
  origTo <- readTVar toVar
  writeTVar toVar (origTo + amt request)

There are helper functions to make this shorter, but I wanted to do this the long way to prove a point. This looks like exactly the kind of race condition I described before. However, that atomically function is vital here. It ensures that only a complete transaction is ever committed. If any of the variables we touch are mutated by another thread before our transaction is complete, all of our changes are rolled back, and the transaction is retried. No need for explicit locking, and therefore many less worries about data races and deadlocks.

A TVar is a "transactional variable." It's an alternative to the IORef that I mentioned earlier. There are other kinds of mutable variables in Haskell, including channels and MVars which are like mutexes. This is what I meant when I said you need to be explicit about what kind of mutation you want in Haskell.

Purity's role

What do you think will happen with this program:

atomically $ do
  buyBitcoins 3 -- side effects on my bank account

  modifyTVar myBitcoinCount (+ 3)

Here, buyBitcoins is going off to some exchange a buying about $100,000 in bitcoin (or whatever ridiculous amount they're selling for now). I said before that, if the variables are modified while running, the transaction will be retried. It seems like this function is very dangerous, as it may result in me going about $10,000,000 into debt buying bitcoins!

This is where purity steps in. Inside atomically, you are not allowed to perform any side effects outside of STM itself. That means you can modify TVars, but you cannot read or write files, print to the console, fire the missiles, or place multi million dollar currency purchases. This may feel like a limitation, but the tradeoff is that it's perfectly safe for the runtime system to retry your transactions as many times as it wants.

Summary of STM


  • Makes concurrent data modification much easier
  • Bypass many race conditions and deadlocks


  • Depends on purity to work at all
  • Not really a disadvantage, you're already stuck with purity in Haskell
  • Not really any other disadvantages, so just use it!


It's a little cheeky of me to get this far into a talk about unique features of Haskell and ignore one of its most notable features: laziness. Laziness is much more of a double-edged sword than the other features I've talked about, and let me prove that by revisiting one of our previous examples.

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

I didn't describe it before, but this function will sum up the numbers from 1 to 1,000,000. There are two problems with this function:

  1. There's a major performance bug in it
  2. It's much more cumbersome than it should be

Space leaks

The bane of laziness is space leaks, something you've probably heard about if you've read at all about Haskell. To understand this, let's look at how laziness is implemented. When you say something like:

let foo = 1 + 2

foo doesn't actually contain 3 right now. Instead, it contains an instruction to apply the operator + to the values 1 and 2. This kind of instruction is called a thunk. And as you might guess, storing the thunk is a lot more expensive than storing a simple integer. We'll see why this helps in a bit, but for now we just care about why it sucks. Let's look at what happens in our loop function:

let loop i total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

Each time we step through the loop, we have to compare i to the number 1,000,000. Therefore, we are forced to evaluate it, which means turning it into a simple integer. But we never look at the value of total. Instead of storing a simple integer, which would be cheap, we end up building a huge tree that looks like "add 1 to the result of add 2 to the result of ... to 1,000,000." This is really bad: it uses more memory and more CPU than we'd like.

We can work around this in Haskell by being explicit about which values should be evaluated. There are a few ways to do this, but in our case, the easiest is:

let loop i !total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

All I've done is added an exclamation point in front of the total argument. This is known as a bang pattern, and says "make sure this is evaluated before running the rest of this function." The need to do this in some cases is definitely a downside to Haskell's laziness. On the other hand, as we'll see shortly, you often don't need to bother if you use the right kinds of functions.

Laziness is awesome

Let's go back to pseudocode and rewrite our summation:

total := 0
for(i := 1; i <= 1000000; i++) {
  total += i

Pretty simple. But now let's modify this to only sum up the even numbers:

total := 0
for(i := 1; i <= 1000000; i++) {
  if (isEven(i)) {
    total += i

OK, that's fine. But now, let's sum up the indices modulus 13 (for some weird reason):

total := 0
for(i := 1; i <= 1000000; i++) {
  if (isEven(i)) {
    total += i % 13

Each of these modifications is fine on its own, but at this point it's getting harder to see the forest for the trees. And fortunately each of these transformations was relatively simple. If some of the requirements were more complicated, fitting it into the for loop may be more challenging.

Let's go back to the beginning with Haskell. We saw how we could do it with a loop, but let's see the real way to sum the numbers from 1 to 1,000,000:

-- Bad
let loop i !total =
      if i > 1000000
        then total
        else loop (i + 1) (total + i)
 in loop 1 0

-- Awesome!
sum [1..1000000]

We use list range syntax to create a list with one million numbers in it. On its face, this looks terrible: we need to allocate about 8mb of data to hold onto these integers, when this should run in constant space. But this is exactly where laziness kicks in: instead of allocating all of these values immediately, we allocate a thunk. Each time we step through the list, our thunk generates one new integer and a new thunk for the rest of the list. We're never using more than a few machine words.

There are also other optimizations in GHC to avoid even allocating those thunks, but that's not something I'm going to cover today.

Anyway, let's continue. We can easily tweak this to only add up the even numbers:

sum (filter even [1..1000000])

This uses the filter higher order function, and likewise avoids allocating an entire list at once. And doing the silly modulus 13 trick:

sum (map (`mod` 13) (filter even [1..1000000]))

Laziness is definitely a mixed bag, but combined with the functional style of Haskell in general, it allows you to write higher level, declarative code, while keeping great performance.

Short circuiting for free

Lots of languages define && and || operators which stop evaluation early, e.g.:

foo() && bar()

bar is only called if foo returns true. Haskell works the same way, but these operators aren't special; they just use laziness!

False && _ = False
True && x = x

True || _ = True
False || x = x

This even scales up to functions working on lists of values, such as and, or, all, and any.

Other downsides

There's one other downside to laziness, and a historical artifact. Laziness means that exceptions can be hiding inside any thunk. This is also known as partial values and partial functions. For example, what does this mean?

head []

Generally speaking, partiality is frowned upon, and you should use total functions in Haskell.

The historical artifact is that many bad functions are still easily available, and they should be avoided. head is arguably an example of that. Another is the lazy left fold function, foldl. In virtually all cases, you should replace it with a strict left fold foldl'.

Summary of laziness


  • More composable code
  • Get efficient results from combining high level functions
  • Short-circuiting like && and || is no longer a special case


  • Need to worry about space leaks
  • Exceptions can be hiding in many places
  • Unfortunately some bad functions like foldl still hanging around

Side note There's a major overlap with Python generators or Rust iterators, but laziness in Haskell is far more pervasive than these other approaches.


Due to time constraints, I'm not going to be able to go into detail on a bunch of other examples I wanted to talk about. Let me just throw out some quick thoughts on them.

Parser (and other) DSLs

  • Operator overloading!
  • Abstract type classes like Applicative and Alternative a natural fit, e.g.: parseXMLElement <|> parseXMLText.
  • Able to reuse huge number of existing library functions, e.g. optional, many
  • General purpose do-notation is great
data Time = Time Hour Minutes Seconds (Maybe AmPm)
data AmPm = Am | Pm

parseAmPm :: Parser Time
parseAmPm = Time
  <$> decimal
  <*> (":" *> decimal)
  <*> (":" *> decimal)
  <*> optional (("AM" $> Am) <|> ("PM" $> Pm))

c/o @queertypes

Advanced techniques

  • Free monads
  • Monad transformer stacks
  • Lens, conduit, pipes, ...
  • Lots of ways to do things in Haskell!
  • It's a plus and a minus
  • Recommendation: choose a useful subset of Haskell and its libraries, and define some best practices


  • Haskell combines a lot of uncommon features
  • Very few of those features are unique
  • Combining those features allows you to write code very differently than in other languages
  • If you want readable, robust, easy to maintain code: I think it's a great choice
  • Be aware of the sharp edges: they do exist!


December 17, 2017 08:00 AM

December 15, 2017

Ken T Takusagawa

[agobrown] Longest games of chomp

What Chomp starting positions offer the longest games, perhaps the most possibilities for interesting games?  Among rectangular starting positions, good starting positions are 13x12, 12x11, 10x9, 9x8, 11x6, 7x6, 8x5, 6x5, 5x4.  Missing from the pattern of (N)x(N-1) are 11x10 and 8x7.  (Chomp is weird in how there aren't simple patterns.  It might be a good candidate for machine learning.)

We assumed 3 types of positions in Chomp are instantly known lost (P positions):

  1. L-shaped positions with both arms of the L having unit width and same lengths
  2. 2-row positions of the form [a,a-1]
  3. 3-row positions of the form [a,a-2,2]

The 3-row [a,a-2,2] class of positions is noted in Proposition 2 of "Three-Rowed Chomp" by Doron Zeilberger.  The winning strategy from such a position is as follows:

The base case is [4,2,2] (which looks kind of like a pistol).  If the opponent moves to [3,2,2], then respond moving to [3,2] and follow the 2-row strategy (or move to [3,1,1] and L-shaped strategy).  If [2,2,2] then 2-row strategy vertically.  If [4,1,1] then [3,1,1] and L-shaped strategy.  If [4,2,1] then [2,2,1] and 2-row strategy vertically.  If [4,2] then 2-row strategy.

For larger 3-row positions [a,a-2,2], if the opponent moves in the first 2 rows, leaving at least 4 in the first row and at least 2 in the second row, then restore the position to the shape [b,b-2,2].  If [3,3,2] then [3,1,1] and L-shaped strategy.  If [a,1,1] then [3,1,1] and L-shaped strategy.  If the opponent moves on the third row to [a,a-2,1] then [2,2,1] and follow the 2-row strategy vertically.  If [a,a-2], then 2-row strategy.

Here is complete output of all positions within board size 13x13 and Haskell source code.  A selection of some positions and their game values are also given below.  Computing board size 12 required 8.5 GB of RAM on a machine with 16 GB of RAM.  (Haskell programs tend to use a lot of memory unless one puts effort into conserving memory, which we did not do.)

For computing board size 13, we allowed swapping to virtual memory on SSD on a machine with 8 GB of physical RAM.  The output of /usr/bin/time was:

5751.60user 86808.57system 39:48:33elapsed 64%CPU (0avgtext+0avgdata 7192640maxresident)k
10410518744inputs+8outputs (184956202major+316491058minor)pagefaults 0swaps

This suggests a slowdown factor of about 25 for using virtual memory on SSD compared to RAM for this program which made heavy use of Data.Map.  Polling "ps xu" saw a maximum virtual memory usage of 39 GB.  For the output of the board size 13 at the link above, we omitted saving the "Win_in 1" positions to save disk space.

There are only 3 "Lose in 2" positions: [6,3,3]; [5,5,3]; and [5,2,1,1].  Memorize them to get an edge against opponents.  One could also memorize the 7 "Lose in 4" positions, 14 "Lose in 6", 26 "Lose in 8"...

There seem to be some more patterns that lose: [odd,2,1,1,1,...]; [even,3,1,1,1,...]; [even,2,2,2,1,1,1,...]; [even,2,2,1,1,1,...]; [odd,4,1,1,1,...].  These deserve future investigation.  Andries Brouwer's web site suggests that losing families of positions exist in 3-row chomp for [a+11,a+7,5]; [?,?,7]; [?,?,9]; [?,?,11]; [?,?,14] (not 13, once again breaking what seemed to be a simple pattern of odd third rows).  It still needs to be explicitly articulated how to win after giving your opponent these losing positions.  Work by Steven Byrnes suggests the game values of all 3-row Chomp positions can be rapidly computed, though probably not by a human in his or her head.  Future versions of the code should bound not by board size but number of pieces, to investigate thin positions and roughly L-shaped positions.

(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12, 5], Win_in 103)
(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 5], Win_in 103)
(Position [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13], Win_in 101)
(Position [12, 12, 12, 12, 12, 12, 12, 12, 12, 10, 7], Lose_in 86)
(Position [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12], Win_in 79)
(Position [11, 11, 11, 10, 10, 10, 10, 10, 2], Win_in 57)
(Position [11, 11, 11, 10, 10, 10, 10, 9, 2], Win_in 57)
(Position [11, 11, 11, 11, 11, 9, 9, 7, 1, 1], Win_in 57)
(Position [11, 11, 11, 11, 11, 9, 9, 9, 1, 1], Win_in 57)
(Position [11, 11, 11, 11, 11, 11], Win_in 43)
(Position [11, 11, 11, 11, 11, 11, 11], Win_in 41)
(Position [11, 11, 11, 11, 11, 11, 11, 11], Win_in 39)
(Position [11, 11, 11, 11, 11, 11, 11, 11, 11], Win_in 37)
(Position [11, 11, 11, 11, 11], Win_in 35)
(Position [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], Win_in 21)
(Position [10, 10, 10, 10, 10, 10, 10, 10, 4], Lose_in 56)
(Position [10, 10, 10, 10, 10, 10, 10, 10, 10], Win_in 55)
(Position [9, 9, 9, 9, 9, 9, 9, 9], Win_in 41)
(Position [8, 8, 8, 8, 8], Win_in 23)
(Position [8, 8, 8, 8, 8, 8], Win_in 15)
(Position [8, 8, 8, 8, 8, 8, 8], Win_in 13)
(Position [7, 7, 7, 7, 7, 7], Win_in 21)
(Position [6, 6, 6, 6, 2], Win_in 13)
(Position [6, 6, 6, 6, 6], Win_in 9)
(Position [5, 5, 5, 5], Win_in 5)
(Position [4, 4, 4, 4], Win_in 1)
(Position [4, 4, 4], Win_in 1)
(Position [4, 4], Win_in 1)
(Position [4], Win_in 1)

(Position [5, 2, 1, 1], Lose_in 2)
(Position [5, 5, 3], Lose_in 2)
(Position [6, 3, 3], Lose_in 2)

(Position [5, 3, 3, 2], Lose_in 4)
(Position [5, 5, 2, 2], Lose_in 4)
(Position [6, 2, 2, 1, 1], Lose_in 4)
(Position [6, 2, 2, 2], Lose_in 4)
(Position [6, 3, 1, 1, 1], Lose_in 4)
(Position [7, 2, 1, 1, 1, 1], Lose_in 4)
(Position [7, 4, 3], Lose_in 4)

(Position [6, 4, 3, 3, 2], Lose_in 6)
(Position [7, 2, 2, 2, 2], Lose_in 6)
(Position [7, 3, 2, 1, 1, 1], Lose_in 6)
(Position [7, 3, 2, 2], Lose_in 6)
(Position [7, 3, 3, 1, 1], Lose_in 6)
(Position [7, 3, 3, 2, 1, 1], Lose_in 6)
(Position [7, 4, 1, 1, 1], Lose_in 6)
(Position [7, 5, 3, 2], Lose_in 6)
(Position [7, 7, 4], Lose_in 6)
(Position [8, 2, 2, 1, 1, 1, 1], Lose_in 6)
(Position [8, 2, 2, 2, 1, 1], Lose_in 6)
(Position [8, 3, 1, 1, 1, 1, 1], Lose_in 6)
(Position [8, 4, 4], Lose_in 6)
(Position [9, 2, 1, 1, 1, 1, 1, 1], Lose_in 6)

(Position [6, 4, 4, 3, 3], Lose_in 8)
(Position [6, 6, 3, 3, 3], Lose_in 8)
(Position [6, 6, 4, 3, 2], Lose_in 8)
(Position [7, 3, 3, 3, 2, 2], Lose_in 8)
(Position [7, 4, 2, 2, 2, 2], Lose_in 8)
(Position [7, 4, 4, 2], Lose_in 8)
(Position [7, 5, 3, 3, 1, 1], Lose_in 8)
(Position [7, 7, 3, 3], Lose_in 8)
(Position [8, 3, 2, 2, 2], Lose_in 8)
(Position [8, 3, 3, 3], Lose_in 8)
(Position [8, 4, 2, 1, 1, 1], Lose_in 8)
(Position [8, 4, 2, 2], Lose_in 8)
(Position [8, 5, 1, 1, 1], Lose_in 8)
(Position [8, 5, 4, 2], Lose_in 8)
(Position [9, 2, 2, 2, 2, 1, 1], Lose_in 8)
(Position [9, 2, 2, 2, 2, 2], Lose_in 8)
(Position [9, 3, 2, 1, 1, 1, 1, 1], Lose_in 8)
(Position [9, 3, 2, 2, 1, 1, 1], Lose_in 8)
(Position [9, 4, 1, 1, 1, 1, 1], Lose_in 8)
(Position [9, 4, 4, 1, 1], Lose_in 8)
(Position [9, 5, 3, 1, 1, 1, 1], Lose_in 8)
(Position [9, 5, 4], Lose_in 8)
(Position [10, 2, 2, 1, 1, 1, 1, 1, 1], Lose_in 8)
(Position [10, 2, 2, 2, 1, 1, 1, 1], Lose_in 8)
(Position [10, 3, 1, 1, 1, 1, 1, 1, 1], Lose_in 8)
(Position [11, 2, 1, 1, 1, 1, 1, 1, 1, 1], Lose_in 8)

by Ken ( at December 15, 2017 08:05 AM

December 14, 2017

Mike Izbicki

how to cheat at settlers by loading the dice

how to cheat at settlers by loading the dice
(and prove it with p-values)

posted on 2017-12-14

tl;dr This post shows how to create loaded dice, and how to use these dice to gain between 5-15 additional resource cards per game of Settlers of Catan. Surprisingly, we’ll prove that standard scientific tests are not powerful enough to determine that the dice are unfair while playing a game. This essentially means that it’s impossible for your opponents to scientifically prove that you’re cheating. This impossibility is due to methodological defects in the current state of scientific practice, and we’ll highlight some ongoing work to fix these defects.

Loading the dice

My copy of Settlers of Catan came with two normal wooden dice. To load these dice, I placed them in a small plate of water overnight, leaving the 6 side exposed.

The submerged area absorbed water, becoming heavier. My hope was that when rolled, the heavier wet sides would be more likely to land face down, and the lighter dry side would be more likely to land face up. So by leaving the 6 exposed, I was hoping to create dice that roll 6’s more often.

This effect is called the bias of the dice. To measure this bias, my wife and I spent the next 7 days rolling dice while eating dinner. (She must love me a lot!)

In total, we rolled the dice 4310 times. The raw results are shown below.

1 2 3 4 5 6
number of rolls 622 698 650 684 666 812
probability 0.151 0.169 0.157 0.165 0.161 0.196

Looking at the data, it’s “obvious” that our dice are biased: The 6 gets rolled more times than any of the other numbers. Before we prove this bias formally, however, let’s design a strategy to exploit this bias while playing Settlers of Catan.

A strategy for loaded dice

The key to winning at Settlers of Catan is to get a lot of resources. We want to figure out how many extra resources we can get using our biased dice.

First, let’s quickly review the rules. Each settlement is placed on the corner of three tiles, and each tile has a number token. Whenever the dice are rolled, if they add up to one of the numbers on the tokens, you collect the corresponding resource card. For example:

A good settlement will be placed next to numbers that will be rolled often.

To make strategizing easier, the game designers put helpful dots on each token below the number. These dots count the ways to roll that token’s number using two dice.

We can use these dots to calculate the probability of rolling each number. For example, a \(4\) can be rolled in three ways. If we name our two dice \(A\) and \(B\), then the possible combinations are \((A=1,B=3)\), \((A=2,B=2)\), \((A=3,B=1)\). To calculate the probability of rolling a 4, we calculate the probability of each of these rolls and add them together. For fair dice, the probability of every roll is the same \((1/6)\), so the calculation is:

\[\begin{align} Pr(A+B = 4) &= Pr(A = 1)Pr(B=3) + Pr(A=2)Pr(B=2) + Pr(A=3)Pr(B=1) \\ &= (1/6)(1/6) + (1/6)(1/6) + (1/6)(1/6) \\ &= 1/12 \\ &\approx 0.08333 \end{align}\]

For our biased dice, the probability of each roll is different. Using the numbers from the table above, we get:

\[\begin{align} Pr(A+B = 4) &= Pr(A = 1)Pr(B=3) + Pr(A=2)Pr(B=2) + Pr(A=3)Pr(B=1) \\ &= (0.151)(0.157) + (0.169)(0.169) + (0.157)(0.151) \\ &= 0.07597 \end{align}\]

So rolling a \(4\) is now less likely with our biased dice. Performing this calculation for each possible number gives us the following chart.

All the numbers below \(7\) are now less likely, and the numbers above 7 are now more likely. The shift is small, but it has important strategic implications.

Consider the two initial settlement placements below.

The naughty player knows that the dice are biased and puts her settlements on locations with high numbers, but the nice player doesn’t know the dice are biased and puts her settlements on locations with low numbers. Notice that if the dice were fair, both settlement locations would be equally good because they have the same number of dots.

The following formula calculates the average number of cards a player receives on each dice roll:

\[ \text{expected cards per roll} = \sum_{\text{adjacent tokens}} Pr(A+B=\text{token value}) \]

Substituting the appropriate values gives us the following results.

expected cards per roll
naughty nice
fair dice 0.500 0.500
biased dice 0.543 0.457

So the difference between the naughty and nice player is \(0.086\) cards per roll of the biased dice. A typical game of Settlers contains about 60 dice rolls (about 15 turns per player in a 4 player game), so this results in \(0.086*60=5.16\) more cards for the naughty player.

And this is only considering the two starting settlements. As the game progresses, more settlements will be built, and some settlements will be upgraded to cities (which receive two cards per roll instead of one). Calculating the exact effect of these additional sources of cards is difficult because these improvements will be built at random points throughout the game. We’ll have to make some additional assumptions.

If we assume that the naughty player gets 0.043 more cards per roll per settlement/city than the nice player (this exact number will vary depending on the quality of the settlement), and that both players build settlement/cities at turns 10,20,25,30,35,40,45, and 50, then the naughty player will on average receive 15.050 more cards than the nice player.

To summarize, the naughty player will receive somewhere between 5 and 15 more resource cards depending on how their future settlements and cities are built. This advantage can’t guarantee a victory, but it’ll definitely help.

A scientific analysis

Now we’re going to do some simple statistics to prove two things:
  1. The dice really are biased. So the fact that the 6 was rolled more times than the other numbers wasn’t just due to random chance.
  2. There are not enough dice rolls in a game of Settlers for our opponents to scientifically prove that the dice are biased. So it’s scientifically impossible for our opponents to know that we’re cheating.

To show that the dice are biased, we will use a standard scientific technique called null hypothesis significance testing. We begin by assuming a hypothesis that we want to disprove. In our case, we assume that the dice are not biased. In other words, we assume that each number on the dice has a \(1/6\approx 0.166\) chance of being rolled. Our goal is to show that under this assumption, the number of 6’s rolled above is very unlikely. We therefore conclude that our hypothesis is also unlikely, and that the dice probably are in fact biased.

More formally, we let \(X\) be a random variable that represents the total number of 6’s we would roll if we were to repeat our initial experiment with fair dice. Then \(X\) follows a binomial distribution whose density is plotted below.
The \(p\)-value for our experiment is defined informally to be the probability of getting results similar to the results we observed if the dice are not biased. The formal definition and formula is \[\begin{equation} p\text{-value}= Pr(X\ge k) = %1-\sum_{i=0}^k {4310\choose 812} (1/6)^i(1-1/6)^{n-i} 1-\sum_{i=0}^k {n\choose i} q^i(1-q)^{n-i} , \end{equation}\]

where \(n\) is the total number of dice rolls (4310), \(k\) is the number of 6’s actually rolled (812), and \(q\) is the assumed probability of rolling a 6 (1/6). Substituting these numbers gives us \[ p\text{-value}= Pr(X\ge k) \approx 0.0000884 . \] In other words, if we repeated this experiment one million times with fair dice, we would expect to get results similar to the results we actually got only 88 times. Since this is so unlikely, we conclude that our original assumption (that the dice are not biased) is probably false. Most science classes teach that \(p\)-values less than 0.05 are “significant.” We are very far below that threshold, so our result is “very significant.”

Our \(p\)-value is so low because the number of trials we conducted was very large \((n=4310)\). In a typical game of Settlers, however, there will be many fewer trials. This makes it hard for our opponents to prove that we’re cheating.

We said before that there are 60 dice rolls in a typical game. Since we have two dice, that means \(n=120\). To keep the math simple, we’ll assume that we role an average number of 6’s. That is, the number of sixes rolled during the game is \[ k=812\cdot \frac{120}{4310}\approx23. \] Substituting into our formula for the \(p\)-value, we get \[ p\text{-value}=P(X\ge k) \approx 0.265 . \] In words, this means that if the dice were actually fair, then we would still role this number of 6’s \(26.5\%\) of the time. Since this probability is so high, the standard scientific protocol tells us to conclude that we have no “significant” evidence that the dice are biased. (Notice that this is subtly different from having evidence that the dice are not biased! Confusing these two statements is a common mistake, even for trained phd scientists, and especially for medical doctors.)

So how many games can we play without getting caught? It turns out that if we play 6 games (so \(n=6*120=720\), and \(k=812\cdot(720/4310)\approx136\)), then the resulting \(p\)-value is 0.05. In other words, as long as we play fewer than 6 games, then our opponents won’t have enough data to conclude that their measurements of the biased dice are “significant.” The standard scientific method won’t prove we’re cheating.

Some flaws with the \(p\)-value and “significance”

The \(p\)-value argument above is how most scientists currently test their hypotheses. But there’s some major flaws with this approach. For example:

  1. The \(p\)-value test doesn’t use all the available information. In particular, our opponents may have other reasons to believe that the dice are loaded. If you look closely at the dice, you’ll notice some slight discoloration where it was submerged in water.

    This discoloration was caused because the water spread the ink on the die’s face. If you see similar discoloration on the dice in your game, it makes sense to be extra suspicious about the dice’s bias.

    Unfortunately, there’s no way to incorporate this suspicion into the \(p\)-value analysis we conducted above. An alternative to the \(p\)-value called the bayes factor can incorporate this prior evidence. So if our opponent uses a bayes factor analysis, they may be able to determine that we’re cheating. The bayes factor is more complicated than the \(p\)-value, however, and so it is not widely taught to undergraduate science majors. It is rarely even used in phd-level scientific publications, and many statisticians are calling for increased use of these more sophisticated analysis techniques.

  2. Another weakness of the \(p\)-value test is that false positives are very common. Using the standard significance threshold of \(p\le0.05\) means that 5 of every 100 games will have “significant” evidence that the dice are biased to role 6’s. Common sense, however, tells us that cheating at Settlers of Catan is almost certainly not this common because most people just don’t want to cheat. But when you run many experiments, some of them will give “significant” results just by random chance. This is one of the many reasons why some scientists have concluded that most published research is false. This effect is thought to be one of the reasons that evidence of extra sensorial perception (ESP) continues to be published in scientific journals. Some less scrupulous scientists exploit this deficiency in a process called p-hacking to make their research seem more important.

    To alleviate the problem of false positives, a group of statisticians is proposing a new significance threshold of \(p\le0.005\) for a result to qualify as “significant”. While this reduces the risk of false positives, it also makes detecting true effects harder. Under this new criterion, we’d have to play 16 games (for \(n=1920\) dice roles) to get statistically significant evidence that the dice are biased.

At this point, you might be feeling overwhelmed at the complexity of statistical analysis. And this is just for the toy problem of detecting loaded dice in a game. Real world problems like evaluating the effectiveness of chemotherapy drugs are much more complicated, and so require much more complicated statistical analyses. Doing science is hard!

Edit after peer review: Vijay Lulla sent me the following message:

The blog mentions that you rolled the dice 4310 times and all your calculations are based on it, but the frequency table adds up to 4312.

Whooops! It looks like a messed up my addition. Fortunately, this mistake is small enough that it won’t affect any of the numbers in the article by much.

A lot of people mistakenly think that peer review is where other scientists repeat an experiment to test the conclusion. But that’s not the case. The purpose for peer review is for scientists like Vijay to just do a sanity check on the whole procedure to make sure obvious mistakes like this get caught. Sadly, another commonly made mistake in science is that researchers don’t publish their data, so there’s no way for checks like this to be performed.

If this were a real publication in a scientific journal, I would redo all the calculations. But since it’s not, I’ll leave the mistake for posterity.

Edit 2: There’s a good discussion on reddit’s /r/statistics. This discussion provides a much more nuanced view about significance testing than my discussion above, and a few users point out ways that I might be overstating some conclusions.

December 14, 2017 12:00 AM

December 12, 2017

Neil Mitchell

Benchmarking strchr vs memchr

Summary: memchr is faster, but the obvious implement seems to beat the builtin versions.

There are two related C functions for finding the next character in a string - strchr which assumes the string has a NUL character at the end, and memchr which takes the string length as an argument. For strings where you have the size and a NUL terminator, which is fastest? Using gcc 6.2.0 64bit MSYS2 on Windows 10, searching for a single byte 10M bytes along a string, the times were (fastest to slowest):

Trying on 3 different Windows computers, the results are all similar (but scaled).

Given the choice, you should prefer memchr over strchr.

Surprise result

The optimised implementations shipped with GCC are slower than the obvious C implementations taken from a wiki. I have absolutely no idea why. From what I can tell, the builtin versions are coded in assembly, operating on multiple bytes at a time, using SSE instructions. In contrast, the C variants operate on a single byte at a time, and aren't vectorised by the optimiser according to Godbolt. If anyone has an explanation I'd be keen to hear it.

Benchmark Code

To benchmark the variants I wrote a Haskell program using criterion. The full code and build instructions are available in this gist. I compiled the C code with -O3, using the gcc shipped with GHC 8.2.1. I've reproduced the Haskell code below, with some comments:

-- Import all the necessary pieces
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import Criterion.Main
import Foreign
import Foreign.C.Types
import Data.Monoid

-- Make all the C imports
foreign import ccall unsafe "string.h memchr" memchr_std :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe "string.h strchr" strchr_std :: Ptr Word8 -> CInt -> IO (Ptr Word8)
foreign import ccall unsafe memchr_c :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe strchr_c :: Ptr Word8 -> CInt -> IO (Ptr Word8)

-- Method for ignoring the size when using strchr
ignoreSize f a b _ = f a b

-- Build a suitable string with an interesting character i bytes along
cstr i = BS.replicate i 32 <> BS.singleton 64 <> BS.replicate i 32 <> BS.singleton 0

-- The functions to benchmark
funs =
[("memchr_std", memchr_std)
,("strchr_std", ignoreSize strchr_std)
,("memchr_c", memchr_c)
,("strchr_c", ignoreSize strchr_c)]

-- The main function, using Criterion
main = defaultMain
[ seq bs $ bench (show i ++ " " ++ name) $ whnfIO $ test fun bs
| i <- [1,10,100,1000,10000,100000,1000000,10000000]
, let bs = cstr i
, (name, fun) <- funs]

-- The function under test and input string
{-# NOINLINE test #-}
test fun bs =
BS.unsafeUseAsCStringLen bs $ \(ptr,len) ->
fun (castPtr ptr) 64 (fromIntegral len)

by Neil Mitchell ( at December 12, 2017 04:56 PM

December 11, 2017

Jeremy Gibbons

Streaming Arithmetic Coding

In the previous post we saw the basic definitions of arithmetic encoding and decoding, and a proof that decoding does indeed successfully retrieve the input. In this post we go on to show how both encoding and decoding can be turned into streaming processes.

Producing bits

Recall that

\displaystyle  \begin{array}{@{}l} \mathit{encode}_0 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow \mathit{Rational} \\ \mathit{encode}_0\;m = \mathit{pick} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \vrule width0pt depth2ex \\ \mathit{decode}_0 :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_0\;m\;x = \mathit{unfoldr}\;\mathit{step}\;(m,x) \end{array}

Encoding and decoding work together. But they work only in batch mode: encoding computes a fraction, and yields nothing until the last step, and so decoding cannot start until encoding has finished. We really want encoding to yield as the encoded text a list of bits representing the fraction, rather than the fraction itself, so that we can stream the encoded text and the decoding process. To this end, we replace {\mathit{pick} :: \mathit{Interval} \rightarrow \mathit{Rational}} by {\mathit{pick}_2 = \mathit{fromBits} \cdot \mathit{toBits}}, where

\displaystyle  \begin{array}{@{}l} \mathbf{type}\;\mathit{Bit} = \mathit{Integer} - \mbox{\quad 0 or 1 only} \vrule width0pt depth2ex \\ \mathit{toBits} :: \mathit{Interval} \rightarrow [\mathit{Bit}] \\ \mathit{fromBits} :: [\mathit{Bit}] \rightarrow \mathit{Rational} \end{array}

The obvious definitions have {\mathit{toBits}\;i} yield the shortest binary expansion of any fraction within {i}, and {\mathit{fromBits}} evaluate this binary expansion. However, we don’t do quite this—it turns out to prevent the streaming condition from holding—and instead arrange for {\mathit{toBits}} to yield the bit sequence that when extended with a 1 yields the shortest expansion of any fraction within {i} (and indeed, the shortest binary expansion necessarily ends with a 1), and {\mathit{fromBits}} compute the value with this 1 appended.

\displaystyle  \begin{array}{@{}l} \mathit{fromBits} = \mathit{foldr}\;\mathit{pack}\;(\frac 1 2) \vrule width0pt depth2ex \\ \mathit{pack} :: \mathit{Bit} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{pack}\;b\;x = (b + x) / 2 \vrule width0pt depth2ex \\ \mathit{toBits} = \mathit{unfoldr}\;\mathit{nextBit} \vrule width0pt depth2ex \\ \mathit{nextBit} :: \mathit{Interval} \rightarrow \mathsf{Maybe}\;(\mathit{Bit}, \mathit{Interval}) \\ \mathit{nextBit}\;(l,r) \\ \begin{array}[t]{@{\quad}clcl} | & r \le \frac 1 2 &=& \mathit{Just}\;(0, (0, \frac 1 2) \mathbin{\triangleleft} (l,r)) \\ | & \frac 1 2 \le l &=& \mathit{Just}\;(1, (\frac 1 2,1) \mathbin{\triangleleft} (l,r)) \\ | & \mathbf{otherwise} &=& \mathit{Nothing} \end{array} \end{array}

Thus, if {r \le \frac 1 2} then the binary expansion of any fraction within {[l,r)} starts with 0; and similarly, if {\frac 1 2 \le l}, the binary expansion starts with 1. Otherwise, the interval {[l,r)} straddles {\frac 1 2}; the shortest binary expansion within is it the expansion of {\frac 1 2}, so we yield the empty bit sequence.

Note that {\mathit{pick}_2 = \mathit{fromBits} \cdot \mathit{toBits}} is a hylomorphism, so we have

\displaystyle  \begin{array}{@{}l} \mathit{pick}_2\;(l,r) \\ \begin{array}[t]{@{\quad}clcl} | & r \le \frac 1 2 &=& \mathit{pick}_2\;((0,\frac 1 2) \mathbin{\triangleleft} (l,r)) / 2 \\ | & \frac 1 2 \le l &=& (1 + \mathit{pick}_2\;((\frac 1 2,1) \mathbin{\triangleleft} (l,r))) / 2 \\ | & \mathbf{otherwise} &=& \frac 1 2 \end{array} \end{array}

Moreover, it is clear that {\mathit{toBits}} yields a finite bit sequence for any non-empty interval (since the interval doubles in width at each step, and the process stops when it includes {\frac 1 2}); so this equation serves to uniquely define {\mathit{pick}_2}. In other words, {\mathit{nextBit}} is a recursive coalgebra. Then it is a straightforward exercise to prove that {i \ni \mathit{pick}_2\;i}; so although {\mathit{pick}} and {\mathit{pick}_2} differ, they are sufficiently similar for our purposes.

Now we redefine encoding to yield a bit sequence rather than a fraction, and decoding correspondingly to consume that bit sequence:

\displaystyle  \begin{array}{@{}l} \mathit{encode}_1 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Bit}] \\ \mathit{encode}_1\;m = \mathit{toBits} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \vrule width0pt depth2ex \\ \mathit{decode}_1 :: \mathit{Model} \rightarrow [\mathit{Bit}] \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_1\;m = \mathit{decode}_0\;m \cdot \mathit{fromBits} \end{array}

That is, we move the {\mathit{fromBits}} part of {\mathit{pick}_2} from the encoding stage to the decoding stage.

Streaming encoding

Just like {\mathit{encode}_0}, the new version {\mathit{encode}_1} of encoding consumes all of its input before producing any output, so does not work for encoding infinite inputs, nor for streaming execution even on finite inputs. However, it is nearly in the right form to be a metamorphism—a change of representation from lists of symbols to lists of bits. In particular, {\mathit{narrow}} is associative, and {\mathit{unit}} is its unit, so we can replace the {\mathit{foldr}} with a {\mathit{foldl}}:

\displaystyle  \mathit{encode}_1\;m = \mathit{unfoldr}\;\mathit{nextBit} \cdot \mathit{foldl}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m

Now that {\mathit{encode}_1} is in the right form, we must check the streaming condition for {\mathit{narrow}} and {\mathit{nextBit}}. We consider one of the two cases in which {\mathit{nextBit}} is productive, and leave the other as an exercise. When {r \le \frac 1 2}, and assuming {\mathit{unit} \supseteq (p,q)}, we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{nextBit}\;((l,r) \mathbin{\triangleright} (p,q)) \\ = & \qquad \{ \mathit{narrow} \} \\ & \mathit{nextBit}\;(\mathit{weight}\;(l,r)\;p, \mathit{weight}\;(l,r)\;q) \\ = & \qquad \{ (l,r) \ni \mathit{weight}\;(l,r)\;q \mbox{, so in particular } \mathit{weight}\;(l,r)\;q < r \le \frac 1 2 \} \\ & \mathit{Just}\;(0, (0, \frac 1 2) \mathbin{\triangleleft} ((l,r) \mathbin{\triangleright} (p,q))) \\ = & \qquad \{ \mathit{widen} \mbox{ associates with } \mathit{narrow} \mbox{ (see below)} \} \\ & \mathit{Just}\;(0, ((0, \frac 1 2) \mathbin{\triangleleft} (l,r)) \mathbin{\triangleright} (p,q)) \end{array}

as required. The last step is a kind of associativity property:

\displaystyle  i \mathbin{\triangleleft} (j \mathbin{\triangleright} k) = (i \mathbin{\triangleleft} j) \mathbin{\triangleright} k

whose proof is left as another exercise. Therefore the streaming condition holds, and we may fuse the {\mathit{unfoldr}} with the {\mathit{foldl}}, defining

\displaystyle  \begin{array}{@{}l} \mathit{encode}_2 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Bit}] \\ \mathit{encode}_2\;m = \mathit{stream}\;\mathit{nextBit}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \end{array}

which streams the encoding process: the initial bits are output as soon as they are fully determined, even before all the input has been read. Note that {\mathit{encode}_1} and {\mathit{encode}_2} differ, in particular on infinite inputs (the former diverges, whereas the latter does not); but they coincide on finite symbol sequences.

Streaming decoding

Similarly, we want to be able to stream decoding, so that we don’t have to wait for the entire encoded text to arrive before starting decoding. Recall that we have so far

\displaystyle  \mathit{decode}_1\;m = \mathit{decode}_0\;m \cdot \mathit{fromBits}

where {\mathit{decode}_0} is an {\mathit{unfoldr}} and {\mathit{fromBits}} a {\mathit{foldr}}. The first obstacle to streaming is that {\mathit{foldr}}, which we need to be a {\mathit{foldl}} instead. We have

\displaystyle  \textstyle \mathit{fromBits} = \mathit{foldr}\;\mathit{pack}\;(\frac 1 2)

Of course, {\mathit{pack}} is not associative—it doesn’t even have the right type for that. But we can view each bit in the input as a function on the unit interval: bit~0 is represented by the function {\mathit{weight}\;(0,\frac 1 2)} that focusses into the lower half of the unit interval, and bit~1 by the function {\mathit{weight}\;(\frac 1 2, 1)} that focusses into the upper half. The fold itself composes a sequence of such functions; and since function composition is associative, this can be written equally well as a {\mathit{foldr}} or a {\mathit{foldl}}. Having assembled the individual focussers into one composite function, we finally apply it to {\frac 1 2}. (This is in fact an instance of a general trick for turning a {\mathit{foldr}} into a {\mathit{foldl}}, or vice versa.) Thus, we have:

\displaystyle  \textstyle \mathit{fromBits}\;bs = \mathit{foldl}\;\mathit{focusf}\;\mathit{id}\;bs\;(\frac 1 2) \quad\mathbf{where}\; \mathit{focusf}\;h\;b = h \cdot \mathit{weight}\;(\mathit{half}\;b)

where {\mathit{half}} yields either the lower or the upper half of the unit interval:

\displaystyle  \begin{array}{@{}lcl} \multicolumn{3}{@{}l}{\mathit{half} :: \mathit{Bit} \rightarrow \mathit{Interval}} \\ \mathit{half}\;0 &=& (0, \frac 1 2) \\ \mathit{half}\;1 &=& (\frac 1 2, 1) \end{array}

In fact, not only may the individual bits be seen as focussing functions {\mathit{weight}\;(0, \frac 1 2)} and {\mathit{weight}\;(\frac 1 2, 1)} on the unit interval, so too may compositions of such functions:

\displaystyle  \begin{array}{@{}lcl} \mathit{id} &=& \mathit{weight}\;\mathit{unit} \\ \mathit{weight}\;i \cdot \mathit{weight}\;j &=& \mathit{weight}\;(i \mathbin{\triangleright} j) \end{array}

So any such composition is of the form {\mathit{weight}\;i} for some interval {i}, and we may represent it concretely by {i} itself, and retrieve the function via {\mathit{weight}}:

\displaystyle  \textstyle \mathit{fromBits}\;bs = \mathit{weight}\;(\mathit{foldl}\;\mathit{focus}\;\mathit{unit}\;bs)\;(\frac 1 2) \quad\mathbf{where}\; \mathit{focus}\;i\;b = i \mathbin{\triangleright} \mathit{half}\;b

So we now have

\displaystyle  \textstyle \mathit{decode}_1\;m = \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \quad\mathbf{where}\; \mathit{prepare}\;m\;i = (m, \mathit{weight}\;i\;(\frac 1 2))

This is almost in the form of a metamorphism, except for the occurrence of the adapter {\mathit{prepare}\;m} in between the unfold and the fold. It is not straightforward to fuse that adapter with either the fold or the unfold; fortunately, however, we can split it into the composition

\displaystyle  \mathit{prepare}\;m = \mathit{prepare}_2 \cdot \mathit{prepare}_1\;m

of two parts, where

\displaystyle  \begin{array}{@{}lcl} \multicolumn{3}{@{}l}{\mathit{prepare}_1 :: \mathit{Model} \rightarrow \mathit{Interval} \rightarrow (\mathit{Model}, \mathit{Interval})} \\ \mathit{prepare}_1\;m\;i &=& (m,i) \vrule width0pt depth2ex \\ \multicolumn{3}{@{}l}{\mathit{prepare}_2 :: (\mathit{Model}, \mathit{Interval}) \rightarrow (\mathit{Model}, \mathit{Rational})} \\ \mathit{prepare}_2\;(m,i) &=& (m, \mathit{weight}\;i\;(\frac 1 2)) \end{array}

in such a way that the first part {\mathit{prepare}_1} fuses with the fold and the second part {\mathit{prepare}_2} fuses with the unfold. For fusing the first half of the adapter with the fold, we just need to carry around the additional value {m} with the interval being focussed:

\displaystyle  \mathit{prepare}_1\;m \cdot \mathit{foldl}\;\mathit{focus}\;i = \mathit{foldl}\;\mathit{mfocus}\;(m,i)


\displaystyle  \begin{array}{@{}l} \mathit{mfocus} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathit{Bit} \rightarrow (\mathit{Model}, \mathit{Interval}) \\ \mathit{mfocus}\;(m,i)\;b = (m, \mathit{focus}\;i\;b) \end{array}

For fusing the second half of the adapter with the unfold, let us check the fusion condition. We have (exercise!):

\displaystyle  \begin{array}{@{}l} \mathit{step}\;(\mathit{prepare}_2\;(m, i)) = \mathit{fmap}\;\mathit{prepare}_2\;(\mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i))) \\ \qquad\mathbf{where}\;s = \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\frac 1 2)) \end{array}

where the {\mathit{fmap}} is the functorial action for the base functor of the {\mathsf{List}} datatype, applying just to the second component of the optional pair. We therefore define

\displaystyle  \begin{array}{@{}l} \mathit{stepi} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model}, \mathit{Interval})) \\ \mathit{stepi}\;(m,i) = \mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i)) \\ \qquad\mathbf{where}\;s = \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\frac 1 2)) \end{array}

and have

\displaystyle  \mathit{step}\;(\mathit{prepare}_2\;(m, i)) = \mathit{fmap}\;\mathit{prepare}_2\;(\mathit{stepi}\;(m,i))

and therefore

\displaystyle  \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2 = \mathit{unfoldr}\;\mathit{stepi}

Note that the right-hand side will eventually lead to intervals that exceed the unit interval. When {j \supseteq i}, it follows that {\mathit{unit} \supseteq j \mathbin{\triangleleft} i}; but the unfolding process keeps widening the interval without bound, so it will necessarily eventually exceed the unit bounds. We return to this point shortly.

We have therefore concluded that

\displaystyle  \begin{array}{@{}lcl} \mathit{decode}_1\;m &=& \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \\ &=& \mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2 \cdot \mathit{prepare}_1\;m \cdot \mathit{foldl}\;\mathit{focus}\;\mathit{unit} \\ &=& \mathit{unfoldr}\;\mathit{stepi} \cdot \mathit{foldl}\;\mathit{mfocus}\;(m,\mathit{unit}) \end{array}

Now we need to check the streaming condition for {\mathit{mfocus}} and {\mathit{stepi}}. Unfortunately, this is never going to hold: {\mathit{stepi}} is always productive, so {\mathit{stream}\;\mathit{stepi}\;\mathit{mfocus}} will only take production steps and never consume any input. The problem is that {\mathit{unfoldr}\;\mathit{stepi}} is too aggressive, and we need to use the more cautious flushing version of streaming instead. Informally, the streaming process should be productive from a given state {(m,i)} only when the whole of interval {i} maps to the same symbol in model {m}, so that however {i} is focussed by subsequent inputs, that symbol cannot be invalidated.

More formally, note that

\displaystyle  \mathit{unfoldr}\;\mathit{stepi} = \mathit{apo}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{stepi})


\displaystyle  \begin{array}{@{}l} \mathit{safestepi} :: (\mathit{Model}, \mathit{Interval}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model}, \mathit{Interval})) \\ \mathit{safestepi}\;(m,i) \\ \begin{array}[t]{@{\quad}clcl} | & \mathit{safe}\;(m,i) &=& \mathit{stepi}\;(m,i) \\ | & \mathbf{otherwise} &=& \mathit{Nothing} \end{array} \end{array}


\displaystyle  \begin{array}{@{}l} \mathit{safe} :: (\mathit{Model},\mathit{Interval}) \rightarrow \mathit{Bool} \\ \mathit{safe}\;(m, i) = \mathit{encodeSym}\;m\;s \supseteq i \quad\mathbf{where}\; s = \mathit{decodeSym}\;m\;(\mathit{midpoint}\;i) \end{array}

That is, the interval {i} is “safe” for model {m} if it is fully included in the encoding of some symbol {s}; then all elements of {i} decode to {s}. Then, and only then, we may commit to outputting {s}, because no further input bits could lead to a different first output symbol.

Note now that the interval remains bounded by unit interval during the streaming phase, because of the safety check in {\mathit{safestepi}}, although it will still exceed the unit interval during the flushing phase. However, at this point we can undo the fusion we performed earlier, “fissioning{\mathit{unfoldr}\;\mathit{stepi}} into {\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2} again: this manipulates rationals rather than intervals, so there is no problem with intervals getting too wide. We therefore have:

\displaystyle  \mathit{decode}_1\;m = \mathit{apo}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2) \cdot \mathit{foldl}\;\mathit{mfocus}\;(m,\mathit{unit})

Now let us check the streaming condition for {\mathit{mfocus}} and the more cautious {\mathit{safestepi}}. Suppose that {(m,i)} is a productive state, so that {\mathit{safe}\;(m,i)} holds, that is, all of interval {i} is mapped to the same symbol in {m}, and let

\displaystyle  \begin{array}{@{}lcl} s &=& \mathit{decodeSym}\;m\;(\mathit{midpoint}\;i) \\ m' &=& \mathit{newModel}\;m\;s \\ i' &=& \mathit{encodeSym}\;m\;s \mathbin{\triangleleft} i \end{array}

so that {\mathit{safestepi}\;(m,i) = \mathit{Just}\;(s, (m',i'))}. Consuming the next input {b} leads to state {(m, \mathit{focus}\;i\;b)}. This too is a productive state, because {i \supseteq \mathit{focus}\;i\;b} for any {b}, and so the whole of the focussed interval is also mapped to the same symbol {s} in the model. In particular, the midpoint of {\mathit{focus}\;i\;b} is within interval {i}, and so the first symbol produced from the state after consumption coincides with the symbol {s} produced from the state before consumption. That is,

\displaystyle  \mathit{safestepi}\;(\mathit{mfocus}\;(m,i)\;b) = \mathit{Just}\;(s, \mathit{mfocus}\;(m', i')\;b)

as required. We can therefore rewrite decoding as a flushing stream computation:

\displaystyle  \begin{array}{@{}l} \mathit{decode}_2 :: \mathit{Model} \rightarrow [\mathit{Bit}] \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_2\;m = \mathit{fstream}\;\mathit{safestepi}\;(\mathit{unfoldr}\;\mathit{step} \cdot \mathit{prepare}_2)\;\mathit{mfocus}\;(m,\mathit{unit}) \end{array}

That is, initial symbols are output as soon as they are completely determined, even before all the input bits have been read. This agrees with {\mathit{decode}_1} on finite bit sequences.

Fixed-precision arithmetic

We will leave arithmetic coding at this point. There is actually still quite a bit more arithmetic required—in particular, for competitive performance it is important to use only fixed-precision arithmetic, restricting attention to rationals within the unit interval with denominator {2^k} for some fixed~{k}. In order to be able to multiply two numerators using 32-bit integer arithmetic without the risk of overflow, we can have at most {k=15}. Interval narrowing now needs to be approximate, rounding down both endpoints to integer multiples of {2^{-k}}. Care needs to be taken so that this rounding never makes the two endpoints of an interval coincide. Still, encoding can be written as an instance of {\mathit{stream}}. Decoding appears to be more difficult: the approximate arithmetic means that we no longer have interval widening as an exact inverse of narrowing, so the approach above no longer works. Instead, our 2002 lecture notes introduce a “destreaming” operator that simulates and inverts streaming: the decoder works in sympathy with the encoder, performing essentially the same interval arithmetic but doing the opposite conversions. Perhaps I will return to complete that story some time…

by jeremygibbons at December 11, 2017 01:32 PM

Philip Wadler

Simplicity and Michelson


Only once in my life have I encountered a programming language that was too simple to use. That was Lispkit Lisp, developed by Peter Henderson, Geraint Jones, and Simon Jones, which I saw while serving as a postdoc at Oxford, 1983–87, and which despite its simplicity was used to implement an entire operating system. It is an indightment of the field of programming languages that I have not since encountered another system that I consider too simple. Until today. I can now add a second system to the list of those that are too simple, the appropriately-titled Simplicity, developed by Russell O'Connor of Blockstream. It is described by a paper hereand a website here.
The core of Simplicity consists of just nine combinators: three for products (pair, take, and drop), three for sums (injl, injr, and case), one for unit (unit), and two for plumbing (iden and comp). It is throughly grounded in ideas from the functional programming, programming language, and formal methods communities.
When I call Simplicity too simple it is intended as a compliment. It is delightful to see full adders and cryptographic hash functions cobbled together using just products, sums, and units. It is eye-opening to see how far one can get without recursion or iteration, and how this enables simple analyses of the time and space required to execute a program. It is a confirmation to see a system with foundations in category theory and sequent calculus. Now I know what to say when developers respond to my talk "Categories for the Working Hacker" by asking "But how can we use this in practice?"
The system is accompanied by a proof of its correctness in Coq, which sets a high bar for competing systems. O'Connor even claims to have a proof in Coq that the Simplicity implementation of SHA-256 matches the reference specification provided by Andrew Appel's Verified Software Toolchain project (VST), which VST proved corresponds to the OpenSSL implementation of SHA-256 in C.
At IOHK, I have been involved in the design of Plutus Core, our own smart contract scripting language, working with Darryl McAdams, Duncan Coutts, Simon Thompson, Pablo Lamela Seijas, and Grigore Rosu and his semantics team. We have a formal specification which we are preparing for release. O'Connor's work on Simplicity has caused us to rethink our own work: what can we do to make it simpler? Thank you, Russell!
That said, Simplicity is still too simple, and despite its emphasis on rigour there are some gaps in its description.


A 256-bit full adder is expressed with 27,348 combinators, meaning addition in Simplicity requires several orders of magnitude more work than the four 64-bit addition instructions one would normally use. Simplicity proposes a solution: any commonly used sequence of instructions may be abbreviated as a "jet", and implemented in any equivalent matter. Hence, the 27,348 combinators for the 256-bit full adder can be ignored, and replaced by the equivalent four 64-bit additions.
All well and good, but this is where it gets too simple. No one can afford to be inefficient by several orders of magnitude. Hence, any programmer will need to know what jets exist and to exploit them whenever possible. In this sense, Simplicity is misleadingly simple. It would be clearer and cleaner to define each jet as an opcode. Each opcode could still be specified by its equivalent in the other combinators of Simplicity, but programs would be more compact, faster to execute, and—most important—easier to read, understand, and analyse accurately. If one ignores jets, the analyses of time and space required to execute a program, given toward the end of the paper, will be useless—off by orders of magnitude. The list of defined jets is given nowhere in the paper. Nor could I spot additional information on Simplicity linked to from its web page or findable by a web search. More needs to be done before Simplicity can be used in practice.


It's not just the definition of jets which is absent from the paper, and cannot be found elsewhere on the web. Lots more remains to be supplied.
  • Sections 2.4, 2.5, 3.2 claim proofs in Coq, but apart from defining the semantics of the nine combinators in Appendix A, no Coq code is available for scrutiny.
  • Section 2.5 claims a representation of Simplicity terms as a dag, but it is not specified. Lacking this, there is no standard way to exchange code written in Simplicity.
  • Section 4.4 defines an extended semantics for Simplicity that can read the signature of the current transaction, support Merklised abstract syntax trees, and fail when a transaction does not validate. It also lifts meanings of core (unextended) Simplicity programs to the extended semantics. However, it says nothing about how the seven combinators that combine smaller Simplicity programs into bigger ones act in the extended semantics! It's not hard to guess the intended definitions, but worrying that they were omitted from a paper that aims for rigour.
  • Section 3 provides a Bit Machine to model the space and time required to execute Simplicity. The model is of limited use, since it ignores the several orders of magnitude improvement offered by jets. Further, the Bit Machine has ten instructions, enumerated on pages 10–12, but the list omits the vital "case" instruction which appears in Figure 2. Again, it's not hard to guess, but worrying it was omitted.


A second language for scripting blockchains is Michelson. It is described by a paper hereand a website here. (Oddly, the website fails to link to the paper.)
I will offer just one word on Michelson. The word is: "Why?"
Michelson takes many ideas from the functional programming community, including higher-order functions, data structures such as lists and maps, and static type safety. Currently, it is also much more thoroughly described and documented than Simplicity. All of this is to be commended.
But Michelson is an inexplicably low-level language, requiring the programmer to explicitly manipulate a stack. Perhaps this was done so that there is an obvious machine model, but Simplicity offers a far superior solution: a high-level model for programming, which compiles to a low-level model (the Bit Machine) to explicate time and space costs.
Or perhaps Michelson is low-level to improve efficiency. Most of the cost of evaluating a smart contract is in cryptographic primitives. The rest is cheap, whether compiled or interpreted. Saving a few pennies of electricity by adopting an error prone language—where there is a risk of losing millions of dollars in an exploit—is a false economy indeed. Premature optimisation is the root of all evil.
The language looks a bit like all the bad parts of Forth and Lisp, without the unity that makes each of those languages a classic. Lisp idioms such as CAAR and CDADAR are retained, with new ones like DUUP, DIIIIP, and PAAIAIAAIR thrown in.
There is a fair set of built-in datatypes, including strings, signed and unsigned integers, unit, product, sum, options, lists, sets, maps, and higher-order functions. But there is no way for users to define their own data types. There is no way to name a variable or a routine; everything must be accessed by navigating a data structure on the stack.
Some operations are specified formally, but others are left informal. For lists, we are given formal rewriting rules for the first three operators (CONS, NIL, IF_CONS) but not the last two (MAP, REDUCE). Type rules are given in detail, but the process of type inference is not described, leaving me with some questions about which programs are well typed and which are not. It reminds me of a standard problem one sees in early work by students—the easy parts are thoroughly described, but the hard parts are glossed over.
If I have understood correctly, the inference rules assign types that are monomorphic, meaning each term has exactly one type. This omits one of the most successful ideas in functional programming, polymorphic routines that act on many types. It means back to the bad old days of Pascal, where one has to write one routine to sort a list of integers and a different routine to sort a list of strings.
Several of these shortcomings are also shared by Simplicity. But whereas Simplicity is intended as a compilation target, not to be read by humans, the Michelson documentation includes a large collection of examples suggesting it is intended for humans to write and read.
Here is one of the simpler examples from the paper.
  { DUP ; CDAAR ; # T
IF { DUP ; CDADR ; # N
{ DUP ; CDDDR ; # B
PAIR } }
{ DUP ; CDDAR ; # A
PAIR } }
The comment # T is inserted as a reminder that CDAAR extracts variable T, and similarly for the other variables N, B, and A. This isn't the 1950s. Why don't we write T when we mean T, instead of CDAAR? WHY ARE WE WRITING IN ALL CAPS?
In short, Michelson is a bizarre mix of some of the best and worst of computing.


It is exciting to see ideas from the functional programming, programming languages, and formal methods communities gaining traction among cryptocurrencies and blockchains. While there are shortcomings, it is fantastic to see an appreciation of how these techniques can be applied to increase reliability—something which the multi-million dollar exploits against Ethereum show is badly needed. I look forward to participating in the conversations that ensue!


The conversation has begun! Tezos have put up a page to explain Why Michelson. I've also learned there is a higher-level language intended to compile into Michelson, called Liquidity.

by Philip Wadler ( at December 11, 2017 09:37 AM

December 09, 2017

Edward Z. Yang

Systems ML workshop panel

  • JG: Joseph Gonzalez
  • GG: Garth Gibson (CMU)
  • DS: Dawn Song (UC Berkeley)
  • JL: John Langford (Microsoft NY)j
  • YQ: Yangqing Jia (Facebook)
  • SB: Sarah Bird
  • M: Moderator
  • A: Audience

M: This workshop is bringing together ML and systems. Can you put your place on that spectrum? Who is your home community?

YJ: Right in the middle. I'd like to move more towards systems side, but Berkeley Parallel Labs kicked me out. ML is my home base.

JL: ML is where I come from, and where I will be, but I'm interested in systems. My home is NIPS and ICML

DS: My area is AI and security, did computer security in the past, now moving into AI.

GG: Systems.

JG: I started out in ML, working on probabilistic methods. I basically, in middle of PhD, looked at systems. Now I'm moving to being a systems person that does ML.

M: We've seen a proliferation of deep learning / ML frameworks that require a lot of dev effort, money, time to put in. Q, what is the role of academia of doing research in this area. What kind of large scale ML learning can you do.

GG: I liked YJ's answer last time.

YJ: The thing that is astonishing is that academia is the source of so many innovations. With all due respect, we did very good work in Google, but then Alex came out with 2 GPUs and nuked the field. Academia is the amazing place where we find all of the new ideas, and industry scale it out.

JL: Some examples. If you're coming from academia, maybe you don't have research at big company, but it's an advantage as you will spend time about the right algorithm for solving it efficiently. And that's what will win in the long run. Short term, they'll brute force with AutoML. Long run, the learning algorithms are going to be designed where tjhey won't have parameters. A common ML paper is "we eliminate this hyperparameter". When they're more automatic, more efficient, great things will happen. There's an advantage in being resource constrained, as you will solve things in the right way.

Another example is, the study of machine learning tells us that in thefuture we will regard any model that u just learned and deploy as inherently broken adn buggy as data collection is not part of process of training, deploying. It will decay and become irrelevant. The overall paradagim of ML where you're interacting with the world, and learning, that can be studied easy in academia, and that has huge implications about how you're going to design systems,

DS: People often talk about in a startup, the best thing is to not raise a ton of money; if you're resource constrained you're more focused and creative. ML is really broad, there's lots of problems. Right now we learn from lots of data, but lots of talks at NIPS, humans have amazing ability to learn from very few example. These are problems for academia to tackle, given unique resource constraints.

GG: I'll say, it's difficult to concentrate on top accuracy if you don't have enough data, and the data available to students is stuff like DAWNbench which tends to lag. In academia, we build relationships with industry, send students for internships, they get the ability to do big data, while exploring first principles in university. IT's a challenge, but open publishing and open sharing of code world more berable.

JG: The one thing I've struggled with is focusing on human resources. I have grad students; good students, focus on a key problem can make a lot of progress. We struggle with a lot of data. Struggle with RL really is here, we can build simulators to build at this scale. Being able to use simualtion to get data; be creative, find new and interesting problems.

M: Follow-up on process. I think a lot of you have tried to publish ML in your communities. Are they equipped to appreciate work properly; what is a common reason they don't appreciate.

JG: Publishing ML in systems, or vice versa, is hard. It goes both ways. These communities are not equipped to evaluate work in other field. ML in systems, where if you saw here, it was surprising. Or vice versa, wouldn't have done well in systems venue as systems. The failure mode I see, is systems community doesn't appreciate extreme complexity. In ML, I have this very sophisticated thing, and reducing them to their essential components. ML tries to overextend their complexity as an innovation. MOre broadly, each of these communities has their own biases how they look at research. One thing I've noticed, it's gotten better. Systems is better at evaluating, and at this workshop, people are pushing research in an advanced way.

GG: I'm old, so I've seen creation of conference before. So, you start off with an overlap of areas. In my prior life, it was the notion of storage as a research area, rather than app of devices. You start off, send submission in. The PC has two people that know anything about it, and they aren't assigned, and the reviews are sloppy, and you get one conference that do a little better, but other conferences don't read it. I faced this with fault tolerance, database, OS communities, they don't read each other's stuff. You get enough mass, get a conference that focuses in the middle; reviewing and PC that have seen most of the good work in the area. That's hard, but we're on the edge of doing it in SysML. We're doing the right thing to do competitive, on top of state of the art.

M: Is that the only solution, or can we mix up PCs?

GG: I've seen a lot of experiments to try it. You can end up with permanently fractured communities.

JL: Joey and Dawn are an area chair at ICML. I have found the ML community to be friendly to system type things. There's an area chair systems. Hopefully papers get assigned appropriately.

M: We're not good about that at systems.

DS: About ML and security, we have this problem. In security, we also have very small percentage of ML, and the committee, if you submit ML, it's very hard to find people who can review the paper, and as a consequence, the review quality varies highly. Similar in terms of security in ML, similar problems. It's interesting to think about why this happens and how to solve the problem. In general, sometimes the most interesting work is the interdisciplinary areas. ML and systems, security, and examples I see, including machine learning in systems... so, one thing I actually can understand is, within each community, even though the review quality varies, I can see from committee's perspective, really what they want is papers that are more meaningful to community, help people get exposed to this new area, fostering new exploration. That's part of natural progression. As time goes on, there's more cross pollonization.

JG: We are launching a SysML conference. I had a little bit of reservations: ML is getting better at systems, but now I have to decide where I'm going to send a paper. A lot of papers we see in ML is going to have systems.

GG: When you have a new conference area, not all work is sent there. Overlapping, you have a favorite conference, your heros, and you'll send your most exciting work to that root conference. No problem.

YJ: SysML is great, and this is how it comes out. New fields, it warrants new conferences.

M: Do you think ML expert needs to also be a systems expert? Does such a person who lies at that intersection have a different way of looking? Or you come up with a nice algorithm, and you

JL: It's not OK to have a wall.

There's many way learning algorithms can be changed. The problem with having a wall, if you don't understand, throw engineer. But if you can bridge to understand, they're not artifacts, you can break open and modify. That can let you achieve much better solutions.

GG: AGreed, but what happens initially is you reach over to other side, you put it into system, and it's my innovation that redundancy makes fault tolerance, even though it's fairly pedestrian from the other side. If it is a substantial improvement, it is worth doing. We all grow up.

JG: We need a wall, but we're going to constantly tear it down. Matlab in grad school, we made jokes about it, and MKL community would make it fast. Then they said we are going to build ML for distributed computing algorithms, and ML would write class algorithms for system. That waned in the dev of pytorch, TF, etc., which leveled up abstraction. The stack is building up again; systems community to make more efficient. Well, fp could change, and that could affect algorithm. So we're tearing it down again. But systems is about designing the wall.

YJ: It's more like a bar stool. It's a barrier, but we don't have to be both to do anything, but you need it to make it efficient. A story: a training system we looked at, SGD. That person found a very nicely rounded number: 100. But people frown, you should round to 128. Understanding and improving the common core for CS and engineering, that helps a lot for people to have good sense for how to design ML algorithms.

M: There's a lot of talk about democratizing AI, and all of you have helped that process. What is a truly democratic AI landscape look like, and how far are we from that world.

YJ: I plead guilty in participating in framework wars. When reading CS history, one thing that's pretty natural, when field is strating, there's all sorts of standards, protocols. FTP, Gopher, and now in the end HTTP took over, and everything runs on HTTP. Right now, there's all kinds of different abstractions; boiling it down, everyone is doing computation graph, optimization. I look forward to when we have one really nice graph representation, protocol for optimizing graphs. It's not a rosy dream, because in compilers we have that solution, LLVM. I don't know if we'll reach that state but I think one day we'll get there.

JL: You have AI/ML democratized when anyone can use it. What does that mean, a programmer has a library, or language constructs, which that they use routinely and easily; no issues of data getting mismatched or confused or biased. All the bugs people worry about in data science; those are removed from the system because the system is designed right and easy to use. The level beyond that is when somebody is using a system, that system is learning to adapt to you. There's huge room for improvement in how people interact. I don't know how often there's a rewrite rule driving me crazy; why can't it rewrite the way I want. People can signal info to a learning algorithm, and when those can be used effectively tpo assist people, you have democratized AI.

DS: I have a very different view of democratizing AI. I think it's interesting to think about what democratization here really means. For systems people, it's about making it easier for people to do learning, to use these libraries, platforms. But that's really just providing them with tools. For me, I give talks on demccratizing AI, we are looking at it from a completely different perspective. Code: even, whoever controls AI will control the world. So who controls AI? Even if you give everyone the tools, push a button, but they don't have the data to do the training. So who controls the AI today, and tomorrow? It's Facebook, Microsoft, Google... so for me, democratization means something totally different. Today, they collect data, train models, and they control who has action to model, and users can get recommendations, but not direct access to models. We have a project to actually democratize AI, where users can control their data. Combining blockchain and AI, where users can donate their data to a smart contract, where the smart contract will specify the terms; e.g., if you train a model, the user can use the model, and if the model produces profits, the user can get part of the profits. The smart contract can specify various incentive terms; e.g., if the data is vbetter than others, they can get more profits, and other mechanisms. A developer will supply the ML training algorithm, and get benefits when it is trained well. We are decentralizing th epower of AI; users will be able to get direct access to models and use them. In this case, I hope for an alternate future, where big companies can continue with business, but users by pooling their data in a decentralized fashion, will see actual true democratization of AI; they will access the power of AI. Not just use tools.


GG: I think that a lot of what's meant in democratizing AI is how can you move from a small number of people innovating, to a large number. Tool development and standards. We're close to being there. There was an example in the past, was VSLI paint boxes. Up until a certain point, only an EE could really develop hardware at all. They took a lot of effort and time to make sure it could make it through very part without very much crosstalk. a group came together and thought, well, there are some design rules. This lets you build hardware pretty easily. I could paint green/red boxes, hardware months later, worked. It never worked as fast as that EE guy, so there would always be a place for it, but it would let us build a RISC computer, and ship it. We were in the game, we could innvoate, and do it. The tools we're trying to build right now can build on statistical.

JG: When I started PhD, we did integrals and derivatives by hand. Automatic differentiation was a huge step forward. I blame that for the explosion of papers. A first year can build something far more complex than what I could do. That's moving AI forward, on algorithms side.

The data side is interesting, and that is one where I think about in systems. There's a lot of opportunities to think about how security interacts, leveraging hardware to protect it, markets to sell/buy data from sources, and protect the data across a lot of places. I would argue we're making a substantial amount of progress in how we think about algorithms.

M: When I think about democratizing pervasive AI, recent questions that have been consuming our minds, interpretability, fairness, etc. Can you share... any experience where things like interpretability came up and became a problem, issue, do we have to worry about a lot more in ML, or systems-ML.

JG: My grad students come to me and say the models stop working. I don't know how to fix that; the process is very experimental. Tracking experiments is a big part of the process. We cared a lot about interpretable models, and that meant something very particular. Now it's explainable; we don't need to know what it did exactly, but there needs tob e some connection to what we did. Interpretable, explain computation, it could be related or unrelated to the decision. That's two answers about explainability, and how we debug these systems.

GG: SOSP just happened, and they have ten years of... good copies of everything they submitted. At the end of the conference, Peter Chen took all the PDF files, and did a naive bayes classifier, and saw how well he would predict that it would be accepted. And half the things it predicted to be accepted, would be accepted.

So what did they do? They made ad etector for popular authors. And so what you did is those who had succeeded, they will follow behind. I recognize this problem. You might think that you found a good way, but it's actually Nicolai Zeldovich's paper.

DS: There's a big debate. Some think it's really important, and sometimes, as long as the model works, it's fine. Our brain, we can't really explain how we arrive at certain decisions, but it works fine. And it depends on application. Some applications have stronger requirements for explainability; e.g., law and healthcare, whereas in others it's less required. Also as a whole community, there's a lot we don't understand. We can dtalk about causality, transparenty, all related. As a whole community, we don't really understand what explainability means. Not a good definition. All these concepts are related, we're trying to figure out what's the real core. That's a really good open question.

JL: There's two different interpretations. Can you explain to a person? And that's limited; there's no explainable vision models. The other definition is debuggability. If you want to create complex systems, they need to be debuggable. This is nontrivial with a distributed system, it's nomntriival with ML. If you want to create nontrivial ML systems, yo uhave to figure out why they're not behaving the way you want it to.

DS: Do we debug our brains?

JL: Evolution has done this the hard way for a very long way... a lot of people have bugs in their brains. I know I have bugs. I get an ocular migraine sometimes... very annoying. No, we don't debug our brains, and it's a problem.

YJ: I'm suire there's bugs in my brains; I chased chickens in my grandma's house; the chicken has one spot in its back that if you press it, it just ducks and sits there. It shuts off because of fear. WE humans don't do that. But these bugs, are in our brain as well. Chasing for interpretability helps understand how things work. The old days, deep dream; this line of work started with figuring out what the gradients do, and we propagated back, and we found that direct gradient doesn't work; then we added L1 priors, and then we got pictures. This curiosity has lead to the fact that convnets with random weights are codifying the local correlation; we are hardcoding the structured info in CNNs which we didn't know before. So maybe we will not achieve full interpretability, but some amount of interpretability and creativity will help.

(audience questions)

A: I'd really like to hear what Jeff said about ML for systems. As systems, I'm interested in it, but people have said, you can get far with heuristics.

JL: I think it's exciting.

GG: The index databases, when I read it for reviewing, I went, "Wow! Is that possible?" I think things like that will change the way we do systems. The novelty of the application opens a lot of people's minds. Right now we think of the machine learning tools as being expensive things that repeat what humans do easily that computers don't do well. But that's not what DB index is. We can execute it, but we're not better. But to get it half the size and twice the speed, throwing in another way of thinking about compression through a predictor is a fabulous insight.

JG: I tried to publish in this area for a while. For a while, systems didn't like the idea of complex algorithms in the middle of their system. Now, these days, Systems is like, "ML is cool." But where it's easier to have success, you prediction improves the system, but a bad prediction doesn't break the system. So scheduling, that's good. Where models can boost performance but not hurt. The work in ML to solve systems is successful.

DS: ML for systems is super exciting. I'm personally very excited about this domain, esp. for people who have done systems work, and are interested in AI. ML for systems is an amazing domain of ML. I wouldn't be surprised, I would hope to see, in five years, our systems are more ML driven. A lot of systems have a lot of knobs to tune, trial and error setting, where exactly ML can help. On these amazing techniques, RL, bandits, instead of using bandits to serve ads, we can try to autotune systems. Just like we are seeing AI transforming a lot of application domains, and a lot more intelligent system, old systems, the one we built, should be more intelligent. It's a prediction: It hink we are going to see a lot of work in this domain. I think it will transform systems.

M: I work in this quite a bit. We have some successes with bandits in some settings, but there are settings that are really tough: stateful, choices, decisions influence the future, it makes it hard to apply RL, or the RL techniques take a lot of data. There are challenges, but there are successes. There are a lot of papers that apply RL in caching, resource allocation. The real question is why it's not used in production? I don't know if we have an answer to that, papers do it, it seems to be really good, but it's not that mainstream, esp. having RL all over the place. Why isn't it pervasive. That I don't see.

A: Isn't it because it's not verifiable. You want some kind of verification analysis.

GG: It's called a regression sweep. If you deploy on a lot of systems. There's a lot of money, it has to work. If it falls over, that's a lawsuit. I hired a VP of software. OK, now that I'm in charge, things are going to slow down. Every LoC is bugs, if I want low bug, I stop programmers from writing code, by making the bar very high. This is the thing JOy was talking about; they need a really compelling reason with no downsides, and then they have to pass tests before the pass. So anything stochastic has a high bar.

SB: Another thing that is happening, there aren't that many people who have understanding in both areas. It's really hard to do ML in systems without deep expertise in systems. You really need to understand to explain it.

GG: It wasn't that long since we didn't have hosted services.

M: Guardrails, you constrain the ML system to not suggest something bad. We have a scenario in MS, machines are unresponsive. How long to wait? You can do it in ML. The choices are reasonable, they're never more than the max you'd want to wait.

A: On democratization. There's been a lot of talk about optimizing the models so they can bear the cost. Another is decentralizing data... but there's two very big constraints for systems and models. They cost a lot of money, and there's big variance. Because of cost, if some guy gets into programming, and does research, he won't have resources to do it. So they won't go into engineering; they'll intern at Amazon instead. So if there is some community going into lowering the barrier, demoratizing, what solution is there to get people much more easily? Because there's huge economic costs. People are trying to make huge amounts of money, startups, but there's no... systems have faults with decentralization... there's just a big problem colliding and ML.

JG: We teach data, I teach data science at Berkeley. The summary is, what about the costs of getting into DL? There's cost to train models, GPUs, data, how do I get a freshman in college who is excited about this, chromebook, they can do research and explore opportunities. At Berkeley we have exactly this problem. I teach 200 students, a lot of them are freshmen, chromebook ipad as primary computer. We've built tools using Azure... we run a cloud in Azure, and on these devices they can experiment with models. They get to use pretrained models and appreciate how to ... Someone built a Russian Twitterbot detector, and saw value and opportunity in those. And then they got involved in research projects where they had more funds and tools.

JL: The right interfaces make a huge difference, because they prevent you from having bugs that prevent you from doing things. Also, DL, is all the rage, but framing the problem is more important than the representation you do. If you have the right problem, and a dumb representation, you'll still do something interesting. otherwise, it's just not going to work very well at all.

YJ: As industry, don't be afraid of industry and try it out. Back at Berkeley, when Berkeley AI was using GPUs, the requirement was that you have one project per GPU. We students, framed ten different projects, and we just asked for ten GPUs. NVIDIA came to us and asked, what are you donig. We'll just give you 40 GPUs and do research on that. Nowadays, FAIR has residency, and Google AI has residency, all of these things are creating very nice collaborations between industry and academia, and I want to encourage people to try it out. Industry has funds, academia has talent, marrying those together is an everlasting theme.

A: Going back to where do we go forward in terms of conferences, the future of this workshop; has any decision been made, where we go?

SB: This is work in progress. We're interested in feedback and what you think. We've had this workshop evolving for 10 yrs, with NIPS and iCML. Then we did one with SOSP, excciting. We are now doing a separate conference at Stanford in February. We think there's really an important role to play with workshops colocated with NIPS and ICML. We're still planning to conitnue this series of workshops. There's also a growing amount of systems work in ICML and NIPS, natural expansion to accept that work. The field is growing, and we're going to try several venues, and form a community. If people have ideas.

JG: More people should get involved.

M: We plan to continue this; audience is great, participation is great.

It's a panel, so I have to ask you to predict the future. Tell me something you're really excited... 50-100yrs from now. If you're alive then, I will find you and see if your prediction panned out. Or say what you hope will happen...

YJ: Today we write in Python. Hopefully, we'll write every ML model in one line. Classifier, get a cat.

JL: Right now, people are in a phase where they're getting more and more knobs in learning. ML is all about having less knobs. I believe the ML vision of less knobs. I also believe in democratizing AI. You are constantly turning ... around you, and devs can incorporate learning algorithms into systems. It will be part of tech. It's part of hype cycle. NIPS went through a phase transition. At some point it's gotta go down. When it becomes routine, we're democratizing things.

DS: It's hard to give predictions... I guess, right now, we see ML as an example, we see the waves. Not so long ago, there was the wave of NNs, graphical models, now we're back to NNs. I think... I hope that we... there's a plateauing. Even this year, I have been talking to a lot of great ML researchers, even though one can say there has been more papers written this year, when you hear what people talk about in terms of milestones, many people mentioned milestones from past years. AlexNet, ResNet, ... I do hope that we will see new innovation beyond deep learning. I do teach a DL class, but I hope that we see something beyond DL that can bring us... we need something more, to bring us to the next level.

GG: I'm tempted to point out DL is five years ago, and dotcom era was not more than five years... I think, I'm looking forward to a change in the way CS, science in general, does business, having learned from statistical AI. My favorite one is overfitting. I poorly understood overfitting, in vague stories, until ML hammered what this said. I look forward to the time when students tell me, they stopped writing code, because they were adding parameters... and they added a decent random, iid process for testing code. We're no where near there, but I think it's coming.

JG: I'm looking forward to the return of graphical models... actually not. When we're democratizing AI, but what ultimately happens, we're democratizing technology. I can walk up to Alexa and teach it. Or I can teach my Tesla how to park more appropriately. Tech that can adapt to us because it can learn; when I can explain to a computer what I want. (Star Trek but without a transporter.)

by Edward Z. Yang at December 09, 2017 02:17 AM

December 08, 2017

Edward Z. Yang

Accelerating Persistent Neural Networks at Datacenter Scale (Daniel Lo)

The below is a transcript of a talk by Daniel Lo on BrainWave, at the ML Systems Workshop at NIPS'17.

Deploy and serve accelerated DNNs at cloud scale. As we've seen, DNNs have enabled amazing applications. Architectures achieve SoTA on computer vision, language translation and speech recognition. But this is challenging to serve in large-scale interactive because there are latency, cost and power constraints. Also, DNNs are growing larger in size and complexity.

We've seen a Cambrian explosion in startups to solve this problem. Research groups have produced DNN processing units, DPUs, custom hardware solutions to prove high throughput efficient serving of DNNs. We categorize them into two categories: fast DPUs, where the algorithms and applications have to be fixed in at design time, because they're fabbing an ASIC, or a soft DPU, FPGA. But for soft DPUs, we haven't seen them deployed at scale.

To address this, we've been working on Project BrainWave. Solution to deploy large scale DNNs with FPGA-acceleration. We've designed it to be fast, flexible and friendly. High throughput, low latency acceleration using FPGAs. Flexibility with adaptive numerical precision, update to latest AI algorithms with reconfigurable FPGAs. And it's user friendly, because we have a full stack solution, compile CNTK/Caffe/TF and compile them down. This is deployed on our configurable cloud, an outer layer of CPUs, a data center that puts everything together, and a layer of reconfigurable FPGAs.

We've been deployed DNN models. LSTM model that takes tens to hundreds of milliseconds CPU. What we see is the 99th percentile for latency; even at 99 we are able to achieve sub-millisecond latencies. When you get to these levels of acceleration, it's negligible in the E2E pipeline.

Next I'll dive into details. It's a full stack solution. starting with a compiler and runtime that takes model sin high level frameworks and compiles them down to our architecture. A flexible ISA for serving DNNs. We have a throughput, low latency serving. We do this all with persistency at scale, to keep models pinned in FPGA memories. Deployed on our wide deployment of Intel FPGAs using hardware microservices.

To begin with, let's talk about hardware microservices. This is something we presented at Micro. The architecture of reconfigurable cloud is FPGAs sit between CPU and network. CPU can use FPGA locally for acceleration, but because FPGAs are connected over network, they can distribute between them. We have a proprietary network protocol for low latency compute.

We'vec disaggregated FPGA compute plane from CPU. So we can aggregate FPGAs together to form larger accelerators, and you don't have to match the rate of FPGAs to CPUs. You can serve a large number of CPUs with a small cluster of FPGAs, or vice versa.

Next I'll talk about the compiler and runtime. Goal is to make it very easy for ML specialists to do this. The typical ML specialist doesn't know how to program this. Models developed in high level frameworks, compile them down to our architecture. If you compile them down first into an intermediate graph based representation. We split them into portions split on FPGAs, and portions on CPU. When we execute, we also have runtime that handles orchestration and scheduling that handles it between parts.

There are two main categories of DNNs we have to optimize for. DNNs that have very high compute to data ratio, convnets, these are well studied. I'm going to focus on the other class of DNNs, those with less compute to data ratio, e.g. dense layers and RNNs.

The conventional approach to accelerating DNNs on FPGAs, you keep all model parameters in DRAM. When a request comes in, you're going to stream the model parameters of DRAM, and return a request. The issue with this is when you have DNN layers that are memory bandwidth bound, you're limited in how fast you can run this by memory bandwidth; you're not getting full compute capabilities of FPGA. Typically the way to solve this is with batching; you send a number of requests and use the model parameters for all requests. WHile you may achieve good throughput, latency will increase. For realtime services, this violates your SLA. What we want to do is provide high performance at low or no batching.

The way we do this is with persisted Dnets. FPGAs have lots of memory on chip: 10MB memory. Since they're on chip, it's high bandwidth. So we're going to keep the model parameters on the chip, so that when we get one request in, we distribute it across the entire FPGA chip.

The obvious question is, what happens if your model doesn't fit on chip? We take advantage of the hardware microcenter. We'll distribute a single model over multiple FPGAs in the datacenter.

Let's look at the architecture and microarchitecture of the processing unit we developed. The BrainWave DPU is a software programmable processor, programmed in single-threaded C, but we've added a number of instructions for serving DNNs, e.g., matrix multiply, convolution, nonlinear activations, embeddings. The processor is designed to use narrow precision format (float16) and easily flexible for extending to newer algorithms.

The microarchitecture of the processor, main portion is dedicated to matrix vector unit; matrix vector multiply, consisting of a number kernels on a tile of a larger matrix. Tiling gives us flexibility while maintaining performance. Other compute units are multifunction units; vector-vector operations, such as element-wise multiply, add and activation functions. Tying it all together is an on-chip network that lets us keep all the compute together at time.

Most of the chip is dedicated to matrix vector unit. It's composed of hundreds of multilane dot product units. Each of these dot product units is consists of tens of adds and muls. To keep them fed with data, each dot product unit is fed by a set of dedicated block rams.

Next, I'd like to show performance results for this architecture. Two years ago, we had a deployment of Stratix V FPGAs. It shows the effective teraflops of this format. 16 bit integer.. we've been playing with our own format Microsoft Floating Point. 4.5Tflops at MSFP5.8. These Stratix are pretty old.

(Demo for latest generation of FPGAs)

Looking at throughput oriented DPU, the latency is 65.81ms. With brainwave, latency is 0.98ms. Under 1 millisecond.

This was done on initial engineering silicon. For production silicon, we're expecting to get 12TOps at 16-bit integer. 90TOps for MSFP8. One question is how does numeric output affects output. Here is the normalized accuracy for three in-house text models, using GRU and LSTM. The orange bar shows what happens when you go to MSFP9, but we've developed a way to fine tune networks for this precision, and you see we recover our accuracy. We're working with MSFP8 and see similar results.

Project BrainWave is our project for accelerating DNNs at cloud scale. We hope it will be fast, friendly and cloud-scale, and expand capabilities of AI in the cloud, providing a way to run higher dimensional RNN networks for NLP and other great applications. We're planning to release to third parties, stay tuned.

Q: When you decrease batch size, what hardware are you evaluating? Hardware utilization as we decrease?

A: We stay highly utilized even as we decrease batch size; even at high batch size, we're still sending requests one by one. (Only one step will be processed?) Right.

Q: Regarding the FP9 and FP8, nine and eight being the number of bits used? (Yes) Is it in any way related to Flexpoint at Intel?

A: We developed this independently of flexpoint, and I'm not able to talk about our numeric format.

Q: In MS, do you really write Verilog for your FPGA, or do you use high level synthesis tool?

A: For this, we are writing System Verilog

Q: Batchnorm layers, which require batch computation; how do you put that onto the FPGA?

A: Part of the work of the compiler is to do splitting between CPU and FPGA. So things that are not amenable to FPGA, including batchnorm, we're still running them on CPU.

by Edward Z. Yang at December 08, 2017 08:08 PM

MOCHA: Federated Multi-Tasks Learning (Virginia Smith)

The below is a transcript of a talk by Virginia Smith on MOCHA, at the ML Systems Workshop at NIPS'17.

The motivation for this work comes from the way we think about solving ML problems in practice is changing. The typical ML workflow looks like this. You start iwth dataset and problem to solve. Say you want to build a classifier to identify high quality news articles. Next step is to select an ML model to solve the problem. Under the hood, to fit the model to your data, you have to select an optimization algorithm. The goal is to find an optimal model that minimizes some function over your data.

In practice, there's a very important part of the workflow that is missing. For new datasets, interesting and systems, the system and properties of system, play a large role in the optimization algorithm we select to fix. To give an example, in the past several years, data that is so large that must be distributed over multiple machines, in a datacenter environment. I've been thinking about how to perform fast distributed optimization in this setting, when data is so large.

But more and more frequently, data is not coming nicely packaged in datacenter. It's coming from mobile phones, devices, distributed across country and globe. Training ML in this setting is challenging. For one, whereas in datacenter you have hundreds to thousands, here you have millions and billions. Also, in datacenter, devices are similar capability; here, you have phones that are old, low battery, not connected to wifi. This can change ability to perform computation at any given iteration.

Additionally, there's heterogeneity in data itself. For privacy and computation reasons, data can become very unbalanced in network. And it can be non-IID, so much so that there can be interesting underlying structure to the data at hand. I'm excited because these challenges break down into both systems and statistical challenges. The one second summary of this work, thinking about both systems and statistical in this federated setting; the punchline is that systems setting plays a role not only in optimization algorithm but also the model we select to fit. IT plays a more important role in this overall workflow.

I'm going to go through how we holistically tackle systems and statistical challenges.

Starting with statistical. The goal is we have a bunch of devices generating data, could be unbalanced; some devices have more data than others. One approach used in past is fit a single model across all of this data. All of the data can be aggregated; you find one model that best achieves accuracy across all of the data simultaneously. The other extreme is you find a model for each of the data devices, and not share information. From systems point of view this is great, but statistically, you might have devices that are only ... that are poor in practice. What we're proposing is something between these two extremes. We want to find local models for each device, but share information in a structured way. This can be captured in a framework called multitask learning.

The goal is to fit a separate loss function for each device. These models can be aggregated in this matrix W, and the function of the regularizer, is to force some structure omega on it. This omega is a task relationship matrix, capturing interesting relationships, e.g., all the tasks are related and you want to learn weights, or most of the tasks are related and there are a few outliers, or there are clusters and groups, or there are more sophisticated relationships like asymmetric relationships. These can all be captured in multitask.

We developed a benchmarking set of real federated data. This includes trying to predict human activity from mobile phone, predict if eating or drinking, land mine, and vehicle sensor; distributed sensor to determine if a vehicle is passing by.

For these various datasets, we compared global, local and MTL. The goal is to fit a SVD model. For each data set, we looked at the average error across tasks, where each model is a task. What you can see is average error, for SVD, is significantly lower than global and local approaches. This makes sense because MTL is much more expressive; it lets you go between these extremes. What's interesting is that in these real data sets, it really helps. Reduction by half. This is a significant improvement in practice.

Given that we like to be using multitask learning to model data in federated environment, the next problem is figure out how to train this in distributed setting, thinking about massive distributed. In particular, the goal is to solve the following optimization objective. In looking how to solve this objective, we note that it's often common to solve for W and omega in an alternating fashion. When you solve for omega, it's centrally, you just need access to models. But W must be distributed because data is solved across devices. The key component how to solve this in practice is the W update. The challenge of doing this is communication is extremely expensive. And because of heterogeneity, you may have massive problems with stragglers and fault tolerance; e.g., someone who turns their phone off.

The high level idea for how we're doing this, take a communication efficient method that works well in data center, and modify it to work in federated setting. It will handle MTL as well as stragglers and fault tolerance.

What is the method we're using? The method we're using is COCOA, which is a state of the art method for empirical risk minimization problems. The thing that's nice about COCOa is it spans prior work of mini-batch and one-shot communication, by making communication a first class parameter of the method. Make it flexible as possible. It does it by not solving the primal formulation, but the dual. The dual is nice because we can easily approximate it by forming a quadratic approximation to the objective; and this more easily decomposes across machines.

To distribute this to federate setting, a key challenge is figuring out how to generalize it to the MTL framework. A second challenge; in COCOA, the subproblems are assumed to be solved to some accuracy theta. This is nice because theta varies from 0 to 1, where 0 is exact solve, and 1 is inexact. This can be thought of as how much time you do local communication versus communication. However, in fact, this is not as flexible as it should be in the federated setting. There is only one theta that is set for all iterations, a ll nodes. And because theta cannot be set exactly to one, it cannot handle fault tolerance, where there's no work performed at any iteration. Making this communication parameter much more flexible in practice.

JHow are we doing this? we developed MOCHA. The goal is to solve multitask learning framework; W and Omega in an alternating fashion. In particular, we're able to form the following dual formulation, similar to COCOA, so it decomposes. In comparison, we make this much more flexible assumption on subproblem parameter. This is important because of stragglers: statistical reasons, unbalance, different distributions, it can be very different in how difficult it is to solve subproblems. Additionally, there can be stragglers due to systems issues. And issues of fault tolerance. So this looks like a simple fix: we make this accuracy parameter more flexible: allow it to vary by node and iteration t, and let it be exactly 1. The hard thing is showing it converges to optimal solution.

Following this new assumption, and you can't have a device go down every single round, we show the following convergence guarantee. For L-Lipschitz loss, we get a convergence at 1/epsilon; for smooth models (logistic regression) we get a linear rate.

How does this perform in practice? The method is quite simple. The assumption is we have data stored at m different devices. We alternate between solving Omega, and W stored on each. While we're solving w update, it works by defining these local subproblems for machines, and calling solver that does approximate solution. This is flexible because it can vary by node and iteration.

In terms of comparing this to other methods, what we've seen is the following. Comparing MOCHA to CoCoA, compared to Mb-SDCA and Mb-SGD. We had simulation, with real data to see what would happen if we do it on wifi. We have simulated time and how close are to optimal. What you can see is that MoCHA is converging much more quickly to optimal solution, because MoCHA doesn't have the problem of statistical heterogeneity, and it's not bogged down by stragglers. This is true for all of the different types of networks; LET and 3G. The blue line and MOCHA and CoCOA, they work well in high communication settings, because they are more flexible. But compared to CoCOA, MOCHA is much more robust to statistical heterogeneity.

What's interesting is that if we impose some systems heterogeneity, some devices are slower than others, we looked at imposing low and high systems heterogeneity, MOCHA with this additional heterogeneity, it's a two orders of magnitude speedup to reach optimal solution.

And for MOCHA in particular, we looked at issue of fault tolerance. What we're showing here, we're increasing the probability a device will drop out at any distribution. Going up until there's half devices, we're still fairly robust to MOCHA converging, in almost the same amount of time. But what we see with green dotted line, of the same device drops out every iteration, it doesn't converge. This shows the assumption we made makes sense in practice.

The punchline is that in terms of thinking this new setting, training ML on these massive networks of devices, this is both a statistical and systems issue. We've addressed it in a holistic matter. Code at I also want to reiterate about SysML conference in February.

Q: When you compare global and local? Why is it always better than global?

A: The motivation why you want to use local model over global model, is that if you have a local data a lot, you might perform better. It boosts the overall sample size. I have some additional experiments where we took the original data, and skewed it even further than it already was. We took the local data, and there was less data locally, and they have global approaches. That's just a function of the data in the devices.

Q: I really like how your method has guarantees, but I'm wondering about an approach where you create a metalearning algorithm locally and have it work locally?

A: That's worth looking into empirically, since you can do fine tuning locally. What we were trying to do first was converge to exact optimal solution, but you might want to just work empirically well, would be good to compare to this setting.

by Edward Z. Yang at December 08, 2017 06:15 PM

A Machine Learning Approach to Database Indexes (Alex Beutel)

The below is a transcript of a talk by Alex Beutel on machine learning database indexes, at the ML Systems Workshop at NIPS'17.

DB researchers think about there research differently. You have a system that needs to work for all cases. Where as in ML, we have a unique circumstance, I'll build a model that works well. In DB, you have to fit all.

To give an example of this is a B-tree. A B-tree works for range queries. We have records, key, we want to find all records for range of keys. 0-1000, you build tree on top of sorted array. To quickly look up starting point in range. What if all my data, all of the keys, from zero to million... it becomes clear, you don't need the whole tree above. You can use the key itself as an offset into the array. Your lookup is O(1), O(1) memory, no need for extra data structure.

Now, we can't go for each app, we can't make a custom implementation to make use of some pattern. DB scale to any application, we don't want to rebuild it any time.

But ML excels in this situation. It works well for a wide variety of distributions, learn and make use of them effectively.

This is the key insight we came to. Traditional data structures make no assumptions about your data. They work under any distribution, and generally scale O(n). Interestingly, learning, these data distributions, can offer a huge win. What we're trying to go to, is instead of scaling to size of data, we scale to complexity of it. With linear data, it's O(1). For other distributions, can we leverage this?

There are three dat structures underlying databases. There are B-Trees; range queries, similarity search. Main index. Hash maps for point lookups; individual records. This is more common throughout CS. And bloom filters, are really common for set-inclusion queries. Do I have a key. If your record is stored on disk, checking first if there's a record with that key is worthwhile. We're going to focus entirely on B-trees.

B-trees take a tree like structure with high branching factor. What makes it really effective is that it's cache efficient. You can store top level nodes in your cache where it's fast to look it up, maybe others in main memory, and the actual memory on disk. By caching the hierarchy appropriately, it makes it efficiently. At a high level, a B-tree maps a key to a page, some given place in memory. Once it finds that page, it will do some local search to find the particular range of that key. That could be a scan or binary search; we know the range will be the position from start of page to page size.

An abstract level, the Btree is just a model. It's taking the position of the key, and trying to estimate the position. What we have in this case, we want to search in this error range to find the ultimate record. At a high level, it would mean that we can't use any model. We need err_min and err_max. But we have all the data. If you have all the data, you know at index construction time, you know all the data you're executing against, and you can calculate what the model's min and max error is.

One interesting thing is this is just a regression problem. What you're really modeling is just the CDF. On the X axis on this plot here, the X axis is your keys, Ys your position. This is modeling where your probability mass is located; where your data is in the keyspace. CDFs are studied somewhat, but not a ton, in the literature. This is a nice new implication of research.

We thought, OK, let's try this out straightaway. Train a model, see how fast it is. We looked at 200M server logs, timestamp key, 2 layer NN, 32-width, relatively small by ML. We train to predict position, square error. A B-Tree executes in 300ns. Unfortunately, with the model, it takes 80000ns. By most ML model speeds, this is great. If you're looking at executing on server, great. But this doesn't work for a database.

There are a bunch of problems baked into this. TF is really designed for large models. Think about translation or superresolution images; these are hefty tasks. We need to make this fast for database level speed. Second, b-trees are great for overfitting. There's no risk of over-fitting in this context. They're also cache efficient; that's not looked at in ML. The last thing is local search in the end. Is that really the most effective way of ultimately finding that key? I'm skipping that part because it's fairly detailed, I'll focus on first three.

The first part is just the raw speed fo execution of ML model. This was built really by Tim, this Learning Index Framework program. What it does is it lets you create different indexes under different configurations. For one thing, it lets you do code compilation for TF, ideas from Tupleware, where you can take a linear model and execute it extremely quickly. We can also train simple models. Use TF for more complex gradient descent based learning; extract weights, and have inference graph be codegenned. And we can do a lot of autotuning, to find what the best model architecture is. We know ahead of time what the best training is. We can make pretty smart decisions about what works best.

The next problem is accuracy and sepeed. If I have 100M records, I narrow down quickly from 1.5M to 24K, with each step down this tree. Each one of those steps is 50-60 cycles to look through that page, and to find what the right branch is. So we have to get to an accurracy of 12000, within 500 mul/add, to beat these levels of hierarchy, which are in cache. This is a steep task. The question is what is the right model? a really wide network? Single hidden layer? This scales nicely, we can fit in 256 layer reasonably. We could go deeper... the challenge is we have width^2, which need to be parallelized somehow. The challenge is, how do we effectively scale this. We want to add capacity to the model, make it more and more accurate, with increased size, without becoming to.

We took a different approach, based on mixed experts. We'll have a key, have a really simple classifier. We get an estimate. Then we can use that estimate to find it at the next stage. Narrow down the CDF range, and try to be more accurate in the subset of space. It will still get key as input; given key, give position, but more narrow space of keys. We build this down, and we'll walk down this hierarchy. This decouples model size and complexity. We have a huge model, overfitting, but we don't have to execute all of the sparsity that you would have to do from a pure ML view. We can decouple it usefully. The nice thing we can do is fall back to B-trees for subsets that are difficult to learn in a model. The LIF framework lets us substitute it in easily. In the worst case, B-tree. Best case, more efficient.

The quick results version here, is we find we have four different data sets. Most are integer data sets; last one is string data set. We're trying to save memory and speed; we save memory hugely; these are really simple models. Linear with simple layer, with possibly two stages. We're able to get a significant speedup in these cases. Server logs one is interesting. It looks at a high level very linear, but there's actually daily patterns to this data accessed. Maps is more linear; it's longitudes of spaces. We created synthetic data that's log normal, and here we see we can model it effectively. Strings is an interesting challenge going forward; your data is larger and more complicated, building models that are efficient over a really long string is different; the overall patterns are harder to have intuition about. One thing really worth noting here, it's not using GPUs or TPUs; it's pureely CPU comparison. Apples-to-apples.

This is mostly going into the B-tree part. This is a regression model looking at CDF of data. We can use these exact same models for hash maps. With bloom filters, you can use binary classifiers. I have a bunch of results in the poster in the back.

A few minutes to talk about rooms for improvement. There are a bunch of directions that we're excited to explore. Obvious one is GPUs/TPUs. It's cPUs because that's when B-trees are most effective; but scaling is all about ML. Improving throughput and latency for models with GPUs, exciting going forward. Modeling themselves; there's no reason to believe hierarchy of models is the right or best choice; it's interesting to build model structures that match your hardware. Memory efficient, underlying architecture of GPUs. In the scale of ns we need for database. Multidimensional indexes; ML excels in high numbers of dimension; most things are not looking at a single integer feature. There's interesting question about how you map to multidimensional indexes that are difficult to scale. If we have a CDF, you can approximately sort it right there. And inserts and updates, assumed read-only databases. Large class of systems, but we get more data. How do we balance overfitting with accuracy; can we add some extra auxiliary data structures to balance this out?

Q: One thing is that when... this problem, we solved pretty well without ML. When we introduce ML, we should introduce new metrics. We shouldn't make our system more fragile, because distribution changes. What would be the worst case when distribution changes?

A: As the data becomes updated... in the case of inference and updates, there's a question about generalization. I think you could look at it from the ML point of view: statistically, test model today on tomorrows inserts. (It's a method. If I use this method, and then train it with data that I don't yet have... and do.) The typical extrapolation to future generalization of ML. Guarantees are hard. There will be a worst case that is awful... but the flip side, that's the ML side... generalization. There's also a point of view, I couple this with classic data structure. we coupled modeling with classic data structures: search, bloom filter case, so you don't actually have this work. You catch worst case.

Let me add to that. If you assume that the inserts follow the same distribution as trained model, then the inserts become all one operation. They're even better. Suppose they don't follow the same distribution? you can still do delta indexing. Most systems do do delta indexing. So inserts are not a big problem.

Q: (Robert) Most of the inputs were one or two real numbers, and outputs are a single real number. how does it work if you use a low degree polynomial, or a piecewise linear classifier on the different digits?

A: In the case of strings, it's not a single input. (Treat it as integer?) Well, it's possibly a thousand characters long. It's not the best representation. Different representations work really well. The last thing I want to say, piecewise linear could work, but when you run 10k, 100k submodels, it's slow. Hierarchy helps. Polynomials are interesting, depends on data source.

Q: Can you comment how bad your worst case is? Average numbers?

A: We specifically always have a spillover. The worst case is defaulting to typical database. We haven't had a case where you do worse, because we'll default to B-tree. (Deterministic execution?) Not inference time.

by Edward Z. Yang at December 08, 2017 06:11 PM

Ray: A Distributed Execution Framework for Emerging AI Applications (Ion Stoica)

The below is a transcript of a talk by Ion Stoica on Ray, at the ML Systems Workshop at NIPS'17.

We've been working on it at Berkeley for more than one year. Over the past years, there's been tremendous progress in AI. Ad targeting, image&speech, many more. Many applications are based on supervised learning with DNNs. Supervised plus unsupervised are the two dominant approaches.

However, the next generation of AI applications will be very different. They're deployed in mission critical scenarios, need to continually learn from a rapidly changing env. Robotics, self driving cars, unmanned drones, dialogue systems. Implementing this new generation of AI applications requires a broader range of techniques. Stochastic optimization, parallel simulations, many more.

Ray provides a unified platform for implementing these approaches. To motivate Ray, I'll use reinforcement learning. RL learns by interacting with env. A policy mapping from state/observation to action that maximizes a certain reward. What are the reqs of RL? Many applications exhibit nested parallelism: search, where they use data parallel SGD, which then calls a component that does policy evaluation with a model to simulate, that runs in parallel on multiple CPUs. Second, these workloads can be highly heterogenous in hardware and time. Many of these computations require not only CPUs, but GPUs TPUs and FPGAs. Second, this computation can take wildly different times. Simulate a chess game: 3 moves to lose, or 50 moves to win or draw. And in robotics, we need to process in real time, processing the data from sensors in parallel, tens of ms.

Meeting these requirements is not easy. To meet these requirements, you need a system that is flexible and performant. Flexible: it should create and schedule tasks dynamically, and support arbitrary dependencies. Perf: it should scale to hundreds of nodes, sub-millisecond latency, millions of task, and efficiently share numeric data.

Next, I'm going to say how we achieve these challenges. Flexibility? We provide a very flexible model: dynamic tasks graphs. On top of this, we give the two models: parallel tasks and actors.

To talk about parallel tasks, here is Python code: one reads an array from a file, and the other adds two arrays. The code is simple: it creates two arrays a and b from file1 and file2, and sum them up. So now, parallelizing this program is quite easy. If we want to parallelize a function, in order to do that, we need to add a ray.remote decorator to each function. When we invoke these functions, you need to invoke remote method. Remove doesn't return object itself, just the object id. This is very similar to the futures abstraction. To get the actual object, you must invoke ray.get on the object id.

To get a better idea of how Ray is executing, let's execute a simple program. Assumes files stored on different nodes. When read_array on file1, it schedules read_array on the appropriate node. The remote call returns immediately, before the actual read finishes. This allows the driver to run the second task in parallel, running on the node on file 2, and launch the add remote function. All functions have been scheduled remotely, but none of them have finished. To actually get the result, you have to call ray.get on the result. This is a blocking call, you'll wait for the entire computation graph to be executed.

Tasks are very general, but they are not enough. Consider that you want to run a simulator, and this simulator is closed source. In this case, you do not have access to the state. You have state, action, simulations, to set up state in simulator, you cannot do it. So to get around this, there is another use case, where the state is too expensive to create. For example, DNNs on GPUs, in this case, you want to initialize it once, and reinitialize for each simulation.

In order to address these use cases, we add Actor abstraction. An actor is just a remote class. If you have a Counter, you mark it ray.remote, and the when you create the class or invoke methods, you use remote keyword. This is a computation graph for this very simple example. Notice the method invocations also return object identifiers. To get the results, you need to call ray.get on object identifiers. Ray also allows you to specify the number of resources, for actors and tasks.

To put things together, and provide a more realistic example, evaluation strategy, a scalable form of RL, by Salimans et al in OpenAI. In a nutshell, evolution strategy, tries lots of policies, and tries to see which runs best. This is highly parallel. So here is pseudocode for parallel strategies. A worker that does simulation and returns the reward, create twenty workers, and then 200, do 200 simulations, update policy. Again, if you want to parallelize this code, we have to add a bunch of remote, and now on the right hand side, you'll notice I'm also sharing the computation graph. When you invoke now, the Worker.remote, you create 20 remote workers to do it in parallel. And you invoke with the remote keyword. Again, notice that in this case, the results are not the rewards themselves, but they're ids to the reward objects. In order to get the rewards to get policy, you have to call ray.get.

This hopefully gives you a flavor how to program in Ray. Next time, I switch gears, presents system design of Ray; how Ray gets high performance and scalability.

Like many classic computing frameworks, it has a driver, and a bunch of workers. Driver runs a program, worker runs task remotely. You can run and write a bunch of actors. The drivers actors on the same node, they share the data, on shared memory, and the workers and actors of cross nodes, share through distributed object store we built. Each node has a local scheduler, so when a driver wants to run another task, the local scheduler tries to schedule it locally. If it cannot schedule it locally, it invokes global scheduler, and it will schedule another node that has resources. Actor, remote method. Finally, what we do, and one essential part of the design, is we have a Global Control State. It takes all of the state of the system, and centralizes it. The metadata for the objects, in objects table, function. This allows system to be stateless. All these other components can fail, you can bring them up, get the most recent data from global control state. It also allows us to parallelize the global scheduler, because these replicas are going to share the same state in the GCS.

Another nice effect of having a GCS is that it makes it easy to build a bunch of profiling and debugging tools.

This design is highly scalable. Let me try to convince you why this is. To make GcS scalable, we just shard it. All these keys are pseudorandom, so it's easy to shard and load balance. The scheduler as you see is distributed; each node has a local scheduler, and Ray tries to schedule tasks which are spawned by a worker/driver on another task that is locally. The global scheduler, becomes a bottleneck, we can also replicate it. Finally, in systems, even if scheduler is super scalable, in Spark, there's another bottleneck: only the driver can launch new tasks. In order to get around that, we allow in Ray the workers and actors to launch tasks. Really, there is no single bottleneck point.

A few words about implementation. The GCS is implemented with Redis. For object store, we leverage Apache Arrow. For fault tolerance, we use lineage based fault tolerance like Spark. Actors are part of task graph; methods are treated as tasks, so we have a uniform model for providing fault tolerance.

So now some evaluation results. This plot represents the number of tasks per second, and you can see the number of nodes; it scales linearly. You can schedule over 1.8 M/s. Latency of local task execution is 300us, the latency of remote task is 1ms. This plot illustrates fault tolerance. You may ask why you care about fault tolerance? The problem is you need in your program that the simulation may not finish; this makes the program far more complicated, even if you're willing to ignore some results. Here, on this axis, you have the time in seconds, you have two y axes, number of nodes in system, and the throughput. As you can see, the number of nodes is starting at 50, then 25, then to 10, and goes back to 50. In the red area, you show the number of tasks per second; it follows as you may expect, the number of nodes in the system. If you look a little bit, there are some drops; every time, you have a drop in the number of tasks. It turns out this is because of the object reconstruction. When some nodes go away, you lose the objects on the node, so you have to reconstruct them. Ray and Spark reconstruct them transparently. With blue, you can see the re-executed tasks. If you add them, you get a very nice filling curve.

Finally, for evolution strategies, we compared with reference ES from... we followed the OpenAI, and on the X axis, you have number of CPUs, mean time to solve the particular problem; simulator, learning to run, there are three points to notice. One is, as expected, as you add more CPUs, the time to solve goes down. The second is that Ray is actually better than the reference ES, better results, even though the reference ES is specialized for beating. Third, for a very large number of CPUs, ref couldn't do it, but Ray could do better and better. I should add that Ray takes half the amount of code, and was implemented in a couple of hours.

Related work: look, in this area, there are a huge number of systems, that's why you are here, lots of systems. Ray is complimentary to TF, MXNet, PyTorch, etc. We use these systems to implement DNNs. We integrate with TF and PyT. There are more general systems, like MPI and Spark; these have limited support for nested parallelism; computation model, and they have much coarser grained tasks.

To conclude, Ray is a system for high performance and flexibility and scalability. We have two libraries on top of Ray: RLlib and Ray Tune. It's open source, please try, we'd love your feedback. Robert, Philip, Alex, Stephanie, Richard, Eric, Heng, William, and many thanks to my colleague Michael Jordan.

Q: In your system, you also use actor; actor is built up on shared memory. Do you have separate mailbox for actors? How do you do that?

A: No, the actors communicate by passing the argument to the shared object store.

Q: What is the granularity of parallelism? Is it task atomic, or do you split task?

A: The task granularity is given by what is the overhead for launching a task and scheduling the task. The task you see, we are targeting task, low and few ms. The task is not implementing something like activation function. we leave that job to much better frameworks. And a task is executing atomically, a method, in the actors, are serialized.

Q: Question about fault tolerance: in Spark, when you don't have a response for some time, it says this node died. Here, the task is much more, because NN, something like that. So we don't have the same time.

A: We do not do speculation; implicit speculation in Ray, for the reason you mentioned.

Q: Can you give me more details on the reference implementation, doesn't scale

A: The reference implementation, it's the OpenAI implementation, Robert here can provide you a lot more detailed answers to that question.

by Edward Z. Yang at December 08, 2017 06:07 PM

December 07, 2017

Jasper Van der Jeugt

Video: Getting things done in Haskell

Someone alerted me that the video of my talk at the Skills Matter Haskell eXchange 2017 is now available. You can watch it on their website.

The slides can be found here.

It’s a talk aimed towards beginners. If you are writing a medium-sized Haskell application for the very first time, you will typically end up with three modules: Types.hs, Utils.hs and Main.hs. While this is a very clear split, it typically doesn’t scale very well as applications become larger.

I try to answer some questions like:

  • When is it a good idea to use something like Monad/Applicative (and when is it not)?
  • When is it a good idea to invent my own typeclass (and when is it not)?
  • How do I design interfaces and services like in OOP?

Thanks again to Skills Matter for putting together this excellent conference.

by Jasper Van der Jeugt at December 07, 2017 12:00 AM

December 05, 2017

Jeremy Gibbons

Arithmetic Coding

This post is about the data compression method called arithmetic coding, by which a text is encoded as a subinterval of the unit interval, which is then represented as a bit sequence. It can often encode more effectively than Huffman encoding, because it doesn’t have the restriction of Huffman that each symbol be encoded as a positive whole number of bits; moreover, it readily accommodates adaptive models of the text, which “learn” about the text being encoded while encoding it. It is based on lecture notes that I wrote in 2002 with Richard Bird, although the presentation here is somewhat simplified; it is another application of streaming. There’s quite a lot to cover, so in this post I’ll just set up the problem by implementing a basic encoder and decoder. In the next post, I’ll show how they can both be streamed. (We won’t get into the intricacies of restricting to fixed-precision arithmetic—perhaps I can cover that in a later post.)

The basic idea behind arithmetic coding is essentially to encode an input text as a subinterval of the unit interval, based on a model of the text symbols that assigns them to a partition of the unit interval into non-empty subintervals. For the purposes of this post, we will deal mostly with half-open intervals, so that the interval {[l,r)} contains values {x} such that {l \le x < r}, where {l,r,x} are rationals.

For example, with just two symbols “a” and “b”, and a static model partitioning the unit interval into {[0, \frac 1 3)} for “a” and {[\frac 1 3, 1)} for “b”, the symbols in the input text “aba” successively narrow the unit interval to {[0,\frac 1 3), [\frac 1 9, \frac 1 3), [\frac 1 9, \frac 5 {27})}, and the latter interval is the encoding of the whole input. And in fact, it suffices to pick any single value in this final interval, as long as there is some other way to determine the end of the encoded text (such as the length, or a special end-of-text symbol).


We introduce the following basic definitions for intervals:

\displaystyle  \begin{array}{@{}l} \mathbf{type}\;\mathit{Interval} = (\mathit{Rational}, \mathit{Rational}) \vrule width0pt depth2ex \\ \mathit{unit} :: \mathit{Interval} \\ \mathit{unit} = (0,1) \vrule width0pt depth2ex \\ \mathit{contains} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Bool} \\ \mathit{contains}\;(l,r)\;x = l \le x \land x < r \vrule width0pt depth2ex \\ \mathit{includes} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Bool} \\ \mathit{includes}\;(l,r)\;(p,q) = l \le p \land q \le r \end{array}

We’ll write “{i \ni x}” for {\mathit{contains}\;i\;x}, and “{i \supseteq j}” for {\mathit{includes}\;i\;j}.

A crucial operation on intervals is narrowing of one interval by another, where {\mathit{narrow}\;i\;j} is to {i} as {j} is to the unit interval:

\displaystyle  \begin{array}{@{}l} \mathit{narrow} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{narrow}\;i\;(p,q) = (\mathit{weight}\;i\;p, \mathit{weight}\;i\;q) \vrule width0pt depth2ex \\ \mathit{weight} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{weight}\;(l,r)\;x = l + (r-l) \times x \end{array}

We’ll write “{i \mathbin{\triangleright} j}” for {\mathit{narrow}\;i\;j}. Thus, {\mathit{weight}\;(l,r)\;x} is “proportionately {x} of the way between {l} and {r}“, and we have

\displaystyle  \begin{array}{@{}lcl} i \ni \mathit{weight}\;i\;x & \Leftarrow& \mathit{unit} \ni x \\ i \supseteq i \mathbin{\triangleright} j &\Leftarrow& \mathit{unit} \supseteq j \end{array}

Conversely, we can widen one interval by another:

\displaystyle  \begin{array}{@{}l} \mathit{widen} :: \mathit{Interval} \rightarrow \mathit{Interval} \rightarrow \mathit{Interval} \\ \mathit{widen}\;i\;(p,q) = (\mathit{scale}\;i\;p, \mathit{scale}\;i\;q) \vrule width0pt depth2ex \\ \mathit{scale} :: \mathit{Interval} \rightarrow \mathit{Rational} \rightarrow \mathit{Rational} \\ \mathit{scale}\;(l,r)\;x = (x-l)/(r-l) \end{array}

We’ll write “{i \mathbin{\triangleleft} j}” for {\mathit{widen}\;i\;j}. Note that {\mathit{scale}} is inverse to {\mathit{weight}}, in the sense

\displaystyle  y = \mathit{weight}\;i\;x \Leftrightarrow \mathit{scale}\;i\;y = x

and consequently widening is inverse to narrowing:

\displaystyle  i \mathbin{\triangleleft} (i \mathbin{\triangleright} j) = j


We work with inputs consisting of sequences of symbols, which might be characters or some higher-level tokens:

\displaystyle  \mathbf{type}\;\mathit{Symbol} = \mathit{Char}

The type {\mathit{Model}} then must provide the following operations:

  • a way to look up a symbol, obtaining the corresponding interval:

    \displaystyle  \mathit{encodeSym} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Interval}

  • conversely, a way to decode a value, retrieving a symbol:

    \displaystyle  \mathit{decodeSym} :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow \mathit{Symbol}

  • an initial model:

    \displaystyle  \mathit{initial} :: \mathit{Model}

  • a means to adapt the model on seeing a new symbol:

    \displaystyle  \mathit{newModel} :: \mathit{Model} \rightarrow \mathit{Symbol} \rightarrow \mathit{Model}

The central property is that encoding and decoding are inverses, in the following sense:

\displaystyle  \mathit{decodeSym}\;m\;x = s \quad \Leftrightarrow \quad \mathit{encodeSym}\;m\;s \ni x

There are no requirements on {\mathit{initial}} and {\mathit{newModel}}, beyond the latter being a total function.

For example, we might support adaptive coding via a model that counts the occurrences seen so far of each of the symbols, represented as a histogram:

\displaystyle  \mathbf{type}\;\mathit{Model} = [(\mathit{Symbol},\mathit{Integer})]

This naive implementation works well enough for small alphabets. One might maintain the histogram in decreasing order of counts, so that the most likely symbols are at the front and are therefore found quickest. For larger alphabets, it is better to maintain the histogram as a binary search tree, ordered alphabetically by symbol, and caching the total counts of every subtree.


Now encoding is straightforward to define. The function {\mathit{encodeSyms}} takes an initial model and a list of symbols, and returns the list of intervals obtained by looking up each symbol in turn, adapting the model at each step:

\displaystyle  \begin{array}{@{}l} \mathit{encodeSyms} :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow [\mathit{Interval}] \\ \mathit{encodeSyms}\; m = \mathit{map}\;\mathit{snd} \cdot \mathit{tail} \cdot \mathit{scanl}\;\mathit{next}\;(m,\mathit{unit}) \\ \quad \mathbf{where}\; \begin{array}[t]{@{}l} \mathit{next} :: (\mathit{Model},\mathit{Interval}) \rightarrow \mathit{Symbol} \rightarrow (\mathit{Model},\mathit{Interval}) \\ \mathit{next}\;(m,i)\;s = (\mathit{newModel}\;m\;s, \mathit{encodeSym}\;m\;s) \end{array} \end{array}

That is,

\displaystyle  \begin{array}{@{}lcl} \mathit{encodeSyms}\;m\;[\,] &=& [\,] \\ \mathit{encodeSyms}\;m\;(s:ss) &=& \mathit{encodeSym}\;m\;s : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss \end{array}

We then narrow the unit interval by each of these subintervals, and pick a single value from the resulting interval:

\displaystyle  \begin{array}{@{}l} \mathit{encode}_0 :: \mathit{Model} \rightarrow [\mathit{Symbol}] \rightarrow \mathit{Rational} \\ \mathit{encode}_0\;m = \mathit{pick} \cdot \mathit{foldr}\;\mathit{narrow}\;\mathit{unit} \cdot \mathit{encodeSyms}\;m \end{array}

All we require of {\mathit{pick} :: \mathit{Interval} \rightarrow \mathit{Rational}} is that {i \ni \mathit{pick}\;i}; then {\mathit{encode}_0} yields a fraction in the unit interval. For example, we might set {\mathit{pick} = \mathit{midpoint}}, where

\displaystyle  \textstyle \mathit{midpoint}\;i = \mathit{weight}\;i\;(\frac 1 2)


So much for encoding; how do we retrieve the input text? In fact, we can retrieve the first symbol simply by using {\mathit{decodeSym}}. Expanding the encoding of a non-empty text, we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{encode}_0\;m\;(s:ss) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(i : \mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mbox{fold} \} \\ & \mathit{pick}\;(i \mathbin{\triangleright} \mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss)) \\ = & \qquad \{ \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j) \mbox{ (see below)} \} \\ & \mathit{weight}\;i\;(\mathit{pick}\;(\mathit{foldr}\;\mathit{narrow}\;\mathit{unit}\;(\mathit{encodeSyms}\;(\mathit{newModel}\;m\;s)\;ss))) \\ = & \qquad \{ \mathit{encode}_0 \mbox{ and } \mathit{encodeSyms} \mbox{ again} \} \\ & \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \end{array}

The proof obligation, left as an exercise, is to show that

\displaystyle  \mathit{pick}\;(i \mathbin{\triangleright} j) = \mathit{weight}\;i\;(\mathit{pick}\;j)

which holds when {\mathit{pick}\;i} is of the form {\mathit{weight}\;i\;x} for some {x}.


\displaystyle  \begin{array}{@{}cl} & \mathit{decodeSym}\;m\;(\mathit{encode}_0\;m\;(s:ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{expansion of } \mathit{encode}_0 \mbox{, as above; let } i = \mathit{encodeSym}\;m\;s \} \\ & \mathit{decodeSym}\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss)) = s \\ \Leftrightarrow & \qquad \{ \mbox{requirement on models} \} \\ & i \ni \mathit{weight}\;i\;(\mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss) \\ \Leftarrow & \qquad \{ \mathit{weight} \} \\ & \mathit{unit} \ni \mathit{encode}_0\;(\mathit{newModel}\;m\;s)\;ss \end{array}

and indeed, encoding yields a fraction in the unit interval, so this recovers the first symbol correctly. This is the foothold that allows the decoding process to make progress; having obtained the first symbol using {\mathit{decodeSym}}, it can adapt the model in precisely the same way that the encoding process does, then retrieve the second symbol using that adapted model, and so on. The only slightly tricky part is that when decoding an initial value {x}, having obtained the first symbol {s}, decoding should continue on some modified value {x'}; what should the modification be? It turns out that the right thing to do is to scale {x} by the interval associated in the model with symbol {s}, since scaling is the inverse operation to the {\mathit{weight}}s that take place during encoding. That is, we define:

\displaystyle  \begin{array}{@{}l} \mathit{decode}_0 :: \mathit{Model} \rightarrow \mathit{Rational} \rightarrow [\mathit{Symbol}] \\ \mathit{decode}_0\;m\;x = \mathit{unfoldr}\;\mathit{step}\;(m,x) \vrule width0pt depth2ex \\ \mathit{step} :: (\mathit{Model}, \mathit{Rational}) \rightarrow \mathsf{Maybe}\;(\mathit{Symbol}, (\mathit{Model},\mathit{Rational})) \\ \mathit{step}\;(m,x) = \mathit{Just}\;(s, (\mathit{newModel}\;m\;s, \mathit{scale}\;(\mathit{encodeSym}\;m\;s)\;x)) \\ \quad \mathbf{where}\;s = \mathit{decodeSym}\;m\;x \end{array}

(Of course, {\mathit{encodeSym}\;m\;s \ni x}, by the inverse requirement on models, and so the new scaled value is again within the unit interval.)

Note that decoding yields an infinite list of symbols; the function {\mathit{step}} is always productive. Nevertheless, that infinite list starts with the encoded text, as we shall now verify. Define the round-trip function

\displaystyle  \mathit{round}_0\;m = \mathit{decode}_0\;m \cdot \mathit{encode}_0\;m

Then we have:

\displaystyle  \begin{array}{@{}cl} & \mathit{round}_0\;m\;(s:ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & \mathit{decode}_0\;m\;(\mathit{encode}_0\;m\;(s:ss)) \\ = & \qquad \{ \mathit{encode}_0 \mbox{; let } i = \mathit{encodeSym}\;m\;s, m' = \mathit{newModel}\;m\;s \} \\ & \mathit{decode}_0\;m\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss)) \\ = & \qquad \{ \mathit{decode}_0 \mbox{; first decoded symbol is correct, as above} \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{scale}\;i\;(\mathit{weight}\;i\;(\mathit{encode}_0\;m'\;ss))) \\ = & \qquad \{ \mathit{scale}\;i\;(\mathit{weight}\;i\;x) = x \} \\ & s : \mathit{decode}_0\;m'\;(\mathit{encode}_0\;m'\;ss) \\ = & \qquad \{ \mbox{definition of } \mathit{round}_0 \} \\ & s : \mathit{round}_0\;m'\;ss \end{array}

From this it follows that indeed the round-trip recovers the initial text, in the sense that {\mathit{round}_0\;m\;ss} yields an infinite sequence that starts with {ss}; in fact,

\displaystyle  \mathit{round}_0\;m\;ss = ss \mathbin{{+}\!\!\!{+}} \mathit{round}_0\;(\mathit{foldl}\;\mathit{newModel}\;m\;ss)\;[\,]

yielding the original input followed by some junk, the latter obtained by decoding the fraction {\frac 1 2} (the encoding of {[\,]}) from the final model {\mathit{foldl}\;\mathit{newModel}\;m\;ss} that results from adapting the initial model to each symbol in {ss} in turn. To actually retrieve the input text with no junk suffix, one could transmit the length separately (although that doesn’t sit well with streaming), or append a distinguished end-of-text symbol.

What’s next

So far we have an encoder and a decoder, and a proof that the decoder successfully decodes the encoded text. In the next post, we’ll see how to reimplement both as streaming processes.

by jeremygibbons at December 05, 2017 03:58 PM

Joachim Breitner

Finding bugs in Haskell code by proving it

Last week, I wrote a small nifty tool called bisect-binary, which semi-automates answering the question “To what extent can I fill this file up with zeroes and still have it working”. I wrote it it in Haskell, and part of the Haskell code, in the Intervals.hs module, is a data structure for “subsets of a file” represented as a sorted list of intervals:

data Interval = I { from :: Offset, to :: Offset }
newtype Intervals = Intervals [Interval]

The code is the kind of Haskell code that I like to write: A small local recursive function, a few guards to case analysis, and I am done:

intersect :: Intervals -> Intervals -> Intervals
intersect (Intervals is1) (Intervals is2) = Intervals $ go is1 is2
    go _ [] = []
    go [] _ = []
    go (i1:is1) (i2:is2)
        -- reorder for symmetry
        | to i1 < to i2 = go (i2:is2) (i1:is1)
        -- disjoint
        | from i1 >= to i2 = go (i1:is1) is2
        -- subset
        | to i1 == to i2 = I f' (to i2) : go is1 is2
        -- overlapping
        | otherwise = I f' (to i2) : go (i1 { from = to i2} : is1) is2
      where f' = max (from i1) (from i2)

But clearly, the code is already complicated enough so that it is easy to make a mistake. I could have put in some QuickCheck properties to test the code, I was in proving mood...

Now available: Formal Verification for Haskell

Ten months ago I complained that there was no good way to verify Haskell code (and created the nifty hack ghc-proofs). But things have changed since then, as a group at UPenn (mostly Antal Spector-Zabusky, Stephanie Weirich and myself) has created hs-to-coq: a translator from Haskell to the theorem prover Coq.

We have used hs-to-coq on various examples, as described in our CPP'18 paper, but it is high-time to use it for real. The easiest way to use hs-to-coq at the moment is to clone the repository, copy one of the example directories (e.g. examples/successors), place the Haskell file to be verified there and put the right module name into the Makefile. I also commented out parts of the Haskell file that would drag in non-base dependencies.

Massaging the translation

Often, hs-to-coq translates Haskell code without a hitch, but sometimes, a bit of help is needed. In this case, I had to specify three so-called edits:

  • The Haskell code uses Intervals both as a name for a type and for a value (the constructor). This is fine in Haskell, which has separate value and type namespaces, but not for Coq. The line

    rename value Intervals.Intervals = ival

    changes the constructor name to ival.

  • I use the Int64 type in the Haskell code. The Coq version of Haskell’s base library that comes with hs-to-coq does not support that yet, so I change that via

    rename type GHC.Int.Int64 = GHC.Num.Int

    to the normal Int type, which itself is mapped to Coq’s Z type. This is not a perfect fit, and my verification would not catch problems that arise due to the boundedness of Int64. Since none of my code does arithmetic, only comparisons, I am fine with that.

  • The biggest hurdle is the recursion of the local go functions. Coq requires all recursive functions to be obviously (i.e. structurally) terminating, and the go above is not. For example, in the first case, the arguments to go are simply swapped. It is very much not obvious why this is not an infinite loop.

    I can specify a termination measure, i.e. a function that takes the arguments xs and ys and returns a “size” of type nat that decreases in every call: Add the lengths of xs and ys, multiply by two and add one if the the first interval in xs ends before the first interval in ys.

    If the problematic function were a top-level function I could tell hs-to-coq about this termination measure and it would use this information to define the function using Program Fixpoint.

    Unfortunately, go is a local function, so this mechanism is not available to us. If I care more about the verification than about preserving the exact Haskell code, I could easily change the Haskell code to make go a top-level function, but in this case I did not want to change the Haskell code.

    Another way out offered by hs-to-coq is to translate the recursive function using an axiom unsafeFix : forall a, (a -> a) -> a. This looks scary, but as I explain in the previous blog post, this axiom can be used in a safe way.

    I should point out it is my dissenting opinion to consider this a valid verification approach. The official stand of the hs-to-coq author team is that using unsafeFix in the verification can only be a temporary state, and eventually you’d be expected to fix (heh) this, for example by moving the functions to the top-level and using hs-to-coq’s the support for Program Fixpoint.

With these edits in place, hs-to-coq splits out a faithful Coq copy of my Haskell code.

Time to prove things

The rest of the work is mostly straight-forward use of Coq. I define the invariant I expect to hold for these lists of intervals, namely that they are sorted, non-empty, disjoint and non-adjacent:

Fixpoint goodLIs (is : list Interval) (lb : Z) : Prop :=
  match is with
    | [] => True
    | (I f t :: is) => (lb <= f)%Z /\ (f < t)%Z /\ goodLIs is t

Definition good is := match is with
  ival is => exists n, goodLIs is n end.

and I give them meaning as Coq type for sets, Ensemble:

Definition range (f t : Z) : Ensemble Z :=
  (fun z => (f <= z)%Z /\ (z < t)%Z).

Definition semI (i : Interval) : Ensemble Z :=
  match i with I f t => range f t end.

Fixpoint semLIs (is : list Interval) : Ensemble Z :=
  match is with
    | [] => Empty_set Z
    | (i :: is) => Union Z (semI i) (semLIs is)

Definition sem is := match is with
  ival is => semLIs is end.

Now I prove for every function that it preserves the invariant and that it corresponds to the, well, corresponding function, e.g.:

Lemma intersect_good : forall (is1 is2 : Intervals),
  good is1 -> good is2 -> good (intersect is1 is2).
Proof. … Qed.

Lemma intersection_spec : forall (is1 is2 : Intervals),
  good is1 -> good is2 ->
  sem (intersect is1 is2) = Intersection Z (sem is1) (sem is2).
Proof. … Qed.

Even though I punted on the question of termination while defining the functions, I do not get around that while verifying this, so I formalize the termination argument above

Definition needs_reorder (is1 is2 : list Interval) : bool :=
  match is1, is2 with
    | (I f1 t1 :: _), (I f2 t2 :: _) => (t1 <? t2)%Z
    | _, _ => false

Definition size2 (is1 is2 : list Interval) : nat :=
  (if needs_reorder is1 is2 then 1 else 0) + 2 * length is1 + 2 * length is2.

and use it in my inductive proofs.

As I intend this to be a write-once proof, I happily copy’n’pasted proof scripts and did not do any cleanup. Thus, the resulting Proof file is big, ugly and repetitive. I am confident that judicious use of Coq tactics could greatly condense this proof.

Using Program Fixpoint after the fact?

This proofs are also an experiment of how I can actually do induction over a locally defined recursive function without too ugly proof goals (hence the line match goal with [ |- context [unsafeFix ?f _ _] ] => set (u := f) end.). One could improve upon this approach by following these steps:

  1. Define copies (say, intersect_go_witness) of the local go using Program Fixpoint with the above termination measure. The termination argument needs to be made only once, here.

  2. Use this function to prove that the argument f in go = unsafeFix f actually has a fixed point:

    Lemma intersect_go_sound:

    f intersect_go_witness = intersect_go_witness

    (This requires functional extensionality). This lemma indicates that my use of the axioms unsafeFix and unsafeFix_eq are actually sound, as discussed in the previous blog post.

  3. Still prove the desired properties for the go that uses unsafeFix, as before, but using the functional induction scheme for intersect_go! This way, the actual proofs are free from any noisy termination arguments.

    (The trick to define a recursive function just to throw away the function and only use its induction rule is one I learned in Isabelle, and is very useful to separate the meat from the red tape in complex proofs. Note that the induction rule for a function does not actually mention the function!)

Maybe I will get to this later.

Update: I experimented a bit in that direction, and it does not quite work as expected. In step 2 I am stuck because Program Fixpoint does not create a fixpoint-unrolling lemma, and in step 3 I do not get the induction scheme that I was hoping for. Both problems would not exist if I use the Function command, although that needs some tickery to support a termination measure on multiple arguments. The induction lemma is not quite as polished as I was hoping for, so he resulting proof is still somewhat ugly, and it requires copying code, which does not scale well.

Efforts and gains

I spent exactly 7 hours working on these proofs, according to arbtt. I am sure that writing these functions took me much less time, but I cannot calculate that easily, as they were originally in the Main.hs file of bisect-binary.

I did find and fix three bugs:

  • The intersect function would not always retain the invariant that the intervals would be non-empty.
  • The subtract function would prematurely advance through the list intervals in the second argument, which can lead to a genuinely wrong result. (This occurred twice.)

Conclusion: Verification of Haskell code using Coq is now practically possible!

Final rant: Why is the Coq standard library so incomplete (compared to, say, Isabelle’s) and requires me to prove so many lemmas about basic functions on Ensembles?

by Joachim Breitner ( at December 05, 2017 02:17 PM

December 04, 2017

Roman Cheplyaka

Introduction to golden testing

Golden tests are like unit tests, except the expected output is stored in a separate file. I learned about them in 2010 from Max Grigorev at ZuriHac.

Let’s say you want to test Python’s json module. One way to do that would be to encode an object and compare the result to a reference string:

import json

assert(json.dumps([1,2,3]) == "[1, 2, 3]")

Alternatively, you could create a file with contents

[1, 2, 3]

and read it to know the expected output:

import json

with open("example1.json", "r") as ex1_file:
    ex1 =
    assert(json.dumps([1,2,3]) == ex1)

The file example1.json is called a golden file.

Here are some advantages of golden tests over ordinary unit tests:

  1. If the expected output is large in size, it may be impractical to put it inside the source code.
  2. No need to escape quotes or binary data in the expected output.
  3. When you add a new test, your testing framework can generate the missing golden file from the current output of the function.

    It is best if you can write down the expected output without looking at the actual output, but it is not always possible. The output may be too big to type it character by character, or it may be hard to predict. For instance, in the json example, you couldn’t tell in advance whether there would be spaces between array elements or not. So often what you do is launch an interactive interpreter (if your language of choice even has one), run the function, and then copy-paste its output into the test code.

    This process can be easily automated if you use golden files.

  4. The expected output can be automatically updated.

    Say you changed your json module to replace some of the spaces with newlines to make the output more aesthetically pleasing. You have 40 test cases that need updating. Can you imagine doing this by hand?

    With golden tests, you can tell your test framework to update all golden files from the current outputs, then check git diff to ensure that all changes are valid, and commit them.
  5. If some of your tests suddently started failing, you can use diff or other such tools to compare the golden file to the actual file and figure out what exactly changed. Perhaps your testing framework could even show the diff automatically on test failure?

While advantages 1-2 are automatic, 3-5 require special support from your testing framework. The rest of this article will be focused on a Haskell testing framework tasty and its add-on package for golden tests, tasty-golden.

Basic usage

To illustrate how tasty-golden works, consider this yaml-to-json conversion module:

{-# LANGUAGE TypeApplications #-}
module YamlToJson where

import qualified Data.Yaml as Y
import Data.Aeson as J
import qualified Data.ByteString.Lazy as LBS

yamlToJson :: LBS.ByteString -> LBS.ByteString
yamlToJson = J.encode . Y.decode @Value . LBS.toStrict

Because JSON contains quotes and YAML spans multiple lines, it is not very practical to store them as string literals in the source code file. Instead, you will keep them both in files.

Note that the name “golden file” only refers to the file containing the output, not the input. There is no requirement that the input is stored in a file or that there even is any “input” at all; but in practice it is often convenient to store them both in files so that there is an input file for every output file and vice versa.

import Test.Tasty (defaultMain, TestTree, testGroup)
import Test.Tasty.Golden (goldenVsString, findByExtension)
import qualified Data.ByteString.Lazy as LBS
import YamlToJson (yamlToJson)
import System.FilePath (takeBaseName, replaceExtension)

main :: IO ()
main = defaultMain =<< goldenTests

goldenTests :: IO TestTree
goldenTests = do
  yamlFiles <- findByExtension [".yaml"] "."
  return $ testGroup "YamlToJson golden tests"
    [ goldenVsString
        (takeBaseName yamlFile) -- test name
        jsonFile -- golden file path
        (yamlToJson <$> LBS.readFile yamlFile) -- action whose result is tested
    | yamlFile <- yamlFiles
    , let jsonFile = replaceExtension yamlFile ".json"

This is all the code you need to support one, two, or a thousand test cases. When run, this code will:

  1. find all .yaml files in the current directory
  2. for each .yaml file, construct a golden test that evaluates yamlToJson on the input read from file and compares the result to the golden file, which has the name and the .json extension
  3. put all individual tests in a test group and pass it to defaultMain for execution

To see how this works in practice, create an input file, fruits.yaml, with the following contents:

- orange
- apple
- banana

Now run your test suite (note: in a proper cabalized project, you’d run cabal test or stack test instead):

% stack runghc test.hs
YamlToJson golden tests
  fruits: OK
    Golden file did not exist; created

All 1 tests passed (0.00s)

tasty-golden realized that this is a new test case because the golden file was absent, so it went ahead and initialized the golden file based on the function’s output. You can now examine the file to see if it makes sense:

% cat fruits.json

If you are happy with it, check in both input and output files to git. This is important so that your collaborators can run the tests, but it also helps when dealing with failing tests, as you’ll see next.

% git add fruits.yaml fruits.json && git commit -m "fruits test case"

Dealing with test failures

Occasionally, your tests will fail. A test that cannot fail is a useless test.

A golden test fails when the actual output does not match the contents of the golden file. You then need to figure out whether this is a bug or an intentional code change.

Let’s say you decide that the output of yamlToJson should end with a newline.

The new function definition is

yamlToJson = (<> "\n") . J.encode . Y.decode @Value . LBS.toStrict

Now run the test suite:

% stack runghc test.hs
YamlToJson golden tests
  fruits: FAIL
    Test output was different from './fruits.json'. It was: "[\"orange\",\"apple\",\"banana\"]\n"

1 out of 1 tests failed (0.00s)

Ok, this is not very helpful. There are two main ways to get better diagnostics. One is to use the goldenVsStringDiff function as an alternative to goldenVsString. This will include the diff right in the tasty output.

But my preferred workflow is to use git for this. First, rerun the tests and pass the --accept option. This will update the golden files with the new output:

% stack runghc -- test.hs --accept
YamlToJson golden tests
  fruits: OK
    Accepted the new version

All 1 tests passed (0.00s)

Now, because your golden file is tracked by git, you can examine the differences between the old and new golden files with git diff:

% git diff
diff --git fruits.json fruits.json
index c244c0a..ed447d4 100644
--- fruits.json
+++ fruits.json
@@ -1 +1 @@
\ No newline at end of file

Because this is the change you expected, you can now commit the updated file to git.

This workflow lets you use all the powerful git diff options like --color-words, or even launch a graphical diff tool like kdiff3 with git difftool.

See also

Golden tests are tasty by Kwang Yul Seo

December 04, 2017 08:00 PM

Manuel M T Chakravarty

Here is the video of my Functional Conf 2017 talk Haskell...

<iframe allow="autoplay; encrypted-media" allowfullscreen="allowfullscreen" frameborder="0" height="225" id="youtube_iframe" src=";enablejsapi=1&amp;origin=;wmode=opaque" width="400"></iframe>

Here is the video of my Functional Conf 2017 talk Haskell SpriteKit — a Purely Functional API for a Stateful Animation System and Physics Engine. In this talk, I am explaining how to wrap an OOish game engine API based on a mutable scene graph into a purely functional API based on an immutable algebraic data type.

December 04, 2017 05:23 AM

November 30, 2017

Douglas M. Auclair (geophf)

November 2017 1HaskellADay problems and solutions

by geophf ( at November 30, 2017 09:58 PM

November 29, 2017

Tweag I/O

Making two garbage collectors be good neighbours <br/> (using linear types)

Facundo Domínguez and Mathieu Boespflug

Foreign function interfaces (FFI) allow fast interop between languages. Unlike other approaches, like performing RPC calls between different components written in different languages, using the FFI allows for all manner of data to be shared between each language runtime, in the same address space. This reduces memory consumption and obviates marshalling costs. But when two garbaged-collected languages share references to the same values, each garbage collector (GC) needs to be careful to not collect these values while the other language has references to them. This is a problem we ran into when building both inline-r and inline-java. In this post, we'll survey this very generic problem in all fast language interop, using Java interop as a case study.

Bonus: we'll show you how linear types can help solve the problem safely.

Unsafe bindings to Java

The Java Virtual Machine (JVM) offers a foreign interface to manipulate Java objects, known as the Java Native Interface (JNI). This is a C interface, which we can readily bind in Haskell using inline-c or similar. This is what the jni package does.

The JNI is a low-level interface that is painful to use. No programmer wants to invoke Java methods through the JNI using stringly typed class names, method names and argument types. Doing so is very error-prone and verbose. So we built higher-level abstractions on top, jvm and inline-java, that run every method invocation through the Java type checker as well as the Haskell type checker. Think of inline-java as a pretty good typo detector.

In fact, inline-java does even more than that. It checks that Haskell types and Java types line up. It catches at compile time many common bugs that could cause the program to crash or fail, but a few remain. Notably,

  • it is possible to use references to Java objects by mistake after they have been collected, and
  • it is possible to accidentally retain large amounts of memory in the Java heap with references that live in the memory managed by Haskell.

Here's a case study: the conversion of Java Iterators to Haskell Streams (as defined in the streaming package).

import Foreign.JNI
import Language.Java as Java
import Language.Java.Inline
import Streaming

  :: Reify a
  => J ('Iface "java.util.Iterator")
  -> IO (Stream (Of a) IO ())
iteratorToStream it = do
    return $ Streaming.untilRight $ do
      [| $it.hasNext() |] >>= \case
        False -> return (Right ())
        True -> do
          obj <- [| $ |]
          Left <$> Java.reify obj

See previous posts for an intro to inline-java, but here's the gist. The input to this function is any Java object that conforms to the java.util.Iterator interface. The output is a Stream yielding values of some type a. The Java objects are pulled from the iterator as the stream is consumed. The constraint Reify a states that we know how to convert Java objects to Haskell values of type a. We do this on the last line by calling reify.

Like in Java, it and obj above are actually references to objects. But it's a special type of reference provided by the JNI, which can be used by foreign code (such as C or Haskell). These JNI references need to be deleted explicitly once they are no longer needed, otherwise JVM objects cannot be reclaimed by the JVM GC.

The above implementation of iteratorToStream is not deleting the references to Java objects. That's a leak! Indeed, an object reference acts as a root in the graph of all objects in the heap, as far as the JVM garbage collector is concerned. Adding to the problem, the JVM can't deal very well with large and unknown amounts of references. The JNI expects native calls to use only a few references and expects the programmer to say in advance how many references will be needed. Failing to do so affects performance and can lead to failures.

A straightforward fix to this situation is to delete the reference after the Haskell value has been obtained.

    bracket [| $ |]
            (\jNext -> Left <$> Java.reify jNext)

There are two problems with this approach:

  • this puts the burden on the programmer to remember to delete the reference and to be careful not to use it afterwards (or risk a segfault). Moreover,
  • JNI references are usually local, meaning that they are only valid on the thread that created them. So the programmer has to be careful to not share them with other threads.

Could we possibly ask the compiler to perform these checks?

Garbage Collector Finalizers

One way to avoid needing these checks in the first place is to just let the Haskell GC delete Java references automatically when they become unreachable. We attach to each reference a finalizer that deletes it, which is going to be called by the Haskell GC. Such references are no longer local references, but global references. Unlike local references, a global reference can be used in any thread and it is not destroyed when control returns to Java. Since the JNI provides a facility to promote any local reference to a global one, couldn't we just turn all local references into global ones and then have them be managed by the GC? A global reference is more expensive than a local one, so performance suffers. But it mostly works. Until you run out of memory...

A major problem with letting the GC run the show completely is that counter intuitively, sometimes memory might never be reclaimed, even when many objects are long dead. Suppose that the Java heap is crowded, the Garbage Collector of the JVM is desperate to kick some objects out of existence, and yet there is a good chunk of references from Haskell-land to the Java Heap. The Haskell portion of the application is already done with the references, but since there is plenty of space in the Haskell heap, the Haskell's Garbage Collector is basking in the sun, with no pressure to run the finalizers that would delete the unused references.

Sometimes, the application is lucky and the Haskell GC runs the finalizers just in time, which lets the Java GC clean the Java heap. Unfortunately, sometimes, the Haskell GC won't run and the JVM will fail with an OutOfMemory exception.

Dynamic scopes

Another solution is to define dynamic scopes. When a program's control flow enters a scope, we open a new buffer. We keep track of all newly created references in the buffer, until the control flow leaves the scope, at which point we discard all recorded references all at once. In general, scopes are not allowed to overlap arbitrarily, but they can be nested.

In Haskell, the resourcet package neatly encapsulates this idea. The JNI natively supports a similar idea with using pushLocalFrame and popLocalFrame. pushLocalFrame (n :: Int) creates a new scope in which at least n local references can be created. Exceeding the given capacity might cause performance issues or errors. popLocalFrame j copies the reference j to the parent frame and deletes the current frame, which causes all references of the frame to be deleted.

We are still running the risk of accidentally using a local reference after deletion, and to use it in threads where it is invalid. But programmers no longer need to remember to delete individual local references. Still, in practice we found difficult finding a hierarchy of nested scopes that keeps the counts of local references low. It is a problem that worsens with the size of the application. When building a complex server application that made many invocations to Java, we started with a scope per client request, and then a scope per test, and then we added scopes within the scopes when we were creating more local references than anticipated. Eventually, it did get very difficult for multiple teams of programmers of varying experience levels to be sure that the number of extant references stayed bounded for all possible code paths and inputs.

Linear Types

We would really prefer to delete a reference exactly when we know it to be no longer useful. In this way, memory becomes reclaimable by Java GC immediately. The problem is: it's easy to forget doing so at all, leading to multiple leaks in an application. The key invariant we want checked by the compiler is that once we have a reference, it should be deleted exactly once, and never referred to after that. That is, we want to use references linearly.

What if we used the GHC proposal for linear types to treat our local references linearly? It would look something like this:

import Foreign.JNI
import Language.Java as Java
import Language.Java.Inline as Inline.
import Streaming

  :: Reify a
  => J ('Iface "java.util.Iterator" <> [Interp a])
  ->. IOL (Stream (Of a) IOL ())
iteratorToStream itLocal = do
    return $ Streaming.untilRight $ do
      [| $it.hasNext() |] >>= \case
        False -> return (Right ())
        True -> do
          obj0 <- [| $ |]
          (obj1, Unrestricted a) <- Java.reify obj0
          JNI.deleteLocalRef obj1
          return a

Java.reify :: J (Interp a) ->. IOL (J (Interp a), Unrestricted a)

-- | A linear value of type `Unrestricted a` holds a value of
-- type `a` which can be used non-linearly or unrestrictly.
data Unrestricted a where
  Unrestricted :: a -> Unrestricted a

We are assuming that we have a restricted form of the IO monad, called IOL, with the following operations.

return :: a ->. IOL a
(>>=) :: IOL a ->. (a ->. IOL b) ->. IOL b

liftIO :: IO a -> IOL a

data IOL a where
  IOL :: IO a -> IOL a

runIOL :: IOL (Unrestricted a) -> IO a
runIOL (IOL io) =
    Unrestricted a <-
      bracket_ (JNI.pushLocalFrame capacity)
               (JNI.popLocalFrame JNI.jnull)
    return a
    capacity = ...

Compared to dynamic scopes, the major feature of IOL is that programmers can delete local references promptly, inside a single global scope, when they are no longer needed. The programmer doesn't have to be concerned with guessing a scope hierarchy anymore.

IOL introduces local references as linear values. Operations that do not delete the reference, like reify, now have to return a copy of it, and the operations that delete the value, like deleteLocalRef, produce no copy. This means both that references cannot be used after they are deleted (since they can't be used more than once), and that the compiler will require them to be deleted eventually (they must be used at least once). Finally, local references cannot be allowed to escape the scope of runIOL, as they become invalid before runIOL returns. This is achieved by constraining its argument to yield an unrestricted value Unrestricted a. Local references are released promptly even if an exception arises, thanks to the bracket inside runIOL and the fact that there is no way to catch exceptions in IOL.

Admittedly, if exceptions need to be caught, it has to be done by the caller of runIOL. In our experience, many applications need to catch exceptions in a few places only, so this is a modest price to pay.


Each the local and global references we create via the JNI is effectively a GC root for the Java GC. The JNI was designed with the assumption that programmers ensure that very few such roots are in flight at any one time. The R native interface and others make similar assumptions. In this post, we discussed the tension that arises between releasing early and frequently, and doing so safely without increasing the risk of use-after-free bugs. With linear types, we can get both.

A competing approach that we haven't discussed is the lightweight monadic regions of Kiselyov and Shan. This is an incarnation of dynamic scopes that, like linear types, have the type checker guarantee that resources aren't used after released and that they aren't used in other threads. However, they still demand from the programmer to not insert too many or too few scopes.

Some have suggested introducing affine types instead of linear types in Haskell. But for the particular use case discussed in this post, affine types would do no better than these monadic regions. That's because affine types provide a weaker guarantee to the caller: we can return to the caller having used the argument at most once, but also never at all. We'd need nested scopes all over again to ensure that references do get disposed of in a timely fashion.

In our discussion of linear types, we brought streams to a linear monad without delving into the details of whether it is possible and how it would work. This will be the topic for a future post.

November 29, 2017 12:00 AM

November 27, 2017

Gabriel Gonzalez

Compare Nix derivations using nix-diff

<head><meta charset="UTF-8"/></head>

I'm announcing a small nix-diff utility I wrote for comparing Nix derivations. This post will walk through two use cases for how you might use this utility.


This section provides some required background for understanding this post if you're new to Nix.

There are three stages to a Nix build:

  • Nix source code (i.e. *.nix files)
    • This corresponds to a source distribution in a typical package manager
  • Nix derivations (i.e. /nix/store/*.drv files)
    • This is the stage that caching works at
  • Nix build products (i.e. /nix/store/* files that are not derivations)
    • This corresponds to a binary distribution in a typical package manager

You can convert between these stages using the following command-line tools:

  • nix-instantiate converts Nix source code to Nix derivations
    • i.e. *.nix → /nix/store/*.drv
  • nix-store --realise converts Nix derivations to Nix build products
    • i.e. /nix/store/*.drv → /nix/store/*
  • nix-build is a convenience utility which combines the two preceding steps to go straight from source code to build products
    • i.e. *.nix → /nix/store/*

Nix supports caching binary build products so if you try to build the same derivation twice then the second build will reuse the result of the first build (i.e. a "cache hit"). If the derivation changes in any way, you get a "cache miss" and you need to build the derivation.

Carefully note that caching works at the level of Nix derivations and not at the level of Nix source code. For example, the following two Nix files differ at the source code level:

$ cat example0.nix 
pkgs = import <nixpkgs> { };


$ cat example1.nix
(import <nixpkgs> { }).hello

... but they produce the exact same derivation file:

$ nix-instantiate example0.nix 

$ nix-instantiate example1.nix

... which means that if you try to build both example0.nix and example1.nix the build will only occur once since they share the same derivation.

You can think of the derivation file as a language-independent description of how to build something:

$ fold /nix/store/ajypjz54a8rn1qxsnhyr8m87w6hd7ghp-hello-2.10.drv 
orks/CoreFoundation.framework/CoreFoundation /dev/zero /dev/random /dev/urandom
__sandboxProfile","(allow file-read* (literal \"/usr/lib/libncurses.5.4.dylib\")
)\n(import \"/nix/store/\")\

These *.drv files use the ATerm file format and are Nix-independent. Conceptually, Nix is just a domain-specific language for generating these ATerm files. That means, for example, that you could replace Nix with any front-end language or tool that can generate these ATerm files. In fact, this is how Guix works, by replacing Nix with Guile Scheme as the front-end language.

Understanding how Nix derivations work is fundamental to understanding the Nix ecosystem. nix-diff is one tool that aids this learning process as the following sections will illustrate.

Cache misses

nix-diff is a tool that I wish I had back when Awake Security first adopted Nix. We frequently ran into cache misses when using Nix because of subtle differences in Nix derivations in different development environments.

We can understand why we got cache misses by referring back to the three stages of a Nix build:

  • Nix source code (i.e. *.nix files)
  • Nix derivations (i.e. /nix/store/*.drv files)
  • Nix build products (i.e. /nix/store/* files that are not derivations)

For production we prefer to distribute Nix build products (i.e. binary distributions), but internally for development we distribute Nix source code. We prefer Nix code internally because this gives developers complete control over all of their transitive dependencies. For example, a developer can easily patch the systemd executable used on the virtual machine that runs their integration tests.

However, this flexibility comes at a price: if you don't know what you are doing you can easily accidentally change the derivation. This is because Nix and Nixpkgs are customizable to a fault and they have all sorts of "impure" defaults that change depending on the development environment. If you trip over one of these pitfalls you end up with a cache miss, which is a poor user experience.

The most common pitfalls we ran into early on in our Nix adoption were:

  • Not pinning nixpkgs
    • Note: We publicly shared our recipe for pinning nixpkgs here
  • Not pinning the system field for a derivation
    • This field defaults to the impure builtins.currentSystem in many cases
  • Impure surprises in nixpkgs

Let's motivate this with a real example. Suppose that I have the following derivation to build the Glasgow Haskell compiler (ghc):

$ cat example0.nix
pkgs = import <nixpkgs> { };


This Nix expression is "impure" because the expression depends on the ambient nixpkgs channel that the user has installed. Compare this to the following expression which pins nixpkgs to a specific revision protected by a hash:

$ cat example1.nix
fetchNixpkgs = import ./fetchNixpkgs.nix;

nixpkgs = fetchNixpkgs {
rev = "76d649b59484607901f0c1b8f737d8376a904019";
sha256 = "01c2f4mj4ahir0sxk9kxbymg2pki1pc9a3y6r9x6ridry75fzb8h";

pkgs = import nixpkgs { };


Let's instantiate the two expressions to compute their derivations:

$ nix-instantiate example0.nix 
$ nix-instantiate example1.nix

Note that you may get a different result for the first derivation depending on what version of the nixpkgs channel you have installed.

Visually comparing the two derivation files is tedious and time-consuming:

$ fold /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv | head

$ fold /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv | head

If we use nix-diff, then we can pull out the differences immediately:

$ nix-diff /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv 
- /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv:{out}
+ /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv:{out}
The set of outputs do not match:
+ {man}
The builders do not match
- /nix/store/hsk82g493i7r496ghs0y61m6yvknxcml-bash-4.4-p5/bin/bash
+ /nix/store/axikcsz4wh2qpi5zmlfsmm4jx8wm8s1g-bash-4.4-p12/bin/bash
The set of input names do not match:
- bash-4.4-p5
- clang-wrapper-3.7.1
- coreutils-8.26
- gmp-6.1.1
- perl-5.22.3
- python2.7-Sphinx-1.5.2
+ bash-4.4-p12
+ clang-wrapper-4.0.1
+ coreutils-8.28
+ gmp-6.1.2
+ perl-5.24.3
+ python2.7-Sphinx-1.6.5

Now we can see at a glance that the versions of several dependencies changed and GHC has split out its man pages into a new man output for better granularity of the build graph.

Note that these are not the only differences between the two derivations. However, all of the other differences are downstream of the above differences. For example, the two derivations have different out paths, but we expect them to differ for any two derivations that are not identical so there's no point including that in the diff. nix-diff makes an effort to highlight the root cause of the difference.

Understanding differences

Nix is more than just a package manager. You can use Nix to build and deploy an entire machine, which is how NixOS (the Nix operating system) works. The machine configuration is a Nix expression that you can instantiate and build like any other Nix expression.

This means that we can also use nix-diff to compare two machine configurations and understand how they differ. For example, when we change our production systems at Awake Security we sometimes run the change through nix-diff during code review to ensure that reviewers understand every change being made to the system.

We can illustrate this with a small example comparing two NixOS system specifications. The first system specification is a mostly blank system:

$ cat example0.nix
nixos = import <nixpkgs/nixos> {
system = "x86_64-linux";

configuration = {
boot.loader.grub.devices = [ "/dev/sda" ];

fileSystems."/" = {
device = "/dev/sda";


... and the second specification enables Kafka on the system:

$ cat example1.nix
nixos = import <nixpkgs/nixos> {
system = "x86_64-linux";

configuration = {
boot.loader.grub.devices = [ "/dev/sda" ];

fileSystems."/" = {
device = "/dev/sda";

services.apache-kafka.enable = true;


We can differentiate the two derivations in one step like this:

$ nix-diff $(nix-instantiate example0.nix) $(nix-instantiate example1.nix)
- /nix/store/6z9nr5pzs4j1v9mld517dmlcz61zy78z-nixos-system-nixos-18.03pre119245.
+ /nix/store/k05ibijg0kknvwrgfyb7dxwjrs8qrlbj-nixos-system-nixos-18.03pre119245.
The input named `etc` differs
- /nix/store/05c0v10pla0v8rfl44rs744m6wr729jy-etc.drv:{out}
+ /nix/store/8waqvzjg7bazzfzr49m89q299kz972wv-etc.drv:{out}
The input named `dbus-1` differs
- /nix/store/a16j2snzz25dhh96jriv3p6cgkc0vhxr-dbus-1.drv:{out}
+ /nix/store/mliabzdkqaayya67xiwfhwkg4gs9k0cg-dbus-1.drv:{out}
The input named `system-path` differs
- /nix/store/jcf6q7na01j8k9xcmqxykl62k4x6zwiv-system-path.drv:{out}
+ /nix/store/kh4kgsms24d02bxlrxb062pgsbs3riws-system-path.drv:{out}
The set of input names do not match:
+ apache-kafka-2.12-
The input named `system-path` differs
These two derivations have already been compared
The input named `system-units` differs
- /nix/store/yqnqdajd4664rvycrnwxwaj0mxp7602c-system-units.drv:{out}
+ /nix/store/2p5c4arwqphdz5wsvz6dbrgv0vhgf5qh-system-units.drv:{out}
The set of input names do not match:
+ unit-apache-kafka.service
The input named `user-units` differs
- /nix/store/x34dqw5y34dq6fj5brj2b5qf0nvglql9-user-units.drv:{out}
+ /nix/store/4iplnk260q2dpr8b8ajrjkrn44yk06aq-user-units.drv:{out}
The input named `unit-dbus.service` differs
- /nix/store/fd6j972zn1hfvqslxc8c64xxaf1wg475-unit-dbus.service.drv:{out}
+ /nix/store/s7rpgwbald9qx8rwlw4v276wj2x3ld8r-unit-dbus.service.drv:{out}
The input named `dbus-1` differs
These two derivations have already been compared
The input named `system-path` differs
These two derivations have already been compared
The input named `users-groups.json` differs
- /nix/store/x6c7pqx40wfdzwf96jfi1l0hzxjgypri-users-groups.json.drv:{out}
+ /nix/store/gk5yyjw579hgyxgwbrh1kzb3hbdbzgbq-users-groups.json.drv:{out}
The environments do not match:
"mutableUsers":true,"users":[{"createHome":false,"description":"→Apache Kafka
daemon user","group":"nogroup","hashedPassword":null,"home":"/tmp/kafka-logs","i
/bin/nologin","uid":169},{"createHome":false,"description":"→D-Bus system mess

However, this doesn't do the diff justice because the output is actually colorized, like this:

From the diff we can see that:

  • This change adds Kafka executables to the system PATH
  • This change adds a new apache-kafka systemd service
  • This change adds a new apache-kafka user to the system

Note how nix-diff does more than diffing the two root derivations. If the two derivations differ on a shared input then nix-diff will descend into that input and diff that and repeat the process until the root cause of the change is found. This works because Nix's dependency graph is complete and reachable from the root derivation.


You can find the nix-diff utility on Hackage or GitHub if you would like to use this in your own development workflow. Hopefully nix-diff will help you better understand how Nix works under the hood and also help you pin Nix derivations more robustly.

by Gabriel Gonzalez ( at November 27, 2017 03:58 PM

November 26, 2017

Neil Mitchell

Haskell exceptions and FFI wrappers

Summary: If you create a C function pointer from a Haskell function with "wrapper", and it throws an exception, bad things happen.

The Haskell FFI is incredibly powerful, allowing you to convert Haskell functions into C function pointers. In this post I'll give a quick example, then go into what happens if the Haskell function throws an exception. First, let's define a C function (and put it in a file called c.c):

int apply(int(*f)(int), int x)
return f(x);

The piece int(*f)(int) says f is a function of type Int -> Int. The function apply is equivalent to $, restricted to int - it applies the first argument f to the second argument x and returns the result. We can call that in Haskell with:

foreign import ccall apply :: FunPtr (CInt -> IO CInt) -> CInt -> IO CInt
foreign import ccall "wrapper" wrap :: (CInt -> IO CInt) -> IO (FunPtr (CInt -> IO CInt))

main :: IO ()
main = do
f <- wrap $ \x -> return $ x + 20
res <- apply f 22
print res

On the first line we wrap apply into a Haskell definition, turning a C function pointer into FunPtr. In the second we define a special "wrapper" FFI definition - the name "wrapper" is a specific string which is part of the FFI spec - it converts a Haskell function into a C function pointer. In main we put these pieces together, and other than the pervasive IO, it looks like the equivalent Haskell.

Note: In real code you should always call freeHaskellFunPtr after you have finished using a "wrapper" function, usually using bracket.

Consequences of Exceptions

What happens if the function we pass to wrap throws an exception? If you read the GHC manual, you'll find an incomplete link to the FFI spec, which stays silent on the subject. Thinking it through, Haskell has exceptions, but C does not - if the Haskell throws an exception it can't be passed back through C. Haskell can't provide a return value, so it can never resume the C code that called it. The GHC runtime can block indefinitely or kill the thread, both of which are fairly fatal for a program. As a consequence, I strongly recommend never throwing an exception from a function generated by "wrapper" - but what if we do?

Suggestion: most of the FFI addendum should probably be reproduced in the GHC manual with details around corner cases and exceptions.

Testing Exceptions

First, let's change our wrapped function to wrap $ \x -> fail "finish". Running that prints out:

bug.exe: user error (finish)

That seems like a standard exception. However, let's go further and put the entire program inside a finally, to show we have a normal Haskell exception:

main = flip finally (print "done") $ do

The output doesn't change - we never print out "done". It seems the exception thrown inside wrap aborts the program rather than bubbling up.

Suggestion: This error looks like a normal exception, but really isn't. It should say you have violated the wrapper invariant and your program has been violently aborted.

We've encountered bad behaviour, but can we go worse? Yes we can, by adding threads:

main = do
replicateM_ 100 $ do
forkIO $ do
ff <- wrap $ \_ -> fail "die"
print =<< apply ff 12
threadDelay 10000000

Here we spawn 100 threads, each of which does an apply with an exception, then we wait for 10 seconds. The output is:

bug.exe: user error (die)
bug.exe: user error (die)
bug.exe: warning: too many hs_exit()s

It looks like there is a race condition with the exit path, causing two fatal wrapper exceptions to try and take down the runtime twice.

Suggestion: The hs_exit bug should be fixed.

Avoiding Exceptions

Now we know we need to avoid throwing exceptions inside "wrapper" functions, the obvious approach is to wrap them in a catch, e.g.:

wrap $ \x -> ... `catch` \(_ :: SomeException) -> return (-1)

Namely catch all exceptions, and replace them with -1. As usual with catch, it is important to force evaluation of the ... inside the catch (e.g. using catchDeep from safe-exceptions). If you want to recover the original exception you can capture it in an IORef and throw it after leaving C:

ref <- newIORef Nothing
f <- wrap $ \x -> ... `catch` \(e :: SomeException) -> do
writeIORef ref $ Just e
return (-1)
res <- apply f 22
whenJustM (readIORef ref) throwIO

However, what if there is an asynchronous exception after we leave the catch but before we return to C? From my experiments, this doesn't appear to be possible. Even though getMaskingState returns Unmasked exceptions thrown to the function inside wrapper appear to be deferred until the C code returns.

Suggestion: The documentation should clarify if my experiments are correct. Should getMaskingState return MaskedUninterruptible?

by Neil Mitchell ( at November 26, 2017 09:39 PM

November 25, 2017

Joachim Breitner

Existence and Termination

I recently had some intense discussions that revolved around issues of existence and termination of functions in Coq, about axioms and what certain proofs actually mean. We came across some interesting questions and thoughts that I’ll share with those of my blog readers with an interest in proofs and interactive theorem proving.


  • It can be meaningful to assume the existence of a function in Coq, and under that assumption prove its termination and other properties.
  • Axioms and assumptions are logically equivalent.
  • Unsound axioms do not necessary invalidate a theory development, when additional meta-rules govern their use.


Our main running example is the infamous Collatz series. Starting at any natural number, the next is calculated as follow:

Require Import Coq.Arith.Arith.

Definition next (n : nat) :nat :=
  if Nat.even n then n / 2 else 3*n + 1.

If you start with some positive number, you are going to end up reaching 1 eventually. Or are you? So far nobody has found a number where that does not happen, but we also do not have a proof that it never happens. It is one of the great mysteries of Mathematics, and if you can solve it, you’ll be famous.

A failed definition

But assume we had an idea on how to prove that we are always going to reach 1, and tried to formalize this in Coq. One attempt might be to write

Fixpoint good (n : nat) : bool :=
  if n <=? 1
    then true
    else good (next n).

Theorem collatz: forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

Unfortunately, this does not work: Coq rejects this recursive definition of the function good, because it does not see how that is a terminating function, and Coq requires all such recursive function definitions to be obviously terminating – without this check there would be a risk of Coq’s type checking becoming incomplete or its logic being unsound.

The idiomatic way to avoid this problem is to state good as an inductive predicate... but let me explore another idea here.

Working with assumptions

What happens if we just assume that the function good, described above, exists, and then perform our proof:

Theorem collatz
  (good : nat -> bool)
  (good_eq : forall n,
     good n = if n <=? 1 then true else good (next n))
  : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

Would we accept this as a proof of Collatz’ conjecture? Or did we just assume what we want to prove, in which case the theorem is vacuously true, but we just performed useless circular reasoning?

Upon close inspection, we find that the assumptions of the theorem (good and good_eq) are certainly satisfiable:

Definition trivial (n: nat) : bool := true.

Lemma trivial_eq: forall n,
  trivial n = if n <=? 1 then true else trivial (next n).
Proof. intro; case (n <=? 1); reflexivity. Qed.

Lemma collatz_trivial: forall n, trivial n = true.
  apply (collatz trivial trivial_eq).

So clearly there exists a function of type nat -> bool that satisfies the assumed equation. This is good, because it means that the collatz theorem is not simply assuming False!

Some (including me) might already be happy with this theorem and proof, as it clearly states: “Every function that follows the Collatz series eventually reaches 1”.

Others might still not be at ease with such a proof. Above we have seen that we cannot define the real collatz series in Coq. How can the collatz theorem say something that is not definable?

Classical reasoning

One possible way of getting some assurance it to define good as a classical function. The logic of Coq can be extended with the law of the excluded middle without making it inconsistent, and with that axiom, we can define a version of good that is pretty convincing (sorry for the slightly messy proof):

Require Import Coq.Logic.ClassicalDescription.
Require Import Omega.
Definition classical_good (n:nat) : bool :=
  if excluded_middle_informative (exists m, Nat.iter m next n <= 1)
  then true else false.

Lemma iter_shift:
  forall a f x (y:a), Nat.iter x f (f y) = f (Nat.iter x f y).
 intros. induction x. reflexivity. simpl. rewrite IHx. reflexivity. Qed.

Lemma classical_good_eq: forall n,
  classical_good n = if n <=? 1 then true else classical_good (next n).
  unfold classical_good at 1.
  destruct (Nat.leb_spec n 1).
  * destruct (excluded_middle_informative _); try auto.
    contradict n0. exists 0. simpl. assumption.
  * unfold classical_good.
    destruct (Nat.eqb_spec (next n) 0); try auto.
    destruct (excluded_middle_informative _), (excluded_middle_informative _); auto.
    - contradict n0.
      destruct e0.
      destruct x; simpl in *. omega.
      exists x. rewrite iter_shift. assumption.
    - contradict n0.
      destruct e0.
      exists (S x). simpl. rewrite iter_shift in H0. assumption.

Lemma collatz_classical: forall n, classical_good n = true.
Proof. apply (collatz classical_good classical_good_eq). Qed.

The point of this is not so much to use this particular definition of good, but merely to convince ourselves that the assumptions of the collatz theorem above encompass “the” Collatz series, and thus constitutes a proof of the Collatz conjecture.

The main take-away so far is that existence and termination of a function are two separate issues, and it is possible to assume the former, prove the latter, and not have done a vacuous proof.

The ice gets thinner


Starting with the above Theorem collatz, there is another train of thought I invite to to follow along.

Probably the “genius idea” proof will be more than a few lines long, and we probably to be able to declare helper lemmas and other things along the way. Doing all that in the body of the collatz proof is not very convenient, so instead of using assumptions, we might write

Section collatz:
Variable good : nat -> bool.
Variable good_eq : forall n,
  good n = if n <=? 1 then true else good (next n)

Theorem collatz2 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.
End collatz.

So far so good: Clearly, I just refactored my code a bit, but did not make any significant change. The theorems collatz2 and collatz are equivalent.

Sound axioms

But note that we do not really intend to instantiate collatz2. We know that the assumptions are satisfiable (e.g. since we can define trivial or classical_good). So maybe, we would rather avoid the Section mechanism and simply write

Axiom good : nat -> bool.
Axiom good_eq : forall n,
  good n = if n <=? 1 then true else good (next n)

Theorem collatz3 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

I assume this will make a few of my readers’ eyebrows go up: How can I dare to start with such Axioms? Do they not invalidate my whole development?

On the other hand, all that a Coq axiom is doing is saying “the following theorems are under the assumption that the axiom holds”. In that sense, collatz3 and collatz2 are essentially equivalent.

Unsound axioms

Let me take it one step further, and change that to:

Axiom unsafeFix : forall a, (a -> a) -> a.
Axiom unsafeFix_eq : forall f, unsafeFix f = f (unsafeFix f).
Definition good : nat -> bool :=
  unsafeFix (fun good n => if n <=? 1 then true else good (next n)).

Theorem collatz4 : forall n, good n = true.
Proof. (* Insert genius idea here.*) Qed.

At this point, the majority of my readers will cringe. The axiom unsafeFix is so blatantly unsound (in Coq), how do I even dare to think of using it. But bear with me for a moment: I did not change the proof. So maybe the collatz4 theorem is still worth something?

I want to argue that it is: Both unsafeFix and unsafeFix_eq are unsound in their full generality. But as long as I instantiate them only with functions f which have a fixpoint, then I cannot prove False this way. So while “Coq + unsafeFix” is unsound, “Coq + unsafeFix + unsafeFix_eq + metarule that these axioms are only called with permissible f” is not.

In that light, my collatz4 proof carries the same meaning as the collatz3 proof, it is just less convenient to check: If I were to check the validity of collatz3, I have to maybe look for uses of admit, or some misleading use of syntax or other tricks, or other smells. When I have to check the validity of collatz4, I also have to additionally check the meta-rule -- tedious, but certainly possible (e.g. by inspecting the proof term).

Beyond Collatz

The questions discussed here did not come up in the context of the Collatz series (for which I unfortunately do not have a proof), but rather the verification of Haskell code in Coq using hs-to-coq. I started with the idiomatic Haskell definition of “Quicksort”:

quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = quicksort lesser ++ [p] ++ quicksort greater
    where (lesser, greater) = partition (<p) xs

This function is not terminating in a way that is obvious to the Coq type checker. Conveniently, hs-to-coq can optionally create the Coq code using the unsafeFix axiom above, producing (roughly):

Definition quicksort {a} `{Ord a} : list a -> list a :=
  unsafeFix (fun quicksort xs =>
    match xs with
      | nil => nil
      | p :: xs => match partition (fun x => x <? p) xs with
         | (lesser, greater) => quicksort lesser ++ [p] ++ quicksort greater

I then proved (roughly)

Theorem quicksort_sorted:
  forall a `(Ord a) (xs : list a), StronglySorted (quicksort xs).


Theorem quicksort_permutation:
  forall a `(Ord a) (xs : list a), Permutation (quicksort xs) xs.

These proofs proceed by well-founded induction on the length of the argument xs, and hence encompass a termination proof of quicksort. Note that with a only partially correct but non-terminating definition of quicksort (e.g. quicksort := unsafeFix (fun quicksort xs => quicksort xs)) I would not be able to conclude these proofs.

My (not undisputed) claim about the meaning of these theorems is therefore

If the Haskell equations for quicksort actually have a fixed point, then the use of unsafeFix in its definition does not introduce any inconsistency. Under this assumption, we showed that quicksort always terminates and produces a sorted version of the input list.

Do you agree?

by Joachim Breitner ( at November 25, 2017 08:54 PM

November 24, 2017

Functional Jobs

Software Engineer (Haskell, Full Stack) at Capital Match (Full-time)

About Us: Capital Match is a leading P2P lending platform based in Singapore, founded in 2014, backed by alternative investment management firm with more than US$ 5 bn AUM. We are looking for experienced developers to lead our tech growth in the Fintech space, expand into surrounding countries and develop new products on the platform.

Job Description: We are inviting developers with a minimum of 5 years coding experience. The candidate should have functional programming experience as well as experience in developing server and web applications. An interest in all aspects of the creation, growth and operations of a secure web-based platform: front-to-back feature development, distributed deployment and automation in the cloud, build and test automation, is highly desirable. A background in fintech and especially the lending space would be an advantage (but not essential).

Job Requirements: Our platform is primarily developed in Haskell with a ClojureScript frontend. Candidates should ideally have production experience with Haskell, or strong experience with at least one other functional programming language. (For example: OCaml/F#/Scala/Clojure/Lisp/Erlang)

We use Docker containers and standard cloud infrastructure systems to manage our production rollouts, so familiarity with Linux systems, command-line environments and cloud-based deployments is highly desirable. Exposure to and understanding of XP practices such as TDD, CI, Emergent Design, Refactoring, Peer Review and Continuous Improvement is highly desirable.

We are inviting developers with at least 5 years of software engineering experience.

Offer: We offer a combination of salary and equity depending on experience and skills of the candidate. Most expats who relocate to Singapore do not have to pay their home country taxes and the local tax rate in Singapore is more or less 5%. Visa sponsorship will be provided. Singapore is a great place to live, a vibrant city rich with diverse cultures, a very strong financial sector and a central location in Southeast Asia.

Get information on how to apply for this position.

November 24, 2017 08:40 AM

Sandy Maguire

Gentle Theorems: Difference of Squares

<article> <header>

Gentle Theorems: Difference of Squares


<time>November 24, 2017</time> math

I have a (not very controversial) feeling that people don’t feel as though algebra is actually a thing you can use for stuff. I fall into this trap myself often, despite being someone who does math for a living, and so I suspect this is a pretty wide-spread phenomenon. Let me explain.

For example, consider the equation:

\[ (x + y)(x - y) = x^2 - y^2 \]

This is known as the difference of squares. Let’s work through the derivation of it together:

\[ \begin{align*} (x + y)(x - y) &= (x + y)(x - y) \\ &= x^2 + xy - xy - y^2 \\ &= x^2 + \cancel{xy - xy} - y^2 \\ &= x^2 - y^2 \end{align*} \]

Recall that we can use the FOIL method to get from the first line to the second.

I implore you to read through this proof carefully, and convince yourself of its truthfulness – even if you don’t consider yourself a “math” person. Believe it or not, there’s a point I’m getting to.

Anyway – by all accounts, this difference of squares thing is a pretty humdrum theorem. Who really cares, right? Let’s switch gears for a bit and talk about something more interesting.

Recall that \(20 \times 20 = 400\). As an interesting question, without actually computing it, let’s think about the product \(19 \times 21\). What does this equal? It seems like it could also be \(400\) – after all, all we did was take one away from the left side of the times and move it to the right.

In fact, if you work it out, \(19 \times 21 = 399\). That’s kind of interesting: somehow we lost a \(1\) by shuffling around the things we were multiplying.

This seems to not be an isolated incident:

\[ \begin{align*} 5 \times 5 &= 25 \\ \text{but,}\quad4 \times 6 &= 24 \end{align*} \]

\[ \begin{align*} 10 \times 10 &= 100 \\ \text{but,}\quad9 \times 11 &= 99 \end{align*} \]

An intriguing question to ask yourself is whether this is always true, or whether we’ve just gotten lucky with the examples we looked at.

But the more interesting question, in my opinion, is what happens if we go from \(19 \times 21 = 399\) to \(18\times22\). Will we lose another \(1\) when we fiddle with it? Or will something else happen? Form an opinion on what the answer will be before continuing.

\[ \begin{align*} 20 \times 20 &= 400 \\ \text{but,}\quad 21 \times 19 &= 399 \\ \text{but,}\quad 22 \times 18 &= 396 \end{align*} \]

Weird – somehow we lost \(3\) that time. What’s happened here?

If you’re confused (and I was, when I first saw this), don’t despair. As it happens, you already know the answer!

So, what’s going on here? Well, we’ve actually just been dealing with differences of squares the whole time – probably without even realizing it!

Most people, I think, fail to connect the algebraic fact that \((x+y)(x-y)=x^2-y^2\) to the fact that \(22\times18=396\). If you still don’t see why, we can explicitly fill in our variables:

\[ \begin{align*} 22\times18&=(20+2)(20-2)\\ &=20^2-2^2 \\ &= 400 - 4 \\ &= 396 \end{align*} \]

Neat, right? Even if you carefully read through the proof of the difference of squares earlier, you might not have noticed that we’ve been playing with them the entire time! I blame western math education for this; too often are equations presented only to be solved, and never to be thought about. It’s a disservice we’ve done to ourselves.

The takeaway of all of this, in my opinion, is that we should spend some time thinking about the notion of equality, about the \(=\) symbol. Ever since looking at this difference of squares thing, I’ve started viewing \(=\) not as the symbol which separates the left side of an equation from the right, but as a transformation. The \(=\) sign transforms something we can experience into something we can manipulate, and back again.

What I mean by that is that it’s a lot easier to conceptualize \(22\times18\) than it is to think about \((x+y)(x-y)\). The numeric representation is better suited for human minds to experience, while the algebraic expression is better at twiddling. We know how to twiddle algebra, but twiddling numbers themselves is rather meaningless.

In terms of everyday usefulness, this isn’t particularly helpful, except that it’s often easier to compute a difference of squares than it is to do the multiplication naively. If you can recognize one, you could probably impress someone with your mental arithmetic – but, again, it’s not going to revolutionize your life in any way.

All of this is to say that math is neat. Even if you don’t see any practical value in this stuff, hopefully you’ll agree that there might be interesting puzzles to be found here. And, as it turns out, algebra can be a satisfying tool for solving these puzzles.

Thanks to Matt Parsons for proof-reading an early version of this post.


November 24, 2017 12:00 AM

November 22, 2017

Robert Harper

Sequentiality as the Essence of Parallelism

I recently thought of a nice way to structure a language for parallel programming around the concept of sequential composition.  Think of parallelism as the default—evaluate everything in parallel unless the semantics of the situation precludes it: sums are posterior to summands, but the summands can be evaluated simultaneously.  You need a way to express the necessary dependencies without introducing any spurious ones.

There’s a tool for that, called lax logic, introduced by Fairtlough and Mendler and elaborated by Davies and Pfenning, which I use extensively in PFPL.  The imperative language Modernized Algol is formulated in the lax style, distinguishing two modes, or levels, of syntax, the (pure) expressions and the (impure) commands.  The lax modality, which links the two layers, behaves roughly like a monad, but, all the hype notwithstanding, it is not the central player.  It’s the modes, not the modality, that matter.  (See the Commentary on PFPL for more.)

The lax modality is just the ticket for expressing parallelism.  Rather than separate expressions from commands, here we distinguish between values and computations.  The names are important, to avoid possible confusion.  Values are fully evaluated; they are not a source of parallelism.  (If values were called “pure”, it would be irresistible to think otherwise.)  Computations have yet to be evaluated; they engender parallelism by sequential composition.  What?  No, you didn’t nod off! Let me explain.

Parallelism is all about the join points.  If parallel execution is the default, then the job of the programmer is not to induce parallelism, but to harness it.  And you do that by saying, “this computation depends on these others.”  Absent that, there is nothing else to say, just go for it.  No sub-languages.  No program analysis.  No escaping the monad.  Just express the necessary dependencies, and you’re good to go.

So, what are the join points?  They are the elimination forms for two parallel modalities.  They generalize the sequential case to allow for statically and dynamically determined parallelism.   A value of parallel product type is a tuple of unevaluated computations, a kind of “lazy” tuple (but not that kind of laziness, here I just mean unevaluated components).  The elimination form evaluates all of the component computations in parallel, creates a value tuple from their values, and passes it to the body of the form.  Similarly, a value of parallel sequence type is a generator consisting of two values, a natural number n indicating its size, and a function determining the ith component computation for each 1≤i<n.  The elimination form activates all n component computations, binds their values to a value sequence, and passes it to the body of the form.

The join point effects a change of type, from encapsulated computations to evaluated values, neatly generalizing sequential composition from a unary to a multiway join.  If you’d like, the parallel products and parallel sequences are “generalized monads” that encapsulate not just one, but many, unevaluated computations.  But they are no more monads than they are in any other functional language: the categorial equational laws need not hold in the presence of, say, divergence, or exceptions.

The dynamics assigns costs to computations, not to values, whose cost of creation has already been paid.  The computation that just returns a value has unit work and span.  Primitive operations take unit work and span.  The sequential composition of a parallel product with n components induces span one more than the maximum span of the constituents, and induces work one more than the sum of their work.  The dynamics of sequential composition for parallel sequences is similar, with the “arity” being determined dynamically rather than statically.

Programming in this style means making the join points explicit.  If you don’t like that, you can easily define derived forms—and derived costs—for constructs that do it for you.    For example, a pair of computations might be rendered as activating a parallel pair of its components, then returning the resulting value pair.  And so on and so forth.  It’s no big deal.

En passant the modal formulation of parallelism solves a nasty technical problem in a substitution-based cost semantics that does not make the modal distinction.  The issue is, how to distinguish between the creation of a value, and the many re-uses of it arising from substitution?  It’s not correct to charge again and again for cresting the value each time you see it (this cost can be asymptotically significant), but you do have to charge for creating it somewhere (it’s not free, and it can matter).  And, anyway, how is one to account for the cost of assessing whether an expression is, in fact, a value?  The usual move is to use an environment semantics to manage sharing.  But you don’t have to, the modal framework solves the problem, by distinguishing between a value per se; the computation that returns it fully created; and the computation that incrementally constructs it from its constituent parts.  It’s the old cons-vs-dotted pair issue, neatly resolved.

Please see Section 10 of the Commentary on PFPL for a fuller account.  The main idea is to generalize a type of single unevaluated computations, which arises in lax logic, to types of statically- and dynamically many unevaluated computations.  The bind operation becomes a join operation for these computations, turning a “lazy” tuple or sequence into eager tuples or sequences.

Updates: word-smithing, added cite to Davies-Pfenning, replaced cite of course notes with reference to commentary.

by Robert Harper at November 22, 2017 06:18 PM

November 21, 2017

The GHC Team

GHC 8.2.2 is available

The GHC Team is pleased to announce a new minor release of GHC. This release builds on the performance and stability improvements of 8.2.1, fixing a variety of correctness bugs, improving error messages, and making the compiler more portable.

Notable bug-fixes include

  • A correctness issue resulting in segmentation faults in some FFI-users (#13707, #14346)
  • A correctness issue resulting in undefined behavior in some programs using STM (#14171)
  • A bug which may have manifested in segmentation faults in out-of-memory condition (#14329)
  • clearBit of Natural no longer bottoms (#13203)
  • A specialisation bug resulting in exponential blowup of compilation time in some specialisation-intensive programs (#14379)
  • ghc-pkg now works even in environments with misconfigured NFS mounts (#13945)
  • GHC again supports production of position-independent executables (#13702)

A thorough list of the changes in the release can be found in the release notes,

How to get it

This release can be downloaded from

For older versions see

We supply binary builds in the native package format for many platforms, and the source distribution is available from the same place.


Haskell is a standard lazy functional programming language.

GHC is a state-of-the-art programming suite for Haskell. Included is an optimising compiler generating efficient code for a variety of platforms, together with an interactive system for convenient, quick development. The distribution includes space and time profiling facilities, a large collection of libraries, and support for various language extensions, including concurrency, exceptions, and foreign language interfaces. GHC is distributed under a BSD-style open source license.

A wide variety of Haskell related resources (tutorials, libraries, specifications, documentation, compilers, interpreters, references, contact information, links to research groups) are available from the Haskell home page (see below).

On-line GHC-related resources

Relevant URLs on the World-Wide Web:

Supported Platforms

The list of platforms we support, and the people responsible for them, is here

Ports to other platforms are possible with varying degrees of difficulty. The Building Guide describes how to go about porting to a new platform.


We welcome new contributors. Instructions on accessing our source code repository, and getting started with hacking on GHC, are available from the GHC's developer's site run by Trac.

Community Resources

There are mailing lists for GHC users, develpoers, and monitoring bug tracker activity; to subscribe, use the Mailman web interface.

There are several other Haskell and GHC-related mailing lists on; for the full list, see the lists page.

Some GHC developers hang out on the #ghc and #haskell of the Freenode IRC network, too. See the Haskell wiki for details.

Please report bugs using our bug tracking system. Instructions on reporting bugs can be found here.

by Ben Gamari at November 21, 2017 10:06 PM

Yesod Web Framework

mega-sdist: the mega repo helper

Many years ago, I wrote a utility called mega-sdist to help me with managing mega repos (more on that below). I've been using it myself ever since, making some minor improvements over the years. But I realized recently that I never really announced it to others, and especially not to the people whom it would help the most: other Yesod contributors and maintainers. Consider this the (massively belated) announcement.

You can find the most up-to-date information in the project on Github. Below is the current content of that file, to help save you a click.

This is a utility written to address the specific needs in maintaining Haskell "mega-repos," or Git repositories containing multiple Cabal projects. It is intended to ease the process of deciding which packages need to be released and tagging those releases appropriately.

It provides the following functionality:

  • Detect when local code has changed from what's on Hackage
    • Note that, due to Hackage revisions, sometimes this logic isn't perfect
  • Detect when a version number needs to be updated
  • Dump the difference between the Hackage version of your package and the local version

To install it... well, listen. This tool is intended for people authoring Haskell packages. Odds are, you already know how to do this. And if you don't know, this probably isn't a tool that will help you. Anyway, in order to install it, first install Stack and then run stack install mega-sdist, or just stack install inside this repository.

Opinionated tool

This utility is highly opinionated in some ways, e.g.:

  • It only supports one style of Git tag name: packagename/version. This may look weird in non-mega-repos, where v1.2.3 looks better than foo/1.2.3, but for mega-repos the former doesn't make sense.
  • It depends on Stack for both discovering all of your local packages, and for uploading to Hackage.

If you're OK with these opinions, keep reading for usage.

Have I changed anything?

Let's say I'm working on the monad-unlift megarepo (chosen as an example of a relatively small repo). I've merged some PRs recently, or at least think I have. But I don't remember which of the individual packages within the repo this affected. Instead of looking at the commit history like some caveman, I'll typically do:

$ git pull # make sure I have all latest changes
$ mega-sdist

The mega-sdist command will:

  • Build tarballs for all local packages
  • Check what the latest versions of my packages on Hackage are
  • Do a full diff on these two things and see if anything's changed

At the time of writing, here's the output from this repo:

The following packages from Hackage have not changed:

The following packages require a version bump:

What this means is:

  • The monad-unlift package I have locally is at version 0.2.0. And it perfectly matches that version on Hackage. No actions necessary.
  • The monad-unlift-ref package I have locally is at version 0.2.1. And it doesn't match the code on Hackage. Therefore, if I wanted to run stack upload monad-unlift-ref successfully, I'd need to bump the version number.

What did I change?

Well, again, if I wanted to see what changed, I could run (again, like a caveman):

$ git diff monad-unlift-ref/0.2.1 -- monad-unlift-ref

But that's long! mega-sidst's got your back. Just run:

$ mega-sdist monad-unlift-ref --get-diffs

This will print out the difference between the tarball uploaded to Hackage and what you have locally. Besides my tongue-in-cheek comment above, this is also useful if, for some reason, you either don't have or don't trust the tags in your Git repo.

One other thing: this diff is currently based on the pristine tarball from Hackage, ignoring cabal file revisions. So the difference may be slightly different from what you'd get from stack unpack monad-unlift-ref-0.2.1. But ¯\_(ツ)_/¯ that's revisions for you.

The default behavior of mega-sdist is to look at all packages specified in your stack.yaml. Targets can be any directory. And mega-sdist will automatically look at packages in any subdirectory, so that mega-sdist . is the same as mega-sdist at the root of your repo*.

* Assuming all of your packages are actually in your repo, but only crazy people would do otherwise.

Preparing a new release

OK, now I continue working on my project, and I've:

  • Made some changes to monad-unlift
  • Updated the cabal file's version number
    • And of course I also updated the, I'm not some monster

From the root of my repo, I run:

$ mega-sdist monad-unlift

Or, equivalently, from inside the monad-unlift subdirectory I run:

$ mega-sdist .

Either way, I get:

The following new packages exist locally:

No version bumps required, good to go!

This tells me that my package has local changes, and the version number has been updated, so that stack upload monad-unlift will work. Neato! Now, you could just run stack upload ..., but here's what I usually do. First, I'll review the changes I'm about to upload and make sure there are no surprises:

$ mega-sdist --get-diffs .

The following new packages exist locally:
diff -r old/monad-unlift-0.2.0/ new/monad-unlift-0.2.1/
> ## 0.2.1
> * Silly changes
diff -r old/monad-unlift-0.2.0/Control/Monad/Trans/Unlift.hs new/monad-unlift-0.2.1/Control/Monad/Trans/Unlift.hs
> -- I just need some space
diff -r old/monad-unlift-0.2.0/monad-unlift.cabal new/monad-unlift-0.2.1/monad-unlift.cabal
< version:             0.2.0
> version:             0.2.1

No version bumps required, good to go!

OK, that's what I wanted. Time to release. Next, I'm going to use mega-sdist to tag the release:

$ mega-sdist --gittag .

From the root of my repo, this would notice that monad-unlift-ref still requires a version bump, and refuse to proceed. But inside the monad-unlift directory, it notices that all necessary version bumps are done, and happily tags:

$ mega-sdist --gittag .
The following new packages exist locally:

No version bumps required, good to go!
Raw command: git tag monad-unlift/0.2.1

And suddenly I notice something new:

$ ls tarballs/

Neat, mega-sdist left behind tarballs I can upload! To do so, I run:

$ stack upload tarballs/*

Note that this will work whether I'm trying to upload just one package, or all of the updated packages in my repo. Finally, I need to push the new tags to Github (or wherever):

$ git push --tags

And in fact, this upload sequence is so common that I have a shell alias set up:

$ alias upload
alias upload='mega-sdist --gittag . && stack upload tarballs/* && git push --tags'

So there you have it: convenient little utility to help manage repos with lots of packages in them.

November 21, 2017 03:15 PM

Philip Wadler

Pay what you want for Java Generics and Collections

Humble Book Bundle is selling off a passle of Java books, including Java Generics and Collection by Naftalin and Wadler, on a pay-what-you-want basis (USD $1 minimum), DRM-free. You choose what proportion of the profits go to Humble and what goes to the charity Code for America. A great deal!

by Philip Wadler ( at November 21, 2017 12:16 PM

November 18, 2017

Sandy Maguire

Type-Directed Code Generation

<article> <header>

Type-Directed Code Generation


<time>November 18, 2017</time>

aka “Type-Level Icing Sugar”


At work recently I’ve been working on a library to get idiomatic gRPC support in our Haskell project. I’m quite proud of how it’s come out, and thought it’d make a good topic for a blog post. The approach demonstrates several type-level techniques that in my opinion are under-documented and exceptionally useful in using the type-system to enforce external contracts.

Thankfully the networking side of the library had already been done for me by Awake Security, but the interface feels like a thin-wrapper on top of C bindings. I’m very, very grateful that it exists, but I wouldn’t expect myself to be able to use it in anger without causing an uncaught type error somewhere along the line. I’m sure I’m probably just using it wrong, but the library’s higher-level bindings all seemed to be targeted at Awake’s implementation of protobuffers.

We wanted a version that would play nicely with proto-lens, which, at time of writing, has no official support for describing RPC services via protobuffers. If you’re not familiar with proto-lens, it generates Haskell modules containing idiomatic types and lenses for protobuffers, and can be used directly in the build chain.

So the task was to add support to proto-lens for generating interfaces to RPC services defined in protobuffers.

My first approach was to generate the dumbest possible thing that could work – the idea was to generate records containing fields of the shape Request -> IO Response. Of course, with a network involved there is a non-negligible chance of things going wrong, so this interface should expose some means of dealing with errors. However, the protobuffer spec is agnostic about the actual RPC backend used, and so it wasn’t clear how to continue without assuming anything about the particulars behind errors.

More worrisome, however, was that RPCs can be marked as streaming – on the side of the client, server, or both. This means, for example, that a method marked as server-streaming has a different interface on either side of the network:

serverSide :: Request -> (Response -> IO ()) -> IO ()
clientSide :: Request -> (IO (Maybe Response) -> IO r) -> IO r

This is problematic. Should we generate different records corresponding to which side of the network we’re dealing with? An early approach I had was to parameterize the same record based on which side of the network, and use a type family to get the correct signature:

{-# LANGUAGE DataKinds #-}

data NetworkSide = Client | Server

data MyService side = MyService
  { runServerStreaming :: ServerStreamingType side Request Response

type family ServerStreamingType (side :: NetworkSide) input output where
  ServerStreamingType Server input output =
      input -> (output -> IO ()) -> IO ()

  ServerStreamingType Client input output =
      forall r. input -> (IO (Maybe output) -> IO r) -> IO r

This seems like it would work, but in fact the existence of the forall on the client-side is “illegally polymorphic” in GHC’s eyes, and it will refuse to compile such a thing. Giving it up would mean we wouldn’t be able to return arbitrarily-computed values on the client-side while streaming data from the server. Users of the library might be able to get around it by invoking IORefs or something, but it would be ugly and non-idiomatic.

So that, along with wanting to be backend-agnostic, made this approach a no-go. Luckily, my brilliant coworker Judah Jacobson (who is coincidentally also the author of proto-lens), suggested we instead generate metadata for RPC services in proto-lens, and let backend library code figure it out from there.

With all of that context out of the way, we’re ready to get into the actual meat of the post. Finally.

Generating Metadata

According to the spec, a protobuffer service may contain zero or more RPC methods. Each method has a request and response type, either of which might be marked as streaming.

While we could represent this metadata at the term-level, that won’t do us any favors in terms of getting type-safe bindings to this stuff. And so, we instead turn to TypeFamilies, DataKinds and GHC.TypeLits.

For reasons that will become clear later, we chose to represent RPC services via types, and methods in those services as symbols (type-level strings). The relevant typeclasses look like this:

class Service s where
  type ServiceName    s :: Symbol

class HasMethod s (m :: Symbol) where
  type MethodInput       s m :: *
  type MethodOutput      s m :: *
  type IsClientStreaming s m :: Bool
  type IsServerStreaming s m :: Bool

For example, the instances generated for the RPC service:

service MyService {
  rpc BiDiStreaming(stream Request) returns(stream Response);

would look like this:

data MyService = MyService

instance Service MyService where
  type ServiceName    MyService = "myService"

instance HasMethod MyService "biDiStreaming" where
  type MethodInput       MyService "biDiStreaming" = Request
  type MethodOutput      MyService "biDiStreaming" = Response
  type IsClientStreaming MyService "biDiStreaming" = 'True
  type IsServerStreaming MyService "biDiStreaming" = 'True

You’ll notice that these typeclasses perfectly encode all of the information we had in the protobuffer definition. The idea is that with all of this metadata available to them, specific backends can generate type-safe interfaces to these RPCs. We’ll walk through the implementation of the gRPC bindings together.

The Client Side

The client side of things is relatively easy. We can the HasMethod instance directly:

    :: HasMethod s m
    => s
    -> Proxy m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))
runNonStreamingClient =  -- call the underlying gRPC code

    :: HasMethod s m
    => s
    -> Proxy m
    -> MethodInput s m
    -> (IO (Either GRPCError (Maybe (MethodOutput s m)) -> IO r)
    -> IO r
runServerStreamingClient =  -- call the underlying gRPC code

-- etc

This is a great start! We’ve got the interface we wanted for the server-streaming code, and our functions are smart enough to require the correct request and response types.

However, there’s already some type-unsafety here; namely that nothing stops us from calling runNonStreamingClient on a streaming method, or other such silly things.

Thankfully the fix is quite easy – we can use type-level equality to force callers to be attentive to the streaming-ness of the method:

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'False
    => s
    -> Proxy m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'True
    => s
    -> Proxy m
    -> MethodInput s m
    -> (IO (Either GRPCError (Maybe (MethodOutput s m)) -> IO r)
    -> IO r

-- et al.

Would-be callers attempting to use the wrong function for their method will now be warded off by the type-system, due to the equality constraints being unable to be discharged. Success!

The actual usability of this code leaves much to be desired (it requires being passed a proxy, and the type errors are absolutely disgusting), but we’ll circle back on improving it later. As it stands, this code is type-safe, and that’s good enough for us for the time being.

The Server Side

Method Discovery

Prepare yourself (but don’t panic!): the server side of things is significantly more involved.

In order to run a server, we’re going to need to be able to handle any sort of request that can be thrown at us. That means we’ll need an arbitrary number of handlers, depending on the service in question. An obvious thought would be to generate a record we could consume that would contain handlers for every method, but there’s no obvious place to generate such a thing. Recall: proto-lens can’t, since such a type would be backend-specific, and so our only other strategy down this path would be Template Haskell. Yuck.

Instead, recall that we have an instance of HasMethod for every method on Service s – maybe we could exploit that information somehow? Unfortunately, without Template Haskell, there’s no way to discover typeclass instances.

But that doesn’t mean we’re stumped. Remember that we control the code generation, and so if the representation we have isn’t powerful enough, we can change it. And indeed, the representation we have isn’t quite enough. We can go from a HasMethod s m to its Service s, but not the other way. So let’s change that.

We change the Service class slightly:

class Service s where
  type ServiceName    s :: Symbol
  type ServiceMethods s :: [Symbol]

If we ensure that the ServiceMethods s type family always contains an element for every instance of HasService, we’ll be able to use that info to discover our instances. For example, our previous MyService will now get generated thusly:

data MyService = MyService

instance Service MyService where
  type ServiceName    MyService = "myService"
  type ServiceMethods MyService = '["biDiStreaming"]

instance HasMethod MyService "biDiStreaming" where
  type MethodInput       MyService "biDiStreaming" = Request
  type MethodOutput      MyService "biDiStreaming" = Response
  type IsClientStreaming MyService "biDiStreaming" = 'True
  type IsServerStreaming MyService "biDiStreaming" = 'True

and we would likewise add the m for any other HasMethod MyService m instances if they existed.

This seems like we can now use ServiceMethods s to get a list of methods, and then somehow type-level map over them to get the HasMethod s m constraints we want.

And we almost can, except that we haven’t told the type-system that ServiceMethods s relates to HasService s m instances in this way. We can add a superclass constraint to Service to do this:

class HasAllMethods s (ServiceMethods s) => Service s where
  -- as before

But was is this HasAllMethods thing? It’s a specialized type-level map which turns our list of methods into a bunch of constraints proving we have HasMethod s m for every m in that promoted list.

class HasAllMethods s (xs :: [Symbol])

instance HasAllMethods s '[]
instance (HasMethod s x, HasAllMethods s xs) => HasAllMethods s (x ': xs)

We can think of xs here as the list of constraints we want. Obviously if we don’t want any constraints (the '[] case), we trivially have all of them. The other case is induction: if we have a non-empty list of constraints we’re looking for, that’s the same as looking for the tail of the list, and having the constraint for the head of it.

Read through these instances a few times; make sure you understand the approach before continuing, because we’re going to keep using this technique in scarier and scarier ways.

With this HasAllMethods superclass constraint, we can now convince ourselves (and, more importantly, GHC), that we can go from a Service s constraint to all of its HasMethod s m constraints. Cool!

Typing the Server

We return to thinking about how to actually run a server. As we’ve discussed, such a function will need to be able to handle every possible method, and, unfortunately, we can’t pack them into a convenient data structure.

Our actual implementation of such a thing might take a list of handlers. But recall that each handler has different input and output types, as well as different shapes depending on which bits of it are streaming. We can make this approach work by existentializing away all of the details.

While it works as far as the actual implementation of the underlying gRPC goes, we’re left with a great sense of uneasiness. We have no guarantees that we’ve provided a handler for every method, and the very nature of existentialization means we have absolutely no guarantees that any of these things are the right ype.

Our only recourse is to somehow use our Service s constraint to put a prettier facade in front of this ugly-if-necessary implementation detail.

The actual interface we’ll eventually provide will, for example, for a service with two methods, look like this:

runServer :: HandlerForMethod1 -> HandlerForMethod2 -> IO ()

Of course, we can’t know a priori how many methods there will be (or what type their handlers should have, for that matter). We’ll somehow need to extract this information from Service s – which is why we previously spent so much effort on making the methods discoverable.

The technique we’ll use is the same one you’ll find yourself using again and again when you’re programming at the type-level. We’ll make a typeclass with an associated type family, and then provide a base case and an induction case.

class HasServer s (xs :: [Symbol]) where
  type ServerType s xs :: *

We need to make the methods xs explicit as parameters in the typeclass, so that we can reduce them. The base case is simple – a server with no more handlers is just an IO action:

instance HasServer s '[] where
  type ServerType s '[] = IO ()

The induction case, however, is much more interesting:

instance ( HasMethod s x
         , HasMethodHandler s x
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs) = MethodHandler s x -> ServerType s xs

The idea is that as we pull methods x off our list of methods to handle, we build a function type that takes a value of the correct type to handle method x, which will take another method off the list until we’re out of methods to handle. This is exactly a type-level fold over a list.

The only remaining question is “what is this MethodHandler thing?” It’s going to have to be a type family that will give us back the correct type for the handler under consideration. Such a type will need to dispatch on the streaming variety as well as the request and response, so we’ll define it as follows, and go back and fix HasServer later.

class HasMethodHandler input output cs ss where
  type MethodHandler input output cs ss :: *

cs and ss refer to whether we’re looking for client-streaming and/or server-streaming types, respectively.

Such a thing could be a type family, but isn’t because we’ll need its class-ness later in order to actually provide an implementation of all of this stuff. We provide the following instances:

-- non-streaming
instance HasMethodHandler input output 'False 'False where
  type MethodHandler input output 'False 'False =
    input -> IO output

-- server-streaming
instance HasMethodHandler input output 'False 'False where
  type MethodHandler input output 'False 'True =
    input -> (output -> IO ()) -> IO ()

-- etc for client and bidi streaming

With MethodHandler now powerful enough to give us the types we want for handlers, we can go back and fix HasServer so it will compile again:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs

It’s not pretty, but it works! We can convince ourselves of this by asking ghci:

ghci> :kind! ServerType MyService (ServiceMethods MyService)

(Request -> (Response -> IO ()) -> IO ()) -> IO () :: *

and, if we had other methods defined for MyService, they’d show up here with the correct handler type, in the order they were listed in ServiceMethods MyService.

Implementing the Server

Our ServerType family now expands to a function type which takes a handler value (of the correct type) for every method on our service. That turns out to be more than half the battle – all we need to do now is to provide a value of this type.

The generation of such a value is going to need to proceed in perfect lockstep with the generation of its type, so we add to the definition of HasServer:

class HasServer s (xs :: [Symbol]) where
  type ServerType s xs :: *
  runServerImpl :: [AnyHandler] -> ServerType s xs

What is this [AnyHandler] thing, you might ask. It’s an explicit accumulator for existentialized handlers we’ve collected during the fold over xs. It’ll make sense when we look at the induction case.

For now, however, the base case is trivial as always:

instance HasServer s '[] where
  type ServerType s '[] = IO ()
  runServerImpl handlers = runGRPCServer handlers

where runGRPCServer is the underlying server provided by Awake’s library.

We move to the induction case:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs
  runServerImpl handlers f = runServerImpl (existentialize f : handlers)

where existentialize is a new class method we add to HasMethodHandler We will elide it here because it is just a function MethodHandler i o cs mm -> AnyHandler and is not particularly interesting if you’re familiar with existentialization.

It’s evident here what I meant by handlers being an explicit accumulator – our recursion adds the parameters it receives into this list so that it can pass them eventually to the base case.

There’s a problem here, however. Reading through this implementation of runServerImpl, you and I both know what the right-hand-side means, unfortunately GHC isn’t as clever as we are. If you try to compile it right now, GHC will complain about the non-injectivity of HasServer as implied by the call to runServerImpl (and also about HasMethodHandler and existentialize, but for the exact same reason.)

The problem is that there’s nothing constraining the type variables s and xs on runServerImpl. I always find this error confusing (and I suspect everyone does), because in my mind it’s perfectly clear from the HasServer s xs in the instance constraint. However, because SeverType is a type family without any injectivity declarations, it means we can’t learn s and xs from ServerType s xs.

Let’s see why. For a very simple example, let’s look at the following type family:

type family NotInjective a where
  NotInjective Int  = ()
  NotInjective Bool = ()

Here we have NotInjective Int ~ () and NotInjective Bool ~ (), which means even if we know NotInjective a ~ () it doesn’t mean that we know what a is – it could be either Int or Bool.

This is the exact problem we have with runServerImpl: even though we know what type runServerImpl has (it must be ServerType s xs, so that the type on the left-hand of the equality is the same as on the right), that doesn’t mean we know what s and xs are! The solution is to explicitly tell GHC via a type signature or type application:

instance ( HasMethod s x
         , HasMethodHandler (MethodInput       s x)
                            (MethodOutput      s x)
                            (IsClientStreaming s x)
                            (IsServerStreaming s x)
         , HasServer s xs
         ) => HasServer s (x ': xs) where
  type ServerType s (x ': xs)
      = MethodHandler (MethodInput       s x)
                      (MethodOutput      s x)
                      (IsClientStreaming s x)
                      (IsServerStreaming s x)
     -> ServerType s xs
  runServerImpl handlers f = runServerImpl @s @xs (existentialize f : handlers)

(For those of you playing along at home, you’ll need to type-apply the monstrous MethodInput and friends to the existentialize as well.)

And finally, we’re done! We can slap a prettier interface in front of this runServerImpl to fill in some of the implementation details for us:

    :: forall s
     . ( Service s
       , HasServer s (ServiceMethods s)
    => s
    -> ServerType s (ServiceMethods s)
runServer _ = runServerImpl @s @(ServiceMethods s) []

Sweet and typesafe! Yes!

Client-side Usability

Sweet and typesafe all of this might be, but the user-friendliness on the client-side leaves a lot to be desired. As promised, we’ll address that now.

Removing Proxies

Recall that the runNonStreamingClient function and its friends require a Proxy m parameter in order to specify the method you want to call. However, m has kind Symbol, and thankfully we have some new extensions in GHC for turning Symbols into values.

We can define a new type, isomorphic to Proxy, but which packs the fact that it is a KnownSymbol (something we can turn into a String at runtime):

data WrappedMethod (sym :: Symbol) where
  WrappedMethod :: KnownSymbol sym => WrappedMethod sym

We change our run*Client friends to take this WrappedMethod m instead of the Proxy m they used to:

    :: ( HasMethod s m
       , IsClientStreaming s m ~ 'False
       , IsServerStreaming s m ~ 'False
    => s
    -> WrappedMethod m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

and, with this change in place, we’re ready for the magic syntax I promised earlier.

import GHC.OverloadedLabel

instance ( KnownSymbol sym
         , sym ~ sym'
         ) => IsLabel sym (WrappedMethod sym') where
  fromLabel _ = WrappedMethod

This sym ~ sym' thing is known as the constraint trick for instances, and is necessary here to convince GHC that this can be the only possible instance of IsLabel that will give you back WrappedMethods.

Now turning on the {-# LANGUAGE OverloadedLabels #-} pragma, we’ve changed the syntax to call these client functions from the ugly:

runBiDiStreamingClient MyService (Proxy @"biDiStreaming")

into the much nicer:

runBiDiStreamingClient MyService #biDiStreaming

Better “Wrong Streaming Variety” Errors

The next step in our journey to delightful usability is remembering that the users of our library are only human, and at some point they are going to call the wrong run*Client function on their method with a different variety of streaming semantics.

At the moment, the errors they’re going to get when they try that will be a few stanza long, the most informative of which will be something along the lines of unable to match 'False with 'True. Yes, it’s technically correct, but it’s entirely useless.

Instead, we can use the TypeError machinery from GHC.TypeLits to make these error messages actually helpful to our users. If you aren’t familiar with it, if GHC ever encounters a TypeError constraint it will die with a error message of your choosing.

We will introduce the following type family:

type family RunNonStreamingClient (cs :: Bool) (ss :: Bool) :: Constraint where
  RunNonStreamingClient 'False 'False = ()
  RunNonStreamingClient 'False 'True = TypeError
      ( Text "Called 'runNonStreamingClient' on a server-streaming method."
   :$$: Text "Perhaps you meant 'runServerStreamingClient'."
  RunNonStreamingClient 'True 'False = TypeError
      ( Text "Called 'runNonStreamingClient' on a client-streaming method."
   :$$: Text "Perhaps you meant 'runClientStreamingClient'."
  RunNonStreamingClient 'True 'True = TypeError
      ( Text "Called 'runNonStreamingClient' on a bidi-streaming method."
   :$$: Text "Perhaps you meant 'runBiDiStreamingClient'."

The :$$: type operator stacks message vertically, while :<>: stacks it horizontally.

We can change the constraints on runNonStreamingClient:

    :: ( HasMethod s m
       , RunNonStreamingClient (IsClientStreaming s m)
                               (IsServerStreaming s m)
    => s
    -> WrappedMethod m
    -> MethodInput s m
    -> IO (Either GRPCError (MethodOutput s m))

and similarly for our other client functions. Reduction of the resulting boilerplate is left as an exercise to the reader.

With all of this work out of the way, we can test it:

runNonStreamingClient MyService #biDiStreaming
Main.hs:45:13: error:
    • Called 'runNonStreamingClient' on a bidi-streaming method.
      Perhaps you meant 'runBiDiStreamingClient'.
    • In the expression: runNonStreamingClient MyService #bidi


Better “Wrong Method” Errors

The other class of errors we expect our users to make is to attempt to call a method that doesn’t exist – either because they made a typo, or are forgetful of which methods exist on the service in question.

As it stands, users are likely to get about six stanzas of error messages, from No instance for (HasMethod s m) to Ambiguous type variable 'm0', and other terrible things that leak our implementation details. Our first thought might be to somehow emit a TypeError constraint if we don’t have a HasMethod s m instance, but I’m not convinced such a thing is possible.

But luckily, we can actually do better than any error messages we could produce in that way. Since our service is driven by a value (in our example, the data constructor MyService), by the time things go wrong we do have a Service s instance in scope. Which means we can look up our ServiceMethods s and given some helpful suggestions about what the user probably meant.

The first step is to implement a ListContains type family so we can determine if the method we’re looking for is actually a real method.

type family ListContains (n :: k) (hs :: [k]) :: Bool where
  ListContains n '[]       = 'False
  ListContains n (n ': hs) = 'True
  ListContains n (x ': hs) = ListContains n hs

In the base case, we have no list to look through, so our needle is trivially not in the haystack. If the head of the list is the thing we’re looking for, then it must be in the list. Otherwise, take off the head of the list and continue looking. Simple really, right?

We can now use this thing to generate an error message in the case that the method we’re looking for is not in our list of methods:

type family RequireHasMethod s (m :: Symbol) (found :: Bool) :: Constraint where
  RequireHasMethod s m 'False = TypeError
      ( Text "No method "
   :<>: ShowType m
   :<>: Text " available for service '"
   :<>: ShowType s
   :<>: Text "'."
   :$$: Text "Available methods are: "
   :<>: ShowType (ServiceMethods s)
  RequireHasMethod s m 'True = ()

If found ~ 'False, then the method m we’re looking for is not part of the service s. We produce a nice error message informing the user about this (using ShowType to expand the type variables).

We will provide a type alias to perform this lookup:

type HasMethod' s m =
  ( RequireHasMethod s m (ListContains m (ServiceMethods s)
  , HasMethod s m

Our new HasMethod' s m has the same shape as HasMethod, but will expand to our custom type error if we’re missing the method under scrutiny.

Replacing all of our old HasMethod constraints with HasMethod' works fantastically:

Main.hs:54:15: error:
    • No method "missing" available for service 'MyService'.
      Available methods are: '["biDiStreaming"]

Damn near perfect! That list of methods is kind of ugly, though, so we can write a quick pretty printer for showing promoted lists:

type family ShowList (ls :: [k]) :: ErrorMessage where
  ShowList '[]  = Text ""
  ShowList '[x] = ShowType x
  ShowList (x ': xs) = ShowType x :<>: Text ", " :<>: ShowList xs

Replacing our final ShowType with ShowList in RequireHasMethod now gives us error messages of the following:

Main.hs:54:15: error:
    • No method "missing" available for service 'MyService'.
      Available methods are: "biDiStreaming"

Absolutely gorgeous.


This is where we stop. We’ve used type-level metadata to generate client- and server-side bindings to an underlying library. Everything we’ve made is entirely typesafe, and provides gorgeous, helpful error messages if the user does anything wrong. We’ve found a practical use for many of these seemingly-obscure type-level features, and learned a few things in the process.

In the words of my coworker Renzo Carbonara1:

“It is up to us, as people who understand a problem at hand, to try and teach the type system as much as we can about that problem. And when we don’t understand the problem, talking to the type system about it will help us understand. Remember, the type system is not magic, it is a logical reasoning tool.”

This resounds so strongly in my soul, and maybe it will in yours too. If so, I encourage you to go forth and find uses for these techniques to improve the experience and safety of your own libraries.

  1. Whose article “Opaleye’s sugar on top” was a strong inspiration on me, and subsequently on this post.


November 18, 2017 12:00 AM

November 16, 2017

Tweag I/O

Parallelising your array code

Manuel M T Chakravarty

This is the fifth post in a series about array programming in Haskell — you might be interested in the first, second, third, and fourth, too.

A recurring theme in array programming is performance. After all, many algorithms in numerical computing and data science are computationally intensive. Once the sequential implementation of an array program has been fully optimised, the natural next step is to use one or multiple forms of parallelism to achieve further performance improvements. This can be parallelism within one computational core (SIMD parallelism), multicore parallelism, or distributed multi-machine parallelism. Unfortunately, at this point matters become much more complicated, because parallel programming comes with its own set of serious challenges.

In this post, we will focus on multicore parallelism for computations operating on multi-dimensional arrays. In other words, in relation to the vector package, which we discussed in the last post, we have two new ingredients. Firstly, instead of one-dimensional Int-indexed arrays, we have multi-dimensional Int-indexed arrays. Secondly, the collective operations provided on these arrays come with parallel implementations. In fact, the library API is designed to favour collective operations that have good parallel implementations. Similarly, the move to explicitly multi-dimensional arrays is motivated by being able to provide parallel implementations that take the array shape into account, wherever that is an advantage.

To make matters concrete, we will discuss the Repa library. Internally it uses many of the same techniques as vector, including strictness, unboxing, and a two-phase initialisation strategy. However, it uses a second array fusion strategy in addition to vector’s stream fusion. More precisely, Repa internally uses vector to represent plain boxed and unboxed arrays and to execute sequential computations on those, which still benefit from stream fusion. However, Repa introduces additional array representations, such as delayed arrays, to also achieve fusion across parallel computations.

This additional complication is necessary as stream fusion, by itself, tends to turn parallel into sequential code. In other words, one of the challenges of high-performance parallel array implementations that are built on collective operations is to apply fusion while preserving parallelism. To really get good performance, we need to simultaneously optimize along two orthogonal dimensions: get more done simultaneously, by parallelizing, but also make each sequential unit of work run faster.

A second consequence of targeting a parallelisation-friendly API is a very limited use of mutable arrays. Mutable structures generally interact badly with concurrency and parallelism, opening the door to a whole range of hard to diagnose faults. In fact, the focus on immutable arrays for parallel programming is one of the most compelling conceptual improvements of functional over imperative parallel array programming. (To be precise, Repa’s API does provide access to the mutable array structures used to implement two-phase initialisation, but it is usually not necessary to use them directly.)

Multiple dimensions

The obvious structure for indexing multi-dimensional Int-indexed arrays are tuples of Ints. However, they come with two severe drawbacks: (1) they force us to fix the dimensionality of all functions over arrays and (2) they are not sufficient to characterise operations on lower-dimensional subarrays of an array (e.g., a two-dimensional plane within a three-dimensional cube).

As an example of the first drawback, consider a fold function that given a three-dimensional cube, reduces it along, say, the x-axis to a two-dimensional plane of sums. The only difference of that operation compared to a fold that sums a two-dimensional plane across one axis to a one-dimensional vector is the number of dimensions that we do not reduce along. Now, we could have a family of fold functions (fold1, fold2, and so on), one for each possible dimension of argument array. But that is hardly satisfactory.

Instead, Repa uses a custom datatype for indexing. Index types are built from the infix constructor (:.) and the constant Z, representing a zero-dimensional array (which is the special cases of a singleton array). For example, the type of two-dimensional indices is Z :. Int :. Int and one of its values is Z :. 3 :. 5. By using a type variable instead of Z, we can denote indices with a particular minimum dimensionality. For instance, sh :. Int has at least one dimension, but it might have more, depending on how the type variable sh is instantiated — in any case, instances of sh need to be drawn from the class Shape. On the basis of this index representation, we can capture the entire family of multi-dimensional fold functions in a single type:

foldS :: (Shape sh, Source r a, Unbox a)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> Array U sh a

The function foldS implements a sequential, multi-dimensional reduction; hence, the S suffix. It gets three arguments:

  1. a -> a -> a is the type of the binary reduction function, which needs to be associative,
  2. a is the reduction function’s neutral (i.e, together they form a monoid), and
  3. Array e (sh :. Int) a is an at least one-dimensional array of elements of type a, which the type constraint Unbox a requires to be a type that has an associated unboxed representation.

Finally, the result of type Array U sh a has one dimension less than the argument array, but contains elements of the same type a. This leaves us with wondering about the meaning of the first type argument of Arrayr and U, respectively— as well as the type constraint Source r a.

Indexed arrays

The first type argument of Array determines the array representation. The available representations include boxed (V) and unboxed (U) representations, but also delayed (D) and cursored (C) representations. The latter are guaranteed to be removed by fusion, but can lead to the superfluous recomputation of array elements that are used more than once. Repa makes the choice of representation explicit to place it under programmer control — experience shows that compiler heuristics for automatic representation selection tend to be fragile and unreliable.

A consequence of a representation that is fused away, such as delayed D and cursored C, is that it can only be a data Source of a computation. Hence, the type class of the same name provides elementary array access functions for arrays. The opposite, a Target, provides the functionality to fill an array as part of two-phase initialisation and is only available to manifest representations, such as the boxed V and unboxed U representation. A manifest representation is one which, in contrast to a fused-away delayed representation, is actually stored in memory.

In addition to concrete representations, Repa representation tags can also include meta information, such as the interleaving hint I. An array tagged I U uses an unboxed interleaved representation, which improves parallel load balancing in parallel computations where the amount of work strongly varies between different regions in the parallel array. A standard example is computing a Mandelbrot set, where black pixels are significantly more expensive than others.


As we saw above with foldS, Repa follows the convention of adding an S to sequential array operations. Similarly, it uses a P as a suffix for parallel functions. For example, we have

foldP :: (Shape sh, Source r a, Unbox a, Monad m)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> m (Array U sh a)

for the parallel version of fold. The distinction between sequential and parallel functions is an important one, since Repa does not support nested parallelism. That is, a parallel function (e.g., foldP) cannot use another parallel function as an argument (e.g., as the combination function).

In addition to the suffix, the parallel fold distinguishes itself from the sequential by the use of a not further specified monad. The purpose of this monad is to ensure the one-by-one execution of pipelines of parallel computations. This is important to prevent inadvertent nesting of parallel computations as Haskell is a lazy language and we might otherwise feed a suspended (i.e., not yet evaluated) parallel computation into another parallel computation.

Parallel matrix multiplication

As a simple example of a parallel computation, consider the multiplication of two matrices arr and brr of type Array U DIM2 Double (two-dimensional, unboxed arrays), where type DIM2 = Z :. Int :. Int:

mmultP :: Monad m
       => Array U DIM2 Double
       -> Array U DIM2 Double
       -> m (Array U DIM2 Double)
mmultP arr brr
  = do trr <- transpose2P brr
       computeP (fromFunction (Z :. h1 :. w2) dotp)
    (Z :. h1  :. _)  = extent arr
    (Z :. _   :. w2) = extent brr

    dotp ix = sumAllS $
                zipWith (*)
                  (slice arr (Any :. (row ix) :. All))
                  (slice trr (Any :. (col ix) :. All))

We assume the existence of a helper function transpose2P, which transposes a matrix in parallel — for example, by using Repa’s backpermute function. Then, we generate the manifest result array by computing all elements of fromFunction (Z :. h1 :. w2) dotpin parallel with computeP. The shape (i.e., the size of the dimensions) of the result is h1 times w2, and fromFunction turns a function, which takes an array index to the corresponding array element , into a delayed array:

fromFunction :: sh -> (sh -> a) -> Array D sh a

At each index ix of the resulting array, we evaluate dotp, which only involves a sequential computation. It’s sequential nature is important for two reasons. Firstly, as mentioned, Repa does not support nested parallelism, so the computations on each result array index triggered by computeP in parallel may themselves not be parallel. Secondly, the work complexity of matrix multiplication is n^3 — that is the number of scalar multiplications that need to be performed. Performing them all in parallel would lead to (a) too much and (b) too fine-grained parallelism. Both too much parallelism and parallel workloads that are each too little work lead to bad performance as they result in too much administrative overhead.

In contrast, the sequential computation performed by dotp obtains a row of the matrix arr and a column of brr (actually, a row of the transposed brr, which is trr) with slice, which extracts an entire subarray from an array. Then, it multiples the row and column pointwise with zipWith (*) and sums up the products with sumAllS, where

zipWith :: (Shape sh, Source r1 a, Source r2 b)
        => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c
sumAllS :: (Shape sh, Source r a, Num a) => Array r sh a -> a

This example highlights how reasoning about the decomposition of an algorithm into parallel and sequential components is crucial for good parallel performance. This is assisted by Repa’s clear distinction between sequential and parallel operations.

Further reading

Repa went through three major iterations before arriving at the current interface. The underlying concepts are described and supported by benchmarks in the papers Regular, shape-polymorphic, parallel arrays in Haskell, Efficient Parallel Stencil Convolution in Haskell, and Guiding Parallel Array Fusion with Indexed Types, respectively. In addition, Data Flow Fusion with Series Expressions in Haskell proposes a further improvement to the fusion system. However, this has not been integrated into the main package.

November 16, 2017 12:00 AM

November 15, 2017

Jeremy Gibbons

The Digits of Pi

In the previous post we were introduced to metamorphisms, which consist of an unfold after a fold—typically on lists, and the fold part typically a {\mathit{foldl}}. A canonical example is the conversion of a fraction from one base to another. For simplicity, let’s consider here only infinite fractions, so we don’t have to deal with the end of the input and flushing the state:

\displaystyle  \begin{array}{@{}lcl@{}} \multicolumn{3}{@{}l}{\mathit{stream} :: (\beta \rightarrow \mathsf{Maybe}\;(\gamma,\beta)) \rightarrow (\beta \rightarrow \alpha \rightarrow \beta) \rightarrow \beta \rightarrow [\alpha] \rightarrow [\gamma]} \\ \mathit{stream}\;g\;f\;b\;x &=& \mathbf{case}\;g\;b\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} \mathit{Just}\;(c,b') &\rightarrow& c : \mathit{stream}\;g\;f\;b'\;x \\ \mathit{Nothing} &\rightarrow& \mathbf{case}\;x\;\mathbf{of} \\ & & \quad \begin{array}[t]{@{}lcl@{}} a:x' &\rightarrow& \mathit{stream}\;g\;f\;(f\;b\;a)\;x' \end{array} \end{array} \end{array}

So for example, we can convert an infinite fraction in base 3 to one in base 7 with

\displaystyle  \mathit{stream}\;\mathit{next}\;\mathit{step}\;(0,1)


\displaystyle  \begin{array}{@{}lcl@{}} \mathit{next}\;(u,v) &=& \begin{array}[t]{@{}l} \mathbf{let}\;y = \lfloor{7 \times u \times v}\rfloor\;\mathbf{in} \\ \mathbf{if}\;\lfloor{y}\rfloor = \lfloor{7 \times (u+1) \times v}\rfloor\;\begin{array}[t]{@{}l@{\;}l}\mathbf{then}&\mathit{Just}\;(y,(u - y/(v \times 7), v \times 7))\\\mathbf{else}&\mathit{Nothing} \\ \end{array} \end{array} \\ \mathit{stepl}\;(u,v)\;d &=& (u \times 3 + d, v / 3) \end{array}

In this post, we’ll see another number conversion problem, which will deliver the digits of {\pi}. For more details, see my paper—although the presentation here is simpler now.

Series for pi

Leibniz showed that

\displaystyle  \displaystyle \frac{\pi}{4} = \sum_{i=0}^{\infty} \frac{(-1)^i}{2i+1}

From this, using Euler’s convergence-accelerating transformation, one may derive

\displaystyle  \pi = \sum_{i=0}^{\infty} \frac{(i!)^2\,2^{i+1}}{(2i+1)!}

or equivalently

\displaystyle  \pi = 2 + \frac{1}{3} \times \biggl(2 + \frac{2}{5}\times \biggl(2 + \frac{3}{7}\times \biggl( \cdots \biggl(2 + \frac{i}{2i+1}\times \biggl(\cdots\biggr)\biggr)\biggr)\biggr)\biggr)

This can be seen as the number {(2;2,2,2...)} in a funny mixed-radix base {(\frac{1}{3}, \frac{2}{5}, \frac{3}{7}...)}, just as the usual decimal expansion

\displaystyle  \pi = 3 + \frac{1}{10} \times \biggl(1 + \frac{1}{10}\times \biggl(4 + \frac{1}{10}\times \biggl( \cdots\biggr)\biggr)\biggr)

is represented by the number {(3;1,4,1...)} in the fixed-radix base {(\frac{1}{10},\frac{1}{10},\frac{1}{10}...)}. Computing the decimal digits of {\pi} is then a matter of conversion from the mixed-radix base to the fixed-radix base.

Conversion from a fixed base

Let’s remind ourselves of how it should work, using a simpler example: conversion from one fixed base to another. We are given an infinite-precision fraction in the unit interval

\displaystyle  x = \frac{1}{m} \times \biggl(x_0 + \frac{1}{m}\times \biggl(x_1 + \frac{1}{m}\times \biggl( \cdots\biggr)\biggr)\biggr)

in base {m}, in which {0 \le x_i < m} for each digit {x_i}. We are to convert it to a similar representation

\displaystyle  x = \frac{1}{n} \times \biggl(y_0 + \frac{1}{n}\times \biggl(y_1 + \frac{1}{n}\times \biggl( \cdots\biggr)\biggr)\biggr)

in base {n}, in which {0 \le y_j < n} for each output digit {y_j}. The streaming process maintains a state {(u,v)}, a pair of rationals; the invariant is that after consuming {i} input digits and producing {j} output digits, we have

\displaystyle  x = \frac{1}{n} \times \biggl(y_0 + \cdots + \frac{1}{n}\times \biggl(y_{j-1} + v \times (u + \frac{1}{m} \times \biggl( x_i + \frac{1}{m} \times \biggl(x_{i+1} + \cdots \biggr)\biggr)\biggr)\biggr)\biggr)

so that {(u,v)} represents a linear function {(v\times) \cdot (u+)} that should be applied to the value represented by the remaining input.

We can initialize the process with {i=0, j=0, u=0, v=1}. At each step, we first try to produce another output digit. The remaining input digits {x_i, x_{i+1},...} represent a value in the unit interval; so if {n \times v \times (u+0)} and {n \times v \times (u+1)} have the same integer part, then that must be the next output digit, whatever the remaining input digits are. Let {y_j = \lfloor n \times v \times u \rfloor} be that integer. Now we need to find {(u',v')} such that

\displaystyle  \frac{1}{n} \times \biggl(y_j + v' \times (u' + r)\biggr) = v \times (u + r)

for any remainder {r}; then we can increment {j} and set {(u,v)} to {(u',v')} and the invariant is maintained. A little algebra shows that we should take {v' = n \times v} and {u' = u - y_j/v'}.

If {v \times u} and {v \times (u+1)} have different integer parts, we cannot yet tell what the next output digit should be, so we must consume the next input digit instead. Now we need to find {(u',v')} such that

\displaystyle  v \times \biggl(u + \frac{1}{m} \times \biggl(x_i + r\biggr)\biggr) = v' \times (u' + r)

for any remainder {r}; then we can increment {i} and set {(u,v)} to {(u',v')} and the invariant is again maintained. Again, algebraic manipulation leads us to {v' = v/m} and {u' = m \times u + x_i}.

For example, {\frac{1}{4} = 0.020202..._3 = 0.151515..._7}, and the conversion starts as follows:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 0 && 2 && 0 && && 2 && && 0 && 2 \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{0}{1},\frac{1}{3}\bigr) && \bigl(\frac{2}{1},\frac{1}{9}\bigr) && \bigl(\frac{6}{1},\frac{1}{27}\bigr) && \bigl(\frac{15}{7},\frac{7}{27}\bigr) && \bigl(\frac{59}{7},\frac{7}{81}\bigr) && \bigl(\frac{8}{49},\frac{49}{81}\bigr) && \bigl(\frac{24}{49},\frac{49}{243}\bigr) && \bigl(\frac{170}{49},\frac{49}{729}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && && && 1 && && 5 \end{array}

That is, the initial state is {u_0=\frac{0}{1}, v_0=\frac{1}{1}}. This state does not yet determine the first output digit, so we consume the first input digit 0 to yield the next state {u_1 = \frac{0}{1}, v_1 = \frac{1}{3}}. This state still does not determine the first output, and nor will the next; so we consume the next two input digits 2 and 0, yielding state {u_3 = \frac{6}{1}, v_3 = \frac{1}{27}}. This state does determine the next digit: {v_3 \times u_3 = 0.020_3 = 0.136..._7} and {v_3 \times (u_3+1) = 0.021_3 = 0.154..._7} both start with a 1 in base 7. So we can produce a 1 as the first output digit, yielding state {u_4 = \frac{15}{7}, v_4 = \frac{7}{27}}. And so on.

The process tends to converge. Each production step widens the non-empty window {[n \times v \times u, n \times v \times (u+1))} by a factor of {n}, so it will eventually contain multiple integers; therefore we cannot produce indefinitely. Each consumption step narrows the window by a factor of {m}, so it will tend towards eventually producing the next output digit. However, this doesn’t always work. For example, consider converting {0.333..._{10}} to base 3:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 3 && 3 && 3 & \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{3}{1},\frac{1}{10}\bigr) && \bigl(\frac{33}{1},\frac{1}{100}\bigr) && \bigl(\frac{333}{1},\frac{1}{1000}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && \end{array}

The first output digit is never determined: if the first non-3 in the input is less than 3, the value is less than a third, and the first output digit should be a 0; if the first non-3 is greater than 3, then the value is definitely greater than a third, and it is safe to produce a 1 as the first output digit; but because the input is all 3s, we never get to make this decision. This problem will happen whenever the value being represented has a finite representation in the output base.

Conversion from a mixed base

Let’s return now to computing the digits of {\pi}. We have the input

\displaystyle  \pi = 2 + \frac{1}{3} \times \biggl(2 + \frac{2}{5}\times \biggl(2 + \frac{3}{7}\times \biggl( \cdots \biggl(2 + \frac{i}{2i+1}\times \biggl(\cdots\biggr)\biggr)\biggr)\biggr)\biggr)

which we want to convert to decimal. The streaming process maintains a pair {(u,v)} of rationals—but this time representing the linear function {(u+) \cdot (v\times)}, since this time our expression starts with a sum rather than a product. The invariant is similar: after consuming {i-1} input digits and producing {j} output digits, we have

\displaystyle  \pi = y_0 + \frac{1}{10} \times \biggl(\cdots y_{j-1} + \frac{1}{10} \times \biggl(u + v \times \biggl(x_i + \frac{i}{2i+1} \times \biggl(x_{i+1} + \frac{i+1}{2i+3} \times \cdots\biggr)\biggr)\biggr)\biggr)

Note that the output base is fixed at 10; but more importantly, the input digits {x_i} are all fixed at 2, and it is the input base that varies from digit to digit.

We can initialize the process with {i=1, j=0, u=0, v=1}. At each step, we first try to produce an output digit. What value might the remaining input

\displaystyle  r = 2 + \frac{i}{2i+1} \times \biggl(2 + \frac{i+1}{2i+3} \times \cdots \biggr)

represent? Each of the bases is at least {\frac{1}{3}}, so it is clear that {r_{\mathrm{min}} \le r}, where

\displaystyle  r_{\mathrm{min}} = 2 + \frac{1}{3} \times r_{\mathrm{min}}

which has unique solution {r_{\mathrm{min}} = 3}. Similarly, each of the bases is less than {\frac{1}{2}}, so it is clear that {r < r_{\mathrm{max}}}, where

\displaystyle  r_{\mathrm{max}} = 2 + \frac{1}{2} \times r_{\mathrm{max}}

which has unique solution {r_{\mathrm{max}} = 4}. So we consider the bounds {\lfloor u + v \times 3 \rfloor} and {\lfloor u + v \times 4 \rfloor}; if these have the same integer part {y_j}, then that is the next output digit. Now we need to find {(u',v')} such that

\displaystyle  y_j + \frac{1}{10} \times (u' + v' \times r) = u + v \times r

for any remainder {r}, so we pick {u' = 10 \times (u - y_j)} and {v' = 10 \times v}. Then we can increment {j} and set {(u,v)} to {(u',v')}, and the invariant is maintained.

If the two bounds have different integer parts, we must consume the next input digit instead. Now we need to find {(u',v')} such that

\displaystyle  u' + v' \times r = u + v \times \biggl(x_i + \frac{i}{2i+1} \times r\biggr)

for all {r}, so we pick {u' = u + v \times x_i} and {v' = v \times i / (2i+1)}. Then we can increment {i} and set {(u,v)} to {(u',v')}, and again the invariant is maintained.

The conversion starts as follows:

\displaystyle  \begin{array}{c|c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}cc} x_i & & 2 && && 2 && 2 && && 2 && 2 \\ \hline (u,v) & \bigl(\frac{0}{1},\frac{1}{1}\bigr) && \bigl(\frac{2}{1},\frac{1}{3}\bigr) && \bigl(\frac{-10}{1},\frac{10}{3}\bigr) && \bigl(\frac{10}{3},\frac{4}{3}\bigr) && \bigl(\frac{-2}{3},\frac{4}{7}\bigr) && \bigl(\frac{-50}{3},\frac{40}{7}\bigr) && \bigl(\frac{-110}{21},\frac{160}{63}\bigr) && \bigl(\frac{-10}{63},\frac{800}{693}\bigr) & \cdots \vrule height 2.5ex depth 1.5ex width 0pt \\ \hline y_j & & && 3 && && && 1 && && \end{array}

Happily, non-termination ceases to be a problem: the value being represented does not have a finite representation in the output base, being irrational.


We can plug these definitions straight into the {\mathit{stream}} function above:

\displaystyle  \mathit{piDigits} = \mathit{stream}\;g\;f\;(1,0,0\%1,1\%1)\;(\mathit{repeat}\;2)


\displaystyle  g\;(i,j,u,v) = \begin{array}[t]{@{}l} \mathbf{if}\;y == \mathit{floor}\;(u + v \times 4) \\ \mathbf{then}\;\mathit{Just}\;(y, (i,j+1, 10 \times (u - \mathit{fromIntegral}\;y), 10 \times v)) \\ \mathbf{else}\;\mathit{Nothing} \\ \mathbf{where}\;y = \mathit{floor}\;(u + v \times 3) \end{array}


\displaystyle  f\;(i,j,u,v)\;x = \begin{array}[t]{@{}l} (i+1,j,u + v \times \mathit{fromIntegral}\;x, v \times i' / (2 \times i' + 1)) \\ \mathbf{where}\;i' = \mathit{fromIntegral}\;i \end{array}

(The {\%}s make rational numbers in Haskell, and force the ambiguous fractional type to be {\mathit{Rational}} rather than {\mathit{Double}}.)

In fact, this program can be considerably simplified, by inlining the definitions. In particular, the input digits are all 2, so we need not supply them. Moreover, the {j} component of the state is never used, because we treat each output digit in the same way (in contrast to the input digits); so that may be eliminated. Finally, we can eliminate some of the numeric coercions if we represent the {i} component as a rational in the first place:

\displaystyle  \mathit{piDigits} = \begin{array}[t]{@{}l} \mathit{go}\;((1,0,1) :: (\mathit{Rational},\mathit{Rational},\mathit{Rational}))\;\mathbf{where} \\ \qquad \mathit{go}\;(i,u,v) = \begin{array}[t]{@{}ll} \mathbf{if} & y == \mathit{floor}\;(u+v \times 4) \\ \mathbf{then} & y : \mathit{go}\;(i,10 \times (u-\mathit{fromIntegral}\;y),10 \times v) \\ \mathbf{else} & \mathit{go}\;(i+1,u+2 \times v, (v \times i) / (2 \times i+1)) \\ \multicolumn{2}{@{}l}{\qquad \mathbf{where}\; y = \mathit{floor}\;(u+v \times 3)} \end{array} \end{array}

Then we have

\displaystyle  \mathit{piDigits} = [3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3...

by jeremygibbons at November 15, 2017 05:22 PM