I planned to publish this last week sometime but then I wrote a line of code with three errors and that took over the blog.
A few years ago I mentioned in passing that in the 1990s I had constructed a listing of all the anagrams in Webster's Second International dictionary. (The Webster's headword list was available online.)
This was easy to do, even at the time, when the word list itself, at 2.5 megabytes, was a file of significant size. Perl and its cousins were not yet common; in those days I used Awk. But the task is not very different in any reasonable language:
# Process word list
while (my $word = <>) {
chomp $word;
my $sorted = join "", sort split //, $word; # normal form
push @{$anagrams{$sorted}}, $word;
}
for my $words (values %anagrams) {
print "@$words\n" if @$words > 1;
}
The key technique is to reduce each word to a normal form so that
two words have the same normal form if and only if they are anagrams
of one another. In this case we do this by sorting the letters into
alphabetical order, so that both megalodon and moonglade become
adeglmnoo
.
Then we insert the words into a (hash | associative array | dictionary), keyed by their normal forms, and two or more words are anagrams if they fall into the same hash bucket. (There is some discussion of this technique in Higher-Order Perl pages 218–219 and elsewhere.)
(The thing you do not want to do is to compute every permutation of the letters of each word, looking for permutations that appear in the word list. That is akin to sorting a list by computing every permutation of the list and looking for the one that is sorted. I wouldn't have mentioned this, but someone on StackExchange actually asked this question.)
Anyway, I digress. This article is about how I was unhappy with the results of the simple procedure above. From the Webster's Second list, which contains about 234,000 words, it finds about 14,000 anagram sets (some with more than two words), consisting of 46,351 pairs of anagrams. The list starts with
aal ala
and ends with
zolotink zolotnik
which exemplify the problems with this simple approach: many of the 46,351 anagrams are obvious, uninteresting or even trivial. There must be good ones in the list, but how to find them?
I looked in the list to find the longest anagrams, but they were also disappointing:
cholecystoduodenostomy duodenocholecystostomy
(Webster's Second contains a large amount of scientific and medical jargon. A cholecystoduodenostomy is a surgical operation to create a channel between the gall bladder (cholecysto-) and the duodenum (duodeno-). A duodenocholecystostomy is the same thing.)
This example made clear at least one of the problems with boring anagrams: it's not that they are too short, it's that they are too simple. Cholecystoduodenostomy and duodenocholecystostomy are 22 letters long, but the anagrammatic relation between them is obvious: chop cholecystoduodenostomy into three parts:
cholecysto duodeno stomy
and rearrange the first two:
duodeno cholecysto stomy
and there you have it.
This gave me the idea to score a pair of anagrams according to how many chunks one had to be cut into in order to rearrange it to make the other one. On this plan, the “cholecystoduodenostomy / duodenocholecystostomy” pair would score 3, just barely above the minimum possible score of 2. Something even a tiny bit more interesting, say “abler / blare” would score higher, in this case 4. Even if this strategy didn't lead me directly to the most interesting anagrams, it would be a big step in the right direction, allowing me to eliminate the least interesting.
This rule would judge both “aal / ala” and “zolotink / zolotnik” as being uninteresting (scores 3 and 4 respectively), which is a good outcome. Note that some other boring-anagram problems can be seen as special cases of this one. For example, short anagrams never need to be cut into many parts: no four-letter anagrams can score higher than 4. The trivial anagramming of a word to itself always scores 1, and nontrivial anagrams always score more than this.
So what we need to do is: for each anagram pair, say
acrididae
(grasshoppers)
and cidaridae
(sea
urchins), find the smallest number of chunks into which we can chop
acrididae
so that the chunks can be rearranged into cidaridae
.
One could do this with a clever algorithm, if one were available. There is a clever algorithm, based on finding maximal independent sets in a certain graph. (More about this tomorrow.) I did not find this algorithm at the time; nor did I try. Instead, I used a brute-force search. Or rather, I used a very small amount of cleverness to reduce the search space, and then used brute-force search to search the reduced space.
Let's consider a example, scoring the anagram “abscise / scabies”.
You do not have to consider every possible permutation of
abscise
. Rather, there are only two possible mappings from the
letters of abscise
to the letters of scabies
. You know that the
C
must map to the C
, the A
must map to the A
, and so
forth. The only question is whether the first S
of abscise
maps to
the first or to the second S
of scabies
. The first mapping gives
us:
and the second gives us
because the S
and the C
no longer go to adjoining positions. So
the minimum number of chunks is 5, and this anagram pair gets a score
of 5.
To fully analyze cholecystoduodenostomy
by this method required considering 7680
mappings. (120 ways to map the five O
's × 2 ways to map the two
C
's × 2 ways to map the two D
's, etc.) In the 1990s this took a
while, but not prohibitively long, and it worked well enough that I
did not bother to try to find a better algorithm. In 2016 it would
probably still run quicker than implementing the maximal independent
set algorithm. Unfortunately I have lost the code that I wrote then
so I can't compare.
Assigning scores in this way produced a scored anagram list which began
2 aal ala
and ended
4 zolotink zolotnik
and somewhere in the middle was
3 cholecystoduodenostomy duodenocholecystostomy
all poor scores. But sorted by score, there were treasures at the end, and the clear winner was
I declare this the single best anagram in English. It is 15 letters
long, and the only letters that stay together are the E
and the R
.
“Cinematographer” is as familiar as a 15-letter word can be, and
“megachiropteran” means a giant bat. GIANT BAT! DEATH FROM
ABOVE!!!
And there is no serious competition. There was another 14-pointer, but both its words are Webster's Second jargon that nobody knows:
14 rotundifoliate titanofluoride
There are no score 13 pairs, and the score 12 pairs are all obscure. So this is the winner, and a deserving winner it is.
I think there is something in the list to make everyone happy. If you are the type of person who enjoys anagrams, the list rewards casual browsing. A few examples:
7 admirer married
7 admires sidearm8 negativism timesaving
8 peripatetic precipitate
8 scepters respects
8 shortened threnodes
8 soapstone teaspoons9 earringed grenadier
9 excitation intoxicate
9 integrals triangles
9 ivoriness revisions
9 masculine calumnies10 coprophagist topographics
10 chuprassie haruspices
10 citronella interlocal11 clitoridean directional
11 dispensable piebaldness
11 endometritria intermediator
“Clitoridean / directional” has been one of my favorites for years. But my favorite of all, although it scores only 6, is
6 yttrious touristy
I think I might love it just because the word yttrious is so delightful. (What a debt we owe to Ytterby, Sweden!)
I also rather like
5 notaries senorita
which shows that even some of the low-scorers can be worth looking at. Clearly my chunk score is not the end of the story, because “notaries / senorita” should score better than “abets / baste” (which is boring) or “Acephali / Phacelia” (whatever those are), also 5-pointers. The length of the words should be worth something, and the familiarity of the words should be worth even more.
Here are the results:
In former times there was a restaurant in Philadelphia named “Soupmaster”. My best unassisted anagram discovery was noticing that this is an anagram of “mousetraps”.
[ Addendum: There is a followup article, which will become available on 22 February 2017. ]
The other day, Anders Claesson wrote a very nice blog post explaining a more combinatorial way to understand multiplicative inverses of virtual species (as opposed to the rather algebraic way I explained it in my previous post). In the middle of his post he makes an offhanded assumption which I stubbornly refused to take at face value; after thinking about it for a while and discussing it with Anders, I’m very glad I did, because there’s definitely more going on here than meets the eye and it’s given me a lot of interesting things to think about.
Recall that denotes the species of sets, defined by , that is, the only -structure on a given label set is the set of labels itself. Recall also that the exponential generating function of a species is given by
where counts the number of labelled -structures of size . In the case of , we have for all , so
(This is why is such a good name for the species of sets—though in a fantastic coincidence, it seems to originally come from the French word for set, ensemble, rather than from the fact that (though on the other hand calling it a “coincidence” is probably too strong, since Joyal must surely have picked the notation with the generating function already in mind!).)
Now, from my previous post we know that
Let’s first consider (without the ). This means that we have, for some , a -ary product of structures—in other words, a list of nonempty sets. This is the species of ballots, also known as ordered set partitions, and can also be written . As an example, here is a ballot on the set of labels :
The order of the parts matters, so this is a different ballot:
But the order of labels within each part doesn’t matter (since each part is a set). As another example, here is the complete collection of ballot structures on the labels :
We can see that there are 13 in total: six where the labels are each in their own separate part (corresponding to the six possible permutations of the labels); six where two labels share a part and the other label is a singleton part (corresponding to the three ways to choose the solitary label, times the two ways to order the parts); and one final ballot where all three labels are grouped in the same part. (As an exercise, can you verify that there are 75 different ballot structures on a set of four labels?)
Returning to , we can see that it consists of signed ballots, where the sign of a ballot is the parity of its number of parts, that is, a ballot with parts has sign . The second half of Anders’ post gives a nice combinatorial proof that , via a sign-reversing involution: if we consider -structures, i.e. pairs of sets and signed ballots, there is a natural^{1} way to pair them up, matching positive and negative structures so everything cancels (except in the case of the empty label set, which is why we get instead of ).
However, Anders is trying to do more than that. Note first that since multiplication of EGFs corresponds to multiplication of species, the EGF for is of course . But this ought to also be the EGF for the virtual species , and the rest of his post hinges on identifying and . As Anders and I discovered, however, this is precisely the point where it is worth being more careful.
First of all, what is ? Intuitively, an structure consists of a set of negative atoms; since each set can be thought of as an (unordered) product of atoms, the whole set acquires a sign given by the parity of the number of atoms. In other words, intuitively it seems that should be the species of signed sets, where an even-sized set is considered positive and an odd-sized set negative. That is,
where denotes the species of sets of size exactly . As a sanity check, this makes sense as an EGF equation too, since the EGF of is just and indeed
But hold on a minute, what does really mean, formally? It is the composition of the species with the virtual species , and it turns out that it is not at all a priori obvious how to define composition for virtual species! We can find the definition on p. 127 of Bergeron et al. A special case (which is enough for our present purposes) is
where and are two sorts of atoms, and denotes Cartesian product of species. In our case,
since is the identity for Cartesian product (overlaying an additional structure on a set of labels does not add any structure, since there is only one possible -structure).
All of this is to say, is actually defined as ! So at first glance it may seem we actually have nothing to prove: and are the same by definition, end of story. …but in fact, all we have done is shift the burden of proof elsewhere: now it is our intuitive idea of representing signed sets that requires proof!
To sum up, we know that is the species of signed ballots, with sign given by parity of the number of parts; and intuitively, we also believe that should correspond to parity-signed sets, . So, is there a nice combinatorial proof showing the correspondence between signed sets and signed ballots?
One can use the law of excluded middle to show that the answer must be “yes”: suppose the answer were “no”; but then I would not be writing this blog post, which is a contradiction since I am writing this blog post. But is there a constructive proof? Fear not! This blog post has gotten long enough, so I will stop here for now and let interested readers puzzle over it; in my next post I will explain what I came up with, along with some musings on linear orders and naturality.
I am indeed using the word natural in a technical, categorical sense here! This will play an important role in my second post…↩
Ideas for a desirable feature of a Haskell IDE:
Good: IDE pops up the type signature of the library function or symbol under the point. Emacs haskell-mode can do this.
Better: IDE is aware of static scoping, let binding, and imports to really know what function you are referring to. However, if you forgot to import, it still tries to be helpful, guessing at a library function and offering its signature as well as a reminder that you need to import it.
Better: If the function does not have an explicit type signature, the IDE does type inference to figure it out.
Better: if the type is polymorphic, the IDE also provides the type of the function as instantiated where it is used, instead of just the polymorphic type where it was declared.
In the Theory Lunch of the last week, James Chapman talked about the MU puzzle from Douglas Hofstadter’s book Gödel, Escher, Bach. This puzzle is about a string rewriting system. James presented a Haskell program that computes derivations of strings. Inspired by this, I wrote my own implementation, with the goal of improving efficiency. This blog post presents this implementation. As usual, it is available as a literate Haskell file, which you can load into GHCi.
Let me first describe the MU puzzle shortly. The puzzle deals with strings that may contain the characters , , and . We can derive new strings from old ones using the following rewriting system:
The question is whether it is possible to turn the string into the string using these rules.
You may want to try to solve this puzzle yourself, or you may want to look up the solution on the Wikipedia page.
The code is not only concerned with deriving from , but with derivations as such.
We import Data.List
:
import Data.List
We define the type Sym
of symbols and the type Str
of symbol strings:
data Sym = M | I | U deriving Eq
type Str = [Sym]
instance Show Sym where
show M = "M"
show I = "I"
show U = "U"
showList str = (concatMap show str ++)
Next, we define the type Rule
of rules as well as the list rules
that contains all rules:
data Rule = R1 | R2 | R3 | R4 deriving Show
rules :: [Rule]
rules = [R1,R2,R3,R4]
We first introduce a helper function that takes a string and returns the list of all splits of this string. Thereby, a split of a string str
is a pair of strings str1
and str2
such that str1 ++ str2 == str
. A straightforward implementation of splitting is as follows:
splits' :: Str -> [(Str,Str)]
splits' str = zip (inits str) (tails str)
The problem with this implementation is that walking through the result list takes quadratic time, even if the elements of the list are left unevaluated. The following implementation solves this problem:
splits :: Str -> [(Str,Str)]
splits str = zip (map (flip take str) [0 ..]) (tails str)
Next, we define a helper function replace
. An expression replace old new str
yields the list of all strings that can be constructed by replacing the string old
inside str
by new
.
replace :: Str -> Str -> Str -> [Str]
replace old new str = [front ++ new ++ rear |
(front,rest) <- splits str,
old `isPrefixOf` rest,
let rear = drop (length old) rest]
We are now ready to implement the function apply
, which performs rule application. This function takes a rule and a string and produces all strings that can be derived from the given string using the given rule exactly once.
apply :: Rule -> Str -> [Str]
apply R1 str | last str == I = [str ++ [U]]
apply R2 (M : tail) = [M : tail ++ tail]
apply R3 str = replace [I,I,I] [U] str
apply R4 str = replace [U,U] [] str
apply _ _ = []
Now we want to build derivation trees. A derivation tree for a string str
has the following properties:
str
.str
by a single rule application.We first define types for representing derivation trees:
data DTree = DTree Str [DSub]
data DSub = DSub Rule DTree
Now we define the function dTree
that turns a string into its derivation tree:
dTree :: Str -> DTree
dTree str = DTree str [DSub rule subtree |
rule <- rules,
subStr <- apply rule str,
let subtree = dTree subStr]
A derivation is a sequence of strings with rules between them such that each rule takes the string before it to the string after it. We define types for representing derivations:
data Deriv = Deriv [DStep] Str
data DStep = DStep Str Rule
instance Show Deriv where
show (Deriv steps goal) = " " ++
concatMap show steps ++
show goal ++
"\n"
showList derivs
= (concatMap ((++ "\n") . show) derivs ++)
instance Show DStep where
show (DStep origin rule) = show origin ++
"\n-> (" ++
show rule ++
") "
Now we implement a function derivs
that converts a derivation tree into the list of all derivations that start with the tree’s root label. The function derivs
traverses the tree in breadth-first order.
derivs :: DTree -> [Deriv]
derivs tree = worker [([],tree)] where
worker :: [([DStep],DTree)] -> [Deriv]
worker tasks = rootDerivs tasks ++
worker (subtasks tasks)
rootDerivs :: [([DStep],DTree)] -> [Deriv]
rootDerivs tasks = [Deriv (reverse revSteps) root |
(revSteps,DTree root _) <- tasks]
subtasks :: [([DStep],DTree)] -> [([DStep],DTree)]
subtasks tasks = [(DStep root rule : revSteps,subtree) |
(revSteps,DTree root subs) <- tasks,
DSub rule subtree <- subs]
Finally, we implement the function derivations
which takes two strings and returns the list of those derivations that turn the first string into the second:
derivations :: Str -> Str -> [Deriv]
derivations start end
= [deriv | deriv@(Deriv _ goal) <- derivs (dTree start),
goal == end]
You may want to enter
derivations [M,I] [M,U,I]
at the GHCi prompt to see the derivations
function in action. You can also enter
derivations [M,I] [M,U]
to get an idea about the solution to the MU puzzle.
Generic programming is a powerful way to define a function that works in an analogous way for a class of types. In this article, I describe the latest approach to generic programming that is implemented in GHC. This approach goes back to the paper A Generic Deriving Mechanism for Haskell by José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh.
This article is a writeup of a Theory Lunch talk I gave on 4 February 2016. As usual, the source of this article is a literate Haskell file, which you can download, load into GHCi, and play with.
Parametric polymorphism allows you to write functions that deal with values of any type. An example of such a function is the reverse
function, whose type is [a] -> [a]
. You can apply reverse
to any list, no matter what types the elements have.
However, parametric polymorphism does not allow your functions to depend on the structure of the concrete types that are used in place of type variables. So values of these types are always treated as black boxes. For example, the reverse
function only reorders the elements of the given list. A function of type [a] -> [a]
could also drop elements (like the tail
function does) or duplicate elements (like the cycle
function does), but it could never invent new elements (except for ⊥) or analyze elements.
Now there are situation where a function is suitable for a class of types that share certain properties. For example, the sum
function works for all types that have a notion of binary addition. Haskell uses type classes to support such functions. For example, the Num
class provides the method (+)
, which is used in the definition of sum
, whose type Num a => [a] -> a
contains a respective class constraint.
The methods of a class have to be implemented separately for every type that is an instance of the class. This is reasonable for methods like (+)
, where the implementations for the different instances differ fundamentally. However, it is unfortunate for methods that are implemented in an analogous way for most of the class instances. An example of such a method is (==)
, since there is a canonical way of checking values of algebraic data types for equality. It works by first comparing the outermost data constructors of the two given values and if they match, the individual fields. Only when the data constructors and all the fields match, the two values are considered equal.
For several standard classes, including Eq
, Haskell provides the deriving mechanism to generate instances with default method implementations whose precise functionality depends on the structure of the type. Unfortunately, there is no possibility in standard Haskell to extend this deriving mechanism to user-defined classes. Generic programming is a way out of this problem.
For generic programming, we need several language extensions. The good thing is that only one of them, DeriveGeneric
, is specific to generic programming. The other ones have uses in other areas as well. Furthermore, DeriveGeneric
is a very small extension. So the generic programming approach we describe here can be considered very lightweight.
We state the full set of necessary extensions with the following pragma:
{-# LANGUAGE DefaultSignatures,
DeriveGeneric,
FlexibleContexts,
TypeFamilies,
TypeOperators #-}
Apart from these language extensions, we need the module GHC.Generics
:
import GHC.Generics
As our running example, we pick serialization and deserialization of values. Serialization means converting a value into a bit string, and deserialization means parsing a bit string in order to get back a value.
We introduce a type Bit
for representing bits:
data Bit = O | I deriving (Eq, Show)
Furthermore, we define the class of all types that support serialization and deserialization as follows:
class Serializable a where
put :: a -> [Bit]
get :: [Bit] -> (a, [Bit])
There is a canonical way of serializing values of algebraic data types. It works by first encoding the data constructor of the given value as a sequence of bits and then serializing the individual fields. To show this approach in action, we define an algebraic data type Tree
, which is a type of labeled binary trees:
data Tree a = Leaf | Branch (Tree a) a (Tree a) deriving Show
An instantiation of Serializable
for Tree
that follows the canonical serialization approach can be carried out as follows:
instance Serializable a => Serializable (Tree a) where
put Leaf = [O]
put (Branch left root right) = [I] ++
put left ++
put root ++
put right
get (O : bits) = (Leaf, bits)
get (I : bits) = (Branch left root right, bits''') where
(left, bits') = get bits
(root, bits'') = get bits'
(right, bits''') = get bits''
Of course, it quickly becomes cumbersome to provide such an instance declaration for every algebraic data type that should use the canonical serialization approach. So we want to implement the canonical approach once and for all and make it easily usable for arbitrary types that are amenable to it. Generic programming makes this possible.
An algebraic data type is essentially a sum of products where the terms “sum” and “product” are understood as follows:
A sum is a variant type. In Haskell, Either
is the canonical type constructor for binary sums, and the empty type Void
from the void
package is the nullary sum.
A product is a tuple type. In Haskell, (,)
is the canonical type constructor for binary products, and ()
is the nullary product.
The key idea of generic programming is to map types to representations that make the sum-of-products structure explicit and to implement canonical behavior based on these representations instead of the actual types.
The GHC.Generics
module defines a number of type constructors for constructing representations:
data V1 p
infixr 5 :+:
data (:+:) f g p = L1 (f p) | R1 (g p)
data U1 p = U1
infixr 6 :*:
data (:*:) f g p = f p :*: g p
newtype K1 i a p = K1 { unK1 :: a }
newtype M1 i a f p = M1 { unM1 :: f p }
All of these type constructors take a final parameter p
. This parameter is relevant only when dealing with higher-order classes. In this article, however, we only discuss generic programming with first-order classes. In this case, the parameter p
is ignored. The different type constructors play the following roles:
V1
is for the nullary sum.
(:+:)
is for binary sums.
U1
is for the nullary product.
(:*:)
is for binary products.
K1
is a wrapper for fields of algebraic data types. Its parameter i
used to provide some information about the field at the type level, but is now obsolete.
M1
is a wrapper for attaching meta information at the type level. Its parameter i
denotes the kind of the language construct the meta information refers to, and its parameter c
provides access to the meta information.
The GHC.Generics
module furthermore introduces a class Generic
, whose instances are the types for which a representation exists. Its definition is as follows:
class Generic a where
type Rep a :: * -> *
from :: a -> (Rep a) p
to :: (Rep a) p -> a
A type Rep a
is the representation of the type a
. The methods from
and to
convert from values of the actual type to values of the representation type and vice versa.
To see all this in action, we make Tree a
an instance of Generic
:
instance Generic (Tree a) where
type Rep (Tree a) =
M1 D D1_Tree (
M1 C C1_Tree_Leaf U1
:+:
M1 C C1_Tree_Branch (
M1 S NoSelector (K1 R (Tree a))
:*:
M1 S NoSelector (K1 R a)
:*:
M1 S NoSelector (K1 R (Tree a))
)
)
from Leaf = M1 (L1 (M1 U1))
from (Branch left root right) = M1 (
R1 (
M1 (
M1 (K1 left)
:*:
M1 (K1 root)
:*:
M1 (K1 right)
))
)
to (M1 (L1 (M1 U1))) = Leaf
to (M1 (
R1 (
M1 (
M1 (K1 left)
:*:
M1 (K1 root)
:*:
M1 (K1 right)
))
)) = Branch left root right
The types D1_Tree
, C1_Tree_Leaf
, and C1_Tree_Branch
are type-level representations of the type constructor Tree
, the data constructor Leaf
, and the data constructor Branch
, respectively. We declare them as empty types:
data D1_Tree
data C1_Tree_Leaf
data C1_Tree_Branch
We need to make these types instances of the classes Datatype
and Constructor
, which are part of GHC.Generics
as well. These classes provide a link between the type-level representations of type and data constructors and the meta information related to them. This meta information particularly covers the identifiers of the type and data constructors, which are needed when implementing canonical implementations for methods like show
and read
. The instance declarations for the Tree
-related types are as follows:
instance Datatype D1_Tree where
datatypeName _ = "Tree"
moduleName _ = "Main"
instance Constructor C1_Tree_Leaf where
conName _ = "Leaf"
instance Constructor C1_Tree_Branch where
conName _ = "Branch"
Instantiating the Generic
class as shown above is obviously an extremely tedious task. However, it is possible to instantiate Generic
completely automatically for any given algebraic data type, using the deriving
syntax. This is what the DeriveGeneric
language extension makes possible.
So instead of making Tree a
an instance of Generic
by hand, as we have done above, we could have declared the Tree
type as follows in the first place:
data Tree a = Leaf | Branch (Tree a) a (Tree a)
deriving (Show, Generic)
As mentioned above, we implement canonical behavior based on representations. Let us see how this works in the case of the Serializable
class.
We introduce a new class Serializable'
whose methods provide serialization and deserialization for representation types:
class Serializable' f where
put' :: f p -> [Bit]
get' :: [Bit] -> (f p, [Bit])
We instantiate this class for all the representation types:
instance Serializable' U1 where
put' U1 = []
get' bits = (U1, bits)
instance (Serializable' r, Serializable' s) =>
Serializable' (r :*: s) where
put' (rep1 :*: rep2) = put' rep1 ++ put' rep2
get' bits = (rep1 :*: rep2, bits'') where
(rep1, bits') = get' bits
(rep2, bits'') = get' bits'
instance Serializable' V1 where
put' _ = error "attempt to put a void value"
get' _ = error "attempt to get a void value"
instance (Serializable' r, Serializable' s) =>
Serializable' (r :+: s) where
put' (L1 rep) = O : put' rep
put' (R1 rep) = I : put' rep
get' (O : bits) = let (rep, bits') = get' bits in
(L1 rep, bits')
get' (I : bits) = let (rep, bits') = get' bits in
(R1 rep, bits')
instance Serializable' r => Serializable' (M1 i a r) where
put' (M1 rep) = put' rep
get' bits = (M1 rep, bits') where
(rep, bits') = get' bits
instance Serializable a => Serializable' (K1 i a) where
put' (K1 val) = put val
get' bits = (K1 val, bits') where
(val, bits') = get bits
Note that in the case of K1
, the context mentions Serializable
, not Serializable'
, and the methods put'
and get
call put
and get
, not put'
and get'
. The reason is that the value wrapped in K1
has an ordinary type, not a representation type.
We can now apply canonical behavior to ordinary types using the methods from
and to
from the Generic
class. For example, we can implement functions defaultPut
and defaultGet
that provide canonical serialization and deserialization for all instances of Generic
:
defaultPut :: (Generic a, Serializable' (Rep a)) =>
a -> [Bit]
defaultPut = put' . from
defaultGet :: (Generic a, Serializable' (Rep a)) =>
[Bit] -> (a, [Bit])
defaultGet bits = (to rep, bits') where
(rep, bits') = get' bits
We can use these functions in instance declarations for Serializable
. For example, we can make Tree a
an instance of Serializable
in the following way:
instance Serializable a => Serializable (Tree a) where
put = defaultPut
get = defaultGet
Compared to the instance declaration we had initially, this one is a real improvement, since we do not have to implement the desired behavior of put
and get
by hand anymore. However, it still contains boilerplate code in the form of the trivial method declarations. It would be better to establish defaultPut
and defaultGet
as defaults in the class declaration:
class Serializable a where
put :: a -> [Bit]
put = defaultPut
get :: [Bit] -> (a, [Bit])
get = defaultGet
However, this is not possible, since the types of defaultPut
and defaultGet
are less general than the types of put
and get
, as they put additional constraints on the type a
. Luckily, GHC supports the language extension DefaultSignatures
, which allows us to give default implementations that have less general types than the actual methods (and consequently work only for those instances that are compatible with these less general types). Using DefaultSignatures
, we can declare the Serializable
class as follows:
class Serializable a where
put :: a -> [Bit]
default put :: (Generic a, Serializable' (Rep a)) =>
a -> [Bit]
put = defaultPut
get :: [Bit] -> (a, [Bit])
default get :: (Generic a, Serializable' (Rep a)) =>
[Bit] -> (a, [Bit])
get = defaultGet
With this class declaration in place, we can make Tree a
an instance of Serializable
as follows:
instance Serializable a => Serializable (Tree a)
With the minor extension DeriveAnyClass
, which is provided by GHC starting from Version 7.10, we can even use the deriving
keyword to instantiate Serializable
for Tree a
. As a result, we only have to write the following in order to define the Tree
type and make it an instance of Serializable
:
data Tree a = Leaf | Branch (Tree a) a (Tree a)
deriving (Show, Generic, Serializable)
So finally, we can use our own classes like the Haskell standard classes regarding the use of deriving clauses, except that we have to additionally derive an instance declaration for Generic
.
Usually, not all instances of a class should or even can be generated by means of generic programming, but some instances have to be crafted by hand. For example, making Int
an instance of Serializable
requires manual work, since Int
is not an algebraic data type.
However, there is no problem with this, since we still have the opportunity to write explicit instance declarations, despite the presence of a generic solution. This is in line with the standard deriving mechanism: you can make use of it, but you are not forced to do so. So we can have the following instance declaration, for example:
instance Serializable Int where
put val = replicate val I ++ [O]
get bits = (length is, bits') where
(is, O : bits') = span (== I) bits
Of course, the serialization approach we use here is not very efficient, but the instance declaration illustrates the point we want to make.
There are Haskell types that have an associated monad structure, but cannot be made instances of the Monad
class. The reason is typically that the return or the bind operation of such a type m
has a constraint on the type parameter of m
. As a result, all the nice library support for monads is unusable for such types. This problem is called the constrained-monad problem.
In my article The Constraint
kind, I described a solution to this problem, which involved changing the Monad
class. In this article, I present a solution that works with the standard Monad
class. This solution has been developed by Neil Sculthorpe, Jan Bracker, George Giorgidze, and Andy Gill. It is described in their paper The Constrained-Monad Problem and implemented in the constrained-normal package.
This article is a write-up of a Theory Lunch talk I gave quite some time ago. As usual, the source of this article is a literate Haskell file, which you can download, load into GHCi, and play with.
We have to enable a couple of language extensions:
{-# LANGUAGE ConstraintKinds,
ExistentialQuantification,
FlexibleInstances,
GADTSyntax,
Rank2Types #-}
Furthermore, we need to import some modules:
import Data.Set hiding (fold, map)
import Data.Natural hiding (fold)
These imports require the packages containers and natural-numbers to be installed.
The Set
type has an associated monad structure, consisting of a return and a bind operation:
returnSet :: a -> Set a
returnSet = singleton
bindSet :: Ord b => Set a -> (a -> Set b) -> Set b
bindSet sa g = unions (map g (toList sa))
We cannot make Set
an instance of Monad
though, since bindSet
has an Ord
constraint on the element type of the result set, which is caused by the use of unions
.
For a solution, let us first look at how monadic computations on sets would be expressed if Set
was an instance of Monad
. A monadic expression would be built from non-monadic expressions and applications of return
and (>>=)
. For every such expression, there would be a normal form of the shape
s_{1} >>= \
x_{1} ->
s_{2} >>= \
x_{2} ->
… ->
s_{n} >>= \
x_{n} -> return
r
where the s_{i} would be non-monadic expressions of type Set
. The existence of a normal form would follow from the monad laws.
We define a type UniSet
of such normal forms:
data UniSet a where
ReturnSet :: a -> UniSet a
AtmBindSet :: Set a -> (a -> UniSet b) -> UniSet b
We can make UniSet
an instance of Monad
where the monad operations build expressions and normalize them on the fly:
instance Monad UniSet where
return a = ReturnSet a
ReturnSet a >>= f = f a
AtmBindSet sa h >>= f = AtmBindSet sa h' where
h' a = h a >>= f
Note that these monad operations are analogous to operations on lists, with return
corresponding to singleton construction and (>>=)
corresponding to concatenation. Normalization happens in (>>=)
by applying the left-identity and the associativity law for monads.
We can use UniSet
as an alternative set type, representing a set by a normal form that evaluates to this set. This way, we get a set type that is an instance of Monad
. For this to be sane, we have to hide the data constructors of UniSet
, so that different normal forms that evaluate to the same set cannot be distinguished.
Now we need functions that convert between Set
and UniSet
. Conversion from Set
to UniSet
is simple:
toUniSet :: Set a -> UniSet a
toUniSet sa = AtmBindSet sa ReturnSet
Conversion from UniSet
to Set
is expression evaluation:
fromUniSet :: Ord a => UniSet a -> Set a
fromUniSet (ReturnSet a) = returnSet a
fromUniSet (AtmBindSet sa h) = bindSet sa g where
g a = fromUniSet (h a)
The type of fromUniSet
constrains the element type to be an instance of Ord
. This single constraint is enough to make all invocations of bindSet
throughout the conversion legal. The reason is our use of normal forms. Since normal forms are “right-leaning”, all applications of (>>=)
in them have the same result type as the whole expression.
Let us now look at a different monad, the multiset monad.
We represent a multiset as a function that maps each value of the element type to its multiplicity in the multiset, with a multiplicity of zero denoting absence of this value:
newtype MSet a = MSet { mult :: a -> Natural }
Now we define the return operation:
returnMSet :: Eq a => a -> MSet a
returnMSet a = MSet ma where
ma b | a == b = 1
| otherwise = 0
For defining the bind operation, we need to define a class Finite
of finite types whose sole method enumerates all the values of the respective type:
class Finite a where
values :: [a]
The implementation of the bind operation is as follows:
bindMSet :: Finite a => MSet a -> (a -> MSet b) -> MSet b
bindMSet msa g = MSet mb where
mb b = sum [mult msa a * mult (g a) b | a <- values]
Note that the multiset monad differs from the set monad in its use of constraints. The set monad imposes a constraint on the result element type of bind, while the multiset monad imposes a constraint on the first argument element type of bind and another constraint on the result element type of return.
Like in the case of sets, we define a type of monadic normal forms:
data UniMSet a where
ReturnMSet :: a -> UniMSet a
AtmBindMSet :: Finite a =>
MSet a -> (a -> UniMSet b) -> UniMSet b
The key difference to UniSet
is that UniMSet
involves the constraint of the bind operation, so that normal forms must respect this constraint. Without this restriction, it would not be possible to evaluate normal forms later.
The Monad
–UniMSet
instance declaration is analogous to the Monad
–UniSet
instance declaration:
instance Monad UniMSet where
return a = ReturnMSet a
ReturnMSet a >>= f = f a
AtmBindMSet msa h >>= f = AtmBindMSet msa h' where
h' a = h a >>= f
Now we define conversion from MSet
to UniMSet
:
toUniMSet :: Finite a => MSet a -> UniMSet a
toUniMSet msa = AtmBindMSet msa ReturnMSet
Note that we need to constrain the element type in order to fulfill the constraint incorporated into the UniMSet
type.
Finally, we define conversion from UniMSet
to MSet
:
fromUniMSet :: Eq a => UniMSet a -> MSet a
fromUniMSet (ReturnMSet a) = returnMSet a
fromUniMSet (AtmBindMSet msa h) = bindMSet msa g where
g a = fromUniMSet (h a)
Here we need to impose an Eq
constraint on the element type. Note that this single constraint is enough to make all invocations of returnMSet
throughout the conversion legal. The reason is again our use of normal forms.
The solutions to the constrained-monad problem for sets and multisets are very similar. It is certainly not good if we have to write almost the same code for every new constrained monad that we want to make accessible via the Monad
class. Therefore, we define a generic type that covers all such monads:
data UniMonad c t a where
Return :: a -> UniMonad c t a
AtmBind :: c a =>
t a -> (a -> UniMonad c t b) -> UniMonad c t b
The parameter t
of UniMonad
is the underlying data type, like Set
or MSet
, and the parameter c
is the constraint that has to be imposed on the type parameter of the first argument of the bind operation.
For every c
and t
, we make UniMonad c t
an instance of Monad
:
instance Monad (UniMonad c t) where
return a = Return a
Return a >>= f = f a
AtmBind ta h >>= f = AtmBind ta h' where
h' a = h a >>= f
We define a function lift
that converts from the underlying data type to UniMonad
and thus generalizes toUniSet
and toUniMSet
:
lift :: c a => t a -> UniMonad c t a
lift ta = AtmBind ta Return
Evaluation of normal forms is just folding with the return and bind operations of the underlying data type. Therefore, we implement a fold operator for UniMonad
:
fold :: (a -> r)
-> (forall a . c a => t a -> (a -> r) -> r)
-> UniMonad c t a
-> r
fold return _ (Return a) = return a
fold return atmBind (AtmBind ta h) = atmBind ta g where
g a = fold return atmBind (h a)
Note that fold
does not need to deal with constraints, neither with constraints on the result type parameter of return (like Eq
in the case of MSet
), nor with constraints on the result type parameter of bind (like Ord
in the case of Set
). This is because fold
works with any result type r
.
Now let us implement Monad
-compatible sets and multisets based on UniMonad
.
In the case of sets, we face the problem that UniMonad
takes a constraint for the type parameter of the first bind argument, but bindSet
does not have such a constraint. To solve this issue, we introduce a type class Unconstrained
of which every type is an instance:
class Unconstrained a
instance Unconstrained a
The implementation of Monad
-compatible sets is now straightforward:
type UniMonadSet = UniMonad Unconstrained Set
toUniMonadSet :: Set a -> UniMonadSet a
toUniMonadSet = lift
fromUniMonadSet :: Ord a => UniMonadSet a -> Set a
fromUniMonadSet = fold returnSet bindSet
The implementation of Monad
-compatible multisets does not need any utility definitions, but can be given right away:
type UniMonadMSet = UniMonad Finite MSet
toUniMonadMSet :: Finite a => MSet a -> UniMonadMSet a
toUniMonadMSet = lift
fromUniMonadMSet :: Eq a => UniMonadMSet a -> MSet a
fromUniMonadMSet = fold returnMSet bindMSet
More than two years ago, my colleague Denis Firsov and I gave a series of three Theory Lunch talks about the MIU string rewriting system from Douglas Hofstadter’s MU puzzle. The first talk was about a Haskell implementation of MIU, the second talk was an introduction to the functional logic programming language Curry, and the third talk was about a Curry implementation of MIU. The blog articles MIU in Haskell and A taste of Curry are write-ups of the first two talks. However, a write-up of the third talk has never seen the light of day so far. This is changed with this article.
As usual, this article is written using literate programming. The article source is a literate Curry file, which you can load into KiCS2 to play with the code.
I want to thank all the people from the Curry mailing list who have helped me improving the code in this article.
We import the module SearchTree
:
import SearchTree
We define the type Sym
of symbols and the type Str
of symbol strings:
data Sym = M | I | U
showSym :: Sym -> String
showSym M = "M"
showSym I = "I"
showSym U = "U"
type Str = [Sym]
showStr :: Str -> String
showStr str = concatMap showSym str
Next, we define the type Rule
of rules:
data Rule = R1 | R2 | R3 | R4
showRule :: Rule -> String
showRule R1 = "R1"
showRule R2 = "R2"
showRule R3 = "R3"
showRule R4 = "R4"
So far, the Curry code is basically the same as the Haskell code. However, this is going to change below.
Rule application becomes a lot simpler in Curry. In fact, we can code the rewriting rules almost directly to get a rule application function:
applyRule :: Rule -> Str -> Str
applyRule R1 (init ++ [I]) = init ++ [I, U]
applyRule R2 ([M] ++ tail) = [M] ++ tail ++ tail
applyRule R3 (pre ++ [I, I, I] ++ post) = pre ++ [U] ++ post
applyRule R4 (pre ++ [U, U] ++ post) = pre ++ post
Note that we do not return a list of derivable strings, as we did in the Haskell solution. Instead, we use the fact that functions in Curry are nondeterministic.
Furthermore, we do not need the helper functions splits
and replace
that we used in the Haskell implementation. Instead, we use the ++
-operator in conjunction with functional patterns to achieve the same functionality.
Now we implement a utility function applyRules
for repeated rule application. Our implementation uses a similar trick as the famous Haskell implementation of the Fibonacci sequence:
applyRules :: [Rule] -> Str -> [Str]
applyRules rules str = tail strs where
strs = str : zipWith applyRule rules strs
The Haskell implementation does not need the applyRules
function, but it needs a lot of code about derivation trees instead. In the Curry solution, derivation trees are implicit, thanks to nondeterminism.
A derivation is a sequence of strings with rules between them such that each rule takes the string before it to the string after it. We define types for representing derivations:
data Deriv = Deriv [DStep] Str
data DStep = DStep Str Rule
showDeriv :: Deriv -> String
showDeriv (Deriv steps goal) = " " ++
concatMap showDStep steps ++
showStr goal ++
"\n"
showDerivs :: [Deriv] -> String
showDerivs derivs = concatMap ((++ "\n") . showDeriv) derivs
showDStep :: DStep -> String
showDStep (DStep origin rule) = showStr origin ++
"\n-> (" ++
showRule rule ++
") "
Now we implement a function derivation
that takes two strings and returns the derivations that turn the first string into the second:
derivation :: Str -> Str -> Deriv
derivation start end
| start : applyRules rules start =:= init ++ [end]
= Deriv (zipWith DStep init rules) end where
rules :: [Rule]
rules free
init :: [Str]
init free
Finally, we define a function printDerivations
that explicitly invokes a breadth-first search to compute and ultimately print derivations:
printDerivations :: Str -> Str -> IO ()
printDerivations start end = do
searchTree <- getSearchTree (derivation start end)
putStr $ showDerivs (allValuesBFS searchTree)
You may want to enter
printDerivations [M, I] [M, I, U]
at the KiCS2 prompt to see the derivations
function in action.
Curry is a programming language that integrates functional and logic programming. Last week, Denis Firsov and I had a look at Curry, and Thursday, I gave an introductory talk about Curry in the Theory Lunch. This blog post is mostly a write-up of my talk.
Like Haskell, Curry has support for literate programming. So I wrote this blog post as a literate Curry file, which is available for download. If you want to try out the code, you have to install the Curry system KiCS2. The code uses the functional patterns language extension, which is only supported by KiCS2, as far as I know.
The functional fragment of Curry is very similar to Haskell. The only fundamental difference is that Curry does not support type classes.
Let us do some functional programming in Curry. First, we define a type whose values denote me and some of my relatives.
data Person = Paul
| Joachim
| Rita
| Wolfgang
| Veronika
| Johanna
| Jonathan
| Jaromir
Now we define a function that yields the father of a given person if this father is covered by the Person
type.
father :: Person -> Person
father Joachim = Paul
father Rita = Joachim
father Wolfgang = Joachim
father Veronika = Joachim
father Johanna = Wolfgang
father Jonathan = Wolfgang
father Jaromir = Wolfgang
Based on father
, we define a function for computing grandfathers. To keep things simple, we only consider fathers of fathers to be grandfathers, not fathers of mothers.
grandfather :: Person -> Person
grandfather = father . father
Logic programming languages like Prolog are able to search for variable assignments that make a given proposition true. Curry, on the other hand, can search for variable assignments that make a certain expression defined.
For example, we can search for all persons that have a grandfather according to the above data. We just enter
grandfather person where person free
at the KiCS2 prompt. KiCS2 then outputs all assignments to the person
variable for which grandfather person
is defined. For each of these assignments, it additionally prints the result of the expression grandfather person
.
Functions in Curry can actually be non-deterministic, that is, they can return multiple results. For example, we can define a function element
that returns any element of a given list. To achieve this, we use overlapping patterns in our function definition. If several equations of a function definition match a particular function application, Curry takes all of them, not only the first one, as Haskell does.
element :: [el] -> el
element (el : _) = el
element (_ : els) = element els
Now we can enter
element "Hello!"
at the KiCS2 prompt, and the system outputs six different results.
We have already seen how to combine functional and logic programming with Curry. Now we want to do pure logic programming. This means that we only want to search for variable assignments, but are not interested in expression results. If you are not interested in results, you typically use a result type with only a single value. Curry provides the type Success
with the single value success
for doing logic programming.
Let us write some example code about routes between countries. We first introduce a type of some European and American countries.
data Country = Canada
| Estonia
| Germany
| Latvia
| Lithuania
| Mexico
| Poland
| Russia
| USA
Now we want to define a relation called borders
that tells us which country borders which other country. We implement this relation as a function of type
Country -> Country -> Success
that has the trivial result success
if the first country borders the second one, and has no result otherwise.
Note that this approach of implementing a relation is different from what we do in functional programming. In functional programming, we use Bool
as the result type and signal falsity by the result False
. In Curry, however, we signal falsity by the absence of a result.
Our borders
relation only relates countries with those neighbouring countries whose names come later in alphabetical order. We will soon compute the symmetric closure of borders
to also get the opposite relationships.
borders :: Country -> Country -> Success
Canada `borders` USA = success
Estonia `borders` Latvia = success
Estonia `borders` Russia = success
Germany `borders` Poland = success
Latvia `borders` Lithuania = success
Latvia `borders` Russia = success
Lithuania `borders` Poland = success
Mexico `borders` USA = success
Now we want to define a relation isConnected
that tells whether two countries can be reached from each other via a land route. Clearly, isConnected
is the equivalence relation that is generated by borders
. In Prolog, we would write clauses that directly express this relationship between borders
and isConnected
. In Curry, on the other hand, we can write a function that generates an equivalence relation from any given relation and therefore does not only work with borders
.
We first define a type alias Relation
for the sake of convenience.
type Relation val = val -> val -> Success
Now we define what reflexive, symmetric, and transitive closures are.
reflClosure :: Relation val -> Relation val
reflClosure rel val1 val2 = rel val1 val2
reflClosure rel val val = success
symClosure :: Relation val -> Relation val
symClosure rel val1 val2 = rel val1 val2
symClosure rel val2 val1 = rel val1 val2
transClosure :: Relation val -> Relation val
transClosure rel val1 val2 = rel val1 val2
transClosure rel val1 val3 = rel val1 val2 &
transClosure rel val2 val3
where val2 free
The operator &
used in the definition of transClosure
has type
Success -> Success -> Success
and denotes conjunction.
We define the function for generating equivalence relations as a composition of the above closure operators. Note that it is crucial that the transitive closure operator is applied after the symmetric closure operator, since the symmetric closure of a transitive relation is not necessarily transitive.
equivalence :: Relation val -> Relation val
equivalence = reflClosure . transClosure . symClosure
The implementation of isConnected
is now trivial.
isConnected :: Country -> Country -> Success
isConnected = equivalence borders
Now we let KiCS2 compute which countries I can reach from Estonia without a ship or plane. We do so by entering
Estonia `isConnected` country where country free
at the prompt.
We can also implement a nondeterministic function that turns a country into the countries connected to it. For this, we use a guard that is of type Success
. Such a guard succeeds if it has a result at all, which can only be success
, of course.
connected :: Country -> Country
connected country1
| country1 `isConnected` country2 = country2
where country2 free
Curry has a predefined operator
=:= :: val -> val -> Success
that stands for equality.
We can use this operator, for example, to define a nondeterministic function that yields the grandchildren of a given person. Again, we keep things simple by only considering relationships that solely go via fathers.
grandchild :: Person -> Person
grandchild person
| grandfather grandkid =:= person = grandkid
where grandkid free
Note that grandchild
is the inverse of grandfather
.
Functional patterns are a language extension that allows us to use ordinary functions in patterns, not just data constructors. Functional patterns are implemented by KiCS2.
Let us look at an example again. We want to define a function split
that nondeterministically splits a list into two parts.^{1} Without functional patterns, we can implement splitting as follows.
split' :: [el] -> ([el],[el])
split' list | front ++ rear =:= list = (front,rear)
where front, rear free
With functional patterns, we can implement splitting in a much simpler way.
split :: [el] -> ([el],[el])
split (front ++ rear) = (front,rear)
As a second example, let us define a function sublist
that yields the sublists of a given list.
sublist :: [el] -> [el]
sublist (_ ++ sub ++ _) = sub
In the grandchild
example, we showed how we can define the inverse of a particular function. We can go further and implement a generic function inversion operator.
inverse :: (val -> val') -> (val' -> val)
inverse fun val' | fun val =:= val' = val where val free
With this operator, we could also implement grandchild
as inverse grandfather
.
Inverting functions can make our lives a lot easier. Consider the example of parsing. A parser takes a string and returns a syntax tree. Writing a parser directly is a non-trivial task. However, generating a string from a syntax tree is just a simple functional programming exercise. So we can implement a parser in a simple way by writing a converter from syntax trees to strings and inverting it.
We show this for the language of all arithmetic expressions that can be built from addition, multiplication, and integer constants. We first define types for representing abstract syntax trees. These types resemble a grammar that takes precedence into account.
type Expr = Sum
data Sum = Sum Product [Product]
data Product = Product Atom [Atom]
data Atom = Num Int | Para Sum
Now we implement the conversion from abstract syntax trees to strings.
toString :: Expr -> String
toString = sumToString
sumToString :: Sum -> String
sumToString (Sum product products)
= productToString product ++
concatMap ((" + " ++) . productToString) products
productToString :: Product -> String
productToString (Product atom atoms)
= atomToString atom ++
concatMap ((" * " ++) . atomToString) atoms
atomToString :: Atom -> String
atomToString (Num num) = show num
atomToString (Para sum) = "(" ++ sumToString sum ++ ")"
Implementing the parser is now extremely simple.
parse :: String -> Expr
parse = inverse toString
KiCS2 uses a depth-first search strategy by default. However, our parser implementation does not work with depth-first search. So we switch to breadth-first search by entering
:set bfs
at the KiCS2 prompt. Now we can try out the parser by entering
parse "2 * (3 + 4)"
.
Note that our split
function is not the same as the split
function in Curry’s List
module.↩
Previous related article
Earlier related article
Over the past couple of days I've written about how I committed a syntax error on a cron script, and a co-worker had to fix it on Saturday morning. I observed that I should have remembered to check the script for syntax errors before committing it, and several people wrote to point out to me that this is the sort of thing one should automate.
(By the way, please don't try to contact me on Twitter. It won't work. I have been on Twitter Vacation for months and have no current plans to return.)
Git has a “pre-commit hook” feature, which means that you can set up a program that will be run every time you attempt a commit, and which can abort the commit if it doesn't like what it sees. This is the natural place to put an automatic syntax check. Some people suggested that it should be part of the CI system, or even the deployment system, but I don't control those, and anyway it is much better to catch this sort of thing as early as possible. I decided to try to implement a pre-commit hook to check syntax.
Unlike some of the git hooks, the pre-commit hook is very simple to use. It gets run when you try to make a commit, and the commit is aborted if the hook exits with a nonzero status.
I made one mistake right off the bat: I wrote the hook in Bourne shell, even though I swore years ago to stop writing shell scripts. Everything that I want to write in shell should be written in Perl instead or in some equivalently good language like Python. But the sample pre-commit hook was written in shell and when I saw it I went into automatic shell scripting mode and now I have yet another shell script that will have to be replaced with Perl when it gets bigger. I wish I would stop doing this.
Here is the hook, which, I should say up front, I have not yet tried in day-to-day use. The complete and current version is on github.
#!/bin/bash
function typeof () {
filename=$1
case $filename in
*.pl | *.pm) echo perl; exit ;;
esac
line1=$(head -1 $1)
case $line1 in '#!'*perl )
echo perl; exit ;;
esac
}
Some of the sample programs people showed me decided which files
needed to be checked based only on the filename. This is not good
enough. My most important Perl programs have filenames with no
extension. This typeof
function decides which set of checks to
apply to each file, and the minimal demonstration version here can do
that based on filename or by looking for the #!...perl
line in the
first line of the file contents. I expect that this function will
expand to include other file types; for example
*.py ) echo python; exit ;;
is an obvious next step.
if [ ! -z $COMMIT_OK ]; then
exit 0;
fi
This block is an escape hatch. One day I will want to bypass the hook
and make a commit without performing the checks, and then I can
COMMIT_OK=1 git commit …
. There is actually a --no-verify
flag to
git-commit
that will skip the hook entirely, but I am unlikely to
remember it.
(I am also unlikely to remember COMMIT_OK=1
. But I know from
experience that I will guess that I might have put an escape hatch
into the hook. I will also guess that there might be a flag to
git-commit
that does what I want, but that will seem less likely to
be true, so I will look in the hook program first. This will be a
good move because my hook is much shorter than the git-commit
man
page. So I will want the escape hatch, I will look for it in the best place,
and I will find it. That is worth two lines of code. Sometimes I feel
like the guy in Memento. I have not yet resorted to tattooing
COMMIT_OK=1
on my chest.)
exec 1>&2
This redirects the standard output of all subsequent commands to go to
standard error instead. It makes it more convenient to issue error
messages with echo
and such like. All the output this hook produces
is diagnostic, so it is appropriate for it to go to standard error.
allOK=true
badFiles=
for file in $(git diff --cached --name-only | sort) ; do
allOK
is true if every file so far has passed its checks.
badFiles
is a list of files that failed their checks. the
git diff --cached --name-only
function interrogates the Git index
for a list of the files that have been staged for commit.
type=$(typeof "$file")
This invokes the typeof
function from above to decide the type of
the current file.
BAD=false
When a check discovers that the current file is bad, it will signal
this by setting BAD
to true
.
echo
echo "## Checking file $file (type $type)"
case $type in
perl )
perl -cw $file || BAD=true
[ -x $file ] || { echo "File is not executable"; BAD=true; }
;;
* )
echo "Unknown file type: $file; no checks"
;;
esac
This is the actual checking. To check Python files, we would add a
python) … ;;
block here. The * )
case is a catchall. The perl
checks run perl -cw
, which does syntax checking without executing
the program. It then checks to make sure the file is executable, which
I am sure is a mistake, because these checks are run for .pm
files,
which are not normally supposed to be executable. But I wanted to
test it with more than one kind of check.
if $BAD; then
allOK=false;
badFiles="$badFiles;$file"
fi
done
If the current file was bad, the allOK
flag is set false, and the
commit will be aborted. The current filename is appended to badFiles
for a later report. Bash has array variables but I don't remember how
they work and the manual made it sound gross. Already I regret not
writing this in a real language.
After the modified files have been checked, the hook exits successfully if they were all okay, and prints a summary if not:
if $allOK; then
exit 0;
else
echo ''
echo '## Aborting commit. Failed checks:'
for file in $(echo $badFiles | tr ';' ' '); do
echo " $file"
done
exit 1;
fi
This hook might be useful, but I don't know yet; as I said, I haven't
really tried it. But I can see ahead of time that it has a couple of
drawbacks. Of course it needs to be built out with more checks. A
minor bug is that I'd like to apply that is-executable check to Perl
files that do not end in .pm
, but that will be an easy fix.
But it does have one serious problem I don't know how to fix yet. The hook checks the versions of the files that are in the working tree, but not the versions that are actually staged for the commit!
The most obvious problem this might cause is that I might try to commit some files, and then the hook properly fails because the files are broken. Then I fix the files, but forget to add the fixes to the index. But because the hook is looking at the fixed versions in the working tree, the checks pass, and the broken files are committed!
A similar sort of problem, but going the other way, is that I might
make several changes to some file, use git add -p
to add the part I
am ready to commit, but then the commit hook fails, even though the
commit would be correct, because the incomplete changes are still in
the working tree.
I did a little tinkering with git stash save -k
to try to stash the
unstaged changes before running the checks, something like this:
git stash save -k "pre-commit stash" || exit 2 trap "git stash pop" EXIT
but I wasn't able to get anything to work reliably. Stashing a modified index has never worked properly for me, perhaps because there is something I don't understand. Maybe I will get it to work in the future. Or maybe I will try a different method; I can think of several offhand:
The hook could copy each file to a temporary file and then run the check on the temporary file. But then the diagnostics emitted by the checks would contain the wrong filenames.
It could move each file out of the way, check out the currently-staged version of the file, check that, and then restore the working tree version. (It can skip this process for files where the staged and working versions are identical.) This is not too complicated, but if it messes up it could catastrophically destroy the unstaged changes in the working tree.
Check out the entire repository and modified index into a fresh working tree and check that, then discard the temporary working tree. This is probably too expensive.
This one is kind of weird. It could temporarily commit the current
index (using --no-verify
), stash the working tree changes, and
check the files. When the checks are finished, it would unstash the
working tree changes, use git-reset --soft
to undo the temporary
commit, and proceed with the real commit if appropriate.
Come to think of it, this last one suggests a much better version of
the same thing: instead of a pre-commit hook, use a post-commit
hook. The post-commit hook will stash any leftover working tree
changes, check the committed versions of the files, unstash the
changes, and, if the checks failed, undo the commit with git-reset
--soft
.
Right now the last one looks much the best but perhaps there's something straightforward that I didn't think of yet.
[ Thanks to Adam Sjøgren, Jeffrey McClelland, and Jack Vickeridge for discussing this with me. Jeffrey McClelland also suggested that syntax checks could be profitably incorporated as a post-receive hook, which is run on the remote side when new commits are pushed to a remote. I said above that running the checks in the CI process seems too late, but the post-receive hook is earlier and might be just the thing. ]
[ Addendum: Daniel Holz wrote to tell me that the Yelp pre-commit frameworkhandles the worrisome case of unstaged working tree changes. The strategy is different from the ones I suggested above. If I'm reading this correctly, it records the unstaged changes in a patch file, which it sticks somewhere, and then checks out the index. If all the checks succeed, it completes the commit and then tries to apply the patch to restore the working tree changes. The checks in Yelp's framework might modify the staged files, and if they do, the patch might not apply; in this case it rolls back the whole commit. Thank you M. Holtz! ]
Let's write a really silly, highly inefficient (my favorite kind!) program that connects to multiple HTTP servers and sends a very simple request. Using the network package, this is really straightforward:
#!/usr/bin/env stack
-- stack --install-ghc --resolver lts-8.0 runghc --package network -- -Wall -Werror
import Control.Monad (forM, forM_)
import Network (PortID (PortNumber), PortNumber, connectTo)
import System.IO (hClose, hPutStrLn)
dests :: [(String, PortNumber)]
dests =
[ ("localhost", 80)
, ("localhost", 8080)
, ("10.0.0.138", 80)
]
main :: IO ()
main = do
handles <- forM dests $ \(host, port) -> connectTo host (PortNumber port)
forM_ handles $ \h -> hPutStrLn h "GET / HTTP/1.1\r\n\r\n"
forM_ handles hClose
We have our destinations. We open a connection to each of them, send our data,
and then close the connection. You may have plenty of objections to how I've
written this: we shouldn't be using String
, shouldn't we flush the Handle
,
etc. Just ignore that for now. I'm going to run this on my local system, and
get the following output:
$ ./foo.hs
foo.hs: connect: does not exist (Connection refused)
Riddle me this: which of the destinations above did the connection fail for?
Answer: without changing our program, we have no idea. And that's the point of
this blog post: all too often in the Haskell world, we get error messages from
a program without nearly enough information to debug it. Prelude.undefined
,
Prelude.read: no parse
, and Prelude.head: empty list
are all infamous
examples where a nice stack trace would save lots of pain. I'm talking about
something slightly different.
When you throw an exception in your code, whether it be via throwIO
,
returning Left
, using fail
, or using error
, please give us some
context. During development, it's a pain to have to dive into the code, add
some trace statements, figure out what the actual problem is, and then remove
the trace statements. When running in production, that extra information can be
the difference between a two-minutes operations level fix (like opening a port
in the firewall) versus a multi-hour debugging excursion.
Concretely, here's an example of how I'd recommend collecting more information
from connectTo
:
#!/usr/bin/env stack
-- stack --install-ghc --resolver lts-5.10 runghc --package network -- -Wall -Werror
{-# LANGUAGE DeriveDataTypeable #-}
import Control.Exception (Exception, IOException, catch, throwIO)
import Control.Monad (forM, forM_)
import Data.Typeable (Typeable)
import Network (HostName, PortID (PortNumber), PortNumber, connectTo)
import System.IO (Handle, hClose, hPutStrLn)
data ConnectException = ConnectException HostName PortID IOException
deriving (Show, Typeable)
instance Exception ConnectException
connectTo' :: HostName -> PortID -> IO Handle
connectTo' host port = connectTo host port `catch`
\e -> throwIO (ConnectException host port e)
dests :: [(String, PortNumber)]
dests =
[ ("localhost", 80)
, ("localhost", 8080)
, ("10.0.0.138", 80)
]
main :: IO ()
main = do
handles <- forM dests $ \(host, port) -> connectTo' host (PortNumber port)
forM_ handles $ \h -> hPutStrLn h "GET / HTTP/1.1\r\n\r\n"
forM_ handles hClose
Notice how the ConnectException
datatype provides plenty of information about
the context that connectTo'
was called from (in fact, all available
information). If I run this program, the problem is immediately obvious:
$ ./bar.hs
bar.hs: ConnectException "localhost" (PortNumber 80) connect: does not exist (Connection refused)
My web server isn't running locally on port 80. My ops team can now go kick the nginx/Warp process or do whatever other magic they need to do to get things running. All without bothering me at 2am :)
You may be thinking that this extra data type declaration is a lot of boilerplate overhead. While it does add some tedium, the benefit of being able to not only catch the exact exception we care about, but also easily extract the relevant context information, can pay off in completely unexpected ways in the future. I highly recommend it.
Since no Haskell blog post about exceptions is complete without it, let me cover some controversy:
Left
,
ExceptT
, impure exceptions, etc), be kind to them and provide this extra
context information.HttpException
type with lots of data constructors.Also, I left it out for brevity, but including a displayException
method in
your Exception
instance can allow programs to display much more user-friendly
error messages to end users.
While nothing I've said here is revolutionary, it's a small tweak to a library author's development style that can have a profound impact on users of the library, both at the dev level and those running the executable itself.
Takt is seeking data engineers to help develop our flagship product. Our platform learns and adapts to people's preferences, habits, and feedback—orchestrating highly relevant experiences that are truly unique to each person. Our vision will change the way people engage across multiple industries, be it retail, finance, or healthcare.
We share your passion for using data to solve complex problems. You understand that legacy code is the work you did yesterday. You'll work in small, self-sufficient teams with a common goal: deliver excellent software anchored in an agile culture of quality, delivery, and innovation.
Key Responsibilities:
Skills and Experience:
Bonus Points:
Get information on how to apply for this position.
Takt is seeking Systems Infrastructure Engineers to support the development of our flagship product. Our platform learns and adapts to people's preferences, habits, and feedback—orchestrating highly relevant experiences that are truly unique to each person. Our vision will change the way people engage with their favorite brands across multiple industries, be it retail, finance, or healthcare.
As a Systems and Infrastructure Engineer at Takt you understand that legacy code is the work you did yesterday. You’re well versed in modern technologies and select tools based on what is best for the team, product and organization. If those tools don’t exist, you’ll roll up your sleeves and build them. At Takt “DevOps” isn’t a role, but an approach to collaboration. We work in small, self-sufficient teams with the shared goal of delivering excellent software anchored in an agile culture of quality, delivery, and innovation.
Key Responsibilities:
Skills and Experience:
Bonus Points:
Get information on how to apply for this position.
Yesterday I wrote, in great irritation, about a line of code I had written that contained three errors.
I said:
What can I learn from this? Most obviously, that I should have tested my code before I checked it in.
Afterward, I felt that this was inane, and that the matter required a little more reflection. We do not test every single line of every program we write; in most applications that would be prohibitively expensive, and in this case it would have been excessive.
The change I was making was in the format of the diagnostic that the program emitted as it finished to report how long it had taken to run. This is not an essential feature. If the program does its job properly, it is of no real concern if it incorrectly reports how long it took to run. Two of my errors were in the construction of the message. The third, however, was a syntax error that prevented the program from running at all.
Having reflected on it a little more, I have decided that I am only really upset about the last one, which necessitated an emergency Saturday-morning repair by a co-worker. It was quite acceptable not to notice ahead of time that the report would be wrong, to notice it the following day, and to fix it then. I would have said “oops” and quietly corrected the code without feeling like an ass.
The third problem, however, was serious. And I could have prevented it with a truly minimal amount of effort, just by running:
perl -cw the-script
This would have diagnosed the syntax error, and avoided the main problem at hardly any cost. I think I usually remember to do something like this. Had I done it this time, the modified script would have gone into production, would have run correctly, and then I could have fixed the broken timing calculation on Monday.
In the previous article I showed the test program that I wrote to test the time calculation after the program produced the wrong output. I think it was reasonable to postpone writing this until after program ran and produced the wrong output. (The program's behavior in all other respects was correct and unmodified; it was only its report about its running time that was incorrect.) To have written the test ahead of time might be an excess of caution.
There has to be a tradeoff between cautious preparation and risk. Here I put everything on the side of risk, even though a tiny amount of caution would have eliminated most of the risk. In my haste, I made a bad trade.
[ Addendum 20170216: I am looking into automating the perl -cw
check. ]
Back in 2015, there were two proposals made for securing package distribution in Haskell. The Stackage team proposed and implemented a solution using HTTPS and Git, which was then used as the default in Stack. Meanwhile, the Hackage team moved ahead with hackage-security. Over the past few weeks, I've been working on moving Stack over to hackage-security (more on motivation below). The current status of the overall hackage-security roll-out is:
One upside to this is more reliable package index download time. We have had complaints from some firewalled users of slow Git clone time, so this is a good thing. We're still planning on maintaining the Git-based package indices for people using them (to my knowledge they are still being used by Nix, and all-cabal-metadata is still used to power a lot of the information on stackage.org).
However, there's one significant downside I've encountered in the current implementation that I want to discuss.
Quick summary of how hackage-security works: there is a 01-index.tar
file, the contents of which I'll discuss momentarily. This is the file
which is downloaded by Stack/cabal-install when you "update your
index." It is signed by a cryptographic algorithm specified within the
hackage-security project, and whenever a client does an update, it
must verify the signature. In theory, when that signature is verified,
we know that the contents of the 01-index.tar
file are unmodified.
Within this file are two (relevant) kinds of files: the .cabal
files
for every upload to Hackage (including revisions), and .json
files
containing metadata about the package tarballs
themselves. Importantly, this includes a SHA256 checksum and the size
of the tarball. Using these already-validated-to-be-correct JSON
files, we can download and verify a package tarball, even over an
insecure connection.
The alternative Git-based approach that the Stackage team proposed has an almost-identical JSON file concept in the all-cabal-hashes repo. Originally, these were generated by downloading tarballs from https://hackage.haskell.org (note the HTTPS). However, a number of months back it became known that the connection between the CDN in front of Hackage and Hackage itself was not TLS-secured, and therefore reliance on HTTPS was not possible. We now rely on the JSON files provided by hackage-security to generate the JSON files used in the Git repo.
With that background, the bug is easy to describe: sometimes the
.json
files are missing from the 01-index.tar
file. This was
originally opened in April 2016
(for Americans: on tax day no less), and then
I rediscovered the issue three weeks ago
when working on Stack.
Over the weekend, another .json
file went missing, resulting in
the FP Complete mirror not receiving updates
until I
manually updated the list of missing index files.
Due to the inability to securely generate the .json
file in the
all-cabal-hashes
Git repo without the file existing upstream, that
file is now missing in all-cabal-hashes
, causing downstream issues
to the Nix team.
There are a number of outcomes to be aware of from this issue:
all-cabal-hashes
. I can't speak to the Nix team internal
processes, and cannot therefore assess how big an impact that is.Overall, I'm still very happy that we've moved Stack over to hackage-security:
At work we had this script that was trying to report how long it had
taken to run, and it was using DateTime::Duration
:
my $duration = $end_time->subtract_datetime($start_time);
my ( $hours, $minutes, $seconds ) =
$duration->in_units( 'hours', 'minutes', 'seconds' );
log_info "it took $hours hours $minutes minutes and $seconds seconds to run"
This looks plausible, but because
DateTime::Duration
is shit,
it didn't work. Typical output:
it took 0 hours 263 minutes and 19 seconds to run
I could explain to you why it does this, but it's not worth your time.
I got tired of seeing 0 hours 263 minutes
show up in my cron email
every morning, so I went to fix it. Here's what I changed it to:
my $duration = $end_time->subtract_datetime_absolute($start_time)->seconds;
my ( $hours, $minutes, $minutes ) = (int(duration/3600), int($duration/60)%60, $duration%3600);
I was at some pains to get that first line right, because getting
DateTime
to produce a useful time interval value is a tricky
proposition. I did get the first line right. But the second line is
just simple arithmetic, I have written it several times before, so I
dashed it off, and it contains a syntax error, that duration/3600
is
missing its dollar sign, which caused the cron job to crash the next
day.
A co-worker got there before I did and fixed it for me. While he was
there he also fixed the $hours, $minutes, $minutes
that should have
been $hours, $minutes, $seconds
.
I came in this morning and looked at the cron mail and it said
it took 4 hours 23 minutes and 1399 seconds to run
so I went back to fix the third error, which is that $duration%3600
should have been $duration%60
. The thrice-corrected line has
my ( $hours, $minutes, $seconds ) = (int($duration/3600), int($duration/60)%60, $duration%60);
What can I learn from this? Most obviously, that I should have tested my code before I checked it in. Back in 2013 I wrote:
Usually I like to draw some larger lesson from this sort of thing. … “Just write the tests, fool!”
This was a “just write the tests, fool!” moment if ever there was one. Madame Experience runs an expensive school, but fools will learn in no other.
I am not completely incorrigible. I did at least test the fixed code before I checked that in. The test program looks like this:
sub dur {
my $duration = shift;
my ($hours, $minutes, $seconds ) = (int($duration/3600), int($duration/60)%60, $duration%60);
sprintf "%d:%02d:%02d", $hours, $minutes, $seconds;
}
use Test::More;
is(dur(0), "0:00:00");
is(dur(1), "0:00:01");
is(dur(59), "0:00:59");
is(dur(60), "0:01:00");
is(dur(62), "0:01:02");
is(dur(122), "0:02:02");
is(dur(3599), "0:59:59");
is(dur(3600), "1:00:00");
is(dur(10000), "2:46:40");
done_testing();
It was not necessary to commit the test program, but it was necessary to write it and to run it. By the way, the test program failed the first two times I ran it.
Three errors in one line isn't even a personal worst. In 2012 I posted here about getting four errors into a one-line program.
[ Addendum 20170215: I have some further thoughts on this. ]