1, 2, 3, Go!

OK, the title is an easy joke, but that's what happens when you give a silly name to a language… So I've started looking at Golang, since it seems to be the newish (ok, ok, not that new) kid on the block. And in fact the name is not an issue: I usually type "golang" to prefix my searches on the web and it works pretty well.

- I installed Go and the first striking thing are the instructions to read "how to write Go code". Having one way to structure the code (one workspace containing source control repositories, each containing packages) makes it very easy to get started - compared say to Java where different IDEs may have different approaches to what a "project" is. I wonder how this structure works when you write Go professionally, and you want to distinguish open sources dependencies from proprietary code. Also coming from Java where you're used to structure classes into packages, seeing Go repositories with a huge number of files inside one folder makes me shudder, but we'll see over time.
- I started of course with the Tour of Go. In a couple of evenings I went through it, proving how easy and small the base language is. Encouraging!
- I then started to get my hands dirty by writing a little app that hooks together JSON web services, Cassandra, Elastic Search and Docker. It was fairly easy to find libraries and manage to put them to use. I'll probably push that code to Github at some stage after some cleaning up.

So Go delivers on the promise to be a language easy to pick up and easy to get started. You can get from zero to productive in a few hours.

These are the features that stood up for me:

- Lack of generics. It's a trope in the computing world "Go doesn't have generics, ha ha ha". Of course straight away you see that arrays/slices and maps are typed, so already the basic use case for generics (safety in collections) is taken care of. Then you have functions as first-class citizens and interfaces, so there are plenty of techniques you can use to go around the lack of generics, so I'm not sure it's such a huge problem.
- Interfaces. It's interesting that there is no explicit declaration that a type implements an interface. If you define a Duck interface with a Quack method, every type that you use as the receiver for an implementation of a Quack function is considered a Duck. This is nice, but still tooling support to find out all types that implement a given interface will be a must ("OK, this function expects anything implementing Duck, what implementation can I use?").
- Pointers. I'm at the core a Java developer, so the idea of pointers makes me a bit uneasy. But I breathed easier when I saw "no pointer arithmetic". In fact, pointers in Go just make explicit if you pass a struct by value or by reference. I feel this adds a level of clarity that is worth the extra syntax and the initial repulsion the word "pointer" may awake in some.
- Errors. Go recommends a style where a function that may fail should return a tuple: normal value, error. The calling code can decide to ignore the error or deal with it. This is reminiscent of Either in Haskell. I'm still not sold that this is a better system than checked exceptions (but I seem to be the only one liking checked exceptions it seems), but I'll need more practice before I make my mind up.

So all in all I like a lot of design decisions that were made in Go, and it certainly makes for a pleasant programming experience. I ran into a few gotchas that I'll cover maybe in later posts, but nothing too stressful. So far, I have to say I enjoyed the experience!

Happy Go Hacking!

Validating Big Data Jobs - Stopping Failures before Production (w/ Spark, BEAM, & friends!) -- now with Scooters @ Scala eXchange 2018

Come join me on Friday 14 December @ 16:00 at Scala eXchange 2018 London, UK for Validating Big Data Jobs - Stopping Failures before Production (w/ Spark, BEAM, & friends!) -- now with Scooters.I'll update this post with the slides soon.Come see to the talk or comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

Holden @ Kiwi Code Mania: talk title TBD @ Code Mania

Come join me on Wednesday 15. May 2019 at Code Mania 2019 Auckland, New Zealand for Holden @ Kiwi Code Mania: talk title TBD.I'll update this post with the slides soon.Come see to the talk or comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

Thoughts on bootstrapping GHC

I am returning from the reproducible builds summit 2018 in Paris. The latest hottest thing within the reproducible-builds project seems to be bootstrapping: How can we build a whole operating system from just and only source code, using very little, or even no, binary seeds or auto-generated files. This is actually concern that is somewhat orthogonal to reproducibility: Bootstrappable builds help me in trusting programs that I built, while reproducible builds help me in trusting programs that others built.

And while they make good progress bootstrapping a full system from just a C compiler written in Scheme, and a Scheme interpreter written in C, that can build each other (Janneke’s mes project), and there are plans to build that on top of stage0, which starts with a 280 bytes of binary, the situation looks pretty bad when it comes to Haskell.

Unreachable GHC

The problem is that contemporary Haskell has only one viable implementation, GHC. And GHC, written in contemporary Haskell, needs GHC to be build. So essentially everybody out there either just downloads a binary distribution of GHC. Or they build GHC from source, using a possibly older (but not much older) version of GHC that they already have. Even distributions like Debian do nothing different: When they build the GHC package, the builders use, well, the GHC package.

There are other Haskell implementations out there. But if they are mature and active developed, then they are implemented in Haskell themselves, often even using advanced features that only GHC provides. And even those are insufficient to build GHC itself, let alone the some old and abandoned Haskell implementations.

In all these cases, at some point an untrusted binary is used. This is very unsatisfying. What can we do? I don’t have the answers, but please allow me to outline some venues of attack.

Retracing history

Obviously, even GHC does not exist since the beginning of time, and the first versions surely were built using something else than GHC. The oldest version of GHC for which we can find a release on the GHC web page is version 0.29 from July 1996. But the installation instructions write:

GHC 0.26 doesn't build with HBC. (It could, but we haven't put in the effort to maintain it.)

GHC 0.26 is best built with itself, GHC 0.26. We heartily recommend it. GHC 0.26 can certainly be built with GHC 0.23 or 0.24, and with some earlier versions, with some effort.

GHC has never been built with compilers other than GHC and HBC.

So it seems that besides GHC, only ever HBC was used to compiler GHC. HBC is a Haskell compiler where we find the sources of one random version only thanks to archive.org. Parts of it are written in C, so I looked into this: Compile HBC, use it to compile GHC-0.29, and then step for step build every (major) version of GHC until today.

The problem is that it is non-trivial to build software from the 90s using today's compilers. I briefly looked at the HBC code base, and had to change some files from using varargs.h to stdargs.v, and this is surely just one of many similar stumbling blocks trying to build that tools. Oh, and even the hbc source state

# To get everything done: make universe
# It is impossible to make from scratch.
# You must have a running lmlc, to
# recompile it (of course).

So I learned that actually, most of it is written in LML, and the LML compiler is written in LML. So this is a dead end. (Thanks to Lennart for clearing up a misunderstanding on my side here.

Going back, but doing it differently

Another approach is to go back in time, to some old version of GHC, but maybe not all the way to the beginning, and then try to use another, officially unsupported, Haskell compiler to build GHC. This is what rekado tried to do in 2017: He use the most contemporary implementation of Haskell in C, the Hugs interpreter. Using this, he compiled nhc98 (yet another abandoned Haskell implementation), with the hope of building GHC with nhc98. He made impressive progress back then, but ran into a problem where the runtime crashed. Maybe someone is interested in picking up up from there?

Removing, simplifying, extending, in the present.

Both approaches so far focus on building an old version of GHC. This adds complexity: other tools (the shell, make, yacc etc.) may behave different now in a way that causes hard to debug problems. So maybe it is more fun and more rewarding to focus on today’s GHC? (At this point I am starting to hypothesize).

I said before that no other existing Haskell implementation can compile today’s GHC code base, because of features like mutually recursive modules, the foreign function interface etc. And also other existing Haskell implementations often come with a different, smaller set of standard libraries, but GHC assumes base, so we would have to build that as well...

But we don’t need to build it all. Surely there is much code in base that is not used by GHC. Also, much code in GHC that we do not need to build GHC, and . So by removing that, we reduce the amount of Haskell code that we need to feed to the other implementation.

The remaining code might use some features that are not supported by our bootstrapping implementation. Mutually recursive module could be manually merged. GADTs that are only used for additional type safety could be replaced by normal ones, which might make some pattern matches incomplete. Syntactic sugar can be desugared. By simplifying the code base in that way, one might be able a fork of GHC that is within reach of the likes of Hugs or nhc98.

And if there are features that are hard to remove, maybe we can extend the bootstrapping compiler or interpreter to support them? For example, it was mostly trivial to extend Hugs with support for the # symbol in names – and we can be pragmatic and just allow it always, since we don’t need a standards conforming implementation, but merely one that works on the GHC code base. But how much would we have to implement? Probably this will be more fun in Haskell than in C, so maybe extending nhc98 would be more viable?

Or maybe it is time to create a new Haskell compiler from scratch, written in something other than Haskell? Maybe some other language that is reasonably pleasant to write a compiler in (Ocaml? Scala?), but that has the bootstrappability story already sorted out somehow.

But in the end, all variants come down to the same problem: Writing a Haskell compiler for full, contemporary Haskell as used by GHC is hard and really a lot of work – if it were not, there would at least be implementations in Haskell out there. And as long as nobody comes along and does that work, I fear that we will continue to be unable to build our nice Haskell ecosystem from scratch. Which I find somewhat dissatisfying.

Michael Snoyman

I passed around a version of this document for initial feedback. I’m now proposing this publicly. I say this below, but I want to reiterate here: I’m only interested in doing this if there’s real demand and interest, as well as others to participate. If an improved Commercial Haskell is something with little interest, I’ll drop the topic completely.

Premise

We started the Commercial Haskell Special Interest Group (SIG) as an informal organization for those interested in using Haskell in a commercial setting. So far, it has provided a few concrete activities:

• Ability to sign up as an individual or a company to be “part of” the group
• A central place for some documentation, which honestly hasn’t taken off too significantly
• A Github organization for placing some community projects, like Stack, Stackage, and others

The premise of this proposal is: can and should we do more? Are there additional activities that the Commercial Haskel SIG should be responsible for? Should the decision making process and meaning of membership be clarified?

And then, most importantly, is this something that others are interested in participating in? If this is just a group with a handful of active members, there’s not reason to formalize things at all.

Possible goals

First of all, I believe we should be defining clearly what the goal of the Commercial Haskell SIG is, and some potential subgoals. As a headline, I think the overriding goal should be:

I can think of some concrete subgoals which are in line with the above.

• Encourage more serious consideration of security vulnerabilities in the Haskell ecosystem, such as:
• Provide forums for commercial users to discuss issues and collaborate on solutions
• Put together marketing-style material for Haskell in industry
• Increase the set of documentation provided by Commercial Haskell, and probably host on commercialhaskell.com
• One easy, concrete, and hopefully beneficial to the ecosystem: I propose to remove haskell-lang.org, and instead put its documentation on commercialhaskell.com
• Concretely, this will hopefully help rectify a mistake I made, and provide a clear vision for why this site exists and is separate from haskell.org (providing opinionated, commercially-geared documentation, without needing to distance from haskell.org)
• Establish and encourage a Code of Conduct. This has been an ongoing discussion across the Haskell ecosystem.

Possible problems

I’ve debated this off-and-on for a while now. Here’s a brain dump of potential hurdles:

• “Yet another committee” overhead. We have enough of these already, and they are not well loved in general.
• If we establish clear committee rules, they may be open to abuse. This would depend on organizational structure and how we determine voting occurs.
• There’s a non-trivial time and energy investment in this, and I personally don’t have a lot of either right now. This will only work if others are truly engaged.
• Maybe no one cares about any of this, and formalizing things is strictly worse than the status quo.
• Maybe there’s not alignment on what the Commercial Haskell SIG should be doing, and all we’ll be doing is creating a new public fight.
• Maybe some people who are signed up on the repo don’t believe on our goals, and will object to any clarification of direction.

Organizational structure

Some basic ideas, I haven’t given this much thought yet.

Option A: Typical committee, a group of 7-ish people, privately choose members, no real accountability

Option B:

• Individual initiatives have a single owner or small group of cooperating owners
• Approved list of members of Commercial Haskell SIG
• Question: how do we bootstrap that list?
• What is the criterion for membership? Individuals? Companies?
• If there’s an objection to actions by the owner, can take a vote to override
• Simple majority? 60%? Two-thirds?

Option C (proposed by a reviewer): A Benevolent Dictator for Life (BDFL) model

Discussion

I’ve created a Github issue for discussion, and will be participating there. Discussion is of course free to occur elsewhere, but I’m going to focus my energies on that Github issue. For real-time chat (if relevant), the commercialhaskell Gitter channel is a great place. If there’s anything sensitive people want to raise, I’m also available for private communications.

GHC: From Bug to Merge

Summary: The story of a bug, from report, proposal, patch, revisions, to merge.

A while ago I found a bug in GHC. Like all good bugs, this one starts with playing around, in this case seeing if it was possible to eliminate String from a base-like library. There are two interesting extensions:

• OverloadedStrings changes each string literal to be wrapped in fromString, so "hello" in the source code becomes fromString "hello" when desugared.
• RebindableSyntax means that when GHC sees fromString or fail (or lots of other functions) it uses the version in scope, rather than the version in the base library.

Taking a look at the code:

foo x = do    Just y <- x    return y

In GHC 8.6 it desugars to something equivalent to:

foo x = x >>= \v ->    case v of        Just y -> return y        _ -> fail ("pattern match error" :: String)

In particular, it does not insert a fromString around the generated "pattern match error". If you are using the built-in fail, that's fine, since the built-in fail wants a String. But if we define a custom fail function which doesn't use String, but which takes the output of fromString, then we can get type errors with the definitions:

newtype Text = Text Stringfail (Text x) = error xfromString = Text

The solution is fairly simple - desugar adding a fromString - which is what GHC 8.8 will do. This post is the story of getting that change integrated.

Raise a GHC ticket

The first step after finding a bug is to raise a GHC ticket, which I did as Trac 15645, also offering to fix it. After discussing a bit with the GHC developers, we came the conclusion that the essential property of the bug is:

Overloaded strings should imply even generated strings are overloaded, if they are passed to user-controlled functions.

I think (and still think) that this ticket is a pure bug fix, partly because changing it does not imply changing the user manual in any way. However, GHC developers quite reasonably disagreed, and suggested I go through the GHC proposal process.

Raise a GHC proposal

The GHC Proposal process involves creating a proposal, having people discuss it, and then passing it via a committee. At this point I got Shayne Fletcher involved, who actually ran with most of the steps from here onwards. We raised GHC Proposal 168, setting out the case why the change should be made. We @ mentioned people on GitHub, and posted to Twitter to solicit opposing views, but only 4 people made any comment, and none of them disagreed. We submitted to the committee after the discussion was verifiably dead (not that it was ever super alive), and had the proposal accepted.

Write the code

After having the proposal accepted the hard work of writing the code began as Phab D5251. The code itself wasn't simple, but neither was it very complex. An initial version was reviewed with lots of good suggestions from Simon Peyton Jones. Shayne made some revisions, including adding a regression test, and the patch was ready. Sometime later the code was merged to GHC master.

Timeline

The whole process took a long time - 14 Sep to 12 Dec (fortunately all 2018), so nearly 3 months. That timeline was significantly extended because GHC is in the process of changing hosting for their code, which both makes it harder to merge stuff and involves the people who would normally be merging stuff doing other work. However, I thought it instructive to show where the time went.

• 14 Sep - 14 Sep: Open bug and discuss it
• 14 Sep - 17 Sep: Writing the proposal
• 17 Sep - 20 Sep: Discussing the proposal
• 20 Sep - 24 Sep: Confirming the discussion had died down
• 24 Sep - 21 Oct: Waiting for the committee decision
• 21 Oct - 22 Oct: Opening the Phab ticket
• 22 Oct - 25 Oct: Discussion on the code review
• 29 Oct - 31 Oct: Additional review and discussion
• 1 Nov - 26 Nov: Waiting for people to give a final approval
• 26 Nov - 11 Dec: Waiting to merge in

I think the interesting thing is that of the 3 months, over 2 months was spent waiting for confirmations of a mechanical nature. However, every time real in-depth thoughtful discussion was required (code review, discussing the bug) it happened almost immediately.

FP Complete's opinion

I see an attitude expressed on forums often. This has happened for years, and I usually ignore it. But I’m responding to it today because I think the opinion is negative for the Haskell community. And—hitting a bit closer to home for me—negatively impacts my coworkers.

I’m an opinionated person. In my experience, most people who end up writing Haskell are to some extent. I’m not afraid to share my opinions on lots of topics, and I think that’s healthy. Most people who read my blog or Twitter feed know my opinions on language design, API design, build tool design, nutrition, and exercise, for example.

Here’s the two sentence summary of everything I’m going to say in this blog post:

• Just because I believe something does not make it FP Complete’s belief.
• Just because I believe something does not mean that everyone at FP Complete believes it.

I think I’ve interviewed every engineer who works for FP Complete. When interviewing, I do look for certain criteria. For example, if I’m hiring for a Haskell development position, I look for people who like and know Haskell. That’s the closest thing to an “FP Complete opinion” you can find: things which are intrinsic to the job. I’ve never tried to hire someone who has identical opinions to me on all topics. Not only would that be impossible to achieve, but it would be bad for business: we want a diversity of opinions, not a monoculture.

Someone on our sales team recently brought a question to the engineering team: “Why does FP Complete hate Nix?” He showed me the comment in question. I won’t point to the comment; it’s neither informative nor unique. Here’s the answer: FP Complete has no emotions. It’s a company. We work on projects, and we have people who do that work. I don’t think anyone on our team “hates” Nix. I certainly don’t, even if I’m usually one of the people on the team saying we shouldn’t use Nix for most projects (I can explain why that happens separately, but it’s off topic here). However:

We have technical disagreements on our team. You’ve probably seen members of the FP Complete team espousing those differing viewpoints publicly. We regularly have technical debates about how to approach things. This isn’t just in the realm of Haskell. We debate DevOps tooling, front end development, and even business objectives.

When someone joins the FP Complete team, they are agreeing to fulfill the terms of their contract: do the work we ask from them, contribute to our internal technical discussions, and so on. There is no pledge of obedience to the “FP Complete will.” Because it doesn’t exist.

I feel almost silly writing up this blog post, because everything I’m saying should be automatic. However, it seems like something like this needed to be said.

If you want to criticize something I’ve said, go for it. I’ve proven over the years I’m always up for a technical debate. But couching it as “FP Complete believes X” is lazy (in the bad way) and wrong. That goes for generalizing from statements of other members of the team too.

New user empathy

This blog post will focus on the situation for Haskell, though the ideas likely generalize well to other languages, or even to non-programming disciplines. I’m simply speaking to the topic I have the most experience with.

Many people will make claims that a certain decision needs to be made “for new users.” These recommendations can range anywhere from sensible, to confusing, to insulting. New users are thought of anywhere from smart people who need decent information provided, to individuals hungry for knowledge, to idiots who don’t know any better.

I’ve discussed this situation over time with people, and have spent significant time personally and professionally looking for ways to improve the situation for “new users.” And I believe that, by far, the most important trait we need to do this is empathy. To quote Wikipedia:

Empathy is the capacity to understand or feel what another person is experiencing from within their frame of reference

Let me start off with the most extreme example of where we, as Haskellers, have failed historically: monad tutorials.

Haskellers occassionally will glibly state “a monad is just a monoid in the category of endofunctors, what’s the problem.” I strongly believe the vast majority, if not all cases, of this statement are meant to self-mock. But let’s pretend for a moment that this information was stated in this tone to a new user, who has never seen monads, monoids, categories, or endofunctors before. This massively fails the empathy test:

• The new user is unaware of these terms, and cannot understand them without further information
• Their frame of reference is probably “I heard that monads are necessary to do I/O in Haskell, why is that?” The answer provides obfuscation in place of elucidation
• The tone, especially “just” and “what’s the problem,” is dismissive and derisive. If I’d heard a statement like that, I’d probably want to walk away from Haskell

Note that none of these statements require us to believe anything negative about a new user. I’ve not implied that they are unintelligent, lazy, weak-willed, or unsuitable to Haskell. My mental model of their frame of reference is fairly straightforward here: they have not been exposed to this material previously, they’re trying to get an answer to a specific question, and they don’t want to be made to feel inferior.

The monad tutorial fallacy in general (aka “monads are like burritos”) is a more subtle form of this. The author of such a tutorial, in their mind, has decided that monads are somehow like burritos. This is ignoring the frame of reference of the reader, who likely has no reason to believe monads actually are like burritos, and cannot understand the analogy. Sharing the analogy as if it will help, when in reality it doesn’t, can make a reader feel frustrated, confused, and inferior, and prevent them from wanting to go further. Again, this is not a failing in the new user!

My mental model for a new user

I’ve built up a mental model for how I believe a new user to Haskell will typically be thinking. This is by no means universal; many people are exceptions. But I’ve found that the following traits generally apply to someone trying out Haskell for the first time. And making these assumptions about new users who don’t fit this description perfectly doesn’t usually cause any kind of major negative impact:

• Skeptical: they don’t yet know if Haskell will live up to the hype and actually solve their problems
• Impatient: there are dozens of languages that they could be investigating, and they want to quickly figure out if Haskell is one of the select few worth spending more time on
• Curious: few people pick Haskell as a language to look into without some level of curiosity about what makes this language so cool
• Smart, but distracted: I assume that, despite this curiosity, impatience may win out, and new users will probably be juggling a few other things concurrently. Maybe they’re testing out other languages, or have some deadline at work, or who knows what.
• Eager to succeed: most people don’t want to fail. They want their efforts to pay off and be rewarded.

There are others attributes too, most of which I probably haven’t fully identified even to myself. But this is a decent starting point.

Exceptions

Regarding exceptions to this rule: some people are not skeptical. Maybe they’ve been a Scala developer for years, have seen Haskell at conferences for a long time, and are totally bought into the language before they even start. Great! Treating them as skeptical initially may involve giving them some cool motivating cases for Haskell (my personal favorite being something like STM and the Concurrently type). Providing this extra information can, at worst, result in them telling me “hey, I get it, can we just skip to the part where you teach me something?”

The same applies with impatient. I optimize my Haskell teaching to get something working quickly so that people know the investment has a good chance of paying off. But maybe someone has decided that, come hell or high water, they’re going to spend the next 2 weeks learning all the ins and outs of Haskell. They want to learn the Lambda calculus from the ground up, understand some type theory, and then write Hello World. Great! In those cases, they can easily skip my “concurrent HTTP downloads in 5 minutes” tutorial. The other way around—skipping the “the Lambda calculus for math majors”—happens less frequently.

As a side note, and just to prove the point: my wife has started learning Haskell (though our home renovations have recently gotten in the way of that). She fulfills neither of these criteria: she’s watched me using Haskell for 10 years, and isn’t at all skeptical of the language. And she’s learning the language without any specific impatience to get things working. To my knowledge, my standard mental modeling and new user onboarding wouldn’t have negatively impacted her acquisition of Haskell, despite this mismatch..

Practical outcomes

Alright, so given that I’ve built up this mental model, how does this affect how I try to onboard new users?

1. Ensure there’s a clear start location to point them to
2. Reduce the number of choices they need to make to get started
3. Include a compelling example from the beginning that will get them interested
4. Include exercises they can work on to keep that curiosity high, but don’t make them too complicated at first
5. Provide links to go “off the beaten track” if someone has drastically different goals than provided

New users are different!

This is vital to keep in mind. When I started using Haskell, Hackage had barely started, cabal-install didn’t really exist at all, there were barely any libraries worth using, and there was almost no educational material. I was absolutely a “new user,” but I didn’t fit the mental model described above at all. Making any assumption about what other new users are willing to go through based on my own experience would be wrong.

What are our goals?

As I’ve recently stated, my goal is to “increase the adoption of Haskell.” For me, it’s vital to improve the new user experience significantly, and this drives a lot of what I work on in Haskell. Not everyone shares that goal, and that’s fine. But identifying these differences can help us within the Haskell community to have more constructive discussions about how to proceed.

One example of such a discussion (that I was not really involved in) was the Foldable Traversable Proposal (FTP) debate. There was a lot of discussion about “new users.” I strongly believe that different parties had wildly different ideas of who these new users were, and what their frame of reference was. If we had the language and perspective to openly discuss these differences, I think we could have had a more meaningful discussion.

The same applies to many discussions I am involved in, such as tooling. You can trace many of the more opinionated pieces of Stack to the mental model I’ve described above. For example, on the impatient side, Stack is optimized for a use case of impatience: you download a Stack executable, and it will automatically handle all of the other tooling installation you need. It may not do it in exactly the way every user wants (e.g. shared vs user-local GHC installations), but that’s not the goal.

Similarly, it’s optimized for skeptical users. From personal experience, running into dependency hell out of the gate is a major turnoff for skeptical users. Ensuring that the default of the tool is to use pretested Stackage snapshots knocks down this major obstacle to skeptical users. This also plays into the “smart, but distracted:” from experience, new users won’t spend time reading all of the detailed instructions you put on your site. They’re impatient to get started quickly, distracted by 15 other tabs, and will become frustrated when line 27 of your instructions would have fixed the problem they run into, but they just happened to miss it.

Is this a good way to do JSON validation?

At work, where we use Clojure, we’ve been improving our error messages in the public API to

1. return as many errors as possible in a response, and
2. be in humanly readable English.

If one adopts spec as we have one gets the former for free, but the output of spec can hardly be called humanly readable. For the latter part we chose to use phrase.

Given that I’d like to see Haskell used more at work (currently there are 2 minor services written in Haskell and around a score in Clojure) I thought I’d take a look at JSON validation in Haskell. I ended up beig less than impressed. We have at least one great library for parsing JSON, aeson, but there are probably a few more that I haven’t noticed. It’s of course possible to mix in validation with the parsing, but since parsers, and this is true for aeson’s parser too, tend to be monads and that means that item 1 above, finding as many errors as possible, isn’t on the table.

A quick look at Hackage gave that

• there is a package called aeson-better-errors that looked promising but didn’t fi my needs (I explain at the end why it isn’t passing muster)
• the support for JSON Schema is very lacking in Haskell, hjsonschema is deprecated and aeson-schema only supports version 3 of the draft (the current version is 7) and the authors claim that that hjsonschema is more moderna and more actively maintained

So, a bit disappointed I started playing with the problem myself and found that, just as is stated in the description of the validation library, I want something that’s isomorphic to Either but accumulates on the error side. That is, something like

data JSONValidationResult = JVRInvalid [JSONValidationFailure]
| JVRValid
deriving (Eq, Show)

instance Semigroup JSONValidationResult where
(JVRInvalid es0) <> (JVRInvalid es1) = JVRInvalid $es0 <> es1 JVRValid <> r = r r <> JVRValid = r I decided it was all right to limit validation to proper JSON expressions, i.e. a validator could have the type Value -> JSONValidationResult. I want to combine validators so I decided to wrap it in a newtype and write a SemiGroup instance for it as well: newtype JSONValidator = JV (A.Value -> JSONValidationResult) instance Semigroup JSONValidator where (JV v0) <> (JV v1) = JV$ \ val -> v0 val <> v1 val

The function to actually run the validation is rather straight forward

runJSONValidator (JV validator) val = validator val

After writing a few validators I realised a few patterns emerged and the following functions simplified things a bit:

mapInvalid _ JVRValid = JVRValid
mapInvalid f (JVRInvalid es) = JVRInvalid $map f es valid = JVRValid invalid s = JVRInvalid [JVFDesc s] With this in place I started writing validators for the basic JSON types: isNumber = JV go where go (A.Number _) = valid go _ = invalid "not a number" isString = JV go where go (A.String _) = valid go _ = invalid "not a string" isBool = JV go where go (A.Bool _) = valid go _ = invalid "not a bool" isNull = JV go where go A.Null = valid go _ = invalid "not 'null'" The number type in JSON is a float (well, in aeson it’s a Scientific), so to check for an integer a bit more than the above is needed isInt = JV go where go (A.Number i) = if i == fromInteger (round i) then valid else invalid "not an integer" go _ = invalid "not an integer" as well as functions that check for the presence of a specific key reqKey n v = JV go where go (A.Object obj) = case HM.lookup n obj of Nothing -> invalid$ "required key '" <> n <> "' is missing"
Just val -> mapInvalid (JVFPath n) $runJSONValidator v val go _ = invalid "not an object" optKey n v = JV go where go (A.Object obj) = case HM.lookup n obj of Nothing -> valid Just val -> mapInvalid (JVFPath n)$ runJSONValidator v val
go _ = invalid "not an object"

With this in place I can now create a validator for a person with a name and an age:

vPerson = reqKey "name" isString <>
reqKey "age" isInt

and run it on a Value:

*> runJSONValidator vPerson <$> (decode "{\"name\": \"Alice\", \"age\": 32}" :: Maybe Value) Just JVRValid and all failures are picked up *> runJSONValidator vPerson <$> (decode "{\"name\": \"Alice\", \"age\": \"foo\"}" :: Maybe Value)
Just (JVRInvalid [JVFPath "age" (JVFDesc "not an integer")])

*>runJSONValidator vPerson <$> (decode "{\"name\": \"Alice\"}" :: Maybe Value) Just (JVRInvalid [JVFDesc "required key 'age' is missing"]) runJSONValidator vPerson <$> (decode "{\"nam\": \"Alice\"}" :: Maybe Value)
Just (JVRInvalid [JVFDesc "required key 'name' is missing",JVFDesc "required key 'age' is missing"])

Reflections

1. I quickly realised I wanted slightly more complex validation of course, so all the validators for basic JSON types above have a version taking a custom validator of type a -> JSONValidationResult (where a is the Haskell type contained in the particulare Value).
2. I started out thinking that I want an Applicative for my validations, but slowly I relaxed that to SemiGroup. I’m still not sure about this decision, because I can see a real use of or which I don’t really have now. Maybe that means I should switch back towards Applicative, just so I can implement an Alternative instance for validators.
3. Well, I simply don’t know if this is even a good way to implement validators. I’d love to hear suggestions both for improvements and for completely different ways of tackling the problems.
4. I would love to find out that there already is a library that does all this in a much better way. Please point me in its direction!

Appendix: A look at aeson-better-errors

The issue with aeson-better-errors is easiest to illustrate using the same example as in its announcement:

{-# LANGUAGE OverloadedStrings #-}
module Play where

import           Data.Aeson
import           Data.Aeson.BetterErrors

data Person = Person String Int
deriving (Show)

asPerson :: Parse e Person
$cd interval  We’re going to put the code for our Interval into a separate module. First, put the following code into src/interval.rs: use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering}; use std::sync::Arc; use std::thread::{sleep, spawn}; use std::time::Duration; #[derive(Clone)] pub struct Interval { counter: Arc<AtomicUsize>, still_running: Arc<AtomicBool>, } impl Drop for Interval { fn drop(&mut self) { println!("Interval thread shutting down"); self.still_running.store(false, Ordering::SeqCst); } } impl Interval { pub fn from_millis(millis: u64) -> Interval { let duration = Duration::from_millis(millis); let counter = Arc::new(AtomicUsize::new(0)); let counter_clone = counter.clone(); let still_running = Arc::new(AtomicBool::new(true)); let still_running_clone = still_running.clone(); spawn(move || { println!("Interval thread launched"); while still_running_clone.load(Ordering::SeqCst) { sleep(duration); let old = counter_clone.fetch_add(1, Ordering::SeqCst); println!("Interval thread still alive, value was: {}", old); } }); Interval { counter, still_running, } } pub fn get_counter(&self) -> usize { self.counter.load(Ordering::SeqCst) } }  Next, let’s provide a minimal src/main.rs which uses this Interval data type: mod interval; use self::interval::Interval; fn main() { let interval = Interval::from_millis(500); // half a second let duration = std::time::Duration::from_millis(2000); // 2 seconds for i in 1..11 { println!("Iteration number {}, counter is {}", i, interval.get_counter()); std::thread::sleep(duration); } }  You should see something like the following: Iteration number 1, counter is 0 Interval thread launched Interval thread still alive, value was: 0 Interval thread still alive, value was: 1 Interval thread still alive, value was: 2 Iteration number 2, counter is 3 Interval thread still alive, value was: 3 Interval thread still alive, value was: 4 ... Interval thread still alive, value was: 33 Interval thread still alive, value was: 34 Iteration number 10, counter is 35 Interval thread still alive, value was: 35 Interval thread still alive, value was: 36 Interval thread still alive, value was: 37 Interval thread still alive, value was: 38 Interval thread shutting down  Hurrah, we have some concurrent communication. Problems with this approach The first thing that jumps out as a problem is that we’re missing some updates in the main thread. Notice how the counter jumps from 0 to 3. This is obviously a problem with the interval set in the main thread: we’re delaying for 2 seconds instead of half a second. Let’s instead delay for a tenth of a second (100ms) in the main thread, and check if the value has changed since last time. NOTE It’s still possible we’ll miss some updates this way, since sleep guarantees a thread will sleep at least a given amount of time, but may sleep longer. However, by having such a large difference, we’re fairly certain we’ll catch all of the updates. fn main() { let interval = Interval::from_millis(500); // half a second let duration = std::time::Duration::from_millis(100); // 0.1 seconds let mut last = interval.get_counter(); for i in 1..51 { let curr = interval.get_counter(); if curr != last { last = curr; println!("Iteration number {}, counter is {}", i, curr); } std::thread::sleep(duration); } }  I had to increase the number of iterations to 50, because so many of our main thread iterations end up showing no change in the counter. Here’s an example of running this on my machine: Interval thread launched Interval thread still alive, value was: 0 Iteration number 6, counter is 1 Interval thread still alive, value was: 1 Iteration number 11, counter is 2 Interval thread still alive, value was: 2 Iteration number 16, counter is 3 Interval thread still alive, value was: 3 Iteration number 21, counter is 4 Interval thread still alive, value was: 4 Iteration number 26, counter is 5 Interval thread still alive, value was: 5 Iteration number 31, counter is 6 Interval thread still alive, value was: 6 Iteration number 36, counter is 7 Interval thread still alive, value was: 7 Iteration number 41, counter is 8 Interval thread still alive, value was: 8 Iteration number 46, counter is 9 Interval thread still alive, value was: 9 Interval thread shutting down  We didn’t lose any counter updates here, but from the bumps in interval thread numbers, we can see that we’re wasting a lot of time in the main thread checking numbers that aren’t changing. Another problem that’s less obvious is that we’re dedicating an entire OS thread to this sleep-and-check iteration. In our simple program, that’s not a big deal. But imagine we decided we wanted to have 50 different similar tasks going on. It would require 49 extra threads, most of which would sit around sleeping the majority of the time. That’s highly wasteful. We should be able to do better. Finally, and perhaps least important for the moment, this is all rather ad hoc. It seems like a common need to be able to abstract over “this thing will produce a value in the future.” Even though this seems like the least important problem, we’ll start by solving it first. The Future trait Who’d have thought that our meandering would naturally lead to one of the topics mentioned in this lesson’s title! You may have noticed a pattern that developed in the main thread’s loop: • Check if a new value is available • Use it if it is available • Skip if it isn’t available That’s exactly what the Future trait allows us to do, with one addition: it also allows for error handling. We’re not going to worry about that for now, since my code doesn’t have any errors :). We’ll start by adding the futures crate as a dependency. In Cargo.toml: [dependencies] futures = "0.1"  Next, let’s add a new module to provide a struct that will provide a Future implementation. Behold, src/future.rs: extern crate futures; use super::interval::Interval; use futures::prelude::*; pub struct IntervalFuture { interval: Interval, last: usize, } impl IntervalFuture { pub fn new(interval: Interval) -> IntervalFuture { let last = interval.get_counter(); IntervalFuture { interval, last } } } impl Future for IntervalFuture { type Item = usize; type Error = (); fn poll(&mut self) -> Poll<Self::Item, Self::Error> { let curr = self.interval.get_counter(); if curr == self.last { Ok(Async::NotReady) } else { self.last = curr; Ok(Async::Ready(curr)) } } }  We’re going to own an Interval and the last value we provided, just like our main loop used to. The new method is fairly straightforward. For the impl Future, we need to define three things: • The thing which will be returned by this type when ready. In our case, it’s the counter value, which is a usize. • The type of errors that can occur. We don’t have any errors, so we use (). (The Haskeller in me screams that we should use Void. Soon enough, we’ll be able to use never.) • A function poll, which returns a Result. In the error case, this will be our Self::Error. In the success case, this is an Async enum type. As we can see in our method body, this is either a Ready variant with the value, or NotReady. The logic in our function is the same as before, so I won’t comment on implementation. Back in our src/main.rs, instead of playing around with the curr/last logic, we can just pattern match on the result of poll()ing: extern crate futures; mod future; mod interval; use self::interval::Interval; use self::future::IntervalFuture; use futures::prelude::*; fn main() { let interval = Interval::from_millis(500); // half a second let mut interval_future = IntervalFuture::new(interval); let duration = std::time::Duration::from_millis(100); // 0.1 seconds for i in 1..51 { match interval_future.poll() { Ok(Async::Ready(curr)) => { println!("Iteration number {}, counter is {}", i, curr); } Ok(Async::NotReady) => (), Err(()) => unreachable!(), } std::thread::sleep(duration); } }  Arguably a minor improvement on the previous code, though nothing major. But congratulations, you’re now officially using the futures crate! The Poll type definition Just a minor helper to mention. The type Result<Async<Self::Item>, Self::Error> may look a bit unwieldy to you. If so, you’ll be happy to learn about the Poll type definition, which lets you replace the above with Poll<Self::Item, Self::Error>. Not a big deal, but important to recognize as you’re reading other code. The tokio executor Right now, we’re running our own executor in our main function: we’re manually looping, delaying, etc. Besides tedium, we’ve already mentioned some downsides to this above: • We need a single thread per task we wish to perform • We need to implement some kind of guess-and-check thread sleeping It’s time to pull out the big guns, and tokio this thing. We’ll end up losing some functionality first, and then we’ll build it back. First, add tokio = "0.1" to your Cargo.toml. Now, let’s try using the tokio executor by calling tokio::run on a Future: extern crate futures; extern crate tokio; mod future; mod interval; use self::interval::Interval; use self::future::IntervalFuture; use tokio::prelude::*; fn main() { let interval = Interval::from_millis(500); // half a second let mut interval_future = IntervalFuture::new(interval); tokio::run(interval_future) }  This fails with a compilation error: error[E0271]: type mismatch resolving <future::IntervalFuture as futures::Future>::Item == () --> src/main.rs:15:5 | 15 | tokio::run(interval_future) | ^^^^^^^^^^ expected usize, found () | = note: expected type usize found type () = note: required by tokio::run  The tokio::run function expects a Future where the Item is (), but ours is usize. This kind of makes sense anyway: don’t we want to write some code to actually do something when we get a value? We’re going to fix this, first the overly painful way, and then the pleasant way. That will also help you appreciate why lesson 5 spent so much time on closures. Define an adapter Future Remember how you can define Iterators that consume other Iterators and compose more powerful streams? Well, you can do the same thing with Futures. Let’s define a new type that will: • Wrap around an IntervalFuture • Print the new value whenever it’s ready We’ll put this in src/main.rs for now, it won’t last long anyway. extern crate futures; extern crate tokio; mod future; mod interval; use self::interval::Interval; use self::future::IntervalFuture; use tokio::prelude::*; struct IntervalPrinter(IntervalFuture); impl Future for IntervalPrinter { type Item = (); type Error = (); fn poll(&mut self) -> Poll<Self::Item, Self::Error> { match self.0.poll() { Ok(Async::Ready(curr)) => { println!("Counter is: {}", curr); Ok(Async::Ready(())) } Ok(Async::NotReady) => Ok(Async::NotReady), Err(e) => Err(e), } } } fn main() { let interval = Interval::from_millis(500); // half a second let interval_future = IntervalFuture::new(interval); let interval_printer = IntervalPrinter(interval_future); tokio::run(interval_printer) }  Compile the code, but don’t run it yet. This is relatively straight-forward given all of the types and traits we’ve seen, but it’s obviously tedious. Let’s start off with a minor simplification. The try_ready macro The body of poll spends a lot of lines of code to: • Pattern match • If it’s NotReady, return NotReady • If it’s an Err, return the Err This is a repetitive pattern, and is pretty similar to the error handling we saw previously in lesson 3. The futures crate provides a macro, try_ready!, to deal with this annoyance. Add the following above the extern crate futures; in src/main.rs: #[macro_use]  And then your implementation of poll can be simplified to: let curr = try_ready!(self.0.poll()); println!("Counter is: {}", curr); Ok(Async::Ready(()))  Nice! Compile, and again, don’t run. (Getting curious yet why I keep saying that? We’ll find out soon, just one more pitstop first.) I need some closure It’s amazing I’ve made it to lesson 7 in this crash course without making that pun. Obviously, defining an entire struct and Future implementation is a bit overkill to just print a line. Fortunately, the authors of the futures crate noticed this too. There are a number of combinators built into the Future trait that make it easy to chain things together. We’re going to use the and_then method: extern crate futures; extern crate tokio; mod future; mod interval; use self::interval::Interval; use self::future::IntervalFuture; use tokio::prelude::*; fn main() { let interval = Interval::from_millis(500); // half a second let interval_future = IntervalFuture::new(interval); let interval_printer = interval_future.and_then(|curr| { println!("Counter is: {}", curr); futures::future::ok(()) }); tokio::run(interval_printer) }  That’s much nicer! If you’re anything like me though, the futures::future::ok(()) is bothering you. What purpose does it serve there? This is a vital part of the design of futures, which we’ll be taking advantage of quite a bit going forward. It allows us to create chains of actions to run as each bit of async I/O completes. For now, we don’t want to do anything else after we print the first value from the counter, so we just return futures::future::ok(()), which means “don’t do anything, and return the item ()”. Exercise 1 There’s another method, .map, which is actually a better choice for us here than .and_then. Try rewriting the code above to use .map. Note: no solution provided to this one. Out of curiosity, what’s the type of interval_printer? Let’s use the dumb trick from before of giving it the wrong type. Insert something silly like let interval_printer: bool = ... and try compiling, you’ll get some type like: futures::AndThen< future::IntervalFuture, futures::FutureResult<(), ()>, [closure@src/main.rs:14:59: 17:6] >  If this is starting to look a bit like Iterator types, that’s by design. Just like Iterators, Futures capture a large amount of information in the types themselves about what they’ll do. This allows Futures to compile down to highly efficient code, living up to the Rust mantra of zero-cost abstractions. Finally, run it! Is the suspense killing you? Alright, your moment has arrived. What happens when you run cargo run? $ cargo run
Interval thread still alive, value was: 0
...


And that’s it. It hangs. It doesn’t print out the message we painstakingly added with and_then. I’m a complete failure, my life is in ruins.

After an hour to myself to contemplate my life and a few shots of whiskey, things started to get clear. In our original implementation, we had a really wasteful loop in the main function. It kept checking if the counter had changed. We did that with our Future implementation too at first, sleeping and then checking again for a NotReady. But tokio probably isn’t doing that, right? (The answer is yes, right.)

Instead of doing the silly wasteful thing, the futures crate is far smarter. It has a mechanism to:

• determine which task is trying to get access to the data provided by this future, and then
• notify that task that new data is available

The futures::task::current() function gives us the current task being run, as a Task struct. That struct has a method, notify, to let the task know that more data is available.

In our program, we have the logic split between Interval and IntervalFuture. IntervalFuture will need to be responsible for calling the current() function (give some thought as to why that’s the case). The changes needed to make this work are:

• Add a new field to Interval to hold an Arc<Mutex<Option<Task>>> (yes, that’s a mouthful), and initialize correctly.
• Each time we call fetch_add and update the counter, also call notify() on that task, if it’s there.
• Provide a method set_task to set the Task on an Interval
• When returning NotReady in IntervalFuture, call set_task

Exercise 2 Take a stab at implementing these four changes before looking at the solution below. A hint on some features of Rust we haven’t covered yet: you’ll end up wanting to pattern match on the Option held inside the Mutex. You’ll want to pattern match by reference, which will require some code that looks like Some(ref task) =>. And the final output should look like:

Interval thread launched
Interval thread still alive, value was: 0
Counter is: 1


If you want to be sure, you can see the initial version of the code on Github.

Solution 2

You can check out the diff and full solution on Github. Here’s the diff included inline:

diff --git a/src/future.rs b/src/future.rs
index 9aaee3c..e231e7b 100644
--- a/src/future.rs
+++ b/src/future.rs
@@ -22,6 +22,8 @@ impl Future for IntervalFuture {
fn poll(&mut self) -> Result<Async<Self::Item>, Self::Error> {
let curr = self.interval.get_counter();
if curr == self.last {
} else {
self.last = curr;
diff --git a/src/interval.rs b/src/interval.rs
index 044e2ca..8013ac6 100644
--- a/src/interval.rs
+++ b/src/interval.rs
@@ -1,12 +1,14 @@
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
-use std::sync::Arc;
+use std::sync::{Arc, Mutex};
use std::time::Duration;

#[derive(Clone)]
pub struct Interval {
counter: Arc<AtomicUsize>,
still_running: Arc<AtomicBool>,
}

impl Drop for Interval {
@@ -26,22 +28,37 @@ impl Interval {
let still_running = Arc::new(AtomicBool::new(true));
let still_running_clone = still_running.clone();

+
spawn(move || {
sleep(duration);
println!("Interval thread still alive, value was: {}", old);
+
+                    None => (),
+                };
}
});

Interval {
counter,
still_running,
}
}

pub fn get_counter(&self) -> usize {
}
+
+        let mut guard = self.task.lock().unwrap();
+    }
}


Hurrah, we finally have a working tokio program!

In case you’re worried about how complex that was, don’t be. Notification is a vital aspect of how tokio works internally. However, in most cases, you won’t be creating your own primitive Futures, but instead dealing with existing ones provided by tokio or other libraries. Those existing Futures will provide the necessary notification logic. You’ll simply need to obey this one rule:

Only return a NotReady from a poll function if you received a NotReady from an underlying Future.

Just one value?

It’s a bit disappointing that our wonderful long running counter only ends up printing a single value. Can we create some kind of a loop? A simple approach like the following doesn’t work:

let interval_printer = interval_future.and_then(|curr| {
println!("Counter is: {}", curr);
interval_printer
});


This isn’t Haskell, we can’t recursively refer to interval_printer we’re in the middle of defining. Go ahead and take a few other stabs at doing something like that, and you’ll eventually get frustrated and go back to the whiskey. Digging through the futures docs, a helper function like loop_fn looks promising, but I didn’t see a simple way to make it work in this case. (Please let me know if I missed something!) I ended up with something wonky like this before stopping:

fn main() {
let interval = Interval::from_millis(500); // half a second
let interval_future = Arc::new(Mutex::new(IntervalFuture::new(interval)));
let interval_printer = loop_fn(interval_future, |interval_future| {
let interval_future_clone = interval_future.clone();
interval_future.lock().unwrap().and_then(|curr| {
println!("Counter: {}", curr);
futures::future::ok(Continue(interval_future_clone))
})
});

tokio::run(interval_printer)
}


Another struct!

Like before, we’re going to define another helper type to implement this concept of looping. Then we’ll see that this problem has already been solved better in the futures crate itself, but we’ll get there soon.

We want to define a new struct, KeepPrinting, which is a newtype around an IntervalFuture. It’s going to:

• Have a Future implementation
• Have Item = ()
• Use a loop in its implementation
• Use the try_ready! macro

Exercise 3 Try implementing KeepPrinting and using it in the main function. Solution follows immediately, but try not to cheat!

#[macro_use]
extern crate futures;
extern crate tokio;

mod future;
mod interval;

use self::future::IntervalFuture;
use self::interval::Interval;
use tokio::prelude::*;

struct KeepPrinting(IntervalFuture);

impl Future for KeepPrinting {
type Item = ();
type Error = ();
fn poll(&mut self) -> Poll<(), ()> {
loop {
println!("Counter: {}", curr);
}
}
}

fn main() {
let interval = Interval::from_millis(500); // half a second
let interval_future = IntervalFuture::new(interval);
let keep_printing = KeepPrinting(interval_future);

tokio::run(keep_printing)
}


And, just like that, we get an infinitely looping program. This almost looks like something that we could have done with Iterators. Which makes me wonder… is there something like Iterators in futures?

Streams

A Future is an action with a delayed single result. A Stream is a stream of results, like an Iterator, with a delay between each value.

In src/main.rs, add mod stream; and then edit src/stream.rs. The file will end up looking remarkably similar to src/future.rs, except:

• Call the struct IntervalStream instead of IntervalFuture
• Provide an impl Stream for IntervalStream instead of impl Future
• Follow the compiler errors to fix it

Within the main function, instead of using KeepPrinting or anything else, we’ll want to create an IntervalStream value. However, tokio::run needs a Future, not a Stream, to run. Fortunately, there’s a helper function, for_each, that runs a given closure on each value in the stream.

Exercise 4 Try to implement src/stream.rs and src/main.rs. Solution to follow.

The trickiest bit for me when first learning for_each was to realize that, like and_then, it needs to end with a Future. I don’t know if that was just my own shortcoming, or a common issue. In any event, if you struggled to realize you needed something like future::ok(()) at the end of your closure, you’re in good company.

In addition, the poll function for a Stream is slightly different, in returning an Option<Item>. This is similar to how Iterator works. In our case, we have an infinite stream, so we never provide the None case.

Anyway, here’s src/stream.rs:

extern crate futures;

use super::interval::Interval;
use futures::prelude::*;

pub struct IntervalStream {
interval: Interval,
last: usize,
}

impl IntervalStream {
pub fn new(interval: Interval) -> IntervalStream {
let last = interval.get_counter();
IntervalStream { interval, last }
}
}

impl Stream for IntervalStream {
type Item = usize;
type Error = ();

fn poll(&mut self) -> Poll<Option<Self::Item>, Self::Error> {
let curr = self.interval.get_counter();
if curr == self.last {
} else {
self.last = curr;
}
}
}


And src/main.rs:

extern crate futures;
extern crate tokio;

mod interval;
mod stream;

use self::interval::Interval;
use self::stream::IntervalStream;
use tokio::prelude::*;

fn main() {
let interval = Interval::from_millis(500); // half a second
let interval_stream = IntervalStream::new(interval);
let future = interval_stream.for_each(|curr| {
println!("Counter: {}", curr);
futures::future::ok(())
});

tokio::run(future)
}


Exercise 5 Like Iterators, Streams have helper methods that you can use to build up more complex things. For example, try throwing in map and take to print only the first 10 counter values, but double them before printing. (No solution provided.)

This is all beginning to fit together nicely! While there are still details to learn in the futures crate, you’ve got most of the big ideas down. The next bit is to get familiar with the API in tokio, but relatively speaking this is less mind-bending. To hammer home what we’ve done so far, we’ll hit a few exercises, and then continue with tokio.

Exercise 6

Define a new struct MyOk such that this main function works:

fn main() {
let name = String::from("Alice");
let future = MyOk::new(name).and_then(|name| {
println!("Name: {}", name);
MyOk::new(())
});

tokio::run(future)
}


Hint: before cheating and looking at the solution, here’s one piece of help: you’ll want an Option inside the MyOk newtype, and it’s invalid to call poll on it twice.

Solution 6

struct MyOk<T>(Option<T>);

impl<T> MyOk<T> {
fn new(t: T) -> MyOk<T> {
MyOk(Some(t))
}
}

impl<T> Future for MyOk<T> {
type Item = T;
type Error = ();
fn poll(&mut self) -> Poll<T, ()> {
}
}


Exercise 7

Use iter_ok to convert the range 1..11 to a Stream, and then collect it as a Vec and print it.

Solution 7

fn main() {
tokio::run(stream::iter_ok(1..11).collect().and_then(|x| {
println!("{:?}", x);
future::ok(())
}))
}


Async I/O

We’ve played around with the futures crate by creating a fake async I/O source of data (the Interval). We’ve built up Futures and Streams in that world. And we’ve used tokio’s executor to run these things. It’s now time to take it to use some real async I/O.

Most async I/O we care about will end up being network traffic. Filesystem operations don’t always play nicely with async I/O at an operating system level. That said, to get our feet wet, let’s play with a filesystem based example.

You’ll want to look at the docs quite a bit, you can find them on docs.rs.

List files in a directory

If you look through the docs above, you may find the function read_dir. It takes a path, and returns a ReadDirFuture. This is a standard approach in tokio, like we had with Iterators: simple wrapper functions providing access to the structs that do the heavy lifting. One thing to get used to in tokio is how to read these docs.

Click through on the ReadDirFuture struct. It has a Future implementation, where Item is ReadDir, and Error is std::io::Error. Before we deal with that ReadDir, let’s just get something that compiles. Since this is still a crash course, we’ll bash our heads against a brick wall each step of the way.

First, to call read_dir, we need a directory. Let’s use "." (the current directory). We’ll use command line arguments later. Here’s a naive implementation:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;

fn main() {
tokio::run(future)
}


This gives us a somewhat intimidating error message:

error[E0271]: type mismatch resolving <tokio_fs::read_dir::ReadDirFuture<&str> as tokio::prelude::Future>::Item == ()
--> src/main.rs:8:5
|
8 |     tokio::run(future)
|     ^^^^^^^^^^ expected struct tokio_fs::read_dir::ReadDir, found ()
|
= note: expected type tokio_fs::read_dir::ReadDir
found type ()
= note: required by tokio::run

error[E0271]: type mismatch resolving <tokio_fs::read_dir::ReadDirFuture<&str> as tokio::prelude::Future>::Error == ()
--> src/main.rs:8:5
|
8 |     tokio::run(future)
|     ^^^^^^^^^^ expected struct std::io::Error, found ()
|
= note: expected type std::io::Error
found type ()
= note: required by tokio::run


However, let me narrow that down for you:

expected type tokio_fs::read_dir::ReadDir, found type ()
expected type std::io::Error, found type ()


Oh, right! tokio::run requires that we have Item and Error as (). We can modify the Error with map_err. Let’s just print out the error if one occurs:

let future = fs::read_dir(".")
.map_err(|e| eprintln!("Error reading directory: {}", e))
;


That knocked out the first compilation error. Let’s also throw in a .and_then:

.and_then(|readdir| {
})


Uh oh, we got this compilation error. Can you figure out how to solve it?

error[E0277]: the trait bound (): tokio::prelude::Future is not satisfied


In my experience, when you see that, it almost always means: “I forgot to add future::ok(()). Remember, and_then needs to end with the next Future to run. Add that line, and your code should compile. Running produces the output:

FIXME: use this: ReadDir(ReadDir("."))


Cool! Now it’s time to look at the docs for ReadDir. Instead of a Future implementation, this has a Stream. Let’s shove a for_each in there and see what happens.

Challenge try to guess the errors in the code below before you compile it.

.and_then(|readdir| {
.for_each(|entry| {
println!("{:?}", entry.path());
})
})


There are two problems with this code:

1. It leaves an error type of std::io::Error
2. It doesn’t include future::ok(()) at the end of the closure provided to for_each

Go ahead and fix those problems. To be sure we’re on the same page, here’s my solution which compiles and runs successfully:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;

fn main() {
.map_err(|e| eprintln!("Error reading directory: {}", e))
.map_err(|e| eprintln!("Error reading directory: {}", e))
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
})
;
tokio::run(future)
}


Duplicated error handling

It’s a bit irritating that we have two identical map_err calls. We have two different sources of errors: the initial read_dir Future, and then streaming the individual DirEntrys from it. However, the type of the errors in both cases is the same: std::io::Error. Therefore, we can move our error handling to the end, and just do it once:

fn main() {
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
})
.map_err(|e| eprintln!("Error reading directory: {}", e))
;
tokio::run(future)
}


Flattening

It turns out that it’s common enough to have a Future that generates another Future, and then we want to run that second Future, that there’s a helper method for it flatten(). There’s also a flatten_stream() that does the same thing when a Future gives us a Stream.

Exercise 8 Rewrite the code above to use flatten_stream. You should end up with no calls to and_then. Solution follows immediately:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;

fn main() {
.flatten_stream()
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
.map_err(|e| eprintln!("Error reading directory: {}", e))
;
tokio::run(future)
}


Command line arguments

It’s somewhat boring to always print out what’s in the current directory. Instead, let’s take all of the command line arguments (skipping the first, which is the executable name), and list the directory contents. We’ll use stream::iter_ok to convert the Args Iterator into a Stream:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;
use std::env::args;

fn main() {
let future = stream::iter_ok(args())
.skip(1)
.for_each(|dir| {
.flatten_stream()
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
})
.map_err(|e| eprintln!("Error reading directory: {}", e))
;
tokio::run(future)
}


Unfortunately, this doesn’t compile. The full error message is large (I encourage you to check it out yourself), but the first few lines are sufficient to find the problem:

error[E0277]: std::env::Args cannot be sent between threads safely
--> src/main.rs:20:5
|
20 |     tokio::run(future)
|     ^^^^^^^^^^ std::env::Args cannot be sent between threads safely


Oh, darn. Args isn’t thread safe, and so cannot be converted into a Stream. Fair enough: vectors to the rescue!

Exercise 9 Create a vector of arguments before the definition of future and use that.

Solution:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;
use std::env::args;

fn main() {
let args: Vec<String> = args().skip(1).collect();
let future = stream::iter_ok(args)
.for_each(|dir| {
.flatten_stream()
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
})
.map_err(|e| eprintln!("Error reading directory: {}", e))
;
tokio::run(future)
}


Where’s the concurrency?

If you provide this program two different directories with a large number of files, you may notice that it processes these directories sequentially: it will print all of the files in the first directory, and then all of the files in the second directory. Given that async I/O and concurrency usually go hand-in-hand, that may be a bit surprising.

So far, we’ve only ever had a single task at a time. Our program streams out the value of args, and for each one provides a Future. That Future is run to completion, and then the next value from args is processed.

What if we want to process each directory concurrently? To do that, we need to spawn another task, the same way we would spawn a new thread. Like tokio::run, tokio::spawn takes a Future where both Item and Error are (). Here’s a more concurrent version of our program:

let future = stream::iter_ok(args)
.for_each(|dir| {
.flatten_stream()
.map_err(|e| eprintln!("Error reading directory: {}", e))
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
;
tokio::spawn(future);
future::ok(())
})
;


Notice how I’ve put future::ok(()) after the tokio::spawn(future); call. It turns out that’s not needed: spawn returns a Spawn value, which behaves like future::ok(()) (via its IntoFuture implementation). So just remove future::ok and the semicolon after spawn, and your code will still work.

NOTE You may not notice the concurrency unless you have a large number of files in each directory.

Skipping the vector

One final thing that annoyed me above is that Vec. It really seems like we should be able to get away without it. We can’t convert Args into a Stream, because that would require sending the value between threads. But now we’ve got a new trick up our sleeves: spawning. What if we never send the Args anywhere, but just spawn a bunch of tasks.

When I was first learning tokio, I’ll admit that I spent way more time trying to figure out the trick I’m about to show you than I’m proud of. We need to create a Future, then run a for loop inside of it. How do we create a Future that lets us run some code without waiting for anything else? We can use future::ok(()) to create a dummy Future, and then chain the next action together with and_then.

let future = future::ok(()).and_then(|()| {
for dir in args().skip(1) {
.flatten_stream()
.map_err(|e| eprintln!("Error reading directory: {}", e))
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
;
tokio::spawn(future);
}
future::ok(())
});


Another approach, if you’re so inclined, is to use the future::poll_fn helper function. This takes a 0 argument function which returns a Result<Async<Item>, Error>, just like the poll method of Future does.

Exercise 10

Rewrite our program above to use future::poll_fn. Your program should not use and_then at all.

Solution:

extern crate tokio;

use tokio::prelude::*;
use tokio::fs;
use std::env::args;

fn main() {
let future = future::poll_fn(|| {
for dir in args().skip(1) {
.flatten_stream()
.map_err(|e| eprintln!("Error reading directory: {}", e))
.for_each(|entry| {
println!("{:?}", entry.path());
future::ok(())
})
;
tokio::spawn(future);
}
});
tokio::run(future)
}


Is all of this ceremony worth it for file system operations? Probably not. Let’s get into something more interesting: network communications!

TCP client

Rust’s standard library already provides really nice support for TCP communications out of the box. For example, the following code will print the full response headers and body from an HTTP request to http://httpbin.org/json:

use std::io::{Read, Write};
use std::net::TcpStream;

fn main() -> Result<(), Box<std::error::Error>> {
let mut stream = TcpStream::connect("httpbin.org:80")?;
stream.write_all(b"GET /json HTTP/1.1\r\nHost: httpbin.org\r\nConnection: close\r\n\r\n")?;

let mut buffer = vec![];

println!("{}", std::str::from_utf8(&buffer)?);
Ok(())
}


There are some simplifying assumptions here, like using connection: close so that we can use read_to_end, and assuming the response body is properly UTF-8 encoded. But that’s not really an indictment of the TCP support in the standard library.

The real problem is the same one we’ve been talking about throughout this lesson: it hogs an entire OS thread blocking on a successful write to and subsequent read from the network. Let’s look at how tokio can help.

It looks like there’s a TcpStream type in tokio::net::TcpStream. That looks like a good place to start. It takes a SocketAddr, which we can probably make easily enough. And it returns a ConnectFuture. Let’s start with some code that simple establishes a connection:

extern crate tokio;

use tokio::net::TcpStream;
use tokio::prelude::*;

fn main() {
None => panic!("DNS resolution failed"),
};
.map_err(|e| eprintln!("Error connecting: {:?}", e))
.map(|stream| {
println!("Got a stream: {:?}", stream);
});
tokio::run(future)
}


The to_socket_addrs business isn’t our focus right now, so I’m going to ignore it. Feel free as an exercise to improve the error handling of that bit of code.

We’ve got all of the familiar pieces here: define a Future, handle errors, and use map to chain together the action to take with the open connection.

Let’s look a bit more closely at that stream value passed to the closure. It comes from a ConnectFuture, so we need to look at the Item associated type there. And sure enough, if you check the docs, you’ll see that it’s TcpStream. Great.

We used write_all in our original, non-async, blocking code. If I search for write_all in tokio, I find that there’s such a helper function which returns a WriteAll Future. Something interesting:

• write_all takes two parameters: an AsyncWrite and an AsRef<[u8]>. This will work out to be our TcpStream and the data to send.
• The Item for AsyncWrite is a pair of the variables A and T, which turns out to be exactly the same as the parameters we passed in.

Giving us back the stream we originally provided is vital to continuing the connection. I didn’t see any documentation clarifying the point of returning the byte buffer, but I believe it’s returned so that, if desired, you can reuse a mutable buffer.

Anyway, let’s put this together and make our request:

extern crate tokio;

use tokio::net::TcpStream;
use tokio::io::write_all;
use tokio::prelude::*;

const REQ_BODY: &[u8] = b"GET /json HTTP/1.1\r\nHost: httpbin.org\r\nConnection: close\r\n\r\n";

fn main() {
None => panic!("DNS resolution failed"),
};
.and_then(|stream| {
write_all(stream, REQ_BODY)
.map(|(stream, _body)| println!("Write succeeded: {:?}", stream))
})
.map_err(|e| eprintln!("Error occured: {:?}", e))
;
tokio::run(future)
}


Notice how I replaced the previous map with an and_then call, so that I could provide another Future to be performed after the connection was established.

Would it be too much to ask to also get a read_to_end function in tokio? Nope, not at all.

Exercise 11 Use read_to_end to consume the entire response into a Vec<u8>, and then print that out, using std::str::from_utf8 and being as careless with error handling as you like.

Alright, solution:

extern crate tokio;

use tokio::net::TcpStream;
use tokio::prelude::*;

const REQ_BODY: &[u8] = b"GET /json HTTP/1.1\r\nHost: httpbin.org\r\nConnection: close\r\n\r\n";

fn main() {
None => panic!("DNS resolution failed"),
};
.and_then(|stream| {
write_all(stream, REQ_BODY)
}).and_then(|(stream, _body)| {
let buffer = vec![];
}).map(|(_stream, buffer)| {
let s = std::str::from_utf8(&buffer).unwrap();
println!("{}", s);
}).map_err(|e| eprintln!("Error occured: {:?}", e));
tokio::run(future)
}


Streaming to a file

I’ll repeat for maximum annoyance: tokio is not intended for asynchronous file operations. That said, there is a tokio::fs::File struct which we can use. Let’s try to write the response contents to httpbin.json instead:

let future = TcpStream::connect(&addr)
.and_then(|stream| {
write_all(stream, REQ_BODY)
}).and_then(|(stream, _body)| {
let buffer = vec![];
}).and_then(|(_stream, buffer)| {
File::create("httpbin.json").and_then(|file| {
write_all(file, &buffer).map(|_| ())
})
}).map_err(|e| eprintln!("Error occured: {:?}", e));


Unfortunately, the compiler doesn’t like this too much:

error[E0277]: the trait bound std::fs::File: tokio::io::AsyncWrite is not satisfied
--> src/main.rs:25:17
|
25 |                 write_all(file, &buffer).map(|_| ())
|                 ^^^^^^^^^ the trait tokio::io::AsyncWrite is not implemented for std::fs::File
|
= note: required by tokio::io::write_all


Well, I guess that makes sense: you can’t asynchronously write to a File, so tokio::io::write_all isn’t going to work. Fortunately, File does implement the Write trait, which provides a blocking write_all, which is sufficient for our purposes.

Exercise 12 Rewrite the code above to successfully write httpbin.json.

First solution, ignoring anything close to proper error handling:

let future = TcpStream::connect(&addr)
.and_then(|stream| {
write_all(stream, REQ_BODY)
}).and_then(|(stream, _body)| {
let buffer = vec![];
}).and_then(|(_stream, buffer)| {
File::create("httpbin.json").map(|mut file| {
file.write_all(&buffer).unwrap()
})
}).map_err(|e| eprintln!("Error occured: {:?}", e));


But ideally, we’d like to avoid that unwrap() and instead promote an I/O error here to be handle by the map_err below. It turns out that there’s a surprisingly trivial change to make that happen:

File::create("httpbin.json").and_then(|mut file| {
file.write_all(&buffer)
})


Instead of using map, we use and_then, which requires that we return some value that implements Future. But fortunately, Result itself implements Future! The Ok variant becomes the Item for that Future, and the Err variant becomes its Error. Problem solved!

Exercise 13

We haven’t taken advantage of tokio here at all! Let’s make this program concurrent. Write a program that takes command line arguments to determine HTTP requests to make and files to store them to. To simplify the implementation, we’ll have it take input that looks like the following:

$cargo run httpbin.org:80 /json httpbin.json example.com:80 / homepage.html  Feel free to handle invalid command line arguments however’s easiest. Solution extern crate tokio; use std::net::ToSocketAddrs; use tokio::io::{read_to_end, write_all}; use tokio::net::TcpStream; use tokio::prelude::*; use std::fs::File; fn download(host: String, path: String, filename: String) -> impl Future<Item=(), Error=()> { let mut addr_iter = host.to_socket_addrs().unwrap(); let addr = match addr_iter.next() { None => panic!("DNS resolution failed"), Some(addr) => addr, }; let req_body = format!( "GET {} HTTP/1.1\r\nHost: {}:80\r\nConnection: close\r\n\r\n", path, host, ); TcpStream::connect(&addr) .and_then(|stream| { write_all(stream, req_body).and_then(|(stream, _body)| { let buffer = vec![]; read_to_end(stream, buffer).and_then(|(_stream, buffer)| { File::create(filename).and_then(|mut file| { file.write_all(&buffer) }) }) }) }).map_err(|e| eprintln!("Error occured: {:?}", e)) } fn main() { tokio::run(future::poll_fn(|| { let mut args = std::env::args().skip(1); loop { match (args.next(), args.next(), args.next()) { (Some(host), Some(path), Some(filename)) => { tokio::spawn(download(host, path, filename)); } _ => return Ok(Async::Ready(())), } } })) }  Nicer error handling We’re just panic!ing when we have a bad address. Let’s do a little bit better. First, I’ll define a helper function to return a nice Result. We’ll use a String for the Err variant, but we could should ideally define an enum instead: fn resolve_addr(host: &str) -> Result<SocketAddr, String> { let mut addr_iter = match host.to_socket_addrs() { Ok(addr_iter) => addr_iter, Err(e) => return Err(format!("Invalid host name {:?}: {:?}", host, e)), }; match addr_iter.next() { None => Err(format!("No addresses found for host: {:?}", host)), Some(addr) => Ok(addr), } }  Inside download, we could continue panic!ing with: let addr = resolve_addr(&host).unwrap();  But let’s do better. Using ? won’t work, since we aren’t returning a Result. One idea would be to use return to return early: let addr = match resolve_addr(&host) { Ok(addr) => addr, Err(e) => { eprintln!("Error resolving address: {}", e); return future::err(()); } };  However, we get an interesting error message from the compiler: error[E0308]: mismatched types --> src/main.rs:34:5 | 34 | / TcpStream::connect(&addr) 35 | | .and_then(|stream| { 36 | | write_all(stream, req_body).and_then(|(stream, _body)| { 37 | | let buffer = vec![]; ... | 43 | | }) 44 | | }).map_err(|e| eprintln!("Error occured: {:?}", e)) | |___________________________________________________________^ expected struct tokio::prelude::future::FutureResult, found struct tokio::prelude::future::MapErr  In order to make things work, we need to ensure that we always return the same type. We’ve so far used impl Future to say “we’ll return some type which is a Future,” but we haven’t told the compiler what that type is. Instead, the compiler has inferred that. But now, we have two different types. One approach would be dynamic dispatch, such as using Box<Future>. But there’s a better way: using the Either helper type. This type is used in a case where we have two different types of Futures which both have the same Item and Error. Let’s see how we can rewrite our code above to use Either: fn download(host: String, path: String, filename: String) -> impl Future<Item=(), Error=()> { let addr = match resolve_addr(&host) { Ok(addr) => addr, Err(e) => { eprintln!("Error resolving address: {}", e); return future::Either::A(future::err(())); } }; let req_body = format!( "GET {} HTTP/1.1\r\nHost: {}:80\r\nConnection: close\r\n\r\n", path, host, ); future::Either::B(TcpStream::connect(&addr) .and_then(|stream| { write_all(stream, req_body).and_then(|(stream, _body)| { let buffer = vec![]; read_to_end(stream, buffer).and_then(|(_stream, buffer)| { File::create(filename).and_then(|mut file| { file.write_all(&buffer) }) }) }) }).map_err(|e| eprintln!("Error occured: {:?}", e))) }  Exercise 14 Implement your own Either data type and use it in the code above. Solution: enum Either<A, B> { A(A), B(B), } impl<A, B> Future for Either<A, B> where A: Future<Item=B::Item, Error=B::Error>, B: Future, { type Item = A::Item; type Error = A::Error; fn poll(&mut self) -> Poll<Self::Item, Self::Error> { match self { Either::A(a) => a.poll(), Either::B(b) => b.poll(), } } }  TCP server Having been so successful with our TCP client, let’s move over to the server side. Conceptually, we want to: 1. Bind a listening socket 2. Accept connections from that socket 3. Copy all data from the input side of the socket to the output side of the socket Binding a listening socket is going to be a blocking call to bind, taking our old friend SocketAddr. Since we’re not playing around with DNS resolution anymore, we can be a bit lazier about how we do this: extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); println!("It worked! {:?}", listener); }  We now have have a TcpListener. Unlike other types we’ve seen, this doesn’t have an implementation of either Future or Stream. However, it does have a method called incoming(), which returns an Incoming, which has a Stream implementation, where Item is TcpStream. That looks promising! let future = listener .incoming() .for_each(|socket| { println!("Accepted a connection! {:?}", socket); future::ok(()) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future)  And just like that, we’ve implemented points (1) and (2) above. We’re just left with point 3: copying all of the data. Let’s search tokio for something to do copying. It looks like tokio::io::copy will do. We need to provide it both a reader and writer. Since we’re reader from and writing to the same socket, let’s just provide the same value for both: let future = listener .incoming() .for_each(|socket| { copy(socket, socket) .map(|_| println!("Connection closed")) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ;  Are you already laughing at my comically silly mistake? error[E0382]: use of moved value: socket --> src/main.rs:13:26 | 13 | copy(socket, socket) | ------ ^^^^^^ value used here after move | | | value moved here  Of course we can’t use the same value in both positions. Fortunately, when designing Streams, the authors provided a method called split to give us a read and write end of the stream. With that in hand, our echo server becomes trivial: extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; use tokio::io::copy; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); let future = listener .incoming() .for_each(|socket| { let (reader, writer) = socket.split(); copy(reader, writer) .map(|_| println!("Connection closed")) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future) }  Writing directly Using copy kind of ignores the gory details of what’s going on under the surface. Let’s start off by writing some arbitrary message to the writer side of things, using the write_all function. NOTE The tokio::io::write_all function takes a AsyncWrite and returns a WriteAll Future. Don’t be confused by the presence of a write_all method, which in fact is a blocking call. I wasted about 5 minutes fumbling with that while writing this tutorial. extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; use tokio::io::{copy, write_all}; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); let future = listener .incoming() .for_each(|socket| { let (reader, writer) = socket.split(); write_all(writer, b"Welcome to the echo server\r\n") .map(|_| println!("Connection closed")) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future) }  Exercise 15 Modify the code above so that, after printing “Welcome to the echo server”, it proceeds to actually echo content sent in. write_all(writer, b"Welcome to the echo server\r\n") .and_then(|(writer, _)| { copy(reader, writer) .map(|_| println!("Connection closed")) })  Codecs Reading from the reader directly is slightly trickier than writing to the writer. We could go play around with the underlying polling reading functions, but we’re not going to here. (Feel free to read the official tokio tutorial for more information.) Instead, we’re going to introduce a new concept, codecs. So far, we’ve implicitly been working with the AsyncRead and AsyncWrite traits, which essentially provide the raw polling functions we’d need for building up our own Futures (as we did way long ago at the beginning of this lesson). However, we often don’t want to work at that level of abstraction. Instead, we’d like to deal with some kind of framed (or chunked) data. The new abstraction instead will be a Sink, which is “a value into which other values can be sent, asynchronously.” We’ll continue to use the Stream trait for the read side, which we’re already quite familiar with. Let’s contrive an example. Our echo server currently provides slightly weird output: Hello Hello There There World World  It’s hard to tell what I typed in, and what the server responded. Instead, I’d like each line sent back from the server to begin with “You said: “. Doing that with the abstractions we’ve seen so far would be fairly painful: we’d need to grab chunks of data, look for the newline character, break up the input, splice in the “You said: “ message. I know this is a crash course and all, but I’d rather not crash into that. Instead, let’s jump straight to the better solution. I want to treat our TCP stream as a stream of lines of data. If I search for the word “lines” (and this is actually how I learned about codecs), I end up with LinesCodec. It provides a method new(), as well as new_with_max_length. We’ll use new here, but I recommend reading the docs to see why that’s a terrible idea in any kind of security sensitive context. The only other method on the type is max_length, which doesn’t look like it’s going to help us actually deal with a TCP socket as a stream of lines. So let’s look down at the trait implementations. We’ve got all of our usual suspects: Clone, PartialOrd, etc. But two new ones stick out: Decoder and Encoder. Well that certainly looks interesting. Reading through the docs on Decoder, it provides a method called framed, which has a description that is great. (Please, take a second to follow that link and read the docs.) Without further ado, let’s try adding in our LinesCodec to our echo server: extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; use tokio::codec::{Decoder, LinesCodec}; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); let future = listener .incoming() .for_each(|socket| { let lines_codec = LinesCodec::new(); let socket = lines_codec.framed(socket); socket .send(String::from("Welcome to the echo server")) .map(|_| println!("Connection closed")) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future) }  You may have noticed that we no longer have a newline sequence at the end of the “Welcome” string. That’s because our lines codec automatically handles that. Additionally, we now need to use String::from, since the Item for this Sink is a String. We can also use split to isolate the Sink from the Stream: let (sink, stream) = lines_codec.framed(socket).split(); sink .send(String::from("Welcome to the echo server")) .map(|_| println!("Connection closed"))  And, we can use for_each on the Stream side to get a stream of the lines: extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; use tokio::codec::{Decoder, LinesCodec}; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); let future = listener .incoming() .for_each(|socket| { let lines_codec = LinesCodec::new(); let (sink, stream) = lines_codec.framed(socket).split(); sink .send(String::from("Welcome to the echo server")) .and_then(|sink| { stream .for_each(|line| { println!("Received a line: {}", line); future::ok(()) }) .map(|_| println!("Connection closed")) }) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future) }  We’re almost done here: we just need to send the lines back to the sink instead of to stdout. Unfortunately, using the send method we’ve seen so far is going to be tricky, since we’ll end up consuming the sink in each iteration of for_each. We could figure out a way to make that all work, but instead, let’s just cut to the chase and use send_all. Exercise 16 Modify the code above so that, instead of printing the lines to standard output, they get sent back to the client with the message “You said: “. You’ll want to look at send_all Solution extern crate tokio; use tokio::prelude::*; use tokio::net::TcpListener; use tokio::codec::{Decoder, LinesCodec}; fn main() { let addr = "127.0.0.1:3000".parse().expect("couldn't parse address"); let listener = TcpListener::bind(&addr).expect("couldn't bind address"); let future = listener .incoming() .for_each(|socket| { let lines_codec = LinesCodec::new(); let (sink, stream) = lines_codec.framed(socket).split(); sink .send(String::from("Welcome to the echo server")) .and_then(|sink| { let stream = stream .map(|line| format!("You said: {}", line)) ; sink.send_all(stream) .map(|_| println!("Connection closed")) }) }) .map_err(|e| eprintln!("An error occurred: {:?}", e)) ; tokio::run(future) }  Next time Whew, that was a big lesson! At this point, you should have a very solid footing in the ins and outs of tokio. It’s time to get lots more experience with using the library, and related higher level libraries for doing things like HTTP servers and clients. Depending on reader feedback, the next lesson may either go deeper into tokio and related libraries, or go back to more fundamental aspects of Rust like lifetimes. From the tokio side, we’d play with: • Message passing between tasks • UDP communications • Recursive directory traversal • Parallel downloading of files December 02, 2018 Mark Jason Dominus Necklaces and bracelets There are combinatorial objects called necklaces and bracelets and I can never remember which is which. Both are finite sequences of things (typically symbols) where the start point is not important. So the bracelet ABCDE is considered to be the same as the bracelets BCDEA, CDEAB, DEABC, and EABCD. One of the two also disregards the direction you go, so that ABCDE is also considered the same as EDCBA (and so also DCBAE, etc.). But which? I have to look it up every time. I have finally thought of a mnemonic. In a necklace, the direction is important, because to reverse an actual necklace you have to pull it off over the wearer's head, turn it over, and put it back on. But in a bracelet the direction is not important, because it could be on either wrist, and a bracelet on the left wrist is the same as the reversed bracelet on the right wrist. Okay, silly, maybe, but I think it's going to work. Another day, another bug. No, four bugs. I'm working on a large and wonderful project called “Greenlight”. It's a Git branch merging service that implements the following workflow: 1. Submitter submits a branch to Greenlight (greenlight submit my-topic-branch) 2. Greenlight analyzes the branch to decide if it changes anything that requires review and signoff 3. If so, it contacts the authorized reviewers, who then inform Greenlight that they approve the changes (greenlight approve 03a46dc1) 4. Greenlight merges the branch to master and publishes the result to the central repository Of course, there are many details elided here. Multiple instances of Greenlight share a local repository, but to avoid confusion each has its own working tree. In Git you can configure these by setting GIT_DIR and GIT_WORK_TREE environment variables, respectively. When Greenlight needs to run a Git command, it does so like this:  with env_var("GIT_DIR", self.repo_dir): with env_var("GIT_WORK_TREE", self.work_dir): result = subprocess.run(command, ...)  The env_var here is a Python context manager that saves the old environment, sets the new environment variable, and then when the body of the block is complete, it restores the environment to the way it was. This worked in testing every time. But the first time a beta tester ran the approve command, Greenlight threw a fatal exception. It was trying to run git checkout --quiet --detach, and this was failing, with Git saying fatal: this operation must be run in a work tree  Where was the GIT_WORK_TREE setting going? I still don't know. But in the course of trying to track the problem down, I changed the code above to:  with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir): result = subprocess.run(command, ...)  and the problem, whatever it was, no longer manifested. But this revealed a second bug: Greenlight no longer failed in the approval phase. It went ahead and merged the branch, and then tried to publish the merge with git push origin .... But the push was rejected. This is because the origin repository had an update hook that ran on every push, which performed the same review analysis that Greenlight was performing; one of Greenlight's main purposes is to be a replacement for this hook. To avoid tying up the main repository for too long, this hook had a two-minute timeout, after which it would die and reject the push. This had only happened very rarely in the past, usually when someone was inadvertently trying to push a malformed branch. For example, they might have rebased all of master onto their topic branch. In this case, however, the branch really was legitimately enormous; it contained over 2900 commits. “Oh, right,” I said. “I forgot to add the exception to the hook that tells it that it can immediately approve anything pushed by Greenlight.” The hook can assume that if the push comes from Greenlight, it has already been checked and authorized. Pushes are happening via SSH, and Greenlight has its own SSH identity, which is passed to the hook itself in the GL_USERNAME variable. Modifying the hook was easy: I just added:  if environ["GL_USERNAME"] == 'greenlight': exit(0)  This didn't work. My first idea was that Greenlight's public SSH key had not been installed in the authorized_keys file in the right place. When I grepped for greenlight in the authorized_keys file, there were no matches. The key was actually there, but in Gitlab the authorized_keys file doesn't have actual usernames in it. It has internal userids, which are then mapped to GL_USERNAME variables by some other entity. So I chased that wild goose for a while. Eventually I determined that the key was in the right place, but that the name of the Greenlight identity on the receiving side was not greenlight but bot-greenlight, which I had forgotten. So I changed the exception to say:  if environ["GL_USERNAME"] == 'bot-greenlight': exit(0)  and it still didn't work. I eventually discovered that when Greenlight did the push, the GL_USERNAME was actually set to mjd. “Oh, right,” I said. “I forgot to have Greenlight use its own SSH credentials in the ssh connection.” The way you do this is to write a little wrapper program that obtains the correct credentials and runs ssh, and then you set GIT_SSH to point to the wrapper. It looks like this:  #!/usr/bin/env bash export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION exec /usr/bin/ssh -i$HOME/.ssh/identity  "$@"  But wait, why hadn't I noticed this before? Because, apparently, every single person who had alpha-tested Greenlight had had their own credentials stored in ssh-agent, and every single one had had agent-forwarding enabled, so that when Greenlight tried to use ssh to connect to the Git repository, SSH duly forwarded their credentials along and the pushes succeeded. Amazing. With these changes, the publication went through. I committed the changes to the SSH credential stuff, and some other unrelated changes, and I looked at what was left to see what had actually fixed the original bug. Every change but one was to add diagnostic messages and logging. The fix for the original bug had been to replace the nested context managers with a single context manager. This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know. If you'd like to play along at home, the context manager looks like this, and did not change during the debugging process:  from contextlib import contextmanager @contextmanager def env_var(*args): # Save old values of environment variables in old # A saved value of None means that the variable was not there before old = {} for i in range(len(args)//2): (key, value) = (args[2*i : 2*i+2]) old[key] = None if key in os.environ: old[key] = os.environ[str(key)] if value is None: os.environ.pop(str(key), "dummy") else: os.environ[str(key)] = str(value) yield # Undo changes from versions saved in old for (key, value) in old.items(): if value is None: os.environ.pop(str(key), "dummy") else: os.environ[str(key)] = value  I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet. This computer stuff is amazingly complicated. I don't know how anyone gets anything done. [ Addendum 20181204: I figured it out. ] Philip Wadler Programming Language Foundations in Agda Wen Kokke and I are pleased to announce the availability of the textbook: Programming Language Foundations in Agda plfa.inf.ed.ac.uk github.com/plfa/plfa.github.io/ It is written as a literate script in Agda, and available at the above URLs. The books has so far been used for teaching at the Universities of Edinburgh and Vermont, and at Google Seattle. Please send your comments and pull requests! The book was presented in a paper (of the same title) at the XXI Brazilian Symposium on Formal Methods, 28--30 Nov 2018, and is available here: http://homepages.inf.ed.ac.uk/wadler/topics/agda.html#sbmf The paper won the SBMF 2018 Best Paper Award, 1st Place. December 01, 2018 Roman Cheplyaka Tasty 1.2 supports dependencies between tests Tasty (GitHub, Hackage) is a versatile and extensible testing framework for Haskell. Using tasty, you can quickly create a test suite consisting of unit tests, property tests and golden tests. Users of tasty benefit from automatic parallelism, a rich filtering language to choose which tests to run, sharing resources between tests and more. Tasty 1.2 adds another long-awaited feature: dependencies between tests. If the tests only ran sequentially, dependencies wouldn’t be an issue: you could always arrange your tests in the desired order. However, tasty runs all tests in parallel by default (using a user-specified or automatically determined number of worker threads). While it’s always been possible to force sequential execution by setting the number of threads to 1, sometimes you want to take advantage of the parallelism while also making sure that some tests run after some other tests due to the tests’ side effects. This is now possible using the after combinator: testGroup "Tests accessing the same resource" [ testCase "Test A"$ ...
, after AllFinish "Test A" $testCase "Test B"$ ...
]

In this example, Test B will only start running after Test A has finished—successfully or not. Alternatively, you can tell Test B to run only when Test A succeeds (notice AllSucceed instead of AllFinish):

testGroup "Tests creating and using a resource"
[ testCase "Test A" $... , after AllSucceed "Test A"$
testCase "Test B" $... ] The "Test A" argument of after in these examples is not simply a test name—it is a pattern in tasty’s AWK-like pattern language, so a test can depend simultaneously on a combination of different tests from different subtrees. To learn more about dependencies in tasty, please refer to the README and the haddock documentation. November 30, 2018 Functional Jobs Software Engineer - Functional Programming at Klarna (Full-time) Ever wondered why banking is so customer unfriendly? The UX is a nightmare, modern features with quick logins and integrations seem non-existent. Feature rollouts take months and that’s just not smoooth. At Klarna we aim to fix that. We want to create a banking offer that is both a pleasure to use as a consumer and one that you as a developer can use for your Fintech startup. Starting a new bank? Use our platform. Want to launch your own credit card? Use our platform. We do all the heavy lifting for you, providing a secure and compliant set of APIs which allow developers to focus on adding value through building their own financial products and services By being based on a modern architecture in the cloud, utilizing tools such as Erlang, Haskell, Clojure and Scala combined with a serverless architecture -- we will deliver on that expectation. We're expanding several of our financial platform engineering teams in Stockholm; scaling up the backend of Klarna Checkout and ensuring our highly available purchase accepting system can handle ever increasing traffic volumes. We're also adding new features and functionality into the backend of our payment processing systems and building a modern banking platform. What you’ll get to do • Own your services end to end. Decide how best to build, test, deploy and monitor. • Work with large scale, highly available and resilient modern financial systems. • Be an integral part of a team, in addition to its culture and ways of working. Common practices include agile methodologies, pair and mob programming. • Work with automated deployment enabling code release multiple times a day. • Succeed, fail, and learn together with other talented people. We believe in an environment that provides an opportunity for growth and see education as an outcome of failure that gets us closer to the next breakthrough. • Work with modern tools and languages that excite you. You'll get the chance to work with Functional programming languages including Erlang, Scala (typelevel stack) and JVM tooling. We are also utilising Clojure, PureScript, and Haskell.* Mnesia, and a range of OTP tools. Kafka, Serverless, PostgreSQL, DynamoDB, LevelDB. Prometheus, Splunk, Ansible, Terraform, Jenkins, Docker, Kubernetes. • Previous functional programming experience isn’t essential, but you should be open to learning and working with any of the following languages as the choices available may change over time. What we offer Culture - You'll have an opportunity to work with people from 45 different countries in our English speaking office in Stockholm city centre. Learning - We have a learning and development focused environment with an emphasis on knowledge sharing, training, and regular internal technical talks. Compensation - You’ll get an attractive salary, pension, and insurance plans, along with 30 days annual leave. We recognise that life is more than work, and offer benefits for gym memberships, marathons, and all sorts of activities that promote physical health. We also have generous parental leave (for men and women). Relocation - We can offer full relocation support to Stockholm for senior candidates. Remote work is not something we can accomodate. We know that diverse teams are strong teams, and welcome those with alternative identities, backgrounds, and experiences. Our team includes women, men, mothers, fathers, the self-taught, the college-educated, and people from all over the world. We also believe in making contributions back to the open source community. You can find some of our work at https://github.com/klarna. Interested in finding out more? Send over a CV or LinkedIn profile in English. Get information on how to apply for this position. Mark Jason Dominus How many kinds of polygonal loops? (part 2) And I said I thought there were nine analogous figures with six points. Rahul Narain referred me to a recent discussion of almost this exact question on Math Stackexchange. (Note that the discussion there considers two figures different if they are reflections of one another; I consider them the same.) The answer turns out to be OEIS A000940. I had said: … for , I found it hard to enumerate. I think there are nine shapes but I might have missed one, because I know I kept making mistakes in the enumeration … I missed three. The nine I got were: And the three I missed are: I had tried to break them down by the arrangement of the outside ring of edges, which can be described by a composition. The first two of these have type (which I missed completely; I thought there were none of this type) and the other has type , the same as the one in the lower right of the previous diagram. I had ended by saying: I would certainly not trust myself to hand-enumerate the shapes. Good call, Mr. Dominus! I considered filing this under “oops” but I decided that although I had gotten the wrong answer, my confidence in it had been adequately low. On one level it was a mistake, but on a higher and more important level, I did better. I am going to try the (Cauchy-Frobenius-)Burnside-(Redfield-)Pólya lemma on it next and see if I can get the right answer. Thanks again to Rahul Narain for bringing this to my attention. November 29, 2018 Mark Jason Dominus How many kinds of polygonal loops? Take equally-spaced points on a circle. Now connect them in a loop: each point should be connected to exactly two others, and each point should be reachable from the others. How many geometrically distinct shapes can be formed? For example, when , these four shapes can be formed: (I phrased this like a geometry problenm, but it should be clear it's actually a combinatorics problem. But it's much easier to express as a geometry problem; to talk about the combinatorics I have to ask you to consider a permutation where blah blah blah…) For it's easy. When it is always a triangle. When there are only two shapes: a square and a bow tie. But for , I found it hard to enumerate. I think there are nine shapes but I might have missed one, because I know I kept making mistakes in the enumeration, and I am not sure they are all corrected: It seems like it ought not to be hard to enumerate and count these, but so far I haven't gotten a good handle on it. I produced the display above by considering the compositions of the number : Composition How many loops? 6 1 1+5 2+4 1 3+3 1 1+1+4 1+2+3 1 2+2+2 2 1+1+1+3 1+1+2+2 1 1+2+1+1 1 1+1+1+1+2 1+1+1+1+1+1 1 Total9 (?) (Actually it's the compositions, modulo bracelet symmetries — that is, modulo the action of the dihedral group.) But this is fraught with opportunities for mistakes in both directions. I would certainly not trust myself to hand-enumerate the shapes. [ Addendum: For there are 12 figures, not 9. For , there are 39. Further details. ] November 28, 2018 Michael Snoyman Lifetimes and Slices - Rust Crash Course lesson 6 - exercise solutions Below are the solutions to the exercises from the last Rust Crash Course lesson, “Lifetimes and slices.” This post is part of a series based on teaching Rust at FP Complete. If you’re reading this post outside of the blog, you can find links to all posts in the series at the top of the introduction post. You can also subscribe to the RSS feed. Exercise 1 If you try just throwing in the ref keyword like this: match person.age { Some(ref age) => { println!("Age is {}", age); *age += 1; } None => println!("No age provided"), }  You’ll get an error message: error[E0594]: cannot assign to immutable borrowed content *age --> src/main.rs:16:13 | 14 | Some(ref age) => { | ------- help: use a mutable reference instead: ref mut age 15 | println!("Age is {}", age); 16 | *age += 1; | ^^^^^^^^^ cannot borrow as mutable  Instead, you need to say ref mut age. And if you’re like me and regularly type in mut ref age instead of ref mut age, don’t worry, the compiler’s got your back: error: the order of mut and ref is incorrect --> src/main.rs:14:14 | 14 | Some(mut ref age) => { | ^^^^^^^ help: try switching the order: ref mut error: aborting due to previous error  Exercise 2 You need to provide mutable references for the two arguments to swap. Additionally, in order to get a mutable reference to res, res itself needs to be mutable. fn next(&mut self) -> Option<T> { let mut res = None; std::mem::swap(&mut res, &mut self.next); res }  Exercise 3 We need to have two different parameters, and ensure that ret and the return value have the same lifetime parameter: fn message_and_return<'a, 'b>(msg: &'a String, ret: &'b String) -> &'b String { println!("Printing the message: {}", msg); ret }  Exercise 4 Since the data is stored in the program executable itself, it lives for the entire program execution. Therefore, the lifetime is 'static: fn main() { let bytearray1: &'static [u8; 22] = b"Hello World in binary!"; let bytearray2: &'static [u8] = b"Hello World in binary!"; println!("{:?}", bytearray1); println!("{:?}", bytearray2); }  Exercise 5 This is a great use case for the iterator method count: fn main() { for arg in std::env::args() { println!( "arg: {}, characters: {}, bytes: {}", arg, arg.chars().count(), arg.bytes().count(), ); } }  November 26, 2018 FP Complete Haskell and Rust FP Complete is known for our best-in-class DevOps automation tooling in addition to Haskell. We use industry standards like Kubernetes and Docker. We’re always looking for new developments in software that help us empower our clients. Functional Jobs Senior Haskell / Full Stack Developer at PRODA Ltd (Full-time) Position summary We are looking for senior Haskell engineers to join our team in London or work remotely. We want you to be someone who is looking to really help shape the future of the development team, have real impact on strategy, architecture and be the lead on our most important projects. You will expand our application’s capabilities in data ingestion, data exports, data standardisation and machine learning. We work in a functional programming stack in Haskell and Elm, together with machine learning components written in Python. Our development process includes CI/CD and automated tests. All developers work closely with our Chief Product Officer. Skills & experience required • Significant experience in a functional programming language, for instance: Haskell, Scala, F#, Ocaml, Clojure, etc. • Extensive knowledge working with a major RDBMS: PostgreSQL, MySQL, MS SQL Server. • Significant exposure to Javascript (node.js) and SPAs. • Good understanding of Linux and Shell Scripting in Bash. • Experience with basic machine learning algorithms, scikit-learn, Python, numPy, and pandas is highly beneficial. • Experience with AWS Cloud services, particularly EC2, RDS, Codedeploy, Codebuild, Codepipelines, S3. • Knowledge of build tools: Webpack, Makefiles. • Experience with web servers: basics of load balancers, Nginx configuration, etc. • Familiarity with Git and GitHub. • Familiarity with Docker. About the Company PRODA is a PropTech start-up founded in January 2017. We focus on solving core data processing pain points in real estate, and believe that unstructured data is a key barrier to digital transformation in the industry. PRODA is combining real estate expertise with data science to develop machine leaning-enabled software, with the aim of truly automating many analysis and reporting tasks for real estate professionals. Our software automatically captures property data from spreadsheets, before standardising, error-checking & securely storing it. This enables users to standardise and access all their data, regardless of the country, asset class or company that it originated from. Once standardised, many exciting machine-learning analyses, track changes and machine written reports are possible. Please find press releases and the PRODA Whitepaper in the PRODA news section Based in a smart We-work in London's South-bank PRODA has: • Modern, open office with communal areas, quite spaces, table games, unlimited teas and coffees and beers in the evenings • Focus on getting in specialists, encouraging meetups, conferences, training and learning for software engineers • Regular team meals and activities • Culture of smart individuals who collaborate to solve interesting problems using interesting technologies Get information on how to apply for this position. Lysxia's blog Surgery for data types November 23, 2018 Neil Mitchell Counting the Cost of Colons in Haskell Summary: Haskell uses :: as the type operator. That was a mistake that costs us over 1 million characters of source code. Haskell uses :: for type annotations, e.g. (1 :: Int). Most other FP languages with types use :, including Scala, OCaml, Agda, Idris and Elm. Haskell uses : for list cons, so you can write: myList = 1:2:[] :: [Int] Whereas in other languages you write: myList = 1::2::[] : [Int] Moreover, the reason Haskell preferred :: was the belief that if the cons operator was :: then people would quite naturally insert spaces around it, giving: myList = 1 :: 2 :: [] : [Int] The final program is now noticeably longer than the original. Back when people first invented Haskell, I imagine they were mainly list manipulation operations, whereas now plenty of libraries work mainly at the type level. That raises the question - would Hackage be shorter or longer if we used : for types? Method I downloaded the latest version of every Hackage package. For each .hs file, I excluded comments and strings, then counted the number of instances of : and ::, also noting whether there were spaces around the :. The code I used can be found here. Results • Instances of :: = 1,991,631 • Instances of : = 265,880 (of which 79,109 were surrounded by two spaces, and 26,931 had one space) Discussion Assuming we didn't add/remove any spaces, switching the symbols would have saved 1,725,751 characters. If we assume that everyone would have written :: with spaces, that saving drops to 137,9140 characters. These numbers are fairly small, representing about 0.17% of the total 993,793,866 characters on Hackage. Conclusion The Haskell creators were wrong in their assumptions - type should have been :. Update: the first version of this post had numbers that were too low due to a few bugs, now fixed. November 22, 2018 Neil Mitchell Downloading all of Hackage Summary: I wanted to download the latest version of every package in Hackage. Here's a script and explanation of how to do it. Imagine you want the latest version of every package on Hackage. I found two tools that mirror every version of every package: • Using hackage-mirror you can do hackage-mirror --from="http://hackage.haskell.org" --to="C:/hackage-mirror". But this project is long deprecated and doesn't actually work anymore. • Using hackage-mirror-tool you might be able to do it, but it requires a new Cabal, isn't on Hackage, doesn't seem to work on Windows and doesn't say whether it downloads to disk or not. Given it's a fairly simple problem, after investigating these options for an hour, I decided to cut my losses and write a script myself. Writing the script took a lot less than an hour, and I even wrote this blog post while the download was running. The complete script is at the bottom of this post, but I thought it might be instructive to explain how I went about developing it. Step 0: Set up my working environment I created a file named Download.hs where I was writing the source code, used ghcid Download.hs in a VS Code terminal to get fast error feedback using Ghcid, and opened another terminal to execute runhaskell Download.hs for testing. Step 1: Find where a download link is You can download a package from Hackage at http://hackage.haskell.org/package/shake/shake-0.17.tar.gz. You can also use https, but for my purposes and bulk downloading I figured http was fine. I hunted around to find a link which didn't contain the version number (as then I wouldn't have to compute the version number), but failed. Step 2: Find a list of package versions Looking at the cabal tool I found the cabal list --simple command, which prints a big list of packages in the form: foo 1.0foo 2.1bar 1.0 For each package on Hackage I get all versions sequentially, with the highest version number last. I can execute this command using systemOutput_ "cabal list --simple" (where systemOutput_ comes from the extra library). Step 3: Generate the list of URLs Now I have the data as a big string I want to convert it into a list of URL's. The full pipeline is: map (toUrl . last) . groupOn fst . map word1 . lines Reading from right to left, I split the output into a list of lines with lines, then split each line on its first space (using word1 from the extra library). I then use groupOn fst so that I get consecutive runs of each package (no points for guessing where groupOn comes from). For each list of versions for a package I take the last (since I know that's the highest one) and transform it into the URL using: let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz" Step 4: Download the URLs I could make multiple calls to wget, but that's very slow, so instead I write them to a file and make a single call: writeFile "_urls.txt"$ unlines urlssystem_ "wget --input-file=_urls.txt"

I use the name _urls.txt so I can spot that special file in amongst all the .tar.gz files this command produces.

Step 5: Putting it all together

The complete script is:

import Data.List.Extraimport System.Process.Extramain :: IO ()main = do    let toUrl (name, ver) = "http://hackage.haskell.org/package/" ++ name ++ "/" ++ name ++ "-" ++ ver ++ ".tar.gz"    urls <- map (toUrl . last) . groupOn fst .  map word1 . lines <$> systemOutput_ "cabal list --simple" writeFile "_urls.txt"$ unlines urls    system_ "wget --input-file=_urls.txt"

After waiting 46 minutes I had 13,258 packages weighing in at 861Mb.

Update: In the comments Janek Stolarek suggested the simpler alternative of cabal list --simple | cut -d' ' -f1 | sort | uniq | xargs cabal get (I had missed the existence of cabal get).

How to design co-programs

I recently attended the Matthias Felleisen Half-Time Show, a symposium held in Boston on 3rd November in celebration of Matthias’s 60th birthday. I was honoured to be able to give a talk there; this post is a record of what I (wished I had) said.

Matthias Felleisen

Matthias is known for many contributions to the field of Programming Languages. He received the SIGPLAN Programming Languages Achievement Award in 2012, the citation for which states:

He introduced evaluation contexts as a notation for specifying operational semantics, and progress-and-preservation proofs of type safety, both of which are used in scores of research papers each year, often without citation. His other contributions include small-step operational semantics for control and state, A-normal form, delimited continuations, mixin classes and mixin modules, a fully-abstract semantics for Sequential PCF, web programming techniques, higher-order contracts with blame, and static typing for dynamic languages.

Absent from this list, perhaps because it wasn’t brought back into the collective memory until 2013, is a very early presentation of the idea of using effects and handlers for extensible language design.

However, perhaps most prominent among Matthias’s contributions to our field is a long series of projects on teaching introductory programming, from TeachScheme! through Program By Design to a latest incarnation in the form of the How to Design Programs textbook (“HtDP”), co-authored with Robby Findler, Matthew Flatt, and Shriram Krishnamurthi, and now in its second edition. The HtDP book is my focus in this post; I have access only the First Edition, but the text of the Second Edition is online. I make no apologies for using Haskell syntax in my examples, but at least that gave Matthias something to shout at!

Design Recipes

One key aspect of HtDP is the emphasis on design recipes for solving programming tasks. A design recipe is a template for the solution to a problem: a contract for the function, analogous to a type signature (but HtDP takes an untyped approach, so this signature is informal); a statement of purpose; a function header; example inputs and outputs; and a skeleton of the function body. Following the design recipe entails completing the template—filling in a particular contract, etc—then fleshing out the function body from its skeleton, and finally testing the resulting program against the initial examples.

The primary strategy for problem solving in the book is via analysis of the structure of the input. When the input is composite, like a record, the skeleton should name the available fields as likely ingredients of the solution. When the input has “mixed data”, such as a union type, the skeleton should enumerate the alternatives, leading to a case analysis in the solution. When the input is of a recursive type, the skeleton encapsulates structural recursion—a case analysis between the base case and the inductive case, the latter case entailing recursive calls.

So, the design recipe for structural recursion looks like this:

 Phase Goal Activity Data Analysis and Design to formulate a data definition develop a data definition for mixed data with at least two alternatives; one alternative must not refer to the definition; explicitly identify all self-references in the data definition Contract Purpose and Header to name the function; to specify its classes of input data and its class of output data; to describe its purpose; to formulate a header name the function, the classes of input data, the class of output data, and specify its purpose: ;; name : in1 in2 … –> out ;; to compute … from x1 … (define (name x1 x2 …) …) Examples to characterize the input-output relationship via examples create examples of the input-output relationship; make sure there is at least one example per subclass Template to formulate an outline develop a cond-expression with one clause per alternative; add selector expressions to each clause; annotate the body with natural recursions; Test: the self-references in this template and the data definition match! Body to define the function formulate a Scheme expression for each simple cond-line; explain for all other cond-clauses what each natural recursion computes according to the purpose statement Test to discover mistakes (“typos” and logic) apply the function to the inputs of the examples; check that the outputs are as predicted

The motivating example for structural recursion is Insertion Sort: recursing on the tail of a non-empty list and inserting the head into the sorted subresult.

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{insertsort} :: [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{insertsort}\;[\,] &= [\,] \\ \mathit{insertsort}\;(a:x) &= \mathit{insert}\;a\;(\mathit{insertsort}\;x) \medskip \\ \multicolumn{2}{@{}l}{\mathit{insert} :: \mathit{Integer} \rightarrow [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{insert}\;b\;[\,] &= [b] \\ \mathit{insert}\;b\;(a:x) \\ \qquad \mid b \le a &= b : a : x \\ \qquad \mid b > a &= a : \mathit{insert}\;b\;x \end{array}$

A secondary, more advanced, strategy is to use generative recursion, otherwise known as divide-and-conquer. The skeleton in this design recipe incorporates a test for triviality; in the non-trivial cases, it splits the problem into subproblems, recursively solves the subproblems, and assembles the subresults into an overall result. The motivating example for generative recursion is QuickSort (but not the fast in-place version): dividing a non-empty input list into two parts using the head as the pivot, recursively sorting both parts, and concatenating the results with the pivot in the middle.

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{quicksort} :: [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{quicksort}\;[\,] & = [\,] \\ \mathit{quicksort}\;(a:x) &= \mathit{quicksort}\;y \mathbin{{+}\!\!\!{+}} [a] \mathbin{{+}\!\!\!{+}} \mathit{quicksort}\;z \\ & \qquad \mathbf{where}\; \begin{array}[t]{@{}ll} y &= [ b \mid b \leftarrow x, b \le a ] \\ z &= [ b \mid b \leftarrow x, b > a ] \end{array} \end{array}$

As far as I can see, no other program structures than structural recursion and generative recursion are considered in HtDP. (Other design recipes are considered, in particular accumulating parameters and imperative features. But these do not determine the gross structure of the resulting program. In fact, I believe that the imperative recipe has been dropped in the Second Edition.)

Co-programs

My thesis is that HtDP has missed an opportunity to reinforce its core message, that data structure determines program structure. Specifically, I believe that the next design recipe to consider after structural recursion, in which the shape of the program is determined by the shape of the input, should be structural corecursion, in which the shape of the program is determined instead by the shape of the output.

More concretely, a function that generates “mixed output”—whether that is a union type, or simply a boolean—might be defined by case analysis over the output. A function that generates a record might be composed of subprograms that generate each of the fields of that record. A function that generates a recursive data structure from some input data might be defined with a case analysis as to whether the result is trivial, and for non-trivial cases with recursive calls to generate substructures of the result. HtDP should present explicit design recipes to address these possibilities, as it does for program structures determined by the input data.

For an example of mixed output, consider a program that may fail, such as division, guarded so as to return an alternative value in that case:

$\displaystyle \begin{array}{@{}lll} \multicolumn{3}{@{}l}{\mathit{safeDiv} :: \mathit{Integer} \rightarrow \mathit{Integer} \rightarrow \mathsf{Maybe}\;\mathit{Integer}} \\ \mathit{safeDiv}\;x\;y & \mid y = 0 & = \mathit{Nothing} \\ & \mid \mathbf{otherwise} & = \mathit{Just}\;(x \mathbin{\underline{\smash{\mathit{div}}}} y) \end{array}$

The program performs a case analysis, and of course the analysis depends on the input data; but the analysis is not determined by the structure of the input, only its value. So a better explanation of the program structure is that it is determined by the structure of the output data.

For an example of generating composite output, consider the problem of extracting a date, represented as a record:

$\displaystyle \mathbf{data}\;\mathit{Date} = \mathit{Date} \{ \mathit{day} :: \mathit{Day}, \mathit{month} :: \mathit{Month}, \mathit{year} :: \mathit{Year} \}$

from a formatted string. The function is naturally structured to match the output type:

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{readDate} :: \mathit{String} \rightarrow \mathit{Date}} \\ \mathit{readDate}\;s & = \mathit{Date}\;\{ \mathit{day} = d, \mathit{month} = m, \mathit{year} = y \} \\ & \qquad \mathbf{where}\; \begin{array}[t]{@{}ll} d & = ... s ... \\ m & = ... s ... \\ y & = ... s ... \end{array} \end{array}$

For an example of corecursion, consider the problem of “zipping” together two input lists to a list of pairs—taking ${[1,2,3]}$ and ${[4,5,6]}$ to ${[(1,4),(2,5),(3,6)]}$, and for simplicity let’s say pruning the result to the length of the shorter input. One can again solve the problem by case analysis on the input, but the fact that there are two inputs makes that a bit awkward—whether to do case analysis on one list in favour of the other, or to analyse both in parallel. For example, here is the outcome of case analysis on the first input, followed if that is non-empty by a case analysis on the second input:

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{zip} :: [\alpha] \rightarrow [\beta] \rightarrow [(\alpha,\beta)]} \\ \mathit{zip}\;x\;y &= \begin{array}[t]{@{}l} \mathbf{if}\; \mathit{null}\;x \;\mathbf{then}\; [\,] \;\mathbf{else} \\ \qquad \mathbf{if}\; \mathit{null}\;y \;\mathbf{then}\; [\,] \;\mathbf{else} \\ \qquad \qquad (\mathit{head}\;x,\mathit{head}\;y) : \mathit{zip}\;(\mathit{tail}\;x)\;(\mathit{tail}\;y) \end{array} \end{array}$

Case analysis on both inputs would lead to four cases rather than three, which would not be an improvement. One can instead solve the problem by case analysis on the output—and it is arguably more natural to do so, because there is only one output rather than two. When is the output empty? (When either input is empty.) If it isn’t empty, what is the head of the output? (The pair of input heads.) And from what data is the tail of the output recursively constructed? (The pair of input tails.)

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{zip} :: [\alpha] \rightarrow [\beta] \rightarrow [(\alpha,\beta)]} \\ \mathit{zip}\;x\;y &= \begin{array}[t]{@{}l} \mathbf{if}\; \mathit{null}\;x \lor \mathit{null}\;y \;\mathbf{then}\; [\,] \;\mathbf{else} \\ \qquad (\mathit{head}\;x,\mathit{head}\;y) : \mathit{zip}\;(\mathit{tail}\;x)\;(\mathit{tail}\;y) \end{array} \end{array}$

And whereas Insertion Sort is a structural recursion over the input list, inserting elements one by one into a sorted intermediate result, Selection Sort is a structural corecursion towards the output list, repeatedly extracting the minimum remaining element as the next element of the output: When is the output empty? (When the input is empty.) If the output isn’t empty, what is its head? (The minimum of the input.) And from what data is the tail recursively generated? (The input without this minimum element.)

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{selectSort} :: [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{selectSort}\;x & = \mathbf{if}\; \mathit{null}\;x \;\mathbf{then}\; [\,] \;\mathbf{else}\; a : \mathit{selectSort}\;(x \mathbin{\backslash\!\backslash} [a]) \\ & \qquad \mathbf{where}\; a = \mathit{minimum}\;x \end{array}$

(here, ${x \mathbin{\backslash\!\backslash} y}$ denotes list ${x}$ with the elements of list ${y}$ removed).

I’m not alone in making this assertion. Norman Ramsey wrote a nice paper On Teaching HtDP; his experience led him to the lesson that

Last, and rarely, you could design a function’s template around the introduction form for the result type. When I teach [HtDP] again, I will make my students aware of this decision point in the construction of a function’s template: should they use elimination forms, function composition, or an introduction form? They should use elimination forms usually, function composition sometimes, and an introduction form rarely.

He also elaborates on test coverage:

Check functional examples to be sure every choice of input is represented. Check functional examples to be sure every choice of output is represented. This activity is especially valuable for functions returning Booleans.

(his emphasis). We should pay attention to output data structure as well as to input data structure.

Generative recursion, aka divide-and-conquer

Only once this dual form of program structure has been explored should students be encouraged to move on to generative recursion, because this exploits both structural recursion and structural corecursion. For example, the QuickSort algorithm that is used as the main motivating example is really structured as a corecursion ${\mathit{build}}$ to construct an intermediate tree, followed by structural recursion ${\mathit{flatten}}$ over that tree to produce the resulting list:

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{quicksort} :: [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{quicksort} & = \mathit{flatten} \cdot \mathit{build} \medskip \\ \multicolumn{2}{@{}l}{\mathbf{data}\;\mathit{Tree} = \mathit{Empty} \mid \mathit{Node}\;\mathit{Tree}\;\mathit{Integer}\;\mathit{Tree}} \medskip\\ \multicolumn{2}{@{}l}{\mathit{build} :: [\mathit{Integer}] \rightarrow \mathit{Tree}} \\ \mathit{build}\;[\,] & = \mathit{Empty} \\ \mathit{build}\;(a:x) &= \mathit{Node}\;(\mathit{build}\;y)\;a\;(\mathit{build}\;z) \\ & \qquad \mathbf{where}\; \begin{array}[t]{@{}ll} y &= [ b \mid b \leftarrow x, b \le a ] \\ z &= [ b \mid b \leftarrow x, b > a ] \end{array} \medskip \\ \multicolumn{2}{@{}l}{\mathit{flatten} :: \mathit{Tree} \rightarrow [\mathit{Integer}]} \\ \mathit{flatten}\;\mathit{Empty} & = [\,] \\ \mathit{flatten}\;(\mathit{Node}\;t\;a\;u) &= \mathit{flatten}\;t \mathbin{{+}\!\!\!{+}} [a] \mathbin{{+}\!\!\!{+}} \mathit{flatten}\;u \end{array}$

The structure of both functions ${\mathit{build}}$ and ${\mathit{flatten}}$ is determined by the structure of the intermediate ${\mathit{Tree}}$ datatype, the first as structural corecursion and the second as structural recursion. A similar explanation applies to any divide-and-conquer algorithm; for example, MergeSort is another divide-and-conquer sorting algorithm, with the same intermediate tree shape, but this time with a simple splitting phase and all the comparisons in the recombining phase:

$\displaystyle \begin{array}{@{}ll} \multicolumn{2}{@{}l}{\mathit{mergesort} :: [\mathit{Integer}] \rightarrow [\mathit{Integer}]} \\ \mathit{mergesort} & = \mathit{mergeAll} \cdot \mathit{splitUp} \medskip \\ \multicolumn{2}{@{}l}{\mathit{splitUp} :: [\mathit{Integer}] \rightarrow \mathit{Tree}} \\ \mathit{splitUp}\;[\,] & = \mathit{Empty} \\ \mathit{splitUp}\;(a:x) &= \mathit{Node}\;(\mathit{splitUp}\;y)\;a\;(\mathit{splitUp}\;z) \\ & \qquad \mathbf{where}\; (y,z) = \mathit{halve}\;x \medskip \\ \multicolumn{2}{@{}l}{\mathit{halve} :: [\mathit{Integer}] \rightarrow ([\mathit{Integer}],[\mathit{Integer}])} \\ \mathit{halve}\;[\,] & = ( [\,], [\,]) \\ \mathit{halve}\;[a] & = ( [a], [\,]) \\ \mathit{halve}\;(a:b:x) & = (a:y,b:z) \;\mathbf{where}\; (y,z) = \mathit{halve}\;x \medskip \\ \multicolumn{2}{@{}l}{\mathit{mergeAll} :: \mathit{Tree} \rightarrow [\mathit{Integer}]} \\ \mathit{mergeAll}\;\mathit{Empty} & = [\,] \\ \mathit{mergeAll}\;(\mathit{Node}\;t\;a\;u) &= \mathit{merge}\;(\mathit{mergeAll}\;t)\;(\mathit{merge}\;[a]\;(\mathit{mergeAll}\;u)) \medskip \\ \multicolumn{2}{@{}l}{\mathit{merge} :: ([\mathit{Integer}],[\mathit{Integer}]) \rightarrow [\mathit{Integer}]} \\ \mathit{merge}\;[\,]\;y & = y \\ \mathit{merge}\;x\;[\,] & = x \\ \mathit{merge}\;(a:x)\;(b:y) & = \begin{array}[t]{@{}l@{}l@{}l} \mathbf{if}\; a \le b\; & \mathbf{then}\; & a : \mathit{merge}\;x\;(b:y) \\ & \mathbf{else}\; & b : \mathit{merge}\;(a:x)\;y \end{array} \end{array}$

(Choosing this particular tree type is a bit clunky for Merge Sort, because of the two calls to ${\mathit{merge}}$ required in ${\mathit{mergeAll}}$. It would be neater to use non-empty externally labelled binary trees, with elements at the leaves and none at the branches:

$\displaystyle \begin{array}{@{}l} \mathbf{data}\;\mathit{Tree}_2 = \mathit{Tip}\;\mathit{Integer} \mid \mathit{Bin}\;\mathit{Tree}_2\;\mathit{Tree}_2 \end{array}$

Then you define the main function to work only for non-empty lists, and provide a separate case for sorting the empty list. This clunkiness is a learning opportunity: to realise the problem, come up with a fix (no node labels in the tree), rearrange the furniture accordingly, then replay the development and compare the results.)

Having identified the two parts, structural recursion and structural corecursion, they may be studied separately; separation of concerns is a crucial lesson in introductory programming. Moreover, the parts may be put together in different ways. The divide-and-conquer pattern is known in the MPC community as a hylomorphism, an unfold to generate a call tree followed by a fold to consume that tree. As the QuickSort example suggests, the tree can always be deforested—it is a virtual data structure. But the converse pattern, of a fold from some structured input to some intermediate value, followed by an unfold to a different structured output, is also interesting—you can see this as a change of structured representation, so I called it a metamorphism. One simple application is to convert a number from an input base (a sequence of digits in that base), via an intermediate representation (the represented number), to an output base (a different sequence of digits). More interesting applications include encoding and data compression algorithms, such as arithmetic coding.

Laziness

Although I have used Haskell as a notation, nothing above depends on laziness; it would all work as well in ML or Scheme. It is true that the mathematical structures underlying structural recursion and structural corecursion are prettier when you admit infinite data structures—the final coalgebra of the base functor for lists is the datatype of finite and infinite lists, and without admitting the infinite ones some recursive definitions have no solution. But that sophistication is beyond the scope of introductory programming, and it suffices to restrict attention to finite data structures.

HtDP already stipulates a termination argument in the design recipe for generative recursion; the same kind of argument should be required for structural corecursion (and is easy to make for the sorting examples given above). Of course, structural recursion over finite data structures is necessarily terminating. Laziness is unnecessary for co-programming.

Structured programming

HtDP actually draws on a long tradition on relating data structure and program structure, which is a theme dear to my own heart (and indeed, the motto of this blog). Tony Hoare wrote in his Notes on Data Structuring in 1972:

There are certain close analogies between the methods used for structuring data and the methods for structuring a program which processes that data.

Fred Brooks put it pithily in The Mythical Man-Month in 1975:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

which was modernized by Eric Raymond in his 1997 essay The Cathedral and the Bazaar to:

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.

HtDP credits Jackson Structured Programming as partial inspiration for the design recipe approach. As Jackson wrote, also in 1975:

The central theme of this book has been the relationship between data and program structures. The data provides a model of the problem environment, and by basing our program structures on data structures we ensure that our programs will be intelligible and easy to maintain.

and

The structure of a program must be based on the structures of all of the data it processes.

(my emphasis). In a retrospective lecture in 2001, he clarified:

program structure should be dictated by the structure of its input and output data streams

—the JSP approach was designed for processing sequential streams of input records to similar streams of output records, and the essence of the approach is to identify the structures of each of the data files (input and output) in terms of sequences, selections, and iterations (that is, as regular languages), to refine them all to a common structure that matches all of them simultaneously, and to use that common data structure as the program structure. So even way back in 1975 it was clear that we need to pay attention to the structure of output data as well as to that of input data.

HtDP apologizes that generative recursion (which it identifies with the field of algorithm design)

is much more of an ad hoc activity than the data-driven design of structurally recursive functions. Indeed, it is almost better to call it inventing an algorithm than designing one. Inventing an algorithm requires a new insight—a “eureka”.

It goes on to suggest that mere programmers cannot generally be expected to have such algorithmic insights:

In practice, new complex algorithms are often developed by mathematicians and mathematical computer scientists; programmers, though, must th[o]roughly understand the underlying ideas so that they can invent the simple algorithms on their own and communicate with scientists about the others.

I say that this defeatism is a consequence of not following through on the core message, that data structure determines program structure. QuickSort and MergeSort are not ad hoc. Admittedly, they do require some insight in order to identify structure that is not present in either the input or the output. But having identified that structure, there is no further mystery, and no ad-hockery required.

Upcoming stackage LTS 13 snapshot with ghc-8.6.2

The stackage curator team has started preparations for an LTS major bump which will include ghc-8.6.2.

To give everyone some time to prepare, we are announcing the planned release date as approximately two to four weeks from the date of this post.

For more insight into how stackage curators handle these sorts of things, see: the LTS major bump process.

Validating Big Data Jobs - Stopping Failures before Production (w/ Spark, BEAM, & friends!) @ Big Data Spain

Thanks for joining me on 2018-11-14 at Big Data Spain 2018 Madrid, Spain for Validating Big Data Jobs - Stopping Failures before Production (w/ Spark, BEAM, & friends!).<iframe allowfullscreen="allowfullscreen" frameborder="0" height="356" marginheight="0" marginwidth="0" scrolling="no" src="https://www.slideshare.net/slideshow/embed_code/key/waR8PdjCYv8U4h" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" width="427"> </iframe> Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

Big Data w/Python on Kubernetes (PySpark on K8s) @ Big Data Spain

Thanks for joining me on 2018-11-15 for Big Data w/Python on Kubernetes (PySpark on K8s).<iframe allowfullscreen="allowfullscreen" frameborder="0" height="356" marginheight="0" marginwidth="0" scrolling="no" src="https://www.slideshare.net/slideshow/embed_code/key/FaOvUqcsvJcFBk" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" width="427"> </iframe> And some related links: recorded demos, setup livestreamComment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

A Basic Introduction to PySpark Dataframes by exploring ASF Gender Diversity Data - workshop @ @pyconca

Thanks for joining us (@holdenkarau,@math_foo) on 2018-11-10 at @pyconca 2018 Toronto, ON, Canada for A Basic Introduction to PySpark Dataframes by exploring ASF Gender Diversity Data - workshop.

You can find the code for this talk at https://github.com/holdenk/diversity-analytics/.The slides are at http://bit.ly/2RNhG1g.There is a related video you might want to check out.

<iframe allow="autoplay; encrypted-media" allowfullscreen="allowfullscreen" frameborder="0" height="315" src="https://www.youtube.com/embed/lXrgN9-9jV8&amp;t=12s" width="560"></iframe>Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

Thinking with Types

After five months of work, I'm happy to announce that it's done.

Hendrix teams at ACM ICPC

Today I took five students from Hendrix to Fort Smith to compete in the Mid-Central USA Regional of the ACM International Collegiate Programming Contest (ICPC). Both teams solved five out of ten problems, and finished second and fourth out of 18 teams at our site (23rd and 25th out of 121 teams in the whole region). I’m really proud of them!

A lot of fun was had by all. Thanks to Andrew Mackey and the rest of the folks at UAFS for hosting a great contest!

The Trouble with Typed Errors

You, like me, program in either Haskell, or Scala, or F#, or Elm, or PureScript, and you don’t like runtime errors. They’re awful and nasty! You have to debug them, and they’re not represented in the types. Instead, we like to use Either (or something isomorphic) to represent stuff that might fail:

data Either l r = Left l | Right r


Either has a Monad instance, so you can short-circuit an Either l r computation with an l value, or bind it to a function on the r value.

So, we take our unsafe, runtime failure functions:

head   :: [a] -> a
lookup :: k -> Map k v -> v
parse  :: String -> Integer


and we use informative error types to represent their possible failures:

data HeadError = ListWasEmpty

data LookupError = KeyWasNotPresent

lookup :: k -> Map k v -> Either LookupError v

data ParseError
= UnexpectedChar Char String
| RanOutOfInput

parse :: String -> Either ParseError Integer


Except, we don’t really use types like HeadError or LookupError. There’s only one way that head or lookup could fail. So we just use Maybe instead. Maybe a is just like using Either () a – there’s only one possible Left () value, and there’s only one possible Nothing value. (If you’re unconvinced, write newtype Maybe a = Maybe (Either () a), derive all the relevant instances, and try and detect a difference between this Maybe and the stock one).

But, Maybe isn’t great – we’ve lost information! Suppose we have some computation:

foo :: String -> Maybe Integer
foo str = do
r <- lookup str strMap
eitherToMaybe (parse (c : r))


Now, we try it on some input, and it gives us Nothing back. Which step failed? We actually can’t know that! All we can know is that something failed.

So, let’s try using Either to get more information on what failed. Can we just write this?

foo :: String -> Either ??? Integer
foo str = do
r <- lookup str strMap
parse (c : r)


Unfortunately, this gives us a type error. We can see why by looking at the type of >>=:

(>>=) :: (Monad m) => m a -> (a -> m b) -> m b


The type variable m must be an instance of Monad, and the type m must be exactly the same for the value on the left and the function on the right. Either LookupError and Either ParseError are not the same type, and so this does not type check.

Instead, we need some way of accumulating these possible errors. We’ll introduce a utility function mapLeft that helps us:

mapLeft :: (l -> l') -> Either l r -> Either l' r
mapLeft f (Left l) = Left (f l)
mapLeft _ r = r


Now, we can combine these error types:

foo :: String
-> Either
Integer
foo str = do
c <- mapLeft Left (head str)
r <- mapLeft (Right . Left) (lookup str strMap)
mapLeft (Right . Right) (parse (c : r))


There! Now we can know exactly how and why the computation failed. Unfortunately, that type is a bit of a monster. It’s verbose and all the mapLeft boilerplate is annoying.

At this point, most application developers will create a “application error” type, and they’ll just shove everything that can go wrong into it.

data AllErrorsEver
= AllParseError ParseError
| AllLookupError LookupError
| AllWhateverError WhateverError
| FileNotFound FileNotFoundError
| etc...


Now, this slightly cleans up the code:

foo :: String -> Either AllErrorsEver Integer
foo str = do
r <- mapLeft AllLookupError (lookup str strMap)
mapLeft AllParseError (parse (c : r))


However, there’s a pretty major problem with this code. foo is claiming that it can “throw” all kinds of errors – it’s being honest about parse errors, lookup errors, and head erors, but it’s also claiming that it will throw if files aren’t found, “whatever” happens, and etc. There’s no way that a call to foo will result in FileNotFound, because foo can’t even do IO! It’s absurd. The type is too large! And I have written about keeping your types small and how wonderful it can be for getting rid of bugs.

Suppose we want to handle foo’s error. We call the function, and then write a case expression like good Haskellers:

case foo "hello world" of
Right val ->
pure val
Left err ->
case err of
AllParseError parseError ->
handleParseError parseError
AllLookupError lookupErr ->
handleLookupError
_ ->
error "impossible?!?!?!"


Unfortunately, this code is brittle to refactoring! We’ve claimed to handle all errors, but we’re really not handling many of them. We currently “know” that these are the only errors that can happen, but there’s no compiler guarantee that this is the case. Someone might later modify foo to throw another error, and this case expression will break. Any case expression that evaluates any result from foo will need to be updated.

The error type is too big, and so we introduce the possibility of mishandling it. There’s another problem. Let’s suppose we know how to handle a case or two of the error, but we must pass the rest of the error cases upstream:

bar :: String -> Either AllErrorsEver Integer
bar str =
case foo str of
Right val -> Right val
Left err ->
case err of
AllParseError pe ->
Right (handleParseError pe)
_ ->
Left err


We know that AllParseError has been handled by bar, because – just look at it! However, the compiler has no idea. Whenever we inspect the error content of bar, we must either a) “handle” an error case that has already been handled, perhaps dubiously, or b) ignore the error, and desperately hope that no underlying code ever ends up throwing the error.

Are we done with the probles on this approach? No! There’s no guarantee that I throw the right error!

head :: [a] -> Either AllErrorsEver a
head [] = Left (AllLookupError KeyWasNotPresent)


This code typechecks, but it’s wrong, because LookupError is only supposed to be thrown by lookup! It’s obvious in this case, but in larger functions and codebases, it won’t be so clear.

So, having a monolithic error type has a ton of problems. I’m going to make a claim here:

All error types should have a single constructor

That is, no sum types for errors. How can we handle this?

Let’s maybe see if we can make Either any nicer to use. We’ll define a few helpers:

type (+) = Either
infixr + 5

l :: l -> Either l r
l = Left

r :: r -> Either l r
r = Right


Now, let’s refactor that uglier Either code with these new helpers:

foo :: String
-> Either
Integer
foo str = do
c <- mapLeft l (head str)
r <- mapLeft (r . l) (lookup str strMap)
mapLeft (r . r) (parse (c : r))


Well, the syntax is nicer. We can case over the nested Either in the error branch to eliminate single error cases. It’s easier to ensure we don’t claim to throw errors we don’t – after all, GHC will correctly infer the type of foo, and if GHC infers a type variable for any +, then we can assume that we’re not using that error slot, and can delete it.

Unfortuntaly, there’s still the mapLeft boilerplate. And expressions which you’d really want to be equal, aren’t –

x :: Either (HeadError + LookupError) Int
y :: Either (LookupError + HeadError) Int


The values x and y are isomorphic, but we can’t use them in a do block because they’re not exactly equal. If we add errors, then we must revise all mapLeft code, as well as all case expressions that inspect the errors. Fortunately, these are entirely compiler-guided refactors, so the chance of messing them up is small. However, they contribute significant boilerplate, noise, and busywork to our program.

Boilerplate be gone!

Well, turns out, we can get rid of the order dependence and boilerplate with type classes! The most powerful approach is to use “classy prisms” from the lens package. Let’s translate our types from concrete values to prismatic ones:

-- Concrete:

-- Prismatic:

-- Concrete:
lookup :: k -> Map k v -> Either LookupError v

-- Prismatic:
lookup
:: (AsLookupError err)
=> k -> Map k v -> Either err v


Now, type class constraints don’t care about order – (Foo a, Bar a) => a and (Bar a, Foo a) => a are exactly the same thing as far as GHC is concerned. The AsXXX type classes will automatically provide the mapLeft stuff for us, so now our foo function looks a great bit cleaner:

foo :: (AsHeadError err, AsLookupError err, AsParseError err)
=> String -> Either err Integer
foo str = do
r <- lookup str strMap
parse (c : r)


This appears to be a significant improvement over what we’ve had before! And, most of the boilerplate with the AsXXX classes is taken care of via Template Haskell:

makeClassyPrisms ''ParseError
-- this line generates the following:

class AsParseError a where
_ParseError :: Prism' a ParseError
_UnexpectedChar :: Prism' a (Char, String)
_RanOutOfInput :: Prism' a ()

instance AsParseError ParseError where
-- etc...


However, we do have to write our own boilerplate when we eventually want to concretely handle these ttypes. We may end up writing a huge AppError that all of these errors get injected into.

There’s one major, fatal flaw with this approach. While it composes very nicely, it doesn’t decompose at all! There’s no way to catch a single case and ensure that it’s handled. The machinery that prisms give us don’t allow us to separate out a single constraint, so we can’t pattern match on a single error.

Once again, our types become ever larger, with all of the problems that entails.

Generics to the rescue!

What we really want is:

• Order independence
• No boilerplate
• Easy composition
• Easy decomposition

In PureScript or OCaml, you can use open variant types to do this flawlessly. Haskell doesn’t have open variants, and the attempts to mock them end up quite clumsy to use in practice.

I’m happy to say that the entire job is handled quite nicely with the amazing generic-lens package. I created a gist that demonstrates their usage, but the magic comes down to this simple fact: there’s an instance of the prismatic AsType class for Either, which allows you to “pluck” a constraint off. This satisfies all of the things I wanted in my list, and we can consider representing errors mostly solved.

Mostly?

Well, ExceptT e IO a still imposes a significant runtime performance hit, and asynchronous exceptions aren’t considered here. A bifunctor IO type like newtype BIO err a = BIO (IO a) which carries the type class constraints of the errors it contains is promising, but I haven’t been able to write a satisfying interface to this yet.

I also haven’t used this technique in a large codebase yet, and I don’t know how it scales. And the technique does require you to be comfortable with lens, which is a fairly high bar for training new folks on. I’m sure that API improvements could be made to make this style more accessible and remove some of the lens knowledge prerequisites.

Capability and Suitability

Gary Bernhardt has a fantastic talk on Capability vs Suitability, where he separates advances in software engineering into two buckets:

• Capability: The ability to do new things!
• Suitability: The ability to do things well.

Capability is progressive and daring, while suitability is conservative and boring. Capability wants to create entirely new things, while suitability wants to refine existing things.

This post is going to explore a metaphor with bicycles, specifically bike tires, while we think about capability and suitability. When you get a bike, you have so many options. Tire size is one of them. You can opt for a super narrow road tire – a mere 19mm in width! Or, on the other end of the scale, you can opt for a truly fat tire at around 5” in width. What’s the difference?

Narrower tires are less capable – there is less terrain you can cover on a narrow tire. However, they’re more suitable for the terrain they can cover – a 19mm tire will be significantly lighter and faster thana 5” tire. A good 19mm tire weighs around 200g, while a 5” tire might weigh 1,800g each. Lugging around an extra 7lbs of rubber takes a lot of energy! Additionally, all that rubber is going to have a lot of rolling resistance – it’ll be harder to push across the ground on smooth surfaces where the 19mm tire excels.

So, most cyclists don’t use fat tire bikes. But they also don’t use 19mm skinny tires. Most road cyclists have moved up to 25 or 28mm tires. While the 19mm tires work fantastically on a perfectly smooth surface, they start suffering when the road gets bumpy. All the bumps and rough surfaces call for a slightly more capable tire. The wider tires can run lower air pressure, which lets them float over bumps rather than being bumped up and down.

So, we have two competing forces in bike tires:

• The speed and comfort on the terrain you ride most frequently
• The speed and comfort on the worst terrain you encounter regularly

You want enough capability to handle the latter, while a tire that’s suitable for the former.

In computer programming, we tend to reach for the most capable thing we can get our hands on. Dynamically typed, impure, and Turing complete programming languages like Ruby, JavaScript, and Python are immensely popular. Statically typed languages are often seen as stifling, and pure languages even more so. There simply aren’t many languages that are Turing incomplete, that’s how little we like them!

Yet, these extra gains in capability are often unnecessary. There’s very little code that’s difficult to statically type with a reasonable type system. Impurity seems convenient, until you realize that you need to look at every single method call to see why the code that renders an HTML page is making an N+1 query and ruining performance. Indeed, even Turing completeness is overrated – a Turing incomplete language permits dramatically more optimizations and static analysis for bug prevention, and very few programs actually require Turing completeness.

In this sense, programmers are like cyclists that pick up the 5” tire fat bikes and then wonder why they’re moving so slow. They may ride in the snow or deep sand once or twice a year, and they stick with the 5” tire for that reason alone. Programmers that are willing to give up the capability they don’t need in order to purchase suitability they could use tend to go faster, as you might expect. Folks that learn Haskell and become sufficiently familiar with purely functional and statically typed programming tend to take those practices with them, even in impure or dynamically typed languages.

It is easier to understand what you did when you limit what you can do.

Hi Glaukon.

Hi Glaukon. There’s a bunch of stuff related to my teaching at https://drive.google.com/drive/folders/0B-qIu_nqxaMoYTQ4ODZjMjMtNWRiOC00OWZiLWI3MzMtOGIxODIyZDkxODBk?usp=sharing It’s in varying degrees of completion, but it includes an outline of units and lessons, descriptions of themes, worksheets and handouts, and some (slightly out-of-date) presentation slides, all of which have been used in past classes. There’s also a comic book that is mostly complete, that I sometimes give out as just a cool add-on for students in the class. Finally, I’ve just finished completely writing (at least a first draft of) the first unit of the online guide, which can be found at http://code.world when you first visit; or under the Guide button at the bottom of the page.

Sekiso's Functions

Sekiso said, "Pure functions exist beyond time and space. The results we achieve will be achieved always." Ummon held up his fan and replied "If you yourself do not exist beyond time and space, then how can you be so sure? Functions only have meaning when the hardware is running, and when it is not, there is space, but no time." The master overhearing replied, "Last Tuesday I typed "cabal install cabal-install" and nothing happened."

TChan vs TQueue: What's the difference?

I always forget the difference between a TChan and a TQueue. They appear to have an almost identical API, so whenever I need a concurrent message thing, I spend some time working out what the difference is. I’ve done this a few times now, and it’s about time that I write it out so that I don’t need to keep reconstructing it.

Aside: Please don’t use TChan or TQueue. These types are unbounded, which means that you can run into unbounded memory use if your producer is faster than your consumer. Instead, use TBChan or TBQueue, which allow you to set a bound. I have run into issues with livelock with the standard stm and stm-chans types, and have found that unagi-chan package has better performance in all cases, so I usually reach for that when I have a need for a high performance concurrent channel. Unfortunately, the unagi-chan variants don’t operate in STM, which can be a dealbreaker depending on your workflow.

tl;dr: Use a channel when you want all readers to receive each message. Use a queue when you want only one reader to receive each message.

The docs for a TChan are concise:

TChan is an abstract type representing an unbounded FIFO channel.

A TQueue is like a TChan, with two important differences:

• it has faster throughput than both TChan and Chan (although the costs are amortised, so the cost of individual operations can vary a lot).
• it does not provide equivalents of the dupTChan and cloneTChan operations.

The implementation is based on the traditional purely-functional queue representation that uses two lists to obtain amortised $O(1)$ enqueue and dequeue operations.

So the docs say that TQueue is faster, but has fewer operations. Presumably, we should use a TQueue unless we need these operations. What do dupTChan and cloneTChan do? Let’s look at the Haddocks:

dupTChan :: TChan a -> STM (TChan a)


Duplicate a TChan: the duplicate channel begins empty, but data written to either channel from then on will be available from both. Hence this creates a kind of broadcast channel, where data written by anyone is seen by everyone else.

cloneTChan :: TChan a -> STM (TChan a)


Clone a TChan: similar to dupTChan, but the cloned channel starts with the same content available as the original channel.

So, what’s the point of these? Let’s write some code and see what happens.

test2 :: IO ()
test2 = do
c <- newTChanIO
forkIO $do for_ [1..5]$ \n -> do
i <- atomically $readTChan c putStrLn$ "First thread received: " ++ show i ++ "on #: " ++ show n

forkIO $do for_ [5..10]$ \n -> do
i <- atomically $readTChan c putStrLn$ "Second thread received: " ++ show i ++ "on #: " ++ show n

for_ [1..10] $\i -> do threadDelay 10000 atomically$ writeTChan c i


This creates a new TChan, then forks two threads. Each thread sits and waits on the TChan to have a value, and then it prints the value out. Finally we stuff the numbers 1 through 100 into the channel, with a slight delay.

This is the output:

Beginning test...


Alright, so the two threads mostly just interleave their work. The values 1-100 are printed out by each thread. Let’s try using dupTChan and see what happens:

test3 :: IO ()
test3 = do
c <- newTChanIO
forkIO $do for_ [1..5]$ \n -> do
i <- atomically $readTChan c putStrLn$ "First thread received: " ++ show i ++ " on #: " ++ show n

forkIO $do c' <- atomically$ dupTChan c
for_ [6..10] $\n -> do i <- atomically$ readTChan c'
putStrLn $"Second thread received: " ++ show i ++ " on #: " ++ show n for_ [1..10]$ \i -> do
atomically $writeTChan c i  This is basically the same code, but we’ve duplicated the TChan in the second thread. Here’s the new output: Beginning test... First thread received: 1 on #: 1 Second thread received: 1 on #: 6 Second thread received: 2 on #: 7 First thread received: 2 on #: 2 First thread received: 3 on #: 3 Second thread received: 3 on #: 8 First thread received: 4 on #: 4 Second thread received: 4 on #: 9 First thread received: 5 on #: 5 Second thread received: 5 on #: 10  Interesting! So the duplicated channel is able to receive a writeTChan, and both threads are able to see the values. So, a TChan with a dupTChan call is suitable for when you want all copies of the TChan to receive a value. A TQueue will only permit a value to be seen once, by a single thread. There’s a variant of newTChan called newBroadcastTChan. How does it differ? The docs explain: Create a write-only TChan. More precisely, readTChan will retry even after items have been written to the channel. The only way to read a broadcast channel is to duplicate it with dupTChan. Consider a server that broadcasts messages to clients: serve :: TChan Message -> Client -> IO loop serve broadcastChan client = do myChan <- dupTChan broadcastChan forever$ do
send client message


The problem with using newTChan to create the broadcast channel is that if it is only written to and never read, items will pile up in memory. By using newBroadcastTChan to create the broadcast channel, items can be garbage collected after clients have seen them.

The last paragraph is the important part. A standard TChan will accumulate messages until that copy of the TChan is read. A broadcast TChan will not accumulate messages on the write end of the channel. This points to an important performance concern:

• If you dupTChan a channel, you must read from every duplicate of the TChan in order to avoid memory loss.
• If you intend on having a write-only end, you must use newBroadcastTChan for any channel that you won’t read from.

TQueue avoids this problem as it cannot be duplicated.

Haskell Developer Position at Karamaan Group (Full-time)

Karamaan Group is an investment firm based in Manhattan. We are looking for an outstanding software developer to develop tools for financial analysis and knowledge management. Since we are investors and not traders, our focus is not on building trading systems, but on business and data analysis. We strongly value people who take pride in their work and have a craftsman’s dedication to building bespoke products. Our ideal candidate is an experienced software developer with a penchant for abstraction and analysis, a solid sense of organization, and an ability to communicate and persuade. While we are looking for a candidate who demonstrates an intense focus on quality, she also needs a master engineer’s intuition for how to make trade offs that are a necessary part of day-to-day software development.

Candidates should have some formal training in a quantitative field and a keen interest in building robust and elegant software. This is a high-impact, high-visibility position where successful candidates will be entrusted with a lot of responsibility for products that have a direct effect on the P&L of the firm and influence on our workflow.

The ideal candidate will have production experience with Haskell or another functional language, relational databases and relational data modeling, Python, and Java. Additionally, as this is a startup environment, the ideal candidate will be comfortable filling many different technology roles, and contributing to the success of the firm wherever they can, not just in agreement with their job description.

Karamaan Group is an investment adviser and manager catering to family offices and institutional investors. We value innovative thinking, encourage an entrepreneurial atmosphere, and enjoy spirited discussions.

Get information on how to apply for this position.

Announcing Profiterole - GHC Profile Viewer

Summary: profiterole reformats GHC profile reports so they are easier to read.

Do you often work with GHC time profiling reports? Do you find them excessively long and hard to navigate? Profiterole reads standard GHC .prof files and generates both textual and HTML reports which are typically more than 10x smaller. As an example compare HLint profile input to HLint Profiterole output.

Usage

To run, first install (cabal update && cabal install profiterole), generate a GHC profile the normal way, then run:

profiterole myprogram.prof

Profiterole will generate myprogram.profiterole.txt and myprogram.profiterole.html - both contain the same information, but the HTML has hyperlinks. There are three columns of numbers:

• TOT is the total time spent in any item under this code, what GHC calls inherited time.
• INH is the total time spent in the items that Profiterole did not move out to the top level.
• IND is the individual time, just like GHC profiles.

For large programs, using +RTS -P (instead of the common -p) will give more accurate results.

How it works

Profiterole aims to make the profile shorter by combining common subtrees and lifting them to the root - e.g. if you call parseFile from 7 places in the code, instead of having 7 pieces of parseFile profiling, Profiterole will give you one. With only 1 place containing parseFile, it's easier to optimise parseFile, and it's easier to read the code calling it without getting lost in the internals.

How to profile

Given profile data, different ways of looking at it reveal different insights, and the ones discovered by Profiterole have definitely had value. I tend to use:

• I first use Profiteur to get an overall sense of where the time is going visually. Profiteur lets me orientate myself, but tends to be a little difficult to drill into the details and repeat experiments.
• I then use Profiterole to see if there were any repeated pieces of code Profiteur missed, and then dig into the details using Profiterole.
• Only if I'm really going into the details do I go to the GHC .prof output - it's pretty rare.

Running from the past

Preface

Functional programming encourages us to program without mutable state. Instead we compose functions that can be viewed as state transformers. It's a change of perspective that can have a big impact on how we reason about our code. But it's also a change of perspective that can be useful in mathematics and I'd like to give an example: a really beautiful technique that alows you to sample from the infinite limit of a probability distribution without needing an infinite number of operations. (Unless you're infinitely unlucky!)

Markov Chains

A Markov chain is a sequence of random states where each state is drawn from a random distribution that possibly depends on the previous state, but not on any earlier state. So it is a sequence such that for all . A basic example might be a model of the weather in which each day is either sunny or rainy but where it's more likely to be rainy (or sunny) if the previous day was rainy (or sunny). (And to be technically correct: having information about two days or earlier doesn't help us if we know yesterday's weather.)

Like imperative code, this description is stateful. The state at step depends on the state at step . Probability is often easier to reason about when we work with independent identically drawn random variables and our aren't of this type. But we can eliminate the state from our description using the same method used by functional programmers.

Let's choose a Markov chain to play with. I'll pick one with 3 states called , and and with transition probabilities given by where

Here's a diagram illustrating our states:

Implementation

First some imports:

> {-# LANGUAGE LambdaCase #-}> {-# LANGUAGE TypeApplications #-}> import Data.Sequence(replicateA)> import System.Random> import Control.Monad.State> import Control.Monad> import Data.List> import Data.Array
And now the type of our random variable:

> data ABC = A | B | C deriving (Eq, Show, Ord, Enum, Bounded)
We are now in a position to simulate our Markov chain. First we need some random numbers drawn uniformly from [0, 1]:

> uniform :: (RandomGen gen, MonadState gen m) => m Double> uniform = state random
And now the code to take a single step in the Markov chain:

> step :: (RandomGen gen, MonadState gen m) => ABC -> m ABC> step A = do>     a <- uniform>     if a < 0.5>         then return A>         else return B> step B = do>     a <- uniform>     if a < 1/3.0>         then return A>         else if a < 2/3.0>             then return B>             else return C> step C = do>     a <- uniform>     if a < 0.5>         then return B>         else return C
Notice how the step function generates a new state at random in a way that depends on the previous state. The m ABC in the type signature makes it clear that we are generating random states at each step.

We can simulate the effect of taking steps with a function like this:

> steps :: (RandomGen gen, MonadState gen m) => Int -> ABC -> m ABC> steps 0 i = return i> steps n i = do>     i <- steps (n-1) i>     step i
We can run for 100 steps, starting with , with a line like so:

*Main> evalState (steps 3 A) genB
The starting state of our random number generator is given by gen.

Consider the distribution of states after taking steps. For Markov chains of this type, we know that as goes to infinity the distribution of the th state approaches a limiting "stationary" distribution. There are frequently times when we want to sample from this final distribution. For a Markov chain as simple as this example, you can solve exactly to find the limiting distribution. But for real world problems this can be intractable. Instead, a popular solution is to pick a large and hope it's large enough. As gets larger the distribution gets closer to the limiting distribution. And that's the problem I want to solve here - sampling from the limit. It turns out that by thinking about random functions instead of random states we can actually sample from the limiting distribution exactly.

Some random functions

Here is a new version of our random step function:

> step' :: (RandomGen gen, MonadState gen m) => m (ABC -> ABC)> step' = do>     a <- uniform>     return $\case> A -> if a < 0.5 then A else B> B -> if a < 1/3.0> then A> else if a < 2/3.0 then B else C> C -> if a < 0.5 then B else C In many ways it's similar to the previous one. But there's one very big difference: the type signature m (ABC -> ABC) tells us that it's returning a random function, not a random state. We can simulate the result of taking 10 steps, say, by drawing 10 random functions, composing them, and applying the result to our initial state: > steps' :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)> steps' n = do> fs <- replicateA n step'> return$ foldr (flip (.)) id fs
Notice the use of flip. We want to compose functions , each time composing on the left by the new . This means that for a fixed seed gen, each time you increase by 1 you get the next step in a single simulation: (BTW I used replicateA instead of replicateM to indicate that these are independent random draws. It may be well known that you can use Applicative instead of Monad to indicate independence but I haven't seen it written down.)

*Main> [f A | n <- [0..10], let f = evalState (steps' n) gen][A,A,A,B,C,B,A,B,A,B,C]
When I first implemented this I accidentally forgot the flip. So maybe you're wondering what effect removing the flip has? The effect is about as close to a miracle as I've seen in mathematics. It allows us to sample from the limiting distribution in a finite number of steps!

Here's the code:

> steps_from_past :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)> steps_from_past n = do>   fs <- replicateA n step'>   return $foldr (.) id fs We end up building . This is still a composition of independent identically distributed functions and so it's still drawing from exactly the same distribution as steps'. Nonetheless, there is a difference: for a particular choice of seed, steps_from_past n no longer gives us a sequence of states from a Markov chain. Running with argument draws a random composition of functions. But if you increase by 1 you don't add a new step at the end. Instead you effectively restart the Markov chain with a new first step generated by a new random seed. Try it and see: *Main> [f A | n <- [0..10], let f = evalState (steps_from_past n) gen][A, A, A, A, A, A, A, A, A, A] Maybe that's surprising. It seems to get stuck in one state. In fact, we can try applying the resulting function to all three states. *Main> [fmap f [A, B, C] | n <- [0..10], let f = evalState (steps_from_past n) gen][[A,B,C],[A,A,B],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A]] In other words, for large enough we get the constant function. Think of it this way: If f isn't injective then it's possible that two states get collapsed to the same state. If you keep picking random f's it's inevitable that you will eventually collapse down to the point where all arguments get mapped to the same state. Once this happens, we'll get the same result no matter how large we take . If we can detect this then we've found the limit of as goes to infinity. But because we know composing forwards and composing backwards lead to draws from the same distribution, the limiting backward composition must actually be a draw from the same distribution as the limiting forward composition. That flip can't change what probability distribution we're drawing from - just the dependence on the seed. So the value the constant function takes is actually a draw from the limiting stationary distribution. We can code this up: > all_equal :: (Eq a) => [a] -> Bool> all_equal [] = True> all_equal [_] = True> all_equal (a : as) = all (== a) as> test_constant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool> test_constant f => all_equal$ map f $enumFromTo minBound maxBound This technique is called coupling from the past. It's "coupling" because we've arranged that different starting points coalesce. And it's "from the past" because we're essentially asking answering the question of what the outcome of a simulation would be if we started infinitely far in the past. > couple_from_past :: (RandomGen gen, MonadState gen m, Enum a, Bounded a, Eq a) =>> m (a -> a) -> (a -> a) -> m (a -> a)> couple_from_past step f = do> if test_constant f> then return f> else do> f' <- step> couple_from_past step (f . f') We can now sample from the limiting distribution a million times, say: *Main> let samples = map ($ A) $evalState (replicateA 1000000 (couple_from_past step' id)) gen We can now count how often A appears: *Main> fromIntegral (length$ filter (== A) samples)/10000000.285748
That's a pretty good approximation to , the exact answer that can be found by finding the eigenvector of the transition matrix corresponding to an eigenvalue of 1.

> gen = mkStdGen 669

Notes

The technique of coupling from the past first appeared in a paper by Propp and Wilson. The paper Iterated Random Functions by Persi Diaconis gave me a lot of insight into it. Note that the code above is absolutely not how you'd implement this for real. I wrote the code that way so that I could switch algorithm with the simple removal of a flip. In fact, with some clever tricks you can make this method work with state spaces so large that you couldn't possibly hope to enumerate all starting states to detect if convergence has occurred. Or even with uncountably large state spaces. But I'll let you read the Propp-Wilson paper to find out how.

What’s the Difference? video and slides

At ICFP in St. Louis I presented my paper with Kenny Foner, What’s the Difference? A Functional Pearl on Subtracting Bijections. Many people seemed to enjoy the talk and we got lots of great feedback afterwards. We will probably write up an extended version of the paper for JFP, to include extra stuff we could not fit in the paper and to incorporate things we learned at ICFP.

(Fun fact: this was actually my first ICFP talk—though not my first ICFP paper. I am glad I have a bunch of experience giving talks at this point; I can remember a time in my life when giving a talk on that stage would have been terrifying.)

Here is the video of the talk:

You can download my slides here. This version of the slides includes speaking notes—it’s not exactly what I ended up saying, but it should give you a decent idea if you just want to read the slides without watching the video.

Is Rust functional?

In the past few months, and in particular in the past two weeks, I’ve gotten a number of people asking me the question: Is Rust a functional programming language? This makes sense: I’m a big advocate of functional programming, I work at a company with FP in its name, my primary programming language is Haskell, and yet I use and enjoy Rust. So is Rust consistent with everything else in my FP-centric world, or is it my dirty vice to get away from it all?

October 17, 2018

FP Complete

Our monthly webinar series is really starting to take off and judging by the registration numbers we might be forced into allowing more than 500 people to register for our next webinar which could be about Rust - Don't hold us to that ;)  For Roman's, Development Workflows in Haskell, we had 434 registrants which makes this our most popular webinar so far.

Announcing Shake 0.17

I'm delighted to announce Shake 0.17. As always, the full changelog is on GitHub, but I'd like to highlight three areas that have seen most attention.

Error Messages

Error messages now support GHC's HasCallStack feature, giving code locations in error messages. As an example, let's define rules for both *.txt and overlap.*, then try and build overlap.txt. With Shake 0.17 we get the far more informative error:

Error when running Shake build system:at Example.hs:50:46-55:* Depends on: overlap.txt* Raised the exception:Build system error - key matches multiple rules:Key type:       FileQKey value:      overlap.txtRules matched:  2Rule 1:         "overlap.*" %> at Example::21:94-106:Rule 2:         ".txt" %> at Example::24:94-106:Modify your rules so only one can produce the above key

We can see where the dependency was introduced (line 50), where the rules were defined (lines 21 and 24), and what their patterns were.

The Database module

The new module Development.Shake.Database provides operations for working with open Shake databases - meaning you can now open the database, run some actions against, and shut it. Unlike before, you can now run against an open database repeatedly, and query the resulting database for live or erroneous files. When combined with the new feature that /dev/null for shakeFiles results in no on-disk representation of Shake, you can create an in-memory database, execute it many times, then throw it away. These features aren't targetted at build systems, but allow reuse of Shake in other domains.

If you are using the Database module, and have a way to observe changes interactively, the deprioritize function may be of use, to cause Shake to build some unimportant rules last.

This work was supported by Digital Asset.

Changes to Builtin Rules

Most users shouldn't need to define their own types of rules, but for those who do, the biggest improvement is probably the better documentation in Development.Shake.Rule, complete with a worked example. At the same time, the builtin rules have changed slightly in incompatible ways - the docs explain the new state. These changes are intended to set the stage for Cloud Shake, following the pattern described in Build Systems a la Carte. I hope that a forthcoming release of Shake will provide an actual implementation of Cloud Shake.

Zipping streams

Writing the following is easy after glancing through the documentation for conduit:

foo = let src = mapM_ C.yield [0..9 :: Int]
p0 = CC.map (\ i -> ("p0", succ i))
p1 = CC.filter odd .| CC.map (\ i -> ("p1", i))
p = C.getZipConduit $C.ZipConduit p0 <* C.ZipConduit p1 sink = CC.mapM_ print in C.runConduit$ src .| p .| sink

Neither pipes nor streaming make it as easy to figure out. I must be missing something! What functions should I be looking at?

[jhomitso] Haskell pack int to bytes

The following Haskell code demonstrates packing a 32-bit unsigned integer (Word32) into a (lazy) ByteString with big-endian byte ordering, using the Put monad of Data.Binary in the binary package.  Essentially, runPut . putWord32be is the Word32 -> ByteString conversion function.

module Main(main) where { import qualified Data.Binary.Put; import Data.Word(Word32); import qualified Data.ByteString.Lazy; value :: Word32; value = 65539; -- 3 + 2^16 main :: IO(); main = out_bytestring $Data.Binary.Put.runPut$ Data.Binary.Put.putWord32be value; {- expected output: 0 1 0 3 -} out_bytestring :: Data.ByteString.Lazy.ByteString -> IO(); out_bytestring = putStrLn . unwords . map show . Data.ByteString.Lazy.unpack; }

This code is analogous to Perl's pack function.  Here is the mapping from Perl's pack template characters to Haskell functions:

L = putWord32host
N = putWord32be (network, big-endian)
V = putWord32le (x86, vax, little-endian)

To convert from lazy to strict ByteString, do Strict.concat . Lazy.toChunks

Chris Smith 2

I gave a brief informal talk about fixpoints and their uses in functional programming at the Norcross Haskathon yesterday. I ended up promising to write it up as a blog post, as well. This is that post.

Just to set expectations from the beginning:

• This is not research; it’s just exposition of things that are already widely known in programming language theory. If you eat denotational semantics for breakfast, nothing here will be new.
• This is also not an applied post. You won’t learn (at least directly) how to build better software in Haskell. You might, though, learn a new way of thinking about what your existing Haskell code means.

If you’re still with me, let’s go!

What is a fixpoint?

If you have a function from any set to itself, then a fixpoint of that function is any input that maps to itself. On a standard, graph you can think of these as being values that fall on the line xy.

<figure><figcaption>Fixpoints of a simple function (from Wikimedia Commons)</figcaption></figure>

Here are a few examples:

• Consider the function f(x) = x². It has two fixpoints: 0 and 1. These are the two real numbers that don’t change when you square them. (Negative numbers become positive, so they can’t work. Numbers between 0 and 1 become smaller. Numbers larger than 1 get larger.)
• Now consider the cosine function, with its argument in radians. You may have unknowingly calculated the fixpoint of this function when you were bored in high school math class. If you start with any number, and repeatedly press the cosine button on your calculator, you’ll notice that the answer converges to a number around 0.739. This is the only fixpoint of the cosine function.
• Finally, consider g(x) = x + 1. This function has no fixpoints, since if g(x) = x, then subtracting x from both sides, you could conclude that 0 = 1.

In general, deciding whether a function has fixpoints or not can be difficult. So there are many different fixpoint theorems. For example, Brouwer’s fixpoint theorem says that any continuous function from a closed ball into itself in Euclidean space must have a fixpoint. Another, which will be important in a bit, is the Knaster-Tarski theorem, which says that any order-preserving function on a complete lattice has a complete lattice of fixpoints. But more on that in a bit.

That’s what a fixpoint is, but why should we care about it? Are fixpoints more interesting than a children’s amusement with the cosine button of a calculator? In fact, they are.

A really beautiful sequence of the MIT text Structure and Interpretation of Computer Programs deals with progressively more abstract representations of algorithms for computing a square root. Heron of Alexandria proposed an algorithm as follows. To compute the square root of n: (a) start with a guess, a; (b) compare a with n / a; (c) if they are close enough to each other, then stop; otherwise, average the two and repeat with that as your guess.

For example, if we start by guessing that the square root of 10 is 5, then we go through these steps:

1. a = 5, but n / a = 2. Those are not close, so our new guess is 7/2.
2. a = 7/2 (3.5), but n / a = 20/7 (about 2.86). Those are not close, so our new guess is 89/14.
3. a = 89/28 (about 3.18), but n / a = 280/89 (about 3.15). Those are not quite close enough, so our new guess is 15761/4984.
4. a = 15761/4984 (about 3.16), and n / a = 49840/15761 (also about 3.16). Those pretty close, so we can stop.

See what’s happening? Just like with the cosine button, we’re repeating a computation to make a guess better until it converges. The point where it converges is a fixpoint. But this time, the computation was chosen so that its fixpoint is the square root of n, giving a technique for calculating square roots.

To express this, we can write a Haskell implementation built around the notion of searching for a fixpoint.

<iframe frameborder="0" height="0" scrolling="no" src="https://medium.com/feed/@cdsmithus/" width="0">https://medium.com/media/561dd185b05b03f9d6cd549f491e90ae/href</iframe>

You can play with this code using CodeWorld, and compute your own square roots.

This is sort of a funky way to compute a fixpoint, though.

• First, it requires a starting value, which is a wart in the API. A function’s fixpoints don’t really depend on any particular starting value.
• Second, it relies on the function to converge — to produce a value closer to a fixpoint in its result than its input. This is the reason the average is there at all, instead of just using n/2, which has the right fixpoint but doesn’t converge to that fixpoint.
• Third, the function may not even have a fixpoint at all! Here, a fixpoint exists for Double because it has finite precision. But the function f(x) = (x + 10/x) / 2 has no fixpoint in the rational numbers, so there’s no chance this would also work with Rational.

These problems are hard to overcome in the current form, but it turns out that by creatively modifying some definitions, we can distill the essence of this into a more abstract combinator, which is Haskell’s fix.

In the Data.Function module of Haskell’s base library, there’s a surprising function.

fix :: (a -> a) -> a

This function computes a fixpoint of any Haskell function. It does it without any of the disadvantages of the version above, and even without any constraints on the types! But how can this be? First of all, even demonstrating the existence of fixpoints is so difficult that it requires heavy mathematical theorems. And even worse, some functions (like g(x) = x + 1, above) have no fixpoints at all!

The trick is that Haskell functions aren’t just ordinary functions on ordinary sets. Haskell functions can throw exceptions, and they can loop indefinitely. To reconcile this with the world of mathematical functions, we say that Haskell functions work not just with the set of ordinary values of its types, but the set of all partially defined values.

Here’s how that works. In math, the set of integers is {0, 1, -1, 2, -2, …}. That’s it! There is nothing else in that set. But in Haskell, we say that the type Integer includes one more value: ⊥. This special value is used to indicate that the function doesn’t produce a true number at all, instead either running forever in an infinite loop, or throwing an exception. In essence, ⊥ means “I don’t know.”

Once ⊥ is taken into account, g(x) = x + 1 does actually have a fixpoint, because g(⊥) = ⊥. That is, if you don’t know what number is passed in, you also don’t know what number comes back out. And this is, in fact, the fixpoint you get from fix. The expression fix g loops forever instead of producing a value, so we say its value is ⊥.

In fact, any time that f(⊥) = ⊥, fix f will produce ⊥. (This is unfortunate if we’re looking for actual numbers as fixpoints like we are above. But later, we’ll be able to recover our original notion of fixpoint in terms of this new one!) This might make you wonder why fix is useful at all! After all, if you don’t know the input to a function, can you ever know the output? Are there any functions for which ⊥ is not a fixpoint? Well, yes there are, and it’s all because of laziness. Consider a constant function, like h(x) = 42. Because it’s a non-strict language, Haskell will produce an output of 42 without even looking at the input. In other words, h(⊥) = 42. ⊥ is not a fixpoint. And, in fact, fix h is 42 for this function.

The definedness ordering

With functions on simple types like Double, these are the only two possibilities: if the function is constant, then fix will produce the constant value, and otherwise it will produce ⊥. (The word for these simple types is “flat”; either you know nothing about the value, or you know everything.) But the result is more interesting when the types have more structure. Given a linked list, for example, there are many possible values where different parts of the value are unknown:

• ⊥ means that nothing about the list is known.
• ⊥ : ⊥ means that the list is known to be non-empty, but neither the first element nor the rest of the list are known. This is more information than you have about ⊥ (which might be an empty list).
• 3 : ⊥ means that the list starts with a 3, but you don’t know what (if anything) comes after that. This is strictly more information than ⊥ : ⊥.
• ⊥ : [] (also known as [⊥]) means that the list is known to only contain one element, but it’s not known what that element is. This is again strictly more information than ⊥ : ⊥, but it’s incomparable to 3 : ⊥. They are both more info than just whether the list is empty, but neither is strictly more informative than the other. They provide different information.
• 3 : [] (also known as [3]) is strictly more information than either of ⊥ : [] or 3 : ⊥.

This defines a partial order on Haskell values. It’s different from the order given by Ord, which is the usual meaningful order for that type. This order sorts the values by how much we know about them. So in this order, neither of 3 or 4 is less than the other (they are both completely defined, just different); but ⊥ is less than ⊥ : ⊥, which is in turn less than either 3 : ⊥ or ⊥ : [], and each of these is in turn less than 3 : [].

In fact, any Haskell type has a complete semilattice of values in this definedness order, which just means that given any nonempty collection of values, there is always some unique greatest lower bound, which is the most-defined possible value that’s still less defined — but compatible — with all of them. For example, for lists of numbers, the greatest lower bound of the set {[3, 2], [3, 1, 5], [3, 4, 5, ⊥]} is 3 : ⊥ : ⊥, since all three lists are compatible with this type, but making it any more defined would have to exclude some of them.

An important attribute of Haskell functions is that they preserve this definedness order. In other words, if x is less than or equal to y in this order — meaning that x is compatible with y but possibly less defined — , then f(x) is also less then or equal to f(y). If you think about it, this makes sense. If knowing a little bit about the input gives you some knowledge about the output, then learning more about the input might tell you more about the output, but it should never contradict or take away from what you already knew.

Now here’s where everything starts to come together. The fix combinator always produces the least fixpoint in this definedness ordering. This least fixpoint will be guaranteed to exist by the Knaster-Tarski theorem mentioned earlier, which says that any order-preserving function on a complete semilattice must also have a complete semilattice of fixpoints — and in particular, there must be a least one of them. Definedness is a complete semilattice, and all Haskell functions are order-preserving, and that’s good enough to guarantee that the least fixpoint exists.

Let’s look at another example. Define threeAnd list = 3 : list. A fixpoint here is a list that is not changed at all by prepending a 3 onto it. It can’t be ⊥, because 3 : ⊥ is definitely not the same as ⊥. The answer is an infinite list of 3s, so fix threeAnd gives you exactly that: an infinite list of 3s. We can check this with GHCi.

ghci> import Data.Functionghci> fix (3 :)[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ^CInterrupted

Fixpoints and recursion

The reason that fixpoints play such a dominant role in functional programming is that they are intricately related to recursion, and that relationship is an important bridge between the operational realm — understanding what the program does— and the denotational realm — understanding what the program means.

In the operational sense, Haskell’s fixpoints can be defined using recursion, like this:

fix f = x  where x = f x

This definition seems almost too cute to work, but it does! It just starts with an x, and then keeps replacing every x with f(x). After an infinite number of substitutions, we find that the least fixpoint of f is just f(f(f(f(f(f(f(f(…)))))))). In other words, any time we need to refer to the input of the function, we just substitute another function application instead.

In the opposite direction, though, general recursion can be defined in terms of the fixpoint combinator. Suppose you have a programming language with no direct recursion allowed, but you are given a fixpoint combinator. Then there’s a simple syntactic sugar for recovering general recursion. Just take your recursive definition, like x = ... x ..., and rewrite it with an extra parameter that takes the place of recursive uses: x1 x0 = ... x0 .... Notice that x1 is not defined recursively. But setting x = fix x1 satisfies the equation given for x.

In the untyped lambda calculus, fix is known as the Y combinator, and it’s a crucial step to showing that the untyped lambda calculus is computationally universal. That’s because lambda calculus doesn’t have general recursion as an a priori language feature. But Y combinators can be built within the untyped lambda calculus itself, so that general recursion can be simulated with fixpoints as described above.

Once you add types, things change a little bit. It’s impossible to define a Y combinator directly in most sound typed lambda calculi, because it would allow for nontermination, which typed lambda calculi usually try to avoid. When these calculi are used as the basis for general programming languages, the first step is usually introduce some kind of general recursion (such as a recursive let) to allow for nonterminating programs to be written.

There’s still an important place for a fixpoint combinator, though! If we care only about what our programs do, then adding recursion might be sufficient. But recursion literally means defining things in terms of themselves, and that’s not safest thing to do if you would like words to keep having meanings that make sense. So what does a recursive definition even mean?

Fortunately, the Knaster-Tarski theorem guarantees that any function has a unique least fixpoint in the definedness order! So, at least for the pure fragment of Haskell, it’s okay to just define the meaning of a recursive definition to be the least fixed point of the related non-recursive function obtained by pull recursive uses into a parameter exactly as we did above. This ensures that any recursive definition can be given a solid meaning, even though on face value it’s just a circular definition.

(Incidentally, fixpoints are a nice way out of all kinds of self-referential paradoxes like this — what Hofstadter called “strange loops”. So next time you’re debating your science fiction geek friends about time travel and the integrity of the space-time continuum, remember that there’s no paradox in time travel as long as the universe is a fixpoint of the function obtained by pulling the back references out as parameters.)

Looping back, then, we could now redefine the fixpoint function from the introduction — using iterative refinement to close in on the right value of the square root — using the least fixpoint of a Haskell function.

A fun puzzle

Putting together all that we’ve talked about, here’s a question. Suppose you get frustrated with fixing your Haskell code, and type the following into GHCi:

ghci> import Data.Functionghci> fix error

What happens next? Give it some thought before reading on.

To solve the puzzle, let’s first explore the types. The error function in Haskell has the type String -> a. It cannot reasonably have a fixpoint unless the domain and codomain are the same, so the type specializes to String -> String. Then fix error has the type String.

What is this magical string that fixes all errors? Well, it must be a fixpoint of the error function, but the result of theerror function is always ⊥. That is its whole purpose in life! So fix error must be ⊥, too. Not so special after all?

Ah, but on the operational side, there are many kinds of ⊥. Some are just infinite loops, while others throw exceptions. If they throw exceptions, then there’s an exception value to think about. These are all invisible to the pure fragment of Haskell, but immensely important to what the program does! To understand what’s going on here, recall that fix error = error (error (error ...))), an infinite nesting of error functions. The result is quite interesting.

ghci> fix error"*** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception: *** Exception:^CInterrupted

Let’s just say it did not fix any errors. Instead, it threw an exception, with an associated exception message that itself throws an exception when printed, and that exception has another exploding error message, and so on as long as you allow it to continue!

Picking up new rows in a DB

I’ve been playing around a little with postgresql-simple in combination with streaming for picking up inserted rows in a DB table. What I came up with was

eventStreamer :: Connection -> Int -> Stream (Of (Int, Int)) IO ()
eventStreamer conn ver = do
res <- liftIO $query conn "SELECT * FROM events WHERE id > (?) ORDER BY id" (Only ver) case res of [] -> do liftIO$ threadDelay 1000000
eventStreamer conn ver
_ -> do
let ver' = maximum map fst res each res eventStreamer conn ver' It seems almost too simple so I’m suspecting I missed something. • Am I missing something? • Should I be using streaming, or is pipes or conduit better (in case they are, in what way)? October 12, 2018 Gabriel Gonzalez Detailed walkthrough for a beginner Haskell program <html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml"><head> <meta charset="utf-8"/> <meta content="pandoc" name="generator"/> <meta content="width=device-width, initial-scale=1.0, user-scalable=yes" name="viewport"/> post <style type="text/css"> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} </style> <style type="text/css">a.sourceLine { display: inline-block; line-height: 1.25; } a.sourceLine { pointer-events: none; color: inherit; text-decoration: inherit; } a.sourceLine:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode { white-space: pre; position: relative; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { code.sourceCode { white-space: pre-wrap; } a.sourceLine { text-indent: -1em; padding-left: 1em; } } pre.numberSource a.sourceLine { position: relative; left: -4em; } pre.numberSource a.sourceLine::before { content: attr(data-line-number); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; pointer-events: all; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { a.sourceLine::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ </style> </head><body> This post walks through the development of a small Haskell program for aligning equals symbols in a block of text. This walkthrough targets a beginning programmer by describing several steps and concepts in extra detail. Note that this post describes how to author, compile, and run single-file Haskell programs for ease of experimentation and learning. For larger Haskell projects you will want to use cabal or stack to scaffold and run the project and to share your work with others. I mainly introduce things this way because this provides a lightweight way to get started and try the language. Background I obsess about visual cleanliness of the code I write because I optimize for ease of reading rather than ease of writing. One way I try to improve appearance is aligning equals signs. For example, the code might begin like this: address = "192.168.0.44"port = 22hostname = "wind" … and then I manually indent the equals signs to all align with each other: address = "192.168.0.44"port = 22hostname = "wind" My editor is vim, and you can install support for that using the Tabular plugin. However, I thought this would be an instructive example to implement from scratch to illustrate how to program in a functional style. One neat feature of vim is that you can transform text inside your editor using any command line program. For example, I can select text using visual mode and then type: :!some-command … and vim will feed the selected text as standard input to the command-line program named some-command and replace the selected text with whatever the command emits to standard output. This means that I only need to write a program that takes text to align on standard input and then emits the aligned text on standard output. I’ll call this program: align-equals. Development environment The command line is my “IDE”,so I will usually keep up to three terminal windows open: • One window edits text using vim • One window displays type errors using ghcid • One window tests the Haskell code I write in a REPL I also use Nix, specifically a nix-shell to provision my development tools. I prefer Nix to provision development tools is because I don’t want to accumulate globally installed cruft on my system. Using Nix, I can provision any development tool or library transiently using nix-shell. I will run all of the following examples inside of the following Nix shell (created for each new terminal):  nix-shell --packages 'haskellPackages.ghcWithHoogle (pkgs: [ pkgs.text pkgs.safe ])' haskellPackages.ghcid

That creates a transient shell that provides ghc, ghci, ghcid, and hoogle with the text and safe Haskell packages available. If I need to change the set of available Haskell packages I edit the command line and recreate the shell.

I turn on live type-checking by running:

ghcid --command='ghci align-equals.hs' … which will then automatically reload align-equals.hs every time the file changes and display any errors or warnings that the Haskell compiler has found. In another terminal, I open the code I’m editing inside the ghci REPL so that I can interactively test the functions that I am writing:  ghci align-equals.hsGHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help[1 of 1] Compiling Main             ( test.hs, interpreted )Ok, one module loaded.*Main>

Then in the third terminal, I actually edit the file:

vi align-equals.hs The program First, let’s describe what we plan to do using English prose, before we translate that to code: We want to find the longest prefix preceding an = sign and then add spaces to the end of all other prefixes to match the length of the longest prefix. Prefix length In order to do that, we first need a function that computes the number of characters preceding an = sign for a given line. This function should have the following type: import Data.Text (Text)prefixLength :: Text -> Int You can read that type signature as saying: “prefixLength is a function whose input is a value of type Text (i.e. a line of input) and whose output is an Int (the number of characters preceding the first = sign)”. We can even add comments to that effect: prefixLength :: Text -- ^ Line of input -> Int -- ^ Number of characters preceding the first @=@ sign I import Data.Text because I prefer not to use the default String type provided by Haskell’s Prelude, which is inefficient. The text package provides a high-performance alternative type named Text along with many useful utilities. The implementation of the prefixLength function directly follows the description: {-# LANGUAGE OverloadedStrings #-}import Data.Text (Text)import qualified Data.TextprefixLength :: Text -> IntprefixLength line = Data.Text.length prefix where (prefix, suffix) = Data.Text.breakOn "=" line As the name suggests, the prefixLength is the length of the prefix. The only non-trivial part is just knowing how to discover the Data.Text.breakOn function created for this purpose. I usually browse the online documentation for Haskell packages by doing a Google search for “hackage{package name}” (i.e. “hackage text”). That’s how I found the breakOn function here:

However, some people also prefer to use hoogle, which indexes and searches Haskell functions by name and by type. For example, if I’m looking for a function that splits a Text value into a prefix and suffix, I can run:

hoogle 'Text -> (Text, Text)'Data.Text breakOn :: Text -> Text -> (Text, Text)Data.Text breakOnEnd :: Text -> Text -> (Text, Text)Data.Text.Lazy breakOn :: Text -> Text -> (Text, Text)Data.Text.Lazy breakOnEnd :: Text -> Text -> (Text, Text)Data.Text transpose :: [Text] -> [Text]Data.Text.Lazy transpose :: [Text] -> [Text]Data.Text intercalate :: Text -> [Text] -> TextData.Text.Lazy intercalate :: Text -> [Text] -> TextData.Text splitOn :: Text -> Text -> [Text]Data.Text.Lazy splitOn :: Text -> Text -> [Text]-- plus more results not shown, pass --count=20 to see more I can also verify that my function works as expected by testing my function in the long-running REPL I have open: *Main> :reloadOk, one module loaded.*Main> :set -XOverloadedStrings*Main> Data.Text.breakOn "=" "foo = 1"("foo ","= 1")*Main> prefixLength "foo = 1"4 I have to enable theOverloadedStrings extension because I’m not using the default String type from the Haskell Prelude. This extension allows other packages to override how string literals behave to work with alternative implementations like Text. One thing that’s neat about Haskell is how order-insensitive the language is. You can define things out of order and the compiler doesn’t mind. That’s why we can write our code as if it says: “prefixLength is the length of the prefix … and by the way you can compute the prefix and suffix by splitting the string using breakOn”. prefixLength line = Data.Text.length prefix where (prefix, suffix) = Data.Text.breakOn "=" line This out-of-order coding style also jives well with lazy evaluation. Haskell is a “lazy” language, meaning that the program can evaluate things out-of-order, too, or even not at all if they are never used. For example, our prefixLength function never actually uses the suffix, so the program will never actually bother to compute or allocate that value. The more you program in Haskell the less you think about your program as a sequence of statements and the more you think about your program as a graph of dependent computations. Indenting a line Now we need a function that pads a prefix with spaces up to a desired number of characters: adjustLine :: Int -> Text -> Text … or with comments: adjustLine :: Int -- ^ The desired prefix length -> Text -- ^ The line to pad -> Text -- ^ The padded line This function is slightly longer, but still pretty straightforward: adjustLine :: Int -> Text -> TextadjustLine desiredPrefixLength oldLine = newLine where (prefix, suffix) = Data.Text.breakOn "=" oldLine actualPrefixLength = Data.Text.length prefix additionalSpaces = desiredPrefixLength - actualPrefixLength spaces = Data.Text.replicate additionalSpaces " " newLine = Data.Text.concat [ prefix, spaces, suffix ] Note that we can reorder all the lines after the where with no change to the program’s behavior. However, to keep things readable we order them so that the reader can understand them from top to bottom: • Split the line into a prefix and suffix • Compute the actual length of the prefix • Compute the number of spaces to pad by subtracting the actual length from the desired length • Create the padding by repeating a space the specified number of times • Create a new line by inserting the padding in between the prefix and suffix Note that this reads a lot like a function defined in an imperative language. For example, the analogous Python code is not much different: def adjustLine(desiredPrefixLength, oldLine): (prefix, suffix) = oldLine.split("=") actualPrefixLength = len(prefix) additionalSpaces = desiredPrefixLength - actualPrefixLength spaces = " " * additionalSpaces # Python's split strips the =, so we have to reintroduce it newLine = "".join([ prefix, spaces, "=", suffix ]) return newLine In general, you can port functional programs to imperative programs in this way if the functional program sticks to simple types (i.e. strings, numbers, records, etc.). For these simple programs, the functional program is essentially an imperative program where you forbid reassigning a value (i.e. “mutation”), which is generally good practice anyway for ease of program comprehension. We can confirm that our function works by saving the complete program so far: {-# LANGUAGE OverloadedStrings #-}import Data.Text (Text)import qualified Data.TextprefixLength :: Text -> IntprefixLength line = Data.Text.length prefix where (prefix, suffix) = Data.Text.breakOn "=" lineadjustLine :: Int -> Text -> TextadjustLine desiredPrefixLength oldLine = newLine where (prefix, suffix) = Data.Text.breakOn "=" oldLine actualPrefixLength = Data.Text.length prefix additionalSpaces = desiredPrefixLength - actualPrefixLength spaces = Data.Text.replicate additionalSpaces " " newLine = Data.Text.concat [ prefix, spaces, suffix ] … and then :reloading that program in our REPL: *Main> :reloadOk, one module loaded.*Main> adjustLine 10 "foo = 1""foo = 1" Indenting multiple lines Now we can define a function to indent multiple lines: import SafeadjustText :: Text -> TextadjustText oldText = newText where oldLines = Data.Text.lines oldText prefixLengths = map prefixLength oldLines newLines = case Safe.maximumMay prefixLengths of Nothing -> [] Just desiredPrefixLength -> map (adjustLine desiredPrefixLength) oldLines newText = Data.Text.unlines newLines This function takes advantage of two convenient utilities called lines and unlines. Data.Text.lines splits a block of Text into a list of lines: *Main> :type Data.Text.linesData.Text.lines :: Text -> [Text] … and Data.Text.unlines combines a list of lines back into a block of Text: *Main> :type Data.Text.unlinesData.Text.unlines :: [Text] -> Text You can use these two utilities to simplify line-oriented Text transformations in Haskell: • Split the block of Text into a list of lines • Process the list of lines into a new list of lines • Join the new list of lines back into a block of Text The interesting part of our adjustText function is how we process the list of lines:  prefixLengths = map prefixLength oldLines newLines = case Safe.maximumMay prefixLengths of Nothing -> [] Just desiredPrefixLength -> map (adjustLine desiredPrefixLength) oldLines You can read the above code as saying: • Apply (“map”) the prefixLength function over each element of our list of lines to get a list of prefix lengths • Find the maximum length • If there is no maximum length, return an empty list of lines • If there is a maximum length, pad each line using that length You might wonder: “Why would there not be a maximum length?” Well, consider the case where we begin with 0 lines of input: what is the maximum value of an empty list? The maximumMay function doesn’t throw an exception or return a bogus sentinel value that might be confused for real data. Instead, the maximumMay function returns an optional result: data Maybe a = Just a | NothingmaximumMay :: Ord a => [a] -> Maybe a The a in the type of maximumMay can be any type that is comparable ( i.e. implements Ord), and in the context of our code that type is Int, so we can pretend that the maximumMay function actually has this more specific type: maximumMay :: [Int] -> Maybe Int In other words, given a list of Ints, the maximumMay function will Maybe return an Int (i.e. it might return an Int or it might not). This means that the result will either be a Nothing (i.e. no result) or an Int value wrapped inside of a Just. We consume the result of maximumMay by “pattern matching” on the result, like this:  case Safe.maximumMay prefixLengths of Nothing -> ... -- First branch Just desiredPrefixLength -> ... -- Second branch The first branch covers the case where the list is empty. Here, the desiredPrefixLength is not in scope, so if we try to use that value we will get a type error. This provides a nice safeguard so that we can’t attempt to access a result that does not exist. In other languages this might be caught at runtime as a java.lang.NullPointerException or AttributeError: 'NoneType' object has no attribute 'x', but with pattern matching the type-checker can detect these sorts of bugs before we even attempt to run the program. The second branch covers the case where the list is non-empty and does have a sensible maximum length. There we use that length to adjust each line. The nice thing about pattern matching is that you can’t forget to handle these cases. If you try to use the result of maximumMay directly as an Int then you will get a type error. The maximumMay function wraps its own result in a Maybe to force downstream users to carefully consider how to handle the case where the list may be empty. Tying it all together So far all of our functions are “pure” functions, meaning that they are deterministic conversions from their inputs to their outputs that don’t mutate any variables and don’t have any side effects that we care about. Carefully note the key phrase: “side effects that we care about”. These functions do technically have side effects, including: • Allocating memory/registers • Taking a non-zero amount of time to compute Sometimes we do care about these types of side effects in certain contexts, like cryptography (where secure information can leak through these side channels) or embedded programming (where programs require careful attention to time/memory). However, for our simple utility these functions are effectively “pure”. We can’t use our utility functions from the command line, though, unless we wrap them in a main function which a program can run, like this: import qualified Data.Text.IOmain :: IO ()main = Data.Text.IO.interact adjustText The interact function converts any pure Text transformation into a runnable program that transforms standard input into standard output using the supplied transformation: *Main> :type Data.Text.IO.interactData.Text.IO.interact :: (Text -> Text) -> IO () This is an example of a “higher-order” function: a function whose input is another function. The input to the interact function is another function of type Text -> Text. Conveniently, our adjustText function has exactly the correct type to be supplied as an argument to the interact function: adjustText :: Text -> TextData.Text.IO.interact adjustText :: IO () … and then any value of type IO () can be assigned to main, which is what our program will run when invoked as a command-line executable. If we save the following complete example to align-equals.hs: {-# LANGUAGE OverloadedStrings #-}module Main whereimport Data.Text (Text)import qualified Data.Textimport qualified Data.Text.IOimport qualified SafeprefixLength :: Text -> IntprefixLength line = Data.Text.length prefix where (prefix, suffix) = Data.Text.breakOn "=" lineadjustLine :: Int -> Text -> TextadjustLine desiredPrefixLength oldLine = newLine where (prefix, suffix) = Data.Text.breakOn "=" oldLine actualPrefixLength = Data.Text.length prefix additionalSpaces = desiredPrefixLength - actualPrefixLength spaces = Data.Text.replicate additionalSpaces " " newLine = Data.Text.concat [ prefix, spaces, suffix ]adjustText :: Text -> TextadjustText oldText = newText where oldLines = Data.Text.lines oldText prefixLengths = map prefixLength oldLines newLines = case Safe.maximumMay prefixLengths of Nothing -> [] Just desiredPrefixLength -> map (adjustLine desiredPrefixLength) oldLines newText = Data.Text.unlines newLinesmain :: IO ()main = Data.Text.IO.interact adjustText … then we can compile that using:  ghc -O2 align-equals.hs

… and verify that the executable works using:

\$ ./align-equalsfoo = 1a = 2asdf = 3<Ctrl-D>foo  = 1a    = 2asdf = 3

That means that I can now use ./align-equals to transform my text buffer by selecting a text block like this:

address = "192.168.0.44"port = 22hostname = "wind"

… and running :!./align-equals to align the block:

address  = "192.168.0.44"port     = 22hostname = "wind"`

… and now I don’t have to tediously align the code manually.

Conclusion

Hopefully this illustrates one way to learn the Haskell language in the course of building small but practical programs. The language introduces many cool features and concepts and this post only touches the tip of the iceberg.

</body></html>