Planet Haskell

January 01, 2001

Alessandro Vermeulen

About

Alessandro Vermeulen is a Computer Sciences (Masters) graduate from Utrecht University specialized in Software Technology. Alessandro obtained his honorary Bachelors degree in Computer Science also at Utrecht University. Among his core interests are Functional Programming (in both Haskell and Scala), compilers and automatic program analysis. Alessandro is currently employed at ING where he focuses on incorporating functional programming, reactive systems, and other software technology best practices in daily practice.

January 01, 2001 12:00 AM

February 20, 2019

FP Complete

Tweag I/O

Clodl:
build, copy, and run

Facundo Dominguez, Mathieu Boespflug

Have you ever needed to execute in another computer a binary that you have built, only to find that it doesn't work because it depends on shared libraries that aren't installed? Or even worse, you don't have admin rights on the machine, so you cannot install the correct dependencies?

In this post we present clodl, a simple tool to build self-contained closures of dynamically linked executables and shared libraries. The closure is a file that can be copied elsewhere and executed or loaded as a library without depending on any pre-installed library.

Deploying binaries

While copying executables is the simplest way to deploy a binary, it depends strongly on a correctly configured computing environment. Other options, such as copying all the dependencies to a Docker container, or building instead a fully static binary executable alleviate this problem. However, there are cases where these methods are not appropriate.

One such case is feeding native code to the Java Virtual Machine (JVM), since it can only load shared libraries. If we are to deploy a hybrid application requiring both the JVM and native code, we need to be sure that all the native code dependencies are shipped or the JVM won't be able to load it.

A possibility is creating a Docker image with the JVM and all the libraries. This can work as long as we have control over how the application is launched, which was not the case in one of our projects.

In this project, we wanted to integrate Haskell programs with Spark, a system for running distributed computations. Among other things, Spark interfaces with systems which manage resources in a cluster and orchestrates the creation of JVM instances to run distributed computations. Computations are submitted to Spark as self-contained JAR files containing the code to execute. In this scenario, we must be able to encapsulate dependencies.

Another situation where clodl has been useful for us is when we need a self-contained (static-like) executable, but we are using a compiler that doesn't build fully static executables by default (e.g. GHC). Depending on the application, doing so might not be trivial. The clodl approach offers a stop-gap solution, or even a permanent one if fully static builds don't work (some dependencies often make this difficult).

What is a closure?

In essence, the closure of a shared library or executable is a file containing this initial library or executable and its dependencies. In the case of clodl this takes the form of a zip archive, though any container format could be used.

In addition, the closure must have some provisions to execute or load the code contained within. For example, a Java application could uncompress the closure and load the initial shared library. However, this can cause problems, since there is the danger that the dependencies would be searched at other locations where incompatible libraries could be found. Unix-like systems have a flexible mechanism for finding the dependencies of a library, but this mechanism is also hard to override when we want to force the library to use a specific set of dependencies.

Because of this, the closure has a wrapper library created solely for the purpose of loading it. This wrapper library is the result of linking dynamically the initial library and all of its dependencies, and therefore all the libraries in the closure are direct dependencies of it. Moreover, the runtime library path of this wrapper library instructs the dynamic linker to search for dependencies at the same location where the wrapper is.

If we want the closure to be executed rather than loaded, we create a wrapper executable instead of a wrapper library, which is linked in the same way.

Creating closures with clodl

clodl is implemented with a mix of bash scripts and Bazel rules. Bazel is a build system which is able to express dependencies between source files in various languages (e.g. Haskell and Java). While we have been using clodl with it, it would be possible to implement the same approach in other tools or even as a standalone shell script.

In order to build a closure, one has to write rules in a BUILD file which can be fed to Bazel. For instance

# bring into scope the library_closure rule
load(
    "@io_tweag_clodl//clodl:clodl.bzl",
    "library_closure",
)

# create a C library
cc_library(
    name = "lib-cc",
    srcs = ["lib.c"],
)

# create a shared library from the C library
cc_binary(
    name = "libhello-cc.so",
	# include a main function if you intend to make
	# the closure executable
    srcs = ["main.c"],
    deps = ["lib-cc"],
    linkshared = 1,
)

# create the closure of the shared library
library_closure(
    name = "clotest-cc",
    srcs = ["libhello-cc.so"],
)

The library_closure rule here produces a file clotest-cc.zip containing all the shared libraries needed to load libhello-cc.so. clodl finds the dependencies using whatever version of ldd is on the PATH when Bazel runs the rules.

We can make the the closure executable by replacing the library_closure rule with

binary_closure(
    name = "clotestbin-cc",
    src = "libhello-cc.so",
)

This will produce a file clotestbin-cc.sh that has the closure zip file prepended with a bash script that invokes the main function in main.c. The clodl repository has full examples. At the time of this writing, we only have tested clodl on Linux.

Tools for deployment

clodl is not the first tool on Linux designed to deploy binaries. There is Flatpack and Snap, for instance. These are build tools that start from the source code to build cross-linux packages. Another such tool is nix-bundle, which requires the application to be packaged with nix in order to provide a binary which bundles all of its dependencies.

These tools require specifying the dependencies, the source code and any build settings. clodl, in contrast, starts from the executable binary or shared library, written in any language, built by any compiler.

However, there are tools that follow this approach as well. One of these is a commercial tool called ermine, which produces a self-contained executable from a dynamically-linked application. Unfortunately, it provides no clues on the structure of the final executable. clodl produces an easy to handle self extracting zip archive. JAR files are also zip files, so this means that the output of clodl can also be loaded in the JVM.

Finally, there is a tool called exodus which produces a tar archive from a dynamically linked executable, its dependencies and the dynamic linker. Most notably, it also includes a launcher program that will invoke the linker with command line flags that constrain the directories in which to search for libraries. This approach is rather frugal. Unlike clodl, it doesn't need to build a wrapper executable. However, exodus will not work for building closures of libraries. When we load libraries in the JVM, we assume no control over the flags passed to the dynamic linker that the JVM is using.

Summary

In this post we have presented clodl, a tool to build closures of dynamically linked executables and shared libraries. These closures are easy to copy and load or execute in other hosts. We expect these capabilites to be useful to various projects. You are welcome to get in touch if you would like to contribute or have clodl features enhanced.

February 20, 2019 12:00 AM

February 19, 2019

Holden Karau

PyData Hong Kong - Making the Big Data ecosystem work together with Python: Apache Arrow, Spark, Flink, Beam, and Dask @ PyData Hong Kong

Come join me on Tuesday 19 February @ 02:00 at PyData Hong Kong 2019 Hong Kong for PyData Hong Kong - Making the Big Data ecosystem work together with Python: Apache Arrow, Spark, Flink, Beam, and Dask.I'll update this post with the slides soon.Come see to the talk or comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

by Holden Karau (noreply@blogger.com) at February 19, 2019 02:02 AM

February 18, 2019

Michael Snoyman

Shutting down haskell-lang.org

Early this week, I merged a commit which essentially shuts down the haskell-lang.org website. Besides a few rarely viewed pages without obvious replacements, visiting pages on https://haskell-lang.org will automatically redirect you to an appropriate page on https://haskell.fpcomplete.com. Also, consider this an announcement that there’s a new site around, https://haskell.fpcomplete.com!

The site is still being refined. However, to avoid confusion, someone requested that I make an announcement now. That request makes sense. In this post, I want to explain:

  • Why I shut down haskell-lang.org
  • What the goal of haskell.fpcomplete.com is
  • What makes this new site different from haskell-lang.org
  • Plans for the future

Onward!

Shutting down haskell-lang.org

I’ve been playing with the idea of shutting down haskell-lang.org for a while now. I didn’t want to simply turn it off, since there’s a lot of useful content there that I regularly use myself and point others to for training (we’ll get to that in a bit). It wasn’t a huge amount of work to put together the replacement site, but it did take some dedicated writing time. I got that time last week and did most of the work then.

I first publicly proposed shutting down the site in December of last year. I put together a proposal to get a bunch of people together to work on a new Commercial Haskell site. There wasn’t much interest in such collaboration, so I went with option B, which I’ll explain in the next section.

As for why haskell-lang.org should be shut down, I’ll quote the history section of the new site’s README:

This website replaces some previous false starts (and mistakes) at trying to create an opinionated hub of Haskell information. Previously, some FP Complete members participated in haskell-lang.org as another vision of an opinionated Haskell site. Later, we sought out collaborators for creating a more complete Commercial Haskell website.

haskell-lang.org was a mistake, and there was insufficient interest in commercialhaskell.com. Given that the branding for haskell-lang.org was incorrect, there was little interest from others in the general concept of creating an opinionated site, and we at FP Complete were still passionate about this topic, we decided to focus efforts on a site under our own branding and control.

This site is unapologetically opinionated, and follows what we have found to be the best route towards getting productive with Haskell quickly.

The need for a site

haskell-lang.org has been collecting solid tutorials on general Haskell techniques and specific libraries. The collection is opinionated, so that I can use it for training courses I give, and point new users to it without needing to give any caveats about which approach to follow. And the site is backed by a Git repository using Markdown files, which I’m a huge fan of.

So first goal of this site would be: retain the technical content and educational approach provided by haskell-lang.org, without the bad history that goes along with that name.

Other goals

At FP Complete, we’ve done a few surveys of the Haskell community to get an idea of what we can do to most help more companies adopt Haskell. From our last survey, it seems that more educational content is top of the list, followed by helping companies generally feel comfortable adopting Haskell. I covered the education aspect above, and we’ll continue to put efforts into improving that situation.

On the more nebulous adoption point, we’re adding two more goals to the new site:

  • Provide promotion material: content that explains what makes Haskell a good choice for businesses, together with some potential drawbacks people should be aware of.
  • Introduce a Success Program at FP Complete, providing affordable commercial Haskell support including training and mentoring. We believe this may help more companies adopt Haskell.

Enrolling in the Success Program is a paid service at FP Complete (though we are pricing it as low as we can, to maximize adoption without actually losing money). We’re hoping that the presence of a clearly defined and publicly priced commercial support package will help reduce perceived risk with Haskell further and allow more adoption.

The future

The new site is still a work in progress. Overall, the styling still needs a bit of work, I want to refine the language some (and likely scale back the prose). I also want to refresh a bunch of the technical content to be in line with our current recommendations. This will also affect the Applied Haskell training I’ll be giving later this year. (Feel free to check out the current version of the course online.)

I still have some questions up in the air, like whether we’ll host a blog on this site or simply continue all FP Complete blogging activitiy on our corporate site. I’ve started putting together a bit of a philosophy page explaining the FP Complete approach to commercial Haskell development, and that needs expanding. And I’d like to get a more content on the contribute page to allow people to find a project they can cut their teeth on.

I hope that this new site not only allows for creation of and access to great Haskell content. I also hope that this is taken as a positive message from the rest of the community and an indication of how we at FP Complete, and I personally, intend to interact with the broader Haskell community going forward. We’ll remain opinionated on our technical recommendations, as I believe we should be. But hopefully this change in naming and purpose of the site remove any adversarial nature to sharing these opinions.

February 18, 2019 07:33 PM

Monday Morning Haskell

Upgrading My Development Setup!

code_setup.jpg

In the last year or so, I haven't actually written as much Haskell as I'd like to. There are a few reasons for this. First, I switched to a non-Haskell job, meaning I was now using my Windows laptop for all my Haskell coding. Even on my previous work laptop (a Mac), things weren't easy. Haskell doesn't have a real IDE, like IntelliJ for Java or XCode for iOS development.

Besides not having an IDE, Windows presents extra pains compared to the more developer-friendly Mac. And it finally dawned on me. If I, as an experienced developer, was having this much friction, it must be a nightmare for beginners. Many people must be trying to both learn the language AND fight against their dev setup. So I decided to take some time to improve how I program so that it'll be easier for me to actually do it.

I wanted good general functionality, but also some Haskell-specific functions. I did a decent amount of research and settled on Atom as my editor of choice. In this article, we'll explore the basics of setting up Atom, what it offers, and the Haskell tools we can use within it. If you're just starting out with Haskell, I hope you can take these tips to make your Haskell Journey easier.

Note that many tips in this article won't work without the Haskell platform! To start with Haskell, download our Beginners Checklist, or read our Liftoff Series!

Goals

It's always good to begin with the end in mind. So before we start out, let's establish some goals for our development environment. A lot of these are basic items we should have regardless of what language we're using.

  1. Autocomplete. Must have for terms within the file. Nice to have for extra library functions and types.
  2. Syntax highlighting.
  3. Should be able to display at least two code files side-by-side, should also be able to open extra files in tabs.
  4. Basic file actions should only need the keyboard. These include opening new files to new tabs or splitting the window and opening a new file in the pane.
  5. Should be able to build code using the keyboard only. Should be able to examine terminal output and code at the same time.
  6. Should be able to format code automatically (using, for instance, Hindent)
  7. Some amount of help filling in library functions and basic types. Should be able to coordinate types from other files.
  8. Partial compilation. If I make an obvious mistake, the IDE should let me know immediately.
  9. Vim keybindings (depends on your preference of course)

With these goals in mind, let's go ahead and see how Atom can help us.

Basics of Atom

Luckily, the installation process for Atom is pretty painless. Using the Windows installer comes off without a hitch for me. Out of the box, Atom fulfills most of the basic requirements we'd have for an IDE. In fact, we get all our 1-4 goals without putting in any effort. The trick is that we have to learn a few keybindings. The following are what you'll need to open files.

  1. Ctrl+P - Open a new tab with a file using fuzzy find
  2. Ctrl+K + Direction (left/right/up/down arrow) - Open a new pane (will initially have the same file as before).
  3. Ctrl+K + Ctrl+Direction - Switch pane focus

Those commands solve requirements 3 and 4 from our goals list.

Another awesome thing about Atom is the extensive network of easy-to-install plugins. We'll look at some Haskell specific items below. But to start, we can use the package manager to install vim-mode-improved. This allows most Vim keybindings, fulfilling requirement 9 from above. There are a few things to re-learn with different keystrokes, but it works all right.

Adding Our Own Keybindings

Since Atom is so Hackable, you can also add your own keybindings and change ones you don't like. We'll do one simple example here, but you can also check out the documentation for some more ideas. One thing we'll need for goal #5 is to make it easier to bring up the bottom panel within atom. This is where terminal output goes when we run a command. You'll first want to open up keymap.cson, which you can do by going to the file menu and click Keymap….

Then you can add the following lines at the bottom:

'atom-workspace':
  'ctrl-shift-down': 'window:toggle-bottom-dock'
  'ctrl-shift-up': 'window:toggle-bottom-dock'

First, we scope the command to the entire atom workspace. (We'll see an example below of a command with a more limited scope). Then we assign the Ctrl+Shift+Down Arrow key combination to toggle the bottom dock. Since it's a toggle command, we could repeat the command to move it both up and down. But this isn't very intuitive, so we add the second line so that we can also use the up arrow to bring it up.

A super helpful tool is the key binding resolver. At any point, you can use ctrl+. (control key plus the period key) to bring up the resolver. Then pressing any key combination will bring up the commands Atom will run for it. It will highlight the one it will pick in case of conflicts. This is great for finding unassigned key combinations!

Haskell Mode in Atom

Now let's start looking at adding some Haskell functionality to our editor. We'll start by installing a few different Haskell-related packages in Atom. You don't need all these, but this is a list of the core packages suggested in the Atom documentation.

language-haskell
ide-haskell
ide-haskell-cabal
haskell-ghc-mod
autocomplete-haskel

The trickier part of getting Haskell functionality is the binary dependencies. A couple of the packages we added depend on having a couple programs installed. The most prominent of these is ghc-mod, which exposes some functionality of GHC. You'll also want a code formatter, such as hindent, or stylish-haskell installed.

At the most basic level, it's easy to install these programs with Stack. You can run the command:

stack install ghc-mod stylish-haskell

However, ghc-mod matches up with a specific version of GHC. The command above installs the binaries at a system-wide level. This means you can only have the version for one GHC version installed at a time. So imagine you have one project using GHC 8.0, and another project using GHC 8.2. You won't be able to get Haskell features for each one at the same time using this approach. You would need to re-install the proper version whenever you switched projects.

As a note, there are a couple ways to ensure you know what version you've installed. First, you can run the stack install ghc-mod command from within the particular project directory. This will use that project's LTS to resolve which version you need. You can also modify the install command like so:

stack --resolver lts-9 install ghc-mod

There is an approach where you can install different, compiler specific versions of the binary on your system, and have Atom pick the correct one. I haven't been able to make this approach work yet. But you can read about that approach on Alexis King's blog post here.

Keybinding for Builds

Once we have that working, we'll have met most of our feature goals. We'll have partial compilation and some Haskell specific autocompletion. There are other packages, such as haskell-hoogle that you can install for even more features.

There's one more feature we want though, which is to be able to build our project from the keyboard. When we installed our Haskell packages, Atom added a "Haskell IDE" menu at the top. We can use this to build our project with "Haskell IDE" -> "Builder" -> "Build Project". We can add a keybinding for this command like so.

'atom-text-editor[data-grammer~/"haskell"]':
  ...
  'ctrl-alt-shift-b': 'ide-haskell-cabal:build'

Notice that we added a namespace here, so this command will only run on Haskell files. Now we can build our project at any time with Ctrl+Shift+Alt+B, which will really streamline our development!

Weaknesses

The biggest weakness with Atom Haskell-mode is binary dependencies and GHC versions. The idea behind Stack is that switching to a different project with a different compiler shouldn't be hard. But there are a lot of hoops to jump through to get editor support. To be fair though, these problems are not exclusive to Atom.

Another weakness is that the Haskell plugins for Atom currently only support up through LTS 9 (GHC 8). This is a big weakness if you're looking to use new features from the cutting edge of GHC development. So Atom Haskell-mode might not be fully-featured for industry projects or experimental work.

As a further note, the Vim mode in Atom doesn't give all the keybindings you might expect from Vim. For example, I could no longer use the colon key plus a number to jump to a line. Of course, Atom has its own bindings for these things. But it takes a little while to re-learn the muscle memory.

Alternatives

There are, of course, alternatives to the approach I've laid out in this article. Many plugins/packages exist enabling you to get good Haskell features with Emacs and Vim. For Emacs, you should look at haskell-mode. For Vim, I made the most progress following this article from Stephen Diehl. I'll say for my part that I haven't tried the Emacs approach, and ran into problems a couple times with Vim. But with enough time and effort, you can get them to work!

If you use Visual Studio, there are a couple packages for Haskell: Haskelly and Haskero. I haven't used either of these, but they both seem provide a lot of nice features.

Conclusion

Having a good development environment is one of the keys to programming success. More popular languages have full-featured IDE's that make programming a lot easier. Haskell doesn't have this level of support. But there's enough community help that you can use a hackable editor like Atom to get most of what you want. Since I fixed this glaring weakness, I've been able to write Haskell much more efficiently. If you're starting out with the language, this can make or break your experience! So it's worth investing at least a little bit of time and effort to ensure you've got a smooth system to work with.

Of course, having an editor setup for Haskell is meaningless if you've never used the language! Download our Beginners Checklist or read our Liftoff Series to get going!

by James Bowen at February 18, 2019 03:30 PM

Sandy Maguire

Freer Monads: Too Fast, Too Free

The venerable Lyxia had this to say about my last post on freer monads:

I agree the performance argument is way less important than the frequency at which it's thrown around makes it seem. The reason freer performance sucks is that you're repeatedly constructing and deconstructing trees at runtime. However, that is only a consequence of the implementation of freer as a GADT (initial encoding). I bet the final encoding can do wonders:

newtype Freer f a = Freer (forall m. Monad m => (forall t. f t -> m t) -> m a)

I spent a few days working through the implications of this, and it turns out to be a particularly compelling insight. Behold the microbenchmarks between freer-simple and an equivalent program written against mtl:

benchmarking freer-simple
time                 745.6 μs   (741.9 μs .. 749.4 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 745.1 μs   (742.2 μs .. 748.5 μs)
std dev              10.68 μs   (8.167 μs .. 14.23 μs)

benchmarking mtl
time                 10.96 μs   (10.93 μs .. 10.98 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.95 μs   (10.92 μs .. 10.99 μs)
std dev              119.3 ns   (93.42 ns .. 153.7 ns)

Not so good; freer-simple is like 75x worse in this case! But the same program again when written in this final encoding is pretty damn fast:

benchmarking too-fast-too-free
time                 24.23 μs   (24.10 μs .. 24.37 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 24.27 μs   (24.15 μs .. 24.40 μs)
std dev              448.8 ns   (355.8 ns .. 586.1 ns)

It's roughly 2x slower than mtl, which is AKA 35x faster than freer-simple. This is pretty sweet, and it comes with the benefit of getting to keep the underlying semantics of freer-simple.

So without further ado, I'd like to share my work-in-progress with you, tentatively named too-fast-too-free. This is ready for prime-time, but I'd prefer to merge it to someone upstream rather than pollute hackage with yet another free(r) monad extensible effects package.

I'll do it if I have to, but the performance is fair game for anyone who wants it. If I don't hear from anyone by next week, I'll publish a new package to hackage and begin the freer monad revolution we've all been waiting for.

What the Heck Is Any of this Stuff Anyway?

Let's investigate this finally-encoded type and see where this performance comes from:

newtype Freer f a = Freer
  { runFreer :: forall m. Monad m => (forall t. f t -> m t) -> m a
  }

The type of runFreer is saying "if you give me a Freer f a and a natural transformation from f to some monad m, then I can give you back an m a." Sounds promising, right?

Freer's instance for Monad is written in terms of this final m, and so short of shunting around some functions, we're not really paying any cost for binds compared to just writing in terms of m:

instance Monad (Freer f) where
  Freer ma >>= f = Freer $ \k -> do
    a <- ma k
    runFreer (f a) k

Compare this to the approach used by freer-simple which needs to allocate leafs in a tree for every bind (and for every fmap and ap!) That's a huge win already.

Turning Freer into Eff uses the same trick as freer-simple---let Eff r be Freer (Union r), where Union r is a value that can be any effect in r. A natural transformation forall m. Monad m => (forall t. Union r t -> m t) must therefore handle every possible effect in r, and so we haven't lost any capabilities with our new encoding.

The challenging part was figuring out how to plumb state through the encoding of Freer f a---after all, many interesting interpreters are necessarily stateful.

Fortunately there's a trick. Because Eff (e ': r) a can be interpreted in terms of any Monad m, we can choose m ~ StateT s (Eff r), and get our statefulness from StateT directly. Because StateT's bind is written in terms of its underlying monad, this trick doesn't cost us anything more than shunting another few functions around.

We can achieve short-circuiting interpreters similarly by evaluating them via ExceptT (Eff r). In fact, this pattern turns out to be pretty common---and it generalizes thusly:

transform
    :: ( MonadTrans t
       , MFunctor t
       , forall m. Monad m => Monad (t m)
       )
    => (forall m. Monad m => t m a -> m b)
       -- ^ The strategy for getting out of the monad transformer.
    -> (eff ~> t (Eff r))
    -> Eff (eff ': r) a
    -> Eff r b
transform lower f (Freer m) = Freer $ \k -> lower $ m $ \u ->
  case decomp u of
    Left  x -> lift $ k x
    Right y -> hoist (usingFreer k) $ f y

Admittedly the type is a little terrifying, but library code can specialize it down to less offensive combinators.

At the end of the day, this final encoding means that Freer code specializes down to its eventual result anyway, giving us the "fusion" of fused-effects without the boilerplate.

Hopefully these results are enough to finally put the "free monads have bad performance" argument to rest. I'll have some promising results on the bracket front as well that require some refinement, but hopefully those will come sooner than later.

¡Viva la freer monad revolucion!

February 18, 2019 02:56 PM

Douglas M. Auclair (geophf)

April/May 2019 1HaskellADay 1Liners

  • May 27th, 2018:
    data F = F { a, b, c :: String }
    data Stamped a = { time :: Day, stamped :: a }

    f :: Stamped F -> String

    output of f x is  "time a b c" for the respective values of time, a, b, c

    Is there some monadic / applicative elegant definition that does this?
    • Nickolay Kudasov @crazy_fizruk
      f = intercalate “ “ . sequence [ show.time, a.stamped, b.stamped, c.stamped ]
    • Nickolay Kudasov @crazy_fizruk
      f = intercalate " " <$> http://pure.show .time <> (sequence [a, b, c]).stamped
      A bit trickier, but shorter and uses stamped once.
  • April 13th, 2018: given f :: [a] -> b -> [c]
    where c is a derived from g :: a -> b -> c

    You have [a] and [b]

    Write a function h :: [a] -> [b] -> [c] from f

by geophf (noreply@blogger.com) at February 18, 2019 08:52 AM

January 2019 1HaskellADay Problems and Solutions

by geophf (noreply@blogger.com) at February 18, 2019 08:27 AM

February 17, 2019

Neil Mitchell

Quadratic "deriving Generic" Compile Times

Summary: For large data types, deriving Generic can take a long time to compile.

I was building GHC using Hadrian, and part of that process involves compiling Cabal multiple times - once to build Hadrian itself, and then once per GHC stage thereafter (typically 3 stages). Compiling Cabal takes quite a long time, but one thing I spotted was that compiling the LicenseId file alone was taking 30 seconds! Given that it seemed to be on the critical path, and usually gets compiled 4 times, that seemed worth investigating.

Looking at the file, it defines an enumeration with 354 license types in it. Pulling apart the file, I found that with only the data type and deriving Generic it was taking about 9 seconds without optimisation or 22 seconds with optimisation, using GHC 8.6.3. I then wrote a test program that generated and compiled instances of the program:

{-# LANGUAGE DeriveGeneric #-}
module Sample where
import GHC.Generics
data Sample = Ctor1 | Ctor2 | ...
deriving Generic

The resulting graph shows that compilation time is quadratic in the number of constructors:


This phenomenon could either be because the deriving Generic itself is quadratic, or more likely that the output generated by deriving Generic provokes quadratic behaviour in some subsequent part of the compiler. I'm going to report this as a GHC bug.

Update: There's already GHC bug 5642 which explains that it might just be a fundamental property of how Generic instances look.

For Cabal the fix might be somewhat simpler - the deriving Generic is only used to define a Binary instance, which can be written in 2 lines for an Enum anyway - discussion on the issue tracker. Hopefully this change will speed up compiling Cabal, and thus speed up compiling GHC.

As I watched the GHC output go by I noticed that both Parser and DynFlags in GHC itself were also very slow, with the latter taking nearly 2 minutes. I'm sure there's more scope for optimising GHC builds, so Andrey Mokhov and I have proposed a Google Summer of Code project to tackle other low-hanging fruit.

The quick program I wrote for generating the data in this blog post is below:

import Control.Monad
import Data.List
import System.Time.Extra
import System.Process.Extra
import Numeric.Extra

main = do
forM_ [1..] $ \n -> do
writeFile "Sample.hs" $ unlines $
["{-# LANGUAGE DeriveGeneric #-}"
,"module Sample where"
,"import GHC.Generics"
,"data Sample = " ++ intercalate " | " ["Ctor" ++ show i | i <- [1..n]]
," deriving Generic"
]
xs <- forM [0..2] $ \o ->
fmap fst $ duration $ systemOutput_ $ "ghc -fforce-recomp Sample.hs -O" ++ show o
putStrLn $ unwords $ show n : map (showDP 2) xs

by Neil Mitchell (noreply@blogger.com) at February 17, 2019 08:04 PM

Magnus Therning

Choosing a conduit randomly

Lately I’ve been playing around conduit. One thing I wanted to try out was to set up processing where one processing step was chosen on random from a number of components, based on weights. In short I guess I wanted a function with a type something like this

I have to admit I don’t even know where to start writing such a function1 but after a little bit of thinking I realised I could get the same effect by controlling how chunks of data is routed. That is, instead of choosing a component randomly, I can choose a route randomly. It would look something like when choosing from three components

                        +---------+   +----------+   +-------------+
                        | Filter  |   | Drop tag |   | Component A |
                    +-->| Value-0 |-->|          |-->|             |--+
                    |   +---------+   +----------+   +-------------+  |
+----------------+  |   +---------+   +----------+   +-------------+  |
| Choose random  |  |   | Filter  |   | Drop tag |   | Component B |  |
| value based on +----->| Value-1 |-->|          |-->|             |----->
| weights        |  |   +---------+   +----------+   +-------------+  |
+----------------+  |   +---------+   +----------+   +-------------+  |
                    |   | Filter  |   | Drop tag |   | Component C |  |
                    +-->| Value-2 |-->|          |-->|             |--+
                        +---------+   +----------+   +-------------+


That is

  1. For each chunk that comes in, choose a value randomly based on weights and tag the chunk with the choosen value, then
  2. split the processing into one route for each component,
  3. in each route filter out chunks tagged with a single value, and
  4. remove the tag, then
  5. pass the chunk to the component, and finally
  6. bring the routes back together again.

Out of these steps all but the very first one are already available in conduit:

What’s left is the beginning. I started with a function to pick a value on random based on weights2

Using that I then made a component that tags chunks

I was rather happy with this…


  1. Except maybe by using Template Haskell to generate the code I did come up with.

  2. I used Quickcheck’s frequency as inspiration for writing it.

February 17, 2019 12:00 AM

February 16, 2019

Brent Yorgey

Worstsort

Thanks for the responses to my previous post about finding roots of polynomials; I now have some new avenues to explore. But today I want to write about something completely different. I recently stumbled across this fun paper by Miguel Lerna. I realized a Haskell implementation would be very elegant, and I couldn’t pass up the opportunity to share.

Badsort

This is badsort.

> import Data.List (permutations, insert)
> 
> badsort :: Ord a => Integer -> [a] -> [a]
> badsort 0 = foldr insert []
> badsort k = head . badsort (k-1) . permutations

Claim: badsort k is a correct sorting algorithm for any natural number k. Before reading further I recommend staring at that until you understand what it’s doing and why it works.

How badsort works

Badsort is very bad. Here’s how it works:

  • badsort 0 is just plain old insertion sort.

  • badsort k xs creates the list of all permutations of xs, sorts them into lexicographic order, and selects the first. This works because the lexicographically smallest permutation of xs is, in fact, the one which is sorted.

    Oh, and of course, sorting the permutations lexicographically is done by a recursive call to badsort (k-1). (As an aside, I like how seamless this is in Haskell with polymorphic recursion—each recursive call is at a different type.)

Here are a few examples to show that it works:

ghci> badsort 0 [3,1,2]
  [1,2,3]

ghci> badsort 1 [3,1,2]  -- generates 6 permutations
  [1,2,3]

ghci> badsort 2 [3,1,2]  -- generates 720 permutations of 6 permutations
  [1,2,3]

badsort 3 [3,1,2], if we tried it (not recommended!!), would generate all possible permutations of the list of 720 permutations of the list of 6 permutations of [3,1,2]. The number of such permutations is, of course, 720!, which has 1747 decimal digits; there is literally not enough space in the universe to store all those permutations.

In general, badsort k is a correct sorting algorithm1 which takes time \Omega((n!^{(k)})^2), where n!^{(k)} denotes the k-fold iterated factorial of n, that is, n!^{(0)} = n and n!^{(k+1)} = (n!^{(k)})!. (This doesn’t even take into account the time for accessing memory; at this scale we certainly can’t assume memory access takes constant time. Fetching memory from a data center in another galaxy takes a while, you know? =)

It gets worse

> worstsort :: Ord a => (Integer -> Integer) -> [a] -> [a]
> worstsort f xs = badsort (f n) xs
>   where
>     n = fromIntegral $ length xs

Worstsort is parameterized by a function on natural numbers, and calls badsort with a recursion depth given by the function f applied to the length of the list. Oh my.

Just for fun, let’s try, oh, say, the Ackermann function.

> ack :: Integer -> Integer -> Integer
> ack 0 n = n+1
> ack m 0 = ack (m-1) 1
> ack m n = ack (m-1) (ack m (n-1))
> 
> diabolicalsort :: Ord a => [a] -> [a]
> diabolicalsort = worstsort (\n -> ack n n)

Here are some fun properties of diabolicalsort (and any other instantiation of worstsort):

  • It will provably terminate in a finite amount of time for any input! Although probably the words “terminate” and “finite” should be in scare quotes.

  • In some sense I can’t quite define formally but still believe in my heart, it “doesn’t cheat” in the sense that it is always “making real progress” towards sorting the input list. If you are trying to design a slow sorting algorithm, it would be cheating, for example, to make an algorithm that spins in a useless loop for a thousand years and then does insertion sort.

  • It works in practice on lists of length 1 or 2, but length 3 is completely hopeless. ack 3 3 = 61, so we are looking at the 61-fold iterated factorial of 3, which is a… rather large number.

  • ack 4 4 is 2^{2^{2^{65536}}} - 3; there are not enough atoms in the universe to even write down this number in base 10. And then of course we take that number and iterate factorial that many times on 4. Sheesh.

  • Let us not even speak of lists of length 5.

The upshot of this, in the end, is that it is possible to make a “non-cheating” sorting algorithm whose running time grows faster than any computable function you care to choose (proof: take your chosen computable function and substitute it for f).


  1. It might be a fun exercise to prove this formally using a proof assistant.↩

by Brent at February 16, 2019 08:36 PM

February 14, 2019

Functional Jobs

Senior Haskell / Full Stack Developer at PRODA Ltd (Full-time)

Position summary

We are looking for a full-time senior software engineer to join our team in London or work remotely.

We want you to be someone who is looking to really help shape the future of the development team, have real impact on strategy, architecture and be the lead on our most important projects.

You will expand PRODA's capabilities in data ingestion, data exports, data standardisation and machine learning. We work in a functional programming stack in Haskell and Elm, together with machine learning components written in Python. Our development process includes CI/CD and automated tests. All developers work closely with our Chief Product Officer.

Skills & experience required

  • Significant experience in a functional programming language, for instance: Haskell, Scala, F#, Ocaml, Clojure, etc.
  • Extensive knowledge working with a major RDBMS: PostgreSQL, MySQL, MS SQL Server.
  • Significant exposure to Javascript (node.js) and SPAs.
  • Good understanding of Linux and Shell Scripting in Bash.
  • Experience with basic machine learning algorithms, scikit-learn, Python, numPy, and pandas is highly beneficial.
  • Experience with AWS Cloud services, particularly EC2, RDS, Codedeploy, Codebuild, Codepipelines, S3.
  • Knowledge of build tools: Webpack, Makefiles.
  • Experience with web servers: basics of load balancers, Nginx configuration, etc.
  • Familiarity with Git and GitHub.
  • Familiarity with Docker.

About the Company

PRODA is a PropTech start-up founded in January 2017, backed by large Prop-tech players and real estate investors. We focus on solving core data processing pain points in commercial real estate, and believe that unstructured data is a key barrier to digital transformation in the industry.

PRODA is combining real estate expertise with data science to develop machine leaning-enabled software, with the aim of truly automating many analysis and reporting tasks for real estate professionals.

Our software automatically captures property data from spreadsheets, before standardising, error-checking & securely storing it. This enables users to standardise and access all their data, regardless of the country, asset class or company that it originated from.

Once standardised, many exciting machine-learning analyses, track changes and machine written reports are possible.

Please find press releases and the PRODA Whitepaper in the PRODA news section

Based in a smart We-work in London's South-bank PRODA has:

  • Modern, open office with communal areas, quite spaces, table games, unlimited teas and coffees and beers in the evenings
  • Focus on getting in specialists, encouraging meetups, conferences, training and learning for software engineers
  • Regular team meals and activities
  • Culture of smart individuals who collaborate to solve interesting problems using interesting technologies

Get information on how to apply for this position.

February 14, 2019 10:27 AM

February 13, 2019

Brent Yorgey

Finding roots of polynomials in Haskell?

tl;dr: There are several Haskell packages one can use to find an individual root of a function on a certain interval. But I’ve had less luck finding a suitable package for finding all the roots of a given polynomial. This blog post consists of my notes and a plea for help.

The diagrams-solve package contains miscellaneous numeric solving routines used in diagrams, including tridiagonal and cyclic tridiagonal linear system solvers (used for generating cubic splines and emulating Metafont paths) as well as functions for finding roots of low-dimensional polynomials (quadratic, cubic, and quartic). Solving quadratics is used in a bunch of places; solving cubics is needed specifically for doing interior/exterior testing on closed loops built from cubic Bezier curves; thankfully we have never needed to solve a quartic or higher.

Unfortunately, the polynomial root solvers are pretty naive: I simply transcribed some formulas (which is why it only goes up to quartics). This works OK a lot of the time, but the formulas are very numerically unstable, and it’s not hard to come up with example inputs where the returned roots are off by quite a bit. In fact, people regularly run into this when running the test suite. I am not specifically aware of any diagrams bugs that have arisen in actual practice due to the cubic solutions being off, but it’s probably just a matter of time.

So I decided it was finally time to look into better root-finding methods. This blog post is both a plea for help and a place to write down some of the things I’ve learned so far.

There’s root finding, and then there’s root finding

The first thing to get out of the way is that when you talk about “root finding” there are (at least!) two pretty different things you could mean:

  1. Given a function f and some particular interval, or an initial guess, find a value x in the interval/close to the initial guess for which f(x) = 0.
  2. Given a polynomial with {real, complex} coefficients, find all its {real, complex} roots.

If you want to do (1), there are several nice Haskell packages you could use. The Numeric.RootFinding module from the math-functions package is probably your best bet; it implements both Ridders’ method and the Newton-Raphson method, which both attempt to find a single root of a function on a given interval. They both work on any continuous Double -> Double function, not just polynomials (Newton-Raphson also needs to know the first derivative). But they don’t work well if you don’t already have an interval in mind to search; and if you want to find all the roots you have to call these multiple times (somehow coming up with an appropriate interval each time).

As for (2), I haven’t been able to find anything that would work well for diagrams-solve. Here are my notes:

  • The dsp package has a module Polynomial.Roots containing an implementation of Laguerre’s method, which finds all the (complex) roots of a polynomial. However, the dsp package is a rather heavyweight dependency to pull in just for the root-finding code; it’s also licensed under the GPL and I’d like to avoid having to “infect” the entire diagrams ecosystem with the GPL.

  • Laguerre’s method seems like it should be fairly easy to implement myself—but writing my own solver from scratch is how I got here in the first place; I’d really like to avoid it if possible. I am far from being an expert on numerical analysis, floating-point computation, etc.

  • The hmatrix-gsl package has the Numeric.GSL.Polynomials module, which has an interface to a root finding algorithm from GSL (apparently using something called “balanced-QR reduction”), but I’d like to avoid pulling in a C library as as dependency, and also, again, GPL.

  • From the Wikipedia page for Laguerre’s method I learned that the Jenkins-Traub algorithm is another widely used method for polynomial root-finding, and often preferred over Laguerre’s method. However, it seems rather complex, and the only Haskell implementation of Jenkins-Traub I have been able to fnid is this one which seems to be just a toy implementation; I don’t even know if it works correctly.

If you know of a good place where I can find polynomial solving code in Haskell, can you point me to it? Or if you know more about numerics than me, could you maybe whip up a quick implementation of Laguerre’s method and put it up on Hackage?

by Brent at February 13, 2019 10:19 PM

FP Complete

Randomization Testing for an SQL Translator - FP Complete

Not all SQLs are created equal. I’ll say even more, none of the SQL dialects are even close to being equal. In particular, when talking about Microsoft SQL Server and PostgreSQL, their syntax might look similar. However, in their semantics, they are mountains apart. Today I will describe

by Alexey Kuleshevich (alexey@fpcomplete.com) at February 13, 2019 04:29 PM

Sandy Maguire

Freer Monads, More Better Programs

If you consider yourself a Haskell beginner, this post is not aimed at you! You're going to want to understand DataKinds and RankNTypes in order to get things done. Feel free to read anyway, but keep in mind that the technical solutions described here are tricky.

Every two weeks in the functional programming slack, we get into a big argument about "the right way to structure programs." The debate goes around and around in circles; names get called; feelings get hurt. We never get anywhere, and the whole process starts again 14 days later. Frankly, it's exhausting.

As best I can tell, the community roughly fragments itself along four lines---those who like mtl, those who say "just do everything in Reader IO", those who like the three layer cake, and those who think free(r) monads are worth their weight in gold.

Being in the latter camp is frustrating, because everyone has strongly negative opinions on freer monads, and as best I can tell, nobody seems to have ever actually used them.

As one of the few people in the Haskell-sphere who has actually used freer monads in both anger and production, I wanted to set the record straight. So today, let's talk about freer monads---what they are, what they buy you, how they work, and what wide-scale adoption could buy us. Yes, I'll also talk about what their weaknesses are.

Criminally Misunderstood

Freer monads are amazingly powerful. Much more so than I think anyone realizes---including many of the people who maintain the libraries. There's a lot of free, super-common, crazy-generic functionality that exists, but isn't anywhere useful.

Freer monads are so much more than just a different way of expressing monad transformers. They're a completely orthogonal means of composition that doesn't really exist anywhere else in the Haskell ecosystem. By not using them, you are condemning yourself to writing a significant amount more of significantly more complicated code than you need to be.

Traditional monad stacks can be understood as "a small, established toolbox for side effects." You say "I want some state, so I will add a StateT transformer," with the understanding that this is isomorphic to s -> (s, a). That means it only ever does local state.

I'd suggest we instead think about freer monads as "implementation mix-ins," or equivalently, as "compiler passes." Code written against freer monads is exceptionally high-level and doesn't mix concerns. It's all "business logic", where the implementation details are handled in layers, each one performed by a pass that simplifies the representation.

These passes are type-safe, independent and composable. You can mix-and-match which ones you want---which means testing is often just swapping in a test pass somewhere near the bottom of the stack. By mocking out different layers, it's easy to get 100% test coverage, without ever needing to write full-scale integration tests.

The beauty of this system is that the testability itself also composes. If your program runs properly under the test pass, and you can prove that both your test and real pass are also correct, this correctness composes to the entire program.

Having an exceptionally high-level representation of your program's intent is valuable in another way. Let's take a State capability as an example. This thing might correspond to local state, or a database, or maybe even just GET/POST HTTP requests. Who knows? But also, who cares?

Most of the people reading the code, most of the time, don't actually care what are the semantics behind the state. Those semantics are implementation details, interesting only to the people who care about the implementation. If you're tracing a program flow, and aren't interested in the database side of things, it's a lot nicer to not need to wade through a bunch of irrelevant database code.

In short, freer monads let you separate the high-level "what am I trying to do" from the low-level "how to actually do it."

Understanding Freer Monads

The Eff monad is parameterized by a type-level list of effects (or capabilities as I will also call them.) This list is kept polymorphic, and constraints are enforced on it to ensure that certain effects are available.

For example, the type StateT String (ReaderT Int IO) Bool is analogous to Eff '[State String, Reader Int, IO] Bool.

However, the type (MonadState String m, MonadReader Int m, MonadIO m) => m Bool in the mtl style also has an analogue: (Member (State String) r, Member (Reader Int) r, Member IO r) => Eff r Bool.

Freer monads are extensible in their effects---that means you can write your own, and use them completely interchangeably with existing effects. It's trivial to write a new effect, as they're just GADTs:

data Teletype a where
  GetLine ::           Teletype String
  PutLine :: String -> Teletype ()

This is all it takes to define a new effect. We now have a Teletype effect, and we can use it after a small amount of (freely derivable) boilerplate:

getLine :: Member Teletype r => Eff r String
getLine = send GetLine

putLine :: Member Teletype r => String -> Eff r ()
putLine = send . PutLine

Notice that the a in Teletype a describes the type you get back from calling this operation.

Our new Teletype effect corresponds to a domain specific language that can talk about reading and writing lines on a teletype. It's important to keep in mind that there is no meaning associated with this effect. We get no semantics, other than an implicit, unverified guarantee that "it probably does what you expect."

However, this lack of pre-established semantics is a feature, rather than a bug. The semantics are given after the fact by interpretations of the effects. One interpretation of Teletype might be to perform it in IO, interacting directly with the console. Another might be in the form of POSTing putLines to an HTTP server, and returning the results of a GET for getLine. Another could just do everything purely in memory.

Freer monads are extensible not only in their effects, but also in their interpretations. You can give new interpretations for existing effects, and for your own.

freer-simple offers several combinators for constructing new effects, which we'll explore in the example below.

Solving Problems with Freer Monads

Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.

-- Melvin Conway

Freer monads are a data representation of your program, which then gets interpreted at finer-and-finer grained resolution until it's just code.

In other words, they enforce a clean boundary between "saying what you mean" and "saying how to do it." They let you focus on writing business logic, and relegate the implementation details to library code.

They give you composable, plug-and-play functionality for transforming a high-level business logic spec into an actual implementation.

As an example of how this works on a real-life application, let's write a program that fetches a CSV file from FTP, decrypts it, streams its contents to an external pipeline, and tracks its stats in Redis.

ingest
    :: ( Member (Input Record) r
       , Member (Output [Record]) r
       , Member (Output Stat) r
       )
    => Eff r ()
ingest = nextInput >>= \case
  Nothing     -> pure ()
  Just record -> do
    output [record]
    output ProcessedRecordStat
    ingest

Done!1 Pretty easy, hey?

"Now hold on a minute, Sandy! That program you wrote doesn't do what you said! It doesn't fetch files from FTP, it doesn't decrypt them, and it doesn't do anything with Redis."

That's right. It doesn't. What this does is exactly what the business people say they want---it moves data from one place to somewhere else, and lets you know about it. The rest are implementation details, and aren't relevant to anyone except the particular engineers responsible for this piece of the system. Engineers on other teams in your organization probably don't even care about the implementation details.

Written like this it's easy for people to get a sense of what you're trying to accomplish, without needing to know the nitty-gritty details connection management, credential management, performance enhancements, error handling, or database details. It concisely describes the goal, and leaves the irrelevant bits out of sight and out of mind.

Of course; not everyone wants this high-level picture. The people responsible for this service really and truly do care about how the thing actually works. At least, at some level of abstraction. The people whose job it is to ingest data probably care about the service's performance and error handling, but likely don't have strong opinions on the semantics of fetching data, the encryption schemes used, the database layout or the choice of the external streaming pipeline. They probably don't even care that they're ingesting CSV files---they'd just as happily consume real-time JSON requests.

The goal is to make it easy for people to analyze the pieces they understand and are responsible for, and hide the noise of the underlying details to someone else.

So, how to do we get from our high-level description to a real program? We transform it into a slightly less-high-level program. For example, in order to get our Input of Records, we do actually need to parse a CSV file. You'll notice that such a problem really doesn't care where the file comes from; it just wants something to read.

So we write an interpretation of Input Record in terms of CSV files. This suggests we have some sort of FileProvider capability, whose job it is to actually get use the file in question.

csvInput
    :: ( Member FileProvider r
       , FromCSVRow i
       )
    => FilePath
    -> Eff (Input i ': r) a
    -> Eff r a
csvInput file m = do
    contents <- getFile file
    let csvData = toList $ parseCSV contents
    handleRelayS csvData (const pure) bind m  -- discussed immediately after
  where
    --  bind :: [i] -> Input i x -> ([i] -> x -> Eff r a) -> Eff r a
    bind (row : rows) NextInput k = k rows $ Just row
    bind rows@[]      NextInput k = k rows Nothing

csvInput takes a file name, reads that file in terms of some abstract FileProvider capability, and then returns one row of the result every time nextInput is called in the higher-level application.

This function is implemented in terms of the handleRelayS combinator. You can think of handleRelayS as being parameterized on what return and (>>=) mean for the effect being interpreted. In addition, it allows you to thread a piece of state between binds and the final return.

Our implementation of bind is to return a new row of the CSV file for every subsequent call to nextInput in the original program. We accomplish this by returning the head of the list of rows, and then passing the tail as the next piece of state.

In effect, we've described what it means to have an Input capability in terms of what it means to have a FileProvider capability. Notice that this isn't the only interpretation of Input---it could just as well be implemented by reading from a streaming source, or by always giving back the same result, or by cycling through a static list.

The point is that the people writing the service don't care where this data is coming from. All they care is that they can read it and pipe it to the right place. In fact, they might want to test that their service works by calling it on a constant stream of data---so instead they can interpret it purely:

pureInput
    :: [i]
    -> Eff (Input i ': r) a
    -> Eff r a
pureInput is = handleRelayS is (const pure) bind
  where
    --  bind :: [i] -> Input i x -> ([i] -> x -> Eff r a) -> Eff r a
    bind (row : rows) NextInput k = k rows $ Just row
    bind rows@[]      NextInput k = k rows Nothing

(for bonus points, you can implement csvInput in terms of pureInput)

Ok, great! The next step towards a working program is to give an interpretation of a FileProvider. We'll write two---one in terms of a lower level FTP capability, and one in terms of regular everyday IO:

ftpFileProvider
    :: Member FTP r
    => Eff (FileProvider ': r) a
    -> Eff r a
ftpFileProvider = interpret $ \(GetFile filename) ->
  ftpGet filename
localFileProvider
    :: Member IO r
    => Eff (FileProvider ': r) a
    -> Eff r a
localFileProvider = interpret $ \(GetFile filename) ->
  send $ Data.Bytestring.readFile filename

Often you just want to reinterpret an effect in terms of some other effect you already have (here, FTP and IO, respectively). In this case, it's sufficient to just use the interpret combinator, which takes implements your interpretation via a natural transformation (something of the form forall x. effect x -> Eff r x.)

For testing, you might also want a mock filesystem---pureFileProvider :: Map FilePath ByteString -> _.

Our program can now provide provide an Input capability via a FileProvider capability, via IO directly or via an FTP capability. You get the picture.

Something we haven't yet handled is file decryption. It's worth noting that this concern is largely orthogonal to FileProviders; we'd like to be able to mix-in the capability to deal with encrypted files regardless of what the actual mechanism for files looks like.

For that, we're exposed to yet another combinator for writing interpretations; interpose. This combinator allows us to interpret a capability in terms of itself. Which means, we can intercept calls to a capability without necessarily handling them. Providing decrypted files is a good use case for this---we can intercept requests for files, and silently decrypt them before giving them back.

decryptFileProvider
    :: Member Encryption r
    => Eff (FileProvider ': r) a
    -> Eff (FileProvider ': r) a
decryptFileProvider =
  interpose $ \(GetFile filename) -> do
    cyphertext <- getFile filename
    decrypt cyphertext

We've gained the ability to inject logic around other interpretations!

Assuming we have an FTP implementation, the Input side of the coin is done. Now to deal with the Outputs of our ingestion program. Remember, we want to put our records into some external streaming service. We can naively provide an interpreter that POSTs these records against our service.

postOutput
    :: ( Member HTTP r
       , ToJSON i
       )
    => (i -> HttpRequest 'POST)
    -> Eff (Output i ': r) a
    -> Eff r a
postOutput mkReq = interpret $ \Output i ->
  postHttp $ mkReq i

Assuming we have another interpretation HTTP ~> IO, we're now good to go!

This works, but accounting comes back a few days later and complains---our streaming bill is suddenly really big. Apparently we pay per API call. Uh oh. The good news is that the API can handle up to 500 records per POST. So, we can just write another interpret that batches writes before posting them.

batch
    :: Int
    -> Eff (Output [i] ': r) a
    -> Eff (Output [i] ': r) a
batch size = interposeS (0, []) finalize bind
  where
    -- finalize :: (Int, [i]) -> a -> Eff (Writer [i] ': r) a
    finalize (_, acc) a = do
      output acc
      pure a

    -- bind
    --     :: (Int, [i])
    --     -> Output [i] x
    --     -> ((Int, [i]) -> x -> Eff (Writer [i] ': r) a)
    --     -> Eff (Writer [i] ': r) a
    bind (nacc, acc) (Output o) k = do
      let no     = length o
          total  = acc <> o
          ntotal = nacc + no
      if (ntotal >= size)
        then do
          let (emit, acc') = splitAt size total
          output emit
          k (ntotal - size, acc') ()
        else k (ntotal, total) ()

Cool. Now sticking a batch 500 pass before postOutput will batch all of our transactions sent to the API. Again, our business-logic doesn't change, because it need to care about this implementation detail.

We could continue on, but at this point you've seen most of the machinery freer monads give us. At the end of the day, main will end up looking like this:

main :: IO ()
main = runM
     . runRedis
     . runFTP
     . runHTTP
     . runEncryption
     . redisOuput @Stat   mkRedisKey
     . postOutput @Record mkApiCall
     . batch      @Record 500
     . ftpFileProvider
     . decryptFileProvider
     . csvInput "file.csv"
     $ ingest

It composes nicely, and the compiler will yell at you if you forget to handle any of required capabilities.

Behavior can be mixed in at will; some other common things you might want include retry-with-backoff, service discovery, chaos-injection, etc.

Over time and scale, you'll realize that most of your application code is the same crap over and over again---read configuration, connect to a database, deal with retry, shuffle data from one buffer to another. It's often hard to see this when it's written with a traditional monad stack, because traditional monad stacks don't give you the tools to abstract it away.

As you get into the habit of building new effects and interpretations for those effects, you'll see that new applications are often ready to ship after 25 lines of business logic and another 25 lines of choosing the right interpretations for it.

Bad Arguments Against Freer Monads

There are several arguments against freer monad, some of which are good, but most of which are terrible.

Free Monads Have Bad Performance

Free monads suffer from O(n2) complexity when used naively, which is unfortunately what you get by default. Freer monads are optimized via a queue which provides constant-time construction of the default case.

Yes, freer monads are today somewhere around 30x slower than the equivalent mtl code. That's roughly on par with Python, but be honest, you've deployed Python services in the past and they were fast enough. And besides, the network speed already dominates your performance---you're IO-bound anyway.

If you are writing real-time services maybe this will be an issue, but you're probably not. And if you are, optimizing Haskell is likely a skill you already have.

A subtle point to notice is that it's the monadic bits of the code that are 30x slower. Not "your program is 30x slower if you import Control.Monad.Freer"---but simply that you will spend more time in binds than you would in another monad. But your program isn't only monadic in Eff; it also needs to compute expressions and wait for IO and all of that stuff.

If it makes you feel better, I recently got a 15% performance increase by just more aggressively inlining some of the combinators. This suggests there's a lot of low-hanging optimization wins for anyone willing to go through the work to pluck it.

In short: worry about writing good code first, and deal with performance if it becomes an issue.

Purescript Abandoned Eff

Purescript had a thing called Eff, but it was not the same as this. From the purescript-eff readme:

As of PureScript 0.12 the default type for handling effects is Effect from purescript-effect. This differs from Eff by removing the row of effect types. This decision was made as getting the effect rows to line up was sometimes quite tricky, without providing a great deal of benefit.

There is also purescript-run now, which uses a similar effect row mechanic but provides true algebraic effect handling. [emphasis mine]

The Eff described in this document is equivalent to purescript-run.

Reasonably Good Arguments Against Freer Monads

ContT is Not an Algebraic Effect

I never really understood this one as stated---I've never actually used ContT in a real monad stack. Have you?

But the sentiment behind this argument is better stated in human as "Eff is unable to model resource bracketing." Which is to say, it's hard to make sure an Eff program calls all of its finalizers.

The good news is that there's a solution if your allocation and cleanup code only requires IO---you can just interpret your entire Eff monad directly into ResourceT:

bracket
    :: Member (ResourceT IO) r
    => IO a
    -> (a -> IO ())
    -> (a -> Eff r b)
    -> Eff r b
bracket alloc dealloc m = do
  (key, a) <- send $ allocate alloc dealloc
  result   <- m a
  send $ release key
  pure result

Specialize bracket with your own first two parameters to taste.

More annoyingly, the lack of ContT-support means that it's hard to write effects that imply asynchronicity. That's not to say it's impossible, merely that it doesn't compose in the same nice way that synchronous effects do.

This is bad, but not disastrously so. You can spin up a thread pool elsewhere, and add a capability that sends effects to it:

data AsyncEff capabilities a where
  AsyncEff
      :: Members capabilities r
      => Eff r a
      -> AsyncEff capabilities ()


startAsyncTaskSlave
    :: Members capabilities r
    => (forall x. Eff r x -> IO x)
       -- ^ An interpretation stack from `Eff r` into `IO`
    -> IO (InChan (AsyncEff capabilities))
startThreadPool runEff = do
  (in, out) <- newChan 10
  void . async . forever $ do
    m <- readChan out
    void . async $ runEff m
  pure in


asyncEff
    :: Member IO r
    => InChan (AsyncEff capabilities)
    -> Eff (AsyncEff capabilities ': r) a
    -> Eff r a
asyncEff chan = interpret $ send . writeChan chan

Changing the interface to fill an MVar upon completion of the task and make it available to the original Eff program is an exercise left to the reader.

The Error Messages Are Bad / It's Too Complicated

This has historically been true. While freer-simple makes the situation significantly better, there is definitely room for improvement on this front.

First things first, Eff eschews the functional dependencies that mtl has. This means you can have multiple Writer effects in the same stack in Eff (but not in mtl) at the cost of type-inference.

This is both a feature, and, I won't lie to you, amazingly annoying at times. It's a feature because lots of things are just Writer effects. It's annoying as heck because polymorphism makes it eat shit.

For example, consider the following innocuous looking program:

foo :: Member (Writer Int) r => Eff r ()
foo = tell 15

Seems fine, right? Wrong. Because 15 is actually fromInteger 15 :: Num a => a, this program will complain about not having a Writer a capability. You as a human know what should happen here, but the compiler is stupid.

Thankfully the solution is simple, but it requires knowing what's wrong and how to fix it.

foo' :: Member (Writer Int) r => Eff r ()
foo' = tell @Int 15

If you're going to be doing a lot of work with polymorphic effects, a low-energy solution is to just provide a locally-bound monomorphic type:

foo'' :: Member (Writer Int) r => Eff r ()
foo'' = do
  let tellInt = tell @Int
  tellInt 1
  tellInt 2
  tellInt 3

All of this is much less user-friendly than it should be. However, in my experience, people quickly learn how to debug problems like this. It was enough to have an "Eff mentor" on our team, whose job it was was to promptly reply to "I don't know why this doesn't work."

Jesus Help Me There Are A Lot of Unmaintained Free(r) Monad Packages

Tell me about it. Even as someone who is keenly interested in this stuff I have a hard time keeping up with the situation.

Here's the skinny---I'd strongly recommend freer-simple. Failing that, if you really, really, really need the performance, take a look at fused-effects.

Ignore the other ones.2

Conclusion

Freer monads are fucking sick and you'd be foolish to not at least consider them for your next project.

Furthermore, if you're going to continue insisting on saying that $technology is better, I strongly encourage you to write up a similar argument stating your case. My mind is open on this; if you make a strong argument, I'm more than happy to denounce this article and jump on the $technology train too.

It's worth keeping in mind that despite our small differences; we're all on the same team here. We all love functional programming and want to do our best to make it more popular. As best I can tell, the best strategy towards that aim is to come up with a consensus on how to do things, and to stop the needless infighting.

One love.


  1. Input and Output are called Reader and Writer respectively in freer-simple. I decided to not use this terminology in order to prevent people from thinking that these are the same monads they're used to.

  2. If you're the maintainer of another effects package and want me to include it here, shoot me an email and make an argument!

February 13, 2019 11:40 AM

Tweag I/O

The types got you

Mark Karpov

Haskell is in a way a unique language. It is, to be clear, a language people and businesses use for building “serious stuff”, yet it is also a language which remains a platform for experimentation with functional programming. This means that the teams which develop in Haskell have access to expressivity and flexibility few other languages allow.

As the GHC compiler evolves and grows its feature set, one couldn't help but wonder about the connection between power provided by a programming language and the process of making design decisions when developing software.

We could start by stating that power and freedom are dangerous. Indeed, in human societies these have been continuously restricted or contained on different levels with tools like law, tradition, culture and religion. Which is not at all bad, because for most people it's hard to work, live, and coexist without preset constraints. Freedom is generally confusing. The statement holds in software systems as well: a task is always more feasible if some questions have been answered definitively before you start working. With each question answered for you, a possibility to screw up is removed.

Haskellers know and seem to actively acknowledge that limiting expressivity and freedom is good for them. They like DSLs and can argue endlessly how to model monadic effects in a way that a piece of code can only do exactly what is necessary and not more. So why not take a step back and look at the bigger picture? And by “bigger picture” I mean the language as a whole, as implemented by GHC.

No, I'm not suggesting removing features from the compiler. But with great power comes great responsibility. Or in our case, rather a need for great caution. Using advanced features of the type system takes a fair bit of judgment to get right. Judgement comes from experience. But who really has this experience? Who are we but a relatively small group of engineers trying to build production software using tools that relatively few people have used for this before, and even fewer people have used successfully?

Some features may be so new that no one yet truly knows what to do with them. And by “what to do with them” I don't mean that people don't know how to make fancy stuff compile. I assure you, this is accessible to many. The problem is in predicting what use of the features will give you in long term, is it worth it? Will it actually slow down the development? Will it make it harder for new people on the team to start working on your code? Are you actually catching more bugs at the compile time, or do you just think you do? Are you thinking about the practical results or about intricate niceties of your code?

Developing in Haskell is hard because it's easy to take a wrong turn that may be fatal. I'm writing this on the basis of my experience as a consultant. The anecdata of a single person might not be convincing to you, but if there were to be just one common theme across many of our projects at Tweag in the past year, it would be this: we have witnessed the high cost of entirely incidental complexity many times over. Our data suggests this issue is real and very common.

Unfortunately, the problem is not limited to code in the language itself. There seems to be a trend for disregarding common engineering practices, reinventing solutions, preferring solutions written in Haskell, or solutions which just seem to be “nicer” without doing unbiased comparison and analysis.

In the essay The bipolar Lisp programmer Mark Tarver made the point that the Lisp programming language attracts a certain kind of personality. I think it is generally true that there is a connection between personalities and the tools that people choose. Perhaps this is most prominently revealed in niche communities that are made up from enthusiastic people, not just people who have to pay the bill.

Haskell does attract a certain kind of personality. Who in their sane mind would be entirely oblivious to the many barbaric and funny words part of our lingua franca in documentation, tutorials and books today? Probably someone who is either very convinced in the benefits of typed FP or just someone who has, let's admit it, a bit adventurous and curious mind. It is easy to forget the timeless principles of robust engineering in this intellectual pursuit, perhaps even without noticing it.

Good music producers listen. They judge everything by how it sounds, because when the result is rendered and saved, people won't know how many effects, plugins, advanced techniques from some magazine were used. Nope, they really could not care less about that. They only care about the sound. For a beginner though, it is tempting to throw in a lot of processing and clever tricks, and in doing so they often overdo and produce a track that sucks.

The amateur producer from our example tried to make a record better, but he has not yet understood that his focus is misplaced. I think the situation is often the same with development in Haskell.

At the most basic level, development is about efficiently turning money into software that does something while remaining modifiable and maintainable. Haskell98 with some “benign” extensions is often the optimal solution in the trade-off between safety and simplicity. With conventional Haskell, developers effortlessly can stay in the pit of success.

Even though Haskell is a wonderful language by default, there are ways to get out of the pit. Ironically, these ways are exceptionally attractive to Haskellers for exactly the same reason Haskell itself is attractive to them: exploration and the endless pursuit of correctness. There are probably far more people who will go with anything that promises an improvement with respect to type-safety than those who are capable of sober estimation and balancing.

In the end, the main thought here is the following: build using simple proven techniques even if you're using a technology like Haskell. For anything extra (like dependent types, formal verification, etc.) you might want to think twice and thrice. And still remain uncertain.

February 13, 2019 12:00 AM

February 12, 2019

Functional Jobs

Full Stack Sr. Software Engineer (Haskell) at Interos (Full-time)

Interos Solutions, Inc. is a fast-growing venture backed company with an AI-driven SaaS application that delivers our commercial and government customers unique insight into their ever-changing business ecosystems. The core of our platform is the compilation, analysis and visualization of dynamically changing big data collected across open source, proprietary and public data sources. By continuously analyzing our "real-time" data stream we provide our customers with unique insights into their business relations, supply chains and other third-party activities. To accelerate the growth of our application, we are putting together a team of Haskell engineers, data analysts, data scientists, UI/UX professionals and product managers.  We are committed to building a world class product organization that leverages tools like Haskell and latest machine learning techniques to achieve outsized results as individuals and as a team. 

Summary 

You will design, code, implement and maintain both front-end and back-end technologies. As one of our first Full-stack Engineers, you will have room to shape your work and have an outsized impact on our product and our culture. You will work across our tech stack to develop our enterprise-grade application, help institute effective processes for ensuring our products have quality code and minimal defects, collaborate with our VP of engineering, founder and our customers on product features, and more. We're looking for engineers who relish solving the hard-technical and analytics challenges and diving into the subtle details that make products amazing.  Being part of and leading an innovative engineering team, you'll need solid hands-on experience in Haskell as well as web technologies (HTML/CSS/JS, REST, JSON/XML) and databases (Relational and NoSQL).  Experience with machine learning, big data and cloud infrastructure are a plus.

Responsibilities include:

  • Working with the management team and customers to understand desired features, capabilities and testing scenarios
  • Developing and maintaining the Interos customer-facing platform, machine learning capabilities and infrastructure
  • Working within and across Agile teams to design, develop, test, implement, and support technical solutions across a full-stack of development tools and technologies
  • Leading the craftsmanship, availability, resilience, and scalability of Interos' application
  • Bringing a passion to stay current with technology, experimenting with and learning new technologies as well as a desire to train developers without Haskell experience

Requirements/Qualifications: (This is who we are looking to hire)

  • At least 5 years of hands on experience in designing and architecting applications in a functional style and a strong working foundation in machine learning, big data and AWS infrastructure
  • Bachelor's degree in Computer Science, Engineering, Mathematics, Physics or another scientific field is preferred
  • Experience working within a distributed micro-services architecture Enough experience with Haskell to be comfortable with things like: Monad Transformers, Rank-N Types, Functional Dependencies, Type Families, GADTs, Lenses, Type-Level Programming
  • Experience with either of the following would be a plus: Reflex / Reflex.Dom / Obelisk, Data Visualization
  • Experience and interest in training developers without Haskell/FP experience
  • Completed projects that you can discuss in depth or code samples (ideally in Haskell) that you can share are a big plus

Location: Arlington, VA (15 minutes from downtown DC)

We realize this is a lot to ask and we are hiring for a lot of similar roles soon so if this opportunity sounds interesting to you, feel free to contact us even if you don't fit the bill exactly.

Get information on how to apply for this position.

February 12, 2019 08:18 PM

Full Stack Software Engineer (Haskell experience preferred) at Interos Solutions (Full-time)

Interos Solutions, Inc. is a fast-growing venture backed company with an AI-driven SaaS application that delivers our commercial and government customers unique insight into their ever-changing business ecosystems. The core of our platform is the compilation, analysis and visualization of dynamically changing big data collected across open source, proprietary and public data sources. By continuously analyzing our "real-time" data stream we provide our customers with unique insights into their business relations, supply chains and other third-party activities. To accelerate the growth of our application, we are putting together a team of Haskell engineers, data analysts, data scientists, UI/UX professionals and product managers.  We are committed to building a world class product organization that leverages tools like Haskell and latest machine learning techniques to achieve outsized results as individuals and as a team. 

Summary 

You will code, implement and maintain both front-end and back-end technologies. Being part of an innovative engineering team, you'll need solid experience in web technologies (HTML/CSS/JS, REST, JSON/XML) and databases (Relational and NoSQL databases).  Experience with machine learning, big data, and (AWS) cloud infrastructure are a plus.  If you have experience in Haskell even better, but if not, you must have a desire to learn Haskell in "real time". 

Responsibilities include:

  • Working with the management team and customers to understand desired features, capabilities and testing scenarios
  • Developing and maintaining the Interos customer-facing platform, machine learning capabilities and infrastructure
  • Working within and across Agile teams to design, develop, test, implement, and support technical solutions across a full-stack of development tools and technologies
  • Driving the craftsmanship, availability, resilience, and scalability of the Interos application
  • Bringing a passion to stay current with technology, experiment with and learn new technologies such as Haskell

Requirements/Qualifications: (This is who we are looking to hire)

  • At least 2 years of hands on experience and a strong foundation in coding, big data and AWS infrastructure

  • Bachelor's degree in Computer Science, Engineering, Mathematics, Physics or another scientific field is preferred

  • Passionate about software engineering and comfortable with front-end and back-end development

  • Interested (if not experienced) in Haskell and Functional Programming

  • Experienced with a variety of SQL and NoSQL databases

  • Experience with any of the following would be a plus: Haskell or other FP languages (PureScript, Elm, OCaml, Clojure, etc) Python, Reflex / Reflex.Dom / Obelisk, Data Visualization

  • Experience working within a distributed micro-services architecture

  • Working understanding of machine learning concepts and AWS.

  • Ability to work in a team as well as independently and deliver on aggressive schedules and goals

  • Completed projects that you can discuss in depth or code samples that you can share are a big plus

Location: Arlington, VA (15 min from downtown DC)

Get information on how to apply for this position.

February 12, 2019 08:18 PM

FP Complete

WebAssembly in Rust: A Primer

We are excited that we have been able to keep up with the pace of our monthly webinar series. More importantly, we have been able to offer a variety of topics and this month's webinar certainly illustrates our commitment to webinar diversity.  We are pleased that Aniket Deshpande (Software and DevOps Engineer at FP Complete) was able to discuss and demonstrate "WebAssembly in Rust: A Primer." for this webinar. We had 216 people registered for the event which aired on Wednesday, Jan 30th, 2019 at 10:00 am PST.  As we always do, we recorded the event to enable anyone who could not tune in live to enjoy the webinar at their one pace. Check it out below.

by Robert Bobbett (robert@fpcomplete.com) at February 12, 2019 04:05 PM

Yesod Web Framework

Deprecating GoogleEmail2

As I'm sure many readers are aware, Google+ is shutting down. This affects Yesod's authentication system. In particular, any users of the Yesod.Auth.GoogleEmail2 module will need to migrate.

Fortunately for all of us, Patrick Brisbin has written both the code for using Google Sign-in, but has also put together a great migration blog post. If you're affected, please check that out for a relatively painless migration.

Earlier today, I migrated Haskellers.com in two commits: the first did all of the work, and the second made things more future-proof.

As you may be able to tell from the GoogleEmail2 module, this isn't the first time we've had to migrate between Google authentication APIs. Hopefully this one will stick.

February 12, 2019 03:49 PM

Michael Snoyman

Call for new Stack issue triager

For those unfamiliar with it, Stack has a team called the “Stack Issue Triagers.” If you’ve been interacting on the Stack issue tracker recently, it’s likely that those are the people you’ve seen answering questions. We’ve been slowly ramping up the team, and it seems to have been quite successful so far at helping Stack users, as well as escalating issues and getting them resolved as necessary.

We’re looking to expand the team again. This team is a volunteer effort, with each person taking a week at a time of responding to issues. We have a lot more information on this in the team process Stack documentation. Working with this team is not only a great service to the community, but also gives you a chance to learn more about maintaining a large open source Haskell codebase, and gives you many opportunities to interact with the Stack maintainers.

For now, we’re adding 1 new person to the team, though we’ll likely add another one in three months time, and continue the ramp up process from there. If you’re interested in taking part, please send me a private email to michael at snoyman dot com.

I’ll collect applicants and discuss with the team members. If you have questions about the process, please feel free to ask on the Gitter Stack chat!

February 12, 2019 08:35 AM

Gabriel Gonzalez

Haskell command-line utility using GHC generics

cli-twitter

Today, Justin Woo wrote a post about writing a simple Haskell command-line utility with minimal dependencies. The utility is a small wrapper around the nix-prefetch-git command.

In the post he called out people who recommend overly complex solutions on Twitter:

Nowadays if you read about Haskell on Twitter, you will quickly find that everyone is constantly screaming about some “advanced” techniques and trying to flex on each other

However, I hope to show that we can simplify his original solution by taking advantage of just one feature: Haskell’s support for generating code from data-type definitions. My aim is to convince you that this Haskell feature improves code clarity without increasing the difficulty. If anything, I consider this version less difficult both to read and write.

Without much ado, here is my solution to the same problem (official Twitter edition):

{-# LANGUAGE DeriveAnyClass        #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}

import Data.Aeson (FromJSON, ToJSON)
import Data.Text (Text)
import Options.Generic (Generic, ParseRecord)

import qualified Data.Aeson
import qualified Data.ByteString.Lazy
import qualified Data.Text.Encoding
import qualified Data.Text.IO
import qualified Options.Generic
import qualified Turtle

data Options = Options
{ branch :: Bool
, fetchgit :: Bool
, hashOnly :: Bool
, owner :: Text
, repo :: Text
, rev :: Maybe Text
} deriving (Generic, ParseRecord)

data NixPrefetchGitOutput = NixPrefetchGitOutput
{ url :: Text
, rev :: Text
, date :: Text
, sha256 :: Text
, fetchSubmodules :: Bool
} deriving (Generic, FromJSON)

data GitTemplate = GitTemplate
{ url :: Text
, sha256 :: Text
} deriving (Generic, ToJSON)

data GitHubTemplate = GitHubTemplate
{ owner :: Text
, repo :: Text
, rev :: Text
, sha256 :: Text
} deriving (Generic, ToJSON)

main :: IO ()
main = do
Options {..} <- Options.Generic.getRecord "Wrapper around nix-prefetch-git"

let revisionFlag = case (rev, branch) of
(Just r , True ) -> "--rev origin/" <> r
(Just r , False) -> "--rev " <> r
(Nothing, _ ) -> ""

let url = "https://github.com/" <> owner <> "/" <> repo <> ".git/"

let command =
"GIT_TERMINAL_PROMPT=0 nix-prefetch-git " <> url <> " --quiet " <> revisionFlag

text <- Turtle.strict (Turtle.inshell command Turtle.empty)

let bytes = Data.Text.Encoding.encodeUtf8 text

NixPrefetchGitOutput {..} <- case Data.Aeson.eitherDecodeStrict bytes of
Left string -> fail string
Right result -> return result

if hashOnly
then Data.Text.IO.putStrLn sha256
else if fetchgit
then Data.ByteString.Lazy.putStr (Data.Aeson.encode (GitTemplate {..}))
else Data.ByteString.Lazy.putStr (Data.Aeson.encode (GitHubTemplate {..}))

This solution takes advantage of two libraries:

  • optparse-generic

    This is a library I authored which auto-generates a command-line interface (i.e. argument parser) from a Haskell datatype definition.

  • aeson

    This is a library that generates JSON encoders/decoders from Haskell datatype definitions.

Both libraries take advantage of GHC’s support for generating code statically from datatype definitions. This support is known as “GHC generics”. While a bit tricky for a library author to support, it’s very easy for a library user to consume.

All a user has to do is enable two extensions:

… and then they can auto-generate an instance for any typeclass that implements GHC generics support by adding a line like this to the end of their data type:

You can see that in the above example, replacing SomeTypeClass with FromJSON, ToJSON, and ParseRecord.

And that’s it. There’s really not much more to it than that. The result is significantly shorter than the original example (which still omitted quite a bit of code) and (in my opinion) easier to follow because actual program logic isn’t diluted by superficial encoding/decoding concerns.

I will note that the original solution only requires using libraries that are provided as part of a default GHC installation. However, given that the example is a wrapper around nix-prefetch-git then that implies that the user already has Nix installed, so they can obtain the necessary libraries by running this command:

… which is one of the reasons I like to use Nix.

by Gabriel Gonzalez (noreply@blogger.com) at February 12, 2019 02:28 AM

February 11, 2019

Monday Morning Haskell

Haskell Data Types Review!

This week we're taking a quick break from new content. We've added our new series on Haskell's data system to our permanent collection. You can find it under the beginners panel or check it out here! This series had five parts. Let's take a quick review:

  1. In part 1 we reviewed the basic way to construct data types in Haskell. We compared this to the syntax of other langauges like Java and Python.
  2. Part 2 showed the simple way we can extend our Haskell types to make them sum types! We saw that this is a more difficult process in other languages. In fact, we resorted to making different inherited types in object oriented languages.
  3. Next, we demonstrated the concept of parametric types in part 3. We saw how little we needed to add to Haskell's definitions to make this work. Again, we looked at comparable examples in other languages as well.
  4. In part 4, we delved into Haskell's typeclasses. We compared them against inherited types from OO languages and noted some pros and cons.
  5. Finally, in part 5 we concluded the series by exploring the idea of type families. Our code was more complicated than we'd need in other languages. And yet, our code contains a lot more behavioral guarantees in Haskell than it does elsewhere. And we achieved this while still having a good deal of flexibility. Type families have a definite learning curve, but they're a useful concept to know.

As always keeping coming back every Monday morning for some new Haskell content! For more updates and our monthly newsletter, make sure you Subscribe! This will also give you access to our Subscriber Resources!

by James Bowen at February 11, 2019 03:30 PM

February 10, 2019

JP Moresmau

Rust helped me escape Cthulhu!

I just downloaded the game Lovecraft Quest on my Android tablet, and was enjoying the puzzles, until I got to the final one, and that got me, ahem, *frustrated*. I found a walk-through where the reviewers admitted to more or less clicking randomly on the screen for one hour until it worked:

I actually managed to solve the puzzle the first time quickly with sheer luck, but on the second run I was stumped. So I followed the only reasonable course of action: let the computer solve the puzzle for me!

The puzzle is as follows: there is a square (4 rows, 4 columns) of gold bars that can be either in horizontal or vertical position. You can make one bar pivot, but when you do that all the bars in the same row and column pivot too! You need to get all bars to the vertical position for the door to open to reveal... Cthulhu itself!

So let us use Rust for this puzzle, and see if a brute force approach can work. 
Let's first define a few types. A grid is a 4x4 array of booleans and a position is a tuple. We'll have positions ranging from (0,0) to (3,3). A grid is solved is all elements are true (so true means vertical position for a bar).
Then let's implement the swapping operation. We'll take one Grid reference and return a new Grid with the proper bars swapped.
Note that we swap again the position we chose since we swapped it twice already, one for the columns and once for the rows. Not very elegant maybe but simple

The main solving function is as follows: we're given a grid and we return a list of positions to swap to solve the puzzle. If the grid is already solved we return an empty vec. Otherwise we keep two structures: a map of grids to a vector of positions (showing how we got to that particular grid from the start grid) and a list of grids to process. We use a Deque here so we can put new grids at the end but process grids from the start, to do a breadth-first traversal of all possible grids.
The try_all function will do all possible swaps on the given grid and store all the results with the path used to get to them. We store for all grids the shortest path we used to get to it. New grids get added at the end of the todo list. We return the path if we end up with a winning grid.
To simplify calling the program, we'll just pass a string made of 16 characters, with 1s and 0s indicating the position of the bars we see on the screen. Parsing this string into a Grid instance is easy enough with the help of chunks():
And we can pass then that string as a argument to our binary program, and print the result:
In my case, the program ran and gave me the solution: only 12 moves!

This Rust code is probably fairly naive in places, since I only have a couple of weeks of evenings learning it. I'm not too happy with the number of clone() calls I had to add to make the borrow checker happy, on top of the ones I thought I needed anyway. Maybe immutable structures would in fact be easier to use here! Of course this code could be made parallel but since it gave me my answer in a few seconds I didn't need to improve it further.

Happy Rust Hacking!


by JP Moresmau (noreply@blogger.com) at February 10, 2019 03:49 PM

Magnus Therning

Using stack to get around upstream bugs

Recently I bumped into a bug in amazonka.1 I can’t really sit around waiting for Amazon to fix it, and then for amazonka to use the fixed documentation to generate the code and make another release.

Luckily slack contains features that make it fairly simple to work around this bug until it’s properly fixed. Here’s how.

  1. Put the upstream code in a git repository of your own. In my case I simply forked the amazonka repository on github (my fork is here).
  2. Fix the bug and commit the change. My change to amazonka-codepipeline was simply to remove the missing fields – it was easier than trying to make them optional (i.e. wrapping them in Maybes).
  3. Tell slack to use the code from your modified git repository. In my case I added the following to my slack.yaml:

     extra-deps:
       - github: magthe/amazonka
         commit: 1543b65e3a8b692aa9038ada68aaed9967752983
         subdirs:
           - amazonka-codepipeline

That’s it!


  1. The guilty party is Amazon, not amazonka, though I was a little surprised that there doesn’t seem to be any established way to modify the Amazon API documentation before it’s used to autogenerate the Haskell code.

February 10, 2019 12:00 AM

February 09, 2019

JP Moresmau

Pedal to the Metal: starting with Rust


After hearing for years about Rust, I finally decided to give it a go. This is my journey so far

I'm currently using Visual Studio Code with the Rustextension. It's sufficient for the little exercises I've been doing so far, but I notices that code completion is not very accurate and does not seem to give me everything that's possible.  I've installed the IntelliJ Rust extension in my community IDEA, I'll see how well it works when I start doing bigger projects.

Rust is obviously a different kettle of fish than the last language I picked up, which was Go. It's a lot more complex and rich and hopefully powerful, and consequently takes a lot more time to master. It feels a bit like a cross between C for the low level aspects and Haskell for the structural and functional parts. I was a bit surprised straight away to read about variable shadowing since it seems to take away the security somehow (you cannot mutate a variable but you can redefine it), but from what I read it doesn't seem to be an issue practically since the compiler holds your hand at all time.

Most languages I've worked with have a Garbage Collector, so I'd say dealing with the borrow checker is going to be fun in larger code bases, but so far in the small exercises I've done it hasn't been an issue.

Hopefully I can get productive with Rust quickly, I'd like to contribute to projects like Rusty-Machine to do neural networks and ML with Rust, which would seem a perfect match with its performance and safety. We'll see!

Happy Rust Hacking!


by JP Moresmau (noreply@blogger.com) at February 09, 2019 10:44 AM

February 07, 2019

Sandy Maguire

How to Write Technical Posts (so people will read them)

One of today's best bloggers in the functional programming space is Chris Penner. Despite this, Chris is criminally underrated---nobody I talk to seems to know of his work. Go check him out as soon as you finish reading this post!

This is reasonably pervasive phenomenon in our subculture; there's lots of fantastic work being produced, but for one reason or another, it falls between the cracks. I'd propose the biggest obstacle is that most FP writing isn't optimized to be read.

So I'd like to take some time today and talk about common failure modes in technical writing. If you don't have a checklist you run through before publishing a post, you'll probably get a lot of benefit from internalizing this advice.

The value? You'll make it easier for people to understand what you're trying to tell them. Which is why you're writing in the first place, right? The good news is that none of this is magic---just some simple guidelines for structuring your content.

The Guiding Principle

Here's the biggest thing to keep in mind: your reader doesn't really care what you have to say. You get maybe four sentences to convince them that your essay is worth their time. If you haven't made your case by then, you've probably lost them to the next tab they binge-opened.

Even after you've convinced them that it's a valuable read, you need to continue assuring them that you're not wasting their time. If you take too long to get to the point, or if you make the information too hard to find on a quick skim, you've lost them again.

As a result, alongside your technical content, you have two primary goals to focus on:

  • Provide strong, concise motivations for everything you want to talk about.
  • Make it easy to skim.

If you can do these two things, you're already most of the way to better engagement.

Writing Strong Motivations

People care about solutions to problems they have, or expect to have one day. They care about other things too, but it's not really relevant to us today.

With this in mind, you want to tailor your motivations in terms of solutions to problems.

Here are some good examples of problems that you've probably run into (and links to their solutions for you to read later):

Bad examples of motivating documents are things that assume you care simply because they exist. This is pretty common of libraries' tutorials. The same document that convinces you to use a technology should also show you how.

As a more general rule, focusing on why is going to be more valuable than on how.

Why should someone care, and why your solution is a good one.

"How" without a "why" suggests a contrived solution to a made-up problem. In other words, it's easily read as "who cares?"

Understanding How People Read

People who spend lots of time reading have good heuristics for skipping lots of text. If you understand these heuristics, you can make it easy for people to find what they're looking for. And remember: they're looking for reasons to continue reading.

There are two behaviors here. The first is what I call "skimming for concepts"---which is that people will attempt to determine what text is about at a high-level. They're looking for what you're trying to tell them, as opposed to what you're actually saying.

When skimming, people are likely to read only the headings and the first two sentences of a paragraph. If they're convinced they know what you have to say, they'll skip to the next paragraph. If several paragraphs in a row don't seem relevant, they'll skip to the next heading.

If the next heading also fails to grab their attention, they'll probably give up.

The solution is to structure your argument as a tree. Headings must provide enough information to let someone know whether or not they care about the following prose.

This also means that the first sentence of each paragraph should be sufficient to understand the rest of the paragraph. The remaining sentences are only allowed to reinforce or give details of the first sentence. One paragraph equals one idea.

Roughly speaking, the hierarchy of your document should look like this:

  • Document
    • Heading
      • First sentence
        • Rest of paragraph

Maybe you feel like it's hard to know how much knowledge to assume your readership knows. Understanding how people read makes this a non-issue. Present as much information as is necessary to understand your point, but make it easy to skip if people already know it. Those who don't yet know will learn, those who do can skip past, and both camps will appreciate it.

After you've convinced your reader that they care what you have to say, they're more willing to read what you actually have to say. When the reader is ready to dive in, that's when you can make the finer points of your argument.

It's unlikely that anyone is going to read every word of your essay. It's even less likely that they'll read them all in order.

Stumbling Blocks

Now that you've got people ready to listen to you, it's important to keep them on the same page as you. You really need to stay on top of anything that might break their focus.

Use short sentences. If it's too hard to parse, people won't.

Make sure your spelling and grammar have no egregious problems. You don't need to have perfect command of the language, but you need to be able to signal that you're trustworthy-enough to listen to. Don't underestimate how much credibility you'll lose from blatantly bad spelling and grammar.

This stuff is relatively common-sense.

What might be less so is that you also need to stay on top of conceptual stumbling blocks. If your argument makes a jump that feels poorly motivated or references something potentially unfamiliar, expect that your reader will experience vertigo.

Most of the ideas we talk about in functional programming are not easy, and it doesn't do anyone any favors to pretend this isn't so. Expect that your readers will be pretty similar to you; if you had a problem understanding a piece of your topic, call that out. Point out the obstacle. Point out what they might be thinking, and then very explicitly show them what they should be thinking instead.

For example, if your code sample uses a functional dependency, it might be worth a short sentence saying "that funny bar thing is a functional dependency." Give a tiny summary of its purpose, and provide a link for them to learn more if they need to.

Another illustration: if you originally got confused that an "algebra in recursion schemes" is not the same thing as the algebra you learned in high-school, your readers probably will too. The first time you say "algebra" in the new context, say "this is a misleading term. This has nothing to do with solving equations like you did in high-school."

More important than what you're saying is what you're not saying. If you're giving examples of something that fits a pattern, make sure you give examples of things that do not fit the pattern. A concept that is applicable everywhere is useful nowhere.

Give lots of examples. Nobody I know learns via mathematical statements, nor do they learn via long, wordy, abstract prose. People learn by seeing lots of examples, and being told explicitly how these examples relate to one another. The mathematical statements are only useful after you have the intuition, so save them for an appropriate time.

Conclusion

The takeaway advice from this essay is that if you want lots of readers, you must make it easy for them to read.

To that end, pay lots of attention to motivation. Why should people care what you have to say? What value does it give them?

Focus your energy on the beginnings---both of the essay as a whole, and of each paragraph. People who are unconvinced by your essay's value will skim their way through it, and they will do that by reading only the beginnings of things.

Make your information easy to find. Structure your argument in a tree, and make sure it supports random-access. Nobody is going to read the whole thing from start to finish. Instead they're going to jump around, ignoring the pieces they already know, and looking for the bits they don't.

Use lots of short paragraphs. A paragraph should correspond to a single idea.

Anticipate which parts of your argument will be difficult for your readers, and be proactive in trying to assuage those difficulties. Give lots of examples of what you're talking about, and more importantly, what you're not talking about.

Point out where you went into the weeds while learning this stuff. Steer readers away from common mistakes and misconceptions.

And finally, end on a high note. Leave people with a good feeling, and it will incentivize them to get to the bottom of your next piece. Inspirational calls to action are a good way to go out.

Tell them what you're going to tell them. Then tell them. Finally, tell them what you told them.

-Unknown

Funny and poignant quotes are too.

February 07, 2019 01:04 PM

February 06, 2019

Chris Smith 2

CodeWorld Update — February, 2019

CodeWorld Update — February, 2019

I’ve been quiet on social media lately, but it’s been an active time for the CodeWorld project. Here’s a brief summary of some of the things happening in the last few months.

Platform features

There have been some significant updates lately in the CodeWorld platform. Here are some of the big ones.

The Requirements Framework

Together with Ashish Myles and with some good advice from Fernando Alegre, I started building the CodeWorld Requirements Framework. This is a language embedded into CodeWorld for expressing requirements for assigned student code. For example, you might ask students to write at least 4 functions in their project, to use all five built-in picture transformations (translation, rotation, dilation, non-uniform scaling, and coloring), and to make sure there are no unused parameters in their functions. The requirements language lets you express all of these.

The framework implements a few one-off rules for some common requirements, but based on some inspired advice from Fernando, it also includes a pattern matcher over Haskell syntax, which lets you define patterns that should (or should not) occur in the code. By abusing template haskell syntax, you can write a pattern like f $var = sqrt $any, and this will match any function called f whose body is defined by applying sqrt to any expression. It’s surprising how many rules can be expressed in this form. Ashish and I alpha-tested the framework when teaching functions to middle school students, and it was very successful.

That said, though, the Requirements Framework is still in experimental stages. There are a lot more rule types I’d like to implement, including common lint rules, UX improvements, and even some ambitious projects like integrating runtime validation testing and compile-time inspection testing into the framework. The status of the project is at https://github.com/google/codeworld/projects/9.

Tooling work

Artem Kanev has recently been contributing to the tool support in CodeWorld, and has made a lot of progress. You might notice that CodeWorld’s education mode now offers documentation for known variables when you hover over them, or when you select them in the autocomplete UI. There’s also a lot more work done to include symbols defined in the current module in these features. These simple changes make a huge difference to students as they use the platform.

I know Artem has big plans coming up, to continue working on the student programming tools and turn the CodeWorld environment into a world-class user interface (but without sacrificing its essential simplicity!). Coming soon are continuous highlighting of compile errors, better error message presentation, and deferred type errors. After that, who knows? You can definitely try out the new features and send bugs and feature requests with your feedback.

API updates

As we learn more from teaching with CodeWorld, we continue to tweak the exported API so that future classes go better. A few API changes are now in progress.

First, CodeWorld is moving toward using pattern synonyms for a lot more constants. This includes the built-in colors, and the constant pi, for example. The reason for this is that there’s a nice educational progression from listing inputs and outputs of a function in a table, then writing them in a sequence of equations for specific inputs, and finally generalizing to a pattern with a formal parameter. But that doesn’t work when the input isn’t a literal. Bidirectional pattern synonyms act like literals in most ways, so students can write equations like f(Blue) = 5, and they act as they are supposed to.

This is a big change, especially for colors, and the old names for colors will be around for at least a year and half, and possibly longer. But I encourage you to teach with the new names in the future.

Documentation updates

At the same time, I’m continuing to update documentation and curriculum. I’ve taken another pass at the beginning of the guide over the last several months, and expanded much of the content. There are also new folders in the CodeWorld Drive folder with worksheets and resources I’ve created for my new classes. This is still very much a work in progress, but things are getting better.

I’m also resuming work on the CodeWorld comic book. This was a cute project I began many years ago, to convey some of the humor and programming culture, as well as introduce the language of error messages, in a fun and casual way. The story is finished, but we’re filling out the book with creative and personality-filled reference pages that students can use as they build their own projects. Stay tuned!

Classes and community

Of course, none of this would be worth it without the kids! I’m also continuing to teach. I’ve finished a class for 8th grade students in the previous semester, and I’m taking on 9th graders for the first time this semester. It’s a new experience for me, so keep your fingers crossed for me.

Others have also continued their teaching. In particular, I’m still working with Louisiana State University to investigate results measured in our classes, and I know others are teaching classes everywhere from elementary school through university classes around the world.

Getting involved

To stay up to date on CodeWorld, feel free to subscribe to the codeworld-discuss mailing list. We’ve got a lot of cool things coming up. I’m hopeful that we can work with another student or two in the upcoming Google Summer of Code season, and share some of our research results over the next several months. Stay tuned!

by Chris Smith at February 06, 2019 01:15 AM

Tweag I/O

Mapping a Universe
of Open Source Software

Matthias Meschede

Introduction

The repositories of distributions such as Debian and Nixpkgs are among the largest collections of open source (and some unfree) software. They are complex systems that connect and organize many interdependent packages. Interestingly, these systems grow out of the individual design choices of thousands of contributors but few of them have larger design goals in mind.

Is it possible to capture the large scale features of such a repository in an image? Are there common design choices of the contributors? Did they lead to any emergent structure?

Plenty of studies (e.g. here, here, or here) have examined these questions from different viewpoints and for various software ecosystems. In this blog post I'll try to shed some light on them from the perspective of Nixpkgs, mostly with visualizations of its complete dependency graph but also with a very short examination of its statistical properties (its degree distribution). But wait, you might think: "visualizing a graph with 50000 nodes and 300000 edges? - All you are going to see is black!". Well, it turns out that this is not the case because large scale structures appear when we look at the graph in the right way.

An overview of the Nixpkgs dependency graph

Before explaining in detail what I did, let's start with a teaser. Check out the structure in this fascinating image where each little dot represents a single software package in the Nixpkgs repository:

What looks remotely like the section of a mouse brain actually represents around 46000 software packages. Each dot in this image corresponds to a different one, more precisely to a Nix derivation in the Nixpkgs repository. Each package is invisibly linked to those that it depends on (ingoing dependencies) or those that depend on it (outgoing dependencies). These packages are the nodes, and their invisible links the incoming and outgoing edges of the directed dependency graph of Nixpkgs.

The total number of dependencies that a node has is called the node's degree. It is subdivided in the in degree which counts the number of ingoing dependencies and the out degree which counts the number of outgoing dependencies. I have colored and sized the nodes in the above image by their in degree: a brighter larger node means that more derivations depend on it (high in degree) than on a darker smaller node (low in degree). Size and color are scaled non-linearly and should only be interpreted qualitatively. We will see later from the distribution of these degrees why this makes sense.

The graph was laid out and visualized with the program gephi using a force simulation algorithm to project each node to a display positions in the image. This force model consists of attractive and repulsive forces that drag the nodes into their final positions that you can see in the image.

The attractive force pulls nodes towards their dependencies. Therefore nodes with similar dependencies end up close to each other. Looking back at the image, we see the effect of this force in the form of node clusters with visible gaps between them. The repulsive force pushes nodes away from dependencies that have high degree. High degree nodes therefore appear with a lot of space around them. Low degree nodes, on the other hand, are tightly packed.

As a final addition, a gravitational force keeps the whole graph together and squeezes it into roughly circular form. Check out the appendix at the end of this post for a detailed explanation, including the code and the data that are required to reproduce this image yourself.

The following image of the dependency graph was laid out in the same way, but shows the package name of each node explicitely. As before, the package names are scaled non-linearly with the in degree of the node. The names of the largest nodes - we'll call them hub nodes - are well readable but to read the labels of the smaller nodes you should click on the image and use your browsers zoom functionality.

Most central to the Nixpkgs repository are the primary hub nodes bash and the Nix stdenv derivation. Unsurprisingly, most derivations depend on them - at least at build time - because bash is one of the most commonly used Unix shells and stdenv is a nix package that groups very fundamental build dependencies such as GNU make. Both are used as default environments in Nix.

Slightly smaller secondary hub nodes can also be identified: curl and mirrors-list seem to be central to the upper part of the graph and indicate that it has something to do with the internet. On the right side, Python2.7 and Python3.6 are visible as distinct clusters and grouped around a central hook derivation that is used by Nix to setup a Python environment. On the left side, different versions of Perl can be seen. In the bottom, the Linux library manager tool pkg-config forms a little aura that matches its importance for the open source ecosystem.

The hub nodes indicate that some clusters in this graph correspond to the environments of different programming languages. Naturally packages in these environments will have a common set of dependencies and therefore appear close to each other in the graph.

Ego networks: exploring hub node environments

Let's dive further into this. A hub node forms a little subgraph together with its ingoing and outgoing dependencies that is sometimes called its ego network (this term seems to originate in studies of social networks). In the following plot, the ego networks of a few selected hub nodes are colored in green. The corresponding hub node's name is indicated above each subplot. It illustrates nicely how clusters form around a compiler, an interpreter or just an important tool:

The ego networks confirm our previous intuition that some clusters correspond to different programming languages. They also provide basic orientation on this graph. Here is a short description of the different ego networks:

Most of the upper part of the graph is dominated by curl. Closer inspection shows that most of these derivations are archives such as .tar.gz files or language specific archives such as .gem files. These archives are downloaded from the internet before other packages are derived from them. Another large, but less tightly packed region is the pkg-config zone. It dominates the bottom of the graph and it is characterized by lots of smaller hub nodes indicating that pkg-config organizes many packages that are important for other packages. The ego network of cmake is mixed into a similar region of the graph.

Python2.7 and Python3.6 form neat clusters with three central hub nodes that correpond to the python interpreter, the package managment tool pip, and the build and installation tool setuptools. Python2.7 is used more often by packages outside of its cluster than Python3.6. In contrast to Python, Perl has only a single hub node for each specific version but it forms similarly dense clusters. From the different Perl versions, Perl-5.28.0 is reused most ouside of its own cluster by a lot of other packages in the pkg-config zone. This indicates that the specific version of the Nixpkgs repository that I have evaluated uses this particular Perl version as the default for other packages.

Lisp (sbcl), Haskell (ghc), Ruby, OCaml, Go, and Rust derivations also accumulate in tight clusters around their respective compiler.

Ego networks can only give a glimpse into the structure of the graph of Nixpkgs. Not all environments are organized around a central hub. Some packages are self-contained and independent. For example, the Go packages form a tight independent group because they often include copies of third party packages that they depend on. There are also many smaller clusters that correspond to particular tools such as terraform, maven or even the vim plugins. If you have time and patience, you can try to find them in the labelled graph image above.

The degree distribution of the Nixpkgs dependency graph

Another perspective on the global properties of the Nixpkgs repository is the degree distribution of its nodes. It shows how many nodes exist with a certain in or out degree (as a reminder: the in degree is the number of packages that depend on the node; the out degree is the number of packages that the node depends on). Certainly there are a lot more nodes with low in degree than with high in degree. More interesting is how the number of nodes with a certain degree changes between the extremes.

A whole bunch of studies (for example here, here) have found that the in degree distribution of software repository dependencies follows a power-law distribution, similar to the degree distributions of networks in Geology, Biology, Economics, or the Internet (here is a short introduction on power-laws). Such a power-law distribution can be generated by several mechanisms. The hottest candidate mechanism in the case of software repositories seems to be preferential attachment, a.k.a. "the rich get richer". In other words, a new package is more likely to depend on existing packages with a lot than on packages with few dependencies. Other mechanisms such as self-organized criticality might also play a role.

The (complementary) cumulative in and out degree distributions show how many packages exist above a certain degree. Looking at the cumulative distribution is typically better than looking at its derivative, the degree distribution. As an integral, the cumulative distribution has a better behavior when the density of packages gets low at high degrees. We plot it here for the in and out degrees in log-log scale such that a power-law is just a straight line:

The cumulative in degree distribution is a very clean power-law decay with an exponent of -0.9. The in degree distribution itself has an exponent of -1.9, because it is the derivative of the cumulative distribution. The value that we observe falls in the range of values that have been observed for the debian repository (around -2.0). Interestingly, a power-law distribution is also a scale-free distribution: For example, at the largest scale of Nixpkgs, a node with an approximate (order of magnitude) in degree of 2000 such as a python2.7 (3033), perl-5.28 (2406), or git (2394), is about 80 times (=10^1.9) less common than a node with in degree 200, which could be an important library such as python2.7-mock (212), python3.6-six (321), or freetype (303). On the other hand, such a node with degree 200 is also about 80 times less common than a node with approximate in degree 20, such as python3.6-flake8 (23), ruby2.5.3-nokogiri (20), or erlang (19). In some sense, the large scale ecosystems around python, perl, or git are therefore scaled up versions of the small scale ecosystems around a minor programming language or library. This shows that there is more to a software repository such as Nixpkgs than just a few primary and secondary hub nodes. It is a complex growing ecosystem.

The out degree of a package counts how many other packages the node depends on. Its cumulative distribution looks a bit different than the one for the in degrees. It has a cutoff at around degree 3, which means that only few nodes depend on three or less packages. At degree 4, the distribution drops markedly, which means that there are lots of packages that depend on 4 other packages. The decay afterwards is also not quite as regular as for the in degrees. If we interpret this decay as a power-law, the decay rate corresponds to an exponent of around -2.9. This means that a node that depends on 200 packages is about 800 times (=10^2.9) less common than a node that depends on 20 packages. Although this gives some intuition about the out degree distribution, it is unclear if a power-law is really a good model for it. Preferential attachment is less plausible as generating mechanism behind the out degree distribution because the number of dependencies that a package depends on is often directly chosen by the developers. Other distributions, such as multiple overlapping log-normals, could potentially fit the distribution equally well and also account for different mechanisms.

Conclusions

The graph visualizations give an overview of the structure of the enormous Nixpkgs repository that can be difficult to grasp in its entirety otherwise. The degree distribution of the nodes is another macro perspective of the repository that is difficult to obtain with other means. It would also be interesting to examine whether the degree distributions of different language environments differ. Taking a step back, I hope that this blog post can raise some awareness for the complexity and beauty of a system that was generated by thousands of individual contributions but that ultimately becomes more than the sum of its parts.

Appendix: do it yourself

If you want to explore the Nixpkgs dependency graph yourself, you can download the gephi project file that will allow you to start immediately. The Nixpkgs graph in dot format can be used with graphviz and many other libraries. Finally, I have also uploaded the Nixpkgs graph as a sparse adjacency matrix.

You can also regenerate the graph from the beginning:

A Nix derivation is a recipe with all meta information that is required to build a package. This includes links to other required derivations, hashed source code, shell commands, compiler flags, environment variables, and so on. Derivations are not written by hand but generated automatically by evaluating code that is written in the Nix language. The Nixpkgs repository is a collection of such Nix code and generates by default around 46000 derivation files with build recipes. These derivation files are stored as human readable text files with the extension .drv in the Nix store folder (usually under /nix/store).

To build the dependency graph, I traversed and evaluated the whole Nixpkgs repository with a few lines of Nix code. This process is quite fast (<5min) because no package has to be actually built. The derivation files only gather the meta information that is needed to build them. I then extracted the package dependencies from each derivation file in the store with a small Haskell code and saved them as a graph .dot file (here is the Nix and Haskell code). Due to the graph's size, I merged derivations with the same package and version name, ignoring other build characeristics that are typically identified with a unique hash by Nix.

The graph was laid out and visualized with the program gephi using a force simulation algorithm to project the graph nodes to display positions in the image. The simulation starts by putting the nodes at random initial positions. Afterwards it updates the node positions stepwise according to a force model until they find a steady place. The force model, called Force Atlas 2, consists of an attractive and of a repulsive force between connected nodes. The attractive force is proportional to the distance and the repulsive force inversely proportional to the distance of the connected nodes. In addition, the repulsive force is proportional to the product of the degrees of the connected nodes, that is to the number of edges that are incident on them. The nodes are therefore pushed to an equilibrium distance where the forces are balanced. The equilibrium distance is smaller for nodes with low degree and larger for nodes with high degrees.

Getting a nice visualization is then a matter of fine-tuning the correponding parameters of the force simulation in gephi, together with the color scheme and the node sizes, until the structure of the graph becomes most evident.

February 06, 2019 12:00 AM

February 05, 2019

Holden Karau

Validating Big Data Jobs An exploration with @ApacheSpark & @ApacheAirflow (+ friends) @ FOSDEM

Thanks for joining me on 2019-02-03 at FOSDEM 2019 Brussels, Belgium for Validating Big Data Jobs An exploration with @ApacheSpark & @ApacheAirflow (+ friends).I'll update this post with the slides soon.Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

by Holden Karau (noreply@blogger.com) at February 05, 2019 02:50 PM

Introducing @Kubeflow (w. Special Guests Tensorflow and @ApacheSpark) @ FOSDEM

Thanks for joining us (@holdenkarau,@rawkintrevo) on 2019-02-03 at FOSDEM 2019 Brussels, Belgium for Introducing @Kubeflow (w. Special Guests Tensorflow and @ApacheSpark).I'll update this post with the slides soon.Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

by Holden Karau (noreply@blogger.com) at February 05, 2019 02:41 PM

Functional Jobs

OCaml server-side developer at Ahrefs (Full-time)

What we need

Ahrefs is looking for a backend developer with a deep understanding of networks, distributed systems, OS fundamentals and taste for simple and efficient architectural designs. Our backend is implemented mostly in OCaml and some C++, as such proficiency in OCaml is very much appreciated, otherwise a strong inclination to intensively learn OCaml in a short term will be required. Understanding of functional programming in general and/or experience with other FP languages (F#,Haskell,Scala,Scheme,etc) will help a lot. Knowledge of C++ and/or Rust is a plus.

Every day the candidate will have to deal with:

  • 20+ petabytes of live data
  • OCaml
  • linux
  • git

The ideal candidate is expected to:

  • Independently deal with bugs, schedule tasks and investigate code
  • Make argumented technical choice and take responsibility for it
  • Understand the whole technology stack at all levels : from network and userspace code to OS internals and hardware
  • Handle full development cycle of a single component - i.e. formalize task, write code and tests, setup and support production (devops), resolve user requests
  • Approach problems with practical mindset and suppress perfectionism when time is a priority
  • Write flexible maintainable code and adapt to post-launch requirements’ tweaks

These requirements stem naturally from our approach to development with fast feedback cycle, highly-focused personal areas of responsibility and strong tendency to vertical component splitting.

Who we are

Ahrefs runs an internet-scale bot that crawls the whole Web 24/7, storing huge volumes of information to be indexed and structured in a timely fashion. Backend system is powered by a custom petabyte-scale distributed key-value storage to accommodate all that data coming in at high speed. The storage system is implemented in OCaml with thin performance-critical low-level part in C++. On top of that Ahrefs is building various analytical services for end-users.

We are a small team and strongly believe in better technology leading to better solutions for real-world problems. We worship functional languages and static typing, extensively employ code generation and meta-programming, value code clarity and predictability, and are constantly seeking to automate repetitive tasks and eliminate boilerplate, guided by DRY and following KISS. If there is any new technology that will make our life easier - no doubt, we'll give it a try. We rely heavily on opensource code (as the only viable way to build maintainable system) and contribute back, see e.g. https://github.com/ahrefs . It goes without saying that our team is all passionate and experienced OCaml programmers, ready to lend a hand and explain that intricate ocamlbuild rule or track a CPU bug.

Our motto is "first do it, then do it right, then do it better".

What you get

We provide:

  • Competitive salary
  • Informal and thriving atmosphere
  • First-class workplace equipment (hardware, tools)
  • Medical insurance

Locations

Remote

Singapore : modern office in CBD

Get information on how to apply for this position.

February 05, 2019 02:21 PM

Neil Mitchell

Announcing ghc-lib

On behalf of Digital Asset I'm delighted to announce ghc-lib, a repackaging of the GHC API to allow it to be used on different GHC versions. The GHC API allows you use the GHC compiler as a library, so you can parse, analyze and compile Haskell code.

The GHC API comes pre-installed with GHC, and is tied to that GHC version - if you are using GHC 8.6.3, you get version 8.6.3 of the API, and can't change it. The ghc-lib package solves that problem, letting you mix and match versions of the GHC compiler and GHC API. Why might you want that?

  • Imagine you are writing a tool to work with several versions of the GHC compiler. The GHC API changes significantly between each version, so doing this would require writing a lot of C preprocessor code to support it. An alternative is to use one version of ghc-lib which works across multiple versions of GHC.
  • Imagine you are modifying the GHC API or want features from GHC HEAD. With ghc-lib you can depend on the revised GHC API, without upgrading the compiler used to build everything, speeding up iteration.

While ghc-lib provides the full GHC API, it doesn't contain a runtime system, nor does it create a package database. That means you can't run code produced by ghc-lib (no runtime), and compiling off-the-shelf code is very hard (no package database containing the base library). What you can do:

The package ghc-lib is released on Hackage, and can be used like any normal package, e.g. cabal install ghc-lib. Since ghc-lib conflicts perfectly with the GHC API and template-haskell, you may wish to ghc-pkg hide ghc-lib and use the language extension PackageImports to do import "ghc-lib" GHC. There will be two release streams within the ghc-lib name:

  • Version 8.8.1 will be the version of ghc-lib produced against the released GHC 8.8.1, once GHC 8.8.1 is released. There is no release against GHC 8.6.3 because we had to make changes to GHC to enable ghc-lib, which were only upstreamed in the last few months.
  • Version 0.20190204 is the version of ghc-lib using GHC HEAD on the date 2019-02-04.

We've been developing and using ghc-lib internally at Digital Asset for the last six months. The use of ghc-lib has enabled us to track GHC HEAD for certain projects, and develop improvements to GHC itself, and then integrate them without requiring us to rebuild all Haskell source code on every step. Smoothing out that development loop has been a massive productivity boon to us.

While this is Digital Asset's first open source project in a while, we have been making lots of contributions behind the scenes - it's no coincidence several of my recent posts involve my colleague Shayne. In particular, our engineers have been behind several GHC proposals.

While I'm announcing the project, much of the work has been done by Shayne Fletcher and Jussi Maki, but neither of them have active blogs just yet!

by Neil Mitchell (noreply@blogger.com) at February 05, 2019 01:44 PM

February 04, 2019

Monday Morning Haskell

Why Haskell V: Type Families

type_families.png

Welcome to the conclusion of our series on Haskell data types! We've gone over a lot of things in this series that demonstrated Haskell's simplicity. We compared Haskell against other languages where we saw more cumbersome syntax. In this final part, we'll see something a bit more complicated though. We'll do a quick exploration of the idea of type families. We'll start by tracing the evolution of some related type ideas, and then look at a quick example.

Type families are a rather advanced concept. But if you're more of a beginner, we've got plenty of other resources to help you out! Take a look at our Getting Started Checklist or our Liftoff Series!

Different Kinds of Type Holes

In this series so far, we've seen a couple different ways to "plug in a hole", as far as a type or class definition goes. In the third part of this series we explored parametric types. These have type variables as part of their definition. We can view each type variable as a hole we need to fill in with another type.

Then in the fourth part, we explored the concept of typeclasses. For any instance of a typeclass, we're plugging in the holes of the function definitions of that class. We fill in each hole with an implementation of the function for that particular type.

This week, we're going to combine these ideas to get type families! A type family is an enhanced class where one or more of the "holes" we fill in is actually a type! This allows us to associate different types with each other. The result is that we can write special kinds of polymorphic functions.

A Basic Logger

First, here's a contrived example to use through this article. We want to have a logging typeclass. We'll call it MyLogger. We'll have two main functions in this class. We should be able to get all the messages in the log in chronological order. Then we should be able to log a new message while sending some sort of effect. A first pass at this class might look like this:

class MyLogger logger where
  prevMessages :: logger -> [String]
  logString :: String -> logger -> logger

We can make a slight change that would use the State monad instead of passing the logger as an argument:

class MyLogger logger where
  prevMessages :: logger -> [String]
  logString :: String -> State logger ()

But this class is deficient in an important way. We won't be able to have any effects associated with our logging. What if we want to save the log message in a database, send it over network connection, or log it to the console? We could allow this, while still keeping prevMessages pure like so:

class MyLogger logger where
  prevMessages :: logger -> [String]
  logString :: String -> StateT logger IO ()

Now our logString function can use arbitrary effects. But this has the obvious downside that it forces us to introduce the IO monad places where we don't need it. If our logger doesn't need IO, we don't want it. So what do we do?

Type Family Basics

One answer is to make our class a type family. W do this with the type keyword in the class defintion. First, we need a few language pragmas to allow this:

{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE AllowAmbiguousTypes #-}

Now we'll make a type within our class that refers to the monadic effect type of the logString function. We have to describe the "kind" of the type with the definition. Since it's a monad, its kind is * -> *. This indicates that it requires another type parameter. Here's what our definition looks like:

class MyLogger logger where
  type LoggerMonad logger :: * -> *
  prevMessages :: logger -> [String]
  logString :: String -> (LoggerMonad logger) ()

Some Simple Instances

Now that we have our class, let's make an instance that does NOT involve IO. We'll use a simple wrapper type for our logger. Our "monad" will contain the logger in a State. Then all we do when logging a string is change the state!

newtype ListWrapper = ListWrapper [String]
instance MyLogger ListWrapper where
  type (LoggerMonad ListWrapper) = State ListWrapper
  prevMessages (ListWrapper msgs) = reverse msgs
  logString s = do
    (ListWrapper msgs) <- get
    put $ ListWrapper (s : msgs)

Now we can make a version of this that starts involving IO, but without any extra "logging" effects. Instead of using a list for our state, we'll use a mapping from timestamps to the messages. When we log a string, we'll use IO to get the current time and store the string in the map with that time.

newtype StampedMessages = StampedMessages (Data.Map.Map UTCTime String)
instance MyLogger StampedMessages where
  type (LoggerMonad StampedMessages) = StateT StampedMessages IO
  prevMessages (StampedMessages msgs) = Data.Map.elems msgs
  logString s = do
    (StampedMessages msgs) <- get
    currentTime <- lift getCurrentTime
    put $ StampedMessages (Data.Map.insert currentTime s msgs)

More IO

Now for a couple examples that use IO in a traditional logging way while also storing the messages. Our first example is a ConsoleLogger. It will save the message in its State but also log the message to the console.

newtype ConsoleLogger = ConsoleLogger [String]
instance MyLogger ConsoleLogger where
  type (LoggerMonad ConsoleLogger) = StateT ConsoleLogger IO
  prevMessages (ConsoleLogger msgs) = reverse msgs
  logString s = do
    (ConsoleLogger msgs) <- get
    lift $ putStrLn s
    put $ ConsoleLogger (s : msgs)

Another option is to write our messages to a file! We'll store the file name as part of our state, though we could use the Handle if we wanted.

newtype FileLogger = FileLogger (String, [String])
instance MyLogger FileLogger where
  type (LoggerMonad FileLogger) = StateT FileLogger IO
  prevMessages (FileLogger (_, msgs)) = reverse msgs
  logString s = do
    (FileLogger (filename, msgs)) <- get
    handle <- lift $ openFile filename AppendMode
    lift $ hPutStrLn handle s
    lift $ hClose handle
    put $ FileLogger (filename, s : msgs)

And we can imagine that we would have a similar situation if we wanted to send the logs over the network. We would use our State to store information about the destination server. Or else we could add something like Servant's ClientM monad to our stack in the type definition.

Using Our Logger

By defining our class like this, we can now write a polymorphic function that will work with any of our loggers!

runComputations :: (Logger logger, Monad (LoggerMonad logger)) => InputType -> (LoggerMonad logger) ResultType
runComputations input = do
  logString "Starting Computation!"
  let x = firstFunction input
  logString "Finished First Computation!"
  let y = secondFunction x
  logString "Finished Second Computation!"
  return y

This is awesome because our code is now abstracted away from the needed effects. We could call this with or without the IO monad.

Comparing to Other Languages

Now, to be fair, this is one area of Haskell's type system that makes it a bit more difficult to use than other languages. Arbitrary effects can happen anywhere in Java or Python. Because of this, we don't have to worry about matching up effects with types.

But let's not forget about the benefits! For all parts of our code, we know what effects we can use. This lets us determine at compile time where certain problems can arise.

And type families give us the best of both worlds! They allow us to write polymorphic code that can work either with or without IO effects!

Conclusion

That's all for our series on Haskell's data system! We've now seen a wide range of elements, from the simple to the complex. We compared Haskell against other languages. Again, the simplicity with which one can declare data in Haskell and use it polymorphically was a key selling point for me!

Hopefully this series has inspired you to get started with Haskell if you haven't already! Download our Getting Started Checklist or read our Liftoff Series to get going!

by James Bowen at February 04, 2019 03:30 PM

February 03, 2019

Holden Karau

Predicting areas for PR Comments based on Code Vectors & Mailing List Data @ FOSDEM

Thanks for joining us (@holdenkarau,@krisnova) on 2019-02-03 at FOSDEM 2019 Brussels, Belgium for Predicting areas for PR Comments based on Code Vectors & Mailing List Data.I'll update this post with the slides soon.Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

by Holden Karau (noreply@blogger.com) at February 03, 2019 12:21 PM

Apache Spark on Kubernetes -- Avoiding the pain of YARN @ Pre-FOSDEM Belgium Kubernetes Meetup

Thanks for joining me on 2019-02-02 for Apache Spark on Kubernetes -- Avoiding the pain of YARN.The talk covered:

Apache Spark is one of the most popular big data tools, and starting last year has had integrated support for running on Kubernetes. This talk will introduce some of the use cases of Apache Spark quickly (machine learning, ETL, etc.) and then look at the current cluster managers Spark runs on and their limitations. Most of the focus will be around running non-Java code, and the challenges associated with dependencies along with general challenges like scale-up & down. Its not all sunshine and roses though, I will talk about some of the limitations of our current approach and the work being done to improve this.

.I'll update this post with the slides soon.Comment bellow to join in the discussion :).Talk feedback is appreciated at http://bit.ly/holdenTalkFeedback

by Holden Karau (noreply@blogger.com) at February 03, 2019 12:21 PM

February 02, 2019

Magnus Therning

The ReaderT design pattern or tagless final?

The other week I read V. Kevroletin’s Introduction to Tagless Final and realised that a couple of my projects, both at work and at home, would benefit from a refactoring to that approach. All in all I was happy with the changes I made, even though I haven’t made use of all the way. In particular there I could further improve the tests in a few places by adding more typeclasses. For now it’s good enough and I’ve clearly gotten some value out of it.

I found mr. Kevroletin’s article to be a good introduction so I’ve been passing it on when people on the Functional programming slack bring up questions about how to organize their code as applications grow. In particular if they mention that they’re using monad transformers. I did exactly that just the other day _@solomon_ wrote

so i’ve created a rats nest of IO where almost all the functions in my program are in ReaderT Env IO () and I’m not sure how to purify everything and move the IO to the edge of the program

I proposed tagless final and passed the URL on, and then I got a pointer to the article The ReaderT Design Patter which I hadn’t seen before.

The two approches are similar, at least to me, and I can’t really judge if one’s better than the other. Just to get a feel for it I thought I’d try to rewrite the example in the ReaderT article in a tagless final style.

A slightly changed example of ReaderT design pattern

I decided to make a few changes to the example in the article:

  • I removed the modify function, instead the code uses the typeclass function modifyBalance directly.
  • I separated the instances needed for the tests spatially in the code just to make it easier to see what’s “production” code and what’s test code.
  • I combined the main functions from the various examples to that both an example (main0) and the test (main1) are run.
  • I switched from Control.Concurrent.Async.Lifted.Safe (from monad-control) to UnliftIO.Async (from unliftio)

After that the code looks like this

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE FlexibleInstances #-}

import           Control.Concurrent.STM
import           Control.Monad.Reader
import qualified Control.Monad.State.Strict as State
import           Say
import           Test.Hspec
import           UnliftIO.Async

data Env = Env
  { envLog :: !(String -> IO ())
  , envBalance :: !(TVar Int)
  }

class HasLog a where
  getLog :: a -> (String -> IO ())

instance HasLog Env where
  getLog = envLog

class HasBalance a where
  getBalance :: a -> TVar Int

instance HasBalance Env where
  getBalance = envBalance

class Monad m => MonadBalance m where
  modifyBalance :: (Int -> Int) -> m ()

instance (HasBalance env, MonadIO m) => MonadBalance (ReaderT env m) where
  modifyBalance f = do
    env <- ask
    liftIO $ atomically $ modifyTVar' (getBalance env) f

logSomething :: (MonadReader env m, HasLog env, MonadIO m) => String -> m ()
logSomething msg = do
  env <- ask
  liftIO $ getLog env msg

main0 :: IO ()
main0 = do
  ref <- newTVarIO 4
  let env = Env { envLog = sayString , envBalance = ref }
  runReaderT
    (concurrently_
      (modifyBalance (+ 1))
      (logSomething "Increasing account balance"))
    env
  balance <- readTVarIO ref
  sayString $ "Final balance: " ++ show balance

instance HasLog (String -> IO ()) where
  getLog = id

instance HasBalance (TVar Int) where
  getBalance = id

instance Monad m => MonadBalance (State.StateT Int m) where
  modifyBalance = State.modify

main1 :: IO ()
main1 = hspec $ do
  describe "modify" $ do
    it "works, IO" $ do
      var <- newTVarIO (1 :: Int)
      runReaderT (modifyBalance (+ 2)) var
      res <- readTVarIO var
      res `shouldBe` 3
    it "works, pure" $ do
      let res = State.execState (modifyBalance (+ 2)) (1 :: Int)
      res `shouldBe` 3
  describe "logSomething" $
    it "works" $ do
      var <- newTVarIO ""
      let logFunc msg = atomically $ modifyTVar var (++ msg)
          msg1 = "Hello "
          msg2 = "World\n"
      runReaderT (logSomething msg1 >> logSomething msg2) logFunc
      res <- readTVarIO var
      res `shouldBe` (msg1 ++ msg2)

main :: IO ()
main = main0 >> main1

I think the distinguising features are

  • The application environmant, Env will contain configuraiton values (not in this example), state, envBalance, and functions we might want to vary, envLog
  • There is no explicit type representing the execution context
  • Typeclasses are used to abstract over application environment, HasLog and HasBalance
  • Typeclasses are used to abstract over operations, MonadBalance
  • Typeclasses are implemented for both the application environment, HasLog and HasBalance, and the execution context, MonadBalance

In the end this makes for code with very loose couplings; there’s not really any single concrete type that implements all the constraints to work in the “real” main function (main0). I could of course introduce a type synonym for it

but it brings no value – it wouldn’t be used explicitly anywhere.

A tagless final version

In order to compare the ReaderT design pattern to tagless final (as I understand it) I made an attempt to translate the code above. The code below is the result.1

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TypeFamilies #-}

import           Control.Concurrent.STM
import qualified Control.Monad.Identity as Id
import           Control.Monad.Reader
import qualified Control.Monad.State.Strict as State
import           Say
import           Test.Hspec
import           UnliftIO (MonadUnliftIO)
import           UnliftIO.Async

newtype Env = Env {envBalance :: TVar Int}

newtype AppM a = AppM {unAppM :: ReaderT Env IO a}
  deriving (Functor, Applicative, Monad, MonadIO, MonadReader Env, MonadUnliftIO)

runAppM :: Env -> AppM a -> IO a
runAppM env app = runReaderT (unAppM app) env

class Monad m => ModifyM m where
  mModify :: (Int -> Int) -> m ()

class Monad m => LogSomethingM m where
  mLogSomething :: String -> m()

instance ModifyM AppM where
  mModify f = do
    ref <- asks envBalance
    liftIO $ atomically $ modifyTVar' ref f

instance LogSomethingM AppM where
  mLogSomething = liftIO . sayString

main0 :: IO ()
main0 = do
  ref <- newTVarIO 4
  let env = Env ref
  runAppM env
    (concurrently_
      (mModify (+ 1))
      (mLogSomething "Increasing account balance"))
  balance <- readTVarIO ref
  sayString $ "Final balance: " ++ show balance

newtype ModifyAppM a = ModifyAppM {unModifyAppM :: State.StateT Int Id.Identity a}
  deriving (Functor, Applicative, Monad, State.MonadState Int)

runModifyAppM :: Int -> ModifyAppM a -> (a, Int)
runModifyAppM s app = Id.runIdentity $ State.runStateT (unModifyAppM app) s

instance ModifyM ModifyAppM where
  mModify = State.modify'

newtype LogAppM a = LogAppM {unLogAppM :: ReaderT (TVar String) IO a}
  deriving (Functor, Applicative, Monad, MonadIO, MonadReader (TVar String))

runLogAppM :: TVar String -> LogAppM a -> IO a
runLogAppM env app = runReaderT (unLogAppM app) env

instance LogSomethingM LogAppM where
  mLogSomething msg = do
    var <- ask
    liftIO $ atomically $ modifyTVar var (++ msg)

main1 :: IO ()
main1 = hspec $ do
  describe "mModify" $ do
    it "works, IO" $ do
      var <- newTVarIO 1
      runAppM (Env var) (mModify (+ 2))
      res <- readTVarIO var
      res `shouldBe` 3
    it "works, pure" $ do
      let (_, res) = runModifyAppM 1 (mModify (+ 2))
      res `shouldBe` 3
  describe "mLogSomething" $
    it "works" $ do
      var <- newTVarIO ""
      runLogAppM var (mLogSomething "Hello" >> mLogSomething "World!")
      res <- readTVarIO var
      res `shouldBe` "HelloWorld!"

main :: IO ()
main = main0 >> main1

The steps for the “real” part of the program were

  1. Introduce an execution type, AppM, with a convenience function for running it, runAppM
  2. Remove the log function from the environment type, envLog in Env
  3. Remove all the HasX classes
  4. Create a new operations typeclass for logging, LogSomethingM
  5. Rename the operations typeclass for modifying the balance to match the naming found in the tagless article a bit better, ModifyM
  6. Implement instances of both operations typeclasses for AppM

For testing the steps were

  1. Define an execution type for each test, ModifyAppM and LogAppM, with some convenience functions for running them, runModifyAppM and runLogAppM
  2. Write instances for the operations typeclasses, one for each

So I think the distinguising features are

  • There’s both an environment type, Env, and an execution type AppM that wraps it
  • The environment holds only configuration values (none in this example), and state (envBalance)
  • Typeclasses are used to abstract over operations, LogSomethingM and ModifyM
  • Typeclasses are only implemented for the execution type

This version has slightly more coupling, the execution type specifies the environment to use, and the operations are tied directly to the execution type. However, this coupling doesn’t really make a big difference – looking at the pure modify test the amount of code don’t differ by much.

A short note (mostly to myself)

I did write it using monad-control first, and then I needed an instance for MonadBaseControl IO. Deriving it automatically requires UndecidableInstances and I didn’t really dare turn that on, so I ended up writing the instance. After some help on haskell-cafe it ended up looking like this

Conclusion

My theoretical knowledge isn’t anywhere near good enough to say anything objectively about the difference in expressiveness of the two design patterns. That means that my conclusion comes down to taste, do you like the readerT patter or tagless final better?

I like the slightly looser coupling I get with the ReaderT pattern. Loose coupling is (almost) always a desirable goal. However, I can see that tying the typeclass instances directly to a concrete execution type results in the intent being communicated a little more clearly. Clearly communicating intent in code is also a desirable goal. In particular I suspect it’ll result in more actionable error messages when making changes to the code – the error will tell me that my execution type lacks an instance of a specific typeclass, instead of it telling me that a particular transformer stack does. On the other hand, in the ReaderT pattern that stack is very shallow.

One possibility would be that one pattern is better suited for libraries and the other for applications. I don’t think that’s the case though as in both cases the library code would be written in a style that results in typeclass constraints on the caller and providing instances for those typeclasses is roughly an equal amount of work for both styles.


  1. Please do point out any mistakes I’ve made in this, in particular if they stem from me misunderstanding tagless final completely.

February 02, 2019 12:00 AM

February 01, 2019

Functional Jobs

Elixir Developers & Functional Programmers at CareDox (Full-time)

Love functional programming? Want to keep millions of children healthy? CareDox is hiring Elixir developers and functional programmers interested in learning Elixir.

We're a diverse, collaborative team of thoughtful, friendly engineers driven by CareDox's mission of working with schools and school nurses to immunize all children -- even the uninsured! -- against diseases like the flu and to help children with chronic illnesses (like asthma) manage their care.

We take pride in our Elixir code and use continuous integration tools to guarantee that all code going into production has high test coverage and clean compiler, Dialyzer, Credo, and Sobelow results.

We're conveniently located between Bryant Park and Times Square, walking distance from both Grand Central and Penn Station.

Requirements: * Functional programming experience (Haskell, Elm, Clojure, Lisp, Scheme, F#, Scala, OCaml, etc.) * Strong Unit Testing Skills

Bonus points for experience with: * Elixir, Phoenix, Erlang, and Ecto * The actor model (esp. OTP, Akka, or similar) * React & JavaScript (we use React) * SQL and Data modeling * Modern DevOps (CI/CD deployment, Docker, Kubernetes, etc.) * Working with Redis, Redshift, Elasticsearch, DynamoDB, GraphQL

Get information on how to apply for this position.

February 01, 2019 04:36 PM

January 31, 2019

Matt Parsons

Implementing Union in Esqueleto I

We use the SQL UNION operator at IOHK in one of our beam queries, and esqueleto does not support it. To make porting the IOHK SQL code more straightforward, I decided to implement UNION. This blog post series will delve into implementing this feature, in a somewhat stream-of-thought manner.

Background

esqueleto is a SQL library that builds on the persistent library for database definitions and simple queries. It attempts to provide an embedded DSL that allows you to use SQL and Haskell together. In my opinion, it has less complicated types than beam and an easier to learn UX than opaleye. The persistent quasiquoter model definitions save a bunch of boilerplate, too.

esqueleto is implemented in a somewhat convoluted manner – we have a type class Esqueleto query expr backend that everything is defined in terms of. However, the functional depenencies on the class essentially only permit a single instance. The query type is always fixed to SqlQuery, a WriterT [Clauses] (StateT IdentInfo) monad. The expr type is always SqlExpr, which is a GADT that provides a structure for SQL expressions.

It’s kind of a tagless final encoding paired with a GADT initial encoding. Neat.

Goal

Alright, so let’s start with the SQL. We want to be able to write a Haskell thing that translates to this SQL:

SELECT name
FROM person
UNION
SELECT title
FROM blog_post

Let’s just write the syntax out that esqueleto usually uses, and see where that takes us:

unionTest = ( 
  select $ 
  from $ \person ->
  return (person ^. PersonName)
  ) `union` (
  select $
  from $ \blog ->
  return (blog ^. BlogPostTitle)
  )

This is a pleasing looking API! Can it work? What type would union need to have?

Well, probably not. Let’s look at the type of select:

select 
  :: (SqlSelect a r, MonadIO m)
  => SqlQuery a 
  -> SqlReadT m [r]

Once something has become a SqlReadT value, we can’t really introspect on the query structure anymore. So we can’t have this syntax :(

Let’s try something else:

unionTest = 
  select $
  union 
    ( from $ \person -> do
      pure (person ^. PersonName)
    )
    ( from $ \blog -> do
      pure (blog ^. BlogPostTitle)
    )

This means that union will end up returning a SqlQuery a. It takes two arguments, each of which is a query returning the same thing. We have our first attempt at a type to implement!

union
  :: SqlQuery a
  -> SqlQuery a
  -> SqlQuery a
union query0 query1 = undefined

First Attempt

Alright, so, uh, how do we make values of type SqlQuery? Let’s first look at what the type actually is:

-- | SQL backend for @esqueleto@ using 'SqlPersistT'.
newtype SqlQuery a = Q 
  { unQ :: W.WriterT SideData (S.State IdentState) a 
  }

-- | Side data written by 'SqlQuery'.
data SideData = SideData 
  { sdDistinctClause :: !DistinctClause
  , sdFromClause     :: ![FromClause]
  , sdSetClause      :: ![SetClause]
  , sdWhereClause    :: !WhereClause
  , sdGroupByClause  :: !GroupByClause
  , sdHavingClause   :: !HavingClause
  , sdOrderByClause  :: ![OrderByClause]
  , sdLimitClause    :: !LimitClause
  , sdLockingClause  :: !LockingClause
  }

-- | List of identifiers already in use and supply of temporary
-- identifiers.
newtype IdentState = IdentState 
  { inUse :: HS.HashSet T.Text 
  }

initialIdentState :: IdentState
initialIdentState = IdentState mempty

So, we use the WriterT SideData to accumulate information about the query we’re building. And then we use IdentState to keep track of identifiers in use.

Let’s look at some things that return SqlQuery values. I searched through the Database.Esqueleto.Internal.Sql module for -> SqlQuery and got some interesting results. The only function that returns a SqlQuery value in the whole module is this:

-- line 497
  withNonNull :: PersistField typ
              => SqlExpr (Value (Maybe typ))
              -> (SqlExpr (Value typ) -> SqlQuery a)
              -> SqlQuery a
  withNonNull field f = do
    where_ $ not_ $ isNothing field
    f $ veryUnsafeCoerceSqlExprValue field

Okay, so where_ is a SqlQuery function. Let’s look for it’s definition:

class (Monad query) =>
  Esqueleto query expr backend 
    | query -> expr backend
    , expr -> query backend 
    where
-- snip...

-- in Database.Esqueleto.Internal.Language, line 93
  -- | @WHERE@ clause: restrict the query's result.
  where_ :: expr (Value Bool) -> query ()

The class definition has functional dependencies that basically make it so you can determine any type variable from any other. Since persistent uses the SqlBackend type for the backend, you end up needing to totally pick Esqueleto SqlQuery SqlExpr SqlBackend, and you can’t vary any of those types.

Okay, let’s find the instance definition:

-- line 452 in Database.Esqueleto.Internal.Sql
  where_ expr = Q $ W.tell mempty { sdWhereClause = Where expr }

  on expr = Q $ W.tell mempty { sdFromClause = [OnClause expr] }

  groupBy expr = Q $ W.tell mempty { sdGroupByClause = GroupBy $ toSomeValues expr }

  having expr = Q $ W.tell mempty { sdHavingClause = Where expr }

  locking kind = Q $ W.tell mempty { sdLockingClause = Monoid.Last (Just kind) }

There’s actually a bunch, and they mostly just tell about a part of the query we’re building. Cool. This may be useful soon, but it’s not immediately obvious to me how.

I spent some time perusing the rest of the library, and I found another combinator that takes a SqlQuery value and produces something else:

sub 
  :: PersistField a 
  => Mode 
  -> SqlQuery (SqlExpr (Value a)) 
  -> SqlExpr (Value a)
sub mode query = ERaw Parens $ \info -> toRawSql mode info query

This looks useful! Let’s look at the ERaw constructor, which is from the SqlExpr datatype:

  -- Raw expression: states whether parenthesis are needed
  -- around this expression, and takes information about the SQL
  -- connection (mainly for escaping names) and returns both an
  -- string ('TLB.Builder') and a list of values to be
  -- interpolated by the SQL backend.
  ERaw     
    :: NeedParens 
    -> (IdentInfo -> (TLB.Builder, [PersistValue])) 
    -> SqlExpr (Value a)

Okay, so we can start with this approach and just generate the raw SQL we need. This is probably the wrong approach, but it might work, and working is better than imaginary.

union
  :: PersistField a
  => SqlQuery (SqlExpr (Value a))
  -> SqlQuery (SqlExpr (Value a))
  -> SqlQuery (SqlExpr (Value a))
union query0 query1 =
  pure $ ERaw Parens $ \info ->
    let
      (q0, v0) = toRawSql SELECT info query0
      (q1, v1) = toRawSql SELECT info query1
     in
      (q0 <> " UNION " <> q1, v0 <> v1)

This is basically what sub does. We just concatenate them with the UNION operator in between. Let’s write a test and see how it works!

Testing the First Approach

I hop into the esqueleto test suite and start writing my test:

testCaseUnion :: Run -> Spec
testCaseUnion run = do
  describe "union" $ do
    it "works" $ do
      run $ do
        let
          names =
            [ "john", "joe", "jordan", "james" ]
          blogs =
            [ "asdf", "qwer", "berty", "nopex" ]
        (pid:_) <- forM names $ \name ->
          insert (Person name Nothing Nothing 3)

        forM_ blogs $ \blog ->
          insert (BlogPost blog pid)

        res <- select $
          ( from $ \person -> do
            pure (person ^. PersonName)
          )
          `union`
          ( from $ \blog -> do
            pure (blog ^. BlogPostTitle)
          )

        liftIO $
          L.sort (map unValue res)
            `shouldBe`
              L.sort (names <> blogs)

We insert four blogs and people into the database, and then get the UNION of their names and titles. Does it work?

No :(

  test/Common/Test.hs:1421:11:
  1) Tests that are common to all backends, union, works
       expected: ["asdf","berty","james","joe","john","jordan","nopex","qwer"]
        but got: ["asdf"]

Okay, that’s a bit weird. Why does it only pick the first? Let’s test our understanding and try a raw query. I’ll add a line that runs the raw SQL and then I’m going to error out to see the output in the test suite:

-- ...
res' <- rawSql 
  (  "SELECT Person.name FROM Person "
  <> "UNION "
  <> "SELECT BlogPost.title FROM BlogPost" 
  )
  []
error (show $ map unSingle (res' :: [Single String]))
-- ...

Now, we get an error that shows what that query returned:

  test/Common/Test.hs:1417:9:
  1) Tests that are common to all backends, union, works
       uncaught exception: ErrorCall
       ["asdf","berty","james","joe","john","jordan","nopex","qwer"]
       CallStack (from HasCallStack):
         error, called at test/Common/Test.hs:1417:9 in main:Common.Test

Okay, that’s exactly what I expected to see! So there’s something weird about how the query is being generated. I want to get a textual representation of the query, and toRawSql is the function to do that. I’m going to make a wrapper around it:

renderQuery
  :: (Monad m, EI.SqlSelect a r)
  => SqlQuery a
  -> SqlPersistT m TL.Text
renderQuery q = do
  conn <- ask
  pure (queryToText conn q)

queryToText :: EI.SqlSelect a r => SqlBackend -> SqlQuery a -> TL.Text
queryToText conn q =
  let (tlb, _) = EI.toRawSql EI.SELECT (conn, EI.initialIdentState) q
  in TLB.toLazyText tlb

And we’ll render the query:

-- ...snip
        let
          q =
            ( from $ \person -> do
              pure (person ^. PersonName)
            )
            `union`
            ( from $ \blog -> do
              pure (blog ^. BlogPostTitle)
            )
        res <- select q
        error . show =<< renderQuery q
-- snip...

Now, what do we get?

  test/Common/Test.hs:1415:9:
  1) Tests that are common to all backends, union, works
       uncaught exception: ErrorCall
       "SELECT (SELECT \"Person\".\"name\"\nFROM \"Person\"\n UNION SELECT \"BlogPost\".\"title\"\nFROM \"BlogPost\"\n)\n"
       CallStack (from HasCallStack):
         error, called at test/Common/Test.hs:1415:9 in main:Common.Test

Okay, so it’s doing something that we don’t want. We want this:

SELECT name
FROM person
UNION
SELECT title
FROM blog_post

And it’s doing this:

SELECT (
  SELECT name
  FROM person
  UNION
  SELECT title
  FROM blog_post
)

Which explains our problem! We actually need it do SELECT * FROM (the union query). Or removing the outer SELECT entirely. So, this suggests that this isn’t the right approach.

Next post, I’ll attempt to find another way to implement it, and write down the stream-of-thought process on how I got there.

January 31, 2019 12:00 AM

January 29, 2019

FP Complete

Defining exceptions in Haskell - FP Complete

Let’s talk about exceptions. Programs do a thing successfully all the time, except sometimes when things didn’t work out. So we have “exceptions”, which, like anything fun in programming languages, was invented in Lisp in the 60’s .

by Chris Done (chrisdone@fpcomplete.com) at January 29, 2019 09:16 PM

Defining exceptions in Haskell - FP Complete

Let’s talk about exceptions. Programs do a thing successfully all the time, except sometimes when things didn’t work out. So we have “exceptions”, which, like anything fun in programming languages, was invented in Lisp in the 60’s .

by Chris Done (chrisdone@fpcomplete.com) at January 29, 2019 09:16 PM

JP Moresmau

Go Gotchas

As I wrote my first non trivial project in Go, I had a few head-scratching moments. Nothing that a quick search could clear up, but worth keeping in mind...

Nil is not nil. 

You may run into that one quickly enough in tests. An instance of an interface can have a nil value but since it carries the interface type information, it does not equal to nil!! A lot has been written about it so make sure to read about it!

Redeclarations

The fact that you can use := to declare variables easily is a double edged sword, as declaring a variable with the same name as a previously declared one is allowed if it's in a different scope, and it shadows the original variable. So:



This code will only print True at the end. The "err" variable inside the if block has been redefined so even though it's not nil inside the if block, it IS nil once you exit it! So if you want to chain several operations that may return errors and then at the end handle whatever error occurred in a consistent way, you have to be careful.

Closures

This one gave more some Javascript PTSD:



This prints :
4
4
4

Which was not what I expected! One easy fix is to add a line saying i:=i before creating the func in the first loop, which by virtue of the previous gotcha redeclares a variable in the scope of the for loop body.


Errors

I'm not convinced in the end by the standard error reporting in Go. A lot has been written on that topic too. I feel that the code is peppered with error handling that gets in the way of readability and implementing cleanly the business logic. I'm sure there are some design tricks you can use, but you shouldn't have to resort to tricks to do clean error handling in a language, because at the end of the day it's how you handle errors that make up the quality of your software, not only how you handle the happy path.

These things you can watch out for and probably code around pretty easily, but they do impact a bit the ease of use of Go, even though I still think it's quite a simple elegant language to quickly be productive in.

Happy Go Hacking!

by JP Moresmau (noreply@blogger.com) at January 29, 2019 09:05 PM

January 28, 2019

Douglas M. Auclair (geophf)

December 2018 1HaskellADay Problems and Solutions

by geophf (noreply@blogger.com) at January 28, 2019 05:43 PM

Monday Morning Haskell

Why Haskell IV: Typeclasses vs. Inheritance

inheritance.jpg

Welcome to part four of our series comparing Haskell's data types to other languages. As I've expressed before, the type system is one of the key reasons I enjoy programming in Haskell. And this week, we're going to get to the heart of the matter. We'll compare Haskell's typeclass system with the idea of inheritance used by object oriented languages.

If Haskell's simplicity inspires you as well, try it out! Download our Beginners Checklist and read our Liftoff Series to get going!

Typeclasses Review

Before we get started, let's do a quick review of the concepts we're discussing. First, let's remember how typeclasses work. A typeclass describes a behavior we expect. Different types can choose to implement this behavior by creating an instance.

One of the most common classes is the Functor typeclass. The behavior of a functor is that it contains some data, and we can map a type transformation over that data.

In the raw code definition, a typeclass is a series of function names with type signatures. There's only one function for Functor: fmap:

class Functor f where
  fmap :: (a -> b) -> f a -> f b

A lot of different container types implement this typeclass. For example, lists implement it with the basic map function:

instance Functor [] where
  fmap = map

But now we can write a function that assumes nothing about one of its inputs except that it is a functor:

stringify :: (Functor f) -> f Int -> f String

We could pass a list of ints, an IO action returning an Int, or a Maybe Int if we wanted. This function would still work! This is the core idea of how we can get polymorphic code in Haskell.

Inheritance Basics

As we saw in previous parts, object oriented languages like Java, C++, and Python tend to use inheritance to achieve polymorphism. With inheritance, we make a new class that extends the functionality of a parent class. The child class can access the fields and functions of the parent. We can call functions from the parent class on the child object. Here's an example:

public class Person {
  public String firstName;
  public String lastName;
  public int age;

  public Person(String fn, String ln, int age) {
    this.firstName = fn;
    this.lastName = ln;
    this.age = age;
  }

  public String getFullName() {
    return this.firstName + " " + this.lastName;
  }
}

public class Employee extends Person {
  public String company;
  public String email;
  public int salary;

  public Employee(String fn,
                  String ln,
                  int age,
                  String company,
                  String em,
                  int sal) {
    super(fn, ln, age);
    this.company = company;
    this.email = em;
    this.sal = sal;
  }
}

Inheritance expresses an "Is-A" relationship. An Employee "is a" Person. Because of this, we can create an Employee, but pass it to any function that expects a Person. We can also call the getFullName function from Person on our Employee type.

public void printPerson(Person p) {
  ...
}

public void main {
  Employee e = Employee("Michael", "Smith", 23, "Google", "msmith@google.com", 100000);
  printPerson(e);
  String s = e.getFullName();
}

Here's another trick. We can put items constructed as either Person or Employee in the same array, if that array has type Person[]:

public void main {
  Employee e = Employee("Michael", "Smith", 23, "Google", "msmith@google.com", 100000);
  Person p = Person("Katie", "Johnson", 25);
  Person[] people = {e, p};
}

This provides a useful kind of polymorphism we can't get in Haskell.

Benefits

Inheritance does have a few benefits. It allows us to reuse code. The Employee class can use the getFullName function without having to define it. If we wanted, we could override the definition in the Employee class, but we don't have to.

Inheritance also allows a degree of polymorphism, as we saw in the code examples above. If the circumstances only require us to use a Person, we can use an Employee or any other subclass of Person we make.

We can also use inheritance to hide variables away when they aren't needed by subclasses. In our example above, we made all our instance variables public. This means an Employee function can still call this.firstName. But if we make them private instead, the subclasses can't use them in their functions. This helps to encapsulate our code.

Drawbacks

Inheritance is not without its downsides though. One unpleasant consequence is that it creates a tight coupling between classes. If we change the parent class, we run the risk of breaking all child classes. If the interface to the parent class changes, we'll have to change any subclass that overrides the function.

Another potential issue is that your interface could deform to accommodate child classes. There might be some parameters only a certain child class needs, and some only the parent needs. But you'll end up having all parameters in all versions because the API needs to match.

A final problem comes from trying to understand source code. There's a yo-yo effect that can happen when you need to hunt down what function definition your code is using. For example your child class can call a parent function. That parent function might call another function in its interface. But if the child has overridden it, you'd have to go back to the child. And this pattern can continue, making it difficult to keep track of what's happening. It gets even worse the more levels of a hierarchy you have.

I was a mobile developer for a couple years, using Java and Objective C. These kinds of flaws were part of what turned me off OO-focused languages.

Typeclasses as Inheritance

Now, Haskell doesn't allow you to "subclass" a type. But we can still get some of the same effects of inheritance by using typeclasses. Let's see how this works with the Person example from above. Instead of making a separate Person data type, we can make a Person typeclass. Here's one approach:

class Person a where
  firstName :: a -> String
  lastName :: a -> String
  age :: a -> Int
  getFullName :: a -> String

data Employee = Employee
  { employeeFirstName :: String
  , employeeLastName :: String
  , employeeAge :: Int
  , company :: String
  , email :: String
  , salary :: Int
  }

instance Person Employee where
  firstName = employeeFirstName
  lastName = employeeLastName
  age = employeeAge
  getFullName e = employeeFirstName e ++ " " ++ employeeLastName e

We can one interesting observation here. Multiple inheritance is now trivial. After all, a type can implement as many typeclasses as it wants. Python and C++ allows multiple inheritance. But it presents enough conceptual pains that languages like Java and Objective C do not allow it.

Looking at this example though, we can see a big drawback. We won't get much code reusability out of this. Every new type will have to define getFullName. That will get tedious. A different approach could be to only have the data fields in the interface. Then we could have a library function as a default implementation:

class Person a where
  firstName :: a -> String
  lastName :: a -> String
  age :: a -> Int

getFullName :: (Person a) => a -> String
getFullName p = firstName p ++ " " ++ lastName p

data Employee = ...

instance Person Employee where
  ...

This allows code reuse. But it does not allow overriding, which the first example would. So you'd have to choose on a one-off basis which approach made more sense for your type. And no matter what, we can't place different types into the same array, as we could in Java.

So while we could do inheritance in Haskell, it's a pattern you should avoid. Stick to using typeclasses in the intended way.

Comparisons

Object oriented inheritance has some interesting uses. But at the end of the day, I found the warts very annoying. Tight coupling between classes seems to defeat the purpose of abstraction. Meanwhile, restrictions like single inheritance feel like a code smell to me. The existence of that restriction suggests a design flaw. Finally, the issue of figuring out which version of a function you're using can be quite tricky. This is especially true when your class hierarchy is large.

Typeclasses express behaviors. And as long as our types implement those behaviors, we get access to a lot of useful code. It can be a little tedious to flesh out a new instance of a class for every type you make. But there are all kinds of ways to derive instances, and this can reduce the burden. I find typeclasses a great deal more intuitive and less restrictive. Whenever I see a requirement expressed through a typeclass, it feels clean and not clunky. This distinction is one of the big reasons I prefer Haskell over other languages.

Conclusion

That wraps up our comparison of typeclasses and inheritance! There's one more topic I'd like to cover in this series. It goes a bit beyond the "simplicity" of Haskell into some deeper ideas. We've seen concepts like parametric types and typeclasses. These force us to fill in "holes" in a type's definition. We can expand on this idea by looking at type families. Next week, we'll explore this more advanced concept and see what it's useful for.

If you want to stay up to date with our blog, make sure to subscribe! That will give you access to our subscriber only resources page!

by James Bowen at January 28, 2019 03:30 PM

Magnus Therning

A missing piece in my Emacs/Spacemacs setup for Haskell development

With the help of a work mate I’ve finally found this gem that’s been missing from my Spacemacs setup

(with-eval-after-load 'intero
  (flycheck-add-next-checker 'intero '(warning . haskell-hlint))
  (flycheck-add-next-checker 'intero '(warning . haskell-stack-ghc)))

January 28, 2019 12:00 AM

January 27, 2019

Brandon Simmons

Hacking around on an FV-1 based guitar pedal

This is a diary-style post where I chronicle what I learn digging into a digital guitar effects pedal for the first time. I’m not sure how useful or interesting it will be to others.


I picked up an oddball modulation pedal from a “boutique” maker (being coy here) and decided there were a bunch of things I could improve about it, so I thought I’d try to tweak it. I know almost nothing about electronics.

I took the guts out of the enclosure and first googled the chips on the board. The largest was labeled “SpinSemiconductor FV-1” This is a “batteries included” DSP chip and the only real product from Spin.

IYI (IF YOU’RE INTERESTED): KEITH BARR AND THE FV-1:

One of the pleasures of this project has been learning a little about the designer of the FV-1, Keith Barr. He passed away in 2010 and some of his contributions are summarized in this tribute: cofounder of MXR and creator of the Phase 90, founder of Alesis, pioneer in bringing digital recording and then digital reverb to the masses, etc.

The FV-1 was developed and released in the mid-2000s and is responsible for the boom in “boutique” reverb and (along with the PT2399) non-BBD delay pedals, being used by the likes of: Old Blood Noise, Catalinbread, Neunaber, Earthquaker, Red Panda, Keeley, Dr. Scientist, Walrus Audio, etc, etc.

I’m new to the space but it seems like Keith was the only person doing this sort of work (cheap DSP chip that acts like another analog component, accessible to hobbyists) and with his death there is sadly no “next-gen” FV-1 coming along.

I also noticed what turned out to be an EEPROM chip And a spot marked “pgm” where a 5-pin connector header could be soldered, presumably for programming. I guessed the EEPROM was where the code itself was stored (having read this was a capability).

I traced the “pgm” solder pads carefully (using the setting on my multimeter that beeps when a circuit is formed), and found they all connected to the EEPROM. I drew up a little picture showing how the header and EEPROM chip relate.

At this point I was feeling pretty hopeful that I might be able to both write and possibly dump/decompile code from the EEPROM and tweak the functionality to my liking (maybe… eventually).

Dumping the dang ROM

I didn’t have any kind of special programmer, but found that reading and writing to an EEPROM is easily done with an arduino

However I got concerned looking at the hardware setup: they connect 5v to pin A2, but all of mine are grounded.

I looked at the data sheet for the EEPROM and was surprised to find it really readable even for someone without much embedded background! I learned (sec 5.0 of the manual) these pins were for configuring the I²C address for the chip (this is clear in the tutorial, but I didn’t get it at first). So my EEPROM was hardwired to communicate at a different address (0x50).

Another thing that confused me for a second: the manual states the EEPROM’s write-protect pin needs to be connected to either Vcc or Vss (ground), but it didn’t seem to be connected to either (according to my multimeter). But after looking closer I found it was connected to Vcc via a 1k resistor (I guess too much resistance for my multimeter to see continuity). I think the idea is that the corresponding pin from the header could be connected to ground and current would flow there overriding the write-protect). I’m sure this is probably a common technique.

I have an older “Duemilanove” arduino and had to look up the SDA and SCL pins on my board (pretty easy to find on the arduino site).

I hoped the pedal’s circuitboard would have the necessary pullup resistors soldered on so I didn’t have to use a breadboard at all but that didn’t seem to be the case. The arduino’s ATmega’s built-in pull-up resistors also apparently won’t work reliably for i2c communication, so I just wired things up like the SparkFun article above shows.

Here’s what that looked like (mostly for my own future reference); mystery pedal on the left, Duemilanove in the background:

wired up for flashing EEPROM

(the dangling green wire can be connected to Gnd on the arduino to make the EEPROM writeable).

I used the arduino sketches from the sparkfun tutorial above, just tweaking some basics:

  • modified EEPROM_ADR (zero out least sig bits, since my EEPROM was hardwired to 000 since all three address pins were grounded)
  • loop over 32k bits with #define EEPROM_SIZE 4096

For developing and uploading the arduino sketch I used this this vim plugin. I just had to add to my .vimrc:

let g:arduino_board = 'arduino:avr:diecimila'

…then an :ArduinoUpload in VIM builds and uploads the sketch to the board.

And the following seemed to be the simplest way to listen on the arduino serial port for the dumped EEPROM bytes. Note the baud passed to -s here should match Serial.begin(<baud>) in the arduino code:

$ busybox microcom -s 115200 /dev/ttyUSB0 > pedal.rom

After running this for the first time and inspecting the output with hexdump the project went off the rails for a while…: the output repeated every 512 bytes (so was this actually a 4096 bit EEPROM?) Microchip’s docs describe it as “a single block of 4K x 8-bit memory”

IYI: in fact “4k” here means 4096; they also write “kbit” to mean 1024 bits, which is not what I understand the term to mean. Bafflingly nowhere in the datasheet are they unambiguous about the actual size of the EEPROM. Also AFAIU there is no way to determine the size of an EEPROM without writing to it and reading back.

Or was it communicating with the arduino’s internal EEPROM on the same i2c address or something? (I unplugged one of the lines going to the EEPROM and got garbage, so ruled this out)

Or is the reader program actually buggy and skipping every 8 bytes, and then wrapping?

After inspecting the code I became more and more convinced the latter was what was happening:

  • Arduino’s Wire library doesn’t look like i2c at all (in fact it’s communicating with a separate chip that does i2c asynchronously (although the Wire library doesn’t take advantage of this))
  • the library is an undocumented mess, though this helped a litte
  • most of the EEPROM code I tried didn’t actually match the documented spec as far as I could tell (e.g. no repeated START)

In short fertile ground for wasting a bunch of time chasing red herrings…

What was really happening is the chip actually contained 8 banks of identical programs, which is what the FV-1 in fact expects (they can be eight different programs obviously). Had I done a little more initial basic research about the FV-1, or taken the time to quickly rule out the idea that the EEPROM dump was correct despite the fact that I thought that was unlikely (easily done by writing to the last byte and reading again, which is what I eventually did), I would have saved myself a lot of time. This is like bayesian reasoning and it’s really easy to not do.

Also, oops. I had 5v running to the EEPROM (connected to the FV-1), although FV-1 takes 3.3v. Luckily this didn’t seem to fry the board…

Assembling / Disassembling

Someone named Igor has written an FV1 decompiler presumably as a PHP script. Unfortunately the code doesn’t seem to be open source, and I wasn’t immediately sure how to contact the author. After uploading my ROM I got back a legitimate looking but at-this-point-totally-meaningless-to-me bunch of annotated assembly, like:

   CHO     RDAL, SIN0      ; { ACC = SIN0 } ; Get LFO value
   WRAX    REG7 , 0        ; { REG7 = ACC; ACC = 0 * ACC ; PACC = ACC } ; * Note: C=0(16 bit) - s1.14 format
   OR      0x20000         ; { ACC = ACC | %00000000000000100000000000000000 }
   RDAX    REG7 , 1        ; { ACC = ACC + REG7 * 1 ; PACC = ACC } ; * Note: C=1(16 bit) - s1.14 format

So that’s cool. Now I need to get an assembler and I’m off to the races.

The standard SpinASM assembler is Windows only (I haven’t tried it with Wine yet), so I tried an alternate one github user ndf-zz has written in python called asfv1:

$ pip3 install asfv1
$ asfv1 -b pedal.rom.disassembled pedal.rom.reassembled

I got some errors which seemed related to a documented quirk that I didn’t totally understand. After adding decimals to the problematic literals I was able to successfully assemble my disassembled program!

Unfortunately comparing the first 512 bytes dumped from the EEPROM with my disassembled/reassembled output from asfv1 showed some differences. I upload the new ROM to the disassembler service again and looked at a diff and it appeared many immediate values of 1 were turned into e.g. 6.103515625E-5 or 0.001953125 after going through asfv1 (and the mystery blockbox disassembler).

I re-read the asfv1 README more carefully (I’d read “fixed-point” as “floating point”), did a little research and looked at a diff of the hexdumps of the roms and what was happening was pretty obvious:

asfv1 compiled the first line of assembly below to 0x0001, while the binary output for the original dumped ROM was achieved by using the decimal literal, as in the second line below:

                 `1` as unsigned 16-bit int

                            -
                          /   \
RDAX   ADCL , 1           00 01 02 84 ...etc
RDAX   ADCL , 1.0         40 00 02 84 ...etc
                          \   /
                            - 

               `1` as s1.14 fixed point value:
                   0b0100000000000000
                       \ fractional /

I wasn’t familiar with the notation “s1.14” used in the asfv1 and FV-1 ASM docs, but it quite simply means a fixed-point real value represented by: a sign bit, followed by 1 integer bit, followed by 14 fractional bits (totalling 16 bits).

I dug into the asfv1 code and tweaked things so that, for real arguments, we treat decimal literals as real literals and values entered in hexadecimal or binary notation as a raw bit value.

With my fork I successfully assembled the output I got from the blackbox disassembler, and miraculously the output from the new asfv1 matches the original program we dumped from the EEPROM (head -c 512 pedal.rom)!

$ md5sum * | sort                                                                                                    
38092c4673ff63f561ad3413c732db43  pedal.rom.reassembled_with_my_asfv1_fork
38092c4673ff63f561ad3413c732db43  pedal.rom.1
9d13dcb79754603e85eca19cbf405c4a  pedal.rom.reassembled
...

I did quick hack job on asfv1 without understanding the code or SpinASM very deeply, so beware, but if you want to try out the fork you can do:

$ pip3 install git+https://github.com/jberryman/asfv1.git

Building SpinCAD Designer

I get the impression many if not most pedal makers are developing their algorithms with the GPL’d drag-and-drop editor SpinCAD Designer.

It seems difficult to build, but luisfcorreia has a fork where they’ve made it buildable with maven. Here’s what I had to do to get it to build and run successfully on my debian stretch box:

$ git clone https://github.com/HolyCityAudio/SpinCAD-Designer.git
$ apt install maven
$ git branch maven
$ git pull https://github.com/luisfcorreia/SpinCAD-Designer.git dev
$ cd spincad-designer  # seems this work was done on a copy of the codebase to a different dir in the repo...?
$ _JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true  mvn package
$ java -classpath ./target/spincad-1.0-SNAPSHOT.jar:./lib/elmGen-0.5.jar  com.holycityaudio.SpinCAD.SpinCADFrame

SpinCAD is able to import “Spin Hex” which apparently means the Intel HEX encoded ROM data. This is I guess a common format to feed to EEPROM writers, programmers, etc.

After some trial and error I was able to convert the binary image into HEX in such a way that “File > Open Hex” in SpinCAD didn’t choke:

$ sudo apt install srecord
$ srec_cat pedal.rom.1 -binary -output pedal.rom.1.hex -Intel --line-length=19

I was curious if SpinCAD would be able to disassemble and recognize the idioms from a spin ROM but unsurprisingly it does not. I probably won’t be using SpinCAD for this project, but the library of “modules” in the source code might be really valuable to learn from, or maybe I’ll build a visual “disassembler” myself using them some day.


Appendix: links, etc

A nice write-up/experience report from a first-time FV-1 developer, with good advice and links:

Easy DIY dev boards for FV-1 (why aren’t there more of these? What is the cheapest commercial FV-1 based pedal available?)

A alternatives to the FV-1 and some history:

Great resources for DIY pedal building generally to be found here. I found discussions of the FV-1 on all of these, though most builders focusing on analog electronics:

January 27, 2019 01:56 AM

January 26, 2019

Roman Cheplyaka

Surprises of the Haskell module system (part 2)

It’s been 6 years since I published Surprises of the Haskell module system (part 1), so I thought it was time to follow through on the implicit promise and publish part 2.

Exporting constructors without data type

Is it possible to export a data constructor without exporting its parent data type? To be clear, there’s probably no good reason to want to do that, but it’s a fun puzzle, and we’ll learn something while trying to solve it.

Surprisingly, the answer depends on how our constructor and type are named. First, consider the case when they are named differently, as in

The naive attempt

does not work. The Haskell 2010 standard says this about data types in the export list:

An algebraic datatype T declared by a data or newtype declaration may be named in one of three ways:
  • The form T names the type but not the constructors or field names. The ability to export a type without its constructors allows the construction of abstract datatypes (see Section 5.8).
  • The form T(c1,,cn), names the type and some or all of its constructors and field names.
  • The abbreviated form T(..) names the type and all its constructors and field names that are currently in scope (whether qualified or not).

Therefore, there seems to be no way to export a constructor without exporting its parent type at the same time. But here’s the trick. First, let’s move the data type definition from M to a different module, Internal:

Then, in our module M, import Internal like this:

This will import C but not T. Here’s Haskell 2010 on the semantics of hiding lists:

Entities can be excluded by using the form hiding(import1 ,  , importn ), which specifies that all entities exported by the named module should be imported except for those named in the list. Data constructors may be named directly in hiding lists without being prefixed by the associated type. Thus, in

  import M hiding (C)

any constructor, class, or type named C is excluded. In contrast, using C in an import list names only a class or type.

So now, within module M, we have only C in scope but not T. But M does not export any of them yet, and it seems like we need T in scope in order to add C to the export list.

Here we use a second trick: instead of naming C specifically, we export our Internal module as a whole:

And this achieves the desired result: M now exports the data constructor C but not the data type T. See Part 1 for more discussion of module exports.

This all works because the data type and its constructor are named differently. If we have

this won’t work, because

will hide both the data type and the data constructor. I actually don’t know a way to export the constructor T without the type T; if you do, please let me know.

Exporting fields without constructor

A similar question is that of exporting a field name without exporting its parent constructor, as in

The above code works. But what does it mean? Field names can be used as functions (e.g. field :: T -> Bool), but they can also be used in the record construction (C { field = True }) and record update (x { field = True }) syntaxes.

Since we don’t export the data constructor, clearly record construction won’t work. But what about record update? Does field lose its magic and become a simple function when exported this way—either because it wasn’t exported as T(field), or because the data constructor is not in scope? The answer is no, it is still a valid field name, and x { field = True } still works. Here’s Haskell 2010:

It makes no difference to an importing module how an entity was exported. For example, a field name f from data type T may be exported individually (f, item (1) above); or as an explicitly-named member of its data type (T(f), item (2)); or as an implicitly-named member (T(..), item(2)); or by exporting an entire module (module M, item (5)).

And there’s no requirement that the constructor be in scope in order to use the record update syntax—even though its desugaring would require all of the constructors to be in scope.

Some authors use this to their advantage: for instance, Michael Snoyman suggests exporting fields but not constructors for settings types.

But more often than not, we want the opposite: to make the type abstract and export a getter for it. If you think the module M above exports an abstract type T and a function field :: T -> Bool, then you may believe you are free to change the implementation of T without your users noticing. So in the next version you may change T to

This new module still exports a type T and a function field :: T -> Bool, but that function is no longer a record field. If someone was using field in a record update, x { field = True }, their code is going to break.

There’s no way to export a field as a plain function or otherwise strip it of its magic. You can either define a getter yourself, as in

or create a copy of a field, as in

Hidden types and type synonyms

Consider this example:

What type does x have in a module that imports M? On the one hand, T is Int, so x :: Int. On the other hand, since T is not in scope, there’s no way to know that T = Int, so it might seem that x is now an opaque value of some abstract type T that we know nothing about. In fact, the former view is correct and the latter isn’t.

The type of an exported entity is unaffected by non-exported type synonyms. For example, in

  module M(x) where  
    type T = Int  
    x :: T  
    x = 1

the type of x is both T and Int; these are interchangeable even when T is not in scope. That is, the definition of T is available to any module that encounters it whether or not the name T is in scope. The only reason to export T is to allow other modules to refer it by name; the type checker finds the definition of T if needed whether or not it is exported.

Similarly, you may wonder what does it mean to export a function but hide some of the types that are part of its signature. Here’s what Haskell 2010 has to say about that:

[E]ntities that the compiler requires for type checking or other compile time analysis need not be imported if they are not mentioned by name. The Haskell compilation system is responsible for finding any information needed for compilation without the help of the programmer. That is, the import of a variable x does not require that the datatypes and classes in the signature of x be brought into the module along with x unless these entities are referenced by name in the user program. The Haskell system silently imports any information that must accompany an entity for type checking or any other purposes. Such entities need not even be explicitly exported: the following program is valid even though T does not escape M1:

  module M1(x) where  
    data T = T  
    x = T  
 
  module M2 where  
    import M1(x)  
    y = x

In this example, there is no way to supply an explicit type signature for y since T is not in scope. Whether or not T is explicitly exported, module M2 knows enough about T to correctly type check the program.

Self-exporting module

By default—when no export list is present in a module—all values, types and classes defined in the module are exported, but not those that are imported. With an explicit export list, we can export a mix of entities from the module itself and the modules it imports. But now you have to explicitly list everything defined in the current module and remember to update the list when you define a new function or type. Or do you?

There’s a nice trick to have an explicit export list and automatically export everything from the current module at the same time: just let the module export itself.

Thanks to the clear semantics of module exports (which we discussed in Part 1), self-exporting is well-defined and doesn’t involve any recursion.

January 26, 2019 08:00 PM

January 25, 2019

Neil Mitchell

HLint Unused Extension Hints

Summary: HLint detects unused extensions in LANGUAGE pragmas, including over 17,000 on Hackage.

HLint has detected unused LANGUAGE pragmas for a while - where you enable an extension (e.g. {-# LANGUAGE EmptyDataDecls #-}) but don't use it. HLint v2.1.13 includes some improvements from Yair and myself making these hints even more powerful. As a result, I thought it worth showing some examples of what HLint can do in this area. I started by running HLint on all of Hackage, which found 17,718 "Unused LANGUAGE pragma" hints, including the examples in this post.

Detecting unused extensions

For extensions that show up as syntax (e.g. EmptyDataDecls, ViewPatterns etc), HLint has rules saying which constructs require which extensions. For extensions that aren't syntax directed (e.g. AllowAmbiguousTypes or IncoherentInstances), HLint can't detect whether they are used or not. In all, HLint has rules for detecting 36 different unused extensions. Taking a look at some examples from Hackage:

abcBridge-0.15\src\Data\ABC\AIG.hs:18:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE EmptyDataDecls #-}
Perhaps you should remove it.

mallard-0.6.1.1\lib\Database\Mallard\Validation.hs:4:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE TemplateHaskell #-}
Perhaps you should remove it.

scholdoc-texmath-0.1.0.1\src\Text\TeXMath\Writers\TeX.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE GeneralizedNewtypeDeriving, ViewPatterns, GADTs #-}
Perhaps:
{-# LANGUAGE GeneralizedNewtypeDeriving, GADTs #-}

As we can see, HLint can spot entirely redundant extension declarations, and also prune those that are partly redundant.

Duplicate extensions

Sometimes extension are simply duplicated, and HLint detects these, either between two separate pragmas, or within a single pragma.

ghcjs-base-stub-0.2.0.0\src\GHCJS\Marshal\Pure.hs:3:1: Warning: Use fewer LANGUAGE pragmas
Found:
{-# LANGUAGE DefaultSignatures #-}
{-# LANGUAGE DefaultSignatures #-}
Perhaps:
{-# LANGUAGE DefaultSignatures #-}

abstract-deque-tests-0.3\Data\Concurrent\Deque\Tests.hs:1:1: Warning: Use fewer LANGUAGE pragmas
Found:
{-# LANGUAGE BangPatterns, RankNTypes, CPP, BangPatterns #-}
Perhaps:
{-# LANGUAGE BangPatterns, RankNTypes, CPP #-}

Implied extensions

The new feature for v2.1.13 is that extension are detected as redundant if they are implied by other extensions. For example, if you have PolyKinds defined then that implies KindSignatures. HLint now features a list of such implications, which it uses to detect redundant extensions.

AERN-RnToRm-0.5.0.1\src\Data\Number\ER\RnToRm\UnitDom\Base.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE MultiParamTypeClasses #-}
Perhaps you should remove it.
Note: Extension MultiParamTypeClasses is implied by FunctionalDependencies

attoparsec-0.13.2.2\Data\Attoparsec\ByteString\Char8.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE BangPatterns, CPP, FlexibleInstances, TypeFamilies,
TypeSynonymInstances, GADTs #-}
Perhaps:
{-# LANGUAGE BangPatterns, CPP, FlexibleInstances, TypeFamilies,
GADTs #-}
Note: Extension TypeSynonymInstances is implied by FlexibleInstances

Redundant extensions that imply non-redundant extensions

Sometimes there is an extension that you can tell is unused (e.g. RecordWildCards), which implies an extension that is either being used or can't be detected (e.g. DisambiguateRecordFields). In such cases HLint gives a note that the implied extension might now need to be provided explicitly, although usually it won't be necessary. As examples:

gogol-maps-engine-0.3.0\gen\Network\Google\Resource\MapsEngine\Projects\List.hs:7:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE RecordWildCards #-}
Perhaps you should remove it.
Note: may require `{-# LANGUAGE DisambiguateRecordFields #-}` adding to the top of the file

manifolds-0.5.0.1\Data\Function\Affine.hs:14:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE FunctionalDependencies #-}
Perhaps you should remove it.
Note: may require `{-# LANGUAGE MultiParamTypeClasses #-}` adding to the top of the file

Being wrong

Finally, sometimes HLint gets it a bit wrong. As an example:

shake-0.17.1\src\Development\Shake\Internal\FileInfo.hs:1:1: Warning: Unused LANGUAGE pragma
Found:
{-# LANGUAGE GeneralizedNewtypeDeriving, DeriveDataTypeable, CPP,
ForeignFunctionInterface #-}
Perhaps:
{-# LANGUAGE GeneralizedNewtypeDeriving, DeriveDataTypeable, CPP
#-}

Here HLint has detected that ForeignFunctionInterface is not used, but in fact it is, although only under one branch of an #ifdef. To fix the hint we can put the extension itself under CPP, adjust the CPP definitions given to HLint, or ignore the hint.

by Neil Mitchell (noreply@blogger.com) at January 25, 2019 12:19 AM

Donnacha Oisín Kidney

A New Ring Solver for Agda

Posted on January 25, 2019
Tags: Agda

I’m finally a the point where I feel like I can make the project I’ve been working on for the past few months public: A Ring Solver for Agda. The focus of the project is ergonomics and ease-of-use: hopefully the interface to the solver is simpler and more friendly than the one that’s already there. It can do step-by-step solutions (like Wolfram Alpha). It’s also asymptotically faster than the old solver (and actually faster! The usual optimizations you might apply don’t actually work here, so this bit definitely took the most work).

Anyway, this work is all for my undergrad final year project, but I’m hoping to submit it to a conference or something in the next few weeks.

by Donnacha Oisín Kidney at January 25, 2019 12:00 AM

January 24, 2019

Michael Snoyman

Mismatched global packages

This blog post covers an issue with how packages are shipped with GHC, and an edge case of how this can negatively interact with Stack. The overall point about the package contents mismatch will apply more broadly, but I’m mostly focused in this post on how to better handle this in Stack.

Intro

GHC ships with a few wired in packages which cannot be reinstalled. Examples are ghc, base, and template-haskell. The situation with these for all build tools is simple: the GHC version determines the package version.

There are also packages which are not shipped with GHC at all. All build tools get to determine whatever version of these packages they want, assuming compatibility with the GHC version and the packages that are wired in with it.

There’s an in-between category: packages like process and transformers are shipped with GHC, but can also be easily reinstalled from Hackage. Some tools may decide to eagerly recompile the latest version of them. Stackage takes a different approach: in order to avoid invalidating the ghc package, it avoids recompiling any of these packages, and sticks by default with the global version. Stack is designed to work with that.

When recompilation happens

Suppose you have the following stack.yaml file:

resolver: ghc-8.6.3
packages:
- .

And suppose your local package depends on the unix package. On non-Windows systems, unix ships with GHC. Running stack build will therefore result in only building the local package, not unix, since unix is in the global package database.

Now let’s suppose that, instead of using a ghc-8.6.3 resolver, you use an LTS Haskell resolver:

resolver: lts-13.4
packages:
- .

Same thing: LTS Haskell snapshots never specify versions of global packages. Instead, they are implicitly inherited from the GHC global package database. In a few specific cases (like the stack init command), we want to avoid requiring that GHC is already installed, so we instead use a global hints file to specify the versions of packages which ship with GHC.

The unix package depends on the time package. Let’s suppose your stack.yaml file instead said:

resolver: lts-13.4
packages:
- .
extra-deps:
- time-1.9.2

And suppose your local package depends on both unix and time. If you run stack build, Stack has three basic options:

  1. Only recompile time, but keep the original unix. That’s a bad idea: you’ll now have two different versions of the datatypes in the time package floating around. People may remember extremely confusing error messages about “cannot match ByteString with ByteString” or similar. It comes from this kind of a build plan.
  2. Fail to compile at all. The unix in the global database is “invalidated” by the presence of a newly compiled time, and your package depends on unix. Just give up.
  3. Automatically add an extra-dep version of unix to the build plan, and recompile both time and unix.

(1) is a really bad idea, and (2) is pretty annoying. So today, Stack follows (3), and automatically/implicitly adds the invalidated global packages to the build plan.

Where do global packages come from?

Look at the following simplified output of a command describing the global process package shipped with GHC 8.6.3:

$ stack --resolver ghc-8.6.3 exec -- ghc-pkg describe process
name: process
version: 1.6.3.0
...
depends:
    base-4.12.0.0 deepseq-1.4.4.0 directory-1.3.3.0 filepath-1.4.2.1
    unix-2.7.2.2

Important bits:

  • We’re using process-1.6.3.0
  • We’re depending on base-4.12.0.0

Now check out the process.cabal file for process-1.6.3.0 on Hackage and on Github. You’ll notice the following bound:

    build-depends: base      >= 4.4 && < 4.12,

These pieces of information are contradictory: the cabal file says base < 4.12, but GHC ships with this version of process and base-4.12.0.0! How is this possible?

Well, it turns out, when GHC says it has process-1.6.3.0, it’s not code that’s downloaded from Hackage. Instead, it’s established via a submodule. Specifically, you can see on Github that GHC 8.6.3 is using process at Git commit 36a3ad5. By contrast, the v1.6.3.0 tag points to Git commit 7c0b5814.

This isn’t a recent revelation, and not only my revelation. I peripherally knew about this, but I’ve heard of three different people discovering this problem in the past month. I’m writing this blog post to point out how it negatively affects Stack, and what to do about it.

How it breaks Stack

process depends on the directory package. Let’s use an extra-dep to provide a new version of directory, and then try to build:

$ cat stack.yaml 
resolver: ghc-8.6.3
packages: []
extra-deps:
- directory-1.3.3.1
$ stack build process

Error: While constructing the build plan, the following exceptions were encountered:

In the dependencies for process-1.6.3.0:
    base-4.12.0.0 from stack configuration does not match >=4.4 && <4.12  (latest matching version is 4.11.1.0)
needed since process is a build target.

Well that’s darn confusing. I changed a version of directory, and suddenly process has a bounds error?!? This makes perfect sense given the above (though it’s certainly unsatisfying):

  • GHC ships with a different version of process-1.6.3.0
  • process needs to be recompiled when we change the directory package
  • Stack assumes that it can simply use process-1.6.3.0 to replace the global process package
  • However, the process on Hackage differs slightly from the one shipped with GHC

This is a relatively simple problem which results in an annoying bounds error, and can be easily fixed by adding a new process extra-dep. However, other problems are much more severe:

  • What if GHC is compiling one of these packages with a special cabal flag? We won’t be able to reproduce that. In fact, there are comments in the Stack code base about this already.
  • What if the code on Hackage is fundamentally different from what’s shipped with GHC? We could get silent but significant behavior changes.

Solution: make no assumptions

The core problem here is that Stack assumes process-1.6.3.0 provided by GHC is the same as what’s on Hackage. That assumption plays itself out in our selection of option (3) above. I propose: get rid of the assumption! Instead, move to option (2): if we replace a global package, prune all global packages that depend on it, and require adding those as explicit extra-dep entries.

This will break some existing build plans, but I’m proposing this change for Stack 2.0, which is already allowing some slight breakage along those lines. This will add some annoyance to users, but with the result of more clarity on what’s happening in a build plan. And perhaps we can even make an error message for these cases which points right at this post!

January 24, 2019 12:30 PM

January 23, 2019

Well-Typed.Com

An in-depth look at quickcheck-state-machine

Stateful APIs are everywhere: file systems, databases, widget libraries, the list goes on. Automated testing of such APIs requires generating sequences of API calls, and when we find a failing test, ideally shrinking such a sequence to a minimal test case. Neither the generation nor the shrinking of such sequences is trivial. After all, it is the very nature of stateful systems that later calls may depend on earlier calls: we can only add rows to a database table after we create it, we can only write to a file after we open it, etc. Such dependencies need to be tracked carefully. Moreover, in order to verify the responses we get back from the system, the test needs to maintain some kind of internal representation of what it thinks the internal state of the system is: when we read from a file, we need to know what was in the file in order to be able to verify if the response was correct or not.

In this blog post we will take an in-depth look at quickcheck-state-machine, a library for testing stateful code. Our running example will be the development of a simple mock file system that should behave identically to a real file system. Although simple, the example will be large enough to give us an opportunity to discuss how we can verify that our generator is producing all test cases we need, and how we can inspect whether the shrinker is doing a good job; in both cases, test case labelling will turn out to be essential. Throughout we will also discuss design patterns for quickcheck-state-machine tests which improve separation of concerns and reduce duplication. It should probably be pointed out that this is an opinionated piece: there are other ways to set things up than we present here.

We will not show the full development in this blog post, and instead focus on explaining the underlying concepts. If you want to follow along, the code is
available for download. We will assume version 0.6 of quickcheck-state-machine, which was recently released. If you are using an older version, it is recommended to upgrade, since the newer version includes some important bug fixes, especially in the shrinker.

Introducing the running example

Our running example will be the development of a simple mock file system; the intention is that its behaviour is identical to the real file system, within the confines of what it needs to support. We will represent the state of the file system as

data Mock = M {
    dirs  :: Set Dir
  , files :: Map File String
  , open  :: Map MHandle File
  , next  :: MHandle
  }

type MHandle = Int

emptyMock :: Mock
emptyMock = M (Set.singleton (Dir [])) Map.empty Map.empty 0

We record which directories (folders) exist, the contents of all files on the system, the currently open handles (where mock handles are just integers), and the next available mock handle. To avoid confusion between files and directories we do not use FilePath but instead use

data Dir  = Dir [String]
data File = File {dir :: Dir, name :: String}

As one example, here is the mock equivalent of readFile:

type MockOp a = Mock -> (Either Err a, Mock)

mRead :: File -> MockOp String
mRead f m@(M _ fs hs _)
  | alreadyOpen               = (Left Busy         , m)
  | Just s <- Map.lookup f fs = (Right s           , m)
  | otherwise                 = (Left DoesNotExist , m)
  where
    alreadyOpen = f `List.elem` Map.elems hs

We first check if there is an open handle to the file; if so, we disallow reading this file (“resource busy”); if the file exists, we return its content; otherwise we report a “does not exist” error. The implementation of the other mock functions is similar; the full API is

mMkDir :: Dir               -> MockOp ()
mOpen  :: File              -> MockOp MHandle
mWrite :: MHandle -> String -> MockOp ()
mClose :: MHandle           -> MockOp ()
mRead  :: File              -> MockOp String

Finally, we should briefly talk about errors; the errors that the mock file system can report are given by

data Err = AlreadyExists | DoesNotExist | HandleClosed | Busy

and they capture a subset of the IO exceptions1

fromIOError :: IOError -> Maybe Err
fromIOError e =
    case ioeGetErrorType e of
      GHC.AlreadyExists    -> Just AlreadyExists
      GHC.NoSuchThing      -> Just DoesNotExist
      GHC.ResourceBusy     -> Just Busy
      GHC.IllegalOperation -> Just HandleClosed
      _otherwise           -> Nothing

Testing

Typically we are developing some stateful code, and then write a pure (mock) implementation of the same thing to test it, making sure that the stateful implementation and the simpler pure model compute the same things. Here we are doing the opposite: we are adjusting the model (the mock file system) to match what the real file system does. Either way, the process is the same: we write tests, execute them both in the real thing and in the model, and compare results.

If we were writing unit tests, we might write tests such as

  • Write to two different files
  • Write to a file and then read it
  • etc.

However, as John Hughes of QuviQ – and one of the original authors of QuickCheck – likes to point out, as systems grow, it becomes impossible to write unit tests that test the interaction between all the features of the system. So, don’t write unit tests. Instead, generate tests, and verify properties.

To generate tests for our mock file system, we have to generate sequences of calls into the API; “open this file, open that file, write to the first file we opened, …”. We then execute this sequence both against the mock file system and against the real thing, and compare results. If something goes wrong, we end up with a test case that we can inspect. Ideally, we should then try to reduce this test to try and construct a minimal test case that illustrates the bug. We have to be careful when shrinking: for example, when we remove a call to open from a test case, then any subsequent writes that used that file handle must also be removed. A library such as quickcheck-state-machine can be used both to help with generating such sequences and, importantly, with shrinking them.

Reifying the API

It is important that we generate the test before executing it. In other words, the test generation should not depend on any values that we only get when we run the test. Such a dependency makes it impossible to re-run the same test multiple times (no reproducible test cases) or shrink tests to obtain minimal examples. In order to do this, we need to reify the API: we need to define a data type whose constructors correspond to the API calls:

data Cmd h =
    MkDir Dir
  | Open File
  | Write h String
  | Close h
  | Read File

Cmd is polymorphic in the type of handles h; this is important, because we should be able to execute commands both against the mock file system and against the real file system:

runMock ::       Cmd MHandle -> Mock -> (.. Mock)
runIO   :: .. -> Cmd IO.Handle -> IO ..

What should the return type of these functions be? After all, different functions return different things: Open returns a new handle, Read returns a string, the other functions return unit. To solve this problem we will simply introduce a union type2 for successful responses

data Success h = Unit () | Handle h | String String

A response is then either a succesful response or an error:

newtype Resp h = Resp (Either Err (Success h))

It is now easy to implement runMock: we just map all the constructors in Cmd to the corresponding API calls, and wrap the result in the appropriate constructor of Success:

runMock :: Cmd MHandle -> Mock -> (Resp MHandle, Mock)
runMock (MkDir d)   = first (Resp . fmap Unit)   . mMkDir d
runMock (Open  f)   = first (Resp . fmap Handle) . mOpen  f
runMock (Write h s) = first (Resp . fmap Unit)   . mWrite h s
runMock (Close h)   = first (Resp . fmap Unit)   . mClose h
runMock (Read  f)   = first (Resp . fmap String) . mRead  f

where first :: (a -> b) -> (a, x) -> (b, x) comes from Data.Bifunctor.

We can write a similar interpreter for IO; it will take a FilePath as an additional argument that it will use as a prefix for all paths; we will use this to run the IO test in some temporary directory.

runIO :: FilePath -> Cmd IO.Handle -> IO (Resp IO.Handle)

References

Our interpreter for IO takes real IO handles as argument; but will not have any real handles until we actually run the test. We need a way to generate commands that run in IO but don’t use real handles (yet). Here is where we see the first bit of infrastructure provided by quickcheck-state-machine, references:

data Reference a r = Reference (r a)

where we will instantiate that r parameter to either Symbolic or Concrete:3

data Symbolic a = Symbolic Var
data Concrete a = Concrete a

In other words, a Reference a Concrete is really just a wrapper around an a; indeed, quickcheck-state-machine provides

reference :: a -> Reference a Concrete
concrete  :: Reference a Concrete -> a

However, a Reference a Symbolic is a variable:

newtype Var = Var Int

An example of a program using symbolic references is

openThenWrite :: [Cmd (Reference IO.Handle Symbolic)]
openThenWrite = [
      Open (File (Dir []) "a")
    , Open (File (Dir []) "b")
    , Write (Reference (Symbolic (Var 0))) "Hi"
    ]

This program corresponds precisely to our example from earlier: “open this file, open that file, then write to the first file we opened”. Commands can return as many symbolic references in their result values as they want4; in our simple example, only Open creates a new reference, and so Var 0 returns to the handle returned by the first call to Open.

When we execute the test, those variables will be instantiated to their real values, turning symbolic references into concrete references. We will of course not write programs with symbolic references in them by hand; as we will see later, quickcheck-state-machine provides infrastructure for doing so.

Since we will frequently need to instantiate Cmd and Resp with references to handles, we will introduce some special syntax for this:

newtype At f r = At (f (Reference IO.Handle r))
type    f :@ r = At f r

For example, here is a wrapper around runIO that we will need that executes a command with concrete references:

semantics :: FilePath -> Cmd :@ Concrete -> IO (Resp :@ Concrete)
semantics root (At c) = (At . fmap reference) <$>
                          runIO root (concrete <$> c)

This is really just a call to runIO, with some type wrapping and unwrapping.

Relating the two implementations

When we run our tests, we will execute the same set of commands against the mock implementation and in real IO, and compare the responses we get after each command. In order to compare, say, a command “write to this MHandle” against the mock file system to a command “write to this IOHandle” in IO, we need to know the relation between the mock handles and the IO handles. As it turns out, the most convenient way to store this mapping is as a mapping from references to the IO handles (either concrete or symbolic) to the corresponding mock handles.

type HandleRefs r = [(Reference IO.Handle r, MHandle)]

(!) :: Eq k => [(k, a)] -> k -> a
env ! r = fromJust (lookup r env)

Then to compare the responses from the mock file system to the responses from IO we need to keep track of the state of the mock file system and this mapping; we will refer to this as the model for our test:

data Model r = Model Mock (HandleRefs r)

initModel :: Model r
initModel = Model emptyMock []

The model must be polymorphic in r: during test generation we will instantiate r to Symbolic, and during test execution we will instantiate r to Concrete.

Stepping the model

We want to work towards a function

transition :: Eq1 r => Model r -> Cmd :@ r -> Resp :@ r -> Model r

to step the model; we will gradually build up towards this. First, we can use the model to translate from commands or responses in terms of references to the corresponding commands or responses against the mock file system:

toMock :: (Functor f, Eq1 r) => Model r -> f :@ r -> f MHandle
toMock (Model _ hs) (At fr) = (hs !) <$> fr

Specifically, this can be instantiated to

toMock :: Eq1 r => Model r -> Cmd :@ r -> Cmd MHandle

which means that if we have a command in terms of references, we can translate that command to the corresponding command for the mock file system and execute it:

step :: Eq1 r => Model r -> Cmd :@ r -> (Resp MHandle, Mock)
step m@(Model mock _) c = runMock (toMock m c) mock

In order to construct the full new model however we also need to know how to extend the handle mapping. We can compute this by comparing the response we get from the “real” semantics (Resp :@ r) to the response we get from the mock semantics (from step), and simply zip the handles from both responses together to obtain the new mapping. We wrap all this up into an event:

data Event r = Event {
    before   :: Model  r
  , cmd      :: Cmd :@ r
  , after    :: Model  r
  , mockResp :: Resp MHandle
  }

and we construct an event from a model, the command we executed, and the response we got from the real implementation:

lockstep :: Eq1 r
         => Model   r
         -> Cmd  :@ r
         -> Resp :@ r
         -> Event   r
lockstep m@(Model _ hs) c (At resp) = Event {
      before   = m
    , cmd      = c
    , after    = Model mock' (hs <> hs')
    , mockResp = resp'
    }
  where
    (resp', mock') = step m c
    hs' = zip (toList resp) (toList resp')

The function we mentioned at the start of this section is now easily derived:

transition :: Eq1 r => Model r -> Cmd :@ r -> Resp :@ r -> Model r
transition m c = after . lockstep m c

as well as a function that compares the mock response and the response from the real file system and checks that they are the same:

postcondition :: Model   Concrete
              -> Cmd  :@ Concrete
              -> Resp :@ Concrete
              -> Logic
postcondition m c r = toMock (after e) r .== mockResp e
  where
    e = lockstep m c r

We will pass this function to quickcheck-state-machine to be run after every command it executes to make sure that the model and the real system do indeed return the same responses; it therefore does not need to be polymorphic in r. (Logic is a type introduced by quickcheck-state-machine; think of it as a boolean with some additional information, somewhat similar to QuickCheck’s Property type.)

Events will also be very useful when we label our tests; more on that later.

Constructing symbolic responses

We mentioned above that we will not write programs with symbolic references in it by hand. Instead what will happen is that we execute commands in the mock file system, and then replace any of the generated handles by new variables. Most of this happens behind the scenes by quickcheck-state-machine, but we do need to give it this function to construct symbolic responses:

symbolicResp :: Model Symbolic
             -> Cmd :@ Symbolic
             -> GenSym (Resp :@ Symbolic)
symbolicResp m c = At <$> traverse (const genSym) resp
  where
    (resp, _mock') = step m c

This function does what we just described: we use step to execute the command in the mock model, and then traverse the response, constructing a new (fresh) variable for each handle. GenSym is a monad defined in quickcheck-state-machine for the sole purpose of generating variables; we won’t use it anywhere else except in this function.

Generating commands

To generate commands, quickcheck-state-machine requires a function that produces the next command given the current model; this function will be a standard QuickCheck generator. For our running example, the generator is easy to write:

generator :: Model Symbolic -> Maybe (Gen (Cmd :@ Symbolic))
generator (Model _ hs) = Just $ QC.oneof $ concat [
      withoutHandle
    , if null hs then [] else withHandle
    ]
  where
    withoutHandle :: [Gen (Cmd :@ Symbolic)]
    withoutHandle = [
          fmap At $ MkDir <$> genDir
        , fmap At $ Open  <$> genFile
        , fmap At $ Read  <$> genFile
        ]

    withHandle :: [Gen (Cmd :@ Symbolic)]
    withHandle = [
          fmap At $ Write <$> genHandle <*> genString
        , fmap At $ Close <$> genHandle
        ]

    genDir :: Gen Dir
    genDir = do
        n <- QC.choose (0, 3)
        Dir <$> replicateM n (QC.elements ["x", "y", "z"])

    genFile :: Gen File
    genFile = File <$> genDir <*> QC.elements ["a", "b", "c"]

    genString :: Gen String
    genString = QC.sized $ \n -> replicateM n (QC.elements "ABC")

    genHandle :: Gen (Reference IO.Handle Symbolic)
    genHandle = QC.elements (map fst hs)

A few comments on the generator:

  • When we generate paths, we choose from a very small set of directory and file names. We are not really interested in testing that, for example, our implementation is Unicode-safe here; by picking from a small known set we generate tests that are more likely to contain multiple references to the same file, without having to add special provisions to do so.
  • We cannot, of course, generate handles out of nowhere; fortunately, the model tells us which handles we have available, and so genHandle just picks one at random from that. We do not limit the generator to picking open handles: it is important to also test the behaviour of the system in the error case where a program attempts to write to a closed handle.
  • The elements function from QuickCheck, which picks a random element from a list, is partial: it will throw an error if the list is empty. We must be careful not to generate any commands requiring handles when we have no handles yet.
  • The reason for the Maybe in the type signature is that in some circumstances a generator might be unable to generate any commands at all given a particular model state. We don’t need to take advantage of this feature and therefore always return Just a generator.

We can also define a shrinker for commands, but we can start with a shrinker that says that individual commands cannot be shrunk:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ _ = []

Putting it all together

In order to actually start generating and running tests, quickcheck-state-machine wants us to bundle all of this functionality up in a single record:

sm :: FilePath -> StateMachine Model (At Cmd) IO (At Resp)
sm root = StateMachine {
      initModel     = initModel
    , transition    = transition
    , precondition  = precondition
    , postcondition = postcondition
    , invariant     = Nothing
    , generator     = generator
    , distribution  = Nothing
    , shrinker      = shrinker
    , semantics     = semantics root
    , mock          = symbolicResp
    }

We have defined all of these functions above, with the exception of precondition. When quickcheck-state-machine finds a failing test, it will try to shrink it to produce a smaller test case. It does this by trying to remove commands from the program and then checking if the resulting program still “makes sense”: does the precondition of all commands still hold?

The precondition should be as liberal as possible; for instance, the precondition should not require that a file handle is open before writing to it, because we should also check that errors are handled correctly. Typically, the only thing the precondition should verify is that commands do not refer to non-existent handles (i.e., if we remove a call to open, then subsequent uses of the handle returned by that call to open simply cannot be executed anymore). Thus, we will define:

precondition :: Model Symbolic -> Cmd :@ Symbolic -> Logic
precondition (Model _ hs) (At c) =
    forall (toList c) (`elem` map fst hs)

Aside. Although we do not need it for our running example, one other thing a precondition might do is rule out test cases that we explicitly don’t want to test. For example, if our mock file system throws an error in some cases because a particular combination of arguments to a call is simply not supported, then we don’t want to flag this as a bug. A clean way to implement this is to extend the error type with a field that marks an error as an intentional limitation; then the precondition can simply rule out any commands that (in the model) would return an error with this mark. This keeps the definition of the precondition itself simple, and the logic of what is and what isn’t an intentional limitation lives in the implementation of the model itself.

Running the tests

Apart from some additional type class instances (see the code), we are now ready to define and execute the actual test:

prop_sequential :: FilePath -> QC.Property
prop_sequential tmpDir =
    forAllCommands (sm rootUnused) Nothing $ \cmds ->
      QC.monadicIO $ do
        tstTmpDir <- liftIO $ createTempDirectory tmpDir "QSM"
        let sm' = sm tstTmpDir
        (hist, _model, res) <- runCommands sm' cmds
        prettyCommands sm' hist
          $ checkCommandNames cmds
          $ res QC.=== Ok

rootUnused :: FilePath
rootUnused = error "root not used during command generation"

All functions prefixed QC are standard QuickCheck functions. The others come from quickcheck-state-machine:

  • forAllCommands uses QuickCheck’s forAllShrink and instantiates it will the command generation and shrinking infrastructure from quickcheck-state-machine
  • runCommands then executes the generated commands, validating the postcondition at every step.
  • prettyCommands renders those commands in case of a test failure.
  • checkCommandNames adds some statistics about the distribution of commands in the generated tests.

We can run the test in ghci:

> quickCheck (prop_sequential "./tmp")
+++ OK, passed 100 tests:
56% coverage

 3% [("MkDir",1),("Read",1)]
 3% []
 2% [("MkDir",1),("Open",1),("Read",1)]
 2% [("MkDir",1)]
 1% [("Close",1),("MkDir",1),("Open",5),("Read",4),("Write",2)]
...

It tells us that all tests passed, and gives us some statistics about the tests that were executed: in 3% of the cases, they contained a single MkDir and a single Read, 3% were completely empty, 2% contained one call to MkDir, one call to Open, one call to Read, and so on.

Labelling

At this point you might conclude that you’re done. We have the real implementation, we have the mock implementation, they return identical results for all tests, what else could we want?

Let’s think back to those unit tests we were on the verge of writing, but stopped just in time because we remembered that we should generate unit tests, not write them:

  • Write to two different files
  • Write to a file and then read it

How do we know that our generated tests include these two cases (and all the other unit tests that we would have written)? We get some statistics from quickcheck-state-machine, but it’s not terribly helpful. For example, the first line above tells us that 3% of our test cases contain one call to MkDir and one call to Read; but we know that that call to Read must fail, because if these are the only two commands we execute, there aren’t any files for that Read to read.

The solution is to label tests. For example, we might introduce labels, or tags, that correspond to the two unit tests above:

data Tag = OpenTwo | SuccessfulRead

We then need to write a function that looks at a particular test case and checks which labels apply. It is important to realize that this does not mean we are bringing back the same unit tests under a different guise: programs that will be labelled OpenTwo must write to two different files, but may also do a whole bunch of other things.

We can use use the foldl package to write such a labelling function in a convenient manner. The labelling function will look at the Events and produce Tags. To check if we open (at least) two different files, we keep track of all the successful calls to open; for SucessfulRead we simply remember if we have seen a call to Read with a non-error result:

tag :: [Event Symbolic] -> [Tag]
tag = Foldl.fold $ catMaybes <$> sequenceA [
      openTwo
    , successfulRead
    ]
  where
    openTwo :: Fold (Event Symbolic) (Maybe Tag)
    openTwo = Fold update Set.empty extract
      where
        update :: Set File -> Event Symbolic -> Set File
        update opened ev =
            case (cmd ev, mockResp ev) of
              (At (Open f), Resp (Right _)) ->
                Set.insert f opened
              _otherwise ->
                opened

        extract :: Set File -> Maybe Tag
        extract opened = do
            guard (Set.size opened >= 2)
            return $ OpenTwo

    successfulRead :: Fold (Event Symbolic) (Maybe Tag)
    successfulRead = Fold update False extract
      where
        update :: Bool -> Event Symbolic -> Bool
        update didRead ev = didRead ||
            case (cmd ev, mockResp ev) of
              (At (Read _), Resp (Right _)) ->
                True
              _otherwise ->
                False

        extract :: Bool -> Maybe Tag
        extract didRead = do
            guard didRead
            return SuccessfulRead

(For a read to be successful, we must have created the file first – this is a property of the semantics, we don’t need to enforce this in the labeller.)

The commands we get back from the forAllCommands function of quickcheck-state-machine are of type Commands. This is a simple wrapper around a list of Command; Command in turn bundles a command (Cmd) along with its response (and the variables in that response). We can therefore easily turn this into a list of events:

execCmd :: Model Symbolic
        -> Command (At Cmd) (At Resp)
        -> Event Symbolic
execCmd model (Command cmd resp _vars) =
    lockstep model cmd resp

execCmds :: Commands (At Cmd) (At Resp) -> [Event Symbolic]
execCmds = \(Commands cs) -> go initModel cs
  where
    go :: Model Symbolic
       -> [Command (At Cmd) (At Resp)]
       -> [Event Symbolic]
    go _ []       = []
    go m (c : cs) = e : go (after e) cs
      where
        e = execCmd m c

We can then define a variant on prop_sequential above that replaces checkCommandNames with our own labelling function (you could also use both, if you wanted to):

prop_sequential' :: FilePath -> QC.Property
prop_sequential' tmpDir =
    forAllCommands (sm rootUnused) Nothing $ \cmds ->
      QC.monadicIO $ do
        tstTmpDir <- liftIO $ createTempDirectory tmpDir "QSM"
        let sm' = sm tstTmpDir
        (hist, _model, res) <- runCommands sm' cmds
        prettyCommands sm' hist
          $ QC.tabulate "Tags" (map show $ tag (execCmds cmds))
          $ res QC.=== Ok

Here tabulate is the standard QuickCheck function for adding multiple labels to a test case. If we now re-run our tests, we get

> quickCheck (prop_sequential' "./tmp")
+++ OK, passed 100 tests.

Tags (57 in total):
63% OpenTwo
37% SuccessfulRead

The numbers here are a bit awkward to interpret: it says that across all tests a total of 57 labels were found, and of those 57 labels, 63% were OpenTwo and 37% were SuccessfulRead. In other words, 63% * 57 = 36 tags were OpenTwo (36 tests out of 100 were labelled as OpenTwo), and 37% * 57 = 21 tags were SuccessfulRead (21 tests out of 100 were labelled as SuccessfulRead). Note that it is perfectly possible for these two sets of tests to overlap (i.e., a single program can be tagged as both OpenTwo and SuccessfulRead).

Inspecting the labelling function

So we have a generator, and we have labels for the unit tests that we didn’t write; have we covered our bases now? Well, how do we know that our labelling function is correct? We might seem to descent into infinite regress here, testing the tests that tests the tests… but it would be good to at least have a look at some test case examples and how they are labelled. Fortunately, this is functionality that QuickCheck provides out of the box through a function called labelledExamplesWith. We can define a simple wrapper around it that gives us the possibility to specify a particular seed (which will be useful once we start working on shrinking):

showLabelledExamples :: Maybe Int -> IO ()
showLabelledExamples mReplay = do
    replaySeed <- case mReplay of
                    Nothing   -> getStdRandom $ randomR (1, 999999)
                    Just seed -> return seed

    putStrLn $ "Using replaySeed " ++ show replaySeed

    let args = QC.stdArgs {
            QC.maxSuccess = 10000
          , QC.replay     = Just (QC.mkQCGen replaySeed, 0)
          }

    QC.labelledExamplesWith args $
      forAllCommands (sm rootUnused) Nothing $ \cmds ->
        repeatedly QC.collect (tag . execCmds $ cmds) $
          QC.property True

Instead of tabulate we must use collect to label examples when constructing examples; collect takes a single label at a time though, so we use my all-time favourite combinator to repeatedly call collect for all tags:

repeatedly :: (a -> b -> b) -> ([a] -> b -> b)
repeatedly = flip . List.foldl' . flip

(I wonder if I can patent this function?)

When we run this, we see something like

> showLabelledExamples Nothing
Using replaySeed 288187
*** Found example of OpenTwo
Commands [
    .. Open (File (Dir []) "b")
  , .. Open (File (Dir []) "a")
  ]

*** Found example of SuccessfulRead
Commands [
    .. Open (File (Dir []) "b")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Read (File (Dir []) "b")
  ]

(I’ve tidied up the output a bit and removed some redundant information.)

This is looking good: both of these look like the kind of examples we would have written ourselves for these tags.

Standard shrinking

In fact, if we look at those test cases at the end of the previous section, you might be thinking that those examples look surprisingly good: not only are they indeed instances of those tags, but they are very small test cases: they are pretty much the unit tests that we would have written if we were so misguided as to think we need to write unit tests. Were we simply lucky?

No, we were not. QuickCheck’s labelledExamples not only searches for labelled test cases, it also tries to shrink them when it finds one, and continues to shrink them until it can shrink them no further without losing the label. The shrinker itself is implemented in quickcheck-state-machine; if we disable it altogether and re-run the search for labelled examples, we might find an example such as the following for SuccessfulRead, where for clarity I’ve marked all failing commands with an asterisk (*):

Commands [
    .. Open (File (Dir ["z", "x", "y"]) "c")    (*)
  , .. MkDir (Dir ["z", "x")                    (*)
  , .. Read (File (Dir ["z", "y", "y"]) "c")    (*)
  , .. Read (File (Dir []) "c")                 (*)
  , .. MkDir (Dir [ "x" , "x" ])                (*)
  , .. Read (File (Dir ["x", "z", "x"]) "b")    (*)
  , .. MkDir (Dir ["y"])                       
  , .. MkDir (Dir [])                           (*)
  , .. Open (File (Dir ["z"]) "b")              (*)
  , .. Open (File (Dir []) "a")                
  , .. MkDir (Dir [])                           (*)
  , .. Open (File (Dir ["x"]) "b")              (*)
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Open (File (Dir ["x", "y"]) "b")         (*)
  , .. Open (File (Dir []) "b")                
  , .. MkDir (Dir ["y"])                        (*)
  , .. Read (File (Dir []) "a")                
  , .. MkDir (Dir ["z"])                       
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 1)))    
  , .. Open (File (Dir ["z" , "z" , "y"]) "a")  (*)
  , .. Open (File (Dir ["x" , "x" , "z"]) "c")  (*)
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Open (File (Dir ["y", "y"]) "b")         (*)
  , .. Read (File (Dir ["x", "y", "x"]) "a")    (*)
  ]

This is looking significantly less ideal! If there was a bug in read, then this would certainly not be a very good minimal test case, and not something you would want to debug. So how does quickcheck-state-machine shrink tests? The basic approach is quite simple: it simply removes commands from the program. If the resulting program contains commands whose precondition is not satisfied (remember, for our running example this means that those commands would use handles that are no longer created) then it discards it as a possible shrink candidate; otherwise, it reruns the test, and if it still fails, repeats the process.

The large percentage of commands that are unsuccessful can easily be removed by quickcheck-state-machine:

Commands [
    .. MkDir (Dir ["y"])                       
  , .. Open (File (Dir []) "a")                
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Open (File (Dir []) "b")                
  , .. Read (File (Dir []) "a")                
  , .. MkDir (Dir ["z"])                       
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 1)))    
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Close (Reference (Symbolic (Var 0)))    
  ]

Both calls to MkDir can easily be removed, and the resulting program would still be tagged as SuccessfulRead; and the same is true for repeated closing of the same handle:

Commands [
    .. Open (File (Dir []) "a")                
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Open (File (Dir []) "b")                
  , .. Read (File (Dir []) "a")                
  , .. Close (Reference (Symbolic (Var 1)))    
  ]

At this point the shrinker cannot remove the second call to Open because the second call to Close depends on it, but it can first remove that second call to Close and then remove that second call to Open, and we end up with the minimal test case that we saw in the previous section:

Commands [
    .. Open (File (Dir []) "a")                
  , .. Close (Reference (Symbolic (Var 0)))    
  , .. Read (File (Dir []) "a")                
  ]

Perfect.

Improving shrinking

Unfortunately, this does not mean that we can depend on quickcheck-state-machine to solve all our shrinking problems. Consider this run of showLabelledExamples:

> showLabelledExamples (Just 166205)
Using replaySeed 166205
*** Found example of OpenTwo
Commands [
    .. MkDir (Dir ["x"])
  , .. Open (File (Dir [])    "c")
  , .. Open (File (Dir ["x"]) "a")
  ]

This is indeed an example of a program in which we open at least two files; however, why is that call to MkDir still there? After all, if there was a bug in opening more than one file, and the “minimal” test case would include a call to MkDir, that would be a red herring which might send the person debugging the problem down the wrong path.

The reason that quickcheck-state-machine did not remove the call to MkDir, of course, is that without it the second call to Open would fail: it tries to create a file in directory x, and if that directory does not exist, it would fail. To fix this, we need to tell quickcheck-state-machine how to shrink individual commands; so far we have been using

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ _ = []

which says that individual commands cannot be shrunk at all. So how might we shrink an Open call? One idea might be to shrink the directory, replacing for instance /x/y/a by /x/a or indeed just /a. We can implement this using

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open (File (Dir d) f) ->
        [At $ Open (File (Dir d') f) | d' <- QC.shrink d]
      _otherwise ->
        []

If we use this shrinker and run showLabelledExamples a number of times, we will find that all the examples of OpenTwo are now indeed minimal… until it finds an example that isn’t:

> showLabelledExamples (Just 980402)
Using replaySeed 980402
*** Found example of OpenTwo
Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir [])    "a")
  , .. Open (File (Dir ["x"]) "a")
  ]

In this example we cannot shrink the directory to [] because the resulting program would try to open the same file twice, which is not allowed (“resource busy”). We need a better way to shrink this program.

What we want to implement is “try to replace the file path by some file in the (local) root. It is important to realize however that a shrinker such as

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open _ -> -- BAD EXAMPLE
        [At $ Open (File (Dir []) f') | f' <- ["t1", "t2"]]
      _otherwise ->
        []

is a bad idea. This shrinker tries to replace any file path with either /t1 or /t2. However, this means that shrinking now never stops:

Open (File (Dir [] "t1"))

can be shrunk to

Open (File (Dir [] "t2"))

which can then be shrunk back to

Open (File (Dir [] "t1"))

and QuickCheck will loop when trying to shrink the test case. It is important that there is a clear direction to shrinking.

An approach that does work is the following: any file path that doesn’t start with a t can be replaced by path /t100; moreover, any /tN (for some number N) can be replaced by /tN' for some number N’ < N:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open (File (Dir []) ('t' : n)) ->
        [openTemp n' | n' <- QC.shrink (read n)]
      Open _ ->
        [openTemp 100]
      _otherwise ->
        []
  where
    openTemp :: Int -> Cmd :@ Symbolic
    openTemp n = At $ Open (File (Dir []) ('t' : show n))

Now

Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir [])    "a")
  , .. Open (File (Dir ["x"]) "a")
  ]

can shrink to

Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir []) "a")
  , .. Open (File (Dir []) "t100")
  ]

at which point the call to MkDir can be removed; this will eventually shrink down to

Commands [
    .. Open (File (Dir []) "t0")
  , .. Open (File (Dir []) "t1")
  ]

Dependencies between commands

It’s been a long road, but we are almost there. The last thing we need to discuss is how to shrink programs with dependencies. The “open at least two” example above was relatively easy to shrink because we could shrink one command at the time. Sometimes however there are dependencies between commands. For example, consider this “minimal” example of “successful read”:

> showLabelledExamples (Just 617213)
Using replaySeed 617213
*** Found example of SuccessfulRead
Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Read (File (Dir ["z"]) "b")
  ]

As before, we have a call to MkDir which should ideally be removed. However, if we tried to change the path in the call to Open, the call to Read would fail: both of these commands should refer to the same path. But shrinking can only change one command at a time, and this is important to keep the computational complexity (runtime cost) of shrinking down. What to do?

The problem is that we want that call to Read to use the same path as the call to Open, but we have no way to express this. The solution is to make this expressible. After all, we can already express “the handle returned by that call to Open”; all we need to do is introduce a second kind of reference: a reference to a path.

The language of commands changes to

data Cmd fp h = Read (Expr fp) | -- rest as before

Cmd gets an additional type argument fp to record the types of paths, and instead of a File, Read now takes an Expr as argument:

data Expr fp = Val File | Var fp

We can of course still use a concrete path, as before, but we can also use a variable: “use the same path as used in that call to open”. This means that Open must return that reference, so Success and Resp get adjusted accordingly:

data Success fp h = Handle fp h | -- rest as before
newtype Resp fp h = Resp (Either Err (Success fp h))

Just like was the case for handles, when we actually run code all variables have been resolved, so the interpreter isn’t any more difficult:

runMock :: Cmd File MHandle -> Mock -> (Resp File MHandle, Mock)
runMock (Open f) = first (Resp . fmap (Handle f)) . mOpen f
runMock (Read e) = first (Resp . fmap String)     . mRead (eval e)
-- rest as before

eval :: Expr File -> File
eval (Val f) = f
eval (Var f) = f

The IO interpreter is modified similarly. Most of the rest of the changes to the code are minor and mostly automatic. For example, At must now instantiate both parameters

newtype At f r = At (f (Reference File r) (Reference IO.Handle r))

the model must record the mapping from file references now too

type FileRefs r = [(Reference File r, File)]
data Model    r = Model Mock (FileRefs r) (HandleRefs r)

See Version2.hs in the example package for details.

Crucially, we can now take advantage of this in the shrinker: when we see a call to Read with a file path that we have seen before, we can shrink that to use a variable instead:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker (Model _ fs _) (At cmd) =
    case cmd of
      Read (Val f) ->
        [At $ Read (Var r) | r <- mapMaybe (matches f) fs]
      -- other cases as before
  where
    matches :: File -> (r, File) -> Maybe r
    matches f (r, f') | f == f'   = Just r
                      | otherwise = Nothing

This means that the problematic example5

> showLabelledExamples (Just 617213)
Using replaySeed 617213
*** Found example of SuccessfulRead
Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Val (File (Dir ["z"]) "b"))
  ]

can now shrink to

Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Var 0)
  ]

At this point the shrinking for Open that we defined above can kick in, replacing z/b with /t100, making the call to MkDir redundant, and the example can ultimately shrink to

Commands [
    .. Open (File (Dir []) "t0")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Var 0)
  ]

Conclusions

Writing unit tests by hand is problematic for at least two reasons. First, as the system grows in complexity the number of interactions between the various features will be impossible to capture in hand written unit tests. Second, unit tests can only exercise the system in ways that the author of the unit tests foresees; often, bugs arise when the system is exercised in ways that were not foreseen. It is therefore important not to write unit tests, but rather generate them. We have seen how a library such as quickcheck-state-machine can assist in generating and shrinking meaningful sequences of API calls.

We have also seen why it is important to label test cases, how to inspect the labelling function, and how to use labelling to improve shrinking. Without writing a shrinker for individual commands, the standard quickcheck-state-machine shrinker already does a really good job at removing redundant commands. However, if we are serious about producing minimal test cases without red herrings that might lead debugging efforts astray (and we should be serious about that), then we also need to put some thought into shrinking individual commands.

Finally, we have seen how we can improve shrinking by making dependencies between commands explicit. This also serves as an example of a language with multiple types of references; the approach we put forth in this blog post essentially scales to an arbitrary number of types of references without much difficulty.

We have not talked about running parallel tests at all in this blog post. This is an interesting topic in its own right, but affects only the very outermost level of the test: prop_sequential would get an analogue prop_parallel, and nothing else would be affected; the quickcheck-state-machine documentation shows how to do this; the original paper describing the approach (in Erlang) by John Hughes and others is also well worth a read. Finally, quickcheck-state-machine is not the only library providing this functionality in Haskell; in particular, hedgehog, an alternative to QuickCheck, does also.

The way we set up the tests in this blog post is not the only way possible, but is one that we believe leads to a clean design. The running example in this blog post is a simplified version of a mock file system that we use in the new Ouroboros consensus layer, where we use it to simulate file system errors when testing a blockchain database. The tests for that blockchain database in turn also use the design patterns laid out in this blog post, and we have used those design patterns also in a test we added to quickcheck-state-machine itself to test the shrinker. Of course, being Haskellers, we would prefer to turn design patterns into actual code; indeed, we believe this to be possible, but it may require some more advanced type system features. This is something we want to investigate further.

Footnotes


  1. Mapping from IllegalOperation to HandleClosed is bit too coarse, but suffices for the sake of this blog post.

  2. This has a somewhat unpleasant untyped feel to it. However, if the design patterns proposed in this blog post are used, this barely affects any code that we write: we never (or almost never) have to pattern match on that Success type.

  3. I’m eliding a Typeable constraint here.

  4. Technically, quickcheck-state-machine uses a function Resp :@ Symbolic -> [Var] to find all the references “bound” by a command.

  5. The Close call now refers to Var 1 instead of Var 0, because Open now creates two variables: the reference to the path and the reference to the handle.

by edsko at January 23, 2019 03:44 PM

Tweag I/O

Towards Interactive Data Science in Haskell:
Haskell in JupyterLab

Matthias Meschede, Juan Simões

Introduction

Haskell and data science - on first sight a great match: native function composition, lazy evaluation, fast execution times, and lots of code checks. These sound like ingredients for scalable, production-ready data transformation pipelines. What is missing then? Why is Haskell not widely used in data science?

One of the reasons is that Haskell lacks a standardized data analysis environment. For example, Python has a de facto standard library set with numpy, pandas and scikit-learn that form the backbone, and many other well-supported specialized libraries such as keras and tensorflow that are easily accessible. These libraries are distributed with user friendly package managers and explained in a plethora of tutorials, Stack Overflow questions and millions of Jupyter notebooks. Most problems from beginner to advanced level can be solved by adapting and combining these existing solutions.

This post presents Jupyter and JupyterLab - both important ingredients of the Python ecosystem - and shows how we can take advantage of these tools for interactive data analysis in Haskell.

Jupyter and exploratory data analysis

Project Jupyter became famous through the browser-based notebook app that allows to execute code in various compute environments and interlace it with text and media elements (example gallery).

More generally, Project Jupyter standardizes the interactions between Jupyter frontends such as the notebook and Jupyter kernels, the compute environments. Typically a kernel receives a message from a frontend, executes some code, and responds with a rich media message. The frontend can render the response message as text, images, videos or small applications. All exchanged messages are standardized by the Jupyter protocol and therefore independent of the specific frontend or kernel that is used. Various frontends and kernels for many different languages, like Python, Haskell, R, C++, Julia, etc, exist.

Quick REPL-like interaction of a user with a compute environment via a frontend is very useful for exploratory data analysis. The user interrogates and transforms the data with little code snippets and receives immediate feedback through rich media responses. Different algorithms (expressed as short code snippets) can rapidly be prototyped and visualized. Long multistep dialogues with the kernel can be assembled into a sequential notebook. Notebooks mix explanatory text, code and media elements and can therefore be used as human-readable reports (such as this blogpost). This REPL workflow has become one of the most popular ways for exploratory data analysis.

Conversations with a Jupyter kernel

IHaskell is the name of the Jupyter kernel for Haskell. It contains a little executable ihaskell that can receive and respond to messages in the Jupyter protocoll (via ZeroMQ. Here is a little dialogue that is initiated by sending the following code snippet from the notebook frontend to ihaskell:

take 10 $ (^2) <$> [1..]

And here is the rendered answer that the frontend received from the kernel

[1,4,9,16,25,36,49,64,81,100]

In Jupyter parlance the above dialogue corresponds to the following execute_request:

>> shell.execute_request (8be63d5c-1170-495d-82da-e56272052faf) <<

header: {username: "", version: "5.2",
         session: "32fe9cd0-8c37-450e-93c0-6fbd45bfdcd9",
         msg_id: "8be63d5c-1170-495d-82da-e56272052faf",
         msg_type: "execute_request"}
parent_header: Object
channel: "shell"
content: {silent: false, store_history: true, user_expressions: Object,
          allow_stdin: true, stop_on_error: true,
          code: "take 10 $ (^2) <$> [1..]"}   <<<<< LOOK HERE
metadata: Object
buffers: Array[0]

and to the following display_data message that is received as a response

<< iopub.display_data (68cce1e7-4d60-4a20-a707-4bf352c4d8d2) >>

header: {username: "", msg_type: "display_data", version: "5.0"
         msg_id: "68cce1e7-4d60-4a20-a707-4bf352c4d8d2",
         session: "32fe9cd0-8c37-450e-93c0-6fbd45bfdcd9",
         date: "2018-08-02T08:14:10.245877Z"}
msg_id: "68cce1e7-4d60-4a20-a707-4bf352c4d8d2"
msg_type: "display_data"
parent_header: {username: "", msg_type: "execute_request", version: "5.0",
                msg_id: "8be63d5c-1170-495d-82da-e56272052faf",
                session: "32fe9cd0-8c37-450e-93c0-6fbd45bfdcd9"}
metadata: Object
content: {data: {text/plain: "[1,4,9,16,25,36,49,64,81,100]"},  <<<<< LOOK HERE
          metadata: {output_type: "display_data"}}
buffers: Array[0]
channel: "iopub"

ihaskell can import Haskell libraries dynamically and has some special commands to enable language extensions, print type information or to use Hoogle.

JupyterLab

JupyterLab is the newest animal in the Jupyter frontend zoo, and it is arguably the most powerful: console, notebook, terminal, text-editor, or image viewer, JupyterLab integrates these data science building blocks into a single web based user interface. JupyterLab is setup as a modular system that can be extended. A module assembles the base elements, changes or add new features to build an IDE, a classical notebook or even a GUI where all interactions with the underlying execution kernels are hidden behind graphical elements.

How can Haskell take advantage of JupyterLab's capacities? To begin with, JupyterLab provides plenty of out-of-the-box renderers that can be used for free by Haskell. From the default renderers, the most interesting is probably Vega plotting. But also geojson, plotly or and many other formats are available from the list of extensions, which will certainly grow. Another use case might be to using the JupyterLab extension system makes it easy to build a simple UI that interact with an execution environment. Finally, Jupyter and associated workflows are known by a large community. Using Haskell through these familiar environments softens the barrier that many encounter when exploring Haskell for serious data science.

Let's get into a small example that shows how to use the JupyterLab VEGA renderer with IHaskell.

Wordclouds using Haskell, Vega and JupyterLab

We will use here the word content of all blog posts of tweag.io, which are written in markdown text files. The following little code cell that reads all .md files in the posts folder and concatenates them into a single long string from which we remove some punctuation characters. This code cell is then sent to the ihaskell kernel, which responds to the last take function with a simple text response.

:ext QuasiQuotes
import System.Directory
import Data.List

fnames <- getDirectoryContents "../../posts"
paths = ("../../posts/"++) <$> fnames
md_files = filter (isSuffixOf ".md") paths
text <- mconcat (readFile <$> md_files)
cleanedText = filter (not . (`elem` "\n,.?!-:;\"\'")) text
take 50 cleanedText
"title Were hiring<br>(Software engineer / devops)p"

The VEGA visualization specification for a wordcloud can be defined as a JSON string that is filled with our text data. A convenient way to write longer multiline strings in Haskell are QuasiQuotes. We use fString QuasiQuotes from the PyF package. Note that {} fills in template data and {{ corresponds to an escaped {.

import Formatting
import PyF
import Data.String.QQ

let vegaString = [fString|{{
  "$schema": "https://vega.github.io/schema/vega/v4.json",
  "width": 800,
  "height": 400,
  "padding": 0,

  "data": [
    {{
      "name": "table",
      "values": [
         "{take 20000 cleanedText}"
      ],
      "transform": [
        {{
          "type": "countpattern",
          "field": "data",
          "case": "upper",
          "pattern": "[\\\\w']{{3,}}",
          "stopwords": "(\\\\d+|youll|looking|like|youre|etc|yet|need|cant|ALSO|STILL|ISNT|Want|Lots|HTTP|HTTPS|i|me|my|myself|we|us|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|whose|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|will|would|should|can|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|upon|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|say|says|said|shall)"
        }},
        {{
          "type": "formula", "as": "angle",
          "expr": "[0, 90][~~(random() * 3)]"
        }},
        {{
          "type": "formula", "as": "weight",
          "expr": "if(datum.text=='VEGA', 600, 300)"
        }}
      ]
    }}
  ],

  "scales": [
    {{
      "name": "color",
      "type": "ordinal",
      "range": ["#3e4593", "#bc3761", "#39163d", "#2a1337"]
    }}
  ],

  "marks": [
    {{
      "type": "text",
      "from": {{"data": "table"}},
      "encode": {{
        "enter": {{
          "text": {{"field": "text"}},
          "align": {{"value": "center"}},
          "baseline": {{"value": "alphabetic"}},
          "fill": {{"scale": "color", "field": "text"}}
        }},
        "update": {{
          "fillOpacity": {{"value": 1}}
        }},
        "hover": {{
          "fillOpacity": {{"value": 0.5}}
        }}
      }},
      "transform": [
        {{
          "type": "wordcloud",
          "size": [800, 400],
          "text": {{"field": "text"}},
          "rotate": {{"field": "datum.angle"}},
          "font": "Helvetica Neue, Arial",
          "fontSize": {{"field": "datum.count"}},
          "fontWeight": {{"field": "datum.weight"}},
          "fontSizeRange": [12, 56],
          "padding": 2
        }}
      ]
    }}
  ]
}}|]

We display this JSON string with the native Jupyterlab JSON renderer here for convenience. The Display.json function from IHaskell annotates the content of the display message as application/json and ihaskell sends the annotated display message to Jupyterlab. In consequence, Jupyterlab knows that it should use its internal application/json renderer to display the message in the frontend.

import qualified IHaskell.Display as D
D.Display [D.json vegaString]

Vega Wordcloud

In a similar way, we can annotate the display message with the application/vnd.vegalite.v2+json MIME renderer type. The VEGA string that we have generated earlier is then rendered with the internal JupyterLab javascript VEGA code:

D.Display [D.vegalite vegaString]

Vega Wordcloud

Conclusion

JupyterLab provides a REPL-like workflow that is convenient for quick exploratory data analysis and reporting. Haskell can benefit in particular from JupyterLab's rendering capacities, its pleasant user interface and its familiarity in the data science community. If you want to try it yourself, this blog post has been written directly as a notebook that is stored in this folder (with a dataset from Wikipedia). The IHaskell setup with this example notebook is available as a docker image that you can run with:

docker run -p 8888:8888 tweag/jupyterwith:latest

Alternatively, if you are a Nix user, you can try our declarative JupyterLab-on-Nix setup that is the subject of our next post.

January 23, 2019 12:00 AM

January 22, 2019

Neil Mitchell

Release delays with Stackage

Summary: There are two steps that delay new versions of packages in Stackage.

I aim to get the latest version of my software out to as many people as quickly as possible. Older versions have bugs, new versions have new features - that's why I release new versions. Unfortunately there are two steps in Stackage that slow down this process.

Taking an example, HLint depends on haskell-src-exts, and is tightly coupled, so (to a first approximation) every 0.1 bump to haskell-src-exts requires changing HLint. There are also lots of other packages that depend on haskell-src-exts. This situation leads to two delays in getting HLint to Stackage users, both of which are on display in bug 4214:

Issue 1: Reluctance to remove packages

Stackage has a policy that if a new package (e.g. haskell-src-exts) is released which breaks your package (e.g. haskell-src-meta) you have an unspecified amount of time to release an update. My experience is either packages are updated quickly (all upgrades on that ticket happened within 12 days) or the package maintainers never reply (46 days later no other maintainer has even left a comment).

It used to be the case that there were hard time limits (maximum one month), but my experience was those were never enforced. Unfortunately this lag can cause a significant delay until Stackage Nightly picks up an upgrade. It seems like a more mechanical rule (e.g. after 2 weeks with no update, or 6 weeks total) might keep the process ticking faster. I appreciate it's hard to break people's work, which is why making it come down to human judgement seems to lengthen the process significantly.

Delay imposed: up to 2 months, and sometimes requires chasing.

Issue 2: Existence of Stackage LTS

While the first issue is very much a trade off, the second one is (in my view) just a bad design of Stackage, as I've said before. There is Stackage Nightly which has the latest code. There is Stackage LTS which has older and therefore buggier code, up to 2-3 months older. Having two options is fine, but the stack tool and documentation direct people towards LTS as a preference. LTS is useful if you view the act of upgrading between 0.0.1 versions as low risk (which it isn't) or you find it easier to fix multiple distinct breaking changes when they are overlapped (which it isn't). Unfortunately Stackage LTS users won't get a new version of HLint until a new Stackage LTS version is created, even after it gets merged. On the plus side, this process happens automatically without any intervention by package authors.

Delay imposed: 2-3 months.

PS. While I criticise Stackage, that's because I want to make it better, since it is a very useful distribution channel for many people, and I'm grateful for the work the Stackage maintainers do to keep the process ticking along.

by Neil Mitchell (noreply@blogger.com) at January 22, 2019 09:27 PM

January 21, 2019

Monday Morning Haskell

Why Haskell III: Parametric Types

templating.jpg

Welcome back to our series on the simplicity of Haskell's data declarations. Last week, we looked at how to express sum types in different languages. We saw that they fit very well within Haskell's data declaration system. For Java and Python, we ended up using inheritance, which presents some interesting benefits and drawbacks. We'll explore those more next week. But first, we should wrap our heads around one more concept: parametric types.

We'll see how each of these languages allows for the concept of parametric types. In my view, Haskell does have the cleanest syntax. But other compiled languages do pretty well to incorporate the concept. Dynamica languages though, provide insufficient guarantees for my liking.

This all might seem a little wild if you haven't done any Haskell at all yet! Read our Liftoff Series to get started!

Haskell Parametric Types

Let's remember how easy it is to do parametric types in Haskell. When we want to parameterize a type, we'll add a type variable after its name in the definition. Then we can use this variable as we would any other type. Remember our Person type from the first week? Here's what it looks like if we parameterize the occupation field.

data Person o = Person
  { personFirstName :: String
  , personLastName :: String
  , personEmail :: String
  , personAge :: Int
  , personOccupation :: o
  }

We add the o at the start, and then we can use o in place of our former String type. Now whenever we use the Person type, we have to specify a type parameter to complete the definition.

data Occupation = Lawyer | Doctor | Engineer

person1 :: Person String
person1 = Person "Michael" "Smith" "msmith@gmail.com" 27 "Lawyer"

person2 :: Person Occupation
person2 = Person "Katie" "Johnson" "kjohnson@gmail.com" 26 Doctor

When we define functions, we can use a specific version of our parameterized type if we want to constrain it. We can also use a generic type if it doesn't matter.

salesMessage :: Person Occupation -> String
salesMessage p = case personOccupation p of
  Lawyer -> "We'll get you the settlement you deserve"
  Doctor -> "We'll get you the care you need"
  Engineer -> "We'll build that app for you"

fullName :: Person o -> String
fullName p = personFirstName p ++ " " ++ personLastName p

Last of all, we can use a typeclass constraint on the parametric type if we only need certain behaviors:

sortOnOcc :: (Ord o) => [Person o] -> [Person o]
sortOnOcc = sortBy (\p1 p2 -> compare (personOccupation p1) (personOccupation p2)

Java Generic Types

Java has a comparable concept called generics. The syntax for defining generic types is pretty clean. We define a type variable in brackets. Then we can use that variable as a type freely throughout the class definition.

public class Person<T> {
    private String firstName;
    private String lastName;
    private String email;
    private int age;
    private T occupation;

    public Person(String fn, String ln, String em, int age, T occ) {
        this.firstName = fn;
        this.lastName = ln;
        this.email = em;
        this.age = age;
        this.occupation = occ;
    }

    public T getOccupation() { return this.occupation; }
    public void setOccupation(T occ) { this.occupation = occ; }
    ...
}

There's a bit of a wart in how we pass constraints. This comes from the Java distinction of interfaces from classes. Normally, when you define a class and state the subclass, you would use the extends keyword. But when your class uses an interface, you use the implements keyword.

But with generic type constraints, you only use extends. You can chain constraints together with &. But if one of the constraints is a subclass, it must come first.

public class Person<T extends Number & Comparable & Serializable> {

In this example, our template type T must be a subclass of Number. It must then implement the Comparable and Serializable interfaces. If we mix the order up and put an interface before the parent class, it will not compile:

public class Person<T extends Comparable & Number & Serializable> {

C++ Templates

For the first time in this series, we'll reference a little bit of C++ code. C++ has the idea of "template types" which are very much like Java's generics. Here's how we can create our user type as a template:

template <class T>
class Person {
public:
  string firstName;
  string lastName;
  string email;
  int age;
  T occupation;

  bool compareOccupation(const T& other);
};

There's a bit more overhead with C++ though. C++ function implementations are typically defined outside the class definition. Because of this, you need an extra leading line for each of these stating that T is a template. This can get a bit tedious.

template <class T>
bool Person::compareOccupation(const T& other) {
  ...
}

One more thing I'll note from my experience with C++ templates. The error messages from template types can be verbose and difficult to parse. For example, you could forget the template line above. This alone could cause a very confusing message. So there's definitely a learning curve. I've always found Haskell's error messages easier to deal with.

Python - The Wild West!

Since Python isn't compiled, there aren't type constraints when you construct an object. Thus, there is no need for type parameters. You can pass whatever object you want to a constructor. Take this example with our user and occupation:

class Person(object):

  # This definition hasn't changed!
  def __init__(self, fn, ln, em, age, occ):
    self.firstName = fn
    self.lastName = ln
    self.email = em
    self.age = age
    self.occupation = occ

stringOcc = "Lawyer"
person1 = Person(
    "Michael",
    "Smith",
    "msmith@gmail.com",
    27,
    stringOcc)

class Occupation(object):
  …

classOcc = Occupation()

# Still works!
person2 = Person(
  "Katie",
  "Johnson",
  "kjohnson@gmail.com",
  26,
  classOcc)

Of course, with this flexibility comes great danger. If you expect there are different types you might pass for the occupation, your code must handle them all! Without compilation, it can be tricky to know you can do this. So while you can do polymorphic code in Python, you're more limited. You shouldn't get too carried away, because it is more likely to blow up in your face.

Conclusion

Now that we know about parametric types, we have more intuition for the idea of filling in type holes. This will come in handy next week as we look at Haskell's typeclass system for sharing behaviors. We'll compare the object oriented notion of inheritance and Haskell's typeclasses. This distinction gets to the core of why I've come to prefer Haskell as a language. You won't want to miss it!

If these comparisons have intrigued you, you should give Haskell a try! Download our Beginners Checklist to get started!

by James Bowen at January 21, 2019 03:30 PM

January 20, 2019

Magnus Therning

Tagless final and Scotty

For a little while I’ve been playing around with event sourcing in Haskell using Conduit and Scotty. I’ve come far enough that the basic functionality I’m after is there together with all those little bits that make it a piece of software that’s fit for deployment in production (configuration, logging, etc.). There’s just one thing that’s been nagging me, testability.

The app is built of two main parts, a web server (Scotty) and a pipeline of stream processing components (Conduit). The part using Scotty is utilising a simple monad stack, ReaderT Config IO, and the Conduit part is using Conduit In Out IO. This means that in both parts the outer edge, the part dealing with the outside world, is running in IO directly. Something that isn’t really aiding in testing.

I started out thinking that I’d rewrite what I have using a free monad with a bunch of interpreters. Then I remembered that I have “check out tagless final”. This post is a record of the small experiments I did to see how to use it with Scotty to achieve (and actually improve) on the code I have in my production-ready code.

1 - Use tagless final with Scotty

As a first simple little experiment I wrote a tiny little web server that would print a string to stdout when receiving the request to GET /route0.

The printing to stdout is the operation I want to make abstract.

I then created an application type that is an instance of that class.

Then I added a bit of Scotty boilerplate. It’s not strictly necessary, but does make the code a bit nicer to read.

With that in place the web server itself is just a matter of tying it all together.

That was simple enough.

2 - Add configuration

In order to try out how to deal with configuration I added a class for doing some simple logging

The straight forward way to deal with configuration is to create a monad stack with ReaderT and since it’s logging I want to do the configuration consists of a single LoggerSet (from fast-logger).

That means the class instance can be implemented like this

Of course foo has to be changed too, and it becomes a little easier with a wrapper for runReaderT and unAppM.

With that in place the printing to stdout can be replaced by a writing to the log.

Not really a big change, I’d say. Extending the configuration is clearly straight forward too.

3 - Per-request configuration

At work we use correlation IDs1 and I think that the most convenient way to deal with it is to put the correlation ID into the configuration after extracting it. That is, I want to modify the configuration on each request. Luckily it turns out to be possible to do that, despite using ReaderT for holding the configuration.

I can’t be bothered with a full implementation of correlation ID for this little experiment, but as long as I can get a new AppM by running a function on the configuration it’s just a matter of extracting the correct header from the request. For this experiment it’ll do to just modify an integer in the configuration.

I start with defining a type for the configuration and changing AppM.

The logger instance has to be changed accordingly of course.

The get function that comes with scotty isn’t going to cut it, since it has no way of modifying the configuration, so I’ll need a new one.

The tricky bit is in the withCfg function. It’s indeed not very easy to read, I think

Basically it reaches into the guts of scotty’s ActionT type (the details are exposed in Web.Scotty.Internal.Types, thanks for not hiding it completely), and modifies the ReaderT Config I’ve supplied.

The new server has two routes, the original one and a new one at GET /route1.

It’s now easy to verify that the original route, GET /route0, logs a string containing the integer ‘0’, while the new route, GET /route1, logs a string containing the integer ‘1’.


  1. If you don’t know what it is you’ll find multiple sources by searching for “http correlation-id”. A consistent approach to track correlation IDs through microservices is as good a place to start as any.

January 20, 2019 12:00 AM

January 18, 2019

Michael Snoyman

Kids Coding, Part 4

Previous lessons have all been focused on teaching our ten and eight year olds some coding, since our six year old (Yakov) is still working on reading and writing in English. However, Yakov’s been home sick all week, and he asked me today to teach him some programming. So I came up with a simplified version focused solely on the GHCi prompt.

I started off with:

> x = 32
> x + x

And then I asked him what he thought the answer would be. He quickly came back with 64. Then I typed in:

> y = 5
> x + y

And he got 37 pretty quickly. We played around with a few more bits of simple arithmetic (he hasn’t really mastered multiplication yet). I introduced defining variables in terms of other variables:

> x = 2 * 3
> y = x * 2
> y + 3

This took him a bit longer, but entirely due to the multiplication! This showed me a few things:

  1. Reassigning variables was not confusing for him in the least
  2. Variable replacement was similarly trivial

Finally, I decided to push things just a bit further and introduce functions:

> f x = x + 3
> f 7

This confused him at first, but once I explained that this was applying the function f to the number 7, he got it, and said “oh, it’s the +3 function.” (Remember from last time that he’s been playing the function game for a while.) Next I hit:

> x = f 0
> f x

This was easy: it’s 6! Finally I gave him two more challenging ones:

> f (f 0)
> f (f (f 10))

I fully expected confusion about parentheses. I was shocked: he wasn’t bothered by them at all. He immediately got both answers, and was very proud of himself.

Total time: less than 10 minutes, probably closer to 5. Which is good, because he’s got a short attention span and wanted to play some Nintendo with me too. Overall, I was very happy with how many concepts he was able to absorb.

(Thanks to my ~/.ghc/ghci_history file for reminding me what we covered today.)

January 18, 2019 11:31 AM

January 17, 2019

Neil Mitchell

Ignoring HLint

Summary: HLint now has more ways to ignore hints you don't like.

HLint makes suggestions about how to improve your Haskell code. But not everyone likes all the suggestions all the time, so HLint comes with ways of ignoring those annoying hints, and HLint 2.1.11 provides even more mechanisms. Without further ado, let's take a quick tour - full details are in the HLint README.

Method 1: the --default flag

To ignore all hints your code currently generates run hlint as normal, but passing the --default flag, which will generate a config file with all hints that fire set to ignore. Typically, when approaching a new code base to run HLint on, I start by doing:

hlint . --default > .hlint.yaml

After that, it's easy to remove ignored hints from .hlint.yaml one by one and fix the code.

Method 2: Add -ignore directives

In the .hlint.yaml file you can write:

- ignore: {name: Eta reduce}

This directive ignores the named hint, and is what --default generates. There are also more refined ways of ignoring a hint in certain modules, or ignoring all hints in certain modules (see the README).

Method 3: Add a {- comment -}

Method 3 actually has 3 sub-methods, you can write any of:

  • {-# ANN module "HLint: ignore Eta reduce" #-}
  • {-# HLINT ignore "Eta reduce" #-}
  • {- HLINT ignore "Eta reduce" -}

For ANN pragmas it is important to put them after any import statements. If you have the OverloadedStrings extension enabled you will need to give an explicit type to the annotation, e.g. {-# ANN module ("HLint: ignore Eta reduce" :: String) #-}. The ANN pragmas can also increase compile times or cause more recompilation than otherwise required, since they are evaluated by TemplateHaskell.

For {-# HLINT #-} pragmas GHC may give a warning about an unrecognised pragma, which can be supressed with -Wno-unrecognised-pragmas.

For {- HLINT -} comments they are likely to be treated as comments in syntax highlighting, which can lead to them being overlooked.

My current preference is {- HLINT -}, but I think GHC should just special case {-# HLINT #-} and then in a few GHC releases we could use that. Unfortunately, other people disagree with me, so {- HLINT -} is the best we have.

Method 4: Using the C Pre Processor

hlint defines the __HLINT__ preprocessor definition (with value 1), so problematic definitions (including those that don't parse) can be hidden with:

#ifndef __HLINT__
foo = ( -- HLint would fail to parse this
#endif

by Neil Mitchell (noreply@blogger.com) at January 17, 2019 10:27 PM

January 16, 2019

Gabriel Gonzalez

Dhall - Year in review (2018-2019)

dhall-2018

The Dhall configuration language is now two years old and this post will review progress in 2018 and the future direction of the language in 2019.

If you’re not familiar with Dhall, you might want to visit the official website for the language, which is the recommended starting point. This post assumes familiarity with the language.

Also, I want to use this post to advertise a short survey that you can take if you are interested in the language and would like to provide feedback:

Progress in 2018

This section will review the highlights of what we accomplished over the last year. These highlights are not exhaustive and I focus on improvements that might encourage people to revisit the language if they were on the fence a year ago.

If you’re already familiar with recent progress in the language and you are more interested in where the language is going then you can jump to the Future direction section.

New language bindings

Several contributors stepped up to the plate to begin three new actively-maintained language bindings to Dhall.

Of these three the Clojure bindings are the ones closest to completion:

The Clojure bindings are sufficiently close to completion that they currently get an official vote on proposed changes to the language standard, giving them an equal voice in the language evolution process.

This is a complete reimplementation of the language entirely in Clojure that allows you to marshal Dhall expressions, including Dhall functions, directly into Clojure:

The Clojure bindings pave the way for making the Dhall configuration language a first class citizen on the JVM.

Additionally, two other language bindings have gotten pretty far along:

These latter two language bindings haven’t announced yet as they are works in progress, but I still wanted to recognize their work so far.

Also, I want to mention that adding a conformance test suite (thanks to Fabrizio Ferrai) helped drive parallel implementations by providing implementors with a tangible measure of progress towards the goal of 100% standard coverage.

Haskell - Cabal

Thanks to the work of Oliver Charles you can generate .cabal files with Dhall by using dhall-to-cabal.

This part of the project’s README sold me on the motivation for doing so:

We can go beyond Cabal files. If Cabal is a domain specific language for building Haskell projects, what does a domain specific language for building Haskell web applications look like? Does the separate of library, executable, and test-suite make sense here? Maybe we’d rather:

… and have this take care of some other details.

When you think about this it makes perfect sense: Haskell programmers use Cabal/Hackage to package and distribute Haskell code, but then what do they use to package and distribute “Cabal code”? The answer is a language like Dhall that builds in its own code distribution mechanism instead of relying on a separate build tool. This closes the loop so that you don’t need to maintain a growing tower of build tools as your project expands.

I’m pretty sure Cabal was the first “heavy duty” configuration format tested with Dhall because this project prompted the first swell of feature requests related to interpreter performance and usability improvements for working with giant schemas.

Also, the dhall-to-cabal project includes the entire Cabal schema encoded as a Dhall type, which you can find here:

This comes in handy when you want a systematic listing of all Cabal configuration features. If you have the dhall interpreter installed you can also view the normal form of the schema in all its glory by running:

You can also migrate an existing project using cabal-to-dhall, a tool which converts a .cabal file to the equivalent .dhall file.

Eta

Javier Neira with the support of TypeLead added Dhall as a supported file format for configuring Eta packages by building on top of the dhall-to-cabal project.

That project has also produced work-in-progress Eta and Java bindings to Dhall bindings along the way by using Eta to compile the Haskell implementation of Dhall to the JVM. When those are complete you will have yet another option for using the Dhall configuration language on the JVM

If you are interested, you can follow the progress on those bindings via this GitHub issue:

Kubernetes

Dhall is commonly used for ops, and the first project to systematically integrate Dhall into a widely used ops tool is the dhall-kubernetes project, thanks to the work of Arian van Putten, Fabrizio Ferrai, and Thomas Scholtes.

I’ve never used Kubernetes, but everybody tells me that Kubernetes configurations are large, repetitive, and error-prone YAML files, which are the perfect use case for Dhall.

PureScript - Spago

The Spago project builds on top of psc-packages to assemble a PureScript package set to build using Dhall as the configuration format.

This tool takes advantage of Dhall’s import system so that the package set can be split over multiple files, which you can see in the top-level package set here:

… and users can easily import that and easily override packages locally without dealing with the headache of rebasing their local changes whenever the upstream package set changes.

Complete language standard

Last year I promised to upstream all features from the Haskell implementation into the language standard, since at the time a few import-related features were implementation-defined. This was a top priority because I didn’t want to treat other language bindings as second-class citizens.

This year we successfully standardized everything, meaning that there should no longer be any implementation-defined features. Additionally, all new functionality now begins with a change to the standard followed by a change to each implementation of the language, meaning that the Haskell implementation is no longer treated as a distinguished implementation.

Unsigned Natural literals

Earlier this year Greg Pfeil proposed to fix the obligatory + sign preceding Natural number literals, which bothered a lot of newcomers to the language. He proposed require a leading + for non-negative Integers instead of Natural numbers.

We knew this would be a highly breaking change, but we were all tired of the + signs which littered our code. The Natural type is much more natural to use (pun intended) than the Integer type, so why not optimize the syntax for Natural numbers?

So we made the change and now instead of writing an expression like:

… you instead write:

This change also improved code comprehension, because before this change an expression like this:

… could be misconstrued as adding f to various numbers, but after this change:

… the reader can more easily discern that f is being applied as a function to numeric arguments.

Type synonyms

Previously, users couldn’t create new types using let expressions and had to work around this limitation by using the import system to reuse types, like this:

Now users can define new types inline within the same file using an ordinary let expression:

Simpler Optional literals

Optional literals used to resemble lists, like this:

Now you can use Some and None instead, like this:

In particular, a present Optional literal no longer requires a type since Some can infer the type from the provided argument. This simplifies the common idiom of overriding an absent Optional value within a record of defaults:

The old list-like syntax is still supported but is deprecated and will be dropped. dhall lint will also automatically replace the old list-like Optional literals with their new Some/None equivalents.

Union constructors

Unions used to be one of the major pain points when using the language, due to having to specify all alternatives for a union literal, like this:

My first attempt to improve this introduced the constructors keyword which took a union type as an argument and returned a record of functions to create each constructor. This changed the above code example to:

However, this initial solution introduced two new problem:

  • A lot of constructors-related boilerplate at the beginning of Dhall files
  • Performance issues due to these large intermediate records of constructors

A follow-up change resolved both issues by overloading the . operator to also access constructor functions directly from a union type (as if it were already a record), like this:

This solved both the performance issue (by eliminating the need for an intermediate record of constructors) and eliminated the constructors keyword boilerplate. Also, this simplification plays nicely with the auto-complete support provided by the dhall repl since you can now auto-complete constructors using the . operator.

Import caching

In 2017 Dhall added support for semantic integrity checks, where you tag an import with a SHA256 hash of a standard binary encoding of an expression’s normal form. This integrity check protects against tampering by rejecting any expression with a different normal form, guaranteeing that the import would never change.

Several astute users pointed out that you could locally cache any import protected by such a check indefinitely. Even better, the SHA256 hash makes for a natural lookup key within that cache.

We standardized and implemented exactly that idea and now any import protected by an integrity check is permanently cached locally using the standard directory prescribed by the XDG Base Directory Specification.

For example, you can now import the entire Prelude as a package protected by an integrity check:

The first time you resolve the Prelude the import may take a bit (~7 seconds on my machine) to fetch the entire package, but the normal form is then stored locally in a 5 KB file:

… and then subsequent attempts to import the same Prelude resolve much more quickly (~80 ms on my machine).

This means that you can now cheaply import the entire Prelude in every file instead of separately importing each function that you use.

Alternative imports

The language now provides support for fallback imports using the ? operator if import resolution fails.

For example, you can use this feature to import an environment variable if present but gracefully fallback to another value if absent:

Or you can use the ? operator to provide alternative locations for obtaining an imported expression:

Multi-let expressions

You can now define multiple values within a single let expression instead of nesting let expressions. In other words, instead of this:

… you can now write this:

dhall lint will also automatically simplify any code using the old nested let style to use the new “multi-let” style.

Statically linked executables

The Haskell implementation of Dhall strives to be like the “Bash of typed functional programming”, but in order to do so the implementation needs to small, statically linked, and portable so that sysadmins don’t object to widely installing Dhall. In fact, if the executable satisfies those criteria then you don’t even need your sysadmin’s permission to try Dhall out within your own workspace.

Niklas Hambüchen made this possible through this through his general-purpose work on fully static Haskell executables built using Nix. Now Dhall’s continuous integration system produces small (< 3 MB) Linux executables that have no dependency footprint whatsoever.

Major performance improvements

The Haskell implementation of Dhall has made dramatic strides in performance improvements over the last year, motivated by projects with very large schemas, such as:

… as well as Formation’s internal use of Dhall which has led to them upstreaming many performance improvements to handle large Dhall programs.

Thanks to the work of Fintan Halpenny, Greg Pfeil, @quasicomputational, and others the Haskell implementation is between 1 to 3 orders of magnitude faster than it was a year ago, depending on the configuration file that you benchmark.

We’re also not done improving performance! We continue to improve as new projects continue to stretch the boundaries of what the language can do.

Type diffs

Large projects like these also led to usability improvements when working with gigantic types. The Haskell implementation now displays concise “type diffs” whenever you get a type mismatch so that you can quickly narrow down the problem no matter how much your configuration schema grows. This works no matter how deeply nested the error is.

For example, the following contrived example introduces four deeply nested errors in a gigantic schema (where the type is over 6000 lines long) and the error message still zeroes in on every error:

dhall repl

The Haskell implementation also added a REPL contributed by Oliver Charles that you can use to interactively interpret Dhall code, including sophisticated auto-completion support contributed by Basile Henry:

The REPL comes in handy when exploring large values or types, as illustrated by the dhall-nethack tutorial which uses the REPL:

dhall lint

The Haskell implementation also provides a useful dhall lint subcommand that you can use to not only format code but to also automatically improve the code in non-controversial ways.

For example, dhall lint will automatically remove unused let bindings and will simplify nested let expressions to instead take advantage of the newest multi-let feature.

dhall resolve --dot

Basile Henry also contributed support for visualizing the dependency tree of a Dhall expression like this:

The following tweet illustrates how to use this feature along with example output:

dhall freeze

Thanks to Tobias Pflug you can also automatically take advantage of Dhall’s semantic integrity checks using the dhall freeze subcommand. This command fetches all imports within a Dhall expression and then automatically tags all of them with semantic integrity checks.

For example:

dhall-lang.org

A while ago, Neuman Vong advised me that if you want your open source project to take off, you need a logo, a website, and a live demo in the browser.

So I took that advice to heart and now Dhall has all three! You can try out the language live in your browser by visiting:

This allows people to “try before they buy” and the site links to several other useful resources, such as the …

Dhall wiki

The Dhall wiki contains several useful educational resources for learning the language. The organization of the wiki closely follows the guidelines from this handy post on writing documentation:

The main thing that is missing is to migrate the Haskell tutorial into a language-agnostic tutorial.

Twitter account

You can also now follow the official Twitter account for the language:

This account regularly posts news and tips about the language and ecosystem that you can use to stay abreast of recent progress.

Switch from IPFS to GitHub

Early on in the language history we used IPFS to distribute the Dhall Prelude, but due to reliability issues we’ve switched to using GitHub for hosting Dhall code.

There’s even a convenient link you can use to browse the Prelude:

Future direction

Phew! That was a lot to recap and I’m grateful to all the contributors who made that possible. Now we can review where the language is going.

First, I’m no longer benevolent dictator-for-life of the language. Each new reimplementation of the language gets a vote on the language standard and now that the Clojure implementation of Dhall is essentially complete they get an equal say on the evolution of the language. Similarly, once the PureScript bindings and Python bindings are close to complete they will also get a vote on the language standard, too.

However, I can still use this post to outline my opinion of where the language should go.

Crossing the Chasm

A colleague introduced me to the book Crossing the Chasm, which heavily influenced my approach to designing and marketing the language. The book was originally written for startups trying to gain mainstream adoption, but the book also strongly resonated with my experience doing open source evangelism (first for Haskell, and now Dhall).

The book explains that you need to first build a best-in-class solution for a narrowly-defined market. This in turn requires that you think carefully about what market you are trying to address and strategically allocate your limited resources to address that market.

So what “market” should Dhall try to address?

YAML

One of the clearest signals I’ve gotten from users is that Dhall is “the YAML killer”, for the following reasons:

  • Dhall solves many of the problems that pervade enterprise YAML configuration, including excessive repetition and templating errors

  • Dhall still provides many of the good parts of YAML, such as multi-line strings and comments, except with a sane standard

  • Dhall can be converted to YAML using a tiny statically linked executable, which provides a smooth migration path for “brownfield” deployments

Does that mean that Dhall is clearly the best-in-class solution for people currently using YAML?

Not quite. The key thing Dhall is missing for feature parity with YAML is a wide array of native language bindings for interpreting Dhall configuration files. Many people would prefer to use Dhall without having to invoke an external executable to convert their Dhall configuration file to YAML.

This is one of the reasons I’ve slowed down the rate of evolution of the standard so that many of the new language bindings have an opportunity to implement the full standard. Also, I expect that once more language bindings have votes on the standard evolution that will further stabilize the language since new features proposals will have a higher bar to clear.

That’s not to say that we will freeze the language, but instead we will focus on strategically spending our “complexity budget” on features that help displace YAML. If we spend our complexity budget on unrelated features then we will increase the difficulty of porting Dhall to new languages without addressing the initial use case that will help Dhall gain mainstream traction.

JSON integration

One of YAML’s features is that all JSON is also valid YAML, by definition. In fact, some people use YAML just for the fact that it supports both JSON and comments.

This suggests that Dhall, like YAML, should also natively support JSON in some way. Dhall’s issue tracker contains a few issues along these lines and the one I would most like to see completed this year is adding support for importing JSON files as Dhall expressions:

Editor support

Another thing Dhall is missing compared to YAML is widespread editor support. This is why another one of my goals for this year is to create a Dhall language server so that any editor that supports the language server protocol (basically all of them) would get Dhall support for free.

Ops

We can actually narrow down Dhall’s “market” further if we really want to be selective about what we work on. Dhall has also grown in popularity for simplifying ops-related configurations, providing several features that ops engineers care about:

  • Strong normalization

    Ops commonly suffers from the dilemma that too much repetition is error prone, but too much abstraction is also error prone if readers of the code can’t effectively audit what is going on. One of Dhall’s unique features is that all code is strongly normalizing, meaning that every expression can be reduced to an abstraction-free normal form. This is made possible by the fact that Dhall is not Turing-complete (another feature favored by Ops).

  • Absolute type safety

    Ops engineers care about reliability since they maintain the software that the rest of their company relies on and any outages can have devastating effects on both the product and the productivity of other engineers.

    This is one of the reason for the Ops flight from Turing-complete languages to inert configuration files like JSON/YAML because Turing-complete languages give you the tools to shoot yourself in the foot. However, Dhall strikes a balance between being programmable while still not being Turing-complete and having a type system with no escape hatches, so you’re incapable of shooting yourself in the foot.

  • Built-in support for importing code

    Another reason that Ops people hate programmable configuration files is that the programming language they pick typically comes with an external build tool for the language that adds one more layer to the tower of build tools that they have to maintain. Now they’ve just replaced one problem (a repetitive configuration file for their infrastructure) which a new problem (a repetitive configuration file for the build tool for the programming language they used to reduce the original repetition).

    Dhall solves this problem well by providing built-in language support for importing other code (similar to Bash and Nix, both also heavily used for Ops use cases). This means that Dhall provides a solid foundaton for their tower of automation because they don’t need to introduce another tool to support a growing Dhall codebase.

  • Dhall displaces YAML well

    YAML configuration files are incredibly common in Ops and “infrastructure as code”. Example tools that use a YAML configuration are:

    • Kubernetes
    • Docker Compose
    • Concourse
    • Ansible
    • Travis

    YAML is so common that Ops engineers sometimes half-jokingly refer to themselves as “YAML engineers”.

    As already mentioned above, Dhall provides a sane alternative to YAML.

We’ve already seen one Dhall integration for an Ops tool emerge last year with the dhall-kubernetes project and this year I hope we continue along those lines and add at least one more Ops-related integration.

I think the next promising integration is the dhall-terraform project which is still a work in progress that would benefit from contributions.

Funding

Finally, I would like to experiment with various ways to fund open source work on Dhall now that the language has a growing userbase. In particular, I’d like to fund:

  • additional language bindings
  • better editor support
  • adding CI support for statically linked Windows and OS X binaries
  • packaging Dhall for various software distributions (i.e. .rpm/.deb)

… and I’d like to provide some way to reward the work of people who contribute beyond just acknowledging their work in posts like this one.

That’s why one of the survey questions for this year asks for suggestions on what would be the most appropriate (non-proprietary) funding model for that sort of work.

Conclusion

Hopefully that gives people a sense of where I think the language is going. If you have any thoughts on the direction of the language this would be a good time to take the survey:

Like last year, I will follow-up a month from now with another post reviewing the feedback from the survey.

by Gabriel Gonzalez (noreply@blogger.com) at January 16, 2019 04:06 PM

January 15, 2019

Donnacha Oisín Kidney

A Binomial Urn

Posted on January 15, 2019
Tags: Haskell

When we started the series, we wanted to find a “better� fold: one that was more balanced than either foldl or foldr (in its placement of parentheses). Both of these are about as unbalanced as you can get:

The first better fold I found was Jon Fairbairn’s simple treeFold:

Already this function was kind of magical: if your binary operator merges two sorted lists, foldr will give you insertion sort, whereas treeFold will give you merge sort; for summing floats, treeFold has a lower error growth than sum. By dividing up the work better, we were able to improve the characteristics of many algorithms automatically. We also saw that it could easily be made parallel:

In the next post, we saw how we could make the fold incremental, by using binary number representations for data structures. This let us do 2 things: it meant the fold was structurally terminating, so it would pass the termination checker (efficiently) in languages like Agda or Idris, and it meant we could write scanl using the fold. The scanl was also efficient: you could run the fold at any point in <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics> time, and work would be shared between subsequent runs. Effectively, this let us use it to solve greedy optimization problems. We also saw how it was effectively constructing an implicit binomial priority queue under the hood, and how it exploited laziness to get sharing.

I’ve gotten huge mileage out of this fold and the general ideas about it, and today I’m going to show one more use of it. We’re going to improve some of the asymptotics of the data structure presented in Lampropoulos, Spector-Zabusky, and Foner (2017).

A Random Urn

The paper opens with the problem:

Suppose you have an urn containing two red balls, four green balls, and three blue balls. If you take three balls out of the urn, what is the probability that two of them are green?

If you were to take just one ball out of the earn, calculating the associated probabilities would be easy. Once you get to the second, though, you have to update the previous probability based on what ball was removed. In other words, we need to be able to dynamically update the distribution.

Using lists, this would obviously become an <semantics>�(n)<annotation encoding="application/x-tex">\mathcal{O}(n)</annotation></semantics> operation. In the paper, an almost-perfect binary tree is used. This turns the operation into one that’s <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics>. The rest of the operations have the following complexities:

Operation Complexity
insert <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics>
remove <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics>
fromList <semantics>�(n)<annotation encoding="application/x-tex">\mathcal{O}(n)</annotation></semantics>

As a quick spoiler, the improved version presented here has these complexities:

Operation Complexity
insert <semantics>�(1)<annotation encoding="application/x-tex">\mathcal{O}(1)</annotation></semantics>
remove <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics>
merge <semantics>�(logn)<annotation encoding="application/x-tex">\mathcal{O}(\log n)</annotation></semantics>
fromList <semantics>�(n)<annotation encoding="application/x-tex">\mathcal{O}(n)</annotation></semantics>

We add another operation (merge), which means that the new structure is viable as an instance of Alternative, Monad, and so on, making it an efficient monad for weighted backtracking search.

Priority Queues

The key thing to notice in the paper which will let us improve the structure is that what they’re designing is actually a priority queue. Well, a weird looking priority queue, but a priority queue nonetheless.

Think about it like a max-priority queue (pop returns the largest element first), with a degree of “randomization�. In other words, when you go to do a pop, all of the comparisons between the ordering keys (the weights in this case) sprinkles some randomness into the equation, meaning that instead of 1 < 2 returning True, it returns True <semantics>23<annotation encoding="application/x-tex">\frac{2}{3}</annotation></semantics> of the time, and False the other <semantics>13<annotation encoding="application/x-tex">\frac{1}{3}</annotation></semantics>.

This way of doing things means that not every priority queue is suitable: we want to run comparisons at pop time (not insert), so a binary heap (for instance) won’t do. At branches (non-leaves), the queue will only be allowed store summaries of the data, not the “max element�.

The one presented in the paper is something like a Braun priority queue: the <semantics>�(n)<annotation encoding="application/x-tex">\mathcal{O}(n)</annotation></semantics> fromList implementation is reminiscent of the one in Okasaki (1997).

So what priority queue can we choose to get us the desired efficiency? Why, a binomial one of course!

The Data Structure

The urn structure itself looks a lot like a binomial heap:

By avoiding the usual Skip constructors you often see in a binomial heap we save a huge amount of space. Instead, we store the “number of zeroes before this bit�. Another thing to point out is that only left branches in the trees store their weight: the same optimization is made in the paper.

Insertion is not much different from insertion for a usual binomial priority queue, although we don’t need to do anything to merge the trees:

We could potentially get insertion from amortized <semantics>�(1)<annotation encoding="application/x-tex">\mathcal{O}(1)</annotation></semantics> to worst-case <semantics>�(1)<annotation encoding="application/x-tex">\mathcal{O}(1)</annotation></semantics> by using skew binary instead of binary (in fact I am almost sure it’s possible), but then I think we’d lose the efficient merge. I’ll leave exploring that for another day.

To get randomness, we’ll write a very simple class that encapsulates only what we need:

You can later instantiate this to whatever random monad you end up using. (The same approach was taken in the paper, although we only require Functor here, not Monad).

Sampling (with replacement) first randomly chooses a tree from the top-level list, and then we drill down into that tree with binary search.

So we’re off to a good start, but remove is a complex operation. We take the same route taken in the paper: first, we perform an “uncons�-like operation, which pops out the last inserted element. Then, we randomly choose a point in the tree (using the same logic as in sample), and replace it with the popped element1.

Merge is the same as on binomial heaps:

Finger Trees

Again, the cleverness of all the tree folds is that they intelligently batch summarizing operations, allowing you to efficiently so prefix-scan-like operations that exploit sharing.

The bare-bones version just uses binary numbers: you can upgrade the cons operation to worst-case constant-time if you use skew binary. Are there other optimizations? Yes! What if we wanted to stick something on to the other end, for instance? What if we wanted to reverse?

If you figure out a way to do all these optimizations, and put them into one big data structure, you get the mother-of-all “batching� data structures: the finger tree. This is the basis for Haskell’s Data.Sequence, but it can also implement priority queues, urns (I’d imagine), fenwick-tree-like structures, and more.

Uses and Further Work

First and foremost, I should test the above implementations! I’m pretty confident the asymptotics are correct, but I’m certain the implementations have bugs.

The efficient merge is intriguing: it means that Urn could conceivably be Alternative, MonadPlus, etc. I have yet to see a use for that, but it’s interesting nonetheless! I’m constantly looking for a way to express something like Dijkstra’s algorithm algebraicly, using the usual Alternative combinators; I don’t know if this is related.

The other interesting point is that, for this to be an instance of Applicative, it would need some analogue for multiplication for the weights. I’m not sure what that should be.

This is inherently max-priority. It’s not obvious how to translate what we have into a min-priority queue version.

Finally, it might be worth trying out different priority queues (a pairing heap is very similar in structure to this). Also, we could rearrange the weights so that larger ones are higher in each tree: this might give a performance boost.

Lampropoulos, Leonidas, Antal Spector-Zabusky, and Kenneth Foner. 2017. “Ode on a random urn (functional pearl).� In, 26–37. ACM Press. doi:10.1145/3122955.3122959. https://www.cis.upenn.edu/~llamp/pdf/urns.pdf.

Okasaki, Chris. 1997. “Three Algorithms on Braun Trees.� Journal of Functional Programming 7 (6) (November): 661–666. doi:10.1017/S0956796897002876. https://www.eecs.northwestern.edu/~robby/courses/395-495-2013-fall/three-algorithms-on-braun-trees.pdf.


  1. There’s one extra step I haven’t mentioned: we also must allow the first element (the last inserted) to be chosen, so we run the random-number generator once to check if that’s the element we want to choose.↩

by Donnacha Oisín Kidney at January 15, 2019 12:00 AM