Planet Haskell

April 20, 2014

Roman Cheplyaka

Setting up Samsung Wireless Printer on Linux

Here’s a complete guide for setting up a wireless Samsung printer on Linux, where by “setting up” I mean making it connect to your wireless network.

It worked for me with Samsung ML-2165W on Debian GNU/Linux «jessie», but should work for other models and distributions, too.

Connecting Samsung printer to a wireless network

  1. Create a new, temporary user. We’ll use it to launch Samsung’s wireless setup utility. This is optional, but it provides an additional layer of security (who knows what those utilities from Samsung do behind the scenes) and ensures that nothing will mess with your system.

    We add the new user to the lp group, so that it can talk to the printer.

    user$ sudo useradd --create-home --shell /bin/bash --groups lp samsung
  2. Allow the new user to use our display. (Samsung’s utility is graphical.)

    user$ xhost +local:samsung
  3. Now, time to switch to our temporary user.

    user$ sudo su - samsung
  4. Download Samsung’s PSU (“Printer Settings Utility”) archive from their website. Unpack it and go to the wirelesssetup directory.

    samsung$ wget http://downloadcenter.samsung.com/content/DR/201110/20111019151150392/PSU_1.01.tar.gz
    samsung$ tar xzf PSU_1.01.tar.gz
    samsung$ cd cdroot/Linux/wirelesssetup
  5. Check if there are any missing dynamic libraries:

    samsung$ ldd bin/wirelesssetup  | grep 'not found'

    (Note: this is for a 32-bit system. On a 64-bit system, replace bin with bin64.)

    In my case, the output was

    libnetsnmp.so.10 => not found

    This particular library is included in the PSU archive, so we load it by

    samsung$ export LD_PRELOAD=$PWD/../psu/share/lib/libnetsnmp.so.10.0.2

    (Likewise, replace lib with lib64 on a 64-bit system.)

    If there are more missing libraries, first see if your distribution ships them. The major versions must match! E.g. Debian jessie ships libnetsnmp.so.30.0.2, which has the major version number 30, so that won’t do.

    If your distribution doesn’t have the right version, use a resource like http://rpm.pbone.net/ to find a package that has one. Unpack it (do not install!) and set LD_PRELOAD and/or LD_LIBRARY_PATH so that they are found.

  6. Now connect the printer via a USB cable to the Linux machine and run

    samsung$ bin/wirelesssetup /dev/usb/lp0

    A graphical window should appear, where you’ll be able to choose your wireless network and enter the password to it.

  7. After you made the printer connect to the wireless network, you can logout and remove the temporary user. Note that the command below will remove that user’s home directory.

    user$ sudo userdel --remove samsung

Troubleshooting

Please do not ask me about the problems you may have with your printer. Either try to solve them yourself, or use the usual venues (forums, mailing lists, Samsung support etc.) to ask for help.

However, if you solved your problems, and they were related to the instructions above, please do contact me so that I can fix/update the instructions.

If this article helped you and you want to say thanks, that’s fine, too :-)

April 20, 2014 09:00 PM

Edward Kmett

CUFP 2014 Call For Presentations


Workshop for
Commercial Users of Functional Programming 2014
Sponsored by SIGPLAN
[CUFP 2014](http://cufp.org/conference)
Co-located with ICFP 2014
Gothenburg, Sweden
Sep 4-6
Talk Proposal Submission Deadline: 27 June 2014

CUFP 2014 Presentation Submission Form

The annual CUFP workshop is a place where people can see how others are using functional programming to solve real world problems; where practitioners meet and collaborate; where language designers and users can share ideas about the future of their favorite language; and where one can learn practical techniques and approaches for putting functional programming to work.

Giving a CUFP Talk

If you have experience using functional languages in a practical setting, we invite you to submit a proposal to give a talk at the workshop. We're looking for two kinds of talks:

Experience reports are typically 25 minutes long, and aim to inform participants about how functional programming plays out in real-world applications, focusing especially on lessons learned and insights gained. Experience reports don't need to be highly technical; reflections on the commercial, management, or software engineering aspects are, if anything, more important.

Technical talks are also 25 minutes long, and should focus on teaching the audience something about a particular technique or methodology, from the point of view of someone who has seen it play out in practice. These talks could cover anything from techniques for building functional concurrent applications, to managing dynamic reconfigurations, to design recipes for using types effectively in large-scale applications. While these talks will often be based on a particular language, they should be accessible to a broad range of programmers.

We strongly encourage submissions from people in communities that are underrepresented in functional programming, including but not limited to women; people of color; people in gender, sexual and romantic minorities; people with disabilities; people residing in Asia, Africa, or Latin America; and people who have never presented at a conference before. We recognize that inclusion is an important part of our mission to promote functional programming. So that CUFP can be a safe environment in which participants openly exchange ideas, we abide by the SIGPLAN Conference Anti-Harassment Policy.

If you are interested in offering a talk, or nominating someone to do
so, please submit your presentation before 27 June 2014 via the

CUFP 2014 Presentation Submission Form

You do not need to submit a paper, just a short proposal for your talk! There will be a short scribe's report of the presentations and discussions but not of the details of individual talks, as the meeting is intended to be more a discussion forum than a technical interchange.

Nevertheless, presentations will be video taped and presenters will be expected to sign an ACM copyright release form.

Note that we will need all presenters to register for the CUFP workshop and travel to Gothenburg at their own expense.

Program Committee

More information

For more information on CUFP, including videos of presentations from
previous years, take a look at the CUFP website at cufp.org. Note that presenters, like other attendees, will need to register for the event. Presentations will be video taped and presenters will be expected to sign an ACM copyright release form. Acceptance and rejection letters will be sent out by July 16th.

Guidance on giving a great CUFP talk

Focus on the interesting bits: Think about what will distinguish your talk, and what will engage the audience, and focus there. There are a number of places to look for those interesting bits.

  • Setting: FP is pretty well established in some areas, including formal verification, financial processing and server-sid web-services. An unusual setting can be a source of interest. If you're deploying FP-based mobile UIs or building servers on oil rigs, then the challenges of that scenario are worth focusing on. Did FP help or hinder in adapting to the setting?
  • Technology: The CUFP audience is hungry to learn about how FP techniques work in practice. What design patterns have you applied, and to what areas? Did you use functional reactive programming for user interfaces, or DSLs for playing chess, or fault-tolerant actors for large scale geological data processing? Teach us something about the techniques you used, and why we should consider using them ourselves.
  • Getting things done: How did you deal with large software development in the absence of a myriad of pre-existing support that are often expected in larger commercial environments (IDEs, coverage tools, debuggers, profilers) and without larger, proven bodies of libraries? Did you hit any brick walls that required support from the community?
  • Don't just be a cheerleader: It's easy to write a rah-rah talk about how well FP worked for you, but CUFP is more interesting when the talks also spend time on what _doesn't_ work. Even when the results were all great, you should spend more time on the challenges along the way than on the parts that went smoothly.

by Edward Kmett at April 20, 2014 07:13 PM

Johan Tibell

Announcing cabal 1.20

On behalf of all cabal contributors, I'm proud to announce cabal 1.20. This is quite a big release, with 404 commits since 1.18. To install:

cabal update
cabal install Cabal-1.20.0.0 cabal-install-1.20.0.0

New features

Since there are 404 commits since cabal 1.18, there are too many changes to give all of them a just treatment here. I've cherry-picked some that I thought you would find interesting.

To see the full list of commits, run:

git log Cabal-v1.18.1.3..Cabal-v1.20.0.0

in the cabal repo.

Dependency freezing

If you're building an application that you're delivering to customers, either as binary or as something that runs on your servers, you want to make sure unwanted changes don't sneak in between releases.

For example, say you've found a bug in the just released 1.0.0.0 version and you want to release 1.0.0.1, which contains a fix. If you build the binary on a different machine than you built the release on, you risk building it against a slightly different set of dependencies, which means that your introducing untested code into your application, possible causing new bugs.

cabal freeze lets developers of applications freeze the set of dependencies used to make builds reproducible. cabal freeze computes a valid set of package versions, just like cabal install does, and stores the result in cabal.config. You can then check cabal.config into your source control system to make sure that all developers that work on the application build against the exact same set of dependencies.

Here's the contents of the cabal.config file created by running cabal freeze in the cabal-install repo:

constraints: ...
             HTTP ==4000.2.8,
             array ==0.4.0.1,
             base ==4.6.0.1,
             bytestring ==0.10.0.2,
             ...

If you later want to update the set of dependencies either:

  • manually edit cabal.config or
  • delete (parts of) cabal.config and re-run cabal freeze.

Note that you should only use cabal freeze on applications you develop in-house, not on packages you intend to upload to Hackage.

Parallel cabal build

Starting with 7.8, GHC now accepts a -j flag to allow using multiple CPU cores to build a single package. This build performance enhancement is now exposed as a -j flag on cabal build (and also cabal test/bench/run). Build time improvements are modest, but free.

Flag to temporary ignore upper bounds

When new major versions of a package P is released, it usually takes a little while for packages that depend on P to update their version bounds to allow for the new version. This can be frustrating if you just want to test if your own package, which depends on both some of these other packages and on P, builds with the new P.

The --allow-newer=P flag tells the dependency solver to ignore all upper version bounds on P, allowing you to try to compile all packages against this newer version.

Unnecessary re-linking avoidance

Before 1.20, cabal would sometimes re-link build targets that hadn't changed. For example, if you had several test suites that tested different parts of your library, every test suite would be re-linked when you ran cabal test, even if no source file that the test suite depends on was changed. The same thing would happen for executables and benchmarks.

Now cabal doesn't re-link executables (of any kind) unless something changed.

Streaming cabal test output

cabal test can now stream its output to stdout, making it easier to see progress for long-running tests. Streaming output is enabled by passing --show-details=streaming to cabal test and is off by default (for now.)

New cabal exec command

Cabal sandboxes have almost completely replaced previous sandbox implementations. There was one use case that wasn't completely captured by the integrated sandbox implementation, namely starting new shells where the environment was set up to automatically use the sandbox for all GHC commands.

cabal exec allows you to launch any binary in an environment where the sandbox package DB is used by default. In particular you can launch a new shell using cabal exec [ba]sh.

Haddock configuration options

Haddock options can now be set in ~/.cabal/config. Here are the options and their default values:

haddock
     -- keep-temp-files: False
     -- hoogle: False
     -- html: False
     -- html-location:
     -- executables: False
     -- tests: False
     -- benchmarks: False
     -- all:
     -- internal: False
     -- css:
     -- hyperlink-source: False
     -- hscolour-css:
     -- contents-location:

How to use cabal

While not strictly related to this release, I thought I'd share how we expect users to use cabal. Using cabal this way makes sure that the features work well for you, now and in the future.

The main message is this: to build a package, use build, not install.

Building packages using cabal install comes from a time when

  • installing dependencies was more difficult,
  • depending on non-published packages was difficult (i.e. no sandbox add-source), and
  • using the other commands required manual configure-ing.

My intention is to remove the need for install from the development workflow altogether. Today the recommended way to build a package is to run this once:

cabal sandbox init
cabal install --only-dep  # optionally: --enable-tests

The development workflow is then just

cabal build/test/bench/run

No configuring (or manual rebuilding) needed. build implies configure and test/bench/run imply build.

Soon build will also imply the above dependency installation, when running in a sandbox.

Credits

Here are the contributors for this release, ordered by number of commits:

  • Mikhail Glushenkov
  • Johan Tibell
  • Duncan Coutts
  • Thomas Tuegel
  • Ian D. Bollinger
  • Ben Armston
  • Niklas Hambüchen
  • Daniel Trstenjak
  • Tuncer Ayaz
  • Herbert Valerio Riedel
  • Tillmann Rendel
  • Liyang HU
  • Dominic Steinitz
  • Brent Yorgey
  • Ricky Elrod
  • Geoff Nixon
  • Florian Hartwig
  • Bob Ippolito
  • Björn Peemöller
  • Wojciech Danilo
  • Sergei Trofimovich
  • Ryan Trinkle
  • Ryan Newton
  • Roman Cheplyaka
  • Peter Selinger
  • Niklas Haas
  • Nikita Karetnikov
  • Nick Smallbone
  • Mike Craig
  • Markus Pfeiffer
  • Luke Iannini
  • Luite Stegeman
  • John Lato
  • Jens Petersen
  • Jason Dagit
  • Gabor Pali
  • Daniel Velkov
  • Ben Millwood

Apologies if someone was left out. Once in a while we have to make commits on behalf of an author. Those authors are not captured above.

by Johan Tibell (noreply@blogger.com) at April 20, 2014 11:36 AM

Douglas M. Auclair (geophf)

'M' is for burrito. No, really! (Monads, a burrito-escque introduction)


'M' is for Burrito. No, really.
(A little post about monads)

This one's easy.

A monad is a burrito: a nice, soft-shell wrapper hiding all that yummy, but complicated stuff inside (green peppers and onions? who was the first one to think of that?)

A monad is a burrito, wrapping all that complicated stuff like an integer:

Just 3

under its soft-shell wrapper:

[1,2,3]

I just showed you two monads: the first one is the 'Maybe' monad, representing semideterminism (we're certain, in this case, that the underlying unit is 'Just' 3, but we could, in a different situation, be equally certain that the underlying unit was 'Nothing' Those two cases (two = semi) give us our certainty (determinism) one way or the other), the second is the 'List' monad, representing nondeterminism (you could have the empty list [] (no answers, no = non), or, in the above case, a list with a set of different answers (one or several)).

So, monad = burrito. Simple, right?

Class dismissed.

"Excuse me, Professor geophf," the wormy student in the back of the class wearing b-c glasses and who always has to ask annoying questions in his whiny voice at the very end of class, making everybody late for their next class and me late for my latte, and you don't want to make me late for my latte! He continued, oblivious: "What about monads that don't contain anything at all? What is that? An empty burrito?"

He giggled at his own joke, thinking himself rather smart and drôle.

"What?" I roared, "An empty burrito? Such sacrilege cannot be! Besides, what's the point of an empty monad? That's just ridiculous!"

I was rather peeved that this young upstart would trespass on the beauty of my burrito-analogue. It was simple and easy to understand. Why mess with it?

"But," he continued in his whiny voice, "what about the empty list and 'Nothing' in the Maybe monad. They carry no value but are essential to each of those monads, how do you explain these monadic values as burritos? Are they churros?"

I sputtered in fury. He had me there.

So I did what every <strikeout>tyrant</strikeout>teacher does, I relied on my position of authority.

"No, monads are burritos! Get that through your head!" And then: "Class dismissed!"

Everybody fled. I do cut quite the imposing figure when I have that angry glint in my eye and my flinty face set in stone.

Okay, so monads aren't burritos. I admit it. (But not to Mr. Wormwood. Never, never!) There is nothing that says a monad has to carry a value, it's just a convenience to think of monads like that: a wrapper around an 'ordinary' object. But monads are objects, themselves, and, in fact, more 'ordinary,' that is: regular, than plain-old objects like integers.

The problem of plain-old objects, the problem that monads address, each in their own way (and there are many species of monads, not just maybes and lists, but eithers, and states and ... stuff! Lots of wild and weird different stuff!), is that plain-old objects admit the uninhabited instance, , in their type. So the type of natural numbers is 0, 1, 2, 3, ... but it's also . For programming languages that realize this (haskell, for example), they deal with this gracefully. For programming languages that realize this, but then do nothing about it, there is a systemic problem of the program going into an unstable state when encountering an uninhabited instance.

Something that causes Algol-child language programmers to shake in their booties: the 'null-pointer exception.'

Question: how can Java throw a 'NullPointerException,' when it doesn't have the pointer construct in its language, at all? Am I the first to call this one out?

So, the function in these languages that have but do not handle it properly must always admit that you're not working with functions at all, for:

x + y

is x plus y, but if either in uninhabited, the answer is KABOOM!

A program crash. You just destroyed your entire system. And you didn't even do it: the language designers did.

Monads were an answer to this problem. A little dude with a long, bushy beard by the name of Kleisli developed the work around what came to be known as the Kleisli categories, that had monads as their objects and (what came to be known as) Kleisli arrows as morphism, or functions.

The guarantee of the Kleisli category was that, indeed, every object as a monad, so there's no uninhabited type, this means Kleisli functions always get you back into a Kleisli category (yes: once you enter the monadic domain, you never leave it. That's the guarantee of a Kleisli category, you can only be a monad (or a monadic function or monad category ...) to get in, and no monads exist outside the Kleisli categories. The monadic domain is the Hotel California of categories).

Okay, you're stuck. And the guarantee of the monad is that you don't care what's under the soft-shell, there's no 'getting a value out' of a monad, because, as my annoying student, Mr. Wormwood, pointed out, there are monads with no underlying values. So what do you do with them?

Monads have a special function associated with them, called the 'join' operator.

Monads, you see, are a special type in that a join of a monad of a monad is just a monad, so, for example,

join (Just (Just 3))

gives you 

Just 3

What does that do for us?

Well, the structure of a monadic function is this:

Monad m ↠ f m :: a → b m

That is, the function takes an ordinary type a and 'lifts' it into the monadic category giving the (monadic) answer of type b (correctly m b, as m is the monadic type and b is the type carried by the monad).

Okay, but monadic functions can't dig out the underlying type in a monad of its category, so how do even the functions work at all?

Just as you can lift an unit type into a monad:

return 3 = Just 3

You can also lift an ordinary function into the monadic domain:

liftM succ = m Int → m Int (for some monadic type m)

Okay, so, but where does that even get us? We're still stuck, right?

Well, the thing is, you can even lift monadic functions into the monadic-monadic domain:

liftM f (of type Monad m ↠ a → m b) = f' :: Monad m ↠ m a → m (m b)

What does that get us?

Well, combining join with a lifted monadic function gets us a monad from monadic input:

join (liftM f (Just 3)) = some monadic answer dependent on the resultant type of f.

This whole lifting-into-the-monadic-domain-and-then-joining-the-result is so common when working in monad categories that a special operator was invented, à la Haskell, named '>>=' (pronounced: 'bind').

Now, with bind the true power of monads come out:

m >>= f >>= g >>= h >>= ... etc.

What you see here a monad being bound, in turn, to the monadic functions f, g, h, etc.

So, okay, you've got a short-hand for monadically binding functions together. Great. Good on you, and have a latte.

The thing here is this.

In 'plain-old' programming, you have this ever-present fear that a NullPointerException is going to ruin your life, so you can't write:

obj.getA().getB().getC().thenDoSomething()

getting an A from obj, then getting a B from that A, then getting a C from that B, then you do something with that C.

I mean, you can write that code, but if any of those objects are null: obj, A, B, C, your whole statement explodes?

No, worst than that: your entire system blows up, and you have to deal with irate customers wondering why they got a 505-Service down error when they submit their credit card information but forgot to enter their zip code, or some such.

So you have to wrap that statement in a try-block, and try it and see if it works, and if it doesn't you have this whole envelope, this whole burrito shell of code you have to write to handle the uninhabited case.

Now let's look at the monadic function binding again:

m >>= f >>= g >>= h >>= ... etc.

What happens if m is null, or the result of f m is ...

No. Wait. Monads don't have nulls. Monads are just monads, so the above binding ...

... okay, wait for it, ...

just.

works.

In fact, in the Kleisli category, it is guaranteed to work.

It. just. works.

For you, not a Java(script) programmer, you're like: "Well, duh, yeah, that's how it's supposed to be! 1 + 1 = 2, not 'error' or whatever."

You can say that, and expect that, in a pure mathematical domain (with monads implicit in the mathematics for you. You're welcome.) But a Java programmer, with any experience, any hard-won experience should be (rightly) awed.

"Wait. You mean, it just works? How is that possible?"

Yeah. How is that possible? That Java code just works?

It doesn't. But, when I translate the Kleisli category down into a Java implementation, and then lift every Java object into that cateogy, so it's not a Plain-old Java Object (POJO) any more, but now it's a monad in the Kleisli category, and operate on it as such from then on.

Well, then, it just works.

Some people look at my monadic harness and say I'm unnecessarily complicating things.

I beg to differ.

"But, geophf, I just want the start date; all this stuff just complicates things!"

But what do you want the start date ... for? If you just want the start date, then

pojo.getStartDate()

will get you there, and you're done.

But if you want to base any logic off it? Is your start date the start of a work-flow? So if the start date is null, your system, that you unit tested with start date values, now crashes in production and you have no idea why, because you unit tested it, and the code looks good.

But that one null wrecks everything.

I don't have any nulls. So I lift my start date up into a monad, and start my work flow, and my work flow works, and if the start date is null, then the start date isn't lifted, the functions fall through, because they're not working with a monad, and I just go onto the next thing.

And I coded zero lines of code around that.

Your try-catch blocks looking for nulls everywhere ...

Well, what's more complicated now? What code works in production, regardless of dirty or bad data?

Monads are a guarantee, and, as burritos go, they are quite tasty. Give them a try!

p.s.:

Now there's the whole conversation around monads that carry no value intentionally throughout the entire monad. So, for example, if the monad type is 'Quark' the the monadic inhabited types are up, down, strange, charmed, top and bottom, and their values are ... nothing at all: their instance is the value, it has no underlying value it carries (unless you want to get into the guts of spin and charge ...), its the particular monad instance, or the particular quark, we care about, not what's inside at all, something like monads as enumerated values, but ... eh. Monads-as-burritos is a limited view, it will trip you up, but for the most part it works. When you do get to the point of treating monads-as-monads, and not caring, anymore, about the underlying value, you can tap yourself on the shoulder with monadic-Nirvana. It's a really good feeling, and gives an entirely new understanding of monads and their utility. "Good to know," as it were, but monad-as-burrito works for the most part and is tasty, too, to boot.

by geophf (noreply@blogger.com) at April 20, 2014 08:33 AM

'O' is for ontology.


'O' is for Ontology.

What is an ontology? A knowledge-base? Sure, if that's simpler to grasp, but only insofar as 'knowledge-base' doesn't mean 'collection of facts' or 'data set.'

An ontology is more than that. But what, precisely, is an ontology?

Well, actually, there is a precise meaning to 'ontology.' And 'meaning,' itself, is central to ontology. Because what does these data mean is what an ontology is about. It's not a listing of facts, but it's also the relationship of the facts in the ontology that makes it what it is. The data, the facts, of an ontology have meaning, not only intrinsically, but also explicitly, or: the meaning is useable, or can be processed, itself, as information, and used in the handling of the underlying information.

So:

Milk
Eggs
Bread
(Ho-hos)
Oranges
Lettuce

An ontology? Absolutely! You hand that to your husband, and he knows exactly what it is and he knows exactly how to use it. He even, helpfully, penciled in the missing item (ho-hos, just as a 'fer instance') onto your shopping list for you.

See?

Now, ontology, per se?  Not so much. But if you explicitly titled it "Shopping List," now you're talking!

Format it as XML or JSON or OWL and then your computer will do your shopping for you, just as well as your husband would.

Even better, as it won't add those pesky ho-hos your husband always 'helpfully' adds to your list for you.

Your computer does need arms and legs and artificial intelligence, but I'm sure Cyberdyne Systems will be happy to help you there ... and also hand your terminator that was a computer a fully automatic plasma rifle.

Whoopsie! Better give your husband the list, and live with the ho-hos ... Ho-hos are, relatively speaking, better than global thermonuclear war and the extinction of all mankind.

But I digress.

As always.

by geophf (noreply@blogger.com) at April 20, 2014 08:09 AM

'P' is for Predicate

Hey, all, 'P' day.

'P' should be for π, but that's redundant, right? since π (pronounced 'pee' in math circles) is just the letter 'p' in Greek.

And it's 3.141592653589793238462643383279502884196939937510582097494459230

Yeah, ... I memorized that in high school. For real. People didn't believe me. Every time. So I'd recite it for them, after I wrote it down on a piece of paper so they could verify that I wasn't just spouting digits.

Yeah.

That.

So, but, this post is not about π.

No. Wait. Stop using 22/7 to represent π. That's, like, way off. Way. Use 355/113 instead. And thank Chuck Moore whenever you do.

Okay, done with that aside. Onto this post. This post is about the predicate.

So, what's a predicate?

A predicate is a statement of truth, that you arrive at from other predicates.

Yeah, math and logic are like that: a thing is a thing from other things just like it. Take 'number' for example, any number, x, is just all the previous numbers before x with some starter seed (usually zero or some other representation of null).

Well, that's what a predicate is.

First of all, a predicate is more than a proposition. A proposition is something that you've been seeing already in the alphabet-soup I've been writing, so the proposition for the type of the continuation function is:

(p → r) → r

So, this says, 'given a function that takes a p and gives an r, I can get you an r,' right? Remember that?

Well, a predicate is of this form:

p |- q

Or, if you're feeling your Prolog:

p :- q

A p is true, depending on whether q is true.

(Here we go again with dependent types).

So, what's the difference? I mean, if you just reverse the arrows you could say you have:

q → p

And you could say that, but there is a difference, in that the propositional logic of

p → r

Is all very deterministic whereas the predicate logic of

p :- q

is all very non-deterministic. Predicate logic allows you to base the truth of your consequence on none, one, or several conditions.

So, you could have:

p :- q, r, s.

Which means that p is true if q and r and s are all true, as well.

These are universals, but you can get specific with it, too:

p(x) :- q(x, y), r(y, z).

Which says that there is some x in p, giving a truth, dependent upon the same x in q (associated with some y) being true, along with the same y in r being true with some z.

And, to prove that statement you have to find the (x, y, z) values that satisfy the conditions of that statement. And you keep looking until you find it.

What does that sound like? ... We'll get back to that in a moment.

The neater thing about predicate logic is that the clauses of a predicate can, themselves, be consequences of other predicates:

q(x, y) :- s(y), t(x)
r(a, b) :- t(a), w(b)

And predicates can have multiple statements to arrive at that truth:

q(1, 2).
q(2, 4).
q(3, 6).

The above is saying: "q(1, 2) is true, regardless of anything." And you can take that truth (that fact) and plug it in anywhere. The fact propagates throughout the system.

So, those three facts, along with the relation, q(x, y) :- s(y), t(x), form a predicate. A statement of truth throughout the system, covering all possibilities.

A predicate is a program.

Yeah. Wow.

In Java or Haskell or Ruby a program is thousands, tens of thousands of lines, but in a logic programming languages, like Prolog, for example, a program could be one line or several ... more than a few lines, actually, and you're going off the deep end. And how you program in Prolog is that you string these predicates, these programs, together to create an Universe of related truths, an ontology, in which you can ... play, by which I mean, you can query the system, get an answer (if there is one) or get back several answers, and explore this Universe you've created, or that someone has created for you, as an inquiry into truth.

*Sigh* I miss coding in Prolog. When done neatly, it was so neat to use.

But I don't miss coding in Prolog. It was all dynamically typed, so you could put in anything for your inquiry, and it would try it out, looking for the string "Mama meetzaballs!" even though this particular predicate is about arithmoquines. And I don't miss it for the fact that statements of fact were easy to formulate, but the functional constructs ... ugh, they were terrible.

I need me a programming language that does logic typefully and handles functional logic beautifully, like Haskell does. Maybe ... Idris? ... but Idris doesn't work on my decade-old laptop. Birthday present for moi, then?

Hm.

by geophf (noreply@blogger.com) at April 20, 2014 08:08 AM

April 19, 2014

Douglas M. Auclair (geophf)

'Q' is for Quine

So, yesterday ('P' is for Predicate) I forgot to mention to whom we owe our gratitude, and that is Gottlob Frege. After Wittgenstein, Frege is considered to have made the most important contributions to philosophy through his formalization of predicate logic.

I mention this, in fronting (as opposed to 'in passing') because this post is about 'quining.' And the word 'Quine' comes from the American philosopher Willard 'Van' Quine.

Quining is a neat thing. What a quine does is it's a program that outputs itself.

That sounds simple enough, but there's a problem here. If you were using BASIC, for example, then you might start off with:

10 PRINT "10 PRINT \"10 PRINT \" ...

but then, whoops! We've just run into a problem, haven't we?

So how do we solve that? With a GOTO statement? Even though the "GOTO Statement Considered Harmful"?

Crusty BASIC programmer: HECK! I've used GOTO zillions of times, and it never smacked me once!

Yeah. Okay. Whatevs.

But, okay, let's give it a try:

10 PRINT "10 PRINT \""
20 GOTO 10

The problem here is that it prints:

10 PRINT "
10 PRINT "
10 PRINT "
... blah, blah, blah ...

but it never gets to the

20 GOTO 10

That solved a whole ton of nothing.

Writing quines in non-functional languages is a big-ole pile-o-pain! But totally doable, in tons of languages (and since languages have no weight, that's saying a lot)!

Here's the ones for BASIC (the first one is totally cheating, by the way) (or totally clever) (I'm not sure which).

Now, in functional programming languages, writing a quine is trivial.

Here's one in combinatory logic:

MM

because, as you recall,

M = λx → xx

So, MM is M where x is also M so MM = MM

Piece of cake!

Now, since, usually, functional languages are in an evaluator, that gives instant feedback to what you just entered, for example:

3 + 4

returns

7

right away.

Well, then just simply evaluating a function that returns itself is easy enough:

()

gives:

()

(the 'unit' value)

And

[]

gives:

[]

(the empty list)

And even:

7

gives:

7

... and we can skip all this 3 + 4 nonsense.

So, there's that.

Same thing for any programming language that has an evaluator, like Prolog, (SWI, for example, or interpreted on the web) (ooh, look at the cute little 'witch' example on the online Prolog interpreter!):

? X = 3

X = 3

So, there are those that might consider this cheating, too, like that BASIC program that uses a BASIC command to print itself as its own list. I say P-SHAW!

But also, as the quine it to get you thinking about recursion and bootstrapping, maybe those blow-hards saying you can't use the evaluator to write a quine may be onto something.

Quine was a man who exercised his brain, and let to some serious reexamination of what it is to be true, that there may be things we consider 'obviously' true to be examined again under the light of our experiential knowledge.

A little quine program just may do the same thing for me and my ideas.

p.s. Happy Easter!

by geophf (noreply@blogger.com) at April 19, 2014 10:47 PM

Roman Cheplyaka

JSON validation combinators

Why write a custom JSON validator?

At Signal Vine we have a JSON-based language for describing text messaging campaigns. We may design a better surface syntax for this language in the future, but the current one gets the job done and there are certain existing systems that depend on it.

Anyway, the problem with this language is that it is too easy to make mistakes — including errors in JSON syntax, structural errors (plugging an array where an object is expected), or name errors (making a typo in a field name).

So the first thing I did was write a validator for our JSON format.

There are several projects of «JSON schemas» around, but there were many reasons against using them.

  1. I don’t know about the quality of the tools that support such schemas (i.e. the quality of error messages they generate), and the expressivity of the schemas themselves (whether they’d let us to express the structure of our JSON structure and the constraints we’d like to enforce). So, though it may seem that using an existing solution is «free», it is not — I’d have to spend time learning and evaluating these existing solutions.

  2. I remember that we went through this in our team at Barclays, and eventually decided to create a custom JSON schema language, although I was not involved in the evaluation process, so can’t share the details.

  3. I was almost certain that no existing «generic JSON schema» solution can provide the power of a custom one. For instance, some of the JSON strings contain expressions in another in-house mini-language. Ideally, I’d like to parse those expressions while I am parsing the enclosing JSON structure, and give locations of possible errors as they appear in the JSON file.

  4. I’d need a parser for the language anyway. Maintaining a schema separately from the parser would mean one more thing to keep in sync and worry about.

I couldn’t use an existing JSON parsing library either. Of course, aeson was out of question, being notorious for its poor error messages (since it’s based on attoparsec and optimized for speed). json, though, is based on parsec, so its error messages are better.

But there is a deeper reason why a JSON parsing library is inadequate for validation. All of the existing JSON libraries first parse into a generic JSON structure, and only then do they try to recognize the specific format and convert to a value of the target Haskell type.

Which means that during parsing, only JSON syntax errors will be detected, but not the other kinds of errors described above. Granted, they all can be detected sooner or later. But what differentiates sooner from later is that once we’re out of the parsing monad, we no longer have access to the position information (unless, of course, our JSON parsing library does extra work to store locations in the parsed JSON structure — which it typically doesn’t). And not having such position information severely impacts our ability to produce good error messages.

To summarize, in order to provide good diagnostics, it is important to parse exactly the language we expect (and not some superset thereof), and to perform all the checks in the parsing monad, where we have access to location information.

JSON parsing combinators

Even though I couldn’t re-use an existing JSON parser or schema, I still wanted my parser to be high-level, and ideally to resemble a JSON schema, just embedded in Haskell.

The rest of this article describes the JSON schema combinators I wrote for this purpose.

Strings

As I mentioned before, the json package uses parsec underneath, so I was able to reuse some basic definitions from there — most notably, p_string, which parses a JSON string. This is fortunate, because handling escape sequences is not straightforward, and I’d rather use a well-tested existing implementation.

string :: Parser String
string = {- copied from Text.JSON.Parsec -}

I introduced one other combinator, theString, which parses a given string:

theString :: String -> Parser ()
theString str = (<?> "\"" ++ str ++ "\"") $ try $ do
  str' <- string
  if str == str'
    then return ()
    else empty

Objects

Objects are an interesting case because we know what set of fields to expect, but not the order in which they come (it may be arbitrary). Such syntax is known as a «permutation phrase», and can be parsed as described in the classical paper Parsing Permutation Phrases by Arthur Baars, Andres Löh and Doaitse Swierstra.

There are surprisingly many implementations of permutation parsing on hackage, including one in parsec itself. Most of them suffer from one or both of the following issues:

  1. they use custom combinators, which, despite being similar to Applicative and Alternative operators, have their quirks and require learning

  2. they don’t support permutation phrases with separators, which is obviously required to parse JSON objects. (The technique to parse permutation phrases with separators was descibed in the original paper, too.)

On the other hand, the action-permutations library by Ross Paterson addresses both of these issues. It provides the familiar Applicative interface to combine permutation elements (or atoms, as it calls them), and includes the function runPermsSep to parse phrases with separators. The interface is also very generic, requiring the underlying functor to be just Alternative.

Below are the combinators for parsing JSON objects. field parses a single object field (or member, as it’s called in the JSON spec), using the supplied parser to parse the field’s value. optField is similar, except it returns Nothing if the field is absent (in which case field would produce an error message). Finally, theField is a shortcut to parse a field with the fixed contents. It is useful when there’s a tag-like field identifying the type/class of the object, for instance

data Item
  = Book
      String -- writer
  | Song
      String -- composer
      String -- singer

item =
  (try . object $
    Book <$
    theField "type" "book" <*>
    field "writer" string)
  <|>
  (try . object $
    Song <$
    theField "type" "song" <*>
    field "composer" string <*>
    field "singer" string)

(Note: examples in this article have nothing to do with the JSON schema we actually use at Signal Vine.)

One thing to pay attention to is how field parsers (field, theField and optField) have a different type from the ordinary parsers. This makes it much easier to reason about what actually gets permuted.

object :: Perms Parser a -> Parser a
object fields = (<?> "JSON object") $
  between (tok (char '{')) (tok (char '}')) $
    runPermsSep (tok (char ',')) fields
  where
    field key value = (,) <$> (theString key <* tok (char ':')) <*> value

-- common function used by field and optField
field' 
  :: String -- key name
  -> Parser a -- value parser
  -> Parser a
field' key value = theString key *> tok (char ':') *> value

field
  :: String -- key name
  -> Parser a -- value parser
  -> Perms Parser a
field key value = atom $ field' key value

theField
  :: String -- key name
  -> String -- expected value
  -> Perms Parser ()
theField key value = () <$ field key (theString value)

optField
  :: String -- key name
  -> Parser a -- value parser
  -> Perms Parser (Maybe a)
optField key value = maybeAtom $ field' key value

Aside: correct separator parsing

There was only one issue I ran into with action-permutations, and it is interesting enough that I decided to describe it here in more detail.

Consider, for example, the expression runPermsSep sep (f <$> atom a <*> atom b <*> atom c)

It would expand to

(flip ($) <$> a <*> (
  (sep *> (flip ($) <$> b <*>
  (sep *> (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))
  <|>
  (sep *> (flip ($) <$> c <*>
  (sep *> (flip ($) <$> b <*>
    pure (\xc xb xa -> f xb xc xa)))))
))
<|>
(flip ($) <$> b <*> (
  ...
))
<|>
(flip ($) <$> c <*> (
  ...
))

See the problem? Suppose the actual order of the atoms in the input stream is a, c, b. At the beginning the parser is lucky to enter the right branch (the one starting from flip ($) <$> a <*> ...) on the first guess. After that, it has two alternatives: b-then-c, or c-then-b. First it enters the b-then-c branch (i.e. the wrong one) and fails. However, it fails after having consumed some input (namely, the separator) — which in libraries like parsec and trifecta means that the other branch (the right one) won’t be considered.

We cannot even work around this outside of the library by using try, because we can’t insert it in the right place. E.g. wrapping the separator in try won’t work. The right place to insert try would be around the whole alternative

  (sep *> (flip ($) <$> b <*>
  (sep *> (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))

but this piece is generated by the lirbary and, as a library user, we have no control over it.

The usage of try inside the library itself is unsatisfactory, too. Remember, the interface only assumes the Alternative instance, which has no notion of try. If we had to make it less generic by imposing a Parsing constraint, that would be really unfortunate.

Fortunately, once identified, this problem is not hard to fix properly — and no usage of try is required! All we need is to change runPermsSep so that it expands to the tree where separator parsing is factored out:

(flip ($) <$> a <*> sep *> (
  (flip ($) <$> b <*> sep *>
  (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))
  <|>
  (flip ($) <$> c <*> sep *>
  (flip ($) <$> b <*>
    pure (\xc xb xa -> f xb xc xa)))
))
<|>
(flip ($) <$> b <*> sep *> (
  ...
))
<|>
(flip ($) <$> c <*> sep *> (
  ...
))

Now, all alternatives start with atoms, so we have full control over whether they consume any input.

Mathematically, this demonstrates that <*> does not distribute over <|> for some backtracking parsers. Note that such distributive property is not required by the Alternative class.

Even for parser monads that allow backtracking by default (attoparsec, polyparse) and for which there’s no semantic difference between the two versions, this change improves efficiency by sharing separator parsing across branches.

My patch fixing the issue has been incorporated into the version 0.0.0.1 of action-permutations.

Arrays

Arrays should be easier to parse than objects in that the order of the elements is fixed. Still, we need to handle separators (commas) between array elements.

If we interpreted arrays as lists, then the schema combinator for arrays might look like

array
  :: Parser a -- parser for a signle element
  -> Parser [a] -- parser for the array

Implementation would be straightforward, too.

However, in our JSON schema we use arrays as tuples rather than lists. That is, we typically expect an array of a fixed number of heterogeneous elements. Thus we’d like to combine these tuple elements into a single parser using the applicative interface.

Let’s say we expect a 2-tuple of a string (a person’s name) and an object (that person’s address).

data Address -- = ...
data Person =
  Person
    String -- name
    Address

Written by hand, the parser may look like

address :: Parser Address
address = _

personAddress = 
  between (tok (char '[')) (tok (char ']')) $
    Person <$> string <* sep <*> address
  where sep = tok $ char ','

It makes sense to move brackets parsing to the array combinator:

array :: Parser a -> Parser a
array p = (<?> "JSON array") $
  between (tok (char '[')) (tok (char ']')) p

But what should we do with the commas? Manually interspersing elements with separators is error-prone and doesn’t correspond to my view of a high-level JSON schema description.

Inserting comma parsers automatically isn’t impossible — after all, it is done in the action-permutations package and we used it to parse object fields, which are comma-separated, too. But it cannot be as easy as adding a separator to every element since there’s one less separators than elements. We have somehow to detect the last element and not to expect a separator after it.

A nice and simple way to achieve this is with a free applicative functor. A free applicative functor will allow us to capture the whole applicative expression and postpone the decision on where to insert separator parsers until we can tell which element is the last one. In this case we’ll use Twan van Laarhoven’s free applicative, as implemented in the free package.

element :: Parser a -> Ap Parser a
element = liftAp

theArray :: Ap Parser a -> Parser a
theArray p = between (tok (char '[')) (tok (char ']')) $ go p
  where
    go :: Ap Parser a -> Parser a
    go (Ap p (Pure f)) = f <$> p
    go (Ap p1 pn) = flip id <$> p1 <* tok (char ',') <*> go pn

Like object fields, array elements have a special type which makes it clear which pieces exactly are comma-separated.

In fact, the applicative functor Perms is essentially the free applicative functor Ap plus branching.

Optional array elements

Now comes the twist. Some of the array elements may be optional — in the same way as positional function arguments in some languages may be optional. Since the elements are positional, if one of them is omitted, all subsequent ones have to be omitted, too — otherwise we won’t be able to tell which one was omitted, exactly.

For that reason, all optional elements should come after all the non-optional ones; if not, then we’ve made a mistake while designing (or describing) our schema. Ideally, our solution should catch such mistakes, too.

So, how can the above solution be adapted to handle optional arguments?

Attempt #1:
optElement :: Parser a -> Ap Parser (Maybe a)
optElement p = element $ optional p

Here optional is a combinator defined in Control.Applicative as

optional v = Just <$> v <|> pure Nothing

This won’t work at all, as it doesn’t give us any information about whether it’s an optional element or just a parser that happens to return a Maybe type.

Attempt #2:

Well, let’s just add a flag indicating whether the element was created using optElement, shall we?

data El a =
  El
    Bool -- is optional?
    (Parser a)

element :: Parser a -> Ap El a
element = liftAp . El False

optElement :: Parser a -> Ap El (Maybe a)
optElement = liftAp . El True . optional

Now we can check that optional arguments come after non-optional ones. If an element’s parse result is Nothing, we also know whether that element is an optional one, and whether we should stop trying to parse the subsequent elements.

Still, there are two related issues preventing this version from working:

  • How do we actually know when a parser returns Nothing? Once we lift a parser into the free applicative, its return type becomes existentially quantified, i.e. we should treat it as polymorphic and cannot assume it has form Maybe a (by pattern-matching on it), even if we can convince ourselves by looking at the Bool flag that it indeed must be of that form.

  • Similarly, once we’ve detected an absent optional element (assuming for a second that it is possible), we have to force all the remaining optional parsers to return Nothing without parsing anything. But again, we cannot convince the compiler that Nothing is an acceptable return value of those parsers.

Attempt #3:

So, we need certain run-time values (the optionality flag) to introduce type-level information (namely, that the parser’s return type has form Maybe a). That’s exactly what GADTs do!

data El a where
  El :: Parser a -> El a
  OptEl :: Parser a -> El (Maybe a)

El’s two constructors are a more powerful version of our old Bool flag. They let us see whether the element is optional, and if so, guarantee that its parser’s return type is Maybeish.

And here’s the code for the parsing functions:

element :: Parser a -> Ap El a
element = liftAp . El

optElement :: Parser a -> Ap El (Maybe a)
optElement = liftAp . OptEl

theArray :: Ap El a -> Parser a
theArray p = between (tok (char '[')) (tok (char ']')) $
  go True False False p
  where
    go :: Bool -> Bool -> Bool -> Ap El a -> Parser a
    go _ _ _ (Pure x) = pure x
    go isFirst optionalOccurred optionalOmitted (Ap el1 eln) =
      let
        eltSequenceError :: a
        eltSequenceError =
          error "theArray: a non-optional element after an optional one"

        !_check =
          case el1 of
            El {} | optionalOccurred -> eltSequenceError
            _ -> ()
      in
        if optionalOmitted
          then
            case el1 of
              El {} -> eltSequenceError
              OptEl {} -> go False True True eln <*> pure Nothing
          else do
            let
              sep = if isFirst then pure () else () <$ tok (char ',')
            case el1 of
              El p1 ->
                flip id <$ sep <*> p1 <*> go False False False eln
              OptEl p1 -> do
                r1 <- optional $ sep *> p1
                go False True (isNothing r1) eln <*> pure r1

theArray is a state machine with three pieces of state: isFirst, optionalOccurred and optionalOmitted. isFirst and optionalOccurred are used to guide actual parsing, while optionalOccurred is needed to check the proper arrangement of optional vs non-optional arguments.

Conclusions

Although the standard approach to JSON parsing is to parse into a generic JSON representation first, the article shows that an alternative approach — parsing the expected structure directly — is also viable and can be employed to improve error reporting.

Of course, the tricks and ideas described here are not specific to JSON. Understanding how they work and how to use them may become handy in a variety of parsing situations.

April 19, 2014 09:00 PM

Philip Wadler

Silicon Valley could force NSA reform, tomorrow. What's taking so long?

<figure class="element element-image" data-media-id="gu-fc-3e1d8f3a-dd53-4f60-9376-ccc2021cc1eb" style="background-repeat: no-repeat no-repeat; border-collapse: collapse; margin: 0px 0px 10px; padding: 0px;"><figcaption style="background-repeat: no-repeat no-repeat; border-collapse: collapse; color: #666666; font-size: 0.858em; line-height: 1.25; margin: 0px 0px 10px; padding: 0px;">CEOs from Yahoo to Dropbox and Microsoft to Zynga met at the White House, but are they just playing for the cameras? Photograph: Kevin Lamarque / Reuters</figcaption></figure>
Trevor Timm asks a key question in The Guardian:
The CEOs of the major tech companies came out of the gate swinging 10 months ago, complaining loudly about how NSA surveillance has been destroying privacy and ruining their business. They still are. Facebook founder Mark Zuckerberg recently called the US a "threat" to the Internet, and Eric Schmidt, chairman of Google, called some of the NSA tactics "outrageous" and potentially "illegal". They and their fellow Silicon Valley powerhouses – from Yahoo to Dropbox and Microsoft to Apple and more – formed a coalition calling for surveillance reform and had conversations with the White House.
But for all their talk, the public has come away empty handed. The USA Freedom Act, the only major new bill promising real reform, has been stalled in the Judiciary Committee. The House Intelligence bill may be worse than the status quo. Politico reported on Thursday that companies like Facebook and are now "holding fire" on the hill when it comes to pushing for legislative reform.
...
We know it's worked before. Three years ago, when thousands of websites participated in an unprecedented response to internet censorship legislation, the Stop Online Piracy Act (Sopa), the public stopped a once-invincible bill in its tracks. If they really, truly wanted to do something about it, the online giants of Silicon Valley and beyond could design their systems so that even the companies themselves could not access their users' messages by making their texting and instant messaging clients end-to-end encrypted.
But the major internet outfits were noticeably absent from this year's similar grassroots protest – dubbed The Day We Fight Back – and refused to alter their websites à la Sopa. If they really believed the NSA was the threat so many of them have claimed, they'd have blacked out their websites in protest already.

by Philip Wadler (noreply@blogger.com) at April 19, 2014 08:59 AM

I Shall Vote No


Spotted on Bella Caledonia.
[After Christopher Logue, I Shall Vote Labour (1966)
By A.R. Frith
I shall vote No because, without Westminster, We’d never have got rid of the Poll Tax
I shall vote No because eight hundred thousand Scots live in England, and there are no jobs here to match their talents and meet their aspirations
I shall vote No, because my grandmother was a MacDougall
I shall vote No in case Shell and BP leave and take their oil with them
I shall vote No because otherwise we would have to give back the pandas
I shall vote No because I am feart
I shall vote No because the people who promised us a better deal if we voted No in 79, and warned us of the dire consequences of devolution in 97, tell us we should
I shall vote No so as not to let down my fellow socialists in Billericay and Basildon
and
I shall vote No, because if we got rid of Trident and stopped taking part in illegal wars we would be a target for terrorism
I shall vote No because if I lived under a government that listened to me and had policies I agreed with, I wouldn’t feel British
I shall vote No because the RAF will bomb our airports if we are a separate country
I shall vote No because to vote Yes dishonours the Dead of the Great War, who laid down their lives for the rights of small nations
I shall vote No, lest being cut off from England turns Red Leicester cheese and Lincolnshire sausages into unobtainable foreign delicacies, like croissants, or bananas
I shall vote No, because, as a progressive, I have more in common with Billy Bragg or Tariq Ali, who aren’t Scottish, than some toff like Lord Forsyth, who is.
I shall vote No, because the certainty of billions of pounds worth of spending cuts to come is preferable to the uncertainty of wealth
I shall vote No, because it is blindingly obvious that Scotlands voice at the UN, and other international bodies, will be much diminished if we are a member-state
I shall vote No because having a parliament with no real power, and another which is run by people we didnt vote for, is the best of both worlds
I shall vote No because I trust and admire Nick Clegg, who is promising us Federalism when the Liberals return to office
I shall vote No, because Emma Thompson would vote No, and her Dad did The Magic Roundabout
I shall vote No, because A.C. Grayling would vote No,and his Mum was born on Burns Night
I shall vote No because David Bowie asked Kate Moss to tell us to, and he lives in New York and used to be famous
I shall vote No, because nobody ever asks me what I think
I shall vote No, because a triple-A credit rating is vital in the modern world
I shall vote No because things are just fine as they are
I shall vote No because the English say they love us,
and that if we vote Yes, they will wreck our economy.

by Philip Wadler (noreply@blogger.com) at April 19, 2014 08:35 AM

April 17, 2014

Well-Typed.Com

Haskell gets static typing right

The following blog post originally appeared as a guest post on the Skills Matter blog. Well-Typed are regularly teaching both introductory or advanced Haskell courses at Skills Matter, and we will also be running a special course on Haskell’s type system.

Statically typed languages are often seen as a relic of the past – old and clunky. Looking at languages such as C and Java, we’re used to writing down a lot of information in a program that just declares certain variables to be of certain types. And what do we get in return? Not all that much. Yes, granted, some errors are caught at compile time. But the price is high: we’ve cluttered up the code with noisy declarations. Often, code has to be duplicated or written in a more complicated way, just to satisfy the type checker. And then, we still have a significant risk of run-time type errors, because type casting is common-place and can fail at unexpected moments.

So it isn’t a big surprise that dynamically typed languages are now very fashionable. They promise to achieve much more in less time, simply by getting rid of static type checking.

However, I want to argue that we shouldn’t be too keen on giving up the advantages of static types, and instead start using programming languages that get static typing right. Many functional languages such as Scala, F#, OCaml and in particular Haskell are examples of programming languages with strong static type systems that try not to get in the way, but instead guide and help the programmer.

In the rest of this post, I want to look at a few of the reasons why Haskell’s type system is so great. Note that some of the features I’m going to discuss are exclusive to Haskell, but most are not. I’m mainly using Haskell as a vehicle to advertise the virtues of good static type systems.

1. Type inference

Type inference makes the compiler apply common sense to your programs. You no longer have to declare the types of your variables, the compiler looks at how you use them and tries to determine what type they have. If any of the uses are inconsistent, then a type error is reported. This removes a lot of noise from your programs, and lets you focus on what’s important.

Of course, you are still allowed to provide explicit type signatures, and encouraged to do so in places where it makes sense, for example, when specifying the interface of your code.

2. Code reuse

Nothing is more annoying than having to duplicate code. In the ancient days of statically typed programming, you had to write the same function several times if you wanted it to work for several types. These days, most languages have “generics” that allow you to abstract over type parameters. In Haskell, you just write a piece of code that works for several types, and type inference will tell you that it does, by inferring a type that is “polymorphic”. For example, write code that reverses all the elements of a data structure, and type inference will tell you that your code is independent of the type of elements of the data structure, so it’ll just work regardless of what element type you use. If you write code that sorts a data structure, type inference will figure out that all you require to know about the elements is that they admit an ordering.

3. No run-time type information by default

Haskell erases all type information after type checking. You may think that this is mainly a performance issue, but it’s much more than that. The absence of run-time type information means that code that’s polymorphic (i.e., type-agnostic, see above) cannot access certain values. This can be a powerful safety net. For example, just the type signature of a function can tell you that the function could reorder, delete or duplicate elements in a data structure, but not otherwise touch them, modify them or operate on them in any other way. Whereas in the beginning of this post I complained that bad static type systems don’t allow you to do what you want because they’re not powerful enough, here we can deliberately introduce restrictions to save us (as well as colleagues) from accidental mistakes. So polymorphism turns out to be much more than just a way to reduce code duplication.

By the way, Haskell gives you various degrees of selective run-time typing. If you really need it in places, you can explicitly attach run-time type information to values and then make type-based decisions. But you say where and when, making a conscious choice that you gain flexibility at the cost of safety.

4. Introducing new datatypes made easy

In Haskell, it’s a one-liner to define a new datatype with a new name that has the same run-time representation as an existing type, yet is treated as distinct by the type system. (This may sound trivial, but surprisingly many statically typed languages get it wrong.) So for example it’s easy to define lots of different types that are all integers internally: counters, years, quantities, … In Haskell, this simple feature is often used to define safe boundaries: a specific type for URLs, a specific type for SQL queries, a specific type for HTML documents, and so on. Each of these types then comes with specific functions to operate on them. All such operations guarantee that whenever you have a value of this type, it’s well-formed, and whenever you render a value of this type, it’s syntactically correct and properly escaped.

5. Explicit effects

In virtually all programming languages, a function that performs some calculations on a few numbers and another function that performs the same calculations, but additionally sends a million spam emails to addresses all over the world, have exactly the same type, and therefore the same interface. Not so in Haskell. If a function writes to the screen, reads from the disk, sends messages over the network, accesses the system time, or makes use of any other so-called side effect, this is visible in its type. This has two advantages: first, it makes it much easier to rely on other people’s code. If you look at the interface and a function is effect-free, then you for example automatically know that it is also thread-safe. Second, the language facilitates a design where side effects are isolated into relatively small parts of the code base. This may seem difficult to achieve for highly stateful systems, but surprisingly, it usually is not: even interactive systems can usually be described as pure functions reacting to a series of requests with appropriate responses, and a separate driver that does the actual communication. Such separation makes it not only easier to test the program, but also facilitates the evolution of the program such, for example, to adapt it to run in a different environment. Haskell’s type system therefore encourages good design.

6. Types as a guide in program development

If you only ever see types as a source of errors, and therefore as enemies on your path of getting your program accepted, you’re doing them injustice. Types as provided by Haskell are an element of program design. If you give your program precise types and follow systematic design principles, your program almost writes itself. Programming with a strong type system is comparable to playing a puzzle game, where the type system removes many of the wrong choices and helpfully guides you to take the right path. This style of programming is supported by a new language extension called “Typed Holes” where you leave parts of your program unspecified during development, and obtain feedback from the development environment about what type has to go into the hole, and what program fragments you have available locally to construct a value of the desired type. Playing this kind of puzzle game is actually quite fun!

7. Programming on the type level

Haskell’s type system provides many advanced features that you don’t usually have to know about, but that can help you if you want to ensure that some complicated invariants hold in your program. Scarily named concepts such as “higher-ranked polymorphism”, “generalized algebraic datatypes” and “type families” essentially provide you with a way to write programs that compute with types. The possibilities are nearly endless. From playful things such as writing a C-printf-style function where the first argument determines the number of arguments that are expected afterwards as well as their types, you can go on to code that provides useful guarantees such as that mutable references that are available within one thread of control are guaranteed not to be accessed in a completely different context, arrays that can adapt to different internal representations depending on what type of values they contain, working with lists that are guaranteed to be of a specific length, or with trees that are guaranteed to be balanced, or with heterogeneous lists (where each element can be of a different type) in a type-safe way. The goal is always to make illegal inputs impossible to construct. If they’re impossible to construct by the type system, you can isolate sanity tests at the boundary of your code, rather than having to do them over and over again. The good thing is that these features are mostly optional, and often hardly affect the interface of libraries. So as a user, you can benefit from libraries employing such features and having extra safety guarantees internally. As a library writer, you can choose whether you’re happy with the normal level of Haskell type safety (which is already rather a lot), or if you want to spend some extra effort and get even more.

If my overview has tempted you and you now want to learn more about Haskell, you’re welcome follow one of my introductory or advanced Haskell courses that I (together with my colleagues at Well-Typed) regularly teach at Skills Matter. These courses do not just focus on the type system of Haskell (although that’s a significant part). They introduce the entire language in a hands-on way with lots of examples and exercises, as well as providing guidelines on how to write idiomatic Haskell and how to follow good development practices.

If you already know some Haskell and are particularly interested in the advanced type system features mentioned in point 7, we also offer a new one-day course on Haskell’s type system that specifically focuses on how far you can push it.

by andres at April 17, 2014 04:04 PM

Douglas M. Auclair (geophf)

'N' is for No Nix Nada Nil Null Not Nothing


'N' is for got Nuttin'!

I was thinking of covering the universal combinator, the Nightingale, with this post. It's a combinator that gives you any other combinator in the combinatory logic:

N = λx -> (xS)K

or

N = λx -> xKSK

or an infinite many other combinations of combinators that when combined repeatedly reduce either to the S or the K combinators, because the SK-basis has been shown to be Turing-complete.

But then I thought: eh! Maybe you've read enough about combinators this month, or to date, at any rate, so there is it: the N-combinator, the Nightingale, when combined with only itself (in certain desired ways), can give you any computable form. Yay!

So, let's talk about nothing for a sec here.

... hm, hm, hm, ...

Nice talk?

The problem with nothing is that there's nothing to talk about ... about it. But people tend to talk quite a bit, and if you reduce what they've just been saying to you and to everyone else who will listen (and quite a few of them won't), then you find that they haven't been really talking about anything at all, and, insult to injury, even they would find their own words boring, obnoxious, offensive, thoughtless, careless, if you played them back to them and forced them to listen to every word that came out of their mouths.

Ever listen to yourself speaking? Educational experience. Sometimes, I stop and listen to myself. Most times, I find, it would be better if I shut myself up sooner rather than later.

geophf, shut up! I tell myself.

Sometimes I listen when I tell myself that.

I feels better when I shut myself up when I'm speaking I have nothing to say

You know: THINK before you speak.



If it's not every one of those things, then why did I just open my mouth.

Here's another gauge.

Bad people talk about people.
Good people talk about things.
Great people talk about ideas and ideals.

What are you talking about?

Okay, but that's really 'something' that I've been talking about, and that is what people talk about, which is usually 'nothing good nor kind,' but it's still something, even if that something is vicious or inane or banal.

So let's talk about nothing.

The problem of nothing is that there's none of it. And here's why.

Get yourself to a pure vacuum. Is it inside a glass tube? No, because, actually, you've sucked most of the air out of it, but guess what? There's still quite a bit of light in there. That's not nothing. That's quite a bit of energy.

Drat! Bummer! So, turning off the night doesn't do anything for you. Visible light is gone but then you have infrared and ultravioletooh! vampires! — so you've got to go somewhere where's there's no light, no light, on either end of the visible spectrum.

Okay, spaaaaace!

Quite a lot of dust out there, star dust, and solar winds.

Hm, anywhere where we can go where there's nothing?

How about next to a black hole?

Now, that's interesting.

Now, if we can find a place somewhere around the black hole where it's not sucking up mass nor jetting out X-rays (and, granted, that would be hard to find, ... how about a black hole away from any nearby galaxy in spaaaace! Aw, the poor lonely black hole!) (Nobody cares about black holes being all alone, they're just like: oh! black hole! black holes are evil! I hate you, black hole! But does anybody ever consider the black hole's feelings when they're making these pronouncements? Noooo! They just say these things to the black hole's face, and then wonder why black holes are always so mean to them!)

There's still a problem. It's called the Hawking radiation.

You see, even in a 'pure vacuum' ... it isn't. Nature abhors a vacuum, so what happens is that there's so much energy around a black hole that it creates quantum particles (the Higgs field?) sucking most of them back into itself, but some little quarks escape, at around the speed of light (quark: 'Imma getting me away from that there black hole! NAOW!') and so Hawkings predicted, correctly, that even black holes emit radiation.

Bummer, dude! There doesn't seem to be anywhere where you can find yourself some nothing to be all sad and morose and all alone by yourself! You just have to buck up and accept the John Donne commandment: No quark is an island.

Or something like that.

BUT THEN! There's mathematics. You can invent a mathematical space where there's nothing in it, just it, and you (which makes it not nothing, but you get some quiet-time, finally, some alone time to catch up on your reading without having to worry about if the dishes were done).

What happens there? Besides nothing? What does it look like? Besides really, really empty?

Here's an interesting factoid (factoid, n.: geophf's interesting declarations that may or may not be somewhat related to some real reality), you're not the first visitor there.

There was this tall, thin dude in a white lab coat, by the name of Mach (no, not Mac the Knife, okay? Different Mac), and yes, the measure of velocity greater than the speed of sound was named after him (you know: mach 1, mach 2, mach 5+ for SR-71s), but besides that (and that's more than most people have ever done with their lives, but did he rest on his laurels? No. Besides, ... he still needed to earn his bread, so he could bake it, and then enjoy it with a latte) (lattes are important to physicists and mathematicians, don't you know.)

(But I digress.)

(As usual.)

Besides that, he did this thought-experiment. He created this mostly empty space, just him and the star Alpha Centauri in the distance. That was it in the entire Universe. And he spun himself, arms outstretched.

How did he know he was spinning? Easy, Alpha Centauri came into view, crossing it, then disappeared behind him as he spun, and he saw the star in an arc.

Next, he removed Alpha Centauri.

Now. Was he spinning?

He had no way to tell. You can tell you're spinning, because you get dizzy and then sick, but why? Because of gravity. Earth's gravity (mostly). But now there's not Earth.

There's not even Another Earth. So gravity is not now giving you a frame of reference, and you could tell you were spinning because there Alpha Centuri was in the distance, giving you your (only) point of reference, but now it's gone, too.

You could be moving a million miles per hour, you could be spinning and doing backward summersaults (or winter-saults for all you know: no seasons; no nothing. No nothing but nothing in your created Universe), and you'd have no frame of reference for you to make that determination.

Are you spinning in Mach's empty Universe?

The answer to that is: mu.

The answer to that question is that that's not a question to be asking. The answer to that question is that it has no sensible answer, whether you're spinning or not, you have no way to tell, or, no way to measure it. What's your coordinate system? How do you measure your speed or your spin. You don't, because if the XY-axis extends from in front of you, then it's always oriented to your front, no matter which way you face.

Anyway, where is time in all this? There's nothing. Space is bounded by the things in it, otherwise there's not space. 'Space' is emptiness, nothing there, but the space is defined by the no-thing between things.

There are no things in your nothing-Universe.

No space, no space-time. No space-time, no time. Spin needs time to measure.

If you had Alpha Centuri, then you would have light from four years ago shining on you, but now you have nothing, no light, no space, no time. No spin.

This post was about nothing, nothing at all. I hope you see when you get to that point where there is indeed nothing, no dust, no light, no nothing, you really, really do have nothing, not even space (the space between things: there are no things), not even time, not even spin.

Not even weight; weight needs gravity. You have no gravity in your nothing-Universe.

Enjoy the slim new-you, but the thing is, it'll be a lonely victory, so no bragging rights: nobody to brag to.

Okay, I'm going to go enjoy a latte with my busy, noisy family, and enjoy my not-nothing, not-solitude.

Nothing's not all cracked up to be what it's supposed to be.

by geophf (noreply@blogger.com) at April 17, 2014 06:10 AM

April 16, 2014

Jan Stolarek

Autocomplete command-line flags with GHC 7.8.2

GHC 7.8.2 has been released just a few days ago1. This is the first official release of GHC that contains my contributions. Most of them are improvements in the code generator and are thus practically invisible to most users. But I also implemented one very nice feature that will be useful to an average Haskeller. GHC now has --show-options flag that lists all command-line flags. This feature can be used to auto-complete command-line flags in shells that support this feature. To enable auto-completion in Bash add this code snippet to your ~/.bashrc file:

# Autocomplete GHC commands
_ghc()
{
    local envs=`ghc --show-options`
    # get the word currently being completed
    local cur=${COMP_WORDS[$COMP_CWORD]}
 
    # the resulting completions should be put into this array
    COMPREPLY=( $( compgen -W "$envs" -- $cur ) )
}
complete -F _ghc -o default ghc

From my experience the first completion is a bit slow but once the flags are cached things work fast.

  1. Please ignore 7.8.1 release. It shipped with a bug that caused rejection of some valid programs.

by Jan Stolarek at April 16, 2014 02:48 PM

Philip Wadler

Team Scotland

What happens after Yes? It's not the SNP, it's the people. A great post on Bella Caledonia.
The UK chattering classes have been wondering what a real, mass grass-roots campaign might look like in modern, professionalised politics. Impotent is their usual conclusion. Well come on up and we’ll show you. The old feudal dance where councilor doths cap to MP, MP to Minister, Minister to Prime Minister and Prime Minister to corporate CEO may well continue apace even here in Scotland. But it’s not winning Scotland.

by Philip Wadler (noreply@blogger.com) at April 16, 2014 12:19 PM

Help ORG restart the debate about internet filters

The Open Rights Group is starting a campaign opposed to the default filtering now imposed by all providers in the UK---de facto censorship. You can fund it via IndieGogo.

by Philip Wadler (noreply@blogger.com) at April 16, 2014 11:59 AM

April 15, 2014

Silk

Writing admin interfaces with Fay using fay-builder

We are in the process of open sourcing a lot of the general purpose libraries we’ve built at Silk under the BSD3 license. We’re planning to write a series of blog posts introducing them so please stay tuned!

First off is one of the smaller packages, fay-builder (github).

Fay is a proper subset of Haskell that compiles to JavaScript, as well as a compiler. I’ve been helping out with the project since its first release in July 2012.

We have several Haskell backend services with admin interfaces written in HTML and JavaScript that communicate with the running server using REST APIs. It seemed like a good idea to write their admin interfaces in Haskell and we have been experimenting with using Fay for this. Since Fay supports Cabal packages we can use our usual tools and upload these packages to the public hackage or our private instance.

fay-builder lets you store options needed to compile Fay code along with an executable. This is done by using custom cabal flags that can be read either by Setup.hs (but see caveats below) or from within the application at startup, on request, or at some other point.

If you want to try it out you can install it by running cabal install fay-builder.

This post demonstrates how we build Haskell code for our admin interfaces.
Join our team if you enjoy this stuff too, we’re hiring!

The Cabal file

Here’s how you can define a Cabal file that uses fay-builder for a project:

name:                my-program
version:             0.1

Use a custom build type

build-type:          Custom

extra-source-files specifies files what should be included with a source distribution. The fay files are added here so they don’t get lost when running cabal sdist.

extra-source-files: fay/*.hs

data-files are also included, with the addition that Cabal will generate a paths module that you can use to fetch the resources during runtime. We probably want to precompile the fay files so the program can just serve them.

data-files: my-program.cabal, js/*.js

x-foo specifies custom flags that Cabal itself ignores, but we can use it for extra configuration options.

x-fay-packages:      fay-jquery, fay-text
x-fay-root-modules:  Index, Misc
x-fay-include-paths: src
x-fay-output-dir:    data/js
x-fay-source-dir:    fay

executable my-program
  main-is:           Main.hs
  hs-source-dirs:    src
  default-language:  Haskell2010
  ghc-options:       -Wall

This is the haskell module Cabal generates to let us access data-files.

other-modules:     Paths_my_program

To make sure we can build the Fay code we add all the dependencies it has here along with the GHC libraries dependencies.

One thing to note is that we can’t depend on base and fay-base at the same time since they have clashing module names. But if we depend on some other fay package we get a transitive dependency to fay-base and we know it will be available.

build-depends:     base, Cabal, fay, fay-builder, fay-jquery, fay-text

We have a few options for when to build the Fay code.

Building as a Cabal hook

With a custom Setup.hs we can add a postBuildHook that will compile the Fay code after the executable has been compiled, but before an sdist is created. This is nice because it will generate the needed .js files before the tarball is created. The disadvantage is that Setup.hs cannot have explicit dependencies so you won’t be able to sdist your program unless you already have all dependencies installed.

It is possible to do this, and fay-builder includes a defaultFayHook that you can use in Setup.hs like this:

import Fay.Builder (defaultFayHook)
main = defaultFayHook

It’s pretty cool, but maybe not very useful because of the dependency issue. Just make sure to have fay-builder installed before running cabal configure.

Building inside the program

For one application I started out with the above Setup.hs approach, but soon realized that this would not work out since sdisting wasn’t possible in a clean package DB. I instead chose to add a --compile-fay flag that would compile all Fay sources when the process starts when developing. The live setup won’t do this and instead read the data-files.

The trick here is that we added the .cabal file as a data-file to let us access it from other modules.

import qualified Paths_my_program as Paths
import qualified Fay.Builder as Builder

compileFay :: IO ()
compileFay = do
  pkgDb       <- getPackageDb
  packageDesc <- Paths.getDataFileName "my-program.cabal"
                 >>= Builder.readPackageDescription
  Builder.build packageDesc pkgDb
    where
      -- Some clever way to fetch the current package DB
      -- if you are using a sandbox.
      getPackageDb :: IO (Maybe FilePath)

If we had added the settings elsewhere, such as in a custom configurator file like snaplet-fay does, we wouldn’t be able to read it from within Setup.hs so the cabal file might be the only place to put this configuration if we want both features at once.

One of the biggest advantages in using fay-builder is that all dependencies can be listed in one cabal file instead of having to split the client and server code into multiple packages (which would turn into three different packages if there’s shared code).

April 15, 2014 09:38 AM

April 14, 2014

FP Complete

The heartbleed bug and FP Haskell Center

The heartbleed bug and FP Haskell Center

If you haven't heard about the heartbleed bug and related security issues, this article provides a good explanation.

Applications developed in FPHC and deployed using the FP Application servers are not vulnerable to this bug.

Services we use from Amazon and GitHub were affected. SSL connections to our servers go through Amazon software, and we use SSL to connect to GitHub repositories. Both Amazon and GitHub have already updated their services to remove the vulnerability. FP Complete has followed GitHub's suggestions by changing our passwords and OAuth tokens. You can read those guidelines at github.

While we have no evidence that any sensitive information was leaked from FPHC, we recommend changing your password for FPHC as soon as possible, just in case.

Other measures to increase security never hurt. Things like using two-factor authentication on sites for which it is available, and using password locker software that will generate strong passwords unique to each site, will help prevent people breaking into your accounts. This event provides a good time to consider adding these extra security measures if you aren't already using them.

April 14, 2014 02:25 PM

Danny Gratzer

Continuations and Exceptions

Posted on April 14, 2014

Continuations are useful things. They provide a nice way to manually handle control flow. This fact makes them very useful for compilers, an often used alternative to SSA is an intermediate language in which every function call is a tail-call and every expression has been converted to continuation passing style.

Often however, this isn’t enough. In a language which exceptions, we don’t just have a single continuation. Since every expression can either do one of two things.

  1. Continue the rest of the program normally
  2. Throw an exception and run an alternative program, the exception handler

To represent this, we can imagine having two continuations. Instead of

    newtype Cont r a = Cont {runCont :: (a -> r) -> r}

We have

    {-# LANGUAGE DeriveFunctor #-}
    import Control.Monad
    
    newtype Throws r e a = Throws {runThrows :: (e -> r) -> (a -> r) -> r}
    deriving (Functor)

Now we have two continuations, where e -> r represents the composition of exception handlers.

We can write a trivial monad instance similar to Cont

    instance Monad (Throws r e) where
      return a = Throws $ \ex cont -> cont a
      (Throws c) >>= f = Throws $ \ ex cont ->
        c ex $ \a -> runThrows (f a) e cont

So >>= maintains the exception handler between computations and otherwise acts exactly like Cont.

To actually take advantage of our exception handlers, we need two things, a throw and catch like pair of function. Let’s start with throw since it’s easiest.

    throw :: e -> Throws r e a
    throw e = Throws $ \ex cont -> ex e

This is pretty straightforward, when we’re given an exception an to throw, we simply feed it to our exception handler continuation. Since care what value cont needs, we can universally quantify over a.

Next up is handle, we’ll represent an exception handler as a function from e -> Maybe a. If we return Nothing, we can’t handle the exception at this level and we’ll just pass it to the existing exception handler.

So our handle is

    handle :: Throws r e a -> (e -> Maybe a) -> Throws r e a
    handle (Throws rest) handler = Throws $ \ex cont ->
      rest (\e -> maybe (ex e) cont (handler e)) cont

Notice the clever bit here, each handler actually contains both the success and failure continuations! If we can’t handle the exception we fail otherwise we can resume exactly where we were before.

No post would be complete without a demonstration!

    data Ex = Boom | Kaboom | Splat String
            deriving Show
    
    throwEx1 = throw Boom
    throwEx2 = throw Kaboom
    throwEx3 = throw (Splat "splat")
    test = do
      result <- handle throwEx1 $ \e -> case e of
        Boom -> Just "foo"
        _    -> Nothing
      result2 <- handle throwEx2 $ \e -> case e of
        Boom   -> Just "what"
        Kaboom -> Just "huh?"
        _      -> Nothing
      result3 <- handle throwEx3 $ \e -> case e of
        Splat s -> Just s
        _       -> Nothing
      return (unwords [result, result2, result3])

We can run this with

    runThrows (error . ("Toplevel fail "++)) test

which returns

    "foo huh? splat"

A Note on Either

Now we already have a perfectly good system of monadic exception like thing in the form of Either.

It might be interesting to note that what we’ve written is in fact isomorphic to Either. (e -> r) -> (a -> r) -> r is just the church representation of Either e a.

We can even go all the way and change Throws to

    newtype Throws e a = Throws {runThrows :: forall r. (e -> r) -> (a -> r) -> r}

So there you are, an interesting realization that one of the classic representations of a language like SML is in fact a really complicated version of Either :)

<script type="text/javascript"> /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */ var disqus_shortname = 'codeco'; // required: replace example with your forum shortname /* * * DON'T EDIT BELOW THIS LINE * * */ (function() { var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })(); </script> <noscript>Please enable JavaScript to view the comments powered by Disqus.</noscript> comments powered by Disqus

April 14, 2014 12:00 AM

April 10, 2014

Manuel M T Chakravarty

As a spin off from teaching programming to my 10 year old son...



As a spin off from teaching programming to my 10 year old son and his friends, we have published a sprite and pixel art editor for the iPad, called BigPixel, which you can get from the App Store. (It has a similar feature set as the earlier Haskell version, but is much prettier!)

April 10, 2014 11:47 AM

Alejandro Cabrera

Migration Complete: Welcome to blog.cppcabrera.com!

The migration is now complete. Currently, I have:

* Atom feed support
* Hosted on https:// (with heartbleed patched)
* Sources hosted on github
* Main blog over at https://blog.cppcabrera.com
* All content from here ported over
* All posts tagged appropriately

I've documented the process as my first new post at the new location: Learning Hakyll and Setting Up

At this point, the one feature I'd like to add to my static blog soon is the ability to have a Haskell-only feed. I'll be working on that over the coming week.

Thanks again for reading, and I hope you'll enjoy visiting the new site!

by Alejandro Cabrera (noreply@blogger.com) at April 10, 2014 07:27 AM

Robert Harper

Parallelism and Concurrency, Revisited

To my delight, I still get compliments on and criticisms of my post from three years ago (can it possibly be that long?) on parallelism and concurrency.  In that post I offered a “top down” argument to the effect that these are different abstractions with different goals: parallelism is about exploiting computational resources to maximize efficiency, concurrency is about non-deterministic composition of components in a system.  Parallelism never introduces bugs (the semantics is identical to the sequential execution), but concurrency could be said to be the mother lode of all bugs (the semantics of a component changes drastically, without careful provision, when composed concurrently with other components).  The two concepts just aren’t comparable, yet somehow the confusion between them persists.  (Not everyone agrees with me on this distinction, but neither have I seen a comparable analysis that shows them to be the same concept.  Most complaints seem to be about my use of the words “parallelism” and “concurrency” , which is an unavoidable problem, or about my temerity in trying to define two somewhat ill-defined concepts, a criticism that I’ll just have to accept.)

I’ve recently gotten an inkling of why it might be that many people equate the two concepts (or see no point in distinguishing them).  This post is an attempt to clear up what I perceive to be a common misunderstanding that seems to explain it.  It’s hard for me to say whether it really is all that common of a misunderstanding, but it’s the impression I’ve gotten, so forgive me if I’m over-stressing an obvious point.  In any case I’m going to try for a “bottom up” explanation that might make more sense to some people.

The issue is scheduling.

The naive view of parallelism is that it’s just talk for concurrency, because all you do when you’re programming in parallel is fork off some threads, and then do something with their results when they’re done.  I’ve previously argued that this is the wrong way to think about parallelism (it’s really about cost), but let’s just let that pass.  It’s unarguably true that a parallel computation does consist of a bunch of, well, parallel computations.  So, the argument goes, it’s nothing but concurrency, because concurrency is, supposedly, all about forking off some threads and waiting to see what they do, and then doing something with it.  I’ve argued that that’s not a good way to think about concurrency either, but we’ll let that pass too.  So, the story goes, concurrency and parallelism are synonymous, and bullshitters like me are just trying to confuse people and make trouble.

Being the troublemaker that I am, my response is, predictably, nojust no.  Sure, it’s kinda sorta right, as I’ve already acknowledged, but not really, and here’s why: scheduling as you learned about it in OS class (for example) is an altogether different thing than scheduling for parallelism.  And this is the heart of the matter, from a “bottom-up” perspective.

There are two aspects of OS-like scheduling that I think are relevant here.  First, it is non-deterministic, and second, it is competitive.  Non-deterministic, because you have little or no control over what runs when or for how long.  A beast like the Linux scheduler is controlled by a zillion “voodoo parameters” (a turn of phrase borrowed from my queueing theory colleague, Mor Harchol-Balter), and who the hell knows what is going to happen to your poor threads once they’re in its clutches.  Second, and more importantly, an OS-like scheduler is allocating resources competitively.  You’ve got your threads, I’ve got my threads, and we both want ours to get run as soon as possible.  We’ll even pay for the privilege (priorities) if necessary.  The scheduler, and the queueing theory behind it (he says optimistically) is designed to optimize resource usage on a competitive basis, taking account of quality of service guarantees purchased by the participants.  It does not matter whether there is one processor or one thousand processors, the schedule is unpredictable.  That’s what makes concurrent programming hard: you have to program against all possible schedules.  And that’s why you can’t prove much about the time or space complexity of your program when it’s implemented concurrently.

Parallel scheduling is a whole ‘nother ball of wax.  It is (usually, but not necessarily) deterministic, so that you can prove bounds on its efficiency (Brent-type theorems, as I discussed in my previous post and in PFPL).  And, more importantly, it is cooperative in the sense that all threads are working together for the same computation towards the same ends.  The threads are scheduled so as to get the job (there’s only one) done as quickly and as efficiently as possible.  Deterministic schedulers for parallelism are the most common, because they are the easiest to analyze with respect to their time and space bounds.  Greedy schedulers, which guarantee to maximize use of available processors, never leaving any idle when there is work to be done, form an important class for which the simple form of Brent’s Theorem is obvious.

Many deterministic greedy scheduling algorithms are known, of which I will mention p-DFS and p-BFS, which do p-at-a-time depth- and breadth-first search of the dependency graph, and various forms of work-stealing schedulers, pioneered by Charles Leiserson at MIT.  (Incidentally, if you don’t already know what p-DFS or p-BFS are, I’ll warn you that they are a little trickier than they sound.  In particular p-DFS uses a data structure that is sort of like a stack but is not a stack.)  These differ significantly in their time bounds (for example, work stealing usually involves expectation over a random variable, whereas the depth- and breadth traversals do not), and differ dramatically in their space complexity.  For example, p-BFS is absolutely dreadful in its space complexity.  For a full discussion of these issues in parallel scheduling, I recommend Dan Spoonhower’s PhD Dissertation.  (His semantic profiling diagrams are amazingly beautiful and informative!)

So here’s the thing: when you’re programming in parallel, you don’t just throw some threads at some non-deterministic competitive scheduler.  Rather, you generate an implicit dependency graph that a cooperative scheduler uses to maximize efficiency, end-to-end.  At the high level you do an asymptotic cost analysis without considering platform parameters such as the number of processors or the nature of the interconnect.  At the low level the implementation has to validate that cost analysis by using clever techniques to ensure that, once the platform parameters are known, maximum use is made of the computational resources to get your job done for you as fast as possible.  Not only are there no bugs introduced by the mere fact of being scheduled in parallel, but even better, you can prove a theorem that tells you how fast your program is going to run on a real platform.  Now how cool is that?


Filed under: Programming, Research Tagged: concurrency, parallelism

by Robert Harper at April 10, 2014 03:18 AM

April 09, 2014

Dominic Steinitz

Gibbs Sampling in R, Haskell, Jags and Stan

Introduction

It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module Gibbs (
>     main
>   , m
>   , Moments(..)
>   ) where
> 
> import qualified Data.Vector.Unboxed as V
> import qualified Control.Monad.Loops as ML
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import qualified Control.Foldl as L
> 
> import Diagrams.Backend.Cairo.CmdLine
> 
> import LinRegAux
> 
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )

The length of our chain and the burn-in.

> nrep, nb :: Int
> nb   = 5000
> nrep = 105000

Data generated from {\cal{N}}(10.0, 5.0).

> xs :: [Double]
> xs = [
>     11.0765808082301
>   , 10.918739177542
>   , 15.4302462747137
>   , 10.1435649220266
>   , 15.2112705014697
>   , 10.441327659703
>   , 2.95784054883142
>   , 10.2761068139607
>   , 9.64347295100318
>   , 11.8043359297675
>   , 10.9419989262713
>   , 7.21905367667346
>   , 10.4339807638017
>   , 6.79485294803006
>   , 11.817248658832
>   , 6.6126710570584
>   , 12.6640920214508
>   , 8.36604701073303
>   , 12.6048485320333
>   , 8.43143879537592
>   ]

A Bit of Theory

Gibbs Sampling

For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.

Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters i \triangleq \boldsymbol{\theta}^{(i)} \triangleq (\theta^{(i)}_1,\ldots, \theta^{(i)}_n) and the conditional posteriors by \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big) where {\boldsymbol{\theta}}^{(i)}_{-k} = \big(\theta_1^{(i)},\ldots,\theta_{k-1}^{(i)},\theta_{k+1}^{(i)}\ldots\theta_n^{(i)}\big) then

\displaystyle   \begin{aligned}  \frac{\pi\big(\boldsymbol{\theta}^{(j)}\big)q\big(\boldsymbol{\theta}^{(j)}, \boldsymbol{\theta}^{(i)}\big)}  {\pi(\boldsymbol{\theta}^{(i)})q(\boldsymbol{\theta}^{(i)}, \boldsymbol{\theta}^{(j)})}  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(i)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(i)}\big)  } \\  &=  \frac{  \pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  }  {  \pi\big({\theta}_{k}^{(i)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\boldsymbol{\theta}}_{-k}^{(j)}\big)\pi\big({\theta}_{k}^{(j)} \,\big|\, {\boldsymbol{\theta}}_{-k}^{(j)}\big)  } \\  &= 1  \end{aligned}

where we have used the rules of conditional probability and the fact that \boldsymbol{\theta}_i^{(-k)} = \boldsymbol{\theta}_j^{(-k)}

Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.

Normal Distribution with Unknown Mean and Variance

It is fairly standard to use an improper prior

\displaystyle   \begin{aligned}  \pi(\mu, \tau) \propto \frac{1}{\tau} & & -\infty < \mu < \infty\, \textrm{and}\, 0 < \tau < \infty  \end{aligned}

The likelihood is

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \sigma) = \prod_{i=1}^n \bigg(\frac{1}{\sigma\sqrt{2\pi}}\bigg)\exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}

re-writing in terms of precision

\displaystyle   p(\boldsymbol{x}\,|\,\mu, \tau) \propto \prod_{i=1}^n \sqrt{\tau}\exp{\bigg( -\frac{\tau}{2}{(x_i - \mu)^2}\bigg)} = \tau^{n/2}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

Thus the posterior is

\displaystyle   p(\mu, \tau \,|\, \boldsymbol{x}) \propto \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)}

We can re-write the sum in terms of the sample mean \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i and variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \mu)^2 &= \sum_{i=1}^n (x_i - \bar{x} + \bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2\sum_{i=1}^n (x_i - \bar{x})(\bar{x} - \mu) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= \sum_{i=1}^n (x_i - \bar{x})^2 - 2(\bar{x} - \mu)\sum_{i=1}^n (x_i - \bar{x}) + \sum_{i=1}^n (\bar{x} - \mu)^2 \\  &= (n - 1)s^2 + n(\bar{x} - \mu)^2  \end{aligned}

Thus the conditional posterior for \mu is

\displaystyle   \begin{aligned}  p(\mu \,|\, \tau, \boldsymbol{x}) &\propto \exp{\bigg( -\frac{\tau}{2}\bigg(\nu s^2 + \sum_{i=1}^n{(\mu - \bar{x})^2}\bigg)\bigg)} \\  &\propto \exp{\bigg( -\frac{n\tau}{2}{(\mu - \bar{x})^2}\bigg)} \\  \end{aligned}

which we recognise as a normal distribution with mean of \bar{x} and a variance of (n\tau)^{-1}.

The conditional posterior for \tau is

\displaystyle   \begin{aligned}  p(\tau \,|\, , \mu, \boldsymbol{x}) &\propto \tau^{n/2 -1}\exp\bigg(-\tau\frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)  \end{aligned}

which we recognise as a gamma distribution with a shape of n/2 and a scale of \frac{1}{2}\sum_{i=1}^n{(x_i - \mu)^2}

In this particular case, we can calculate the marginal posterior of \mu analytically. Writing z = \frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2} we have

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &= \int_0^\infty p(\mu, \tau \,|\, \boldsymbol{x}) \textrm{d}\tau \\  &\propto \int_0^\infty \tau^{n/2 - 1}\exp{\bigg( -\frac{\tau}{2}\sum_{i=1}^n{(x_i - \mu)^2}\bigg)} \textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \int_0^\infty z^{n/2 - 1}\exp{-z}\textrm{d}\tau \\  &\propto \bigg( \sum_{i=1}^n{(x_i - \mu)^2} \bigg)^{-n/2} \\  \end{aligned}

Finally we can calculate

\displaystyle   \begin{aligned}  p(\mu \,|\, \boldsymbol{x}) &\propto \bigg( (n - 1)s^2 + n(\bar{x} - \mu)^2 \bigg)^{-n/2} \\  &\propto \bigg( 1 + \frac{n(\mu - \bar{x})^2}{(n - 1)s^2} \bigg)^{-n/2} \\  \end{aligned}

This is the non-standardized Student’s t-distribution t_{n-1}(\bar{x}, s^2/n).

Alternatively the marginal posterior of \mu is

\displaystyle   \frac{\mu - \bar{x}}{s/\sqrt{n}}\bigg|\, x \sim t_{n-1}

where t_{n-1} is the standard t distribution with n - 1 degrees of freedom.

The Model in Haskell

Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.

> x2Sum, xSum, n :: Double
> (x2Sum, xSum, n) = L.fold stats xs
>   where
>     stats = (,,) <$>
>             (L.premap (\x -> x * x) L.sum) <*>
>             L.sum <*>
>             L.genericLength

And re-writing the sample variance s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 using

\displaystyle   \begin{aligned}  \sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2) \\  &= \sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + \sum_{i=1}^n \bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\  &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\  \end{aligned}

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double
> xBar = xSum / n
> varX = n * (m2Xs - xBar * xBar) / (n - 1)
>   where m2Xs = x2Sum / n

In random-fu, the Gamma distribution is specified by the rate paratmeter, \beta.

> beta, initTau :: Double
> beta = 0.5 * n * varX
> initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)

Our sampler takes an old value of \tau and creates new values of \mu and \tau.

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double))
> gibbsSampler oldTau = do
>   newMu <- sample (Normal xBar (recip (sqrt (n * oldTau))))
>   let shape = 0.5 * n
>       scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar)
>   newTau <- sample (Gamma shape (recip scale))
>   return $ Just ((newMu, newTau), newTau)

From which we can create an infinite stream of samples.

> gibbsSamples :: [(Double, Double)]
> gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)

As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.

> data Moments = Moments { mN :: !Double
>                        , m1 :: !Double
>                        , m2 :: !Double
>                        , m3 :: !Double
>                        , m4 :: !Double
>                        }
>   deriving Show
> moments :: [Double] -> Moments
> moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs
>   where
>     f :: Moments -> Double -> Moments
>     f m x = Moments n' m1' m2' m3' m4'
>       where
>         n = mN m
>         n'  = n + 1
>         delta = x - (m1 m)
>         delta_n = delta / n'
>         delta_n2 = delta_n * delta_n
>         term1 = delta * delta_n * n
>         m1' = m1 m + delta_n
>         m4' = m4 m +
>               term1 * delta_n2 * (n'*n' - 3*n' + 3) +
>               6 * delta_n2 * m2 m - 4 * delta_n * m3 m
>         m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m
>         m2' = m2 m + term1

In order to examine the posterior, we create a histogram.

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = xBar - 2.0 * sqrt varX
>     upper = xBar + 2.0 * sqrt varX

And fill it with the specified number of samples preceeded by a burn-in.

> hist :: Histogram V.Vector BinD Double
> hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)

Now we can plot this.

And calculate the skewness and kurtosis.

> m :: Moments
> m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples)
ghci> import Gibbs
ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5
  8.733959917065126e-4

ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2
  3.451374739494607

We expect a skewness of 0 and a kurtosis of 3 + 6 / \nu - 4 = 3.4 for \nu = 19. Not too bad.

The Model in JAGS

JAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.

Here is our model as expressed in JAGS. Somewhat terse.

model {
	for (i in 1:N) {
		x[i] ~ dnorm(mu, tau)
	}
	mu ~ dnorm(0, 1.0E-6)
	tau <- pow(sigma, -2)
	sigma ~ dunif(0, 1000)
}

To run it and examine its results, we wrap it up in some R

## Import the library that allows R to inter-work with jags.
library(rjags)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

jags <- jags.model('example1.bug',
                   data = list('x' = fn[,1], 'N' = 20),
                   n.chains = 4,
                   n.adapt = 100)

## Burnin for 10000 samples
update(jags, 10000);

mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000)

png(file="diagrams/jags.png",width=400,height=350)
plot(mcmc_samples)
dev.off()

And now we can look at the posterior for \mu.

The Model in STAN

STAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.

Here is our model as expressed in STAN. Again, somewhat terse.

data {
  int<lower=0> N;
  real x[N];
}

parameters {
  real mu;
  real<lower=0,upper=1000> sigma;
}
model{
  x     ~ normal(mu, sigma);
  mu    ~ normal(0, 1000);
}

Just as with JAGS, to run it and examine its results, we wrap it up in some R.

library(rstan)

## Read the simulated data into a data frame.
fn <- read.table("example1.data", header=FALSE)

## Running the model
fit1 <- stan(file = 'Stan.stan',
             data = list('x' = fn[,1], 'N' = 20),
             pars=c("mu", "sigma"),
             chains=3,
             iter=30000,
             warmup=10000)

png(file="diagrams/stan.png",width=400,height=350)
plot(fit1)
dev.off()

Again we can look at the posterior although we only seem to get medians and 80% intervals.

PostAmble

Write the histogram produced by the Haskell code to a file.

> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
>   mainRender ( DiagramOpts (Just 900) (Just 700) fn
>              , DiagramLoopOpts False Nothing 0
>              )
> main :: IO ()
> main = do
>   displayHeader "diagrams/DataScienceHaskPost.png"
>     (barDiag
>      (zip (map fst $ asList hist) (map snd $ asList hist)))

The code can be downloaded from github.


by Dominic Steinitz at April 09, 2014 11:43 AM

April 08, 2014

Ketil Malde

Some not-so-grand challenges in bioinformatics

Bioinformatics is a field in the intersection of biology, statistics, and computer science. Broadly speaking, biologists will plan an experiment, bioinformaticians will run the analysis, and the biologists will interpret the results. Sometimes it can be rather frustrating to be the bioinformatician squeezed into the middle, realizing that results are far from as robust as they should be.

Here, I try to list a number of areas in bioinformatics that I think are problematic. They are not in any way “grand challenges”, like computing protein structure or identifying non-trivial factors from GWAS analysis. These are more your everyday challenges that bioinformaticians need to be aware of, but are too polite to mention.

Importance of bioinformatics

High throughput sequencing (HTS) has rapidly become an indispensable tool in medicine, biology, and ecology. As a technology, it is crucial in addressing many of our most important challenges: health and disease, food supply and food production, and ecologoy. Scientifically, it has allowed us to reveal the mechanisms of living organisms in unprecedented detail and scale.

New technology invariably poses new challenges to analysis, and in addition to a thorough understanding of relevant biology, robust study design and analysis based on HTS requires a solid understanding of statistics and bioinformatics. A lack of this background often results in study design that is overly optimistic and ambitious, and in analysis that fails to take the many sources of noise and error into account.

Specific examples

The typical bioinformatics pipeline is (as implied by the name) structured as a chain of operations. The tools used will of course not be perfect, and they will introduce errors, artifacts, and biases. Unfortunately, the next stage in the pipeline usually works from the assumption that the results from the previous stage are correct, and errors thus propagate and accumulate - and in the end, it is difficult to say whether results represent real, biological phenomena, or are simply artifacts of the analysis. What makes matters worse, is that the pipeline is often executed by scientists with a superficial knowledge of the errors that can occur. (A computer said it, so it must be true.)

Gene prediction and transcript assembly

One important task in genomics is identifying the genes of new organisms. This can be performed with a variety of tools, and in one project, we used both ab initio gene prediction (MAKER and Augustus) as well as RNAseq assembly (CLC, abyss). The results varied substantially, with predicted gene counts ranging from 16000 to 45000. To make things worse, clustering revealed that there was not a simple many-to-one relationship between the predicted genes. Clearly, results from analysis of e.g. GO-enrichment or expansion and contraction of gene families will be heavily dependent on which set of transcripts one uses.

Expression analysis

Quantifying the expression of specific genes is an important tool in many studies. In addition to the challenges of acquiring a good set of transcripts, expression analysis based on RNAseq introduces several other problems.

The most commonly used measure, reads (or fragments) per kilobase per million (RPKM), has some statistical problems.

In addition, mapping is challenging, and in particular, mapping reads from transcripts to a genome will struggle with intron/exon boundaries. Of course, you can map to the transcriptome instead, but as we have seen, the reliability of reference transcriptomes are not always as we would like.

Reference genomes

The main goal of a de novo genome project is to produce a correct reference sequence for the genome. Although in many ways a simpler task than the corresponding transcriptome assembly, there are many challenges to overcome. In reality, most genome sequences are fragmented and contain errors. But even in the optimal case, a reference genome may not be representative.

For instance, a reference genome from a single individual is clearly unlikely to accurately model sex chromosomes of the other sex (but on the other hand, similarities between different sex chromosomes will likely lead to a lot of incorrectly mapped reads). But a single individual is often not representative even for other individuals of the same sex. E.g., an obvious method for identifying a sex chromosome is to compare sequence data from one sex to the draft reference (which in our case happened to be from an individual of the other sex) Unfortunately, it turned out that the marker we found to be absent in the reference - and thus inferred to be unique to the other sex – also occurred in 20% of the other sex. Just not in the sequenced individual.

So while reference genomes may work fairly well for some species, it is less useful for species with high genomic variation (or with B-chromosomes), and simply applying methods developed for mammals to different species will likely give misleading and/or incorrect results.

Genome annotation

I recently attended an interesting talk about pseudogenes in various species. One thing that stuck in my mind, was that the number of pseudogenes in a species is highly correlated with the institution responsible for the gene prediction. Although this is just an observation and not a scientific result, it seems very likely that different institutes use in-house pipelines with varying degrees of sensitivity/specificity. Thus what one pipeline would identify as a real gene, a more conservative pipeline would ignore as a pseudogene. So here too, our precious curated resources may be less valuable than you thought.

Population genomics

In population genomics, there are many different measures of diversity, but the most commonly used is FST. Unfortunately, the accuracy of estimating FST for a SNP is highly dependent on coverage and error rate, and a simulation study I did showed that with the coverage commonly used in projects, the estimated FST has a large variance.

In addition, the usual problems apply: the genome may not be complete, correct or representative. And even if it were, reads could still be mapped incorrectly. And it is more likely to fail in variable regions, meaning you get less accurate results exactly in the places it matters most.

Variant analysis

Variant analysis suffers from many of the same challenges as population genomics, genome assembly and mapping being what they are. Again, estimated allele frequencies tend to be wildly inaccurate.

Structural variants (non-trivial indels, transpositions, copy number variation, etc) are very common, but difficult to detect accurately from mapped reads. In many cases, such features will just result in lower mapping coverage, further reducing your chances of identifying them.

Curation is expensive

Analysis is challenging even under the best of conditions – with high quality data and using a good reference genome that is assembled and annotated and curated by a reputable institution or heavyweight consortium. But in many cases, data quality is poor and we don’t even have the luxury of good references.

A de novo genome project takes years and costs millions. This is acceptable for species crucial to medicine, like human or mouse, and for food production, like cow or wheat. These are sequenced by large multi-national consortia at tremendous expense. But while we will eventually get there for the more important species, the large majority of organisms – the less commercially or scientifically interesting ones – will never have anything close to this.

Similarly, the databases with unannotated sequences (e.g. UniProt) grow at a much quicker pace than the curated databases (e.g. SwissProt), simply because identifying the function of a protein in the lab takes approximately one post doc., and nobody has the incentives or manpower to do this.

Conclusions

So, what is the take-home message from all of this? Let’s take a moment to sum up:

Curation is less valuable than you think

Everybody loves and anticipates finished genomes, complete with annotated transcripts and proteins. But there’s a world of difference between the finished genomes of widely studied species and the draft genomes of the more obscure ones. Often, the reference resources are incomplete and partially incorrect, and sometimes the best thing that can be said for them, is that as long as we all use it our results will be comparable because we all make the same errors.

Our methods aren’t very robust

Genome assembly, gene annotation, transcriptome assembly, and functional annotation is difficult. Anybody can run some assembler or gene predictor and get a result, but if you run two of them, you will get two different results.

Sequence mapping is not very complicated (at a conference, somebody claimed to have counted over ninety programs for short read mapping), and works well if you have a good reference genome, if you don’t have too many repeats, and if you don’t have too much variation in your genome. In other words, everywhere you don’t really need it.

Dealing with the limitations

Being everyday challenges, I don’t think any of this is insurmountable, but I’ll save that for a later post. Or grant application. In the meantime, do let me know what you think.

April 08, 2014 05:00 PM

Yesod Web Framework

Proposal: Changes to the PVP

As I mentioned two posts ago, there was a serious discussion on the libraries mailing list about the Package Versioning Policy (PVP).

This blog post presents some concrete changes I'd like to see to the PVP to make it better for both general consumers of Hackage, and for library authors as well. I'll start off with a summary of the changes, and then give the explanations:

  1. The goal of the PVP needs to be clarified. Its purpose is not to ensure reproducible builds of non-published software, but rather to provide for more reliable builds of libraries on Hackage. Reproducible builds should be handled exclusively through version freezing, the only known technique to actually give the necessary guarantees.

  2. Upper bounds should not be included on non-upgradeable packages, such as base and template-haskell (are there others?). Alternatively, we should establish some accepted upper bound on these packages, e.g. many people place base < 5 on their code.

  3. We should be distinguishing between mostly-stable packages and unstable packages. For a package like text, if you simply import Data.Text (Text, pack, reverse), or some other sane subset, there's no need for upper bounds.

    Note that this doesn't provide a hard-and-fast rule like the current PVP, but is rather a matter of discretion. Communication between library authors and users (via documentation or other means) would be vital to making this work well.

  4. For a package version A.B.C, a bump in A or B indicates some level of breaking change. As an opt-in approach, package authors are free to associated meaning to A and B beyond what the PVP requires. Libraries which use these packages are free to rely on the guarantees provided by package authors when placing upper bounds.

    Note that this is very related to point (3).

While I (Michael Snoyman) am the author of this proposal, the following people have reviewed the proposal and support it:

  • Bryan O'Sullivan
  • Felipe Lessa
  • Roman Cheplyaka
  • Vincent Hanquez

Reproducible builds

There are a number of simple cases that can result in PVP-compliant code not being buildable. These aren't just hypothetical cases; in my experience as both a package author and Stackage maintainer, I've seen these come up.

  • Package foo version 1.0 provides an instance for MonadFoo for IO and Identity. Version 1.1 removes the IO instance for some reason. Package bar provides a function:

    bar :: MonadFoo m => Int -> m Double

    Package bar compiles with both version 1.0 and 1.1 of foo, and therefore (following the PVP) adds a constraint to its cabal file foo >= 1.0 && < 1.2.

    Now a user decides to use the bar package. The user never imports anything from foo, and therefore has no listing for foo in the cabal file. The user code depends on the IO instance for MonadFoo. When compiled with foo 1.0, everything works fine. However, when compiled with foo 1.1, the code no longer compiles.

  • Similarly, instead of typeclass instances, the same situation can occur with module export lists. Consider version 1.0 of foo which provides:

    module Foo (foo1, foo2) where

    Version 1.1 removes the foo2 export. The bar package reexports the entire Foo module, and then a user package imports the module from bar. If the user package uses the foo2 function, it will compile when foo-1.0 is used, but not when foo-1.1 is used.

In both of these cases, the issue is the same: transitive dependencies are not being clamped down. The PVP makes an assumption that the entire interface for a package can be expressed in its version number, which is not true. I see three possible solutions to this:

  1. Try to push even more of a burden onto package authors, and somehow make them guarantee that their interface is completely immune to changes elsewhere in the stack. This kind of change was proposed on the libraries list. I'm strongly opposed to some kind of change like this: it makes authors' lives harder, and makes it very difficult to provide backwards compatibility in libraries. Imagine if transformers 0.4 adds a new MonadIO instance; the logical extreme of this position would be to disallow a library from working with both transformers 0.3 and 0.4, which will split Hackage in two.

  2. Modify the PVP so that instead of listing just direct dependencies, authors are required to list all transitive dependencies as well. So it would be a violation to depend on bar without explicitly listing foo in the dependency list. This will work, and be incredibly difficult to maintain. It will also greatly increase the time it takes for a new version of a deep dependency to be usable due to the number of authors who will have to bump version bounds.

  3. Transfer responsibility for this to package users: if you first built your code against foo 1.0, you should freeze that information and continue building against foo 1.0, regardless of the presence of new versions of foo. Not only does this increase reproducibility, it's just common sense: it's entirely possible that new versions of a library will introduce a runtime bug, performance regression, or even fix a bug that your code depends on. Why should the reliability of my code base be dependent on the actions of some third party that I have no control over?

Non-upgradeable packages

There are some packages which ship with GHC and cannot be upgraded. I'm aware of at least base and template-haskell, though perhaps there are others (haskell98 and haskell2010?). In the past, there was good reason to place upper bounds on base, specifically with the base 3/4 split. However, we haven't had that experience in a while, and don't seem to be moving towards doing that again. In today's world, we end up with the following options:

  • Place upper bounds on base to indicate "I haven't tested this with newer versions of GHC." This then makes it difficult for users to test out that package with newer versions of GHC.
  • Leave off upper bounds on base. Users may then try to install a package onto a version of GHC on which the package hasn't been tested, which will result in either (1) everything working (definitely the common case based on my Stackage maintenance), or (2) getting a compilation error.

I've heard two arguments to push us in the direction of keeping the upper bounds in this case, so I'd like to address them both:

  • cabal error messages are easier to understand than GHC error messages. I have two problems with that:
    • I disagree: cabal error messages are terrible. (I'm told this will be fixed in the next version of cabal.) Take the following output as a sample:

      cabal: Could not resolve dependencies:
      trying: 4Blocks-0.2
      rejecting: base-4.6.0.1/installed-8aa... (conflict: 4Blocks => base>=2 && <=4)
      rejecting: base-4.6.0.1, 4.6.0.0, 4.5.1.0, 4.5.0.0, 4.4.1.0, 4.4.0.0, 4.3.1.0,
      4.3.0.0, 4.2.0.2, 4.2.0.1, 4.2.0.0, 4.1.0.0, 4.0.0.0, 3.0.3.2, 3.0.3.1 (global
      constraint requires installed instance)

      I've seen a number of users file bug reports not understanding that this message means "you have the wrong version of GHC."

    • Even if the error messages were more user-friendly, they make it more difficult to fix the actual problem: the code doesn't compile with the new version of GHC. Often times, I've been able to report an error message to a library author and, without necessarily even downloading the new version of GHC, he/she has been able to fix the problem.

  • Using upper bounds in theory means that cabal will be able to revert to an older version of the library that is compatible with the new version of GHC. However, I find it highly unlikely that there's often- if ever- a case where an older version of a library is compatible with a later version of GHC.

Mostly-stable, and finer-grained versioning

I'll combine the discussion of the last two points. I think the heart of the PVP debates really comes from mostly-stable packages. Let's contrast with the extremes. Consider a library which is completely stable, never has a breaking change, and has stated with absolute certainty that it never will again. Does anyone care about upper bounds on this library? They're irrelevant! I'd have no problem with including an upper bound, and I doubt even the staunchest PVP advocates would really claim it's a problem to leave it off.

On the other hand, consider an extremely unstable library, which is releasing massively breaking changes on a weekly basis. I would certainly agree in that case that an upper bound on that library is highly prudent.

The sticking point is the middle ground. Consider the following code snippet:

import Data.Text (Text, pack)

foo :: Text
foo = pack "foo"

According to the PVP as it stands today, this snippet requires an upper bound of < 1.2 on the text package. But let's just play the odds here: does anyone actually believe there's a real chance that the next iteration of text will break this code snippet? I highly doubt it; this is a stable subset of the text API, and I doubt it will ever be changing. The same can be said of large subsets of many other packages.

By putting in upper bounds in these cases, we run a very real risk of bifurcating Hackage into "those demanding the new text version for some new feature" vs "those who haven't yet updated their upper bounds to allow the new version of text."

The PVP currently takes an extremely conservative viewpoint on this, with the goal of solving just one problem: making sure code that compiles now continues to compile. As I demonstrated above, it doesn't actually solve that problem completely. And in addition, in this process, it has created other problems, such as this bifurcation.

So my proposal is that, instead of creating rigid rules like "always put an upper bound no matter what," we allow some common sense into the process, and also let package authors explicitly say "you can rely on this API not changing."

April 08, 2014 03:55 PM

Noam Lewis

Generate Javascript classes for your .NET types

We open-sourced another library: ClosureExterns.NET (on github and nuget). It generates Javascript classes from .NET-based backend types, to preserve type “safety” (as safe as Javascript gets) across front- and backend. As a bonus you get Google closure annotations. The type annotations are understood by WebStorm (and other editors) and improve your development experience. Also, if you use Google Closure to compile or verify your code, it will take these types into account. We use it extensively with C#. We haven’t tried it with F#, but it’s supposed to work with any .NET type.

ClosureExterns.NET makes it easier to keep your frontend models in sync with your backend. The output is customizable – you can change several aspects of the generated code. For example you can change the constructor function definition, to support inheritance from some other Javascript function. For more details see ClosureExternOptions.

Getting Started

First, install it. Using nuget, install the package ClosureExterns.

Then, expose a method that generates your externs. For example, a console application:

public static class Program
{
    public static void Main()
    {
        var types = ClosureExternsGenerator.GetTypesInNamespace(typeof(MyNamespace.MyType));
        var output = ClosureExternsGenerator.Generate(types);
        Console.Write(output);
    }
}

You can also customize the generation using a ClosureExternsOptions object.

Example input/output

Input

class B
{
    public int[] IntArray { get; set; }
}

Output

var Types = {};

// ClosureExterns.Tests.ClosureExternsGeneratorTest+B
/** @constructor
*/
Types.B = function() {};
/** @type {Array.<number>} */
Types.B.prototype.intArray = null;

For a full example see the tests.


Tagged: .NET, c#, Javascript

by sinelaw at April 08, 2014 03:36 PM

Danny Gratzer

Bargain Priced Coroutines

Posted on April 8, 2014

The other day I was reading the 19th issue of the Monad.Reader and there was a fascinating post on coroutines.

While reading some of the code I noticed that it, like most things in Haskell, can be reduced to 5 lines with a library that Edward Kmett has written.

Consider the type of a trampoline as described in this article

    newtype Trampoline m a = Tramp {runTramp :: m (Either (Tramp m a) a)}

So a trampoline is a monadic computation of some sort returning either a result, a, or another computation to run to get the rest.

Now this looks strikingly familiar. A computation returning Trampoline m a is really a computation returning a tree of Tramp m a’s terminating in a pure value.

This sounds like a free monad!

    import Control.Monad.Trans.Free
    import Control.Monad.Identity
    
    type Trampoline = FreeT Identity

Recall that FreeT is defined as

    data FreeF f a b = Pure a | Free (f b)
    data FreeT f m a = FreeT (m (FreeF f a (FreeT f m a)))

This is isomorphic to what we where looking at before. As an added bonus, we’ve saved the tedium of defining our own monad and applicative instance for Trampoline.

We can now implement bounce and pause to define our trampolines. bounce must take a computation and unwrap it by one level, leaving either a value or another computation.

This is just a matter of rejiggering the FreeF into an Either

    bounce :: Functor m => Trampoline m a -> m (Either (Trampoline m a) a)
    bounce = fmap toEither . runFreeT
      where toEither (Pure a) = Right a
            toEither (Free m) = Left $ runIdentity m

pause requires some thought, the trick is to realize that if we wrap a computation in one layer of Free when unwrapped by bounce we’ll get the rest of the computation.

Therefore,

    pause :: Monad m => Trampoline m ()
    pause = FreeT $ return (Free . Identity $ return ())

So that’s 6 lines of code for trampolines. Let’s move on to generators.

A generator doesn’t yield just another computation, it yields a pair of a computation and a freshly generated value. We can account for this by changing that Identity functor.

    type Generator c = FreeT ((,) c)

Again we get free functor, applicative and monad instances. We two functions, yield and runGen. Yield is going to take one value and stick it into the first element of the pair.

    yield :: Monad m => g -> Generator g m ()
    yield g = FreeT . return $ Free (g, return ())

This just sticks a good old boring m () in the second element of the pair.

Now runGen should take a generator and produce a m (Maybe c, Generator c m a). This can be done again by pattern matching on the underlying FreeF.

    runGen :: (Monad m, Functor m) => Generator g m a -> m (Maybe g, Generator g m a)
    runGen = fmap toTuple . runFreeT
      where toTuple (Pure a)         = (Nothing, return a)
            toTuple (Free (g, rest)) = (Just g, rest)

Now, last but not least, let’s build consumers. These wait for a value rather than generating one, so -> looks like the right functor.

    type Consumer c = FreeT ((->) c)

Now we want await and runCon. await to wait for a value and runCon to supply one. These are both fairly mechanical.

    runConsumer :: Monad m => c -> Consumer c m a -> m a
    runConsumer c = (>>= go) . runFreeT
      where go (Pure a) = return a
            go (Free f) = runConsumer c $ f c

    runCon :: (Monad m, Functor m)
        => Maybe c
        -> Consumer c m a
        -> m (Either a (Consumer c m a))
    runCon food c = runFreeT c >>= go
      where go (Pure a) = return . Left $ a
            go (Free f) = do
              result <- runFreeT $ f food
              return $ case result of
                Pure a -> Left                   $ a
                free   -> Right . FreeT . return $ free

runCon is a bit more complex than I’d like. This is to essentially ensure that if we had some code like

    Just a <- await
    lift $ do
      foo
      bar
      baz
    Just b <- await

We want foo, bar, and baz to run with just one await. You’d expect that we’d run as much as possible with each call to runCon. Thus we unwrap not one, but two layers of our FreeT and run them, then rewrap the lower layer. The trick is that we make sure never to duplicate side effects by using good old return.

We can sleep easy that this is sound since return a >>= f is f a by the monad laws. Thus, our call to return can’t do anything detectable or too interesting.

While this is arguably more intuitive, I don’t particularly like it so we can instead write

    runCon :: (Monad m, Functor m)
        => Maybe c
        -> Consumer c m a
        -> m (Either a (Consumer c m a))
    runCon food = fmap go . runFreeT
      where go (Pure a) = Left a
            go (Free f) = Right (f food)

Much simpler, but now our above example wouldn’t run foo and friends until the second call of runCon.

Now we can join generators to consumers in a pretty naive way,

    (>~>) :: (Functor m, Monad m) => Generator c m () -> Consumer c m a -> m a
    gen >~> con = do
      (cMay, rest) <- runGen gen
      case cMay of
        Nothing -> starve con
        Just c  -> runCon c con >>= use rest
      where use _    (Left a)  = return a
            use rest (Right c) = rest >~> c

And now we can use it!

    addGen :: Generator Int IO ()
    addGen = do
      lift $ putStrLn "Yielding 1"
      yield 1
      lift $ putStrLn "Yielding 2"
      yield 2
    
    addCon :: Consumer Int IO ()
    addCon = do
      lift $ putStrLn "Waiting for a..."
      Just a <- await
      lift $ putStrLn "Waiting for b..."
      Just b <- await
      lift . print $ a + b

    main = addGen >~> addCon

When run this prints

    Yielding 1
    Waiting for a...
    Yielding 2
    Waiting for b...
    3

Now, this all falls out of playing with what functor we give to FreeT. So far, we’ve gotten trampolines out of Identity, generators out of (,) a, and consumers out of (->) a.

<script type="text/javascript"> /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */ var disqus_shortname = 'codeco'; // required: replace example with your forum shortname /* * * DON'T EDIT BELOW THIS LINE * * */ (function() { var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })(); </script> <noscript>Please enable JavaScript to view the comments powered by Disqus.</noscript> comments powered by Disqus

April 08, 2014 12:00 AM

April 05, 2014

Lennart Augustsson

Haskell error reporting with locations, update

Since some people (I'm among them) dislike impure features in Haskell I thought I'd present a slight variation on the error location feature that is "pure".

First, the __LOCATION__ variable gets an abstract type. So

  data Location
  __LOCATION__ :: Location
It's defined in the Prelude and always in scope. The type cannot be compared, shown, or anything. There's just one thing that can be done, namely:
  extractLocation :: Location -> IO String
The error function needs a new exception to throw
  data ErrorCallLoc = ErrorCallLoc Location String

  {-# LOCATIONTRANSPARENT error #-}
  error :: String -> a
  error s = throw (ErrorCallLoc __LOCATION__ s)
This means that the location string cannot be used when we throw the error. But it can be used where the error is caught, since this can only be done in the IO monad.

Under the hood the everything is just as before, Location is just a string. It just can't be manipulated except in the IO monad, so we can pretend it's pure.

  newtype Location = Location String
  extractLocation (Location s) = return s
It now looks a lot like Michael Snoyman's proposal.

by augustss (noreply@blogger.com) at April 05, 2014 07:53 PM

Philip Wadler

Dan Piponi (sigfpe)

This is a test

Introduction

I like to see connections between different branches of mathematics. Here's an application of the same technique in two completely different areas that I haven't seen in the literature. It might take a bit of work to convince you that they are in fact the same thing, so here goes.

The dual of an optimization problem

Consider the optimization problem
$$ \begin{aligned} \mbox{min} & f(x) \\ \mbox{subject to} & x\le a\\ \end{aligned} $$
illustrated below and where $f$ is convex:

The essential features are there in this problem despite it being the most basic problem with an inequality.

Now consider the optimum given as a function of $a$. So $f_+(a)=\inf_{x\le a}f(a)$. You can think of $f_+$ as representing the "best value of $f$ so far". Its graph looks like this:

It's basically like the original $f$ except that we only have the descending parts and none of the ascending parts.

Now we're going to construct $f_+$ by a slightly roundabout way, not by considering the function itself, but its epigraph, the set of points above the graph. Here's the epigraph for the original $f$:

Notice I've drawn a bunch of lines just touching the epigraph, ie. its subtangents. In general, any convex shape can be expressed as the intersection of half-planes. Those subtangents indicate a handful of the boundaries of those half planes. We can construct those subtangents by considering the lines $y=\lambda x$ and then pushing those lines up or down so each is high as possible without crossing $f$. We can find out how much pushing we need by solving the optimisation:
$$ \begin{aligned} \mbox{maximise} & \lambda x-f(x) \end{aligned} $$
For any $\lambda$ we call the optimal value $f^\ast(\lambda)$. $f^\ast$ is also known as the Legendre transform or Fenchel dual. It's basically telling us which half-planes define our epigraph.

Interestingly, reconstructing the function $f$ from $f^\ast$ requires exactly the same expression with $f^\ast$ substituted for $f$. So the Fenchel dual also takes a collection of half-planes and gives you back the original function.

Now we construct the epigraph of $f_+$. We just want the descending parts of $f$. It's easy to construct the epigraph, we take the intersection of just those half-planes whose boundaries are decreasing. Here's a picture:

So here's a process for constructing $f_+$: we take the Fenchel dual of $f$, we throw away the "half" where $\lambda\le0$, and now take the inverse Fenchel dual (which is in fact just the Fenchel dual).

Now let's go back to the original problem and construct its dual as described in numerous textbooks. We form the Lagrangian
$$L(a,x,\lambda)=f(x)+\lambda (x-a)$$
We then form the function
$$g(a,\lambda)=\min_x L(a,x,\lambda)$$
This is basically (modulo a sign) the Fenchel dual of $f$ with an extra $-\lambda a$ term. The dual problem, as in the textbooks, is
$$ \begin{aligned} \mbox{maximise} & g(a,\lambda)\\ \mbox{such that} & \lambda\ge0\\ \end{aligned} $$
Remembering that $-\lambda a$ term, that's almost the inverse Fenchel dual except we've maximised over just "half" of the $\lambda$s, those for which $\lambda\ge0$. When we compute the optimum of the dual problem we get $h(x)=\max_x g(a,x)$. If the problem is convex, we've just computed $f_+$ by another path. So this gives what I think is a clear picture of what the dual problem is. Forming the dual problem is (modulo a vertical shift) converting the original problem into finding the subtangents. Solving the dual problem is converting that back to a function again, but using only the downward sloping lines. So this illustrates the primal-dual theorem. We get the same result in both the primal and dual problems as the dual problem is simply solving our original problem via epigraphs.

The Riemann-Hilbert problem and the Hilbert Transform

Here's my latest toy:
It's a software defined radio I bought for $13 on Amazon. Here's what it does.

You have an electromagnetic field in the vicinity of the antenna. Let's pretend such fields, $\phi$ are scalars. Just about any kind of radio uses a carrier modulated in some way. For example AM radio encodes the audio (represented as a function of time, A) as something like
$$\phi=A(t)cos(\omega t)$$

by Dan Piponi (noreply@blogger.com) at April 05, 2014 01:05 AM

April 04, 2014

FP Complete

Calculating the Minimum Variance Portfolio in R, Pandas and IAP

Introduction

As part of producing a demo for FP Complete's new IAP product, I wound up implementing the Minimum Variance Portfolio calculation for a stock portfolio in R, then in Haskell for the IAP, and finally in Python using the NumPy and SciPy extension libraries. I want to look at the process of writing each of these, and the resulting code in this article. I am not an expert in these things, so if you know a better way to do them, please let me know. In particular, the R example was borrowed from someone else.

The function

First, we find a version of the formula for MVP that can be converted into those systems. I like the one used by Yue Kuen KWOK:

ω = Ω⁻¹ 1/ 1 ⋅ Ω⁻¹ 1

where Ω is the covariant matrix of the returns for the stocks in question.

R version

The R version is fairly straightforward - we just need to put the pieces together:

minvar <- function (px){
    Rt <- px[-1,]/px[-nrow(px),]-1
    cov.Rt <- cov(Rt)
    one.vec <- rep(1,nrow(cov.Rt))
    num <- solve(cov.Rt) %*% one.vec
    den <- as.numeric(t(one.vec) %*% num)
    return(num/den)
}

Calculating returns can be done with standard matrix operations and slicing. The covariant function is built in, as is inverting it (solve). Since the numerator Ω⁻¹ 1 appears in the denominator, I reuse it's value there.

All the array operations were documented in the same place. That I only needed one unit vector was a bit of a surprise, but R sized it dynamically to work. That I had to transpose the unit vector and use the cross product operator (%*%) to get a dot product was a a less pleasant surprise, but is apparently a standard R idiom.

Python version

The Python version is almost a direct copy of the R version.

def minvar(prices):
    cov = np.cov((prices[1:] / prices[:-1] - 1).transpose())
    vu = np.array(cov.shape[1] * [1], float)
    num = np.dot(np.linalg.inv(cov), vu)
    den = np.dot(vu, num)
    return num / den

In this case, I passed the returns matrix to the covariant function directly. The NumPy dot function performs both cross products and dot products, and again the unit vector adopts it's length dynamically.

Documentation was a bit more scattered. Being a mix of traditional imperative and object-oriented, some functions are functions in the module, and others are object methods. The biggest surprise was that the returns matrix needed to be transposed before the covariant was calculated.

Haskell version

Haskell is not quite as obvious a translation of the R and Python versions, but is a more straightforward translation of the original formula - once you notice that Ω⁻¹ 1 has been factored into tv. It reads from top to bottom like the original, with the main formula at the top and the various terms defined below.

Returns again use the standard Haskell idiom for slicing the array. This is a bit more verbose than either R or Python, as they are functions rather than special syntax.

minVariance prices = tv / scalar (tvu <.> tv)
    where rets = dropRows 1 prices / takeRows (rows prices - 1) prices - 
          (_, cov) = meanCov $ rets
          vu = constant 1.0 (cols cov)
          tvu = constant 1.0 (rows cov)
          tv = inv cov <> vu

The documentation was again straightforward, with everything being a function in the hmatrix package. In this case, both unit vectors were needed, as Haskell does not scale them automatically. It was the least surprising in at least one way - it used a distinct dot product operator for the two vectors rather than transposing - whether implicitly like Python or explicitly like R - the unit vector in a cross product.

Performance

While performance comparisons with IAP aren't very useful, as it runs in the cloud so doing comparisons on identical systems may be impossible, Haskell does have one interesting advantage.

All three of these systems have the same basic architecture - a high-level language running in an interactive environment, with bindings to low-level, fast implementations of the matrix manipulations. Haskell adds the ability to compile your program into native x86_64 code. Doing so reduced the wall clock time of this short demo by roughly two orders of magnitude.

Summary

I found the IAP version a little easier to deal with. Having custom operators and functions for everything - and this will only get better with time - along with being able to use the mathematical layout of the where statement just made things a little easier to deal with. While not having unit vectors that automatically size themselves - or transpose into matrices - is a little inconvenient, this also exposed a problem in the original R version, in that the unit vector's length was initially set wrong. I'm not sure that will make any real difference, but the thought that the language can catch such errors for me is comforting.

April 04, 2014 08:07 PM

Lennart Augustsson

Haskell error reporting with locations

Error reporting in GHC is not always the nicest. For example, I often develop code by using undefined as a placeholder for code I've not written yet. Here's a simple example:
import System.Environment
  main = do
    args <- getargs
    if null args then
      undefined
     else
      undefined
Running this looks like this:
$ ./H
  H: Prelude.undefined
Which undefined caused that? Looking at the error message we have no idea. Wouldn't it be nice with some location information?

We can actually get location information by using Control.Exception.assert:

import Control.Exception(assert)
  import System.Environment

  main = do
    args <- getargs
    if null args then
      assert False undefined
     else
      assert False undefined
Now running it is much more informative:
$ ./H
  H: H.hs:7:9-14: Assertion failed
How is assert able to report the location? If we dig deep enough we discover that it's because the ghc compiler contains a special hack to recognize this function and give it location information.

A generalized hack

In a Haskell compiler that I've implemented I've taken this compiler hack and extended it so it can be used for any function.  It comes in two parts, location information and location transparent definitions.

__LOCATION__

The __LOCATION__ identifier is always defined and utterly magical. Its value is a string that describes the location of that very identifier. This is the very opposite of a referentially transparent name. In fact it's value varies with where it is placed in the code! So it's definitely not for purists. But I'm a practical man, so I sometimes have resort of the ugliness of reality. And in reality we want to report locations in errors.

Enough philosophy, here's an example:

main = do
    print __LOCATION__
    print   __LOCATION__
And running it prints:
"test/Test.hs:2:11"
  "test/Test.hs:3:13"
And to illustrate the impurity:
main = do
    let loc = __LOCATION__
    print loc
    print loc
And running this:
"test/Test.mu:2:15"
  "test/Test.mu:2:15"

Location transparency

The __LOCATION__ identifier gives the location of itself. This is of little use on its own. Imagine the definition we could give for undefined. Somewhere in the Prelude module it could say something like
undefined = error ("undefined: " ++ __LOCATION__)
But if we use this all that it will tell us is where the definition of undefined is, not where it was used.

To get the point of use instead of the definition I've introduced location transparent definitions. In a location transparent definition the __LOCATION__ identifier will not refer to its own position, but to the position of the reference to the definition. Location transparency is introduced with a pragma.

{-# LOCATIONTRANSPARENT undefined #-}
  undefined = error ("undefined: " ++ __LOCATION__)
With this definition our initial example looks like this when we run it:
undefined: test/H.hs:6:9
In fact, the real definition of undefined doesn't look like that. The __LOCATION__ identifier is only used in the definition of error, so it looks something like this:
{-# LOCATIONTRANSPARENT error #-}
  error :: String -> a
  error s = throw (ErrorCall (__LOCATION__ ++ ": " ++ s))

  {-# LOCATIONTRANSPARENT undefined #-}
  undefined = error "undefined"
Since both error and undefined are transparent any use of undefined will be reported with the location of the use.

Furthermore, we can make a few more functions location transparent, e.g.,

{-# LOCATIONTRANSPARENT head #-}
  head :: [a] -> a
  head [] = error "Empty list"
  head (x:xs) = x
A simple example:
main = putStr (head [])
Which will print:
test/Head.hs:1:16: Empty list
which is the location where head was called.

Implementation

There are different ways to implement this feature, and I'm going to sketch two of them.

First: Every function that has the LOCATIONTRANSPARENT pragma will be inlined at the point of use, and the __LOCATION__ identifier in the inlined code will be updated to reflect the call site. The definitions must be processed in a bottom-up fashion for this to work. It's fairly simple to implement, but will cause some code bloat due to inlining.

Second: Every function that has LOCATIONTRANSPARENT pragma will be rewritten (by the compiler) to have an extra location argument, and each use of this function will be rewritten to pass in the current location. For example (using $$f for the location version of f):

main = putStr ($$head __LOCATION__ [])

  $$head __LOCATION__ [] = $$error __LOCATION__ "Empty list"
  $$head __LOCATION__ (x:xs) = x
  $$error __LOCATION__ s = throw (ErrorCall (__LOCATION__ ++ ": " ++ s))
This should be fairly straightforward to implement, but I've not tried it. (It's somewhat like dynamic binding, so maybe ghc could reuse that mechanism for locations.)

And, of course, the global __LOCATION__ identifier has to be recognized by the compiler and replaced by a string that is its location.

Conclusion

I implemented the __LOCATION__ hack quite a while ago, and I like the much improved reporting of error locations. I hope someone will add it to ghc as well.

by augustss (noreply@blogger.com) at April 04, 2014 07:28 AM

April 03, 2014

Thiago Negri

Premature optimization: it's not just about code

When we talk about performance, we shouldn't look only to performance of code execution. There are more types that we should care about, like time to market, development time and processes. Actually, we shouldn't look into any of the performance measures at a single perspective. As I see, all these measures sums up to constitute a single value of business performance.

There are lots of things that can slow down your business, and if you have a slow business, you'll have a hard time to keep pace with your rivals. As you discover that some areas of your company may be slowing you down, you may become tempted to add key performance indicators to each step trying to squeeze all possible value out of each part of your processes. This can bring up a whole new set of problems, your business can be seen as a complex system and it will adapt itself to render good results even in chaotic scenarios. [1]

So, to avoid going nuts with indicators all over the place, you decide to avoid bottlenecks from the start. To each and every new thing you need to deliver, you'll start a precise planning routine, choosing among a plethora of providers, technologies and unicorns to avoid at all costs to have a new bottleneck in the future. This is what I like to call premature optimization in business performance.

Simply put: you can't have a business if you don't have a business. How can you possibly know all the events that will happen to have an impact in the decision you take today? You can't.I'm not saying that planning is wrong or a Bad Thing™. What I really want to avoid is losing energy on stuff that won't make a difference. You may spend a whole year choosing between apples and oranges, and in the end be crushed by some weirdo playing around with random technology. Why? Because he was not worried about future bottlenecks. He was really trying to do something cool, and he did it while you were arguing in endless and purposeless meetings.

You're always trading performance measurements. For example, everyone is talking about service oriented architecture (SOA), but what does it really means in terms of business? It means a tradeoff between code execution performance and others benefits to the business as a whole, like decentralized grow and continuous delivery. Or, you may look at the difference between Apple's App Store and Google's Play Store model. There is a huge tradeoff of quality assurance and time to market between these two players: one offers their customers (smartphone owners), quality assured apps; the other offers their customers (operating system users), fast time to market for their apps.

The big question really is: what are you trying to deliver to your customer? Are you trying to deliver software over time or value over time? I bet most of you are trying to deliver value over time, so why bother with premature optimization of technologies or processes? Let the bad stuff to die on its own: failing fast is orders of magnitude better than not doing anything at all. And let the good stuff to accompany you to where you want to go.

Don't bother guessing the future, embrace the uncertainty of life: you can't predict how a complex system will behave, you can't know how your systems, services or whatever you are doing will be used in the real world. Put it into the wild, take the complaints and tune it (or throw it away) afterwards, this is a constant process. [2]

Start measuring how much you are delivering over time and stop over-planning. Your end goal is to deliver stuff. The world couldn't care less about what you can do, it only cares about what you do.

References

  1. Complex Systems and Key Performance Indicators - Márcio Geovani Jasinski
  2. Complaint-Driven Development - Jeff Atwood

by Thiago Negri (noreply@blogger.com) at April 03, 2014 04:46 PM

Daniil Frumin

Heads up: scotty-hastache 0.2.1

To accommodate for the release of hastache-0.6, I’ve updated the scotty-hastache package to the version 0.2.1.


Tagged: haskell, hastache, scotty, web

by Dan at April 03, 2014 01:59 PM

Lennart Augustsson

A small Haskell extension

The extension

In Haskell you can give a type to an expression by writing expr ::  type.  To an untrained eye the :: looks just like an infix operator, even if it is described as a special syntactical construct in the Haskell report.  But let's think about it as an infix operator for a moment.

For an infix operator you you can for a section, i.e., a use of the operator with one operand left out.  For instance (* 2) leaves out the first operand, and Haskell defines this to be the same as (\ x -> x * 2).  Regarding :: as an operator we should be able to write (:: type) and it should have the obvious meaning (\ x -> x :: type).

I suggest, and I plan sending the haskell-prime mailing list, Haskell should adopt this small extension.

Why?

First, the extension is very light weight and has almost no extra intellectual weight for anyone learning Haskell.  I'd argue it makes the language simpler because it allows :: to be treated more like an infix operator.  But without use cases this would probably not be enough of an argument.

Example 1

We want to make a function, canonDouble, that takes a string representing a Double and changes it to the standard Haskell string representing this Double.  E.g. canonDouble "0.1e1" == "1.0".  A first attempt might look like this:

  canonDouble :: String -> String
  canonDouble = show . read         -- WRONG!

This is, of course, wrong since the compiler cannot guess that the type between read and show should be a Double.  We can convey this type information in different ways, e.g.:

  canonDouble :: String -> String
  canonDouble = show . asDouble . read
    where asDouble :: Double -> Double
          asDouble x = x

This is somewhat clumsy.  Using my proposed extension we can instead write:

  canonDouble :: String -> String
  canonDouble = show . (:: Double) . read

This has the obvious meaning, and succinctly describes what we want.

Example 2

In ghc 7.8 there is a new, better implementation of Data.Typeable.  It used to be (before ghc 7.8) that to get a TypeRep for some type you would have to have a value of that type.  E.g., typeOf True gives the TypeRep for the Bool type.  If we don't have a value handy of the type, then we will have to make one, e.g., by using undefined.  So we could write typeOf (undefined :: Bool).

This way of using undefined is rather ugly, and relies on non-strictness to work.  Ghc 7.8 add a new, cleaner way of doing it.

  typeRep :: proxy a -> TypeRep

The typeRep function does not need an actual value, but just a proxy for the value.  A common proxy is the Proxy type from Data.Proxy:

  data Proxy a = Proxy

Using this type we can now get the TypeRep of a Bool by writing typeRep (Proxy :: Proxy Bool).  Note that in the type signature of typeRep the proxy is a type variable.  This means we can use other values as proxies, e.g., typeRep ([] :: [Bool]).

We can in fact use anything as a proxy that has a structure that unifies with proxy a.  For instance, if we want a proxy for the type T we could use T -> T, which is the same as (->) T T.  The (->) T part makes of it is the proxy and the last T makes up the a.

The extension I propose provides an easy way to write a function of type T -> T, just write (:: T).  So to get a TypeRep for Bool we can simply write typeRep (:: Bool).  Doesn't that look (deceptively) simple?

In fact, my driving force for coming up with this language extension was to get an easy and natural way to write type proxies, and I think using (:: T) for a type proxy is a as easy and natural as it gets (even if the reason it works is rather peculiar).

Implementation

I've implemented the extension in one Haskell compiler and it was very easy to add and it works as expected.  Since it was so easy, I'll implement it for ghc as well, and the ghc maintainers can decide if the want to merge it.  I suggest this new feature is available using the language extension name SignatureSections.

Extensions

Does it make sense to do a left section of ::?  I.e., does (expr ::) make sense?  In current Haskell that does not make sense, since it would be an expression that lacks an argument that is a type.  Haskell doesn't currently allow explicit type arguments, but if it ever will this could be considered.

With the definition that (:: T) is the same as (\ x -> x :: T) any use of quantified or qualified types as T will give a type error.  E.g., (:: [a]), which is (\ x -> x :: [a],  is a type error.  You could imagine a different desugaring of (:: T), namely (id :: T -> T).  Now (:: [a]) desugars to (id :: [a] -> [a]) which is type correct.  In general, we have to keep quantifiers and qualifiers at the top, i.e., (:: forall a . a) turns into (id :: forall a . a -> a).

Personally, I'm not convinced this more complex desugaring is worth the extra effort.

by augustss (noreply@blogger.com) at April 03, 2014 07:40 AM

Yesod Web Framework

Consolidation progress: shakespeare, conduit, http-client

My last blog post detailed a number of changes I was going to be making for package consolidation. A number of those have gone through already, this blog post is just a quick summary of the changes.

shakespeare

shakespeare is now a single package. hamlet, shakespeare-css, shakespeare-js, shakespeare-i18n, shakespeare-text, and servius have all been merged in and marked as deprecated. I've also uploaded new, empty versions of those deprecated packages. This means that, in order to support both the old and new versions of shakespeare, you just need to ensure that you have both the shakespeare and deprecated packages listed in your cabal file. In other words, if previously you depended on hamlet, now you should depend on hamlet and shakespeare. When you're ready to drop backwards compatibility, simply put a lower bound of >= 2.0 on shakespeare and remove the deprecated packages.

(Note: this method for dealing with deprecated packages is identical for all future deprecations, I won't detail the steps in the rest of this blog post.)

conduit

conduit-extra now subsumes attoparsec-conduit, blaze-builder-conduit, network-conduit, and zlib-conduit. It also includes three modules that used to be in conduit itself: .Text, .Binary, and .Lazy. To deal with this change, simply adding conduit-extra to your dependencies should be sufficient.

The other changes have to do with resourcet. In particular:

  • Data.Conduit no longer reexports identifiers from resourcet and monad-control. These should be imported directly from their sources.
  • Instead of defining its own MonadThrow typeclass, resourcet now uses the MonadThrow typeclass from the exceptions package. For backwards compatibility, Control.Monad.Trans.Resource provides monadThrow as an alias for the new throwM function.
  • The Resource monad had a confusing name, in that it wasn't directly related to the ResourceT transformer. I've renamed it to Acquire, and put it in its own module (Data.Acquire).
    • I'm actually very happy with Acquire, and think it's a great alternative to hard-coding either the bracket pattern or resourcet into libraries. I'm hoping to add better support to WAI for Acquire, and blog a bit more about the usage of Acquire.
  • MonadUnsafeIO has been removed entirely. All of its functionality can be replaced with MonadPrim and MonadBase (for example, see the changes to blaze-builder-conduit).
  • MonadActive, which is only needed for Data.Conduit.Lazy, has been moved to that module.

http-client

http-client-multipart has been merged into http-client. In addition, instead of using the failure package, http-client now uses the exceptions package.

http-client-conduit has been merged into http-conduit. I've also greatly expanded the Network.HTTP.Client.Conduit module to contain what I consider its next-gen API. In particular:

  • No usage of ResumableSource.
  • Instead of explicit ResourceT usage, it uses the Acquire monad and bracket pattern (acquireResponse, withResponse).
  • Instead of explicitly passing around a Manager, it uses MonadReader and the HasHttpManager typeclass.

I'm curious how people like the new API. I have no plans on removing or changing the current Network.HTTP.Conduit module, this is merely an alternative approach.

Updated yesod-platform

I've also released a new version of yesod-platform that uses the new versions of the packages above. A number of packages on Hackage still depend on conduit 1.0, but I've sent quite a few pull requests in the past few days to get things up-to-date. Thankfully, maintaining compatibility with both 1.0 and 1.1 is pretty trivial.

April 03, 2014 07:07 AM

April 02, 2014

Functional Jobs

Server Game Developer at Quark Games (Full-time)

Quark Games was established in 2008 with the mission to create hardcore games for the mobile and tablet platforms. By focusing on making high quality, innovative, and engaging games, we aim to redefine mobile and tablet gaming as it exists today.

We seek to gather a group of individuals who are ambitious but humble professionals who are relentless in their pursuit of learning and sharing knowledge. We're looking for people who share our passion for games, aren’t afraid to try new and different things, and inspire and push each other to personal and professional success.

As a Server Game Developer, you’ll be responsible for implementing server related game features. You’ll be working closely with the server team to create scalable infrastructure as well as the client team for feature integration. You’ll have to break out of your toolset to push boundaries on technology to deliver the most robust back end to our users.

What you’ll do every day

  • Develop and maintain features and systems necessary for the game
  • Collaborate with team members to create and manage scalable architecture
  • Work closely with Client developers
    on feature integration
  • Solve real time problems at a large
    scale
  • Evaluate new technologies and products

What you can bring to the role

  • Ability to get stuff done
  • Desire to learn new technologies and design patterns
  • Care about creating readable, reusable, well documented, and clean code
  • Passion for designing and building systems to scale
  • Excitement for building and playing games

Bonus points for

  • Experience with a functional language (Erlang, Elixir, Haskell, Scala, Julia, Rust, etc..)
  • Experience with a concurrent language (Erlang, Elixir, Clojure, Go, Scala, etc..)
  • Being a polyglot programmer and having experience with a wide range of languages (Ruby, C#, and Objective-C)
  • Experience with database integration and management for NoSQL systems (Riak, Couchbase, Redis, etc...)
  • Experience with server operations, deployment, and with tools such as Chef or Puppet
  • Experience with system administration

Get information on how to apply for this position.

April 02, 2014 11:36 PM

Mateusz Kowalczyk

Haddock 2.14.2

Posted on April 2, 2014 by Fūzetsu

This is just a quick follow-up to my previous post. We have now released Haddock 2.14.2 which contains few minor changes. The reason for this release is to get a few quick patches in. No fancy overview today, just quick mentions. Here is the relevant part of the changelog:

Changes in version 2.14.2

  • Always drop –split-objs GHC flag for performance reasons (#292)

  • Print kind signatures GADTs (#85)

  • Drop single leading whitespace when reasonable from @-style blocks (#201)

  • Fix crashes associated with exporting data family record selectors (#294)

#201 was the the annoying aesthetics bug I mentioned last time and that is now fixed.

#294 was a bug we’re glad to have gotten rid of now: it was only reported recently but I imagine more and more projects would have start to hit it.

#292 should improve performance considerably in some special cases, such as when Template Haskell is being used.

#85 was just a quick resolution of years old ticket, I think you’ll find it useful.

I predict that this is the version that will ship with GHC 7.8.1 and I don’t think we’ll have any more 2.14.x releases.

Ideally I’d like to get well under 100 open tickets for the next release (there are currently 117 open).

Some things I will be concentrating on next is splitting up Haddock into a few packages and working on the Hoogle back-end. The Hoogle back-end is incredibly broken which is a shame considering Hoogle is a very useful service. We want to make the maintainers life easier.

Splitting up Haddock into a few packages will be of great advantage to people wishing to use (parts of) Haddock as a library without adding a dependency on a specific version of GHC to their program. It should also become much easier to implement and maintain your own back-ends.

If you are interested in helping out with Haddock, we’d love to have you. Pop into #haddock on Freenode, make some noise and wait for someone to respond. Alternatively, contact me through other means.

PS: While I realise that some of my posts make it on reddit, I myself do not use it. You’re welcome to discuss these but if you leave questions or messages to me on reddit, I will almost certainly not see them. If you want my attention, please either use e-mail or IRC. Thanks!

April 02, 2014 11:03 PM

Dominic Steinitz

Student’s T and Space Leaks

Introduction

The other speaker at the Machine Learning Meetup at which I gave my talk on automatic differentiation gave a very interesting talk on A/B testing. Apparently this is big business these days as attested by the fact I got 3 ads above the wikipedia entry when I googled for it.

It seems that people tend to test with small sample sizes and to do so very often, resulting in spurious results. Of course readers of XKCD will be well aware of some of the pitfalls.

I thought a Bayesian approach might circumvent some of the problems and set out to write a blog article only to discover that there was no Haskell library for sampling from Student’s t. Actually there was one but is currently an unreleased part of random-fu. So I set about fixing this shortfall.

I thought I had better run a few tests so I calculated the sampled mean, variance, skewness and kurtosis.

I wasn’t really giving this my full attention and as a result ran into a few problems with space. I thought these were worth sharing and that is what this blog post is about. Hopefully, I will have time soon to actually blog about the Bayesian equivalent of A/B testing.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> 
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> 
> module StudentTest (
>     main
>   ) where
> 
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Data.Random.Distribution.T
> import Control.Monad.State
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List

Space Analysis

Let’s create a reasonable number of samples as the higher moments converge quite slowly.

> nSamples :: Int
> nSamples = 1000000

An arbitrary seed for creating the samples.

> arbSeed :: Int
> arbSeed = 8

Student’s t only has one parameter, the number of degrees of freedom.

> nu :: Integer
> nu = 6

Now we can do our tests by calculating the sampled values.

> ts :: [Double]
> ts =
>   evalState (replicateM nSamples (sample (T nu)))
>             (pureMT $ fromIntegral arbSeed)
> mean, variance, skewness, kurtosis :: Double
> mean = (sum ts) / fromIntegral nSamples
> variance = (sum (map (**2) ts)) / fromIntegral nSamples
> skewness = (sum (map (**3) ts) / fromIntegral nSamples) / variance**1.5
> kurtosis = (sum (map (**4) ts) / fromIntegral nSamples) / variance**2

This works fine for small sample sizes but not for the number we have chosen.

./StudentTest +RTS -hc
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

It seems a shame that the function in the Prelude has this behaviour but never mind let us ensure that we consume values strictly (they are being produced lazily).

> mean' = (foldl' (+) 0 ts) / fromIntegral nSamples
> variance' = (foldl' (+) 0 (map (**2) ts)) / fromIntegral nSamples
> skewness' = (foldl' (+) 0 (map (**3) ts) / fromIntegral nSamples) / variance'**1.5
> kurtosis' = (foldl' (+) 0 (map (**4) ts) / fromIntegral nSamples) / variance'**2

We now have a space leak on the heap as using the ghc profiler below shows. What went wrong?

If we only calculate the mean using foldl then all is well. Instead of 35M we only use 45K.

Well that gives us a clue. The garbage collector cannot reclaim the samples as they are needed for other calculations. What we need to do is calculate the moments strictly altogether.

Let’s create a strict record to do this.

> data Moments = Moments { m1 :: !Double
>                        , m2 :: !Double
>                        , m3 :: !Double
>                        , m4 :: !Double
>                        }
>   deriving Show

And calculate the results strictly.

> 
> m  = foldl' (\m x -> Moments { m1 = m1 m + x
>                              , m2 = m2 m + x**2
>                              , m3 = m3 m + x**3
>                              , m4 = m4 m + x**4
>                              }) (Moments 0.0 0.0 0.0 0.0) ts
> 
> mean''       = m1 m / fromIntegral nSamples
> variance''   = m2 m / fromIntegral nSamples
> skewness''   = (m3 m / fromIntegral nSamples) / variance''**1.5
> kurtosis'' = (m4 m / fromIntegral nSamples) / variance''**2

Now we have what we want; the program runs in small constant space.

> main :: IO ()
> main = do
>   putStrLn $ show mean''
>   putStrLn $ show variance''
>   putStrLn $ show skewness''
>   putStrLn $ show kurtosis''

Oh and the moments give the expected answers.

ghci> mean''
  3.9298418844289093e-4

ghci> variance''
  1.4962681916693004

ghci> skewness''
  1.0113188204317015e-2

ghci> kurtosis''
  5.661776268997382

Running the Code

To run this you will need my version of random-fu. The code for this article is here. You will need to compile everything with profiling, something like

ghc -O2 -main-is StudentTest StudentTest.lhs -prof
    -package-db=.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d

Since you need all the packages to be built with profiling, you will probably want to build using a sandbox as above. The only slightly tricky aspect is building random-fu so it is in your sandbox.

runghc Setup.lhs configure --enable-library-profiling
--package-db=/HasBayes/.cabal-sandbox/x86_64-osx-ghc-7.6.2-packages.conf.d
--libdir=/HasBayes/.cabal-sandbox/lib

by Dominic Steinitz at April 02, 2014 10:17 AM

Lee Pike

Embedded Security in the Mainstream

Science Friday had an interesting podcast with Bruce Schneier worth a listen.  It discussed security in embedded systems (or as is hip to say now, the Internet of Things).  The idea of security in embedded systems seemed pretty obscure just a few years ago, so it’s nice to see it being featured on a general-audience radio show.

So who’s going to listen to it from their phone, connected to their car’s audio over Bluetooth, on the way to the store to buy a Nest smoke alarm?


by Lee Pike at April 02, 2014 04:21 AM

April 01, 2014

Neil Mitchell

Exceptional Testing

Summary: Failing properties should throw exceptions, not return False.

When testing properties in Haskell with QuickCheck we usually write a predicate that takes some arguments, and returns a boolean. For example:

import Test.QuickCheck
main = quickCheck $ \x -> exp (log (abs x)) == abs (x :: Double)

Here we are checking that for all positive Double values, applying log then exp is the identity. This statement is incorrect for Double due to floating point errors. Running main we get:

*** Failed! Falsifiable (after 6 tests and 2 shrinks):
3.0

QuickCheck is an incredibly powerful tool - it first found a failing example, and then automatically simplified it. However, it's left out some important information - what is the value of exp (log 3)? For my tests I usually define === and use it instead of ==:

a === b | a == b = True
| otherwise = error $ show a ++ " /= " ++ show b

Now when running main we get:

*** Failed! Exception: '3.0000000000000004 /= 3.0'
(after 2 tests and 2 shrinks):
3.0

We can immediately see the magnitude of the error introduced, giving a kick-start to debugging.

Do we still need Bool?

When writing functions returning Bool there are three interesting cases:

  • The function returns True, the test passes, continue on.
  • The function returns False, the test fails, with no information beyond the input arguments.
  • The function throws an exception, the test fails, but with a human readable error message about the point of failure.

Of these, returning False is the least useful, and entirely subsumed by exceptions. It is likely there was additional information available before reducing to False, which has now been lost, and must first be recovered before debugging can start.

Given that the only interesting values are True and exception, we can switch to using the () type, where passing tests return () and failing tests throw an exception. However, working with exceptions in pure code is a bit ugly, so I typically define:

import Control.Monad
(===) :: (Show a, Eq a) => a -> a -> IO ()
a === b = when (a /= b) $ error $ show a ++ " /= " ++ show b

Now === is an action, and passing tests do nothing, while failing tests raise an error. This definition forces all tests to end up in IO, which is terrible for "real" Haskell code, where pure and IO computations should be segregated. However, for tests, as long as the test is repeatable (doesn't store some temporary state) then I don't worry.

To test the IO () returning property with QuickCheck we can define:

instance Testable () where
property () = property True
instance Testable a => Testable (IO a) where
property = property . unsafePerformIO

Now QuickCheck can work with this === definition, but also any IO property. I have found that testing IO properties with QuickCheck is very valuable, even without using ===.

Non-QuickCheck tests

Whilst I have argued for exceptions in the context of QuickCheck tests, I use the same exception pattern for non-parameterised/non-QuickCheck assertions. As an example, taken from Shake:

test = do
dropDirectory1 "aaa/bbb" === "bbb"
dropDirectory1 "aaa/" === ""
dropDirectory1 "aaa" === ""
dropDirectory1 "" === ""

Using IO () properties allows for trivial sequencing. As a result, the tests are concise, and the effort to add additional tests is low.

(===) in QuickCheck

In QuickCheck-2.7 the === operator was introduced (which caused name clashes in both Shake and Hoogle - new versions of both have been released). The === operator uses the Property data type in QuickCheck to pass both the Bool value and additional information together. I prefer my definition of === because it's simpler, doesn't rely on QuickCheck and makes it clear how to pass additional information beyond just the arguments to ===. As an example, we could also return the log value:

\x -> when (exp (log (abs x)) /= abs x) $
error $ show (x,log $ abs x, exp $ log $ abs x)

However, the QuickCheck === is a great improvement over ==, and should be a very popular addition.

by Neil Mitchell (noreply@blogger.com) at April 01, 2014 08:57 PM

Well-Typed.Com

Fixing foldl

The foldl function is broken. Everyone knows it’s broken. It’s been broken for nearly a quarter of a century. We should finally fix it!

Today I am proposing that Prelude.foldl be redefined using the implementation currently known as Data.List.foldl'.

foldl is broken!

I’m sure you knew that already, but just in case…

Have you ever noticed that Haskellers usually recommend using either foldr or foldl' but not foldl? For example Real World Haskell has this to say:

Due to the thunking behaviour of foldl, it is wise to avoid this function in real programs: even if it doesn’t fail outright, it will be unnecessarily inefficient. Instead, import Data.List and use foldl'.

In the online version of the book the first user comments on that paragraph are

Why isn’t Data.List foldl implementation placed in Prelude?

I second the question: Why isn’t foldl' the default?

Good question.

Ok, so obviously we’re talking about the difference between foldl and foldl':

foldl :: (b -> a -> b) -> b -> [a] -> b
foldl f a []     = a
foldl f a (x:xs) = foldl f (f a x) xs

foldl' :: (b -> a -> b) -> b -> [a] -> b
foldl' f a []     = a
foldl' f a (x:xs) = let a' = f a x in a' `seq` foldl' f a' xs

The dry technical difference is that foldl' evaluates the call to f before making the next recursive call. The consequences of that are perhaps not immediately obvious, so we’ll take a step back and look at a slightly bigger picture.

Folding left and right

When we first learn Haskell we learn that there are two ways to fold a list, from the left or right.

foldl f z [x1, x2, ..., xn] = (...((z `f` x1) `f` x2) `f`...) `f` xn

foldr f z [x1, x2, ..., xn] = x1 `f` (x2 `f` ... (xn `f` z)...)

Saying “from the left” or “from the right” is a description of what foldl and foldr calculate, with the parenthesis nesting to the left or to the right. At runtime of course we always have to start from the left (front) of the list.

We later learn other ways of thinking about left and right folds, that a left fold can be used like a classic loop where we go through the whole list eagerly, while a right fold can be used like a demand-driven iterator. For the left fold that means using a function that is strict in its first argument (like (+)) while for the right fold that means using a function that is not strict in its second argument (like (:)).

Indeed when looking at whether we want foldl or foldr in any particular case our choice is usually governed by whether we want “all at once” behaviour (foldl) or if we want incremental or short-cut behaviour (foldr).

Accumulating thunks

Again, as we are learning Haskell, we get told that foldl has this crazy behaviour

foldl (+) 0 (1:2:3:[])
          =  foldl (+) (0 + 1)             (2:3:[])
          =  foldl (+) ((0 + 1) + 2)       (3:[])
          =  foldl (+) (((0 + 1) + 2) + 3) []
          =            (((0 + 1) + 2) + 3)

when what we had in mind when we thought of an accumulating loop was

foldl' (+) 0 (1:2:3:[])
          =  foldl' (+) 1 (2:3:[])
          =  foldl' (+) 3 (3:[])
          =  foldl' (+) 6 []
          =             6

Of course that’s just what foldl' does, it evaluates the call to + before making the next recursive call.

When is foldl (rather than foldl') useful?

The short answer is “almost never”.

As beginners we often assume that foldl must still make sense in some cases (or why have both?) but it turns out that’s not really the case.

When the f argument of foldl is a strict function then delaying the evaluation does not gain us anything as it all has to be evaluated at the end anyway. The only time when delaying the evaluation could save us anything is when the f function is not strict in its first argument – in which case you either don’t care or probably should have been using foldr in the first place.

In fact even if our f function is non-strict in its first argument, we probably do not gain anything from delaying evaluation and it usually does no harm to evaluate earlier. Remember that we still have to traverse the whole list, we don’t get any short-cutting behaviour like with foldr.

We can, if we think about it, construct examples where foldl' would be too strict. We could define last and last' like this:

last  = foldl  (\_ y -> y) (error "empty list")

last' = foldl' (\_ y -> y) (error "empty list")

Now if we try

> last [1,undefined,3]
3
> last' [1,undefined,3]
*** Exception: Prelude.undefined

This is because our accumulator is always the latest element that we’ve seen but we don’t actually want to evaluate the elements (except the last one).

So it’s true that foldl' fails in this case, but it’s also a silly definition, the usual definition is a lot clearer

last [x]    = x
last (_:xs) = last xs
last []     = error "empty list"

That goes for pretty much all the other examples you might be able to think of where foldl would work but foldl' would not: the examples are either artificial or are clearer written in other ways.

People sometimes point out that sum is defined using foldl and not foldl' and claim that this is something to do with Haskell’s designers wanting to allow Num instances where (+) might be lazy. This is pretty much nonsense. If that were the case then sum would have been defined using foldr rather than foldl to properly take advantage of short-cutting behaviour. A much simpler explanation is that foldl' was not available in early versions of Haskell when sum was defined.

In nearly 15 years as a Haskell programmer I think I’ve specifically needed foldl rather than foldl' about three times. I say “about” because I can only actually remember one. That one case was in a slightly cunning bit of code for doing cache updates in a web server. It would almost certainly have been clearer as a local recursion but I was amused to find a real use case for foldl and couldn’t help myself from using it just for fun. Of course it needed a comment to say that I was using it on purpose rather than by mistake!

So why do we have foldl and foldl'?

If foldl is almost always a mistake (or merely benign) then why do we have it in the first place?

I don’t know for sure, but here’s my guess…

When Haskell 1.0 was published on this day 24 years ago there was no seq function at all, so there was no choice but to define foldl in the “classic” way.

Eventually, six years later after much discussion, we got the seq function in Haskell 1.3. Though actually in Haskell 1.3 seq was part of an Eval class, so you couldn’t use it just anywhere, such as in foldl. In Haskell 1.3 you would have had to define foldl' with the type:

foldl' :: Eval b => (b -> a -> b) -> b -> [a] -> b

Haskell 1.4 and Haskell 98 got rid of the Eval class constraint for seq but foldl was not changed. Hugs and GHC and other implementations added the non-standard foldl'.

I suspect that people then considered it a compatibility and inertia issue. It was easy enough to add a non-standard foldl' but you can’t so easily change the standard.

I suspect that if we had had seq from the beginning then we would have defined foldl using it.

Miranda, one of Haskell’s predecessor languages, already had seq 5 years before Haskell 1.0.

A strict foldl in Orwell!

Orwell is an interesting case. Orwell was another Haskell predecessor, very similar to Miranda and early Haskell. An informed source told me that Orwell had defined its foldl in the way that we now define foldl', ie with strict evaluation. Information on Orwell is a little hard to get ahold of online these days so I asked Philip Wadler. Phil very kindly fished out the manuals and looked up the definitions for me.

In the original version:

An Introduction to Orwell
(DRAFT)
Philip Wadler
1 April 1985

In the standard prelude:

lred f a []  =  a
lred f a (x:xs) = lred f (f a x) xs

But five years later, by the time Haskell 1.0 is being published…

An Introduction to Orwell 6.00
by Philip Wadler
revised by Quentin Miller
Copyright 1990 Oxford University Computing Lab

In the standard prelude:

foldl :: (a -> b -> a) -> a -> [b] -> a
foldl f a []  =  a
foldl f a (x:xs)  =  strict (foldl f) (f a x) xs

Note the use of strict. Presumably Orwell’s strict function was defined as (or equivalent to)

strict :: (a -> b) -> a -> b
strict f x = x `seq` f x

(These days in Haskell we call this function ($!).)

So my source was right, Orwell did change foldl to be the strict version!

I contend that this was and is the right decision, and that it was just a consequence of the late arrival of seq in Haskell and inertia and fears about backwards compatibility that have kept us from fixing foldl.

Just do it!

It’d help all of us who are sometimes tempted to use foldl because we can’t be bothered to import Data.List. It’d help confused beginners. It’d save teachers from the embarrassment of having to first explain foldl and then why you should never use it.

Orwell fixed this mistake at least 24 years ago, probably before Haskell 1.0 was released. Just because it’s an old mistake doesn’t mean we shouldn’t fix it now!

A postscript: which foldl'?

I hate to complicate a simple story but I should admit that there are two plausible definitions of foldl' and I’ve never seen any serious discussion of why we use one rather than the other (I suspect it’s another historical accident).

So the version above is the “standard” version, perhaps more clearly written using bang patterns as

foldl' :: (b -> a -> b) -> b -> [a] -> b
foldl' f a []     = a
foldl' f a (x:xs) = let !a' = f a x
                     in foldl' f a' xs

But an equally plausible version is

foldl'' :: (b -> a -> b) -> b -> [a] -> b
foldl'' f !a []     = a
foldl'' f !a (x:xs) = foldl'' f (f a x) xs

The difference is this: in the first version we evaluate the new accumulating parameter before making the recursive call, while in the second version we ensure that the accumulating parameter is always evaluated (to WHNF) on entry into each call.

These two ways of forcing the evaluation have almost the same effect. It takes a moment or two to find a case where it makes a difference, but here’s one:

foldl'  (\_ y -> y) undefined [1] = 1
foldl'' (\_ y -> y) undefined [1] = undefined

The standard foldl' ensures that all the new accumulating parameter values are evaluated, but still allows the initial value to be unevaluated. The alternative version simply says that the accumulating parameter is always evaluated.

The second version is attractive from a code generation point of view. One of the clever things that GHC can do with foldl' (and strict function arguments generally) is to unbox the values, so for example an Int can be unboxed to a Int# and passed in registers rather than on the heap. With the standard foldl' it needs a special case for the first iteration of the loop where the initial value of the accumulating parameter which might not be evaluated yet. With foldl'', that’s not a problem, we can unbox things right from the start. In practice, GHC can often see that the initial value is evaluated anyway, but not always.

(Don Stewart and I noticed this a few years back when we were working on stream fusion for lists. We had defined foldl' on streams in a way that corresponded to the second form above and then got a test failure when doing strictness testing comparing against the standard list functions.)

So if we’re going to fix foldl to be the strict version, then perhaps it should be the fully strict version, not just the “strict after the first iteration” version.

by duncan at April 01, 2014 10:38 AM

Edward Z. Yang

Calculating Shanten in Mahjong

Move aside, poker! While the probabilities of various poker hands are well understood and tabulated, the Chinese game of chance Mahjong [1] enjoys a far more intricate structure of expected values and probabilities. [2] This is largely due in part to the much larger variety of tiles available (136 tiles, as opposed to the standard playing card deck size of 52), as well as the turn-by-turn game play, which means there is quite a lot of strategy involved with what is ostensibly a game of chance. In fact, the subject is so intricate, I’ve decided to write my PhD thesis on it. This blog post is a condensed version of one chapter of my thesis, considering the calculation of shanten, which we will define below. I’ll be using Japanese terms, since my favorite variant of mahjong is Riichi Mahjong; you can consult the Wikipedia article on the subject if you need to translate.

Calculating Shanten

The basic gameplay of Mahjong involves drawing a tile into a hand of thirteen tiles, and then discarding another tile. The goal is to form a hand of fourteen tiles (that is, after drawing, but before discarding a tile) which is a winning configuration. There are a number of different winning configurations, but most winning configurations share a similar pattern: the fourteen tiles must be grouped into four triples and a single pair. Triples are either three of the same tile, or three tiles in a sequence (there are three “suits” which can be used to form sequences); the pair is two of the same tiles. Here is an example:

http://upload.wikimedia.org/wikipedia/commons/1/1c/MJw1.png
http://upload.wikimedia.org/wikipedia/commons/c/c3/MJw2.png
http://upload.wikimedia.org/wikipedia/commons/9/9e/MJw3.png
http://upload.wikimedia.org/wikipedia/commons/e/ed/MJw5.png
http://upload.wikimedia.org/wikipedia/commons/e/ed/MJw5.png
http://upload.wikimedia.org/wikipedia/commons/6/61/MJt2.png
http://upload.wikimedia.org/wikipedia/commons/a/a4/MJt3.png
http://upload.wikimedia.org/wikipedia/commons/7/72/MJt4.png
http://upload.wikimedia.org/wikipedia/commons/6/6f/MJt7.png
http://upload.wikimedia.org/wikipedia/commons/1/18/MJt8.png
http://upload.wikimedia.org/wikipedia/commons/a/a1/MJt9.png
http://upload.wikimedia.org/wikipedia/commons/d/df/MJs4.png
http://upload.wikimedia.org/wikipedia/commons/b/bf/MJs5.png
http://upload.wikimedia.org/wikipedia/commons/c/cb/MJs6.png

Represented numerically, this hand consists of the triples and pairs 123 55 234 789 456.

One interesting quantity that is useful to calculate given a mahjong hand is the shanten number, that is, the number of tiles away from winning you are. This can be used to give you the most crude heuristic of how to play: discard tiles that get you closer to tenpai. The most widely known shanten calculator is this one on Tenhou’s website [3]; unfortunately, the source code for this calculator is not available. There is another StackOverflow question on the subject, but the “best” answer offers only a heuristic approach with no proof of correctness! Can we do better?

Naïvely, the shanten number is a breadth first search on the permutations of a hand. When a winning hand is found, the algorithm terminates and indicates the depth the search had gotten to. Such an algorithm is obviously correct; unfortunately, with 136 tiles, one would have to traverse ((136-13)\times 14)^n hands (choices of new tiles times choices of discard) while searching for a winning hand that is n-shanten away. If you are four tiles away, you will have to traverse over six trillion hands. We can reduce this number by avoiding redundant work if we memoize the shanten associated with hands: however, the total number of possible hands is roughly 136 \choose 13, or 59 bits. Though we can fit (via a combinatorial number system) a hand into a 64-bit integer, the resulting table is still far too large to hope to fit in memory.

The trick is to observe that shanten calculation for each of the suits is symmetric; thus, we can dynamic program over a much smaller space of the tiles 1 through 9 for some generic suit, and then reuse these results when assembling the final calculation. 9 \times 4 \choose 13 is still rather large, so we can take advantage of the fact that because there are four copies of each tile, an equivalent representation is a 9-vector of the numbers zero to four, with the constraint that the sum of these numbers is 13. Even without the constraint, the count 5^9 is only two million, which is quite tractable. At a byte per entry, that’s 2MB of memory; less than your browser is using to view this webpage. (In fact, we want the constraint to actually be that the sum is less than or equal to 13, since not all hands are single-suited, so the number of tiles in a hand is less.

The breadth-first search for solving a single suit proceeds as follows:

  1. Initialize a table A indexed by tile configuration (a 9-vector of 0..4).
  2. Initialize a todo queue Q of tile configurations.
  3. Initialize all winning configurations in table A with shanten zero (this can be done by enumeration), recording these configurations in Q.
  4. While the todo queue Q is not empty, pop the front element, mark the shanten of all adjacent uninitialized nodes as one greater than that node, and push those nodes onto the todo queue.

With this information in hand, we can assemble the overall shanten of a hand. It suffices to try every distribution of triples and the pairs over the four types of tiles (also including null tiles), consulting the shanten of the requested shape, and return the minimum of all these configurations. There are 4 \times {4 + 4 - 1 \choose 4} (by stars and bars) combinations, for a total of 140 configurations. Computing the shanten of each configuration is a constant time operation into the lookup table generated by the per-suit calculation. A true shanten calculator must also accomodate the rare other hands which do not follow this configuration, but these winning configurations are usually highly constrained, and quite easily to (separately) compute the shanten of.

With a shanten calculator, there are a number of other quantities which can be calculated. Uke-ire refers to the number of possible draws which can reduce the shanten of your hand: one strives for high uke-ire because it means that probability that you will draw a tile which moves your hand closer to winning. Given a hand, it's very easy to calculate its uke-ire: just look at all adjacent hands and count the number of hands which have lower shanten.

Further extensions

Suppose that you are trying to design an AI which can play Mahjong. Would the above shanten calculator provide a good evaluation metric for your hand? Not really: it has a major drawback, in that it does not consider the fact that some tiles are simply unavailable (they were discarded). For example, if all four “nine stick” tiles are visible on the table, then no hand configuration containing a nine stick is actually reachable. Adjusting for this situation is actually quite difficult, for two reasons: first, we can no longer precompute a shanten table, since we need to adjust at runtime what the reachability metric is; second, the various suits are no longer symmetric, so we have to do three times as much work. (We can avoid an exponential blowup, however, since there is no inter-suit interaction.)

Another downside of the shanten and uke-ire metrics is that they are not direct measures of “tile efficiency”: that is, they do not directly dictate a strategy for discards which minimizes the expected time before you get a winning hand. Consider, for example, a situation where you have the tiles 233, and only need to make another triple in order to win. You have two possible discards: you can discard a 2 or a 3. In both cases, your shanten is zero, but discarding a 2, you can only win by drawing a 3, whereas discarding a 3, you can win by drawing a 1 or a 4. Maximizing efficiency requires considering the lifetime ure-kire of your hands.

Even then, perfect tile efficiency is not enough to see victory: every winning hand is associated with a point-score, and so in many cases it may make sense to go for a lower-probability hand that has higher expected value. Our decomposition method completely falls apart here, as while the space of winning configurations can be partitioned, scoring has nonlocal effects, so the entire hand has to be considered as a whole. In such cases, one might try for a Monte Carlo approach, since the probability space is too difficult to directly characterize. However, in the Japanese Mahjong scoring system, there is yet another difficulty with this approach: the scoring system is exponential. Thus, we are in a situation where the majority of samples will be low scoring, but an exponentially few number of samples have exponential payoff. In such cases, it’s difficult to say if random sampling will actually give a good result, since it is likely to miscalculate the payoff, unless exponentially many samples are taken. (On the other hand, because these hands are so rare, an AI might do considerably well simply ignoring them.)

To summarize, Mahjong is a fascinating game, whose large state space makes it difficult to accurately characterize the probabilities involved. In my thesis, I attempt to tackle some of these questions; please check it out if you are interested in more.


[1] No, I am not talking about the travesty that is mahjong solitaire.

[2] To be clear, I am not saying that poker strategy is simple—betting strategy is probably one of the most interesting parts of the game—I am simply saying that the basic game is rather simple, from a probability perspective.

[3] Tenhou is a popular Japanese online mahjong client. The input format for the Tenhou calculator is 123m123p123s123z, where numbers before m indicate man tiles, p pin tiles, s sou tiles, and z honors (in order, they are: east, south, west, north, white, green, red). Each entry indicates which tile you can discard to move closer to tenpai; the next list is of ure-kire (and the number of tiles which move the hand further).

by Edward Z. Yang at April 01, 2014 08:20 AM

Well-Typed.Com

Fun and Profit with Strongly-Typed Data Schemas

Over the past few months, Duncan and I have been working with Chris Dornan and Alfredo Di Napoli on api-tools, a DSL for specifying data schemas for REST-like APIs in Haskell. If you’re interested in the real-world use of Haskell, static types and DSLs, why not come along to hear Chris talk about it?

Wednesday 9th April, 6:30pm, London

Find out more and register for free over at Skills Matter:

Typical business apps store structured data, transform it and send it hither and thither. They are typically made of multiple components that have to agree on the schema of the data they exchange. So a large part of what it means to be “flexible” is for it to be easy to modify and extend the schema of the data that the system handles.

Strong typing can help with this, ensuring that the code that accesses the data is consistent with the schema. One idea that has been tried with databases is to generate Haskell types from the database schema, or generate the database schema from the Haskell types, or both from some standalone schema.

In this talk we will describe how we have applied this same idea but for REST APIs. We have a DSL that we use to specify the schema of the data available via a REST API. With a machine description of the schema we can generate the code and documentation for many parts of the system, and of course those parts are then easily adaptable when the schema changes. We generate Haskell types, JSON conversion functions, REST API documentation, disk persistence and more. We also have a story for managing schema changes and migrating data. This system is in use at Iris Connect and is now available on Hackage.

This talk will also discuss further applications we have found for these tools and some of the experience from the development project at Iris Connect, including problems we have had to solve building the Haskell components of the system and the wider challenge of integrating it into the overall development project.

And if that isn’t enough for you, Well-Typed’s Haskell courses are back at the end of April, with an all-new course on advanced features of the Haskell type system. Stay tuned for more events coming soon…

by adam at April 01, 2014 07:24 AM

Christopher Done

Haskell structured diffs

Project-request: someone please implement a program that will diff Haskell in a cleverer way than lines.

In an effort to reign in my incessant work on Haskell tooling1, I’m outlining a tool that I’d personally like and welcome people to implement it. Otherwise it serves as a motivating problem description for the next time I come around to it myself with free time.

Before anyone emails me saying “lines/words are simple, other things are hard, that’s why it’s not been done yet. People undervalue the simple solution …” with a long lecture, spare me!

The concrete diff

The concrete diff is the line-based, and sometimes character-based, diff that we all know and love. There’s no reason to throw this away. You will need to keep this as an optional backend for when you are unable to parse a Haskell file.

Pros: simple to implement. You produce the necessary lines to delete and insert to create the change from A to B.

Cons: doesn’t know about syntactic redundancy where some changes don’t mean anything, and where the actual important change occurs. For example:

main = do putStrLn "Write your name!"
          name <- getLine
          print name

Now you change this to:

main = do args <- getArgs
          case args of
            [] -> do putStrLn "Write your name!"
                     name <- getLine
                     print name
            _ -> runWithArgs args

The diff will look like this:

@@ -5,3 +5,6 @@ module Main where
-main = do putStrLn "Write your name!"
-          name <- getLine
-          print name
+main = do args <- getArgs
+          case args of
+            [] -> do putStrLn "Write your name!"
+                     name <- getLine
+                     print name
+            _ -> runWithArgs args

But it’s clear to observe that this is not the change we made in spirit, it’s just one line-based way to achieve it. In actual fact, our do putStrLn … was moved into a case, un-changed. At this size, it’s not a big deal. When the code is more interesting, it’s important to know what was really changed, and what remains the same.

The abstract syntax diff

Enter the syntactic diff. We show the difference between two syntactic trees. How this is to be achieved in a readable way is the rub, but here are some ideas.

Take our example above, one approach can be to label nodes.

Before:

¹{main = ²{do putStrLn "Write your name!"
              name <- getLine
              print name}}

After:

¹{main = do args <- getArgs
            case args of
              [] -> ²{do putStrLn "Write your name!"
                         name <- getLine
                         print name}
              _ -> runWithArgs args}

Now, at least at a superficial glance, you don’t even need this explained to you. You can see exactly what has happened: The code before has changed to the code after, but we can see that node2 has just moved to inside the case.

Where the trickiness arises is taking this to its logical conclusion and applying it generally. What’s displayed if you also change the string in the putStrLn? Good question. Here’s an idea:

¹{main = ²{do putStrLn -{"Write your name!"}
              name <- getLine
              print name}}

Because the node "Write your name" has now been lost, we don’t need to reference it any longer. So one way to show that it has been removed could be to put -{…}. And then to show what replaced it, put in +{…}, a la classic diffs:

¹{main = +{do args <- getArgs
              case args of
                [] -> ²{do putStrLn +{"Write your name!"}
                           name <- getLine
                           print name}
                _ -> runWithArgs args}}

In reality this rule would insert more -{…} and +{…} than I’ve written here, but I’m writing these examples manually so take them with a grain of salt. Let’s take it further and say that the string has actually been moved. Then we should indeed give it a number to reference it later:

Before:

¹{main = ²{do putStrLn ³{"Write your name!"}
              name <- getLine
              print name}}

After:

¹{main = +{do args <- getArgs
              case args of
                [] -> ²{do putStrLn +{greeting}
                           name <- getLine
                           print name}
                _ -> runWithArgs args}
    +{where greeting = ³{"Write your name!"}}}

Again, I don’t think anybody is going to find this confusing. The node3 has moved into a where clause, which has been named greeting and referenced in place of its original place.

Am I making obvious sense, here? It’s not a particularly novel display, it states what happened syntactically, precisely. With a UI, you could expand/collapse nodes in a nested fashion or “explode” all the pieces into a flat list of numbered or +’d or -’d nodes, or just narrow down to one specific interesting expression, like

²{do putStrLn +{greeting}
     name <- getLine
     print name}

If you’re sufficiently nerd-sniped to find this interesting and do-able, then I invite you to go ahead and give it a go. I’d love to see a prototype. I don’t plan on implementing this in the near or distant future, so we won’t be toe stepping.

The reduced semantic diff

If you’re still reading by this point, let me try to entice you with ambitious ideas. Take the above approach, everything we just laid out, but let’s put an additional step in there: instead of diffing Haskell’s abstract syntax tree, diff the Core.

If you compile the below with GHC,

main = case Just () of
         Just () -> print "Hey!"

The external core is:

%module main:Main
  main:main5 :: (ZMZN Char) = unpackCStringzh ("Hey!"::Addrzh);
  main:main4 :: (ZMZN Char) = ZC @ Char base:zdfShowChar1 (ZMZN @ Char);
  main:main3 :: (ZMZN Char) = base:showLitString main:main5 main:main4;
  main:main2 :: (ZMZN Char) = ZC @ Char base:zdfShowChar1 main:main3;
  main:main1 :: (Statezh RealWorld) -> (Z2H ((Statezh RealWorld)) Z0T) =
    \ (etaB1::(Statezh RealWorld)) ->
      base:hPutStr2 base:stdout main:main2 True etaB1;
  main:main :: (IO Z0T) = %cast (main:main1) (%sym ((NTCoZCIO Z0T)));
  main:main6 :: (Statezh RealWorld) -> (Z2H ((Statezh RealWorld)) Z0T) =
    \ (etaXb::(Statezh RealWorld)) ->
      base:runMainIO1 @ Z0T (%cast (main:main1) (%sym ((NTCoZCIO Z0T))))
                       etaXb;
  main:ZCmain :: (IO Z0T) = %cast (main:main6) (%sym ((NTCoZCIO Z0T)));

You can see that the pointless case has been removed. This is the bread and butter of Core simplification. But if I remove the case myself, the Core is exactly the same. This is redundant semantic content, which is why GHC removed it.

If someone made a change like this in a real codebase which removed some redundant semantic content, not just syntactical redundancy, your diff could show it like that. In other words, nothing important semantically actually happened here.

In fact, if I refactored a bunch of code, re-organized a bit, does my next colleague really want to read through all the syntax tree just to see the crux of what changed? Sometimes, but not always. Sometimes, they just want to see the precise thing that will change at runtime.

It might actually be insane, with big blow ups in code difference for minute high-level changes, or it might be great for teams caring about performance. Difficult to know until you try it. You can also do a source-mapping back to the original Haskell source, for a more interesting display.

If you want to implement this, I would love to see any results.

The typed diff

Okay, you’re still reading so you’re pretty easily nerd sniped. Let’s continue with the ideas. Another type of difference between two sources is the types of expressions in there. Consider:

main = let x = [1,2,3]
       in print (x <> x)

Now you change the code to:

main = let x = myFancyMonoid
       in print (x <> x)

Our structural diff laid out earlier will show this:

main = let x = -{[1,2,3]}
       in print (x <> x)

After:

main = let x = +{myFancyMonoid}
       in print (x <> x)

But actually, more things have changed here. As a result of the different monoid instance, the print (x <> x) will do something different. Maybe it’s a * rather than +, maybe it’s a number, whatever. Maybe that expression is used in a more interesting way than merely printing it. What’s the real diff?

main = let {x::[Integer]} = -{[1,2,3]}
       in print {{(x <> x)}::[Integer]}

After:

main = let {x::MyFancyMonoid} = +{myFancyMonoid}
       in print {(x <> x)}::MyFancyMonoid}

Or something like that. I’m being hand-wavey in the display, here. The real difference is that we’ve changed the type of x. It’s an important change, which has semantic meaning. My ideas are more vague here. I haven’t thought through many scenarios of how to represent this. But one thing is clear: a diff of types can actually be useful and interesting.

The editing diff

The diffs above are all examples of “cold” diffs. Calculating the difference between two files as-is. If you’re in a structured editor like Lamdu, then you don’t have to do cold diffs and figure out and guess at what happened. You know exactly what happened. This node was raised here, this variable was renamed there, etc. But if you want to work on that, you pretty much have to work on Lamdu.

Summary

In summary I’ve intentionally listed increasingly more wacky diff ideas, from the familiar to the fairly novel. My general approach to tooling is progressive: start with the simplest working implementation then step up. Structured-haskell-mode is an example of this. It’s no Lamdu, and it’s no vanilla text-based mode. It’s a stepping stone inbetween. The impedance to try SHM is lower.

In the same way, maybe we can start with the abstract syntax diff, let people become acclimatized to it, let it stabilize, get it integrated into things like Git, and then work our way up from there.

If nobody bothers trying out these ideas, I’ll probably end up doing them myself eventually, but I thought I’d put the suggestion out there first.


  1. In favour of writing programs that concern themselves with things other than Haskell for once!

April 01, 2014 12:00 AM

March 31, 2014

Functional Jobs

Infrastructure Engineer at Zeta Project Germany GmbH (Full-time)

About us

We are an early-stage product company. At Zeta everyone has a voice and we expect you to change our world. If you like coming up with creative solutions then this is definitely the place for you.

We’ve built a team of passionately experienced, curious engineers and designers who are dedicated, hardworking and wholeheartedly embrace technical and design challenges of all types.

We are located in Berlin Mitte and are looking for talented individuals who share our passion.

Job Description

This position is for someone to work in the Infrastructure team, to contribute to the evolution of our software platform.

The team is responsible for developing and supporting the company's core systems and infrastructure, which is predominately written in Haskell.

You will work directly with other teams within the company, including client, backend, and QA engineers, ensuring that the infrastructure runs correctly, reliably and at scale.

The ability to educate others on the best use of the software resources the team provides is crucial, as is the ability to use feedback to continually improve our tooling and processes.

Initially we would like you to own our Jenkins setup and drive best practices for the automation of various teams' Continuous Integration and Deployment workflows, with future work planned to improve problem areas such as development workflow, monitoring, alerting, provisioning, and performance.

Skills & Requirements

  • Practical experience using various Amazon Web Services
  • Comfortable with admin tasks, like partitioning and formatting filesystems, Debian packaging, debugging live processes, and general Linux best practices
  • Working knowledge of Cassandra, Redis, or similar databases
  • Proactive with the ability to own and oversee projects
  • Extensive familiarity with Jenkins and Continuous Integration, for technologies such as Objective-C, C++ and Haskell
  • Good programmer with experience programming Ruby, Python, Erlang or other languages
  • Experience with various Configuration Management technologies like Ansible, Chef, Puppet, etc.
  • Interested in learning Haskell
  • Experience working as a Linux System Administrator

Get information on how to apply for this position.

March 31, 2014 06:14 PM

Alejandro Cabrera

Migrating to Static, Self-Hosted Blog

Hello, readers.

Soon, I'll be moving away from the Blogger platform to a statically hosted blog. I've been playing with hakyll for some time now, and am pleased with it's support for syntax highlighting, markup formats, RSS/Atom generation, and packaged Warp.

It'll be great to develop my blog with emacs, leveraging git for backup and rsync for deployment.

Expect:

* Less latency
* A home-grown design
* More code samples, with pretty highlighting

With possibly:

* More frequent posts

Stay tuned. I'll be releasing the new blog by the end of the week!

by Alejandro Cabrera (noreply@blogger.com) at March 31, 2014 03:50 PM

Daniil Frumin

ANN: hastache 0.6

Announcing: hastache 0.6

Hastache is a Haskell implementation of the mustache templating system.

Quick start

cabal update
cabal install hastache

A simple example:

import Text.Hastache 
import Text.Hastache.Context 
import qualified Data.Text.Lazy.IO as TL

main = hastacheStr defaultConfig (encodeStr template) (mkStrContext context)
    >>= TL.putStrLn

template = "Hello, {{name}}!\n\nYou have {{unread}} unread messages." 

context "name" = MuVariable "Haskell"
context "unread" = MuVariable (100 :: Int)

Read Mustache documentation for template syntax; consult README for more details.

Whats’s new in 0.6?

The interface of the library has been switched from ByteString to (lazy) Text. That means, for example, that the type of hastcheStr function is now the following:

hastacheStr
:: MonadIO m => MuConfig m	-> Text	-> MuContext m -> m Text

The generic context generation (mkGenericContext) now supports functions of the types Text -> Text and Text -> m Text, as well as types with multiple constructors. That is, given a Haskell datastructure

data A = A { str :: String }
       | B { num :: Int }

it is possible to write a template like this:

{{#A}}
A : {{str}}
{{/A}}
{{#B}}
B : {{num}}
{{/B}}

Please take a look at the multiple constructors example if you are interested.

Additionally, a couple of new MuVar instances has been added.

Acknowledgments

Special thanks to Mark Lentczner, Alexander Voikov and others who reported issues, tested the library and provided feedback.


Tagged: haskell, hastache

by Dan at March 31, 2014 07:49 AM

March 30, 2014

Christopher Done

One step forward, two steps back

The issue is that programming languages don’t go forward, they move sideways or diagonally, or sometimes backwards.

A new car comes out, and it has some cool feature: Hey, it has road surface detection and changes the steering accordingly! But it should also come with all the old stuff that you come to expect. Comfy seats, seatbelts, airconditioner, heated windows, wipers, proximity detection, power steering, cruise control, etc.

With new programming languages, what you tend to get is a chassis, engine and steering wheel, and the road surface detection.

Here is a list of cool ideas that have been discovered and implemented in programming languages, but which do not in their whole make up any existing language:

The moral is, if you’re inventing a new general purpose programming language and you have some clue that it’s going to be adopted, I implore you to thoroughly explore all of the features above within the existing languages that do them well, and think about adding it to your language.

As a follow-up post I might make a matrix of the top, say, 30, general purpose programming languages and all the features that they tick off.


  1. Also known as quotation, quasiquotes, macros, templating, mixins, etc.

  2. So, numbers, strings, vectors, patterns, etc.

  3. Preferablly static. Also known as implicit parameters, contexts, etc.

  4. Also known as “hot-swapping”, “live update”, “plugins”, etc.

March 30, 2014 12:00 AM

March 29, 2014

ERDI Gergo

My first computer

tl;dr: I've built a computer on a Xilinx FPGA using Kansas Lava, based on a virtual machine design from the mid seventies.

I would be lying if I said I always wanted to build my own computer. For a very long time, hardware didn't tickle my curiosity at all; and even today, I prefer staying away from dangling wires and soldering irons. I like my computing platforms to Just Work™, and hardware problems are just a huge hassle. But then in 2010 some coworkers of mine started getting into electronics and I learned from them just enough to start hooking some ICs up on a breadbord, and it seemed like a very fun diversion from all the high-level, abstract, softwarey stuff. In some sense, it filled the same void in me that assembly programing would probably have had. But even back in the days on the Commodore 64, I was looking for more BASIC extensions instead of going downwards to assembly / machine code.

One thing led to another and I was creating a Brainfuck CPU on a Papilio FPGA. It was a very satisfying experience, plus, I got to learn about a completely new world (that of digital hardware design and hardware description languages). So ever since then, I had a voice in the back of my head saying I should go all in, and implement a proper full computer, with I/O and all. But jumping straight into replicating a real computer seemed like trying to run before I can walk.

I can't remember now how I bumped into the CHIP-8 platform, but when I read about it in detail, I realized this is something I, a self-taught HDL guy, could realistically implement. To recap, it's a virtual machine spec from the mid seventies indended to be implemented on home computers of the time. It is super low performance, meaning I could implement everything the most naïve way possible and still get something workable out of it: graphics is 64×32 black & white, RAM is short of 4KBytes, CPU speed is not really part of the spec, and the only timers provided run at 60 Hz.

I can do this!

Setting the scene

The FPGA board that I targeted is a Papilio One 500K, which has 500K... somethings called "system gates", and about 40KByte of SRAM. The Arcade MegaWing provides a D-sub 15 VGA connector and two PS/2 ports, among some other stuff that I am not using for this project.

The home computer targeted by the original CHIP-8 spec only had a four-by-four keypad, so I am not 100% happy about using a PS/2 keyboard for the input. Maybe a later version could use a bespoke keypad connected to the unused pins of the LogicStart MegaWing. However, by using a PS/2 keyboard, I could not worry about the hardware side of things and just implement a decoder for the PS/2 protocol.

In terms of software, there are several games available for the CHIP-8. Maybe even donzens! But just for fun, after finishing the machine itself, I ended up writing my own game as well: a crappy port of the currently trending 2048 game.

Understanding the CHIP-8

I would say the CPU itself is a RISC architecture, in that it has 16 registers and not a whole lot of operations. But I'm not sure it would be called RISC back in its day.

The aforementioned 64×32 one-bit graphics is manipulated via a sprite interface: there's a special CPU opcode for xor-blitting an 8-pixel-wide, variable-height sprite onto any part of the screen. I went with the obvious way of implementing it as a 2048-bit frame buffer.

To validate my understanding, I first created a software emulator in Haskell. That unearthed a ton of edge cases and situations where I was not reading the spec with enough attention. Once I had the emulator working well enough to play the sample games I found, I enumerated the modules I'll need to implement.

The emulator running a Tetris game

A TODO list

Since I've already done the Brainfuck CPU, I didn't foresee too much difficulty in implementing the CPU proper (oh boy was I wrong). However, the world of peripherals was a new one to me.

I've never looked into VGA signal timings in detail before, and for some reason I just assumed that it's going to be just as complicated as PAL, about which I knew just enough to know that you have to generate all kinds of elaborate sync patterns. So actually reading the VGA spec was a relief, and I quickly came up with a scheme where my CHIP-8 computer would be running at 50 MHz, so that it can easily implement the 25 MHz pixel clock needed for 640×480@60 Hz. I went with this mode because it has several advantages:

  • It refreshes the screen at 60 Hz, so I can synchronize the CHIP-8 timers to it
  • It has square pixel aspect ratio
  • It is a "default" mode, so I can be confident that if I generate the correct signal, modern LCDs will be able to display it. In fact, for testing, I later got a 7" CCTV LCD with a VGA input on the cheap, so that I didn't have to make room for the 21" TFT I had lying around. The CCTV monitor supports three VGA video modes, and 640×480@60 Hz is one of them.
  • By dividing both the horizontal and vertical resolution by 8, I can get 80×60 which doesn't leave too much unused border around 64×32. I could use all horizontal screen real estate if I divided by 10 instead, to get 64×48, but I have no idea how I could divide my horizontal/vertical position counter by 10; whereas dividing by 8 is a simple matter of shifting to the right by 3 bits.

The Papilio One itself has a clock of 32 MHz, and I first hoped that 32 Mhz and 25 Mhz is close enough that I can just generate a signal using the 32 MHz as the pixel clock. Turns out that's not quite how signal timings work.

Luckily, I've found out that the Papilio One also has something called a DCM which can divide and multiply the clock signal. I was able to go to 25 or 50 MHz easily with it. It's a bit of a black box component: I had to run a wizard in the Xilinx IDE to get a small binary file describing my DCM parameters, and then I integrated it into my build process by running another Xilinx program on it which spits out some magic bitstream.

The PS/2 protocol is a simple serial protocol with a slow (~10 KHz) clock and one parity bit per 8 data bits. Decoding it into a stream of bytes was a straightforward thing once I hunted down an actual PS/2 keyboard, since it turned out those USB to PS/2 dongles don't really work with regular USB keyboards; rather, the keyboard have to be made ready to "speak" PS/2 over the dongle. So I ended up getting a proper PS/2 keyboard from Mustafa Centre (thanks to Viki's sister Dori for picking it up for me); god only knows why they still had one on stock.

Tooling

The Xilinx tool suite seems to be aimed at being used from the IDE. This has several drawbacks: version controlling is complicated because generated artifacts litter the source tree; the editor itself in the IDE is of course crap compared to Emacs; and most importantly, I didn't intend to manually write either VHDL or Verilog. As I've mentioned before in my post about the Brainfuck CPU, I've found both of these hardware description languages to be lacking in the same way mainstream imperative programming languages are: the tools you, the developer, is given to introduce your own abstractions are very poor. So I planned to use Kansas Lava, an HDL embedded in Haskell.

Now, the first thing to note about Kansas Lava is, as nice at is, the software package itself is bit-rotten. The latest version available on Hackage cannot even be compiled with GHC 7.6. While the change to fix that is trivial, after that, you can easily bump into bugs in the Lava compiler itself. But more on that later. Later versions available from GitHub are not even self-consisent between the various dependencies. I've put my patched-up version of Kansas Lava and some of the packages it dependens on on GitHub and I'm trying to get Andy Gill to allow me to upload the bundle as a 0.2.4.1 update to Hackage. Don says I should maybe say fuck it and create my own Singapore Lava fork, just to stay with the Lava tradition of a twisty little maze of incompatible forks, all alike.

However, when it all works, it is amazing. I was able to extend the library of Papilio-specific Lava code that I created for the Brainfuck project with reusable modules for the VGA signal generator and the PS/2 decoder in such a way that they should be very easy to be re-used for any other future projects. And it's all type-safe, so for example, the Papilio Arcade MegaWing VGA port is statically enforced to be 4+4+4 bits whereas the LogicStart MegaWing is 3+3+2.

But there was one bug that left a bitter aftertaste in my mouth. Once I had both the CPU and the VGA parts working and I started running some test programs on the thing, I noticed that the framebuffer blitting is exhibiting "or" behaviour instead of "xor". Running the same code in the Lava simulator and dumping the contents of the framebuffer, it showed the correct, "xor" behaviour. After weeks of frustration and a complete rework of the communication system between the CPU, the VGA signal generator and the frame buffer to add memory read bussing, I finally took a look at the Lava compiler's source code to find that it simulates the xor2 primitive as xor but compiles it to or. How do you not notice this bug? Has the Kansas Lava suite been used by anyone for everything at all in the history of ever???

End result

The final result, after much trial and error, looking at the Lava simulator's output, and pouring over the code, is available here. I'm reasonably happy about the code base, except for the implementation of the CPU itself, which feels a bit spaghetti to me. Especially around the parts where it's waiting for results to come back from main RAM or the video framebuffer.

Below are some photographs and videos of it running two games: the memory puzzle game Hidden and the 2048 clone mentioned above. Unfortunately, the Tetris game from the screenshot above seems to have a bug in its input handling, in that it samples the keyboard at an unthrottled frequency; thus, even the shortest keypress sends the falling piece to the edge of the wall. I'll eventually need to disassemble the game and fix this.

The VGA signal generator is not as neat as it could be, because it doesn't do pre-fetching. This means by the time the current CHIP-8 pixel's value is read from the framebuffer, it is already too late, the first of the 8 real pixels for the given virtual CHIP-8 pixel should already have been drawn. This results in some ghosting. But since I know what's causing it, I will hopefully be able to fix this in some next version. But right now I'm too happy to have the whole thing just working.

The setup

Detail of the FPGA board itself

<object height="350" width="425"><param name="movie" value="http://www.youtube.com/v/SbdEzmiFJhk"/><param name="wmode" value="transparent"/><embed height="350" src="http://www.youtube.com/v/SbdEzmiFJhk" type="application/x-shockwave-flash" width="425" wmode="transparent"></embed></object>

<object height="350" width="425"><param name="movie" value="http://www.youtube.com/v/cOle-tX91b8"/><param name="wmode" value="transparent"/><embed height="350" src="http://www.youtube.com/v/cOle-tX91b8" type="application/x-shockwave-flash" width="425" wmode="transparent"></embed></object>

March 29, 2014 01:06 PM

Vincent Hanquez

Listing licenses with cabal-db

Following discussions with fellow haskellers, regarding the need to be careful with adding packages that could depends on GPL or proprietary licenses, it turns out it’s not easy to get your dependencies’s licenses listed.

It would be convenient to be able to ask the hackage database those things, and this is exactly what cabal-db usually works with.

cabal-db, for the ones that missed the previous annoucement, is a simple program that using the index.tar.gz file, is able to recursively display or search into packages and dependencies.

The license subcommand is mainly to get a summary of the licenses of a packages and its dependencies, but it can also display the tree of licenses. Once a package has been listed, it would not appears again in the tree even if another package depend on it.

A simple example is better than many words:

$ cabal-db license -s -t BNFC
BNFC: GPL
  process: BSD3
    unix: BSD3
      time: BSD3
        old-locale: BSD3
          base: BSD3
        deepseq: BSD3
          array: BSD3
      bytestring: BSD3
    filepath: BSD3
    directory: BSD3
  pretty: BSD3
  mtl: BSD3
    transformers: BSD3
  containers: BSD3
== license summary ==
BSD3: 14
GPL: 1

cabal-db is only using the license listed in the license field in cabal files, so if the field is incorrectly set, cabal-db would have no idea.

March 29, 2014 12:00 AM

March 28, 2014

Bryan O'Sullivan

Where credit belongs for Hack

It’s been exciting to see the positive reception in the tech press and the open source community to Hack, the new programming language that my team at Facebook released last week.

At the same time, one part of that coverage has made me wince: a couple of articles give the impression that I created Hack, but I didn’t. What I do is manage the Hack team, and since I’m incredibly proud of the work they do, I’d like to shine a little light on the people who really deserve the credit.

A few years ago, Julien Verlaguet and Alok Menghrajani had a notion that Facebook could improve some aspects of how it ships software. They developed a proof-of-concept of a system that could detect certain kinds of logic error in a program, and showed it to a few colleagues. Buoyed by the positive feedback they received, they decided to invest more effort in the project, which would in due course become Hack.

This process underscores one of the aspects I find most appealing about engineering at Facebook: engineers are welcome to try out bold ideas, and to push on with them if they look like they’ve got a good prospect of paying off.

The work that Julien and Alok did to get the project off the ground was far from simple. They had to design not just a type system, but one where the type checking algorithms could scale to run efficiently and incrementally across hundreds of thousands of source files. Not just a static type system, but one that could interoperate seamlessly with PHP’s dynamic types. Even the notionally simple act of watching the filesystem to quickly find changed files is impressively difficult; just ask anyone who’s had to wrestle with race conditions involving Linux’s inotify subsystem.

When presented with the early prototype of Hack, Drew Paroski went beyond simply offering support: he immediately signed up to enrich Hack by building the Collections system, a set of clean, modern APIs for dealing with bulk data. While the face that Collections presents to the programmer is easy to understand, there’s once again a lot of subtle work going on behind the scenes (including a lot of runtime support in HHVM, to make them efficient). Collections interoperate seamlessly with PHP arrays, both when a program is being typechecked and while it’s running.

While designing a language and building a typechecker are substantial efforts in their own right, a big reason that we’re excited about Hack is that it is already a success within Facebook. This is in large part due to the work of two engineers, Gabe Levi and Josh Watzman, who converted a large part of our codebase to use Hack’s type annotations. Pulling this off is a surprisingly subtle task. In a large dynamically typed codebase, there can be both latent type errors and code that is more elegantly left dynamically typed. Introducing static types on a large scale involves a series of judgment calls: is what we’re looking at an error; a piece of code that we can refactor so that we can express its types statically; or something that’s just fine as it stands?

As the rest of the team proceeded with the language design, engineering, and conversion work, Joel Marcey built out the HHVM and Hack documentation, while Eugene Letuchy both added features to the language (e.g. richer support for traits) and made significant contributions to our PHP-to-Hack conversion work.

I’m proud to be associated with such skilled people who put so much passion and care into their work. When it comes to Hack, I’m simply the messenger, and these engineers have all along been the ones making it happen.

by Bryan O'Sullivan at March 28, 2014 06:03 PM

Leon P Smith

Announcing postgresql-simple-0.4.2

With 19 new releases, postgresql-simple has seen substantial development since my announcement of version 0.3.1 nearly a year ago. Compared to my last update, the changes are almost exclusively bugfixes and new features. Little code should break as a result of these changes, and most if not all of the code that does break should fail at compile time.

As it stands today, postgresql-simple certainly has the best support for postgres-specific functionality of anything on hackage. So, for a few highlights:

Parametrized Identifiers

Thanks to contributions from Tobias Florek, perhaps the most exciting change for many is that postgresql-simple now properly supports parametrized identifiers, including column, table, and type names. These are properly escaped via libpq. So for example, you can now write:

execute conn "DROP TABLE IF EXISTS ? CASCADE"
              (Only ("schema.table" :: QualifiedIdentifier))

Of course, this particular example is still potentially very insecure, but it shouldn’t be possible to create a SQL injection vulnerability this way.

The downside of this change is that postgresql-simple now requires libpq version 9.0 or later. However, community support for 8.4 is coming to and end this July. Also, it is possible to use newer versions of libpq to connect to older versions of postgres, so you don’t have to upgrade your server. (In fact, in one particular situation I’m still using postgresql-simple to connect to postgresql 8.1, although some features such as non-builtin types don’t work.)

Copy In and Copy Out support

While the postgresql-libpq binding has supported COPY FROM STDIN and COPY TO STDOUT for some time, postgresql-simple now supports these directly without having to muck around with postgresql-libpq calls via the Internal module.

If you are interested in streaming data to and from postgres, you may also be interested in higher-level COPY bindings for pipes and io-streams. This is available in Oliver Charles’s pipes-postgresql-simple, which also supports cursors, and my own unreleased postgresql-simple-streams, which also supports cursors and large objects.

Out-of-box support for JSON, UUID, and Scientific

Thanks to contributions from Bas van Dijk, postgresql-simple now provides FromField and ToField instances for aeson‘s value type, Antoine Latter’s uuid, as well as scientific.

Savepoints and nestable folds

Thanks to contributions from Joey Adams, postgresql-simple now has higher-level support for savepoints in the Transaction module, as well as the ability to nest the fold operator. Previously, postgresql-simple had assigned a static name to the cursor that underlies fold, and now every connection has a counter used to generate temporary names.

Parametrized VALUES expressions

While executeMany and returning already support common use cases of this new feature, the new Values type allows you to parameterize more than just a single VALUES table literal.

For example, these situations commonly arise when dealing with writable common table expressions. Let’s say we have a table of things with an associated table of attributes:

CREATE TABLE things (
   id   SERIAL NOT NULL,
   name TEXT NOT NULL,
   PRIMARY KEY (id)
  );

CREATE TABLE attributes (
   id    INT NOT NULL REFERENCES things,
   key   TEXT   NOT NULL,
   value BIGINT NOT NULL
  );

Then we can populate both a thing and its attributes with a single query, returning the id generated by the database:

query conn [sql|
    WITH thing AS (
        INSERT INTO things (name) VALUES (?) RETURNING id
    ), newattrs AS (
        INSERT INTO attributes
            SELECT thing.id, a.*
            FROM thing JOIN ? a
    )
      SELECT id FROM thing;
  |]  ("bob", Values   [ "text" ,"int8" ]
                     [ ( "foo"  , 42    )
                     , ( "bar"  , 60    ) ])

The empty case is also dealt with correctly; see the documentation for details.

Improved support for outer joins

A long standing challenge with postgresql-simple is dealing with outer joins in a sane way. For example, with the schema above, let’s say we want to fetch all the things along with their attributes, whether or not any given thing has any attributes at all. Previously, we could write:

getAllTheThings :: Connection -> IO [(Text, Maybe Text, Maybe Int64)]
getAllTheThings conn = do
    query conn [sql|
        SELECT name, key, value
          FROM things LEFT OUTER JOIN attributes
            ON things.id = attributes.id
      |]

Now, the columns from the attributes table are not nullable, so normally we could avoid the Maybe constructors, however the outer join changes that. Since both of these columns are not nullable, they are always both null or both not null, which is an invariant not captured by the type. And a separate Maybe for each column gets awkward to deal with, especially when more columns are involved.

What we would really like to do is change the type signature to:

getAllTheThings :: Connection -> IO [Only Text :. Maybe (Text, Int64)]

And now we can, and it will just work! Well, almost. The caveat is that there is a separate instance FromRow (Maybe ...) for most of the provided FromRow instances. This won’t work with your own FromRow instances unless you also declare a second instance. What’s really desired is a generic instance:

instance FromRow a => FromRow (Maybe a)

This instance would return Nothing if all the columns that would normally be consumed are null, and attempt a full conversion otherwise. This would reduce code bloat and repetition, and improve polymorphism and compositionality.

But alas, it’s not possible to define such an instance without changing the FromRow interface, and quite probably breaking everybody’s FromRow instances. Which I’m totally willing to do, once somebody comes up with a way to do it.


by lpsmith at March 28, 2014 05:09 PM

Robert Harper

Old neglected theorems are still theorems

I have very recently been thinking about the question of partiality vs totality in programming languages, a perennial topic in PL’s that every generation thinks it discovers for itself.  And this got me to remembering an old theorem that, it seems, hardly anyone knows ever existed in the first place.  What I like about the theorem is that it says something specific and technically accurate about the sizes of programs in total languages compared to those in partial languages.  The theorem provides some context for discussion that does not just amount to opinion or attitude (and attitude alway seems to abound when this topic arises).

The advantage of a total programming language such as Goedel’s T is that it ensures, by type checking, that every program terminates, and that every function is total. There is simply no way to have a well-typed program that goes into an infinite loop. This may seem appealing, until one considers that the upper bound on the time to termination can be quite large, so large that some terminating programs might just as well diverge as far as we humans are concerned. But never mind that, let us grant that it is a virtue of  T that it precludes divergence.

Why, then, bother with a language such as PCF that does not rule out divergence? After all, infinite loops are invariably bugs, so why not rule them out by type checking? (Don’t be fooled by glib arguments about useful programs, such as operating systems, that “run forever”. After all, infinite streams are programmable in the language M of inductive and coinductive types in which all functions terminate. Computing infinitely does not mean running forever, it just means “for as long as one wishes, without bound.”)  The notion does seem appealing until one actually tries to write a program in a language such as T.

Consider computing the greatest common divisor (GCD) of two natural numbers. This can be easily programmed in PCF by solving the following equations using general recursion:

\begin{array}{rcl}    \textit{gcd}(m,0) & = & m \\    \textit{gcd}(0,m) & = & m \\    \textit{gcd}(m,n) & = & \textit{gcd}(m-n,n) \quad \text{if}\ m>n \\    \textit{gcd}(m,n) & = & \textit{gcd}(m,n-m) \quad \text{if}\ m<n    \end{array}

The type of \textit{gcd} defined in this manner has partial function type (\mathbb{N}\times \mathbb{N})\rightharpoonup \mathbb{N}, which suggests that it may not terminate for some inputs. But we may prove by induction on the sum of the pair of arguments that it is, in fact, a total function.

Now consider programming this function in T. It is, in fact, programmable using only primitive recursion, but the code to do it is rather painful (try it!). One way to see the problem is that in T the only form of looping is one that reduces a natural number by one on each recursive call; it is not (directly) possible to make a recursive call on a smaller number other than the immediate predecessor. In fact one may code up more general patterns of terminating recursion using only primitive recursion as a primitive, but if you examine the details, you will see that doing so comes at a significant price in performance and program complexity. Program complexity can be mitigated by building libraries that codify standard patterns of reasoning whose cost of development should be amortized over all programs, not just one in particular. But there is still the problem of performance. Indeed, the encoding of more general forms of recursion into primitive recursion means that, deep within the encoding, there must be “timer” that “goes down by ones” to ensure that the program terminates. The result will be that programs written with such libraries will not be nearly as fast as they ought to be.  (It is actually quite fun to derive “course of values” recursion from primitive recursion, and then to observe with horror what is actually going on, computationally, when using this derived notion.)

But, one may argue, T is simply not a serious language. A more serious total programming language would admit sophisticated patterns of control without performance penalty. Indeed, one could easily envision representing the natural numbers in binary, rather than unary, and allowing recursive calls to be made by halving to achieve logarithmic complexity. This is surely possible, as are numerous other such techniques. Could we not then have a practical language that rules out divergence?

We can, but at a cost.  One limitation of total programming languages is that they are not universal: you cannot write an interpreter for T within T (see Chapter 9 of PFPL for a proof).  More importantly, this limitation extends to any total language whatever.  If this limitation does not seem important, then consider the Blum Size Theorem (BST) (from 1967), which places a very different limitation on total languages.  Fix any total language, L, that permits writing functions on the natural numbers. Pick any blowup factor, say 2^{2^n}, or however expansive you wish to be.  The BST states that there is a total function on the natural numbers that is programmable in L, but whose shortest program in L is larger by the given blowup factor than its shortest program in PCF!

The underlying idea of the proof is that in a total language the proof of termination of a program must be baked into the code itself, whereas in a partial language the termination proof is an external verification condition left to the programmer. Roughly speaking, there are, and always will be, programs whose termination proof is rather complicated to express, if you fix in advance the means by which it may be proved total. (In T it was primitive recursion, but one can be more ambitious, yet still get caught by the BST.)  But if you leave room for ingenuity, then programs can be short, precisely because they do not have to embed the proof of their termination in their own running code.

There are ways around the BST, of course, and I am not saying otherwise.  For example, the BST merely guarantees the existence of a bad case, so one can always argue that such a case will never arise in practice.  Could be, but I did mention the GCD in T problem for a reason: there are natural problems that are difficult to express in a language such as T.  By fixing the possible termination arguments in advance, one is tempting fate, for there are many problems, such as the Collatz Conjecture, for which the termination proof of a very simple piece of code has been an open problem for decades, and has resisted at least some serious attempts on it.  One could argue that such a function is of no practical use.  I agree, but I point out the example not to say that it is useful, but to say that it is likely that its eventual termination proof will be quite nasty, and that this will have to be reflected in the program itself if you are limited to a T-like language (rendering it, once again, useless).  For another example, there is no inherent reason why termination need be assured by means similar to that used in T.  We got around this issue in NuPRL by separating the code from the proof, using a type theory based on a partial programming language, not a total one.  The proof of termination is still required for typing in the core theory (but not in the theory with “bar types” for embracing partiality).  But it’s not baked into the code itself, affecting its run-time; it is “off to the side”, large though it may be).

Updates: word smithing, fixed bad link, corrected gcd, removed erroneous parenthetical reference to Coq, fixed LaTeX problems.


Filed under: Programming, Research, Teaching Tagged: functional programming, primitive recursion, programming languages, type theory, verification

by Robert Harper at March 28, 2014 01:26 AM

March 26, 2014

Ketil Malde

Parallel SNP identification

Basics

Lately, I’ve been trying to do SNP identification. Basically, we take samples from different populations (geographically separated, or based on some phenotype or property), sequence them in pools, and then we want to identify variants that are more specific to certain populations. The basic method is to map sequences to a reference genome, and examine the sites where the mappings disagree - either with the reference, or with each other.

There are many different implementation of this, most require indiviual genotypes (that is, each position is identified for each individual as homo- or heterozygotic), and most are geared towards finding variant positions, rather than finding diagnostic variants. Many methods - at least those I’ve tested are also quite slow, and they give output that is harder to use (i.e. pairwise scores, rather than overall).

In my case, we have pooled sequences (i.e. not individual data), we want diagnostic variants (i.e. those that differ most between populations), and we only need candidates that will be tested on individuals later, so some errors are okay. So I implemented my own tool for this, and ..uh.. invented some new measures for estimating diversity. (Yes, I know, I know. But on the other hand, it does look pretty good, in my brief testing I get less than a 0.06% FP positive rate - and plenty of true positives. But if things work out, you’ll hear more about that later.)

Parallel implementation

Anyway - my program is reasonably fast, but ‘samtools mpileup’ is a bit faster. So I thought I’d parallelize the execution. After all, we live in the age of multicore and functional programming, meaning plenty of CPUs, and an abundance of parallel programming paradigms at our disposal. Since the program basically is a simple function applied line by line, it should be easy to parallelize with ‘parMap’ or something similarly straightforward. The probelm with ‘parMap’ is that - although it gets a speedup for short inputs - it builds a list with one MVar per input (or IVars if you use the [Par monad]) strictly. So for my input - three billion records or so - this will be quite impractical.

After toying a bit with different modeles, I ended up used a rather quick and dirty hack. Avert your eyes, or suffer the consequences:

parMap :: (b -> a) -> [b] -> IO [a]
parMap f xs = do
  outs <- replicateM t newEmptyMVar
  ins  <- replicateM t newEmptyMVar
  sequence_ [forkIO (worker i o) | (i,o) <- zip ins outs]
  forkIO $ sequence_ [putMVar i (f x) | (i,x) <- zip (cycle ins) xs]
  sequence' [takeMVar o | (o,_) <- zip (cycle outs) xs]

worker :: MVar a -> MVar a -> IO b
worker imv omv = forever $ do 
  ex <- takeMVar imv 
  ex `seq` putMVar omv ex

Basically, I use worker threads with two MVars for input and output, arranged in circular queues. The input MVars are populated by a separate thread, and then the outputs are collected in sequence. I don’t bother about terminating threads or cleaning up, so it’s likely this would leak space or have other problems. But it seems to work okay for my purposes. And sequence_ is the normal sequence, but lazy, i.e. it uses unsafeInterleaveIO to delay execution.

Benchmarks

To test the parallel performance, I first ran samtools mpileup on six BAM files, and then extracted the first ten million records (lines) from the output file. I then processed the result, calculating confidence intervals, a chi-squared test, and Fst, with the results in the table below.

Table 1: Benchmark results for different numbers of threads.
Threads wall time time (s) size (MB) cpu (s) sys (s)
1 8:00 480 32.0 472 7
2 4:40 280 39.7 525 19
4 2:33 153 40.1 550 32
8 1:43 103 40.7 689 57
12 1:50 110 36.9 978 106
16 1:33 93 40.3 1047 125
24 1:17 77 40.7 1231 152
Benchmark results for different numbers of threads.

Benchmark results for different numbers of threads.

I would guess that system time is Linux scheduling overhead.

I thought the time with 12 threads was a fluke, but then I reran it thrice: 1:52, 1:50, 1:52. Strange? The only explanation I can see, is that the machine has 12 real cores (hyperthreading giving the pretense of 24), and that running the same number of threads gives some undesirable effect. Running for t={10,11,13,14} gives

Table 2: A more detailed look at performance scaling.
Threads time
10 1:45
11 1:44
12 1:51
13 1:45
14 1:45

So: apparently a fluke in the middle there, and 11 seems like a local minimum. So let’s try also with 23 threads:

% /usr/bin/time varan 10m.mpile.out --conf --chi --f-st > /dev/null +RTS -N23
1165.14user 137.80system 1:15.10elapsed 1734%CPU (0avgtext+0avgdata 41364maxresident)k
0inputs+0outputs (0major+10518minor)pagefaults 0swaps

Sure enough, slightly faster than 24, and 5% less work overall. Of course, the GHC scheduler is constantly being improved, so this will likely change. These numbers were with 7.6.3, I also have a 32-core machine with 7.0.4, but currently, that refuses to compile some of the dependencies.

Conclusions

Memory stays roughly constant (and quite low), scalability is okay, but starts to suffer with increasing parallelism, and gains are small beyond about eight threads.

Genomes vary enourmously in size, from a handful of megabases to multiple gigabases (each position corresponding to a record here), but scaling this up to human-sized genomes at about three gigabases indicates a processing time of roughly two days for a single-threaded run, reduced to about ten hours if you can throw four CPUs at it. This particular difference is important, as you can start the job as you leave work, and have the results ready the next morning (as opposed to waiting to over the weekend). Anyway, the bottleneck is now samtools. I can live with that.

March 26, 2014 06:00 AM

March 25, 2014

Yesod Web Framework

Package consolidation

A few weeks ago, there was a pretty heated debate about the PVP on the libraries mailing list. I've seen a few different outcomes from that discussion. One is to reduce dependency footprints to try and avoid some of the dependency problems. (Another one is concrete proposals for changes to the PVP; I intend to post a proposal as well in the coming days, but wanted to get the easier stuff out of the way first.)

This blog post is about some plans I have for consolidating multiple packages together, hopefully to result in simpler dependency trees without causing users to have unneeded package dependencies- at least not too often. The reason I'm writing up this blog post is to let the community know about these changes in advance, and let me know if any of these changes will cause them problems. Also, if I list a package as deprecated, and you'd like to take over maintainership, please let me know.

One of the guiding principles I've used in setting this up is that I belive some dependencies should be considered incredibly cheap. Depending on process, for example, is a cheap dependency, since it comes with GHC. (Unless you place restrictive bounds on process, which can instead cause dependency problems.) For more information on these principles, please read my description.

I'll start at the bottom of the dependency tree and build up.

streaming-commons, zlib-bindings, text-stream-decode, conduit and pipes

Currently, there are about six core conduit packages: conduit, zlib-conduit, attoparsec-conduit, etc. In addition, for some of these packages, I've split off helper packages providing functionality to be used by other streaming data libraries as well, such as zlib-bindings and text-stream-decode.

I want to collapse that into just three packages. All of the streaming helpers will end up in a new package, streaming-commons. I've talked with Gabriel Gonzalez about this, and we'll be collaborating together on creating a high-quality, low-dependency library. This library will also include some features currently baked into conduit itself, like lock-free Windows file reading and directory traversals.

All of the conduit core packages on top of that would then be merged into a new package, conduit-extra. So we'd end up with conduit, conduit-extra, and streaming-commons. The only downside is that, if you only needed zlib support, you'll now get a few extra packages as well. However, following the principle I listed above, these extra dependencies should all be coming from the "basically free" dependency category.

Crazier ideas for streaming-commons

This may be taking the idea too far, but we could include some even more advanced tooling in streaming-commons. This could include not only the data types from xml-types and json-types- which provide both streaming and tree based data structures for those data formats- but also attoparsec parsers and blaze-builder renderers. This could allow quite a bit of the xml-conduit codebase to be shared by the pipes world, for example.

I'm curious if people think this is a cool idea, or too radical (or both!).

Deprecate failure, attempt, and safe-failure

This one's been on my list for a while, pending some details being worked out with Edward Kmett. The goal is to completely deprecate failure and attempt in favor of the exceptions package, and within exceptions split out MonadThrow from MonadCatch.

This will also mean removing some redundant functionality from resourcet. It will be nice to be rid of the custom MonadThrow and MonadUnsafeIO defined there.

http-client-*

A few simple moves: merge http-client-multipart into http-client, and merge http-client-conduit into http-conduit. The latter change will mean that it's a bit more difficult to use http-client in conduit without depending on tls, but that's a use case anyone has expressed interest to me in.

Another change I'm planning to do at the same time is to add a new module to http-conduit, with an alternate API. There are a few places where I'm dissatisfied with the current API, and this module will work as an experimental next-gen http-conduit. I'm planning on keeping both versions of the API around for quite a while for backwards compatibility, however. The changes I'm looking at are:

  • Avoiding ResumableSource
  • Using with-semantics (like http-client) to avoid accidentally keeping connections open.
  • Don't bake in the need for ResourceT
  • Possibly use lenses for field modifiers on the Request data type.

shakespeare

This change is pretty simple: collapse shakespeare, shakespeare-css, shakespeare-js, shakespeare-text, shakespeare-i18n, and hamlet into a single package. It made sense to keep these separate when APIs were changing rapidly, but things are basically stable now.

wai

Since they don't add any extra dependencies, I'd like to merge wai-test and wai-eventsource into wai-extra. Once again, since we're dealing with stable APIs, this shouldn't cause too much trouble.

I'm also considering deprecating wai-handler-devel, since it's no longer used at all by yesod devel.

Deprecate pool-conduit

pool-conduit used to be a resource pool implementation based on code in conduit. Since then:

  • The actual resource pool implementation is provided by resource-pool.
  • The code I used from conduit has moved to resourcet.

At this point, pool-conduit doesn't really have much in it. If there's code that people are using from it, I'd like to get it merged into resource-pool itself.

yesod

The next iteration of Yesod will have a significantly simpler dispatch system. This new code doesn't really make sense as a general-purpose routing tool, so I'm planning on moving that code into yesod-core itself and deprecate yesod-routes. I know there are other users of yesod-routes; I think it makes sense to rename yesod-routes to do something like merging yesod-routes into wai-routes, as yesod-routes has no Yesod-specifics in it.

Another minor change: merge yesod-eventsource into yesod-core. No extra dep, and a stable API.

Finally, the big (and somewhat controversial) one: merge most of the yesod-* core packages into the yesod package itself. A number of year ago, we did precisely the opposite. However, given API stability, I think the time has come to simplify our dependency list again here. This will have the added benefit that when a user reports "I'm using Yesod version 1.2.3", it will give us more information.

xml

I'm planning on deprecating dtd, uri-conduit, and xml-catalog, as I don't use them any more and have no time to maintain them.

Another idea I'm playing with is merging html-conduit into xml-conduit. This would add a tagstream-conduit dependency. Alternatively, perhaps tagstream-conduit could be merged into xml-conduit as well.

classy-prelude

Merge classy-prelude-conduit in with classy-prelude. Downside: classy-prelude will depend on conduit-combinators, but that doesn't actually add more than 3 extra packages.

Next step: trimming dependencies in general

I'm not planning on rushing any of these changes. The goal is to take them slowly and methodically, and release changes incrementally. For example, after the conduit changes are done, I'd release new versions of wai, yesod, etc, that are compatible with both the new and old versions. Hopefully the user facing changes will simply be tweaking import lists and cabal files, but users will be able to take their time on this.

Ultimately, I'm planning on releasing new version of persistent (2.0) and Yesod (1.4). You can see the Persistent 2.0 goals. The Yesod 1.4 release doesn't actually have any breaking changes planned, aside from upgrading its dependencies.

One other thing I'm going to be doing in this process is a more general trimming down of dependencies. I'll be going through yesod-platform to find packages we don't need. I also want to avoid redundant packages if possible (e.g., we only need one cryptographic hash packages). In many cases, as a community, we have multiple package options available. I think a great move for the community would be to start consolidating those options as well where it makes sense. If I have concrete ideas on that, I'll certainly share them.

March 25, 2014 08:07 AM

March 24, 2014

Mateusz Kowalczyk

New Haddock released! A visual guide to changes.

Posted on March 24, 2014 by Fūzetsu

We’ve just uploaded Haddock 2.14.1 and while you can view the CHANGES file, here I’ll attempt to present all new features added since 2.13.2.1. A quick note that while 2.14.0 is in the CHANGES file, it was never officially released to the public. Consider it an internal release if you will. This basically covers 2.14.0 and 2.14.1. I am posting this now as I hear GHC 7.8.1 is supposed to come out in a few hours and this is the version that you’ll be getting. I had only just realised this but this integrates the changes I have made over the last GSoC into a stable GHC release. FYI, I’m using GHC 7.8-rc2 for any code snippets presented here. Last thing to mention is that any ticket numbers you see here are the tickets as seen on Haddock Trac. We’re actually planning to move everything to GitHub soon so keep that in mind if you’re reading this further in the future. Note that pretty much everything here is described in Haddock documentation (although without nice examples) so please refer to that if you need further information.

Let’s begin!

  • Print entities with missing documentation (#258)

    This adds a --print-missing-docs flag to Haddock. Given a file like this:

    module Foo where
    
    f :: ()
    f = ()
    
    -- | Doc for 'g'
    g :: ()
    g = ()
    
    class FooClass a where

    we can ask Haddock to tell us which docs are missing:

    $ haddock Foo.hs -h -o /tmp --print-missing-docs
    Haddock coverage:
      25% (  1 /  4) in 'Foo'
      Missing documentation for:
        Module header
        f (Foo.hs:3)
        FooClass (Foo.hs:10)

    There has been a suggestion to make this flag default. I’m personally not against it. What do you think?

  • Print a warning message when given -optghc instead of --optghc (#5)

    This is just a quick fix to a long-standing feature request. The problem was that -optghc actually means --odir=ptghc which is probably not what you wanted. We now warn when we see -optghc in the flags. The warning is:

    Warning: `-optghc' means `-o ptghc', did you mean `--optghc'?

  • Add --compatible-interface-versions (#231)

    This simply prints the versions of the .haddock interface files that your Haddock binary knows how to work with.

    $ haddock --compatible-interface-versions
    25

    We had some fairly big changes to the interface file so current Haddock can only work with a single version: this means it can’t re-use .haddock files that your previous versions might have generated.

  • Allow to generate latex documentation for FFI declarations (#247)

    Fairly self-explanatory. Note that I don’t encourage actually trying to use the LaTeX back-end, it is not maintained and has many bugs. It is meant to serve a sole purpose of generating the Haskell Report when that time comes. If you are interested in using this back-end and are willing to put in some time to breathe some life into it, we’d love to have you, contact us!

  • Add copyright and license information to generated documentation

    We let you document modules with a comment containing some special fields. The header is documented here. Consider the following module:

    {-|
    Module      : W
    Description : Short description
    Copyright   : (c) Some Guy, 2013
                      Someone Else, 2014
    License     : GPL-3
    Maintainer  : sample@email.com
    Stability   : experimental
    Portability : POSIX
    
    Here is a longer description of this module, containing some
    commentary with @some markup@.
    -}
    module W where

    Here’s how it renders using 2.13.2:

    Old module info box

    Old module info box

    and here is how it renders with 2.14.1:

    New module info box

    New module info box

    As you can see, perhaps copyright holders could be presented better. Perhaps in the next release each author will be on its own line, see ticket #279.

  • Improved Unicode support

    Unicode support previously was very finicky. We now have a new parser which can handle unicode much better. Here’s an example comment with a single definition list:

    -- | [灼眼のシャナ] ℕ ℤ ℚ
    f :: ()
    f = ()

    Here’s how 2.13.2 renders it:

    Old unicode rendering

    Old unicode rendering

    and here’s how 2.14.1 renders it:

    New unicode rendering

    New unicode rendering

    Much better! Notice a character missing in the old rendering.

  • Bold markup support

    I have covered this one in the past so here’s only a brief mention. Double underscores are used to denote that something is bald.

    -- | This is bold: __Hello world. Underscores_are_allowed__
    f :: ()
    f = ()
    Bold support

    Bold support

    Note that just like with other such markup (emphasis), we do not allow the user to stretch it over multiple lines.

    -- | This is not bold: __Hello world.
    -- Underscores_are_allowed__
    f :: ()
    f = ()
    No multiline support

    No multiline support

    This is by design. We feel that extra complexity of implementation and the fact that it changes how 2.13.2 behaved does not warrant such support. See ticket #126 for minor discussion.

  • Nested paragraphs

    This is a pretty big addition and if you are the type of person that tries to format their comments so that they look nice in source, you’ll probably need to pay attention. Basically, we allow something like what Markdown allows: nesting things under list elements (such as more list elements and so on). A simple example would be nesting some a code snippet and another list under some other list. I’m actually showing off two features here. Consider

    {-|
    * This is some list
    
        @
        This
        is
        code
        @
    
        * Another list
        * Second element of inner list, not separated by line break.
    -}
    f :: ()
    f = ()

    2.13.2 makes a mess out of it:

    Old nested lists

    Old nested lists

    but 2.14.1 does what you might expect:

    New nested lists

    New nested lists

    The rule is that everything to be nested under a list element is to be indented 4 spaces from the start of the comment. Note that this is not 4 spaces relative from start of the previous list. You also have to make sure that the nested paragraph is separated by a line break so that Haddock doesn’t simply think it’s the continuation of the list.

    A double nesting will therefore look like this:

    {-|
    * Top level
    
        * First nested
    
            * Second nested
    -}
    f :: ()
    f = ()
    Twice nested

    Twice nested

    Those with sharp eyes will notice that I have two list elements not broken up by the line break in the initial example. This in now allowed as long as the list elements are of the same type:

    This is now fine:

    -- |
    -- * foo
    -- * bar

    but this is not fine:

    -- |
    -- * foo
    -- 1. bar

    Haddock will think it’s just a single list element and it will look something like this:

    Different type no break

    Different type no break

    Please refer to list section of the docs for details. These changes mean that you can write much nicer docs but they also mean that if you wrote something that wasn’t exactly model Haddock before, it might now look radically different! I know that even GHC is guilty of this.

  • Better escaping

    We now have much better escaping behaviour. Consider

    -- | < http:/haskell.org Haskell.org loves \>\>= >

    2.13.2 messes up:

    Old link escape

    Old link escape

    But 2.14.1 works as we’d like it to:

    New link escape

    New link escape

    It is actually impossible to have the > character in the link or alt text even with HTML escapes because we don’t accept markup there so it won’t get converted.

    If you don’t need the alt text, we now even automatically try to convert text to links. Consider

    -- | http://haskell.org is cool
    f :: ()
    f = ()

    2.13.2 doesn’t do what we want at all and even swallows up the forward slashes because it thinks it sees (empty) emphasis:

    Old autolink

    Old autolink

    2.14.1 does something much more reasonable:

    New autolink

    New autolink

    You should notice that escaping things is much more reasonable now.

  • Header markup

    Headers in regular comments (rather than just for sections) are now allowed. The syntax is multiple = characters, from 0 up to 6. Each back-end decides how to render the different header levels itself.

    {-|
    = Top level
    * Hello
    * World
    
    == Subheader
    === Subsubheader
    @More stuff!@
    -}
    f :: ()
    f = ()
    Headers

    Headers

    Note that headers have to be at the beginning of a paragraph but we do allow a paragraph to follow without a line break right after it. This allows you to write down things like lists and another header straight after.

  • Parser should no longer fail to parse any markup

    We now aim to be able to parse everything. This means that you should never see a parse failure caused by bad Haddock syntax. For example

    -- | [ hello
    f :: ()
    f = ()

    fails on 2.13.2 with a parse error: doc comment parse failed: [ hello. This will render as you’d expect on 2.14.1:

    No parse error

    No parse error

    This means that if you had a documentation that failed to parse due to such error before, it will now (silently) succeed.

    Important: please note that you can still have a different kind of parse error. If your comment is at a place where we don’t expect it, that’s an error. For example, the following will throw a parse error:

    data F = F () -- ^ Doc for first ()
               () -- ^ Doc for second ()

    gives us W.hs:18:12: parse error on input ‘(’ because we don’t support documentation of each parameter to the constructors.

    Please do not report these as bugs! If you do get a doc comment parse failed then report that, you should not be seeing any of these anymore.

  • {-# OPTIONS_HADDOCK show-extensions #-} pragma will show the GHC extensions enabled in the module.

    I think this is a pretty nifty one. Consider

    {-# LANGUAGE UnicodeSyntax #-}
    {-# LANGUAGE TypeFamilies #-}
    {-# LANGUAGE FunctionalDependencies #-}
    {-# LANGUAGE DataKinds #-}
    {-# LANGUAGE TypeOperators #-}
    {-# LANGUAGE FlexibleInstances #-}
    {-# OPTIONS_HADDOCK show-extensions #-}
    module W where

    You can now ask Haddock to list all enabled extensions (even those implicit ones) with the Haddock pragma that I show above. This particular example renders like this:

    Ext pragma

    Ext pragma

    If you have a Haskell98/2010/whatever pragma too, that will also get shown. Any extension implied by the current language (H98,2010) is not shown.

    I decided to show all the extensions, including the ones pulled in by stronger ones to discourage enabling the most powerful extensions without a good reason.

    This option is not a default. Do you think it should be?

  • Properly render License field (#271)

    There was a bug where we rendered the wrong thing in the License field. I can’t show you because it already has been patched up. I simply mention this for completeness.

  • Print type/data family instances (for exported types only)

    Fairly self explanatory, your type/data family instances now get shown in the documentation.

    This example is a pretty big one because there’s a fair amount of stuff going into it. This is actually a stripped down version used by Haddock for testing.

    {-# LANGUAGE TypeFamilies, UndecidableInstances, PolyKinds, TypeOperators,
                 DataKinds, MultiParamTypeClasses, GADTs #-}
    module W where
    
    -- | Doc for: data X
    data X
      = X   -- ^ Doc for: X
      | XX  -- ^ Doc for: XX
      | XXX -- ^ Doc for: XXX
    
    -- | Doc for: data Y
    data Y
    
    -- | Doc for: class Test a
    class Test a
    
    -- | Doc for: instance Test X
    instance Test X
    -- | Doc for: instance Test Y
    instance Test Y
    
    -- | Doc for: type family Foo a
    type family Foo a :: k
    
    -- | Doc for: type instance Foo X = Y
    type instance Foo X = Y
    -- | Doc for: type instance Foo Y = X
    type instance Foo Y = X
    
    -- | Doc for: class Assoc a
    class Assoc a where
      -- | Doc for: data AssocD a
      data AssocD a :: *
      -- | Doc for: type AssocT a
      type AssocT a :: *
    
    -- | Doc for: instance Assoc X
    instance Assoc X where
      -- | Doc for: data AssocD X = AssocX
      data AssocD X = AssocX -- ^ Doc for: AssocX
      -- | Doc for: type AssocT X = Foo X
      type AssocT X = Foo X
    
    -- | Doc for: instance Assoc Y
    instance Assoc Y where
      -- | Doc for: data AssocD Y = AssocY
      data AssocD Y = AssocY -- ^ Doc for: AssocY

    and here’s part of how it looks

    Type families

    Type families

  • Fix display of poly-kinded type operators (#189)

    Old poly-kinded rendering

    Old poly-kinded rendering

    New poly-kinded rendering

    New poly-kinded rendering

    We’re still unsure how to display this to the user but at least now it’s not completely wrong. Suggestions are most welcome, please comment on #189.

  • PatternSynonyms support

    GHC 7.8 now has support for pattern synonyms. Here’s an example right from Haddock test-suite.

    {-# LANGUAGE PatternSynonyms, PolyKinds, TypeOperators #-}
    
    -- | Testing some pattern synonyms
    module W where
    -- | FooType doc
    data FooType x = FooCtor x
    
    -- | Pattern synonym for 'Foo' x
    pattern Foo x = FooCtor x
    
    -- | Pattern synonym for 'Bar' x
    pattern Bar x = FooCtor (Foo x)
    
    -- | Pattern synonym for (':<->')
    pattern x :<-> y = (Foo x, Bar y)
    
    -- | Doc for ('><')
    data (a :: *) >< b = Empty
    
    -- | Pattern for 'Empty'
    pattern E = Empty

    The rendering is still pretty experimental so suggestion welcome!

    Pattern Synonyms

    Pattern Synonyms

  • Fix display of implicit parameters (#260)

    {-# LANGUAGE RankNTypes #-}
    {-# LANGUAGE ImplicitParams #-}
    module W where
    
    data Configuration
    
    c :: String -> ((?configuration :: Configuration) => IO b) -> IO b
    c = undefined
    Broken implicit params rendering

    Broken implicit params rendering

    Fixed implicit params rendering

    Fixed implicit params rendering

  • Fix rendering of Contents when links are present (#276)

    Given

    module W where
    
    -- * Section header with 'f' link
    
    -- | f doc
    f :: ()
    f = ()

    We used to have a problem where a link in the header would break the Contents box rendering.

    Old contents

    Old contents

    That is now fixed. Note that you can no longer click on ‘f’ in the Contents box to be taken there. I feel that it’s the expected way.

    New contents

    New contents

  • Fix documentation duplication on record fields (#195)

    I think this is going to be a pretty controversial one. Consider

    module W where
    
    data F = FOne { field :: () -- ^ Doc for FOne field
                  }
           | FTwo { field :: () -- ^ Doc for FTwo field
                  }

    As ‘field’ is actually the same function, in the past Haddock would join the comments (it’s in the weird order due to an unfixed bug):

    Old record doc rendering

    Old record doc rendering

    We now instead take the doc of the first field to occur. Note that is used even if the first field has no comment and others do.

    New record doc rendering

    New record doc rendering

    See ticket #195 if you want to discuss this change. Both behaviours are weird but I think no one intentionally used the old behaviour.

  • Add --source-entity-line for exact line links (eg. things defined inside TH splices) (#79)

    This allows HsColour to insert anchors for TH declarations. Nothing to show here, check the ticket for details.

  • Display fixity information for names with nonstandard fixities

    There’s no a mechanism in place to display fixity of (type) operators and infix functions. Includes exotic things like type families and pattern synonyms. Code omitted but there’s nothing special you have to do, your new docs should automagically have fixities shown.

    Fixity rendering

    Fixity rendering

  • Bird tracks specified like “> code” no longer suffer from an extra leading space in the code output

    Pretty self explanatory. We strip a leading space from code blocks generated by bird tracks.

    -- |
    -- > hello
    -- > world
    f :: ()
    f = ()
    Old bird tracks

    Old bird tracks

    New bird tracks

    New bird tracks

    This is also planned for the ‘@’ style code blocks which should have this implemented in the next Haddock release, most likely 2.15.0 coming out with GHC 7.8.2.

  • Render * and -> with their UnicodeSyntax equivalents if -U is enabled

    Replaces * and -> in extra places compared to 2.13.2.

    Old unicode syntax

    Old unicode syntax

    New unicode syntax

    New unicode syntax

  • Display minimal complete definitions for type classes

    I feel this is a nice feature. GHC now supports MINIMAL pragmas and we are now able to display it in the docs. Another example right out of the Haddock test-suite:

    module W
      ( Foo(..)
      , Weird(..)
      , NoMins(..)
      , FullMin(..)
      , PartialMin(ccc)
      , EmptyMin(..)
      ) where
    
    class Foo a where
      -- | Any two of these are required...
      foo, bar, bat :: a
    
      -- | .. or just this
      fooBarBat :: (a,a,a)
    
      {-# MINIMAL (foo, bar) | (bar, bat) | (foo, bat) | fooBarBat #-}
    
    class Weird a where
      a,b,c,d,e,f,g :: a
    
      {-# MINIMAL ((a, b), c | (d | (e, (f | g)))) #-}
    
    class NoMins a where
      x,y,z :: a
    
      -- | Has a default implementation!
      z = x
    
    class FullMin a where
      aaa,bbb :: a
    
    class PartialMin a where
      ccc,ddd :: a
    
    class EmptyMin a where
      eee,fff :: a
      eee = fff
      fff = undefined
    Minimal pragma

    Minimal pragma

    Again I ask you to ignore the silly ordering of some grouped functions, this is the aforementioned old bug. Hopefully we can fix it by the next release.

  • Hide right hand side of TF instances with hidden names on the RHS

    Changes a bit which TF RHSs are hidden. It is a change between 2.14.0 and 2.14.1 and is only mentioned for completeness.

This is it for all the changes I can think of but I’m sure I missed something! There was some other minor stuff fixed up that doesn’t deserve a mention on its own (such as fixing bullet point rendering in constructor docs, #281) so I encourage you to read the commit history if you need to know all the little details.

While I’d love to end it here, I do have to admit that there’s a regression in this release which we don’t get to fix until GHC 7.8.2.

Namely, if you have a (very common) comment like this:

-- |
-- @
-- some code
-- goes here
-- @
f :: ()
f = ()

2.13.2 will render it like this:

Old codeblock rendering

Old codeblock rendering

and 2.14.1 like this:

New codeblock rendering

New codeblock rendering

The problem is that while Haskellers are used to putting a space after the comment marker --, that space is actually a part of a comment and we end up with an extra ‘empty’ line which actually has a single space in front of it. This is the line with the closing @ on it.

All of the following let you workaround the problem:

-- |
-- > some code
-- > goes here
f :: ()
f = ()

-- |
--@
--some code
--goes here
--@
g :: ()
g = ()

{-|
@
some code
goes here
@
-}
i :: ()
i = ()

Surprisingly, that second form is allowed. If you care a lot about the extra line, please use a workaround for now and I’m sorry! If you don’t care that it looks a bit on the ugly side for a while, we’ll have a fix in the next release, most likely to ship with GHC 7.8.2.

Thanks!

March 24, 2014 03:33 PM

March 23, 2014

Christopher Done

Emacs support for debugging Haskell

One odd thing about the Haskell community is that we don’t use debuggers. We don’t even use stack traces. I think for several reasons:

  1. Haskell code goes wrong less often.
  2. Due to that, people are content with once in a while sticking in printf statements.
  3. Lazy evaluation is known to make debugging tricky.
  4. Haskell has no decent editor support for debugging.

It’s at least my experience that when my Haskell code goes wrong, I don’t fret too much because I put my code into small functions. If something’s not working as expected then I run that function in GHCi with typical inputs and check that the result is as desired.

If I’m in a codebase that makes it hard to do that, then I’ll insert some error or trace calls and re-run the whole program.

It’s also my experience that I don’t care about GHCi’s debugging support if I have to manually set breakpoints myself and step through things manually. Who wants to bother doing that? It’s like running git commands manually instead of using magit.

So, I thought, as an experiment, I’d whip up a simple magit-like interface to GHCi’s debugging facilities, based upon my (limited) understanding from the manual and the tutorials about it, which should help me answer the following questions:

  1. Is GHCi’s debugger any good? I.e. it’s useful, not quirky or obtuse.
  2. Is it practical? I.e. it works on real project code.
  3. Is lazy evaluation as problematic as suspected for real code?
  4. Does Haskell lend itself to debugging with GHCi enough that I’d reach for it as part of my problem-solving workflow?

By removing the “it’s hard to use” impedance from the debugger, that puts me in a good position to evaluate the above questions. Is it that we as a community are missing out on sweet tooling because we just don’t have a convenient way to use it?

Here is a video demonstrating a trivial usage of it, i.e. the one that I’ve been testing with. I was too lazy to commentate it, but you can pause the video to see the display. I just set a breakpoint on fib and step into the main function and keep stepping until evaluation is complete. At one point I go to the REPL to inspect the local binding x and then go back to stepping.

In the display you can see the current expression being evaluated, and the values of the local bindings and their types below. I think there’s room for in-place expansion of values, here. But I need to experiment with more interesting data structures.

I invite overenthusiastic early adopters to try pulling from the haskell-mode repo to play around with it and patch up obvious deficiencies. You have to be using haskell-interactive-mode’s REPL, (require 'haskell-debug) and then run M-x haskell-debug from a Haskell file or the REPL, like I do in the video. The rest should be fairly obvious from the buffer’s helpful messages. But if it doesn’t work as you expected because I don’t know the first thing about how debuggers are supposed to be used, don’t blame me. Just fix it.

I’m completely ambivalent about whether a debugger is really useful, I’ve never really used one properly. So I’ll try to use it to solve some problems—once I can find some decent use-cases—and report back in a few weeks. For science!

March 23, 2014 12:00 AM

March 22, 2014

Ken T Takusagawa

[dvxmfpfy] Breaking a list into chunks

Breaking a list into chunks is an unfold:

list_chunks :: Int -> [a] -> [[a]];
list_chunks chunk_size = unfoldr $ \l -> if null l then Nothing else Just $ splitAt chunk_size l;

The output tuple of splitAt is conveniently just the right format for input to unfoldr. Is there a more elegant way to write if null l then Nothing else Just ...? The function listToMaybe doesn't quite work, as it assumes a single-element list.

> list_chunks 3 "hello world"
["hel","lo ","wor","ld"]

Here is a ByteString version:

import qualified Data.ByteString.Lazy as BS;
bytestring_blocks :: Int64 -> BS.ByteString -> [BS.ByteString];
bytestring_blocks block_size = unfoldr $ \b -> if BS.null b then Nothing else Just $ BS.splitAt block_size b;

After getting recursion beat into our skulls in introductory computer science courses, there's a feeling in functional programming to eschew explicit recursion in favor of higher-order functions like unfoldr which capture the recursive computational structure.

by Ken (noreply@blogger.com) at March 22, 2014 12:52 AM

March 20, 2014

Dominic Steinitz

Bayesian Analysis: A Conjugate Prior and Markov Chain Monte Carlo

Introduction

This is meant to be shorter blog post than normal with the expectation that the material will be developed further in future blog posts.

A Bayesian will have a prior view of the distribution of some data and then based on data, update that view. Mostly the updated distribution, the posterior, will not be expressible as an analytic function and sampling via Markov Chain Monte Carlo (MCMC) is the only way to determine it.

In some special cases, when the posterior is of the same family of distributions as the prior, then the posterior is available analytically and we call the posterior and prior conjugate. It turns out that the normal or Gaussian distribution is conjugate with respect to a normal likelihood distribution.

This gives us the opportunity to compare MCMC against the analytic solution and give ourselves more confidence that MCMC really does deliver the goods.

Some points of note:

  • Since we want to display the posterior (and the prior for that matter), for histograms we use the histogram-fill package.

  • Since we are using Monte Carlo we can use all the cores on our computer via one of Haskell’s parallelization mechanisms.

Preamble

> {-# OPTIONS_GHC -Wall                      #-}
> {-# OPTIONS_GHC -fno-warn-name-shadowing   #-}
> {-# OPTIONS_GHC -fno-warn-type-defaults    #-}
> {-# OPTIONS_GHC -fno-warn-unused-do-bind   #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods  #-}
> {-# OPTIONS_GHC -fno-warn-orphans          #-}
> {-# LANGUAGE NoMonomorphismRestriction     #-}
> module ConjMCMCSimple where
> 
> import qualified Data.Vector.Unboxed as V
> import Data.Random.Source.PureMT
> import Data.Random
> import Control.Monad.State
> import Data.Histogram ( asList )
> import qualified Data.Histogram as H
> import Data.Histogram.Fill
> import Data.Histogram.Generic ( Histogram )
> import Data.List
> import Control.Parallel.Strategies
> 
> import Diagrams.Backend.Cairo.CmdLine
> 
> import Diagrams.Backend.CmdLine
> import Diagrams.Prelude hiding ( sample, render )
> 
> import LinRegAux

A Simple Example

Analytically

Suppose the prior is \mu \sim \cal{N}(\mu_0, \sigma_0), that is

\displaystyle   \pi(\mu) \propto \exp{\bigg( -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}\bigg)}

Our data is IID normal, x_i \sim \cal{N}(\mu, \sigma), where \sigma is known, so the likelihood is

\displaystyle   p(x\,|\,\mu, \sigma) \propto \prod_{i=1}^n \exp{\bigg( -\frac{(x_i - \mu)^2}{2\sigma^2}\bigg)}

The assumption that \sigma is known is unlikely but the point of this post is to demonstrate MCMC matching an analytic formula.

This gives a posterior of

\displaystyle   \begin{aligned}  p(\mu\,|\, \boldsymbol{x}) &\propto \exp{\bigg(  -\frac{(\mu - \mu_0)^2}{2\sigma_0^2}  - \frac{\sum_{i=1}^n(x_i - \mu)^2}{2\sigma^2}\bigg)} \\  &\propto \exp{\bigg[-\frac{1}{2}\bigg(\frac{\mu^2 \sigma^2 -2\sigma^2\mu\mu_0 - 2\sigma_0^2n\bar{x}\mu + \sigma_0^2 n\mu^2}{\sigma^2\sigma_0^2}\bigg)\bigg]} \\  &= \exp{\bigg[-\frac{1}{2}\bigg(\frac{ (n\sigma_0^2 + \sigma^2)\mu^2 - 2(\sigma^2\mu_0 - \sigma_0^2n\bar{x})\mu}{\sigma^2\sigma_0^2}\bigg)\bigg]} \\  &= \exp{\Bigg[-\frac{1}{2}\Bigg(\frac{ \mu^2 - 2\mu\frac{(\sigma^2\mu_0 - \sigma_0^2n\bar{x})}{(n\sigma_0^2 + \sigma^2)}}{\frac{\sigma^2\sigma_0^2}{(n\sigma_0^2 + \sigma^2)}}\Bigg)\Bigg]} \\  &\propto \exp{\Bigg[-\frac{1}{2}\Bigg(\frac{\big(\mu - \frac{(\sigma^2\mu_0 - \sigma_0^2n\bar{x})}{(n\sigma_0^2 + \sigma^2)}\big)^2}{\frac{\sigma^2\sigma_0^2}{(n\sigma_0^2 + \sigma^2)}}\Bigg)\Bigg]}  \end{aligned}

In other words

\displaystyle   \mu\,|\, \boldsymbol{x} \sim {\cal{N}}\bigg(\frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{n\sigma_0^2 + \sigma^2}, \frac{\sigma^2\sigma_0^2}{n\sigma_0^2 + \sigma^2} \bigg)

Writing

\displaystyle   \sigma_n^2 = \frac{\sigma^2\sigma_0^2}{n\sigma_n^2 + \sigma^2}

we get

\displaystyle   \frac{1}{\sigma_n^2} = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}

Thus the precision (the inverse of the variance) of the posterior is the precision of the prior plus the precision of the data scaled by the number of observations. This gives a nice illustration of how Bayesian statistics improves our beliefs.

Writing

\displaystyle   \mu_n = \frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{n\sigma_0^2 + \sigma^2}

and

\displaystyle   \lambda = 1 / \sigma^2, \, \lambda_0 = 1 / \sigma_0^2, \, \lambda_n = 1 / \sigma_n^2

we see that

\displaystyle   \mu_n = \frac{n\bar{x}\lambda + \mu_0\lambda_0}{\lambda_n}

Thus the mean of the posterior is a weight sum of the mean of the prior and the sample mean scaled by preciscion of the prior and the precision of the data itself scaled by the number of observations.

Rather arbitrarily let us pick a prior mean of

> mu0 :: Double
> mu0 = 11.0

and express our uncertainty about it with a largish prior variance

> sigma_0 :: Double
> sigma_0 = 2.0

And also arbitrarily let us pick the know variance for the samples as

> sigma :: Double
> sigma = 1.0

We can sample from this in way that looks very similar to STAN and JAGS:

> hierarchicalSample :: MonadRandom m => m Double
> hierarchicalSample = do
>   mu <- sample (Normal mu0 sigma_0)
>   x  <- sample (Normal mu sigma)
>   return x

and we didn’t need to write a new language for this.

Again arbitrarily let us take

> nSamples :: Int
> nSamples = 10

and use

> arbSeed :: Int
> arbSeed = 2

And then actually generate the samples.

> simpleXs :: [Double]
> simpleXs =
>   evalState (replicateM nSamples hierarchicalSample)
>             (pureMT $ fromIntegral arbSeed)

Using the formulae we did above we can calculate the posterior

> mu_1, sigma1, simpleNumerator :: Double
> simpleNumerator = fromIntegral nSamples * sigma_0**2 + sigma**2
> mu_1 = (sigma**2 * mu0 + sigma_0**2 * sum simpleXs) / simpleNumerator
> sigma1 = sigma**2 * sigma_0**2 / simpleNumerator

and then compare it against the prior

The red posterior shows we are a lot more certain now we have some evidence.

Via Markov Chain Monte Carlo

The theory behinde MCMC is described in a previous post. We need to generate some proposed steps for the chain. We sample from the normal distribution but we could have used e.g. the gamma.

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $ fromIntegral seed)

We also need samples from the uniform distribution

> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
>   evalState (replicateM nIters (sample stdUniform))
>   (pureMT $ fromIntegral seed)

And now we can calculate the (un-normalised) prior, likelihood and posterior

> prior :: Double -> Double
> prior mu = exp (-(mu - mu0)**2 / (2 * sigma_0**2))
> 
> likelihood :: Double -> [Double] -> Double
> likelihood mu xs = exp (-sum (map (\x -> (x - mu)**2 / (2 * sigma**2)) xs))
> 
> posterior :: Double -> [Double] -> Double
> posterior mu xs = likelihood mu xs * prior mu

The Metropolis algorithm tells us that we always jump to a better place but only sometimes jump to a worse place. We count the number of acceptances as we go.

> acceptanceProb :: Double -> Double -> [Double] -> Double
> acceptanceProb mu mu' xs = min 1.0 ((posterior mu' xs) / (posterior mu xs))
> oneStep :: (Double, Int) -> (Double, Double) -> (Double, Int)
> oneStep (mu, nAccs) (proposedJump, acceptOrReject) =
>   if acceptOrReject < acceptanceProb mu (mu + proposedJump) simpleXs
>   then (mu + proposedJump, nAccs + 1)
>   else (mu, nAccs)

Now we can actually run our simulation. We set the number of jumps and a burn in but do not do any thinning.

> nIters, burnIn :: Int
> nIters = 300000
> burnIn = nIters `div` 10

Let us start our chain at

> startMu :: Double
> startMu = 10.0

and set the variance of the jumps to

> jumpVar :: Double
> jumpVar = 0.4
> test :: Int -> [(Double, Int)]
> test seed =
>   drop burnIn $
>   scanl oneStep (startMu, 0) $
>   zip (normalisedProposals seed jumpVar nIters)
>       (acceptOrRejects seed nIters)

We put the data into a histogram

> numBins :: Int
> numBins = 400
> hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double)
> hb = forceDouble -<< mkSimple (binD lower numBins upper)
>   where
>     lower = startMu - 1.5*sigma_0
>     upper = startMu + 1.5*sigma_0
> 
> hist :: Int -> Histogram V.Vector BinD Double
> hist seed = fillBuilder hb (map fst $ test seed)

Not bad but a bit lumpy. Let’s try a few runs and see if we can smooth things out.

> hists :: [Histogram V.Vector BinD Double]
> hists = parMap rpar hist [3,4..102]
> emptyHist :: Histogram V.Vector BinD Double
> emptyHist = fillBuilder hb (replicate numBins 0)
> 
> smoothHist :: Histogram V.Vector BinD Double
> smoothHist = foldl' (H.zip (+)) emptyHist hists

Quite nice and had my machine running at 750% with +RTS -N8.

Comparison

Let’s create the same histogram but from the posterior created analytically.

> analPosterior :: [Double]
> analPosterior =
>   evalState (replicateM 100000 (sample (Normal mu_1 (sqrt sigma1))))
>   (pureMT $ fromIntegral 5)
> 
> histAnal :: Histogram V.Vector BinD Double
> histAnal = fillBuilder hb analPosterior

And then compare them. Because they overlap so well, we show the MCMC, both and the analytic on separate charts.

PostAmble

Normally with BlogLiteratelyD, we can generate diagrams on the fly. However, here we want to run the simulations in parallel so we need to actually compile something.

ghc -O2 ConjMCMCSimple.lhs -main-is ConjMCMCSimple -threaded -fforce-recomp
> displayHeader :: FilePath -> Diagram B R2 -> IO ()
> displayHeader fn =
>   mainRender ( DiagramOpts (Just 900) (Just 700) fn
>              , DiagramLoopOpts False Nothing 0
>              )
> main :: IO ()
> main = do
>   displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMC.png"
>     (barDiag MCMC
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistMCMCAnal.png"
>     (barDiag MCMCAnal
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/HistAnal.png"
>     (barDiag Anal
>      (zip (map fst $ asList (hist 2)) (map snd $ asList (hist 2)))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))
> 
>   displayHeader "http://idontgetoutmuch.files.wordpress.com/2014/03/SmoothHistMCMC.png"
>     (barDiag MCMC
>      (zip (map fst $ asList smoothHist) (map snd $ asList smoothHist))
>      (zip (map fst $ asList histAnal) (map snd $ asList histAnal)))

by Dominic Steinitz at March 20, 2014 03:28 PM

Yesod Web Framework

Efficient directory traversals

I've just released a new version of conduit-combinators which adds two new functions: sourceDirectory and sourceDirectoryDeep. The former lists all of the children of a given directory, and the latter traverses deeply into a directory tree and lists all of the files present.

To see how this is used, consider the following simple example, which prints out the total number of files in the current directory tree. (Note: the False parameter means not to traverse symlinks to directories.)

{-# LANGUAGE OverloadedStrings #-}
import Conduit

main :: IO ()
main = runResourceT (sourceDirectoryDeep False "." $$ lengthC) >>= print

Note that this is equivalent to running find . -type f | wc -l.

This new function supersedes Data.Conduit.Filesystem.traverse, which uses more memory. To give you an idea of the difference, for a directory structure with 3361 files, traverse uses a maximum residency of 957KB, whereas sourceDirectoryDeep uses a maximum of 48KB.

In the implementation of traverse, the entire contents of a directory are read into memory as a list, and then each entry is analyzed one at a time. If the entry is a file, it is yielded, and then can be garbage collected. But if the entry is a directory, that directory must then be traversed, at which point both the remaining contents from the parent directory, and the contents of the new directory, are in memory simultaneously. By contrast, in sourceDirectoryDeep, only a single file path is read into memory at a given time.

Even if you're just doing shallow traversals, you can get a memory improvement by using sourceDirectory instead of getDirectoryContents.

Deprecating filesystem-conduit

At this point, I'm also deprecating filesystem-conduit, as all of its functionality is represented in conduit-combinators. I'm actually hoping to consolidate a few other packages over the coming weeks in an effort to simplify dependency trees a bit. I'll post on the subject when I have some more concrete plans.

March 20, 2014 10:23 AM