I've gotten into the habit of using tabs, via tab-bar, to organise my buffers
when I have multiple projects open at once. Each project has its own tab.
There's nothing fancy here (yet), I simply open a new tab manually before
opening a new project.
A while ago I added bufferlo to my config to help with getting consult-buffer
to organise buffers (somewhat) by tab. I copied the configuration from the
bufferlo README and started using it. It took me a little while to notice that
the behaviour wasn't quite what I wanted. It seemed like one buffer "leaked"
from another tab.
Figure 1: Example of buffer leakage
In the image above all files in ~/.emacs.d should be listed under Other
Buffers, but one has been brought over into the tab for the Sider project.
After a bit of experimenting I realised that
the buffer that leaks is the one I'm in when creating the new tab, and
my function for creating a new tab doesn't work the way I thought.
My function for creating a new tab looked like this
(lambda()(interactive)(tab-new)(dashboard-open))
and it turns out that tab-new shows the current buffer in the new tab which in
turn caused bufferlo to associate it to the wrong tab. From what I can see
there's no way to tell tab-new to open a specific buffer in the newly created
tab. I tried the following
Welcome back to our Rust vs. Haskell comparison series, featuring some of the most common LeetCode questions. We’ve done a couple graph problems the last two weeks, involving DFS and BFS.
Today we’ll do a graph problem involving a slightly more complicated algorithm. We’ll also use a couple data structures we haven’t seen in this series yet, and we’ll see how tricky it can get to have multiple mutable structures in a Haskell algorithm.
To learn all the details of managing your data structures in Haskell, check out Solve.hs, our problem solving course. You’ll learn all the key APIs, important algorithms, and you’ll get a lot of practice with LeetCode style questions!
The Problem
Today’s problem is called Course Schedule. We are given a number of courses, and a list of prerequisites among those courses. For a prerequisite pair (A,B), we cannot take Course A until we have taken Course B. Our job is to determine, in a sense, if the prerequisite list is well-defined. We want to see whether or not the list would actually allow us to take all the courses.
As an example, suppose we had these inputs:
Number Courses: 4
Prerequisites: [(2, 0), (1,0), (3,1), (3,2)]
This is a well defined set of courses. In order to take courses 1 and 2, we must take course 0. Then in order to take course 3, we have to take courses 1 and 2. So if we have the ordering 0->1->2->3, we can take all the courses. So we would return True.
However, if we were to add (1,3) there, we would not be able to take all the courses. We could take courses 0 and 2, but then we would be stuck because 1 and 3 have a mutual dependency. So we would return False with this list.
We are guaranteed that the course indices in the prerequisites list are in the range [0, numCourses - 1]. We are also guaranteed that all prerequisites are unique.
The Algorithm
For our algorithm, we will image these courses as living in a directed graph. If course A is a prerequisite of Course B, there should be a directed edge from A to B. This problem essentially boils down to determining if this graph has a cycle or not.
There are many ways to approach this, including relying on DFS or BFS as we discussed in the past two weeks! However, to introduce a new idea, we’ll solve this problem using the idea of topological sorting.
We can think of nodes as having “in degrees”. The “in degree” of a node is the number of directed edges coming into it. We are particularly concerned with nodes that have an in degree of 0. These are courses with no prerequisites, which we can take immediately.
Each time we “take” a course, we can increment a count of the courses we’ve taken, and then we can “remove” that node from the graph by decrementing the in degrees of all nodes that it is pointing to. If any of these nodes have their in degrees drop to 0 as a result of this, we can then add them to a queue of “0 degree nodes”.
If, once the queue is exhausted, we’ve taken every course, then we have proven that we can satisfy all the requirements! If not, then there must be a cycle preventing some nodes from ever having in-degree 0.
Rust Solution
We’ll start with a Rust solution. We need to manage a few different structures in this problem. The first two will be vectors giving us information about each course. We want to know the current “in degree” as well as having a list of the courses “unlocked” by each course.
Each “prerequisite” pair gives the unlocked course first, and then the prerequisite course. We’ll call these “post” and “pre”, respectively. We increase the in-degree of “post” and add “post” to the list of courses unlocked by “pre”:
pub fn can_finish(num_courses: i32, prerequisites: Vec<Vec<i32>>) -> bool {
// More convenient to use usize
let n = num_courses as usize;
let mut inDegrees = Vec::with_capacity(n);
inDegrees.resize(n, 0);
// Maps from “pre” course to “post” course
let mut unlocks: Vec<Vec<usize>> = Vec::with_capacity(n);
unlocks.resize(n, Vec::new());
for req in prerequisites {
let post = req[0] as usize;
let pre = req[1] as usize;
inDegrees[post] += 1;
unlocks[pre].push(post);
}
...
}
Now we need to make a queue of 0-degree nodes. This uses VecDeque from last time. We’ll go through the initial in-degrees list and add all the nodes that are already 0. Then we’ll set up our loop to pop the front element until empty:
pub fn can_finish(num_courses: i32, prerequisites: Vec<Vec<i32>>) -> bool {
let n = num_courses as usize;
...
// Make a queue of 0 degree
let mut queue: VecDeque<usize> = VecDeque::new();
for i in 0..(num_courses as usize) {
if inDegrees[i] == 0 {
queue.push_back(i);
}
}
let mut numSatisfied = 0;
while let Some(course) = queue.pop_front() {
...
}
return numSatisfied == num_courses;
}
All we have to do now is process the course at the top of the queue each time now. We always increment the number of courses satisfied, since de-queuing a course indicates we are taking it. Then we loop through unlocks and decrement each of their in degrees. If reducing an in-degree takes it to 0, then we add this unlocked course to the back of the queue:
pub fn can_finish(num_courses: i32, prerequisites: Vec<Vec<i32>>) -> bool {
let n = num_courses as usize;
...
let mut numSatisfied = 0;
while let Some(course) = queue.pop_front() {
numSatisfied += 1;
for post in &unlocks[course] {
inDegrees[*post] -= 1;
if (inDegrees[*post] == 0) {
queue.push_back(*post);
}
}
}
return numSatisfied == num_courses;
}
This completes our solution! Here is the full Rust implementation:
pub fn can_finish(num_courses: i32, prerequisites: Vec<Vec<i32>>) -> bool {
let n = num_courses as usize;
// Make a vector with inDegree Count
let mut inDegrees = Vec::with_capacity(n);
inDegrees.resize(n, 0);
// Make a vector of "unlocks"
let mut unlocks: Vec<Vec<usize>> = Vec::with_capacity(n);
unlocks.resize(n, Vec::new());
for req in prerequisites {
let post = req[0] as usize;
let pre = req[1] as usize;
inDegrees[post] += 1;
unlocks[pre].push(post);
}
// Make a queue of 0 degree
let mut queue: VecDeque<usize> = VecDeque::new();
for i in 0..(num_courses as usize) {
if inDegrees[i] == 0 {
queue.push_back(i);
}
}
let mut numSatisfied = 0;
while let Some(course) = queue.pop_front() {
numSatisfied += 1;
for post in &unlocks[course] {
inDegrees[*post] -= 1;
if (inDegrees[*post] == 0) {
queue.push_back(*post);
}
}
}
return numSatisfied == num_courses;
}
Haskell Solution
In Haskell, we can follow this same approach. However, this is a somewhat challenging algorithm for Haskell beginners, because there are a lot of data structure “modifications” occurring, and expressions in Haskell are immutable! So we’ll organize our solution into three different parts:
Initializing our structures
Writing loop modifiers
Writing the loop
This solution will introduce 2 data structures we haven’t used in this series so far. The IntMap and the Sequence (Seq), which we’ll use qualified like so:
import qualified Data.IntMap.Lazy as IM
import qualified Data.Sequence as Seq
The IntMap type works more or less exactly like a normal Map, with the same API. However, it assumes we have Int as our key type, which makes certain operations more efficient than a generic ordered map.
Then Seq is the best thing to use for a FIFO queue. We would have used this last week if we implemented BFS from scratch.
We’ll also make a few type alias, since we’ll be combining these structures and frequently using them in type signatures:
type DegCount = IM.IntMap Int
type CourseMaps = (DegCount, IM.IntMap [Int])
type CourseState = (Int, Seq.Seq Int, DegCount)
The setup to our problem is fairly simple. Our function takes the number of courses as an integer, and the prerequisites as a list of tuples. We’ll write a number of helper functions beneath this top level definition, but for additional clarity, we’ll show them independently as we write them.
Recall that the first part of our Rust solution focused on populating 3 structures:
The list of in-degrees (per node)
The list of “unlocks” (per node)
The initial queue of 0-degree nodes
We use IntMaps for the first two (and use the alias DegCount for the first). These are easier to modify than vectors in Haskell. The other noteworthy fact is that we want to create these together (this is why we have the CourseMaps alias combining them). We process each prerequisite pair, updating both of these maps. This means we want to write a folding function like so:
For this function, we want to define two more helpers. One that will make it easier to increment the key of a degree value, and one that will make it easy to append a new unlock for the other mapping.
incKey :: Int -> DegCount -> DegCount
appendUnlock :: Int -> Int -> IM.IntMap [Int] -> IM.IntMap [Int]
These two helpers are straightforward to implement. In each case, we check for the key existing. If it doesn’t exist, we insert the default value (either 1 or a singleton list). If it exists, we either increment the value for the degree, or we append the new unlocked course to the existing list.
incKey :: Int -> DegCount -> DegCount
incKey k mp = case IM.lookup k mp of
Nothing -> IM.insert k 1 mp
Just x -> IM.insert k (x + 1) mp
appendUnlock :: Int -> Int -> IM.IntMap [Int] -> IM.IntMap [Int]
appendUnlock pre post mp = case IM.lookup pre mp of
Nothing -> IM.insert pre [post] mp
Just prev -> IM.insert pre (post : prev) mp
Now it’s very tidy to implement our folding function, and apply it to get these initial values:
processPrereq :: (Int, Int) -> CourseMaps -> CourseMaps
processPrereq (post, pre) (inDegrees', unlocks') =
(incKey post inDegrees', appendUnlock pre post unlocks')
Now we want to build our initial queue as well. For this, we just want to loop through the possible course numbers, and add any that are not in the map for inDegrees (we never insert something with a value of 0).
Now we have to consider what structures are going to be part of our “loop” and how we’re going to modify them. The type alias CourseState already expresses our loop state. We want to track the number of courses satisfied so far, the queue of 0-degree nodes, and the remaining in-degree values.
The key modification is that we can reduce the in-degrees of remaining courses. When we do this, we want to know immediately if we reduced the in-degree to 0. So let’s write a function that decrements the value, except that it deletes the key entirely if it drops to 0. We’ll return a boolean indicating if the key no longer exists in the map after this process:
decKey :: Int -> DegCount -> (DegCount, Bool)
decKey key mp = case IM.lookup key mp of
Nothing -> (mp, True)
Just x -> if x <= 1
then (IM.delete key mp, True)
else (IM.insert key (x - 1) mp, False)
Now what’s the core function of the loop? When we “take” a course, we loop through its unlocks, reduce all their degrees, and track which ones are now 0. Since this is a loop that updates state (the remaining inDegrees), we want to write a folding function for it:
decDegree :: Int -> (DegCount, [Int]) -> (DegCount, [Int])
First we perform the decrement. Then if decKey returns True, we’ll add the course to our new0s list.
decDegree :: Int -> (DegCount, [Int]) -> (DegCount, [Int])
decDegree post (inDegrees', new0s) =
let (inDegrees'', removed) = decKey post inDegrees'
in (inDegrees'', if removed then (post : new0s) else new0s)
Writing the Loop
With all these helpers at our disposal, we can finally write our core loop. Recall the 3 parts of our loop state: the number of courses taken so far, the queue of 0-degree courses, and the in-degree values. This loop should just return the number of courses completed:
If the queue is empty, we just return our accumulated number. While we’re at it, the final action is to simply compare this loop result to total number of courses to get our final result:
We we “pop” the first course off of the queue, we first get the list of “post” courses that could now be unlocked by this course. Then we can apply our decDegree helper to get the final inDegrees’’ map and the “new 0’s”.
type DegCount = IM.IntMap Int
type CourseMaps = (DegCount, IM.IntMap [Int])
type CourseState = (Int, Seq.Seq Int, DegCount)
canFinishCourses :: Int -> [(Int, Int)] -> Bool
canFinishCourses numCourses prereqs = loop (0, queue, inDegrees) == numCourses
where
incKey :: Int -> DegCount -> DegCount
incKey k mp = case IM.lookup k mp of
Nothing -> IM.insert k 1 mp
Just x -> IM.insert k (x + 1) mp
appendUnlock :: Int -> Int -> IM.IntMap [Int] -> IM.IntMap [Int]
appendUnlock pre post mp = case IM.lookup pre mp of
Nothing -> IM.insert pre [post] mp
Just prev -> IM.insert pre (post : prev) mp
processPrereq :: (Int, Int) -> CourseMaps -> CourseMaps
processPrereq (post, pre) (inDegrees', unlocks') =
(incKey post inDegrees', appendUnlock pre post unlocks')
(inDegrees, unlocks) = foldr processPrereq (IM.empty, IM.empty) prereqs
queue = Seq.fromList
(filter (`IM.notMember` inDegrees) [0..numCourses-1])
decKey :: Int -> DegCount -> (DegCount, Bool)
decKey key mp = case IM.lookup key mp of
Nothing -> (mp, True)
Just x -> if x <= 1
then (IM.delete key mp, True)
else (IM.insert key (x - 1) mp, False)
decDegree :: Int -> (DegCount, [Int]) -> (DegCount, [Int])
decDegree post (inDegrees', new0s) =
let (inDegrees'', removed) = decKey post inDegrees'
in (inDegrees'', if removed then (post : new0s) else new0s)
loop :: CourseState -> Int
loop (numSatisfied, queue', inDegrees') = case Seq.viewl queue' of
Seq.EmptyL -> numSatisfied
(course Seq.:< rest) ->
let posts = fromMaybe [] (IM.lookup course unlocks)
(inDegrees'', new0s) = foldr decDegree (inDegrees', []) posts
queue'' = foldl (Seq.|>) rest new0s
in loop (numSatisfied + 1, queue'', inDegrees'')
Conclusion
This problem showed the challenge of working with multiple mutable types in Haskell loops. You have to be very diligent about tracking what pieces are mutable, and you often need to write a lot of helper functions to keep your code clean. In our course, Solve.hs, you’ll learn about writing compound data structures to help you solve problems more cleanly. A Graph is one example, and you’ll also learn about occurrence maps, which we could have used in this problem.
That’s all for graphs right now. In the next couple weeks, we’ll cover the Trie, a compound data structure that can help with some very specific problems.
We sat down with Phil Wadler, one of the most influential folks in the Haskell community, functional programming, and programming languages, responsible for type classes, monads, and much more. We take a stroll down memory lane, starting from Haskell's inception. We talked about the difference between research and Phil's work on impactful industrial projects and standards - specifically XML and the design of generics in Java, as well as Phll's teaching at the University of Edinburgh using Agda.. Phil is a fountain of great ideas and stories, and this conversation could have gone on for hours. As it is, we hope you enjoy the hour that we had as much as we did.
The Moonbit team recently published a blog post claiming their language runs "30% faster than Rust" for FFT workloads. This is a lie by omission. They benchmarked against a deliberately crippled Rust implementation that no competent programmer would write.
The Moonbit FFT benchmark used a crippled Rust baseline and used to claim their language was faster than Rust.
My corrected Rust implementation is 3.2–3.4× faster than Moonbit on the same benchmark.
In 5 minutes of prompting GPT-5, I produced a Rust version already 2.33× faster than Moonbit.
Moonbit devs are programming language developers that have marketed their language aggressively on the basis of performance for awhile now, they know better than this.
Moonbit should retract or clearly amend their blog post with corrected Rust baseline results. Including the qualification that their benchmark is a naive Cooley-Tukey FFT benchmark and nothing else.
One night, while drifting off to sleep (or failing to), I solved a conundrum that has puzzled me since 1987.
Before Haskell there was Orwell. In Orwell equations were checked to ensure order is unimportant (similar to Agda today). When an equation was to match only if no previous equation applied, it was to be preceded by ELSE. Thus, equality on lists would be defined as follows:
We pondered whether to include this restriction in Haskell. Further, we wondered whether Haskell should insist that order is unimportant in a sequence of conditionals, unless ELSE was included. Thus, equality on an abstract type Shape would be defined as follows:
(==) :: Shape -> Shape -> Bool
x == y | circle x && circle y = radius x == radius y
| square x && square y = side x == side y
ELSE
| otherwise = False
In Orwell and early Haskell, guards were written at the end of an equation and preceded by the keyword if or the end of an equation could be labelled otherwise. (Miranda was similar, but lacked the keywords.) Here I use the guard notation, introduced later by Paul Hudak, where otherwise is a variable bound to True.
Sometime two equations or two guards not separated by ELSE might both be satisfied. In that case, we thought the semantics should ensure that both corresponding right-hand sides returned the same value, indicating an error otherwise. Thus, the following:
plus :: Thing -> Thing -> Thing
plus x y | zero x = y
| zero y = x
ELSE
| otherwise = ...
would be equivalent to:
plus :: Thing -> Thing -> Thing
plus x y | zero x && zero y && x == y = x
| zero x && zero y && x /= y = error "undefined"
| zero x && not (zero y) = y
| not (zero x) && zero y = x
| not (zero x) && not (zero y) = ...
Here the code checks that if x and y are both zero then they are the same. (I will consider a refinement to the check for sameness later.) Of course, the compiler would issue code that performs the tests zero x, zero y, and x == y at most once.
We didn’t pursue this design in Haskell for two reasons. First, because we thought it might be too unfamiliar. Second, because the ELSE on a line by itself was syntactically awkward. It would be especially annoying if one ever wanted the usual cascading behaviour:
f :: Thing -> Thing
f x | p x = ...
ELSE
| q x = ...
ELSE
| r x = ...
Here each guard is tested in turn, and we take the first that succeeds.
Today, the first problem is perhaps no longer quite so strong an issue. Many applications using Haskell would welcome the extra assurance from flagging any cases where order of the equations is significant. But the syntactic awkwardness of ELSE remains considerable. It was syntax about which I had an insight while tossing in bed.
Above otherwise is a variable bound to True in the standard prelude. But say we were to treat otherwise as a keyword, and to give it the meaning that the equation applies only if no previous equation applies, and to allow it to optionally be followed by a further guard. Then our first example becomes:
(==) :: Shape -> Shape -> Bool
x == y | circle x && circle y = radius x == radius y
| square x && square y = side x == side
| otherwise = False
And our third example becomes:
plus :: Thing -> Thing -> Thing
plus x y | zero x = y
| zero y = x
| otherwise = ...
If one doesn’t want to invoke the equality test in the case that both zero x and zero y hold then one would instead write:
plus :: Thing -> Thing -> Thing
plus x y | zero x = y
| otherwise zero y = x
| otherwise = ...
Similarly, the cascading example becomes:
f :: Thing -> Thing
f x | p x = ...
| otherwise q x = ...
| otherwise r x = ...
That’s it! The syntactic awkwardness is greatly reduced.
The proposed notation depends upon Paul’s clever insight to move the guard from the end of the equation to the middle, so evaluation works strictly left to right. But we’ve had guards in that position for quite a while now. Goodness knows why none of us hit upon this proposal thirty-odd years ago.
Of course, the change is not backward compatible. Changes to guards could be made backward compatible (with added ugliness) by using a different symbol than ‘|’ to mark guards with the new semantics. But now the old definition of (==) should not be accepted without an otherwise, and I cannot think of how to introduce that new semantics with a backward compatible syntax.
The solution, as with so much of Haskell nowadays, is to activate the new semantics with a pragma. Manual porting of legacy code would not be hard in most cases, and it would also be easy to write a tool that adds otherwisewhenever the equations are not easily shown to be independent of order.
John Hughes suggests a further refinement to the above. Using equality to check that the value of two equations is the same may not be appropriate if the values are computed lazily. Instead, he suggests that the plus example should translates as follows:
plus :: Thing -> Thing -> Thing
plus x y | zero x && zero y = x `meet` y
| zero x && not (zero y) = y
| not (zero x) && zero y = x
| not (zero x) && not (zero y) = ...
Here we presume a type class
class Meet a where
meet : a -> a -> a
which confirms that the two arguments are the same and returns a value that is the same as both the arguments. For strict data types, two arguments are the same if they are equal.
instance Meet Integer where
x `meet` y | x == y = x
| otherwise = error "undefined"
For lazy data types, we check that they are the meet lazily.
If the compiler could not verify that equations are disjoint, it would require that their right-hand sides have a type belonging to the class Meet.
In most cases, one would hope the compiler could verify that equations are disjoint, and hence would not have to resort to meet or additional checks. One might wish to allow a pragma to declare disjointness, permitting the compiler to assume, for instance, that x < y and x >= y are disjoint. An SMT solver could do much of the work of checking for disjointness.
In general, equations not separated with otherwise would be checked to ensure they are disjoint or all give equivalent results. For example,
g :: Thing -> Thing
g x | p x = a x
| q x = b x
| otherwise r x = c x
| s x = d x
| otherwise t x = e x
would be equivalent to
g :: Thing -> Thing
g x | p x && q x = a x `meet` b x
| p x && not (q x) = a x
| q x && not (p x) = b x
| otherwise r x && s x = c x `meet` d x
| r x && not (s x) = c x
| s x && not (r x) = d x
| otherwise t x = e x
On the other hand, if we declared that p x and q x are disjoint, and the same for s x and r x, then the first code would instead compile to something equivalent to Haskell’s current behaviour,
g :: Thing -> Thing
g x | p x = a x
| otherwise q x = b x
| otherwise r x = c x
| otherwise s x = d x
| otherwise t x = e x
One drawback of this proposal is that the source code doesn’t directly indicate when extra tests and the use of meet are required. An IDE might provide feedback to make explicit which tests are performed, or one might also add pragmas or additional syntax to reflect that information in the source.
I hope some reader might be keen to take this forward. What do you think?
The GHC developers are very pleased to announce the availability of the
second alpha prerelease of GHC 9.14.1. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org.
GHC 9.14 will bring a number of new features and improvements, including:
Significant improvements in specialisation:
The SPECIALISE pragma now allows use of type application syntax
The SPECIALISE pragma can be used to specialise for expression arguments
as well as type arguments.
Specialisation is now considerably more reliable in the presence of
newtypes
Significant improvements in the GHCi debugger
Record fields can be defined to be non-linear when LinearTypes is enabled.
RequiredTypeArgments can now be used in more contexts
SSE/AVX2 support in the x86 native code generator backend
A major update of the Windows toolchain
… and many more
A full accounting of changes can be found in the release notes. Given the
many specialisation improvements and their potential for regression, we would
very much appreciate testing and performance characterisation on downstream
workloads.
Observant readers of these prerelease announcements will note that polymorphic
specialisation has been dropped from alpha 2. This measure was taken out of an
abundance of caution after finding a miscompilation during testing of alpha 1.
While this bug will be fixed in the next alpha, we expect to keep polymorphic
specialisation disabled by default in the final release. Users needing more
aggressive specialisation can explicitly enable this feature with the
-fpolymorphic-specialisation flag. Depending upon our experience with 9.14.1,
we may enable this feature by default in a later minor release.
This is the second of three expected alpha prereleases. We expect the next
(third) alpha will come 23 Sept. 2025, with the release candidate coming 7 Oct.
2025.
We would like to thank the Zw3rk stake pool, Well-Typed, Mercury, Channable,
Tweag I/O, Serokell, SimSpace, the Haskell Foundation, and other anonymous
contributors whose on-going financial and in-kind support has facilitated GHC
maintenance and release management over the years. Finally, this release would
not have been possible without the hundreds of open-source contributors whose
work have made the Haskell ecosystem what it is today.
As always, do give this release a try and open a ticket if you see
anything amiss.
The GHC developers are very pleased to announce the availability of the
second alpha prerelease of GHC 9.14.1. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org.
GHC 9.14 will bring a number of new features and improvements, including:
Significant improvements in specialisation:
The SPECIALISE pragma now allows use of type application syntax
The SPECIALISE pragma can be used to specialise for expression arguments
as well as type arguments.
Specialisation is now considerably more reliable in the presence of
newtypes
Significant improvements in the GHCi debugger
Record fields can be defined to be non-linear when LinearTypes is enabled.
RequiredTypeArgments can now be used in more contexts
SSE/AVX2 support in the x86 native code generator backend
A major update of the Windows toolchain
… and many more
A full accounting of changes can be found in the release notes. Given the
many specialisation improvements and their potential for regression, we would
very much appreciate testing and performance characterisation on downstream
workloads.
Observant readers of these prerelease announcements will note that polymorphic
specialisation has been dropped from alpha 2. This measure was taken out of an
abundance of caution after finding a miscompilation during testing of alpha 1.
While this bug will be fixed in the next alpha, we expect to keep polymorphic
specialisation disabled by default in the final release. Users needing more
aggressive specialisation can explicitly enable this feature with the
-fpolymorphic-specialisation flag. Depending upon our experience with 9.14.1,
we may enable this feature by default in a later minor release.
This is the second of three expected alpha prereleases. We expect the next
(third) alpha will come 23 Sept. 2025, with the release candidate coming 7 Oct.
2025.
We would like to thank the Zw3rk stake pool, Well-Typed, Mercury, Channable,
Tweag I/O, Serokell, SimSpace, the Haskell Foundation, and other anonymous
contributors whose on-going financial and in-kind support has facilitated GHC
maintenance and release management over the years. Finally, this release would
not have been possible without the hundreds of open-source contributors whose
work have made the Haskell ecosystem what it is today.
As always, do give this release a try and open a ticket if you see
anything amiss.
Liquid Haskell (LH) is a formal verification tool for Haskell programs, with the
potential to prove correctness with considerably less friction than approaches
that aim to make code correct by construction using dependent types—often at
the cost of heavy refactoring (as argued in a previous post). It
has come a long way towards becoming a usable tool by adding quality-of-life
features to foster its adoption. Think optimization of spec verification and
improved user experience.
During my GSoC 2025 Haskell.org project with Tweag, I worked on a seemingly
small but impactful feature: allowing LH’s type and predicate aliases to be written
in qualified form.
That is, being able to write Foo.Nat instead of only just Nat, like we can for regular Haskell type aliases.
In this post, I introduce these annotations and their uses, walk through some of
the design decisions, and share how I approached the implementation.
Aliasing refinement types
Type and predicate aliases in LH help users abstract over refinement type
annotations, making them easier to reuse and more concise. A type alias refines
an existing type. For instance, LH comes with built-in aliases like Nat and
Odd, which refine Int to represent natural and odd numbers, respectively.
{-@ type Nat = {v: Int | v >= 0 } @-}{-@ type Odd = {v: Int | (v mod 2) = 1 } @-}
Predicate aliases, by contrast, capture only the predicate part of a refinement
type. For example, we might define aliases for positive and negative numerical
values.
-- Value parameters in aliases are specified in uppercase,-- while lowercase is reserved for type parameters.{-@ predicate Neg N = N < 0 @-}{-@ predicate Pos N = N > 0 @-}
Enter the subtle art of giving descriptive names so that our specifications
read more clearly. Consider declaring aliases for open intervals
with freely oriented boundaries.
{-@ predicate InOpenInterval A B X =
(A != B) &&
((X > A && X < B) || (X > B && X < A)) @-}{-@ type OpenInterval A B = { x:Float | InOpenInterval A B x } @-}
These aliases can then be used to prove, for instance, that an implementation
of an affine transformation, fromUnitInterval below, from the open unit interval to an
arbitrary interval is a bijection. The proof proceeds by supplying an inverse
function (toUnitInterval) and specifying1 that their composition is the identity.
The example shows one half on the proof; the other half is straightforward
and left to the reader.
typeBound=Float{-@ inline fromUnitInterval @-}{-@ fromUnitInterval :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval 0 1
-> v : OpenInterval a b @-}fromUnitInterval::Bound->Bound->Float->FloatfromUnitIntervalabx=a+x*(b-a){-@ inline toUnitInterval @-}{-@ toUnitInterval :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval a b
-> v : OpenInterval 0 1 @-}toUnitInterval::Bound->Bound->Float->FloattoUnitIntervalabx=(x-a)/(b-a){-@ intervalId :: a : Bound
-> { b : Bound | a != b }
-> x : OpenInterval a b
-> {v : OpenInterval a b | x = v} @-}intervalId::Bound->Bound->Float->FloatintervalIdabx=fromUnitIntervalab.toUnitIntervalab
Another case: refining a Map type to a fixed length allows us to enforce that
a function can only grant access privileges to a bounded number of users at any
call site.
typePassword=StringtypeName=String{-@ type FixedMap a b N = { m : Map a b | len m = N } @-}{-@ giveAccess :: Name
-> Password
-> FixedMap Name Password 3
-> Bool @-}giveAccess::Name->Password->MapNamePassword->BoolgiveAccessnamepsswdusers=Map.lookupnameusers==Justpsswd
None of these specifications strictly require aliases, but they illustrate the
practical convenience they bring.
A crowded name space
When we try to be simple and reasonable about such aliases, it becomes quite
likely for other people to converge on the same names to describe similar
types. Even a seemingly standard type such as Nat is not safe: someone
with a historically informed opinion might want to define it as strictly positive
numbers2, or may just prefer to refine Word8 instead of Int.
Naturally, this is the familiar problem of name scope, for which established
solutions exist, such as modules and local scopes. Yet for LH and its Nat, it
was the case that one would have to either invent a non-conflicting name,
exclude assumptions for the base package, or avoid
importing the Prelude altogether. It might be argued that having to invent
alternative names is a minor nuisance, but also that it can quickly lead to
unwieldy and convoluted naming conventions once multiple dependencies expose
their own specifications.
Simply stated, the problem was that LH imported all aliases from transitive
dependencies into a flat namespace. After my contribution, LH still accumulates
aliases transitively, but users gain two key capabilities: (i) to disambiguate
occurrences by qualifying an identifier, and (ii) to overwrite an imported alias
without conflict. In practice, this prevents spurious verification failures
and gives the user explicit means to resolve clashes when they matter.
Consider the following scenario. Module A defines alias Foo. Two other
modules, B and B', both define an alias Bar and import A.
moduleAwhere{-@ type Foo = { ... } @-}moduleBwhereimport A{-@ type Bar = { ... } @-}moduleB'whereimport A{-@ type Bar = { ... } @-}
A module C that imports B and B' will now see Foo in scope unambiguously,
while any occurrence of Bar must be qualified in the usual Haskell manner.
Previously, this would have caused C to fail verification with a conflicting
definitions error, even if Bar was never used.
examples/B.hs:3:10:error:MultipledefinitionsofTypeAlias `Bar`
Conflictingdefinitionsat.*examples/B.hs:3:10-39.*examples/B'.hs:3:10-39|3|{-@ type Bar = { ... } @-}|^^^^^^^^^^^^^^
This error is now only triggered when the alias is defined multiple times within
the same module. And instead, when an ambiguous type alias is found, the user is
prompted to choose among the matching names in scope and directed to the
offending symbol.
examples/C.hs:6:19:error:Ambiguousspecificationsymbol `Bar` fortypealiasCouldrefertoanyofthenames.*BarimportedfrommoduleBdefinedatexamples/B.hs:3:10-39.*BarimportedfrommoduleB'definedatexamples/B'.hs:3:10-39|6|{-@ baz :: Foo -> Bar @-}|^^^
The precise behavior is summarized in a set of explicit rules
that I proposed, which specify how aliases are imported and exported under
this scheme.
The initial name resolution flow
The project goals were initially put forward on a GitHub issue as a
spin-off from a recent refactoring of the codebase that changed the
internal representation of names to a structured LHName type that
distinguishes between resolved and unresolved names and stores information about
where the name originates, so that names are resolved only once for each compiled
module.
Name resolution has many moving parts, but in broad terms its implementation is
divided into two phases: The first handles names corresponding to entities GHC
knows of—data and type constructors, functions, and annotation binders of
aliases, measures, and data constructors—and uses its
global reader environment to look them up. The resolution of logical
entities (i.e. those found in logical expressions) is left for the second
phase, where the names resolved during the first phase are used to build custom
lookup environments.
Occurrences of type and predicate aliases were resolved by looking them up in an
environment indexed by their unqualified name. When two or more dependencies
(possibly transitive) defined the same alias, resolution defaulted to whichever
definition happened to be encountered first during collection. This accidental
choice was effectively irrelevant, however, since a later duplicate-name check
would short-circuit with the aforementioned error. Locally defined aliases
were recorded in the module’s interface file after verification, and LH
assembled the resolution environment by accumulating the aliases from the
interface files of all transitive dependencies.
The reason a module import brings all aliases from transitive dependencies
into scope is that no mechanism exists to declare which aliases a module exports
or imports. Implementing such a mechanism exceeded the project’s allocated time,
so a trade-off was called for. On the importing side, Haskell’s qualifying
directives could be applied, but an explicit defaulting mechanism was needed to
determine what aliases a module exposes. This left us with at least
three possibilities:
Export no aliases, so that they would be local to each module alone. This
no-op solution would allow the user to use any names she wants, but quickly
becomes inconvenient as an alias would have to be redefined in each module she
intends to use it.
Export only those locally defined, so that only aliases from direct
dependencies would be in scope for any given module. This could leave out
aliases used to specify re-exported functions, so we would end up in a
similar situation as before.
Export all aliases from transitive dependencies, avoiding the need to ever
duplicate an alias definition.
The chosen option (3) reflects the former behavior and, complemented by
the ability qualify and overwrite aliases, it was deemed the most effective
solution.
Qualifying type aliases
Type aliases are resolved during the first phase, essentially because they are
parsed as type constructors, which are resolved uniformly across the input
specification. Two changes had to be made to qualify them: include module import
information in the resolution environment to discern which module aliases can be
used to qualify an imported type alias, and make sure transitively imported
aliases are stored in the interface file along with the locally defined type
aliases.
Careful examination of the code revealed that we could reuse environments built
for other features of LH that could be qualified already! And as a
bonus, their lookup function returns close-match alternatives in case of failure.
Factoring this out almost did the
trick. In addition, I had to add some provisions to give precedence to locally
defined aliases during lookups.
Qualifying predicate aliases
Two aspects of the code made predicate aliases somewhat hard to reason about.
First, predicate aliases are conflated in environments with
Haskell entities lifted by inline and define annotations.
The rationale is to use a single mechanism to expand these definitions in
logical expressions.
Second, the conflated environments were redundantly gathered twice with different
purposes: to resolve Haskell function names in logical
expressions, and afterwards again to resolve occurrences of predicate aliases.
Both were not straightforward to deduce from the code. These facts,
together with some code comments from the past about predicate aliases being the
last names that remained “unhandled”, pointed the way.
The surgical change, then, was to sieve out predicate aliases from the lifted
Haskell functions as they were stored together in interface files, and include
these predicate aliases in the environment used to resolve qualified names for
other features.
Alias expansion
Although the problem I set out to solve was primarily about name resolution, the
implementation also required revisiting another process: alias expansion. For a
specification to be ready for constraint generation, all aliases must be fully
expanded (or unfolded), since liquid-fixpoint3 has no notion of aliases.
Uncovering this detail was crucial to advance with the implementation. It
clarified why Haskell functions lifted
with inline or define are eventually converted into predicate aliases: doing
so allows for every aliasing annotation to be expanded consistently in a single
pass wherever they appear in a specification. With qualified aliases, the
expansion mechanism needed some adjustments, as the alias names were now more
structured (LHName).
An additional complication was that the logic to expand type
aliases was shared with predicate aliases, and since I did qualification of type
aliases first, I needed to have different behavior for type and predicate
aliases. In the end, I opted for duplicating the expansion logic for each case
during the transition, and unified it again after implementing qualification of
predicate aliases.
Closing remarks
My determination to understand implementation details was rewarded by
insights that allowed me to refactor my way to a solution. For perspective,
my contribution consisted of a 210 LOC addition for the feature implementation
alone, after familiarizing myself with 2,150 LOC out of the 25,000 LOC making up
the LH plugin.
The bulk of this work is contained in two merged PRs (#2550 and
#2566), which include detailed source documentation and tests.
The qualified aliases support and the explicit rules that govern it
are a modest addition, but hopefully one of a positive impact on user experience.
LH tries to be as close as possible to Haskell, but refinement type aliases
still mark the boundary between both worlds. Perhaps the need for an ad hoc
mechanism for importing and exporting logic entities will be revised in a horizon
where LH gets integrated into GHC (which sounds good to me!).
This project taught me about many language features and introduced me to the
GHC API; knowledge I will apply in future projects and to further contribute to
the Haskell ecosystem. I am grateful to Facundo Domínguez for his generous and
insightful mentoring, which kept on a creative flow throughout the project.
Working on Liquid Haskell was lots of fun!
Note that, in this example, the inline annotation is used to translate
the Haskell definitions into the logic so Liquid Haskell can unfold calls to these
functions when verifying specifications.↩
It took humanity quite a while to think clearly about a null quantity,
and further still for it to play a fundamental role as a placeholder for
positional number notation.↩
liquid-fixpoint is the component of Liquid Haskell that
transforms a module’s specification into a set of constraints for an external
SMT solver.↩
The GHC developers are very pleased to announce the availability
of the final release for GHC 9.10.3. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org and
via GHCup.
GHC 9.10.3 is a bug-fix release fixing many issues of a variety of
severities and scopes, including:
Fix a number of crashes in the compiler frontend (#25960, #25004, #25056)
A fix for a segfault in the RTS when running certain code involving STM (#26205)
And many more!
A full accounting of these fixes can be found in the release notes. As
always, GHC’s release status, including planned future releases, can be found on
the GHC Wiki status.
We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable,
Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors
whose on-going financial and in-kind support has facilitated GHC maintenance
and release management over the years. Finally, this release would not have
been possible without the hundreds of open-source contributors whose work
comprise this release.
As always, do give this release a try and open a ticket if you see
anything amiss.
In the face of swinging funding cuts in the US, David Samuel Shiffman defends the value of scientific curiosity in American Scientist. Spotted via Boing Boing.
For last week’s problem we started learning about graph algorithms, focusing on depth-first-search. Today we’ll do a problem from an old board game that will require us to use breadth-first-search. We’ll also learn about a special library in Haskell that lets us solve these types of problems without needing to implement all the details of these algorithms.
To learn more about this library and graph algorithms in Haskell, you should check out our problem solving course, Solve.hs! Module 3 of the course focuses on algorithms, with a special emphasis on graph algorithms!
The Problem
Today’s problem comes from a kids board game called Snakes and Ladders, which will take a little bit to explain. First, we imagine we have a square board in an N x N grid, where each cell is numbered 1 to N^2. The bottom left corner is always “1”, and numbers increase in a snake-like fashion. First the increase from left to right along the bottom row. Then they go from right to left in the next row, before reversing again. Here’s what the numbers look like for a 6x6 board:
The “goal” is to reach the highest numbered tile, which is either in the top left (for even grid sizes) or the top right (for odd grid sizes). One moves by rolling a 6-sided die. Given the number on the die, you are entitled to move that many spaces. The ordinary path of movement is following the increasing numbers.
As is, the game is a little boring. You just always want to roll the highest number you can. However, various cells on the grid are equipped with “snakes” or “ladders”, which can move you around the grid if your die roll would cause your turn to end where these items start. Ladders typically move you closer to the goal, snakes typically move you away from the goal. Here’s an illustrative picture of a board:
We can represent such a board by putting an integer on each cell. The integer -1 represents an ordinary cell, where you would simply proceed to the next cell in order. However, we can represent the start of each snake and ladder with a number corresponding to the cell number where you end up if your die role lands you there. Here’s an example:
This grid has two ladders. The first can take you from position 2 to position 15 (see the bottom left corner). The second can take you from position 14 to position 35. There is also a snake, that will take you back from position 17 to position 13. Note that no matter the layout, you can only follow one snake or ladder on a turn. If you end your turn at the beginning of a ladder, which takes you to the beginning of another ladder, you do not take the second ladder on that turn.
Our objective is to find the smallest number of dice rolls possible to reach the goal cell (which will always have -1). In this case, the answer is 4. Various combinations of 3 rolls can land us on 14, which will take us to 35. Then rolling 1 would take us to the goal.
It is possible to contrive a board where it is impossible to reach the goal! We need to handle these cases. In these situations we must return -1. Here is such a grid, with many snakes, all leading back to the start!
1 1 -1
1 1 1
-1 1 1
The Algorithm
This is a graph search problem where each step we take carries the same weight (one turn), and we are trying to find the shortest path. This makes it a canonical example of a Breadth First Search problem (BFS).
We solve BFS by maintaining a queue of search states. In our case, the search state might consist simply of our location, though we may also want to track the number of steps we needed to reach that location as part of the state.
We’ll have a single primary loop, where we remove the first element in our queue. We’ll find all its “neighbors” (the states reachable from that node), and place these on the end of the queue. Then we’ll continue processing.
BFS works out so that states with a lower “cost” (i.e. number of turns) will all be processed before any states with higher cost. This means that the first time we dequeue a goal state from our queue, we can be sure we have found the shortest path to that goal state.
As with last week’s problem, we’ll spend a fair amount of effort on our “neighbors” function, which is often the core of a graph solution. Once we have that in place, the mechanics of the graph search generally become quite easy.
Rust Solution
Once again we’ll start with Rust, because we’ll use a special trick in Haskell. As stated, we want to start with our neighbors function. We’ll represent a single location just using the integer representing it on the board, not its grid coordinates. So at root, we’re taking one usize and returning a vector of usize values. But we’ll also take the board (a 2D vector of integers) so we can follow the snakes and ladders. Finally, we’ll pass the size of the board (just N, since our board is always square) and the “goal” location so that we don’t have to recalculate these every time:
The basic idea of this function is that we’ll loop through the possible die rolls (1 to 6) and return the resulting location from each roll. If we find that the roll would take us past the goal, then we can safely break:
pub fn neighbors(n: usize, goal: usize, board: &Vec<Vec<i32>>, loc: usize) -> Vec<usize> {
let mut results = Vec::new();
for i in 1..=6 {
if loc + i > goal {
break;
}
...
}
return results;
}
How do we actually get the resulting location? We need to use the board, but in order to use the board, we have to convert the location into 2D coordinates. So let’s just write the frame for a function converting a location into coordinates. We’ll fill it in later:
Assuming we have this function, the rest of our neighbors logic is easy. We check the corresponding value for the location in board. If it is -1, we just use our prior location added to the die roll. Otherwise, we use the location given in the cell:
pub fn neighbors(n: usize, goal: usize, board: &Vec<Vec<i32>>, loc: usize) -> Vec<usize> {
let mut results = Vec::new();
for i in 1..=6 {
if loc + i > goal {
break;
}
let (row, col) = convert(n, loc + i);
let next = board[row][col];
if next == -1 {
results.push(loc + i);
} else {
results.push(next as usize);
}
}
return results;
}
So let’s fill in this conversion function. It’s tricky because of the snaking order of the board and because we start from the bottom (highest row index) and not the top. Nonetheless, we want to start by getting the quotient and remainder of our location with the side-length. (We subtract 1 since our locations are 1-indexed).
To get the final row, we simply take n - rowBase - 1. The column is trickier. We need to consider if the row base is even or odd. If it is even, the row is going from left to right. Otherwise, it goes from right to left. In the first case, the modulo for the column gives us the right column. In the second case, we need to subtract from n like we did with rows.
pub fn convert(n: usize, loc: usize) -> (usize, usize) {
let rowBase = (loc - 1) / n;
let colBase = (loc - 1) % n;
let row = n - rowBase - 1;
let col =
if rowBase % 2 == 0 {
colBase
} else {
n - colBase - 1
};
return (row, col);
}
But that’s all we need for conversion!
Now that our neighbors function is closed up, we can finally write the core solution. For the Rust solution, we’ll define our “search state” as including the location and the number of steps we took to reach it, so a tuple (usize, usize). We’ll create a VecDeque of these, which is Rust’s structure for a queue, and insert our initial state (location 1, count 0):
use std::collections::VecDeque;
pub fn snakes_and_ladders(board: Vec<Vec<i32>>) -> i32 {
let n = board.len();
let goal = board.len() * board[0].len();
let mut queue: VecDeque<(usize, usize)> = VecDeque::new();
queue.push_back((1,0));
...
}
We also want to track the locations we’ve already visited. This will be a hash set of the locations but not the counts. This is necessary to prevent infinite loops. Once we’ve visited a location there is no advantage to considering it again on a later branch (with this problem at least). We’ll also follow the practice of considering a cell “visited” once it is enqueued.
use std::collections::VecDeque;
use std::collections::HashSet;
pub fn snakes_and_ladders(board: Vec<Vec<i32>>) -> i32 {
let n = board.len();
let goal = board.len() * board[0].len();
let mut queue: VecDeque<(usize, usize)> = VecDeque::new();
queue.push_back((1,0));
let mut visited = HashSet::new();
visited.insert(1);
...
}
Now we’ll run a loop popping the front of the queue and finding the “neighboring” locations. If our queue is empty, this indicates no path was possible, so we return -1.
use std::collections::VecDeque;
use std::collections::HashSet;
pub fn snakes_and_ladders(board: Vec<Vec<i32>>) -> i32 {
let n = board.len();
let goal = board.len() * board[0].len();
let mut queue: VecDeque<(usize, usize)> = VecDeque::new();
queue.push_back((1,0));
let mut visited = HashSet::new();
visited.insert(1);
while let Some((idx, count)) = queue.pop_front() {
let ns = neighbors(n, goal, &board, idx);
...
}
return -1;
}
Now processing each neighbor is simple. First, if the neighbor is the goal, we’re done! Just return the dequeued count plus 1. Otherwise, check if we’ve visited the neighbor before. If not, push it to the back of the queue, along with an increased count:
pub fn snakes_and_ladders(board: Vec<Vec<i32>>) -> i32 {
let mut queue: VecDeque<(usize, usize)> = VecDeque::new();
queue.push_back((1,0));
let n = board.len();
let goal = board.len() * board[0].len();
let mut visited = HashSet::new();
visited.insert(1);
while let Some((idx, count)) = queue.pop_front() {
let ns = neighbors(n, goal, &board, idx);
for next in ns {
if next == goal {
return (count + 1) as i32;
}
if !visited.contains(&next) {
queue.push_back((next, count + 1));
visited.insert(next);
}
}
}
return -1;
}
This completes our BFS solution! Here is the complete code:
use std::collections::VecDeque;
use std::collections::HashSet;
pub fn convert(n: usize, loc: usize) -> (usize, usize) {
let rowBase = (loc - 1) / n;
let colBase = (loc - 1) % n;
let row = n - rowBase - 1;
let col =
if rowBase % 2 == 0 {
colBase
} else {
n - colBase - 1
};
return (row, col);
}
pub fn neighbors(n: usize, goal: usize, board: &Vec<Vec<i32>>, loc: usize) -> Vec<usize> {
let mut results = Vec::new();
for i in 1..=6 {
if loc + i > goal {
break;
}
let (row, col) = convert(n, loc + i);
let next = board[row][col];
if next == -1 {
results.push(loc + i);
} else {
results.push(next as usize);
}
}
return results;
}
pub fn snakes_and_ladders(board: Vec<Vec<i32>>) -> i32 {
let mut queue: VecDeque<(usize, usize)> = VecDeque::new();
queue.push_back((1,0));
let n = board.len();
let goal = board.len() * board[0].len();
let mut visited = HashSet::new();
visited.insert(1);
while let Some((idx, count)) = queue.pop_front() {
let ns = neighbors(n, goal, &board, idx);
for next in ns {
if next == goal {
return (count + 1) as i32;
}
if !visited.contains(&next) {
queue.push_back((next, count + 1));
visited.insert(next);
}
}
}
return -1;
}
Haskell Solution
For our Haskell solution, we’re going to use a special shortcut. We’ll make use of the Algorithm.Search library to handle the mechanics of the BFS for us. The function we’ll use has this type signature (slightly simplified):
We provide 3 inputs. First is the “neighbors” function, taking one state and returning its neighbors. Second is the “goal” function, telling us if a state is our final goal state. Finally we give it the initial state. If a goal is reachable, we receive a path to that goal. If not, we receive Nothing. Since this library provides the full path for us automatically, we won’t track the number of steps in our state. Our “state” will simply be the location. So let’s begin by framing out our function:
snakesAndLadders :: A.Array (Int, Int) Int -> Int
snakesAndLadders board = ...
where
((minRow, minCol), (maxRow, _)) = A.bounds board
n = maxRow - minRow + 1
goal = n * n
convert :: Int -> (Int, Int)
neighbor :: Int -> Int
neighbors :: Int -> [Int]
Let’s start with convert. This follows the same rules we used in our Rust solution, so there’s not much to say here. We just have to make sure we account for non-zero start indices in Haskell arrays by adding minRow and minCol.
snakesAndLadders :: A.Array (Int, Int) Int -> Int
snakesAndLadders board = ...
where
((minRow, minCol), (maxRow, _)) = A.bounds board
n = maxRow - minRow + 1
goal = n * n
convert :: Int -> (Int, Int)
convert loc =
let (rowBase, colBase) = (loc - 1) `quotRem` n
row = minRow + (n - rowBase - 1)
col = minCol + if even rowBase then colBase else n - colBase - 1
in (row, col)
Now we’ll write a neighbor helper that converts a single location. This just makes our neighbors function a lot cleaner. We use the same logic of checking for -1 in the board, or else using the value we find there.
snakesAndLadders :: A.Array (Int, Int) Int -> Int
snakesAndLadders board = ...
where
((minRow, minCol), (maxRow, _)) = A.bounds board
n = maxRow - minRow + 1
goal = n * n
convert = ...
neighbor :: Int -> Int
neighbor loc =
let coord = convert loc
onBoard = board A.! coord
in if onBoard == -1 then loc else onBoard
Now we can write neighbors with a simple list comprehension. We look through each roll of 1-6, add it to the current location, filter if this location is past the goal, and then calculate the neighbor.
snakesAndLadders :: A.Array (Int, Int) Int -> Int
snakesAndLadders board = ...
where
((minRow, minCol), (maxRow, _)) = A.bounds board
n = maxRow - minRow + 1
goal = n * n
convert = ...
neighbor = ...
neighbors :: Int -> [Int]
neighbors loc =
[neighbor (loc + i) | i <- [1..6], loc + i <= goal]
Now for the coup-de-grace. We call bfs with our neighbors function. The “goal” function is just (== goal), and the starting state is just 1. It will return our shortest path, and so we just return its length:
snakesAndLadders :: A.Array (Int, Int) Int -> Int
snakesAndLadders board = case bfs neighbors (== goal) 1 of
Nothing -> (-1)
Just path -> length path
where
((minRow, minCol), (maxRow, _)) = A.bounds board
n = maxRow - minRow + 1
goal = n * n
convert :: Int -> (Int, Int)
convert loc =
let (rowBase, colBase) = (loc - 1) `quotRem` n
row = minRow + (n - rowBase - 1)
col = minCol + if even rowBase then colBase else n - colBase - 1
in (row, col)
neighbor :: Int -> Int
neighbor loc =
let coord = convert loc
onBoard = board A.! coord
in if onBoard == -1 then loc else onBoard
neighbors :: Int -> [Int]
neighbors loc =
[neighbor (loc + i) | i <- [1..6], loc + i <= goal]
And that’s our complete Haskell solution!
Conclusion
If you take our Solve.hs course, Module 3 is your go-to for learning about graph algorithms! You’ll implement BFS from scratch in Haskell, and learn how to apply other helpers from Algorithm.Search. In next week’s article, we’ll do one more graph problem that goes beyond the basic ideas of DFS and BFS.
PT2’s dominant internal representation, FX graphs, do not directly support control flow (if statements, while loops): they only represent straight-line basic blocks. Most of our graph capture mechanisms are tracing based (fx.symbolic_trace, make_fx, Dynamo), which means that we expect to be able to linearize all conditionals we encounter into a straight line program. Sometimes, you want to work with code that has control flow while working the compiler stack. There is no silver bullet, instead there are a lot of different options with different tradeoffs.
Regional compilation
We have a perfectly good general purpose language that supports control flow: Python. To handle control flow, compile only regions/submodules of your program that have no internal control flow, and then string them together with a standard Python control flow constructs. PT2 compiled regions are compositional with non-compiled regions, “it works.”
Pro:
Simple: requires no major model changes
Universal: it always works (including data dependent flow, calling into third-party libraries, making an HTTP request, anything!)
Cons:
You will not get a full graph this way; you will only get graphs for each region. In particular, you will not be able to do truly global optimizations, nor will you be able to serialize a self-contained Python-less representation of the entire model
It can sometimes be inconvenient to structure your program so all the regions you want are compilable. Suppose you have this call graph between modules: A -> B -> C. C is compileable; A is compileable except for its call to B, which is what does the control flow. It’s easy to compile C, but you can’t directly compile A, as it has a B-shaped bit that can’t be compiled. What to do? If you split A so it is pipelined as A1, B, A2, you can then compile A1 and A2, but not B. Dynamo also supports “graph breaks” to automatically perform this split for you, in which case you just disable compilation on B, but graph break generated graphs can be difficult to reason about as the inputs to A2 are implicitly inferred.
When the control flow is controlled by arguments that are known ahead of time (no data-dependent), you can also compile at the top level and get the flattened straight-line program for the particular branching you had in this case. Because Dynamo is a symbolic bytecode interpreter, it can automatically determine what inputs were used as part of control flow, and generate guards to validate that we would take the same paths again. If those values change, we will recompile the program at the new values. We dispatch between all the different unrollings of the program we have generated.
Pros:
Simple: requires no major model changes
You get a full graph for a particular unrolling of loops / conditionals, so global optimizations are possible
Cons:
Doesn’t work with data-dependent shapes.
You will end up with a graph for every unrolling; for example, if you have a loop that ranges from 1 to 32, you will end up with 32 different graphs. This will increase compile time.
Black box via custom operator
An FX graph just calls operators. The operator internally can have whatever control flow in them they want. So you can always black box a problematic region of your model into an operator and preserve compilation for everything else.
Pros:
You get a single, full graph that works for all possible branches
Cons:
A custom operator only supports inputs/outputs that fall inside our type system, which means you can only pass simple types like Tensor, int, bool (or pytree-able containers containing these things). There is some in progress work to relax this to allow more opaque types.
You have to explicitly declare all the inputs/outputs for the custom operator. This can be tiresome if the black boxed region represents a Module, since all the parameters also have to be directly passed in as well. The larger the region you black box, the bigger the arguments are.
You don’t actually get to see the inside of the custom operator from the outside graph, so no optimization over both inside and outside of the custom operator is possible. (Of course, you can always special case this operator in a pass on the outer graph.)
Do you really, really need a conditional? If you’re doing an if-branch, can you instead rewrite it so that you run both branches and torch.where dispatch to the results? If you’re doing a while-loop, can you unroll it to the max number of iterations and rely on dynamic shapes to cause it to no-op when you’re done and running extra iterations. Basically, this option is to rewrite your model so it doesn’t have Python-level control flow anymore (the conditional can either be done host or GPU side).
Pros:
You get a single, full graph that works for all possible branches
You are able to optimize inside and outside of the control flow
Cons:
You have to rewrite your model
For unrolling, if you are close to being CPU-dispatch bound, unrolling and running with zero size could push you over the brink (as zero size dispatches are still not free)
For conditional operators, unconditionally both branches increases the compute you need to do, which can be bad if you are compute-bound.
Control flow HOP
torch has special structured control flow operators that avoid unrolling large loops or needing to execute both branches of a control flow statement. If you’re familiar with JAX, these are very similar to the JAX equivalents. They have specific constraints that allow them to be directly compilable by torch.compile. For example, torch.cond accepts two functions (a true_fn and a false_fn) for the two branches and requires that outputs of each function must have the same properties (e.g. shape, dtype).
So far, we have the following “higher-order” operators (HOPs):
You get a single, full graph that works for all possible branches
You are able to optimize inside and outside of the control flow
Cons:
You have to rewrite your model.
The control flow HOPs are structured: they have specific constraints on the functions (true_fn, false_fn (cond) or body_fn (while_loop)) that can be passed to them. One such constraint is that these functions may not mutate any of their inputs. This may make rewrites difficult because you have to think about code in a “functional”, JAX-like way.
Still WIP and they have some quirks especially for training. For example, the backward pass of torch.scan currently requires re-computing the forward pass (instead of just saving intermediates from each iteration of scan).
CFG over FX graphs
If FX graphs give you basic blocks, you can use them as building blocks for a language that does support conditionals, stringing them together with basic blocks. In fact, Helion, a kernel DSL language, does exactly this, as it is common to need to directly write data-dependent conditionals and loops when writing kernels (it otherwise uses all PyTorch API functions, similar to conventional FX graphs). To do this, you would need to write your own Python frontend that parses Python directly to generate the CFG. TorchScript also does this, but TorchScript frontend is unmaintained and we don’t recommend using it (and it also doesn’t generate FX graphs by default.)
Pros:
You get a single graph that works for all possible branches
You are able to optimize inside and outside of control flow
In principle, you can write exactly the control flow you want
Cons:
You have to write the frontend, we don’t have one ready for you (TorchScript is not it, you’re princess is in another castle)
If your language looks too much like Python and too general purpose, prepare to get on the endless treadmill of feature requests for adding “just one more Python feature” (can we have lists? dataclasses? etc etc) in the frontend (it is more tractable for Helion, as it’s not a general purpose language.)
Getting an accurate and precise backtrace is the key to debugging unexpected exceptions in Haskell programs.
We recently implemented a family of functions that enable the user to push user-defined annotations to the native Haskell stack.
The native stack decoder can display this information to the user when an unexpected
exception is thrown.
This facility offers a number of advantages over the existing backtrace collection
mechanisms:
It is not necessary modify the function API (unlike HasCallStack)
A “continuous chain” of modifications is not necessary (unlike HasCallStack)
The annotations work in all ways of compilation (unlike cost centre stacks)
The backtrace is expressed in terms of predictable source locations (unlike some IPE backtraces)
In this post we wil introduce the API for stack annotation, give some examples of
how to use the annotation functions and discuss some trade-offs we have noticed with the design.
We’re interested in feedback from users on this feature. We’re expecting it
to be available from GHC 9.16, as our implementation already landed in GHC HEAD (!14538).
Annotation stack frames
The core of the design is a new primop, annotateStack#, which when executed pushes an “annotation stack-frame” to
the stack. Semantically, the frame is a no-op, but the payload contains a pointer to an arbitrary user-defined annotation.
When decoding the native Haskell stack the annotation can be rendered to
provide the user with additional context about the current location of the program.
The primop annotateStack# is exposed to the user via an IO-based API in
GHC.Stack.Annotation.Experimental from the ghc-experimental package:1
annotateStackIO :: (Typeable a, StackAnnotation a) => a ->IO b ->IO b
This will push the annotation value a onto the stack for the duration of the IO b action. The constraints allow the value to be rendered to a string or have its type inspected, similarly to the Exception class.
There are also specialised variants:
annotateCallStackIO ::HasCallStack=>IO b ->IO b -- Annotate with the current source locationannotateStackStringIO ::String->IO b ->IO b -- Annotate with an arbitrary StringannotateStackShowIO ::Show a => a ->IO b ->IO b -- Annotate with the result of 'show' on a value
In addition, there are “pure” variants for use in non-IO code. However, these
tend to be less intuitive due to the combination of lazy evaluation and
imprecise exceptions, so the IO versions will generally produce better stack
traces more reliably.
Note, annotateStack# is heavily inspired by annotated-exception
and can be used together with annotated-exception for even better stack traces.
Example of the status quo
Let’s use the annotation functions to improve the backtrace for a program reported
in a GHC ticket (#26040).
The program implements a simple REST API using servant. When the endpoint is requested with
a parameter which is larger than or equal to 100, the endpoint will error.
topHandler catches all exceptions thrown by the handler and turns them into an HTTP 505 error.
Finally, the exception handler prints any exceptions that might be thrown by the endpoint.
main ::IO ()main =do setBacktraceMechanismState IPEBacktraceTrue run 8086 mkServertypeApi=Capture"x"Int:>Get '[PlainText] TextmkServer ::ApplicationmkServer = serve (Proxy@Api) (hoistServer (Proxy@Api) topHandler api)topHandler ::IO a ->Handler atopHandler action =do result <- liftIO $ (Right<$> action) `catch` \(exc ::SomeException) ->do liftIO $putStrLn$"Exception: "<> displayExceptionWithInfo excpure$Left err500either throwError pure resultapi ::ServerTApiIOapi = handlerhandler ::Int->IOTexthandler x =if x >=100then throw $ErrorCall"Oh no!"elsepure (pack"handler")
With the current version of GHC, when calling this API via http://localhost:8086/105, this stack trace is printed:
Exception: ghc-internal:GHC.Internal.Exception.ErrorCall:
Oh no!
IPE backtrace:
Main.liftIO (src/Servant/Server/Internal/Handler.hs:30:36-42)
Servant.Server.Internal.Delayed.runHandler' (src/Servant/Server/Internal/Handler.hs:27:31-41)
Control.Monad.Trans.Resource.runResourceT (./Control/Monad/Trans/Resource.hs:(192,14)-(197,18))
Network.Wai.Handler.Warp.HTTP1.processRequest (./Network/Wai/Handler/Warp/HTTP1.hs:195:20-22)
Network.Wai.Handler.Warp.HTTP1.processRequest (./Network/Wai/Handler/Warp/HTTP1.hs:(195,5)-(203,31))
Network.Wai.Handler.Warp.HTTP1.http1server.loop (./Network/Wai/Handler/Warp/HTTP1.hs:(141,9)-(157,42))
HasCallStack backtrace:
collectExceptionAnnotation, called at libraries/ghc-internal/src/GHC/Internal/Exception.hs:170:37 in ghc-internal:GHC.Internal.Exception
toExceptionWithBacktrace, called at libraries/ghc-internal/src/GHC/Internal/Exception.hs:90:42 in ghc-internal:GHC.Internal.Exception
throw, called at app/Main.hs:42:10 in backtrace-0.1.0.0-inplace-server:Main
In this example there are two different backtraces:
The “IPE backtrace” is constructed by decoding the Haskell stack, using information stored in the binary by -finfo-table-map, where each
frame is automatically associated with a source location. (The compiler option -finfo-table-map was originally introduced for profiling.)
On the the other hand, the “HasCallStack backtrace” is built using the implicitly passed HasCallStack
constraints, which are automatically supplied by the type-checker, provided HasCallStack appears in the type.
The HasCallStack backtrace seems the most useful, telling us exactly where our program went wrong.
However, the backtrace is very brief, as the rest of the program doesn’t have any HasCallStack constraints.
As such, this stack trace might be unhelpful in larger programs, if the call to error was placed behind
many layers of abstraction.
The IPE backtrace looks impressive, but doesn’t even show us where the exception is thrown!
We get more intermediate source locations, but not the source of the exception.
The function from which the exception is thrown is not even listed.
The reason the IPE backtrace may be unhelpful lies in the way the Haskell call stack works.
We show the IPE info for each stack frame, which doesn’t relate precisely to the original source code and the resulting stack trace feels unintuitive.
One reason for this is many function calls are tail-calls which don’t result in stack frames.
The IPE backtrace can be improved by manually annotating important parts of the
program which should always appear in a backtrace.
For example, we always want to know in which handler the exception was thrown in, so
the handler function is annotated with annotateCallStackIO.
Further, we annotate the location where the exception is thrown.
handler ::Int->IOTexthandler x = annotateCallStackIO $doif x >=100then annotateCallStackIO $ throw $ErrorCall"Oh no!"elsepure (pack"handleIndex")
When running this program again, the stack trace will now contain the source location of the handler where exception was thrown from:
Exception: ghc-internal:GHC.Internal.Exception.ErrorCall:
Oh no!
IPE backtrace:
annotateCallStackIO, called at app/Main.hs:42:10 in backtrace-0.1.0.0-inplace-server:Main
annotateCallStackIO, called at app/Main.hs:40:13 in backtrace-0.1.0.0-inplace-server:Main
Main.handler (app/Main.hs:(40,1)-(43,30))
Main.liftIO (src/Servant/Server/Internal/Handler.hs:30:36-42)
Servant.Server.Internal.Delayed.runHandler' (src/Servant/Server/Internal/Handler.hs:27:31-41)
Control.Monad.Trans.Resource.runResourceT (./Control/Monad/Trans/Resource.hs:(192,14)-(197,18))
Network.Wai.Handler.Warp.HTTP1.processRequest (./Network/Wai/Handler/Warp/HTTP1.hs:195:20-22)
Network.Wai.Handler.Warp.HTTP1.processRequest (./Network/Wai/Handler/Warp/HTTP1.hs:(195,5)-(203,31))
Network.Wai.Handler.Warp.HTTP1.http1server.loop (./Network/Wai/Handler/Warp/HTTP1.hs:(141,9)-(157,42))
HasCallStack backtrace:
collectExceptionAnnotation, called at libraries/ghc-internal/src/GHC/Internal/Exception.hs:170:37 in ghc-internal:GHC.Internal.Exception
toExceptionWithBacktrace, called at libraries/ghc-internal/src/GHC/Internal/Exception.hs:90:42 in ghc-internal:GHC.Internal.Exception
throw, called at app/Main.hs:42:32 in backtrace-0.1.0.0-inplace-server:Main
Note the first two entries of the IPE backtrace:
annotateCallStackIO, called at app/Main.hs:42:10 in backtrace-0.1.0.0-inplace-server:Main
annotateCallStackIO, called at app/Main.hs:40:13 in backtrace-0.1.0.0-inplace-server:Main
These have been added due to our manual annotation of our source program via annotateCallStackIO!
They give us precise source location where the exception is thrown, making the IPE backtrace just as useful
as the HasCallStack backtrace.
However, note, we did not have to change the type signature of handler at all to get a much more informative stack trace.
throwIO vs throw vs error
Some readers may have noticed that we used throw instead of error, which is usually the go to function for throwing
example errors (or from within pure code).
At the moment, throw and error produce noticeably different stack traces, because
error evaluates the exception annotations lazier than throw, which leads
to failing to capture the call stack when throwing the exception. This should be possible to resolve; see GHC issue
#25430.
On the other hand, throwIO behaves more predictably within IO code and the IPE backtrace includes the source location of the exception throwing:
This means that how the exception is thrown is important to get reasonable stack traces.
Unsurprisingly, you should use throwIO whenever you are within the IO monad.
Summary
Annotation stack frames are a lightweight way to add extra information to stack traces.
By modifying the execution stack, the information is always available and can be used
by the native stack decoder to display informative backtraces to users. We’re
interested to hear what users think about this feature and how libraries will be
adapted to take advantage of the new annotation frames.
This work has been performed in collaboration with Mercury, who
have a long-term commitment to the scalability and robustness of the Haskell
ecosystem.
Well-Typed are always interested in projects and looking for funding to improve
GHC and other Haskell tools. Please contact info@well-typed.com if we
might be able to work with you!
The ghc-experimental package ships with GHC, but is distinct from base, and has weaker stability guarantees. This allows new APIs to be introduced and fine-tuned before eventually being stabilised and added to base.↩︎
A dependency graph is a representation of how different parts of a software project rely on each other.
Understanding the dependency graph helps a software engineer see the bigger picture of how their component fits into the whole project
and why certain changes might affect other areas.
It’s a useful tool for organizing, debugging, and improving the source code.
Engineers responsible for managing the development and build environments also benefit greatly
from understanding dependency graph concepts and how they are used by the build system.
This knowledge is crucial for optimizing build times since it allows engineers to identify opportunities
to parallelize and improve the incrementality of builds.
Understanding the dependency graph also helps in troubleshooting build failures, managing changes safely,
and ensuring that updates or refactors do not worsen the overall design of the codebase.
In this blog post, we’ll take a fresh look at dependency graphs, starting from the basic concepts
and building up from there.
You will learn what a dependency graph is, some terminology required to be successful in managing it,
and what it is used for.
What is a dependency graph?
A dependency graph is a visual map that explains the connectivity between parts of a software project.
Let’s use a contrived example of a dependency graph in a tiny codebase and lay out some key terminology.
Nodes and edges
A node in a dependency graph represents an individual item
which can be a software package, a module, or a component.
The edges (connections) between nodes represent dependencies,
meaning one node relies on another to function or build correctly.
Dependencies
appA depends on libX directly therefore libX is a direct dependency of appA.
For example, if you import the requests package in your Python module,
this would be that module’s direct dependency.
appB depends on commons via libY therefore commons is a transitive dependency of appB.
For example, if your C++ program depends on libcurl, then it also depends (transitively)
on every external library that libcurl depends on
such as OpenSSL or zlib.
Dependents
libX and libY directly depend on commons.
This could also be reversed — commons has two direct dependents: libX and libY.
In fact, the dependents are often called reverse dependencies.
Similarly, secrets have two reverse dependencies: one direct - appB, and one transitive - testB.
Shape and orientation
A simple dependency graph can sometimes look like a tree,
with one common base component at the root,
supporting multiple dependents (components pointing back towards the root),
which in turn are depended on by the leaves (components with no further dependents).
However, dependency graphs are usually more complex than trees
and belong to a more general family of graphs
known as directed acyclic graphs (DAG),
where you can only follow the arrows in one direction,
and you can never end up back at the same node you started from.
We’ll talk about the word “acyclic” in more detail later in the post.
When describing this project, we could emphasize that commons is foundational -
the root that everything else builds upon.
Libraries and apps become the trunk and branches, with tests as leaves.
Without clearly defining how arrows show dependencies,
we might easily draw all arrows pointing the opposite way (a reverse dependency graph1):
This makes terms like “roots” or “leaves” potentially confusing,
but it’s important to be aware of them as you will likely hear them being used
when talking about graphs.
What is it used for?
Dependency graph concepts have lots of applications:
In artifact-based build systems such as Bazel,
a dependency graph is used to determine the order in which different parts of a project should be built.
Having access to this allows building only what is necessary and in the correct sequence.
GNU Make uses a dependency graph implicitly through its rules:
each target specifies its dependencies, and Make constructs a graph to determine the order in which to build targets.
Native programming language build tools use the dependency graph to fetch and build modules in the correct order, e.g.,
in Go, it is used to maintain a cache of passing test results
(where go test checks whether any of the transitive dependencies of the tests have changed since the last run).
Graph theory applications
Graph theory is a branch of mathematics focused on networks of connected items.
Understanding some graph theory ideas can make managing dependencies much smarter.
Being familiar with the terminology also helps to find relevant tooling,
for instance, knowing that part of the graph is called subgraph
would let you find more relevant results when searching for algorithms to extract a part of the graph.
Connected Components
A connected component is a group of nodes
where each one can reach every other by following edges in either direction.
In a dependency graph, this means a set of source code modules that are all linked together by a dependency link
(or a reverse dependency link) — what’s important is that there is some sort of connection.
When two applications share modules in the same connected component, they become indirectly connected
which might make it hard to test or deploy them separately.
In a worse scenario, if the modules of these apps actually import from each other,
then code changes in one app can unexpectedly break another.
Applications with isolated dependencies are much easier to extract and move to separate repositories.
In the example below, the configuration is shared among three
applications making them part of the same connected component.
That is, you can’t move any of the applications along with the shared configuration out of the codebase.
This could be refactored by splitting the shared configuration into separate configurations for each application.
Making changes specific to the appA in the shared-config no longer triggers rebuilds of all applications
and running all their tests.
One connected component:
Three connected components:
Isolated nodes (nodes that don’t have any edges connected) also are connected components
which may represent software units that are no longer needed.
For instance, a program might have once used a third-party library,
but later stopped using its functionality.
If nothing else in the codebase depends on that library, it is now isolated,
and can be removed to avoid rebuilding.
Cut Points and Bridges
A cut point (also called a “point of connection” or “articulation point”) is a node
that, if removed, would split the graph into separate components.
A bridge is an edge whose removal would produce a new connected component.
In the example below, if we stop depending on the third-party library third-party-lib,
we would stop depending transitively on all those third-party libraries
that third-party-lib brought into the dependency graph of our project.
To remove a “cut point” like third-party-lib, you can replace its functionality with an existing dependency or reimplement it yourself.
This can make builds faster (fewer downloads), more secure, and more reliable.
The npm left-pad incident shows
how third-party dependencies can cause problems.
Creating isolated groups in the dependency graph is often a good thing as it means those modules can now evolve,
be tested, and deployed independently, reducing risk and complexity.
However, in a large dependency graph, the hard part is to identify the best cut points
as often breaking the dependency between two modules might still leave the part of the dependency
graph you are concerned about connected to the rest of the codebase.
Breaking appA -> config1 (incorrectly assuming that this as a bridge)
would still leave appA connected to the rest of the codebase via the libX connection.
To identify that libX might still lead to the rest of the codebase via a chain of connections is not trivial
and to be able to refactor the dependency graph so that one can reason about it,
it is often required to use advanced dependency graph querying and visualization tooling.
To estimate how much work it would be to break a connection, one can list all paths between your module
and the undesired dependency, which will be discussed later.
Subgraphs
A subgraph is just a smaller part of the whole graph, focusing on a subset of nodes and their connections.
Depending on the complexity and shape of your dependency graph, it might only make sense to interact with
a subgraph of it.
Take a look at the dependency graphs of the microservices at tech giants
to appreciate the complexity of their dependency management.
Visualizing or analyzing a subgraph (e.g., all dependencies of a single service) helps you zoom in on what matters for your project.
If the dependencies of a program are complicated,
it may make sense to extract only its direct dependencies and their direct dependencies.
In graph theory terms, this means focusing on nodes that are at most two degrees away from the program node.
The degree of a node refers to the number of direct connections (dependencies) it has.
We can extract a subgraph by limiting our view to nodes within a certain depth (in this case, a depth of two).
By controlling the depth, you avoid being overwhelmed by the entire transitive chain of dependencies.
With the same dependency graph we had seen in the very first graph of the post,
we can extract the subgraph containing dependencies with depth of 2 for appB:
Transitivity
The transitive closure of a node in a graph is the set of all nodes
that can be reached from that node by following edges.
In the context of a dependency graph, the transitive closure2 of a module
is the entire “tree” of things required for that module to work.
In this dependency graph,
both appA and appB depend on secrets (directly) and cloud (directly and transitively).
In this cluttered visualization of the graph, the direct dependency edge between appA/appB and cloud
could be removed for clarity as we already know that they are connected:
The process of simplifying the graph by removing edges that are implied by other edges is called transitive reduction.
Keep in mind that you would not normally want to do this for any other reason than clearer visualization of the graph.
If your build tool tracks node dependencies by reading build metadata (stored in files maintained by engineers),
this information must stay up-to-date so the build system can correctly identify necessary build steps.
Imagine that at some point in time appA used to import some code from cloud, however, after some refactoring,
it doesn’t depend on it directly any longer:
Now, what if in the build metadata files, the direct dependencies of appA are still [cloud, secrets]?
The stale build metadata information such as a redundant declaration of the direct dependency won’t be an issue
from the build systems perspective: cloud will ultimately end up in the transitive closure of appA.
However, if after further refactorings, appA no longer depends on secrets, we end up with this graph used by the build system:
Since appA depends on cloud, it becomes dependent on the transitive closure of cloud
which might lead to slower build times (all resources that cloud depends on now need to be downloaded to build appA).
Paths
Finding paths between arbitrary modules in a dependency graph helps understand
how different parts of your system are connected.
In this context, we are primarily interested in finding simple paths — paths where all nodes visited are distinct.
By finding a path from module A to module B, you can see if changes in A might affect B (or vice versa).
This helps estimate the risk of changes and debug issues that propagate through dependencies.
For example, if a module contains source code under a specific license,
you might want to ensure no paths from applications with incompatible licenses lead to it,
preventing its inclusion in the application bundle.
With this contrived example of a dependency graph,
there are two paths from appA to commons:
appA -> libX -> libY -> commons
appA -> secrets -> commons
In a large, highly connected dependency graph, there may be hundreds of paths between two modules.
When listing paths, shortest paths help to understand the minimal set of dependencies connecting two modules.
In contrast, the longest path between two modules tells you how deep the dependency chains are.
The higher the average number of nodes in all paths in the graph, the more interconnected your codebase is.
Having a very interconnected dependency graph might be problematic because it becomes hard to reason about
how changes will propagate and a change in a low-level module can ripple through many layers,
increasing the risk of unexpected breakages.
Topological sort
Topological sort (or order) is a way of ordering the nodes in a dependency graph
so that every node comes after all the nodes it depends on.
A build system might use topological sort to determine what must be built first
and which targets can be built in parallel.
Having access to this contrived dependency graph,
and oversimplifying what a modern build system would do with this dependency graph,
we could produce a parallelizable list of build actions.
In order to build a particular node (say, produce a binary executable), we need to first build all nodes
that this node depends on (transitively).
For instance, let’s say we want to build appA:
To build appA, we need to first build its direct dependency, libX.
To build libX, we need to first build its direct dependencies, commons and secrets.
commons and secrets can be built immediately as they do not have any dependencies.
This means that our dependency graph nodes would be sorted like this:
[secrets,commons],libX,appA
secrets and commons can be built in parallel, and once both of them are built,
we can start building libX, and, thereafter, appA.
Parallelism emerges only when the graph has branches, that is, multiple independent subgraphs
that can be built concurrently once their dependencies are satisfied.
Practically, this means that flattening overly nested or serial dependencies can unlock better
parallelism leading to faster builds.
In an extreme case, if your graph is in the shape of a linked list
such as app -> lib -> secrets -> commons, no parallelism can be achieved
because every node would need to wait for its dependency to be built first.
However, even when components must be built sequentially due to their dependencies,
parallelism can still occur within each component,
for instance, compiling multiple source files simultaneously within a single library.
Cycles
Cycles in a dependency graph mean that some components depend on each other in a loop,
making it impossible to determine the order in the dependency chain.
Build systems like Bazel require the dependency graph to be a directed graph without cycles
(commonly known as Directed Acyclic Graph, or DAG)
because cycles would lead to infinite build loops
and prevent the system from knowing which component to build first.
With this graph having a cycle (libA -> libB -> libC), it is unclear
in what order dependencies of app should be built:
When adopting a build system that needs to construct a DAG out of your dependency graph,
you might need to make refactorings in the codebase to break cycles.
This is particularly true for legacy codebases written in Python, JavaScript, or Ruby
where native build tools might tolerate cycles in the dependency graph.
A DAG is a very common data structure used by various build systems such as Bazel,
Pants, and Buck2,
process orchestration software such as Dagster, Flyte, and AirFlow,
and software engineering tooling such as Git.
In this post, we have reviewed the basic principles related to graph theory and talked about dependency graphs
that consist of modules in a codebase.
In sophisticated build systems, you’ll find that more kinds of graphs exist, with differences between them.
In Bazel, there is a build graph (what we have called dependency graph in this post for simplicity)
and an action graph that breaks down each component into specific actions (like compiling a file or generating code)
that need to be executed.
There are some more advanced kinds of graphs you might run into
such as the evaluation graph (Skyframe graph) representing Bazel’s internal state
(see skyscope to learn more)
and the shadow dependency graph which is created
when aspects are used.
In the next blog post, we will cover common problems associated with managing project dependencies
and share best practices for keeping a large dependency graph healthy over time.
The reversed dependency graph concept is useful in scenarios like impact analysis
(e.g., “If changes are made to this core library, what other components will be affected?”).↩
You won’t see this term often, but the transitive closure that also includes the node itself
from which we start the search is called a reflexive transitive closure.↩
Back in March, with version 4.17.0, Lean introduced partial_fixpoint, a new way to define recursive functions. I had drafted a blog post for the official Lean FRO blog back then, but forgot about it, and with the Lean FRO blog discontinued, I’ll just publish it here, better late than never.
With the partial_fixpoint mechanism we can model possibly partial functions (so those returning an Option) without an explicit termination proof, and still prove facts about them. See the corresponding section in the reference manual for more details.
On the Lean Zulip, I was asked if we can use this feature to define the McCarthy 91 function and prove it to be total. This function is a well-known tricky case for termination proofs.
First let us have a brief look at why this function is tricky to define in a system like Lean. A naive definition like
def f91 (n : Nat) : Nat :=
if n > 100
then n - 10
else f91 (f91 (n + 11))
does not work; Lean is not able to prove termination of this functions by itself.
Even using well-founded recursion with an explicit measure (e.g. termination_by 101 - n) is doomed, because we would have to prove facts about the function’s behaviour (namely that f91n = f91101 = 91 for 90 ≤ n ≤ 100) and at the same time use that fact in the termination proof that we have to provide while defining the function. (The Wikipedia page spells out the proof.)
We can make well-founded recursion work if we change the signature and use a subtype on the result to prove the necessary properties while we are defining the function. Lean by Example shows how to do it, but for larger examples this approach can be hard or tedious.
With partial_fixpoint, we can define the function as a partial function without worrying about termination. This requires a change to the function’s signature, returning an Option Nat:
def f91 (n : Nat) : Option Nat :=
if n > 100
then pure (n - 10)
else f91 (n + 11) >>= f91
partial_fixpoint
From the point of view of the logic, Option.none is then used for those inputs for which the function does not terminate.
This function definition is accepted and the function runs fine as compiled code:
#eval f91 42
prints some 91.
The crucial question is now: Can we prove anything about f91 In particular, can we prove that this function is actually total?
Since we now have the f91 function defined, we can start proving auxillary theorems, using whatever induction schemes we need. In particular we can prove that f91 is total and always returns 91 for n ≤ 100:
theorem f91_spec_high (n : Nat) (h : 100 < n) : f91 n = some (n - 10) := by
unfold f91; simp [*]
theorem f91_spec_low (n : Nat) (h₂ : n ≤ 100) : f91 n = some 91 := by
unfold f91
rw [if_neg (by omega)]
by_cases n < 90
· rw [f91_spec_low (n + 11) (by omega)]
simp only [Option.bind_eq_bind, Option.some_bind]
rw [f91_spec_low 91 (by omega)]
· rw [f91_spec_high (n + 11) (by omega)]
simp only [Nat.reduceSubDiff, Option.some_bind]
by_cases h : n = 100
· simp [f91, *]
· exact f91_spec_low (n + 1) (by omega)
theorem f91_spec (n : Nat) : f91 n = some (if n ≤ 100 then 91 else n - 10) := by
by_cases h100 : n ≤ 100
· simp [f91_spec_low, *]
· simp [f91_spec_high, Nat.lt_of_not_le ‹_›, *]
-- Generic totality theorem
theorem f91_total (n : Nat) : (f91 n).isSome := by simp [f91_spec]
(Note that theorem f91_spec_low is itself recursive in a somewhat non-trivial way, but Lean can figure that out all by itself. Use termination_by? if you are curious.)
This is already a solid start! But what if we want a function of type f91! (n : Nat) : Nat, without the Option? Then can derive that from the partial variant, as we have just proved that to be actually total:
def f91! (n : Nat) : Nat := (f91 n).get (f91_total n)
theorem f91!_spec (n : Nat) : f91! n = if n ≤ 100 then 91 else n - 10 := by
simp [f91!, f91_spec]
Using partial_fixpoint one can decouple the definition of a function from a termination proof, or even model functions that are not terminating on all inputs. This can be very useful in particular when using Lean for program verification, such as with the aeneas package, where such partial definitions are used to model Rust programs.
For a few weeks now, we’ve been tackling problems related to data structures, with a sprinkling of algorithmic ideas in there. Last week, we covered sets and heaps. Prior to that, we considered Matrices and the binary search algorithm.
This week, we’ll cover our first graph problem! Graph problems often build on a lot of fundamental layers. You need to understand the algorithm itself. Then you need to use the right data structures to apply it. And you’ll also still need the core problem solving patterns at your disposal. These 3 areas correspond to the first 3 modules of Solve.hs, our Haskell problem solving course! Check out that course to level up your Haskell skills!
The Problem
Today’s problem is called Number of Islands. We are given a 2D array as input, where every cell is either “land” or “sea” (either represented as the characters 1 and 0, or True and False). We want to find the number of distinct islands in this grid. Two “land” cells are part of the same island if we can draw a path from one cell to the other that only uses other land cells and other travels up, down, left and right (but not diagonally).
Let’s suppose we have this example:
111000
100101
100111
110000
000110
This grid has 3 islands. The island in the top left corner comprises 7 connected cells. Then there’s a small island in the bottom right with only 2 cells. Finally, we have a third island in the middle right with 5 tiles. While it is diagonally adjacent to the first island, we do not count this as a connection.
The Algorithm
This is one of the most basic questions you’ll see that requires a graph search algorithm, like Depth-First-Search (DFS) or Breadth-First-Search (BFS). The basic principle is that we will select a starting coordinate for a search. We will use one of these algorithms to find all the land cells that are part of that cell’s island. We’ll then increment a counter for having found this island.
We need to track all the cells that are part of this island. We’ll then keep iterating for new start locations to find new islands, but we have to exclude any locations that have already been explored.
While BFS is certainly possible, the solutions we’ll write here will use DFS. Our solution will consist of 3 components:
A “neighbors” function that finds all adjacent land tiles to a given tile.
A “visit” function that will take a starting coordinate and populate a “visited” set with all of the cells on the same island as the starting coordinate.
A core “loop” that will consider each coordinate as a possible starting value for an island.
This ordering represents more of a “bottom up” approach to solving the problem. Going “top down” also works, and may be easier if you’re unfamiliar with graph algorithms. But as you get more practice with them, you’ll get a feel for knowing the bottom layers you need right away.
Rust Solution
To write our solution, we’ll write the components in our bottom up order. We’ll start with our neighbors function. This will take the island grid (LeetCode supplies us with a Vec<Vec<char>>) and the current location. We’ll represent locations as tuples, (usize, usize). This function will return a vector of locations.
We’ll start this function by defining a few values. We want to know the length and width of the grid, as well as defining r and c to quickly reference the current location.
use std::collections::HashSet;
pub fn neighbors(
grid: &Vec<Vec<char>>,
loc: &(usize,usize)) -> Vec<(usize,usize)> {
let m = grid.len();
let n = grid[0].len();
let r = loc.0;
let c = loc.1;
let mut result: Vec<(usize,usize)> = Vec::new();
...
}
Now we just have to look in each of the four directions. Each direction is included as long as it is a land tile and that it is not out of bounds. We’ll do our “visited” checks elsewhere.
pub fn neighbors(
grid: &Vec<Vec<char>>,
loc: &(usize,usize)) -> Vec<(usize,usize)> {
let m = grid.len();
let n = grid[0].len();
let r = loc.0;
let c = loc.1;
let mut result: Vec<(usize,usize)> = Vec::new();
if (r > 0 && grid[r - 1][c] == '1') {
result.push((r - 1, c));
}
if (c > 0 && grid[r][c - 1] == '1') {
result.push((r, c - 1));
}
if (r + 1 < m && grid[r + 1][c] == '1') {
result.push((r + 1, c));
}
if (c + 1 < n && grid[r][c + 1] == '1') {
result.push((r, c + 1));
}
return result;
}
Now let’s write the visit function. Remember, this function’s purpose is to populate the visited set starting from a certain location. We’ll use a HashSet of tuples for the visited set.
All we have to do now is find the neighbors of this location, and recursively “visit” each one of them!
pub fn visit(
grid: &Vec<Vec<char>>,
visited: &mut HashSet<(usize,usize)>,
loc: &(usize,usize)) {
if (visited.contains(loc)) {
return;
}
visited.insert(*loc);
let ns = neighbors(grid, visited, loc);
for n in ns {
visit(grid, &n);
}
}
We’re not quite done, as we have to loop through our grid to call this function on each possible start. This isn’t so bad though. We start our function by defining key terms.
pub fn num_islands(grid: Vec<Vec<char>>) -> i32 {
let m = grid.len();
let n = grid[0].len();
let mut visited: HashSet<(usize,usize)> = HashSet::new();
let mut islandCount = 0;
...
// islandCount will be our final result
return islandCount;
}
Now we’ll “loop” through each possible starting location:
pub fn num_islands(grid: Vec<Vec>) -> i32 {
let m = grid.len();
let n = grid[0].len();
let mut visited: HashSet<(usize,usize) data-preserve-html-node="true"> = HashSet::new();
let mut islandCount = 0;
for row in 0..m {
for col in 0..n {
...
}
// islandCount will be our final result
return islandCount;
}
The last question is, what do we do for each location? If the location is land AND it is still unvisited, we treat it as the start of a new island. This means we increase the island count and then “visit” the location. When we consider other cells on this island, they’re already visited, so we won’t increase the island count when we find them!
```rust
pub fn num_islands(grid: Vec<Vec<char>>) -> i32 {
let m = grid.len();
let n = grid[0].len();
let mut visited: HashSet<(usize,usize)> = HashSet::new();
let mut islandCount = 0;
for row in 0..m {
for col in 0..n {
islandCount += 1;
visit(&grid, &mut visited, &loc);
}
return islandCount;
}
Here’s our complete solution:
use std::collections::HashSet;
pub fn neighbors(
grid: &Vec<Vec<char>>,
loc: &(usize,usize)) -> Vec<(usize,usize)> {
let m = grid.len();
let n = grid[0].len();
let r = loc.0;
let c = loc.1;
let mut result: Vec<(usize,usize)> = Vec::new();
if (r > 0 && grid[r - 1][c] == '1') {
result.push((r - 1, c));
}
if (c > 0 && grid[r][c - 1] == '1') {
result.push((r, c - 1));
}
if (r + 1 < m && grid[r + 1][c] == '1') {
result.push((r + 1, c));
}
if (c + 1 < n && grid[r][c + 1] == '1') {
result.push((r, c + 1));
}
return result;
}
pub fn visit(
grid: &Vec<Vec<char>>,
visited: &mut HashSet<(usize,usize)>,
loc: &(usize,usize)) {
if (visited.contains(loc)) {
return;
}
visited.insert(*loc);
let ns = neighbors(grid, visited, loc);
for n in ns {
visit(grid, &n);
}
}
pub fn num_islands(grid: Vec<Vec<char>>) -> i32 {
let m = grid.len();
let n = grid[0].len();
let mut visited: HashSet<(usize,usize)> = HashSet::new();
let mut islandCount = 0;
for row in 0..m {
for col in 0..n {
let loc: (usize,usize) = (row,col);
if grid[row][col] == '1' && !(visited.contains(&loc)) {
islandCount += 1;
visit(&grid, &mut visited, &loc);
}
}
}
return islandCount;
}
Haskell Solution
The structure of this solution translates well to Haskell, since it’s a recursive solution at its root. We’ll just use a couple of folds to handle the other loops. Let’s outline the solution:
Since we’re writing our functions “in line”, we don’t need to pass the grid around like we did in our Rust solution (though inline functions are also possible there). What you should observe immediately is that visit and loop have a similar structure. They both fit into the a -> b -> b pattern we want for foldr! We’ll use this to great effect!
But first, let’s fill in neighbors. Each of the 4 directions requires the same two conditions we used before. We make sure it’s not out of bounds, and that the next tile is “land”. Here’s how we check the “up” direction:
numberOfIslands :: A.Array (Int, Int) Bool -> Int
numberOfIslands grid = ...
where
((minRow, minCol), (maxRow, maxCol)) = A.bounds grid
neighbors :: (Int, Int) -> [(Int, Int)]
neighbors (row, col) =
let up = if row > minRow && grid A.! (row - 1, col) then Just (row - 1, col) else Nothing
...
We return Nothing if it is not a valid neighbor. Then we just combine the four directional options with catMaybes to complete this helper:
numberOfIslands :: A.Array (Int, Int) Bool -> Int
numberOfIslands grid = ...
where
((minRow, minCol), (maxRow, maxCol)) = A.bounds grid
neighbors :: (Int, Int) -> [(Int, Int)]
neighbors (row, col) =
let up = if row > minRow && grid A.! (row - 1, col) then Just (row - 1, col) else Nothing
left = if col > minCol && grid A.! (row, col - 1) then Just (row, col - 1) else Nothing
down = if row < maxRow && grid A.! (row + 1, col) then Just (row + 1, col) else Nothing
right = if col < maxCol && grid A.! (row, col + 1) then Just (row, col + 1) else Nothing
in catMaybes [up, left, down, right]
Now we start the visit function by checking if we’ve already visited the location, and add it to the set if not:
numberOfIslands :: A.Array (Int, Int) Bool -> Int
numberOfIslands grid = ...
where
((minRow, minCol), (maxRow, maxCol)) = A.bounds grid
visit :: (Int, Int) -> HS.HashSet (Int, Int) -> HS.HashSet (Int, Int)
visit coord visited = if HS.member coord visited then visited else
let visited' = HS.insert coord visited
...
Now we have to get the neighbors and “loop” through the neighbors so that we keep the visited set updated. This is where we’ll apply our first fold. We’ll recursively fold over visit on each of the possible neighbors, which will give us the final visited set from this process. That’s all we need for this helper!
numberOfIslands :: A.Array (Int, Int) Bool -> Int
numberOfIslands grid = ...
where
((minRow, minCol), (maxRow, maxCol)) = A.bounds grid
visit :: (Int, Int) -> HS.HashSet (Int, Int) -> HS.HashSet (Int, Int)
visit coord visited = if HS.member coord visited then visited else
let visited' = HS.insert coord visited
in foldr visit visited’ ns
Now our loop function will consider only a single coordinate. We think of this as having two pieces of state. First, the number of accumulated islands (the Int). Second, we have the visited set. So we check if the coordinate is unvisited land. If so, we increase the count, and get our “new” visited set by calling visit on it. If not, we return the original inputs.
Now for the final flourish. Our loop also has the structure for foldr. So we’ll loop over all the indices of our array, which will give us the final number of islands and the visited set. Our final answer is just the fst of these:
numberOfIslands :: A.Array (Int, Int) Bool -> Int
numberOfIslands grid = fst (foldr loop (0, HS.empty) (A.indices grid))
where
((minRow, minCol), (maxRow, maxCol)) = A.bounds grid
neighbors :: (Int, Int) -> [(Int, Int)]
neighbors (row, col) =
let up = if row > minRow && grid A.! (row - 1, col) then Just (row - 1, col) else Nothing
left = if col > minCol && grid A.! (row, col - 1) then Just (row, col - 1) else Nothing
down = if row < maxRow && grid A.! (row + 1, col) then Just (row + 1, col) else Nothing
right = if col < maxCol && grid A.! (row, col + 1) then Just (row, col + 1) else Nothing
in catMaybes [up, left, down, right]
visit :: (Int, Int) -> HS.HashSet (Int, Int) -> HS.HashSet (Int, Int)
visit coord visited = if HS.member coord visited then visited else
let visited' = HS.insert coord visited
ns = neighbors coord
in foldr visit visited' ns
loop :: (Int, Int) -> (Int, HS.HashSet (Int, Int)) -> (Int, HS.HashSet (Int, Int))
loop coord (count, visited) = if grid A.! coord && not (HS.member coord visited)
then (count + 1, visit coord visited)
else (count, visited)
A Note on the Graph Algorithm
It seems like we solved this without even really apply a “graph” algorithm! We just did a loop and a recursive call and everything worked out! There are a couple elements of this problem that make it one of the easiest graph problems out there.
Normally, we have to use some kind of structure to store a search state, telling us the next nodes in our graph to search. For BFS, this is a queue. For Dijkstra or A*, it is a heap. For DFS it is normally a stack. However, we are using the call stack to act as the stack for us!
When we make a recursive call to “visit” a location, we don’t need to keep track of which node we return to after we’re done. The function returns, and the prior node is already sitting there on the call stack.
The other simplifying factor is that we don’t need to do any backtracking or state restoration. Sometimes with a DFS, you need to “undo” some of the steps you took if you don’t find your goal. But this algorithm is just a space-filling algorithm. We are just trying to populate our “visited” set, and we never take nodes out of this set once we have visited them.
Conclusion
We’ve got a couple more graph problems coming up next. If you want to learn more about applying graph algorithms in Haskell (including implementing them for yourself!) check out Solve.hs, where Module 3 will teach you about algorithms including DFS, BFS, A* and beyond!
Minimax is a general algorithm for finding optimal strategies.
It’s not meant to be efficient or practical. It is more of a
basic concept of game theory, and a reference against which
to compare other game-solving algorithms.
We consider a simple model of two-player games.
They take turns playing moves until reaching an end
state with a final score. One player’s goal is to maximize the
score, whereas the other player’s goal is to minimize it.
Let us call these players Max and Min respectively, short for Maximizer
and Minimizer.
We represent such a game by its game tree, which is made up of
three constructors:
a Max (resp. Min) node represents a game state where Max (resp. Min)
chooses the next move, each move resulting in a new game state,
and an End leaf represents an end state as its score.
Note that Max and Min nodes must have at least one possible move.
You may be wondering about games that end when one player can no longer play:
instead of an empty Min or Max node, such game states simply correspond
to an End leaf, making the final score explicit.
Most real games just have a win/tie/lose end condition.
They naturally correspond to applying Game to a type with three possible scores:
In practice, chess engines don’t work with the whole game tree
since it is too massive. Instead, they build approximations by
pruning certain branches of the tree and replacing them with leaves.
The score on each leaf is a number which estimates how favorable
the game state is to either player. So we end up with
Game ℝ, or Game Double.
In general, the type Game represents two-player games
with complete information and zero-sum objectives.
We shall assume that score is a totally ordered set. This requirement
corresponds to a constraint Ord score in Haskell. In that case,
there exists an “optimal strategy” for each player which guarantees
them an “optimal score” m in the sense that as long as one player
sticks to their “optimal strategy”, the other player cannot
score better than m.
This situation is what we call a Nash equilibrium in game theory.
For win/tie/lose games, the existence of a Nash equilibrium
means that either there is a winning strategy for one of the players,
or they must tie by playing optimally.
The “optimal score” m is unique, and can be computed by a fold of the game tree,
replacing Max and Min constructors with the functions maximum and minimum.
This is the minimax algorithm:
minimax is quite an inefficient algorithm:
it must traverse the whole game tree. Indeed, maximum
and minimum must traverse the whole list to find
the maximum or minimum element.
Often, we can do much better. For instance, consider the following tree:
Max [ End 0,
Min [ End (-1),
t ] ]
The minimax of that tree does not depend on the subtree t.
Indeed, minimum [-1, minimax t] is guaranteed to be at most -1,
so the maximum between that value and 0 is guaranteed to be 0.
Thus we can compute the minimax without inspecting the subtree t,
which may be arbitrarily large.
That idea leads to a more efficient algorithm to compute the minimax.
Alpha-beta
The alpha-beta pruning algorithm1 is a modification of
minimax with an extra pair of arguments:
The pair (alpha, beta) represents a “relevance interval” which
relaxes the possible outputs of alphabeta.
Either alphabeta t (alpha, beta) produces a score within that interval,
in which case it is guaranteed to be equal to minimax. Otherwise,
alphabeta t (alpha, beta) produces a value outside of the interval,
in which case its exact value does not matter; it only has to be on
the same side of the interval as minimax t. More rigorously:
if alpha < minimax t < beta, then alphabeta t (alpha, beta) = minimax t;
if minimax t <= alpha, then alphabeta t (alpha, beta) <= alpha;
if beta <= minimax t, then beta <= alphabeta t (alpha, beta).
Leaving the value of alphabeta underspecified when outside of the
interval allows the implementation to short-circuit:
we can stop searching through Max nodes as soon as we can guarantee a score greater than beta,
and we can stop searching through Min nodes as soon as we can guarantee a score smaller than alpha.
We can then use alphabeta to redefine minimax:
-- Minimax using alpha-beta pruningminimaxAB:: (Ordscore, Boundedscore) =>Gamescore->scoreminimaxABt=alphabetat (minBound, maxBound)
assuming that score is Bounded with extreme values
minBound :: score and maxBound :: score.
It’s possible to avoid the Bounded constraint by changing
the interval type (score, score) to (Maybe score, Maybe score),
which amounts to adding distinguished top and bottom elements.
We’ll stick with Bounded to keep things a bit simpler.
Implementing alphabeta is a standard exercise.
It is even easier when you have a formal specification like the above
to guide the implementation.
But still, it is at least a little finicky and tedious to make sure that
you haven’t mixed your alphas and betas.
As we will see in this post,
we can streamline the implementation of alpha-beta pruning
by factoring the short-circuiting logic out of the “minimax” logic.
Generalized minimax
Remark that minimax only uses min and max
(via minimum and maximum), rather than the comparison
functions of Ord (compare, (<=), etc.).
We can reduce the dependency footprint of minimax by
defining a new class with only the necessary operations,
the class of lattices:
classLatticeawhere-- Join, least upper bound, max (\/) ::a->a->a-- Meet, greatest lower bound, min (/\) ::a->a->a
In mathematics, lattices are algebraic structures with two operations
(\/) (“join”) and (/\) (“meet”)
satisfying commutativity, associativity, as well as the absorption laws:
x \/ (x /\ y) = x
x /\ (x \/ y) = x
In this post, we will only be looking at lattices that arise
out of total orders,
so this class is rather just a way of saying that we only
depend on min and max.
Binary operations can be iterated to combine lists of arguments,
similarly to the maximum and minimum functions:
minimaxL generalizes minimax since every decidable total order is a lattice
(because you can use (<=) to define min/max).
Ideally this fact would be made explicit by making Lattice into
a superclass of Ord. Unfortunately in Haskell this would require us
to modify Ord or redefine it.
Another way to express the relation between Lattice and Ord is through a newtype.
Focus on the type (score, score) -> score which appears in the signature of alphabeta.
More specifically, we are interested in a subset of those functions that
we shall call clamping functions.
Intuitively, a clamping function f is a delayed representation of a constant s:
the goal of f is to compute s, but it may also stop early with an approximation
if it’s not necessary to know the exact value of s.
The name “clamping function” is a reference to the clamp function:
We can think of the partially applied function clamp s as an encoding of the constant s,
which may or may not be output depending on the interval (alpha, beta).
More formally, a clamping function with value s is a function f :: (score, score) -> score
that satisfies the following, for all (alpha, beta) such that alpha < beta:
if alpha < s < beta, then f (alpha, beta) = s;
if s <= alpha, then f (alpha, beta) <= alpha;
if beta <= s, then beta <= f (alpha, beta).
Two clamping functions with the same value s are considered equal.
In particular, as clamping functions, const s is equal to clamp s.
Making the notion of equality explicit is necessary to make sense of equations
(laws for lattices, homomorphisms, and isomorphisms).
We enshrine the definition of clamping functions in a newtype:
-- Type of clamping functions, satisfying the properties above.newtypeClampingscore=Clamping ((score, score) ->score)unClamping::Clampingscore-> (score, score) ->scoreunClamping (Clampingf) =f
For any value s, we can construct the constant clamping function:
Note that \_ -> s and clamp s are both clamping functions with value s,
so both are valid definitions of clamping s.
We prefer the constant function \_ -> s because it does less work.
Conversely, we can project clamping functions back into their values
by passing the whole interval (minBound, maxBound):
Those two functions form an isomorphism between score and Clamping score,
meaning that they satisfy the following equations:
declamp . clamping = id
clamping . declamp = id
We now get to the secret sauce of this post:
the maximum of two clamping functions (as well as the minimum).
This operation can be defined in two ways.
First is the naive definition, for reference:
Second is the lazy definition: if f (alpha, beta) is greater
than the given upper bound beta, then the max of f and g will
be even greater:
beta <= f (alpha, beta) <= max (f (alpha, beta)) (g (alpha, beta))
In that case, the maximum of f and g is allowed to output
f (alpha, beta) without looking at g.
Otherwise we must evaluate g, but we can tighten the interval by
updating the lower bound to max alpha (f (alpha, beta)).
These “naive” and “lazy” functions denote the same value
(maxC = lazyMaxC and minC = lazyMinC),
but lazyMaxC and lazyMinC may do less work,
either by ignoring their second argument or by applying it to a smaller interval than expected.
The point is that these “lazy” functions embody
the short-circuiting logic of alpha-beta pruning exactly.
All that’s left to do is to plug them into minimax.
The lattice of clamping functions
With the lazy min and max that we just defined, we get a lattice:
instanceOrdscore=>Lattice (Clampingscore) where (\/) =lazyMaxC (/\) =lazyMinC
Specialize minimax in the lattice of clamping functions:
This doesn’t look like much, but we have actually implemented the
alpha-beta pruning algorithm.
With a tiny bit of plumbing, we can redefine the function alphabeta
from earlier:
Then we want to partially apply alphabeta' to the interval (minBound, maxBound).
This amounts to replacing unClamping with declamp in the body of alphabeta'.
Behold our final implementation of minimax by alpha-beta pruning:
To sum up, we implemented alpha-beta pruning as a simple combination of:
minimax, generalized from orders to lattices (minimaxL);
the lattice of clamping functions (Lattice (Clamping score)).
This alternative approach does not completely absolve you from effort:
you still have to juggle alphas and betas correctly to implement the lattice
(lazyMinC and lazyMaxC).
But unlike in the original alphabeta,
you don’t have to do all that juggling in the middle of a recursive function.
The logic of alpha-beta pruning is neatly decomposed into bite-sized pieces.
Correctness for free
Since we just reused the code of minimax, it’s also easier to prove that
that alpha-beta pruning yields the same result:
minimax = minimaxAB'
As we are about to see, this is a direct consequence of
the free theorem2 for minimaxL:
any function of type forall s. Lattice s => Game s -> s,
such as minimaxL, commutes with any lattice homomorphism3f,
in the following sense:
f . minimaxL = minimaxL . fmap f
We can picture that equation as a commutative diagram:
(To be pedantic, the above proof
conflates minimaxL with minimax/minimaxO,
which relies on pretending that Lattice is a superclass of Ord.
Below is another proof that doesn’t take that shortcut,
by going through the OrdLattice newtype explicitly,
so this proof applies more directly to the Haskell definitions as written here.)
A somewhat more rigorous proof
We want to prove that the alpha-beta-pruning minimaxAB'
is equivalent to the naive minimax:
minimax = minimaxAB'
Recall the free theorem of minimaxL. For any lattice isomorphism (f, f⁻¹):
minimaxL = f⁻¹ . minimaxL . fmap f
Replace (f, f⁻¹) with the lattice isomorphism (clamping . unOrdLattice, OrdLattice . declamp)
between the lattices OrdLattice score and Clamping score.
The above is only a proof of functional correctness:
minimax and minimaxAB' compute the same result.
To verify that minimaxAB' does so more efficiently
is another problem for another day. For now, we can test it.
Strictness check
We test that our “fancy” implementation of alpha-beta (minimaxAB') has the same
strictness as the “classical” implementation (minimaxAB),
which we presume to be much lazier than minimax.
We use StrictCheck for property-testing of strictness behaviors in Haskell.
The following test checks that minimaxAB and minimaxAB' have the
same demand on random inputs.
We use the function observe1 from StrictCheck to observe the demand
of a function f: observe1 applies f it to an instrumented copy
of the provided input g, it forces the output (f g of type Int)
using the provided forcing function (`seq` ()),
and finally returns the demand on the input tree g that was observed
by forcing the instrumented copy of g.
I came up with this idea a while back on Stack Overflow, as an answer to
Alpha-beta pruning with recursion schemes.
My understanding of alpha-beta pruning changed overnight from a somewhat tricky algorithm
to a completely trivial solution.
Getting to reuse minimax is not only a satisfying achievement in refactoring,
it enables a neat proof of correctness by parametricity (via free theorems).
The role of laziness should also be underscored.
If you try to do the same thing in a call-by-value language,
the implementation of “generalized minimax” must
explicitly delay computations, obscuring the point:
Alpha-beta pruning is just minimax in a lattice of clamping functions.
When training large scale LLMs, there is a large assortment of parallelization strategies which you can employ to scale your training runs to work on more GPUs. There are already a number of good resources for understanding how to parallelize your models: I particularly recommend How To Scale Your Model and The Ultra-Scale Playbook. The purpose of this blog post is to discuss parallelization strategies in a more schematic way by focusing only on how they affect your device mesh. The device mesh is an abstraction used by both PyTorch and JAX that takes your GPUs (however many of them you've got in your cluster!) and organizes them into a N-D tensor that expresses how the devices communicate with each other. When we parallelize computation, we shard a tensor along one dimension of the mesh, and then do collectives along that dimension when there are nontrivial dependencies between shards. Being able to explain why a device mesh is set up the way it is for a collection of parallelization strategies is a good check for seeing if you understand how the parallelization strategies work in the first place! (Credit: This post was influenced by Visualizing 6D Mesh Parallelism.)
Prologue: Why device mesh? Before we jump into the zoo, why do we have multi-dimensional meshes in the first place? One intuition is that the dimensions of the device mesh are a reflection of the physical constraints of networking between GPUs (there's a reason why all of the scaling books talk extensively about how the networking for GPUs works; you can't reason about what parallelization strategy you should use without knowing about this!) Let's imagine you have 1024 NVIDIA GPUs. You don't want to treat this 1024 GPUs as an undifferentiated blob of GPUs. Physically, these GPUs are grouped into nodes of eight which have much faster NVLink connections compared to cross-node communication which is done on a slower Infiniband connection. Intuitively, you will want to do something different depending on if you're doing intra-node communication or inter-node communication.
The device mesh imposes structure on this collection of GPUs. A mesh is typically specified as a tensor size (e.g., (128, 8)) as well as string axis names ala named tensor (e.g., ["dp", "tp"]), and is simply an N-D tensor over a range of GPU indices (typically [0, 1, 2, 3, ...] for GPUs, and a mostly ascending but occasionally permuted sequence for TPUs). We typically think of 2D and 3D tensors as grids and cubes, but I find it is more helpful (especially in higher dimensions) to think of the device mesh as imposing some self-similar (fractal) structure on the GPUs. In the simplest 2D mesh that accounts for intra versus inter node communication, GPUs are first organized into nodes on the inner-most dimension, and then the nodes are collected together in the outer-most dimension to form the cluster. (The self-similar nature of the nodes is important because it tells us how communication occurs across the cluster: to communicate over the outer-most mesh dimension, all the GPU 0s on each node talk to each other, all the GPU 1s, etc.) This is only the very simplest mesh we can create, however; with more complicated parallelization strategies we may impose extra levels of structure, e.g., we may organize nodes into pods of two and four, or we might further divide the eight GPUs of a single node. In other words, the mesh tells us about which GPUs communicate to which other GPUs. This is important to know, because when I want to parallelize our model, I am making choices about how to shard tensors across my GPUs. The mesh tells me which GPUs have the other shards of my tensor; in other words, they are who I have to communicate with when I am doing a computation that requires information about the full tensor and cannot be done with the local shards only.
In the zoo, when we talk about a parallelism strategy, we will talk to how it typically relates to other parallelization strategies in the model, and the device mesh will tell us if it is orthogonal to other parallelisms (a new dimension), multiplexed with another strategy (a reused dimension) or perhaps a completely different hierarchy of communication (multiple meshes in the same model that don't factor into the other).
Without further ado, here is the zoo!
Data parallelism (DP). Data parallelism predates the concept of device meshes, since you don't actually need any nontrivial mesh structure to do data parallelism: if you are only doing data parallel, you just shard your input on the batch axis for however many devices you have. This sharding propagates through forwards and backwards until you allreduce to compute the final global gradient for a parameter. If you did make a 1D device mesh (this is useful to think about, because most higher dimensional parallelisms will include some form of data parallelism), you'd probably name your mesh ["dp"], ["ddp"] or perhaps ["batch"].
Let's talk briefly about how people tend to name device mesh axes. In the PyTorch world, it's most common to name the axis after the parallelism that it is responsible, so either "dp" or "ddp" (you really shouldn't call it ddp, but the DataParallel taboo in PyTorch is very real!) The batch name is common in JAX, and is very natural there because when you annotate the sharding of your input, you need to say for each dimension tensor what mesh dim it is sharded over. So when you shard the batch dimension over the batch mesh dim, it looks just like you're labeling the batch dimension of your tensor as batch, e.g., P("batch", None). (This situation doesn't happen in PyTorch because shardings of a tensor are specified per device mesh dim, but that's a story for another day!)
Fully-sharded data parallel (FSDP). This is best understood as an augmentation over DP where weights are also sharded over all GPUs and you just all-gather weights before performing operations (and reduce-scatter in backwards). Because this all-gather is also among all devices, you don't need another axes in your mesh, and your mesh might also be called ["dp"] in this case, even though you're actually doing FSDP. Occasionally, you'll see people name their mesh ["fsdp"] in this case.
Hybrid sharded data parallel (HSDP). HSDP is an extension of FSDP where you shard weights (FSDP) up to the point where you can't actually do a giant all-gather/reduce-scatter over every GPU, and then replicate these shards to cover the rest of your cluster (DP). It's also amenable to fault tolerance techniques that make the modeling assumption that it's OK to lose samples of your batch if a replica fails (you won't model this with device mesh though!). This is probably the first time you will encounter a 2D device mesh (indeed, the DeviceMesh tutorial in PyTorch specifically uses hybrid sharding as its motivating example), since HSDP doesn't require any extra model changes on top of FSDP. There are a few common ways to name the mesh axes for HSDP. One way to think about it is that it is FSDP on the inner dimension and DP on the outer dimension, in which case you would say ["dp", "fsdp"]. Another way is to think about what happens to parameters at the various layers of the mesh: the inner dimension shards, while the outer dimension replicates, so you would say ["replicate", "shard"] or perhaps ["dp_replicate", "dp_shard"] to make it clear that you are still doing data parallelism across both of these device mesh dims (in particular, when you split your batches, you split on both the dp_replicate and dp_shard dims--although, to get the final gradients, you can do the reduction hierarchically by first doing a reduce-scatter on "dp_shard" and then doing an allreduce on "dp_replicate").
Tensor parallelism (TP). Depending on who you ask, tensor parallelism is either about letting you reduce your effective batch size for training or moving you towards reducing the memory usage of activations in your model. In the "reduce effective batch size" framing, the idea behind TP is that you can only scale up DP until your cluster is as large as your batch size. From a modeling perspective, it can be undesirable to have a batch size that is too large, so you can't just keep increasing your batch size to get more parallelism. Instead, TP allows us to get some extra scaling by sharding over the feature dimension of our matrix multiplies [1] (you can shard over either the columns or the rows of your weight matrix, so we will frequently specify if a TP Linear is column-wise or row-wise; in attention, column-wise linear effectively parallelizes the attention computation over attention heads). The communication needed to do TP is fairly exposed (unless you're doing async tensor parallel), so you typically want to keep the communications for it within a single node. This leads to this classic 2D device mesh for DP+TP: ["dp", "tp"] (or, if you're a JAXer, you might write ["batch", "model"], where model is used to indicate the inner feature dimension of the model weights being parallelized over.) When someone says 2D parallelism, they're usually referring to this combo of parallelisms (although I do not recommend using this term--as you can see, it is obviously ambiguous!) Note that tp is the inner mesh dimension, since it benefits the most from the high bandwidth network between GPUs on a single node.
You don't have to stop with DP+TP, however. If you're using FSDP with tensor parallelism (remember, "dp" can mean FSDP!), intra-node TP doesn't improve the amount of inter-node FSDP communication you have to do: however much TP you do, within one TP node you only have one slice of the model and have to talk to everyone else to get their slices. You could solve this by expanding TP to also cross nodes, but in practice mixed intra/inter-node collectives are a lot slower than pure inter-node collectives. This limits the scaling you can get from TP, and so if you're still hitting limits on FSDP, it can still be useful to apply HSDP to avoid running collectives that are too large. In that case, you'd end up with a mesh like ["dp_replicate", "dp_shard", "tp"].
Sequence parallelism (SP). For this section, we specifically take the definition of sequence parallelism from the Ultrascale Playbook (as distinguished from context parallelism). Although we said that TP is the first step towards reducing the memory usage of activations [2], if you literally implement DP+TP based on my descriptions above, you will still end up with more memory spent on activations than you want because there are still parts of the model around the FFN like the LayerNorm need the full hidden dimension to compute mean and variance [3]. To reduce the memory usage in these segments, you need to shard on something else. So typically what you will see is that the model will alternate between TP (hidden dimension is sharded) and SP (sequence dimension is sharded). Consequently, if you look at the device mesh for a model using DP+TP+SP, it will typically still look like ["dp", "tp"], and instead the tp dimension is multiplexed to be used both for TP and SP. Because TP and SP never occur at the same time, you don't need a separate dimension for them.
Ulysses sequence parallelism. Ulysses sequence parallelism from DeepSpeed Ulysses is another sequence parallelism strategy that is implemented by verl (because verl is forked so often, it shows up quite prominently if you are looking for examples of init_device_mesh on GitHub code search). It aims to alleviate memory pressure from extremely long sequences, so sequences are sharded on input, and only when attention needs to be computed is an alltoall issued to re-shard on the attention heads rather than the sequence (doing another alltoall to restore the sequence sharding after the attention is done). Importantly, this means it competes with TP for sharding on the attention heads, which is why you also see people use it to replace TP in MoE models, since it has much less communication than TP (at the cost of having to replicate the attention weights). In verl, you will just see a device mesh ["dp", "sp"] when you are using their FSDP backend (which is what supports Ulysses).
Context parallelism (CP). Context parallelism is another form of "sequence" parallelism. Like Ulysses sequence parallelism, sequences are sharded on input; the difference, however, is instead of using an alltoall to re-shard on attention heads, you just do a (distributed) attention on the entire context. You can do this the easy way by just using allgather to get the full context (as was done in llama4) or you can use a fancy kernel like ring attention, which carefully overlaps communication and computation when performing attention. A popular implementation of context parallelism lives in Megatron, which doesn't directly use PyTorch's native DeviceMesh abstraction but has an analogous HyperCommGrid. The mesh we see here will be something like ["dp", "cp"] or more commonly ["dp", "cp", "tp"]. Notice that we can have a dedicated mesh dim for CP: CP operates very similarly to SP outside of the attention calls (as it is just plain data parallelism when there is no cross-token dependency), but because it never shards on attention heads, it doesn't compete with TP and can be used completely orthogonally to TP (TP shards hidden, CP shards sequence).
CP has a pretty interesting interaction with FSDP. Both DP and CP shard the input data (on batch and sequence respectively). It's pretty common when you do FSDP to just shard over both "dp" ("dp_shard" in HSDP) and "cp". In torchtitan, we create a flattened mesh dim "dp_shard_cp" specifically for FSDP sharding (a flattened mesh dim is what happens if you take your mess and "forget" about some of the structure; e.g., if you were to do an all-gather, you just all-gather over all the flattened axes). In the HSDP world, "dp_cp" is still a useful concept because this is the combination of axes you want to all-reduce over to, e.g., compute the global average loss.
Pipeline parallelism (PP). Pipeline parallelism is kind of an ugly duckling and people tend to hate on it because you have to rewrite your models to introduce pipeline stages, and you can't really use things like DTensor with it (unless you do really strange things like how the GSPMD paper "supports" pipeline parallelism--the general consensus is automatic parallelism does not like PP). PP still goes in the device mesh, because it affects how you are organizing your GPUs, but, for example, torchtitan solely uses it to setup PGs for doing the point-to-point communications. I've seen both ["dp", "pp", ...] or ["pp", "dp", ...] for meshes with PP, but the order probably doesn't make too much of a difference as you are likely solidly inter-node at this point. Pipeline parallelism bandwidth use is very low, and latency can be covered up as you can immediately start processing the next batch after triggering an asynchronous send of the previous batch.
Expert parallelism (EP). EP is its own kettle of fish. Expert parallelism only applies over the expert computation of the model, but within this region, we are not sharding parameters as FSDP conventionally sees it: we will commonly have the entire expert's weights on our node. torchtitan's WIP expert parallelism implementation, when it has ALL parallelisms on, would look like ["pp", "dp_replicate", "dp_shard_mod_ep", "dp_shard_in_ep", "cp", "tp"], where dp_shard has been split into two mesh dimensions (DP shard modulo EP, and DP shard in EP). dp_shard_mod_ep is conventionally one, but when it is not it represents further FSDP-style sharding of expert weights inside of the expert region (there's some complication here if you have shared experts along-side your EP-sharded experts). But then dp_shard_in_ep, cp and optionally tp are combined together to give you the expert parallel dimension. It's actually more intuitive to imagine that you have two distinct meshes: ["pp", "dp_replicate", "dp_shard", "cp", "tp"] and ["pp", "dp_shard_mod_ep", "ep", "tp"]. The keen-eyed may also notice that there is no intrinsic reason the tp mesh size inside and outside of the expert parallel region, but this is not easily done if you have to have a single global device mesh for everything. In fact, there is a WIP PR to have two meshes, one for inside the expert region and one for outside: https://github.com/pytorch/torchtitan/pull/1660
Conclusion. The general concept behind mesh parallelism is that you can compose parallelization strategies without too much fuss. Indeed, the use of, e.g., TP to improve scaling is precisely because it lets you cover your device space without having to expand DP beyond the batch size you want to do. However, as you can see from these concrete examples, it's not always quite as simple as just stacking all of the parallelisms together one on top of each other. In the end, all the device mesh is doing is creating PGs behind groups of devices as defined by the mesh, so if you want some weird setup where you're swapping between two device meshes, PyTorch's general philosophy has been to say, have fun!
Thanks to Horace He, Tianyu Liu and Natalia Gimelshein for helping fact check this post. Any remaining errors are mine!
[1]
One more subtlety I want to point out: while we tend to think of TP as sharding the feature dimension of parameters, when we "propagate" this sharding through the network, other intermediate tensors end up getting sharded on the TP dimension as well. In particular, in a transformer block, you will typically have a column-wise linear followed by a row-wise linear, and the intermediate activation will be temporarily sharded on the TP dimension before the row-wise linear runs.
[2]
I am very carefully using "activation memory" here and not total memory, because total memory usage (what you actually care about) is also a function of peak memory usage, which is subject to transient peaks such as when FSDP does an all-gather to collect parameters. In fact, even without SP, TP will improve your peak memory usage, because unlike FSDP, it's not necessary to all-gather the full weight matrix to actually perform the matrix multiply. TPs peak memory usage occurs when it all-gathers activations.
[3]
You will get a little improvement between the column-wise and row-wise linear, since the activations there are sharded. You can turn this into a big improvement by using selective activation checkpointing and forcing recomputation of activations that aren't sharded! (Plain activation checkpointing tends not to work so well because of the all-gather of the activations.)
At Standard Chartered Bank, Haskell is used in a core software library
supporting the entire Markets division – a business line with 3 billion USD
operating income in 2023. Typed functional programming is used across the entire
tech stack, including foundational APIs and CLIs for deal valuation and risk
analysis, server-side components for long-running batches or sub-second RESTful
services, and end-user GUIs. Thousands of users across Markets interact with
software built using functional programming, and over one hundred write
functional code.
invest in the maintenance and future development of the core Haskell toolchain,
access Well-Typed’s team of Haskell experts for private development or technical support, and
fund the Haskell Foundation to sustain key community infrastructure.
You can read more about the toolchain maintenance activities these packages fund
in our regular reports. Many thanks to
Standard Chartered, to the existing Haskell Ecosystem Supporters, and to our
other clients who fund open-source development work, for making this possible.
If your company relies on Haskell, and depends on its core toolchain and vibrant
open-source ecosystem, why not read more about our offer?
In the previous blog post of this series, I talked about CodeQL,
a static analyzer from GitHub that performs semantic search queries on source code to extract structured data.
I described how I wrote my first CodeQL query and how I executed it locally.
In this second blog post, I want to go beyond that.
I will cover aspects that are required for putting custom queries into production. I’ll explain:
how CodeQL sources are organized,
what query metadata is,
how to run CodeQL in GitHub Actions, and
how to visualize results.
While the first two topics are specific to teams that need to write their own queries,
the last two are applicable both to teams that write their own queries
and to teams relying on the default queries shipped with CodeQL (which do capture a vast number of issues already).
I won’t dive deep on any topic,
but rather give an overview of the features you will most likely need to put your own CodeQL queries into production.
I’ll often link to GitHub’s official documentation,
so that you have quick access to the documentation most useful to you.
Finding what you need can be a bit of a challenge,
because CodeQL’s documentation is spread over both https://docs.github.com/en/code-security
and https://codeql.github.com/docs/.
Structure of CodeQL sources
There are four main types of CodeQL file:
*.ql files are query files. A query is an executable request and a query file must contain exactly one query.
I will describe the query syntax below. A query file cannot be imported by other files.
*.qll files are library files. A library file can contain types and predicates, but it cannot contain a query. Library files can be imported.
*.qls files are YAML files describing query suites. They are used to select queries, based on various filters such as a query’s filename, name, or metadata. Query suites are documented in detail in the official documentation.
*.qlpack files are YAML files describing packs. Packs are containers for the three previous kind of files. A pack can either be a query pack, containing queries to be run; a library pack, containing code to be reused; or a model pack, which is an experimental kind of pack meant to extend existing CodeQL rules. Packs are described in detail here.
When developing custom queries, I need to wrap them in a query pack in order to declare on what parts of the CodeQL standard library my queries depend (here’s an example to show how to depend on the Java standard library).
Queries in *.ql files have the following structure (as explained in more detail in the official documentation):
from /* ... variable declarations ... */
where /* ... logical formula ... */
select /* ... expressions ... */
This can be understood like an SQL query:
First, the from clause declares typed variables that can be referenced in the rest of the query.
Because types define predicates, this clause already constrains the possible instances returned
by the where clause that follows.
The where clause constrains the query to only return the variables that satisfy the logical
formula it contains. It can be omitted, in which case all instances of variables with the type
specified in the from clause are returned.
The select clause limits the query to operate on the variables
declared in the from clause. The select clause can also contain
formatting instructions,
so that the results of the query are more human readable.
To give an example of a query, if I need to write a query to track
tainted data in Java, in a file named App.java, I’ll write this to start somewhere
and will refine the where clause iteratively, based on the query’s result:
from DataFlow::Node node // A node in the syntax tree
where node.getLocation().getFile().toString()="App"// .java extension is stripped
select node,"node in App"
select clauses must obey the following constraints with respect to the number of columns selected:
A problem query (see below) must select an even number of columns.
The format is supposed to be: select var1, formatting_for_var1, var2, formatting_for_var2, ...
where formatting_for_var* must be an expression returning a string, as described earlier in the select paragraph.
If you omit the formatting, the query is executed, but a warning is issued.
A path-problem query must select four columns, the first three referring to syntax nodes and the fourth
one a string describing the issue. This assumption is required by the CodeQL Query Results view in VSCode
to show the results as paths (using the alerts style in the drop down):
Query metadata
The header of a query defines a set of properties called query metadata:
/**
* @name Code injection
* @description Interpreting unsanitized user input as code allows a malicious user to perform arbitrary
* code execution.
* @kind path-problem
* @problem.severity error
* ...
*/
Query metadata is documented in detail in CodeQL’s official documentation. I don’t want to repeat GitHub’s documentation here,
so I’m focusing on the important information:
@kind can take two values: problem and path-problem. The former is for queries that
flag one specific location, while the latter is for queries that track tainted data flow from a source to a sink.
Severity of issues is defined through two means, depending on whether the query is considered a security-related
one or not 🤷
@problem.severity is used for queries that don’t have @tags security. @problem.severity can be one
of error, warning, or recommendation.
@security-severity is a score between 0.0 and 10.0, for queries with @tags security.
Metadata is most useful for filtering queries in qls files.
This is used extensively in queries shipped with CodeQL itself, as visible for example in
security-experimental-selectors.yml1. To give an idea of the filtering capability, here is an excerpt of this file that declares filtering criteria:
-include:kind:- problem
- path-problem
precision:- high
- very-high
tags contain:- security
-exclude:query path:- Metrics/Summaries/FrameworkCoverage.ql
- /Diagnostics/Internal/.*/-exclude:tags contain:- modeleditor
- modelgenerator
To smooth the introduction of CodeQL (and security tools in general), I recommend starting small and only
reporting the most critical alerts at first (in other words: filtering aggressively).
This helps to convince teammates that CodeQL reports useful insights, and
it doesn’t make the task of fixing security vulnerabilities look insurmountable.
Once the most critical alerts are fixed, I advise loosening the filtering,
so that pressing — but not critical — issues can be addressed.
Running CodeQL in GitHub Actions
The following GitHub Actions are required to run CodeQL:
github/codeql-action/init installs CodeQL and creates the database. It can be customized
to specify the list of programming languages to analyze, as well as many other options.
Customization is done in the YAML workflow file, or via an external YAML configuration file, as explained in
the customize advanced setup documentation.
github/codeql-action/autobuild is required if you are analyzing a compiled language
(such as C# or Java, as opposed to Python). This action can either work out of the box, guessing
what to do based on the presence of the build files that are idiomatic in your programming
language’s ecosystem. I must admit this is not very principled — you need to look up the
corresponding documentation
to see how CodeQL is going to behave for your programming language and platform.
If the automatic behavior doesn’t work out of the box,
you can manually specify the build commands to perform.
github/codeql-action/analyze runs the queries. Its results are used
to populate the Security tab, as shown below.
Since the actions work out of the box on GitHub, replicating them in another CI/CD system
is non-trivial: you will have to build your own solution.
Visualizing results
Once CodeQL executes successfully in CI, GitHub’s UI picks up its results automatically
and shows them in the Security tab:
You may wonder why you cannot see the Security tab on
the repository used to create this post’s screenshots yourself.
This is because, as GitHub’s documentation explains, security alerts are only visible to people
with the necessary rights to the repository. The required rights depend on whether the repository is owned
by a user or an organisation. In any case, security alerts cannot be made visible to people
who do not have at least some rights to the relevant repository.
Clicking on View alerts brings up the main CodeQL view:
As visible in the screenshot, this view allows you to filter the alerts in multiple ways,
as well as to select the branch from which the alerts are shown.
Conclusion
In this post, I covered multiple aspects that you need to know to put your custom queries
in production. I described how CodeQL codebases are organized and the constraints that individual queries
must obey. I described queries’ metadata and how metadata is used.
I concluded by showing how to run queries in CI and how everyone in a team can visualize the
alerts found. Equipped with this knowledge, I think you are ready to experiment with CodeQL
and later pitch it to your stakeholders, as part of your security posture 😉
Amp is a coding agent which I’ve been working on the last six months at Sourcegraph. And in the last couple of weeks, I’ve been building on a testing rig inspired by Deterministic Simulation Testing (DST) to test the most crucial parts of the system. DST is closely related to fuzzing and property-based testing.
The goal is to get one of Amp’s most central pieces, the ThreadWorker, under heavy scrutiny. We’ve had a few perplexing bug reports, where users experienced corrupted threads, LLM API errors from invalid tool calls, and more vague issues like “it seems like it’s spinning forever.” Reproducing such problems manually is usually somewhere between impractical and impossible. I want to reproduce them deterministically, and in a way where we can debug and fix them. And beyond the known ones, I’d like to find the currently unknown ones before our users hit them.
Generative testing to the rescue!
Approach: Lightweight DST in TypeScript
Amp is written in TypeScript, which is an ecosystem currently not drowning in fuzzing tools. My starting point was using jsfuzz, which I hadn’t used before but it looked promising. However, I had a bunch of problems getting it to run together with our Bun stack. One could use fast-check, but as far as I can tell, the model-based testing they support doesn’t fit with our needs. We don’t have a model of the system, and we need to generate values in multiple places as the test runs. So, I decided to build something from scratch for our purposes.
I borrowed an idea I got from matklad last year: instead of passing a seeded PRNG to generate test input, we generate an entropy Buffer with random contents, and track our position in that array with a cursor. Drawing a random byte consumes the byte at the current position and increments the cursor. We don’t know up-front how many bytes we need for a given fuzzer, so the entropy buffer grows dynamically when needed, appending more random bytes. This, together with a bunch of methods for drawing different types of values, is packaged up in an Entropy class:
class Entropy {random(count): UInt8Array { ... }randomRange(minIncl:number, maxExcl:number):number { ... }// ... lots of other stuff}
A fuzzer is an ES module written in TypeScript, exporting a single function:
exportasyncfunctionfuzz(entropy: Entropy) {// test logic here}
Any exception thrown by fuzz is considered a test failure. We use the node:assert module for our test assertions, but it could be anything.
Another program, the fuzz runner, imports a built fuzzer module and runs as many tests it can before a given timeout. If it finds a failure, it prints out the command to reproduce that failure:
Why use this Entropy rather than a seed? More about that at the end of the post!
The ThreadWorker Fuzzer
In the fuzzer for our ThreadWorker, we stub out all IO and other nondeterministic components, and we install fake timers to control when and how asynchronous code is run. In effect, we have determinism and simulation to run tests in, so I guess it qualifies as DST.
The test simulates a sequence of user actions (send message, cancel, resume, and wait). Similarly, it simulates responses from tool calls (like the agent reading a file) and from inference backends (like the Anthropic API). We inject faults and delays in both tool calls and inference requests to test our error handling and possible race conditions.
After all user actions have been executed, we make sure to approve any pending tool calls that require confirmation. Next, we tell the fake timer to run all outstanding timers until the queue is empty; like fast-forwarding until there’s nothing left to do. Finally, we check that the thread is idle, i.e. that there’s no ongoing inference and that all tool calls have terminated. This is a liveness property.
After the liveness property, we check a bunch of safety properties:
all messages posted by the user are present in the thread
all message pairs involving tools calls are valid according to Anthropic’s API specification
all tool calls have settled in expected terminal states
Some of these are targeted at specific known bugs, while some are more general but have found bugs we did not expect.
Given I’ve been working on this for about a week in total, I’m very happy with the outcome. Here are some issues the fuzzer found:
Corrupted thread due to eagerly starting tool calls during streaming
While streaming tool use blocks from the Anthropic API, we invoked tools eagerly, while not all of them were finished streaming. This, in combination with how state was managed, led to tool results being incorrectly split across messages. Anthropic’s API would reject any further requests, and the thread would essentially be corrupted. This was reported by a user and was the first issue we found and fixed using the fuzzer.
Another variation, which the fuzzer also found, this was a race condition where user messages interfered at a particular timing with ongoing tool calls, splitting them up incorrectly.
Subagent tool calls not terminating when subthread tool calls were rejected
Due to a recent change in behavior, where we don’t run inference automatically after tool call rejection, subagents could end up never signalling their termination, which led to the main thread never reaching an idle state.
I confirmed this in both VSCode and the CLI: infinite spinners, indeed.
Tool calls blocked on user not getting cancelled after user message
Due to how some tool calls require confirmation, like reading files outside the workspace or running some shell commands, in combination how we represent and track termination of tools, there’s a possibility for such tools to be resumed and then, after an immediate user cancellation, not be properly cancelled. This leads to incorrect mutations of the thread data.
I’ve not yet found the cause of this issue, but it’s perfectly reproducible, so that’s a start.
Furthermore, we were able to verify an older bug fix, where Anthropic’s API would send an invalid message with an empty tool use block array. That used to get the agent into an infinite loop. With the fuzzer, we verified and improved the old fix which had missed another case.
How about number of test runs and timeouts? Most of these bugs were found almost immediately, i.e. within a second. The last one in the list above takes longer, around a minute normally. We run a short version of each fuzzer in every CI build, and longer runs on a nightly basis. This is up for a lot of tuning and experimentation.
Why the Entropy Buffer?
So why the entropy buffer instead of a seeded PRNG? The idea is to use that buffer to mutate the test input, instead of just bombarding with random data every time. If we can track which parts of the entropy was used where, we can make those slices “smaller” or “bigger.” We can use something like gradient descent or simulated annealing to optimize inputs, maximizing some objective function set by the fuzzer. Finally, we might be able to minimize inputs by manipulating the entropy.
In case the JavaScript community gets some powerful fuzzing framework like AFL+, that could also just be plugged in. Who knows, but I find this an interesting approach that’s worth exploring. I believe the entropy buffer approach is also similar to how Hypothesis works under the hood. Someone please correct me if that’s not the case.
Anyhow, that’s today’s report from the generative testing mines. Cheers!
In software development today, security can no longer be treated as an afterthought. The earlier vulnerabilities are identified, the cheaper and safer they are to fix. This is why Static Application Security Testing (SAST) has become a cornerstone of modern DevSecOps practices.
Among the available tools, GitHub’s CodeQL has emerged as a standout. It doesn’t just look for known patterns—it analyzes code as if it were a database, uncovering subtle and complex vulnerabilities that other tools often miss without increasing the false positive findings. But for all its power, integrating CodeQL into large organizations, especially those with monorepos and heterogeneous CI/CD platforms, can be a real challenge.
To solve this, we built two complementary solutions: the codeql-wrapper and the codeql-wrapper-pipelines. Together, they make CodeQL adoption significantly easier for development teams and security organizations alike.
Highlights of codeql-wrapper:
While GitHub’s CodeQL works well for single-project repositories, monorepos present unique challenges that break the standard workflow.
In a monorepo, teams need to:
Analyze multiple projects, each with different languages, build systems, and configurations
Run scans only on changed code during PR reviews for efficiency
Maintain consistent security policies across diverse projects
The problems with standard CodeQL Actions in monorepos:
No dynamic project discovery – You can’t generate a job matrix from runtime data, forcing you to hard-code every project
Inflexible configuration – Each project may need different CodeQL queries, build commands, or language-specific settings
# This doesn't work - matrix can't be built from runtime discoveryjobs:analyze:strategy:matrix:project: ${{ steps.discover.outputs.projects }}# Not supported, we would need a `step` before to populate this variablesteps:-name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:languages: ${{ project.programming_language }}-name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v3
with:category:"/language:${{ project.programming_language }}"
With the wrapper, you’re done with this one-liner:
In the rest of this post, we detail the problems that codeql-wrapper solves
and how it does so.
Understanding CodeQL: Why It Matters
At its core, CodeQL is a semantic code analysis engine. It transforms source code into a relational database, allowing queries to detect vulnerable patterns, logic flaws, and unsafe data flows. This capability is powerful because it lets organizations:
Automate security checks – CodeQL integrates directly into CI/CD pipelines, scanning every pull request or commit.
Catch complex vulnerabilities – By modeling data and control flows, CodeQL goes far beyond simple regex-based scanners. To the academical reader, such capability is called inter-procedural analysis. Its upside is that it can detect vulnerabilities spanning multiple functions and code source files. Its downside is that it can be slow, and so one needs to be careful how CodeQL models and queries are written, so that performance stays acceptable.
Run variant analysis – Once you identify a bug, you can find every similar instance across a codebase in minutes. This is particularly important to scale your findings to large organizations.
Customize queries – Security teams can author their own CodeQL queries tailored to the organization’s specific risk profile. We explained how to write custom queries in our first post in this CodeQL series.
Cover multiple languages – CodeQL supports a wide set of languages, fitting polyglot environments.
Integrate with GitHub Advanced Security – Findings are uploaded as SARIF reports and appear directly in GitHub’s security dashboards and pull requests.
For companies under compliance requirements (PCI DSS, FedRAMP, HIPAA) or simply those scaling rapidly, CodeQL is a way to shift security left—catching problems earlier, reducing remediation costs, and strengthening overall security posture.
The CodeQL Wrapper: Make CodeQL Easy on Every Codebase
Despite these benefits, running CodeQL at scale isn’t always straightforward. Complex repositories, custom build processes, and multiple CI/CD environments introduce friction.
This is where the CodeQL Wrapper we developed comes in. It’s a universal Python CLI that abstracts away much of the setup pain and provides a consistent way to run CodeQL across projects.
Codeql Wrapper streamlines the usage of CodeQL in many different ways:
CI/CD agnostic: Works with GitHub Actions, Jenkins, Azure Pipelines, CircleCI, Harness, and more.
SARIF uploads: Built-in support for sending results directly to GitHub Advanced Security.
Auto-installation: Fetches and installs the correct version of the CodeQL CLI automatically.
Flexible .codeql.json config: Lets teams define build modes, queries, and paths centrally.
In particular, Codeql Wrapper is designed to gracefully handle monorepos, addressing both
their polyglot nature (monorepos tend to feature multiple programming langauges)
and the need for performance (because monorepos tend to be big, analysis on the entire repository doesn’t scale):
Smart language detection: The idiomatic GitHub Action requires to specify the programming language to analyse. This doesn’t work in monorepos, that are typically polygot (i.e. they mix multiple programming languages). CodeQL Wrapper lifts this burden: you don’t need to configure target languages manually.
Performance tuned: Supports parallel processing to keep scans efficient in large repositories.
Typically, language detection gets cumbersome in monorepos:
Constant updates: Every new project requires workflow modifications
Language mixing: Projects often contain multiple languages (e.g., Java backend with JavaScript tests)
Human error: Forgetting to add new languages leads to incomplete security coverage
Maintenance overhead: Teams spend time updating CI/CD configurations instead of writing code
Project-Level Parallelization
The primary parallelization strategy operates at the project level in monorepos. Multiple projects are analyzed simultaneously using Python’s ProcessPoolExecutor.
Language-Level Parallelization
Within each project, multiple languages are analyzed concurrently. If a project contains Python, JavaScript, and Java code, all three languages are processed in parallel rather than sequentially.
Adaptive Worker Allocation
The wrapper automatically determines optimal concurrency based on Available CPU cores and System memory.
You can also run a Manual override: --max-workers parameter for fine-tuning
Performance Benefits
This multi-level approach delivers significant speedups:
Monorepo analysis: 5+ projects analyzed simultaneously instead of sequentially
Mixed-language projects: Python + JavaScript + Java processed in parallel
Resource efficiency: Maximizes CPU utilization while respecting memory constraints
Scalable: Performance scales with available hardware resources
For large monorepos with dozens of projects, this can reduce analysis time from hours to minutes.
Getting Started
Install from PyPI:
pip install codeql-wrapper
This will install the tool in your environment. Once installed, you’ll have access to two commands:
install – Installs CodeQL CLI and query packs (you can pin a specific version if needed).
analyze – Runs CodeQL analysis on your project(s), with automatic project detection and parallelized scans.
If you only need the CodeQL CLI (and don’t want to run a scan yet), just run:
codeql-wrapper install
This will pull the latest version of the CLI and query packs. If you need a different version you can use the --version argument.
The real magic happens with the analyze command. It:
Checks if CodeQL is installed (and installs it if it’s missing).
Detects your project structure and languages.
Creates CodeQL databases.
Runs analysis in parallel, making the most of your CPU.
Here are some ways to run it:
Run a basic analysis:
# Check if CodeQL is installed, if not it will install it
# Create Code CLI database for each language detected
# Analyze the databases using the default queries and build-mode: none
# You can use `--upload-sarif` to upload the result to Github
codeql-wrapper analyze ./repo
For monorepos:
# Like above, and Detect Projects (each folder inside the monorepo folder will be treated as a project)
codeql-wrapper analyze ./monorepo --monorepo
You can also customize the behavior using a .codeql.json file as follows:
As seen in the .codeql.json snippet above, you can also specify the query to execute
in the field queries of a project.
Benefits of This Approach
This approach gives you the flexibility to:
Use different queries per project in a monorepo
Maintain consistent query policies across different CI/CD platforms
Override queries without modifying pipeline files
Changed Files Detection Algorithm
The CodeQL wrapper implements a sophisticated Git-based differential analysis algorithm to detect which files have changed, enabling efficient analysis of only modified code in pull requests and monorepo scenarios.
Algorithm Overview
The changed files detection works through these key steps:
Git Reference Resolution: Determines the base commit for comparison using CI/CD environment variables or manual specification
Repository Fetching: Ensures the latest commit data is available using git fetch
Diff Calculation: Compute differences between base and current commits
Project Filtering: Maps changed files to specific projects in monorepo configurations
This unified approach eliminates the need for platform-specific implementations while maintaining robust change detection across all major CI/CD systems.
Files Detection Benefits
This algorithm provides significant performance improvements:
Selective Analysis: Only projects containing changed files are analyzed
Of course, simplifying local execution is only half the battle. Enterprises need repeatable, automated CI/CD integrations. That’s why Modus Create also released the codeql-wrapper-pipelines repository.
This companion project provides ready-to-use templates for popular CI/CD systems, demonstrating exactly how to embed the wrapper in automated workflows.
Why this matters:
CodeQL’s standard implementation is tightly coupled to GitHub Actions. The official github/codeql-action assumes you’re running on GitHub-hosted runners with specific environment variables and GitHub APIs available. Adapting CodeQL to other CI/CD platforms typically requires:
Manually installing the CodeQL CLI and managing versions
Reimplementing authentication and SARIF upload logic
Handling language detection and build automation from scratch
Writing platform-specific scripts for each CI/CD system
By decoupling CodeQL from GitHub Actions, our wrapper enables true CI/CD portability.
Supported platforms include:
GitHub Actions – first-class support with PR integration
Azure Pipelines – tested YAML templates for enterprise environments
CircleCI – workflows with CodeQL scanning built in
Harness – templates for modern CI/CD pipelines
Each platform template handles unique authentication, artifact management, and reporting requirements while maintaining a consistent interface. This means security teams can enforce the same CodeQL policies whether code is built in GitHub, Azure DevOps, or any other platform.
By providing these blueprints, Modus helps teams move from “we should scan with CodeQL” to “we are scanning every PR with CodeQL” in a matter of hours—not weeks.
Conclusion
It’s our experience working with global enterprises adopting GitHub - and GitHub Advanced Security - that made us create
codeql-wrapper and codeql-wrapper-pipelines.
These tools are the result of our field experience working with monorepos:
They encapsulate lessons learned from real-world enterprise rollouts.
They reduce friction for developers and security engineers alike.
They standardize how CodeQL is runs across organizations, improving consistency and reliability.
These tools are accelerators for companies that want to strengthen their DevSecOps posture without burdening engineering teams. Try them
and reach out back to us on the GitHub repositories if you need support!
In the lastfewarticles, we’ve spent some time doing manual algorithms with binary trees. But most often, you won’t have to work with binary trees yourself, as they are built in to the operations of other data structures.
Today, we’ll solve a problem that relies on using the “set” data structure as well as the “heap” data structure. To learn more details about using these structures in Haskell, sign up for our Solve.hs course, where you’ll spend an entire module on data structures!
The Problem
Today’s problem is Find k pairs with smallest sums. Our problem input consists of two sorted arrays of integers as well as a number k. Our job is to return the k pairs of numbers that have the smallest sum, where a “pair” consists of one number from the first array, and one number from the second array.
Observe that we are returning the numbers themselves in pairs, rather than the sums, and not the indices of the numbers.
While the (implicit) indices of each pair must be unique, the numbers do not need to be unique. For each, if we have the lists [1,1], [5, 7], and k is 2, we should return [(1,5), (1,5)], where both of the 1’s from the first list are paired with the 5 from the second list.
The Algorithm
At the center of our algorithm is a min-heap. We want to store elements that contain pairs of numbers. But we want those elements to be ordered by their sum. We also want the element to contain the corresponding indices for the each number in their array.
We’ll start by making a pair of the first element from each array, and inserting that into our heap, because this pair must be the smallest. Then we’ll do a loop where we extract the minimum from the heap, and then try adding its “neighbors” into the heap. A “neighbor” of the index pair (i1, i2) comes from incrementing one of the two indices, so either (i1 + 1, i2) or (i1, i2 + 1). We continue until we have extracted k elements (or our heap is empty).
Each time we insert a pair of numbers into the heap, we’ll insert the pair of indices into a set. This will allow us to avoid double counting any pairs.
So using the first example in the section above, here are the first few steps of our process. Inside the heap is a 3-tuple with the sum, the indices (a 2-tuple), and the values (another 2-tuple).
And so we would continue this process until we gathered k outputs.
Haskell Solution
The core structure of this problem isn’t hard. We define some initial terms, such as our data structures and outputs, and then we have a single loop. We’ll express this single loop as a recursive function. It will take the number of remaining values, the visited set, the heap, and the accumulated output, and ultimately return the output. Throughout this problem we’ll use the type alias I2 to refer to a tuple of two integers.
import qualified Data.Heap as H
import qualified Data.Set as S
type I2 = (Int, Int)
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = ...
where
n1 = V.length arr1
n2 = V.length arr2
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f remaining visited heap acc = ...
Let’s fill in our loop. We can start with the edge cases. If the k is 0, or if our heap is empty, we should return the results list (in reverse).
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = ...
where
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f 0 _ _ acc = reverse acc
f remaining visited heap acc = case H.view heap of
Nothing -> reverse acc
Just ((_, (i1, i2), (v1, v2)), restHeap) -> ...
Our primary case now results from extracting the min element of the heap. We don’t actually need its sum. We just put that in the first position of the tuple so that it is the primary sorting value for the heap. Let’s define the next possible coordinate pairs from adding to i1 and i2, as well as the new sums we get from using those indices:
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = ...
where
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f 0 _ _ acc = reverse acc
f remaining visited heap acc = case H.view heap of
Nothing -> reverse acc
Just ((_, (i1, i2), (v1, v2)), restHeap) ->
let c1 = (i1 + 1, i2)
c2 = (i1, i2 + 1)
inc1 = arr1 V.! (i1 + 1) + v2
inc2 = v1 + arr2 V.! (i2 + 1)
in ...
Now we need to try adding these values to the remaining heap. If the index is too large, or if we’ve already visited the coordinate, we don’t add the new value, returning the old heap. Otherwise we insert it into our heap. For the second value, we just use heap1 (from trying to add the first value) as the baseline.
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = ...
where
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f 0 _ _ acc = reverse acc
f remaining visited heap acc = case H.view heap of
Nothing -> reverse acc
Just ((_, (i1, i2), (v1, v2)), restHeap) ->
let c1 = (i1 + 1, i2)
c2 = (i1, i2 + 1)
inc1 = arr1 V.! (i1 + 1) + v2
inc2 = v1 + arr2 V.! (i2 + 1)
heap1 = if i1 + 1 < n1 && S.notMember c1 visited
then H.insert (inc1, c1, (arr1 V.! (i1 + 1), v2)) restHeap else restHeap
heap2 = if i2 + 1 < n2 && S.notMember c2 visited
then H.insert (inc2, c2, (v1, arr2 V.! (i2 + 1))) heap1 else heap1
in ...
Now we complete our recursive loop function, by adding these new indices to the visited set and make a recursive call. We decrement the remaining number, and append the values to our accumulated list.
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = ...
where
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f 0 _ _ acc = reverse acc
f remaining visited heap acc = case H.view heap of
Nothing -> reverse acc
Just ((_, (i1, i2), (v1, v2)), restHeap) ->
let c1 = (i1 + 1, i2)
c2 = (i1, i2 + 1)
inc1 = arr1 V.! (i1 + 1) + v2
inc2 = v1 + arr2 V.! (i2 + 1)
heap1 = if i1 + 1 < n1 && S.notMember c1 visited
then H.insert (inc1, c1, (arr1 V.! (i1 + 1), v2)) restHeap else restHeap
heap2 = if i2 + 1 < n2 && S.notMember c2 visited
then H.insert (inc2, c2, (v1, arr2 V.! (i2 + 1))) heap1 else heap1
visited' = foldr S.insert visited ([c1, c2] :: [I2])
in f (remaining - 1) visited' heap2 ((v1,v2) : acc)
To complete the function, we define our initial heap and make the first call to our loop function:
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = f k (S.singleton (0,0)) initialHeap []
where
val1 = arr1 V.! 0
val2 = arr2 V.! 0
initialHeap = H.singleton (val1 + val2, (0,0), (val1, val2))
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f = ...
Now we’re done! Here’s our complete solution:
type I2 = (Int, Int)
findKPairs :: V.Vector Int -> V.Vector Int -> Int -> [I2]
findKPairs arr1 arr2 k = f k (S.singleton (0,0)) initialHeap []
where
val1 = arr1 V.! 0
val2 = arr2 V.! 0
initialHeap = H.singleton (val1 + val2, (0,0), (val1, val2))
n1 = V.length arr1
n2 = V.length arr2
f :: Int -> S.Set I2 -> H.MinHeap (Int, I2, I2) -> [I2] -> [I2]
f 0 _ _ acc = reverse acc
f remaining visited heap acc = case H.view heap of
Nothing -> reverse acc
Just ((_, (i1, i2), (v1, v2)), restHeap) ->
let c1 = (i1 + 1, i2)
c2 = (i1, i2 + 1)
inc1 = arr1 V.! (i1 + 1) + v2
inc2 = v1 + arr2 V.! (i2 + 1)
heap1 = if i1 + 1 < n1 && S.notMember c1 visited
then H.insert (inc1, c1, (arr1 V.! (i1 + 1), v2)) restHeap else restHeap
heap2 = if i2 + 1 < n2 && S.notMember c2 visited
then H.insert (inc2, c2, (v1, arr2 V.! (i2 + 1))) heap1 else heap1
visited' = foldr S.insert visited ([c1, c2] :: [I2])
in f (remaining - 1) visited' heap2 ((v1,v2) : acc)
Rust Solution
Now, on to our Rust solution. We’ll start by defining our terms. These follow the pattern laid out in our algorithm and the Haskell solution:
pub fn k_smallest_pairs(nums1: Vec<i32>, nums2: Vec<i32>, k: i32) -> Vec<Vec<i32>> {
let mut heap: BinaryHeap<Reverse<(i32, (usize,usize), (i32,i32))>> = BinaryHeap::new();
let val1 = nums1[0];
let val2 = nums2[0];
heap.push(Reverse((val1 + val2, (0,0), (val1, val2))));
let mut visited: HashSet<(usize, usize)> = HashSet::new();
visited.insert((0,0));
let mut results = Vec::new();
let mut remaining = k;
let n1 = nums1.len();
let n2 = nums2.len();
...
return results;
}
The most interesting of these is the heap. We parameterize the type of the BinaryHeap using the same kind of tuple we had in Haskell. But in order to make it a “Min” heap, we have to wrap our values in the Reverse type.
Now let’s define the outline of our loop. We keep going until remaining is 0. We will also break if we can’t pop a value from the heap.
pub fn k_smallest_pairs(nums1: Vec<i32>, nums2: Vec<i32>, k: i32) -> Vec<Vec<i32>> {
let mut heap: BinaryHeap<Reverse<(i32, (usize,usize), (i32,i32))>> = BinaryHeap::new();
let val1 = nums1[0];
let val2 = nums2[0];
heap.push(Reverse((val1 + val2, (0,0), (val1, val2))));
let mut visited: HashSet<(usize, usize)> = HashSet::new();
visited.insert((0,0));
let mut results = Vec::new();
let mut remaining = k;
let n1 = nums1.len();
let n2 = nums2.len();
while remaining > 0 {
if let Some(Reverse((sumNum, (i1, i2), (v1, v2)))) = heap.pop() {
let c1 = (i1 + 1, i2);
let c2 = (i1, i2 + 1);
if i1 + 1 < n1 && !visited.contains(&c1) {
let inc1 = nums1[i1 + 1] + v2;
visited.insert(c1);
heap.push(Reverse((inc1, c1, (nums1[i1 + 1], v2))));
}
if i2 + 1 < n2 && !visited.contains(&c2) {
let inc2 = v1 + nums2[i2 + 1];
visited.insert(c2);
heap.push(Reverse((inc2, c2, (v1, nums2[i2 + 1]))));
}
results.push(vec![v1,v2]);
} else {
break;
}
remaining -= 1;
}
return results;
}
Conclusion
Next time, we’ll start exploring some graph problems, which also rely on data structures like we used here!
I didn’t explain too much in this article about the details of using these various data structures. If you want an in-depth exploration of how data structures work in Haskell, including the “common API” that helps you use almost all of them, you should sign up for our Solve.hs course! Module 2 is completely dedicated to teaching you about data structures, and you’ll get a lot of practice working with these structures in sample problems.
Today’s guest is Jurriaan Hage. Jurriaan is a professor at Heriot-Watt University in Edinburgh who’s worked with and on Haskell for many years. He’s known for the Helium Haskell compiler, specifically designed for teaching, and he has plenty of other projects related to Haskell, including improvements to the type system, the generation of better error messages, or detection of plagiarism.
A Fast Bytecode VM for Arithmetic: The Virtual Machine
Introduction
AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST, the interpreter jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode.
Bytecode is a flattened and compact representation of a program, usually manifested as a byte array. Bytecode is essentially an Instruction Set (IS), but custom-made to be executed by a Virtual Machine (VM), instead of a physical machine. Each bytecode instruction is one byte in size (that’s where it gets its name from). A bytecode and its VM are created in synergy so that the execution is as efficient as possible1. Compiling source code to bytecode and executing it in a VM also allows the program to be run on all platforms that the VM supports without the developer caring much about portability concerns. The most popular combo of bytecode and VM is probably the Java bytecode and the Java virtual machine.
The VMs can be stack-based or register-based. In a stack-based VM, all values created during the execution of a program are stored only in a Stack data-structure residing in the memory. Whereas, in a register-based VM, there is also an additional set of fixed number of registers that are used to store values in preference to the stack2. Register-based VMs are usually faster, but stack-based VMs are usually simpler to implement. For our purpose, we choose to implement a stack-based VM.
We are going to write a compiler that compiles our expression AST to bytecode. But first, let’s design the bytecode for our stack-based VM.
Let’s figure out the right bytecode for each case. First, we create Opcodes for each bytecode, which are sort of mnemonics for actual bytecode. Think of them as Assembly is to Machine Code.
Num
For a number literal, we need to put it directly in the bytecode so that we can use it later during the execution. We also need an opcode to push it on the stack. Let’s call it OPush with an Int16 parameter.
BinOp
Binary operations recursively use Expr for their operands. To evaluate a binary operation, we need its operands to be evaluated before, so we compile them first to bytecode. After that, all we need is an opcode per operator. Let’s call them OAdd, OSub, OMul, and ODiv for Add, Sub, Mul, and Div operators respectively.
Var and Let
Variables and Let expressions are more complex3. In the AST interpreter we chucked the variables in a map, but we cannot do that in a VM. There is no environment map in a VM, and all values must reside in the stack. How do we have variables at all then? Let’s think for a bit.
Each expression, after being evaluated in the VM, must push exactly one value on the stack: its result. Num expressions are a trivial case. When a binary operation is evaluated, first its left operand is evaluated. That pushes one value on the stack. Then its right operand is evaluated, and that pushes another value on the stack. Finally, the operation pops the two values from the top of the stack, does its thing, and pushes the resultant value back on the stack—again one value for the entire BinOp expression.
A Let expression binds a variable’s value to its name, and then the variable can be referred from the body of the expression. But how can we refer to a variable when the stack contains only values, not names? Let’s imagine that we are in middle of evaluating a large expression, wherein we encounter a Let expression. First we evaluate its assignment expression, and that pushes a value on the top of the stack. Let’s say that the stack has n values at this point. After this we get to evaluate the body expression. At all times when we are doing that, the value from assignment stays at the same point in the stack because evaluating sub-expressions, no matter how complicated, only adds new values to the stack, without popping an existing value from before. Therefore, we can use the stack index of the assignment value (n−1) to refer to it from within the body expression. So, we encode Var as an opcode and an integer index into the stack.
We choose to use a Word8 to index the stack, limiting us to a stack depth of 256. We encode the variable references with an opcode OGet, which when executed gets the value from the stack at the given index and pushes it on the stack.
For a Let expression, after we compile its assignment and body expressions, we need to make sure that the exactly-one-value invariant holds. Evaluating the assignment and body pushes two values on the stack, but we can have only one! So we overwrite the assignment value with the body value, and pop the stack to remove the body value. We invent a new opcode OSwapPop to do this, called so because its effect is equivalent to swapping the topmost two values on the stack, and then popping the new top value4.
Putting all the opcodes together, we have the Opcode ADT:
Notice that we also assigned bytecodes—that is, a unique byte value—to each Opcode above, which are just their ordinals. Now we are ready to write the compiler.
The Compiler
The compiler takes an expression with the bytecode size, and compiles it to a strict ByteString of that size. Recall that in the previous post, we wrote our parser such that the bytecode size for each AST node was calculated while parsing it. This allows us to pre-allocate a bytestring of required size before compiling the AST. We compile to actual bytes here, and don’t use the opcodes.
typeBytecode=BS.ByteStringcompile ::SizedExpr->ResultBytecodecompile = compile' defaultStackSizecompile' ::Int->SizedExpr->ResultBytecodecompile' stackSize (expr, bytecodeSize) =uncurry (fmap.const) . BSI.unsafeCreateUptoN' bytecodeSize $ \fp ->do (bytecodeSize,)<$>fmapRight (compileIO bytecodeSize stackSize fp fp expr >>= checkSize fp . TS.fst)`catch` (pure.Left)where checkSize fp ip =dolet actualBytecodeSize = ip `minusPtr` fp unless (actualBytecodeSize == bytecodeSize) $ throwIO .ErrorCompile$"Compiled bytecode size "<>show actualBytecodeSize<>" is not same as expected size: "<>show bytecodeSizecompileIO ::Int->Int->PtrWord8->PtrWord8->Expr->IO (Pair (PtrWord8) Int)compileIO bytecodeSize stackSize fp ip = go Map.empty 0 ipwhere ep = fp `plusPtr` bytecodeSize go env !sp !ip = \caseNum n | sp +1<= stackSize ->dolet!lb =fromIntegral$ n .&.0xff!mb =fromIntegral$ ((fromIntegral n ::Word16) .&.0xff00) `shiftR`8 writeByte ip 0-- OPush writeByte (ip `plusPtr`1) lb writeByte (ip `plusPtr`2) mbpure (ip `plusPtr`3:!: sp +1)Num _ -> throwCompileError "Stack overflow"BinOp op a b ->do (ip' :!: sp') <- go env sp ip a (ip'' :!: sp'') <- go env sp' ip' b writeByte ip'' $ translateOp oppure (ip'' `plusPtr`1:!: sp'' -1)Let x assign body ->do (ip' :!: sp') <- go env sp ip assign (ip'' :!: sp'') <- go (Map.insert x sp env) sp' ip' body writeByte ip'' 1-- OSwapPoppure (ip'' `plusPtr`1:!: sp'' -1)Var x | sp +1<= stackSize ->case Map.lookup x env ofNothing-> throwCompileError $"Unknown variable: "<>show xJust varScope| varScope < stackSize && varScope <fromIntegral (maxBound@Word8) ->do writeByte ip 2-- OGet writeByte (ip `plusPtr`1) $fromIntegral varScopepure (ip `plusPtr`2:!: sp +1)Just _ -> throwCompileError "Stack overflow"Var _ -> throwCompileError "Stack overflow" writeByte ::PtrWord8->Word8->IO () writeByte !ip !val| ip < ep = poke ip val|otherwise= throwCompileError $"Instruction index "<>show (ip `minusPtr` fp)<>" out of bound "<>show (bytecodeSize -1) translateOp = \caseAdd->3-- OAddSub->4-- OSubMul->5-- OMulDiv->6-- ODiv throwCompileError = throwIO .ErrorCompiledefaultStackSize ::IntdefaultStackSize =256
ArithVMLib.hs
We use the unsafeCreateUptoN' function from the Data.ByteString.Internal module that allocates enough memory for the provided bytecode size, and gives us a pointer to the allocated memory. We call this pointer fp for frame pointer. Then we traverse the AST recursively, writing bytes for opcodes and arguments for each case. We use pointer arithmetic and the poke function to write the bytes. Int16 numbers are encoded as two bytes in little endian fashion.
In the recursive traversal function go, we pass and return the current stack pointer sp and instruction pointer ip. We update these correctly for each case5. We also take care of checking that the pointers stay in the right bounds, failing which we throw appropriate errors.
We also pass an env parameter that is similar to the variable names to values environment we use in the AST interpreter, but this one tracks variable names to stack indices at which they reside. We update this information before compiling the body of a Let expression to capture the stack index of its assignment value. When compiling a Var expression, we use the env map to lookup the variable’s stack index, and encode it in the bytecode.
At the end of compilation, we check that the entire bytestring is filled with bytes till the very end, failing which we throw an error. This check is required because otherwise the bytestring may have garbage bytes, and may fail inexplicably during execution.
All the errors are thrown in the IO monad using the throwIO function, and are caught after compilation using the catch function. The final result or error is returned wrapped into Result.
$ echo -n "let x = 4 in let y = 5 in x + y" | arith-vm compile | hexdump -C
00000000 00 04 00 00 05 00 02 00 02 01 03 01 01 |.............|
0000000d
You can verify that the resultant bytes are indeed correct. I assume that it is difficult for you to read raw bytes. We’ll fix this in a minute. Meanwhile, let’s ponder upon some performance characteristics of our compiler.
Compiling, Fast and Slow
You may be wondering why I chose to write the compiler in this somewhat convoluted way of pre-allocating a bytestring and using pointers. The answer is: performance. I didn’t actually start with pointers. I iterated through many different data and control structures to find the fastest one.
The table below shows the compilation times for a benchmark expression file when using different data structures to implement the compileIO function:
Data structure
Time (ms)
Incremental speedup
Overall speedup
List
4345
1x
1x
Seq
523
8.31x
8.31x
DList
486
1.08x
8.94x
BS Builder
370
1.31x
11.74x
Pre-allocated BS
54
6.85x
80.46x
Bytearray
52
1.02x
83.55x
I started with the bread-and-butter data structure of Haskellers, the humble and known to be slow List, which was indeed quite slow. Next, I moved on to Seq and thereafter DList, which are known to be faster at concatenation/consing. Then I abandoned the use of intermediate data structures completely, choosing to use a bytestring Builder to create the bytestring. Finally, I had the epiphany that the bytestring size was known at compile time, and rewrote the function to pre-allocate the bytestring, thereby reaching the fastest solution.
I also tried using Bytearray, which has more-or-less same performance of bytestring, but it is inconvenient to use because there are no functions to do IO with bytearrays. So I’d anyway need to use bytestrings for reading from STDIN or writing to STDOUT, and converting to-and-fro between bytearray and bytestring is a performance killer. Thus, I decided to stick to bytestrings.
The pre-allocated bytestring approach is 80 times faster than using lists, and almost 10 times faster than using Seq. For such gain, I’m okay with the complications it brings to the code. Here are the numbers in a chart (smaller is better):
Compilation time using different data-structures
The other important data structure used here is the map (or dictionary) in which we add the mappings from identifiers to their stack indices. This data structure needs to be performant because we do a lookup for each variable we encounter while compiling. I benchmarked compilation for some data structures:
Strict hashmap turns out to be the fasted one, but interestingly, linked list is a close second. Mutable hashtable is the slowest even though I expected it to be the fastest. Here are the times in a chart (smaller is better):
Compilation time using different map data-structures
Another choice I had to make was how to write the go function. I ended up passing and returning pointers and environment map, and throwing errors in IO, but a number of solutions are possible. I tried out some of them, and noted the compilation times for the benchmark expression file:
Control structure
Time (ms)
Slowdown
IO
57.4
1.00x
IO + IORef
65.0
1.13x
IO + ReaderT
60.9
1.06x
IO + StateT
65.6
1.14x
IO + ExceptT
65.9
1.15x
IO + ReaderT + ExceptT
107.1
1.87x
IO + StateT + ExceptT
383.9
6.69x
IO + StateT + ReaderT
687.5
11.98x
IO + StateT + ReaderT + ExceptT
702.0
12.23x
IO + CPS
78.2
1.36x
IO + DCPS
78.4
1.37x
IO + ContT
76.5
1.33x
I tried putting the pointer in IORefs and StateT state instead of passing them back-and-forth. I tried putting the environment in a ReaderT config. I tried using ExceptT for throwing errors instead of using IO errors. Then I tried various combinations of these monad transformers.
Finally, I also tried converting the go function to be tail-recursive by using Continuation-passing style (CPS), and then defunctionalizing the continuations, as well as, using the ContT monad transformer. All of these approaches resulted in slower code. The times are interesting to compare (smaller is better):
Compilation time using different control-structures
There is no reason to use IORefs here because they result in slower and uglier code. Using one monad transformer at a time results in slight slowdowns, which may be worth the improvement in the code. But using more than one of them degrades performance by a lot. Also, there is no improvement caused by CPS conversion, because GHC is smart enough to optimize the non tail-recursive code to be faster then handwritten tail-recursive one that allocates a lot of closures (or objects in case of defunctionalization).
Moving on …
The Decompiler
It is a hassle to read raw bytes in the compiler output. Let’s write a decompiler to aid us in debugging and testing the compiler. First, a disassembler that converts bytes to opcodes:
typeProgram=SeqOpcodedisassemble ::Bytecode->ResultProgramdisassemble bytecode = go 0 Seq.emptywhere!size = BS.length bytecode go !ip !program| ip == size =pure program|otherwise=case readInstr bytecode ip of0| ip +2< size -> go (ip +3) $ program |>OPush (readInstrArgInt16 bytecode ip)0-> throwIPOOBError $ ip +21-> go (ip +1) $ program |>OSwapPop2| ip +1< size -> go (ip +2) $ program |>OGet (readInstrArgWord8 bytecode ip)2-> throwIPOOBError $ ip +13-> go (ip +1) $ program |>OAdd4-> go (ip +1) $ program |>OSub5-> go (ip +1) $ program |>OMul6-> go (ip +1) $ program |>ODiv n -> throwDisassembleError $"Invalid bytecode: "<>show n <>" at: "<>show ip throwIPOOBError ip = throwDisassembleError $"Instruction index "<>show ip <>" out of bound "<>show (size -1) throwDisassembleError = throwError .ErrorDisassemble
ArithVMLib.hs
A disassembled program is a sequence of opcodes. We simply go over each byte of the bytecode, and append the right opcode for it to the program, along with any parameters it may have. Note that we do not verify that the disassembled program is correct.
Here are the helpers that read instruction bytes and their arguments from a bytestring:
decompile ::Program->ResultExprdecompile program =do stack <- go Seq.empty program checkStack DecompilemaxBound$length stacklet ast :<| _ = stackpure astwhere go stack = \caseSeq.Empty->pure stack opcode :<| rest ->case opcode ofOPush n -> go (stack |>Num n) restOAdd-> decompileBinOp Add>>=flip go restOSub-> decompileBinOp Sub>>=flip go restOMul-> decompileBinOp Mul>>=flip go restODiv-> decompileBinOp Div>>=flip go restOGet i -> go (stack |>Var (mkIdent $ mkName $fromIntegral i)) restOSwapPop-> decompileLet >>=flip go restwhere decompileBinOp op =case stack of stack' :|> a :|> b ->pure$ stack' |>BinOp op a b _ -> throwDecompileError $"Not enough elements to decompile binary operation: "<>show op decompileLet =case stack of stack' :|> a :|> b ->pure$ stack' |>Let (mkIdent $ mkName $length stack -2) a b _ -> throwDecompileError "Not enough elements to decompile let" mkName i = names `Seq.index` i names = Seq.fromList $tail$ combinations 2 combinations = \case0-> [""] n ->let prev = combinations (n -1)in prev <> [x : xs | x <- ['a'..'z'], xs <- prev] throwDecompileError = throwError .ErrorDecompilecheckStack :: (MonadErrorError m) =>Pass->Int->Int-> m ()checkStack pass stackSize = \case1->pure ()0-> throwError $Error pass "Final stack has no elements" n | n > stackSize -> throwError .Error pass $"Stack overflow" n | n >1-> throwError .Error pass $"Final stack has more than one element" _ -> throwError .Error pass $"Stack underflow"
ArithVMLib.hs
Decompilation is the opposite of compilation. While compiling there is an implicit stack of expressions that are yet to be compiled. We make that stack explicit here, capturing expressions as they are decompiled from opcodes. For compound expressions, we inspect the stack and use the already decompiled expressions as the operands of the expression being decompiled. This way we build up larger expressions from smaller ones, culminating in the single top-level expression at the end7. Finally, we check the stack to make sure that there is only one expression left in it. Note that like the disassembler, we do not verify that the decompiled expression is correct.
There is one tricky thing in decompilation: we lose the names of the variables when compiling, and are left with only stack indices. So while decompiling, we generate variable names from their stack indices by indexing a list of unique names. Let’s see it in action:
$ echo -n "1 + 2 - 3 * 4" | arith-vm compile | arith-vm disassemble
OPush 1
OPush 2
OAdd
OPush 3
OPush 4
OMul
OSub
$ echo -n "1 + 2 - 3 * 4" | arith-vm compile | arith-vm decompile
( ( 1 + 2 ) - ( 3 * 4 ) )
$ echo -n "let x = 4 in let y = 5 in x + y" | arith-vm compile | arith-vm disassemble
OPush 4
OPush 5
OGet 0
OGet 1
OAdd
OSwapPop
OSwapPop
$ echo -n "let x = 4 in let y = 5 in x + y" | arith-vm compile | arith-vm decompile
( let a = 4 in ( let b = 5 in ( a + b ) ) )
That’s all for compilation and decompilation. Now, we use them together to make sure that everything works.
Testing the Compiler
We write some unit tests for the compiler, targeting both success and failure cases:
compilerSpec ::SpeccompilerSpec = describe "Compiler"$do forM_ compilerSuccessTests $ \(input, result) -> it ("compiles: \""<> BSC.unpack input <>"\"") $do parseCompile input `shouldBe`Right (Seq.fromList result) forM_ compilerErrorTests $ \(input, err) -> it ("fails for: \""<> BSC.unpack input <>"\"") $do parseCompile input `shouldSatisfy` \caseLeft (ErrorCompile msg) | err == msg ->True _ ->False it "fails for greater sized expr"$do compile (Num1, 4) `shouldSatisfy` \caseLeft ( ErrorCompile"Compiled bytecode size 3 is not same as expected size: 4" ) ->True _ ->False it "fails for lesser sized expr"$do compile (Num1, 2) `shouldSatisfy` \caseLeft (ErrorCompile"Instruction index 2 out of bound 1") ->True _ ->Falsewhere parseCompile = parseSized >=> compile' 4>=> disassemblecompilerSuccessTests :: [(BSC.ByteString, [Opcode])]compilerSuccessTests = [ ( "1", [OPush1] ), ( "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1", [ OPush1, OPush2, OAdd, OPush3, OPush4, OMul, OSub, OPush5, OPush6,ODiv, OPush1, ODiv, OAdd, OPush1, OAdd ] ), ( "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)", [ OPush1, OPush2, OPush3, OSub, OPush4, OMul, OAdd, OPush5, OPush6,ODiv, OPush1, OPush1, OAdd, ODiv, OAdd ] ), ( "let x = 4 in x + 1", [OPush4, OGet0, OPush1, OAdd, OSwapPop] ), ( "let x = 4 in let y = 5 in x + y", [OPush4, OPush5, OGet0, OGet1, OAdd, OSwapPop, OSwapPop] ), ( "let x = 4 in let x = x + 1 in x + 2", [OPush4, OGet0, OPush1, OAdd, OGet1, OPush2, OAdd, OSwapPop, OSwapPop] ), ( "let x = let y = 3 in y + y in x * 3", [ OPush3, OGet0, OGet0, OAdd, OSwapPop, OGet0, OPush3, OMul, OSwapPop ] ), ( "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3", [ OPush1, OPush2, OGet1, OGet1, OMul, OSwapPop, OAdd, OGet0, OPush1,OAdd, OSwapPop, OGet0, OPush3, OMul, OSwapPop ] ), ("1/0", [OPush1, OPush0, ODiv]), ("-32768 / -1", [OPush (-32768), OPush (-1), ODiv]) ]compilerErrorTests :: [(BSC.ByteString, String)]compilerErrorTests = [ ("x", "Unknown variable: x"), ("let x = 4 in y + 1", "Unknown variable: y"), ("let x = y + 1 in x", "Unknown variable: y"), ("let x = x + 1 in x", "Unknown variable: x"), ("let x = 4 in let y = 1 in let z = 2 in y + x", "Stack overflow"), ("let x = 4 in let y = 5 in x + let z = y in z * z", "Stack overflow"), ("let a = 0 in let b = 0 in let c = 0 in let d = 0 in d", "Stack overflow") ]
ArithVMSpec.hs
In each test, we parse and compile an expression, and then disassemble the compiled bytes, which we match with expected list of opcodes, or an error message.
Let’s put these tests with the parser tests, and run them:
main ::IO ()main = hspec $do parserSpec astInterpreterSpec compilerSpec
ArithVMSpec.hs
Output of the test run
$ cabal test -O2
Running 1 test suites...
Test suite specs: RUNNING...
Parser
parses: "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1" [✔]
parses: "1+2-3*4+5/6/0+1" [✔]
parses: "1 + -1" [✔]
parses: "let x = 4 in x + 1" [✔]
parses: "let x=4in x+1" [✔]
parses: "let x = 4 in let y = 5 in x + y" [✔]
parses: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
parses: "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z" [✔]
parses: "let x=4in 2+let y=x-5in x+let z=y+1in z/2" [✔]
parses: "let x = (let y = 3 in y + y) in x * 3" [✔]
parses: "let x = let y = 3 in y + y in x * 3" [✔]
parses: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "" [✔]
fails for: "1 +" [✔]
fails for: "1 & 1" [✔]
fails for: "1 + 1 & 1" [✔]
fails for: "1 & 1 + 1" [✔]
fails for: "(" [✔]
fails for: "(1" [✔]
fails for: "(1 + " [✔]
fails for: "(1 + 2" [✔]
fails for: "(1 + 2}" [✔]
fails for: "66666" [✔]
fails for: "-x" [✔]
fails for: "let 1" [✔]
fails for: "let x = 1 in " [✔]
fails for: "let let = 1 in 1" [✔]
fails for: "let x = 1 in in" [✔]
fails for: "let x=1 inx" [✔]
fails for: "letx = 1 in x" [✔]
fails for: "let x ~ 1 in x" [✔]
fails for: "let x = 1 & 2 in x" [✔]
fails for: "let x = 1 inx" [✔]
fails for: "let x = 1 in x +" [✔]
fails for: "let x = 1 in x in" [✔]
fails for: "let x = let x = 1 in x" [✔]
AST interpreter
interprets: "1" [✔]
interprets: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
interprets: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
interprets: "1 + -1" [✔]
interprets: "1 * -1" [✔]
interprets: "let x = 4 in x + 1" [✔]
interprets: "let x = 4 in let x = x + 1 in x + 2" [✔]
interprets: "let x = 4 in let y = 5 in x + y" [✔]
interprets: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
interprets: "let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z" [✔]
interprets: "let x = let y = 3 in y + y in x * 3" [✔]
interprets: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "x" [✔]
fails for: "let x = 4 in y + 1" [✔]
fails for: "let x = y + 1 in x" [✔]
fails for: "let x = x + 1 in x" [✔]
fails for: "1/0" [✔]
fails for: "-32768 / -1" [✔]
Compiler
compiles: "1" [✔]
compiles: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
compiles: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
compiles: "let x = 4 in x + 1" [✔]
compiles: "let x = 4 in let y = 5 in x + y" [✔]
compiles: "let x = 4 in let x = x + 1 in x + 2" [✔]
compiles: "let x = let y = 3 in y + y in x * 3" [✔]
compiles: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
compiles: "1/0" [✔]
compiles: "-32768 / -1" [✔]
fails for: "x" [✔]
fails for: "let x = 4 in y + 1" [✔]
fails for: "let x = y + 1 in x" [✔]
fails for: "let x = x + 1 in x" [✔]
fails for: "let x = 4 in let y = 1 in let z = 2 in y + x" [✔]
fails for: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
fails for: "let a = 0 in let b = 0 in let c = 0 in let d = 0 in d" [✔]
fails for greater sized expr [✔]
fails for lesser sized expr [✔]
Finished in 0.0147 seconds
73 examples, 0 failures
Test suite specs: PASS
Awesome, it works! That’s it for this post. Let’s update our checklist:
In the next part, we write a virtual machine that runs our compiled bytecode, and do some benchmarking.
If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading!
There are VMs that execute hardware ISs instead of bytecode. Such VMs are also called Emulators because they emulate actual CPU hardware. Some examples are QEMU and video game console emulators.↩︎
VMs use virtual registers instead of actual CPU registers, which are often represented as a fixed size array of 1, 2, 4 or 8 byte elements.↩︎
I call them variables here but they do not actually vary. A better name is let bindings.↩︎
We could have used two separate opcodes here: OSwap and OPop. That would result in same final result when evaluating an expression, but we’d have to execute two instructions instead of one for Let expressions. Using a single OSwapPop instruction speeds up execution, not only because we reduce the number of instructions, but also because we don’t need to do a full swap, only a half swap is enough because we pop the stack anyway after the swap. This also shows how we can improve the performance of our VMs by inventing specific opcodes for particular operations.↩︎
Notice the use of strict Pairs here, for performance reasons.↩︎
One of the intriguing features of Swift is its distinction between value types and reference types. Conceptually, value types are always copied in assignments and passed-by-value in function calls — i.e., they are semantically immutable. In contrast, for reference types, Swift only copies a pointer to an object on an assignment and they are being passed-by-reference to functions. If such an object gets mutated, it changes for for all references. While most languages feature both value and reference types, Swift is unique in that (1) it makes it easy to define and use both flavours of types and (2) it supports fine-grained mutability control.
For large values, such as arrays, frequent copying carries a significant performance penalty. Hence, the Swift compiler goes to great length to avoid copying whenever it is safe. For large values, this effectively boils down to a copy-on-write strategy, where a large value is only copied when it actually is being mutated (on one code path). Swift facilitates for user-defined value types to also adopt this copy-on-write strategy.
In this talk, I will explain the semantic difference between value and reference types, and I will illustrate how this facilitates safe and robust coding practices in Swift. Moreover, I will explain how the copy-on-write strategy for large values works and how it interacts with Swift’s memory management system. Finally, I will demonstrate how you can define your own copy-on-write large value types.
I want to talk about one of the many pretty areas of number theory. This involves the notion of an
arithmetic function and related concepts. A few relatively simple concepts will allow us to produce
a variety of useful functions and theorems. This provides only a glimpse of the start of the
field of analytic number theory, though many of these techniques are used in other places as we’ll
also start to see.
As some notation, I’ll write |\mathbb N_+| for the set of positive naturals, and |\mathbb P| for
the set of primes. |\mathbb N| will contain |0|. Slightly atypically, I’ll write |[n]| for the set
of numbers from |1| to |n| inclusive, i.e. |a \in [n]| if and only if |1 \leq a \leq n|.
I find that the easiest way to see results in number theory
is to view a positive natural number as a multiset of primes which is uniquely given by
factorization. Coprime numbers are ones where these multisets are disjoint. Multiplication unions
the multisets. The greatest common divisor is multiset intersection. |n| divides |m| if and only
if |n| corresponds to a sub-multiset of |m|, in which case |m/n| corresponds to the multiset
difference. The multiplicity of an element of a multiset is the number of occurrences.
For a multiset |P|, |\mathrm{dom}(P)| is the set of elements of the multiset |P|, i.e. those
with multiplicity greater than |0|. For a finite multiset |P|, |\vert P\vert| will be the sum of
the multiplicities of the distinct elements, i.e. the number of elements (with duplicates) in the
multiset.
We can represent a multiset of primes as a function |\mathbb P \to \mathbb N| which maps an
element to its multiplicity. A finite multiset would then be such a function that is |0| at all
but finitely many primes. Alternatively, we can represent the multiset as a partial function
|\mathbb P \rightharpoonup \mathbb N_+|. It will be finite when it is defined for only finitely
many primes. Equivalently, when it is a finite subset of |\mathbb P\times\mathbb N_+| (which is
also a functional relation).
Unique factorization provides a bijection between finite multisets of primes and positive natural
numbers. Given a finite multiset |P|, the corresponding positive natural number is
|n_P = \prod_{(p, k) \in P} p^k|.
I will refer to this view often in the following.
Arithmetic Functions
An arithmetic function is just a function
defined on the positive naturals. Usually, they’ll land in (not necessarily positive) natural
numbers, but that isn’t required.
In most cases, we’ll be interested in the specific subclass of multiplicative arithmetic functions.
An arithmetic function, |f|, is multiplicative if |f(1) = 1| and |f(ab) = f(a)f(b)| whenever
|a| and |b| are coprime. We also have the notion of a completely multiplicative arithmetic
function for which |f(ab) = f(a)f(b)| always. Obviously, completely multiplicative functions are
multiplicative. Analogously, we also have a notion of (completely) additive where
|f(ab) = f(a) + f(b)|. Warning: In other mathematical contexts, “additive” means |f(a+b)=f(a)+f(b)|.
An obvious example of a completely additive function being the logarithm. Exponentiating an additive
function will produce a multiplicative function.
For an additive function, |f|, we automatically get |f(1) = 0| since |f(1) = f(1\cdot 1) = f(1) + f(1)|.
Lemma: The product of two multiplicative functions |f| and |g| is multiplicative. Proof: For |a| and |b| coprime, |f(ab)g(ab) = f(a)f(b)g(a)g(b) = f(a)g(a)f(b)g(b)|. |\square|
A parallel statement holds for completely multiplicative functions.
It’s also clear that a completely multiplicative function is entirely determined by its action on
prime numbers. Since |p^n| is coprime to |q^n| whenever |p| and |q| are coprime, we see
that a multiplicative function is entirely determined by its action on powers of primes. To this
end, I’ll often define multiplicative/additive functions by their action on prime powers and
completely multiplicative/additive functions by their action on primes.
Multiplicative functions aren’t closed under composition, but we do have that if |f| is completely
multiplicative and |g| is multiplicative, then |f \circ g| is multiplicative when that composite
makes sense.
Here are some examples. Not all of these will be used in the sequel.
The power function |({-})^z| for any |z|, not necessarily an integer, is completely multiplicative.
Choosing |z=0| in the previous, we see the constantly one function |\bar 1(n) = 1| is completely
multiplicative.
The identity function is clearly completely multiplicative and is also the |z=1| case of the above.
The Kronecker delta function |\delta(n) = \begin{cases}1, & n = 0 \\ 0, & n \neq 0\end{cases}|
is completely multiplicative. Often written |\varepsilon| in this context.
Define a multiplicative function via |\mu(p^n) = \begin{cases} -1, & n = 1 \\ 0, & n > 1\end{cases}|
where |p| is prime. This is the Möbius function.
More holistically, |\mu(n)| is |0| if |n| has any square factors, otherwise |\mu(n) = (-1)^k|
where |k| is the number of (distinct) prime factors.
Define a completely multiplicative function via |\lambda(p) = -1|. |\lambda(n) = \pm 1|
depending on whether there is an even or odd number of prime factors (including duplicates).
This function is known as the Liouville function.
|\lambda(n) = (-1)^{\Omega(n)}| where |\Omega(n)| is the completely additive function which
counts the number of prime factors of |n| including duplicates. |\Omega(n_P) = \vert P\vert|.
Define a multiplicative function via |\gamma(p^n) = -1|. |\gamma(n) = \pm 1| depending on
whether there is an even or odd number of distinct prime factors.
|\gamma(n) = (-1)^{\omega(n)}| where |\omega(n)| is the additive function which counts the
number of distinct prime factors of |n|. See Prime omega function.
We also see that |\omega(n_P) = \vert\mathrm{dom}(P)\vert|.
The completely additive function for |q\in\mathbb P|, |\nu_q(p) = \begin{cases}1,&p=q\\0,&p\neq q\end{cases}|
is the p-adic valuation.
It follows that the |p|-adic absolute value |\vert r\vert_p = p^{-\nu_p(r)}| is completely
multiplicative. It can be characterized on naturals by
|\vert p\vert_q = \begin{cases}p^{-1},&p=q\\1,&p\neq q\end{cases}|.
|\gcd({-}, k)| for a fixed |k| is multiplicative. Given any multiplicative function |f|,
|f \circ \gcd({-},k)| is multiplicative. This essentially “restricts” |f| to only see the prime
powers that divide |k|. Viewing the finite multiset of primes |P| as a function |\mathbb P\to\mathbb N|,
|f(\gcd(p^n,n_P)) = \begin{cases}f(p^n),&n\leq P(p)\\f(p^{P(p)}),&n>P(p)\end{cases}|.
The multiplicative function characterized by |a(p^n) = p(n)| where |p(n)| is the partition function
counts the number of abelian groups the given order. That this function is multiplicative is a
consequence of the fundamental theorem of finite abelian groups.
The Jacobi symbol |\left(\frac{a}{n}\right)|
where |a\in\mathbb Z| and |n| is an odd positive integer is a completely multiplicative
function with either |a| or |n| fixed. When |n| is an odd prime, it reduces to the
Legendre symbol. For |p| an odd prime, we have
|(\frac{a}{p}) = a^{\frac{p-1}{2}} \pmod p|. This will always be in
|\{-1, 0, 1\}| and can be alternately defined as
|\left(\frac{a}{p}\right) = \begin{cases}0,&p\mid a\\1,&p\nmid a\text{ and }\exists x.x^2\equiv a\pmod p\\-1,&\not\exists x.x^2\equiv a\pmod p\end{cases}|.
Therefore, |\left(\frac{a}{p}\right)=1| (|=0|) when |a| is a (trivial) quadratic residue
mod |p|.
An interesting example which is not multiplicative nor additive is the arithmetic derivative.
Let |p\in\mathbb P|. Define |\frac{\partial}{\partial p}(n)| via |\frac{\partial}{\partial p}(p) = 1|,
|\frac{\partial}{\partial p}(q) = 0| for |q\neq p| and |q\in\mathbb P|, and
|\frac{\partial}{\partial p}(nm) = \frac{\partial}{\partial p}(n)m + n\frac{\partial}{\partial p}(m)|.
We then have |D_S = \sum_{p\in S}\frac{\partial}{\partial p}| for non-empty |S\subseteq\mathbb P|
which satisfies the same product rule identity. This perspective views a natural number (or, more
generally, a rational number) as a monomial in infinitely many variables labeled by prime numbers.
A Dirichlet character of modulus |m| is,
by definition, a completely multiplicative function |\chi| satisfying |\chi(n + m) = \chi(n)|
and |\chi(n)| is non-zero if and only if |n| is coprime to |m|. The Jacobi symbol
|\left(\frac{({-})}{m}\right)| is a Dirichlet character of modulus |m|. |\bar 1| is the
Dirichlet character of modulus |1|.
Dirichlet Series
Given an arithmetic function |f|, we define the Dirichlet series:
When |f| is a Dirichlet character, |\chi|, this is referred to as the (Dirichlet) |L|-series
of the character, and the analytic continuation is the (Dirichlet) |L|-function and is written
|L(s, \chi)|.
We’ll not focus much on when such a series converges. See this section
of the above Wikipedia article for more details. Alternatively, we could talk about
formal Dirichlet series.
We can clearly see that if |s = 0|, then we get
the sum |\sum_{n=1}^\infty f(n)| which clearly won’t converge for, say, |f = \bar 1|. We can say
that if |f| is asymptotically bounded by |n^k| for some |k|, i.e. |f \in O(n^k)|, then the series
will converge absolutely when the real part of |s| is greater than |k+1|. For |\bar 1|, it follows
that |\mathcal D[\bar 1](x + iy)| is defined when |x > 1|. We can use analytic continuation
to go beyond these limits.
Why is this interesting in this context? Let’s consider two arithmetic functions |f| and |g| and
multiply their corresponding Dirichlet series. We’ll get:
where now we need to figure out what |h(n)| is. But |h(n)| is going to be the sum of all the terms
of the form |f(a)a^{-s}g(b)b^{-s} = f(a)g(b)(ab)^{-s}| where |ab = n|. We can thus write:
\[h(n) = \sum_{ab=n} f(a)g(b) = \sum_{d\mid n} f(d)g(n/d)\] We’ll write this more compactly as
|h = f \star g| which we’ll call Dirichlet convolution.
We have thus shown a convolution theorem of the form
\[\mathcal D[f]\mathcal D[g] = \mathcal D[f \star g]\]
The Kronecker delta serves as a unit to this operation which is reflected by
|\mathcal D[\delta](s) = 1|.
In the same way we can view a sum of the form |\sum_{a+b=n}f(a)g(b)| that arises in “normal”
convolution as a sum along the line |y = n - x|, we can view the sum |\sum_{ab=n}f(a)g(b)| as
a sum along a hyperbola of the form |y = n/x|. For all of
|\sum_{n=1}^\infty\sum_{k=1}^\infty f(n)g(k)|, |\sum_{n=1}^\infty\sum_{k=1}^n f(k)g(n-k)|,
and |\sum_{n=1}^\infty\sum_{k\mid n}f(k)g(n/k)| we’re including |f(a)g(b)| for every
|(a,b)\in\mathbb N_+\times\mathbb N_+| in the sum exactly once. The difference is whether
we’re grouping the internal sum by rows, diagonals, or hyperbolas. This idea of summing hyperbolas
can be expanded to a computational technique for sums of multiplicative functions called the
Dirichlet hyperbola method.
Since we will primarily be interested in multiplicative functions, we should check that
|f \star g| is a multiplicative function when |f| and |g| are.
Lemma: Assume |a| and |b| are coprime, and |f| and |g| are multiplicative. Then
|(f \star g)(ab) = (f \star g)(a)(f \star g)(b)|.
Proof: Since |a| and |b| are coprime, they share no divisors besides |1|. This means every |d|
such that |d \mid ab| factors as |d = d_a d_b| where |d_a \mid a| and |d_b \mid b|. More
strongly, write |D_n = \{ d \in \mathbb N_+ \mid d \mid n\}|, then for any coprime pair of
numbers |i| and |j|, we have |D_{ij} \cong D_i \times D_j| and that every pair
|(d_i, d_j) \in D_i \times D_j| are coprime1. Thus,
It is not the case that the Dirichlet convolution of two completely multiplicative functions is
completely multiplicative.
We can already start to do some interesting things with this. First, we see that
|\mathcal D[\bar 1] = \zeta|, the Riemann zeta function.
Now consider |(\bar 1 \star \bar 1)(n) = \sum_{k \mid n} 1 = d(n)|.
|d(n)| is the divisor function
which counts the number of divisors of |n|. We see that
|\mathcal D[d](s) = \zeta(s)^2|. A simple but useful fact is
|\zeta(s - z) = \mathcal D[(-)^z](s)|. This directly generalizes the result for
|\mathcal D[\bar 1]| and also implies |\mathcal D[\operatorname{id}](s) = \zeta(s - 1)|.
Generalizing in a different way, we get the family of functions
|\sigma_k = ({-})^k \star \bar 1|. |\sigma_k(n) = \sum_{d \mid n} d^k|.
From the above, we see |\mathcal D[\sigma_k](s) = \zeta(s - k)\zeta(s)|.
As a simple corollary, for a completely multiplicative |f|,
|f \star f = f(\bar 1 \star \bar 1) = fd|.
Euler Product Formula
However, the true power of this is unlocked by the following theorem:
Theorem (Euler product formula):
Given a multiplicative function |f| which doesn’t grow too fast, e.g. is |O(n^k)| for some |k > 0|,
\[\mathcal D[f](s)
= \sum_{n=1}^\infty f(n)n^{-s}
= \prod_{p \in \mathbb P}\sum_{n=0}^\infty f(p^n)p^{-ns}
= \prod_{p \in \mathbb P}\left(1 + \sum_{n=1}^\infty f(p^n)p^{-ns}\right)
\]
where the series converges.
Proof: The last equality is simply using the fact that |f(p^0)p^0 = f(1) = 1| because |f| is
multiplicative. The idea for the main part is similar to how we derived Dirichlet convolution.
When we start to distribute out the infinite product, each term will correspond to the product of
selections of a term from each series. When all but finitely many of those selections select the |1|
term, we get |\prod_{(p, k) \in P}f(p^k)(p^k)^{-s}| where |P| is some finite multiset of
primes induced by those selections. Therefore,
|\prod_{(p, k) \in P}f(p^k)(p^k)^{-s} = f(n_P)n_P^{-s}|. Thus, by unique factorization,
|f(n)n^{-s}| for every positive natural occurs in the sum produced by distributing the right-hand
side exactly once.
In the case where |P| is not a finite multiset, we’ll have \[
\frac{\prod_{(p, k) \in P}f(p^k)}{\left(\prod_{(p, k) \in P}p^k\right)^s}\]
The denominator of this expression goes to infinity when the real part of |s| is greater than |0|.
As long as the numerator doesn’t grow faster than the denominator (perhaps after restricting the
real part of |s| to be greater than some bound), then this product goes to |0|. Therefore, the only
terms that remain are these corresponding to the Dirichlet series on the left-hand side. |\square|
If we assume |f| is completely multiplicative, we can further simplify Euler’s product formula
via the usual sum of a geometric series, |\sum_{n=0}^\infty x^n = (1-x)^{-1}|, to:
Now let’s put this to work. The first thing we can see is
|\zeta(s) = \mathcal D[\bar 1](s) = \prod_{p\in\mathbb P}(1 - p^{-s})^{-1}|. But this lets
us write |1/\zeta(s) = \prod_{p\in\mathbb P}(1 - p^{-s})|.
If we look for a multiplicative function that would produce the right-hand side, we see that it must
send a prime |p| to |-1| and |p^n| for |n > 1| to |0|. In other words, it’s the Möbius function
|\mu| we defined before. So |\mathcal D[\mu](s) = 1/\zeta(s)|.
Using |\mathcal D[d](s) = \zeta(s)^2|, we see that
\[\begin{flalign}
\zeta(s)^2
& = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty p^{-ns}\right)^{-2} \\
& = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty (n+1)p^{-ns}\right)^{-1} \\
& = \prod_{p\in\mathbb P}\left(\sum_{n=0}^\infty d(p^n)p^{-ns}\right)^{-1} \\
& = \mathcal D[d](s)
\end{flalign}\]
Therefore, |d(p^n) = n + 1|. This intuitively makes sense because the only divisors of |p^n| are
|p^k| for |k = 0, \dots, n|, and for |a| and |b| coprime
|d(ab) = \vert D_{ab} \vert = \vert D_a \times D_b\vert = \vert D_a\vert\vert D_b\vert = d(a)d(b)|.
Another result leveraging the theorem is given any multiplicative function |f|, we can define a new
multiplicative function via
|f^{[k]}(p^n) = \begin{cases}f(p^m), & km = n\textrm{ for }m\in\mathbb N \\ 0, & k \nmid n\end{cases}|.
Lemma: The operation just defined has the property that
|\mathcal D[f^{[k]}](s) = \mathcal D[f](ks)|. Proof:
\[\begin{flalign}
\mathcal D[f^{[k]}](s)
& = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f^{[k]}(p^n)p^{-ns} \\
& = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f^{[k]}(p^{kn})p^{-nks} \\
& = \prod_{p \in \mathbb P}\sum_{n=0}^\infty f(p^n)p^{-nks} \\
& = \mathcal D[f](ks)
\end{flalign}\]
|\square|
Möbius Inversion
We can write a sum over some function, |f|, of the divisors of a given natural |n| as
|(f \star \bar 1)(n) = \sum_{d \mid n} f(d)|. Call this |g(n)|. But then we have
|\mathcal D[f \star \bar 1] = \mathcal D[f]\mathcal D[\bar 1] = \mathcal D[f]\zeta| and thus
|\mathcal D[f] = \mathcal D[f]\zeta/\zeta = \mathcal D[(f \star \bar 1) \star \mu]|.
Therefore, if we only have the sums |g(n) = \sum_{d \mid n} f(d)| for some unknown |f|, we can
recover |f| via |f(n) = (g \star \mu)(n) = \sum_{d\mid n}g(d)\mu(n/d)|.
This is Möbius inversion.
As a simple example, we clearly have |\zeta(s)/\zeta(s) = 1 = \mathcal D[\delta](s)| so
|\bar 1 \star \mu = \delta| or |\sum_{d \mid n}\mu(d) = 0| for |n > 1| and |1| when |n = 1|.
We also get generalized Möbius inversion via
|\delta(n) = \delta(n)n^k = (\mu\star\bar 1)(n)n^k = (({-})^k\mu\star({-})^k)(n)|. Which
is to say if |g(n) = \sum_{d\mid n}d^k f(n/d)| then |f(n) = \sum_{d\mid n} \mu(d)d^kg(n/d)|.
By considering logarithms, we also get a multiplicative form of (generalized) Möbius inversion:
\[g(n) = \prod_{d\mid n}f(n/d)^{d^k} \iff f(n) = \prod_{d\mid n}g(n/d)^{\mu(d)d^k}\]
Theorem: As another guise of Möbius inversion, given any completely multiplicative function |h|,
let |g(m) = \sum_{n=1}^\infty f(mh(n))|. Assuming these sums make sense, we can recover |f(k)|
via |f(k) = \sum_{m=1}^\infty \mu(m)g(kh(m))|.
This will often show up in the form of |r(x^{1/n})| or |r(x^{1/n})/n|, i.e. with |h(n)=n^{-1}|
and |f_x(k) = r(x^k)| or |f_x(k) = kr(x^k)|. Typically, we’ll then be computing
|f_x(1) = r(x)|.
Given an arithmetic function |a|, these are series of the form:
\[
\sum_{n=1}^\infty a(n) \frac{x^n}{1-x^n}
= \sum_{n=1}^\infty a(n) \sum_{k=1}^\infty x^{kn}
= \sum_{n=1}^\infty (a \star \bar 1)(n) x^n
\]
This leads to:
\[\sum_{n=1}^\infty \mu(n) \frac{x^n}{1-x^n} = x\]
and:
\[\sum_{n=1}^\infty \varphi(n) \frac{x^n}{1-x^n} = \frac{x}{(1-x)^2}\]
Inclusion-Exclusion
The Möbius and |\zeta| functions can be generalized to
incidence algebras where this form is from the
incidence algebra induced by the divisibility order2.
A notable and relevant example of a Möbius
functions for another, closely related, incidence algebra is when we consider the incidence algebra
induced by finite multisets with the inclusion ordering. Let |T| be a finite multiset, we get
|\mu(T) = \begin{cases}0,&T\text{ has repeated elements}\\(-1)^{\vert T\vert},&T\text{ is a set}\end{cases}|.
Since we can view a natural number as a finite multiset of primes, and we can always relabel the
elements of a finite multiset with distinct primes, this is equivalent to the Möbius function we’ve
been using.
This leads to a nice and compact way of describing the principle of inclusion-exclusion.
Let |A| and |S| be (finite) multisets with |S \subseteq A| and assume we have |f| and |g| defined
on the set of sub-multisets of |A|. If \[g(A) = \sum_{S\subseteq A} f(S)\] then
\[f(A) = \sum_{S\subseteq A}\mu(A\setminus S)g(S)\] and this is Möbius inversion for this
notion of Möbius function. We can thus take a different perspective on Möbius inversion. If
|P| is a finite multiset of primes, then
\[g(n_P) = \sum_{Q\subseteq P}f(n_Q) \iff f(n_P) = \sum_{Q\subseteq P}\mu(P\setminus Q)g(n_Q)\]
recalling that |Q\subseteq P \iff n_Q \mid n_P| and |n_{P\setminus Q} = n_P/n_Q| when
|Q\subseteq P|.
We get traditional inclusion-exclusion by noting that |\mu(T)=(-1)^{\vert T\vert}| when |T| is a
set, i.e. all elements have multiplicity at most |1|. Let |I| be a finite set and assume we have a
family of finite sets, |\{T_i\}_{i\in I}|. Write |T = \bigcup_{i\in I}T_i| and define
|\bigcap_{i\in\varnothing}T_i = T|.
Define
\[f(J) = \left\vert\bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i\right\vert\]
for |J\subseteq I|. In particular, |f(I) = 0|.
|f(J)| is then the number of elements shared by all |T_i| for |i\notin J| and no |T_j| for
|j\in J|. Every |x \in \bigcup_{i\in I}T_i| is thus associated to exactly one such subset
of |I|, namely |\{j\in I\mid x\notin T_j\}|. Formally,
|x \in \bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i \iff J = \{j\in I\mid x\notin T_j\}|
so each |\bigcap_{i\in I\setminus J}T_i\setminus\bigcup_{i \in J}T_i| is disjoint and
\[g(J)
= \sum_{S\subseteq J}f(S)
= \left\vert\bigcup_{S\subseteq J}\left(\bigcap_{i\in I\setminus S}T_i\setminus\bigcup_{i \in S}T_i\right)\right\vert
= \left\vert\bigcap_{i\in I\setminus J}T_i\right\vert
\]
for |J \subseteq I|. In particular, |g(I) = \vert\bigcup_{i\in I}T_i\vert|.
By the Möbius inversion formula for finite sets, we thus have:
\[f(J) = \sum_{S\subseteq J}(-1)^{\vert J\vert - \vert S\vert}g(S)\]
which for |J = I| gives:
\[
0
= \sum_{J\subseteq I}(-1)^{\vert I\vert - \vert J\vert}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert
= \left\vert\bigcup_{i\in I}T_i\right\vert + \sum_{J\subsetneq I}(-1)^{\vert I\vert - \vert J\vert}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert
\]
which is equivalent to the more usual form:
\[\left\vert\bigcup_{i\in I}T_i\right\vert
= \sum_{J\subsetneq I}(-1)^{\vert I\vert - \vert J\vert - 1}\left\vert\bigcap_{i\in I\setminus J}T_i\right\vert
= \sum_{\varnothing\neq J\subseteq I}(-1)^{\vert J\vert + 1}\left\vert\bigcap_{i\in J}T_i\right\vert
\]
|\varphi|
An obvious thing to explore is to apply Möbius inversion to various arithmetic functions.
A fairly natural first start is applying Möbius inversion to the identity function. From the above
results, we know that this unknown function |\varphi| will satisfy
|\mathcal D[\varphi](s) = \zeta(s-1)/\zeta(s) = \mathcal D[\operatorname{id}\star\mu](s)|.
We also immediately have the property that |n = \sum_{d \mid n}\varphi(d)|. Using Euler’s
product formula we have:
\[\begin{flalign}
\zeta(s-1)/\zeta(s)
& = \prod_{p \in \mathbb P} \frac{1 - p^{-s}}{1 - p^{-s+1}} \\
& = \prod_{p \in \mathbb P} \frac{1 - p^{-s}}{1 - pp^{-s}} \\
& = \prod_{p \in \mathbb P} (1 - p^{-s})\sum_{n=0}^\infty p^n p^{-ns} \\
& = \prod_{p \in \mathbb P} \left(\sum_{n=0}^\infty p^n p^{-ns}\right) - \left(\sum_{n=0}^\infty p^n p^{-s} p^{-ns}\right) \\
& = \prod_{p \in \mathbb P} \left(\sum_{n=0}^\infty p^n p^{-ns}\right) - \left(\sum_{n=0}^\infty p^n p^{-(n + 1)s}\right) \\
& = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty p^n p^{-ns}\right) - \left(\sum_{n=1}^\infty p^{n-1} p^{-ns}\right) \\
& = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty (p^n - p^{n-1}) p^{-ns}\right) \\
& = \prod_{p \in \mathbb P} \left(1 + \sum_{n=1}^\infty \varphi(p^n) p^{-ns}\right) \\
& = \mathcal D[\varphi](s)
\end{flalign}\]
So |\varphi| is the multiplicative function defined by |\varphi(p^n) = p^n - p^{n-1}|.
For |p^n|, we can see that this counts the number of positive integers less than or equal to |p^n|
which are coprime to |p^n|. There are |p^n| positive integers less than or equal to |p^n|, and
every |p|th one is a multiple of |p| so |p^n/p = p^{n-1}| are not coprime to |p^n|. All the
remainder are coprime to |p^n| since they don’t have |p| in their prime factorizations and |p^n|
only has |p| in its. We need to verify that this interpretation is multiplicative. To be clear, we
know that |\varphi| is multiplicative and that this interpretation works for |p^n|. The question
is whether |\varphi(n)| for general |n| meets the above description, i.e. whether the number of
coprime numbers less than |n| is multiplicative.
Theorem: The number of coprime numbers less than |n| is multiplicative and is equal to |\varphi(n)|.
Proof: |\varphi = \mu\star\operatorname{id}|. We have:
We can see an inclusion-exclusion
pattern. Specifically, let |C_k = \{ c \in [k] \mid \gcd(c, k) = 1\}| be the numbers less than
or equal to |k| and coprime to |k|. Let |S_{k,m} = \{ c \in [k] \mid m \mid c\}|. We have
|S_{k,a} \cap S_{k,b} = S_{k,\operatorname{lcm}(a,b)}|. Also, when |c \mid k|, then
|\vert S_{k,c}\vert = k/c|. |C_{n_P} = [n_P] \setminus \bigcup_{p \in \mathrm{dom}(P)} S_{n_P,p}|
because every number not coprime to |n_P| shares some prime factor with it. Applying
inclusion-exclusion to the union yields
\[\begin{align}
\vert C_{n_P}\vert
& = n_P - \sum_{\varnothing\neq Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert+1}\left\vert \bigcap_{p\in Q}S_{n_P,p}\right\vert \\
& = n_P + \sum_{\varnothing\neq Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert}\frac{n_P}{\prod_{p\in Q}p} \\
& = \sum_{Q\subseteq\mathrm{dom}(P)}(-1)^{\vert Q\vert}\frac{n_P}{n_Q}
\end{align}\]
|\square|
The book Combinatorial Species and Tree-Like Structures
has many examples where Dirichlet convolutions and Möbius inversion come up.
A combinatorial species is a functor |\operatorname{Core}(\mathbf{FinSet})\to\mathbf{FinSet}|.
Any permutation on a finite set can be decomposed into a collection of cyclic permutations.
Let |U| be a finite set of cardinality |n| and |\pi : U \cong U| a permutation of |U|.
For any |u\in U|, there is a smallest |k\in\mathbb N_+| such that |\pi^k(u) = u| where
|\pi^{k+1} = \pi \circ \pi^k| and |\pi^0 = \operatorname{id}|. The |k| elements
|\mathcal O(u)=\{\pi^{i-1}(u)\mid i\in[k]\}| make up a cycle of length |k|, and |\pi|
restricted to |U\setminus O(u)| is a permutation on this smaller set. We can just inductively pull
out another cycle until we run out of elements. Write |\pi_k| for the number of cycles of length
|k| in the permutation |\pi|. We clearly have |n = \sum_{k=1}^\infty k\pi_k| as every cycle
has |k| elements in it.
Write |\operatorname{fix}\pi| for the number of fixed points of |\pi|, i.e. the cardinality of
the set |\{u\in U\mid \pi(u) = u\}|. Clearly, every element that is fixed by |\pi^k| needs
to be in a cycle whose length divides |k|. This leads to the equation:
Since |F(\pi^k) = F(\pi)^k| for a combinatorial species |F|, Möbius inversion, as explicitly
stated in Proposition 2.2.3 of Combinatorial Species and Tree-Like Structures, leads to:
If we Dirichlet convolve both sides of this with |\operatorname{id}|, replacing |F(\pi)| with
|\beta| as it doesn’t matter that this permutation comes from an action of a species, we get:
This is just using |\varphi = \operatorname{id}\star\mu|. If we choose |m| such that
|\beta^m = \operatorname{id}|, then we get |\sum_{d\mid m} \beta_d = \sum_{k=1}^\infty \beta_k|
because |\beta_k| will be |0| for all the |k| which don’t divide |m|.
This makes the previous equation into equation 2.2 (34) in the book.
Since we know |n = \sum_{k=1}^\infty k\pi_k| for any permutation |\pi|, we also get:
\[\vert F([n])\vert
= \sum_{k=1}^\infty\sum_{d\mid k}\mu\left(\frac{k}{d}\right)\operatorname{fix}F(\pi^d)
= \sum_{k=1}^\infty(\mu\star(d\mapsto\operatorname{fix}F(\pi^d)))(k)\]
These equations give us a way to compute some of these divisor sums by looking at the number
fixed points and cycles of the action of species and vice versa. For example, 2.3 (49) is a
series of Dirichlet convolutions connected to weighted species.
Example 12 from this book presents a nice and perhaps surprising identity. The core of it can be
written as: \[\sum_{k=1}^\infty\ln(1-ax^k) = \sum_{k=1}^\infty\rho_k(a)\ln(1-x^k)\]
where |\rho_k(a) = k^{-1}\sum_{d\mid k}\varphi(k/d)a^d|. We can rewrite this definition as
the characterization |k\rho_k(a) = (\varphi\star a^{({-})})(k)|. Recalling that
|\varphi = \mu \star \operatorname{id}| and |\ln(1-x) = -\sum_{n=1}^\infty x^n/n|, we get
the following derivation:
Theorem: \[\sum_{k=1}^\infty\ln(1-ax^k) = \sum_{k=1}^\infty\rho_k(a)\ln(1-x^k)\]
where |\rho_k(a) = k^{-1}\sum_{d\mid k}\varphi(k/d)a^d|.
This leads to the identity |\frac{d}{ds}\ln\mathcal D[f](s) = \mathcal D[f]’ (s)/\mathcal D[f](s) = -\mathcal D[f\ln \star \mu](s)|.
For example, we have |-\zeta’(s)/\zeta(s) = \mathcal D[\ln \star \mu](s)|. Using the Euler
product formula, we have |\ln\zeta(s) = -\sum_{p\in\mathbb P}\ln(1-p^{-s})|. Differentiating
this gives
\[\begin{flalign}
\frac{d}{ds}\ln\zeta(s)
& = -\sum_{p\in\mathbb P} p^{-s}\ln p/(1 - p^{-s}) \\
& = -\sum_{p\in\mathbb P} \sum_{k=1}^\infty \ln p (p^k)^{-s} \\
& = -\sum_{n=1}^\infty \Lambda(n) n^{-s} \\
& = -\mathcal D[\Lambda](s)
\end{flalign}\]
where |\Lambda(n) = \begin{cases}\ln p,&p\in\mathbb P\land\exists k\in\mathbb N_+.n=p^k \\ 0, & \text{otherwise}\end{cases}|.
|\Lambda|, which is not a multiplicative nor an additive function, is known as the von Mangoldt function.
Just to write it explicitly, the above implies |\Lambda = \ln \star \mu|, i.e. |\Lambda| is the
Möbius inversion of |\ln|. This can be generalized for arbitrary completely multiplicative
functions besides |\bar 1| to get |\mathcal D[f]’/\mathcal D[f] = \mathcal D[f\Lambda]|.
We now have multiple perspectives on |\Lambda| which is a kind of “indicator function” for prime
powers.
Dirichlet Inverse
Let’s say we’re given an arithmetic function |f|, and we want to find an arithmetic function |g|
such that |f \star g = \delta| which we’ll call the Dirichlet inverse of |f|.
We immediately get |(f \star g)(1) = f(1)g(1) = 1 = \delta(1)|.
So, supposing |f(1)\neq 1|, we can define |g(1) = 1/f(1)|. We then get a recurrence relation for
all the remaining values of |g| via:
\[0 = (f \star g)(n) = f(1)g(n) + \sum_{d \mid n, d\neq 1} f(d)g(n/d)\]
for |n > 1|. Solving for |g(n)|, we have:
\[g(n) = -f(1)^{-1}\sum_{d\mid n,d\neq 1}f(d)g(n/d)\]
where the right-hand side only requires |g(k)| for |k < n|. If |f| is multiplicative, then
|f(1) = 1| and the inverse of |f| exists.
If |f| is completely multiplicative, its Dirichlet inverse is |\mu f|. This follows easily from
|f \star \mu f = (\bar 1 \star \mu)f = \delta f = \delta|. As an example, |({-})^z| is
completely multiplicative so its inverse is |({-})^z\mu|. Since the inverse of a Dirichlet
convolution is the convolution of the inverses, we get |\varphi^{-1}(n) = \sum_{d\mid n}d\mu(d)|.
Not to be confused with |\varphi(n) = (\operatorname{id}\star\mu)(n) = \sum_{d\mid n} d\mu(n/d)|.
Less trivially, the inverse of a multiplicative function is also a multiplicative function.
We can prove it by complete induction on |\mathbb N_+| using the formula for |g| from above.
Theorem: If |f\star g = \delta|, then |g| is multiplicative when |f| is.
Proof: Let |n = ab| where |a| and |b| are coprime. If |a| (or, symmetrically, |b|) is equal to
|1|, then since |g(1) = 1/f(1) = 1|, we have |g(1n) = g(1)g(n) = g(n)|. Now assume neither |a| nor
|b| are |1| and, as the induction hypothesis, assume that |g| is multiplicative on all numbers less
than |n|. We have:
\[\begin{flalign}
g(ab)
& = -\sum_{d\mid ab,d\neq 1}f(d)g(ab/d) \\
& = -\sum_{d_a \mid a}\sum_{d_b \mid b,d_a d_b \neq 1}f(d_ad_b)g(ab/(d_ad_b)) \\
& = -\sum_{d_a \mid a}\sum_{d_b \mid b,d_a d_b \neq 1}f(d_a)f(d_b)g(a/d_a)g(b/d_b)) \\
& = -\sum_{d_b \mid b,d_b \neq 1}f(d_b)g(a)g(b/d_b))
- \sum_{d_a \mid a,d_a \neq 1}\sum_{d_b \mid b}f(d_a)f(d_b)g(a/d_a)g(b/d_b)) \\
& = -g(a)\sum_{d \mid b,d \neq 1}f(d)g(b/d))
- \sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a)\sum_{d_b \mid b}f(d_b)g(b/d_b)) \\
& = g(a)g(b) - \sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a) (f \star g)(b) \\
& = g(a)g(b) - \delta(b)\sum_{d_a \mid a,d_a \neq 1}f(d_a)g(a/d_a) \\
& = g(a)g(b)
\end{flalign}\] |\square|
Assuming |f| has a Dirichlet inverse, we also have:
\[\mathcal D[f^{-1}](s) = \mathcal D[f](s)^{-1}\]
immediately from the convolution theorem.
As an example, |\eta(s) = (1 - 2^{1-s})\zeta(s) = \mathcal D[f](s)| where
|f(n) = \begin{cases}-1,&n=2\\1,&n\neq 2\end{cases}|.
Alternatively, |f(n) = \mu(\gcd(n, 2))| and we can apply the above formula to see:
\[\begin{flalign}
\mathcal D[\mu(\gcd({-},2))]
& = \zeta(s)(1-2^{-s})\left(\frac{\mu(2)2^{-2s}}{1 - 2^{-s}} + \sum_{n=0}^1 \mu(2^n)2^{-ns}\right) \\
& = \zeta(s)(1-2^{-s})\left(\frac{-2^{-2s}}{1 - 2^{-s}} + 1 - 2^{-s}\right) \\
& = \zeta(s)(-2^{-2s} + (1 - 2^{-s})^2) \\
& = \zeta(s)(1 - 2^{1-s})
\end{flalign}\]
|\lambda| and |\gamma|
Recalling, |\lambda| is completely multiplicative and is characterized by |\lambda(p) = -1|.
We can show that |\mathcal D[\lambda](s) = \zeta(2s)/\zeta(s)| which is equivalent to saying
|\bar 1^{(2)} \star \mu = \lambda| or |\lambda\star\bar 1 = \bar 1^{(2)}|.
This implies that |(\gamma\star\mu)(p^n) = \begin{cases}-2, & n=1 \\ 0, & n > 1 \end{cases}|.
Indicator Functions
Let |1_{\mathbb P}| be the indicator function for the primes.
We have |\omega = 1_{\mathbb P}\star\bar 1| or |1_{\mathbb P} = \omega\star\mu|. Directly,
|\mathcal D[1_{\mathbb P}](s) = \sum_{p\in\mathbb P}p^{-s}| so we have
|\mathcal D[\omega](s)/\zeta(s) = \sum_{p\in\mathbb P} p^{-s}|.
Let |1_{\mathcal P}| be the indicator function for prime powers.
|\Omega = 1_{\mathcal P}\star\bar 1| or |1_{\mathcal P} = \Omega\star\mu|.
|\mathcal D[1_{\mathcal P}](s) = \sum_{p\in\mathbb P}(1 - p^{-s})^{-1}| so we have
|\mathcal D[\Omega](s)/\zeta(s) = \sum_{p\in\mathbb P}(1 - p^{-s})^{-1}|.
Lemma: |\mathcal D[1_{\mathcal P}](s)=\sum_{n=1}^\infty \frac{\varphi(n)}{n}\ln\zeta(ns)| Proof: This is quite similar to the previous proof.
\[\begin{align}
\sum_{n=1}^\infty \frac{\varphi(n)}{n}\ln\zeta(ns)
& = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}\sum_{N=kn}\varphi(n) \\
& = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N}(\varphi\star\bar 1)(N) \\
& = \sum_{p\in\mathbb P}\sum_{N=1}^\infty \frac{p^{-Ns}}{N} N \\
& = \sum_{p\in\mathbb P}\sum_{N=1}^\infty p^{-Ns} \\
& = \mathcal D[1_{\mathcal P}](s)
\end{align}\] |\square|
Summatory Functions
One thing we’ve occasionally been taking for granted is that the operator |\mathcal D| is
injective. That is, |\mathcal D[f] = \mathcal D[g]| if and only if |f = g|. To show this, we’ll
use the fact that we can (usually) invert
the Mellin transform which can be viewed roughly as a version of |\mathcal D| that operates on
continuous functions.
Before talking about the Mellin transform, we’ll talk about summatory functions as this will ease
our later discussion.
We will turn a sum into a continuous function via a zero-order hold, i.e. we will take the floor
of the input. Thus |\sum_{n\leq x} f(n)| is constant on any interval of the form |[k,k+1)|. It
then (potentially) has jump discontinuities at integer values. The beginning of the sum is at |n=1|
so for all |x<1|, the sum up to |x| is |0|. We will need a slight tweak to better deal with these
discontinuities. This will be indicated by a prime on the summation sign.
For non-integer values of |x|, we have:
\[\sum_{n \leq x}’ f(n) = \sum_{n \leq x} f(n)\]
For |m| an integer, we have:
\[
\sum_{n \leq m}’ f(n)
= \frac{1}{2}\left(\sum_{n<m} f(n) + \sum_{n \leq m} f(n)\right)
= \sum_{n\leq m} f(n) - f(m)/2
\]
This kind of thing should be familiar to those who’ve worked with things like Laplace transforms of
discontinuous functions.
(Not for no reason…)
One reason for introducing these summation functions is they are a little easier to
work with. Arguably, we want something like |\frac{d}{dx}\sum_{n\leq x}f(n) = \sum_{n=1}^\infty f(n)\delta(n-x)|,
but that means we end up with a bunch of distribution nonsense and even more improper integrals.
The summation function may be discontinuous, but it at least has a finite value everywhere.
Of course, another reason for introducing these functions is that they often are values we’re
interested in.
Several important functions are these continuous “sums” of arithmetic functions of this form:
Let’s consider the arithmetic function |\Lambda/\ln| whose Dirichlet series is |\ln\zeta|.
We have the summation function |\sum_{n\leq x}’ \Lambda(n)/\ln(n)|, but |\Lambda(n)| is |0|
except when |n=p^k| for some |p\in\mathbb P| and |k\in\mathbb N_+|. Therefore, we have
\[\begin{align}
\sum_{n\leq x}’ \frac{\Lambda(n)}{\ln(n)}
& = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{\Lambda(p^k)}{\ln(p^k)} \\
& = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{\ln(p)}{k\ln(p)} \\
& = \sum_{k=1}^\infty\sum_{p^k\leq x, p\in\mathbb P}’ \frac{1}{k} \\
& = \sum_{k=1}^\infty \frac{1}{k} \sum_{p^k\leq x, p\in\mathbb P}’ 1 \\
& = \sum_{k=1}^\infty \frac{1}{k} \sum_{p\leq x^{1/k}, p\in\mathbb P}’ 1 \\
& = \sum_{k=1}^\infty \frac{\pi(x^{1/k})}{k} \\
\end{align}\]
|\ln\zeta(s) = s\mathcal M[\Pi_0](-s)=\mathcal D[\Lambda/\ln](s)|
where |\mathcal M| is the Mellin transform, and the connection to Dirichlet series is described
in the following section.
The contour integral is intended to mean the vertical line with real part |c| traversed from
negative to positive imaginary values. Modulo the opposite sign of |s| and the extra factor of |x|,
this is quite similar to a continuous version of a Dirichlet series.
There are side conditions on the convergence of |\mathcal D[f]| for these formulas to be
justified. See the links.
Many of the operations we’ve described on Dirichlet series follow from Mellin transform properties.
For example, we have |\mathcal M[f]’(s) = \mathcal M[f\ln](s)| generally.
Dirichlet convolution forms a commutative ring with it as the multiplication, |\delta| as the
multiplicative unit and the usual additive structure. This is to say that Dirichlet convolution
is commutative, associative, unital, and bilinear.
For |f| completely multiplicative, |f(g\star h) = fg \star fh|.
Dirichlet Inverse
For any |f| such that |f(1)\neq 0|, there is a |g| such that |f\star g = \delta|. In particular,
the set of multiplicative functions forms a subgroup of this multiplicative group, i.e. the
Dirichlet convolution of multiplicative functions is multiplicative.
If |f(1) \neq 0|, then |f \star g = \delta| where |g| is defined by the following recurrence:
This means from a divisor sum |g(n)\sum_{d\mid n}f(d) = (f\star\bar 1)(n)| for each |n|, we can
recover |f| via |g\star\mu = f\star\bar 1\star\mu = f|. Which is to say
|f(n)=\sum_{d\mid n}g(d)\mu(n/d)|.
This can be generalized via |({-})^k\mu\star({-})^k = \delta|. In sums, this means when
|g(n)=\sum_{d\mid n}d^k f(n/d)|, then |f(n)=\sum_{d\mid n}\mu(d)d^k g(n/d)|.
Let |h| be a completely multiplicative function.
Given |g(m) = \sum_{n=1}^\infty f(mh(n))|, then |f(n) = \sum_{m=1}^\infty \mu(m)g(nh(m))|.
Using the Möbius function for finite multisets and their inclusion ordering, we can recast
Möbius inversion of naturals as Möbius inversion of finite multisets (of primes) a la:
\[n_P
= \sum_{Q\subseteq P}\mu(P\setminus Q)n_Q
= \sum_{Q\subseteq P}\mu(n_P/n_Q)n_Q
= \sum_{d\mid n_P}\mu(n_P/d)d
\]
As a nice result, we have:
\[\sum_{n=1}^\infty\ln(1-ax^n) = \sum_{n=1}^\infty\rho_n(a)\ln(1-x^n)\]
where |n\rho_n(a) = (\varphi \star a^{({-})})(n)|.
Given an arithmetic function |a|, these are series of the form:
\[
\sum_{n=1}^\infty a(n) \frac{x^n}{1-x^n} = \sum_{n=1}^\infty (a \star \bar 1)(n) x^n
\]
|f(p^n)=\cdots| implies a multiplicative/additive function, while |f(p)=\cdots| implies a
completely multiplicative/additive function.
|p^z| for |z\in\mathbb C| is completely multiplicative. This includes the identity function
(|z=1|) and |\bar 1| (|z=0|). For any multiplicative |f|, |f\circ \gcd({-},k)| is multiplicative.
|\ln| is completely additive.
Important but neither additive nor multiplicative are the indicator functions for primes
|1_{\mathbb P}| and prime powers |1_{\mathcal P}|.
The following functions are (completely) multiplicative unless otherwise specified.
Viewing natural numbers as multisets, |D_n| is
the set of all sub-multisets of |n|. The isomorphism described is then simply the fact that given
any sub-multiset of the union of two disjoint multisets, we can sort the elements into their
original multisets producing two sub-multisets of the disjoint multisets.↩︎
Incidence algebras are a decategorification
of the notion of a category algebra.↩︎
CuTe is a C++ library that aims to make dealing with complicated indexing easier. A key part of how it does this is by defining a Layout type, which specifies how to map from logical coordinates to physical locations (CuTe likes to say layouts are "functions from integers to integers.") In fact, CuTe layouts are a generalization of PyTorch strides, which say you always do this mapping by multiplying each coordinate with its respective stride and summing them together, e.g., i0 * s0 + i1 * s1 + .... Although NVIDIA's docs don't spell it out, the CuTe's generalization here is actually very natural, and in this blog post I'd like to explain how you could have invented it (on a good day).
First, a brief recap about strides. PyTorch views allow us to reinterpret the physical layout of a tensor in different ways, changing how we map logical coordinates into physical locations. For example, consider this 2-D tensor:
The physical memory reads 0, 1, 2, 3, and if I want to know what the value at coordinate (0, 1) is (row 0, col 1), I compute 0 * 2 + 1 * 1, which tells me I should read out the value at index 1 in physical memory. If I change the strides, I can change the order I read out the physical locations. For example, if I transpose I have:
The physical memory hasn't changed, but now when we read out coordinate (0, 1), we compute 0 * 1 + 1 * 2, which tells me I should read the value at index 2 (which is indeed what I see at this coordinate!)
PyTorch also allows us to "flatten" dimensions of a tensor, treating them as a 1D tensor. Intuitively, a 2-D tensor flattened into a 1-D one involves just concatenating all the rows together into one line:
We should be able to do this for the transpose too, getting tensor([0, 2, 1, 3]), but instead, this is what you get:
>>> torch.arange(4).view(2, 2).T.view(-1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
The dreaded "use reshape instead" error! The error is unavoidable under PyTorch striding: there is no stride we can select that will cause us to read the elements in this order (0, 2, 1, 3); after all, i0 * s0 is a pretty simple equation, we can't simultaneously have 1 * s0 == 2 and 2 * s0 == 1.
Upon learning this, an understandable reaction is to just shrug, assume that this is impossible to fix, and move on with your life. But today, you are especially annoyed by this problem, because you were only trying to flatten N batch dimensions into a single batch dimension so that you could pass it through a function that only works with one batch dimension, with the plan of unflattening it when you're done. It doesn't matter that this particular layout is inexpressible with strides; you aren't going to rely on the layout in any nontrivial way, you just care that you can flatten and then unflatten back to the original layout.
Imagine we're dealing with a tensor of size (2, 2, 2) where the strides for dim 0 and dim 1 were transposed as (2, 4, 1). It should be OK to flatten this into a tensor (4, 2) and then unflatten it back to (2, 2, 2). Intuitively, I'd like to "remember" what the original sizes and strides are, so that I can go back to them. Here's an idea: let's just store the original size/stride as a nested entry in our size tuple. So instead of the size (4, 2), we have ((2, 2), 2); and now analogously the stride can simply be ((2, 4), 1). When I write (2, 2) as the "size" of a dimension, I really just mean the product 4, but there is some internal structure that affects how I should index its inside, namely, the strides (2, 4). If I ask for the row at index 2, I first have to translate this 1D coordinate into a 2D coordinate (1, 0), and then apply the strides to it like before.
Well, it turns out, this is exactly how CuTe layouts work! In CuTe, sizes/strides are hierarchical: a size is actually a tree of ints, where the hierarchy denotes internal structure of a dimension that you can address linearly (in fact, everything by default can be addressed in a 1-D linear way, even if its an N-D object.) The documentation of Layout does say this... but I actually suffered a lot extracting out the high level intuition of this blog post, because CuTe uses co-lexicographic ordering when linearizing (it iterates over coordinates (0,0), (1,0), (2,0), etc. rather than in the more normal lexicographic order (0,0), (0,1), (0,2)). This leads to some truly deranged example code where they print a 2D matrix in conventional lexicographic ordering, and then turn around and say, "But wait, if I have the layout take care of translating the 1D coordinate into an ND coordinate, it is colexicographic!!":
In any case, if you want to engage with the documentation, s2xh4 is the important example to pay attention to for understanding the nested semantics. However, note the example is smeared across like five sections and also you need to know about the co-lexicographic thing to understand why the examples print the way they do.
In a post from a year
ago,
I explored how to prove decidable equality in Agda of a particular
indexed data type. Recently, I discovered a different way to
accomplish the same thing, without resorting to embedded sigma types.
This post is literate Agda; you can download it
here
if you want to play along. I tested everything here with Agda version
2.6.4.3 and version 2.0 of the standard library. (I assume it would
also work with more recent versions, but haven’t tested it.)
Background
This section is repeated from my previous
post,
which I assume no one remembers.
First, some imports and a module declaration. Note that the entire
development is parameterized by some abstract set B of base types,
which must have decidable equality.
We’ll work with a simple type system containing base types, function
types, and some distinguished type constructor □. So far, this is
just to give some context; it is not the final version of the code we
will end up with, so we stick it in a local module so it won’t end up
in the top-level namespace.
module Unindexed wheredata Ty :Setwhere base : B → Ty_⇒_: Ty → Ty → Ty □_: Ty → Ty
For example, if \(X\) and \(Y\) are base types, then we could write down a
type like \(\square ((\square \square X \to Y) \to \square Y)\):
infixr2_⇒_infix30 □_postulate BX BY : B X : Ty X = base BX Y : Ty Y = base BY example : Ty example = □ ((□ □ X ⇒ Y) ⇒ □ Y)
However, for reasons that would take us too far afield in this blog
post, I don’t want to allow immediately nested boxes, like \(\square \square X\). We can still have multiple boxes in a type, and even
boxes nested inside of other boxes, as long as there is at least one
arrow in between. In other words, I only want to rule out boxes
immediately applied to another type with an outermost box. So we
don’t want to allow the example type given above (since it contains
\(\square \square X\)), but, for example, \(\square ((\square X \to Y) \to \square Y)\) would be OK.
Two encodings
In my previous blog
post,
I ended up with the following encoding of types indexed by a Boxity,
which records the number of top-level boxes. Since the boxity of the
arguments to an arrow type do not matter, we make them sigma types
that package up a boxity with a type having that boxity. I was then
able to define decidable equality for ΣTy and Ty by mutual
recursion.
data Boxity :Setwhere ₀ : Boxity ₁ : Boxityvariable b b₁ b₂ b₃ b₄ : Boxitymodule WithSigma where ΣTy :Setdata Ty : Boxity →Set ΣTy = Σ Boxity Tydata Ty where □_: Ty ₀ → Ty ₁ base : B → Ty ₀_⇒_: ΣTy → ΣTy → Ty ₀
The problem is that working with this definition of Ty is really
annoying! Every time we construct or pattern-match on an arrow type,
we have to package up each argument type into a dependent pair with
its Boxity; this introduces syntactic clutter, and in many cases we
know exactly what the Boxity has to be, so it’s not even
informative. The version we really want looks more like this:
data Ty : Boxity →Setwhere base : B → Ty ₀_⇒_:{b₁ b₂ : Boxity}→ Ty b₁ → Ty b₂ → Ty ₀ □_: Ty ₀ → Ty ₁infixr2_⇒_infix30 □_
In this version, the boxities of the arguments to the arrow
constructor are just implicit parameters of the arrow constructor
itself. Previously, I was unable to get decidable equality to go
through for this version… but just the other day, I finally realized
how to make it work!
Path-dependent equality
The key trick that makes everything work is to define a
path-dependent equality type. I learned this from Martín
Escardó.
The idea is that we can express equality between two indexed things
with different indices, as long as we also have an equality between
the indices.
_≡⟦_⟧_:{A :Set}{B : A →Set}{a₀ a₁ : A}→ B a₀ → a₀ ≡ a₁ → B a₁ →Setb₀ ≡⟦ refl ⟧ b₁ = b₀ ≡ b₁
That’s exactly what we need here: the ability to express
equality between Ty values, which may be indexed by different
boxities—as long as we know that the boxities are equal.
Decidable equality for Ty
We can now use this to directly encode decidable equality for Ty.
First, we can easily define decidable equality for Boxity.
Here is the type of the decision procedure: given two Ty values
which may have different boxities, we decide whether or not we can
produce a witness to their equality. Such a witness consists of a
pair of (1) a proof that the boxities are equal, and (2) a proof
that the types are equal, depending on (1).We would really like to
write this as Σ (b₁ ≡ b₂) λ p → σ ≡⟦ p ⟧ τ, but for some reason Agda
requires us to fill in some extra implicit arguments before it is
happy that everything is unambiguous, requiring some ugly syntax.
Ty-≟′ :(σ : Ty b₁)→(τ : Ty b₂)→ Dec (Σ (b₁ ≡ b₂)λ p →_≡⟦_⟧_{_}{Ty} σ p τ)
Before showing the definition of Ty-≟′, let’s see that we can use it
to easily define both a boxity-homogeneous version of decidable
equality for Ty, as well as decidable equality for Σ Boxity Ty:
A lot of pattern matching on refl and everything falls out quite easily.
And now the definition of Ty-≟′. It looks complicated, but it is
actually not very difficult. The most interesting case is when
comparing two arrow types for equality: we must first compare the
boxities of the arguments, then consider the arguments themselves once
we know the boxities are equal.
Ty-≟′ (□ σ)(□ τ)with Ty-≟′ σ τ...| yes (refl , refl)= yes (refl , refl)...| no σ≢τ = no λ{(refl , refl)→ σ≢τ (refl , refl)}Ty-≟′ (base S)(base T)with ≟B S T...| yes refl = yes (refl , refl)...| no S≢T = no λ{(refl , refl)→ S≢T refl }Ty-≟′ (_⇒_{b₁}{b₂} σ₁ σ₂)(_⇒_{b₃}{b₄} τ₁ τ₂)with Boxity-≟ b₁ b₃ | Boxity-≟ b₂ b₄ | Ty-≟′ σ₁ τ₁ | Ty-≟′ σ₂ τ₂...| no b₁≢b₃ |_|_|_= no λ{(refl , refl)→ b₁≢b₃ refl }...| yes _| no b₂≢b₄ |_|_= no λ{(refl , refl)→ b₂≢b₄ refl }...| yes _| yes _| no σ₁≢τ₁ |_= no λ{(refl , refl)→ σ₁≢τ₁ (refl , refl)}...| yes _| yes _| yes _| no σ₂≢τ₂ = no λ{(refl , refl)→ σ₂≢τ₂ (refl , refl)}...| yes _| yes _| yes (refl , refl)| yes (refl , refl)= yes (refl , refl)Ty-≟′ (□ _)(base _)= no λ()Ty-≟′ (□ _)(_ ⇒ _)= no λ()Ty-≟′ (base _)(□ _)= no λ()Ty-≟′ (base _)(_ ⇒ _)= no λ{(refl , ())}Ty-≟′ (_ ⇒ _)(□ _)= no λ()Ty-≟′ (_ ⇒ _)(base _)= no λ{(refl , ())}
<noscript>Javascript needs to be activated to view comments.</noscript>
Let’s prove in Haskell (in one line) that these two statements, taken
together, imply that I am my own baby.
The normal proof
The normal proof using propositional logic goes as follows:
If everyone loves Baby, Baby must love baby. (instantiate axiom 1 with \(x =
\text{Baby}\)).
If baby loves someone, that someone must be me. (axiom 2)
Therefore, because baby loves baby, baby must be me. (instantiate axiom 2
with axiom 1 with \(x = \text{Baby}\))
Haskell as a Theorem Prover
First, some background: when using Haskell as a theorem prover, you represent
the theorem as a type, and proving it involves constructing a
value of that type — you create an inhabitant of that type.
Using the Curry-Howard correspondence (often also called the Curry-Howard
isomorphism), we can pair some simple logical connectives with types:
Logical “and” corresponds to tupling (or records of values). If
(a, b) is inhabited, it means that both a and
b are inhabited.
Logical “or” corresponds to sums, Either a b being inhabited
implies that either a or b are inhabited. They might
both the inhabited, but Either a b requires the “proof” of only
one.
Constructivist logical implication is a function: If a -> b
is inhabited, it means that an inhabitant of a can be used to
create an inhabitant of b.
Any type with a constructor is “true”: (), Bool,
String, etc.; any type with no constructor (data Void)
is “false” because it has no inhabitants.
Introducing type variables (forall a.) corresponds to…well, for
all. If forall a. Either a () means that Either a ()
is “true” (inhabited) for all possible a. This one represented
logically as \(\forall x. x \lor
\text{True}\).
You can see that, by chaining together those primitives, you can translate a
lot of simple proofs. For example, the proof of “If x and
y together imply z, then x implies that
y implies z”:
\[
\forall x y z. ((x \wedge y) \implies z) \implies (x \implies (y \implies z))
\]
can be expressed as:
curry ::forall a b c. ((a, b) -> c) -> a -> b -> ccurry f x y = f (x, y)
Or maybe, “If either x or y imply z, then x implies z and y implies z,
independently:”
\[
\forall x y z. ((x \lor y) \implies z) \implies ((x \implies z) \land (y
\implies z)))
\]
In haskell:
unEither :: (Either a b -> c) -> (a -> c, b -> c)unEither f = (f .Left, f .Right)
And, we have a version of negation: if a -> Void is
inhabited, then a must be uninhabited (the principle of
explosion). Let’s prove that “‘x or y’ being false implies both x and y are
false”: \(\forall x y. \neg(x \lor y)
\implies (\neg x \wedge \neg y)\)
deMorgan :: (Either a b ->Void) -> (a ->Void, b ->Void)deMorgan f = (f .Left, f .Right)
(Maybe surprisingly, that’s the same proof as unEither!)
We can also think of “type functions” (type constructors that take arguments)
as “parameterized propositions”:
dataMaybe a =Nothing|Maybe a
Maybe a (like \(\text{Maybe}(x)\)) is the proposition that \(\text{True} \lor x\): Maybe a is
always inhabited, because “True or X” is always True. Even
Maybe Void is inhabited, as Nothing :: Maybe Void.
The sky is the limit if we use GADTs. We can create arbitrary propositions by
restricting what types constructors can be called with. For example, we can
create a proposition that x is an element of a list:
dataElem :: k -> [k] ->TypewhereHere ::Elem x (x : xs)There ::!(Elem x ys) ->Elem x (y : ys)
Read this as “Elem x xs is true (inhabited) if either
x is the first item, or if x is an elem of the tail of
the list”. So for example, Elem 5 [1,5,6] is inhabited but
Elem 7 [1,5,6] is not:1
itsTrue ::Elem5 [1,5,6]itsTrue =ThereHereitsNotTrue ::Elem7 [1,5,6] ->VoiditsNotTrue = \case {} -- GHC is smart enough to know both cases are invalid
We can create a two-argument proposition that two types are equal,
a :~: b:
data (:~:) :: k -> k ->TypewhereRefl :: a :~: a
The proposition a :~: b is only inhabited if a is
equal to b, since Refl is its only constructor.
Of course, this whole correspondence assumes we aren’t ever touching bottom
(things like undefined for let x = x in x). For this
exercise, we are working in a total subset of Haskell.
The Baby Paradox
Now we have enough. Let’s parameterize it over a proposition
loves, where loves a b being inhabited means that
a loves b.
We can express our axiom as a record of propositions in terms of the atoms
loves, me, and baby:
dataBabyAxioms loves me baby =BabyAxioms { everybodyLovesMyBaby ::forall x. loves x baby , myBabyOnlyLovesMe ::forall x. loves baby x -> x :~: me }
The first axiom everybodyLovesMyBaby means that for anyx, loves x baby must be “true” (inhabited). The second
axiom myBabyOnlyLovesMe means that if we have a
loves baby x (if my baby loves someone), then it must be that
x ~ me: we must be able to derive that person the baby loves is
indeed me.
The expression of the baby paradox then relies on writing the function
babyParadox ::BabyAxioms loves me baby -> me :~: baby
And indeed if we play around with GHC enough, we’ll get this typechecking
implementation:
babyParadox ::BabyAxioms loves me baby -> me :~: babybabyParadox BabyAxioms{everybodyLovesMyBaby, myBabyOnlyLovesMe} = myBabyOnlyLovesMe everybodyLovesMyBaby
Using x & f = f x from Data.Function, this becomes
a bit smoother to read:
babyParadox ::BabyAxioms loves me baby -> me :~: babybabyParadox BabyAxioms{everybodyLovesMyBaby, myBabyOnlyLovesMe} = everybodyLovesMyBaby & myBabyOnlyLovesMe
And we have just proved it! It ended up being a one-liner. So, given the
BabyAxioms loves me baby, it is possible to prove that
memust be equal to baby. That is, it is
impossible to create any BabyAxioms without me and
baby being the same type.
The actual structure of the proof goes like this:
First, we instantiated everybodyLovesBaby with
x ~ baby, to get loves baby baby.
Then, we used myBabyOnlyLovesMe, which normally takes
loves baby x and returns x :~: me. Because we give it
loves baby baby, we get a baby :~: me!
And that’s exactly the same structure of the original symbolic proof.
What is Love?
We made BabyAxioms parametric over loves,
me, and baby, which means that these apply in
any universe where love, me, and baby follow the rules of the song
lyrics.
Essentially this means that for any binary relationship
Loves x y, if that relationship follows these axioms, it
must be true that me is baby. No matter what that relationship actually
is, concretely.
That being said, it might be fun to play around with what this might look
like in concrete realizations of love, me, and my baby.
First, we could imagine that Love is completely mundane, and can be created
between any two operands without any extra required data or constraints —
essentially, a proxy
between two phantoms:
dataLove a b =Love
In this case, it’s impossible to create a BabyAxioms where
me and baby are different:
dataLove a b =Love-- | me ~ baby is a cosntraint required by GHCproxyLove :: (me ~ baby) =>BabyAxiomsLove me babyproxyLove =BabyAxioms { everybodyLovesMyBaby =Love , myBabyOnlyLovesMe = \_ ->Refl }
The me ~ baby constraint being required by GHC is actually an
interesting manifestation of the paradox itself, without an explicit proof
required on our part. Alternatively, and more traditionally, we can write
proxyLove :: BabyAxioms Love baby baby or
proxyLove :: BabyAxioms Love me me to mean the same thing.
We can imagine another concrete universe where it is only possible to love my
baby, and my baby is the singular recipient of love in this entire universe:
dataLoveOnly :: k -> k -> k ->TypewhereLoveMyBaby ::LoveOnly baby x babyonlyBaby ::BabyAxioms (LoveOnly baby) me babyonlyBaby =BabyAxioms { everybodyLovesMyBaby =LoveMyBaby , myBabyOnlyLovesMe = \caseLoveMyBaby->Refl }
Now we get both axioms fulfilled for free! Basically if we ever have a
LoveOnly baby x me, the only possible constructor is is
LoveMyBaby :: LoveOnly baby x baby, so me must be
baby!
Finally, we could imagine that love has no possible construction, with no way
to construct or realize. In this case, love is the uninhabited
Void:
dataLove a b
In this universe, we can finally fulfil myBabyOnlyLovesMe
without me being baby, because “my baby don’t love
nobody but me” is vacuously true if there is no possible love. However, we
cannot fulfil everybodyLovesMyBaby because no love is possible,
except in the case that the universe of people (k) is also empty.
But GHC doesn’t have any way to encode empty kinds, I believe (I would love to
hear of any techniques if you knew of any), so we cannot realize these axioms
even if forall (x :: k) is truly empty.
Note that we cannot fully encode the axioms purely as a GADT in Haskell — our
LoveOnly was close, but it is too restrictive: in a fully general
interpretation of the song, we want to be able to allow other recipients of love
besides baby. Basically, Haskell GADTs cannot express the eliminators necessary
to encode myBabyOnlyLovesMe purely structurally, as far as I am
aware. But I could be wrong.
Why
Nobody who listens to this song seriously believes that the speaker is
intending to convey that they are their own baby, or attempting to tantalize the
listener with an unintuitive tautology. However, this is indeed a common
homework assignment in predicate logic classes, and I wasn’t able to find anyone
covering this yet in Haskell, so I thought might as well be the first.
Sorry, teachers of courses that teach logic through Haskell.
I’ve also been using paradox as one of my go-to LLM stumpers, and it’s
actually only recently (with GPT 5) that it’s been able to get this right. Yay
the future? Before this, it would get stuck on trying to define a
Loves GADT, which is a dead end as previously discussed.
I’m pretty sure nobody has ever used it for anything useful, but
I wrote the entire decidable library
around manipulating propositions like this.↩︎
The GHC developers are very pleased to announce the availability of the
first alpha prerelease of GHC 9.14.1. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org.
GHC 9.14 will bring a number of new features and improvements, including:
Significant improvements in specialisation:
The SPECIALISE pragma now allows use of type application syntax
The SPECIALISE pragma can be used to specialise for expression arguments
as well as type arguments.
Specialisation is now considerably more reliable in the presence of
newtypes
the specialiser is now able to produce specialisations with
polymorphic typeclass constraints, considerably broadening its scope.
Significant improvements in the GHCi debugger
Record fields can be defined to be non-linear when LinearTypes is enabled.
RequiredTypeArgments can now be used in more contexts
SSE/AVX support in the x86 native code generator backend
A major update of the Windows toolchain
… and many more
A full accounting of changes can be found in the release notes. Given the
many specialisation improvements and their potential for regression, we would
very much appreciate testing and performance characterisation on downstream
workloads.
Due to unexpected complications, this initial prerelease comes a bit later than
expected. Consequently, we expect to have three condensed alphas prior to the
release candidate, in contrast to the scheduled three. We expect the next alpha
will come the week of 9 Sept. 2025, while the third will come 23 Sept. 2025,
with the release candidate coming 7 Oct. 2025.
We would like to thank the Zw3rk stake pool,
Well-Typed, Mercury, Channable, Tweag I/O, Serokell, SimSpace, the Haskell
Foundation, and other anonymous contributors whose on-going financial
and in-kind support has facilitated GHC maintenance and release
management over the years. Finally, this release would not have been
possible without the hundreds of open-source contributors whose work
comprise this release.
As always, do give this release a try and open a ticket if you see
anything amiss.
In our last article, we explored how to perform an in-order traversal of a binary search tree. Today we’ll do one final binary tree problem to solidify our understanding of some common tree patterns, as well as the tricky syntax for dealing with a binary tree in Rust.
If you want some interesting challenge problems using Haskell data structures, you should take our Solve.hs course. In particular, you’ll learn how to write a self-balancing binary tree to use for an ordered set!
The Problem
Today we will solve Zigzag Level Order Traversal. For any binary tree, we can think about it in terms of “levels” based on the number of steps from the root. So given this tree:
- 45
/ \
32 50
/ \ \
5 40 100
/ \
37 43
We can visually see that there are 4 levels. So a normal level order traversal would return a list of 4 lists, where each list is a single level, ordered from left to right, visually speaking:
[45]
[32, 50]
[5, 40, 100]
[37, 43]
However, with a zigzag level order traversal, every other level is reversed. So we should get the following result for the input tree:
[45]
[50, 32]
[5, 40, 100]
[43, 37]
So we can imagine that we do the first level from left to right and then zigzag back to get the second level from right to level. Then we do left to right again for the third level, and so on.
The Algorithm
For our in-order traversal, we used a kind of depth-first search (DFS), and this approach is more common for tree-based problems. However, for a level-order problem, we want more of a breadth-first search (BFS). In a BFS, we explore states in order of their distance to the root. Since all nodes in a level have the same distance to the root, this makes sense.
Our general idea is that we’ll store a list of all the nodes from the prior level. Initially, this will just contain the root node. We’ll loop through this list, and create a new list of the values from the nodes in this list. This gets appended to our final result list.
While we’re doing this loop, we’ll also compose the list for the next level. The only trick is knowing whether to add each node’s left or right child to the next-level list first. This flips each iteration, so we’ll need a boolean tracking it that flips each time.
Once we encounter a level that produces no numbers (i.e. it only contains Nil nodes), we can stop iterating and return our list of lists.
Rust Solution
Now that we’re a bit more familiar with manipulating Rc RefCells, we’ll start with the Rust solution, framing it according to the two-loop structure in our algorithm. We’ll define stack1, which is the iteration stack, and stack2, where we accumulate the new nodes for the next layer. We also define our final result vector, a list of lists.
pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
let mut result: Vec<Vec<i32>> = Vec::new();
let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
stack1.push(root.clone());
let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
let mut leftToRight = true;
...
return result;
}
Our initial loop will continue until stack1 no longer contains any elements. So our basic condition is while(!stack1.is_empty(). However, there’s another important element here.
After we accumulate the new nodes in stack2, we want to flip the meanings of our two stacks. We want our accumulated nodes referred to by stack1, and stack2 to be an empty list to accumulate. We accomplish this in Rust by clearing stack1 at the end of our loop, and then using std::mem::swap to flip their meanings:
pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
let mut result: Vec<Vec<i32>> = Vec::new();
let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
stack1.push(root.clone());
let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
let mut leftToRight = true;
while (!stack1.is_empty()) {
let mut thisLayer = Vec::new(); // Values from this level
...
leftToRight = !leftToRight;
stack1.clear();
mem::swap(&mut stack1, &mut stack2);
}
return result;
}
In C++ we could accomplish something like this using std::move, but only because we want stack1 to return to an empty state.
stack2 = std::move(stack1);
Also, observe that we flip our boolean flag at the end of the iteration.
Now let’s get to work on the inner loop. This will actually go through stack1, add values to thisLayer, and accumulates the next layer of nodes for stack2. An interesting finding is that whether we’re going left to right or vice versa, we want to loop through stack2 in reverse. This means we’re treating it like a true stack instead of a vector, first accessing the last node to be added.
A left-to-right pass will add lefts and then rights. This means the right-mode node in the next layer is on “top” of the stack, at the end of the vector. A right-to-left pass will first add the right child for a node before its left. This means the left-most node of the next layer is at the end of the vector.
Let’s frame up this loop, and also add the results of this layer to our final result vector.
pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
...
while (!stack1.is_empty()) {
let mut thisLayer = Vec::new()
for node in stack1.iter().rev() {
...
}
if (!thisLayer.is_empty()) {
result.push(thisLayer);
}
leftToRight = !leftToRight;
stack1.clear();
mem::swap(&mut stack1, &mut stack2);
}
return result;
}
Note that we do not add the values array if it is empty. We allow ourselves to accumulate None nodes in our stack. The final layer we encounter will actually consist of all None nodes, and we don’t want this layer to add an empty list.
Now all we need to do is populate the inner loop. We only take action if the node from stack1 is Some instead of None. Then we follow a few simple steps:
Borrow the TreeNode from this RefCell
Push its value onto thisLayer.
Add its children (using clone) to stack2, in the right order.
Here’s the code:
pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
...
while (!stack1.is_empty()) {
let mut thisLayer = Vec::new()
for node in stack1.iter().rev() {
if let Some(current) = node {
let currentTreeNode = current.borrow();
thisLayer.push(currentTreeNode.val);
if leftToRight {
stack2.push(currentTreeNode.left.clone());
stack2.push(currentTreeNode.right.clone());
} else {
stack2.push(currentTreeNode.right.clone());
stack2.push(currentTreeNode.left.clone());
}
}
}
...
}
return result;
}
And now we’re done! Here’s the full solution:
use std::rc::Rc;
use std::cell::RefCell;
use std::mem;
pub fn zigzag_level_order(root: Option<Rc<RefCell<TreeNode>>>) -> Vec<Vec<i32>> {
let mut result: Vec<Vec<i32>> = Vec::new();
let mut stack1: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
stack1.push(root.clone());
let mut stack2: Vec<Option<Rc<RefCell<TreeNode>>>> = Vec::new();
let mut leftToRight = true;
while (!stack1.is_empty()) {
let mut thisLayer = Vec::new();
for node in stack1.iter().rev() {
if let Some(current) = node {
let currentTreeNode = current.borrow();
thisLayer.push(currentTreeNode.val);
if leftToRight {
stack2.push(currentTreeNode.left.clone());
stack2.push(currentTreeNode.right.clone());
} else {
stack2.push(currentTreeNode.right.clone());
stack2.push(currentTreeNode.left.clone());
}
}
}
if (!thisLayer.is_empty()) {
result.push(thisLayer);
}
leftToRight = !leftToRight;
stack1.clear();
mem::swap(&mut stack1, &mut stack2);
}
return result;
}
Haskell Solution
While our Rust solution was better described from the outside in, it’s easy to build the Haskell solution from the inside out. We have two loops, and we can start by defining the inner loop (we’ll call it the stack loop).
The goal of this loop is to take stack1 and turn it into stack2 (the next layer) and the numbers for this layer, while also tracking the direction of iteration. Both outputs are accumulated as lists, so we have inputs for them as well:
When stack1 is empty, we return our result from this loop. Because of list accumulation order, we reverse nums when giving the result. However, we don’t reverse stack2, because we want to iterate starting from the “top”. This seems like the opposite of what we did in Rust, because Rust uses a vector for its stack type, instead of a singly linked list!
Observe also a second edge case…for Nil nodes in stack1, we just recurse on the rest of the list. Now for the main case, we just define the new stack2, which adds the child nodes in the correct order. Then we recurse while also adding x to nums.
zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = ...
where
stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
stackLoop _ [] stack2 nums = (stack2, reverse nums)
stackLoop isLeftToRight (Nil : rest) stack2 numbers = stackLoop isLeftToRight rest stack2 numbers
stackLoop isLeftToRight (Node x left right : rest) stack2 nums =
let stack2' = if isLeftToRight then right : left : stack2 else left : right : stack2
in stackLoop isLeftToRight rest stack2' (x : nums)
...
Now we’ll define the outer loop, which we’ll call the layerLoop. This takes the direction flag and stack1, plus the accumulator list for the results. It also has a simple base case to reverse the results list once stack1 is empty.
Now in the recursive case, we call the stackLoop to get our new numbers and the stack for the next layer (which we now think of as our new stack1). We then recurse, flipping the boolean flags and adding these new numbers to our results, but only if the list is not empty.
The last step, as you seen is calling layerLoop from the start with root. We’re done! Here’s our final implementation:
zigzagOrderTraversal :: TreeNode -> [[Int]]
zigzagOrderTraversal root = layerLoop True [root] []
where
stackLoop :: Bool -> [TreeNode] -> [TreeNode] -> [Int] -> ([TreeNode], [Int])
stackLoop _ [] stack2 nums = (stack2, reverse nums)
stackLoop isLeftToRight (Nil : rest) stack2 numbers = stackLoop isLeftToRight rest stack2 numbers
stackLoop isLeftToRight (Node x left right : rest) stack2 nums =
let stack2' = if isLeftToRight then right : left : stack2 else left : right : stack2
in stackLoop isLeftToRight rest stack2' (x : nums)
layerLoop :: Bool -> [TreeNode] -> [[Int]] -> [[Int]]
layerLoop _ [] allNums = reverse allNums
layerLoop isLeftToRight stack1 allNums =
let (stack1', newNums) = stackLoop isLeftToRight stack1 [] []
in layerLoop (not isLeftToRight) stack1' (if null newNums then allNums else newNums : allNums)
Conclusion
That’s all we’ll do for binary trees right now. In the coming articles we’ll continue to explore more data structures as well as some common algorithms. If you want to learn more about data structures in algorithms in Haskell, check out our course Solve.hs. Modules 2 & 3 are filled with this sorts of content, including lots of practice problems.
The GHC developers are very pleased to announce the availability
of the fourth release candidate for GHC 9.10.3. Binary distributions, source
distributions, and documentation are available at downloads.haskell.org and
via GHCup.
GHC 9.10.3 is a bug-fix release fixing over 50 issues of a variety of
severities and scopes. A full accounting of these fixes can be found in the
release notes. As always, GHC’s release status, including planned future
releases, can be found on the GHC Wiki status.
The changes from the first release candidate are:
A fix for a rare segfault with code involving STM (#26205)
A fix for the naturalAndNot returning bogus results (#26205)
This release candidate will have a two-week testing period. If all goes well
the final release will be available the week of 1 September 2025.
We would like to thank Well-Typed, Tweag I/O, Juspay, QBayLogic, Channable,
Serokell, SimSpace, the Haskell Foundation, and other anonymous contributors
whose on-going financial and in-kind support has facilitated GHC maintenance
and release management over the years. Finally, this release would not have
been possible without the hundreds of open-source contributors whose work
comprise this release.
As always, do give this release a try and open a ticket if you see
anything amiss.
The context behind this post is that my partner asked me how to
implement type inference for plain data structures (e.g. JSON or YAML)
which was awfully convenient because this is something I’ve done a
couple of times already and there is a pretty elegant trick for this I
wanted to share.
Now, normally type inference
and unification
are a bit tricky to implement in a programming language with functions,
but they’re actually fairly simple to implement if all you have to work
with is plain data. To illustrate this, I’ll implement and walk through
a simple type inference algorithm for JSON-like expressions.
For this post I’ll use the Value type from Haskell’s
aeson package, which represents a JSON value1:
dataValue=Object (KeyMapValue) -- { "key₀": value₀, "key₁": value₁, … }|Array (VectorValue) -- [ element₀, element₁, … ]|StringText-- e.g. "example string"|NumberScientific-- e.g. 42.0|BoolBool-- true or false|Null-- null
I’ll also introduce a Type datatype to represent the
type of a JSON value, which is partially inspired by TypeScript:
importData.Aeson.KeyMap (KeyMap)dataType=ObjectType (KeyMapType) -- { "key₀": type₀, "key₁": type₁, … }|ArrayTypeType-- type[]|StringType-- string|NumberType-- number|BoolType-- boolean|OptionalType-- null | type|Never-- never, the subtype of all other types|Any-- any, the supertype of all other typesderiving (Show)
… and the goal is that we want to implement an infer
function that has this type:
importData.Aeson (Value(..))infer ::Value->Type
I want to walk through a few test cases before diving into the
implementation, otherwise it might not be clear what the
Type constructors are supposed to represent:
>>>-- I'll use the usual `x : T` syntax to denote "`x` has type `T`">>>-- I'll also use TypeScript notation for the types>>>-- "example string" : string>>> infer (String"example string")StringType>>>-- true : boolean>>> infer (BoolTrue)BoolType>>>-- false : boolean>>> infer (BoolFalse)BoolType>>>-- 42 : number>>> infer (Number42)NumberType>>>-- [ 2, 3, 5 ] : number[]>>> infer (Array [Number2, Number3, Number5])ArrayTypeNumberType>>>-- [ 2, "hello" ] : any[]>>>-- To keep things simple, we'll differ from TypeScript and not infer>>>-- a type like (number | string)[]. That's an exercise for the reader.>>> infer (Array [Number2, String"hello"])ArrayTypeAny>>>-- [] : never[]>>> infer (Array [])ArrayTypeNever>>>-- { "key₀": true, "key₁": 42 } : { "key₀": bool, "key₁": number }>>> infer (Object [("key₀", BoolTrue), ("key₁", Number42)])ObjectType [("key₀", BoolType), ("key₁", NumberType)]>>>-- [{ "key₀": true }, { "key₁": 42 }] : { "key₀": null | bool, "key₁": null | bool }[]>>> infer (Array [Object [("key₀", BoolTrue)], Object [("key₁", Number42)]]) ArrayType (ObjectType (fromList [("key₀",OptionalBoolType),("key₀",OptionalNumberType)]))>>>-- null : null | never>>> infer NullOptionalNever>>>-- [ null, true ] : (null | boolean)[]>>> infer (Array [Null, BoolTrue])ArrayType (OptionalBool)
Some of those test cases correspond almost 1-to-1 with the
implementation of infer, which we can begin to
implement:
The main two non-trivial cases are the implementation of
infer for Objects and Arrays.
We’ll start with Objects since that’s the easier case to
infer. To infer the type of an object we infer the type of each field
and then collect those field types into the final object type:
… because there can only be a single element type for the whole
array. We can infer the type of each element, but if those element types
don’t match then we need some way to unify those element types into a
single element type representing the entire array. In other words, we
need a function with this type:
unify ::VectorType->Type
… because if we had such function then we could write:
The trick to doing this is that we need to implement a
Monoid instance and Semigroup instance for
Type, which is the same as saying that we need to define
two functions:
-- The default type `unify` returns if our list is emptymempty ::Type-- Unify two types into one(<>) ::Type->Type->Type
… because if we implement those two functions then our
unify function becomes … fold!
Given a structure with elements whose type is a Monoid, combine them via the monoid’s (<>) operator.
Laws
There are a few rules we need to be aware of when implementing
mempty and (<>) which will help ensure
that our implementation of unification is well-behaved.
First, mempty and (<>) must obey the
“Monoid laws”, which require that:
-- Left identitymempty<> x = x-- Right identityx <>mempty= x-- Associativityx <> (y <> z) = (x <> y) <> z
Second, mempty and (<>) must
additionally obey the following unification laws:
mempty is a subtype of x, for all
x
x <> y is a supertype of both x and
y
Unification
mempty is easy to implement since according to the
unification laws mempty must be the universal subtype,
which is the Never type:
instanceMonoidTypewheremempty=Never
(<>) is the more interesting function to
implement, and we’ll start with the easy cases:
If we unify any scalar type with itself, we get back the same type.
That’s pretty self-explanatory.
The next two cases are also pretty simple:
Never<> other = other other <>Never= other
If we unify the Never type with any other
type, then we get the other type because Never is a subtype
of every other type.
The next case is slightly more interesting:
ArrayType left <>ArrayType right =ArrayType (left <> right)
If we unify two array types, then we unify their element types. But
what about Optional types?
Optional left <>Optional right =Optional (left <> right)Optional left <> right =Optional (left <> right) left <>Optional right =Optional (left <> right)
If we unify two Optional types, then we unify their
element types, but we also handle the case where only one or the other
type is Optional, too.
The last complex data type is objects, which has the most interesting
implementation:
ObjectType left <>ObjectType right =ObjectType (KeyMap.alignWith adapt left right)where adapt (This (Optional a)) =Optional a adapt (That (Optional b)) =Optional b adapt (This a) =Optional a adapt (That b) =Optional b adapt (These a b) = a <> b
You can read that as saying “to unify two objects, unify the types of
their respective fields, and if either object has an extra field not
present in the other object then wrap the field’s type in
Optional”.
Finally, we have the case of last resort:
_ <> _ =Any
If we try to unify two types that could not unify via the previous
rules, then fall back to Any (the supertype of all other
types).
This gives us our final program (which I’ll included in its entirety
here):
importData.Aeson (Value(..))importData.Aeson.KeyMap (KeyMap)importData.Foldable (fold)importData.These (These(..))importData.Vector (Vector)importqualifiedData.Aeson.KeyMapasKeyMapdataType=ObjectType (KeyMapType) -- { "key₀": type₀, "key₁": type₁, … }|ArrayTypeType-- type[]|StringType-- string|NumberType-- number|BoolType-- boolean|OptionalType-- null | type|Never-- never, the subtype of all other types|Any-- any, the supertype of all other typesderiving (Show)infer ::Value->Typeinfer (String _) =StringTypeinfer (Bool _) =BoolTypeinfer (Number _) =NumberTypeinfer Null=OptionalNeverinfer (Object fields) =ObjectType (fmap infer fields)infer (Array elements) =ArrayType (unify (fmap infer elements))unify ::VectorType->Typeunify = foldinstanceMonoidTypewheremempty=NeverinstanceSemigroupTypewhereStringType<>StringType=StringTypeNumberType<>NumberType=NumberTypeBoolType<>BoolType=BoolTypeNever<> other = other other <>Never= otherArrayType left <>ArrayType right =ArrayType (left <> right)Optional left <>Optional right =Optional (left <> right)Optional left <> right =Optional (left <> right) left <>Optional right =Optional (left <> right)ObjectType left <>ObjectType right =ObjectType (KeyMap.alignWith adapt left right)where adapt (This (Optional a)) =Optional a adapt (That (Optional b)) =Optional b adapt (This a) =Optional a adapt (That b) =Optional b adapt (These a b) = a <> b _ <> _ =Any
Pretty simple! That’s the complete implementation of type inference and unification.
Unification laws
I mentioned that our implementation should satisfy the
Monoid laws and unification laws, so I’ll include some
quick proof sketches (albeit not full formal proofs), starting with the
unification laws.
Let’s start with the first unification law:
mempty is the subtype of x, for all
x
This is true because we define mempty = Never and
Never is the subtype of all other types.
Next, let’s show that the implementation of (<>)
satisfies the other unification law:
x <> y is a super type of both x and
y
The first case is:
StringType<>StringType=StringType
This satisfies the unificaiton law because if we replace both
x and y with StringType we
get:
StringType <> StringType is a supertype of both
StringType and StringType
… and since StringType <> StringType = StringType
that simplifies down to:
StringType is a supertype of both
StringType and StringType
… and every type is a supertype of itself, so this satisfies the
unification law.
We’d prove the unification law for the next two cases in the exact
same way (just replacing StringType with
NumberType or BoolType):
Well, if we take our unification law and replace x with
Never and replace y with other we
get:
Never <> other is a supertype of
Never and other
… and since Never <> other = other that simplifies
to:
other is a supertype of Never and
other
… which is true because:
other is a supertype of Never (because
Never is the universal subtype)
other is a supertype of other (because
every type is a supertype of itself)
We’d prove the next case in the exact same way (just swapping
Never and other):
other <>Never= other
For the next case:
ArrayType left <>ArrayType right =ArrayType (left <> right)
The unification law becomes:
ArrayType (left <> right) is a supertype of both
ArrayType left and ArrayType right
… which is true because ArrayType is covariant
and by induction left <> right is a supertype of both
left and right.
We’d prove the first case for Optional in the exact same
way (just replace Array with Optional):
Optional left <>Optional right =Optional (left <> right)
The next case for Optional is more interesting:
Optional left <> right =Optional (left <> right)
Here the unification law would be:
Optional (left <> right) is a supertype of
Optional left and right
… which is true because:
Optional (left <> right) is a supertype of
Optional left
This is true because Optional is covariant and
left <> right is a supertype of
left
Optional (left <> right) is a supertype of
right
This is true because:
Optional (left <> right) is a supertype of
Optional right
Optional right is a supertype of
right
Therefore, by transitivity,
Optional (left <> right) is a supertype of
right
We’d prove the next case in the same, just switching
left and right:
left <>Optional right =Optional (left <> right)
The case for objects is the most interesting case:
ObjectType left <>ObjectType right =ObjectType (KeyMap.alignWith adapt left right)where adapt (This (Optional a)) =Optional a adapt (That (Optional b)) =Optional b adapt (This a) =Optional a adapt (That b) =Optional b adapt (These a b) = a <> b
I won’t prove this case as formally, but the basic idea is that this
is true because a record type (A) is a supertype of another
record type (B) if and only if:
for each field k they share in common, A.k
is a supertype of B.k
for each field k present only in A,
A.k is a supertype of Optional Never
there are no fields present only in B
… and given that definition of record subtyping then the above
implementation satisfies the unification law.
Monoid laws
The first two Monoid laws are trivial to prove:
mempty<> x = xx <>mempty= x
… because we defined:
mempty=Never
… and if we replace mempty with Never in
those laws:
Never<> x = xx <>Never= x
… that is literally what our code defines (except replacing
x with other):
Never<> other = other other <>Never= other
The last law, associativity, is pretty tedious to prove in full:
(x <> y) <> z = x <> (y <> z)
… but I’ll do a few cases to show how the basic gist of how the proof
works.
First, the associativity law is easy to prove for the case where any
of x, y, or z is
Never. For example, if x = Never, then we
get:
(Never<> y) <> z =Never<> (y <> z)-- Never <> other = othery <> z = y <> z
… which is true. The other two cases for y = Never and
z = Never are equally simple to prove.
Associativity is also easy to prove when any of x,
y, or z is Any. For example, if
x = Any, then we get:
(Any<> y) <> z =Any<> (y <> z)-- Any <> other = otherAny<> z =Any-- Any <> other = otherAny=Any
… which is true. The other two cases for y = Any and
Z = Any are equally simple to prove.
Now we can prove associativity if any of x,
y or z is StringType. The reason
why is that these are the only relevant cases in the implementation of
unification for StringType:
That means, that there are only seven cases we need to consider to
prove the associativity laws if at least one of x,
y, and z is StringType (using
_ below to denote “any type other than
StringType):
-- true: both sides evaluate to StringType(StringType<>StringType) <>StringType=StringType<> (StringType<>StringType)-- all other cases below are also true: they all evaluate to `Any`(StringType<>StringType) <> _ =StringType<> (StringType<> _ )(StringType<> _ ) <>StringType=StringType<> (_ <>StringType)(StringType<> _ ) <> _ =StringType<> (_ <> _ )(_ <>StringType) <>StringType= _ <> (StringType<>StringType)(_ <>StringType) <> _ = _ <> (StringType<> _ )(_ <> _ ) <>StringType= _ <> (_ <>StringType)
We can similarly prove associativity for all cases involving at least
one NumberType or BoolType.
The proof for ArrayType is almost the same as the proof
for
StringType/NumberType/BoolType.
The only relevant cases are:
ArrayType left <>ArrayType right =ArrayType (left <> right)ArrayType left <>Never=ArrayTypeNever<>ArrayType right =ArrayTypeArrayType left <> _ =Any_ <>ArrayType right =Any
Just like before, we can ignore the case where either argument is
Never because we already proved associativity for that.
That just leaves:
ArrayType left <>ArrayType right =ArrayType (left <> right)ArrayType left <> _ =Any_ <>ArrayType right =Any
Just like before, there are only seven cases we have to prove (using
_ below to denote “any type other than
ArrayType):
ArrayType x <> (ArrayType y <>ArrayType z) = (ArrayType x <>ArrayType y) <>ArrayType z-- … simplifies to:ArrayType (x <> (y <> z)) =ArrayType ((x <> y) <> z)-- … which is true because unification of the element types is associative-- all other cases below are also true: they all evaluate to `Any`(ArrayType x <>ArrayType y) <> _ =ArrayType x <> (ArrayType y <> _ )(ArrayType x <> _ ) <>ArrayType z =ArrayType x <> (_ <>ArrayType z)(ArrayType x <> _ ) <> _ =ArrayType x <> (_ <> _ )(_ <>ArrayType y) <>ArrayType z = _ <> (ArrayType y <>ArrayType z)(_ <>ArrayType y) <> _ = _ <> (ArrayType y <> _ )(_ <> _ ) <>ArrayType z = _ <> (_ <>ArrayType z)
The proofs for the Optional and Object
cases are longer and more laborious so I’ll omit them. They’re an
exercise for the reader because I am LAZY.
I’ve inlined all the type synonyms and removed
strictness annotations, for clarity↩︎
The purpose of this post is to sum up, in one place, the state of torch.compile for training as of August 2025. Nothing in here isn't something you might not already know about from elsewhere on the Internet, but we rarely put everything together in one place. The target audience for this document are teams who are evaluating the use of torch.compile for large scale training runs.
First, the basics. torch.compile (also known as PT2) is a compiler for PyTorch eager programs for both inference and training workloads. Speedups from 1.5-2x compared to eager code are typical, and torch.compile also makes it possible to do global optimizations for memory (e.g., automatic activation checkpointing) and distributed communications (e.g., async tensor parallelism).
What is torch.compile's functionality?
The headline functionality of torch.compile is a decorator you can attach to a function to compile it:
@torch.compile()
def f(x, y):
...
Here are some non-functional properties of compile which are important to know:
Just-in-time compilation. We don't actually compile the function until it is called for the first time, and execution blocks until compilation completes. There is both local and remote caching to skip compilation cost when you rerun the model. (Ahead-of-time compilation is possible for inference with AOTInductor, and is being worked on for training.)
Compositional with Eager. PyTorch's original success comes from the extreme hackability of eager mode, and torch.compile seeks to preserve this. The function can be as big or as small part of your training loop as you like; compiled functions compose with autograd, DDP, FSDP and other PyTorch subsystems. (This composition is sometimes imperfect, e.g., in the case of double backwards (not supported), tensor subclasses (requires specific support from the subclass), autograd (differentiating with respect to intermediates returned from a compiled region does not work).) If compilation doesn't work on a region, you can disable it entirely with torch.compiler.disable() and fall back to eager.
Gradient updates are delayed to the end of compiled regions. This arises because PyTorch eager autograd does not support streaming gradients incrementally from a large backward node. (This can be solved by using compiled autograd, but this requires that the entirety of your backwards be compileable.)
Graphs may be recompiled. We aggressively specialize on all non-Tensor arguments/globals used in the function to ensure we always generate straight-line computation graphs with no control flow. If those arguments/globals change we will recompile the graph. (Recompilations can be banned with torch._dynamo.config.error_on_recompile = True.)
Static by default, recompile to dynamic shapes. We aggressively specialize all sizes to static. However, if we discover that a size varies over time, on the first recompile we will attempt to generate a single compiled region that handles dynamic shapes. We are not guaranteed to be able to compile a model with dynamic shapes. (You can use mark_dynamic to force an input shape to be dynamic, and you can use mark_unbacked to error if we specialize.)
Graph breaks transparently bypass non-capturable code. By default, if the compiler encounters a line of code that it is not able to handle, it will trigger a graph break, disabling compilation for that line of code, but still attempting to compile regions before and after it. (This behavior can be banned with fullgraph=True.)
Function calls are inlined and loops are unrolled by default. If you have many copies of a Transformer block in your model, your compile time will scale with the number of Transformer blocks. (You can reduce compile time by doing "regional compilation", where you only compile the Transformer block instead of compiling the entire model.)
NOT bitwise equivalent with eager PyTorch. The biggest divergence with eager PyTorch is that when float16/bfloat16 operations are fused together, we do not insert redundant down/up-conversions. (This can be disabled torch._inductor.config.emulate_precision_casts = True; you can also rewrite eager code to perform operations in higher precision with the understanding torch.compile will optimize it. XLA has a similar config xla_allow_excess_precision which JAX enables by default.) However, we may also make decisions to swap out, e.g., matmul implementations, and there may also be slight divergence that arise from differences in reduction ordering that are unavoidable when compilation occurs. We support ablating the graph capture frontend separately from the compiler backend to help diagnose these kinds of problems.
Distributed collectives and DTensor can be compiled, but are unoptimized by default. We are able to capture c10d collectives and also programs that handle DTensors, but we don't apply optimizations to collectives by default. (There are experimental optimizations that can be enabled, but this is active work in progress.) We generally do not expect to be able to trace through highly optimized distributed framework code.
State of advanced parallelism
For large scale training runs, torch.compile faces stiff competition from (1) PyTorch native distributed frameworks which embrace eager mode and implement all optimizations by hand (e.g., megatron), (2) custom "compiler" stacks which reuse our tracing mechanisms (e.g., symbolic_trace and make_fx) but implement their desired passes by hand, (3) JAX, which has always been XLA first and is years ahead in compile-driven parallelism techniques.
Here is where we currently are for advanced parallelism (with an emphasis on comparing with JAX):
DTensor, a "global tensor" abstraction for representing sharded tensors. DTensor is a tensor subclass which allows us to represent tensors which are sharded over an SPMD device mesh. The shape of a DTensor reflects the global shape of the original full tensor, but it only stores locally a shard of the data according to the placement. Here are some important details:
Shard placements. Unlike JAX placements, DTensor placements are "device mesh" oriented; that is to say, you conventionally specify a device mesh dim size list of placements, and Shard(i) indicates that the ith dimension of a tensor is sharded. This is opposite of JAX, which is "tensor" oriented. For example, given a 2-D mesh ["dp", "tp"], a tensor with [Replicate, Shard(0)] in DTensor placement (or {"dp": Replicate, "tp": Shard(0)} with named device mesh axes), would correspond to a JAX placement of P("tp", None). The reason for this is that DTensor supports a Partial placement, which indicates that an axis on the device mesh has a pending reduction. Partial shows up ubiquitously from matrix multiplies, and it isn't associated with any particular tensor axis, making it more convenient to represent in a device-mesh oriented formulation. The tradeoff is that device-mesh oriented placements don't naively support specifying sharding ordering, e.g., suppose I want to shard a 1-D tensor on tp and then dp, in JAX I'd represent this as P(("tp", "dp"),) but this order cannot be disambiguated from [Shard(0), Shard(0)] and in fact DTensor always forces left-to-right sharding. There is currently a proposal to extend our sharding specification to support ordering to bring us to parity with JAX expressiveness, but it is not yet implemented.
Autograd. DTensor is directly differentiable; we run autograd on programs that have DTensors (as opposed to desugaring a DTensor program to one with regular Tensors and differentiating it). This ensures that the sharding strategy of a primal and its corresponding tangent can diverge. This is parity with JAX.
Python subclass of Tensor. Unlike JAX, DTensor is a separate subclass from Tensor. However, Tensor and DTensor interoperate fine; a Tensor can simply be thought of as a DTensor that is replicated on all dimensions. DTensor is implemented in Python, which makes it easy to modify and debug but imposes quite a bit of overhead (for example, FSDP2 does not directly accumulate gradients into DTensor, because with thousands of parameters, performing detach and add operations on DTensor is a bottleneck). Still, despite this overhead, DTensor was designed for good eager performance, and extensively caches the results of sharding propagation so that in the fastpath, it only needs to lookup what redistribute it should perform and then directly dispatches to the local eager operation. However, this caching strategy means that overhead can be quite high for workloads with dynamic shapes, as the cache requires exact matches of all input shapes.
Compilation. DTensor is compilable by torch.compile, and doing so will desugar it into its underlying collectives and eliminate any eager mode DTensor overhead (even if you do not perform any other optimizations.) However, DTensor with dynamic shapes in compile is not well supported, see http://github.com/pytorch/pytorch/issues/159635 (we don't think this is currently critical path for any critical use cases, so a relatively junior engineer has been chipping away at it.)
Greedy propagation. Because DTensor must work in eager mode, it only implements greedy shard propagation, where for every eager operation we greedily pick whatever output shard minimizes the collective costs of an operation. It is work in progress to support backward propagation of sharding with the assistance of a compiler-like framework.
Operator coverage. DTensor requires sharding propagation rules to work for operations. If a sharding propagation rule is not implemented, DTensor will fail rather than trigger an inefficient allgather to run the operator under replication. We don't currently have full coverage of all operators, but important operators for transformer models like llama3 are all covered (sharding rules are defined here). You can write custom shardings for user defined operators.
Jagged sharding. We do not support a "jagged sharding" concept which would be necessary for expert parallelism with imbalanced routing. However, we believe that our existing sharding rules could largely be reused to support such an idea. As dynamism would only be exposed in the local tensor for the jagged shard, jagged shards don't suffer from the dynamic shapes problems mentioned in the compilation section.
Ecosystem. We are committed to DTensor as the standard representation for sharded tensors, and DTensor is integrated with checkpointing, FSDP2, SimpleFSDP, AutoParallel, torchtitan, among others.
Functional collectives. If you don't like DTensor, we also support "functional collectives", which are non-mutating versions of collective operations that can be used to manually implement SPMD operations in a compiler-friendly way without needing DTensor. (In fact, if you use traditional collective APIs and compile them, we will silently translate them into functional collectives for compiler passes.) When compiled, functional collectives don't necessarily force allocation of the output buffer as they can be re-inplaced. Importantly, functional collectives currently do NOT support autograd, see https://discuss.pytorch.org/t/supporting-autograd-for-collectives/219430
Graph capture. There are two particularly popular graph capture mechanisms which people have used to perform distributed optimizations separate from model code. All graph capture mechanisms produce FX graphs, which are a simple Python basic block IR representation with no control flow, which is entirely unopinionated about what actual operator set can occur in the graph.
Symbolic_trace. This was the original graph capture mechanism and is quite popular, despite its limitations. It is implemented entirely with Python operator overloading and will give you exactly whatever operations are overloadable in the graph. We consider this largely a legacy pipeline as you are unable to trace code involving conditionals on shapes and you end up with a graph that has no useful metadata about the shapes/dtypes of intermediate values. For example, PiPPY, a legacy stack for performing pipeline parallelism, was built on top of symbolic_trace graph capture.
make_fx/torch.export. This graph capture mechanism works by actually sending (fake) tensors through your program and recording ATen operators. There are a number of different variants: e.g., whether or not it is a Python tracing approach ala JAX jit, or whether it uses sophisticated bytecode analysis ala Dynamo; similarly, there are various levels of IR you can extract (pre-dispatch, post-dispatch; also, operators can be decomposed or kept as single units). Our compiler parallelism efforts are built on top of this capture mechanism, but there is nothing stopping you per se from writing your own graph pass on top of this IR. In practice, this can be difficult without PyTorch expertise, because (1) integrating a traced graph into PyTorch's autograd system so it can interoperate with other code is quite complicated to do in full generality, (2) the exact operator sets you get at various phases of compilation are undocumented and in practice very tied to the Inductor lowering stack, and it is poorly documented on how to prevent operators from getting decomposed before your pass gets to them.
Not SPMD compiler by default. torch.compile does not assume the program being compiled is SPMD by default, which means it will not do things like drop unused collectives (you can change this behavior with a config flag). Additionally, the default mode of use for torch.compile is to compile in parallel on all nodes, which means care has to be taken to ensure that every instance of the compiler compiles identically (only one rank recompiling, or compilers making different decisions, can lead to NCCL timeout). We ultimately think that we should compile a program once and send it to all nodes, but as this is not currently implemented, the general approach people have taken to solve this problem is to either (1) eliminate all sources of divergent behavior from ranks, e.g., don't allow the compiler to look at the actual size for dynamic inputs when making compiler decisions, or (2) introducing extra collectives to the compiler to communicate decisions that must be made consistently across all ranks.
Our vision for the future of advanced parallelism, spearheaded by the in-progress SimpleFSDP and AutoParallel, is that users should write single-node programs that express mathematically what they want to do. These are then transformed into efficient distributed programs in two steps: (1) first, collectives are inserted into the graph in a naive way (i.e., simply to express what the sharding of all intermediates should be), and (2) the collectives are optimized to handle scheduling concerns such as pre-fetching and bucketing. AutoParallel sets a GSPMD style goal of automatically determining a good enough sharding for a program--it should be able to rediscover data parallel, tensor parallel, even expert parallel(!)--but SimpleFSDP sets a smaller goal of just inserting collectives in the pattern that FSDP would mandate, and then writing FSDP-specific optimization passes for recovering FSDP2's performance. It is very common to write domain specific optimizations; for example, async tensor parallelism is also implemented as a pass that detects TP patterns and rewriting them to async TP operations. Unlike JAX, which started with a very generic solver and has needed to add more manual escape hatches over time, PyTorch has started with writing all of the distributed patterns exactly by hand, and we are only recently adding more automatic mechanisms as an alternative to doing everything by hand.
State of optimization
torch.compile performs many optimizations, but here are some particularly important ones to know about:
Inductor. Inductor is our backend for torch.compile that generates Triton kernels for PyTorch programs. It has very good coverage of PyTorch's operator set and can do fusions of pointwise and reductions, including in the patterns that typically occur for backwards. It also is able to fuse pointwise operations into matmuls and autotune different matmul backends (including cuBlas, cutlass and Triton) to select the best one for any given size. When people talk about torch.compile speeding up their programs, they are conventionally talking about Inductor; however, you don't have to use torch.compile with Inductor; for example, you could run with AOTAutograd only and skip Inductor compilation.
CUDA graphs. Inductor builds in support for CUDA graphing models. Unlike manual CUDA graphs application, we can give better soundness guarantees than manual CUDA graphs application (e.g., forgetting to copy in all input buffers, CPU compute inside the CUDA graph region). torch.compile CUDA graphs is typically used with Inductor but we also offer an eager-only cudagraphs integration (that is less well exercised).
Automatic activation checkpointing. With torch.compile, we can globally optimize the memory-compute tradeoff, much better than the activation checkpointing APIs that eager PyTorch supports (and require the user to manually feed in what they want checkpointed or not). However, some folks have reported that it can be quite miserable tuning the hyperparameter for AC; we have also found bugs in it.
FP8 optimizations. One big success story for traditional compilation was adding support for a custom FP8 flavor. With torch.compile, they didn't have to write manual kernels for their variant. This has since been upstreamed to torchao.
Flex attention. Flex attention usage continues to grow, with 632 downstream repo users in OSS (vs 125 in Jan '25). It has been used to enable chunked attention, document masking and context parallelism in llama family models. It is a really good research tool, although sometimes people complain about slight numerical differences.
Helion.Helion is an actively developed project aiming to go beta in October this year which offers a higher level interface for programming Triton kernels that looks just like writing PyTorch eager code. It relies heavily on autotuning to explore the space of possible structural choices of kernels to find the best one. It is not production ready but it is worth knowing that it is coming soon.
State of compile time
torch.compile is a just-in-time compiler and as such, in its default configuration, compilation will occur on your GPU cluster (preventing you from using the GPUs to do other useful work!) In general, most pathological compile times arise from repeated recompilation (often due to dynamic shapes, but sometimes not). In Transformer models, compile time can also be improved by only compiling the Transformer block (which can then be compiled only once, instead of having to be compiled N times for each Transformer block in the model).
We don't think caching is an ideal long-term solution for large scale training runs, and we have been working on precompile to solve the gap here. Precompile simply means having compilation be an ahead-of-time process which produces a binary which you can directly run from your training script to get the compiled model. The compilation products are built on top of our ABI stable interface (developed for AOTInductor) which allows the same binaries to target multiple PyTorch versions, even though PyTorch the library does not offer ABI compatibility from version to version.
How do I get started?
The most typical pattern we see for people who want to make use of torch.compile for large-scale training is to fork torchtitan and use this codebase as the basis for your training stack. torchtitan showcases PyTorch native functionality, including torch.compile--in effect, it shows you how to use features in PyTorch together in a way that lets you do large-scale training. From there, swap out the components you are opinionated about and keep the things you don't care about.
The performance of a system is critical for the user experience. Whether it’s a website, mobile app, or service, users demand fast response times and seamless functionality.
Performance testing is a non-functional testing technique that evaluates the speed, responsiveness, and stability of a system under different workloads for different purposes.
The primary goal of performance testing is to identify and eliminate performance bottlenecks to ensure that the system meets the expected performance criteria.
It is crucial for understanding the performance of the system under various conditions and ensuring that it can handle real-world usage scenarios effectively.
From my experience, performance testing is usually underestimated and overlooked, as it is generally only run after big feature releases, architectural changes, or when preparing for promotional events.
In this post, I want to explain the foundations of performance testing for the wider engineering community.
In a future post, I’ll talk about continuous performance testing.
Performance testing helps in:
Validating System Performance: Ensuring that the system performs well under expected load conditions.
Identifying Bottlenecks: Detecting performance issues that could degrade the user experience.
Ensuring Scalability: Verifying that the system can scale to accommodate increased load, and also decreasing load.
Improving User Experience: Providing a smooth and responsive experience constantly for end-users to increase loyalty.
Performance Testing process
Like other software development activities, for performance testing to be effective it should be done through a process.
The process requires collaboration with other teams such as business, DevOps, system, and development teams.
Let’s explain the process with a real-world scenario.
Imagine Wackadoo Corp wants to implement performance testing because they’ve noticed their e-commerce platform slows down dramatically during peak sales events, leading to frustrated customers and lost revenue.
When this issue is raised to the performance engineers, they suspect it could be due to inadequate server capacity or inefficient database queries under heavy load and recommend running performance tests to pinpoint the problem.
The engineers begin by gathering requirements, such as simulating 10,000 concurrent users while maintaining response times under 2 seconds, and then create test scripts to mimic real user behavior, like browsing products and completing checkouts.
A testing environment mirroring production is set up, and the scripts are executed while the system is closely monitored to ensure it handles the expected load.
After the first test run, the engineers analyze the results and identify slow database queries as the primary bottleneck.
They optimize the queries, add caching, and re-run the tests, repeating this process until the system meets all performance criteria.
Once satisfied, they publish the final results, confirming the platform can now handle peak traffic smoothly, improving both customer experience and sales performance.
How to Apply Performance Testing
Like functional testing, performance testing should be integrated at every level of the system, starting from the unit level up.
The test pyramid traditionally illustrates functional testing, with unit tests at the base, integration tests in the middle, and end-to-end or acceptance tests at the top.
However, the non-functional aspect of testing—such as performance testing—often remains less visible within this structure. It is essential to apply appropriate non-functional tests at each stage to ensure a comprehensive evaluation.
By conducting tailored performance tests across different levels, we can obtain early and timely feedback, enabling continuous assessment and improvement of the system’s performance.
Types of Performance Testing
There are several types of performance tests, each designed to evaluate different aspects of system performance.
We can basically categorize performance testing with three main criteria:
Load; for example, the number of virtual users
The strategy for varying the load over time
How long we apply performance testing
The following illustrates the different types of performance testing with regards to the three main criteria.
The three main criteria are a good starting point, but they don’t completely characterize the types performance tests.
For example, we can also vary the type of load (for example, to test CPU-bound or I/O-heavy tasks) or the testing environment (for example, whether the system is allowed to scale up the number of instances).
Load Testing
Load testing is a basic form of performance testing that evaluates how a system behaves when subjected to a specific level of load.
This specific load represents the optimal or expected amount of usage the system is designed to handle under normal conditions.
The primary goal of load testing is to verify whether the system can deliver the expected responses while maintaining stability over an extended period.
By applying this consistent load, performance engineers can observe the system’s performance metrics, such as response times, resource utilization, and throughput, to ensure it functions as intended.
Basic and widely known form of performance testing
Load tests are run under the optimum load of the system
Load tests give a result that real users might face in production
Easiest type to run in a CI/CD pipeline
Let’s make it clearer by again looking at Wackadoo Corp.
Wackadoo Corp wants to test that a new feature is performing similarly to the system in production.
The business team and performance engineers have agreed that the new feature should meet the following requirements while handling 5,000 concurrent users:
It can handle 1,000 requests per second (rps)
95% of the response times are less than 1,000 ms
Longest responses are less then 2,000 ms
0% error rate
The test server is not exceeding 70% of CPU usage with 4GB of RAM
With these constraints in place, Wackadoo Corp can deploy the new
feature in a testing environment and observe how it performs.
Stress Testing
Stress testing evaluates a system’s upper limits by pushing it beyond normal operation to simulate extreme conditions like high traffic or data processing.
It identifies breaking points and assesses the system’s ability to recover from failures.
This testing uncovers weaknesses, ensuring stability and performance during peak demand, and improves reliability and fault tolerance.
Tests the upper limits of the system
Requires more resources than load testing, to create more virtual users, etc.
The boundary of the system should be investigated during the stress test
Stress tests can break the system
Stress tests can give us an idea about the performance of the system under heavy loads, such as promotional events like Black Friday
Hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Wackadoo Corp wants to investigate the system behavior when exceeding the optimal users/responses so it decides to run a stress test.
Performance engineers have the metrics for the upper limit of the system, so during the tests the load will be increased gradually until the peak level.
The system can handle up to 10,000 concurrent users.
The expectation is that the system will continue to respond, but the response metrics will degrade within the following expected limits:
It can handle 800 requests per second (rps)
95% of the response times are less than 2,500 ms
Longest responses are less then 5,000 ms
10% error rate
The test server is around 95% of CPU usage with 4GB of RAM
If any of these limits are exceeded when monitoring in the test
environment, then Wackadoo Corp knows it has a decision to make
about resource scaling and its associated costs, if no further
efficiencies can be made.
Spike Testing
A spike test is a type of performance test designed to evaluate how a system behaves when there is a sudden and significant increase or decrease in the amount of load it experiences.
The primary objective of this test is to identify potential system failures or performance issues that may arise when the load changes unexpectedly or reaches levels that are outside the normal operating range.
By simulating these abrupt fluctuations in load, the spike test helps to uncover weaknesses in the system’s ability to handle rapid changes in demand.
This type of testing is particularly useful for understanding how the system responds under stress and whether it can maintain stability and functionality when subjected to extreme variations in workload.
Ultimately, the spike test provides valuable insights into the system’s resilience and helps ensure it can manage unexpected load changes without critical failures.
Spike tests give us an idea about the behavior of the system under unexpected increases and decreases in load
We can get an idea about how fast the system can scale-up and scale-down
They can require additional performance testing tools, as not all tools support this load profile
Good for some occasions like simulating push notifications, or critical announcements
Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Let’s look at an example again, Wackadoo Corp wants to send push notifications to 20% of the mobile users at 3pm for Black Friday.
They want to investigate the system behavior when the number of users increase and decrease suddenly so they want to run a spike test.
The system can handle up to 10,000 concurrent users, so the load will be increased to this amount in 10 seconds and then decreased to 5,000 users in 10 seconds.
The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:
Maximum latency is 500ms
95% of the response times are less than 5,000 ms
Longest responses are less then 10,000 ms
15% error rate
The test server is around 95% of CPU usage but it should decrease when the load decreases
Again, if any of these expectations are broken, it may suggest to
Wackadoo Corp that its resources are not sufficient.
Endurance Testing (Soak Testing)
An endurance test focuses on evaluating the upper boundary of a system over an extended period of time.
This test is designed to assess how the system behaves under sustained high load and whether it can maintain stability and performance over a prolonged duration.
The goal is to identify potential issues such as memory leaks, resource exhaustion, or degradation in performance that may occur when the system is pushed to its limits for an extended time.
By simulating long-term usage scenarios, endurance testing helps uncover hidden problems that might not be evident during shorter tests.
This approach ensures that the system remains reliable and efficient even when subjected to continuous high demand over an extended period.
Soak tests run for a prolonged time
They check the system stability when the load does not decrease for a long time
Soak testing can give a better idea about the performance of the system for campaigns like Black Friday than the other tests, hence the need for a diverse testing strategy
Hard to run in a CI/CD pipeline since it aims to test for a long period, which goes against the expected short feedback loop
This time, Wackadoo Corp wants to send push notifications to 10% of users at every hour, starting from 10am until 10pm, for Black Friday to increase sales for a one-day 50%-off promotion.
They want to investigate the system behavior when the number of users increase, but the load stays stable between nominal and the upper-boundary for a long time so they want to run an endurance test.
The system can handle up to 10,000 concurrent users, so the load will be increased to 8,000 users in 30 seconds and it will be kept busy.
The expectation is that the system keeps responding, but the response metrics increase within the following expected limits:
Maximum latency is 300ms
95% of the response times are less than 2,000 ms
Longest responses are less then 3,000 ms
5% error rate
The test server is around 90% of CPU usage
Scalability Testing
Scalability testing is a critical type of performance testing that evaluates how effectively a system can manage increased load by incorporating additional resources, such as servers, databases, or other infrastructure components.
This testing determines whether the system can efficiently scale up to accommodate higher levels of demand as user activity or data volume grows.
By simulating scenarios where the load is progressively increased, scalability testing helps identify potential bottlenecks, resource limitations, or performance issues that may arise during expansion.
This process ensures that the system can grow seamlessly to meet future requirements without compromising performance, stability, or user experience.
Ultimately, scalability testing provides valuable insights into the system’s ability to adapt to growth, helping organizations plan for and support increasing demands over time.
Scalability tests require collaboration for system monitoring and scaling
They can require more load generators, depending of the performance testing tools (i.e. load the system, then spike it)
They aim to check the behavior of the system during the scaling
Very hard to run in a CI/CD pipeline since it requires the scaling to be orchestrated
Performance engineers at Wackadoo Corp want to see how the system scales when the loads exceed the upper boundary, so they perform a scalability test.
The system can handle up to 10,000 concurrent users for one server, so this time the load will be increased gradually starting from 5,000 users, and every 2 minutes 1,000 users will join the system.
The expectation is that the system keeps responding, but the response metrics increase with the load (as before) until after 10,000 users, when a new server should join the system. At which point, we should observe the response metrics starting to decrease.
Once scaling up is tested, we can continue with testing the scaling down by decreasing the number of users under the upper limit.
Volume Testing
Volume testing assesses the system’s behavior when it is populated with a substantial amount of data.
The purpose of this testing is to evaluate how well the system performs and maintains stability under conditions of high data volume.
By simulating scenarios where the system is loaded with large datasets, volume testing helps identify potential issues related to data handling, storage capacity, and processing efficiency.
This type of testing is particularly useful for uncovering problems such as slow response times, data corruption, or system crashes that may occur when managing extensive amounts of information. Additionally, volume testing ensures that the system can effectively store, retrieve, and process large volumes of data without compromising its overall performance or reliability.
Volume tests simulate the system behavior when huge amounts of data are received
They check if databases have any issue with indexing data
For example, in a Black Friday sale scenario, with a massive surge of new users accessing the website simultaneously, they ensure that no users experience issues such as failed transactions, slow response times, or an inability to access the system
Very hard to run in a CI/CD pipeline since the system is intentionally prone to fail
Wackadoo Corp wants to increase customers, so they implemented an “invite your friend” feature. The company plans to give a voucher to both members and invited members, which will result in a huge amount of database traffic.
Performance engineers want to run a volume test, which mostly includes scenarios like inviting, registering, checking voucher code state, and loading the checkout page.
During the test, the load will increase to 5,000 users by adding 1,000 users every 2 minutes and they should simulate normal user behaviors.
After that heavy write operations can start.
As a result, we should expect the following:
Maximum latency is 500ms
95% of the response times are less than 3,000 ms
Longest responses are less then 5,000 ms
0% error rate
The test server is around 90% of CPU usage
A failure here might suggest to Wackadoo Corp that its database
service is a bottleneck.
Conclusion
Performance testing plays a crucial role in shaping the overall user experience because an application that performs poorly can easily lose users and damage its reputation.
When performance problems are not detected and resolved early, the cost of fixing them later can increase dramatically, impacting both time and resources.
Moreover, collaboration between multiple departments, including development, operations, and business teams, is essential to ensure that the testing process aligns with real-world requirements and produces meaningful, actionable insights.
Without this coordinated effort and knowledge base, performance testing may fail to deliver valuable outcomes or identify critical issues.
There are many distinct types of performance testing, each designed to assess the system’s behavior from a specific angle and under different conditions.
Load testing can be easily adapted to the CI/CD pipeline; the other performance testing types can be more challenging, but they can still provide a lot of benefits.
In my next blog post, I will talk about my experiences on how we can apply performance testing continuously.
Imagine this, you get a report from your bug tracker:
Sophie got an error when viewing the diff after her most recent push
to her contribution to the @unison/cloud project on Unison
Share
(BTW, contributions are like pull requests, but for Unison code)
Okay, this is great, we have something to start with, let's go look
up that contribution and see if any of the data there is suspicious.
Uhhh, okay, I know the error is related to one of Sophie's
contributions, but how do I actually find it?
I know Sophie's username from the bug report, that helps, but I don't
know which project she was working on, or what the contribution ID is,
which branches are involved, etc. Okay no problem, our data is
relational, so I can dive in and figure it out with a query:
>SELECT contribution.*FROM contributions AS contributionJOIN projects AS project ON contribution.project_id = project.idJOIN users AS unison_user ON project.owner = unison_user.idJOIN users AS contribution_author ON contribution.author_id = contribution_author.idJOIN branches AS source_branch ON contribution.source_branch = source_branch.idWHERE contribution_author.username ='sophie'AND project.name ='cloud'AND unison_user.username ='unison'ORDERBY source_branch.updated_at DESC-[ RECORD1 ]--------+----------------------------------------------------id | C-4567project_id | P-9999contribution_number | 21title | Fix bugdescription | Prevent the app from deleting the User's hard drivestatus | opensource_branch | B-1111target_branch | B-2222created_at | 2025-05-28 13:06:09.532103+00updated_at | 2025-05-28 13:54:23.954913+00author_id | U-1234
It's not the worst query I've ever had to write out, but if you're
doing this a couple times a day on a couple different tables, writing
out the joins gets pretty old real fast. Especially so
if you're writing it in a CLI interface where's it's a royal pain to
edit the middle of a query.
Even after we get the data we get a very ID heavy view of what's
going on, what's the actual project name? What are the branch names?
Etc.
We can solve both of these problems by writing a bunch of joins
ONCE by creating a debugging view over the table we're
interested in. Something like this:
CREATEVIEW debug_contributions ASSELECT contribution.idAS contribution_id, contribution.project_id, contribution.contribution_number, contribution.title, contribution.description, contribution.status, contribution.source_branch as source_branch_id, source_branch.name AS source_branch_name, source_branch.updated_at AS source_branch_updated_at, contribution.target_branch as target_branch_id, target_branch.name AS target_branch_name, target_branch.updated_at AS target_branch_updated_at, contribution.created_at, contribution.updated_at, contribution.author_id, author.username AS author_username, author.display_name AS author_name, project.name AS project_name,'@'|| project_owner.username ||'/'|| project.name AS project_shorthand, project.owner AS project_owner_id, project_owner.username AS project_owner_usernameFROM contributions AS contributionJOIN projects AS project ON contribution.project_id = project.idJOIN users AS author ON contribution.author_id = author.idJOIN users AS project_owner ON project.owner = project_owner.idJOIN branches AS source_branch ON contribution.source_branch = source_branch.idJOIN branches AS target_branch ON contribution.target_branch = target_branch.id;
Okay, that's a lot to write out at once, but we never need to write
that again. Now if we need to answer the same question we did above we
do:
SELECT*from debug_contributions WHERE author.username ='sophie'AND project_shorthand ='@unison/cloud'ORDERBY source_branch_updated_at DESC;
Which is considerably easier on both my brain and my
fingers. I also get all the information I could possibly want in the
result!
You can craft one of these debug tables for whatever your needs are
for each and every table you work with, and since it's just a view, it's
trivial to update or delete, and doesn't take any space in the DB
itself.
Obviously querying over
project_shorthand = '@unison/cloud' isn't going to be able
to use an index, so isn't going to be the most performant query; but
these are one off queries, so it's not a concern (to me at least). If
you care about that sort of thing you can leave out the computed columns
so you won't have to worry about that.
Anyways, that's it, that's the whole trick. Go make some debugging
views and save your future self some time.
Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put
these things together, if I managed to teach your something or even just entertain you for a minute or two
maybe send a few bucks my way for a coffee? Cheers! �
In this episode, we’re joined by Michael Snoyman, author of Yesod, Conduit, Stackage and many other popular Haskell libraries. We discuss newcomer friendliness, being a Rustacean vs a Haskellasaur, how STM is Haskell’s best feature and how laziness can be a vice.
This post will introduce a simple caching strategy, with a small
twist, which depending on your app may help you not only improve
performance, but might also drastically reduce the memory residency of
your program.
I had originally written this post in 2022, but looks like I got busy
and failed to release it, so just pretend you're reading this in 2022,
okay? It was a simpler time.
In case you're wondering, we continued to optimize storage since and
modern UCM uses even less memory than back in 2022 😎.
Spoiler warning, with about 80 lines of code, I was able to reduce
both the memory residency and start-up times by a whopping ~95%! From
90s -> 4s startup time, and from 2.73GB -> 148MB. All of these
gains were realized by tweaking our app to enforce sharing
between identical objects in memory.
Case Study
I help build the Unison
Language. One unique thing about the language is that programmers
interact with the language through the Unison Codebase Manager (a.k.a.
ucm), which is an interactive shell. Some users have
started to amass larger codebases, and lately we've been noticing that
the memory usage of ucm was growing to unacceptable
levels.
Loading one specific codebase, which I'll use for testing throughout
this article, required 2.73GB and took about 90
seconds to load from SQLite. This is far larger and slower than
we'd like.
There are 2 important facets of how Unison stores code that will be
important to know as we go forward, and will help you understand whether
this technique might work for you.
Unison codebases are append-only, and codebase definitions
are referenced by a content-based hash.
A Unison codebase is a tree with many branches, each branch contains
many definitions and also has references its history. In Unison, once a
definition is added to the codebase it is immutable, this is similar to
how commits work in git; commits can be built upon, and branches can
change which commit they point to, but once a commit is created it
cannot be changed and is uniquely identified by its hash.
A given Unison codebase is likely to refer to subtrees of
code like libraries many times across different Unison branches. E.g.
most projects contain a reference to the base
library.
A Unison project can pull in the libraries it depends on by simply
mounting that dependency into its lib namespace. Doing so
is inexpensive because in effect we simply copy the hash which refers to
a given snapshot of the library, we don't need to make copies of any of
the underlying code. However, when loading the codebase into memory
ucm was hydrating each and every library reference into a
full in-memory representation of that code. No good!
What is sharing and why do
I want it?
Sharing is a very simple concept at its core: rather than having
multiple copies of the same identical object in memory, we should just
have one. It's dead simple if you say it like that, but there are many
ways we can end up with duplicates of values in memory. For example, if
I load the same codebase from SQLite several times then SQLite won't
know that the object I'm loading already exists in memory and will make
a whole new copy.
In a language where data is mutable by default you'll want to think
long and hard about whether sharing is sensible or even possible for
your use-case, but luckily for me, everything in Haskell is immutable by
default so there's absolutely no reason to make copies of identical
values.
There's an additional benefit to sharing beyond just saving memory:
equality checks may be optimized! Some Haskell types like
ByteStrings include an
optimization in their Eq instance which short circuits
the whole check if the two values are pointer-equal. Typically testing
equality on string-like values is actually most expensive when
the two strings are actually equal since the check must examine every
single byte to see if any of them differ. By interning our values using
a cache we can reduce these checks become a single pointer equality
check rather than an expensive byte-by-byte check.
Implementation
One issue with caches like this is that they can grow to eventually
consume unbounded amounts of memory, we certainly don't want every value
we've ever cached to stay there forever. Haskell is a garbage collected
language, so naturally the ideal situation would be for a value to live
in the cache up until it is garbage collected, but how can we know
that?
GHC implements weak
pointers! This nifty feature allows us to do two helpful things:
We can attach a finalizer to the values we return from the cache,
such that values will automatically evict themselves
from the cache when they're no longer reachable.
Weak references don't prevent the value they're pointing to from
being garbage collected. This means that if a value is only
referenced by a weak pointer in a cache then it will still be garbage
collected.
As a result, there's really no downside to this form of caching
except a very small amount of compute and memory used to maintain the
cache itself. Your mileage may vary, but as the numbers show, in our
case this cost was very much worth it when compared to
the gains.
Here's an implementation of a simple Interning Cache:
moduleInternCache ( InternCache, newInternCache, lookupCached, insertCached, intern, hoist, )whereimportControl.Monad.IO.Class (MonadIO (..))importData.HashMap.Strict (HashMap)importData.HashMap.StrictqualifiedasHashMapimportData.Hashable (Hashable)importSystem.Mem.WeakimportUnliftIO.STM-- | Parameterized by the monad in which it operates, the key type, -- and the value type.dataInternCache m k v =InternCache { lookupCached :: k -> m (Maybe v), insertCached :: k -> v -> m () }-- | Creates an 'InternCache' which uses weak references to only -- keep values in the cache for as long as they're reachable by -- something else in the app.---- This means you don't need to worry about a value not being -- GC'd because it's in the cache.newInternCache ::forall m k v. (MonadIO m, Hashable k) => m (InternCache m k v)newInternCache =do var <- newTVarIO memptypure$InternCache { lookupCached = lookupCachedImpl var, insertCached = insertCachedImpl var }where lookupCachedImpl ::TVar (HashMap k (Weak v)) -> k -> m (Maybe v) lookupCachedImpl var ch = liftIO $do cache <- readTVarIO varcase HashMap.lookup ch cache ofNothing->pureNothingJust weakRef ->do deRefWeak weakRef insertCachedImpl ::TVar (HashMap k (Weak v)) -> k -> v -> m () insertCachedImpl var k v = liftIO $do wk <- mkWeakPtr v (Just$ removeDeadVal var k) atomically $ modifyTVar' var (HashMap.insert k wk)-- Use this as a finalizer to remove the key from the map -- when its value gets GC'd removeDeadVal ::TVar (HashMap k (Weak v)) -> k ->IO () removeDeadVal var k = liftIO do atomically $ modifyTVar' var (HashMap.delete k)-- | Changing the monad in which the cache operates with a natural transformation.hoist :: (forall x. m x -> n x) ->InternCache m k v ->InternCache n k vhoist f (InternCache lookup' insert') =InternCache { lookupCached = f . lookup', insertCached = \k v -> f $ insert' k v }
Now you can create a cache for any values you like! You can maintain
a cache within the scope of a given chunk of code, or you can make a
global cache for your entire app using unsafePerformIO like
this:
-- An in memory cache for interning hashes.-- This allows us to avoid creating multiple in-memory instances of the same hash bytes;-- but also has the benefit that equality checks for equal hashes are O(1) instead of O(n), since-- they'll be pointer-equal.hashCache :: (MonadIO m) =>InternCache m HashHashhashCache = unsafePerformIO $ hoist liftIO <$> IC.newInternCache @IO@Hash@Hash{-# NOINLINE hashCache #-}
And here's an example of what it looks like to use the cache in
practice:
expectHash ::HashId->TransactionHashexpectHash h =-- See if we've got the value in the cache lookupCached hashCache h >>= \caseJust hash ->pure hashNothing->do hash <- queryOneCol [sql| SELECT base32 FROM hash WHERE id = :h |]-- Since we didn't have it in the cache, add it now insertCached hashCache h hashpure hash
For things like Hashes, the memory savings are more modest, but in
the cases of entire subtrees of code the difference for us was
substantial. Not only did we save memory, but we saved a ton of time
re-hydrating subtrees of code from SQLite that we already had.
We can even get the benefits of a cache like this when we don't have
a separate key for the value, as long as the value itself has a
Hashable or Ord instance (if you swap the
InternCache to use a regular Map). We can use it as its own key, this
doesn't help us avoid the computational cost of creating the
value, but it still gives us the memory savings:
-- | When a value is its own key, this ensures that the given value -- is in the cache and always returns the single canonical in-memory -- instance of that value, garbage collecting any others.intern :: (Hashable k, Monad m) =>InternCache m k k -> k -> m kintern cache k =do mVal <- lookupCached cache kcase mVal ofJust v ->pure vNothing->do insertCached cache k kpure k
Conclusion
An approach like this doesn't work for every app, it's much easier to
use when working with immutable values like this, but if there's a
situation in your app where it makes sense I recommend giving it a try!
I'll reiterate that for us, we dropped our codebase load times from 90s
down to 4s, and our resting memory usage from 2.73GB down to 148MB.
Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put
these things together, if I managed to teach your something or even just entertain you for a minute or two
maybe send a few bucks my way for a coffee? Cheers! �
Philip Wadler is a man who wears many different hats. Both literally: fedoras, trilbys, even the occasional straw hat, and metaphysically: recently retired Professor of theoretical computer science at the University of Edinburgh; Fellow of the Royal Society; senior researcher at the blockchain infrastructure company IOHK; Lambda Man; often-times favourite lecturer of the first year computer science students; and, occasionally, stand-up comedian. It is the latter role that leads me to ask Phil if he will participate in a Q&A.
[Previous post repeated below.]
Following two sell-out shows at the Fringe last year, I'm on at the Fringe again:
Shows are under the banner of The Provocateurs (formerly Cabaret of Dangerous Ideas). Tickets go on sale Wednesday 7 May, around noon. The official blurb is brief:
Professor Philip Wadler (The University of Edinburgh) separates the hopes and threats of AI from the chatbot bullshit.
Here is a longer blurb, from my upcoming appearance at Curious, run by the RSE, in September.
Brave New Bullshit
In an AI era, who wins and who loses?
Your future workday might look like this:
You write bullet points.
You ask a chatbot to expand them into a report.
You send it to your boss ...
Who asks a chatbot to summarise it to bullet points.
Will AI help you to do your job or take it from you? Is it fair for AI to be trained on copyrighted material? Will any productivity gains benefit everyone or only a select few?
Join Professor Philip Wadler’s talk as he looks at the hopes and threats of AI, exploring who wins and who loses.
This article is about a code-transformation technique I used to get
100x-300x performance improvements on a particularly slow bit of code
which was loading Unison code from Postgres in Unison Share. I haven't
seen it documented anywhere else, so wanted to share the trick!
It's a perennial annoyance when I'm programming that often the most
readable way to write some code is also directly at odds with being
performant. A lot of data has a tree structure, and so working with this
data is usually most simply expressed as a series of nested function
calls. Nested function calls are a reasonable approach when executing
CPU-bound tasks, but in webapps we're often querying or fetching data
along the way. In a nested function structure we'll naturally end up
interleaving a lot of one-off data requests. In most cases these data
requests will block further execution until a round-trip to the database
fetches the data we need to proceed.
In Unison Share, I often need to hydrate an ID into an AST structure
which represents a chunk of code, and each reference in that code will
often contain some metadata or information of its own. We split off
large text blobs and external code references from the AST itself, so
sometimes these fetches will proceed in layers, e.g. fetch the AST, then
fetch the text literals referenced in the tree, then fetch the metadata
for code referenced by the tree, etc.
When hydrating a large batch of code definitions, if each definition
takes N database calls, loading M definitions is NxM database
round-trips, NxM query plans, and potentially NxM index or table scans!
If you make a call for each text ID or external reference individually,
then this scales even worse.
The technique in the post details a technique for using traversals to
iteratively evolve linear, nested codepaths into
similar functions which work on batches of data
instead. Critically, It allows keeping all the same codepaths which
allow you to keep the same nested code structure, avoiding the need to
restructure the whole codebase and allowing you to easily introduce
batching progressively without shipping a whole rewrite at once. It also
provides a trivial mechanism for deduplicating data
requests, and even allows using the exact same codepath for loading 0,
1, or many entities in a typesafe way. First a quick explanation of how
I ended up in this situation.
Case study: Unison
Share definition loading
I'm in charge of the Unison
Share code-hosting and collaboration platform. The codebase for this
webapp started its life by collecting bits and pieces of code from the
UCM CLI application. UCM uses SQLite, so the first iteration was minimal
rewrite which simply replaced SQLite queries with the equivalent
Postgres queries, but the codepaths themselves were left largely the
same.
SQLite operates in-process and loads everything from memory or disk,
so for our intents and purposes in UCM it has essentially no latency. As
a result, most code for loading definitions from the user's codebase in
UCM was written simply and linearly, loading the data only as it is
needed. E.g. we may have a method
loadText :: TextId -> Sqlite.Transaction Text, and when
we needed to load many text references it was perfectly reasonable to
just traverse loadText over a list of IDs.
However, not all databases have the same trade-offs! In the Unison
Share webapp we use Postgres, which means the database has a network
call and round-trip latency for each and every query. We now pay a fixed
round-trip latency cost on every query that simply wasn't a factor
before. Something simple like traverse loadText textIds is
now performing hundreds of sequential database
calls and individual text index lookups! Postgres doesn't know anything
about which query we'll run next, so it can't optimize this at all
(aside from warming up caches) That's clearly not good.
To optimize for Postgres we'd much prefer to make one large database
call which takes an array of a batch of TextIds and returns
all the Text results in a single query, this allows
Postgres to save a lot of work by finding all text values in a single
scan, and means we only incur a single round-trip delay rather than one
per text.
Here's a massively simplified sketch of what the original naive
linear code looked like:
loadTerm ::TermReference->Transaction (ASTTermInfoText)loadTerm ref =do ast <- loadAST ref bitraverse loadTermInfo loadText astloadTermInfo ::TermReference->TransactionTermInfoloadTermInfo ref = queryOneRow [sql| SELECT name, type FROM terms WHERE ref = #{ref} |]loadText ::TextId->TransactionTextloadText textId = queryOneColumn [sql| SELECT text FROM texts WHERE id = #{textId} |]
We really want to load all the Texts in a single query, but the
TextIds aren't just sitting in a nice list, they're nested
within the AST structure.
Here's some pseudocode for fetching a these as a batch:
batchLoadASTTexts ::ASTTermReferenceTextId->Transaction (ASTTermInfoText)batchLoadASTTexts ast =dolet textIds =Foldable.toList ast texts <- fetchTexts textIds for ast \textId ->case Map.lookup textId texts ofNothing-> throwError $MissingText textIdJust text ->pure textwhere fetchTexts :: [TextId] ->Transaction (MapTextIdText) fetchTexts textIds =do resolvedTexts <- queryListColumns [sql| SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds}) |]pure$ Map.fromList resolvedTexts
This solves the biggest problem, most importantly it reduces N
queries down to a single batch query which is already a huge
improvement! However, it is a bit of boilerplate, and we'd need to write
a custom version of this for each container we want to batch load texts
from.
Clever folks will realize that we actually don't care about the
AST structure at all, we only need a container which is
Traversable, so we can generalize over that:
batchLoadTexts ::Traversable t => t TextId->Transaction (t Text)batchLoadTexts textIds =do resolvedTexts <- fetchTexts textIdspure$fmap (\textId ->case Map.lookup textId resolvedTexts ofNothing-> throwError $MissingText textIdJust text -> text) textIdswhere fetchTexts :: [TextId] ->Transaction (MapTextIdText) fetchTexts textIds =do resolvedTexts <- queryListColumns [sql| SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds}) |]pure$ Map.fromList resolvedTexts
This is much better, now we can use this on any form of Traversable,
meaning we can now batch load from ASTs, lists, vectors, Maps, and can
even just use Identity to re-use our query logic for a
single ID like this:
loadText ::TextId->TransactionTextloadText textId =doIdentity text <- batchLoadTexts (Identity textId)pure text
This approach does still require that the IDs you want to batch load
are the focus of some Traversable instance. What if instead your
structure contains a half-dozen different ID types, or is arranged such
that it's not in the Traversable slot of your type parameters?
Bitraversable can handle up to two parameters, but after that you're
back to writing bespoke functions for your container types.
For instance, how would we use this technique to batch load our
TermInfo from the AST's TermReferences?
-- Assume we've written these batched term and termInfo loaders:batchLoadTexts ::Traversable t => t TextId->Transaction (t Text)batchLoadTermInfos ::Traversable t => t TermReference->Transaction (t TermInfo)loadTerm ::TermReference->Transaction (ASTTermInfoText)loadTerm termRef =do ast <- loadAST termRef astWithText <- batchLoadTexts ast??? astWithText -- How do we load the TermInfos in here?
We're getting closer, but Traversable instances just aren't very
adaptable, the relevant ID must always be in the final parameter of the
type. In this case you could get by using Flip
wrapper, but it's not going to be very readable and this technique
doesn't scale past two parameters.
We need some way to define and compose bespoke Traversable instances
for any given situation.
Custom Traversals
In its essence, the Traversable type class is just a way to easily
provide a canonical implementation of traverse for a given
type:
traverse ::Applicative f => (a -> f b) -> t a -> f (t b)
As it turns out, we don't need a type class in order to construct and
pass functions of this type around, we can define them ourselves.
With this signature it's still requiring that the elements being
traversed are the final type parameter of the container t;
we need a more general version. We can use this instead:
typeTraversal s t a b =Applicative f => (a -> f b) -> s -> f t
It looks very similar, but note that s and
t are now concrete types of kind *, they don't
take a parameter, which means we can pick any fully parameterized type
we like for s and t which focus some other
type a and convert or hydrate it into b.
E.g. If we want a traversal to focus the TermReferences
in an AST and convert them to TermInfos, we
can write:
Traversal (ASTTermReference text) (ASTTermInfo text) TermReferenceTermInfo-- Which expands to the function type:Applicative f => (TermReference-> f TermInfo) ->ASTTermReference text -> f (ASTTermInfo text)
If you've ever worked with optics or the lens library
before this should be looking mighty familiar, we've just derived
lens's Traversal
type!
Most optics are essentially just traversals, we can write one-off
traversals for any situation we might need, and can trivially compose
small independent traversals together to create more complex
traversals.
Let's rewrite our batch loaders to take an explicit Traversal
argument.
importControl.LensqualifiedasLensimportData.Functor.Contravariant-- Take a traversal, then a structure 's', and replace all TextIds with Texts to-- transform it into a 't'batchLoadTextsOf ::Lens.Traversal s t TextIdText-> s ->Transaction tbatchLoadTextsOf traversal s =dolet textIds = toListOf (traversalToFold traversal) s resolvedTexts <- fetchTexts textIds Lens.forOf traversal s $ \textId ->case Map.lookup textId resolvedTexts ofNothing-> throwError $MissingText textIdJust text ->pure textwhere fetchTexts :: [TextId] ->Transaction (MapTextIdText) fetchTexts textIds =do resolvedTexts <- queryListColumns [sql| SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds}) |]pure$ Map.fromList resolvedTextstraversalToFold :: (Applicative f, Contravariant f) =>Lens.Traversal s t a b ->Lens.LensLike' f s atraversalToFold traversal f s = phantom $ traversal (phantom . f) s
The *Of naming convention comes from the
lens library. A combinator ending in Of takes
an traversal as an argument.
It's a bit unfortunate that we need traversalToFold,
it's just a quirk of how Traversals and Folds are implemented in the
lens library, but don't worry we'll replace it with something better
soon.
Now we can pass any custom traversal we like into
batchLoadTexts and it will batch up the IDs and hydrate
them in-place.
Let's write the AST traversals we need:
astTexts ::Traversal (ASTTermReferenceTextId) (ASTTermReferenceText) TextIdTextastTexts =traverseastTermReferences ::Traversal (ASTTermReferenceTextId) (ASTTermInfoText) TermReferenceTermInfoastTermReferences f = bitraverse f pure
Here we can just piggy-back on existing traverse and
bitraverse implementations, but if you need to write your
own, I included a small guide on writing your own custom Traversals with
the traversal
method in the lens library, go check that out.
With this, we can now batch load both the texts and term infos from
an AST in one pass each.
Okay now we're cooking, we've reduced the number of queries per term
from 1 + numTexts + numTermRefs down to a flat
3 queries per term, which is a huge improvement, but
there's more to do.
What if we need to load a whole batch of asts at once? Here's a first
attempt:
-- Assume these batch loaders are in scope:batchLoadTermASTs ::Traversal s t TermReference (ASTTermReferenceTextId) -> s ->Transaction tbatchLoadTermInfos ::Traversal s t TermReferenceTermInfo-> s ->Transaction tbatchLoadTexts ::Traversal s t TextIdText-> s ->Transaction tbatchLoadTerms ::MapTermReferenceTextId->Transaction (MapTermReference (ASTTermInfoText))batchLoadTerms termsMap =do termASTsMap <- batchLoadTermASTs traverse termsMap for termASTsMap \ast ->do astWithTexts <- batchLoadTexts astTexts ast hydratedAST <- batchLoadTermInfos astTermReferences astWithTextspure hydratedAST
This naive approach loads the asts in a batch, but then traverses
over the resulting ASTs batch loading the terms and texts: This is
better than no batching at all, but we're still running queries in a
loop. 2 queries for each term in the map is still O(N)
queries, we can do better.
Luckily, Traversals are easily composable! We can effectively
distribute the for loop into our batch calls by adding
composing an additional traverse so each traversal is
applied to every element of the outer map. In case you're not familiar
with optics, just note that traversals compose from outer to inner from
left to right, using .; it looks like this:
It was a small change, but this performs much better at
scale, we went from O(N) queries to O(1)
queries, that is, we now run EXACTLY 3 queries, no matter how many terms
we're loading, pretty cool. In fact, the latter two queries have no
data-dependencies on each other, so you can also pipeline them if your
DB supports that, but I'll leave that as an exercise (or come ask me on
bluesky).
That's basically the technique, the next section will show a few
tweaks which help me to use it at application scale.
Additional tips
Let's revisit the database layer where we actually make the batch
query:
importControl.LensqualifiedasLensimportData.Functor.Contravariant-- Take a traversal, then a structure 's', and replace all TextIds with Texts to-- transform it into a 't'batchLoadTextsOf ::Lens.Traversal s t TextIdText-> s ->Transaction tbatchLoadTextsOf traversal s =dolet textIds = toListOf (traversalToFold traversal) s resolvedTexts <- fetchTexts textIds Lens.forOf traversal s $ \textId ->case Map.lookup textId resolvedTexts ofNothing-> throwError $MissingText textIdJust text ->pure textwhere fetchTexts :: [TextId] ->Transaction (MapTextIdText) fetchTexts textIds =do resolvedTexts <- queryListColumns [sql| SELECT id, text FROM texts WHERE id = ANY(#{toArray textIds}) |]pure$ Map.fromList resolvedTextstraversalToFold :: (Applicative f, Contravariant f) =>Lens.Traversal s t a b ->Lens.LensLike' f s atraversalToFold traversal f s = phantom $ traversal (phantom . f) s
This pattern is totally fine, but it does involve materializing and
sorting a Map of all the results, which also requires an Ord instance on
the database key we use. Here's an alternative approach:
importControl.LensqualifiedasLensimportData.Functor.Contravariant-- Take a traversal, then a structure 's', and replace all TextIds with Texts to-- transform it into a 't'batchLoadTextsOf ::Lens.Traversal s t TextIdText-> s ->Transaction tbatchLoadTextsOf traversal s =do s & unsafePartsOf traversal %%~ \textIds ->dolet orderedIds =zip [0 ::Int32..] textIds queryListColumns [sql| WITH text_ids(ord, id) AS ( SELECT * unnest(#{toArray orderedIds}) AS ids(ord, id) ) SELECT texts.text FROM texts JOIN text_ids ON texts.id = text_ids.id; ORDER BY text_ids.ord ASC |]
Using unsafePartsOf allows us to act on the foci of a
traversal as though they were in a simple list. The
unsafe bit is that it will crash if we don't return a list
with the exact same number of elements, so be aware of that, but it's
the same crash we'd have gotten in our old version if an ID was missing
a value.
This also allows us to avoid the song-and-dance for converting the
incoming traversal into a fold.
We need the ord column simply because sql doesn't
guarantee any specific result order unless we specify one. This will
pair up result rows piecewise with the input IDs, and so it doesn't
require any Ord instance.
We can wrap unsafePartsOf with our own combinator to add
a few additional features.
Here's a version which will deduplicate IDs in the input list, will
skip the action if the input list is empty, and will provide a nice
error with a callstack if anything goes sideways.
asListOf :: (HasCallStack, Ord a) =>Traversal s t a b ->Traversal s t [a] [b]asListOf trav f s = s& unsafePartsOf trav %%~ \case-- No point making a database call which will return no results [] ->pure [] inputs ->do-- First, deduplicate the inputs as a self indexed map.let asMap = Map.fromList (zip inputs inputs) asMap-- Call the action with the list of deduped inputs& unsafePartsOf traversed f<&> \resultMap ->-- Now map the result for each input in the original list to its result valuelet resultList = mapMaybe (\k -> Map.lookup k resultMap) inputs aLength =length inputs bLength =length resultListinif aLength /= bLength-- Better error message if our query is bad and returns the wrong number of elements.thenerror$"asListOf: length mismatch, expected "++show aLength ++" elements, got "++show bLength <>" elements"else resultList
Using a tool like this has caveats, it's very easy to cause runtime
crashes if your query isn't written to always return the same
number of results as it was given inputs, and skipping the action on
empty lists could result in some confusion.
Conclusion
I've gotten a ton of use out of this technique in Unison Share, and
managed to speed things up by 2 orders of magnitude. I was also able to
perform a fully batched rewrite of heavily nested code without needing
to re-arrange the code-graph. This was particularly useful because it
allowed me to partially large portions of the codebase in smaller pieces
by using batched methods with a simple id Traversal, and
using simple traverse on methods you haven't rewritten yet.
You may not get such huge gains if your code isn't pessimistically
linear in the first place, but this is also a nice, composable way to
write batch code in the first place.
Anyways, give it a go and let me know what you think of it!
Hopefully you learned something 🤞! Did you know I'm currently writing a book? It's all about Lenses and Optics! It takes you all the way from beginner to optics-wizard and it's currently in early access! Consider supporting it, and more posts like this one by pledging on my Patreon page! It takes quite a bit of work to put
these things together, if I managed to teach your something or even just entertain you for a minute or two
maybe send a few bucks my way for a coffee? Cheers! �
Well-Typed was strongly represented at this year’s ZuriHac, with our team of Haskell experts giving
eight talks across ZuriHac itself and the Haskell Ecosystem and Implementors’ Workshops. We’re
pleased to report that the recordings are now available.
ZuriHac Beginners Track
Andres hosted the Beginners Track at ZuriHac, delivering a four-hour tutorial that covers all
the fundamentals of the Haskell language. It’s an excellent starting point for anyone
interested in learning Haskell, taught by one of the community’s most experienced Haskell
educators.
Haskell Ecosystem Workshop
Matt was lucky to be invited to give a talk about our work on memory profiling over the last five years.
Profiling and observability have been a key focus for Well-Typed. We have developed tooling which allows
easy and powerful introspection into the runtime performance of Haskell programs. You can read more about our work
in this area in posts tagged with profiling.
Haskell Implementors Workshop
The Haskell Implementors Workshop was a great opportunity to share our progress on
improvements to GHC over the last year. It’s always nice to take a moment to reflect
on the progress we’ve made and the work we’ve done.
Ben and Andreas kicked things off with the annual GHC status report. This report
provides a summary of the essential maintenance and community stewardship work which
Well-Typed performs for the GHC project.
Hannes introduced recent improvements to GHCi to support multi-unit sessions natively. This is the latest in our long-running work to improve the ecosystem support for project-based workflows with many different packages being developed in parallel.
Rodrigo showcased his work on a standalone step-through debugger for GHC. We have implemented a GHC API application which uses the Debug Adapter Protocol to communicate with any debugger frontend. We look forward to releasing this work to the public in the near future. Which will give Haskell programmers access to a maintained and powerful debugger.
Matt presented the work on Explicit Level Imports which aims to make it clear what exactly is needed
by Template Haskell (during compilation) and what is needed during runtime. An important stepping stone to improving the developer experience
for projects relying on both cross compilation and Template Haskell.
Finally, there were two more research-oriented presentations.
Matt presented some joint work with his collaborator Ellis Kesteron on a possible improvement to the desugaring of
Typed Template Haskell quotations, which would make it easier to perform well-typed intensional syntax analysis.
Andreas presented his idea of the ability to express strictness properties of a function in the type level. His talk explored different ideas in
how these annotations may affect unboxing and optimisation passes such as worker-wrapper transformations.
Conclusion
Well-Typed offer Haskell Ecosystem Support Packages
in partnership with the Haskell Foundation, to provide commercial
users with support from Well-Typed’s experts, while investing in the Haskell
community and its technical ecosystem.
These projects were made possible by funding from our clients, notably Mercury, who
are improving the experience for Haskell developers by supporting foundational work on Haskell tools.
It was great to meet everyone who attended the workshops and asked interesting
questions during and after the talks. We hope to see you all again next year!
A Fast Bytecode VM for Arithmetic: The Virtual Machine
Introduction
The language that we are going to work with is that of basic arithmetic expressions, with integer values, and addition, subtraction, multiplication and integer division operations. However, our expression language has a small twist: it is possible to introduce a variable using a let binding and use the variable in the expressions in the body of let1. Furthermore, we use the same syntax for let as Haskell does. Here are some examples of valid expressions in our language:
1+2-3*4+5/6/0+1let x =4in x +1let x =4inlet y =5in x + ylet x =4inlet y =5in x +let z = y in z * zlet x =4in (let y =5in x +1) +let z =2in z * zlet x = (let y =3in y + y) in x *3let x =let y =3in y + y in x *3let x =let y =1+let z =2in z * z in y +1in x *3
The only gotcha here is that the body of a let expression extends as far as possible while accounting for nested lets. It becomes clear when we look at parsed expressions later.
The eventual product is a command-line tool that can run different commands. Let’s start with a demo of the tool:
$ arith-vm -h
Bytecode VM for Arithmetic written in Haskell
Usage: arith-vm COMMAND
Available options:
-h,--help Show this help text
Available commands:
parse Parse expression to AST
compile Parse and compile expression to bytecode
disassemble Disassemble bytecode to opcodes
decompile Disassemble and decompile bytecode to expression
interpret-ast Parse expression and interpret AST
interpret-bytecode Parse, compile and assemble expression, and
interpret bytecode
run Run bytecode
generate Generate a random arithmetic expression
$ arith-vm parse -h
Usage: arith-vm parse [FILE]
Parse expression to AST
Available options:
FILE Input file, pass - to read from STDIN (default)
-h,--help Show this help text
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm parse
( let x = 1 in ( let y = 2 in ( y + ( x * 3 ) ) ) )
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm compile > a.tbc
$ hexdump -C a.tbc
00000000 00 01 00 00 02 00 03 01 03 00 00 03 00 06 04 02 |................|
00000010 01 02 01 |...|
00000013
$ arith-vm disassemble a.tbc
OPush 1
OPush 2
OGet 1
OGet 0
OPush 3
OMul
OAdd
OSwap
OPop
OSwap
OPop
$ arith-vm decompile a.tbc
( let v0 = 1 in ( let v1 = 2 in ( v1 + ( v0 * 3 ) ) ) )
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-ast
5
$ echo -n "let x = 1 in let y = 2 in y + x * 3" | arith-vm interpret-bytecode
5
$ arith-vm run a.tbc
5
$ arith-vm generate
(
(
(
( let nD =
( 11046 - -20414 ) in
( let xqf = ( -15165 * nD ) in nD )
) * 26723
) /
(
( let phMuOI =
( let xQ = ( let mmeBy = -28095 in 22847 ) in 606 ) in 25299
) *
( let fnoNQm = ( let mzZaZk = 29463 in 18540 ) in ( -2965 / fnoNQm ) )
)
) * 21400
)
We can parse an expression, or compile it to bytecode. We can also disassemble bytecode to opcodes, or decompile it back to an expression. We can interpret an expression either as an AST or as bytecode. We can also run a bytecode file directly. Finally, we have a handy command to generate random expressions for testing/benchmarking purposes2.
Let’s start.
Expressions
Since this is Haskell, we start with listing many language extensions and imports:
dataExpr=Num!Int16|Var!Ident|BinOp!OpExprExpr|Let!IdentExprExprderiving (Eq, Generic)newtypeIdent=IdentBS.ByteStringderiving (Eq, Ord, Generic, Hashable)dataOp=Add|Sub|Mul|Divderiving (Eq, Enum, Generic)instanceNFDataExprinstanceShowExprwhereshow= \caseNum n ->show nVar (Ident x) -> BSC.unpack xBinOp op a b ->"("<>show a <>" "<>show op <>" "<>show b <>")"Let (Ident x) a b ->"(let "<> BSC.unpack x <>" = "<>show a <>" in "<>show b <>")"instanceNFDataIdentinstanceShowIdentwhereshow (Ident x) = BSC.unpack xmkIdent ::String->IdentmkIdent =Ident. BSC.packinstanceNFDataOpinstanceShowOpwhereshow= \caseAdd->"+"Sub->"-"Mul->"*"Div->"/"
ArithVMLib.hs
We add Show instances for ADTs so that we can pretty-print the parsed AST3. Now, we can start parsing.
expr ::= term | term space*("+"|"-") termterm ::= factor | factor space*("*"|"/") factorfactor ::= space*(grouping | num | var | let)grouping ::="(" expr space*")"num ::="-"?[0-9]+var ::= identident ::=([a-z]|[A-Z])+let ::="let" space+ ident space*"=" expr space*"in" space+ expr space*space ::=" "|"\t"|"\n"|"\f"|"\r"
The expr, term, factor, and grouping productions take care of having the right precedence of arithmetic operations. The num and var productions are trivial. Our language is fairly oblivious of whitespaces; we allow zero-or-more spaces at most places.
The let expressions grammar is pretty standard, except we require one-or-more spaces after the let and in keywords to make them unambiguous.
We use the parser combinator library attoparsec for creating the parser. attoparsec works directly with bytestrings so we don’t incur the cost of decoding unicode characters45.
We write the parser in a top-down recursive-descent fashion, same as the grammar, starting with the expr parser:
typeSizedExpr= (Expr, Int)-- expr ::= term | term space* ("+" | "-") termexprParser ::P.ParserSizedExprexprParser = chainBinOps termParser $ \case'+'->pureAdd'-'->pureSub op ->fail$"Expected '+' or '-', got: "<>show op-- term ::= factor | factor space* ("*" | "/") factortermParser ::P.ParserSizedExprtermParser = chainBinOps factorParser $ \case'*'->pureMul'/'->pureDiv op ->fail$"Expected '*' or '/', got: "<>show opchainBinOps ::P.ParserSizedExpr-> (Char->P.ParserOp) ->P.ParserSizedExprchainBinOps operandParser operatorParser = operandParser >>= restwhere rest (!expr, !size1) = ( do P.skipSpace c <- P.anyChar operator <- operatorParser c (operand, !size2) <- operandParser rest (BinOp operator expr operand, size1 + size2 +1) ) <|>pure (expr, size1){-# INLINE chainBinOps #-}
ArithVMLib.hs
One small complication: our parsers not only return the parsed expressions, but also the number of bytes they occupy when compiled to bytecode. We gather this information while building the AST in parts, and propagate it upward in the tree. We use the bytecode size later in the compilation pass6.
Both exprParser and termParser chain the right higher precedence parsers with the right operators between them7 using the chainBinOps combinator.
-- factor ::= space* (grouping | num | var | let)factorParser ::P.ParserSizedExprfactorParser =do P.skipSpace P.peekChar' >>= \case'('-> groupingParser'-'-> numParser c | P.isDigit c -> numParser c | c /='l'-> varParser _ -> varParser <|> letParser-- grouping ::= "(" expr space* ")"groupingParser ::P.ParserSizedExprgroupingParser = P.char '('*> exprParser <* P.skipSpace <* P.char ')'
ArithVMLib.hs
factorParser uses lookahead to dispatch between one of the primary parsers, which is faster than using backtracking. groupingParser simply skips the parenthesis, and recursively calls exprParser.
-- num ::= "-"? [0-9]+numParser ::P.ParserSizedExprnumParser =do n <- P.signed P.decimal P.<?>"number"if validInt16 nthenpure (Num$fromIntegral n, 3)elsefail$"Expected a valid Int16, got: "<>show nwhere validInt16 ::Integer->Bool validInt16 i =fromIntegral (minBound@Int16) <= i&& i <=fromIntegral (maxBound@Int16)
ArithVMLib.hs
numParser uses the signed and decimal parsers from the attoparsec library to parse an optionally signed integer. We restrict the numbers to 2-byte integers (-32768–32767 inclusive)8. The <?> helper from attoparsec names parsers so that the error message shown in case of failures point to the right parser.
varParser and identParser are straightforward. We restrict identifiers to upper-and-lowercase ASCII alphabetic letters. We also check that our reserved keywords (let and in) are not used as identifiers.
In letParser we use identParser to parse the variable name, and recursively call exprParser to parse the assignment and body expressions, while making sure to correctly parse the spaces. The helper parser expect is used to parse known string tokens (let, = and in), and provide good error messages in case of failures. Talking about error messages …
Error Handling
Let’s figure out an error handling strategy. We use an Error type wrapped in Either to propagate the errors in our program:
The Error type also captures the Pass in which the error is thrown. Result is a type alias that represents either an error or a result. Finally, we put all the parsers together to write the parse function.
The Parser
Our parseSized function uses the parse function from attoparsec to run the exprParser over an input.
The processResult function deals with intricacies of how attoparsec returns the parsing result. Basically, we inspect the returned result and throw appropriate errors with useful error messages. We use throwError from the MonadError typeclass that works with all its instances, which Either is one of.
Finally, we throw away the bytecode size from the result of parseSized in the parse function.
The parser is done. But as good programmers, we must make sure that it works correctly. Let’s write some unit tests.
Testing the Parser
We use the hspec library to write unit tests for our program. Each test is written as a spec9.
{-# LANGUAGE GHC2021 #-}{-# LANGUAGE OverloadedStrings #-}moduleMain (main) whereimportArithVMLibimportControl.Arrow ((>>>))importControl.Monad (forM_, (>=>))importData.ByteString.Char8qualifiedasBSCimportData.Int (Int16)importData.SequencequalifiedasSeqimportTest.HspecimportTest.Hspec.QuickCheckimportTest.QuickCheckqualifiedasQparserSpec ::SpecparserSpec = describe "Parser"$do forM_ parserSuccessTests $ \(input, result) -> it ("parses: \""<> BSC.unpack input <>"\"") $do (show<$> parse input) `shouldBe`Right result forM_ parserErrorTests $ \(input, err) -> it ("fails for: \""<> BSC.unpack input <>"\"") $do parse input `shouldSatisfy` \caseLeft (ErrorParse msg) | err == msg ->True _ ->FalseparserSuccessTests :: [(BSC.ByteString, String)]parserSuccessTests = [ ( "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1","((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)" ), ( "1+2-3*4+5/6/0+1","((((1 + 2) - (3 * 4)) + ((5 / 6) / 0)) + 1)" ), ( "1 + -1","(1 + -1)" ), ( "let x = 4 in x + 1","(let x = 4 in (x + 1))" ), ( "let x=4in x+1","(let x = 4 in (x + 1))" ), ( "let x = 4 in let y = 5 in x + y","(let x = 4 in (let y = 5 in (x + y)))" ), ( "let x = 4 in let y = 5 in x + let z = y in z * z","(let x = 4 in (let y = 5 in (x + (let z = y in (z * z)))))" ), ( "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z","(let x = 4 in ((let y = 5 in (x + 1)) + (let z = 2 in (z * z))))" ), ( "let x=4in 2+let y=x-5in x+let z=y+1in z/2","(let x = 4 in (2 + (let y = (x - 5) in (x + (let z = (y + 1) in (z / 2))))))" ), ( "let x = (let y = 3 in y + y) in x * 3","(let x = (let y = 3 in (y + y)) in (x * 3))" ), ( "let x = let y = 3 in y + y in x * 3","(let x = (let y = 3 in (y + y)) in (x * 3))" ), ( "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3","(let x = (let y = (1 + (let z = 2 in (z * z))) in (y + 1)) in (x * 3))" ) ]parserErrorTests :: [(BSC.ByteString, String)]parserErrorTests = [ ("", "Not enough input"), ("1 +", "Leftover input: \"+\""), ("1 & 1", "Leftover input: \"& 1\""), ("1 + 1 & 1", "Leftover input: \"& 1\""), ("1 & 1 + 1", "Leftover input: \"& 1 + 1\""), ("(", "Not enough input"), ("(1", "Expected: ')', got: end-of-input"), ("(1 + ", "Expected: ')', got: \"+\""), ("(1 + 2", "Expected: ')', got: end-of-input"), ("(1 + 2}", "Expected: ')', got: \"}\""), ("66666", "Expected a valid Int16, got: 66666"), ("-x", "Expected: number, got: \"-x\""), ("let 1", "Expected: identifier, got: \"1\""), ("let x = 1 in ", "Not enough input"), ( "let let = 1 in 1","Expected identifier, got: \"let\", which is a reversed keyword" ), ( "let x = 1 in in","Expected identifier, got: \"in\", which is a reversed keyword" ), ("let x=1 inx", "Expected: space, got: \"x\""), ("letx = 1 in x", "Leftover input: \"= 1 in x\""), ("let x ~ 1 in x", "Expected: \"=\", got: \"~\""), ("let x = 1 & 2 in x", "Expected: \"in\", got: \"&\""), ("let x = 1 inx", "Expected: space, got: \"x\""), ("let x = 1 in x +", "Leftover input: \"+\""), ("let x = 1 in x in", "Leftover input: \"in\""), ("let x = let x = 1 in x", "Expected: \"in\", got: end-of-input") ]
ArithVMSpec.hs
We have a bunch of tests for the parser, testing both success and failure cases. Notice how spaces are treated in the expressions. Also notice how the let expressions are parsed. We’ll add property-based tests for the parser in the next post.
There is not much we can do with the parsed ASTs at this point. Let’s write an interpreter to evaluate our ASTs.
The AST Interpreter
The AST interpreter is a standard and short recursive interpreter with an environment mapping variables to their values:
interpretAST ::Expr->ResultInt16interpretAST = go Map.emptywhere go env = \caseNum n ->pure nVar x ->case Map.lookup x env ofJust v ->pure vNothing-> throwInterpretError $"Unknown variable: "<>show xBinOp op a b ->do!a' <- go env a!b' <- go env bcase op ofAdd->pure$! a' + b'Sub->pure$! a' - b'Mul->pure$! a' * b'Div| b' ==0-> throwInterpretError "Division by zero"Div| b' == (-1) && a' ==minBound-> throwInterpretError "Arithmetic overflow"Div->pure$! a' `div` b'Let x assign body ->do!val <- go env assign go (Map.insert x val env) body throwInterpretError = throwError .ErrorInterpretAST
ArithVMLib.hs
This interpreter serves both as a performance baseline for the bytecode VM we write later, and as a definitional interpreter for testing the VM10. We are careful in detecting division-by-zero and arithmetic overflow errors, but we ignore possible integer overflow/underflow errors that may be caused by the arithmetic operations.
Testing the Interpreter
We write some unit tests for the interpreter following the same pattern as the parser:
astInterpreterSpec ::SpecastInterpreterSpec = describe "AST interpreter"$do forM_ astInterpreterSuccessTests $ \(input, result) -> it ("interprets: \""<> BSC.unpack input <>"\"") $do parseInterpret input `shouldBe`Right result forM_ astInterpreterErrorTests $ \(input, err) -> it ("fails for: \""<> BSC.unpack input <>"\"") $do parseInterpret input `shouldSatisfy` \caseLeft (ErrorInterpretAST msg) | err == msg ->True _ ->Falsewhere parseInterpret = parse >=> interpretASTastInterpreterSuccessTests :: [(BSC.ByteString, Int16)]astInterpreterSuccessTests = [ ("1", 1), ("1 + 2 - 3 * 4 + 5 / 6 / 1 + 1", -8), ("1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)", -3), ("1 + -1", 0), ("1 * -1", -1), ("let x = 4 in x + 1", 5), ("let x = 4 in let x = x + 1 in x + 2", 7), ("let x = 4 in let y = 5 in x + y", 9), ("let x = 4 in let y = 5 in x + let z = y in z * z", 29), ("let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z", 13), ("let x = let y = 3 in y + y in x * 3", 18), ("let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3", 18) ]astInterpreterErrorTests :: [(BSC.ByteString, String)]astInterpreterErrorTests = [ ("x", "Unknown variable: x"), ("let x = 4 in y + 1", "Unknown variable: y"), ("let x = y + 1 in x", "Unknown variable: y"), ("let x = x + 1 in x", "Unknown variable: x"), ("1/0", "Division by zero"), ("-32768 / -1", "Arithmetic overflow") ]
ArithVMSpec.hs
Now, we can run the parser and interpreter tests to make sure that everything works correctly.
main ::IO ()main = hspec $do parserSpec astInterpreterSpec
ArithVMSpec.hs
Output of the test run
$ cabal test -O2
Running 1 test suites...
Test suite specs: RUNNING...
Parser
parses: "1 + 2 - 3 * 4 + 5 / 6 / 0 + 1" [✔]
parses: "1+2-3*4+5/6/0+1" [✔]
parses: "1 + -1" [✔]
parses: "let x = 4 in x + 1" [✔]
parses: "let x=4in x+1" [✔]
parses: "let x = 4 in let y = 5 in x + y" [✔]
parses: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
parses: "let x = 4 in (let y = 5 in x + 1) + let z = 2 in z * z" [✔]
parses: "let x=4in 2+let y=x-5in x+let z=y+1in z/2" [✔]
parses: "let x = (let y = 3 in y + y) in x * 3" [✔]
parses: "let x = let y = 3 in y + y in x * 3" [✔]
parses: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "" [✔]
fails for: "1 +" [✔]
fails for: "1 & 1" [✔]
fails for: "1 + 1 & 1" [✔]
fails for: "1 & 1 + 1" [✔]
fails for: "(" [✔]
fails for: "(1" [✔]
fails for: "(1 + " [✔]
fails for: "(1 + 2" [✔]
fails for: "(1 + 2}" [✔]
fails for: "66666" [✔]
fails for: "-x" [✔]
fails for: "let 1" [✔]
fails for: "let x = 1 in " [✔]
fails for: "let let = 1 in 1" [✔]
fails for: "let x = 1 in in" [✔]
fails for: "let x=1 inx" [✔]
fails for: "letx = 1 in x" [✔]
fails for: "let x ~ 1 in x" [✔]
fails for: "let x = 1 & 2 in x" [✔]
fails for: "let x = 1 inx" [✔]
fails for: "let x = 1 in x +" [✔]
fails for: "let x = 1 in x in" [✔]
fails for: "let x = let x = 1 in x" [✔]
AST interpreter
interprets: "1" [✔]
interprets: "1 + 2 - 3 * 4 + 5 / 6 / 1 + 1" [✔]
interprets: "1 + (2 - 3) * 4 + 5 / 6 / (1 + 1)" [✔]
interprets: "1 + -1" [✔]
interprets: "1 * -1" [✔]
interprets: "let x = 4 in x + 1" [✔]
interprets: "let x = 4 in let x = x + 1 in x + 2" [✔]
interprets: "let x = 4 in let y = 5 in x + y" [✔]
interprets: "let x = 4 in let y = 5 in x + let z = y in z * z" [✔]
interprets: "let x = 4 in (let y = 5 in x + y) + let z = 2 in z * z" [✔]
interprets: "let x = let y = 3 in y + y in x * 3" [✔]
interprets: "let x = let y = 1 + let z = 2 in z * z in y + 1 in x * 3" [✔]
fails for: "x" [✔]
fails for: "let x = 4 in y + 1" [✔]
fails for: "let x = y + 1 in x" [✔]
fails for: "let x = x + 1 in x" [✔]
fails for: "1/0" [✔]
fails for: "-32768 / -1" [✔]
Finished in 0.0058 seconds
54 examples, 0 failures
Test suite specs: PASS
Awesome, it works! That’s it for this post. Let’s update our checklist:
In the next part, we write a bytecode compiler for our expression AST.
If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading!
Variables are scoped to the body of the let expressions they are introduced in, that is, our language has lexical scoping. Also, variables with same name in inner lets shadow the variables in outer lets.↩︎
If you are wondering why do this at all, when we can directly run the expressions while parsing, I think this is a great little project to learn how to write performant bytecode compilers and VMs in Haskell.↩︎
Bangs (!) that enforce strictness are placed in the Expr ADT (and also in the later code) at the right positions that provide performance benefits. The right positions were found by profiling the program. A bang placed at a wrong position (for example in front of Expr inside BinOp) may ruin the compiler provided optimizations and make the overall program slower.↩︎
attoparsec is very fast, but there are faster parsing libraries in Haskell. On the other hand, attoparsec does not provided great error messages. If the user experience were a higher priority, I’d use the megaparsec library. I find attoparsec to have the right balance of performance, developer experience and user experience. Handwritten parsers from scratch could be faster, but they’d be harder to maintain and use.↩︎
I wrote the first version of the parser using the ReadP library that comes with Haskell standard library. I rewrote it to use attoparsec and found that the rewritten parser was more than 10x faster.↩︎
You don’t need to think about the bytecode size of expressions right now. It’ll become clear when we go over compilation in the next post.↩︎
Certain functions such as chainBinOps are inlined using the INLINE pragma to improve the program performance. The functions to inline were chosen by profiling.↩︎
Since the numbers need to be encoded into bytes when we compile to bytecode, we need to choose some encoding for them. For simpler code, we choose 2-byte integers.↩︎
Testing your parsers is crucial because that’s your programming languages’ interface to the users. Also because writing (fast) parsers is difficult and error-prone. Most of the bugs I found in this program were in the parser.↩︎
Again, notice the carefully placed bangs to enforce strictness. Try to figure out why they are placed at some places and not at others.↩︎
Twentyseven
is a Rubik’s cube solver and one of my earliest projects in Haskell.
The first commit dates from January 2014, and version 0.0.0 was uploaded on Hackage in March 2016.
I first heard of Haskell in a course on lambda calculus in 2013.
A programming language with lazy evaluation sounded
like a crazy idea, so I gave it a try.
Since then, I have kept writing in Haskell as my favorite language.
For me it is the ideal blend of programming and math.
And a Rubik’s cube solver is a great excuse for doing group theory.
Twentyseven 1.0.0 is more of a commemorative release for myself,
with the goal of making it compile with the current version of GHC (9.12).
There was surprisingly little breakage:
Aside from that, the code is basically just as it was 9 years ago,
including design decisions that I would find questionable today.
For example, I use unsafePerformIO to read precomputed tables
into top-level constants, but the location of the files to read from
can be configured by command-line arguments, so I better make sure that
the tables are not forced before the location is set…
How Twentyseven works
The input of the program is a string enumerating the 54 facelets
of a Rubik’s cube, each character represents one color.
The facelets follow the order pictured below. They are grouped
by faces (up, left, front, right, back, top), and in each face
they are listed in top-down, left-right order.
The output is a sequence of moves to solve that cube.
U L B' L R2 D R U2 F U2 L2 B2 U B2 D' B2 U' R2 U L2 R2 U
The implementation of Twentyseven is based on Herbert Kociemba’s notes
about Cube Explorer, a program written in Pascal!
The search algorithm is iterative deepening A*, or IDA*. Like A*, IDA* finds
the shortest path between two vertices in a graph.
A conventional A* is not feasible because the state space of a Rubik’s cube is massive (43 252 003 274 489 856 000 states,
literally billions of billions).
Instead, we run a series of depth-first searches
with a maximum allowed number of moves that increases for each search.
As it is based on depth-first search,
IDA* only needs memory for the current path,
which is super cheap.
IDA* relies on an estimate of the number of moves remaining
to reach the solved state. We obtain such an estimate by
projecting the Rubik’s cube state into a simpler puzzle.
For example, we can consider only the permutation of corners,
ignoring their orientation.
We can pre-compute a table mapping each corner permutation
(there are 8! = 40320) to the minimum
number of moves to put the corners back to their location.
This is a lower bound on the number of moves to actually solve a Rubik’s cube.
Different projections yield different lower bounds (for example, by
looking at the permutation of edges instead, or their orientation),
and we can combine lower bounds into their maximum,
yielding a more precise lower bound, and thus a faster IDA*.
Putting all that together, we obtain an optimal solver for Rubik’s cubes.
But even with these heuristics, Twentyseven can take hours to solve a random cube optimally.
Kociemba’s Cube Explorer is apparently much faster
(I’ve never tried it myself).
My guess is that the difference is due to a better selection of projections,
yielding better heuristics.
But I haven’t gotten around to figure out whether I’ve misinterpreted
his notes or those improvements can only be found in the code.
A faster alternative is Kociemba’s two phase algorithm.
It is suboptimal, but it solves Rubik’s cubes in a fraction of a second
(1000 cubes per minute).
The first phase puts cubies into a “common orientation”
and “separates” the edges into two groups.
In other words, we reach a state where the permutation
of 12 edges can be decomposed into two disjoint
permutations of 4 and 8 edges respectively.
In the second phase, we restrict the possible moves:
quarter- and half-turns on the top and bottom faces,
half-turns only on the other faces.
These restricted moves preserve the “common orientation” of edges and corners
from phase 1,
and the edges in the middle slice stay in their slice.
Each phase thus performs an IDA* search in a much smaller space
than the full Rubik’s cube state space (2 217 093 120 and 19 508 428 800
states respectively).
Today, 2025-07-23, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …)
we are streaming the 47th episode of the Haskell Unfolder live on YouTube.
“Pure parallelism” refers to the execution of pure Haskell functions on multiple CPU cores, (hopefully) speeding up the computation. Since we are still dealing with pure functions, however, we get none of the problems normally associated with concurrent execution: no non-determinism, no need for locks, etc. In this episode we will develop a pure but parallel implementation of linear regression. We will briefly recap how linear regression works, before discussing the two primitive functions that Haskell offers for pure parallelism: par and pseq.
About the Haskell Unfolder
The Haskell Unfolder is a YouTube series about all things Haskell hosted by
Edsko de Vries and Andres Löh, with episodes appearing approximately every two
weeks. All episodes are live-streamed, and we try to respond to audience
questions. All episodes are also available as recordings afterwards.
Continuing a series of
posts
on techniques for calculating range queries, today I will present
the sparse table data structure, for doing fast range queries on a
static sequence with an idempotent combining operation.
Motivation
In my previous
post,
we saw that if we have a static sequence and a binary operation with a
group structure (i.e. every element has an inverse), we can
precompute a prefix sum table in \(O(n)\) time, and then use it to answer
arbitrary range queries in \(O(1)\) time.
What if we don’t have inverses? We can’t use prefix sums, but can we
do something else that still allows us to answer range queries in
\(O(1)\)? One thing we could always do would be to construct an \(n \times n\) table storing the answer to every possible range
query—that is, \(Q[i,j]\) would store the value of the range \(a_i \diamond \dots \diamond a_j\). Then we could just look up the answer to
any range query in \(O(1)\). Naively computing the value of each
\(Q[i,j]\) would take \(O(n)\) time, for a total of \(O(n^3)\) time to fill
in each of the entries in the tableWe only have to fill in \(Q[i,j]\)
where \(i < j\), but this is still about \(n^2/2\) entries.
, though it’s not
too hard to fill in the table in \(O(n^2)\) total time, spending only
\(O(1)\) to fill in each entry—I’ll leave this to you as an exercise.
However, \(O(n^2)\) is often too big. Can we do better? More
generally, we are looking for a particular subset of range queries
to precompute, such that the total number is asymptotically less than
\(n^2\), but we can still compute the value of any arbitrary range query
by combining some (constant number of) precomputed ranges. In the case
of a group structure, we were able to compute the values for only
prefix ranges of the form \(1 \dots k\), then compute the value of an arbitrary
range using two prefixes, via subtraction.
A sparse table is exactly such a scheme for precomputing a subset of
ranges.In fact, I believe, but do not know for sure, that this is
where the name “sparse table” comes from—it is “sparse” in the sense
that it only stores a sparse subset of range values.
Rather than only
a linear number of ranges, as with prefix sums, we have to compute
\(O(n \lg n)\) of them, but that’s still way better than \(O(n^2)\). Note,
however, that a sparse table only works when the combining operation
is idempotent, that is, when \(x \diamond x = x\) for all \(x\). For
example, we can use a sparse table with combining operations such as
\(\max\) or \(\gcd\), but not with \(+\) or \(\times\). Let’s see how it works.
Sparse tables
The basic idea behind a sparse table is that we precompute a series of
“levels”, where level \(i\) stores values for ranges of length \(2^i\). So level
\(0\) stores “ranges of length \(1\)”—that is, the elements of the
original sequence; level \(1\) stores ranges of length \(2\); level
\(2\) stores ranges of length \(4\); and so on. Formally, \(T[i,j]\)
stores the value of the range of length \(2^i\) starting at index \(j\).
That is,
We can see that \(i\) only needs to go from \(0\) up to \(\lfloor \lg n \rfloor\); above that and the stored ranges would be larger than
the entire sequence. So this table has size \(O(n \lg n)\).
Two important questions remain: how do we compute this table in the
first place? And once we have it, how do we use it to answer arbitrary
range queries in \(O(1)\)?
Computing the table is easy: each range on level \(i\), of length \(2^i\), is the
combination of two length-\(2^{i-1}\) ranges from the previous level. That is,
\[T[i,j] = T[i-1, j] \diamond T[i-1, j+2^{i-1}]\]
The zeroth level just consists of the elements of the original
sequence, and we can compute each subsequent level using values from
the previous level, so we can fill in the entire table in \(O(n \lg n)\)
time, doing just a single combining operation for each value in the table.
Once we have the table, we can compute the value of an arbitrary
range \([l,r]\) as follows:
Compute the biggest power of two that fits within the range, that
is, the largest \(k\) such that \(2^k \leq r - l + 1\). We can compute
this simply as \(\lfloor \lg (r - l + 1) \rfloor\).
Look up two range values of length \(2^k\), one for the range which begins at \(l\)
(that is, \(T[k, l]\)) and one for the range which ends at \(r\) (that is, \(T[k, r - 2^k + 1]\)). These two ranges overlap; but because the combining
operation is idempotent, combining the values of the ranges yields
the value for our desired range \([l,r]\).
This is why we require the combining operation to be idempotent:
otherwise the values in the overlap would be overrepresented in the
final, combined value.
Haskell code
Let’s write some Haskell code! First, a little module for idempotent
semigroups. Note that we couch everything in terms of semigroups,
not monoids, because we have no particular need of an identity
element; indeed, some of the most important examples like \(\min\) and
\(\max\) don’t have an identity element. The IdempotentSemigroup
class has no methods, since as compared to Semigroup it only adds a
law. However, it’s still helpful to signal the requirement. You
might like to convince yourself that all the instances listed below
really are idempotent.
moduleIdempotentSemigroupwhereimportData.BitsimportData.Semigroup-- | An idempotent semigroup is one where the binary operation-- satisfies the law @x <> x = x@ for all @x@.classSemigroup m =>IdempotentSemigroup minstanceOrd a =>IdempotentSemigroup (Min a)instanceOrd a =>IdempotentSemigroup (Max a)instanceIdempotentSemigroupAllinstanceIdempotentSemigroupAnyinstanceIdempotentSemigroupOrderinginstanceIdempotentSemigroup ()instanceIdempotentSemigroup (First a)instanceIdempotentSemigroup (Last a)instanceBits a =>IdempotentSemigroup (And a)instanceBits a =>IdempotentSemigroup (Ior a)instance (IdempotentSemigroup a, IdempotentSemigroup b) =>IdempotentSemigroup (a,b)instanceIdempotentSemigroup b =>IdempotentSemigroup (a -> b)
Now, some code for sparse tables. First, a few imports.
{-# LANGUAGE TupleSections #-}moduleSparseTablewhereimportData.Array (Array, array, (!))importData.Bits (countLeadingZeros, finiteBitSize, (!<<.))importIdempotentSemigroup
The sparse table data structure itself is just a 2D array over some
idempotent semigroup m. Note that UArray would be more efficient,
but (1) that would make the code for building the sparse table more
annoying (more on this later), and (2) it would require a bunch of
tedious additional constraints on m.
newtypeSparseTable m =SparseTable (Array (Int, Int) m)deriving (Show)
We will frequently need to compute rounded-down base-two logarithms,
so we define a function for it. A straightforward implementation
would be to repeatedly shift right by one bit and count the number of
shifts needed to reach zero; however, there is a better way, using
Data.Bits.countLeadingZeros. It has a naive default implementation
which counts right bit shifts, but in most cases it compiles down to
much more efficient machine instructions.
-- | Logarithm base 2, rounded down to the nearest integer. Computed-- efficiently using primitive bitwise instructions, when available.lg ::Int->Intlg n = finiteBitSize n -1- countLeadingZeros n
Now let’s write a function to construct a sparse table, given a
sequence of values. Notice how the sparse table array st is defined
recursively.
This works because the Array type is lazy in the stored values, with
the added benefit that only the array values we end up actually
needing will be computed. However, this comes with a decent amount of
overhead. If we wanted to use an unboxed array instead, we wouldn’t
be able to use
the recursive definition trick; instead, we would have to use an
STUArray
and fill in the values in a specific order. The code for this would
be longer and much more tedious, but could be faster if we end up
needing all the values in the array anyway.
-- | Construct a sparse table which can answer range queries over the-- given list in $O(1)$ time. Constructing the sparse table takes-- $O(n \lg n)$ time and space, where $n$ is the length of the list.fromList ::IdempotentSemigroup m => [m] ->SparseTable mfromList ms =SparseTable stwhere n =length ms lgn = lg n st = array ((0, 0), (lgn, n -1)) $zip ((0,) <$> [0..]) ms++ [ ((i, j), st ! (i -1, j) <> st ! (i -1, j +1!<<. (i -1)))| i <- [1.. lgn] , j <- [0.. n -1!<<. i] ]
Finally, we can write a function to answer range queries.
-- | \$O(1)$. @range st l r@ computes the range query which is the-- @sconcat@ of all the elements from index @l@ to @r@ (inclusive).range ::IdempotentSemigroup m =>SparseTable m ->Int->Int-> mrange (SparseTable st) l r = st ! (k, l) <> st ! (k, r - (1!<<. k) +1)where k = lg (r - l +1)
Applications
Most commonly, we can use a sparse table to find the minimum or
maximum values on a range, \(\min\) and \(\max\) being the quintessential
idempotent operations. For example, this plays a key role in a
solution to the (quite tricky) problem
Ograda.At first it
seemed like that problem should be solvable with some kind of sliding
window approach, but I couldn’t figure out how to make it work!
What if we want to find the index of the minimum or maximum value in
a given range (see, for example, Worst Weather)? We can easily accomplish this using the semigroup Min (Arg m i) (or Max (Arg m i)), where m is the type of the values and i is
the index type. Arg, from Data.Semigroup, is just a pair which uses only the first value
for its Eq and Ord instances, and carries along the second value
(which is also exposed via Functor, Foldable, and Traversable
instances). In the example below, we can see that the call to range st 0 3 returns both the max value on the range (4) and its index
(2) which got carried along for the ride:
λ> :m +Data.Semigroup
λ> st = fromList (map Max (zipWith Arg [2, 3, 4, 2, 7, 4, 9] [0..]))
λ> range st 0 3
Max {getMax = Arg 4 2}
Finally, I will mention that being able to compute range minimum
queries is one way to compute lowest common ancestors for a (static,
rooted) tree. First, walk the tree via a depth-first search and
record the depth of each node encountered in sequence, a so-called
Euler tour (note
that you must record every visit to a node—before visiting any of
its children, in between each child, and after visiting all the
children). Now the minimum depth recorded between visits to any two
nodes will correspond to their lowest common ancestor.
Here are a few problems that involve computing least common ancestors
in a tree, though note there are also other techniques for computing
LCAs (such as binary jumping) which I plan to write about eventually.
The Stackage team is happy to announce that
Stackage LTS version 24 has finally been
released a couple of days ago, based on GHC stable version 9.10.2.
LTS 24 includes many
package changes, and over
3400 packages! Thank you for all your nightly contributions that made this
release possible: the initial release was prepared by Mihai Maruseac. The
closest nightly snapshot to lts-24.0 is
nightly-2025-07-13.
At the same time we are excited to move Stackage
Nightly to GHC 9.12.2: the initial snapshot
release is nightly-2025-07-15.
Current nightly has over 3100 packages, and we expect that number to grow over
the coming weeks and months: we welcome your contributions and help with this.
This initial release build was made by Jens Petersen (31 commits).
A number of packages have been disabled, with the switch to a new GHC version.
You can see all the
changes
made relative to the preceding last 9.10 nightly snapshot.
Apart from trying to build yourself, the easiest way to understand why
particular packages are disabled is to look for their < 0 lines in
build-constraints.yaml,
particularly under the "Library and exe bounds failures" section.
We also have some
tracking issues
still open related to 9.12 core boot libraries.
Thank you to all those who have already done work updating their packages for ghc-9.12.
Today, 2025-07-09, at 1830 UTC (11:30 am PDT, 2:30 pm EDT, 7:30 pm GMT, 20:30 CET, …)
we are streaming the 46th episode of the Haskell Unfolder live on YouTube.
In this episode targeted at beginners, we show the end-to-end application development process, starting from an empty directory. We’ll consider package configuration, taking advantage of editor integration, how to deal with dependencies, organizing code into modules, and parsing command line arguments. We will use this to write a simple but useful application.
About the Haskell Unfolder
The Haskell Unfolder is a YouTube series about all things Haskell hosted by
Edsko de Vries and Andres Löh, with episodes appearing approximately every two
weeks. All episodes are live-streamed, and we try to respond to audience
questions. All episodes are also available as recordings afterwards.
Mike and Andres speak to Alex McLean who created the TidalCycles system for electronic music - implemented in Haskell of course. We talk about how Alex got into Haskell coming from Perl, how types helped him think about the structure of music and patterns, the architecture and evolution of TidalCycles, about art, community and making space for new ideas, and lots of things in between.
the Builder type in bytestring produce lazy bytestrings.
At the time I was happy to see that attoparsec seemed to support strict and lazy
bytestrings equally well.
To get on with things I also wrote the simplest function I could come up with
for sending and receiving data over the network – I used send and recv from
Network.Socket.ByteString.Lazy in network. The function was really simple
import Network.Socket.ByteString.Lazy qualifiedas SB
sendCmd :: Conn->Command r->IO (Result r)sendCmd(Conn p)(Command k cmd) = withResource p $ \sock ->do
_ <- SB.send sock $ toWireCmd cmd
resp <- SB.recv sock 4096case decode resp of
Left err -> pure $ Left $ RespError "decode"(TL.pack err)
Right r -> pure $ k <$> fromWireResp cmd r
I knew I'd have to revisit this function, it was naïve to believe that a call to
recv would always result in as single complete response. It was however good
enough to get going. When I got to improving sendCmd I was a little surprised
to find that I'd also have to switch to using strict bytestrings in the parser.
Interlude on the Redis serialisation protocol (RESP3)
The Redis protocol has some defining attributes
It's somewhat of a binary protocol. If you stick to keys and values that fall
within the set of ASCII strings, then the protocol is humanly readable and you
can rather easily use netcat or telnet as a client. However, you aren't
limited to storing only readable strings.
It's somewhat of a type-length-value style protocol. Some of the data types
include their length in bytes, e.g. bulk strings and verbatim strings.
Other types include the number of elements, e.g. arrays and maps. A large
number of them have no length at all, e.g. simple strings, integers, and
doubles.
I suspect there are good reasons, I gather a lot of it has to do with speed. It
does however cause one issue when writing a client: it's not possible to read a
whole response without parsing it.
Rewriting sendCmd
With that extra information about the RESP3 protocol the naïve implementation
above falls short in a few ways
The read buffer may contain more than one full message and give the definition
of decode above any remaining bytes are simply dropped.1
The read buffer my contain less than one full message and then decode will
return an error.2
Surely this must be solvable, because in my mind running the parser results in
one of three things:
Parsing is done and the result is returned, together with any input that
wasn't consumed.
The parsing is not done due to lack of input, this is typically encoded as a
continuation.
The parsing failed so the error is returned, together with input that wasn't
consumed.
So, I started looking in the documentation for the module
Data.Attoparsec.ByteString.Lazy in attoparsec. I was a little surprised to find
that the Result type lacked a way to feed more input to a parser – it only
has two constructors, Done and Fail:
data Result r
= Fail ByteString[String]String
| Done ByteStringr
I'm guessing the idea is that the function producing the lazy bytestring in the
first place should be able to produce more chunks of data on demand. That's
likely what the lazy variant of recv does, but at the same time it also
requires choosing a maximum length and that doesn't rhyme with RESP3. The lazy
recv isn't quite lazy in the way I needed it to be.
When looking at the parser for strict bytestrings I calmed down. This parser
follows what I've learned about parsers (it's not defined exactly like this;
it's parameterised in its input but for the sake of simplicity I show it with
ByteString as input):
data Result r
= Fail ByteString[String]String
| Partial (ByteString->Result r)
| Done ByteStringr
Then to my delight I found that there's already a function for handling exactly
my problem
parseWith :: Monad m => (m ByteString)->Parser a->ByteString->m (Result a)
I only needed to rewrite the existing parser to work with strict bytestrings and
work out how to write a function using recv (for strict bytestrings) that
fulfils the requirements to be used as the first argument to parseWith. The
first part wasn't very difficult due to the similarity between attoparsec's
APIs for lazy and strict bytestrings. The second only had one complication. It
turns out recv is blocking, but of course that doesn't work well with
parseWith. I wrapped it in timeout based on the idea that timing out means
there's no more data and the parser should be given an empty string so it
finishes. I also decided to pass the parser as an argument, so I could use the
same function for receiving responses for individual commands as well as for
pipelines. The full receiving function is
import Data.ByteString qualifiedas BS
import Data.Text qualifiedas T
import Network.Socket.ByteString qualifiedas SB
recvParse :: S.Socket->Parser r->IO (Either Text (BS.ByteString, r))recvParse sock parser = do
parseWith receive parser BS.empty >>= \case
Fail _ [] err -> pure $ Left (T.pack err)
Fail _ ctxs err -> pure $ Left $T.intercalate " > "(T.pack <$> ctxs)<>": "<>T.pack err
Partial _ -> pure $ Left "impossible error"
Done rem result -> pure $ Right (rem, result)wherereceive =
timeout 100_000(SB.recv sock 4096)>>= \case
Nothing -> pure BS.empty
Just bs -> pure bs
Then I only needed to rewrite sendCmd and I wanted to do it in such a way that
any remaining input data could be use in by the next call to sendCmd.3 I
settled for modifying the Conn type to hold an IORef ByteString together
with the socket and then the function ended up looking like this
sendCmd :: Conn->Command r->IO (Result r)sendCmd(Conn p)(Command k cmd) = withResource p $ \(sock, remRef)->do
_ <- SBL.send sock $ toWireCmd cmd
rem <- readIORef remRef
recvParse sock rem resp >>= \case
Left err -> pure $ Left $ RespError "recv/parse" err
Right (newRem, r)->do
writeIORef remRef newRem
pure $ k <$> fromWireResp cmd r
What's next?
I've started looking into pub/sub, and basically all of the work described in
this post is a prerequisite for that. It's not very difficult on the protocol
level, but I think it's difficult to come up with a design that allows maximal
flexibility. I'm not even sure it's worthwhile the complexity.
I'm sure that whatever size of buffer I choose to use there'll be someone
out there who's storing values that are larger. Then there's pipelining that
makes it even more of an issue.
To be honest I'm not totally convinced there'll ever be any remaining input.
Unless a single Conn is used by several threads – which would lead to much
pain with the current implementation – or pub/sub is used – which isn't
supported yet.
In a previous blog
post
I categorized a number of different techniques for calculating range queries.
Today, I will discuss one of those techniques which is simple but frequently
useful.
Precomputing prefix sums
Suppose we have a static sequence of values \(a_1, a_2, a_3, \dots, a_n\) drawn from some
groupThat is,
there is an associative binary operation with an identity element, and
every element has an inverse.
, and want
to be able to compute the total value (according to the group
operation) of any contiguous subrange. That is, given a range
\([i,j]\), we want to compute \(a_i \diamond a_{i+1} \diamond \dots \diamond a_j\) (where \(\diamond\) is the group operation). For example,
we might have a sequence of integers and want to compute the sum, or
perhaps the bitwise xor (but not the maximum) of all the values in any particular
subrange.
Of course, we could simply compute \(a_i \diamond \dots \diamond a_j\)
directly, but that takes \(O(n)\) time. With some simple preprocessing,
it’s possible to compute the value of any range in constant time.
The key idea is to precompute an array \(P\) of prefix sums, so \(P_i = a_1 \diamond \dots \diamond a_i\). This can be computed in linear time
via a scan; for example:
importData.ArrayimportData.List (scanl')prefix ::Monoid a => [a] ->ArrayInt aprefix a = listArray (0, length a) $ scanl' (<>) mempty a
Actually, I would typically use an unboxed array, which is
faster but slightly more limited in its uses: import
Data.Array.Unboxed, use UArray instead of Array, and add an
IArray UArray a constraint.
Note that we set \(P_0 = 0\) (or whatever the identity element is for
the group); this is why I had the sequence of values indexed starting
from \(1\), so \(P_0\) corresponds to the empty sum, \(P_1 = a_1\), \(P_2 = a_1 \diamond a_2\), and so on.
Now, for the value of the range \([i,j]\), just compute \(P_j \diamond P_{i-1}^{-1}\)—that is, we start with a prefix that ends at the right place, then
cancel or “subtract” the prefix that ends right before the range we
want. For example, to find the sum of the integers \(a_5 + \dots + a_{10}\), we can compute \(P_{10} - P_4\).
range ::Group a =>ArrayInt a ->Int->Int-> arange p i j = p!j <> inv (p!(i-1))
That’s why this only works for groups but not for general monoids:
only in a group can we cancel unwanted values. So, for example,
this works for finding the sum of any range, but not the maximum.
Practice problems
Want to practice? Here are a few problems that can be solved using
techniques discussed in this post:
It is possible to generalize this scheme to 2D—that is, to compute
the value of any subrectangle of a 2D grid of values from some
group in only \(O(1)\) time. I will leave you the fun of figuring out
the details.
If you’re looking for an extra challenge, here are a few harder
problems which use techniques from this post as an important
component, but require some additional nontrivial ingredients:
Niki and Mike talked to Daniele Micciancio who is a professor at UC San Diego. He's been using Haskell for 20 years, and works in lattice cryptography. We talked to him about how he got into Haskell, using Haskell for teaching theoretical computer science and of course for his research and the role type systems and comonads could play in the design of cryptographic algorithms. Along the way, he gave an accessible introduction to post-quantum cryptography which we really enjoyed. We hope you do, too.
Suppose we have a sequence of values, which is static in the sense
that the values in the sequence will never change, and we want to
perform range queries, that is, for various ranges we want to
compute the total of all consecutive values in the range, according to
some binary combining operation. For example, we might want to
compute the maximum, sum, or product of all the consecutive values in
a certain subrange. We have various options depending on the kind of
ranges we want and the algebraic properties of the operation.
If we want ranges corresponding to a sliding window, we can use
an amortized queue
structure
to find the total of each range in \(O(1)\), for an arbitrary
monoid.
If we want arbitrary ranges but the operation is a group, the
solution is relatively straightforward: we can precompute all
prefix sums, and subtract to find the result for an arbitrary
range in \(O(1)\).
If the operation is an idempotent semigroup (that is, it has the
property that \(x \diamond x = x\) for all \(x\)), we can use a sparse
table, which takes \(O(n \lg n)\) time and space for precomputation,
and then allows us to answer arbitrary range queries in \(O(1)\).
If the operation is an arbitrary monoid, we can use a sqrt tree,
which uses \(O(n \lg \lg n)\) precomputed time and space, and allows
answering arbitrary range queries in \(O(\lg \lg n)\). I will write
about this in a future post.
Dynamic range queries
What if we want dynamic range queries, that is, we want to be able
to interleave range queries with arbitrary updates to the values of
the sequence?
If the operation is an arbitrary monoid, we can use a segment
tree.
If the operation is a group, we can use a Fenwick tree.
I published a paper about Fenwick
trees,
which also discusses segment trees, but I should write more about
them here!
Table
Here’s a table summarizing the above classification scheme. I plan to
fill in links as I write blog posts about each row.
An intriguing talk by Gabriella Gonzalez, delivered at Haskell Love 2020. Based largely on the famous marketing book, Crossing the Chasm. Gonzalez argues that marketing is not about hype, it is about setting priorities: what features and markets are you going to ignore? The key to adoption is to be able to solve a problem that people need solved today and where existing mainstream tools are inadequate. Joe Armstrong will tell you that the key to getting Erlang used was to approach failing projects and ask "Would you like us to build you a prototype?" Gonzalez makes a strong case that Haskell should first aim to capture the interpreters market. He points out that the finance/blockchain market may be another possibility. Recommended to me at Lambda Days by Pedro Abreu, host of the Type Theory Forall podcast.
Arriving at a type for Redis commands required a bit of exploration. I had some
ideas early on that I for various reasons ended up dropping on the way. This is
a post about my travels, hopefully someone finds it worthwhile reading.
The protocol
The Redis Serialization Protocol (RESP) initially reminded me of JSON and I
thought that following the pattern of aeson might be a good idea. I decided
up-front that I'd only support the latest version of RESP, i.e. version 3. So, I
thought of a data type, Resp with a constructor for each RESP3 data type, and
a pair of type classes, FromResp and ToResp for converting between Haskell
types and RESP3. Then after some more reflection I realised that converting to
RESP is largely pointless. The main reason to convert anything to RESP3 is to
assemble a command, with its arguments, to send to Redis, but all commands are
arrays of bulk strings so it's unlikely that anyone will actually use
ToResp.1 So I scrapped the idea of ToResp. FromResp looked like this
class FromResp a wherefromResp :: Value->Either FromRespError a
When I started defining commands I didn't like the number of ByteString
arguments that resulted in, so I defined a data type, Arg, and an accompanying
type class for arguments, ToArg:
Later on I saw that it might also be nice to have a type class specifically for
keys, ToKey, though that's a wrapper for a single ByteString.
Implementing the functions to encode/decode the protocol were straight-forward
applications of attoparsec and bytestring (using its Builder).
A command is a function in need of a sender
Even though supporting pipelining was one of the goals I felt a need to make
sure I'd understood the protocol so I started off with single commands. The
protocol is a simple request/response protocol at the core so I settled on this
type for commands
type Cmd a = forall m. (Monad m) => (ByteString->m ByteString)->m (Either FromRespError a)
that is, a command is a function accepting a sender and returning an a.
I wrote a helper function for defining commands, sendCmd
sendCmd :: (Monad m, FromResp a) => [ByteString]->(ByteString->m ByteString)->m (Either FromRespError a)sendCmd cmdArgs send = doletcmd = encode $ Array $ map BulkString cmdArgs
send cmd <&> decode >>= \case
Left desc -> pure $ Left $ FromRespError "Decode"(Text.pack desc)
Right v -> pure $ fromValue v
which made it easy to define commands. Here are two examples, append and mget:
append :: (ToArg a, ToArg b) => a->b->Cmd Intappend key val = sendCmd $["APPEND"]<> unArg (toArg key <> toArg val)-- | https://redis.io/docs/latest/commands/mget/mget :: (ToArg a, FromResp b) => NE.NonEmpty a->Cmd (NE.NonEmpty b)mget ks = sendCmd $["MGET"]<> unArg (foldMap1 toArg ks)
The function to send off a command and receive its response, sendAndRecieve,
was just a call to send followed by a call to recv in network (the variants
for lazy bytestrings).
I sort of liked this representation – there's always something pleasant with
finding a way to represent something as a function. There's a very big problem
with it though: it's difficult to implement pipelining!
Yes, Cmd is a functor since (->) r is a functor, and thus it's possible to
make it an Applicative, e.g. using free. However, to implement pipelining it's
necessary to
encode all commands, then
concatenate them all into a single bytestring and send it
read the response, which is a concatenation of the individual commands'
responses, and
convert each separate response from RESP3.
That isn't easy when each command contains its own encoding and decoding. The
sender function would have to relinquish control after encoding the command, and
resume with the resume again later to decode it. I suspect it's doable using
continuations, or monad-coroutine, but it felt complicated and rather than
travelling down that road I asked for ideas on the Haskell Discourse. The
replies lead me to a paper, Free delivery, and a bit later a package,
monad-batcher. When I got the pointer to the package I'd already read the paper
and started implementing the ideas in it, so I decided to save exploring
monad-batcher for later.
A command for free delivery
The paper Free delivery is a perfect match for pipelining in Redis, and my
understanding is that it proposes a solution where
Commands are defined as a GADT, Command a.
Two functions are defined to serialise and deserialise a Command a. In the
paper they use String as the serialisation, so show and read is used.
A type, ActionA a, is defined that combines a command with a modification
of its a result. It implements Functor.
A free type, FreeA f a is defined, and made into an Applicative with the
constraint that f is a Functor.
A function, serializeA, is defined that traverses a FreeA ActionA a
serialising each command.
A function, deserializeA, is defined that traverses a FreeA ActionA a
deserialising the response for each command.
I defined a command type, Command a, with only three commands in it, echo,
hello, and ping. I then followed the recipe above to verify that I could get
it working at all. The Haskell used in the paper is showing its age, and there
seems to be a Functor instance missing, but it was still straight forward and
I could verify that it worked against a locally running Redis.
Then I made a few changes…
I renamed the command type to Cmd so I could use Command for what the
paper calls ActionA.
data Cmd r where
Echo :: Text->Cmd Text
Hello :: Maybe Int->Cmd ()
Ping :: Maybe Text->Cmd Textdata Command a = forall r. Command !(r->a) !(Cmd r)instance Functor Commandwherefmap f (Command k c) = Command (f . k) c
toWireCmd :: Cmd r->ByteStringtoWireCmd(Echo msg) = _
toWireCmd(Hello ver) = _
toWireCmd(Ping msg) = _
fromWireResp :: Cmd r->Resp->Either RespError rfromWireResp(Echo _) = fromResp
fromWireResp(Hello _) = fromResp
fromWireResp(Ping _) = fromResp
(At this point I was still using FromResp.)
I also replaced the free applicative defined in the paper and started using
free. A couple of type aliases make it a little easier to write nice signatures
type Pipeline a = Ap Command atype PipelineResult a = Validation [RespError] a
and defining individual pipeline commands turned into something rather
mechanical. (I also swapped the order of the arguments to build a Command so I
can use point-free style here.)
On the other hand deserialisation became a little more involved, but it's not
too bad
fromWirePipelineResp :: Pipeline a->[Resp]->PipelineResult afromWirePipelineResp(Pure a) _ = pure a
fromWirePipelineResp(Ap (Command k c) p)(r : rs) = fromWirePipelineResp p rs <*>(k <$> liftError singleton (fromWireResp c r))fromWirePipelineResp _ _ = Failure [RespError "fromWirePipelineResp""Unexpected wire result"]
Everything was working nicely and I started adding support for more commands. I
used the small service from work to guide my choice of what commands to add.
First out was del, then get and set. After adding lpush I was pretty much ready
to try to replace hedis in the service from work.
data Cmd r where-- echo, hello, ping
Del :: (ToKey k) => NonEmpty k->Cmd Int
Get :: (ToKey k, FromResp r) => k->Cmd r
Set :: (ToKey k, ToArg v) => k->v->Cmd Bool
Lpush :: (ToKey k, ToArg v) => k->NonEmpty v->Cmd Int
However, when looking at the above definition started I thinking.
Was it really a good idea to litter Cmd with constraints like that?
Would it make sense to keep the Cmd type a bit closer to the actual Redis
commands?
Also, maybe FromResp wasn't such a good idea after all, what if I remove it?
That brought me to the third version of the type for Redis commands.
Converging and simplifying
While adding new commands and writing instances of FromResp I slowly realised
that my initial thinking of RESP3 as somewhat similar to JSON didn't really pan
out. I had quickly dropped ToResp and now the instances of FromResp didn't
sit right with me. They obviously had to "follow the commands", so to speak, but
at the same time allow users to bring their own types. For instance, LSPUSH
returns the number of pushed messages, but at the same time GET should be able
to return an Int too. This led to Int's FromResp looking like this
instance FromResp IntwherefromResp(BulkString bs) =
case parseOnly (AC8.signed AC8.decimal) bs of
Left s -> Left $ RespError "FromResp"(TL.pack s)
Right n -> Right n
fromResp(Number n) = Right $ fromEnum n
fromResp _ = Left $ RespError "FromResp""Unexpected value"
I could see this becoming worse, take the instance for Bool, I'd have to
consider that
for MOVEInteger 1 means True and Integer 0 means False
for SETSimpleString "OK" means True
users would justifiably expect a bunch of bytestrings to be True, e.g.
BulkString "true", BulkString "TRUE", BulkString "1", etc
However, it's impossible to cover all ways users can encode a Bool in a
ByteString so no matter what I do users will end up having to wrap theirBool with newtype and implement a fitting FromResp. On top of that, even
thought I haven't found any example of it yet, I fully expect there to be,
somewhere in the large set of Redis commands, at least two commands each wanting
an instance of a basic type that simply can't be combined into a single
instance, meaning that the client library would need to do some newtype
wrapping too.
No, I really didn't like it! So, could I get rid of FromResp and still offer
users an API where they can user their own types as the result of commands?
To be concrete I wanted this
data Cmd r where-- other commands
Get :: (ToKey k) => k->Cmd (Maybe ByteString)
and I wanted the user to be able to conveniently turn a Cmd r into a Cmd s.
In other words, I wanted a Functor instance. Making Cmd itself a functor
isn't necessary and I just happened to already have a functor type that wraps
Cmd, the Command type I used for pipelining. If I were to use that I'd need
to write wrapper functions for each command though, but if I did that then I
could also remove the ToKey~/~ToArg constraints from the constructors of Cmd
r and put them on the wrapper instead. I'd get
data Cmd r where-- other commands
Get :: Key->Cmd (Maybe ByteString)get :: (ToKey k) => k->Command (Maybe ByteString)get = Command id . Get . toKey
I'd also have to rewrite fromWireResp so it's more specific for each command.
Instead of
fromWireResp :: Cmd r -> Resp -> Either RespError r
fromWireResp (Get _) = fromResp
...
I had to match up exactly on the possible replies to GET
fromWireResp :: Cmd r->Resp->Either RespError rfromWireResp _ (SimpleError err desc) = Left $ RespError (T.decodeUtf8 err)(T.decodeUtf8 desc)fromWireResp(Get _)(BulkString bs) = Right $ Just bs
fromWireResp(Get _) Null = Right Nothing
...
fromWireResp _ _ = Left $ RespError "fromWireResp""Unexpected value"
Even though it was more code I liked it better than before, and I think it's
slightly simpler code. I also hope it makes the use of the API is a bit simpler
and clear.
Here's an example from the code for the service I wrote for work. It reads a UTC
timestamp stored in timeKey, the timestamp is a JSON string so it needs to be
decoded.
readUTCTime :: Connection->IO (Maybe UTCTime)readUTCTime conn =
sendCmd conn (maybe Nothing decode <$> get timeKey)>>= \case
Left _ -> pure Nothing
Right datum -> pure datum
What's next?
I'm pretty happy with the command type for now, though I have a feeling I'll
have to revisit Arg and ToArg at some point.
I've just turned the Connection type into a pool using resource-pool, and I
started looking at pub/sub. The latter thing, pub/sub, will require some thought
and experimentation I think. Quite possibly it'll end up in a post here too.
Of course one could use RESP3 as the serialisation format for storing values
in Redis. Personally I think I'd prefer using something more widely used, and
easier to read, such as JSON or BSON.
A couple of weeks ago I needed a small, hopefully temporary, service at work. It
bridges a gap in functionality provided by a legacy system and the functionality
desired by a new system. The legacy system is cumbersome to work with, so we
tend to prefer building anti-corruption layers rather than changing it directly,
and sometimes we implement it as separate services.
This time it was good enough to run the service as a cronjob, but it did need to
keep track of when it ran the last time. It felt silly to spin up a separate DB
just to keep a timestamp, and using another service's DB is something I really
dislike and avoid.1 So, I ended up using the Redis instance that's used as a
cache by a OSS service we host.
The last time I had a look at the options for writing a Redis client in Haskell
I found two candidates, hedis and redis-io. At the time I wrote a short note
about them. This time around I found nothing much has changed, they are still
the only two contenders and they still suffer from the same issues
hedis has still has the same API and I still find it as awkward.
redis-io still requires a logger.
I once again decided to use hedis and wrote the service for work in a couple
of days, but this time I thought I'd see what it would take to remove the
requirement on tinylog from redis-io. I spent a few evenings on it, though I
spent most time on "modernising" the dev setup, using Nix to build, re-format
using fourmolu, etc. I did the same for redis-resp, the main dependency of
redis-io. The result of that can be found on my gitlab account:
At the moment I won't take that particular experiment any further and given that
the most recent change to redis-io was in 2020 (according to its git repo)
I don't think there's much interest upstream either.
Making the changes to redis-io and redis-resp made me a little curious about
the Redis protocol so I started reading about it. It made me start thinking
about implementing a client lib myself. How hard could it be?
I'd also asked a question about Redis client libs on r/haskell and a response
led me to redis-schema. It has a very good README, and its section on
transactions with its observation that Redis transactions are a perfect match
for Applicative. This pushed me even closer to start writing a client lib.
What pushed me over the edge was the realisation that pipelining also is a
perfect match for Applicative.
For the last few weeks I've spent some of my free time reading and experimenting
and I'm enjoying it very much. We'll see where it leads, but hopefully I'll at
least have bit more to write about it.
In January 2009, while just a baby first-year PhD student, I wrote a
blog post titled Abstraction, intuition, and the “monad tutorial
fallacy”.
In it, I made the argument that humans tend to learn best by first
grappling with concrete examples, and only later proceeding to
higher-level intuition and analogies; hence, it’s a mistake to
think that clearly presenting your intuition for a topic will help
other people understand it. Analogies and intuition can help, but
only when accompanied by concrete examples and active engagement. To
illustrate the point, I made up a fictitious programmer with a
fictitious analogy.
But now Joe goes and writes a monad tutorial called “Monads are
Burritos,” under the well-intentioned but mistaken assumption that
if other people read his magical insight, learning about monads will
be a snap for them. “Monads are easy,” Joe writes. “Think of them as
burritos.” Joe hides all the actual details about types and such
because those are scary, and people will learn better if they can
avoid all that difficult and confusing stuff. Of course, exactly
the opposite is true, and all Joe has done is make it harder for
people to learn about monads…
My intention was to choose a fictitious analogy which was obviously
ridiculous and silly, as a parody of many of the monad tutorials which
existed at the time (and still do). Mark Jason Dominus
then wrote a blog post, Monads are like
burritos, pointing out
that actually, monads are kinda like burritos. It’s really funny,
though I don’t think it’s actually a very good analogy, and my guess
is that Mark would agree: it was clearly written as a silly joke and
not as a real way to explain monads.
In any case, from that point the “monads are burritos” meme took on a
life of its own. For example:
So, to set the record straight: “monads are burritos” is not a helpful
analogy!Yes, I am writing a blog post because People Are Wrong On
The Internet, and I know it probably won’t
make any difference, but here we are.
The burrito analogy strongly implies that a value of type m a
somehow “contains” a value (or values) of type a. But that is not
true for all monads (e.g. there is no sense in which a value of type
IO String contains a String).
Relatedly, the analogy also implies that a value of type m a can
be “unwrapped” to get an a, but this is impossible for many monads.
It is not actually very easy to take a burrito containing a burrito
and merge it into a single-level burrito. At least this is not in
any sense a natural operation on burritos. Perhaps you could argue
that it is always easy to remove outer tortilla layers (but not the
innermost one since the food will all fall out), but this is a bad
analogy, since in general join does not just “remove” an outer
layer, but somehow merges the effects of two layers into one.
Actually, burritos are a great analogy for the Identity monad!
…but not much beyond that.
On a more positive note, my sense is that the average
pedagogical quality of Haskell materials, and monad tutorials in
particular, has indeed gone up significantly since 2009. I’d love to
think this can be at least partially attributed to my original blog
post, though of course it’s impossible to know that for sure.
<noscript>Javascript needs to be activated to view comments.</noscript>
(Updated June 2025 for PenroseKiteDart version 1.4)
PenroseKiteDart is a Haskell package with tools to experiment with finite tilings of Penrose’s Kites and Darts. It uses the Haskell Diagrams package for drawing tilings. As well as providing drawing tools, this package introduces tile graphs (Tgraphs) for describing finite tilings. (I would like to thank Stephen Huggett for suggesting planar graphs as a way to reperesent the tilings).
This document summarises the design and use of the PenroseKiteDart package.
PenroseKiteDart package is now available on Hackage.
In figure 1 we show a dart and a kite. All angles are multiples of (a tenth of a full turn). If the shorter edges are of length 1, then the longer edges are of length , where is the golden ratio.
Figure 1: The Dart and Kite Tiles
Aperiodic Infinite Tilings
What is interesting about these tiles is:
It is possible to tile the entire plane with kites and darts in an aperiodic way.
Such a tiling is non-periodic and does not contain arbitrarily large periodic regions or patches.
The possibility of aperiodic tilings with kites and darts was discovered by Sir Roger Penrose in 1974. There are other shapes with this property, including a chiral aperiodic monotile discovered in 2023 by Smith, Myers, Kaplan, Goodman-Strauss. (See the Penrose Tiling Wikipedia page for the history of aperiodic tilings)
This package is entirely concerned with Penrose’s kite and dart tilings also known as P2 tilings.
Legal Tilings
In figure 2 we add a temporary green line marking purely to illustrate a rule for making legal tilings. The purpose of the rule is to exclude the possibility of periodic tilings.
If all tiles are marked as shown, then whenever tiles come together at a point, they must all be marked or must all be unmarked at that meeting point. So, for example, each long edge of a kite can be placed legally on only one of the two long edges of a dart. The kite wing vertex (which is marked) has to go next to the dart tip vertex (which is marked) and cannot go next to the dart wing vertex (which is unmarked) for a legal tiling.
Figure 2: Marked Dart and Kite
Correct Tilings
Unfortunately, having a finite legal tiling is not enough to guarantee you can continue the tiling without getting stuck. Finite legal tilings which can be continued to cover the entire plane are called correct and the others (which are doomed to get stuck) are called incorrect. This means that decomposition and forcing (described later) become important tools for constructing correct finite tilings.
2. Using the PenroseKiteDart Package
You will need the Haskell Diagrams package (See Haskell Diagrams) as well as this package (PenroseKiteDart). When these are installed, you can produce diagrams with a Main.hs module. This should import a chosen backend for diagrams such as the default (SVG) along with Diagrams.Prelude.
module Main (main)whereimport Diagrams.Backend.SVG.CmdLine
import Diagrams.Prelude
For Penrose’s Kite and Dart tilings, you also need to import the PKD module and (optionally) the TgraphExamples module.
import PKD
import TgraphExamples
Then to ouput someExample figure
fig::Diagram B
fig = someExample
main :: IO ()
main = mainWith fig
Note that the token B is used in the diagrams package to represent the chosen backend for output. So a diagram has type Diagram B. In this case B is bound to SVG by the import of the SVG backend. When the compiled module is executed it will generate an SVG file. (See Haskell Diagrams for more details on producing diagrams and using alternative backends).
3. Overview of Types and Operations
Half-Tiles
In order to implement operations on tilings (decompose in particular), we work with half-tiles. These are illustrated in figure 3 and labelled RD (right dart), LD (left dart), LK (left kite), RK (right kite). The join edges where left and right halves come together are shown with dotted lines, leaving one short edge and one long edge on each half-tile (excluding the join edge). We have shown a red dot at the vertex we regard as the origin of each half-tile (the tip of a half-dart and the base of a half-kite).
The labels are actually data constructors introduced with type operator HalfTile which has an argument type (rep) to allow for more than one representation of the half-tiles.
data HalfTile rep
= LD rep -- Left Dart| RD rep -- Right Dart| LK rep -- Left Kite| RK rep -- Right Kitederiving(Show,Eq)
Tgraphs
We introduce tile graphs (Tgraphs) which provide a simple planar graph representation for finite patches of tiles. For Tgraphs we first specialise HalfTile with a triple of vertices (positive integers) to make a TileFace such as RD(1,2,3), where the vertices go clockwise round the half-tile triangle starting with the origin.
type TileFace = HalfTile (Vertex,Vertex,Vertex)type Vertex = Int -- must be positive
The function
makeTgraph ::[TileFace]-> Tgraph
then constructs a Tgraph from a TileFace list after checking the TileFaces satisfy certain properties (described below). We also have
faces :: Tgraph ->[TileFace]
to retrieve the TileFace list from a Tgraph.
As an example, the fool (short for fool’s kite and also called an ace in the literature) consists of two kites and a dart (= 4 half-kites and 2 half-darts):
fool :: Tgraph
fool = makeTgraph [RD (1,2,3), LD (1,3,4)-- right and left dart,LK (5,3,2), RK (5,2,7)-- left and right kite,RK (5,4,3), LK (5,6,4)-- right and left kite]
To produce a diagram, we simply draw the Tgraph
foolFigure :: Diagram B
foolFigure = draw fool
which will produce the diagram on the left in figure 4.
Alternatively,
foolFigure :: Diagram B
foolFigure = labelled drawj fool
will produce the diagram on the right in figure 4 (showing vertex labels and dashed join edges).
Figure 4: Diagram of fool without labels and join edges (left), and with (right)
When any (non-empty) Tgraph is drawn, a default orientation and scale are chosen based on the lowest numbered join edge. This is aligned on the positive x-axis with length 1 (for darts) or length (for kites).
Tgraph Properties
Tgraphs are actually implemented as
newtype Tgraph = Tgraph [TileFace]deriving(Show)
but the data constructor Tgraph is not exported to avoid accidentally by-passing checks for the required properties. The properties checked by makeTgraph ensure the Tgraph represents a legal tiling as a planar graph with positive vertex numbers, and that the collection of half-tile faces are both connected and have no crossing boundaries (see note below). Finally, there is a check to ensure two or more distinct vertex numbers are not used to represent the same vertex of the graph (a touching vertex check). An error is raised if there is a problem.
Note: If the TilFaces are faces of a planar graph there will also be exterior (untiled) regions, and in graph theory these would also be called faces of the graph. To avoid confusion, we will refer to these only as exterior regions, and unless otherwise stated, face will mean a TileFace. We can then define the boundary of a list of TileFaces as the edges of the exterior regions. There is a crossing boundary if the boundary crosses itself at a vertex. We exclude crossing boundaries from Tgraphs because they prevent us from calculating relative positions of tiles locally and create touching vertex problems.
For convenience, in addition to makeTgraph, we also have
The first of these (performing no checks) is useful when you know the required properties hold. The second performs the same checks as makeTgraph except that it omits the touching vertex check. This could be used, for example, when making a Tgraph from a sub-collection of TileFaces of another Tgraph.
Main Tiling Operations
There are three key operations on finite tilings, namely
Decomposition (also called deflation) works by splitting each half-tile into either 2 or 3 new (smaller scale) half-tiles, to produce a new tiling. The fact that this is possible, is used to establish the existence of infinite aperiodic tilings with kites and darts. Since our Tgraphs have abstracted away from scale, the result of decomposing a Tgraph is just another Tgraph. However if we wish to compare before and after with a drawing, the latter should be scaled by a factor times the scale of the former, to reflect the change in scale.
Figure 5: fool (left) and decompose fool (right)
We can, of course, iterate decompose to produce an infinite list of finer and finer decompositions of a Tgraph
Force works by adding any TileFaces on the boundary edges of a Tgraph which are forced. That is, where there is only one legal choice of TileFace addition consistent with the seven possible vertex types. Such additions are continued until either (i) there are no more forced cases, in which case a final (forced) Tgraph is returned, or (ii) the process finds the tiling is stuck, in which case an error is raised indicating an incorrect tiling. [In the latter case, the argument to force must have been an incorrect tiling, because the forced additions cannot produce an incorrect tiling starting from a correct tiling.]
An example is shown in figure 6. When forced, the Tgraph on the left produces the result on the right. The original is highlighted in red in the result to show what has been added.
Figure 6: A Tgraph (left) and its forced result (right) with the original shown red
Compose
Composition (also called inflation) is an opposite to decompose but this has complications for finite tilings, so it is not simply an inverse. (See Graphs,Kites and Darts and Theorems for more discussion of the problems). Figure 7 shows a Tgraph (left) with the result of composing (right) where we have also shown (in pale green) the faces of the original that are not included in the composition – the remainder faces.
Figure 7: A Tgraph (left) and its (part) composed result (right) with the remainder faces shown pale green
Under some circumstances composing can fail to produce a Tgraph because there are crossing boundaries in the resulting TileFaces. However, we have established that
If g is a forced Tgraph, then compose g is defined and it is also a forced Tgraph.
Try Results
It is convenient to use types of the form Try a for results where we know there can be a failure. For example, compose can fail if the result does not pass the connected and no crossing boundary check, and force can fail if its argument is an incorrect Tgraph. In situations when you would like to continue some computation rather than raise an error when there is a failure, use a try version of a function.
We define Try as a synonym for Either ShowS (which is a monad) in module Tgraph.Try.
type Try a = Either ShowS a
(Note ShowS is String -> String). Successful results have the form Right r (for some correct result r) and failure results have the form Left (s<>) (where s is a String describing the problem as a failure report).
The function
runTry:: Try a -> a
runTry = either error id
will retrieve a correct result but raise an error for failure cases. This means we can always derive an error raising version from a try version of a function by composing with runTry.
force = runTry . tryForce
compose = runTry . tryCompose
Elementary Tgraph and TileFace Operations
The module Tgraph.Prelude defines elementary operations on Tgraphs relating vertices, directed edges, and faces. We describe a few of them here.
When we need to refer to particular vertices of a TileFace we use
originV :: TileFace -> Vertex -- the first vertex - red dot in figure 2
oppV :: TileFace -> Vertex -- the vertex at the opposite end of the join edge from the origin
wingV :: TileFace -> Vertex -- the vertex not on the join edge
A directed edge is represented as a pair of vertices.
type Dedge =(Vertex,Vertex)
So (a,b) is regarded as a directed edge from a to b.
When we need to refer to particular edges of a TileFace we use
joinE :: TileFace -> Dedge -- shown dotted in figure 2
shortE :: TileFace -> Dedge -- the non-join short edge
longE :: TileFace -> Dedge -- the non-join long edge
which are all directed clockwise round the TileFace. In contrast, joinOfTile is always directed away from the origin vertex, so is not clockwise for right darts or for left kites:
joinOfTile:: TileFace -> Dedge
joinOfTile face =(originV face, oppV face)
In the special case that a list of directed edges is symmetrically closed [(b,a) is in the list whenever (a,b) is in the list] we can think of this as an edge list rather than just a directed edge list.
For example,
internalEdges :: Tgraph ->[Dedge]
produces an edge list, whereas
boundary :: Tgraph ->[Dedge]
produces single directions. Each directed edge in the resulting boundary will have a TileFace on the left and an exterior region on the right. The function
dedges :: Tgraph ->[Dedge]
produces all the directed edges obtained by going clockwise round each TileFace so not every edge in the list has an inverse in the list.
Note: There is now a class HasFaces (introduced in version 1.4) which includes instances for both Tgraph and [TileFace] and others. This allows some generalisations. In particular the more general types of the above three functions are now
internalEdges :: HasFaces a => a ->[Dedge]
boundary :: HasFaces a => a ->[Dedge]
dedges :: HasFaces a => a ->[Dedge]
Patches (Scaled and Positioned Tilings)
Behind the scenes, when a Tgraph is drawn, each TileFace is converted to a Piece. A Piece is another specialisation of HalfTile using a two dimensional vector to indicate the length and direction of the join edge of the half-tile (from the originV to the oppV), thus fixing its scale and orientation. The whole Tgraph then becomes a list of located Pieces called a Patch.
type Piece = HalfTile (V2 Double)type Patch =[Located Piece]
Piece drawing functions derive vectors for other edges of a half-tile piece from its join edge vector. In particular (in the TileLib module) we have
drawPiece :: Piece -> Diagram B
dashjPiece :: Piece -> Diagram B
fillPieceDK :: Colour Double -> Colour Double -> Piece -> Diagram B
where the first draws the non-join edges of a Piece, the second does the same but adds a dashed line for the join edge, and the third takes two colours – one for darts and one for kites, which are used to fill the piece as well as using drawPiece.
Patch is an instances of class Transformable so a Patch can be scaled, rotated, and translated.
Vertex Patches
It is useful to have an intermediate form between Tgraphs and Patches, that contains information about both the location of vertices (as 2D points), and the abstract TileFaces. This allows us to introduce labelled drawing functions (to show the vertex labels) which we then extend to Tgraphs. We call the intermediate form a VPatch (short for Vertex Patch).
type VertexLocMap = IntMap.IntMap (Point V2 Double)data VPatch = VPatch {vLocs :: VertexLocMap, vpFaces::[TileFace]}deriving Show
and
makeVP :: Tgraph -> VPatch
calculates vertex locations using a default orientation and scale.
VPatch is made an instance of class Transformable so a VPatch can also be scaled and rotated.
One essential use of this intermediate form is to be able to draw a Tgraph with labels, rotated but without the labels themselves being rotated. We can simply convert the Tgraph to a VPatch, and rotate that before drawing with labels.
labelled draw (rotate someAngle (makeVP g))
We can also align a VPatch using vertex labels.
alignXaxis ::(Vertex, Vertex)-> VPatch -> VPatch
So if g is a Tgraph with vertex labels a and b we can align it on the x-axis with a at the origin and b on the positive x-axis (after converting to a VPatch), instead of accepting the default orientation.
labelled draw (alignXaxis (a,b)(makeVP g))
Another use of VPatches is to share the vertex location map when drawing only subsets of the faces (see Overlaid examples in the next section).
4. Drawing in More Detail
Class Drawable
There is a class Drawable with instances Tgraph, VPatch, Patch. When the token B is in scope standing for a fixed backend then we can assume
draw :: Drawable a => a -> Diagram B -- draws non-join edges
drawj :: Drawable a => a -> Diagram B -- as with draw but also draws dashed join edges
fillDK :: Drawable a => Colour Double -> Colour Double -> a -> Diagram B -- fills with colours
where fillDK clr1 clr2 will fill darts with colour clr1 and kites with colour clr2 as well as drawing non-join edges.
These are the main drawing tools. However they are actually defined for any suitable backend b so have more general types.
(Update Sept 2024) As of version 1.1 of PenroseKiteDart, these will be
draw ::(Drawable a, OKBackend b)=>
a -> Diagram b
drawj ::(Drawable a, OKBackend) b)=>
a -> Diagram b
fillDK ::(Drawable a, OKBackend b)=>
Colour Double -> Colour Double -> a -> Diagram b
where the class OKBackend is a check to ensure a backend is suitable for drawing 2D tilings with or without labels.
In these notes we will generally use the simpler description of types using B for a fixed chosen backend for the sake of clarity.
The drawing tools are each defined via the class function drawWith using Piece drawing functions.
class Drawable a where
drawWith ::(Piece -> Diagram B)-> a -> Diagram B
draw = drawWith drawPiece
drawj = drawWith dashjPiece
fillDK clr1 clr2 = drawWith (fillPieceDK clr1 clr2)
To design a new drawing function, you only need to implement a function to draw a Piece, (let us call it newPieceDraw)
newPieceDraw :: Piece -> Diagram B
This can then be elevated to draw any Drawable (including Tgraphs, VPatches, and Patches) by applying the Drawable class function drawWith:
newDraw :: Drawable a => a -> Diagram B
newDraw = drawWith newPieceDraw
Class DrawableLabelled
Class DrawableLabelled is defined with instances Tgraph and VPatch, but Patch is not an instance (because this does not retain vertex label information).
class DrawableLabelled a where
labelColourSize :: Colour Double -> Measure Double ->(Patch -> Diagram B)-> a -> Diagram B
So labelColourSize c m modifies a Patch drawing function to add labels (of colour c and size measure m). Measure is defined in Diagrams.Prelude with pre-defined measures tiny, verySmall, small, normal, large, veryLarge, huge. For most of our diagrams of Tgraphs, we use red labels and we also find small is a good default size choice, so we define
labelSize :: DrawableLabelled a => Measure Double ->(Patch -> Diagram B)-> a -> Diagram B
labelSize = labelColourSize red
labelled :: DrawableLabelled a =>(Patch -> Diagram B)-> a -> Diagram B
labelled = labelSize small
and then labelled draw, labelled drawj, labelled (fillDK clr1 clr2) can all be used on both Tgraphs and VPatches as well as (for example) labelSize tiny draw, or labelCoulourSize blue normal drawj.
Further drawing functions
There are a few extra drawing functions built on top of the above ones. The function smart is a modifier to add dashed join edges only when they occur on the boundary of a Tgraph
smart ::(VPatch -> Diagram B)-> Tgraph -> Diagram B
So smart vpdraw g will draw dashed join edges on the boundary of g before applying the drawing function vpdraw to the VPatch for g. For example the following all draw dashed join edges only on the boundary for a Tgraph g
smart draw g
smart (labelled draw) g
smart (labelSize normal draw) g
When using labels, the function rotateBefore allows a Tgraph to be drawn rotated without rotating the labels.
Here, restrictSmart g vpdraw vp uses the given vp for drawing boundary joins and drawing faces of g (with vpdraw) rather than converting g to a new VPatch. This assumes vp has locations for vertices in g.
Overlaid examples (location map sharing)
The function
drawForce :: Tgraph -> Diagram B
will (smart) draw a Tgraph g in red overlaid (using <>) on the result of force g as in figure 6. Similarly
drawPCompose :: Tgraph -> Diagram B
applied to a Tgraph g will draw the result of a partial composition of g as in figure 7. That is a drawing of compose g but overlaid with a drawing of the remainder faces of g shown in pale green.
Both these functions make use of sharing a vertex location map to get correct alignments of overlaid diagrams. In the case of drawForce g, we know that a VPatch for force g will contain all the vertex locations for g since force only adds to a Tgraph (when it succeeds). So when constructing the diagram for g we can use the VPatch created for force g instead of starting afresh. Similarly for drawPCompose g the VPatch for g contains locations for all the vertices of compose g so compose g is drawn using the the VPatch for g instead of starting afresh.
The location map sharing is done with
subVP :: VPatch ->[TileFace]-> VPatch
so that subVP vp fcs is a VPatch with the same vertex locations as vp, but replacing the faces of vp with fcs. [Of course, this can go wrong if the new faces have vertices not in the domain of the vertex location map so this needs to be used with care. Any errors would only be discovered when a diagram is created.]
For cases where labels are only going to be drawn for certain faces, we need a version of subVP which also gets rid of vertex locations that are not relevant to the faces. For this situation we have
restrictVP:: VPatch ->[TileFace]-> VPatch
which filters out un-needed vertex locations from the vertex location map. Unlike subVP, restrictVP checks for missing vertex locations, so restrictVP vp fcs raises an error if a vertex in fcs is missing from the keys of the vertex location map of vp.
5. Forcing in More Detail
The force rules
The rules used by our force algorithm are local and derived from the fact that there are seven possible vertex types as depicted in figure 8.
Figure 8: Seven vertex types
Our rules are shown in figure 9 (omitting mirror symmetric versions). In each case the TileFace shown yellow needs to be added in the presence of the other TileFaces shown.
Figure 9: Rules for forcing
Main Forcing Operations
To make forcing efficient we convert a Tgraph to a BoundaryState to keep track of boundary information of the Tgraph, and then calculate a ForceState which combines the BoundaryState with a record of awaiting boundary edge updates (an update map). Then each face addition is carried out on a ForceState, converting back when all the face additions are complete. It makes sense to apply force (and related functions) to a Tgraph, a BoundaryState, or a ForceState, so we define a class Forcible with instances Tgraph, BoundaryState, and ForceState.
This allows us to define
force :: Forcible a => a -> a
tryForce :: Forcible a => a -> Try a
The first will raise an error if a stuck tiling is encountered. The second uses a Try result which produces a Left string for failures and a Right a for successful result a.
There are several other operations related to forcing including
stepForce :: Forcible a => Int -> a -> a
tryStepForce :: Forcible a => Int -> a -> Try a
addHalfDart, addHalfKite :: Forcible a => Dedge -> a -> a
tryAddHalfDart, tryAddHalfKite :: Forcible a => Dedge -> a -> Try a
The first two force (up to) a given number of steps (=face additions) and the other four add a half dart/kite on a given boundary edge.
Update Generators
An update generator is used to calculate which boundary edges can have a certain update. There is an update generator for each force rule, but also a combined (all update) generator. The force operations mentioned above all use the default all update generator (defaultAllUGen) but there are more general (with) versions that can be passed an update generator of choice. For example
forceWith :: Forcible a => UpdateGenerator -> a -> a
tryForceWith :: Forcible a => UpdateGenerator -> a -> Try a
In fact we defined
force = forceWith defaultAllUGen
tryForce = tryForceWith defaultAllUGen
We can also define
wholeTiles :: Forcible a => a -> a
wholeTiles = forceWith wholeTileUpdates
where wholeTileUpdates is an update generator that just finds boundary join edges to complete whole tiles.
In addition to defaultAllUGen there is also allUGenerator which does the same thing apart from how failures are reported. The reason for keeping both is that they were constructed differently and so are useful for testing.
In fact UpdateGenerators are functions that take a BoundaryState and a focus (list of boundary directed edges) to produce an update map. Each Update is calculated as either a SafeUpdate (where two of the new face edges are on the existing boundary and no new vertex is needed) or an UnsafeUpdate (where only one edge of the new face is on the boundary and a new vertex needs to be created for a new face).
type UpdateGenerator = BoundaryState ->[Dedge]-> Try UpdateMap
type UpdateMap = Map.Map Dedge Update
data Update = SafeUpdate TileFace
| UnsafeUpdate (Vertex -> TileFace)
Completing (executing) an UnsafeUpdate requires a touching vertex check to ensure that the new vertex does not clash with an existing boundary vertex. Using an existing (touching) vertex would create a crossing boundary so such an update has to be blocked.
Forcible Class Operations
The Forcible class operations are higher order and designed to allow for easy additions of further generic operations. They take care of conversions between Tgraphs, BoundaryStates and ForceStates.
class Forcible a where
tryFSOpWith :: UpdateGenerator ->(ForceState -> Try ForceState)-> a -> Try a
tryChangeBoundaryWith :: UpdateGenerator ->(BoundaryState -> Try BoundaryChange)-> a -> Try a
tryInitFSWith :: UpdateGenerator -> a -> Try ForceState
For example, given an update generator ugen and any f:: ForceState -> Try ForceState , then f can be generalised to work on any Forcible using tryFSOpWith ugen f. This is used to define both tryForceWith and tryStepForceWith.
We also specialize tryFSOpWith to use the default update generator
tryFSOp :: Forcible a =>(ForceState -> Try ForceState)-> a -> Try a
tryFSOp = tryFSOpWith defaultAllUGen
Similarly given an update generator ugen and any f:: BoundaryState -> Try BoundaryChange , then f can be generalised to work on any Forcible using tryChangeBoundaryWith ugen f. This is used to define tryAddHalfDart and tryAddHalfKite.
We also specialize tryChangeBoundaryWith to use the default update generator
tryChangeBoundary :: Forcible a =>(BoundaryState -> Try BoundaryChange)-> a -> Try a
tryChangeBoundary = tryChangeBoundaryWith defaultAllUGen
Note that the type BoundaryChange contains a resulting BoundaryState, the single TileFace that has been added, a list of edges removed from the boundary (of the BoundaryState prior to the face addition), and a list of the (3 or 4) boundary edges affected around the change that require checking or re-checking for updates.
The class function tryInitFSWith will use an update generator to create an initial ForceState for any Forcible. If the Forcible is already a ForceState it will do nothing. Otherwise it will calculate updates for the whole boundary. We also have the special case
tryInitFS :: Forcible a => a -> Try ForceState
tryInitFS = tryInitFSWith defaultAllUGen
Efficient chains of forcing operations.
Note that (force . force) does the same as force, but we might want to chain other force related steps in a calculation.
For example, consider the following combination which, after decomposing a Tgraph, forces, then adds a half dart on a given boundary edge (d) and then forces again.
combo :: Dedge -> Tgraph -> Tgraph
combo d = force . addHalfDart d . force . decompose
Since decompose:: Tgraph -> Tgraph, the instances of force and addHalfDart d will have type Tgraph -> Tgraph so each of these operations, will begin and end with conversions between Tgraph and ForceState. We would do better to avoid these wasted intermediate conversions working only with ForceStates and keeping only those necessary conversions at the beginning and end of the whole sequence.
This can be done using tryFSOp. To see this, let us first re-express the forcing sequence using the Try monad, so
force . addHalfDart d . force
becomes
tryForce <=< tryAddHalfDart d <=< tryForce
Note that (<=<) is the Kliesli arrow which replaces composition for Monads (defined in Control.Monad). (We could also have expressed this right to left sequence with a left to right version tryForce >=> tryAddHalfDart d >=> tryForce). The definition of combo becomes
combo :: Dedge -> Tgraph -> Tgraph
combo d = runTry . (tryForce <=< tryAddHalfDart d <=< tryForce) . decompose
This has no performance improvement, but now we can pass the sequence to tryFSOp to remove the unnecessary conversions between steps.
The sequence actually has type Forcible a => a -> Try a but when passed to tryFSOp it specialises to type ForceState -> Try ForseState. This ensures the sequence works on a ForceState and any conversions are confined to the beginning and end of the sequence, avoiding unnecessary intermediate conversions.
A limitation of forcing
To avoid creating touching vertices (or crossing boundaries) a BoundaryState keeps track of locations of boundary vertices. At around 35,000 face additions in a single force operation the calculated positions of boundary vertices can become too inaccurate to prevent touching vertex problems. In such cases it is better to use
recalibratingForce :: Forcible a => a -> a
tryRecalibratingForce :: Forcible a => a -> Try a
These work by recalculating all vertex positions at 20,000 step intervals to get more accurate boundary vertex positions. For example, 6 decompositions of the kingGraph has 2,906 faces. Applying force to this should result in 53,574 faces but will go wrong before it reaches that. This can be fixed by calculating either
recalibratingForce (decompositions kingGraph !!6)
or using an extra force before the decompositions
force (decompositions (force kingGraph) !!6)
In the latter case, the final force only needs to add 17,864 faces to the 35,710 produced by decompositions (force kingGraph) !!6.
6. Advanced Operations
Guided comparison of Tgraphs
Asking if two Tgraphs are equivalent (the same apart from choice of vertex numbers) is a an np-complete problem. However, we do have an efficient guided way of comparing Tgraphs. In the module Tgraph.Rellabelling we have
sameGraph ::(Tgraph,Dedge)->(Tgraph,Dedge)-> Bool
The expression sameGraph (g1,d1) (g2,d2) asks if g2 can be relabelled to match g1 assuming that the directed edge d2 in g2 is identified with d1 in g1. Hence the comparison is guided by the assumption that d2 corresponds to d1.
where tryRelabelToMatch (g1,d1) (g2,d2) will either fail with a Left report if a mismatch is found when relabelling g2 to match g1 or will succeed with Right g3 where g3 is a relabelled version of g2. The successful result g3 will match g1 in a maximal tile-connected collection of faces containing the face with edge d1 and have vertices disjoint from those of g1 elsewhere. The comparison tries to grow a suitable relabelling by comparing faces one at a time starting from the face with edge d1 in g1 and the face with edge d2 in g2. (This relies on the fact that Tgraphs are connected with no crossing boundaries, and hence tile-connected.)
which tries to find the union of two Tgraphs guided by a directed edge identification. However, there is an extra complexity arising from the fact that Tgraphs might overlap in more than one tile-connected region. After calculating one overlapping region, the full union uses some geometry (calculating vertex locations) to detect further overlaps.
which will find common regions of overlapping faces of two Tgraphs guided by a directed edge identification. The resulting common faces will be a sub-collection of faces from the first Tgraph. These are returned as a list as they may not be a connected collection of faces and therefore not necessarily a Tgraph.
Empires and SuperForce
In Empires and SuperForce we discussed forced boundary coverings which were used to implement both a superForce operation
superForce:: Forcible a => a -> a
and operations to calculate empires.
We will not repeat the descriptions here other than to note that
forcedBoundaryECovering:: Tgraph ->[Tgraph]
finds boundary edge coverings after forcing a Tgraph. That is, forcedBoundaryECovering g will first force g, then (if it succeeds) finds a collection of (forced) extensions to force g such that
each extension has the whole boundary of force g as internal edges.
each possible addition to a boundary edge of force g (kite or dart) has been included in the collection.
(possible here means – not leading to a stuck Tgraph when forced.) There is also
forcedBoundaryVCovering:: Tgraph ->[Tgraph]
which does the same except that the extensions have all boundary vertices internal rather than just the boundary edges.
Combinations and Explicitly Forced
We introduced a new type Forced (in v 1.3) to enable a forcible to be explictily labelled as being forced. For example
forceF :: Forcible a => a -> Forced a
tryForceF :: Forcible a => a -> Try (Forced a)
forgetF :: Forced a -> a
This allows us to restrict certain functions which expect a forced argument by making this explicit.
composeF :: Forced Tgraph -> Forced Tgraph
The definition makes use of theorems established in Graphs,Kites and Darts and Theorems that composing a forced Tgraph does not require a check (for connectedness and no crossing boundaries) and the result is also forced. This can then be used to define efficient combinations such as
compForce:: Tgraph -> Forced Tgraph -- compose after forcing
composeForce = composeF . forceF
allCompForce:: Tgraph ->[Forced Tgraph]-- iterated (compose after force) while not emptyTgraph
maxCompForce:: Tgraph -> Forced Tgraph -- last item in allCompForce (or emptyTgraph)
Tracked Tgraphs
The type
data TrackedTgraph = TrackedTgraph
{ tgraph :: Tgraph
, tracked ::[[TileFace]]}deriving Show
has proven useful in experimentation as well as in producing artwork with darts and kites. The idea is to keep a record of sub-collections of faces of a Tgraph when doing both force operations and decompositions. A list of the sub-collections forms the tracked list associated with the Tgraph. We make TrackedTgraph an instance of class Forcible by having force operations only affect the Tgraph and not the tracked list. The significant idea is the implementation of
Decomposition of a Tgraph involves introducing a new vertex for each long edge and each kite join. These are then used to construct the decomposed faces. For decomposeTracked we do the same for the Tgraph, but when it comes to the tracked collections, we decompose them re-using the same new vertex numbers calculated for the edges in the Tgraph. This keeps a consistent numbering between the Tgraph and tracked faces, so each item in the tracked list remains a sub-collection of faces in the Tgraph.
The function
drawTrackedTgraph ::[VPatch -> Diagram B]-> TrackedTgraph -> Diagram B
is used to draw a TrackedTgraph. It uses a list of functions to draw VPatches. The first drawing function is applied to a VPatch for any untracked faces. Subsequent functions are applied to VPatches for the tracked list in order. Each diagram is beneath later ones in the list, with the diagram for the untracked faces at the bottom. The VPatches used are all restrictions of a single VPatch for the Tgraph, so will be consistent in vertex locations. When labels are used, there is also a drawTrackedTgraphRotated and drawTrackedTgraphAligned for rotating or aligning the VPatch prior to applying the drawing functions.
Note that the result of calculating empires (see Empires and SuperForce ) is represented as a TrackedTgraph. The result is actually the common faces of a forced boundary covering, but a particular element of the covering (the first one) is chosen as the background Tgraph with the common faces as a tracked sub-collection of faces. Hence we have
Diagrams for Penrose Tiles – the first blog introduced drawing Pieces and Patches (without using Tgraphs) and provided a version of decomposing for Patches (decompPatch).
Graphs, Kites and Darts intoduced Tgraphs. This gave more details of implementation and results of early explorations. (The class Forcible was introduced subsequently).
Empires and SuperForce – these new operations were based on observing properties of boundaries of forced Tgraphs.
Have you ever wished you could browse all the Haskell packages
together in your IDE, with full navigation using go-to-definition
and find-references? Here’s a demo of something I hacked together
while at ZuriHac 2025 over the weekend:
In the previous post I talked about
how to index all of Hackage (actually Stackage, strictly speaking,
because it’s not in general possible to build all of Hackage together)
using Glean. Since that post I made some
more progress on the indexer:
The indexer now indexes
types. You can
see type-on-hover working in the demo. The types are similar to what
you see in the Haddock-generated hyperlinked source, except that
here it’s always using the type of the definition and not the type
at the usage site, which might be more specific. That’s a TODO for
later.
Fixed a bunch of things, enriched the index with details about
constructors, fields and class methods, and made indexing more
efficient.
The DB size including types is now about 850MB, and it takes
just under 8 minutes on my 9-year-old laptop to index the nearly
3000 packages in my stackage LTS 21.21 snapshot. (Note: the figures
here were updated on 12-06-2025 when I redid the measurments).
Hooking it up to VS Code
The architecture looks like this:
The LSP server is a modified version of
static-ls, which is
already designed to provide an LSP service based on static
information. I just reimplemented a few of its handlers to make calls
to Glass instead of the existing hie/hiedb implementations. You can
see the changes on my fork of
static-ls. Of
course, these changes are still quite hacky and not suitable for
upstreaming.
Glass
is a “Language-agnostic Symbol Server”. Essentially it provides an API
abstraction over Glean with operations that are useful for code
navigation and search.
Where to next?
There remain a few issues to solve before this can be useful.
Make Glean more easily installable. There’s a general concensus that
cabal install glean would lower the barrier to entry
significantly; in order to do this we need to build the folly
dependency using Cabal.
Clean up and ship the LSP server, somehow. Once Glean is
cabal-installable, we can depend on it from an LSP server package.
Think about continuous integration to build the Glean
DB. Perhaps this can piggyback off the stackage CI infra? If we
can already build a complete stackage snapshot, and Glean is
easily installable, then indexing would be fairly
straightforward. I’d love to hear suggestions on how best to do
this.
And looking forwards a bit further:
Think about how to handle multiple packages versions. There’s no
fundamental problem with indexing multiple package versions, except
that Glass’s SymbolID format currently doesn’t include the package
version but that’s easily fixable. We could for example build
multiple stackage LTS instances and index them all in a single Glean
DB. There would be advantages to doing this, if for instance there
were packages in common between two Stackage instances then the
Glean DB would only contain a single copy. A lot of the type
structure would be shared too.
Provide search functionality in the LSP. Glean can provide
simple textual search for names, and with some work could also
provide Hoogle-like type search.
Think about how to index local projects and local changes. Glean
supports stacked and
incremental DBs, so we
could build a DB for a local project stacked on top of the full
Stackage DB. You would be able to go-to-definition directly from
a file in your project to the packages it depends on in
Stackage. We could re-index new .hie files as they are
generated, rather like how static-ls currently handles changes.
Integrate with HLS? Perhaps Glean could be used to handle
references outside of the current project, switching seamlessly
from GHC-based navigation to Glean-based navigation if you jump
into a non-local package.
More use cases?
I talked with a few people at ZuriHac about potential use cases for
Glean within the Haskell ecosystem. Using it in haskell.org came up
a few times, as a way to power search, navigation and analysis. Also
mentioned was the possibility of using it as a Hoogle
backend. Potentially we could replace the Haddock-generated
hyperlinked sources on haskell.org with a Glean-based browser, which
would allow navigating links between packages and find-references.
Another use cases that came up was the possibility of doing impact
analysis for core library changes (or any API changes really). Some of
this is already possible using find-references, but more complex cases
such as finding instances that override certain methods aren’t
possible yet until we extend the indexer to capture richer
information.
If you’re interested in using Glean for something, why not jump on the
Glean discord server and tell us about it!